A Hybrid Method of Exponential Smoothing and Recurre - 2020 - International Jour
A Hybrid Method of Exponential Smoothing and Recurre - 2020 - International Jour
A Hybrid Method of Exponential Smoothing and Recurre - 2020 - International Jour
article info a b s t r a c t
Keywords: This paper presents the winning submission of the M4 forecasting competition. The
Forecasting competitions submission utilizes a dynamic computational graph neural network system that enables
M4 a standard exponential smoothing model to be mixed with advanced long short term
Dynamic computational graphs
memory networks into a common framework. The result is a hybrid and hierarchical
Automatic differentiation
forecasting method.
Long short term memory (LSTM) networks
Exponential smoothing © 2019 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.ijforecast.2019.03.017
0169-2070/© 2019 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.
76 S. Smyl / International Journal of Forecasting 36 (2020) 75–85
hierarchical manner, meaning that both local and global • It is hybrid, in the sense that statistical modeling
components are utilized in order to extract and combine (ES models) is combined concurrently with ML al-
information at either a series or a dataset level, thus gorithms (LSTM networks).
enhancing the forecasting accuracy. • It is hierarchical in nature, in the sense that both
The rest of the paper is organized as follows. global (applicable to large subsets of all series) and
Section 2 introduces the method and describes it in a local (applied to each series individually) parameters
general sense, while Section 3 gives more implementation are utilized in order to enable cross-learning while
details. Section 4 concludes by sketching some general also emphasizing the particularities of the time se-
modelling possibilities that are provided by recent NN ries being extrapolated.
systems and probabilistic programming languages, and, in
this context, traces back the development of the models 2.2. Method description
described in this paper.
2.2.1. Deseasonalization and normalization
2. Methodology The M4 time series, even within the same frequency
subset, e.g. monthly, come from many different sources
2.1. Intuition and overview of the hybrid method and exhibit a range of seasonality patterns. In addition,
the starting dates of the series are not provided. In such
circumstances, the NNs are unable to learn how to deal
The method effectively mixes ES models with LSTM
with seasonality effectively. A standard remedy is to apply
networks, and in so doing, provides forecasts that are
a deseasonalization at preprocessing time. This solution is
more accurate than those generated by either pure statis-
adequate but not ideal, as it separates the preprocessing
tical or ML approaches, thus exploiting their advantages
from the forecasting completely, and the quality of the
while avoiding their drawbacks. This hybrid forecasting
decomposition is likely to be worst near the end of the
approach has three main elements: (i) deseasonalization
series, where it counts most for the forecast. One can also
and adaptive normalization, (ii) generation of forecasts
observe that the deseasonalization algorithms, however
and (iii) ensembling.
sophisticated and robust, were not designed to be good
The first element is implemented with state space ES-
preprocessors for NNs. Classic statistical models, such as
style formulas. The initial seasonality components (e.g.
those from the exponential smoothing family, show a
four for the quarterly series) and smoothing coefficients
better way: the forecasting model has integral parts that
(two of them in the case of a single seasonality sys-
deal with the seasonality.
tem) are per-series parameters and were fitted, together In most NN variants, including LSTMs, the update size
with global NN weights, by stochastic gradient descent of weight wij is proportional to the final error, but also
(SGD). Knowing these parameters and the values of the to the absolute value of the strength of the associated
series allows the seasonality components and levels to be signal (coming from neuron i in the current layer to
calculated, and these are used for deseasonalization and neuron j in the next layer). Thus, the NNs behave like ana-
normalization. The deseasonalization of seasonal series logue devices, even if implemented on a digital computer:
was very important in the M4 Competition, given that small-valued inputs will have small impacts during the
the series were provided as numeric vectors without any learning process. Normalizing each series globally to an
time stamp, so that there was no way of incorporating interval like [0–1] is also problematic, as the values to be
calendar features like the day of the week or the month forecast may lie outside this range, and, more importantly,
number. Also, the series came from many sources, so their for series that change a lot over their lifespan, the parts of
seasonality patterns varied. the series with small values will be ignored. Finally, infor-
The second element is a NN that operates on desea- mation on the strength of trend is lost: two series of the
sonalized and normalized data, providing the horizon- same lengths and similar shapes, but one growing from
steps ahead (e.g. 18 points in case of monthly series) 100 to 110 and another from 100 to 200, will look very
outputs that were subsequently re-normalized and re- similar after the [0–1] normalization. Thus, while normal-
seasonalized to produce forecasts. The NN is global, learn- ization is necessary, it should be adaptive and local, where
ing across many time series. the ‘‘normalizer’’ follows the series values.
The final element of the method is the ensembling of
the forecasts made in the previous step. This includes en- 2.2.2. Exponential smoothing formulas
sembling the forecasts produced by the individual models Keeping in mind that the M4 series all have positive
from several independent runs, sometimes produced by a values, the models of Holt (Gardner, 2006) and Holt and
subset of concurrently-trained models, and also averag- Winters (Hyndman, Koehler, Ord, & Snyder, 2008) with
ing those generated by the most recent training epochs. multiplicative seasonality were chosen. However, these
This process enhances the robustness of the method fur- were simplified by the removal of the linear trend: the
ther, mitigating the model and parameter uncertainty NN was tasked to produce a trend that was most likely to
(Petropoulos, Hyndman, & Bergmeir, 2018) while also be non-linear. Moreover, non-seasonal (yearly and daily
exploiting the beneficial effects of combining (Chan & data), single-seasonal (monthly, quarterly, and weekly
Pauwels, 2018). data) or double-seasonal (hourly data) models (Taylor,
Based on the above, it can be said that the method has 2003) were used, depending on the frequency of the data.
the following two special characteristics: The updating formulas for each case are as follows:
S. Smyl / International Journal of Forecasting 36 (2020) 75–85 77
Non-seasonal models: the original amplitude of the series and its history. Finally,
a squashing function, log(), was applied. The squashing
lt = α yt + (1 − α )lt −1 (1)
function prevented outliers from having an unduly large
Single seasonality models: and disturbing effect on the learning.
In addition, the domain of the time series (e.g. finance
lt = α yt /st + (1 − α )lt −1 or macro) was one-hot encoded as a six-long vector and
(2)
st +K = β yt /lt + (1 − β )st appended to the time series derived features. The domain
information was the only meta information available and
Double seasonality models:
I considered it prudent to expose the NN to it.
lt = α yt /(st ut ) + (1 − α )lt −1 It is generally worthwhile to increase the size of the
st +K = β yt /(lt ut ) + (1 − β )st (3) input window and extract more sophisticated features,
like the strength of the seasonality or the variability,
ut +L = γ yt /(lt st ) + (1 − γ )ut ,
when preprocessing for NNs, but such approaches were
where yt is the value of the series at point t; lt , st , and ut not adopted here for several reasons. The most impor-
are the level, seasonality, and second seasonality compo- tant one was that many series were too short to afford
nents, respectively; K is the number of observations per a large input window, meaning that they could not be
seasonal period, i.e., four for quarterly, 12 for monthly and used for backtesting. Another reason was that creating
52 for weekly; and, finally, L is the number of observations features that summarize the characteristics of the series
per second seasonal period (for hourly data, 168). Note effectively, irrespective of their length, is not straightfor-
that st and ut are always positive, while the smoothing ward. It was only after the end of the competition that a
coefficients α , β and γ take a value between zero and promising R package called tsfeatures came to my atten-
one. These restrictions can be implemented easily by ap- tion (Hyndman, Wang, & Laptev, 2015; Kang, Hyndman,
plying exp() to the underlying parameters of the initial & Smith-Miles, 2017).
seasonality components and sigmoid() to the underlying
parameters of the smoothing coefficients. 2.2.4. Forecast by NNs
As explained above, the NNs operated on deseasonal-
2.2.3. On-the-fly preprocessing ized, adaptively normalized, and squashed values. Their
The above formulas allow the level and seasonality output needed to be ‘‘unwound’’, in following way:
components to be calculated for all points of each series. For non-seasonal models:
These components are then used for deseasonalization
ŷt +1..t +h = exp(NN(x)) ∗ lt (4)
and adaptive normalization during the on-the-fly prepro-
cessing. This step is a crucial part of the method and is For single seasonality models:
described in this section.
Each series was preprocessed anew for each training ŷt +1..t +h = exp(NN(x)) ∗ lt ∗ st +1:..t +h (5)
epoch, because the parameters (initial seasonality compo- For dual seasonality models:
nents and smoothing coefficients) and the resulting levels
and seasonality components were different during each ŷt +1..t +h = exp(NN(x)) ∗ lt ∗ st +1:..t +h ∗ ut +1:..t +h , (6)
epoch. where x is the pre-processed input (a vector), NN(x) is an
The standard approach of constant size, rolling input NN output (a vector), lt is the value of level at time t (last
and output windows was applied, as is shown in Fig. 1 for known data point) and h is the forecasting horizon. All
the case of a monthly series. The size of the output win- operations are elementwise. The above is summarized in
dow was always equal to the forecasting horizon (e.g., 13 Fig. 2.
for the weekly series), while the size of the input window Note that the final forecast is actually an ensemble of
was determined by a rule that, for seasonal series, it many such forecasts, a procedure which is explained later
should cover at least a full seasonal period (e.g., being in the paper.
equal to or larger than four in the case of quarterly series),
while for non-seasonal series the size of the input window 2.2.5. Architectures of neural networks
should be close to the forecasting horizon. However, the In order to better understand the implementation, it
exact size was defined after conducting experimentation is useful to classify the parameters of forecasting systems
(backtesting). Please note that, unlike in many other re- into the following three groups:
current neural network (RNN) based sequence processing
systems, the input size is larger than one. This works • Local constants: These parameters reflect the be-
better because it allows the NN to be exposed to the havior of a single series; they do not change as we
immediate history of the series directly. step through that series. For example, the smoothing
Preprocessing was rather simple: at each step, the coefficients of the ES model, as well as the initial
values in the input and output windows were normalized seasonal components, are the local non-changing
by dividing them by the last value of the level in the (constant) parameters.
input window (the thick blue dot on Fig. 1), and then, • Local states: These parameters change as we step
in the case of seasonal time series, divided further by through a series, evolving over time. For instance,
the relevant seasonality components. That resulted in the the level and seasonal components, as well as a
input and output values being close to one, irrespective of recurrent NN state, are local states.
78 S. Smyl / International Journal of Forecasting 36 (2020) 75–85
Fig. 1. An example showing how rolling windows are used for preprocessing a random monthly series. The last step is used for validation.
Fig. 2. Data flow and system architecture for the single seasonality case. Xi is the normalized, deseasonalized, and squashed input to the NN. Fi is
the NN output. Ŷi is the forecast, covering outputs i + 1...i + H, where H is the forecasting horizon. li is scalar, the last level in the input window.
Xi , Ŷi and Fi are vectors. Ensembling is not shown.
• Global constants: These parameters reflect the pat- of dynamic computation graph (DCG) systems, such as
terns learned across large sets of series and are con- DyNet (Neubig, Dyer, Goldberg, Matthews, Ammar, Anas-
stant; they do not change as we step through a tasopoulos, et al., 2017), PyTorch (Paszke, Gross, Chintala,
series. For example, the weights used for the NN Chanan, Yang, DeVito, et al., 2017) and TensorFlow in
systems are global constants. ‘‘eager mode’’ (Abadi, Agarwal, Barham, Brevdo, Chen,
Citro, et al., 2015). The difference between static and
Typical statistical time series methods are trained on dynamic computational graph systems is that the latter
individual series, meaning that they involve only local have the ability to recreate the computational graph (built
constant and local state parameters. On the other hand, behind the scenes by the NN system) for each sample,
standard ML methods are usually trained on large here, for each time series. Thus, each series may have a
datasets, involving only global parameters. The hybrid partially unique and partially shared model.
method described here uses all three types of parame- The architecture deployed was different for each fre-
ters, being partly global and partly time series specific. quency and output type (point forecast or prediction in-
This type of modeling becomes possible through the use tervals).
S. Smyl / International Journal of Forecasting 36 (2020) 75–85 79
At a high level, the NNs of the model are dilated LSTM- transfer function equals identity) that adapts the
based stacks (Chang, Zhang, Han, Yu, Guo, Tan, et al., hidden output from the fourth layer (the one with
2017), sometimes followed by a non-linear layer and al- dilation = 8), usually 30–40 long, into the expected
ways followed by a linear ‘‘adapter’’ layer, the objective output size (here eight).
of which is to adapt the size of the state of the last layer (b) The NN consists of a single block composed of
to the size of the output layer (the forecasting horizon, or four dilated LSTMs, with residual connections as
twice the forecasting horizon in case of prediction interval per (Kim et al., 2017). Please note that the short-
(PI) models). The LSTM stacks are composed of a number cut arrows point correctly into the inside of the
of blocks (here 1–2). In case of two (and theoretically residual LSTM cell; this is a non-standard residual
more) blocks, the output of a block is added to the next shortcut.
block’s output using Resnet-style shortcuts (He, Zhang, (c) The NN consists of a single block consisting of two
Ren, & Sun, 2015). Each block is a sequence of one to dilated LSTMs with the attention mechanism, fol-
four layers, belonging to one of the three types of dilated lowed by a dense non-linear layer (with tanh()
LSTMs: standard (Chang et al., 2017), with an attention activation), then by a linear adapter layer of the
mechanism (Qin, Song, Chen, Cheng, Jiang, & Cottrell, double size of the output, so that forecasts of both
2017) and a special residual version (Kim, El-Khamy, & lower and upper bounds are generated simultane-
Lee, 2017). ously. The attention mechanism (Qin et al., 2017)
Dilated LSTMs use as part of their input the hidden slows the calculations considerably, but occasion-
state from previous, but not necessarily the latest, steps. ally appeared best.
In standard LSTMs and related cells, part of the input at a
Later, I provide a table that lists architectures and hyper-
time t is the hidden state from step t − 1. In a cell that
parameters for all cases, not just these three. Please keep
is k-dilated, e.g. three, the hidden state is taken from step in mind that, while the graph shows only the global parts
t − k, so here t − 3. This improves long-term memory of the models, the per-series parts are equally important.
performance. As is customary for dilated LSTMs (Chang
et al., 2017), they were deployed in stacks of cells with in- 3. Implementation details
creasing dilations. Similar blocks of standard, non-dilated
LSTMs performed slightly worse. Even bigger drops in This section provides more implementation details
performance would have happened if the recurrent NNs regarding the hybrid method. This includes information
had been replaced with non-recurrent ones, indicating about the loss function, the hyperparameters of the mod-
that the RNN state is useful for dealing with the time els, and the ensembling procedures.
series and sequences more generally.
The general idea of the recurrent NN attention mech- 3.1. Loss function
anism is that, instead of using the previous hidden state
as in standard LSTMs, or the delayed state as in the case 3.1.1. Point forecasts
of dilated LSTMs, one calculates weights that are applied The error measure used in the M4 Competition for
to a number of past hidden states in order to create an the case of the PFs was a combination of the symmetric
artificial weighted average state. This allows the system mean absolute error (sMAPE) and the mean scaled error
to ‘‘concentrate on’’ or ‘‘attend to’’ a particular single state (MASE) (Makridakis, Spiliotis, & Assimakopoulos, 2018a).
or group of past states dynamically. My implementation The two metrics are quite similar in nature, in the sense
is an extension of the dilated LSTM, so the maximum that both are normalized absolute differences between
look-behind horizon is equal to the dilation. In the case the predicted and actual values of the series. Recalling
of weekly series, the network consisted of single block that the inputs to the NN in this system are already desea-
with two layers, encoded as attentive (1,52). The first sonalized and normalized, I postulated that the training
layer dilation is equal to one, so it is a standard LSTM, loss function does not need to include normalization: it
but the second layer calculates weights over the past 52 could be just a simple absolute difference between the
hidden states (as they become available, so at point 53 target values and the predicted ones. However, it be-
or later when stepping through a series). The weights are came apparent during backtesting that the models tend
calculated using a separate standard two-layer NN that is to have a positive bias, probably as a result of applying
embedded into the LSTM; its inputs are concatenations of a squashing function, log(), to time series derived inputs
the LSTM input and the last hidden state, and its weights and outputs to the NN. The system learned in the log
are adjusted by the same gradient descent mechanism space, but the final forecast errors are calculated back in
that operates on all other parameters. the linear space. To counter this, a pinball loss with a τ
Fig. 3 shows three examples of configurations; the value a bit smaller than 0.5 (typically 0.45–0.49) was used.
first one generates point forecasts (PFs) for the quarterly The pinball loss is defined as follows:
series, the second one PFs for the monthly series, and the Lt = (yt − yˆt )τ , if yt ≥ yˆt
third one prediction intervals (PIs) for the yearly series. (7)
= (yˆt − yt )(1 − τ ), if yˆt > yt .
(a) The NN consists of two blocks, each one involving Thus, the pinball function is asymmetric, penalizing
two dilated LSTMs, that are connected by a shortcut actual values that are above and below a quantile differ-
around the second block. The final element is the ently so as to allow the method to deal with the bias.
‘‘adapter layer’’; it is just a standard linear layer (the It is an important loss function on its own; minimizing
80 S. Smyl / International Journal of Forecasting 36 (2020) 75–85
Fig. 3. NN architectures used for generating some of the PFs and PIs.
it produces quantile regression (Takeuchi, Le, Sears, & when applied to time series with occasional large shifts. In
Smola, 2006). this regard, a modified version of this penalty was applied,
which was calculated as follows:
3.1.2. Prediction intervals
The pinball loss function could have been adopted for • Calculate log differences, i.e., dt = log(yt +1 /yt ),
generating the PIs as well. The requested coverage was where yt is point t of the series;
95%, so one could have tried to forecast 2.5% and 97.5% • Calculate differences of the above: et = dt +1 − dt ;
intervals. However, the competition metric for PIs was not • Square and average them for each series.
based on separate upper and lower pinball losses; instead, This penalty, multiplied by a constant parameter in the
it was a single formula called the mean scaled interval range of 50–100, called the level variability penalty (LVP),
score (MSIS) (Makridakis et al., 2018a). Once again, the was added to both PFs and PIs loss function. The level wig-
denominator of the MSIS was omitted, since the input to gliness penalty affected the performance of the method
the NNs was already deseasonalized and normalized. It significantly, and it is conceivable that this submission
should be noted that, although the method provided the would not have won the M4 Competition without it.
most precise PIs among those of all methods submitted,
the positive bias mentioned above can still be observed
3.2. Ensembling and data subsetting
for the case of the PIs, with the upper interval being
exceeded less frequently than the lower one.
Two models (one for PFs and one for PIs) were built
At this point I would like to draw the reader’s atten-
for each of the six single-frequency subsets (daily, weekly
tion to the great practical feature of NN-based systems:
etc.). Each of the models was actually an ensemble at
the ease of creating a loss function that is aligned with
several levels, which are presented below.
business/scientific objectives. For this application, the loss
functions were aligned with the accuracy metrics used in
the M4 Competition. 3.2.1. Independent runs
A single run involves a full training of the models as
3.1.3. Level wiggliness penalty well as the generation of forecasts for all series in the
Intuitively, the level should be a smooth version of subset. However, each run for a given series produces a
the time series, with no seasonality patterns. One would slightly different forecast, since the parameter initializa-
expect this to be of secondary importance, and more of tions are random. Ensembling models constructed from
an aesthetic-level requirement. However, it turns out that different runs can mitigate the effect of randomness and
the smoothness of the level influenced the forecasting decrease the uncertainty. Backtesting indicated that in-
accuracy substantially. It appears that when the input to creasing the number of runs above the 6–9 range did
the NN was smooth, the NN concentrated on predict- not improve the forecasting accuracy, and as a result, the
ing the trend, instead of over-fitting on some spurious, number of independent runs was limited accordingly.
seasonality-related patterns. A smooth level also means
that the seasonality components absorbed the seasonality 3.2.2. Ensemble of specialists or simple ensemble
properly. In functional data analysis, an average of squares When it was computationally feasible, as turned out
of second derivatives is a popular penalty against the wig- to be the case for all except the monthly and quarterly
gliness of a curve (Ramsay & Silverman, 2002). However, series, several concurrently-trained models, learning from
such a penalty may be too strict and not robust enough different subset of series, were used rather than training
S. Smyl / International Journal of Forecasting 36 (2020) 75–85 81
Fig. 4. An example allocation performed by the ensemble of specialists algorithm for a set of ten series to seven models, using the top two models
per series.
a single model. This approach, called ‘‘ensemble of spe- and the inputs used for the individual models remain the
cialists’’, was proposed originally by Smyl (2017), and is same. What differs, and is manipulated actively, between
summarized below. epochs is the composition of the training data set for each
The main idea is that, when a dataset contains a large model. Fig. 4 shows an example allocation of ten series
number of series from unknown sources, it is reasonable among seven models altogether and two top models.
to assume that these could possibly be grouped in subsets, A simpler approach, called here simple ensemble, was
such that the overall forecasting accuracy would improve used for monthly and quarterly data instead of the en-
if one used a separate model for each group instead of semble of specialists. In this case, the data were split into
a single one for the whole dataset. However, there is two non-overlapping sets at the beginning of each run and
no straightforward way of performing the grouping task, then models were trained and forecasts made for each of
as series from disparate sources may look and behave the two halves. This was a kind of bagging, and worked
similarly. Moreover, clustering the series using generic well too.
metrics may not be useful for improving the forecasting It is worth mentioning that the ensemble of specialists
accuracy. improved the forecasting accuracy by around 3% on the
In this regard, the ensemble of specialists algorithm M3 monthly data set in the work of Smyl (2017). How-
trains a number of models (NNs and per-series param- ever, the difference reported was not stable, depending
eters) concurrently and forces them to specialize in a on the data and the quality of the models used. Therefore,
subset of series. The algorithm is summarized as follows: more work is needed to delineate the areas of superiority
of each method clearly.
1. Create a pool of models (e.g. seven models) and
randomly allocate a part (e.g. half of the time series)
3.2.3. Stage of training
to each model.
The forecasts generated by a few (e.g. 4–5) of the
2. For each model:
most recent training epochs were ensembled to provide
(a) Execute a single training on the allocated a single forecast. The whole training typically used 10–
subset. 30 epochs; thus, in the case of 20 epochs for example,
(b) Record the performance for the whole train- the final forecast was actually an average of the forecasts
ing set (in-sample, average over all points of produced at epochs 16, 17, 18, 19 and 20.
the training part of a series).
3.3. Backtesting
3. Rank the models for each series and then allocate
each series to the top N (e.g. two) best models.
Backtesting was implemented by removing typically
4. Repeat steps 2 and 3 until the average error in the
one, but sometimes two, of the last horizon-number of
validation area starts growing.
points from each series (e.g. 18 or 36 for monthly data),
Thus, the final forecast for a particular series is the and training the system on a set with such shortened
average of the forecasts produced by the top N models. series. However, while the training steps were never
The main assumption here is continuity: if a particular exposed to the removed values, the system was tested
model is good at forecasting the in-sample part of the on the validation (removed) area after every epoch, and
series, it will hopefully display accurate results in the the results guided the architectural and hyperparameter
out-of-sample part of the series as well. The architecture choices. In practice, there was a very strong correlation
82 S. Smyl / International Journal of Forecasting 36 (2020) 75–85
Table 1
Details of the architecture and parameters used.
Frequency PF PIs
Monthly Simple ensemble
Residual (1–3–6–12)
LVP = 50
Epochs = 10 Epochs = 14
LR = 5e−4 LR = 1e−3, {8,3e−4}, {13,1e−4}
Max length = 272 Max length = 512
Training percentile = 49
State size of LSTMs = 50
Quarterly Simple ensemble
(1,2)–(4,8)
LVP = 80
Epochs = 15 Epochs = 16
LR = 1e−3, {10,1e−4} LR = 1e−3, {7,3e−4}, {11,1e−4}
Max length = 174
Training percentile = 45
State size of LSTMs = 40
Yearly Ensemble of specialists 4/5
Attentive (1,6) Attentive (1,6),NL
LVP = 0
Epochs = 12 Epochs = 29
LR = 1e−4, {15,1e−5} LR = 1e−4, {17,3e−5}, {22,1e−5}
Max length = 72
Training percentile = 50
State size of LSTMs = 30
Daily Ensemble of specialists 4/5
(1,3)–(7,14)
Seasonality = 7
LVP = 100
Epochs = 13 Epochs = 21
LR = 3e−4, {9,1e−4} LR = 3e−4, {13,1e−4}
Max length = 112
Training percentile = 49
State size of LSTMs = 40
Weekly Ensemble of specialists 3/5 Ensemble of specialists 4/5
Attentive(1,52)
Seasonality = 52
LVP = 100
Epochs = 23 Epochs = 31
LR = 1e−3, {11,3e−4}, {17,1e−4} LR = 1e−3, {15,3e−4}
Max length = 335
Training percentile = 47
State size of LSTMs = 40
Hourly Ensemble of specialists 4/5
(1,4)–(24,168)
Seasonality = 24,168
LVP = 10
Epochs = 27 Epochs = 37
LR = 1e−2, {7,5e−3}, {18,1e−3}, {22,3e−4} LR = 1e−2, {20,1e−3}
Max length = NA
Training percentile = 49
State size of LSTMs = 40
between the validation results when testing on the last maximum length hyperparameter was tested and
horizon-number of points and the penultimate horizon- increased from a relatively small value until no further
number of points, with the former approach typically meaningful improvement in accuracy was observed in
being used because it also admitted a larger number of backtesting. It is listed in Table 1 along with the rest of
series (many were too short to be used for backtesting if the hyperparameters.
more than one horizon-number of points was removed).
While many series were short, there were also many 3.4. Hyperparameters
that were very long, for example representing over 300
years of monthly data. The usefulness of the early parts All hyperparameters were chosen using some combi-
of such series for the accuracy of the forecast was not nation of reasoning, intuition, and backtesting. The main
obvious, while they involved an obvious computational tool used for preventing over-fitting was early stopping:
demand. Thus, the long series were shortened, keep- during training, the average accuracy on the validation
ing only the most recent ‘‘maximum length’’ points. The area (typically the last output horizon number of points,
S. Smyl / International Journal of Forecasting 36 (2020) 75–85 83
see Fig. 1) was calculated after every training epoch. The • Number of epochs
epoch with the lowest validation error was noted and The number of training epochs for the final training
used as the maximum number of epochs when doing the and forecasting runs was chosen experimentally as
final (using all of the data) learning and forecasting. The the one that minimized the error on the validation
learning rate schedule was also decided by observing the area. There was a clear interplay between the learn-
validation errors after every epoch. ing rate and the number of epochs: higher learning
Table 1 lists all of the NN architectures and hyperpa- rates needed smaller numbers of epochs. Another
rameters used. If a PI model used the same values as the factor that influenced them both was the computa-
PF model, the values are not repeated. A few comments tional requirement of a subset: a larger number of
about each:
series in a subset forced a preference for a smaller
• Ensembling number of epochs (and thus higher learning rates).
Either the simple ensemble or the ensemble of spe- • Learning rates
cialists. In the latter case it is detailed as topN/ The first number is the initial learning rate, which
numberOfAllModels, e.g. 4/5, see Section 3.2.2. In was often reduced during training; for example, in
the case of yearly data, both ensembling methods case of the model for yearly PIs, it started at 1e−4,
were tried, but in the cases of daily, weekly, and but was reduced to 3e−5 at epoch 17 and again
hourly data, the ensemble of specialists was chosen to 1e−5 at epoch 22. The schedule was result of
without experimentation, under the belief that it observing the behaviors of the validation errors af-
should provide better results. ter each training epoch. When they plateaued for
• NN architecture around two epochs, the learning rate was reduced
It is encoded as a sequence of blocks, in brackets, by a factor of 3–10.
see Section 2.2.5. The residual shortcuts around the • Max length
blocks, or, in the special case of LSTMs as per Kim
This parameter lists maximum length of series used,
et al. (2017), around the layers, are marked with
see Section 3.3. In the case of hourly series, there was
dashes. Let me quickly describe the architecture of
no chopping, all series were used in their original
each type of series:
length.
– Monthly series used a single block of the spe- • Training percentile
cial residual layers. See Section 3.1.1.
– Quarterly, daily, and hourly series used what • State size of LSTMs
perhaps should be a standard architecture (as it LSTM cells maintain a vector of numbers, called the
appears to work well in other contexts, outside state, which is their memory. The size of the state
of the M4 Competition, too): two blocks of two was not a sensitive parameter, with values above
dilated LSTM layers. 30 working well. Larger values slow down the cal-
– Point forecast models of yearly series used a culations, but reduce the number of epochs needed
single block of dilated LSTMs with attention, slightly. There was no benefit in accuracy of using
encoded as attentive (1,6), while models for larger states.
prediction intervals added a standard dense
layer with tanh() activation, which I call the
3.5. Implementation
nonlinear layer (NL), and therefore this is en-
coded as attentive (1,6), NL.
– The architecture for weekly series, as described The method was implemented through four programs:
above, used attentive LSTMs. two using the ensemble of specialists and two using sim-
– As in other cases in the table, the architecture ple ensembling, as was described earlier. Each pair con-
chosen was result of some reasoning/beliefs sisted of one program for generating the PFs and another
and experimentation. I believed that, in case for estimating the PIs. If the competition were to happen
of seasonal series, at least one of the dilations today, probably only two programs would be needed, one
should be equal to the seasonality, while an- using the ensemble of specialists and another using the
other should be in the range of the prediction simple ensemble, as PIs and PFs can be generated from
horizon. It is likely that the architecture was a single program by modifying the loss function and the
over-fitted to the backtesting results; for ex- architecture. The method was written in C++ relying on
ample, the more standard architectures (1,3)– the DyNet library (Neubig et al., 2017). It can be compiled
(6,12) or (1,3,12) would almost certainly work and run on Windows, Linux or Mac, and can option-
well for monthly series too (without the special ally write to a relational database, such as SQL Server
residual architecture). or MySQL, to facilitate the analysis of the backtesting
• LVP results, which is very useful in practice. The programs
LVP stands for level variability penalty, and is the use CPU, not GPU, and are meant to be run in parallel.
multiplier that is applied to the level wiggliness The code is available publicly at the M4 GitHub reposi-
penalty. It applies only to the seasonal models. The tory (https://github.com/M4Competition/M4-methods) to
value is not very sensitive, as changing it even by facilitate replicability and support future research (Makri-
50% would not make a big difference. However, it is dakis, Assimakopoulos, & Spiliotis, 2018). The code is well
still important. commented and is the ultimate description of the method.
84 S. Smyl / International Journal of Forecasting 36 (2020) 75–85
3.6. What did not work well and recent changes series) expressions and parameters. There could also be
per-group parts. The models can be quite general; e.g. in
The method generated accurate forecasts for most of a classical, statistical vein:
the frequencies, but especially the monthly, yearly and
quarterly ones. However, the accuracy was sub-optimal Student performance = School impact + Teacher impact
for the cases of the daily and weekly data. This can be + Individual impact.
explained in part by the author’s concentration on the
Note that each component can be a separate NN, an in-
‘‘three big" subsets: monthly, yearly and quarterly, as
scrutable black box. However, we can observe and quan-
they covered 95% of the data, and performing well on
tify the impact of each of the black boxes, both generally
them was key to success in the competition. However,
and in each case, and therefore we are getting a partially
subsequent work on daily and weekly data confirmed
understandable ML model.
that under-performance on these frequencies is a real
Automatic differentiation is also a fundamental feature
problem.
of Stan, a probabilistic programming language (Carpenter,
Since the competition ended, several improvements
Hoffman, Brubaker, Lee, Li, & Betancourt, 2015). It fits
have been attempted. One such attempt that achieved
noticeable improvements in accuracy on the daily and models primarily using Hamiltonian Markov chain Monte
weekly data, bringing the performance to the level of Carlo, so the optimization is different, but the underlying
the best benchmarks, is as follows. When analyzing the auto-differentiation feels very similar. This similarity in
values of the smoothing coefficients, as they changed with modeling capabilities between Stan and DyNet led to the
passing training epochs, it became clear that they did not formulation of the proposed model, as is described in
seem to plateau in late epochs, as the gradient descent did more detail below.
not seem to push them strongly enough. Thus, a separate, By the middle of 2016, I and my collaborators had
larger learning rate, which was a multiple of three of the successfully created extensions and generalizations of the
main learning rate, was assigned to them, and this had Holt and Holt-Winters models in Stan (I called this family
the required effect. The smoothing coefficients changed of models LGT: local and global trend models; see Smyl &
quickly and eventually plateaued in late epochs. Zhang, 2015, and Smyl, Bergmeir, Wibowo, & Ng, 2019),
and experimented with using them along with NN mod-
4. Hybrid, hierarchical, and understandable ML models els (Smyl & Kuber, 2016). Later, I also experimented a lot
with building NN models for the M3 Competition data
This section begins by summarizing the main features set (Smyl, 2017). I was able to beat classical statistical
of the model, then outlines generalizations and broader algorithms on the yearly (and therefore non-seasonal)
implications of its techniques and approaches. Also in this subset, but could not do it on the monthly subset. For
context, I retrace the steps that lead to the formulation of seasonal series, I used STL decomposition as part of the
the models described in this paper. preprocessing, so clearly it did not work well. Also, my
The winning solution was a hybrid forecasting method LGT models were more accurate than my NN models in
which mixed exponential smoothing-inspired formulas, every category of the M3 data. Thus, when I realized that
used for deseasonalizing and normalizing the series, with DyNet, like Stan, allows a broad range of models to be
advanced neural networks, exploited for extrapolating the coded freely, I decided to apply LGT ideas, such as dealing
series. Equally important was the hierarchical structure with seasonality, to a NN model. That is how the M4
of the method, which combined a global part learned winning solution was born.
across many time series (weights of the NN) with a time
series specific part (smoothing coefficients and initial sea- Acknowledgments
sonality components). The third main component of the
method was a broad usage of ensembling, at multiple I would like to thank Professor Spyros Makridakis and
levels. The first two features were made possible by the his colleagues for organizing the M4 Competition. I be-
great functionalities offered by the modern NN systems lieve that forecasting competitions have been the main
of automatic differentiation and dynamic computational driver for the advancement of deep learning during its
graphs (Paszke et al., 2017). early years. Competitions enable the comparison of var-
Automatic differentiation allows the building of mod- ious forecasting methods, the exchange of ideas and the
els that utilize expressions made up of two sets: a quite sharing of code. In addition, the competition dataset often
broad list of basic functions, like sin(), exp(), etc., and a list becomes a valuable resource for years to come.
of operators like matrix and element-wise multiplications, Finally, I would like to thank Evangelos Spiliotis for his
additions, reciprocal, etc. Neural networks that use matrix help in editing this paper.
operations and some nonlinear functions are just exam-
ples of the allowed expressions. The gradient descent References
machinery fits parameters of all of these expressions. In
the model described here, there was both a NN and a non- Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C.,
NN part (exponential smoothing inspired formulas). It is et al. (2015). TensorFlow: Large-scale machine learning on het-
erogeneous systems. URL https://www.tensorflow.org/, software
quite feasible to build models that encode complicated
available from tensorflow.org.
technical or business knowledge. Carpenter, B., Hoffman, M. D., Brubaker, M., Lee, D., Li, P., & Betan-
Dynamic computational graphs allow the building of court, M. (2015). The stan math library: Reverse-mode automatic
hierarchical models, with global and local (here, per time differentiation in c++. CoRR, abs/1509.07164.
S. Smyl / International Journal of Forecasting 36 (2020) 75–85 85
Chan, F., & Pauwels, L. L. (2018). Some theoretical results on forecast Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Ammar, W., Anasta-
combinations. International Journal of Forecasting, 34(1), 64–74. sopoulos, A., et al. (2017). DyNet: The dynamic neural network
Chang, S., Zhang, Y., Han, W., Yu, M., Guo, X., Tan, W., et al. toolkit. arXiv preprint, arXiv:1701.03980.
(2017). Dilated recurrent neural networks. arXiv e-prints, arXiv: Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., et al.
1710.02224. (2017). Automatic differentiation in PyTorch. In NIPS 2017 autodiff
Dimoulkas, I., Mazidi, P., & Herre, L. (2019). Neural networks for workshop.
GEFCom2017 probabilistic load forecasting. International Journal of Petropoulos, F., Hyndman, R. J., & Bergmeir, C. (2018). Exploring
Forecasting, (in press). the sources of uncertainty: Why does bagging for time series
Gardner, E. S. (2006). Exponential smoothing: The state of the art — forecasting work? European Journal of Operational Research, 268(2),
Part II. International Journal of Forecasting, 22(4), 637–666. 545–554.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for Qin, Y., Song, D., Chen, H., Cheng, W., Jiang, G., & Cottrell, G. (2017). A
image recognition. arXiv e-prints, arXiv:1512.03385. dual-stage attention-based recurrent neural network for time series
Hyndman, R. J., Koehler, A. B., Ord, A. B., & Snyder, R. D. (2008). prediction. arXiv e-prints, arXiv:1704.02971.
Ramsay, J., & Silverman, B. (2002). Functional data analysis. New York:
Forecasting with exponential smoothing: The state space approach.
Springer-Verlag.
Berlin: Springer Verlag.
Smyl, S. (2017). Ensemble of specialized neural networks for time series
Hyndman, R., Wang, E., & Laptev, N. (2015). Large-scale unusual time
forecasting. In 37th international symposium on forecasting.
series detection. In IEEE international conference on data mining.
Smyl, S., Bergmeir, C., Wibowo, E., & Ng, T. W. (2019). Rlgt: Bayesian
Kang, Y., Hyndman, R. J., & Smith-Miles, K. (2017). Visualising fore-
exponential smoothing models with trend modifications. R package
casting algorithm performance using time series instance spaces.
version 01-2.
International Journal of Forecasting, 33(2), 345–358.
Smyl, S., & Kuber, K. (2016). Data preprocessing and augmentation
Kim, J., El-Khamy, M., & Lee, J. (2017). Residual LSTM: Design of a deep
for multiple short time series forecasting with recurrent neural
recurrent architecture for distant speech recognition. arXiv e-prints, networks. In 36th international symposium on forecasting.
arXiv:1701.03360. Smyl, S., & Zhang, Q. (2015). Fitting and extending exponential
Makridakis, S. (2017). The forthcoming artificial intelligence (AI) smoothing models with Stan. In 35th international symposium on
revolution: Its impact on society and firms. Futures, 90, 46–60. forecasting.
Makridakis, S., Assimakopoulos, V., & Spiliotis, E. (2018). Objectivity, re- Takeuchi, I., Le, Q. V., Sears, T. D., & Smola, A. J. (2006). Nonparametric
producibility and replicability in forecasting research. International quantile estimation. Journal of Machine Learning Research (JMLR), 7,
Journal of Forecasting, 34(4), 835–838. 1231–1264.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018a). The Taylor, J. W. (2003). Short-term electricity demand forecasting us-
M4 Competition: Results, findings, conclusion and way forward. ing double seasonal exponential smoothing. The Journal of the
International Journal of Forecasting, 34(4), 802–808. Operational Research Society, 54(8), 799–805.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018b). Statistical and Weron, R. (2014). Electricity price forecasting: A review of the state-
machine learning forecasting methods: Concerns and ways forward. of-the-art with a look into the future. International Journal of
PLoS One, 13(3), 1–26. Forecasting, 30(4), 1030–1081.