A Machine Learning Approach For Forecasting Hierarchical Time Series
A Machine Learning Approach For Forecasting Hierarchical Time Series
A R T I C L E I N F O A B S T R A C T
Keywords: In this paper, we propose a machine learning approach for forecasting hierarchical time series. When dealing
Hierarchical time series with hierarchical time series, apart from generating accurate forecasts, one needs to select a suitable method for
Forecast producing reconciled forecasts. Forecast reconciliation is the process of adjusting forecasts to make them
Machine learning
coherent across the hierarchy. In literature, coherence is often enforced by using a post-processing technique on
Deep neural network
the base forecasts produced by suitable time series forecasting methods. On the contrary, our idea is to use a deep
neural network to directly produce accurate and reconciled forecasts. We exploit the ability of a deep neural
network to extract information capturing the structure of the hierarchy. We impose the reconciliation at training
time by minimizing a customized loss function. In many practical applications, besides time series data, hier
archical time series include explanatory variables that are beneficial for increasing the forecasting accuracy.
Exploiting this further information, our approach links the relationship between time series features extracted at
any level of the hierarchy and the explanatory variables into an end-to-end neural network providing accurate
and reconciled point forecasts. The effectiveness of the approach is validated on three real-world datasets, where
our method outperforms state-of-the-art competitors in hierarchical forecasting.
1. Introduction two types of coherence are pursued with different and dedicated ap
proaches, apart from some recent papers (Kourentzes & Athanasopou
A hierarchical time series is a collection of time series organized in a los, 2019; Fonzo & Girolimetto, 2020; Spiliotis, Petropoulos,
hierarchical structure that can be aggregated at different levels Kourentzes, & Assimakopoulos, 2020). In this paper, we focus on cross-
(Hyndman & Athanasopoulos, 2018). As an example, Stock Keeping sectional coherence. In the literature, two lines of research among others
Unit (SKU) sales aggregate up to product subcategory sales, which are pursued: top-down and bottom-up approaches. Top-down ap
further aggregate to product categories (Franses & Legerstee, 2011). proaches involve forecasting first the top-level series and then dis
Hierarchical forecasting is a very important application of expert sys aggregating by means of historical (Gross & Sohl, 1990) or forecasted
tems for decision-making (Huber, Gossmann, & Stuckenschmidt, 2017). proportion (Athanasopoulos, Ahmed, & Hyndman, 2009) to get fore
In order to support decision-making at different levels of the hierarchy, a casts for the lower-level series. On the other hand, the bottom-up
challenging task is the generation of coherent forecasts. Forecasts of the approach produces first forecasts for the bottom-level time series and
individual series are coherent when they sum up in a proper way across then aggregates them to get the forecasts for the higher-level time series.
the levels preserving the hierarchical structure. Both classes of methods have their advantages since top-down ap
Coherence can be required either at the cross-sectional level or at the proaches perform well when the top-level series is easy to forecast,
temporal level. For example, at the cross-sectional level, forecasts of whereas the bottom-up method accurately identifies the pattern of each
regional sales should sum up to give forecasts of state sales, which series without loss of information. However, the bottom-up approach
should, in turn, sum up to give forecasts for the national sales. For ignores correlations among the series, possibly leading to aggregate
temporal coherence instead, forecasts at the day level must sum up forecasts worse than the ones produced by top-down approaches (Shlifer
coherently at the week level, then at the month level, and so on. & Wolff, 1979). In general, a bottom-up approach should be preferable
Recently, hierarchical time series attracted attention, see Hollyman, whenever the forecasts are employed to support decisions that are
Petropoulos, and Tipping (2021) and references therein. Usually, the mainly related to the bottom rather than the top of the hierarchy,
* Corresponding author.
E-mail addresses: [email protected] (P. Mancuso), [email protected] (V. Piccialli), [email protected] (A.M. Sudoso).
https://doi.org/10.1016/j.eswa.2021.115102
Received 7 June 2020; Received in revised form 14 April 2021; Accepted 21 April 2021
Available online 2 May 2021
0957-4174/© 2021 Elsevier Ltd. All rights reserved.
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
whereas a top-down approach performs better when the bottom-level Weber, Idoumghar, & Muller, 2019) but they have not been used in
series are too noisy (Dunn, Williams, & Dechaine, 1976). The objec hierarchical time series forecasting. Our intuition is that extracting in
tive to reconcile forecasts at all levels of the hierarchy has lead re formation at any level of the hierarchy through a CNN can be used to
searchers to investigate the impact that the association between bottom- discover the structure of the series below.
level series produces on the aggregation (Nenova & May, 2016). Hierarchical forecasting is relevant in many applications, such as
Analytical approaches to the forecast reconciliation problem have been energy and tourism, and it is common in the retail industry where the
proposed by Hyndman, Ahmed, Athanasopoulos, and Shang (2011) and SKU demand can be grouped at different levels. Therefore, we prove the
by Wickramasuriya, Athanasopoulos, and Hyndman (2019). These effectiveness of our method using three public datasets coming from
methods not only ensure that forecasts are coherent but also lead to real-world applications. The first one considers five years of sales data of
improvements in forecast accuracy. However, a shortcoming of these an Italian grocery store, has three levels and noisy bottom-level series.
methods is the need for two stages, with forecasts first produced inde This dataset has been made public by the authors (see Mancuso, Pic
pendently for each series in the hierarchy, and then optimally combined cialli, & Sudoso, 2021). The second one has two levels and comes from
to satisfy the aggregation constraint. Therefore, the reconciliation is the electricity demand data in Switzerland (Nespoli, Medici, Lopatichki, &
result of post-processing on the base forecasts. In Hollyman et al. (2021), Sossan, 2020); it has quite regular bottom-level series, whereas the third
all the above-mentioned methods are reconsidered within the frame one with four levels is extracted from the Walmart data used in the M5
work of forecast combinations, showing that they can all be re- forecasting competition. On all these datasets, our method increases the
interpreted as particular examples of forecast combination where the forecasting accuracy of the hierarchy outperforming state-of-the-art
coherence constraint is enforced with different strategies. The authors approaches, as confirmed by deep statistical analysis.
also show that combining forecasts at the bottom level of the hierarchy However, our methodology for forecasting hierarchical time series
can be exploited to improve the accuracy of the higher levels. shares the same limitations of the machine learning approaches for
In recent years, machine learning models, especially based on neural forecasting non–hierarchical time series: it is not viable for time series
networks, have emerged in the literature as an alternative to statistical with a too-small number of historical observations (i.e. a few years of
methods for forecasting non–hierarchical time series. Indeed, many observations for daily time series are needed). Summarizing, the main
papers define new machine learning algorithms (see for example Bon contributions of the paper are:
tempi, Taieb, & Le Borgne, 2012; Liu, Gong, Yang, & Chen, 2020;
Bandara, Bergmeir, & Smyl, 2020; Carta, Corriga, Ferreira, Podda, & 1. For the first time, we introduce the use of machine learning in the
Recupero, 2021; Ye & Dai, 2021), proposing innovative forecasting forecasting of hierarchical time series, defining a methodology that
strategies that aim at improving the accuracy of time series predictions. can be used at any level of the hierarchy to generate accurate and
Drawing inspiration from this line of research, we propose a machine coherent forecasts for the lower levels.
learning approach for forecasting hierarchical time series. Using ma 2. Our method uses a deep neural network that is able at once to
chine learning in hierarchical time series has also been considered automatically extract the relevant features of the hierarchy while
recently in Spiliotis, Abolghasemi, Hyndman, Petropoulos, and Assim forcing the reconciliation and easily exploiting the exogenous vari
akopoulos (2020). The authors propose a bottom-up method where the ables at any level of the hierarchy.
forecasts of the series of the bottom level are produced by a machine 3. We consider three real-world datasets, and we perform comparisons
learning model (Random Forest and XGBoost), taking as input the base with state-of-the-art methods in hierarchical forecasting. Further
forecasts of all the series of the hierarchy. The reconciliation is then more, a deep statistical analysis assesses the superiority of our
obtained by summing up the bottom-level forecasts. Rather than approach in comparison to standard methods.
formulating the reconciliation problem as a post-processing technique 4. We share with the research community a new challenging dataset for
or just forecasting the bottom-level time series, our idea is to define a hierarchical forecasting coming from the sales data of an Italian
method that can automatically extract at any level of the hierarchy all grocery store.
the relevant information, keeping into account during the training also
the reconciliation. Furthermore, our approach is able to easily incor The rest of the paper is organized as follows. Section 2 discusses the
porate at any level the information provided by the explanatory concept of hierarchical time series and the methods of hierarchical
variables. forecasting. Section 3 contains the detail of the proposed machine
Forecasting models for time series with explanatory variables aim to learning algorithm. Section 4 describes the basic forecasting methods
predict correlated data taking into account additional information, employed in the hierarchical models and the experimental setup. Sec
known as exogenous variables. It is well known that incorporating tion 5 discusses the datasets and the numerical experiments conducted
explanatory variables in time series models helps to improve the forecast to evaluate the proposed method. Finally, Section 6 concludes the paper.
accuracy (see Maçaira, Thomé, Oliveira, & Ferrer, 2018 for a systematic
literature review), thus in this paper we focus on these types of time 2. Hierarchical Time Series
series in the context of hierarchical forecasting. Our idea is to combine
the explanatory variables with time series features defining the structure In a general hierarchical structure with K > 0 levels, level 0 is
of the hierarchy to enhance the reconciliation and forecasting process. defined as the completely aggregated series. Each level from 1 to K − 2
The main instrument we use to extract time series features is a Deep denotes a further disaggregation down to level K − 1 containing the most
Neural Network (DNN). DNNs are designed to learn hierarchical rep disaggregated time series. In a hierarchical time series, the observations
resentations of data (LeCun, Bengio, & Hinton, 2015). Thanks to the at higher levels can be obtained by summing up the series below. Let
ability to extract meaningful features from data, Convolutional Neural ykt ∈ Rmk be the vector of all observations at level k = 1, …, K − 1 and t =
Networks (CNNs) have been successful in time series forecasting and 1, …, T, where mk is the number of series at level k and M = K−
∑ 1
k=0 mk is
classification producing state-of-the-art results (Fawaz, Forestier, the total number of series in the hierarchy. Then we define the vector of
2
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
∑
T yK−
t,i
1
T
where y0t is the observation of the series at the top and the vector yK−
t
1
pi = t=1
, i = 1, …, mK− 1 .
∑T
contains the observations of the series at the bottom of the hierarchy. y0t
T
The structure of the hierarchy is determined by the summing matrix S t=1
the upper levels of the hierarchy according to the summing matrix. It can k
y t,i is the base forecast of the series that corresponds to the node
where ̂
be represented as follows:
which is k levels above node i, and ̂ σ k+1
t,i is the sum of the base forecasts
̃
yh =
K− 1
Ŝy h , below the series that is k levels above node i and directly in contact with
that series.
where ̃yh is the vector of coherent h-step-ahead forecasts for all series of
the hierarchy. An advantage of this approach is that we directly forecast 2.3. Middle-out approach
the series at the bottom level, and no information gets lost due to the
aggregation. On the other hand, bottom-level series can be noisy and The middle-out method can be seen as a combination of the top-
more challenging to model and forecast. This approach also has the down and bottom-up approaches. It combines ideas from both
disadvantage of having many time series to forecast if there are many methods by starting from a middle level where forecasts are reliable. For
series at the lowest level. the series above the middle level, coherent forecasts are generated using
the bottom-up approach by aggregating these forecasts upwards. For the
2.2. Top-down approaches series below the middle level, coherent forecasts are generated using a
top-down approach by disaggregating the middle-level forecasts
Top-down approaches first involve generating the base forecasts for downwards.
the total series and then disaggregating these downwards to get coherent
forecasts for each series of the hierarchy. The disaggregation of the top- 2.4. Optimal reconciliation
level forecasts is usually achieved by using the proportions p =
(p1 , …, pmK− 1 )T , which represent the relative contribution of the bottom- Hyndman et al. (2011) propose a novel approach that provides
level series to the top-level aggregate. The two most commonly used top- optimal forecasts that are better than forecasts produced by either a top-
down approaches are the Average Historical Proportions (AHP) and the down or a bottom-up approach. Their proposal is independently fore
Proportions of the Historical Averages (PHA). In the case of the AHP, the casting all series at all levels of the hierarchy and then using a linear
proportions are calculated as follows: regression model to optimally combine and reconcile these forecasts.
3
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
Fig. 2. Decomposition of the aggregated forecast through a neural network: Neural Network Disaggregation (NND).
Their approach uses a generalized least squares estimator that requires approach is that allows for the correlations between the series at each
an estimate of the covariance matrix of the errors that arise due to level using all the available information within the hierarchy. However,
incoherence. In a recent paper, Wickramasuriya et al. (2019) show that it is computationally expensive compared to the other methods intro
this matrix is impossible to estimate in practice, and they propose a duced so far because it requires to individually forecast the time series at
state-of-the-art forecast reconciliation approach, called Minimum Trace all the levels.
(MinT) that incorporates the information from a full covariance matrix
of forecast errors in obtaining a set of coherent forecasts. MinT mini 3. Neural network disaggregation
mizes the mean squared error of the coherent forecasts across the entire
hierarchy with the constraint of unbiasedness. The resulting revised According to Hyndman and Athanasopoulos (2018), standard top-
forecasts are coherent, unbiased, and have minimum variance amongst down approaches have the disadvantage of information loss since they
all combination forecasts. An advantage of the optimal reconciliation are unable to capture the individual time series characteristics. On the
4
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
Fig. 3. Our model has one branch that accepts the numerical data (left) and another branch that accepts time series data (right).
other hand, the bottom-up approach does not exploit the characteristics
where yk+1,j is the vector of size mk+1 containing the series at level k + 1,
of the time series at intermediate levels. Departing from the related t j
literature to the best of our knowledge, we propose a new approach that yk,p
t,j is the aggregate time series corresponding to the node j at level k
first generates an accurate forecast for the aggregated time series at a connected to the parent node p at level k − 1, l is the number of the
chosen level of the hierarchy and then disaggregates it downwards. We lagged time steps of the aggregated series, xt,i is a vector of the external
formulate the disaggregation problem as a non-linear regression prob regressors for each series at level k + 1, f is a non-linear function
lem, and we solve it with a deep neural network that jointly learns how learned, in our case, by a feed-forward neural network and ∊ is the error
to disaggregate and generate coherent forecasts across the levels of the term.
hierarchy. To explain the proposed algorithm, we focus on two Given any aggregate time series yk,p
t,j and the vector of series yt , the
k+1,j
consecutive levels with the top-level time series being at node j of level k
algorithm is made up of two steps. In the first one, the best forecasting
and the bottom-level series at level k +1 (see Fig. 1).
model for the aggregated time series is chosen, and the neural network is
Let mk+1
j be the number of series at level k +1 connected to the parent trained with the real values of the training set of the two levels time
node j at level k, then we model the disaggregation procedure as a non- series. In the second step, forecasts for the aggregated time series are fed
linear regression problem: to the neural network to obtain forecasts for all the lower-level time
( ) series. The flow chart of the proposed algorithm is shown in Fig. 2. More
yk+1,j
t = f yk,p k,p k,p
t,j , yt− 1,j , …, yt− l,j , xt,1 , …, xt,mk+1 + ∊, (1) in detail, the two steps are the following:
j
5
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
1. [Step 1] In the training phase, the best forecasting model F* for the of the whole hierarchy in two different ways:
time series yk,p
t,j is chosen based on the training set. At the same time,
1. Standard top-down: a forecasting model F* is developed for the
the neural network is trained taking as input the training set of yk,p
t,j aggregate at level 0, and a single disaggregation model NDD is
with lagged time steps and the explanatory variables xt,i relative to trained with the series at level 0 and K − 1. Therefore, forecasts for
the training set of yk+1,j
t . The output are the true values of the dis the bottom-level series are produced by looking only at the aggre
aggregated time series yk+1,j
t . In order to simplify the notation, from gated series at level 0. Then, the bottom-level forecasts are aggre
now on we refer to the produced model as NND (Neural Network gated to generate coherent forecasts for the rest of the series of the
Disaggregation). hierarchy.
k,p 2. Iterative top-down: the forecasting model F* for an aggregate at level k
2. [Step 2] In the disaggregation or test phase, forecasts ̂
y t,j relative to
is the disaggregation model NDD trained with the series at level k − 1
the time period of the test set are generated by the model F* . Finally,
and k, for each k = 1,…,K − 1. At level 0, instead, F* is the best model
these forecasts are fed to the trained NND to produce the dis
selected among a set of standard forecasting methods. Forecasts for
aggregated forecasts ̂
y k+1,j
t for the test set. all the levels are then obtained by feeding forecasts to the disag
gregation models at each level.
In general, the learned function f generates base forecasts that are not
coherent since they do not sum up correctly according to the hierar The difference between the two approaches is that in the standard top-
chical structure. In order to ensure that forecasts are reconciled across down, bottom-level forecasts are generated with only one disaggrega
the hierarchy, we want f to output a set of forecasts that are as close as tion model, whereas in the iterative version, a larger number of disag
possible to the base forecasts, but also meet the requirement that fore gregation models is trained, one for each series to be disaggregated. To
casts at upper levels in the hierarchy are the sum of the associated lower- be more precise, to disaggregate the mk series at level k = 0, …, K − 2,
level forecasts. From an optimization perspective, we want to introduce exactly mk disaggregation models are trained in parallel. In this way, on
an equality constraint to the regression problem in such a way that we the one hand, we increase the variance of the approach (and the
can still use backpropagation to train the network. More in detail, we are computational time), but on the other hand, we reduce the bias since we
looking for the network weights such that the fitting error is minimized increase flexibility and keep into account more the variability at the
and, besides, we want the following constraint to hold: different levels.
We also notice that this algorithm can be easily plugged into a
(2)
k+1,j
yk,p T k+1,j
t,j = 1 yt = 1T ̂
yt = ̂y k,p
t,j ,
middle-out strategy: a forecasting model is developed for each aggregate
at a convenient level, and the disaggregation models are trained and
where 1 is the vector of all ones of size mk+1 .
j tested to distribute these forecasts to the series below. For the series
We impose the coherence by adding a term to the fitting error that above the middle level, coherent forecasts are generated using the
penalizes differences between the sum of the lower-level observations bottom-up approach.
and the sum of the lower-level forecasts: Regarding the choice of the neural network architecture, our objec
( ) [( )
∑ T tive is to include in the model the relationship between explanatory
1 k+1,j 2
||yk+1,j y t || variables derived from the lower-level series, and the features of the
k+1,j k+1,j
L yt , ̂ yt = 1− α t − ̂
T t=1 aggregate series that describe the structure of the hierarchy. In order to
T ( )2 ]
∑ better capture the structure of the hierarchy, we use a Convolutional
+α T k+1,j
1 yt T k+1,j
− 1 ̂ yt , (3) Neural Network (CNN). CNNs are well known for creating useful rep
t=1
resentations of time series automatically, being highly noise-resistant
where α ∈ (0, 1) is a parameter that controls the relative contribution of models, and being able to extract very informative, deep features,
each term in the loss function. Note that the two terms are on the same which are independent of time (Kanarachos, Christopoulos, Chroneos, &
scale and the parameter α measures the compromise between mini Fitzpatrick, 2017; Ferreira, Corrêa, Nonato, & de Mello, 2018). Our
mizing the fitting error and satisfying the coherence. A too small value of model is a deep neural network capable of accepting and combining
α will result in the corresponding constraint being ignored, producing, in multiple types of input, including cross-sectional and time series data, in
general, not coherent forecasts whereas a too large value will cause the a single end-to-end model. Our architecture is made up of two branches:
fitting error being ignored, producing coherent but possibly inaccurate the first branch is a simple Multi-Layer Perceptron (MLP) designed to
base forecasts. The idea is to balance the contribution of both terms by handle the explanatory variables xt,i such as, promotions, day of the
setting α = 0.5, that corresponds to giving the two terms the same week, or in general, special events affecting the time series of interest;
importance. In principle, the parameter α may be tuned on each the second branch is a one-dimensional CNN that extracts feature maps
instance. However, we did not investigate the tuning of α and kept it over fixed segments of length w from the aggregate series yk,p
t,j . Features
fixed to 0.5, since this setting allowed us to reach a satisfying reconcil extracted from the two subnetworks are then concatenated together to
iation error on all the experiments. form the final input of the multi-output regression model (see Fig. 3).
Top-down approaches distribute the top-level forecasts down the The output layer of the model is a standard regression layer with linear
hierarchy using historical or forecasted proportions of the data. In our activation function where the number of units is equal to the number of
case, explicit proportions are never calculated since the algorithm the series to forecast.
automatically learns how to disaggregate forecasts from any level of the
hierarchy to the series below without loss of information. Furthermore,
our method is flexible enough to be employed in the forecasting process
6
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
1. Naive
2. Autoregressive Integrated Moving Average (ARIMA) Different from the simple average which does not need any training
3. Exponential Smoothing (ETS) as the weights are a function of m only, with this method we need to
4. Non-linear autoregression model (NAR) allocate a reserved portion of forecasts to train the meta-model.
5. Dynamic regression models: univariate time series models, such as
linear and non-linear autoregressive models, allow for the inclusion In particular, we consider two following composite models:
of information from past observations of a series, but not for the
inclusion of other information that may also affect the time series of 1. Combination of ARIMAX, NARX, and ETS forecasts obtained through
interest. Dynamic regression models allow keeping into account the the simple mean.
time-lagged relationship between the output and the lagged obser 2. Combination of ARIMAX, NARX, and ETS forecasts obtained by
vations of both the time series itself and of the external regressors. solving the constrained least squares problem.
More in detail, we consider two types of dynamic regression models:
(a) ARIMA model with exogenous variables (ARIMAX) We choose to combine these two dynamic regression models with
(b) NAR model with exogenous variables (NARX) exponential smoothing in order to take directly into account the effect of
the explanatory variables and the presence of linear and non-linear
In the literature, it has been pointed out that the performance of fore patterns in the series.
casting models could be improved by suitably combining forecasts from
standard approaches (Timmermann, 2006). An easy way to improve 4.2. Model selection
forecast accuracy is to use several different models on the same time
series and to average the resulting forecasts. We consider two ways of Following an approach widely employed in the machine learning
combining forecasts: literature, we separate the available data into two sets, training (in-
sample) and test (out-of-sample) data. The training data (y1 , …, yN ), a
1. Simple Average: the most natural approach to combine forecasts is to time series of length N, is used to estimate the parameters of a fore
use the mean. The composite forecast in case of simple average is casting model and the test data (yN+1 ,…,yT ), that comes chronologically
∑
y t = m1 m
given by ̂ y i,t for t = T +1, …, T +h where h is the forecast
i=1 ̂ after the training set, is used to evaluate its accuracy.
horizon, m is the number of combined models and ̂ y i,t is the forecast To achieve a reliable measure of model performance, we implement
at time t generated by model i. on the training set a procedure that applies a cross-validation logic
2. Constrained Least Squares Regression: the composed forecast is not a suitable for time series data. In the expanding window procedure
function of m only as in the simple average but is a linear function of described by Hyndman and Athanasopoulos (2018), the model is trained
the individual forecasts whereby the parameters are determined by on a window that expands over the entire history of the time series, and
solving an optimization problem. The approach proposed by Tim it is repeatedly tested against a forecasting window without dropping
mermann (2006) minimizes the sum of squared errors under some older data points. This method produces many different train/test splits,
7
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
Table 1 Table 2
Hierarchy for the Italian sales data. Hierarchy for the electricity demand data.
Level Number of series Total series per level Level Number of series Total series per level
Store 1 1 Grid 1 1
Brand 4 4 Meter 24 24
Item 42–45–10–21 118
and the error on each split is averaged in order to compute a robust Table 3
estimate of the model error (see Fig. 4). The implementation of the Hierarchy for the Walmart data.
expanding window procedure requires four parameters: Level Number of series Total series per level
Total 1 1
• Starting window: the number of data points included in the first State 3 3
training iteration. Store 4–3–3 10
Category 3–3–3–3–3–3–3–3–3–3 30
• Ending window: the number of data points included in the last
training iteration.
• Forecasting window: number of data points included for forecasting. used to perform the hyperparameters optimization search in the space of
• Expanding steps: the number of data points added to the training time the neural network where F = {16, 32, 64}, K = {4, 8, 16} and H = {64,
series from one iteration to another. 128,256}. We evaluate the hyperparameters configuration on a held-out
validation set, and we choose the architecture achieving the best per
For each series, the best performing model after the cross-validation formance on it (see A for the optimal hyperparameters of the trained
phase is retrained using the in-sample data, and forecasts are obtained models). The NND model takes as input mini-batches of 32 examples and
recursively over the out-of-sample period. The above procedure requires the loss function in Eq. (3) is minimized by using the Adam optimizer
a forecast error measure. We consider the Mean Absolute Scaled Error (Kingma & Ba, 2017) with the initial learning rate set to 0.001. The
(MASE) proposed by Hyndman (2006): network is trained for 500 epochs, and early stopping is used to stop the
T∑
+h training as soon as the error on the validation set starts to grow (Car
uana, Lawrence, & Giles, 2000). The training time of a single disag
1
h
|yi − ̂y i |
gregation model requires order of minutes on the Intel Core i7-8565U
i=T+1
MASE = ,
∑
T
1
T− m
|yt − yt− m | CPU depending on the network dimension and on the granularity of the
dataset.
t=m+1
Time series models described above are implemented by using the 1. Italian Dataset: we consider sales data gathered from an Italian
forecast package in R (Hyndman & Khandakar, 2008). Hierarchical grocery store1 (Mancuso et al., 2021). The dataset consists of 118
time series forecasting is performed with the help of hts package in R daily time series representing the demand for pasta from 01/01/
(Hyndman, Lee, Wang, & Wickramasuriya, 2018). For the optimal 2014 to 31/12/2018. Besides univariate time series data, the quan
reconciliation approach, we use the MinT algorithm that estimates the tity sold is integrated by information on the presence or the absence
covariance matrix of the base forecast errors using shrinkage (Wickra of a promotion (no detail on the type of promotion on the final price
masuriya et al., 2019). The proposed NND is implemented in Python is available). These time series can be naturally arranged to follow a
with TensorFlow, a large-scale machine learning library (Abadi et al., hierarchical structure. Here, the idea is to build a 3-level structure: at
2015). The CNN subnetwork has 6 convolutional layers with ReLU the top of the hierarchy, there is the total or the store-level series
activation, whereas the MLP subnetwork has 3 fully connected layers obtained by aggregating the brand-level series. At the second level,
with ReLU activation. The hyperparameters optimization of the CNN there are the brand-level series (like for instance Barilla) obtained by
subnetwork regards the number of filters (F) and the kernel size (K) of
the convolutional layers, whereas the optimization of the MLP subnet
work regards the number of units in the hidden layers (H). Grid search is 1
https://data.mendeley.com/datasets/s8dgbs3rng/1
8
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
aggregating the individual demand at the item level. Finally, the Table 4
third level contains the most disaggregated time series representing MASE for all 123 series of the Italian sales dataset. In bold the best performing
the item-level demand (for example the demand for spaghetti Bar approach.
illa). The completely aggregated series at level 0 is disaggregated BU AHP PHA FP NND1 NND2 OPT
into 4 component series at level 1 (B1 to B4). Each of these series is TOTAL 0.999 0.562 0.562 0.562 0.562 0.562 0.560
further subdivided into 42, 45, 10, and 21 series at level 2, the
B1 1.321 1.213 1.225 1.249 0.752 0.702 0.856
completely disaggregated bottom level representing the different
B2 0.729 1.009 1.027 0.740 0.788 0.764 0.776
varieties of pasta for each brand (see Table 1). B3 1.265 1.749 1.910 1.225 0.772 0.677 0.877
2. Electricity Dataset: we use a public electricity demand dataset that B4 1.443 1.610 1.660 1.325 0.723 0.737 0.926
contains power measurements and meteorological forecasts relative Average Brand 1.189 1.395 1.455 1.135 0.759 0.720 0.859
to a set of 24 power meters installed in low-voltage cabinets of the
B1-I1 1.295 1.100 1.099 1.360 0.669 0.605 1.292
distribution network of the city of Rolle in Switzerland (Nespoli
B1-I2 1.011 0.930 0.931 1.024 0.723 0.747 1.007
et al., 2020). The dataset contains measurements from 13/01/2018 B1-I3 0.870 0.832 0.820 0.793 0.795 0.677 0.769
to 19/01/2019 at the granularity of 10 min and includes mean active B1-I4 0.753 0.874 0.819 0.753 0.644 0.666 0.731
and reactive power, voltage magnitude, maximum total harmonic B1-I5 1.483 1.414 1.419 1.589 0.604 0.639 0.915
B1-I6 0.886 0.758 0.749 0.793 0.691 0.760 0.786
distortion for each phase, voltage frequency, and the average power
B1-I7 0.838 0.889 0.849 0.850 0.659 0.724 0.832
over the three phases. We assume that the grid losses are not sig B1-I8 0.835 0.786 0.785 0.760 0.625 0.653 0.731
nificant, so the power at the grid connection is the algebraic sum of B1-I9 1.020 1.032 1.059 1.044 0.662 0.626 0.922
the connected nodes. Based on the historical measurements, the B1-I10 1.113 1.100 1.128 1.078 0.617 0.672 0.868
operator can determine coherent forecasts for all the grid by gener B1-I11 0.944 0.838 0.836 0.965 0.691 0.743 0.951
B1-I12 1.247 1.154 1.191 1.084 0.694 0.730 0.843
ating forecasts for the nodal injections individually. We build a 2-
B1-I13 0.997 0.810 0.812 0.921 0.635 0.638 0.886
level hierarchy in which we aggregate the 24 series (M1 to M24) B1-I14 0.945 0.876 0.894 0.946 0.687 0.651 0.922
of the distribution system at the meter level to generate the total B1-I15 1.090 1.124 1.125 1.107 0.632 0.665 0.801
series at the grid level (see Table 2). B1-I16 0.746 0.778 0.757 0.754 0.733 0.678 0.736
3. Walmart Dataset: we consider a public dataset made available by B1-I17 0.876 0.803 0.804 0.901 0.968 1.027 0.884
B1-I18 1.642 1.215 1.228 1.176 0.692 0.697 0.644
Walmart that was adopted in the latest M competition, M52. This B1-I19 0.846 0.788 0.757 0.786 0.760 0.772 0.738
dataset contains historical sales data from 29/01/2011 to 19/06/ B1-I20 0.939 0.861 0.864 0.864 0.637 0.701 0.820
2016 of various products sold in the USA, organized in the form of B1-I21 0.872 0.777 0.767 0.778 0.676 0.705 0.755
grouped time series. More specifically, it uses unit sales data B1-I22 0.813 0.765 0.763 0.812 0.582 0.679 0.802
B1-I23 1.054 0.901 0.907 1.084 0.502 0.629 1.057
collected at the product-store level that are grouped according to
B1-I24 1.082 1.011 1.074 1.127 0.603 0.617 1.079
product departments, product categories, stores, and three B1-I25 0.823 0.905 0.867 0.834 0.627 0.681 0.818
geographical areas: the States of California (CA), Texas (TX), and B1-I26 0.784 0.841 0.829 0.798 0.765 0.830 0.783
Wisconsin (WI). Besides the time series data, it includes explanatory B1-I27 0.747 0.747 0.725 0.753 0.678 0.702 0.743
variables such as promotions (SNAP events), days of the week, and B1-I28 0.983 1.029 1.022 1.021 0.586 0.726 0.986
B1-I29 1.082 0.874 0.868 0.889 0.600 0.692 0.882
special events (e.g., Super Bowl, Valentine’s Day, Thanksgiving Day) B1-I30 0.972 0.826 0.833 0.866 0.515 0.638 0.858
that typically affect unit sales and could improve forecasting accu B1-I31 0.955 0.890 0.896 0.972 0.690 0.688 0.938
racy. Starting from this dataset, we extract a 4-level hierarchy: the B1-I32 1.294 1.076 1.155 0.998 0.639 0.602 1.004
completely aggregated series at level 0 is divided into 3 component B1-I33 1.115 0.696 0.698 0.723 0.672 0.722 0.711
B1-I34 0.951 0.825 0.797 0.761 0.614 0.670 0.746
series at level 1 representing the state-level time series (CA, TX, WI).
B1-I35 0.853 0.917 0.901 0.883 0.779 0.833 0.846
The state-level time series are respectively subdivided in 4 (CA1, B1-I36 0.736 0.689 0.692 0.747 0.702 0.775 0.731
CA2, CA3, CA4), 3 (TX1, TX2, TX3), and 3 (WI1, WI2, WI3) time B1-I37 1.325 1.349 1.397 1.334 0.529 0.647 1.343
series at level 2, the store level. Finally, each store-level time series is B1-I38 1.367 1.284 1.392 1.384 0.642 0.682 1.434
further subdivided into 3 time series at the category level, the most B1-I39 0.897 0.874 0.885 0.894 0.585 0.598 0.873
B1-I40 1.353 1.298 1.302 1.403 0.641 0.413 1.363
disaggregated one, containing the categories Foods, Hobbies, and B1-I41 0.858 0.803 0.769 0.770 0.635 0.734 0.749
Household (see Table 3). B1-I42 1.251 1.247 1.280 1.193 0.536 0.691 1.263
B2-I1 0.958 1.163 1.107 0.848 0.761 0.852 0.955
To summarize, we have the first dataset with a three-level hierarchy, the B2-I2 0.966 0.874 0.826 0.762 0.608 0.666 0.758
B2-I3 0.829 0.836 0.825 0.723 0.569 0.700 0.724
second one with a two-level hierarchy, and the third one with a four-
B2-I4 0.760 0.793 0.803 0.771 0.577 0.681 0.744
level hierarchy. As for the experimental setup, we have to make some B2-I5 0.836 0.780 0.768 0.737 0.537 0.699 0.733
choices for each dataset: B2-I6 0.753 0.853 0.836 0.751 0.583 0.633 0.745
B2-I7 0.853 0.805 0.813 0.760 0.696 0.635 0.747
1. Italian Dataset: for each series, as explanatory variables, we add a B2-I8 0.820 0.801 0.788 0.717 0.664 0.790 0.707
B2-I9 0.830 0.703 0.698 0.716 0.639 0.637 0.718
binary variable representing the presence of promotion if the B2-I10 0.806 0.850 0.840 0.788 0.656 0.700 0.777
disaggregation is computed at the item level or a variable repre B2-I11 0.831 0.806 0.800 0.826 0.673 0.788 0.826
senting the relative number of items in promotion for each brand if B2-I12 0.804 0.900 0.863 0.824 0.617 0.715 0.801
the disaggregation is computed at the brand level. In both cases, B2-I13 0.834 0.806 0.793 0.816 0.604 0.631 0.823
B2-I14 0.754 0.749 0.741 0.744 0.654 0.658 0.739
dummy variables representing the day of the week and the month are
B2-I15 0.686 0.785 0.760 0.739 0.604 0.679 0.685
also added to the model. As for the number of lagged observations of B2-I16 0.875 0.792 0.784 0.768 0.608 0.691 0.770
the aggregate demand, we consider time windows of length w = 30 B2-I17 0.984 0.888 0.860 0.790 0.665 0.696 0.792
days with a hop size of 1 day. We consider 4 years from 01/01/2014 B2-I18 0.835 0.793 0.777 0.747 0.622 0.661 0.730
to 31/12/2017 for the in-sample period and the last year of data from B2-I19 1.351 1.153 1.144 0.986 0.781 0.833 1.062
B2-I20 0.912 0.902 0.887 0.893 0.681 0.783 0.902
01/01/2018 to 31/12/2018 for the out-of-sample period. The B2-I21 0.951 0.774 0.759 0.734 0.602 0.650 0.740
experimental setup for the cross-validation procedure is as follows. B2-I22 0.873 0.859 0.834 0.788 0.532 0.646 0.775
B2-I23 0.849 0.816 0.795 0.768 0.577 0.643 0.752
B2-I24 0.780 0.893 0.874 0.775 0.580 0.645 0.774
2
https://mofc.unic.ac.cy/m5-competition/ (continued on next page)
9
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
Table 4 (continued )
BU AHP PHA FP NND1 NND2 OPT
10
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
Table 6
SMAPE for all 25 series of the electricity demand data. In bold the best per
forming approach.
BU AHP PHA FP NND OPT
5.2. Results
11
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
Table 7
MASE for all 44 series of the Walmart data. In bold the best performing
approach.
BU AHP PHA FP NND1 NND2 OPT
Average Store 0.776 1.109 1.089 0.804 0.615 0.588 0.777 Fig. 8. Nemenyi test results at 95% confidence level for all 44 series of the
Walmart data. The hierarchical forecasting methods are sorted vertically ac
CA1-Foods 0.823 1.385 1.432 0.864 0.521 0.590 0.827
cording to the MASE mean rank.
CA1-Hobbies 0.722 0.840 0.848 0.735 0.733 0.686 0.722
CA1- 0.742 1.328 1.628 0.798 0.673 0.689 0.743
Household level of the hierarchy, we have enough data to build decent models
CA2-Food 0.808 1.060 1.111 0.838 0.711 0.713 0.809 capturing the underlying trend and seasonality. Indeed, the aggregation
CA2-Hobbies 0.792 1.015 1.020 0.819 0.753 0.755 0.791
tends to regularize the demand and make it easier to forecast. The only
CA2- 0.800 1.549 1.507 0.826 0.783 0.781 0.796
Household level for which the optimal reconciliation approach is the best is the top
CA3-Food 0.864 1.393 1.317 0.887 0.754 0.788 0.862 level. As we move down the hierarchy our approach outperforms all the
CA3-Hobbies 0.749 0.889 0.898 0.775 0.697 0.647 0.750 top-down approaches, the bottom-up method and the optimal recon
CA3- 0.807 1.283 1.256 0.884 0.745 0.777 0.811 ciliation, with the NND iterative top-down (NND2) performing best at
Household
CA4-Food 0.770 1.199 1.161 0.803 0.726 0.747 0.771
the brand level and the NND standard top-down (NND1) performing
CA4-Hobbies 0.709 0.967 0.980 0.720 0.709 0.667 0.709 best at the item level, on average.
CA4- 0.712 1.402 1.389 0.758 0.692 0.631 0.713 In Table 5 and 6 we present the MASE and SMAPE for all 25 time
Household series of the electricity demand dataset. Note that here we only have two
TX1-Food 0.761 1.347 1.315 0.773 0.644 0.523 0.763
levels, so that NND1 and NND2 coincide (which is why we call it only
TX1-Hobbies 0.728 0.912 0.914 0.738 0.756 0.719 0.728
TX1-Household 0.770 1.218 1.164 0.808 0.679 0.660 0.771 NND in the tables). In Fig. 6 and 7 we show the outcome of the Friedman
TX2-Food 0.778 1.308 1.800 0.774 0.615 0.699 0.776 and Nemenyi tests with a confidence level of 95%, with respect to MASE
TX2-Hobbies 0.722 0.852 0.848 0.724 0.773 0.622 0.721 and SMAPE. We find that all the top-down approaches perform best at
TX2-Household 0.772 0.974 0.965 0.794 0.652 0.628 0.771 the grid level. On this dataset, the optimal reconciliation method, and
TX3-Food 0.735 0.845 0.813 0.745 0.655 0.582 0.735
TX3-Hobbies 0.739 1.334 1.329 0.749 0.646 0.679 0.739
the bottom-up approach show good performance. The good performance
TX3-Household 0.771 1.193 1.143 0.793 0.714 0.776 0.769 of the bottom-up method with respect to the classical top-down ap
WI1-Food 0.760 1.489 1.374 0.755 0.622 0.711 0.756 proaches can be attributed to the strong seasonality of the series, even at
WI1-Hobbies 0.708 0.992 0.992 0.710 0.604 0.622 0.708 the bottom level. Our NND clearly outperforms all the competitors,
WI1- 0.761 1.361 1.316 0.766 0.741 0.779 0.760
ranking first in the tests and having a better average error.
Household
WI2-Food 0.727 0.771 0.767 0.732 0.657 0.566 0.728 In Table 7, we provide results for all 44 time series of the Walmart
WI2-Hobbies 0.749 1.057 1.052 0.767 0.634 0.671 0.749 dataset. For the NND1, we directly generate forecasts at the category
WI2- 0.847 1.173 1.123 0.883 0.775 0.681 0.849 level using the total aggregate at level 0. We aggregate these forecasts to
Household obtain first the store-level forecasts, and then the state-level forecasts.
WI3-Food 0.779 1.031 0.996 0.797 0.709 0.605 0.780
For the NND2, instead, we train a disaggregation model that outputs the
WI3-Hobbies 0.748 0.924 0.897 0.763 0.605 0.663 0.748
WI3- 0.825 0.878 0.874 0.873 0.737 0.710 0.826 state-level forecasts starting from the total aggregate series at level 0,
Household one NND for each state-level series to generate forecasts for the stores of
Average 0.765 1.132 1.141 0.788 0.690 0.679 0.768 the geographical area they belong to, and finally, one model for each
Category store to obtain the bottom-level forecast at the category level. Overall,
for the entire hierarchy, we train one NND at the top level, 3 NND in
parallel at the state level, and 10 NND in parallel at the store level.
provide significantly better forecasts than the rest of the methods found In Fig. 8, we plot the results of the Friedman and Nemenyi tests for all
in the literature, with the bottom-up performing worst. The bad per 44 series of the Walmart dataset. Fig. 8 shows that NND1 and NND2 are
formance of the bottom-up method can be attributed to the demand at statistically equivalent on this dataset, even though, on average, the
the most granular level of the hierarchy being challenging to model and error produced by NND2 is lower. Anyhow, both NND1 and NND2
forecast effectively due to its too sparse and erratic nature. The majority outperform the competitors. On this dataset, the bottom-up approach
of the item-level time series display sporadic sales including zeros, and performs best at the most aggregate level, and it is quite competitive
the promotion of an item does not always correspond to an increase in with the optimal combination approach since the time series at the
sales. By using traditional or combination of methods to generate base category level display a strong seasonality component. Indeed, sales are
forecasts for the time series at the lowest level, we end up with flat line relatively high on weekends in comparison to normal days, and this
forecasts, representing the average demand, failing to account for the behavior propagates as we go up the hierarchy. As we move down the
seasonality that truly exists but is impossible to identify between the hierarchy, our approach outperforms all the top-down approaches, the
noise. By focusing our attention at the highest or some intermediate
12
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
13
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
Fig. B.1. NND out-of-sample forecasts for the Italian sales dataset. Series B1, B2, B3 and B4 (last 6 months).
14
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
Fig. B.2. NND out-of-sample forecasts for the electricity demand dataset. Series M11, M16, M18 and M19 (first 72 h).
15
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
Fig. B.3. NND out-of-sample forecasts for the Walmart dataset. Time series CA4-Household, CA1-Foods, TX1 and WI2-Foods (last 6 months).
References Dunn, D. M., Williams, W. H., & Dechaine, T. L. (1976). Aggregate versus subaggregate
models in local area forecasting. Journal of the American Statistical Association, 71,
68–71.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis,
Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L., & Muller, P.-A. (2019). Deep
A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M.,
learning for time series classification: a review. Data Mining and Knowledge Discovery,
Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R.,
33, 917–963.
Moore, S., Murray, D., Olah, C., Schuster, M., & Shlens, J. (2015). TensorFlow:
Ferreira, M. D., Corrêa, D. C., Nonato, L. G., & de Mello, R. F. (2018). Designing
Large-scale machine learning on heterogeneous systems.
architectures of convolutional neural networks to solve practical problems. Expert
Athanasopoulos, G., Ahmed, R. A., & Hyndman, R. J. (2009). Hierarchical forecasts for
Systems with Applications, 94, 205–217.
australian domestic tourism. International Journal of Forecasting, 25, 146–166.
Franses, P. H., & Legerstee, R. (2011). Combining sku-level sales forecasts from models
Bandara, K., Bergmeir, C., & Smyl, S. (2020). Forecasting across time series databases
and experts. Expert Systems with Applications, 38, 2365–2370.
using recurrent neural networks on groups of similar series: A clustering approach.
Gross, C. W., & Sohl, J. E. (1990). Disaggregation methods to expedite product line
Expert Systems with Applications, 140, Article 112896.
forecasting. Journal of Forecasting, 9, 233–254.
Bontempi, G., Taieb, S. B., & Le Borgne, Y.-A. (2012). Machine learning strategies for
Hollyman, R., Petropoulos, F., & Tipping, M. E. (2021). Understanding forecast
time series forecasting. In European business intelligence summer school (pp. 62–77).
reconciliation. European Journal of Operational Research.
Springer.
Huber, J., Gossmann, A., & Stuckenschmidt, H. (2017). Cluster-based hierarchical
Carta, S., Corriga, A., Ferreira, A., Podda, A. S., & Recupero, D. R. (2021). A multi-layer
demand forecasting for perishable goods. Expert Systems with Applications, 76,
and multi-ensemble stock trader using deep learning and deep reinforcement
140–151.
learning. Applied Intelligence, 51, 889–905.
Hyndman, R., & Athanasopoulos, G. (2018). Forecasting: principles and practice. OTexts.
Caruana, R., Lawrence, S., & Giles, L. (2000). Overfitting in neural nets:
Hyndman, R., Lee, A., Wang, E., & Wickramasuriya, S. (2018). hts: Hierarchical and
Backpropagation, conjugate gradient, and early stopping. In Proceedings of the 13th
Grouped Time Series. R package version 5.1.5.
International Conference on Neural Information Processing Systems NIPS’00 (pp.
Hyndman, R. J., Ahmed, R. A., Athanasopoulos, G., & Shang, H. L. (2011). Optimal
381–387). MIT Press.
combination forecasts for hierarchical time series. Computational Statistics & Data
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The
Analysis, 55, 2579–2589.
Journal of Machine Learning Research, 7, 1–30.
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: the forecast
Di Fonzo, T., & Girolimetto, D. (2020). Cross-temporal forecast reconciliation: Optimal
package for R. Journal of Statistical Software, 26, 1–22.
combination method and heuristic alternatives. arXiv preprint arXiv:2006.08570.
Hyndman, R. J., et al. (2006). Another look at forecast-accuracy metrics for intermittent
demand. Foresight: The International Journal of Applied Forecasting, 4, 43–46.
16
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102
Kanarachos, S., Christopoulos, S.-R. G., Chroneos, A., & Fitzpatrick, M. E. (2017). Nenova, Z. D., & May, J. H. (2016). Determining an optimal hierarchical forecasting
Detecting anomalies in time series data via a deep learning algorithm combining model based on the characteristics of the data set: Technical note. Journal of
wavelets, neural networks and hilbert transform. Expert Systems with Applications, 85, Operations Management, 44, 62–68.
292–304. Nespoli, L., Medici, V., Lopatichki, K., & Sossan, F. (2020). Hierarchical demand
Kingma, D.P., & Ba, J. (2017). Adam: A method for stochastic optimization. arXiv: forecasting benchmark for the distribution grid. Electric Power Systems Research, 189,
1412.6980. Article 106755.
Koning, A. J., Franses, P. H., Hibon, M., & Stekler, H. O. (2005). The m3 competition: Shlifer, E., & Wolff, R. W. (1979). Aggregation and proration in forecasting. Management
Statistical tests of the results. International Journal of Forecasting, 21, 397–409. Science, 25, 594–603.
Kourentzes, N., & Athanasopoulos, G. (2019). Cross-temporal coherent forecasts for Spiliotis, E., Abolghasemi, M., Hyndman, R.J., Petropoulos, F., & Assimakopoulos, V.
australian tourism. Annals of Tourism Research, 75, 393–409. (2020a). Hierarchical forecast reconciliation with machine learning. arXiv preprint
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444. arXiv:2006.02043.
Liu, Y., Gong, C., Yang, L., & Chen, Y. (2020). Dstp-rnn: A dual-stage two-phase Spiliotis, E., Petropoulos, F., Kourentzes, N., & Assimakopoulos, V. (2020). Cross-
attention-based recurrent neural network for long-term and multivariate time series temporal aggregation: Improving the forecast accuracy of hierarchical electricity
prediction. Expert Systems with Applications, 143, Article 113082. consumption. Applied Energy, 261, Article 114339.
Maçaira, P. M., Thomé, A. M. T., Oliveira, F. L. C., & Ferrer, A. L. C. (2018). Time series Timmermann, A. (2006). Forecast combinations. Handbook of economic forecasting, 1,
analysis with explanatory variables: A systematic literature review. Environmental 135–196.
Modelling & Software, 107, 199–209. Wickramasuriya, S. L., Athanasopoulos, G., & Hyndman, R. J. (2019). Optimal forecast
Mancuso, P., Piccialli, V., & Sudoso, A. M. (2021). Hierarchical sales data of an italian reconciliation for hierarchical and grouped time series through trace minimization.
grocery store. Mendeley Data, V1, 10.17632/s8dgbs3rng.1. Journal of the American Statistical Association, 114, 804–819.
Ye, R., & Dai, Q. (2021). Implementing transfer learning across different datasets for time
series forecasting. Pattern Recognition, 109, Article 107617.
17