0% found this document useful (0 votes)
25 views17 pages

A Machine Learning Approach For Forecasting Hierarchical Time Series

This document summarizes a research paper that proposes a machine learning approach for forecasting hierarchical time series. The approach uses a deep neural network to directly produce accurate and reconciled forecasts for hierarchical time series, rather than using a two-stage process of first generating individual forecasts and then reconciling them. The effectiveness of the approach is validated on three real-world hierarchical time series datasets, where it outperforms state-of-the-art methods by increasing forecasting accuracy while ensuring coherent forecasts across the hierarchy.

Uploaded by

sri baginda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views17 pages

A Machine Learning Approach For Forecasting Hierarchical Time Series

This document summarizes a research paper that proposes a machine learning approach for forecasting hierarchical time series. The approach uses a deep neural network to directly produce accurate and reconciled forecasts for hierarchical time series, rather than using a two-stage process of first generating individual forecasts and then reconciling them. The effectiveness of the approach is validated on three real-world hierarchical time series datasets, where it outperforms state-of-the-art methods by increasing forecasting accuracy while ensuring coherent forecasts across the hierarchy.

Uploaded by

sri baginda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Expert Systems With Applications 182 (2021) 115102

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

A machine learning approach for forecasting hierarchical time series


Paolo Mancuso a, *, Veronica Piccialli b, Antonio M. Sudoso b
a
Department of Industrial Engineering, University of Rome Tor Vergata, Italy
b
Department of Civil Engineering and Computer Science Engineering, University of Rome Tor Vergata, Italy

A R T I C L E I N F O A B S T R A C T

Keywords: In this paper, we propose a machine learning approach for forecasting hierarchical time series. When dealing
Hierarchical time series with hierarchical time series, apart from generating accurate forecasts, one needs to select a suitable method for
Forecast producing reconciled forecasts. Forecast reconciliation is the process of adjusting forecasts to make them
Machine learning
coherent across the hierarchy. In literature, coherence is often enforced by using a post-processing technique on
Deep neural network
the base forecasts produced by suitable time series forecasting methods. On the contrary, our idea is to use a deep
neural network to directly produce accurate and reconciled forecasts. We exploit the ability of a deep neural
network to extract information capturing the structure of the hierarchy. We impose the reconciliation at training
time by minimizing a customized loss function. In many practical applications, besides time series data, hier­
archical time series include explanatory variables that are beneficial for increasing the forecasting accuracy.
Exploiting this further information, our approach links the relationship between time series features extracted at
any level of the hierarchy and the explanatory variables into an end-to-end neural network providing accurate
and reconciled point forecasts. The effectiveness of the approach is validated on three real-world datasets, where
our method outperforms state-of-the-art competitors in hierarchical forecasting.

1. Introduction two types of coherence are pursued with different and dedicated ap­
proaches, apart from some recent papers (Kourentzes & Athanasopou­
A hierarchical time series is a collection of time series organized in a los, 2019; Fonzo & Girolimetto, 2020; Spiliotis, Petropoulos,
hierarchical structure that can be aggregated at different levels Kourentzes, & Assimakopoulos, 2020). In this paper, we focus on cross-
(Hyndman & Athanasopoulos, 2018). As an example, Stock Keeping sectional coherence. In the literature, two lines of research among others
Unit (SKU) sales aggregate up to product subcategory sales, which are pursued: top-down and bottom-up approaches. Top-down ap­
further aggregate to product categories (Franses & Legerstee, 2011). proaches involve forecasting first the top-level series and then dis­
Hierarchical forecasting is a very important application of expert sys­ aggregating by means of historical (Gross & Sohl, 1990) or forecasted
tems for decision-making (Huber, Gossmann, & Stuckenschmidt, 2017). proportion (Athanasopoulos, Ahmed, & Hyndman, 2009) to get fore­
In order to support decision-making at different levels of the hierarchy, a casts for the lower-level series. On the other hand, the bottom-up
challenging task is the generation of coherent forecasts. Forecasts of the approach produces first forecasts for the bottom-level time series and
individual series are coherent when they sum up in a proper way across then aggregates them to get the forecasts for the higher-level time series.
the levels preserving the hierarchical structure. Both classes of methods have their advantages since top-down ap­
Coherence can be required either at the cross-sectional level or at the proaches perform well when the top-level series is easy to forecast,
temporal level. For example, at the cross-sectional level, forecasts of whereas the bottom-up method accurately identifies the pattern of each
regional sales should sum up to give forecasts of state sales, which series without loss of information. However, the bottom-up approach
should, in turn, sum up to give forecasts for the national sales. For ignores correlations among the series, possibly leading to aggregate
temporal coherence instead, forecasts at the day level must sum up forecasts worse than the ones produced by top-down approaches (Shlifer
coherently at the week level, then at the month level, and so on. & Wolff, 1979). In general, a bottom-up approach should be preferable
Recently, hierarchical time series attracted attention, see Hollyman, whenever the forecasts are employed to support decisions that are
Petropoulos, and Tipping (2021) and references therein. Usually, the mainly related to the bottom rather than the top of the hierarchy,

* Corresponding author.
E-mail addresses: [email protected] (P. Mancuso), [email protected] (V. Piccialli), [email protected] (A.M. Sudoso).

https://doi.org/10.1016/j.eswa.2021.115102
Received 7 June 2020; Received in revised form 14 April 2021; Accepted 21 April 2021
Available online 2 May 2021
0957-4174/© 2021 Elsevier Ltd. All rights reserved.
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

whereas a top-down approach performs better when the bottom-level Weber, Idoumghar, & Muller, 2019) but they have not been used in
series are too noisy (Dunn, Williams, & Dechaine, 1976). The objec­ hierarchical time series forecasting. Our intuition is that extracting in­
tive to reconcile forecasts at all levels of the hierarchy has lead re­ formation at any level of the hierarchy through a CNN can be used to
searchers to investigate the impact that the association between bottom- discover the structure of the series below.
level series produces on the aggregation (Nenova & May, 2016). Hierarchical forecasting is relevant in many applications, such as
Analytical approaches to the forecast reconciliation problem have been energy and tourism, and it is common in the retail industry where the
proposed by Hyndman, Ahmed, Athanasopoulos, and Shang (2011) and SKU demand can be grouped at different levels. Therefore, we prove the
by Wickramasuriya, Athanasopoulos, and Hyndman (2019). These effectiveness of our method using three public datasets coming from
methods not only ensure that forecasts are coherent but also lead to real-world applications. The first one considers five years of sales data of
improvements in forecast accuracy. However, a shortcoming of these an Italian grocery store, has three levels and noisy bottom-level series.
methods is the need for two stages, with forecasts first produced inde­ This dataset has been made public by the authors (see Mancuso, Pic­
pendently for each series in the hierarchy, and then optimally combined cialli, & Sudoso, 2021). The second one has two levels and comes from
to satisfy the aggregation constraint. Therefore, the reconciliation is the electricity demand data in Switzerland (Nespoli, Medici, Lopatichki, &
result of post-processing on the base forecasts. In Hollyman et al. (2021), Sossan, 2020); it has quite regular bottom-level series, whereas the third
all the above-mentioned methods are reconsidered within the frame­ one with four levels is extracted from the Walmart data used in the M5
work of forecast combinations, showing that they can all be re- forecasting competition. On all these datasets, our method increases the
interpreted as particular examples of forecast combination where the forecasting accuracy of the hierarchy outperforming state-of-the-art
coherence constraint is enforced with different strategies. The authors approaches, as confirmed by deep statistical analysis.
also show that combining forecasts at the bottom level of the hierarchy However, our methodology for forecasting hierarchical time series
can be exploited to improve the accuracy of the higher levels. shares the same limitations of the machine learning approaches for
In recent years, machine learning models, especially based on neural forecasting non–hierarchical time series: it is not viable for time series
networks, have emerged in the literature as an alternative to statistical with a too-small number of historical observations (i.e. a few years of
methods for forecasting non–hierarchical time series. Indeed, many observations for daily time series are needed). Summarizing, the main
papers define new machine learning algorithms (see for example Bon­ contributions of the paper are:
tempi, Taieb, & Le Borgne, 2012; Liu, Gong, Yang, & Chen, 2020;
Bandara, Bergmeir, & Smyl, 2020; Carta, Corriga, Ferreira, Podda, & 1. For the first time, we introduce the use of machine learning in the
Recupero, 2021; Ye & Dai, 2021), proposing innovative forecasting forecasting of hierarchical time series, defining a methodology that
strategies that aim at improving the accuracy of time series predictions. can be used at any level of the hierarchy to generate accurate and
Drawing inspiration from this line of research, we propose a machine coherent forecasts for the lower levels.
learning approach for forecasting hierarchical time series. Using ma­ 2. Our method uses a deep neural network that is able at once to
chine learning in hierarchical time series has also been considered automatically extract the relevant features of the hierarchy while
recently in Spiliotis, Abolghasemi, Hyndman, Petropoulos, and Assim­ forcing the reconciliation and easily exploiting the exogenous vari­
akopoulos (2020). The authors propose a bottom-up method where the ables at any level of the hierarchy.
forecasts of the series of the bottom level are produced by a machine 3. We consider three real-world datasets, and we perform comparisons
learning model (Random Forest and XGBoost), taking as input the base with state-of-the-art methods in hierarchical forecasting. Further­
forecasts of all the series of the hierarchy. The reconciliation is then more, a deep statistical analysis assesses the superiority of our
obtained by summing up the bottom-level forecasts. Rather than approach in comparison to standard methods.
formulating the reconciliation problem as a post-processing technique 4. We share with the research community a new challenging dataset for
or just forecasting the bottom-level time series, our idea is to define a hierarchical forecasting coming from the sales data of an Italian
method that can automatically extract at any level of the hierarchy all grocery store.
the relevant information, keeping into account during the training also
the reconciliation. Furthermore, our approach is able to easily incor­ The rest of the paper is organized as follows. Section 2 discusses the
porate at any level the information provided by the explanatory concept of hierarchical time series and the methods of hierarchical
variables. forecasting. Section 3 contains the detail of the proposed machine
Forecasting models for time series with explanatory variables aim to learning algorithm. Section 4 describes the basic forecasting methods
predict correlated data taking into account additional information, employed in the hierarchical models and the experimental setup. Sec­
known as exogenous variables. It is well known that incorporating tion 5 discusses the datasets and the numerical experiments conducted
explanatory variables in time series models helps to improve the forecast to evaluate the proposed method. Finally, Section 6 concludes the paper.
accuracy (see Maçaira, Thomé, Oliveira, & Ferrer, 2018 for a systematic
literature review), thus in this paper we focus on these types of time 2. Hierarchical Time Series
series in the context of hierarchical forecasting. Our idea is to combine
the explanatory variables with time series features defining the structure In a general hierarchical structure with K > 0 levels, level 0 is
of the hierarchy to enhance the reconciliation and forecasting process. defined as the completely aggregated series. Each level from 1 to K − 2
The main instrument we use to extract time series features is a Deep denotes a further disaggregation down to level K − 1 containing the most
Neural Network (DNN). DNNs are designed to learn hierarchical rep­ disaggregated time series. In a hierarchical time series, the observations
resentations of data (LeCun, Bengio, & Hinton, 2015). Thanks to the at higher levels can be obtained by summing up the series below. Let
ability to extract meaningful features from data, Convolutional Neural ykt ∈ Rmk be the vector of all observations at level k = 1, …, K − 1 and t =
Networks (CNNs) have been successful in time series forecasting and 1, …, T, where mk is the number of series at level k and M = K−
∑ 1
k=0 mk is
classification producing state-of-the-art results (Fawaz, Forestier, the total number of series in the hierarchy. Then we define the vector of

2
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

all observations of the hierarchy:


⎛ ⎞ 1∑ T
yK−
t,i
1
pi = , i = 1, …, mK− 1 .
y0 T t=1 y0t
⎜ t ⎟
⎜ y1 ⎟
⎜ ⎟
yt = ⎜ t ⎟, In the PHA approach, the proportions are obtained in the following
⎜ ⋮ ⎟
⎝ ⎠ manner:
yK−
t
1


T yK−
t,i
1

T
where y0t is the observation of the series at the top and the vector yK−
t
1
pi = t=1
, i = 1, …, mK− 1 .
∑T
contains the observations of the series at the bottom of the hierarchy. y0t
T
The structure of the hierarchy is determined by the summing matrix S t=1

that defines the aggregation constraints:


For these two methods, once the bottom-level h-step-ahead forecasts
yt = SyK−
t
1
. have been generated, these are aggregated to produce coherent forecasts
for the rest of the series of the hierarchy by using the summing matrix.
The summing matrix S is a matrix having entries belonging to {0, 1} of Given the vector of proportions p, top-down approaches can be repre­
size M × mK− 1 . sented as:
Given observations at time t = 1, …, T and the forecasting horizon h,
the aim is to forecast each series at each level at time t = T + 1,...,T + h. yh = Sp̂y 0h .
̃
The current methods of forecasting hierarchical time series are top-
Top-down approaches based on historical proportions usually produce
down, bottom-up, middle-out, and optimal reconciliation (Hyndman &
less accurate forecasts at lower levels of the hierarchy than bottom-up
Athanasopoulos, 2018; Hollyman et al., 2021). The main objective of
approaches because they don’t take into account that these pro­
such approaches is to ensure that forecasts are coherent across the levels
portions may change over time. To address this issue, instead of using
of the hierarchy. Regardless of the methods used to forecast the time
the static proportions as in AHP and PHA, Athanasopoulos et al. (2009)
series for the different levels of the hierarchy, the individual forecasts
propose the Forecasted Proportion (FP) method in which proportions
must be reconciled to be useful for any subsequent decision making.
are based on forecasts rather than on historical data. It first generates an
Forecast reconciliation is the process of adjusting forecasts to make them
independent base forecast for all series in the hierarchy, then for each
coherent. By definition, a forecast is coherent if it satisfies the aggre­
level, from the top to the bottom, the proportion of each base forecast to
gation constraints defined by the summing matrix.
the aggregate of all the base forecasts at that level are calculated. For a
hierarchy with K levels we have:
2.1. Bottom-up approach

K − 2
y kt,i
̂
pi = , i = 1, …, mK− 1 .
The bottom-up approach focuses on producing the h-step-ahead base k=0 σ k+1
̂ t,i
forecasts for each series at the lowest level ̂ hand aggregating them to
y K− 1

the upper levels of the hierarchy according to the summing matrix. It can k
y t,i is the base forecast of the series that corresponds to the node
where ̂
be represented as follows:
which is k levels above node i, and ̂ σ k+1
t,i is the sum of the base forecasts
̃
yh =
K− 1
Ŝy h , below the series that is k levels above node i and directly in contact with
that series.
where ̃yh is the vector of coherent h-step-ahead forecasts for all series of
the hierarchy. An advantage of this approach is that we directly forecast 2.3. Middle-out approach
the series at the bottom level, and no information gets lost due to the
aggregation. On the other hand, bottom-level series can be noisy and The middle-out method can be seen as a combination of the top-
more challenging to model and forecast. This approach also has the down and bottom-up approaches. It combines ideas from both
disadvantage of having many time series to forecast if there are many methods by starting from a middle level where forecasts are reliable. For
series at the lowest level. the series above the middle level, coherent forecasts are generated using
the bottom-up approach by aggregating these forecasts upwards. For the
2.2. Top-down approaches series below the middle level, coherent forecasts are generated using a
top-down approach by disaggregating the middle-level forecasts
Top-down approaches first involve generating the base forecasts for downwards.
the total series and then disaggregating these downwards to get coherent
forecasts for each series of the hierarchy. The disaggregation of the top- 2.4. Optimal reconciliation
level forecasts is usually achieved by using the proportions p =
(p1 , …, pmK− 1 )T , which represent the relative contribution of the bottom- Hyndman et al. (2011) propose a novel approach that provides
level series to the top-level aggregate. The two most commonly used top- optimal forecasts that are better than forecasts produced by either a top-
down approaches are the Average Historical Proportions (AHP) and the down or a bottom-up approach. Their proposal is independently fore­
Proportions of the Historical Averages (PHA). In the case of the AHP, the casting all series at all levels of the hierarchy and then using a linear
proportions are calculated as follows: regression model to optimally combine and reconcile these forecasts.

3
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

Fig. 1. A top-level series at level k and the bottom-level series at level k + 1.

Fig. 2. Decomposition of the aggregated forecast through a neural network: Neural Network Disaggregation (NND).

Their approach uses a generalized least squares estimator that requires approach is that allows for the correlations between the series at each
an estimate of the covariance matrix of the errors that arise due to level using all the available information within the hierarchy. However,
incoherence. In a recent paper, Wickramasuriya et al. (2019) show that it is computationally expensive compared to the other methods intro­
this matrix is impossible to estimate in practice, and they propose a duced so far because it requires to individually forecast the time series at
state-of-the-art forecast reconciliation approach, called Minimum Trace all the levels.
(MinT) that incorporates the information from a full covariance matrix
of forecast errors in obtaining a set of coherent forecasts. MinT mini­ 3. Neural network disaggregation
mizes the mean squared error of the coherent forecasts across the entire
hierarchy with the constraint of unbiasedness. The resulting revised According to Hyndman and Athanasopoulos (2018), standard top-
forecasts are coherent, unbiased, and have minimum variance amongst down approaches have the disadvantage of information loss since they
all combination forecasts. An advantage of the optimal reconciliation are unable to capture the individual time series characteristics. On the

4
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

Fig. 3. Our model has one branch that accepts the numerical data (left) and another branch that accepts time series data (right).

other hand, the bottom-up approach does not exploit the characteristics
where yk+1,j is the vector of size mk+1 containing the series at level k + 1,
of the time series at intermediate levels. Departing from the related t j

literature to the best of our knowledge, we propose a new approach that yk,p
t,j is the aggregate time series corresponding to the node j at level k
first generates an accurate forecast for the aggregated time series at a connected to the parent node p at level k − 1, l is the number of the
chosen level of the hierarchy and then disaggregates it downwards. We lagged time steps of the aggregated series, xt,i is a vector of the external
formulate the disaggregation problem as a non-linear regression prob­ regressors for each series at level k + 1, f is a non-linear function
lem, and we solve it with a deep neural network that jointly learns how learned, in our case, by a feed-forward neural network and ∊ is the error
to disaggregate and generate coherent forecasts across the levels of the term.
hierarchy. To explain the proposed algorithm, we focus on two Given any aggregate time series yk,p
t,j and the vector of series yt , the
k+1,j

consecutive levels with the top-level time series being at node j of level k
algorithm is made up of two steps. In the first one, the best forecasting
and the bottom-level series at level k +1 (see Fig. 1).
model for the aggregated time series is chosen, and the neural network is
Let mk+1
j be the number of series at level k +1 connected to the parent trained with the real values of the training set of the two levels time
node j at level k, then we model the disaggregation procedure as a non- series. In the second step, forecasts for the aggregated time series are fed
linear regression problem: to the neural network to obtain forecasts for all the lower-level time
( ) series. The flow chart of the proposed algorithm is shown in Fig. 2. More
yk+1,j
t = f yk,p k,p k,p
t,j , yt− 1,j , …, yt− l,j , xt,1 , …, xt,mk+1 + ∊, (1) in detail, the two steps are the following:
j

5
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

1. [Step 1] In the training phase, the best forecasting model F* for the of the whole hierarchy in two different ways:
time series yk,p
t,j is chosen based on the training set. At the same time,
1. Standard top-down: a forecasting model F* is developed for the
the neural network is trained taking as input the training set of yk,p
t,j aggregate at level 0, and a single disaggregation model NDD is
with lagged time steps and the explanatory variables xt,i relative to trained with the series at level 0 and K − 1. Therefore, forecasts for
the training set of yk+1,j
t . The output are the true values of the dis­ the bottom-level series are produced by looking only at the aggre­
aggregated time series yk+1,j
t . In order to simplify the notation, from gated series at level 0. Then, the bottom-level forecasts are aggre­
now on we refer to the produced model as NND (Neural Network gated to generate coherent forecasts for the rest of the series of the
Disaggregation). hierarchy.
k,p 2. Iterative top-down: the forecasting model F* for an aggregate at level k
2. [Step 2] In the disaggregation or test phase, forecasts ̂
y t,j relative to
is the disaggregation model NDD trained with the series at level k − 1
the time period of the test set are generated by the model F* . Finally,
and k, for each k = 1,…,K − 1. At level 0, instead, F* is the best model
these forecasts are fed to the trained NND to produce the dis­
selected among a set of standard forecasting methods. Forecasts for
aggregated forecasts ̂
y k+1,j
t for the test set. all the levels are then obtained by feeding forecasts to the disag­
gregation models at each level.
In general, the learned function f generates base forecasts that are not
coherent since they do not sum up correctly according to the hierar­ The difference between the two approaches is that in the standard top-
chical structure. In order to ensure that forecasts are reconciled across down, bottom-level forecasts are generated with only one disaggrega­
the hierarchy, we want f to output a set of forecasts that are as close as tion model, whereas in the iterative version, a larger number of disag­
possible to the base forecasts, but also meet the requirement that fore­ gregation models is trained, one for each series to be disaggregated. To
casts at upper levels in the hierarchy are the sum of the associated lower- be more precise, to disaggregate the mk series at level k = 0, …, K − 2,
level forecasts. From an optimization perspective, we want to introduce exactly mk disaggregation models are trained in parallel. In this way, on
an equality constraint to the regression problem in such a way that we the one hand, we increase the variance of the approach (and the
can still use backpropagation to train the network. More in detail, we are computational time), but on the other hand, we reduce the bias since we
looking for the network weights such that the fitting error is minimized increase flexibility and keep into account more the variability at the
and, besides, we want the following constraint to hold: different levels.
We also notice that this algorithm can be easily plugged into a
(2)
k+1,j
yk,p T k+1,j
t,j = 1 yt = 1T ̂
yt = ̂y k,p
t,j ,
middle-out strategy: a forecasting model is developed for each aggregate
at a convenient level, and the disaggregation models are trained and
where 1 is the vector of all ones of size mk+1 .
j tested to distribute these forecasts to the series below. For the series
We impose the coherence by adding a term to the fitting error that above the middle level, coherent forecasts are generated using the
penalizes differences between the sum of the lower-level observations bottom-up approach.
and the sum of the lower-level forecasts: Regarding the choice of the neural network architecture, our objec­
( ) [( )
∑ T tive is to include in the model the relationship between explanatory
1 k+1,j 2
||yk+1,j y t || variables derived from the lower-level series, and the features of the
k+1,j k+1,j
L yt , ̂ yt = 1− α t − ̂
T t=1 aggregate series that describe the structure of the hierarchy. In order to
T ( )2 ]
∑ better capture the structure of the hierarchy, we use a Convolutional
+α T k+1,j
1 yt T k+1,j
− 1 ̂ yt , (3) Neural Network (CNN). CNNs are well known for creating useful rep­
t=1
resentations of time series automatically, being highly noise-resistant
where α ∈ (0, 1) is a parameter that controls the relative contribution of models, and being able to extract very informative, deep features,
each term in the loss function. Note that the two terms are on the same which are independent of time (Kanarachos, Christopoulos, Chroneos, &
scale and the parameter α measures the compromise between mini­ Fitzpatrick, 2017; Ferreira, Corrêa, Nonato, & de Mello, 2018). Our
mizing the fitting error and satisfying the coherence. A too small value of model is a deep neural network capable of accepting and combining
α will result in the corresponding constraint being ignored, producing, in multiple types of input, including cross-sectional and time series data, in
general, not coherent forecasts whereas a too large value will cause the a single end-to-end model. Our architecture is made up of two branches:
fitting error being ignored, producing coherent but possibly inaccurate the first branch is a simple Multi-Layer Perceptron (MLP) designed to
base forecasts. The idea is to balance the contribution of both terms by handle the explanatory variables xt,i such as, promotions, day of the
setting α = 0.5, that corresponds to giving the two terms the same week, or in general, special events affecting the time series of interest;
importance. In principle, the parameter α may be tuned on each the second branch is a one-dimensional CNN that extracts feature maps
instance. However, we did not investigate the tuning of α and kept it over fixed segments of length w from the aggregate series yk,p
t,j . Features
fixed to 0.5, since this setting allowed us to reach a satisfying reconcil­ extracted from the two subnetworks are then concatenated together to
iation error on all the experiments. form the final input of the multi-output regression model (see Fig. 3).
Top-down approaches distribute the top-level forecasts down the The output layer of the model is a standard regression layer with linear
hierarchy using historical or forecasted proportions of the data. In our activation function where the number of units is equal to the number of
case, explicit proportions are never calculated since the algorithm the series to forecast.
automatically learns how to disaggregate forecasts from any level of the
hierarchy to the series below without loss of information. Furthermore,
our method is flexible enough to be employed in the forecasting process

6
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

Fig. 4. Expanding window procedure.

4. Experimental setup additional constraints. Specifically, the estimated coefficients βi are


constrained to be non-negative and to sum up to one. The weights
In this section, we resume first the forecasting models we use to obtained are easily interpretable as percentages devoted to each of
generate the base forecasts for the hierarchical approaches, then we the individual forecasts. Given the optimal weights, the composed

describe our strategy to select the best forecasting model and the forecast is obtained as ̂ yt = m y i,t for t = T + 1, …, T + h. From
i=1 βi ̂
implementation details. the mathematical point of view the following optimization problem
needs to be solved:
( )2
4.1. Forecasting models ∑
T+h ∑
m
min yt − βi ̂y i,t
In order to describe the methods, let (y1 , …, yT ) be an univariate time t=T+1 i=1
s.t. βi ⩾0 i = 1, …, m (4)
series of length T and (yT+1 , …, yT+h ) the forecasting period, where h is
∑ m
the forecast horizon. We consider the following models: βi = 1
i=1

1. Naive
2. Autoregressive Integrated Moving Average (ARIMA) Different from the simple average which does not need any training
3. Exponential Smoothing (ETS) as the weights are a function of m only, with this method we need to
4. Non-linear autoregression model (NAR) allocate a reserved portion of forecasts to train the meta-model.
5. Dynamic regression models: univariate time series models, such as
linear and non-linear autoregressive models, allow for the inclusion In particular, we consider two following composite models:
of information from past observations of a series, but not for the
inclusion of other information that may also affect the time series of 1. Combination of ARIMAX, NARX, and ETS forecasts obtained through
interest. Dynamic regression models allow keeping into account the the simple mean.
time-lagged relationship between the output and the lagged obser­ 2. Combination of ARIMAX, NARX, and ETS forecasts obtained by
vations of both the time series itself and of the external regressors. solving the constrained least squares problem.
More in detail, we consider two types of dynamic regression models:
(a) ARIMA model with exogenous variables (ARIMAX) We choose to combine these two dynamic regression models with
(b) NAR model with exogenous variables (NARX) exponential smoothing in order to take directly into account the effect of
the explanatory variables and the presence of linear and non-linear
In the literature, it has been pointed out that the performance of fore­ patterns in the series.
casting models could be improved by suitably combining forecasts from
standard approaches (Timmermann, 2006). An easy way to improve 4.2. Model selection
forecast accuracy is to use several different models on the same time
series and to average the resulting forecasts. We consider two ways of Following an approach widely employed in the machine learning
combining forecasts: literature, we separate the available data into two sets, training (in-
sample) and test (out-of-sample) data. The training data (y1 , …, yN ), a
1. Simple Average: the most natural approach to combine forecasts is to time series of length N, is used to estimate the parameters of a fore­
use the mean. The composite forecast in case of simple average is casting model and the test data (yN+1 ,…,yT ), that comes chronologically

y t = m1 m
given by ̂ y i,t for t = T +1, …, T +h where h is the forecast
i=1 ̂ after the training set, is used to evaluate its accuracy.
horizon, m is the number of combined models and ̂ y i,t is the forecast To achieve a reliable measure of model performance, we implement
at time t generated by model i. on the training set a procedure that applies a cross-validation logic
2. Constrained Least Squares Regression: the composed forecast is not a suitable for time series data. In the expanding window procedure
function of m only as in the simple average but is a linear function of described by Hyndman and Athanasopoulos (2018), the model is trained
the individual forecasts whereby the parameters are determined by on a window that expands over the entire history of the time series, and
solving an optimization problem. The approach proposed by Tim­ it is repeatedly tested against a forecasting window without dropping
mermann (2006) minimizes the sum of squared errors under some older data points. This method produces many different train/test splits,

7
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

Table 1 Table 2
Hierarchy for the Italian sales data. Hierarchy for the electricity demand data.
Level Number of series Total series per level Level Number of series Total series per level

Store 1 1 Grid 1 1
Brand 4 4 Meter 24 24
Item 42–45–10–21 118

and the error on each split is averaged in order to compute a robust Table 3
estimate of the model error (see Fig. 4). The implementation of the Hierarchy for the Walmart data.
expanding window procedure requires four parameters: Level Number of series Total series per level

Total 1 1
• Starting window: the number of data points included in the first State 3 3
training iteration. Store 4–3–3 10
Category 3–3–3–3–3–3–3–3–3–3 30
• Ending window: the number of data points included in the last
training iteration.
• Forecasting window: number of data points included for forecasting. used to perform the hyperparameters optimization search in the space of
• Expanding steps: the number of data points added to the training time the neural network where F = {16, 32, 64}, K = {4, 8, 16} and H = {64,
series from one iteration to another. 128,256}. We evaluate the hyperparameters configuration on a held-out
validation set, and we choose the architecture achieving the best per­
For each series, the best performing model after the cross-validation formance on it (see A for the optimal hyperparameters of the trained
phase is retrained using the in-sample data, and forecasts are obtained models). The NND model takes as input mini-batches of 32 examples and
recursively over the out-of-sample period. The above procedure requires the loss function in Eq. (3) is minimized by using the Adam optimizer
a forecast error measure. We consider the Mean Absolute Scaled Error (Kingma & Ba, 2017) with the initial learning rate set to 0.001. The
(MASE) proposed by Hyndman (2006): network is trained for 500 epochs, and early stopping is used to stop the
T∑
+h training as soon as the error on the validation set starts to grow (Car­
uana, Lawrence, & Giles, 2000). The training time of a single disag­
1
h
|yi − ̂y i |
gregation model requires order of minutes on the Intel Core i7-8565U
i=T+1
MASE = ,

T
1
T− m
|yt − yt− m | CPU depending on the network dimension and on the granularity of the
dataset.
t=m+1

where the numerator is out-of-sample Mean Absolute Error (MAE) of the


method evaluated across the forecast horizon h, and the denominator is 5. Numerical experiments
the in-sample one-step ahead Naive forecast with seasonal period m.
We also consider the Symmetric Mean Absolute Percentage Error In this section, we aim to evaluate the effectiveness of our approach
(SMAPE) defined as follows: by comparing it with the hierarchical methods described in Section 2.
These methods are used for benchmarking the proposed method, as they
2 ∑T+h
|yi − ̂yi| have been successfully applied in numerous applications and are
SMAPE = .
h i=T+1 |yi | + |̂
yi | considered the state-of-the-art in the area of hierarchical forecasting
(Hollyman et al., 2021). In order to be as fair as possible in the com­
The SMAPE is easy to interpret, and has an upper bound of 2 when either parison, we perform model selection among the set of forecasting
actual or predicted values are zero or when actual and predicted are methods described in Section 4 whenever a base forecast is required.
opposite signs. However, the significant disadvantage of SMAPE is that it This means that different methods may be used for each time series of
produces infinite or undefined values where the actual values are zero or the hierarchy we are trying to forecast (bottom-level series for the
close to zero. The MASE and SMAPE can be used to compare forecast bottom-up approach, the top-level series for all the top-down, all the
methods on a single series and, because they are scale-free, to compare time series for the optimal reconciliation approach). Note that
forecast accuracy across series. For this reason, we average the MASE improving the accuracy of the base forecasts using the cross-validation
and SMAPE values of several series to obtain a measurement of forecast procedure described in Section 4 is also beneficial for the competitors.
accuracy for the group of series. As for the metrics used for comparison, we use both the MASE and the
SMAPE where possible (i.e., where no zeros are present).

4.3. Implementation 5.1. Datasets

Time series models described above are implemented by using the 1. Italian Dataset: we consider sales data gathered from an Italian
forecast package in R (Hyndman & Khandakar, 2008). Hierarchical grocery store1 (Mancuso et al., 2021). The dataset consists of 118
time series forecasting is performed with the help of hts package in R daily time series representing the demand for pasta from 01/01/
(Hyndman, Lee, Wang, & Wickramasuriya, 2018). For the optimal 2014 to 31/12/2018. Besides univariate time series data, the quan­
reconciliation approach, we use the MinT algorithm that estimates the tity sold is integrated by information on the presence or the absence
covariance matrix of the base forecast errors using shrinkage (Wickra­ of a promotion (no detail on the type of promotion on the final price
masuriya et al., 2019). The proposed NND is implemented in Python is available). These time series can be naturally arranged to follow a
with TensorFlow, a large-scale machine learning library (Abadi et al., hierarchical structure. Here, the idea is to build a 3-level structure: at
2015). The CNN subnetwork has 6 convolutional layers with ReLU the top of the hierarchy, there is the total or the store-level series
activation, whereas the MLP subnetwork has 3 fully connected layers obtained by aggregating the brand-level series. At the second level,
with ReLU activation. The hyperparameters optimization of the CNN there are the brand-level series (like for instance Barilla) obtained by
subnetwork regards the number of filters (F) and the kernel size (K) of
the convolutional layers, whereas the optimization of the MLP subnet­
work regards the number of units in the hidden layers (H). Grid search is 1
https://data.mendeley.com/datasets/s8dgbs3rng/1

8
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

aggregating the individual demand at the item level. Finally, the Table 4
third level contains the most disaggregated time series representing MASE for all 123 series of the Italian sales dataset. In bold the best performing
the item-level demand (for example the demand for spaghetti Bar­ approach.
illa). The completely aggregated series at level 0 is disaggregated BU AHP PHA FP NND1 NND2 OPT
into 4 component series at level 1 (B1 to B4). Each of these series is TOTAL 0.999 0.562 0.562 0.562 0.562 0.562 0.560
further subdivided into 42, 45, 10, and 21 series at level 2, the
B1 1.321 1.213 1.225 1.249 0.752 0.702 0.856
completely disaggregated bottom level representing the different
B2 0.729 1.009 1.027 0.740 0.788 0.764 0.776
varieties of pasta for each brand (see Table 1). B3 1.265 1.749 1.910 1.225 0.772 0.677 0.877
2. Electricity Dataset: we use a public electricity demand dataset that B4 1.443 1.610 1.660 1.325 0.723 0.737 0.926
contains power measurements and meteorological forecasts relative Average Brand 1.189 1.395 1.455 1.135 0.759 0.720 0.859
to a set of 24 power meters installed in low-voltage cabinets of the
B1-I1 1.295 1.100 1.099 1.360 0.669 0.605 1.292
distribution network of the city of Rolle in Switzerland (Nespoli
B1-I2 1.011 0.930 0.931 1.024 0.723 0.747 1.007
et al., 2020). The dataset contains measurements from 13/01/2018 B1-I3 0.870 0.832 0.820 0.793 0.795 0.677 0.769
to 19/01/2019 at the granularity of 10 min and includes mean active B1-I4 0.753 0.874 0.819 0.753 0.644 0.666 0.731
and reactive power, voltage magnitude, maximum total harmonic B1-I5 1.483 1.414 1.419 1.589 0.604 0.639 0.915
B1-I6 0.886 0.758 0.749 0.793 0.691 0.760 0.786
distortion for each phase, voltage frequency, and the average power
B1-I7 0.838 0.889 0.849 0.850 0.659 0.724 0.832
over the three phases. We assume that the grid losses are not sig­ B1-I8 0.835 0.786 0.785 0.760 0.625 0.653 0.731
nificant, so the power at the grid connection is the algebraic sum of B1-I9 1.020 1.032 1.059 1.044 0.662 0.626 0.922
the connected nodes. Based on the historical measurements, the B1-I10 1.113 1.100 1.128 1.078 0.617 0.672 0.868
operator can determine coherent forecasts for all the grid by gener­ B1-I11 0.944 0.838 0.836 0.965 0.691 0.743 0.951
B1-I12 1.247 1.154 1.191 1.084 0.694 0.730 0.843
ating forecasts for the nodal injections individually. We build a 2-
B1-I13 0.997 0.810 0.812 0.921 0.635 0.638 0.886
level hierarchy in which we aggregate the 24 series (M1 to M24) B1-I14 0.945 0.876 0.894 0.946 0.687 0.651 0.922
of the distribution system at the meter level to generate the total B1-I15 1.090 1.124 1.125 1.107 0.632 0.665 0.801
series at the grid level (see Table 2). B1-I16 0.746 0.778 0.757 0.754 0.733 0.678 0.736
3. Walmart Dataset: we consider a public dataset made available by B1-I17 0.876 0.803 0.804 0.901 0.968 1.027 0.884
B1-I18 1.642 1.215 1.228 1.176 0.692 0.697 0.644
Walmart that was adopted in the latest M competition, M52. This B1-I19 0.846 0.788 0.757 0.786 0.760 0.772 0.738
dataset contains historical sales data from 29/01/2011 to 19/06/ B1-I20 0.939 0.861 0.864 0.864 0.637 0.701 0.820
2016 of various products sold in the USA, organized in the form of B1-I21 0.872 0.777 0.767 0.778 0.676 0.705 0.755
grouped time series. More specifically, it uses unit sales data B1-I22 0.813 0.765 0.763 0.812 0.582 0.679 0.802
B1-I23 1.054 0.901 0.907 1.084 0.502 0.629 1.057
collected at the product-store level that are grouped according to
B1-I24 1.082 1.011 1.074 1.127 0.603 0.617 1.079
product departments, product categories, stores, and three B1-I25 0.823 0.905 0.867 0.834 0.627 0.681 0.818
geographical areas: the States of California (CA), Texas (TX), and B1-I26 0.784 0.841 0.829 0.798 0.765 0.830 0.783
Wisconsin (WI). Besides the time series data, it includes explanatory B1-I27 0.747 0.747 0.725 0.753 0.678 0.702 0.743
variables such as promotions (SNAP events), days of the week, and B1-I28 0.983 1.029 1.022 1.021 0.586 0.726 0.986
B1-I29 1.082 0.874 0.868 0.889 0.600 0.692 0.882
special events (e.g., Super Bowl, Valentine’s Day, Thanksgiving Day) B1-I30 0.972 0.826 0.833 0.866 0.515 0.638 0.858
that typically affect unit sales and could improve forecasting accu­ B1-I31 0.955 0.890 0.896 0.972 0.690 0.688 0.938
racy. Starting from this dataset, we extract a 4-level hierarchy: the B1-I32 1.294 1.076 1.155 0.998 0.639 0.602 1.004
completely aggregated series at level 0 is divided into 3 component B1-I33 1.115 0.696 0.698 0.723 0.672 0.722 0.711
B1-I34 0.951 0.825 0.797 0.761 0.614 0.670 0.746
series at level 1 representing the state-level time series (CA, TX, WI).
B1-I35 0.853 0.917 0.901 0.883 0.779 0.833 0.846
The state-level time series are respectively subdivided in 4 (CA1, B1-I36 0.736 0.689 0.692 0.747 0.702 0.775 0.731
CA2, CA3, CA4), 3 (TX1, TX2, TX3), and 3 (WI1, WI2, WI3) time B1-I37 1.325 1.349 1.397 1.334 0.529 0.647 1.343
series at level 2, the store level. Finally, each store-level time series is B1-I38 1.367 1.284 1.392 1.384 0.642 0.682 1.434
further subdivided into 3 time series at the category level, the most B1-I39 0.897 0.874 0.885 0.894 0.585 0.598 0.873
B1-I40 1.353 1.298 1.302 1.403 0.641 0.413 1.363
disaggregated one, containing the categories Foods, Hobbies, and B1-I41 0.858 0.803 0.769 0.770 0.635 0.734 0.749
Household (see Table 3). B1-I42 1.251 1.247 1.280 1.193 0.536 0.691 1.263
B2-I1 0.958 1.163 1.107 0.848 0.761 0.852 0.955
To summarize, we have the first dataset with a three-level hierarchy, the B2-I2 0.966 0.874 0.826 0.762 0.608 0.666 0.758
B2-I3 0.829 0.836 0.825 0.723 0.569 0.700 0.724
second one with a two-level hierarchy, and the third one with a four-
B2-I4 0.760 0.793 0.803 0.771 0.577 0.681 0.744
level hierarchy. As for the experimental setup, we have to make some B2-I5 0.836 0.780 0.768 0.737 0.537 0.699 0.733
choices for each dataset: B2-I6 0.753 0.853 0.836 0.751 0.583 0.633 0.745
B2-I7 0.853 0.805 0.813 0.760 0.696 0.635 0.747
1. Italian Dataset: for each series, as explanatory variables, we add a B2-I8 0.820 0.801 0.788 0.717 0.664 0.790 0.707
B2-I9 0.830 0.703 0.698 0.716 0.639 0.637 0.718
binary variable representing the presence of promotion if the B2-I10 0.806 0.850 0.840 0.788 0.656 0.700 0.777
disaggregation is computed at the item level or a variable repre­ B2-I11 0.831 0.806 0.800 0.826 0.673 0.788 0.826
senting the relative number of items in promotion for each brand if B2-I12 0.804 0.900 0.863 0.824 0.617 0.715 0.801
the disaggregation is computed at the brand level. In both cases, B2-I13 0.834 0.806 0.793 0.816 0.604 0.631 0.823
B2-I14 0.754 0.749 0.741 0.744 0.654 0.658 0.739
dummy variables representing the day of the week and the month are
B2-I15 0.686 0.785 0.760 0.739 0.604 0.679 0.685
also added to the model. As for the number of lagged observations of B2-I16 0.875 0.792 0.784 0.768 0.608 0.691 0.770
the aggregate demand, we consider time windows of length w = 30 B2-I17 0.984 0.888 0.860 0.790 0.665 0.696 0.792
days with a hop size of 1 day. We consider 4 years from 01/01/2014 B2-I18 0.835 0.793 0.777 0.747 0.622 0.661 0.730
to 31/12/2017 for the in-sample period and the last year of data from B2-I19 1.351 1.153 1.144 0.986 0.781 0.833 1.062
B2-I20 0.912 0.902 0.887 0.893 0.681 0.783 0.902
01/01/2018 to 31/12/2018 for the out-of-sample period. The B2-I21 0.951 0.774 0.759 0.734 0.602 0.650 0.740
experimental setup for the cross-validation procedure is as follows. B2-I22 0.873 0.859 0.834 0.788 0.532 0.646 0.775
B2-I23 0.849 0.816 0.795 0.768 0.577 0.643 0.752
B2-I24 0.780 0.893 0.874 0.775 0.580 0.645 0.774
2
https://mofc.unic.ac.cy/m5-competition/ (continued on next page)

9
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

Table 4 (continued )
BU AHP PHA FP NND1 NND2 OPT

B2-I25 0.846 0.709 0.702 0.732 0.659 0.721 0.735


B2-I26 0.831 0.779 0.775 0.755 0.558 0.677 0.731
B2-I27 0.819 0.817 0.797 0.724 0.548 0.614 0.720
B2-I28 0.911 0.984 0.983 0.931 0.523 0.645 0.902
B2-I29 0.985 0.804 0.806 0.795 0.505 0.689 0.785
B2-I30 0.921 0.784 0.770 0.730 0.477 0.587 0.715
B2-I31 0.761 0.808 0.789 0.767 0.564 0.629 0.759
B2-I32 0.850 0.798 0.788 0.764 0.699 0.726 0.752
B2-I33 0.800 0.737 0.746 0.726 0.515 0.644 0.699
B2-I34 0.712 0.781 0.751 0.689 0.511 0.650 0.687
B2-I35 0.865 0.791 0.778 0.766 0.614 0.570 0.754
B2-I36 0.808 0.735 0.718 0.694 0.674 0.604 0.691
B2-I37 0.777 0.823 0.818 0.806 0.676 0.670 0.771
B2-I38 0.728 0.727 0.723 0.732 0.633 0.628 0.714
B2-I39 0.780 0.994 0.946 0.771 0.670 0.609 0.764
B2-I40 0.825 0.731 0.728 0.702 0.667 0.704 0.711
B2-I41 0.856 0.743 0.740 0.759 0.635 0.793 0.758
B2-I42 0.853 0.773 0.754 0.729 0.689 0.771 0.738
B2-I43 0.762 0.801 0.794 0.749 0.580 0.777 0.747
B2-I44 0.945 0.769 0.762 0.732 0.586 0.655 0.729 Fig. 5. Nemenyi test results at 95% confidence level for all 123 series of the
B2-I45 0.966 0.890 0.871 0.763 0.774 0.816 0.866 Italian sales dataset. The hierarchical forecasting methods are sorted vertically
B3-I1 1.045 1.358 1.322 1.089 0.723 0.759 1.058 according to the MASE mean rank.
B3-I2 1.021 1.196 1.256 1.131 0.575 0.794 1.041
B3-I3 0.990 0.881 0.910 1.160 0.619 0.660 0.998
B3-I4 0.953 0.939 0.943 0.906 0.686 0.719 0.961
B3-I5 0.907 0.991 0.993 1.032 0.521 0.659 0.906 Table 5
B3-I6 1.116 0.981 0.965 1.067 0.653 0.729 1.019 MASE for all 25 series of the electricity demand data. In bold the best performing
B3-I7 1.223 0.905 0.904 0.859 0.649 0.787 0.914 approach.
B3-I8 1.374 1.204 1.267 0.974 0.526 0.679 1.122
BU AHP PHA FP NND OPT
B3-I9 1.382 1.123 1.130 0.917 0.655 0.661 1.176
B3-I10 1.357 1.408 1.483 1.189 0.681 0.611 1.111 TOTAL 1.090 0.974 0.974 0.974 0.974 1.088
B3-I11 1.014 1.197 1.217 1.032 0.784 0.708 1.022
M1 1.745 1.520 1.549 1.092 1.002 1.249
B3-I12 0.934 0.822 0.815 0.901 0.621 0.722 0.834
M2 1.759 1.062 1.271 1.174 1.013 1.432
B3-I13 0.918 0.903 0.892 0.958 0.738 0.839 0.915
M3 1.158 1.361 1.355 0.963 0.892 1.170
B3-I14 1.230 1.351 1.369 1.342 0.788 0.606 1.232
M4 1.475 1.612 1.694 1.128 1.038 1.465
B3-I15 0.928 0.976 0.986 0.935 0.679 0.760 0.925
M5 0.999 1.075 1.063 1.181 0.873 1.132
B3-I16 1.162 1.324 1.390 0.957 0.644 0.856 1.090
M6 1.352 1.486 1.520 1.821 0.769 1.621
B3-I17 0.993 1.153 1.206 1.016 0.634 0.862 1.006
M7 1.044 1.771 1.678 1.174 0.637 1.087
B3-I18 0.918 1.118 1.088 0.895 0.675 0.775 0.927
M8 1.029 1.042 1.041 1.323 0.892 1.053
B3-I19 0.870 0.879 0.871 0.842 0.620 0.673 0.877
M9 1.070 1.272 1.244 0.997 0.893 0.975
B3-I20 0.882 0.913 0.902 0.894 0.677 0.824 0.870
M10 1.045 1.573 1.564 1.620 1.051 1.340
B3-I21 1.060 1.232 1.276 1.199 0.633 0.653 1.017
M11 1.225 1.452 1.437 1.382 0.840 1.131
B4-I1 1.287 1.094 1.161 1.176 0.561 0.766 1.186
M12 1.430 1.468 1.411 1.163 0.821 1.342
B4-I2 1.242 1.780 1.971 1.130 0.620 0.666 1.077
M13 1.002 1.872 1.715 1.496 1.090 1.682
B4-I3 1.041 0.889 0.913 0.914 0.718 0.670 0.938
M14 1.041 1.162 1.164 1.391 0.820 1.121
B4-I4 1.200 1.055 1.125 1.056 0.630 0.710 0.968
M15 1.348 1.533 1.553 1.893 0.863 1.185
B4-I5 1.322 1.245 1.341 1.094 0.622 0.607 1.318
M16 1.095 1.055 1.174 1.831 0.844 1.457
B4-I6 1.478 1.543 1.659 1.168 0.684 0.654 0.886
M17 1.292 1.806 1.759 1.706 1.122 1.131
B4-I7 1.324 0.993 1.044 0.822 0.647 0.624 0.919
M18 1.263 1.824 1.826 1.393 1.047 1.299
B4-I8 1.275 1.106 1.122 1.083 0.634 0.704 0.961
M19 1.436 1.730 1.789 1.218 0.882 1.280
B4-I9 1.107 1.399 1.528 0.837 0.854 0.861 0.830
M20 1.240 1.310 1.331 1.008 0.987 1.016
B4-I10 1.306 1.334 1.469 1.083 0.769 0.766 0.804
M21 1.326 1.317 1.343 1.176 0.802 1.106
Average Item 0.981 0.949 0.956 0.909 0.644 0.697 0.875 M22 1.431 1.108 1.078 1.316 0.924 1.380
M23 1.164 1.353 1.309 1.419 0.830 1.153
M24 1.284 1.705 1.740 1.532 1.165 1.232
The starting window consists of the first three years of data from 01/ Average Meter 1.261 1.436 1.442 1.350 0.921 1.252
01/2014 to 31/12/2016. The training window expands over the last
year of the training data including daily observations from 01/01/
2017 to 31/12/2018. The forecasting window is set to h = 7, cor­ are also added to the model. As for the number of lagged observa­
responding to a forecasting horizon of one week ahead. At each tions of the aggregate power, we consider time windows of length
iteration, the training window expands by one week to simulate a w = 144 observations with the hop size of 10 min. We consider 9
production environment in which the model is re-estimated as soon months from 13/01/2018 to 13/09/2018 for the training set and the
as new data are available and to better mimic the practical scenario last 3 months from 14/09/2018 to 13/01/2019 for the test set. The
in which retailing decisions occur every week. To evaluate the configuration of the cross-validation procedure is as follows. The
forecasting accuracy at each level, for this hierarchy we use the starting window consists of the first six months of data from 13/01/
average MASE, as recommended by Hyndman (2006), since most of 2018 to 13/06/2018. At each iteration, the training window expands
the item-level series are intermittent. by 24 h over the last 3 months of the training data including obser­
2. Electricity Dataset: for each series, we use the average power over vations from 14/06/2018 to 13/09/2018. The forecasting window is
the three phases as the target variable and the temperature, hori­ set to h = 144, corresponding to a forecasting horizon of 24 h ahead.
zontal irradiance, normal irradiance, relative humidity, pressure, We evaluate the forecasting accuracy at each level by using the
wind speed, and wind direction as explanatory variables. Dummy average MASE and the average SMAPE over all the series of that level
variables representing the day of the week and the hour of the day since there are no zero values in these time series.

10
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

Table 6
SMAPE for all 25 series of the electricity demand data. In bold the best per­
forming approach.
BU AHP PHA FP NND OPT

TOTAL 0.076 0.072 0.072 0.072 0.072 0.075

M1 0.270 0.229 0.233 0.235 0.165 0.267


M2 0.233 0.272 0.257 0.201 0.099 0.246
M3 0.218 0.231 0.230 0.244 0.136 0.248
M4 0.183 0.296 0.393 0.212 0.131 0.224
M5 0.217 0.338 0.336 0.278 0.118 0.268
M6 0.195 0.274 0.279 0.294 0.072 0.192
M7 0.214 0.305 0.308 0.351 0.176 0.213
M8 0.228 0.366 0.265 0.238 0.155 0.247
M9 0.217 0.294 0.287 0.216 0.112 0.213
M10 0.216 0.259 0.258 0.274 0.146 0.226
M11 0.237 0.313 0.292 0.246 0.144 0.237
M12 0.211 0.219 0.280 0.276 0.089 0.206
M13 0.222 0.381 0.374 0.294 0.095 0.225
M14 0.195 0.318 0.318 0.260 0.104 0.227
M15 0.205 0.388 0.322 0.273 0.121 0.271
M16 0.198 0.235 0.246 0.108 0.063 0.272
M17 0.193 0.235 0.228 0.295 0.124 0.199 Fig. 7. Nemenyi test results at 95% confidence level for all 25 series of the
M18 0.209 0.271 0.272 0.256 0.164 0.224 Electricity data. The hierarchical forecasting methods are sorted vertically ac­
M19 0.194 0.292 0.206 0.226 0.134 0.251 cording to the SMAPE mean rank.
M20 0.207 0.320 0.308 0.340 0.106 0.244
M21 0.192 0.301 0.307 0.262 0.128 0.192
M22 0.183 0.225 0.220 0.302 0.112 0.132
quality of the forecasts by looking at the MASE since the SMAPE is
M23 0.185 0.390 0.381 0.208 0.141 0.197 undefined due to the presence of zeros values when Walmart is
M24 0.198 0.318 0.325 0.309 0.145 0.188 closed.
Average Meter 0.209 0.294 0.288 0.258 0.124 0.225

5.2. Results

We compare the forecasting performance of our method for each


series, in both its versions, standard top-down (NND1) and iterative top-
down (NND2) with the bottom-up (BU), average historical proportions
(AHP), proportions of historical averages (PHA), forecasted proportions
(FP) and the optimal reconciliation approach through trace minimiza­
tion (OPT). We stress that for all the top-down approaches, the perfor­
mance at the most aggregated level is equivalent, and the differences
only emerge at the lower levels of the hierarchy, where we are interested
in the comparison.
For all the datasets, we report the metrics on all the considered time
series, comparing also the average error at each level. To formally test
whether the forecasts produced by the considered hierarchical methods
are different, we use the non-parametric Friedman and post hoc Nem­
enyi tests as in Koning, Franses, Hibon, and Stekler (2005), Demšar
(2006) and Fonzo and Girolimetto (2020). As stated by Kourentzes and
Athanasopoulos (2019), the Friedman test first establishes whether at
least one of the forecasts is statistically different from the others. If this is
the case, the Nemenyi test identifies groups of forecasts for which there
Fig. 6. Nemenyi test results at 95% confidence level for all 25 series of the
is no evidence of significant differences. The advantage of this approach
Electricity data. The hierarchical forecasting methods are sorted vertically ac­ is that it does not impose any assumption on the distribution of the data
cording to the MASE mean rank. and does not require multiple pairwise tests between forecasts, which
would distort the outcome of the tests. The hierarchical forecasting
3. Walmart Dataset: for each series, we use dummy variables repre­ methods are then sorted according to the mean rank with respect to the
senting the day of the week and the month. We also include as considered metric.
explanatory variables snap events and special events affecting sales: In Table 4 we provide results for all 123 time series of the Italian sales
National holidays, Religious holidays, Sporting events, Valentine’s dataset. For the NND1, we directly forecast the demand at the item level
Day, Father’s Day, and Mother’s Day. We consider 4 years from 29/ using the aggregate demand at the store level, and then we aggregate the
01/2011 to 29/06/2015 for the in-sample period and the last year of item-level forecasts to obtain the brand-level forecasts. For the NND2,
data from30/06/2015 to 16/06/2016 for the out-of-sample period. we train a disaggregation model that generates the brand-level forecasts
The experimental setup for the cross-validation procedure is as fol­ starting from the store-level series, and then one NND for each brand-
lows. The starting window consists of the first three years of data level series to generate forecasts for each item demand of the brand
from 29/01/2011 to 29/01/2014. The training window iteratively they belong to. Overall, for the entire hierarchy we train one NND at the
expands over the last year of the training data including observations top level, and 4 NND in parallel at the brand level.
from 30/01/2014 to 29/06/2015. The forecasting horizon is set to In Fig. 5 we show the outcome of the Friedman and post hoc Nemenyi
h = 7, and the number of lagged observations for the aggregate sales tests with a confidence level of 95%. If the intervals of two methods do
is set to w = 30 days with the hop size of 1 day. We evaluate the not overlap, they exhibit statistically different performance. As seen,
Table 4 and Fig. 5 indicate that NND (iterative and standard top-down)

11
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

Table 7
MASE for all 44 series of the Walmart data. In bold the best performing
approach.
BU AHP PHA FP NND1 NND2 OPT

TOTAL 0.781 0.782 0.782 0.782 0.782 0.782 0.785

CA 0.807 0.997 0.997 0.877 0.593 0.571 0.810


TX 0.731 0.986 0.956 0.751 0.632 0.615 0.734
WI 0.760 0.858 0.846 0.778 0.578 0.492 0.760

Average State 0.766 0.947 0.933 0.802 0.601 0.559 0.768

CA1 0.771 1.044 1.070 0.828 0.687 0.761 0.775


CA2 0.843 1.029 1.034 0.887 0.625 0.669 0.839
CA3 0.876 1.371 1.340 0.914 0.576 0.494 0.873
CA4 0.757 1.023 1.020 0.800 0.593 0.576 0.758
TX1 0.739 0.968 0.965 0.762 0.572 0.446 0.743
TX2 0.756 1.518 1.451 0.761 0.651 0.579 0.755
TX3 0.723 0.761 0.764 0.742 0.577 0.559 0.723
WI1 0.746 1.454 1.369 0.745 0.796 0.741 0.743
WI2 0.737 0.907 0.888 0.754 0.561 0.572 0.740
WI3 0.813 1.017 0.987 0.845 0.516 0.483 0.818

Average Store 0.776 1.109 1.089 0.804 0.615 0.588 0.777 Fig. 8. Nemenyi test results at 95% confidence level for all 44 series of the
Walmart data. The hierarchical forecasting methods are sorted vertically ac­
CA1-Foods 0.823 1.385 1.432 0.864 0.521 0.590 0.827
cording to the MASE mean rank.
CA1-Hobbies 0.722 0.840 0.848 0.735 0.733 0.686 0.722
CA1- 0.742 1.328 1.628 0.798 0.673 0.689 0.743
Household level of the hierarchy, we have enough data to build decent models
CA2-Food 0.808 1.060 1.111 0.838 0.711 0.713 0.809 capturing the underlying trend and seasonality. Indeed, the aggregation
CA2-Hobbies 0.792 1.015 1.020 0.819 0.753 0.755 0.791
tends to regularize the demand and make it easier to forecast. The only
CA2- 0.800 1.549 1.507 0.826 0.783 0.781 0.796
Household level for which the optimal reconciliation approach is the best is the top
CA3-Food 0.864 1.393 1.317 0.887 0.754 0.788 0.862 level. As we move down the hierarchy our approach outperforms all the
CA3-Hobbies 0.749 0.889 0.898 0.775 0.697 0.647 0.750 top-down approaches, the bottom-up method and the optimal recon­
CA3- 0.807 1.283 1.256 0.884 0.745 0.777 0.811 ciliation, with the NND iterative top-down (NND2) performing best at
Household
CA4-Food 0.770 1.199 1.161 0.803 0.726 0.747 0.771
the brand level and the NND standard top-down (NND1) performing
CA4-Hobbies 0.709 0.967 0.980 0.720 0.709 0.667 0.709 best at the item level, on average.
CA4- 0.712 1.402 1.389 0.758 0.692 0.631 0.713 In Table 5 and 6 we present the MASE and SMAPE for all 25 time
Household series of the electricity demand dataset. Note that here we only have two
TX1-Food 0.761 1.347 1.315 0.773 0.644 0.523 0.763
levels, so that NND1 and NND2 coincide (which is why we call it only
TX1-Hobbies 0.728 0.912 0.914 0.738 0.756 0.719 0.728
TX1-Household 0.770 1.218 1.164 0.808 0.679 0.660 0.771 NND in the tables). In Fig. 6 and 7 we show the outcome of the Friedman
TX2-Food 0.778 1.308 1.800 0.774 0.615 0.699 0.776 and Nemenyi tests with a confidence level of 95%, with respect to MASE
TX2-Hobbies 0.722 0.852 0.848 0.724 0.773 0.622 0.721 and SMAPE. We find that all the top-down approaches perform best at
TX2-Household 0.772 0.974 0.965 0.794 0.652 0.628 0.771 the grid level. On this dataset, the optimal reconciliation method, and
TX3-Food 0.735 0.845 0.813 0.745 0.655 0.582 0.735
TX3-Hobbies 0.739 1.334 1.329 0.749 0.646 0.679 0.739
the bottom-up approach show good performance. The good performance
TX3-Household 0.771 1.193 1.143 0.793 0.714 0.776 0.769 of the bottom-up method with respect to the classical top-down ap­
WI1-Food 0.760 1.489 1.374 0.755 0.622 0.711 0.756 proaches can be attributed to the strong seasonality of the series, even at
WI1-Hobbies 0.708 0.992 0.992 0.710 0.604 0.622 0.708 the bottom level. Our NND clearly outperforms all the competitors,
WI1- 0.761 1.361 1.316 0.766 0.741 0.779 0.760
ranking first in the tests and having a better average error.
Household
WI2-Food 0.727 0.771 0.767 0.732 0.657 0.566 0.728 In Table 7, we provide results for all 44 time series of the Walmart
WI2-Hobbies 0.749 1.057 1.052 0.767 0.634 0.671 0.749 dataset. For the NND1, we directly generate forecasts at the category
WI2- 0.847 1.173 1.123 0.883 0.775 0.681 0.849 level using the total aggregate at level 0. We aggregate these forecasts to
Household obtain first the store-level forecasts, and then the state-level forecasts.
WI3-Food 0.779 1.031 0.996 0.797 0.709 0.605 0.780
For the NND2, instead, we train a disaggregation model that outputs the
WI3-Hobbies 0.748 0.924 0.897 0.763 0.605 0.663 0.748
WI3- 0.825 0.878 0.874 0.873 0.737 0.710 0.826 state-level forecasts starting from the total aggregate series at level 0,
Household one NND for each state-level series to generate forecasts for the stores of
Average 0.765 1.132 1.141 0.788 0.690 0.679 0.768 the geographical area they belong to, and finally, one model for each
Category store to obtain the bottom-level forecast at the category level. Overall,
for the entire hierarchy, we train one NND at the top level, 3 NND in
parallel at the state level, and 10 NND in parallel at the store level.
provide significantly better forecasts than the rest of the methods found In Fig. 8, we plot the results of the Friedman and Nemenyi tests for all
in the literature, with the bottom-up performing worst. The bad per­ 44 series of the Walmart dataset. Fig. 8 shows that NND1 and NND2 are
formance of the bottom-up method can be attributed to the demand at statistically equivalent on this dataset, even though, on average, the
the most granular level of the hierarchy being challenging to model and error produced by NND2 is lower. Anyhow, both NND1 and NND2
forecast effectively due to its too sparse and erratic nature. The majority outperform the competitors. On this dataset, the bottom-up approach
of the item-level time series display sporadic sales including zeros, and performs best at the most aggregate level, and it is quite competitive
the promotion of an item does not always correspond to an increase in with the optimal combination approach since the time series at the
sales. By using traditional or combination of methods to generate base category level display a strong seasonality component. Indeed, sales are
forecasts for the time series at the lowest level, we end up with flat line relatively high on weekends in comparison to normal days, and this
forecasts, representing the average demand, failing to account for the behavior propagates as we go up the hierarchy. As we move down the
seasonality that truly exists but is impossible to identify between the hierarchy, our approach outperforms all the top-down approaches, the
noise. By focusing our attention at the highest or some intermediate

12
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

bottom-up method, and the optimal reconciliation. Acknowledgments


Summarizing, our method outperforms the hierarchical forecasting
competitors on all the considered datasets. This result is particularly We would like to thank the reviewers for their thoughtful comments
significant due to the different characteristics of the three datasets. In that greatly helped to improve our manuscript.
the sales data, the bottom-level series are extremely noisy and hard to
forecast (as confirmed by the bottom-up method’s poor performance). Appendix A. Details on the NND hyperparameters
On the other hand, the electricity demand data display seasonality at the
bottom level, as confirmed by the bottom-up method’s good perfor­ In Table A.8, we report for each dataset the details of the imple­
mance. Finally, the Walmart dataset comes from a domain similar to the mented neural network used for producing the results of method NND
one of the Italian Sales Data, has a higher number of levels, but a strong standard top-down (NND1) described in Section 3. We have one row for
seasonality that propagates through the hierarchy even at the bottom each dataset, whereas on the columns we have the level of disaggrega­
level. In all the experiments, our approach generates coherent forecasts: tion, the number of units in all the dense layers, the number of filters and
the maximum violation of the aggregation constraint is less than 10− 3 . kernel size of all the convolutional layers.
Furthermore, it improves the overall accuracy at any level of the hier­
archy. This confirms the general viability of our approach, which can get
coherent and accurate forecasts by extracting the hidden information in Table A.8
Implementation details for the networks used in NND1.
the hierarchy. In B, we also report some figures to show the accuracy of
the forecasts produced by our method on some series of the considered Dataset Level units dense filters kernel size
datasets at different levels of the hierarchy. Italian Total to Item 128 16 8
Electricity Total to Meter 256 32 16
6. Conclusions Walmart Total to Category 128 32 4

In this paper, we propose a machine learning method for forecasting


hierarchical time series. Our approach relies on a deep neural network
capable of automatically extracting time series features at any level In Table B.9, we report for each dataset the details of the imple­
thanks to the convolutional layers. The network combines these features mented neural networks used for producing the results of the method
with the explanatory variables available at any level of the hierarchy. NND iterative top-down (NND2) described in Section 3. We have one
The obtained forecasts are coherentsince reconciliation is forced in the row for each dataset and for each network used in NND2 for that dataset,
training phase by minimizing a customized loss function. The effec­ whereas on the columns we have the level of disaggregation, the number
tiveness of the approach is shown on three real-world datasets that fit of units in all the dense layers, the number of filters and kernel size of all
the method’s assumptions: they include explanatory variables, and the the convolutional layers. Depending on the dataset, the number of rows
number of observations is large enough. On these datasets, a deep sta­ changes depending on the number of models built for a given level.
tistical analysis proves that our method outperforms the state-of-the-art
competitors in hierarchical forecasting, being always more accurate,
and producing significantly different forecasts at any level. The results Table B.9
confirm that we fulfilled our aim to combine in a single machinery all Implementation details for the networks used in NND2.
the available information through the hierarchy, both hidden and pro­
Dataset Level units dense filters kernel size
vided by the explanatory variables, without the need of post-processing
on the series to achieve high accuracy and cross-sectional coherence. As Italian Total to Brand 64 32 8
Italian B1 to B1-IX 128 16 4
future work, our idea is to extend the proposed methodology to take into
Italian B2 to B2-IX 64 32 4
account the temporal reconciliation and jointly perform the temporal
Italian B3 to B3-IX 64 16 4
and cross-sectional reconciliation. Forcing only the temporal coherence
Italian B4 to B4-IX 128 16 4
can be viewed as a straightforward extension, whereas keeping into
Walmart Total to State 128 16 8
account both temporal and cross-sectional coherence may require some
Walmart CA to CAX 64 16 4
changes to the network structure. The idea is to adapt the neural
Walmart TX to TXX 64 16 8
network to exploit the information for the temporal reconciliation,
Walmart WI to WIX 128 16 8
which may require some recurrent layers, as LSTM or GRU. Also, the loss
Walmart CA1 to CA1-Category 64 32 8
function should be adapted to effectively force both reconciliation Walmart CA2 to CA2-Category 64 16 8
constraints during the training. Walmart CA3 to CA3-Category 64 16 8
Walmart CA4 to CA4-Category 64 16 4
Walmart TX1 to TX1-Category 64 16 8
CRediT authorship contribution statement
Walmart TX2 to TX2-Category 64 32 8
Walmart TX3 to TX3-Category 128 32 4
Paolo Mancuso: Conceptualization, Methodology, Validation, Walmart WI1 to WI1-Category 64 16 8
Formal analysis, Investigation, Data curation, Writing - original draft, Walmart WI2 to WI2-Category 128 16 8
Writing - review & editing. Veronica Piccialli: Conceptualization, Walmart WI3 to WI3-Category 64 32 8

Methodology, Validation, Formal analysis, Investigation, Data curation,


Writing - original draft, Writing - review & editing. Antonio M. Sudoso:
Conceptualization, Methodology, Validation, Formal analysis, Investi­
gation, Data curation, Software, Writing - original draft, Writing - review Appendix B. Plots
& editing.
In Figs. B.1, B.2, B.3 we show the NND predictions on the test set for
Declaration of Competing Interest some component series of each dataset.

The authors declare that they have no known competing financial


interests or personal relationships that could have appeared to influence
the work reported in this paper.

13
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

Fig. B.1. NND out-of-sample forecasts for the Italian sales dataset. Series B1, B2, B3 and B4 (last 6 months).

14
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

Fig. B.2. NND out-of-sample forecasts for the electricity demand dataset. Series M11, M16, M18 and M19 (first 72 h).

15
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

Fig. B.3. NND out-of-sample forecasts for the Walmart dataset. Time series CA4-Household, CA1-Foods, TX1 and WI2-Foods (last 6 months).

References Dunn, D. M., Williams, W. H., & Dechaine, T. L. (1976). Aggregate versus subaggregate
models in local area forecasting. Journal of the American Statistical Association, 71,
68–71.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis,
Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L., & Muller, P.-A. (2019). Deep
A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M.,
learning for time series classification: a review. Data Mining and Knowledge Discovery,
Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R.,
33, 917–963.
Moore, S., Murray, D., Olah, C., Schuster, M., & Shlens, J. (2015). TensorFlow:
Ferreira, M. D., Corrêa, D. C., Nonato, L. G., & de Mello, R. F. (2018). Designing
Large-scale machine learning on heterogeneous systems.
architectures of convolutional neural networks to solve practical problems. Expert
Athanasopoulos, G., Ahmed, R. A., & Hyndman, R. J. (2009). Hierarchical forecasts for
Systems with Applications, 94, 205–217.
australian domestic tourism. International Journal of Forecasting, 25, 146–166.
Franses, P. H., & Legerstee, R. (2011). Combining sku-level sales forecasts from models
Bandara, K., Bergmeir, C., & Smyl, S. (2020). Forecasting across time series databases
and experts. Expert Systems with Applications, 38, 2365–2370.
using recurrent neural networks on groups of similar series: A clustering approach.
Gross, C. W., & Sohl, J. E. (1990). Disaggregation methods to expedite product line
Expert Systems with Applications, 140, Article 112896.
forecasting. Journal of Forecasting, 9, 233–254.
Bontempi, G., Taieb, S. B., & Le Borgne, Y.-A. (2012). Machine learning strategies for
Hollyman, R., Petropoulos, F., & Tipping, M. E. (2021). Understanding forecast
time series forecasting. In European business intelligence summer school (pp. 62–77).
reconciliation. European Journal of Operational Research.
Springer.
Huber, J., Gossmann, A., & Stuckenschmidt, H. (2017). Cluster-based hierarchical
Carta, S., Corriga, A., Ferreira, A., Podda, A. S., & Recupero, D. R. (2021). A multi-layer
demand forecasting for perishable goods. Expert Systems with Applications, 76,
and multi-ensemble stock trader using deep learning and deep reinforcement
140–151.
learning. Applied Intelligence, 51, 889–905.
Hyndman, R., & Athanasopoulos, G. (2018). Forecasting: principles and practice. OTexts.
Caruana, R., Lawrence, S., & Giles, L. (2000). Overfitting in neural nets:
Hyndman, R., Lee, A., Wang, E., & Wickramasuriya, S. (2018). hts: Hierarchical and
Backpropagation, conjugate gradient, and early stopping. In Proceedings of the 13th
Grouped Time Series. R package version 5.1.5.
International Conference on Neural Information Processing Systems NIPS’00 (pp.
Hyndman, R. J., Ahmed, R. A., Athanasopoulos, G., & Shang, H. L. (2011). Optimal
381–387). MIT Press.
combination forecasts for hierarchical time series. Computational Statistics & Data
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The
Analysis, 55, 2579–2589.
Journal of Machine Learning Research, 7, 1–30.
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: the forecast
Di Fonzo, T., & Girolimetto, D. (2020). Cross-temporal forecast reconciliation: Optimal
package for R. Journal of Statistical Software, 26, 1–22.
combination method and heuristic alternatives. arXiv preprint arXiv:2006.08570.
Hyndman, R. J., et al. (2006). Another look at forecast-accuracy metrics for intermittent
demand. Foresight: The International Journal of Applied Forecasting, 4, 43–46.

16
P. Mancuso et al. Expert Systems With Applications 182 (2021) 115102

Kanarachos, S., Christopoulos, S.-R. G., Chroneos, A., & Fitzpatrick, M. E. (2017). Nenova, Z. D., & May, J. H. (2016). Determining an optimal hierarchical forecasting
Detecting anomalies in time series data via a deep learning algorithm combining model based on the characteristics of the data set: Technical note. Journal of
wavelets, neural networks and hilbert transform. Expert Systems with Applications, 85, Operations Management, 44, 62–68.
292–304. Nespoli, L., Medici, V., Lopatichki, K., & Sossan, F. (2020). Hierarchical demand
Kingma, D.P., & Ba, J. (2017). Adam: A method for stochastic optimization. arXiv: forecasting benchmark for the distribution grid. Electric Power Systems Research, 189,
1412.6980. Article 106755.
Koning, A. J., Franses, P. H., Hibon, M., & Stekler, H. O. (2005). The m3 competition: Shlifer, E., & Wolff, R. W. (1979). Aggregation and proration in forecasting. Management
Statistical tests of the results. International Journal of Forecasting, 21, 397–409. Science, 25, 594–603.
Kourentzes, N., & Athanasopoulos, G. (2019). Cross-temporal coherent forecasts for Spiliotis, E., Abolghasemi, M., Hyndman, R.J., Petropoulos, F., & Assimakopoulos, V.
australian tourism. Annals of Tourism Research, 75, 393–409. (2020a). Hierarchical forecast reconciliation with machine learning. arXiv preprint
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444. arXiv:2006.02043.
Liu, Y., Gong, C., Yang, L., & Chen, Y. (2020). Dstp-rnn: A dual-stage two-phase Spiliotis, E., Petropoulos, F., Kourentzes, N., & Assimakopoulos, V. (2020). Cross-
attention-based recurrent neural network for long-term and multivariate time series temporal aggregation: Improving the forecast accuracy of hierarchical electricity
prediction. Expert Systems with Applications, 143, Article 113082. consumption. Applied Energy, 261, Article 114339.
Maçaira, P. M., Thomé, A. M. T., Oliveira, F. L. C., & Ferrer, A. L. C. (2018). Time series Timmermann, A. (2006). Forecast combinations. Handbook of economic forecasting, 1,
analysis with explanatory variables: A systematic literature review. Environmental 135–196.
Modelling & Software, 107, 199–209. Wickramasuriya, S. L., Athanasopoulos, G., & Hyndman, R. J. (2019). Optimal forecast
Mancuso, P., Piccialli, V., & Sudoso, A. M. (2021). Hierarchical sales data of an italian reconciliation for hierarchical and grouped time series through trace minimization.
grocery store. Mendeley Data, V1, 10.17632/s8dgbs3rng.1. Journal of the American Statistical Association, 114, 804–819.
Ye, R., & Dai, Q. (2021). Implementing transfer learning across different datasets for time
series forecasting. Pattern Recognition, 109, Article 107617.

17

You might also like