A Machine Learning Approach For Forecasting Hierarchical Time Series PDF
A Machine Learning Approach For Forecasting Hierarchical Time Series PDF
time series
Paolo Mancuso, Veronica Piccialli, Antonio M. Sudoso
University of Rome Tor Vergata
arXiv:2006.00630v1 [cs.LG] 31 May 2020
Abstract
In this paper, we propose a machine learning approach for forecasting hierar-
chical time series. Rather than using historical or forecasted proportions, as
in standard top-down approaches, we formulate the disaggregation problem
as a non-linear regression problem. We propose a deep neural network that
automatically learns how to distribute the top-level forecasts to the bottom
level-series of the hierarchy, keeping into account the characteristics of the
aggregate series and the information of the individual series. In order to
evaluate the performance of the proposed method, we analyze hierarchical
sales data and electricity demand data. Besides comparison with the top-
down approaches, the model is compared with the bottom-up method and the
optimal reconciliation method. Results demonstrate that our method does
not only increase the average forecasting accuracy of the hierarchy but also
addresses the need of building an automated procedure generating coherent
forecasts for many time series at the same time.
Keywords: Hierarchical Time Series, Forecast, Machine Learning, Neural
Network
1. Introduction
A hierarchical time series is a collection of time series organized in a
hierarchical structure that can be aggregated at different levels [7]. As an
example, Stock Keeping Unit (SKU) sales aggregate up to product subcate-
gory sales, which further aggregate to product categories. In order to support
decision-making at different levels of the hierarchy, a challenging task is the
generation of coherent forecasts. Forecasts of the individual series must sum
up in a proper way across the levels preserving the hierarchical structure.
2
sales data and has noisy and intermittent bottom-level series, the second one
comes from electricity demand data and has more regular bottom-level series.
Our numerical experiments show that in both cases our method increases the
average forecasting accuracy of the hierarchy outperforming state-of-the-art
approaches. The rest of the paper is organized as follows. Section 2 dis-
cusses the concept of hierarchical time series and the methods of hierarchical
forecasting. Section 3 contains the detail of the proposed machine learning
algorithm. Section 4 describes the basic forecasting methods employed in
the hierarchical models and the experimental setup. Section 5 discusses the
datasets and the numerical experiments conducted to evaluate the proposed
method. Finally, Section 6 concludes the paper.
yt = SytK−1 .
The summing matrix S is a matrix having entries belonging to {0, 1}, of size
M × mK−1 .
Given observations at time t = 1, ..., T and the forecasting horizon h,
the aim is to forecast each series at each level at time t = T + 1, ..., T + h.
3
The current methods of forecasting hierarchical time series are: top-down,
bottom-up, middle-out and optimal reconciliation [9]. The main objective
of such approaches is to ensure that forecasts are coherent across all the
levels of the hierarchy. Regardless of the methods used to forecast the time
series for the different levels of the hierarchy, the individual forecasts must
be reconciled to be useful for any subsequent decision making. Forecast
reconciliation is the process of adjusting forecasts to make them coherent.
By definition, a forecast is coherent if it satisfies the aggregation constraints
defined by the summing matrix.
4
In the PHA approach, the proportions are determined in the following man-
ner: K−1
PT yt,i
t=1 T
pi = PT y0 , i = 1, . . . , mK−1 .
t
t=1 T
For these two methods, once the bottom-level h-step-ahead forecasts have
been generated, these are aggregated to generate coherent forecasts for the
rest of the series of the hierarchy by using the summing matrix. Given the
vector of proportions p, top-down approaches can be represented as:
ỹh = Spŷh0 .
5
2.4. Optimal Reconciliation
In [9] Hyndman et al. (2011) propose a novel approach that provides op-
timal forecasts that are better than forecasts produced by either a top-down
or a bottom-up approach. Their proposal is based on independently forecast-
ing all series at all levels of the hierarchy and then using a linear regression
model to optimally combine and reconcile these forecasts. Their approach is
based on a generalized least squares estimator that requires an estimate of the
covariance matrix of the errors that arise due to incoherence. In [19] Wick-
ramasuriya et al. (2019) show that this matrix is impossible to estimate in
practice and they propose a state-of-the-art forecast reconciliation approach,
called Minimum Trace (MinT) that incorporates the information from a full
covariance matrix of forecast errors in obtaining a set of coherent forecasts.
MinT minimizes the mean squared error of the coherent forecasts across the
entire hierarchy with the constraint of unbiasedness. The resulting revised
forecasts are coherent, unbiased and have minimum variance amongst all
combination forecasts. An advantage of the optimal reconciliation approach
is that allows for the correlations between the series at each level using all
the available information within the hierarchy. However, it is computation-
ally expensive compared to the other methods introduced so far because it
requires to individually forecast all the time series at all the levels of the
hierarchy.
6
...
k+1,j k+1,j
yt,1 ... yt,m k+1
j
7
k,p
Step 2 In the disaggregation or test phase, forecasts ŷt,j relative to the time
∗
period of the test set are generated by the model F . Finally, these fore-
casts are fed to the trained NND in order to produce the disaggregated
forecasts ŷtk+1,j for the test set.
In general, the learned function f generates base forecasts that are not
coherent since they do not sum up correctly according to the structure of the
hierarchy. In order to ensure that forecasts are reconciled across the hierar-
chy, we want f to output a set of forecasts that are as close as possible to the
base forecasts, but also meet the requirement that forecasts at upper levels
in the hierarchy are the sum of the associated lower level forecasts. From
an optimization perspective, we want to introduce an equality constraint to
the regression problem in such a way that we can still use backpropagation
to train the network. More in detail, we are looking for the network weights
such that the Mean Squared Error (MSE) between the true values and the
predictions is minimized and, in addition, we want the following constraint
to hold:
k,p k,p
yt,j = 1T ytk+1,j = 1T ŷtk+1,j = ŷt,j ,
where 1 is the vector of all ones of size mk+1
j .
We impose the aggregation constraint by adding a term to the MSE loss
function that penalizes differences between the sum of lower level observa-
tions and the sum of the lower level forecasts:
T
1 X
L(ytk+1,j , ŷtk+1,j ) = (1 − α) ||ytk+1,j − ŷtk+1,j ||2 +
T t=1
T (1)
X
T k+1,j T k+1,j 2
+α (1 yt − 1 ŷt ) ,
t=1
8
k,p
yt,j ytk+1,j , xt,i
Training Set Training Set
Step 1
F ∗ Model
k,p
ŷt,j Test Set NND Model
Step 2
9
idea is to balance the contribution of both terms by setting α = 0.5, that
corresponds to giving the two terms the same importance. In principle, the
parameter α may be tuned on each instance. However, if the forecasts were
equal to the true values, the coherence constraint would be satisfied, since
the true values are coherent by construction. This explains why even with
α = 0, the violation of coherence is in practice relatively small and with
α = 0.5 the violation is basically zero in all our experiments. For this reason,
we did not investigate the tuning of parameter α and kept it fixed to 0.5.
Top-down approaches distribute the top-level forecasts down the hier-
archy using historical or forecasted proportions of the data. In our case,
explicit proportions are never calculated since the algorithm automatically
learns how to disaggregate forecasts of the top-level series to the bottom-
level series without loss of information. Furthermore, our method is flexible
enough to be employed in the forecasting process of the whole hierarchy in
two different ways:
1. Standard top-down: a forecasting model F ∗ is developed for the aggre-
gate at level 0, and a single disaggregation model NDD is trained with
the series at level 0 and K −1. Therefore, forecasts for the bottom-level
series are produced by looking only at the aggregated series at level 0.
Then, the bottom-level forecasts are aggregated to generate coherent
forecasts for the rest of the series of the hierarchy.
2. Iterative top-down: the forecasting model F ∗ for an aggregate at level
k is the disaggregation model NDD trained with the series at level
k − 1 and k, for each k = 1, . . . , K − 1. At level 0, instead, F ∗ is
the best model selected among a set of standard forecasting methods.
Forecast for all the levels are then obtained by feeding forecasts to the
disaggregation models at each level.
The difference between the two approaches is that in the standard top-down,
bottom-level forecasts are generated with only one disaggregation model,
whereas in the iterative version, a larger number of disaggregation models
is trained, one for each series to be disaggregated. To be more precise, to
disaggregate the mk series at level k = 0, . . . , K−2, exactly mk disaggregation
models are trained in parallel. In this way, on the one hand we increase the
variance of the approach (and the computational time), but on the other
hand we reduce the bias since we increase flexibility, and keep into account
more the variability at the different levels.
We also notice that this algorithm can be easily plugged into a middle-out
10
strategy: a forecasting model is developed for each aggregate at a convenient
level, and the disaggregation models are trained and tested to distribute these
forecasts to the series below. For the series above the middle level, coherent
forecasts are generated using the bottom-up approach.
Regarding the choice of the neural network architecture, our objective
is to include in the model the relationship between explanatory variables
derived from the lower level series, and the features of the aggregate series
that describe the structure of the hierarchy. In order to better capture the
structure of the hierarchy, we use a Convolutional Neural Network (CNN).
Indeed CNNs are well known for creating useful representations of time series
automatically, being highly noise-resistant models, and being able to extract
very informative, deep features, which are independent from time [13]. Our
model is a deep neural network that is capable of accepting and combining
multiple types of input, including cross-sectional and time series data, in a
single end-to-end model. Our architecture is made up of two branches: the
first branch is a simple Multi-Layer Perceptron (MLP) designed to handle
the explanatory variables xt,i such as price, promotions, day of the week,
or in general, special events affecting the time series of interest; the second
branch is a one-dimensional Convolutional Neural Network that extracts fea-
k,p
ture maps over fixed segments of the aggregate series yt,j . Features extracted
from the two subnetworks are then concatenated together to form the final
input of the multi-output regression model (see Figure 3). The output layer
of the model is a standard regression layer with linear activation function
where the number of units is equal to the number of the series to forecast.
11
Explanatory Time Series
Variables Data
Multi-Layer Convolutional
Perceptron Neural Network
(MLP) (CNN)
Branch 1 Branch 2
Concatenate
Fully-Connected
(Linear Activation)
Figure 3: Our model has one branch that accepts the numerical data (left) and another
branch that accepts time series data (right).
12
4. Experimental Setup
In this section, we resume first the forecasting models we use to generate
base forecasts for the hierarchical approaches, then we describe our strategy
to select the best forecasting model and the implementation details.
1. Naive
2. Autoregressive Integrated Moving Average (ARIMA)
3. Exponential Smoothing (ETS)
4. Non-linear autoregression model (NAR)
5. Dynamic regression models: univariate time series models, such as lin-
ear and non-linear autoregressive models, allow for the inclusion of
information from past observations of a series, but not for the inclusion
of other information that may also affect the time series of interest.
Dynamic regression models allow keeping into account the time-lagged
relationship between the output and the lagged observations of both
the time series itself and of the external regressors. More in detail, we
consider two types of dynamic regression models:
(a) ARIMA model with exogeneous variables (ARIMAX)
(b) NAR model with exogenous variables (NARX)
In the literature it has been pointed out that the performance of forecasting
models could be improved by suitably combining forecasts from standard
approaches [18]. An easy way to improve forecast accuracy is to use sev-
eral different models on the same time series, and to average the resulting
forecasts. We consider two ways of combining forecasts:
13
2. Constrained Least Squares Regression: in this setting, the composed
forecast is not a function of m only as in the simple average but is a
linear function of the individual forecasts whereby the parameters are
determined by solving an optimization problem. The approach pro-
posed by Timmermann in [18] minimizes the sum of squared errors
under some additional constraints. Specifically, the estimated coeffi-
cients βi are constrained to be non-negative and to sum up to one. The
weights obtained are easily interpretable as percentages devoted to each
of the individual forecasts. PGiven the optimal weights, the composed
forecast is obtained as ŷt = mi=1 βi ŷi,t for t = T +1, ..., T +h. From the
mathematical point of view the following optimization problem needs
to be solved:
T
X +h m
X
min (yt − βi ŷi,t )2
t=T +1 i=1
Differently from the simple average which does not need any training
as the weights are a function of m only, with this method we need to
allocate a reserved portion of forecasts in order to train the meta-model.
In particular, we consider two following composite models:
1. Combination of ARIMAX, NARX and ETS forecasts obtained through
the simple mean.
2. Combination of ARIMAX, NARX and ETS forecasts obtained by solv-
ing the constrained least squares problem.
We choose to combine the two dynamic regression models with the ex-
ponential smoothing in order to take directly into account the effect of the
explanatory variables and the presence of linear and non-linear patterns in
the series.
14
test (out-of-sample) data. The training data (y1 , . . . , yN ), a time series of
length N , is used to estimate the parameters of a forecasting model and the
test data (yN +1 , . . . , yT ), that comes chronologically after the training set, is
used to evaluate its accuracy.
To achieve a reliable measure of model performance, we implement on the
training set a procedure that applies a cross-validation logic suitable for time
series data. In the expanding window procedure described in [7], the model
is trained on a window that expands over the entire history of the time series
and it is repeatedly tested against a forecasting window without dropping
older data points. This method produces many different train/test splits
and the error on each split is averaged in order to compute a robust estimate
of the model error (see Figure 4). The implementation of the expanding
window procedure requires four parameters:
- Starting window: the number of data points included in the first train-
ing iteration.
- Ending window: the number of data points included in the last training
iteration.
- Expanding steps: the number of data points added to the training time
series from one iteration to another.
For each series, the best performing model after the cross-validation phase
is retrained using the in-sample data and forecasts are obtained recursively
over the out-of-sample period. The above procedure requires a forecast error
measure. We consider the Mean Absolute Scaled Error (MASE) proposed by
Hyndman and Koehler in [11]:
1
PT +h
h i=T +1 |yi − ŷi |
M ASE = 1
PT ,
T −m t=m+1 |yt − yt−m |
15
Figure 4: Expanding window procedure.
4.3. Implementation
Time series models described above are implemented by using the “fore-
cast” package in R [10]. Hierarchical time series forecasting is performed by
using the “hts” package in R [8]. For the optimal combination approach we
use the MinT(Shrink) algorithm that estimates the covariance matrix of the
base forecast errors using shrinkage. The proposed disaggregation method
is implemented in Python with TensorFlow, a large-scale machine learning
framework [1]. Regarding the training details of the NND, early stopping is
employed as a form of regularization to avoid overfitting since it stops the
training as soon as the error on the validation set starts to grow [3]. The
16
neural network is trained by minimizing the loss function (1) with Adam
optimizer [14], a mini-batch stochastic gradient descent algorithm. Grid
search is used to perform the hyperparameter optimization which is simply
an exhaustive search through a manually specified subset of points in the hy-
perparameter space of the neural network. Each configuration is evaluated
on the validation set and the optimal values are chosen. More in detail, the
hyperparameters we optimize are:
• the number of filters in {16, 32, 64} for the CNN subnetwork,
As for the number of layers, the MLP subnetwork has 3 fully connected layers
whereas the CNN subnetwork has 6 convolutional layers. The training time
of a single disaggregation model requires order of minutes on a commercial
GPU depending on the network dimension and on the granularity of the
dataset.
5. Numerical Experiments
In this section, we aim to evaluate the effectiveness of our approach, by
comparing it with all the hierarchical methods described in Section 2. In
order to be as fair as possible in the comparison, we perform model selection
among the set of forecasting methods described in Section 4 whenever a base
forecast is required. More in detail, this means that different methods may be
used for each time series of the hierarchy we are trying to forecast (bottom-
level series for the bottom-up approach, aggregate series for all the top-down,
all the time series for the optimal reconciliation approach). As for the metrics
used for comparison, we use both the MASE and the SMAPE where possible
(i.e. where no zeros are present). We consider two datasets, coming from two
completely different problems. In both cases, starting from the aggregated
series at some level, we aim to exploit our method to increase the forecasting
accuracy of the hierarchy using the characteristics of the aggregate series and
explanatory variables.
17
5.1. Datasets
1. Sales Data: we analyze sales data gathered from an Italian super-
market. The dataset consists of 118 daily time series, representing the
demand of pasta from 01/01/2014 to 31/12/2018. Besides univariate
time series data, the quantity sold is integrated by information on the
presence or the absence of a promotion (no detail on the type of promo-
tion on the final price is given). Demand time series can be naturally
arranged to follow a hierarchical structure. Here, the idea is to build a
3-level structure: at the top of the hierarchy, there is the total or the
store-level series obtained by aggregating the brand-level series. At the
second level there are the brand-level series (like for instance Barilla)
obtained by aggregating the individual demand at the item-level and at
the third level there are the most disaggregated time series represent-
ing the item-level demand (for example the demand of spaghetti Bar-
illa). The completely aggregated series at level 0 is disaggregated into
4 component series at level 1. Each of these series is further subdivided
into 42, 45, 10 and 21 series at level 2, the completely disaggregated
bottom-level representing the different varieties of pasta for each brand
(see Table 1).
2. Electricity Data: we analyze a public electricity demand dataset that
contains power measurements and meteorological forecasts relative to
a set of 24 power meters installed in low-voltage cabinets of the dis-
tribution network of the city of Rolle in Switzerland [16]. The dataset
contains measurements from 13/01/2018 to 19/01/2019 at the resolu-
tion of 10 minutes and includes mean active and reactive power, voltage
magnitude, maximum total harmonic distortion for each phase, voltage
frequency and the average power over the three phases. We assume that
the grid losses are not significant, so the power at the grid connection
is the algebraic sum of the connected nodes. Based on the historical
measurements, the operator can determine coherent forecasts for all
the grid by generating forecasts for the nodal injections individually.
We build a 2-level hierarchy in which we aggregate the 24 series of the
distribution system at the meter-level to generate the total series at
the grid-level (see Table 2).
18
Level Number of series Total series per level
Store 1 1
Brand 4 4
Item 42, 45, 10, 21 118
Table 1: Hierarchy for the sales data.
Summarizing, we have the first dataset with a three level hierarchy and
the second one with a two level hierarchy. As for the experimental setup, we
have to make some choices for each dataset:
1. Sales Data: for each series, as explanatory variables, we add a binary
variable representing the presence of promotion if the disaggregation
is computed at the item-level or a variable representing the relative
number of items in promotion for each brand if the disaggregation is
computed at the brand-level. In both cases, dummy variables repre-
senting the day of the week and the month are also added to the model.
As for the number of lagged observations of the aggregate demand, we
consider fixed-length time windows of 30 days with a hop size of 1 day.
We consider 4 years from 01/01/2014 to 31/12/2017 for the in-sample
period and the last year of data from 01/01/2018 to 31/12/2018 for the
out-of-sample period. The experimental setup for the cross-validation
procedure is as follows. The starting windows consists of the first three
years of data from 01/01/2014 to 31/12/2016. The training window
expands over the last year of the training data including daily obser-
vations from 01/01/2017 to 31/12/2018. The forecasting window is set
to h = 7, corresponding to a forecasting horizon of one week ahead. At
each iteration, the training window expands by one week to simulate
a production environment in which the model is re-estimated as soon
as new data are available and to better mimic the practical scenario
in which retailing decisions are made every week. In order to evaluate
the forecasting accuracy at each level, for this hierarchy we use the av-
erage MASE, as recommended by Hyndman in [12], since most of the
19
item-level series are intermittent.
2. Electricity Data: for each series, we use the average power over the
three phases as target variable and the temperature, horizontal irra-
diance, normal irradiance, relative humidity, pressure, wind speed and
wind direction as explanatory variables. Dummy variables representing
the day of the week and the hour of the day are also added to the model.
As for the number of lagged observations of the aggregate power, we
consider fixed-length time windows of 144 observations with hop size of
10 minutes. We consider 6 months from 13/01/2018 to 13/06/2018 for
the training set and the last 6 months from 14/06/2018 to 13/01/2019
for the test set. The configuration of the cross-validation procedure is
as follows. The starting window consists of the first three months of
data from 13/01/2018 to 13/03/2018. At each iteration, the training
window expands by 24 hours over the last 3 months of the training data
including observations from 14/03/2018 to 13/06/2018. The forecast-
ing window is set to h = 144, corresponding to a forecasting horizon of
24 hours ahead. We evaluate the forecasting accuracy at each level by
using the average MASE and the average SMAPE over all the series of
that level, since there are no zero values in these time series.
5.2. Results
In Table 3 we compare the forecasting performance of our method at each
level, in both its versions, standard top-down (NND1) and iterative top-down
(NND2) with the bottom-up, average historical proportions, proportions of
historical averages, forecasted proportions and the optimal reconciliation ap-
proach (MinT). We stress that for all the top down approaches, the perfor-
mance at the most aggregated level are equivalent, and the differences only
emerge at the lower levels of the hierarchy, where we really are interested in
the comparison.
For the NND1, we directly forecast the demand at the item-level using
the aggregate demand at the store-level and then we aggregate the item-
level forecasts to obtain the brand-level forecasts. For the NND2, we train a
disaggregation model that generates the brand-level forecasts starting from
the store-level series and then one NND for each brand-level series to generate
forecasts for each item demand of the brand they belong to. In total, for the
entire hierarchy, we train one NND at the top-level and 4 NND in parallel at
the brand-level.
20
Average MASE
Level Bottom-up AHP PHA FP NND1 NND2 Optimal
Store 1.103 0.567 0.567 0.567 0.567 0.567 0.559
Brand 1.237 1.413 1.481 1.157 0.862 0.838 1.137
Item 1.057 0.934 0.943 0.891 0.745 0.811 0.893
Table 3: Average MASE for each aggregation level of sales data. In bold the best per-
forming approach.
21
Average MASE
Level Bottom-up AHP PHA FP NND Optimal
Grid 1.095 1.007 1.007 1.007 1.007 1.087
Meter 1.089 1.371 1.365 1.143 0.956 1.112
Table 4: Average MASE for each aggregation level of the electricity dataset. In bold the
best performing approach.
the electricity dataset and for each method. Note that here we only have
two levels, so that NND1 and NND2 coincide (which is why we call it only
NND in the tables). We find that all the top-down approaches perform
best at the grid-level. The good performance of the bottom-up method with
respect to the classical top-down approaches can be attributed to the fact
that the series have a strong seasonality, even at the bottom level. Our
NND clearly outperforms all the top-down methods, the bottom-up and the
optimal combination at the meter-level. Note that these conclusions hold
also looking at different metrics like MSE. In Figure 7 we show the forecasts
generated by the NND for some of the meter-level series. In order to identify
the pairs of forecasts which are significantly different from each other, we
perform pairwise t-test. Our method is significantly different from each of
the other methods (p-value < 10−3 for each test). The optimal combination
and the bottom-up methods are not significantly different (p-value = 0.599).
22
Brand 1
1,000
True
Forecast
800
600
Sales
400
200
0
0 50 100 150 200 250 300 350
Day
Brand 2
True
500 Forecast
400
Sales
300
200
100
23
Brand 3
600
True
Forecast
400
Sales
200
0
0 50 100 150 200 250 300 350
Day
Brand 4
800
True
Forecast
600
Sales
400
200
0
0 50 100 150 200 250 300 350
Day
24
Meter 11 Meter 16
40 True True
Forecast 50 Forecast
30 40
Power
Power
20 30
20
10
0 100 200 300 400 0 100 200 300 400
Time Time
Meter 18 Meter 19
True 40 True
30
Forecast Forecast
30
Power
Power
20
20
10
10
25
Summarizing, we get similar results on both datasets, and this is particu-
larly significant due to the different characteristics of the two datasets. In the
sales data, the bottom-level series are extremely noisy and hard to forecast
(as confirmed by the poor performance of the bottom-up method). On the
other hand, the electricity demand data display seasonality at the bottom-
level, as confirmed by the good performance of the bottom-up method. In
both cases, we generate coherent forecasts (in all our experiments the maxi-
mum violation of the aggregation constraint is less then 10−3 ) and we improve
the overall accuracy at any level of the hierarchy. This confirms the general
viability of our approach.
6. Conclusions
In this paper, we propose a machine learning method for forecasting hier-
archical time series. We formulate the disaggregation problem as a non-linear
regression problem and we solve it with a deep neural network. The network
architecture is able to jointly learn the structure of the hierarchy and gen-
erate coherent forecasts, thanks to the neural network’s ability to extract
meaningful features from the aggregate series and combine them with the
dynamics of the individual series. Furthermore, differently from top-down
approaches, our method allows to easily incorporate any external informa-
tion that affect the time series to disaggregate with no loss of information.
Results demonstrate that our method does not only increase the average
forecasting accuracy of the hierarchy but also addresses the need of building
an automated procedure generating coherent forecasts for many time series
at the same time. Our procedure fulfills the need of scalable algorithms to
automate the process of forecasting hierarchical time series with the aim to
increase the forecasting accuracy at any level. We stress that, differently
from the recently proposed optimal forecast reconciliation approach, in our
method, forecast reconciliation is performed inside the learning process with-
out the need of generating base forecast for all the series of the hierarchy. As
all the top-down approaches, our method highly relies on accurate top-level
forecasts. However, this assumption is often satisfied in hierarchical time
series since the top-level series are in general periodic and less noisy (since
they are the sum of many sub-level components) compared to individual se-
ries at the bottom-level. In summary, our machine learning approach uses
all the relevant information available in the hierarchical structure. This is
important, as particular aggregation levels may reveal hidden features of the
26
data, not easily identifiable at other levels, that are of interest to the user
and needed to be modeled.
References
[1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C.,
Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfel-
low, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser,
L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray,
D., Olah, C., Schuster, M., & Shlens, J. (2015). TensorFlow: Large-scale
machine learning on heterogeneous systems.
[2] Athanasopoulos, G., Ahmed, R. A., & Hyndman, R. J. (2009). Hierar-
chical forecasts for australian domestic tourism. International Journal
of Forecasting, 25 , 146–166.
[3] Caruana, R., Lawrence, S., & Giles, L. (2000). Overfitting in neural nets:
Backpropagation, conjugate gradient, and early stopping. In Proceedings
of the 13th International Conference on Neural Information Processing
Systems NIPS’00 (pp. 381–387). MIT Press.
[4] Dunn, D. M., Williams, W. H., & Dechaine, T. L. (1976). Aggregate
versus subaggregate models in local area forecasting. Journal of the
American Statistical Association, 71 , 68–71.
[5] Gross, C. W., & Sohl, J. E. (1990). Disaggregation methods to expedite
product line forecasting. Journal of forecasting, 9 , 233–254.
[6] Hyndman, R., & Athanasopoulos, G. (2014). Optimally reconciling fore-
casts in a hierarchy. Foresight: The International Journal of Applied
Forecasting, (pp. 42–48).
[7] Hyndman, R., & Athanasopoulos, G. (2018). Forecasting: principles
and practice. OTexts.
[8] Hyndman, R., Lee, A., Wang, E., & Wickramasuriya, S. (2018). hts:
Hierarchical and Grouped Time Series. R package version 5.1.5.
[9] Hyndman, R. J., Ahmed, R. A., Athanasopoulos, G., & Shang, H. L.
(2011). Optimal combination forecasts for hierarchical time series. Com-
putational statistics & data analysis, 55 , 2579–2589.
27
[10] Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series fore-
casting: the forecast package for R. Journal of Statistical Software, 26 ,
1–22.
[13] Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L., & Muller,
P.-A. (2019). Deep learning for time series classification: a review. Data
Mining and Knowledge Discovery, 33 , 917963.
[14] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic opti-
mization.
[16] Nespoli, L., Medici, V., Lopatichki, K., & Sossan, F. (2019). Hierarchical
demand forecasting benchmark for the distribution grid.
[17] Shlifer, E., & Wolff, R. W. (1979). Aggregation and proration in fore-
casting. Management Science, 25 , 594–603.
28