0% found this document useful (0 votes)
21 views15 pages

Scalable Probabilistic Forecasting in Retail With Gradient Boosted Trees - A Practitioners Approach

Uploaded by

AJ2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views15 pages

Scalable Probabilistic Forecasting in Retail With Gradient Boosted Trees - A Practitioners Approach

Uploaded by

AJ2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Int. J.

Production Economics 279 (2025) 109449

Contents lists available at ScienceDirect

International Journal of Production Economics


journal homepage: www.elsevier.com/locate/ijpe

Scalable probabilistic forecasting in retail with gradient boosted trees: A


practitioner’s approach
Xueying Long a , Quang Bui a , Grady Oktavian b , Daniel F. Schmidt a , Christoph Bergmeir a,c ,∗,
Rakshitha Godahewa a , Seong Per Lee d , Kaifeng Zhao d , Paul Condylis d
a
Department of Data Science and Artificial Intelligence, Monash University, Australia
b
Data Science, Tokopedia, Indonesia
c Department of Computer Science and Artificial Intelligence, University of Granada, Spain
d Data Science, Tokopedia, Singapore

ARTICLE INFO ABSTRACT

Keywords: The recent M5 competition has advanced the state-of-the-art in retail forecasting. However, there are important
Probabilistic forecasting differences between the competition challenge and the challenges we face in a large e-commerce company. The
Gradient boosted trees datasets in our scenario are larger (hundreds of thousands of time series), and e-commerce can afford to have
Global models
a larger stock assortment than brick-and-mortar retailers, leading to more intermittent data. To scale to larger
Disaggregation
dataset sizes with feasible computational effort, we investigate a two-layer hierarchy, namely the decision
level with product unit sales and an aggregated level, e.g., through warehouse-product aggregation, reducing
the number of series and degree of intermittency. We propose a top-down approach to forecasting at the
aggregated level, and then disaggregate to obtain decision-level forecasts. Probabilistic forecasts are generated
under distributional assumptions. The proposed scalable method is evaluated on both a large proprietary
dataset, as well as the publicly available Corporación Favorita and M5 datasets. We are able to show the
differences in characteristics of the e-commerce and brick-and-mortar retail datasets. Notably, our top-down
forecasting framework enters the top 50 of the original M5 competition, even with models trained at a higher
level under a much simpler setting.

1. Introduction datasets in our application are often significantly larger and more in-
termittent than the datasets provided by the M5 competition. While the
Forecasting plays an important role in decision-making processes. M5 has less than 50,000 time series, over half a million different types
In the retail industry, accurate sales forecasting is crucial for different of products are purchased on the e-platform each day. Furthermore, the
phases such as supply chain management (Fildes et al., 2022a,b) and M5 data is derived from traditional brick-and-mortar retail situations,
inventory control (Kourentzes et al., 2020). Probabilistic forecasts, which have some important differences to the e-commerce setting; most
which quantify uncertainty about the future, are often essential in these notably, e-commerce platforms can typically afford to have a larger
cases, e.g., for determining the stock level and reorder points (do Rego assortment of products available, and that many of these products
and De Mesquita, 2015). However, effective uncertainty estimation may have slow sales. This leads to a higher proportion of intermittent
is a challenging problem due to the fact that the series are often
series, and thus a high level of overall intermittency in the data. In
intermittent, i.e., a large percentage of entries are zero.
addition to handling the challenges presented by these differences,
The recent M5 competition (Makridakis et al., 2021, 2022b) es-
our aim is to develop an approach that is ready for production use,
tablished the state of the art of retail forecasting, through both an
and as such involves additional constraints regarding robustness and
accuracy track, which focused on point forecasting, and an uncertainty
execution time that were not an element of the M5 competition. It
track, which focused on probabilistic forecasting. Many of the M5
is important to mention that while promotions are often key drivers
findings are applicable to our situation; however, we observe that our
use cases, drawn from a large Indonesian e-commerce retail company, in retail forecasting, they are not a main consideration in the M5, as
exhibit some important difference from the challenges posed in the M5 the data in this competition was taken from Walmart, which utilises
competition. The two biggest differences we have identified are that the an everyday low price strategy. They are also not relevant in our

∗ Correspondence to: D3, UGR AI, Av. del Conocimiento, 37, 18016 Granada, Spain.
E-mail address: [email protected] (C. Bergmeir).

https://doi.org/10.1016/j.ijpe.2024.109449
Received 27 February 2024; Received in revised form 24 October 2024; Accepted 25 October 2024
Available online 5 November 2024
0925-5273/© 2024 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
X. Long et al. International Journal of Production Economics 279 (2025) 109449

business use cases, so we do not consider them when developing the considered a superior solution to other implementations of GBTs that
methodology in this paper. yield lower accuracy with longer training times.
Consequently, the main aim of our work is to adapt the best- In this paper, we propose an efficient way of generating accurate
performing M5 methodologies to the problem of forecasting in an e- and scalable forecasting systems. We make the most of a two-layer hier-
commerce setting. The M5 is dominated by global
archy of raw and aggregated data, and develop a top–down forecasting
models (Januschowski et al., 2020), which are learned across series.
framework that is able to scalably predict with small computational
This has important consequences for scalability, as global models
effort while maintaining competitive accuracy. Instead of directly deal-
cannot be fitted in parallel as trivially as local models, which are
embarrassingly parallelisable along the series dimension. As such we ing with data on the decision level, we forecast with the aggregated
require our methods to scale to datasets that are at least an order series and disaggregate back in a top–down fashion according to his-
of magnitude larger than the M5 data. There are three immediate torical proportions. Our forecasting framework is capable of generating
strategies to handle this: accurate probabilistic forecasts with simple assumptions of distribu-
tions. The proposed approach is analysed on a proprietary e-commerce
Model simplification An obvious option is to try and use simpler dataset, as well as the public Corporación Favorita dataset and the M5
models. However, by itself this does not guarantee the abil- competition dataset. As a notable side-product of this research, we have
ity to train models with a feasible computational effort, and implemented a negative binomial loss function for LightGBM (Ke et al.,
the resulting forecast accuracy may be poor due to the model 2017), for which the details are given in Appendix.
simplification.
The rest of this paper is organised as follows. Section 2 reviews
Data partitioning Data partitioning is an intuitive way of scaling the related work. Section 3 provides a comprehensive description of
global models. The global models are trained not in a truly the proposed top–down forecasting framework. Section 4 explains the
‘‘global’’ way, i.e., across all available series, but several such experimental setup. Section 5 reports the results and provides a further
models are trained on subsets of the data. This is a popular discussion. Section 6 concludes our work.
processing step, and most competitors in the M5 subdivide the
data in one way or another. One of the earliest papers proposing
this procedure that we are aware of is (Bandara et al., 2020), and 2. Related work
later this idea is studied more systematically in Godahewa et al.
(2021). However, subdividing the data is mainly done with the
In this section, we cover relevant prior work; specifically, global,
aim of improving accuracy, and cannot be seen as a step with
the primary purpose of achieving scalability. hierarchical, probabilistic modelling strategies and intermittent fore-
casting.
Training with less data Another option is to train with less data. One
may simply omit part of the historical data and fit a model to
a subsample. Additionally, if data has a suitable hierarchical 2.1. Modelling across series with global models
structure, we can train models at a higher level of the hierar-
chy using substantially less series (with consequent reduction
Global modelling (Januschowski et al., 2020) has received substan-
in intermittency), and then apply a top–down disaggregation
strategy to obtain forecasts at lower levels. tial recent attention in the forecasting community. All top contenders
in the M5 were global models, and even before this, global models
Regardless of the strategy chosen, the forecasting must be done in a have shown strong performance in various Kaggle competitions (Bojer
probabilistic manner. This usually involves modelling either via para- and Meldgaard, 2021). Under the global modelling paradigm, available
metric distribution assumption, or some more flexible non-parametric time series are pooled together and a single model is built across them,
approach. Using the quantile loss function (Koenker and Bassett, 1978),
with shared parameters. As a global model is trained with more data,
probabilistic forecasts can be generated without assumptions. However,
it can afford to be more complex, compared with traditional local per-
a drawback of this approach is that separate models must be trained
series models in which each time series is viewed as a distinct dataset,
for each quantile of interest, which can make the process expensive
and models are built for each series separately. Montero-Manso and
when handling large datasets. Additionally, quantile crossing (Bassett
and Koenker, 1982; He, 1997) can happen as a consequence of training Hyndman (2021) present some theoretical explanations for the superi-
quantiles separately, adding another layer of complexity. Compromises ority of global models over local models, and argue that no similarity or
may also need to be made to ensure a feasible implementation; for ex- relatedness between series is necessary for global models to work well.
ample, having to train with reduced sample sizes. In contrast, paramet- Hewamalage et al. (2022) confirm these findings empirically and make
ric methods based on distributional assumptions (Snyder et al., 2012) them more nuanced in a simulation study. They argue that minimal
are relatively straightforward to implement and apply in practice. They assumptions on relation between time series are necessary as global
are faster and scale more readily to large datasets in comparison with models have the capacity of learning complex patterns and perform
non-parametric methods. More importantly, classical choices such as well even when the series are heterogeneous. One of the earliest and
a Poisson or negative binomial distribution have useful mathematical most prominent global models in the literature is DeepAR (Salinas
properties (Steutel and Van Harn, 2003) that can be leveraged when et al., 2020), which is a global forecasting method based on autore-
scaling to large datasets.
gressive neural networks. It has demonstrated high forecasting accuracy
In the M5 competition tree-based methods were very successful,
for Amazon sales data, and can be considered a standard benchmark
and most top competitors based their solutions on LightGBM (Ke et al.,
in retail forecasting. Other modelling choices for implementation can
2017), a highly efficient gradient boosted tree (GBT) algorithm. For ex-
ample, the winning method in the accuracy track leveraged LightGBM involve classical linear models, standard machine learning models such
by training on grouped data from multiple categories and combining as LightGBM (Januschowski et al., 2021), and neural networks (Kunz
the forecasts with equal weights (Makridakis et al., 2022b). Tree-based et al., 2023). Consequently, we focus in our work on global models,
implementations such as LightGBM and XGBoost (Chen and Guestrin, as prior research has established their general superiority over local
2016) are open source and highly flexible tools. As LightGBM offers models in retail settings similar to the one under consideration in this
fast training while maintaining predictive accuracy, it is generally work.

2
X. Long et al. International Journal of Production Economics 279 (2025) 109449

2.2. Hierarchical forecasting directly generated, and implementations are available in most open-
source GBT frameworks. In this case, the modelling and training process
Retail sales data is naturally organised in a hierarchal fashion, needs to be repeated for each quantile of interest. For intermittent
i.e., per-store product sales data at the bottom level can be combined data, Lainder and Wolfinger (2022) propose a quantile forecasting
according to product categories and regions. Typically, hierarchical method using LightGBM and data augmentation techniques; this tech-
forecasting is concerned with producing coherent forecasts across dif- nique achieved first place in the M5 uncertainty track. Bootstrapping
ferent levels of the hierarchy (for different decisions to be made, such as has been utilised to solve intermittent forecasting problems (Willemain
strategical, tactical, or operational decisions). Additionally, hierarchical et al., 2004; Viswanathan and Zhou, 2008; Zhou and Viswanathan,
forecasting methods have been used in the past to transport information 2011; Hasni et al., 2019) with some highlights in forecast accuracy, but
between series, such as bringing seasonal patterns only emerging at it requires an access to a large amount of historical data and potentially
higher levels of the hierarchy into the noisy bottom-level series fore- huge computational costs, both of which pose questions regarding
casting. Classical approaches of hierarchical forecasting in the literature plausibility in real-life problem settings (Syntetos et al., 2015). Using
are top–down, bottom–up and middle-out methods (Hyndman et al., empirical in-sample quantiles is an especially simple way to generate
2011), in which forecasts are produced on only a single level of the probabilistic forecasts and has been found to work well in retail fore-
hierarchy and then aggregated up or disaggregated down, using histor- casting (Kolassa, 2016; Spiliotis et al., 2021; Kolassa, 2022). We employ
ical (or through other ways obtained) proportions. More sophisticated this established method as a strong benchmark. Another method to
alternatives include optimal reconciliation approaches (Hyndman et al., turn point forecasts into probabilistic forecasts is through level set
2011), in which all series in the hierarchy are forecasted, and then a forecasting (Hasson et al., 2021). Level-set forecasting first partitions
subsequent step a reconciliation (optimisation) is performed to adjust the training set according to the predicted values obtained from a
the forecasts and make them coherent. The most recent methods com- certain point forecaster. Then, when forecasting, the algorithm picks
bine forecasting and reconciliation into a single step, building global the closest set based on its point forecast and takes the corresponding
models that are able to produce reconciled forecasts directly. The most true values of that set as distributional forecasts. However, level-set
prominent methods in this space are HierE2E (Rangapuram et al., forecasting is a general algorithm that is not specifically designed to
2021), SHARQ (Han et al., 2021), HIRED (Paria et al., 2021), and deal with the challenges caused by intermittent series.
PROFHIT (Kamarthi et al., 2022). On the other hand, parametric methods involve understanding, or
On the other hand, probabilistic hierarchical forecasting is a much making assumptions regarding, the characteristics of historical data
more challenging problem as it requires, in theory, the distribution of and the nature of the data generating process. Classical distributional
choices for fitting retail data in the literature include the Poisson dis-
the forecasts of the aggregated series being the same as the distribu-
tribution (Heinen, 2003; Snyder et al., 2012), or the negative binomial
tion of sum of the forecasts of its children series. This is difficult to
distribution (Agrawal and Smith, 1996; Snyder et al., 2012), potentially
achieve, for example, quantile forecasts produced at a certain level
mixed with zero-inflation (Lambert, 1992) and hurdle models (Cragg,
cannot be simply added together, or divided up, to derive forecasts
1971) to accommodate the excess zeros typical in this domain. Based
on other levels. In contrast, point forecasts can be straightforwardly
on distributional assumptions, relevant model parameters are learned
generated based on the summation constraint of the hierarchy. In the
empirically. Snyder et al. (2012) proposed a hurdle shifted Poisson
literature, different definitions on the coherence of probabilistic hierar-
model and introduced a dynamic state-space structure for both damped
chical forecasting have been provided. Taieb et al. (2017, 2020) define
and undamped versions. de Rezende et al. (2021) extended this struc-
probabilistic coherence from the perspective of convolution of marginal
ture to the negative binomial distribution, and this technique achieved
predictive distributions of the children series. Panagiotelis et al. (2022)
sixth place in the M5 uncertainty competition. Parameter estimation
propose a more intuitive definition where densities of children series
of such state-space models is often performed via maximum likeli-
should lie on a coherent subspace, and a similar notation can be found
hood, frequently in conjunction with the expectation maximisation
in Rangapuram et al. (2021). Han et al. (2021) explore the coherence
algorithm; these procedures can be computationally intensive. Kolassa
of quantiles with a regularised quantile loss function. Kamarthi et al.
(2016) studied a set of parametric methods with Poisson and negative
(2022) propose a distributional coherency regularisation to ensure the
binomial assumptions and applied these methods in a later paper to the
distributional consistency of the entire hierarchy.
M5 data (Kolassa, 2022). They emphasised the consideration of over-
Our motivation for using a hierarchy differs from the usual use
dispersion in retail forecasting, which is in line with the parametric
cases. We do not use the hierarchical structure from the perspective of
methods studied in Spiliotis et al. (2021). However, these works only
reconciliation, and are not particularly interested in coherent forecasts
focus on local methods, and did not consider ways of scaling up the
for the entire hierarchy. Instead, we leverage the hierarchy as a way forecasting process.
to scale the forecasts from more aggregated levels in the hierarchy, Unlike many machine learning algorithms which are only capable
where fewer time series exist, to lower levels where the amount of of producing a single output, the generalised additive model (location,
series and their intermittency hinder traditional forecasting techniques. shape, scale) (GAMLSS, Stasinopoulos and Rigby, 2007) approach can
Thus, the sophisticated methods from the literature are not directly produce estimates for all relevant parameters of the assumed distribu-
applicable to our use case. We are interested in generating probabilistic tion. Ziel (2021) applied this approach to the M5 dataset with different
forecasts in our application; however, as noted previously, quantile distribution assumptions, including a zero-inflated Poisson distribution.
forecasts cannot be directly used to produce forecasts at other levels. A major pitfall of GAMLSS is the huge computational cost; to deal
This motivates us to explore distributional assumptions and properties with this, models are trained only based on subsamples in that work.
that could potentially make the problem tractable. These are discussed DeepAR generates probabilistic forecasts based on distributional as-
in the next section. sumptions. For example, a negative binomial distribution can be chosen
for count data, with both mean and shape parameters produced as the
2.3. Probabilistic forecasting for intermittent data outputs of the neural network. Following the literature, we consider
the Poisson distribution and negative binomial distribution, as mixed
We categorise the existing probabilistic forecasting approaches into distributions require extra parameters which can bring with them
two main parts: non-parametric methods such as quantile regression additional complexity during the modelling process. Moreover, these
and bootstrapping, and parametric methods under some distributional two distributions are characterised as being infinitely divisible (Steutel
assumptions. A particularly flexible non-parametric technique is quan- and Van Harn, 2003); for example, a Poisson random variable can be
tile regression. By utilising the pinball loss, quantile forecasts can be expressed as the sum of an arbitrary number of independent Poisson

3
X. Long et al. International Journal of Production Economics 279 (2025) 109449

a shift of characteristics in the series with time. However, these are


common problems affecting any type of hard classification, and we
argue this type of partitioning is suitable for our work as it is the
most established method used in the literature. We also argue that
such a partitioning is in fact necessary; this is because the series in the
different classes described above have characteristics which make them
behave quite differently in terms of forecasting. Smooth and erratic
series tend to have larger values on average by definition. If we evaluate
all series together, they are likely to dominate the error measure when
using scaled metrics. Likewise for a scale-free measure, the intermittent
and lumpy series will typically contribute a very large part of the
overall error, as their values are generally smaller which makes them
more difficult to forecast in relative terms, due to the integer nature
of the series. Thus, we perform a demand classification and evaluate
using scaled metrics for each group separately.

3.2. Top–down distributional forecasting framework

We can form a two-layer hierarchy by aggregating the series at


Fig. 1. The demand classification scheme and cutoff values used in this work (Syntetos the decision level, denoted as level 𝐿, based on product hierarchy to
et al., 2005). an aggregated level, denoted as level 𝐴. The constructed two-layer
hierarchy is illustrated in Fig. 2. At each time point 𝑡, a series 𝑗
at level 𝐴, denoted as 𝐴𝑡,𝑗 , can be constructed from the sum of the
random variables. In this case, we can decompose the aggregated level corresponding 𝑛𝑗 series at level 𝐿. 𝐿𝑡,𝑗 ,𝑖 is used to denote a series 𝑖
forecasts and generate probabilistic forecasts based on the distributions at level 𝐿 at time 𝑡, where 𝑗 matches the 𝑗th series at level 𝐴 in the
for both layers. Olivares et al. (2021) tested Poisson mixtures from hierarchy. Thus the relation
𝑛
a perspective of hierarchical reconciliation while modelling with a ∑𝑗
𝐴𝑡,𝑗 = 𝐿𝑡,𝑗 ,𝑖
deep neural network. In this paper, we also examine negative binomial
𝑖=1
mixtures, as the addition of a dispersion parameter can introduce more
modelling flexibility. is always satisfied.
We are interested in producing forecasts at the decision level 𝐿.
3. Methodology Based on the two-layer hierarchy introduced above we first train global
models at level 𝐴, a higher level in which the data are less intermittent
and the number of series to forecast is feasible. We then disaggregate
As outlined earlier, our methodology consists of improvements to
and produce forecasts recursively for the entire horizon. Any off-
the state of the art in retail forecasting, specifically to address issues re-
the-shelf global forecasting model can be used in this framework to
garding large amounts of data and intermittency in the training series.
generate point forecasts at level 𝐴; that is, the top–down distributional
In particular, we propose a methodology consisting of the following
framework is model-agnostic. In this work, we use LightGBM models
two components: (1) a data partitioning step commonly used in retail
and linear models. For time point 𝑡 in the horizon ℎ, we denote the
settings (see Section 3.1); and (2) a hierarchical top–down approach to
point forecast (conditional mean) at the aggregated level for series 𝑗
forecasting, in which we forecast the top level series and disaggregate
by 𝐴̂ 𝑡,𝑗 .
the forecasts to the lower level series (see Section 3.2).
The proposed forecasting framework consists of four steps. At each
time point in the forecast horizon, we (1) point-forecast the values
3.1. Demand classification
at the aggregated level 𝐴 using the predicted conditional means; (2)
estimate the parameter(s) of the distributions at the aggregated level;
Following the scheme proposed by Syntetos et al. (2005), we classify
(3) obtain the historical proportion of lower-to-higher level sales, and
the time series into one of four groups: smooth, erratic, lumpy, and
disaggregate to obtain the lower level 𝐿 point forecast; and (4) estimate
intermittent. This is done according to the average demand interval
the parameter(s) of the distributions at the lower level. In this section,
(ADI) and coefficient of variation squared (CV2 ) of the series:
we start by introducing the distribution properties and then discuss
Days available since first sale
ADI = , (1) each step in detail.
Days with sale
( )2
Standard deviation of daily sales
CV2 = . (2) 3.2.1. Distribution properties and forecasting
Mean of daily sales Poisson forecasts. We assume that sales are realisations of either
Specifically, we dichotomise the ADI and CV2 values for the series Poisson or negative binomial random variables. For Poisson distributed
using thresholds of 1.32 and 0.49 respectively, yielding four distinct sales, 𝑋 ∼ Poisson(𝜆) with rate parameter 𝜆. Once we have the point
categories (see Fig. 1). Even though these threshold values are orig- forecast (i.e., the estimated conditional mean), the parameter 𝜆 can be
inally proposed as an optimal method for choosing between simple estimated using the point forecast, as the maximum likelihood estimate
exponential smoothing and a modified Croston’s method (Syntetos and of 𝜆 is simply the sample mean, i.e., 𝜆̂ 𝐴𝑡,𝑗 = 𝐴̂ 𝑡,𝑗 . We can then produce
Boylan, 2005), methods which we are not using in our work, we employ distributional forecasts according to the probability model
( )
these threshold values as they are well-established in the literature. 𝐴𝑡,𝑗 ∼ Poisson 𝜆̂ 𝐴𝑡,𝑗 .
Naturally, there are certain limitations of such a classification
scheme. The use of hard cutoffs means that series which are inherently in the usual fashion.
similar, but have ADI and CV2 values close to the thresholds, may Negative-binomial forecasts. Consider a random variable 𝑋 ∣ 𝜆 ∼
fall into different categories. Furthermore, the classification is usually Poisson(𝜆) that conditionally follows a Poisson distribution, and let 𝜆
performed in a one-off manner and thus may not be accurate if there is be a Gamma distributed random variable, i.e.,

4
X. Long et al. International Journal of Production Economics 279 (2025) 109449

Fig. 2. An illustration of the two-layer hierarchical structure.

( )
1−𝑝 𝐿𝑡,𝑗 ,𝑖 ∼ Poisson(𝜆̂ 𝐿𝑡,𝑗 ,𝑖 ),
𝜆 ∼ Gamma 𝑟, ,
𝑝
where Gamma(𝛼 , 𝛽) denotes a Gamma distribution with scale 𝛼 and where 𝜆̂ 𝐿𝑡,𝑗 ,𝑖 = 𝐿̂ 𝑡,𝑗 ,𝑖 , and 𝐿̂ 𝑡,𝑗 ,𝑖 is the conditional mean for observation
shape 𝛽. Then, the random variable 𝑋 is marginally distributed as per 𝐿𝑡,𝑗 ,𝑖 , given by (6).
a negative binomial distribution 𝑋 ∼ NB(𝑟, 𝑝) (Hilbe, 2011), with the Negative-binomial forecasts. Negative binomial random variables also
probability mass function given by possess the same property of infinite divisibility; however, for this to
𝛤 (𝑟 + 𝑥) be the case it is required that the parameter 𝑝 must be the same across
𝑃 (𝑥 ∣ 𝑟, 𝑝) = 𝑝𝑟 (1 − 𝑝)𝑥 .
𝛤 (𝑟)𝛤 (𝑥 + 1) all series in the hierarchy. That is, 𝑝𝐴𝑡,𝑗 = 𝑝𝐿𝑡,𝑗 ,𝑖 for all 𝑖 = 1, … , 𝑛𝑗 .
We can view the negative binomial distribution as an extension of the One could adhere to this restriction and use the estimated 𝑝̂ from the
Poisson distribution. The relationship between the mean and variance aggregated level, 𝑝̂𝐴𝑡,𝑗 , as an estimate of 𝑝𝐿𝑡,𝑗 ,𝑖 for the lower level series.
of the negative binomial random variable, and the parameters 𝑟 and 𝑝, However, in our preliminary experiments (not reported), this procedure
is did not yield satisfactory results, and we do not pursue this approach
( )
E [𝑋] 𝑝 further. Instead, we estimate 𝑝𝐿𝑡,𝑗 ,𝑖 individually for each of the lower
𝑝= and 𝑟 = E [𝑋] . (3)
V [𝑋] 1−𝑝 level series. We estimate the variance of lower level series 𝐿𝑗 ,𝑖 by
[ ]
the sample variance, denoted as V ̂ 𝐿𝑗 ,𝑖 . Then, the estimation of the
Since 0 ≤ 𝑝 ≤ 1, the variance of the negative binomial distribution is
greater than its mean; this is known as over-dispersion. To produce a parameters of negative binomial distribution at the lower level can be
distributional forecast for observation 𝑡 in series 𝑗 at the aggregate level performed using the method-of-moments technique in a similar fashion
(i.e., 𝐴𝑡,𝑗 ), we substitute the sample variance of sales over the series to Section 3.2.1, i.e.,
[ ] ( )
𝐴𝑗 (i.e., V ̂ 𝐴𝑗 ), and mean forecast for observation 𝐴𝑡,𝑗 (i.e., 𝐴̂ 𝑡,𝑗 ) for 𝐿̂ 𝑡,𝑗 𝑝̂𝐿𝑡,𝑗 ,𝑖
𝑝̂𝐿𝑡,𝑗 ,𝑖 = [ ] and 𝑟̂𝐿𝑡,𝑗 ,𝑖 = 𝐿𝑡,𝑗̂ . (7)
the population variance and mean in (3), respectively, i.e., we use the V̂ 𝐿𝑗 ,𝑖 1 − 𝑝̂𝐿𝑡,𝑗 ,𝑖
method-of-moments estimator to obtain parameter estimates:
( )
𝐴̂ 𝑡,𝑗 𝑝̂𝐴𝑡,𝑗 Once we have estimated the relevant parameters we can produce
𝑝̂𝐴𝑡,𝑗 = [ ] and 𝑟̂𝐴𝑡,𝑗 = 𝐴̂ 𝑡,𝑗 . (4) distributional forecasts for the lower-level observation 𝐿𝑡,𝑗 ,𝑖 based on
V̂ 𝐴𝑗 1 − 𝑝̂𝐴𝑡,𝑗 ( )
𝐿𝑡,𝑗 ,𝑖 ∼ NB 𝑝̂𝐿𝑡,𝑗 ,𝑖 , 𝑟̂𝐿𝑡,𝑗 ,𝑖 .
We can then produce distributional forecasts for 𝐴𝑡,𝑗 using the esti-
mated negative binomial distribution It is worth noting that when series are highly intermittent, the large
( )
𝐴𝑡,𝑗 ∼ NB 𝑝̂𝐴𝑡,𝑗 , 𝑟̂𝐴𝑡,𝑗 number of zero entries in the series could potentially lead to a sample
variance smaller than the mean, resulting in an under-dispersed model,
in the usual fashion. i.e., 𝑝̂𝐿𝑡,𝑗 ,𝑖 ≥ 1. In principle, a Conway–Maxwell–Poisson distribution
could be used in these situations; however, in practice, as the negative
3.2.2. Disaggregation binomial distribution reduces to the Poisson distribution when 𝑟 →
The disaggregation process is performed by weighting the sales ∞ (Hilbe, 2011), we use probabilistic forecasts based on the Poisson
forecasts by the historical proportion-of-contribution to the aggregate
model in these cases.
series 𝐴𝑗 . This proportion 𝜌𝑗 ,𝑖 is calculated by
∑𝑇 Fig. 3 provides a visual example to illustrate our proposed top–
𝑡=1 𝐿𝑡,𝑗 ,𝑖 down forecasting framework. We consider a randomly chosen series
𝜌𝑗 ,𝑖 = ∑𝑇 , 𝑖 = 1, … , 𝑛𝑗 , (5)
𝑡=1 𝐴𝑡,𝑗 𝐴𝑗 at level 𝐴 with a hierarchy that consists of three series at level
where 𝑇 is the timestamp of the last observation in the training set for 𝐿. We first produce point forecasts for series 𝐴𝑗 with an off-the-
aggregate series 𝐴𝑗 . The point forecasts at the lower levels, 𝐿̂ 𝑡,𝑗 ,𝑖 , are shelf global forecasting model, in this case a LightGBM model. We
then given by may then choose an appropriate distributional model (i.e., Poisson or
negative binomial) and estimate the relevant distributional parameters
𝐿̂ 𝑡,𝑗 ,𝑖 = 𝜌𝑗 ,𝑖 𝐴̂ 𝑡,𝑗 , (6) for the forecast observations using the procedures in Section 3.2.1. The
i.e., the proportion of the aggregate point-forecast attributed to series historical proportion-of-contribution of each of the series 𝐿𝑗 ,𝑖 at level
𝑖. 𝐿 to the aggregate 𝐴𝑗 is calculated using (5) (shown in Fig. 3). These
are then used to disaggregate the point forecasts from level 𝐴 to level
3.2.3. Parameter estimation for lower level series 𝐿. Parameter estimation for the distributional models is performed at
Poisson forecasts. Poisson random variables are infinitely divisible, level 𝐿 following the procedures in Section 3.2.3. Finally, using these
that is, they can be decomposed into a sum of arbitrary many inde- estimated distributional models, a probabilistic forecast, for example a
pendent Poisson random variables (Steutel and Van Harn, 2003). We 90% prediction interval, is produced for each series 𝐿𝑗 ,𝑖 at level 𝐿.
use this assumption to obtain the probabilistic forecasts for level 𝐿, as
they are assumed to come from the same distributional family as the
4. Experimental framework
corresponding aggregated level series. Despite the fact that the lower
level series could potentially be cross-related in reality, we decompose
the aggregated forecasts under a simplifying independence assumption. This section describes the datasets, benchmarks, and error measure-
Under the Poisson assumption, the lower-level observation 𝐿𝑡,𝑗 ,𝑖 follows ments used in our experimental study.

5
X. Long et al. International Journal of Production Economics 279 (2025) 109449

Fig. 3. An illustration of the proposed top–down forecasting framework with a toy example.

Table 1
Summary of the percentage of series and percentage of zeros out of the days since the first sale across all
series, in each category on the lower level of the three datasets analysed in this paper (in percent).
Dataset Smooth Erratic Lumpy Intermittent
E-commerce 0.10 0.53 42.00 57.37
Percentage of series Corporación Favorita 20.51 19.58 33.62 26.29
M5 6.23 2.83 18.38 72.56
E-commerce 14.06 16.04 83.00 91.57
Percentage of zeros Corporación Favorita 9.50 12.80 52.65 59.48
M5 14.13 16.77 54.03 67.13

4.1. Datasets service level optimisation, which is beyond the scope of this study. For
demonstration purposes, we use the 10th percentile to illustrate the
We are aware of two openly available large retail datasets, namely performance of the proposed model at a lower quantile level. Thus,
the M5 dataset (Makridakis et al., 2022a) and the Corporación Favorita to evaluate the top–down approach, we forecast 28 days ahead with
dataset (Kaggle, 2018). Both of these represent traditional brick-and- the catalogue level series and evaluate the 10th and 90th percentile
mortar sales datasets. We use these datasets in addition to a proprietary forecasts at level 𝐿.
e-commerce dataset. Based on demand classification (see Section 3.1),
we can categorise the lower level series into four classes, and the per- 4.1.2. The Corporación Favorita dataset
centage of series that fall into each class is summarised in Table 1. We The Corporación Favorita dataset (Kaggle, 2018) provides daily unit
find that in the examined e-commerce dataset, lumpy and intermittent sales data in brick-and-mortar grocery stores from January 1st of 2013
series are the biggest subgroups. The Corporación Favorita dataset con- to August 15th of 2017. The original data contains negative values
tains series which are more evenly distributed over the four categories, which denote the number of returns for a certain product, and these
while the intermittent series form a large part of the M5 dataset as negative values are set to zero in our experiments as we are only
well. We further calculate the percentage of zeros out of the days since interested in sales forecasting. A natural way of constructing a two-
the first sale in each category of the three datasets. From Table 1, the layer hierarchy is to use the original data as the lower level, and sum
proprietary e-commerce series are more intermittent compared with the up unit sales by item as an aggregated level, i.e., add up the volumes
brick-and-mortar datasets we also use in the experiments. We describe in different stores for each item. In this way, level 𝐴 contains 3998
the datasets in more details in the following. series, whereas level 𝐿 consists of 172,906 series. The tasks performed
are similar: we evaluate the 10th and 90th percentiles of the future 28
4.1.1. The examined proprietary e-commerce dataset days ahead at level 𝐿 with models trained with the item-level series.
This dataset consists of 211,765 series of daily unit sales across all
regions of Indonesia from May 7th of 2019 to May 8th of 2021 from one 4.1.3. The M5 dataset
particular department of the company. In the dataset, similar products With data available for over 5 years in the M5 dataset (Makridakis
are grouped and regarded as a ‘Catalogue’, and products in a catalogue et al., 2022a), participants were required in the original competition to
have a high level of similarity in price. For example, an iPhone 11 could submit 9 quantile forecasts for each series. The provided sales data is
be one item of the catalogue, which contains different specific models hierarchically structured and can be aggregated to 12 different levels.
such as green iPhone 11. We use the catalogue level as level 𝐴, 101,944 To provide further insights of the proposed methods, we evaluate
series, and the specific models level as level 𝐿 in the experiments. the performance of the proposed top–down probabilistic forecasting
Around half of the categories have only 1 or 2 products. We are able framework in line with the competition settings, i.e., we evaluate the
to scale the methods to this large dataset by training them on a much 0.005, 0.025, 0.165, 0.250, 0.500, 0.750, 0.835, 0.975, and 0.995
smaller dataset and then adapting their forecasts to the original dataset. quantiles. We utilise the hierarchy between level 10 (product unit sales
Forecasts at different quantile levels are often required to determine aggregated by stores, 3049 series) and level 12 (product unit sales,
the optimal inventory level. While businesses generally strive for a 30,490 series, the lowest level). Models are trained with data from
high service level, such as 90%, constraints like limited warehouse level 10 and forecasts are disaggregated proportionally to level 12,
capacity and working capital may necessitate a lower optimal service and quantile forecasts are then generated according to distributional
level. Consequently, forecasts at various quantile levels are needed for assumptions.

6
X. Long et al. International Journal of Production Economics 279 (2025) 109449

4.2. Compared settings DeepAR The autoregressive neural network forecasting framework de-
veloped by Salinas et al. (2020) is another competitive standard
The proposed top–down forecasting framework is implemented with benchmark nowadays. We trained DeepAR models globally with
LightGBM model variants and linear model variants. Models are trained the Python GluonTS package (Alexandrov et al., 2020) on the
with 100 lags as input features to capture possible weekly, monthly, lower level 𝐿 with default parameters and a negative binomial
and quarterly seasonality while being not too computationally expen- output. Considering the massive computational costs, we use
sive and complex. Fourier terms are also introduced to model yearly DeepAR as a prototype for other deep-learning methods.
and weekly seasonality. The LightGBM models are named by the cor-
responding loss functions and parameter settings, and linear models Local statistical methods Five classic statistical methods, namely Au-
are named by specific regression settings. In the following, we list the toregressive Integrated Moving Average model (ARIMA, Box
techniques used in this work. The models below are trained on level 𝐴 et al., 2015), ExponenTial Smoothing model (ETS, Hyndman
and a top–down disaggregation is then applied to obtain forecasts on et al., 2008), Mean, Naïve, Drift, and Seasonal Naïve (SNaïve,
level 𝐿. with weekly seasonality) are considered in the experiments.
Models are fitted using the R fable package (O’Hara-Wild
LightGBM LightGBM models are trained in a top–down fashion under
et al., 2021) under their default configurations, and probabilistic
different loss functions and parameter settings. The LightGBM
forecasts are produced by specifying the level parameter.
package provides L1, L2, Poisson, Huber, and Tweedie loss
functions for regression problems (Shi et al., 2022). Following The following five per-series methods analysed by Kolassa (2022) are
the literature, the negative binomial loss is the most adequate considered in this work as strong benchmarks for count data.
loss function to use as it takes over-dispersion into considera-
tion (Kolassa, 2016), however, no off-the-shelf implementation
In-sample quantiles If we take the distribution of the in-sample data
of the negative binomial loss function is available. We im-
as an estimate of the true marginal distribution, quantile fore-
plement it with the custom loss and evaluation function in
casts in the future horizon can be then obtained according to
Python (refer to Appendix). It is not straightforward to imple-
this distribution, denoted as in-sample quantiles. The in-sample
ment such a loss function where two parameters are considered,
quantile forecasts on the lower level can be thought of as the
in a common machine learning framework which only supports
probabilistic variant of a mean forecast for point forecasts.
a single output. Thus the implementation is integrated with an
iterative optimisation step for updating the 𝑟 parameter. We are
Empirical weekday (Emp-Wd) In-sample quantiles are calculated for
exploring three different sets of parameters. We consider default
each day of the week separately.
regression parameters, and a preset parameter setting (Bandara
et al., 2021) that has shown to perform well for the M5 com- Empirical Poisson (Pois) A Poisson distribution is fitted to each se-
petition, but on the decision level (Level 12), which is not ries with moment matching using the R fitdistrplus pack-
the level on which we forecast. They are named as default
age (Delignette-Muller and Dutang, 2015). Quantiles are then
and preset in the models, respectively. Instead of modelling
generated from the empirical Poisson distribution.
with a constant, piecewise linear trees use linear functions to
produce the outcomes, and have demonstrated accurate per- Empirical negative binomial/Conway–Maxwell–Poisson (NB-CMP)
formance in forecasting (Godahewa et al., 2022). So we also Either a negative binomial distribution, when the series is over-
include the piecewise linear GBTs, which can be selected with dispersed, or a Conway–Maxwell–Poisson distribution, when the
the linear_tree parameter in LightGBM.
series is equi- or under-dispersed, is fitted to the series where
Linear models Linear models, or Pooled Regression (PR, Gelman and quantiles are generated from. A negative binomial distribution
Hill, 2006) models linear relationships between predictors and with moment matching using the R fitdistrplus package
target values fitted via ordinary least squares. Penalised lin- and a Conway–Maxwell–Poisson distribution is fitted through
ear regression, specifically Lasso regression models (Tibshirani, the glm.cmp() function provided in the R COMPoissonReg
1996) are also trained in the experiments. We implement pooled package (Sellers et al., 2023).
regression with ordinary least squares and penalised models
with the R glmnet package (Simon et al., 2011) under default Zero-inflated Poisson (ZIP) Quantiles are generated through a
settings with cross-validation. Moreover, apart from using the Zero-Inflated Poisson distribution fitted to each series. The
100 lags and Fourier terms as stated previously, it is intuitive to zeroinfl() function is used from the R pscl package
consider quadratic terms in the regression models. We trained (Jackman, 2024; Zeileis et al., 2008).
models with Lasso penalty and extra 100 quadratic lag terms,
but they did not show improvements in accuracy so results are Zero-inflated negative binomial (ZINB) Quantiles are generated
not reported here. through a Zero-Inflated negative binomial distribution fitted to
each series. The zeroinfl() function is used from the R pscl
In terms of benchmarks, we consider the following baselines of package (Jackman, 2024; Zeileis et al., 2008). A ZIP model is
forecasts directly performed on level 𝐿, namely direct quantile mod- fitted if a numerical singularity error occurs when fitting a ZINB
elling with LightGBM models, DeepAR, traditional univariate forecast- model.
ing models, and some relatively simple methods tailored to count data
as used by Kolassa (2022). An input window of 100 lags and Fourier
terms is used for the former two approaches, similarly to the proposed 4.3. Evaluation metrics
methods. The details are as follows.
Following the setup of the M5 competition, we evaluate the proba-
Direct LightGBM Direct quantile models are trained on the lower
bilistic forecasts using the Weighted Scaled Pinball Loss
level 𝐿 to get the lower level prediction. This approach requires
(WSPL, Makridakis et al., 2021). We denote 𝑞𝑡[𝑢] as the predicted value
training a model for each quantile of interest. We use LightGBM
for quantile 𝑢 at time 𝑡, and 𝑦𝑡 as the corresponding ground truth.
with the preset parameters from Bandara et al. (2021) as those
Then, for a series 𝑖, the Scaled Pinball Loss (SPL) is calculated for each
authors report promising accuracy of this parameterisation on
quantile as follows,
the M5 decision level (level 12). Quantile forecasts are generated
with the quantile loss function. SPL𝑖 [𝑢]

7
X. Long et al. International Journal of Production Economics 279 (2025) 109449
∑𝑇 +ℎ [𝑢] [𝑢]
1 𝑡=𝑇 +1 (𝑢(𝑦𝑡 − 𝑞𝑡 )𝟏{𝑞𝑡 ≤ 𝑦𝑡 } + (1 − 𝑢)(𝑞𝑡[𝑢] − 𝑦𝑡 )𝟏{𝑞𝑡[𝑢] > 𝑦𝑡 }) assumption. More sophisticated hyperparameter settings such as the
= ∑𝑇 , (8)
ℎ 1 preset parameters do not show an advantage over the default param-
𝑛−1 𝑡=2 |𝑦𝑡 − 𝑦𝑡−1 |
eters, which can even lead to better accuracy. Interestingly, linear
where the pinball loss (Gneiting, 2011) over the forecast horizon ℎ is models fitted via least squares have demonstrated even more com-
scaled by the average absolute error of the one-step-ahead in-sample petitive accuracy as PR models and Lasso models present satisfactory
naïve forecast within the period between the first non-zero sales to results across all data categories. The PR even beats the Direct Light-
time 𝑇 . 𝟏 is the indicator function. For example, for the 10th and 90th GBM on the lumpy series, and is slightly better than DeepAR on the
percentile forecast evaluation, 𝑢 ∈ {0.1, 0.9}, and 𝑞 = 2 corresponds intermittent series. With regard to different distribution assumptions,
to the number of quantiles of interest. The WSPL is computed by the we can find that models with negative binomial assumptions out-
weighted average of the average SPL for all the quantiles per series with
perform those with Poisson assumptions, indicating that the data is
weights 𝑤𝑖 ,
over-dispersed.
∑𝑛
1∑
𝑞
Table 3 compares the total training time of the forecasting models.
WSPL = 𝑤𝑖 × SPL𝑖 [𝑢𝑗 ].
𝑖=1
𝑞 𝑗=1 Models were trained on a server machine (16 vCPUs, 64 GB RAM) using
R 4.1. The proposed top–down methods are much faster than the direct
When evaluating the proposed methods on the M5 dataset, we follow
LightGBM models. Specifically, the top–down LightGBM methods under
the M5 competition setup and use the same weighting for a direct
default parameterisation can be trained within 10 min, whereas the
comparison with other participants, where dollar sales in the last 28
direct LightGBM approach takes around 5 h. The training process of the
days are calculated as weights. In the examined proprietary dataset
top–down PR model is efficient, and the Lasso model is relatively slower
and in the Corporación Favorita dataset, such information on dollar
as it fits additional regularisation parameters. Among the LightGBM
sales is not available. While one can still possibly propose a weighting
model variants, those using user-defined negative binomial loss take
process with certain assumptions, we opt for weighting series equally
the longest time. This is due to the iterative search of parameter
during evaluation. A lower WSPL indicates a better estimate of the
𝑟 of the negative binomial distribution (see Appendix). Such a loss
forecast intervals. The SPL uses the in-sample naïve forecast as the
function does not demonstrate the promised accuracy within a practical
denominator, a procedure that was first proposed by Hyndman and
timeframe. With competitive accuracy discussed previously, the in-
Koehler (2006) for the MASE and is nowadays standard practice in
sample quantile is also superior in terms of computational efficiency.
forecasting. However, this process has the problem that a division
Other local benchmarks such as zero-inflated models, ETS and ARIMA
by zero can occur if the series is constant. Due to the procedure of
can take a long time to train. Finally, the DeepAR model appears to be
trimming leading zeros, series can be very short and this situation can
fast and computationally efficient.
happen in our experiments. However, such cases are rare, for example,
only 8 series with such property are present in the Corporación Favorita
dataset, so that we omit such series during the evaluation process. 5.2. Evaluation with the Corporación Favorita dataset

5. Results and discussion Based on the previous experiments, we limit our experiments on this
dataset on a selection of the best-performing methods from the previous
In the following, we present an evaluation on the three different experiments, from the different categories of methods, to run with
datasets separately. The proposed top–down forecasting framework is the Corporación Favorita dataset, namely LightGBM with Poisson loss,
first evaluated on the e-commerce dataset. Based on the results, we aim Tweedie loss and negative binomial loss functions, pooled regression,
at transferring the findings to the brick-and-mortar datasets. Therefore, and Lasso. Again, we use 100 lags and Fourier terms as input, and
we use the most competitive models for further experiments on the Cor- LightGBM models are trained under default parameter settings. In the
poración Favorita dataset and the M5 dataset. For the M5 dataset, we top–down probabilistic experiments, we assume sales data to follow
are able to directly compare the performance of the proposed top–down a Poisson distribution or a negative binomial distribution across the
forecasting framework with the results of the original competition hierarchy. From the results on the e-commerce dataset, we utilise Direct
participants. LightGBM models, DeepAR and the five local count data models as
the comparison methods on level 𝐿, with the same parameter setting
5.1. Evaluation with the e-commerce dataset discussed in Section 4.2. In the case of direct training, the lag matrix
is over 230 GB, which hinders the implementation on our available
In this section, we present detailed performance evaluations on the computing resources. In addition, the series are much less intermittent
proprietary e-commerce dataset. Models are globally trained on level 𝐴 compared to the e-commerce dataset which leads to a much denser
and a top–down approach is then applied to get forecasts for level 𝐿. input matrix. The limit on the size of input sparse matrices restricts
Table 2 presents the WSPL results on level 𝐿, based on the demand the amount of series and the number of lags that can be trained at the
classification category of the respective level 𝐿 series. The benchmarks same time. Therefore, we need to make compromises and the direct
are placed at the top of the table, and models trained in a top–down LightGBM model is trained as follows. As the partitioning technique
fashion are arranged by distribution assumptions. Noticeably, the direct introduced in Section 3.1 can also be used as a pre-processing step to
LightGBM model outperforms all other models in all categories except render the methods more scalable when a single global model cannot fit
being in third place for lumpy data. DeepAR models beat other methods into memory. We first partition the lower level series into the smooth,
for lumpy data, and have consistently accurate performance in other erratic, lumpy and intermittent categories and train four LighGBM
categories. It is somewhat surprising to find that simply using the models separately. Due to the restriction of the size on the input matrix,
in-sample quantiles can lead to a competitive forecasting accuracy, we intend to use as many lags as possible for a fair comparison. With
especially for the intermittent series. This is in line with findings in the the Fourier terms to capture seasonality, we use 20 lags for the lumpy
literature that empirical models can be able to outperform sophisticated category and 30 lags for the other three categories. Another option
ones, as shown by Kolassa (2016) and Spiliotis et al. (2021). In addi- is to remove the Fourier terms and give more importance to the lags
tion, the Emp-Wd model, which treats each day of the week separately as input. This approach leads to another Direct LightGBM (max lags)
yields accurate forecasts. The zero-inflated models are competitive on model where we use 50 lags for the smooth and erratic series, and 35
this dataset. No consistently good performance can be found for the and 45 for lumpy and intermittent series, respectively.
local statistical methods. Table 4 reports the WSPL errors that are calculated on the lower
For the proposed top–down method, the LightGBM models have level. From the second column, we can compare the top–down ap-
achieved competitive accuracy especially under a negative binomial proach against the strong direct methods. We observe that our methods

8
X. Long et al. International Journal of Production Economics 279 (2025) 109449

Table 2
The WSPL on level 𝐿 of the examined proprietary dataset, categorised based on the demand class. The WSPL for all series on level 𝐿 are
provided in the last column. The amount of series in each category is provided in parenthesis. The top–down forecasting methods are sorted
by distribution assumptions.
Model Smooth (209) Erratic (1,126) Lumpy (88,933) Intermittent (121,497) All
ARIMA 0.2187 0.2336 0.3088 0.1853 0.2374
Drift 0.3002 0.4161 0.8530 0.7287 0.7788
ETS 0.2165 0.2348 0.3081 0.1813 0.2350
Mean 0.2313 0.2507 0.2947 0.1836 0.2307
Naïve 0.2964 0.4114 0.8434 0.7187 0.7690
SNaïve 0.2460 0.3107 0.5033 0.3573 0.4183
In-sample quantiles 0.2229 0.2545 0.2204 0.1712 0.1923
Emp-Wd 0.2254 0.2556 0.2208 0.1718 0.1929
Pois 0.3322 0.3359 0.2344 0.1727 0.1997
NB-CMP 0.2558 0.2671 0.2215 0.1711 0.1929
ZIP 0.2212 0.2504 0.2235 0.1712 0.1937
ZINB 0.2229 0.2549 0.2211 0.1712 0.1926
DeepAR 0.1976 0.2200 0.2093 0.1660 0.1845
Direct LightGBM 0.1931 0.2078 0.2139 0.1634 0.1849
Negative binomial distribution assumption
Lasso 0.2095 0.2446 0.2240 0.1736 0.1952
Pooled Regression 0.2015 0.2327 0.2132 0.1659 0.1861
LightGBM Huber loss default 0.2689 0.2607 0.2160 0.1690 0.1893
LightGBM Huber loss linear leaf 0.2974 0.2769 0.2171 0.1696 0.1902
LightGBM Huber preset 0.2005 0.2269 0.2145 0.1681 0.1879
LightGBM L1 loss default 0.3208 0.2973 0.2235 0.1733 0.1952
LightGBM L1 loss linear leaf 0.2207 0.2483 0.2200 0.1711 0.1921
LightGBM L1 loss preset 0.2318 0.2463 0.2200 0.1718 0.1925
LightGBM L2 loss default 0.2076 0.2323 0.2171 0.1734 0.1921
LightGBM L2 loss linear leaf 0.2042 0.2329 0.2188 0.1748 0.1936
LightGBM L2 loss preset 0.2173 0.2413 0.2262 0.1810 0.2003
LightGBM Neg. Bin. loss default 0.2144 0.2466 0.2362 0.1946 0.2124
LightGBM Poisson loss default 0.2057 0.2348 0.2192 0.1748 0.1938
LightGBM Poisson loss linear leaf 0.2284 0.2568 0.2255 0.1786 0.1988
LightGBM Poisson loss preset 0.2175 0.2466 0.7572 0.2231 0.4476
LightGBM Tweedie loss default 0.2108 0.2359 0.2185 0.1747 0.1935
LightGBM Tweedie loss linear leaf 0.2145 0.2440 0.2225 0.1782 0.1972
LightGBM Tweedie preset 0.2192 0.2477 0.2323 0.1879 0.2070
Poisson distribution assumption
Lasso 0.2522 0.3225 0.2436 0.1763 0.2055
Pooled Regression 0.2343 0.2870 0.2214 0.1662 0.1901
LightGBM Huber loss default 0.3080 0.3371 0.2223 0.1689 0.1923
LightGBM Huber loss linear leaf 0.3369 0.3595 0.2246 0.1696 0.1939
LightGBM Huber preset 0.2329 0.2790 0.2178 0.1676 0.1894
LightGBM L1 loss default 0.3630 0.3868 0.2263 0.1727 0.1965
LightGBM L1 loss linear leaf 0.2675 0.3299 0.2283 0.1716 0.1964
LightGBM L1 loss preset 0.2655 0.3108 0.2218 0.1711 0.1932
LightGBM L2 loss default 0.2417 0.2947 0.2369 0.1756 0.2021
LightGBM L2 loss linear leaf 0.2390 0.2937 0.2389 0.1770 0.2037
LightGBM L2 loss preset 0.2522 0.3105 0.2480 0.1831 0.2111
LightGBM Neg. Bin. loss default 0.2463 0.3161 0.2527 0.1959 0.2205
LightGBM Poisson loss default 0.2394 0.3053 0.2414 0.1774 0.2051
LightGBM Poisson loss linear leaf 0.2618 0.3353 0.2494 0.1812 0.2108
LightGBM Poisson loss preset 0.2504 0.3232 0.7813 0.2254 0.4595
LightGBM Tweedie loss default 0.2435 0.3032 0.2388 0.1769 0.2037
LightGBM Tweedie loss linear leaf 0.2471 0.3112 0.2439 0.1803 0.2078
LightGBM Tweedie preset 0.2532 0.3218 0.2546 0.1900 0.2179

are competitive, and the linear models again have remarkably out- find a more significant scalability of the proposed top–down methods,
performed the LightGBM variants. The negative binomial distribution without losing much of the accuracy. The Emp-Wd method takes more
still seems to be more appropriate on this dataset compared with a time as each day in the week needs to be considered separately. Also,
Poisson assumption. The DeepAR model has the best accuracy on this the LightGBM model with negative binomial loss requires more training
dataset. As we have to make compromises when training the direct effort, as is to be expected.
LightGBM models, we observe that the model that uses more lags seems
to perform better than the one with Fourier terms. The simple in- 5.3. The M5 competition revisit
sample quantile is still a strong benchmark, as well as the Emp-Wd
method. At the same time, we can find more top–down methods that We conduct experiments on the M5 dataset similar to Section 5.2
are competitive against them, with a wider gap compared to the results with selected models and parameter settings, and probabilistic fore-
found on the e-commerce dataset. Table 5 reports the training time casts are generated based on Poisson distribution or negative binomial
for each method. Overall, the top–down approach incorporated with distribution. We use the same type of result plot as the competition
linear models and LightGBM models can be trained very efficiently, summary in Makridakis et al. (2021). Instead of separating by distri-
even compared to the in-sample quantile method. They are much faster bution assumptions as in the previous tables, we put the name of the
than the direct LightGBM models and DeepAR. As the hierarchy on the specific distribution, i.e., Poisson, Neg. Bin., at the beginning of the
Corporación Favorita dataset contains more bottom level series, we can names of the proposed methods. Fig. 4 compares the WSPL values on

9
X. Long et al. International Journal of Production Economics 279 (2025) 109449

Table 3 5.4. Further discussion


Training time on the examined proprietary dataset for the benchmark methods (top
part) on level 𝐿, and LightGBM models and linear models on level 𝐴.
In this section, we provide a discussion on the automatic selection
Model Training time (minutes)
of the aggregated level to apply the proposed top–down forecasting
ARIMA 741.60 framework, and a suggested workflow for forecasting the e-commerce
Drift 58.29
datasets.
ETS 875.75
Mean 45.39
Naïve (fable) 54.96 5.4.1. Selecting the aggregated level
Snaïve (fable) 85.99 We have explored two types of aggregation in our analysis, namely
In-sample quantile 0.99 the category–product hierarchy on our proprietary dataset, and store–
Emp-Wd 13.78
product hierarchy on the Corporación Favorita dataset and the M5
Pois 1.86
NB-CMP 25.04 dataset. The top–down forecasting framework works well in both situa-
ZIP 153.52 tions. Ultimately, the way to form a hierarchy is application-dependent,
ZINB 519.42 but there are some heuristics we can follow. To make the most of the
DeepAR 68.75 proposed framework, on the one hand, practical considerations should
Direct LightGBM 284.36
come first. Data should be aggregated to a level where models can run
Lasso 56.15 without concerning the limitations of memory and computing power.
Pooled Regression 28.48
On the other hand, since the probabilistic forecasts are generated based
LightGBM Huber loss default 6.32
LightGBM Huber loss linear leaf 9.64 on assumptions, the aggregation levels should be chosen in a way that
LightGBM Huber loss preset 30.30 we would not expect too large changes of data characteristics after
LightGBM L1 loss default 11.04 aggregation.
LightGBM L1 loss linear leaf 16.19 In practice, we can explore the distributions of series on the ag-
LightGBM L1 loss preset 51.95
gregated level and on the decision level and compare the similarity.
LightGBM L2 loss default 9.07
LightGBM L2 loss linear leaf 14.46 For example, a negative binomial distribution can be fitted for each
LightGBM L2 loss preset 24.13 series of the decision level and the possible levels to aggregate to,
LightGBM Neg. Bin. loss default 1540.08 and goodness-of-fit results can then be evaluated. We perform such an
LightGBM Poisson loss default 6.99 example, where we use the glm.nb function from the R MASS pack-
LightGBM Poisson loss linear leaf 8.37
age (Venables and Ripley, 2002) to fit a negative binomial distribution,
LightGBM Poisson loss preset 22.85
LightGBM Tweedie loss default 5.53 and examine the fitting with the 𝑝-value reported from the poisgof
LightGBM Tweedie loss linear leaf 7.12 from the R epiDisplay package (Chongsuvivatwong, 2022). Table 8
LightGBM Tweedie loss preset 26.8 reports the results at a significance level 𝛼 = 0.05. As the series are
aggregated to higher levels, they are less and less similar to the decision
level. We see that levels 10 and 11 offer similar trade-offs between the
number of series to forecast and the similarity between the decision
level 12 with the top 50 participants in the uncertainty track of the
level and the aggregated level. In our experiments, we decide to aim
original competition. Remarkably, the proposed top–down forecasting
for higher scalability and therefore choose level 10 as the aggregated
approaches all enter the top 50 when compared with the original 892
level.
participating teams, w.r.t. WSPL, except for the ones with a negative bi-
nomial loss function. We also notice that methods which assume future
5.4.2. A suggested workflow for forecasting on e-commerce datasets
sales to follow a negative binomial distribution perform better, which
It is an interesting finding that in the intermittent series of the
is in line with the previous experiments. Benchmarks on level 12 are
trained in the same fashion as in Section 4.2. Due to the computational e-commerce data, the largest proportion of the dataset, simple meth-
limitations, we are able to train on the whole dataset from level 12 ods such as the in-sample quantile are competitive against LightGBM
with a single direct LightGBM model with 70 lags. We also include variants and linear models trained under the top–down forecasting
a max lags version where we intend to include more lags and train framework. In particular, the in-sample quantile method achieves 2nd
a direct LightGBM model with 100 lags without Fourier terms. From place after the direct LightGBM model which achieves the best accuracy
Fig. 4, we see that the DeepAR model is very competitive on the M5 (see Table 2). If we also take training time into account, the in-sample
level 12. The direct LightGBM models are also accurate, where the one quantile is unbeatable compared with other methods. In contrast, such
with more lags instead of Fourier terms performs better, ranking 10th an advantage in accuracy cannot be seen in the Corporación Favorita
against other competitors. The in-sample quantile and the Emp-Wd are dataset (Table 4) and the M5 dataset (Fig. 4), where brick-and-mortar
strong benchmarks on this brick-and-mortar retailer dataset, but rank sales data are considered.
lower than the proposed top–down methods. The detailed WSPL results Recall the percentage of zeros calculated in Table 1 on the lower
of each category on series from level 12 are also provided in Table 6 for level of the three datasets experimented in this research. Noticeably,
consistency. The proposed models are competitive against the strong over 91% of entries in the intermittent series of the e-commerce data
benchmarks on each category, especially the linear models. Table 7 are zero, implying a high degree of intermittency. This explains why
presents the training time of models on level 10 and directly on level these series are relatively unpredictable and no method leads any
12. Using the top–down approach, the GBTs can be trained with modest benefits over the most simple benchmarks. One may observe that the
computational effort, as well as the linear models. The simple in-sample lumpy series also present high proportion of zeros. However, from
quantile benchmark is still very efficient. The LightGBM model with the empirical results, the proposed methods have shown a better per-
negative binomial loss function takes much longer time because of the formance on the examined e-commerce dataset. This may be due to
iterative numerical optimisation process. In the M5 dataset, it seems the lumpy series by definition having larger variance compared to the
there is a poorer estimate of the parameter 𝑟 through the numerical intermittent series. Taking all the findings into account, we can suggest
search under practical time constraints. DeepAR can be trained at a fast as a generic workflow in our e-commerce forecasting use case the
speed as the M5 dataset is relatively smaller compared to the other two following. For intermittent series, one can simply use in-sample quan-
datasets in our experiments. Although it may provide certain accuracy tiles to produce accurate forecasts. The proposed top–down forecasting
gains, training directly on level 12 can take much more computational framework, for example, integrated with linear models and LightGBM
time and compromises may have to be made to make the approach models (e.g., with Tweedie loss and Poisson loss functions), is used to
feasible. generate probabilistic forecasts for other categories.

10
X. Long et al. International Journal of Production Economics 279 (2025) 109449

Table 4
The WSPL for lower level series from the Corporación Favorita dataset on each category. The WSPL for all lower level series are provided in
the last column. The number of series in each category is provided in parenthesis.
Model Smooth (35,458) Erratic (33,851) Lumpy (58,131) Intermittent (45,466) All
In-sample quantiles 0.2618 0.2747 0.3177 0.3259 0.2999
Emp-Wd 0.2559 0.2706 0.3154 0.3233 0.2965
Pois 0.3237 0.3287 0.3343 0.3393 0.3323
NB-CMP 0.2705 0.2853 0.3224 0.3267 0.3056
ZIP 0.2716 0.2787 0.3140 0.3252 0.3013
ZINB 0.2695 0.2749 0.3180 0.3259 0.3017
Direct LightGBM (max lags) 0.1985 0.2192 0.2584 0.2580 0.2383
Direct LightGBM (Fourier terms) 0.2048 0.2239 0.2589 0.2588 0.2409
DeepAR 0.1811 0.2147 0.2539 0.2553 0.2317
Negative binomial distribution assumption
Lasso 0.2262 0.2498 0.2832 0.2783 0.2637
Pooled Regression 0.2196 0.2449 0.2768 0.2709 0.2572
LightGBM Neg. Bin. loss default 0.2339 0.2693 0.2871 0.2763 0.2699
LightGBM Poisson loss default 0.2303 0.2639 0.2851 0.2765 0.2674
LightGBM Tweedie loss default 0.2282 0.2632 0.2829 0.2742 0.2656
Poisson distribution assumption
Lasso 0.2571 0.2863 0.3032 0.2920 0.2875
Pooled Regression 0.2416 0.2740 0.2901 0.2796 0.2744
LightGBM Neg. Bin. loss default 0.2590 0.3000 0.3040 0.2874 0.2896
LightGBM Poisson loss default 0.2553 0.2947 0.3025 0.2878 0.2874
LightGBM Tweedie loss default 0.2520 0.2928 0.2991 0.2845 0.2844

Fig. 4. The performance of the proposed methods and benchmarks on level 12 compared with the top 50 submissions of the M5 uncertainty competition.

6. Conclusion Poisson assumption. However, this does not translate into higher ac-
curacies when using a negative binomial loss function. We have shown
In this paper, we have proposed a scalable top–down forecast- that in practice, implementation of this loss function requires additional
ing framework which is capable of generating reliable probabilistic numerical search to fit in a common machine learning framework,
forecasts at a fast speed. Direct modelling on the lower level and which prevents it from beating other built-in loss functions under prac-
producing quantile forecasts is accurate, but it can be computationally tical computational constraints. Somewhat surprisingly, linear models
expensive while no corresponding large gains of accuracy are observed. are competitive with the state-of-the-art LightGBM algorithm in situa-
Compromises may also have to be made when training direct quantile tions where no external covariates are used (as in our research; external
models. In our use cases, and presumably many others in the industry, variables could regard pricing, promotions, and others). Here, linear
the additional computational effort is thus not justified. Our forecasting models offer a simple alternative to GBTs that is fast, robust, and more
approach is feasible to implement in production. The top–down fore- interpretable.
casting framework has also been evaluated with two public datasets We observe that the e-commerce dataset can be much more in-
and has shown good results. termittent compared to brick-and-mortar retail datasets. In particular,
As evaluated in the experiments, we have found that the accuracy the intermittent series make up the largest proportion of the dataset
depends largely on the estimation of distributional parameters. In and they are also more intermittent, i.e., they contain proportionally
accordance with the literature, in the three datasets in our experiments more zeros. Simply using in-sample quantiles on this category can be
the negative binomial assumption tends to be more adequate than the very competitive against other sophisticated methods, with superior

11
X. Long et al. International Journal of Production Economics 279 (2025) 109449

Table 5 binomial loss function for model training. However, the LightGBM
Training time of the top–down methods on aggregated level of the Corporación Favorita
package (Ke et al., 2017) does not provide a built-in negative binomial
dataset, and the benchmarks executed on level 𝐿.
loss function, but it provides functionality which supports user-defined
Model Training time (minutes)
loss functions.
In-sample quantile 0.76
In order to implement any customised loss, there are two functions
Emp-Wd 13.17
Pois 3.64
we need to specify: an objective function and an evaluation function.
NB-CMP 5.81 The objective function is defined according to the log likelihood of a
ZIP 53.29 certain distribution, and the evaluation function returns the first and
ZINB 226.76 second derivatives w.r.t. model predictions.
DeepAR 68.75
For the negative binomial distribution, the probability mass func-
Direct LightGBM (max lags) 357.67
Direct LightGBM (Fourier terms) 388.84 tion is given by
𝛤 (𝑟 + 𝑥)
Lasso 5.67 𝑃 (𝑥 ∣ 𝑟, 𝑝) = 𝑝𝑟 (1 − 𝑝)𝑥 .
Pooled Regression 2.70 𝛤 (𝑟)𝛤 (𝑥 + 1)
LightGBM Neg. Bin. loss default 95.89 with a mean value 𝜇 that equals to (1 −𝑝)𝑟∕𝑝. So if we substitute 𝑝 w.r.t.
LightGBM Poisson loss default 1.14
𝜇, that is, 𝑝 = 𝑟∕(𝜇 + 𝑟), we can get the following,
LightGBM Tweedie loss default 1.19 ( )𝑟 ( )𝑥
𝛤 (𝑟 + 𝑥) 𝑟 𝜇
𝑃 (𝑥 ∣ 𝑟, 𝜇) = .
𝛤 (𝑟)𝛤 (𝑥 + 1) 𝜇 + 𝑟 𝜇+𝑟
computational efficiency. In addition, the proposed top–down method
So, the negative log likelihood is given by
depends on the hierarchical structure of the series and distributional
assumptions to some extent. We have investigated the distributions of 𝐿(𝑥 ∣ 𝜇 , 𝑟) = − log 𝛤 (𝑟 + 𝑥) + log 𝛤 (𝑟) + log 𝛤 (𝑥 + 1)
the series on the lower level and on the possible aggregated levels of the − 𝑟 log 𝑟 + 𝑟 log(𝜇 + 𝑟) − 𝑥 log 𝜇 + 𝑥 log(𝜇 + 𝑟).
M5 dataset. Based on the given hierarchy of the business, it is a trade-
off between the number of series on the aggregated level to model on And we denote the predicted mean value from the LightGBM model as
and the similarity between the two levels when applying the proposed 𝑓 . As the support of the negative binomial distribution is the set of non-
top–down forecasting framework. negative integers, we apply a log transformation so that 𝑓 is allowed
A limitation of the proposed framework that could be addressed to take any real value and 𝑒𝑓 is always non-negative. For data point 𝑥𝑖 ,
as future work is the static top–down approach where total historical treating 𝑥𝑖 as the true value and plugging in the predicted mean value
proportions are used during disaggregation. We assume that using a after transformation, i.e., 𝑒𝑓𝑖 , then the negative log likelihood is given
disaggregation method which accounts for future changes may improve by,
forecasting accuracy. Additionally, the proposed top–down forecasting 𝐿(𝑥𝑖 ∣ 𝑓𝑖 , 𝑟) = − log 𝛤 (𝑟 + 𝑥𝑖 ) + log 𝛤 (𝑟) + log 𝛤 (𝑥𝑖 + 1)
framework depends on the selection of the aggregated level. We have
− 𝑟 log 𝑟 + 𝑟 log(𝑒𝑓𝑖 + 𝑟) − 𝑥𝑖 𝑓𝑖 + 𝑥𝑖 log(𝑒𝑓𝑖 + 𝑟).
provided some preliminary results on how to perform an automatic
level selection, but a more systematical procedure could be further Consider 𝐱 = (𝑥1 , … , 𝑥𝑛 ) and 𝐟 = (𝑓1 , … , 𝑓𝑛 ); then our objective function
investigated. Finally, the examined e-commerce data spreads out before is defined as
and after the global pandemic lockdown periods. However, the poten- ∑
𝑛

tial structural breaks of the shopping patterns are not modelled in this 𝐿(𝐱 ∣ 𝐟, 𝑟) = 𝐿(𝑥𝑖 ; 𝑓𝑖 , 𝑟).
𝑖=1
study.
And we calculate the gradient and Hessian w.r.t. 𝑓 ,
∑𝑛 ( 𝑓 )
CRediT authorship contribution statement 𝑒 𝑖 (𝑟 + 𝑥𝑖 )
𝑔(𝐱 ∣ 𝐟, 𝑟) = − 𝑥 𝑖 ,
𝑖=1 𝑒𝑓𝑖 + 𝑟
Xueying Long: Writing – review & editing, Writing – original
draft, Visualization, Software, Investigation, Formal analysis. Quang ∑
𝑛
𝑒𝑓𝑖 𝑟(𝑟 + 𝑥𝑖 )
Bui: Writing – original draft, Software, Investigation, Formal analysis, ℎ(𝐱 ∣ 𝐟, 𝑟) = .
𝑖=1 (𝑒𝑓𝑖 + 𝑟)2
Data curation. Grady Oktavian: Validation, Software, Formal analysis,
Data curation. Daniel F. Schmidt: Writing – review & editing, Valida- With this we have defined all the required functions for implemen-
tion, Supervision, Methodology. Christoph Bergmeir: Writing – review tation, except that the value of 𝑟 has to be obtained for completing
& editing, Validation, Supervision, Resources, Project administration, the calculation. Intuitively, we can treat 𝑟 as a model parameter and
Methodology, Investigation, Funding acquisition, Conceptualization. optimise it alongside the training process, but the LightGBM pack-
Rakshitha Godahewa: Writing – review & editing, Software. Seong age does not provide an option for defining custom parameters. A
Per Lee: Validation, Software, Formal analysis, Data curation. Kaifeng possible solution, which is the solution we are using, is the coordinate-
Zhao: Writing – review & editing, Validation, Software, Data cura- wise optimisation, that is, updating the model and 𝑟 iteratively until
tion. Paul Condylis: Writing – review & editing, Supervision, Funding convergence. We initialise the value of 𝑟 by the method of moments
acquisition, Data curation. from the historical data. The optimisation process of each iteration
takes three steps: (1) train a LightGBM model with the custom loss
Acknowledgements function and the current value of 𝑟; (2) predict the training set with
the model obtained and then get the predicted mean values; and (3)
Christoph Bergmeir is supported by a María Zambrano (Senior) get an updated estimate of 𝑟 by minimising the negative log likelihood,
Fellowship that is funded by the Spanish Ministry of Universities and which is also the function 𝐿 defined above. In this case, the LightGBM
Next Generation funds from the European Union. models are retrained iteratively through coordinate-wise optimisation
and the optimisation procedure takes longer as the length of series
Appendix. Implementation of negative binomial loss function grows, which in return leads to an overall longer training process.
with LightGBM
Data availability
As sales data is usually over-dispersed, i.e., the variance is greater
than its mean, when we use machine learning algorithms to predict The data that has been used is confidential.
the future mean values, it is a natural choice to consider the negative

12
X. Long et al. International Journal of Production Economics 279 (2025) 109449

Table 6
The WSPL for the M5 dataset level 12 series on each category. The WSPL for all level 12 series are provided in the last column. The number
of series in each category is provided in parenthesis.
Model Smooth (1,900) Erratic (863) Lumpy (5,604) Intermittent (22,123) All
In-sample quantiles 0.2797 0.3982 0.4248 0.3314 0.3398
Emp-Wd 0.2734 0.3977 0.4263 0.3314 0.3387
Pois 0.3331 0.5251 0.5006 0.3696 0.3931
NB-CMP 0.2761 0.3981 0.4263 0.3343 0.3409
ZIP 0.2884 0.4752 0.4408 0.3317 0.3494
ZINB 0.3241 0.3965 0.4245 0.3314 0.3492
Direct LightGBM (max lags) 0.2434 0.3055 0.3333 0.2695 0.2766
Direct LightGBM (Fourier terms) 0.2510 0.3162 0.3368 0.2773 0.2839
DeepAR 0.2474 0.3040 0.3298 0.2649 0.2742
Negative binomial distribution assumption
Lasso 0.2585 0.3320 0.3744 0.2820 0.2952
Pooled Regression 0.2628 0.3282 0.3725 0.2822 0.2957
LightGBM Neg. Bin. loss default 0.7434 0.8479 0.6625 0.7189 0.7232
LightGBM Poisson loss default 0.2719 0.3389 0.3843 0.2904 0.3048
LightGBM Tweedie loss default 0.2746 0.3609 0.3879 0.2941 0.3095
Poisson distribution assumption
Lasso 0.3009 0.4049 0.4298 0.2977 0.3268
Pooled Regression 0.3035 0.3871 0.4225 0.2959 0.3239
LightGBM Neg. Bin. loss default 0.8211 0.9336 0.7121 0.7433 0.7670
LightGBM Poisson loss default 0.3126 0.3978 0.4318 0.3028 0.3319
LightGBM Tweedie loss default 0.3153 0.4248 0.4343 0.3065 0.3367

Table 7
Training time on the M5 dataset of the proposed model variants on level 10 and benchmarks
directly modelling on level 12.
Model Training time (minutes)
In-sample quantiles 0.53
Emp-Wd 11.99
Pois 0.61
NB-CMP 0.93
ZIP 18.20
ZINB 91.60
DeepAR 17.07
Direct LightGBM (max lags) 483.72
Direct LightGBM (Fourier terms) 489.47
Lasso 4.56
Pooled Regression 2.05
LightGBM Neg. Bin. loss default 68.90
LightGBM Tweedie loss default 1.09
LightGBM Poisson loss default 1.09

Table 8
Number of series on Level 9 to Level 12, and the percentage of series on the corresponding level that follow
a negative binomial distribution (in percent) of the M5 dataset, as a trade-off to choose an aggregation level.
Level 12 Level 11 Level 10 Level 9
Number of series 30,490 9,147 3,049 70
Series following neg. bin. dist. (%) 83.75 51.11 25.25 17.14

References Bassett, G., Koenker, R., 1982. An empirical quantile function for linear models with
IID errors. J. Amer. Statist. Assoc. 77, 407–415.

Agrawal, N., Smith, S.A., 1996. Estimating negative binomial demand for retail Bojer, C.S., Meldgaard, J.P., 2021. Kaggle forecasting competitions: An overlooked
inventory management with unobservable lost sales. Naval Res. Logist. 43, learning opportunity. Int. J. Forecast. 37, 587–603. http://dx.doi.org/10.1016/j.
839–861. ijforecast.2020.07.007.
Alexandrov, A., Benidis, K., Bohlke-Schneider, M., Flunkert, V., Gasthaus, J., Box, G.E.P., Jenkins, G.M., Reinsel, G.C., Ljung, G.M., 2015. Time Series Analysis:
Januschowski, T., Maddix, D.C., Rangapuram, S., Salinas, D., Schulz, J., Stella, L., Forecasting and Control. John Wiley & Sons.
Türkmen, A.C., Wang, Y., 2020. GluonTS: Probabilistic and neural time series Chen, T., Guestrin, C., 2016. XGBoost: A scalable tree boosting system. In: Proceedings
modeling in Python. J. Mach. Learn. Res. 21, 1–6, URL: http://jmlr.org/papers/ of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
v21/19-820.html. Data Mining. ACM, http://dx.doi.org/10.1145/2939672.2939785.
Bandara, K., Bergmeir, C., Smyl, S., 2020. Forecasting across time series databases using Chongsuvivatwong, V., 2022. epiDisplay: Epidemiological data display package. URL:
recurrent neural networks on groups of similar series: A clustering approach. Expert https://CRAN.R-project.org/package=epiDisplay. R package version 3.5.0.2.
Syst. Appl. 140, 112896. http://dx.doi.org/10.1016/j.eswa.2019.112896. Cragg, J.G., 1971. Some statistical models for limited dependent variables with
Bandara, K., Hewamalage, H., Godahewa, R., Gamakumara, P., 2021. A fast and application to the demand for durable goods. Econometrica 39, 829. http://dx.
scalable ensemble of global models with long memory and data partitioning for the doi.org/10.2307/1909582.
M5 forecasting competition. Int. J. Forecast. http://dx.doi.org/10.1016/j.ijforecast. de Rezende, R., Egert, K., Marin, I., Thompson, G., 2021. A white-boxed ISSM approach
2021.11.004. to estimate uncertainty distributions of walmart sales. Int. J. Forecast. http://

13
X. Long et al. International Journal of Production Economics 279 (2025) 109449

dx.doi.org/10.1016/j.ijforecast.2021.11.006, URL: https://www.sciencedirect.com/ Kunz, M., Birr, S., Raslan, M., Ma, L., Januschowski, T., 2023. Deep learning based
science/article/pii/S0169207021001801. forecasting: A case study from the online fashion industry. In: Forecasting with
Delignette-Muller, M.L., Dutang, C., 2015. fitdistrplus: An R package for fitting Artificial Intelligence: Theory and Applications. Springer, pp. 279–311.
distributions. J. Stat. Softw. 64, 1–34. http://dx.doi.org/10.18637/jss.v064.i04. Lainder, A.D., Wolfinger, R.D., 2022. Forecasting with gradient boosted trees: Aug-
Fildes, R., Kolassa, S., Ma, S., 2022a. Post-script—Retail forecasting: Research and mentation, tuning, and cross-validation strategies: Winning solution to the M5
practice. Int. J. Forecast. 38, 1319–1324. http://dx.doi.org/10.1016/j.ijforecast. uncertainty competition. Int. J. Forecast. http://dx.doi.org/10.1016/j.ijforecast.
2021.09.012. 2021.12.003.
Fildes, R., Ma, S., Kolassa, S., 2022b. Retail forecasting: Research and practice. Int. J. Lambert, D., 1992. Zero-inflated Poisson regression, with an application to defects in
Forecast. 38, 1283–1318. http://dx.doi.org/10.1016/j.ijforecast.2019.06.004. manufacturing. Technometrics 34, 1–14.
Gelman, A., Hill, J., 2006. Data Analysis Using Regression and Multi- Makridakis, S., Petropoulos, F., Spiliotis, E., 2022a. Special Issue: M5 Competition, vol.
level/Hierarchical Models. Cambridge University Press, http://dx.doi.org/10. 38. Int. J. Forecast..
1017/cbo9780511790942. Makridakis, S., Spiliotis, E., Assimakopoulos, V., 2022b. M5 accuracy competition:
Gneiting, T., 2011. Quantiles as optimal point forecasts. Int. J. Forecast. 27, 197–207. Results, findings, and conclusions. Int. J. Forecast. http://dx.doi.org/10.1016/j.
http://dx.doi.org/10.1016/j.ijforecast.2009.12.015. ijforecast.2021.11.013.
Godahewa, R., Bandara, K., Webb, G.I., Smyl, S., Bergmeir, C., 2021. Ensembles Makridakis, S., Spiliotis, E., Assimakopoulos, V., Chen, Z., Gaba, A., Tsetlin, I., Win-
of localised models for time series forecasting. Knowl.-Based Syst. 233, 107518. kler, R.L., 2021. The M5 uncertainty competition: Results, findings and conclusions.
http://dx.doi.org/10.1016/j.knosys.2021.107518. Int. J. Forecast. http://dx.doi.org/10.1016/j.ijforecast.2021.10.009.
Godahewa, R., Webb, G.I., Schmidt, D., Bergmeir, C., 2022. SETAR-tree: A novel and Montero-Manso, P., Hyndman, R.J., 2021. Principles and algorithms for forecasting
accurate tree algorithm for global time series forecasting. http://dx.doi.org/10. groups of time series: Locality and globality. Int. J. Forecast. 37, 1632–1653.
48550/ARXIV.2211.08661. http://dx.doi.org/10.1016/j.ijforecast.2021.03.004.
Han, X., Dasgupta, S., Ghosh, J., 2021. Simultaneously reconciled quantile forecasting O’Hara-Wild, M., Hyndman, R., Wang, E., 2021. fable: Forecasting models for tidy time
of hierarchically related time series. In: International Conference on Artificial series. URL: https://CRAN.R-project.org/package=fable. R package version 0.3.1.
Intelligence and Statistics. PMLR, pp. 190–198. Olivares, K.G., Meetei, O.N., Ma, R., Reddy, R., Cao, M., Dicker, L., 2021.
Hasni, M., Aguir, M.S., Babai, M.Z., Jemai, Z., 2019. On the performance of adjusted Probabilistic hierarchical forecasting with deep poisson mixtures. In: NeurIPS
bootstrapping methods for intermittent demand forecasting. Int. J. Prod. Econ. 216, 2021 Workshop on Deep Generative Models and Downstream Applications. URL:
145–153. https://www.amazon.science/publications/probabilstic-hierarchical-forecasting-
Hasson, H., Wang, B., Januschowski, T., Gasthaus, J., 2021. Probabilistic forecasting: with-deep-poisson-mixtures.
A level-set approach. Adv. Neural Inf. Process. Syst. 34, URL: https://github.com/ Panagiotelis, A., Gamakumara, P., Athanasopoulos, G., Hyndman, R.J., 2022. Probabilis-
awslabs/gluon-ts/blob/master/src/. tic forecast reconciliation: Properties, evaluation and score optimisation. European
He, X., 1997. Quantile curves without crossing. Amer. Statist. 51, 186–192. J. Oper. Res. http://dx.doi.org/10.1016/j.ejor.2022.07.040.
Heinen, A., 2003. Modelling time series count data: An autoregressive conditional Paria, B., Sen, R., Ahmed, A., Das, A., 2021. Hierarchically regularized deep forecasting.
Poisson model. Available at SSRN 1117187. arXiv preprint arXiv:2106.07630.
Hewamalage, H., Bergmeir, C., Bandara, K., 2022. Global models for time series Rangapuram, S.S., Werner, L.D., Benidis, K., Mercado, P., Gasthaus, J.,
forecasting: A simulation study. Pattern Recognit. 124, 108441. http://dx.doi.org/ Januschowski, T., 2021. End-to-end learning of coherent probabilistic forecasts
10.1016/j.patcog.2021.108441. for hierarchical time series. In: ICML 2021. URL: https://www.amazon.
Hilbe, J.M., 2011. Negative Binomial Regression. Cambridge University Press. science/publications/end-to-end-learning-of-coherent-probabilistic-forecasts-
Hyndman, R.J., Ahmed, R.A., Athanasopoulos, G., Shang, H.L., 2011. Optimal com- for-hierarchical-time-series.
bination forecasts for hierarchical time series. Comput. Statist. Data Anal. 55, do Rego, J.R., De Mesquita, M.A., 2015. Demand forecasting and inventory control: A
2579–2589. http://dx.doi.org/10.1016/J.CSDA.2011.03.006. simulation study on automotive spare parts. Int. J. Prod. Econ. 161, 1–16.
Hyndman, R.J., Koehler, A.B., 2006. Another look at measures of forecast accuracy. Salinas, D., Flunkert, V., Gasthaus, J., Januschowski, T., 2020. DeepAR: Probabilistic
Int. J. Forecast. 22, 679–688. forecasting with autoregressive recurrent networks. Int. J. Forecast. 36, 1181–1191.
Hyndman, R., Koehler, A.B., Ord, J.K., Snyder, R.D., 2008. Forecasting with Exponential http://dx.doi.org/10.1016/j.ijforecast.2019.07.001, URL: http://creativecommons.
Smoothing: The State Space Approach. Springer Science & Business Media. org/licenses/by/4.0/.
Jackman, S., 2024. pscl: Classes and Methods for R Developed in the Political Science Sellers, K., Lotze, T., Raim, A., 2023. COMPoissonReg: Conway-Maxwell Poisson (COM-
Computational Laboratory. University of Sydney, Sydney, Australia, URL: https: Poisson) regression. URL: https://CRAN.R-project.org/package=COMPoissonReg. R
//github.com/atahk/pscl/. R package version 1.5.9. package version 0.8.1.
Januschowski, T., Gasthaus, J., Wang, Y., Salinas, D., Flunkert, V., Bohlke- Shi, Y., Ke, G., Soukhavong, D., Lamb, J., Meng, Q., Finley, T., Wang, T., Chen, W.,
Schneider, M., Callot, L., 2020. Criteria for classifying forecasting methods. Int. Ma, W., Ye, Q., Liu, T.-Y., Titov, N., 2022. LightGBM: Light gradient boosting
J. Forecast. 36, 167–177. http://dx.doi.org/10.1016/j.ijforecast.2019.05.008. machine. URL: https://CRAN.R-project.org/package=lightgbm. R package version
Januschowski, T., Wang, Y., Torkkola, K., Erkkilä, T., Hasson, H., Gasthaus, J., 2021. 3.3.2.
Forecasting with trees. Int. J. Forecast. http://dx.doi.org/10.1016/j.ijforecast.2021. Simon, N., Friedman, J., Hastie, T., Tibshirani, R., 2011. Regularization paths for Cox’s
10.004. proportional hazards model via coordinate descent. J. Stat. Softw. 39, 1–13, URL:
Kaggle, 2018. Corporación favorita grocery sales forecasting. URL: https://www.kaggle. https://www.jstatsoft.org/v39/i05/.
com/c/favorita-grocery-sales-forecasting. Snyder, R.D., Ord, J.K., Beaumont, A., 2012. Forecasting the intermittent demand for
Kamarthi, H., Kong, L., Rodríguez, A., Zhang, C., Prakash, B.A., 2022. PROFHIT: slow-moving inventories: A modelling approach. Int. J. Forecast. 28, 485–496.
Probabilistic robust forecasting for hierarchical time-series. arXiv preprint arXiv: http://dx.doi.org/10.1016/j.ijforecast.2011.03.009.
2206.07940. Spiliotis, E., Makridakis, S., Kaltsounis, A., Assimakopoulos, V., 2021. Product sales
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.-Y., probabilistic forecasting: An empirical evaluation using the M5 competition data.
2017. LightGBM: A highly efficient gradient boosting decision tree. In: Guyon, I., Int. J. Prod. Econ. 240, 108237. http://dx.doi.org/10.1016/j.ijpe.2021.108237.
Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. Stasinopoulos, D.M., Rigby, R.A., 2007. Generalized additive models for location scale
(Eds.), Advances in Neural Information Processing Systems, vol. 30. Curran and shape (GAMLSS) in R. J. Stat. Softw. 23, http://dx.doi.org/10.18637/jss.v023.
Associates, Inc.. i07.
Koenker, R., Bassett, G., 1978. Regression quantiles. Econometrica 33–50. Steutel, F.W., Van Harn, K., 2003. Infinite Divisibility of Probability Distributions on
Kolassa, S., 2016. Evaluating predictive count data distributions in retail sales forecast- the Real Line. CRC Press.
ing. Int. J. Forecast. 32, 788–803. http://dx.doi.org/10.1016/j.ijforecast.2015.12. Syntetos, A.A., Babai, M.Z., Gardner, E.S., 2015. Forecasting intermittent inventory
004. demands: Simple parametric methods vs. bootstrapping. J. Bus. Res. 68, 1746–1752.
Kolassa, S., 2022. Commentary on the M5 forecasting competition. Int. J. Forecast. 38, http://dx.doi.org/10.1016/j.jbusres.2015.03.034.
1562–1568. Syntetos, A.A., Boylan, J.E., 2005. The accuracy of intermittent demand estimates. Int.
Kourentzes, N., Trapero, J.R., Barrow, D.K., 2020. Optimising forecasting models for J. Forecast. 21, 303–314. http://dx.doi.org/10.1016/j.ijforecast.2004.10.001.
inventory planning. Int. J. Prod. Econ. 225, 107597. http://dx.doi.org/10.1016/j. Syntetos, M., Boylan, J., Croston, J.D., 2005. On the categorization of demand patterns.
ijpe.2019.107597. J. Oper. Res. Soc. 56, http://dx.doi.org/10.1057/palgrave.jors.2601841.

14
X. Long et al. International Journal of Production Economics 279 (2025) 109449

Taieb, S.B., Taylor, J.W., Hyndman, R.J., 2017. Coherent probabilistic forecasts for Viswanathan, S., Zhou, C.X., 2008. A New Bootstrapping Based Method for Forecasting
hierarchical time series. In: Precup, D., Teh, Y.W. (Eds.), Proceedings of the 34th and Safety Stock Determination for Intermittent Demand Items. Working Paper,
International Conference on Machine Learning. In: Proceedings of Machine Learning Nanyang Business School, Nanyang Technological University Singapore.
Research, vol. 70, PMLR, pp. 3348–3357, URL: https://proceedings.mlr.press/v70/ Willemain, T.R., Smart, C.N., Schwarz, H.F., 2004. A new approach to forecasting
taieb17a.html. intermittent demand for service parts inventories. Int. J. Forecast. 20, 375–387.
Taieb, S.B., Taylor, J.W., Hyndman, R.J., 2020. Hierarchical probabilistic forecasting http://dx.doi.org/10.1016/S0169-2070(03)00013-X.
of electricity demand with smart meter data. J. Amer. Statist. Assoc. 116, 27–43. Zeileis, A., Kleiber, C., Jackman, S., 2008. Regression models for count data in R. J.
http://dx.doi.org/10.1080/01621459.2020.1736081. Stat. Softw. 27, URL: https://www.jstatsoft.org/v27/i08/.
Tibshirani, R., 1996. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Zhou, C., Viswanathan, S., 2011. Comparison of a new bootstrapping method with
Ser. B Stat. Methodol. 58, 267–288, URL: http://www.jstor.org/stable/2346178. parametric approaches for safety stock determination in service parts inventory
Venables, W.N., Ripley, B.D., 2002. Modern Applied Statistics with S, fourth systems. Int. J. Prod. Econ. 133, 481–485.
ed. Springer, New York, URL: https://www.stats.ox.ac.uk/pub/MASS4/. ISBN Ziel, F., 2021. M5 competition uncertainty: Overdispersion, distributional forecasting,
0-387-95457-0. GAMLSS, and beyond. Int. J. Forecast. http://dx.doi.org/10.1016/j.ijforecast.2021.
09.008.

15

You might also like