Simplifying Tree-based Methods for Retail Sales Forecasting with Explanatory Variables
Simplifying Tree-based Methods for Retail Sales Forecasting with Explanatory Variables
Variables1
Abstract
Despite being consistently outperformed by machine learning (ML) in forecasting competitions, simple statistical
forecasting techniques remain standard in retail. This is partly because, for all their advantages, these top-performing
ML methods are often too complex to implement. We have experimented with various tree-based ML methods and
find that a ‘simple’ implementation of these can (substantially) outperform traditional forecasting methods while being
computationally efficient. Our approach is validated with a dataset of 4,523 products of a leading Belgian retailer
containing various explanatory variables (e.g., promotions and national events). Using Shapley values and slightly
adjusted tree-based methods, we show that superior performance depends on the availability of explanatory variables
and additional feature engineering. For robustness, we show that our findings also hold when using the M5 competition
dataset. Extensive numerical experimentation finally shows how the forecast superiority of our proposed framework
translates to higher service levels, lower inventory costs, and improvements in the bullwhip of orders and inventory.
Our framework, with its excellent performance and scalability to practical forecasting settings, we contribute to the
growing body of research aimed at facilitating the higher adoption rate of ML among ‘traditional’ retailers.
Keywords: Forecasting, Global forecasting methods, Tree-based methods, Inventory simulation
1. Introduction
Global retail is currently estimated to be a 25 trillion dollar industry.2 Their core activity is getting the right
product at the right place and time. In short, to efficiently organize the flow of products from supplier to customer
is a large supply chain optimization problem. As most retailers face high fixed costs and low-profit margins, small
revenue increases can drastically improve their bottom line. This motivates using innovative technologies to improve
sales forecasts (Fisher and Raman, 2018). Sales forecasts are also used to support various business decisions such
as workforce scheduling, inventory replenishment, and safety stock calculations. Many of these operational decisions
happen daily or weekly at the product-store level, where some products are sold in large quantities and others only
sporadically. This requires forecasting methods that can accommodate a variety of time series with different levels
of variability and intermittency (periods with zero sales) (Petropoulos et al., 2013; Spiliotis et al., 2020). These
forecasting methods can also incorporate explanatory variables such as promotions and calendar events to improve
their accuracy. In addition to forecast accuracy, scalability is important. Fildes et al. (2019b) note that scalability is
one of the major forecasting challenges mentioned by retailers—even a small-sized retailer with 2,000 products and
15 stores requires 420.000 unique forecasts per day if the daily sales of each product-store combination for two weeks
is of interest. In this paper, we show how a ‘simple’ decision-tree machine-learning framework performs well for retail
forecasting. The motivation for tree-based methods comes from their recent success in retail forecasting competitions
(Bojer and Meldgaard, 2021) and their ease-of-use compared to neural networks (Januschowski et al., 2021).
Studies indicate that, today, most retailers still use simple statistical methods like exponential smoothing (Fildes
and Petropoulos, 2015; Weller and Crone, 2012). These simple methods are the default in commercial software packages
(Fildes et al., 2019b), are easy to compute and understand, and have historically been able to achieve similar levels of
forecast accuracy as more complex methods (Makridakis et al., 2020). At the same time, several more recent studies
and forecasting competitions plead in favor of more complex ML methods, which have recently started to outperform
simple statistical methods (Bojer and Meldgaard, 2021; Huber and Stuckenschmidt, 2020; Makridakis et al., 2022;
Mukherjee et al., 2018; Salinas et al., 2020; Spiliotis et al., 2020). Yet, many of the suggested ML methods are
insufficiently validated regarding benchmarks, availability of explanatory variables, accuracy metrics, and test data.
Their model complexity and computational requirements may also prevent ‘traditional’ retailers from putting these
ML forecasting techniques into production as most retailers require millions of forecasts to cover each product in each
store (Bojer and Meldgaard, 2021; Fildes et al., 2019b; Hyndman, 2020; Seaman, 2018; Wellens et al., 2021). Besides
the computational cost, Petropoulos et al. (2021) emphasize that complex forecasting methods hinder retailers from
making forecasts for all their product offerings or adhering to best practices, such as proper hyperparameter tuning.
The forecasts generated by these more sophisticated methods also pose challenges in interpretation and explanation
for decision-makers. This lack of transparency may lead to an aversion towards algorithms, even if these methods
outperform simpler-to-understand methods (Dietvorst et al., 2015).
1
To understand the trade-off between model complexity and practical applicability, we investigate various tree-
based ML methods with different levels of complexity. We compare them to popular statistical methods regarding
forecast accuracy, bias, and inventory performance on a rich dataset of a leading Belgian retailer. The dataset includes
4,523 products with different levels of variability and intermittency. It contains various explanatory variables such as
daily prices, promotions (including competitor’s promotions), weather forecasts, product hierarchy, national events,
and holidays. In addition, we explore the importance of these different data sources using Shapley values, a popular
explainable artificial intelligence framework, and various tree-based methods trained on subsets of the data. To ensure
our insights are structural and independent of the specific dataset, we also apply our framework to the dataset of the
M5 competition.
On our private dataset, We find that a ‘simple’ tree-based ML implementation improves the forecast accuracy
of traditional methods3 by 11.48% on average, while being computationally efficient. With ‘simple,’ we mean that
publicly available open-source software can be used off the shelf without modifying core algorithms or conducting
advanced data transformations. These results are robust for multiple products and time series categories over time.
More complex versions of tree-based ML only marginally improve forecast accuracy. Using Shapley values and various
tree-based methods, we find that the superior performance of our tree-based framework is explained by the availability
of explanatory variables and their capability to learn non-linear time series. Despite the ‘simplicity’ of this framework,
we note that it is not a plug-and-play method and still requires data science knowledge. Our analysis shows the
value—and need—of additional (manual) feature engineering. Finally, we show, using an inventory simulation, how
the improved forecast performance translates into higher service levels, lower inventory levels, and improvements in
the bullwhip of orders and inventory.
The contribution of our paper is threefold. First, we show that a simple, off-the-shelf tree-based method performs
exceptionally well. For most retailers, the gains of more sophisticated tree-based methods may not be worth the
increased model complexity and computational requirements. Second, we describe the conditions under which our
method outperforms popular statistical methods, offering managers the tools to identify promising implementation
opportunities quickly. Third, we investigate how these improved sales forecasts can benefit the daily replenishment of
a retailer. In short, our findings provide evidence that a simple tree-based method is competitive for retail forecasting
to improve their forecast accuracy and replenishment when sufficient data is available.
2. Literature review
There is ample evidence that retailers use managerial judgment and simple statistical methods to forecast demand
(Fildes and Petropoulos, 2015; McCarthy et al., 2006). A decade ago, Weller and Crone (2012) surveyed 200 demand
planning experts and found that exponential smoothing, moving average, and the naive method account for 82.1%
of the statistical forecasting methods used in practice. Only 13.5% of the respondents made use of advanced time
series models such as econometric models (6.9%), autoregressive integrated moving average (ARIMA; 3.5%), and
neural networks (NNs; 1.5%). Once estimated, forecasts are often adjusted with experts’ knowledge to incorporate
the effect of explanatory variables such as promotions and holidays. The effectiveness of such manual adjustments is
questionable. Multiple studies have shown that the forecast accuracy is often higher when these manual adjustments
are limited (Fildes et al., 2009, 2019a).
3 With traditional forecasting methods we mean commonly used forecasting methods in practice by retailers such as exponential smooth-
ing and moving average.
2
The dominance of simple statistical methods in practice should not surprise. Historically, these methods achieved
similar levels of forecast accuracy as more complex methods (Makridakis et al., 2020). This observation was confirmed
by multiple forecasting competitions (Makridakis et al., 1982, 1993; Makridakis and Hibon, 2000). For instance, the
winning method of the M3 forecasting competition combined linear regression and simple exponential smoothing with
some minor tweaks (Assimakopoulos and Nikolopoulos, 2000). Although simple by design, it outperformed the more
sophisticated methods in the M3 as well as every method in the NN3 competition, which was organized nearly a
decade later to promote the use of ML for forecasting (Crone et al., 2011; Hyndman, 2020). In fact, until 2015, most
retail forecasting competitions on Kaggle,4 the world’s largest data science community, were won by relatively simple
statistical methods (Bojer and Meldgaard, 2021). As a result, the superiority of ML in the field of forecasting has
been questioned until very recently (Fildes et al., 2019b; Makridakis et al., 2018, 2020).
More recent studies and forecasting competitions, however, plead in favor of ML methods (Huber and Stucken-
schmidt, 2020; Makridakis et al., 2022; Mukherjee et al., 2018; Salinas et al., 2020; Spiliotis et al., 2020). Since 2015,
all major large-scale retail forecasting competitions on Kaggle have been dominated by ML (Bojer and Meldgaard,
2021). It started with the Rossmann Store Sales competition, which was won by a tree-based ML method, more
specifically, an ensemble of 12 Extreme Gradient Boosting (XGBoosting) methods. Two years later, similar results
were found at the Corporación Favorita Grocery Sales Forecasting competition, where only ML methods were among
the top-performing methods. This time, the winner used an ensemble of NNs and Light Gradient Boosting Machine
(LightGBM) methods, which is an improved implementation of XGBoost (Ke et al., 2017). Finally, in 2021, the M5
Accuracy competition was dominated by tree-based ML methods; all but one of the top 50 performing methods used
LightGBM (the third place was taken by a method solely based on NNs) (Makridakis et al., 2022). The M5 was
set up to estimate the sales forecasts at different hierarchical levels for 3,049 products in ten Walmart stores. The
time series were hierarchically organized, starting at the product-store level and aggregated per department, category,
store, and other combinations. The winning method estimated every sales forecast with an ensemble of recursive and
non-recursive LightGBM methods. (A recursive method uses its prior forecasts as inputs to make predictions for later
timestamps.) Interestingly, while the top-performing methods outperformed the benchmarks by more than 20% on
average, the outperformance deteriorated completely at the product-store level. This raises the question of whether
these ML methods can outperform the benchmarks at the more granular levels.
The recent victories of ML in sales forecasting are largely driven by innovations in ML (Benidis et al., 2020;
Bojer and Meldgaard, 2021). Examples are embedded layers in NNs, which appeared in 2016 (Guo and Berkhahn,
2016), and the wider adoption of the Long-short term memory (LSTM) cell. Decision-tree methods have also become
more advanced since the launch of XGBoost in 20145 and LightGBM in 2016.6 Another major innovation is the
increased use of global forecasting methods. Global methods estimate model parameters by using multiple time series
simultaneously. This is in sharp contrast to most traditional forecasting methods. These methods are local as these
build one method per time series and do not share any parameters (Januschowski et al., 2020). Global methods can
easily exploit cross-series information by learning across different time series. This is especially valuable when time
series are related (Bandara et al., 2020; Makridakis et al., 2022; Smyl, 2020). As such, global methods have access to
more training data and can be more complex (e.g., regarding the total amount of inputs and learning non-linear time
series patterns) than their local counterpart, yet without necessarily overfitting the training data (Montero-Manso and
4 https://www.kaggle.com/
5 https://github.com/dmlc/xgboost/releases
6 https://github.com/microsoft/LightGBM/releases
3
Hyndman, 2021).
Global tree-based methods and NNs have demonstrated state-of-the-art performance in recent retail forecasting
competitions. However, decision trees can achieve similar results as NNs using simpler model architectures (i.e.,
without modifying the core algorithms or software) and without extensive data preprocessing (Januschowski et al.,
2021). This renders them more amenable to practical application. The winner of M5, an undergrad student with
little knowledge and experience in forecasting, demonstrates the ease of use of tree-based methods (Makridakis et al.,
2022). Shwartz-Ziv and Armon (2021) compared tree-based methods with varying deep NN methods on different
tabular datasets. They showed that tree-based methods consistently outperform NNs but require less tuning. We
acknowledge that not all NNs have sophisticated model architectures. NNs with more straightforward architectures
exist but still need extensive pre- and postprocessing of the input data, and they do not necessarily outperform the
simpler statistical forecasting methods (Hewamalage et al., 2021; Spiliotis et al., 2020). The extensive data processing
entails data transformations, such as deseasonalizing the time series, stabilizing the variance, and normalizing the
mean and trend (Hewamalage et al., 2021). Although simple to understand theoretically, it substantially complicates
the programming code (we refer to Benidis et al. (2020) for an extensive review on large-scale forecasting with NNs).
Note that some of these data transformations can also benefit tree-based methods (especially in the case of training
global forecasting methods with heterogeneous time series); however, it is not a strict requirement for trees (Makridakis
et al., 2022; Montero-Manso and Hyndman, 2021).
Despite the incredible performance of these top-performing ML methods, they are rarely used in practice (Fildes
et al., 2019b). Multiple authors argue that large-scale adoption is partly hampered due to its insufficient validation
and difficult implementation. ML forecasting papers and competitions have been criticized for including too few
informative explanatory variables, overfitting the test data, or not using proper benchmarks and accuracy metrics
(Bojer and Meldgaard, 2021; Hyndman, 2020). This lack of proper validation is also seen as a major issue by retailers
(Fildes et al., 2019b). Besides, we believe that the practical applicability of the proposed methods is often neglected
as the goal of forecasting competitions (and many papers) is to achieve the highest possible level of forecast accuracy
regardless of their implementation complexity. For example, the M4 winning method is said to be too complicated to
implement in terms of costs and effort as it combines statistical features with multiple blocks of LSTMs (Gilliland,
2020; Makridakis et al., 2022). The M5 winning method uses an ensemble of pure LightGBM methods. While
the code of each LightGBM method is straightforward to model, the ensemble consists of LightGBM methods with
different pooling strategies, recursive and non-recursive inputs, and advanced feature engineering and feature selection
procedures. This complicates the codebase and the computational requirements to train each global LightGBM method
in practical retail settings (Wellens et al., 2021).
We contribute to the literature by showing that a straightforward tree-based method with explanatory variables
and basic feature engineering suffices to outperform popular benchmarks. We illustrate how the model complexity and
computational requirements of top-performing tree-based methods, such as the M5 winning method, can be reduced
with a minimal negative impact on forecast accuracy. Our approach achieves this by simplifying common complexities
observed in the M5 winning methods. This includes transforming recursive inputs into non-recursive ones; focusing on
a single forecasting method instead of ensembles; training a single forecasting method over all time series (compared to,
first, creating multiple pools of similar time series, and, second, training a forecasting method per pool); and, lastly, by
only focusing on ‘basic’ feature engineering – omitting feature selection procedures and the adding of complex inputs.
In the next section, we describe our decision-tree framework for retail forecasting and validate it in Sections 4 and 5.
4
Figure 1: Summary of our decision-tree framework. Our framework simplifies the experimental, training, and prediction phases of a
tree-based method to make it scalable while still obtaining accurate forecasts.
We describe our decision-tree framework (DTF) for retail sales forecasting at the product-store level. Given the
higher complexity of NNs and our goal to simplify tree-based ML methods for retail applications, we focus on the
‘simpler-to-use’ decision-tree methods. More specifically, we use gradient boosting decision trees (GBDTs) currently
dominating Kaggle. Implementing a tree-based ML method requires three phases: (1) an experimental phase to decide
on the GBDT method’s inputs and hyperparameters, (2) a training phase to train these methods on the most recent
data, and (3) a prediction phase that uses the trained methods to produce sales forecasts (Januschowski et al., 2020).
Most top-performing ML methods require sophisticated optimization steps during these three phases. Examples are
data transformations (e.g., normalizing the average unit sales), feature selection (e.g., recursive feature elimination
(May et al., 2011)), optimizing forecast ensembles with meta-learning (Ma and Fildes, 2021), modifying the core ML
algorithms to improve performance (Januschowski et al., 2021), and more. These steps require an advanced level of
ML knowledge and often extensive computation. The simplification of our framework is motivated by various papers
that have shown that a suboptimal selection of forecasting methods has a negligible effect on the out-of-sample forecast
accuracy (Nikolopoulos and Petropoulos, 2018; Petropoulos et al., 2021). In what follows, we describe each phase of
our ‘simplified’ framework that can still obtain accurate forecasts. Figure 1 visualizes its key elements.
(1) Experimental phase: The experimental phase aims to find the best-performing inputs and hyperparameters
for our tree-based method. (a) First, you must prepare the input data for your methods. (b) Second, you need a
process to select the best-performing method. (c) Third, the hyperparameters of the selected methods need to be
optimized.
(a) Input data: As mentioned before, GBDT methods do not require sophisticated data transformations such
as NNs. GBDT methods can handle time series with different levels of variability or mean without any additional
processing (Makridakis et al., 2022). Given that these data transformation steps often require extensive computation
time and expert knowledge, our simplified framework omits sophisticated data transformations. This also makes
it more scalable. Moverover, most GBDT implementations can handle input data with missing values, categorical
variables, and even variables with text (strings). This reduces the need for even simpler preprocessing operations. We
only require some basic cleaning, such as looking for errors in the data.
We extract all available data relevant to the forecasting task from the data warehouse, such as the historical
sales data, the sales timestamps, and the available explanatory variables such as price and promo data. We call this
available data the ‘raw’ data. Creating additional inputs based on the ‘raw’ data, known as feature engineering,
typically improves the learning of ML methods. Therefore, we manually fit basic operations on the available data to
5
create new inputs. Examples of these ‘basic’ operations are lagged variables (e.g., lags of promotions and holidays)
and basic statistical operations on the time series data and explanatory variables (e.g., the mean of the sales and price,
the maximum price per product, etc.). Once the inputs have been created, using only a subset of these inputs is a best
practice. Known as feature selection, this removes redundant inputs from the method to reduce the dimensionality of
the forecasting problem, which improves learning (May et al., 2011). However, we suggest using all the available ‘raw’
inputs as-is and the inputs with some basic manual engineering. The total number of inputs is limited as we only select
the ‘raw’ inputs and the inputs with some basic manual engineering. Due to this limited amount, we avoid the need for
any sophisticated feature selection procedure because decision trees naturally ignore (a reasonably small amount of)
noisy inputs. Remark that most top-performing ML methods improve the feature engineering step by using automatic
feature engineering packages such as tsfresh,7 featuretools,8 and tsflex,9 or by extensively searching (manually)
for more sophisticated inputs such as computing price elasticities. While these techniques typically improve forecast
accuracy, they also increase the model complexity and computational requirements considerably. In addition, these
techniques may produce hundreds and even thousands of potentially relevant inputs. This large amount of inputs,
in turn, needs to be reduced through sophisticated feature selection procedures that, again, increase complexity and
computational requirements.
(b) Model selection: Our DTF uses an ‘off-the-shelf,’ non-recursive, single, global GBDT method on all available
data per store. By ‘off-the-shelf,’ we mean that no changes to the original software code are made to improve the
forecast accuracy further. Our recommendation is to employ non-recursive forecasting methods. While we acknowledge
that recursive GBDT methods can be more accurate (Makridakis et al., 2022), especially for long-term forecasting,
they come with increased complexity in implementation. These methods are not prebuilt in most forecasting packages,
and their computation time is higher due to the need to recompute recursive inputs after each forecast. Therefore,
we highlight that this study primarily focuses on short-term forecasting. We omit ensembles that include multiple
forecasting methods. Although known for improving forecast accuracy, ensembles increase computation time and
typically require additional strategies to weight the forecasts of the ensemble (Ma and Fildes, 2021). Our non-
recursive GBDT method can be implemented easily using ‘off-the-shelf’ software without a deep understanding of the
algorithm. Lastly, our model architecture consists of a single global method per store. Multiple authors (Bandara
et al., 2020; Montero-Manso and Hyndman, 2021) show that the forecast accuracy of global forecasting methods can
be improved by grouping the set of time series into smaller groups. This results in training multiple global methods
on subsets of the time series. As grouping time series is a non-trivial optimization problem, we suggest to simply pool
all the time series per store.
(c) Hyperparameter tuning: We implement ready-to-use open-source packages that work with most ML methods to
find the best-performing set of hyperparameter values. These packages are typically computationally efficient and fully
automate the hyperparameter tuning. Consequently, these require little expertise on the subject of hyperparameters.
We suggest to only optimize the hyperparameters that largely impact the learning capabilities of GBDT methods
such as the number of trees, the maximum number of leaves, the minimum amount of data per leaf, the learning
rate and L1 and L2 regularization. Note that many of these hyperparameters overlap in their quest to reduce the
tree complexity and the chance of overfitting. For example, L1 and L2 regularization are computed on the leaf scores
to reduce the depth of the tree; this is also controlled by the minimum amount of data per leaf. Therefore, one
7 https://tsfresh.readthedocs.io/en/latest/
8 https://www.featuretools.com/
9 https://github.com/predict-idlab/tsflex
6
can opt to optimize fewer hyperparameters to reduce computational requirements. To further reduce computational
requirements, we suggest to implement packages based on more advanced optimization techniques such as Bayesian
optimization instead of random search or grid search. Finally, we suggest to limit the total amount of iterations of
retraining the methods with different sets of hyperparameters as finding the optimal hyperparameters only slightly
impacts the forecast accuracy (Nikolopoulos and Petropoulos, 2018; Petropoulos et al., 2021).
(2) Training phase: The training phase trains the methods identified during the experimental phase on the most
recent data. The computational requirements for training the determined GBDT methods are low as no ensembles
are used, and only one global method per store must be trained. Note that ML methods must be retrained when new
data or products come in. However, this can happen only sporadically as Huber and Stuckenschmidt (2020) show
during a five-month-test period that retraining ML methods only improves forecast accuracy by 1% to 3%.
(3) Prediction phase: The prediction phase uses the trained methods obtained in the training phase to produce
sales forecasts. Once the non-recursive GBDT methods are trained, they estimate the sales forecasts for the required
forecast horizon. While training ML methods can be time-consuming, the inference is typically fast (Januschowski
et al., 2020). Especially in this case, as we do not use any ensembles, meta-learners, or recursive methods. Note
that we do not include any postprocessing either. Postprocessing is often necessary when the input data has been
normalized or deseasonalized.
4. Experimental setup
In this section, we describe the experimental setup used to validate our DTF. We explore the dataset and describe
how we apply our DTF. Next, we explore various tree-based methods with varying levels of complexity. This allows us
to understand better the impact of simplifying these tree-based methods. We then give an overview of our performance
measures and benchmarks. Section 5 discusses the results.
4.1. Dataset
We validate our DTF on a private dataset of a large Belgian retailer. The dataset includes daily sales of 4,523
unique products from 09/01/2017 to 10/22/2020 (1,148 days) in one large supermarket in Belgium and is enriched
by explanatory variables from various data sources. The retailer sells food products, including perishable products
(e.g., strawberries and tomatoes) and products with longer shelf life (e.g., ice cream and frozen pizzas). Products can
be bought in-store or online. In the latter case, the basket with products is collected by the retailer for pick-up in
the store. The entire dataset includes data from eight different data sources. Table 1 summarizes the data and their
source.
The first two sources contain point-of-sales data (unit sales and timestamp). The first data source includes the daily
sales and the daily sales lags, indicated as sales lag 1, sales lag 2, sales lag 3, and so on. The second data source
includes information about the timestamp of each daily sale. This includes day of the month, week of the year,
month, and year. The point-of-sales data differs in terms of variability and intermittency. Some products are sold
sporadically (e.g., expensive wines), while others experience volatile sales patterns due to promotions or seasonality.
We use the formulation of Syntetos et al. (2005) to categorize the time series of the sales data in four different groups
according to the squared coefficient of variation of the units sold (CV2 ) and their average inter-demand interval (ADI),
which is the average amount of days between registered sales (see Figure 2). Smooth time series have low levels of
variance (CV2 < 0.5) and intermittency (ADI < 4/3 days). These represent the majority of our dataset (46.01%). The
second largest category of our time series are erratic (18.22%). These are characterized by higher levels of variability
7
Figure 2: Classification of the sales data of 4,523 products based on their intermittency (ADI) in days and variability (CV2 ). The dataset
includes 818 intermittent, 800 lumpy, 824 erratic, and 2081 smooth time series.
but with a similar level of intermittency. The third group includes the intermittent time series (18.09%), which have
little variance, but a high level of intermittency. Finally, 17.68% of the products can be categorized as lumpy. These
time series have high levels of variability and intermittency.
The six additional data sources contain data on explanatory variables. The third data source describes the hi-
erarchical structure of each product. Each product is hierarchically structured in five different levels of aggregation
(productgroup l 1, productgroup l 2, productgroup l 3, productgroup l 4, and productgroup l 5) by the re-
tailer using managerial judgment and is based on product characteristics (e.g., is the product drinkable or eatable, is
it a vegetable, etc.). For example, a six-pack of beer bottles can be hierarchically structured as a food/drink product
(level one), a beverage (level two), an alcoholic beverage (level three), beer (level four), and beer sold in glass (level
five). The first level consists of three categories, the second level of eight categories, the third level has 55 different cat-
egories, the fourth level has 246 categories, and the most disaggregated level consists of 1,135 categories. Although this
hierarchical structure was originally set up for various business operations, learning similar time series patterns across
related products proved useful. The fourth data source describes two sets of holiday information. The first set includes
national public holidays like Christmas and Easter. These periods are typically linked to, e.g., a closure of the store or
family reunions and can be useful to identify temporary changes in the time series patterns. The input public holiday
includes this information and the date of each public holiday. The second set includes data about school holidays
that may cause shifts in demand. The input school holiday includes the name and date of each school holiday.10
The dataset also includes two inputs called public holiday dummy and school holiday dummy that indicate whether
it is a holiday (or not) without specifying the name of the holiday. The fifth data source includes weather forecasts
10 Note that the effect of school holidays or public holidays may also be implicitly captured by using seasonality. This is, however, not
the case for all holidays as some may not have a fixed date (e.g., Easter) (Huber and Stuckenschmidt, 2020).
8
Table 1: Overview of the ‘raw’ data in the dataset.
(weather type forecast and temperature forecast). These forecasts include the daily temperature and the type
of weather (such as sunny or cloudy) in Belgium. We distinguish between 14 different types of weather. This data is
based on forecasts with a two-week forecast horizon. As a result, even when forecasting the next day, we use weather
forecasts that are made two weeks ago. The sixth data source describes the price information. We denote daily price
as the daily price of a product if it is bought physically in-store, and daily price pickup when ordered online. For dis-
counted products bought in bulk, we have daily price large quantity and daily price large quantity pickup.
These prices are always lower or equal to daily price( pickup) and only apply if a certain volume is bought. If no
volume discounts apply, daily price( pickup) is set equal to daily price large quantity( pickup). The seventh
data source includes promotional information. It describes whether there is a promotion (promo dummy), the type of a
promotion (18 different types are identified) (promo type), the discount percentage (promo depth), and whether the
price discount is due to a nearby competitor with lower prices (price reaction, price reaction pickup). Finally,
we have access to a dataset of national events. This includes events such as Tournée minérale, a national campaign to
reduce alcohol consumption during February, and other events such as festivals. The national events are split over four
different inputs as sometimes multiple events happen at the same time (events 1, events 2, events 3, events 4).
9
(09/01/2017 to 07/15/2020). Finally, we use the final three months of the data (07/16/2020 to 10/22/2020) to
evaluate our (trained) methods against the benchmarks (this is the test set).11 We review the experimental, training,
and prediction phases in the next subsections.
11 Note that we do not make adjust the dataset to structural breaks such as the COVID-19 pandemic, which could not be foreseen.
12 https://lightgbm.readthedocs.io/en/latest/
13 https://optuna.org/
14 https://scikit-learn.org/stable/modules/generated/sklearn.model selection.TimeSeriesSplit.html
10
4.2.3. Shapley values
To understand how our DTF interacts with its inputs, we make use of Shapley values. ML, in general, is typically
hard (if not impossible) to interpret and is therefore referred to as a black box. A recent popularized method to
explain such a black box is SHAP (SHapley Additive exPlanations), developed by Lundberg and Lee (2017). The
idea of SHAP is based on the work of Shapley (1953), which describes a method used in cooperative game theory
to distribute the total surplus fairly to each player. These contributions are called Shapley values. In the case of
forecasting, the total surplus can be seen as the prediction made by the black box ML algorithm, and the players
are defined as different inputs. Thus, given a black box and a set of data points, SHAP estimates how each input
impacts the prediction of a black box model. The Shapley values are defined as the average marginal contribution of
an input across all possible combinations of the inputs and are estimated for each prediction and each input. As such,
the Shapley values can be used to understand how a certain input influences the final prediction. When aggregating
the average absolute value of each Shapley value over all forecasts per input, each input’s global impact/importance
can be estimated. As such, the Shapley values help us understand what drives the good performance of the DTF.
We apply the TreeExplainer from the package SHAP15 on the DTF method that is trained on the training and the
validation set.
• Recursive DTF (RDTF): We implement a recursive version of the DTF, which we call the RDTF. RDTF is
similar to the DTF but includes four additional recursive inputs. These inputs are rolling sales averages of the
last 7, 14, 30, and 60 days. These are strictly different from the rolling mean inputs of the DTF, as we have
lagged these by the forecast horizon.
• Ensemble: We implement a simple ensemble that takes the average of the DTF and the RDTF forecasts.
• DTF-l-1 and DTF-l-2: We implement our DTF with different pooling strategies. Instead of training
one method per store, we train one per subgroup of products per store. More specifically, we focus on
productgroup l 1 containing three unique categories, and productgroup l 2 with eight different categories.
We call these methods DTF-l-1 and DTF-l-2, respectively. This means that DTF-l-1 consists of three Light-
GBM methods and DTF-l-2 of eight LightGBM methods, each trained on a seperate dataset. Note that we
optimize the hyperparameters only once for DTF-l-1 and once for DTF-l-2. We pick one random group of time
series to optimize the hyperparameters and reuse these hyperparameters for the other groups to reduce the
computational requirements.
• DTF-sl-70: We implement our DTF with additional sales lags, from sales-lag-1 up to sales-lag-70.
• DTF-fs-sl-70: We also implement the DTF-sl-70 using a proper feature selection procedure based on
Rinderknecht and Klopfenstein (2021), which we denote as DTF-fs-sl-70. This approach uses Shapley val-
ues to determine the most important inputs to make the forecasts. To apply this feature selection procedure, we
15 https://shap-lrjball.readthedocs.io/en/docs update/generated/shap.TreeExplainer.html
11
first optimize the hyperparameters of the LightGBM method using all inputs. In the second step, we compute
the average Shapley value of each input on a random subset of the training data with a maximum of 5000 data
samples to restrict the computational requirements. This returns the average marginal contribution of each
input. In line with Rinderknecht and Klopfenstein (2021), we only keep the inputs that explain 95% of the
forecasts and drop the rest. Finally, we retrain the method using this subset of inputs.
• DTF-m5: The most sophisticated tree-based method we test is the M5 winning method. We replicate this
method by creating an ensemble of recursive and non-recursive methods, methods with different pooling strate-
gies, and a more advanced feature engineering and feature selection procedure. More specifically, we train our
DTF, DTF-l-1, and DTF-l-2 with additional inputs and the automatic feature selection procedure based on
Shapley values. We also make a recursive version of each. The additional inputs include 70 sales lags and two
more sophisticated engineered inputs. These two engineered inputs respectively measure for each product the
number of products sold at a given price and the number of days that the product was offered at that price. We
call this ensemble the DTF-m5. It consists of 24 different LightGBM methods.16
Next to the more sophisticated verions of our DTF, we implement five LightGBM methods that are simpler by
using fewer inputs.
• pos-nfe-DTF: We implement the pos-nfe-DTF, which relies only on point-of-sales data (pos) and uses no feature
engineering (nfe). The pos-nfe-DTF only includes sales lags and time-related inputs that are based on the ‘raw’
data related to the date (see first two rows (sales and date) of Table 1).
• pos-fe-DTF: We implement the pos-fe-DTF which relies on point-of-sales data but makes use of feature engi-
neering (fe). As a result, it uses all inputs related to the first two rows of Table 1 and Table A.6.
• nfe-DTF: The nfe-DTF uses all data sources (like our implementation of the DTF). However, it does not use
any feature engineering. Hence, it uses all the inputs of Table 1, but none of Table A.6.
• pp-DTF: The pp-DTF is based on the pos-fe-DTF, but it also includes the inputs related to the price and the
promotions (pp) of the products. Thus, the pp-DTF uses all available data besides the data sources related to
the hierarchy, holidays, weather forecasts, and events. This enables the analysis of the importance of subsets of
the available data.
• npp-DTF: The npp-DTF uses all data sources except the data related to price and promo (npp).
16 The DTF-m5 consists of two LightGBM methods that are trained on the entire dataset, six methods per productgroup l 1, and 16
per productgroup l 2. Half of these are recursive, and the other half non-recursive.
12
Table 2: Overview of tree-based methods with varying levels of complexity.
Method Overview
4.4. Benchmarks
We benchmark the forecast accuracy against nine forecasting methods commonly used in practice. The first
seven methods are based on the benchmarks of the M5 Accuracy competitions (Makridakis et al., 2022). Their code
is publicly available on Github.17 We also benchmark against two Prophet-based benchmarks due to their recent
rise in popularity. In our study, we do not benchmark against NNs. The main reason is that most plain-vanilla
NN methods are not competitive with the other benchmarks. In the M5 Accuracy competition, for example, the
NN benchmarks were one of the worst. More sophisticated NNs architectures with sufficient data transformations
can perform extremely well. However, given our focus on simple-to-use tree-based methods, we believe that adding
plain-vanilla NN methods will not add much value.
1. Naive: The naive method is the simplest of all forecasting techniques. This method sets all forecasts equal
to the value of the last observed period. Although inherently naive, according to a study of Morlidge (2014),
52% of the forecasts produced by eight different consumer and industrial companies do not even outperform this
approach.
2. Seasonal Naive (sNaive): The sNaive method adjusts Naive by taking seasonality into account. It uses the
last observed value of the same period. When estimating daily sales forecasts with a frequency set to seven, the
method uses the last observed value of the same weekday to forecast daily sales.
3. Moving Averages (MA): Moving averages (MA) is another commonly used method in practice (Syntetos and
Boylan, 2005) that computes its forecasts by taking the average of the last k observations. This k is optimized
by using an in-sample MSE.
4. Exponential Smoothing (ES): One of the most popular forecasting methods in practice and the best-
performing benchmark at the M5 is exponential smoothing (ES). As different variations exist, we use the smooth
package in R to select the best-performing ES method per time series automatically.18
5. Croston’s method (CRO): Croston’s method (Croston, 1972) is a popular technique to forecast intermittent
sales patterns. It separates the non-zero demand periods, zt , from those without sales, pt . Using simple
17 https://github.com/Mcompetitions/M5-methods/blob/master/validation/Point%20Forecasts%20-%20Benchmarks.R
18 https://cran.r-project.org/web/packages/smooth/index.html
13
exponential smoothing (this is the simplest version of ES as it does not include any trend or seasonality),
it forecasts ẑt and p̂t separately. Afterwards, it divides ẑt by p̂t to compute the final forecast.
6. Aggregate-Disaggregate Intermittent Demand Approach (ADIDA): Nikolopoulos et al. (2011) propose
another approach to forecasting intermittent time series by using temporal aggregation. Aggregating a high-
frequency time series to a lower frequency (e.g., aggregating daily sales to weekly sales) reduces intermittency and
variance. We then apply simple exponential smoothing on the aggregated time series to produce the forecasts.
7. Exponential Smoothing with explanatory variables (ESX): ESX includes explanatory variables to im-
prove the forecast accuracy of ES further. Note that local statistical methods are typically incapable of dealing
with high-dimensional input space. Therefore, we use a greedy forward feature selection procedure. First, based
on domain knowledge, we manually rank all available inputs from most to least valuable. Second, we add the
most valuable input from the list to each local method. Third, we check for highly correlated inputs before train-
ing the ESX per time series. If the correlation is higher than 0.8, we remove this input for this single time series.
Fourth, if the additional input improves the average forecast accuracy over all time series on the test set, we add
the input to the method and go to the next input in the list. Once the forecast accuracy declines, we remove
the last input and stop training. In our dataset, the final ESX method includes promo dummy, promo depth,
price reaction, and daily price large quantity.
8. Prophet: Some retailers, e.g., Target in the US, have been experimenting with generalized additive models
as these are simple to use, explainable, and achieve good levels of forecast accuracy on a wide variety of time
series (Yelland et al., 2019). Prophet, developed by Facebook, is a popular method to implement generalized
additive models (Taylor and Letham, 2018). It is a decomposable forecasting method that consists of three
major components: trend, seasonality, and explanatory variables. We implement this method using the default
parameters.
9. Prophet with explanatory variables (Proph.X): We further improve the forecast accuracy of
Prophet by including explanatory variables (Proph.X). Therefore, we use a similar feature selection pro-
cedure as in the case of ESX. Given our dataset, the final method includes promo dummy, promo depth,
promo type, price reaction, daily price large quantity, daily price, public holiday dummy,
school holiday dummy, public holiday, school holiday, events 1, events 2, events 3, events 4.
Remark that, in contrast to most tree-based methods, the benchmarks require that the data contains no missing
values. We, therefore, interpolate the sales data on public holidays when the store is closed. Moreover, the benchmarks
that use explanatory variables require additional data preparation as these methods cannot handle categorical variables
or variables with text (strings). These additional data preparation steps are time-consuming and may be imperfect,
as interpolating sales data is just a proxy of the real demand. We find that these data preparation steps only have a
minor negative effect on forecast accuracy.19
19 We checked the impact of imputing missing values and making dummies of categorical variables on the forecast of our DTF. We found
that the forecast accuracy of our proposed DTF decreases by less than 0.5%, on average, when using this additionally processed data
instead of using the data with less data preparation.
14
between time series and can be used on time series with intermittency (Hyndman and Koehler, 2006). For these
reasons, the RMSSE was also the key metric during the M5 Accuracy competition.20 Denote yt the actual sales at
period t, ŷt the forecast at period t, h the forecast horizon, and n the length of the training data, then the RMSSE is
defined by:
v
u 1
Pn+h 2
t=n+1 (yt − ŷt )
u
h . (1)
RMSSE = t 1
Pn 2
n−1 t=2 (yt − yt−1 )
Next to the forecast accuracy, we also keep track of the bias. Bias and inventory performance are shown to be
interconnected (Kourentzes et al., 2021). Similarly to the RMSSE, we scale the bias of each forecasting method by
the absolute forecasting error of the one-step ahead naive forecasting method. We denote this the scaled mean error
(SME):
1
Pn+h
h t=n+1 (yt − ŷt )
SME = 1
P n . (2)
n−1 t=2 |yt − yt−1 |
5. Results
This section describes the validation of our decision-tree framework (DTF) and the analysis of why our framework
outperforms the benchmarks. In Section 5.1 we report the forecast accuracy over time, per time series category, and
per product. Section 5.2 compares the performance to the various tree-based methods. Section 5.3 describes how
Shapley values are used to give insight into how and when the DTF outperforms the benchmarks. In Section 5.4, we
validate our insights on the M5 dataset. Finally, Section 5.5 briefly discusses the computational requirements of the
different methods.
5.1. Results
Figure 3 summarizes the forecast accuracy and the bias of our DTF, described in Section 4.2, and the benchmarks
over time, averaged over all 4,523 products. We report the RMSSE and the SME over the test period of three months
with a forecast horizon of seven days. The RMSSE reveals that the DTF substantially outperforms the best-performing
statistical benchmark, ESX. The DTF outperforms ESX by 11.48%, on average. Moreover, the forecasting methods
that make use of explanatory variables substantially outperform the methods that only use point-of-sales data. This
shows the value of (investing in the data collection of) explanatory variables. It is noteworthy that Proph.X performs
slightly worse than ESX. This is remarkable given that Proph.X can incorporate more explanatory variables and data
sources than ESX. Finally, we find that our DTF has the smallest bias (SME) in absolute terms.
When we compare the average forecast accuracy and bias per time series category (smooth, erratic, intermittent,
and lumpy) on the test set (we refer to Appendix C for the detailed results), we find that the DTF systematically
outperforms the benchmarks for each time series category in terms of RMSSE. This shows the robustness of our
proposed method regarding different time series categories. Interestingly, the forecasting methods designed specifically
to deal with intermittency (i.e., CRO and ADIDA) do not outperform DTF or ESX.
Lastly, we compare the average forecast accuracy per product for the DTF and the ESX. More specifically, for each
product, we look whether the DTF or the ESX is more accurate over the three-month test set. For the RMSSE, the
20 In fact, the M5 Accuracy competition used a weighted version of the RMSSE by using the dollar value of each time series (Makridakis
et al., 2022).
15
Improvement
Method RMSSE SME
over ESX
Figure 3: Forecast accuracy of our proposed decision-tree framework (DTF) and the benchmarks.
DTF outperforms ESX in 91% of the 4,523 products. We conclude that our DTF provides better daily sales forecasts
for the large majority of the products.
16
data. With engineered inputs, we find that the pos-fe-DTF slightly outperforms the ES according to the RMSSE.
This suggests that, without explanatory variables, our global LightGBM method performs similarly to traditional
forecasting methods without explanatory variables. By incorporating all data sources but without feature engineering,
the forecast accuracy improves and becomes competitive compared to the ESX. This is shown by the nfe-DTF. These
results indicate that the DTF outperforms benchmarks only when we can access and invest in explanatory variables
and engineered inputs. Without these, simple statistical methods provide equal performance.
To identify the most valuable explanatory variables and engineered inputs for retailers, we compare subsets of our
inputs. The pp-DTF (our pos-fe-DTF with price and promo data) shows that the inclusion of engineered price and
promo-related variables mainly drives the outperformance of the DTF in our experiment. Despite not utilizing all
data sources, its forecast accuracy remains similar to our DTF. This suggests that the other data sources, namely the
hierarchy, holidays, weather forecasts, and events, have a limited impact on the average forecast accuracy. To check
this hypothesis, we compare the results of the npp-DTF, which excludes the inputs related to price and promo. While
it has access to all other inputs, it performs much worse than the DTF and barely improves the forecast accuracy of
the pos-fe-DTF. These results indicate that retailers should mainly invest in price and promo-related inputs as the
other explanatory variables are not equally important in terms of RMSSE.
Table 3: Average forecast accuracy and bias of our decision-tree framework (DTF) and variations. The RMSSE and the SME are based
on the rolling forecasts of 4,523 products over a three-month test set with a one-week forecast horizon.
Improvement
Method RMSSE SME
over ESX
17
the product hierarchy explains 4%, and weather forecasts, holidays and events account together for only 4%.21 The
right pie chart compares the sum of the Shapley values of the ‘raw’ inputs against the engineered inputs. It shows
that engineered inputs explain 79% of the variance of the forecasts and the ‘raw’ inputs only 21%. This indicates the
importance of the engineered inputs for our DTF. Note that these Shapley values do not indicate causality, nor can
they be used to compute, for example, price elasticities. These values only explain how the black box model results
in a different output when fed by different inputs.
Figure 4: Contribution of the different data sources (left pie chart) and feature engineering (right pie chart) according to the Shapley
values of our decision-tree framework. The left figure shows that explanatory variables explain 38% of the variance of the forecasts and
point-of-sales data 62%. The right figure indicates that 79% of the variance is explained by engineered inputs and only 21% by ‘raw’ inputs.
Our results show that explanatory variables and engineered inputs are the largest contributors to the performance
of our DTF. Our analysis also confirms the value of price and promo-related inputs, in line with the previous section.
Thus, to outperform the benchmarks, the DTF requires both.
Our numerical analysis also reveals that the value of explanatory variables and engineered inputs is more limited
for our benchmarks. While the RMSSE of DTF improves by 20% when explanatory variables are provided, Figure 3
shows that the RMSSE improves by only 5.98% when ES uses explanatory variables (i.e., ESX). Compared to Prophet,
Prophet.X improves by 7.81% when it includes explanatory variables. This indicates that the DTF benefits more from
additional inputs than the benchmarks and can outperform these traditional forecasting methods. We believe that
the reason for this is twofold. First, our DTF can deal with a high-dimensional input space to explain the unit sales.
Second, the DTF can learn non-linear time series patterns.
High-dimensional input space: Our DTF uses 104 different inputs. Appendix D shows that most inputs have
a non-zero Shapley value. This means that the method uses these inputs to make predictions. This is also confirmed by
training our DTF with subsets of the inputs, which negatively affects the forecast accuracy (see Table 3). In the case
of ESX and Prophet.X, respectively, only 4 and 14 different inputs improve the forecast accuracy. Including additional
inputs has a neutral or even a negative impact on their forecast accuracy (see Section 4.4). The main cause of this
large deviation (104 versus 4 or 14 inputs) is that DTF is a global method. Global methods use multiple time series
to train a single method resulting in a much larger training dataset. Therefore, they can handle more inputs than the
21 Note that these percentage values only indicate the importance of the data source on average. For example, if weather data is only
important for 1% of the products for a short period, it will result in low Shapley values on average. This does not mean that the data
sources cannot be important for certain products or periods.
18
commonly used local forecasting methods without overfitting the data (Montero-Manso and Hyndman, 2021).
Non-linear time series patterns: Our DTF can identify non-linear time series patterns. Figure 5
shows the SHAP dependency plots of four arbitrarily chosen inputs, namely sales lag 3, day of the month,
daily price/price last month large quantity, and promo depth. Each dot represents the sales of one product
for a certain day in the test set and explains the impact of a certain input value on the prediction. Combining these
dots explains the learned relationship between each input and the prediction. The horizontal axis shows the input
value. The vertical axis denotes the Shapley value, which visualizes the positive or negative impact on the prediction.
The vertical (colorful) bar on the right-hand side of each dependency plot shows the input with the strongest inter-
action according to Friedman’s H-statistic (Friedman and Popescu, 2008). Figure 5 shows how the DTF can pick up
non-linear time series patterns. Given the outperformance of our DTF and the fact that most common forecasting
methods cannot learn non-linear patterns, we identify this as another important driver of its success.
Figure 5: SHAP dependency plots of four arbitrarily chosen inputs, namely sales lag 3, day of the month,
daily price/price last month large quantity, and promo depth.
The combination of learning non-linear time series patterns and coping with a wide variety of inputs explains how
the DTF can outperform the benchmarks. However, our DTF requires access to explanatory variables and engineered
inputs to exploit these capabilities.
19
Table 4: Average forecast accuracy (WRMSSE) of the M5 winning method, our proposed decision-tree framework (DTF), and M5’s winning
benchmark (ES) on the M5 dataset. The WRMSSE is computed per aggregation level, with level (1) the most aggregated level and level
(12) the most granular product-store level.
Improvement
Aggregation level ES M5 winning method DTF
DTF over ES
easily aggregate these forecasts to higher levels of aggregation. We assess forecast accuracy using the weighted RMSSE
(WRMSSE), with the errors weighted based on the monetary value associated with each product, see Makridakis et al.
(2022). As before, we compute our method three times, each with a different seed value. For the M5 winning method
and ES, in contrast, we obtain these forecasts directly from Makridakis et al. (2022).
Table 4 summarizes the results. On average, our DTF approach demonstrates comparable accuracy to the more
sophisticated (and computationally more expensive) M5 winning method. Recall that our DTF relies on only ten
LightGBM methods, each employing the same hyperparameters and inputs. The M5 winning method, in contrast,
utilizes 220 LightGBM methods with variations in hyperparameters and inputs (Wellens et al., 2021). Once more, we
find that more sophisticated tree-based methods only marginally improve forecast accuracy compared to our DTF.
When comparing the results to ES, we find that the DTF, on average, substantially outperforms ES, especially at
the aggregate levels. The difference between the DTF and ES is minor at the more disaggregated levels. For example,
at the product-store level, i.e., level (12) in the M5 dataset, the DTF only outperforms ES by 2.30%, which is much less
compared to the 11.48% outperformance on our private dataset, or the 54.23% outperformance at the most aggregated
level, level (1). This can be attributed to the lack of price and promo-related inputs at level (12), which we identified
as crucial for DTF’s outperformance. The M5 competition provided explanatory variables, but feature information on
price and promo-related data were limited at this level (12) — the available product prices are weekly averages that
remain almost constant for most of the products throughout the training data. Additionally, the dataset only includes
one type of promotion known as SNAP,22 which is not available to all customers. Although we recognize that certain
retailers may engage in more promotions and price reductions than others, the dataset available appears to be limited
in capturing the full extent of Walmart’s activities. Walmart runs TV commercials, puts products on display, and
even offers temporary discounts, among other strategies that may not be fully represented in the current dataset.23
22 https://www.kaggle.com/competitions/m5-forecasting-accuracy/data
23 https://edition.cnn.com/2022/02/18/business/walmart-rollbacks-promotions-inflation/index.html
20
As a result, the available price and promo-related data have limited predictive power. To conclude, we find that our
DTF performs as well as more sophisticated tree-based methods on the M5 dataset. Traditional forecasting methods
remain competitive at the product-store level when only a limited set of relevant explanatory variables is available.
In this section, we perform extensive numerical experimentation to investigate whether the observed forecast
superiority translates into higher service levels, lower inventory costs, and improvements regarding the variability of
orders and inventory (i.e., the bullwhip effect). To evaluate the impact of the forecasts of our DTF on the inventory
21
control, we consider a discrete-time, periodic-review, single-echelon, general automatic pipeline, variable inventory,
order-based, production control system (APVIOBPCS) with backlogging (Udenio et al., 2017). APVIOBPCS is an
implementation of a generalized order-up-to policy that has been extensively used in the literature to investigate
different aspects of the bullwhip effect. Relevant to our work, Dejonckheere et al. (2003) use such a system to quantify
the influence of several forecasting methods on the order and inventory variability, Li et al. (2014) show that dampened
trend forecasting can help reduce the bullwhip effect in order-up-to systems, and Udenio et al. (2022) compare the
performance of simple and seasonal exponential smoothing forecasts under (seasonal) deterministic, stochastic, and
empirical demands — the latter using data from the M5 forecasting competition. We refer the reader to Wang and
Disney (2016) for, respectively, comprehensive reviews of the bullwhip effect in general and the application of the
APVIOBPCS family of policies in particular.
The orders that will be placed in the following period (ot+1 ) are generated according to an anchor and adjustment-type
procedure,
ot+1 = γI ît − it + γP p̂t − pt + ft,t+L+1 .
22
The balance equations for inventory (i) and pipeline (p) are: it = it−1 + ot−L − dt , and pt = pt−1 + ot − ot−L . Orders
and inventories can be negative (i.e., backlogs and returns are allowed) to maintain the linearity of the model.
is the number of periods in a given time series and n∗ the number of periods without a backlog. We compute the
achieved service level as the fraction of periods with positive inventory for a given time series, i.e., n∗ /n. The BWO
2 2 2 2
is computed as σO /σD with σO the order variance and σD the demand variance. The BWI is similarly computed as
σI2 /σD
2
with σI2 defined as the variance of the inventory time series.
6.4. Results
Table 5 summarizes the inventory results. First, we find that the forecasting methods that are more accurate, in
terms of RMSSE, indeed achieve higher service levels with lower average inventory levels. More specifically, the DTF
achieves a slightly higher service level than ESX, with 12.47% less inventory on average. We also note that, under
these settings, ESX fails to meet the 95% target service level for 684 products, while this is only true for 393 products
in the case of our DTF. This implies that our DTF is capable of either higher product availability at the same coverage
level, or lower inventory requirements at the same service level.
23
Second, we discuss the results of the BWI and the BWO . The best performing forecasting method regarding
the BWI and the BWO are the ADIDA method and the variants of the DTF.24 While adding explanatory variables
improves the forecast accuracy (and lowers the mean inventory and increases the service level), it does worsen the
BWI and the BWO . This holds true for our DTF, ES, and Prophet. One potential explanation for this observation is
that methods with explanatory variables are quicker to adjust forecasts down/up without having to wait for changes
in the actual sales time series (e.g., by anticipating the end of a promotion). See Wellens et al. (2023) for a detailed
discussion.
Table 5: Impact of different forecasting methods on the inventory performance. This table shows the different forecasting methods, the
RMSSE, the SME, the mean inventory, the mean achieved service level, the number of products that have a service level below 95%, and
the mean bullwhip of orders (BWO ) and inventory (BWI ).
The above results indicate that our DTF is competitive across all inventory metrics. Moreover, our results show
that adding explanatory variables tends to improve the performance of the system as measured by the mean inventory
and achieved service level but worsens the performance as measured by the bullwhip metrics; suggesting a trade-off
between optimizing for availability vs. variability. We conclude that retailers need to select the forecasting method
that best fits the objective of their operations.
7. Conclusion
In this paper, we validate various tree-based methods with different levels of complexity to support retailers
on whether and how to invest in tree-based machine learning (ML) forecasting. We show how a straightforward
implementation of a tree-based method outperforms traditional forecasting methods by 11.48% on average, while
being computationally efficient. More sophisticated versions of the tree-based method only marginally improve the
forecast accuracy, with improvements of up to 2.0%. It shows that the ‘raw’ performance of tree-based methods does
not come from minor tweaks such as smart feature engineering or smart ways of pooling data. By analyzing various
implementations of our decision-tree framework (DTF) and validating our results on the M5 dataset, we find that the
24 Note that MA and CRO slightly outperform our DTF regarding BWO , but have a much higher BWI .
24
superior performance of DTF depends on the availability of explanatory variables and feature engineering. We believe
this is an important result because it allows managers to quickly assess whether the implementation of a tree-based
method has immediate potential and gives guidance regarding the allocation of resources for (better) data gathering.
Extensive numerical experimentation finally shows how the forecast superiority of our proposed framework translates to
higher service levels (+0.73%), lower inventory costs (-12.47%), and improvements in the bullwhip of orders (-30.32%)
and inventory (-36.64%) compared to the most accurate benchmark. With our framework, its excellent performance,
and its scalability to practical forecasting settings, we hope to increase the adoption rate of machine learning for
‘traditional’ retailers further.
We believe there is value in extending the DTF to, e.g., probability forecasting, long-term forecasting, hierarchical
forecasting, or doing a similar analysis for ML methods based on neural networks. For instance, an interesting way to
make probability forecasts with our DTF is by exploring a pinball loss function, enabling the prediction of different
quantiles by only changing a few lines of code. Regarding the forecast horizon, it is important to note that our study
primarily addresses short-term forecasting. It would be worthwhile to investigate the applicability and performance
of the DTF in long-term forecasting scenarios such as the yearly planning of weekly truck capacity. In this case, more
attention should be given to the recursive version of the DTF and its lagged values. Another potential extension of
the DTF relates to temporal and hierarchical aggregation, which has proven to be effective in various other forecasting
methods (Nikolopoulos et al., 2011; Athanasopoulos et al., 2017). Finally, this paper focused on tree-based methods.
It may be interesting to explore a similar analysis for neural networks to identify their key ingredients to outperform
traditional forecasting methods. However, each of these would be full research projects on its own.
We are aware that our study also comes with some limitations. The ESX and Proph.X benchmarks are implemented
using a feature selection procedure, while the DTF does not. Since global ML methods differ significantly from
local statistical methods, adopting a similar feature selection procedure for both the tree-based methods and the
benchmarks was not feasible. Another remark is that the feature selection procedure of ESX and Proph.X could
be further improved by selecting different inputs for each time series individually. This would increase the model
complexity of the benchmarks. Another limitation comes from the size of our dataset, as it is limited to products
of only one store. Our analysis demonstrates that DTF-l-1 and DTF-l-2, which are trained on subsets of the data,
perform worse than our DTF. Therefore, it would be interesting to investigate how incorporating a larger pool of time
series from different stores would impact the forecast accuracy and computational cost of the DTF. Prior research
using M5 data showed that pooling time series of multiple stores, instead of pooling per store, reduces computational
costs while attaining similar levels of forecast accuracy (Wellens et al., 2021). This raises the question whether pooling
at the highest hierarchical level is a good rule of thumb. Lastly, leveraging weather forecasts spanning a one-week
forecast horizon, rather than the current two-week span, could amplify their effect on our sales forecasts. However,
such data was unavailable to the retailer at that time.
Declaration of interest
Arnoud Wellens is funded by Flanders Innovation Entrepreneurship (VLAIO), grant number HBC.2020.2215.
Maximiliano Udenio and Robert N. Boute declare that they have no known competing financial interests or personal
relationships that could have appeared to influence the work reported in this paper.
25
References
Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization
framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages
2623–2631.
Assimakopoulos, V. and Nikolopoulos, K. (2000). The theta model: A decomposition approach to forecasting. International
Journal of Forecasting, 16(4):521–530.
Athanasopoulos, G., Hyndman, R. J., Kourentzes, N., and Petropoulos, F. (2017). Forecasting with temporal hierarchies.
European Journal of Operational Research, 262(1):60–74.
Bandara, K., Bergmeir, C., and Smyl, S. (2020). Forecasting across time series databases using recurrent neural networks on
groups of similar series: A clustering approach. Expert systems with applications, 140:112896.
Benidis, K., Rangapuram, S. S., Flunkert, V., Wang, B., Maddix, D., Turkmen, C., Gasthaus, J., Bohlke-Schneider, M., Salinas,
D., Stella, L., et al. (2020). Neural forecasting: Introduction and literature overview. arXiv preprint arXiv:2004.10240.
Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algorithms for hyper-parameter optimization. Advances in neural
information processing systems, 24.
Bojer, C. S. and Meldgaard, J. P. (2021). Kaggle forecasting competitions: An overlooked learning opportunity. International
Journal of Forecasting, 37(2):587–603.
Crone, S. F., Hibon, M., and Nikolopoulos, K. (2011). Advances in forecasting with neural networks? Empirical evidence from
the NN3 competition on time series prediction. International Journal of Forecasting, 27(3):635–660.
Croston, J. D. (1972). Forecasting and stock control for intermittent demands. Journal of the Operational Research Society,
23(3):289–303.
Dejonckheere, J., Disney, S. M., Lambrecht, M. R., and Towill, D. R. (2003). Measuring and avoiding the bullwhip effect: A
control theoretic approach. European journal of operational research, 147(3):567–590.
Dietvorst, B. J., Simmons, J. P., and Massey, C. (2015). Algorithm aversion: people erroneously avoid algorithms after seeing
them err. Journal of Experimental Psychology: General, 144(1):114.
Fildes, R., Goodwin, P., Lawrence, M., and Nikolopoulos, K. (2009). Effective forecasting and judgmental adjustments: An
empirical evaluation and strategies for improvement in supply-chain planning. International Journal of Forecasting, 25(1):3–
23.
Fildes, R., Goodwin, P., and Önkal, D. (2019a). Use and misuse of information in supply chain forecasting of promotion effects.
International Journal of Forecasting, 35(1):144–156.
Fildes, R., Ma, S., and Kolassa, S. (2019b). Retail forecasting: Research and practice. International Journal of Forecasting.
Fildes, R. and Petropoulos, F. (2015). Improving forecast quality in practice. Foresight: The International Journal of Applied
Forecasting, 36:5–12.
Fisher, M. and Raman, A. (2018). Using data and big data in retailing. Production and Operations Management, 27(9):1665–
1669.
Friedman, J. H. and Popescu, B. E. (2008). Predictive learning via rule ensembles. The annals of applied statistics, pages
916–954.
Gilliland, M. (2020). The value added by machine learning approaches in forecasting. International Journal of Forecasting,
36(1):161–166.
Guo, C. and Berkhahn, F. (2016). Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737.
Hewamalage, H., Bergmeir, C., and Bandara, K. (2021). Recurrent neural networks for time series forecasting: Current status
and future directions. International Journal of Forecasting, 37(1):388–427.
Hoberg, K. and Thonemann, U. W. (2015). Analyzing variability, cost, and responsiveness of base-stock inventory policies with
linear control theory. IIE Transactions, 47(8):865–879.
26
Huber, J. and Stuckenschmidt, H. (2020). Daily retail demand forecasting using machine learning with emphasis on calendric
special days. International Journal of Forecasting, 36(4):1420–1438.
Hyndman, R. J. (2020). A brief history of forecasting competitions. International Journal of Forecasting, 36(1):7–14.
Hyndman, R. J. and Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting,
22(4):679–688.
Januschowski, T., Gasthaus, J., Wang, Y., Salinas, D., Flunkert, V., Bohlke-Schneider, M., and Callot, L. (2020). Criteria for
classifying forecasting methods. International Journal of Forecasting, 36(1):167–177.
Januschowski, T., Wang, Y., Torkkola, K., Erkkilä, T., Hasson, H., and Gasthaus, J. (2021). Forecasting with trees. International
Journal of Forecasting.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). Lightgbm: A highly efficient gradient
boosting decision tree. In Advances in Neural Information Processing Systems, pages 3146–3154.
Kourentzes, N., Svetunkov, I., and Trapero, J. R. (2021). Connecting forecasting and inventory performance: a complex task.
Available at SSRN 3878176.
Li, Q., Disney, S. M., and Gaalman, G. (2014). Avoiding the bullwhip effect using damped trend forecasting and the order-up-to
replenishment policy. International Journal of Production Economics, 149:3–16.
Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Proceedings of the 31st
international conference on neural information processing systems, pages 4768–4777.
Ma, S. and Fildes, R. (2021). Retail sales forecasting with meta-learning. European Journal of Operational Research, 288(1):111–
128.
Makridakis, S., Andersen, A., Carbone, R., Fildes, R., Hibon, M., Lewandowski, R., Newton, J., Parzen, E., and Winkler, R.
(1982). The accuracy of extrapolation (time series) methods: Results of a forecasting competition. Journal of Forecasting,
1(2):111–153.
Makridakis, S., Chatfield, C., Hibon, M., Lawrence, M., Mills, T., Ord, K., and Simmons, L. F. (1993). The M2-competition:
A real-time judgmentally based forecasting study. International Journal of Forecasting, 9(1):5–22.
Makridakis, S. and Hibon, M. (2000). The M3-competition: Results, conclusions and implications. International Journal of
Forecasting, 16(4):451–476.
Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2018). Statistical and machine learning forecasting methods: Concerns
and ways forward. PloS one, 13(3):e0194889.
Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2020). The M4 competition: 100,000 time series and 61 forecasting
methods. International Journal of Forecasting, 36(1):54–74.
Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2022). M5 accuracy competition: Results, findings, and conclusions.
International Journal of Forecasting.
May, R., Dandy, G., and Maier, H. (2011). Review of input variable selection methods for artificial neural networks. Artificial
neural networks-methodological advances and biomedical applications, 10:16004.
McCarthy, T. M., Davis, D. F., Golicic, S. L., and Mentzer, J. T. (2006). The evolution of sales forecasting management: A
20-year longitudinal study of forecasting practices. Journal of Forecasting, 25(5):303–324.
Montero-Manso, P. and Hyndman, R. J. (2021). Principles and algorithms for forecasting groups of time series: Locality and
globality. International Journal of Forecasting, 37(4):1632–1653.
Morlidge, S. (2014). Forecast quality in the supply chain. Foresight: The International Journal of Applied Forecasting, (33).
Mukherjee, S., Shankar, D., Ghosh, A., Tathawadekar, N., Kompalli, P., Sarawagi, S., and Chaudhury, K. (2018). ARMDN:
Associative and recurrent mixture density networks for eRetail demand forecasting. arXiv preprint arXiv:1803.03800.
Nikolopoulos, K. and Petropoulos, F. (2018). Forecasting for big data: Does suboptimality matter? Computers & Operations
Research, 98:322–329.
27
Nikolopoulos, K., Syntetos, A. A., Boylan, J. E., Petropoulos, F., and Assimakopoulos, V. (2011). An aggregate–disaggregate
intermittent demand approach (ADIDA) to forecasting: An empirical proposition and analysis. Journal of the Operational
Research Society, 62(3):544–554.
Petropoulos, F., Grushka-Cockayne, Y., Siemsen, E., and Spiliotis, E. (2021). Wielding occam’s razor: Fast and frugal retail
forecasting. arXiv preprint arXiv:2102.13209.
Petropoulos, F., Nikolopoulos, K., Spithourakis, G. P., and Assimakopoulos, V. (2013). Empirical heuristics for improving
intermittent demand forecasting. Industrial Management & Data Systems.
Rinderknecht, M. D. and Klopfenstein, Y. (2021). Predicting critical state after covid-19 diagnosis: Model development using
a large us electronic health record dataset. NPJ digital medicine, 4(1):113.
Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski, T. (2020). DeepAR: Probabilistic forecasting with autoregressive
recurrent networks. International Journal of Forecasting, 36(3):1181–1191.
Seaman, B. (2018). Considerations of a retail forecasting practitioner. International Journal of Forecasting, 34(4):822–829.
Shapley, L. S. (1953). A value for n-person games, contributions to the theory of games, 2, 307–317.
Shwartz-Ziv, R. and Armon, A. (2021). Tabular data: Deep learning is not all you need. arXiv preprint arXiv:2106.03253.
Smyl, S. (2020). A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. Interna-
tional Journal of Forecasting, 36(1):75–85.
Spiliotis, E., Makridakis, S., Semenoglou, A.-A., and Assimakopoulos, V. (2020). Comparison of statistical and machine learning
methods for daily SKU demand forecasting. Operational Research - An International Journal, pages 1–25.
Syntetos, A. A. and Boylan, J. E. (2005). The accuracy of intermittent demand estimates. International Journal of Forecasting,
21(2):303–314.
Syntetos, A. A., Boylan, J. E., and Croston, J. (2005). On the categorization of demand patterns. Journal of the operational
research society, 56(5):495–503.
Taylor, S. J. and Letham, B. (2018). Forecasting at scale. The American Statistician, 72(1):37–45.
Udenio, M., Vatamidou, E., and Fransoo, J. C. (2022). Exponential smoothing forecasts: Taming the bullwhip effect when
demand is seasonal. International Journal of Production Research, pages 1–18.
Udenio, M., Vatamidou, E., Fransoo, J. C., and Dellaert, N. (2017). Behavioral causes of the bullwhip effect: An analysis using
linear control theory. Iise Transactions, 49(10):980–1000.
Wang, X. and Disney, S. M. (2016). The bullwhip effect: Progress, trends and directions. European Journal of Operational
Research, 250(3):691–701.
Wellens, A. P., Boute, R. N., and Udenio, M. (2023). Increased bullwhip in retail: A side effect of improving forecast accuracy
with more data? Available at SSRN 4320911.
Wellens, A. P., Udenio, M., and Boute, R. N. (2021). Transfer learning for hierarchical forecasting: Reducing computational
efforts of M5 winning methods. International Journal of Forecasting.
Weller, M. and Crone, S. F. (2012). Supply chain forecasting: Best practices & benchmarking study.
Yelland, P., Baz, Z. E., and Serafini, D. (2019). Forecasting at scale: The architecture of a modern retail forecasting system.
Foresight: The International Journal of Applied Forecasting, 4(55).
28
Appendix A. Engineered inputs
Table A.6 gives an overview of the engineered inputs. Below we briefly explain these.
Sales: Regarding the first data source, namely sales, we compute the mean and the standard deviation (stdv)
of the unit sales of the last 7, 14, 30, 60, and 180 days. The determination of these windows is based on
the intuition that the daily sales of retailers typically exhibit weekly and monthly seasonality. These are de-
noted as rolling mean 7, rolling stdv 7, rolling mean 14, rolling stdv 14, rolling mean 30, rolling stdv 30,
rolling mean 60, rolling stdv 60, rolling mean 180, and rolling stdv 180. Before computing these inputs, we
lag the unit sales according to our forecast horizon. This is necessary when forecasting non-recursively for longer
forecast horizons than the step-one forecast. Otherwise, these inputs would use information which is not yet available,
which is called leakage. For example, the rolling mean 7 uses the sales data of the last seven days. If we compute
this for the step 2 forecast (forecasting the sales in two days), we require the unit sales of the next day (step 1),
which is not known when making the forecast. As we do not forecast recursively, we therefore need to lag the unit
sales by at least one period. When forecasting for longer horizons (as in our case), we need to lag the unit sales
even more. We also compute the average daily sales per product (enc product ID mean) and the stdv per product
(enc product ID stdv). These inputs are recomputed per period, using all the available unit sales data up to that
period in time.
Date: For the second data source, date, we create inputs that indicate the day of the week (day of the week),
whether we are the end of the week (end of the week), the week of the month (week of the month), and the quarter
(quarter).
Hierarchy: In terms of the product hierarchy, we compute the average unit sales and stdv per hierar-
chical group. This gives the following inputs: enc productgroup l 1 mean, enc productgroup l 1 stdv,
enc productgroup l 2 mean, enc productgroup l 2 stdv, enc productgroup l 3 mean, enc productgroup
l 3 stdv, enc productgroup l 4 mean, enc productgroup l 4 stdv, enc productgroup l 5 mean, and
enc productgroup l 5 stdv. These inputs are recomputed per period, using all the available unit sales data
up to that period in time.
Holidays: Regarding the holidays, we create lags and future values of the national public holidays for up to four
days in advance and after the holiday. We denote this inputs as lags future indication public holiday
Weather forecasts: In terms of weather forecasts, we compute the average temperature of last month
(average temperature last month).
Price: For the different daily prices, we compute the maximum value (max price, max price
large quantity, and max price large quantity pickup), the minimum value (min price, min price
large quantity, and min price large quantity pickup), the mean (price mean, price mean large quantity,
and price mean large quantity pickup) and the stdv (price stdv, price stdv large quantity, and
price stdv large quantity pickup) of the daily price. These inputs are recomputed per period, using
all the available unit sales data up to that period in time. In addition, we compute the national price
of a product (by taking the average price over multiple stores), and the average national price of last
month. These are denoted as national price, average national price last month, and average national
price last month pickup. We compute the price differences between the daily price with and with-
out a volume discount (price large quantity abs diff and price large quantity abs diff pickup).
These are denoted in absolute value. By taking a simple division, we compare the daily price with vol-
ume discount to the maximum price (daily price/max price, daily price/max price large quantity,
29
and daily price/max price large quantity pickup), the average price (daily price/price mean,
daily price/price mean large quantity, daily price/price mean large quantity pickup), and to the average
price of the previous month (daily price/price last month, daily price/price last month large quantity,
and daily price/price last month large quantity pickup) and week (daily price/price last week,
daily price/price last week large quantity, and daily price/price last week large quantity pickup).
Note that we use the different prices, namely daily price, daily price large quantity, daily price pickup,
and daily price large quantity pickup, purely as an indication for the DTF. For instance, we do not disclose the
quantity of products sold (either in absolute or relative terms) with their corresponding prices.
Promo: Regarding our promotional data, we create new inputs indicating the starting day of a promo
(first day promo dummy), the first week of a promo (first week promo dummy), the second week of a promo
(second week promo dummy) or none of the above (rest promo dummy). Finally, we create lags and future values
of the promos for up to seven days in advance and after the promo (lags future indication promo dummy).
Note that the feature engineering does not require advanced Python or sophisticated domain knowledge. The ‘raw’
inputs combined with the engineered inputs, gives us in total a set of 104 inputs in per time series.
30
Table A.6: Overview of engineered inputs that are used by our proposed method.
31
Appendix B. Hyperparameters
Table B.7 gives an overview of the hyperparameters of our decision-tree framework. The table shows the value of
each hyperparameter and the bounds of the search space. The hyperparameters are optimized using Optuna.
Table B.7: Overview of the hyperparameters of our decision-tree framework. The hyperparameters are optimized using Optuna.
32
Appendix C. Forecast metrics per time series category
Table C.8 compares the average forecast accuracy and bias per time series category (smooth, erratic, intermittent,
and lumpy) on the three-month test set in greater detail. Regarding the RMSSE, the DTF outperforms ESX for
smooth, lumpy, intermittent and erratic time series by 9.57%, 15.64%, 12.64%, and 11.52% respectively. Interestingly
enough, the forecasting methods that are designed to deal with intermittency (i.e., CRO and ADIDA) do not outper-
form DTF or ESX. In addition, according to the RMSSE, the most difficult time series categories to predict are the
time series with little intermittency. This means that for all methods, the RMSSE is the highest for the smooth and
the erratic time series.
The results of the SME indicate that sNaive and our DTF have the lowest bias in absolute terms.
Table C.8: Forecast accuracy per time series category of our proposed decision-tree framework (DTF) and the benchmarks, averaged over
all 4,523 products. The RMSSE and the SME are based on the daily rolling forecasts of 4,523 products over a three-month test set with a
one-week forecast horizon.
33
Appendix D. Shapley values
Table D.9 gives an overview of the average absolute Shapley value per input.
Table D.9: Overview of the average absolute Shapley value per input.
34
daily price/max price 0.005298
end of the week 0.004714
rolling mean 180 0.004605
lags future indication promo dummy 0.004583
sales lag 10 0.004556
daily price/price mean 0.003994
rolling stdv 180 0.003986
enc productgroup l 5 mean 0.003953
sales lag 11 0.003902
price stdv 0.003702
productgroup l 3 0.003612
enc productgroup l 5 stdv 0.003095
daily price/max price large quantity 0.002864
rolling stdv 60 0.002659
sales lag 6 0.002575
sales lag 13 0.002423
month 0.002413
week of the month 0.002333
quarter 0.002278
rolling stdv 14 0.002146
sales lag 9 0.001912
sales lag 14 0.001548
rolling stdv 7 0.001483
second week promo dummy 0.001445
national price 0.001443
sales lag 4 0.001412
sales lag 5 0.001374
daily price/price mean large quantity pickup 0.001317
sales lag 12 0.001314
school holiday dummy 0.00126
first day promo dummy 0.00118
price stdv large quantity pickup 0.001098
enc productgroup l 1 mean 0.001025
average national price last month pickup 0.000975
enc productgroup l 4 stdv 0.000951
events 2 0.000833
enc productgroup l 3 mean 0.00073
enc productgroup l 4 mean 0.000567
school holiday 0.000535
enc productgroup l 2 stdv 0.000512
enc productgroup l 3 stdv 0.000452
price mean 0.000424
min price 0.000402
enc productgroup l 1 stdv 0.000351
enc productgroup l 2 mean 0.000341
min price large quantity pickup 0.000325
daily price pickup 0.000319
price mean large quantity pickup 0.000278
min price large quantity 0.000236
35
max price large quantity pickup 0.000198
max price 0.000181
daily price large quantity pickup 0.000171
price mean large quantity 0.000158
productgroup l 5 0.000141
max price large quantity 0.0001
public holiday 0.000077
productgroup l 4 0.000073
rest promo dummy 0.000039
price large quantity abs diff pickup 0
events 4 0
public holiday dummy 0
events 3 0
36