Data Science For Supply Chain Forecasting 2nd Edition-Extract2
Data Science For Supply Chain Forecasting 2nd Edition-Extract2
www.degruyter.com
Thinking must never submit itself, neither to a dogma, nor to a party, nor to a passion,
nor to an interest, nor to a preconceived idea, nor to whatever it may be, if not to facts
themselves, because, for it, to submit would be to cease to be.
Henri Poincare
Acknowledgments
Second Edition
We can see in Raphael’s fresco The School of Athens, philosophers—practitioners and
theorists alike—debating and discussing science. I like to use the same approach when
working on projects: discussing ideas, insights, and models with other supply chain
data scientists worldwide. It is always a pleasure for me.
For the second edition of Data Science for Supply Chain Forecasting, I surrounded
myself with a varied team of supply chain practitioners who helped me to review chap-
ter after chapter, model after model. I would like to thank each of them for their time,
dedication, and support.
I would like to thank my dear friend Gwendoline Dandoy for her work on this book.
She helped me to make every single chapter as clear and as simple as possible. She
worked tirelessly on this book, as she did on my previous book Inventory Optimization:
Models and Simulations. Along with her help in reviewing this book, I could always
count on her support, iced tea, and kindness. Thank you, Gwendoline.
Thanks to Mike Arbuzov for his help with the advanced machine learning models.
It is always a pleasure to discuss with such a data science expert.
I had the chance this year to be surrounded by two inspiring, brilliant en-
trepreneurs: João Paulo Oliveira, co-founder of BiLD analytic (a data science consul-
tancy company) and Edouard Thieuleux, founder of AbcSupplyChain (do not hesitate
to visit his website abcsupplychain.com for supply chain training material and coach-
ing). I would like to thank them both for their help reviewing the book. It is always a
pleasure to discuss models and entrepreneurship with them.
I would like to thank Michael Gilliland (author of The Business Forecasting Deal—
the book that popularized the forecast value added framework) for his help and nu-
merous points of advice on Part III of the book.
I was also helped by a group of talented supply chain practitioners, both con-
sultants and in-house experts. I would like to thank Léo Ducrot for his help on
the third part of the book. Léo is an expert in supply chain planning and inven-
tory optimization—discussing supply chain best practices with him is a rewarding
pleasure. I would also like to thank Karl-Eric Devaux, who helped me to review the
first chapters of the book; thanks to his amazing consultancy experience as demand
planner. I am blessed by the help and advice that Karl-Eric shared with me over the
last years. Thanks, also, to Steven Pauly (who earlier helped me for my inventory
book) for his advice and ideas for the third part of the book.
I could also count on the incredible online community of supply chain and data
science practitioners. Many thanks to Evangelos Spiliotis (PhD and research fellow
at the National Technical University of Athens), who reviewed the statistical models
in detail. His strong academic background and wide experience with statistical mod-
els was of great help. Thank you to Suraj Vissa (who also kindly helped me with my
https://doi.org/10.1515/9783110671124-201
VIII | Acknowledgments
previous book) for his review of multiple chapters. And Vincent Isoz, for his careful
review of the neural network chapter. You can read his book about applied mathemat-
ics, Opera Magistris, here: www.sciences.ch/Opera_Magistris_v4_en.pdf. Thank you
to Eric Wilson for his review of Part III (do not hesitate to check his demand plan-
ning YouTube channel: www.youtube.com/user/DemandPlanning). Thanks, also, to
Mohammed Hichame Benbitour for his review of Part I.
As for the first edition of this book—and my second book!—I could count on the
help of an all-star team of friends: François Grisay and his experience with supply
chains, as well as Romain Faurès, Nathan Furnal, and Charles Hoffreumon with their
experience with data science.
I would also like to mention a few friends for their support and help. Thank you
to my friend Emmeline Everaert for the various illustrations she drew for this second
edition. Thanks, also, to my sister Caroline Vandeput, who drew the timelines you
can see through the book. Thank you to My-Xuan Huynh, Laura Garcia Marian, and
Sébastien Van Campenhoudt for their reviews of the first chapters.
Finally, thank you to my (previous) master’s students Alan Hermans, Lynda
Dhaeyer, and Vincent Van Loo for their reviews.
Nicolas Vandeput
September 2020
[email protected]
First Edition
Discussing problems, models, and potential solutions has always been one of my fa-
vorite ways to find new ideas—and test them. As with any other big project, when I
started to write Data Science for Supply Chain Forecasting, I knew discussions with var-
ious people would be needed to receive feedback. Thankfully, I have always been able
to count on many friends, mentors, and experts to share and exchange these thoughts.
First and foremost, I want to express my thanks to Professor Alassane Ndiaye,
who has been a true source of inspiration for me ever since we met in 2011. Not only
does Alassane have the ability to maintain the big picture and stay on course in any
situation—especially when it comes to supply chain—but he also has a sense of lead-
ership that encourages each and everyone to shine and come face to face with their
true potential. Thank you for your trust, your advice, and for inspiring me, Alassane.
Furthermore, I would like to thank Henri-Xavier Benoist and Jon San Andres from
Bridgestone for their support, confidence, and the many opportunities they have given
me. Together, we have achieved many fruitful endeavors, knowing that many more are
to come in the future.
Of course, I also need to mention Lokad’s team for their support, vision, and their
incredible ability to create edge models. Special thanks to Johannes Vermorel (CEO
First Edition | IX
and founder) for his support and inspiration—he is a real visionary for quantitative
supply chain models. I would also like to thank the all-star team, Simon Schalit,
Alexandre Magny, and Rafael de Rezende for the incredible inventory model we have
created for Bridgestone.
There are few passionate professionals in the field of supply chains who can deal
with the business reality and the advanced quantitative models. Professor Bram De
Smet is one of those. He has inspired me, as well as many other supply chain profes-
sionals around the globe. In February 2018, when we finally got the chance to meet in
person, I shared my idea of writing a book about supply chain and data science. He
simply said, “Just go for it and enjoy it to the fullest.” Thank you, Bram, for believing
in me and pushing me to take that first step.
Just like forests are stronger than a single tree by itself, I like to surround myself
with supportive and bright friends. I especially would like to thank each and every
one of the following amazing people for their feedback and support: Gil Vander Mar-
cken, Charles Hoffremont, Bruno Deremince, and Emmeline Everaert, Romain Faurès,
Alexis Nsamzinshuti, François Grisay, Fabio Periera, Nicolas Pary, Flore Dargent, and
Gilles Belleflamme. And of course, a special thanks goes to Camille Pichot. They have
all helped me to make this book more comprehensive and more complete. I have al-
ways appreciated feedback from others to improve my work, and I would never have
been able to write this book alone without the help of this fine team of supportive
friends.
On another note, I would also like to mention Daniel Stanton for the time he took
to share his experience with business book publishing with me.
Last but not least, I would like to thank Jonathan Vardakis truly. Without his ded-
icated reviews and corrections, this book would simply not have come to its full com-
pletion. Throughout this collaboration, I have realized that we are a perfect fit together
to write a book. Many thanks to you, Jon.
Nicolas Vandeput
November 2017
[email protected]
About the Author
https://doi.org/10.1515/9783110671124-202
Foreword – First Edition
Tomorrow’s supply chain is expected to provide many improved benefits for all stake-
holders, and across much more complex and interconnected networks than the cur-
rent supply chain.
Today, the practice of supply chain science is striving for excellence: innovative
and integrated solutions are based on new ideas, new perspectives and new collabo-
rations, thus enhancing the power offered by data science.
This opens up tremendous opportunities to design new strategies, tactics and op-
erations to achieve greater anticipation, a better final customer experience and an
overall enhanced supply chain.
As supply chains generally account for between 60% and 90% of all company
costs (excluding financial services), any drive toward excellence will undoubtedly be
equally impactful on a company’s performance as well as on its final consumer satis-
faction.
This book, written by Nicolas Vandeput, is a carefully developed work emphasiz-
ing how and where data science can effectively lift the supply chain process higher up
the excellence ladder.
This is a gap-bridging book from both the research and the practitioner’s perspec-
tive, it is a great source of information and value.
Firmly grounded in scientific research principles, this book deploys a compre-
hensive set of approaches particularly useful in tackling the critical challenges that
practitioners and researchers face in today and tomorrow’s (supply chain) business
environment.
https://doi.org/10.1515/9783110671124-203
Contents
Acknowledgments | VII
Introduction | XXI
1 Moving Average | 3
1.1 Moving Average Model | 3
1.2 Insights | 4
1.3 Do It Yourself | 6
2 Forecast KPI | 10
2.1 Forecast Error | 10
2.2 Bias | 11
2.3 MAPE | 14
2.4 MAE | 16
2.5 RMSE | 17
2.6 Which Forecast KPI to Choose? | 20
3 Exponential Smoothing | 27
3.1 The Idea Behind Exponential Smoothing | 27
3.2 Model | 28
3.3 Insights | 31
3.4 Do It Yourself | 33
4 Underfitting | 37
4.1 Causes of Underfitting | 38
4.2 Solutions | 40
6 Model Optimization | 52
6.1 Excel | 52
6.2 Python | 56
8 Overfitting | 66
8.1 Examples | 66
8.2 Causes and Solutions | 68
10 Outliers | 86
10.1 Idea #1 – Winsorization | 86
10.2 Idea #2 – Standard Deviation | 89
10.3 Idea #3 – Error Standard Deviation | 93
10.4 Go the Extra Mile! | 95
13 Tree | 122
13.1 How Does It Work? | 123
13.2 Do It Yourself | 126
15 Forest | 138
15.1 The Wisdom of the Crowd and Ensemble Models | 138
15.2 Bagging Trees in a Forest | 139
15.3 Do It Yourself | 141
15.4 Insights | 144
23 Clustering | 209
23.1 K-means Clustering | 209
23.2 Looking for Meaningful Centers | 211
23.3 Do It Yourself | 214
A Python | 265
Bibliography | 273
Glossary | 277
Index | 281
Introduction
Artificial intelligence is the new electricity.
Andrew Ng1
In the same way electricity revolutionized the second half of the 19th century, allowing
industries to produce more with less, artificial intelligence (AI) will drastically impact
the decades to come. While some companies already use this new electricity to cast
new light upon their business, others are still using old oil lamps or even candles,
using manpower to manually change these candles every hour of the day to keep the
business running.
As you will discover in this book, AI and machine learning (ML) are not just a
question of coding skills. Using data science to solve a problem will require more a
scientific mindset than coding skills. We will discuss many different models and algo-
rithms in the later chapters. But as you will see, you do not need to be an IT wizard
to apply these models. There is another more important story behind these: a story of
experimentation, observation, and questioning everything—a truly scientific method
applied to supply chain. In the field of data science as well as supply chain, simple
questions do not come with simple answers. To answer these questions, you need to
think like a scientist and use the right tools. In this book, we will discuss how to do
both.
https://doi.org/10.1515/9783110671124-204
XXII | Introduction
intelligence into it, or some less-known statistical model, but in the end, it all goes
back to exponential smoothing, which we will discuss in the first part of this book. In
the past, one demand planner on her/his own personal computer couldn’t compete
with these models.
Today, things have changed. Thanks to the increase in computing power, the in-
flow of data, better models, and the availability of free tools, one can make a difference.
You can make a difference. With a few coding skills and an appetite for experimenta-
tion, powered by machine learning models, you will be able to bring to any business
more value than any off-the-shelf forecasting software can deliver. We will discuss ma-
chine learning models in Part II.
We often hear that the recent rise of artificial intelligence (or machine learning) is
due to an increasing amount of data available, as well as cheaper computing power.
This is not entirely true. Two other effects explain the recent interest in machine learn-
ing. In previous years, many machine learning models were improved, giving better
results. As these models became better and faster, the tools to use them have become
more user-friendly. It is much easier today to use powerful machine learning models
than it was ten years ago.
Tomorrow, demand planners will have to learn to work hand-in-hand with ad-
vanced ML-driven forecast models. Demand planners will be able to add value to those
models as they understand the ML shortcomings. We will discuss this in Part III.
then machine learning models in Part II. Both parts will start with simple models and
end with more powerful (and complex) ones. This will allow you to build your under-
standing of the field of data science and forecasting step by step. Each new model or
concept will allow us to overcome a limitation or to go one step further in terms of
forecast accuracy.
On the other hand, not every single existing forecast model is explained here. We
will only focus on the models that have proven their value in the world of supply chain
forecasting.
Do It Yourself
We also make the decision not to use any black-box forecasting function from Python
or Excel. The objective of this book is not to teach you how to use software. It is twofold.
Its first purpose is to teach you how to experiment with different models on your own
datasets. This means that you will have to tweak the models and experiment with dif-
ferent variations. You will only be able to do this if you take the time to implement
these models yourself. Its second purpose is to allow you to acquire in-depth knowl-
edge on how the different models work as well as their strengths and limitations. Im-
plementing the different models yourself will allow you to learn by doing as you test
them along the way.
At the end of each chapter, you will find a Do It Yourself (DIY) section that will
show you a step-by-step implementation of the different models. I can only advise
you to start testing these models on your own datasets ASAP.
Data Science for Supply Chain Forecasting has been written for supply chain practi-
tioners, demand planners, and analysts who are interested in understanding the in-
ner workings of the forecasting science.4 By the end of the book, you will be able to
create, fine-tune, and use your own models to populate a demand forecast for your
supply chain. Demand planners often ask me what the best model is for demand fore-
casting. I always explain that there is no such thing as a perfect forecasting model
that could beat any other model for any business. As you will see, tailoring models to
your demand dataset will allow you to achieve a better level of accuracy than by using
black-box tools. This will be especially appreciable for machine learning, where there
is definitely no one-size-fits-all model or silver bullet: machine learning models need
to be tailor-fit to the demand patterns at hand.
4 Even though we will focus on supply chain demand forecasting, the principles and models ex-
plained in this book can be applied to any forecasting problem.
XXIV | Introduction
You do not need technical IT skills to start using the models in this book today.
You do not need a dedicated server or expensive software licenses—only your own
computer. You do not need a PhD in Mathematics: we will only use mathematics when
it is directly useful to tweak and understand the models. Often—especially for machine
learning—a deep understanding of the mathematical inner workings of a model will
not be necessary to optimize it and understand its limitations.
1 n
ft = ∑d
n i=1 t−i
Where,
ft is the forecast for period t
n is the number of periods we take the average of
dt is the demand during period t
Initialization
As you will see for further models, we always need to discuss how to initialize the
forecast for the first periods. For the moving average method, we won’t have a forecast
until we have enough historical demand observations. So the first forecast will be done
for t = n + 1.
Future Forecast
Once we are out of the historical period, we simply define any future forecast as the
last forecast that was computed based on historical demand. This means that, with
this model, the future forecast is flat. This will be one of the major restrictions of this
model: its inability to extrapolate any trend.
https://doi.org/10.1515/9783110671124-001
2 Forecast KPI
This chapter focuses on the quantitative aspects of forecast accuracy. For the
sake of simplicity, we will look at the error on the very next period (lag 1) and
at a single item at a time. In Chapter 27, we will discuss which lags are the most
important as well as how to deal with forecast error across a product portfolio.
Accuracy
The accuracy of your forecast measures how much spread you had between
your forecasts and the actual values. The accuracy gives an idea of the magni-
tude of the errors, but not their overall direction.
Bias
The bias represents the overall direction of the historical average error. It mea-
sures if your forecasts were on average too high (i. e., you overshot the demand)
or too low (i. e., you undershot the demand).
Of course, as you can see in Figure 2.1, what we want to have is a forecast that is both
accurate and unbiased.
Let’s start by defining the error during one period (et ) as the difference between the
forecast (ft ) and the demand (dt ).
et = ft − dt
https://doi.org/10.1515/9783110671124-002
2.2 Bias | 11
Note that with this definition, if the forecast overshoots the demand, the error will be
positive; if the forecast undershoots the demand, the error will be negative.
DIY
Excel
You can easily compute the error as the forecast minus the demand.
Starting from our example from Section 1.3, you can do this by inputting =C5-B5
in cell D5. This formula can then be dragged onto the range C5:D11.
Python
You can access the error directly via df['Error'] as it is included in the DataFrame
returned by our function moving_average(d).
2.2 Bias
The (average) bias of a forecast is defined as its average error.
1
bias = ∑e
n n t
Where n is the number of historical periods where you have both a forecast and a
demand (i. e., periods where an error can be computed).
3 Exponential Smoothing
3.1 The Idea Behind Exponential Smoothing
Simple exponential smoothing is one of the simplest ways to forecast a time series;
we will use it in later chapters as a building block in many more powerful models. Just
as for a moving average, the basic idea of this model is to assume that the future will
be more or less the same as the (recent) past. The only pattern that this model will be
able to learn from demand history is its level.
Level
The level is the average value around which the demand varies over time. As
you can observe in Figure 3.1, the level is a smoothed version of the demand.
The exponential smoothing model will then forecast the future demand as its last es-
timation of the level. It is important to understand that there is no definitive mathe-
matical definition of the level—instead it is up to our model to estimate it.
The simple exponential smoothing model will have some advantages compared
to a naïve1 or a moving average model (see Chapter 1):
– The weight that is put on each observation decreases exponentially2 over time.
In other words, in order to determine the forecast, the historical, most recent pe-
riod has the highest importance; then each subsequent (older) period has less and
less importance. This is often better than moving average models, where the same
importance (weight) is given to a handful of historical periods.
– Outliers and noise have less impact than with a naïve forecast.
1 Remember, a naïve forecast simply projects the latest available observation in the future.
2 We’ll discuss this in more detail in the paragraph “Why Is It Called Exponential Smoothing?”
https://doi.org/10.1515/9783110671124-003
28 | 3 Exponential Smoothing
3.2 Model
The underlying idea of any exponential smoothing model is that, at each period, the
model will learn a bit from the most recent demand observation and remember a
bit of the last forecast it did. The smoothing parameter (or learning rate) alpha (α)
will determine how much importance is given to the most recent demand observation
(see Figure 3.2). Let’s represent this mathematically:
ft = αdt−1 + (1 − α)ft−1
0<α≤1
The magic about this formula is that the last forecast made by the model was al-
ready including a part of the previous demand observation and a part of the previous
forecast. This means that the previous forecast includes everything the model
learned so far based on demand history.
A model aims to describe reality. As reality can be rather complex, a model will be built
on some assumptions (i. e., simplifications), as summarized by statistician George
Box. Unfortunately, due to these assumptions or some limitations, some forecast mod-
els will not be able to predict or properly explain the reality they are built upon.
We say that a model is underfitted if it does not explain reality accurately enough.
To analyze our model’s abilities, we will divide our dataset (i. e., the historical
demand in the case of a forecast) into two different parts: the training set and the test
set.
Training Set
The training set is used to train (fit) our model (i. e., optimize its parameters).
Test Set
The test set is the dataset that will assess the accuracy of our model against
unseen data. This dataset is kept aside from the model during its training
phase, so that the model is not aware of this data and can thus be tested against
unseen—and right away available—data.
Typically, in the case of statistical forecast models, we use historical demand as the
training set to optimize the different parameters (α for the simple exponential smooth-
ing model). To test our forecast, we can keep the latest periods out of the training set
to see how our models behave during these periods.
We need to be very careful with our test set. We can never use it to optimize our
model. Keep in mind that this dataset is here to show us how our model would perform
against new data. If we optimize our model on the test set, we will never know what
accuracy we can expect against new demand.
One could say that an underfitted model lacks a good understanding of the train-
ing dataset. As the model does not perform properly on the training dataset, it will
not perform well on the test set either. In the case of demand forecasting, a model that
does not achieve good accuracy on historical demand will not perform properly on
future demand either.
We will now look into two possible causes of underfitting and how to solve them.
https://doi.org/10.1515/9783110671124-004
5 Double Exponential Smoothing
When the facts change, I change my mind. What do you do, sir?
John Maynard Keynes
Trend
We define the trend as the average variation of the time series level between
two consecutive periods. Remember that the level is the average value around
which the demand varies over time.
If you assume that your time series follows a trend, you will most likely not know
its magnitude in advance—especially as this magnitude could vary over time. This is
fine, because we will create a model that will learn by itself the trend over time. Just as
for the level, this new model will estimate the trend based on an exponential weight
beta (β), giving more (or less) importance to the most recent observations.
Similarly, the general idea behind exponential smoothing models is that each demand
component (currently, the level and the trend, later the seasonality, as well) will be up-
dated after each period based on two same pieces of information: the last observation
and the previous estimation of this component. Let’s apply this principle to estimate
the demand level (at and trend bt ).
Level Estimation
Let’s see how the model will estimate the level:
https://doi.org/10.1515/9783110671124-005
6 Model Optimization
It is difficult to make predictions, especially about the future.
Author unknown
Now that we have seen a couple of forecast models, we can discuss parameter opti-
mization. Let’s recap the models we have seen so far and their different parameters,
Moving Average n (Chapter 1)
Simple Exponential Smoothing alpha (Chapter 3)
Double Exponential Smoothing alpha, beta (Chapter 5)
As we have seen in the double exponential smoothing case, a wrong parameter opti-
mization will lead to catastrophic results. To optimize our models, we could manually
search for the best parameter values. But remember, that would be against our sup-
ply chain data science best practices: we need to automate our experiments in order
to scale them. Thanks to our favorite tools—Excel and Python—we will be able to au-
tomatically look for the best parameter values. The idea is to set an objective (either
RMSE or MAE),1 automatically run through different parameter values, and then select
the one that gave the best results.
6.1 Excel
To optimize the parameters in Excel, we will use Excel Solver. If you have never used
Excel Solver before, do not worry. It is rather easy.
Solver Activation
The first step is to activate Solver in Excel. If you have a Windows machine with Excel
2010 or a more recent version, you can activate it via the following steps,
1. Open Excel, go to File/Options/Add-ins.
2. Click on the Manage drop-down menu, select Excel Add-ins and click on the
Go... button just to the right of it.
3. On the add-ins box, click on the Solver Add-in check box and then click the OK
button to confirm your choice.
4. Let’s now confirm that Solver is activated. In the Excel ribbon, go to the Data tab,
there on the sub-menu Analyze (normally on the far right), you should see the
Solver button.
https://doi.org/10.1515/9783110671124-006
7 Double Smoothing with Damped Trend
7.1 The Idea Behind Double Smoothing with Damped Trend
One of the limitations of the double smoothing model is the fact that the trend is as-
sumed to go on forever. In 1985, Gardner and McKenzie proposed in their paper “Fore-
casting Trends in Time Series” to add a new layer of intelligence to the double expo-
nential model: a damping factor, phi (ϕ), that will exponentially reduce the trend
over time. One could say that this new model forgets the trend over time.1 Or that the
model remembers only a fraction (ϕ) of the previous estimated trend.
Practically, the trend (b) will be reduced by a factor ϕ in each period. In theory,
ϕ will be between 0 and 1—so that it can be seen as a % (like α and β). Nevertheless,
in practice, it is often between 0.7 and 1. At the edge cases if ϕ = 0, we are back to a
simple exponential smoothing forecast; and if ϕ = 1, the damping is removed and we
deal with a double smoothing model.
7.2 Model
We will go back to the double exponential smoothing model and multiply all bt−1 oc-
currences by ϕ. Remember that ϕ ≤ 1, so that the damped trend is a muted version of
the double smoothing model. We then have:
ft+1 = at + bt ϕ
ft+3 = at + bt ϕ + bt ϕ2 + bt ϕ3
Note that if ϕ = 1, then we are back to a normal double smoothing model. Setting ϕ = 1
basically means that the model won’t forget the trend over time.
https://doi.org/10.1515/9783110671124-007
8 Overfitting
With four parameters, I can fit an elephant, and with five, I can make him wiggle his trunk.
John Von Neumann
In Chapter 4, we saw the issue of underfitting a dataset. This happens when a model is
not able to learn the patterns present in the training dataset. As we saw, underfitting
is most likely due to the model not being smart enough to understand the patterns in
the training dataset. This could be solved by using a more complex model.
On the other end of the spectrum, we have the risk for a model to overfit a dataset.
If a model overfits the data, it means that it has recognized (or learned) patterns from
the noise (i. e., randomness) of the training set. As it has learned patterns from the
noise, it will reapply these patterns in the future on new data. This will create an
issue as the model will show (very) good results on the training dataset but will
fail to make good predictions on the test set. In a forecasting model, that means
that your model will show good accuracy on historical demand but will fail to deliver
as good results on future demand.
In other words, overfitting means that you learned patterns that worked by
chance on the training set. And as these patterns most likely won’t occur again on
future data, you will make wrong predictions.
Overfitting is the number one enemy of many data scientists for another reason.
Data scientists are always looking to make the best models with the highest accuracy.
When a model achieves (very) good accuracy, it is always tempting to think that it
is simply excellent and call it a day. But a careful analysis will reveal that the model
is just overfitting the data. Overfitting can be seen as a mirage: you are tempted to
think that there is an oasis in the middle of the desert, but actually, it is just sand
reflecting the sky. As we learn more complex models, underfitting will become less
of an issue, and overfitting will become the biggest risk. This risk will be especially
present with machine learning, as we will see in Part II. Therefore, we will have to use
more complex techniques to prevent our models from overfitting our datasets. Our
battle against overfitting will reach a peak in Chapters 18 and 24, when we discuss
feature optimization.
Let’s go over some examples of overfitting and the tools to avoid it.
8.1 Examples
Supply Chain Forecasting
Let’s imagine that we have observed the demand of a new product over 20 periods
(as shown in Figure 8.1), and now we want to use our optimization algorithm from
Chapter 6 to make a forecast.
https://doi.org/10.1515/9783110671124-008
9 Triple Exponential Smoothing
9.1 The Idea Behind Triple Exponential Smoothing
With the first two exponential smoothing models we saw, we learned how to identify
the level and the trend of a time series and used these pieces of information to populate
our forecast. After that, we added an extra layer of intelligence to the trend by allowing
the model to partially forget it over time.
Unfortunately, the simple and double exponential smoothing models do not rec-
ognize seasonal patterns and therefore cannot extrapolate any seasonal behavior in
the future. Seasonal products—with high and low seasons—are common for many sup-
ply chains across the globe, as many different factors can cause seasonality. This lim-
itation is thus a real problem for our model.
In order for our model to learn a seasonal pattern, we will add a third layer of
exponential smoothing. The idea is that the model will learn multiplicative seasonal
factors that will be applied to each period inside a full seasonal cycle. As for the trend
(β) and the level (α), the seasonal factors will be learned via an exponential weighting
method with a new specific learning rate: gamma (γ).
Multiplicative seasonal factors mean, for example, that the model will know that
the demand is increased by 20% in January (compared to the yearly average) but re-
duced by 30% in February.
We will discuss the case of additive seasonality in Chapter 11.
9.2 Model
The main idea is that the forecast is now composed of the level (a) plus the (damped)
trend (b) multiplied by the seasonal factor (s).
Pay attention to the fact that, we need to use the seasonal factors that were calculated
during the previous season: st−p , where p (for periodicity) is the season length.1 The
different seasonal factors (s) can be seen as percentages to be applied to the level in
order to obtain the forecast. For example, for a monthly forecast, the statement, “We
sell 20% more in January” would be translated as sjanuary = 120%.
https://doi.org/10.1515/9783110671124-009
10 Outliers
I shall not today attempt further to define this kind of material ... and perhaps I could never succeed
in intelligibly doing so. But I know it when I see it.
Potter Stewart
In 1964, Potter Stewart was a United States Supreme Court Justice. He was discussing
not outliers, but whether the movie The Lovers was obscene or not.
As you work on forecasts with the different models we saw—and the following
models we will see later—you will notice that your dataset has outliers. And even
though I know it when I see it might be the only practical definition, these outliers pose
a real threat to supply chains. These high (or low) points will result in overreactions in
your forecast or your safety stocks, ultimately resulting in (at best) manual corrections
or (at worst) dead stocks, losses, and a nasty bullwhip effect. Actually, when you look
at blogs, books, articles, or software on forecasting, the question of outlier detection
is often eluded. This is a pity. Outlier detection is serious business.
These outliers pop out all the time in modern supply chains. They are mostly due
to two main reasons:
Mistakes and Errors These are obvious outliers. If you spot such kind of errors or
encoding mistakes, it calls for process improvement to prevent these from hap-
pening again.
Exceptional Demand Even though some demand observations are real, it does not
mean they are not exceptional and shouldn’t be cleaned or smoothed. This kind
of exceptional sales is actually not so uncommon in supply chains. Think about
promotions, marketing, strange customer behaviors, or destocking. Typically, you
might not want to take into account for your forecast the exceptional −80% sales
you did last year to get rid of an old, nearly obsolete inventory.
If you can spot outliers and smooth them out, you will make a better forecast. I have
seen numerous examples where the forecast error was reduced by a couple of per-
centages, thanks to outlier cleaning. Flagging outliers manually is a time-intensive,
error-prone, and un rewarding process; few demand planners will take the time nec-
essary to review those. Therefore, the bigger the dataset, the more important it is to
automate this detection and cleaning. As data scientists, we automate tasks to scale
our processes. Let’s see how we can do this for outliers detection.
In the following pages, we will discuss three and a half ideas to spot these outliers
and put them back to a reasonable level.
https://doi.org/10.1515/9783110671124-010
11 Triple Additive Exponential Smoothing
11.1 The Idea Behind Triple Additive Exponential Smoothing
So far, we have discussed four different exponential smoothing models:
– Simple exponential smoothing
– Double exponential smoothing with (additive) trend
– Double exponential smoothing with (additive) damped trend
– Triple exponential smoothing with (additive) damped trend and multiplicative
seasonality
The last model we saw is potent but still has some limitations due to its seasonality’s
multiplicative aspect. The issue of multiplicative seasonality is how the model reacts
when you have periods with very low volumes. Periods with a demand of 2 and 10
might have an absolute difference of 8 units. Still, their relative difference is 500%,
so the seasonality (which is expressed in relative terms) could drastically change. We
will then replace this multiplicative seasonality with an additive one.
With multiplicative seasonality, we could interpret the seasonal factors as a per-
centage increase (or decrease) of the demand during each period. One could say “We
sell 20% more in January but 30% less in February.” Now, the seasonal factors will be
absolute amounts to be added to the demand level. One could say “We sell 150 units
more than usual in January but 200 less in February.”
11.2 Model
The model idea is that the forecast is now composed of the level plus the (damped)
trend, plus an additive seasonal factor s.
Pay attention—like for the multiplicative model, we use the seasonal factors from the
previous cycle: st−p (where p denotes the season length—p for periodicity). The sea-
sonal factors can now be read as an amount to add (or subtract) from the period level
to obtain the forecast. For example, if sjanuary = 20, it means you will sell 20 pieces
more in January than in an average month.
Component Updates
https://doi.org/10.1515/9783110671124-011
|
Part II: Machine Learning
12 Machine Learning
Tell us what the future holds, so we may know that you are gods.
Isaiah 41:23
Until now, we have been using old-school statistics to predict demand. But with the
recent rise of machine learning algorithms, we have new tools at our disposal that can
easily achieve outstanding performance in terms of forecast accuracy for a typical sup-
ply chain demand dataset. As you will see in the following chapters, these models will
be able to learn many relationships that are beyond the ability of traditional exponen-
tial smoothing models. For example, we will discuss how to add external information
to our model in Chapters 20 and 22.
So far, we have created different algorithms that have used a predefined model
to populate a forecast based on historical demand. The issue was that these models
couldn’t adapt to historical demand. If you use a double exponential smoothing model
to predict a seasonal product, it will fail to interpret the seasonal patterns. On the other
hand, if you use a triple exponential smoothing model on a non-seasonal demand, it
might overfit the noise in the demand and interpret it as a seasonality.
Machine learning is different. Here, the algorithm (i. e., the machine) will learn re-
lationships from a training dataset (i. e., our historical demand) and then apply these
relationships on new data. Whereas a traditional statistical model will apply a prede-
fined relationship (model) to forecast the demand, a machine learning algorithm will
not assume a priori a particular relationship (like seasonality or a linear trend); it will
learn these patterns directly from the historical demand.
For a machine learning algorithm to learn how to make predictions, we will have
to feed it with both the inputs and the desired respective outputs. It will then automat-
ically understand the relationships between these inputs and outputs.
Another important difference between using machine learning and exponential
smoothing models to forecast our demand is that machine learning algorithms will
learn patterns from our entire dataset. Exponential smoothing models will treat
each item individually and independently from the others. Because it uses the entire
dataset, a machine learning algorithm will apply what works best to each product.
One could improve the accuracy of an exponential smoothing model by increasing the
length of each time series (i. e., providing more historical periods for each product).
Using machine learning, we will be able to increase our model’s accuracy by providing
more of the products’ data to be ingested by the model.
Welcome to the world of machine learning.
https://doi.org/10.1515/9783110671124-012
110 | 12 Machine Learning
Inputs Output
Product Year 1 Year 2
Q1 Q2 Q3 Q4 Q1
#1 5 15 10 7 6
#2 7 2 3 1 1
#3 18 25 32 47 56
#4 4 1 5 3 2
For our forecasting problem, we will basically show our machine learning algorithm
different extracts of our historical demand dataset as inputs and, as a desired output,
what the very next demand observation was. In our example in Table 12.1, the algo-
rithm will learn the relationship between the last four quarters of demand, and the
demand of the next quarter. The algorithm will learn that if we have 5, 15, 10, and 7
as the last four demand observations, the next demand observation will be 6, so that
its prediction should be 6. Next to the data and relationships from product #1, the al-
gorithm will also learn from products #2, #3, and #4. In doing so, the idea is for the
model to use all the data provided to give us better forecasts.
Most people will react to this idea with two very different thoughts. Either people
will think that “it is simply impossible for a computer to look at the demand and make
a prediction” or that ‘‘as of now, the humans have nothing left to do.” Both are wrong.
As we will see later, machine learning can generate very accurate predictions. And
as the human controlling the machine, we still have to ask ourselves many questions,
such as:
– Which data should we feed to the algorithm for it to understand the proper rela-
tionships? We will discuss how to include other data features in Chapters 20, 22,
and 23; and how to select the relevant ones in Chapters 18 and 24.
– Which machine learning algorithm should be used? There are many different
ones: we will discuss new models in Chapters 13, 15, 17, 19, 21, and 25.
13 Tree
It’s a dangerous business, Frodo, going out your door. You step onto the road, and if you don’t keep
your feet, there’s no knowing where you might be swept off to.
J. R. R. Tolkien, The Lord of the Rings
As a first machine learning algorithm, we will use a decision tree. Decision trees are
a class of machine learning algorithms that will create a map (a tree, actually) of ques-
tions to make a prediction. We call these trees regression trees if we want them to pre-
dict a number, or classification trees if we want them to predict a category or a label.
Regression
Classification
In order to make a prediction, the tree will start at its foundation with a yes/no ques-
tion, and based on the answer, it will continue asking new yes/no questions until it
gets to a final prediction. Somehow you might see these trees as a big game of “Guess
Who?” (the famous ’80s game): the model will ask multiple consecutive questions
until it gets to a good answer.1
In a decision tree, each question is called a node. For example, in Figure 13.1,
“Does the person have a big nose?” is a node. Each possible final answer is called a
leaf. In Figure 13.1, each leaf contains only one single person. But that is not manda-
tory. You could imagine that multiple people have a big mouth and a big nose. In such
case, the leaf would contain multiple values.
The different pieces of information that a tree has at its disposal to split a node are
called the features.
Feature
1 If you do not think an algorithm can easily make a prediction based on a few yes/no questions, I
advise you to check out Akinator—a genius who will find any character you can think of within a few
questions (en.akinator.com).
https://doi.org/10.1515/9783110671124-013
13.1 How Does It Work? | 123
For example, the tree we had in Figure 13.1 for the game “Guess Who?” could split a
node on the three features: mouth, nose, and glasses.
X_train Y_train
5 15 10 7 6
15 10 7 6 13
10 7 6 13 11
7 6 13 11 5
6 13 11 5 4
13 11 5 4 11
11 5 4 11 9
7 2 3 1 1
Based on this training dataset, a smart question to ask yourself in order to make a
prediction is: Is the first demand observation >7?
As shown in Table 13.2, this is a useful question, as you know that the answer
(Yes/No) will provide an interesting indication of the behavior of the demand for the
next quarter. If the answer is yes, the demand we try to predict is likely to be rather
high (≥ 8), and if the answer is no, then the demand we try to predict is likely to be
low (≤ 7).
14 Parameter Optimization
When we created our regression tree in Chapter 13, we chose some parameters:
But are we sure these are the best? Maybe if we set max_depth to 7, we could improve
our model accuracy? It is unfortunately impossible to know a priori what the best set
of parameters is. But that’s fine—we are supply chain data scientists, we love to run
experiments.
If we did this, we would actually be optimizing this parameter directly against the test
set. We would then face the risk of overfitting the model to the test set and not being
able to replicate similar results against new data.
https://doi.org/10.1515/9783110671124-014
15 Forest
15.1 The Wisdom of the Crowd and Ensemble Models
In social choice theory1 there is a concept called the wisdom of the crowd. This idea
explains that the average opinion of a group of people is going to be more precise
(on average) than the opinion of a single member of the group. Let’s explore a simple
example: if you want to make a forecast for the demand of a product next month, it
is better to ask the opinion of many different team members (salespeople, marketing,
CEO, supply chain planners, financial analysts) and take the average of the different
forecasts, rather than to trust only one team member blindly.
In their (excellent) book Superforecasting: The Art and Science of Prediction, Philip
E. Tetlock and Dan Gardner explain how one can harness the power of the wisdom
of the crowd among a team to predict anything from stock-price value to presidential
elections.2 For wisdom to emerge out of a crowd, you need three main elements. First,
each individual needs independent sources of information. Second, they must make
independent decisions (not being influenced by the others). And third, there must be
a way to gather and weigh those individual predictions into a final one (usually taking
the average or median of the various predictions).
1 The social choice theory is the science of combining different individual choices into a final global
decision. This scientific field emerged in the late 18th century with the voting paradox. It states that a
group’s collective preference can be cyclic (i. e., A is preferred to B, B is preferred to C, and C is preferred
to A) even if none of the preferences of each individual are cyclic.
2 See Tetlock and Gardner (2016).
https://doi.org/10.1515/9783110671124-015
16 Feature Importance
In some ways, AI is comparable to the classical oracle of Delphi, which left to human beings the
interpretation of its cryptic messages about human destiny.
Henry A. Kissinger, Eric Schmidt, Daniel Huttenlocher1
Statistical models are interesting, as they can show us the interactions between the
different input variables and the output. Such models are easy to understand and
to interpret. In the first part of the book, we worked with many different exponen-
tial smoothing models: we could analyze them by simply looking at their level, trend,
or seasonality over time. Thanks to them, we can easily answer questions such as Do
we have a seasonality? or Is there a trend? And if the forecast for a specific period is
strange, we can look at how the sub-components (level, trend, seasonality) are behav-
ing to understand where the error comes from.
This is unfortunately not the case with machine learning. These models are very
difficult to interpret. A forest will never give you an estimation of the level, trend, or
seasonality of a product. You won’t even know if the model sees a seasonality or a
trend. Nevertheless, we have one tool at our disposal to understand how our machine
learning algorithm thinks: the feature importance.
As you remember, we created a machine learning algorithm that looked at the last
12 months of historical car sales in Norway to predict the sales in the following month.
In order to do that, we trained our model by showing it many different sequences of
13 months so that it could learn from these sequences how to predict the 13th month
based on the 12 previous ones.
Before we continue with new models and further optimization, let’s discuss the
importance of these inputs. We do not know yet which of these 12 historical months is
the most important to predict the sales of the 13th.
We would like to answer questions such as:
– Is M-1 more important than M-12?
– Is M-5 of any help?
When we grow a tree or a forest, we can get an idea of each feature’s importance (in
our specific case, each feature is one of the 12 previous periods). There are different
definitions and ways to compute each feature’s importance; we will focus on the one
used by the scikit-learn library. Basically, the importance of a feature is the reduction
it brings to the objective that the algorithm tries to minimize. In our case, we want to
bring the MAE or the RMSE down. Therefore, the feature importance is measured as
the forecast accuracy brought by each feature. The feature importance is then normal-
ized (i. e., each feature importance is scaled so that the sum of all features importance
is 1).
https://doi.org/10.1515/9783110671124-016
17 Extremely Randomized Trees
The random forest idea was that we could obtain a better forecast accuracy by taking
the average prediction of many different trees. To create those slightly different trees
from the same initial training dataset, we used two tricks. The first was to limit the
number of features the algorithm could choose from each node split. The second trick
was to create different random subsets of the initial training dataset (thanks to boot-
strapping and only keeping a subset of the initial samples).
In 2006, Belgian researchers Pierre Geurts, Damien Ernst, and Louis Wehenkel in-
troduced a third idea to further increase the differences between each tree.1 At each
node, the algorithm will now choose a split point randomly for each feature and then
select the best split among these. It means that, for our Norwegian car sales dataset,
this new method will draw at each node one random split point for each of the 12 pre-
vious months (our features) and, among these 12 potential splits, choose the best one
to split the node. The fact that an Extremely Randomized Trees (or ETR) draws split
points randomly seems counter intuitive. Still, this will increase further the difference
between each tree in the ETR, resulting in better overall accuracy. Remember, an en-
semble model (such as the ETR or the forest) is more accurate, as its sub-models are
different.
17.1 Do It Yourself
We will once again use the scikit-learn library, so our code will be very similar to the
one we used for the random forest.
1 Y_train_pred = ETR.predict(X_train)
2 Y_test_pred = ETR.predict(X_test)
3 kpi_ML(Y_train, Y_train_pred, Y_test, Y_test_pred, name='ETR')
https://doi.org/10.1515/9783110671124-017
18 Feature Optimization #1
The more you know about the past, the better prepared you are for the future.
Theodore Roosevelt
So far, we have created three different models (a regression tree, a random forest, and a
set of extremely randomized trees), and we have optimized each of them via a random
search automatically running through some possible parameter sets and testing them
via k-fold cross-validation.
We took the time to choose a model and optimize its parameters. There is one thing
that we haven’t optimized (yet): the features that we use.
The different models that we used so far have an interesting feature: once fitted to
a training dataset, they can show each input’s (i. e., feature) importance. Remember
that, in our case, the input features are the different historical periods that we use
to populate our forecast. For our car sales in Norway, we used the last 12 months of
sales to predict the next month: these are our features. We previously looked at the
feature importance of our random forest and saw that M-1 was the most critical month,
leaving half of the other months useless. As you can see in Figure 18.1, if we do the
same exercise with the ETR model, we obtain a much flatter curve.
How can we explain this? As you might remember, the difference between the extra
trees model and the random forest lies in the fact that the split point chosen at each
node is not the optimal one, but can only be selected from a set of random points
for each selected feature. So even though M-1 should be the most useful feature, in
many cases, the random split point proposed for this feature is not the best across the
different possibilities the algorithm can choose from.
Based on this graph, we can then ask ourselves an important question: what if we
used 13 months instead of 12 to make a forecast? To answer this question, we will have
to do a lot of experiments (on our training set).
https://doi.org/10.1515/9783110671124-018
19 Adaptive Boosting
Weak Model
A weak model (or weak learner) is a model that is slightly more accurate than
random guessing—typically, a simple, shallow tree.
If we reframe this question, we could say: Can we obtain a good forecasting model by
using only (very) simple trees? In 1990, Robert E. Schapire answered this positively. In
1997, Yoav Freund and Robert E. Schapire published an algorithm together that could
create a strong learner based on weak ones.2 To do so, they used another ensemble
technique called boosting.
1 See Kearns (1988), Kearns and Valiant (1989). In his work, Kearns noted that “the resolution of this
question is of theoretical interest and possibly of practical importance.” As you will see, boosting
algorithms further pushed the frontiers of machine learning power.
2 See Schapire (1990), Freund and Schapire (1997) for the original papers. For a detailed review and
discussion between multiple academic authors, see Breiman et al. (1998) (available here: projecteu-
clid.org/download/pdf_1/euclid.aos/1024691079). Those initial papers discussed Adaptive Boosting
for classification problems; the first implementation of AdaBoost for regression problems was pub-
lished by Drucker (1997), which is the algorithm used in scikit-learn.
https://doi.org/10.1515/9783110671124-019
20 Demand Drivers and Leading Indicators
Felix, qui potuit rerum cognoscere causas – Fortunate, who was able to know the causes of things.
Virgil (70–19 BC)
Until now, we have made our forecasts solely based on historical demand. We have
discussed in Chapter 14 how to optimize our model, and we have discussed in Chap-
ter 18 how to optimize the number of historical periods we should take into account to
make a prediction. We haven’t discussed the following yet: which other factors could
we be looking at to predict future demand?
For many businesses, historical demand is not the only—or main—factor that
drives future sales. Other internal and external factors drive the demand as well. You
might sell more or less depending on the weather, the GDP growth, unemployment
rate, loan rates, and so on. These external factors (external, as a company does not
control them) are often called leading indicators.
The demand can also be driven by company decisions: price changes, promotions,
marketing budget, or another product’s sales. As these factors result from business
decisions, we will call them internal factors.
As you can see in Figure 20.1, we will group both internal and external factors into
the term exogenous factors (as these factors are exogenous to the historical demand),
or, more simply, demand drivers.
https://doi.org/10.1515/9783110671124-020
21 Extreme Gradient Boosting
21.1 From Gradient Boosting to Extreme Gradient Boosting
In 2001, Jerome H. Friedman proposed a new concept to boost trees: Gradient Boost-
ing.1 The general concept of Gradient Boosting and Adaptive Boosting is essentially
the same: they are both ensemble models boosting (stacking) trees on top of each other
based on the model mistakes. The main difference is that in Gradient Boosting, each
new weak learner is stacked directly on the model’s current errors rather than on a
weighted version of the initial training set.
21.2 Do It Yourself
Installation
1 See Friedman (2001) for the original paper, and Kashnitsky (2020) (mlcourse.ai/articles/topic10-
boosting) for a detailed explanation and comparison against AdaBoost.
2 See Chen and Guestrin (2016).
3 The main mathematical difference between XGBoost and GBoost lies in the fact that in XGBoost
uses a regularized objective function that penalizes both the number of leaves in a tree and extreme
weights given to individual leaves. Even though the exact mathematics involved are beyond the scope
of this book—see Chen and Guestrin (2016) for the detailed model—we will use those regularization
parameters to improve our model.
4 See an example of implementation here, scikit-learn.org/stable/auto_examples/ensemble/plot_
gradient_boosting_regression.html
5 See the official website xgboost.readthedocs.io
https://doi.org/10.1515/9783110671124-021
22 Categorical Features
As we saw in Chapter 20, we can improve our forecast by enriching our dataset—that
is by adding external information to our historical demand. We saw that it might not
be straightforward (nor meaningful) for all businesses to use such external macro-
economic elements. On the other hand, most supply chains serve different markets
(often through different channels) and have different product families. What if a ma-
chine learning model could benefit from these extra pieces of information: Am I selling
this to market A? Is this product part of family B?
In our car sales dataset, we could imagine that instead of only having sales in Nor-
way, we could have the sales in different markets across Europe. You could then feed
the algorithm with the sales per country and indicate the market of each data sample
(e. g., Sweden, Finland). We could also imagine segmenting into four categories: low
cost, normal, premium, and luxury brands—or simply allowing the model to see the
brand it is forecasting.
Unfortunately, most of the current machine learning libraries (including scikit-
learn) cannot directly deal with categorical inputs. This means that you won’t be able
to fit your model based on an X_train dataset, as shown in Table 22.1.
X_train Y_train
Segment Demand Demand
Premium 5 15 10 7 6
Normal 2 3 1 1 1
Low cost 18 25 32 47 56
Luxury 4 1 5 3 2
The machine learning models that we discussed can only be trained based on nu-
merical inputs. That means that we will have to transform our categorical inputs into
numbers.
https://doi.org/10.1515/9783110671124-022
23 Clustering
As discussed in Chapter 22, it can be helpful for both data scientists and machine
learning models to classify the various products they have. Unfortunately, you do not
always receive a pre-classified dataset. Could a machine learning model help us clas-
sify it? Yes, of course. In order to do so, we will have to use unsupervised machine
learning models.
Supervised Learning
A supervised machine learning model is a model that is fed with both inputs
and desired outputs. It is up to the algorithm to understand the relationship(s)
between these inputs and outputs. It is called supervised, as you show the
model what the desired output is.
All the machine learning models we have seen so far are called supervised
models (you cannot ask your algorithm to make a forecast if you never show it
what a good forecast looks like).
Unsupervised Learning
Unsupervised learning, using given meaningful features, can cluster anything from
products to clients or from social behaviors to neighborhoods in a city. Usually, a prod-
uct catalog is segmented based on product families (or brands). But products within
the same family might not share the same demand patterns (seasonality, trends). For
example, meat for barbecues will sell in a similar way as outside furniture, but not reg-
ular meat. Student notebooks will be purchased with the same seasonality as business
notebooks, although they look alike.
https://doi.org/10.1515/9783110671124-023
25 Neural Networks
Neural networks (or deep learning, which usually refers to “deep” neural net-
works) is a vast subject. This chapter is only an introduction to the topic. It will
be enough, though, to get you started in no time using some of the most recent
best practices.
If you are interested in learning more about neural networks, do not hesitate to
apply for Andrew Ng’s deep learning specialization on Coursera (www.cours-
era.org/specializations/deep-learning).
https://doi.org/10.1515/9783110671124-025
25.1 How Neural Networks Work | 229
glish computer scientist) proposed the “Turing test” to assess if a machine could im-
itate human language well enough to be confused for a human. GPT-3 is now writing
journal articles that are often barely recognizable from human ones. OpenAI—funded
by Elon Musk—was initially a non-profit organization. It became a capped profit or-
ganization in 2019, capping the returns on investment at a hundred times the initial
amount. Shortly after, they secured a $1 billion investment from Microsoft.4
Neuron
In Parts I and II, we discussed how to make a statistical or ML-driven forecast engine.
This engine will populate a forecast baseline for you. Is forecast baseline the end of
the forecasting process in a supply chain? No, it can still be enriched by humans by
using various sources of insights and information. A human-created forecast is called
a judgmental forecast, as it relies on human judgment. Using judgmental forecasts
comes with a lot of risks, as we will discuss in this chapter. Nevertheless, if done prop-
erly, it will add value to your baseline forecast (as we will discuss in Chapter 27).
Judgmental forecasts are also appropriate to forecast products when you lack his-
torical data to use a model, or when significant changes are ongoing (due to changing
client behavior, new competition, changing legislation, the COVID-19 pandemic, etc.).
Before discussing judgmental forecasts further, keep in mind that a demand fore-
cast should be the best unbiased estimate of a supply chain’s future demand. This is
nothing like a budget, a sales target to incentivize sale representatives, or a production
plan.
https://doi.org/10.1515/9783110671124-026
27 Forecast Value Added
In this chapter, we will identify how to improve a forecast baseline using the best
practices of judgmental forecast (as seen in Chapter 26). We will first discuss, in Sec-
tion 27.1, how to use smart KPIs to manage a portfolio of products and focus on prod-
ucts that matter the most. Then, in Section 27.2, we will see how to track the value
added by each team in the forecasting process.
Smart Weighting
As a demand planner, you have the opportunity to review some of your product fore-
casts during each forecast exercise. But on which items should you spend your time?
The first idea would be to look at the products with the most significant forecast error.
Let’s imagine an example where you are responsible for nails and hammers. As shown
in Table 27.1, the absolute forecast error is more significant on the nails (500 pieces)
than on the hammers (50 pieces). Should you, therefore, focus your time and efforts
on nails?
Obviously, not every SKU is created equal: some bring more profits, some are costly,
some use constrainted resources (e. g., space), some are of strategic importance...
https://doi.org/10.1515/9783110671124-027
Now It’s Your Turn!
Your task is not to foresee the future, but to enable it.
Antoine de Saint-Exupéry
The real purpose of this book was not just to explain different models but, more impor-
tantly, to give you the appetite to use them. This book is a toolbox that gives you the
tools and models to create your own forecast models. Hopefully, it has ignited ideas
for you to create unique models and has taught you the methodology to test them.
You have learned in Part I how to create a robust statistical baseline forecast by us-
ing the different exponential smoothing models. Moreover, we have discussed differ-
ent ideas (especially the different initialization methods) to tweak them to any dataset.
We also have created a robust model to detect—and correct—outliers in an automated
fashion. We have discussed various forecast KPIs, ensuring that you will be able to
assess your models properly.
In Part II, you have discovered how machine learning could create advanced fore-
cast models that could learn relationships across a whole dataset. Machine learning
will allow you to classify your products and uncover potential complex relationships
with any demand driver.
Finally, in Part III, we discussed the best practices required for judgmental fore-
casting and how to use the forecast value added framework to create—and sustain—an
efficient forecasting process.
Now, it is your turn to work on creating your own models. The most important mes-
sage is: You can do it. Start your journey by collecting historical demand data. Once
you have gathered enough data (aim for five years), build your first model using simple
machine learning models (such as random forests). You can then refine your model by
either including new features (pricing, promotions, historical stock-outs, etc.) or us-
ing more advanced ML techniques. Remember that to succeed, you will have to test
many ideas; avoid overfitting, and avoid the temptation to add too much complexity
at once. Openly discuss your ideas, models, and results with others.
You will achieve astonishing results.
The future of demand planning is human-machine teams working hand-in-hand.
You have learned here the keys to creating your own machine-learning driven models
and how to best interact with them to create even more value.
This is the future of supply chain forecasting.
https://doi.org/10.1515/9783110671124-028
A Python
If this is your first time using Python, let’s take some time to quickly introduce and
discuss the different libraries and data types we are going to use. Of course, the goal
of this book is not to give you full training in Python; if you wish to have an in-depth
introduction (or if you are not yet convinced to use Python), please refer to the recom-
mended courses in the Introduction.
There are multiple ways to install Python on your computer. An easy way to do this
is to install the Anaconda distribution on www.anaconda.com/download. Anaconda
is a well-known platform used by data scientists all over the world. It works on Win-
dows, Mac, and Linux. Anaconda will take care of installing all the Python libraries
you need to run the different models that we are going to discuss. It will also install
Spyder and Jupyter Notebook, two Python-code editors that you can use to type and
run your scripts. Feel free to check both and use your favorite.
Lists
The most basic object we will use in Python is a list. In Python, a list is simply an
ordered sequence of any number of objects (e. g., strings, numbers, other lists, more
complex objects, etc.). You can create a list by encoding these objects between [].
Typically, we can define our first time series ts as:
1 ts = [1,2,3,4,5,6]
These lists are very efficient at storing and manipulating objects, but are not meant
for number computation. For example, if we want to add two different time series, we
can’t simply ask ts + ts2, as this is what we would get:
1 ts = [1,2,3,4,5,6]
2 ts2 = [10,20,30,40,50,60]
3 ts + ts2
4 Out: [1, 2, 3, 4, 5, 6, 10, 20, 30, 40, 50, 60]
Python is returning a new, longer list. That’s not exactly what we wanted.
https://doi.org/10.1515/9783110671124-029
266 | A Python
NumPy Arrays
This is where the famous NumPy1 library comes to help. Since its initial release in
2006, NumPy has offered us a new data type: a NumPy array. This is similar to a list,
as it contains a sequence of different numeric values, but differs in the way that we can
easily call any mathematical function up on them. You can create one directly from a
list like this:
1 import numpy as np
2 ts = np.array([1,2,3,4,5,6])
1 ts2 = np.array([10,20,30,40,50,60])
2 ts + ts2
3 Out: array([11, 22, 33, 44, 55, 66])
Note that the result is another NumPy array (and not a simple list).
NumPy most often works very well directly with regular lists, because we can use
most of the NumPy functions directly on them. Here is an example:
1 alist = [1,2,3]
2 np.mean(alist)
3 Out: 2.0
You can always look for help on the NumPy official website.2 As you will see yourself,
most of your Google searches about NumPy functions will actually end up directly in
their documentation.
Slicing Arrays
To select a particular value in a list (or an array), you simply have to indicate between
[] the index of its location inside the list (array). The catch—as with many coding
languages—is that the index starts at 0 and not at 1; so the first element in your list
will have the index 0, the second element will have the index 1, and so on.
1 alist = ['cat','dog','mouse']
2 alist[1]
3 Out: 'dog'
4 anarray = np.array([1,2,3])
5 anarray[0]
6 Out: 1
If you want to select multiple items at once, you can simply indicate a range of index
with this format: [start:end]. Note the following:
– If you do not give a start value, Python will assume it is 0.
– If you do not give an end, it will assume it is the end of the list.
Note that the result will include the start element but will exclude the end element.
1 alist = ['cat','dog','mouse']
2 alist[1:]
3 Out: ['dog','mouse']
4 anarray = np.array([1,2,3])
5 anarray[:1]
6 Out: np.array([1])
If you give a negative value as the end, Python will start by counting backward from
the last element of your list/array (−1 being the last element of your array).
1 alist = ['cat','dog','mouse']
2 alist[-1]
3 Out: ['mouse']
4 alist[:-1]
5 Out: ['cat','dog']
You can slice a multi-dimensional array by separating each dimension with a comma.
1 anarray = np.array([[1,2],[3,4]])
2 anarray[0,0]
3 Out: 1
4 anarray[:,-1]
5 Out: array([2, 4])
268 | A Python
Pandas DataFrames
Pandas is one of the most used libraries in Python (it was created by Wes McKinney
in 2008). The name comes from panel data, because it helps to order data into ta-
bles. Think Excel-meets-databases in Python. This library introduces a new data type:
a DataFrame. If you’re a database person, just think of a DataFrame as an SQL ta-
ble. If you’re an Excel person, just imagine a DataFrame as an Excel table. Actually, a
DataFrame is a sort of data table in which each column would be a NumPy array with
a specific name. That will come in pretty handy, because we can select each column
of our DataFrame by its name.
There are many ways to create a DataFrame. Let’s create our first one by using a
list of our two time series.
1 import pandas as pd
2 pd.DataFrame([ts,ts2])
1 Out:
2 0 1 2 3 4 5
3 0 1 2 3 4 5 6
4 1 10 20 30 40 50 60
The convention is to import pandas as pd and to call our main DataFrame df. The
output we get is a DataFrame where we have 6 columns (named '0','1','2','3','4' and '5')
and 2 rows (actually, they also have a name—or index—'0' and '1').
We can easily edit the column names:
1 df = pd.DataFrame([ts,ts2])
2 df.columns = ['Day1','Day2','Day3','Day4','Day5','Day6']
3 print(df)
1 Out:
2 Day1 Day2 Day3 Day4 Day5 Day6
3 0 1 2 3 4 5 6
4 1 10 20 30 40 50 60
Pandas comes with very simple and helpful official documentation.3 When in doubt,
do not hesitate to look into it. As with NumPy, most of your Google searches will end
up there.
3 pandas.pydata.org/pandas-docs/stable/
A Python | 269
Here, the key 'Small product' will give you the value ts, whereas the key 'Big prod-
uct' will give you ts2.
1 dic['Small product']
2 Out: array([1, 2, 3, 4, 5, 6])
3 dic['Small product'] + dic['Big product']
4 Out: array([11, 22, 33, 44, 55, 66])
1 df = pd.DataFrame.from_dict(dic)
2 Out:
3 Small product Big product
4 0 1 10
5 1 2 20
6 2 3 30
7 3 4 40
8 4 5 50
9 5 6 60
We now have a DataFrame where each product has its own column and where each
row is a separate period.
Slicing DataFrames
There are many different techniques to slice a DataFrame to get the element or the part
you want. This might be confusing for beginners, but you’ll soon understand that each
of these has its uses and advantages. Do not worry if you get confused or overwhelmed:
you won’t need to apply all of these right now.
270 | A Python
– You can select a specific column by passing the name of this column directly to the
DataFrame—either with df['myColumn'], or even more directly with df.myColumn.
– You can select a row based on its index value by simply typing df[myIndexValue].
– If you want to select an element based on both its row and column, you can call
the method .loc on the DataFrame and give it the index value and the column
name you want. You can, for example, type df.loc[myIndexValue,'myColumn'].
– You can also use the same slicing method as for lists and arrays based on the
position of the element you want to select. You then need to call the method .iloc
to the DataFrame. Typically, to select the first element (top left corner), you can
type df.iloc[0,0].
As a recap, here are all the techniques you can use to select a column or a row:
1 df['myColumn']
2 df.myColumn
3 df[myIndexValue]
4 df.loc[myIndexValue,'myColumn']
5 df.iloc[0,0]
Exporting DataFrames
A DataFrame can easily be exported as either an Excel file or a CSV file.
1 df.to_excel('FileName.xlsx',index=False)
2 df.to_csv('FileName.csv',index=False)
The parameter index will indicate if you want to print the DataFrame index in the
output file.
Other Libraries
We will also use other very well-known Python libraries. We used the usual import
conventions for these libraries throughout the book. For the sake of clarity, we did not
show the import lines over and over in each code extract.
SciPy is a library used for all kinds of scientific computation, optimization, as well
as statistical computation (SciPy stands for Scientific Python). The documentation is
available on docs.scipy.org/doc/scipy/reference but it is unfortunately not always as
clear as we would wish it to be. We mainly focus on the statistical tools and only import
them as stats.
A Python | 271
To make our code shorter, we will import some functions directly from scipy.stats
in our examples. We can import these as such:
Matplotlib is the library used for plotting graphs. Unfortunately, matplotlib is not the
most user-friendly library, and its documentation is only rarely helpful. If we want to
make simple graphs, we will prefer using the .plot() method from pandas.
Seaborn is another plotting library built on top of matplotlib. It is actually more user-
friendly and provides some refreshing visualization compared to matplotlib. You can
check the official website seaborn.pydata.org, as it provides clear and inspiring exam-
ples.
https://doi.org/10.1515/9783110671124-030
274 | Bibliography
Wallis, K. F. (2014). Revisiting francis galton’s forecasting competition. Statistical Science, pages
420–424.
Wason, P. C. (1960). On the failure to eliminate hypotheses in a conceptual task. Quarterly Journal of
Experimental Psychology, 12(3):129–140.
Werbos, P. (1974). Beyond regression: New tools for prediction and analysis in the behavioral
sciences. PhD dissertation, Harvard University.
Winters, P. R. (1960). Forecasting sales by exponentially weighted moving averages. Management
Science, 6(3):324–342.
Glossary
accuracy The accuracy of your forecast measures how much spread you had between
your forecasts and the actual values. The accuracy of a forecast gives an idea of
the magnitude of the errors but not their overall direction. See page 10
alpha A smoothing factor applied to the demand level in the various exponential
smoothing models. In theory: 0 < α ≤ 1; in practice: 0 < α ≤ 0.6. See page 28
array A data structure defined in NumPy. It is a list or a matrix of numeric values. See
page 266
bagging Bagging (short word for Bootstrap Aggregation) is a method for aggregating
multiple sub-models into an ensemble model by averaging their predictions with
equal weighting. See page 139
beta A smoothing factor applied to the trend in the various exponential smoothing
models. In theory: 0 < β ≤ 1; in practice: 0 < β ≤ 0.6. See page 41
bias The bias represents the overall direction of the historical average error. It mea-
sures if your forecasts were on average too high (i. e., you overshot the demand)
or too low (i. e., you undershot the demand). See page 10
Boolean A Boolean is a value that is either True or False: 1 or 0. See page 203
boosting Boosting is a class of ensemble algorithms in which models are added se-
quentially, so that later models in the sequence will correct the predictions made
by earlier models in the sequence. See page 168
bullwhip effect The bullwhip effect is observed in supply chains when small varia-
tions in the downstream demand result in massive fluctuations in the upstream
supply chain. See page 45
classification Classification problems require you to classify data samples in differ-
ent categories. See page 122
data leakage In the case of forecast models, a data leakage describes a situation
where a model is given pieces of information about future demand. See page 30
DataFrame A DataFrame is a table of data as defined by the pandas library. It is sim-
ilar to a table in Excel or an SQL database. See page 268
demand observation This is the demand for a product during one period. For exam-
ple, a demand observation could be the demand for a product in January last year.
See page 4
ensemble An ensemble model is a (meta-)model constituted of many sub-models.
See page 139
epoch One epoch consists, for the neural network learning algorithm, to run through
all the training samples. The number of epochs is the number of times the learning
algorithm will run through the entire training dataset. See page 242
Euclidean distance The Euclidean distance between two points is the length of a
straight line between these two points. See page 210
https://doi.org/10.1515/9783110671124-031
278 | Glossary
evaluation set An evaluation set is a set of data that is left aside from the training
set to be used as a monitoring dataset during the training. A validation set or a
holdout set can be used as an evaluation set. See page 192
feature A feature is a type of information that a model has at its disposal to make a
prediction. See page 122
gamma A smoothing factor applied to the seasonality (either additive or multiplica-
tive) in the triple exponential smoothing models. In theory: 0 < γ ≤ 1; in practice:
0.05 < γ ≤ 0.3. See page 70
holdout set Subset of the training set that is kept aside during the training to validate
a model against unseen data. The holdout set is made of the last periods of the
training set to replicate a test set. See page 162
inertia In a K-means model, the inertia is the sum of the distances between each data
sample and its associated cluster center. See page 211
instance An (object) instance is a technical term for an occurrence of a class. You can
see a class as a blueprint and an instance of a specific realization of this blueprint.
The class (blueprint) will define what each instance will look like (which variables
will constitute it) and what it can do (what methods or functions it will be able to
perform). See page 126
level The level is the average value around which the demand varies over time. See
page 27
Mean Absolute Error MAE = n1 ∑ |et | See page 16
1
Mean Absolute Percentage Error MAPE = See page 14
|et |
n
∑ dt
Mean Square Error MSE = n1 ∑ et2 See page 17
naïve forecast The simplest forecast model: it always predicts the last available ob-
servation. See page 5
noise In statistics, the noise is an unexplained variation in the data. It is often due to
the randomness of the different processes at hand. See page 5
NumPy One of the most famous Python libraries. It is focused on numeric computa-
tion. The basic data structure in NumPy is an array. See page 266
pandas Pandas is a Python library specializing in data formatting and manipulation.
It allows the use of DataFrames to store data in tables. See page 268
phi A damping factor applied to the trend in the exponential smoothing models. This
reduces the trend after each period. In theory: 0 < ϕ ≤ 1; in practice: 0.7 ≤ ϕ ≤ 1.
See page 60
regression Regression problems require an estimate of a numerical output based on
various inputs. See page 122
Root Mean Square Error RMSE = √ n1 ∑ et2 See page 17
S&OP The sales and operations planning (S&OP) process focuses on aligning mid-
and long-term demand and supply. See page XXI
SKU A stock keeping unit refers to a specific material kept in a specific location. Two
different pieces of the same SKU are indistinguishable. See page 251
Glossary | 279
early stopping
MAE 16, 20–23, 120, 127, 223, 254, 255, 257,
– XGBoost 197
258, 260, 261
ensemble 138, 139, 167, 169, 196
ETR 150, 151, 153, 154, 157, 164, 177, 195, 196 – bias 21, 24
– computation time 154, 176, 199 – compared to RMSE 23, 26, 59, 120, 129
– computation 16, 17, 53, 54, 119
feature importance 147, 148, 155, 164, 222 – error weighting 19, 20
– plot 149 – intermittent demand 26
– selection 223 – limitations 21
– XGBoost 190 – objective function 52, 55, 56, 59, 127–129,
forecast value added 252, 257–259, 261 136, 142, 147, 158, 198
forest 128, 139–157, 164, 167, 176, 177, 195, 196 – reporting 26
– computation time 142, 144, 154, 176 – sensitivity to outliers 24, 25
MAPE 14–16, 20, 23
gradient boosting 189, 195 – bias 21
gradient descent 234–236 – limitations 21
MSE 17, 22
holdout set 164 – objective function 128, 136, 142
– as evaluation set 195
– compared to test set 163, 165 naïve forecast 4, 5, 27
– creation 161, 162, 165, 219 neural network
– activation function 230, 239, 243
intermittent demand 26 – Adam 239, 242
– KPI 25, 26 – backpropagation 237
– seasonality 76 – early stopping 242, 243
– forward propagation 232, 237
judgmental forecast 259, 261, see Chapter 26 noise 4, 27, 28, 66, 68, 109, 156
282 | Index