Interpretable ML
Interpretable ML
Interpretable ML
List of Figures ix
Summary xxv
1 Introduction 1
1.1 Story Time . . . . . . . . . . . . . . . . . . . . . 3
1.2 What Is Machine Learning? . . . . . . . . . . . . . 11
1.3 Terminology . . . . . . . . . . . . . . . . . . . . 13
2 Interpretability 17
2.1 Importance of Interpretability . . . . . . . . . . . 18
2.2 Taxonomy of Interpretability Methods . . . . . . . 25
2.3 Scope of Interpretability . . . . . . . . . . . . . . 27
2.4 Evaluation of Interpretability . . . . . . . . . . . . 30
2.5 Properties of Explanations . . . . . . . . . . . . . 31
2.6 Human-friendly Explanations . . . . . . . . . . . 34
3 Datasets 43
3.1 Bike Rentals (Regression) . . . . . . . . . . . . . . 43
3.2 YouTube Spam Comments (Text Classification) . . . 44
3.3 Risk Factors for Cervical Cancer (Classification) . . . 46
4 Interpretable Models 49
4.1 Linear Regression . . . . . . . . . . . . . . . . . 51
4.2 Logistic Regression . . . . . . . . . . . . . . . . . 71
4.3 GLM, GAM and more . . . . . . . . . . . . . . . . 79
iii
iv Contents
13 Translations 401
14 Acknowledgements 403
List of Tables
vii
List of Figures
ix
x List of Figures
7.1 PDPs for the bicycle count prediction model and tem-
perature, humidity and wind speed. The largest differ-
ences can be seen in the temperature. The hotter, the
more bikes are rented. This trend goes up to 20 degrees
Celsius, then flattens and drops slightly at 30. Marks
on the x-axis indicate the data distribution. . . . . . 157
7.2 PDPs for the bike count prediction model and the sea-
son. Unexpectedly all seasons show similar effect on
the model predictions, only for winter the model pre-
dicts fewer bicycle rentals. . . . . . . . . . . . . . 158
7.3 PDPs of cancer probability based on age and years
with hormonal contraceptives. For age, the PDP shows
that the probability is low until 40 and increases after.
The more years on hormonal contraceptives the higher
the predicted cancer risk, especially after 10 years. For
both features not many data points with large values
were available, so the PD estimates are less reliable in
those regions. . . . . . . . . . . . . . . . . . . . . 159
7.4 PDP of cancer probability and the interaction of age
and number of pregnancies. The plot shows the in-
crease in cancer probability at 45. For ages below 25,
women who had 1 or 2 pregnancies have a lower pre-
dicted cancer risk, compared with women who had 0
or more than 2 pregnancies. But be careful when draw-
ing conclusions: This might just be a correlation and
not causal! . . . . . . . . . . . . . . . . . . . . . 160
7.5 Strongly correlated features x1 and x2. To calculate the
feature effect of x1 at 0.75, the PDP replaces x1 of all
instances with 0.75, falsely assuming that the distribu-
tion of x2 at x1 = 0.75 is the same as the marginal distri-
bution of x2 (vertical line). This results in unlikely com-
binations of x1 and x2 (e.g. x2=0.2 at x1=0.75), which
the PDP uses for the calculation of the average effect. 164
List of Figures xiii
xxv
xxvi Summary
3
http://creativecommons.org/licenses/by-nc-sa/4.0/
0
Preface by the Author
xxvii
xxviii Preface by the Author
for your application or not. In the last section of each chapter, avail-
able software implementations are discussed.
Machine learning has received great attention from many people in re-
search and industry. Sometimes machine learning is overhyped in the
media, but there are many real and impactful applications. Machine
learning is a powerful technology for products, research and automa-
tion. Today, machine learning is used, for example, to detect fraud-
ulent financial transactions, recommend movies and classify images.
It is often crucial that the machine learning models are interpretable.
Interpretability helps the developer to debug and improve the model,
build trust in the model, justify model predictions and gain insights.
The increased need for machine learning interpretability is a natural
consequence of an increased use of machine learning. This book has
become a valuable resource for many people. Teaching instructors use
the book to introduce their students to the concepts of interpretable
machine learning. I received e-mails from various master and doc-
toral students who told me that this book was the starting point and
most important reference for their theses. The book has helped ap-
plied researchers in the field of ecology, finance, psychology, etc. who
use machine learning to understand their data. Data scientists from
industry told me that they use the “Interpretable Machine Learning”
book for their work and recommend it to their colleagues. I am happy
that many people can benefit from this book and become experts in
model interpretation.
I would recommend this book to practitioners who want an overview
of techniques to make their machine learning models more inter-
pretable. It is also recommended to students and researchers (and any-
one else) who is interested in the topic. To benefit from this book, you
should already have a basic understanding of machine learning. You
should also have a mathematical understanding at university entry
level to be able to follow the theory and formulas in this book. It should
also be possible, however, to understand the intuitive description of
the method at the beginning of each chapter without mathematics.
I hope you enjoy the book!
1
Introduction
1
2 1 Introduction
“It’s definitely not the worst way to die!” Tom summarised, trying to
find something positive in the tragedy. He removed the pump from
the intravenous pole.
“He just died for the wrong reasons,” Lena added.
“And certainly with the wrong morphine pump! Just creating more
work for us!” Tom complained while unscrewing the back plate of the
4
https://jack-clark.net/
4 1 Introduction
pump. After removing all the screws, he lifted the plate and put it aside.
He plugged a cable into the diagnostic port.
“You didn’t just complain about having a job, did you?” Lena gave him
a mocking smile.
“Of course not. Never!” he exclaimed with a sarcastic undertone.
He booted the pump’s computer.
Lena plugged the other end of the cable into her tablet. “All right, diag-
nostics are running,” she announced. “I am really curious about what
went wrong.”
“It certainly shot our John Doe into Nirvana. That high concentration
of this morphine stuff. Man. I mean … that’s a first, right? Normally a
broken pump gives off too little of the sweet stuff or nothing at all. But
never, you know, like that crazy shot,” Tom explained.
“I know. You don’t have to convince me … Hey, look at that.” Lena held
up her tablet. “Do you see this peak here? That’s the potency of the
painkillers mix. Look! This line shows the reference level. The poor guy
had a mixture of painkillers in his blood system that could kill him 17
times over. Injected by our pump here. And here …” she swiped, “here
you can see the moment of the patient’s demise.”
“So, any idea what happened, boss?” Tom asked his supervisor.
“Hm … The sensors seem to be fine. Heart rate, oxygen levels, glucose,
… The data were collected as expected. Some missing values in the
blood oxygen data, but that’s not unusual. Look here. The sensors have
also detected the patient’s slowing heart rate and extremely low cor-
tisol levels caused by the morphine derivate and other pain blocking
agents.” She continued to swipe through the diagnostics report.
Tom stared captivated at the screen. It was his first investigation of a
real device failure.
“Ok, here is our first piece of the puzzle. The system failed to send a
warning to the hospital’s communication channel. The warning was
triggered, but rejected at protocol level. It could be our fault, but it
could also be the fault of the hospital. Please send the logs over to the
IT team,” Lena told Tom.
Tom nodded with his eyes still fixed on the screen.
Lena continued: “It’s odd. The warning should also have caused the
pump to shut down. But it obviously failed to do so. That must be a bug.
1.1 Story Time 5
Something the quality team missed. Something really bad. Maybe it’s
related to the protocol issue.”
“So, the emergency system of the pump somehow broke down, but why
did the pump go full bananas and inject so much painkiller into John
Doe?” Tom wondered.
“Good question. You are right. Protocol emergency failure aside, the
pump shouldn’t have administered that amount of medication at all.
The algorithm should have stopped much earlier on its own, given the
low level of cortisol and other warning signs,” Lena explained.
“Maybe some bad luck, like a one in a million thing, like being hit by a
lightning?” Tom asked her.
“No, Tom. If you had read the documentation I sent you, you would
have known that the pump was first trained in animal experiments,
then later on humans, to learn to inject the perfect amount of
painkillers based on the sensory input. The algorithm of the pump
might be opaque and complex, but it’s not random. That means that in
the same situation the pump would behave exactly the same way again.
Our patient would die again. A combination or undesired interaction
of the sensory inputs must have triggered the erroneous behavior of
the pump. That is why we have to dig deeper and find out what hap-
pened here,” Lena explained.
“I see …,” Tom replied, lost in thought. “Wasn’t the patient going to die
soon anyway? Because of cancer or something?”
Lena nodded while she read the analysis report.
Tom got up and went to the window. He looked outside, his eyes fixed
on a point in the distance. “Maybe the machine did him a favor, you
know, in freeing him from the pain. No more suffering. Maybe it just
did the right thing. Like a lightning, but, you know, a good one. I mean
like the lottery, but not random. But for a reason. If I were the pump,
I would have done the same.”
She finally lifted her head and looked at him.
He kept looking at something outside.
Both were silent for a few moments.
Lena lowered her head again and continued the analysis. “No, Tom. It’s
a bug… Just a damn bug.”
6 1 Introduction
Trust Fall
2050: A subway station in Singapore
She rushed to the Bishan subway station. With her thoughts she was
already at work. The tests for the new neural architecture should be
completed by now. She led the redesign of the government’s “Tax Affin-
ity Prediction System for Individual Entities”, which predicts whether
a person will hide money from the tax office. Her team has come up
with an elegant piece of engineering. If successful, the system would
not only serve the tax office, but also feed into other systems such
as the counter-terrorism alarm system and the commercial registry.
One day, the government could even integrate the predictions into the
Civic Trust Score. The Civic Trust Score estimates how trustworthy a
person is. The estimate affects every part of your daily life, such as get-
ting a loan or how long you have to wait for a new passport. As she de-
scended the escalator, she imagined how an integration of her team’s
system into the Civic Trust Score System might look like.
She routinely wiped her hand over the RFID reader without reducing
her walking speed. Her mind was occupied, but a dissonance of sen-
sory expectations and reality rang alarm bells in her brain.
Too late.
Nose first she ran into the subway entrance gate and fell with her butt
first to the ground. The door was supposed to open, … but it did not.
Dumbfounded, she stood up and looked at the screen next to the gate.
“Please try another time,” suggested a friendly looking smiley on the
1.1 Story Time 7
screen. A person passed by and, ignoring her, wiped his hand over the
reader. The door opened and he went through. The door closed again.
She wiped her nose. It hurt, but at least it did not bleed. She tried to
open the door, but was rejected again. It was strange. Maybe her pub-
lic transport account did not have sufficient tokens. She looked at her
smartwatch to check the account balance.
“Login denied. Please contact your Citizens Advice Bureau!” her watch
informed her.
A feeling of nausea hit her like a fist to the stomach. She suspected
what had happened. To confirm her theory, she started the mobile
game “Sniper Guild”, an ego shooter. The app was directly closed again
automatically, which confirmed her theory. She became dizzy and sat
down on the floor again.
There was only one possible explanation: Her Civic Trust Score had
dropped. Substantially. A small drop meant minor inconveniences,
such as not getting first class flights or having to wait a little longer for
official documents. A low trust score was rare and meant that you were
classified as a threat to society. One measure in dealing with these peo-
ple was to keep them away from public places such as the subway. The
government restricted the financial transactions of subjects with low
Civic Trust Scores. They also began to actively monitor your behavior
on social media and even went as far as to restrict certain content, such
as violent games. It became exponentially more difficult to increase
your Civic Trust Score the lower it was. People with a very low score
usually never recovered.
She could not think of any reason why her score should have fallen. The
score was based on machine learning. The Civic Trust Score System
worked like a well-oiled engine that ran society. The performance of
the Trust Score System was always closely monitored. Machine learn-
ing had become much better since the beginning of the century. It
had become so efficient that decisions made by the Trust Score System
could no longer be disputed. An infallible system.
She laughed in despair. Infallible system. If only. The system has rarely
failed. But it failed. She must be one of those special cases; an error of
the system; from now on an outcast. Nobody dared to question the sys-
8 1 Introduction
tem. It was too integrated into the government, into society itself, to
be questioned. In the few remaining democratic countries it was for-
bidden to form anti-democratic movements, not because they where
inherently malicious, but because they would destabilize the current
system. The same logic applied to the now more common algocraties.
Critique in the algorithms was forbidden because of the danger to the
status quo.
Algorithmic trust was the fabric of the social order. For the common
good, rare false trust scorings were tacitly accepted. Hundreds of
other prediction systems and databases fed into the score, making it
impossible to know what caused the drop in her score. She felt like a
big dark hole was opening in and under her. With horror she looked
into the void.
Her tax affinity system was eventually integrated into the Civic Trust
Score System, but she never got to know it.
Fermi’s Paperclips
Year 612 AMS (after Mars settlement): A museum on Mars
she asked carefully. “No. They made the climate hot and it wasn’t
people, it was computers and machines. And it’s Planet Earth, not
Earther Planet,” added another girl named Lin. Xola nodded in agree-
ment. With a touch of pride, the teacher smiled and nodded. “You
are both right. Do you know why it happened?” “Because people were
short-sighted and greedy?” Xola asked. “People could not stop their
machines!” Lin blurted out.
“Again, you are both right,” the teacher decided, “but it’s much more
complicated than that. Most people at the time were not aware of what
was happening. Some saw the drastic changes, but could not reverse
them. The most famous piece from this period is a poem by an anony-
mous author. It best captures what happened at that time. Listen care-
fully!”
The teacher started the poem. A dozen of the small drones reposi-
tioned themselves in front of the children and began to project the
video directly into their eyes. It showed a person in a suit standing in
a forest with only tree stumps left. He began to talk:
The machines compute; the machines predict.
We march on as we are part of it.
We chase an optimum as trained.
The optimum is one-dimensional, local and unconstrained.
Silicon and flesh, chasing exponentiality.
Growth is our mentality.
When all rewards are collected,
and side-effects neglected;
When all the coins are mined,
and nature has fallen behind;
We will be in trouble,
After all, exponential growth is a bubble.
The tragedy of the commons unfolding,
10 1 Introduction
Exploding,
Before our eyes.
Cold calculations and icy greed,
Fill the earth with heat.
Everything is dying,
And we are complying.
Like horses with blinders we race the race of our own creation,
Towards the Great Filter of civilization.
And so we march on relentlessly.
As we are part of the machine.
Embracing entropy.
“A dark memory,” the teacher said to break the silence in the room. “It
will be uploaded to your library. Your homework is to memorise it until
next week.” Xola sighed. She managed to catch one of the little drones.
The drone was warm from the CPU and the engines. Xola liked how it
warmed her hands.
1.2 What Is Machine Learning? 11
in default with their loans, and data that will help you make predic-
tions, such as income, past credit defaults, and so on. For an auto-
matic house value estimator program, you could collect data from past
house sales and information about the real estate such as size, loca-
tion, and so on.
Step 2: Enter this information into a machine learning algorithm that
generates a sign detector model, a credit rating model or a house value
estimator.
Step 3: Use model with new data. Integrate the model into a product
or process, such as a self-driving car, a credit application process or a
real estate marketplace website.
Machines surpass humans in many tasks, such as playing chess (or
more recently Go) or predicting the weather. Even if the machine is
as good as a human or a bit worse at a task, there remain great ad-
vantages in terms of speed, reproducibility and scaling. A once im-
plemented machine learning model can complete a task much faster
than humans, reliably delivers consistent results and can be copied in-
finitely. Replicating a machine learning model on another machine is
fast and cheap. The training of a human for a task can take decades
(especially when they are young) and is very costly. A major disadvan-
tage of using machine learning is that insights about the data and
the task the machine solves is hidden in increasingly complex models.
You need millions of numbers to describe a deep neural network, and
there is no way to understand the model in its entirety. Other models,
such as the random forest, consist of hundreds of decision trees that
“vote” for predictions. To understand how the decision was made, you
would have to look into the votes and structures of each of the hun-
dreds of trees. That just does not work no matter how clever you are or
how good your working memory is. The best performing models are
often blends of several models (also called ensembles) that cannot be
interpreted, even if each single model could be interpreted. If you fo-
cus only on performance, you will automatically get more and more
opaque models. The winning models on machine learning competi-
tions are often ensembles of models or very complex models such as
boosted trees or deep neural networks.
1.3 Terminology 13
1.3 Terminology
To avoid confusion due to ambiguity, here are some definitions of
terms used in this book:
An Algorithm is a set of rules that a machine follows to achieve a par-
ticular goal5 . An algorithm can be considered as a recipe that defines
the inputs, the output and all the steps needed to get from the inputs
to the output. Cooking recipes are algorithms where the ingredients
are the inputs, the cooked food is the output, and the preparation and
cooking steps are the algorithm instructions.
Machine Learning is a set of methods that allow computers to learn
from data to make and improve predictions (for example cancer,
weekly sales, credit default). Machine learning is a paradigm shift
from “normal programming” where all instructions must be explic-
itly given to the computer to “indirect programming” that takes place
through providing data.
FIGURE 1.1A learner learns a model from labeled training data. The
model is used to make predictions.
A Black Box Model is a system that does not reveal its internal mecha-
nisms. In machine learning, “black box” describes models that cannot
be understood by looking at their parameters (e.g. a neural network).
The opposite of a black box is sometimes referred to as White Box,
and is referred to in this book as interpretable model. Model-agnostic
methods for interpretability treat machine learning models as black
boxes, even if they are not.
Interpretable Machine Learning refers to methods and models that
make the behavior and predictions of machine learning systems un-
derstandable to humans.
A Dataset is a table with the data from which the machine learns. The
1.3 Terminology 15
dataset contains the features and the target to predict. When used to
induce a model, the dataset is called training data.
An Instance is a row in the dataset. Other names for ‘instance’ are:
(data) point, example, observation. An instance consists of the feature
values 𝑥(𝑖) and, if known, the target outcome 𝑦𝑖 .
The Features are the inputs used for prediction or classification. A fea-
ture is a column in the dataset. Throughout the book, features are as-
sumed to be interpretable, meaning it is easy to understand what they
mean, like the temperature on a given day or the height of a person.
The interpretability of the features is a big assumption. But if it is hard
to understand the input features, it is even harder to understand what
the model does. The matrix with all features is called X and 𝑥(𝑖) for a
single instance. The vector of a single feature for all instances is 𝑥𝑗 and
(𝑖)
the value for the feature j and instance i is 𝑥𝑗 .
The Target is the information the machine learns to predict. In math-
ematical formulas, the target is usually called y or 𝑦𝑖 for a single in-
stance.
A Machine Learning Task is the combination of a dataset with features
and a target. Depending on the type of the target, the task can be for ex-
ample classification, regression, survival analysis, clustering, or out-
lier detection.
16 1 Introduction
The Prediction is what the machine learning model “guesses” what the
target value should be based on the given features. In this book, the
̂ (𝑖) ) or 𝑦.̂
model prediction is denoted by 𝑓(𝑥
2
Interpretability
17
18 2 Interpretability
do I feel so sick?”. He learns that he gets sick every time he eats those
red berries. He updates his mental model and decides that the berries
caused the sickness and should therefore be avoided. When opaque
machine learning models are used in research, scientific findings re-
main completely hidden if the model only gives predictions without ex-
planations. To facilitate learning and satisfy curiosity as to why certain
predictions or behaviors are created by machines, interpretability and
explanations are crucial. Of course, humans do not need explanations
for everything that happens. For most people it is okay that they do not
understand how a computer works. Unexpected events makes us curi-
ous. For example: Why is my computer shutting down unexpectedly?
Closely related to learning is the human desire to find meaning in the
world. We want to harmonize contradictions or inconsistencies be-
tween elements of our knowledge structures. “Why did my dog bite me
even though it has never done so before?” a human might ask. There
is a contradiction between the knowledge of the dog’s past behavior
and the newly made, unpleasant experience of the bite. The vet’s expla-
nation reconciles the dog owner’s contradiction: “The dog was under
stress and bit.” The more a machine’s decision affects a person’s life,
the more important it is for the machine to explain its behavior. If a
machine learning model rejects a loan application, this may be com-
pletely unexpected for the applicants. They can only reconcile this in-
consistency between expectation and reality with some kind of expla-
nation. The explanations do not actually have to fully explain the situ-
ation, but should address a main cause. Another example is algorith-
mic product recommendation. Personally, I always think about why
certain products or movies have been algorithmically recommended
to me. Often it is quite clear: Advertising follows me on the Inter-
net because I recently bought a washing machine, and I know that
in the next days I will be followed by advertisements for washing ma-
chines. Yes, it makes sense to suggest gloves if I already have a win-
ter hat in my shopping cart. The algorithm recommends this movie,
because users who liked other movies I liked also enjoyed the recom-
mended movie. Increasingly, Internet companies are adding explana-
tions to their recommendations. A good example are product recom-
mendations, which are based on frequently purchased product com-
binations:
20 2 Interpretability
goal. I would not fully accept my robot vacuum cleaner if it did not ex-
plain its behavior to some degree. The vacuum cleaner creates a shared
meaning of, for example, an “accident” (like getting stuck on the bath-
room carpet … again) by explaining that it got stuck instead of simply
stopping to work without comment. Interestingly, there may be a mis-
alignment between the goal of the explaining machine (create trust)
and the goal of the recipient (understand the prediction or behavior).
Perhaps the full explanation for why Doge got stuck could be that the
battery was very low, that one of the wheels is not working properly
and that there is a bug that makes the robot go to the same spot over
and over again even though there was an obstacle. These reasons (and
a few more) caused the robot to get stuck, but it only explained that
something was in the way, and that was enough for me to trust its be-
havior and get a shared meaning of that accident. By the way, Doge
got stuck in the bathroom again. We have to remove the carpets each
time before we let Doge vacuum.
FIGURE 2.2 Doge, our vacuum cleaner, got stuck. As an explanation for
the accident, Doge told us that it needs to be on an even surface.
lation with actual flu outbreaks. Ideally, models would only use causal
features because they would not be gameable.
decision rules, …) or testing how well people can predict the behav-
ior of the machine learning model from the explanations. The com-
prehensibility of the features used in the explanation should also
be considered. A complex transformation of features might be less
comprehensible than the original features.
• Certainty: Does the explanation reflect the certainty of the ma-
chine learning model? Many machine learning models only give
predictions without a statement about the models confidence that
the prediction is correct. If the model predicts a 4% probability of
cancer for one patient, is it as certain as the 4% probability that an-
other patient, with different feature values, received? An explana-
tion that includes the model’s certainty is very useful.
• Degree of Importance: How well does the explanation reflect the
importance of features or parts of the explanation? For example,
if a decision rule is generated as an explanation for an individual
prediction, is it clear which of the conditions of the rule was the
most important?
• Novelty: Does the explanation reflect whether a data instance to be
explained comes from a “new” region far removed from the distri-
bution of training data? In such cases, the model may be inaccu-
rate and the explanation may be useless. The concept of novelty is
related to the concept of certainty. The higher the novelty, the more
likely it is that the model will have low certainty due to lack of data.
• Representativeness: How many instances does an explanation
cover? Explanations can cover the entire model (e.g. interpretation
of weights in a linear regression model) or represent only an indi-
vidual prediction (e.g. Shapley Values).
are worth so much, I would say things like: “The decentralized, dis-
tributed, blockchain-based ledger, which cannot be controlled by a
central entity, resonates with people who want to secure their wealth,
which explains the high demand and price.” But to my grandmother I
would say: “Look, Grandma: Cryptocurrencies are a bit like computer
gold. People like and pay a lot for gold, and young people like and pay
a lot for computer gold.”
What it means for interpretable machine learning: Pay attention to
the social environment of your machine learning application and the
target audience. Getting the social part of the machine learning model
right depends entirely on your specific application. Find experts from
the humanities (e.g. psychologists and sociologists) to help you.
Explanations focus on the abnormal. People focus more on abnormal
causes to explain events (Kahnemann and Tversky, 198110 ). These are
causes that had a small probability but nevertheless happened. The
elimination of these abnormal causes would have greatly changed the
outcome (counterfactual explanation). Humans consider these kinds
of “abnormal” causes as good explanations. An example from Štrum-
belj and Kononenko (2011)11 is: Assume we have a dataset of test situ-
ations between teachers and students. Students attend a course and
pass the course directly after successfully giving a presentation. The
teacher has the option to additionally ask the student questions to
test their knowledge. Students who cannot answer these questions
will fail the course. Students can have different levels of preparation,
which translates into different probabilities for correctly answering
the teacher’s questions (if they decide to test the student). We want
to predict whether a student will pass the course and explain our pre-
diction. The chance of passing is 100% if the teacher does not ask any
additional questions, otherwise the probability of passing depends on
the student’s level of preparation and the resulting probability of an-
swering the questions correctly.
Scenario 1: The teacher usually asks the students additional questions
10
Kahneman, Daniel, and Amos Tversky. “The Simulation Heuristic.” Stanford
Univ CA Dept of Psychology. (1981).
11
Štrumbelj, Erik, and Igor Kononenko. “A general method for visualizing and ex-
plaining black-box regression models.” In International Conference on Adaptive and
Natural Computing Algorithms, 21–30. Springer. (2011).
2.6 Human-friendly Explanations 39
(e.g. 95 out of 100 times). A student who did not study (10% chance to
pass the question part) was not one of the lucky ones and gets addi-
tional questions that he fails to answer correctly. Why did the student
fail the course? I would say that it was the student’s fault to not study.
Scenario 2: The teacher rarely asks additional questions (e.g. 2 out of
100 times). For a student who has not studied for the questions, we
would predict a high probability of passing the course because ques-
tions are unlikely. Of course, one of the students did not prepare for
the questions, which gives him a 10% chance of passing the questions.
He is unlucky and the teacher asks additional questions that the stu-
dent cannot answer and he fails the course. What is the reason for the
failure? I would argue that now, the better explanation is “because the
teacher tested the student”. It was unlikely that the teacher would test,
so the teacher behaved abnormally.
What it means for interpretable machine learning: If one of the input
features for a prediction was abnormal in any sense (like a rare cate-
gory of a categorical feature) and the feature influenced the prediction,
it should be included in an explanation, even if other ‘normal’ features
have the same influence on the prediction as the abnormal one. An ab-
normal feature in our house price prediction example might be that
a rather expensive house has two balconies. Even if some attribution
method finds that the two balconies contribute as much to the price
difference as the above average house size, the good neighborhood or
the recent renovation, the abnormal feature “two balconies” might be
the best explanation for why the house is so expensive.
Explanations are truthful. Good explanations prove to be true in re-
ality (i.e. in other situations). But disturbingly, this is not the most
important factor for a “good” explanation. For example, selectiveness
seems to be more important than truthfulness. An explanation that
selects only one or two possible causes rarely covers the entire list of
relevant causes. Selectivity omits part of the truth. It is not true that
only one or two factors, for example, have caused a stock market crash,
but the truth is that there are millions of causes that influence millions
of people to act in such a way that in the end a crash was caused.
What it means for interpretable machine learning: The explanation
should predict the event as truthfully as possible, which in machine
learning is sometimes called fidelity. So if we say that a second bal-
40 2 Interpretability
cony increases the price of a house, then that also should apply to other
houses (or at least to similar houses). For humans, fidelity of an expla-
nation is not as important as its selectivity, its contrast and its social
aspect.
Good explanations are consistent with prior beliefs of the explainee.
Humans tend to ignore information that is inconsistent with their
prior beliefs. This effect is called confirmation bias (Nickerson 199812 ).
Explanations are not spared by this kind of bias. People will tend to de-
value or ignore explanations that do not agree with their beliefs. The
set of beliefs varies from person to person, but there are also group-
based prior beliefs such as political worldviews.
What it means for interpretable machine learning: Good explana-
tions are consistent with prior beliefs. This is difficult to integrate into
machine learning and would probably drastically compromise predic-
tive performance. Our prior belief for the effect of house size on pre-
dicted price is that the larger the house, the higher the price. Let us
assume that a model also shows a negative effect of house size on the
predicted price for a few houses. The model has learned this because it
improves predictive performance (due to some complex interactions),
but this behavior strongly contradicts our prior beliefs. You can en-
force monotonicity constraints (a feature can only affect the predic-
tion in one direction) or use something like a linear model that has
this property.
Good explanations are general and probable. A cause that can explain
many events is very general and could be considered a good explana-
tion. Note that this contradicts the claim that abnormal causes make
good explanations. As I see it, abnormal causes beat general causes.
Abnormal causes are by definition rare in the given scenario. In the
absence of an abnormal event, a general explanation is considered a
good explanation. Also remember that people tend to misjudge prob-
abilities of joint events. (Joe is a librarian. Is he more likely to be a shy
person or to be a shy person who likes to read books?) A good example
is “The house is expensive because it is big”, which is a very general,
12
Nickerson, Raymond S. “Confirmation Bias: A ubiquitous phenomenon in many
guises.” Review of General Psychology 2 (2). Educational Publishing Foundation: 175.
(1998).
2.6 Human-friendly Explanations 41
Throughout the book, all models and techniques are applied to real
datasets that are freely available online. We will use different datasets
for different tasks: Classification, regression and text classification.
43
44 3 Datasets
• Number of days since the 01.01.2011 (the first day in the dataset).
This feature was introduced to take account of the trend over time.
• Indicator whether the day was a working day or weekend.
• The weather situation on that day. One of:
– clear, few clouds, partly cloudy, cloudy
– mist + clouds, mist + broken clouds, mist + few clouds, mist
– light snow, light rain + thunderstorm + scattered clouds, light
rain + scattered clouds
– heavy rain + ice pallets + thunderstorm + mist, snow + mist
• Temperature in degrees Celsius.
• Relative humidity in percent (0 to 100).
• Wind speed in km per hour.
For the examples in this book, the data has been slightly processed.
You can find the processing R-script in the book’s Github repository4
together with the final RData file5 .
nature. But this is not a book about missing data imputation, so the
mode imputation will have to suffice for the examples.
To reproduce the examples of this book with this dataset, find the pre-
processing R-script13 and the final RData file14 in the book’s Github
repository.
13
https://github.com/christophM/interpretable-ml-book/blob/master/R/get-
cervical-cancer-dataset.R
14
https://github.com/christophM/interpretable-ml-book/blob/master/data/
cervical.RData
4
Interpretable Models
49
50 4 Interpretable Models
From this table, you can select a suitable interpretable model for your
task, either regression (regr) or classification (class):
𝑦 = 𝛽0 + 𝛽1 𝑥1 + … + 𝛽𝑝 𝑥𝑝 + 𝜖
𝑝 2
𝑛
(𝑖)
𝛽̂ = arg min ∑ (𝑦(𝑖) − (𝛽0 + ∑ 𝛽𝑗 𝑥𝑗 ))
𝛽0 ,…,𝛽𝑝
𝑖=1 𝑗=1
We will not discuss in detail how the optimal weights can be found,
but if you are interested, you can read chapter 3.2 of the book “The
Elements of Statistical Learning” (Friedman, Hastie and Tibshirani
52 4 Interpretable Models
Normality
It is assumed that the target outcome given the features follows a nor-
mal distribution. If this assumption is violated, the estimated confi-
dence intervals of the feature weights are invalid.
Homoscedasticity (constant variance)
The variance of the error terms is assumed to be constant over the en-
tire feature space. Suppose you want to predict the value of a house
given the living area in square meters. You estimate a linear model that
assumes that, regardless of the size of the house, the error around the
predicted response has the same variance. This assumption is often vi-
olated in reality. In the house example, it is plausible that the variance
of error terms around the predicted price is higher for larger houses,
since prices are higher and there is more room for price fluctuations.
Suppose the average error (difference between predicted and actual
price) in your linear regression model is 50,000 Euros. If you assume
homoscedasticity, you assume that the average error of 50,000 is the
same for houses that cost 1 million and for houses that cost only 40,000.
This is unreasonable because it would mean that we can expect nega-
tive house prices.
Independence
It is assumed that each instance is independent of any other instance.
If you perform repeated measurements, such as multiple blood tests
per patient, the data points are not independent. For dependent data
you need special linear regression models, such as mixed effect models
or GEEs. If you use the “normal” linear regression model, you might
draw wrong conclusions from the model.
Fixed features
The input features are considered “fixed”. Fixed means that they are
treated as “given constants” and not as statistical variables. This im-
plies that they are free of measurement errors. This is a rather unrealis-
tic assumption. Without that assumption, however, you would have to
fit very complex measurement error models that account for the mea-
surement errors of your input features. And usually you do not want
to do that.
Absence of multicollinearity
54 4 Interpretable Models
4.1.1 Interpretation
The interpretation of a weight in the linear regression model depends
on the type of the corresponding feature.
• Numerical feature: Increasing the numerical feature by one unit
changes the estimated outcome by its weight. An example of a nu-
merical feature is the size of a house.
• Binary feature: A feature that takes one of two possible values for
each instance. An example is the feature “House comes with a gar-
den”. One of the values counts as the reference category (in some
programming languages encoded with 0), such as “No garden”.
Changing the feature from the reference category to the other cat-
egory changes the estimated outcome by the feature’s weight.
• Categorical feature with multiple categories: A feature with a fixed
number of possible values. An example is the feature “floor type”,
with possible categories “carpet”, “laminate” and “parquet”. A solu-
tion to deal with many categories is the one-hot-encoding, mean-
ing that each category has its own binary column. For a categorical
feature with L categories, you only need L-1 columns, because the L-
th column would have redundant information (e.g. when columns
1 to L-1 all have value 0 for one instance, we know that the categori-
cal feature of this instance takes on category L). The interpretation
for each category is then the same as the interpretation for binary
features. Some languages, such as R, allow you to encode categori-
cal features in various ways, as described later in this chapter.
• Intercept 𝛽0 : The intercept is the feature weight for the “constant
feature”, which is always 1 for all instances. Most software packages
automatically add this “1”-feature to estimate the intercept. The in-
terpretation is: For an instance with all numerical feature values at
zero and the categorical feature values at the reference categories,
the model prediction is the intercept weight. The interpretation of
4.1 Linear Regression 55
the intercept is usually not relevant because instances with all fea-
tures values at zero often make no sense. The interpretation is only
meaningful when the features have been standardised (mean of
zero, standard deviation of one). Then the intercept reflects the pre-
dicted outcome of an instance where all features are at their mean
value.
The interpretation of the features in the linear regression model can
be automated by using following text templates.
Interpretation of a Numerical Feature
An increase of feature 𝑥𝑘 by one unit increases the prediction for y by
𝛽𝑘 units when all other feature values remain fixed.
Interpretation of a Categorical Feature
Changing feature 𝑥𝑘 from the reference category to the other category
increases the prediction for y by 𝛽𝑘 when all other features remain
fixed.
Another important measurement for interpreting linear models is the
R-squared measurement. R-squared tells you how much of the total
variance of your target outcome is explained by the model. The higher
R-squared, the better your model explains the data. The formula for
calculating R-squared is:
𝑅2 = 1 − 𝑆𝑆𝐸/𝑆𝑆𝑇
The SSE tells you how much variance remains after fitting the linear
56 4 Interpretable Models
𝑛−1
𝑅̄ 2 = 1 − (1 − 𝑅2 )
𝑛−𝑝−1
𝛽𝑗̂
𝑡𝛽 ̂ =
𝑗
𝑆𝐸(𝛽𝑗̂ )
Let us examine what this formula tells us: The importance of a feature
increases with increasing weight. This makes sense. The more variance
the estimated weight has (= the less certain we are about the correct
value), the less important the feature is. This also makes sense.
4.1 Linear Regression 57
4.1.2 Example
In this example, we use the linear regression model to predict the num-
ber of rented bikes on a particular day, given weather and calendar in-
formation. For the interpretation, we examine the estimated regres-
sion weights. The features consist of numerical and categorical fea-
tures. For each feature, the table shows the estimated weight, the stan-
dard error of the estimate (SE), and the absolute value of the t-statistic
(|t|).
Weight SE |t|
(Intercept) 2399.4 238.3 10.1
seasonSPRING 899.3 122.3 7.4
seasonSUMMER 138.2 161.7 0.9
seasonFALL 425.6 110.8 3.8
holidayHOLIDAY -686.1 203.3 3.4
workingdayWORKING DAY 124.9 73.3 1.7
weathersitMISTY -379.4 87.6 4.3
weathersitRAIN/SNOW/STORM -1901.5 223.6 8.5
temp 110.7 7.0 15.7
hum -17.4 3.2 5.5
windspeed -42.5 6.9 6.2
days_since_2011 4.9 0.2 28.5
Interpretation of a numerical feature (temperature): An increase of the
temperature by 1 degree Celsius increases the predicted number of bi-
cycles by 110.7, when all other features remain fixed.
Interpretation of a categorical feature (“weathersit”): The estimated
number of bicycles is -1901.5 lower when it is raining, snowing or
stormy, compared to good weather – again assuming that all other fea-
tures do not change. When the weather is misty, the predicted number
of bicycles is -379.4 lower compared to good weather, given all other
features remain the same.
All the interpretations always come with the footnote that “all other
features remain the same”. This is because of the nature of linear re-
gression models. The predicted target is a linear combination of the
weighted features. The estimated linear equation is a hyperplane in
58 4 Interpretable Models
FIGURE 4.1 Weights are displayed as points and the 95% confidence in-
tervals as lines.
weights depend on the scale of the features and will be different if you
have a feature that measures e.g. a person’s height and you switch
from meter to centimeter. The weight will change, but the actual ef-
fects in your data will not. It is also important to know the distribution
of your feature in the data, because if you have a very low variance, it
means that almost all instances have similar contribution from this
feature. The effect plot can help you understand how much the com-
bination of weight and feature contributes to the predictions in your
data. Start by calculating the effects, which is the weight per feature
times the feature value of an instance:
(𝑖) (𝑖)
effect𝑗 = 𝑤𝑗 𝑥𝑗
FIGURE 4.2 The feature effect plot shows the distribution of effects (=
feature value times feature weight) across the data per feature.
value. For example, days with a high negative effect of windspeed are
the ones with high wind speeds.
FIGURE 4.3 The effect plot for one instance shows the effect distribution
and highlights the effects of the instance of interest.
all instances of the dataset, the crosses show the effects for the 6-th
instance. The 6-th instance has a low temperature effect because on
this day the temperature was 2 degrees, which is low compared to most
other days (and remember that the weight of the temperature feature
is positive). Also, the effect of the trend feature “days_since_2011” is
small compared to the other data instances because this instance is
from early 2011 (5 days) and the trend feature also has a positive weight.
there are many more. The example used has six instances and a cate-
gorical feature with three categories. For the first two instances, the
feature takes category A; for instances three and four, category B; and
for the last two instances, category C.
Treatment coding
In treatment coding, the weight per category is the estimated differ-
ence in the prediction between the corresponding category and the ref-
erence category. The intercept of the linear model is the mean of the
reference category (when all other features remain the same). The first
column of the design matrix is the intercept, which is always 1. Col-
umn two indicates whether instance i is in category B, column three
indicates whether it is in category C. There is no need for a column for
category A, because then the linear equation would be overspecified
and no unique solution for the weights can be found. It is sufficient to
know that an instance is neither in category B or C.
Feature matrix:
1 0 0
⎛
⎜1 0 0⎞⎟
⎜
⎜ ⎟
⎜1 1 0⎟⎟
⎜
⎜ ⎟
⎜1 1 0⎟⎟
⎜
⎜1 ⎟
0 1⎟
⎝1 0 1⎠
Effect coding
The weight per category is the estimated y-difference from the cor-
responding category to the overall mean (given all other features are
zero or the reference category). The first column is used to estimate
the intercept. The weight 𝛽0 associated with the intercept represents
the overall mean and 𝛽1 , the weight for column two, is the difference
between the overall mean and category B. The total effect of category
B is 𝛽0 + 𝛽1 . The interpretation for category C is equivalent. For the
reference category A, −(𝛽1 + 𝛽2 ) is the difference to the overall mean
and 𝛽0 − (𝛽1 + 𝛽2 ) the overall effect.
64 4 Interpretable Models
Feature matrix:
1 −1 −1
⎛
⎜1 −1 −1⎞⎟
⎜
⎜ ⎟
⎟
⎜1 1 0 ⎟
⎜
⎜ ⎟
⎜1 1 0⎟ ⎟
⎜
⎜1 0 ⎟
1⎟
⎝1 0 1⎠
Dummy coding
The 𝛽 per category is the estimated mean value of y for each category
(given all other feature values are zero or the reference category). Note
that the intercept has been omitted here so that a unique solution can
be found for the linear model weights. Another way to mitigate this
multicollinearity problem is to leave out one of the categories.
Feature matrix:
1 0 0
⎛
⎜1 0 0⎞⎟
⎜
⎜ ⎟
⎜0 1 0⎟⎟
⎜
⎜ ⎟
⎜0 1 0⎟⎟
⎜
⎜0 ⎟
0 1⎟
⎝0 0 1⎠
If you want to dive a little deeper into the different encodings of cate-
gorical features, checkout this overview webpage2 and this blog post3 .
mean centered (feature minus mean of feature) and all categorical fea-
tures are effect coded, the reference instance is the data point where
all the features take on the mean feature value. This might also be a
non-existent data point, but it might at least be more likely or more
meaningful. In this case, the weights times the feature values (feature
effects) explain the contribution to the predicted outcome contrastive
to the “mean-instance”. Another aspect of a good explanation is selec-
tivity, which can be achieved in linear models by using less features
or by training sparse linear models. But by default, linear models do
not create selective explanations. Linear models create truthful expla-
nations, as long as the linear equation is an appropriate model for the
relationship between features and outcome. The more non-linearities
and interactions there are, the less accurate the linear model will be
and the less truthful the explanations become. Linearity makes the ex-
planations more general and simpler. The linear nature of the model, I
believe, is the main factor why people use linear models for explaining
relationships.
1 𝑛
𝑚𝑖𝑛𝛽 ( ∑(𝑦(𝑖) − 𝑥𝑇𝑖 𝛽)2 )
𝑛 𝑖=1
1 𝑛 (𝑖)
𝑚𝑖𝑛𝛽 ( ∑(𝑦 − 𝑥𝑇𝑖 𝛽)2 + 𝜆||𝛽||1 )
𝑛 𝑖=1
The term ||𝛽||1 , the L1-norm of the feature vector, leads to a penaliza-
tion of large weights. Since the L1-norm is used, many of the weights
receive an estimate of 0 and the others are shrunk. The parameter
lambda (𝜆) controls the strength of the regularizing effect and is usu-
ally tuned by cross-validation. Especially when lambda is large, many
weights become 0. The feature weights can be visualized as a function
of the penalty term lambda. Each feature weight is represented by a
curve in the following figure.
What value should we choose for lambda? If you see the penalization
term as a tuning parameter, then you can find the lambda that min-
imizes the model error with cross-validation. You can also consider
lambda as a parameter to control the interpretability of the model. The
larger the penalization, the fewer features are present in the model
(because their weights are zero) and the better the model can be inter-
preted.
Example with Lasso
We will predict bicycle rentals using Lasso. We set the number of fea-
tures we want to have in the model beforehand. Let us first set the num-
ber to 2 features:
4.1 Linear Regression 67
FIGURE 4.4 With increasing penalty of the weights, fewer and fewer fea-
tures receive a non-zero weight estimate. These curves are also called
regularization paths. The number above the plot is the number of non-
zero weights.
Weight
seasonWINTER 0.00
seasonSPRING 0.00
seasonSUMMER 0.00
seasonFALL 0.00
holidayHOLIDAY 0.00
workingdayWORKING DAY 0.00
weathersitMISTY 0.00
weathersitRAIN/SNOW/STORM 0.00
temp 52.33
hum 0.00
windspeed 0.00
days_since_2011 2.15
68 4 Interpretable Models
The first two features with non-zero weights in the Lasso path are tem-
perature (“temp”) and the time trend (“days_since_2011”).
Now, let us select 5 features:
Weight
seasonWINTER -389.99
seasonSPRING 0.00
seasonSUMMER 0.00
seasonFALL 0.00
holidayHOLIDAY 0.00
workingdayWORKING DAY 0.00
weathersitMISTY 0.00
weathersitRAIN/SNOW/STORM -862.27
temp 85.58
hum -3.04
windspeed 0.00
days_since_2011 3.82
Note that the weights for “temp” and “days_since_2011” differ from
the model with two features. The reason for this is that by decreas-
ing lambda even features that are already “in” the model are penalized
less and may get a larger absolute weight. The interpretation of the
Lasso weights corresponds to the interpretation of the weights in the
linear regression model. You only need to pay attention to whether the
features are standardized or not, because this affects the weights. In
this example, the features were standardized by the software, but the
weights were automatically transformed back for us to match the orig-
inal feature scales.
Other methods for sparsity in linear models
A wide spectrum of methods can be used to reduce the number of fea-
tures in a linear model.
Pre-processing methods:
• Manually selected features: You can always use expert knowledge to
select or discard some features. The big drawback is that it cannot
4.1 Linear Regression 69
4.1.8 Advantages
The modeling of the predictions as a weighted sum makes it transpar-
ent how predictions are produced. And with Lasso we can ensure that
the number of features used remains small.
Many people use linear regression models. This means that in many
places it is accepted for predictive modeling and doing inference.
There is a high level of collective experience and expertise, including
teaching materials on linear regression models and software imple-
70 4 Interpretable Models
4.1.9 Disadvantages
Linear regression models can only represent linear relationships, i.e. a
weighted sum of the input features. Each nonlinearity or interaction
has to be hand-crafted and explicitly given to the model as an input
feature.
Linear models are also often not that good regarding predictive per-
formance, because the relationships that can be learned are so re-
stricted and usually oversimplify how complex reality is.
The interpretation of a weight can be unintuitive because it depends
on all other features. A feature with high positive correlation with the
outcome y and another feature might get a negative weight in the lin-
ear model, because, given the other correlated feature, it is negatively
correlated with y in the high-dimensional space. Completely corre-
lated features make it even impossible to find a unique solution for the
linear equation. An example: You have a model to predict the value of
a house and have features like number of rooms and size of the house.
House size and number of rooms are highly correlated: the bigger a
house is, the more rooms it has. If you take both features into a linear
model, it might happen, that the size of the house is the better pre-
dictor and gets a large positive weight. The number of rooms might
end up getting a negative weight, because, given that a house has the
same size, increasing the number of rooms could make it less valuable
or the linear equation becomes less stable, when the correlation is too
strong.
4.2 Logistic Regression 71
tion of a class with a higher number, even if classes that happen to get
a similar number are not closer than other classes.
4.2.2 Theory
A solution for classification is logistic regression. Instead of fitting a
straight line or hyperplane, the logistic regression model uses the lo-
gistic function to squeeze the output of a linear equation between 0
and 1. The logistic function is defined as:
1
logistic(𝜂) =
1 + 𝑒𝑥𝑝(−𝜂)
(𝑖) (𝑖)
𝑦(𝑖)
̂ = 𝛽0 + 𝛽1 𝑥1 + … + 𝛽𝑝 𝑥𝑝
1
𝑃 (𝑦(𝑖) = 1) = (𝑖) (𝑖)
1 + 𝑒𝑥𝑝(−(𝛽0 + 𝛽1 𝑥1 + … + 𝛽𝑝 𝑥𝑝 ))
Let us revisit the tumor size example again. But instead of the linear
regression model, we use the logistic regression model:
Classification works better with logistic regression and we can use 0.5
74 4 Interpretable Models
FIGURE 4.7 The logistic regression model finds the correct decision
boundary between malignant and benign depending on tumor size.
The line is the logistic function shifted and squeezed to fit the data.
4.2.3 Interpretation
The interpretation of the weights in logistic regression differs from the
interpretation of the weights in linear regression, since the outcome
in logistic regression is a probability between 0 and 1. The weights do
not influence the probability linearly any longer. The weighted sum
is transformed by the logistic function to a probability. Therefore we
need to reformulate the equation for the interpretation so that only
the linear term is on the right side of the formula.
𝑃 (𝑦 = 1) 𝑃 (𝑦 = 1)
𝑙𝑛 ( ) = 𝑙𝑜𝑔 ( ) = 𝛽0 + 𝛽1 𝑥1 + … + 𝛽𝑝 𝑥𝑝
1 − 𝑃 (𝑦 = 1) 𝑃 (𝑦 = 0)
4.2 Logistic Regression 75
We call the term in the ln() function “odds” (probability of event divided
by probability of no event) and wrapped in the logarithm it is called log
odds.
This formula shows that the logistic regression model is a linear model
for the log odds. Great! That does not sound helpful! With a little shuf-
fling of the terms, you can figure out how the prediction changes when
one of the features 𝑥𝑗 is changed by 1 unit. To do this, we can first apply
the exp() function to both sides of the equation:
𝑃 (𝑦 = 1)
= 𝑜𝑑𝑑𝑠 = 𝑒𝑥𝑝 (𝛽0 + 𝛽1 𝑥1 + … + 𝛽𝑝 𝑥𝑝 )
1 − 𝑃 (𝑦 = 1)
𝑒𝑥𝑝(𝑎)
= 𝑒𝑥𝑝(𝑎 − 𝑏)
𝑒𝑥𝑝(𝑏)
𝑜𝑑𝑑𝑠𝑥𝑗 +1
= 𝑒𝑥𝑝 (𝛽𝑗 (𝑥𝑗 + 1) − 𝛽𝑗 𝑥𝑗 ) = 𝑒𝑥𝑝 (𝛽𝑗 )
𝑜𝑑𝑑𝑠
In the end, we have something as simple as exp() of a feature weight. A
change in a feature by one unit changes the odds ratio (multiplicative)
by a factor of exp(𝛽𝑗 ). We could also interpret it this way: A change in
𝑥𝑗 by one unit increases the log odds ratio by the value of the corre-
sponding weight. Most people interpret the odds ratio because think-
ing about the ln() of something is known to be hard on the brain. In-
terpreting the odds ratio already requires some getting used to. For
example, if you have odds of 2, it means that the probability for y=1
76 4 Interpretable Models
is twice as high as y=0. If you have a weight (= log odds ratio) of 0.7,
then increasing the respective feature by one unit multiplies the odds
by exp(0.7) (approximately 2) and the odds change to 4. But usually you
do not deal with the odds and interpret the weights only as the odds
ratios. Because for actually calculating the odds you would need to set
a value for each feature, which only makes sense if you want to look at
one specific instance of your dataset.
These are the interpretations for the logistic regression model with dif-
ferent feature types:
• Numerical feature: If you increase the value of feature 𝑥𝑗 by one
unit, the estimated odds change by a factor of exp(𝛽𝑗 )
• Binary categorical feature: One of the two values of the feature is
the reference category (in some languages, the one encoded in 0).
Changing the feature 𝑥𝑗 from the reference category to the other
category changes the estimated odds by a factor of exp(𝛽𝑗 ).
• Categorical feature with more than two categories: One solution to
deal with multiple categories is one-hot-encoding, meaning that
each category has its own column. You only need L-1 columns
for a categorical feature with L categories, otherwise it is over-
parameterized. The L-th category is then the reference category.
You can use any other encoding that can be used in linear regres-
sion. The interpretation for each category then is equivalent to the
interpretation of binary features.
• Intercept 𝛽0 : When all numerical features are zero and the categor-
ical features are at the reference category, the estimated odds are
exp(𝛽0 ). The interpretation of the intercept weight is usually not
relevant.
4.2.4 Example
We use the logistic regression model to predict cervical cancer based
on some risk factors. The following table shows the estimate weights,
the associated odds ratios, and the standard error of the estimates.
Interpretation of a numerical feature (“Num. of diagnosed STDs”): An
increase in the number of diagnosed STDs (sexually transmitted dis-
eases) changes (increases) the odds of cancer vs. no cancer by a factor
4.2 Logistic Regression 77
TABLE 4.2 The results of fitting a logistic regression model on the cer-
vical cancer dataset. Shown are the features used in the model, their
estimated weights and corresponding odds ratios, and the standard
errors of the estimated weights.
of 2.27, when all other features remain the same. Keep in mind that
correlation does not imply causation.
Interpretation of a categorical feature (“Hormonal contraceptives
y/n”): For women using hormonal contraceptives, the odds for cancer
vs. no cancer are by a factor of 0.89 lower, compared to women without
hormonal contraceptives, given all other features stay the same.
Like in the linear model, the interpretations always come with the
clause that ‘all other features stay the same’.
that feature would not converge, because the optimal weight would be
infinite. This is really a bit unfortunate, because such a feature is re-
ally useful. But you do not need machine learning if you have a simple
rule that separates both classes. The problem of complete separation
can be solved by introducing penalization of the weights or defining a
prior probability distribution of weights.
On the good side, the logistic regression model is not only a classi-
fication model, but also gives you probabilities. This is a big advan-
tage over models that can only provide the final classification. Know-
ing that an instance has a 99% probability for a class compared to 51%
makes a big difference.
Logistic regression can also be extended from binary classification to
multi-class classification. Then it is called Multinomial Regression.
4.2.6 Software
I used the glm function in R for all examples. You can find logistic re-
gression in any programming language that can be used for perform-
ing data analysis, such as Python, Java, Stata, Matlab, …
4.3 GLM, GAM and more 79
𝑦 = 𝛽0 + 𝛽1 𝑥1 + … + 𝛽𝑝 𝑥𝑝 + 𝜖
FIGURE 4.8 Three assumptions of the linear model (left side): Gaussian
distribution of the outcome given the features, additivity (= no interac-
tions) and linear relationship. Reality usually does not adhere to those
assumptions (right side): Outcomes might have non-Gaussian distri-
butions, features might interact and the relationship might be nonlin-
ear.
𝑔(𝐸𝑌 (𝑦|𝑥)) = 𝛽0 + 𝛽1 𝑥1 + … 𝛽𝑝 𝑥𝑝
terized by the mean and the variance parameters. The mean describes
the value that we expect on average and the variance describes how
much the values vary around this mean. In the linear model, the link
function links the weighted sum of the features to the mean of the
Gaussian distribution.
Under the GLM framework, this concept generalizes to any distribu-
tion (from the exponential family) and arbitrary link functions. If y is
a count of something, such as the number of coffees someone drinks
on a certain day, we could model it with a GLM with a Poisson distri-
bution and the natural logarithm as the link function:
𝑙𝑛(𝐸𝑌 (𝑦|𝑥)) = 𝑥𝑇 𝛽
𝐸𝑌 (𝑦|𝑥) 𝑃 (𝑦 = 1|𝑥)
𝑥𝑇 𝛽 = 𝑙𝑛 ( ) = 𝑙𝑛 ( )
1 − 𝐸𝑌 (𝑦|𝑥) 1 − 𝑃 (𝑦 = 1|𝑥)
And if we solve this equation to have P(y=1) on one side, we get the lo-
gistic regression formula:
1
𝑃 (𝑦 = 1) =
1 + 𝑒𝑥𝑝(−𝑥𝑇 𝛽)
that are outside the domain of the exponential distribution. Since you
can choose any link function, the simple solution is to choose another
function that respects the domain of the distribution.
Examples
I have simulated a dataset on coffee drinking behavior to highlight the
need for GLMs. Suppose you have collected data about your daily cof-
fee drinking behavior. If you do not like coffee, pretend it is about tea.
Along with number of cups, you record your current stress level on a
scale of 1 to 10, how well you slept the night before on a scale of 1 to
10 and whether you had to work on that day. The goal is to predict
the number of coffees given the features stress, sleep and work. I sim-
ulated data for 200 days. Stress and sleep were drawn uniformly be-
tween 1 and 10 and work yes/no was drawn with a 50/50 chance (what
a life!). For each day, the number of coffees was then drawn from a
Poisson distribution, modelling the intensity 𝜆 (which is also the ex-
pected value of the Poisson distribution) as a function of the features
sleep, stress and work. You can guess where this story will lead: “Hey,
let us model this data with a linear model … Oh it does not work … Let us try a
GLM with Poisson distribution … SURPRISE! Now it works!”. I hope I did
not spoil the story too much for you.
Let us look at the distribution of the target variable, the number of
coffees on a given day:
On 76 of the 200 days you had no coffee at all and on the most extreme
day you had 7. Let us naively use a linear model to predict the number
of coffees using sleep level, stress level and work yes/no as features.
What can go wrong when we falsely assume a Gaussian distribution?
A wrong assumption can invalidate the estimates, especially the con-
fidence intervals of the weights. A more obvious problem is that the
predictions do not match the “allowed” domain of the true outcome,
as the following figure shows.
The linear model does not make sense, because it predicts negative
number of coffees. This problem can be solved with Generalized Linear
Models (GLMs). We can change the link function and the assumed dis-
tribution. One possibility is to keep the Gaussian distribution and use
a link function that always leads to positive predictions such as the log-
4.3 GLM, GAM and more 85
To interpret the weights we invert the link function so that we can in-
terpret the effect of the features on the expected outcome and not on
the logarithm of the expected outcome.
𝐸(coffee|str, slp, wrk) = 𝑒𝑥𝑝(𝛽0 + 𝛽str 𝑥str + 𝛽slp 𝑥slp + 𝛽wrk 𝑥wrk )
Since all the weights are in the exponential function, the effect in-
terpretation is not additive, but multiplicative, because exp(a + b) is
exp(a) times exp(b). The last ingredient for the interpretation is the ac-
tual weights of the toy example. The following table lists the estimated
weights and exp(weights) together with the 95% confidence interval:
4.3 GLM, GAM and more 87
4.3.2 Interactions
The linear regression model assumes that the effect of one feature is
the same regardless of the values of the other features (= no interac-
tions). But often there are interactions in the data. To predict the num-
ber of bicycles rented, there may be an interaction between tempera-
ture and whether it is a working day or not. Perhaps, when people have
to work, the temperature does not influence the number of rented
bikes much, because people will ride the rented bike to work no mat-
ter what happens. On days off, many people ride for pleasure, but only
when it is warm enough. When it comes to rental bicycles, you might
expect an interaction between temperature and working day.
How can we get the linear model to include interactions? Before you fit
the linear model, add a column to the feature matrix that represents
the interaction between the features and fit the model as usual. The
solution is elegant in a way, since it does not require any change of
the linear model, only additional columns in the data. In the working
day and temperature example, we would add a new feature that has
zeros for no-work days, otherwise it has the value of the temperature
feature, assuming that working day is the reference category. Suppose
our data looks like this:
work temp
Y 25
N 12
N 30
Y 5
The data matrix used by the linear model looks slightly different. The
following table shows what the data prepared for the model looks like
if we do not specify any interactions. Normally, this transformation is
performed automatically by any statistical software.
4.3 GLM, GAM and more 89
work wthr
Y 2
N 0
N 1
Y 2
Next, we include interaction terms:
Intercept workY wthr1 wthr2 workY.wthr1 workY.wthr2
1 1 0 1 0 1
1 0 0 0 0 0
1 0 1 0 0 0
1 1 0 1 0 1
The first column serves to estimate the intercept. The second column is
the encoded work feature. Columns three and four are for the weather
feature, which requires two columns because you need two weights to
capture the effect for three categories, one of which is the reference
category. The rest of the columns capture the interactions. For each
category of both features (except for the reference categories), we cre-
ate a new feature column that is 1 if both features have a certain cate-
gory, otherwise 0.
For two numerical features, the interaction column is even easier to
construct: We simply multiply both numerical features.
There are approaches to automatically detect and add interaction
terms. One of them can be found in the RuleFit chapter. The RuleFit
algorithm first mines interaction terms and then estimates a linear
regression model including interactions.
Example
Let us return to the bike rental prediction task which we have already
modeled in the linear model chapter. This time, we additionally con-
sider an interaction between the temperature and the working day fea-
ture. This results in the following estimated weights and confidence
intervals.
4.3 GLM, GAM and more 91
two slopes for the temperature instead of one. The temperature slope
for days on which people do not have to work (‘NO WORKING DAY’)
can be read directly from the table (125.4). The temperature slope for
days on which people have to work (‘WORKING DAY’) is the sum of
both temperature weights (125.4 -21.8 = 103.6). The intercept of the ‘NO
WORKING DAY’-line at temperature = 0 is determined by the inter-
cept term of the linear model (2185.8). The intercept of the ‘WORKING
DAY’-line at temperature = 0 is determined by the intercept term + the
effect of working day (2185.8 + 451.9 = 2637.7).
FIGURE 4.13 Predicting the number of rented bicycles using only the
temperature feature. A linear model (top left) does not fit the data well.
One solution is to transform the feature with e.g. the logarithm (top
right), categorize it (bottom left), which is usually a bad decision, or
use Generalized Additive Models that can automatically fit a smooth
curve for temperature (bottom right).
temperature feature into 20 intervals with the levels [-10, -5), [-5, 0), …
and so on. When you use the categorized temperature instead of the
continuous temperature, the linear model would estimate a step func-
tion because each level gets its own estimate. The problem with this
approach is that it needs more data, it is more likely to overfit and it
is unclear how to discretize the feature meaningfully (equidistant in-
tervals or quantiles? how many intervals?). I would only use discretiza-
tion if there is a very strong case for it. For example, to make the model
comparable to another study.
Generalized Additive Models (GAMs)
Why not ‘simply’ allow the (generalized) linear model to learn nonlin-
ear relationships? That is the motivation behind GAMs. GAMs relax
the restriction that the relationship must be a simple weighted sum,
and instead assume that the outcome can be modeled by a sum of ar-
bitrary functions of each feature. Mathematically, the relationship in
a GAM looks like this:
The formula is similar to the GLM formula with the difference that the
linear term 𝛽𝑗 𝑥𝑗 is replaced by a more flexible function 𝑓𝑗 (𝑥𝑗 ). The core
of a GAM is still a sum of feature effects, but you have the option to
allow nonlinear relationships between some features and the output.
Linear effects are also covered by the framework, because for features
to be handled linearly, you can limit their 𝑓𝑗 (𝑥𝑗 ) only to take the form
of 𝑥𝑗 𝛽𝑗 .
The big question is how to learn nonlinear functions. The answer is
called “splines” or “spline functions”. Splines are functions that can be
combined in order to approximate arbitrary functions. A bit like stack-
ing Lego bricks to build something more complex. There is a confus-
ing number of ways to define these spline functions. If you are inter-
ested in learning more about all the ways to define splines, I wish you
good luck on your journey. I am not going to go into details here, I am
just going to build an intuition. What personally helped me the most
for understanding splines was to visualize the individual spline func-
tions and to look into how the data matrix is modified. For example, to
96 4 Interpretable Models
And the actual curve, which results from the sum of the spline func-
tions weighted with the estimated weights, looks like this:
The interpretation of smooth effects requires a visual check of the fit-
ted curve. Splines are usually centered around the mean prediction,
so a point on the curve is the difference to the mean prediction. For ex-
ample, at 0 degrees Celsius, the predicted number of bicycles is 3000
lower than the average prediction.
4.3.4 Advantages
All these extensions of the linear model are a bit of a universe in them-
selves. Whatever problems you face with linear models, you will prob-
ably find an extension that fixes it.
Most methods have been used for decades. For example, GAMs are al-
98 4 Interpretable Models
4.3.5 Disadvantages
As advantage I have said that linear models live in their own universe.
The sheer number of ways you can extend the simple linear model is
overwhelming, not just for beginners. Actually, there are multiple par-
allel universes, because many communities of researchers and prac-
titioners have their own names for methods that do more or less the
same thing, which can be very confusing.
Most modifications of the linear model make the model less inter-
pretable. Any link function (in a GLM) that is not the identity function
complicates the interpretation; interactions also complicate the inter-
pretation; nonlinear feature effects are either less intuitive (like the log
transformation) or can no longer be summarized by a single number
(e.g. spline functions).
GLMs, GAMs and so on rely on assumptions about the data generating
process. If those are violated, the interpretation of the weights is no
longer valid.
The performance of tree-based ensembles like the random forest or
gradient tree boosting is in many cases better than the most sophisti-
cated linear models. This is partly my own experience and partly obser-
vations from the winning models on platforms like kaggle.com.
4.3.6 Software
All examples in this chapter were created using the R language. For
GAMs, the gam package was used, but there are many others. R has an
incredible number of packages to extend linear regression models. Un-
surpassed by any other analytics language, R is home to every conceiv-
able extension of the linear regression model extension. You will find
implementations of e.g. GAMs in Python (such as pyGAM6 ), but these
implementations are not as mature.
6
https://github.com/dswah/pyGAM
100 4 Interpretable Models
The Poisson model is also a GLM. You might also have the problem that
the count value of 0 is very frequent.
Search for zero-inflated Poisson regression, hurdle model.
I am not sure what features need to be included in the model to draw
correct causal conclusions.
For example, I want to know the effect of a drug on the blood pressure.
The drug has a direct effect on some blood value and this blood value
affects the outcome. Should I include the blood value into the regres-
sion model?
Search for causal inference, mediation analysis.
I have missing data.
Search for multiple imputation.
I want to integrate prior knowledge into my models.
Search for Bayesian inference.
I am feeling a bit down lately.
Search for “Amazon Alexa Gone Wild!!! Full version from beginning
to end”.
102 4 Interpretable Models
𝑀
̂ = ∑ 𝑐 𝐼{𝑥 ∈ 𝑅 }
𝑦 ̂ = 𝑓(𝑥) 𝑚 𝑚
𝑚=1
Each instance falls into exactly one leaf node (=subset 𝑅𝑚 ). 𝐼{𝑥∈𝑅𝑚 } is
the identity function that returns 1 if 𝑥 is in the subset 𝑅𝑚 and 0 oth-
erwise. If an instance falls into a leaf node 𝑅𝑙 , the predicted outcome
is 𝑦 ̂ = 𝑐𝑙 , where 𝑐𝑙 is the average of all training instances in leaf node
𝑅𝑙 .
7
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. “The elements of sta-
tistical learning”. www.web.stanford.edu/~hastie/ElemStatLearn/ (2009).
4.4 Decision Tree 103
FIGURE 4.16 Decision tree with artificial data. Instances with a value
greater than 3 for feature x1 end up in node 5. All other instances are
assigned to node 3 or node 4, depending on whether values of feature
x2 exceed 1.
But where do the subsets come from? This is quite simple: CART takes
a feature and determines which cut-off point minimizes the variance
of y for a regression task or the Gini index of the class distribution of
y for classification tasks. The variance tells us how much the y values
in a node are spread around their mean value. The Gini index tells us
how “impure” a node is, e.g. if all classes have the same frequency, the
node is impure, if only one class is present, it is maximally pure. Vari-
ance and Gini index are minimized when the data points in the nodes
have very similar values for y. As a consequence, the best cut-off point
makes the two resulting subsets as different as possible with respect
to the target outcome. For categorical features, the algorithm tries to
create subsets by trying different groupings of categories. After the
best cutoff per feature has been determined, the algorithm selects the
feature for splitting that would result in the best partition in terms of
104 4 Interpretable Models
the variance or Gini index and adds this split to the tree. The algorithm
continues this search-and-split recursively in both new nodes until a
stop criterion is reached. Possible criteria are: A minimum number of
instances that have to be in a node before the split, or the minimum
number of instances that have to be in a terminal node.
4.4.1 Interpretation
The interpretation is simple: Starting from the root node, you go to
the next nodes and the edges tell you which subsets you are looking at.
Once you reach the leaf node, the node tells you the predicted outcome.
All the edges are connected by ‘AND’.
Template: If feature x is [smaller/bigger] than threshold c AND … then
the predicted outcome is the mean value of y of the instances in that
node.
Feature importance
The overall importance of a feature in a decision tree can be computed
in the following way: Go through all the splits for which the feature was
used and measure how much it has reduced the variance or Gini index
compared to the parent node. The sum of all importances is scaled to
100. This means that each importance can be interpreted as share of
the overall model importance.
Tree decomposition
Individual predictions of a decision tree can be explained by decom-
posing the decision path into one component per feature. We can track
a decision through the tree and explain a prediction by the contribu-
tions added at each decision node.
The root node in a decision tree is our starting point. If we were to use
the root node to make predictions, it would predict the mean of the
outcome of the training data. With the next split, we either subtract or
add a term to this sum, depending on the next node in the path. To get
to the final prediction, we have to follow the path of the data instance
that we want to explain and keep adding to the formula.
4.4 Decision Tree 105
𝐷 𝑝
̂ = 𝑦 ̄ + ∑ split.contrib(d,x) = 𝑦 ̄ + ∑ feat.contrib(j,x)
𝑓(𝑥)
𝑑=1 𝑗=1
4.4.2 Example
Let us have another look at the bike rental data. We want to predict
the number of rented bikes on a certain day with a decision tree. The
learned tree looks like this:
The first split and one of the second splits were performed with the
trend feature, which counts the days since data collection began and
covers the trend that the bike rental service has become more popu-
lar over time. For days prior to the 105th day, the predicted number of
bicycles is around 1800, between the 106th and 430th day it is around
3900. For days after the 430th day, the prediction is either 4600 (if tem-
perature is below 12 degrees) or 6600 (if temperature is above 12 de-
grees).
The feature importance tells us how much a feature helped to improve
the purity of all nodes. Here, the variance was used, since predicting
bicycle rentals is a regression task.
The visualized tree shows that both temperature and time trend were
used for the splits, but does not quantify which feature was more im-
portant. The feature importance measure shows that the time trend is
far more important than temperature.
106 4 Interpretable Models
FIGURE 4.17 Regression tree fitted on the bike rental data. The maxi-
mum allowed depth for the tree was set to 2. The trend feature (days
since 2011) and the temperature (temp) have been selected for the
splits. The boxplots show the distribution of bicycle counts in the ter-
minal node.
4.4.3 Advantages
The tree structure is ideal for capturing interactions between features
in the data.
The data ends up in distinct groups that are often easier to understand
than points on a multi-dimensional hyperplane as in linear regression.
The interpretation is arguably pretty simple.
The tree structure also has a natural visualization, with its nodes and
edges.
Trees create good explanations as defined in the chapter on “Human-
Friendly Explanations”. The tree structure automatically invites to
think about predicted values for individual instances as counterfactu-
als: “If a feature had been greater / smaller than the split point, the
4.4 Decision Tree 107
4.4.4 Disadvantages
Trees fail to deal with linear relationships. Any linear relationship be-
tween an input feature and the outcome has to be approximated by
splits, creating a step function. This is not efficient.
This goes hand in hand with lack of smoothness. Slight changes in the
input feature can have a big impact on the predicted outcome, which
is usually not desirable. Imagine a tree that predicts the value of a
house and the tree uses the size of the house as one of the split fea-
ture. The split occurs at 100.5 square meters. Imagine user of a house
price estimator using your decision tree model: They measure their
house, come to the conclusion that the house has 99 square meters,
enter it into the price calculator and get a prediction of 200 000 Euro.
The users notice that they have forgotten to measure a small storage
room with 2 square meters. The storage room has a sloping wall, so
they are not sure whether they can count all of the area or only half of
it. So they decide to try both 100.0 and 101.0 square meters. The results:
The price calculator outputs 200 000 Euro and 205 000 Euro, which is
rather unintuitive, because there has been no change from 99 square
meters to 100.
Trees are also quite unstable. A few changes in the training dataset can
create a completely different tree. This is because each split depends
on the parent split. And if a different feature is selected as the first
split feature, the entire tree structure changes. It does not create con-
fidence in the model if the structure changes so easily.
Decision trees are very interpretable – as long as they are short. The
number of terminal nodes increases quickly with depth. The more ter-
minal nodes and the deeper the tree, the more difficult it becomes to
understand the decision rules of a tree. A depth of 1 means 2 termi-
nal nodes. Depth of 2 means max. 4 nodes. Depth of 3 means max. 8
nodes. The maximum number of terminal nodes in a tree is 2 to the
power of the depth.
4.4.5 Software
For the examples in this chapter, I used the rpart R package that
implements CART (classification and regression trees). CART is im-
4.4 Decision Tree 109
8
https://scikit-learn.org/stable/modules/tree.html
9
https://cran.r-project.org/web/views/MachineLearning.html
110 4 Interpretable Models
Let’s start with the simplest approach: Using the single best feature to
learn rules.
that carries the most information about the outcome of interest and
creates decision rules from this feature.
Despite the name OneR, which stands for “One Rule”, the algorithm
generates more than one rule: It is actually one rule per unique feature
value of the selected best feature. A better name would be OneFeature-
Rules.
The algorithm is simple and fast:
OneR always covers all instances of the dataset, since it uses all levels
of the selected feature. Missing values can be either treated as an ad-
ditional feature value or be imputed beforehand.
A OneR model is a decision tree with only one split. The split is not
necessarily binary as in CART, but depends on the number of unique
feature values.
Let us look at an example how the best feature is chosen by OneR.
The following table shows an artificial dataset about houses with in-
formation about its value, location, size and whether pets are allowed.
We are interested in learning a simple model to predict the value of a
house.
114 4 Interpretable Models
the location feature is 4/10, for the size feature it is 3/10 and for the pet
feature it is 4/10 . The size feature produces the rules with the lowest
error and will be used for the final OneR model:
IF size=small THEN value=low
IF size=medium THEN value=medium
IF size=big THEN value=high
OneR prefers features with many possible levels, because those fea-
tures can overfit the target more easily. Imagine a dataset that con-
tains only noise and no signal, which means that all features take
on random values and have no predictive value for the target. Some
features have more levels than others. The features with more levels
can now more easily overfit. A feature that has a separate level for
each instance from the data would perfectly predict the entire train-
ing dataset. A solution would be to split the data into training and val-
idation sets, learn the rules on the training data and evaluate the total
error for choosing the feature on the validation set.
Ties are another issue, i.e. when two features result in the same total
error. OneR solves ties by either taking the first feature with the lowest
error or the one with the lowest p-value of a chi-squared test.
Example
Let us try OneR with real data. We use the cervical cancer classification
task to test the OneR algorithm. All continuous input features were
discretized into their 5 quantiles. The following rules are created:
Age prediction
(12.9,27.2] Healthy
(27.2,41.4] Healthy
(41.4,55.6] Healthy
(55.6,69.8] Healthy
(69.8,84.1] Healthy
The age feature was chosen by OneR as the best predictive feature.
Since cancer is rare, for each rule the majority class and therefore the
predicted label is always Healthy, which is rather unhelpful. It does
not make sense to use the label prediction in this unbalanced case. The
116 4 Interpretable Models
houses from size, location and whether pets are allowed. We learn the
first rule, which turns out to be: If size=big and location=good, then
value=high. Then we remove all big houses in good locations from the
dataset. With the remaining data we learn the next rule. Maybe: If lo-
cation=good, then value=medium. Note that this rule is learned on data
without big houses in good locations, leaving only medium and small
houses in good locations.
For multi-class settings, the approach must be modified. First, the
classes are ordered by increasing prevalence. The sequential covering
algorithm starts with the least common class, learns a rule for it, re-
moves all covered instances, then moves on to the second least com-
mon class and so on. The current class is always treated as the positive
class and all classes with a higher prevalence are combined in the neg-
4.5 Decision Rules 119
ative class. The last class is the default rule. This is also referred to as
one-versus-all strategy in classification.
How do we learn a single rule? The OneR algorithm would be useless
here, since it would always cover the whole feature space. But there are
many other possibilities. One possibility is to learn a single rule from
a decision tree with beam search:
• Learn a decision tree (with CART or another tree learning algo-
rithm).
• Start at the root node and recursively select the purest node (e.g. with
the lowest misclassification rate).
• The majority class of the terminal node is used as the rule prediction;
the path leading to that node is used as the rule condition.
The following figure illustrates the beam search in a tree:
is the space of all possible rules. The goal of the search is to find the
best rule according to some criteria. There are many different search
strategies: hill-climbing, beam search, exhaustive search, best-first
search, ordered search, stochastic search, top-down search, bottom-
up search, …
RIPPER (Repeated Incremental Pruning to Produce Error Reduction)
by Cohen (1995)11 is a variant of the Sequential Covering algorithm.
RIPPER is a bit more sophisticated and uses a post-processing phase
(rule pruning) to optimize the decision list (or set). RIPPER can run
in ordered or unordered mode and generate either a decision list or
decision set.
Examples
We will use RIPPER for the examples.
The RIPPER algorithm does not find any rule in the classification task
for cervical cancer.
When we use RIPPER on the regression task to predict bike counts
some rules are found. Since RIPPER only works for classification, the
bike counts must be turned into a categorical outcome. I achieved this
by cutting the bike counts into the quartiles. For example (4548, 5956)
is the interval covering predicted bike counts between 4548 and 5956.
The following table shows the decision list of learned rules.
11
Cohen, William W. “Fast effective rule induction.” Machine Learning Proceed-
ings (1995). 115-123.
4.5 Decision Rules 121
rules
(temp >= 16) and (days_since_2011 <= 437) and (weathersit
= GOOD) and (temp <= 24) and (days_since_2011 >= 131) =>
cnt=(4548,5956]
(temp <= 13) and (days_since_2011 <= 111) => cnt=[22,3152]
(temp <= 4) and (workingday = NO WORKING DAY) =>
cnt=[22,3152]
(season = WINTER) and (days_since_2011 <= 368) =>
cnt=[22,3152]
(hum >= 72) and (windspeed >= 16) and (days_since_2011
<= 381) and (temp <= 17) => cnt=[22,3152]
(temp <= 6) and (weathersit = MISTY) => cnt=[22,3152]
(hum >= 91) => cnt=[22,3152]
(mnth = NOV) and (days_since_2011 >= 327) =>
cnt=[22,3152]
(days_since_2011 >= 438) and (weathersit = GOOD) and
(hum >= 51) => cnt=(5956,8714]
(days_since_2011 >= 441) and (hum <= 73) and (temp >= 15)
=> cnt=(5956,8714]
(days_since_2011 >= 441) and (windspeed <= 10) =>
cnt=(5956,8714]
(days_since_2011 >= 455) and (hum <= 40) =>
cnt=(5956,8714]
=> cnt=(3152,4548]
The interpretation is simple: If the conditions apply, we predict the in-
terval on the right hand side for the number of bikes. The last rule is
the default rule that applies when none of the other rules apply to an in-
stance. To predict a new instance, start at the top of the list and check
whether a rule applies. When a condition matches, then the right hand
side of the rule is the prediction for this instance. The default rule en-
sures that there is always a prediction.
1 𝑛 (𝑖)
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑥𝑗 = 𝐴) = ∑ 𝐼(𝑥𝑗 = 𝐴)
𝑛 𝑖=1
where A is the feature value, n the number of data points in the dataset
and I the indicator function that returns 1 if the feature 𝑥𝑗 of the in-
stance i has level A otherwise 0. In a dataset of house values, if 20% of
houses have no balcony and 80% have one or more, then the support for
the pattern balcony=0 is 20%. Support can also be measured for combi-
nations of feature values, for example for balcony=0 AND pets=allowed.
There are many algorithms to find such frequent patterns, for example
Apriori or FP-Growth. Which you use does not matter much, only the
12
Letham, Benjamin, et al. “Interpretable classifiers using rules and Bayesian
analysis: Building a better stroke prediction model.” The Annals of Applied Statistics
9.3 (2015): 1350-1371.
13
Borgelt, C. “An implementation of the FP-growth algorithm.” Proceedings of the
1st International Workshop on Open Source Data Mining Frequent Pattern Mining
Implementations - OSDM ’05, 1–5. http://doi.org/10.1145/1133905.1133907 (2005).
4.5 Decision Rules 123
speed at which the patterns are found is different, but the resulting
patterns are always the same.
I will give you a rough idea of how the Apriori algorithm works to find
frequent patterns. Actually the Apriori algorithm consists of two parts,
where the first part finds frequent patterns and the second part builds
association rules from them. For the BRL algorithm, we are only in-
terested in the frequent patterns that are generated in the first part of
Apriori.
In the first step, the Apriori algorithm starts with all feature values that
have a support greater than the minimum support defined by the user.
If the user says that the minimum support should be 10% and only 5%
of the houses have size=big, we would remove that feature value and
keep only size=medium and size=small as patterns. This does not mean
that the houses are removed from the data, it just means that size=big
is not returned as frequent pattern. Based on frequent patterns with
a single feature value, the Apriori algorithm iteratively tries to find
combinations of feature values of increasingly higher order. Patterns
are constructed by combining feature=value statements with a logical
AND, e.g. size=medium AND location=bad. Generated patterns with a
support below the minimum support are removed. In the end we have
all the frequent patterns. Any subset of a frequent pattern’s clauses is
frequent again, which is called the Apriori property. It makes sense in-
tuitively: By removing a condition from a pattern, the reduced pattern
can only cover more or the same number of data points, but not less.
For example, if 20% of the houses are size=medium and location=good,
then the support of houses that are only size=medium is 20% or greater.
The Apriori property is used to reduce the number of patterns to be
inspected. Only in the case of frequent patterns we have to check pat-
terns of higher order.
Now we are done with pre-mining conditions for the Bayesian Rule
List algorithm. But before we move on to the second step of BRL, I
would like to hint at another way for rule-learning based on pre-mined
patterns. Other approaches suggest including the outcome of interest
into the frequent pattern mining process and also executing the sec-
ond part of the Apriori algorithm that builds IF-THEN rules. Since the
algorithm is unsupervised, the THEN-part also contains feature val-
124 4 Interpretable Models
ues we are not interested in. But we can filter by rules that have only
the outcome of interest in the THEN-part. These rules already form a
decision set, but it would also be possible to arrange, prune, delete or
recombine the rules.
In the BRL approach however, we work with the frequent patterns and
learn the THEN-part and how to arrange the patterns into a decision
list using Bayesian statistics.
Learning Bayesian Rule Lists
The goal of the BRL algorithm is to learn an accurate decision list using
a selection of the pre-mined conditions, while prioritizing lists with
few rules and short conditions. BRL addresses this goal by defining a
distribution of decision lists with prior distributions for the length of
conditions (preferably shorter rules) and the number of rules (prefer-
ably a shorter list).
The posteriori probability distribution of lists makes it possible to say
how likely a decision list is, given assumptions of shortness and how
well the list fits the data. Our goal is to find the list that maximizes
this posterior probability. Since it is not possible to find the exact best
list directly from the distributions of lists, BRL suggests the following
recipe:
1) Generate an initial decision list, which is randomly drawn from the
priori distribution.
2) Iteratively modify the list by adding, switching or removing rules,
ensuring that the resulting lists follow the posterior distribution of
lists.
3) Select the decision list from the sampled lists with the highest prob-
ability according to the posteriori distribution.
Let us go over the algorithm more closely: The algorithm starts with
pre-mining feature value patterns with the FP-Growth algorithm. BRL
makes a number of assumptions about the distribution of the target
and the distribution of the parameters that define the distribution
of the target. (That’s Bayesian statistic.) If you are unfamiliar with
Bayesian statistics, do not get too caught up in the following explana-
tions. It is important to know that the Bayesian approach is a way to
combine existing knowledge or requirements (so-called priori distri-
4.5 Decision Rules 125
butions) while also fitting to the data. In the case of decision lists, the
Bayesian approach makes sense, since the prior assumptions nudges
the decision lists to be short with short rules.
The goal is to sample decision lists d from the posteriori distribution:
𝑝(𝑑|𝑥,
⏟⏟ ⏟⏟⏟⏟𝑦, 𝐴, 𝛼,⏟𝜆,
⏟⏟ 𝜂) ∝ 𝑝(𝑦|𝑥,
⏟⏟⏟ 𝑑,⏟𝛼)
⏟ ⋅ 𝑝(𝑑|𝐴,
⏟⏟⏟ 𝜆,
⏟⏟ 𝜂)
𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟𝑖 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑝𝑟𝑖𝑜𝑟𝑖
where d is a decision list, x are the features, y is the target, A the set
of pre-mined conditions, 𝜆 the prior expected length of the decision
lists, 𝜂 the prior expected number of conditions in a rule, 𝛼 the prior
pseudo-count for the positive and negative classes which is best fixed
at (1,1).
𝑝(𝑑|𝑥, 𝑦, 𝐴, 𝛼, 𝜆, 𝜂)
quantifies how probable a decision list is, given the observed data and
the priori assumptions. This is proportional to the likelihood of the
outcome y given the decision list and the data times the probability
of the list given prior assumptions and the pre-mined conditions.
𝑝(𝑦|𝑥, 𝑑, 𝛼)
is the likelihood of the observed y, given the decision list and the data.
BRL assumes that y is generated by a Dirichlet-Multinomial distribu-
tion. The better the decision list d explains the data, the higher the like-
lihood.
𝑝(𝑑|𝐴, 𝜆, 𝜂)
ditions or removing a rule from the decision list. Which of the rules
is switched, added or deleted is chosen at random. At each step, the
algorithm evaluates the posteriori probability of the decision list (mix-
ture of accuracy and shortness). The Metropolis Hastings algorithm
ensures that we sample decision lists that have a high posterior proba-
bility. This procedure provides us with many samples from the distri-
bution of decision lists. The BRL algorithm selects the decision list of
the samples with the highest posterior probability.
Examples
That is it with the theory, now let’s see the BRL method in action. The
examples use a faster variant of BRL called Scalable Bayesian Rule Lists
(SBRL) by Yang et. al (2017) 14 . We use the SBRL algorithm to predict the
risk for cervical cancer. I first had to discretize all input features for
the SBRL algorithm to work. For this purpose I binned the continuous
features based on the frequency of the values by quantiles.
We get the following rules:
rules
If {STDs=1} (rule[259]) then positive probability =
0.16049383
else if {Hormonal.Contraceptives..years.=[0,10)} (rule[82])
then positive probability = 0.04685408
else (default rule) then positive probability = 0.27777778
Note that we get sensible rules, since the prediction on the THEN-part
is not the class outcome, but the predicted probability for cancer.
The conditions were selected from patterns that were pre-mined with
the FP-Growth algorithm. The following table displays the pool of con-
ditions the SBRL algorithm could choose from for building a decision
list. The maximum number of feature values in a condition I allowed
as a user was two. Here is a sample of ten patterns:
14
Yang, Hongyu, Cynthia Rudin, and Margo Seltzer. “Scalable Bayesian rule lists.”
Proceedings of the 34th International Conference on Machine Learning-Volume 70.
JMLR. org, 2017.
128 4 Interpretable Models
pre-mined conditions
Num.of.pregnancies=[3.67, 7.33)
IUD=0, STDs=1
Number.of.sexual.partners=[1, 10),
STDs..Time.since.last.diagnosis=[1, 8)
First.sexual.intercourse=[10, 17.3), STDs=0
Smokes=1, IUD..years.=[0, 6.33)
Hormonal.Contraceptives..years.=[10, 20),
STDs..Number.of.diagnosis=[0, 1)
Age=[13, 36.7)
Hormonal.Contraceptives=1,
STDs..Number.of.diagnosis=[0, 1)
Number.of.sexual.partners=[1, 10), STDs..number.=[0,
1.33)
STDs..number.=[1.33, 2.67),
STDs..Time.since.first.diagnosis=[1, 8)
Next, we apply the SBRL algorithm to the bike rental prediction task.
This only works if the regression problem of predicting bike counts is
converted into a binary classification task. I have arbitrarily created a
classification task by creating a label that is 1 if the number of bikes
exceeds 4000 bikes on a day, else 0.
The following list was learned by SBRL:
rules
If {yr=2011,temp=[-5.22,7.35)} (rule[718]) then positive
probability = 0.01041667
else if {yr=2012,temp=[7.35,19.9)} (rule[823]) then positive
probability = 0.88125000
else if {yr=2012,temp=[19.9,32.5]} (rule[816]) then positive
probability = 0.99253731
else if {season=SPRING} (rule[351]) then positive
probability = 0.06410256
else if {temp=[7.35,19.9)} (rule[489]) then positive
probability = 0.44444444
else (default rule) then positive probability = 0.79746835
Let us predict the probability that the number of bikes will exceed 4000
4.5 Decision Rules 129
for a day in 2012 with a temperature of 17 degrees Celsius. The first rule
does not apply, since it only applies for days in 2011. The second rule
applies, because the day is in 2012 and 17 degrees lies in the interval
[7.35,19.9). Our prediction for the probability is that more than 4000
bikes are rented is 88%.
4.5.4 Advantages
This section discusses the benefits of IF-THEN rules in general.
IF-THEN rules are easy to interpret. They are probably the most in-
terpretable of the interpretable models. This statement only applies if
the number of rules is small, the conditions of the rules are short (max-
imum 3 I would say) and if the rules are organized in a decision list or
a non-overlapping decision set.
Decision rules can be as expressive as decision trees, while being
more compact. Decision trees often also suffer from replicated sub-
trees, that is, when the splits in a left and a right child node have the
same structure.
The prediction with IF-THEN rules is fast, since only a few binary
statements need to be checked to determine which rules apply.
Decision rules are robust against monotonic transformations of the
input features, because only the threshold in the conditions changes.
They are also robust against outliers, since it only matters if a condition
applies or not.
IF-THEN rules usually generate sparse models, which means that not
many features are included. They select only the relevant features for
the model. For example, a linear model assigns a weight to every input
feature by default. Features that are irrelevant can simply be ignored
by IF-THEN rules.
Simple rules like from OneR can be used as baseline for more complex
algorithms.
4.5.5 Disadvantages
This section deals with the disadvantages of IF-THEN rules in general.
130 4 Interpretable Models
16
https://www.eecs.yorku.ca/tdb/_doc.php/userg/sw/weka/doc/weka/
classifiers/rules/package-summary.html
17
https://cran.r-project.org/web/packages/RWeka/index.html
18
https://cran.r-project.org/web/packages/sbrl/index.html
19
https://github.com/datascienceinc/Skater
20
https://github.com/Hongyuy/sbrlmod
21
https://github.com/csinva/imodels
22
Fürnkranz, Johannes, Dragan Gamberger, and Nada Lavrač. “Foundations of
rule learning.” Springer Science & Business Media, (2012).
23
http://weka.sourceforge.net/doc.dev/weka/classifiers/rules/package-
summary.html
132 4 Interpretable Models
4.6 RuleFit
The RuleFit algorithm by Friedman and Popescu (2008)24 learns sparse
linear models that include automatically detected interaction effects
in the form of decision rules.
The linear regression model does not account for interactions between
features. Would it not be convenient to have a model that is as simple
and interpretable as linear models, but also integrates feature interac-
tions? RuleFit fills this gap. RuleFit learns a sparse linear model with
the original features and also a number of new features that are deci-
sion rules. These new features capture interactions between the origi-
nal features. RuleFit automatically generates these features from deci-
sion trees. Each path through a tree can be transformed into a decision
rule by combining the split decisions into a rule. The node predictions
are discarded and only the splits are used in the decision rules:
FIGURE 4.21 4 rules can be generated from a tree with 3 terminal nodes.
Where do those decision trees come from? The trees are trained to pre-
24
Friedman, Jerome H, and Bogdan E Popescu. “Predictive learning via rule en-
sembles.” The Annals of Applied Statistics. JSTOR, 916–54. (2008).
4.6 RuleFit 133
dict the outcome of interest. This ensures that the splits are meaning-
ful for the prediction task. Any algorithm that generates a lot of trees
can be used for RuleFit, for example a random forest. Each tree is de-
composed into decision rules that are used as additional features in a
sparse linear regression model (Lasso).
The RuleFit paper uses the Boston housing data to illustrate this: The
goal is to predict the median house value of a Boston neighborhood.
One of the rules generated by RuleFit is: IF number of rooms > 6.64
AND concentration of nitric oxide <0.67 THEN 1 ELSE 0.
RuleFit also comes with a feature importance measure that helps to
identify linear terms and rules that are important for the predictions.
Feature importance is calculated from the weights of the regression
model. The importance measure can be aggregated for the original fea-
tures (which are used in their “raw” form and possibly in many deci-
sion rules).
RuleFit also introduces partial dependence plots to show the average
change in prediction by changing a feature. The partial dependence
plot is a model-agnostic method that can be used with any model, and
is explained in the book chapter on partial dependence plots.
4.6.2 Theory
Let us dive deeper into the technical details of the RuleFit algorithm.
RuleFit consists of two components: The first component creates
4.6 RuleFit 135
“rules” from decision trees and the second component fits a linear
model with the original features and the new rules as input (hence the
name “RuleFit”).
Step 1: Rule generation
What does a rule look like? The rules generated by the algorithm have
a simple form. For example: IF x2 < 3 AND x5 < 7 THEN 1 ELSE 0.
The rules are constructed by decomposing decision trees: Any path to
a node in a tree can be converted to a decision rule. The trees used for
the rules are fitted to predict the target outcome. Therefore the splits
and resulting rules are optimized to predict the outcome you are inter-
ested in. You simply chain the binary decisions that lead to a certain
node with “AND”, and voilà, you have a rule. It is desirable to generate
a lot of diverse and meaningful rules. Gradient boosting is used to fit
an ensemble of decision trees by regressing or classifying y with your
136 4 Interpretable Models
𝑀
̂ = 𝑎 + ∑ 𝑎 𝑓 ̂ (𝑋)
𝑓(𝑥) 0 𝑚 𝑚
𝑚=1
M is the number of trees and 𝑓𝑚̂ (𝑥) is the prediction function of the
m-th tree. The 𝑎’s are the weights. Bagged ensembles, random forest,
AdaBoost and MART produce tree ensembles and can be used for Rule-
Fit.
We create the rules from all trees of the ensemble. Each rule 𝑟𝑚 takes
the form of:
where T𝑚 is the set of features used in the m-th tree, I is the indicator
function that is 1 when feature 𝑥𝑗 is in the specified subset of values s
for the j-th feature (as specified by the tree splits) and 0 otherwise. For
numerical features, 𝑠𝑗𝑚 is an interval in the value range of the feature.
The interval looks like one of the two cases:
This rule returns 1 if all three conditions are met, otherwise 0. RuleFit
extracts all possible rules from a tree, not only from the leaf nodes. So
another rule that would be created is:
𝑀
𝐾 = ∑ 2(𝑡𝑚 − 1)
𝑚=1
where 𝛿𝑗− and 𝛿𝑗+ are the 𝛿 quantiles of the data distribution of feature
𝑥𝑗 . A choice of 0.05 for 𝛿 means that any value of feature 𝑥𝑗 that is
in the 5% lowest or 5% highest values will be set to the quantiles at 5%
or 95% respectively. As a rule of thumb, you can choose 𝛿 = 0.025. In
addition, the linear terms have to be normalized so that they have the
same prior importance as a typical decision rule:
The 0.4 is the average standard deviation of rules with a uniform sup-
port distribution of 𝑠𝑘 ∼ 𝑈 (0, 1).
We combine both types of features to generate a new feature matrix
and train a sparse linear model with Lasso, with the following struc-
ture:
𝐾 𝑝
̂ = 𝛽 ̂ + ∑ 𝛼̂ 𝑟 (𝑥) + ∑ 𝛽 ̂ 𝑙 (𝑥 )
𝑓(𝑥) 0 𝑘 𝑘 𝑗 𝑗 𝑗
𝑘=1 𝑗=1
where 𝛼̂ is the estimated weight vector for the rule features and 𝛽 ̂ the
weight vector for the original features. Since RuleFit uses Lasso, the
loss function gets the additional constraint that forces some of the
weights to get a zero estimate:
𝑛
({𝛼}̂ 𝐾 ̂ 𝑝 𝑎𝑟𝑔𝑚𝑖𝑛{𝛼}̂ 𝐾 ,{𝛽}̂ 𝑝 ∑ 𝐿(𝑦(𝑖) , 𝑓(𝑥(𝑖) ))
1 , {𝛽}0 ) = 1 0
𝑖=1
𝐾 𝑝
+ 𝜆 ⋅ (∑ |𝛼𝑘 | + ∑ |𝑏𝑗 |)
𝑘=1 𝑗=1
The result is a linear model that has linear effects for all of the original
features and for the rules. The interpretation is the same as for linear
models, the only difference is that some features are now binary rules.
Step 3 (optional): Feature importance
4.6 RuleFit 139
For the linear terms of the original features, the feature importance is
measured with the standardized predictor:
where 𝛽𝑗 is the weight from the Lasso model and 𝑠𝑡𝑑(𝑙𝑗 (𝑥𝑗 )) is the
standard deviation of the linear term over the data.
For the decision rule terms, the importance is calculated with the fol-
lowing formula:
𝐼𝑘 = |𝛼𝑘̂ | ⋅ √𝑠𝑘 (1 − 𝑠𝑘 )
where 𝛼𝑘̂ is the associated Lasso weight of the decision rule and 𝑠𝑘 is
the support of the feature in the data, which is the percentage of data
points to which the decision rule applies (where 𝑟𝑘 (𝑥) = 1):
1 𝑛
𝑠𝑘 = ∑ 𝑟𝑘 (𝑥(𝑖) )
𝑛 𝑖=1
A feature occurs as a linear term and possibly also within many deci-
sion rules. How do we measure the total importance of a feature? The
importance 𝐽𝑗 (𝑥) of a feature can be measured for each individual pre-
diction:
4.6.3 Advantages
RuleFit automatically adds feature interactions to linear models.
Therefore, it solves the problem of linear models that you have to add
interaction terms manually and it helps a bit with the issue of model-
ing nonlinear relationships.
RuleFit can handle both classification and regression tasks.
The rules created are easy to interpret, because they are binary deci-
sion rules. Either the rule applies to an instance or not. Good inter-
pretability is only guaranteed if the number of conditions within a rule
is not too large. A rule with 1 to 3 conditions seems reasonable to me.
This means a maximum depth of 3 for the trees in the tree ensemble.
Even if there are many rules in the model, they do not apply to every
instance. For an individual instance only a handful of rules apply (=
have a non-zero weights). This improves local interpretability.
RuleFit proposes a bunch of useful diagnostic tools. These tools are
model-agnostic, so you can find them in the model-agnostic section
of the book: feature importance, partial dependence plots and feature
interactions.
4.6.4 Disadvantages
Sometimes RuleFit creates many rules that get a non-zero weight in
the Lasso model. The interpretability degrades with increasing num-
ber of features in the model. A promising solution is to force feature
effects to be monotonic, meaning that an increase of a feature has to
lead to an increase of the prediction.
An anecdotal drawback: The papers claim a good performance of Rule-
Fit – often close to the predictive performance of random forests! –
but in the few cases where I tried it personally, the performance was
disappointing. Just try it out for your problem and see how it per-
forms.
The end product of the RuleFit procedure is a linear model with addi-
tional fancy features (the decision rules). But since it is a linear model,
the weight interpretation is still unintuitive. It comes with the same
4.6 RuleFit 141
25
Fokkema, Marjolein, and Benjamin Christoffersen. “Pre: Prediction rule ensem-
bles”. https://CRAN.R-project.org/package=pre (2017).
26
https://github.com/christophM/rulefit
27
https://github.com/scikit-learn-contrib/skope-rules
28
https://github.com/csinva/imodels
142 4 Interpretable Models
The term Z is a scaling parameter that ensures that the sum of prob-
abilities for all classes is 1 (otherwise they would not be probabilities).
The conditional probability of a class is the class probability times the
probability of each feature given the class, normalized by Z. This for-
mula can be derived by using the Bayes’ theorem.
Naive Bayes is an interpretable model because of the independence as-
sumption. It can be interpreted on the modular level. It is very clear
for each feature how much it contributes towards a certain class pre-
diction, since we can interpret the conditional probability.
4.7 Other Interpretable Models 143
145
146 5 Model-Agnostic Methods
FIGURE 5.1 The big picture of explainable machine learning. The real
world goes through many layers before it reaches the human in the
form of explanations.
148 5 Model-Agnostic Methods
149
150 6 Example-Based Explanations
patient’s symptoms remind her of another patient she had years ago
with similar symptoms. She suspects that her current patient could
have the same disease and she takes a blood sample to test for this spe-
cific disease.
A data scientist works on a new project for one of his clients: Analysis
of the risk factors that lead to the failure of production machines for
keyboards. The data scientist remembers a similar project he worked
on and reuses parts of the code from the old project because he thinks
the client wants the same analysis.
A kitten sits on the window ledge of a burning and uninhabited house.
The fire department has already arrived and one of the firefighters pon-
ders for a second whether he can risk going into the building to save
the kitten. He remembers similar cases in his life as a firefighter: Old
wooden houses that have been burning slowly for some time were of-
ten unstable and eventually collapsed. Because of the similarity of this
case, he decides not to enter, because the risk of the house collapsing is
too great. Fortunately, the kitty jumps out of the window, lands safely
and nobody is harmed in the fire. Happy end.
These stories illustrate how we humans think in examples or analogies.
The blueprint of example-based explanations is: Thing B is similar to
thing A and A caused Y, so I predict that B will cause Y as well. Im-
plicitly, some machine learning approaches work example-based. De-
cision trees partition the data into nodes based on the similarities of
the data points in the features that are important for predicting the
target. A decision tree gets the prediction for a new data instance by
finding the instances that are similar (= in the same terminal node) and
returning the average of the outcomes of those instances as the pre-
diction. The k-nearest neighbors (knn) method works explicitly with
example-based predictions. For a new instance, a knn model locates
the k-nearest neighbors (e.g. the k=3 closest instances) and returns the
average of the outcomes of those neighbors as a prediction. The predic-
tion of a knn can be explained by returning the k neighbors, which –
again – is only meaningful if we have a good way to represent a single
instance.
The following interpretation methods are all example-based:
6.0 151
2
Kim, Been, Rajiv Khanna, and Oluwasanmi O. Koyejo. “Examples are not
enough, learn to criticize! Criticism for interpretability.” Advances in Neural Infor-
mation Processing Systems (2016).
7
Global Model-Agnostic Methods
153
154 7 Global Model-Agnostic Methods
̂ , 𝑋 )] = ∫ 𝑓(𝑥
𝑓𝑆̂ (𝑥𝑆 ) = 𝐸𝑋𝐶 [𝑓(𝑥 𝑆 𝐶
̂ , 𝑋 )𝑑ℙ(𝑋 )
𝑆 𝐶 𝐶
The 𝑥𝑆 are the features for which the partial dependence function
should be plotted and 𝑋𝐶 are the other features used in the machine
learning model 𝑓,̂ which are here treated as random variables. Usu-
ally, there are only one or two features in the set S. The feature(s) in S
are those for which we want to know the effect on the prediction. The
feature vectors 𝑥𝑆 and 𝑥𝐶 combined make up the total feature space
x. Partial dependence works by marginalizing the machine learning
model output over the distribution of the features in set C, so that the
function shows the relationship between the features in set S we are
interested in and the predicted outcome. By marginalizing over the
other features, we get a function that depends only on features in S,
interactions with other features included.
The partial function 𝑓𝑆̂ is estimated by calculating averages in the
training data, also known as Monte Carlo method:
1 𝑛 ̂ (𝑖)
𝑓𝑆̂ (𝑥𝑆 ) = ∑ 𝑓(𝑥𝑆 , 𝑥𝐶 )
𝑛 𝑖=1
The partial function tells us for given value(s) of features S what the
1
Friedman, Jerome H. “Greedy function approximation: A gradient boosting ma-
chine.” Annals of statistics (2001): 1189-1232.
7.1 Partial Dependence Plot (PDP) 155
(𝑖)
average marginal effect on the prediction is. In this formula, 𝑥𝐶 are
actual feature values from the dataset for the features in which we are
not interested, and n is the number of instances in the dataset. An as-
sumption of the PDP is that the features in C are not correlated with
the features in S. If this assumption is violated, the averages calculated
for the partial dependence plot will include data points that are very
unlikely or even impossible (see disadvantages).
For classification where the machine learning model outputs probabil-
ities, the partial dependence plot displays the probability for a certain
class given different values for feature(s) in S. An easy way to deal with
multiple classes is to draw one line or plot per class.
The partial dependence plot is a global method: The method considers
all instances and gives a statement about the global relationship of a
feature with the predicted outcome.
Categorical features
So far, we have only considered numerical features. For categorical fea-
tures, the partial dependence is very easy to calculate. For each of the
categories, we get a PDP estimate by forcing all data instances to have
the same category. For example, if we look at the bike rental dataset
and are interested in the partial dependence plot for the season, we
get 4 numbers, one for each season. To compute the value for “sum-
mer”, we replace the season of all data instances with “summer” and
average the predictions.
√ 𝐾
√ 1 1 𝐾 ̂ (𝑘) 2
𝐼(𝑥𝑆 ) = √
(𝑘)
∑(𝑓𝑆̂ (𝑥𝑆 ) − ∑ 𝑓 (𝑥 ))
𝐾 − 1 𝑘=1 𝐾 𝑘=1 𝑆 𝑆
⎷
(𝑘)
Note that here the 𝑥𝑆 are the K unique values of feature the 𝑋𝑆 . For
categorical features we have:
(𝑘) (𝑘)
𝐼(𝑥𝑆 ) = (𝑚𝑎𝑥𝑘 (𝑓𝑆̂ (𝑥𝑆 )) − 𝑚𝑖𝑛𝑘 (𝑓𝑆̂ (𝑥𝑆 )))/4
This is the range of the PDP values for the unique categories divided by
four. This strange way of calculating the deviation is called the range
rule. It helps to get a rough estimate for the deviation when you only
know the range. And the denominator four comes from the standard
normal distribution: In the normal distribution, 95% of the data are
minus two and plus two standard deviations around the mean. So the
range divided by four gives a rough estimate that probably underesti-
mates the actual variance.
This PDP-based feature importance should be interpreted with care. It
captures only the main effect of the feature and ignores possible fea-
ture interactions. A feature could be very important based on other
methods such as permutation feature importance, but the PDP could
be flat as the feature affects the prediction mainly through interac-
tions with other features. Another drawback of this measure is that
it is defined over the unique values. A unique feature value with just
one instance is given the same weight in the importance computation
as a value with many instances.
7.1.2 Examples
In practice, the set of features S usually only contains one feature or
a maximum of two, because one feature produces 2D plots and two
features produce 3D plots. Everything beyond that is quite tricky. Even
3D on a 2D paper or monitor is already challenging.
Let us return to the regression example, in which we predict the num-
ber of bikes that will be rented on a given day. First we fit a machine
learning model, then we analyze the partial dependencies. In this
7.1 Partial Dependence Plot (PDP) 157
FIGURE 7.1PDPs for the bicycle count prediction model and tempera-
ture, humidity and wind speed. The largest differences can be seen
in the temperature. The hotter, the more bikes are rented. This trend
goes up to 20 degrees Celsius, then flattens and drops slightly at 30.
Marks on the x-axis indicate the data distribution.
For warm but not too hot weather, the model predicts on average a
high number of rented bicycles. Potential bikers are increasingly in-
hibited in renting a bike when humidity exceeds 60%. In addition, the
more wind the fewer people like to cycle, which makes sense. Interest-
ingly, the predicted number of bike rentals does not fall when wind
speed increases from 25 to 35 km/h, but there is not much training
data, so the machine learning model could probably not learn a mean-
ingful prediction for this range. At least intuitively, I would expect the
158 7 Global Model-Agnostic Methods
FIGURE 7.2 PDPs for the bike count prediction model and the season.
Unexpectedly all seasons show similar effect on the model predictions,
only for winter the model predicts fewer bicycle rentals.
7.1.3 Advantages
The computation of partial dependence plots is intuitive: The partial
dependence function at a particular feature value represents the aver-
7.1 Partial Dependence Plot (PDP) 159
FIGURE 7.3 PDPs of cancer probability based on age and years with hor-
monal contraceptives. For age, the PDP shows that the probability is
low until 40 and increases after. The more years on hormonal contra-
ceptives the higher the predicted cancer risk, especially after 10 years.
For both features not many data points with large values were avail-
able, so the PD estimates are less reliable in those regions.
age prediction if we force all data points to assume that feature value.
In my experience, lay people usually understand the idea of PDPs
quickly.
If the feature for which you computed the PDP is not correlated with
the other features, then the PDPs perfectly represent how the feature
influences the prediction on average. In the uncorrelated case, the
interpretation is clear: The partial dependence plot shows how the
average prediction in your dataset changes when the j-th feature is
changed. It is more complicated when features are correlated, see also
disadvantages.
Partial dependence plots are easy to implement.
160 7 Global Model-Agnostic Methods
FIGURE 7.4 PDP of cancer probability and the interaction of age and
number of pregnancies. The plot shows the increase in cancer prob-
ability at 45. For ages below 25, women who had 1 or 2 pregnancies
have a lower predicted cancer risk, compared with women who had 0
or more than 2 pregnancies. But be careful when drawing conclusions:
This might just be a correlation and not causal!
The calculation for the partial dependence plots has a causal interpre-
tation. We intervene on a feature and measure the changes in the pre-
dictions. In doing so, we analyze the causal relationship between the
feature and the prediction.3 The relationship is causal for the model –
because we explicitly model the outcome as a function of the features
– but not necessarily for the real world!
3
Zhao, Qingyuan, and Trevor Hastie. “Causal interpretations of black-box mod-
els.” Journal of Business & Economic Statistics, to appear. (2017).
7.1 Partial Dependence Plot (PDP) 161
7.1.4 Disadvantages
The realistic maximum number of features in a partial dependence
function is two. This is not the fault of PDPs, but of the 2-dimensional
representation (paper or screen) and also of our inability to imagine
more than 3 dimensions.
Some PD plots do not show the feature distribution. Omitting the dis-
tribution can be misleading, because you might overinterpret regions
with almost no data. This problem is easily solved by showing a rug
(indicators for data points on the x-axis) or a histogram.
The assumption of independence is the biggest issue with PD plots.
It is assumed that the feature(s) for which the partial dependence is
computed are not correlated with other features. For example, sup-
pose you want to predict how fast a person walks, given the person’s
weight and height. For the partial dependence of one of the features,
e.g. height, we assume that the other features (weight) are not corre-
lated with height, which is obviously a false assumption. For the com-
putation of the PDP at a certain height (e.g. 200 cm), we average over
the marginal distribution of weight, which might include a weight be-
low 50 kg, which is unrealistic for a 2 meter person. In other words:
When the features are correlated, we create new data points in areas
of the feature distribution where the actual probability is very low (for
example it is unlikely that someone is 2 meters tall but weighs less
than 50 kg). One solution to this problem is Accumulated Local Effect
plots or short ALE plots that work with the conditional instead of the
marginal distribution.
Heterogeneous effects might be hidden because PD plots only show
the average marginal effects. Suppose that for a feature half your data
points have a positive association with the prediction – the larger the
feature value the larger the prediction – and the other half has a neg-
ative association – the smaller the feature value the larger the predic-
tion. The PD curve could be a horizontal line, since the effects of both
halves of the dataset could cancel each other out. You then conclude
that the feature has no effect on the prediction. By plotting the indi-
vidual conditional expectation curves instead of the aggregated line,
we can uncover heterogeneous effects.
162 7 Global Model-Agnostic Methods
Alternatives to PDPs presented in this book are ALE plots and ICE
curves.
7.2 Accumulated Local Effects (ALE) Plot 163
FIGURE 7.5 Strongly correlated features x1 and x2. To calculate the fea-
ture effect of x1 at 0.75, the PDP replaces x1 of all instances with 0.75,
falsely assuming that the distribution of x2 at x1 = 0.75 is the same as
the marginal distribution of x2 (vertical line). This results in unlikely
combinations of x1 and x2 (e.g. x2=0.2 at x1=0.75), which the PDP uses
for the calculation of the average effect.
the size of the living area increases the predicted value, since the num-
ber of rooms increases with the living area. The following plot shows
for two correlated features how M-Plots work.
FIGURE 7.6 Strongly correlated features x1 and x2. M-Plots average over
the conditional distribution. Here the conditional distribution of x2 at
x1 = 0.75. Averaging the local predictions leads to mixing the effects of
both features.
FIGURE 7.7 Calculation of ALE for feature x1, which is correlated with x2.
First, we divide the feature into intervals (vertical lines). For the data
instances (points) in an interval, we calculate the difference in the pre-
diction when we replace the feature with the upper and lower limit of
the interval (horizontal lines). These differences are later accumulated
and centered, resulting in the ALE curve.
To summarize how each type of plot (PDP, M, ALE) calculates the ef-
fect of a feature at a certain grid value v:
Partial Dependence Plots: “Let me show you what the model predicts
on average when each data instance has the value v for that feature. I
ignore whether the value v makes sense for all data instances.”
M-Plots: “Let me show you what the model predicts on average for data
instances that have values close to v for that feature. The effect could
be due to that feature, but also due to correlated features.”
ALE plots: “Let me show you how the model predictions change in a
small”window” of the feature around v for data instances in that win-
dow.”
7.2 Accumulated Local Effects (ALE) Plot 167
7.2.2 Theory
How do PD, M and ALE plots differ mathematically? Common to all
three methods is that they reduce the complex prediction function f to
a function that depends on only one (or two) features. All three meth-
ods reduce the function by averaging the effects of the other features,
but they differ in whether averages of predictions or of differences
in predictions are calculated and whether averaging is done over the
marginal or conditional distribution.
Partial dependence plots average the predictions over the marginal
distribution.
̂
𝑓𝑆,𝑃 ̂
𝐷𝑃 (𝑥) = 𝐸𝑋𝐶 [𝑓(𝑥𝑆 , 𝑋𝐶 )]
̂ , 𝑋 )𝑑ℙ(𝑋 )
= ∫ 𝑓(𝑥 𝑆 𝐶 𝐶
𝑋𝐶
̂ (𝑥 ) = 𝐸
𝑓𝑆,𝑀 ̂
𝑆 𝑋𝐶 |𝑋𝑆 [𝑓(𝑋𝑆 , 𝑋𝐶 )|𝑋𝑆 = 𝑥𝑠 ]
̂ , 𝑋 )𝑑ℙ(𝑋 |𝑋 = 𝑥 )
= ∫ 𝑓(𝑥 𝑆 𝐶 𝐶 𝑆 𝑆
𝑋𝐶
The only thing that changes compared to PDPs is that we average the
predictions conditional on each grid value of the feature of interest,
instead of assuming the marginal distribution at each grid value. In
168 7 Global Model-Agnostic Methods
̂ ,𝑥 )
𝜕 𝑓(𝑥
𝑓 𝑆̂ (𝑥𝑠 , 𝑥𝑐 ) = 𝑆 𝐶
𝜕𝑥𝑆
One problem remains: Not all models come with a gradient, for exam-
ple random forests have no gradient. But as you will see, the actual
computation works without gradients and uses intervals. Let us delve
a little deeper into the estimation of ALE plots.
7.2.3 Estimation
First I will describe how ALE plots are estimated for a single numeri-
cal feature, later for two numerical features and for a single categori-
cal feature. To estimate local effects, we divide the feature into many
intervals and compute the differences in the predictions. This proce-
dure approximates the derivatives and also works for models without
derivatives.
First we estimate the uncentered effect:
𝑘𝑗 (𝑥)
̂̃ 1 ̂ (𝑖) ̂ (𝑖)
𝑓 𝑗,𝐴𝐿𝐸 (𝑥) = ∑ ∑ [𝑓(𝑧 𝑘,𝑗 , 𝑥 𝑗 ) − 𝑓(𝑧𝑘−1,𝑗 , 𝑥 𝑗 )]
𝑘=1
𝑛𝑗 (𝑘) (𝑖)
𝑖∶𝑥𝑗 ∈𝑁𝑗 (𝑘)
Let us break this formula down, starting from the right side. The name
Accumulated Local Effects nicely reflects all the individual compo-
nents of this formula. At its core, the ALE method calculates the differ-
ences in predictions, whereby we replace the feature of interest with
grid values z. The difference in prediction is the Effect the feature has
for an individual instance in a certain interval. The sum on the right
adds up the effects of all instances within an interval which appears in
the formula as neighborhood 𝑁𝑗 (𝑘). We divide this sum by the num-
ber of instances in this interval to obtain the average difference of the
predictions for this interval. This average in the interval is covered by
the term Local in the name ALE. The left sum symbol means that we
accumulate the average effects across all intervals. The (uncentered)
ALE of a feature value that lies, for example, in the third interval is the
sum of the effects of the first, second and third intervals. The word
Accumulated in ALE reflects this.
This effect is centered so that the mean effect is zero.
170 7 Global Model-Agnostic Methods
̂̃ 1 𝑛 ̂̃ (𝑖)
𝑓̂
𝑗,𝐴𝐿𝐸 (𝑥) = 𝑓 𝑗,𝐴𝐿𝐸 (𝑥) − ∑ 𝑓 𝑗,𝐴𝐿𝐸 (𝑥𝑗 )
𝑛 𝑖=1
The value of the ALE can be interpreted as the main effect of the feature
at a certain value compared to the average prediction of the data. For
example, an ALE estimate of -2 at 𝑥𝑗 = 3 means that when the j-th
feature has value 3, then the prediction is lower by 2 compared to the
average prediction.
The quantiles of the distribution of the feature are used as the grid
that defines the intervals. Using the quantiles ensures that there is
the same number of data instances in each of the intervals. Quan-
tiles have the disadvantage that the intervals can have very different
lengths. This can lead to some weird ALE plots if the feature of inter-
est is very skewed, for example many low values and only a few very
high values.
ALE plots for the interaction of two features
ALE plots can also show the interaction effect of two features. The cal-
culation principles are the same as for a single feature, but we work
with rectangular cells instead of intervals, because we have to accu-
mulate the effects in two dimensions. In addition to adjusting for the
overall mean effect, we also adjust for the main effects of both features.
This means that ALE for two features estimate the second-order effect,
which does not include the main effects of the features. In other words,
ALE for two features only shows the additional interaction effect of the
two features. I spare you the formulas for 2D ALE plots because they
are long and unpleasant to read. If you are interested in the calculation,
I refer you to the paper, formulas (13) – (16). I will rely on visualizations
to develop intuition about the second-order ALE calculation.
In the previous figure, many cells are empty due to the correlation. In
the ALE plot this can be visualized with a grayed out or darkened box.
Alternatively, you can replace the missing ALE estimate of an empty
cell with the ALE estimate of the nearest non-empty cell.
Since the ALE estimates for two features only show the second-order
effect of the features, the interpretation requires special attention. The
7.2 Accumulated Local Effects (ALE) Plot 171
FIGURE 7.8 Calculation of 2D-ALE. We place a grid over the two features.
In each grid cell we calculate the 2nd-order differences for all instance
within. We first replace values of x1 and x2 with the values from the
cell corners. If a, b, c and d represent the ”corner”-predictions of a ma-
nipulated instance (as labeled in the graphic), then the 2nd-order dif-
ference is (d - c) - (b - a). The mean 2nd-order difference in each cell is
accumulated over the grid and centered.
second-order effects or, you can get an estimate of the total ALE plots
by refraining from subtracting the lower-order effects.
The accumulated local effects could also be calculated for arbitrarily
higher orders (interactions of three or more features), but as argued in
the PDP chapter, only up to two features makes sense, because higher
interactions cannot be visualized or even interpreted meaningfully.
ALE for categorical features
The accumulated local effects method needs – by definition – the fea-
ture values to have an order, because the method accumulates effects
in a certain direction. Categorical features do not have any natural or-
der. To compute an ALE plot for a categorical feature we have to some-
how create or find an order. The order of the categories influences the
calculation and interpretation of the accumulated local effects.
One solution is to order the categories according to their similarity
based on the other features. The distance between two categories is the
sum over the distances of each feature. The feature-wise distance com-
pares either the cumulative distribution in both categories, also called
Kolmogorov-Smirnov distance (for numerical features) or the relative
frequency tables (for categorical features). Once we have the distances
between all categories, we use multi-dimensional scaling to reduce the
distance matrix to a one-dimensional distance measure. This gives us
a similarity-based order of the categories.
To make this a little bit clearer, here is one example: Let us assume we
have the two categorical features “season” and “weather” and a numer-
ical feature “temperature”. For the first categorical feature (season) we
want to calculate the ALEs. The feature has the categories “spring”,
“summer”, “fall”, “winter”. We start to calculate the distance between
categories “spring” and “summer”. The distance is the sum of distances
over the features temperature and weather. For the temperature, we
take all instances with season “spring”, calculate the empirical cumu-
lative distribution function and do the same for instances with season
“summer” and measure their distance with the Kolmogorov-Smirnov
statistic. For the weather feature we calculate for all “spring” instances
the probabilities for each weather type, do the same for the “summer”
instances and sum up the absolute distances in the probability dis-
7.2 Accumulated Local Effects (ALE) Plot 173
7.2.4 Examples
Let us see ALE plots in action. I have constructed a scenario in which
partial dependence plots fail. The scenario consists of a prediction
model and two strongly correlated features. The prediction model is
mostly a linear regression model, but does something weird at a com-
bination of the two features for which we have never observed in-
stances.
FIGURE 7.9 Two features and the predicted outcome. The model pre-
dicts the sum of the two features (shaded background), with the excep-
tion that if x1 is greater than 0.7 and x2 less than 0.3, the model always
predicts 2. This area is far from the distribution of data (point cloud)
and does not affect the performance of the model and also should not
affect its interpretation.
174 7 Global Model-Agnostic Methods
Is this a realistic, relevant scenario at all? When you train a model, the
learning algorithm minimizes the loss for the existing training data
instances. Weird stuff can happen outside the distribution of training
data, because the model is not penalized for doing weird stuff in these
areas. Leaving the data distribution is called extrapolation, which can
also be used to fool machine learning models, described in the chap-
ter on adversarial examples. See in our little example how the partial
dependence plots behave compared to ALE plots.
FIGURE 7.10 Comparison of the feature effects computed with PDP (up-
per row) and ALE (lower row). The PDP estimates are influenced by the
odd behavior of the model outside the data distribution (steep jumps
in the plots). The ALE plots correctly identify that the machine learn-
ing model has a linear relationship between features and prediction,
ignoring areas without data.
But is it not interesting to see that our model behaves oddly at x1 >
0.7 and x2 < 0.3? Well, yes and no. Since these are data instances that
might be physically impossible or at least extremely unlikely, it is usu-
ally irrelevant to look into these instances. But if you suspect that your
7.2 Accumulated Local Effects (ALE) Plot 175
test distribution might be slightly different and some instances are ac-
tually in that range, then it would be interesting to include this area in
the calculation of feature effects. But it has to be a conscious decision
to include areas where we have not observed data yet and it should not
be a side-effect of the method of choice like PDP. If you suspect that
the model will later be used with differently distributed data, I recom-
mend to use ALE plots and simulate the distribution of data you are
expecting.
Turning to a real dataset, let us predict the number of rented bikes
based on weather and day and check if the ALE plots really work as
well as promised. We train a regression tree to predict the number of
rented bicycles on a given day and use ALE plots to analyze how tem-
perature, relative humidity and wind speed influence the predictions.
Let us look at what the ALE plots say:
Let us look at the correlation between temperature, humidity and
wind speed and all other features. Since the data also contains categor-
ical features, we cannot only use the Pearson correlation coefficient,
which only works if both features are numerical. Instead, I train a lin-
ear model to predict, for example, temperature based on one of the
other features as input. Then I measure how much variance the other
feature in the linear model explains and take the square root. If the
other feature was numerical, then the result is equal to the absolute
value of the standard Pearson correlation coefficient. But this model-
based approach of “variance-explained” (also called ANOVA, which
stands for ANalysis Of VAriance) works even if the other feature is
categorical. The “variance-explained” measure lies always between 0
(no association) and 1 (temperature can be perfectly predicted from
the other feature). We calculate the explained variance of temperature,
humidity and wind speed with all the other features. The higher the
explained variance (correlation), the more (potential) problems with
PD plots. The following figure visualizes how strongly the weather fea-
tures are correlated with other features.
This correlation analysis reveals that we may encounter problems with
partial dependence plots, especially for the temperature feature. Well,
see for yourself:
176 7 Global Model-Agnostic Methods
FIGURE 7.11 ALE plots for the bike prediction model by temperature, hu-
midity and wind speed. The temperature has a strong effect on the
prediction. The average prediction rises with increasing temperature,
but falls again above 25 degrees Celsius. Humidity has a negative ef-
fect: When above 60%, the higher the relative humidity, the lower the
prediction. The wind speed does not affect the predictions much.
Next, let us see ALE plots in action for a categorical feature. The month
is a categorical feature for which we want to analyze the effect on the
predicted number of bikes. Arguably, the months already have a cer-
tain order (January to December), but let us try to see what happens if
we first reorder the categories by similarity and then compute the ef-
fects. The months are ordered by the similarity of days of each month
based on the other features, such as temperature or whether it is a hol-
iday.
Since many of the features are related to weather, the order of the
months strongly reflects how similar the weather is between the
months. All colder months are on the left side (February to April) and
the warmer months on the right side (October to August). Keep in
7.2 Accumulated Local Effects (ALE) Plot 177
mind that non-weather features have also been included in the sim-
ilarity calculation, for example relative frequency of holidays has the
same weight as the temperature for calculating the similarity between
the months.
Next, we consider the second-order effect of humidity and tempera-
ture on the predicted number of bikes. Remember that the second-
order effect is the additional interaction effect of the two features and
does not include the main effects. This means that, for example, you
will not see the main effect that high humidity leads to a lower number
of predicted bikes on average in the second-order ALE plot.
Keep in mind that both main effects of humidity and temperature say
178 7 Global Model-Agnostic Methods
FIGURE 7.13 PDPs for temperature, humidity and wind speed. Com-
pared to the ALE plots, the PDPs show a smaller decrease in predicted
number of bikes for high temperature or high humidity. The PDP uses
all data instances to calculate the effect of high temperatures, even
if they are, for example, instances with the season ”winter”. The ALE
plots are more reliable.
that the predicted number of bikes decreases in very hot and humid
weather. In hot and humid weather, the combined effect of temper-
ature and humidity is therefore not the sum of the main effects, but
larger than the sum. To emphasize the difference between the pure
second-order effect (the 2D ALE plot you just saw) and the total effect,
let us look at the partial dependence plot. The PDP shows the total ef-
fect, which combines the mean prediction, the two main effects and
the second-order effect (the interaction).
If you are only interested in the interaction, you should look at the
second-order effects, because the total effect mixes the main effects
into the plot. But if you want to know the combined effect of the fea-
tures, you should look at the total effect (which the PDP shows). For ex-
7.2 Accumulated Local Effects (ALE) Plot 179
FIGURE 7.14 ALE plot for the categorical feature month. The months are
ordered by their similarity to each other, based on the distributions
of the other features by month. We observe that January, March and
April, but especially December and November, have a lower effect on
the predicted number of rented bikes compared to the other months.
FIGURE 7.15 ALE plot for the 2nd-order effect of humidity and tempera-
ture on the predicted number of rented bikes. Lighter shade indicates
an above average and darker shade a below average prediction when
the main effects are already taken into account. The plot reveals an in-
teraction between temperature and humidity: Hot and humid weather
increases the prediction. In cold and humid weather an additional neg-
ative effect on the number of predicted bikes is shown.
risk factors. We visualize the accumulated local effects for two of the
features:
Next, we look at the interaction between number of pregnancies and
age.
7.2.5 Advantages
ALE plots are unbiased, which means they still work when features are
correlated. Partial dependence plots fail in this scenario because they
marginalize over unlikely or even physically impossible combinations
of feature values.
7.2 Accumulated Local Effects (ALE) Plot 181
FIGURE 7.16 PDP of the total effect of temperature and humidity on the
predicted number of bikes. The plot combines the main effect of each
of the features and their interaction effect, as opposed to the 2D-ALE
plot which only shows the interaction.
ALE plots are faster to compute than PDPs and scale with O(n), since
the largest possible number of intervals is the number of instances
with one interval per instance. The PDP requires n times the number
of grid points estimations. For 20 grid points, PDPs require 20 times
more predictions than the worst case ALE plot where as many intervals
as instances are used.
The interpretation of ALE plots is clear: Conditional on a given value,
the relative effect of changing the feature on the prediction can be read
from the ALE plot. ALE plots are centered at zero. This makes their
interpretation nice, because the value at each point of the ALE curve is
the difference to the mean prediction. The 2D ALE plot only shows the
interaction: If two features do not interact, the plot shows nothing.
The entire prediction function can be decomposed into a sum of lower-
182 7 Global Model-Agnostic Methods
FIGURE 7.17 ALE plots for the effect of age and years with hormonal con-
traceptives on the predicted probability of cervical cancer. For the age
feature, the ALE plot shows that the predicted cancer probability is low
on average up to age 40 and increases after that. The number of years
with hormonal contraceptives is associated with a higher predicted
cancer risk after 8 years.
7.2.6 Disadvantages
An interpretation of the effect across intervals is not permissible if
the features are strongly correlated. Consider the case where your fea-
tures are highly correlated, and you are looking at the left end of a 1D-
ALE plot. The ALE curve might invite the following misinterpretation:
“The ALE curve shows how the prediction changes, on average, when
7.2 Accumulated Local Effects (ALE) Plot 183
we gradually change the value of the respective feature for a data in-
stance, and keeping the instances other feature values fixed.” The ef-
fects are computed per interval (locally) and therefore the interpreta-
tion of the effect can only be local. For convenience, the interval-wise
effects are accumulated to show a smooth curve, but keep in mind that
each interval is created with different data instances.
ALE effects may differ from the coefficients specified in a linear re-
gression model when features interact and are correlated. Grömping
184 7 Global Model-Agnostic Methods
(2020)5 showed that in a linear model with two correlated features and
an additional interaction term (𝑓(𝑥)̂ = 𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 𝑥 ),
0 1 1 2 2 3 1 2
the first-order ALE plots do not show a straight line. Instead, they
are slightly curved because they incorporate parts of the multiplicative
interaction of the features. To understand what is happening here, I
recommend reading the chapter on function decomposition. In short,
ALE defines first-order (or 1D) effects differently than the linear for-
mula describes them. This is not necessarily wrong, because when fea-
tures are correlated, the attribution of interactions is not as clear. But
it is certainly unintuitive that ALE and linear coefficient do not match.
ALE plots can become a bit shaky (many small ups and downs) with a
high number of intervals. In this case, reducing the number of inter-
vals makes the estimates more stable, but also smoothes out and hides
some of the true complexity of the prediction model. There is no per-
fect solution for setting the number of intervals. If the number is too
small, the ALE plots might not be very accurate. If the number is too
high, the curve can become shaky.
Unlike PDPs, ALE plots are not accompanied by ICE curves. For PDPs,
ICE curves are great because they can reveal heterogeneity in the fea-
ture effect, which means that the effect of a feature looks different
for subsets of the data. For ALE plots you can only check per interval
whether the effect is different between the instances, but each interval
has different instances so it is not the same as ICE curves.
Second-order ALE estimates have a varying stability across the fea-
ture space, which is not visualized in any way. The reason for this is
that each estimation of a local effect in a cell uses a different number
of data instances. As a result, all estimates have a different accuracy
(but they are still the best possible estimates). The problem exists in a
less severe version for main effect ALE plots. The number of instances
is the same in all intervals, thanks to the use of quantiles as grid, but
in some areas there will be many short intervals and the ALE curve
will consist of many more estimates. But for long intervals, which can
5
Grömping, Ulrike. “Model-Agnostic Effects Plots for Interpreting Machine
Learning Models.” Reports in Mathematics, Physics and Chemistry: Department II,
Beuth University of Applied Sciences Berlin. Report 1/2020 (2020)
7.2 Accumulated Local Effects (ALE) Plot 185
make up a big part of the entire curve, there are comparatively fewer
instances. This happened in the cervical cancer prediction ALE plot for
high age for example.
Second-order effect plots can be a bit annoying to interpret, as you al-
ways have to keep the main effects in mind. It is tempting to read the
heat maps as the total effect of the two features, but it is only the addi-
tional effect of the interaction. The pure second-order effect is inter-
esting for discovering and exploring interactions, but for interpreting
what the effect looks like, I think it makes more sense to integrate the
main effects into the plot.
The implementation of ALE plots is much more complex and less in-
tuitive compared to partial dependence plots.
Even though ALE plots are not biased in case of correlated features, in-
terpretation remains difficult when features are strongly correlated.
Because if they have a very strong correlation, it only makes sense to
analyze the effect of changing both features together and not in isola-
tion. This disadvantage is not specific to ALE plots, but a general prob-
lem of strongly correlated features.
If the features are uncorrelated and computation time is not a prob-
lem, PDPs are slightly preferable because they are easier to under-
stand and can be plotted along with ICE curves.
The list of disadvantages has become quite long, but do not be fooled
by the number of words I use: As a rule of thumb: Use ALE instead of
PDP.
ventor himself and once in the iml package7 . ALE also has at least two
Python implementations with the ALEPython package8 and in Alibi9 .
7
https://cran.r-project.org/web/packages/iml/index.html
8
https://github.com/blent-ai/ALEPython
9
https://docs.seldon.io/projects/alibi/en/stable/index.html
7.3 Feature Interaction 187
̂ = 𝑃 𝐷 (𝑥 ) + 𝑃 𝐷 (𝑥 )
𝑓(𝑥) 𝑗 𝑗 −𝑗 −𝑗
𝑛 2
̂ (𝑖) ) − 𝑃 𝐷 (𝑥(𝑖) ) − 𝑃 𝐷 (𝑥(𝑖) )]
∑𝑖=1 [𝑓(𝑥
2 𝑗 𝑗 −𝑗 −𝑗
𝐻𝑗 = 𝑛
∑ 𝑓 ̂ (𝑥 )
2 (𝑖)
𝑖=1
7.3.3 Examples
Let us see what feature interactions look like in practice! We measure
the interaction strength of features in a support vector machine that
predicts the number of rented bikes based on weather and calendrical
features. The following plot shows the feature interaction H-statistic:
FIGURE 7.19 The interaction strength (H-statistic) for each feature with
all other features for a support vector machine predicting bicycle
rentals. Overall, the interaction effects between the features are very
weak (below 10% of variance explained per feature).
FIGURE 7.20 The interaction strength (H-statistic) for each feature with
all other features for a random forest predicting the probability of cer-
vical cancer. The years on hormonal contraceptives has the highest rel-
ative interaction effect with all other features, followed by the number
of pregnancies.
7.3.4 Advantages
The interaction H-statistic has an underlying theory through the par-
tial dependence decomposition.
The H-statistic has a meaningful interpretation: The interaction is de-
fined as the share of variance that is explained by the interaction.
Since the statistic is dimensionless, it is comparable across features
and even across models.
The statistic detects all kinds of interactions, regardless of their par-
ticular form.
With the H-statistic it is also possible to analyze arbitrary higher in-
7.3 Feature Interaction 193
7.3.5 Disadvantages
The first thing you will notice: The interaction H-statistic takes a long
time to compute, because it is computationally expensive.
The computation involves estimating marginal distributions. These
estimates also have a certain variance if we do not use all data points.
This means that as we sample points, the estimates also vary from run
to run and the results can be unstable. I recommend repeating the H-
statistic computation a few times to see if you have enough data to get
a stable result.
It is unclear whether an interaction is significantly greater than 0. We
194 7 Global Model-Agnostic Methods
would need to conduct a statistical test, but this test is not (yet) avail-
able in a model-agnostic version.
Concerning the test problem, it is difficult to say when the H-statistic
is large enough for us to consider an interaction “strong”.
Also, the H-statistic can be larger than 1, which makes the interpreta-
tion difficult.
When the total effect of two features is weak, but mostly consists of in-
teractions, than the H-statistic will be very large. These spurious inter-
actions require a small denominator of the H-statistic and are made
worse when features are correlated. A spurious interaction can be eas-
ily overinterpreted as a strong interaction effect, when in reality both
features play a minor role in the model. A possible remedy is to visu-
alize the unnormalized version of the H-statistic, which is the square
root of the numerator of the H-statistic 11 . This scales the H-statistic
to the same level as the response, at least for regression, and puts less
emphasis on spurious interactions.
𝑛 2
∗ (𝑖) (𝑖) (𝑖) (𝑖)
𝐻𝑗𝑘 = √∑ [𝑃 𝐷𝑗𝑘 (𝑥𝑗 , 𝑥𝑘 ) − 𝑃 𝐷𝑗 (𝑥𝑗 ) − 𝑃 𝐷𝑘 (𝑥𝑘 )]
𝑖=1
The H-statistic tells us the strength of interactions, but it does not tell
us how the interactions look like. That is what partial dependence plots
are for. A meaningful workflow is to measure the interaction strengths
and then create 2D-partial dependence plots for the interactions you
are interested in.
The H-statistic cannot be used meaningfully if the inputs are pixels. So
the technique is not useful for image classifier.
The interaction statistic works under the assumption that we can shuf-
fle features independently. If the features correlate strongly, the as-
sumption is violated and we integrate over feature combinations that
11
Inglis, Alan, Andrew Parnell, and Catherine Hurley. “Visualizing Variable Im-
portance and Variable Interaction Effects in Machine Learning Models.” arXiv
preprint arXiv:2108.04310 (2021).
7.3 Feature Interaction 195
are very unlikely in reality. That is the same problem that partial de-
pendence plots have. Correlated features can lead to large values of the
H-statistic.
Sometimes the results are strange and for small simulations do not
yield the expected results. But this is more of an anecdotal observa-
tion.
7.3.6 Implementations
For the examples in this book, I used the R package iml, which is avail-
able on CRAN12 and the development version on Github13 . There are
other implementations, which focus on specific models: The R pack-
age pre14 implements RuleFit and H-statistic. The R package gbm15 im-
plements gradient boosted models and H-statistic.
7.3.7 Alternatives
The H-statistic is not the only way to measure interactions:
Variable Interaction Networks (VIN) by Hooker (2004)16 is an approach
that decomposes the prediction function into main effects and feature
interactions. The interactions between features are then visualized as
a network. Unfortunately no software is available yet.
Partial dependence based feature interaction by Greenwell et. al
(2018)17 measures the interaction between two features. This approach
measures the feature importance (defined as the variance of the partial
dependence function) of one feature conditional on different, fixed
points of the other feature. If the variance is high, then the features
interact with each other, if it is zero, they do not interact. The corre-
12
https://cran.r-project.org/web/packages/iml
13
https://github.com/christophM/iml
14
https://cran.r-project.org/web/packages/pre/index.html
15
https://github.com/gbm-developers/gbm3
16
Hooker, Giles. “Discovering additive structure in black box functions.” Proceed-
ings of the tenth ACM SIGKDD international conference on Knowledge discovery
and data mining. (2004).
17
Greenwell, Brandon M., Bradley C. Boehmke, and Andrew J. McCarthy. “A
simple and effective model-based variable importance measure.” arXiv preprint
arXiv:1805.04755 (2018).
196 7 Global Model-Agnostic Methods
̂ , 𝑥 ) = 2 + 𝑒 𝑥1 − 𝑥 + 𝑥 ⋅ 𝑥
𝑦 = 𝑓(𝑥 1 2 2 1 2
̂ , 𝑥 ) = 𝑓 ̂ + 𝑓 ̂ (𝑥 ) + 𝑓 ̂ (𝑥 ) + 𝑓 ̂ (𝑥 , 𝑥 )
𝑓(𝑥 1 2 0 1 1 2 2 1,2 1 2
The main effects indicate how each feature affects the prediction, in-
dependent of the values the other feature. The interaction effect indi-
cates the joint effect of the features. The intercept simply tells us what
the prediction is when all feature effects are set to zero. Note that the
components themselves are functions (except for the intercept) with
different input dimensionality.
I will just give you the components now and explain where they come
from later. The intercept is 𝑓0̂ ∼ 3.18. Since the other components are
functions, we can visualize them:
Do you think the components make sense given the above true formula
ignoring the fact that the intercept takes on some random value? The
𝑥1 feature shows an exponential main effect, and 𝑥2 shows a negative
linear effect. The interaction term looks like a Pringles chip – in less
198 7 Global Model-Agnostic Methods
̂ =𝑓 ̂ + 𝑓 ̂ (𝑥 ) + … + 𝑓 ̂ (𝑥 )
𝑓𝑥) 0 1 1 𝑝 𝑝
̂ (𝑥 , 𝑥 ) + … + 𝑓 ̂ (𝑥 , 𝑥 ) + … + 𝑓 ̂
+ 𝑓1,2 1 2 1,𝑝 1 𝑝 𝑝−1,𝑝 (𝑥𝑝−1 , 𝑥𝑝 )
+…
+𝑓̂ 1,…,𝑝 (𝑥1 , … , 𝑥𝑝 )
̂ =
𝑓𝑥) ∑ 𝑓𝑆̂ (𝑥𝑆 )
𝑆⊆{1,…,𝑝}
In the formula, 𝑥𝑆 is the vector of features in the index set 𝑆. And each
subset 𝑆 represents a functional component, for example a main effect
if S contains only one feature, or an interaction if |𝑆| > 1.
How many components are in the above formula? The answer boils
down to how many possible subsets 𝑆 of the features 1, … , 𝑝 we can
𝑝
form. And these are ∑𝑖=0 (𝑝𝑖) = 2𝑝 possible subsets! For example, if
a function uses 10 features, we can decompose the function into 1042
components: 1 intercept, 10 main effects, 90 2-way interaction terms,
720 3-way interaction terms, … And with each additional feature, the
number of components doubles. Clearly, for most functions, it is not
feasible to compute all components. Another reasons NOT to compute
all components is that components with |𝑆| > 2 are difficult to visual-
ize and interpret.
̂ =
𝑓𝑥) ∑ 𝑓𝑆̂ (𝑥𝑆 )
𝑆⊆{1,…,𝑝}
Okay, let us take this thing apart. We can rewrite the component as:
𝑓𝑆̂ (𝑥) = ∫ ̂
(𝑓𝑥)) 𝑑𝑋−𝑆 ) − ∫ ( ∑ 𝑓𝑉̂ (𝑥)) 𝑑𝑋−𝑆 )
𝑋−𝑆 𝑋−𝑆 𝑉 ⊂𝑆
On the left side is the integral over the prediction function with respect
19
Hooker, Giles. “Discovering additive structure in black box functions.” Proceed-
ings of the tenth ACM SIGKDD international conference on Knowledge discovery
and data mining. 2004.
202 7 Global Model-Agnostic Methods
to the features excluded from the set 𝑆, denoted with −𝑆. For exam-
ple, if we compute the 2-way interaction component for features 2 and
3, we would integrate over features 1, 4, 5, … The integral can also be
viewed as the expected value of the prediction function with respect
to 𝑋−𝑆 , assuming that all features follow a uniform distribution from
their minimum to their maximum. From this interval, we subtract all
components with subsets of 𝑆. This subtraction removes the effect of
all lower order effects and centers the effect. For 𝑆 = {2, 3}, we sub-
tract the main effects of both features 𝑓2̂ and 𝑓3̂ , as well as the inter-
cept 𝑓0̂ . The occurrence of these lower order effects makes the formula
recursive: We have through the hierarchy of subsets down to the inter-
cept and compute all these components. For the intercept component
𝑓0̂ , the subset is the empty set 𝑆 = {∅} and therefore −𝑆 contains all
features:
̂
𝑓𝑆̂ (𝑥) = ∫ 𝑓𝑥)𝑑𝑋
𝑋
This is simply the prediction function integrated over all features. The
intercept can also be interpreted as the expectation of the prediction
function when we assume that all features are uniformly distributed.
Now that we know 𝑓0̂ , we can compute 𝑓1̂ (and equivalently 𝑓2̂ ):
𝑓1̂ (𝑥) = ∫ ̂ − 𝑓 ̂ ) 𝑑𝑋
(𝑓𝑥) 0 −𝑆
𝑋−1
̂ (𝑥) = ∫
𝑓1,2 ̂ − (𝑓 ̂ (𝑥) + 𝑓 ̂ (𝑥) − 𝑓 ̂ + 𝑓 ̂ (𝑥) − 𝑓 ̂ )) 𝑑𝑋 , 𝑋
(𝑓𝑥) 0 1 0 2 0 3 4
𝑋3,4
=∫ ̂ − 𝑓 ̂ (𝑥) − 𝑓 ̂ (𝑥) + 𝑓 ̂ )) 𝑑𝑋 , 𝑋
(𝑓(𝑥) 1 2 0 3 4
𝑋3,4
This example shows how each higher order effect is defined by inte-
grating over all other features, but also by removing all the lower order
effects that are subsets of the feature set we are interested in.
7.4 Functional Decompositon 203
The zero means axiom implies that all effects or interactions are cen-
tered around zero. As a consequence, the interpretation at a position
x is relative to the centered prediction and not the absolute prediction.
The orthogonality axiom implies that components do not share infor-
mation. For example, the first-order effect of feature 𝑋1 and the in-
teraction term of 𝑋1 and 𝑋2 are not correlated. Because of orthog-
onality, all components are “pure” in the sense that they do not mix
effects. It makes a lot of sense that the component for, say, feature
𝑋4 should be independent of the interaction term between features
𝑋1 and 𝑋2 . The more interesting consequence arises for orthogonality
of hierarchical components, where one component contains features
of another, for example the interaction between 𝑋1 and 𝑋2 , and the
main effect of feature 𝑋1 . In contrast, a two-dimensional partial de-
pendence plot for 𝑋1 and 𝑋2 would contain four effects: the intercept,
the two main effects of 𝑋1 and 𝑋2 and the interaction between them.
The functional ANOVA component for 𝑓1,2 ̂ (𝑥 , 𝑥 ) contains only the
1 2
pure interaction.
Variance decomposition allows us to divide the variance of the func-
tion 𝑓 among the components, and guarantees that it adds up the
total variance of the function in the end. The variance decomposi-
tion property can also explain to us why the method is called “func-
tional ANOVA”. In statistics, ANOVA stands for ANalysis Of VAriance.
ANOVA refers to a collection of methods that analyze differences in
the mean of a target variable. ANOVA works by dividing the variance
and attributing it to the variables. Functional ANOVA can therefore be
seen as an extension of this concept to any function.
Problems arise with the functional ANOVA when features are corre-
204 7 Global Model-Agnostic Methods
tors for different feature sets? For example, 𝐻1,2 and 𝐻1 , or 𝐻1,2 and
𝐻3,4,5 ? The answer is zero. If we first apply the ALE operator 𝐻𝑆 to a
function and then apply 𝐻𝑈 to the result (with 𝑆 ≠ 𝑈 ), then the re-
sult is zero. In other words, the ALE plot of an ALE plot is zero, unless
you apply the same ALE plot twice. Or in other words, the ALE plot
for the feature set S does not contian any other ALE plots in it. Or in
mathematical terms, the ALE operator maps functions to orthogonal
subspaces of an inner product space.
As Apley and Zhu (2021) note, pseudo-orthogonality may be more de-
sirable than hierarchical orthogonality, because it does not entangle
marginal effects of the features. Furthermore, ALE does not require
estimation of the joint distribution; the components can be estimated
in a hierarchical manner, which means that calculating the 2D ALE for
features 1 and 2 requires only the calculations of individual ALE com-
ponents of 1 and 2 and the intercept term in addition.
̂ = 𝛽 + 𝛽 𝑥 + …𝛽 𝑥
𝑓𝑥) 0 1 1 𝑝 𝑝
yield the original function, and thus is not a valid decomposition. But
could we adjust the PDP, perhaps by removing all lower effects? Yes,
we could, but the we would get something similar to the functional
ANOVA. However, instead of integrating over a uniform distribution,
the PDP integrates over the marginal distribution of 𝑋−𝑆 , which is
estimated using Monte Carlo sampling.
7.4.9 Advantages
I consider functional decomposition to be a core concept of machine
learning interpretability.
Functional decomposition gives us a theoretical justification for de-
composing high-dimensional and complex machine learning models
into individual effects and interactions – a necessary step that allows
us to interpret individual effects. Functional decomposition is the core
idea for techniques such as statistical regression models, ALE, (gener-
alized) functional ANOVA, PDP, the H-statistic, and ICE curves.
Functional decomposition also provides a better understanding of
other methods. For example, permutation feature importance breaks
the association between a feature and the target. Viewed through the
functional decomposition lens, we can see that the permutation “de-
stroys” the effect of all components in which the feature was involved.
This affects the main effect of the feature, but also all interactions with
other features. As another example, Shapley values decompose a pre-
diction into additive effects of the individual feature. But the func-
tional decomposition tells us that there should also be interaction ef-
fects in the decomposition, so where are they? Shapley values provide
a fair attribution of effects to the individual features, meaning that all
interactions are also fairly attributed to the features and therefore di-
vided up among the Shapley values.
When considering functional decomposition as a tool, the use of ALE
plots offers many advantages. ALE plots provide a functional decom-
position that is fast to compute, has software implementations (see
the ALE chapter), and desirable pseudo-orthogonality properties.
7.4 Functional Decompositon 209
7.4.10 Disadvantages
The concept of functional decomposition quickly reaches its limits for
high-dimensional components beyond interactions between two fea-
tures. Not only does this exponential explosion in the number of fea-
tures limit practicability, since we cannot easily visualize higher order
interactions, but computational time is insane if we were to compute
all interactions.
Each method of functional decomposition method has. The bottom-
up approach – constructing regression models – has the disadvantage
of being a quite manual process that imposes many constraints on the
model that can affect predictive performance. Functional ANOVA re-
quires independent features. Generalized functional ANOVA is very
difficult to estimate. Accumulated local effect plots do not provide a
variance decomposition.
The functional decomposition approach is more appropriate for
analysing tabular data than text or images.
210 7 Global Model-Agnostic Methods
7.5.1 Theory
The concept is really straightforward: We measure the importance of
a feature by calculating the increase in the model’s prediction error
after permuting the feature. A feature is “important” if shuffling its
values increases the model error, because in this case the model re-
lied on the feature for the prediction. A feature is “unimportant” if
shuffling its values leaves the model error unchanged, because in this
case the model ignored the feature for the prediction. The permutation
feature importance measurement was introduced by Breiman (2001)23
for random forests. Based on this idea, Fisher, Rudin, and Dominici
(2018)24 proposed a model-agnostic version of the feature importance
and called it model reliance. They also introduced more advanced ideas
about feature importance, for example a (model-specific) version that
takes into account that many prediction models may predict the data
well. Their paper is worth reading.
The permutation feature importance algorithm based on Fisher,
Rudin, and Dominici (2018):
Input: Trained model 𝑓,̂ feature matrix 𝑋, target vector 𝑦, error mea-
̂
sure 𝐿(𝑦, 𝑓).
Fisher, Rudin, and Dominici (2018) suggest in their paper to split the
dataset in half and swap the values of feature j of the two halves in-
stead of permuting feature j. This is exactly the same as permuting
feature j, if you think about it. If you want a more accurate estimate,
you can estimate the error of permuting feature j by pairing each in-
stance with the value of feature j of each other instance (except with
itself). This gives you a dataset of size n(n-1) to estimate the permuta-
tion error, and it takes a large amount of computation time. I can only
recommend using the n(n-1) -method if you are serious about getting
extremely accurate estimates.
of the first things you learn in machine learning: If you measure the
model error (or performance) on the same data on which the model
was trained, the measurement is usually too optimistic, which means
that the model seems to work much better than it does in reality. And
since the permutation feature importance relies on measurements of
the model error, we should use unseen test data. The feature impor-
tance based on training data makes us mistakenly believe that features
are important for the predictions, when in reality the model was just
overfitting and the features were not important at all.
The case for training data
The arguments for using training data are somewhat more difficult to
formulate, but are IMHO just as compelling as the arguments for us-
ing test data. We take another look at our garbage SVM. Based on the
training data, the most important feature was X42. Let us look at a par-
tial dependence plot of feature X42. The partial dependence plot shows
how the model output changes based on changes of the feature and
does not rely on the generalization error. It does not matter whether
the PDP is computed with training or test data.
The plot clearly shows that the SVM has learned to rely on feature X42
for its predictions, but according to the feature importance based on
the test data (1), it is not important. Based on the training data, the
importance is 1.19, reflecting that the model has learned to use this
feature. Feature importance based on the training data tells us which
features are important for the model in the sense that it depends on
them for making predictions.
As part of the case for using training data, I would like to introduce
an argument against test data. In practice, you want to use all your
data to train your model to get the best possible model in the end. This
means no unused test data is left to compute the feature importance.
You have the same problem when you want to estimate the generaliza-
tion error of your model. If you would use (nested) cross-validation for
the feature importance estimation, you would have the problem that
the feature importance is not calculated on the final model with all the
data, but on models with subsets of the data that might behave differ-
ently.
214 7 Global Model-Agnostic Methods
FIGURE 7.27 The importance for each of the features in predicting bike
counts with a support vector machine. The most important feature
was temp, the least important was holiday.
7.5.4 Advantages
Nice interpretation: Feature importance is the increase in model error
when the feature’s information is destroyed.
Feature importance provides a highly compressed, global insight into
the model’s behavior.
A positive aspect of using the error ratio instead of the error difference
is that the feature importance measurements are comparable across
different problems.
The importance measure automatically takes into account all interac-
tions with other features. By permuting the feature you also destroy
the interaction effects with other features. This means that the per-
mutation feature importance takes into account both the main feature
effect and the interaction effects on model performance. This is also a
7.5 Permutation Feature Importance 217
7.5.5 Disadvantages
Permutation feature importance is linked to the error of the model.
This is not inherently bad, but in some cases not what you need. In
some cases, you might prefer to know how much the model’s out-
put varies for a feature without considering what it means for perfor-
mance. For example, you want to find out how robust your model’s out-
put is when someone manipulates the features. In this case, you would
218 7 Global Model-Agnostic Methods
7.5.6 Alternatives
An algorithm called PIMP25 adapts the permutation feature impor-
tance algorithm to provide p-values for the importances. Another loss-
based alternative is to omit the feature from the training data, re-
train the model and measuring the increase in loss. Permuting a fea-
ture and measuring the increase in loss is not the only way to mea-
sure the importance of a feature. The different importance measures
can be divided into model-specific and model-agnostic methods. The
Gini importance for random forests or standardized regression coef-
25
https://academic.oup.com/bioinformatics/article/26/10/1340/193348
220 7 Global Model-Agnostic Methods
7.5.7 Software
The iml R package was used for the examples. The R packages DALEX
and vip, as well as the Python library alibi, scikit-learn and rfpimp,
also implement model-agnostic permutation feature importance.
26
Wei, Pengfei, Zhenzhou Lu, and Jingwen Song. “Variable importance analysis: a
comprehensive review.” Reliability Engineering & System Safety 142 (2015): 399-432.
7.6 Global Surrogate 221
7.6.1 Theory
Surrogate models are also used in engineering: If an outcome of in-
terest is expensive, time-consuming or otherwise difficult to measure
(e.g. because it comes from a complex computer simulation), a cheap
and fast surrogate model of the outcome can be used instead. The dif-
ference between the surrogate models used in engineering and in in-
terpretable machine learning is that the underlying model is a ma-
chine learning model (not a simulation) and that the surrogate model
must be interpretable. The purpose of (interpretable) surrogate mod-
els is to approximate the predictions of the underlying model as accu-
rately as possible and to be interpretable at the same time. The idea of
surrogate models can be found under different names: Approximation
model, metamodel, response surface model, emulator, …
About the theory: There is actually not much theory needed to under-
stand surrogate models. We want to approximate our black box pre-
diction function f as closely as possible with the surrogate model pre-
diction function g, under the constraint that g is interpretable. For
the function g any interpretable model – for example from the inter-
pretable models chapter – can be used.
For example a linear model:
𝑔(𝑥) = 𝛽0 + 𝛽1 𝑥1 + … + 𝛽𝑝 𝑥𝑝
Or a decision tree:
222 7 Global Model-Agnostic Methods
𝑀
𝑔(𝑥) = ∑ 𝑐𝑚 𝐼{𝑥 ∈ 𝑅𝑚 }
𝑚=1
1. Select a dataset X. This can be the same dataset that was used
for training the black box model or a new dataset from the
same distribution. You could even select a subset of the data
or a grid of points, depending on your application.
2. For the selected dataset X, get the predictions of the black box
model.
3. Select an interpretable model type (linear model, decision
tree, …).
4. Train the interpretable model on the dataset X and its predic-
tions.
5. Congratulations! You now have a surrogate model.
6. Measure how well the surrogate model replicates the predic-
tions of the black box model.
7. Interpret the surrogate model.
You may find approaches for surrogate models that have some extra
steps or differ a little, but the general idea is usually as described here.
One way to measure how well the surrogate replicates the black box
model is the R-squared measure:
𝑛 (𝑖)
2 𝑆𝑆𝐸 ∑𝑖=1 (𝑦∗̂ − 𝑦(𝑖)
̂ )2
𝑅 =1− =1− 𝑛
𝑆𝑆𝑇 ̂ − 𝑦)̄̂ 2
∑𝑖=1 (𝑦(𝑖)
(𝑖)
where 𝑦∗̂ is the prediction for the i-th instance of the surrogate model,
7.6 Global Surrogate 223
̂ the prediction of the black box model and 𝑦 ̄̂ the mean of the black
𝑦(𝑖)
box model predictions. SSE stands for sum of squares error and SST
for sum of squares total. The R-squared measure can be interpreted
as the percentage of variance that is captured by the surrogate model.
If R-squared is close to 1 (= low SSE), then the interpretable model ap-
proximates the behavior of the black box model very well. If the inter-
pretable model is very close, you might want to replace the complex
model with the interpretable model. If the R-squared is close to 0 (=
high SSE), then the interpretable model fails to explain the black box
model.
Note that we have not talked about the model performance of the un-
derlying black box model, i.e. how good or bad it performs in predict-
ing the actual outcome. The performance of the black box model does
not play a role in training the surrogate model. The interpretation of
the surrogate model is still valid because it makes statements about
the model and not about the real world. But of course the interpreta-
tion of the surrogate model becomes irrelevant if the black box model
is bad, because then the black box model itself is irrelevant.
We could also build a surrogate model based on a subset of the orig-
inal data or reweight the instances. In this way, we change the dis-
tribution of the surrogate model’s input, which changes the focus of
the interpretation (then it is no longer really global). If we weight the
data locally by a specific instance of the data (the closer the instances
to the selected instance, the higher their weight), we get a local surro-
gate model that can explain the individual prediction of the instance.
Read more about local models in the following chapter.
7.6.2 Example
To demonstrate the surrogate models, we consider a regression and a
classification example.
First, we train a support vector machine to predict the daily number
of rented bikes given weather and calendar information. The support
vector machine is not very interpretable, so we train a surrogate with a
CART decision tree as interpretable model to approximate the behav-
ior of the support vector machine.
224 7 Global Model-Agnostic Methods
should not overinterpret the tree when drawing conclusions about the
complex model.
7.6.3 Advantages
The surrogate model method is flexible: Any model from the inter-
pretable models chapter can be used. This also means that you can ex-
change not only the interpretable model, but also the underlying black
box model. Suppose you create some complex model and explain it
to different teams in your company. One team is familiar with linear
models, the other team can understand decision trees. You can train
two surrogate models (linear model and decision tree) for the origi-
nal black box model and offer two kinds of explanations. If you find
a better performing black box model, you do not have to change your
226 7 Global Model-Agnostic Methods
method of interpretation, because you can use the same class of sur-
rogate models.
I would argue that the approach is very intuitive and straightforward.
This means it is easy to implement, but also easy to explain to people
not familiar with data science or machine learning.
With the R-squared measure, we can easily measure how good our sur-
rogate models are in approximating the black box predictions.
7.6.4 Disadvantages
You have to be aware that you draw conclusions about the model and
not about the data, since the surrogate model never sees the real out-
come.
It is not clear what the best cut-off for R-squared is in order to be con-
fident that the surrogate model is close enough to the black box model.
80% of variance explained? 50%? 99%?
We can measure how close the surrogate model is to the black box
model. Let us assume we are not very close, but close enough. It could
happen that the interpretable model is very close for one subset of the
dataset, but widely divergent for another subset. In this case the inter-
pretation for the simple model would not be equally good for all data
points.
The interpretable model you choose as a surrogate comes with all its
advantages and disadvantages.
Some people argue that there are, in general, no intrinsically inter-
pretable models (including even linear models and decision trees) and
that it would even be dangerous to have an illusion of interpretability.
If you share this opinion, then of course this method is not for you.
7.6.5 Software
I used the iml R package for the examples. If you can train a machine
learning model, then you should be able to implement surrogate mod-
els yourself. Simply train an interpretable model to predict the predic-
tions of the black box model.
7.7 Prototypes and Criticisms 227
FIGURE 7.30 Prototypes and criticisms for a data distribution with two
features x1 and x2.
7.7.1 Theory
The MMD-critic procedure on a high-level can be summarized briefly:
1 𝑛 1 𝑚
𝑤𝑖𝑡𝑛𝑒𝑠𝑠(𝑥) = ∑ 𝑘(𝑥, 𝑥𝑖 ) − ∑ 𝑘(𝑥, 𝑧𝑗 )
𝑛 𝑖=1 𝑚 𝑗=1
For two datasets (with the same features), the witness function gives
you the means of evaluating in which empirical distribution the point
x fits better. To find criticisms, we look for extreme values of the wit-
ness function in both negative and positive directions. The first term
in the witness function is the average proximity between point x and
the data, and, respectively, the second term is the average proximity
between point x and the prototypes. If the witness function for a point
232 7 Global Model-Agnostic Methods
x is close to zero, the density function of the data and the prototypes
are close together, which means that the distribution of prototypes re-
sembles the distribution of the data at point x. A negative witness func-
tion at point x means that the prototype distribution overestimates
the data distribution (for example if we select a prototype but there
are only few data points nearby); a positive witness function at point
x means that the prototype distribution underestimates the data dis-
tribution (for example if there are many data points around x but we
have not selected any prototypes nearby).
To give you more intuition, let us reuse the prototypes from the plot be-
forehand with the lowest MMD2 and display the witness function for a
few manually selected points. The labels in the following plot show the
value of the witness function for various points marked as triangles.
Only the point in the middle has a high absolute value and is therefore
a good candidate for a criticism.
with high absolute value in the witness function. Like prototypes, crit-
icisms are also found through greedy search. But instead of reducing
the overall MMD2, we are looking for points that maximize a cost func-
tion that includes the witness function and a regularizer term. The ad-
ditional term in the optimization function enforces diversity in the
points, which is needed so that the points come from different clus-
ters.
This second step is independent of how the prototypes are found. I
could also have handpicked some prototypes and used the procedure
described here to learn criticisms. Or the prototypes could come from
any clustering procedure, like k-medoids.
That is it with the important parts of MMD-critic theory. One ques-
tion remains: How can MMD-critic be used for interpretable machine
learning?
MMD-critic can add interpretability in three ways: By helping to better
understand the data distribution; by building an interpretable model;
by making a black box model interpretable.
If you apply MMD-critic to your data to find prototypes and criticisms,
it will improve your understanding of the data, especially if you have a
complex data distribution with edge cases. But with MMD-critic you
can achieve more!
For example, you can create an interpretable prediction model: a so-
called “nearest prototype model”. The prediction function is defined
as:
̂ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑘(𝑥, 𝑥 )
𝑓(𝑥) 𝑖∈𝑆 𝑖
which means that we select the prototype i from the set of prototypes
S that is closest to the new data point, in the sense that it yields the
highest value of the kernel function. The prototype itself is returned
as an explanation for the prediction. This procedure has three tuning
parameters: The type of kernel, the kernel scaling parameter and the
number of prototypes. All parameters can be optimized within a cross
validation loop. The criticisms are not used in this approach.
234 7 Global Model-Agnostic Methods
How does that help? Remember when Google’s image classifier identi-
fied black people as gorillas? Perhaps they should have used the proce-
dure described here before deploying their image recognition model.
It is not enough just to check the performance of the model, because
if it were 99% correct, this issue could still be in the 1%. And labels can
also be wrong! Going through all the training data and performing a
sanity check if the prediction is problematic might have revealed the
problem, but would be infeasible. But the selection of – say a few thou-
sand – prototypes and criticisms is feasible and could have revealed
a problem with the data: It might have shown that there is a lack of
images of people with dark skin, which indicates a problem with the
diversity in the dataset. Or it could have shown one or more images
of a person with dark skin as a prototype or (probably) as a criticism
with the notorious “gorilla” classification. I do not promise that MMD-
critic would certainly intercept these kind of mistakes, but it is a good
sanity check.
7.7.2 Examples
The following example of MMD-critic uses a handwritten digit
dataset.
Looking at the actual prototypes, you might notice that the number
of images per digit is different. This is because a fixed number of pro-
totypes were searched across the entire dataset and not with a fixed
7.7 Prototypes and Criticisms 235
7.7.3 Advantages
In a user study the authors of MMD-critic gave images to the partici-
pants, which they had to visually match to one of two sets of images,
each representing one of two classes (e.g. two dog breeds). The par-
ticipants performed best when the sets showed prototypes and criti-
cisms instead of random images of a class.
You are free to choose the number of prototypes and criticisms.
MMD-critic works with density estimates of the data. This works with
any type of data and any type of machine learning model.
The algorithm is easy to implement.
MMD-critic is very flexible in the way it is used to increase inter-
pretability. It can be used to understand complex data distributions.
It can be used to build an interpretable machine learning model. Or it
can shed light on the decision making of a black box machine learning
model.
236 7 Global Model-Agnostic Methods
7.7.4 Disadvantages
While, mathematically, prototypes and criticisms are defined differ-
ently, their distinction is based on a cut-off value (the number of pro-
totypes). Suppose you choose a too low number of prototypes to cover
the data distribution. The criticisms would end up in the areas that
are not that well explained. But if you were to add more prototypes
they would also end up in the same areas. Any interpretation has to
take into account that criticisms strongly depend on the existing pro-
totypes and the (arbitrary) cut-off value for the number of prototypes.
You have to choose the number of prototypes and criticisms. As much
as this can be nice-to-have, it is also a disadvantage. How many proto-
types and criticisms do we actually need? The more the better? The less
the better? One solution is to select the number of prototypes and crit-
icisms by measuring how much time humans have for the task of look-
ing at the images, which depends on the particular application. Only
when using MMD-critic to build a classifier do we have a way to opti-
mize it directly. One solution could be a screeplot showing the number
of prototypes on the x-axis and the MMD2 measure on the y-axis. We
would choose the number of prototypes where the MMD2 curve flat-
tens.
The other parameters are the choice of the kernel and the kernel scal-
ing parameter. We have the same problem as with the number of pro-
totypes and criticisms: How do we select a kernel and its scaling pa-
rameter? Again, when we use MMD-critic as a nearest prototype clas-
sifier, we can tune the kernel parameters. For the unsupervised use
cases of MMD-critic, however, it is unclear. (Maybe I am a bit harsh
here, since all unsupervised methods have this problem.)
It takes all the features as input, disregarding the fact that some fea-
tures might not be relevant for predicting the outcome of interest.
One solution is to use only relevant features, for example image em-
7.7 Prototypes and Criticisms 237
28
https://github.com/BeenKim/MMD-critic
29
https://arxiv.org/pdf/1707.01212.pdf
30
https://github.com/Trusted-AI/AIX360
31
https://en.wikipedia.org/wiki/K-medoids
32
Kaufman, Leonard, and Peter Rousseeuw. “Clustering by means of medoids”.
North-Holland (1987).
8
Local Model-Agnostic Methods
239
240 8 Local Model-Agnostic Methods
8.1.1 Examples
Let’s go back to the cervical cancer dataset and see how the prediction
of each instance is associated with the feature “Age”. We will analyze
a random forest that predicts the probability of cancer for a woman
given risk factors. In the partial dependence plot we have seen that
the cancer probability increases around the age of 50, but is this true
for every woman in the dataset? The ICE plot reveals that for most
women the age effect follows the average pattern of an increase at age
50, but there are some exceptions: For the few women that have a high
predicted probability at a young age, the predicted cancer probability
does not change much with age.
FIGURE 8.1 ICE plot of cervical cancer probability by age. Each line rep-
resents one woman. For most women there is an increase in predicted
cancer probability with increasing age. For some women with a pre-
dicted cancer probability above 0.4, the prediction does not change
much at higher age.
The next figure shows ICE plots for the bike rental prediction. The un-
derlying prediction model is a random forest.
242 8 Local Model-Agnostic Methods
All curves seem to follow the same course, so there are no obvious in-
teractions. That means that the PDP is already a good summary of the
relationships between the displayed features and the predicted num-
ber of bicycles
8.1.1.1 Centered ICE Plot
(𝑖)
̂ ̂ 𝑎 , 𝑥(𝑖) )
̂ − 1𝑓(𝑥
𝑓𝑐𝑒𝑛𝑡 = 𝑓 (𝑖) 𝐶
8.1 Individual Conditional Expectation (ICE) 243
For example, take the cervical cancer ICE plot for age and center the
lines on the youngest observed age:
FIGURE 8.3 Centered ICE plot for predicted cancer probability by age.
Lines are fixed to 0 at age 14. Compared to age 14, the predictions
for most women remain unchanged until the age of 45 where the pre-
dicted probability increases.
The centered ICE plots make it easier to compare the curves of individ-
ual instances. This can be useful if we do not want to see the absolute
change of a predicted value, but the difference in the prediction com-
pared to a fixed point of the feature range.
Let’s have a look at centered ICE plots for the bicycle rental prediction:
244 8 Local Model-Agnostic Methods
̂
𝛿 𝑓(𝑥)
̂ = 𝑓(𝑥
𝑓(𝑥) ̂ , 𝑥 ) = 𝑔(𝑥 ) + ℎ(𝑥 ), with = 𝑔′ (𝑥𝑆 )
𝑆 𝐶 𝑆 𝐶
𝛿𝑥𝑆
same for all instances. If they differ, it is due to interactions and it be-
comes visible in the d-ICE plot. In addition to displaying the individ-
ual curves for the derivative of the prediction function with respect to
the feature in S, showing the standard deviation of the derivative helps
to highlight regions in feature in S with heterogeneity in the estimated
derivatives. The derivative ICE plot takes a long time to compute and
is rather impractical.
8.1.2 Advantages
Individual conditional expectation curves are even more intuitive to
understand than partial dependence plots. One line represents the
predictions for one instance if we vary the feature of interest.
Unlike partial dependence plots, ICE curves can uncover heteroge-
neous relationships.
8.1.3 Disadvantages
ICE curves can only display one feature meaningfully, because two
features would require the drawing of several overlaying surfaces and
you would not see anything in the plot.
ICE curves suffer from the same problem as PDPs: If the feature of
interest is correlated with the other features, then some points in the
lines might be invalid data points according to the joint feature distri-
bution.
If many ICE curves are drawn, the plot can become overcrowded and
you will not see anything. The solution: Either add some transparency
to the lines or draw only a sample of the lines.
In ICE plots it might not be easy to see the average. This has a simple
solution: Combine individual conditional expectation curves with the
partial dependence plot.
246 8 Local Model-Agnostic Methods
2
Goldstein, Alex, et al. “Package ‘ICEbox’.” (2017).
8.2 Local Surrogate (LIME) 247
3
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “Why should I trust
you?: Explaining the predictions of any classifier.” Proceedings of the 22nd ACM
SIGKDD international conference on knowledge discovery and data mining. ACM
(2016).
248 8 Local Model-Agnostic Methods
The explanation model for instance x is the model g (e.g. linear regres-
sion model) that minimizes loss L (e.g. mean squared error), which
measures how close the explanation is to the prediction of the original
model f (e.g. an xgboost model), while the model complexity Ω(𝑔) is
kept low (e.g. prefer fewer features). G is the family of possible expla-
nations, for example all possible linear regression models. The proxim-
ity measure 𝜋𝑥 defines how large the neighborhood around instance
x is that we consider for the explanation. In practice, LIME only opti-
mizes the loss part. The user has to determine the complexity, e.g. by
selecting the maximum number of features that the linear regression
model may use.
The recipe for training local surrogate models:
• Select your instance of interest for which you want to have an expla-
nation of its black box prediction.
• Perturb your dataset and get the black box predictions for these new
points.
• Weight the new samples according to their proximity to the instance
of interest.
• Train a weighted, interpretable model on the dataset with the varia-
tions.
• Explain the prediction by interpreting the local model.
In the current implementations in R4 and Python5 , for example, lin-
ear regression can be chosen as interpretable surrogate model. In ad-
vance, you have to select K, the number of features you want to have
in your interpretable model. The lower K, the easier it is to interpret
the model. A higher K potentially produces models with higher fidelity.
There are several methods for training models with exactly K features.
A good choice is Lasso. A Lasso model with a high regularization pa-
rameter 𝜆 yields a model without any feature. By retraining the Lasso
models with slowly decreasing 𝜆, one after the other, the features get
weight estimates that differ from zero. If there are K features in the
model, you have reached the desired number of features. Other strate-
gies are forward or backward selection of features. This means you ei-
4
https://github.com/thomasp85/lime
5
https://github.com/marcotcr/lime
8.2 Local Surrogate (LIME) 249
ther start with the full model (= containing all features) or with a model
with only the intercept and then test which feature would bring the
biggest improvement when added or removed, until a model with K
features is reached.
How do you get the variations of the data? This depends on the type
of data, which can be either text, image or tabular data. For text and
images, the solution is to turn single words or super-pixels on or off. In
the case of tabular data, LIME creates new samples by perturbing each
feature individually, drawing from a normal distribution with mean
and standard deviation taken from the feature.
FIGURE 8.5 LIME algorithm for tabular data. A) Random forest predic-
tions given features x1 and x2. Predicted classes: 1 (dark) or 0 (light). B)
Instance of interest (big dot) and data sampled from a normal distribu-
tion (small dots). C) Assign higher weight to points near the instance
of interest. D) Signs of the grid show the classifications of the locally
learned model from the weighted samples. The white line marks the
decision boundary (P(class=1) = 0.5).
8.2 Local Surrogate (LIME) 251
a good way to find the best kernel or width. And where does the 0.75
even come from? In certain scenarios, you can easily turn your expla-
nation around by changing the kernel width, as shown in the following
figure:
8.2.1.1 Example
8.2.2.1 Example
FIGURE 8.7 LIME explanations for two instances of the bike rental
dataset. Warmer temperature and good weather situation have a pos-
itive effect on the prediction. The x-axis shows the feature effect: The
weight times the actual feature value.
tors). Let us look at the two comments of this dataset and the corre-
sponding classes (1 for spam, 0 for normal comment):
CONTENT CLASS
267 PSY is a good guy 0
173 For Christmas Song visit my channel! ;) 1
The next step is to create some variations of the datasets used in a local
model. For example, some variations of one of the comments:
For Christmas Song visit my channel! ;) prob weight
1 0 1 1 0 0 1 0.17 0.57
0 1 1 1 1 0 1 0.17 0.71
1 0 0 1 1 1 1 0.99 0.71
1 0 1 1 1 1 1 0.99 0.86
0 1 1 1 0 0 1 0.17 0.57
254 8 Local Model-Agnostic Methods
8.2.3.1 Example
FIGURE 8.8 Left: Image of a bowl of bread. Middle and right: LIME ex-
planations for the top 2 classes (bagel, strawberry) for image classifi-
cation made by Google’s Inception V3 neural network.
The prediction and explanation for “Bagel” are very reasonable, even if
the prediction is wrong – these are clearly no bagels since the hole in
the middle is missing.
8.2.4 Advantages
Even if you replace the underlying machine learning model, you can
still use the same local, interpretable model for explanation. Suppose
the people looking at the explanations understand decision trees best.
Because you use local surrogate models, you use decision trees as ex-
planations without actually having to use a decision tree to make the
predictions. For example, you can use a SVM. And if it turns out that
an xgboost model works better, you can replace the SVM and still use
as decision tree to explain the predictions.
256 8 Local Model-Agnostic Methods
8.2.5 Disadvantages
The correct definition of the neighborhood is a very big, unsolved prob-
lem when using LIME with tabular data. In my opinion it is the biggest
problem with LIME and the reason why I would recommend to use
LIME only with great care. For each application you have to try differ-
ent kernel settings and see for yourself if the explanations make sense.
Unfortunately, this is the best advice I can give to find good kernel
widths.
Sampling could be improved in the current implementation of LIME.
Data points are sampled from a Gaussian distribution, ignoring the
correlation between features. This can lead to unlikely data points
which can then be used to learn local explanation models.
The complexity of the explanation model has to be defined in advance.
This is just a small complaint, because in the end the user always has
to define the compromise between fidelity and sparsity.
Another really big problem is the instability of the explanations. In an
article 10 the authors showed that the explanations of two very close
points varied greatly in a simulated setting. Also, in my experience, if
you repeat the sampling process, then the explantions that come out
can be different. Instability means that it is difficult to trust the expla-
nations, and you should be very critical.
LIME explanations can be manipulated by the data scientist to hide bi-
ases 11 . The possibility of manipulation makes it more difficult to trust
explanations generated with LIME.
Conclusion: Local surrogate models, with LIME as a concrete imple-
mentation, are very promising. But the method is still in development
phase and many problems need to be solved before it can be safely ap-
plied.
10
Alvarez-Melis, David, and Tommi S. Jaakkola. “On the robustness of inter-
pretability methods.” arXiv preprint arXiv:1806.08049 (2018).
11
Slack, Dylan, et al. “Fooling lime and shap: Adversarial attacks on post hoc ex-
planation methods.” Proceedings of the AAAI/ACM Conference on AI, Ethics, and
Society. 2020.
258 8 Local Model-Agnostic Methods
cards, age, …) that would change the prediction from rejected to ap-
proved? One possible answer could be: If Peter would earn 10,000 Euro
more per year, he would get the loan. Or if Peter had fewer credit cards
and had not defaulted on a loan 5 years ago, he would get the loan. Pe-
ter will never know the reasons for the rejection, as the bank has no
interest in transparency, but that is another story.
In our second example we want to explain a model that predicts a
continuous outcome with counterfactual explanations. Anna wants to
rent out her apartment, but she is not sure how much to charge for it,
so she decides to train a machine learning model to predict the rent. Of
course, since Anna is a data scientist, that is how she solves her prob-
lems. After entering all the details about size, location, whether pets
are allowed and so on, the model tells her that she can charge 900 Euro.
She expected 1000 Euro or more, but she trusts her model and decides
to play with the feature values of the apartment to see how she can
improve the value of the apartment. She finds out that the apartment
could be rented out for over 1000 Euro, if it were 15 m2 larger. Interest-
ing, but non-actionable knowledge, because she cannot enlarge her
apartment. Finally, by tweaking only the feature values under her con-
trol (built-in kitchen yes/no, pets allowed yes/no, type of floor, etc.),
she finds out that if she allows pets and installs windows with better in-
sulation, she can charge 1000 Euro. Anna had intuitively worked with
counterfactuals to change the outcome.
Counterfactuals are human-friendly explanations, because they are
contrastive to the current instance and because they are selective,
meaning they usually focus on a small number of feature changes.
But counterfactuals suffer from the ‘Rashomon effect’. Rashomon is a
Japanese movie in which the murder of a Samurai is told by different
people. Each of the stories explains the outcome equally well, but the
stories contradict each other. The same can also happen with counter-
factuals, since there are usually multiple different counterfactual ex-
planations. Each counterfactual tells a different “story” of how a cer-
tain outcome was reached. One counterfactual might say to change
feature A, the other counterfactual might say to leave A the same
but change feature B, which is a contradiction. This issue of multiple
truths can be addressed either by reporting all counterfactual explana-
8.3 Counterfactual Explanations 261
versity also enables “diverse” individuals to alter the features that are
convenient for them. The last requirement is that a counterfactual in-
stance should have feature values that are likely. It would not make
sense to generate a counterfactual explanation for the rent example
where the size of an apartment is negative or the number of rooms is
set to 200. It is even better when the counterfactual is likely according
to the joint distribution of the data, for example, an apartment with
10 rooms and 20 m2 should not be regarded as counterfactual explana-
tion. Ideally, if the number of square meters is increased, an increase
in the number of rooms should also be proposed.
̂ ′ ) − 𝑦′ )2 + 𝑑(𝑥, 𝑥′ )
𝐿(𝑥, 𝑥′ , 𝑦′ , 𝜆) = 𝜆 ⋅ (𝑓(𝑥
The first term is the quadratic distance between the model prediction
for the counterfactual x’ and the desired outcome y’, which the user
must define in advance. The second term is the distance d between the
instance x to be explained and the counterfactual x’. The loss measures
how far the predicted outcome of the counterfactual is from the prede-
fined outcome and how far the counterfactual is from the instance of
interest. The distance function d is defined as the Manhattan distance
weighted with the inverse median absolute deviation (MAD) of each
feature.
𝑝
′
|𝑥𝑗 − 𝑥′𝑗 |
𝑑(𝑥, 𝑥 ) = ∑
𝑗=1
𝑀 𝐴𝐷𝑗
The total distance is the sum of all p feature-wise distances, that is, the
absolute differences of feature values between instance x and counter-
factual x’. The feature-wise distances are scaled by the inverse of the
median absolute deviation of feature j over the dataset defined as:
The median of a vector is the value at which half of the vector values
are greater and the other half smaller. The MAD is the equivalent of the
variance of a feature, but instead of using the mean as the center and
summing over the square distances, we use the median as the center
and sum over the absolute distances. The proposed distance function
has the advantage over the Euclidean distance that it is more robust to
outliers. Scaling with the MAD is necessary to bring all the features to
the same scale - it should not matter whether you measure the size of
an apartment in square meters or square feet.
The parameter 𝜆 balances the distance in prediction (first term)
against the distance in feature values (second term). The loss is solved
264 8 Local Model-Agnostic Methods
̂ ′ ) − 𝑦′ | ≤ 𝜖
|𝑓(𝑥
arg min
′
max 𝐿(𝑥, 𝑥′ , 𝑦′ , 𝜆).
𝑥 𝜆
The proposed method has some disadvantages. It only takes the first
and second criteria into account not the last two (“produce counter-
factuals with only a few feature changes and likely feature values”). d
does not prefer sparse solutions since increasing 10 features by 1 will
give the same distance to x as increasing one feature by 10. Unrealistic
feature combinations are not penalized.
The method does not handle categorical features with many different
levels well. The authors of the method suggested running the method
separately for each combination of feature values of the categorical fea-
tures, but this will lead to a combinatorial explosion if you have multi-
ple categorical features with many values. For example, 6 categorical
features with 10 unique levels would mean 1 million runs.
Let us now have a look on another approach overcoming these issues.
8.3.1.2 Method by Dandl et al.
⎧ ̂ ′ ) ∈ 𝑦′
if 𝑓(𝑥
̂ ′ ′ {0
𝑜1 (𝑓(𝑥 ), 𝑦 ) = ̂ ′ ) − 𝑦′ |
⎨
{ inf |𝑓(𝑥 else
⎩𝑦′ ∈𝑦′
𝑝
1
𝑜2 (𝑥, 𝑥 ) = ∑ 𝛿𝐺 (𝑥𝑗 , 𝑥′𝑗 )
′
𝑝 𝑗=1
⎧
{𝑅
1 ′
̂ 𝑗 |𝑥𝑗 − 𝑥𝑗 | if 𝑥𝑗 numerical
𝛿𝐺 (𝑥𝑗 , 𝑥′𝑗 ) =⎨
⎩𝕀𝑥𝑗 ≠𝑥′𝑗
{ if 𝑥𝑗 categorical
8.3.2 Example
The following example is based on the credit dataset example in Dandl
et al. (2020). The German Credit Risk dataset can be found on the ma-
chine learning challenges platform kaggle.com15 .
The authors trained a support vector machine (with radial basis ker-
nel) to predict the probability that a customer has a good credit risk.
The corresponding dataset has 522 complete observations and nine
features containing credit and customer information.
The goal is to find counterfactual explanations for a customer with the
following feature values:
15
https://www.kaggle.com/uciml/german-credit
8.3 Counterfactual Explanations 269
The SVM predicts that the woman has a good credit risk with a proba-
bility of 24.2 %. The counterfactuals should answer how the input fea-
tures need to be changed to get a predicted probability larger than 50
%?
The following table shows the 10 best counterfactuals:
The first five columns contain the proposed feature changes (only al-
tered features are displayed), the next three columns show the objec-
tive values (𝑜1 equals 0 in all cases) and the last column displays the
predicted probability.
All counterfactuals have predicted probabilities greater than 50 % and
do not dominate each other. Nondominated means that none of the
counterfactuals has smaller values in all objectives than the other coun-
terfactuals. We can think of our counterfactuals as a set of trade-off
solutions.
They all suggest a reduction of the duration from 48 months to min-
imum 23 months, some of them propose that the woman should be-
come skilled instead of unskilled. Some counterfactuals even suggest
270 8 Local Model-Agnostic Methods
to change the gender from female to male which shows a gender bias
of the model. This change is always accompanied by a reduction in age
between 1 and 30 years. We can also see that although some counter-
factuals suggest changes to 4 features, these counterfactuals are the
ones that are closest to the training data.
8.3.3 Advantages
The interpretation of counterfactual explanations is very clear. If the
feature values of an instance are changed according to the counterfac-
tual, the prediction changes to the predefined prediction. There are
no additional assumptions and no magic in the background. This also
means it is not as dangerous as methods like LIME, where it is unclear
how far we can extrapolate the local model for the interpretation.
The counterfactual method creates a new instance, but we can also
summarize a counterfactual by reporting which feature values have
changed. This gives us two options for reporting our results. You can
either report the counterfactual instance or highlight which features
have been changed between the instance of interest and the counter-
factual instance.
The counterfactual method does not require access to the data or
the model. It only requires access to the model’s prediction function,
which would also work via a web API, for example. This is attractive for
companies which are audited by third parties or which are offering
explanations for users without disclosing the model or data. A com-
pany has an interest in protecting model and data because of trade se-
crets or data protection reasons. Counterfactual explanations offer a
balance between explaining model predictions and protecting the in-
terests of the model owner.
The method works also with systems that do not use machine learn-
ing. We can create counterfactuals for any system that receives in-
puts and returns outputs. The system that predicts apartment rents
could also consist of handwritten rules, and counterfactual explana-
tions would still work.
The counterfactual explanation method is relatively easy to imple-
ment, since it is essentially a loss function (with a single or many objec-
8.3 Counterfactual Explanations 271
8.3.4 Disadvantages
For each instance you will usually find multiple counterfactual expla-
nations (Rashomon effect). This is inconvenient - most people prefer
simple explanations over the complexity of the real world. It is also a
practical challenge. Let us say we generated 23 counterfactual expla-
nations for one instance. Are we reporting them all? Only the best?
What if they are all relatively “good”, but very different? These ques-
tions must be answered anew for each project. It can also be advanta-
geous to have multiple counterfactual explanations, because then hu-
mans can select the ones that correspond to their previous knowledge.
23
Mothilal, Ramaravind K., Amit Sharma, and Chenhao Tan. “Explaining ma-
chine learning classifiers through diverse counterfactual explanations.” Proceedings
of the 2020 Conference on Fairness, Accountability, and Transparency. 2020.
24
https://github.com/interpretml/DiCE
25
Laugel, Thibault, et al. “Inverse classification for comparison-based inter-
pretability in machine learning.” arXiv preprint arXiv:1712.08443 (2017).
26
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “Anchors: High-
precision model-agnostic explanations.” AAAI Conference on Artificial Intelligence
(2018).
8.4 Scoped Rules (Anchors) 273
how faithful they are as LIME solely learns a linear decision boundary
that best approximates the model given a perturbation space 𝐷. Given
the same perturbation space, the anchors approach constructs expla-
nations whose coverage is adapted to the model’s behavior and clearly
express their boundaries. Thus, they are faithful by design and state ex-
actly for which instances they are valid. This property makes anchors
particularly intuitive and easy to comprehend.
Feature Value
Age 20
Sex female
8.4 Scoped Rules (Anchors) 275
Feature Value
Class first
TicketPrice 300$
More attributes …
Survived true
Wherein:
• 𝑥 represents the instance being explained (e.g., one row in a tabular
data set).
• 𝐴 is a set of predicates, i.e., the resulting rule or anchor, such that
𝐴(𝑥) = 1 when all feature predicates defined by 𝐴 correspond to 𝑥’s
feature values.
• 𝑓 denotes the classification model to be explained (e.g., an artificial
neural network model). It can be queried to predict a label for 𝑥 and
its perturbations.
• 𝐷𝑥 (⋅|𝐴) indicates the distribution of neighbors of 𝑥, matching 𝐴.
• 0 ≤ 𝜏 ≤ 1 specifies a precision threshold. Only rules that achieve a
local fidelity of at least 𝜏 are considered a valid result.
276 8 Local Model-Agnostic Methods
The previous two definitions are combined and extended by the no-
tion of coverage. Its rationale consists of finding rules that apply to a
preferably large part of the model’s input space. Coverage is formally
defined as an anchors’ probability of applying to its neighbors, i.e., its
perturbation space:
Including this element leads to anchors’ final definition taking into ac-
count the maximization of coverage:
max 𝑐𝑜𝑣(𝐴)
𝐴 s.t. 𝑃 (𝑝𝑟𝑒𝑐(𝐴)≥𝜏)≥1−𝛿
Thus, the proceeding strives for a rule that has the highest coverage
among all eligible rules (all those that satisfy the precision threshold
given the probabilistic definition). These rules are thought to be more
8.4 Scoped Rules (Anchors) 277
important, as they describe a larger part of the model. Note that rules
with more predicates tend to have higher precision than rules with
fewer predicates. In particular, a rule that fixes every feature of 𝑥 re-
duces the evaluated neighborhood to identical instances. Thus, the
model classifies all neighbors equally, and the rule’s precision is 1. At
the same time, a rule that fixes many features is overly specific and
only applicable to a few instances. Hence, there is a trade-off between
precision and coverage.
The anchors approach uses four main components to find explana-
tions, as is shown in the figure below.
Candidate Generation: Generates new explanation candidates. In the
first round, one candidate per feature of 𝑥 gets created and fixes the re-
spective value of possible perturbations. In every other round, the best
candidates of the previous round are extended by one feature predi-
cate that is not yet contained therein.
Best Candidate Identification: Candidate rules are to be compared in
regard to which rule explains 𝑥 the best. To this end, perturbations
that match the currently observed rule are created evaluated by call-
ing the model. However, these calls need to be minimized as to limit
computational overhead. This is why, at the core of this component,
there is a pure-exploration Multi-Armed-Bandit (MAB; KL-LUCB28 , to
be precise). MABs are used to efficiently explore and exploit different
strategies (called arms in an analogy to slot machines) using sequen-
tial selection. In the given setting, each candidate rule is to be seen
as an arm that can be pulled. Each time it is pulled, respective neigh-
bors get evaluated, and we thereby obtain more information about the
candidate rule’s payoff (precision in anchors’ case). The precision thus
states how well the rule describes the instance to be explained.
Candidate Precision Validation: Takes more samples in case there is
no statistical confidence yet that the candidate exceeds the 𝜏 thresh-
old.
Modified Beam Search: All of the above components are assembled in
28
Emilie Kaufmann and Shivaram Kalyanakrishnan. “Information Complexity in
Bandit Subset Selection”. Proceedings of Machine Learning Research (2013).
278 8 Local Model-Agnostic Methods
𝒪(𝐵 ⋅ 𝑝2 + 𝑝2 ⋅ 𝒪MAB[𝐵⋅𝑝,𝐵] )
FIGURE 8.13 Anchors explaining six instances of the bike rental dataset.
Each row represents one explanation or anchor, and each bar depicts
the feature predicates contained by it. The x-axis displays a rule’s pre-
cision, and a bar’s thickness corresponds to its coverage. The ’base’
rule contains no predicates. These anchors show that the model mainly
considers the temperature for predictions.
The results are instinctively interpretable and show for each explained
instance, which features are most important for the model’s predic-
tion. As the anchors only have a few predicates, they additionally have
high coverage and hence apply to other cases. The rules shown above
were generated with 𝜏 = 0.9. Thus, we ask for anchors whose evalu-
ated perturbations faithfully support the label with an accuracy of at
least 90%. Also, discretization was used to increase the expressiveness
and applicability of numerical features.
All of the previous rules were generated for instances where the model
8.4 Scoped Rules (Anchors) 281
values 𝜖. This would cause the MAB to draw more samples, ultimately
leading to the minority being sampled more often in absolute terms.
For this example, we use a subset of the cervical cancer set in which
the majority of cases are labeled cancer. We then have the framework
to create a corresponding perturbation space from it. Perturbations
are now more likely to lead to varying predictions, and the anchors al-
gorithm can identify important features. However, one needs to take
the coverage’s definition into account: it is only defined within the per-
turbation space. In the previous examples, we used the train set as the
perturbation space’s basis. Since we only use a subset here, a high cov-
erage does not necessarily indicate globally high rule importance.
FIGURE 8.16 Balancing the data set before constructing anchors shows
the model’s reasoning for decisions in minority cases.
8.4.4 Advantages
The anchors approach offers multiple advantages over LIME. First, the
algorithm’s output is easier to understand, as rules are easy to inter-
pret (even for laypersons).
Furthermore, anchors are subsettable and even state a measure of im-
portance by including the notion of coverage. Second, the anchors ap-
proach works when model predictions are non-linear or complex in
an instance’s neighborhood. As the approach deploys reinforcement
284 8 Local Model-Agnostic Methods
8.4.5 Disadvantages
The algorithm suffers from a highly configurable and impactful setup,
just like most perturbation-based explainers. Not only do hyperpa-
rameters such as the beam width or precision threshold need to be
tuned to yield meaningful results but also does the perturbation func-
tion need to be explicitly designed for one domain/use-case. Think of
how tabular data gets perturbed and think of how to apply the same
concepts to image data (Hint: these cannot be applied). Luckily, de-
fault approaches may be used in some domains (e.g., tabular), facili-
tating an initial explanation setup.
Also, many scenarios require discretization as otherwise results are
too specific, have low coverage, and do not contribute to understand-
ing the model. While discretization can help, it may also blur decision
boundaries if used carelessly and thus have the exact opposite effect.
Since there is no best discretization technique, users need to be aware
of the data before deciding on how to discretize data not to obtain poor
results.
Constructing anchors requires many calls to the ML model, just like
all perturbation-based explainers. While the algorithm deploys MABs
to minimize the number of calls, its runtime still very much depends
on the model’s performance and is therefore highly variable.
Lastly, the notion of coverage is undefined in some domains. For ex-
ample, there is no obvious or universal definition of how superpixels
in one image compare to such in other images.
8.4 Scoped Rules (Anchors) 285
29
https://github.com/marcotcr/anchor
30
https://github.com/SeldonIO/alibi
31
https://github.com/viadee/javaAnchorExplainer
32
https://github.com/viadee/anchorsOnR
286 8 Local Model-Agnostic Methods
FIGURE 8.17 The predicted price for a 50 𝑚2 2nd floor apartment with a
nearby park and cat ban is €300,000. Our goal is to explain how each
of these feature values contributed to the prediction.
The average prediction for all apartments is €310,000. How much has
33
https://leanpub.com/c/shapley-xai
8.5 Shapley Values 287
using its value for the floor feature. The value floor-2nd was replaced by
the randomly drawn floor-1st. Then we predict the price of the apart-
ment with this combination (€310,000). In a second step, we remove
cat-banned from the coalition by replacing it with a random value of
the cat allowed/banned feature from the randomly drawn apartment.
In the example it was cat-allowed, but it could have been cat-banned
again. We predict the apartment price for the coalition of park-nearby
and area-50 (€320,000). The contribution of cat-banned was €310,000
- €320,000 = -€10,000. This estimate depends on the values of the ran-
domly drawn apartment that served as a “donor” for the cat and floor
feature values. We will get better estimates if we repeat this sampling
step and average the contributions.
FIGURE 8.19 All 8 coalitions needed for computing the exact Shapley
value of the ‘cat-banned‘ feature value.
For the bike rental dataset, we also train a random forest to predict
the number of rented bikes for a day, given weather and calendar in-
formation. The explanations created for the random forest prediction
of a particular day:
Be careful to interpret the Shapley value correctly: The Shapley value is
the average contribution of a feature value to the prediction in differ-
ent coalitions. The Shapley value is NOT the difference in prediction
when we would remove the feature from the model.
FIGURE 8.20 Shapley values for a woman in the cervical cancer dataset.
With a prediction of 0.57, this woman’s cancer probability is 0.54 above
the average prediction of 0.03. The number of diagnosed STDs in-
creased the probability the most. The sum of contributions yields the
difference between actual and average prediction (0.54).
̂ =𝛽 +𝛽 𝑥 +…+𝛽 𝑥
𝑓(𝑥) 0 1 1 𝑝 𝑝
where 𝐸(𝛽𝑗 𝑋𝑗 ) is the mean effect estimate for feature j. The contri-
292 8 Local Model-Agnostic Methods
FIGURE 8.21 Shapley values for day 285. With a predicted 2409 rental
bikes, this day is -2108 below the average prediction of 4518. The
weather situation and humidity had the largest negative contribu-
tions. The temperature on this day had a positive contribution. The
sum of Shapley values yields the difference of actual and average pre-
diction (-2108).
bution is the difference between the feature effect minus the average
effect. Nice! Now we know how much each feature contributed to the
prediction. If we sum all the feature contributions for one instance,
the result is the following:
𝑝 𝑝
∑ 𝜙𝑗 (𝑓)̂ = ∑(𝛽𝑗 𝑥𝑗 − 𝐸(𝛽𝑗 𝑋𝑗 ))
𝑗=1 𝑗=1
𝑝 𝑝
=(𝛽0 + ∑ 𝛽𝑗 𝑥𝑗 ) − (𝛽0 + ∑ 𝐸(𝛽𝑗 𝑋𝑗 ))
𝑗=1 𝑗=1
̂ − 𝐸(𝑓(𝑋))
=𝑓(𝑥) ̂
8.5 Shapley Values 293
This is the predicted value for the data point x minus the average pre-
dicted value. Feature contributions can be negative.
Can we do the same for any type of model? It would be great to have this
as a model-agnostic tool. Since we usually do not have similar weights
in other model types, we need a different solution.
Help comes from unexpected places: cooperative game theory. The
Shapley value is a solution for computing feature contributions for sin-
gle predictions for any machine learning model.
̂ , … , 𝑥 )𝑑ℙ
𝑣𝑎𝑙𝑥 (𝑆) = ∫ 𝑓(𝑥 ̂
1 𝑝 𝑥∉𝑆 − 𝐸𝑋 (𝑓(𝑋))
You actually perform multiple integrations for each feature that is not
contained S. A concrete example: The machine learning model works
with 4 features x1, x2, x3 and x4 and we evaluate the prediction for the
coalition S consisting of feature values x1 and x3:
̂ , 𝑋 , 𝑥 , 𝑋 )𝑑ℙ
𝑣𝑎𝑙𝑥 (𝑆) = 𝑣𝑎𝑙𝑥 ({1, 3}) = ∫ ∫ 𝑓(𝑥 ̂
1 2 3 4 𝑋2 𝑋4 −𝐸𝑋 (𝑓(𝑋))
ℝ ℝ
Do not get confused by the many uses of the word “value”: The fea-
ture value is the numerical or categorical value of a feature and in-
stance; the Shapley value is the feature contribution to the prediction;
the value function is the payout function for coalitions of players (fea-
ture values).
The Shapley value is the only attribution method that satisfies the prop-
erties Efficiency, Symmetry, Dummy and Additivity, which together
can be considered a definition of a fair payout.
Efficiency The feature contributions must add up to the difference of
prediction for x and the average.
𝑝
∑ ̂ − 𝐸 (𝑓(𝑋))
𝜙𝑗 = 𝑓(𝑥) ̂
𝑗=1 𝑋
for all
𝑆 ⊆ {1, … , 𝑝} {𝑗, 𝑘}
then
𝜙𝑗 = 𝜙𝑘
Dummy A feature j that does not change the predicted value – regard-
less of which coalition of feature values it is added to – should have a
Shapley value of 0. If
for all
𝑆 ⊆ {1, … , 𝑝}
8.5 Shapley Values 295
then
𝜙𝑗 = 0
𝜙𝑗 + 𝜙𝑗+
Suppose you trained a random forest, which means that the prediction
is an average of many decision trees. The Additivity property guaran-
tees that for a feature value, you can calculate the Shapley value for
each tree individually, average them, and get the Shapley value for the
feature value for the random forest.
8.5.3.2 Intuition
̂ 1 𝑀 ̂ 𝑚 ) − 𝑓(𝑥
̂ 𝑚 ))
𝜙𝑗 = ∑ (𝑓(𝑥 +𝑗 −𝑗
𝑀 𝑚=1
35
Štrumbelj, Erik, and Igor Kononenko. “Explaining prediction models and indi-
vidual predictions with feature contributions.” Knowledge and information systems
41.3 (2014): 647-665.
296 8 Local Model-Agnostic Methods
feature j replaced by the value for feature j from the sample z. The dif-
ference in the prediction from the black box is computed:
̂ 𝑚 ) − 𝑓(𝑥
𝜙𝑗𝑚 = 𝑓(𝑥 ̂ 𝑚)
+𝑗 −𝑗
1 𝑀 𝑚
𝜙𝑗 (𝑥) = ∑𝜙
𝑀 𝑚=1 𝑗
8.5.4 Advantages
The difference between the prediction and the average prediction is
fairly distributed among the feature values of the instance – the Ef-
ficiency property of Shapley values. This property distinguishes the
Shapley value from other methods such as LIME. LIME does not guar-
antee that the prediction is fairly distributed among the features. The
Shapley value might be the only method to deliver a full explanation.
In situations where the law requires explainability – like EU’s “right to
explanations” – the Shapley value might be the only legally compliant
method, because it is based on a solid theory and distributes the ef-
fects fairly. I am not a lawyer, so this reflects only my intuition about
the requirements.
The Shapley value allows contrastive explanations. Instead of compar-
ing a prediction to the average prediction of the entire dataset, you
could compare it to a subset or even to a single data point. This con-
trastiveness is also something that local models like LIME do not have.
The Shapley value is the only explanation method with a solid theory.
The axioms – efficiency, symmetry, dummy, additivity – give the ex-
planation a reasonable foundation. Methods like LIME assume linear
298 8 Local Model-Agnostic Methods
8.5.5 Disadvantages
The Shapley value requires a lot of computing time. In 99.9% of real-
world problems, only the approximate solution is feasible. An exact
computation of the Shapley value is computationally expensive be-
cause there are 2k possible coalitions of the feature values and the “ab-
sence” of a feature has to be simulated by drawing random instances,
which increases the variance for the estimate of the Shapley values esti-
mation. The exponential number of the coalitions is dealt with by sam-
pling coalitions and limiting the number of iterations M. Decreasing
M reduces computation time, but increases the variance of the Shap-
ley value. There is no good rule of thumb for the number of iterations
M. M should be large enough to accurately estimate the Shapley values,
but small enough to complete the computation in a reasonable time. It
should be possible to choose M based on Chernoff bounds, but I have
not seen any paper on doing this for Shapley values for machine learn-
ing predictions.
The Shapley value can be misinterpreted. The Shapley value of a fea-
ture value is not the difference of the predicted value after removing
the feature from the model training. The interpretation of the Shapley
value is: Given the current set of feature values, the contribution of a
feature value to the difference between the actual prediction and the
mean prediction is the estimated Shapley value.
The Shapley value is the wrong explanation method if you seek sparse
explanations (explanations that contain few features). Explanations
created with the Shapley value method always use all the features. Hu-
mans prefer selective explanations, such as those produced by LIME.
LIME might be the better choice for explanations lay-persons have to
deal with. Another solution is SHAP36 introduced by Lundberg and Lee
36
https://github.com/slundberg/shap
8.5 Shapley Values 299
(2016)37 , which is based on the Shapley value, but can also provide ex-
planations with few features.
The Shapley value returns a simple value per feature, but no prediction
model like LIME. This means it cannot be used to make statements
about changes in prediction for changes in the input, such as: “If I were
to earn €300 more a year, my credit score would increase by 5 points.”
Another disadvantage is that you need access to the data if you want to
calculate the Shapley value for a new data instance. It is not sufficient
to access the prediction function because you need the data to replace
parts of the instance of interest with values from randomly drawn in-
stances of the data. This can only be avoided if you can create data in-
stances that look like real data instances but are not actual instances
from the training data.
Like many other permutation-based interpretation methods, the
Shapley value method suffers from inclusion of unrealistic data in-
stances when features are correlated. To simulate that a feature
value is missing from a coalition, we marginalize the feature. This
is achieved by sampling values from the feature’s marginal distribu-
tion. This is fine as long as the features are independent. When fea-
tures are dependent, then we might sample feature values that do
not make sense for this instance. But we would use those to compute
the feature’s Shapley value. One solution might be to permute corre-
lated features together and get one mutual Shapley value for them. An-
other adaptation is conditional sampling: Features are sampled con-
ditional on the features that are already in the team. While condi-
tional sampling fixes the issue of unrealistic data points, a new is-
sue is introduced: The resulting values are no longer the Shapley val-
ues to our game, since they violate the symmetry axiom, as found out
by Sundararajan et. al (2019)38 and further discussed by Janzing et. al
(2020)39 .
37
Lundberg, Scott M., and Su-In Lee. “A unified approach to interpreting model
predictions.” Advances in Neural Information Processing Systems. 2017.
38
Sundararajan, Mukund, and Amir Najmi. “The many Shapley values for model
explanation.” arXiv preprint arXiv:1908.08474 (2019).
39
Janzing, Dominik, Lenon Minorics, and Patrick Blöbaum. “Feature relevance
300 8 Local Model-Agnostic Methods
8.6.1 Definition
The goal of SHAP is to explain the prediction of an instance x by com-
puting the contribution of each feature to the prediction. The SHAP
explanation method computes Shapley values from coalitional game
theory. The feature values of a data instance act as players in a coali-
tion. Shapley values tell us how to fairly distribute the “payout” (= the
prediction) among the features. A player can be an individual feature
value, e.g. for tabular data. A player can also be a group of feature val-
ues. For example to explain an image, pixels can be grouped to super
pixels and the prediction distributed among them. One innovation
that SHAP brings to the table is that the Shapley value explanation is
43
Lundberg, Scott M., and Su-In Lee. “A unified approach to interpreting model
predictions.” Advances in Neural Information Processing Systems. 2017.
44
https://leanpub.com/c/shapley-xai
302 8 Local Model-Agnostic Methods
𝑀
𝑔(𝑧 ′ ) = 𝜙0 + ∑ 𝜙𝑗 𝑧𝑗′
𝑗=1
𝑀
′
𝑔(𝑥 ) = 𝜙0 + ∑ 𝜙𝑗
𝑗=1
You can find this formula in similar notation in the Shapley value chap-
ter. More about the actual estimation comes later. Let us first talk
about the properties of the 𝜙’s before we go into the details of their
estimation.
Shapley values are the only solution that satisfies properties of Effi-
ciency, Symmetry, Dummy and Additivity. SHAP also satisfies these,
since it computes Shapley values. In the SHAP paper, you will find dis-
crepancies between SHAP properties and Shapley properties. SHAP
describes the following three desirable properties:
1) Local accuracy
8.6 SHAP (SHapley Additive exPlanations) 303
𝑀
̂ = 𝑔(𝑥′ ) = 𝜙 + ∑ 𝜙 𝑥′
𝑓(𝑥) 0 𝑗 𝑗
𝑗=1
̂
If you define 𝜙0 = 𝐸𝑋 (𝑓(𝑥)) and set all 𝑥′𝑗 to 1, this is the Shapley
efficiency property. Only with a different name and using the coalition
vector.
𝑀 𝑀
̂ = 𝜙 + ∑ 𝜙 𝑥′ = 𝐸 (𝑓(𝑋))
𝑓(𝑥) ̂ + ∑ 𝜙𝑗
0 𝑗 𝑗 𝑋
𝑗=1 𝑗=1
2) Missingness
𝑥′𝑗 = 0 ⇒ 𝜙𝑗 = 0
𝜙𝑗 (𝑓 ′̂ , 𝑥) ≥ 𝜙𝑗 (𝑓,̂ 𝑥)
8.6.2 KernelSHAP
KernelSHAP estimates for an instance x the contributions of each fea-
ture value to the prediction. KernelSHAP consists of 5 steps:
• Sample coalitions 𝑧𝑘′ ∈ {0, 1}𝑀 , 𝑘 ∈ {1, … , 𝐾} (1 = feature
present in coalition, 0 = feature absent).
• Get prediction for each 𝑧𝑘′ by first converting 𝑧𝑘′ to the original fea-
ture space and then applying model 𝑓 ̂ ∶ 𝑓(ℎ
̂ (𝑧′ ))
𝑥 𝑘
• Compute the weight for each 𝑧𝑘 with the SHAP kernel.
′
̂ (𝑧′ )) = 𝐸 [𝑓(𝑥)]
𝑓(ℎ ̂
𝑥 𝑋𝐶
(𝑀 − 1)
𝜋𝑥 (𝑧′ ) =
(|𝑧𝑀′ |)|𝑧′ |(𝑀 − |𝑧 ′ |)
Here, M is the maximum coalition size and |𝑧′ | the number of present
features in instance z’. Lundberg and Lee show that linear regression
with this kernel weight yields Shapley values. If you would use the
SHAP kernel with LIME on the coalition data, LIME would also esti-
mate Shapley values!
We can be a bit smarter about the sampling of coalitions: The small-
est and largest coalitions take up most of the weight. We get better
Shapley value estimates by using some of the sampling budget K to
include these high-weight coalitions instead of sampling blindly. We
start with all possible coalitions with 1 and M-1 features, which makes
2 times M coalitions in total. When we have enough budget left (cur-
rent budget is K - 2M), we can include coalitions with two features and
with M-2 features and so on. From the remaining coalition sizes, we
sample with readjusted weights.
We have the data, the target and the weights. Everything to build our
weighted linear regression model:
𝑀
𝑔(𝑧 ′ ) = 𝜙0 + ∑ 𝜙𝑗 𝑧𝑗′
𝑗=1
𝐿(𝑓,̂ 𝑔, 𝜋𝑥 ) = ∑ [𝑓(ℎ
̂ (𝑧′ )) − 𝑔(𝑧 ′ )]2 𝜋 (𝑧′ )
𝑥 𝑥
𝑧′ ∈𝑍
where Z is the training data. This is the good old boring sum of squared
errors that we usually optimize for linear models. The estimated coef-
ficients of the model, the 𝜙𝑗 ’s are the Shapley values.
Since we are in a linear regression setting, we can also make use of the
308 8 Local Model-Agnostic Methods
8.6.3 TreeSHAP
Lundberg et. al (2018)46 proposed TreeSHAP, a variant of SHAP for
tree-based machine learning models such as decision trees, random
forests and gradient boosted trees. TreeSHAP was introduced as a fast,
model-specific alternative to KernelSHAP, but it turned out that it can
produce unintuitive feature attributions.
TreeSHAP defines the value function using the conditional expecta-
̂
tion 𝐸𝑋𝑆 |𝑋𝐶 (𝑓(𝑥)|𝑥 𝑆 ) instead of the marginal expectation. The prob-
lem with the conditional expectation is that features that have no influ-
ence on the prediction function f can get a TreeSHAP estimate differ-
ent from zero.4748 The non-zero estimate can happen when the feature
is correlated with another feature that actually has an influence on the
prediction.
How much faster is TreeSHAP? Compared to exact KernelSHAP, it
reduces the computational complexity from 𝑂(𝑇 𝐿2𝑀 ) to 𝑂(𝑇 𝐿𝐷2 ),
where T is the number of trees, L is the maximum number of leaves in
any tree and D the maximal depth of any tree.
TreeSHAP uses the conditional expectation 𝐸𝑋𝑆 |𝑋𝐶 (𝑓(𝑥)|𝑥̂
𝑆 ) to esti-
mate effects. I will give you some intuition on how we can compute the
expected prediction for a single tree, an instance x and feature subset
S. If we conditioned on all features – if S was the set of all features –
then the prediction from the node in which the instance x falls would
be the expected prediction. If we did no condition on any feature – if
46
Lundberg, Scott M., Gabriel G. Erion, and Su-In Lee. “Consistent individualized
feature attribution for tree ensembles.” arXiv preprint arXiv:1802.03888 (2018).
47
Sundararajan, Mukund, and Amir Najmi. “The many Shapley values for model
explanation.” arXiv preprint arXiv:1908.08474 (2019).
48
Janzing, Dominik, Lenon Minorics, and Patrick Blöbaum. “Feature rele-
vance quantification in explainable AI: A causality problem.” arXiv preprint
arXiv:1910.13413 (2019).
8.6 SHAP (SHapley Additive exPlanations) 309
8.6.4 Examples
I trained a random forest classifier with 100 trees to predict the risk
for cervical cancer. We will use SHAP to explain individual predictions.
We can use the fast TreeSHAP estimation method instead of the slower
KernelSHAP method, since a random forest is an ensemble of trees.
But instead of relying on the conditional distribution, this example
uses the marginal distribution. This is described in the package, but
not in the original paper. The Python TreeSHAP function is slower
with the marginal distribution, but still faster than KernelSHAP, since
it scales linearly with the rows in the data.
Because we use the marginal distribution here, the interpretation is
310 8 Local Model-Agnostic Methods
the same as in the Shapley value chapter. But with the Python shap
package comes a different visualization: You can visualize feature at-
tributions such as Shapley values as “forces”. Each feature value is a
force that either increases or decreases the prediction. The prediction
starts from the baseline. The baseline for Shapley values is the aver-
age of all predictions. In the plot, each Shapley value is an arrow that
pushes to increase (positive value) or decrease (negative value) the pre-
diction. These forces balance each other out at the actual prediction of
the data instance.
The following figure shows SHAP explanation force plots for two
women from the cervical cancer dataset:
when 𝑖 ≠ 𝑗 and:
𝛿𝑖𝑗 (𝑆) = 𝑓𝑥̂ (𝑆 ∪ {𝑖, 𝑗}) − 𝑓𝑥̂ (𝑆 ∪ {𝑖}) − 𝑓𝑥̂ (𝑆 ∪ {𝑗}) + 𝑓𝑥̂ (𝑆)
This formula subtracts the main effect of the features so that we get
the pure interaction effect after accounting for the individual effects.
We average the values over all possible feature coalitions S, as in the
Shapley value computation. When we compute SHAP interaction val-
ues for all features, we get one matrix per instance with dimensions M
x M, where M is the number of features.
How can we use the interaction index? For example, to automatically
color the SHAP feature dependence plot with the strongest interac-
tion:
8.6 SHAP (SHapley Additive exPlanations) 315
You can use any clustering method. The following example uses hier-
archical agglomerative clustering to order the instances.
The plot consists of many force plots, each of which explains the pre-
diction of an instance. We rotate the force plots vertically and place
them side by side according to their clustering similarity.
8.6.10 Advantages
Since SHAP computes Shapley values, all the advantages of Shapley
values apply: SHAP has a solid theoretical foundation in game theory.
The prediction is fairly distributed among the feature values. We get
contrastive explanations that compare the prediction with the aver-
age prediction.
SHAP connects LIME and Shapley values. This is very useful to bet-
ter understand both methods. It also helps to unify the field of inter-
pretable machine learning.
SHAP has a fast implementation for tree-based models. I believe this
was key to the popularity of SHAP, because the biggest barrier for
adoption of Shapley values is the slow computation.
The fast computation makes it possible to compute the many Shapley
values needed for the global model interpretations. The global inter-
8.6 SHAP (SHapley Additive exPlanations) 317
8.6.11 Disadvantages
KernelSHAP is slow. This makes KernelSHAP impractical to use when
you want to compute Shapley values for many instances. Also all global
SHAP methods such as SHAP feature importance require computing
Shapley values for a lot of instances.
KernelSHAP ignores feature dependence. Most other permutation
based interpretation methods have this problem. By replacing feature
values with values from random instances, it is usually easier to ran-
domly sample from the marginal distribution. However, if features
are dependent, e.g. correlated, this leads to putting too much weight
on unlikely data points. TreeSHAP solves this problem by explicitly
modeling the conditional expected prediction.
TreeSHAP can produce unintuitive feature attributions. While Tree-
SHAP solves the problem of extrapolating to unlikely data points, it
does so by changing the value function and therefore slightly changes
the game. TreeSHAP changes the value function by relying on the con-
ditional expected prediction. With the change in the value function,
features that have no influence on the prediction can get a TreeSHAP
value different from zero.
The disadvantages of Shapley values also apply to SHAP: Shapley val-
ues can be misinterpreted and access to data is needed to compute
them for new data (except for TreeSHAP).
It is possible to create intentionally misleading interpretations with
SHAP, which can hide biases 49 . If you are the data scientist creating
the explanations, this is not an actual problem (it would even be an ad-
49
Slack, Dylan, et al. “Fooling lime and shap: Adversarial attacks on post hoc ex-
318 8 Local Model-Agnostic Methods
vantage if you are the evil data scientist who wants to create mislead-
ing explanations). It is a disadvantage as the receiver of an explanation,
as you can be less sure about their truthfulness.
8.6.12 Software
The authors implemented SHAP in the shap50 Python package. This
implementation works for tree-based models in the scikit-learn51 ma-
chine learning library for Python. The shap package was also used for
the examples in this chapter. SHAP is integrated into the tree boost-
ing frameworks xgboost52 and LightGBM53 . In R, there is the shap-
per54 and fastshap55 packages. SHAP is also included in the R xgboost56
package.
319
320 9 Neural Network Interpretation
• The first convolutional layer(s) learn features such as edges and sim-
ple textures.
322 9 Neural Network Interpretation
The function ℎ is the activation of a neuron, img the input of the net-
work (an image), x and y describe the spatial position of the neuron, n
specifies the layer and z is the channel index. For the mean activation
of an entire channel z in layer n we maximize:
If you want to dive a lot deeper into feature visualization, take a look at
the distill.pub online journal, especially the feature visualization post
by Olah et al. 4 , from which I used many of the images, and also about
the building blocks of interpretability 5 .
9.1.1.2 Connection to Adversarial Examples
stop you from finding the input that maximally activates a neuron of
a fully connected neural network for tabular data or a recurrent neu-
ral network for text data. You might not call it feature visualization
any longer, since the “feature” would be a tabular data input or text.
For credit default prediction, the inputs might be the number of prior
credits, number of mobile contracts, address and dozens of other fea-
tures. The learned feature of a neuron would then be a certain com-
bination of the dozens of features. For recurrent neural networks, it
is a bit nicer to visualize what the network learned: Karpathy et. al
(2015)6 showed that recurrent neural networks indeed have neurons
that learn interpretable features. They trained a character-level model,
which predicts the next character in the sequence from the previous
characters. Once an opening brace “(” occurred, one of the neurons
got highly activated, and got de-activated when the matching closing
bracket “)” occurred. Other neurons fired at the end of a line. Some
neurons fired in URLs. The difference to the feature visualization for
CNNs is that the examples were not found through optimization, but
by studying neuron activations in the training data.
Some of the images seem to show well-known concepts like dog snouts
or buildings. But how can we be sure? The Network Dissection method
links human concepts with individual neural network units. Spoiler
alert: Network Dissection requires extra datasets that someone has la-
beled with human concepts.
tions do not prove that a unit has learned a certain concept. We also
do not have a measure for how well a unit detects e.g. sky scrapers. Be-
fore we go into the details of Network Dissection, we have to talk about
the big hypothesis that is behind that line of research. The hypothesis
is: Units of a neural network (like convolutional channels) learn disen-
tangled concepts.
The Question of Disentangled Features
Do (convolutional) neural networks learn disentangled features? Dis-
entangled features mean that individual network units detect specific
real world concepts. Convolutional channel 394 might detect sky scrap-
ers, channel 121 dog snouts, channel 12 stripes at 30 degree angle …
The opposite of a disentangled network is a completely entangled net-
work. In a completely entangled network, for example, there would be
no individual unit for dog snouts. All channels would contribute to the
recognition of dog snouts.
Disentangled features imply that the network is highly interpretable.
Let us assume we have a network with completely disentangled units
that are labeled with known concepts. This would open up the pos-
sibility to track the network’s decision making process. For example,
we could analyze how the network classifies wolves against huskeys.
First, we identify the “huskey”-unit. We can check whether this unit
depends on the “dog snout”, “fluffy fur” and “snow”-units from the
previous layer. If it does, we know that it will misclassify an image of a
huskey with a snowy background as a wolf. In a disentangled network,
we could identify problematic non-causal correlations. We could auto-
matically list all highly activated units and their concepts to explain
an individual prediction. We could easily detect bias in the neural net-
work. For example, did the network learn a “white skin” feature to pre-
dict salary?
Spoiler-alert: Convolutional neural networks are not perfectly disen-
tangled. We will now look more closely at Network Dissection to find
out how interpretable neural networks are.
9.1.2.1 Network Dissection Algorithm
FIGURE 9.5 For a given input image and a trained network (fixed
weights), we forward propagate the image up to the target
layer, upscale the activations to match the original image size
and compare the maximum activations with the ground truth
pixel-wise segmentation. Figure originally from the [project web-
site](http://netdissect.csail.mit.edu/).
Next, we create the masks of the top activated areas per channel and
per image. At this point the concept labels are not yet involved.
• For each convolutional channel k:
– For each image x in the Broden dataset
* Forward propagate image x to the target layer containing
channel k.
* Extract the pixel activations of convolutional channel k:
𝐴𝑘 (𝑥)
– Calculate distribution of pixel activations 𝛼𝑘 over all images
– Determine the 0.995-quantile level 𝑇𝑘 of activations 𝛼𝑘 . This
means 0.5% of all activations of channel k for image x are
greater than 𝑇𝑘 .
– For each image x in the Broden dataset:
* Scale the (possibly) lower-resolution activation map 𝐴𝑘 (𝑥)
to the resolution of image x. We call the result 𝑆𝑘 (𝑥).
* Binarize the activation map: A pixel is either on or off, de-
pending on whether it exceeds the activation threshold 𝑇𝑘 .
The new mask is 𝑀𝑘 (𝑥) = 𝑆𝑘 (𝑥) ≥ 𝑇𝑘 (𝑥).
Step 3: Activation-concept alignment
After step 2 we have one activation mask per channel and image. These
activation masks mark highly activated areas. For each channel we
want to find the human concept that activates that channel. We find
the concept by comparing the activation masks with all labeled con-
cepts. We quantify the alignment between activation mask k and con-
cept mask c with the Intersection over Union (IoU) score:
FIGURE 9.7 Activation mask for inception_4e channel 750 which detects
dogs with 𝐼𝑜𝑈 = 0.203. Figure originally from the [project web-
site](http://netdissect.csail.mit.edu/)
9.1.2.2 Experiments
9.1.3 Advantages
Feature visualizations give unique insight into the working of neural
networks, especially for image recognition. Given the complexity and
opacity of neural networks, feature visualization is an important step
in analyzing and describing neural networks. Through feature visual-
ization, we have learned that neural networks first learn simple edge
and texture detectors and more abstract part and object detectors in
higher layers. Network dissection expands those insights and makes
interpretability of network units measurable.
Network dissection allows us to automatically link units to concepts,
which is very convenient.
Feature visualization is a great tool to communicate in a non-
technical way how neural networks work.
With network dissection, we can also detect concepts beyond the
classes in the classification task. But we need datasets that contain
images with pixel-wise labeled concepts.
Feature visualization can be combined with feature attribution meth-
ods, which explain which pixels were important for the classification.
The combination of both methods allows to explain an individual clas-
sification along with local visualization of the learned features that
9
https://gandissect.csail.mit.edu/
9.1 Learned Features 333
9.1.4 Disadvantages
Many feature visualization images are not interpretable at all, but
contain some abstract features for which we have no words or mental
concept. The display of feature visualizations along with training data
can help. The images still might not reveal what the neural network
reacted to and only show something like “maybe there has to be some-
thing yellow in the images”. Even with Network Dissection some chan-
nels are not linked to a human concept. For example, layer conv5_3
from VGG trained on ImageNet has 193 channels (out of 512) that could
not be matched with a human concept.
There are too many units to look at, even when “only” visualizing the
channel activations. For Inception V1 there are already over 5000 chan-
nels from 9 convolutional layers. If you also want to show the negative
activations plus a few images from the training data that maximally or
minimally activate the channel (let’s say 4 positive, 4 negative images),
then you must already display more than 50 000 images. At least we
know – thanks to Network Dissection – that we do not need to inves-
tigate random directions.
Illusion of Interpretability? The feature visualizations can convey the
illusion that we understand what the neural network is doing. But do
we really understand what is going on in the neural network? Even if
we look at hundreds or thousands of feature visualizations, we can-
not understand the neural network. The channels interact in a complex
way, positive and negative activations are unrelated, multiple neurons
might learn very similar features and for many of the features we do
not have equivalent human concepts. We must not fall into the trap of
believing we fully understand neural networks just because we believe
we saw that neuron 349 in layer 7 is activated by daisies. Network Dis-
section showed that architectures like ResNet or Inception have units
10
https://distill.pub/2018/building-blocks/
334 9 Neural Network Interpretation
that react to certain concepts. But the IoU is not that great and often
many units respond to the same concept and some to no concept at
all. They channels are not completely disentangled and we cannot in-
terpret them in isolation.
For Network Dissection, you need datasets that are labeled on the
pixel level with the concepts. These datasets take a lot of effort to col-
lect, since each pixel needs to be labeled, which usually works by draw-
ing segments around objects on the image.
Network Dissection only aligns human concepts with positive activa-
tions but not with negative activations of channels. As the feature visu-
alizations showed, negative activations seem to be linked to concepts.
This might be fixed by additionally looking at the lower quantile of ac-
tivations.
11
https://github.com/tensorflow/lucid
12
https://github.com/InFoCusp/tf_cnnvis
13
https://github.com/jacobgil/keras-filter-visualization
14
https://github.com/yosinski/deep-visualization-toolbox
15
http://netdissect.csail.mit.edu/
9.2 Pixel Attribution (Saliency Maps) 335
FIGURE 9.8 A saliency map in which pixels are colored by their contribu-
tion to the classification.
You will see later in this chapter what is going on in this particular im-
age. Pixel attribution methods can be found under various names: sen-
sitivity map, saliency map, pixel attribution map, gradient-based at-
tribution methods, feature relevance, feature attribution, and feature
contribution.
Pixel attribution is a special case of feature attribution, but for images.
Feature attribution explains individual predictions by attributing each
input feature according to how much it changed the prediction (neg-
atively or positively). The features can be input pixels, tabular data or
336 9 Neural Network Interpretation
words. SHAP, Shapley Values and LIME are examples of general fea-
ture attribution methods.
We consider neural networks that output as prediction a vector of
length 𝐶, which includes regression where 𝐶 = 1. The output of the
neural network for image I is called 𝑆(𝐼) = [𝑆1 (𝐼), … , 𝑆𝐶 (𝐼)]. All
these methods take as input 𝑥 ∈ ℝ𝑝 (can be image pixels, tabular data,
words, …) with p features and output as explanation a relevance value
for each of the p input features: 𝑅𝑐 = [𝑅1𝑐 , … , 𝑅𝑝𝑐 ]. The c indicates the
relevance for the c-th output 𝑆𝐶 (𝐼).
There is a confusing amount of pixel attribution approaches. It helps
to understand that there are two different types of attribution meth-
ods:
Occlusion- or perturbation-based: Methods like SHAP and LIME
manipulate parts of the image to generate explanations (model-
agnostic).
Gradient-based: Many methods compute the gradient of the predic-
tion (or classification score) with respect to the input features. The
gradient-based methods (of which there are many) mostly differ in
how the gradient is computed.
Both approaches have in common that the explanation has the same
size as the input image (or at least can be meaningfully projected onto
it) and they assign each pixel a value that can be interpreted as the rel-
evance of the pixel to the prediction or classification of that image.
Another useful categorization for pixel attribution methods is the
baseline question:
Gradient-only methods tell us whether a change in a pixel would
change the prediction. Examples are Vanilla Gradient and Grad-CAM.
The interpretation of the gradient-only attribution is: If I were to
change this pixel, the predicted class probability would go up (for pos-
itive gradient) or down (for negative gradient). The larger the absolute
value of the gradient, the stronger the effect of a change at this pixel.
Path-attribution methods compare the current image to a reference
image, which can be an artificial “zero” image such as a completely
9.2 Pixel Attribution (Saliency Maps) 337
𝑆𝑐 (𝐼) ≈ 𝑤𝑇 𝐼 + 𝑏
𝛿𝑆𝐶
𝑤= |
𝛿𝐼 𝐼0
Now, there is some ambiguity how to perform a backward pass of
the gradients, since non-linear units such as ReLU (Rectifying Linear
Unit) “remove” the sign. So when we do a backpass, we do not know
whether to assign a positive or negative activation. Using my incredi-
ble ASCII art skill, the ReLU function looks like this: _/ and is defined
as 𝑋𝑛+1 (𝑥) = 𝑚𝑎𝑥(0, 𝑋𝑛 ) from layer 𝑋𝑛 to layer 𝑋𝑛−1 . This means
that when the activation of a neuron is 0, we do not know which value
to backpropagate. In the case of Vanilla Gradient, the ambiguity is re-
solved as follows:
𝛿𝑓 𝛿𝑓
= ⋅ I(𝑋𝑛 > 0)
𝛿𝑋𝑛 𝛿𝑋𝑛+1
far up to layer n+1, and then simply sets the gradients to zero where
the activation at the layer below is negative.
Let us look at an example where we have layers 𝑋𝑛 and 𝑋𝑛+1 =
𝑅𝑒𝐿𝑈 (𝑋𝑛+1 ). Our fictive activation at 𝑋𝑛 is:
1 0
( )
−1 −10
0.4 1.1
( )
−0.5 −0.1
0.4 0
( )
0 0
9.2.2 DeconvNet
DeconvNet by Zeiler an Fergus (2014) 18 is almost identical to Vanilla
Gradient. The goal of DeconvNet is to reverse a neural network and
the paper proposes operations that are reversals of the filtering, pool-
ing and activation layers. If you take a look into the paper, it looks
very different to Vanilla Gradient, but apart from the reversal of the
ReLU layer, DeconvNet is equivalent to the Vanilla Gradient approach.
Vanilla Gradient can be seen as a generalization of DeconvNet. De-
convNet makes a different choice for backpropagating the gradient
through ReLU:
𝑅𝑛 = 𝑅𝑛+1 𝕀(𝑅𝑛+1 > 0)$,
where 𝑅𝑛 and 𝑅𝑛+1 are the layer reconstructions. When backpassing
from layer n to layer n-1, DeconvNet “remembers” which of the activa-
tions in layer n were set to zero in the forward pass and sets them to 0
in layer n-1. Activations with a negative value in layer x are set to zero
in layer n-1. The gradient 𝑋𝑛 for the example from earlier becomes:
0.4 1.1
( )
0 0
9.2.3 Grad-CAM
Grad-CAM provides visual explanations for CNN decisions. Unlike
other methods, the gradient is not backpropagated all the way back
to the image, but (usually) to the last convolutional layer to produce a
coarse localization map that highlights important regions of the im-
age.
Grad-CAM stands for Gradient-weighted Class Activation Map. And,
as the name suggests, it is based on the gradient of the neural net-
works. Grad-CAM, like other techniques, assigns each neuron a rele-
vance score for the decision of interest. This decision of interest can be
the class prediction (which we find in the output layer), but can theoret-
ically be any other layer in the neural network. Grad-CAM backpropa-
18
Zeiler, Matthew D., and Rob Fergus. “Visualizing and understanding convolu-
tional networks.” European conference on computer vision. Springer, Cham, 2014.
9.2 Pixel Attribution (Saliency Maps) 341
Here, u is the width, v the height of the explanation and c the class of
interest.
5. Weight each feature map “pixel” by the gradient for the class.
Indices i and j refer to the width and height dimensions:
global average pooling
⏞
1⏞⏞⏞⏞ 𝛿𝑦𝑐
𝛼𝑐𝑘 = ∑∑
𝑍 𝑖 𝑗 𝛿𝐴𝑘𝑖𝑗
⏟
gradients via backprop
9.2.5 SmoothGrad
The idea of SmoothGrad by Smilkov et. al 2017 19 is to make gradient-
based explanations less noisy by adding noise and averaging over these
artificially noisy gradients. SmoothGrad is not an standalone expla-
nation method, but an extension to any gradient-based explanation
method.
SmoothGrad works in the following way:
Yes, it is that simple. Why should this work? The theory is that the
derivative fluctuates greatly at small scales. Neural networks have no
incentive during training to keep the gradients smooth, their goal is
to classify images correctly. Averaging over multiple maps “smooths
out” these fluctuations:
1 𝑛
𝑅𝑠𝑔 (𝑥) = ∑ 𝑅(𝑥 + 𝑔𝑖 ),
𝑁 𝑖=1
Here, 𝑔𝑖 ∼ 𝑁 (0, 𝜎2 ) are noise vectors sampled from the Gaussian dis-
tribution. The “ideal” noise level depends on the input image and the
network. The authors suggest a noise level of 10%-20%, which means
that 𝑥 −𝑥 𝜎
should be between 0.1 and 0.2. The limits 𝑥𝑚𝑎𝑥 and
𝑚𝑎𝑥 𝑚𝑖𝑛
𝑥𝑚𝑖𝑛 refer to minimum and maximum pixel values of the image. The
19
Smilkov, Daniel, et al. “Smoothgrad: removing noise by adding noise.” arXiv
preprint arXiv:1706.03825 (2017).
344 9 Neural Network Interpretation
9.2.6 Examples
Let us see some examples of what these maps look like and how the
methods compare qualitatively. The network under examination is
VGG-16 (Simonyan et. al 2014 20 ) which was trained on ImageNet and
can therefore distinguish more than 20,000 classes. For the following
images we will create explanations for the class with the highest clas-
sification score.
These are the images and their classification by the neural network:
The image on the left with the honorable dog guarding the Inter-
pretable Machine Learning book was classified as “Greyhound” with a
probability of 35% (seems like “Interpretable Machine Learning book”
was not one of the 20k classes). The image in the middle shows a bowl
of yummy ramen soup and is correctly classified as “Soup Bowl” with
a probability of 50%. The third image shows an octopus on the ocean
floor, which is incorrectly classified as “Eel” with a high probability of
70%.
20
Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks
for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).
9.2 Pixel Attribution (Saliency Maps) 345
And these are the pixel attributions that aim to explain the classifica-
tion:
FIGURE 9.10 Pixel attributions or saliency maps for the vanilla method,
smoothgrad and Grad-Cam.
but also the area around the chop sticks. In the octopus image, mostly
the animal itself is highlighted.
For the soup bowl, Grad-CAM highlights the egg part and, for some
reason, the upper part of the bowl. The octopus explanations by Grad-
CAM are even messier.
You can already see here the difficulties in assessing whether we trust
the explanations. As a first step, we need to consider which parts of the
image contain information that is relevant to the images classification.
But then we also need to think about what the neural network might
have used for the classification. Perhaps the soup bowl was correctly
classified based on the combination of eggs and chopstick, as Smooth-
Grad implies? Or maybe the neural network recognized the shape of
the bow plus some ingredients, as Grad-CAM suggests? We just do not
know.
And that is the big issue with all of these methods. We do not have a
ground truth for the explanations. We can only, in a first step, reject
explanations that obviously make no sense (and even in this step we
don’t have strong confidence. The prediction process in the neural net-
work is very complicated).
9.2.7 Advantages
The explanations are visual and we are quick to recognize images. In
particular, when methods only highlight important pixels, it is easy to
immediately recognize the important regions of the image.
Gradient-based methods are usually faster to compute than model-
agnostic methods. For example, LIME and SHAP can also be used to
explain image classifications, but are more expensive to compute.
There are many methods to choose from.
9.2.8 Disadvantages
As with most interpretation methods, it is difficult to know whether
an explanation is correct, and a huge part of the evaluation is only
qualitative (“These explanations look about right, let’s publish the pa-
per already.”).
9.2 Pixel Attribution (Saliency Maps) 347
9.2.9 Software
There are several software implementations of pixel attribution meth-
ods. For the example, I used tf-keras-vis25 . One of the most compre-
hensive libraries is iNNvestigate26 , which implements Vanilla gradi-
ent, Smoothgrad, Deconvnet, Guided Backpropagation, PatternNet,
LRP and more. A lot of the methods are implemented in the DeepEx-
plain Toolbox27 .
Alun Preece. “Sanity checks for saliency metrics.” In Proceedings of the AAAI Con-
ference on Artificial Intelligence, vol. 34, no. 04, pp. 6021-6029. 2020.
25
https://pypi.org/project/tf-keras-vis/
26
https://github.com/albermax/innvestigate
27
https://github.com/marcoancona/DeepExplain
9.3 Detecting Concepts 349
where 𝑓𝑙̂ maps the input 𝑥 to the activation vector of the layer 𝑙 and ℎ𝑙,𝑘
maps the activation vector to the logit output of class 𝑘.
Mathematically, the sign of 𝑆𝐶,𝑘,𝑙 (𝑥) only depends on the angle be-
tween the gradient of ℎ𝑙,𝑘 (𝑓𝑙̂ (𝑥)) and 𝑣𝑙𝐶 . If the angle is greater than
90 degrees, 𝑆𝐶,𝑘,𝑙 (𝑥) will be positive, and if the angle is less than 90 de-
grees, 𝑆𝐶,𝑘,𝑙 (𝑥) will be negative. Since the gradient ∇ℎ𝑙,𝑘 points to the
direction that maximizes the output the most rapidly, conceptual sen-
sitivity 𝑆𝐶,𝑘,𝑙 , intuitively, indicates whether 𝑣𝑙𝐶 points to the similar
direction that maximizes ℎ𝑙,𝑘 . Thus, 𝑆𝐶,𝑘,𝑙 (𝑥) > 0 can be interpreted
as concept 𝐶 encouraging the model to classify 𝑥 into class 𝑘.
9.3 Detecting Concepts 351
9.3.2 Example
Let us see an example available on the TCAV GitHub29 . Continuing
the “zebra” class example we have been using previously, this exam-
ple shows the result of the TCAV scores of “striped”, “zigzagged”, and
“dotted” concepts. The image classifier we are using is InceptionV330 , a
convolutional neural network trained using ImageNet data. Each con-
cept or random dataset contains 50 images, and we are using 10 ran-
dom datasets for the statistical significance test with the significance
level of 0.05. We are not using the Bonferroni correction because we
only have a few random datasets, but it is recommended to add the
correction in practice to avoid false discovery.
In practice, you may want to use more than 50 images in each dataset
to train better CAVs. You may also want to use more than 10 random
datasets to perform better statistical significance tests. You can also
apply TCAV to multiple bottlenecks to have a more thorough observa-
tion.
9.3.3 Advantages
Since users are only required to collect data for training the concepts
that they are interested in, TCAV does not require users to have ma-
chine learning expertise. This allows TCAV to be extremely useful for
domain experts to evaluate their complicated neural network models.
Another unique characteristic of TCAV is its customizability enabled
29
https://github.com/tensorflow/tcav/blob/master/Run_TCAV.ipynb
30
Szegedy, Christian, et al. “Rethinking the Inception architecture for computer
vision.” Proceedings of the IEEE conference on computer vision and pattern recog-
nition, 2818-2826 (2016).
9.3 Detecting Concepts 353
9.3.4 Disadvantages
TCAV might perform badly on shallower neural networks. As many
papers suggested 31 , concepts in deeper layers are more separable. If
a network is too shallow, its layers may not be capable of separating
concepts clearly so that TCAV is not applicable.
Since TCAV requires additional annotations for concept datasets, it
can be very expensive for tasks that do not have ready labelled data.
A possible alternative to TCAV when annotating is expensive is to use
ACE32 which we will briefly talk about it in the next section.
Although TCAV is hailed because of its customizability, it is difficult
to apply to concepts that are too abstract or general. This is mainly be-
cause TCAV describes a concept by its corresponding concept dataset.
The more abstract or general a concept is, such as “happiness”, the
more data are required to train a CAV for that concept.
Though TCAV gains popularity in applying to image data, it has rela-
tively limited applications in text data and tabular data.
9.3.6 Software
The official Python library of TCAV36 requires Tensorflow, but there
are other versions implemented online. The easy-to-use Jupyter note-
books are also accessible on the tensorflow/tcav37 .
34
Koh, Pang Wei, et al. “Concept bottleneck models.” Proceedings of the 37th In-
ternational Conference on Machine Learning (2020).
35
Chen, Zhi, et al. “Concept Whitening for Interpretable Image Recognition.” Na-
ture Machine Intelligence 2, 772–782 (2020).
36
https://pypi.org/project/tcav/
37
https://github.com/tensorflow/tcav/tree/master
356 9 Neural Network Interpretation
with deep neural networks, as a lot of research is done in this area and
the visualization of adversarial images is very educational. Adversar-
ial examples for images are images with intentionally perturbed pix-
els with the aim to deceive the model during application time. The ex-
amples impressively demonstrate how easily deep neural networks for
object recognition can be deceived by images that appear harmless to
humans. If you have not yet seen these examples, you might be sur-
prised, because the changes in predictions are incomprehensible for
a human observer. Adversarial examples are like optical illusions but
for machines.
Something is Wrong With My Dog
Szegedy et. al (2013)38 used a gradient based optimization approach in
their work “Intriguing Properties of Neural Networks” to find adver-
sarial examples for deep neural networks.
These adversarial examples were generated by minimizing the follow-
ing function with respect to r:
̂ + 𝑟), 𝑙) + 𝑐 ⋅ |𝑟|
𝑙𝑜𝑠𝑠(𝑓(𝑥
Goodfellow et. al (2014)39 invented the fast gradient sign method for
generating adversarial images. The gradient sign method uses the gra-
dient of the underlying model to find adversarial examples. The origi-
nal image x is manipulated by adding or subtracting a small error 𝜖 to
each pixel. Whether we add or subtract 𝜖 depends on whether the sign
of the gradient for a pixel is positive or negative. Adding errors in the
direction of the gradient means that the image is intentionally altered
so that the model classification fails.
The following formula describes the core of the fast gradient sign
method:
39
Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. “Explaining and
harnessing adversarial examples.” arXiv preprint arXiv:1412.6572 (2014).
9.4 Adversarial Examples 359
pixel with the five attributes for location and color and each of those
attributes is a mixture of three random parent pixels.
The creation of children is stopped if one of the candidate solutions is
an adversarial example, meaning it is classified as an incorrect class, or
if the number of maximum iterations specified by the user is reached.
Everything is a toaster: Adversarial patch
One of my favorite methods brings adversarial examples into physical
reality. Brown et. al (2017)41 designed a printable label that can be stuck
next to objects to make them look like toasters for an image classifier.
Brilliant work!
This method differs from the methods presented so far for adversarial
examples, since the restriction that the adversarial image must be very
close to the original image is removed. Instead, the method completely
replaces a part of the image with a patch that can take on any shape.
The image of the patch is optimized over different background images,
with different positions of the patch on the images, sometimes moved,
sometimes larger or smaller and rotated, so that the patch works in
41
Brown, Tom B., et al. “Adversarial patch.” arXiv preprint arXiv:1712.09665 (2017).
362 9 Neural Network Interpretation
many situations. In the end, this optimized image can be printed and
used to deceive image classifiers in the wild.
Never bring a 3D-printed turtle to a gunfight – even if your computer
thinks it is a good idea: Robust adversarial examples
The next method is literally adding another dimension to the toaster:
Athalye et. al (2017)42 3D-printed a turtle that was designed to look like
a rifle to a deep neural network from almost all possible angles. Yeah,
you read that right. A physical object that looks like a turtle to humans
looks like a rifle to the computer!
for generating adversarial examples that even work when the image
is transformed. The main idea behind EOT is to optimize adversarial
examples across many possible transformations. Instead of minimiz-
ing the distance between the adversarial example and the original im-
age, EOT keeps the expected distance between the two below a certain
threshold, given a selected distribution of possible transformations.
The expected distance under transformation can be written as:
where x is the original image, t(x) the transformed image (e.g. ro-
tated), x’ the adversarial example and t(x’) its transformed version.
Apart from working with a distribution of transformations, the EOT
method follows the familiar pattern of framing the search for adver-
sarial examples as an optimization problem. We try to find an adver-
sarial example x’ that maximizes the probability for the selected class
𝑦𝑡 (e.g. “rifle”) across the distribution of possible transformations T:
arg max
′
𝔼𝑡∼𝑇 [𝑙𝑜𝑔𝑃 (𝑦𝑡 |𝑡(𝑥′ ))]
𝑥
With the constraint that the expected distance over all possible trans-
formations between adversarial example x’ and original image x re-
mains below a certain threshold:
your couch, you can send data and my service answers with the corre-
sponding classifications. Most adversarial attacks are not designed to
work in this scenario because they require access to the gradient of the
underlying deep neural network to find adversarial examples. Paper-
not and colleagues (2017)43 showed that it is possible to create adver-
sarial examples without internal model information and without ac-
cess to the training data. This type of (almost) zero-knowledge attack
is called black box attack.
How it works:
1. Start with a few images that come from the same domain as
the training data, e.g. if the classifier to be attacked is a digit
classifier, use images of digits. The knowledge of the domain
is required, but not the access to the training data.
2. Get predictions for the current set of images from the black
box.
3. Train a surrogate model on the current set of images (for ex-
ample a neural network).
4. Create a new set of synthetic images using a heuristic that
examines for the current set of images in which direction to
manipulate the pixels to make the model output have more
variance.
5. Repeat steps 2 to 4 for a predefined number of epochs.
6. Create adversarial examples for the surrogate model using
the fast gradient method (or similar).
7. Attack the original model with adversarial examples.
that these spammers exist and the only abuse of the email service is
sending pirated copies of music, then the defense would be different
(e.g. scanning the attachments for copyrighted material instead of an-
alyzing the text for spam indicators).
Being proactive means actively testing and identifying weak points
of the system. You are proactive when you actively try to deceive the
model with adversarial examples and then defend against them. Us-
ing interpretation methods to understand which features are impor-
tant and how features affect the prediction is also a proactive step in
understanding the weaknesses of a machine learning model. As the
data scientist, do you trust your model in this dangerous world with-
out ever having looked beyond the predictive power on a test dataset?
Have you analyzed how the model behaves in different scenarios, iden-
tified the most important inputs, checked the prediction explanations
for some examples? Have you tried to find adversarial inputs? The in-
terpretability of machine learning models plays a major role in cyber-
security. Being reactive, the opposite of proactive, means waiting un-
til the system has been attacked and only then understanding the prob-
lem and installing some defensive measures.
How can we protect our machine learning systems against adversar-
ial examples? A proactive approach is the iterative retraining of the
classifier with adversarial examples, also called adversarial training.
Other approaches are based on game theory, such as learning invari-
ant transformations of the features or robust optimization (regulariza-
tion). Another proposed method is to use multiple classifiers instead
of just one and have them vote the prediction (ensemble), but that has
no guarantee to work, since they could all suffer from similar adver-
sarial examples. Another approach that does not work well either is
gradient masking, which constructs a model without useful gradients
by using a nearest neighbor classifier instead of the original model.
We can distinguish types of attacks by how much an attacker knows
about the system. The attackers may have perfect knowledge (white
box attack), meaning they know everything about the model like the
type of model, the parameters and the training data; the attackers may
have partial knowledge (gray box attack), meaning they might only
know the feature representation and the type of model that was used,
9.4 Adversarial Examples 367
but have no access to the training data or the parameters; the attack-
ers may have zero knowledge (black box attack), meaning they can only
query the model in a black box manner but have no access to the train-
ing data or information about the model parameters. Depending on
the level of information, the attackers can use different techniques to
attack the model. As we have seen in the examples, even in the black
box case adversarial examples can be created, so that hiding informa-
tion about data and model is not sufficient to protect against attacks.
Given the nature of the cat-and-mouse game between attackers and
defenders, we will see a lot of development and innovation in this
area. Just think of the many different types of spam emails that are
constantly evolving. New methods of attacks against machine learn-
ing models are invented and new defensive measures are proposed
against these new attacks. More powerful attacks are developed to
evade the latest defenses and so on, ad infinitum. With this chapter I
hope to sensitize you to the problem of adversarial examples and that
only by proactively studying the machine learning models are we able
to discover and remedy weaknesses.
368 9 Neural Network Interpretation
est income in your sample would earn ten times more, the resulting
median would not change.
Deletion diagnostics and influence functions can also be applied to the
parameters or predictions of machine learning models to understand
their behavior better or to explain individual predictions. Before we
look at these two approaches for finding influential instances, we will
examine the difference between an outlier and an influential instance.
Outlier
An outlier is an instance that is far away from the other instances in
the dataset. “Far away” means that the distance, for example the Eu-
clidean distance, to all the other instances is very large. In a dataset of
newborns, a newborn weighting 6 kg would be considered an outlier.
In a dataset of bank accounts with mostly checking accounts, a ded-
icated loan account (large negative balance, few transactions) would
be considered an outlier. The following figure shows an outlier for a
1-dimensional distribution.
Outliers can be interesting data points (e.g. criticisms). When an out-
lier influences the model it is also an influential instance.
Influential instance
An influential instance is a data instance whose removal has a strong
effect on the trained model. The more the model parameters or predic-
tions change when the model is retrained with a particular instance
removed from the training data, the more influential that instance is.
Whether an instance is influential for a trained model also depends
on its value for the target y. The following figure shows an influential
instance for a linear regression model.
Why do influential instances help to understand the model?
The key idea behind influential instances for interpretability is to trace
model parameters and predictions back to where it all began: the train-
ing data. A learner, that is, the algorithm that generates the machine
learning model, is a function that takes training data consisting of fea-
tures X and target y and generates a machine learning model. For ex-
ample, the learner of a decision tree is an algorithm that selects the
370 9 Neural Network Interpretation
split features and the values at which to split. A learner for a neural
network uses backpropagation to find the best weights.
We ask how the model parameters or the predictions would change if
we removed instances from the training data in the training process.
This is in contrast to other interpretability approaches that analyze
how the prediction changes when we manipulate the features of the
instances to be predicted, such as partial dependence plots or feature
importance. With influential instances, we do not treat the model as
fixed, but as a function of the training data. Influential instances help
us answer questions about global model behavior and about individual
predictions. Which were the most influential instances for the model
parameters or the predictions overall? Which were the most influen-
tial instances for a particular prediction? Influential instances tell us
for which instances the model could have problems, which training in-
stances should be checked for errors and give an impression of the ro-
bustness of the model. We might not trust a model if a single instance
9.5 Influential Instances 371
FIGURE 9.17 A linear model with one feature. Trained once on the full
data and once without the influential instance. Removing the influen-
tial instance changes the fitted slope (weight/coefficient) drastically.
FIGURE 9.18 A learner learns a model from training data (features plus
target). The model makes predictions for new data.
372 9 Neural Network Interpretation
𝐷𝐹 𝐵𝐸𝑇 𝐴𝑖 = 𝛽 − 𝛽 (−𝑖)
where 𝛽 is the weight vector when the model is trained on all data in-
stances, and 𝛽 (−𝑖) the weight vector when the model is trained without
instance i. Quite intuitive I would say. DFBETA works only for mod-
els with weight parameters, such as logistic regression or neural net-
45
Cook, R. Dennis. “Detection of influential observation in linear regression.”
Technometrics 19.1 (1977): 15-18.
9.5 Influential Instances 373
works, but not for models such as decision trees, tree ensembles, some
support vector machines and so on.
Cook’s distance was invented for linear regression models and ap-
proximations for generalized linear regression models exist. Cook’s
distance for a training instance is defined as the (scaled) sum of the
squared differences in the predicted outcome when the i-th instance
is removed from the model training.
𝑛 (−𝑖) 2
∑𝑗=1 (𝑦𝑗̂ − 𝑦𝑗̂ )
𝐷𝑖 =
𝑝 ⋅ 𝑀 𝑆𝐸
(−𝑖) 1 𝑛 (−𝑖)
Influence = ∑ ∣𝑦𝑗̂ − 𝑦𝑗̂ ∣
𝑛 𝑗=1
(−𝑖) (−𝑖)
Influence𝑗 = ∣𝑦𝑗̂ − 𝑦𝑗̂ ∣
This would also work for the difference in model parameters or the dif-
ference in the loss. In the following example we will use these simple
influence measures.
Deletion diagnostics example
In the following example, we train a support vector machine to pre-
dict cervical cancer given risk factors and measure which training in-
stances were most influential overall and for a particular prediction.
Since the prediction of cancer is a classification problem, we measure
the influence as the difference in predicted probability for cancer. An
instance is influential if the predicted probability strongly increases or
decreases on average in the dataset when the instance is removed from
model training. The measurement of the influence for all 858 training
instances requires to train the model once on all data and retrain it 858
times (= size of training data) with one of the instances removed each
time.
The most influential instance has an influence measure of about 0.01.
An influence of 0.01 means that if we remove the 540-th instance, the
predicted probability changes by 1 percentage point on average. This
is quite substantial considering the average predicted probability for
cancer is 6.4%. The mean value of influence measures over all possible
deletions is 0.2 percentage points. Now we know which of the data
instances were most influential for the model. This is already useful
to know for debugging the data. Is there a problematic instance? Are
there measurement errors? The influential instances are the first ones
9.5 Influential Instances 375
that should be checked for errors, because each error in them strongly
influences the model predictions.
Apart from model debugging, can we learn something to better un-
derstand the model? Just printing out the top 10 most influential in-
stances is not very useful, because it is just a table of instances with
many features. All methods that return instances as output only make
sense if we have a good way of representing them. But we can better
understand what kind of instances are influential when we ask: What
distinguishes an influential instance from a non-influential instance?
We can turn this question into a regression problem and model the in-
fluence of an instance as a function of its feature values. We are free to
choose any model from the chapter on Interpretable Machine Learn-
ing Models. For this example I chose a decision tree (following figure)
that shows that data from women of age 35 and older were the most
influential for the support vector machine. Of all the women in the
dataset 153 out of 858 were older than 35. In the chapter on Partial
Dependence Plots we have seen that after 40 there is a sharp increase
in the predicted probability of cancer and the Feature Importance has
also detected age as one of the most important features. The influence
analysis tells us that the model becomes increasingly unstable when
predicting cancer for higher ages. This in itself is valuable information.
This means that errors in these instances can have a strong effect on
the model.
This first influence analysis revealed the overall most influential in-
stance. Now we select one of the instances, namely the 7-th instance,
for which we want to explain the prediction by finding the most in-
fluential training data instances. It is like a counterfactual question:
How would the prediction for instance 7 change if we omit instance
i from the training process? We repeat this removal for all instances.
Then we select the training instances that result in the biggest change
in the prediction of instance 7 when they are omitted from the training
and use them to explain the prediction of the model for that instance.
I chose to explain the prediction for instance 7 because it is the in-
stance with the highest predicted probability of cancer (7.35%), which
I thought was an interesting case to analyze more deeply. We could
return the, say, top 10 most influential instances for predicting the 7-
376 9 Neural Network Interpretation
FIGURE 9.19 A decision tree that models the relationship between the
influence of the instances and their features. The maximum depth of
the tree is set to 2.
FIGURE 9.20 Decision tree that explains which instances were most
influential for predicting the 7-th instance. Data from women who
smoked for 18.5 years or longer had a large influence on the prediction
of the 7-th instance, with an average change in absolute prediction by
11.7 percentage points of cancer probability.
where 𝜃 is the model parameter vector and 𝜃𝜖,𝑧̂ is the parameter vector
after upweighting z by a very small number 𝜖. L is the loss function
with which the model was trained, 𝑧𝑖 is the training data and z is the
training instance which we want to upweight to simulate its removal.
The intuition behind this formula is: How much will the loss change if
we upweight a particular instance 𝑧𝑖 from the training data by a little
(𝜖) and downweight the other data instances accordingly? What would
the parameter vector look like to optimize this new combined loss? The
influence function of the parameters, i.e. the influence of upweighting
training instance z on the parameters, can be calculated as follows.
̂
𝑑𝜃𝜖,𝑧
𝐼up,params (𝑧) = ∣ = −𝐻𝜃−1 ̂
̂ ∇𝜃 𝐿(𝑧, 𝜃)
𝑑𝜖
𝜖=0
The last expression ∇𝜃 𝐿(𝑧, 𝜃)̂ is the loss gradient with respect to the
380 9 Neural Network Interpretation
1 𝑛 2
𝐻𝜃 = ∑ ∇ 𝐿(𝑧𝑖 , 𝜃)̂
𝑛 𝑖=1 𝜃 ̂
More informally: The Hessian matrix records how curved the loss is at
a certain point. The Hessian is a matrix and not just a vector, because
it describes the curvature of the loss and the curvature depends on the
direction in which we look. The actual calculation of the Hessian ma-
trix is time-consuming if you have many parameters. Koh and Liang
suggested some tricks to calculate it efficiently, which goes beyond the
scope of this chapter. Updating the model parameters, as described by
the above formula, is equivalent to taking a single Newton step after
forming a quadratic expansion around the estimated model parame-
ters.
What intuition is behind this influence function formula? The formula
comes from forming a quadratic expansion around the parameters
𝜃.̂ That means we do not actually know, or it is too complex to cal-
culate how exactly the loss of instance z will change when it is re-
moved/upweighted. We approximate the function locally by using in-
formation about the steepness (= gradient) and the curvature (= Hes-
sian matrix) at the current model parameter setting. With this loss ap-
proximation, we can calculate what the new parameters would approx-
imately look like if we upweighted instance z:
̂ ≈ 𝜃̂− 1 𝐼
𝜃−𝑧 (𝑧)
𝑛 up,params
9.5 Influential Instances 381
̂ )
𝑑𝐿(𝑧𝑡𝑒𝑠𝑡 , 𝜃𝜖,𝑧
𝐼𝑢𝑝,𝑙𝑜𝑠𝑠 (𝑧, 𝑧𝑡𝑒𝑠𝑡 ) = ∣
𝑑𝜖
𝜖=0
̂
𝑑𝜃𝜖,𝑧
= ∇𝜃 𝐿(𝑧𝑡𝑒𝑠𝑡 , 𝜃)̂ 𝑇 ∣
𝑑𝜖
𝜖=0
= −∇𝜃 𝐿(𝑧𝑡𝑒𝑠𝑡 , 𝜃)̂ 𝑇
𝐻𝜃 ∇𝜃 𝐿(𝑧, 𝜃)̂
−1
The first line of this equation means that we measure the influence of a
training instance on a certain prediction 𝑧𝑡𝑒𝑠𝑡 as a change in loss of the
test instance when we upweight the instance z and get new parameters
̂ . For the second line of the equation, we have applied the chain rule
𝜃𝜖,𝑧
382 9 Neural Network Interpretation
of derivatives and get the derivative of the loss of the test instance with
respect to the parameters times the influence of z on the parameters.
In the third line, we replace the expression with the influence function
for the parameters. The first term in the third line ∇𝜃 𝐿(𝑧𝑡𝑒𝑠𝑡 , 𝜃)̂ 𝑇 is the
gradient of the test instance with respect to the model parameters.
Having a formula is great and the scientific and accurate way of show-
ing things. But I think it is very important to get some intuition about
what the formula means. The formula for 𝐼up,loss states that the influ-
ence function of the training instance z on the prediction of an in-
stance 𝑧𝑡𝑒𝑠𝑡 is “how strongly the instance reacts to a change of the
model parameters” multiplied by “how much the parameters change
when we upweight the instance z”. Another way to read the formula:
9.5 Influential Instances 383
The influence is proportional to how large the gradients for the train-
ing and test loss are. The higher the gradient of the training loss, the
higher its influence on the parameters and the higher the influence on
the test prediction. The higher the gradient of the test prediction, the
more influenceable the test instance. The entire construct can also be
seen as a measure of the similarity (as learned by the model) between
the training and the test instance.
That is it with theory and intuition. The next section explains how in-
fluence functions can be applied.
Application of Influence Functions
Influence functions have many applications, some of which have al-
ready been presented in this chapter.
Understanding model behavior
Different machine learning models have different ways of making pre-
dictions. Even if two models have the same performance, the way they
make predictions from the features can be very different and therefore
fail in different scenarios. Understanding the particular weaknesses
of a model by identifying influential instances helps to form a “mental
model” of the machine learning model behavior in your mind.
Handling domain mismatches / Debugging model errors
Handling domain mismatch is closely related to better understand
the model behavior. Domain mismatch means that the distribution of
training and test data is different, which can cause the model to per-
form poorly on the test data. Influence functions can identify training
instances that caused the error. Suppose you have trained a prediction
model for the outcome of patients who have undergone surgery. All
these patients come from the same hospital. Now you use the model
in another hospital and see that it does not work well for many pa-
tients. Of course, you assume that the two hospitals have different
patients, and if you look at their data, you can see that they differ in
many features. But what are the features or instances that have “bro-
ken” the model? Here too, influential instances are a good way to an-
swer this question. You take one of the new patients, for whom the
model has made a false prediction, find and analyze the most influen-
384 9 Neural Network Interpretation
tial instances. For example, this could show that the second hospital
has older patients on average and the most influential instances from
the training data are the few older patients from the first hospital and
the model simply lacked the data to learn to predict this subgroup well.
The conclusion would be that the model needs to be trained on more
patients who are older in order to work well in the second hospital.
Fixing training data
If you have a limit on how many training instances you can check for
correctness, how do you make an efficient selection? The best way is
to select the most influential instances, because – by definition – they
have the most influence on the model. Even if you would have an in-
stance with obviously incorrect values, if the instance is not influen-
tial and you only need the data for the prediction model, it is a bet-
ter choice to check the influential instances. For example, you train a
model for predicting whether a patient should remain in hospital or
be discharged early. You really want to make sure that the model is ro-
bust and makes correct predictions, because a wrong release of a pa-
tient can have bad consequences. Patient records can be very messy, so
you do not have perfect confidence in the quality of the data. But check-
ing patient information and correcting it can be very time-consuming,
because once you have reported which patients you need to check, the
hospital actually needs to send someone to look at the records of the se-
lected patients more closely, which might be handwritten and lying in
some archive. Checking data for a patient could take an hour or more.
In view of theses costs, it makes sense to check only a few important
data instances. The best way is to select patients who have had a high
influence on the prediction model. Koh and Liang (2017) showed that
this type of selection works much better than random selection or the
selection of those with the highest loss or wrong classification.
48
https://github.com/christophM/interpretable-ml-book/blob/master/
manuscript/06.5-example-based-influence-fct.Rmd
49
https://github.com/kohpangwei/influence-release
10
A Look into the Crystal Ball
389
390 10 A Look into the Crystal Ball
The stage for our predictions is set, the crystal ball is ready, now we
look at where the field could go!
using machine learning, with the big Internet companies at the fore-
front. We need to find better ways to integrate machine learning into
processes and products, train people and develop machine learning
tools that are easy to use. I believe that machine learning will become
much easier to use: We can already see that machine learning is becom-
ing more accessible, for example through cloud services (“Machine
Learning as a service” – just to throw a few buzzwords around). Once
machine learning has matured – and this toddler has already taken its
first steps – my next prediction is:
Machine learning will fuel a lot of things.
Based on the principle “Whatever can be automated will be auto-
mated”, I conclude that whenever possible, tasks will be formulated
as prediction problems and solved with machine learning. Machine
learning is a form of automation or can at least be part of it. Many
tasks currently performed by humans are replaced by machine learn-
ing. Here are some examples of tasks where machine learning is used
to automate parts of it:
• Sorting / decision-making / completion of documents (e.g. in insur-
ance companies, the legal sector or consulting firms)
• Data-driven decisions such as credit applications
• Drug discovery
• Quality controls in assembly lines
• Self-driving cars
• Diagnosis of diseases
• Translation. For this book, I used a translation service called (DeepL2 )
powered by deep neural networks to improve my sentences by trans-
lating them from English into German and back into English.
• …
The breakthrough for machine learning is not only achieved through
better computers / more data / better software, but also:
Interpretability tools catalyze the adoption of machine learning.
Based on the premise that the goal of a machine learning model
can never be perfectly specified, it follows that interpretable machine
2
https://deepl.com
10.2 The Future of Interpretability 393
learning is necessary to close the gap between the misspecified and the
actual goal. In many areas and sectors, interpretability will be the cat-
alyst for the adoption of machine learning. Some anecdotal evidence:
Many people I have spoken to do not use machine learning because
they cannot explain the models to others. I believe that interpretabil-
ity will address this issue and make machine learning attractive to or-
ganisations and people who demand some transparency. In addition
to the misspecification of the problem, many industries require inter-
pretability, be it for legal reasons, due to risk aversion or to gain in-
sight into the underlying task. Machine learning automates the mod-
eling process and moves the human a bit further away from the data
and the underlying task: This increases the risk of problems with ex-
perimental design, choice of training distribution, sampling, data en-
coding, feature engineering, and so on. Interpretation tools make it
easier to identify these problems.
• …
What I am telling you here is actually nothing new. So why switch
from analyzing assumption-based, transparent models to analyzing
assumption-free black box models? Because making all these assump-
tions is problematic: They are usually wrong (unless you believe that
most of the world follows a Gaussian distribution), difficult to check,
very inflexible and hard to automate. In many domains, assumption-
based models typically have a worse predictive performance on un-
touched test data than black box machine learning models. This is only
true for big datasets, since interpretable models with good assump-
tions often perform better with small datasets than black box mod-
els. The black box machine learning approach requires a lot of data to
work well. With the digitization of everything, we will have ever big-
ger datasets and therefore the approach of machine learning becomes
more attractive. We do not make assumptions, we approximate real-
ity as close as possible (while avoiding overfitting of the training data).
I argue that we should develop all the tools that we have in statistics
to answer questions (hypothesis tests, correlation measures, interac-
tion measures, visualization tools, confidence intervals, p-values, pre-
diction intervals, probability distributions) and rewrite them for black
box models. In a way, this is already happening:
• Let us take a classical linear model: The standardized regression co-
efficient is already a feature importance measure. With the permu-
tation feature importance measure, we have a tool that works with
any model.
• In a linear model, the coefficients measures the effect of a single fea-
ture on the predicted outcome. The generalized version of this is the
partial dependence plot.
• Test whether A or B is better: For this we can also use partial depen-
dence functions. What we do not have yet (to the best of my best
knowledge) are statistical tests for arbitrary black box models.
The data scientists will automate themselves.
I believe that data scientists will eventually automate themselves out
of the job for many analysis and prediction tasks. For this to happen,
the tasks must be well-defined and there must to be some processes
396 10 A Look into the Crystal Ball
and routines around them. Today, these routines and processes are
missing, but data scientists and colleagues are working on them. As
machine learning becomes an integral part of many industries and in-
stitutions, many of the tasks will be automated.
Robots and programs will explain themselves.
We need more intuitive interfaces to machines and programs that
make heavy use of machine learning. Some examples: A self-driving
car that reports why it stopped abruptly (“70% probability that a kid
will cross the road”); A credit default program that explains to a bank
employee why a credit application was rejected (“Applicant has too
many credit cards and is employed in an unstable job.”); A robot arm
that explains why it moved the item from the conveyor belt into the
trash bin (“The item has a craze at the bottom.”).
Interpretability could boost machine intelligence research.
I can imagine that by doing more research on how programs and ma-
chines can explain themselves, we can improve our understanding of
intelligence and become better at creating intelligent machines.
In the end, all these predictions are speculations and we have to see
what the future really brings. Form your own opinion and continue
learning!
11
Contribute to the Book
1
https://github.com/christophM/interpretable-ml-book
2
https://github.com/christophM/interpretable-ml-book/issues
397
12
Citing this Book
If you found this book useful for your blog post, research article or
product, I would be grateful if you would cite this book. You can cite
the book like this:
Molnar, Christoph. "Interpretable machine learning.
A Guide for Making Black Box Models Explainable", 2019.
https://christophm.github.io/interpretable-ml-book/.
1
mailto:christoph.molnar.ai@gmail.com
399
13
Translations
401
402 13 Translations
Japanese
• Complete translation9 by Ryuji Masui and team HACARUS.
Korean:
• Complete Korean translation10 by TooTouch11
• Partial Korean translation12 by An Subin13
Spanish
• Full Spanish translation14 by Federico Fliguer15
Vietnamese
• A complete translation16 by Giang Nguyen, Duy-Tung Nguyen,
Hung-Quang Nguyen, Tri Le and Hoang Nguyen.
If you know of any other translation of the book or of individual chap-
ters, I would be grateful to hear about it and list it here. You can reach
me via email: christoph.molnar.ai@gmail.com17 .
9
https://hacarus.github.io/interpretable-ml-book-ja/index.html
10
https://tootouch.github.io/IML/taxonomy_of_interpretability_methods/
11
https://tootouch.github.io/
12
https://subinium.github.io/IML/
13
https://subinium.github.io/
14
https://fedefliguer.github.io/AAI/
15
https://fedefliguer.github.io/
16
https://github.com/giangnguyen2412/InterpretableMLBook-Vietnamese
17
mailto:christoph.molnar.ai@gmail.com
14
Acknowledgements
Writing this book was (and still is) a lot of fun. But it is also a lot of
work and I am very happy about the support I received.
My biggest thank-you goes to Katrin who had the hardest job in terms
of hours and effort: she proofread the book from beginning to end and
discovered many spelling mistakes and inconsistencies that I would
never have found. I am very grateful for her support.
A big thanks goes to Verena Haunschmid for writing the section about
LIME explanations for images. She works in data science and I recom-
mend following her on Twitter: @ExpectAPatronum1 . I also want to
thank all the early readers who contributed corrections2 on Github!
Furthermore, I want to thank everyone who created illustrations: The
cover was designed by my friend @YvonneDoinel3 . The graphics in
the Shapley Value chapter were created by Heidi Seibold4 , as well
as the turtle example in the adversarial examples chapter. Verena
Haunschmid created the graphic in the RuleFit chapter.
I would also like to thank my wife and family who have always sup-
ported me. My wife, in particular, had to listen to me ramble on about
the book a lot. She helped me make many decisions around the writing
of the book.
In at least three aspects, the way I published this book is unconven-
tional. First, it is available both as website and as ebook/pdf. The soft-
ware I used to create this book is called bookdown, written by Yihui Xie5 ,
1
https://twitter.com/ExpectAPatronum
2
https://github.com/christophM/interpretable-ml-book/graphs/contributors
3
https://twitter.com/YvonneDoinel
4
https://twitter.com/HeidiBaya
5
https://yihui.name/
403
404 14 Acknowledgements
who created many R packages that make it easy to combine R code and
text. Thanks a lot! Secondly, I self-publish the book on the platform
Leanpub6 , instead of working with a traditional publisher. And third,
I published the book as in-progress book, which has helped me enor-
mously to get feedback and to monetize it along the way. Many thanks
to leanpub for making this possible and handling the royalties fairly. I
would also like to thank you, dear reader, for reading this book with-
out a big publisher name behind it.
I am grateful for the funding of my research on interpretable machine
learning by the Bavarian State Ministry of Science and the Arts in the
framework of the Centre Digitisation.Bavaria (ZD.B) and by the Bavar-
ian Research Institute for Digital Transformation (bidt).
6
https://leanpub.com/
References
405
406 14 References
R Packages Used
arules. Michael Hahsler, Christian Buchta, Bettina Gruen and Kurt
Hornik (2021). arules: Mining Association Rules and Frequent Item-
sets. R package version 1.7-2. https://CRAN.R-project.org/package=
arules
bookdown. Yihui Xie (2021). bookdown: Authoring Books and Tech-
nical Documents with R Markdown. R package version 0.24. https:
//CRAN.R-project.org/package=bookdown
Cairo. Simon Urbanek and Jeffrey Horner (2020). Cairo: R Graph-
ics Device using Cairo Graphics Library for Creating High-Quality
Bitmap (PNG, JPEG, TIFF), Vector (PDF, SVG, PostScript) and Display
(X11 and Win32) Output. R package version 1.5-12.2. https://CRAN.R-
project.org/package=Cairo
caret. Max Kuhn (2021). caret: Classification and Regression Train-
ing. R package version 6.0-90. https://CRAN.R-project.org/package=
caret
data.table. Matt Dowle and Arun Srinivasan (2021). data.table: Exten-
sion of data.frame. R package version 1.14.2. https://CRAN.R-project.
org/package=data.table
devtools. Hadley Wickham, Jim Hester, Winston Chang and Jennifer
Bryan (2021). devtools: Tools to Make Developing R Packages Easier. R
package version 2.4.3. https://CRAN.R-project.org/package=devtools
dplyr. Hadley Wickham, Romain François, Lionel Henry and Kirill
Müller (2021). dplyr: A Grammar of Data Manipulation. R package ver-
sion 1.0.7. https://CRAN.R-project.org/package=dplyr
DT. Yihui Xie, Joe Cheng and Xianying Tan (2021). DT: A Wrapper
of the JavaScript Library ‘DataTables’. R package version 0.20. https:
//CRAN.R-project.org/package=DT
414 14 References