Shap
Shap
Christoph Molnar
Content
1 Preface 7
2 Introduction 8
2.1 Interpreting to debug . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Users may create their own interpretations . . . . . . . . . . . . . 9
2.3 Building trust in your models . . . . . . . . . . . . . . . . . . . . 10
2.4 The limitations of inherently interpretable models . . . . . . . . . 11
2.5 Model-agnostic interpretation is the answer . . . . . . . . . . . . 12
2.6 SHAP: An explainable AI technique . . . . . . . . . . . . . . . . 13
2
6 Estimating SHAP Values 38
6.1 Estimating SHAP values with Monte Carlo integration . . . . . . 38
6.2 Computing all coalitions, if possible . . . . . . . . . . . . . . . . . 42
6.3 Handling large numbers of coalitions . . . . . . . . . . . . . . . . 43
6.4 Estimation through permutation . . . . . . . . . . . . . . . . . . 44
6.5 Overview of SHAP estimators . . . . . . . . . . . . . . . . . . . . 46
6.6 From estimators to explainers . . . . . . . . . . . . . . . . . . . . 47
3
10.2 Computing SHAP values . . . . . . . . . . . . . . . . . . . . . . . 95
10.3 SHAP values have a “global” component . . . . . . . . . . . . . . 98
10.4 SHAP values are different from a “what-if” analysis . . . . . . . . 100
4
15.7 How SHAP interacts with text-to-text models . . . . . . . . . . . 162
15.8 Explaining a text-to-text model . . . . . . . . . . . . . . . . . . . 163
15.9 Other text-to-text tasks . . . . . . . . . . . . . . . . . . . . . . . 165
5
20.4 Model valuation in ensembles . . . . . . . . . . . . . . . . . . . . 182
20.5 Federated learning . . . . . . . . . . . . . . . . . . . . . . . . . . 183
20.6 And many more . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
21 Acknowledgments 184
References 186
Appendices 191
A SHAP Estimators 191
A.1 Exact Estimation: Computing all the coalitions . . . . . . . . . . 191
A.2 Sampling Estimator: Sampling the coalitions . . . . . . . . . . . . 192
A.3 Permutation Estimator: Sampling permutations . . . . . . . . . . 193
A.4 Linear Estimator For linear models . . . . . . . . . . . . . . . . . 195
A.5 Additive Estimator: For additive models . . . . . . . . . . . . . . 196
A.6 Kernel Estimator: The deprecated original . . . . . . . . . . . . . 198
A.7 Tree Estimator: Designed for tree-based models . . . . . . . . . . 200
A.8 Tree-path-dependent Estimator . . . . . . . . . . . . . . . . . . . 202
A.9 Gradient Estimator: For gradient-based models . . . . . . . . . . 204
A.10 Deep Estimator: for neural networks . . . . . . . . . . . . . . . . 206
A.11 Partition Estimator: For hierarchically grouped data . . . . . . . 207
6
1 Preface
In my first book, “Interpretable Machine Learning,” I overlooked the inclusion
of SHAP. I conducted a Twitter survey to determine the most frequently used
methods for interpreting machine learning models. Options included LIME, per-
mutation feature importance, partial dependence plots, and “Other.” SHAP was
not an option.
To my surprise, the majority of respondents selected “Other,” with many com-
ments highlighting the absence of SHAP. Although I was aware of SHAP at that
time, I underestimated its popularity in machine learning explainability.
This popularity was a double-edged sword. My PhD research on interpretable
machine learning was centered around partial dependence plots and permutation
feature importance. On multiple occasions, when submitting a paper to a con-
ference, we were advised to focus on SHAP or LIME instead. This advice was
misguided because we should make progress for all interpretation methods, not
just SHAP, but it underscores the popularity of SHAP.
SHAP has been subjected to its fair share of criticism: it’s costly to compute,
challenging to interpret, and overhyped. I agree with some of these criticisms. In
the realm of interpretable machine learning, there’s no perfect method; we must
learn to work within constraints, which this book also addresses. However, SHAP
excels in many areas: it can work with any model, it’s modular in building global
interpretations, and it has a vast ecosystem of SHAP adaptations.
As you can see, my relationship with SHAP is a mix of admiration and frustra-
tion – perhaps a balanced standpoint for writing about SHAP. I don’t intend to
overhype it, but I believe it’s a beneficial tool worth understanding.
7
2 Introduction
Machine learning models are powerful tools, but their lack of interpretability is
a challenge. It’s often unclear why a certain prediction was made, what the
most important features were, and how the features influenced the predictions in
general. Many people argue that as long as a machine learning model performs
well, interpretability is unnecessary. However, there are many practical reasons
why you need interpretability, ranging from debugging to building trust in your
model.
Ĺ Interpretability
8
Figure 2.1: Asthma increases the likelihood of pneumonia. However, in the study,
asthma also increased the (preemptive) use of antibiotics which gener-
ally protects against pneumonia and led to an overall lower pneumonia
risk for asthma patients.
No rule would be apparent, stating “asthma ⇒ lower risk.” Instead, the network
would learn this rule and conceal it, potentially causing harm if deployed in real-
life situations.
Although you could theoretically spot the problem by closely examining the data
and applying domain knowledge, it’s generally easier to identify such issues if
you can understand what the model has learned. Machine learning models that
aren’t interpretable create a distance between the data and the modeler, and
interpretability methods help bridge this gap.
9
is likely to develop sepsis (Elish and Watkins 2020). If the model detects a po-
tential sepsis case, it triggers an alert that initiates a new hospital protocol for
diagnosis and treatment. This protocol involves a rapid response team (RRT)
nurse who monitors the alarms and informs the doctors, who then treat the pa-
tient. Numerous aspects of the implementation warrant discussion, especially the
social implications of the new workflow, such as the hospital hierarchy causing
nurses to feel uncomfortable instructing doctors. There was also considerable
repair work carried out by RRT nurses to adapt the new system to the hospital
environment. Interestingly, the report noted that the deep learning system didn’t
provide explanations for warnings, leaving it unclear why a patient was predicted
to develop sepsis. The software merely displayed the score, resulting in occasional
discrepancies between the model score and the doctor’s diagnosis. Doctors would
consequently ask nurses what they were observing that the doctors were not. The
patient didn’t seem septic, so why were they viewed as high-risk? However, the
nurse only had access to the scores and some patient data, leading to a discon-
nect. Feeling responsible for explaining the model outputs, RRT nurses collected
context from patient charts to provide an explanation. One nurse assumed the
model was keying in on specific words in the medical record, which wasn’t the
case. The model wasn’t trained on text. Another nurse also formed incorrect
assumptions about the influence of lab values on the sepsis score. While these
misunderstandings didn’t hinder tool usage, they underscore an intriguing issue
with the lack of interpretability: users may devise their own interpretations when
none are provided.
10
explanations alongside model scores to facilitate others’ engagement with the
predictions.
11
to inferior performance. This inferior performance could directly result in fewer
sales, increased churn, or more false negative sepsis predictions.
So, what’s the solution?
• Sampling data.
• Intervention on the data.
• Prediction step.
• Aggregating the results.
Various methods operate under the SIPA framework (Molnar 2022), including:
• Partial dependence plots, which illustrate how altering one (or two) features
changes the average prediction.
• Individual conditional expectation curves, which perform the same function
for a single data point.
• Accumulated Local Effect Plots, an alternative to partial dependence plots.
1
The terms “Explainable AI” and “interpretable machine learning” are used interchangeably
in this book. Some people use XAI more for post-hoc explanations of predictions and inter-
pretable ML for inherently interpretable models. However, when searching for a particular
method, it’s advisable to use both terms.
12
• Permutation Feature Importance, quantifying a feature’s importance for
accurate predictions.
• Local interpretable model-agnostic explanations (LIME), explaining predic-
tions with local linear models (Ribeiro et al. 2016).
Ď Tip
Even if you use an interpretable model, this book can be of assistance. Meth-
ods like SHAP can be applied to any model, so even if you’re using a decision
tree, SHAP can provide additional interpretation.
13
f(x) = 0.999
14
Given its wide range of applications, you are likely to find a use for SHAP in your
work.
Before we talk about the practical application of SHAP, let’s begin with its his-
torical background, which provides context for the subsequent theory chapters.
15
3 A Short History of Shapley
Values and SHAP
This chapter offers an overview of the history of SHAP and Shapley values, focus-
ing on their chronological development. The history is divided into three parts,
each highlighted by a milestone:
16
Games” (Shapley et al. 1953) introduced Shapley values. In 2012, Lloyd Shapley
and Alvin Roth were awarded the Nobel Prize in Economics2 for their work in
“market design” and “matching theory.”
Shapley values serve as a solution in cooperative game theory, which deals with
games where players cooperate to achieve a payout. They address the issue of a
group of players participating in a collaborative game, where they work together
to reach a certain payout. The payout of the game needs to be distributed
among the players, who may have contributed differently. Shapley values provide
a mathematical method of fairly dividing the payout among the players.
Shapley values have since become a cornerstone of coalitional game theory, with
applications in various fields such as political science, economics, and computer
science. They are frequently used to determine fair and efficient strategies for
resource distribution within a group, including dividing profits among sharehold-
ers, allocating costs among collaborators, and assigning credit to contributors in
a research project. However, Shapley values were not yet employed in machine
learning, which was still in its early stages at the time.
17
In 2014, they further developed their methodology for computing Shapley values
(Štrumbelj and Kononenko 2014).
However, this approach did not immediately gain popularity. Some possible rea-
sons why Shapley values were not widely adopted at the time include:
Next, we will look at the events that led to the rise of Shapley values in machine
learning.
3
The name NIPS faced criticism due to its association with “nipples” and its derogatory usage
against Japanese individuals, leading to its change to NeurIPS.
18
Lundberg and Lee presented a new way to estimate SHAP values using a weighted
linear regression with a kernel function to weight the data points4 . The paper also
demonstrated how their proposed estimation method could integrate other expla-
nation techniques, such as DeepLIFT (Shrikumar et al. 2017), LIME (Ribeiro et
al. 2016), and Layer-Wise Relevance Propagation (Bach et al. 2015).
Here’s why I believe SHAP gained popularity:
• It was published in a reputable venue (NIPS/NeurIPS).
• It was a pioneering work in a rapidly growing field.
• Ongoing research by the original authors and others contributed to its de-
velopment.
• The open-source shap Python package with a wide range of features and
plotting capabilities
The availability of open-source code played a significant role, as it enabled people
to integrate SHAP values into their projects.
Ĺ Naming conventions
• Both the method and the resulting numbers can be referred to as Shap-
ley values (and SHAP values).
• Lundberg and Lee (2017b) renamed Shapley values for machine learn-
ing as SHAP, an acronym for SHapley Additive exPlanations.
19
theoretic method of Shapley values and the specific machine learning appli-
cation of SHAP.
Since its inception, SHAP’s popularity has steadily increased. A significant mile-
stone was reached in 2020 when Lundberg et al. (2020) proposed an efficient
computation method specifically for SHAP, targeting tree-based models. This
advancement was crucial because tree-boosting excels in many applications, en-
abling rapid estimation of SHAP values for state-of-the-art models. Another
remarkable achievement by Lundberg involved extending SHAP beyond individ-
ual predictions. He stacked SHAP values, similar to assembling Legos, to cre-
ate global model interpretations. This method was made possible by the fast
computation designed for tree-based models. Thanks to numerous contributors,
Lundberg continued to enhance the shap package, transforming it into a com-
prehensive library with a wide range of estimators and functionalities. Besides
Lundberg’s work, other researchers have also contributed to SHAP, proposing
extensions. Moreover, SHAP has been implemented in other contexts, indicating
that the shap package is not the only source of this method.
Given this historical context, we will begin with the theory of Shapley values and
gradually progress to SHAP.
20
4 Theory of Shapley Values
To learn about SHAP, we first discuss the theory behind Shapley values from
game theory. We will progressively define a fair payout[^fair] in a coalition of
players and ultimately arrive at Shapley values (spoiler alert). [^fair]: There is
no perfect definition of fairness everyone would agree upon. Shapley values define
a very specific version of fairness, which can be seen as egalitarian.
21
to $25, as he insists on a more spacious, luxurious taxi, adding a flat $10 to the
ride costs. Adding Charlie to Alice and Bob’s ride increases the cost to $51 since
Charlie lives somewhat further away. We define the taxi ride costs for all possible
combinations and compile the following table:
The coalition ∅ is a coalition without any players in it, i.e., an empty taxi. This
table seems like a step in the right direction, giving us an initial idea of how much
each person contributes to the cost of the ride.
Ĺ Marginal contribution
22
Using the table, we can easily calculate the marginal contributions. Taking an
example, if we compare the cost between the {Alice, Bob} coalition and Bob
alone, we derive the marginal contribution of Alice, the “player”, to the coalition
{Bob}. In this scenario, it’s $25 - $25 = $0, as the taxi ride cost remains the
same. If we calculate the marginal contribution of Bob to the {Alice} coalition,
we get $25 - $15 = $10, meaning adding Bob to a taxi ride with Alice increases
the cost by $10. We calculate all possible marginal contributions in this way:
We’re one step closer to calculating a fair share of ride costs. Could we just
average these marginal contributions per passenger? We could, but that would
assign equal weight to every marginal contribution. However, one could argue
that we learn more about how much Alice should pay when we add her to an
empty taxi compared to when we add her to a ride with Bob. But how much
more informative?
One way to answer this question is by considering all possible permutations of
Alice, Bob, and Charlie. There are 3! = 3 ∗ 2 ∗ 1 = 6 possible permutations of
passengers:
• Alice, Bob, Charlie
• Alice, Charlie, Bob
23
• Bob, Alice, Charlie
• Charlie, Alice, Bob
• Bob, Charlie, Alice
• Charlie, Bob, Alice
We can use these permutations to form coalitions, for example, for Alice. Each
permutation then maps to a coalition: People who come before Alice in the
order are in the coalition, people after are not. Since in a coalition the order
of passengers doesn’t matter, some coalitions will occur more often than others
when we iterate through all permutations like this: In 2 out of 6 permutations,
Alice is added to an empty taxi; In 1 out of 6, she is added to a taxi with Bob; In
1 out of 6, she is added to a taxi with Charlie; And in 2 out of 6, she is added to
a taxi with both Bob and Charlie. We use these counts to weight each marginal
contribution to continue our journey towards a fair cost sharing.
We could make different decisions regarding how to “fairly” allocate the costs
to the passengers. For instance, we could weight the marginal contributions
differently. We could divide the cost by 3. Alternatively, we could use solutions
that depend on the order of passengers: Alice alone would pay $15, when we add
Bob it’s +$10, which would be his share, and Charlie would pay the remainder.
However, all these different choices would lead us away from Shapley values.
1
(2⏟
⋅ $15 + 1⏟
⋅ $0 + 1⏟
⋅ $3 + 2⏟
⋅ $0 ) = $5.50
6 A to ∅ A to B A to C A to B,C
24
1
(2⏟
⋅ $25 + 1⏟⋅ $10 + 1⏟⋅ $13 + 2⏟⋅ $10 ) = $15.50
6 B to ∅ B to A B to C B to A,C
1
(2⏟
⋅ $38 + 1⏟⋅ $26 + 1⏟⋅ $26 + 2⏟⋅ $26 ) = $30.00
6 C to ∅ C to A C to B C to A,B
The individual contributions sum to the total cost: $5.50 + $15.50 + $30.00 =
$51.00. Perfect! And that’s it, this is how we compute Shapley values (Shapley
et al. 1953).
Let’s formalize the taxi example in terms of game theory and explore the Shapley
value theory, which makes Shapley values a unique solution.
25
Term Math Term Taxi Example
Shapley Value 𝜙𝑗 For example, 𝜙1 = $5.50 for Alice,
𝜙2 = $15.50 for Bob, and 𝜙3 = $30 for
Charlie.
26
4.5 The axioms behind Shapley values
We now have the formula, but where did it come from? Lloyd Shapley derived it
(Shapley et al. 1953), but it didn’t just materialize out of thin air. He proposed
axioms defining what a fair distribution could look like, and from these axioms,
he derived the formula. Lloyd Shapley also proved that based on these axioms,
the Shapley value formula yields a unique solution.
Let’s discuss these axioms, namely Efficiency, Symmetry, Dummy, and Ad-
ditivity. An axiom is a statement accepted as self-evidently true. Consider the
axioms as defining fairness when it comes to payouts in team play.
4.5.1 Efficiency
The efficiency axiom states that the sum of the contributions must precisely add
up to the payout. This makes a lot of sense. Consider Alice, Bob, and Charlie
sharing a taxi ride and calculating their individual shares, but the contributions
don’t equal the total taxi fare. All three, including the taxi driver, would find
this method useless. The efficiency axiom can be expressed formally as:
∑ 𝜙𝑗 = 𝑣(𝑁 )
𝑗∈𝑁
4.5.2 Symmetry
The symmetry principle states that if two players are identical, they should receive
equal contributions. Identical means that all their marginal contributions are the
same. For instance, if Bob wouldn’t need the luxury version of the taxi, his
marginal contributions would be exactly the same as Alice’s. The symmetry
axiom says that in such situations, both should pay the same amount, which
seems fair.
We can also express symmetry mathematically for two players 𝑗 and 𝑘:
If 𝑣(𝑆 ∪ {𝑗}) = 𝑣(𝑆 ∪ {𝑘}) for all 𝑆 ⊆ 𝑁 \{𝑗, 𝑘}, then 𝜙𝑗 = 𝜙𝑘 .
27
4.5.3 Dummy or Null Player
The Shapley value for a player who doesn’t contribute to any coalition is zero,
which seems quite fair. Let’s introduce Dora, Charlie’s dog, and consider her an
additional player. Assuming there’s no extra cost for including Dora in any ride,
all of Dora’s marginal contributions would be $0. The dummy axiom states that
when all marginal contributions are zero, the Shapley value should also be zero.
This rule seems reasonable, especially as Dora doesn’t have any money.
To express this axiom formally:
If 𝑣(𝑆 ∪ {𝑗}) = 𝑣(𝑆) for all 𝑆 ⊆ 𝑁 \{𝑗}, then 𝜙𝑗 = 0.
4.5.4 Additivity
In a game with two value functions 𝑣1 and 𝑣2 , the Shapley values for the sum of
the games can be expressed as the sum of the Shapley values:
Imagine Alice, Bob, and Charlie not only sharing a taxi but also going out for
ice cream. Their goal is to fairly divide not just the taxi costs, but both the taxi
and ice cream costs. The additivity axiom suggests that they could first calculate
each person’s fair share of the ice cream costs, then the taxi costs, and add them
up per person.
These four1 axioms ensure the uniqueness of the Shapley values, indicating there’s
only one solution presented in the Shapley formula, Equation 4.1. The proof of
why this is the case won’t be discussed in this book, as it would be too detailed.
Instead, it’s time to relate this approach to explaining machine learning predic-
tions.
1
A fifth axiom called Linearity or Marginality exists, but it can be derived from the other
axioms, so it doesn’t introduce any new requirements for fair payouts.
28
5 From Shapley Values to SHAP
We have been learning about Shapley values from coalitional game theory. But
how do these values connect to machine learning explanations? The connection
might not seem apparent – it certainly didn’t to me when I first learned about
SHAP.
29
Figure 5.1: The predicted price for a 50 𝑚2 2nd floor apartment with a nearby
park and cat ban is €300,000. Our goal is to explain how each of
these feature values contributed to the prediction.
30
Concept Machine Learning Term
Total payout Prediction for 𝑥(𝑖) minus average 𝑓(𝑥(𝑖) ) − 𝔼(𝑓(𝑋))
prediction
Value function Prediction for feature values in 𝑣𝑓,𝑥(𝑖) (𝑆)
coalition S minus expected
(𝑖)
SHAP value Contribution of feature 𝑗 towards 𝜙𝑗
payout
You may have questions about these terms, but we will discuss them shortly. The
value function is central to SHAP, and we will discuss it in detail. This function
is closely related to the simulation of absent features.
(𝑖)
𝑣𝑓,𝑥(𝑖) (𝑆) = ∫ 𝑓(𝑥𝑆 ∪ 𝑋𝐶 )𝑑ℙ𝑋𝐶 − 𝔼(𝑓(𝑋))
Ĺ Note
The value function relies on a specific model 𝑓 and a particular data point to
be explained 𝑥(𝑖) , and maps a coalition 𝑆 to its value. Although the correct
notation is 𝑣𝑓,𝑥(𝑖) (𝑆), I will occasionally use 𝑣(𝑆) for brevity. Another misuse
(𝑖)
of notation: I use the union operator for the feature vector: 𝑥𝑆 ∪ 𝑋𝐶 is a
(𝑖)
feature vector ∈ ℝ𝑝 where values at positions 𝑆 have values from 𝑥𝑆 and
the rest are random variables from 𝑋𝐶 .
This function provides an answer for the simulation of absent features. The second
part 𝔼(𝑓(𝑋)) is straightforward: It ensures the value of an empty coalition 𝑣(∅)
equals 0.
Confirm this for yourself:
31
𝑣(∅) = ∫ 𝑓(𝑋1 , … , 𝑋𝑝 )𝑑ℙ𝑋 − 𝐸𝑋 (𝑓(𝑋)) (5.1)
(𝑖)
The first part of the value function, ∫ 𝑓(𝑥𝑆 ∪𝑋𝐶 )𝑑𝑋𝐶 , is where the magic occurs.
The model prediction function 𝑓, which is central to the value function, takes the
feature vector 𝑥(𝑖) ∈ ℝ𝑝 as input and generates the prediction ∈ ℝ. However, we
only know the features in set 𝑆, so we need to account for the features not in 𝑆,
which we index with 𝐶.
SHAP’s approach is to treat the unknown features as random variables and inte-
grate over their distribution. This concept of integrating over the distribution of
a random variable is called marginalization.
Ď Marginalization
This means that we can input “known” features directly into the model 𝑓, while
absent features are treated as random variables. In mathematical terms, I distin-
guish a random variable from an observed value by capitalizing it:
32
Park Cat Area Floor Predicted Price
Nearby Banned 50 2nd €300,000
Informally, the value function for the coalition of park, floor would be:
(𝑖)
𝑣(𝑆 ∪ 𝑗) − 𝑣(𝑆) = ∫ 𝑓(𝑥𝑆∪𝑗 ∪ 𝑋𝐶\𝑗 )𝑑ℙ𝑋𝐶\𝑗 − 𝔼(𝑓(𝑋))
(𝑖)
− (∫ 𝑓(𝑥𝑆 ∪ 𝑋𝐶 )𝑑ℙ𝑋𝐶 − 𝔼(𝑓(𝑋)))
(𝑖)
= ∫ 𝑓(𝑥𝑆∪𝑗 ∪ 𝑋𝐶\𝑗 )𝑑ℙ𝑋𝐶\𝑗
(𝑖)
− ∫ 𝑓(𝑥𝑆 ∪ 𝑋𝐶 )𝑑ℙ𝑋𝐶
For instance, the contribution of ‘cat’ to a coalition of {park, floor} would be:
33
𝑣({cat, park, floor}) − 𝑣({park, floor})
The resulting marginal contribution describes the change in the value of the
coalition {park, floor} when the ‘cat’ feature is included. Another way to interpret
the marginal contribution is that present features are known, absent feature values
are unknown, so the marginal contribution illustrates how much the value changes
from knowing 𝑗 in addition to already knowing 𝑆.
(𝑖)
The SHAP value 𝜙𝑗 of a feature value is the average marginal contribution of
(𝑖)
a feature value 𝑥𝑗 to all possible coalitions of features. And that concludes it.
This formula is similar to the one in the Shapley Theory Chapter, but the value
function is adapted to explain a machine learning prediction. The formula, once
again, is an average of marginal contributions, each contribution being weighted
based on the size of the coalition.
34
SHAP follows the principles of Efficiency, Symmetry, Dummy, and Additivity, we
can deduce how to interpret SHAP values or at least obtain a preliminary under-
standing. Let’s explore each axiom individually and determine their implications
for the interpretation of SHAP values.
SHAP values must total to the difference between the prediction for 𝑥(𝑖) and the
expected prediction:
𝑝
(𝑖)
∑ 𝜙𝑗 = 𝑓(𝑥(𝑖) ) − 𝔼(𝑓(𝑋))
𝑗=1
If two feature values j and k contribute equally to all possible coalitions, their
contributions should be equal.
Given
for all
𝑆 ⊆ {1, … , 𝑝}\{𝑗, 𝑘}
then
35
(𝑖) (𝑖)
𝜙𝑗 = 𝜙𝑘
Implications: The symmetry axiom implies that the attribution shouldn’t depend
on any ordering of the features. If two features contribute equally, they will receive
the same SHAP value. Other methods, such as the breakdown method (Staniak
and Biecek 2018) or counterfactual explanations, violate the symmetry axiom
because two features can impact the prediction equally without receiving the same
attribution. For example, the breakdown method also computes attributions, but
does it by adding one feature at a time, so that the order by which features
are added matters for the explanation. Symmetry is essential for accurately
interpreting the order of SHAP values, for instance, when ranking features using
SHAP importance (sum of absolute SHAP values per feature).
A feature j that does not alter the predicted value, regardless of the coalition of
feature values it is added to, should have a SHAP value of 0.
Given
for all
𝑆 ⊆ {1, … , 𝑝}
then
(𝑖)
𝜙𝑗 = 0
Implications: The dummy axiom ensures that unused features by the model
receive a zero attribution. This is an obvious implication. For instance, if a
sparse linear regression model was trained, we can be sure that a feature with a
𝛽𝑗 = 0 will have a SHAP value of zero for all data points.
36
5.6.2 Additivity: Additive predictions correspond to additive
SHAP values
For a game with combined payouts 𝑣1 + 𝑣2 , the respective SHAP values are:
(𝑖) (𝑖)
𝜙𝑗 (𝑣1 ) + 𝜙𝑗 (𝑣2 )
Ĺ Note
An alternative formulation of the SHAP axioms exists where the Dummy
and Additivity axioms are replaced with a Linearity axiom; however, both
formulations eventually yield the SHAP values.
This chapter has provided theoretical SHAP values. However, we face a significant
problem: In practice, we lack a closed-form expression for 𝑓 and we are unaware
of the distributions of 𝑋𝐶 . This means we are unable to calculate the SHAP
values, but, fortunately, we can estimate them.
37
6 Estimating SHAP Values
In the previous chapter, we applied the Shapley value concepts from game theory
to machine learning. While exact Shapley values can be calculated for simple
games, SHAP values must be estimated for two reasons:
• The value function utilized by SHAP requires integration over the feature
distribution. However, since we only have data and lack knowledge of the
distributions, we must use estimation techniques like Monte Carlo integra-
tion.
• Machine learning models often possess many features. As the number of
coalitions increases exponentially with the number of features (2𝑝 ), it might
become too time-consuming to compute the marginal contributions of a
feature to all coalitions. Instead, we have to sample coalitions.
Let’s assume we have a limited number of features for which we can still iterate
through all coalitions. This allows us to focus on estimating the SHAP values
from data without sampling coalitions.
38
subset of feature values, including the empty set and the set containing all feature
values of the instance. When features are not part of a coalition, the prediction
function still requires that we input some value. This problem was theoretically
solved by integrating the prediction function over the absent features. Now, let’s
explore how we can estimate this integral using our apartment example.
The following figure evaluates the marginal contribution of the cat-banned fea-
ture value when added to a coalition of park-nearby and area-50. To compute
the marginal contribution, we need two coalitions: {park-nearby, cat-banned,
area-50} and {park-nearby, area-50}. For the absent features, we would have to
integrate the prediction function over the distribution of floor, and floor + cat,
respectively.
However, we don’t have these distributions, so we resort to using Monte Carlo
integration.
Using Monte Carlo integration, we can estimate the value functions for our apart-
ment by sampling the absent features from our data and averaging the predictions.
In this case, the data are the other apartments. Sometimes, I’ll refer to this data
as background data.
Ĺ Background data
The replacement of absent feature values with randomly drawn ones requires
a dataset to draw from, known as the background data. This could be the
same data that was used to train the model. The background data serves as
the context for the interpretation of the resulting SHAP values.
Let’s illustrate what sampling from the background data looks like by drawing
just one sample for the Monte Carlo integration. Although a single sample results
in a very unstable estimate of the integral, it helps us understand the concept.
Let’s say the randomly sampled apartment has the following characteristics:
39
Park Cat Area Floor Predicted Price
Nearby Allowed 100 1st €504,000
Then, we replace the floor-2nd value of the original apartment with the ran-
domly drawn floor-1st value. We then predict the price of the apartment with
this combination (€310,000), which is the value function for the first coalition,
v({park-nearby, cat-banned, area-50}).
Next, we replace cat-banned in the coalition with a random value of the cat
allowed/banned feature from the same apartment that we sampled. In essence,
we are estimating v({park-nearby, area-50}).
In this scenario, the replaced value was cat-allowed, but it could have been
cat-banned if we had drawn a different apartment. We predict the apartment
price for the coalition of park-nearby and area-50 to be €320,000. Therefore,
the marginal contribution of cat-banned is €310,000 - €320,000 = -€10,000.
This estimate is based on the values of a single, randomly drawn apartment that
served as a “donor” for the cat and floor feature values. This is not an optimal
estimate of the marginal contribution as it relies on only one Monte Carlo sample.
To obtain better estimates, we can repeat this sampling process and average the
marginal contributions.
Now, let’s get into the formalities. The value of a coalition of features 𝑆 is
estimated as:
1 𝑛 (𝑖) (𝑘)
𝑣(𝑆)
̂ = ∑ (𝑓(𝑥𝑆 ∪ 𝑥𝐶 ) − 𝑓(𝑥(𝑘) ))
𝑛 𝑘=1
40
Figure 6.1: One Monte Carlo sample to estimate the marginal contribution
of cat-banned to the prediction when added to the coalition of
park-nearby and area-50.
Here, 𝑛 is the number of data samples drawn from the data. The hat on the 𝑣 ̂
signifies that this is an estimate of the value function 𝑣.
The marginal contribution of a feature 𝑗 added to a coalition 𝑆 is given by:
Δ̂ 𝑆,𝑗 = 𝑣(𝑆
̂ ∪ {𝑗}) − 𝑣(𝑆)
̂
1 𝑛 (𝑖) (𝑘) (𝑖) (𝑘)
= ∑ (𝑓(𝑥𝑆∪{𝑗} ∪ 𝑥𝐶\{𝑗} ) − 𝑓(𝑥𝑆 ∪ 𝑥𝐶 ))
𝑛 𝑘=1
Monte Carlo integration allows us to replace the integral ∫ with a sum ∑ and the
distribution ℙ with data samples. I personally appreciate Monte Carlo because it
makes integrations over distributions more comprehensible. It not only enables us
to compute the integral for unknown distributions, but I also find the operation
of summing more intuitive than integration.
41
6.2 Computing all coalitions, if possible
In the previous section, we discussed how to estimate the marginal contribution
using Monte Carlo integration. To calculate the actual SHAP value of a feature,
we need to estimate the marginal contributions for all possible coalitions.
Figure 6.2 shows all coalitions of feature values required to determine the exact
SHAP value for cat-banned.
Figure 6.2: All 8 coalitions needed for computing the exact SHAP value of the
cat-banned feature value.
• No feature values
• park-nearby
• area-50
42
• floor-2nd
• park-nearby and area-50
• park-nearby and floor-2nd
• area-50 and floor-2nd
• park-nearby, area-50, and floor-2nd.
For each coalition, we calculate the predicted apartment price with and without
the cat-banned feature value and derive the marginal contribution from the
difference. The exact SHAP value is the (weighted) average of these marginal
contributions. To generate a prediction from the machine learning model, we
replace the feature values of features not in a coalition with random feature
values from the apartment dataset. The SHAP value formula is:
• Speed
43
• Accuracy, typically as a trade-off with speed
• Applicability: some estimators are model-specific
The permutation estimator is a rather flexible and fast method that we will further
examine.
• Adding 𝑥cat to ∅
• Adding 𝑥area to {𝑥cat }
• Adding 𝑥park to {𝑥cat , 𝑥area }
• Adding 𝑥floor to {𝑥cat , 𝑥area , 𝑥park }
This is the forward generation. Next, we iterate backwards:
• Adding 𝑥floor to ∅
• Adding 𝑥park to {𝑥floor }
• Adding 𝑥area to {𝑥park , 𝑥floor }
• Adding 𝑥cat to {𝑥area , 𝑥park , 𝑥floor }
This approach only alters one feature at a time, reducing the number of model
calls as the first term of a marginal contribution transitions into the second term
of the subsequent one. For instance, the coalition {𝑥cat , 𝑥area } is used to calcu-
late the marginal contribution of 𝑥park to {𝑥cat , 𝑥area } and of 𝑥area to {𝑥cat }. We
estimate the marginal contribution using Monte Carlo integration. With each
44
forward and backward generation for a permutation, we get marginal contribu-
tions for multiple features, not just a single one. In fact, we get two marginal
contributions per feature for one permutation. By repeating the permutation
sampling, we get even more marginal contributions and therefore achieve more
accurate estimates. The more permutations we sample and iterate over, the more
marginal contributions are estimated, bringing the final SHAP estimates closer
to their theoretical value.
So, how do we transition from here, from marginal contributions based on per-
mutations, to SHAP values? Actually, the formula is simpler than the original
SHAP formula. The SHAP formula contains a complex fraction that we multiply
by the sum of the marginal contributions. However, with permutation estima-
tion, we don’t sum over coalitions but over permutations. If you recall from the
Theory Chapter, we justified the coalition weights by their frequency when listing
all possible coalitions. But SHAP values can also be defined via permutations.
Let 𝑚 denote permutations of the features, with 𝑜(𝑘) being the k-th permutation,
then SHAP can be estimated as follows:
(𝑖) 1 𝑚 ̂
𝜙𝑗̂ = ∑Δ
𝑚 𝑘=1 𝑜(𝑘),𝑗
Now, let’s explain Δ̂ 𝑜(𝑘),𝑗 : We have permutation 𝑜(𝑘). In this k-th permutation,
feature 𝑗 occupies a particular position. Assuming 𝑜(𝑘) is (𝑥cat , 𝑥area , 𝑥park , 𝑥floor )
and 𝑗 is park, then Δ̂ 𝑜(𝑘),𝑗 = 𝑣({cat,
̂ area, park}) − 𝑣({cat,
̂ area}). But what is
𝑚? If we want to sum over all coalitions, then 𝑚 = 𝑝!. However, the motivation
for permutation estimation was to avoid computing all possible coalitions or per-
mutations. The good news is that 𝑚 can be a number smaller than all possible
permutations, and you can use a sample of permutations with the above formula.
But since we perform forward and backward iterations, the formula looks like
this:
(𝑖) 1 𝑚 ̂
𝜙𝑗̂ = ∑(Δ + Δ̂ −𝑜(𝑘),𝑗 )
2𝑚 𝑘=1 𝑜(𝑘),𝑗
45
The permutation procedure with forward and backward iterations, also known as
antithetic sampling, performs quite well compared to other SHAP value sampling
estimators (Mitchell et al. 2022). A simpler version would involve sampling
random permutations without the forward and backward steps. One advantage
of antithetic sampling is the reuse of resulting coalitions to save computation.
The permutation procedure has an additional benefit: it ensures that the effi-
ciency axiom is always satisfied, meaning when you add up the SHAP values,
they will exactly equal the prediction minus the average prediction. Estimation
methods relying on sampling coalitions only satisfy the efficiency axiom in ex-
pectation. However, individual SHAP values remain estimates, and the more
permutations you draw, the better these estimates will be. For a rough idea of
how many permutations you might need: the shap package defaults to 10.
Model-
Method Estimation specific?
Exact Iterates through all background data and Agnostic
coalitions
Sampling Samples coalitions Agnostic
Permutation Samples permutations Agnostic
Linear Exact estimation with linear model weights Linear
Additive Simplifies estimation based on additive nature of GAMs
the model (inspired by GAMs)
Kernel Locally weighted regression for sampled Agnostic
coalitions (inspired by LIME)
Tree, inter- Recursively iterates tree paths Tree-based
ventional
46
Model-
Method Estimation specific?
Tree, path- Recursively iterates hybrid paths Tree-based
dependent
Gradient Computes the output’s gradient with respect to Gradient-
inputs (inspired by Input Gradient) based
Deep Backpropagates SHAP value through network Neural
layers (inspired by DeepLIFT) Networks
Partition Recursive estimation based on feature hierarchy Agnostic
(inspired by Owen values)
There are more estimation methods than those listed here. The selection shown
is based on the estimation methods available in the Python shap package.
47
Ď Tip
The original SHAP paper (Lundberg and Lee 2017b) introduced the Ker-
nel method, which involves sampling coalitions and using a weighted linear
regression model to estimate SHAP values. The Kernel method “united”
SHAP with LIME and other prediction explanation techniques in machine
learning. However, the Kernel method is slow and has been superseded by
the permutation method.
48
7 SHAP for Linear Models
import pandas as pd
# Set the file URL and filename
url = 'https://archive.ics.uci.edu/ml/' \
'machine-learning-databases/' \
'wine-quality/winequality-white.csv'
file_name = 'wine.csv'
49
wine = pd.read_csv(file_name)
except FileNotFoundError:
print(f'Downloading {file_name} from {url}...')
wine = pd.read_csv(url, sep=";")
wine.to_csv(file_name, index=False)
print('Download complete!')
As observed, the highest quality is 9 (out of 10), and the lowest is 3. The other
features have varying scales, but this is not an issue for SHAP values, as they
explain the prediction on the outcome’s scale.
50
7.2 Fitting a linear regression model
With the wine dataset in our hands, we aim to predict the quality of a wine
based on its physicochemical features. A linear model for one data instance is
represented as:
(𝑖) (𝑖)
𝑓(𝑥(𝑖) ) = 𝛽0 + 𝛽1 𝑥1 + … + 𝛽𝑝 𝑥𝑝
where 𝑥(𝑖) is the instance for which we want to compute the contributions. Each
(𝑖)
𝑥𝑗 is a feature value, with 𝑗 = 1, … , 𝑝. The 𝛽𝑗 is the weight in the linear
regression model corresponding to feature j.
Before fitting the linear model, let’s divide the data into training and test sets.
We’ll now train the linear regression model using the scikit-learn package.
Ď Tip
shap can be used with all sklearn models.
model = LinearRegression()
model = model.fit(X_train, y_train)
How does the model perform? To evaluate, we calculate the mean absolute error
(MAE) on the test data.
51
from sklearn.metrics import mean_absolute_error
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae:.2f}")
MAE: 0.59
This indicates that, on average, the prediction deviates by 0.59 from the actual
value.
Next, we aim to understand how the model generates predictions. How is the
predicted quality of a given wine related to its input features?
import numpy as np
coefs = pd.DataFrame({
'feature': X.columns.values,
'coefficient': np.round(model.coef_, 3)
})
print(coefs.to_markdown(index=False))
feature coefficient
fixed acidity 0.046
volatile acidity -1.915
citric acid -0.061
residual sugar 0.071
chlorides -0.026
free sulfur dioxide 0.005
total sulfur dioxide -0
density -124.264
52
feature coefficient
pH 0.601
sulphates 0.649
alcohol 0.229
Interpretation:
• For instance, increasing the fixed acidity of a wine by 1 unit raises the
predicted quality by 0.046.
• Increasing the density by 1 unit reduces the predicted quality by 124.264.
• Volatile acidity, citric acid, chlorides, total sulfur dioxide, and density neg-
atively affect the predicted quality.
53
7.5 Theory: SHAP for linear models
For linear regression models without interaction terms, the computation of SHAP
values is straightforward since the model only has linear relations between the
(𝑖)
features and target and no interactions. The SHAP value 𝜙𝑗 of the j-th feature
(𝑖)
on the prediction 𝑓(𝑥𝑗 ) for a linear regression model is defined as:
𝑝 𝑝
(𝑖) (𝑖)
∑ 𝜙𝑗 = ∑(𝛽𝑗 𝑥𝑗 − 𝔼(𝛽𝑗 𝑋𝑗 ))
𝑗=1 𝑗=1
𝑝 𝑝
(𝑖)
=𝛽0 + ∑ 𝛽𝑗 𝑥𝑗 − (𝛽0 + ∑ 𝔼(𝛽𝑗 𝑋𝑗 ))
𝑗=1 𝑗=1
=𝑓(𝑥) − 𝔼(𝑓(𝑋))
This is the predicted value for data point x subtracted from the average predicted
value. Feature contributions can be negative. Now, let’s apply these to the wine
quality prediction.
54
7.6 Installing shap
Although the computation of SHAP for linear models is straightforward enough
to implement on our own, we’ll take the easier route and install the shap library,
which also provides extensive plotting functions and other utilities.
Ĺ shap library
The shap library was developed by Scott Lundberg, the author of the SHAP
paper (Lundberg and Lee 2017b) and many other SHAP-related papers. The
initial commita was made on November 22nd, 2016. At the time of writing,
the library has over 2000 commits. shap is open-source and hosted on Github,
allowing public access and tracking of its progress. The repository has re-
ceived over 19k stars and almost 3k forks. In terms of features, it’s the most
comprehensive library available for SHAP values. I believe that the shap
library is the most widely-used implementation of SHAP values in machine
learning.
You can find the shap repository at: https://github.com/slundberg/shap
a
https://github.com/slundberg/shap/tree/7673c7d0e147c1f9d3942b32ca2c0ba93fd37875
Like most Python packages, you can install shap using pip.
All examples in this book utilize shap version 0.42.0. To install this exact version,
execute the following command:
source venv/bin/activate
pip install shap
55
If you use conda
If you’re using conda, use the following commands to install shap:
import shap
explainer = shap.LinearExplainer(model, X_train)
Ĺ Note
While the model here is a LinearRegression model from the sklearn li-
brary, shap works with any model from sklearn as well as with other li-
braries such as xgboost and lightgbm. shap also works with custom predic-
tion functions, so it’s quite flexible!
shap.Explainer(model, X_train)
<shap.explainers._linear.Linear at 0x292942d40>
56
Another method involves directly choosing the appropriate algorithm in the
explainer:
<shap.explainers._linear.Linear at 0x292942860>
To ultimately calculate SHAP values, we call the explainer with the data to be
explained.
shap_values = explainer(X_test)
Ĺ Note
When constructing a prediction model, you divide the data into training and
testing sets to prevent overfitting and to achieve a fair evaluation. Although
the risk of overfitting doesn’t apply in the same way to SHAP, it’s consid-
ered best practice to use the training data for the Explainer (i.e., for the
background data) and compute explanations for new data. This separation
prevents a data point’s feature values from being “replaced” by its own val-
ues. It also means we calculate explanations for fresh data that the model
hasn’t previously encountered. However, I must confess that I haven’t seen
much research on dataset choices, so take this information with a pinch of
salt.
print(shap_values.values)
57
[[-0.03479769 0.00306381 -0.00545601 ... -0.06541621 0.05289943
0.08545841]
[-0.06234203 -0.45650842 0.00986986 ... 0.00066077 0.01395506
0.59691113]
[ 0.01570028 0.07965919 -0.00422994 ... 0.04871676 -0.05095221
0.36790245]
...
[-0.03938841 0.06051034 0.00680469 ... 0.04871676 -0.05095221
-0.250421 ]
[ 0.03406317 0.00306381 0.00067434 ... -0.07142321 0.02044579
-0.29622273]
[-0.00266262 0.13710572 -0.00422994 ... 0.18087073 -0.0249893
-0.13591665]]
However, having only the raw SHAP values isn’t particularly useful. The true
power of the shap library lies in its various visualization capabilities.
shap.plots.waterfall(shap_values[0])
58
f(x) = 6.372
59
Interpretation: The predicted value of 6.37 for instance 0 differs from the average
prediction of 5.89 by 0.48.
The sum of all SHAP values equals the difference between the prediction (6.37)
and the expected value (5.89).
Prediction [𝑓(𝑥)] for instance [𝑖] differs from the average prediction [𝔼(𝑓(𝑋))]
by [𝑓(𝑥𝑖) − 𝔼(𝑓(𝑋))𝑗] to which [feature name = feature value] contributed
(𝑖)
[𝜙𝑗 ].
• The most influential feature was ‘residual sugar’ (=10.8), with a SHAP
value of 0.32, indicating it had an increasing impact on the quality on
average.
• Overall, the prediction surpassed the average, suggesting a high-quality
wine.
• Most of this wine’s feature values were assigned a positive SHAP value.
• The feature ‘pH’ with a value of 3.09 had the largest negative SHAP value.
Let’s examine another data point:
shap.waterfall_plot(shap_values[1])
60
f(x) = 6.398
This wine has a similar predicted rating to the previous one, but the contribu-
tions to this prediction differ. It has two substantial positive contributions from
the ‘density’ and ‘alcohol’ values, but also two strong negative factors: ‘volatile
acidity’ and ‘residual sugar’.
The waterfall plot lacks context for interpretation. For instance, while we know
‘residual sugar’ increased the prediction for the first wine, we cannot deduce from
the waterfall plot alone whether low or high levels of ‘residual sugar’ are associated
with small or large SHAP values.
61
dataset (and training data as background data). By visualizing the SHAP values
across all features and multiple data points, we can uncover patterns of how the
model makes predictions. This gives us a global model interpretation.
We previously computed the SHAP values for the test data, which are now stored
in the shap_values variable. We can create a summary plot from this variable
for further insights into the model.
shap.plots.beeswarm(shap_values)
High
density
residual sugar
alcohol
volatile acidity
Feature value
pH
free sulfur dioxide
sulphates
fixed acidity
total sulfur dioxide
Sum of 2 other features
Low
1.0 0.5 0.0 0.5 1.0
SHAP value (impact on model output)
62
Ĺ Summary plot
• Observe the ranking of the features. The higher the feature, the greater
its SHAP importance.
• For each feature of interest:
– Examine the distribution of the SHAP values. This provides in-
sight into the various ways the feature values can influence the
prediction. For instance, a wide spread indicates a broad range
63
of influence.
– Understand the color trend for a feature: This offers an initial
insight into the direction of a feature effect and whether the rela-
tionship is monotonic or exhibits a more complex pattern.
– Look for color clusters that may indicate interesting data clusters.
Not relevant for linear models, but for non-linear ones.
shap.plots.scatter(shap_values[:, 'alcohol'])
64
0.8
0.6
0.4
SHAP value for
0.2
alcohol
0.0
0.2
0.4
0.6
8 9 10 11 12 13 14
alcohol
65
This plot demonstrates the global dependence modeled by the linear regression
between alcohol and the corresponding SHAP values for alcohol. The dependence
plot will be much more insightful for a non-linear model, but it’s a great way
to confirm that the SHAP values reflect the linear relationship in the case of
a linear regression model. As the alcohol content increases, the corresponding
SHAP value also increases linearly. This increase corresponds to the slope in the
linear regression model:
feature = 'alcohol'
ind = X_test.columns.get_loc(feature)
coefs.coefficient[ind]
0.229
A visual inspection of the dependence plot confirms the same slope, as the plot
ranges from (8, -0.6) to (14, 0.8), resulting in a slope of (0.8 − (−0.6))/(14 − 8) ≈
0.23.
66
8 Classification with Logistic
Regression
This chapter explains how to use and interpret SHAP with logistic regression.
Differences from linear regression include:
1
𝑃 (𝑌 = 1|𝑥(𝑖) ) = (𝑖) (𝑖) (𝑖)
1 + exp(−(𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ... + 𝛽𝑝 𝑥𝑝 ))
Since the probability of one class defines the other’s, you can work with just one
probability. Having two classes is a special case of having 𝑘 classes.
1
While the output is a number between 0 and 1, classifiers are frequently not well-calibrated,
so be cautious when interpreting the output as a probability in real-world scenarios.
67
Ď Tip
Even though the example here is binary classification, shap works the same
for multi-class and also when the model is not logistic regression.
import shap
from sklearn.model_selection import train_test_split
X, y = shap.datasets.adult()
Next, we train the model and compute the SHAP values. Compared to the linear
regression example, you will notice something new here:
68
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np
ex = shap.Explainer(model.predict_proba, X_sub)
shap_values = ex(X_test.iloc[0:100])
69
Regression Chapter we’ll discuss more elaborate choices of background data
in depth.
• Applying SHAP not to the model but to the entire pipeline allows us to
compute SHAP values for the original features instead of their processed
versions.
The Adult dataset contains both categorical and numerical features, which are
transformed before being inputted into the logistic regression model.
(𝑖) (𝑖)
• Numerical features are standardized: 𝑥𝑗,𝑠𝑡𝑑 = (𝑥𝑗 − 𝑥𝑗̄ )/𝑠𝑑(𝑥𝑗 ).
• Categorical features are one-hot encoded. For instance, a feature with 1
column and 3 categories transforms into 3 columns, e.g. category “3” might
be encoded as (0,0,1).
Following these steps, our dataset expands to approximately 470 columns. The
numerical features, like age, are no longer easily interpretable due to standardiza-
tion, making it necessary to compute the actual age represented by, say, 0.8.
The logistic regression model, however, can process these inputs. This implies
that the coefficients are based on this transformed dataset. Applying SHAP
values directly on the logistic regression model would yield 470 SHAP values.
Yet, there’s a more interpretable method: We can integrate the preprocessing
and logistic regression into a pipeline and regard it as our model. This approach
is akin to nesting mathematical functions:
Ď Tip
When preprocessing your data, think about which steps you want to incorpo-
rate into your pipeline when calculating SHAP values. It’s sensible to include
70
steps like feature standardization in the pipeline, while transformations that
enhance interpretability should be left out.
Another point of interest is that the model has two outputs: the probability of
earning less than $50k and the probability of earning more than $50k. This is
mirrored in the resulting shap_values variable, which gains an extra dimension.
So to pick a single SHAP value, we have define 3 things: For which data instance?
For which feature? For which model output?
class_index = 1
data_index = 1
shap.plots.waterfall(shap_values[data_index,:,class_index])
f(x) = 0.001
71
For this individual, the predicted likelihood of earning more than $50k was 0.01%,
well below the expected 22%. This plot shows that Marital Status was the most
influential feature, contributing -0.05. The interpretation is largely the same as
for regression, except that the outcome is on the probability level and we have to
choose which class’s SHAP values we want to interpret.
Now, let’s inspect the SHAP values for the alternative class.
class_index = 0
shap.plots.waterfall(shap_values[data_index,:,class_index])
f(x) = 0.999
This plot shows the probability of this individual earning less than $50k. We
can see it’s the same figure from before except all SHAP values are multiplied
72
by -1. This is logical since the probabilities for both classes must add up to 1.
Thus, a variable that increases the classification by 0.11 for class >50k decreases
the classification by 0.11 for class <=50k. We only need to pick one of the two
classes. This changes when we have three or more classes, see the Image Chapter
for an example involving multiple classes.
73
Marital Status 0.06
Occupation 0.04
Education-Num 0.04
Relationship 0.02
Hours per week 0.02
Age 0.02
Capital Gain 0.02
Sex +0
Capital Loss 0
Sum of 3 other features +0
The interpretation here is the same as that of the waterfall plot, so I will not repeat
it. The only difference between the two plots is the arrangement of information,
with the bar plot lacking in the presentation of 𝔼(𝑓(𝑋)) and 𝑓(𝑥(𝑖) ).
Additionally, the force plot is simply a different representation of the SHAP val-
ues:
shap.initjs()
shap.plots.force(shap_values[data_index,:,class_index])
The force plot is interactive, based on JavaScript, and allows you to hover over
it for more insights. Of course, this feature is not available in the static format
you’re currently viewing, but it can be accessed if you create your own plot and
embed it in a Jupyter notebook or a website. The image above is a screenshot
of a force plot. The plot is named force plot because the SHAP values are
depicted as forces, represented by arrows, which can either increase or decrease
74
Figure 8.1: Force Plot
the prediction. If you compare it with the waterfall plot, it’s like a horizontal
arrangement of arrows.
Personally, I find the waterfall plot easier to read than the force plot, and it
provides more information than the bar plot.
We can use this logit link to transform the output of the logistic regression model
and compute SHAP values on this new scale:
ex_logit = shap.Explainer(
model.predict_proba, X_sub, link=shap.links.logit
)
sv_logit = ex_logit(X_test.iloc[0:100])
shap.plots.waterfall(sv_logit[data_index,:,class_index])
75
f(x) = 6.521
6 5 4 3 2 1
E[f(X)] = 1.259
When the outcome of a logistic regression model is defined in terms of log odds,
the features impact the outcome linearly. In other words, logistic regression is a
linear model on the level of the log odds.
Here’s what it means for interpretation: A marital status of 4 contributes -1.56 to
the log odds of making >$50k versus <=$50k compared to the average prediction.
However, SHAP values shine in their applicability at the probability level, and
log odds can be challenging to interpret. So, when should you use log odds and
when should you use probabilities?
Ĺ Note
If your focus is on the probability outcome, use the identity link (which is
the default behavior). The logit space is more suitable if you’re interested in
“evidence” in an information-theoretic sense, even if the effect in probability
76
space isn’t substantial.
Let’s discuss when the distinction between log odds and probabilities matters: A
shift from 80% to 90% is large in probability space, while a change from 98% to
99.9% is relatively minor.
In probability space, the differences are 0.10 and 0.019. In logit space, we have:
In logit space, the second jump is larger. This happens because the logit com-
presses near 0 and 1, making changes in the extremes of probability space appear
larger.
So, which one should you select? If you’re primarily concerned with probabilities,
and a jump from 80% to 81% is as significant to you as from 99% to 100%,
stick with the default and use the identity link. However, if changes in extreme
probabilities near 0 and 1 are more critical for your application, choose logits.
Whenever rare events, anomalies, and extreme probabilities matter, go with logits.
You can also visualize the difference in step sizes in the following Figure 8.2.
shap.plots.beeswarm(shap_values[:,:,class_index])
77
Figure 8.2: Probabilities versus Logits
High
Marital Status
Education-Num
Capital Gain
Occupation
Feature value
Sex
Age
Hours per week
Relationship
Workclass
Sum of 3 other features
Low
0.2 0.0 0.2 0.4 0.6 0.8
SHAP value (impact on model output)
78
From our observations, Marital Status and Education emerge as the two most
important features. For some individuals, Capital Gain has substantial effects,
suggesting that large capital gains result in large SHAP values.
shap.plots.force(sv_logit[0:20:,:,0])
Ĺ Cluster plot
• The cluster plot consists of vertical force plots.
• Data instances are distributed across the x-axis, while SHAP values
are spread across the y-axis.
• Color signifies the direction of SHAP values.
• The larger the area for a feature, the larger the SHAP values across
79
Figure 8.3: Clustering Plot
You can hover over the plot for more information, change what you see on the
x-axis, and experiment with various other orderings. The cluster plot is an ex-
ploratory tool.
You can also alter the ordering by clicking on the interactive graph, for example,
by the prediction:
Ĺ Heatmap plot
• Each row on the y-axis represents a feature, and instances are dis-
tributed across the x-axis.
80
Figure 8.4: Clustering Plot
shap.plots.heatmap(sv_logit[:,:,class_index])
81
1.56
f(x)
82
9 SHAP Values for Additive Models
Unlike the simple linear model, we allow the functions 𝑓𝑗 to be non-linear. If for
all features, 𝑓𝑗 (𝑥𝑗 ) = 𝑥𝑗 , we arrive at the linear model. Thus, linear regression
models are special cases of GAMs.
With GAMs, we can use arbitrary functions for the features. Popular choices
include spline functions, which allow for flexible, smooth functions with a gradient.
Tree-based basis functions, which have a fast implementation, are also an option.
Additive models expand our understanding of SHAP values, as they allow us
to examine non-linear functions without interactions. Although we could add
83
interaction terms to a GAM, we will not do so in this chapter, as the interpretation
becomes more complex.
Next, we fit a model. We’re using the Explainable Boosting Regressor from the
interpret package. The Explainable Boosting Machine (EBM) is a tree-based
GAM. It offers optional automated interaction detection, which we won’t use in
this example. In our case, each tree in the ensemble can only use one feature to
avoid modeling interactions.
Here we train the model:
import pandas as pd
from sklearn.model_selection import train_test_split
from interpret.glassbox import ExplainableBoostingRegressor
wine = pd.read_csv('wine.csv')
y = wine['quality']
X = wine.drop('quality', axis=1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = ExplainableBoostingRegressor(interactions=0)
model = model.fit(X_train, y_train)
84
from sklearn.metrics import mean_absolute_error
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae:.2f}")
MAE: 0.55
The mean absolute error on the test data is less than that of the linear regression
model. This is promising as it indicates that by using a GAM and allowing
non-linear feature effects, we improved predictions. The increase in performance
indicates that some relations between wine quality and features are non-linear.
import shap
explainer = shap.Explainer(model.predict, X_train)
shap_values = explainer(X_test)
shap.plots.waterfall(shap_values[0], max_display=10)
85
f(x) = 5.989
This waterfall plot provides a different perspective than the purely linear model.
• For this particular wine, the most important features were alcohol and free
sulfur dioxide, whereas, in the linear model, they were residual sugar and
free sulfur dioxide.
• The quality predicted by the GAM is approximately 6.0, lower than the 6.4
predicted by the linear model.
• This example clearly illustrates how the global average prediction and the
local prediction can be similar, but numerous SHAP values cancel each
other out.
86
examine the SHAP dependence plot for alcohol:
shap.plots.scatter(shap_values[:,"alcohol"])
0.6
0.4
SHAP value for
0.2
alcohol
0.0
0.2
8 9 10 11 12 13 14
alcohol
In the case of alcohol, there is a positive relationship between alcohol levels and
the SHAP values. The SHAP contribution increases with the alcohol content,
but it plateaus at extremely high and low levels.
Let’s compare these SHAP values with the alcohol effect learned by the GAM.
We can plot the SHAP values and overlay the alcohol curve extracted directly
from the GAM.
87
import matplotlib.pyplot as plt
import numpy as np
shap.plots.scatter(shap_values[:,"alcohol"], show=False)
88
0.6
0.4
SHAP value for
0.2
alcohol
0.0
0.2
8 9 10 11 12 13 14
alcohol
As evident, the SHAP values follow the same trajectory as we would see when
simply altering one of the features (here, alcohol). This reinforces our confidence
in understanding SHAP values. There’s a paper (Bordt and Luxburg 2022) that
demonstrates that when the model is a GAM, the non-linear components can
be recovered by SHAP. Like the linear case, in the additive case, SHAP values
accurately track the feature effect and align with what we would expect.
89
are important. To determine global importance, we average the absolute SHAP
values per feature across the data:
1 𝑛 (𝑖)
𝐼𝑗 = ∑ |𝜙𝑗 |
𝑛 𝑖=1
We then sort the features by decreasing importance and plot them. This method
of sorting features is also used in the summary plot.
shap.plots.bar(shap_values)
alcohol +0.21
volatile acidity +0.13
density +0.13
free sulfur dioxide +0.11
residual sugar +0.09
citric acid +0.09
chlorides +0.08
total sulfur dioxide +0.06
pH +0.06
Sum of 2 other features +0.08
90
Ĺ Note
Permutation Feature Importance (PFI) is derived from the decline in model
performance, whereas SHAP relies on the magnitude of feature attributions.
This difference becomes particularly pronounced when the model is overfit-
ting. A feature that doesn’t actually correlate with the target will have an
expected PFI of zero but may exhibit a non-zero SHAP importance.
91
10 Understanding Feature
Interactions with SHAP
Interpreting models becomes more complex when they contain interactions. This
chapter presents a simulated example to explain how feature interactions influence
SHAP values.
92
• The taller the fan, the better.
• The closer to the stage, the better.
• Both of these have linear relationships with concert enjoyment.
• Additional interaction: Small fans who are far from the stage get a “bonus”:
some kind soul allows them to sit on their shoulders for a better view of
the concert.
• The -8 is just so that the output is roughly between 0 (bad concert experi-
ence) and 10 (great concert experience).
Ĺ Feature Interaction
Two features interact when the prediction can’t be explained by the sum of
both feature effects. Alternatively formulated: interaction means the effect
of one feature changes depending on the value of the other feature.
Consider the price of a hotel room, which depends on room size and sea view.
Both factors individually contribute to the price: a larger room costs more,
as does a room with a sea view. However, size and view interact: for small
rooms, the sea view adds less value than it does for large rooms, since small
rooms are less inviting for extended stays.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt
np.random.seed(42)
n = 1000
X = pd.DataFrame({
'x1': np.random.uniform(140, 200, n),
'x2': np.random.uniform(0, 10, n)
})
93
X, y, test_size=0.2, random_state=42
)
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model = rf_model.fit(X_train, y_train)
94
10
10
8
8
Distance to stage
6 6
4 4
2
2
0
0
140 150 160 170 180 190 200
Height of fan
The random forest appears to approximate the function quite accurately. Next,
we will generate explanations for the predictions.
import shap
explainer = shap.TreeExplainer(rf_model)
shap_values = explainer(X_test)
shap.plots.beeswarm(shap_values)
95
High
Feature value
x2
x1
Low
4 3 2 1 0 1 2 3 4
SHAP value (impact on model output)
However, there are some exceptions: Small fans sometimes receive large SHAP
values due to the interaction effect of sitting on shoulders. To investigate further,
let’s examine the dependence plot, Figure 10.1.
shap.plots.scatter(shap_values[:,0], color=shap_values)
96
3
8
2
1
SHAP value for
6
0
x1
x2
4
1
2 2
3
140 150 160 170 180 190 200
x1
Figure 10.1: Dependence plot for x1, the height feature
The dependence plot colors the points by the values of feature 𝑥2 , as we provided
the SHAP values for the color option. By default, the points are colored by
the feature with the highest approximate interaction. Given our model only
contains two features, the selection is naturally feature 𝑥2 . We can make three
observations:
1. A large jump at 𝑥1 = 160 is logical because, according to our simulated
data, fans taller than 160cm will not sit on someone’s shoulders.
2. Ignoring the jump, there seems to be a linear upward trend, which aligns
with the linear dependence of 𝑌 on 𝑥1 . The slope reflects the coefficient in
the simulated function (𝛽1 = 0.1).
3. There are two “clusters” of points: one with a small jump and one with a
large jump.
97
The third point becomes clearer when we note the curves are colored by the
feature value 𝑥2 . There are two “lines”:
• One line represents fans who are >7 away from the stage (𝑥2 ). Here we
see the large jump, which is expected since fans taller than 160cm have no
chance of getting on someone’s shoulders.
• The other line represents values of 𝑥2 below 7. It has a smaller jump,
but why is there a jump at all? Fans in this “cluster” don’t get to sit on
someone’s shoulders when they are smaller than 160cm.
The reason why the interaction also “bleeds” into the cluster where we wouldn’t
expect it has to do with how SHAP values function.
1. Mia, who is 159cm tall and 2 units away from the stage.
2. Tom, who is 161cm tall and standing right next to Mia, also 2 units away
from the stage.
Here are the model’s predictions for how much they will enjoy the concert:
print("""
Mia: {mia}
Tom: {tom}
Expected: {exp}
""".format(
mia=round(rf_model.predict(Xnew)[0], 2),
tom=round(rf_model.predict(Xnew)[1], 2),
exp=round(explainer.expected_value[0], 2)
))
98
Mia: 5.88
Tom: 6.07
Expected: 4.86
They have a rather similar predicted joy for the concert, with Mia having a
slightly worse prediction – makes sense given she is slightly smaller and neither
of them qualify for shoulders.
Let’s examine their SHAP values.
shap_values = explainer(Xnew)
print('Mia')
print(shap_values[0].values)
print('Tom')
print(shap_values[1].values)
Mia
[-0.15944093 1.18150926]
Tom
[-1.37794848 2.5913748 ]
But shouldn’t Mia have a smaller SHAP value for height than Tom? Neither of
them benefits from the shoulder bonus, so Mia being smaller than Tom should
mean that her SHAP value for “height” should be smaller than Tom’s, right? But
99
surprisingly, Mia’s SHAP value is influenced by the interaction term, despite her
not being directly affected by the shoulder bonus!
This outcome is a result of the calculation process of SHAP values: When comput-
ing the SHAP value for Mia’s height, one of the marginal contributions involves
adding her height to the empty coalition (∅). For this marginal contribution,
we have to sample the stage distance feature. And sometimes we sample dis-
tances > 7, which activate the shoulder bonus. But only for Mia, not for Tom.
The shoulder bonus strongly increases concert enjoyment and, as a consequence,
Mia’s SHAP value for height becomes greater than Tom’s. So even though Mia
is too close to the stage to get the shoulder bonus, her height’s SHAP value ac-
counts for this interaction. This example shows that SHAP values have a global
component: Interactions influence data points that are far away.
Á Warning
In a what-if analysis, we would only evaluate how the prediction changes when
the height changes, which should be similar for both Tom and Mia.
But, what does such a contrived example have to do with real machine learning
applications? The extreme phenomenon observed here occurs subtly in real appli-
cations, albeit in more complex ways. The interactions might be more nuanced
and intricate, and there may be more features involved. However, this global
phenomenon is likely to occur. Interactions within a machine learning model can
100
be highly complex. Thus, bear this limitation in mind when interpreting SHAP
values.
101
11 The Correlation Problem
SHAP values encounter a subtle yet relevant issue when dealing with correlated
features. Simulating the absence of features by replacing them with sampled
values from the background data can generate unrealistic data points. These
implausible data points are then used to produce explanations, resulting in several
problems:
In this chapter, we will first dive into the problem in more detail using a small
simulation, followed by a discussion of possible solutions.
102
import numpy as np
np.random.seed(42)
p = 0.9
mean = [0, 0] # mean vector
cov = [[1, p], [p, 1]] # covariance matrix
n = 100 # number of samples
Next, let’s create a data point for which we will simulate the sampling from the
background data.
I’m interested in the SHAP value for feature 𝑋1 . I will demonstrate how SHAP
values sample from the marginal distribution and compare that to what sampling
from the conditional distribution would look like.
103
plt.xlabel('x1')
plt.subplot(122)
plt.scatter(x1, x2, color='black', alpha=0.1)
plt.scatter(np.repeat(point[0], m), x2_cond, color='green')
plt.scatter(point[0], point[1], color='red')
plt.xlabel('x1')
plt.show()
2 2
1 1
x2
0 0
1
1
2
2
2 0 2 2 0 2
x1 x1
Figure 11.1: Marginal and conditional sampling
The plot on the left shows sampling from the marginal distribution: we disre-
gard the correlation between 𝑋1 and 𝑋2 and sample 𝑋2 independently from
𝑋1 . Marginal sampling in this correlated case creates new data points outside
of the distribution. This is what occurs with SHAP values as we have used
them throughout the book. On the right, we see conditional sampling. Con-
ditional sampling means that we respect the distribution of 𝑋2 , given that we
already know the value for 𝑋1 and sample from 𝑃 (𝑋2 |𝑋1 ) instead of 𝑃 (𝑋2 ).
104
Conditional sampling preserves the distribution, whereas marginal sampling may
distort it when features are correlated.
105
• Apply feature engineering to decrease correlation. For instance, if you have
the features “apartment rent” and “number of rooms”, they will be cor-
related. You can decorrelate them by converting the rent into “rent per
square meter”.
• Combine features. Perhaps having the amount of rain in the morning and
afternoon as separate features is unnecessary. Would daily rainfall be suf-
ficient? Test it out.
Reducing correlated features and the overall number of features can significantly
enhance model fitting. You can assess how each of these steps impacts predictive
performance, aiding in your decision-making for possible trade-offs.
106
• Cat and size are most strongly correlated.
• Both are slightly correlated with the park feature.
• The floor feature is not correlated with any of the features.
107
11.5 Solution: Conditional sampling
As suggested in Figure 11.1, conditional sampling can also be utilised.
Here’s a brief example:
• We have four features: 𝑋1 through 𝑋4 .
• Calculate the SHAP value for 𝑋1 .
• During the sampling process, we add 𝑋1 to the coalition {𝑋2 }.
• Compute the predictions for the coalitions {𝑋2 } and {𝑋1 , 𝑋2 }.
• The missing features are {𝑋1 , 𝑋3 , 𝑋4 } and {𝑋3 , 𝑋4 }, and we sample these
from the background data.
• Sampling can be carried out in two ways:
– From 𝑃 (𝑋1 , 𝑋3 , 𝑋4 ) and 𝑃 (𝑋3 , 𝑋4 ) (marginal sampling), or
– From 𝑃 (𝑋1 , 𝑋3 , 𝑋4 |𝑋2 ) and 𝑃 (𝑋3 , 𝑋4 |𝑋1 , 𝑋2 ) (conditional sam-
pling)
Conditional sampling helps avoid extrapolation problems. Aas et al. (2021) sug-
gested incorporating this method into KernelSHAP. However, implementing con-
ditional sampling is not straightforward, given that supervised machine learning
mainly learns 𝑃 (𝑌 |𝑋1 , … , 𝑋𝑝 ), and we now need to learn complex distributions
for numerous variables. Generally, we make simplifying assumptions for the con-
ditional distributions to facilitate sampling, as suggested by Aas et al. (2021):
• Use multivariate Gaussians.
• Utilize Gaussian copulas.
• Apply kernel estimators (if the dimensions are not too many).
The paper suggests more techniques, but these are the most significant.
Á Warning
(𝑖)
𝑣𝑓,𝑥(𝑖) (𝑆) = ∫ 𝑓(𝑥𝑆 ∪ 𝑋𝐶 )𝑑ℙ𝑋 (𝑖) − 𝔼(𝑓(𝑋))
𝐶 |𝑋𝑆 =𝑥𝑆
This shift changes the game. For example, the resulting SHAP values might
seem to not adhere to the Dummy axiom, as unused features with, for ex-
108
ample, 𝛽𝑗 = 0 in a linear model can suddenly have non-zero SHAP values.
However, conditional SHAP does not actually violate the Dummy axiom, as
they still qualify as (conditional) SHAP values based on the new payout us-
ing conditional distributions. Without correlation, marginal and conditional
SHAP are identical, since then 𝑃 (𝑋1 ) = 𝑃 (𝑋1 |𝑋2 ).
If you want to employ conditional sampling in shap, use the Linear explainer
with the feature_perturbation='correlation_dependent' option. The
Tree explainer provides a similar option called feature_perturbation=
'tree_path_dependent', which uses the splits in the underlying tree-based
model to ensure SHAP is not extrapolating. However, I advise against using the
tree-path option, as it does not accurately model the conditional distribution
and does not approximate the conditional SHAP values well (Aas et al. 2021).
Whether a conditional SHAP interpretation is useful depends on your interpreta-
tion objectives. If you want to audit the model, marginal sampling may be more
suitable. If your goal is to better understand the data, conditional sampling
might be the better choice. This trade-off is often described as being true to the
model (marginal sampling) or true to the data (conditional sampling) (Chen et
al. 2020). Sundararajan and Najmi (2020) further discusses this concept.
109
12 Regression Using a Random
Forest
In this chapter, we will examine the wine dataset again and fit a tree-based model,
specifically a random forest. This model potentially contains numerous interac-
tions and non-linear functions, making its interpretation more complex than in
previous chapters. Nevertheless, we can employ the fast shap.TreeExplainer.
Ĺ Note
Gradient boosted trees algorithms such as LightGBM and xgboost are other
popular tree-based models. The shap application demonstrated here works
the same way with them.
110
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
wine = pd.read_csv('wine.csv')
y = wine['quality']
X = wine.drop(columns=['quality'])
model = RandomForestRegressor(random_state=42)
model = model.fit(X_train, y_train)
Next, we evaluate the performance of our model, hoping for better results than
with the GAM:
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print('MAE:', round(mae, 2))
MAE: 0.42
This model performs better than the GAM, suggesting that additional interac-
tions are beneficial. Despite the GAM also being tree-based, it did not model
interactions.
111
import shap
# Compute the SHAP values for the sample
explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_test)
ĺ Important
The code above produces an error: “This check failed because for one of the
samples the sum of the SHAP values was 5.881700, while the model output
was 5.820000.” The Tree Explainer is an exact explanation method, and
shap checks if additivity holds: the model prediction should equal the sum
of SHAP values + base_value. In this case, there is a discrepancy in some
SHAP values. I’m not entirely sure why this happens - it may be due to
rounding issues. You might encounter this too, so here are two options to
handle it: either set check_additivity to False or use a different explainer, like
the Permutation Explainer. If you disable the check, ensure the difference is
acceptable:
import numpy as np
shap_values.base_values + np.sum(shap_values, axis=1) - \
model.predict(X_test)
Let’s try again but this time we skip the check for additivity:
import shap
explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_test, check_additivity=False)
Á Warning
112
Let’s revisit the SHAP values for the wine from the Linear Chapter and the
Additive Chapter.
shap.plots.waterfall(shap_values[0], max_display=11)
f(x) = 6.849
While the results differ from both the linear and the GAM models, the interpreta-
tion process remains the same. A key difference is that the random forest model
includes interactions between the features. However, since there’s only one SHAP
value per feature value (and not one for every interaction), we don’t immediately
see how features interact.
113
12.3 Global model interpretation
Global SHAP plots provide an overall view of how features influence the model’s
predictions. Let’s examine the summary plot:
shap.plots.beeswarm(shap_values)
High
alcohol
volatile acidity
free sulfur dioxide
chlorides
Feature value
residual sugar
pH
total sulfur dioxide
citric acid
sulphates
Sum of 2 other features
Low
1.0 0.5 0.0 0.5 1.0
SHAP value (impact on model output)
Key observations:
114
– Low levels of free sulfur dioxide resulting in lower quality.
We can examine interactions in global plots like the dependence plots. Here’s the
dependence plot for the alcohol feature:
shap.plots.scatter(shap_values[:,"alcohol"], color=shap_values)
0.45
1.00
0.40
0.75
0.35
SHAP value for
volatile acidity
0.50
alcohol
0.25 0.30
0.00 0.25
0.25 0.20
0.50
0.15
8 9 10 11 12 13 14
alcohol
The shap package automatically detects interactions. In this case, shap iden-
tified volatile acidity as a feature that greatly interacts with alcohol and
color-coded the SHAP values accordingly. By default, the shap dependence plot
chooses the feature that has the strongest interaction with the feature of inter-
est. The dependence plot function calls the approximate_interactions func-
tion, which measures the interaction between features through the correlation of
115
SHAP values, with a stronger correlation indicating a stronger interaction. Then
it ranks features based on their interaction strength with a chosen feature. You
can also manually select a feature.
Here are some important observations:
Ĺ Note
Here’s some advice on interpreting the interaction part of the dependence
plot:
Next, let’s examine the dependence plot for residual sugar as another example.
Residual sugar represents the remaining sugar in the wine, with higher amounts
indicating a sweeter taste.
shap.plots.scatter(
shap_values[:,"residual sugar"], color=shap_values
)
116
0.3 12.5
0.2 12.0
0.1 11.5
SHAP value for
residual sugar
11.0
alcohol
0.0
0.1 10.5
0.2 10.0
0.3 9.5
9.0
0 5 10 15 20
residual sugar
Key observations:
• Higher residual sugar is associated with higher SHAP values.
• The shap package identifies alcohol as having the highest interaction with
residual sugar.
• Alcohol and residual sugar are negatively correlated with a correlation co-
efficient of -0.5 (see later in this chapter); this makes sense as sugar is
converted into alcohol during the fermentation process.
• Comparing curves for low (below 12) and high alcohol levels (above 12):
– High variance in SHAP values is observed when alcohol content is low.
– High alcohol content is associated with low residual sugar and higher
SHAP values, compared to low alcohol content.
117
12.4 Analyzing correlated features
As mentioned in the Correlation Chapter, correlated features require additional
consideration. Let’s examine which features are correlated and how to use the
Partition explainer. We’ll start with a correlation plot that displays the Pearson
correlation between the features, given by the formula:
𝑛
∑𝑖=1 (𝑥(𝑖) − 𝑥)(𝑧
̄ (𝑖) − 𝑧)̄
𝑟𝑥𝑦 =
𝑛 𝑛
√∑𝑖=1 (𝑥(𝑖) − 𝑥)̄ 2 √∑𝑖=1 (𝑧(𝑖) − 𝑧)̄ 2
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5},
118
annot=True, fmt=".1f")
plt.show()
fixed acidity
0.2
residual sugar 0.1 0.1 0.1
0.0
chlorides 0.0 0.1 0.1 0.1
total sulfur dioxide 0.1 0.1 0.1 0.4 0.2 0.6 0.4
sulphates -0.0 -0.0 0.1 -0.0 0.0 0.1 0.1 0.1 0.2
alcohol -0.1 0.1 -0.1 -0.5 -0.4 -0.3 -0.5 -0.8 0.1 -0.0
pH
volatile acidity
citric acid
residual sugar
chlorides
density
sulphates
alcohol
fixed acidity
Figure Figure 12.1 shows that density correlates with residual sugar (0.8) and
total sulfur dioxide (0.5). Density also has a strong negative correlation with
119
alcohol. Volatile acidity does not show a strong correlation with other features.
The Partition explainer is a method that handles correlated features by computing
SHAP values based on hierarchical feature clusters.
An obvious strategy is to use correlation to cluster features so that highly corre-
lated features are grouped together. However, one modification is needed: Cor-
relation leads to extrapolation, which we need to manage, but it doesn’t matter
whether the correlation is positive or negative. Clustering based on correlation
would cause features with a strong negative correlation to be far apart in the
clustering hierarchy, which is not ideal for our goal of reducing extrapolation.
Therefore, in the following example, we perform tree-based hierarchical cluster-
ing on the absolute correlation. Features that are highly correlated, whether
negatively or positively, are grouped together hierarchically until the groups with
the least correlation are merged.
correlation_matrix = X_train.corr()
correlation_matrix = np.corrcoef(correlation_matrix)
correlation_matrix = np.abs(correlation_matrix)
dist_matrix = 1 - correlation_matrix
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = hierarchy.dendrogram(clustering, labels=X_train.columns)
120
plt.ylabel('Correlation Distance')
plt.show()
Dendrograms
2.00
1.75
1.50
Correlation Distance
1.25
1.00
0.75
0.50
0.25
0.00
alcohol
citric acid
fixed acidity
volatile acidity
chlorides
density
pH
sulphates
Figure Figure 12.2 shows the clustering results: density and alcohol are combined
first, then merged with residual sugar, and finally with the cluster of free and total
121
sulfur dioxide. As we ascend, the correlation weakens. This clustering hierarchy
is input into the Partition Explainer to produce SHAP values:
We now have our new SHAP values. The key question is: Do the results dif-
fer from when we ignored feature correlation? Let’s compare the SHAP impor-
tances:
fig = plt.figure(figsize=(6,12))
ax0 = fig.add_subplot(211)
shap.plots.bar(shap_values, max_display=11, show=False)
ax1 = fig.add_subplot(212)
shap.plots.bar(
shap_values2, max_display=11, show=False, clustering_cutoff=0.6
)
plt.tight_layout()
plt.show()
122
alcohol +0.34
volatile acidity +0.12
free sulfur dioxide +0.11
chlorides +0.06
residual sugar +0.05
pH +0.05
total sulfur dioxide +0.05
citric acid +0.04
sulphates +0.04
density +0.03
fixed acidity +0.03
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
mean(|SHAP value|)
alcohol +0.34
density +0.03
While the SHAP importances are not identical, the differences are not substantial.
However, the real benefit is the new interpretation we gain from clustering and
the Partition Explainer:
• We may add the SHAP values for both alcohol and density and interpret this
as the effect for the alcohol and density group. There was no extrapolation
between the two features, meaning no unlikely combination of alcohol and
density was formed.
• Similarly, we may interpret the combined SHAP importance: we can inter-
pret 0.34 + 0.03 = 0.37 as the SHAP importance of the alcohol+density
123
group.
• Free and total sulfur dioxide form a cluster with a combined importance of
0.16.
• Collectively, they are more important than volatile acidity and, due to their
high correlation, we have a solid argument for analyzing them together.
As the user, you can decide how high up you go in the hierarchy by increasing the
clustering_cutoff and then adding up the SHAP values (or SHAP importance
values) for clusters. The higher the cutoff, the larger the groups, but also the
more the correlation problem is reduced.
Now let’s compare the SHAP explanation for the first data instance:
shap.plots.bar(shap_values2[0], clustering_cutoff=0.6)
alcohol +0.39
density +0.02
residual sugar +0.14
free sulfur dioxide +0.15
Again, there are only slight differences in the SHAP values, and we can combine
124
the SHAP values of clusters in addition to interpreting the individual SHAP
values. For the computation of a combined SHAP value, the features within that
group were not subjected to extrapolation through marginal sampling. Revisit
the Correlation Chapter for a refresher on this concept. For instance, the feature
group “alcohol, density, and residual sugar” contributed a significant +0.55 (0.39
+ 0.02 + 0.14) to the predicted quality. We know that for the group SHAP
value of 0.55, alcohol, density, and residual sugar were always kept together in
coalitions.
However, the individual SHAP values are still partially susceptible to extrapo-
lation. For instance, the SHAP value for alcohol was computed by attributing
0.41 to both density and alcohol. For this attribution, density was also sampled
by marginal sampling, which introduces extrapolation, such as combining high
alcohol values with high density. So we have a trade-off between extrapolation
and group granularity: The higher we ascend in the clustering hierarchy, the less
extrapolation but the larger the feature groups become, which also complicates
interpretation.
125
up to a positive value, whereas in the second scenario, they would sum up to a
negative value. Let’s explore the two methods of comparing subsets.
fig = plt.figure(figsize=(6,12))
ax0 = fig.add_subplot(211)
shap.plots.waterfall(shap_values_sub[1], show=False)
ax1 = fig.add_subplot(212)
shap.plots.waterfall(shap_values_sub_all[1], show=False)
plt.tight_layout()
plt.show()
126
f(x) = 6.59
30 = free sulfur dioxide +0.14
7.1 = fixed acidity +0.06
0.43 = sulphates 0.06
12.2 = alcohol 0.06
0.39 = citric acid 0.06
0.25 = volatile acidity 0.03
2.1 = residual sugar 0.03
3.28 = pH +0.02
124 = total sulfur dioxide +0.02
2 other features 0.01
6.35 6.40 6.45 6.50 6.55 6.60 6.65
E[f(X)] = 6.589
f(x) = 6.589
12.2 = alcohol +0.56
30 = free sulfur dioxide +0.08
0.036 = chlorides +0.07
0.43 = sulphates 0.06
2.1 = residual sugar 0.04
7.1 = fixed acidity +0.04
124 = total sulfur dioxide +0.03
0.39 = citric acid 0.02
0.991 = density +0.01
2 other features 0
5.8 6.0 6.2 6.4 6.6
E[f(X)] = 5.921
Figure 12.3: Background data: wines with alcohol > 12 (top) or all wines (bot-
tom).
127
• The reference changes: Since wines rich in alcohol are associated with higher
predicted quality, 𝔼(𝑓(𝑋)) is also higher when we use all wines as back-
ground data. This means that the SHAP values only need to explain a
smaller difference of almost 0 instead of approximately 0.7.
The prediction [𝑓(𝑥)] for instance [𝑖] differs from the average prediction of
[𝔼(𝑓(𝑋))] for [subset] by [𝑓(𝑥(𝑖) − 𝔼(𝑓(𝑋))] to which [feature name = feature
(𝑖)
value] contributed [𝜙𝑗 ].
The sum of all SHAP values equals the difference between the prediction (6.59)
and the expected value (5.92).
Keeping the background data set for all wines and subsetting the SHAP values
produces the same individual SHAP values, but it changes the global interpreta-
tions:
# sort based on SHAP importance for all data and all wines
ordered = np.argsort(abs(shap_values.values).mean(axis=0))[::-1]
plt.subplot(131)
shap.plots.beeswarm(
shap_values, show=False, color_bar=False, order=ordered
)
plt.xlabel("")
plt.subplot(132)
shap.plots.beeswarm(
shap_values_sub_all, show=False, color_bar=False, order=ordered
)
128
plt.gca().set_yticklabels([]) # Remove y-axis labels
plt.ylabel("")
plt.subplot(133)
shap.plots.beeswarm(
shap_values_sub, show=False, color_bar=False, order=ordered
)
plt.gca().set_yticklabels([]) # Remove y-axis labels
plt.ylabel("")
plt.xlabel("")
plt.tight_layout()
plt.show()
129
alcohol
volatile acidity
free sulfur dioxide
chlorides
residual sugar
pH
total sulfur dioxide
citric acid
sulphates
Sum of 2 other features
1 0 1 1 0 1 1 0
SHAP value (impact on model output)
Figure 12.4: The left plot includes all SHAP values with all wines as background
data. The middle plot contains SHAP values for wines high in al-
cohol with all wines as background data. The right plot displays
SHAP values for wines high in alcohol with background data also
comprising wines high in alcohol. The feature order for all plots is
based on the SHAP importance of the left plot.
Figure 12.4 shows how subsetting SHAP values alone or together with the back-
ground data influences the explanations. Alcohol, according to SHAP, is the most
crucial feature. Its importance remains when we subset SHAP values for wines
high in alcohol. Its significance increases because these wines, high in alcohol,
have a high predicted quality due to their alcohol content. However, when we
also alter the background data, the importance of alcohol significantly decreases,
as evidenced by the close clustering of the SHAP values around zero. More in-
sights can be found in Figure 12.4. For instance, consider volatile acidity. A
130
higher volatile acidity typically correlates to lower SHAP values, but different
patterns emerge when considering wines rich in alcohol. Firstly, the SHAP val-
ues of volatile acidity exhibit a smaller range. Moreover, some wines with high
volatile acidity surprisingly present positive SHAP values, contradicting the usual
relationship between volatile acidity and predicted quality.
Ď Tip
Be inventive: Any feature can be employed to form subsets. You can even re-
sort to variables that were not used as model features for subset creation. For
example, you may want to examine how explanations change for protected
attributes like ethnicity or gender, variables which you would not normally
employ as features.
131
13 Image Classification with
Partition Explainer
Up until now, we’ve explored tabular data. Now, let’s explore image data.
Image classification is a common task typically solved using deep learning. Given
an image, the model identifies a class based on the visible content. Rather than
training our own image classifier, we’ll utilize a pre-trained ResNet model (He et
al. 2016) trained on ImageNet data (Deng et al. 2009). ImageNet is a large-scale
image classification challenge where models are required to categorize objects
within digital images. It boasts a dataset of over 1 million images from 1000
different categories, aiming to develop a model that accurately classifies each
image.
This example’s code is derived from a notebook from the shap library1 .
1
https://github.com/slundberg/shap/blob/master/notebooks/image_examples/image_
classification/Explain%20ResNet50%20using%20the%20Partition%20explainer.ipynb
132
import json
import numpy as np
import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input
import shap
model = ResNet50(weights='imagenet')
X, y = shap.datasets.imagenet50()
We’ll use the 50 images supplied by SHAP. One of these is a cheeseburger, which
we’ll encounter again later.
import json
import os
import urllib.request
json_file_path = 'imagenet_class_index.json'
# Verify if the JSON file is on disk
if os.path.exists(json_file_path):
with open(json_file_path) as file:
class_names = [v[1] for v in json.load(file).values()]
else:
url = 'https://s3.amazonaws.com/deep-learning-models/' + \
'image-models/imagenet_class_index.json'
with urllib.request.urlopen(url) as response:
json_data = response.read().decode()
with open(json_file_path, 'w') as file:
file.write(json_data)
class_names = [v[1] for v in json.loads(json_data).values()]
133
What we need now are:
• A prediction function that wraps the model.
• A masker that defines how to simulate absent features (pixel clusters).
The prediction function is straightforward: it takes an image as a numpy array,
preprocesses it for the ResNet, and feeds it to the ResNet. Thus, it takes a numpy
array as input and outputs probability scores. We make use of the Partition Ex-
plainer that partitions the image into equal rectangles and recursively computes
the SHAP values.
masker = shap.maskers.Image(
'blur(128,128)', shape = X[0].shape
)
The masker’s role is to “remove” the pixels not included in the coalition. From
its parameters, we can infer that it blurs the parts of the image that are absent. I
will illustrate what this looks like later. First, let’s calculate the SHAP values for
two images: a cheeseburger and a pocket watch. Since this is a classification task,
we need to decide the classes for which we want explanations. It’s standard to
select the top classes, especially the one with the highest probability, as it often
has significance in subsequent stages of using the model.
We’ll select the top 3 classes (topk).
topk = 3
134
explainer = shap.Explainer(
predict, masker, output_names=class_names
)
shap_values = explainer(
X[index], max_evals=1000,
batch_size=50,
outputs=shap.Explanation.argsort.flip[:topk],
silent=True
)
The above code calculates all SHAP values. However, with Image SHAP, the
raw values aren’t particularly useful, so let’s visualize them. The SHAP values
are optimally visualized by overlaying them on the original image. Below are the
selected images:
shap.image_plot(shap_values, pixel_values=X[index]/255)
135
Before we dive into the interpretation of the SHAP values, notice the level on
which we computed the SHAP values: The image is split into smaller rectangles,
a bit like a chessboard, and we get a SHAP value for each rectangle. That’s
automatically done by the Partition Explainer. Now, let’s interpret these SHAP
values: The burger is misclassified as the top class is “bottlecap”. The SHAP
values show that the top part of the burger primarily influenced the bottlecap
classification. The middle parts of the burger negatively impacted the classifica-
tion, but the top part resembled a bottlecap too closely for the Resnet model.
The second class, however, is cheeseburger, which is accurate. In this case, the
middle parts mainly contributed to the correct classification. The second image is
of a pocket watch. But, there’s no pocket watch class among the 1000 ImageNet
classes. Therefore, “chain” might be a reasonable classification. The chain part
of the watch is the correct reason for this classification.
One disadvantage of calculating the SHAP values for multiple images is that the
color scale for the Shapley values, seen at the bottom, applies to all images. As
the watch has larger values, the values for the cheeseburger are scaled closer to
white.
Let’s learn about the different maskers, as they’re more than just a “set-and-
forget” parameter.
136
about the missing data but use algorithms to “guess” the missing part based on
the rest of the image that isn’t missing.
Specifically, here are the options in SHAP:
• inpaint_telea: Telea inpainting fills the missing area using a weighted
average of neighboring pixels, based on a specified radius.
• inpaint_ns: NS (Navier-Stokes) inpainting is based on fluid dynamics and
uses partial differential equations.
• blur(16, 16): Blurring depends on kernel size, which can be set by the
user. Larger values involve distant pixels in the blurring process, and blur-
ring is faster than inpainting.
The shap package uses the cv2 package for inpainting and blurring operations.
Below are examples of how these different options look, primarily featuring melt-
ing burgers.
sh = X[0].shape
masks = [shap.maskers.Image(m, sh) for m in mask_names]
137
axs[i, j].set_xticks([])
axs[i, j].set_yticks([])
axs[i, j].tick_params(axis='both', which='both', length=0)
ind += 1
plt.show()
inpaint_telea inpaint_ns
The upper half of the image is “present” and the lower half is “absent”. This
mirrors the SHAP values concept, where the image is divided into two “players”:
the upper and lower halves. This is also akin to the Partition Explainer, which
further segments along the x and y axes.
Unlike replacing all missing values with gray pixels, maskers don’t entirely alter
the image. Blurring even maintains the original data, merely smoothing it and
causing some information loss. For instance, the blur(16, 16) kernel doesn’t
significantly alter the image; the bottom of the burger remains fairly recognizable.
Next, let’s examine the effect of the masks on the SHAP values.
138
topk = 3
inpaint_telea
inpaint_ns
139
bottlecap cheeseburger bagel
blur(128, 128)
4 2 0 2 4
SHAP value 1e 5
blur(16, 16)
140
bottlecap cheeseburger bagel
The masker choice does influence the explanation of the cheeseburger classifica-
tion. The mistaken classification identified the cheeseburger as a bottle cap. All
maskers concurred that the most influencing pixels were at the top, except for the
blur(16, 16) masker, which highlighted the side pixels. Nonetheless, the blur(16,
16) masker appears somewhat unreliable, as it doesn’t obscure much. Even to
me, the bottom part of the image is recognizable as a burger, indicating that the
features haven’t been significantly obscured. In all instances, the pixels at the
bottom of the burger seemed to counter the bottle cap classification. For more
information on maskers, refer to the maskers chapter in the Appendix.
Another intriguing hyperparameter is the number of evaluation steps.
topk = 1
masker = shap.maskers.Image(
141
'blur(128, 128)', shape=sh
)
bottlecap
1 0 1
SHAP value 1e 5
shap_values = explainer(
X[[21]],
max_evals=100,
batch_size=50,
outputs=shap.Explanation.argsort.flip[:topk]
142
)
shap.image_plot(shap_values, pixel_values=X[[21]]/255, width=4)
bottlecap
2 0 2
SHAP value 1e 5
• We acquire more detailed superpixels, which are associated with the Parti-
tion Explainer. With only 10 evaluations, we see 4 leaves, requiring a tree
depth of 2, which leads to 4 individual SHAP values and 2 for the first split.
• The SHAP values decrease as the partitions and the number of pixels get
smaller. This is because masking fewer pixels tends to have a lower impact
on the prediction.
• The computation time grows linearly with the number of evaluations.
However, the Partition Explainer isn’t the only option for image classifiers. In
the next chapter, we’ll discuss how to generate pixel-level explanations.
143
14 Deep and Gradient Explainer
In the previous chapter, we explored the Partition Explainer, which treated larger
image patches as features for SHAP. In this chapter, we will explain image classi-
fier classifications using a different approach, akin to tabular data. This involves
two key aspects:
Since we are working with a neural network, two model-specific tools are avail-
able:
• The Gradient Explainer, as neural networks often rely on gradients.
• The Deep Explainer, which utilizes neural network layers to backpropagate
SHAP values.
Both methods are discussed in greater detail in the Estimation Appendix. For
this example, we will use the MNIST dataset. The MNIST dataset contains
70,000 handwritten digits (0-9), each represented as a 28x28 pixel grayscale im-
age. The goal of the MNIST task is to create a machine learning algorithm that
can accurately classify these images into their corresponding digit categories. This
well-established benchmark problem has been used to evaluate the performance
144
of various algorithms, including neural networks, decision trees, and support vec-
tor machines. Researchers in machine learning, computer vision, and pattern
recognition have extensively used the MNIST dataset.
Why did I choose the MNIST dataset instead of ImageNet, as in the previous
example? Because using pixel-wise explanations with Gradient or Deep Explainer
requires sampling absent features using a background dataset.
Imagine using the ImageNet dataset, where we have an image of a burger and
replace the “absent” pixels with pixels from a dog image – it would result in
a strange outcome. However, for the MNIST dataset, this approach is more
reasonable, as digits are more similar to each other, and replacing some pixels
of a “2” with those of a “3” won’t generate bizarre images. I acknowledge that
this is a somewhat vague argument, but generally, explanations for images can
be more challenging.
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (
Dense, Dropout, Flatten, Conv2D, MaxPooling2D
)
from tensorflow.keras.utils import to_categorical
145
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)
# Compile model
model.compile(
loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy']
)
# Train model
model.fit(
x_train,
y_train,
batch_size=128,
epochs=5,
validation_data=(x_test, y_test)
)
score = model.evaluate(x_test, y_test, verbose=0)
146
Next, we evaluate the model’s performance:
import shap
import time
The output, shap_values, is a list with a length equal to the number of classes
(10 in this case):
print(len(shap_values))
147
10
print(shap_values[0].shape)
The first dimension represents the number of images for which we computed the
SHAP values. The remaining dimensions contain the SHAP values in the form
of an image, because the input data was an image.
Now, let’s plot the SHAP values:
shap.image_plot(shap_values, x_test[:3])
Figure 14.1: SHAP values for the input pixels of different input images (one per
row). The first column shows the input image, then each column
shows the SHAP values for the classes from 1 to 9.
In the plot, red pixels contributed positively to the respective class, while blue
pixels contributed negatively. Grey pixels have a near zero SHAP value. The first
row shows a 7 and, for example, in the 8th column we see positive contributions
148
of the pixels that make up a 7. The second row shows the image of a 2 and we can
see that especially the start and end of the “2” contributed positively to the class
“2” (3rd column). The start of the “2” and the slope in the middle contributed
negatively to a prediction of “1”.
shap.image_plot(shap_values, x_test[:3])
The SHAP values are very similar to the Gradient Explainer. This makes sense,
since in both cases the result should be the same SHAP values and the difference
is only due to the fact that both are approximations.
149
14.4 Time Comparison
We measured the time for both the gradient and deep explainer.
Let’s examine the results:
In theory, to obtain a reliable time comparison, you should repeat the calls several
hundred times and avoid using other programs simultaneously. However, this
comparison was conducted only once on my MacBook with an M1 chip. Keeping
this in mind, we see a bit of a difference. Since only the CPU is used, the efficiency
of calling the model could improve if a GPU were involved. But, I’d rather use
the Gradient Explainer.
150
15 Explaining Language Models
Let’s explore text-based models. All models in this chapter have one thing in
common: their inputs are text. However, we’ll encounter two distinct types of
outputs:
While they might seem different at first glance, they’re quite similar upon closer
inspection. We’ll start with the simpler case where a model outputs a single score
for a text input, like in classification, for instance, determining the category of a
news article.
In this chapter, we’ll mainly work with transformers, which are state-of-the-art
for text-based machine learning. However, keep in mind, SHAP values are model-
agnostic. So it doesn’t matter if the underlying model is a transformer neural
network or a support vector machine that works with engineered features, like
TF-IDF (Term Frequency-Inverse Document Frequency).
151
scalar, but we can make it so by examining the score for a specific token instead
of the word.
The features in both text classification and text-to-text models are text-based.
However, it’s not as simple as it sounds, because it’s not words that are fed into
the neural network but numbers. In the case of state-of-the-art neural networks,
these numbers are represented as embedding tokens. Tokens are typically smaller
than words, and there are numerous methods to tokenize text.
• By character
• By token
• By word
• By sentence
• By paragraph
• And everything in between
The choice depends on the specific application, and we’ll explore various examples
throughout this chapter. Consider the task of sentiment analysis. The sentence
“I returned the item as it didn’t work.” might have a predicted score of -0.5
indicating a negative sentiment.
The aim of SHAP is to attribute this score to the input words. If you choose to
attribute the prediction at the word level, you will obtain one SHAP value for
each word: [“I”, “returned”, “the”, “item”, “as”, “it”, “didn’t”, “work”].
Each word acts as a team player, and the -0.5 score is fairly distributed among
them.
152
15.3 Removing players in text-based scenarios
An interesting question arises: how do you simulate the absence of play-
ers/features in text? In theory, you have multiple options:
153
{'label': 'POSITIVE', 'score': 0.0003391578793525696}
import shap
explainer = shap.Explainer(model)
shap_values = explainer(s)
print("expected: %.2f" % shap_values.base_values[0][1])
print("prediction: %.2f" % model(s)[0][1]['score'])
shap.plots.bar(shap_values[0, :, 'POSITIVE'])
expected: 0.31
prediction: 0.00
sc 0.57
am 0.07
+0
154
That was straightforward, wasn’t it? The term “sc” appears to contribute the
most to the negative sentiment. However, the splitting of “scam” into “sc” and
“am” isn’t ideal for interpretation. This issue emerges from our masking of the
input, in which the choice of tokenizer influences how the text is masked. We can
see that the Partition explainer was used since the clustering is displayed here as
well. So we can just add up the two SHAP values of “sc” and “am” to get the
SHAP value for “scam”, but it would be more elegant to compute SHAP values
based on better tokenization.
masker = shap.maskers.Text(tokenizer=r"\W+")
explainer = shap.Explainer(model, masker=masker)
shap_values = explainer(s)
shap.plots.bar(shap_values[0, :, 'POSITIVE'])
155
scam 0.69
was 0.17
This +0.01
product +0.01
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
SHAP value
Now it’s evident that “scam” is the most relevant term for a negative classifica-
tion. The tokenizer is highly adaptable. To illustrate this, let’s consider another
example using SHAP values calculated on sentences. In this scenario, the to-
kenizer is simple and breaks the input at periods “.”. We’ll experiment with a
lengthier input text to observe the contribution of each sentence.
expected: 0.75
prediction: 1.00
156
But that's why I loved it. +0.58
“But that’s why I loved it” has a huge positive contribution to the sentiment,
more than the two negative sentences combined. I’ve used a whitespace ” ” as
the masking token, which seems suitable for dropping a sentence. The default
replacement token for text is “…”, but generally, if a tokenizer is provided, the
.mask_token attribute is utilized, assuming the tokenizer has this attribute.
To illustrate “extreme” masking, let’s replace a removed sentence with a specific
one instead of leaving it blank. The collapse_mask_token=True argument en-
sures that if two tokens in a row are replaced by the mask_token, the token is
only added once. In the ensuing example, sentences are replaced with “I love it”,
but only once consecutively.
Consider the sentence: “This product was a scam. It was more about marketing
than technology. But that’s why I loved it. Learned a bunch about marketing
that way.” Let’s analyze the marginal contribution of “Learned a bunch about
marketing” when added to an empty set, by comparing these two sentences:
“I love it. Learned a bunch about marketing that way.” versus “I love it.”
If collapse_mask_token=False, we would compare “I love it. I love it. I love it.
Learned a bunch about marketing that way.” with “I love it. I love it. I love it.
I love it.” Therefore, it often makes sense to set collapse_mask_token to True.
In theory, you could also create a custom masker.
157
print("expected: %.2f" % shap_values.base_values[0][1])
print("prediction: %.2f" % model(s2)[0][1]['score'])
shap.plots.bar(shap_values[0, :, 'POSITIVE'])
expected: 1.00
prediction: 1.00
masker = shap.maskers.Text(
tokenizer=r"\.",
mask_token='I hate it',
collapse_mask_token=True
)
explainer = shap.Explainer(model, masker=masker)
shap_values = explainer([s2])
print("expected: %.2f" % shap_values.base_values[0][1])
print("prediction: %.2f" % model(s2)[0][1]['score'])
shap.plots.bar(shap_values[0, :, 'POSITIVE'])
expected: 0.00
prediction: 1.00
158
But that's why I loved it. +0.75
Here, the replacement acts as a reference. In one scenario, any sentence removed
from the coalition is replaced with “I love it,” and in the other scenario, it’s
replaced with “I hate it.”
What changes are the base values; they shift from strongly positive to negative, as
can be inferred from the difference in the base value. Every sentence is now inter-
preted in contrast to the replacement. This was also true earlier, but previously
we replaced it with an empty string, which is more neutral than the sentences
provided.
Á Warning
Avoid using extreme masking tokens as they might not make sense. How-
ever, more specific tokens can be beneficial. This highlights the importance
of masking, which serves as background data. Consider the replacement
carefully and test alternatives if necessary.
masker = shap.maskers.Text(
tokenizer=r"\.", mask_token='...', collapse_mask_token=True
)
explainer = shap.Explainer(model, masker=masker)
shap_values = explainer([s2])
print("expected: %.2f" % shap_values.base_values[0][1])
print("prediction: %.2f" % model(s2)[0][1]['score'])
shap.plots.bar(shap_values[0, :, 'POSITIVE'])
159
expected: 0.96
prediction: 1.00
masker = shap.maskers.Text(
tokenizer=r"\.", mask_token=' ', collapse_mask_token=True
)
explainer = shap.Explainer(model, masker=masker)
shap_values = explainer([s2])
print("expected: %.2f" % shap_values.base_values[0][1])
print("prediction: %.2f" % model(s2)[0][1]['score'])
shap.plots.bar(shap_values[0, :, 'POSITIVE'])
expected: 0.75
prediction: 1.00
160
But that's why I loved it. +0.58
Despite the overall attribution not changing significantly, except for the sign
change in the “marketing” sentence, which was close to zero, the base value
changes considerably.
Experiment with it, generate some text, and make a qualitative judgment about
whether it makes sense. To understand more about maskers, refer to the maskers
chapter in the Appendix.
model2 = shap.models.TransformersPipeline(
model, rescale_to_logits=True
)
Like the original transformer, you can make predictions with this model:
model2(s)
161
Now let’s see how this impacts the explanations with SHAP:
explainer2 = shap.Explainer(model2)
shap_values2 = explainer2(s)
shap.plots.bar(shap_values2[0,:, 'POSITIVE'])
sc 6.46
am 1.95
+0
162
This can be viewed as a classification task where the goal is to determine the
next token based on some text input. Consider early large language models that
aimed to produce the next words given an input text. For example:
Input text: “Is this the Krusty Krab?” Output text: “No! This is Patrick!”
In the context of text-to-text models, each output token is considered an indi-
vidual prediction, much like in multi-class classification. We can compute SHAP
values for each token.
If the tokenized input has a length of 𝑛 and the tokenized output length is 𝑚, we
derive 𝑛 ⋅ 𝑚 SHAP values. The level of input tokenization is user-controllable. In
the example above, if the user opts for word-level tokenization for the input, the
first token of the output, i.e., “No”, receives 𝑛 SHAP values. The next token “!”
gets 𝑛 SHAP values, and so on.
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')
1
https://shap.readthedocs.io/en/latest/example_notebooks/text_examples/text_
generation/Open%20Ended%20GPT2%20Text%20Generation%20Explanations.html
163
import torch
The result: “He insulted Italian cuisine by throwing spaghetti at the tables and
throwing rocks at the local biker’s peloton, and he did something about it by
showing that she needs a change”
Now, we can obtain the SHAP explanations for the first generated word “throw-
ing”.
torch.manual_seed(0)
# Setting the model to decoder to prevent input repetition.
model.config.is_decoder = True
explainer = shap.Explainer(model, tokenizer)
shap_values = explainer([input_text])
shap.plots.waterfall(shap_values[0, :, 5 + 1])
164
f(x) = 3.653
by = by 0.18
He = He 0.07
The word “insulted” positively contributed to this word, while all others had a
negative contribution.
As general text-to-text models can now handle these tasks, the given example
should suffice.
2
https://shap.readthedocs.io/en/latest/example_notebooks/text_examples/question_
answering/Explaining%20a%20Question%20Answering%20Transformers%20Model.html
3
https://shap.readthedocs.io/en/latest/example_notebooks/text_examples/summarization/
Abstractive%20Summarization%20Explanation%20Demo.html
4
https://shap.readthedocs.io/en/latest/example_notebooks/text_examples/translation/
Machine%20Translation%20Explanations.html
165
16 Limitations of SHAP
1
https://christophm.github.io/interpretable-ml-book/pdp.html
166
𝑝
function 𝑓(𝑥) = ∏𝑗=1 𝑥𝑗 , which served as the “model” but was actually a prede-
fined function for studying SHAP values. This function is purely multiplicative.
All features were independent and had an expected value of zero. Given this in-
formation, what would you anticipate for the SHAP importance of each feature?
I would expect all SHAP importances to be equal. However the two features have
different scales, such as 𝑋1 ranging from -1 to +1 and 𝑋2 ranging from -1000 to
+1000. While 𝑋2 appears to have a greater influence on the prediction due to its
larger scale, the SHAP importance for both features is the same. This is because
knowing 𝑋1 is as important as knowing 𝑋2 . 𝑋2 has a much larger range, but
it doesn’t impact the SHAP value because 𝑋1 can set the prediction to zero or
flip the sign. This shows that interpreting interactions is not always intuitive. In
the Interaction Chapter, we also observed how interactions are divided between
features and often exhibit a combination of local and global effects. Keep this
peculiarity in mind.
167
Bear in mind that SHAP values are but one example of an attribution method.
And even attribution methods are just a subset of methods in interpretable ma-
chine learning.
168
For actionable recommendations, consider using counterfactual explanations2 .
Additionally, ensure the model itself offers actionable advice by using representa-
tive data samples and modeling causal relationships.
2
https://christophm.github.io/interpretable-ml-book/counterfactual.html
169
16.9 You can fool SHAP
It’s possible to create intentionally misleading interpretations with SHAP, which
can conceal biases (Slack et al. 2020), at least if you use the marginal sampling
version of SHAP. If you are the data scientist creating the explanations, this is
not an issue (it could even be advantageous if you are an unscrupulous data sci-
entist who wants to create misleading explanations). However, for the recipients
of a SHAP explanation, it is a disadvantage as they cannot be certain of the
explanation’s truthfulness.
170
17 Building SHAP Dashboards with
Shapash
Shapash is a package that utilizes SHAP (or LIME) to compute contributions and
visualize them in a dashboard or report. The dashboard is more convenient for
exploring explanations than iterating through explanations in a Jupyter notebook
or Python script, enabling exploratory data analysis.
17.1 Installation
You can install Shapash using pip:
import sklearn
import shap
import shapash
171
import pandas as pd
from sklearn.model_selection import train_test_split
from lightgbm import LGBMRegressor
# Train a model
model = LGBMRegressor(n_estimators=100).fit(X_train, y_train)
y_pred = pd.DataFrame(
model.predict(X_test), columns=['pred'], index=X_test.index
)
xpl = shapash.SmartExplainer(model=model)
xpl.compile(y_pred=y_pred, x=X_test)
172
Figure 17.1: Shapash Webapp
173
• Top left: Displays feature importance.
• Bottom left: Feature dependence plots. The feature can be changed by
clicking on the importance graph.
• Top right: Contains multiple tabs, including raw data values, filters for a
subset view on the importance graph, and a true versus predicted plot.
• Bottom right: Presents a bar plot of SHAP values. Select an instance in the
top-right graph using the dataset picker or in the bottom-left feature con-
tribution figure. The graph can be customized, for example by displaying
only the most influential features.
app.kill()
174
18 Alternatives to the shap Library
Several alternative implementations of shap and SHAP values are available.
In Python:
• Captum1 (Kokhlikyan et al. 2020): A comprehensive model interpretabil-
ity library, providing KernelShap, Sampling estimator, GradientShap, and
Deep Shap implementations.
• shapley2 (Rozemberczki et al. 2022): Offers the exact estimator, several
linear explanation methods, and Monte Carlo permutation sampling.
Python packages that utilize shap internally:
• DALEX3 (Baniecki et al. 2021)
• AIX3604 (Arya et al. 2019)
• InterpretML5 (Nori et al. 2019) encompasses multiple methods including
SHAP.
• OmniXAI6 (Yang et al. 2022), a library dedicated to explainable AI.
• shapash7 , designed for dashboards and reports, as discussed in this chapter.
In R:
• DALEX8 (Biecek 2018)
• kernelshap9
• shapr10 (Sellereite and Jullum 2019)
1
https://github.com/pytorch/captum
2
https://github.com/benedekrozemberczki/shapley
3
https://github.com/ModelOriented/DALEX
4
https://github.com/Trusted-AI/AIX360
5
https://github.com/interpretml/interpret/
6
https://github.com/salesforce/OmniXAI
7
https://github.com/MAIF/shapash
8
https://github.com/ModelOriented/DALEX
9
https://github.com/ModelOriented/kernelshap
10
https://github.com/NorskRegnesentral/shapr
175
• ShapleyR11
• shapper12 , which depends on the Python shap package.
• shapviz13 reproduces many SHAP plots from the original Python shap pack-
age and includes additional ones.
• treeshap14
• iml15 (Molnar et al. 2018).
11
https://github.com/redichh/ShapleyR
12
https://github.com/ModelOriented/shapper
13
https://github.com/ModelOriented/shapviz
14
https://github.com/ModelOriented/treeshap
15
https://github.com/christophM/iml
176
19 Extensions of SHAP
This chapter outlines some extensions of SHAP values, with a focus on adapta-
tions for specific models or data structures, while preserving their role in explain-
ing predictions.
177
19.3 n-Shapley values
The n-Shapley values (Bordt and Luxburg 2022) link SHAP values with GAMs
(generalized additive models), with n signifying the interaction depth considered
during the Shapley value calculation, depending on the GAM’s training. A stan-
dard GAM includes no interaction (n=1), but interactions can be incorporated.
The paper illustrates the relationship between n-Shapley values and functional
decomposition, providing the nshap package in Python1 for implementation. If
the model being explained is a GAM, SHAP recovers all non-linear components,
as outlined in the Additive Chapter.
1
https://github.com/tml-tuebingen/nshap
178
19.6 Counterfactual SHAP
As mentioned in the Limitations Chapter, SHAP values may not be the best
choice for counterfactual explanations. A counterfactual explanation is a con-
trastive explanation that clarifies why the current prediction was made instead
of a counterfactual outcome. This is crucial in recourse situations when someone
affected by a decision wants to challenge a prediction, such as a creditworthiness
classifier. Counterfactual SHAP (or CF-SHAP) (Albini et al. 2022) introduces
this approach to SHAP values through careful selection of background data.
179
accounting for the usage of attributes by the model via explanation distributions.
Consequently, equal treatment implies equal outcomes, but the converse is not
necessarily true. Equal treatment is implemented within the explanationspace4
package, which includes tutorials.
4
https://explanationspace.readthedocs.io/en/latest/Python
180
20 Other Uses of Shapley Values in
Machine Learning
This chapter explores the various applications of Shapley values in tasks within
the machine learning field beyond prediction explanations. While the Extensions
Chapter focused on extensions for explaining predictions, this chapter introduces
other tasks in machine learning and data science that can benefit from the use of
Shapley values. The information in this chapter is based on the overview paper
by Rozemberczki et al. (2022).
181
20.2 Feature selection
Feature selection, a process closely related to SAGE, involves identifying the fea-
tures that best enhance model performance. Shapley values can be incorporated
into the modeling process, as suggested by Guyon and Elisseeff (2003) and Fryer
et al. (2021). By repeatedly training the model using different features, it is pos-
sible to estimate each feature’s importance to performance. However, Fryer et al.
(2021) argue that Shapley values may not be ideally suited for this task, as they
excel in attribution rather than selection. Despite this limitation, each feature is
treated as a player, with the model performance being the payoff. Unlike SHAP
values for prediction, the contribution of each feature is evaluated globally for
the entire model. The ultimate goal is to ascertain each feature’s contribution to
the model’s performance.
182
20.5 Federated learning
Federated learning is a method to train a machine learning model on private
data distributed across multiple entities, such as hospitals. In this context, each
hospital is a player that contributes to the training of the model. Federated
learning facilitates training models across various private datasets while main-
taining privacy. The payout is determined by the model’s goodness of fit or other
performance metrics, which estimate the contribution of each entity.
183
21 Acknowledgments
This book stands on the shoulders of giants. In particular, these giants are
the researchers working on Shapley values and SHAP, with a special mention of
Scott Lundberg who played a pivotal role in bringing Shapley values to machine
learning.
I would also like to thank all my beta readers for their invaluable feedback. In
no particular order: Carlos Mougan, Bharat Raghunathan, Junaid Butt, Joshua
Le Cornu, Vaibhav Krishna Irugu Guruswamy, Liban Mohamed, Tim Triche,
Ronald Richman, Germán García, Jeff Herman, Zachary Duey, Sven Kruschel,
Joaquín Bogado, Shino Chen, Sairam Subramanian, Kerry Pearn, Gavin Parnaby,
Robert Martin, Andrea Ruggerini, Arved Niklas Fanta, Saman Parvaneh, Simon
Prince, Marouane Il Idrissi, Kranthi Kamsanpalli, Enrico Roletto, David Cortés,
Valentino Zocca, HaveF, and Johannes Widera.
And of course a big thanks goes to Heidi, my wife, who always has to put up with
my ramblings when I learn something new and just have to tell someone. The
cover art (the team of meerkats) was created by jeeshiu from Fiverr1 .
1
https://www.fiverr.com/jeeshiu
184
More From The Author
185
References
Aas K, Jullum M, Løland A (2021) Explaining individual predictions when fea-
tures are dependent: More accurate approximations to shapley values. Artifi-
cial Intelligence 298:103502
Albini E, Long J, Dervovic D, Magazzeni D (2022) Counterfactual shapley addi-
tive explanations. In: 2022 ACM conference on fairness, accountability, and
transparency. pp 1054–1070
Arya V, Bellamy RKE, Chen P-Y, et al (2019) One explanation does not fit all:
A toolkit and taxonomy of AI explainability techniques2
Bach S, Binder A, Montavon G, et al (2015) On pixel-wise explanations for
non-linear classifier decisions by layer-wise relevance propagation. PloS one
10:e0130140
Baniecki H, Kretowicz W, Piatyszek P, et al (2021) Dalex: Responsible machine
learning with interactive explainability and fairness in python3 . Journal of
Machine Learning Research 22:1–7
Biecek P (2018) DALEX: Explainers for complex predictive models in r4 . Journal
of Machine Learning Research 19:1–5
Bloch L, Friedrich CM, Initiative ADN (2021) Data analysis with shapley values
for automatic subject selection in alzheimer’s disease data sets using inter-
pretable machine learning. Alzheimer’s Research & Therapy 13:1–30
Bordt S, Luxburg U von (2022) From shapley values to generalized additive mod-
els and back. arXiv preprint arXiv:220904012
Caruana R, Lou Y, Gehrke J, et al (2015) Intelligible models for healthcare:
Predicting pneumonia risk and hospital 30-day readmission. In: Proceedings
of the 21th ACM SIGKDD international conference on knowledge discovery
and data mining. pp 1721–1730
Chen H, Janizek JD, Lundberg S, Lee S-I (2020) True to the model or true to
the data? arXiv preprint arXiv:200616234
2
https://arxiv.org/abs/1909.03012
3
http://jmlr.org/papers/v22/20-1473.html
4
http://jmlr.org/papers/v19/18-416.html
186
Chen H, Lundberg S, Lee S-I (2021) Explaining models by propagating shap-
ley values of local components. Explainable AI in Healthcare and Medicine:
Building a Culture of Transparency and Accountability 261–270
Chen J, Song L, Wainwright MJ, Jordan MI (2018) L-shapley and c-
shapley: Efficient model interpretation for structured data. arXiv preprint
arXiv:180802610
Covert I, Lundberg SM, Lee S-I (2020) Understanding global feature contribu-
tions with additive importance measures. Advances in Neural Information
Processing Systems 33:17212–17223
Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical
image database. In: 2009 IEEE conference on computer vision and pattern
recognition. Ieee, pp 248–255
Elish MC, Watkins EA (2020) Repairing innovation: A study of integrating AI
in clinical care. Data & Society
Frye C, Rowat C, Feige I (2020) Asymmetric shapley values: Incorporating causal
knowledge into model-agnostic explainability. Advances in Neural Informa-
tion Processing Systems 33:1229–1239
Fryer D, Strümke I, Nguyen H (2021) Shapley values for feature selection: The
good, the bad, and the axioms. IEEE Access 9:144352–144360
Garcı́a MV, Aznarte JL (2020) Shapley additive explanations for NO2 forecasting.
Ecological Informatics 56:101039
Ghorbani A, Zou J (2019) Data shapley: Equitable valuation of data for machine
learning. In: International conference on machine learning. PMLR, pp 2242–
2251
Grinsztajn L, Oyallon E, Varoquaux G (2022) Why do tree-based models still
outperform deep learning on tabular data? arXiv preprint arXiv:220708815
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection.
Journal of machine learning research 3:1157–1182
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recogni-
tion. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. pp 770–778
Heskes T, Sijben E, Bucur IG, Claassen T (2020) Causal shapley values: Exploit-
ing causal knowledge to explain individual predictions of complex models.
Advances in neural information processing systems 33:4778–4789
Jabeur SB, Mefteh-Wali S, Viviani J-L (2021) Forecasting gold price with the
XGBoost algorithm and SHAP interaction values. Annals of Operations Re-
search 1–21
Janzing D, Minorics L, Blöbaum P (2020) Feature relevance quantification in
187
explainable AI: A causal problem. In: International conference on artificial
intelligence and statistics. PMLR, pp 2907–2916
Johnsen PV, Riemer-Sørensen S, DeWan AT, et al (2021) A new method for
exploring gene–gene and gene–environment interactions in GWAS with tree
ensemble methods and SHAP values. BMC bioinformatics 22:1–29
Kim Y, Kim Y (2022) Explainable heat-related mortality with random forest
and SHapley additive exPlanations (SHAP) models. Sustainable Cities and
Society 79:103677
Kokhlikyan N, Miglani V, Martin M, et al (2020) Captum: A unified and generic
model interpretability library for PyTorch5
Kumar IE, Venkatasubramanian S, Scheidegger C, Friedler S (2020) Problems
with shapley-value-based explanations as feature importance measures. In:
International conference on machine learning. PMLR, pp 5491–5500
Lin K, Gao Y (2022) Model interpretability of financial fraud detection by group
SHAP. Expert Systems with Applications 210:118354
Lundberg SM, Erion G, Chen H, et al (2020) From local explanations to global
understanding with explainable AI for trees. Nature machine intelligence
2:56–67
Lundberg SM, Lee S-I (2017b) A unified approach to interpreting model predic-
tions. Advances in neural information processing systems 30:
Lundberg SM, Lee S-I (2017a) A unified approach to interpreting model predic-
tions6 . In: Guyon I, Luxburg UV, Bengio S, et al. (eds) Advances in neural
information processing systems 30. Curran Associates, Inc., pp 4765–4774
Miller T (2019) Explanation in artificial intelligence: Insights from the social
sciences. Artificial intelligence 267:1–38
Mitchell R, Cooper J, Frank E, Holmes G (2022) Sampling permutations for
shapley value estimation
Molnar C (2022) Interpretable machine learning: A guide for making black box
models explainable7 , 2nd edn.
Molnar C, Casalicchio G, Bischl B (2018) Iml: An r package for interpretable
machine learning. Journal of Open Source Software 3:786
Mougan C, Broelemann K, Kasneci G, et al (2022) Explanation shift: Detecting
distribution shifts on tabular data via the explanation space. arXiv preprint
arXiv:221012369
Mougan C, State L, Ferrara A, et al (2023) Demographic parity inspector: Fair-
5
https://arxiv.org/abs/2009.07896
6
http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
7
https://christophm.github.io/interpretable-ml-book
188
ness audits via the explanation space. arXiv preprint arXiv:230308040
Nori H, Jenkins S, Koch P, Caruana R (2019) InterpretML: A unified framework
for machine learning interpretability. arXiv preprint arXiv:190909223
Parsa AB, Movahedi A, Taghipour H, et al (2020) Toward safer highways, ap-
plication of XGBoost and SHAP for real-time accident detection and feature
analysis. Accident Analysis & Prevention 136:105405
Redell N (2019) Shapley decomposition of r-squared in machine learning models.
arXiv preprint arXiv:190809718
Ribeiro MT, Singh S, Guestrin C (2016) ” why should i trust you?” explaining
the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD
international conference on knowledge discovery and data mining. pp 1135–
1144
Rodrı́guez-Pérez R, Bajorath J (2020) Interpretation of machine learning mod-
els using shapley values: Application to compound potency and multi-target
activity predictions. Journal of computer-aided molecular design 34:1013–
1026
Rozemberczki B, Watson L, Bayer P, et al (2022) The shapley value in machine
learning. arXiv preprint arXiv:220205594
Rudin C (2019) Stop explaining black box machine learning models for high stakes
decisions and use interpretable models instead. Nature machine intelligence
1:206–215
Scholbeck CA, Molnar C, Heumann C, et al (2020) Sampling, intervention, pre-
diction, aggregation: A generalized framework for model-agnostic interpre-
tations. In: Machine learning and knowledge discovery in databases: Inter-
national workshops of ECML PKDD 2019, würzburg, germany, september
16–20, 2019, proceedings, part i. Springer, pp 205–216
Sellereite N, Jullum M (2019) Shapr: An r-package for explaining machine learn-
ing models with dependence-aware shapley values. Journal of Open Source
Software 5:2027. https://doi.org/10.21105/joss.02027
Shapley LS et al (1953) A value for n-person games
Shrikumar A, Greenside P, Kundaje A (2017) Learning important features
through propagating activation differences. In: International conference on
machine learning. PMLR, pp 3145–3153
Slack D, Hilgard S, Jia E, et al (2020) Fooling lime and shap: Adversarial at-
tacks on post hoc explanation methods. In: Proceedings of the AAAI/ACM
conference on AI, ethics, and society. pp 180–186
Smith M, Alvarez F (2021) Identifying mortality factors from machine learning
using shapley values–a case of COVID19. Expert Systems with Applications
189
176:114832
Staniak M, Biecek P (2018) Explanations of model predictions with live and
breakDown packages. arXiv preprint arXiv:180401955
Štrumbelj E, Kononenko I (2010) An efficient explanation of individual classifi-
cations using game theory. The Journal of Machine Learning Research 11:1–
18
Štrumbelj E, Kononenko I (2014) Explaining prediction models and individual
predictions with feature contributions. Knowledge and information systems
41:647–665
Sundararajan M, Dhamdhere K, Agarwal A (2020) The shapley taylor interaction
index. In: International conference on machine learning. PMLR, pp 9259–
9268
Sundararajan M, Najmi A (2020) The many shapley values for model explanation.
In: International conference on machine learning. PMLR, pp 9269–9278
Sundararajan M, Taly A, Yan Q (2017) Axiomatic attribution for deep networks.
In: International conference on machine learning. PMLR, pp 3319–3328
Tsai C-P, Yeh C-K, Ravikumar P (2023) Faith-shap: The faithful shapley inter-
action index. Journal of Machine Learning Research 24:1–42
Wang D, Thunéll S, Lindberg U, et al (2022) Towards better process management
in wastewater treatment plants: Process analytics based on SHAP values for
tree-based machine learning methods. Journal of Environmental Management
301:113941
Wang J, Wiens J, Lundberg S (2021) Shapley flow: A graph-based approach
to interpreting model predictions. In: International conference on artificial
intelligence and statistics. PMLR, pp 721–729
Yang W, Le H, Savarese S, Hoi S (2022) OmniXAI: A library for explainable AI.
https://doi.org/10.48550/ARXIV.2206.01612
190
A SHAP Estimators
This chapter presents the various SHAP estimators in detail.
Ĺ Exact Estimation
This method computes the exact SHAP value. It’s model-agnostic and is
only meaningful for low-dimensional tabular data (<15 features).
The exact estimation theoretically computes all 2𝑝 possible coalitions, from which
we can calculate all possible feature contributions for each feature, as discussed
in the Theory Chapter. It also uses all of the background data, not just a sample.
This means the computation has no elements of randomness – except that the
data sample is random. Despite the high computational cost, which depends on
the number of features and the size of the background data, this method uses all
available information and provides the most accurate estimation of SHAP values
compared to other model-agnostic estimation methods.
Here is how to use the exact method with the shap package:
1
https://github.com/slundberg/shap/blob/master/shap/explainers/_exact.py
191
Because of this enumeration, the exact estimation uses an optimization method
called Gray code. Gray code is an effective ordering of coalitions where adjacent
coalitions only differ in one feature value, which can be directly used to compute
marginal contributions. This method is more efficient than enumerating all pos-
sible coalitions and adding features to them, as Gray code reduces the number of
model calls through more effective computation.
Exact SHAP values are often not feasible, but this issue can be addressed through
sampling.
Ĺ Sampling Estimator
This method works by sampling coalitions. It’s model-agnostic.
The first versions of the Sampling Estimator were proposed by Štrumbelj and
Kononenko (2014) and modified by Štrumbelj and Kononenko (2010). The sam-
pling process involves two dimensions: sampling from the background data and
sampling the coalitions.
To calculate the exact SHAP value, all possible coalitions (sets) of feature values
must be evaluated with and without the j-th feature. However, the exact solution
becomes problematic as the number of features increases due to the exponential
increase in the number of possible coalitions. Štrumbelj and Kononenko (2014)
proposed an approximation using Monte Carlo integration:
192
explainer = shap.explainers.Sampling(model, background)
Ĺ Permutation Estimator
This method samples permutations of feature values and iterates through
them in both directions. It is model-agnostic.
193
• Our background data consists of only one data point for simplicity: (𝑥1 =
0, 𝑥2 = 0, 𝑥3 = 0, 𝑥4 = 0).
• We’ll examine two permutations and demonstrate that both result in the
same marginal contributions for feature 𝑥3 .
• Note, this doesn’t prove that all 2-way interactions are recoverable by per-
mutation, but it provides some insight into why it might work.
• The first permutation: (𝑥2 , 𝑥3 , 𝑥1 , 𝑥4 ).
– We have two marginal contributions: 𝑥3 to {𝑥2 } and 𝑥3 to {𝑥1 , 𝑥4 }.
– We denote 𝑓2,3 as the prediction where 𝑥2 and 𝑥3 values come from
the data point to be explained, and 𝑥1 and 𝑥4 values come from the
background data.
– Thus: 𝑓2,3 = 𝑓(𝑥1 = 0, 𝑥2 = 1, 𝑥3 = 1, 𝑥4 = 0) = 2 ⋅ 1 + 3 ⋅ 0 ⋅ 1 = 2
• The marginal contributions are 𝑓2,3 − 𝑓2 = 2 − 0 = 2 and 𝑓1,3,4 − 𝑓1,4 = 14.
• Now let’s consider a different permutation: (𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 ).
– This is the original feature order, but it is also a valid permutation.
– For this, we’ll compute different marginal contributions.
– 𝑓1,2,3 − 𝑓1,2 = 14.
– 𝑓3,4 − 𝑓4 = 2.
– And, unsurprisingly, these are the same marginal contributions as for
the other permutation.
• So, even with only a 2-way interaction, we had two different permutations
that we iterated forward and backward, and we obtained the same marginal
contributions.
• This suggests that adding more permutations doesn’t provide new informa-
tion for the feature of interest.
• This isn’t a proof, but it gives an idea of why this method works.
This type of sampling is also known as antithetic sampling and performs well
compared to other sampling-based estimators of SHAP values (Mitchell et al.
2022).
End of interlude
Here’s how to use the Permutation Estimator in shap:
194
explainer = shap.explainers.Permutation(model, background)
𝑓(𝑥) = 𝛽0 + 𝛽1 𝑥1 + … + 𝛽𝑝 𝑥𝑝 ,
The 𝛽’s represent the weights or coefficients by which the features are multi-
plied to generate the prediction. The intercept 𝛽0 is a unique coefficient that
determines the output when all feature values are zero, implying that there are
no interactions2 and no non-linear relations. The SHAP values are simple to
compute in this case, as discussed in the Linear Chapter. They are defined as:
(𝑖)
𝜙𝑗 = 𝛽𝑗 ⋅ (𝑥𝑗 − 𝔼(𝑋𝑗 ))
This formula also applies if you have a non-linear link function. It means that
the model isn’t entirely linear, as the weighted sum is transformed before making
2
While you can generally add interactions to a linear model, this option is not available for
the Linear Estimator.
195
the prediction. This model class is known as generalized linear models (GLMs).
Logistic regression is an example of a GLM, and it’s defined as:
1
𝑓(𝑥) =
1 + 𝑒𝑥𝑝(−(𝛽0 + 𝛽1 𝑥1 + … + 𝛽𝑝 𝑥𝑝 ))
𝑓(𝑥) = 𝑔(𝛽0 + 𝛽1 𝑥1 + … + 𝛽𝑝 𝑥𝑝 )
Even though the function is fundamentally linear, the result of the weighted sum
is non-linearly transformed. In this case, SHAP can still use the coefficients, and
the Linear Estimator remains applicable. However, it operates not on the level of
the prediction but on the level of the inverse of the function 𝑔, namely 𝑔−1 . For
logistic regression, it means that we interpret the results at the level of log odds.
Remember that this adds some complexity to the interpretation.
Here are some notes on implementation:
shap.explainers.Linear(model, background)
• To use the link function, set link in the explainer. The default is the
identity link. Learn more in the Classification Chapter.
• The SHAP implementation allows for accounting for feature correlations
when feature_perturbation is set to “correlation_dependent”. However,
this will result in a different “game” and thus different SHAP values. Read
more in the Correlation Chapter.
Ĺ Note
This model-specific estimator takes advantage of the lack of feature interac-
tion in additive models.
196
The Additive Estimator is a generalization of the Linear Estimator. While it still
assumes no interactions between features, it allows the effect of a feature to be
non-linear. This model class is represented as follows:
This equation is similar to the one used in the Linear Estimator. The first term
(𝑖)
denotes the effect of the feature value 𝑥𝑗 , while the second term centers it at
the expected effect of the feature 𝑋𝑗 . However, different assumptions about the
shape of the effect are required due to the use of different models (linear versus
additive).
Like the Linear Estimator, the Additive Estimator can also be extended to non-
linear link functions:
197
When a link function is used, interpretation happens at the level of the linear
predictor, which isn’t on the scale of the prediction but on the inverse of the link
function 𝑔−1 .
The implementation details are as follows:
shap.explainers.Additive
Ĺ Kernel Estimator
The Kernel Estimator is no longer widely used in SHAP, although it’s still
available. Instead, the Permutation Estimator is now the preferred option.
This section remains for historical reasons. The Kernel Estimator was the
original SHAP implementation, proposed in Lundberg and Lee (2017b), and
it drew parallels with other attribution methods such as LIME (Ribeiro et
al. 2016) and DeepLIFT (Shrikumar et al. 2017).
198
the regression model. The target for the regression model is the prediction for
a coalition. (“Wait a minute!” you might say, “The model hasn’t been trained
on these binary coalition data and thus can’t make predictions for them.”) To
convert coalitions of feature values into valid data instances, we need a function
ℎ𝑥 (𝑧 ′ ) = 𝑧 where ℎ𝑥 ∶ {0, 1}𝑀 → ℝ𝑝 . The function ℎ𝑥 maps 1’s to the corre-
sponding value from the instance x that we wish to explain and 0’s to values from
the original instance.
For SHAP-compliant weighting, Lundberg et al. propose the SHAP kernel:
(𝑀 − 1)
𝜋𝑥 (𝑧′ ) =
(|𝑧𝑀′ |)|𝑧 ′ |(𝑀 − |𝑧 ′ |)
Here, M represents the maximum coalition size, and |𝑧 ′ | signifies the number of
features present in instance z’. Lundberg and Lee illustrate that using this kernel
weight for linear regression yields SHAP values. LIME, a method that functions
by fitting local surrogate models, operates similarly to Kernel SHAP. If you were
to employ the SHAP kernel with LIME on the coalition data, LIME would also
generate SHAP values!
We possess the data, the target, and the weights; everything necessary to con-
struct our weighted linear regression model:
199
𝑀
(𝑖)
𝑔(𝑧 ) = 𝜙0 + ∑ 𝜙𝑗 𝑧𝑗′
′
𝑗=1
Here, Z is the training data. This equation represents the familiar sum of squared
errors that we typically optimize for linear models. The estimated coefficients of
(𝑖)
the model, the 𝜙𝑗 ’s, are the SHAP values.
As we are in a linear regression context, we can rely on standard tools for regres-
sion. For instance, we can incorporate regularization terms to make the model
sparse. By adding an L1 penalty to the loss L, we can create sparse explana-
tions.
Implementation details in shap:
Ĺ Tree Estimator
This estimation method is specific to tree-based models such as decision trees,
random forests, and gradient boosted trees.
200
XGBoost. Boosted trees are particularly effective for tabular data, and having a
quick method to compute SHAP values positions the technique advantageously.
Moreover, it is an exact method, meaning you obtain the correct SHAP values
rather than mere estimates, at least with respect to the coalitions. It remains
an estimate in relation to the background data, given that the background data
itself is a sample.
The Tree Estimator makes use of the tree structure to compute SHAP values, and
it comes in two versions: Interventional and Tree-Path Dependent Estimator.
The Interventional Estimator calculates the usual SHAP values but takes advan-
tage of the tree structure for the computation. The estimation is performed by
recursively traversing the tree(s). Here is a basic outline of the algorithm for ex-
plaining one data point and one background data point. Bear in mind that this
explanation applies to a single tree; the strategy for an ensemble will be discussed
subsequently.
• We start with a background data point, which we’ll refer to as z, and the
data point we want to explain, known as x.
• The process begins at the top of the tree, tracing the paths for x and z.
• However, we don’t just trace the paths of x and z as they would merely
terminate at two or possibly the same leaf nodes.
• Instead, at each crossroad, we ask: what if the decision was based on the
feature values of x and z?
• If they differ, both paths are pursued.
• This combined path tracing is done recursively.
• Ah, recursion — always a mind boggler!
• Upon reaching the terminal nodes, or leaves, predictions from these leaf
nodes are gathered and weighted based on how many feature differences
exist in relation to x and z.
• These weights are recursively combined.
201
process and explore the tree paths to determine which coalitions would yield
different predictions since many feature changes may have no impact on the
prediction. The tricky part, which is addressed in SHAP, is accurately weighting
and combining predictions based on the altered features. For further details, refer
to p.25, Algorithm 3 of this paper3 by Lundberg et al. (2020).
For ensembles of trees, we can average the SHAP values, with each tree’s predic-
tion contribution influencing the final ensemble’s weight. Thanks to the additivity
of SHAP values, the Shapley values of a tree ensemble are the (weighted) average
of the individual trees’ Shapley values.
The complexity of the Tree Estimator (over a background set of size 𝑛𝑏𝑔 ) is
𝒪(𝑇 𝑛𝑏𝑔 𝐿), where 𝑇 is the number of trees and 𝐿 is the maximum number of
leaves in any tree.
3
https://arxiv.org/abs/1905.04610
202
means that S contains all features, the prediction from the node where instance
x falls would be the expected prediction. Conversely, if we don’t condition the
prediction on any feature, meaning S is empty, we use the weighted average of
predictions from all terminal nodes. If S includes some, but not all features, we
disregard predictions of unreachable nodes. A node is deemed unreachable if the
decision path to it contradicts values in 𝑥𝑆 . From the remaining terminal nodes,
we average the predictions weighted by node sizes, which refers to the number
of training samples in each node. The mean of the remaining terminal nodes,
weighted by the number of instances per node, yields the expected prediction for
x given S. The challenge lies in applying this procedure for each possible subset
S of the feature values.
The fundamental concept of the path-dependent Tree Estimator is to push all
possible subsets S down the tree simultaneously. For each decision node, we need
to keep track of the number of subsets. This depends on the subsets in the parent
node and the split feature. For instance, when the first split in a tree is on feature
x3, all subsets containing feature x3 will go to one node (the one where x goes).
Subsets that do not include feature x3 go to both nodes with reduced weight.
Unfortunately, subsets of different sizes carry different weights. The algorithm
must keep track of the cumulative weight of the subsets in each node, which
complicates the algorithm.
The tree estimation method is implemented in shap:
shap.explainers.Tree(
model, data, feature_perturbation='interventional'
)
shap.explainers.Tree(
model, feature_perturbation='tree_path_dependent'
)
203
A.9 Gradient Estimator: For gradient-based
models
Ĺ Gradient Estimator
The Gradient Estimator is a model-specific estimation method tailored for
gradient-based models, such as neural networks, and can be applied to both
tabular and image data.
Many models, including several neural networks, are gradient-based. This means
that we can compute the gradient of the loss function with respect to the model
input. When we can compute the gradient with respect to the input, we can use
this information to calculate SHAP values more efficiently.
Gradient SHAP is defined as the expected value of the gradients times the inputs
minus the baselines.
𝛿𝑔(𝑥̃ + 𝛼 ⋅ (𝑥 − 𝑥))
̃
GradientShap(𝑥) = 𝔼 ((𝑥𝑗 − 𝑥𝑗̃ ) ⋅ 𝑑𝛼)
𝛿𝑥𝑗
𝑛𝑏𝑔
1 (𝑖) 𝛿𝑔(𝑥̃ + 𝛼𝑖 ⋅ (𝑥 − 𝑥(𝑖)
̃ ))
GradientShap(𝑥) = ∑(𝑥𝑗 − 𝑥𝑗̃ ) ⋅ 𝑑𝛼
𝑛𝑏𝑔 𝑖=1 𝛿𝑥𝑗
So, what does this formula do? For a given feature value 𝑥𝑗 , this estimation
method cycles through the background data of size 𝑛𝑏𝑔 , computing two terms:
• The distance between the data point to be explained 𝑥𝑗 and the sample
from the background data.
• The gradient 𝑔 of the prediction with respect to the j-th feature, calculated
not at the position of the point to be explained, but at a random location
of feature 𝑋𝑗 between the data point of interest and the background data.
The 𝛼𝑖 is uniformly sampled from [0, 1].
These terms are multiplied and averaged over the background data to approxi-
mate SHAP values. There’s a connection between the Gradient Estimator and a
204
method called Integrated Gradients (Sundararajan et al. 2017). Integrated Gra-
dients is a feature attribution method also based on gradients that outputs the
integrated path of the gradient with respect to a reference point as an explanation.
The difference between Integrated Gradients and SHAP values is that Integrated
Gradients use a single reference point, while Shapley values utilize a background
data set. The Gradient Estimator can be viewed as an adaptation of Integrated
Gradients, where instead of a single reference point, we reformulate the integral
as an expectation and estimate that expectation with the background data.
Integrated gradients are defined as follows:
1
𝛿𝑔(𝑥̃ + 𝛼 ⋅ (𝑥 − 𝑥))
̃
𝐼𝐺(𝑥) = (𝑥𝑗 − 𝑥𝑗̃ ) ⋅ ∫ 𝑑𝛼
𝛼=0 𝛿𝑥𝑗
The SHAP Gradient Estimator extends this concept by using multiple data points
as references and integrating over an entire background dataset.
Here are the implementation details in shap:
shap.GradientExplainer(model, data)
4
https://shap.readthedocs.io/en/latest/example_notebooks/image_examples/image_
classification/Explain%20an%20Intermediate%20Layer%20of%20VGG16%20on%
20ImageNet%20(PyTorch).html?highlight=Gradient
205
A.10 Deep Estimator: for neural networks
The Deep Estimator is specifically designed for deep neural networks (Chen et
al. 2021). This makes the Deep Estimator more model-specific compared to
the Gradient Estimator, which can be applied to all gradient-based methods in
theory. The Deep Estimator is inspired by the DeepLIFT algorithm (Shrikumar
et al. 2017), an attribution method for deep neural networks. To understand how
the Deep Estimator works, we first need to discuss DeepLIFT. DeepLIFT explains
feature attribution in neural networks by calculating the contribution value Δ𝑓
for each input feature 𝑥𝑗 , comparing the prediction for 𝑥 with the prediction for
a reference point 𝑧. The user chooses the reference point, which is usually an
“uninformative” data point, such as a blank image for image data. The difference
to be explained is Δ𝑓(𝑥) − Δ𝑓(𝑥). ̃ DeepLIFT’s attributions, called contribution
𝑛
scores 𝐶Δ𝑥𝑗 Δ𝑓 , add up to the total difference: ∑𝑗=1 𝐶Δ𝑥𝑗 Δ𝑓 = Δ𝑓. This process
is similar to how SHAP values are calculated. DeepLIFT does not require 𝑥𝑗 to
be the model inputs; they can be any neuron layer along the way. This feature
is not only a perk of DeepLIFT but also a vital aspect, as DeepLIFT is designed
to backpropagate the contributions through the neural network, layer by layer.
DeepLIFT employs the concept of “multipliers,” defined as follows:
𝐶Δ𝑥Δ𝑓
𝑚Δ𝑥Δ𝑓 =
Δ𝑥
A multiplier represents the contribution of Δ𝑥 to Δ𝑓 divided by Δ𝑥. Like a par-
𝜕𝑓
tial derivative ( 𝜕𝑥 ) when Δ𝑥 approaches a very small value, this multiplier is a
finite distance. Like derivatives, these multipliers can be backpropagated through
𝑛
the neural network using the chain rule: 𝑚Δ𝑥𝑗 Δ𝑓 = ∑𝑗=1 𝑚Δ𝑥𝑗 Δ𝑦𝑗 𝑚Δ𝑦𝑗 Δ𝑓 ,
where x and y are two consecutive layers of the neural network. DeepLIFT then
defines a set of rules for backpropagating the multipliers for different components
of the neural networks, using the linear rule for linear units, the “rescale rule”
for nonlinear transformations like ReLU and sigmoid, and so on. Positive and
negative attributions are separated, which is crucial for backpropagating through
nonlinear units.
However, DeepLIFT does not yield SHAP values. Deep SHAP is an adaptation
of the DeepLIFT procedure to produce SHAP values. Here are the changes the
Deep Estimator incorporates:
206
• The Deep Estimator uses background data, a set of reference points, instead
of a single reference point.
• The multipliers are redefined in terms of SHAP values, which are backpropa-
gated instead of the original DeepLIFT multipliers. Informally: 𝑚Δ𝑥𝑗 Δ𝑓 =
𝜙
− 𝔼(𝑋𝑗 ).
𝑥𝑗
• Another interpretation of the Deep Estimator: it computes the SHAP val-
ues in smaller parts of the network first and combines those to obtain SHAP
values for the entire network, explaining the prediction from the input, sim-
ilar to our usual understanding of SHAP.
Ĺ Note
How large should the background data be for the Deep Estimator? According
to the SHAP authora , 100 is good, and 1000 is very good.
a
https://shap-lrjball.readthedocs.io/en/latest/generated/shap.DeepExplainer.html#
shap.DeepExplainer.shap_values
207
The SHAP value for each group can then be attributed to its individual fea-
tures. Alternatively, if the hierarchy further splits into subgroups, we attribute
the SHAP value at the subgroup level.
Why is this useful? There are instances where we are more interested in a group
of features rather than individual ones. For example, multiple feature columns
may represent a similar concept, and we’re interested in the attribution of the
concept, not the individual features. Let’s say we’re predicting the yield of fruit
trees, and we have various soil humidity measurements at different depths. We
might not care about the individual attributions to different depths but instead
want a SHAP value attributed to the overall soil humidity. The results are not
SHAP values but Owen values. Owen values are another solution to the attribu-
tion problem in cooperative games. They are similar to SHAP values but assigned
to feature groups instead of individual features. Owen values only allow permu-
tations defined by a coalition structure. The computation is identical to SHAP
values, except that it imposes a hierarchy.
Partition Estimator also proves useful for image inputs, where image pixels can
be grouped into larger regions.
Implementation details:
shap.PartitionExplainer(model, partition_tree=None)
208
B The Role of Maskers and
Background Data
Ĺ Note
Maskers are the technical solution for “removing” feature values to compute
marginal contributions and SHAP values.
209
B.1 Masker for tabular data
Ĺ Note
Typical maskers for tabular data replace absent feature values with samples
from a background dataset.
For tabular data, we can provide a background dataset, and the masker replaces
missing values with samples from the background data. There are two choices
for maskers: Independent masker and Partition masker.
The independent masker replaces missing features with the background data.
Technically, the tabular data in SHAP is implemented with the Independent
masker in shap:
masker = shap.maskers.Independent(data=X_train)
explainer = shap.LinearExplainer(model, masker=masker)
import shap
import pandas as pd
import numpy as np
210
# print the DataFrame
print(df)
f1 f2 f3
0 166 4 13
1 169 6 12
2 102 8 13
3 114 6 12
4 122 1 11
5 134 8 19
6 105 6 10
7 161 4 18
8 148 4 15
9 156 6 14
This dataframe is our background dataset. Next, we create a masker and apply
it to data point (f1=0, f2=0, f3=0).
np.random.seed(2)
m = shap.maskers.Independent(df, max_samples=3)
mask = np.array([1, 0, 1], dtype=np.bool)
print(m(mask=mask, x=np.array([0, 0, 0])))
f1 f2 f3
0 0 8 0
1 0 4 0
2 0 1 0
f1 f2 f3
0 102 8 0
1 148 4 0
2 122 1 0
211
In this example, we have two masks. The first mask eliminates features f1 and
f3, keeping only the feature values f1=0 and f3=0 from x, and drawing f2 from
the base data. In the second instance, only feature f3 is preserved.
Masks for tabular data can also be defined through integer vectors that indicate
which feature (based on column index) to mask.
Ĺ Note
Maskers for text replace tokens with a user-defined token.
For text data, the input is often represented as tokens within models, especially
neural networks. These tokens can be represented by (learned) embeddings, a
vectorized representation of a token, or other internal representations such as
bag-of-words counts or hand-written rules flagging specific words. However, when
calculating SHAP values, we are not concerned with the internal representation.
The “team” consists of the text fed into the model, and the payout is the model
output, whether it’s a sentiment score or a probability for the next word in a
translation task. The individual players, which can be characters, tokens, or
212
words, are determined by the user. The granularity at which text is divided
into smaller units is termed tokenization, and the individual units are known as
tokens. Tokenization can be performed at various levels, such as by character,
subword, word, sentence, paragraph, or even custom methods like using n-grams
or stopping at specific characters. The choice of tokenization method depends on
the goal of the model interpretation.
Ď Tip
Word tokenization is often a good first choice for interpretation with SHAP.
Assuming the input is tokenized by word for the following discussions, the next
question is: How do we represent the “absence” of a word? This decision is up to
the user. By default, the word is replaced by an empty string, but it could also
be replaced with “…” (which is the default in shap) or even a randomly chosen
word from background data or based on a grammar tool. The chosen method
will influence the prediction, as shown in the text chapter.
import shap
m = shap.maskers.Text()
s = 'Hello is this the Krusty Krab?'
print(m(s=s, mask = [1,1,1,1,1,1]))
print(m(s=s, mask = [1,0,0,1,0,1]))
print(m(s=s, mask = [1,1,1,1,0,0]))
print(m(s=s, mask = [0,0,0,0,0,0]))
213
B.4 Maskers for image data
Ĺ Note
For images, maskers substitute missing pixels with blurred versions or employ
inpainting techniques.
Similar to text data, the representation of an image for SHAP can be independent
of its representation for the model, as long as there is a mapping between the
SHAP version and the model version to calculate the marginal contributions.
For images, we have two options for players: individual pixels or larger units
containing multiple pixels. The composition of these units is flexible, and they can
be created based on a grid, such as dividing a 224x224 image into 196 rectangles
of size 16x16. In this case, the number of players would be 196 instead of 224x224
= 50,176.
So, what does the masker do? It replaces parts of the image. In theory, you could
use data from a background dataset, but that would be strange. Alternatively,
you could replace parts with gray pixels or another neutral color, but that could
also result in unusual images. SHAP implements blurring and inpainting methods
to remove or guess content from the rest of the image.
The absence of a team member is addressed in this manner. This masker is
implemented in SHAP’s Image masker.
214