0% found this document useful (0 votes)
4 views11 pages

Paper Interpretación Ceptum

Uploaded by

croesgomez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

Paper Interpretación Ceptum

Uploaded by

croesgomez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Axiomatic Attribution for Deep Networks

Mukund Sundararajan * 1 Ankur Taly * 1 Qiqi Yan * 1

Abstract Shrikumar et al., 2016; Binder et al., 2016; Springenberg


et al., 2014).
We study the problem of attributing the pre-
diction of a deep network to its input features, The intention of these works is to understand the input-
arXiv:1703.01365v2 [cs.LG] 13 Jun 2017

a problem previously studied by several other output behavior of the deep network, which gives us the
works. We identify two fundamental axioms— ability to improve it. Such understandability is critical to
Sensitivity and Implementation Invariance that all computer programs, including machine learning mod-
attribution methods ought to satisfy. We show els. There are also other applications of attribution. They
that they are not satisfied by most known attri- could be used within a product driven by machine learn-
bution methods, which we consider to be a fun- ing to provide a rationale for the recommendation. For in-
damental weakness of those methods. We use stance, a deep network that predicts a condition based on
the axioms to guide the design of a new attri- imaging could help inform the doctor of the part of the im-
bution method called Integrated Gradients. Our age that resulted in the recommendation. This could help
method requires no modification to the original the doctor understand the strengths and weaknesses of a
network and is extremely simple to implement; model and compensate for it. We give such an example in
it just needs a few calls to the standard gradi- Section 6.2. Attributions could also be used by developers
ent operator. We apply this method to a couple in an exploratory sense. For instance, we could use a deep
of image models, a couple of text models and a network to extract insights that could be then used in a rule-
chemistry model, demonstrating its ability to de- based system. In Section 6.3, we give such an example.
bug networks, to extract rules from a network, A significant challenge in designing an attribution tech-
and to enable users to engage with models better. nique is that they are hard to evaluate empirically. As we
discuss in Section 4, it is hard to tease apart errors that stem
from the misbehavior of the model versus the misbehavior
1. Motivation and Summary of Results of the attribution method. To compensate for this short-
coming, we take an axiomatic approach. In Section 2 we
We study the problem of attributing the prediction of a deep
identify two axioms that every attribution method must sat-
network to its input features.
isfy. Unfortunately most previous methods do not satisfy
Definition 1. Formally, suppose we have a function F : one of these two axioms. In Section 3, we use the axioms
Rn → [0, 1] that represents a deep network, and an in- to identify a new method, called integrated gradients.
put x = (x1 , . . . , xn ) ∈ Rn . An attribution of the predic-
tion at input x relative to a baseline input x0 is a vector Unlike previously proposed methods, integrated gradients
AF (x, x0 ) = (a1 , . . . , an ) ∈ Rn where ai is the contribu- do not need any instrumentation of the network, and can
tion of xi to the prediction F (x). be computed easily using a few calls to the gradient opera-
tion, allowing even novice practitioners to easily apply the
For instance, in an object recognition network, an attribu- technique.
tion method could tell us which pixels of the image were In Section 6, we demonstrate the ease of applicability over
responsible for a certain label being picked (see Figure 2). several deep networks, including two images networks, two
The attribution problem was previously studied by vari- text processing networks, and a chemistry network. These
ous papers (Baehrens et al., 2010; Simonyan et al., 2013; applications demonstrate the use of our technique in either
*
Equal contribution 1
Google Inc., Mountain View, improving our understanding of the network, performing
USA. Correspondence to: Mukund Sundararajan debugging, performing rule extraction, or aiding an end
<[email protected]>, Ankur Taly <[email protected]>. user in understanding the network’s prediction.
Proceedings of the 34 th International Conference on Machine Remark 1. Let us briefly examine the need for the base-
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 line in the definition of the attribution problem. A common
by the author(s). way for humans to perform attribution relies on counter-
Axiomatic Attribution for Deep Networks

factual intuition. When we assign blame to a certain cause ure 2).


we implicitly consider the absence of the cause as a base-
line for comparing outcomes. In a deep network, we model Other back-propagation based approaches. A second
the absence using a single baseline input. For most deep set of approaches involve back-propagating the final pre-
networks, a natural baseline exists in the input space where diction score through each layer of the network down to the
the prediction is neutral. For instance, in object recognition individual features. These include DeepLift, Layer-wise
networks, it is the black image. The need for a baseline has relevance propagation (LRP), Deconvolutional networks
also been pointed out by prior work on attribution (Shriku- (DeConvNets), and Guided back-propagation. These meth-
mar et al., 2016; Binder et al., 2016). ods differ in the specific backpropagation logic for various
activation functions (e.g., ReLU, MaxPool, etc.).
2. Two Fundamental Axioms Unfortunately, Deconvolution networks (DeConvNets),
and Guided back-propagation violate Sensitivity(a). This
We now discuss two axioms (desirable characteristics) for is because these methods back-propogate through a ReLU
attribution methods. We find that other feature attribution node only if the ReLU is turned on at the input. This makes
methods in literature break at least one of the two axioms. the method similar to gradients, in that, the attribution is
These methods include DeepLift (Shrikumar et al., 2016; zero for features with zero gradient at the input despite a
2017), Layer-wise relevance propagation (LRP) (Binder non-zero gradient at the baseline. We defer the specific
et al., 2016), Deconvolutional networks (Zeiler & Fergus, counterexamples to Appendix B.
2014), and Guided back-propagation (Springenberg et al.,
2014). As we will see in Section 3, these axioms will also Methods like DeepLift and LRP tackle the Sensitivity issue
guide the design of our method. by employing a baseline, and in some sense try to compute
“discrete gradients” instead of (instantaeneous) gradients at
Gradients. For linear models, ML practitioners regularly the input. (The two methods differ in the specifics of how
inspect the products of the model coefficients and the fea- they compute the discrete gradient). But the idea is that a
ture values in order to debug predictions. Gradients (of the large, discrete step will avoid flat regions, avoiding a break-
output with respect to the input) is a natural analog of the age of sensitivity. Unfortunately, these methods violate a
model coefficients for a deep network, and therefore the different requirement on attribution methods.
product of the gradient and feature values is a reasonable
starting point for an attribution method (Baehrens et al., 2.2. Axiom: Implementation Invariance
2010; Simonyan et al., 2013); see the third column of Fig-
ure 2 for examples. The problem with gradients is that Two networks are functionally equivalent if their outputs
they break sensitivity, a property that all attribution meth- are equal for all inputs, despite having very different imple-
ods should satisfy. mentations. Attribution methods should satisfy Implemen-
tation Invariance, i.e., the attributions are always identical
for two functionally equivalent networks. To motivate this,
2.1. Axiom: Sensitivity(a)
notice that attribution can be colloquially defined as assign-
An attribution method satisfies Sensitivity(a) if for every ing the blame (or credit) for the output to the input features.
input and baseline that differ in one feature but have differ- Such a definition does not refer to implementation details.
ent predictions then the differing feature should be given
We now discuss intuition for why DeepLift and LRP break
a non-zero attribution. (Later in the paper, we will have a
Implementation Invariance; a concrete example is provided
part (b) to this definition.)
in Appendix B.
Gradients violate Sensitivity(a): For a concrete example,
First, notice that gradients are invariant to implementation.
consider a one variable, one ReLU network, f (x) = 1 −
In fact, the chain-rule for gradients ∂f ∂f ∂h
∂g = ∂h · ∂g is essen-
ReLU(1−x). Suppose the baseline is x = 0 and the input is
tially about implementation invariance. To see this, think
x = 2. The function changes from 0 to 1, but because f be-
of g and f as the input and output of a system, and h being
comes flat at x = 1, the gradient method gives attribution of
some implementation detail of the system. The gradient of
0 to x. Intuitively, gradients break Sensitivity because the
output f to input g can be computed either directly by ∂f ∂g ,
prediction function may flatten at the input and thus have
ignoring the intermediate function h (implementation de-
zero gradient despite the function value at the input being
tail), or by invoking the chain rule via h. This is exactly
different from that at the baseline. This phenomenon has
how backpropagation works.
been reported in previous work (Shrikumar et al., 2016).
Methods like LRP and DeepLift replace gradients with dis-
Practically, the lack of sensitivity causes gradients to focus
crete gradients and still use a modified form of backpropa-
on irrelevant features (see the “fireboat” example in Fig-
gation to compose discrete gradients into attributions. Un-
Axiomatic Attribution for Deep Networks

fortunately, the chain rule does not hold for discrete gra- Proposition 1. If F : Rn → R is differentiable almost
dients in general. Formally fg(x
(x1 )−f (x0 )
1 )−g(x0 )
6= fh(x
(x1 )−f (x0 )
1 )−h(x0 )
· everywhere 1 then
h(x1 )−h(x0 )
, and therefore these methods fail to satisfy im-
g(x1 )−g(x0 ) Σni=1 IntegratedGradsi (x) = F (x) − F (x0 )
plementation invariance.
If an attribution method fails to satisfy Implementation In- For most deep networks, it is possible to choose a base-
variance, the attributions are potentially sensitive to unim- line such that the prediction at the baseline is near zero
portant aspects of the models. For instance, if the network (F (x0 ) ≈ 0). (For image models, the black image base-
architecture has more degrees of freedom than needed to line indeed satisfies this property.) In such cases, there is
represent a function then there may be two sets of values an intepretation of the resulting attributions that ignores the
for the network parameters that lead to the same function. baseline and amounts to distributing the output to the indi-
The training procedure can converge at either set of values vidual input features.
depending on the initializtion or for other reasons, but the Remark 2. Integrated gradients satisfies Sensivity(a) be-
underlying network function would remain the same. It is cause Completeness implies Sensivity(a) and is thus a
undesirable that attributions differ for such reasons. strengthening of the Sensitivity(a) axiom. This is because
Sensitivity(a) refers to a case where the baseline and the
3. Our Method: Integrated Gradients input differ only in one variable, for which Completeness
asserts that the difference in the two output values is equal
We are now ready to describe our technique. Intuitively, to the attribution to this variable. Attributions generated
our technique combines the Implementation Invariance of by integrated gradients satisfy Implementation Invariance
Gradients along with the Sensitivity of techniques like LRP since they are based only on the gradients of the function
or DeepLift. represented by the network.
Formally, suppose we have a function F : Rn → [0, 1] that
represents a deep network. Specifically, let x ∈ Rn be the 4. Uniqueness of Integrated Gradients
input at hand, and x0 ∈ Rn be the baseline input. For image
networks, the baseline could be the black image, while for Prior literature has relied on empirically evaluating the at-
text models it could be the zero embedding vector. tribution technique. For instance, in the context of an object
recognition task, (Samek et al., 2015) suggests that we se-
We consider the straightline path (in Rn ) from the baseline lect the top k pixels by attribution and randomly vary their
x0 to the input x, and compute the gradients at all points intensities and then measure the drop in score. If the at-
along the path. Integrated gradients are obtained by cu- tribution method is good, then the drop in score should be
mulating these gradients. Specifically, integrated gradients large. However, the images resulting from pixel perturba-
are defined as the path intergral of the gradients along the tion could be unnatural, and it could be that the scores drop
straightline path from the baseline x0 to the input x. simply because the network has never seen anything like it
The integrated gradient along the ith dimension for an input in training. (This is less of a concern with linear or logis-
x and baseline x0 is defined as follows. Here, ∂F (x)
∂xi is the
tic models where the simplicity of the model ensures that
gradient of F (x) along the ith dimension. ablating a feature does not cause strange interactions.)

Z 1
A different evaluation technique considers images with
∂F (x0 +α×(x−x0 )) human-drawn bounding boxes around objects, and com-
IntegratedGradsi (x) ::= (xi − x0i ) × ∂xi

α=0 putes the percentage of pixel attribution inside the box.
(1)
While for most objects, one would expect the pixels located
Axiom: Completeness. Integrated gradients satisfy an
on the object to be most important for the prediction, in
axiom called completeness that the attributions add up to some cases the context in which the object occurs may also
the difference between the output of F at the input x and contribute to the prediction. The cabbage butterfly image
the baseline x0 . This axiom is identified as being desirable from Figure 2 is a good example of this where the pixels
by Deeplift and LRP. It is a sanity check that the attribu- on the leaf are also surfaced by the integrated gradients.
tion method is somewhat comprehensive in its accounting,
Roughly, we found that every empirical evaluation tech-
a property that is clearly desirable if the networks score is
nique we could think of could not differentiate between ar-
used in a numeric sense, and not just to pick the top la-
bel, for e.g., a model estimating insurance premiums from 1
Formally, this means the function F is continuous every-
credit features of individuals. where and the partial derivative of F along each input dimension
satisfies Lebesgue’s integrability condition, i.e., the set of discon-
This is formalized by the proposition below, which instanti- tinuous points has measure zero. Deep networks built out of Sig-
ates the fundamental theorem of calculus for path integrals. moids, ReLUs, and pooling operators satisfy this condition.
Axiomatic Attribution for Deep Networks

s1,s2 collectively known as path methods. Notice that integrated


gradients is a path method for the straightline path specified
P2 γ(α) = x0 + α × (x − x0 ) for α ∈ [0, 1].
P1 P3
Remark 3. All path methods satisfy Implementation In-
variance. This follows from the fact that they are defined
using the underlying gradients, which do not depend on the
implementation. They also satisfy Completeness (the proof
is similar to that of Proposition 1) and Sensitvity(a) which
r1,r2
is implied by Completeness (see Remark 2).

Figure 1. Three paths between an a baseline (r1 , r2 ) and an input More interestingly, path methods are the only methods
(s1 , s2 ). Each path corresponds to a different attribution method. that satisfy certain desirable axioms. (For formal defini-
The path P2 corresponds to the path used by integrated gradients. tions of the axioms and proof of Proposition 2, see Fried-
man (Friedman, 2004).)

tifacts that stem from perturbing the data, a misbehaving Axiom: Sensitivity(b). (called Dummy in (Friedman,
model, and a misbehaving attribution method. This was 2004)) If the function implemented by the deep network
why we turned to an axiomatic approach in designing a does not depend (mathematically) on some variable, then
good attribution method (Section 2). While our method the attribution to that variable is always zero.
satisfies Sensitivity and Implementation Invariance, it cer-
This is a natural complement to the definition of Sensitiv-
tainly isn’t the unique method to do so.
ity(a) from Section 2. This definition captures desired in-
We now justify the selection of the integrated gradients sensitivity of the attributions.
method in two steps. First, we identify a class of meth-
ods called Path methods that generalize integrated gradi- Axiom: Linearity. Suppose that we linearly composed
ents. We discuss that path methods are the only methods two deep networks modeled by the functions f1 and f2 to
to satisfy certain desirable axioms. Second, we argue why form a third network that models the function a×f1 +b×f2 ,
integrated gradients is somehow canonical among the dif- i.e., a linear combination of the two networks. Then we’d
ferent path methods. like the attributions for a × f1 + b × f2 to be the weighted
sum of the attributions for f1 and f2 with weights a and b
4.1. Path Methods respectively. Intuitively, we would like the attributions to
preserve any linearity within the network.
Integrated gradients aggregate the gradients along the in-
Proposition 2. (Theorem 1 (Friedman, 2004)) Path meth-
puts that fall on the straightline between the baseline and
ods are the only attribution methods that always satisfy
the input. There are many other (non-straightline) paths
Implementation Invariance, Sensitivity(b), Linearity, and
that monotonically interpolate between the two points, and
Completeness.
each such path will yield a different attribution method. For
instance, consider the simple case when the input is two di- Remark 4. We note that these path integrated gradients
mensional. Figure 1 has examples of three paths, each of have been used within the cost-sharing literature in eco-
which corresponds to a different attribution method. nomics where the function models the cost of a project as
a function of the demands of various participants, and the
Formally, let γ = (γ1 , . . . , γn ) : [0, 1] → Rn be a smooth attributions correspond to cost-shares. Integrated gradi-
function specifying a path in Rn from the baseline x0 to the ents correspond to a cost-sharing method called Aumann-
input x, i.e., γ(0) = x0 and γ(1) = x. Shapley (Aumann & Shapley, 1974). Proposition 2 holds
Given a path function γ, path integrated gradients are ob- for our attribution problem because mathematically the
tained by integrating the gradients along the path γ(α) for cost-sharing problem corresponds to the attribution prob-
α ∈ [0, 1]. Formally, path integrated gradients along the lem with the benchmark fixed at the zero vector. (Imple-
ith dimension for an input x is defined as follows. mentation Invariance is implicit in the cost-sharing litera-
ture as the cost functions are considered directly in their
Z 1 mathematical form.)
∂F (γ(α)) ∂γi (α)
PathIntegratedGradsγi (x) ::= ∂γi (α) ∂α dα
α=0
(2) 4.2. Integrated Gradients is Symmetry-Preserving
∂F (x)
where ∂xi is the gradient of F along the ith dimension
In this section, we formalize why the straightline path cho-
at x.
sen by integrated gradients is canonical. First, observe that
Attribution methods based on path integrated gradients are it is the simplest path that one can define mathematically.
Axiomatic Attribution for Deep Networks

Second, a natural property for attribution methods is to pre- inputs that are combinations of the input and the baseline.
serve symmetry, in the following sense. It is possible that some of these combinations are very dif-
ferent from anything seen during training. We speculate
Symmetry-Preserving. Two input variables are symmet- that this could lead to attribution artifacts.
ric w.r.t. a function if swapping them does not change the
function. For instance, x and y are symmetric w.r.t. F if
5. Applying Integrated Gradients
and only if F (x, y) = F (y, x) for all values of x and y. An
attribution method is symmetry preserving, if for all inputs
Selecting a Benchmark. A key step in applying integrated
that have identical values for symmetric variables and base-
gradients is to select a good baseline. We recommend that
lines that have identical values for symmetric variables, the
developers check that the baseline has a near-zero score—
symmetric variables receive identical attributions.
as discussed in Section 3, this allows us to interpret the
E.g., consider the logistic model Sigmoid(x1 + x2 + . . . ). attributions as a function of the input. But there is more to
x1 and x2 are symmetric variables for this model. For an a good baseline: For instance, for an object recogntion net-
input where x1 = x2 = 1 (say) and baseline where x1 = work it is possible to create an adversarial example that has
x2 = 0 (say), a symmetry preserving method must offer a zero score for a given input label (say elephant), by apply-
identical attributions to x1 and x2 . ing a tiny, carefully-designed perturbation to an image with
a very different label (say microscope) (cf. (Goodfellow
It seems natural to ask for symmetry-preserving attribution
et al., 2015)). The attributions can then include undesirable
methods because if two variables play the exact same role
artifacts of this adversarially constructed baseline. So we
in the network (i.e., they are symmetric and have the same
would additionally like the baseline to convey a complete
values in the baseline and the input) then they ought to re-
absence of signal, so that the features that are apparent from
ceive the same attrbiution.
the attributions are properties only of the input, and not of
Theorem 1. Integrated gradients is the unique path the baseline. For instance, in an object recognition net-
method that is symmetry-preserving. work, a black image signifies the absence of objects. The
black image isn’t unique in this sense—an image consisting
The proof is provided in Appendix A. of noise has the same property. However, using black as a
baseline may result in cleaner visualizations of “edge” fea-
Remark 5. If we allow averaging over the attributions
tures. For text based networks, we have found that the all-
from multiple paths, then are other methods that satisfy all
zero input embedding vector is a good baseline. The action
the axioms in Theorem 1. In particular, there is the method
of training causes unimportant words tend to have small
by Shapley-Shubik (Shapley & Shubik, 1971) from the cost
norms, and so, in the limit, unimportance corresponds to
sharing literature, and used by (Lundberg & Lee, 2016;
the all-zero baseline. Notice that the black image corre-
Datta et al., 2016) to compute feature attributions (though
sponds to a valid input to an object recognition network,
they were not studying deep networks). In this method, the
and is also intuitively what we humans would consider ab-
attribution is the average of those from n! extremal paths;
sence of signal. In contrast, the all-zero input vector for a
here n is the number of features. Here each such path con-
text network does not correspond to a valid input; it never-
siders an ordering of the input features, and sequentially
theless works for the mathematical reason described above.
changes the input feature from its value at the baseline to
its value at the input. This method yields attributions that
Computing Integrated Gradients. The integral of inte-
are different from integrated gradients. If the function of
grated gradients can be efficiently approximated via a sum-
interest is min(x1 , x2 ), the baseline is x1 = x2 = 0, and
mation. We simply sum the gradients at points occurring at
the input is x1 = 1, x2 = 3, then integrated gradients
sufficiently small intervals along the straightline path from
attributes the change in the function value entirely to the
the baseline x0 to the input x.
critical variable x1 , whereas Shapley-Shubik assigns attri-
butions of 1/2 each; it seems somewhat subjective to prefer IntegratedGradsapprox
i (x) ::=
one result over the other. k
∂F (x0 + m ×(x−x0 )))
(3)
(xi − x0i ) × Σm
k=1 ∂xi × 1
m
We also envision other issues with applying Shapley-Shubik
to deep networks: It is computationally expensive; in an Here m is the number of steps in the Riemman approxi-
object recognition network that takes an 100X100 image mation of the integral. Notice that the approximation sim-
as input, n is 10000, and n! is a gigantic number. Even ply involves computing the gradient in a for loop which
if one samples few paths randomly, evaluating the attribu- should be straightforward and efficient in most deep learn-
tions for a single path takes n calls to the deep network. ing frameworks. For instance, in TensorFlow, it amounts
In contrast, integrated gradients is able to operate with 20 to calling tf.gradients in a loop over the set of in-
to 300 calls. Further, the Shapley-Shubik computation visit puts (i.e., x0 + m
k
× (x − x0 ) for k = 1, . . . , m), which
Axiomatic Attribution for Deep Networks

could also be batched. In practice, we find that somewhere


between 20 and 300 steps are enough to approximate the
integral (within 5%); we recommend that developers check
that the attributions approximately adds up to the differ-
ence beween the score at the input and that at the baseline
(cf. Proposition 1), and if not increase the step-size m.

6. Applications
The integrated gradients technique is applicable to a variety
of deep networks. Here, we apply it to two image models,
two natural language models, and a chemistry model.

6.1. An Object Recognition Network


We study feature attribution in an object recognition net-
work built using the GoogleNet architecture (Szegedy
et al., 2014) and trained over the ImageNet object recog-
nition dataset (Russakovsky et al., 2015). We use the inte-
grated gradients method to study pixel importance in pre-
dictions made by this network. The gradients are computed
for the output of the highest-scoring class with respect to
pixel of the input image. The baseline input is the black
image, i.e., all pixel intensities are zero.
Integrated gradients can be visualized by aggregating them Figure 2. Comparing integrated gradients with gradients at
along the color channel and scaling the pixels in the ac- the image. Left-to-right: original input image, label and softmax
tual image by them. Figure 2 shows visualizations for a score for the highest scoring class, visualization of integrated gra-
bunch of images2 . For comparison, it also presents the cor- dients, visualization of gradients*image. Notice that the visual-
responding visualization obtained from the product of the izations obtained from integrated gradients are better at reflecting
image with the gradients at the actual image. Notice that distinctive features of the image.
integrated gradients are better at reflecting distinctive fea-
tures of the input image. actual image in gray scale with positive attribtutions along
the green channel and negative attributions along the red
6.2. Diabetic Retinopathy Prediction channel. Notice that integrated gradients are localized to a
Diabetic retinopathy (DR) is a complication of the diabetes few pixels that seem to be lesions in the retina. The inte-
that affects the eyes. Recently, a deep network (Gulshan rior of the lesions receive a negative attribution while the
et al., 2016) has been proposed to predict the severity grade periphery receives a positive attribution indicating that the
for DR in retinal fundus images. The model has good pre- network focusses on the boundary of the lesion.
dictive accuracy on various validation datasets.
We use integrated gradients to study feature importance for
this network; like in the object recognition case, the base-
line is the black image. Feature importance explanations
are important for this network as retina specialists may use
it to build trust in the network’s predictions, decide the
grade for borderline cases, and obtain insights for further
testing and screening.
Figure 3 shows a visualization of integrated gradients for a Figure 3. Attribution for Diabetic Retinopathy grade predic-
retinal fundus image. The visualization method is a bit dif- tion from a retinal fundus image. The original image is show
ferent from that used in Figure 2. We aggregate integrated on the left, and the attributions (overlayed on the original image
gradients along the color channel and overlay them on the in gray scaee) is shown on the right. On the original image we an-
2
notate lesions visible to a human, and confirm that the attributions
More examples can be found at https://github.com/ indeed point to them.
ankurtaly/Attributions
Axiomatic Attribution for Deep Networks

6.3. Question Classification baseline, we zero out the embeddings of all tokens except
the start and end markers. Figure 5 shows an example of
Automatically answering natural language questions (over
such an attribution-based alignments. We observed that the
semi-structured data) is an important problem in artificial
results make intuitive sense. E.g. “und” is mostly attributed
intelligence (AI). A common approach is to semantically
to “and”, and “morgen” is mostly attributed to “morning”.
parse the question to its logical form (Liang, 2016) using
We use 100 − 1000 steps (cf. Section 5) in the integrated
a set of human-authored grammar rules. An alternative ap-
gradient approximation; we need this because the network
proach is to machine learn an end-to-end model provided
is highly nonlinear.
there is enough training data. An interesting question is
whether one could peek inside machine learnt models to de-
rive new rules. We explore this direction for a sub-problem
of semantic parsing, called question classification, using
the method of integrated gradients.
The goal of question classification is to identify the type of
answer it is seeking. For instance, is the quesiton seek-
ing a yes/no answer, or is it seeking a date? Rules for
solving this problem look for trigger phrases in the ques-
tion, for e.g., a “when” in the beginning indicates a date
seeking question. We train a model for question classifica-
tion using the the text categorization architecture proposed
by (Kim, 2014) over the WikiTableQuestions dataset (Pasu- Figure 5. Attributions from a language translation model. In-
pat & Liang, 2015). We use integrated gradients to attribute put in English: “good morning ladies and gentlemen”. Output in
predictions down to the question terms in order to identify German: “Guten Morgen Damen und Herren”. Both input and
new trigger phrases for answer type. The baseline input is output are tokenized into word pieces, where a word piece pre-
the all zero embedding vector. fixed by underscore indicates that it should be the prefix of a word.

Figure 4 lists a few questions with constituent terms high-


lighted based on their attribution. Notice that the attri- 6.5. Chemistry Models
butions largely agree with commonly used rules, for e.g.,
“how many” indicates a numeric seeking question. In ad- We apply integrated gradients to a network performing
dition, attributions help identify novel question classifica- Ligand-Based Virtual Screening which is the problem of
tion rules, for e.g., questions containing “total number” are predicting whether an input molecule is active against a
seeking numeric answers. Attributions also point out unde- certain target (e.g., protein or enzyme). In particular, we
sirable correlations, for e.g., “charles” is used as trigger for consider a network based on the molecular graph convolu-
a yes/no question. tion architecture proposed by (Kearnes et al., 2016).
The network requires an input molecule to be encoded by
hand as a set of atom and atom-pair features describing the
molecule as an undirected graph. Atoms are featurized us-
ing a one-hot encoding specifying the atom type (e.g., C, O,
S, etc.), and atom-pairs are featurized by specifying either
the type of bond (e.g., single, double, triple, etc.) between
the atoms, or the graph distance between them. The base-
line input is obtained zeroing out the feature vectors for
Figure 4. Attributions from question classification model. atom and atom-pairs.
Term color indicates attribution strength—Red is positive, Blue is
negative, and Gray is neutral (zero). The predicted class is speci- We visualize integrated gradients as heatmaps over the the
fied in square brackets. atom and atom-pair features with the heatmap intensity de-
picting the strength of the contribution. Figure 6 shows
the visualization for a specific molecule. Since integrated
6.4. Neural Machine Translation
gradients add up to the final prediction score (see Proposi-
We applied our technique to a complex, LSTM-based Neu- tion 1), the magnitudes can be use for accounting the con-
ral Machine Translation System (Wu et al., 2016). We tributions of each feature. For instance, for the molecule in
attribute the output probability of every output token (in the figure, atom-pairs that have a bond between them cu-
form of wordpieces) to the input tokens. Such attributions mulatively contribute to 46% of the prediction score, while
“align” the output sentence with the input sentence. For all other pairs cumulatively contribute to only −3%.
Axiomatic Attribution for Deep Networks

diction function in the sense of Section 2. The other issue


is that the method is expensive to implement for networks
with “dense” input like image networks as one needs to ex-
plore a local region of size proportional to the number of
pixels and train a model for this space. In contrast, our
technique works with a few calls to the gradient operation.
Attention mechanisms (Bahdanau et al., 2014) have gained
popularity recently. One may think that attention could be
used a proxy for attributions, but this has issues. For in-
stance, in a LSTM that also employs attention, there are
Figure 6. Attribution for a molecule under the W2N2 net- many ways for an input token to influence an output to-
work (Kearnes et al., 2016). The molecules is active on task ken: the memory cell, the recurrent state, and “attention”.
PCBA-58432. Focussing only an attention ignores the other modes of in-
fluence and results in an incomplete picture.

Identifying Degenerate Features. We now discuss how


attributions helped us spot an anomaly in the W1N2 ar-
8. Conclusion
chitecture in (Kearnes et al., 2016). On applying the in- The primary contribution of this paper is a method called
tegrated gradients method to this network, we found that integrated gradients that attributes the prediction of a deep
several atoms in the same molecule received identical at- network to its inputs. It can be implemented using a few
tribution despite being bonded to different atoms. This is calls to the gradients operator, can be applied to a variety
surprising as one would expect two atoms with different of deep networks, and has a strong theoretical justification.
neighborhoods to be treated differently by the network.
A secondary contribution of this paper is to clarify desir-
On investigating the problem further, in the network archi- able features of an attribution method using an axiomatic
tecture, the atoms and atom-pair features were not fully framework inspired by cost-sharing literature from eco-
convolved. This caused all atoms that have the same atom nomics. Without the axiomatic approach it is hard to tell
type, and same number of bonds of each type to contribute whether the attribution method is affected by data arti-
identically to the network. facts, network’s artifacts or artifacts of the method. The
axiomatic approach rules out artifacts of the last type.
7. Other Related work While our and other works have made some progress on
We already covered closely related work on attribution in understanding the relative importance of input features in
Section 2. We mention other related work. Over the last a deep network, we have not addressed the interactions
few years, there has been a vast amount work on demysti- between the input features or the logic employed by the
fying the inner workings of deep networks. Most of this network. So there remain many unanswered questions in
work has been on networks trained on computer vision terms of debugging the I/O behavior of a deep network.
tasks, and deals with understanding what a specific neu-
ron computes (Erhan et al., 2009; Le, 2013) and interpret- ACKNOWLEDGMENTS
ing the representations captured by neurons during a pre- We would like to thank Samy Bengio, Kedar Dhamdhere,
diction (Mahendran & Vedaldi, 2015; Dosovitskiy & Brox, Scott Lundberg, Amir Najmi, Kevin McCurley, Patrick Ri-
2015; Yosinski et al., 2015). In contrast, we focus on un- ley, Christian Szegedy, Diane Tang for their feedback. We
derstanding the network’s behavior on a specific input in would like to thank Daniel Smilkov and Federico Allocati
terms of the base level input features. Our technique quan- for identifying bugs in our descriptions. We would like to
tifies the importance of each feature in the prediction. thank our anonymous reviewers for identifying bugs, and
One approach to the attribution problem proposed first their suggestions to improve presentation.
by (Ribeiro et al., 2016a;b), is to locally approximate the
behavior of the network in the vicinity of the input being References
explained with a simpler, more interpretable model. An
Aumann, R. J. and Shapley, L. S. Values of Non-Atomic
appealing aspect of this approach is that it is completely
Games. Princeton University Press, Princeton, NJ, 1974.
agnostic to the implementation of the network and satisfies
implemenation invariance. However, this approach does
not guarantee sensitivity. There is no guarantee that the Baehrens, David, Schroeter, Timon, Harmeling, Stefan,
local region explored escapes the “flat” section of the pre- Kawanabe, Motoaki, Hansen, Katja, and Müller, Klaus-
Axiomatic Attribution for Deep Networks

Robert. How to explain individual classification deci- Lundberg, Scott and Lee, Su-In. An unexpected unity
sions. Journal of Machine Learning Research, pp. 1803– among methods for interpreting model predictions.
1831, 2010. CoRR, abs/1611.07478, 2016. URL http://arxiv.
org/abs/1611.07478.
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio,
Yoshua. Neural machine translation by jointly learning Mahendran, Aravindh and Vedaldi, Andrea. Understand-
to align and translate. CoRR, abs/1409.0473, 2014. URL ing deep image representations by inverting them. In
http://arxiv.org/abs/1409.0473. Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 5188–5196, 2015.
Binder, Alexander, Montavon, Grégoire, Bach, Sebastian,
Müller, Klaus-Robert, and Samek, Wojciech. Layer- Pasupat, Panupong and Liang, Percy. Compositional se-
wise relevance propagation for neural networks with lo- mantic parsing on semi-structured tables. In ACL, 2015.
cal renormalization layers. CoRR, 2016.
Ribeiro, Marco Túlio, Singh, Sameer, and Guestrin, Carlos.
Datta, A., Sen, S., and Zick, Y. Algorithmic transparency ”why should I trust you?”: Explaining the predictions of
via quantitative input influence: Theory and experiments any classifier. In 22nd ACM International Conference on
with learning systems. In 2016 IEEE Symposium on Se- Knowledge Discovery and Data Mining, pp. 1135–1144.
curity and Privacy (SP), pp. 598–617, 2016. ACM, 2016a.

Dosovitskiy, Alexey and Brox, Thomas. Inverting visual Ribeiro, Marco Túlio, Singh, Sameer, and Guestrin, Car-
representations with convolutional networks, 2015. los. Model-agnostic interpretability of machine learning.
CoRR, 2016b.
Erhan, Dumitru, Bengio, Yoshua, Courville, Aaron, and
Vincent, Pascal. Visualizing higher-layer features of a Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan,
deep network. Technical Report 1341, University of Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpa-
Montreal, 2009. thy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg,
Alexander C., and Fei-Fei, Li. ImageNet Large Scale
Friedman, Eric J. Paths and consistency in additive cost Visual Recognition Challenge. International Journal of
sharing. International Journal of Game Theory, 32(4): Computer Vision (IJCV), pp. 211–252, 2015.
501–518, 2004.
Samek, Wojciech, Binder, Alexander, Montavon, Grégoire,
Goodfellow, Ian, Shlens, Jonathon, and Szegedy, Chris- Bach, Sebastian, and Müller, Klaus-Robert. Evaluat-
tian. Explaining and harnessing adversarial exam- ing the visualization of what a deep neural network has
ples. In International Conference on Learning Repre- learned. CoRR, 2015.
sentations, 2015. URL http://arxiv.org/abs/
1412.6572. Shapley, Lloyd S. and Shubik, Martin. The assignment
game : the core. International Journal of Game Theory,
Gulshan, Varun, Peng, Lily, Coram, Marc, and et al. Devel- 1(1):111–130, 1971. URL http://dx.doi.org/
opment and validation of a deep learning algorithm for 10.1007/BF01753437.
detection of diabetic retinopathy in retinal fundus pho-
tographs. JAMA, 316(22):2402–2410, 2016. Shrikumar, Avanti, Greenside, Peyton, Shcherbina, Anna,
and Kundaje, Anshul. Not just a black box: Learning
Kearnes, Steven, McCloskey, Kevin, Berndl, Marc, Pande, important features through propagating activation differ-
Vijay, and Riley, Patrick. Molecular graph convolutions: ences. CoRR, 2016.
moving beyond fingerprints. Journal of Computer-Aided
Molecular Design, pp. 595–608, 2016. Shrikumar, Avanti, Greenside, Peyton, and Kundaje, An-
shul. Learning important features through propagating
Kim, Yoon. Convolutional neural networks for sentence activation differences. CoRR, abs/1704.02685, 2017.
classification. In ACL, 2014. URL http://arxiv.org/abs/1704.02685.
Le, Quoc V. Building high-level features using large scale Simonyan, Karen, Vedaldi, Andrea, and Zisserman, An-
unsupervised learning. In International Conference on drew. Deep inside convolutional networks: Visualising
Acoustics, Speech, and Signal Processing (ICASSP), pp. image classification models and saliency maps. CoRR,
8595–8598, 2013. 2013.

Liang, Percy. Learning executable semantic parsers for nat- Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox,
ural language understanding. Commun. ACM, 59(9):68– Thomas, and Riedmiller, Martin A. Striving for sim-
76, 2016. plicity: The all convolutional net. CoRR, 2014.
Axiomatic Attribution for Deep Networks

Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet,


Pierre, Reed, Scott E., Anguelov, Dragomir, Erhan, Du-
mitru, Vanhoucke, Vincent, and Rabinovich, Andrew.
Going deeper with convolutions. CoRR, 2014.
Network f (x1 , x2 )
Wu, Yonghui, Schuster, Mike, Chen, Zhifeng, Le, Quoc V., Attributions at x1 = 3, x2 = 1
Integrated gradients x1 = 1.5, x2 = −0.5
Norouzi, Mohammad, Macherey, Wolfgang, Krikun, DeepLift x1 = 1.5, x2 = −0.5
Maxim, Cao, Yuan, Gao, Qin, Macherey, Klaus, LRP x1 = 1.5, x2 = −0.5
Klingner, Jeff, Shah, Apurva, Johnson, Melvin, Liu,
Xiaobing, Kaiser, Lukasz, Gouws, Stephan, Kato,
Yoshikiyo, Kudo, Taku, Kazawa, Hideto, Stevens, Keith,
Kurian, George, Patil, Nishant, Wang, Wei, Young, Cliff,
Smith, Jason, Riesa, Jason, Rudnick, Alex, Vinyals, Network g(x1 , x2 )
Oriol, Corrado, Greg, Hughes, Macduff, and Dean, Jef- Attributions at x1 = 3, x2 = 1
Integrated gradients x1 = 1.5, x2 = −0.5
frey. Google’s neural machine translation system: Bridg-
DeepLift x1 = 2, x2 = −1
ing the gap between human and machine translation. LRP x1 = 2, x2 = −1
CoRR, abs/1609.08144, 2016. URL http://arxiv.
org/abs/1609.08144. Figure 7. Attributions for two functionally equivalent net-
works. The figure shows attributions for two functionally equiva-
Yosinski, Jason, Clune, Jeff, Nguyen, Anh Mai, Fuchs, lent networks f (x1 , x2 ) and g(x1 , x2 ) at the input x1 = 3, x2 =
Thomas, and Lipson, Hod. Understanding neural net- 1 using integrated gradients, DeepLift (Shrikumar et al., 2016),
works through deep visualization. CoRR, 2015. and Layer-wise relevance propagation (LRP) (Binder et al.,
2016). The reference input for Integrated gradients and DeepLift
Zeiler, Matthew D. and Fergus, Rob. Visualizing and un- is x1 = 0, x2 = 0. All methods except integrated gradients
derstanding convolutional networks. In ECCV, pp. 818– provide different attributions for the two networks.
833, 2014.
f (x1 , x2 ) and g(x1 , x2 ) for which DeepLift and LRP yield
different attributions.
A. Proof of Theorem 1
First, observe that the networks f and g are of the
Proof. Consider a non-straightline path γ : [0, 1] → Rn form f (x1 , x2 ) = ReLU(h(x1 , x2 )) and f (x1 , x2 ) =
from baseline to input. W.l.o.g., there exists t0 ∈ [0, 1] ReLU(k(x1 , x2 ))3 , where
such that for two dimensions i, j, γi (t0 ) > γj (t0 ). Let
(t1 , t2 ) be the maximum real open interval containing t0 h(x1 , x2 ) = ReLU(x1 ) − 1 − ReLU(x2 )
such that γi (t) > γj (t) for all t in (t1 , t2 ), and let a = k(x1 , x2 ) = ReLU(x1 − 1) − ReLU(x2 )
γi (t1 ) = γj (t1 ), and b = γi (t2 ) = γj (t2 ). Define function
Note that h and k are not equivalent. They have differ-
f : x ∈ [0, 1]n → R as 0 if min(xi , xj ) ≤ a, as (b − a)2
ent values whenever x1 < 1. But f and g are equivalent.
if max(xi , xj ) ≥ b, and as (xi − a)(xj − a) otherwise.
To prove this, suppose for contradiction that f and g are
Next we compute the attributions of f at x = h1, . . . , 1in
different for some x1 , x2 . Then it must be the case that
with baseline x0 = h0, . . . , 0in . Note that xi and xj are
ReLU(x1 ) − 1 6= ReLU(x1 − 1). This happens only when
symmetric, and should get identical attributions. For t ∈ /
x1 < 1, which implies that f (x1 , x2 ) = g(x1 , x2 ) = 0.
[t1 , t2 ], the function is a constant, and the attribution of f
is zero to all variables, while for t ∈ (t1 , t2 ), the integrand Now we leverage the above example to show that Deconvo-
of attribution of f is γj (t) − a to xi , and γi (t) − a to xj , lution and Guided back-propagation break sensitivity. Con-
where the latter is always strictly larger by our choice of sider the network f (x1 , x2 ) from Figure 7. For a fixed
the interval. Integrating, it follows that xj gets a larger value of x1 greater than 1, the output decreases linearly
attribution than xi , contradiction. as x2 increases from 0 to x1 − 1. Yet, for all inputs, De-
convolutional networks and Guided back-propagation re-
sults in zero attribution for x2 . This happens because for
B. Attribution Counter-Examples
all inputs the back-propagated signal received at the node
We show that the methods DeepLift and Layer-wise rel- ReLU(x2 ) is negative and is therefore not back-propagated
evance propagation (LRP) break the implementation in- through the ReLU operation (per the rules of deconvolu-
variance axiom, and the Deconvolution and Guided back- tion and guided back-propagation; see (Springenberg et al.,
propagation methods break the sensitivity axiom. 2014) for details). As a result, the feature x2 receives zero
3
Figure 7 provides an example of two equivalent networks ReLU(x) is defined as max(x, 0).
Axiomatic Attribution for Deep Networks

attribution despite the network’s output being sensitive to


it.

You might also like