Paper Interpretación Ceptum
Paper Interpretación Ceptum
a problem previously studied by several other output behavior of the deep network, which gives us the
works. We identify two fundamental axioms— ability to improve it. Such understandability is critical to
Sensitivity and Implementation Invariance that all computer programs, including machine learning mod-
attribution methods ought to satisfy. We show els. There are also other applications of attribution. They
that they are not satisfied by most known attri- could be used within a product driven by machine learn-
bution methods, which we consider to be a fun- ing to provide a rationale for the recommendation. For in-
damental weakness of those methods. We use stance, a deep network that predicts a condition based on
the axioms to guide the design of a new attri- imaging could help inform the doctor of the part of the im-
bution method called Integrated Gradients. Our age that resulted in the recommendation. This could help
method requires no modification to the original the doctor understand the strengths and weaknesses of a
network and is extremely simple to implement; model and compensate for it. We give such an example in
it just needs a few calls to the standard gradi- Section 6.2. Attributions could also be used by developers
ent operator. We apply this method to a couple in an exploratory sense. For instance, we could use a deep
of image models, a couple of text models and a network to extract insights that could be then used in a rule-
chemistry model, demonstrating its ability to de- based system. In Section 6.3, we give such an example.
bug networks, to extract rules from a network, A significant challenge in designing an attribution tech-
and to enable users to engage with models better. nique is that they are hard to evaluate empirically. As we
discuss in Section 4, it is hard to tease apart errors that stem
from the misbehavior of the model versus the misbehavior
1. Motivation and Summary of Results of the attribution method. To compensate for this short-
coming, we take an axiomatic approach. In Section 2 we
We study the problem of attributing the prediction of a deep
identify two axioms that every attribution method must sat-
network to its input features.
isfy. Unfortunately most previous methods do not satisfy
Definition 1. Formally, suppose we have a function F : one of these two axioms. In Section 3, we use the axioms
Rn → [0, 1] that represents a deep network, and an in- to identify a new method, called integrated gradients.
put x = (x1 , . . . , xn ) ∈ Rn . An attribution of the predic-
tion at input x relative to a baseline input x0 is a vector Unlike previously proposed methods, integrated gradients
AF (x, x0 ) = (a1 , . . . , an ) ∈ Rn where ai is the contribu- do not need any instrumentation of the network, and can
tion of xi to the prediction F (x). be computed easily using a few calls to the gradient opera-
tion, allowing even novice practitioners to easily apply the
For instance, in an object recognition network, an attribu- technique.
tion method could tell us which pixels of the image were In Section 6, we demonstrate the ease of applicability over
responsible for a certain label being picked (see Figure 2). several deep networks, including two images networks, two
The attribution problem was previously studied by vari- text processing networks, and a chemistry network. These
ous papers (Baehrens et al., 2010; Simonyan et al., 2013; applications demonstrate the use of our technique in either
*
Equal contribution 1
Google Inc., Mountain View, improving our understanding of the network, performing
USA. Correspondence to: Mukund Sundararajan debugging, performing rule extraction, or aiding an end
<[email protected]>, Ankur Taly <[email protected]>. user in understanding the network’s prediction.
Proceedings of the 34 th International Conference on Machine Remark 1. Let us briefly examine the need for the base-
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 line in the definition of the attribution problem. A common
by the author(s). way for humans to perform attribution relies on counter-
Axiomatic Attribution for Deep Networks
fortunately, the chain rule does not hold for discrete gra- Proposition 1. If F : Rn → R is differentiable almost
dients in general. Formally fg(x
(x1 )−f (x0 )
1 )−g(x0 )
6= fh(x
(x1 )−f (x0 )
1 )−h(x0 )
· everywhere 1 then
h(x1 )−h(x0 )
, and therefore these methods fail to satisfy im-
g(x1 )−g(x0 ) Σni=1 IntegratedGradsi (x) = F (x) − F (x0 )
plementation invariance.
If an attribution method fails to satisfy Implementation In- For most deep networks, it is possible to choose a base-
variance, the attributions are potentially sensitive to unim- line such that the prediction at the baseline is near zero
portant aspects of the models. For instance, if the network (F (x0 ) ≈ 0). (For image models, the black image base-
architecture has more degrees of freedom than needed to line indeed satisfies this property.) In such cases, there is
represent a function then there may be two sets of values an intepretation of the resulting attributions that ignores the
for the network parameters that lead to the same function. baseline and amounts to distributing the output to the indi-
The training procedure can converge at either set of values vidual input features.
depending on the initializtion or for other reasons, but the Remark 2. Integrated gradients satisfies Sensivity(a) be-
underlying network function would remain the same. It is cause Completeness implies Sensivity(a) and is thus a
undesirable that attributions differ for such reasons. strengthening of the Sensitivity(a) axiom. This is because
Sensitivity(a) refers to a case where the baseline and the
3. Our Method: Integrated Gradients input differ only in one variable, for which Completeness
asserts that the difference in the two output values is equal
We are now ready to describe our technique. Intuitively, to the attribution to this variable. Attributions generated
our technique combines the Implementation Invariance of by integrated gradients satisfy Implementation Invariance
Gradients along with the Sensitivity of techniques like LRP since they are based only on the gradients of the function
or DeepLift. represented by the network.
Formally, suppose we have a function F : Rn → [0, 1] that
represents a deep network. Specifically, let x ∈ Rn be the 4. Uniqueness of Integrated Gradients
input at hand, and x0 ∈ Rn be the baseline input. For image
networks, the baseline could be the black image, while for Prior literature has relied on empirically evaluating the at-
text models it could be the zero embedding vector. tribution technique. For instance, in the context of an object
recognition task, (Samek et al., 2015) suggests that we se-
We consider the straightline path (in Rn ) from the baseline lect the top k pixels by attribution and randomly vary their
x0 to the input x, and compute the gradients at all points intensities and then measure the drop in score. If the at-
along the path. Integrated gradients are obtained by cu- tribution method is good, then the drop in score should be
mulating these gradients. Specifically, integrated gradients large. However, the images resulting from pixel perturba-
are defined as the path intergral of the gradients along the tion could be unnatural, and it could be that the scores drop
straightline path from the baseline x0 to the input x. simply because the network has never seen anything like it
The integrated gradient along the ith dimension for an input in training. (This is less of a concern with linear or logis-
x and baseline x0 is defined as follows. Here, ∂F (x)
∂xi is the
tic models where the simplicity of the model ensures that
gradient of F (x) along the ith dimension. ablating a feature does not cause strange interactions.)
Z 1
A different evaluation technique considers images with
∂F (x0 +α×(x−x0 )) human-drawn bounding boxes around objects, and com-
IntegratedGradsi (x) ::= (xi − x0i ) × ∂xi
dα
α=0 putes the percentage of pixel attribution inside the box.
(1)
While for most objects, one would expect the pixels located
Axiom: Completeness. Integrated gradients satisfy an
on the object to be most important for the prediction, in
axiom called completeness that the attributions add up to some cases the context in which the object occurs may also
the difference between the output of F at the input x and contribute to the prediction. The cabbage butterfly image
the baseline x0 . This axiom is identified as being desirable from Figure 2 is a good example of this where the pixels
by Deeplift and LRP. It is a sanity check that the attribu- on the leaf are also surfaced by the integrated gradients.
tion method is somewhat comprehensive in its accounting,
Roughly, we found that every empirical evaluation tech-
a property that is clearly desirable if the networks score is
nique we could think of could not differentiate between ar-
used in a numeric sense, and not just to pick the top la-
bel, for e.g., a model estimating insurance premiums from 1
Formally, this means the function F is continuous every-
credit features of individuals. where and the partial derivative of F along each input dimension
satisfies Lebesgue’s integrability condition, i.e., the set of discon-
This is formalized by the proposition below, which instanti- tinuous points has measure zero. Deep networks built out of Sig-
ates the fundamental theorem of calculus for path integrals. moids, ReLUs, and pooling operators satisfy this condition.
Axiomatic Attribution for Deep Networks
Figure 1. Three paths between an a baseline (r1 , r2 ) and an input More interestingly, path methods are the only methods
(s1 , s2 ). Each path corresponds to a different attribution method. that satisfy certain desirable axioms. (For formal defini-
The path P2 corresponds to the path used by integrated gradients. tions of the axioms and proof of Proposition 2, see Fried-
man (Friedman, 2004).)
tifacts that stem from perturbing the data, a misbehaving Axiom: Sensitivity(b). (called Dummy in (Friedman,
model, and a misbehaving attribution method. This was 2004)) If the function implemented by the deep network
why we turned to an axiomatic approach in designing a does not depend (mathematically) on some variable, then
good attribution method (Section 2). While our method the attribution to that variable is always zero.
satisfies Sensitivity and Implementation Invariance, it cer-
This is a natural complement to the definition of Sensitiv-
tainly isn’t the unique method to do so.
ity(a) from Section 2. This definition captures desired in-
We now justify the selection of the integrated gradients sensitivity of the attributions.
method in two steps. First, we identify a class of meth-
ods called Path methods that generalize integrated gradi- Axiom: Linearity. Suppose that we linearly composed
ents. We discuss that path methods are the only methods two deep networks modeled by the functions f1 and f2 to
to satisfy certain desirable axioms. Second, we argue why form a third network that models the function a×f1 +b×f2 ,
integrated gradients is somehow canonical among the dif- i.e., a linear combination of the two networks. Then we’d
ferent path methods. like the attributions for a × f1 + b × f2 to be the weighted
sum of the attributions for f1 and f2 with weights a and b
4.1. Path Methods respectively. Intuitively, we would like the attributions to
preserve any linearity within the network.
Integrated gradients aggregate the gradients along the in-
Proposition 2. (Theorem 1 (Friedman, 2004)) Path meth-
puts that fall on the straightline between the baseline and
ods are the only attribution methods that always satisfy
the input. There are many other (non-straightline) paths
Implementation Invariance, Sensitivity(b), Linearity, and
that monotonically interpolate between the two points, and
Completeness.
each such path will yield a different attribution method. For
instance, consider the simple case when the input is two di- Remark 4. We note that these path integrated gradients
mensional. Figure 1 has examples of three paths, each of have been used within the cost-sharing literature in eco-
which corresponds to a different attribution method. nomics where the function models the cost of a project as
a function of the demands of various participants, and the
Formally, let γ = (γ1 , . . . , γn ) : [0, 1] → Rn be a smooth attributions correspond to cost-shares. Integrated gradi-
function specifying a path in Rn from the baseline x0 to the ents correspond to a cost-sharing method called Aumann-
input x, i.e., γ(0) = x0 and γ(1) = x. Shapley (Aumann & Shapley, 1974). Proposition 2 holds
Given a path function γ, path integrated gradients are ob- for our attribution problem because mathematically the
tained by integrating the gradients along the path γ(α) for cost-sharing problem corresponds to the attribution prob-
α ∈ [0, 1]. Formally, path integrated gradients along the lem with the benchmark fixed at the zero vector. (Imple-
ith dimension for an input x is defined as follows. mentation Invariance is implicit in the cost-sharing litera-
ture as the cost functions are considered directly in their
Z 1 mathematical form.)
∂F (γ(α)) ∂γi (α)
PathIntegratedGradsγi (x) ::= ∂γi (α) ∂α dα
α=0
(2) 4.2. Integrated Gradients is Symmetry-Preserving
∂F (x)
where ∂xi is the gradient of F along the ith dimension
In this section, we formalize why the straightline path cho-
at x.
sen by integrated gradients is canonical. First, observe that
Attribution methods based on path integrated gradients are it is the simplest path that one can define mathematically.
Axiomatic Attribution for Deep Networks
Second, a natural property for attribution methods is to pre- inputs that are combinations of the input and the baseline.
serve symmetry, in the following sense. It is possible that some of these combinations are very dif-
ferent from anything seen during training. We speculate
Symmetry-Preserving. Two input variables are symmet- that this could lead to attribution artifacts.
ric w.r.t. a function if swapping them does not change the
function. For instance, x and y are symmetric w.r.t. F if
5. Applying Integrated Gradients
and only if F (x, y) = F (y, x) for all values of x and y. An
attribution method is symmetry preserving, if for all inputs
Selecting a Benchmark. A key step in applying integrated
that have identical values for symmetric variables and base-
gradients is to select a good baseline. We recommend that
lines that have identical values for symmetric variables, the
developers check that the baseline has a near-zero score—
symmetric variables receive identical attributions.
as discussed in Section 3, this allows us to interpret the
E.g., consider the logistic model Sigmoid(x1 + x2 + . . . ). attributions as a function of the input. But there is more to
x1 and x2 are symmetric variables for this model. For an a good baseline: For instance, for an object recogntion net-
input where x1 = x2 = 1 (say) and baseline where x1 = work it is possible to create an adversarial example that has
x2 = 0 (say), a symmetry preserving method must offer a zero score for a given input label (say elephant), by apply-
identical attributions to x1 and x2 . ing a tiny, carefully-designed perturbation to an image with
a very different label (say microscope) (cf. (Goodfellow
It seems natural to ask for symmetry-preserving attribution
et al., 2015)). The attributions can then include undesirable
methods because if two variables play the exact same role
artifacts of this adversarially constructed baseline. So we
in the network (i.e., they are symmetric and have the same
would additionally like the baseline to convey a complete
values in the baseline and the input) then they ought to re-
absence of signal, so that the features that are apparent from
ceive the same attrbiution.
the attributions are properties only of the input, and not of
Theorem 1. Integrated gradients is the unique path the baseline. For instance, in an object recognition net-
method that is symmetry-preserving. work, a black image signifies the absence of objects. The
black image isn’t unique in this sense—an image consisting
The proof is provided in Appendix A. of noise has the same property. However, using black as a
baseline may result in cleaner visualizations of “edge” fea-
Remark 5. If we allow averaging over the attributions
tures. For text based networks, we have found that the all-
from multiple paths, then are other methods that satisfy all
zero input embedding vector is a good baseline. The action
the axioms in Theorem 1. In particular, there is the method
of training causes unimportant words tend to have small
by Shapley-Shubik (Shapley & Shubik, 1971) from the cost
norms, and so, in the limit, unimportance corresponds to
sharing literature, and used by (Lundberg & Lee, 2016;
the all-zero baseline. Notice that the black image corre-
Datta et al., 2016) to compute feature attributions (though
sponds to a valid input to an object recognition network,
they were not studying deep networks). In this method, the
and is also intuitively what we humans would consider ab-
attribution is the average of those from n! extremal paths;
sence of signal. In contrast, the all-zero input vector for a
here n is the number of features. Here each such path con-
text network does not correspond to a valid input; it never-
siders an ordering of the input features, and sequentially
theless works for the mathematical reason described above.
changes the input feature from its value at the baseline to
its value at the input. This method yields attributions that
Computing Integrated Gradients. The integral of inte-
are different from integrated gradients. If the function of
grated gradients can be efficiently approximated via a sum-
interest is min(x1 , x2 ), the baseline is x1 = x2 = 0, and
mation. We simply sum the gradients at points occurring at
the input is x1 = 1, x2 = 3, then integrated gradients
sufficiently small intervals along the straightline path from
attributes the change in the function value entirely to the
the baseline x0 to the input x.
critical variable x1 , whereas Shapley-Shubik assigns attri-
butions of 1/2 each; it seems somewhat subjective to prefer IntegratedGradsapprox
i (x) ::=
one result over the other. k
∂F (x0 + m ×(x−x0 )))
(3)
(xi − x0i ) × Σm
k=1 ∂xi × 1
m
We also envision other issues with applying Shapley-Shubik
to deep networks: It is computationally expensive; in an Here m is the number of steps in the Riemman approxi-
object recognition network that takes an 100X100 image mation of the integral. Notice that the approximation sim-
as input, n is 10000, and n! is a gigantic number. Even ply involves computing the gradient in a for loop which
if one samples few paths randomly, evaluating the attribu- should be straightforward and efficient in most deep learn-
tions for a single path takes n calls to the deep network. ing frameworks. For instance, in TensorFlow, it amounts
In contrast, integrated gradients is able to operate with 20 to calling tf.gradients in a loop over the set of in-
to 300 calls. Further, the Shapley-Shubik computation visit puts (i.e., x0 + m
k
× (x − x0 ) for k = 1, . . . , m), which
Axiomatic Attribution for Deep Networks
6. Applications
The integrated gradients technique is applicable to a variety
of deep networks. Here, we apply it to two image models,
two natural language models, and a chemistry model.
6.3. Question Classification baseline, we zero out the embeddings of all tokens except
the start and end markers. Figure 5 shows an example of
Automatically answering natural language questions (over
such an attribution-based alignments. We observed that the
semi-structured data) is an important problem in artificial
results make intuitive sense. E.g. “und” is mostly attributed
intelligence (AI). A common approach is to semantically
to “and”, and “morgen” is mostly attributed to “morning”.
parse the question to its logical form (Liang, 2016) using
We use 100 − 1000 steps (cf. Section 5) in the integrated
a set of human-authored grammar rules. An alternative ap-
gradient approximation; we need this because the network
proach is to machine learn an end-to-end model provided
is highly nonlinear.
there is enough training data. An interesting question is
whether one could peek inside machine learnt models to de-
rive new rules. We explore this direction for a sub-problem
of semantic parsing, called question classification, using
the method of integrated gradients.
The goal of question classification is to identify the type of
answer it is seeking. For instance, is the quesiton seek-
ing a yes/no answer, or is it seeking a date? Rules for
solving this problem look for trigger phrases in the ques-
tion, for e.g., a “when” in the beginning indicates a date
seeking question. We train a model for question classifica-
tion using the the text categorization architecture proposed
by (Kim, 2014) over the WikiTableQuestions dataset (Pasu- Figure 5. Attributions from a language translation model. In-
pat & Liang, 2015). We use integrated gradients to attribute put in English: “good morning ladies and gentlemen”. Output in
predictions down to the question terms in order to identify German: “Guten Morgen Damen und Herren”. Both input and
new trigger phrases for answer type. The baseline input is output are tokenized into word pieces, where a word piece pre-
the all zero embedding vector. fixed by underscore indicates that it should be the prefix of a word.
Robert. How to explain individual classification deci- Lundberg, Scott and Lee, Su-In. An unexpected unity
sions. Journal of Machine Learning Research, pp. 1803– among methods for interpreting model predictions.
1831, 2010. CoRR, abs/1611.07478, 2016. URL http://arxiv.
org/abs/1611.07478.
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio,
Yoshua. Neural machine translation by jointly learning Mahendran, Aravindh and Vedaldi, Andrea. Understand-
to align and translate. CoRR, abs/1409.0473, 2014. URL ing deep image representations by inverting them. In
http://arxiv.org/abs/1409.0473. Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 5188–5196, 2015.
Binder, Alexander, Montavon, Grégoire, Bach, Sebastian,
Müller, Klaus-Robert, and Samek, Wojciech. Layer- Pasupat, Panupong and Liang, Percy. Compositional se-
wise relevance propagation for neural networks with lo- mantic parsing on semi-structured tables. In ACL, 2015.
cal renormalization layers. CoRR, 2016.
Ribeiro, Marco Túlio, Singh, Sameer, and Guestrin, Carlos.
Datta, A., Sen, S., and Zick, Y. Algorithmic transparency ”why should I trust you?”: Explaining the predictions of
via quantitative input influence: Theory and experiments any classifier. In 22nd ACM International Conference on
with learning systems. In 2016 IEEE Symposium on Se- Knowledge Discovery and Data Mining, pp. 1135–1144.
curity and Privacy (SP), pp. 598–617, 2016. ACM, 2016a.
Dosovitskiy, Alexey and Brox, Thomas. Inverting visual Ribeiro, Marco Túlio, Singh, Sameer, and Guestrin, Car-
representations with convolutional networks, 2015. los. Model-agnostic interpretability of machine learning.
CoRR, 2016b.
Erhan, Dumitru, Bengio, Yoshua, Courville, Aaron, and
Vincent, Pascal. Visualizing higher-layer features of a Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan,
deep network. Technical Report 1341, University of Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpa-
Montreal, 2009. thy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg,
Alexander C., and Fei-Fei, Li. ImageNet Large Scale
Friedman, Eric J. Paths and consistency in additive cost Visual Recognition Challenge. International Journal of
sharing. International Journal of Game Theory, 32(4): Computer Vision (IJCV), pp. 211–252, 2015.
501–518, 2004.
Samek, Wojciech, Binder, Alexander, Montavon, Grégoire,
Goodfellow, Ian, Shlens, Jonathon, and Szegedy, Chris- Bach, Sebastian, and Müller, Klaus-Robert. Evaluat-
tian. Explaining and harnessing adversarial exam- ing the visualization of what a deep neural network has
ples. In International Conference on Learning Repre- learned. CoRR, 2015.
sentations, 2015. URL http://arxiv.org/abs/
1412.6572. Shapley, Lloyd S. and Shubik, Martin. The assignment
game : the core. International Journal of Game Theory,
Gulshan, Varun, Peng, Lily, Coram, Marc, and et al. Devel- 1(1):111–130, 1971. URL http://dx.doi.org/
opment and validation of a deep learning algorithm for 10.1007/BF01753437.
detection of diabetic retinopathy in retinal fundus pho-
tographs. JAMA, 316(22):2402–2410, 2016. Shrikumar, Avanti, Greenside, Peyton, Shcherbina, Anna,
and Kundaje, Anshul. Not just a black box: Learning
Kearnes, Steven, McCloskey, Kevin, Berndl, Marc, Pande, important features through propagating activation differ-
Vijay, and Riley, Patrick. Molecular graph convolutions: ences. CoRR, 2016.
moving beyond fingerprints. Journal of Computer-Aided
Molecular Design, pp. 595–608, 2016. Shrikumar, Avanti, Greenside, Peyton, and Kundaje, An-
shul. Learning important features through propagating
Kim, Yoon. Convolutional neural networks for sentence activation differences. CoRR, abs/1704.02685, 2017.
classification. In ACL, 2014. URL http://arxiv.org/abs/1704.02685.
Le, Quoc V. Building high-level features using large scale Simonyan, Karen, Vedaldi, Andrea, and Zisserman, An-
unsupervised learning. In International Conference on drew. Deep inside convolutional networks: Visualising
Acoustics, Speech, and Signal Processing (ICASSP), pp. image classification models and saliency maps. CoRR,
8595–8598, 2013. 2013.
Liang, Percy. Learning executable semantic parsers for nat- Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox,
ural language understanding. Commun. ACM, 59(9):68– Thomas, and Riedmiller, Martin A. Striving for sim-
76, 2016. plicity: The all convolutional net. CoRR, 2014.
Axiomatic Attribution for Deep Networks