0% found this document useful (0 votes)
6 views17 pages

Explaining Convolutional Neural Networks Through Attribution-Based Input Sampling and Block-Wise Feature Aggregation

This document presents a novel Explainable AI (XAI) algorithm called Semantic Input Sampling for Explanation (SISE) that enhances the interpretability of Convolutional Neural Networks (CNNs) by utilizing multiple layers and an attribution-based input sampling technique. The proposed method addresses limitations of existing XAI techniques by producing high-resolution and comprehensive explanation maps, validated through experiments on various models. The authors demonstrate that SISE outperforms state-of-the-art methods in terms of explanation quality and visual clarity, making it applicable across different CNN architectures.

Uploaded by

Amanda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views17 pages

Explaining Convolutional Neural Networks Through Attribution-Based Input Sampling and Block-Wise Feature Aggregation

This document presents a novel Explainable AI (XAI) algorithm called Semantic Input Sampling for Explanation (SISE) that enhances the interpretability of Convolutional Neural Networks (CNNs) by utilizing multiple layers and an attribution-based input sampling technique. The proposed method addresses limitations of existing XAI techniques by producing high-resolution and comprehensive explanation maps, validated through experiments on various models. The authors demonstrate that SISE outperforms state-of-the-art methods in terms of explanation quality and visual clarity, making it applicable across different CNN architectures.

Uploaded by

Amanda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Explaining Convolutional Neural Networks through Attribution-Based Input

Sampling and Block-Wise Feature Aggregation


Sam Sattarzadeh,1 Mahesh Sudhakar,1 Anthony Lem,2
Shervin Mehryar,1 K. N. Plataniotis,1 Jongseong Jang,3 Hyunwoo Kim,3
Yeonjeong Jeong,3 Sangmin Lee,3 Kyunghoon Bae3
1
The Edward S. Rogers Sr. Department of Electrical & Computer Engineering, University of Toronto
2
Division of Engineering Science, University of Toronto
3
arXiv:2010.00672v2 [cs.CV] 24 Dec 2020

LG AI Research
sam.sattarzadeh, [email protected]; j.jang, [email protected]

Grad-CAM Score-CAM RISE SISE


Abstract
Input Image Horse
As an emerging field in Machine Learning, Explainable AI
0.9956
(XAI) has been offering remarkable performance in interpret-
ing the decisions made by Convolutional Neural Networks
(CNNs). To achieve visual explanations for CNNs, methods
based on class activation mapping and randomized input sam- Person
0.0021
pling have gained great popularity. However, the attribution
methods based on these techniques provide low-resolution
and blurry explanation maps that limit their explanation abil-
ity. To circumvent this issue, visualization based on various Figure 1: Comparison of conventional XAI methods with
layers is sought. In this work, we collect visualization maps SISE (our proposed) to demonstrate SISE’s ability to gener-
from multiple layers of the model based on an attribution- ate class discriminative explanations on a ResNet-50 model.
based input sampling technique and aggregate them to reach
a fine-grained and complete explanation. We also propose a
layer selection strategy that applies to the whole family of
This work particularly addresses the problem of visual ex-
CNN-based models, based on which our extraction frame-
work is applied to visualize the last layers of each convolu- plainability, which is a branch of post-hoc XAI. This field
tional block of the model. Moreover, we perform an empiri- aims to visualize the behavior of models trained for image
cal analysis of the efficacy of derived lower-level information recognition tasks (Barredo Arrieta et al. 2019). The outcome
to enhance the represented attributions. Comprehensive ex- of these methods is a heatmap in the same size as the input
periments conducted on shallow and deep models trained on image named “explanation map”, representing the evidence
natural and industrial datasets, using both ground-truth and leading the model to decide.
model-truth based evaluation metrics validate our proposed Prior works on visual explainable AI can be broadly
algorithm by meeting or outperforming the state-of-the-art categorized into ‘approximation-based’ (Ribeiro,
methods in terms of explanation ability and visual quality,
demonstrating that our method shows stability regardless of
Singh, and Guestrin 2016), ‘backpropagation-based’,
the size of objects or instances to be explained. ‘perturbation-based’, and ‘CAM-based’ methodologies.
In backpropagation-based methods, only the local attri-
butions are represented, making them unable to measure
Introduction global sensitivity. This drawback is addressed by image
Deep Neural models based on Convolutional Neural Net- perturbation techniques used in recent works such as RISE
works (CNNs) have rendered inspiring breakthroughs in a (Petsiuk, Das, and Saenko 2018), and Score-CAM (Wang
wide variety of computer vision tasks. However, the lack et al. 2020). However, feedforwarding several perturbed
of interpretability hurdles the understanding of decisions images in these works makes them very slow. On the other
made by these models. This diminishes the trust consumers hand, explanation maps produced by CAM-based methods
have for CNNs and limits the interactions between users and suffer from a lack of spatial resolution as they are formed
systems established based on such models. Explainable AI by combining the feature maps in the last convolutional
(XAI) attempts to interpret these cumbersome models (Hoff- layer of CNNs, which lack spatial information regarding the
man et al. 2018). The offered interpretation ability has put captured attributions.
XAI in the center of attention in various fields, especially In this work, we delve deeper into providing a solution
where any single false prediction can cause severe conse- for interpreting CNN-based models by analyzing multiple
quences (e.g., healthcare) or where regulations force auto- layers of the network. Our solution concentrates on mutual
motive decision-making systems to provide users with ex- utilization of features represented inside a CNN in different
planations (e.g., criminal justice) (Lipton 2018). semantic levels, achieving class discriminability and spatial
resolution simultaneously. Inheriting productive ideas from Vedaldi 2019), an optimization problem is formulated to op-
the aforementioned types of approaches, we formulate a timize a smooth perturbation mask maximizing the model’s
four-phase explanation method. In the first three phases, in- output confidence score. Most of the perturbation-based
formation extracted from multiple layers of the CNN is rep- methods’ noticeable property is that they treat the model like
resented in their accompanying visualization maps. These a “black-box” instead of a “white-box.”
maps are then combined via a fusion module to form a
unique explanation map in the last phase. The main contri- CAM-based methods Based on the Class Activation
butions of our work can be summarized as follows: Mapping method (Zhou et al. 2016), an extensive research
effort has been put to blend high-level features extracted
• We introduce a novel XAI algorithm that offers both spa- by CNNs in a unique explanation map. CAM-based meth-
tial resolution and explanation completeness in its output ods operate in three steps: 1) feeding the model with the
explanation map by 1) using multiple layers from the “in- input image, 2) scoring the feature maps in the last con-
termediate blocks” of the target CNN, 2) selecting crucial volutional layer, and 3) combining the feature maps using
feature maps from the outputs of the layers, 3) employing the computed scores as weights. Grad-CAM (Selvaraju et al.
an attribution-based technique for input sampling to visu- 2017) and Grad-CAM++ (Chattopadhay et al. 2018) utilize
alize the perspective of each layer, and 4) applying a fea- backpropagation in the second step which causes underes-
ture aggregation step to reach refined explanation maps. timation of sensitivity information due to gradient issues.
Ablation-CAM (Ramaswamy et al. 2020), Smooth Grad-
• We propose a strategy to select the minimum number of
CAM++ (Omeiza et al. 2019), and Score-CAM (Wang et al.
intermediate layers from a given CNN to probe and visu-
2020) have been developed to overcome these drawbacks.
alize their discovered features in order to provide the local
Despite the strength of CAM-based methods in captur-
explanations of the whole CNN. We discuss the applica-
ing the features extracted in CNNs, the lack of localiza-
bility of this strategy to all of the feedforward CNNs.
tion information in the coarse high-level feature maps lim-
• We conduct thorough experiments on various models its such methods’ performance by producing blurry explana-
trained on object detection and industrial anomaly clas- tions. Also, upsampling low-dimension feature maps to the
sification datasets. To justify our method, we employ var- size of input images distorts the location of captured fea-
ious metrics to compare our proposed method with other tures in some cases. Some recent works (Meng et al. 2019;
conventional approaches. Therefore, we show that the in- Rebuffi et al. 2020) addressed these limitations by amalga-
formation between layers can be correctly combined to mating visualization maps obtained from multiple layers to
improve its inference’s visual explainability. achieve a fair trade-off between spatial resolution and class-
distinctiveness of the features forming explanation maps.
Related Work
Methodology
Backpropagation-based methods In general, calculating
the gradient of a model’s output to the input features or Our proposed algorithm is motivated by methods aiming to
the hidden neurons is the basis of this type of explana- interpret the model’s prediction using input sampling tech-
tion algorithms. The earliest backpropagation-based meth- niques. These methods have shown a great faithfulness in ra-
ods operate by computing the model’s confidence score’s tionally inferring the predictions of models. However, they
sensitivity to each of the input features directly (Simonyan, suffer from instability as their output depends on random
Vedaldi, and Zisserman 2014; Zeiler and Fergus 2014). sampling (RISE) or random initialization for optimizing a
To develop such methods, in some preceding works like perturbation mask (Extremal perturbation). Also, such algo-
DeepLift (Shrikumar, Greenside, and Kundaje 2017), In- rithms require an excessive runtime to provide their users
tegratedGradient (Sundararajan, Taly, and Yan 2017) and with generalized results. To address these limitations, we
SmoothGrad (Smilkov et al. 2017), backpropagation-based advance a CNN-specific algorithm that improves their fi-
equations are adapted to tackle the gradient issues. Some delity and plausibility (in the view of reasoning) with adap-
approaches such as LRP (Bach et al. 2015), SGLRP (Iwana, tive computational overhead for practical usage. We term
Kuroki, and Uchida 2019), and RAP (Nam et al. 2020) mod- our algorithm as Semantic Input Sampling for Explanation
ify backpropagation rules to measure the relevance or irrele- (SISE). To claim such a reform, we replace the randomized
vance of the input features to the model’s prediction. More- input sampling technique in RISE with a sampling technique
over, FullGrad (Srinivas and Fleuret 2019) and Excitation that relies on the feature maps derived from multiple layers.
Backpropagation (Zhang et al. 2018) run by aggregating gra- We call this procedure attribution-based input sampling and
dient information from several layers of the network. show that it provides the perspective of the model in various
semantic levels, reducing the applicability of SISE to CNNs.
Perturbation-based methods Several visual explanation As sketched in Figs. 3 and 5, SISE consists of four phases.
methods probe the model’s behavior using perturbed copies In the first phase, multiple layers of the model are selected,
of the input. In general, various strategies can be chosen and a set of corresponding output feature maps are extracted.
to perform input sampling. Like RISE (Petsiuk, Das, and For each set of feature maps in the second phase, a sub-
Saenko 2018), few of these approaches proposed random set containing the most important feature maps is sampled
perturbation techniques to yield strong approximations of with a backward pass. The selected feature maps are then
explanations. In Extremal Perturbation (Fong, Patrick, and post-processed to create sets of perturbation masks to be uti-
lized in the third phase for attribution-based input sampling (a)

and are termed as attribution masks. The first three phases


are applied to multiple layers of the CNN to output a 2-
dimensional saliency map named visualization map for each
layer. Such obtained visualization maps are aggregated in (b)

the last phase to reach the final explanation map.


In the following section, we present a block-wise layer Figure 2: Architecture of the residual convolutional blocks
selection policy, showing that the richest knowledge in any as in (Shen, Ma, and Li 2018). (a) raveled schematic of a
CNN can be derived by probing the output of (the last layer residual network, (b) unraveled view of the residual network.
in) each convolutional blocks, followed by the discussion of
the phase-by-phase methodology of SISE.
Such a decomposition, yields to a similar equation form
Block-Wise Feature Explanation as equation 2, and consequently, equation 3.
As we attempt to visualize multiple layers of the CNNs to
yi+1 = gi (yi ) + yi (5)
merge spatial information and semantic information discov-
ered by the CNN-based model, we intend to define the most It can be inferred from the unraveled view that while feed-
crucial layers for explicating the model’s decisions to reach ing the model with an input, signals might not pass through
a complete understanding of the model by visualizing the all convolutional layers as they may skip some layers and
minimum number of layers. be propagated to the next ones directly. However, this is not
Regardless of the specification of their architecture, all the case for pooling layers. Considering they change the sig-
types of CNNs consist of convolutional blocks connected nals’ dimensions, equation 4 cannot be applied to such lay-
via pooling layers that aid the network to justify the ex- ers. To prove this hypothesis, an experiment was conducted
istence of semantic instances. Each convolutional block is in (Veit, Wilber, and Belongie 2016), where the correspond-
formed by cascading multiple layers, which may vary from ing test errors are reported for removing a layer individually
a simple convolutional filter to more complex structures from a residual network. It was observed that a significant
(e.g., bottleneck or MBConv layers). However, the dimen- degradation in test performance is recorded only when the
sions of their input and output signal are the same. In a con- pooling layers are removed.
volutional block, assuming the number of layers to be L, Based on this hypothesis and result, most of the informa-
each ith layer can be represented with the function fi (.), tion in each model can be collected by probing the pool-
where i = {1, ..., L}. Denoting the input to each ith layer ing layers. Thus, by visualizing these layers, it is possible
as yi , the whole block can be mathematically described as to track the way features are propagated through convolu-
F (y1 ) = fL (yL ). For plain CNNs (e.g., VGG, GoogleNet), tional blocks. Therefore, we derive attribution masks from
the output of each convolutional block can be represented the feature maps in the last layers of all of their convolu-
with the equation below: tional blocks for any given CNN. Then, for each of these
F (y1 ) = fL (fL−1 (...(f1 (y1 ))) (1) layers, we build a corresponding visualization map. These
maps are utilized to perform a block-wise feature aggrega-
After the emergence of residual networks that utilize skip- tion in the last phase of our method.
connection layers to propagate the signals through a convo-
lutional block in the families as ResNet models, DenseNet, Feature Map Selection
EfficientNet (Tan and Le 2019; Huang et al. 2017; Sandler As discussed, the first two phases of SISE take responsibility
et al. 2018), and the models whose architecture are adap- to create multiple sets of attribution masks. In the first phase,
tively learned (Zoph and Le 2016), it is debated that these we feed the model with an input image to derive sets of fea-
neural networks can be represented with a more complicated ture maps from various layers of the model. Then, we sam-
view. These types of networks can be viewed by the unrav- ple the most deterministic feature maps among each set and
eled perspective, as presented in (Veit, Wilber, and Belongie post-process them to obtain corresponding sets of attribution
2016). Based on this perspective as in Fig. 2, the connection masks. These masks are utilized for performing attribution-
between the input and output is formulated as follows: based input sampling.
yi+1 = fi (yi ) + yi (2) Assume Ψ : I → R be a trained model that outputs
a confidence score for a given input image, where I is
and hence, the space of RGB images I = {I|I : Λ → R3 }, and
F (y1 ) = y1 +f1 (y1 )+...+fL (y1 +...+fL−1 (yL−1 )) (3) Λ = {1, ..., H} × {1, ..., W } is the set of locations (pixels)
in the image. Given any model and image, the goal of an ex-
The unraveled architecture as in Fig. 2 is comprehen- planation algorithm is to reach an explanation map SI,Ψ (λ),
sive enough to be generalized even to shallower CNN-based that assigns an “importance value” to each location in the
models that lack skip-connection layers. For plain networks, image (λ ∈ Λ). Also, let l be a layer containing N feature
the layer functions fi can be decomposed to an identity func- (l)
maps represented as Ak (k = {1, ..., N }) and the space of
tion I and a residual function gi as follows:
locations in these feature maps be denoted as Λ(l) . These
fi (yi ) = I(yi ) + gi (yi ) (4) feature maps are collected by probing the feature extractor
Input
Phase 3
image Phase 1
Input image
Feature map Extraction

(l) A1(l) A2(l) AN(l)

.
.
.
CNN-based
model Point-wise
Multiplication
Back-propagation
Feature map
Md(l) (l) CNN-based
Filtering m1 m2 ... mN model
α1(l) α2(l) αN(l) &

.
.
.
Post-processing Attribution mask Creation
Feature map Selection

Phase 2
Linear
Combination

.
.
.
Layer visualization map
Attribution mask Scoring

Figure 3: Schematic of SISE’s layer visualization framework (first three phases). The procedure in this framework is applied to
multiple layers and is followed by the fusion framework (as in Fig. 5).

units of the model, and a similar strategy is also utilized in


(a)
(Wang et al. 2020). The feature maps are formed in these
units independently from the classifier part of the model.
Thus, using the whole set of feature maps does not reflect
(b)
the outlook of CNN’s classifier.
To identify and reject the class-indiscriminative feature
maps, we partially backpropagate the signal to the layer l Figure 4: Qualitative comparison of (a) attribution masks de-
to score the average gradient of model’s confidence score to rived from different blocks of a VGG16 network as in SISE,
each of the feature maps. These average gradient scores are with (b) random masks employed in RISE.
represented as follows:

(l)
X ∂Ψ(I) Attribution-Based Input Sampling
αk = (l)
(6)
λ(l) ∈Λ(l)
∂Ak (λ(l) ) Considering the same notations as the previous section, and
according to RISE method, the confidence scores observed
The feature maps with corresponding non-positive average for the copies of an image masked with a set of binary masks
(l)
gradient scores - αk , tend to contain features related to (M : Λ → {0, 1}) are used to form the explanation map by,
other classes rather than the class of interest. Terming such SI,Ψ (λ) = EM [Ψ(I m)|m(λ) = 1] (9)
feature maps as ‘negative-gradient’, we define the set of at-
tribution masks obtained from the ‘positive-gradient’ feature where I m denotes a masked image obtained by point-
(l) wise multiplication between the input image and a mask
maps, Md , as:
m ∈ M . The representation of equation 9 can be modified to
(l) (l) (l) be generalized for sets of smooth masks (M : Λ → [0, 1]).
Md = {Ω(Ak )|k ∈ {1, ..., N }, αk > µ × β (l) } (7)
Hence, we reformat equation 9 as:
where β (l) denotes the maximum average gradient recorded. SI,Ψ (λ) = EM [Ψ(I m) · Cm (λ)] (10)
(l) where the term Cm (λ) indicates the contribution amount of
β (l) = max (αk ) (8) each pixel in the masked image. Setting the contribution in-
k∈{1,...,N }
dicator as Cm (λ) = m(λ), makes equation 10 equivalent to
In equation 7, µ ∈ R≥0 is a threshold parameter that is 0 by equation 9. We normalize these scores according to the size
default to discard negative-gradient feature maps while re- of perturbation masks to decrease the assigned reward to the
taining only the positive-gradients. Furthermore, Ω(.) repre- background pixels when a high score is reached for a mask
sents a post-processing function that converts feature maps with too many activated pixels. Thus, we define this term as:
to attribution masks. This function contains a ‘bilinear in- m(λ)
terpolation,’ upsampling the feature maps to the size of the Cm (λ) = P (11)
input image, followed by a linear transformation that nor- λ∈Λ m(λ)
malizes the values in the mask in the range [0, 1]. A visual Such a formulation increases the concentration on smaller
comparison of attribution masks and random masks in Fig. features, particularly when multiple objects (either from the
4 emphasizes such advantages of the former. same instance or different ones) are present in an image.
Putting block-wise layer selection policy and attribu- steel defect detection dataset created for anomaly detection
tion mask selection strategy together with the modified and steel defect segmentation problems. We reformatted it
RISE framework, for each CNN containing B convolu- into a defect classification dataset instead, containing 11505
tional blocks, the last layer of each block is indicated as test images from 5 different classes, including one normal
lb ∈ {1, ..., B}. Using equations 10 and 11, we form cor- class and four different defects classes. Class imbalance, in-
responding visualization maps for each of these layers by: traclass variation, and interclass similarity are the main chal-
(l )
lenges of this recast dataset.
VI,Ψb (λ) = EM (lb ) [Ψ(I m) · Cm (λ)] (12)
d
Experimental Setup
Fusion Module Experiments conducted on the PASCAL VOC 2007 dataset
In the fourth phase of SISE, the flow of features from low- are evaluated on its test set with a VGG16, and a ResNet-
level to high-level blocks are tracked. The inputs to the fu- 50 model from the TorchRay library (Fong, Patrick, and
sion module are the visualization layers obtained from the Vedaldi 2019), trained by (Zhang et al. 2018), both trained
third phase of SISE. On the other hand, this module’s out- for multi-label image classification. The top-5 accuracies of
put is a 2-dimensional explanation map, which is the output the models on the test set are 93.29% and 93.09%, respec-
of SISE. The fusion block is responsible for correcting spa- tively. On the other hand, for conducting experiments on
tial distortions caused by upsampling coarse feature maps Severstal, we trained a ResNet-101 model (with a test ac-
to higher dimensions and refining the localization of attribu- curacy of 86.58%) on the recast dataset to assess the perfor-
tions derived from the model. mance of the proposed method in the task of visual defect
inspection. To recast the Severstal dataset for classification,
Block 1 Unweighted Addition the train and test images were cropped into patches of size
Point-wise Multiplication
256 × 256. In our evaluations, a balanced subset of 1381 test
Block 2 Otsu-based binarization
Normalization in range [0,1]
images belonging to defect classes labeled as 1, 2, 3, and 4
Block 3
is chosen. We have implemented SISE on Keras and set the
parameter µ to its default value, 0.
Block 4
SISE Qualitative Results
Explanation
Map
Block 5 Based on explanation quality, we have compared SISE with
other state-of-the-art methods on sample images from the
Pascal dataset in Fig. 6 and Severstal dataset in Fig. 8. Im-
Figure 5: SISE fusion module for a CNN with 5 convolu- ages with both normal-sized and small object instances are
tional blocks. shown along with their corresponding confidence scores.
Moreover, Figs. 1 and 7 with images of multiple objects
Our fusion module is designed with cascaded fusion from different classes depict the superior ability of SISE in
blocks. In each block, the feature information from the vi- discriminating the explanations of various classes in com-
sualization maps representing explanations for two consec- parison with other methods and RISE in particular.
utive blocks is collected using an “addition” block. Then, the
features that are absent in the latter visualization map are re-
Quantitative Results
moved from the collective information by masking the out- Quantitative analysis includes evaluation results categorized
put of the addition block with a binary mask indicating the into ‘ground truth-based’ and ‘model truth-based’ metrics.
activated regions in the latter visualization map. To reach the The former is used to justify the model by assessing the ex-
binary mask, we apply an adaptive threshold to the latter vi- tent to which the algorithm satisfies the users by providing
sualization map, determined by Otsu’s method (Otsu 1979). visually superior explanations, while the latter is used to an-
By cascading fusion blocks as in Fig. 5, the features deter- alyze the model behavior by assessing the faithfulness of the
mining the model’s prediction are represented in a more fine- algorithm and its correctness in capturing the attributions in
grained manner while the inexplicit features are discarded. line with the model’s prediction procedure. The reported re-
sults of RISE and Extremal Perturbation in Table 1 are aver-
Experiments aged on three runs. The utilized metrics are discussed below.
We verify our method’s performance on shallow and deep Ground truth-based Metrics: The state-of-the-art expla-
CNNs, including VGG16, ResNet-50, and ResNet-101 ar- nation algorithms are compared with SISE based on three
chitectures. To conduct the experiments, we employed PAS- distinct ground-truth based metrics to justify the visual qual-
CAL VOC 2007 (Everingham et al. 2007) and Severstal ity of the explanation maps generated by our method. Denot-
(PAO Severstal 2019) datasets. The former is a popular ob- ing the ground-truth mask as G and the achieved explanation
ject detection dataset containing 4952 test images belong- map as S, the evaluation metrics used are:
ing to 20 object classes. As images with many small object Energy-Based Pointing Game (EBPG) evaluates the
occurrences and multiple instances of different classes are precision and denoising ability of XAI algorithms (Wang
prevalent in this dataset, it is hard for an XAI algorithm to et al. 2020). Extending the traditional Pointing Game, EBPG
perform well on the whole dataset. The latter is an industrial considers all pixels in the resultant explanation map S for
Extremal Integrated
Input Image Grad-CAM Grad-CAM++ Perturbation Score-CAM Gradient RISE SISE

Cat
0.9976

Train
0.9997

Person
0.9999

TV
monitor
0.0018

Figure 6: Qualitative comparison of the state-of-the-art XAI methods with our proposed SISE for test images from the PASCAL
VOC 2007 dataset. The first two rows are the results from a ResNet-50 model, and the last two are from a VGG16 model.

Class 'MotorBike' Class 'Person' Input Image Grad-CAM Score-CAM RISE SISE
0.9928 0.0071
Input Image RISE SISE RISE SISE
Defective
Class 1
0.8433

Defective
Class 3
0.9987
Figure 7: Class discriminative ability of SISE vs. RISE ob-
tained from a VGG16 model
Figure 8: Qualitative comparison of explanation maps by a
ResNet-101 model on test images from Severstal dataset.
evaluation by measuring the fraction of its energy captured
in the corresponding ground truth G, as EBP G = ||S||S||G||
1
1
.
mIoU analyses the localization ability and meaningful- discarded from the explanation map respectively. Given a
ness of the attributions captured in an explanation map. In model Ψ(.), an input image Ii from a dataset containing K
our experiments, we select the top 20% pixels highlighted in images, and an explanation map S(Ii ), the Drop/Increase %
each explanation map S and compute the mean intersection metric selects the most important pixels in S(Ii ) to mea-
over union with their corresponding ground-truth masks. sure their contribution towards the model’s prediction. A
Bounding box (Bbox) (Schulz et al. 2020) is taken into threshold function T (.) is applied on S(Ii ) to select the
account as a size-adaptive variant of mIoU. Considering N top 15% pixels that are then extracted from Ii using point-
as the number of ground truth pixels in G, the Bbox score is wise multiplication and fed to the model. The confidence
calculated by selecting the top N pixels in S and evaluating scores on such perturbed images are then compared with
the corresponding fraction captured over G. the original score, according to the equations Drop% =
1
PK max(0,Ψ(Ii )−Ψ(Ii T (Ii )))
Model truth-based metrics: To evaluate the correlation K i=1 Ψ(Ii ) × 100 and Increase% =
PK
between the representations of our method and the model’s i=1 sign(Ψ(Ii T (Ii )) − Ψ(Ii )).
predictions, model-truth based metrics are employed to
compare SISE with the other state-of-the-art methods. As
visual explanation algorithms’ main objective is to envision
Discussion
the model’s perspective for its predictions, these metrics are The experimental results in Figs. 1, 6, 7, and 8 demonstrate
considered of higher importance. the resolution, and concreteness of SISE explanation maps,
Drop% and Increase%, as introduced in (Chattopad- which is further supported by justifying our method via
hay et al. 2018) and later modified by (Ramaswamy et al. ground truth-based evaluation metrics as in Table 1. Also,
2020; Fu et al. 2020), can be interpreted as an indicator of model truth-based metrics in Tables 1 and 2 prove SISE’s
the positive attributions missed and the negative attribution supremacy in highlighting the evidence, based on which the
Grad- Extremal Score- Integrated
Model Metric Grad-CAM RISE FullGrad SISE
CAM++ Perturbation CAM Gradient
EBPG 55.44 46.29 61.19 33.44 46.42 36.87 38.72 60.54
mIoU 26.52 28.1 25.44 27.11 27.71 14.11 26.61 27.79
VGG16 Bbox 51.7 55.59 51.2 54.59 54.98 33.97 54.17 55.68
Drop 49.47 60.63 43.90 39.62 39.79 64.74 60.78 38.40
Increase 31.08 23.89 32.65 37.76 36.42 26.17 22.73 37.96
EBPG 60.08 47.78 63.24 32.86 35.56 40.62 39.55 66.08
mIoU 32.16 30.16 26.29 27.4 31.0 15.41 20.2 31.37
ResNet-50 Bbox 60.25 58.66 52.34 55.55 60.02 34.79 44.94 61.59
Drop 35.80 41.77 39.38 39.77 35.36 66.12 65.99 30.92
Increase 36.58 32.15 34.27 37.08 37.08 24.24 25.36 40.22

Table 1: Results of ground truth-based and model truth-based metrics for state-of-the-art XAI methods along with SISE (pro-
posed) on two networks trained on the PASCAL VOC 2007 dataset. For each metric, the best is shown in bold, and the
second-best is underlined. Except for Drop%, the higher is better for all other metrics. All values are reported in percentage.

XAI method Drop% Increase% model before negative-gradient feature maps were removed.
The difference in the number of masks allows SISE to op-
Grad-CAM 67.44 12.46 erate in around 9.21 seconds. To analyze the effect of re-
Grad-CAM++ 64.1 12.96 ducing the number of attribution masks on SISE’s perfor-
RISE 63.25 15.63 mance, an ablation study is carried. By changing µ to 0.3, a
Score-CAM 64.29 10.35 scanty variation in the boundary of explanation maps can be
FullGrad 77.23 10.26 noticed while the runtime is reduced to 2.18 seconds. This
SISE 61.06 15.64 shows that ignoring feature maps with low gradient values
does not considerably affect SISE outputs since they tend to
Table 2: Results of model truth-based metrics of SISE and be assigned low scores in the third phase of SISE anyway.
state-of-the-art algorithms on a ResNet-101 model trained By increasing µ to 0.5, a slight decline in the performance is
on Severstal data set. recorded along with a runtime of just 0.65 seconds.
A more detailed analysis of the effect of µ on various eval-
uation metrics along with an extensive discussion of our al-
model makes a prediction. Similar to the CAM-based meth- gorithm and additional results on MS COCO 2014 dataset
ods, the output of the last convolutional block plays the (Lin et al. 2014) are provided in the technical appendix of
most critical role in our method. However, by considering our extended version on arXiv1 .
the intermediate layers based on the block-wise layer selec-
tion, SISE’s advantageous properties are enhanced. Further-
more, utilizing attribution-based input sampling instead of a Conclusion
randomized sampling, ignoring the unrelated feature maps,
and modifying the linear combination step dramatically im-
In this work, we propose SISE - a novel visual explana-
proves the visual clarity and completeness offered by SISE.
tion algorithm specialized to the family of CNN-based mod-
Complexity Evaluation In addition to performance eval- els. SISE generates explanations by aggregating visualiza-
uations, a runtime test is carried out to compare the com- tion maps obtained from the output of convolutional blocks
plexity of the methods, using a Tesla T4 GPU with 16GB through attribution-based input sampling. Qualitative results
of memory and the ResNet-50 model. Reported runtimes show that our method can output high-resolution explana-
were averaged over 100 trials using random images from the tion maps, the quality of which is emphasized by quanti-
PASCAL VOC 2007 test set. Grad-CAM and Grad-CAM++ tative analysis using ground truth-based metrics. Moreover,
achieved the best runtimes, 19 and 20 milliseconds, respec- model truth-based metrics demonstrate that our method also
tively. On the other hand, Extremal Perturbation recorded outperforms other state-of-the-art methods in providing con-
the longest runtime, 78.37 seconds, since it optimizes nu- crete explanations. Our experiments reveal that mutual uti-
merous variables. In comparison with RISE, which has a lization of features captured in final and intermediate layers
runtime of 26.08 seconds, SISE runs in 9.21 seconds. of the model aids in producing explanation maps that accu-
rately locate object instances and reach a greater portion of
Ablation Study While RISE uses around 8000 random attributions leading the model to make a decision.
masks to operate on a ResNet-50 model, SISE uses around
1900 attribution masks with µ set to 0, out of a total of 3904
1
feature maps initially extracted from the same ResNet-50 https://arxiv.org/abs/2010.00672
Acknowledgement layer-wise relevance propagation. In 2019 IEEE/CVF In-
This research was supported by LG AI Research. The au- ternational Conference on Computer Vision Workshop (IC-
thors thank all anonymous reviewers for their detailed sug- CVW), 4176–4185. IEEE.
gestions and critical comments on the original manuscript Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ra-
that substantially helped to improve the clarity of this paper. manan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft
coco: Common objects in context. In European conference
References on computer vision, 740–755. Springer.
Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt, Lipton, Z. C. 2018. The Mythos of Model Interpretability:
M.; and Kim, B. 2018. Sanity Checks for Saliency Maps. In In Machine Learning, the Concept of Interpretability is Both
Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa- Important and Slippery. Queue 16(3): 31–57. ISSN 1542-
Bianchi, N.; and Garnett, R., eds., Advances in Neural Infor- 7730. doi:10.1145/3236386.3241340. URL https://doi.org/
mation Processing Systems, volume 31, 9505–9515. Curran 10.1145/3236386.3241340.
Associates, Inc. URL https://proceedings.neurips.cc/paper/ Meng, F.; Huang, K.; Li, H.; and Wu, Q. 2019. Class
2018/file/294a8ed24b1ad22ec2e7efea049b8737-Paper.pdf. Activation Map Generation by Representative Class Se-
Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, lection and Multi-Layer Feature Fusion. arXiv preprint
K.-R.; and Samek, W. 2015. On pixel-wise explanations for arXiv:1901.07683 .
non-linear classifier decisions by layer-wise relevance prop- Nam, W.-J.; Gur, S.; Choi, J.; Wolf, L.; and Lee, S.-W. 2020.
agation. PloS one 10(7): e0130140. Relative Attributing Propagation: Interpreting the Compara-
Barredo Arrieta, A.; Diaz Rodriguez, N.; Del Ser, J.; Ben- tive Contributions of Individual Units in Deep Neural Net-
netot, A.; Tabik, S.; Barbado González, A.; Garcia, S.; Gil- works. In AAAI, 2501–2508.
Lopez, S.; Molina, D.; Benjamins, V. R.; Chatila, R.; and Omeiza, D.; Speakman, S.; Cintas, C.; and Weldermariam,
Herrera, F. 2019. Explainable Artificial Intelligence (XAI): K. 2019. Smooth grad-cam++: An enhanced inference level
Concepts, Taxonomies, Opportunities and Challenges to- visualization technique for deep convolutional neural net-
ward Responsible AI. Information Fusion doi:10.1016/j. work models. arXiv preprint arXiv:1908.01224 .
inffus.2019.12.012. Otsu, N. 1979. A Threshold Selection Method from Gray-
Chattopadhay, A.; Sarkar, A.; Howlader, P.; and Balasubra- Level Histograms. IEEE Transactions on Systems, Man, and
manian, V. N. 2018. Grad-CAM++: Generalized Gradient- Cybernetics 9(1): 62–66.
Based Visual Explanations for Deep Convolutional Net- PAO Severstal. 2019. Severstal: Steel Defect Detection
works. In 2018 IEEE Winter Conference on Applications on Kaggle Challenge. URL https://www.kaggle.com/c/
of Computer Vision (WACV), 839–847. doi:10.1109/WACV. severstal-steel-defect-detection.
2018.00097.
Petsiuk, V.; Das, A.; and Saenko, K. 2018. RISE: Random-
Everingham, M.; Van Gool, L.; Williams, C. K. I.; ized Input Sampling for Explanation of Black-box Models.
Winn, J.; and Zisserman, A. 2007. The PASCAL In Proceedings of the British Machine Vision Conference
Visual Object Classes Challenge 2007 (VOC2007) Re- (BMVC).
sults. URL http://www.pascal-network.org/challenges/
VOC/voc2007/workshop/index.html. Ramaswamy, H. G.; et al. 2020. Ablation-CAM: Visual Ex-
planations for Deep Convolutional Network via Gradient-
Fong, R.; Patrick, M.; and Vedaldi, A. 2019. Understand- free Localization. In The IEEE Winter Conference on Appli-
ing deep networks via extremal perturbations and smooth cations of Computer Vision, 983–991.
masks. In Proceedings of the IEEE International Confer-
ence on Computer Vision, 2950–2958. Rebuffi, S.-A.; Fong, R.; Ji, X.; and Vedaldi, A. 2020. There
and Back Again: Revisiting Backpropagation Saliency
Fu, R.; Hu, Q.; Dong, X.; Guo, Y.; Gao, Y.; and Li, B. 2020. Methods. In Proceedings of the IEEE/CVF Conference on
Axiom-based Grad-CAM: Towards Accurate Visualization Computer Vision and Pattern Recognition, 8839–8848.
and Explanation of CNNs. In British Machine Vision Con-
ference. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. “Why
Should I Trust You?”: Explaining the Predictions of Any
Hoffman, R. R.; Mueller, S. T.; Klein, G.; and Litman, Classifier. In Proceedings of the 22nd ACM SIGKDD In-
J. 2018. Metrics for Explainable AI: Challenges and ternational Conference on Knowledge Discovery and Data
Prospects. CoRR abs/1812.04608. URL http://arxiv.org/abs/ Mining, San Francisco, CA, USA, August 13-17, 2016,
1812.04608. 1135–1144.
Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger, Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; and
K. Q. 2017. Densely connected convolutional networks. In Chen, L.-C. 2018. Mobilenetv2: Inverted residuals and lin-
Proceedings of the IEEE conference on computer vision and ear bottlenecks. In Proceedings of the IEEE conference on
pattern recognition, 4700–4708. computer vision and pattern recognition, 4510–4520.
Iwana, B. K.; Kuroki, R.; and Uchida, S. 2019. Explain- Schulz, K.; Sixt, L.; Tombari, F.; and Landgraf, T. 2020. Re-
ing convolutional neural networks using softmax gradient stricting the Flow: Information Bottlenecks for Attribution.
In International Conference on Learning Representations. tion. In Proceedings of the IEEE conference on computer
URL https://openreview.net/forum?id=S1xWh1rYwB. vision and pattern recognition, 2921–2929.
Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Zoph, B.; and Le, Q. V. 2016. Neural architec-
Parikh, D.; and Batra, D. 2017. Grad-CAM: Visual Expla- ture search with reinforcement learning. arXiv preprint
nations From Deep Networks via Gradient-Based Localiza- arXiv:1611.01578 .
tion. In Proceedings of the IEEE International Conference
on Computer Vision (ICCV).
Shen, L.; Ma, Q.; and Li, S. 2018. End-to-end time series
imputation via residual short paths. In Asian Conference on
Machine Learning, 248–263.
Shrikumar, A.; Greenside, P.; and Kundaje, A. 2017. Learn-
ing Important Features Through Propagating Activation Dif-
ferences. In Precup, D.; and Teh, Y. W., eds., Proceedings
of the 34th International Conference on Machine Learn-
ing, volume 70 of Proceedings of Machine Learning Re-
search, 3145–3153. International Convention Centre, Syd-
ney, Australia: PMLR. URL http://proceedings.mlr.press/
v70/shrikumar17a.html.
Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2014. Deep
Inside Convolutional Networks: Visualising Image Classifi-
cation Models and Saliency Maps. In Workshop at Interna-
tional Conference on Learning Representations.
Smilkov, D.; Thorat, N.; Kim, B.; Viégas, F.; and Watten-
berg, M. 2017. Smoothgrad: Removing noise by adding
noise. arXiv 2017. arXiv preprint arXiv:1706.03825 .
Srinivas, S.; and Fleuret, F. 2019. Full-gradient representa-
tion for neural network visualization. In Advances in Neural
Information Processing Systems, 4126–4135.
Sundararajan, M.; Taly, A.; and Yan, Q. 2017. Axiomatic
attribution for deep networks. In Proceedings of the 34th
International Conference on Machine Learning-Volume 70,
3319–3328. JMLR. org.
Tan, M.; and Le, Q. V. 2019. EfficientNet: Rethinking
Model Scaling for Convolutional Neural Networks. CoRR
abs/1905.11946. URL http://arxiv.org/abs/1905.11946.
Veit, A.; Wilber, M. J.; and Belongie, S. 2016. Residual net-
works behave like ensembles of relatively shallow networks.
In Advances in neural information processing systems, 550–
558.
Wang, H.; Wang, Z.; Du, M.; Yang, F.; Zhang, Z.; Ding, S.;
Mardziel, P.; and Hu, X. 2020. Score-CAM: Score-Weighted
Visual Explanations for Convolutional Neural Networks. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition Workshops, 24–25.
Zeiler, M. D.; and Fergus, R. 2014. Visualizing and under-
standing convolutional networks. In European conference
on computer vision, 818–833. Springer.
Zhang, J.; Bargal, S. A.; Lin, Z.; Brandt, J.; Shen, X.; and
Sclaroff, S. 2018. Top-Down Neural Attention by Excitation
Backprop. Int. J. Comput. Vision 126(10): 1084–1102. ISSN
0920-5691. doi:10.1007/s11263-017-1059-x. URL https:
//doi.org/10.1007/s11263-017-1059-x.
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; and Torralba,
A. 2016. Learning deep features for discriminative localiza-
Technical Appendix Severstal: Steel Defect Detection
Training
Class Test set Total
set
Datasets 0 16620 7124 23744
Experiments are conducted on three different datasets: MS 1 935 401 1336
COCO 2014 (Lin et al. 2014), PASCAL VOC 2007 (Ev- 2 147 63 210
eringham et al. 2007), and Severstal (PAO Severstal 2019). 3 8166 3500 11666
The first two datasets are “natural image” object detection 4 971 417 1388
datasets, while the last one is an “industrial” steel defect
detection dataset. They are discussed more in detail in the Table 3: Data distribution on each class of the recast Sever-
following subsections. stal dataset, outlining the high data-imbalance among them.
MS COCO 2014 and PASCAL VOC 2007 Datasets
The MS COCO 2014 dataset features 80 different object the other four classes have images with only that specific de-
classes, each one of a common object. All experimental re- fect group. Fig. 9 shows sample images from each class of
sults are performed on the validation set, which has 40,504 the recast dataset. The image per class distribution is pro-
images. The PASCAL VOC 2007 dataset features 20 object vided in Table 3. The training split is 70% of the data, and
classes, and all experimental results for this dataset are per- the test is the remaining 30%. From the training data, 20% is
formed on its test set, which has 4,952 images. Both datasets used for validation. The experimental results and qualitative
are created for object detection and segmentation purposes figures of the Severstal dataset are conducted on a subset of
and contain images with multiple object classes, and images the test set using all of the images from classes 1, 2, and 4,
with multiple object instances, making these datasets chal- and using 500 images from class 3.
lenging for XAI algorithms to perform well on.
Models
Severstal Dataset
VGG16 and ResNet-50
To extend the analysis of the influence of XAI algorithms
beyond natural images, the Severstal steel defect detection The top-1 accuracies of the VGG16 and ResNet-50 mod-
dataset was chosen. It was originally hosted on Kaggle as a els (loaded from the TorchRay library (Fong, Patrick, and
“detection” task, which we then converted to a “classifica- Vedaldi 2019)) on the test set of the PASCAL VOC 2007
tion” task. The original dataset has 12,568 train images un- dataset were 56.56 percent and 57.08 percent respectively
der one normal class labeled “0”, and four defective classes out of a maximum top-1 accuracy of 64.88 percent, while
numbered 1 through 4. Each image may contain no defect, the top-5 accuracies were 93.29 percent and 93.09 percent
or one defect, or two and more defects from different classes respectively out of a maximum top-5 accuracy of 99.99 per-
in it. The ground truth annotations for the segments (masks) cent. The top-1 accuracies of the VGG16 and ResNet-50 on
are provided in a CSV file, with a single row entry for each the validation set of the MS COCO 2014 dataset were 29.62
class of defect present within each image. The row entries percent and 30.25 percent respectively out of a maximum
provide the locations of defects, with some entries having top-1 accuracy of 34.43 percent, while the top-5 accuracies
several non-contiguous defect locations available. were 69.01 percent and 70.27 percent respectively out of a
maximum top-5 accuracy of 93.28 percent.
Class 0 Class 1 Class 2 Class 3 Class 4
ResNet-101
A ResNet-101 model was trained on the recast Severstal
dataset using a Stochastic Gradient Descent (SGD) opti-
mizer along with a categorical cross-entropy loss function.
The model is trained for 40 epochs with an initial learning
Figure 9: Sample images with dimension 256 × 256, from rate of 0.1, which is dropped by half every 5 epochs. Con-
each class of the recast Severstal dataset. sidering the high data imbalance among the classes, the top-
1 accuracy of the ResNet-101 model on the test set of the
The original images were long strips of steel sheets with recast Severstal dataset was 86.58 percent, while the top-3
dimensions 1600 × 256 pixels. To convert the dataset for accuracy was 99.60 percent. Table 5 shows the normalized
our purpose, every training image was cropped (without any confusion matrix of this model.
overlap) with an initial offset of 32 pixels into 6 individ-
ual images of dimensions 256 × 256 pixels. The few empty Evaluation
(black) images that tended to be located along the sides of In addition to the quantitative evaluation results shared on
the original long strip images were discarded, along with im- the main paper, the results of both ground-truth based and
ages that had multiple types of defects. This re-formulation model-truth based metrics on the MS COCO 2014 dataset
left a highly-imbalanced dataset with 5 distinct classes - 0, 1, are attached in Table 4. Similar to our earlier results, SISE
2, 3, and 4. Class 0 contains images with no defects, whereas outperforms other conventional XAI methods in most cases.
Grad- Extremal Score- Integrated
Model Metric Grad-CAM RISE FullGrad SISE
CAM++ Perturbation CAM Gradient
EBPG 23.77 18.11 25.71 11.5 12.59 14.01 13.96 28.16
mIoU 15.04 15.69 12.81 14.94 15.52 7.13 14.25 15.57
VGG16 Bbox 28.98 20.48 24.93 28.9 27.8 14.54 27.52 29.63
Drop% 44.46 45.63 41.86 38.69 33.73 52.73 52.39 32.9
Increase% 40.28 38.33 41.30 46.05 49.26 34.11 32.68 50.56
EBPG 25.3 17.81 27.54 11.35 12.6 14.41 14.39 29.43
mIoU 17.89 15.8 13.61 14.69 16.36 7.24 10.14 17.03
ResNet-50 Bbox 32.39 28.28 26.98 29.43 29.27 14.54 19.32 33.34
Drop% 33.42 41.71 36.24 37.93 35.06 55.38 56.83 31.41
Increase% 48.39 40.54 45.74 45.44 47.25 32.18 29.59 49.76

Table 4: Results of ground truth-based and model truth-based metrics for state-of-the-art XAI methods along with SISE (pro-
posed) on two networks (VGG16 and ResNet-50) trained on MS COCO 2014 dataset. For each metric, the best is shown in
bold, and the second-best is underlined. Except for Drop%, the higher is better for all other metrics.

Predicted Class the fusion module. The experiments on the Severstal dataset
0 1 2 3 4 were performed for only the ground-truth labels, as each test
image has exactly one class id associated with it.
Actual Class

0 0.89 0.011 0.0056 0.077 0.012


A detailed qualitative analysis of SISE explanations com-
1 0.27 0.59 0.02 0.12 0.0025
pared with other state-of-the-art XAI algorithms on the dis-
2 0.095 0.032 0.71 0.16 0 cussed models on Pascal VOC 2007 and recast Severstal
3 0.12 0.014 0.004 0.85 0.0086 datasets are shown in Figs. 13, 14 and 15 respectively.
4 0.15 0.0072 0.0024 0.16 0.67 Figs. 16 and 17 show a similar comparative analysis on MS
COCO 2014 dataset.
Table 5: Normalized confusion matrix of ResNet-101 model Ablation Study
trained on recast Severstal dataset.

Metric µ=0 µ = 0.3 µ = 0.5 µ = 0.75


The MS COCO 2014 data set is more challenging for the EBPG 66.08 66.54 65.84 62.5
explanation algorithms than the PASCAL VOC 2007 dataset mIoU 31.37 31.5 30.63 28.51
because of Bbox 61.59 61.45 59.83 56.53
• the higher number of object instances Drop% 30.92 31.5 33.31 38.83
• the presence of more extra small objects Increase% 40.22 40.05 38.36 36.09
• the presence of more objects either from the same or Runtime (s) 9.21 2.18 0.65 0.38
different classes in each image (on average)
• the lower classification accuracy of the models Table 6: Performance and runtime results of SISE with re-
trained on it (as provided in TorchRay library). spect to the parameter µ, on a ResNet-50 network trained on
However, the results depicted in Table 4 and Figs. 16 and 17 PASCAL VOC 2007 dataset. Except for Drop% and runtime
emphasizes the superior ability of SISE in providing satis- (in seconds), the higher is better for all other metrics.
fying, high-resolution, and complete explanation maps that
provide a precise visual analysis of the model’s predictions As stated in the main manuscript, in the second phase of
and perspective. SISE, each set of feature maps is valuated by backpropa-
The benchmark results reported on the Pascal VOC 2007 gating the signal from the output of the model to the layer
and MS COCO 2014 datasets are calculated for all ground- from which the feature maps are derived. In this stage, after
truth labels in the test images. For example, if a chosen input normalizing the backpropagation-based scores, a threshold
image has both “dog” and “cat” object instances, then expla- µ is applied to each set, so that the feature maps passing
nations are collected for both class ids and accounted the threshold are converted to attribution masks and utilized
for in overall performance. SISE’s ability to generate class in the next steps, while the others are discarded. Some of
discriminative explanations is represented in this manner. these feature maps do not contain signals that lead the model
As discussed in the main manuscript, SISE chooses pooling to make a firm prediction since they represent the attribu-
layers to collect feature maps, which are later combined in tions related to the instances of the other classes (rather than
Cascading weight randomization from top to bottom layers

Image SISE Logit Conv28 Conv21 Conv14 Conv7 Conv2

Dog

Bird

Train

Car

Figure 10: Sanity check experimentation of SISE as per (Adebayo et al. 2018) by randomizing a VGG16 model’s (pre-trained
on Pascal VOC 2007 dataset) parameters.

the class of interest). These feature maps are expected to tween performance and speed is reached. When this param-
be identified by reaching zero or negative backpropagation- eter is slightly increased, SISE will discard feature maps
based scores. Getting rid of them by setting the threshold with low positive backpropagation-based scores, which is
parameter µ to 0 (µ is defined in the main manuscript) will expected not to make a considerable impact on the output
improve our method, not only by increasing its speed but explanation map. The higher the parameter µ is though, the
also by enabling us to analyze the model’s decision making more deterministic feature maps are discarded, causing more
process more precisely. degradation in SISE’s performance.
To verify these interpretations, we have conducted an
SISE SISE SISE SISE
ablation analysis on the PASCAL VOC 2007 test set. As
Input Image
stated in the main manuscript, the model truth-based met-
rics (Drop% and Increase%) are the most important metrics
Person revealing the sensitivity of SISE’s performance with respect
0.9999
to its threshold parameter. According to our results as de-
picted in Table 6 and Fig. 11, the ground truth-based results
also follow approximately the same trend for the effect of µ
Car
0.5281
variation. Consequently, our results show that by adjusting
this hyper-parameter, a dramatic increase in SISE’s speed is
gained in turn with a slight compromise in its explanation
TV
Monitor
ability.
0.0014 Since the behavior of our method concerning this hyper-
parameter does not depend on the model and the dataset em-
ployed, it can be consistently fine-tuned, based on the re-
Motorbike quirements of the end-user.
0.9978

Sanity Check
Figure 11: Effect of SISE’s µ variation on a ResNet-50 In addition to the comprehensive quantitative experiments
model trained on Pascal VOC 2007 dataset. presented in the main manuscript and this appendix, we also
verified the sensitivity of our explanation algorithm to the
By increasing the threshold parameter µ, a trade-off be- model’s parameters, illustrating that our method adequately
SISE explanations
Runtime on Runtime on
XAI Method
Input Image Trained model Untrained model VGG16 (s) ResNet-50 (s)
Grad-CAM 0.006 0.019
Grad-CAM++ 0.006 0.020
Bus Extremal Perturbation 87.42 78.37
RISE 64.28 26.08
Score-CAM 5.90 18.17
Integrated Gradient 0.68 0.52
FullGrad 18.69 34.03
SISE 5.90 9.21
Cow
Table 7: Results of runtime evaluation of SISE along with
other algorithms on a Tesla T4 GPU with 16GB of memory.

Person Reported runtimes were averaged over 100 trials using a


random image from the PASCAL VOC 2007 test set for each
trial. Grad-CAM and Grad-CAM++ are the fastest methods
when applied to both models. This is expected as they op-
Figure 12: SISE results from a VGG16 model trained on erate using only one main forward pass and one backward
Pascal VOC 2007 dataset with an untrained VGG16 model. pass. Our method, SISE, is not the fastest, and the main bot-
tleneck in its runtime is the number of feature maps extracted
and used from the CNN. This is addressed by adjusting µ,
explains the relationship between the input and output that as discussed in the ‘Ablation Study’ section.
the model reaches. As introduced by (Adebayo et al. 2018),
sanity checks on explanation methods can be conducted ei-
ther by randomizing the model’s parameters or retraining the
model with the same training data, but with random labels.
In this work, we performed sanity checks on our method by
randomizing the parameters of the model. To do so, we have
randomized the weight and bias parameters on the VGG16
trained on PASCAL VOC 2007 dataset provided by (Fong,
Patrick, and Vedaldi 2019). Fig. 10 represents the results of
sanity checks for some input images. The layers for which
the parameters to be randomized are selected in a top to bot-
tom manner, as specified in the figure. Each row shows the
effect on the output explanation maps for an image when
we perturb the parameters in more layers. According to this
figure, SISE shows alterations in explanation maps, while
dealing with highly perturbed models. Hence, SISE passes
our sanity check.
To access SISE’s explanation beyond a few evaluation
metrics, another sanity check was performed. Fig. 12 at-
tached shows such experimentation where an untrained
VGG16 model was directly compared with our Pascal VOC
dataset trained VGG16 model. SISE doesn’t generate qual-
ity explanations from the untrained model, insisting that our
method not just provide “featured regions” obtained through
convolutional operations, but depict the actual “attributed re-
gions” affecting the model’s decision.

Complexity Evaluation
A runtime test was conducted to compare the complexity of
the different XAI methods with SISE, timing how long it
took for each algorithm to generate an explanation map. It
was performed with a Tesla T4 GPU with 16GB of mem-
ory on both a VGG16 and ResNet-50 model and attached as
Table 7.
Extremal Integrated
Input Image Grad-CAM Grad-CAM++ Perturbation Score-CAM Gradient RISE SISE

Train
1.000

Person
0.9959

Dog
0.9408

Person
0.9889

Cat
0.9999

Person
0.0027

Horse
0.9962

Figure 13: Qualitative comparison of SISE with other state-of-the-art XAI methods with a ResNet-50 model on the Pascal VOC
2007 dataset.
Extremal Integrated
Input Image Grad-CAM Grad-CAM++ Perturbation Score-CAM Gradient RISE SISE

Cat
1.000

Chair
9.65e-06

Person
0.999

Person
1.24e-04

Car
0.999

Figure 14: Comparison of SISE explanations generated with a VGG16 model on the Pascal VOC 2007 dataset.

Integrated
Input Image Grad-CAM Grad-CAM++ Score-CAM Gradient RISE SISE

Class 1
0.8513

Class 2
0.92

Class 3
0.9994

Class 4
0.9983

Figure 15: Qualitative results of SISE and other XAI algorithms from the ResNet-101 model trained on the recast Severstal
dataset.
Extremal Integrated
Input Image Grad-CAM Grad-CAM++ Perturbation Score-CAM Gradient RISE SISE

Elephant
0.1291

Toilet
0.9962

Tennis
Racket
0.0031

Person
0.9999

Truck
0.8803

Figure 16: Explanations of SISE along with other conventional methods from a VGG16 model on the MS COCO 2014 dataset.
Extremal Integrated
Input Image Grad-CAM Grad-CAM++ Perturbation Score-CAM Gradient RISE SISE

Fire Hydrant
0.9542

Pizza
0.0597

Handbag
0.0012

Donut
0.9786

Cup
0.0203

Person
0.9999

Bicycle
6.13e-07

Figure 17: Qualitative results of SISE and other XAI algorithms from the ResNet-50 model trained on the MS COCO 2014
dataset.

You might also like