Explaining Convolutional Neural Networks Through Attribution-Based Input Sampling and Block-Wise Feature Aggregation
Explaining Convolutional Neural Networks Through Attribution-Based Input Sampling and Block-Wise Feature Aggregation
LG AI Research
sam.sattarzadeh, [email protected]; j.jang, [email protected]
.
.
.
CNN-based
model Point-wise
Multiplication
Back-propagation
Feature map
Md(l) (l) CNN-based
Filtering m1 m2 ... mN model
α1(l) α2(l) αN(l) &
.
.
.
Post-processing Attribution mask Creation
Feature map Selection
Phase 2
Linear
Combination
.
.
.
Layer visualization map
Attribution mask Scoring
Figure 3: Schematic of SISE’s layer visualization framework (first three phases). The procedure in this framework is applied to
multiple layers and is followed by the fusion framework (as in Fig. 5).
(l)
X ∂Ψ(I) Attribution-Based Input Sampling
αk = (l)
(6)
λ(l) ∈Λ(l)
∂Ak (λ(l) ) Considering the same notations as the previous section, and
according to RISE method, the confidence scores observed
The feature maps with corresponding non-positive average for the copies of an image masked with a set of binary masks
(l)
gradient scores - αk , tend to contain features related to (M : Λ → {0, 1}) are used to form the explanation map by,
other classes rather than the class of interest. Terming such SI,Ψ (λ) = EM [Ψ(I m)|m(λ) = 1] (9)
feature maps as ‘negative-gradient’, we define the set of at-
tribution masks obtained from the ‘positive-gradient’ feature where I m denotes a masked image obtained by point-
(l) wise multiplication between the input image and a mask
maps, Md , as:
m ∈ M . The representation of equation 9 can be modified to
(l) (l) (l) be generalized for sets of smooth masks (M : Λ → [0, 1]).
Md = {Ω(Ak )|k ∈ {1, ..., N }, αk > µ × β (l) } (7)
Hence, we reformat equation 9 as:
where β (l) denotes the maximum average gradient recorded. SI,Ψ (λ) = EM [Ψ(I m) · Cm (λ)] (10)
(l) where the term Cm (λ) indicates the contribution amount of
β (l) = max (αk ) (8) each pixel in the masked image. Setting the contribution in-
k∈{1,...,N }
dicator as Cm (λ) = m(λ), makes equation 10 equivalent to
In equation 7, µ ∈ R≥0 is a threshold parameter that is 0 by equation 9. We normalize these scores according to the size
default to discard negative-gradient feature maps while re- of perturbation masks to decrease the assigned reward to the
taining only the positive-gradients. Furthermore, Ω(.) repre- background pixels when a high score is reached for a mask
sents a post-processing function that converts feature maps with too many activated pixels. Thus, we define this term as:
to attribution masks. This function contains a ‘bilinear in- m(λ)
terpolation,’ upsampling the feature maps to the size of the Cm (λ) = P (11)
input image, followed by a linear transformation that nor- λ∈Λ m(λ)
malizes the values in the mask in the range [0, 1]. A visual Such a formulation increases the concentration on smaller
comparison of attribution masks and random masks in Fig. features, particularly when multiple objects (either from the
4 emphasizes such advantages of the former. same instance or different ones) are present in an image.
Putting block-wise layer selection policy and attribu- steel defect detection dataset created for anomaly detection
tion mask selection strategy together with the modified and steel defect segmentation problems. We reformatted it
RISE framework, for each CNN containing B convolu- into a defect classification dataset instead, containing 11505
tional blocks, the last layer of each block is indicated as test images from 5 different classes, including one normal
lb ∈ {1, ..., B}. Using equations 10 and 11, we form cor- class and four different defects classes. Class imbalance, in-
responding visualization maps for each of these layers by: traclass variation, and interclass similarity are the main chal-
(l )
lenges of this recast dataset.
VI,Ψb (λ) = EM (lb ) [Ψ(I m) · Cm (λ)] (12)
d
Experimental Setup
Fusion Module Experiments conducted on the PASCAL VOC 2007 dataset
In the fourth phase of SISE, the flow of features from low- are evaluated on its test set with a VGG16, and a ResNet-
level to high-level blocks are tracked. The inputs to the fu- 50 model from the TorchRay library (Fong, Patrick, and
sion module are the visualization layers obtained from the Vedaldi 2019), trained by (Zhang et al. 2018), both trained
third phase of SISE. On the other hand, this module’s out- for multi-label image classification. The top-5 accuracies of
put is a 2-dimensional explanation map, which is the output the models on the test set are 93.29% and 93.09%, respec-
of SISE. The fusion block is responsible for correcting spa- tively. On the other hand, for conducting experiments on
tial distortions caused by upsampling coarse feature maps Severstal, we trained a ResNet-101 model (with a test ac-
to higher dimensions and refining the localization of attribu- curacy of 86.58%) on the recast dataset to assess the perfor-
tions derived from the model. mance of the proposed method in the task of visual defect
inspection. To recast the Severstal dataset for classification,
Block 1 Unweighted Addition the train and test images were cropped into patches of size
Point-wise Multiplication
256 × 256. In our evaluations, a balanced subset of 1381 test
Block 2 Otsu-based binarization
Normalization in range [0,1]
images belonging to defect classes labeled as 1, 2, 3, and 4
Block 3
is chosen. We have implemented SISE on Keras and set the
parameter µ to its default value, 0.
Block 4
SISE Qualitative Results
Explanation
Map
Block 5 Based on explanation quality, we have compared SISE with
other state-of-the-art methods on sample images from the
Pascal dataset in Fig. 6 and Severstal dataset in Fig. 8. Im-
Figure 5: SISE fusion module for a CNN with 5 convolu- ages with both normal-sized and small object instances are
tional blocks. shown along with their corresponding confidence scores.
Moreover, Figs. 1 and 7 with images of multiple objects
Our fusion module is designed with cascaded fusion from different classes depict the superior ability of SISE in
blocks. In each block, the feature information from the vi- discriminating the explanations of various classes in com-
sualization maps representing explanations for two consec- parison with other methods and RISE in particular.
utive blocks is collected using an “addition” block. Then, the
features that are absent in the latter visualization map are re-
Quantitative Results
moved from the collective information by masking the out- Quantitative analysis includes evaluation results categorized
put of the addition block with a binary mask indicating the into ‘ground truth-based’ and ‘model truth-based’ metrics.
activated regions in the latter visualization map. To reach the The former is used to justify the model by assessing the ex-
binary mask, we apply an adaptive threshold to the latter vi- tent to which the algorithm satisfies the users by providing
sualization map, determined by Otsu’s method (Otsu 1979). visually superior explanations, while the latter is used to an-
By cascading fusion blocks as in Fig. 5, the features deter- alyze the model behavior by assessing the faithfulness of the
mining the model’s prediction are represented in a more fine- algorithm and its correctness in capturing the attributions in
grained manner while the inexplicit features are discarded. line with the model’s prediction procedure. The reported re-
sults of RISE and Extremal Perturbation in Table 1 are aver-
Experiments aged on three runs. The utilized metrics are discussed below.
We verify our method’s performance on shallow and deep Ground truth-based Metrics: The state-of-the-art expla-
CNNs, including VGG16, ResNet-50, and ResNet-101 ar- nation algorithms are compared with SISE based on three
chitectures. To conduct the experiments, we employed PAS- distinct ground-truth based metrics to justify the visual qual-
CAL VOC 2007 (Everingham et al. 2007) and Severstal ity of the explanation maps generated by our method. Denot-
(PAO Severstal 2019) datasets. The former is a popular ob- ing the ground-truth mask as G and the achieved explanation
ject detection dataset containing 4952 test images belong- map as S, the evaluation metrics used are:
ing to 20 object classes. As images with many small object Energy-Based Pointing Game (EBPG) evaluates the
occurrences and multiple instances of different classes are precision and denoising ability of XAI algorithms (Wang
prevalent in this dataset, it is hard for an XAI algorithm to et al. 2020). Extending the traditional Pointing Game, EBPG
perform well on the whole dataset. The latter is an industrial considers all pixels in the resultant explanation map S for
Extremal Integrated
Input Image Grad-CAM Grad-CAM++ Perturbation Score-CAM Gradient RISE SISE
Cat
0.9976
Train
0.9997
Person
0.9999
TV
monitor
0.0018
Figure 6: Qualitative comparison of the state-of-the-art XAI methods with our proposed SISE for test images from the PASCAL
VOC 2007 dataset. The first two rows are the results from a ResNet-50 model, and the last two are from a VGG16 model.
Class 'MotorBike' Class 'Person' Input Image Grad-CAM Score-CAM RISE SISE
0.9928 0.0071
Input Image RISE SISE RISE SISE
Defective
Class 1
0.8433
Defective
Class 3
0.9987
Figure 7: Class discriminative ability of SISE vs. RISE ob-
tained from a VGG16 model
Figure 8: Qualitative comparison of explanation maps by a
ResNet-101 model on test images from Severstal dataset.
evaluation by measuring the fraction of its energy captured
in the corresponding ground truth G, as EBP G = ||S||S||G||
1
1
.
mIoU analyses the localization ability and meaningful- discarded from the explanation map respectively. Given a
ness of the attributions captured in an explanation map. In model Ψ(.), an input image Ii from a dataset containing K
our experiments, we select the top 20% pixels highlighted in images, and an explanation map S(Ii ), the Drop/Increase %
each explanation map S and compute the mean intersection metric selects the most important pixels in S(Ii ) to mea-
over union with their corresponding ground-truth masks. sure their contribution towards the model’s prediction. A
Bounding box (Bbox) (Schulz et al. 2020) is taken into threshold function T (.) is applied on S(Ii ) to select the
account as a size-adaptive variant of mIoU. Considering N top 15% pixels that are then extracted from Ii using point-
as the number of ground truth pixels in G, the Bbox score is wise multiplication and fed to the model. The confidence
calculated by selecting the top N pixels in S and evaluating scores on such perturbed images are then compared with
the corresponding fraction captured over G. the original score, according to the equations Drop% =
1
PK max(0,Ψ(Ii )−Ψ(Ii T (Ii )))
Model truth-based metrics: To evaluate the correlation K i=1 Ψ(Ii ) × 100 and Increase% =
PK
between the representations of our method and the model’s i=1 sign(Ψ(Ii T (Ii )) − Ψ(Ii )).
predictions, model-truth based metrics are employed to
compare SISE with the other state-of-the-art methods. As
visual explanation algorithms’ main objective is to envision
Discussion
the model’s perspective for its predictions, these metrics are The experimental results in Figs. 1, 6, 7, and 8 demonstrate
considered of higher importance. the resolution, and concreteness of SISE explanation maps,
Drop% and Increase%, as introduced in (Chattopad- which is further supported by justifying our method via
hay et al. 2018) and later modified by (Ramaswamy et al. ground truth-based evaluation metrics as in Table 1. Also,
2020; Fu et al. 2020), can be interpreted as an indicator of model truth-based metrics in Tables 1 and 2 prove SISE’s
the positive attributions missed and the negative attribution supremacy in highlighting the evidence, based on which the
Grad- Extremal Score- Integrated
Model Metric Grad-CAM RISE FullGrad SISE
CAM++ Perturbation CAM Gradient
EBPG 55.44 46.29 61.19 33.44 46.42 36.87 38.72 60.54
mIoU 26.52 28.1 25.44 27.11 27.71 14.11 26.61 27.79
VGG16 Bbox 51.7 55.59 51.2 54.59 54.98 33.97 54.17 55.68
Drop 49.47 60.63 43.90 39.62 39.79 64.74 60.78 38.40
Increase 31.08 23.89 32.65 37.76 36.42 26.17 22.73 37.96
EBPG 60.08 47.78 63.24 32.86 35.56 40.62 39.55 66.08
mIoU 32.16 30.16 26.29 27.4 31.0 15.41 20.2 31.37
ResNet-50 Bbox 60.25 58.66 52.34 55.55 60.02 34.79 44.94 61.59
Drop 35.80 41.77 39.38 39.77 35.36 66.12 65.99 30.92
Increase 36.58 32.15 34.27 37.08 37.08 24.24 25.36 40.22
Table 1: Results of ground truth-based and model truth-based metrics for state-of-the-art XAI methods along with SISE (pro-
posed) on two networks trained on the PASCAL VOC 2007 dataset. For each metric, the best is shown in bold, and the
second-best is underlined. Except for Drop%, the higher is better for all other metrics. All values are reported in percentage.
XAI method Drop% Increase% model before negative-gradient feature maps were removed.
The difference in the number of masks allows SISE to op-
Grad-CAM 67.44 12.46 erate in around 9.21 seconds. To analyze the effect of re-
Grad-CAM++ 64.1 12.96 ducing the number of attribution masks on SISE’s perfor-
RISE 63.25 15.63 mance, an ablation study is carried. By changing µ to 0.3, a
Score-CAM 64.29 10.35 scanty variation in the boundary of explanation maps can be
FullGrad 77.23 10.26 noticed while the runtime is reduced to 2.18 seconds. This
SISE 61.06 15.64 shows that ignoring feature maps with low gradient values
does not considerably affect SISE outputs since they tend to
Table 2: Results of model truth-based metrics of SISE and be assigned low scores in the third phase of SISE anyway.
state-of-the-art algorithms on a ResNet-101 model trained By increasing µ to 0.5, a slight decline in the performance is
on Severstal data set. recorded along with a runtime of just 0.65 seconds.
A more detailed analysis of the effect of µ on various eval-
uation metrics along with an extensive discussion of our al-
model makes a prediction. Similar to the CAM-based meth- gorithm and additional results on MS COCO 2014 dataset
ods, the output of the last convolutional block plays the (Lin et al. 2014) are provided in the technical appendix of
most critical role in our method. However, by considering our extended version on arXiv1 .
the intermediate layers based on the block-wise layer selec-
tion, SISE’s advantageous properties are enhanced. Further-
more, utilizing attribution-based input sampling instead of a Conclusion
randomized sampling, ignoring the unrelated feature maps,
and modifying the linear combination step dramatically im-
In this work, we propose SISE - a novel visual explana-
proves the visual clarity and completeness offered by SISE.
tion algorithm specialized to the family of CNN-based mod-
Complexity Evaluation In addition to performance eval- els. SISE generates explanations by aggregating visualiza-
uations, a runtime test is carried out to compare the com- tion maps obtained from the output of convolutional blocks
plexity of the methods, using a Tesla T4 GPU with 16GB through attribution-based input sampling. Qualitative results
of memory and the ResNet-50 model. Reported runtimes show that our method can output high-resolution explana-
were averaged over 100 trials using random images from the tion maps, the quality of which is emphasized by quanti-
PASCAL VOC 2007 test set. Grad-CAM and Grad-CAM++ tative analysis using ground truth-based metrics. Moreover,
achieved the best runtimes, 19 and 20 milliseconds, respec- model truth-based metrics demonstrate that our method also
tively. On the other hand, Extremal Perturbation recorded outperforms other state-of-the-art methods in providing con-
the longest runtime, 78.37 seconds, since it optimizes nu- crete explanations. Our experiments reveal that mutual uti-
merous variables. In comparison with RISE, which has a lization of features captured in final and intermediate layers
runtime of 26.08 seconds, SISE runs in 9.21 seconds. of the model aids in producing explanation maps that accu-
rately locate object instances and reach a greater portion of
Ablation Study While RISE uses around 8000 random attributions leading the model to make a decision.
masks to operate on a ResNet-50 model, SISE uses around
1900 attribution masks with µ set to 0, out of a total of 3904
1
feature maps initially extracted from the same ResNet-50 https://arxiv.org/abs/2010.00672
Acknowledgement layer-wise relevance propagation. In 2019 IEEE/CVF In-
This research was supported by LG AI Research. The au- ternational Conference on Computer Vision Workshop (IC-
thors thank all anonymous reviewers for their detailed sug- CVW), 4176–4185. IEEE.
gestions and critical comments on the original manuscript Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ra-
that substantially helped to improve the clarity of this paper. manan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft
coco: Common objects in context. In European conference
References on computer vision, 740–755. Springer.
Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt, Lipton, Z. C. 2018. The Mythos of Model Interpretability:
M.; and Kim, B. 2018. Sanity Checks for Saliency Maps. In In Machine Learning, the Concept of Interpretability is Both
Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa- Important and Slippery. Queue 16(3): 31–57. ISSN 1542-
Bianchi, N.; and Garnett, R., eds., Advances in Neural Infor- 7730. doi:10.1145/3236386.3241340. URL https://doi.org/
mation Processing Systems, volume 31, 9505–9515. Curran 10.1145/3236386.3241340.
Associates, Inc. URL https://proceedings.neurips.cc/paper/ Meng, F.; Huang, K.; Li, H.; and Wu, Q. 2019. Class
2018/file/294a8ed24b1ad22ec2e7efea049b8737-Paper.pdf. Activation Map Generation by Representative Class Se-
Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, lection and Multi-Layer Feature Fusion. arXiv preprint
K.-R.; and Samek, W. 2015. On pixel-wise explanations for arXiv:1901.07683 .
non-linear classifier decisions by layer-wise relevance prop- Nam, W.-J.; Gur, S.; Choi, J.; Wolf, L.; and Lee, S.-W. 2020.
agation. PloS one 10(7): e0130140. Relative Attributing Propagation: Interpreting the Compara-
Barredo Arrieta, A.; Diaz Rodriguez, N.; Del Ser, J.; Ben- tive Contributions of Individual Units in Deep Neural Net-
netot, A.; Tabik, S.; Barbado González, A.; Garcia, S.; Gil- works. In AAAI, 2501–2508.
Lopez, S.; Molina, D.; Benjamins, V. R.; Chatila, R.; and Omeiza, D.; Speakman, S.; Cintas, C.; and Weldermariam,
Herrera, F. 2019. Explainable Artificial Intelligence (XAI): K. 2019. Smooth grad-cam++: An enhanced inference level
Concepts, Taxonomies, Opportunities and Challenges to- visualization technique for deep convolutional neural net-
ward Responsible AI. Information Fusion doi:10.1016/j. work models. arXiv preprint arXiv:1908.01224 .
inffus.2019.12.012. Otsu, N. 1979. A Threshold Selection Method from Gray-
Chattopadhay, A.; Sarkar, A.; Howlader, P.; and Balasubra- Level Histograms. IEEE Transactions on Systems, Man, and
manian, V. N. 2018. Grad-CAM++: Generalized Gradient- Cybernetics 9(1): 62–66.
Based Visual Explanations for Deep Convolutional Net- PAO Severstal. 2019. Severstal: Steel Defect Detection
works. In 2018 IEEE Winter Conference on Applications on Kaggle Challenge. URL https://www.kaggle.com/c/
of Computer Vision (WACV), 839–847. doi:10.1109/WACV. severstal-steel-defect-detection.
2018.00097.
Petsiuk, V.; Das, A.; and Saenko, K. 2018. RISE: Random-
Everingham, M.; Van Gool, L.; Williams, C. K. I.; ized Input Sampling for Explanation of Black-box Models.
Winn, J.; and Zisserman, A. 2007. The PASCAL In Proceedings of the British Machine Vision Conference
Visual Object Classes Challenge 2007 (VOC2007) Re- (BMVC).
sults. URL http://www.pascal-network.org/challenges/
VOC/voc2007/workshop/index.html. Ramaswamy, H. G.; et al. 2020. Ablation-CAM: Visual Ex-
planations for Deep Convolutional Network via Gradient-
Fong, R.; Patrick, M.; and Vedaldi, A. 2019. Understand- free Localization. In The IEEE Winter Conference on Appli-
ing deep networks via extremal perturbations and smooth cations of Computer Vision, 983–991.
masks. In Proceedings of the IEEE International Confer-
ence on Computer Vision, 2950–2958. Rebuffi, S.-A.; Fong, R.; Ji, X.; and Vedaldi, A. 2020. There
and Back Again: Revisiting Backpropagation Saliency
Fu, R.; Hu, Q.; Dong, X.; Guo, Y.; Gao, Y.; and Li, B. 2020. Methods. In Proceedings of the IEEE/CVF Conference on
Axiom-based Grad-CAM: Towards Accurate Visualization Computer Vision and Pattern Recognition, 8839–8848.
and Explanation of CNNs. In British Machine Vision Con-
ference. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. “Why
Should I Trust You?”: Explaining the Predictions of Any
Hoffman, R. R.; Mueller, S. T.; Klein, G.; and Litman, Classifier. In Proceedings of the 22nd ACM SIGKDD In-
J. 2018. Metrics for Explainable AI: Challenges and ternational Conference on Knowledge Discovery and Data
Prospects. CoRR abs/1812.04608. URL http://arxiv.org/abs/ Mining, San Francisco, CA, USA, August 13-17, 2016,
1812.04608. 1135–1144.
Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger, Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; and
K. Q. 2017. Densely connected convolutional networks. In Chen, L.-C. 2018. Mobilenetv2: Inverted residuals and lin-
Proceedings of the IEEE conference on computer vision and ear bottlenecks. In Proceedings of the IEEE conference on
pattern recognition, 4700–4708. computer vision and pattern recognition, 4510–4520.
Iwana, B. K.; Kuroki, R.; and Uchida, S. 2019. Explain- Schulz, K.; Sixt, L.; Tombari, F.; and Landgraf, T. 2020. Re-
ing convolutional neural networks using softmax gradient stricting the Flow: Information Bottlenecks for Attribution.
In International Conference on Learning Representations. tion. In Proceedings of the IEEE conference on computer
URL https://openreview.net/forum?id=S1xWh1rYwB. vision and pattern recognition, 2921–2929.
Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Zoph, B.; and Le, Q. V. 2016. Neural architec-
Parikh, D.; and Batra, D. 2017. Grad-CAM: Visual Expla- ture search with reinforcement learning. arXiv preprint
nations From Deep Networks via Gradient-Based Localiza- arXiv:1611.01578 .
tion. In Proceedings of the IEEE International Conference
on Computer Vision (ICCV).
Shen, L.; Ma, Q.; and Li, S. 2018. End-to-end time series
imputation via residual short paths. In Asian Conference on
Machine Learning, 248–263.
Shrikumar, A.; Greenside, P.; and Kundaje, A. 2017. Learn-
ing Important Features Through Propagating Activation Dif-
ferences. In Precup, D.; and Teh, Y. W., eds., Proceedings
of the 34th International Conference on Machine Learn-
ing, volume 70 of Proceedings of Machine Learning Re-
search, 3145–3153. International Convention Centre, Syd-
ney, Australia: PMLR. URL http://proceedings.mlr.press/
v70/shrikumar17a.html.
Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2014. Deep
Inside Convolutional Networks: Visualising Image Classifi-
cation Models and Saliency Maps. In Workshop at Interna-
tional Conference on Learning Representations.
Smilkov, D.; Thorat, N.; Kim, B.; Viégas, F.; and Watten-
berg, M. 2017. Smoothgrad: Removing noise by adding
noise. arXiv 2017. arXiv preprint arXiv:1706.03825 .
Srinivas, S.; and Fleuret, F. 2019. Full-gradient representa-
tion for neural network visualization. In Advances in Neural
Information Processing Systems, 4126–4135.
Sundararajan, M.; Taly, A.; and Yan, Q. 2017. Axiomatic
attribution for deep networks. In Proceedings of the 34th
International Conference on Machine Learning-Volume 70,
3319–3328. JMLR. org.
Tan, M.; and Le, Q. V. 2019. EfficientNet: Rethinking
Model Scaling for Convolutional Neural Networks. CoRR
abs/1905.11946. URL http://arxiv.org/abs/1905.11946.
Veit, A.; Wilber, M. J.; and Belongie, S. 2016. Residual net-
works behave like ensembles of relatively shallow networks.
In Advances in neural information processing systems, 550–
558.
Wang, H.; Wang, Z.; Du, M.; Yang, F.; Zhang, Z.; Ding, S.;
Mardziel, P.; and Hu, X. 2020. Score-CAM: Score-Weighted
Visual Explanations for Convolutional Neural Networks. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition Workshops, 24–25.
Zeiler, M. D.; and Fergus, R. 2014. Visualizing and under-
standing convolutional networks. In European conference
on computer vision, 818–833. Springer.
Zhang, J.; Bargal, S. A.; Lin, Z.; Brandt, J.; Shen, X.; and
Sclaroff, S. 2018. Top-Down Neural Attention by Excitation
Backprop. Int. J. Comput. Vision 126(10): 1084–1102. ISSN
0920-5691. doi:10.1007/s11263-017-1059-x. URL https:
//doi.org/10.1007/s11263-017-1059-x.
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; and Torralba,
A. 2016. Learning deep features for discriminative localiza-
Technical Appendix Severstal: Steel Defect Detection
Training
Class Test set Total
set
Datasets 0 16620 7124 23744
Experiments are conducted on three different datasets: MS 1 935 401 1336
COCO 2014 (Lin et al. 2014), PASCAL VOC 2007 (Ev- 2 147 63 210
eringham et al. 2007), and Severstal (PAO Severstal 2019). 3 8166 3500 11666
The first two datasets are “natural image” object detection 4 971 417 1388
datasets, while the last one is an “industrial” steel defect
detection dataset. They are discussed more in detail in the Table 3: Data distribution on each class of the recast Sever-
following subsections. stal dataset, outlining the high data-imbalance among them.
MS COCO 2014 and PASCAL VOC 2007 Datasets
The MS COCO 2014 dataset features 80 different object the other four classes have images with only that specific de-
classes, each one of a common object. All experimental re- fect group. Fig. 9 shows sample images from each class of
sults are performed on the validation set, which has 40,504 the recast dataset. The image per class distribution is pro-
images. The PASCAL VOC 2007 dataset features 20 object vided in Table 3. The training split is 70% of the data, and
classes, and all experimental results for this dataset are per- the test is the remaining 30%. From the training data, 20% is
formed on its test set, which has 4,952 images. Both datasets used for validation. The experimental results and qualitative
are created for object detection and segmentation purposes figures of the Severstal dataset are conducted on a subset of
and contain images with multiple object classes, and images the test set using all of the images from classes 1, 2, and 4,
with multiple object instances, making these datasets chal- and using 500 images from class 3.
lenging for XAI algorithms to perform well on.
Models
Severstal Dataset
VGG16 and ResNet-50
To extend the analysis of the influence of XAI algorithms
beyond natural images, the Severstal steel defect detection The top-1 accuracies of the VGG16 and ResNet-50 mod-
dataset was chosen. It was originally hosted on Kaggle as a els (loaded from the TorchRay library (Fong, Patrick, and
“detection” task, which we then converted to a “classifica- Vedaldi 2019)) on the test set of the PASCAL VOC 2007
tion” task. The original dataset has 12,568 train images un- dataset were 56.56 percent and 57.08 percent respectively
der one normal class labeled “0”, and four defective classes out of a maximum top-1 accuracy of 64.88 percent, while
numbered 1 through 4. Each image may contain no defect, the top-5 accuracies were 93.29 percent and 93.09 percent
or one defect, or two and more defects from different classes respectively out of a maximum top-5 accuracy of 99.99 per-
in it. The ground truth annotations for the segments (masks) cent. The top-1 accuracies of the VGG16 and ResNet-50 on
are provided in a CSV file, with a single row entry for each the validation set of the MS COCO 2014 dataset were 29.62
class of defect present within each image. The row entries percent and 30.25 percent respectively out of a maximum
provide the locations of defects, with some entries having top-1 accuracy of 34.43 percent, while the top-5 accuracies
several non-contiguous defect locations available. were 69.01 percent and 70.27 percent respectively out of a
maximum top-5 accuracy of 93.28 percent.
Class 0 Class 1 Class 2 Class 3 Class 4
ResNet-101
A ResNet-101 model was trained on the recast Severstal
dataset using a Stochastic Gradient Descent (SGD) opti-
mizer along with a categorical cross-entropy loss function.
The model is trained for 40 epochs with an initial learning
Figure 9: Sample images with dimension 256 × 256, from rate of 0.1, which is dropped by half every 5 epochs. Con-
each class of the recast Severstal dataset. sidering the high data imbalance among the classes, the top-
1 accuracy of the ResNet-101 model on the test set of the
The original images were long strips of steel sheets with recast Severstal dataset was 86.58 percent, while the top-3
dimensions 1600 × 256 pixels. To convert the dataset for accuracy was 99.60 percent. Table 5 shows the normalized
our purpose, every training image was cropped (without any confusion matrix of this model.
overlap) with an initial offset of 32 pixels into 6 individ-
ual images of dimensions 256 × 256 pixels. The few empty Evaluation
(black) images that tended to be located along the sides of In addition to the quantitative evaluation results shared on
the original long strip images were discarded, along with im- the main paper, the results of both ground-truth based and
ages that had multiple types of defects. This re-formulation model-truth based metrics on the MS COCO 2014 dataset
left a highly-imbalanced dataset with 5 distinct classes - 0, 1, are attached in Table 4. Similar to our earlier results, SISE
2, 3, and 4. Class 0 contains images with no defects, whereas outperforms other conventional XAI methods in most cases.
Grad- Extremal Score- Integrated
Model Metric Grad-CAM RISE FullGrad SISE
CAM++ Perturbation CAM Gradient
EBPG 23.77 18.11 25.71 11.5 12.59 14.01 13.96 28.16
mIoU 15.04 15.69 12.81 14.94 15.52 7.13 14.25 15.57
VGG16 Bbox 28.98 20.48 24.93 28.9 27.8 14.54 27.52 29.63
Drop% 44.46 45.63 41.86 38.69 33.73 52.73 52.39 32.9
Increase% 40.28 38.33 41.30 46.05 49.26 34.11 32.68 50.56
EBPG 25.3 17.81 27.54 11.35 12.6 14.41 14.39 29.43
mIoU 17.89 15.8 13.61 14.69 16.36 7.24 10.14 17.03
ResNet-50 Bbox 32.39 28.28 26.98 29.43 29.27 14.54 19.32 33.34
Drop% 33.42 41.71 36.24 37.93 35.06 55.38 56.83 31.41
Increase% 48.39 40.54 45.74 45.44 47.25 32.18 29.59 49.76
Table 4: Results of ground truth-based and model truth-based metrics for state-of-the-art XAI methods along with SISE (pro-
posed) on two networks (VGG16 and ResNet-50) trained on MS COCO 2014 dataset. For each metric, the best is shown in
bold, and the second-best is underlined. Except for Drop%, the higher is better for all other metrics.
Predicted Class the fusion module. The experiments on the Severstal dataset
0 1 2 3 4 were performed for only the ground-truth labels, as each test
image has exactly one class id associated with it.
Actual Class
Dog
Bird
Train
Car
Figure 10: Sanity check experimentation of SISE as per (Adebayo et al. 2018) by randomizing a VGG16 model’s (pre-trained
on Pascal VOC 2007 dataset) parameters.
the class of interest). These feature maps are expected to tween performance and speed is reached. When this param-
be identified by reaching zero or negative backpropagation- eter is slightly increased, SISE will discard feature maps
based scores. Getting rid of them by setting the threshold with low positive backpropagation-based scores, which is
parameter µ to 0 (µ is defined in the main manuscript) will expected not to make a considerable impact on the output
improve our method, not only by increasing its speed but explanation map. The higher the parameter µ is though, the
also by enabling us to analyze the model’s decision making more deterministic feature maps are discarded, causing more
process more precisely. degradation in SISE’s performance.
To verify these interpretations, we have conducted an
SISE SISE SISE SISE
ablation analysis on the PASCAL VOC 2007 test set. As
Input Image
stated in the main manuscript, the model truth-based met-
rics (Drop% and Increase%) are the most important metrics
Person revealing the sensitivity of SISE’s performance with respect
0.9999
to its threshold parameter. According to our results as de-
picted in Table 6 and Fig. 11, the ground truth-based results
also follow approximately the same trend for the effect of µ
Car
0.5281
variation. Consequently, our results show that by adjusting
this hyper-parameter, a dramatic increase in SISE’s speed is
gained in turn with a slight compromise in its explanation
TV
Monitor
ability.
0.0014 Since the behavior of our method concerning this hyper-
parameter does not depend on the model and the dataset em-
ployed, it can be consistently fine-tuned, based on the re-
Motorbike quirements of the end-user.
0.9978
Sanity Check
Figure 11: Effect of SISE’s µ variation on a ResNet-50 In addition to the comprehensive quantitative experiments
model trained on Pascal VOC 2007 dataset. presented in the main manuscript and this appendix, we also
verified the sensitivity of our explanation algorithm to the
By increasing the threshold parameter µ, a trade-off be- model’s parameters, illustrating that our method adequately
SISE explanations
Runtime on Runtime on
XAI Method
Input Image Trained model Untrained model VGG16 (s) ResNet-50 (s)
Grad-CAM 0.006 0.019
Grad-CAM++ 0.006 0.020
Bus Extremal Perturbation 87.42 78.37
RISE 64.28 26.08
Score-CAM 5.90 18.17
Integrated Gradient 0.68 0.52
FullGrad 18.69 34.03
SISE 5.90 9.21
Cow
Table 7: Results of runtime evaluation of SISE along with
other algorithms on a Tesla T4 GPU with 16GB of memory.
Complexity Evaluation
A runtime test was conducted to compare the complexity of
the different XAI methods with SISE, timing how long it
took for each algorithm to generate an explanation map. It
was performed with a Tesla T4 GPU with 16GB of mem-
ory on both a VGG16 and ResNet-50 model and attached as
Table 7.
Extremal Integrated
Input Image Grad-CAM Grad-CAM++ Perturbation Score-CAM Gradient RISE SISE
Train
1.000
Person
0.9959
Dog
0.9408
Person
0.9889
Cat
0.9999
Person
0.0027
Horse
0.9962
Figure 13: Qualitative comparison of SISE with other state-of-the-art XAI methods with a ResNet-50 model on the Pascal VOC
2007 dataset.
Extremal Integrated
Input Image Grad-CAM Grad-CAM++ Perturbation Score-CAM Gradient RISE SISE
Cat
1.000
Chair
9.65e-06
Person
0.999
Person
1.24e-04
Car
0.999
Figure 14: Comparison of SISE explanations generated with a VGG16 model on the Pascal VOC 2007 dataset.
Integrated
Input Image Grad-CAM Grad-CAM++ Score-CAM Gradient RISE SISE
Class 1
0.8513
Class 2
0.92
Class 3
0.9994
Class 4
0.9983
Figure 15: Qualitative results of SISE and other XAI algorithms from the ResNet-101 model trained on the recast Severstal
dataset.
Extremal Integrated
Input Image Grad-CAM Grad-CAM++ Perturbation Score-CAM Gradient RISE SISE
Elephant
0.1291
Toilet
0.9962
Tennis
Racket
0.0031
Person
0.9999
Truck
0.8803
Figure 16: Explanations of SISE along with other conventional methods from a VGG16 model on the MS COCO 2014 dataset.
Extremal Integrated
Input Image Grad-CAM Grad-CAM++ Perturbation Score-CAM Gradient RISE SISE
Fire Hydrant
0.9542
Pizza
0.0597
Handbag
0.0012
Donut
0.9786
Cup
0.0203
Person
0.9999
Bicycle
6.13e-07
Figure 17: Qualitative results of SISE and other XAI algorithms from the ResNet-50 model trained on the MS COCO 2014
dataset.