0% found this document useful (0 votes)
14 views8 pages

REIMANNOPT

The document presents R IEMANN O PT, a framework designed to optimize the Riemann Sum approximations used in Integrated Gradients (IG) for deep neural networks, thereby reducing noise and improving attribution accuracy. This method can enhance Insertion Scores by up to 20% and significantly decrease computational costs, making it suitable for real-time applications. R IEMANN O PT is versatile and can be applied to various IG derivatives, allowing for cleaner saliency maps without imposing architectural constraints on the models.

Uploaded by

swadiswain007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views8 pages

REIMANNOPT

The document presents R IEMANN O PT, a framework designed to optimize the Riemann Sum approximations used in Integrated Gradients (IG) for deep neural networks, thereby reducing noise and improving attribution accuracy. This method can enhance Insertion Scores by up to 20% and significantly decrease computational costs, making it suitable for real-time applications. R IEMANN O PT is versatile and can be applied to various IG derivatives, allowing for cleaner saliency maps without imposing architectural constraints on the models.

Uploaded by

swadiswain007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Riemann Sum Optimization for Accurate Integrated

Gradients Computation

Swadesh Swain Shree Singhi


Indian Institute of Technology, Roorkee Indian Institute of Technology, Roorkee
[email protected] [email protected]

Abstract

Integrated Gradients (IG) is a widely used algorithm for attributing the outputs
of a deep neural network to its input features. Due to the absence of closed-
form integrals for deep learning models, inaccurate Riemann Sum approximations
are used to calculate IG. This often introduces undesirable errors in the form
of high levels of noise, leading to false insights in the model’s decision-making
process. We introduce a framework, R IEMANN O PT, that minimizes these errors
by optimizing the sample point selection for the Riemann Sum. Our algorithm
is highly versatile and applicable to IG as well as its derivatives like Blur IG and
Guided IG. R IEMANN O PT achieves up to 20% improvement in Insertion Scores.
Additionally, it enables its users to curtail computational costs by up to four folds,
thereby making it highly functional for constrained environments.

1 Introduction

Deep Neural Network (DNN) classifiers for computer vision are increasingly being utilized in critical
fields such as healthcare [4] and autonomous driving [3]. Hence, it has become increasingly important
to understand the decision-making process for these models. This has led to a growing body of
research focused on understanding how the predictions of these deep networks can be attributed to
specific regions of the image. An attribution method attempts to explain which inputs the model
considers to be most important for its outputs. Several gradient-based [20, 18, 19, 9, 17, 12] and
gradient-free [11, 14, 23, 5, 7, 30, 22, 29] attribution methods have been developed for deep learning
models. Integrated Gradient methods [27, 10, 24] are a specific class of gradient-based attribution
methods that compute a line integral of the gradients of the model over a path defined from a baseline
image to the given input.
The complex functional space of deep learning models is often considered as a source of noise for
many gradient-based attribution methods, resulting in undesirable high attribution to some background
regions. For Integrated Gradients, Kapishnikov et al. [2021] claim that the source of noise is large
gradients in the model surface while Smilkov et al. [2017] argue that the source of error is the rapid
fluctuation of the gradients of deep learning models.
Deep learning models do not have closed-form integrals, so their integrals are approximated by
the Riemann Sums [24]. This approximation involves the sampling of a number of points along
the path and approximating the integral using interpolation between these points. Using more
points to approximate the Riemann sum naturally results in cleaner saliency maps. However, most
applications of Integrated Gradients require a high number of steps for the Riemann Sum [24, 16],
generally between 20 to 1000, rendering Integrated Gradients’ usage practically unfeasible in real-
time applications. On the other hand, using lesser number of samples severely impacts the quality of
the saliency map. This results in a trade-off between speed and performance.

Interpretable AI: Past, Present and Future Workshop at NeurIPS 2024


Figure 1: Visual comparison of Integrated Gradient methods with and without R IEMANN O PT. For
IG, R IEMANN O PT suppresses the noise around the Spoonbill and also slightly concentrates stronger
attribution scores on the mouse trap. Applying R IEMANN O PT to BlurIG significantly increases
concentration on the subjects of images. GIG saliency maps remain perceptually similar.

Sotoudeh and Thakur [2019] have attempted to tackle the above issue of inaccurate Integrated
Gradients computation by exactly computing the underlying integral using E XACT L INE. However,
their application is limited to neural networks that are composed of piece-wise linear operations. Most
traditional models, like InceptionV3 [25], ViT [6] and ResNet [8], while being primarily composed
of linear operations, also make use of non-linear operations like LayerNorm [1], GroupNorm [26]
and Attention [2] thus prohibiting the use of E XACT L INE. Furthermore, the use of E XACT L INE
might not be considered ideal in cases where there are computational constraints since it requires
∼ 14000 gradient computations per image for large models.
To overcome the problem of redundancy by ineffective sampling schedules prevalent in Integrated
Gradient methods, we introduce R IEMANN O PT, a framework to pre-determine optimal points for
sampling to calculate Riemann Sums. The pre-determined points are specific only to the model.
Hence, the computation to determine the points is only done once and does not need to be repeated for
every image. Unlike E XACT L INE, our method does not impose any architectural constraints on the
underlying model, with the additional benefit of requiring far fewer samples. We present qualitative
and quantitative results for R IEMANN O PT on Integrated Gradients (IG) [24], Blur Integrated Gradients
(BlurIG) [27] and Guided IG (GIG) [10]. Our method can be easily combined with existing IG-based
methods and enable them to generate cleaner saliency maps.

2 Background

In this section, we review the mathematical definition of IG [24], BlurIG [27], and GIG [10].

2.1 Integrated Gradients

Sundararajan et al. [2017] utilized the idea of a path function. γ : [0, 1] → Rn is a smooth function
that denotes a path within Rn from x′ to x, satisfying γ(0) = x′ and γ(1) = x. Further, they defined
path integrated gradients along the ith dimension for an input x, given a baseline x′ , obtained by
integrating the gradients along the path γ(α) for α ∈ [0, 1] as:

Z 1
∂f (γ(α)) ∂γi (α)
Ii (x) = dα, (1)
0 ∂γi (α) ∂α

Where f denotes a DNN classifier. Integrated Gradients (IG) Sundararajan et al. [2017] originally
defined the path method as a straight line path specified γ IG (α) = x′ +α×(x−x′ ) for α ∈ [0, 1].
Later, BlurIG and GIG introduced non-linear paths that had their respective advantages over IG.

2
2.2 Blur Integrated Gradients

Xu et al. [2020] introduced Blur Integrated Gradients: For a given function f : Rm×n → [0, 1]
representing a classifier, let z(x, y) be the 2D input. Blur IG’s path is defined by a Gaussian filter that
progressively blurs the input. Formally:

∞ ∞
X X1 − x2 +y2
γ BlurIG (x, y, α) = e α z(x − m, y − n)
m=−∞ n=−∞
πα

The final BlurIG computation is as follows:

0
∂fc (γ BlurIG (x, y, α)) ∂γ BlurIG (x, y, α)
Z
I BlurIG (x, y) ::= dα
∞ ∂γ BlurIG (x, y, α) ∂α

2.3 Guided Integrated Gradients

Guided IG [10] (GIG) follows an adaptive integration path γ IG (α), α ∈ [0, 1] to avoid high gradient
regions. An adaptive path is one that depends on the model being used:
N Z 1
X ∂f (γ(α)) ∂γi (α)
γ GIG = argmin | |dα, (2)
γ∈Γ i=1 0 ∂γi (α) ∂α

After finding the optimal path γ GIG , GIG computes the attribution values similar to IG,
Z 1
∂f (γ GIG (α)) ∂γiGIG (α)
IiGIG (x) = dα. (3)
0 ∂γiGIG (α) ∂α

3 Methodology
In this section, we present simple derivation to determine an upper bound on the error introduced by
approximating a one-dimensional integral using a Riemann Sum. We then extend the definition for
multi-dimensional line integrals and define the algorithm R IEMANN O PT uses to schedule samples to
minimize this upper bound.

3.1 Error Minimization of Riemann Sums in 1D

We now present the derivation to estimate


R α the error introduced due to the left Riemann Sum ap-
proximation of a standard 1D integral α0k g(α) dα where {αi }ki=0 is the set of points at which the
integrand, g(α), is evaluated.
The standard way to calculate the left Riemann Sum is:

n−1
X
R= g(αi )(αi+1 − αi ) (4)
i=0

The integral can be broken down as:


n−1
X Z αi+1
I= g(α)dα (5)
i=0 αi

By applying the Taylor Series approximation around αi in (5):


n−1
X Z αi+1
I≈ g(αi ) + (α − αi )g ′ (αi ) dx
i=0 αi
(6)
n−1
X (αi+1 − αi )2
≈ g(αi )(αi+1 − αi ) + g ′ (αi )
i=0
2

3
By (4), (6) and the Triangle Inequality:
n−1
1X ′
|R − I| ≲ g (αi )(αi+1 − αi )2 (7)
2 i=0

3.2 Algorithm

IG computes d integrals (attributions) per image, one for each pixel. We treat each integral indepen-
dently and use the derivation above to estimate the average error over all integrals. The input to the
function, g, would be multidimensional, resulting in a different Taylor Series expansion. However, the
approximation would still be mathematically sound since the integral corresponding to the ith feature
is only dependant on the gradient along that component, i.e. error corresponding to the ith dimension
of gradient only contributes to the ith integral. We use this observation in conjunction with the finite
distance approximation of the derivative to determine the optimal points for sampling a Riemann
Sum for the dataset. The primary idea behind the algorithm is to approximate the average |g ′ (α)|
for all input features on a small subset of images, ∼ 1% of the validation dataset, then compute the
optimal sampling points and use them for the entire dataset.
The following tensors are used in Algorithm 1 where d is the dimensionality of the input:

• Ik×d : Samples evaluated at k equispaced points along the path.


• Ck−1×d : Finite difference estimate of the derivative of I for all input features.
• Ak−1 : Absolute derivative estimate of I, corresponding to |g ′ (α)|

Algorithm 1 Estimation of Optimal Alphas


Inputs:
A subset of m examples from the validation dataset: Xi ∈ Rd , i ∈ {1, . . . , m}
Number of sample points in a path: k
Integrand of the IG method: ∂f∂γ(α)
(γ(α))
⊙ ∂γ(α)
∂α
Output:
Optimal sampling points: {αj∗ }kj=1
Initialization:
Set {αj }kj=1 as k linearly spaced scalars between the integral bounds
A ← Initialize with zeros
for each i in {1, . . . , m} do ▷ Loop over training examples
∂f (γ(α)) ∂γ(α)
Ij ← ∂γ(α) ⊙ ∂α for j in {1, . . . , k}
α=αj
I −I
j+1
Ck−1×d ← αj+1 j
−αj for j in {1, . . . , k − 1} ▷ Finite difference: g ′ (α) ≈ g(α+∆α)−g(α)
∆α
Apply element-wise absolute to C
Ak−1 += Average Ck−1×d across all features ▷ Estimate of |g ′ (α)|
end for
Normalize A by dividing by number of examples m
|g ′ (α)| ← Linearly Interpolate(A, α) ▷ α ∈ [0, 1], A ∈ Rk−1
∗ k
αj ← The set {αj }j=1 that minimizes the upper bound error defined by Equation (7)
return {αj∗ }kj=1

4 Experimental Setup and Metrics


In this section, we discuss the details of the implementation, dataset, model, and metrics used.

4.1 Experimental Setup

We use the original implementations with default parameters in the authors’ code for IG, GIG,
and BlurIG and implement R IEMANN O PT as a pre-computation step that links with the original

4
implementations. We present our results using InceptionV3 for 16, 32, 64 and 128 sample points on
the correctly classified images of the ImageNet validation dataset, ∼ 40K. To estimate |g ′ (α)|, we
apply Algorithm 1 to a set of 200 randomly correctly classified images from the ImageNet validation
dataset for 128 samples. Then, we use Powell’s method [15] to determine the optimal set of sampling
points. This roughly has the same computational cost as computing the saliency map for the set of
200 images. Using R IEMANN O PT is still cost-effective since we only use a small number of images
to calculate sample points but are able to use these points for the entire dataset.

4.2 Metrics

Previous works use the Insertion Score and Normalized Insertion Score to compare different attribu-
tion methods [27, 9, 10, 28, 14, 13]. It is critical to note that the purpose of the Insertion Score is
to measure the efficacy of a saliency map, i.e. it is not designed to measure how close the Riemann
Sum is to the actual integral. However, it is reasonable to assume that the true saliency map would
generally achieve better Insertion Scores than an inaccurate approximation since inaccurate estimates
introduce noise. Hence, we report the Insertion Scores and, additionally, employ the Axiom of
Completeness [24] to define a new metric that measures the quality of the saliency maps without the
need for this hypothesis.
According to the Axiom of Completeness, the sum of all feature attributions, determined by any
Integrated Gradients method, must ideally add up to the difference between the output of f at x
and x′ . However, there is always an error due to inaccurate Riemann Sum estimates. Furthermore,
Sundararajan et al. [2017] advise the developer to ensure that all feature attributions add up to
f (x) − f (x′ ) (within 5%) and suggest increasing the number of samples if the error is greater.
Since the ground truth is unavailable, it is non-trivial to determine the numerical accuracy of a
computed saliency map. We use the relative error between the sum of feature attributions and
f (x) − f (x′ ) to estimate the error. This metric is not infallible since the features’ positive and
negative errors partially offset each other during the summation. Using the Triangle Inequality, it can
be easily shown that this metric is a lower bound on the true error. Nevertheless, it serves as a helpful
proxy since near-perfect saliency maps will have near-zero error, and highly erroneous maps will, on
average, have high error even after the errors partial offset.

5 Results and Discussion

In this section, we compare the sampling points chosen by R IEMANN O PT to the linear schedules,
followed by qualitative and quantitative evaluation against the baselines. In the case of BlurIG, the
sample points chosen by R IEMANN O PT highly differ from the linearly spaced samples, as depicted
in Figure 2. Every path starts with an information-less baseline image, x′ , and gradually gains
perceptible features as it moves towards the input image, x. Along the path, when the image becomes
perceptible, the gradients rapidly change, resulting in large values of |g ′ (α)|. For BlurIG, the image
features become perceptible at the end of the path, when most of the sharpening occurs. For IG and
GIG, the image becomes perceptible as soon as its brightness crosses a certain threshold, α ≈ 0.1.

Figure 2: Estimated |g ′ (α)| and comparison of 16 linearly spaced samples and 16 optimal samples
chosen by R IEMANN O PT. High values of |g ′ (α)| indicate regions of the path where the gradients of
the model are rapidly changing, i.e. regions where the image becomes perceptible to the model.

5
IG IG + Ours BlurIG BlurIG + Ours GIG GIG + Ours

Insertion Score (↑) Normalized Insertion Score (↑)


0.45

0.5
0.4

0.35
0.4

0.3

0.25 0.3
24 25 26 27 24 25 26 27
Samples Samples
Figure 3: We compare R IEMANN O PT against the baseline methods using the Insertion Score and
Normalized Insertion Score. We observe noticeable improvement for BlurIG and IG.

R IEMANN O PT always reduces the relative error and improves metric scores across all methods
and sample counts as depicted in Table 1 and Figure 3 respectively, with a noticeable enhancement
for BlurIG. On the other hand, the improvement in GIG is not very significant. The path of GIG
is theoretically fixed for chosen model. However, due to the employment of an adaptive path, its
practical implementation is highly dependent on the number of samples as well as the location of
the samples, unlike BlurIG and IG. In the derivation 3.1 of R IEMANN O PT, we assumed that the
path function was constant and independent of the sample points. The practical implementation
of GIG breaks this assumption; this is a possible explanation for why GIG is not as improved by
R IEMANN O PT as the other methods are. In terms of relative error, R IEMANN O PT significantly
reduces the number of samples while maintaining comparable performance. Specifically, BlurIG +
R IEMANN O PT achieves similar results with 16 samples as BlurIG with 64 samples. Additionally,
BlurIG + R IEMANN O PT with 16 samples performs comparably to BlurIG with 32 samples, and GIG
+ R IEMANN O PT matches the performance of GIG with 128 samples using just 16 samples. This
makes R IEMANN O PT highly functional for computationally constrained environments.

Table 1: Relative Error (↓) across different methods

Method 16 Samples 32 Samples 64 Samples 128 Samples


IG 0.708 0.374 0.166 0.066
IG + R IEMANN O PT 0.404 0.223 0.123 0.065
BlurIG 0.886 0.554 0.268 0.114
BlurIG + R IEMANN O PT 0.269 0.123 0.058 0.041
GIG 0.786 0.788 0.725 0.612
GIG + R IEMANN O PT 0.666 0.731 0.711 0.610

6 Conclusion
In this paper, we present R IEMANN O PT, a highly efficient framework designed to optimize sample
points in Riemann Sums for the computation of Integrated Gradients. Both qualitative and quantitative
results demonstrate that R IEMANN O PT effectively minimizes numerical errors in saliency maps and
achieves improved Insertion Scores by up to 20%, thereby enhancing the accuracy and reliability of
attribution maps. R IEMANN O PT is adaptable, extending its applicability to any multi-dimensional
line integral computation, including derivatives of Integrated Gradients such as BlurIG and GIG.
Additionally, it enables users to curtail computational costs by up to fourfold, significantly boosting
efficiency. Opportunities for future work include extending R IEMANN O PT to further improve its
suitability for Integrated Gradient methods that employ adaptive paths.

6
Acknowledgments and Disclosure of Funding
We would like to thank Aayan Yadav, Shweta Singh, Anupriya Kumari and Devansh Bhardwaj for
their insights on the paper writing. We would also like to thank all members of the Data Science
Group of IIT Roorkee for their invaluable support.

References
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016.
Dzmitry Bahdanau, Kyunghyun Cho, and Y. Bengio. Neural machine translation by jointly learning
to align and translate. ArXiv, 2014.
Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush
Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for
autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2020.
Bresnick G Cuadros J. Eyepacs: An adaptable telemedicine system for diabetic retinopathy screening.
In Journal of Diabetes Science and Technology, 2009.
Piotr Dabkowski and Yarin Gal. Real time image saliency for black box classifiers. In Advances in
Neural Information Processing Systems. Curran Associates, Inc., 2017.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit,
and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.
In International Conference on Learning Representations, 2021.
Ruth C. Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturba-
tion. In 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2016.
Andrei Kapishnikov, Tolga Bolukbasi, Fernanda Viegas, and Michael Terry. Xrai: Better attributions
through regions. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
Andrei Kapishnikov, Ben Wedin, Besim Namik Avci, Michael Terry, Subhashini Venugopalan, and
Tolga Bolukbasi. Guided integrated gradients: An adaptive path method for removing noise. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2021, 2021.
Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances
in Neural Information Processing Systems, 2017.
Ettore Mariotti, Jose M. Alonso-Moral, and Albert Gatt. Measuring model understandability by
means of shapley additive explanations. In 2022 IEEE International Conference on Fuzzy Systems
(FUZZ-IEEE), 2022.
Deng Pan, Xin Li, and Dongxiao Zhu. Explaining deep neural network models with adversarial
gradient integration. In Proceedings of the Thirtieth International Joint Conference on Artificial
Intelligence, IJCAI-21, 2021.
Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of
black-box models, 2018.
M. J. D. Powell. An efficient method for finding the minimum of a function of several variables
without calculating derivatives. The Computer Journal, 1964.
Kristina Preuer, Günter Klambauer, Friedrich Rippmann, Sepp Hochreiter, and Thomas Unterthiner.
Interpretable deep learning in drug discovery, 2019.

7
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?": Explaining the
predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining. Association for Computing Machinery, 2016.
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh,
and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localiza-
tion. In 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks:
Visualising image classification models and saliency maps, 2014.
Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad:
removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
Matthew Sotoudeh and Aditya V Thakur. Computing linear restrictions of neural networks. In
H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,
Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for
simplicity: The all convolutional net, 2015.
Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In
Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine
Learning Research, pages 9269–9278, 2020.
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In
International Conference on Machine Learning. PMLR, 2017.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In 2015
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference on
Computer Vision (ECCV), 2018.
S. Xu, S. Venugopalan, and M. Sundararajan. Attribution in scale and space. In 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 2020.
Ruo Yang, Binghui Wang, and Mustafa Bilgic. Idgi: A framework to eliminate explanation noise
from integrated gradients. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2023.
Luisa M Zintgraf, Taco S Cohen, Tameem Adel, and Max Welling. Visualizing deep neural network
decisions: Prediction difference analysis, 2017.

Erik Štrumbelj and Igor Kononenko. Explaining prediction models and individual predictions with
feature contributions. Knowledge and Information Systems, 2013.

You might also like