AttnLRP: Attention-Aware Layer-Wise Relevance Propagation
for Transformers

Reduan Achtibat1    Sayed Mohammad Vakilzadeh Hatefi1    Maximilian Dreyer1    Aakriti Jain1    Thomas Wiegand1,2,3    Sebastian Lapuschkin1,†    Wojciech Samek1,2,3,†
( 1 Fraunhofer Heinrich-Hertz-Institute, 10587 Berlin, Germany
2 Technische Universität Berlin, 10587 Berlin, Germany
3 BIFOLD – Berlin Institute for the Foundations of Learning and Data, 10587 Berlin, Germany
corresponding authors: {wojciech.samek,sebastian.lapuschkin}@hhi.fraunhofer.de )
Abstract

Large Language Models are prone to biased predictions and hallucinations, underlining the paramount importance of understanding their model-internal reasoning process. However, achieving faithful attributions for the entirety of a black-box transformer model and maintaining computational efficiency is an unsolved challenge. By extending the Layer-wise Relevance Propagation attribution method to handle attention layers, we address these challenges effectively. While partial solutions exist, our method is the first to faithfully and holistically attribute not only input but also latent representations of transformer models with the computational efficiency similar to a single backward pass. Through extensive evaluations against existing methods on LLaMa 2, Mixtral 8x7b, Flan-T5 and vision transformer architectures, we demonstrate that our proposed approach surpasses alternative methods in terms of faithfulness and enables the understanding of latent representations, opening up the door for concept-based explanations. We provide an LRP library at https://github.com/rachtibat/LRP-eXplains-Transformers.

1 Introduction

The attention mechanism Vaswani et al., (2017) became an essential component of large transformers due to its unique ability to handle multimodality and to scale to billions of training samples. While these models demonstrate impressive performance in text and image generation, they are prone to biased predictions and hallucinations Huang et al., (2023), which hamper their widespread adoption.

To overcome these limitations, it is crucial to understand the latent reasoning process of transformer models. Researchers started using the attention mechanism of transformers as a means to understand how input tokens interact with each other. Attention maps contain rich information about the data distribution Clark et al., (2019); Caron et al., (2021), even allowing for image data segmentation. However, attention, by itself, is inadequate for comprehending the full spectrum of model behavior Wiegreffe and Pinter, (2019). Similar to latent activations, attention is not class-specific and solely provides an explanation for the softmax output (in attention layers) while disregarding other model components. Recent works Geva et al., (2021); Dai et al., (2022) have in fact discovered that factual knowledge in Large Language Models (LLMs) is stored in Feed-Forward Network (FFN) neurons, separate from attention layers.

Refer to caption
Figure 1: By optimizing LRP for transformer-based architectures, our LRP variant outperforms other state-of-the-art methods in terms of explanation faithfulness and computational efficiency. We further are able to explain latent neurons inside and outside the attention module, allowing us to interact with the model. A more detailed discussion on the differences between AttnLRP and other LRP variants can be found in Appendix A.2.2. Heatmaps for other methods are illustrated in Appendix Figure B.6. Legend: highly (+++), semi- (\circ), not (--) suited. Credit: Nataba/iStock.

Further, attention-based attribution methods such as rollout Abnar and Zuidema, (2020); Chefer et al., 2021a result in checkerboard artifacts, as visible in Figure 1 for a Vision Transformer (ViT). Researchers thus have turned to model-agnostic approaches that aim to provide a holistic explanation of the model’s behavior Miglani et al., (2023), including, e.g., perturbation and gradient-based methods.

Methods based on feature perturbation require excessive amounts of compute time (and energy), and in order to access latent attributions they require performing perturbations at each layer separately, resulting in further exponential cost increase. This makes their application economically infeasible, especially for large architectures. In contrast, gradient-based methods benefit from the chain-rule in automatic differentiation and can produce latent attributions for all layers in a single backward pass. While prominent gradient-based methods, e.g. Input ×\times× Gradient Simonyan et al., (2014), are highly efficient, they suffer from noisy gradients and low faithfulness, as evaluated in Section 4.1.

Another option is to take advantage of the versatility of rule-based backpropagation methods, such as Layer-wise Relevance Propagation (LRP). These methods allow for the customization of propagation rules to accommodate novel operations, allowing for more faithful explanations and requiring only a single backward pass. As thoroughly discussed in Appendix A.2.2, all previous attempts to apply LRP to transformers reused standard LRP rules Ding et al., (2017); Voita et al., (2021); Chefer et al., 2021b ; Ali et al., (2022). However, transformer architectures include several functions for which standard LRP rules do not adequately apply, such as softmax, bi-linear (matrix) multiplication (e.g. query-key multiplication) and layer normalization. Additionally, the routing networks in Mixture of Experts (MoE) models Fedus et al., (2022) present notable challenges due to their combination of these functions. As a result, other previous attempts result in either numerical instabilities or low faithfulness.

Our method represents a significant breakthrough in handling the attribution problem in transformer architectures by enabling an accurate attribution flow through non-linear model components outperforming other existing methods (including perturbation) by a large margin.

Refer to caption
Figure 2: AttnLRP combined with ActMax allows to identify relevant neurons and gain insights into their encodings. This allows one to manipulate the latent representations and, e.g., to change the output “Arctic” (by disabling the corresponding neuron) to “Desert” or “Candy Store” (by activating the respective neurons). See also Section 4.3.
Contributions

In this work, we introduce AttnLRP, an extension of LRP within the Deep Taylor Decomposition framework Montavon et al., (2017), with the particular requirements necessary for attributing non-linear transformer components accurately. AttnLRP allows explaining transformer-based models with high faithfulness and efficiency, while also allowing attribution of latent neurons and providing insights into their role in the generation process (see Figure 2).

  1. 1.

    We derive novel efficient and faithful LRP attribution rules for non-linear attention within the Deep Taylor Decomposition framework, demonstrating their superiority over the state-of-the-art and successfully tackling the noise problem in ViTs.

  2. 2.

    We illustrate how to gain insights into an LLM generation process by identifying relevant neurons and explaining their encodings.

  3. 3.

    We provide an efficient and ready-to-use open source implementation of AttnLRP for transformers.

2 Related Work

We present an overview of related work for various model-agnostic and transformer-specialized attribution methods.

2.1 Perturbation & Local Surrogates

In perturbation analysis, such as occlusion-based attribution Zeiler and Fergus, (2014) or SHAP Lundberg and Lee, (2017), the input features are repeatedly perturbed while the effect on the model output is measured Fong and Vedaldi, (2017). AtMan Deb et al., (2023) is specifically adapted to the transformer architecture, where tokens are not suppressed in the input space, but rather in the latent attention weights.

Interpretable local surrogates, on the other hand, replace complex black-box models with simpler linear models that locally approximate the model function being explained. Since the surrogate has low complexity, interpretability is facilitated. Prominent methods include LIME Ribeiro et al., (2016) and LORE Guidotti et al., (2018).

While these approaches are model-agnostic and memory efficient, they have a high computational cost in terms of forward passes. Furthermore, explanations generated on surrogate models cannot explain the hidden representations of the original model. Finally, latent attributions wrt. the prediction must be computed for each layer separately, increasing the computational cost further.

2.2 Attention-based

These methods take advantage of the attention mechanism in transformer models. Although attention maps capture parts of the data distribution, they lack class specificity and do not provide a meaningful interpretation of the final prediction Wiegreffe and Pinter, (2019). Attention Rollout Abnar and Zuidema, (2020) attempts to address the issue by sequentially connecting attention maps of all layers. However, the resulting attributions are still not specific to individual outputs and exhibit substantial noise. Hence, Gildenblat, (2023) has found that reducing noise in attention rollout can be achieved by filtering out excessively strong outlier activations.

To enable class-specificity, the work of Chefer et al., 2021b proposed a novel rollout procedure wherein the attention’s activation is mean-weighted using a combination of the gradient and LRP-inspired relevances. It is important to note that this approach yields an approximation of the mean squared relevance value, which diverges from the originally defined notion of “relevance” or “importance” of additive explanatory models such as SHAP Lundberg and Lee, (2017) or LRP Bach et al., (2015). Subsequent empirical observations by Chefer et al., 2021a revealed that an omission of LRP-inspired relevances and a sole reliance on a positive mean-weighting of the attention’s activation with the gradient improved the faithfulness inside cross-attention layers. Though, this approach can only attribute positively and does not consider counteracting evidence.

Attention-rollout based approaches, while offering advantages in terms of computational efficiency and conceptual simplicity, have significant drawbacks. Primarily, they suffer from a limited resolution in the input attribution maps, resulting in undesirable checkerboard artifacts cf. Figure 1. Moreover, they are unable to attribute hidden latent features beyond the softmax output. Consequently, these approaches only provide explanations for a fraction of the model, thereby compromising the fidelity and limiting the feasibility of explanations within the hidden space.

2.3 Backpropagation-based

Input ×\times× Gradient Simonyan et al., (2014) linearizes the model by utilizing the gradient. However, this approach is vulnerable to gradient shattering Balduzzi et al., (2017); Dombrowski et al., (2022), leading to noisy attributions in deep models. Consequently, several works aim to denoise these attributions. SmoothGrad Smilkov et al., (2017) and Integrated Gradients Sundararajan et al., (2017) have attempted to address this issue but have been unsuccessful in the case of large transformers, as demonstrated in the experiments in Section 4.1. Chefer et al., 2021b adapted Grad-CAM Selvaraju et al., (2017) to transformer models by weighting the last attention map with the gradient.

Modified backpropagation methods, such as LRP Bach et al., (2015), decompose individual layer functions instead of linearizing the entire model. They modify the gradient to produce more reliable attributions Arras et al., (2022). The work Ding et al., (2017) was the first to apply standard LRP on non-linear attention layers, while Voita et al., (2021) proposed an improved variant building upon the Deep Taylor Decomposition framework. Nonetheless, both variants can lead to numerical instabilities in attributing the softmax function and do not fulfill the conservation property (3) in matrix multiplication. Ali et al., (2022) considerably improved attributions by recognizing that standard LRP rules were not suitable for these operations and proposed to exclude softmax and normalization operations from the computational graph by stopping the relevance (gradient) flow through them. However, it does not resolve the fundamental challenge of optimally applying LRP to non-linear operations. In Appendix A.2.2, we provide a comprehensive analysis about different LRP-variants.

3 Attention-Aware LRP for Transformers

First, we motivate LRP in the framework of additive explanatory models. Then, we generalize the design of new rules for non-linear operations. Finally, we apply our methodology successively on each operation utilized in a transformer model to derive efficient and faithful rules.

3.1 Layer-wise Relevance Propagation

Layer-wise Relevance Propagation (LRP) Bach et al., (2015); Montavon et al., (2019) belongs to the family of additive explanatory models, which includes the well-known Shapley Lundberg and Lee, (2017), Gradient ×\times× Input Simonyan et al., (2014) and DeepLIFT Shrikumar et al., (2017) methods.

The underlying assumption of such models is that a function fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with N𝑁Nitalic_N input features x={xi}i=1Nxsuperscriptsubscriptsubscript𝑥𝑖𝑖1𝑁\textbf{x}=\{x_{i}\}_{i=1}^{N}x = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT can be decomposed into individual contributions of single input variables Rijsubscript𝑅𝑖𝑗R_{i\leftarrow j}italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT (called “relevances”). Here, Rijsubscript𝑅𝑖𝑗R_{i\leftarrow j}italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT denotes the amount of output j𝑗jitalic_j that is attributable to input i𝑖iitalic_i, which, when added together, equals (or is proportional to) the original function value. Mathematically, this can be written as:

fj(x)Rj=iNRijproportional-tosubscript𝑓𝑗xsubscript𝑅𝑗superscriptsubscript𝑖𝑁subscript𝑅𝑖𝑗f_{j}(\textbf{x})\propto R_{j}=\sum_{i}^{N}R_{i\leftarrow j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) ∝ italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT (1)

If an input i𝑖iitalic_i is connected to several outputs j𝑗jitalic_j, e.g., a multidimensional function f, the contributions of each output j𝑗jitalic_j are losslessly aggregating together.

Ri=jRij.subscript𝑅𝑖subscript𝑗subscript𝑅𝑖𝑗R_{i}=\sum_{j}R_{i\leftarrow j}.italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT . (2)

This provides us with “importance values” for the input variables, which reveal their direct contribution to the final prediction. Unlike other methods, LRP treats a neural network as a layered directed acyclic graph, where each neuron j𝑗jitalic_j in layer l𝑙litalic_l is modeled as a function node fjlsubscriptsuperscript𝑓𝑙𝑗f^{l}_{j}italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT that is individually decomposed according to Equation (1). Beginning at the model output L𝐿Litalic_L, the initial relevance value RjLfjLproportional-tosubscriptsuperscript𝑅𝐿𝑗subscriptsuperscript𝑓𝐿𝑗R^{L}_{j}\propto f^{L}_{j}italic_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∝ italic_f start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is successively distributed to its prior network neurons one layer at a time. Hence, LRP follows the flow of activations computed during the forward pass through the model in the opposite direction, from output fLsuperscript𝑓𝐿f^{L}italic_f start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT back to input layer f1superscript𝑓1f^{1}italic_f start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT.

This decomposition characteristic of LRP gives rise to the important conservation property:

Rl1=iRil1=i,jRij(l1,l)=jRjl=Rlsuperscript𝑅𝑙1subscript𝑖subscriptsuperscript𝑅𝑙1𝑖subscript𝑖𝑗subscriptsuperscript𝑅𝑙1𝑙𝑖𝑗subscript𝑗subscriptsuperscript𝑅𝑙𝑗superscript𝑅𝑙R^{l-1}=\sum_{i}R^{l-1}_{i}=\sum_{i,j}R^{(l-1,l)}_{i\leftarrow j}=\sum_{j}R^{l% }_{j}=R^{l}italic_R start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT (3)

ensuring that the sum of all relevance values in each layer remains constant. This property allows for meaningful attribution, as the scale of each relevance value can be related to the original function output fLsuperscript𝑓𝐿f^{L}italic_f start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT.

3.1.1 Decomposition through Linearization

To design a faithful attribution method, the challenge lies in identifying a meaningful distribution rule Rijsubscript𝑅𝑖𝑗R_{i\leftarrow j}italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT. Possible solutions encompass all decompositions that adhere to the conservation property (3). However, for a decomposition to be considered faithful, it should approximate the characteristics of the original function as closely as possible.

In this paper, we take advantage of the Deep Taylor Decomposition framework Montavon et al., (2017) to locally linearize and decompose neural network operations into independent contributions. As a special case, we further establish the relationship between one derived rule and the Shapley Values framework in Section 3.3.2.

We start by computing a first-order Taylor expansion at a reference point x~~x\tilde{\textbf{x}}over~ start_ARG x end_ARG. For the purpose of simplifying the equation, we assume that the reference point x~~x\tilde{\textbf{x}}over~ start_ARG x end_ARG is constant:

fj(x)subscript𝑓𝑗x\displaystyle f_{j}(\textbf{x})italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) =fj(x~)+iJji(x~)(xix~i)+𝒪(|xx~|2)absentsubscript𝑓𝑗~xsubscript𝑖subscriptJ𝑗𝑖~xsubscript𝑥𝑖subscript~𝑥𝑖𝒪superscriptx~x2\displaystyle=f_{j}(\tilde{\textbf{x}})+\sum_{i}\textbf{J}_{ji}(\tilde{\textbf% {x}})\ (x_{i}-\tilde{x}_{i})+\mathcal{O}(|\textbf{x}-\tilde{\textbf{x}}|^{2})= italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over~ start_ARG x end_ARG ) + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT J start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ( over~ start_ARG x end_ARG ) ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + caligraphic_O ( | x - over~ start_ARG x end_ARG | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (4)
=iJjixi+fj(x~)iJjix~i+𝒪(|xx~|2)bias b~jabsentsubscript𝑖subscriptJ𝑗𝑖subscript𝑥𝑖subscriptsubscript𝑓𝑗~xsubscript𝑖subscriptJ𝑗𝑖subscript~𝑥𝑖𝒪superscriptx~x2bias subscript~𝑏𝑗\displaystyle=\sum_{i}\textbf{J}_{ji}\ x_{i}+\underbrace{f_{j}(\tilde{\textbf{% x}})-\sum_{i}\textbf{J}_{ji}\ \tilde{x}_{i}+\mathcal{O}(|\textbf{x}-\tilde{% \textbf{x}}|^{2})}_{\text{bias }\tilde{b}_{j}}= ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT J start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + under⏟ start_ARG italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over~ start_ARG x end_ARG ) - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT J start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + caligraphic_O ( | x - over~ start_ARG x end_ARG | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT bias over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT

where 𝒪𝒪\mathcal{O}caligraphic_O is the approximation error in Big-𝒪𝒪\mathcal{O}caligraphic_O notation and the Jacobian J is evaluated at reference point x~~x\tilde{\textbf{x}}over~ start_ARG x end_ARG, that is in the following omitted for brevity111if x~=x~xx\tilde{\textbf{x}}=\textbf{x}over~ start_ARG x end_ARG = x, this is equivalent to Gradient ×\times× Input. We have taken the DTD perspective to highlight the bias term, which is important for the upcoming discussion.. The bias term represents the constant portion of the function and the approximation error that cannot be directly attributed to the input variables.

We substitute the layer function with its first-order expansion and assert its proportionality to a relevance value Rjsubscript𝑅𝑗R_{j}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT following Equation (1) through multiplication with a constant factor c𝑐c\in\mathbb{R}italic_c ∈ blackboard_R with fj(x)0subscript𝑓𝑗x0f_{j}(\textbf{x})\neq 0italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) ≠ 0.

Rj=fj(x)c=iJjixiRjfj(x)Rij+b~jRjfj(x)Rbjsubscript𝑅𝑗subscript𝑓𝑗x𝑐subscript𝑖subscriptsubscriptJ𝑗𝑖subscript𝑥𝑖subscript𝑅𝑗subscript𝑓𝑗xsubscript𝑅𝑖𝑗subscriptsubscript~𝑏𝑗subscript𝑅𝑗subscript𝑓𝑗xsubscript𝑅𝑏𝑗R_{j}=f_{j}(\textbf{x})\ c=\sum_{i}\underbrace{\textbf{J}_{ji}\ x_{i}\frac{R_{% j}}{f_{j}(\textbf{x})}}_{R_{i\leftarrow j}}+\underbrace{\tilde{b}_{j}\frac{R_{% j}}{f_{j}(\textbf{x})}}_{R_{b\leftarrow j}}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) italic_c = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT under⏟ start_ARG J start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) end_ARG end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) end_ARG end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_b ← italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT

Comparing with Equation (1), we identify Rijsubscript𝑅𝑖𝑗R_{i\leftarrow j}italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT as the relevance assigned to the input variables and Rbjsubscript𝑅𝑏𝑗R_{b\leftarrow j}italic_R start_POSTSUBSCRIPT italic_b ← italic_j end_POSTSUBSCRIPT as the relevance assigned to the bias term. Hence, the bias term absorbs a portion of the relevance Rjsubscript𝑅𝑗R_{j}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT that is not allocated to the input variables. This technically violates the conservation property (3), as only Rijsubscript𝑅𝑖𝑗R_{i\leftarrow j}italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT is further distributed to prior layers reducing the amount of relevance per distribution step. However, Bach et al., (2015) treats bias terms as additional hidden neurons (with an activation value of one and a weight that equals the bias value, connected to the output) including them into the conservation property (3). Consequently, we regard this relevance as preserved, rather than lost. Alternatively, to strictly enforce conservation, the absorbed relevance score of the bias term can be distributed equally among the input variables, or the bias term can be excluded completely, as explained in Appendix A.2.1.

To obtain a propagation rule for the input variables, we apply Equation (2) without the bias term. In addition, we insert a stabilizing factor ε|fj(x)|+much-less-than𝜀subscript𝑓𝑗xsuperscript\varepsilon\ll|f_{j}(\textbf{x})|\in\mathbb{R}^{+}italic_ε ≪ | italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) | ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT with the sign of fj(x)subscript𝑓𝑗xf_{j}(\textbf{x})italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) to allow for the case fj(x)=0subscript𝑓𝑗x0f_{j}(\textbf{x})=0italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) = 0:

Ri=jRij=jJjixiRjfj(x)+εsign(fj(x))subscript𝑅𝑖subscript𝑗subscript𝑅𝑖𝑗subscript𝑗subscriptJ𝑗𝑖subscript𝑥𝑖subscript𝑅𝑗subscript𝑓𝑗x𝜀signsubscript𝑓𝑗xR_{i}=\sum_{j}R_{i\leftarrow j}=\sum_{j}\textbf{J}_{ji}\ x_{i}\frac{R_{j}}{f_{% j}(\textbf{x})+\varepsilon\ \text{sign}(f_{j}(\textbf{x}))}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT J start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) + italic_ε sign ( italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) ) end_ARG (5)

In the following, sign(fj(x)){1,1}signsubscript𝑓𝑗x11\text{sign}(f_{j}(\textbf{x}))\in\{-1,1\}sign ( italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) ) ∈ { - 1 , 1 } is omitted for brevity. Note, that ε𝜀\varepsilonitalic_ε acts as bias term and absorbs a negligible amount of the relevance.

To benefit from GPU parallelization, this formula can be written in matrix form:

Rl1=xJRl(f(x)+ε)absentsuperscriptR𝑙1direct-productxsuperscriptJtopsuperscriptR𝑙fx𝜀\Rightarrow\textbf{R}^{l-1}=\textbf{x}\odot\textbf{J}^{\top}\cdot\textbf{R}^{l% }\oslash(\textbf{f}(\textbf{x})+\varepsilon)⇒ R start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = x ⊙ J start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⋅ R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ⊘ ( f ( x ) + italic_ε )

where direct-product\odot denotes the Hadamard product, \oslash element-wise division and RlsuperscriptR𝑙\textbf{R}^{l}R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT a relevance vector at layer l𝑙litalic_l. This formula can be efficiently implemented in automatic differentiation libraries, such as PyTorch Paszke et al., (2019). Compared to a basic backward pass, we have additional computational complexity for the element-wise operations.

3.2 Attributing the Multilayer Perceptron

Commonly a Multilayer Perceptron consists of a linear layer with a (component-wise) non-linearity producing input activations for the succeeding layer(s):

zj=iWjixi+bjsubscript𝑧𝑗subscript𝑖subscriptW𝑗𝑖subscript𝑥𝑖subscript𝑏𝑗\displaystyle z_{j}=\sum_{i}\textbf{W}_{ji}\ x_{i}+b_{j}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT W start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (6)
aj=σ(zj)subscript𝑎𝑗𝜎subscript𝑧𝑗\displaystyle a_{j}=\sigma(z_{j})italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_σ ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (7)

where WjisubscriptW𝑗𝑖\textbf{W}_{ji}W start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT are the weight parameters and σ𝜎\sigmaitalic_σ constitutes a (component-wise) non-linearity.

3.2.1 The ε𝜀\varepsilonitalic_ε- and γ𝛾\gammaitalic_γ-LRP rule

Linearizing linear layers (6) at any point xNxsuperscript𝑁\textbf{x}\in\mathbb{R}^{N}x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT results in the fundamental ε𝜀\varepsilonitalic_ε-LRP Bach et al., (2015) rule

Ril1=jWjixiRjlzj(x)+εsuperscriptsubscript𝑅𝑖𝑙1subscript𝑗subscriptW𝑗𝑖subscript𝑥𝑖superscriptsubscript𝑅𝑗𝑙subscript𝑧𝑗x𝜀R_{i}^{l-1}=\sum_{j}\textbf{W}_{ji}x_{i}\frac{R_{j}^{l}}{z_{j}(\textbf{x})+\varepsilon}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT W start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) + italic_ε end_ARG (8)

The bias bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of Equation (6) and ε𝜀\varepsilonitalic_ε absorb a portion of the relevance. The proof is omitted for brevity. We employ the ε𝜀\varepsilonitalic_ε-LRP rule on all linear layers, unless specified otherwise.

In models with many layers, the gradient of a layer can cause noisy attributions due to the gradient shattering effect Balduzzi et al., (2017); Dombrowski et al., (2022). To mitigate this noise, it is best practice to use the γ𝛾\gammaitalic_γ-LRP rule Montavon et al., (2019), an extension to improve the signal-to-noise ratio. We have observed that this effect is significantly pronounced in ViTs while LLMs lack visible noise. Therefore, we only apply the γ𝛾\gammaitalic_γ-LRP rule to linear layers in ViTs. For more details, please refer to Appendix A.2.3.

3.2.2 Handling Element-wise Non-Linearities

Since element-wise non-linearities have only a single input and output variable, the decomposition of Equation (1) is the operation itself. Therefore, the entire incoming relevance Rjlsuperscriptsubscript𝑅𝑗𝑙R_{j}^{l}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT can only be assigned to the single input variable.

Ril1=Rilsuperscriptsubscript𝑅𝑖𝑙1superscriptsubscript𝑅𝑖𝑙R_{i}^{l-1}=R_{i}^{l}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT (9)

The identity rule (9) is applied to all element-wise operations with a single input and single output variable.

3.3 Attributing Non-linear Attention

The heart of the transformer architecture Vaswani et al., (2017) is non-linear attention

A=softmax(QKdk)AsoftmaxQsuperscriptKtopsubscript𝑑𝑘\displaystyle\textbf{A}=\text{softmax}\left(\frac{\textbf{Q}\cdot\textbf{K}^{% \top}}{\sqrt{d_{k}}}\right)A = softmax ( divide start_ARG Q ⋅ K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) (10)
O=AVOAV\displaystyle\textbf{O}=\textbf{A}\cdot\textbf{V}O = A ⋅ V (11)
softmaxj(x)=exjkexksubscriptsoftmax𝑗xsuperscript𝑒subscript𝑥𝑗subscript𝑘superscript𝑒subscript𝑥𝑘\displaystyle\text{softmax}_{j}(\textbf{x})=\frac{e^{x_{j}}}{\sum_{k}e^{x_{k}}}softmax start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG (12)

where (\cdot) denotes matrix multiplication, Kb×sk×dkKsuperscript𝑏subscript𝑠𝑘subscript𝑑𝑘\textbf{K}\in\mathbb{R}^{b\times s_{k}\times d_{k}}K ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the key matrix, Qb×sq×dkQsuperscript𝑏subscript𝑠𝑞subscript𝑑𝑘\textbf{Q}\in\mathbb{R}^{b\times s_{q}\times d_{k}}Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the queries matrix, Vb×sk×dvVsuperscript𝑏subscript𝑠𝑘subscript𝑑𝑣\textbf{V}\in\mathbb{R}^{b\times s_{k}\times d_{v}}V ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT the values matrix, and Ob×sk×dvOsuperscript𝑏subscript𝑠𝑘subscript𝑑𝑣\textbf{O}\in\mathbb{R}^{b\times s_{k}\times d_{v}}O ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the final output of the attention mechanism. b𝑏bitalic_b is the batch dimension including the number of heads, and dk,dvsubscript𝑑𝑘subscript𝑑𝑣d_{k},d_{v}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT indicate the embedding dimensions, and sq,sksubscript𝑠𝑞subscript𝑠𝑘s_{q},s_{k}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are the number of query and key/value tokens.

First and foremost, the softmax function is highly non-linear. In addition, the matrix multiplication is bilinear, i.e., linear in both of its input variables. In the following, we will derive relevance propagation rules for each of these operations, taking into account considerations of efficiency.

3.3.1 Handling the Softmax Non-Linearity

In Section 3.1.1, we present a generalized approach to linearization that incorporates bias terms, allowing for the absorption of a portion of the relevance. However, Ali et al., (2022) advocates for a strict adherence to the conservation property (3) and argues that a linear decomposition of a non-linear function should typically exclude a bias term. While we see the virtue of this approach for operations such as RMSNorm Zhang and Sennrich, (2019) or matrix multiplication, where f(0)=0𝑓00f(0)=0italic_f ( 0 ) = 0, we contend that a linearization of the softmax function should inherently incorporate a bias term. This is due to the fact that even when the input is zero, the softmax function yields a value of 1N1𝑁\frac{1}{N}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG (where N𝑁Nitalic_N represents the dimension of the inputs) which is analogous to a virtual bias term.

Proposition 3.1 Decomposing the softmax function by a Taylor decomposition (3.1.1) at reference point x yields the following relevance propagation rule:

Ril1=xi(RilsijRjl)subscriptsuperscript𝑅𝑙1𝑖subscript𝑥𝑖subscriptsuperscript𝑅𝑙𝑖subscript𝑠𝑖subscript𝑗subscriptsuperscript𝑅𝑙𝑗R^{l-1}_{i}=x_{i}(R^{l}_{i}-s_{i}\sum_{j}R^{l}_{j})italic_R start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (13)

where sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the i𝑖iitalic_i-th output of the softmax function. The hidden bias term, which contains the approximation error, consequently absorbs a portion of the relevance.

The proof can be found in Appendix A.3.1. In Appendix A.2.4, we explore the implications of vanishing gradients and temperature scaling on attributing the softmax function, which is important when attributing softmax outside of the attention mechanism, e.g. at the classification output. Note, that the works Ding et al., (2017); Voita et al., (2021); Chefer et al., 2021b ; Ali et al., (2022) propose to handle the bias term differently to strictly enforce the conservation property (3). Most variants can lead to severe numerical instabilities as discussed in Appendix A.2.1 and seen empirically in our preliminary experiments.

3.3.2 Handling Matrix-Multiplication

Since f(0,0)=0𝑓000f(0,0)=0italic_f ( 0 , 0 ) = 0 holds, it is desirable to decompose the matrix multiplication without a bias term. To achieve this, we break down the matrix multiplication into an affine operation involving summation and a bi-linear part involving element-wise multiplication.

Ojp=iAjiVipbi-linear partsubscriptO𝑗𝑝subscript𝑖subscriptsubscriptA𝑗𝑖subscriptV𝑖𝑝bi-linear part\textbf{O}_{jp}=\sum_{i}\underbrace{\textbf{A}_{ji}\textbf{V}_{ip}}_{\text{bi-% linear part}}O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT under⏟ start_ARG A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT V start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT bi-linear part end_POSTSUBSCRIPT

The summation already provides a decomposition in the form of Equation (1), and we only need to decompose the individual summands AjiVipsubscriptA𝑗𝑖subscriptV𝑖𝑝\textbf{A}_{ji}\textbf{V}_{ip}A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT V start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT.

Proposition 3.2 Decomposing element-wise multiplication with N𝑁Nitalic_N input variables of the form

fj(x)=iNxjisubscript𝑓𝑗xsuperscriptsubscriptproduct𝑖𝑁subscript𝑥𝑗𝑖f_{j}(\textbf{x})=\prod_{i}^{N}x_{ji}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) = ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT

by Shapley (with baseline zero) or Taylor decomposition (3.1.1) at reference point x (without bias or distributing the bias uniformely) yields the following uniform relevance propagation rule:

Rjil1=1NRjl.superscriptsubscript𝑅𝑗𝑖𝑙11𝑁superscriptsubscript𝑅𝑗𝑙R_{ji}^{l-1}=\frac{1}{N}R_{j}^{l}\,.italic_R start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT . (14)

The proof can be found in Appendix A.3.2. Consequently, the combined rule can be effectively computed using:

Proposition 3.3 Decomposing matrix multiplication with a sequential application of the ε𝜀\varepsilonitalic_ε-rule (8) and the uniform rule (14) on the summands yields the following relevance propagation rule for AjisubscriptA𝑗𝑖\textbf{A}_{ji}A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT:

Rjil1=pAjiVipRjpl2Ojp+εsuperscriptsubscript𝑅𝑗𝑖𝑙1subscript𝑝subscriptA𝑗𝑖subscriptV𝑖𝑝subscriptsuperscript𝑅𝑙𝑗𝑝2subscriptO𝑗𝑝𝜀\displaystyle R_{ji}^{l-1}=\sum_{p}\textbf{A}_{ji}\textbf{V}_{ip}\frac{R^{l}_{% jp}}{2\ \textbf{O}_{jp}+\varepsilon}italic_R start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT V start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT divide start_ARG italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT end_ARG start_ARG 2 O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT + italic_ε end_ARG (15)

There is no bias term absorbing relevance, whereas ε𝜀\varepsilonitalic_ε absorbs a negligible quantity. For VipsubscriptV𝑖𝑝\textbf{V}_{ip}V start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT, we sum over the j𝑗jitalic_j indices. The proof can be found in Appendix A.3.3. By employing this rule, we maintain strict adherence to the conservation property (3), as explained in Appendix A.3.5.

3.3.3 Handling Normalization Layers

Commonly used normalization layers in Transformers include LayerNorm Ba et al., (2016) and RMSNorm Zhang and Sennrich, (2019). These layers apply affine transformations and non-linear normalization sequentially.

LayerNorm(x)=xj𝔼[x]Var[x]+εγj+βjLayerNormxsubscript𝑥𝑗𝔼delimited-[]xVardelimited-[]x𝜀subscript𝛾𝑗subscript𝛽𝑗\displaystyle\text{LayerNorm}(\textbf{x})=\frac{x_{j}-\mathbb{E}[\textbf{x}]}{% \sqrt{\text{Var}[\textbf{x}]+\varepsilon}}\gamma_{j}+\beta_{j}LayerNorm ( x ) = divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - blackboard_E [ x ] end_ARG start_ARG square-root start_ARG Var [ x ] + italic_ε end_ARG end_ARG italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (16)
RMSNorm(x)=xj1Nkxk2+εγjRMSNormxsubscript𝑥𝑗1𝑁subscript𝑘subscriptsuperscript𝑥2𝑘𝜀subscript𝛾𝑗\displaystyle\text{RMSNorm}(\textbf{x})=\frac{x_{j}}{\sqrt{\frac{1}{N}\sum_{k}% x^{2}_{k}+\varepsilon}}\gamma_{j}RMSNorm ( x ) = divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_ε end_ARG end_ARG italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (17)

where ε,γj,βj.𝜀subscript𝛾𝑗subscript𝛽𝑗\varepsilon,\gamma_{j},\beta_{j}\in\mathbb{R}.italic_ε , italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R . Affine transformations such as the multiplicative weighting of the output or the subtraction of the mean value are linear operations that can be attributed by the ε𝜀\varepsilonitalic_ε-LRP rule. Normalization, on the other hand, is non-linear and requires separate considerations. As such, we focus on the following function:

fj(x)=xjg(x)subscript𝑓𝑗xsubscript𝑥𝑗𝑔xf_{j}(\textbf{x})=\frac{x_{j}}{g(\textbf{x})}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) = divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_g ( x ) end_ARG (18)

where g(x)=Var[x]+ε𝑔xVardelimited-[]x𝜀g(\textbf{x})=\sqrt{\text{Var}[\textbf{x}]+\varepsilon}\ \ italic_g ( x ) = square-root start_ARG Var [ x ] + italic_ε end_ARG or g(x)=1Nkxk2+ε𝑔x1𝑁subscript𝑘subscriptsuperscript𝑥2𝑘𝜀\ g(\textbf{x})=\sqrt{\frac{1}{N}\sum_{k}x^{2}_{k}+\varepsilon}italic_g ( x ) = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_ε end_ARG.

The work Ali et al., (2022) demonstrates that when linearizing LayerNorm at x, the bias term absorbs most of the relevance equal to Var[x]/(Var[x]+ε)Vardelimited-[]xVardelimited-[]x𝜀\text{Var}[\textbf{x}]/(\text{Var}[\textbf{x}]+\varepsilon)Var [ x ] / ( Var [ x ] + italic_ε ), effectively absorbing 99% of the relevance with commonly used values of ε=106𝜀superscript106\varepsilon=10^{-6}italic_ε = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and Var[x]=1Vardelimited-[]x1\text{Var}[\textbf{x}]=1Var [ x ] = 1. Hence, a linearization at x is not meaningful. As a solution, Ali et al., (2022) proposes to regard g(x)𝑔xg(\textbf{x})italic_g ( x ) as a constant, which transforms the normalization operation (18) into a (linear) element-wise operation, on which the identity rule (9) can be applied, as discussed in Appendix A.2.2. In the following, we prove that this heuristic can be derived from the Deep Taylor Decomposition framework.

Proposition 3.4 Decomposing LayerNorm or RMSNorm by a Taylor decomposition (3.1.1) with reference point 0 (without bias or distributing the bias uniformly) yields the identity relevance propagation rule:

Ril1=Rilsuperscriptsubscript𝑅𝑖𝑙1superscriptsubscript𝑅𝑖𝑙R_{i}^{l-1}=R_{i}^{l}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT (19)

There is no bias that absorbs relevance. The proof is given in Appendix A.3.4. This rule enforces a strict notion of conservation, while being highly efficient by excluding normalization operations from the computational graph. Experiments in Section 4.1 provide evidence that this simplification is faithful.

3.4 Understanding Latent Features

Refer to caption
Figure 3: There are two approaches for understanding knowledge neurons: (a) Neuron 3948 at the last non-linearity in FFN 17 of the Phi-1.5 model selects a weight row to add to the residual stream. This weight row projected on the vocabulary spans topics about ice, cold places and winter sport. (b) Sentences that maximally activate this neuron contain references about coldness. Attributing the neuron with AttnLRP highlights the most relevant tokens inside the input sentences. Inspired by Voita et al., (2023).

As we iterate through each layer during the attribution process with AttnLRP, we obtain relevance values for each latent neuron as a by-product. Ranking this latent relevance enables us to identify neurons and layers that are most influential for the reasoning process of the model Achtibat et al., (2023). The subsequent step is to reveal the concept that is represented by each neuron by finding the most representative reference samples that explain the neuron’s encoding. A common technique is Activation Maximization (ActMax) Nguyen et al., (2016), where input samples are sought that give rise to the highest activation value. We follow up on these observations and present the following strategy for understanding latent features: (1) Collect prompts that lead to the highest activation of a unit. (2) Explain the unit’s activation using AttnLRP, allowing to narrow down the relevant input tokens for the chosen unit.

In this work, we concentrate on knowledge neurons Dai et al., (2022); Voita et al., (2023) that are situated at the last non-linearity in FFN layers z=GELU(W1x)zGELUsubscriptW1x\textbf{z}=\text{GELU}(\textbf{W}_{1}\textbf{x})z = GELU ( W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT x ). These neurons possess intriguing properties, as shown in Figure 3: They encode factual knowledge and upon activation, the corresponding row of the second weight matrix W2subscriptW2\textbf{W}_{2}W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is added to the residual stream directly influencing the output distribution of the model. By projecting this weight row onto the vocabulary, a distribution of the most probable tokens across the vocabulary is obtained Geva et al., (2022). Applying AttnLRP on ActMax reference samples and projecting the weight row on the vocabulary allow us to understand in which context a neuron activates and how its activation influences the prediction of the next token. In contrast to Ali et al., (2022), AttnLRP also allows analyzing the key and value linear layers inside attention modules.

4 Experiments

Table 1: Faithfulness scores as area between the least and most relevant order perturbation curves Blücher et al., (2024) on different models and datasets. To assess plausibility, the (top-1) accuracy along with the IoU in parentheses are depicted for SQuAD v2. Methods marked with ()(\ast)( ∗ ) have been proposed here. Additional results for ViT-L-16 and ViT-L-32 are in Appendix Table B.6.
Methods ViT-B-16 LLaMa 2-7b Mixtral 8x7b Flan-T5-XL
ImageNet \uparrow IMDB \uparrow Wikipedia \uparrow SQuAD v2 \uparrow SQuAD v2 \uparrow
Random --0.01± 0.01plus-or-minus0.010.010.01{\,\pm\,0.01}0.01 ± 0.01 0.01± 0.05plus-or-minus0.010.05-0.01{\,\pm\,0.05}- 0.01 ± 0.05 0.07± 0.13plus-or-minus0.070.13-0.07{\,\pm\,0.13}- 0.07 ± 0.13 --0.03(0.09)0.030.090.03\,\,(0.09)0.03 ( 0.09 ) --0.03(0.08)0.030.080.03\,\,(0.08)0.03 ( 0.08 )
Input×\times×Grad Simonyan et al., (2014) --0.80± 0.03plus-or-minus0.800.030.80{\,\pm\,0.03}0.80 ± 0.03 --0.12± 0.05plus-or-minus0.120.050.12{\,\pm\,0.05}0.12 ± 0.05 --0.18± 0.13plus-or-minus0.180.130.18{\,\pm\,0.13}0.18 ± 0.13 --0.56(0.35)0.560.350.56\,\,(0.35)0.56 ( 0.35 ) --0.60(0.39)0.600.390.60\,\,(0.39)0.60 ( 0.39 )
IG Sundararajan et al., (2017) --1.54± 0.03plus-or-minus1.540.031.54{\,\pm\,0.03}1.54 ± 0.03 --1.23± 0.05plus-or-minus1.230.051.23{\,\pm\,0.05}1.23 ± 0.05 --4.05± 0.13plus-or-minus4.050.134.05{\,\pm\,0.13}4.05 ± 0.13 --0.68(0.44)0.680.440.68\,\,(0.44)0.68 ( 0.44 ) --0.10(0.16)0.100.160.10\,\,(0.16)0.10 ( 0.16 )
SmoothGrad Smilkov et al., (2017) 0.04± 0.03plus-or-minus0.040.03-0.04{\,\pm\,0.03}- 0.04 ± 0.03 --0.25± 0.05plus-or-minus0.250.050.25{\,\pm\,0.05}0.25 ± 0.05 2.22± 0.14plus-or-minus2.220.14-2.22{\,\pm\,0.14}- 2.22 ± 0.14 --0.47(0.24)0.470.240.47\,\,(0.24)0.47 ( 0.24 ) --0.05(0.09)0.050.090.05\,\,(0.09)0.05 ( 0.09 )
GradCAM Chefer et al., 2021b --0.27± 0.04plus-or-minus0.270.040.27{\,\pm\,0.04}0.27 ± 0.04 0.82± 0.05plus-or-minus0.820.05-0.82{\,\pm\,0.05}- 0.82 ± 0.05 --2.01± 0.15plus-or-minus2.010.152.01{\,\pm\,0.15}2.01 ± 0.15 --0.82(0.72)0.820.720.82\,\,(0.72)0.82 ( 0.72 ) --0.81(0.70)0.810.700.81\,\,(0.70)0.81 ( 0.70 )
AttnRoll Abnar and Zuidema, (2020) --1.31± 0.03plus-or-minus1.310.031.31{\,\pm\,0.03}1.31 ± 0.03 0.64± 0.05plus-or-minus0.640.05-0.64{\,\pm\,0.05}- 0.64 ± 0.05 3.49± 0.15plus-or-minus3.490.15-3.49{\,\pm\,0.15}- 3.49 ± 0.15 --0.05(0.10)0.050.100.05\,\,(0.10)0.05 ( 0.10 ) --0.02(0.08)0.020.080.02\,\,(0.08)0.02 ( 0.08 )
Grad×\times×AttnRoll Chefer et al., 2021a --2.60± 0.03plus-or-minus2.600.032.60{\,\pm\,0.03}2.60 ± 0.03 --1.61± 0.05plus-or-minus1.610.051.61{\,\pm\,0.05}1.61 ± 0.05 --9.79± 0.14plus-or-minus9.790.149.79{\,\pm\,0.14}9.79 ± 0.14 --0.91(0.40)0.910.400.91\,\,(0.40)0.91 ( 0.40 ) --0.94(0.53)0.940.53\mathbf{0.94}\,\,(0.53)bold_0.94 ( 0.53 )
AtMan Deb et al., (2023) --0.70± 0.02plus-or-minus0.700.020.70{\,\pm\,0.02}0.70 ± 0.02 0.20± 0.05plus-or-minus0.200.05-0.20{\,\pm\,0.05}- 0.20 ± 0.05 --3.31± 0.15plus-or-minus3.310.153.31{\,\pm\,0.15}3.31 ± 0.15 --0.86(0.83)0.860.830.86\,\,(\mathbf{0.83})0.86 ( bold_0.83 ) --0.88(0.80)0.880.800.88\,\,(0.80)0.88 ( 0.80 )
KernelSHAP Lundberg and Lee, (2017) --4.71± 0.03plus-or-minus4.710.034.71{\,\pm\,0.03}4.71 ± 0.03 - - - -
CP-LRP (ε𝜀\varepsilonitalic_ε-rule, Ali et al., (2022)) --2.53± 0.02plus-or-minus2.530.022.53{\,\pm\,0.02}2.53 ± 0.02 --1.72± 0.04plus-or-minus1.720.041.72{\,\pm\,0.04}1.72 ± 0.04 --7.85± 0.12plus-or-minus7.850.127.85{\,\pm\,0.12}7.85 ± 0.12 --0.50(0.40)0.500.400.50\,\,(0.40)0.50 ( 0.40 ) --0.91(0.83)0.910.830.91\,\,(0.83)0.91 ( 0.83 )
CP-LRP (γ𝛾\gammaitalic_γ-rule for ViT, as proposed here)* --6.06± 0.02plus-or-minus6.060.026.06{\,\pm\,0.02}6.06 ± 0.02 --- --- --- -
AttnLRP (ours)* --6.19± 0.02plus-or-minus6.190.02\mathbf{6.19}{\,\pm\,0.02}bold_6.19 ± 0.02 --2.50± 0.05plus-or-minus2.500.05\mathbf{2.50}{\,\pm\,0.05}bold_2.50 ± 0.05 --10.93± 0.13plus-or-minus10.930.13\mathbf{10.93}{\,\pm\,0.13}bold_10.93 ± 0.13 --0.96(0.72)0.960.72\mathbf{0.96}\,\,(0.72)bold_0.96 ( 0.72 ) --0.94(0.84)0.940.84\mathbf{0.94}\,\,(\mathbf{0.84})bold_0.94 ( bold_0.84 )

Our experiments aim to answer the following questions:

  • (Q1)

    How faithful are our explanations compared to other state-of-the-art approaches?

  • (Q2)

    How efficient is LRP compared to perturbation-based methods?

  • (Q3)

    Can we understand latent representations and interact with LLMs?

4.1 Evaluating Explanations (Q1)

A reliable measure of faithfulness of an explanation are input perturbation experiments Samek et al., (2017); Hedström et al., (2023). This approach iteratively substitutes the most important tokens in the input domain with a baseline value. If the attribution method accurately identified the most important tokens, the model’s confidence in the predicted output should rapidly decrease. The other way around, perturbing the least relevant tokens first, should not affect the model’s prediction and result in a slow decline of the model’s confidence. For more details, see Appendix B.2. Despite its drawbacks, such as potentially introducing out-of-distribution manipulations Chang et al., (2018) and sensitivity towards the chosen baseline value, this approach is widely adopted in the community. Brocki and Chung, (2023); Blücher et al., (2024) have addressed this criticism and introduced an enhanced metric by quantifying the area between the least and most relevant order perturbation curves to obtain a robust measure. Hence, we will employ this improved metric to measure faithfulness. Appendix Figure B.5 illustrates a typical perturbation curve.

In order to assess plausibility, we utilize the SQuAD v2 Question-Answering (QA) dataset Rajpurkar et al., (2018), which includes a ground truth mask indicating the correct answer within the question. We calculate attributions for accurately answered questions and determine the top-1 accuracy of the most relevant token and the Intersection over Union (IoU) between the positive attribution values and the ground truth mask. This approach assumes that the model solely relies on the information provided in the ground truth mask, which is not entirely accurate but sufficient for identifying a trend.

4.1.1 Baselines

We evaluate the faithfulness on two self-attention models, a ViT-B-16 Dosovitskiy et al., (2021) on ImageNet Deng et al., (2009) classification and the LLaMa 2-7b Touvron et al., (2023) model on IMDB movie review Maas et al., (2011) classification as well as next word prediction of Wikipedia Wikimedia Foundation, (2023). Additional results for ViT-L-16 and ViT-L-32 are in Appendix Table B.6. To assess plausability, we employ two instruction-finetuned models on the SQuAD v2 dataset: the MoE model Mixtral 8x7b Jiang et al., (2024) and the encoder-decoder model Flan T5-XL Chung et al., (2022). We denote our method as AttnLRP and compare it against a broad spectrum of methods including Input×\times×Gradient (I×\times×G), Integrated Gradients (IG), SmoothGrad (SmoothG), Attention Rollout (AttnRoll), Gradient-weighted Attention Rollout (G×\times×AttnRoll) and Conservative Propagation (CP)-LRP. As explained in Appendix A.2.3, we propose to apply the γ𝛾\gammaitalic_γ-rule for AttnLRP in the case of ViTs. For better comparison, we also included an enhanced CP-LRP baseline, which also uses the γ𝛾\gammaitalic_γ-rule in the ViTs experiment. The LRP variants introduced by Voita et al., (2021); Chefer et al., 2021b are excluded due to numerical instabilities observed in preliminary experiments, see also Appendix A.2.1. Further, we utilize the Grad-CAM adaptation described in Chefer et al., 2021b . Specifically, we weight the last attention map with the gradient. For a fair comparison, we attribute all methods without the softmax at the classification output, except AtMan which relies on it. KernelSHAP is only evaluated on vision transformers due to prohibitive computational costs on larger LLMs. Finally, we expand upon AtMan by incorporating it into encoder-decoder models by suppressing tokens in all self-attention layers within the encoder, while only doing so in cross-attention layers within the decoder. For AtMan, SmoothGrad and Rollout-methods we perform a hyperparameter sweep over a subset of the dataset. More details about baseline methods and the hyperparameter search are in Appendix A.1 and B.3. We illustrate example heatmaps for SQuAD v2 in Appendix B.7.

4.1.2 Discussion

In Table 1, we can observe that AttnLRP consistently outperforms all the state-of-the-art methods in terms of faithfulness. In models with a higher number of non-linearities (higher complexity), AttnLRP demonstrates substantially higher accuracy compared to CP-LRP. While the relative improvement to CP-LRP is 3% for Flan-T5-XL, which only utilizes standard attention layers, AttnLRP achieves a remarkable 46% improvement over CP-LRP in terms of top-1 accuracy in Mixtral 8x7b, that incorporates additional expert layers with softmax non-linearities and FFN layers with non-linear weighting. In Appendix B.4, we discuss the architectural differences and conduct an ablation study on different model components to demonstrate this effect. We also observe that gradient-based approaches significantly suffer from noisy attributions, as reflected by the low faithfulness and illustrated in example heatmaps in Appendix B.7. CP-LRP with ε𝜀\varepsilonitalic_ε applied on all layers (as proposed in Ali et al., (2022)), also suffers from noisy gradients in ViTs. Applying instead the γ𝛾\gammaitalic_γ-rule for CP-LRP and AttnLRP in ViTs improves the faithfulness substantially. Whereas AtMan and GradCAM do not perform well in unstructured tasks, i.e., next word prediction or classification, they achieve a high score in QA tasks. While G×\times×AttnRoll better reflects the model behavior compared to AtMan and GradCAM, it is affected by considerable background noise, resulting in a low IoU score in the SQuAD v2 dataset.

4.2 Computational Complexity and Memory Consumption (Q2)

Table 2 illustrates the computational complexity and memory consumption of a single LRP-based attribution and linear-time perturbation, such as AtMan or a Shapley-based method Fatima et al., (2008). Linear-time perturbation requires NTsubscript𝑁𝑇N_{T}italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT forward passes, but has only a memory requirement of 𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 ). Since LRP is a backpropagation-based method, gradient checkpointing Chen et al., (2016) techniques can be applied. In checkpointing, LRP requires two forward and one backward pass, while the memory requirement scales logarithmic with the number of layers. In Appendix B.8, we benchmark energy, time and memory consumption of LRP against perturbation-based methods across context- and model-sizes.

Table 2: Computational and memory complexity of LRP-based and linear-time perturbation methods measured w.r.t. a single forward pass. NLsubscript𝑁𝐿N_{L}italic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT: number of layers, NTsubscript𝑁𝑇N_{T}italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT: number of tokens
Methods Computational Memory
Complexity Consumption
LRP Checkpointing 𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 ) 𝒪(NL)𝒪subscript𝑁𝐿\mathcal{O}(\sqrt{N_{L}})caligraphic_O ( square-root start_ARG italic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG )
Perturbation (linear) 𝒪(NT)𝒪subscript𝑁𝑇\mathcal{O}(N_{T})caligraphic_O ( italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) 𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 )

4.3 Understanding & Manipulating Neurons (Q3)

In our investigation, we use the Phi-1.5 model Li et al., (2023), which has a transformer-based architecture with a next-word prediction objective. We obtain reference samples for each knowledge neuron by collecting the most activating sentences over the Wikipedia summary dataset Scheepers, (2017).

To illustrate, we consider the prompt: ‘The ice bear lives in the’ which gives the corresponding prediction: ‘Arctic’. Using AttnLRP, we determine the most relevant layers for predicting ‘Arctic’ as well as the specific neurons within the FFN layers contributing to this prediction. Our analysis reveals that the most relevant neurons after the first three layers are predominantly situated within the middle layers. Notably, one standout neuron #3948 in layer 17 activates on reference samples about cold temperatures, as depicted in Figure 3. This observation is further validated by projecting the weight matrix of the second FFN layer onto the vocabulary. The neuron shifts the output distribution of the model to cold places, winter sports and animals living in cold regions.

Analogously, for the prompt ‘Children love to eat sugar and’ with the prediction ‘sweets’, the most relevant neuron’s (layer 18, neuron #5687) projection onto the vocabulary signifies a shift in the model’s focus towards the concept of candy, temptation and sweetness in the vocabulary space. We interact with the model by deactivating neuron #3948, and strongly amplifying the activation of neuron #5687 in the forward pass. This manipulation yields the following prediction change:

Prompt: Ice bears live in the
Prediction:  sweet, sugary treats of the candy store.

We further notice that neuron #4104 in layer 17 encodes for dryness, thirst and sand. Increasing its activation changes the output to ‘desert’ (illustrated in Figure 2).

With AttnLRP, we are able to trace the most important neurons in models with billions of parameters. This allows us to systematically navigate the latent space to enable targeted modifications to reduce the impact of certain concepts (for example, ‘coldness’) and enhance the presence of other concepts (for example, ‘dryness’), resulting in discernible output changes. Such an approach holds significant implications for transformer-based models, which have been difficult to manipulate and explain due to inherent opacity and size.

5 Conclusion

We have extended the Layer-wise Relevance Propagation framework to non-linear attention, proposing novel rules for the softmax and matrix-multiplication step and providing interpretations in terms of Deep Taylor Decomposition. Our AttnLRP method stands out due to its unique combination of simplicity, faithfulness, and efficiency. We demonstrate its applicability both for LLMs as well as ViTs, utilizing the denoising effect of the γ𝛾\gammaitalic_γ-rule. In contrast to other backpropagation-based approaches, AttnLRP enables the accurate attribution of neurons in latent space (also within the attention module), thereby introducing novel possibilities for real-time model interaction and interpretation.

Limitations & Open Problems

Adjusting the γ𝛾\gammaitalic_γ-parameter in ViTs remains crucial to achieve accurate attributions. To reduce memory consumption, the impact of quantization on attributions and custom GPU kernels for LRP rules should be investigated.

Acknowledgements

We extend our heartfelt gratitude to Leila Arras for her invaluable feedback and to Johanna Vielhaben for improving our faithfulness metric. We also thank Patrick Kahardipraja, Daniel Becking and Maximilian Ernst for their insightful comments.

Impact Statement

This work establishes the foundations that make it possible to systematically analyze and debug transformer-based AI systems, thereby minimizing the occurrence of false or misleading outputs (hallucination) and mitigating biases that may arise from training data or algorithmic processes. Particularly, it opens up the door for future applications of transformer-based AI systems in critical domains such as healthcare and finance, where the ability to explain the model behavior is often a (legal) requirement. The high computational efficiency of our method significantly reduces the energy usage and consequently also the financial overhead and environmental impact associated with the explanation, which will result in a broader adoption of XAI for transformers.

References

  • Abnar and Zuidema, (2020) Abnar, S. and Zuidema, W. H. (2020). Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197.
  • Achanta et al., (2012) Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and Süsstrunk, S. (2012). Slic superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence, 34(11):2274–2282.
  • Achtibat et al., (2023) Achtibat, R., Dreyer, M., Eisenbraun, I., Bosse, S., Wiegand, T., Samek, W., and Lapuschkin, S. (2023). From attribution maps to human-understandable explanations through concept relevance propagation. Nature Machine Intelligence, 5(9):1006–1019.
  • Ali et al., (2022) Ali, A., Schnake, T., Eberle, O., Montavon, G., Müller, K.-R., and Wolf, L. (2022). Xai for transformers: Better explanations through conservative propagation. In International Conference on Machine Learning, pages 435–451. PMLR.
  • Anders et al., (2021) Anders, C. J., Neumann, D., Samek, W., Müller, K.-R., and Lapuschkin, S. (2021). Software for dataset-wide xai: from local explanations to global insights with zennit, corelay, and virelay. arXiv preprint arXiv:2106.13200.
  • Arras et al., (2022) Arras, L., Osman, A., and Samek, W. (2022). Clevr-xai: A benchmark dataset for the ground truth evaluation of neural network explanations. Information Fusion, 81:14–40.
  • Ba et al., (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
  • Bach et al., (2015) Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., and Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10(7):e0130140.
  • Balduzzi et al., (2017) Balduzzi, D., Frean, M., Leary, L., Lewis, J., Ma, K. W.-D., and McWilliams, B. (2017). The shattered gradients problem: If resnets are the answer, then what is the question? In International Conference on Machine Learning, pages 342–350. PMLR.
  • Binder et al., (2016) Binder, A., Montavon, G., Lapuschkin, S., Müller, K.-R., and Samek, W. (2016). Layer-wise relevance propagation for neural networks with local renormalization layers. In Artificial Neural Networks and Machine Learning–ICANN 2016: 25th International Conference on Artificial Neural Networks, Barcelona, Spain, September 6-9, 2016, Proceedings, Part II 25, pages 63–71. Springer.
  • Blücher et al., (2024) Blücher, S., Vielhaben, J., and Strodthoff, N. (2024). Decoupling pixel flipping and occlusion strategy for consistent xai benchmarks. arXiv preprint arXiv:2401.06654.
  • Brocki and Chung, (2023) Brocki, L. and Chung, N. C. (2023). Feature perturbation augmentation for reliable evaluation of importance estimators in neural networks. Pattern Recognition Letters, 176:131–139.
  • Caron et al., (2021) Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660.
  • Chang et al., (2018) Chang, C.-H., Creager, E., Goldenberg, A., and Duvenaud, D. (2018). Explaining image classifiers by counterfactual generation. arXiv preprint arXiv:1807.08024.
  • (15) Chefer, H., Gur, S., and Wolf, L. (2021a). Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 397–406.
  • (16) Chefer, H., Gur, S., and Wolf, L. (2021b). Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 782–791.
  • Chen et al., (2016) Chen, T., Xu, B., Zhang, C., and Guestrin, C. (2016). Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174.
  • Chung et al., (2022) Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. (2022). Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  • Clark et al., (2019) Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. (2019). What does bert look at? an analysis of bert’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286.
  • Dai et al., (2022) Dai, D., Dong, L., Hao, Y., Sui, Z., Chang, B., and Wei, F. (2022). Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502.
  • Dao et al., (2022) Dao, T., Fu, D., Ermon, S., Rudra, A., and Ré, C. (2022). Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  • Deb et al., (2023) Deb, M., Deiseroth, B., Weinbach, S., Schramowski, P., and Kersting, K. (2023). Atman: Understanding transformer predictions through memory efficient attention manipulation. arXiv preprint arXiv:2301.08110.
  • Deng et al., (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
  • Dettmers et al., (2024) Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2024). Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36.
  • Ding et al., (2017) Ding, Y., Liu, Y., Luan, H., and Sun, M. (2017). Visualizing and understanding neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1150–1159.
  • Dombrowski et al., (2022) Dombrowski, A.-K., Anders, C. J., Müller, K.-R., and Kessel, P. (2022). Towards robust explanations for deep neural networks. Pattern Recognition, 121:108194.
  • Dosovitskiy et al., (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR.
  • Fatima et al., (2008) Fatima, S. S., Wooldridge, M., and Jennings, N. R. (2008). A linear approximation method for the shapley value. Artificial Intelligence, 172(14):1673–1699.
  • Fedus et al., (2022) Fedus, W., Dean, J., and Zoph, B. (2022). A review of sparse expert models in deep learning. arXiv preprint arXiv:2209.01667.
  • Fong and Vedaldi, (2017) Fong, R. C. and Vedaldi, A. (2017). Interpretable explanations of black boxes by meaningful perturbation. In IEEE International Conference on Computer Vision (ICCV), pages 3449–3457.
  • Fryer et al., (2021) Fryer, D., Strümke, I., and Nguyen, H. (2021). Shapley values for feature selection: The good, the bad, and the axioms. IEEE Access, 9:144352–144360.
  • Geva et al., (2022) Geva, M., Caciularu, A., Wang, K., and Goldberg, Y. (2022). Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45.
  • Geva et al., (2021) Geva, M., Schuster, R., Berant, J., and Levy, O. (2021). Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495.
  • Gildenblat, (2023) Gildenblat, J. (2020. Accessed on Dec 01, 2023). Exploring explainability for vision transformers. https://jacobgil.github.io/deeplearning/vision-transformer-explainability.
  • Guidotti et al., (2018) Guidotti, R., Monreale, A., Ruggieri, S., Pedreschi, D., Turini, F., and Giannotti, F. (2018). Local rule-based explanations of black box decision systems. arXiv preprint arXiv:1805.10820.
  • Hedström et al., (2023) Hedström, A., Weber, L., Krakowczyk, D., Bareeva, D., Motzkus, F., Samek, W., Lapuschkin, S., and Höhne, M. M. M. (2023). Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond. Journal of Machine Learning Research, 24(34):1–11.
  • Huang et al., (2023) Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.
  • Jiang et al., (2024) Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. (2024). Mixtral of experts. arXiv preprint arXiv:2401.04088.
  • Kokhlikyan et al., (2020) Kokhlikyan, N., Miglani, V., Martin, M., Wang, E., Alsallakh, B., Reynolds, J., Melnikov, A., Kliushkina, N., Araya, C., Yan, S., et al. (2020). Captum: A unified and generic model interpretability library for pytorch. arXiv preprint arXiv:2009.07896.
  • Li et al., (2023) Li, Y., Bubeck, S., Eldan, R., Del Giorno, A., Gunasekar, S., and Lee, Y. T. (2023). Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463.
  • Lundberg and Lee, (2017) Lundberg, S. M. and Lee, S. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, pages 4765–4774.
  • Maas et al., (2011) Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150.
  • Mao et al., (2021) Mao, C., Jiang, L., Dehghani, M., Vondrick, C., Sukthankar, R., and Essa, I. (2021). Discrete representations strengthen vision transformer robustness. In International Conference on Learning Representations.
  • Miglani et al., (2023) Miglani, V., Yang, A., Markosyan, A., Garcia-Olano, D., and Kokhlikyan, N. (2023). Using captum to explain generative language models. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 165–173.
  • Montavon et al., (2019) Montavon, G., Binder, A., Lapuschkin, S., Samek, W., and Müller, K.-R. (2019). Layer-wise relevance propagation: an overview. Explainable AI: interpreting, explaining and visualizing deep learning, pages 193–209.
  • Montavon et al., (2017) Montavon, G., Lapuschkin, S., Binder, A., Samek, W., and Müller, K.-R. (2017). Explaining nonlinear classification decisions with deep taylor decomposition. Pattern recognition, 65:211–222.
  • Nguyen et al., (2016) Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., and Clune, J. (2016). Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Advances in neural information processing systems, 29.
  • Pahde et al., (2023) Pahde, F., Yolcu, G. Ü., Binder, A., Samek, W., and Lapuschkin, S. (2023). Optimizing explanations by network canonization and hyperparameter search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3818–3827.
  • Paszke et al., (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32.
  • Rajpurkar et al., (2018) Rajpurkar, P., Jia, R., and Liang, P. (2018). Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789.
  • Ribeiro et al., (2016) Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ”why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM.
  • Samek et al., (2017) Samek, W., Binder, A., Montavon, G., Lapuschkin, S., and Müller, K.-R. (2017). Evaluating the visualization of what a deep neural network has learned. IEEE Transactions on Neural Networks and Learning Systems, 28(11):2660–2673.
  • Scheepers, (2017) Scheepers, T. (2017). Improving the compositionality of word embeddings. Master’s thesis, Universiteit van Amsterdam, Science Park 904, Amsterdam, Netherlands.
  • Selvaraju et al., (2017) Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 618–626.
  • Shaham et al., (2023) Shaham, U., Ivgi, M., Efrat, A., Berant, J., and Levy, O. (2023). Zeroscrolls: A zero-shot benchmark for long text understanding. arXiv preprint arXiv:2305.14196.
  • Shrikumar et al., (2017) Shrikumar, A., Greenside, P., and Kundaje, A. (2017). Learning important features through propagating activation differences. In International Conference on Machine Learning, pages 3145–3153. PMLR.
  • Simonyan et al., (2014) Simonyan, K., Vedaldi, A., and Zisserman, A. (2014). Deep inside convolutional networks: visualising image classification models and saliency maps. In Proceedings of the International Conference on Learning Representations (ICLR). ICLR.
  • Smilkov et al., (2017) Smilkov, D., Thorat, N., Kim, B., Viégas, F., and Wattenberg, M. (2017). Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825.
  • Sundararajan et al., (2017) Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic attribution for deep networks. In International Conference on Machine Learning, pages 3319–3328. PMLR.
  • Touvron et al., (2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Vaswani et al., (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
  • Voita et al., (2023) Voita, E., Ferrando, J., and Nalmpantis, C. (2023). Neurons in large language models: Dead, n-gram, positional. arXiv preprint arXiv:2309.04827.
  • Voita et al., (2021) Voita, E., Sennrich, R., and Titov, I. (2021). Analyzing the source and target contributions to predictions in neural machine translation. In 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP, pages 1126–1140.
  • Wiegreffe and Pinter, (2019) Wiegreffe, S. and Pinter, Y. (2019). Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 11–20.
  • Wikimedia Foundation, (2023) Wikimedia Foundation (2023. Accessed on Dec 01, 2023). Wikimedia downloads. https://dumps.wikimedia.org.
  • Wolf et al., (2019) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. (2019). Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  • Zeiler and Fergus, (2014) Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks. In European Conference Computer Vision - ECCV 2014, pages 818–833.
  • Zhang and Sennrich, (2019) Zhang, B. and Sennrich, R. (2019). Root mean square layer normalization. Advances in Neural Information Processing Systems, 32.

Appendix

Appendix A Appendix I: Methodological Details

This appendix provides further details on the methods presented in the paper. In particular, we focus on the AttnLRP method, provide implementation details, discuss the stability of the bias term, highlight the difference between other LRP variants, discuss the noise problem in Vision Transformers and illustrate the effects of temperature scaling on attributing the softmax function. Finally, we provide proofs for the four propositions presented in the main paper.

A.1 Details on Baseline Methods

In the following, we present an overview of the baseline methods and their hyperparamter choices.

A.1.1 Input ×\times× Gradient

Gradients are one of the most straightforward approaches to depict how sensitive the trained model is with respect to each individual given feature (traditionally of the input space). By weighting the gradient with the input features, the model is locally linearized Simonyan et al., (2014):

I×G(x)=fc(x)x×xIG(x)subscript𝑓𝑐xxx\centering\text{I}\times\text{G({x})}=\frac{\partial f_{c}(\textbf{x})}{% \partial\textbf{x}}\times\textbf{x}\@add@centeringI × G( bold_x ) = divide start_ARG ∂ italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( x ) end_ARG start_ARG ∂ x end_ARG × x (20)

Due to the gradient shattering effect Balduzzi et al., (2017) which is a known phenomenon (especially in the ReLU-based CNNs), heatmaps generated by I×\times×G are very noisy, making them in many cases not meaningful.

A.1.2 Integrated Gradients

To tackle the noisiness of I×\times×G, the idea to integrate gradients along a trajectory has been proposed. Here, the gradients of different (m𝑚mitalic_m) interpolated versions of the input x, noted by xsuperscriptx\textbf{x}^{\prime}x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, are integrated as Sundararajan et al., (2017):

IG(x)=(xx)α=01fj(x+α×(xx))x𝑑x(xx)k=1mfj(x+km×(xx))x×1mIGxxsuperscriptxsuperscriptsubscript𝛼01subscript𝑓𝑗superscriptx𝛼xsuperscriptxxdifferential-dxxsuperscriptxsuperscriptsubscript𝑘1𝑚subscript𝑓𝑗superscriptx𝑘𝑚xsuperscriptxx1𝑚\displaystyle\begin{split}\text{IG}(\textbf{x})&=(\textbf{x}-\textbf{x}^{% \prime})\int_{\alpha=0}^{1}\frac{\partial f_{j}(\textbf{x}^{\prime}+\alpha% \times(\textbf{x}-\textbf{x}^{\prime}))}{\partial\textbf{x}}d\textbf{x}\\ &\approx(\textbf{x}-\textbf{x}^{\prime})\sum_{k=1}^{m}\frac{\partial f_{j}(% \textbf{x}^{\prime}+\frac{k}{m}\times(\textbf{x}-\textbf{x}^{\prime}))}{% \partial\textbf{x}}\times\frac{1}{m}\end{split}start_ROW start_CELL IG ( x ) end_CELL start_CELL = ( x - x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∫ start_POSTSUBSCRIPT italic_α = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG ∂ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_α × ( x - x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG ∂ x end_ARG italic_d x end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≈ ( x - x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG ∂ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + divide start_ARG italic_k end_ARG start_ARG italic_m end_ARG × ( x - x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG ∂ x end_ARG × divide start_ARG 1 end_ARG start_ARG italic_m end_ARG end_CELL end_ROW (21)

We utilize zennit Anders et al., (2021) and its default settings to compute Integrated Gradients attribution maps i.e. m=20𝑚20m=20italic_m = 20.

A.1.3 SmoothGrad

A different technique towards the reduction of noisy gradients is smoothing the gradients Smilkov et al., (2017) through generating (m𝑚mitalic_m) various samples in the neighborhood of input x as xε=x+𝒩(μ,σ2)subscriptx𝜀x𝒩𝜇superscript𝜎2\textbf{x}_{\varepsilon}=\textbf{x}+\mathcal{N}(\mu,\,\sigma^{2})x start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT = x + caligraphic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and computing the average of all gradients:

SmoothGrad(x)=1m1mfj(xε)xεSmoothGradx1𝑚superscriptsubscript1𝑚subscript𝑓𝑗subscriptx𝜀subscriptx𝜀\centering\text{SmoothGrad}(\textbf{x})=\frac{1}{m}\sum_{1}^{m}\frac{\partial f% _{j}(\textbf{x}_{\varepsilon})}{\partial\textbf{x}_{\varepsilon}}\@add@centeringSmoothGrad ( x ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG ∂ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ x start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT end_ARG (22)

In this work, we set μ=0𝜇0\mu=0italic_μ = 0 and perform a hyperparameter search for σ𝜎\sigmaitalic_σ to find the optimal parameter. We utilize zennit Anders et al., (2021) and its default settings to compute SmoothGrad attribution maps i.e. m=20𝑚20m=20italic_m = 20.

A.1.4 Attention Rollout

Self-Attention rollout Abnar and Zuidema, (2020) capitalizes on the intrinsic nature of the attention weights matrix Ab×sq×skAsuperscript𝑏subscript𝑠𝑞subscript𝑠𝑘\textbf{A}\in\mathbb{R}^{b\times s_{q}\times s_{k}}A ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as a representative measure of token importance. It generates a sq×sksubscript𝑠𝑞subscript𝑠𝑘s_{q}\times s_{k}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT matrix where each row is normalized to form a probability distribution, representing the importance of each query token to all key tokens. The attention scores along the head dimension are averaged:

A¯=𝔼b[A]¯Asubscript𝔼𝑏delimited-[]A\bar{\textbf{A}}=\mathbb{E}_{b}[\textbf{A}]over¯ start_ARG A end_ARG = blackboard_E start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT [ A ]

where 𝔼bsubscript𝔼𝑏\mathbb{E}_{b}blackboard_E start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT denotes the expectation along the head dimension b𝑏bitalic_b of the attention map. To compute the relevance of hidden layer tokens (hhitalic_h) to the original input tokens (i𝑖iitalic_i), an iterative multiplication of the attention matrices on the left side is sufficient. Hence, the key dimension represents the inputs and the query dimension the outputs. To account for the residual connection through which the information of the previous tokens flows, an identity matrix I is added:

Rkh,i=(I+A¯h,h)Rk1h,isubscriptsuperscriptR𝑖𝑘Isuperscript¯AsubscriptsuperscriptR𝑖𝑘1\textbf{R}^{h,i}_{k}=(\textbf{I}+\bar{\textbf{A}}^{h,h})\cdot\textbf{R}^{h,i}_% {k-1}R start_POSTSUPERSCRIPT italic_h , italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( I + over¯ start_ARG A end_ARG start_POSTSUPERSCRIPT italic_h , italic_h end_POSTSUPERSCRIPT ) ⋅ R start_POSTSUPERSCRIPT italic_h , italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT (23)

where k=1𝑘1k=1italic_k = 1 corresponds to the input layer and R0h,isubscriptsuperscriptR𝑖0\textbf{R}^{h,i}_{0}R start_POSTSUPERSCRIPT italic_h , italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is initialized with the identity matrix I, hhitalic_h denotes the hidden feature space, and i𝑖iitalic_i stands for input dimension.

Chefer et al., 2021a build upon self-attention rollout and weights the attention matrix with the gradient. Additionally, the weighted attention map is denoised by computing the mean value of only positive values.

A¯=𝔼b[(AA)+]¯Asubscript𝔼𝑏delimited-[]superscriptdirect-productAA\bar{\textbf{A}}=\mathbb{E}_{b}[(\nabla\textbf{A}\odot\textbf{A})^{+}]over¯ start_ARG A end_ARG = blackboard_E start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT [ ( ∇ A ⊙ A ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ]

For encoder-decoder models, Chefer et al., 2021a present several additional considerations that are not mentioned here for brevity.

Gildenblat, (2023) notes, that the rollout attributions can further be improved by discarding outlier values. For that, we define a discard threshold dt[0,1]𝑑𝑡01dt\in[0,1]italic_d italic_t ∈ [ 0 , 1 ] used to compute the quantile Q(dt)𝑄𝑑𝑡Q(dt)italic_Q ( italic_d italic_t ), where dt𝑑𝑡dtitalic_d italic_t represents the proportion of the data below the quantile e.g. with cumulative distribution function P(A¯Q(dt))=dt𝑃¯A𝑄𝑑𝑡𝑑𝑡P(\bar{\textbf{A}}\leq Q(dt))=dtitalic_P ( over¯ start_ARG A end_ARG ≤ italic_Q ( italic_d italic_t ) ) = italic_d italic_t.

A¯m,n={0if A¯m,n>Q(dt)A¯m,notherwisesubscript¯A𝑚𝑛cases0if subscript¯A𝑚𝑛𝑄𝑑𝑡subscript¯A𝑚𝑛otherwise\bar{\textbf{A}}_{m,n}=\begin{cases}0&\text{if }\bar{\textbf{A}}_{m,n}>Q(dt)\\ \bar{\textbf{A}}_{m,n}&\text{otherwise}\end{cases}over¯ start_ARG A end_ARG start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT = { start_ROW start_CELL 0 end_CELL start_CELL if over¯ start_ARG A end_ARG start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT > italic_Q ( italic_d italic_t ) end_CELL end_ROW start_ROW start_CELL over¯ start_ARG A end_ARG start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT end_CELL start_CELL otherwise end_CELL end_ROW

A.1.5 AtMan

AtMan Deb et al., (2023) perturbs the pre-softmax activations along the k𝑘kitalic_k-dimension:

H=QKHQsuperscriptKtop\displaystyle\textbf{H}=\textbf{Q}\cdot\textbf{K}^{\top}H = Q ⋅ K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT
H~=H(1pi)~Hdirect-productH1superscriptp𝑖\displaystyle\tilde{\textbf{H}}=\textbf{H}\odot(\textbf{1}-\textbf{p}^{i})over~ start_ARG H end_ARG = H ⊙ ( 1 - p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )

where Hb×sq×skHsuperscript𝑏subscript𝑠𝑞subscript𝑠𝑘\textbf{H}\in\mathbb{R}^{b\times s_{q}\times s_{k}}H ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and 1[1]b×sq×sk1superscriptdelimited-[]1𝑏subscript𝑠𝑞subscript𝑠𝑘\textbf{1}\in[1]^{b\times s_{q}\times s_{k}}1 ∈ [ 1 ] start_POSTSUPERSCRIPT italic_b × italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT a matrix containing only 1. pisuperscriptp𝑖\textbf{p}^{i}p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT denotes a matrix b×sq×skabsentsuperscript𝑏subscript𝑠𝑞subscript𝑠𝑘\in\mathbb{R}^{b\times s_{q}\times s_{k}}∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with

plmni={p for n=i0 for nisuperscriptsubscriptp𝑙𝑚𝑛𝑖cases𝑝 for 𝑛𝑖otherwise0 for 𝑛𝑖otherwise\textbf{p}_{lmn}^{i}=\begin{cases}p\text{ for }n=i\\ 0\text{ for }n\neq i\end{cases}p start_POSTSUBSCRIPT italic_l italic_m italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { start_ROW start_CELL italic_p for italic_n = italic_i end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 for italic_n ≠ italic_i end_CELL start_CELL end_CELL end_ROW

Thus, for a single token i{1,2,,N}𝑖12𝑁i\in\{1,2,...,N\}italic_i ∈ { 1 , 2 , … , italic_N }, we suppress all values along the column/key-dimension with a suppression factor p𝑝pitalic_p. The suppression factor is a hyperparameter that must be tuned to the dataset and model. For ViTs, additional cosine similarities are computed to suppress correlated tokens as detailed in Deb et al., (2023). For that, an additional hyperparameter denoted as t𝑡titalic_t for threshold must be optimized in ViTs only.

A.1.6 KernelSHAP

LIME computes attributions by fitting an additive surrogate model Ribeiro et al., (2016). KernelSHAP Lundberg and Lee, (2017) is a special case of LIME, that sets the loss function, weighting kernel and regularization terms of LIME such that LIME recovers Shapley values. Hence, KernelSHAP allows theoretically to obtain Shapley Values more efficiently than directly computing Shapley Values.

To apply KernelSHAP in the vision domain, we divide the input image into N𝑁Nitalic_N super-pixels using the Simple Linear Iterative Clustering (SLIC) algorithm Achanta et al., (2012). We use captum Kokhlikyan et al., (2020) with its default settings to compute the attributions i.e. number of samples per attributions set to 2000 and baseline value set to 0.5. A baseline value of 0 resulted in lower faithfulness. For SLIC, we set N=100𝑁100N=100italic_N = 100 with compactness set to 10.

A.2 Details on AttnLRP

This section provides more details on AttnLRP and justifies the specific parameter choices made in our work (e.g., use of γ𝛾\gammaitalic_γ-LRP in Vision Transformers).

A.2.1 Conservation & Numerical Stability of Bias Terms

The total relevance Rjsubscript𝑅𝑗R_{j}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of a layer output, i.e. linearized function fj(x)=iJjixi+b~jsubscript𝑓𝑗xsubscript𝑖subscriptJ𝑗𝑖subscript𝑥𝑖subscript~𝑏𝑗f_{j}(\textbf{x})=\sum_{i}\textbf{J}_{ji}\ x_{i}+\tilde{b}_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT J start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, is computed by summing the contributions of the input variables Rijsubscript𝑅𝑖𝑗R_{i\leftarrow j}italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT, represented by JjixisubscriptJ𝑗𝑖subscript𝑥𝑖\textbf{J}_{ji}\ x_{i}J start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and adding the contribution of the bias term Rbjsubscript𝑅𝑏𝑗R_{b\leftarrow j}italic_R start_POSTSUBSCRIPT italic_b ← italic_j end_POSTSUBSCRIPT, represented by b~jsubscript~𝑏𝑗\tilde{b}_{j}over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Rj=iRij+Rbj,subscript𝑅𝑗subscript𝑖subscript𝑅𝑖𝑗subscript𝑅𝑏𝑗R_{j}=\sum_{i}R_{i\leftarrow j}+R_{b\leftarrow j},italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT italic_b ← italic_j end_POSTSUBSCRIPT ,

The relevance of the input variables is solely determined by the input variables themselves

Ri=jRij=jJjixiRjfj(x),subscript𝑅𝑖subscript𝑗subscript𝑅𝑖𝑗subscript𝑗subscriptJ𝑗𝑖subscript𝑥𝑖subscript𝑅𝑗subscript𝑓𝑗xR_{i}=\sum_{j}R_{i\leftarrow j}=\sum_{j}\textbf{J}_{ji}\ x_{i}\frac{R_{j}}{f_{% j}(\textbf{x})},italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT J start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) end_ARG ,

while the relevance of the bias term itself is calculated as

Rbj=b~jRjfj(x).subscript𝑅𝑏𝑗subscript~𝑏𝑗subscript𝑅𝑗subscript𝑓𝑗xR_{b\leftarrow j}=\tilde{b}_{j}\frac{R_{j}}{f_{j}(\textbf{x})}.italic_R start_POSTSUBSCRIPT italic_b ← italic_j end_POSTSUBSCRIPT = over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) end_ARG .

If we want to compute the relevance of the input variables Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT while ensuring strict adherence to the conservation property (3), we must exclude the bias term, so that it does not absorb part of the relevance. In the literature, we find two common practices: Either distributing the bias term uniformely on the input variables Binder et al., (2016); Voita et al., (2021) or applying the identity rule Ding et al., (2017); Chefer et al., 2021b . Both approaches can lead to severe numerical instabilities in specific cases that are challenging to identify. Therefore, we will dedicate some time to explain the issue in greater detail.

Remark A.2.1 Enforcing strict conservation (3) on a function, where

i,j:xi=0fj(x)0,:𝑖𝑗subscript𝑥𝑖0subscript𝑓𝑗x0\exists i,j\in\mathbb{N}:x_{i}=0\land f_{j}(\textbf{x})\neq 0,∃ italic_i , italic_j ∈ blackboard_N : italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ∧ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) ≠ 0 ,

by distributing the relevance of the bias term of its linearization uniformly on the input variables or applying the identity rule with i=j𝑖𝑗i=jitalic_i = italic_j may lead to numerical instabilities.

Distributing the bias term: We can distribute the relevance value of the bias term uniformly across the input variables by assuming that the bias term is part of the input variables:

Rjl=iR~ij(l1,l)=iN(Rij(l1,l)+Rbj(l1,l)N),superscriptsubscript𝑅𝑗𝑙subscript𝑖superscriptsubscript~𝑅𝑖𝑗𝑙1𝑙superscriptsubscript𝑖𝑁superscriptsubscript𝑅𝑖𝑗𝑙1𝑙superscriptsubscript𝑅𝑏𝑗𝑙1𝑙𝑁R_{j}^{l}=\sum_{i}\tilde{R}_{i\leftarrow j}^{(l-1,l)}=\sum_{i}^{N}\left(R_{i% \leftarrow j}^{(l-1,l)}+\frac{R_{b\leftarrow j}^{(l-1,l)}}{N}\right),italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l ) end_POSTSUPERSCRIPT + divide start_ARG italic_R start_POSTSUBSCRIPT italic_b ← italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG ) ,

where N𝑁Nitalic_N represents the number of input variables. Hence,

Ril=jR~ij(l1,l)=j(Jjixi+bj~N)Rjlfj(x).superscriptsubscript𝑅𝑖𝑙subscript𝑗superscriptsubscript~𝑅𝑖𝑗𝑙1𝑙subscript𝑗subscriptJ𝑗𝑖subscript𝑥𝑖~subscript𝑏𝑗𝑁superscriptsubscript𝑅𝑗𝑙subscript𝑓𝑗xR_{i}^{l}=\sum_{j}\tilde{R}_{i\leftarrow j}^{(l-1,l)}=\sum_{j}\left(\textbf{J}% _{ji}\ x_{i}+\frac{\tilde{b_{j}}}{N}\right)\frac{R_{j}^{l}}{f_{j}(\textbf{x})}.italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( J start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG over~ start_ARG italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_N end_ARG ) divide start_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) end_ARG .

However, we may encounter numerical instabilities, but these effects will only become visible in the next sequential relevance propagation at the prior layer, not at this layer yet. For example in the softmax function, we may encounter a situation where i,j𝑖𝑗\exists i,j\in\mathbb{N}∃ italic_i , italic_j ∈ blackboard_N with xil1=0superscriptsubscript𝑥𝑖𝑙10x_{i}^{l-1}=0italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = 0 but fjl(x)>0superscriptsubscript𝑓𝑗𝑙x0f_{j}^{l}(\textbf{x})>0italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( x ) > 0. If a non-zero relevance value from layer l𝑙litalic_l is assigned to fjl(x)superscriptsubscript𝑓𝑗𝑙xf_{j}^{l}(\textbf{x})italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( x ), then its relevance value is propagated to the input variable xil1superscriptsubscript𝑥𝑖𝑙1x_{i}^{l-1}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT through the relevance message:

R~ij(l1,l)=b~jNRjlfjl(x)+εsuperscriptsubscript~𝑅𝑖𝑗𝑙1𝑙subscript~𝑏𝑗𝑁superscriptsubscript𝑅𝑗𝑙superscriptsubscript𝑓𝑗𝑙x𝜀\tilde{R}_{i\leftarrow j}^{(l-1,l)}=\frac{\tilde{b}_{j}}{N}\frac{R_{j}^{l}}{f_% {j}^{l}(\textbf{x})+\varepsilon}over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l ) end_POSTSUPERSCRIPT = divide start_ARG over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG divide start_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( x ) + italic_ε end_ARG

Assuming we apply the ε𝜀\varepsilonitalic_ε-rule in succession, the relevance in the prior layer is given by:

Rki(l2,l1)=Jikxkl2Ril10+εsuperscriptsubscript𝑅𝑘𝑖𝑙2𝑙1subscriptJ𝑖𝑘superscriptsubscript𝑥𝑘𝑙2superscriptsubscript𝑅𝑖𝑙10𝜀R_{k\leftarrow i}^{(l-2,l-1)}=\textbf{J}_{ik}\ x_{k}^{l-2}\frac{R_{i}^{l-1}}{0% +\varepsilon}italic_R start_POSTSUBSCRIPT italic_k ← italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 2 , italic_l - 1 ) end_POSTSUPERSCRIPT = J start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 2 end_POSTSUPERSCRIPT divide start_ARG italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 0 + italic_ε end_ARG

Here, we divide by xil1=fil1=0superscriptsubscript𝑥𝑖𝑙1superscriptsubscript𝑓𝑖𝑙10x_{i}^{l-1}=f_{i}^{l-1}=0italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = 0. Since ε𝜀\varepsilonitalic_ε is very small, this term explodes and causes numerical instabilities. Hence, assigning a non-zero relevance value to an input that equals zero leads to numerical instabilities. Note, that these instabilities would not occur if Ril1=0superscriptsubscript𝑅𝑖𝑙10R_{i}^{l-1}=0italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = 0 e.g. Rij(l1,l)=0superscriptsubscript𝑅𝑖𝑗𝑙1𝑙0R_{i\leftarrow j}^{(l-1,l)}=0italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l ) end_POSTSUPERSCRIPT = 0. Functions where f(0)=0𝑓00f(0)=0italic_f ( 0 ) = 0 do not encounter this issue, because zero output activations will not receive any relevance in following layers e.g. Rjl=0superscriptsubscript𝑅𝑗𝑙0R_{j}^{l}=0italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = 0 using all rules described in this paper.

Applying the identity rule: Alternatively, we can apply the identity rule as follows:

Ril1=Rilsuperscriptsubscript𝑅𝑖𝑙1superscriptsubscript𝑅𝑖𝑙R_{i}^{l-1}=R_{i}^{l}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT

Here, numerical instabilities might also arise in the subsequent relevance propagation, not at this layer. With the same reasoning as before, the identity rule propagates a non-zero relevance value to an input variable that is zero.

Omitting the bias term: For the sake of completeness, we mention that omitting the bias term entirely is also an option. In this case, the relevance propagation equation is:

Ril1=jJjixiRjliJjixi+εsuperscriptsubscript𝑅𝑖𝑙1subscript𝑗subscriptJ𝑗𝑖subscript𝑥𝑖superscriptsubscript𝑅𝑗𝑙subscript𝑖subscriptJ𝑗𝑖subscript𝑥𝑖𝜀R_{i}^{l-1}=\sum_{j}\textbf{J}_{ji}\ x_{i}\frac{R_{j}^{l}}{\sum_{i}\textbf{J}_% {ji}\ x_{i}+\varepsilon}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT J start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT J start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ε end_ARG

Here, we no longer divide by the original function fj(x)subscript𝑓𝑗xf_{j}(\textbf{x})italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ), but by its linearization without the bias term. However, it is important to ensure that no sign flips occur, as iJjixisubscript𝑖subscriptJ𝑗𝑖subscript𝑥𝑖\sum_{i}\textbf{J}_{ji}\ x_{i}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT J start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT might have a different sign than fj(x)subscript𝑓𝑗xf_{j}(\textbf{x})italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ).

Remark A.2.2 Enforcing strict conservation (3) by omitting the bias term of a linearization (3.1.1) can lead to sign flips in the relevance scores.

Summary: In summary, applying the identity rule, distributing its relevance value uniformly across the input variables or omitting the bias term completely are possible approaches, but they have their considerations and potential challenges. Regarding the softmax non-linearity, Voita et al., (2021) distributes the bias term equally on all input variables, while Ding et al., (2017); Chefer et al., 2021b apply the element-wise identity rule (9). Both variants can lead to severe numerical instabilities.

A.2.2 Highlighting the difference between various LRP methods

In Table LABEL:table:lrpcomparison, we illustrate the different strategies employed for LRP in the past.

Softmax: Voita et al., (2021) linearizes at x but distributes the bias term equally on all input variables, while Ding et al., (2017); Chefer et al., 2021b apply the element-wise identity rule (9). More specifically, Ding et al., (2017) did not discuss the softmax function explicitly, but they skip all non-linear activation functions. Therefore, we assume that Ding et al., (2017) applies the identity rule also to the softmax function. Both variants enforce a strict notion of the conservation principle (3), but can lead to severe numerical instabilities as discussed in Appendix A.2.1. Ali et al., (2022) regards the attention matrix A in Equation (11) as constant, attributing relevance solely through the value path by stopping the relevance flow through the softmax. Consequently, the query and key matrices can no longer be attributed, which reduces the faithfulness and makes latent explanations in query and key matrices infeasible. Finally, AttnLRP linearizes at x with a bias term that absorbs part of the relevance. The presence of a bias term in AttnLRP is justified because the softmax function yields a value of 1/N even when the input is zero. This is analogous to a bias term and is necessary to account for this behavior. This ensures not only numerically robust attributions, but also improves the faithfulness considerably.

In Figure A.4, we illustrate different attribution maps for all four options to handle the softmax function. The given section is from the Wikipedia article on Mount Everest. The model is expected to provide an answer for the question ‘How high did they climb in 1922?’ and for the correctly predicted next token 3 of the answer ‘According to the text, the 1922 expedition reached 8,’ is the attribution computed by initializing the relevance at the predicted token with its logit value.

While the relevance values for AttnLRP or CP-LRP are between [4,4]44[-4,4][ - 4 , 4 ], distributing the bias uniformely on the input variables or applying the identity rule leads to an explosion of the relevances between [1015,1015]superscript1015superscript1015[-10^{15},10^{15}][ - 10 start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT ]. As a consequence, the heatmaps resemble random noise. AttnLRP highlights the correct token the strongest, while CP-LRP focuses strongly on the start-of-sequence <s> token and exhibits more background noise e.g. irrelevant tokens such as ‘Context’, ‘attracts’, ‘Everest’ are highlighted, while AttnLRP does not highlight them or assigns negative relevance. In Appendix B.7, we compare also other baseline methods. Note, that the model attends to numerous tokens within the text which enables it to derive conclusions. Consequently, an attribution that reflects the model behavior will highlight more than just the single accurate answer token. The faithfulness experiments in Table 1 demonstrate, that AttnLRP captures the model reasoning most accurately.

Matrix Multiplication: Applying the ε𝜀\varepsilonitalic_ε-rule (8) on bi-linear matrix multiplication (11) violates the conservation property (3) as proved in Appendix A.3.5. To the best of our knowledge, Ding et al., (2017) applies the standard ε𝜀\varepsilonitalic_ε-rule. Voita et al., (2021) utilizes the z+superscript𝑧z^{+}italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT-rule (25), that similar to the ε𝜀\varepsilonitalic_ε-rule also violates the conservation property (3) in bi-linear matrix multiplication (proof in Appendix A.3.5 is valid for z+superscript𝑧z^{+}italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT-rule). While Chefer et al., 2021b also applies the ε𝜀\varepsilonitalic_ε-rule, an additional normalization step is performed by dividing both arguments by the summation of its absolute values. This ensures conservation but is not conform with the DTD framework. Ding et al., (2017); Chefer et al., 2021b set the ε𝜀\varepsilonitalic_ε parameter to 00, which may increases numerical instabilities. Hence, we call their LRP variants in Table LABEL:table:lrpcomparison 00-LRP.

Since Ali et al., (2022) regards the softmax output as constant and does not propagate relevance through it, the matrix multiplication is not bi-linear anymore, but becomes linear. Hence, the application of the ε𝜀\varepsilonitalic_ε-rule does not violate the conservation principle and attributes only the value path. Ali et al., (2022) sets the ε𝜀\varepsilonitalic_ε parameter to zero, hence we call their LRP variant in Table LABEL:table:lrpcomparison 00-LRP.

Finally, AttnLRP also applies the ε𝜀\varepsilonitalic_ε-rule on bi-linear matrix multiplication. In addition, a novel uniform rule (14) derived from the DTD framework is incorporated that ensures conservation and high faithfulness.

Layer Normalization: The works Chefer et al., 2021b ; Ali et al., (2022) and AttnLRP apply the identity rule on normalization functions (18), while using the ε𝜀\varepsilonitalic_ε-rule on all linear components of LayerNorm, if applicable. More specifically, Ali et al., (2022) proposes to regard g(x)𝑔xg(\textbf{x})italic_g ( x ) in (18) as a constant, which transforms the normalization operation, and hence the complete LayerNorm layer, into a linear layer, on which the ε𝜀\varepsilonitalic_ε-rule is applied. However, this is similar to applying the identity rule (9) on the normalization itself, because it becomes element-wise with a single input and output variable (see Section 3.2.2), while applying the ε𝜀\varepsilonitalic_ε-rule on all other linear components of LayerNorm. Voita et al., (2021) linearizes at x and distributes the bias term equally on all input variables, which can lead to numerical instabilities as discussed in Appendix A.2.1.

Vision Transformer: The studies by Ding et al., (2017); Voita et al., (2021); Ali et al., (2022) concentrate on the attribution in natural language processing (NLP) models and do not address vision transformers. Their methodologies, as demonstrated in Table 1 (and Appendix A.2.1), fail when implementing the ε𝜀\varepsilonitalic_ε-rule, leading to gradient shattering and low faithfulness. To mitigate noisy attributions, Chefer et al., 2021b suggests employing on-top of LRP an attention rollout Abnar and Zuidema, (2020) procedure which is additionally enhanced via gradient weighting. This yields an approximation of the mean squared relevance value, which diverges from the originally defined notion of “relevance” or “importance” of additive explanatory models. Subsequent empirical observations by Chefer et al., 2021a revealed that an omission of LRP-inspired relevances and a sole reliance on a positive mean-weighting of the attention’s activation with the gradient improved the faithfulness. Though, this approach can only attribute positively, does not consider counteracting evidence, and does not allow to attribute latent neurons outside the attention’s softmax output. AttnLRP, in contrast, adopts the γ𝛾\gammaitalic_γ-rule instead of the ε𝜀\varepsilonitalic_ε-rule in linear layers, achieving highly faithful attributions without the necessity for a rollout mechanism. However, the γ𝛾\gammaitalic_γ parameter must be tuned to the model and dataset to obtain optimal attributions, as discussed in Appendix A.2.3.

Table A.3: Conceptual differences between various-LRP methods and their implications. “Layer Normalization” refers here only to the normalization function (18) itself and not to the learnable parameters of LayerNorm or RMSNorm.
Methods Softmax Matrix Multiplication Layer Normalization
Ding et al., (2017) Identity rule 00-LRP not available
(bi-linear)
\Rightarrow unstable (Appendix A.2.1) \Rightarrow violates conservation
Voita et al., (2021) Taylor decomposition at x z+superscript𝑧z^{+}italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT-LRP Taylor decomposition at x
(distributes the bias uniformly) (bi-linear) (distributes the bias uniformly)
\Rightarrow unstable (Appendix A.2.1) \Rightarrow violates conservation \Rightarrow unstable (Appendix A.2.1)
Chefer et al., 2021b Identity rule 00-LRP Identity rule
& post-hoc normalization
(bi-linear)
\Rightarrow unstable (Appendix A.2.1) \Rightarrow ensures conservation \Rightarrow ensures conservation
& faithful
Ali et al., (2022) Regarded as constant 00-LRP Identity rule
(linear only)
\Rightarrow stable & no attribution \Rightarrow ensures conservation \Rightarrow ensures conservation
inside attention module & faithful
AttnLRP Taylor decomposition at x ε𝜀\varepsilonitalic_ε-LRP Identity rule
(with bias) & uniform rule
(bi-linear)
\Rightarrow stable & faithful \Rightarrow ensures conservation \Rightarrow ensures conservation
& faithful & faithful

A.2.3 Tackling Noise in Vision Transformers

Since backpropagation-based attributions utilize the gradient, they may produce noisy attributions in models with many layers, where gradient shattering and noisy gradients appear Balduzzi et al., (2017); Dombrowski et al., (2022). Hence, various adaptions of the ε𝜀\varepsilonitalic_ε-LRP rule were developed to strengthen the signal-to-noise ratio by dampening counter-acting activations Bach et al., (2015); Montavon et al., (2019). Here, we use the generalized γ𝛾\gammaitalic_γ-rule that encompasses all other proposed rules in the literature Montavon et al., (2019). Let zijsubscript𝑧𝑖𝑗z_{ij}italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT be the contribution of input i𝑖iitalic_i to output j𝑗jitalic_j, e.g. WjixisubscriptW𝑗𝑖subscript𝑥𝑖\textbf{W}_{ji}x_{i}W start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and zjsubscript𝑧𝑗z_{j}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT the neuron output activation. Then depending on the sign of zjsubscript𝑧𝑗z_{j}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT:

Rij(l,l+1)subscriptsuperscript𝑅𝑙𝑙1𝑖𝑗\displaystyle R^{(l,\>l+1)}_{i\leftarrow j}italic_R start_POSTSUPERSCRIPT ( italic_l , italic_l + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT ={zij+γzij+zj+γkzkj+Rjl+1if zj>0zij+γzijzj+γkzkjRjl+1elseabsentcasessubscript𝑧𝑖𝑗𝛾superscriptsubscript𝑧𝑖𝑗subscript𝑧𝑗𝛾subscript𝑘superscriptsubscript𝑧𝑘𝑗superscriptsubscript𝑅𝑗𝑙1if subscript𝑧𝑗0subscript𝑧𝑖𝑗𝛾superscriptsubscript𝑧𝑖𝑗subscript𝑧𝑗𝛾subscript𝑘superscriptsubscript𝑧𝑘𝑗superscriptsubscript𝑅𝑗𝑙1else\displaystyle=\begin{cases}\frac{z_{ij}+\gamma z_{ij}^{+}}{z_{j}+\gamma\sum_{k% }z_{kj}^{+}}R_{j}^{l+1}&\text{if }z_{j}>0\\ \frac{z_{ij}+\gamma z_{ij}^{-}}{z_{j}+\gamma\sum_{k}z_{kj}^{-}}R_{j}^{l+1}&% \text{else}\end{cases}= { start_ROW start_CELL divide start_ARG italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_γ italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_ARG start_ARG italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_γ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT end_CELL start_CELL if italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_γ italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_ARG start_ARG italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_γ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT end_CELL start_CELL else end_CELL end_ROW (24)

with γ>0𝛾superscriptabsent0\gamma\in\mathbb{R}^{>0}italic_γ ∈ blackboard_R start_POSTSUPERSCRIPT > 0 end_POSTSUPERSCRIPT, ()+=max(,0)superscriptmax0(\cdot)^{+}=\text{max}(\cdot\ ,0)( ⋅ ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = max ( ⋅ , 0 ) and ()=min(,0)superscriptmin0(\cdot)^{-}=\text{min}(\cdot\ ,0)( ⋅ ) start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = min ( ⋅ , 0 ). If γ=𝛾\gamma=\inftyitalic_γ = ∞, it is equivalent to the LRP z+superscript𝑧z^{+}italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT-rule, which is given as

Rij(l,l+1)subscriptsuperscript𝑅𝑙𝑙1𝑖𝑗\displaystyle R^{(l,\>l+1)}_{i\leftarrow j}italic_R start_POSTSUPERSCRIPT ( italic_l , italic_l + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT =(wijxi)+zj+Rjl+1absentsuperscriptsubscript𝑤𝑖𝑗subscript𝑥𝑖superscriptsubscript𝑧𝑗superscriptsubscript𝑅𝑗𝑙1\displaystyle=\frac{(w_{ij}x_{i})^{+}}{z_{j}^{+}}R_{j}^{l+1}= divide start_ARG ( italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_ARG start_ARG italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT (25)

by only taking into account positive contributions zj+=i(wijxi)+superscriptsubscript𝑧𝑗subscript𝑖superscriptsubscript𝑤𝑖𝑗subscript𝑥𝑖z_{j}^{+}=\sum_{i}(w_{ij}x_{i})^{+}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT with ()+=max(0,)superscript0(\cdot)^{+}=\max(0,\cdot)( ⋅ ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = roman_max ( 0 , ⋅ ).

Remarkably, our observations reveal that attributions in LLMs demonstrate high sparsity and lack visible noise, while ViTs are susceptible to gradient shattering. We hypothesize that the discrete nature of the text domain may affect robustness Mao et al., (2021). Therefore, we only apply the γ𝛾\gammaitalic_γ-rule in ViTs in the convolutional and linear FFN layers outside the attention module. To further increase the faithfulness, the γ𝛾\gammaitalic_γ-rule can be also applied on softmax layers. Since the output of the softmax is always greater than zero, we apply the simplified z+superscript𝑧z^{+}italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT-rule (special case of γ𝛾\gammaitalic_γ-rule). The z+superscript𝑧z^{+}italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT-rule applied on a linearization (3.1.1) for softmax results in:

Ril1=j(Jjixi)+Rjlk(Jjkxk)++b~j+superscriptsubscript𝑅𝑖𝑙1subscript𝑗superscriptsubscriptJ𝑗𝑖subscript𝑥𝑖superscriptsubscript𝑅𝑗𝑙subscript𝑘superscriptsubscriptJ𝑗𝑘subscript𝑥𝑘superscriptsubscript~𝑏𝑗R_{i}^{l-1}=\sum_{j}(\textbf{J}_{ji}\ x_{i})^{+}\frac{R_{j}^{l}}{\sum_{k}(% \textbf{J}_{jk}\ x_{k})^{+}+\tilde{b}_{j}^{+}}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( J start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT divide start_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( J start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_ARG (26)

This formula is computationally more expensive to evaluate than the original rule for softmax derived in Proposition 3.1. Should efficiency be a priority, it is recommended to bypass the softmax layer as proposed in CP-LRP, which prevents relevance from passing through the softmax function and reduces gradient shattering caused by this layer. Then, for all other components of the model, AttnLRP rules are recommended, with the application of the γ𝛾\gammaitalic_γ-rule to linear layers only. This is especially true, given that the discrepancy in faithfulness between AttnLRP and γ𝛾\gammaitalic_γCP-LRP is minimal for standard vision architectures Dosovitskiy et al., (2021) evaluated in Table 1 and Table B.6, that incorporate only standard attention and FFN layers.

A.2.4 Impact of Temperature Scaling on the Softmax Rule

Temperature scaling controls the entropy within the softmax probability distribution, thereby influencing the predictability of subsequent next token predictions at the classification output. A high temperature value tends to flatten the softmax output distribution (more randomness), whereas a small temperature parameter sharpens the distribution (less randomness). This scaling is done by dividing the input x by the temperature T𝑇T\in\mathbb{R}italic_T ∈ blackboard_R prior to applying the softmax function.

sj(x)=exj/Tiexi/Tsubscript𝑠𝑗xsuperscript𝑒subscript𝑥𝑗𝑇subscript𝑖superscript𝑒subscript𝑥𝑖𝑇s_{j}(\textbf{x})=\frac{e^{x_{j}/T}}{\sum_{i}e^{x_{i}/T}}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_T end_POSTSUPERSCRIPT end_ARG

Recall that the derivative of the softmax function has two cases, which depend on the output j𝑗jitalic_j and input i𝑖iitalic_i indices:

sjxi={sj(1sj)for i=jsjsifor ijsubscript𝑠𝑗subscript𝑥𝑖casessubscript𝑠𝑗1subscript𝑠𝑗for 𝑖𝑗subscript𝑠𝑗subscript𝑠𝑖for 𝑖𝑗\frac{\partial s_{j}}{\partial x_{i}}=\begin{cases}s_{j}(1-s_{j})&\text{for }i% =j\\ -s_{j}s_{i}&\text{for }i\neq j\end{cases}divide start_ARG ∂ italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = { start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_CELL start_CELL for italic_i = italic_j end_CELL end_ROW start_ROW start_CELL - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL for italic_i ≠ italic_j end_CELL end_ROW

In the scenario where sj1subscript𝑠𝑗1s_{j}\approx 1italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≈ 1, the derivative for i=j𝑖𝑗i=jitalic_i = italic_j vanishes. This occurs e.g. for extremely low temperature values or exceptionally high confidence in the model’s classification output. This poses an issue for the Deep Taylor Decomposition Montavon et al., (2017) derived in Section 3.1.1, because DTD decomposes the softmax function by utilizing the gradient (jacobian) term JjixisubscriptJ𝑗𝑖subscript𝑥𝑖\textbf{J}_{ji}x_{i}J start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for calculating attributions. If the gradient vanishes, the bias term will capture all the relevance, stopping the relevance flow altogether. This effect is also generally described by Shrikumar et al., (2017).

Within the attention mechanism, this limitation is circumvented by multiplying the softmax output with the value path, ensuring that relevance is transmitted via the uniform rule to the value path, akin to CP-LRP (refer to Appendix A.2.2). However, in instances where the softmax function is utilized independently, this becomes problematic as the relevance flow could be distorted.

To see this, consider Proposition 3.1 (13): Ril1subscriptsuperscript𝑅𝑙1𝑖R^{l-1}_{i}italic_R start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT might be zero if Rjl=0jisubscriptsuperscript𝑅𝑙𝑗0for-all𝑗𝑖R^{l}_{j}=0\ \forall j\neq iitalic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 ∀ italic_j ≠ italic_i with si=1subscript𝑠𝑖1s_{i}=1italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1:

Ril1=xi(RilsijRjl)=xi(RilRil)=0subscriptsuperscript𝑅𝑙1𝑖subscript𝑥𝑖subscriptsuperscript𝑅𝑙𝑖subscript𝑠𝑖subscript𝑗subscriptsuperscript𝑅𝑙𝑗subscript𝑥𝑖subscriptsuperscript𝑅𝑙𝑖subscriptsuperscript𝑅𝑙𝑖0R^{l-1}_{i}=x_{i}(R^{l}_{i}-s_{i}\sum_{j}R^{l}_{j})=x_{i}(R^{l}_{i}-R^{l}_{i})=0italic_R start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 0

Therefore, we suggest utilizing an increased temperature scaling value when explaining the softmax classification output to prevent that the softmax saturates i.e. the gradient vanishes. Nonetheless, the attribution of the classification output has not been investigated in this work (the softmax layer is always removed and LRP only applied to the logit outputs). An analysis of these effects remain an interesting topic for future work.

A.3 Proofs

In the following, we provide proofs for the rules presented in the main paper, and that the application of the ε𝜀\varepsilonitalic_ε-rule on bi-linear matrix multiplication violates the conservation property.

A.3.1 Proposition 3.1: Decomposing Softmax

In this subsection, we demonstrate the decomposition of the softmax function by linearizing (3.1.1) it at x . We begin by considering the softmax function:

sj(x)=exjiexisubscript𝑠𝑗xsuperscript𝑒subscript𝑥𝑗subscript𝑖superscript𝑒subscript𝑥𝑖s_{j}(\textbf{x})=\frac{e^{x_{j}}}{\sum_{i}e^{x_{i}}}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG

The derivative of the softmax has two cases, which depend on the output and input indices i𝑖iitalic_i and j𝑗jitalic_j:

sjxi={sj(1sj)for i=jsjsifor ijsubscript𝑠𝑗subscript𝑥𝑖casessubscript𝑠𝑗1subscript𝑠𝑗for 𝑖𝑗subscript𝑠𝑗subscript𝑠𝑖for 𝑖𝑗\frac{\partial s_{j}}{\partial x_{i}}=\begin{cases}s_{j}(1-s_{j})&\text{for }i% =j\\ -s_{j}s_{i}&\text{for }i\neq j\end{cases}divide start_ARG ∂ italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = { start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_CELL start_CELL for italic_i = italic_j end_CELL end_ROW start_ROW start_CELL - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL for italic_i ≠ italic_j end_CELL end_ROW

Consequently, a Taylor decomposition (3.1.1) yields:

fj(x)=sj(xjisixi)+b~jsubscript𝑓𝑗xsubscript𝑠𝑗subscript𝑥𝑗subscript𝑖subscript𝑠𝑖subscript𝑥𝑖subscript~𝑏𝑗f_{j}(\textbf{x})=s_{j}\left(x_{j}-\sum_{i}s_{i}x_{i}\right)+\tilde{b}_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) = italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT

We differentiate between two cases, namely (i) when we attribute relevance from output j𝑗jitalic_j to input ij𝑖𝑗i\neq jitalic_i ≠ italic_j and (ii) when we attribute from output j𝑗jitalic_j to input i=j𝑖𝑗i=jitalic_i = italic_j.

Rij(l1,l)={(xisixi)Rilfor i=jsixiRjlfor ijsuperscriptsubscript𝑅𝑖𝑗𝑙1𝑙casessubscript𝑥𝑖subscript𝑠𝑖subscript𝑥𝑖subscriptsuperscript𝑅𝑙𝑖for 𝑖𝑗subscript𝑠𝑖subscript𝑥𝑖subscriptsuperscript𝑅𝑙𝑗for 𝑖𝑗R_{i\leftarrow j}^{(l-1,l)}=\begin{cases}(x_{i}-s_{i}x_{i})\ R^{l}_{i}&\text{% for }i=j\\ -s_{i}x_{i}\ R^{l}_{j}&\text{for }i\neq j\end{cases}italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l ) end_POSTSUPERSCRIPT = { start_ROW start_CELL ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL for italic_i = italic_j end_CELL end_ROW start_ROW start_CELL - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL start_CELL for italic_i ≠ italic_j end_CELL end_ROW

Applying equation (5), we obtain:

Ril1=jRij(l1,l)=xi(RilsijRjl)subscriptsuperscript𝑅𝑙1𝑖subscript𝑗superscriptsubscript𝑅𝑖𝑗𝑙1𝑙subscript𝑥𝑖subscriptsuperscript𝑅𝑙𝑖subscript𝑠𝑖subscript𝑗subscriptsuperscript𝑅𝑙𝑗R^{l-1}_{i}=\sum_{j}R_{i\leftarrow j}^{(l-1,l)}=x_{i}(R^{l}_{i}-s_{i}\sum_{j}R% ^{l}_{j})italic_R start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l ) end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )

In Appendix A.2.4, we discuss the implications of vanishing gradients and temperature scaling on attributing the softmax function.

A.3.2 Proposition 3.2: Decomposing Multiplication

The aim in is subsection is to decompose the multiplication of N𝑁Nitalic_N input variables.

fj(x)=iNxjisubscript𝑓𝑗xsuperscriptsubscriptproduct𝑖𝑁subscript𝑥𝑗𝑖f_{j}(\textbf{x})=\prod_{i}^{N}x_{ji}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) = ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT

We start by performing a Taylor decomposition (3.1.1), then we derive the same decomposition with Shapley.

Taylor decomposition: The derivative is

fjxji=kiNxjksubscript𝑓𝑗subscript𝑥𝑗𝑖superscriptsubscriptproduct𝑘𝑖𝑁subscript𝑥𝑗𝑘\frac{\partial f_{j}}{\partial x_{ji}}=\prod_{k\neq i}^{N}x_{jk}divide start_ARG ∂ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT end_ARG = ∏ start_POSTSUBSCRIPT italic_k ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT

Consequently, a Taylor decomposition (3.1.1) at x yields

fj(x)=iNfjxjixji+b~j=NkNxjk+b~j=Nfj(x)+b~jsubscript𝑓𝑗xsuperscriptsubscript𝑖𝑁subscript𝑓𝑗subscript𝑥𝑗𝑖subscript𝑥𝑗𝑖subscript~𝑏𝑗𝑁superscriptsubscriptproduct𝑘𝑁subscript𝑥𝑗𝑘subscript~𝑏𝑗𝑁subscript𝑓𝑗xsubscript~𝑏𝑗f_{j}(\textbf{x})=\sum_{i}^{N}\frac{\partial f_{j}}{\partial x_{ji}}x_{ji}+% \tilde{b}_{j}=N\prod_{k}^{N}x_{jk}+\tilde{b}_{j}=Nf_{j}(\textbf{x})+\tilde{b}_% {j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ∂ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_N ∏ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_N italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT

We can either omit the bias term or equally distribute it on the input variables to strictly enforce the conservation property (3). Here, we demonstrate how to distribute the bias term uniformly.

Rjij(l1,l)=(fj+b~jN)RjlNf(x)j+b~j=1NRjlsuperscriptsubscript𝑅𝑗𝑖𝑗𝑙1𝑙subscript𝑓𝑗subscript~𝑏𝑗𝑁superscriptsubscript𝑅𝑗𝑙𝑁𝑓subscriptx𝑗subscript~𝑏𝑗1𝑁superscriptsubscript𝑅𝑗𝑙R_{ji\leftarrow j}^{(l-1,l)}=\left(f_{j}+\frac{\tilde{b}_{j}}{N}\right)\frac{R% _{j}^{l}}{Nf(\textbf{x})_{j}+\tilde{b}_{j}}=\frac{1}{N}R_{j}^{l}italic_R start_POSTSUBSCRIPT italic_j italic_i ← italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l ) end_POSTSUPERSCRIPT = ( italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + divide start_ARG over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG ) divide start_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG italic_N italic_f ( x ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT

Since each input with index ji𝑗𝑖jiitalic_j italic_i at layer l1𝑙1l-1italic_l - 1 is only connected to one output with index j𝑗jitalic_j at layer l𝑙litalic_l, we have only a single relevance propagation message. Hence, it follows from Equation (2):

Rjil1=Rjij(l1,l)=1NRjlsuperscriptsubscript𝑅𝑗𝑖𝑙1superscriptsubscript𝑅𝑗𝑖𝑗𝑙1𝑙1𝑁superscriptsubscript𝑅𝑗𝑙R_{ji}^{l-1}=R_{ji\leftarrow j}^{(l-1,l)}=\frac{1}{N}R_{j}^{l}italic_R start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = italic_R start_POSTSUBSCRIPT italic_j italic_i ← italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT

For omitting the bias term, repeat the proof with b~j=0subscript~𝑏𝑗0\tilde{b}_{j}=0over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0.

Shapley: The Shapley value Lundberg and Lee, (2017) is defined as:

ϕi(f)=SNiS|S|!(N|S|1)!N!(f(S{i})f(S))subscriptitalic-ϕ𝑖𝑓subscript𝑆𝑁𝑖𝑆𝑆𝑁𝑆1𝑁𝑓𝑆𝑖𝑓𝑆\phi_{i}(f)=\sum_{\begin{subarray}{c}S\subseteq N\\ i\notin S\end{subarray}}\frac{|S|!(N-|S|-1)!}{N!}\left(f(S\cup\{i\})-f(S)\right)italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f ) = ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_S ⊆ italic_N end_CELL end_ROW start_ROW start_CELL italic_i ∉ italic_S end_CELL end_ROW end_ARG end_POSTSUBSCRIPT divide start_ARG | italic_S | ! ( italic_N - | italic_S | - 1 ) ! end_ARG start_ARG italic_N ! end_ARG ( italic_f ( italic_S ∪ { italic_i } ) - italic_f ( italic_S ) ) (27)

where ϕi(v)subscriptitalic-ϕ𝑖𝑣\phi_{i}(v)italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_v ) is the Shapley value of the feature i𝑖iitalic_i and value function f𝑓fitalic_f. N𝑁Nitalic_N denotes the set of all features, and S𝑆Sitalic_S denotes a feature subset (coalition).

With respect to multiplication, zero is the absorbing element. Hence, we choose zero as our baseline value, and the Shapley value function becomes:

f(S{i})=kxk𝑓𝑆𝑖subscriptproduct𝑘subscript𝑥𝑘\displaystyle f(S\cup\{i\})=\prod_{k}x_{k}italic_f ( italic_S ∪ { italic_i } ) = ∏ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
f(S)=0𝑓𝑆0\displaystyle f(S)=0italic_f ( italic_S ) = 0
f(S{i})f(S))=kxk\displaystyle f(S\cup\{i\})-f(S))=\prod_{k}x_{k}italic_f ( italic_S ∪ { italic_i } ) - italic_f ( italic_S ) ) = ∏ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT

The symmetry theorem Fryer et al., (2021) of Shapley states that the contributions of two feature values i𝑖iitalic_i and l𝑙litalic_l should be the same if they contribute equally to all possible coalitions

f(S{i})=f(S{l})𝑓𝑆𝑖𝑓𝑆𝑙\displaystyle f(S\cup\{i\})=f(S\cup\{l\})italic_f ( italic_S ∪ { italic_i } ) = italic_f ( italic_S ∪ { italic_l } )
S{1,2,N}\{i,l}for-all𝑆\12𝑁𝑖𝑙\displaystyle\forall S\subseteq\{1,2,...N\}\backslash\{i,l\}∀ italic_S ⊆ { 1 , 2 , … italic_N } \ { italic_i , italic_l }

then ϕi(f)=ϕl(f)subscriptitalic-ϕ𝑖𝑓subscriptitalic-ϕ𝑙𝑓\phi_{i}(f)=\phi_{l}(f)italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f ) = italic_ϕ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_f ). In addition, the efficiency theorem Fryer et al., (2021) states that the output contribution is distributed equally amongst all features. Hence, the output contribution is equal to the sum of coalition values of all features i𝑖iitalic_i,

iϕi(f)=f(N)subscript𝑖subscriptitalic-ϕ𝑖𝑓𝑓𝑁\sum_{i}\phi_{i}(f)=f(N)∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f ) = italic_f ( italic_N )

Both theorems are applicable and hence it follows:

ϕi(f)=1Nf(N)subscriptitalic-ϕ𝑖𝑓1𝑁𝑓𝑁\phi_{i}(f)=\frac{1}{N}f(N)italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG italic_f ( italic_N )

In the case of LRP, we identify f(N)𝑓𝑁f(N)italic_f ( italic_N ) as Rjlsuperscriptsubscript𝑅𝑗𝑙R_{j}^{l}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and ϕi(f)subscriptitalic-ϕ𝑖𝑓\phi_{i}(f)italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f ) as Rjil1superscriptsubscript𝑅𝑗𝑖𝑙1R_{ji}^{l-1}italic_R start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT.

A.3.3 Proposition 3.3: Decomposing bi-linear Matrix Multiplication

Consider the equation for matrix multiplication, where we treat the terms as single input variables by substituting them with ujip=AjiVipsubscriptu𝑗𝑖𝑝subscriptA𝑗𝑖subscriptV𝑖𝑝\textbf{u}_{jip}=\textbf{A}_{ji}\textbf{V}_{ip}u start_POSTSUBSCRIPT italic_j italic_i italic_p end_POSTSUBSCRIPT = A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT V start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT

Ojp=iAjiVip=iujipsubscriptO𝑗𝑝subscript𝑖subscriptA𝑗𝑖subscriptV𝑖𝑝subscript𝑖subscriptu𝑗𝑖𝑝\textbf{O}_{jp}=\sum_{i}\textbf{A}_{ji}\textbf{V}_{ip}=\sum_{i}\textbf{u}_{jip}O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT V start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT u start_POSTSUBSCRIPT italic_j italic_i italic_p end_POSTSUBSCRIPT

In this case, the function already is in the form of an additive decomposition (1). Therefore,

Rjipjp(l1,l)ujipproportional-tosuperscriptsubscript𝑅𝑗𝑖𝑝𝑗𝑝𝑙1𝑙subscriptu𝑗𝑖𝑝R_{jip\leftarrow jp}^{(l-1,l)}\propto\textbf{u}_{jip}italic_R start_POSTSUBSCRIPT italic_j italic_i italic_p ← italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l ) end_POSTSUPERSCRIPT ∝ u start_POSTSUBSCRIPT italic_j italic_i italic_p end_POSTSUBSCRIPT

This can also be seen, by noticing that iujipsubscript𝑖subscriptu𝑗𝑖𝑝\sum_{i}\textbf{u}_{jip}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT u start_POSTSUBSCRIPT italic_j italic_i italic_p end_POSTSUBSCRIPT is a linear operation, since a single variable is left. Hence, it can be regarded as a linear layer (6) characterized with constant weights of one and a bias of zero. We already derived the solution to applying a linearization (3.1.1) to a linear layer: the ε𝜀\varepsilonitalic_ε-rule (8). Therefore, the solution is:

Rjipjp(l1,l)=ujipRjplOjp+ε=AjiVipRjplOjp+εsuperscriptsubscript𝑅𝑗𝑖𝑝𝑗𝑝𝑙1𝑙subscriptu𝑗𝑖𝑝superscriptsubscript𝑅𝑗𝑝𝑙subscriptO𝑗𝑝𝜀subscriptA𝑗𝑖subscriptV𝑖𝑝superscriptsubscript𝑅𝑗𝑝𝑙subscriptO𝑗𝑝𝜀R_{jip\leftarrow jp}^{(l-1,l)}=\textbf{u}_{jip}\frac{R_{jp}^{l}}{\textbf{O}_{% jp}+\varepsilon}=\textbf{A}_{ji}\textbf{V}_{ip}\frac{R_{jp}^{l}}{\textbf{O}_{% jp}+\varepsilon}italic_R start_POSTSUBSCRIPT italic_j italic_i italic_p ← italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l ) end_POSTSUPERSCRIPT = u start_POSTSUBSCRIPT italic_j italic_i italic_p end_POSTSUBSCRIPT divide start_ARG italic_R start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT + italic_ε end_ARG = A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT V start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT divide start_ARG italic_R start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT + italic_ε end_ARG

Since each input with index jip𝑗𝑖𝑝jipitalic_j italic_i italic_p at layer l1𝑙1l-1italic_l - 1 is only connected to one output with index jp𝑗𝑝jpitalic_j italic_p at layer l𝑙litalic_l, we have only a single relevance propagation message. Hence, it follows from equation (2):

Rjipl1=Rjipjp(l1,l)superscriptsubscript𝑅𝑗𝑖𝑝𝑙1superscriptsubscript𝑅𝑗𝑖𝑝𝑗𝑝𝑙1𝑙R_{jip}^{l-1}=R_{jip\leftarrow jp}^{(l-1,l)}italic_R start_POSTSUBSCRIPT italic_j italic_i italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = italic_R start_POSTSUBSCRIPT italic_j italic_i italic_p ← italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l ) end_POSTSUPERSCRIPT

Next, we decompose the individual terms ujipsubscriptu𝑗𝑖𝑝\textbf{u}_{jip}u start_POSTSUBSCRIPT italic_j italic_i italic_p end_POSTSUBSCRIPT using the uniform rule from the previous Section A.3.2 to obtain relevance messages for AjisubscriptA𝑗𝑖\textbf{A}_{ji}A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT:

Rjijip(l1,l1)=12Rjipl1superscriptsubscript𝑅𝑗𝑖𝑗𝑖𝑝𝑙1𝑙112superscriptsubscript𝑅𝑗𝑖𝑝𝑙1R_{ji\leftarrow jip}^{(l-1,l-1)}=\frac{1}{2}R_{jip}^{l-1}italic_R start_POSTSUBSCRIPT italic_j italic_i ← italic_j italic_i italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l - 1 ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_R start_POSTSUBSCRIPT italic_j italic_i italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT

Each input AjisubscriptA𝑗𝑖\textbf{A}_{ji}A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT is connected to p𝑝pitalic_p outputs ujipsubscriptu𝑗𝑖𝑝\textbf{u}_{jip}u start_POSTSUBSCRIPT italic_j italic_i italic_p end_POSTSUBSCRIPT. Hence, to obtain the relevance values attributed to AjisubscriptA𝑗𝑖\textbf{A}_{ji}A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT, we must aggregate all relevance messages from output jip𝑗𝑖𝑝{jip}italic_j italic_i italic_p to inputs ji𝑗𝑖{ji}italic_j italic_i via Equation (2):

Rjil1=pRjijip(l1,l1)=p12Rjipl1superscriptsubscript𝑅𝑗𝑖𝑙1subscript𝑝superscriptsubscript𝑅𝑗𝑖𝑗𝑖𝑝𝑙1𝑙1subscript𝑝12superscriptsubscript𝑅𝑗𝑖𝑝𝑙1\displaystyle R_{ji}^{l-1}=\sum_{p}R_{ji\leftarrow jip}^{(l-1,l-1)}=\sum_{p}% \frac{1}{2}R_{jip}^{l-1}italic_R start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j italic_i ← italic_j italic_i italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 , italic_l - 1 ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_R start_POSTSUBSCRIPT italic_j italic_i italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT
Rjil1=pAjiVipRjpl2(Ojp+ε)superscriptsubscript𝑅𝑗𝑖𝑙1subscript𝑝subscriptA𝑗𝑖subscriptV𝑖𝑝superscriptsubscript𝑅𝑗𝑝𝑙2subscriptO𝑗𝑝𝜀\displaystyle R_{ji}^{l-1}=\sum_{p}\textbf{A}_{ji}\textbf{V}_{ip}\frac{R_{jp}^% {l}}{2(\textbf{O}_{jp}+\varepsilon)}italic_R start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT V start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT divide start_ARG italic_R start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT + italic_ε ) end_ARG

Because ε|Ojp|much-less-than𝜀subscriptO𝑗𝑝\varepsilon\ll|\textbf{O}_{jp}|italic_ε ≪ | O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT |, we simplify the final solution:

Rjil1=pAjiVipRjpl2Ojp+εsuperscriptsubscript𝑅𝑗𝑖𝑙1subscript𝑝subscriptA𝑗𝑖subscriptV𝑖𝑝superscriptsubscript𝑅𝑗𝑝𝑙2subscriptO𝑗𝑝𝜀R_{ji}^{l-1}=\sum_{p}\textbf{A}_{ji}\textbf{V}_{ip}\frac{R_{jp}^{l}}{2\ % \textbf{O}_{jp}+\varepsilon}italic_R start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT V start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT divide start_ARG italic_R start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG 2 O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT + italic_ε end_ARG

The proof for VipsubscriptV𝑖𝑝\textbf{V}_{ip}V start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT follows a similar approach by summing over the j𝑗jitalic_j indices instead of p𝑝pitalic_p. In Appendix A.3.5, we proof that this rule does not violate the conservation property (3) in contrast to the standard ε𝜀\varepsilonitalic_ε-rule.

A.3.4 Proposition 3.4: Layer Normalization

Consider layer normalization of the form

fj(x)=xjg(x)subscript𝑓𝑗xsubscript𝑥𝑗𝑔xf_{j}(\textbf{x})=\frac{x_{j}}{g(\textbf{x})}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x ) = divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_g ( x ) end_ARG

where g(x)=Var[x]+ε𝑔xVardelimited-[]x𝜀g(\textbf{x})=\sqrt{\text{Var}[\textbf{x}]+\varepsilon}italic_g ( x ) = square-root start_ARG Var [ x ] + italic_ε end_ARG or g(x)=1Nkxk2+ε𝑔x1𝑁subscript𝑘subscriptsuperscript𝑥2𝑘𝜀g(\textbf{x})=\sqrt{\frac{1}{N}\sum_{k}x^{2}_{k}+\varepsilon}italic_g ( x ) = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_ε end_ARG. The derivative is

fjxi=1g(x)2{g(x)xjg(x)xifor i=jxjg(x)xifor ijsubscript𝑓𝑗subscript𝑥𝑖1𝑔superscriptx2cases𝑔xsubscript𝑥𝑗𝑔xsubscript𝑥𝑖for 𝑖𝑗subscript𝑥𝑗𝑔xsubscript𝑥𝑖for 𝑖𝑗\frac{\partial f_{j}}{\partial x_{i}}=\frac{1}{g(\textbf{x})^{2}}\begin{cases}% g(\textbf{x})-x_{j}\frac{\partial g(\textbf{x})}{\partial x_{i}}&\text{for }i=% j\\ -x_{j}\frac{\partial g(\textbf{x})}{\partial x_{i}}&\text{for }i\neq j\end{cases}divide start_ARG ∂ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG italic_g ( x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG { start_ROW start_CELL italic_g ( x ) - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG ∂ italic_g ( x ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_CELL start_CELL for italic_i = italic_j end_CELL end_ROW start_ROW start_CELL - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG ∂ italic_g ( x ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_CELL start_CELL for italic_i ≠ italic_j end_CELL end_ROW (28)

In LayerNorm Ba et al., (2016), we assume for simplicity 𝔼[x]=0𝔼delimited-[]x0\mathbb{E}[\textbf{x}]=0blackboard_E [ x ] = 0, then the partial derivative simplifies to

𝕍[x]=𝔼[x2]𝔼[x]2=𝔼[x2]𝕍delimited-[]x𝔼delimited-[]superscriptx2𝔼superscriptdelimited-[]x2𝔼delimited-[]superscriptx2\displaystyle\mathbb{V}[\textbf{x}]=\mathbb{E}[\textbf{x}^{2}]-\mathbb{E}[% \textbf{x}]^{2}=\mathbb{E}[\textbf{x}^{2}]blackboard_V [ x ] = blackboard_E [ x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - blackboard_E [ x ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E [ x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
𝕍[x]xi=2Nxi𝕍delimited-[]xsubscript𝑥𝑖2𝑁subscript𝑥𝑖\displaystyle\frac{\partial\mathbb{V}[\textbf{x}]}{\partial x_{i}}=\frac{2}{N}% x_{i}divide start_ARG ∂ blackboard_V [ x ] end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = divide start_ARG 2 end_ARG start_ARG italic_N end_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

Further, the partial derivative of RMSNorm Zhang and Sennrich, (2019) is

RMSNormxi=xiNkxkRMSNormsubscript𝑥𝑖subscript𝑥𝑖𝑁subscript𝑘subscript𝑥𝑘\frac{\partial\text{RMSNorm}}{\partial x_{i}}=\frac{x_{i}}{\sqrt{N\sum_{k}x_{k% }}}divide start_ARG ∂ RMSNorm end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_N ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG

At reference point x~i=0subscript~𝑥𝑖0\tilde{x}_{i}=0over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0, the diagonal elements in Equation (28) ij𝑖𝑗i\neq jitalic_i ≠ italic_j are zero, yielding the Taylor decomposition:

fi(x)=fixi|x~i=0xi+b~i=xiε+b~isubscript𝑓𝑖xevaluated-atsubscript𝑓𝑖subscript𝑥𝑖subscript~𝑥𝑖0subscript𝑥𝑖subscript~𝑏𝑖subscript𝑥𝑖𝜀subscript~𝑏𝑖f_{i}(\textbf{x})=\frac{\partial f_{i}}{\partial x_{i}}\Bigr{|}_{\tilde{x}_{i}% =0}x_{i}+\tilde{b}_{i}=\frac{x_{i}}{\varepsilon}+\tilde{b}_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( x ) = divide start_ARG ∂ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | start_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_ε end_ARG + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

To enforce a strict notion of the conservation property (3), the bias term b~isubscript~𝑏𝑖\tilde{b}_{i}over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be excluded or evenly distributed across the input variables. Because we have only a single input variable, the bias can be considered as part of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Ril1=(xiε+b~i)Rilxiε+b~isuperscriptsubscript𝑅𝑖𝑙1subscript𝑥𝑖𝜀subscript~𝑏𝑖superscriptsubscript𝑅𝑖𝑙subscript𝑥𝑖𝜀subscript~𝑏𝑖R_{i}^{l-1}=\left(\frac{x_{i}}{\varepsilon}+\tilde{b}_{i}\right)\frac{R_{i}^{l% }}{\frac{x_{i}}{\varepsilon}+\tilde{b}_{i}}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_ε end_ARG + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) divide start_ARG italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_ε end_ARG + over~ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG (29)

Since there is only one input variable and one output, the decomposition is equivalent to the identity function, as discussed in the Section 3.2.2 about component-wise non-linearities. Thus, we conclude that the identity rule applies in this case.

Ril1=Rilsuperscriptsubscript𝑅𝑖𝑙1superscriptsubscript𝑅𝑖𝑙R_{i}^{l-1}=R_{i}^{l}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT (30)

Note, that this rule is numerically stable because fj(0)=0subscript𝑓𝑗00f_{j}(0)=0italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 0 ) = 0 as discussed in Section A.2.1.

A.3.5 Violation of the Conservation Property in bi-linear Matrix Multiplication

In the following we proof that the application of the ε𝜀\varepsilonitalic_ε-rule (8) without the uniform rule (14) on bi-linear matrix multiplication violates the conservation property (3). We reiterate and generalize the Lemma 3 of Chefer et al., 2021b which establishes that 00-LRP (ε=0𝜀0\varepsilon=0italic_ε = 0) violates conservation.

Recall, that matrix multiplication is defined as:

Ojp=iAjiVipsubscriptO𝑗𝑝subscript𝑖subscriptA𝑗𝑖subscriptV𝑖𝑝\textbf{O}_{jp}=\sum_{i}\textbf{A}_{ji}\textbf{V}_{ip}O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT V start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT

The ε𝜀\varepsilonitalic_ε-rule is the solution to applying a linearization (3.1.1) to a linear layer (6). For computing relevance values for AjisubscriptA𝑗𝑖\textbf{A}_{ji}A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT using the ε𝜀\varepsilonitalic_ε-rule, we treat VipsubscriptV𝑖𝑝\textbf{V}_{ip}V start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT as a constant weight matrix with zero bias, and similarly for attributing AjisubscriptA𝑗𝑖\textbf{A}_{ji}A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT. The derived relevance propagation rules are given by:

R~jil1(Aji)=pAjiVipRjplOjp+εsuperscriptsubscript~𝑅𝑗𝑖𝑙1subscriptA𝑗𝑖subscript𝑝subscriptA𝑗𝑖subscriptV𝑖𝑝superscriptsubscript𝑅𝑗𝑝𝑙subscriptO𝑗𝑝𝜀\displaystyle\tilde{R}_{ji}^{l-1}(\textbf{A}_{ji})=\sum_{p}\textbf{A}_{ji}% \textbf{V}_{ip}\frac{R_{jp}^{l}}{\textbf{O}_{jp}+\varepsilon}over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT V start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT divide start_ARG italic_R start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT + italic_ε end_ARG
R~ipl1(Vip)=jAjiVipRjplOjp+εsuperscriptsubscript~𝑅𝑖𝑝𝑙1subscriptV𝑖𝑝subscript𝑗subscriptA𝑗𝑖subscriptV𝑖𝑝superscriptsubscript𝑅𝑗𝑝𝑙subscriptO𝑗𝑝𝜀\displaystyle\tilde{R}_{ip}^{l-1}(\textbf{V}_{ip})=\sum_{j}\textbf{A}_{ji}% \textbf{V}_{ip}\frac{R_{jp}^{l}}{\textbf{O}_{jp}+\varepsilon}over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ( V start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT V start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT divide start_ARG italic_R start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT + italic_ε end_ARG

The conservation property (3) states, that the total relevance at layer l𝑙litalic_l must be equal to the total relevance at layer l1𝑙1l-1italic_l - 1.

The total relevance at layer l𝑙litalic_l is given by

Rl=j,pRjplsuperscript𝑅𝑙subscript𝑗𝑝superscriptsubscript𝑅𝑗𝑝𝑙R^{l}=\sum_{j,p}R_{jp}^{l}italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j , italic_p end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT

and the total relevance at layer l1𝑙1l-1italic_l - 1 is computed by:

Rl1=j,iR~jil1+i,pR~ipl1=2j,pOjpOjp+εRjpl2Rlsuperscript𝑅𝑙1subscript𝑗𝑖superscriptsubscript~𝑅𝑗𝑖𝑙1subscript𝑖𝑝superscriptsubscript~𝑅𝑖𝑝𝑙12subscript𝑗𝑝subscriptO𝑗𝑝subscriptO𝑗𝑝𝜀superscriptsubscript𝑅𝑗𝑝𝑙2superscript𝑅𝑙R^{l-1}=\sum_{j,i}\tilde{R}_{ji}^{l-1}+\sum_{i,p}\tilde{R}_{ip}^{l-1}=2\sum_{j% ,p}\frac{\textbf{O}_{jp}}{\textbf{O}_{jp}+\varepsilon}R_{jp}^{l}\approx 2R^{l}italic_R start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i , italic_p end_POSTSUBSCRIPT over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = 2 ∑ start_POSTSUBSCRIPT italic_j , italic_p end_POSTSUBSCRIPT divide start_ARG O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT end_ARG start_ARG O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT + italic_ε end_ARG italic_R start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ≈ 2 italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT

with ε|Ojp|much-less-than𝜀subscriptO𝑗𝑝\varepsilon\ll|\textbf{O}_{jp}|italic_ε ≪ | O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT |. This results in a violation of the conservation property as RlRl1superscript𝑅𝑙superscript𝑅𝑙1R^{l}\neq R^{l-1}italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ≠ italic_R start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT. However, by employing Proposition 3.3 (15), that is a sequential application of the ε𝜀\varepsilonitalic_ε-rule and uniform rule (14), we ensure conservation by dividing with the factor 2222:

Rl1=j,iRjil1+i,pRipl1=j,pOjpOjp+εRjplRlsuperscript𝑅𝑙1subscript𝑗𝑖superscriptsubscript𝑅𝑗𝑖𝑙1subscript𝑖𝑝superscriptsubscript𝑅𝑖𝑝𝑙1subscript𝑗𝑝subscriptO𝑗𝑝subscriptO𝑗𝑝𝜀superscriptsubscript𝑅𝑗𝑝𝑙superscript𝑅𝑙R^{l-1}=\sum_{j,i}R_{ji}^{l-1}+\sum_{i,p}R_{ip}^{l-1}=\sum_{j,p}\frac{\textbf{% O}_{jp}}{\textbf{O}_{jp}+\varepsilon}R_{jp}^{l}\approx R^{l}italic_R start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i , italic_p end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j , italic_p end_POSTSUBSCRIPT divide start_ARG O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT end_ARG start_ARG O start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT + italic_ε end_ARG italic_R start_POSTSUBSCRIPT italic_j italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ≈ italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT

It is evident that ε𝜀\varepsilonitalic_ε absorbs a negligible proportion of the relevance to safeguard numerical stability. The proof is also valid for the z+superscript𝑧z^{+}italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT-rule (25), where only positive contributions are taken into consideration.

Refer to caption
Figure A.4: Comparison of four different LRP variants computed on a LLaMa 2-7b model. The given section is from the Wikipedia article on Mount Everest. The model is expected to provide the next answer token for the question ‘How high did they climb in 1922? According to the text, the 1922 expedition reached 8,’. For the correctly predicted token 3 the attribution is computed. Distributing the bias uniformely on the input variables (Softmax Distribute Bias) or applying the identity rule (Softmax Identity Rule) leads to numerical instabilities. For “Softmax Distribute Bias” and “Softmax Identity Rule”, we applied AttnLRP rules on all layers except for the softmax function. AttnLRP highlights the correct token the strongest, while CP-LRP focuses strongly on the start-of-sequence <s> token and exhibits more background noise e.g. irrelevant tokens such as ‘Context’, ‘attracts’, ‘Everest’ are highlighted, while AttnLRP does not highlight them or assigns negative relevance.

Appendix B Appendix II: Experimental Details

In the following sections, we provide additional details about the experiments performed.

B.1 Models and Datasets

For ImageNet faithfulness, we utilized the pretrained Vision Transformer B-16, L-16 and L-32 weights of the PyTorch model zoo Paszke et al., (2019). We randomly selected 3200 samples (fixed set for all baselines) such that the standard error of mean converges to below 1% of the mean value.

For Wikipedia and IMDB faithfulness, we evaluated the pretrained LLaMa 2-7b hosted on huggingface Wolf et al., (2019) on 4000 randomly selected validation dataset samples (fixed set for all baselines). For SQuAD v2, we utilize the pretrained Flan-T5 and Mixtral 8x7b weights hosted on huggingface. Further, for Wikipedia next word prediction we evaluated the model on a context size of 512 (from beginning of article until context length is reached), while the context size in SQuAD v2 varies between 169 to 4060. Although Flan-T5 was trained on a smaller context size of 2000 tokens, the relative positional encoding allows it to handle longer context sizes with at least 8192 tokens Shaham et al., (2023).

All computations are performed in the Brain Floating Point (bfloat16) half-precision format to save memory consumption. bfloat16 trades precision for a higher dynamic range than standard float16, and hence prevents numerical errors due to overflow. In this regard, the impact of quantized number formats on (Attn)LRP attributions remains a topic to be investigated. In addition, all linear weights in Mixtral 8x7b are quantized to the 4 bit integer format using bitsandbytes Dettmers et al., (2024) (but computation still performed in bfloat16).

SQuAD v2 encompasses numerous questions that are either unanswerable or subject to incorrect predictions by the model. Consequently, only instances where the model accurately predicts the correct response are considered. Additionally, only the testset is utilized to mitigate a potential overfitting bias during the training phase, if applicable. Finally, for SQuAD v2 top-1 accuracy and IoU, we utilized the following prompt:

Context: [text of dataset sample] Question: [question of dataset sample] Answer:

Flan-T5 does not require a system prompt, while for Mixtral 8x7b we use before the context the system prompt:

Use the context to answer the question. Use few words.

Because Flan-T5 typically provides the correct answer directly, we explain the first token of the answer only. Conversely, Mixtral 8x7b generates full sentences; within these, we identify the positions of the answer tokens and explain all tokens that constitute the correct answer only. To achieve this, we calculate heatmaps for each answer token and add these heatmaps to produce the final heatmap. For gradient-based methods, this process can be parallelized by initiating the backward pass at the designated token positions with the logit output for LRP and with the value 1 for all other baselines, while initializing the remaining output tokens with zero.

For IMDB, we added a last linear layer to a frozen LLaMa 2-7b model and finetuned only the last layer, which achieves 93% accuracy on the validation dataset.

If we encountered NaN values for a sample, we removed it from the evaluation. This happened for Grad ×\times× AttnRollout and AtMan in the Wikipedia dataset. However, the standard error of the mean remains small, as can be seen in Table 1.

B.2 Input Perturbation Metrics

In the following, we summarize the perturbation process introduced by Samek et al., (2017) in a condensed manner.

Given an attributions map Ril(xi)superscriptsubscript𝑅𝑖𝑙subscript𝑥𝑖R_{i}^{l}(x_{i})italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) per input features x={xi}i=1Nxsuperscriptsubscriptsubscript𝑥𝑖𝑖1𝑁\textbf{x}=\{x_{i}\}_{i=1}^{N}x = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT in layer l𝑙litalic_l. \mathcal{H}caligraphic_H denotes a set of relevance values for all input features xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

=(R00(x0),R10(x1),,RN10(xN1))superscriptsubscript𝑅00subscript𝑥0superscriptsubscript𝑅10subscript𝑥1superscriptsubscript𝑅𝑁10subscript𝑥𝑁1\mathcal{H}=(R_{0}^{0}(x_{0}),R_{1}^{0}(x_{1}),...,R_{N-1}^{0}(x_{N-1}))caligraphic_H = ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_R start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ) ) (31)

Then, the flipping perturbation process iteratively substitutes input features with a baseline value bNbsuperscript𝑁\textbf{b}\in\mathbb{R}^{N}b ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT (the baseline might be zero, noise generated from a Gaussian distribution, or pixels of a black image in the vision task). Another reverse equivalent variant, referred as insertion, begins with a baseline b and reconstructs the input x step-wise. The function performing the perturbation is denoted by gFsuperscriptg𝐹\textbf{g}^{F}g start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT for flipping and gIsuperscriptg𝐼\textbf{g}^{I}g start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT for insertion. The perturbation procedure is either conducted in a MoRF (Most Relevant First) or LeRF (Least Relevant First) manner based on the sorted members of \mathcal{H}caligraphic_H. Regardless of the replacement function, the MoRF and LeRF perturbation processes can be defined as recursive formulas at step k={0,1,,N1}𝑘01𝑁1k=\{0,1,...,N-1\}italic_k = { 0 , 1 , … , italic_N - 1 }:

MoRF Pert. Process={xMoRF0=xxMoRFk=g(F|I)(xMoRFk1,b)xMoRFN1=bMoRF Pert. Processcasessuperscriptsubscriptx𝑀𝑜𝑅𝐹0xotherwisesuperscriptsubscriptx𝑀𝑜𝑅𝐹𝑘superscriptgconditional𝐹𝐼superscriptsubscriptx𝑀𝑜𝑅𝐹𝑘1botherwisesuperscriptsubscriptx𝑀𝑜𝑅𝐹𝑁1botherwise\text{MoRF Pert. Process}=\begin{cases}\textbf{x}_{MoRF}^{0}=\textbf{x}\\ \textbf{x}_{MoRF}^{k}=\textbf{g}^{(F|I)}(\textbf{x}_{MoRF}^{k-1},\textbf{b})\\ \textbf{x}_{MoRF}^{N-1}=\textbf{b}\end{cases}MoRF Pert. Process = { start_ROW start_CELL x start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = x end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL x start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = g start_POSTSUPERSCRIPT ( italic_F | italic_I ) end_POSTSUPERSCRIPT ( x start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , b ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL x start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT = b end_CELL start_CELL end_CELL end_ROW

where xMoRFksuperscriptsubscriptx𝑀𝑜𝑅𝐹𝑘\textbf{x}_{MoRF}^{k}x start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT denotes the perturbed input feature x at step k𝑘kitalic_k in MoRF process.

LeRF Pert. Process={xLeRF0=bxLeRFk=g(F|I)(xLeRFk1,b)xLeRFN1=xLeRF Pert. Processcasessuperscriptsubscriptx𝐿𝑒𝑅𝐹0botherwisesuperscriptsubscriptx𝐿𝑒𝑅𝐹𝑘superscriptgconditional𝐹𝐼superscriptsubscriptx𝐿𝑒𝑅𝐹𝑘1botherwisesuperscriptsubscriptx𝐿𝑒𝑅𝐹𝑁1xotherwise\text{LeRF Pert. Process}=\begin{cases}\textbf{x}_{LeRF}^{0}=\textbf{b}\\ \textbf{x}_{LeRF}^{k}=\textbf{g}^{(F|I)}(\textbf{x}_{LeRF}^{k-1},\textbf{b})\\ \textbf{x}_{LeRF}^{N-1}=\textbf{x}\end{cases}LeRF Pert. Process = { start_ROW start_CELL x start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = b end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL x start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = g start_POSTSUPERSCRIPT ( italic_F | italic_I ) end_POSTSUPERSCRIPT ( x start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , b ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL x start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT = x end_CELL start_CELL end_CELL end_ROW

where xLeRFksuperscriptsubscriptx𝐿𝑒𝑅𝐹𝑘\textbf{x}_{LeRF}^{k}x start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT denotes the perturbed input feature x at step k𝑘kitalic_k in LeRF process.

Results of these processes are perturbed input sets of 𝒳MoRFF=(xMoRF0,xMoRF1,,xMoRFN1)subscriptsuperscript𝒳𝐹𝑀𝑜𝑅𝐹superscriptsubscriptx𝑀𝑜𝑅𝐹0superscriptsubscriptx𝑀𝑜𝑅𝐹1superscriptsubscriptx𝑀𝑜𝑅𝐹𝑁1\mathcal{X}^{F}_{MoRF}=(\textbf{x}_{MoRF}^{0},\textbf{x}_{MoRF}^{1},...,% \textbf{x}_{MoRF}^{N-1})caligraphic_X start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT = ( x start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , x start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , x start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ) and 𝒳LeRFF=(xLeRF0,xLeRF1,,xLeRFN1)subscriptsuperscript𝒳𝐹𝐿𝑒𝑅𝐹superscriptsubscriptx𝐿𝑒𝑅𝐹0superscriptsubscriptx𝐿𝑒𝑅𝐹1superscriptsubscriptx𝐿𝑒𝑅𝐹𝑁1\mathcal{X}^{F}_{LeRF}=(\textbf{x}_{LeRF}^{0},\textbf{x}_{LeRF}^{1},...,% \textbf{x}_{LeRF}^{N-1})caligraphic_X start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT = ( x start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , x start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , x start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ). By feeding these sets to the model and computing the corresponding logit output fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, a curve will be induced and consequently the area A𝐴Aitalic_A under the curve can be calculated:

AMoRFF=ALeRFIsubscriptsuperscript𝐴𝐹𝑀𝑜𝑅𝐹subscriptsuperscript𝐴𝐼𝐿𝑒𝑅𝐹\displaystyle A^{F}_{MoRF}=A^{I}_{LeRF}italic_A start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT = italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT =1Nk=0N1fj(xMoRFk)absent1𝑁subscriptsuperscript𝑁1𝑘0subscript𝑓𝑗superscriptsubscriptx𝑀𝑜𝑅𝐹𝑘\displaystyle=\frac{1}{N}\sum^{N-1}_{k=0}f_{j}(\textbf{x}_{MoRF}^{k})= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) (32)

where xMoRFk𝒳MoRFFsuperscriptsubscriptx𝑀𝑜𝑅𝐹𝑘subscriptsuperscript𝒳𝐹𝑀𝑜𝑅𝐹\textbf{x}_{MoRF}^{k}\in\mathcal{X}^{F}_{MoRF}x start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ caligraphic_X start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT or xMoRFk𝒳LeRFIsuperscriptsubscriptx𝑀𝑜𝑅𝐹𝑘subscriptsuperscript𝒳𝐼𝐿𝑒𝑅𝐹\textbf{x}_{MoRF}^{k}\in\mathcal{X}^{I}_{LeRF}x start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ caligraphic_X start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT.

It is notable that the area below the least relevant order insertion curves are identical to the most relevant order flipping curves and that the area below the least relevant order flipping curves are identical to the most relevant order insertion curves. Hence, by using 𝒳MoRFIsubscriptsuperscript𝒳𝐼𝑀𝑜𝑅𝐹\mathcal{X}^{I}_{MoRF}caligraphic_X start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT, AMoRFI=ALeRFFsubscriptsuperscript𝐴𝐼𝑀𝑜𝑅𝐹subscriptsuperscript𝐴𝐹𝐿𝑒𝑅𝐹A^{I}_{MoRF}=A^{F}_{LeRF}italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT = italic_A start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT can be computed similarly. A faithful explainer results in a low value of AMoRFFsubscriptsuperscript𝐴𝐹𝑀𝑜𝑅𝐹A^{F}_{MoRF}italic_A start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT or ALeRFIsubscriptsuperscript𝐴𝐼𝐿𝑒𝑅𝐹A^{I}_{LeRF}italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT. Further, a faithful explainer is expected to have large ALeRFFsubscriptsuperscript𝐴𝐹𝐿𝑒𝑅𝐹A^{F}_{LeRF}italic_A start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT or AMoRFIsubscriptsuperscript𝐴𝐼𝑀𝑜𝑅𝐹A^{I}_{MoRF}italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT values.

Ultimately to reduce introducing out-of-distribution manipulations and the sensitivity towards the chosen baseline value, the work of Blücher et al., (2024) proposes to leverage both insights and to obtain a robust measure as

ΔAF=ALeRFFAMoRFFΔsuperscript𝐴𝐹subscriptsuperscript𝐴𝐹𝐿𝑒𝑅𝐹subscriptsuperscript𝐴𝐹𝑀𝑜𝑅𝐹\displaystyle\Delta A^{F}=A^{F}_{LeRF}-A^{F}_{MoRF}roman_Δ italic_A start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT = italic_A start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT - italic_A start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT
ΔAI=AMoRFIALeRFIΔsuperscript𝐴𝐼subscriptsuperscript𝐴𝐼𝑀𝑜𝑅𝐹subscriptsuperscript𝐴𝐼𝐿𝑒𝑅𝐹\displaystyle\Delta A^{I}=A^{I}_{MoRF}-A^{I}_{LeRF}roman_Δ italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT - italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT

where a higher score signifies a more faithful explainer.

We performed all faithfulness perturbations with a baseline value of zero. In the case of LLMs, we aggreated the relevance for each token and flipped the entire embedding vector of input tokens to the baseline value. For ViTs, we used the relevances of the input pixels and flipped input pixels to the baseline value.

Refer to caption
Figure B.5: Comparison of the AttnLRP (ours) with the γ𝛾\gammaitalic_γ-rule, Grad×\times×AttnRoll Chefer et al., 2021a , AtMan Deb et al., (2023), and SmoothGrad Smilkov et al., (2017) techniques through the perturbation experiment (faithfulness) on the ViT-B-16 using 3200 random samples of ImageNet. From left to right, the plots correspond to fj(𝒳LeRFF)fj(𝒳MoRFF)subscript𝑓𝑗subscriptsuperscript𝒳𝐹𝐿𝑒𝑅𝐹subscript𝑓𝑗subscriptsuperscript𝒳𝐹𝑀𝑜𝑅𝐹f_{j}(\mathcal{X}^{F}_{LeRF})-f_{j}(\mathcal{X}^{F}_{MoRF})italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT ) (large area is good), fj(𝒳MoRFF)subscript𝑓𝑗subscriptsuperscript𝒳𝐹𝑀𝑜𝑅𝐹f_{j}(\mathcal{X}^{F}_{MoRF})italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT ) (steep decline is good), and fj(𝒳LeRFF)subscript𝑓𝑗subscriptsuperscript𝒳𝐹𝐿𝑒𝑅𝐹f_{j}(\mathcal{X}^{F}_{LeRF})italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT ) (slow decline is good). “AUC” denotes the Area under Curve.

B.3 Hyperparameter search for Baselines

As describes in Appendix A.1, several baseline attribution methods have hyperparameters that must be tuned to the datasets. The default parameters are described in Appendix A.1, and to reduce the search space, we optimize a subset of the hyperparameters. The hyperparameters of SmoothGrad (σ[0.01,0.25]𝜎0.010.25\sigma\in[0.01,0.25]italic_σ ∈ [ 0.01 , 0.25 ]), AtMan (suppression value[0.1,1.0]suppression value0.11.0\text{suppression value}\in[0.1,1.0]suppression value ∈ [ 0.1 , 1.0 ] and threshold [0,1.0]absent01.0\in[0,1.0]∈ [ 0 , 1.0 ]), AttnRoll (discard threshold[0.90,1.00]discard threshold0.901.00\text{discard threshold}\in[0.90,1.00]discard threshold ∈ [ 0.90 , 1.00 ]), and G×\times×AttnRoll (discard threshold[0.90,1.00]discard threshold0.901.00\text{discard threshold}\in[0.90,1.00]discard threshold ∈ [ 0.90 , 1.00 ]) are selected to be optimized. The used hyperparameters for the perturbation experiments are available in the captions of Tables B.6, B.7, B.8 and B.9.

For LLMs, we have not noticed a significant impact on the heatmaps for different discard threshold values of G×\times×AttnRoll. For AttnRoll, the impact is minimal. Hence, we choose the default value of 1 (nothing is discarded, as proposed in the original works Abnar and Zuidema, (2020); Chefer et al., 2021a ).

Regarding SQuAD v2, we set AtMan’s p=0.7𝑝0.7p=0.7italic_p = 0.7 for Mixtral 8x7b and p=0.9𝑝0.9p=0.9italic_p = 0.9 for Flan-T5-XL. For SmoothGrad, we set σ=0.1𝜎0.1\sigma=0.1italic_σ = 0.1 for Mixtral 8x7b and Flan-T5-XL.

B.4 Impact of Model Architectural Choices on AttnLRP Performance

We evaluated AttnLRP on three model classes that incorporate different types of layers.

Flan-T5: This encoder-decoder architecture employs self-attention and cross-attention layers (33). The FFN layers are a sequential application of linear layers with GELU non-linearities inbetween.

LLaMa 2: This decoder architecture utilizes only self-attention layers (33). However, in the FFN layers, we have an additional element-wise non-linear weighting with a SiLU non-linearity (34).

Mixtral 8x7b: This mixture of experts model uses self-attention layers (33) and FFN layers with non-linear weighting (34) like the LLaMa 2. In addition, there are FFN routing layers with a softmax weighting (35).

Attention: Softmax(Wqx(Wkx))WvxAttention: SoftmaxsubscriptW𝑞xsuperscriptsubscriptW𝑘xtopsubscriptW𝑣x\displaystyle\text{Attention: }\text{Softmax}(\textbf{W}_{q}\ \textbf{x}\ (% \textbf{W}_{k}\ \textbf{x})^{\top})\ \textbf{W}_{v}\ \textbf{x}Attention: roman_Softmax ( W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT x ( W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT x (33)
FFN × Non-Linearity: SiLU(W1x)W2xdirect-productFFN × Non-Linearity: SiLUsubscriptW1xsubscriptW2x\displaystyle\text{FFN $\times$ Non-Linearity: }\text{SiLU}(\textbf{W}_{1}\ % \textbf{x})\odot\textbf{W}_{2}\ \textbf{x}FFN × Non-Linearity: roman_SiLU ( W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT x ) ⊙ W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT x (34)
Routing: iSoftmax(TopK(Wgx))iFFNi(x)\displaystyle\text{Routing: }\sum_{i}\text{Softmax(TopK(}\textbf{W}_{g}\ % \textbf{x}))_{i}\ \text{FFN}_{i}(\textbf{x})Routing: ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Softmax(TopK( bold_W start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT x ) ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT FFN start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( x ) (35)

where WqsubscriptW𝑞\textbf{W}_{q}W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, WksubscriptW𝑘\textbf{W}_{k}W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, WvsubscriptW𝑣\textbf{W}_{v}W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, W1subscriptW1\textbf{W}_{1}W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, W2subscriptW2\textbf{W}_{2}W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, WgsubscriptW𝑔\textbf{W}_{g}W start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT are linear weight parameters, (direct-product\odot) is element-wise multiplication, i𝑖i\in\mathbb{N}italic_i ∈ blackboard_N the number of expert FFN layers and TopK is returning the top-k elements.

In Table B.4, we study the impact of AttnLRP rules w.r.t. CP-LRP on all three different layer types (33), (34) and (35). We start as baseline with all rules of CP-LRP on all layer types, then we successively substitute CP-LRP rules with AttnLRP rules for specific layer types.

For CP-LRP, we use the rules described in Appendix A.2.2. In the original work of Ali et al., (2022), FFN layers weighted with non-linearities and routing layers are not discussed. Analogously to the argumentation in their work, we regard the non-linearity as constant weight and attribute only through the FFN path using the ε𝜀\varepsilonitalic_ε-rule.

As demonstrated in Table B.4, the application of AttnLRP rules enhances the performance across all layers, regardless of their type. Moreover, the rate of improvement increases with the number of non-linearities present in the model.

Table B.4: This Table is an extension of Table 1: Faithfulness scores as area between the least and most relevant order perturbation curves on LLaMa 2 alongside the top-1 accuracy and IoU in parenthesis for Flan-T5 and Mixtral 8x7b. We start as baseline with all rules of CP-LRP on all layer types, then we successively substitute CP-LRP rules with AttnLRP rules for specific layer types if they exist in the model. We observe that AttnLRP’s improvement is not confined to the attention mechnism alone, but all layers that contain operations that are not attributable with other LRP variants. The more complex the architecture, the better the performance of AttnLRP compared to CP-LRP.
Method on Layer LLaMa 2-7b Mixtral 8x7b Flan-T5-XL
IMDB \uparrow Wikipedia \uparrow SQuAD v2 \uparrow SQuAD v2 \uparrow
Baseline (All Layers)
CP-LRP 1.721.721.721.72 7.857.857.857.85 0.50(0.40)0.500.400.50\ (0.40)0.50 ( 0.40 ) 0.90(0.83)0.900.830.90\ (0.83)0.90 ( 0.83 )
+ Attention Mechanism
AttnLRP 2.092.092.092.09 9.499.499.499.49 0.70(0.53)0.700.530.70\ (0.53)0.70 ( 0.53 ) 0.94(0.84)0.940.840.94\ (0.84)0.94 ( 0.84 )
+ FFN ×\times× Non-Linearity
AttnLRP 2.502.502.502.50 10.9310.9310.9310.93 0.78(0.57)0.780.570.78\ (0.57)0.78 ( 0.57 ) -
+ Routing Layer
AttnLRP - - 0.96(0.72)0.960.720.96\ (0.72)0.96 ( 0.72 ) -

B.5 LRP Composites for ViT

Applying the ε𝜀\varepsilonitalic_ε-rule on all linear layers inside LLMs is sufficient to obtain faithful and noise-free attributions. However, for the vision transformers, we apply the γ𝛾\gammaitalic_γ-rule on all linear layers (including the convolutional layers) outside the attention module. Since the γ𝛾\gammaitalic_γ-rule has a hyperparameter, the work Pahde et al., (2023) proposed to tune the parameter using a grid-search. This optimization search (or in an LRP context known as composite search) is computational highly demanding.

The vision transformer consists of many linear layers. Our proposed approach is to use different γ𝛾\gammaitalic_γ values across different layer types.

According to Vaswani et al., (2017) the attention module consists of several linear layers which we refer to as LinearInputProjection.

Q=WqX+bqQsubscriptW𝑞Xsubscriptb𝑞\textbf{Q}=\textbf{W}_{q}\textbf{X}+\textbf{b}_{q}Q = W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT X + b start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT
K=WkX+bkKsubscriptW𝑘Xsubscriptb𝑘\textbf{K}=\textbf{W}_{k}\textbf{X}+\textbf{b}_{k}K = W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT X + b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
V=WvX+bvVsubscriptW𝑣Xsubscriptb𝑣\textbf{V}=\textbf{W}_{v}\textbf{X}+\textbf{b}_{v}V = W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT X + b start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT

In the attention layer, after the softmax (11), there exists another linear layer performing the output projection back into the residual stream, denoted as LinearOutputProjection:

y=WoO+boysubscriptW𝑜Osubscriptb𝑜\textbf{y}=\textbf{W}_{o}\textbf{O}+\textbf{b}_{o}y = W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT O + b start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT

The other layers in the whole network, will be referred to as Linear.

The perturbation experiment had been conducted over these layers using different types of rules including Epsilon, ZPlus, Gamma, and AlphaBeta (with α=2𝛼2\alpha=2italic_α = 2 and β=1𝛽1\beta=1italic_β = 1 according to Montavon et al., (2019)).

The most faithful composite, that we obtained for AttnLRP and CP-LRP, is in Table B.5. More details over the statistics of the conducted experiments are available in Figures B.14, B.15, B.16, B.17, B.18.

Table B.5: Proposed composite for the AttnLRP and CP-LRP methods used for the Vision Transformer.
Layer Type Rule Proposed
Convolution Gamma(γ=0.25𝛾0.25\gamma=0.25italic_γ = 0.25)
Linear Gamma(γ=0.05𝛾0.05\gamma=0.05italic_γ = 0.05)
LinearInputProjection Epsilon
LinearOutputProjection Epsilon
Refer to caption
Figure B.6: Explanation heatmaps of the methods used for the perturbation experiments on the vision transformer ViT-B-16. A checkerboard effect is visible for almost every method, especially in AttnRoll Abnar and Zuidema, (2020), G×\times×AttnRoll Chefer et al., 2021a , and AtMan Deb et al., (2023). We improve upon CP-LRP Ali et al., (2022) by applying the γ𝛾\gammaitalic_γ-rule as described in B.5. While qualitatively γ𝛾\gammaitalic_γCP-LRP (with γ𝛾\gammaitalic_γ extension for ViT) and AttnLRP give similar explanations, quantitative results in Table 1 and Table B.6 show a consistent improvement of AttnLRP over γ𝛾\gammaitalic_γCP-LRP in terms of faithfulness. Moreover, attributing query and key linear layers within the attention layer is possible with AttnLRP only, while it is not possible with CP-LRP. We leave these further explorations for future work.

B.6 Additional Perturbation Evaluations on Vision Transformers

Table B.6 presents additional perturbation results for the vision transformers ViT-L-16 and ViT-L-32 evaluated on ImageNet. Our method surpasses all comparative baselines, while the proposed enhancement, γ𝛾\gammaitalic_γCP-LRP (which applies the γ𝛾\gammaitalic_γ-rule across all linear layers for CP-LRP Ali et al., (2022)), remains highly competitive. In more complex model architectures that incorporate a greater variety of non-linearities, our method demonstrates more superiority, as elaborated in Appendix B.4. Tuning the γ𝛾\gammaitalic_γ-parameter for γ𝛾\gammaitalic_γCP-LRP and AttnLRP in a grid-search (see Appendix B.5) resulted for both models in the same composite described in Table B.5. However, there is no assurance that other models share the same γ𝛾\gammaitalic_γ parameters.

Table B.6: Faithfulness scores as area between the least and most relevant order perturbation curves Blücher et al., (2024) for ViT-L-16 and ViT-L-32 on ImageNet. For ViT-L-16, we set for SmoothGrad σ=0.01𝜎0.01\sigma=0.01italic_σ = 0.01, for AtMan p=1.0𝑝1.0p=1.0italic_p = 1.0 and t=0.1𝑡0.1t=0.1italic_t = 0.1, for AttnRoll dt=0.90𝑑𝑡0.90dt=0.90italic_d italic_t = 0.90, and for G×\times×AttnRoll dt=0.92𝑑𝑡0.92dt=0.92italic_d italic_t = 0.92. For ViT-L-32, we set for SmoothGrad σ=0.01𝜎0.01\sigma=0.01italic_σ = 0.01, for AtMan p=1.0𝑝1.0p=1.0italic_p = 1.0 and t=0.1𝑡0.1t=0.1italic_t = 0.1, for AttnRoll dt=0.95𝑑𝑡0.95dt=0.95italic_d italic_t = 0.95, and for G×\times×AttnRoll dt=1.0𝑑𝑡1.0dt=1.0italic_d italic_t = 1.0.
Method ViT-L-16 ViT-L-32
ImageNet \uparrow ImageNet \uparrow
Random --0.01± 0.01plus-or-minus0.010.010.01{\,\pm\,0.01}0.01 ± 0.01 --0.01± 0.01plus-or-minus0.010.010.01{\,\pm\,0.01}0.01 ± 0.01
Input×\times×Grad Simonyan et al., (2014) --1.20± 0.06plus-or-minus1.200.061.20{\,\pm\,0.06}1.20 ± 0.06 --0.98± 0.07plus-or-minus0.980.070.98{\,\pm\,0.07}0.98 ± 0.07
IG Sundararajan et al., (2017) --0.96± 0.07plus-or-minus0.960.070.96{\,\pm\,0.07}0.96 ± 0.07 --1.45± 0.06plus-or-minus1.450.061.45{\,\pm\,0.06}1.45 ± 0.06
SmoothGrad Smilkov et al., (2017) 0.10± 0.01plus-or-minus0.100.01-0.10{\,\pm\,0.01}- 0.10 ± 0.01 0.09± 0.04plus-or-minus0.090.04-0.09{\,\pm\,0.04}- 0.09 ± 0.04
GradCAM Chefer et al., 2021b --0.19± 0.06plus-or-minus0.190.060.19{\,\pm\,0.06}0.19 ± 0.06 --2.21± 0.08plus-or-minus2.210.082.21{\,\pm\,0.08}2.21 ± 0.08
AttnRoll Abnar and Zuidema, (2020) --1.41± 0.08plus-or-minus1.410.081.41{\,\pm\,0.08}1.41 ± 0.08 --1.90± 0.07plus-or-minus1.900.071.90{\,\pm\,0.07}1.90 ± 0.07
Grad×\times×AttnRoll Chefer et al., 2021a --2.86± 0.06plus-or-minus2.860.062.86{\,\pm\,0.06}2.86 ± 0.06 --2.69± 0.06plus-or-minus2.690.062.69{\,\pm\,0.06}2.69 ± 0.06
AtMan Deb et al., (2023) --1.58± 0.08plus-or-minus1.580.081.58{\,\pm\,0.08}1.58 ± 0.08 --0.09± 0.05plus-or-minus0.090.050.09{\,\pm\,0.05}0.09 ± 0.05
KernelSHAP Lundberg and Lee, (2017) --4.35± 0.04plus-or-minus4.350.044.35{\,\pm\,0.04}4.35 ± 0.04 --4.90± 0.03plus-or-minus4.900.034.90{\,\pm\,0.03}4.90 ± 0.03
CP-LRP (ε𝜀\varepsilonitalic_ε-rule, Ali et al., (2022)) --4.96± 0.05plus-or-minus4.960.054.96{\,\pm\,0.05}4.96 ± 0.05 --4.07± 0.05plus-or-minus4.070.054.07{\,\pm\,0.05}4.07 ± 0.05
CP-LRP (γ𝛾\gammaitalic_γ-rule for ViT, as proposed here) --6.97± 0.04plus-or-minus6.970.046.97{\,\pm\,0.04}6.97 ± 0.04 --5.99± 0.04plus-or-minus5.990.045.99{\,\pm\,0.04}5.99 ± 0.04
AttnLRP (ours) --7.17± 0.04plus-or-minus7.170.04\textbf{7.17}{\,\pm\,0.04}7.17 ± 0.04 --6.06± 0.04plus-or-minus6.060.04\textbf{6.06}{\,\pm\,0.04}6.06 ± 0.04

B.7 Attributions on SQuAD v2

In Figure B.7 and Figure B.8, we illustrate attributions on the Mixtral 8x7b for different state-of-the-art methods on the SQuAD v2 dataset. In Figure B.9, we depict attributions for Flan-T5-XL. For comparison, we also visualize a random attribution with Gaussian noise.

The similarity between AttnLRP and CP-LRP in Flan-T5-XL are in line with the quantitative evaluation from Table 1, which shows a small, but consistent advantage of AttnLRP over CP-LRP wrt. top-1 accuracy, while in Mixtral 8x7b and LLaMa 2, AttnLRP substantially outperforms, which is also visible in the heatmaps. This is due to the different number of non-linearities present in the models: Flan-T5-XL consists only of standard attention layers, while LLaMa 2 and Mixtral 8x7b have additional FFN layers with non-linear weighting or routing layers making the attribution process more difficult for CP-LRP. This effect is studied in Appendix B.4. In general, gradient-based methods such as G×\times×I, SmoothGrad, IG and Grad-CAM are noisy and often not informative. Attention Rollout and Grad×\times×Attention Rollout suffer from background noise. While the performance of AtMan is in some cases excellent as in Figure B.9, the method fails in other cases as in Figure B.7.

In Figure B.7 and B.8 for Mixtral 8x7b, most methods fail to highlight the correct answer tokens most strongly, except AttnLRP, confirming the quantitative evaluation from Table 1.

In Figure B.9, the heatmaps of I×\times×G or Grad-CAM seem to be inverted, hence we experimented with inverting the attributions on a subset, however we did not notice improvement and applied the rules with their original definition. AtMan produces highly sparse attributions, assigning large positive relevance to the answer token 18, however, also assigning a similar amount of relevance to the token much, which is part of the question. AttnLRP and CP-LRP identify the token 18 as being the most relevant token and also relate it (by assigning positive and negative relevance) to other information in the text such as 27.7, 132 or average. We conjecture that such targeted contrasting reflects the reasoning process of the model (e.g., is necessary to distinguish between related questions about how many tons are blown out vs. how many tons remain on the ground). A systematic analysis of these effects remain an interesting topic for future work.

Refer to caption
Figure B.7: Evaluation on the Mixtral 8x7b model: We compute attributions for different state-of-the-art methods for the answer token “France”. Gradient-based methods such as G×\times×I, SmoothGrad, IG or Grad-CAM are noisy. Grad×\times×Attn Rollout suffers from background noise. While AtMan usually generates sparse heatmap, in this case it fails (compare Figure B.9). CP-LRP highlights “Normandy” the strongest, while AttnLRP highlights the correct token “France”. For comparison, we also visualize a random attribution with Gaussian noise.
Refer to caption
Figure B.8: Evaluation on the Mixtral 8x7b model: We compute attributions for different state-of-the-art methods for all tokens at the same time inside the answer “Ibn Sina”. Gradient-based methods such as G×\times×I, SmoothGrad and IG are noisy. Grad-CAM highlights the correct tokens except it misses the beginning token “I” of the word “Ibn”. Likewise AtMan fails to highlight all tokens. Grad×\times×Attn Rollout suffers from background noise. CP-LRP resembles random noise, while AttnLRP highlights the correct tokens “Ibn Sina” in its entirety. For comparison, we also visualize a random attribution with Gaussian noise.
Refer to caption
Figure B.9: Evaluation on the Flan-T5-XL model: We compute attributions for different state-of-the-art methods on the first token of the answer (highlighted in red). Gradient-based methods such as G×\times×I, SmoothGrad, IG or Grad-CAM are noisy. Grad×\times×Attn Rollout suffers from background noise. AtMan produces highly sparse attributions, assigning an equal amount of relevance to a token, which is part of the question, as to token 18. CP-LRP has a different weighting of the tokens e.g. the word ‘much’ in the question is not highlighted by CP-LRP, while AttnLRP highlights it stronger and AtMan focuses excessively on it. For comparison, we also visualize a random attribution with Gaussian noise.

B.8 Benchmarking Cost, Time and Memory Consumption

We benchmark the runtime and peak GPU memory consumption for computing a single attribution for LLaMa 2 with batch size 1 on a node with four A100-SXM4 40GB, 512 GB CPU RAM and 32 AMD EPYC 73F3 3.5 GHz. Because AtMan, LRP and AttnRollout-variants need access to the attention weights, we did not use flash-attention Dao et al., (2022).

To calculate energy cost, we assume a price of 0.160.160.160.16 $ per kWh of energy, and that a single A100 GPU consumes on average 130W. Figure B.10 depicts the cost, the runtime and peak GPU memory consumption. Since perturbation-based methods are memory efficient, a 70b model with full context size of 4096 is attributable. However, LRP with checkpointing requires more memory than a node supplies.

Refer to caption
Figure B.10: From left to right: Cost in dollar, time in seconds and peak GPU memory in gigabytes for AttnLRP and linear-time perturbation. Evaluated on LLaMa 2-70b and LLaMa 2-13b models on a node with four A100-SXM4 40GB. G×\times×AttnRollout is in the range of AttnLRP and omitted for clarity of visualization. Because AttnLRP consumes more than 160 GB of RAM, the curves for the 70b model stop. Measured at fixed intervals of context size 32, 64, 128, 256, 512, 1024, 2048, 3000, 4096.

B.9 Attributions of Knowledge Neurons

Figure B.11, B.12 and B.13 illustrate the top 10 sentences in the Wikipedia summary dataset that maximally activate a knowledge neuron. We applied AttnLRP to highlight the tokens inside these reference samples. We observe that knowledge neurons exhibit remarkable disentanglement, e.g., neuron #256 of layer 18 shown in Figure B.11 seems to encode concepts related to transport systems (railways in particular), while neuron #2207 of layer 20 shown in Figure B.12 seems to encode the concept teacher, in particular a teacher, in an unusual context (e.g., inappropriate behavior, sexual misconduct). The degree of disentanglement should be studied in future work.

Refer to caption
Figure B.11: AttnLRP attributions on top 10 ActMax sentences collected over the Wikipedia summary dataset for neuron #256, in layer 18. The knowledge neuron seems to activate for transport systems (railways in particular).
Refer to caption
Figure B.12: AttnLRP attributions on top 10 ActMax sentences collected over the Wikipedia summary dataset for neuron #2207, in layer 20. The knowledge neuron is activating for ‘teacher’, in unusual context such as inappropriate behavior, sexual misconduct etc.
Refer to caption
Figure B.13: AttnLRP attributions on top 10 ActMax sentences collected over the Wikipedia summary dataset for neuron #922, in layer 18. The knowledge neuron seems to be activating for scientific descriptions of plants.
Table B.7: ViT-B-16 Perturbation Experiment (Faithfulness). For SmoothGrad, we set σ=0.01𝜎0.01\sigma=0.01italic_σ = 0.01, for AtMan p=1.0𝑝1.0p=1.0italic_p = 1.0 and t=0.1𝑡0.1t=0.1italic_t = 0.1, for AttnRoll dt=0.99𝑑𝑡0.99dt=0.99italic_d italic_t = 0.99, and for G×\times×AttnRoll dt=0.91𝑑𝑡0.91dt=0.91italic_d italic_t = 0.91. “all epsilon” indicates that the ε𝜀\varepsilonitalic_ε-rule has been used on the linear and convolutional layers. The term “best” refers to the utilization of LRP with the composite proposed in B.5. ΔAFΔsuperscript𝐴𝐹\Delta A^{F}roman_Δ italic_A start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT denotes the area under the curve for a flipping perturbation experiment which leverages both AMoRFFsuperscriptsubscript𝐴𝑀𝑜𝑅𝐹𝐹A_{MoRF}^{F}italic_A start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT of the most relevant first order, and ALeRFFsuperscriptsubscript𝐴𝐿𝑒𝑅𝐹𝐹A_{LeRF}^{F}italic_A start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT of least relevant first order. (ΔAF=ALeRFFAMoRFFΔsuperscript𝐴𝐹superscriptsubscript𝐴𝐿𝑒𝑅𝐹𝐹superscriptsubscript𝐴𝑀𝑜𝑅𝐹𝐹\Delta A^{F}=A_{LeRF}^{F}-A_{MoRF}^{F}roman_Δ italic_A start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT = italic_A start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT). As discussed in Section B.2, this is equivalent to insertion perturbation.
Methods ViT-B-16
ImageNet
(\uparrow)ΔAFΔsuperscript𝐴𝐹\Delta A^{F}roman_Δ italic_A start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT (\downarrow)AMoRFFsuperscriptsubscript𝐴𝑀𝑜𝑅𝐹𝐹A_{MoRF}^{F}italic_A start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT (\uparrow)ALeRFFsuperscriptsubscript𝐴𝐿𝑒𝑅𝐹𝐹A_{LeRF}^{F}italic_A start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT
Random 0.01 4.71 4.71
I×\times×G 0.90 2.78 3.69
IG 1.54 2.55 4.10
SmoothG -0.04 3.63 3.58
GradCAM 0.27 5.35 5.63
AttnRoll 1.31 4.866 6.17
G×\times×AttnRoll 2.60 4.01 6.22
AtMan 0.70 5.57 6.27
CP-LRP (all epsilon) 2.53 2.45 4.98
γ𝛾\gammaitalic_γCP-LRP (best) 6.06 1.53 7.59
AttnLRP (all epsilon) 2.79 5.22 2.42
AttnLRP (best) 6.19 1.48 7.67
Table B.8: Wikipedia Perturbation Experiment (Faithfulness). For SmoothGrad, we set σ=0.1𝜎0.1\sigma=0.1italic_σ = 0.1, for AtMan p=1.0𝑝1.0p=1.0italic_p = 1.0, for AttnRoll dt=1𝑑𝑡1dt=1italic_d italic_t = 1, and for G×\times×AttnRoll dt=1𝑑𝑡1dt=1italic_d italic_t = 1. ”all epsilon” indicates that the ε𝜀\varepsilonitalic_ε-rule has been used on all linear layers. ΔAIΔsuperscript𝐴𝐼\Delta A^{I}roman_Δ italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT denotes the area under the curve for the insertion perturbation experiment which leverages both AMoRFIsuperscriptsubscript𝐴𝑀𝑜𝑅𝐹𝐼A_{MoRF}^{I}italic_A start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT of the most relevant first order, and ALeRFIsuperscriptsubscript𝐴𝐿𝑒𝑅𝐹𝐼A_{LeRF}^{I}italic_A start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT of least relevant first order. (ΔAI=AMoRFIALeRFIΔsuperscript𝐴𝐼superscriptsubscript𝐴𝑀𝑜𝑅𝐹𝐼superscriptsubscript𝐴𝐿𝑒𝑅𝐹𝐼\Delta A^{I}=A_{MoRF}^{I}-A_{LeRF}^{I}roman_Δ italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = italic_A start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT). As discussed in Section B.2, this is equivalent to flipping perturbation.
Methods LLaMa 2-7b
Wikipedia
(\uparrow)ΔAIΔsuperscript𝐴𝐼\Delta A^{I}roman_Δ italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT (\uparrow)AMoRFIsuperscriptsubscript𝐴𝑀𝑜𝑅𝐹𝐼A_{MoRF}^{I}italic_A start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT (\downarrow)ALeRFIsuperscriptsubscript𝐴𝐿𝑒𝑅𝐹𝐼A_{LeRF}^{I}italic_A start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT
Random -0.07 2.31 2.38
I×\times×G 0.18 1.27 1.09
IG 4.05 3.74 -0.31
SmoothG -2.22 0.68 2.90
GradCAM 2.01 2.36 0.35
AttnRoll -3.49 1.46 4.95
G×\times×AttnRoll 9.79 8.79 -1.00
AtMan 3.31 4.06 0.76
CP-LRP (all epsilon) 7.85 6.43 -1.42
AttnLRP (all epsilon) 10.93 9.08 -1.85
Table B.9: IMDB Perturbation Experiment (Faithfulness), For SmoothGrad we set σ=0.05𝜎0.05\sigma=0.05italic_σ = 0.05, for AtMan p=0.7𝑝0.7p=0.7italic_p = 0.7, for AttnRoll dt=1𝑑𝑡1dt=1italic_d italic_t = 1, and for G×\times×AttnRoll dt=1𝑑𝑡1dt=1italic_d italic_t = 1. ”all epsilon” indicates that the ε𝜀\varepsilonitalic_ε-rule has been used to propagate relevance to the layers. ΔAIΔsuperscript𝐴𝐼\Delta A^{I}roman_Δ italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT demonstrates the area under the curve for the perturbation experiment of the type Insertion which leverages insights from both AMoRFIsuperscriptsubscript𝐴𝑀𝑜𝑅𝐹𝐼A_{MoRF}^{I}italic_A start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT of the most relevant first order, and ALeRFIsuperscriptsubscript𝐴𝐿𝑒𝑅𝐹𝐼A_{LeRF}^{I}italic_A start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT of least relevant first order. (ΔAI=AMoRFIALeRFIΔsuperscript𝐴𝐼superscriptsubscript𝐴𝑀𝑜𝑅𝐹𝐼superscriptsubscript𝐴𝐿𝑒𝑅𝐹𝐼\Delta A^{I}=A_{MoRF}^{I}-A_{LeRF}^{I}roman_Δ italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = italic_A start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT). As discussed in Section B.2, this is equivalent to flipping perturbation.
Methods LLaMa 2-7b
IMDB
(\uparrow)ΔAIΔsuperscript𝐴𝐼\Delta A^{I}roman_Δ italic_A start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT (\uparrow)AMoRFIsuperscriptsubscript𝐴𝑀𝑜𝑅𝐹𝐼A_{MoRF}^{I}italic_A start_POSTSUBSCRIPT italic_M italic_o italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT (\downarrow)ALeRFIsuperscriptsubscript𝐴𝐿𝑒𝑅𝐹𝐼A_{LeRF}^{I}italic_A start_POSTSUBSCRIPT italic_L italic_e italic_R italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT
Random -0.01 -0.47 -0.46
I×\times×G 0.12 -0.69 -0.81
IG 1.23 -0.06 -1.29
SmoothG 0.25 -0.74 -0.98
GradCAM -0.82 -1.10 -0.28
AttnRoll -0.64 -0.64 0.00
G×\times×AttnRoll 1.61 0.77 -0.84
AtMan -0.05 -0.54 -0.49
CP-LRP (all epsilon) 1.72 0.50 -1.22
AttnLRP (all epsilon) 2.50 1.12 -1.38
Refer to caption
Figure B.14: Statistics on Rules used for softmax layers: Either applying z+superscript𝑧z^{+}italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, ε𝜀\varepsilonitalic_ε-rule, or regarding as constant as proposed in CP-LRP. Propagating relevance values through (specifically by applying z+superscript𝑧z^{+}italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT rule) softmax improves the faithfulness of explanations compared to the case where we block its propagation.
Refer to caption
Figure B.15: Statistics on Rules used for Convolution layers: Applying z+superscript𝑧z^{+}italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and AlphaBeta proposes acceptable results however the most faithful results can be reached via Gamma(γ=0.25𝛾0.25\gamma=0.25italic_γ = 0.25).
Refer to caption
Figure B.16: Statistics on Rules used for Linear layers: Similar to Convolution layers, Gamma seems more promising however with different γ𝛾\gammaitalic_γ value (0.050.050.050.05 in this case).
Refer to caption
Figure B.17: Statistics on Rules used for LinearInputProjection layers: Gamma and ϵitalic-ϵ\epsilonitalic_ϵ rules are competitive in this case, however since there is larger difference between the minimum and the lower quartile in Gamma rules, the most faithful choice will be ϵitalic-ϵ\epsilonitalic_ϵ-rule.
Refer to caption
Figure B.18: Statistics on Rules used for LinearOutputProjection layers: The ϵitalic-ϵ\epsilonitalic_ϵ-rule outperforms other rules clearly.