License: arXiv.org perpetual non-exclusive license
arXiv:2109.13004v2 [stat.ML] 15 Jan 2024

Optimising for Interpretability:
Convolutional Dynamic Alignment Networks

Moritz Böhle, Mario Fritz, and Bernt Schiele Moritz Böhle is the corresponding author.
He is with the Department of Computer Vision and Machine Learning, Max Planck Institute for Informatics, Saarbrücken 66123, Germany. E-mail: [email protected]
Bernt Schiele is with the Department of Computer Vision and Machine Learning, Max Planck Institute for Informatics, Saarbrücken 66123, Germany. E-mail: [email protected]
Mario Fritz is with the CISPA Helmholtz Center for Information Security, Saarbrücken 66123, Germany. E-mail: [email protected].
Abstract

We introduce a new family of neural network models called Convolutional Dynamic Alignment Networks (CoDA Nets), which are performant classifiers with a high degree of inherent interpretability. Their core building blocks are Dynamic Alignment Units (DAUs), which are optimised to transform their inputs with dynamically computed weight vectors that align with task-relevant patterns. As a result, CoDA Nets model the classification prediction through a series of input-dependent linear transformations, allowing for linear decomposition of the output into individual input contributions. Given the alignment of the DAUs, the resulting contribution maps align with discriminative input patterns. These model-inherent decompositions are of high visual quality and outperform existing attribution methods under quantitative metrics. Further, CoDA Nets constitute performant classifiers, achieving on par results to ResNet and VGG models on e.g. CIFAR-10 and TinyImagenet. Lastly, CoDA Nets can be combined with conventional neural network models to yield powerful classifiers that more easily scale to complex datasets such as Imagenet whilst exhibiting an increased interpretable depth, i.e., the output can be explained well in terms of contributions from intermediate layers within the network.

Index Terms:
Explainability in Deep Learning, Convolutional Neural Networks

1 Introduction

Refer to caption
Figure 1: Sketch of a 9-layer CoDA-Net, which computes its output 𝐚𝟗subscript𝐚9{\mathbf{a_{9}}}bold_a start_POSTSUBSCRIPT bold_9 end_POSTSUBSCRIPT for an input 𝐚𝟎subscript𝐚0{\mathbf{a_{0}}}bold_a start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT as a linear transform via a matrix 𝐖𝟎𝟗(𝐚𝟎)subscript𝐖09subscript𝐚0{\mathbf{W_{0\rightarrow 9}({\mathbf{a}}_{0})}}bold_W start_POSTSUBSCRIPT bold_0 → bold_9 end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ). As such, the output can be linearly decomposed into input contributions (see right). This ’global’ transformation matrix 𝐖𝟎𝟗subscript𝐖09{\mathbf{W_{0\rightarrow 9}}}bold_W start_POSTSUBSCRIPT bold_0 → bold_9 end_POSTSUBSCRIPT is computed successively via multiple layers of Dynamic Alignment Units (DAUs). These layers, in turn, produce intermediate linear transformation matrices 𝐖l(𝐚l1)subscript𝐖𝑙subscript𝐚𝑙1{\mathbf{W}}_{l}({\mathbf{a}}_{l-1})bold_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT ) that align with the inputs of layer l𝑙litalic_l. As a result, the combined matrix 𝐖𝟎𝟗subscript𝐖09{\mathbf{W_{0\rightarrow 9}}}bold_W start_POSTSUBSCRIPT bold_0 → bold_9 end_POSTSUBSCRIPT also aligns well with task-relevant patterns. Positive (negative) contributions for the class ‘goldfinch’ are shown in red (blue).

Neural networks are powerful models that excel at a wide range of tasks. However, they are notoriously difficult to interpret and extracting explanations for their predictions is an open research problem. Linear models, in contrast, are generally considered interpretable, because the contribution (‘the weighted input’) of every dimension to the output is explicitly given. Interestingly, many modern neural networks implicitly model the output as a linear transformation of the input; a ReLU-based [1] neural network, e.g., is piece-wise linear and the output thus a linear transformation of the input, cf. [2]. However, due to the highly non-linear manner in which these linear transformations are ‘chosen’, the corresponding contributions per input dimension do not seem to represent the learnt model parameters well, cf. [3], and a lot of research is being conducted to find better explanations for the decisions of such neural networks, cf. [4, 5, 6, 7, 8, 9, 10, 11].

In this work, we introduce a novel network architecture, the Convolutional Dynamic Alignment Networks (CoDA Nets), for which the model-inherent contribution maps are faithful projections of the internal computations and thus good ‘explanations’ of the model prediction. There are two main components to the interpretability of the CoDA Nets. First, the CoDA Nets are dynamic linear, i.e., they compute their outputs through a series of input-dependent linear transforms, which are based on our novel Dynamic Alignment Units (DAUs). As in linear models, the output can thus be decomposed into individual input contributions, see Fig. 1. Second, the DAUs are structurally biased to compute weight vectors that align with relevant patterns in their inputs. In combination, the CoDA Nets thus inherently produce contribution maps that are ‘optimised for interpretability’: since each linear transformation matrix and thus their combination is optimised to align with discriminative features, the contribution maps reflect the most discriminative features as used by the model.

With this work, we present a new direction for building inherently more interpretable neural network architectures with high modelling capacity. In detail, we would like to highlight the following contributions:

  1. 1.

    We introduce the concept of Dynamic Alignment Units (DAUs), which improve the interpretability of neural networks and have two key properties: they are dynamic linear and need to align their dynamically computed weights their inputs to achieve large outputs, since the norm of their weights is explicitly constrained.

  2. 2.

    Apart from normalising the dynamically computed DAU weights by their vector norm, we show that similar results can be achieved when normalising them by an upper bound of their norm which results in more efficient DAUs.

  3. 3.

    We show that networks of DAUs inherit the dynamic linearity and the alignment properties from their constituent DAUs. In particular, we introduce Convolutional Dynamic Alignment Networks (CoDA Nets), which are built out of multiple layers of DAUs. As a result, the model-inherent contribution maps of CoDA Nets highlight discriminative patterns in the input.

  4. 4.

    We further show that the alignment of the DAUs can be promoted by applying a ‘temperature scaling’ to the final output of the CoDA Nets.

  5. 5.

    We show that the resulting contribution maps perform well under commonly employed quantitative criteria for attribution methods. Moreover, under qualitative inspection, we note that they exhibit a high degree of detail.

  6. 6.

    We analyse how the models are affected by different normalisation functions in the DAU weight calculation in terms of accuracy, interpretability, as well as efficiency.

  7. 7.

    Beyond interpretability, CoDA Nets are performant classifiers and yield competitive classification accuracies on the CIFAR-10 and TinyImagenet datasets.

  8. 8.

    We show that CoDA Nets can be seamlessly combined with conventional networks. The resulting hybrid networks exhibit an increased ’interpretable depth’ whilst taking advantage of the efficiency and strong modelling capacity of the base networks. Such networks hold great potential for designing models that are inherently interpretable up to a user-defined minimal resolution.

2 Related Work

Interpretability. In order to make machine learning models more interpretable, a variety of techniques has been developed. While there are different ways in which interpretability can be defined, in this work we focus on attributing importance values to input features; for an extensive review regarding interpretability in machine learning, see [12].

Current techniques for interpreting neural networks via importance attribution can broadly be split into two categories: deriving post-hoc explanations and developing inherently interpretable models. In the following, we will first describe common approaches for post-hoc interpretability.

On the one hand, research regarding post-hoc explanations has been undertaken to develop model-agnostic explanation methods for which the model behaviour under perturbed inputs is analysed; this includes among others [13, 14, 15]. While their generality and the applicability to any model are advantageous, these methods typically require evaluating the respective model several times and are therefore costly approximations of model behaviour. Further, they rely on the assumption that the models generalise to such out-of-distribution (OOD) data (the perturbed / occluded inputs) in a stable manner, such that the outputs on the OOD data can be used to explain model behaviour on in-distribution data.

On the other hand, many techniques that explicitly take advantage of the internal computations have been proposed for explaining the model predictions, including, for example, [4, 5, 6, 7, 8, 9, 10, 11]. Such methods typically distribute importance values layer by layer in a backward pass and introduce different rules for this re-distribution. Similarly, the model-inherent contribution maps in the CoDA Nets can also be obtained as a layer-wise decomposition of the output. However, there is a key difference: producing well-aligned linear decompositions is actually optimised for in the CoDA Nets during training, whereas other methods are developed without taking the optimisation procedure into account. As a result, the inherent explanations in the CoDA Nets lend themselves better to understanding the model outputs.
In contrast to techniques that aim to explain models post-hoc, some recent work has focused on designing new types of network architectures, which are inherently more interpretable. Examples of this are the prototype-based neural networks [16], the BagNet [17] and the self-explaining neural networks (SENNs) [18]. Similarly to our proposed architectures, the SENNs and the BagNets derive their explanations from a linear decomposition of the output into contributions from the input (features). This dynamic linearity, i.e., the property that the output is computed via some form of an input-dependent linear mapping, is additionally shared by the entire model family of piece-wise linear networks (e.g., ReLU-based networks). In fact, the contribution maps of the CoDA Nets are conceptually similar to evaluating the ‘Input×\times×Gradient’ (IxG), cf. [3], on piece-wise linear models, which also yields a linear decomposition in form of a contribution map. However, in contrast to the piece-wise linear functions, we combine this dynamic linearity with a structural bias towards an alignment between the contribution maps and the discriminative patterns in the input. This results in explanations of much higher quality, whereas IxG on piece-wise linear models has been found to yield unsatisfactory explanations of model behaviour [3].

Architectural similarities. In our CoDA Nets, the convolutional kernels are dependent on the specific patch that they are applied on; i.e., a CoDA layer might apply different filters at every position in the input. As such, these layers can be regarded as an instance of dynamic local filtering layers as introduced in [19]. Further, our dynamic alignment units (DAUs) share some high-level similarities to attention networks, cf. [20], in the sense that each DAU has a limited budget to distribute over its dynamic weight vectors (bounded norm), which is then used to compute a weighted sum. However, whereas in attention networks the weighted sum is typically computed over vectors (the ’value vectors’) which differ from the input to the attention module (the ’key’ and ’query’ vectors), a DAU outputs a scalar which is a weighted sum of all scalar entries in the input. Moreover, we note that at their optimum (maximal average output over a set of inputs), the DAUs solve a constrained low-rank matrix approximation problem [21]. While low-rank approximations have been used for increasing parameter efficiency in neural networks, cf. [22], this concept has to the best of our knowledge not been used in order to endow neural networks with a structural bias towards finding low-rank approximations of the input for increased interpretability in classification tasks. Lastly, the CoDA Nets are related to capsule networks. However, whereas in classical capsule networks the activation vectors of the capsules directly serve as input to the next layer, in CoDA Nets the corresponding vectors are used as convolutional filters. We include a detailed comparison in the supplement.

3 Dynamic Alignment Networks

In this section, we present our novel type of network architecture: the Convolutional Dynamic Alignment Networks (CoDA Nets). For this, we first introduce Dynamic Alignment Units (DAUs) as the basic building blocks of CoDA Nets and discuss two of their key properties in sec. 3.1. Concretely, we show that these units linearly transform their inputs with dynamic (input-dependent) weight vectors and, additionally, that they are biased to align these weights with the input during optimisation. Given the computational costs of evaluating DAUs, in sec. 3.2 we further present an alternative formulation of the DAUs for increased efficiency. We then discuss how DAUs can be used for classification (sec. 3.3) and how we build performant networks out of multiple layers of convolutional DAUs (sec. 3.4). Importantly, the resulting linear decompositions of the network outputs are optimised to align with discriminative patterns in the input, making them highly suitable for interpreting the network predictions.

In particular, we structure this section around the following three important properties (P1-P3) of the DAUs:
P1: Dynamic linearity. The DAU output o𝑜oitalic_o is computed as a dynamic (input-dependent) linear transformation of the input 𝐱𝐱{\mathbf{x}}bold_x, such that o=𝐰(𝐱)T𝐱=jwj(𝐱)xj𝑜𝐰superscript𝐱𝑇𝐱subscript𝑗subscript𝑤𝑗𝐱subscript𝑥𝑗o={\mathbf{w}}({\mathbf{x}})^{T}{\mathbf{x}}=\sum_{j}w_{j}({\mathbf{x}})x_{j}italic_o = bold_w ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_x = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x ) italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Hence, o𝑜oitalic_o can be decomposed into contributions from individual input dimensions, which are given by wj(𝐱)xjsubscript𝑤𝑗𝐱subscript𝑥𝑗w_{j}({\mathbf{x}})x_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x ) italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for dimension j𝑗jitalic_j.
P2: Alignment maximisation. Maximising the average output of a single DAU over a set of inputs 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT maximises the alignment between inputs 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the weight vectors 𝐰(𝐱i)𝐰subscript𝐱𝑖{\mathbf{w}}({\mathbf{x}}_{i})bold_w ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). As the modelling capacity of 𝐰(𝐱)𝐰𝐱{\mathbf{w}}({\mathbf{x}})bold_w ( bold_x ) is restricted, 𝐰(𝐱)𝐰𝐱{\mathbf{w}}({\mathbf{x}})bold_w ( bold_x ) will encode the most frequent patterns in the set of inputs 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.
P3: Inheritance. When combining multiple DAU layers to form a Dynamic Alignment Network (DA Net), the properties P1 and P2 are inherited: DA Nets are dynamic linear (P1) and maximising the last layer’s output induces an output maximisation in the constituent DAUs (P2).
These properties increase the interpretability of a DA Net, such as a CoDA Net (sec. 3.4) for the following reasons. First, the output of a DA Net can be decomposed into contributions from the individual input dimensions, similar to linear models (cf. Fig. 1, P1 and P3). Second, we note that optimising a neural network for classification applies a maximisation to the outputs of the last layer for every sample. This maximisation aligns the dynamic weight vectors 𝐰(𝐱)𝐰𝐱{\mathbf{w}}({\mathbf{x}})bold_w ( bold_x ) of the constituent DAUs of the DA Net with their respective inputs (cf. Fig. 2 as well as P2 and P3).

Importantly, the weight vectors will align with the discriminative patterns in their inputs when optimised for classification as we show in sec. 3.3. As a result, the model-inherent contribution maps of CoDA Nets are optimised to align well with discriminative input patterns in the input image and the interpretability of our models thus forms part of the global optimisation procedure.

Refer to caption
Figure 2: For different inputs 𝐱𝐱{\mathbf{x}}bold_x, we visualise the linear weights and contributions (for the single layer, see eq. (7), for the CoDA-Net eq. (11)) for the ground truth label l𝑙litalic_l and the strongest non-label output z𝑧zitalic_z. As can be seen, the weights align well with the input images. The first three rows are based on a single DAU layer, the last three on a 5 layer CoDA-Net. The first two samples (rows) per model are correctly classified and the last one is misclassified.

3.1 Dynamic Alignment Units

We define the Dynamic Alignment Units (DAUs) by

DAU(𝐱)=g(𝐀𝐁𝐱+𝐛)T𝐱=𝐰(𝐱)T𝐱.formulae-sequenceDAU𝐱𝑔superscript𝐀𝐁𝐱𝐛𝑇𝐱𝐰superscript𝐱𝑇𝐱.\displaystyle\text{DAU}({\mathbf{x}})=g({\mathbf{A}}{\mathbf{B}}{\mathbf{x}}+{% \mathbf{b}})^{T}{\mathbf{x}}={\mathbf{w}}({\mathbf{x}})^{T}\,{\mathbf{x}}\quad% \textbf{.}DAU ( bold_x ) = italic_g ( bold_ABx + bold_b ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_x = bold_w ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_x . (1)

Here, 𝐱d𝐱superscript𝑑{\mathbf{x}}\in\mathbb{R}^{d}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is an input vector, 𝐀d×r𝐀superscript𝑑𝑟{\mathbf{A}}\in\mathbb{R}^{d\times r}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT and 𝐁r×d𝐁superscript𝑟𝑑{\mathbf{B}}\in\mathbb{R}^{r\times d}bold_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_d end_POSTSUPERSCRIPT are trainable transformation matrices, 𝐛d𝐛superscript𝑑{\mathbf{b}}\in\mathbb{R}^{d}bold_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT a trainable bias vector, and g(𝐮)=α(𝐮)𝐮𝑔𝐮𝛼norm𝐮𝐮g({\mathbf{u}})=\alpha(||{\mathbf{u}}||){\mathbf{u}}italic_g ( bold_u ) = italic_α ( | | bold_u | | ) bold_u is a non-linear function that scales the norm of its input. In contrast to using a single matrix 𝐌d×d𝐌superscript𝑑𝑑{\mathbf{M}}\in\mathbb{R}^{d\times d}bold_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT, using 𝐀𝐁𝐀𝐁{\mathbf{AB}}bold_AB allows us to control the maximum rank r𝑟ritalic_r of the transformation and to reduce the number of parameters; we will hence refer to r𝑟ritalic_r as the rank of a DAU. As can be seen by the right-hand side of eq. (1), the DAU linearly transforms the input 𝐱𝐱{\mathbf{x}}bold_x (P1). At the same time, given the quadratic form (𝐱T𝐁T𝐀T𝐱superscript𝐱𝑇superscript𝐁𝑇superscript𝐀𝑇𝐱{\mathbf{x}}^{T}{\mathbf{B}}^{T}{\mathbf{A}}^{T}{\mathbf{x}}bold_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_x) and the rescaling function α(𝐮)𝛼norm𝐮\alpha(||{\mathbf{u}}||)italic_α ( | | bold_u | | ), the output of the DAU is a non-linear function of its input. In the context of DAUs, we are particularly interested in functions that constrain the norm of the weight vectors 𝐰(𝐱)𝐰𝐱{\mathbf{w}}({\mathbf{x}})bold_w ( bold_x ), such as, e.g., rescaling to unit norm (L2) or the squashing function (SQ, see [23]):

L2(𝐮)=𝐮𝐮2andSQ(𝐮)=L2(𝐮)×𝐮221+𝐮22L2𝐮𝐮subscriptnorm𝐮2andSQ𝐮L2𝐮subscriptsuperscriptnorm𝐮221superscriptsubscriptnorm𝐮22\displaystyle\text{L2}({\mathbf{u}})=\frac{{\mathbf{u}}}{||{\mathbf{u}}||_{2}}% \;\;\text{and}\;\;\text{SQ}({\mathbf{u}})=\text{L2}({\mathbf{u}})\times\frac{|% |{\mathbf{u}}||^{2}_{2}}{1+||{\mathbf{u}}||_{2}^{2}}L2 ( bold_u ) = divide start_ARG bold_u end_ARG start_ARG | | bold_u | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG and SQ ( bold_u ) = L2 ( bold_u ) × divide start_ARG | | bold_u | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 1 + | | bold_u | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (2)

In sec. 3.2, we further present an approximation to these rescaling functions, which lowers the computational cost of the DAUs whilst maintaining their bounding property 𝐰(𝐱)1norm𝐰𝐱1||{\mathbf{w}}({\mathbf{x}})||\leq 1| | bold_w ( bold_x ) | | ≤ 1. Given such a bound on 𝐰(𝐱)𝐰𝐱{\mathbf{w}}({\mathbf{x}})bold_w ( bold_x ), the output of the DAUs will be upper-bounded by the norm of the input:

DAU(𝐱)=𝐰(𝐱)𝐱cos((𝐱,𝐰(𝐱)))𝐱DAU𝐱norm𝐰𝐱norm𝐱𝐱𝐰𝐱norm𝐱\displaystyle\text{DAU}({\mathbf{x}})=||{\mathbf{w}}({\mathbf{x}})||\hskip 1.9% 9997pt||{\mathbf{x}}||\cos(\angle({\mathbf{x}},{\mathbf{w}}({\mathbf{x}})))% \leq||{\mathbf{x}}||DAU ( bold_x ) = | | bold_w ( bold_x ) | | | | bold_x | | roman_cos ( ∠ ( bold_x , bold_w ( bold_x ) ) ) ≤ | | bold_x | | (3)

As a corollary, for a given input 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the DAUs can only achieve this upper bound if 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an eigenvector (EV) of the linear transform 𝐀𝐁𝐱+𝐛𝐀𝐁𝐱𝐛{\mathbf{AB}}{\mathbf{x}}+{\mathbf{b}}bold_ABx + bold_b. Otherwise, the cosine in eq. (3) will not be maximal111 Note that 𝐰(𝐱)𝐰𝐱{\mathbf{w}}({\mathbf{x}})bold_w ( bold_x ) is proportional to 𝐀𝐁𝐱+𝐛𝐀𝐁𝐱𝐛{\mathbf{AB}}{\mathbf{x}}+{\mathbf{b}}bold_ABx + bold_b. The cosine in eq. (3), in turn, is maximal if and only if 𝐰(𝐱i)𝐰subscript𝐱𝑖{\mathbf{w}}({\mathbf{x}}_{i})bold_w ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is proportional to 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and thus, by transitivity, if 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is proportional to 𝐀𝐁𝐱i+𝐛subscript𝐀𝐁𝐱𝑖𝐛{\mathbf{AB}}{\mathbf{x}}_{i}+{\mathbf{b}}bold_ABx start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_b. This means that 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has to be an EV of 𝐀𝐁𝐱+𝐛𝐀𝐁𝐱𝐛{\mathbf{AB}}{\mathbf{x}}+{\mathbf{b}}bold_ABx + bold_b to achieve maximal output.. As can be seen in eq. (3), maximising the average output of a DAU over a set of inputs {𝐱i|i=1,,n}conditional-setsubscript𝐱𝑖𝑖1𝑛\{{\mathbf{x}}_{i}|\,i=1,...,n\}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , … , italic_n } maximises the alignment between 𝐰(𝐱)𝐰𝐱{\mathbf{w}}({\mathbf{x}})bold_w ( bold_x ) and 𝐱𝐱{\mathbf{x}}bold_x (P2). In particular, it optimises the parameters of the DAU such that the most frequent input patterns are encoded as EVs in the linear transform 𝐀𝐁𝐱+𝐛𝐀𝐁𝐱𝐛{\mathbf{AB}}{\mathbf{x}}+{\mathbf{b}}bold_ABx + bold_b, similar to an r𝑟ritalic_r-dimensional PCA decomposition (r𝑟ritalic_r the rank of 𝐀𝐁𝐀𝐁{\mathbf{AB}}bold_AB). In fact, as discussed in the supplement, the optimum of the DAU maximisation solves a low-rank matrix approximation [21] problem similar to singular value decomposition.

Refer to caption
Figure 3: Eigenvectors (EVs) of 𝐀𝐁𝐀𝐁{\mathbf{AB}}bold_AB after maximising the output of a rank-3 DAU over a set of noisy samples of 3 MNIST digits. Effectively, the DAUs encode the most frequent components in their EVs, similar to a principal component analysis (PCA).

As an illustration of this property, in Fig. 3 we show the 3 EVs222Given r=3𝑟3r=3italic_r = 3, the EVs maximally span a 3-dimensional subspace. of matrix 𝐀𝐁𝐀𝐁{\mathbf{AB}}bold_AB (with rank r=3𝑟3r=3italic_r = 3, bias 𝐛=𝟎𝐛0{\mathbf{b}}={\mathbf{0}}bold_b = bold_0) after optimising a DAU over a set of n𝑛nitalic_n noisy samples of 3 specific MNIST [24] images; for this, we used n=3072𝑛3072n=3072italic_n = 3072 and zero-mean Gaussian noise. As expected, the EVs of 𝐀𝐁𝐀𝐁{\mathbf{AB}}bold_AB encode the original, noise-free images, since this on average maximises the alignment (eq. (3)) between the weight vectors 𝐰(𝐱i)𝐰subscript𝐱𝑖{\mathbf{w}}({\mathbf{x}}_{i})bold_w ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and the input samples 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over the dataset.

3.2 Efficient DAUs: Bounding the Bound

As discussed in the previous section, we introduce a norm constraint for the DAU weights 𝐰(𝐱)𝐰𝐱{\mathbf{w}}({\mathbf{x}})bold_w ( bold_x ) to ensure that large outputs can only be achieved for well-aligned weights. However, the explicit norm constraint on 𝐰(𝐱)𝐰𝐱{\mathbf{w}}({\mathbf{x}})bold_w ( bold_x ) requires its explicit calculation, which we have observed to significantly impact the evaluation time of DAUs. Therefore, we evaluate an additional formulation of the DAUs in which we only constrain an upper bound of the norm of 𝐰(𝐱)𝐰𝐱{\mathbf{w}}({\mathbf{x}})bold_w ( bold_x ). For this, we take advantage of the following inequality:

𝐰(𝐱)=𝐀𝐁𝐱𝐀F𝐁𝐱.norm𝐰𝐱norm𝐀𝐁𝐱subscriptnorm𝐀𝐹norm𝐁𝐱\displaystyle||{\mathbf{w}}({\mathbf{x}})||=||{\mathbf{AB}}{\mathbf{x}}||\leq|% |{\mathbf{A}}||_{F}\,||{\mathbf{B}}{\mathbf{x}}||\quad.| | bold_w ( bold_x ) | | = | | bold_ABx | | ≤ | | bold_A | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT | | bold_Bx | | . (4)

Here, ||||F||\cdot||_{F}| | ⋅ | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT denotes the Frobenius norm and ||||||\cdot||| | ⋅ | | the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT vector norm; this inequality reflects the fact that the Frobenius norm is compatible with the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT vector norm. Note that using this approximation for the norm computation bounds the output bound of the DAUs and is at least as tight as the bound in eq. (3). As a result, without the bias term 𝐛𝐛{\mathbf{b}}bold_b, the output of the corresponding DAUs can be calculated as

eDAU(𝐱)||=||𝐁𝐱||1(𝐁𝐱)T(𝐀T𝐱)\displaystyle\text{eDAU}({\mathbf{x}})||=||{\mathbf{B}}{\mathbf{x}}||^{-1}% \left({\mathbf{B}}{\mathbf{x}}\right)^{T}\left({\mathbf{A}}^{\prime T}{\mathbf% {x}}\right)\quadeDAU ( bold_x ) | | = | | bold_Bx | | start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_Bx ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_A start_POSTSUPERSCRIPT ′ italic_T end_POSTSUPERSCRIPT bold_x ) (5)
with𝐀=𝐀F1𝐀withsuperscript𝐀superscriptsubscriptnorm𝐀𝐹1𝐀\displaystyle\text{with}\quad{\mathbf{A}}^{\prime}=||{\mathbf{A}}||_{F}^{-1}{% \mathbf{A}}with bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = | | bold_A | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_A (6)

We will henceforth refer to this non-linear output computation as weight bounding (WB). Note that under this formulation the d𝑑ditalic_d-dimensional weights 𝐰(𝐱)𝐰𝐱{\mathbf{w}}({\mathbf{x}})bold_w ( bold_x ) are never explicitly calculated and the output is instead obtained as a dot product in rsuperscript𝑟\mathbb{R}^{r}blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT between the vectors 𝐁𝐱𝐁𝐱{\mathbf{B}}{\mathbf{x}}bold_Bx and 𝐀T𝐱superscript𝐀𝑇𝐱{\mathbf{A}}^{\prime T}{\mathbf{x}}bold_A start_POSTSUPERSCRIPT ′ italic_T end_POSTSUPERSCRIPT bold_x. Further, for convolutional DAUs (see sec. 3.4), the matrix 𝐀superscript𝐀{\mathbf{A}}^{\prime}bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has to be computed only once for all positions. As we show in sec. 5.3, this can result in significant gains in efficiency.

3.3 DAUs for classification

DAUs can be used directly for classification by applying k𝑘kitalic_k DAUs in parallel to obtain an output 𝐲^(𝐱)=[DAU1(𝐱),,DAUk(𝐱)]^𝐲𝐱subscriptDAU1𝐱subscriptDAU𝑘𝐱\hat{{\mathbf{y}}}({\mathbf{x}})=\left[\text{DAU}_{1}({\mathbf{x}}),...,\text{% DAU}_{k}({\mathbf{x}})\right]over^ start_ARG bold_y end_ARG ( bold_x ) = [ DAU start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) , … , DAU start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x ) ]. Note that this is a linear transformation 𝐲^(𝐱)^𝐲𝐱\hat{{\mathbf{y}}}({\mathbf{x}})over^ start_ARG bold_y end_ARG ( bold_x )===𝐖(𝐱)𝐱𝐖𝐱𝐱{\mathbf{W}}({\mathbf{x}}){\mathbf{x}}bold_W ( bold_x ) bold_x, with each row in 𝐖𝐖{\mathbf{W}}bold_W\ink×dsuperscript𝑘𝑑\mathbb{R}^{k\times d}blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d end_POSTSUPERSCRIPT corresponding to the weight vector 𝐰jTsuperscriptsubscript𝐰𝑗𝑇{\mathbf{w}}_{j}^{T}bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT of a specific DAU j𝑗jitalic_j. Consider, for example, a dataset 𝒟={(𝐱i,𝐲i)|𝐱id,𝐲ik}𝒟conditional-setsubscript𝐱𝑖subscript𝐲𝑖formulae-sequencesubscript𝐱𝑖superscript𝑑subscript𝐲𝑖superscript𝑘\mathcal{D}=\{({\mathbf{x}}_{i},{\mathbf{y}}_{i})|\,{\mathbf{x}}_{i}\in\mathbb% {R}^{d},{\mathbf{y}}_{i}\in\mathbb{R}^{k}\}caligraphic_D = { ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } of k𝑘kitalic_k classes with ‘one-hot’ encoded labels 𝐲isubscript𝐲𝑖{\mathbf{y}}_{i}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the inputs 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. To optimise the DAUs as classifiers on 𝒟𝒟\mathcal{D}caligraphic_D, we can apply a sigmoid non-linearity to each DAU output and optimise the loss function =iBCE(σ(𝐲^i),𝐲i)subscript𝑖BCE𝜎subscript^𝐲𝑖subscript𝐲𝑖\mathcal{L}=\sum_{i}\text{BCE}(\sigma(\hat{{\mathbf{y}}}_{i}),{\mathbf{y}}_{i})caligraphic_L = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT BCE ( italic_σ ( over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where BCE denotes the binary cross-entropy and σ𝜎\sigmaitalic_σ applies the sigmoid function to each entry in 𝐲^isubscript^𝐲𝑖\hat{{\mathbf{y}}}_{i}over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Note that for a given sample, BCE either maximises (DAU for correct class) or minimises (DAU for incorrect classes) the output of each DAU. Hence, this classification loss will still maximise the (signed) cosine between the weight vectors 𝐰(𝐱i)𝐰subscript𝐱𝑖{\mathbf{w}}({\mathbf{x}}_{i})bold_w ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

To illustrate this property, in Fig. 2 (top) we show the weights 𝐰(𝐱i)𝐰subscript𝐱𝑖{\mathbf{w}}({\mathbf{x}}_{i})bold_w ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for several samples of the digit ‘3’ after optimising the DAUs for classification on a noisy MNIST dataset; the first two are correctly classified, the last one is misclassified as a ‘5’. As can be seen, the weights align with the respective input (the weights for different samples are different). However, different parts of the input are either positively or negatively correlated with a class, which is reflected in the weights: for example, the extended stroke on top of the ‘3’ in the misclassified sample is assigned negative weight and, since the background noise is uncorrelated with the class labels, it is not represented in the weights.

In a classification setting, the DAUs thus preferentially encode the most frequent discriminative patterns in the linear transform 𝐀𝐁𝐱+𝐛𝐀𝐁𝐱𝐛{\mathbf{AB}}{\mathbf{x}}+{\mathbf{b}}bold_ABx + bold_b such that the dynamic weights 𝐰(𝐱)𝐰𝐱{\mathbf{w}}({\mathbf{x}})bold_w ( bold_x ) align well with these patterns. Additionally, since the output for class j𝑗jitalic_j is a linear transformation of the input (P1), we can compute the contribution vector 𝐬jsubscript𝐬𝑗{\mathbf{s}}_{j}bold_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT containing the per-pixel contributions to this output by the element-wise product (direct-product\odot)

𝐬j(𝐱i)=𝐰j(𝐱i)𝐱i,subscript𝐬𝑗subscript𝐱𝑖direct-productsubscript𝐰𝑗subscript𝐱𝑖subscript𝐱𝑖\displaystyle{\mathbf{s}}_{j}({\mathbf{x}}_{i})={\mathbf{w}}_{j}({\mathbf{x}}_% {i})\odot{\mathbf{x}}_{i}\quad,bold_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⊙ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (7)

see Figs. 1 and 2. Such linear decompositions constitute the model-inherent ‘explanations’ which we evaluate in sec. 5.

3.4 Convolutional Dynamic Alignment Networks

The modelling capacity of a single layer of DAUs is limited, similar to a single linear classifier. However, DAUs can be used as the basic building block for deep convolutional neural networks, which yields powerful classifiers. Importantly, in this section we show that such a Convolutional Dynamic Alignment Network (CoDA Net) inherits the properties (P3) of the DAUs by maintaining both the dynamic linearity (P1) as well as the alignment maximisation (P2). For a convolutional dynamic alignment layer, each convolutional filter is modelled by a DAU, similar to dynamic local filtering layers [19]. Note that the output of such a layer is also a dynamic linear transformation of the input to that layer, since a convolution is equivalent to a linear layer with certain constraints on the weights, cf. [25]. We include the implementation details in the supplement. Finally, at the end of this section, we highlight an important difference between output maximisation and optimising for classification with the BCE loss. In this context we discuss the effect of temperature scaling and present the loss function we optimise in our experiments.

Dynamic linearity (P1). In order to see that the linearity is maintained, we note that the successive application of multiple layers of DAUs also results in a dynamic linear mapping. Let 𝐖lsubscript𝐖𝑙{\mathbf{W}}_{l}bold_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT denote the linear transformation matrix produced by a layer of DAUs and let 𝐚l1subscript𝐚𝑙1{\mathbf{a}}_{l-1}bold_a start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT be the input vector to that layer; as mentioned before, each row in the matrix 𝐖lsubscript𝐖𝑙{\mathbf{W}}_{l}bold_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT corresponds to the weight vector of a single DAU333 Note that this also holds for convolutional DAU layers. Specifically, each row in the matrix 𝐖lsubscript𝐖𝑙{\mathbf{W}}_{l}bold_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT corresponds to a single DAU applied to exactly one spatial location in the input and the input with spatial dimensions is vectorised to yield 𝐚l1subscript𝐚𝑙1{\mathbf{a}}_{l-1}bold_a start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT. For further details, we kindly refer the reader to [25] and the implementation details in the supplement of this work.. As such, the output of this layer is given by

𝐚l=𝐖l(𝐚l1)𝐚l1.subscript𝐚𝑙subscript𝐖𝑙subscript𝐚𝑙1subscript𝐚𝑙1\displaystyle{\mathbf{a}}_{l}={\mathbf{W}}_{l}({\mathbf{a}}_{l-1}){\mathbf{a}}% _{l-1}\quad.bold_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = bold_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT ) bold_a start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT . (8)

In a network of DAUs, the successive linear transformations can thus be collapsed. In particular, for any pair of activation vectors 𝐚l1subscript𝐚subscript𝑙1{\mathbf{a}}_{l_{1}}bold_a start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝐚l2subscript𝐚subscript𝑙2{\mathbf{a}}_{l_{2}}bold_a start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT with l1<l2subscript𝑙1subscript𝑙2{l_{1}}<{l_{2}}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the vector 𝐚l2subscript𝐚subscript𝑙2{\mathbf{a}}_{l_{2}}bold_a start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT can be expressed as a linear transformation of 𝐚l1subscript𝐚subscript𝑙1{\mathbf{a}}_{l_{1}}bold_a start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT:

𝐚l2subscript𝐚subscript𝑙2\displaystyle{\mathbf{a}}_{l_{2}}bold_a start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =𝐖l1l2(𝐚l1)𝐚l1absentsubscript𝐖subscript𝑙1subscript𝑙2subscript𝐚subscript𝑙1subscript𝐚subscript𝑙1\displaystyle={\mathbf{W}}_{{l_{1}}\rightarrow{l_{2}}}\left({\mathbf{a}}_{l_{1% }}\right){\mathbf{a}}_{l_{1}}\quad= bold_W start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) bold_a start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (9)
with𝐖l1l2(𝐚l1)𝑤𝑖𝑡subscript𝐖subscript𝑙1subscript𝑙2subscript𝐚subscript𝑙1\displaystyle{with}\quad{\mathbf{W}}_{{l_{1}}\rightarrow{l_{2}}}\left({\mathbf% {a}}_{l_{1}}\right)italic_w italic_i italic_t italic_h bold_W start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) =k=l1+1l2𝐖k(𝐚k1).absentsuperscriptsubscriptproduct𝑘subscript𝑙11subscript𝑙2subscript𝐖𝑘subscript𝐚𝑘1.\displaystyle=\textstyle\prod_{k={l_{1}}+1}^{l_{2}}{\mathbf{W}}_{k}\left({% \mathbf{a}}_{k-1}\right)\quad\text{.}= ∏ start_POSTSUBSCRIPT italic_k = italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) . (10)

For example, the matrix 𝐖0L(𝐚0=𝐱)=𝐖(𝐱)subscript𝐖0𝐿subscript𝐚0𝐱𝐖𝐱{\mathbf{W}}_{0\rightarrow L}({\mathbf{a}}_{0}={\mathbf{x}})={\mathbf{W}}({% \mathbf{x}})bold_W start_POSTSUBSCRIPT 0 → italic_L end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_x ) = bold_W ( bold_x ) models the linear transformation from the input to the output space, see Fig. 1. Since this linearity holds between any two layers, the j𝑗jitalic_j-th entry of any activation vector 𝐚lsubscript𝐚𝑙{\mathbf{a}}_{l}bold_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT in the network can be decomposed into input contributions via:

𝐬jl(𝐱i)=[𝐖0l(𝐱i)]jT𝐱i,superscriptsubscript𝐬𝑗𝑙subscript𝐱𝑖direct-productsuperscriptsubscriptdelimited-[]subscript𝐖0𝑙subscript𝐱𝑖𝑗𝑇subscript𝐱𝑖,\displaystyle{\mathbf{s}}_{j}^{l}({\mathbf{x}}_{i})=\left[{\mathbf{W}}_{0% \rightarrow l}({\mathbf{x}}_{i})\right]_{j}^{T}\odot{\mathbf{x}}_{i}\quad\text% {,}bold_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = [ bold_W start_POSTSUBSCRIPT 0 → italic_l end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⊙ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (11)

with [𝐖]jsubscriptdelimited-[]𝐖𝑗[{\mathbf{W}}]_{j}[ bold_W ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT the j𝑗jitalic_j-th row in the matrix.

Alignment maximisation (P2). Note that the output of a CoDA Net is bounded independent of the network parameters: since each DAU operation can—independent of its parameters—at most reproduce the norm of its input (eq. (3)), the linear concatenation of these operations necessarily also has an upper bound which does not depend on the parameters. Therefore, in order to achieve maximal outputs on average (e.g., the class logit over the subset of images of that class), all DAUs in the network need to produce weights 𝐰(𝐚l)𝐰subscript𝐚𝑙{\mathbf{w}}({\mathbf{a}}_{l})bold_w ( bold_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) that align well with the class features. In other words, the weights will align with discriminative patterns in the input. For example, in Fig. 2 (bottom), we visualise the ‘global matrices’ 𝐖0Lsubscript𝐖0𝐿{\mathbf{W}}_{0\rightarrow L}bold_W start_POSTSUBSCRIPT 0 → italic_L end_POSTSUBSCRIPT and the corresponding contributions (eq. (11)) for a L=5𝐿5L=5italic_L = 5 layer CoDA Net. As before, the weights align with discriminative patterns in the input and do not encode the uninformative noise.

Temperature scaling and loss function.

Refer to caption
Figure 4: By lowering the upper bound (cf. eq. (3)), the correlation maximisation in the DAUs can be emphasised. We show contribution maps for a model trained with different temperatures.

So far we have assumed that minimising the BCE loss for a given sample is equivalent to applying a maximisation or minimisation loss to the individual outputs of a CoDA Net. While this is in principle correct, BCE introduces an additional, non-negligible effect: saturation. Specifically, it is possible for a CoDA Net to achieve a low BCE loss without the need to produce well-aligned weight vectors. As soon as the classification accuracy is high and the outputs of the networks are large, the gradient—and therefore the alignment pressure—will vanish. This effect can, however, easily be mitigated: as discussed in the previous paragraph, the output of a CoDA Net is upper-bounded independent of the network parameters, since each individual DAU in the network is upper-bounded. By scaling the network output with a temperature parameter T𝑇Titalic_T such that 𝐲^(𝐱)=T1𝐖0L(𝐱)𝐱^𝐲𝐱superscript𝑇1subscript𝐖0𝐿𝐱𝐱\hat{{\mathbf{y}}}({\mathbf{x}})=T^{-1}{\mathbf{W}}_{0\rightarrow L}({\mathbf{% x}})\,{\mathbf{x}}over^ start_ARG bold_y end_ARG ( bold_x ) = italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT 0 → italic_L end_POSTSUBSCRIPT ( bold_x ) bold_x, we can explicitly decrease this upper bound and thereby increase the alignment pressure in the DAUs by avoiding the early saturation due to BCE. In particular, the lower the upper bound is, the stronger the induced DAU output maximisation should be, since the network needs to accumulate more signal to obtain large class logits (and thus a negligible gradient). This is indeed what we observe both qualitatively, cf. Fig. 4, and quantitatively, cf. Fig. 6 (right column). The overall loss for an input 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the target vector 𝐲isubscript𝐲𝑖{\mathbf{y}}_{i}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is thus computed as

(𝐱i,𝐲i)subscript𝐱𝑖subscript𝐲𝑖\displaystyle\mathcal{L}({\mathbf{x}}_{i},{\mathbf{y}}_{i})caligraphic_L ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =BCE(σ(T1𝐖0L(𝐱i)𝐱i+𝐛0),𝐲i).absentBCE𝜎superscript𝑇1subscript𝐖0𝐿subscript𝐱𝑖subscript𝐱𝑖subscript𝐛0subscript𝐲𝑖.\displaystyle=\text{BCE}(\sigma(T^{-1}{\mathbf{W}}_{0\rightarrow L}({\mathbf{x% }}_{i})\,{\mathbf{x}}_{i}+{{\mathbf{b}}}_{0})\,,\,{\mathbf{y}}_{i})\quad\text{.}= BCE ( italic_σ ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT 0 → italic_L end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (12)

Here, σ𝜎\sigmaitalic_σ applies the sigmoid activation to each vector entry and 𝐛0subscript𝐛0{{\mathbf{b}}}_{0}bold_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a fixed bias term. As an alternative to the temperature scaling, the explicit representation of the network’s computation as a linear mapping allows to directly regularise what properties these linear mappings should fulfill. For example, we show in the supplement that by regularising the absolute values of the matrix 𝐖0Lsubscript𝐖0𝐿{\mathbf{W}}_{0\rightarrow L}bold_W start_POSTSUBSCRIPT 0 → italic_L end_POSTSUBSCRIPT, we can induce sparsity in the signal alignments, which also leads to sharper heatmaps.

4 Experimental setup

4.1 Datasets

We evaluate and compare the accuracies of the CoDA Nets to other work on the CIFAR-10 [26] and the TinyImagenet [27] datasets. We use the same datasets for the quantitative evaluations of the model-inherent contribution maps. Additionally, we qualitatively show high-resolution examples from a CoDA Net trained on the first 100 classes of the Imagenet [28] dataset. Lastly, we evaluate hybrid models (see sec. 4.4) on CIFAR10 and the full Imagenet dataset, both in terms of interpretability as well as classification accuracy.

4.2 Models

Our results (secs. 5.15.3) are based on models of various sizes denoted by (S/L/XL)-CoDA on CIFAR-10 (S), Imagenet-100 (L), and TinyImagenet (XL); these models have 7-8M444 The models with SQ and L2 non-linearity have 7.8M parameters and the models with WB have 7.1M (without embedding) and 7.2M (with embedding) parameters. (S), 48M (L), and 62M (XL) parameters respectively; see the supplement for details on the model architectures and an evaluation of the impact of model size on accuracy. For the hybrid networks (see secs. 4.4 and 5.4), we use a ResNet-56 (ResNet-50) as a base model on CIFAR-10 (Imagenet) and train CoDA Nets on feature maps extracted at different depths of the base models; see the supplement for details.

4.3 Input encoding

In sec. 3.1, we discussed that the norm-weighted cosine similarity between the dynamic weights and the layer inputs is optimised and the output of a DAU is at most the norm of its input. When using pixels as the input to the CoDA Nets, this favours pixels with large RGB values, since these have a larger norm and can thus produce larger outputs in the maximisation task. In our experiments, we explore two approaches to mitigate this bias: in the first, we add the negative image as three additional color channels and thus encode each pixel in the input as [r𝑟ritalic_r, g𝑔gitalic_g, b𝑏bitalic_b, 1r1𝑟1-r1 - italic_r, 1g1𝑔1-g1 - italic_g, 1b1𝑏1-b1 - italic_b], with r,g,b[0,1]𝑟𝑔𝑏01r,g,b\in[0,1]italic_r , italic_g , italic_b ∈ [ 0 , 1 ].

Secondly, we show that it is also possible to train CoDA Nets on end-to-end optimised patch-embeddings and obtain similar performance in terms of interpretability and classification accuracy. Instead of computing the per-pixel contributions to assess the importance of spatial locations (cf. eq. (11)), in our experiments we thus decompose the output with respect to the contributions from the corresponding (learnt) embeddings via

𝐬jL(𝐱i)=[𝐖0L(E(𝐱i))]jTE(𝐱i),superscriptsubscript𝐬𝑗𝐿subscript𝐱𝑖direct-productsuperscriptsubscriptdelimited-[]subscript𝐖0𝐿𝐸subscript𝐱𝑖𝑗𝑇𝐸subscript𝐱𝑖,\displaystyle{\mathbf{s}}_{j}^{L}({\mathbf{x}}_{i})=\left[{\mathbf{W}}_{0% \rightarrow L}(E({\mathbf{x}}_{i}))\right]_{j}^{T}\odot E({\mathbf{x}}_{i})% \quad\text{,}bold_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = [ bold_W start_POSTSUBSCRIPT 0 → italic_L end_POSTSUBSCRIPT ( italic_E ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⊙ italic_E ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (13)

with E()𝐸E(\cdot)italic_E ( ⋅ ) denoting the applied embedding function and L𝐿Litalic_L the number of CoDA layers in the network.

4.4 Interpolating between networks

Training CoDA Nets on learnt patch-embeddings (see sec. 4.3) naturally raises the question of how complex the embedding function should be and how large its receptive field. In particular, are pixel-wise importance values more useful than importance values for embeddings of patches of size 3×3333\times 33 × 3? How about 7×7777\times 77 × 7 or 64×64646464\times 6464 × 64? Of course, there is no single answer to this question and the ‘optimal’ complexity of the embedding model depends on the dataset, the task, and, ultimately, on the preferences of the end-user of such a model: for example, if a more complex embedding allows for more performant classifiers, one might wish to trade off model interpretability against model accuracy. In order to better understand such trade-offs, we propose to ‘interpolate’ between a conventional CNN and the CoDA Nets and investigate how this affects both model interpretability and model performance. Specifically, starting from a pre-trained CNN, we successively replace an increasing number of the later layers of the base model by CoDA layers. As such, the model output can be decomposed into contributions coming from spatially arranged embeddings computed by the truncated CNN model, which can give insights into how the embeddings are used to produce the classification results.

4.5 Additional details

In our experiments, we observed that rescaling the weight vectors of the DAUs explicitly according to eq. (2) resulted in long training times and high memory usage. To mitigate this, we opted to share the matrix 𝐁𝐁{\mathbf{B}}bold_B between all DAUs in a given layer when using the L2 or SQ non-linearity. This increases efficiency by having the DAUs share a common r𝑟ritalic_r-dimensional subspace and still fixes the maximal rank of each DAU to the chosen value of r𝑟ritalic_r. In contrast, networks with the WB non-linearity (see eDAUs in eq. (5)) are specifically designed to lower the computational costs of the DAUs and are easier to train. Therefore, for CoDA Nets built with eDAUs, we do not share the matrices 𝐁𝐁{\mathbf{B}}bold_B between the eDAUs. As the inputs are thus not restricted to a common low-dimensional subspace, we expect this to increase the modelling capacity of the CoDA Nets.

5 Experiments

Refer to caption
Figure 5: Model-inherent contribution maps for the most confident predictions for 18 different classes, sorted by confidence (high to low). We show positive (negative) contributions (eq. (11)) per spatial location for the ground truth class logit in red (blue).

In sec. 5.1 we assess the classification performance of the CoDA Nets. Further, in sec. 5.2 we evaluate the model-inherent contribution maps derived from 𝐖0Lsubscript𝐖0𝐿{\mathbf{W}}_{0\rightarrow L}bold_W start_POSTSUBSCRIPT 0 → italic_L end_POSTSUBSCRIPT (cf. eq. (13)) of a CoDA Net and compare them both qualitatively (cf. Fig. 5) as well as quantitatively (cf. Fig. 6) to other attribution methods. Additionally, in sec. 5.3, we discuss the impact of the different rescaling methods (cf. eqs. (2) and (5)) on model interpretability and evaluation speed. Lastly, in sec. 5.4 we investigate the hybrid models discussed in sec. 4.4. In particular, we analyse how the depth of the embedding function E(𝐱)𝐸𝐱E({\mathbf{x}})italic_E ( bold_x ) affects the model interpretability at different depths of the resulting hybrid network architecture.

5.1 Model performance

Model C10 Model T-IM
SENNs [18] 78.5% ResNet-34 [29] 52.0%
DE-CapsNet [30] 93.0% VGG 16 [31] 52.2%
VGG-19 [32] 93.4% VGG 16 + aug [31] 56.4%
ResNet-110 [33] 93.6% IRRCNN [34] 52.2%
DenseNet [35] 94.8% ResNet-110 [36] 56.6%
WRN-28-2 [37] 94.9% WRN-40-20 [38] 63.8%
S-CoDA-SQ 93.2% XL-CoDA-SQ 54.4%
S-CoDA-L2 93.0% XL-CoDA-SQ + aug 58.4%
S-eCoDA-WB 94.0%
S-eCoDA-WB + E(𝐱)𝐸𝐱E({\mathbf{x}})italic_E ( bold_x ) 94.1%
TABLE I: CIFAR-10 (C10) and TinyImagenet (T-IM) classification accuracies. Results taken from specified references. The prefix of the CoDAs indicates model size, the suffix the non-linearity used (cf. eqs. (2) and (5)). Further, E(𝐱)𝐸𝐱E({\mathbf{x}})italic_E ( bold_x ) denotes that a learnt embedding was used as input to the model (see sec. 4.3).

Classification performance. In Table I we compare the performances of our CoDA Nets to several other published results. Note that the referenced numbers are meant to be used as a gauge for assessing the CoDA Net performance and do not exhaustively represent the state of the art. In particular, we would like to highlight that the CoDA Net performance is on par to that of models from the VGG [39] and ResNet [33] model families on both datasets. Additionally, we list the reported results of the SENNs [18] and the DE-CapsNet [30] architectures for CIFAR-10. Similar to our CoDA Nets, the SENNs were designed to improve network interpretability and are also based on the idea of explicitly modelling the output as a dynamic linear transformation of the input. On the other hand, the CoDA Nets share similarities to capsule networks, which we discuss in the supplement; to the best of our knowledge, the DE-CapsNet currently achieves the state of the art in the field of capsule networks on CIFAR-10. Overall, we observed that the CoDA Nets deliver competitive performances that are fairly robust to the non-linearity (see eqs. (2) and (5)) and the temperature (T𝑇Titalic_T); for an ablation study on the latter, see the supplement. Finally, while all models achieve good classification results, we note that the WB-based CoDA Nets perform slightly better than CoDA Nets with SQ or L2 non-linearity despite having a comparable amount of parameters. As discussed in sec. 4.5, we attribute this to the fact that for those models we do not share the matrix 𝐁𝐁{\mathbf{B}}bold_B within the CoDA layers, which increases their modelling capacity.

5.2 Interpretability of CoDA Nets

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Top row: Results for the localisation metric, see eq. (14). Bottom row: Pixel removal metric. In particular, we plot the mean target class probability after removing the x%percent𝑥x\%italic_x % of the least important pixels. We show the results of a CoDA-Net-SQ trained on TinyImagenet (left column), as well as of a CoDA-Net-WB trained on CIFAR-10 (center column). We observed the results for different rescaling methods (SQ/L2/WB) to be very similar and therefore just show one per dataset; more results can be found in the supplement. Additionally, we show the effect of the temperature parameter on the interpretability of a CoDA-Net with SQ rescaling (right column): as expected, a higher temperature leads to higher interpretability (sec. 3.4).

In the following, we evaluate the model-inherent contribution maps and compare them to other commonly used methods for importance attribution. The evaluations are based on the XL-CoDA-SQ (T=6400) for TinyImagenet and the S-eCoDA-WB (T=1e61𝑒61e61 italic_e 6) for CIFAR-10, see Table I for the respective accuracies. The results are very similar for all three non-linearities (cf. sec. 5.3; more results are included in the supplement) and we therefore just show one per dataset as an example. Further, we evaluate the effect of training the S-CoDA-SQ architecture with different temperatures T𝑇Titalic_T; as discussed in sec. 3.4, we expect the interpretability to increase along with T𝑇Titalic_T, since for larger T𝑇Titalic_T a stronger alignment is required in order for the models to obtain large class logits. Lastly, in sec. 5.3 we compare how the different non-linearities (L2, SQ, WB) affect the model interpretability. Before turning to the results, however, in the following we will first present the attribution methods used for comparison and discuss the evaluation metrics employed for quantifying their interpretability.

Attribution methods. We compare the model-inherent contribution maps (cf. eq. (11)) to other common approaches for importance attribution. In particular, we evaluate against several perturbation based methods such as RISE [14], LIME [15], and several occlusion attributions [40] (Occ-K, with K the size of the occlusion patch). Additionally, we evaluate against common gradient-based methods. These include the gradient of the class logits with respect to the input image [41] (Grad), ‘Input×\times×Gradient’ (IxG, cf. [3]), GradCam [7] (GCam), Integrated Gradients [9] (IntG), and DeepLIFT [8]. As a baseline, we also evaluated these methods on a pre-trained ResNet-56 [33] on CIFAR-10, for which we show the results in the supplement.

Evaluation metrics. Our quantitative evaluation of the attribution maps is based on the following two methods: we (1) evaluate a localisation metric by adapting the pointing game [42] to the CIFAR-10 and TinyImagenet datasets, and (2) analyse the model behaviour under the pixel removal strategy employed in [10]. For (1), we evaluate the attribution methods on a grid of n×n𝑛𝑛n\times nitalic_n × italic_n with n=3𝑛3n=3italic_n = 3 images sampled from the corresponding datasets; in every grid of images, each class may occur at most once. For a visualisation with n=2𝑛2n=2italic_n = 2, see Fig. 7.

Refer to caption

Figure 7: A multi-image on the CIFAR-10 dataset. The CoDA-Net contribution maps highlight the individual class-images well.

Refer to caption
Figure 8: Comparison to the strongest post-hoc methods. While the regions of importance roughly coincide, the inherent contribution maps of the CoDA-Nets offer the most detail. Note that to improve the RISE visualisation, we chose its default colormap [14]; the most (least) important values are still shown in red (blue).

For each occurring class, we can measure how much positive importance an attribution method assigns to the respective class image. Let csubscript𝑐\mathcal{I}_{c}caligraphic_I start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT be the image for class c𝑐citalic_c, then the score scsubscript𝑠𝑐s_{c}italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT for this class is calculated as

sc=1ZpccpcwithZ=kpckpc,formulae-sequencesubscript𝑠𝑐1𝑍subscriptsubscript𝑝𝑐subscript𝑐subscript𝑝𝑐with𝑍subscript𝑘subscriptsubscript𝑝𝑐subscript𝑘subscript𝑝𝑐\displaystyle\textstyle s_{c}=\frac{1}{Z}\sum_{p_{c}\in\mathcal{I}_{c}}p_{c}% \quad\text{with}\quad Z={\sum_{k}\sum_{p_{c}\in\mathcal{I}_{k}}p_{c}}\quad,italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_Z end_ARG ∑ start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT with italic_Z = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , (14)

with pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT the positive attribution for class c𝑐citalic_c assigned to the spatial location p𝑝pitalic_p. This metric has the same clear oracle score sc=1subscript𝑠𝑐1s_{c}=1italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 1 for all attribution methods (all positive attributions located in the correct grid image) and a clear score for completely random attributions sc=1/n2subscript𝑠𝑐1superscript𝑛2s_{c}=1/n^{2}italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 1 / italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (the positive attributions are uniformly distributed over the different grid images). Since this metric depends on the classification accuracy of the models, we sample the first 500500500500 (CIFAR-10) or 250250250250 (TinyImagenet) images according to their class score for the ground-truth class555 We can only expect an attribution to specifically highlight a class image if this image can be correctly classified on its own. If all grid images have similarly low attributions, the localisation score will be random. ; note that since all attributions are evaluated for the same model on the same set of images, this does not favour any particular attribution method.
For (2), we show how the model’s class score behaves under the removal of an increasing amount of least important pixels, where the importance is obtained via the respective attribution method. Since the first pixels to be removed are typically assigned negative or relatively little importance, we expect the model to initially increase its confidence (removing pixels with negative impact) or maintain a similar level of confidence (removing pixels with low impact) if the evaluated attribution method produces an accurate ranking of the pixel importance values. Conversely, if we were to remove the most important pixels first, we would expect the model confidence to quickly decrease. However, as noted by [10], removing the most important pixels first introduces artifacts in the most important regions of the image and is therefore potentially more unstable than removing the least important pixels first. Nevertheless, the model-inherent contribution maps perform well in this setting, too, as we show in the supplement. Lastly, in the supplement we qualitatively show that they pass the ‘sanity check’ of [3].

Quantitative results. In Fig. 6, we compare the contribution maps of the CoDA Nets to other attributions under the evaluation metrics discussed above. It can be seen that the CoDA Nets (1) perform well under the localisation metric given by eq. (14) and outperform all the other attribution methods evaluated on the same model, both for TinyImagenet (top row, left) and CIFAR-10 (top row, center); note that we excluded RISE and LIME on CIFAR-10, since the default parameters do not seem to transfer well to this low-resolution dataset. Moreover, (2) the CoDA Nets perform well in the pixel-removal setting: the least salient locations according to the model-inherent contributions indeed seem to be among the least relevant for the given class score on both datasets, see Fig. 6 (bottom row, left and center); note that the Occ-K explanations directly estimate the impact of occluding pixels and are thus expected to perform well under this metric. Further, in Fig. 6 (right column), we show the effect of temperature scaling on the interpretability of CoDA Nets with SQ rescaling trained on CIFAR-10. The results indicate that the alignment maximisation is indeed crucial for interpretability and constitutes an important difference of the CoDA Nets to other dynamic linear networks such as piece-wise linear networks (ReLU-based networks). In particular, by structurally requiring a strong alignment for confident classifications, the interpretability of the CoDA Nets forms part of the optimisation objective. Increasing the temperature increases the alignment and thereby the interpretability of the CoDA Nets. While we observe a downward trend in classification accuracy when increasing T𝑇Titalic_T, the best model at T=10𝑇10T=10italic_T = 10 only slightly improved the accuracy compared to T=1000𝑇1000T=1000italic_T = 1000 (93.2%93.6%percent93.2percent93.693.2\%\rightarrow 93.6\%93.2 % → 93.6 %); for more details, see supplement.

In summary, the results show that by combining dynamic linearity with a structural bias towards an alignment with discriminative patterns, we obtain models which inherently provide an interpretable linear decomposition of their predictions. Further, given that we better understand the relationship between the intermediate computations and the optimisation of the final output in the CoDA Nets, we can emphasise model interpretability in a principled way by increasing the ‘alignment pressure’ via temperature scaling.

Qualitative results. In Fig. 5, we visualise spatial contribution maps of an L-CoDA-SQ model (trained on Imagenet-100) for some of its most confident predictions. Note that these contribution maps are linear decompositions of the output and the sum over these maps yields the respective class logit. In Fig. 8, we additionally present a visual comparison to the best-performing post-hoc attribution methods; note that RISE cannot be displayed well under the same color coding and we thus use its default visualisation. We observe that the different methods are not inconsistent with each other and roughly highlight similar regions. However, the inherent contribution maps are of much higher detail and compared to the perturbation-based methods do not require multiple model evaluations. Much more importantly, however, all the other methods are attempts at approximating the model behaviour post-hoc, while the CoDA Net contribution maps in Fig. 5 are derived from the model-inherent linear mapping that is used to compute the model output.

5.3 Interpretability and efficiency of L2, SQ, and WB

In the following, we present results regarding the impact of the different normalisation methods (L2, SQ, WB) on model interpretability and model efficiency.

Refer to caption
Refer to caption
Figure 9: Localisation (left) and perturbation (right) metric results on CIFAR10 evaluated for CoDA Networks with different non-linearities (cf. eqs. (2) and (5)) as well as trained with a learnt patch-embedding E(𝐱)𝐸𝐱E({\mathbf{x}})italic_E ( bold_x ), here denoted by E𝐸Eitalic_E–WB. For the model with embedding function E(𝐱)𝐸𝐱E({\mathbf{x}})italic_E ( bold_x ), we evaluate the pixel perturbation metric (bottom) directly on the pixels (red crosses) as well as on the learnt patch-embeddings (blue crosses). We further added the models’ accuracies (see Table I) in the plot to the left for comparison (circle / square/ cross markers).

Model interpretability. In Fig. 9, we show the results of the interpretability metrics for models with different rescaling functions (L2/SQ/WB, see eqs. (2) and (5)) as well as for a model trained with a learnt patch-embedding E(𝐱)𝐸𝐱E({\mathbf{x}})italic_E ( bold_x ). As an embedding function, we simply apply a 3x3 convolution with 32 filters, followed by a batch normalisation layer [43]. For comparison to post-hoc methods evaluated on a CoDA Net, we kindly refer the reader to the center column of Fig. 6. As can be seen, it is possible to obtain highly interpretable models under all four settings: (1) the linear contributions allow to localise the class-images well (localisation metric, left) and (2) the models are insensitive to input features666For the SQ, L2, and WB model the features are pixels under the static encoding function described in 4.3. For the E𝐸Eitalic_E-WB model, the input features to the CoDA Net are learnt patch embeddings. that are not contributing to the output as per the linear transformation matrix 𝐖0Lsubscript𝐖0𝐿{\mathbf{W}}_{0\rightarrow L}bold_W start_POSTSUBSCRIPT 0 → italic_L end_POSTSUBSCRIPT. Note that for the model with a learnt patch-embedding, denoted by E𝐸Eitalic_E–WB, we show two results for the perturbation metric. First, we ’zero out’ the embeddings at each location ordered by their assigned importance (blue crosses). As the embeddings are the input features to the CoDA Net, the model confidence shows the expected behaviour of being insensitive to unimportant inputs. In contrast, the assigned importance values do not translate to the center pixels of the embeddings: when zeroing out the center pixels according to the contributions of the patch-embeddings, the model confidence drops more quickly (see red crosses). This distinction is important to keep in mind when evaluating CoDA Nets on input embeddings, since it is easy to wrongly interpret such contribution maps. If the input to the CoDA Net is an embedding of an image patch, it depends on the embedding function how the contributions are to be distributed to the image pixels. Lastly, note that different from the center column in Fig. 6, the metrics are evaluated for four different models and are thus not comparisons between different explanation methods, but rather between different models under the same explanation. As such, the differences in the localisation metric, for example, do not show that the linear decompositions are generally better suited to explain WB-based models as compared to SQ- or L2-based models; the differences might instead reflect the fact that the models learnt more robust and class-specific representations, which yield both better results in the localisation task as well as higher classification accuracy.

Model efficiency.

Refer to caption
Refer to caption
Figure 10: We show the speed-up per forward pass of the models with WB and SQ rescaling compared to L2 (left) as well as the GPU memory consumption (right) of the three different models in Table I for different batch sizes for both measures. While the models perform similarly for small batch sizes, the WB-based model scales better to large inputs.

While all three non-linearities can yield interpretable CoDA Nets, the computational cost of the different approaches for bounding the DAU outputs differs. For example, by avoiding the explicit calculation of the d𝑑ditalic_d-dimensional weight vector in eq. (5), the eDAUs are able to save both memory as well as floating point operations—the computed vectors are of size rdmuch-less-than𝑟𝑑r\ll ditalic_r ≪ italic_d and the dot-product in the low-dimensional space requires O(r)𝑂𝑟O(r)italic_O ( italic_r ) operations instead of O(d)𝑂𝑑O(d)italic_O ( italic_d ). Being the fundamental building block of the CoDA Nets, such gains in efficiency can have considerable impact, since the corresponding computations are performed in every layer for every unit and at each spatial position of the input to the respective layer. Accordingly, in practice777In our experiments, we rely on the highly optimised implementations for convolutions from the pytorch [44] library. we observed that the weight bounding approach in the eDAUs (eq. (5)) can yield significant speed-ups and memory savings, especially for high-dimensional inputs. For example, in Fig. 10 we plot the memory consumption and forward-pass speeds for the three different models without learnt embedding function (see Table I) for varying batch sizes on the CIFAR-10 dataset: while SQ and L2 perform similarly, the WB-based model scales better to larger inputs.

Additionally, we measured memory consumption and training time for two models with the same architecture on Imagenet (L-CoDA, see beginning of sec. 5) for the SQ and the WB rescaling methods. For this, we updated the models 8000absent8000\approx 8000≈ 8000 times with a batch size of 16 and recorded the overall time as well as the GPU memory consumption. In these experiments, the WB-based model required more than 3×3\times3 × less memory (9.7GB vs. 30.0GB) and completed the updates more than 1.5×1.5\times1.5 × faster (8.7 minutes vs. 14.1 minutes). All experiments regarding evaluation speed were performed on an nvidia Quadro RTX 8000 GPU with 48GB of memory.

5.4 Hybrid CoDA Networks

In this section, we assess the interpretability of hybrid CoDA Nets, which combine conventional CNN layers and CoDA layers in one network model. For our experiments, we use varying numbers of pre-trained CNN layers as feature extractors on top of which a CoDA Net is trained as a classifier. Such a hybrid structure can prove useful in cases where CoDA Nets do not (yet) yield the same accuracy as conventional architectures; we kindly refer the reader to the supplement for details on the network architectures used in these experiments.

In particular, we use the first K layers of a pre-trained ResNet model [33] as feature extractors. Since ResNets are piece-wise linear models, the hybrids are still dynamic linear and we can assign importance values to input features according to their effective linear contribution; importantly, the input features can be extracted at any depth of the network as the output of a CoDA layer or a ResNet block, or as the actual input pixels. In order to assess whether such hybrids are more interpretable than the base model, we compute spatial contribution maps888The contribution maps according to the dynamic linear mapping can be obtained via ’Input×\times×Gradient’, where for the gradient calculation we treat the dynamic matrices in the CoDA layers as fixed. with respect to different activation maps within the network and evaluate them under the localisation metric (see sec. 5.2, ’Evaluation metrics’).

Refer to caption
Refer to caption
Figure 11: Localisation metric (mean and standard deviation) for different ’explanation depths’ evaluated on four hybrid models trained on CIFAR-10 (left) and Imagenet (right). Additionally, we show the localisation results of the pretrained ResNets (denoted by R5-C0) that were used as base models. For each evaluation, we extract the effectively applied linear transformations up to a certain depth and compute the corresponding linear contributions to the output logits coming from individual positions in the activation maps. We then use the resulting maps as an explanation of the output logit and assess how well these maps allow for localising the corresponding class images in the localisation task. As can be seen, the more layers of the original ResNet architectures are replaced by CoDA layers, the larger is the ’interpretable depth’.

CIFAR10. For the following experiments, we use a pre-trained ResNet-56 obtained from [45]. This model consists of a convolutional layer + batch normalisation [43] (C+B), followed by three times nine residual blocks (RBs) as well as a fully connected and a pooling (FC+P) layer; for more details we kindly refer the reader to the original work [33] and the implementation [45] on which we base these experiments. We can summarise this model by [C+B, 9RB, 9RB, 9RB, FC+P]; we will further denote individual segments 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the model by their index in this summary counting from the back, e.g., 𝒮5subscript𝒮5\mathcal{S}_{5}caligraphic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT=[C+B] and 𝒮1subscript𝒮1\mathcal{S}_{1}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=[FC+P]. In order to evaluate the interpretability of this model at different depths t𝑡titalic_t, we split it at different points into two virtual parts: an embedding function Et(𝐱)subscript𝐸𝑡𝐱E_{t}({\mathbf{x}})italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) and a classification head (CHt𝑡{}_{t}start_FLOATSUBSCRIPT italic_t end_FLOATSUBSCRIPT). For a given split, we then regard the output of Et(𝐱)subscript𝐸𝑡𝐱E_{t}({\mathbf{x}})italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) as the input to the classification head and linearly decompose the latter according to the respective linear transformation performed by the model; e.g., with an explanation depth of 2222 we refer to the split in which 𝒮2subscript𝒮2\mathcal{S}_{2}caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the first element in CH22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT and we evaluate linear contribution maps obtained for the classification head CH22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT=[9RB, FC+P] on the preprocessed input E2(𝐱)subscript𝐸2𝐱E_{2}({\mathbf{x}})italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x )=[C+B, 9RB, 9RB](𝐱)𝐱({\mathbf{x}})( bold_x ). By performing this evaluation for various splits, we can assess the ’interpretable depth’ of a model. In particular, we evaluate how well the contribution maps at different depths allow for localising the correct class images in the localisation task.

In order to investigate the effect of CoDA layers on the interpretable depth, we train and evaluate four different hybrid models. For this, we replace an increasing number of segments 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by CoDA layers, starting from 𝒮1subscript𝒮1\mathcal{S}_{1}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT; in Fig. 11 (left), we denote the base model by R5-C0 (5 ResNet segments, 0 CoDA segments) and the hybrids according to the number of replaced segments, e.g., for R3-C2 we replaced the last two segments by CoDA layers999In detail, each segment of 9RB is replaced by a set of 3 CoDA layers. The final network segment [FC+P] is replaced by a single CoDA layer followed by a global pooling operation.. For each of these models, we can decompose the model outputs in terms of contributions from spatial positions for the embedding functions Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT defined by different splits t𝑡titalic_t. Note that the base model (ResNet-56) is piece-wise linear and we can thus still compute linear contributions at any depth of this hybrid network. In Fig. 11 (left) we show the results of the localisation metric for all networks at various depths; the classification accuracies can be found in Table II. As can be seen, the linear contributions are good explanations of the class logits as long as the classification head entirely consists of CoDA layers and drops as soon as we include a segment with ResNet blocks in the classification head CH. Again, this highlights that dynamic linearity alone is not enough to obtain useful linear decompositions of the model outputs, but that the alignment property is crucial for the interpretability of the CoDA layers.

Imagenet. In the following, we show that the gains from interpolating between networks also extend to a more complex dataset. Similar to the interpolation experiments on CIFAR-10 above, the results are based on an interpolation between a pretrained ResNet model (ResNet-50) and a CoDA-based classification head. However, given the high-dimensional representations produced by the later ResNet layers (up to 2048 channels), the parameters of the classification head increase drastically if the high dimensionality is maintained throughout the CoDA layers. Therefore, in the Imagenet experiments, we first compute a low-dimensional projection 𝐱~=𝐏𝐱~𝐱𝐏𝐱\tilde{{\mathbf{x}}}={\mathbf{P}}{\mathbf{x}}over~ start_ARG bold_x end_ARG = bold_Px of the inputs to the convolutional kernels to which we apply the eDAUs (see eq. (5)); similarly to the dynamic weights of DAUs with L2 normalisation, we normalise the rows of the matrix 𝐏𝐏{\mathbf{P}}bold_P to unit norm to maintain a parameter-independent bound of the network.

TABLE II: Classification accuracies for hybrid networks. RX𝑋Xitalic_X-CY𝑌Yitalic_Y denotes how many segments were replaced by CoDA layers; on Imagenet, maximally up to five ResNet blocks from the end of the network were replaced and all networks thus still rely on a ResNet-based stem. On CIFAR-10 the accuracy can be maintained whilst improving interpretability (see Fig. 11). On Imagenet, on the other hand, we observe a trade-off in accuracy when increasing the ’interpretable depth’ of the models.
CIFAR-10 R5-C0 R4-C1 R3-C2 R2-C3 R1-C4
93.4% 93.6% 93.4% 93.6% 93.8%
Imagenet R5-C0 R3-C2 R2-C3 R1-C4 R0-C5
76.1% 74.7% 73.3% 71.7% 71.4%

For the interpolation experiments, we successively replace the last 5 residual blocks of the ResNet-50 base model (the ’segments’ here correspond to FC+P or individual residual blocks) by a single CoDA layer each and assess the interpretability via the localisation metric as well as model accuracy, see Fig. 11 (right) and Table II respectively. Similar to the CIFAR-10 experiments, we observe an increase in ’interpretable depth’ (Fig. 11, right). However, while on CIFAR-10 the accuracy of the base model could be maintained, on Imagenet we observe a trade-off in accuracy. While better results can certainly be achieved by further optimising the network architectures and / or fine-tuning the learnt embeddings, our results show that it is possible to increase the interpretability of performant classification models by using a classification head comprised of CoDA layers.

6 Discussion and conclusion

We present a new family of neural networks, the CoDA Nets, and show that they are performant classifiers with a high degree of interpretability101010Code is available at github.com/moboehle/CoDA-Nets. In particular, we first introduced the Dynamic Alignment Units (DAUs), which model their output as a dynamic linear transformation of their input and have a structural bias towards alignment maximisation. This bias is induced by ensuring that a DAU can only produce large outputs if its weights are well-aligned with the input, since the dynamically applied weights are explicitly normalised. In order to lower the computational costs of the DAUs, we further introduce the eDAUs, for which we normalise the weights by an upper bound of their norms which is cheaper to compute. Using the DAUs to model filters in a convolutional network, we obtain the Convolutional Dynamic Alignment Networks (CoDA Nets). The successive linear mappings by means of the DAUs within the network make it possible to linearly decompose the output into contributions from individual input dimensions—in contrast to piece-wise linear networks, which are also dynamic linear, the alignment property of the DAUs ensures that the linear decomposition aligns with discriminant patterns in the input. In order to assess the quality of these contribution maps, see eq. (11), we compare against other attribution methods. We find that the CoDA Net contribution maps consistently perform well under commonly used quantitative metrics and are robust to the applied normalisation scheme. Beyond their interpretability, the CoDA Nets constitute performant classifiers: their accuracy on CIFAR-10 and the TinyImagenet dataset are on par to the commonly employed VGG and ResNet models. Lastly, we show that CoDA layers can be combined with conventional networks, which yields hybrid models with an increased ’interpretable depth’ compared to the base model. We believe that such hybrid models hold great potential, since they take advantage of the high modelling capacity and efficiency of modern neural networks whilst allowing for a user-defined ’minimal interpretability’. For example, such networks could allow for localising regions of importance for the model decision at a desired granularity by restricting the receptive field of the feature extractors.

References

  • [1] V. Nair and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” in International Conference on Machine Learning (ICML), 2010.
  • [2] G. F. Montúfar, R. Pascanu, K. Cho, and Y. Bengio, “On the Number of Linear Regions of Deep Neural Networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2014.
  • [3] J. Adebayo, J. Gilmer, M. Muelly, I. J. Goodfellow, M. Hardt, and B. Kim, “Sanity Checks for Saliency Maps,” in Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [4] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,” in International Conference on Learning Representations (ICLR), Workshop, 2014.
  • [5] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller, “Striving for Simplicity: The All Convolutional Net,” in International Conference on Learning Representations (ICLR), Workshop, 2015.
  • [6] B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba, “Learning Deep Features for Discriminative Localization,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [7] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization,” in International Conference on Computer Vision (ICCV), 2017.
  • [8] A. Shrikumar, P. Greenside, and A. Kundaje, “Learning Important Features Through Propagating Activation Differences,” in International Conference on Machine Learning (ICML), 2017.
  • [9] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic Attribution for Deep Networks,” in International Conference on Machine Learning (ICML), 2017.
  • [10] S. Srinivas and F. Fleuret, “Full-Gradient Representation for Neural Network Visualization,” in Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • [11] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, “On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation,” PLoS ONE, 2015.
  • [12] P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis, “Explainable AI: A Review of Machine Learning Interpretability Methods,” Entropy, vol. 23, no. 1, p. 18, 2021.
  • [13] S. M. Lundberg and S. Lee, “A Unified Approach to Interpreting Model Predictions,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
  • [14] V. Petsiuk, A. Das, and K. Saenko, “RISE: Randomized Input Sampling for Explanation of Black-box Models,” in British Machine Vision Conference (BMVC), 2018.
  • [15] M. T. Ribeiro, S. Singh, and C. Guestrin, “”Why Should I Trust You?”: Explaining the predictions of any classifier,” in International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2016.
  • [16] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. Su, “This Looks Like That: Deep Learning for Interpretable Image Recognition,” in Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • [17] W. Brendel and M. Bethge, “Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet,” in International Conference on Learning Representations (ICLR), 2019.
  • [18] D. Alvarez-Melis and T. S. Jaakkola, “Towards Robust Interpretability with Self-Explaining Neural Networks,” in Advances in Neural Information Processing (NeurIPS), 2018.
  • [19] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool, “Dynamic filter networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2016.
  • [20] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning (ICML), 2015.
  • [21] C. Eckart and G. Young, “The approximation of one matrix by another of lower rank,” Psychometrika, 1936.
  • [22] X. Yu, T. Liu, X. Wang, and D. Tao, “On compressing deep models by low rank and sparse decomposition,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [23] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic Routing Between Capsules,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
  • [24] Y. LeCun, “MNIST handwritten digit database,” http://yann.lecun.com/exdb/mnist, 1998.
  • [25] I. Shafkat, “Intuitively Understanding Convolutions for Deep Learning,” https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1#ad33, 2018.
  • [26] A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009.
  • [27] J. et al., “Tiny ImageNet Visual Recognition Challenge,” https://tiny-imagenet.herokuapp.com/, accessed: 2020-11-10.
  • [28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
  • [29] L. Sun, “ResNet on Tiny ImageNet,” http://cs231n.stanford.edu/reports/2017/pdfs/12.pdf, 2016, accessed: 2020-11-16.
  • [30] B. Jia and Q. Huang, “DE-CapsNet: A Diverse Enhanced Capsule Network with Disperse Dynamic Routing,” Applied Sciences, 2020.
  • [31] learningai.io, “VGGNet and Tiny ImageNet,” https://learningai.io/projects/2017/06/29/tiny-imagenet.html, accessed: 2020-11-08.
  • [32] T. Li, J. Li, Z. Liu, and C. Zhang, “Few sample knowledge distillation for efficient network compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 639–14 647.
  • [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [34] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari, “Improved inception-residual convolutional neural network for object recognition,” Neural Computing and Applications, 2020.
  • [35] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” in Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE Computer Society, 2017.
  • [36] C. Termritthikun, Y. Jamtsho, and P. Muneesawang, “An improved residual network model for image recognition using a combination of snapshot ensembles and the cutout technique,” Multimedia Tools and Applications, 2020.
  • [37] S. Zagoruyko and N. Komodakis, “Wide Residual Networks,” in British Machine Vision Conference (BMVC), 2016.
  • [38] D. Hendrycks, K. Lee, and M. Mazeika, “Using Pre-Training Can Improve Model Robustness and Uncertainty,” in Proceedings of Machine Learning Research (PMLR), 2019.
  • [39] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in International Conference on Learning Representations (ICLR), 2015.
  • [40] M. D. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks,” in European Conference on Computer Vision (ECCV), 2014.
  • [41] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Müller, “How to explain individual classification decisions,” The Journal of Machine Learning Research, 2010.
  • [42] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff, “Top-Down Neural Attention by Excitation Backprop,” Int. J. Comput. Vis., 2018.
  • [43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning.   PMLR, 2015, pp. 448–456.
  • [44] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Advances in Neural Information Processing Systems (NeurIPS).   Curran Associates, Inc., 2019.
  • [45] Y. Idelbayev, “”Proper ResNet Implementation for CIFAR10/CIFAR100 in PyTorch”,” https://github.com/akamaster/pytorch_resnet_cifar10, accessed: 2020-06-05.
[Uncaptioned image] Moritz Böhle is a Ph.D. student in Computer Science at the Max Planck Institute for Informatics, working with Prof. Dr. Bernt Schiele and Prof. Dr. Mario Fritz. He graduated with a bachelor’s degree in physics in 2016 from the Freie Universität Berlin and obtained his Master’s degree in computational neuroscience in 2019 from the Technische Universität Berlin. His research focuses on understanding the ’decision process’ in deep neural networks and designing inherently interpretable neural network models.
[Uncaptioned image] Mario Fritz Mario Fritz is faculty member at the CISPA Helmholtz Center for Information Security, honorary professor at the Saarland University, and a fellow of the European Laboratory of Learning and Intelligent Systems (ELLIS). Before, he was senior researcher at the Max Planck Institute for Informatics, PostDoc at UC Berkeley and International Computer Science Institute. He studied computer science at the university Erlangen/Nuremberg and obtained his PhD from the TU Darmstadt. His current work is centered around Trustworthy Information Processing with a focus on the intersection of AI & Machine Learning with Security & Privacy. He is associate editor of IEEE TPAMI, a member of the ACM Europe Technical Policy Committee Europe, and a leading scientist of the Helmholtz Medical Security, Privacy, and AI Research Center, where he is coordinating projects on privacy and federated learning in health. He has over 100 publications, including 80 in top-tier journals (IJCV, TPAMI) and conferences (NeurIPS, AAAI, IJCAI, ICLR, NDSS, USENIX Security, CCS, S&P, CVPR, ICCV, ECCV).
[Uncaptioned image] Bernt Schiele has been Max Planck Director at MPI for Informatics and Professor at Saarland University since 2010. He studied computer science at the University of Karlsruhe, Germany. He worked on his master thesis in the field of robotics in Grenoble, France, where he also obtained the “diplome d’etudes approfondies d’informatique”. In 1994 he worked in the field of multi-modal human-computer interfaces at Carnegie Mellon University, Pittsburgh, PA, USA in the group of Alex Waibel. In 1997 he obtained his PhD from INP Grenoble, France under the supervision of Prof. James L. Crowley in the field of computer vision. The title of his thesis was “Object Recognition using Multidimensional Receptive Field Histograms”. Between 1997 and 2000 he was postdoctoral associate and Visiting Assistant Professor with the group of Prof. Alex Pentland at the Media Laboratory of the Massachusetts Institute of Technology, Cambridge, MA, USA. From 1999 until 2004 he was Assistant Professor at the Swiss Federal Institute of Technology in Zurich (ETH Zurich). Between 2004 and 2010 he was Full Professor at the computer science department of TU Darmstadt.