Optimising for Interpretability:
Convolutional Dynamic Alignment Networks

Moritz Böhle, Mario Fritz, and Bernt Schiele Moritz Böhle is the corresponding author.
He is with the Department of Computer Vision and Machine Learning, Max Planck Institute for Informatics, Saarbrücken 66123, Germany. E-mail: [email protected]
Bernt Schiele is with the Department of Computer Vision and Machine Learning, Max Planck Institute for Informatics, Saarbrücken 66123, Germany. E-mail: [email protected]
Mario Fritz is with the CISPA Helmholtz Center for Information Security, Saarbrücken 66123, Germany. E-mail: [email protected].

Abstract

We introduce a new family of neural network models called Convolutional Dynamic Alignment Networks (CoDA Nets), which are performant classifiers with a high degree of inherent interpretability. Their core building blocks are Dynamic Alignment Units (DAUs), which are optimised to transform their inputs with dynamically computed weight vectors that align with task-relevant patterns. As a result, CoDA Nets model the classification prediction through a series of input-dependent linear transformations, allowing for linear decomposition of the output into individual input contributions. Given the alignment of the DAUs, the resulting contribution maps align with discriminative input patterns. These model-inherent decompositions are of high visual quality and outperform existing attribution methods under quantitative metrics. Further, CoDA Nets constitute performant classifiers, achieving on par results to ResNet and VGG models on e.g. CIFAR-10 and TinyImagenet. Lastly, CoDA Nets can be combined with conventional neural network models to yield powerful classifiers that more easily scale to complex datasets such as Imagenet whilst exhibiting an increased interpretable depth, i.e., the output can be explained well in terms of contributions from intermediate layers within the network.

Index Terms:

Explainability in Deep Learning, Convolutional Neural Networks

1 Introduction

Refer to caption — Figure 1: Sketch of a 9-layer CoDA-Net, which computes its output ${\mathbf{a_{9}}}$ for an input ${\mathbf{a_{0}}}$ as a linear transform via a matrix ${\mathbf{W_{0\rightarrow 9}({\mathbf{a}}_{0})}}$ . As such, the output can be linearly decomposed into input contributions (see right). This ’global’ transformation matrix ${\mathbf{W_{0\rightarrow 9}}}$ is computed successively via multiple layers of Dynamic Alignment Units (DAUs). These layers, in turn, produce intermediate linear transformation matrices ${\mathbf{W}}_{l}({\mathbf{a}}_{l-1})$ that align with the inputs of layer $l$ . As a result, the combined matrix ${\mathbf{W_{0\rightarrow 9}}}$ also aligns well with task-relevant patterns. Positive (negative) contributions for the class ‘goldfinch’ are shown in red (blue).

Neural networks are powerful models that excel at a wide range of tasks. However, they are notoriously difficult to interpret and extracting explanations for their predictions is an open research problem. Linear models, in contrast, are generally considered interpretable, because the contribution (‘the weighted input’) of every dimension to the output is explicitly given. Interestingly, many modern neural networks implicitly model the output as a linear transformation of the input; a ReLU-based [1] neural network, e.g., is piece-wise linear and the output thus a linear transformation of the input, cf. [2]. However, due to the highly non-linear manner in which these linear transformations are ‘chosen’, the corresponding contributions per input dimension do not seem to represent the learnt model parameters well, cf. [3], and a lot of research is being conducted to find better explanations for the decisions of such neural networks, cf. [4, 5, 6, 7, 8, 9, 10, 11].

In this work, we introduce a novel network architecture, the Convolutional Dynamic Alignment Networks (CoDA Nets), for which the model-inherent contribution maps are faithful projections of the internal computations and thus good ‘explanations’ of the model prediction. There are two main components to the interpretability of the CoDA Nets. First, the CoDA Nets are dynamic linear, i.e., they compute their outputs through a series of input-dependent linear transforms, which are based on our novel Dynamic Alignment Units (DAUs). As in linear models, the output can thus be decomposed into individual input contributions, see Fig. 1. Second, the DAUs are structurally biased to compute weight vectors that align with relevant patterns in their inputs. In combination, the CoDA Nets thus inherently produce contribution maps that are ‘optimised for interpretability’: since each linear transformation matrix and thus their combination is optimised to align with discriminative features, the contribution maps reflect the most discriminative features as used by the model.

With this work, we present a new direction for building inherently more interpretable neural network architectures with high modelling capacity. In detail, we would like to highlight the following contributions:

1.

We introduce the concept of Dynamic Alignment Units (DAUs), which improve the interpretability of neural networks and have two key properties: they are dynamic linear and need to align their dynamically computed weights their inputs to achieve large outputs, since the norm of their weights is explicitly constrained.
2.

Apart from normalising the dynamically computed DAU weights by their vector norm, we show that similar results can be achieved when normalising them by an upper bound of their norm which results in more efficient DAUs.
3.

We show that networks of DAUs inherit the dynamic linearity and the alignment properties from their constituent DAUs. In particular, we introduce Convolutional Dynamic Alignment Networks (CoDA Nets), which are built out of multiple layers of DAUs. As a result, the model-inherent contribution maps of CoDA Nets highlight discriminative patterns in the input.
4.

We further show that the alignment of the DAUs can be promoted by applying a ‘temperature scaling’ to the final output of the CoDA Nets.
5.

We show that the resulting contribution maps perform well under commonly employed quantitative criteria for attribution methods. Moreover, under qualitative inspection, we note that they exhibit a high degree of detail.
6.

We analyse how the models are affected by different normalisation functions in the DAU weight calculation in terms of accuracy, interpretability, as well as efficiency.
7.

Beyond interpretability, CoDA Nets are performant classifiers and yield competitive classification accuracies on the CIFAR-10 and TinyImagenet datasets.
8.

We show that CoDA Nets can be seamlessly combined with conventional networks. The resulting hybrid networks exhibit an increased ’interpretable depth’ whilst taking advantage of the efficiency and strong modelling capacity of the base networks. Such networks hold great potential for designing models that are inherently interpretable up to a user-defined minimal resolution.

2 Related Work

Interpretability. In order to make machine learning models more interpretable, a variety of techniques has been developed. While there are different ways in which interpretability can be defined, in this work we focus on attributing importance values to input features; for an extensive review regarding interpretability in machine learning, see [12].

Current techniques for interpreting neural networks via importance attribution can broadly be split into two categories: deriving post-hoc explanations and developing inherently interpretable models. In the following, we will first describe common approaches for post-hoc interpretability.

On the one hand, research regarding post-hoc explanations has been undertaken to develop model-agnostic explanation methods for which the model behaviour under perturbed inputs is analysed; this includes among others [13, 14, 15]. While their generality and the applicability to any model are advantageous, these methods typically require evaluating the respective model several times and are therefore costly approximations of model behaviour. Further, they rely on the assumption that the models generalise to such out-of-distribution (OOD) data (the perturbed / occluded inputs) in a stable manner, such that the outputs on the OOD data can be used to explain model behaviour on in-distribution data.

On the other hand, many techniques that explicitly take advantage of the internal computations have been proposed for explaining the model predictions, including, for example, [4, 5, 6, 7, 8, 9, 10, 11]. Such methods typically distribute importance values layer by layer in a backward pass and introduce different rules for this re-distribution. Similarly, the model-inherent contribution maps in the CoDA Nets can also be obtained as a layer-wise decomposition of the output. However, there is a key difference: producing well-aligned linear decompositions is actually optimised for in the CoDA Nets during training, whereas other methods are developed without taking the optimisation procedure into account. As a result, the inherent explanations in the CoDA Nets lend themselves better to understanding the model outputs.
In contrast to techniques that aim to explain models post-hoc, some recent work has focused on designing new types of network architectures, which are inherently more interpretable. Examples of this are the prototype-based neural networks [16], the BagNet [17] and the self-explaining neural networks (SENNs) [18]. Similarly to our proposed architectures, the SENNs and the BagNets derive their explanations from a linear decomposition of the output into contributions from the input (features). This dynamic linearity, i.e., the property that the output is computed via some form of an input-dependent linear mapping, is additionally shared by the entire model family of piece-wise linear networks (e.g., ReLU-based networks). In fact, the contribution maps of the CoDA Nets are conceptually similar to evaluating the ‘Input $\times$ Gradient’ (IxG), cf. [3], on piece-wise linear models, which also yields a linear decomposition in form of a contribution map. However, in contrast to the piece-wise linear functions, we combine this dynamic linearity with a structural bias towards an alignment between the contribution maps and the discriminative patterns in the input. This results in explanations of much higher quality, whereas IxG on piece-wise linear models has been found to yield unsatisfactory explanations of model behaviour [3].

Architectural similarities. In our CoDA Nets, the convolutional kernels are dependent on the specific patch that they are applied on; i.e., a CoDA layer might apply different filters at every position in the input. As such, these layers can be regarded as an instance of dynamic local filtering layers as introduced in [19]. Further, our dynamic alignment units (DAUs) share some high-level similarities to attention networks, cf. [20], in the sense that each DAU has a limited budget to distribute over its dynamic weight vectors (bounded norm), which is then used to compute a weighted sum. However, whereas in attention networks the weighted sum is typically computed over vectors (the ’value vectors’) which differ from the input to the attention module (the ’key’ and ’query’ vectors), a DAU outputs a scalar which is a weighted sum of all scalar entries in the input. Moreover, we note that at their optimum (maximal average output over a set of inputs), the DAUs solve a constrained low-rank matrix approximation problem [21]. While low-rank approximations have been used for increasing parameter efficiency in neural networks, cf. [22], this concept has to the best of our knowledge not been used in order to endow neural networks with a structural bias towards finding low-rank approximations of the input for increased interpretability in classification tasks. Lastly, the CoDA Nets are related to capsule networks. However, whereas in classical capsule networks the activation vectors of the capsules directly serve as input to the next layer, in CoDA Nets the corresponding vectors are used as convolutional filters. We include a detailed comparison in the supplement.

3 Dynamic Alignment Networks

In this section, we present our novel type of network architecture: the Convolutional Dynamic Alignment Networks (CoDA Nets). For this, we first introduce Dynamic Alignment Units (DAUs) as the basic building blocks of CoDA Nets and discuss two of their key properties in sec. 3.1. Concretely, we show that these units linearly transform their inputs with dynamic (input-dependent) weight vectors and, additionally, that they are biased to align these weights with the input during optimisation. Given the computational costs of evaluating DAUs, in sec. 3.2 we further present an alternative formulation of the DAUs for increased efficiency. We then discuss how DAUs can be used for classification (sec. 3.3) and how we build performant networks out of multiple layers of convolutional DAUs (sec. 3.4). Importantly, the resulting linear decompositions of the network outputs are optimised to align with discriminative patterns in the input, making them highly suitable for interpreting the network predictions.

In particular, we structure this section around the following three important properties (P1-P3) of the DAUs:
P1: Dynamic linearity. The DAU output $o$ is computed as a dynamic (input-dependent) linear transformation of the input ${\mathbf{x}}$ , such that $o={\mathbf{w}}({\mathbf{x}})^{T}{\mathbf{x}}=\sum_{j}w_{j}({\mathbf{x}})x_{j}$ . Hence, $o$ can be decomposed into contributions from individual input dimensions, which are given by $w_{j}({\mathbf{x}})x_{j}$ for dimension $j$ .
P2: Alignment maximisation. Maximising the average output of a single DAU over a set of inputs ${\mathbf{x}}_{i}$ maximises the alignment between inputs ${\mathbf{x}}_{i}$ and the weight vectors ${\mathbf{w}}({\mathbf{x}}_{i})$ . As the modelling capacity of ${\mathbf{w}}({\mathbf{x}})$ is restricted, ${\mathbf{w}}({\mathbf{x}})$ will encode the most frequent patterns in the set of inputs ${\mathbf{x}}_{i}$ .
P3: Inheritance. When combining multiple DAU layers to form a Dynamic Alignment Network (DA Net), the properties P1 and P2 are inherited: DA Nets are dynamic linear (P1) and maximising the last layer’s output induces an output maximisation in the constituent DAUs (P2).
These properties increase the interpretability of a DA Net, such as a CoDA Net (sec. 3.4) for the following reasons. First, the output of a DA Net can be decomposed into contributions from the individual input dimensions, similar to linear models (cf. Fig. 1, P1 and P3). Second, we note that optimising a neural network for classification applies a maximisation to the outputs of the last layer for every sample. This maximisation aligns the dynamic weight vectors ${\mathbf{w}}({\mathbf{x}})$ of the constituent DAUs of the DA Net with their respective inputs (cf. Fig. 2 as well as P2 and P3).

Importantly, the weight vectors will align with the discriminative patterns in their inputs when optimised for classification as we show in sec. 3.3. As a result, the model-inherent contribution maps of CoDA Nets are optimised to align well with discriminative input patterns in the input image and the interpretability of our models thus forms part of the global optimisation procedure.

3.1 Dynamic Alignment Units

We define the Dynamic Alignment Units (DAUs) by

\displaystyle\text{DAU}({\mathbf{x}})=g({\mathbf{A}}{\mathbf{B}}{\mathbf{x}}+{% \mathbf{b}})^{T}{\mathbf{x}}={\mathbf{w}}({\mathbf{x}})^{T}\,{\mathbf{x}}\quad% \textbf{.}

(1)

Here, ${\mathbf{x}}\in\mathbb{R}^{d}$ is an input vector, ${\mathbf{A}}\in\mathbb{R}^{d\times r}$ and ${\mathbf{B}}\in\mathbb{R}^{r\times d}$ are trainable transformation matrices, ${\mathbf{b}}\in\mathbb{R}^{d}$ a trainable bias vector, and $g({\mathbf{u}})=\alpha(||{\mathbf{u}}||){\mathbf{u}}$ is a non-linear function that scales the norm of its input. In contrast to using a single matrix ${\mathbf{M}}\in\mathbb{R}^{d\times d}$ , using ${\mathbf{AB}}$ allows us to control the maximum rank $r$ of the transformation and to reduce the number of parameters; we will hence refer to $r$ as the rank of a DAU. As can be seen by the right-hand side of eq. (1), the DAU linearly transforms the input ${\mathbf{x}}$ (P1). At the same time, given the quadratic form ( ${\mathbf{x}}^{T}{\mathbf{B}}^{T}{\mathbf{A}}^{T}{\mathbf{x}}$ ) and the rescaling function $\alpha(||{\mathbf{u}}||)$ , the output of the DAU is a non-linear function of its input. In the context of DAUs, we are particularly interested in functions that constrain the norm of the weight vectors ${\mathbf{w}}({\mathbf{x}})$ , such as, e.g., rescaling to unit norm (L2) or the squashing function (SQ, see [23]):

\displaystyle\text{L2}({\mathbf{u}})=\frac{{\mathbf{u}}}{||{\mathbf{u}}||_{2}}% \;\;\text{and}\;\;\text{SQ}({\mathbf{u}})=\text{L2}({\mathbf{u}})\times\frac{|% |{\mathbf{u}}||^{2}_{2}}{1+||{\mathbf{u}}||_{2}^{2}}

(2)

In sec. 3.2, we further present an approximation to these rescaling functions, which lowers the computational cost of the DAUs whilst maintaining their bounding property $||{\mathbf{w}}({\mathbf{x}})||\leq 1$ . Given such a bound on ${\mathbf{w}}({\mathbf{x}})$ , the output of the DAUs will be upper-bounded by the norm of the input:

\displaystyle\text{DAU}({\mathbf{x}})=||{\mathbf{w}}({\mathbf{x}})||\hskip 1.9% 9997pt||{\mathbf{x}}||\cos(\angle({\mathbf{x}},{\mathbf{w}}({\mathbf{x}})))% \leq||{\mathbf{x}}||

(3)

As a corollary, for a given input ${\mathbf{x}}_{i}$ , the DAUs can only achieve this upper bound if ${\mathbf{x}}_{i}$ is an eigenvector (EV) of the linear transform ${\mathbf{AB}}{\mathbf{x}}+{\mathbf{b}}$ . Otherwise, the cosine in eq. (3) will not be maximal¹¹1 Note that ${\mathbf{w}}({\mathbf{x}})$ is proportional to ${\mathbf{AB}}{\mathbf{x}}+{\mathbf{b}}$ . The cosine in eq. (3), in turn, is maximal if and only if ${\mathbf{w}}({\mathbf{x}}_{i})$ is proportional to ${\mathbf{x}}_{i}$ and thus, by transitivity, if ${\mathbf{x}}_{i}$ is proportional to ${\mathbf{AB}}{\mathbf{x}}_{i}+{\mathbf{b}}$ . This means that ${\mathbf{x}}_{i}$ has to be an EV of ${\mathbf{AB}}{\mathbf{x}}+{\mathbf{b}}$ to achieve maximal output.. As can be seen in eq. (3), maximising the average output of a DAU over a set of inputs $\{{\mathbf{x}}_{i}|\,i=1,...,n\}$ maximises the alignment between ${\mathbf{w}}({\mathbf{x}})$ and ${\mathbf{x}}$ (P2). In particular, it optimises the parameters of the DAU such that the most frequent input patterns are encoded as EVs in the linear transform ${\mathbf{AB}}{\mathbf{x}}+{\mathbf{b}}$ , similar to an $r$ -dimensional PCA decomposition ( $r$ the rank of ${\mathbf{AB}}$ ). In fact, as discussed in the supplement, the optimum of the DAU maximisation solves a low-rank matrix approximation [21] problem similar to singular value decomposition.

As an illustration of this property, in Fig. 3 we show the 3 EVs²²2Given $r=3$ , the EVs maximally span a 3-dimensional subspace. of matrix ${\mathbf{AB}}$ (with rank $r=3$ , bias ${\mathbf{b}}={\mathbf{0}}$ ) after optimising a DAU over a set of $n$ noisy samples of 3 specific MNIST [24] images; for this, we used $n=3072$ and zero-mean Gaussian noise. As expected, the EVs of ${\mathbf{AB}}$ encode the original, noise-free images, since this on average maximises the alignment (eq. (3)) between the weight vectors ${\mathbf{w}}({\mathbf{x}}_{i})$ and the input samples ${\mathbf{x}}_{i}$ over the dataset.

3.2 Efficient DAUs: Bounding the Bound

As discussed in the previous section, we introduce a norm constraint for the DAU weights ${\mathbf{w}}({\mathbf{x}})$ to ensure that large outputs can only be achieved for well-aligned weights. However, the explicit norm constraint on ${\mathbf{w}}({\mathbf{x}})$ requires its explicit calculation, which we have observed to significantly impact the evaluation time of DAUs. Therefore, we evaluate an additional formulation of the DAUs in which we only constrain an upper bound of the norm of ${\mathbf{w}}({\mathbf{x}})$ . For this, we take advantage of the following inequality:

\displaystyle||{\mathbf{w}}({\mathbf{x}})||=||{\mathbf{AB}}{\mathbf{x}}||\leq|% |{\mathbf{A}}||_{F}\,||{\mathbf{B}}{\mathbf{x}}||\quad.

(4)

Here, $||\cdot||_{F}$ denotes the Frobenius norm and $||\cdot||$ the $L_{2}$ vector norm; this inequality reflects the fact that the Frobenius norm is compatible with the $L_{2}$ vector norm. Note that using this approximation for the norm computation bounds the output bound of the DAUs and is at least as tight as the bound in eq. (3). As a result, without the bias term ${\mathbf{b}}$ , the output of the corresponding DAUs can be calculated as

	$\displaystyle\text{eDAU}({\mathbf{x}})\|\|=\|\|{\mathbf{B}}{\mathbf{x}}\|\|^{-1}% \left({\mathbf{B}}{\mathbf{x}}\right)^{T}\left({\mathbf{A}}^{\prime T}{\mathbf% {x}}\right)\quad$		(5)
	$\displaystyle\text{with}\quad{\mathbf{A}}^{\prime}=\|\|{\mathbf{A}}\|\|_{F}^{-1}{% \mathbf{A}}$		(6)

We will henceforth refer to this non-linear output computation as weight bounding (WB). Note that under this formulation the $d$ -dimensional weights ${\mathbf{w}}({\mathbf{x}})$ are never explicitly calculated and the output is instead obtained as a dot product in $\mathbb{R}^{r}$ between the vectors ${\mathbf{B}}{\mathbf{x}}$ and ${\mathbf{A}}^{\prime T}{\mathbf{x}}$ . Further, for convolutional DAUs (see sec. 3.4), the matrix ${\mathbf{A}}^{\prime}$ has to be computed only once for all positions. As we show in sec. 5.3, this can result in significant gains in efficiency.

3.3 DAUs for classification

DAUs can be used directly for classification by applying $k$ DAUs in parallel to obtain an output $\hat{{\mathbf{y}}}({\mathbf{x}})=\left[\text{DAU}_{1}({\mathbf{x}}),...,\text{% DAU}_{k}({\mathbf{x}})\right]$ . Note that this is a linear transformation $\hat{{\mathbf{y}}}({\mathbf{x}})$ $=$ ${\mathbf{W}}({\mathbf{x}}){\mathbf{x}}$ , with each row in ${\mathbf{W}}$ $\in$ $\mathbb{R}^{k\times d}$ corresponding to the weight vector ${\mathbf{w}}_{j}^{T}$ of a specific DAU $j$ . Consider, for example, a dataset $\mathcal{D}=\{({\mathbf{x}}_{i},{\mathbf{y}}_{i})|\,{\mathbf{x}}_{i}\in\mathbb% {R}^{d},{\mathbf{y}}_{i}\in\mathbb{R}^{k}\}$ of $k$ classes with ‘one-hot’ encoded labels ${\mathbf{y}}_{i}$ for the inputs ${\mathbf{x}}_{i}$ . To optimise the DAUs as classifiers on $\mathcal{D}$ , we can apply a sigmoid non-linearity to each DAU output and optimise the loss function $\mathcal{L}=\sum_{i}\text{BCE}(\sigma(\hat{{\mathbf{y}}}_{i}),{\mathbf{y}}_{i})$ , where BCE denotes the binary cross-entropy and $\sigma$ applies the sigmoid function to each entry in $\hat{{\mathbf{y}}}_{i}$ . Note that for a given sample, BCE either maximises (DAU for correct class) or minimises (DAU for incorrect classes) the output of each DAU. Hence, this classification loss will still maximise the (signed) cosine between the weight vectors ${\mathbf{w}}({\mathbf{x}}_{i})$ and ${\mathbf{x}}_{i}$ .

To illustrate this property, in Fig. 2 (top) we show the weights ${\mathbf{w}}({\mathbf{x}}_{i})$ for several samples of the digit ‘3’ after optimising the DAUs for classification on a noisy MNIST dataset; the first two are correctly classified, the last one is misclassified as a ‘5’. As can be seen, the weights align with the respective input (the weights for different samples are different). However, different parts of the input are either positively or negatively correlated with a class, which is reflected in the weights: for example, the extended stroke on top of the ‘3’ in the misclassified sample is assigned negative weight and, since the background noise is uncorrelated with the class labels, it is not represented in the weights.

In a classification setting, the DAUs thus preferentially encode the most frequent discriminative patterns in the linear transform ${\mathbf{AB}}{\mathbf{x}}+{\mathbf{b}}$ such that the dynamic weights ${\mathbf{w}}({\mathbf{x}})$ align well with these patterns. Additionally, since the output for class $j$ is a linear transformation of the input (P1), we can compute the contribution vector ${\mathbf{s}}_{j}$ containing the per-pixel contributions to this output by the element-wise product ( $\odot$ )

\displaystyle{\mathbf{s}}_{j}({\mathbf{x}}_{i})={\mathbf{w}}_{j}({\mathbf{x}}_% {i})\odot{\mathbf{x}}_{i}\quad,

(7)

see Figs. 1 and 2. Such linear decompositions constitute the model-inherent ‘explanations’ which we evaluate in sec. 5.

3.4 Convolutional Dynamic Alignment Networks

The modelling capacity of a single layer of DAUs is limited, similar to a single linear classifier. However, DAUs can be used as the basic building block for deep convolutional neural networks, which yields powerful classifiers. Importantly, in this section we show that such a Convolutional Dynamic Alignment Network (CoDA Net) inherits the properties (P3) of the DAUs by maintaining both the dynamic linearity (P1) as well as the alignment maximisation (P2). For a convolutional dynamic alignment layer, each convolutional filter is modelled by a DAU, similar to dynamic local filtering layers [19]. Note that the output of such a layer is also a dynamic linear transformation of the input to that layer, since a convolution is equivalent to a linear layer with certain constraints on the weights, cf. [25]. We include the implementation details in the supplement. Finally, at the end of this section, we highlight an important difference between output maximisation and optimising for classification with the BCE loss. In this context we discuss the effect of temperature scaling and present the loss function we optimise in our experiments.

Dynamic linearity (P1). In order to see that the linearity is maintained, we note that the successive application of multiple layers of DAUs also results in a dynamic linear mapping. Let ${\mathbf{W}}_{l}$ denote the linear transformation matrix produced by a layer of DAUs and let ${\mathbf{a}}_{l-1}$ be the input vector to that layer; as mentioned before, each row in the matrix ${\mathbf{W}}_{l}$ corresponds to the weight vector of a single DAU³³3 Note that this also holds for convolutional DAU layers. Specifically, each row in the matrix ${\mathbf{W}}_{l}$ corresponds to a single DAU applied to exactly one spatial location in the input and the input with spatial dimensions is vectorised to yield ${\mathbf{a}}_{l-1}$ . For further details, we kindly refer the reader to [25] and the implementation details in the supplement of this work.. As such, the output of this layer is given by

\displaystyle{\mathbf{a}}_{l}={\mathbf{W}}_{l}({\mathbf{a}}_{l-1}){\mathbf{a}}% _{l-1}\quad.

(8)

In a network of DAUs, the successive linear transformations can thus be collapsed. In particular, for any pair of activation vectors ${\mathbf{a}}_{l_{1}}$ and ${\mathbf{a}}_{l_{2}}$ with ${l_{1}}<{l_{2}}$ , the vector ${\mathbf{a}}_{l_{2}}$ can be expressed as a linear transformation of ${\mathbf{a}}_{l_{1}}$ :

	$\displaystyle{\mathbf{a}}_{l_{2}}$	$\displaystyle={\mathbf{W}}_{{l_{1}}\rightarrow{l_{2}}}\left({\mathbf{a}}_{l_{1% }}\right){\mathbf{a}}_{l_{1}}\quad$		(9)
	$\displaystyle{with}\quad{\mathbf{W}}_{{l_{1}}\rightarrow{l_{2}}}\left({\mathbf% {a}}_{l_{1}}\right)$	$\displaystyle=\textstyle\prod_{k={l_{1}}+1}^{l_{2}}{\mathbf{W}}_{k}\left({% \mathbf{a}}_{k-1}\right)\quad\text{.}$		(10)

For example, the matrix ${\mathbf{W}}_{0\rightarrow L}({\mathbf{a}}_{0}={\mathbf{x}})={\mathbf{W}}({% \mathbf{x}})$ models the linear transformation from the input to the output space, see Fig. 1. Since this linearity holds between any two layers, the $j$ -th entry of any activation vector ${\mathbf{a}}_{l}$ in the network can be decomposed into input contributions via:

\displaystyle{\mathbf{s}}_{j}^{l}({\mathbf{x}}_{i})=\left[{\mathbf{W}}_{0% \rightarrow l}({\mathbf{x}}_{i})\right]_{j}^{T}\odot{\mathbf{x}}_{i}\quad\text% {,}

(11)

with $[{\mathbf{W}}]_{j}$ the $j$ -th row in the matrix.

Alignment maximisation (P2). Note that the output of a CoDA Net is bounded independent of the network parameters: since each DAU operation can—independent of its parameters—at most reproduce the norm of its input (eq. (3)), the linear concatenation of these operations necessarily also has an upper bound which does not depend on the parameters. Therefore, in order to achieve maximal outputs on average (e.g., the class logit over the subset of images of that class), all DAUs in the network need to produce weights ${\mathbf{w}}({\mathbf{a}}_{l})$ that align well with the class features. In other words, the weights will align with discriminative patterns in the input. For example, in Fig. 2 (bottom), we visualise the ‘global matrices’ ${\mathbf{W}}_{0\rightarrow L}$ and the corresponding contributions (eq. (11)) for a $L=5$ layer CoDA Net. As before, the weights align with discriminative patterns in the input and do not encode the uninformative noise.

Temperature scaling and loss function.

So far we have assumed that minimising the BCE loss for a given sample is equivalent to applying a maximisation or minimisation loss to the individual outputs of a CoDA Net. While this is in principle correct, BCE introduces an additional, non-negligible effect: saturation. Specifically, it is possible for a CoDA Net to achieve a low BCE loss without the need to produce well-aligned weight vectors. As soon as the classification accuracy is high and the outputs of the networks are large, the gradient—and therefore the alignment pressure—will vanish. This effect can, however, easily be mitigated: as discussed in the previous paragraph, the output of a CoDA Net is upper-bounded independent of the network parameters, since each individual DAU in the network is upper-bounded. By scaling the network output with a temperature parameter $T$ such that $\hat{{\mathbf{y}}}({\mathbf{x}})=T^{-1}{\mathbf{W}}_{0\rightarrow L}({\mathbf{% x}})\,{\mathbf{x}}$ , we can explicitly decrease this upper bound and thereby increase the alignment pressure in the DAUs by avoiding the early saturation due to BCE. In particular, the lower the upper bound is, the stronger the induced DAU output maximisation should be, since the network needs to accumulate more signal to obtain large class logits (and thus a negligible gradient). This is indeed what we observe both qualitatively, cf. Fig. 4, and quantitatively, cf. Fig. 6 (right column). The overall loss for an input ${\mathbf{x}}_{i}$ and the target vector ${\mathbf{y}}_{i}$ is thus computed as

\displaystyle\mathcal{L}({\mathbf{x}}_{i},{\mathbf{y}}_{i})

\displaystyle=\text{BCE}(\sigma(T^{-1}{\mathbf{W}}_{0\rightarrow L}({\mathbf{x% }}_{i})\,{\mathbf{x}}_{i}+{{\mathbf{b}}}_{0})\,,\,{\mathbf{y}}_{i})\quad\text{.}

(12)

Here, $\sigma$ applies the sigmoid activation to each vector entry and ${{\mathbf{b}}}_{0}$ is a fixed bias term. As an alternative to the temperature scaling, the explicit representation of the network’s computation as a linear mapping allows to directly regularise what properties these linear mappings should fulfill. For example, we show in the supplement that by regularising the absolute values of the matrix ${\mathbf{W}}_{0\rightarrow L}$ , we can induce sparsity in the signal alignments, which also leads to sharper heatmaps.

4 Experimental setup

4.1 Datasets

We evaluate and compare the accuracies of the CoDA Nets to other work on the CIFAR-10 [26] and the TinyImagenet [27] datasets. We use the same datasets for the quantitative evaluations of the model-inherent contribution maps. Additionally, we qualitatively show high-resolution examples from a CoDA Net trained on the first 100 classes of the Imagenet [28] dataset. Lastly, we evaluate hybrid models (see sec. 4.4) on CIFAR10 and the full Imagenet dataset, both in terms of interpretability as well as classification accuracy.

4.2 Models

Our results (secs. 5.1–5.3) are based on models of various sizes denoted by (S/L/XL)-CoDA on CIFAR-10 (S), Imagenet-100 (L), and TinyImagenet (XL); these models have 7-8M⁴⁴4 The models with SQ and L2 non-linearity have 7.8M parameters and the models with WB have 7.1M (without embedding) and 7.2M (with embedding) parameters. (S), 48M (L), and 62M (XL) parameters respectively; see the supplement for details on the model architectures and an evaluation of the impact of model size on accuracy. For the hybrid networks (see secs. 4.4 and 5.4), we use a ResNet-56 (ResNet-50) as a base model on CIFAR-10 (Imagenet) and train CoDA Nets on feature maps extracted at different depths of the base models; see the supplement for details.

4.3 Input encoding

In sec. 3.1, we discussed that the norm-weighted cosine similarity between the dynamic weights and the layer inputs is optimised and the output of a DAU is at most the norm of its input. When using pixels as the input to the CoDA Nets, this favours pixels with large RGB values, since these have a larger norm and can thus produce larger outputs in the maximisation task. In our experiments, we explore two approaches to mitigate this bias: in the first, we add the negative image as three additional color channels and thus encode each pixel in the input as [ $r$ , $g$ , $b$ , $1-r$ , $1-g$ , $1-b$ ], with $r,g,b\in[0,1]$ .

Secondly, we show that it is also possible to train CoDA Nets on end-to-end optimised patch-embeddings and obtain similar performance in terms of interpretability and classification accuracy. Instead of computing the per-pixel contributions to assess the importance of spatial locations (cf. eq. (11)), in our experiments we thus decompose the output with respect to the contributions from the corresponding (learnt) embeddings via

\displaystyle{\mathbf{s}}_{j}^{L}({\mathbf{x}}_{i})=\left[{\mathbf{W}}_{0% \rightarrow L}(E({\mathbf{x}}_{i}))\right]_{j}^{T}\odot E({\mathbf{x}}_{i})% \quad\text{,}

(13)

with $E(\cdot)$ denoting the applied embedding function and $L$ the number of CoDA layers in the network.

4.4 Interpolating between networks

Training CoDA Nets on learnt patch-embeddings (see sec. 4.3) naturally raises the question of how complex the embedding function should be and how large its receptive field. In particular, are pixel-wise importance values more useful than importance values for embeddings of patches of size $3\times 3$ ? How about $7\times 7$ or $64\times 64$ ? Of course, there is no single answer to this question and the ‘optimal’ complexity of the embedding model depends on the dataset, the task, and, ultimately, on the preferences of the end-user of such a model: for example, if a more complex embedding allows for more performant classifiers, one might wish to trade off model interpretability against model accuracy. In order to better understand such trade-offs, we propose to ‘interpolate’ between a conventional CNN and the CoDA Nets and investigate how this affects both model interpretability and model performance. Specifically, starting from a pre-trained CNN, we successively replace an increasing number of the later layers of the base model by CoDA layers. As such, the model output can be decomposed into contributions coming from spatially arranged embeddings computed by the truncated CNN model, which can give insights into how the embeddings are used to produce the classification results.

4.5 Additional details

In our experiments, we observed that rescaling the weight vectors of the DAUs explicitly according to eq. (2) resulted in long training times and high memory usage. To mitigate this, we opted to share the matrix ${\mathbf{B}}$ between all DAUs in a given layer when using the L2 or SQ non-linearity. This increases efficiency by having the DAUs share a common $r$ -dimensional subspace and still fixes the maximal rank of each DAU to the chosen value of $r$ . In contrast, networks with the WB non-linearity (see eDAUs in eq. (5)) are specifically designed to lower the computational costs of the DAUs and are easier to train. Therefore, for CoDA Nets built with eDAUs, we do not share the matrices ${\mathbf{B}}$ between the eDAUs. As the inputs are thus not restricted to a common low-dimensional subspace, we expect this to increase the modelling capacity of the CoDA Nets.

5 Experiments

In sec. 5.1 we assess the classification performance of the CoDA Nets. Further, in sec. 5.2 we evaluate the model-inherent contribution maps derived from ${\mathbf{W}}_{0\rightarrow L}$ (cf. eq. (13)) of a CoDA Net and compare them both qualitatively (cf. Fig. 5) as well as quantitatively (cf. Fig. 6) to other attribution methods. Additionally, in sec. 5.3, we discuss the impact of the different rescaling methods (cf. eqs. (2) and (5)) on model interpretability and evaluation speed. Lastly, in sec. 5.4 we investigate the hybrid models discussed in sec. 4.4. In particular, we analyse how the depth of the embedding function $E({\mathbf{x}})$ affects the model interpretability at different depths of the resulting hybrid network architecture.

5.1 Model performance

Model	C10	Model	T-IM
SENNs [18]	78.5%	ResNet-34 [29]	52.0%
DE-CapsNet [30]	93.0%	VGG 16 [31]	52.2%
VGG-19 [32]	93.4%	VGG 16 + aug [31]	56.4%
ResNet-110 [33]	93.6%	IRRCNN [34]	52.2%
DenseNet [35]	94.8%	ResNet-110 [36]	56.6%
WRN-28-2 [37]	94.9%	WRN-40-20 [38]	63.8%
S-CoDA-SQ	93.2%	XL-CoDA-SQ	54.4%
S-CoDA-L2	93.0%	XL-CoDA-SQ + aug	58.4%
S-eCoDA-WB	94.0%
S-eCoDA-WB + $E({\mathbf{x}})$	94.1%

TABLE I: CIFAR-10 (C10) and TinyImagenet (T-IM) classification accuracies. Results taken from specified references. The prefix of the CoDAs indicates model size, the suffix the non-linearity used (cf. eqs. (2) and (5)). Further,

E({\mathbf{x}})

denotes that a learnt embedding was used as input to the model (see sec. 4.3).

Classification performance. In Table I we compare the performances of our CoDA Nets to several other published results. Note that the referenced numbers are meant to be used as a gauge for assessing the CoDA Net performance and do not exhaustively represent the state of the art. In particular, we would like to highlight that the CoDA Net performance is on par to that of models from the VGG [39] and ResNet [33] model families on both datasets. Additionally, we list the reported results of the SENNs [18] and the DE-CapsNet [30] architectures for CIFAR-10. Similar to our CoDA Nets, the SENNs were designed to improve network interpretability and are also based on the idea of explicitly modelling the output as a dynamic linear transformation of the input. On the other hand, the CoDA Nets share similarities to capsule networks, which we discuss in the supplement; to the best of our knowledge, the DE-CapsNet currently achieves the state of the art in the field of capsule networks on CIFAR-10. Overall, we observed that the CoDA Nets deliver competitive performances that are fairly robust to the non-linearity (see eqs. (2) and (5)) and the temperature ( $T$ ); for an ablation study on the latter, see the supplement. Finally, while all models achieve good classification results, we note that the WB-based CoDA Nets perform slightly better than CoDA Nets with SQ or L2 non-linearity despite having a comparable amount of parameters. As discussed in sec. 4.5, we attribute this to the fact that for those models we do not share the matrix ${\mathbf{B}}$ within the CoDA layers, which increases their modelling capacity.

5.2 Interpretability of CoDA Nets

In the following, we evaluate the model-inherent contribution maps and compare them to other commonly used methods for importance attribution. The evaluations are based on the XL-CoDA-SQ (T=6400) for TinyImagenet and the S-eCoDA-WB (T= $1e6$ ) for CIFAR-10, see Table I for the respective accuracies. The results are very similar for all three non-linearities (cf. sec. 5.3; more results are included in the supplement) and we therefore just show one per dataset as an example. Further, we evaluate the effect of training the S-CoDA-SQ architecture with different temperatures $T$ ; as discussed in sec. 3.4, we expect the interpretability to increase along with $T$ , since for larger $T$ a stronger alignment is required in order for the models to obtain large class logits. Lastly, in sec. 5.3 we compare how the different non-linearities (L2, SQ, WB) affect the model interpretability. Before turning to the results, however, in the following we will first present the attribution methods used for comparison and discuss the evaluation metrics employed for quantifying their interpretability.

Attribution methods. We compare the model-inherent contribution maps (cf. eq. (11)) to other common approaches for importance attribution. In particular, we evaluate against several perturbation based methods such as RISE [14], LIME [15], and several occlusion attributions [40] (Occ-K, with K the size of the occlusion patch). Additionally, we evaluate against common gradient-based methods. These include the gradient of the class logits with respect to the input image [41] (Grad), ‘Input $\times$ Gradient’ (IxG, cf. [3]), GradCam [7] (GCam), Integrated Gradients [9] (IntG), and DeepLIFT [8]. As a baseline, we also evaluated these methods on a pre-trained ResNet-56 [33] on CIFAR-10, for which we show the results in the supplement.

Evaluation metrics. Our quantitative evaluation of the attribution maps is based on the following two methods: we (1) evaluate a localisation metric by adapting the pointing game [42] to the CIFAR-10 and TinyImagenet datasets, and (2) analyse the model behaviour under the pixel removal strategy employed in [10]. For (1), we evaluate the attribution methods on a grid of $n\times n$ with $n=3$ images sampled from the corresponding datasets; in every grid of images, each class may occur at most once. For a visualisation with $n=2$ , see Fig. 7.

For each occurring class, we can measure how much positive importance an attribution method assigns to the respective class image. Let $\mathcal{I}_{c}$ be the image for class $c$ , then the score $s_{c}$ for this class is calculated as

\displaystyle\textstyle s_{c}=\frac{1}{Z}\sum_{p_{c}\in\mathcal{I}_{c}}p_{c}% \quad\text{with}\quad Z={\sum_{k}\sum_{p_{c}\in\mathcal{I}_{k}}p_{c}}\quad,

(14)

with $p_{c}$ the positive attribution for class $c$ assigned to the spatial location $p$ . This metric has the same clear oracle score $s_{c}=1$ for all attribution methods (all positive attributions located in the correct grid image) and a clear score for completely random attributions $s_{c}=1/n^{2}$ (the positive attributions are uniformly distributed over the different grid images). Since this metric depends on the classification accuracy of the models, we sample the first $500$ (CIFAR-10) or $250$ (TinyImagenet) images according to their class score for the ground-truth class⁵⁵5 We can only expect an attribution to specifically highlight a class image if this image can be correctly classified on its own. If all grid images have similarly low attributions, the localisation score will be random. ; note that since all attributions are evaluated for the same model on the same set of images, this does not favour any particular attribution method.
For (2), we show how the model’s class score behaves under the removal of an increasing amount of least important pixels, where the importance is obtained via the respective attribution method. Since the first pixels to be removed are typically assigned negative or relatively little importance, we expect the model to initially increase its confidence (removing pixels with negative impact) or maintain a similar level of confidence (removing pixels with low impact) if the evaluated attribution method produces an accurate ranking of the pixel importance values. Conversely, if we were to remove the most important pixels first, we would expect the model confidence to quickly decrease. However, as noted by [10], removing the most important pixels first introduces artifacts in the most important regions of the image and is therefore potentially more unstable than removing the least important pixels first. Nevertheless, the model-inherent contribution maps perform well in this setting, too, as we show in the supplement. Lastly, in the supplement we qualitatively show that they pass the ‘sanity check’ of [3].

Quantitative results. In Fig. 6, we compare the contribution maps of the CoDA Nets to other attributions under the evaluation metrics discussed above. It can be seen that the CoDA Nets (1) perform well under the localisation metric given by eq. (14) and outperform all the other attribution methods evaluated on the same model, both for TinyImagenet (top row, left) and CIFAR-10 (top row, center); note that we excluded RISE and LIME on CIFAR-10, since the default parameters do not seem to transfer well to this low-resolution dataset. Moreover, (2) the CoDA Nets perform well in the pixel-removal setting: the least salient locations according to the model-inherent contributions indeed seem to be among the least relevant for the given class score on both datasets, see Fig. 6 (bottom row, left and center); note that the Occ-K explanations directly estimate the impact of occluding pixels and are thus expected to perform well under this metric. Further, in Fig. 6 (right column), we show the effect of temperature scaling on the interpretability of CoDA Nets with SQ rescaling trained on CIFAR-10. The results indicate that the alignment maximisation is indeed crucial for interpretability and constitutes an important difference of the CoDA Nets to other dynamic linear networks such as piece-wise linear networks (ReLU-based networks). In particular, by structurally requiring a strong alignment for confident classifications, the interpretability of the CoDA Nets forms part of the optimisation objective. Increasing the temperature increases the alignment and thereby the interpretability of the CoDA Nets. While we observe a downward trend in classification accuracy when increasing $T$ , the best model at $T=10$ only slightly improved the accuracy compared to $T=1000$ ( $93.2\%\rightarrow 93.6\%$ ); for more details, see supplement.

In summary, the results show that by combining dynamic linearity with a structural bias towards an alignment with discriminative patterns, we obtain models which inherently provide an interpretable linear decomposition of their predictions. Further, given that we better understand the relationship between the intermediate computations and the optimisation of the final output in the CoDA Nets, we can emphasise model interpretability in a principled way by increasing the ‘alignment pressure’ via temperature scaling.

Qualitative results. In Fig. 5, we visualise spatial contribution maps of an L-CoDA-SQ model (trained on Imagenet-100) for some of its most confident predictions. Note that these contribution maps are linear decompositions of the output and the sum over these maps yields the respective class logit. In Fig. 8, we additionally present a visual comparison to the best-performing post-hoc attribution methods; note that RISE cannot be displayed well under the same color coding and we thus use its default visualisation. We observe that the different methods are not inconsistent with each other and roughly highlight similar regions. However, the inherent contribution maps are of much higher detail and compared to the perturbation-based methods do not require multiple model evaluations. Much more importantly, however, all the other methods are attempts at approximating the model behaviour post-hoc, while the CoDA Net contribution maps in Fig. 5 are derived from the model-inherent linear mapping that is used to compute the model output.

5.3 Interpretability and efficiency of L2, SQ, and WB

In the following, we present results regarding the impact of the different normalisation methods (L2, SQ, WB) on model interpretability and model efficiency.

Model interpretability. In Fig. 9, we show the results of the interpretability metrics for models with different rescaling functions (L2/SQ/WB, see eqs. (2) and (5)) as well as for a model trained with a learnt patch-embedding $E({\mathbf{x}})$ . As an embedding function, we simply apply a 3x3 convolution with 32 filters, followed by a batch normalisation layer [43]. For comparison to post-hoc methods evaluated on a CoDA Net, we kindly refer the reader to the center column of Fig. 6. As can be seen, it is possible to obtain highly interpretable models under all four settings: (1) the linear contributions allow to localise the class-images well (localisation metric, left) and (2) the models are insensitive to input features⁶⁶6For the SQ, L2, and WB model the features are pixels under the static encoding function described in 4.3. For the $E$ -WB model, the input features to the CoDA Net are learnt patch embeddings. that are not contributing to the output as per the linear transformation matrix ${\mathbf{W}}_{0\rightarrow L}$ . Note that for the model with a learnt patch-embedding, denoted by $E$ –WB, we show two results for the perturbation metric. First, we ’zero out’ the embeddings at each location ordered by their assigned importance (blue crosses). As the embeddings are the input features to the CoDA Net, the model confidence shows the expected behaviour of being insensitive to unimportant inputs. In contrast, the assigned importance values do not translate to the center pixels of the embeddings: when zeroing out the center pixels according to the contributions of the patch-embeddings, the model confidence drops more quickly (see red crosses). This distinction is important to keep in mind when evaluating CoDA Nets on input embeddings, since it is easy to wrongly interpret such contribution maps. If the input to the CoDA Net is an embedding of an image patch, it depends on the embedding function how the contributions are to be distributed to the image pixels. Lastly, note that different from the center column in Fig. 6, the metrics are evaluated for four different models and are thus not comparisons between different explanation methods, but rather between different models under the same explanation. As such, the differences in the localisation metric, for example, do not show that the linear decompositions are generally better suited to explain WB-based models as compared to SQ- or L2-based models; the differences might instead reflect the fact that the models learnt more robust and class-specific representations, which yield both better results in the localisation task as well as higher classification accuracy.

Model efficiency.

While all three non-linearities can yield interpretable CoDA Nets, the computational cost of the different approaches for bounding the DAU outputs differs. For example, by avoiding the explicit calculation of the $d$ -dimensional weight vector in eq. (5), the eDAUs are able to save both memory as well as floating point operations—the computed vectors are of size $r\ll d$ and the dot-product in the low-dimensional space requires $O(r)$ operations instead of $O(d)$ . Being the fundamental building block of the CoDA Nets, such gains in efficiency can have considerable impact, since the corresponding computations are performed in every layer for every unit and at each spatial position of the input to the respective layer. Accordingly, in practice⁷⁷7In our experiments, we rely on the highly optimised implementations for convolutions from the pytorch [44] library. we observed that the weight bounding approach in the eDAUs (eq. (5)) can yield significant speed-ups and memory savings, especially for high-dimensional inputs. For example, in Fig. 10 we plot the memory consumption and forward-pass speeds for the three different models without learnt embedding function (see Table I) for varying batch sizes on the CIFAR-10 dataset: while SQ and L2 perform similarly, the WB-based model scales better to larger inputs.

Additionally, we measured memory consumption and training time for two models with the same architecture on Imagenet (L-CoDA, see beginning of sec. 5) for the SQ and the WB rescaling methods. For this, we updated the models $\approx 8000$ times with a batch size of 16 and recorded the overall time as well as the GPU memory consumption. In these experiments, the WB-based model required more than $3\times$ less memory (9.7GB vs. 30.0GB) and completed the updates more than $1.5\times$ faster (8.7 minutes vs. 14.1 minutes). All experiments regarding evaluation speed were performed on an nvidia Quadro RTX 8000 GPU with 48GB of memory.

5.4 Hybrid CoDA Networks

In this section, we assess the interpretability of hybrid CoDA Nets, which combine conventional CNN layers and CoDA layers in one network model. For our experiments, we use varying numbers of pre-trained CNN layers as feature extractors on top of which a CoDA Net is trained as a classifier. Such a hybrid structure can prove useful in cases where CoDA Nets do not (yet) yield the same accuracy as conventional architectures; we kindly refer the reader to the supplement for details on the network architectures used in these experiments.

In particular, we use the first K layers of a pre-trained ResNet model [33] as feature extractors. Since ResNets are piece-wise linear models, the hybrids are still dynamic linear and we can assign importance values to input features according to their effective linear contribution; importantly, the input features can be extracted at any depth of the network as the output of a CoDA layer or a ResNet block, or as the actual input pixels. In order to assess whether such hybrids are more interpretable than the base model, we compute spatial contribution maps⁸⁸8The contribution maps according to the dynamic linear mapping can be obtained via ’Input $\times$ Gradient’, where for the gradient calculation we treat the dynamic matrices in the CoDA layers as fixed. with respect to different activation maps within the network and evaluate them under the localisation metric (see sec. 5.2, ’Evaluation metrics’).

CIFAR10. For the following experiments, we use a pre-trained ResNet-56 obtained from [45]. This model consists of a convolutional layer + batch normalisation [43] (C+B), followed by three times nine residual blocks (RBs) as well as a fully connected and a pooling (FC+P) layer; for more details we kindly refer the reader to the original work [33] and the implementation [45] on which we base these experiments. We can summarise this model by [C+B, 9RB, 9RB, 9RB, FC+P]; we will further denote individual segments $\mathcal{S}_{i}$ of the model by their index in this summary counting from the back, e.g., $\mathcal{S}_{5}$ =[C+B] and $\mathcal{S}_{1}$ =[FC+P]. In order to evaluate the interpretability of this model at different depths $t$ , we split it at different points into two virtual parts: an embedding function $E_{t}({\mathbf{x}})$ and a classification head (CH ${}_{t}$ ). For a given split, we then regard the output of $E_{t}({\mathbf{x}})$ as the input to the classification head and linearly decompose the latter according to the respective linear transformation performed by the model; e.g., with an explanation depth of $2$ we refer to the split in which $\mathcal{S}_{2}$ is the first element in CH ${}_{2}$ and we evaluate linear contribution maps obtained for the classification head CH ${}_{2}$ =[9RB, FC+P] on the preprocessed input $E_{2}({\mathbf{x}})$ =[C+B, 9RB, 9RB] $({\mathbf{x}})$ . By performing this evaluation for various splits, we can assess the ’interpretable depth’ of a model. In particular, we evaluate how well the contribution maps at different depths allow for localising the correct class images in the localisation task.

In order to investigate the effect of CoDA layers on the interpretable depth, we train and evaluate four different hybrid models. For this, we replace an increasing number of segments $\mathcal{S}_{i}$ by CoDA layers, starting from $\mathcal{S}_{1}$ ; in Fig. 11 (left), we denote the base model by R5-C0 (5 ResNet segments, 0 CoDA segments) and the hybrids according to the number of replaced segments, e.g., for R3-C2 we replaced the last two segments by CoDA layers⁹⁹9In detail, each segment of 9RB is replaced by a set of 3 CoDA layers. The final network segment [FC+P] is replaced by a single CoDA layer followed by a global pooling operation.. For each of these models, we can decompose the model outputs in terms of contributions from spatial positions for the embedding functions $E_{t}$ defined by different splits $t$ . Note that the base model (ResNet-56) is piece-wise linear and we can thus still compute linear contributions at any depth of this hybrid network. In Fig. 11 (left) we show the results of the localisation metric for all networks at various depths; the classification accuracies can be found in Table II. As can be seen, the linear contributions are good explanations of the class logits as long as the classification head entirely consists of CoDA layers and drops as soon as we include a segment with ResNet blocks in the classification head CH. Again, this highlights that dynamic linearity alone is not enough to obtain useful linear decompositions of the model outputs, but that the alignment property is crucial for the interpretability of the CoDA layers.

Imagenet. In the following, we show that the gains from interpolating between networks also extend to a more complex dataset. Similar to the interpolation experiments on CIFAR-10 above, the results are based on an interpolation between a pretrained ResNet model (ResNet-50) and a CoDA-based classification head. However, given the high-dimensional representations produced by the later ResNet layers (up to 2048 channels), the parameters of the classification head increase drastically if the high dimensionality is maintained throughout the CoDA layers. Therefore, in the Imagenet experiments, we first compute a low-dimensional projection $\tilde{{\mathbf{x}}}={\mathbf{P}}{\mathbf{x}}$ of the inputs to the convolutional kernels to which we apply the eDAUs (see eq. (5)); similarly to the dynamic weights of DAUs with L2 normalisation, we normalise the rows of the matrix ${\mathbf{P}}$ to unit norm to maintain a parameter-independent bound of the network.

TABLE II: Classification accuracies for hybrid networks. R

X

-C

Y

denotes how many segments were replaced by CoDA layers; on Imagenet, maximally up to five ResNet blocks from the end of the network were replaced and all networks thus still rely on a ResNet-based stem. On CIFAR-10 the accuracy can be maintained whilst improving interpretability (see Fig. 11). On Imagenet, on the other hand, we observe a trade-off in accuracy when increasing the ’interpretable depth’ of the models.

CIFAR-10	R5-C0	R4-C1	R3-C2	R2-C3	R1-C4
CIFAR-10	93.4%	93.6%	93.4%	93.6%	93.8%
Imagenet	R5-C0	R3-C2	R2-C3	R1-C4	R0-C5
Imagenet	76.1%	74.7%	73.3%	71.7%	71.4%

For the interpolation experiments, we successively replace the last 5 residual blocks of the ResNet-50 base model (the ’segments’ here correspond to FC+P or individual residual blocks) by a single CoDA layer each and assess the interpretability via the localisation metric as well as model accuracy, see Fig. 11 (right) and Table II respectively. Similar to the CIFAR-10 experiments, we observe an increase in ’interpretable depth’ (Fig. 11, right). However, while on CIFAR-10 the accuracy of the base model could be maintained, on Imagenet we observe a trade-off in accuracy. While better results can certainly be achieved by further optimising the network architectures and / or fine-tuning the learnt embeddings, our results show that it is possible to increase the interpretability of performant classification models by using a classification head comprised of CoDA layers.

6 Discussion and conclusion

We present a new family of neural networks, the CoDA Nets, and show that they are performant classifiers with a high degree of interpretability¹⁰¹⁰10Code is available at github.com/moboehle/CoDA-Nets. In particular, we first introduced the Dynamic Alignment Units (DAUs), which model their output as a dynamic linear transformation of their input and have a structural bias towards alignment maximisation. This bias is induced by ensuring that a DAU can only produce large outputs if its weights are well-aligned with the input, since the dynamically applied weights are explicitly normalised. In order to lower the computational costs of the DAUs, we further introduce the eDAUs, for which we normalise the weights by an upper bound of their norms which is cheaper to compute. Using the DAUs to model filters in a convolutional network, we obtain the Convolutional Dynamic Alignment Networks (CoDA Nets). The successive linear mappings by means of the DAUs within the network make it possible to linearly decompose the output into contributions from individual input dimensions—in contrast to piece-wise linear networks, which are also dynamic linear, the alignment property of the DAUs ensures that the linear decomposition aligns with discriminant patterns in the input. In order to assess the quality of these contribution maps, see eq. (11), we compare against other attribution methods. We find that the CoDA Net contribution maps consistently perform well under commonly used quantitative metrics and are robust to the applied normalisation scheme. Beyond their interpretability, the CoDA Nets constitute performant classifiers: their accuracy on CIFAR-10 and the TinyImagenet dataset are on par to the commonly employed VGG and ResNet models. Lastly, we show that CoDA layers can be combined with conventional networks, which yields hybrid models with an increased ’interpretable depth’ compared to the base model. We believe that such hybrid models hold great potential, since they take advantage of the high modelling capacity and efficiency of modern neural networks whilst allowing for a user-defined ’minimal interpretability’. For example, such networks could allow for localising regions of importance for the model decision at a desired granularity by restricting the receptive field of the feature extractors.

References

[1] V. Nair and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” in International Conference on Machine Learning (ICML), 2010.
[2] G. F. Montúfar, R. Pascanu, K. Cho, and Y. Bengio, “On the Number of Linear Regions of Deep Neural Networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2014.
[3] J. Adebayo, J. Gilmer, M. Muelly, I. J. Goodfellow, M. Hardt, and B. Kim, “Sanity Checks for Saliency Maps,” in Advances in Neural Information Processing Systems (NeurIPS), 2018.
[4] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,” in International Conference on Learning Representations (ICLR), Workshop, 2014.
[5] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller, “Striving for Simplicity: The All Convolutional Net,” in International Conference on Learning Representations (ICLR), Workshop, 2015.
[6] B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba, “Learning Deep Features for Discriminative Localization,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[7] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization,” in International Conference on Computer Vision (ICCV), 2017.
[8] A. Shrikumar, P. Greenside, and A. Kundaje, “Learning Important Features Through Propagating Activation Differences,” in International Conference on Machine Learning (ICML), 2017.
[9] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic Attribution for Deep Networks,” in International Conference on Machine Learning (ICML), 2017.
[10] S. Srinivas and F. Fleuret, “Full-Gradient Representation for Neural Network Visualization,” in Advances in Neural Information Processing Systems (NeurIPS), 2019.
[11] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, “On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation,” PLoS ONE, 2015.
[12] P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis, “Explainable AI: A Review of Machine Learning Interpretability Methods,” Entropy, vol. 23, no. 1, p. 18, 2021.
[13] S. M. Lundberg and S. Lee, “A Unified Approach to Interpreting Model Predictions,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
[14] V. Petsiuk, A. Das, and K. Saenko, “RISE: Randomized Input Sampling for Explanation of Black-box Models,” in British Machine Vision Conference (BMVC), 2018.
[15] M. T. Ribeiro, S. Singh, and C. Guestrin, “”Why Should I Trust You?”: Explaining the predictions of any classifier,” in International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2016.
[16] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. Su, “This Looks Like That: Deep Learning for Interpretable Image Recognition,” in Advances in Neural Information Processing Systems (NeurIPS), 2019.
[17] W. Brendel and M. Bethge, “Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet,” in International Conference on Learning Representations (ICLR), 2019.
[18] D. Alvarez-Melis and T. S. Jaakkola, “Towards Robust Interpretability with Self-Explaining Neural Networks,” in Advances in Neural Information Processing (NeurIPS), 2018.
[19] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool, “Dynamic filter networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2016.
[20] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning (ICML), 2015.
[21] C. Eckart and G. Young, “The approximation of one matrix by another of lower rank,” Psychometrika, 1936.
[22] X. Yu, T. Liu, X. Wang, and D. Tao, “On compressing deep models by low rank and sparse decomposition,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[23] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic Routing Between Capsules,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
[24] Y. LeCun, “MNIST handwritten digit database,” http://yann.lecun.com/exdb/mnist, 1998.
[25] I. Shafkat, “Intuitively Understanding Convolutions for Deep Learning,” https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1#ad33, 2018.
[26] A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009.
[27] J. et al., “Tiny ImageNet Visual Recognition Challenge,” https://tiny-imagenet.herokuapp.com/, accessed: 2020-11-10.
[28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
[29] L. Sun, “ResNet on Tiny ImageNet,” http://cs231n.stanford.edu/reports/2017/pdfs/12.pdf, 2016, accessed: 2020-11-16.
[30] B. Jia and Q. Huang, “DE-CapsNet: A Diverse Enhanced Capsule Network with Disperse Dynamic Routing,” Applied Sciences, 2020.
[31] learningai.io, “VGGNet and Tiny ImageNet,” https://learningai.io/projects/2017/06/29/tiny-imagenet.html, accessed: 2020-11-08.
[32] T. Li, J. Li, Z. Liu, and C. Zhang, “Few sample knowledge distillation for efficient network compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 639–14 647.
[33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[34] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari, “Improved inception-residual convolutional neural network for object recognition,” Neural Computing and Applications, 2020.
[35] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” in Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 2017.
[36] C. Termritthikun, Y. Jamtsho, and P. Muneesawang, “An improved residual network model for image recognition using a combination of snapshot ensembles and the cutout technique,” Multimedia Tools and Applications, 2020.
[37] S. Zagoruyko and N. Komodakis, “Wide Residual Networks,” in British Machine Vision Conference (BMVC), 2016.
[38] D. Hendrycks, K. Lee, and M. Mazeika, “Using Pre-Training Can Improve Model Robustness and Uncertainty,” in Proceedings of Machine Learning Research (PMLR), 2019.
[39] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in International Conference on Learning Representations (ICLR), 2015.
[40] M. D. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks,” in European Conference on Computer Vision (ECCV), 2014.
[41] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Müller, “How to explain individual classification decisions,” The Journal of Machine Learning Research, 2010.
[42] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff, “Top-Down Neural Attention by Excitation Backprop,” Int. J. Comput. Vis., 2018.
[43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. PMLR, 2015, pp. 448–456.
[44] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Advances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc., 2019.
[45] Y. Idelbayev, “”Proper ResNet Implementation for CIFAR10/CIFAR100 in PyTorch”,” https://github.com/akamaster/pytorch_resnet_cifar10, accessed: 2020-06-05.

See pages - of supplement/1-supp_main.pdf

Optimising for Interpretability: Convolutional Dynamic Alignment Networks