0% found this document useful (0 votes)
10 views

Learning Local Discrete Features in Explainable-By

Local discrete features explaining neural networks
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Learning Local Discrete Features in Explainable-By

Local discrete features explaining neural networks
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Learning local discrete features in

explainable-by-design convolutional neural networks

Pantelis I. Kaplanoglou1 ∗ Konstantinos Diamantaras1


1
International Hellenic University
{pikaplanoglou, kdiamant}@ihu.gr
arXiv:2411.00139v1 [cs.LG] 31 Oct 2024

Abstract
Our proposed framework attempts to break the trade-off between performance
and explainability by introducing an explainable-by-design convolutional neural
network (CNN) based on the lateral inhibition mechanism. The ExplaiNet model
consists of the predictor, that is a high-accuracy CNN with residual or dense
skip connections, and the explainer probabilistic graph that expresses the spatial
interactions of the network neurons. The value on each graph node is a local discrete
feature (LDF) vector, a patch descriptor that represents the indices of antagonistic
neurons ordered by the strength of their activations, which are learned with gradient
descent. Using LDFs as sequences we can increase the conciseness of explanations
by repurposing EXTREME, an EM-based sequence motif discovery method that
is typically used in molecular biology. Having a discrete feature motif matrix for
each one of intermediate image representations, instead of a continuous activation
tensor, allows us to leverage the inherent explainability of Bayesian networks. By
collecting observations and directly calculating probabilities, we can explain causal
relationships between motifs of adjacent levels and attribute the model’s output to
global motifs. Moreover, experiments on various tiny image benchmark datasets
confirm that our predictor ensures the same level of performance as the baseline
architecture for a given count of parameters and/or layers. Our novel method
shows promise to exceed this performance while providing an additional stream of
explanations. In the solved MNIST classification task, it reaches a comparable to
the state-of-the-art performance for single models, using standard training setup
and 0.75 million parameters.

1 Introduction
Deep learning has yielded high accuracy models, that was the initial prerequisite for the applicability
of Artificial Intelligence (AI). With the advent of the transformer [1] architecture, the technical
robustness of state-of-the-art models allows for real-world applications in all domains, including
Computer Vision [2]. Nevertheless, there are still challenges to be addressed for AI to become
trustworthy [3], towards social robustness. This is the focus of the evolving field of Explainable
Machine Learning (ExML) that falls under the scientific branch of Explainable AI (XAI). Amongst
others, we can identify seven “showstopper” issues for the seamless integration of AI into society;
reliability, trustworthiness, bias, privacy, physical security, human manipulation, and ethics. Their
individual needs require explainability, that is a superset of interpretability which provides human
understandable insight on models. The main difference is completeness of explanations [4][5] which
can be defined as the capability of providing interpretations for any processing node of the model,
from input to output or vice versa . To understand what an explainable model should do, we can draw
equivalence to regular software, where the source code is inherently explainable. State-of-the-art

Primary contribution 1 Department of Information and Electronic Engineering, International Hellenic
University, Sindos, Greece
models should offer “debugging” capabilities for defects caused by outlier or adversarial samples,
overfitting, and confabulations [6], which for humans are attributed to overlearning. This capability
will increase reliability, allow us to investigate dataset/model bias, and in turn ensure physical security.
The approach of explainable-by-design deploys inherently explainable models, for example a simple
linear SVM classifier with its comprehensible decision surface. Yet, the performance of non-black
box models follows a trend; increasing explainability decreases performance, which is known as the
Performance-Explainability Trade-off (PET). Considering the carbon footprint of training models
[7], a third dimension of time creates a new trade-off called PET+ [8]. It is already known that more
training time leads to more performance and considering the extra effort for providing interpretations
or explanations, more time is needed for more explainability. This creates a new requirement, to
assess the model efficiency together with accuracy and explainability.
This work’s main contributions can be summarized in the following points:

• We propose a novel lateral inhibition (LIN) layer that can be incorporated into the design
of any CNN model. We prove that it regulates gradient descent learning of local discrete
feature (LDF) vectors, that contain indices of neurons ordered by activation strength.
• We introduce ExplaiNet, an explainable-by-design classifier. The predictor is a CNN
which provides increased accuracy through non-explainable continuous activations, that are
synchronously discrete image representations used by an explainer probabilistic graph.
• We have experimented with various ExplaiNets of two different architectures in tiny image
datasets, using various combinations of features per layer, and observed equal or greater
prediction accuracy compared to the baseline CNN.
• We increase the conciseness of explanations by bringing the concept of sequence motifs from
the field of Molecular Biology. These are learned by repurposing any existing unsupervised
Expectation-Maximization algorithm for motif discovery.
• We evaluate the fidelity of discrete feature motifs (FMotifs) in all intermediate levels of the
network, to provide causal explanations for the emergence of FMotifs and their contribution
to the predicted class.

2 Preliminaries
2.1 Terms and notation

We present all terminology and notations used in the rest of the paper. The 2D convolution operation
in CNNs is a linear function a(k) (·) that moves a window over the input tensor A(k) of layer
k = 1, . . . , K, through positions x,y. The window corresponds to a tensor slice X(k) x,y that is the
receptive field shared by all neurons operating at x,y, which form a hypercolumn, the respective term
(k)
in Neuroscience. The hypercolumn has a a continuous activation vector ax,y and the weights of the
(k)
convolution operation are stored in the kernel W ∈ Rn×n×cin ×cout . Considering activation values

Figure 1: Overview of the ExplaiNet model. A black-box feed forward (orange arrows) neural
network predictor, that offers high prediction accuracy. Streams of discrete features (green arrows)
provide values to nodes of the probabilistic explainer graph that uses them to explain predictions and
intermediate features. The nodes are mapped to spatial positions of the input at each level.

2
(k)
as features, the term for ax,y in Computer Vision is patch descriptor. A CNN starts with the stem,
followed by blocks of feature extraction, each block has a stack of neural modules, that are composed
of learnable and non-learnable layers. The network’s representation output is fed to the classifier’s
fully connected neurons, that have the softmax activation function which is noted in this work as s(·).
Residual convolutional module: Without normalization, a residual network module [9] is a non-
linear transformation Φ(l)
Res (X
(k−2)
; Ω) = relu(a(k) (relu(a(k−1) (X(k−2) ))) + X(k−2) ) where l is the
level of the intermediate image representation, relu(·) a rectifier function and Ω a set of weights and
bias parameters, which are implicitly included in convolution functions a(k) (·).
Dense convolutional module: A densely connected block [10] has τ stacked modules
(l) (k)
ΦDens (X(k−1) ; Ω) = a(k) (relu(X(k−1) )). Their output is concatenated in BDens (X(l−τ ) ) =
(k) (k−1) (k−τ +1)
||ΦDens (A(k−1) ; Ω), ΦDens (A(k−2) ; Ω), .., ΦDens (X(k−τ ) ; Ω)||. Between subsequent blocks there
k+1 (k)
are transition modules ΦDensT ran (BDens ; Ω) that additionally perform spatial downsampling.

2.2 Gene sequence motif discovery

In the field of Molecular Biology, patterns in genome sequences that are important to explain a
biological function are called sequence motifs [11]. These are recurring in genomic data, and
its presence is used to investigate causality of some observable biological outcome, such as gene
expression. Motifs are stochastic sequences, where each gene symbol has a probability of occurrence
at a specific position of the motif. This is represented in MP P M ∈ Rnmotif ×δ , a Position-specific
Probability Matrix (PPM) for a motif sequence length of nmotif where each position can have one
of δ possible discrete values; for DNA nucleotides {A,T,C,G} we have δ = 4. The PPM is used to
generate a motif logo, that visually depicts the probability of symbols at each position.
The common approach of motif discovery is unsupervised learning, with algorithms that are based
on Expectation-Maximization (EM) [12]. The popular MEME algorithm [13] has quadratic time
complexity and several improvements were proposed, namely DREME [14], EXTREME [15] and
STREME [16]. The problem of EM’s sensitivity to bootstrap conditions is alleviated by carefully
choosing them from candidate sequences, what is known in this context as seeding. The more recent
EXTREME algorithm employs online EM [17] that has linear time complexity.

2.3 Metrics

Fidelity to Output: A criterion of how well a surrogate explainer pertains to the predictor classifica-
tion output is fidelity to output (FTO), as suggested in [18]. In an image dataset D = {X(i)}, the

predicted class for sample i is ŷi and y i (l) is a class index that has been explained based on features
of level l, where J.K is the Iverson bracket. The fidelity to output (FCO) metric is:
Pn ∴
Jŷi = y i (l)K
f to(l) (D) = i
(1)
|D|

(l) (l−1)
Fidelity of Cause to Effect: To express how well an observed feature in a level mi (Ci ),
that is considered an effect of a model’s function, can be attributed to a set of cause features
(l−1) (l−1)
Ci = {mj } from the previous level, we introduce the fidelity of cause to effect metric (FCE):
nf x
1 X (l) (l−1) (l) (l−1)
f ce(l) (D) = Jmi (Ci )=m
b i (Ci )K (2)
nf x i
(l) (l−1)
where nf x is the total count of unique effects observed in the dataset, mi (Ci ) the fact of an
(l) (l−1)
effect occurrence, and mb i (Ci ) the explained effect that is inferred by a probabilistic explanation
model from the same observed causes.
Relative Model Efficiency: To evaluate parameters efficiency in a group of models G that have been
trained on the same task, we introduce the relative model efficiency (RME) index. A model’s metric
µi −µmax
µi is normalized with the minimum and maximum value in the group: νi = µmax −µmin . The size of
the model s is in millions of parameters (MP ), and its magnitude is reduced in the denominator to

3
avoid overpenalizing large but accurate models. For a model i ∈ G the RME is:
νi (2νi − 1)
rme(i, G) = √ (3)
s

3 Proposed Framework
3.1 Lateral inhibition layer

An intuitive way to explain the individual activations in hypercolumns of a module is to order them by
their strength. The maximum activation value for a given input should correspond to the neuron that
best matches a feature pattern according to the weights in the kernel. This behaviour is not ensured by
the way the model parameters are updated, which is based on their backpropagated contributions to
the output loss. Many neurons can have equal responses to the same input feature creating redundancy.
A mechanism that ensures that a winner neuron amplifies its activation and inhibits others, is known
in Neuroscience as lateral inhibition. In a group of co-adapting neurons the response of one neuron
antagonizes the response of others on the same stimuli. Our work facilitates this mechanism into
the gradient descent learning process by adding a new Lateral Inhibition Layer (LIL) after the linear
transformation that is performed by a convolutional layer. The lateral inhibition function is leveraging
the properties of the softmax function [19] to implement the antagonism as:
z(l) (l) (l) (l)
x,y = fLI (ax,y ) = ax,y (1 + s(ax,y )) (4)

We use the LIL inside residual and dense modules after the last convolution operation and before any
non-linear or normalization layer. Their respective transformation functions become:
   
(l)
ΦRes (X(k−2) ; Ω) = relu a(k) (relu(a(k−1) (X(k−2) ))) 1 + s[a(k) (relu(a(k−1) (X(k−2) )))] + X(k−2)
 
(l)
ΦDens (X(k−1) ; Ω) =a(k)
(relu(X (k−1) (k)
)) 1 + s[a (relu(X(k−1)
))]

Figure 2: Lateral Inhibition Layer (LIL) placement inside a residual module.


Lemma 1 - Lateral inhibition via amplification of gradients. For any pair of neurons in a
hypercolumn, the lateral inhibition function (4) will amplify the gradients of the winner neuron,
forcing larger updates in its weights in comparison to the others.
We present the proof of Lemma 1 in Section A of the Appendix.

Conjecture 1. Increased weight regularization is needed for positively monotonic behaviour of the
gradient amplification factors β and γ of the lateral inhibition function, and/or to restrict the input of
the lateral inhibition function in the range ai ∈ [−6, 6].

3.2 Local discrete feature vectors

The output of a LIL is a continuous activation tensor Z(l) ∈ Rh×w×cout of spatial dimensions
(l)
h × w, where a patch descriptor slice zx,y ∈ Rcout has a corresponding local discrete feature (LDF)

4
(l)
vector d(l)
x,y = arg sort[s(ax,y )] ∈ N
cLDF
that is determined by the softmax output, with a length
cLDF < cout potentially less than the count of neurons in the hypercolumn. The learning process
ensures that the scores of the softmax function are corresponding to the prevalence of a neuron over
others, that is ordering their selectivity to an input feature. Thus, keeping the indices of the top ranked
neurons can be used to explain the input space. Even though the number of ordered permutations
could explode, experiments revealed that only a tiny fraction of these occur in a trained neural network
predictor; this was expected since learning reduces uncertainty and removes randomness.

3.3 Explanation process

Figure 3: Steps of the ExplaiNet framework process for the generation of explanations.
Supervised classification training: The process of generating explanations for an ExplaiNet classifier
involves two training phases, each one followed by a collection phase. The process starts with the
supervised learning phase, that trains, or fine-tunes, a predictor on the available training samples
as it is typically done. In this work we use standard minibatch Stochastic Gradient Descent (SGD)
with momentum, while other optimizers can be used. During the first collection process, we recall
all training set samples through the trained model, so that it generates LDF vectors for each image
patch x, y of each level l. This phase determines a set of unique LDF vectors V (l) = {d(l) (i)}, that
constitutes a vocabulary of visual words [20].
Feature motif discovery: The second training phase employs unsupervised learning with EM. In the
prepossessing step, we convert the vectors into the quaternary system so we can utilize any existing
motif discovery algorithm. We have used YAMDA [21], that is an accelerated implementation of
EXTREME [15]. We run the algorithm on each set V4(l) = {d(l) 4 (i)} of quaternary LDF sequences with
a predetermined value for the basic hyperparameter Kmotif s , that is the maximum count of motifs to
discover. The process may stop early to a lower count, if there are no more significant patterns in
the data. Before the start of the next iteration, the algorithm removes the training samples where the
(l)
motif is present, by converting discrete d4 (i) into one-hot encoding matrices D4 (i), and matches
these with the motifs’ PPMs. The matching function was changed from the original implementation
of YAMDA into the normalized 2D signal cross correlation ρ(·) of the rectified odds ratio between a
(l)
motif’s PPM MP P M (j) and the background probabilities B, of quaternary digits’ occurrence at a
Pc P4
position. The 2D discrete signal energy function is ELDF (i) = n4 m |d4 (i)[n, m]|2 where the
length of the quaternary vector is c4 = cLDF ∗ log4 (cout ) ∈ N, for an LDF vector of length cLDF
that is a multiple of 4, and for cout possible discrete of neuron indices.
(l) 1 (l)
fmatch (D4 (i), MP P M (j)) = p ρ(D4 (i), relu(log MP P M (j) − log B)) (5)
ELDF (i) EP P M (j)

The feature motif (FMotif) discovery phase drastically reduces the size of the LDF vocabulary into
a new vocabulary M (l) of size Nmotif s ≤ Kmotif s ≤ |M (l) | ≪ |V (l) |. This serves conciseness [18]
that is one of the required characteristics for comprehensible explanations. Low values of Kmotif s
can be used for extremely concise explanations, using just a handful of FMotifs. We load the
vocabularies M (l) to corresponding levels of our explainer and it can infer a scalar FMotif index
from a non-explainable hypercolumn activation vector. Thus, the explainer provides a discrete image
representation matrix Ψ(l) ∈ Nh×w for each intermediate level of the network.
FMotif effects and causes collection: [18]. In the second collection phase we recall through
ExplaiNet samples that were not part of the training set, to make observations. The LDFs are
(l)
matched with FMotifs using the matching function (5). For each node ox,y of the explainer graph,
(l)
we record the FMotif index m(l) (ox,y ) as the effect observed on image patch that is presumably
(l−1)
caused by the set of effects in the backward adjacent nodes. For CNN networks, the causes Cx,y =
(l−1)
{m(l−1) (ox−i,y−j )} : i, j ∈ [− n2 , + n2 ]} correspond to the n × n receptive field of the convolution
(l) (l−1)
operation. This is our ground truth observation that we keep in pairs {mx,y , Cx,y } for evaluating

5
the quality of explanations. There are edges to "null" nodes in the graph that stand for zero padding,
and are not included in the set of causes since batch normalization layers perform zero-mean centering.
Also, pairs of {Ψ(l) (i), ŷi } are kept for evaluating explanations on the model predictions, with ŷi the
predictor’s class index for sample i. The collection process creates Pareto histograms for cause sets
and image representations, where bins correspond to FMotifs ordered by their frequencies. Their
(l)
ordered indices are kept in vectors cx,y and ψ (l) (i).
Bayesian network explanations: The directed acyclic explainer graph expresses the behaviour of the
convolutional moving window that draws edges between patch nodes. These are considered causal
links, since the activation value of a trained neuron is influenced by the values in its receptive field
in a deterministic manner [22]. Hence we consider our explainer a Bayesian network, a model with
inherent explainability. We define as X the event of an FMotif presence in one or more patches of the
image representation Ψ(l) , or equally having a non-zero count in the histogram. During the collection
phase we count occurrences of FMotif mi and calculate its marginal probability P (mi ) = N (X=m q
i)

where q the total count of occurrences. For an outcome event Y "... therefore the prediction is

∴ ∴ N (Y =y c )
explained as y c " the prior probability is y c = q . The conditional probability for an FMotif


presence X to be the cause of outcome Y is P (mi |y c ) = N (X=mqi Y =y c ) . It is easy to think that the
presence of one FMotif index in the histogram depends on the convolution operation at a specific
location. Since it moves across the image with the same kernel the presence of a second FMotif index
in the histogram is conditionally independent. Thus, we can use a Naive Bayes classifier to make a
surrogate prediction based on FMotif indices, that is our stochastic explanation. This provides us

with the three basic aspects of an explanation; i) why class c was predicted, ii) why class c was not
predicted instead, and iii) what to expect for an unknown sample, based on the explained behaviour
of the model on some samples. The explained class index at an intermediate level l is:
(l)
Nmotif s
∴ ∴ Y ∴
yc = argmax[P (y c ) P (mi |y c )] (6)
c
i

4 Experiments and results


4.1 Prediction

Datasets and augmentation: We train our models from scratch on the MNIST dataset [23], Fashion
MNIST (FMNIST) that contains tiny grayscale images of 10 fashion items [24], Kuzushiji MNIST
(KMNIST) that has Japanese cursive syllabograms with 10 classes, one for each row of Hiragana
syllabary [25], Oracle MNIST (OMNIST) that depicts ancient Chinese ideograms of 10 concepts and
the tiny color image benchmark dataset CIFAR10 [26]. Our approach for the training data feed that
ensures deterministic training and fair comparison is described at Section B of the Appendix.
Supervised training experiments: We use LIL in ResNet and DenseNet to derive R-ExplaiNet and
D-ExplaiNet. For each ResNet pair we train 10 random folds, and for DenseNet pairs 5 due to their
increased complexity, for at least 8 different feature-layer combinations per dataset. Each model was
trained on a single GPU, where details on hyperparameters, our software implementation that will
become available on GitHub, and infrastructure can be found at Appendix C. For fair comparison
with the baseline, data and training hyperparameters are kept the same for all models of a task, except
λ of weight decay regularization. Our formal proof suggests that ExplaiNets need higher values for
gradient stability; that was confirmed with our first experiments on MNIST, where we tried at least 9
models with combinations of constant features and layers, and a reverse pyramid of features.
The MNIST classification task could be considered as solved; nevertheless it still serves the purpose
of providing a handful of misclassified samples that can be investigated via explanations. The
difference in performance lies in the second decimal digit of error rate percentage, that is 1-10
validation samples. The minimum error amongst all MNIST models was 0.200 for R-ExplaiNet22-64
of 0.75MP parameters using λ = 0.001, and its average error stabilizes to 0.258±0.008 for λ = 0.002.
Additionally, we experimented with a second type of LIL that clips its input values to be inside [−6, 6]
that resulted in the overall lowest average error of 0.256±0.013. In FMNIST, KMNIST, OMNIST and

6
Table 1: Average error rate over 10 folds within 95% confidence interval by feature setup on MNIST,
using λ = 0.001 for all models. R-ExplaiNet-C restricts input values of LILs to [−6, 6].

Id R8 R16 R32 R64


K ResNet R-ExNet-C ResNet R-ExNet-C ResNet R-ExNet-C ResNet R-ExNet-C
18 0.562±0.03 0.546±0.04 0.387±0.02 0.364±0.03 0.323±0.02 0.311±0.01 0.287±0.01 0.287±0.02
22 0.462±0.02 0.513±0.03 0.344±0.02 0.346±0.02 0.275±0.02 0.283±0.03 0.263±0.01 0.256±0.01
26 0.488±0.03 0.469±0.03 0.328±0.03 0.354±0.03 0.302±0.02 0.294±0.02 0.273±0.02 0.292±0.02

Table 2: Average accuracy for top performing residual networks in FMNIST, KMNIST, OMNIST,
CIFAR10 with parameter size 0.89MP.

Model Fashion MNIST Kuzushiji MNIST Oracle MNIST CIFAR10 MNIST


R-ExplaiNet26-64 93.03±0.119 98.66±0.049 96.68±0.109 93.80±0.119 99.70±0.022
ResNet26-64 92.83±0.131 98.53±0.061 96.56±0.123 93.41±0.093 99.73±0.015

Table 3: Accuracy of CIFAR10 classifiers that were converted into ExplaiNets.

Res20 (R1) R-ExN20 (R1) Dens40 (R12) D-ExN40 (R12) Dens100BC (R12) D-ExN100BC (R12)
Acc. 91.59±0.117 91.88±0.133 93.41±0.152 93.8±0.08 94.38±0.182 94.6±0.163

CIFAR10, where overfiting is exacerbated, R-ExplaiNets consistently surpass the baseline accuracy
in every setup. Also, experiments reveal that accuracy of DenseNets increases when LIL is used in
their modules. A complete set of experiments is included in Appendix D.
Efficiency assessment: We use our RME index for average accuracy metrix, to report the most
efficient predictor for each task, its accuracy and size. We include in the group only models above
certain accuracy threshold. ExplaiNets can achieve a sufficient level of performance for a task with a
few parameters. More details on each individual model efficiency are included in Appendix D

Table 4: Most efficient models above an accepted performance limit for each task.

Fashion MNIST Kuzushiji MNIST Oracle MNIST CIFAR10 MNIST


Acc. Limit 92.0 97.0 95.0 92.0 99.5
Model R-ExN22-48 R-ExN18-16 R-ExN18-32 R-ExN18-48 R-ExN22-16
RMEF 1.30 2.19 1.74 1.27 2.58
Accuracy 92.86±0.13 97.63±0.08 96.13±0.13 92.61±0.10 99.67±0.03
Size (MP) 0.420 0.038 0.150 0.337 0.048

Figure 4: Discrete feature mosaics. Left:1st level mapped to a 7 × 7 area of the input image, right:2nd
(1) (2) (2)
level to 11 × 11. Middle: Observed causes Cx,y in the receptive field of effects m68 and m13 .

7
4.2 Interpretations and Explanations

Intepretations: We run the explanation process for R-ExplaiNet18-16, a residual ExplaiNet of 18


layers and 16 features per layer trained on MNIST, with Kmotif s = 96, to discover FMotifs for 8
levels of explanations on the unique LDFs in the training set. The discrete representation matrices
are visualized as mosaics, depicted in figure 4. Matching scores of FMotifs can be used to generate
heatmaps, that illustrate how they are selective to different classes, as presented in figure 5. Full
hyperparameters of discovery and extra analysis on interpretations are available in Appendix (E).

Figure 5: Matching scores superimposed on the input image for FMotif 19 of level 8 in R-ExplaiNet18-
16, along with its logo. At this explanation level we have global feature motifs in a 4 × 4 matrix.
Explanations: We recall all samples of the MNIST validation set through R-ExplaiNet18-16, to
collect Pareto histograms of FMotif representations and image slices, calculating probabilities.
For each level of explanation, the explained class is inferred based on the presence of FMotifs in
representations, and FTO is calculated. The results show that local FMotif features are not faithful
to the output of the model, while global are. The FCE is calculated for all FMotif in the vocabulary
and for the top 20% of them. Details about them can be found at Appendix F. The CNN predictor’s
accuracy is 99.58, that can be compared to surrogate predictions from all levels.

Table 5: Explainability metrics for R-ExplainNet18-16 on the MNIST validation set.

Metric Level1 Level2 Level3 Level4 Level5 Level6 Level7 Level8


FMotif Vocab. Count 96 85 96 85 96 73 58 60
Surrogate Pred. Acc. 9.92 10.02 17.42 18.00 53.09 81.04 98.21 99.50
Explain Pred. FTO 9.80 9.92 17.31 17.86 52.99 80.94 98.23 99.62
All effects FCE 92.5±1.6 70.0±3.2 92.4±0.9 88.9±1.4 98.7±0.3 88.5±2.0 69.3±3.8
Most explained FCE 99.6±0.1 88.7±1.3 97.5±0.5 95.7±0.4 99.7±0.1 96.9±0.9 88.1±3.5
Best FMotif FCE 99.96 94.93 99.76 97.67 100.00 99.70 99.46

5 Discussion
5.1 Key findings

Experiments clearly indicate that our explainable-by-design CNN predictor retains or surpasses
performance in comparison to the baseline, while it is able to provide explanations based on discrete
features. Training the best performing R-ExplaiNet on MNIST with increased weight decay revealed
that doubling the amount of regularization increases the mean accuracy.The reproducible error rate of
the best model is 0.2% which is comparable to the current state-of-the-art for single models. This
shows promise for reaching the highest performance on tasks, if more complex models and elaborate
training schemes are used.
The feature motif discovery process underutilizes the learning capacity of EXTREME, because our
genome-encoded LDFs are aligned sequences, while the algorithm can find motifs in the middle
of sequences. Keeping the conversion to quaternary numeral system ensures a logarithmic space
complexity O(m · log4 (n)) for features n and length m of LDFs, thus the method can scale up to an
extremely high count of features for short LDFs, or have a length equal to the count of features.
Our explainer graph, that infers an FMotif as the integer scalar image patch descriptor, can generate
online explanations. We have performed a detailed investigation of samples that are unfaithful to

8
the predictors output, and those that are misclassified by both the CNN and the Bayesian explainer.
We have noticed interesting cases where the neural predictor fails but the explainer is correct. This
creates the potential of using explanations from the last levels together with the neural predictor in an
ensemble, or using the explainer as a detector of adversarial samples.
Evaluation of fidelity to output for the first levels indicates that, even though the intermediate non-
explainable representations arithmetically lead to the output prediction, the presence of a specific
feature is not a causal reason for it. These start to emerge as the network growths in depth and the
receptive field of neurons expands. When explaining intermediate feature interactions in unknown
samples, there are 20% FMotif effects that are sufficiently explained with high FCE, thus their
presence at a level is consistently faithful to the observed FMotif causes of the previous level.

5.2 Limitations and future work

Further research on regularization aspects of the LI function is required, to investigate the magnitude
of weights and activations during training. The proper amount of weight decay and restriction of
input values for positively monotonic gradient amplification needs to be further investigated. Numeric
stability seems to play an role here and training models with fp64 numbers will provide insight.
Variety and amount were preferred over scale for our limited computational resources, thus we have
not trained our models on medium resolution images, high number of classes, nor used models with
millions of parameters. The next version of the software implementation will parallelize the collection
phases, enabling multiple comparative explanation experiments. The EXTREME algorithm could be
adjusted to support custom numeric system bases, so that a conversion to the quaternary system will
become an optional decision, to be taken as a mitigation of high complexity.

6 Related Work

Our work falls under the general scientific field of XAI[4] that has a large taxonomy of sub-fields
where ExML and/or statistical learning is used to provide interpretability and explainability. For neural
networks we can identify the main branches of: Surrogate (proxy) Models, Automatic Rule Extraction,
Attribution Methods, Example-Based Methods and Explainable-by-Design Neural Networks. Neural
network surrogate explanations are based on model-agnostic methods that explain a black-box model
[27] or use a decision tree as a surrogate [28] [29]. Rules can be extracted with various approaches
including those that follow programming logic[30][31]. Attribution methods, that mostly provide
interpretability, is a diverse branch. Beyond gradient-based attribution methods [32][33][34][35] and
attribution based on Shapley-values [36][37] there are axiomatic attributions[38][39], perturbation-
based methods [27][40], rule-based attributions [41], information-bottleneck attributions [42] and
attention saliency maps [43]. The requirements of a proper explanation [44] are better satisfied by the
Example-Based Methods, where explanations can be based on counterfactuals [45][46] or influential
instances [47]. Many proposed methods that provide explainability through the use of prototypes
[48] can be found in the literature [49][50][51][52][53][54][55][56].
More related to our work are explainable-by-design neural networks, like additive models [57],
dynamic alignment networks [58] and neural decision trees [59][60][61][62][63]. The ExplaiNet is
mostly suited under the branch of Hybrid Explainable-by-Design Neural Networks like the work in
[50] from which we embrace the concept of predictor-explainer working in pair. Nevertheless, LDFs
and FMotifs provided by our framework can be used by other branches of ExML.
Considering the use of softmax as an activation function, we utilize its characteristics for a different
purpose when compared to the Attention layer [1][2], that uses it as an activation function over the
inner product of query and key, scores of which are used to match a value. When used in our LIL,
softmax steers the gradient descent learning rule toward incorporating lateral inhibition. It can be
considered as closely related with the long-standing concept of Soft Competitive Learning [64], but
our model does not use a learning rule that belongs to Hebbian learning, when compared to the
closely related work of [65].

9
7 Conclusions
In this work, we present a novel framework for explainability and interpretability that considers
using discrete features to serve their needs. An explainable-by-design CNN learns local discrete
feature vectors that indicate a ranked matching of neuron-specific weights to the input receptive field.
These belong to the convolution kernel that is learned with supervised gradient descent. We can
replace discrete vectors with a single scalar discrete value of a feature motif, for image patches at all
intermediate image representations in the network. The proposed Lateral Inhibition Layer is generally
applicable in neural networks, and we have experimentally proven that our novel ExplaiNet predictor
retains the accuracy of its baseline architecture, breaking the performance-explainability trade-off.
Our explainer graph provides Bayesian explainability for understanding causes in an explanation
level that lead to the occurrence of a feature in the next level, and can be also used for surrogate
class prediction from all levels. Our framework comes with a complete proof-of-concept software
implementation that can be used in future research efforts towards explainability.

Acknowledgments and Disclosure of Funding

8 Author Contributions
The primary author conceived the idea for the method and the overall framework, based on earlier
work done on CNN-based feature quantization, for the task of content-based image retrieval. The
second author supervised the project, setting the experimental requirements, advising, and providing
valuable feedback. The code for the software implementation was written by the primary author and
the manuscript was written by the first author and edited by the second.
This work utilized compute resources at the Department of Information and Electronic Engineering,
International Hellenic University (IHU), at Sindos. The authors would like to thank Professor Antonis
Sidiropoulos for his valuable work for procuring and operating this equipment and his support during
the experiments phase.
Pantelis I. Kaplanoglou received funding by the MANOLO EU grant at the time of this work
completion.

References
[1] T. Lin, Y. Wang, X. Liu, and X. Qiu, “A survey of transformers,” AI Open, vol. 3, pp. 111–132, Jan. 2022.
[Online]. Available: https://www.sciencedirect.com/science/article/pii/S2666651022000146
[2] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in Vision:
A Survey,” ACM Computing Surveys, vol. 54, no. 10s, pp. 1–41, Jan. 2022. [Online]. Available:
https://dl.acm.org/doi/10.1145/3505244
[3] B. Li, P. Qi, B. Liu, S. Di, J. Liu, J. Pei, J. Yi, and B. Zhou, “Trustworthy AI: From Principles to
Practices,” ACM Computing Surveys, vol. 55, no. 9, pp. 177:1–177:46, Jan. 2023. [Online]. Available:
https://dl.acm.org/doi/10.1145/3555803
[4] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal, “Explaining Explanations: An
Overview of Interpretability of Machine Learning,” Feb. 2019, arXiv:1806.00069 [cs, stat]. [Online].
Available: http://arxiv.org/abs/1806.00069
[5] C.-K. Yeh, B. Kim, S. Arik, C.-L. Li, T. Pfister, and P. Ravikumar, “On Completeness-aware
Concept-Based Explanations in Deep Neural Networks,” in Advances in Neural Information
Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 20 554–20 565. [Online]. Available:
https://proceedings.neurips.cc/paper/2020/hash/ecb287ff763c169694f682af52c1f309-Abstract.html
[6] A. L. Smith, F. Greaves, and T. Panch, “Hallucination or Confabulation? Neuroanatomy as metaphor
in Large Language Models,” PLOS Digital Health, vol. 2, no. 11, p. e0000388, Nov. 2023. [Online].
Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10619792/
[7] A. S. Luccioni, S. Viguier, and A.-L. Ligozat, “Estimating the Carbon Footprint of BLOOM, a 176B
Parameter Language Model,” Journal of Machine Learning Research, vol. 24, no. 253, pp. 1–15, 2023.
[Online]. Available: http://jmlr.org/papers/v24/23-0069.html
[8] B. Crook, M. Schlüter, and T. Speith, “Revisiting the Performance-Explainability Trade-Off in
Explainable Artificial Intelligence (XAI),” Jul. 2023, arXiv:2307.14239 [cs]. [Online]. Available:
http://arxiv.org/abs/2307.14239

10
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” 2016,
pp. 770–778. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_
Residual_Learning_CVPR_2016_paper.html
[10] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,”
2017, pp. 4700–4708. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2017/html/Huang_
Densely_Connected_Convolutional_CVPR_2017_paper.html
[11] P. D’haeseleer, “What are DNA sequence motifs?” Nature Biotechnology, vol. 24, no. 4,
pp. 423–425, Apr. 2006, publisher: Nature Publishing Group. [Online]. Available: https:
//www.nature.com/articles/nbt0406-423
[12] T. L. Bailey and C. Elkan, “Unsupervised learning of multiple motifs in biopolymers using expectation
maximization,” Machine Learning, vol. 21, no. 1, pp. 51–80, Oct. 1995. [Online]. Available:
https://doi.org/10.1007/BF00993379
[13] T. L. Bailey, N. Williams, C. Misleh, and W. W. Li, “MEME: discovering and analyzing DNA and protein
sequence motifs,” Nucleic Acids Research, vol. 34, no. suppl_2, pp. W369–W373, Jul. 2006. [Online].
Available: https://doi.org/10.1093/nar/gkl198
[14] T. L. Bailey, “DREME: motif discovery in transcription factor ChIP-seq data,” Bioinformatics, vol. 27,
no. 12, pp. 1653–1659, Jun. 2011. [Online]. Available: https://doi.org/10.1093/bioinformatics/btr261
[15] D. Quang and X. Xie, “EXTREME: an online EM algorithm for motif discovery,” Bioinformatics, vol. 30,
no. 12, pp. 1667–1673, Jun. 2014. [Online]. Available: https://doi.org/10.1093/bioinformatics/btu093
[16] T. L. Bailey, “STREME: accurate and versatile sequence motif discovery,” Bioinformatics, vol. 37, no. 18,
pp. 2834–2840, Sep. 2021. [Online]. Available: https://doi.org/10.1093/bioinformatics/btab203
[17] O. Cappé and E. Moulines, “On-Line Expectation–Maximization Algorithm for latent Data Models,”
Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 71, no. 3, pp. 593–613, Jun.
2009. [Online]. Available: https://doi.org/10.1111/j.1467-9868.2009.00698.x
[18] J. Parekh, P. Mozharovskyi, and F. d’Alché Buc, “A framework to learn with interpretation,” Advances in
Neural Information Processing Systems, vol. 34, pp. 24 273–24 285, 2021.
[19] B. Gao and L. Pavel, “On the Properties of the Softmax Function with Application in Game
Theory and Reinforcement Learning,” Aug. 2018, arXiv:1704.00805 [cs, math]. [Online]. Available:
http://arxiv.org/abs/1704.00805
[20] Sivic and Zisserman, “Video Google: a text retrieval approach to object matching in videos,” in
Proceedings Ninth IEEE International Conference on Computer Vision. Nice, France: IEEE, 2003, pp.
1470–1477 vol.2. [Online]. Available: http://ieeexplore.ieee.org/document/1238663/
[21] D. Quang, Y. Guan, and S. C. J. Parker, “YAMDA: thousandfold speedup of EM-based motif discovery
using deep learning libraries and GPU,” Bioinformatics, vol. 34, no. 20, pp. 3578–3580, Oct. 2018.
[Online]. Available: https://doi.org/10.1093/bioinformatics/bty396
[22] C. Lacave and F. J. Díez, “A review of explanation methods for Bayesian networks,” The
Knowledge Engineering Review, vol. 17, no. 2, pp. 107–127, Jun. 2002. [Online]. Available:
https://www.cambridge.org/core/product/identifier/S026988890200019X/type/journal_article
[23] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”
Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998, conference Name: Proceedings of the
IEEE. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/726791
[24] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: a Novel Image Dataset for Benchmarking
Machine Learning Algorithms,” Sep. 2017, arXiv:1708.07747 [cs, stat]. [Online]. Available:
http://arxiv.org/abs/1708.07747
[25] T. Clanuwat, M. Bober-Irizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha, “Deep learning for
classical japanese literature,” arXiv preprint arXiv:1812.01718, 2018.
[26] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Report, 2009.
[27] M. T. Ribeiro, S. Singh, and C. Guestrin, “Model-Agnostic Interpretability of Machine Learning,” Jun.
2016, arXiv:1606.05386 [cs, stat]. [Online]. Available: http://arxiv.org/abs/1606.05386
[28] T. Pedapati, A. Balakrishnan, K. Shanmugam, and A. Dhurandhar, “Learning Global Transparent Models
consistent with Local Contrastive Explanations,” in Advances in Neural Information Processing Systems,
vol. 33. Curran Associates, Inc., 2020, pp. 3592–3602. [Online]. Available: https://proceedings.neurips.
cc/paper_files/paper/2020/hash/24aef8cb3281a2422a59b51659f1ad2e-Abstract.html
[29] G. Liu, X. Sun, O. Schulte, and P. Poupart, “Learning Tree Interpretation from Object
Representation for Deep Reinforcement Learning,” in Advances in Neural Information Processing
Systems, vol. 34. Curran Associates, Inc., 2021, pp. 19 622–19 636. [Online]. Available:
https://proceedings.neurips.cc/paper/2021/hash/a35fe7f7fe8217b4369a0af4244d1fca-Abstract.html

11
[30] F. Yang, K. He, L. Yang, H. Du, J. Yang, B. Yang, and L. Sun, “Learning Interpretable
Decision Rule Sets: A Submodular Optimization Approach,” in Advances in Neural Information
Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 27 890–27 902. [Online]. Available:
https://proceedings.neurips.cc/paper/2021/hash/eaa32c96f620053cf442ad32258076b9-Abstract.html
[31] D. Trivedi, J. Zhang, S.-H. Sun, and J. J. Lim, “Learning to Synthesize Programs as Interpretable
and Generalizable Policies,” in Advances in Neural Information Processing Systems, vol. 34. Curran
Associates, Inc., 2021, pp. 25 146–25 163. [Online]. Available: https://proceedings.neurips.cc/paper_files/
paper/2021/hash/d37124c4c79f357cb02c655671a432fa-Abstract.html
[32] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra, “Grad-CAM: Why did you
say that?” Jan. 2017, arXiv:1611.07450 [cs, stat]. [Online]. Available: http://arxiv.org/abs/1611.07450
[33] A. Shrikumar, P. Greenside, and A. Kundaje, “Learning Important Features Through Propagating
Activation Differences,” in Proceedings of the 34th International Conference on Machine
Learning. PMLR, Jul. 2017, pp. 3145–3153, iSSN: 2640-3498. [Online]. Available: https:
//proceedings.mlr.press/v70/shrikumar17a.html
[34] D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg, “SmoothGrad: removing noise by adding
noise,” Jun. 2017, arXiv:1706.03825 [cs, stat]. [Online]. Available: http://arxiv.org/abs/1706.03825
[35] J. Li, C. Zhang, J. T. Zhou, H. Fu, S. Xia, and Q. Hu, “Deep-LIFT: Deep Label-Specific
Feature Learning for Image Annotation,” IEEE Transactions on Cybernetics, vol. 52, no. 8, pp.
7732–7741, Aug. 2022, conference Name: IEEE Transactions on Cybernetics. [Online]. Available:
https://ieeexplore.ieee.org/abstract/document/9352498
[36] S. M. Lundberg and S.-I. Lee, “A Unified Approach to Interpreting Model Predictions,” in Advances in
Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017. [Online]. Available:
https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html
[37] H. Chen, I. C. Covert, S. M. Lundberg, and S.-I. Lee, “Algorithms to estimate Shapley value feature
attributions,” Nature Machine Intelligence, vol. 5, no. 6, pp. 590–601, Jun. 2023, publisher: Nature
Publishing Group. [Online]. Available: https://www.nature.com/articles/s42256-023-00657-x
[38] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” in International
conference on machine learning. PMLR, 2017, pp. 3319–3328.
[39] R. Hesse, S. Schaub-Meyer, and S. Roth, “Fast Axiomatic Attribution for Neural Networks,”
in Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc.,
2021, pp. 19 513–19 524. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/
a284df1155ec3e67286080500df36a9a-Abstract.html
[40] M. Ivanovs, R. Kadikis, and K. Ozols, “Perturbation-based methods for explaining deep neural
networks: A survey,” Pattern Recognition Letters, vol. 150, pp. 228–234, Oct. 2021. [Online]. Available:
https://linkinghub.elsevier.com/retrieve/pii/S0167865521002440
[41] B. Letham, C. Rudin, T. H. McCormick, and D. Madigan, “Interpretable classifiers using rules
and Bayesian analysis: Building a better stroke prediction model,” The Annals of Applied
Statistics, vol. 9, no. 3, pp. 1350–1371, Sep. 2015, publisher: Institute of Mathematical Statistics.
[Online]. Available: https://projecteuclid.org/journals/annals-of-applied-statistics/volume-9/issue-3/
Interpretable-classifiers-using-rules-and-Bayesian-analysis--Building-a/10.1214/15-AOAS848.full
[42] Y. Zhang, Y. Li, S. T. Kim, A. Khakzar, A. Farshad, and N. Navab, “Fine-Grained Neural Network
Explanation by Identifying Input Features with Predictive Information.”
[43] V. Shitole, F. Li, M. Kahng, P. Tadepalli, and A. Fern, “One Explanation is Not Enough:
Structured Attention Graphs for Image Classification,” in Advances in Neural Information
Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 11 352–11 363. [Online]. Available:
https://proceedings.neurips.cc/paper/2021/hash/5e751896e527c862bf67251a474b3819-Abstract.html
[44] J. Faye, “Explanation Explained,” Synthese, vol. 120, no. 1, pp. 61–75, 1999, publisher: Springer. [Online].
Available: https://www.jstor.org/stable/20118187
[45] T. Laugel, A. Jeyasothy, M.-J. Lesot, C. Marsala, and M. Detyniecki, “Achieving Diversity
in Counterfactual Explanations: a Review and Discussion,” in Proceedings of the 2023 ACM
Conference on Fairness, Accountability, and Transparency, ser. FAccT ’23. New York, NY,
USA: Association for Computing Machinery, Jun. 2023, pp. 1859–1869. [Online]. Available:
https://dl.acm.org/doi/10.1145/3593013.3594122
[46] F. Hamman, E. Noorani, S. Mishra, D. Magazzeni, and S. Dutta, “Robust Counterfactual Explanations for
Neural Networks With Probabilistic Guarantees,” in Proceedings of the 40th International Conference
on Machine Learning. PMLR, Jul. 2023, pp. 12 351–12 367, iSSN: 2640-3498. [Online]. Available:
https://proceedings.mlr.press/v202/hamman23a.html

12
[47] P. W. Koh and P. Liang, “Understanding Black-box Predictions via Influence Functions,” in Proceedings
of the 34th International Conference on Machine Learning. PMLR, Jul. 2017, pp. 1885–1894, iSSN:
2640-3498. [Online]. Available: https://proceedings.mlr.press/v70/koh17a.html
[48] J. A. Hampton, “Concepts as Prototypes,” in Psychology of Learning and Motivation. Academic Press,
Jan. 2006, vol. 46, pp. 79–113. [Online]. Available: https://www.sciencedirect.com/science/article/pii/
S0079742106460035
[49] J. Bien and R. Tibshirani, “Prototype selection for interpretable classification,” The Annals of
Applied Statistics, vol. 5, no. 4, pp. 2403–2424, Dec. 2011, publisher: Institute of Mathematical
Statistics. [Online]. Available: https://projecteuclid.org/journals/annals-of-applied-statistics/volume-5/
issue-4/Prototype-selection-for-interpretable-classification/10.1214/11-AOAS495.full
[50] D. Alvarez Melis and T. Jaakkola, “Towards Robust Interpretability with Self-Explaining Neural Networks,”
in Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc., 2018. [Online].
Available: https://proceedings.neurips.cc/paper/2018/hash/3e9f0fc9b2f89e043bc6233994dfcf76-Abstract.
html
[51] O. Li, H. Liu, C. Chen, and C. Rudin, “Deep Learning for Case-Based Reasoning Through
Prototypes: A Neural Network That Explains Its Predictions,” Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 32, no. 1, Apr. 2018, number: 1. [Online]. Available:
https://ojs.aaai.org/index.php/AAAI/article/view/11771
[52] N. Papernot and P. McDaniel, “Deep k-Nearest Neighbors: Towards Confident, Interpretable
and Robust Deep Learning,” Mar. 2018, arXiv:1803.04765 [cs, stat]. [Online]. Available:
http://arxiv.org/abs/1803.04765
[53] P. Hase, C. Chen, O. Li, and C. Rudin, “Interpretable Image Recognition with Hierarchical Prototypes,”
Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, vol. 7, pp. 32–40, Oct.
2019. [Online]. Available: https://ojs.aaai.org/index.php/HCOMP/article/view/5265
[54] J. Crabbe, Z. Qian, F. Imrie, and M. van der Schaar, “Explaining Latent Representations with a Corpus of
Examples,” in Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc.,
2021, pp. 12 154–12 166. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2021/hash/
65658fde58ab3c2b6e5132a39fae7cb9-Abstract.html
[55] E. Kim, S. Kim, M. Seo, and S. Yoon, “XProtoNet: Diagnosis in Chest Radiogra-
phy With Global and Local Explanations,” 2021, pp. 15 719–15 728. [Online]. Avail-
able: https://openaccess.thecvf.com/content/CVPR2021/html/Kim_XProtoNet_Diagnosis_in_Chest_
Radiography_With_Global_and_Local_Explanations_CVPR_2021_paper.html
[56] M. Nauta, R. van Bree, and C. Seifert, “Neural Prototype Trees for Inter-
pretable Fine-Grained Image Recognition,” 2021, pp. 14 933–14 943. [Online]. Avail-
able: https://openaccess.thecvf.com/content/CVPR2021/html/Nauta_Neural_Prototype_Trees_for_
Interpretable_Fine-Grained_Image_Recognition_CVPR_2021_paper.html
[57] R. Agarwal, L. Melnick, N. Frosst, X. Zhang, B. Lengerich, R. Caruana, and G. Hinton, “Neural Additive
Models: Interpretable Machine Learning with Neural Nets,” arXiv:2004.13912 [cs, stat], Oct. 2021, arXiv:
2004.13912. [Online]. Available: http://arxiv.org/abs/2004.13912
[58] M. Bohle, M. Fritz, and B. Schiele, “Convolutional Dynamic Alignment Net-
works for Interpretable Classifications,” 2021, pp. 10 029–10 038. [Online]. Avail-
able: https://openaccess.thecvf.com/content/CVPR2021/html/Bohle_Convolutional_Dynamic_Alignment_
Networks_for_Interpretable_Classifications_CVPR_2021_paper.html
[59] C. Olaru and L. Wehenkel, “A complete fuzzy decision tree technique,” Fuzzy Sets and Systems, vol. 138,
no. 2, pp. 221–254, Sep. 2003. [Online]. Available: https://www.sciencedirect.com/science/article/pii/
S0165011403000897
[60] O. İrsoy, O. T. Yıldız, and E. Alpaydın, “Soft decision trees,” in Proceedings of the 21st International
Conference on Pattern Recognition (ICPR2012), Nov. 2012, pp. 1819–1822, iSSN: 1051-4651.
[61] N. Frosst and G. Hinton, “Distilling a Neural Network Into a Soft Decision Tree,” Nov. 2017,
arXiv:1711.09784 [cs, stat]. [Online]. Available: http://arxiv.org/abs/1711.09784
[62] V. Ojha and G. Nicosia, “Backpropagation Neural Tree,” Neural Networks, vol. 149, pp. 66–83, May 2022.
[Online]. Available: https://www.sciencedirect.com/science/article/pii/S0893608022000363
[63] Q. Zhao and Y. Wang, “A Preliminary Study of Interpreting CNNs Using Soft Decision Trees,” in Data
Science, ser. Communications in Computer and Information Science, Y. Wang, G. Zhu, Q. Han, H. Wang,
X. Song, and Z. Lu, Eds. Singapore: Springer Nature, 2022, pp. 152–162.
[64] S. Nowlan, “Maximum Likelihood Competitive Learning,” in Advances in Neural Information Processing
Systems, vol. 2. Morgan-Kaufmann, 1989. [Online]. Available: https://proceedings.neurips.cc/paper/
1989/hash/d1c38a09acc34845c6be3a127a5aacaf-Abstract.html

13
[65] G. Lagani, F. Falchi, C. Gennaro, and G. Amato, “Training Convolutional Neural Networks with
Competitive Hebbian Learning Approaches,” in Machine Learning, Optimization, and Data Science,
G. Nicosia, V. Ojha, E. La Malfa, G. La Malfa, G. Jansen, P. M. Pardalos, G. Giuffrida, and R. Umeton,
Eds. Cham: Springer International Publishing, 2022, vol. 13163, pp. 25–40, series Title: Lecture Notes
in Computer Science. [Online]. Available: https://link.springer.com/10.1007/978-3-030-95467-3_2

14
A Appendix / Proof of lateral inhibition via amplification of gradients
Lemma 1 - Lateral inhibition via amplification of gradients. For any pair of neurons in a
hypercolumn, the lateral inhibition function
z(l) (l) (l) (l)
x,y = fLI (ax,y ) = ax,y (1 + s(ax,y )) (7)

will amplify the gradients of the winner neuron, forcing larger updates in its weights in comparison
to the others.

Proof: Let x the input vector, that is the flattened receptive field matrix, a the hypercolumn’s
activation vector, i, j indices of two neurons and W the shared weight kernel. The 2D convolution is

the function a = g(x; W) and its partial derivatives w.r.t to the weights are ∇g(ai ) = ∂W ai .
The lateral inhibition function centered at position x, y for a neuron i is f (i, a) = ai + ai s(ai , a),
where s(ai , a) is the softmax function with neuron i in the numerator and other neurons in the
denominator.

∂ ∂
∇f (i, a) = ai + ai s(ai , a) (8)
∂W ∂W
∂ ∂ ∂
= ai + s(ai , a) ai + ai s(ai , a) (9)
∂W ∂W ∂W
∂ ∂ ∂s(ai , a) ∂ai
= ai + s(ai , a) ai + ai (10)
∂w ∂w ∂ai ∂w
 
∂ ∂s(ai , a)
= ai 1 + s(ai , a) + ai (11)
∂w ∂ai

The softmax function for two input neurons i,j of the same hypercolumn is s(ai ) = s(ai , a) where
neuron i is in the numerator. The element i,j of the Jacobi matrix for the softmax function is:

∂s(ai , a) ∂s(ai )
J(i, j, a) = = = s(ai )(Ji = jK − s(aj )) (12)
∂ai ∂ai

where J.K is the Iverson bracket. We combine (11) and (12) into

 

∇f (i, j, a) = ai 1 + s(ai ) + ai s(ai )(Ji = jK − s(aj )) (13)
∂w
 

= ai 1 + s(ai )(1 + ai (Ji = jK − s(aj )) (14)
∂w

We plug the derivative of the 2D convolution into ((14)) for the case i ̸= j it becomes
 
∇f (i, j, a) = ∇g(ai ) 1 + s(ai ) − ai s(ai )s(aj ) = β∇g(ai ) (15)

,while for the case i = j, of a diagonal element of the Jacobi matrix where s(ai ) = s(aj ), it is

 
∇f (i, i, a) = ∇g(ai ) 1 + s(ai )(1 + ai − ai s(ai )) (16)
 
= ∇g(ai ) 1 + s(ai ) + ai s(ai ) − ai s(ai )2 = γ∇g(ai ) (17)

Considering that s(ai ) = 1 − s(aj ) the limits below are zero


lim s(ai )s(aj ) = lim s(ai )s(aj ) = lim s(ai )(1 − s(aj )) = 0 (18)
s(ai )→1 s(aj )→1 s(aj )→1

15
For
β = (1 + s(ai ) − ai s(ai )s(aj )) (19)
and
γ = (1 + s(ai ) + ai s(ai ) − ai s(ai )2 ) (20)
it is evident that

lim β = 2, lim β = 1, lim γ = 2, lim γ = 1 (21)


s(ai )→1 s(aj )→1 s(ai )→1 s(ai )→0

which proves the amplification of the gradient for the candidate winner neuron i in all elements of its
Jacobi matrix.

Corolary 1. For s(ai ) = s(aj ) = 0.5 we have the same adjustment of gradients, either amplifica-
tion or attenuation.
We calculate:
β = 1 + 0.5 − 0.25ai = 1.5 − 0.25ai (22)
γ = 1 + 0.5 + 0.5ai − 0.25ai = 1.5 + 0.25ai (23)
The adjustment of gradients, that is either amplification or attenuation, is the same for both neurons.
Studying ai ∈ [−6, 6], when ai = −6 gradients of the non-diagonal elements in the Jacobi matrix
will be amplified ×3, while gradients in the diagonal will be inhibited to zero. The inverse is done for
ai = 6, where non-diagonal gradients are zeroed, while gradients of the diagonal are amplified ×3.

Conjecture 1. Increased weight regularization is needed for positively monotonic behaviour of the
gradient amplification factors β and γ of the lateral inhibition function, and/or to restrict the input of
the lateral inhibition function in the range ai ∈ [−6, 6].
When ai < −6 or ai > 6 we notice inversion of the gradient sign by either β or γ. The unbounded
nature of linear relations (22) (23) can lead to the exploding gradients problem, that we suspect it is
mitigated by weight decay regularization combined with the use of batch normalization layers.

B Appendix / Training setup


B.1 Training data feed

The training data feed is a pipeline of methods on the original samples of the dataset. They are pushed
in a scheme that implements the iterator software design pattern; The model pops the next mini-batch
of samples from the data feed iterator. In all our experiments the pipeline is:
f oreach(S ∈ T S) → std(S) → aug(S → M B) → mix(M B) → ∞(M B → {X}).
where T S the training set, std per channel image standardization, aug reproducible random augmen-
tation, mix reproducible random shuffling on mini-batches of samples, ∞(M B) an iterator on the
mini-batches.

B.2 Data preprocessing and augmentation

We use a simple data augmentation and standardization scheme which follows standard approaches.
We standardize with the per-channel mean µT S (c) and standard deviation σT S (c) for the samples of
the training set, where c = 1 for grayscale images and c ∈ {r, b, g} for color images.
For MNIST, animage sample X is zero-padded with 3 pixels and a random 28 × 28 crop is taken.
For FMNIST, the image is randomly left/right mirrored (horizontal flip) then it is padded with 3
pixels and a random crop is taken. For KMNIST and OMNIST, we prefer the same scheme with
MNIST since we consider left/right mirroring a non-label-preserving transformation of their image
for writing symbols, e.g. numbers,letters,syllabograms,ideograms. For experiments on MNIST,
FMNIST, KMNIST, CIFAR10 the size of the mini-batch is 128 samples while for OMNIST is 64
samples. The same random seed rndi per fold index i = 1, ..., 10 is used across all experiments.

16
Random number generators: In a Python implementation of a learning process there are several
random number generators that need to be seeded to ensure determinism. We seed the value rndi to:

• Generator of Python’s random package.


• Python’s hashing random generator.
• numpy package random number generator/
• Tensorflow random number generator.
• Keras random number generator.

All these provisions ensure deterministic reproduction of training by using the same sequence of
input samples, and fair comparison between models by using the same initial random weights.

C Appendix / CNN architectures, software implementation and training


infrastructure
Explainable-by-design CNN architectures: We have chosen two popular architecture of CNN for
our experiments that they both implement skip connections to help the gradient flow towards the first
layer. They are both counteracting the vanishing gradient problem for networks of increased depths.
Another reason for the selection is the trade-off of having less parameters in a DenseNet, but more
computations in comparison to a ResNet, a trade-off between space and time complexity of models.
Placing LIL layers inside a ResNet of layer depth K the levels of explanation for the R-ExplaiNet are
L = (K − 2)/2.
When DenseNet is used as baseline, there are B blocks each one containing C modules. For the
D-ExplaiNet C LIL outputs are concatenated per block, while in an additional B − 1 transition
modules there is an LIL after the convolution and before the spatial downsampling operation. The
count of available explanation levels for K layers, is L = K − 2. In the D-ExplaiNet the explainer
graph nodes have edges to nodes that belong to previous levels l − 1, l − 2, ..., l − C, due to multiple
(dense) skip connections. Calculating FCE is not trivial, since we FMotifs from different levels are
inside the set of causes.

C.1 Model architectural hyperparameters

We use the following notation for a convolution operation window and its respective kernel dimensions
[width × height /stride |cin → cout ]
for a window moving with stride of specified width and height, cin the feature depth (image
channels) of the input, cout the count of neurons in a hypercolumn or equally the output feature depth.
All CNNs for tiny color images have a stem that is a single convolutional layer

(stem) (k) (k)


[3 × 3 /1 |3 → cout ] while other layers are [3 × 3 /s |cin → cout ]

a stride of s = 2 in convolutional layers performs spatial downsampling in ResNets while in DenseNet


this is done with a max pooling operation of stride 2.

C.2 Supervised training hyperparameters

MNIST: We used SGD with an initial learning rate lr@0 = 0.02 and 0.9 momentum, and we
change learning rate to lr@15 = 0.01, lr@30 = 0.005, lr@40 = 0.002, lr@50 = 0.001 until the
terminal epoch 60.

FMNIST, KMNIST, OMNIST: A slightly different scheduling compared to MNIST is used for the
rest of tiny image grayscale datasets, keeping the same initial learning rate and momentum. The SGD
starts with a learning rate lr@0 = 0.02, and changes to lr@15 = 0.01, lr@35 = 0.005, lr@45 = 0.002,
lr@55 = 0.001, lr@65 = 0.0005 until the terminal epoch 75.

17
Table 6: Architectures and feature setups. Residual networks have an R prefix and DenseNet D.
K=Layers, B=Bottleneck, C=Compression 50%, k=Feature expansion size.
(stem)
Id Layers cout Block1 Block2 Block3 Block4 Block5
R1 K=18 16 16 16 32 64
R8 K=18 8 8 8 8 8
R16 K=18 16 16 16 16 16
R24 K=18 24 24 24 24 24
R16 K=18 16 16 16 16 16
R32 K=18 32 32 32 32 32
R48 K=18 48 48 48 48 48
R64 K=18 64 64 64 64 64
R72 K=18 72 72 72 72 72
R80 K=18 80 80 80 80 80
R1 K=20 16 16 16 32 64
R8 K=22 8 8 8 8 8 8
R16 K=22 16 16 16 16 16 16
R32 K=22 32 32 32 32 32 32
R48 K=22 48 48 48 48 48 48
R64 K=22 64 64 64 64 64 64
R8 K=26 8 8 8 8 8
R16 K=26 16 16 16 16 16
R32 K=26 32 32 32 32 32
R64 K=26 64 64 64 64 64
D40 K=100 24 k=12 k=12 k=12
D100BC K=100 24 k=12 f=12 k=12

CIFAR10: We are using the setup in [9] for all training hyperparameters. SGD with 0.9 momentum
starts with a lr@0 = 0.1 and this is divided by 10 at epochs 82, 123 having 391 steps/epoch, until the
terminal epoch 164, which is slightly above 64000 steps that are described in the original paper. The
same training setup is also used for DenseNets.

C.3 Software implementation

This work is enabled by proof-of-concept software implementation that is based on TensorFlow/Keras


and popular Python packages numpy, matplotlib, pandas. It includes the first version of the explain-
ability framework that is working on a single process, parts of which are not yet enabled for GPU
acceleration. All datasets, models and processes that are not supported by the frameworks have been
implemented from scratch. The source code includes a reusable library for Machine Learning, that
will be released in the future as a standalone package and registered in the PyPI repository. The
software is available for use by researchers on the primary author’s GitHub repository **redacted**,
along with its documentation on how to setup and use it.

Enabling reproducibility of training process and final state Several aspects of experiment
reproducibility were taken into account for training models, to have the same intermediate states at
the end of each epoch and result to same performance metrics.
The non-deterministic behaviour of 2D convolution algorithms for the GPU was disabled and cuDNN
was forced to use the FFT algorithm for convolution. For TensorFlow, op determinism was enabled,
while TensorFlow32 and mixed precision were disabled. These provisions will be release as part of
the upcoming Python package.
Nevertheless, training on different infrastructures resulted in different outcomes for the baseline
ResNet models. This is an important finding to be reported: Reproducibility of training states and
final values of model parameters is not assured when different combinations of hardware, versions of
CUDA+cuDNN middleware and versions of computational framework software are used. Reasons
behind this should be further understood.

18
C.4 Machine Learning infrastructure

C.4.1 Requirements
The available compute infrastructure for Machine Learning allowed us to run multiple single model
training processes for the needs of our work. Our experiments are done with small models, so a
nominal requirement for them to run would be a single CUDA-compatible GPU with 10GB. The
training time per epoch was estimated for each model and was records in the experiment’s workspace,
that is a file structure that our software library creates.

C.4.2 Used resources


The final experiments were performed on 14 NVIDIA A16 GPUs, each one with 16GB, placed in
two systems that run Ubuntu 22.04 operating system. For the preliminary experiments 2 NVIDIA
RTX3060 GPUs with 12GB were available on a Windows Server 2022 DataCenter system.

• The middleware used for NVIDIA A16 was CUDA 12.2 + cuDNN 8.9.6 and the soft-
ware library was TensorFlow 2.15.1. The complete version details for the Ubuntu system
infrastructure are recorded in each experiment log file.
• The middleware used for RTX3060 was CUDA 11.7 + cuDNN DLL 6.14.11.640 and the
software library was TensorFlow 2.9.3. The complete version details for the Windows
system infrastructure are recorded in each experiment log file.

19
D Appendix / Complete set of model evaluation metrics
D.1 Experiments on MNIST

Initial comparison with same weight decay Our first series of experiments were done on MNIST.
We investigated performance for a fixed size of 18 layers by increasing the feature count. Starting
from 8 we increase +8 until 32, then +16 until 64, then +8 until 80. Additional we used the inverse
pyramid features of ResNet for CIFAR10, following the setup in [9]. The rest of the experiments
combined higher number of layers with different features setups. Both ResNets and R-ExplaiNets
were trained by keeping the same training hyperparameters, with weight decay λ = 0.001.

Table 7: Classification accuracy and relative model efficiency for group trained on MNIST-1.

Architecture Id Layers Features MP Folds Average Min Max RME


R-ExplaiNet R01 18 16, 16, 16, 32, 64 0.18 10 0.367±0.027 0.280 0.430 0.87
R-ExplaiNet R08 18 f=8 0.01 10 0.560±0.036 0.460 0.650 0.00
R-ExplaiNet R16 18 f=16 0.04 10 0.374±0.034 0.260 0.450 1.76
R-ExplaiNet R24 18 f=24 0.09 10 0.324±0.022 0.270 0.380 2.01
R-ExplaiNet R32 18 f=32 0.15 10 0.301±0.021 0.250 0.340 1.87
R-ExplaiNet R48 18 f=48 0.34 10 0.296±0.016 0.250 0.330 1.31
R-ExplaiNet R64 18 f=64 0.60 10 0.289±0.019 0.220 0.330 1.04
R-ExplaiNet R72 18 f=72 0.75 10 0.285±0.008 0.270 0.310 0.96
R-ExplaiNet R80 18 f=80 0.93 10 0.276±0.018 0.230 0.330 0.93
R-ExplaiNet R08 22 f=8 0.01 10 0.519±0.040 0.390 0.600 0.14
R-ExplaiNet R16 22 f=16 0.05 10 0.328±0.030 0.230 0.400 2.58
R-ExplaiNet R32 22 f=32 0.19 10 0.290±0.012 0.260 0.320 1.85
R-ExplaiNet R64 22 f=64 0.74 10 0.268±0.023 0.200 0.320 1.11
R-ExplaiNet R08 26 f=8 0.01 10 0.499±0.033 0.410 0.550 0.27
R-ExplaiNet R16 26 f=16 0.06 10 0.336±0.030 0.270 0.410 2.18
R-ExplaiNet R32 26 f=32 0.23 10 0.293±0.025 0.230 0.380 1.64
R-ExplaiNet R64 26 f=64 0.89 10 0.293±0.020 0.240 0.340 0.82
ResNet R01 18 16, 16, 16, 32, 64 0.18 10 0.343±0.018 0.290 0.380 1.13
ResNet R08 18 f=8 0.01 10 0.562±0.027 0.480 0.630 0.00
ResNet R16 18 f=16 0.04 10 0.387±0.019 0.330 0.440 1.50
ResNet R24 18 f=24 0.09 10 0.343±0.023 0.270 0.390 1.66
ResNet R32 18 f=32 0.15 10 0.323±0.018 0.270 0.370 1.53
ResNet R48 18 f=48 0.34 10 0.301±0.016 0.260 0.350 1.25
ResNet R64 18 f=64 0.60 10 0.287±0.013 0.250 0.320 1.06
ResNet R72 18 f=72 0.75 10 0.287±0.015 0.260 0.330 0.95
ResNet R80 18 f=80 0.93 10 0.282±0.022 0.210 0.320 0.89
ResNet R08 22 f=8 0.01 10 0.462±0.022 0.420 0.540 0.79
ResNet R16 22 f=16 0.05 10 0.344±0.020 0.290 0.390 2.19
ResNet R32 22 f=32 0.19 10 0.275±0.016 0.240 0.330 2.09
ResNet R64 22 f=64 0.74 10 0.263±0.015 0.240 0.310 1.16
ResNet R08 26 f=8 0.01 10 0.488±0.030 0.410 0.570 0.38
ResNet R16 26 f=16 0.06 10 0.328±0.027 0.260 0.400 2.36
ResNet R32 26 f=32 0.23 10 0.302±0.022 0.250 0.360 1.52
ResNet R64 26 f=64 0.89 10 0.273±0.015 0.210 0.300 0.98

20
Increased weight decay for training ExplaiNets: We trained ResNets with a weight decay of
λ = 0.001 and R-ExplaiNets with double amount of regularization using λ = 0.002. All other
training conditions were kept the same.

Table 8: Classification accuracy and relative model efficiency for group trained on MNIST-2

Architecture Id Layers Features MP Folds Average Min Max RME


R-ExplaiNet R01 18 16, 16, 16, 32, 64 0.18 10 0.334±0.013 0.300 0.360 1.22
R-ExplaiNet R08 18 f=8 0.01 10 0.570±0.039 0.480 0.670 0.00
R-ExplaiNet R16 18 f=16 0.04 10 0.382±0.015 0.350 0.420 1.60
R-ExplaiNet R24 18 f=24 0.09 10 0.331±0.021 0.280 0.380 1.84
R-ExplaiNet R32 18 f=32 0.15 10 0.311±0.019 0.270 0.370 1.67
R-ExplaiNet R48 18 f=48 0.34 10 0.290±0.017 0.230 0.320 1.34
R-ExplaiNet R64 18 f=64 0.60 10 0.300±0.016 0.270 0.360 0.92
R-ExplaiNet R72 18 f=72 0.75 10 0.282±0.009 0.260 0.300 0.95
R-ExplaiNet R80 18 f=80 0.93 10 0.269±0.021 0.210 0.330 0.95
R-ExplaiNet R08 22 f=8 0.01 10 0.521±0.047 0.340 0.600 0.16
R-ExplaiNet R16 22 f=16 0.05 10 0.352±0.015 0.320 0.390 1.99
R-ExplaiNet R32 22 f=32 0.19 10 0.291±0.021 0.220 0.340 1.77
R-ExplaiNet R64 22 f=64 0.74 10 0.258±0.008 0.240 0.280 1.16
R-ExplaiNet R08 26 f=8 0.01 10 0.487±0.021 0.400 0.520 0.44
R-ExplaiNet R16 26 f=16 0.06 10 0.347±0.025 0.280 0.410 1.92
R-ExplaiNet R32 26 f=32 0.23 10 0.303±0.015 0.250 0.340 1.46
R-ExplaiNet R64 26 f=64 0.89 10 0.297±0.022 0.240 0.340 0.77
ResNet R01 18 16, 16, 16, 32, 64 0.18 10 0.343±0.018 0.290 0.380 1.12
ResNet R08 18 f=8 0.01 10 0.562±0.027 0.480 0.630 0.00
ResNet R16 18 f=16 0.04 10 0.387±0.019 0.330 0.440 1.50
ResNet R24 18 f=24 0.09 10 0.343±0.023 0.270 0.390 1.64
ResNet R32 18 f=32 0.15 10 0.323±0.018 0.270 0.370 1.49
ResNet R48 18 f=48 0.34 10 0.301±0.016 0.260 0.350 1.22
ResNet R64 18 f=64 0.60 10 0.287±0.013 0.250 0.320 1.03
ResNet R72 18 f=72 0.75 10 0.287±0.015 0.260 0.330 0.91
ResNet R80 18 f=80 0.93 10 0.282±0.022 0.210 0.320 0.86
ResNet R16 22 f=16 0.05 10 0.344±0.020 0.290 0.390 2.16
ResNet R32 22 f=32 0.19 10 0.275±0.016 0.240 0.330 2.02
ResNet R64 22 f=64 0.74 10 0.263±0.015 0.240 0.310 1.12
ResNet R16 26 f=16 0.06 10 0.328±0.027 0.260 0.400 2.31
ResNet R32 26 f=32 0.23 10 0.302±0.022 0.250 0.360 1.47
ResNet R64 26 f=64 0.89 10 0.273±0.015 0.210 0.300 0.94

21
Using input value clipping with the same weight decay for training ExplaiNets : We made a
modification in LIL so that the input is clipped in the range [−6, 6], following Conjecture 1, that is
stated under the formal proof of gradient amplification (see Appendix A). We trained both ResNets-
and R-ExplaiNets-C with all training hyperparameters kept the same. The following table compares
models with clipping that are noted with the suffix C, with R-ExplaiNet models without clipping
trained with λ = 0.002, and ResNets with λ = 0.001.

Table 9: Classification accuracy and relative model efficiency for group trained on MNIST-3

Architecture Id Layers Features MP Folds Average Min Max RME


R-ExplaiNet-C R08 18 f=8 0.01 10 0.546±0.042 0.460 0.670 0.04
R-ExplaiNet-C R16 18 f=16 0.04 10 0.364±0.030 0.270 0.430 1.93
R-ExplaiNet-C R32 18 f=32 0.15 10 0.311±0.014 0.270 0.350 1.64
R-ExplaiNet-C R64 18 f=64 0.60 10 0.287±0.018 0.240 0.330 1.01
R-ExplaiNet-C R08 22 f=8 0.01 10 0.513±0.031 0.430 0.610 0.22
R-ExplaiNet-C R16 22 f=16 0.05 10 0.346±0.022 0.300 0.420 2.09
R-ExplaiNet-C R32 22 f=32 0.19 10 0.283±0.025 0.220 0.350 1.87
R-ExplaiNet-C R64 22 f=64 0.74 10 0.256±0.013 0.220 0.290 1.16
R-ExplaiNet-C R08 26 f=8 0.01 10 0.469±0.031 0.360 0.530 0.66
R-ExplaiNet-C R16 26 f=16 0.06 10 0.354±0.029 0.300 0.450 1.76
R-ExplaiNet-C R32 26 f=32 0.23 10 0.294±0.023 0.240 0.350 1.55
R-ExplaiNet-C R64 26 f=64 0.89 10 0.292±0.018 0.240 0.340 0.79
R-ExplaiNet-C R08 18 f=8 0.01 10 0.570±0.039 0.480 0.670 0.00
R-ExplaiNet-C R16 18 f=16 0.04 10 0.382±0.015 0.350 0.420 1.57
R-ExplaiNet-C R32 18 f=32 0.15 10 0.311±0.019 0.270 0.370 1.64
R-ExplaiNet-C R64 18 f=64 0.60 10 0.300±0.016 0.270 0.360 0.91
R-ExplaiNet R08 22 f=8 0.01 10 0.521±0.047 0.340 0.600 0.16
R-ExplaiNet R16 22 f=16 0.05 10 0.352±0.015 0.320 0.390 1.96
R-ExplaiNet R32 22 f=32 0.19 10 0.291±0.021 0.220 0.340 1.75
R-ExplaiNet R64 22 f=64 0.74 10 0.258±0.008 0.240 0.280 1.14
R-ExplaiNet R08 26 f=8 0.01 10 0.487±0.021 0.400 0.520 0.44
R-ExplaiNet R16 26 f=16 0.06 10 0.347±0.025 0.280 0.410 1.89
R-ExplaiNet R32 26 f=32 0.23 10 0.303±0.015 0.250 0.340 1.44
R-ExplaiNet R64 26 f=64 0.89 10 0.297±0.022 0.240 0.340 0.76
ResNet R08 18 f=8 0.01 10 0.562±0.027 0.480 0.630 0.00
ResNet R16 18 f=16 0.04 10 0.387±0.019 0.330 0.440 1.48
ResNet R32 18 f=32 0.15 10 0.323±0.018 0.270 0.370 1.47
ResNet R64 18 f=64 0.60 10 0.287±0.013 0.250 0.320 1.01
ResNet R08 22 f=8 0.01 10 0.462±0.022 0.420 0.540 0.83
ResNet R16 22 f=16 0.05 10 0.344±0.020 0.290 0.390 2.13
ResNet R32 22 f=32 0.19 10 0.275±0.016 0.240 0.330 1.99
ResNet R64 22 f=64 0.74 10 0.263±0.015 0.240 0.310 1.10
ResNet R08 26 f=8 0.01 10 0.488±0.030 0.410 0.570 0.43
ResNet R16 26 f=16 0.06 10 0.328±0.027 0.260 0.400 2.28
ResNet R32 26 f=32 0.23 10 0.302±0.022 0.250 0.360 1.45
ResNet R64 26 f=64 0.89 10 0.273±0.015 0.210 0.300 0.93

22
Using input value clipping with increased weight decay for training ExplaiNets: The last
set of experiments used clipping in the range [−6, 6] and λ = 0.002 for R-ExplaiNets-C and the
same training conditions with all experiments on MNIST. In the results it was apparent that more
regularization lead to instability and worse performance.

Table 10: Classification accuracy and relative model efficiency for group trained on MNIST-4

Architecture Id Layers Features MP Folds Average Min Max RME


R-ExplaiNet-C R08 18 f=8 0.01 10 1.140±0.213 0.750 1.700 1.36
R-ExplaiNet-C R16 18 f=16 0.04 10 0.479±0.032 0.400 0.540 3.53
R-ExplaiNet-C R32 18 f=32 0.15 10 0.319±0.014 0.290 0.350 2.36
R-ExplaiNet-C R64 18 f=64 0.60 10 0.267±0.014 0.230 0.300 1.29
R-ExplaiNet-C R08 22 f=8 0.01 10 1.752±0.851 0.720 5.200 0.00
R-ExplaiNet-C R16 22 f=16 0.05 10 0.452±0.023 0.370 0.510 3.32
R-ExplaiNet-C R32 22 f=32 0.19 10 0.296±0.012 0.280 0.340 2.19
R-ExplaiNet-C R64 22 f=64 0.74 10 0.271±0.021 0.220 0.320 1.14
R-ExplaiNet-C R08 26 f=8 0.01 10 1.270±0.229 0.850 1.910 0.67
R-ExplaiNet-C R16 26 f=16 0.06 10 0.430±0.034 0.330 0.510 3.16
R-ExplaiNet-C R32 26 f=32 0.23 10 0.333±0.016 0.280 0.370 1.88
R-ExplaiNet-C R64 26 f=64 0.89 10 0.291±0.015 0.260 0.330 1.01
ResNet R08 18 f=8 0.01 10 0.562±0.027 0.480 0.630 5.94
ResNet R16 18 f=16 0.04 10 0.387±0.019 0.330 0.440 4.16
ResNet R32 18 f=32 0.15 10 0.323±0.018 0.270 0.370 2.34
ResNet R64 18 f=64 0.60 10 0.287±0.013 0.250 0.320 1.25
ResNet R08 22 f=8 0.01 10 0.462±0.022 0.420 0.540 6.42
ResNet R16 22 f=16 0.05 10 0.344±0.020 0.290 0.390 4.01
ResNet R32 22 f=32 0.19 10 0.275±0.016 0.240 0.330 2.26
ResNet R64 22 f=64 0.74 10 0.263±0.015 0.240 0.310 1.16
ResNet R08 26 f=8 0.01 10 0.488±0.030 0.410 0.570 5.59
ResNet R16 26 f=16 0.06 10 0.328±0.027 0.260 0.400 3.76
ResNet R32 26 f=32 0.23 10 0.302±0.022 0.250 0.360 1.98
ResNet R64 26 f=64 0.89 10 0.273±0.015 0.210 0.300 1.04

23
D.2 Experiments on Fashion MNIST

In the table below Id identifies the architectural hyperparameter setup for the the depth of layers
that corresponds to features per block, where f = stands for the same count of features in all layers.
Model size is reported in millions of parameters (MP ).

Table 11: Classification accuracy and relative model efficiency for group trained on FMNIST

Architecture Id Layers Features MP Folds Average Min Max RME


R-ExplaiNet R01 18 16, 16, 16, 32, 64 0.18 10 92.02±0.14 91.67 92.28 0.67
R-ExplaiNet R08 18 f=8 0.01 10 90.71±0.13 90.39 91.09 0.01
R-ExplaiNet R16 18 f=16 0.04 10 91.53±0.14 91.15 91.81 0.58
R-ExplaiNet R32 18 f=32 0.15 10 92.41±0.10 92.20 92.63 1.28
R-ExplaiNet R64 18 f=64 0.60 10 92.60±0.10 92.28 92.85 0.82
R-ExplaiNet R01 20 16, 16, 32, 64 0.27 10 92.59±0.19 92.01 92.95 1.19
R-ExplaiNet R48 22 f=48 0.42 10 92.86±0.13 92.54 93.12 1.30
R-ExplaiNet R48 26 f=48 0.50 10 92.91±0.15 92.40 93.21 1.26
R-ExplaiNet R64 26 f=64 0.89 10 93.03±0.12 92.70 93.45 1.06
ResNet R01 18 16, 16, 16, 32, 64 0.18 10 91.89±0.14 91.52 92.29 0.54
ResNet R08 18 f=8 0.01 10 90.62±0.11 90.29 90.85 0.00
ResNet R16 18 f=16 0.04 10 91.28±0.13 90.86 91.56 0.29
ResNet R32 18 f=32 0.15 10 92.09±0.09 91.88 92.40 0.82
ResNet R64 18 f=64 0.60 10 92.54±0.10 92.28 92.76 0.76
ResNet R01 20 16, 16, 32, 64 0.27 10 92.42±0.12 92.14 92.64 0.97
ResNet R48 22 f=48 0.42 10 92.69±0.11 92.34 92.89 1.08
ResNet R48 26 f=48 0.50 10 92.84±0.14 92.54 93.19 1.17
ResNet R64 26 f=64 0.89 10 92.83±0.13 92.40 93.15 0.86

D.3 Experiments on Kuzushiji MNIST

Table 12: Classification accuracy and relative model efficiency for group trained on KMNIST

Architecture Id Layers Features MP Folds Average Min Max RME


R-ExplaiNet R08 18 f=8 0.01 10 95.33±0.25 94.36 95.86 0.00
R-ExplaiNet R16 18 f=16 0.04 10 97.63±0.08 97.38 97.80 2.19
R-ExplaiNet R32 18 f=32 0.15 10 98.39±0.07 98.22 98.62 2.10
R-ExplaiNet R64 18 f=64 0.60 10 98.56±0.04 98.43 98.62 1.20
R-ExplaiNet R01 20 16, 16, 32, 64 0.27 10 98.62±0.06 98.42 98.75 1.87
R-ExplaiNet R48 22 f=48 0.42 10 98.59±0.05 98.43 98.71 1.46
R-ExplaiNet R48 26 f=48 0.50 10 98.53±0.06 98.41 98.75 1.29
R-ExplaiNet R64 26 f=64 0.89 10 98.66±0.05 98.53 98.78 1.06
ResNet R08 18 f=8 0.01 10 95.30±0.14 94.88 95.63 0.00
ResNet R16 18 f=16 0.04 10 97.50±0.09 97.29 97.73 1.91
ResNet R32 18 f=32 0.15 10 98.16±0.10 98.01 98.46 1.76
ResNet R64 18 f=64 0.60 10 98.39±0.06 98.22 98.56 1.06
ResNet R01 20 16, 16, 32, 64 0.27 10 98.43±0.08 98.23 98.67 1.63
ResNet R48 22 f=48 0.42 10 98.44±0.05 98.34 98.58 1.32
ResNet R48 26 f=48 0.50 10 98.42±0.06 98.24 98.52 1.18
ResNet R64 26 f=64 0.89 10 98.53±0.06 98.34 98.66 0.97

24
D.4 Experiments on Oracle MNIST

Table 13: Classification accuracy and relative model efficiency for group trained on OMNIST

Architecture Id Layers Features MP Folds Average Min Max RME


R-ExplaiNet R08 18 f=8 0.01 10 93.35±0.17 92.83 93.67 0.03
R-ExplaiNet R16 18 f=16 0.04 10 95.35±0.09 95.17 95.60 1.74
R-ExplaiNet R32 18 f=32 0.15 10 96.13±0.13 95.90 96.47 1.74
R-ExplaiNet R64 18 f=64 0.60 10 96.49±0.17 96.10 96.87 1.13
R-ExplaiNet R01 20 16, 16, 32, 64 0.27 10 96.33±0.14 95.90 96.67 1.50
R-ExplaiNet R48 22 f=48 0.42 10 96.53±0.13 96.13 96.83 1.40
R-ExplaiNet R48 26 f=48 0.50 20 96.65±0.07 96.27 96.93 1.38
R-ExplaiNet R64 26 f=64 0.89 10 96.68±0.11 96.43 96.93 1.06
ResNet R08 18 f=8 0.01 10 93.13±0.30 92.17 93.73 0.00
ResNet R16 18 f=16 0.04 10 95.21±0.18 94.83 95.87 1.50
ResNet R32 18 f=32 0.15 10 96.05±0.18 95.67 96.53 1.63
ResNet R64 18 f=64 0.60 10 96.39±0.15 95.93 96.77 1.06
ResNet R01 20 16, 16, 32, 64 0.27 10 96.34±0.12 96.03 96.63 1.51
ResNet R48 22 f=48 0.42 10 96.39±0.16 96.07 96.83 1.26
ResNet R48 26 f=48 0.50 10 96.44±0.17 95.97 96.83 1.19
ResNet R64 26 f=64 0.89 10 96.56±0.12 96.33 96.93 0.97

D.5 Experiments on CIFAR10

For CIFAR10 we have two different architectures with skip connection where Id identifies the
architectural hyperparameter setup for the the depth of layers. For ResNets we have features per
block, where f = stands for the same count of features in all layers. For DenseNets we have the
feature expansion count k =, where B stands for the use of bottleneck 1 × 1 convolutions and C for
50% compression of features in transition layers. Model size is reported in (MP ).

Table 14: Classification accuracy and relative model efficiency for group trained on CIFAR10

Architecture Id Layers Features MP Folds Average Min Max RME


D-ExplaiNet BC 100 k=12 0.79 5 94.60±0.16 94.35 94.80 1.12
DenseNet BC 100 k=12 0.79 5 94.38±0.18 94.06 94.58 1.09
D-ExplaiNet 40 k=12 1.08 5 93.80±0.08 93.68 93.90 0.86
DenseNet 40 k=12 1.08 5 93.41±0.15 93.15 93.58 0.81
R-ExplaiNet R01 18 16, 16, 16, 32, 64 0.18 10 89.74±0.11 89.48 90.04 1.04
R-ExplaiNet R08 18 f=8 0.01 10 78.49±0.09 78.20 78.70 0.00
R-ExplaiNet R16 18 f=16 0.04 10 87.08±0.11 86.82 87.34 1.26
R-ExplaiNet R32 18 f=32 0.15 10 91.19±0.11 90.80 91.42 1.49
R-ExplaiNet R48 18 f=48 0.34 10 92.61±0.10 92.37 92.88 1.27
R-ExplaiNet R01 20 16, 16, 32, 64 0.27 10 91.88±0.13 91.52 92.28 1.26
R-ExplaiNet R64 26 f=64 0.89 10 93.80±0.12 93.54 94.15 0.94
ResNet R01 18 16, 16, 16, 32, 64 0.18 10 89.11±0.11 88.93 89.46 0.91
ResNet R08 18 f=8 0.01 10 78.16±0.31 77.24 79.02 0.00
ResNet R16 18 f=16 0.04 10 86.72±0.24 86.33 87.48 1.14
ResNet R32 18 f=32 0.15 10 90.91±0.13 90.62 91.32 1.42
ResNet R48 18 f=48 0.34 10 92.16±0.09 91.97 92.35 1.18
ResNet R01 20 16, 16, 32, 64 0.27 10 91.59±0.12 91.37 91.90 1.19
ResNet R64 26 f=64 0.90 10 93.41±0.09 93.21 93.59 0.88

25
E Appendix / Feature motif discovery and mosaics of handwritten digits
E.1 Unsupervised feature motif discovery

We have used YAMDA [21] an implementation of EXTREME base on the Torch framework that
uses GPU acceleration to discover motifs in genomic sequences. The implementation is available on
YAMDA GitHub Repository and we pulled commit #00e9c9d. Some quality improvements were
needed on the implementation provided under the MIT license so that it can function on a sequence
set that has short aligned sequences, that is the case our LDF vectors encoded in the quaternary
system.

Sequence data hyperparameters: The sequence dataset is converted into the .fasta format using the
DNA letter alphabet where δ = 4 is the basic data hyperparameter for the algorithm. The batch size
is set to 4096, since the maximum size of all LDF vocabularies V (l) = {d(l) (i)} is |V (5) | = 33414.
In our experiments the size of LDF vectors was cLDF = 4 for cout = 16 features, so the length of
the quaternary sequence vector is c4 = 8. We use it as our motif width hyperparameter nmotif = 8.

Motif discovery hyperparameters: Several hyperparameters of YAMDA are kept to their default
values, and it is switched to remove a sequence sample from the next epoch, if it contains a discovered
motif. We set nhalf = 2, the k-mer half-length hyperparameter that is used in the search that seeds
the algorithm with the proper initial sequences. Setting nmotif = 8 makes the algorithm to merely
search for motifs in aligned sequences, while it inherently supports motif discovery in any position
inside sequences. We set the maximum count of motifs to discover Kmotif = 96 and minimum count
of occurrences (sites) for a motif in the data to Nsites = 200.
The algorithm’s execution is very fast in our experimental setup, that is promising to scale up to a
higher count of CNN features and/or size of LDF vectors.

E.2 FMotif mosaics of handwritten digit 7.

Figure 6: Image representation Ψ(1) ∈ N28×28 for first sample of the MNIST validation set.

26
Figure 7: Image representation Ψ(2) ∈ N28×28 for first sample of the MNIST validation set.

E.3 FMotif matching scores for handwritten digits

Figure 8: The FMotif 5 that is part of causes in level 1, seems to present in illumination transition
edges. It is visible at row 27 of Figure 6 and it is suspected to be a side-effect of zero padding.

27
Figure 9: The edge FMotif 11 that is part of causes in level 1.

Figure 10: The black background FMotif 43 that is part of causes in level 1.

Figure 11: FMotif 68 is observed as an effect in level 2 and dominates many patches as seen in Figure
7. It can be explained as caused by black background.

28
Figure 12: A horizontal edge FMotif 13, observed as effect in level 2.

Figure 13: FMotif 15 the most explained effect of level 2 with F CE = 99.65, caused by a 3 × 3
receptive field of FMotifs from level 1. At level 2 each receptive field is mapped to an area 11 × 11
of the image input space.

29
F Appendix / Detailed explainability metrics
F.1 Motifs of best causality

Table 15: Fidelity of cause to effect (FCE) for the best 10 motifs per level of MNIST explanations,
using explainer R-ExplaiNet18-16 on the validation set.

Level 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
2 m216 m226 m214 m279 m237 m281 m263 m228
m285 m241
Top-10 FCE 99.96 99.89 99.83 99.83 99.81 99.76 99.69 99.63 99.62 99.58

3 m375 m314 m352 m373 m335 m351 m328 m376 m350 m321
Top-10 FCE 94.93 92.34 92.21 91.85 91.62 91.02 90.33 90.10 89.91 88.22

4 m413 m430 m446 m452 m428 m439 m435 m482 m460 m478
Top-10 FCE 99.76 99.48 98.76 98.29 98.15 97.84 97.76 97.67 97.51 96.79

5 m573 m565 m58 m51 m524 m575 m564 m533 m587 m561
Top-10 FCE 97.67 97.62 96.38 96.15 96.08 96.01 96.01 95.97 95.95 95.73

6 m628 m640 m672 m618 m639 m662 m656 m62 m651 m633
Top-10 FCE 100.00 99.85 99.75 99.74 99.73 99.69 99.68 99.66 99.66 99.65

7 m71 m737 m757 m731 m754 m715 m73 m730 m748 m729
Top-10 FCE 99.70 99.66 97.49 97.34 97.32 96.98 96.50 96.10 96.00 95.81

8 m838 m832 m810 m820 m852 m837 m88 m86 m811 m841
Top-10 FCE 99.46 98.99 95.55 86.35 85.93 85.80 85.12 85.05 84.77 83.64

30
NeurIPS Paper Checklist
1. Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the
paper’s contributions and scope?
Answer: [Yes]
Justification: The claims are proven by experiments ,with a publicly available proof-of-
concept prototype software implementation, while evidence to support the claims is
presented in the sections of the paper.
Guidelines:
• The answer NA means that the abstract and introduction do not include the claims
made in the paper.
• The abstract and/or introduction should clearly state the claims made, including the
contributions made in the paper and important assumptions and limitations. A No or
NA answer to this question will not be perceived well by the reviewers.
• The claims made should match theoretical and experimental results, and reflect how
much the results can be expected to generalize to other settings.
• It is fine to include aspirational goals as motivation as long as it is clear that these goals
are not attained by the paper.
2. Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes]
Justification: Limitations of current work are reported along with future work that is
required to overcome them.
Guidelines:
• The answer NA means that the paper has no limitation while the answer No means that
the paper has limitations, but those are not discussed in the paper.
• The authors are encouraged to create a separate "Limitations" section in their paper.
• The paper should point out any strong assumptions and how robust the results are to
violations of these assumptions (e.g., independence assumptions, noiseless settings,
model well-specification, asymptotic approximations only holding locally). The authors
should reflect on how these assumptions might be violated in practice and what the
implications would be.
• The authors should reflect on the scope of the claims made, e.g., if the approach was
only tested on a few datasets or with a few runs. In general, empirical results often
depend on implicit assumptions, which should be articulated.
• The authors should reflect on the factors that influence the performance of the approach.
For example, a facial recognition algorithm may perform poorly when image resolution
is low or images are taken in low lighting. Or a speech-to-text system might not be
used reliably to provide closed captions for online lectures because it fails to handle
technical jargon.
• The authors should discuss the computational efficiency of the proposed algorithms
and how they scale with dataset size.
• If applicable, the authors should discuss possible limitations of their approach to
address problems of privacy and fairness.
• While the authors might fear that complete honesty about limitations might be used by
reviewers as grounds for rejection, a worse outcome might be that reviewers discover
limitations that aren’t acknowledged in the paper. The authors should use their best
judgment and recognize that individual actions in favor of transparency play an impor-
tant role in developing norms that preserve the integrity of the community. Reviewers
will be specifically instructed to not penalize honesty concerning limitations.
3. Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and
a complete (and correct) proof?

31
Answer: [Yes]
Justification: We present theoretical proof of how the lateral inhibition layer ampli-
fies gradients during gradient descent, that can lead to the attenuation of neuron
activations.
Guidelines:
• The answer NA means that the paper does not include theoretical results.
• All the theorems, formulas, and proofs in the paper should be numbered and cross-
referenced.
• All assumptions should be clearly stated or referenced in the statement of any theorems.
• The proofs can either appear in the main paper or the supplemental material, but if
they appear in the supplemental material, the authors are encouraged to provide a short
proof sketch to provide intuition.
• Inversely, any informal proof provided in the core of the paper should be complemented
by formal proofs provided in appendix or supplemental material.
• Theorems and Lemmas that the proof relies upon should be properly referenced.
4. Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main ex-
perimental results of the paper to the extent that it affects the main claims and/or conclusions
of the paper (regardless of whether the code and data are provided or not)?
Answer: [Yes]
Justification: This work is highly focused on reproducibility. Training epochs are
deterministically reproduced in specific hardware+middleware+software setups that
are reported in full details, the complete set of hyperparameter configuration for each
experiment is provided in the supplementary material in pair with the source code that
uses them.
Guidelines:
• The answer NA means that the paper does not include experiments.
• If the paper includes experiments, a No answer to this question will not be perceived
well by the reviewers: Making the paper reproducible is important, regardless of
whether the code and data are provided or not.
• If the contribution is a dataset and/or model, the authors should describe the steps taken
to make their results reproducible or verifiable.
• Depending on the contribution, reproducibility can be accomplished in various ways.
For example, if the contribution is a novel architecture, describing the architecture fully
might suffice, or if the contribution is a specific model and empirical evaluation, it may
be necessary to either make it possible for others to replicate the model with the same
dataset, or provide access to the model. In general. releasing code and data is often
one good way to accomplish this, but reproducibility can also be provided via detailed
instructions for how to replicate the results, access to a hosted model (e.g., in the case
of a large language model), releasing of a model checkpoint, or other means that are
appropriate to the research performed.
• While NeurIPS does not require releasing code, the conference does require all submis-
sions to provide some reasonable avenue for reproducibility, which may depend on the
nature of the contribution. For example
(a) If the contribution is primarily a new algorithm, the paper should make it clear how
to reproduce that algorithm.
(b) If the contribution is primarily a new model architecture, the paper should describe
the architecture clearly and fully.
(c) If the contribution is a new model (e.g., a large language model), then there should
either be a way to access this model for reproducing the results or a way to reproduce
the model (e.g., with an open-source dataset or instructions for how to construct
the dataset).
(d) We recognize that reproducibility may be tricky in some cases, in which case
authors are welcome to describe the particular way they provide for reproducibility.

32
In the case of closed-source models, it may be that access to the model is limited in
some way (e.g., to registered users), but it should be possible for other researchers
to have some path to reproducing or verifying the results.
5. Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instruc-
tions to faithfully reproduce the main experimental results, as described in supplemental
material?
Answer: [Yes]
Justification: Source code will be openly accessible via GitHub, all datasets used in this
work are open, documentation will be uploaded along with the code.
Guidelines:
• The answer NA means that paper does not include experiments requiring code.
• Please see the NeurIPS code and data submission guidelines (https://nips.cc/
public/guides/CodeSubmissionPolicy) for more details.
• While we encourage the release of code and data, we understand that this might not be
possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not
including code, unless this is central to the contribution (e.g., for a new open-source
benchmark).
• The instructions should contain the exact command and environment needed to run to
reproduce the results. See the NeurIPS code and data submission guidelines (https:
//nips.cc/public/guides/CodeSubmissionPolicy) for more details.
• The authors should provide instructions on data access and preparation, including how
to access the raw data, preprocessed data, intermediate data, and generated data, etc.
• The authors should provide scripts to reproduce all experimental results for the new
proposed method and baselines. If only a subset of experiments are reproducible, they
should state which ones are omitted from the script and why.
• At submission time, to preserve anonymity, the authors should release anonymized
versions (if applicable).
• Providing as much information as possible in supplemental material (appended to the
paper) is recommended, but including URLs to data and code is permitted.
6. Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyper-
parameters, how they were chosen, type of optimizer, etc.) necessary to understand the
results?
Answer: [Yes]
Justification: The most important hyperparameters are described in the paper main
sections, while other can be found in the Appendices. The full set for every experiment
can be found in the supplementary material, where each experiment has a uniquely
identifying code that names a hyperparameter configuration file in an easy-to-read
.JSON format.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The experimental setting should be presented in the core of the paper to a level of detail
that is necessary to appreciate the results and make sense of them.
• The full details can be provided either with the code, in appendix, or as supplemental
material.
7. Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate
information about the statistical significance of the experiments?
Answer: [Yes]
Justification: We have trained different architectural setups (layers,features) over 10
random initial conditions and report metrics with confidence intervals.

33
Guidelines:
• The answer NA means that the paper does not include experiments.
• The authors should answer "Yes" if the results are accompanied by error bars, confi-
dence intervals, or statistical significance tests, at least for the experiments that support
the main claims of the paper.
• The factors of variability that the error bars are capturing should be clearly stated (for
example, train/test split, initialization, random drawing of some parameter, or overall
run with given experimental conditions).
• The method for calculating the error bars should be explained (closed form formula,
call to a library function, bootstrap, etc.)
• The assumptions made should be given (e.g., Normally distributed errors).
• It should be clear whether the error bar is the standard deviation or the standard error
of the mean.
• It is OK to report 1-sigma error bars, but one should state it. The authors should
preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis
of Normality of errors is not verified.
• For asymmetric distributions, the authors should be careful not to show in tables or
figures symmetric error bars that would yield results that are out of range (e.g. negative
error rates).
• If error bars are reported in tables or plots, The authors should explain in the text how
they were calculated and reference the corresponding figures or tables in the text.
8. Experiments Compute Resources
Question: For each experiment, does the paper provide sufficient information on the com-
puter resources (type of compute workers, memory, time of execution) needed to reproduce
the experiments?
Answer: [Yes]
Justification: The compute resource requirements are mentioned. All experiments run
on a single worker, where secs/epoch for each experiment is recorded in training logs
that will be in the supplementary material. Indicative measurements of elapsed time
for training are presented for some experiments in the Appendices.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The paper should indicate the type of compute workers CPU or GPU, internal cluster,
or cloud provider, including relevant memory and storage.
• The paper should provide the amount of compute required for each of the individual
experimental runs as well as estimate the total compute.
• The paper should disclose whether the full research project required more compute
than the experiments reported in the paper (e.g., preliminary or failed experiments that
didn’t make it into the paper).
9. Code Of Ethics
Question: Does the research conducted in the paper conform, in every respect, with the
NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
Answer: [Yes]
Justification: Using public datasets ensures compliance to worldwide privacy laws.
Since our models are not generative no provisions are needed for adding indications or
watermarking on their output.
Guidelines:
• The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
• If the authors answer No, they should explain the special circumstances that require a
deviation from the Code of Ethics.
• The authors should make sure to preserve anonymity (e.g., if there is a special consid-
eration due to laws or regulations in their jurisdiction).

34
10. Broader Impacts
Question: Does the paper discuss both potential positive societal impacts and negative
societal impacts of the work performed?
Answer: [Yes]
Justification: The specific field of research that our work belongs to, targets to bring
positive societal impacts, mainly increasing trustworthiness, ensuring compliance with
law and mitigating model bias which are mentioned in the paper.
Guidelines:
• The answer NA means that there is no societal impact of the work performed.
• If the authors answer NA or No, they should explain why their work has no societal
impact or why the paper does not address societal impact.
• Examples of negative societal impacts include potential malicious or unintended uses
(e.g., disinformation, generating fake profiles, surveillance), fairness considerations
(e.g., deployment of technologies that could make decisions that unfairly impact specific
groups), privacy considerations, and security considerations.
• The conference expects that many papers will be foundational research and not tied
to particular applications, let alone deployments. However, if there is a direct path to
any negative applications, the authors should point it out. For example, it is legitimate
to point out that an improvement in the quality of generative models could be used to
generate deepfakes for disinformation. On the other hand, it is not needed to point out
that a generic algorithm for optimizing neural networks could enable people to train
models that generate Deepfakes faster.
• The authors should consider possible harms that could arise when the technology is
being used as intended and functioning correctly, harms that could arise when the
technology is being used as intended but gives incorrect results, and harms following
from (intentional or unintentional) misuse of the technology.
• If there are negative societal impacts, the authors could also discuss possible mitigation
strategies (e.g., gated release of models, providing defenses in addition to attacks,
mechanisms for monitoring misuse, mechanisms to monitor how a system learns from
feedback over time, improving the efficiency and accessibility of ML).
11. Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible
release of data or models that have a high risk for misuse (e.g., pretrained language models,
image generators, or scraped datasets)?
Answer: [NA]
Justification:
Guidelines:
• The answer NA means that the paper poses no such risks.
• Released models that have a high risk for misuse or dual-use should be released with
necessary safeguards to allow for controlled use of the model, for example by requiring
that users adhere to usage guidelines or restrictions to access the model or implementing
safety filters.
• Datasets that have been scraped from the Internet could pose safety risks. The authors
should describe how they avoided releasing unsafe images.
• We recognize that providing effective safeguards is challenging, and many papers do
not require this, but we encourage authors to take this into account and make a best
faith effort.
12. Licenses for existing assets
Question: Are the creators or original owners of assets (e.g., code, data, models), used in
the paper, properly credited and are the license and terms of use explicitly mentioned and
properly respected?
Answer: [Yes]

35
Justification: Citations, preserving license comments/files in the source code, attribution
with footnote links to dataset homepages, attribution with short description of work.
Guidelines:
• The answer NA means that the paper does not use existing assets.
• The authors should cite the original paper that produced the code package or dataset.
• The authors should state which version of the asset is used and, if possible, include a
URL.
• The name of the license (e.g., CC-BY 4.0) should be included for each asset.
• For scraped data from a particular source (e.g., website), the copyright and terms of
service of that source should be provided.
• If assets are released, the license, copyright information, and terms of use in the
package should be provided. For popular datasets, paperswithcode.com/datasets
has curated licenses for some datasets. Their licensing guide can help determine the
license of a dataset.
• For existing datasets that are re-packaged, both the original license and the license of
the derived asset (if it has changed) should be provided.
• If this information is not available online, the authors are encouraged to reach out to
the asset’s creators.
13. New Assets
Question: Are new assets introduced in the paper well documented and is the documentation
provided alongside the assets?
Answer: [Yes]
Justification: New code/model assets are introduced in the paper, detailed documenta-
tion will be provided as part of a GitHub repository or/and project homepage.
Guidelines:
• The answer NA means that the paper does not release new assets.
• Researchers should communicate the details of the dataset/code/model as part of their
submissions via structured templates. This includes details about training, license,
limitations, etc.
• The paper should discuss whether and how consent was obtained from people whose
asset is used.
• At submission time, remember to anonymize your assets (if applicable). You can either
create an anonymized URL or include an anonymized zip file.
14. Crowdsourcing and Research with Human Subjects
Question: For crowdsourcing experiments and research with human subjects, does the paper
include the full text of instructions given to participants and screenshots, if applicable, as
well as details about compensation (if any)?
Answer: [NA]
Justification:
Guidelines:
• The answer NA means that the paper does not involve crowdsourcing nor research with
human subjects.
• Including this information in the supplemental material is fine, but if the main contribu-
tion of the paper involves human subjects, then as much detail as possible should be
included in the main paper.
• According to the NeurIPS Code of Ethics, workers involved in data collection, curation,
or other labor should be paid at least the minimum wage in the country of the data
collector.
15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human
Subjects

36
Question: Does the paper describe potential risks incurred by study participants, whether
such risks were disclosed to the subjects, and whether Institutional Review Board (IRB)
approvals (or an equivalent approval/review based on the requirements of your country or
institution) were obtained?
Answer: [NA]
Justification:
Guidelines:
• The answer NA means that the paper does not involve crowdsourcing nor research with
human subjects.
• Depending on the country in which research is conducted, IRB approval (or equivalent)
may be required for any human subjects research. If you obtained IRB approval, you
should clearly state this in the paper.
• We recognize that the procedures for this may vary significantly between institutions
and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the
guidelines for their institution.
• For initial submissions, do not include any information that would break anonymity (if
applicable), such as the institution conducting the review.

37

You might also like