0% found this document useful (0 votes)

7 views15 pages

Counterfactual Exp-Causal Model-Image dataset-NN

XAI Counterfactual reference Material

Uploaded by

Atharsh Vishal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views15 pages

Counterfactual Exp-Causal Model-Image dataset-NN

XAI Counterfactual reference Material

Uploaded by

Atharsh Vishal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Knowledge-Based Systems 257 (2022) 109901

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Towards counterfactual and contrastive explainability and

transparency of DCNN image classifiers
∗
Syed Ali Tariq a , , Tehseen Zia a , Mubeen Ghafoor a,b
a
Department of Computer Science, COMSATS University Islamabad, Pakistan
b
School of Computer Science (SoCS), University of Lincoln, Lincoln, UK

article info a b s t r a c t

Article history: Explainability of deep convolutional neural networks (DCNNs) is an important research topic that
Received 10 June 2021 tries to uncover the reasons behind a DCNN model’s decisions and improve their understanding
Received in revised form 5 September 2022 and reliability in high-risk environments. In this regard, we propose a novel method for generating
Accepted 13 September 2022
interpretable counterfactual and contrastive explanations for DCNN models. The proposed method is
Available online 17 September 2022
model intrusive that probes the internal workings of a DCNN instead of altering the input image
Keywords: to generate explanations. Given an input image, we provide contrastive explanations by identifying
Explainable AI the most important filters in the DCNN representing features and concepts that separate the model’s
Interpretable DL decision between classifying the image to the original inferred class or some other specified alter
Counterfactual explanation class. On the other hand, we provide counterfactual explanations by specifying the minimal changes
Contrastive explanation necessary in such filters so that a contrastive output is obtained. Using these identified filters and
Image classification concepts, our method can provide contrastive and counterfactual reasons behind a model’s decisions
DCNN
and makes the model more transparent. One of the interesting applications of this method is
misclassification analysis, where we compare the identified concepts from a particular input image
and compare them with class-specific concepts to establish the validity of the model’s decisions. The
proposed method is compared with state-of-the-art and evaluated on the Caltech-UCSD Birds (CUB)
2011 dataset to show the usefulness of the explanations provided.
© 2022 Elsevier B.V. All rights reserved.

1. Introduction security-critical applications that involve endangerment of hu-

man life or property, such as in medical imaging and diagnos-
In recent years, deep convolutional neural networks (DCNNs) tics [17,18], autonomous vehicles [19], military applications [20],
have achieved state-of-the-art performance in many computer etc. If a DCNN model’s decision cannot be interpreted or it cannot
vision applications such as medical imaging and diagnostics [1,2], explain how or why it made its decision, then using such a model
biometrics identification [3], object detection [4], scene segmen- in high-risk environments will carry some risks. For example,
tation [5], image inpainting [6], etc. DCNNs are a type of deep DCNNs are known to be affected by dataset bias [21] due to
learning (DL) method that automatically learns generalized rep- which they may rely on unrelated or out-of-context patterns to
resentations from the input data useful for image classification, classify images. Zhang et al. [21] demonstrated an example where
segmentation, or recognition tasks. Most of the current research a CNN incorrectly relied on eye features to identify the ‘lipstick’
on DCNNs is focused on architectural improvements [7–12]. Al- attribute of a face image. Similarly, if a DCNN model is trained to
though this aspect of research is necessary, one essential aspect detect lung diseases from chest X-rays annotated with doctors’
that is mostly overlooked, missing, or not focused upon while pen marks, it may learn to rely on them to make predictions.
developing new CNNs is their explainability. Since DCNNs are Another issue with DCNNs is that they are prone to adversarial
trained in an end-to-end manner, their inner workings are not attacks where subtle changes to the input may lead to the CNN
well understood, which makes them a black-box [13]. producing incorrect results [22]. Adversarial attacks pose threats
The explainability or transparency of DCNN models is ex- to many security-critical applications, such as self-driving cars
tremely important for certain applications where the impact of where minor obstructions on traffic signs can cause incorrect
incorrect or unjustifiable predictions can have a significant im- decisions [23] or in surveillance systems where malicious users
pact on the outcome [14–16]. This is especially true for those can cause harm [24]. These are some of the many aspects that
make DCNNs un-trustworthy and require explainable AI tech-
∗ Corresponding author. niques to identify their weaknesses and train robust, trustworthy,
E-mail address: [email protected] (S.A. Tariq). and transparent models [16,25,26].

https://doi.org/10.1016/j.knosys.2022.109901
0950-7051/© 2022 Elsevier B.V. All rights reserved.
S.A. Tariq, T. Zia and M. Ghafoor Knowledge-Based Systems 257 (2022) 109901

In the literature, several types of explainability or interpretabil- well [37–39]. We show that when enabled, disabled, or modified
ity methods exist. At the top level, most types of explainable in a certain way, these filters can make the DCNN predict the
techniques can be broadly divided into two main categories: in- input image either to the original inferred class or to some chosen
trinsic/inherent interpretability or post-hoc interpretability [27]. alter class. These identified filters represent the critical concepts
Inherently interpretable techniques attempt to introduce explain- and features that the model learns and are chosen to maximize
ability built into the structure of the model itself that makes the model’s prediction towards the specified class. Essentially,
the model self-explanatory, for example, decision trees. Whereas the proposed method identifies the most important features or
post-hoc interpretable techniques usually consist of a separate concepts that separate the model’s decisions between the two
explanation of a pre-trained DCNN black-box model. These tech- classes. In this regard, the proposed solution is a predictive coun-
niques may use a second model to explain the black-box either terfactual explanation (CFE) model trained on top of a pre-trained
by probing the internal structure of the black-box or by altering DCNN model to generate explanations. The proposed method has
the input. An example of post-hoc explainability is a visual ex- the following three main objectives:
planation technique such as GradCAM [28]. Counterfactual and
1. Identify the minimum set of filters necessary to predict
contrastive explanations are two popular types of post-hoc ex-
the input image to the inferred class. We call these filters
planation methods. Contrastive explanations generally try to find
minimum correct (MC) for the input image with respect to
the critical features in the input that leads to the model making
its inferred class.
its decision to the inferred class [29]. While in counterfactual ex-
2. Identify the minimum set of filters that, if they were addi-
planations, the goal is to alter the input features (pixels) such that
tionally activated, would have altered the model’s decision
the model changes its decision to some other counter class [30].
to some counter class c ′ . We call this set of filters mini-
Such explanations are natural to humans since they mimic human
mum incorrect (MI) for the input image with respect to its
thought processes. For example, a contrastive explanation can
counter class c ′ .
have the form ‘‘if some event X had not occurred, then event Y would
3. Using the identified MC and MI filters, provide contrastive
not have occurred’’. Whereas using counterfactuals, we can ask
and counterfactual explanations by highlighting the pres-
questions such as ‘‘if X is changed to X’ then would Y’ happen instead
ence or absence of the features and concepts represented
of Y?’’. Such types of explanations are human-friendly and easy
by these filters and demonstrate how they affect the
to understand. If a DCNN model can provide such explanations,
model’s decisions.
it can be considered a reliable or trustworthy model, and we can
predict its behavior. The primary motivation behind this work is to provide clear
Recently, several counterfactual and contrastive explanation and easy-to-understand explanations that do not require a do-
methods have been proposed [30,31,29,32,33]. These methods main expert to use the system. The need for transparent DCNN
generally perturb the input pixels to alter the model prediction. models in high-risk environments is also a crucial factor. We
One of the drawbacks of such an approach is that it generally believe that explanations provided by the proposed method could
does not identify semantically meaningful features or high-level be useful for applications such as machine teaching or model de-
concepts useful in explaining the model decisions. Another is- bugging. We can utilize the MC or MI filters (by visualizing their
sue with pixel-based perturbation methods is that the search receptive fields) to teach human users what features are essential
space for finding the optimal combination of pixels that can for the inferred class or some alter class, respectively. Such expla-
cause the network to change its prediction is large, which makes nations can help humans differentiate between similar-looking
such methods computationally expensive [34]. A better approach classes under challenging tasks such as fine-grained classification
would be to identify the major concepts that a model learns and of birds or medical imaging diagnostic tasks such as brain tumor
relies on to make predictions. In a recent study, Akula et al. [35] detection. The other application that can benefit from the pro-
tried to address this issue by using super-pixels to identify critical posed method is model debugging. Expert users can use MC or MI
concepts that, when added or removed from the input image, filters identified by the proposed method to detect weak or faulty
alter the model’s decision. Although this method provides a useful filters that may cause dataset bias or misclassifications. Such
way to generate explanations; it still operates on pixel data to filters can be disabled permanently or retrained in a controlled
generate explanations that do not make it fully transparent. In manner to repair the model. This approach can also be used to
another work, concept activation vectors (CAV) [36] were in- detect and prevent adversarial attacks and establish a ‘trust index’
troduced that essentially measure the sensitivity of the model for the DCNN model’s predictions.
towards a particular high-level concept. The concepts can be An example explanation provided by the proposed method is
learned from either training or user-provided data. However, this shown in Fig. 1. In this figure, our approach provides a contrastive
method only identifies whether a particular concept is impor- explanation by identifying the most critical MC filters using which
tant or not. It does not investigate if it is actively used in the the model maintains the prediction to the original inferred class,
decision-making process for a particular input. i.e. ‘‘Red-winged Blackbird’’. Activation magnitudes of the 13 MC
The problem with these works is that they are primarily filters predicted by the CFE model are shown as a graph for the
non-intrusive. They only look at model behavior by altering the given input in Fig. 1. Using just these filters, the VGG-16 model
input and do not thoroughly investigate the internal workings still classifies the input to its original inferred class with 99.8%
or reasoning behind the model predictions. Such interpretability confidence. Visualization of the top-3 filters on different images
methods may not be aligned with how the model is making of the inferred class shows the most important features associ-
decisions. In this study, we propose a post-hoc explainability ated with the respective filters for this class. The counterfactual
method that predictively identifies counterfactual and contrastive explanation in Fig. 1 identifies the MI filters for some chosen alter
filters in a DCNN model. Although in this work, we restrict the class (usually top-2 or 3 class), which in this case is the ‘‘Bronzed
identification of filters from the top convolution layer only, the cowbird’’. The CFE model predicts the minimum additive values
proposed method can be used to identify counterfactual and to alter the activation magnitude of these filters such that the
contrastive filters from any layer of a DCNN and can also be used input image is classified to the alter class, i.e. ‘‘Bronzed cowbird’’.
to identify such filters in other networks. It has been shown that The graph for MI filters in Fig. 1 shows the modified activation
filters in the top convolution layer of a DCNN tend to learn more magnitudes highlighted in red. Top-3 MI filters show the most
abstract, high-level features, concepts, and even whole objects as important features for this alter class. Filter 15 activates the most
2
S.A. Tariq, T. Zia and M. Ghafoor Knowledge-Based Systems 257 (2022) 109901

Fig. 1. Explanation provided by the proposed approach. Our method identifies the most important minimum correct (MC) and minimum incorrect (MI) filters using
which the pre-trained model either predicts the input image to its original inferred class or some chosen alter class, respectively. The top-3 MC filters for the example
image classified as ‘‘Red-winged blackbird’’ show that the red spot on this bird’s wing is the most discriminating feature for it. Whereas, the top-3 MI filters of this
example for the ‘‘Bronzed cowbird’’ class show that if the filters corresponding to features such as bird’s red eyes and the blue-tinged feather were present in the
input, the model would have been more likely to predict this image as ‘‘Bronzed cowbird’’.

on the red-colored eye of the Bronzed cowbird, while filter 158 of explainability is the class activation map (CAM) [42] and its
activates the most on the blueish tinge on the bird’s wings. These variant Grad-CAM [28]. Perturbation-based explanation methods
two filters identify the critical features that, if they were present involve modifying the input of the model by either preservation
in the input image, the model would have been more likely to or deletion of different portions/pixels of the input image and
classify the bird as ‘‘Bronzed cowbird’’ instead of the ‘‘Red-winged observing how it affects the model’s output [45].
blackbird’’ class. We can say that these features separate the The main drawback of such visual explainability methods is
model’s decision between the two classes. that they are primarily helpful in explaining the prediction with
In providing such types of explanations, the proposed method respect to the input. They are not suitable for understanding why
probes the internal working of a DCNN model. This makes the or how the model has made that decision. Such explanations
provided explanations more meaningful, trustworthy and pro- can often be misleading [16]. For example, if a model makes an
vides insights into how a trained model makes its decisions, incorrect decision, then the visual explainability method would
thus making it more transparent. To the best of our knowledge, still try to visualize the input region that is found important in
this is the first study that has investigated the importance of making the incorrect decision. These methods do not explore the
concepts learned by individual filters in terms of providing both internal workings of the model, which remains a black-box.
counterfactual and contrastive explanations and how they con- Some of the shortcomings of visual explainability methods can
tribute to the network’s overall prediction. The rest of the paper be addressed by intrinsic or inherently interpretable methods.
is structured as follows. Section 2 discusses the related works. These models provide more useful interpretations than visual ex-
Sections Section 3 presents the proposed approach. Section 4 planations, making them suitable for high-stakes applications. In
presents the results and discussion, and Section 5 concludes the recent years, researchers have developed inherently interpretable
paper. DL models for different computer vision applications that have
produced promising results [46–50]. Liu et al. [46] proposed an
2. Related work interpretable object classification model that learns classification
criteria based on hierarchical visual attributes that define the cat-
Since DCNNs are trained end-to-end, they learn complex hid- egory or class of an image. The authors devised a two-stream ar-
den representations in their intermediate and end layers to map chitecture, where the first stream is a conventional CNN structure
the input data with the respective target outputs. To understand that learns image attributes from the training images. At the same
what the network has learned, many post-hoc visual explana- time, the second stream takes as input the hierarchical category
tion methods in literature try to investigate the relevance or labels that define the input image and learns a linear combination
importance between the input features (i.e., pixels of the image) of different attributes that define each category. The outputs
and the model’s predictions. These methods can be generalized from both streams are classified jointly to learn the hierarchical
into different categories such as back-propagation based meth- attribute structure present in images. One main drawback of this
ods [40,41], activation based methods [42,28], or perturbation approach is that hierarchical category labels have to be provided
based methods [43,26,44]. Back-propagation-based methods try for training which may not be readily available. Chen et al. [50]
to determine the importance of different input features/pixels developed a very useful interpretable model for image recogni-
by back-propagating the error in the output back towards the tion applications based on network dissection [39]. The proposed
input image. An example of such a method is layer-wise relevance model inspects different parts of the input image, which the
propagation (LRP) [41]. Activation-based methods provide visual authors call prototypes, and finds the nearest similar prototypical
explanations by generating heatmaps that identify the regions parts learned from the training dataset. The prototypical parts for
in the input image that were useful for classifying images to different classes are learned by clustering together the seman-
some target class. One of the most popular works for this type tically similar image patches encountered during training. The
3
S.A. Tariq, T. Zia and M. Ghafoor Knowledge-Based Systems 257 (2022) 109901

authors have demonstrated that their interpretable model can In this study, we tried to address some of the existing issues
achieve similar accuracy compared to standard CNN models while with DCNN interpretability methods to propose an explainabil-
providing useful explanations. In another work, Zhang et al. [47], ity method that generates contrastive and counterfactual expla-
modified the top convolutional layers of standard CNN models nations by probing internal network structure to make them
to make them interpretable. The authors did so by introducing more transparent. The proposed method identifies the minimum
a loss for each filter that pushes it to learn some specific object number of most essential filters from the top convolution layer
part belonging to some specific image category. When the model of a pre-trained DCNN for a particular input image that, when
is fully trained, the representations learned by the interpretable enabled, disabled, or modified, alters the model’s decision to
filters are disentangled. This property of interpretable filters is some specified class. Our work is similar to the classification-by-
advantageous in providing relevant explanations and measur- components [54] approach in the sense that instead of compo-
ing the contribution of different object parts on the model’s nents, we identify the crucial filters corresponding to high-level
prediction [49]. concepts that the pre-trained DCNN model learns and relies on
Another promising area of research for interpretable models for the classification of images to a particular class. The pro-
is network de-coupling [51–53]. Such approaches aim to under- posed work is more straightforward and provides contrastive,
stand and manipulate the inner workings of the CNN model to easy-to-understand explanations that are natural to humans as
make them interpretable. Li et al. [51] proposed a method to compared to network de-coupling approaches [51,52]. The pro-
de-couple network architecture by dynamically selecting suit- posed work is also better than existing counterfactual explanation
able filters in each layer that form a hierarchical calculation approaches [35,30,31] that operate on pixel data to generate
path through the network for each input image. Different filters explanations. In contrast, we probe the internal filters to gener-
selected in each layer form a de-coupled sub-architecture that ate more meaningful explanations that make the network more
corresponds to different semantic concepts. In another study, transparent.
Saralajew et al. [54] proposed a classification-by-components
network that extracts components (i.e., visual features that are 3. Proposed methodology
representative of one or more classes) from input images and
reasons over them describing which components are either pos- Given a pre-trained DCNN model, M, and the dataset, D, con-
itively or negatively related to a particular class and which com- sisting of images belonging to C distinct classes, our objective
ponents are not important at all. Recently, Ghorbani et al. [25] is to provide two types of explanations for each image xi ∈ D.
proposed a method to quantify the contribution of each neuron Firstly, our model predicts the minimum set of filters in the top
or filter in a deep CNN towards the model’s accuracy. The authors convolution layer of M necessary for M to maintain its prediction
found that there is usually a small number of critical neurons of image xi to the original inferred class (source class) ci ∈ C .
that, if removed from the network, vastly decrease the overall We call these filters as minimum correct (MC) for image xi to
performance of the DCNN. be classified to class ci , denoted as FMCi ∈ [0, 1]1×n , where n is
Although the interpretable DL models such as the ones dis- the number of filters in the top convolution layer of M. In the
cussed here help provide trustworthy explanations, they still VGG-16 DCNN model, the number of filters in the top convolution
have some drawbacks. These methods usually devise complex layer is 512, i.e., n = 512. Values of ‘1’ and ‘0’ indicate whether
mechanisms to ensure interpretability, and developing them is the corresponding filter is predicted to be active or disabled,
significantly more challenging. Furthermore, these methods need respectively.
deeper analysis to understand or decipher the explanations that Secondly, our model predicts the minimum set of filters that,
may require domain expertise to implement and use, making if they were altered by a larger magnitude, would have resulted
them inaccessible for non-expert users. For example, the tech- in the DCNN classifying the input to some target class ci′ ∈ C .
niques used to implement the ProtoPNet model [50] are not yet We call these filters as minimum incorrect (MI) for image xi with
available in the standard DL libraries used for training CNNs [16]. respect to the target class ci′ , denoted as FMIi ∈ [R+ ]1×n . Non-
This makes designing such networks costly, time-consuming, and zero indexes in FMIi correspond to the MI filters, and the values
problematic. Additionally, since inherently interpretable models at these indexes indicate the magnitude by which the original
must satisfy additional constraints to satisfy interpretability, they filter activations are altered to modify the DCNN’s decision. The
usually demonstrate lower accuracy as compared to standard overall diagram depicting the proposed approach is shown in
black-box models [47,46]. Fig. 2 which is described in the following sub-sections.
On the other hand, counterfactual explanation methods at-
tempt to provide more human-friendly explanations and are easy 3.1. CFE model for MC filters
to understand. Goyal et al. [30] generated counterfactual expla-
nations by identifying regions in the input image that can be
To achieve the first objective, we train the CFE model to
changed such that the network’s decision is altered from the
predict MC filters from the pre-trained model M that explain each
original class to some specified counter class. Similarly, Wang
image with respect to its inferred class. The MC CFE model is
et al. [34] proposed a counterfactual explanation method that
designed with partially similar architecture as M by sharing the
generates an attributive heatmap that is indicative of the pre-
feature extraction layers up to the top convolution layer of M,
dicted class but not of some counter class. In another work,
as shown in Fig. 2. Given an input image xi , model M generates
Akula et al. [35] proposed counterfactual explanations based on
two outputs: (1) the feature maps gi ∈ [R+ ]1×n produced at the
semantic concepts that can be added or removed from the input
last convolution layer of M after the global average pooling layer,
image to alter the model decisions. Although these methods help
and (2) the source or inferred class ci . The MC CFE model takes
provide understandable and user-friendly explanations, they do
as input the feature maps gi and generates a binary filter map
not explore the internal workings of the network and do not
FMCi corresponding to class ci . With these definitions, the MC CFE
make them transparent. Additionally, Goyal et al. [30] perform
model can be represented as:
an exhaustive search for pixels and features to modify that alter
model prediction. Such a method can become too complex and FMCi = CFEMC (gi )
(1)
slow to generate explanations. = ReLUt (A(dn (gi ))),
4
S.A. Tariq, T. Zia and M. Ghafoor Knowledge-Based Systems 257 (2022) 109901

Fig. 2. Overall block diagram of the proposed counterfactual and contrastive explanation model. Given an input image, the contrastive and counterfactual filter
generation networks predict the MC and MI filter maps. MC filters are multiplied with the pre-classification output of the pre-trained network to disable all but
the important features using which the model is able to maintain prediction to the original inferred class. Similarly, MI filter map is used to alter the activation
magnitudes of the pre-trained model such that the model predicts the image to some alter class c ′ .

where gi represents the feature maps after the global average The Ll1 (FMCi ) loss minimizes the sum of the activated filters that
pooling layer of M(xi ), dn is a dense layer with n units, A repre- push the CFE model to predict minimally sufficient filters:
sents sigmoid activation function, and ReLUt is thresholded-ReLU m n
layer with threshold t set to t = 0.5 that outputs the ap-
∑ ∑
Ll1 (FMCi ) = ∥FMCi (k)∥, (5)
proximately1 binarized MC filter map FMCi . The thresholded-ReLU
i=1 k=1
function sets all values below the threshold to zero and keeps
other values unchanged. The shared feature extraction layers where n is the number of filters in the top convolution layer of
between the CFE model and the pre-trained model M are kept M.
frozen for the training of the CFE model. Additionally, all layers The negative logits loss −Llogits is necessary for the proposed
of M after the top convolution layer are frozen. Only the dense methodology to ensure that the sparse filters predicted by the MC
layer dn weights are updated during training of the CFE model. CFE model are contributing maximally towards the source class
To generate explanations, CFEMC model predicts the MC filters ci . This loss is applied on the logits (weighted sum) computed
matrix for the corresponding input xi . FMCi is multiplied with gi after disabling the filters using FMCi and before applying the final
(Hadamard product) to disable all but the MC filters from the top activation function to get M’s output:
layer of M. The DCNN model M makes the alter prediction with n
∑
the disabled filters as ĉi : Llogits (FMCi ) = − ∥(FMCi (k) ∗ gi (k)) ∗ Wk,ci ∥, (6)
ĉi = h(gi ◦ FMCi ), (2) k=1

where gi represents the feature maps after the global average

where h represents the classification (fully-connected and soft-
pooling layer of model M, and Wk,c represents the pre-trained
max) layers of M.
weights of model M connecting the GAP layer with the output
The MC CFE model is trained to predict the optimal filter
maps FMCi to reduce the loss between the predicted classes ĉi layer’s class ci . The negative sign of the loss is for ensuring that
and the source class ci . The optimal MC filters FMCi are learned the model chooses filters that have higher weight for the desired
by minimizing the following three losses simultaneously: (1) class, and hence their activation contributes more towards it and
cross-entropy (CE) loss, LCE , for classifying each input image xi results in a larger logits score. The results section shows that if
by modified model M to the specified class c, (2) sparsity loss, this loss is not included, the CFE model is more likely to predict
Ll1 , that ensures FMCi is sparse so that minimal filters remain less important filters that may still classify the images to the
active, and (3) negative logits loss, −Llogits , that ensures that the desired class but with lower confidence.
predicted sparse filters have higher contribution towards the When all three of these losses are minimized, the CFE model
chosen class c: learns to predict minimally sufficient or MC sparse filters using
which the inputs are classified to the source class ci . For example,
LMC = LCE (ĉi , ci ) + λLl1 (FMCi ) − Llogits , (3) if the pre-trained model M classified a given image xi to class
where λ is the weight assigned to the sparsity loss. LCE (ĉi , ci ) is ci , then to identify the MC filters, we train the CFE model with
computed using the output of the modified model M and the respect to the source class ci . This CFE model is then used to
desired class ci for each training example i: predict the most important FMCi filters necessary for classifying
m
input xi to class ci . The procedure for MC CFE model training is
1 ∑ summarized in Algorithm 1
LCE (ĉi , ci ) = − [ci log ĉi + (1 − ci ) log(1 − ĉi )]. (4)
m
i=1

1 F
MCi is not exactly binary but it is made close to binary as training
3.2. CFE model for MI filters
progresses. We fully binarize it at inference time to ensure that no un-intended
scaling is taking place on the activation magnitudes. If we hard binarize it during
training using a fixed threshold, then the function does not remain differentiable To achieve our second objective of predicting MI filters, we
and model cannot learn weights. follow a similar methodology used for MC filters but with key
5
S.A. Tariq, T. Zia and M. Ghafoor Knowledge-Based Systems 257 (2022) 109901

Algorithm 1 Procedure for training MC CFE model minimizing the following two losses: (1) cross-entropy (CE) loss,
Input: Image I, DCNN model M, target class c, dataset D LCE , for classifying each input image xi by modified model M to
the target class ci′ , and (2) sparsity loss, Ll1 , that ensures FMIi is
1. Train MC CFE model for target class c over the training
sparse so that minimal filters are modified with minimal additive
dataset D
values such that the model M classifies each input to class ci′ :
for each image x ∈ D do
g , c = M(x) LMI = LCE (ĉi , ci′ ) + λLl1 (FMIi , ci′ ). (9)
FMC = ReLUt (A(dn (g)))
ĉ = h(g ◦ FMC ) ▷ alter prediction with just MC filters This equation is similar to Eq. (3) with the difference that the
enabled logits loss is not used to find the MI filters. Logits loss, in this
Minimize MC loss Eq. (3) case, is not necessary as our objective can be achieved with
just cross-entropy and sparsity losses. The first term of the loss
LMC = LCE (ĉ , c) + λLl1 (FMC ) − Llogits
function is the cross-entropy loss using which error between
end for modified model M’s output and the target class ci′ is minimized,
2. Generate contrastive explanation using MC CFE model for which is computed using Eq. (4). The second term minimizes the
input image I sum of the MI filter matrix FMIi that pushes the CFE model to
g , c = M(I) choose the least number of filters whose activation magnitude
FMC = ReLUt (A(dn (g))) is increased minimally to predict each input xi to the target class
Output: FMC ▷ MC filters necessary to maintain prediction of I to ci′ . This loss can be computed using Eq. (5) by replacing FMCi with
inferred class c FMIi . The procedure for MI CFE model training is summarized in
Algorithm 2

differences. The MI filters are those filters that, if they were Algorithm 2 Procedure for training MI CFE model
altered to have higher magnitude, would have resulted in the Input: Image I, DCNN model M, target class c ′ , dataset D
model classifying the input to some other target class ci′ ∈ C , 1. Train MI CFE model for target class c ′ over the training
instead of the initially inferred class ci (or source class). For this dataset D
purpose, we train the MI CFE model to predict MI filters from M for each image x ∈ D do
that explain each image with respect to some target class ci′ . The g , c = M(x)
CFE model for MI filters is designed similarly to the CFE model for FMI = ReLU(dn (g))
MC filters but with minor changes. The MI CFE model shares the ĉ = h(g + FMI ) ▷ alter prediction with just updated MI
feature extraction layers of M up to the top convolution layer, as filters
shown in Fig. 2. Given an input image xi , model M generates two Minimize MI loss Eq. (9)
outputs: (1) the feature maps gi ∈ [R+ ]1×n produced at the last LMI = LCE (ĉ , c ′ ) + λLl1 (FMI , c ′ )
convolution layer of M after the global average pooling layer, and
end for
(2) the source or inferred class ci . The MI CFE model takes as input
2. Generate counterfactual explanation using MI CFE model for
feature maps gi and learns to generate non-binary MI filter map
input I
FMIi corresponding to target class ci′ . This map is combined with
the output of GAP layer of M to modify the activation magnitudes g , c = M(I)
with the objective to classify each input xi to the target class ci′ . FMI = ReLU(dn (g))
With these definitions, the MI CFE model can be represented as: Output: FMI ▷ MI addition to filters necessary to alter prediction
of I to target class c ′
FMIi = CFEMI (gi )
(7)
= ReLU(dn (gi )), With both these CFE models for MC and MI filters, we can
explain each decision of the pre-trained model M in terms of
where gi represents the feature maps of M(xi ) after global average finding the minimum required critical filters that maintain the
pooling layer, dn is a dense layer with n units, and ReLU is the model’s decision to the inferred class or to finding the minimum
ReLU activation function that produces the non-binary MI filter set of filters that if they were altered with higher magnitude,
map FMIi . would have classified the input to ci′ instead of ci . We show
The key difference between this equation and Eq. (1) is the the importance of these filters by highlighting the features that
absence of sigmoid activation and the usage of standard ReLU activate them the most. The results of the proposed methodology
instead of thresholded-ReLU. Similar to the MC CFE model, the are presented in the following section.
feature extraction layers shared between the MI CFE model and
the pre-trained model M are kept frozen during training of the 4. Results
CFE model. Only the dense layer dn weights are updated during
training of the CFE model. To generate explanations, CFEMI model This section presents the results and discussion of the pro-
predicts the MI filters matrix for the corresponding input xi . FMIi is posed counterfactual explanation (CFE) method. We evaluate the
added to g to alter the filters from the top layer of M after global explanations generated by the proposed CFE method qualita-
average pooling. The DCNN model M makes the alter prediction tively and quantitatively. In qualitative analysis, we provide vi-
with the altered filters as ĉi : sualization of the proposed method and show how to interpret
ĉi = h(gi + FMIi ), (8) these explanations. We compare these visualizations with exist-
ing counterfactual and contrastive explanation methods. Addi-
where h represents the classification (fully-connected and soft- tionally, we conduct a user-study to evaluate the usefulness of
max) layers of M. the explanations provided based on the Explanation Satisfaction
The MI CFE model is trained to predict the optimal MI filter (ES) qualitative metric [55].
map FMIi to reduce the loss between the predicted classes ĉi The results section is structured as follows. In Section 4.1,
and the target class ci′ . Optimal MI filters, FMIi , are learned by we discuss the experimental setup describing the dataset, the
6
S.A. Tariq, T. Zia and M. Ghafoor Knowledge-Based Systems 257 (2022) 109901

pre-trained model used for testing, and the training details of 4.2.1. Explanation visualization and interpretation
the proposed CFE model for the explanation of the pre-trained Fig. 3 shows the CFE model results for a sample image from
model. In Section 4.2, we provide visualization of how the CFE the CUB dataset that was correctly predicted as ‘‘Red-winged
method works and how its output can be interpreted, and in blackbird’’ with 99.9% probability by the VGG-16 model. Our
Section 4.3, we qualitatively compare the CFE method with [28] model highlights the MC and MI filters necessary for classifying
and [34]. In Section 4.4, we present the quantitative evaluation of the input image either to the source class or to the target class,
the proposed method where we measure the impact of disabling respectively. The contrastive explanation in Fig. 3, identifies the
the MC filters for different classes on the overall model accuracy most important MC filters plotted as a graph of their activation
and class recall. We also include an analysis on using different magnitudes (y-axis) against filter number (x-axis). Using these fil-
weights for the sparsity loss and measure the effectiveness of ters, the model maintains the prediction of the input image to its
logits loss. Finally, in Section 4.5, we compare our method with inferred class, i.e. ‘‘Red-winged Blackbird’’ with 99.8% probability.
state-of-the-art explanation methods. We visualize the concepts represented by the top-3 filters
(based on activation magnitude) by drawing the filter’s receptive
4.1. Experimental setup field (RF) [38] on the input image. RF is the image-resolution
feature map of the selected filter from the top convolution layer
For the evaluation of the proposed CFE method, we used that allows us to understand where the filter pays most attention.
the Caltech-UCSD Birds (CUB) 2011 [56] dataset. We train a RF for filters 295 and 399 in Fig. 3 show that they focus around
VGG-16 [57] model on this dataset and train our CFE model the ‘‘red spot’’ of the bird, while filter 5 focuses on the belly
to provide explanations for the trained model’s decisions. The of the bird. The counterfactual explanation in Fig. 3 identifies
VGG-16 model was initially trained by removing the dense clas- the MI filters with respect to the ‘‘Bronzed cowbird’’ class (top-3
sification layers and adding a GAP layer after the top convolution predicted class) shown as a graph of filter activation magnitudes
layer, followed by a dropout and the output softmax layer, as dis- against filter number. The MI filters are highlighted in red. The
cussed in Section 3. The VGG-16 model was trained in two steps. CFE model predicts the minimum additive values to modify these
First, we performed transfer learning to train the newly added MI filters such that it results in the input image being classified
output softmax layer with stochastic gradient descent (SGD) opti- to the target class, i.e., ‘‘Bronzed cowbird’’. The RF visualization
mizer using imageNet [58] pre-trained weights. Transfer learning of the top-3 MI filters show the most important features for the
was performed for 50 epochs with a 0.001 learning rate, 0.9 target class. Filter 15 activates the most on the red-colored eye of
momentum, 32 batch size, and 50% dropout without data aug- the ‘‘Bronzed cowbird‘‘. In contrast, filter 158 activates the most
mentation. In the second step, we fine-tuned all model layers
on the blueish tinge on the bird’s wings. These two filters identify
for 150 epochs at a 0.0001 learning rate with standard data
the critical features that if they were present in the input image,
augmentation and kept all other parameters the same. The VGG-
the model would have been more likely to classify the bird as
16 model achieved the final training and testing accuracy of 99.0%
‘‘Bronzed cowbird’’ instead of the ‘‘Red-winged blackbird’’. Filter
and 69.5%, respectively.
297, on the other hand, activates the most on the bird’s black
neck. However, it can be observed that this feature is common
4.1.1. Counterfactual explanation model training details
to both birds. Therefore it is not enough to discriminate between
The CFE model provides contrastive and counterfactual ex-
the two classes, and it is not activated with higher magnitude.
planations of the decisions of the pre-trained VGG-16 model by
To show that, in case the features represented by filters 15 and
predicting the minimum correct (MC) and minimum incorrect
158 in Fig. 3 were present in the input image, the model would
(MI) filters for each decision by the model with respect to the
have been more likely to classify the image as ‘‘Bronzed cowbird‘‘,
original inferred class (source class) and some target alter class.
The CFE models for MC and MI filter prediction are comprised we modify the input image to artificially introduce these features,
of similar architectures as the pre-trained model but with some as shown in Fig. 4. In Fig. 4(a), artificially changing the color of the
differences, as discussed in Section 3. For the MC CFE model, the eye to match the other bird’s eye color alone was not enough to
feature extraction layers are frozen. The output layer consists of change the model prediction. However, if we introduce a patch of
the same number of units as the number of filters in the top bluish tinge to the bird’s wing, as shown in Fig. 4(b), the model
convolution layer, with a sigmoid activation function followed by changes its prediction to ‘‘Bronzed cowbird’’ with a probability
thresholded-ReLU. The weights of the feature extraction layers of 53%. In Fig. 4(c), we modified both the eye color and added
are shared with the pre-trained model being explained. The CFE bluish tinge on the wing of the bird that resulted in the model
model is trained for a given alter class by minimizing the three classifying the image as ‘‘Bronzed cowbird’’ instead of the ‘‘Red-
losses, namely, cross-entropy loss, sparsity loss, and logits loss, as winged blackbird’’ with 82% confidence, thus highlighting the
discussed in Section 3. The MI CFE model follows similar archi- importance of the MI filters predicted by the CFE model for the
tecture but with the difference that the output dense layer after target class. We can say that these features separate the model’s
the top convolution layer is followed by ReLU activation function decision between the two classes.
instead of sigmoid activation, and there is no thresholded-ReLU To further illustrate the effectiveness of the proposed method,
layer. The MI CFE model is trained for a target class by minimizing a visual comparison of explanations provided by GradCAM [28],
the two losses, i.e., cross-entropy loss and sparsity loss, as dis- SCOUT [34], and the proposed CFE method is shown in Fig. 5.
cussed in Section 3. The MC and MI CFE models are trained using Fig. 5 shows explanations for a test image of the ‘‘Bronzed cow-
SGD optimizer with 0.001 learning rate, 0.9 momentum, and 32 bird’’ class that the DCNN correctly classified. Fig. 5a shows Grad-
batch size for 200 epochs. The weight for sparsity loss is chosen CAM explanation for the inferred class by highlighting the head,
as λ = 2 for MC CFE models, and λ = 1 for MI CFE model. neck, and belly regions as the important features for that class.
However, in Fig. 5b, GradCAM fails to provide useful information
4.2. Qualitative analysis regarding the alter class since it only highlights the head part of
the bird, which is a similar feature between the two classes. In
In this section, we qualitatively discuss the results of the Fig. 5c, SCOUT fails to show a meaningful explanation for why the
proposed CFE approach by providing the visualization of the MC bird is a ‘‘Bronzed cowbird’’ and not a ‘‘Red-winged blackbird’’
and MI filters, comparing explanations with existing methods, since it highlights indistinguishable features. Another drawback
performing misclassification analysis, and conducting a user eval- of methods like SCOUT and GradCAM is that they provide ex-
uation to assess the usefulness of different explanation methods. planations for different inferred and target classes based on the
7
S.A. Tariq, T. Zia and M. Ghafoor Knowledge-Based Systems 257 (2022) 109901

Fig. 3. Contrastive and counterfactual explanations for a sample image from CUB dataset. Contrastive explanation highlights the top-3 MC filters representing features
important for the inferred class. Counterfactual explanation highlights top-3 MI filters that represent features required for the alter class.

Fig. 4. Artificially introducing the most important features relevant to the alter class identified in Fig. 3 (Bronzed cowbird). (a) Adding eye color. (b) Adding wing
pattern. (c) Adding both eye color and wing pattern.

same input image. If we want to analyze why a model classified classes, it is possible to analyze why the model has classified an
an image to class A and not to class B, we must switch the target input to the inferred class and why not to some other, maybe top-
class for the same input image to see where the model is looking 2 or 3 class. In Fig. 6, we show a misclassification case where an
for both classes on the same input. This limits the capability image of class ‘‘Red-winged blackbird’’ was incorrectly classified
of methods like GradCAM when the input image has only one as ‘‘Myrtle warbler’’ with a probability of 92%. In Fig. 6(a), we
object and produces less meaningful explanation. Our proposed show the top-3 MC filters for the inferred class with RF drawn on
CFE model, on the other hand, does not have this limitation. The the input image for visualization. As it turns out, the most highly
proposed CFE method identifies filters relevant to the inferred activated filter (filter 504) focuses on the background region
class as well as filters that if they were active, the model would around the tree’s branches. In Fig. 6(b), we show that filter 504
have been likely to classify the input to another target class. We is associated with the incorrectly inferred class by drawing RF on
can visualize what features the filters represent by looking at the top-3 images from the inferred class that activate it the most.
their receptive fields on any random image from the inferred or It can be seen that this filter mostly activates on the background
target class. In this regard, the Figs. 5d and 5e, of the proposed branches making it an unreliable filter, and this decision can be
CFE method show clear explanations. The contrastive explanation treated as an untrustworthy or a misclassification case.
in Fig. 5d shows that the bird’s head and beak along with the On the other hand, the CFE model predicted 13 MI filters from
eye are particularly discriminating features for it, highlighted by the original filter activations with respect to the actual class,
filters 147 and 158, respectively. The counterfactual explanation using which the model correctly predicted the image as ‘‘Red-
in Fig. 5e shows that the red spot on the wings is the main winged blackbird’’ with a probability of 91%. RF visualization of
feature absent from the ‘‘Bronzed cowbird’’ that would have the top-3 MI filters out of the 13 are shown in Fig. 6(c), with RF
been important for the image to be classified to the alter class. drawn on images from the true class that highly activate them.
Together, both the contrastive and counterfactual explanations Filters 44 and 131 in Fig. 6(c) suggest that the input image should
effectively highlight the DCNN model’s reasoning for the classi- have a red-colored spot on the wing of the bird to classify it cor-
fication example. Additional visualizations and comparisons are rectly to the true class. However, the wing of this particular bird
provided as supplementary material. has an orange-colored spot instead of red color. If we manually
change the color of this spot to red or replace it with a similar-
4.2.2. Misclassification analysis sized red spot from another bird of this class, then the model
One of the useful applications of the proposed explanation can correctly classify it to the actual class, as shown in Fig. 7(b).
method is misclassification analysis. Since our model predicts the Introducing the red spot on the bird’s wing changed the model’s
MC and MI filters for an input sample with respect to different prediction to true class with 96% probability.
8
S.A. Tariq, T. Zia and M. Ghafoor Knowledge-Based Systems 257 (2022) 109901

Fig. 5. Explanation comparison of GradCAM [28], SCOUT [34], and proposed CFE method for a query image that was classified as ‘‘Bronzed cowbird’’, while the alter
class is set as ‘‘Red-winged blackbird’’. (a) GradCAM explanation for the inferred class (Bronzed cowbird). (b) GradCAM explanation for the alter class. (c) SCOUT
explanation for why the image is classified to inferred class and not to alter class. (d) Contrastive explanation of the proposed CFE method. (e) Counterfactual
explanation of the proposed CFE method.

In summary, we can say that the original image in Fig. 7(a) ‘‘Red-winged blackbird’’. This explanation fails to provide reasons
was incorrectly classified as ‘‘Myrtle warbler’’ because the model for why the model incorrectly classified the image.
paid more attention to the tree branches in the background that
were common in images with ‘‘Myrtle warbler’’ bird. Moreover, 4.3. User evaluation
the bird does not have a proper red color spot that is important
for being classified as a ‘‘Red-winged blackbird’’. These two issues To show the effectiveness of the explanations provided by
contributed mainly to the incorrect classification. If the image had the proposed CFE method, we conducted a user-study to qual-
a proper red-colored spot, which is the defining characteristic of itatively evaluate different counterfactual and contrastive ex-
the true class of this image, or if the background did not have planation methods including [28,34] in terms of the Explana-
such branches, then the model would have been more likely to tion Satisfaction (ES) [55] qualitative metric. This metric was
classify the image correctly. Such types of explanations provided previously used by [35] to perform user-study, and we have
by our model help to uncover the decision process behind the followed their protocol closely in this work as well. ES metric
pre-trained black-box models, making them more transparent measures the user’s satisfaction at achieving an understanding
and improving trust in their use. of the DCNN model based on different explanations provided
In contrast, the explanations provided by GradCAM [28] and in terms of metrics such as usefulness, understandability, and
SCOUT [34] for this misclassification case are shown in Fig. 8. confidence [55].
Figs. 8a and 8b show the GradCAM explanations for the incorrect The user-study is conducted by creating two expert and non-
inferred class (Myrtle warbler) and the true class (Red-winged expert human subjects groups. The non-expert group consists
blackbird), respectively. It can be observed that these explana- of 30 subjects with a limited understanding of the computer
tions highlight similar regions as evidence for the target class vision field, whereas the expert group consists of 10 subjects that
in each case which fails to provide meaningful reasons for the routinely train and evaluate DCNN models. Subjects in each group
model’s decision. In Fig. 8c, the SCOUT explanation highlights the go through a familiarization phase where they are first shown a
red/orange spot as the region that separates the model’s decision sample query image and the DCNN model’s classification decision
in classifying the input as ‘‘Myrtle warbler’’ and not ‘‘Red-winged for that image. The users are then shown various images from the
blackbird’’. This region, however, is not strong evidence for the predicted class, and some alter class selected from top-2 or top-3
‘‘Myrtle warbler’’ class since it is important for the true class, classes to help them understand the differences between the two
9
S.A. Tariq, T. Zia and M. Ghafoor Knowledge-Based Systems 257 (2022) 109901

Fig. 6. Identifying the erroneous filters that result in misclassification. (a) Input image of class ‘‘Red-winged blackbird’’ with RFs of top-3 MC filters involved in
incorrect classification of the image as ‘‘Myrtle warbler’’. (b) Top-3 images from the inferred class that activate filter 504 the most. (c) Top-3 MI filters with RF
visualization on images from the true class that activate them the most.

Fig. 7. Modifying the misclassified input image according to features identified by MI filters. (a) Image of class ‘‘Red-winged blackbird’’ misclassified as ‘‘Myrtle
warbler’’ with 92% probability. (b) Modified image that is correctly classified as ‘‘Red-winged blackbird’’ with 96% probability.

classes. The users are then shown explanations generated by [28, brief description on how to interpret the generated explanations.
34], and the proposed method for why the model classified the Finally, the users are shown the DCNN model’s decisions on 10
image to the inferred class and not to the alter class, along with a test images, along with the explanations for these provided by the
10
S.A. Tariq, T. Zia and M. Ghafoor Knowledge-Based Systems 257 (2022) 109901

Fig. 8. Misclassification case explanations of GradCAM [28] and SCOUT [34]. (a) GradCAM explanation for the incorrect inferred class (Myrtle warbler). (b) GradCAM
explanation for the true class (Red-winged blackbird). (c) SCOUT explanation for why the image is classified to inferred class and not to true class.

Table 1 the pre-trained VGG-16 model with respect to different classes.

Qualitative evaluation based on Explanation Satisfaction metric for GradCAM, We show the importance of these MC filters by demonstrating
SCOUT, and the proposed CFE methods.
the effect of disabling them on the class recall metric compared
Explanation Satisfaction (±std)
Explanation
to the effect on the overall model accuracy. Furthermore, we
Understandability Usefulness Confidence also discuss the effect of different training parameters on the
Non-expert users MC filters predicted and CFE model accuracy for explaining the
Grad-CAM [28] 3.7 (±1.1) 3.6 (±1.1) 3.6 (±1.1) pre-trained VGG-16 model.
SCOUT [34] 3.4 (±1.0) 3.1 (±1.1) 3.3 (±1.0)
CFE (proposed) 3.9 (±0.9) 3.8 (±0.9) 4.0 (±0.9)
4.4.1. Activated filter statistics
Expert users First, we discuss the filter activation statistics of the MC filters
Grad-CAM [28] 3.7 (±1.2) 3.5 (±1.1) 3.6 (±1.2) predicted by the CFE model for explaining the pre-trained VGG-
SCOUT [34] 3.0 (±1.4) 2.8 (±1.5) 2.8 (±1.4) 16 model with respect to different classes. For a given class, we
CFE (proposed) 4.5 (±0.6) 4.2 (±0.8) 4.4 (±0.6)
use the CFE model to predict MC filters for all test images of
that class. We then accumulate the predicted MC filters to find
the number of times each filter is predicted and compute the
explanation techniques regarding the inferred and alter classes. normalized activation magnitude for those filters. Fig. 9 shows the
At the end of the test examples, the users are asked to rate each activated filter analysis of MC filters for ‘‘Red-winged blackbird’’
explanation method in terms of ES metrics of understandability, class. There are 5794 test images in the CUB dataset, of which
usefulness, and confidence on a Likert scale of 0 to 4. Table 1 30 belong to this class. Fig. 9(a) shows that filters 44, 147, and
presents the results. 364 are predicted as part of MC filters for nearly all of the test
The findings in Table 1 highlight that the users from both the images. Fig. 9(b) shows the normalized activation magnitudes
expert and non-expert groups found the explanations provided of these filters. It can be seen that filters 147 and 364 have,
by the proposed method to be beneficial and understandable. on average, the highest activation magnitude of these filters. In
Particularly for the expert group, there is a significant difference other words, these filters are globally the most important filters
in the level of satisfaction achieved compared to the non-expert for this class and represent crucial features/concepts relevant to
group. The reason for this can be because the proposed method this class. We show the importance of these globally significant
probes the internal working of the DCNN to provide an expla- filters for the ‘‘Red-winged blackbird’’ class by disabling them
nation in the form of filters and their visualizations, which is a from the pre-trained VGG-16 model and reporting the decrease
concept easily understandable by expert users. For non-expert in the model’s ability to accurately classify images of this class
users, the explanations provided by the GradCAM method may as compared to the overall accuracy of the model. Interestingly,
appear to be sufficient since they can be visually pleasing, and disabling all 31 MC filters predicted by the CFE model for the
due to this reason, there is less gap between the understandability ‘‘Red-winged blackbird’’ class reduced the class recall from 93.3%
scores of GradCAM and the proposed method for non-expert to 30%. In contrast, the overall model accuracy is decreased by
users. less than 2%. Similarly, we carry out this analysis for a few other
Furthermore, as evident from high scores given to our ap- classes and summarize the findings in Table 2. In these cases,
proach by expert users, our approach caters to the need of DCNN disabling around 31–44 globally critical MC filters results in a
model developers who already have some understanding of how significant decrease in the class recall, whereas the overall model
DCNNs work. For these users, our approach identifies what filters accuracy is reduced by just 2%–3%. On the contrary, it can be seen
are critical for the model’s decision-making process and shows that randomly disabling 40 filters has a negligible effect on class
what concepts and features these filters represent. These users recall. This analysis shows that the MC filters predicted by the
can make informed decisions regarding the model’s weaknesses CFE model for a particular class represent features exclusive to
and trustworthiness and judge its overall performance in real- that class, and disabling them affects the overall model accuracy
world scenarios. Thus, leading to the success of the proposed minimally while significantly reducing the class recall score.
explanation method in terms of understandability, usefulness,
and confidence. 4.4.2. Trade-off between CFE model accuracy and predicted filter
sparsity
4.4. Quantitative analysis In this section, we will discuss the effect on CFE model training
and testing accuracy by varying the losses and training parame-
In this section, we quantitatively discuss the results of the pro- ters and also report the average number of MC filters predicted
posed CFE methodology in terms of finding the most commonly by the CFE model with these changes. Table 3 presents the effect
activated MC filters predicted by the CFE model for explaining of different weights assigned to the sparsity loss in Eq. (3) for
11
S.A. Tariq, T. Zia and M. Ghafoor Knowledge-Based Systems 257 (2022) 109901

Fig. 9. Statistical analysis of MC filters predicted by CFE model for explaining VGG-16 model with respect to ‘‘Red-winged blackbird’ class. (a) Filter activation
count of MC filters for all test images of class ‘‘Red-winged blackbird’. X -axis represents filter numbers and y-axis is the filter activation count. (b) Normalized filter
activation magnitude for the MC filters.

Table 2
Affect of disabling global MC filters on class recall metric for different classes.
No. of Global MC Class recall (%) Model acc.
Class
filters disabled Orig. Rand. disabled MC disabled (Orig. = 69.5%)

Red-wingedblackbird 31 93.3 93.3 30.0 67.6

Bronzed cowbird 43 73.3 73.3 3.3 67.4
American redstart 33 80.0 76.7 23.3 66.0
Nelson sharp-tailed sparrow 44 53.3 56.7 0.0 67.7
Myrtle warbler 37 53.3 40.0 6.7 67.7

Table 3 Table 4
Sparsity loss analysis for MC CFE model for ‘‘Red-winged blackbird’’ class. MC CFE model training analysis with and without logits loss using fixed λ = 2
CFE model λ Accuracy CE loss L1 loss Filters for ‘‘Red-winged blackbird’’ class.

Training VGG-16 CFE Accuracy CE Loss L1 loss Logits loss Filters

1 99.8% 0.246 0.594 20.5 Training

2 99.4% 0.439 0.421 14.5 With Logits loss 99.4% 0.439 0.421 −0.313 14.5
4 98.5% 0.839 0.293 10.1 W/o logits loss 99.5% 0.529 0.395 −0.275 13.7
VGG-16
Testing Testing
1 99.3% 0.341 0.618 21.9 With Logits loss 99.1% 0.568 0.449 −0.299 16.0
2 99.1% 0.568 0.449 16.0 W/o logits loss 99.2% 0.678 0.418 −0.262 14.9
4 98.1% 0.973 0.317 11.4

Table 5
Quantitative comparison with state-of-the-art.
explaining the VGG-16 model with respect to the ‘‘Red-winged VGG-16
Beginner Advanced
Blackbird’’ class using MC filters. It can be seen that as the sparsity Recall Prec. Recall Prec.
loss weight is increased from λ = 1 to λ = 4, the average number GradCAM [28] 0.03 0.80 0.07 0.41
of filters predicted for classifying all train or test images in the Wang et al. [34] 0.02 0.72 0.08 0.37
CUB dataset to the ‘‘Red-winged Blackbird’’ decreases from 20.5 Proposed MC CFE 0.05 0.78 0.1 0.41
and 21.9 to 10.1 and 11.4, respectively. Although fewer predicted
filters are desirable, it comes at the cost of higher cross-entropy
loss that results in low confidence predictions. of quantitative comparisons, Wang et al. [34] proposed a way to
Table 4 presents a similar analysis where we show the effect synthetically generate ground truths based on part and attribute
of training the MC CFE model with and without the logits loss. annotations present in the CUB dataset [56]. To synthesize the
Without using logits loss, the CFE model suffered higher training ground truths, the authors identified a list of parts that dis-
and testing loss but better accuracies. This means that the sparse tinguish each class pair based on the distribution of attributes
filters predicted by the CFE model had a lower impact or were present in images. The authors used the recall and precision
less critical towards the specified class, leading to low confidence metrics to quantify the performance of their counterfactual expla-
predictions. With logits loss, on the other hand, the CFE model nations. For a given pair of predicted and counterfactual classes
predicted on average 0.8 and 1.1 more filters, respectively, for for an input image, recall is computed by finding the ratio of
training and testing sets, resulting in lower CE loss but slightly ground truth parts lying within the explanation region and the
decreased accuracy. However, the predicted filters are more likely total number of ground truth parts [34]. Similarly, precision is
to be relevant to the specified alter class. computed by finding the ratio of ground truth parts and the total
number of annotated parts in the explanation region. Based on
4.5. Comparison with the state-of-the-art these metrics, we present a comparison of the proposed MC CFE
model with Wang et al.’s [34] counterfactual explanations for
Quantitative evaluation and comparison of state-of-the-art images belonging to five chosen classes shown in Table 5, for the
counterfactual explanation models is challenging as the ground same pre-trained VGG-16 model. We followed a similar evalua-
truths are unavailable. Recent works have mostly provided qual- tion strategy as described by Wang et al. [34] to explain model
itative or human-based evaluations [30,35]. To address the lack predictions with respect to the counterfactual classes chosen to
12
S.A. Tariq, T. Zia and M. Ghafoor Knowledge-Based Systems 257 (2022) 109901

simulate beginner and advanced users. For beginner users, the the image either to the inferred class or to the target class. In
counterfactual class is chosen randomly, whereas, for advanced another recent study, Ghorbani et al. [25] proposed a method that
users, the counterfactual class is chosen as the top-2 predicted quantifies the contribution of each filter in a DCNN towards the
class. We have also included a comparison with the default model’s performance. This method identifies critical filters that
GradCAM [28] based visual explanation method. Beginner user considerably lower the model’s performance when removed from
explanations are easier to generate as there is a large difference the model. Although this method probes the internal working
between the predicted and random counterfactual classes. Due of a DCNN, it does not provide contrastive or counterfactual
to this, there are more distinct ground truth parts, leading to explanations. The proposed CFE model provides such explana-
lower recall but higher precision values. Whereas, for advanced tions and identifies the critical filters that are important towards
users, the explanations are harder since the predicted and the different classes. We have shown that the removal of these filters
counterfactual class pairs are closely related to each other leading considerably lowers the model’s class recall without affecting the
to fewer distinguishable parts. This leads to higher recall and overall model performance.
lower precision scores. From Table 5, it can be seen that the recall
and precision scores of the proposed CFE method are consistently 5. Conclusion
better than Wang et al.’s [34] method. A higher recall score
indicates that the proposed explanation can identify more distinct This paper introduced a novel method for explaining deep
ground truth parts that separate the predicted and counterfac- CNN models based on counterfactual and contrastive explana-
tual classes compared to other methods. The GradCAM visual tions. The proposed method probes the internal workings of
explanation method produces higher precision scores because it DCNNs to predict the most important filters that affect a pre-
generally produces a larger explanation area that encompasses trained model’s predictions in two ways. Firstly, we identified
more parts in the image. This leads to a higher ratio of ground the minimum correct (MC) filters for a given input image. If
truth parts in the explanation region. One drawback of this they were the only ones active, the model would still classify
comparison method developed by Wang et al. [34] is that the the input to the original inferred class. Secondly, we identified
synthesized ground truths for different class pairs are not always the minimum incorrect (MI) filters. If they were altered in a
accurate. For example, the distinct ground truth parts for the class certain way, the model would have classified the image to some
pair of ‘‘Red-winged blackbird’’ and ‘‘Bronzed cowbird’’ (shown alter class instead of the original inferred class. We showed
in Fig. 3) as generated by [34] consists of just the different that these filters represent the critical features or concepts that
colored eyes of the birds. However, these classes have additional the DCNN model learns to detect and rely on for making de-
distinguishing features of wing color and pattern, which have not cisions. We discussed the effect of enabling or disabling these
been captured by [34]’s method. The same problem is true for a filters/features and also discussed misclassification detection as
few other classes as well. There is a need for developing clear and an application of the proposed methodology. With these expla-
robust evaluation metrics for counterfactual explanations that the nations, we showed reasoning behind the model decisions and
research community should look into in the future. improved model understanding and reliability.
In the future, we intend to improve the evaluation metrics
4.5.1. Discussion on other related works for better performance comparison of counterfactual explanation
This section discusses the quality of explanations provided by models. Adversarial attack detection, model debugging, and ma-
the proposed CFE method and other existing works with a similar chine teaching are possible applications of the proposed method-
objective. Goyal et al. [30] proposed a counterfactual explanation ology that can be explored in the future.
method for images that repetitively alters the region in the input
image until the model predicts class c ′ instead of c. This method CRediT authorship contribution statement
can identify the important features relevant to different classes.
However, it provides visual explanations only and does not ex- Syed Ali Tariq: Investigation, Methodology, Software, Writing
plore the internal workings of the DCNN model. Our CFE model, – original draft, Visualization, Validation. Tehseen Zia: Concep-
on the other hand, identifies the filters and corresponding high- tualization, Writing – review & editing, Supervision. Mubeen
level concepts associated with different classes to give contrastive Ghafoor: Writing – review & editing, Supervision.
and counterfactual explanations. These filters are modified to
alter model decisions and establish an understanding of the in-
Declaration of competing interest
ternal working of the model, thus improving transparency and
reliability. Dhurandhar et al. [29] proposed a contrastive ex-
The authors declare that they have no known competing finan-
planation method that identifies minimally sufficient features
cial interests or personal relationships that could have appeared
and those features whose absence is essential to maintain the
to influence the work reported in this paper.
original decision. The objective of this work is similar to ours,
but again it operates on pixel level and does not identify high-
Appendix A. Supplementary data
level semantically meaningful features, unlike our method. Akula
et al. [35] also proposed a counterfactual explanation methodol-
ogy that identifies meaningful concepts using super-pixels that Supplementary material related to this article can be found
were added or removed from images to provide explanations. online at https://doi.org/10.1016/j.knosys.2022.109901.
Although this method provides valuable explanations, they are
only based on globally identified concepts instead of locally iden- References
tifying features from the image that may not be actively used
[1] Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, J. Liu,
in the decision-making process. The proposed CFE model, on
Ce-Net: Context encoder network for 2D medical image segmentation, IEEE
the other hand, is a predictive model that locally explains each
Trans. Med. Imaging 38 (10) (2019) 2281–2292.
image based on counterfactual and contrastive explanations. We [2] P.M. Shakeel, M.A. Burhanuddin, M.I. Desa, Lung cancer detection
identify the most important features or concepts (filters) that from CT image using improved profuse clustering and deep learn-
are actively being used in the decision-making process for an ing instantaneously trained neural networks, Measurement 145 (2019)
image that separates the model’s decisions between classifying 702–712.

13
S.A. Tariq, T. Zia and M. Ghafoor Knowledge-Based Systems 257 (2022) 109901

[3] H. Liu, X. Zhu, Z. Lei, S.Z. Li, Adaptiveface: Adaptive margin and sampling [28] R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra,
for face recognition, in: Proceedings of the IEEE Conference on Computer Grad-cam: Visual explanations from deep networks via gradient-based
Vision and Pattern Recognition, 2019, pp. 11947–11956. localization, in: Proceedings of the IEEE International Conference on
[4] X. Wang, A. Shrivastava, A. Gupta, A-fast-RCNN: Hard positive generation Computer Vision, 2017, pp. 618–626.
via adversary for object detection, in: Proceedings of the IEEE Conference [29] A. Dhurandhar, P.-Y. Chen, R. Luss, C.-C. Tu, P. Ting, K. Shanmugam, P.
on Computer Vision and Pattern Recognition, 2017, pp. 2606–2615. Das, Explanations based on the missing: Towards contrastive explanations
[5] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu, Dual attention network for with pertinent negatives, in: Advances in Neural Information Processing
scene segmentation, in: Proceedings of the IEEE Conference on Computer Systems, 2018, pp. 592–603.
Vision and Pattern Recognition, 2019, pp. 3146–3154. [30] Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, S. Lee, Counterfactual visual
[6] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, T.S. Huang, Free-form image in- explanations, in: K. Chaudhuri, R. Salakhutdinov (Eds.), in: Proceedings of
painting with gated convolution, in: Proceedings of the IEEE International Machine Learning Research, vol. 97, PMLR, Long Beach, California, USA,
Conference on Computer Vision, 2019, pp. 4471–4480. 2019, pp. 2376–2384.
[7] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, [31] L.A. Hendricks, R. Hu, T. Darrell, Z. Akata, Grounding visual explanations,
in: Proceedings of the IEEE Conference on Computer Vision and Pattern in: European Conference on Computer Vision, Springer, 2018, pp. 269–286.
Recognition, 2016, pp. 770–778. [32] S. Liu, B. Kailkhura, D. Loveland, Y. Han, Generative counterfactual intro-
[8] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, J. Feng, Dual path networks, in: spection for explainable deep learning, 2019, arXiv preprint arXiv:1907.
Advances in Neural Information Processing Systems, 2017, pp. 4467–4475. 03077.
[9] M. Tan, Q.V. Le, Efficientnet: Rethinking model scaling for convolutional [33] R. Luss, P.-Y. Chen, A. Dhurandhar, P. Sattigeri, Y. Zhang, K. Shanmugam,
neural networks, 2019, arXiv preprint arXiv:1905.11946. C.-C. Tu, Generating contrastive explanations with monotonic attribute
[10] H. Touvron, A. Vedaldi, M. Douze, H. Jégou, Fixing the train-test resolution functions, 2019, arXiv preprint arXiv:1905.12698.
discrepancy, in: Advances in Neural Information Processing Systems, 2019, [34] P. Wang, N. Vasconcelos, Scout: Self-aware discriminant counterfactual
pp. 8252–8262. explanations, in: Proceedings of the IEEE/CVF Conference on Computer
[11] H. Zhang, C. Wu, Z. Zhang, Y. Zhu, Z. Zhang, H. Lin, Y. Sun, T. He, J. Mueller, Vision and Pattern Recognition, 2020, pp. 8981–8990.
R. Manmatha, et al., Resnest: Split-attention networks, 2020, arXiv preprint [35] A.R. Akula, S. Wang, S.-C. Zhu, Cocox: Generating conceptual and
arXiv:2004.08955. counterfactual explanations via fault-lines., in: AAAI, 2020, pp. 2594–2601.
[12] R. Mohan, A. Valada, Efficientps: Efficient panoptic segmentation, 2020, [36] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, R. sayres,
arXiv preprint arXiv:2004.02307. Interpretability beyond feature attribution: Quantitative testing with con-
[13] A.B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, cept activation vectors (TCAV), in: J. Dy, A. Krause (Eds.), in: Proceedings of
S. García, S. Gil-López, D. Molina, R. Benjamins, et al., Explainable artificial Machine Learning Research, vol. 80, PMLR, Stockholmsmässan, Stockholm
intelligence (XAI): Concepts, taxonomies, opportunities and challenges Sweden, 2018, pp. 2668–2677.
toward responsible AI, Inf. Fusion 58 (2020) 82–115. [37] D. Bau, J.-Y. Zhu, H. Strobelt, A. Lapedriza, B. Zhou, A. Torralba, Under-
[14] W. Samek, T. Wiegand, K.-R. Müller, Explainable artificial intelligence: standing the role of individual units in a deep neural network, Proc. Natl.
Understanding, visualizing and interpreting deep learning models, 2017, Acad. Sci. (2020).
arXiv preprint arXiv:1708.08296. [38] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Object detectors
[15] R. Goebel, A. Chander, K. Holzinger, F. Lecue, Z. Akata, S. Stumpf, P. emerge in deep scene CNNs, 2014, arXiv preprint arXiv:1412.6856.
Kieseberg, A. Holzinger, Explainable AI: The new 42? in: International [39] D. Bau, B. Zhou, A. Khosla, A. Oliva, A. Torralba, Network dissection:
Cross-Domain Conference for Machine Learning and Knowledge Extraction, Quantifying interpretability of deep visual representations, in: Proceedings
Springer, 2018, pp. 295–303. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017,
[16] C. Rudin, Stop explaining black box machine learning models for high pp. 6541–6549.
stakes decisions and use interpretable models instead, Nat. Mach. Intell. 1 [40] K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks:
(5) (2019) 206–215. Visualising image classification models and saliency maps, 2013, arXiv
[17] E. Tjoa, C. Guan, A survey on explainable artificial intelligence (XAI): preprint arXiv:1312.6034.
Toward medical XAI, IEEE Trans. Neural Netw. Learn. Syst. (2020). [41] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, W. Samek, On
[18] A. Holzinger, C. Biemann, C.S. Pattichis, D.B. Kell, What do we need to pixel-wise explanations for non-linear classifier decisions by layer-wise
build explainable AI systems for the medical domain? 2017, arXiv preprint relevance propagation, PLoS One 10 (7) (2015) e0130140.
arXiv:1712.09923. [42] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep fea-
[19] É. Zablocki, H. Ben-Younes, P. Pérez, M. Cord, Explainability of vision-based tures for discriminative localization, in: Proceedings of the IEEE Conference
autonomous driving systems: Review and challenges, 2021, arXiv preprint on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929.
arXiv:2101.05307. [43] M.T. Ribeiro, S. Singh, C. Guestrin, ‘‘ Why should I trust you?’’ explaining
[20] P. Svenmarck, L. Luotsinen, M. Nilsson, J. Schubert, Possibilities and chal- the predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD
lenges for artificial intelligence in military applications, in: Proceedings of International Conference on Knowledge Discovery and Data Mining, 2016,
the NATO Big Data and Artificial Intelligence for Military Decision Making pp. 1135–1144.
Specialists’ Meeting, Neuilly-sur-Seine France, 2018, pp. 1–16. [44] V. Petsiuk, A. Das, K. Saenko, Rise: Randomized input sampling for
[21] Q. Zhang, W. Wang, S.-C. Zhu, Examining CNN representations with respect explanation of black-box models, 2018, arXiv preprint arXiv:1806.07421.
to dataset bias, in: Proceedings of the AAAI Conference on Artificial [45] J. Wagner, J.M. Kohler, T. Gindele, L. Hetzel, J.T. Wiedemer, S. Behnke,
Intelligence, vol. 32, (1) 2018. Interpretable and fine-grained visual explanations for convolutional neural
[22] N. Akhtar, A. Mian, Threat of adversarial attacks on deep learning in networks, in: Proceedings of the IEEE Conference on Computer Vision and
computer vision: A survey, IEEE Access 6 (2018) 14410–14430. Pattern Recognition, 2019, pp. 9097–9107.
[23] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, [46] H. Liu, R. Wang, S. Shan, X. Chen, What is Tabby? Interpretable model
T. Kohno, D. Song, Robust physical-world attacks on deep learning visual decisions by learning attribute-based classification criteria, IEEE Trans.
classification, in: Proceedings of the IEEE Conference on Computer Vision Pattern Anal. Mach. Intell. (2019).
and Pattern Recognition, 2018, pp. 1625–1634. [47] Q. Zhang, Y. Nian Wu, S.-C. Zhu, Interpretable convolutional neural net-
[24] S. Thys, W. Van Ranst, T. Goedemé, Fooling automated surveillance cam- works, in: Proceedings of the IEEE Conference on Computer Vision and
eras: Adversarial patches to attack person detection, in: Proceedings of Pattern Recognition, 2018, pp. 8827–8836.
the IEEE/CVF Conference on Computer Vision and Pattern Recognition [48] Q. Zhang, Y. Yang, H. Ma, Y.N. Wu, Interpreting CNNs via decision trees,
Workshops, 2019. in: Proceedings of the IEEE Conference on Computer Vision and Pattern
[25] A. Ghorbani, J. Zou, Neuron shapley: Discovering the responsible neurons, Recognition, 2019, pp. 6261–6270.
2020, arXiv preprint arXiv:2002.09815. [49] R. Chen, H. Chen, J. Ren, G. Huang, Q. Zhang, Explaining neural networks
[26] R.C. Fong, A. Vedaldi, Interpretable explanations of black boxes by mean- semantically and quantitatively, in: Proceedings of the IEEE International
ingful perturbation, in: Proceedings of the IEEE International Conference Conference on Computer Vision, 2019, pp. 9187–9196.
on Computer Vision, 2017, pp. 3429–3437. [50] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, J.K. Su, This looks like that:
[27] M. Du, N. Liu, X. Hu, Techniques for interpretable machine learning, Deep learning for interpretable image recognition, in: Advances in Neural
Commun. ACM 63 (1) (2019) 68–77. Information Processing Systems, 2019, pp. 8930–8941.

14
S.A. Tariq, T. Zia and M. Ghafoor Knowledge-Based Systems 257 (2022) 109901

[51] Y. Li, R. Ji, S. Lin, B. Zhang, C. Yan, Y. Wu, F. Huang, L. Shao, Dynamic [55] R.R. Hoffman, S.T. Mueller, G. Klein, J. Litman, Metrics for explainable AI:
neural network decoupling, 2019, arXiv preprint arXiv:1906.01166. Challenges and prospects, 2018, arXiv preprint arXiv:1812.04608.
[52] J. Hu, R. Ji, Q. Ye, T. Tong, S. Zhang, K. Li, F. Huang, L. Shao, Architecture [56] C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The Caltech-UCSD
disentanglement for deep neural networks, 2020, arXiv preprint arXiv: Birds-200–2011 Dataset, Tech. Rep. CNS-TR-2011-001, California Institute
2003.13268. of Technology, 2011.
[53] H. Liang, Z. Ouyang, Y. Zeng, H. Su, Z. He, S.-T. Xia, J. Zhu, B. Zhang, [57] K. Simonyan, A. Zisserman, Very deep convolutional networks for
Training interpretable convolutional neural networks by differentiating large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556.
class-specific filters, 2020, arXiv preprint arXiv:2007.08194.
[58] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A.
[54] S. Saralajew, L. Holdijk, M. Rees, E. Asan, T. Villmann, Classification-by-
Karpathy, A. Khosla, M. Bernstein, A.C. Berg, L. Fei-Fei, ImageNet large scale
components: Probabilistic modeling of reasoning over a set of components,
visual recognition challenge, Int. J. Comput. Vis. 115 (3) (2015) 211–252,
in: Advances in Neural Information Processing Systems, 2019, pp.
http://dx.doi.org/10.1007/s11263-015-0816-y.
2792–2803.