0% found this document useful (0 votes)
75 views10 pages

Pip-Net: Patch-Based Intuitive Prototypes For Interpretable Image Classification

Nauta, M., Schlötterer, J., van Keulen, M., & Seifert, C. (2023). Pip-net: Patch-based intuitive prototypes for interpretable image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2744-2753).

Uploaded by

m.vankeulen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views10 pages

Pip-Net: Patch-Based Intuitive Prototypes For Interpretable Image Classification

Nauta, M., Schlötterer, J., van Keulen, M., & Seifert, C. (2023). Pip-net: Patch-based intuitive prototypes for interpretable image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2744-2753).

Uploaded by

m.vankeulen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

PIP-Net: Patch-Based Intuitive Prototypes for Interpretable Image Classification

Meike Nauta Jörg Schlötterer


University of Twente, the Netherlands University of Duisburg-Essen, Germany
University of Duisburg-Essen, Germany [email protected]
[email protected]

Maurice van Keulen Christin Seifert


University of Twente, the Netherlands University of Duisburg-Essen, Germany
[email protected] [email protected]

Abstract a black box, we argue that we should take interpretability


as a design starting point for in-model explainability. The
Interpretable methods based on prototypical patches rec- recognition-by-components theory [1] describes how hu-
ognize various components in an image in order to explain mans recognize objects by segmenting them into multiple
their reasoning to humans. However, existing prototype- components. We mimic this intuitive line of reasoning in
based methods can learn prototypes that are not in line with an intrinsically interpretable image classifier. Specifically,
human visual perception, i.e., the same prototype can refer our PIP-Net (Patch-based Intuitive Prototypes Network) au-
to different concepts in the real world, making interpretation tomatically identifies semantically meaningful components,
not intuitive. Driven by the principle of explainability-by- while only having access to image-level class labels and
design, we introduce PIP-Net (Patch-based Intuitive Proto- not relying on additional part annotations. The components
types Network): an interpretable image classification model are “prototypical parts” (prototypes) visualized as image
that learns prototypical parts in a self-supervised fashion patches, since exemplary natural images are more informa-
which correlate better with human vision. PIP-Net can tive to humans than generated synthetic images [2]. PIP-Net
be interpreted as a sparse scoring sheet where the pres- is globally interpretable and designed to be highly intuitive
ence of a prototypical part in an image adds evidence for a as it uses simple scoring-sheet reasoning: the more relevant
class. The model can also abstain from a decision for out-of- prototypical parts for a specific class are present in an image,
distribution data by saying “I haven’t seen this before”. We the more evidence for that class is found, and the higher its
only use image-level labels and do not rely on any part an- score. When no relevant prototypes are present in the image,
notations. PIP-Net is globally interpretable since the set of with e.g. out-of-distribution data, PIP-Net will abstain from a
learned prototypes shows the entire reasoning of the model. decision. PIP-Net is therefore able to say “I haven’t seen this
A smaller local explanation locates the relevant prototypes before” (see Fig. 2). Additionally, following the principle
in one image. We show that our prototypes correlate with of isolation of functional properties for aligning human and
ground-truth object parts, indicating that PIP-Net closes machine vision [5], the reasoning of PIP-Net is separated
the “semantic gap” between latent space and pixel space. into multiple steps. This simplifies human identification of
Hence, our PIP-Net with interpretable prototypes enables reasons for (mis)classification.
users to interpret the decision making process in an intuitive,
faithful and semantically meaningful way. Code is available Recent interpretable part-prototype models are ProtoP-
at https://github.com/M-Nauta/PIPNet. Net [3], ProtoTree [24], ProtoPShare [29] and ProtoPool [28].
These part-prototype models are only designed for fine-
grained image recognition tasks (birds and car types) and
1. Introduction lack “semantic correspondence” [17] between learned proto-
types and human concepts. This “semantic gap” in prototype-
Deep neural networks are dominant in computer vision, based methods between similarity in latent space and input
but there is a high demand for understanding the reasoning of space was also found by others [9, 14]. We hypothesize that
such complex models [23,30]. Consequently, interpretability the main cause of the semantic gap is the fact that exist-
and explainability have grown in importance. In contrast to ing part-prototype models only regularize interpretability on
the common post-hoc explainability that reverse-engineers class-level, since their underlying assumption is that (parts

2744
Existing prototype-based models Our model
Sun OR Dog NOT (sun OR dog)

Image patches similar to prototype 1 (“dog”)

Image patches similar to prototype 1 Image patches similar to prototype 2 (“sun”)

Figure 1. Toy dataset with two classes (left). Existing models can learn representations of prototypes that do not align with human visually
perceived similarity (center). Our objective is to learn prototypes that represent concepts that also look similar to humans (right).

of) images from the same class have the same prototypes. Our Contributions:
This assumption may however not hold, leading to similarity 1. We present the Patch-based Intuitive Prototypes Net-
in latent space which does not correspond to visually per- work (PIP-Net): an intrinsically interpretable image
ceived similarity. Consider the example in Fig. 1, where we classifier, driven by three explainability requirements:
have re-labeled images from a clipart dataset [43] to create a the model should be intuitive, compact and able to han-
binary classification task: the two kids are happy when the dle out-of-distribution data.
sun or dog is present, and sad when there is neither a sun 2. PIP-Net has a surprisingly simple architecture and is
nor a dog. Hence, the classes are ‘sun OR dog’ and ‘NOT trained with novel regularization for learning prototype
(sun OR dog)’. Intuitively, an easy-to-interpret model should similarity that better correlates with human visual per-
learn two prototypes: one for the sun and one for the dog. ception, thereby closing a perceived semantic gap.
However, existing interpretable part-prototype models, such 3. PIP-Net acts as a scoring sheet and therefore can detect
as ProtoPNet [3] and ProtoTree [24], optimize images of the that an image does not belong to any class or that it
same class to have the same prototypes. They could, there- belongs to multiple classes.
fore, learn a single prototype that represents both the sun 4. Instead of specifying the number of prototypes be-
and the dog, especially when the model is optimized to have forehand as in ProtoPNet [3], ProtoPool [28] and Tes-
few prototypes (see Fig. 1, center). The model’s perception Net [35], PIP-Net only needs an upper bound on the
of patch similarity may thus not be in line with human visual number of prototypes and selects as few prototypes as
perception, leading to the perceived “semantic gap”. possible for good classification accuracy with compact
To address the gap between latent and pixel space, we explanations, reaching sparsity ratios > 99%.
present PIP-Net: an interpretable model that is designed to
be intuitive and optimized to correlate with human vision. 2. Related Work
A sparse linear layer connects learned interpretable proto-
typical parts to classes. A user only needs to inspect the Interpretable Models Chen et al. [3] introduced the Pro-
prototypes and their relation to the classes in order to inter- totypical Part Network (ProtoPNet), an intrinsically inter-
pret the model. We restrict the weights of the linear layer to pretable model with a predetermined number of prototypical
be non-negative, such that the presence of a class-relevant parts per class. To classify an image, the similarity between
prototype increases the evidence for a class. The linear layer the latent encoding of a prototype and an image patch is
can be interpreted as a scoring sheet: the score for a class is calculated by measuring the distance in latent space. The
the sum of all present prototypes multiplied by their weights. resulting similarity scores are weighted by values learned
A local explanation (Fig. 2 and Fig. 3) explains a specific by a fully-connected layer. The explanation of ProtoPNet
prediction and shows which prototypes were found at which shows the reasoning process for a single image, by visual-
locations in the image. The global explanation provides an izing all prototypes together with their weighted similarity
overall view of the model’s decision layer, consisting of the score. The explanation can therefore be understood as a
sparse weights between classes and their relevant prototypes. scoring sheet, although with a fixed number of class-specific
Because of this interpretable and predictive linear layer, we prototypes leading to large explanations which have been
ensure a direct relation between the prototypes and the clas- shown to contain redundant prototypes [22]. In contrast, our
sification, and thereby prevent unfaithful explanations which prototypes can be shared between classes and explanation
can arise with local or post-hoc XAI methods [16]. size is minimized. TesNet [35] builds upon ProtoPNet by

2745
Figure 2. Our classifier is a scoring sheet based on the presence of prototypical parts in an image. Reasoning is intuitive since a single-object
classifier can handle multi-object images and out-of-distribution data. Our model is therefore able to abstain from a decision and instead say
“I haven’t seen this before”. Figure shows actual predictions and prototype locations of PIP-Net trained on PETS (37 cat and dog species).

learning prototypes on a Grassman manifold in order to dis- sparse interpretability contributes to easy debugging of the
entangle the latent space, but also uses a fixed number of 10 network. Our PIP-Net also contains a sparse linear decision
prototypes per class. They are applied to the CUB-200-2011 layer to adopt these advantages, but in our intrinsically in-
dataset [33] with 200 bird species, and Stanford Cars [15] terpretable model the CNN features are optimized together
with 196 car types, meaning that an explanation consists with the linear layer rather than being frozen and each node
of 2000 prototypes, which can be overwhelming for a user. can be visualized as a semantically meaningful prototype.
ProtoPShare [29] is a pruning mechanism for ProtoPNet to
reduce the explanation size, and ProtoPFormer [39] adapts Self-supervised Representation Learning When there is
ProtoPNet for Transformers. ProtoTree [24] reduces the a high variety between discriminative features for a class,
number of prototypes further and learns prototypical parts in extra regularization on the prototypes is needed to prevent a
a decision tree structure in order to reduce the local explana- semantic gap (Fig 1). Since we do not require manual part
tion size. ProtoPool [28] is an improvement of ProtoPNet by annotations but only rely on image labels (for at least part
sharing prototypes between classes without pruning. Their of the data), we use self-supervised learning of prototypes.
number of prototypes is fixed and has to be defined before- Danon et al. [4] learn patch similarity with a triplet loss based
hand. All models are however only designed for fine-grained on spatial proximity. The intuition is that two neighboring
image recognition tasks (CUB-200-2011 and Stanford Cars) patches should have a similar encoding (i.e., prototype in
since their loss functions optimize latent prototypes to be our case), whereas a distant patch should have a different
near (parts of) images from the same class. This however encoding. However, such triplet losses can lead to false neg-
does not explicitly optimize towards human perceptual sim- atives (e.g. a car has multiple wheels on different locations
ilarity. Additionally, where other prototype-based models in the image), and usually require complex hard-negative
learn latent vectors for the prototypes, PIP-Net has a node per mining [10]. Instead, [37] obviate the need for negatives
prototype indicating to what extent the prototype is present. by optimizing only two properties for contrastive represen-
Prototypical parts are also related to concepts. Some tation learning: alignment enforces two similar images to
concept-based XAI methods, e.g. TCAV [13], are supervised be mapped to nearby latent feature vectors, and uniformity
by relying on training data for specific concepts. In contrast, induces a uniform distribution of the feature vectors on a
PIP-Net discovers prototypical parts in a self-supervised way. unit hypersphere. Although applied to full images only, we
Other concept-based methods, e.g. [41], are post-hoc XAI show that the underlying concept can be applied to image
methods, and are therefore not guaranteed to faithfully ex- patches as well. Since we want to model the presence or ab-
plain the model’s reasoning [6]. More related to our work sence of prototypes, our image encodings are ideally binary
is the post-hoc sparse explainer of Wong et al. [38] which rather than continuous. Most existing self-supervised feature
fits a sparse linear classification layer to a trained CNN and learning methods (see [12] for an overview) are therefore
explains the resulting sparse nodes with LIME [27] and acti- not directly applicable. Most relevant to our work is the
vation maximization [42]. They show how the sparse linear recently proposed method CARL (Consistent Assignment
layer helps users to understand the model better and that for Representation Learning) from [31]. Rather than directly

2746
coding p could be [0.9, 0.0, 0.0, 0.1, 0.8, 1.0], indicating that
the first, fifth and sixth prototype are (substantially) present
in this image. The image encoding p is used as input to a
D×K
linear classification layer with weights ωc ∈ R≥0 which
connects prototypes to classes and acts as a scoring system.
Figure 3. Example of local explanation of PIP-Net with only 3 Learned weight ωcd,k indicates the relevance of prototype
prototypes for the correct class. PIP-Net learns part-prototypes
d to class k. The output score per class is the sum of the
visualized as patches from the training data, and localizes similar
prototype presence scores multiplied by the incoming class
image patches in an unseen input image.
weights of this linear layer. By adding up scores for relevant
present prototypical parts, we allow the model to find evi-
learning continuous image embeddings, CARL learns a pre- dence for multiple classes or for none, as shown in Fig. 2. To
determined number of latent anchors. A softmax is applied improve interpretability, we restrict the linear layer to have
to get the distribution of an image over all anchors. CARL’s non-negative weights and optimize for sparsity by learning
alignment loss enforces different augmented views of an many zero weights (see Sec. 5).
image to be assigned to the same anchors. This is similar to
our approach, though we aim to learn a prototype per patch. 4. Self-Supervised Pre-Training of Prototypes
We use self-supervised learning with specially designed
3. Model Architecture and Reasoning loss functions to generate semantically meaningful proto-
types. We assume image-level labels and do not rely on
Consider a classification problem with K classes
expensive manual part-annotations. In the first step we pre-
with training set T containing N labeled1 images
train the prototypes, while freezing (and not using) the linear
{(x(1) , y (1) ), ..., (x(N ) , y (N ) )} ∈ X × Y . Our main ob-
layer to the classes. In this step, we optimize the proto-
jective is to learn interpretable prototypes, which can then be
types to already learn semantic similarity, independent of the
used as input features for any interpretable model. The core
classification task. This will prevent perceptually different
of our model architecture is a convolutional neural network
prototypes to be similar in latent space (see Fig. 1).
(CNN) backbone that learns an interpretable, 1-dimensional
We learn image encodings p which indicate the presence
image encoding p indicating the presence or absence of pro-
of prototypes in input image x. In line with other self-
totypical parts in an image, based on the principle that a
supervised learning methods [12], we create a positive pair
CNN’s latent map preserves spatial information. A sparse
x′ , x′′ by applying different data augmentations to input
linear layer then connects those prototypical parts (proto-
image x. By selecting data augmentations such that humans
types) to classes (see Fig. 4 and Fig. 3).
would still consider the two views similar, we incorporate
An input image is first forwarded through CNN f . The
human perception into the training process.
resulting convolutional output z = f (x; ωf ) consists of
Similar to the contrastive learning approach of [37], we
D two-dimensional (H × W ) feature maps, where ωf de-
optimize for alignment and uniformity of representations.
notes the trainableP parameters of f . We apply a softmax
D However, rather than optimizing for alignment on image-
over D such that d zh,w,d = 1 to force a patch zh,w,:
level, we optimize for patch alignment by optimizing the
to belong to exactly one prototype. Each value zh,w,d can
model to assign the same prototype to two views of an aug-
be interpreted as the probability that the patch at location
mented image patch. Specifically, for pretraining the proto-
h, w ∈ H × W corresponds to prototype d. Ideally, zh,w,: is
types we use a linear combination of only two loss terms:
a one-hot encoded representation signaling perfect allocation
λA LA +λT LT . The alignment loss LA optimizes two views
to one prototype. Since our goal is to identify the absence
of the same image patch to belong to the same, and ideally
or presence of a prototypical part in an image, we apply
a single, prototype. We compute the similarity between the
a max-pooling operation per feature map z:,:,d , as shown ′
latent patches of two views of an image patch (zh,w,: and
by the colors in Fig. 4. The resulting tensor p ∈ [0., 1.]D ′′
zh,w,: ) as their dot product:
represents the presence score of all D prototypes2 in the
image, such that the dth-value in p indicates to what extent
prototype d is present in the image. For example, image en- \mathcal {L}_A = - \frac {1}{HW} \sum _{(h,w) \in H\times W} \log (\bm {z}'_{h,w,:} \cdot \bm {z}''_{h,w,:}). (1)

1 We pretrain prototypes in a self-supervised fashion, thus, additional

unlabeled data can be included during the pretraining process. SincePeach patch encoding is normalized with softmax such
2 D is only an upper bound for the number of prototypes. Regularization D
that d zh,w,d = 1, two identical one-hot encoded tensors
will reduce the number of relevant prototypes. In case D of the chosen
CNN is not sufficient, a 1 × 1-convolutional layer could be added to f to
result in LA = 0. This loss, similar to the consistency
increase the number of prototypes D, although we empirically found that loss of CARL (Consistent Assignment for Representation
an additional layer was not necessary for our datasets. Learning [31]), therefore implicitly optimizes for near-binary

2747
Figure 4. PIP-Net consists of a CNN backbone (e.g. ConvNeXt) to learn prototypical representations z. The feature representations are
pooled to a vector of prototype presence scores p. Contrastive learning implements the objective that two representations of patches for an
image pair should be assigned the same prototype in the latent feature space (loss LA ). The tanh-loss LT prevents trivial solutions and
regularizes the model to make use of all available prototypes. As such, PIP-Net disentangles the latent space into neurons that relate to
specific object parts. Learned part-prototypes and classes are connected via a sparse linear layer. LC is the standard negative log-likelihood
loss. Model outputs during test time are not normalized and allow the outputs to be interpreted as simple scoring sheets.

encodings where an image patch corresponds to exactly one should be compact (cf. Sec. 5.2), and (iii) the model should
prototype. One can imagine that binary presence scores be able to handle out-of-distribution data by being able to
are easier to interpret than soft scores where a prototype is output “I haven’t seen this before”, i.e., be able to abstain
present for e.g. 50%. (cf. Sec. 5.3). These three objectives are captured in a
A naive solution for the model to get LA = 0 is to let custom activation function (cf. Eq. (3)) in Sec. 5.4). The
one prototype node be activated on all image patches in each overall objective for the second training phase of PIP-Net is:
image in the dataset. To prevent such a trivial solution and λC LC + λA LA + λT LT .
learn diverse image representations that make use of the
whole space of D prototypes, we introduce our tanh-loss LT 5.1. Scoring Sheet Reasoning
that regulates that every prototype should be at least once We implement the linear classification layer as an inter-
present in a mini-batch: pretable scoring sheet that looks for (only positive) class
evidence in an input sample. Summing up the relevance of
\mathcal {L}_T(\bm {p}) = -\frac {1}{D} \sum _d^D \log (\tanh (\sum _b^B \bm {p}_b )+\epsilon ), (2) present prototypical parts allows the model to find evidence
for multiple classes or for none (see Fig. 2).
Whereas usually class confidence scores are used to train
where tanh and log are element-wise, B is the number of neural networks, scoring-sheet inference results in unnor-
samples in a mini-batch and ϵ is a small number for numeri- malized output scores. To train with the regular negative
cal stability. The intuition behind LT is that the tanh checks log-likelihood loss during the second training phase (after
whether a prototype is present in the mini-batch without tak- prototype pretraining), we apply a softmax activation func-
ing into account how often a prototype is present, since some tion σ to the output of the linear layer o during training to
prototypes (e.g. sky) will naturally occur more frequently convert unnormalized logits to class confidence scores.3
than others. Naively applying softmax would however conflict with
our goals of compactness and decision abstaining in scoring-
5. Training PIP-Net sheet reasoning, because softmax is not scale-invariant, i.e.,
After pretraining the prototypes, we unfreeze the last σ(z) ̸= σ(c · z) for scalar c. More concretely, if the
linear layer and train the model as a whole. To optimize output-scores are initially large (for example, when there
for classification performance, we add a classification loss a many relevant prototypes present with large weights to
term LC , which is simply a standard negative log-likelihood classes), then softmax outputs a highly skewed distribution.
loss between prediction ŷ and the one-hot encoded ground- In contrast, when the class scores are low (and hence when
truth label y. LC mainly influences the weights of the linear the weights or prototype presence scores are very small),
layer, but also finetunes the prototypes to be relevant for the the softmax output is close to a uniform distribution, e.g.,
downstream classification task. In addition to optimizing for σ([0.12, 0.65, 0.21]) = [0.26, 0.45, 0.29]. Having either
classification performance, we have three requirements for very high or very low scores makes the model susceptible to
our interpretable classifier: (i) it should be explainable with 3 We only apply softmax during training, and use interpretable scoring

scoring-sheet reasoning (cf. Sec. 5.1, (ii) the explanation during inference.

2748
weight initialization and hinders effective and stable training. Top-1 Global Local Spar-
Sections 5.2 to and 5.3 discuss this challenge in more detail Method
Acc↑ Size↓ Size↓ sity%↑
before presenting the final solution in Section 5.4.
PIP-Net C 84.3±0.2 495±6 10 (4) 99.3
PIP-Net R 82.0±0.3 731±19 12 (5) 99.7
5.2. Compact Explanations ProtoPNet [3] 79.2 2000 2000

CUB
ProtoTree [24] 82.2±0.7 202 8.3
The overconfidence of softmax would also compete ProtoPShare [29] 74.7 400 400
with our compactness goal. Consider the following exam- ProtoPool [28] 85.5±0.1 202 202
ple activations in a 3-class scenario: σ([1.2, 6.5, 2.1)] =
[0.005, 0.983, 0.012]. The confidence score of the second PIP-Net C 88.2±0.5 515±4 9 (4) 99.4
class is already close to one, such that the model has no PIP-Net R 86.5±0.3 669±13 11 (4) 99.8

CARS
incentive to further reduce the output scores of the other ProtoPNet [3] 86.1 1960 1960
classes. Prototypes which are actually irrelevant for a class, ProtoTree [24] 86.6±0.2 195 8.5
might therefore keep a positive weight, which results in ex- ProtoPShare [29] 86.4 480 480
planations that are larger than necessary. Sparse weights ProtoPool [28] 88.9±0.1 195 195
between prototypes and classes would improve interpretabil-

PETS
PIP-Net-C 92.0±0.3 172±2 4 (2) 99.4
ity because the number of relevant prototypes per class and PIP-Net R 88.5±0.2 346±12 8 (5) 99.5
consequently explanation size are reduced. Existing spar-
sity and pruning methods are mainly developed for reducing Table 1. Mean accuracy and standard deviation (3 random seeds).
memory and computation costs [8] and often the sparsity Global size indicates the total number of prototypes in the model.
ratio has to be predetermined by the user [8, 20], making Local size indicates the number of non-zero prototypes used for
them not directly relevant to our interpretability goal (further a single prediction: for all classes in total, and between brackets
discussed in Suppl.). Instead, we introduce a novel function for the predicted class only. Sparsity ratio indicates percentage of
that optimizes classification performance and compactness zero-weights in PIP-Net’s linear classification layer.
simultaneously, as presented in Sec. 5.4.
5.4. Overall Classification Objective
5.3. Handling OoD Data
To regularize for sparsity during training, we calculate
The standard solution for the scale-invariance issue is the output scores o, that are used as input to softmax, as
simply to apply normalization before softmax, as is often follows:
done in representation learning (e.g. [11, 34, 37]). However, \label {eq:logloss} \bm {o} = \log ((\bm {p}\omega _c)^2+1), (3)
established normalization layers such as batch normalization
and instance normalization impede interpretability since the where p are the prototype presence scores and ωc the weights
prototype absence scores with a value of zero become non- of the linear layer. Since we restrict the weights to be non-
zero. With such normalization, we would lose the desirable negative such that ωc ∈ RD×K D
≥0 , and p ∈ [0., 1.] , o will be
property of scoring systems being able to output “I haven’t zero when the input is zero, such that the OoD-property is
seen this before” by giving near-zero scores for all classes. kept. Squaring pωc helps the model to quickly adapt. Addi-
Abstaining from a decision could add to the trustworthiness tionally, the natural logarithm reduces large weights to pre-
of explanations [19]. Lp normalization (e.g. L2 ) is an alter- vent overconfidence as the ‘loss gain’ by decreasing weights
native where zero remains zero. However, a near-zero score is higher than increasing weights, such that the model is
could still be significantly increased, limiting the OoD de- incentivized to reduce the weights of irrelevant prototypes.
tection possibilities. More importantly, Lp normalization of This normalization step therefore implicitly optimizes for
p would make the scores in p dependent on each other. The sparsity and smaller explanations. During inference (test
presence score of one prototypical part would then influence time), the output scores are simply calculated as pωc in
the encoding of other prototypes in p. Such dependence order to support easy interpretation.
could result in unintuitive behaviour. As found by others,
normal CNNs can be easily fooled by adding occluding ob- 6. Experiments and Results
jects (e.g. a prediction of a monkey changes to a human when
a guitar partly occludes the monkey) [36]. Hence, to prevent We evaluate our model on the standard benchmarks in
unexpected and unintuitive behavior, we want the prototype prototype literature: CUB-200-2011 [33] (200 bird species),
presence scores to behave independently of each other. We and Stanford Cars [15] (196 car models). Additionally, we
therefore introduce in Sec. 5.4 another way of normalizing evaluate on Oxford-IIIT Pet [25] (37 cat and dog species) to
the logits, where a score of zero stays zero. include a dataset with fewer classes.

2749
6.1. Implementation Details FPR95 (↓) OOD
In our architecture, any convolutional backbone can be ID PETS CUB CARS
used and we apply ResNet50 and ConvNeXt-tiny [18], in- PETS - 0.129 0.009
dicated with R and C respectively. We use the pretrained CUB 0.081 - 0.011
versions but change the strides of the last layers from 2 to CARS 0.056 0.078 -
1, such that the width W and height H of the output feature
maps are increased (from 7 × 7 to 28 × 28 for ResNet and Table 2. OOD detection results, which calculates the false positive
7 × 7 to 26 × 26 for ConvNeXt, see Suppl.). This small rate of OOD detection when the true positive rate of ID samples
change results in a more fine-grained patch grid z which is at 95%. When 95% of the PETS images are classified as in-
can be better optimized for patch similarity. Backbone f is distribution, PIP-Net classifies only 0.9% of the CARS images and
finetuned with Adam with a learning rate of 0.0001 (CUB-R 12.9% of CUB images as in-distribution for PETS.
and PETS) or 0.0005 (CARS, CUB-C) in a cosine annealing
schedule. The linear layer is trained with a learning rate Purity Purity
of 0.05. Weights for the losses are set to λC = λT = 2, Model
(train) ↑ (test) ↑
λA = 5. We pretrain the prototypes for 10 epochs, followed
ProtoPNet R [3] 0.44 ± 0.21 0.46 ± 0.22
by training PIP-Net as a whole for 60 more epochs. Images
ProtoTree R [24] 0.13 ± 0.14 0.14 ± 0.16
are resized to 224 × 224 and augmented with TrivialAug-
ProtoPShare R [29] 0.43 ± 0.21 0.43 ± 0.22
ment [21]. See Suppl. and code for details.
ProtoPool R [28] 0.35 ± 0.20 0.36 ± 0.21

CUB
6.2. Performance and Explanation Size PIP-Net R (ours) 0.63 ± 0.25 0.65 ± 0.25
PIP-Net R (self-sup) 0.29 ± 0.31 0.29 ± 0.32
Table 1 presents the accuracy of recent prototypical-parts-
PIP-Net C (ours) 0.92 ± 0.16 0.93 ± 0.15
based models and the compactness of the explanations. We
PIP-Net C (self-sup) 0.61 ± 0.38 0.60 ± 0.38
measure the size of the global explanation as the number of
prototypes in the model with at least one non-zero weight.
Table 3. Purity of CUB-prototypes w.r.t. object part annotations,
The local explanation can either count all present prototypes averaged over all relevant prototypes in the model (± std. dev.). Cal-
that are relevant for any class, or only for the predicted class culates how often the (center of the) same object part is present in
(Fig. 3). For a local explanation, we count all relevant pro- the top-10 image patches per prototype. PIP-Net (self-supervised)
totypes with a similarity > 0.1. Table 1 shows that PIP-Net indicates the purity after pretraining the prototypes (i.e. without
has a low number of prototypes, especially in a local expla- classification loss LC ). R is ResNet, C is ConvNeXt backbone.
nation, in combination with a competitive accuracy. Hence,
a user only has to check a handful of prototypes to under-
stand why PIP-Net predicted a specific class. Figure 2 shows between prototypes and image patches, we evaluate the pu-
the actual output of PIP-Net trained on PETS. The local rity of prototypes by using ground-truth center locations of
explanation for one class consists of just 3 prototypes, and object parts available in the CUB dataset. Our assumption is
PIP-Net can indeed, as designed, abstain from classifying for that an interpretable prototype should correspond to a single
out-of-distribution data. The supplementary material shows object part, e.g. an eye or a wing. We evaluate to what extent
further examples of OoD and multi-object predictions. the top-10 image patches for a prototype are encoding the
PIP-Net is specifically designed for open set recogni- same part by calculating whether the center of the ground-
tion [40], implying that it can detect OoD input while clas- truth object part is contained in a 32 × 32 image patch. For
sifying in-distribution (ID) input. We quantify the OoD- each part-prototype model, we select the 10 images with the
detection of PIP-Net with the common FPR95-metric by highest similarity score for a particular prototype, in order to
determining class-specific thresholds for output score o such avoid model-specific similarity/distance thresholds. Table 3
that 95% of the ID samples are classified as in-distribution. presents the purity of learned CUB-prototypes. It shows a
Table 2 shows that PIP-Net can detect most of the OoD correlation between the size of the explanation and the purity
samples, thereby contributing to intuitive reasoning. Our of the prototypes for the existing models, which is in line
findings confirm the insight of [32] that sparsification is with the ‘sun OR dog’ issue. Our PIP-Net is however com-
beneficial for OOD detection. pact and has pure, interpretable prototypes. PIP-Net with a
ConvNeXt-tiny backbone achieves a substantially higher pu-
6.3. Semantic Quality of Prototypes
rity than other models. Interestingly, even the self-supervised
As the ‘sun OR dog’ issue from Fig. 1 illustrated, accu- prototypes of PIP-Net-C have a higher purity score than the
racy and number of prototypes is not sufficient for indicating prototypes learned by other models with classification loss.
interpretability. Figure 5 visualizes the top-10 patches of Also PIP-Net with a ResNet-backbone achieves a higher
learned prototypes. To quantify the semantic correspondence purity than ProtoPNet, ProtoPShare (a pruned version of

2750
(a) ProtoPNet - CUB (b) ProtoPool - CUB (c) PIP-Net - CUB, self-supervised (d) PIP-Net - CUB, full
pretraining only

(e) PIP-Net - CUB (more vis.) (f) PIP-Net - CARS (g) PIP-Net - PETS (h) PIP-Net - PIN

Figure 5. Example prototypes, one per row visualized with their top-10 image patches. Note that both ProtoPNet and ProtoPool learn
interpretable and less-interpretable prototypes (e.g, first two rows). ProtoPNet might learn duplicate prototypes (fifth and sixth row). Showing
the same prototypes pi for pretrained and fully trained PIP-Net. All prototypes visualized in the supplementary material.

ProtoPNet), ProtoPool and ProtoTree. We hypothesize that 7. Conclusion


the “patchify stem” of ConvNeXt is beneficial for learning
We presented PIP-Net: an image classifier optimized
part prototypes, whereas ResNet might perform worse due
to be aligned with human perception. By carefully craft-
to its larger number of prototypes (D = 2048 ResNet vs
D = 768 for ConvNeXt), and because ResNets have weak ing the loss terms and activation functions, PIP-Net learns
spatial localization discriminativeness in the last layers [26]. high-quality prototypical parts, is globally interpretable and
generates compact explanations. Additionally, it can abstain
The relatively high standard deviation in Tab. 3 indicates from decision making by outputting near-zero scores when
that some prototypes have a lower part purity. This could no relevant prototypes are found. Our sparse linear deci-
be due to the fact that some prototypes are semantically sion layer makes PIP-Net an intuitive model, although such
meaningful for humans but do not correspond to a single simplicity also has it limitations. PIP-Net learns whether a
object part, such as a prototype encoding a specific color prototype is present or absent, but does not count the number
(e.g. ‘anything bright blue’) or a non-part-related concept of prototypes in an image. Hence, our model may not be
(e.g. ‘human skin’ or ‘tree leaves’). suited for datasets where the number of occurrences of a
prototype is the only discriminative feature.
PIP-Net is also applicable to non-fine-grained image data. Importantly, PIP-Net only relies on image labels and no
We train PIP-Net with a ConvNeXt backbone on PartIma- other annotations to learn an interpretable part-prototype
geNet [7] (PIN), a dataset with 158 classes from ImageNet model. Since our prototypes are learned with a combination
with part segmentation annotations, allowing us to further of supervised and self-supervised loss terms, it is possible
evaluate prototype purity. PIP-Net achieves a top-1 accuracy to apply the self-supervised losses from Sec. 4 to unlabeled
of 85% with 262 prototypes, compared to 91% for a normal data. This is especially interesting for domains where manual
black-box ConvNeXt. We leave further hyperparameter tun- image labeling is expensive. We leave the exploration of
ing for improved classification performance for future work, partly unlabeled data for future work. Lastly, we see further
and rather focused on the evaluation of prototype purity. We research opportunities to use PIP-Net for efficiently adapting
define prototype purity as the fraction of image patches of a the model’s reasoning for e.g. fixing shortcut learning. We
prototype that have overlap with the same ground-truth ob- think that interpretability-by-design should become the new
ject part. We measure the purity based on all image patches standard for interpretable and explainable AI, especially for
where a relevant prototype is detected (i.e., a prototype pres- high stakes decisions. This approach resulted in our PIP-
ence score > 0.5) and find that the purity averaged over Net, which provides compact explanations that align well
all active non-zero-weighted PIN-prototypes is 92%. The with human perception, allowing to interpret decisions in an
high purity aligns with the visual evidence from Fig. 5 and intuitive, faithful and semantically meaningful way.
confirms the interpretability of the learned prototypes.

2751
References with concept activation vectors (TCAV). In Jennifer Dy and
Andreas Krause, editors, Proceedings of the 35th Interna-
[1] Irving Biederman. Recognition-by-components: a the- tional Conference on Machine Learning, volume 80 of Pro-
ory of human image understanding. Psychological review, ceedings of Machine Learning Research, pages 2668–2677.
94(2):115, 1987. 1 PMLR, 10–15 Jul 2018. 3
[2] Judy Borowski, Roland Simon Zimmermann, Judith Schepers,
[14] Sunnie S. Y. Kim, Nicole Meister, Vikram V. Ramaswamy,
Robert Geirhos, Thomas S. A. Wallis, Matthias Bethge, and
Ruth Fong, and Olga Russakovsky. HIVE: Evaluating the
Wieland Brendel. Exemplary natural images explain CNN
human interpretability of visual explanations. In European
activations better than state-of-the-art feature visualization.
Conference on Computer Vision (ECCV), 2022. 1
In 9th International Conference on Learning Representations,
[15] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenRe-
3d object representations for fine-grained categorization. In
view.net, 2021. 1
4th International IEEE Workshop on 3D Representation and
[3] Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia
Recognition (3dRR-13), Sydney, Australia, 2013. 3, 6
Rudin, and Jonathan K Su. This looks like that: Deep learn-
ing for interpretable image recognition. In H. Wallach, H. [16] Matthew L. Leavitt and Ari Morcos. Towards falsifiable
Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. interpretability research. arXiv:2010.12016 [cs, stat], Oct.
Garnett, editors, Advances in Neural Information Processing 2020. arXiv: 2010.12016. 2
Systems, volume 32. Curran Associates, Inc., 2019. 1, 2, 6, 7 [17] Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency.
[4] Dov Danon, Hadar Averbuch-Elor, Ohad Fried, and Daniel Foundations and recent trends in multimodal machine learn-
Cohen-Or. Unsupervised natural image patch learning. Com- ing: Principles, challenges, and open questions, 2022. 1
putational Visual Media, 5(3):229–237, 2019. 3 [18] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht-
[5] Christina M. Funke, Judy Borowski, Karolina Stosio, Wieland enhofer, Trevor Darrell, and Saining Xie. A convnet for
Brendel, Thomas S. A. Wallis, and Matthias Bethge. Five the 2020s. In Proceedings of the IEEE/CVF Conference on
points to check when comparing visual perception in humans Computer Vision and Pattern Recognition (CVPR), pages
and machines. Journal of Vision, 21(3):16–16, 03 2021. 1 11976–11986, June 2022. 7
[6] Yash Goyal, Amir Feder, Uri Shalit, and Been Kim. Ex- [19] Radek Mackowiak, Lynton Ardizzone, Ullrich Kothe, and
plaining classifiers with causal concept effect (cace). arXiv Carsten Rother. Generative classifiers as a basis for trustwor-
preprint arXiv:1907.07165, 2019. 3 thy image classification. In Proceedings of the IEEE/CVF
[7] Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xi- Conference on Computer Vision and Pattern Recognition
aoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, Qihang (CVPR), pages 2971–2981, June 2021. 6
Yu, and Alan Yuille. Partimagenet: A large, high-quality [20] Decebal Constantin Mocanu, Elena Mocanu, Tiago Pinto, Se-
dataset of parts. In Computer Vision–ECCV 2022: 17th Eu- lima Curci, Phuong H. Nguyen, Madeleine Gibescu, Damien
ropean Conference, Tel Aviv, Israel, October 23–27, 2022, Ernst, and Zita A. Vale. Sparse training theory for scalable
Proceedings, Part VIII, pages 128–145. Springer, 2022. 8 and efficient agents. In Proceedings of the 20th International
[8] Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, Conference on Autonomous Agents and MultiAgent Systems,
and Alexandra Peste. Sparsity in deep learning: Pruning and AAMAS ’21, page 34–38, Richland, SC, 2021. International
growth for efficient inference and training in neural networks. Foundation for Autonomous Agents and Multiagent Systems.
Journal of Machine Learning Research, 22(241):1–124, 2021. 6
6 [21] Samuel G. Müller and Frank Hutter. Trivialaugment: Tuning-
[9] Adrian Hoffmann, Claudio Fanconi, Rahul Rade, and Jonas free yet state-of-the-art data augmentation. In Proceedings of
Kohler. This looks like that... does it? shortcomings of la- the IEEE/CVF International Conference on Computer Vision
tent space prototype interpretability in deep networks. arXiv (ICCV), pages 774–782, October 2021. 7
preprint arXiv:2105.02968, 2021. 1 [22] Meike Nauta, Annemarie Jutte, Jesper Provoost, and Christin
[10] Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Seifert. This looks like that, because ... explaining proto-
Zadeh, Debapriya Banerjee, and Fillia Makedon. A survey types for interpretable image recognition. In Machine Learn-
on contrastive self-supervised learning. Technologies, 9(1), ing and Principles and Practice of Knowledge Discovery in
2021. 3 Databases, pages 441–456, Cham, 2021. Springer Interna-
[11] Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki tional Publishing. 2
Zadeh, Debapriya Banerjee, and Fillia Makedon. A survey [23] Meike Nauta, Jan Trienes, Shreyasi Pathak, Elisa Nguyen,
on contrastive self-supervised learning. Technologies, 9(1), Michelle Peters, Yasmin Schmitt, Jörg Schlötterer, Maurice
2021. 6 van Keulen, and Christin Seifert. From anecdotal evidence
[12] Longlong Jing and Yingli Tian. Self-supervised visual fea- to quantitative evaluation methods: A systematic review on
ture learning with deep neural networks: A survey. IEEE evaluating explainable ai. ACM Comput. Surv., feb 2023. 1
Transactions on Pattern Analysis and Machine Intelligence, [24] Meike Nauta, Ron van Bree, and Christin Seifert. Neural pro-
43(11):4037–4058, 2021. 3, 4 totype trees for interpretable fine-grained image recognition.
[13] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, In Proceedings of the IEEE/CVF Conference on Computer
James Wexler, Fernanda Viegas, and Rory sayres. Inter- Vision and Pattern Recognition (CVPR), pages 14933–14943,
pretability beyond feature attribution: Quantitative testing June 2021. 1, 2, 3, 6, 7

2752
[25] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and embedding space. In Proceedings of the IEEE/CVF Interna-
C. V. Jawahar. Cats and dogs. In 2012 IEEE Conference on tional Conference on Computer Vision (ICCV), pages 895–
Computer Vision and Pattern Recognition, pages 3498–3505, 904, October 2021. 2
2012. 6 [36] Jianyu Wang, Zhishuai Zhang, Cihang Xie, Yuyin Zhou, Vittal
[26] Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Premachandran, Jun Zhu, Lingxi Xie, and Alan Yuille. Visual
Chiyuan Zhang, and Alexey Dosovitskiy. Do vision trans- concepts and compositional voting. Annals of Mathematical
formers see like convolutional neural networks? In M. Ran- Sciences and Applications, 3(1):151–188, 2018. 6
zato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman [37] Tongzhou Wang and Phillip Isola. Understanding contrastive
Vaughan, editors, Advances in Neural Information Processing representation learning through alignment and uniformity on
Systems, volume 34, pages 12116–12128. Curran Associates, the hypersphere. In Hal Daumé III and Aarti Singh, editors,
Inc., 2021. 8 Proceedings of the 37th International Conference on Machine
[27] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Learning, volume 119 of Proceedings of Machine Learning
"why should i trust you?": Explaining the predictions of Research, pages 9929–9939. PMLR, 13–18 Jul 2020. 3, 4, 6
any classifier. In Proceedings of the 22nd ACM SIGKDD [38] Eric Wong, Shibani Santurkar, and Aleksander Madry. Lever-
International Conference on Knowledge Discovery and Data aging sparse linear layers for debuggable deep networks. In
Mining, KDD ’16, page 1135–1144, New York, NY, USA, Marina Meila and Tong Zhang, editors, Proceedings of the
2016. Association for Computing Machinery. 3 38th International Conference on Machine Learning, volume
[28] Dawid Rymarczyk, Łukasz Struski, Michał Górszczak, Ko- 139 of Proceedings of Machine Learning Research, pages
ryna Lewandowska, Jacek Tabor, and Bartosz Zieliński. In- 11205–11216. PMLR, 18–24 Jul 2021. 3
terpretable image classification with differentiable prototypes [39] Mengqi Xue, Qihan Huang, Haofei Zhang, Lechao Cheng, Jie
assignment. In Shai Avidan, Gabriel Brostow, Moustapha Song, Minghui Wu, and Mingli Song. Protopformer: Concen-
Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, trating on prototypical parts in vision transformers for inter-
Computer Vision – ECCV 2022, pages 351–368, Cham, 2022. pretable image recognition. arXiv preprint arXiv:2208.10431,
Springer Nature Switzerland. 1, 2, 3, 6, 7 2022. 3
[29] Dawid Rymarczyk, Łukasz Struski, Jacek Tabor, and Bartosz [40] Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu.
Zieliński. ProtoPShare: Prototypical Parts Sharing for Sim- Generalized out-of-distribution detection: A survey. arXiv
ilarity Discovery in Interpretable Image Classification. In preprint arXiv:2110.11334, 2021. 7
Proceedings of the 27th ACM SIGKDD Conference on Knowl- [41] Chih-Kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li,
edge Discovery & Data Mining, KDD ’21, page 1420–1430, Tomas Pfister, and Pradeep Ravikumar. On completeness-
New York, NY, USA, 2021. Association for Computing Ma- aware concept-based explanations in deep neural networks.
chinery. 1, 3, 6, 7 In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and
[30] Wojciech Samek, Grégoire Montavon, Sebastian Lapuschkin, H. Lin, editors, Advances in Neural Information Processing
Christopher J. Anders, and Klaus-Robert Müller. Explaining Systems, volume 33, pages 20554–20565. Curran Associates,
deep neural networks and beyond: A review of methods and Inc., 2020. 3
applications. Proceedings of the IEEE, 109(3):247–278, 2021. [42] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and
1 Hod Lipson. Understanding neural networks through deep
[31] Thalles Silva and Adín Ramírez Rivera. Representation learn- visualization. arXiv preprint arXiv:1506.06579, 2015. 3
ing via consistent assignment of views to clusters. In Pro- [43] C. L. Zitnick and Devi Parikh. Bringing semantics into focus
ceedings of the 37th ACM/SIGAPP Symposium on Applied using visual abstraction. In Proceedings of the IEEE Confer-
Computing, SAC ’22, page 987–994, New York, NY, USA, ence on Computer Vision and Pattern Recognition (CVPR),
2022. Association for Computing Machinery. 3, 4 June 2013. 2
[32] Yiyou Sun and Yixuan Li. Dice: Leveraging sparsification
for out-of-distribution detection. In Computer Vision–ECCV
2022: 17th European Conference, Tel Aviv, Israel, October 23–
27, 2022, Proceedings, Part XXIV, pages 691–708. Springer,
2022. 7
[33] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.
The Caltech-UCSD Birds-200-2011 Dataset. Technical Re-
port CNS-TR-2011-001, California Institute of Technology,
2011. 3, 6
[34] Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon
Yuille. Normface: L2 hypersphere embedding for face verifi-
cation. In Proceedings of the 25th ACM International Confer-
ence on Multimedia, MM ’17, page 1041–1049, New York,
NY, USA, 2017. Association for Computing Machinery. 6
[35] Jiaqi Wang, Huafeng Liu, Xinyue Wang, and Liping Jing.
Interpretable image recognition by constructing transparent

2753

You might also like