Prototype Based Deepm Learning Paper 2 Zhou

This document discusses a novel approach to semantic segmentation by proposing a nonparametric framework that utilizes non-learnable prototypes instead of the traditional learnable class prototypes. The authors identify limitations in existing parametric segmentation methods and demonstrate that their nonparametric model can effectively handle a large number of classes while improving generalizability and reducing computational overhead. Empirical results show that this new framework outperforms existing methods across several datasets, suggesting a significant shift in semantic segmentation model design.

Uploaded by

vvbvansh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views12 pages

Prototype Based Deepm Learning Paper 2 Zhou

Uploaded by

vvbvansh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Rethinking Semantic Segmentation: A Prototype View

Tianfei Zhou1 , Wenguan Wang2,1 *, Ender Konukoglu1 , Luc Van Gool1

1 2
Computer Vision Lab, ETH Zurich ReLER, AAII, University of Technology Sydney
https://github.com/tfzhou/ProtoSeg

Abstract

Prevalent semantic segmentation solutions, despite their

different network designs (FCN based or attention based)
and mask decoding strategies (parametric softmax based or
pixel-query based), can be placed in one category, by con-
sidering the softmax weights or query vectors as learnable
class prototypes. In light of this prototype view, this study un-
covers several limitations of such parametric segmentation
regime, and proposes a nonparametric alternative based on
non-learnable prototypes. Instead of prior methods learning
a single weight/query vector for each class in a fully para-
metric manner, our model represents each class as a set of
non-learnable prototypes, relying solely on the mean fea- Figure 1. Different sematic segmentation paradigms: (a-b) para-
tures of several training pixels within that class. The dense metric vs (c) nonparametric. Modern segmentation solutions, no
prediction is thus achieved by nonparametric nearest proto- matter using (a) parametric softmax or (b) query vectors for mask
decoding, can be viewed as learnable prototype based methods that
type retrieving. This allows our model to directly shape the
learn class-wise prototypes in a fully parametric manner. We in-
pixel embedding space, by optimizing the arrangement be-
stead propose a nonparametric scheme (c) that directly selects sub-
tween embedded pixels and anchored prototypes. It is able to cluster centers of embedded pixels as prototypes, and achieves per-
handle arbitrary number of classes with a constant amount pixel prediction via nonparametric nearest prototype retrieving.
of learnable parameters.We empirically show that, with FCN
based and attention based segmentation models (i.e., HR- segmentation solutions. Many of these ‘non-FCN’ models,
Net, Swin, SegFormer) and backbones (i.e., ResNet, HRNet, like [118, 139], directly follow the standard mask decoding
Swin, MiT), our nonparametric framework yields compel- regime, i.e., estimate softmax distributions over dense visual
ling results over several datasets (i.e., ADE20K, Cityscapes, embeddings (extracted from patch token sequences). Inter-
COCO-Stuff), and performs well in the large-vocabulary estingly, the others [20, 100] follow the good practice of
situation. We expect this work will provoke a rethink of the Transformer in other fields [11, 82, 113] and adopt a pixel-
current de facto semantic segmentation model design. query strategy (Fig. 1(b)): utilize a set of learnable vectors
( ) to query the dense embeddings for mask prediction.
1. Introduction They speculate the learned query vectors can capture class-
wise properties, however, lacking in-depth analysis.
With the renaissance of connectionism, rapid progress
Noticing there exist two different mask decoding strate-
has been made in semantic segmentation. Till now, most of
gies, the following questions naturally arise: ¶ What are the
state-of-the-art segmentation models [15, 34, 49, 135] were
relation and difference between them? and · If the learn-
built upon Fully Convolutional Networks (FCNs) [79]. De-
able query vectors indeed implicitly capture some intrinsic
spite their diversified model designs and impressive results,
properties of data, is there any better way to achieve this?
existing FCN based methods commonly apply parametric
softmax ( ) over pixel-wise features for dense pre- Tackling these two issues can provide insights into mod-
diction (Fig. 1(a)). Very recently, the vast success of Trans- ern segmentation model design, and motivate us to rethink
former [105] stimulates the emergence of attention based the task from a prototype view. The idea of prototype based
classification [31] is classical and intuitive (which can date
* Corresponding author: Wenguan Wang. back to the nearest neighbors algorithm [23] and find evi-

2582
dence in cognitive science [60, 91]): data samples are classi- prototype parameters is no longer constrained to the number
fied based on their proximity to representative prototypes of of classes (i.e., 0 vs D×C). Third, via prototype-anchored
classes. With this perspective, in §2, we first answer ques- metric learning, the pixel embedding space is shaped as well-
tion ¶ by pointing out most modern segmentation methods, structured, benefiting segmentation prediction eventually.
from softmax based to pixel-query based, from FCN based By answering questions ¶-¹, we formalize prior methods
to attention based, fall into one grand category: parametric within a learnable prototype based, parametric framework,
models based on learnable prototypes. Consider a segmenta- and link this field to prototype learning and metric learning.
tion task with C semantic classes. Most existing efforts seek We provide literature review and related discussions in §4.
to directly learn C class-wise prototypes – softmax weights In §5.2, we show our method achieves impressive results
or query vectors – for parametric, pixel-wise classification. over famous datasets (i.e., ADE20K [140], Cityscapes [22],
Hence question · becomes more fundamental: ¸ What are COCO-Stuff [10]) with top-leading FCN based and attention
the limitations of this learnable prototype based parametric based segmentation models (i.e., HRNet [108], Swin [78],
paradigm? and ¹ How to address these limitations? SegFormer [118]) and backbones (i.e., ResNet [45], HRNet
Driven by question ¸, we find there are three critical limi- [108], Swin [78], MiT [118]). Compared with the paramet-
tations: First, usually only one single prototype is learned per ric counterparts, our method does not cause any extra com-
class, insufficient to describe rich intra-class variance. The putational overhead during testing while reduces the amount
prototypes are simply learned in a fully parametric manner, of learnable parameters. In §5.3, we demonstrate our method
without considering their representative ability. Second, to consistently performs well when increasing the number of
map a H×W×D image feature tensor into a H×W×C seman- semantic classes from 150 to 847. Accompanied with a set
tic mask, at least D×C parameters are needed for prototype of ablative studies in §5.4, our extensive experiments verify
learning. This hurts generalizability [115], especially in the the power of our idea and the efficacy of our algorithm.
large-vocabulary case; for instance, if there are 800 classes Finally, we draw conclusions in §6. This work is expec-
and D = 512, we need 0.4M learnable prototype parameters ted to open a new venue for future research in this field.
alone. Third, with the cross-entropy loss, only the relative
relations between intra-class and inter-class distances are op- 2. Existing Semantic Segmentation Models as
timized [89, 111, 134]; the actual distances between pixels Parametric Prototype Learning
and prototypes, i.e., intra-class compactness, are ignored.
Next we first formalize the existing two mask decod-
As a response to question ¹, in §3, we develop a nonpa- ing strategies mentioned in §1, and then answer question ¶
rametric segmentation framework, based on non-learnable from a unified view of parametric prototype learning.
prototypes.Specifically, building upon the ideas of prototype Parametric Softmax Projection. Almost all FCN-like and
learning [116, 133] and metric learning [40, 64], it is fully many attention-based segmentation models adopt this strat-
aware of the limitations of its parametric counterpart. Inde- egy. Their models comprise two learnable parts: i) an en-
pendent of specific backbone architectures (FCN based or coder φ for dense visual feature extraction, and ii) a classi-
attention based), our method is general and brings insights fier ρ (i.e., projection head) that projects pixel features into
into segmentation model design and training. For model the semantic label space. For each pixel example i, its em-
design, our method explicitly sets sub-class centers, in the bedding i ∈ RD , extracted from φ, is fed into ρ for C-way
pixel embedding space, as the prototypes. Each pixel data classification:
is predicted to be in the same class as the nearest prototype, exp(wc> i)
without relying on extra learnable parameters. For training, p(c|i) = PC >
, (1)
c0 =1 exp(wc0 i)
as the prototypes are representative of the dataset, we can
directly pose known inductive biases (e.g., intra-class com- where p(c|i) ∈[0, 1] is the probability that i being assigned
pactness, inter-class separation) as extra optimization crite- to class c. ρ is a pixel-wise linear layer, parameterized by
ria and efficiently shape the whole embedding space, instead W= [w1 , · · ·, wC ] ∈ RC×D; wc ∈ RD is a learnable projec-
of optimizing the prediction accuracy only. Our model has tion vector for c-th class; the bias term is omitted for brevity.
three appealing advantages: First, each class is abstracted by Parametric Pixel-Query. A few attention-based segmen-
a set of prototypes, well capturing class-wise characteristics tation networks [118, 139] work in a more ‘Transformer-
and intra-class variance. With the clear meaning of the pro- like’ manner: given the pixel embedding i ∈ RD , a set of C
totypes, the interpretability is also enhanced – the prediction query vectors, i.e., E= [e1 , · · ·, eC ] ∈ RC×D, are learned to
of each pixel can be intuitively understood as the reference generate a probability distribution over the C classes:
of its closest class center in the embedding space [3, 7]. exp(ec ∗ i)
Second, due to the nonparametric nature, the generalizabil- p(c|i) = PC , (2)
c0 =1 exp(ec ∗ i)
0
ity is improved. Large-vocabulary semantic segmentation
can also be handled efficiently, as the amount of learnable where ‘∗’ is inner product between `2 -normalized inputs.

2583
LCE (Eq. 7)
φ

LPPC (Eq. 11)

LPPD (Eq. 12)

Figure 2. Architecture illustration of our non-learnable prototype based nonparametric segmentation model during the training phase.

Prototype-based Classification. Prototype-based classifi- Second, the amount of the learnable prototype parameters,
cation [31, 33] has been studied for a long time, dating back i.e., {gc ∈RD }C c=1 , grows with the number of classes. This
to the nearest neighbors algorithm [23] in machine learn- may hinder the scalability, especially when a large number
ing and prototype theory [60, 91] in cognitive science. Its of classes are present. For example, if there are 800 classes
prevalence stems from its intuitive idea: represent classes and the pixel feature dimensionality is 512, at least 0.4M
by prototypes, and refer to prototypes for classification. Let parameters are needed for prototype learning alone, making
{pm }Mm=1 be a set of prototypes that are representative of large-vocabulary segmentation a hard task. Moreover, if we
their corresponding classes {cpm∈{1, · · ·, C}}m . For a data want to represent each class by ten prototypes, instead of
sample i, prediction is made by comparing i with {pm }m , only one, we need to learn 4M prototype parameters.
and taking the class of the winning prototype as response: Third, Eq. 3 intuitively shows that prototype based learn-
ĉi = cpm∗ , with m∗ = arg min{hi, pm i}M
ers make metric comparisons of data [8]. However, existing
m=1 , (3)
m algorithms often supervise dense segmentation representa-
where i and {pm }m are embeddings of the data sample and tion by directly optimizing the accuracy of pixel-wise pre-
prototypes in a feature space, and h·, ·i stands for the dis- diction (e.g., cross-entropy loss), ignoring known inductive
tance measure, which is typically set as `2 distance (i.e., biases [83, 84], e.g., intra-class compactness, about the fea-
||i−pm ||) [123], yet other proximities can be applied. ture distribution. This will hinder the discrimination poten-
Further, Eqs. 1-2 can be formulated in a unified form: tial of the learned segmentation features, as suggested by
exp(−hi, gc i) many literature in representation learning [76, 95, 114].
p(c|i) = PC , (4) After tackling question ¸, in the next section we will detail
c0 =1 exp(−hi, gc i)
0
our non-learnable prototype based nonparametric segmenta-
where gc ∈ RD can be either wc in Eq. 1 or ec in Eq. 2. tion method, which serves as a solid response to question ¹.
With Eqs. 3-4, we are ready to answer questions ¶·. Both 3. Non-Learnable Prototype based Nonpara-
the two types of methods are based on learnable prototypes;
they are parametric models in the sense that they learn one metric Semantic Segmentation
prototype gc , i.e., linear weight wc or query vector ec , for We build a nonparametric segmentation framework that
each class c (i.e., M=C ). Thus one can consider softmax pro- conducts dense prediction by a set of non-learnable class pro-
jection based methods ‘secretly’ learn the query vectors. As totypes, and directly supervises the pixel embedding space
for the difference, in addition to different distance measures via a prototype-anchored metric learning scheme (Fig. 2).
(i.e., inner product vs cosine similarity), pixel-query based Non-Learnable Prototype based Pixel Classification. As
methods [118, 139] can feed the queries into cross-attention normal, an encoder network (FCN based or attention based),
decoder layers for cross-class context exchanging, rather i.e., φ, is first adopted to map the input image I ∈ Rh×w×3,
than softmax projection based counterparts only leveraging to a 3D feature tensor I∈RH×W×D. For pixel-wise C-way
the learned class weights within the softmax layer. classification, rather than prior semantic segmentation mod-
With the unified view of parametric prototype learning, els that automatically learn C class weights {wc ∈ RD}C c=1
a few intrinsic yet long ignored issues in this field unfold: (cf. Eq. 1) or C queries vectors {ec ∈ RD}C c=1 (cf. Eq. 2),
First, prototype selection [36] is a vital aspect in the de- we refer to a group of CK non-learnable prototypes, i.e.,
sign of a prototype based learner – prototypes should be typ- {pc,k ∈ RD}C,K c,k=1 , which are based solely on class data sub-
ical for their classes. Nevertheless, existing semantic seg- centers. More specifically, each class c ∈ {1, · · ·, C} is rep-
mentation algorithms often describe each class by only one resented by a total of K prototypes {pc,k}K k=1 , and proto-
prototype, bearing no intra-class variation. Moreover, the type pc,k is determined as the center of k-th sub-cluster of
prototypes are directly learned in a fully parametric man- training pixel samples belonging to class c, in the embed-
ner, without accounting for their representative ability. ding space φ. In this way, the prototypes can comprehen-

2584
sively capture characteristic properties of the corresponding Thus the prototypes, selected as the sub-cluster centers, are
classes, without introducing extra learnable parameters out- typical of the corresponding classes. Conducting clustering
side φ. Analogous to Eq. 3, the category prediction of each online makes our method scalable to large amounts of data,
pixel i ∈ I is achieved by a winner-take-all classification: instead of offline clustering requiring multiple passes over
the entire dataset for feature computation [13].
ĉi = c∗ , with (c∗ , k∗ ) = arg min{hi, pc,k i}C,K
c,k=1 ,
(c,k)
(5) Formally, given pixels I c = {in }N n=1 in a training batch
that belong to class c (i.e., cin = c), our goal is to map the
where i ∈ RD stands for the `2 -normalized embedding of pixels I c to the K prototypes {pc,k}K k=1 of class c. We
pixel i, i.e., i ∈ I, and the distance measure h·, ·i is defined denote this pixel-to-prototype mapping as Lc = [lin ]N n=1 ∈
as the negative cosine similarity, i.e., hi, pi = −i>p. {0, 1}K×N , where lin = [lin ,k ]K ∈ {0, 1}K
is the one-hot
k=1
With this exemplar-based reasoning mode, we first de- assignment vector of pixel in over the K prototypes. The
fine the probability distribution of pixel i over the C classes: optimization of Lc is achieved by maximizing the similarity
exp(−si,c ) between pixel embeddings, i.e., X c= [in ]N D×N
p(c|i) = PC , with si,c = min{hi, pc,k i}K
k=1 , (6) n=1 ∈ R , and
0
c =1 exp(−s i,c 0) c K
the prototypes, i.e., P = [pc,k ]k=1∈ R D×K
:
where the pixel-class distance si,c ∈[−1, 1] is computed as max Tr(Lc>P c>X c ),
the distance to the closest prototype of class c. Given the c
L
(8)
c K×N N K
groundtruth class of each pixel i, i.e., ci ∈ {1, · · ·, C}, the s.t. L ∈ {0, 1} , Lc> 1K = 1N , Lc 1N = 1 ,
K
cross-entropy loss can be therefore used for training:
where 1K denotes the vector of all ones of K dimensions.
LCE = − log p(ci |i) The unique assignment constraint, i.e., Lc> 1K = 1N , en-
exp(−si,ci ) (7) sures that each pixel is assigned to one and only one pro-
= − log P . N K
exp(−si,ci )+ c06=ci exp(−si,c0 ) totype. The equipartition constraint, i.e., Lc 1N = K 1 ,
enforces that on average each prototype is selected at least
In our case, Eq. 7 can be viewed as pushing pixel i closer to N
K times in the batch [13]. This prevents the trivial solu-
the nearest prototype of its corresponding class, i.e., ci , and tion: all pixel samples are assigned to a single prototype,
further from other close prototypes of irrelevant classes, i.e., and eventually benefits the representative ability of the pro-
c0 6= ci . However, only adopting such training objective is totypes. To solve Eq. 8, one can relax Lc to be an element
not enough, due to two reasons. First, Eq. 7 only considers of the transportation polytope [2, 24]:
pixel-class distances, e.g., si,c , without addressing within-
class pixel-prototype relations, e.g., hi, pci ,k i. For example, max Tr(Lc>P c>X c ) + κh(Lc ),
Lc
for discriminative representation learning, pixel i is expec- N (9)
ted to be pushed further close to a certain prototype (i.e., a s.t. Lc∈ RK×N
+ , Lc> 1K = 1N , Lc 1N = 1K ,
K
particularly suitable pattern) of class ci , and, distant from
where h(Lc )= n,k −lin ,k log lin ,k is an entropy, and κ > 0
P
other prototypes (i.e., other irrelevant but within-class pat-
is a parameter that controls the smoothness of distribution.
terns) of class ci . Eq. 7 cannot capture this nature. Second,
With the soft assignment relaxation and the extra regular-
as the pixel-class distances are normalized across all classes
ization term h(Lc ), the solver of Eq. 9 can be given as [24]:
(cf. Eq. 6), Eq. 7 only optimizes the relative relations between
intra-class (i.e., si,ci ) and inter-class (i.e., {si,c0}c06=ci ) dis- P c>X c
Lc = diag(u) exp diag(v), (10)
tances, instead of directly regularizing the cosine distances κ
between pixels and classes. For example, when the intra- where u ∈ RK and v ∈ RN are renormalization vectors,
class distance si,ci of pixel i is relatively smaller than other computed by few steps of Sinkhorn-Knopp iteration [24].
inter-class distances {si,c0 }c06=ci , the penalty from Eq. 7 will Our online clustering is highly efficient on GPU, as it only
be small, but the intra-class distance si,ci might still be large involves a couple of matrix multiplications; in practice,
[89, 134]. Next we first elaborate on our within-class online clustering 10K pixels into 10 prototypes takes only 2.5 ms.
clustering strategy and then detail our two extra training ob- Pixel-Prototype Contrastive Learning. With the assign-
jectives which rely on prototype assignments (i.e., cluster- ment probability matrix Lc = [lin ]N n=1 ∈ [0, 1]
K×N
, we on-
c N
ing results) and address the above two issues respectively. line group the training pixels I = {in }n=1 into K proto-
Within-Class Online Clustering. We approach online clu- types {pc,k}K k=1 within class c. After all the samples in cur-
stering for prototype selection and assignment: pixel sam- rent batch are processed, each pixel i is assigned to ki -th
ples within the same class are assigned to the prototypes prototype of class ci , where ki = arg maxk {li,k }K k=1 and
belonging to that class, and the prototypes are then updated li,k ∈ li . It is natural to derive a training objective for pro-
according to the assignments. Clustering imposes a natural totype assignment prediction, i.e., maximize the prototype
bottleneck [55] that forces the model to discover intra-class assignment posterior probability. This can be viewed as a
discriminative patterns yet discard instance-specific details. pixel-prototype contrastive learning strategy, and addresses

2585
the first limitation of Eq. 7:
exp(i>pci ,ki /τ )
LPPC = −log P , (11)
exp(i>pci ,ki/τ )+ > −
p− ∈P − exp(i p /τ )

where P −= {pc,k }C,K

c,k=1 pci ,ki , and the temperature τ con-
trols the concentration level of representations. Intuitively, Figure 3. Visualization of pixel-prototype similarity for person
Eq. 11 enforces each pixel embedding i to be similar with (top) and car (bottom) classes. Please refer to §3 for details.
its assigned (‘positive’) prototype pci ,ki , and dissimilar with
ated with different colors (i.e., red, green, and blue). For
other CK −1 irrelevant (‘negative’) prototypes P − . Com-
each pixel, its distance to the closest prototype is visualized
pared with prior pixel-wise metric learning based segmenta-
using the corresponding prototype color. As can be seen, the
tion models [111], which consume numerous negative pixel
prototypes well correspond to meaningful patterns within
samples, our method only needs CK prototypes for pixel-
classes, validating their representativeness.
prototype contrast computation, neither causing large mem-
ory cost nor requiring heavy pixel pair-wise comparison. 4. Related Work
Pixel-Prototype Distance Optimization. Building upon the
In this section, we review representative work in seman-
relative comparison over pixel-class/-prototype distances,
tic segmentation, prototype learning and metric learning.
Eq. 7 and Eq. 11 inspire inter-class/-cluster discrinimitive-
ness, but less consider reducing the intra-cluster variation, Semantic Segmentation. Recent years have witnessed re-
i.e., making pixel features of the same prototype compact. markable progress in semantic segmentation, due to the fast
Thus a compactness-aware loss is used for further regulari- evolution of backbone architectures – from CNN-based (e.g.,
zing representations by directly minimizing the distance be- VGG [97], ResNet [45]) to Transformer-like [105] (e.g., ViT
tween each embedded pixel and its assigned prototype: [30], Swin [78]), and segmentation models – from FCNs [79]
to attention networks (e.g., SegFormer [118]). Specifically,
LPPD = (1 − i>pci ,ki )2 . (12) FCN [79] is a milestone; it learns dense prediction efficiently.
Note that both i and pci ,ki are `2 -normalized. This training Since it was proposed, numerous efforts have been devoted
objective minimizes intra-cluster variations while maintain- to improving FCN, by, for example, enlarging the receptive
ing separation between features with different prototype as- field [15, 16, 25, 124, 128, 135]; strengthening context cues
signments, making our model more robust against outliers. [4, 43, 47, 48, 56, 57, 70, 75, 77, 81, 90, 126, 128,
Network Learning and Prototype Update. Our model is 129, 132, 138, 141]; leveraging boundary information [6,
a nonparametric approach that learns semantic segmenta- 14, 27, 66, 127, 131, 137]; incorporating neural attention
tion by directly optimizing the pixel embedding space φ. [34, 41, 42, 49, 50, 63, 68, 101, 110, 112, 136]; or automat-
It is called nonparametric because it constructs prototype ing network engineering [18, 69, 72, 85]. Lately, Transfor-
hypotheses directly from the training pixel samples them- mer based solutions [20, 100, 118, 139] attained growing at-
selves. Thus the parameters of the feature extractor φ are tention; enjoying the flexibility in long-range dependency
learned through stochastic gradient descent, by minimizing modeling, fully attentive solutions yield impressive results.
the combinatorial loss over all the training pixel samples: Different from current approaches that are typically built
upon learnable prototypes, in pre-deep era, many segmenta-
LSEG = LCE + λ1 LPPC + λ2 LPPD . (13) tion systems are nonparametric [32, 73, 74, 80, 102, 103].
Meanwhile, the non-learnable prototypes {pc,k }C,K
are
c,k=1
By absorbing their case-based reasoning ideas, we build a
not learned by stochastic gradient descent, but are computed nonparametric segmentation network, which explicitly de-
as the centers of the corresponding embedded pixel sam- rives prototypes from sample clusters and hence directly
ples. To do so, we let the prototypes evolve continuously optimizes the embedding space with distance metric con-
by accounting for the online clustering results. Particularly, straints. In [62, 111], while cluster-/pixel-level metric loss
after each training iteration, each prototype is updated as: is adopted to regularize representation, the pixel class is still
inferred via parametric softmax. [53] purely relies on class
pc,k ← µpc,k + (1 − µ)īc,k , (14)
embeddings, which, however, are fully trainable. Thus [53,
where µ ∈ [0, 1] is a momentum coefficient, and īc,k in- 62, 111] are all parametric methods. As far as we know, [52]
dicates the `2 -normalized, mean vector of the embedded is the only non-learnable prototype, deep learning based se-
training pixels, which are assigned to prototype pc,k by on- mantic segmentation model. But [52] treats image regions
line clustering. With the clear meaning of the prototypes, as prototypes, incurring huge memory and computational
our segmentation procedure can be intuitively understood demand. Besides, [52] only considers the relative differ-
as retrieving the most similar prototypes (sub-class centers). ence between inter-and intra-class sample-prototype dis-
Fig. 3 provides prototype retrieval results for person and car tances like the parametric counterparts. Our method is more
with K = 3 prototypes for each. The prototypes are associ- principled with fewer heuristic designs. Unlike [52], we rep-

2586
resent prototypes as sub-cluster centers and obtain online 5. Experiment
assignments, allowing our method to scale gracefully to any 5.1. Experimental Setup
dataset size. We encourage a sparse distance distribution
with compactness-awareness, reinforcing the embedding Datasets. Our experiments are conducted on three datasets:
discrimination. With a broader view, a few embedding based • ADE20K [140] is a large-scale scene parsing benchmark
instance segmentation approaches [26, 86] can be viewed as that covers 150 stuff/object categories. The dataset is di-
nonparametric, i.e., treat instance centroids as prototypes. vided into 20k/2k/3k images for train/val/test.
• Cityscapes [22] has 5k finely annotated urban scene im-
Prototype Learning. Cognitive psychological studies evi- ages, with 2,975/500/1,524 for train/val/test. The
dence that people use past cases as models when learning to segmentation performance is evaluated over 19 challeng-
solve problems [1, 87, 125]. Among various machine learning categories, such as rider, bicycle, and traffic light.
ing algorithms, ranging from classical statistics based meth- • COCO-Stuff [10] has 10k images gathered from COCO
ods to Support Vector Machine to Multilayer Perceptrons [9, [71], with 9k and 1k for train and test, respectively.
31, 33, 96], prototype based classification gains particu- There are 172 semantic categories in total, including 80
lar interest, due to its exemplar-driven nature and intuitive objects, 91 stuffs and 1 unlabeled.
interpretation: observations are directly compared with re-
Training. Our method is implemented on MMSegmenta-
presentative examples. Based on the nearest neighbors rule
tion [21], following default training settings. In particular,
– the earliest prototype learning method [23], many famous,
all backbones are initialized using corresponding weights
nonparametric classifiers are proposed [36], such as Learn-
pre-trained on ImageNet-1K [92], while remaining layers
ing Vector Quantization (LVQ) [61], generalized LVQ [94],
are randomly initialized. We use standard data augmen-
and Neighborhood Component Analysis [37, 93]. There has
tation techniques, including random scale jittering with a
been a recent surge of interest to integrate deep learning into
factor in [0.5, 2], random horizontal flipping, random crop-
prototype learning, showing good potential in few-shot [98],
ping as well as random color jittering. We train models
zero-shot [54], and unsupervised learning [116, 120], as
using SGD/AdamW for FCN-/attention-based models, re-
well as supervised classification [38, 83, 115, 123] and
spectively. The learning rate is scheduled following the
interpretable networks [65]. Remarkably, as many few-shot
polynomial annealing policy. In addition, for Cityscapes,
segmentation models can be viewed as prototype-based net-
we use a batch size of 8, and a training crop size of 768×768.
works [29, 106, 109], our work sheds light on the possibility
For ADE20K and COCO-Stuff, we use a crop size of
of closer collaboration between the two segmentation fields.
512 × 512 and train the models with batch size 16. The
Metric Learning. The selection of proper distance measure models are trained for 160k, 160k, and 80k iterations on
impacts the success of prototype based learners [8]; metric Cityscapes, ADE20K and COCO-Stuff, respectively. Ex-
learning and prototype learning are naturally related. As the ceptionally, for ablation study, we train models for 40K
literature on metric learning is vast [58], only the most rel- iterations. The hyper-parameters are empirically set to:
evant ones are discussed. The goal of metric learning is to K = 10, m = 0.999, τ = 0.1, κ = 0.05, λ1 = 0.01, λ2 = 0.01.
learn a distance metric/embedding such that similar samples Testing. For ADE20K and COCO-Stuff, we rescale the
are pulled together and dissimilar samples are pushed away. short scale of the image to training crop size, with the as-
It has shown a significant benefit by learning deep represen- pect ratio kept unchanged. For Cityscapes, we adopt sliding
tation using metric loss functions (e.g., contrastive loss [40], window inference with the window size 768×768. For sim-
triplet loss [95], n-pair loss [99]) for applications (e.g., im- plicity, we do not apply any test-time data augmentation.
age retrieval [107], face recognition [95]). Recently, metric Our model is implemented in PyTorch and trained on eight
learning showed good potential in unsupervised representa- Tesla V100 GPUs with a 32GB memory per-card. Testing
tion learning. Specifically, many instance-based approaches is conducted on the same machine.
use the contrastive loss [39, 88] to explicitly compare pairs Baselines. We mainly compare with four widely recognized
of image representations, so as to push away features from segmentation models, i.e., two FCN based (i.e., FCN [79],
different images while pulling together those from trans- HRNet [108]) and two attention based (i.e., Swin [78] and
formations of the same image [17, 19, 44, 46, 88]. Since SegFormer [118]). For fair comparison, all the models are
computing all the pairwise comparisons on a large dataset based on our reproduction, following the hyper-parameter
is challenging, some clustering-based methods turn to dis- and augmentation recipes used in MMSegmentation [21].
criminate between groups of images with similar features Evaluation Metric. Following conventions [15, 79], mean
instead of individual images [2, 5, 12, 13, 64, 104, 119, intersection-over-union (mIoU) is adopted for evaluation.
121, 122]. Our prototype-anchored metric learning strat-
egy shares a similar spirit of posing metric constraints over 5.2. Comparison to State-of-the-Arts
prototype (cluster) assignments, but it is to reshape the pixel ADE20K [140] val. Table 1 reports comparisons with rep-
segmentation embedding space with explicit supervision. resentative models on ADE20K val. Our nonparametric

2587
Figure 4. Qualitative results of Segformer [118] and our approach (from left to right: ADE20K [140], Cityscapes [22], COCO-Stuff [10]).
# Param mIoU # Param mIoU
Method Backbone Method Backbone
(M) (%) (M) (%)
DeepLabV3+ [ECCV18] [16] ResNet-101 [45] 62.7 44.1 SVCNet [CVPR19] [28] ResNet-101 [45] - 39.6
OCR [ECCV20] [129] HRNetV2-W48 [108] 70.3 45.6 DANet [CVPR19] [34] ResNet-101 [45] 69.1 39.7
MaskFormer [NeurIPS21] [20] ResNet-101 [45] 60.0 46.0 SpyGR [CVPR20] [67] ResNet-101 [45] - 39.9
UperNet [ECCV20] [117] Swin-Base [78] 121.0 48.4 MaskFormer [NeurIPS21] [20] ResNet-101 [45] 60.0 39.8
OCR [ECCV20] [129] HRFormer-B [130] 70.3 48.7 ACNet [ICCV19] [35] ResNet-101 [45] - 40.1
SETR [CVPR21] [139] ViT-Large [30] 318.3 50.2 OCR [ECCV20] [129] HRNetV2-W48 [108] 70.3 40.5
Segmenter [ICCV21] [100] ViT-Large [30] 334.0 51.8 FCN [CVPR15] [79] 68.6 32.5
† MaskFormer [NeurIPS21] [20] Swin-Base [78] 102.0 52.7 ResNet-101 [45]
Ours 68.5 34.0 ↑ 1.5
FCN [CVPR15] [79] 68.6 39.9 HRNet [PAMI21] [108] 65.9 38.7
ResNet-101 [45] HRNetV2-W48 [108]
Ours 68.5 41.1 ↑ 1.2 Ours 65.8 39.9 ↑ 1.2
HRNet [PAMI20] [108] 65.9 42.0 Swin [CVPR21] [78] 90.6 41.5
HRNetV2-W48 [108] Swin-Base [78]
Ours 65.8 43.0 ↑ 1.0 Ours 90.5 42.4 ↑ 0.9
Swin [CVPR21] [78] 90.6 48.0 SegFormer [NeurIPS21] [118] 64.1 42.5
Swin-Base [78] MiT-B4 [118]
Ours 90.5 48.6 ↑ 0.6 Ours 64.0 43.3 ↑ 0.8
SegFormer [NeurIPS21] [118] 64.1 50.9
MiT-B4 [118] Table 3. Quantitative results (§5.2) on COCO-Stuff [10] test.
Ours 64.0 51.7 ↑ 0.8
†: backbone is pre-trained on ImageNet-22K. test. It outperforms all the baselines. Notably, with MiT-
Table 1. Quantitative results (§5.2) on ADE20K [140] val.
B4 [118] as the network backbone, our approach earns an
# Param mIoU mIoU score of 43.3%, establishing a new state-of-the-art.
Method Backbone
(M) (%)
PSPNet [CVPR17] [135] ResNet-101 [45] 65.9 78.4
Qualitative Results. Fig. 4 provides qualitative compari-
PSANet [ECCV18] [136] ResNet-101 [45] - 78.6 son of Ours against Segformer [118] on representative ex-
AAF [ECCV18] [59] ResNet-101 [45] - 79.1 amples in the three datasets. We observe that our approach
Segmenter [ICCV21] [100] ViT-Large [30] 322.0 79.1
ContrastiveSeg [ICCV21] [111] ResNet-101 [45] 58.0 79.2 is able to handle diverse challenging scenarios and produce
MaskFormer [NeurIPS21] [20] ResNet-101 [45] 60.0 80.3 more accurate results (as highlighted in red dashed boxes).
DeepLabV3+ [ECCV18] [16] ResNet-101 [45] 62.7 80.9
OCR [ECCV20] [129] HRNetV2-W48 [108] 70.3 81.1 5.3. Scalability to Large-Vocabulary Semantic Seg-
FCN [CVPR15] [79] 68.6 78.1
Ours
ResNet-101 [45]
68.5 79.1 ↑ 1.0 mentation
HRNet [PAMI20] [108] 65.9 80.4
HRNetV2-W48 [108] Today, rigorous evaluation of semantic segmentation
Ours 65.8 81.1 ↑ 0.7
Swin [CVPR21] [78]
Swin-Base [78]
90.6 79.8 models is mostly performed in a few category regime (e.g.,
Ours 90.5 80.6 ↑ 0.8
SegFormer [NeurIPS21] [118] 64.1 80.7
19/150/172 classes for Cityscapes/ADE20K/COCO-Stuff),
MiT-B4 [118] while the generalization to more natural large-vocabulary
Ours 64.0 81.3 ↑ 0.6
Table 2. Quantitative results (§5.2) on Cityscapes [22] val. setting is ignored. In this section, we demonstrate the re-
markable superiority of our method in large-vocabulary set-
scheme obtains consistent improvements over the baselines, ting. We start with the default setting in ADE20K [140]
with fewer learnable parameters. In particular, it yields which includes 150 semantic concepts. Then, we gradually
1.2% and 1.0% mIoU improvements over the FCN-based increase the number of concepts based on their visibility
counterparts, i.e., FCN [79] and HRNet [108]. Similar frequency, and train/test models on the selected number of
performance gains (0.6% and 0.8%) are obtained over re- classes. In this experiment, we use MiT-V2 [118] as the
cent attention-based models, i.e., Swin [78] and SegFormer backbone and train models for 40k iterations.
[118], manifesting the high versatility of our approach. The results are summarized in Table 4, from which we
Cityscapes [22] val. Table 2 shows again our compelling find that: i) For the parametric scheme, the amount of proto-
performance on Cityscapes val. Specifically, our approach type parameters increases with vocabulary size. For the ex-
surpasses all the competitors, i.e., 1.0% over FCN, 0.7% treme case of 10 prototypes and 847 classes, the number of
over HRNet, 0.8% over Swin, and 0.6% over Segformer. prototype parameters is 6.5 M, accounting for ∼ 20% of to-
COCO-Stuff [10] test. As listed in Table 3, our approach tal parameters (i.e., 33.96 M). In sharp contrast, our scheme
also demonstrates promising performance on COCO-Stuff requires no any learnable prototype parameters. ii) Our

2588
150 classes 300 classes 500 classes 700 classes 847 classes
Method # Proto
mIoU (%) # Param (M) mIoU (%) # Param (M) mIoU (%) # Param (M) mIoU (%) # Param (M) mIoU (%) # Param (M)
parametric 1 45.1 27.48 (0.12) 36.5 27.62 (0.23) 25.7 27.80 (0.39) 19.8 27.98 (0.54) 16.5 28.11 (0.65)
nonparametric
1 45.5 ↑ 0.4 27.37 (0) 37.2 ↑ 0.7 27.37 (0) 26.8 ↑ 1.1 27.37 (0) 21.2 ↑ 1.4 27.37 (0) 18.1 ↑ 1.6 27.37 (0)
(Ours)
parametric 10 45.7 28.56 (1.2) 37.0 29.66 (2.3) 26.6 31.26 (3.9) 20.8 32.86 (5.4) 17.7 33.96 (6.5)
nonparametric
10 46.4 ↑ 0.7 27.37 (0) 37.8 ↑ 0.8 27.37 (0) 27.9 ↑ 1.3 27.37 (0) 22.1 ↑ 1.3 27.37 (0) 19.4 ↑ 1.7 27.37 (0)
(Ours)
Table 4. Scalability study (§5.3) of our nonparametric model against the parametric baseline (i.e., SegFormer [118]) on ADE20K [140].
For each model variant, we report its segmentation mIoU, parameter numbers of the entire model as well as the prototypes (in the bracket).
LCE LPPC LPPD mIoU # Prototype mIoU (%) Coefficient µ mIoU (%) Distance Measure mIoU (%)
(Eq. 7) (Eq. 11) (Eq. 12) (%) K =1 45.5 µ=0 44.9 Standard 45.7
3 45.0 K =5 46.0 µ = 0.9 45.9 Huberized 45.2
3 3 45.9 K = 10 46.4 µ = 0.99 46.0 Cosine 46.4
3 3 45.4 K = 20 46.5 µ = 0.999 46.4
3 3 3 46.4 K = 50 46.4 µ = 0.9999 46.3
(a) Training Objective L (b) Prototype Number K (c) Momentum Coefficient µ (d) Distance Measure
Table 5. A set of ablative studies (§5.4) on ADE20K [140] val. All model variants use MiT-B2 [118] as the backbone.

method achieves consistent performance elevations against the extreme case of µ = 0.

the parametric counterpart under all settings. These results Distance Measure. By default, we use cosine distance (re-
well demonstrate the utility of our nonparametric scheme fer to as ‘Cosine’) to measure pixel-prototype similarity as
for unrestricted open-vocabulary semantic segmentation. denoted in Eq. 6, Eq. 11 and Eq. 12. However, other choices
are also applicable. Here we study two alternatives. The
5.4. Diagnostic Experiment first is the standard Euclidean distance (i.e., ‘Standard’),
To investigate the effect of our core designs, we conduct i.e., hx, yi = kx − yk2 . In contrast to ‘Cosine’, here x
ablative studies on ADE20K [140] val. We use MiT-B2 and y are un-normalized real-valued vectors. To handle
[118] as the backbone and train models for 40K iterations. the non-differentiability in ‘Standard’, we further study an
Training Objective. We first investigate our overall train- approximatedp Huber-like function [51] (‘Huberized’), i.e.,
ing objective (cf. Eq. 13). As shown in Table 5a, the model hx, yi = δ( kx − yk2 /δ 2 + 1−1). The hyper-parameter δ
with LCE alone achieves an mIoU score of 45.0%. Adding is empirically set to 0.1. As we find from Table 5d that ‘Co-
LPPC or LPPD individually brings gains (i.e., 0.9%/0.4%), sine’ performs much better than other un-normalized Eu-
revealing the value to explicitly learn pixel-prototype rela- clidean measurements. The Huberized norm does not show
tions. Combing all the losses together leads to the best per- any advantage over ‘Standard’.
formance, yielding an mIoU score of 46.4%.
6. Conclusion and Discussion
Prototype Number Per Class K. Table 5b reports the per-
formance of our approach with regard to the number of pro- The vast majority of recent effort in this field seek to learn
totype per class. For K = 1, we directly represent each parametric class representations for pixel-wise recognition.
class as the mean embedding of its pixel samples. The pixel In contrast, this paper explores an exemplar-based regime.
assignment is based simply on ground-truth labels, with- This leads to a nonparametric segmentation framework,
out using online clustering (Eqs. 8-9). This baseline ob- where several typical points in the embedding space are se-
tains a score of 45.5%. Further, when using more proto- lected as class prototypical representation, and distance to
types (i.e., K = 3), we see a clear performance boost (i.e., the prototypes determines how a pixel sample is classified.
45.5% → 46.0%). The score further improves when allow- It enjoys several advantages: i) explicit prototypical repre-
ing 5 or 10 prototypes; however, increasing K beyond 10 sentation for class-level statistics modeling; ii) better gener-
gives marginal returns in performance. As a result, we set alization with nonparametric pixel-category prediction; and
K = 10 for a better trade-off between accuracy and com- iii) direct optimization of the feature embedding space. Our
putation cost. This study confirms our motivation to use framework is elegant, general, and yields outstanding per-
multiple prototypes for capturing intra-class variations. formance. It also comes with some intriguing questions.
Coefficient µ. Table 5c quantifies the effect of momentum For example, to pursue better interpretability, one can op-
coefficient (µ in Eq. 14) which controls the speed of proto- timize the prototypes to directly resemble pixel- or region-
type updating. The model performs reasonably well using a level observations [52, 65]. Overall, we feel the results in
relatively large coefficient (i.e., µ ∈ [0.999, 0.9999]), show- this paper warrant further exploration in this direction.
ing that a slow updating is beneficial. When µ is 0.9 or Acknowledgements This work was supported by CCF-Baidu
0.99, the performance decreases, and drops considerably at Open Fund and ARC DECRA DE220101390 .

2589
References offrey Hinton. A simple framework for contrastive learning
of visual representations. In ICML, 2020. 6
[1] Agnar Aamodt and Enric Plaza. Case-based reasoning: [18] Wuyang Chen, Xinyu Gong, Xianming Liu, Qian Zhang,
Foundational issues, methodological variations, and system Yuan Li, and Zhangyang Wang. Fasterseg: Searching for
approaches. AI Communications, 7(1):39–59, 1994. 6 faster real-time semantic segmentation. In ICLR, 2019. 5
[2] Yuki Markus Asano, Christian Rupprecht, and Andrea [19] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He.
Vedaldi. Self-labelling via simultaneous clustering and rep- Improved baselines with momentum contrastive learning.
resentation learning. In ICLR, 2020. 4, 6 arXiv preprint arXiv:2003.04297, 2020. 6
[3] Andreas Backhaus and Udo Seiffert. Classification in high- [20] Bowen Cheng, Alexander G. Schwing, and Alexander Kir-
dimensional spectral data: Accuracy vs. interpretability vs. illov. Per-pixel classification is not all you need for seman-
model size. Neurocomputing, 131:15–22, 2014. 2 tic segmentation. In NeurIPS, 2021. 1, 5, 7
[4] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. [21] MMSegmentation Contributors. MMSegmentation:
Segnet: A deep convolutional encoder-decoder architecture Openmmlab semantic segmentation toolbox and
for image segmentation. IEEE TPAMI, 39(12):2481–2495, benchmark. https : / / github . com / open -
2017. 5 mmlab/mmsegmentation, 2020. 6
[5] Miguel A Bautista, Artsiom Sanakoyeu, Ekaterina [22] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
Tikhoncheva, and Björn Ommer. Cliquecnn: Deep unsu- Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe
pervised exemplar learning. In NeurIPS, 2016. 6 Franke, Stefan Roth, and Bernt Schiele. The cityscapes
[6] Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. Se- dataset for semantic urban scene understanding. In CVPR,
mantic segmentation with boundary neural fields. In CVPR, 2016. 2, 6, 7
2016. 5 [23] Thomas Cover and Peter Hart. Nearest neighbor pattern
[7] Michael Biehl, Barbara Hammer, Petra Schneider, and classification. IEEE TIT, 13(1):21–27, 1967. 1, 3, 6
Thomas Villmann. Metric learning for prototype-based [24] Marco Cuturi. Sinkhorn distances: Lightspeed computation
classification. In Innovations in Neural Information of optimal transport. In NeurIPS, 2013. 4
Paradigms and Applications, pages 183–199. 2009. 2 [25] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong
[8] Michael Biehl, Barbara Hammer, and Thomas Villmann. Zhang, Han Hu, and Yichen Wei. Deformable convolu-
Distance measures for prototype based classification. In In- tional networks. In CVPR, 2017. 5
ternational Workshop on Brain-Inspired Computing, 2013. [26] Bert De Brabandere, Davy Neven, and Luc Van Gool. Se-
3, 6 mantic instance segmentation for autonomous driving. In
[9] Christopher M Bishop. Pattern recognition. Machine learn- CVPR Workshop, 2017. 6
ing, 128(9), 2006. 6 [27] Henghui Ding, Xudong Jiang, Ai Qun Liu, Nadia Magne-
[10] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- nat Thalmann, and Gang Wang. Boundary-aware feature
stuff: Thing and stuff classes in context. In CVPR, 2018. 2, propagation for scene segmentation. In CVPR, 2019. 5
6, 7 [28] Henghui Ding, Xudong Jiang, Bing Shuai, Ai Qun Liu, and
[11] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nico- Gang Wang. Semantic correlation promoted shape-variant
las Usunier, Alexander Kirillov, and Sergey Zagoruyko. context for segmentation. In CVPR, 2019. 7
End-to-end object detection with transformers. In ECCV, [29] Nanqing Dong and Eric P Xing. Few-shot semantic seg-
2020. 1 mentation with prototype learning. In BMVC, 2018. 6
[12] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and [30] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Matthijs Douze. Deep clustering for unsupervised learning Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
of visual features. In ECCV, 2018. 6 Mostafa Dehghani, Matthias Minderer, Georg Heigold,
[13] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Sylvain Gelly, et al. An image is worth 16x16 words: Trans-
Piotr Bojanowski, and Armand Joulin. Unsupervised learn- formers for image recognition at scale. In ICLR, 2020. 5,
ing of visual features by contrasting cluster assignments. In 7
NeurIPS, 2020. 4, 6 [31] Richard O Duda, Peter E Hart, et al. Pattern classification
[14] Liang-Chieh Chen, Jonathan T Barron, George Papan- and scene analysis, volume 3. Wiley New York, 1973. 1,
dreou, Kevin Murphy, and Alan L Yuille. Semantic image 3, 6
segmentation with task-specific edge detection using cnns [32] David Eigen and Rob Fergus. Nonparametric image parsing
and a discriminatively trained domain transform. In CVPR, using adaptive neighbor sets. In CVPR, 2012. 5
2016. 5 [33] Jerome H Friedman. The elements of statistical learning:
[15] Liang-Chieh Chen, George Papandreou, Iasonas Kokki- Data mining, inference, and prediction. Springer, 2009. 3,
nos, Kevin Murphy, and Alan L Yuille. Deeplab: Se- 6
mantic image segmentation with deep convolutional nets, [34] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhi-
atrous convolution, and fully connected crfs. IEEE TPAMI, wei Fang, and Hanqing Lu. Dual attention network for
40(4):834–848, 2017. 1, 5, 6 scene segmentation. In CVPR, 2019. 1, 5, 7
[16] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Flo- [35] Jun Fu, Jing Liu, Yuhang Wang, Yong Li, Yongjun Bao,
rian Schroff, and Hartwig Adam. Encoder-decoder with Jinhui Tang, and Hanqing Lu. Adaptive context network
atrous separable convolution for semantic image segmenta- for scene parsing. In ICCV, 2019. 7
tion. In ECCV, 2018. 5, 7 [36] Salvador Garcia, Joaquin Derrac, Jose Cano, and Francisco
[17] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- Herrera. Prototype selection for nearest neighbor classi-

2590
fication: Taxonomy and empirical study. IEEE TPAMI, [56] Zhenchao Jin, Tao Gong, Dongdong Yu, Qi Chu, Jian
34(3):417–435, 2012. 3, 6 Wang, Changhu Wang, and Jie Shao. Mining contextual
[37] Jacob Goldberger, Geoffrey E Hinton, Sam Roweis, and information beyond image for semantic segmentation. In
Russ R Salakhutdinov. Neighbourhood components anal- ICCV, 2021. 5
ysis. In NeurIPS, 2004. 6 [57] Zhenchao Jin, Bin Liu, Qi Chu, and Nenghai Yu. Isnet: In-
[38] Samantha Guerriero, Barbara Caputo, and Thomas tegrate image-level and semantic-level context for semantic
Mensink. Deepncm: Deep nearest class mean classifiers. segmentation. In ICCV, 2021. 5
In ICLR workshop, 2018. 6 [58] Mahmut Kaya and Hasan Şakir Bilge. Deep metric learn-
[39] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive ing: A survey. Symmetry, 11(9):1066, 2019. 6
estimation: A new estimation principle for unnormalized [59] Tsung-Wei Ke, Jyh-Jing Hwang, Ziwei Liu, and Stella X
statistical models. In AISTATS, 2010. 6 Yu. Adaptive affinity fields for semantic segmentation. In
[40] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimension- ECCV, 2018. 7
ality reduction by learning an invariant mapping. In CVPR, [60] Barbara J Knowlton and Larry R Squire. The learning of
2006. 2, 6 categories: Parallel brain systems for item memory and cat-
[41] Adam W Harley, Konstantinos G Derpanis, and Iasonas egory knowledge. Science, 262(5140):1747–1749, 1993. 2,
Kokkinos. Segmentation-aware convolutional networks us- 3
ing local attention masks. In ICCV, 2017. 5 [61] Teuvo Kohonen. The self-organizing map. Neurocomput-
[42] Junjun He, Zhongying Deng, and Yu Qiao. Dynamic multi- ing, 21(1-3):1–6, 1998. 6
scale filters for semantic segmentation. In ICCV, 2019. 5 [62] Shu Kong and Charless C Fowlkes. Recurrent pixel embed-
[43] Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, and Yu ding for instance grouping. In CVPR, 2018. 5
Qiao. Adaptive pyramid context network for semantic seg- [63] Hanchao Li, Pengfei Xiong, Jie An, and Lingxue Wang.
mentation. In CVPR, 2019. 5 Pyramid attention network for semantic segmentation.
[44] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross arXiv preprint arXiv:1805.10180, 2018. 5
Girshick. Momentum contrast for unsupervised visual rep- [64] Junnan Li, Pan Zhou, Caiming Xiong, and Steven CH Hoi.
resentation learning. In CVPR, 2020. 6 Prototypical contrastive learning of unsupervised represen-
[45] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. tations. In ICLR, 2020. 2, 6
Deep residual learning for image recognition. In CVPR, [65] Oscar Li, Hao Liu, Chaofan Chen, and Cynthia Rudin.
2016. 2, 5, 7 Deep learning for case-based reasoning through prototypes:
[46] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, A neural network that explains its predictions. In AAAI,
Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua 2018. 6, 8
Bengio. Learning deep representations by mutual informa- [66] Xiangtai Li, Xia Li, Li Zhang, Guangliang Cheng, Jian-
tion estimation and maximization. In ICLR, 2019. 6 ping Shi, Zhouchen Lin, Shaohua Tan, and Yunhai Tong.
[47] Chi-Wei Hsiao, Cheng Sun, Hwann-Tzong Chen, and Min Improving semantic segmentation via decoupled body and
Sun. Specialize and fuse: Pyramidal output representation edge supervision. In ECCV, 2020. 5
for semantic segmentation. In ICCV, 2021. 5 [67] Xia Li, Yibo Yang, Qijie Zhao, Tiancheng Shen, Zhouchen
[48] Hanzhe Hu, Deyi Ji, Weihao Gan, Shuai Bai, Wei Wu, and Lin, and Hong Liu. Spatial pyramid based graph reasoning
Junjie Yan. Class-wise dynamic graph convolution for se- for semantic segmentation. In CVPR, 2020. 7
mantic segmentation. In ECCV, 2020. 5 [68] Xia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang,
[49] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation Zhouchen Lin, and Hong Liu. Expectation-maximization
networks. In CVPR, 2018. 1, 5 attention networks for semantic segmentation. In ICCV,
[50] Zilong Huang, Xinggang Wang, Lichao Huang, Chang 2019. 5
Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross [69] Yanwei Li, Lin Song, Yukang Chen, Zeming Li, Xiangyu
attention for semantic segmentation. In ICCV, 2019. 5 Zhang, Xingang Wang, and Jian Sun. Learning dynamic
[51] Peter J Huber. Robust regression: asymptotics, conjectures routing for semantic segmentation. In CVPR, 2020. 5
and monte carlo. The annals of statistics, pages 799–821, [70] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian
1973. 8 Reid. Refinenet: Multi-path refinement networks for high-
[52] Jyh-Jing Hwang, Stella X Yu, Jianbo Shi, Maxwell D resolution semantic segmentation. In CVPR, 2017. 5
Collins, Tien-Ju Yang, Xiao Zhang, and Liang-Chieh Chen. [71] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
Segsort: Segmentation by discriminative sorting of seg- Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
ments. In ICCV, 2019. 5, 8 C Lawrence Zitnick. Microsoft coco: Common objects in
[53] Shipra Jain, Danda Pani Paudel, Martin Danelljan, and context. In ECCV, 2014. 6
Luc Van Gool. Scaling semantic segmentation beyond 1k [72] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig
classes on a single gpu. In ICCV, 2021. 5 Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto-
[54] Saumya Jetley, Bernardino Romera-Paredes, Sadeep Jaya- deeplab: Hierarchical neural architecture search for seman-
sumana, and Philip Torr. Prototypical priors: From im- tic image segmentation. In CVPR, 2019. 5
proving classification to zero-shot learning. arXiv preprint [73] Ce Liu, Jenny Yuen, and Antonio Torralba. Sift flow: Dense
arXiv:1512.01192, 2015. 6 correspondence across scenes and its applications. IEEE
[55] Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant in- TPAMI, 33(5):978–994, 2010. 5
formation clustering for unsupervised image classification [74] Ce Liu, Jenny Yuen, and Antonio Torralba. Nonpara-
and segmentation. In ICCV, 2019. 4 metric scene parsing via label transfer. IEEE TPAMI,

2591
33(12):2368–2382, 2011. 5 linear embedding by preserving class neighbourhood struc-
[75] Mingyuan Liu, Dan Schonfeld, and Wei Tang. Exploit vi- ture. In Artificial Intelligence and Statistics, pages 412–
sual dependency relations for semantic segmentation. In 419, 2007. 6
CVPR, 2021. 5 [94] Atsushi Sato and Keiji Yamada. Generalized learning vec-
[76] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. tor quantization. In NeurIPS, 1995. 6
Large-margin softmax loss for convolutional neural net- [95] Florian Schroff, Dmitry Kalenichenko, and James Philbin.
works. In ICML, 2016. 3 Facenet: A unified embedding for face recognition and
[77] Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen Change Loy, and clustering. In CVPR, 2015. 3, 6
Xiaoou Tang. Deep learning markov random field for [96] John Shawe-Taylor, Nello Cristianini, et al. Kernel methods
semantic segmentation. IEEE TPAMI, 40(8):1814–1828, for pattern analysis. Cambridge university press, 2004. 6
2017. 5 [97] Karen Simonyan and Andrew Zisserman. Very deep con-
[78] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng volutional networks for large-scale image recognition. In
Zhang, Stephen Lin, and Baining Guo. Swin transformer: ICLR, 2015. 5
Hierarchical vision transformer using shifted windows. In [98] Jake Snell, Kevin Swersky, and Richard S Zemel. Proto-
ICCV, 2021. 2, 5, 6, 7 typical networks for few-shot learning. In NeurIPS, 2017.
[79] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully 6
convolutional networks for semantic segmentation. In [99] Kihyuk Sohn. Improved deep metric learning with multi-
CVPR, 2015. 1, 5, 6, 7 class n-pair loss objective. In NeurIPS, 2016. 6
[80] Tomasz Malisiewicz and Alexei A Efros. Recognition by [100] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia
association via learning per-exemplar distances. In CVPR, Schmid. Segmenter: Transformer for semantic segmenta-
2008. 5 tion. In ICCV, 2021. 1, 5, 7
[81] Sachin Mehta, Mohammad Rastegari, Anat Caspi, Linda [101] Guolei Sun, Wenguan Wang, Jifeng Dai, and Luc Van Gool.
Shapiro, and Hannaneh Hajishirzi. Espnet: Efficient spatial Mining cross-image semantics for weakly supervised se-
pyramid of dilated convolutions for semantic segmentation. mantic segmentation. In ECCV, 2020. 5
In ECCV, 2018. 5 [102] Joseph Tighe and Svetlana Lazebnik. Superparsing: scal-
[82] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and able nonparametric image parsing with superpixels. In
Christoph Feichtenhofer. Trackformer: Multi-object track- ECCV, 2010. 5
ing with transformers. arXiv preprint arXiv:2101.02702, [103] Antonio Torralba, Rob Fergus, and William T Freeman.
2021. 1 80 million tiny images: A large data set for nonparametric
[83] Pascal Mettes, Elise van der Pol, and Cees Snoek. Hyper- object and scene recognition. IEEE TPAMI, 30(11):1958–
spherical prototype networks. In NeurIPS, 2019. 3, 6 1970, 2008. 5
[84] Tom M Mitchell. The need for biases in learning general- [104] Wouter Van Gansbeke, Simon Vandenhende, Stamatios
izations. Department of Computer Science, Laboratory for Georgoulis, Marc Proesmans, and Luc Van Gool. Scan:
Computer Science Research, 1980. 3 Learning to classify images without labels. In ECCV, 2020.
[85] Vladimir Nekrasov, Hao Chen, Chunhua Shen, and Ian 6
Reid. Fast neural architecture search of compact semantic [105] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
segmentation models via auxiliary cells. In CVPR, 2019. 5 Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser,
[86] Davy Neven, Bert De Brabandere, Marc Proesmans, and and Illia Polosukhin. Attention is all you need. In NeurIPS,
Luc Van Gool. Instance segmentation by jointly optimizing 2017. 1, 5
spatial embeddings and clustering bandwidth. In CVPR, [106] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan
2019. 6 Wierstra, et al. Matching networks for one shot learning. In
[87] Allen Newell, Herbert Alexander Simon, et al. Human NeurIPS, 2016. 6
problem solving, volume 104. 1972. 6 [107] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosen-
[88] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- berg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu.
sentation learning with contrastive predictive coding. arXiv Learning fine-grained image similarity with deep ranking.
preprint arXiv:1807.03748, 2018. 6 In CVPR, 2014. 6
[89] Tianyu Pang, Kun Xu, Yinpeng Dong, Chao Du, Ning [108] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang,
Chen, and Jun Zhu. Rethinking softmax cross-entropy loss Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui
for adversarial robustness. In ICLR, 2020. 2, 4 Tan, Xinggang Wang, et al. Deep high-resolution represen-
[90] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- tation learning for visual recognition. IEEE TPAMI, 2020.
net: Convolutional networks for biomedical image segmen- 2, 6, 7
tation. In MICCAI, 2015. 5 [109] Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou,
[91] Eleanor H Rosch. Natural categories. Cognitive psychol- and Jiashi Feng. Panet: Few-shot image semantic segmen-
ogy, 4(3):328–350, 1973. 2, 3 tation with prototype alignment. In ICCV, 2019. 6
[92] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, [110] Wenguan Wang, Tianfei Zhou, Siyuan Qi, Jianbing Shen,
Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej and Song-Chun Zhu. Hierarchical human semantic parsing
Karpathy, Aditya Khosla, Michael S. Bernstein, Alexan- with comprehensive part-relation modeling. IEEE TPAMI,
der C. Berg, and Fei-Fei Li. Imagenet large scale visual 2021. 5
recognition challenge. IJCV, 115(3):211–252, 2015. 6 [111] Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, En-
[93] Ruslan Salakhutdinov and Geoff Hinton. Learning a non- der Konukoglu, and Luc Van Gool. Exploring cross-image

2592
pixel contrast for semantic segmentation. In ICCV, 2021. 2021. 7
2, 5, 7 [131] Yuhui Yuan, Jingyi Xie, Xilin Chen, and Jingdong Wang.
[112] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- Segfix: Model-agnostic boundary refinement for segmenta-
ing He. Non-local neural networks. In CVPR, 2018. 5 tion. In ECCV, 2020. 5
[113] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua [132] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang,
Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. End- Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Con-
to-end video instance segmentation with transformers. In text encoding for semantic segmentation. In CVPR, 2018.
CVPR, 2021. 1 5
[114] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. [133] Kai Zhang, James T Kwok, and Bahram Parvin. Prototype
A discriminative feature learning approach for deep face vector machine for large scale semi-supervised learning. In
recognition. In ECCV, 2016. 3 ICML, 2009. 2
[115] Zhirong Wu, Alexei A Efros, and Stella X Yu. Improving [134] Xiao Zhang, Rui Zhao, Yu Qiao, and Hongsheng Li. Rbf-
generalization via scalable neighborhood component anal- softmax: Learning deep representative prototypes with ra-
ysis. In ECCV, 2018. 2, 6 dial basis function softmax. In ECCV, 2020. 2, 4
[116] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. [135] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Unsupervised feature learning via non-parametric instance Wang, and Jiaya Jia. Pyramid scene parsing network. In
discrimination. In CVPR, 2018. 2, 6 CVPR, 2017. 1, 5, 7
[117] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and [136] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen
Jian Sun. Unified perceptual parsing for scene understand- Change Loy, Dahua Lin, and Jiaya Jia. Psanet: Point-wise
ing. In ECCV, 2018. 7 spatial attention network for scene parsing. In ECCV, 2018.
[118] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, 5, 7
Jose M Alvarez, and Ping Luo. Segformer: Simple and ef- [137] Mingmin Zhen, Jinglu Wang, Lei Zhou, Shiwei Li, Tianwei
ficient design for semantic segmentation with transformers. Shen, Jiaxiang Shang, Tian Fang, and Long Quan. Joint se-
In NeurIPS, 2021. 1, 2, 3, 5, 6, 7, 8 mantic segmentation and boundary detection using iterative
[119] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised pyramid contexts. In CVPR, 2020. 5
deep embedding for clustering analysis. In ICML, 2016. 6 [138] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-
[120] Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, and Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang
Zeynep Akata. Attribute prototype network for zero-shot Huang, and Philip HS Torr. Conditional random fields as
learning. In NeurIPS, 2020. 6 recurrent neural networks. In ICCV, 2015. 5
[121] Kouta Nakata Yaling Tao, Kentaro Takagi. Clustering- [139] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
friendly representation learning via instance discrimination Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
and feature decorrelation. In ICLR, 2021. 6 Xiang, Philip HS Torr, et al. Rethinking semantic segmen-
[122] Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti Ghadi- tation from a sequence-to-sequence perspective with trans-
yaram, and Dhruv Mahajan. Clusterfit: Improving general- formers. In CVPR, 2021. 1, 2, 3, 5, 7
ization of visual representations. In CVPR, 2020. 6 [140] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
[123] Hong-Ming Yang, Xu-Yao Zhang, Fei Yin, and Cheng- Barriuso, and Antonio Torralba. Scene parsing through
Lin Liu. Robust classification with convolutional prototype ade20k dataset. In CVPR, 2017. 2, 6, 7, 8
learning. In CVPR, 2018. 3, 6 [141] Tianfei Zhou, Liulei Li, Xueyi Li, Chun-Mei Feng, Jianwu
[124] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Li, and Ling Shao. Group-wise learning for weakly super-
Yang. Denseaspp for semantic segmentation in street vised semantic segmentation. IEEE TIP, 31:799–811, 2021.
scenes. In CVPR, 2018. 5 5
[125] Yi Yang, Yueting Zhuang, and Yunhe Pan. Multiple
knowledge representation for big data artificial intelli-
gence: framework, applications, and case studies. Fron-
tiers of Information Technology & Electronic Engineering,
22(12):1551–1558, 2021. 6
[126] Changqian Yu, Jingbo Wang, Changxin Gao, Gang Yu,
Chunhua Shen, and Nong Sang. Context prior for scene
segmentation. In CVPR, 2020. 5
[127] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao,
Gang Yu, and Nong Sang. Learning a discriminative feature
network for semantic segmentation. In CVPR, 2018. 5
[128] Fisher Yu and Vladlen Koltun. Multi-scale context aggre-
gation by dilated convolutions. In ICLR, 2016. 5
[129] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-
contextual representations for semantic segmentation. In
ECCV, 2020. 5, 7
[130] Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao
Zhang, Xilin Chen, and Jingdong Wang. Hrformer: High-
resolution transformer for dense prediction. In NeurIPS,