Prototype Based Deepm Learning Paper 2 Zhou
Prototype Based Deepm Learning Paper 2 Zhou
Abstract
2582
dence in cognitive science [60, 91]): data samples are classi- prototype parameters is no longer constrained to the number
fied based on their proximity to representative prototypes of of classes (i.e., 0 vs D×C). Third, via prototype-anchored
classes. With this perspective, in §2, we first answer ques- metric learning, the pixel embedding space is shaped as well-
tion ¶ by pointing out most modern segmentation methods, structured, benefiting segmentation prediction eventually.
from softmax based to pixel-query based, from FCN based By answering questions ¶-¹, we formalize prior methods
to attention based, fall into one grand category: parametric within a learnable prototype based, parametric framework,
models based on learnable prototypes. Consider a segmenta- and link this field to prototype learning and metric learning.
tion task with C semantic classes. Most existing efforts seek We provide literature review and related discussions in §4.
to directly learn C class-wise prototypes – softmax weights In §5.2, we show our method achieves impressive results
or query vectors – for parametric, pixel-wise classification. over famous datasets (i.e., ADE20K [140], Cityscapes [22],
Hence question · becomes more fundamental: ¸ What are COCO-Stuff [10]) with top-leading FCN based and attention
the limitations of this learnable prototype based parametric based segmentation models (i.e., HRNet [108], Swin [78],
paradigm? and ¹ How to address these limitations? SegFormer [118]) and backbones (i.e., ResNet [45], HRNet
Driven by question ¸, we find there are three critical limi- [108], Swin [78], MiT [118]). Compared with the paramet-
tations: First, usually only one single prototype is learned per ric counterparts, our method does not cause any extra com-
class, insufficient to describe rich intra-class variance. The putational overhead during testing while reduces the amount
prototypes are simply learned in a fully parametric manner, of learnable parameters. In §5.3, we demonstrate our method
without considering their representative ability. Second, to consistently performs well when increasing the number of
map a H×W×D image feature tensor into a H×W×C seman- semantic classes from 150 to 847. Accompanied with a set
tic mask, at least D×C parameters are needed for prototype of ablative studies in §5.4, our extensive experiments verify
learning. This hurts generalizability [115], especially in the the power of our idea and the efficacy of our algorithm.
large-vocabulary case; for instance, if there are 800 classes Finally, we draw conclusions in §6. This work is expec-
and D = 512, we need 0.4M learnable prototype parameters ted to open a new venue for future research in this field.
alone. Third, with the cross-entropy loss, only the relative
relations between intra-class and inter-class distances are op- 2. Existing Semantic Segmentation Models as
timized [89, 111, 134]; the actual distances between pixels Parametric Prototype Learning
and prototypes, i.e., intra-class compactness, are ignored.
Next we first formalize the existing two mask decod-
As a response to question ¹, in §3, we develop a nonpa- ing strategies mentioned in §1, and then answer question ¶
rametric segmentation framework, based on non-learnable from a unified view of parametric prototype learning.
prototypes.Specifically, building upon the ideas of prototype Parametric Softmax Projection. Almost all FCN-like and
learning [116, 133] and metric learning [40, 64], it is fully many attention-based segmentation models adopt this strat-
aware of the limitations of its parametric counterpart. Inde- egy. Their models comprise two learnable parts: i) an en-
pendent of specific backbone architectures (FCN based or coder φ for dense visual feature extraction, and ii) a classi-
attention based), our method is general and brings insights fier ρ (i.e., projection head) that projects pixel features into
into segmentation model design and training. For model the semantic label space. For each pixel example i, its em-
design, our method explicitly sets sub-class centers, in the bedding i ∈ RD , extracted from φ, is fed into ρ for C-way
pixel embedding space, as the prototypes. Each pixel data classification:
is predicted to be in the same class as the nearest prototype, exp(wc> i)
without relying on extra learnable parameters. For training, p(c|i) = PC >
, (1)
c0 =1 exp(wc0 i)
as the prototypes are representative of the dataset, we can
directly pose known inductive biases (e.g., intra-class com- where p(c|i) ∈[0, 1] is the probability that i being assigned
pactness, inter-class separation) as extra optimization crite- to class c. ρ is a pixel-wise linear layer, parameterized by
ria and efficiently shape the whole embedding space, instead W= [w1 , · · ·, wC ] ∈ RC×D; wc ∈ RD is a learnable projec-
of optimizing the prediction accuracy only. Our model has tion vector for c-th class; the bias term is omitted for brevity.
three appealing advantages: First, each class is abstracted by Parametric Pixel-Query. A few attention-based segmen-
a set of prototypes, well capturing class-wise characteristics tation networks [118, 139] work in a more ‘Transformer-
and intra-class variance. With the clear meaning of the pro- like’ manner: given the pixel embedding i ∈ RD , a set of C
totypes, the interpretability is also enhanced – the prediction query vectors, i.e., E= [e1 , · · ·, eC ] ∈ RC×D, are learned to
of each pixel can be intuitively understood as the reference generate a probability distribution over the C classes:
of its closest class center in the embedding space [3, 7]. exp(ec ∗ i)
Second, due to the nonparametric nature, the generalizabil- p(c|i) = PC , (2)
c0 =1 exp(ec ∗ i)
0
ity is improved. Large-vocabulary semantic segmentation
can also be handled efficiently, as the amount of learnable where ‘∗’ is inner product between `2 -normalized inputs.
2583
LCE (Eq. 7)
φ
Figure 2. Architecture illustration of our non-learnable prototype based nonparametric segmentation model during the training phase.
Prototype-based Classification. Prototype-based classifi- Second, the amount of the learnable prototype parameters,
cation [31, 33] has been studied for a long time, dating back i.e., {gc ∈RD }C c=1 , grows with the number of classes. This
to the nearest neighbors algorithm [23] in machine learn- may hinder the scalability, especially when a large number
ing and prototype theory [60, 91] in cognitive science. Its of classes are present. For example, if there are 800 classes
prevalence stems from its intuitive idea: represent classes and the pixel feature dimensionality is 512, at least 0.4M
by prototypes, and refer to prototypes for classification. Let parameters are needed for prototype learning alone, making
{pm }Mm=1 be a set of prototypes that are representative of large-vocabulary segmentation a hard task. Moreover, if we
their corresponding classes {cpm∈{1, · · ·, C}}m . For a data want to represent each class by ten prototypes, instead of
sample i, prediction is made by comparing i with {pm }m , only one, we need to learn 4M prototype parameters.
and taking the class of the winning prototype as response: Third, Eq. 3 intuitively shows that prototype based learn-
ĉi = cpm∗ , with m∗ = arg min{hi, pm i}M
ers make metric comparisons of data [8]. However, existing
m=1 , (3)
m algorithms often supervise dense segmentation representa-
where i and {pm }m are embeddings of the data sample and tion by directly optimizing the accuracy of pixel-wise pre-
prototypes in a feature space, and h·, ·i stands for the dis- diction (e.g., cross-entropy loss), ignoring known inductive
tance measure, which is typically set as `2 distance (i.e., biases [83, 84], e.g., intra-class compactness, about the fea-
||i−pm ||) [123], yet other proximities can be applied. ture distribution. This will hinder the discrimination poten-
Further, Eqs. 1-2 can be formulated in a unified form: tial of the learned segmentation features, as suggested by
exp(−hi, gc i) many literature in representation learning [76, 95, 114].
p(c|i) = PC , (4) After tackling question ¸, in the next section we will detail
c0 =1 exp(−hi, gc i)
0
our non-learnable prototype based nonparametric segmenta-
where gc ∈ RD can be either wc in Eq. 1 or ec in Eq. 2. tion method, which serves as a solid response to question ¹.
With Eqs. 3-4, we are ready to answer questions ¶·. Both 3. Non-Learnable Prototype based Nonpara-
the two types of methods are based on learnable prototypes;
they are parametric models in the sense that they learn one metric Semantic Segmentation
prototype gc , i.e., linear weight wc or query vector ec , for We build a nonparametric segmentation framework that
each class c (i.e., M=C ). Thus one can consider softmax pro- conducts dense prediction by a set of non-learnable class pro-
jection based methods ‘secretly’ learn the query vectors. As totypes, and directly supervises the pixel embedding space
for the difference, in addition to different distance measures via a prototype-anchored metric learning scheme (Fig. 2).
(i.e., inner product vs cosine similarity), pixel-query based Non-Learnable Prototype based Pixel Classification. As
methods [118, 139] can feed the queries into cross-attention normal, an encoder network (FCN based or attention based),
decoder layers for cross-class context exchanging, rather i.e., φ, is first adopted to map the input image I ∈ Rh×w×3,
than softmax projection based counterparts only leveraging to a 3D feature tensor I∈RH×W×D. For pixel-wise C-way
the learned class weights within the softmax layer. classification, rather than prior semantic segmentation mod-
With the unified view of parametric prototype learning, els that automatically learn C class weights {wc ∈ RD}C c=1
a few intrinsic yet long ignored issues in this field unfold: (cf. Eq. 1) or C queries vectors {ec ∈ RD}C c=1 (cf. Eq. 2),
First, prototype selection [36] is a vital aspect in the de- we refer to a group of CK non-learnable prototypes, i.e.,
sign of a prototype based learner – prototypes should be typ- {pc,k ∈ RD}C,K c,k=1 , which are based solely on class data sub-
ical for their classes. Nevertheless, existing semantic seg- centers. More specifically, each class c ∈ {1, · · ·, C} is rep-
mentation algorithms often describe each class by only one resented by a total of K prototypes {pc,k}K k=1 , and proto-
prototype, bearing no intra-class variation. Moreover, the type pc,k is determined as the center of k-th sub-cluster of
prototypes are directly learned in a fully parametric man- training pixel samples belonging to class c, in the embed-
ner, without accounting for their representative ability. ding space φ. In this way, the prototypes can comprehen-
2584
sively capture characteristic properties of the corresponding Thus the prototypes, selected as the sub-cluster centers, are
classes, without introducing extra learnable parameters out- typical of the corresponding classes. Conducting clustering
side φ. Analogous to Eq. 3, the category prediction of each online makes our method scalable to large amounts of data,
pixel i ∈ I is achieved by a winner-take-all classification: instead of offline clustering requiring multiple passes over
the entire dataset for feature computation [13].
ĉi = c∗ , with (c∗ , k∗ ) = arg min{hi, pc,k i}C,K
c,k=1 ,
(c,k)
(5) Formally, given pixels I c = {in }N n=1 in a training batch
that belong to class c (i.e., cin = c), our goal is to map the
where i ∈ RD stands for the `2 -normalized embedding of pixels I c to the K prototypes {pc,k}K k=1 of class c. We
pixel i, i.e., i ∈ I, and the distance measure h·, ·i is defined denote this pixel-to-prototype mapping as Lc = [lin ]N n=1 ∈
as the negative cosine similarity, i.e., hi, pi = −i>p. {0, 1}K×N , where lin = [lin ,k ]K ∈ {0, 1}K
is the one-hot
k=1
With this exemplar-based reasoning mode, we first de- assignment vector of pixel in over the K prototypes. The
fine the probability distribution of pixel i over the C classes: optimization of Lc is achieved by maximizing the similarity
exp(−si,c ) between pixel embeddings, i.e., X c= [in ]N D×N
p(c|i) = PC , with si,c = min{hi, pc,k i}K
k=1 , (6) n=1 ∈ R , and
0
c =1 exp(−s i,c 0) c K
the prototypes, i.e., P = [pc,k ]k=1∈ R D×K
:
where the pixel-class distance si,c ∈[−1, 1] is computed as max Tr(Lc>P c>X c ),
the distance to the closest prototype of class c. Given the c
L
(8)
c K×N N K
groundtruth class of each pixel i, i.e., ci ∈ {1, · · ·, C}, the s.t. L ∈ {0, 1} , Lc> 1K = 1N , Lc 1N = 1 ,
K
cross-entropy loss can be therefore used for training:
where 1K denotes the vector of all ones of K dimensions.
LCE = − log p(ci |i) The unique assignment constraint, i.e., Lc> 1K = 1N , en-
exp(−si,ci ) (7) sures that each pixel is assigned to one and only one pro-
= − log P . N K
exp(−si,ci )+ c06=ci exp(−si,c0 ) totype. The equipartition constraint, i.e., Lc 1N = K 1 ,
enforces that on average each prototype is selected at least
In our case, Eq. 7 can be viewed as pushing pixel i closer to N
K times in the batch [13]. This prevents the trivial solu-
the nearest prototype of its corresponding class, i.e., ci , and tion: all pixel samples are assigned to a single prototype,
further from other close prototypes of irrelevant classes, i.e., and eventually benefits the representative ability of the pro-
c0 6= ci . However, only adopting such training objective is totypes. To solve Eq. 8, one can relax Lc to be an element
not enough, due to two reasons. First, Eq. 7 only considers of the transportation polytope [2, 24]:
pixel-class distances, e.g., si,c , without addressing within-
class pixel-prototype relations, e.g., hi, pci ,k i. For example, max Tr(Lc>P c>X c ) + κh(Lc ),
Lc
for discriminative representation learning, pixel i is expec- N (9)
ted to be pushed further close to a certain prototype (i.e., a s.t. Lc∈ RK×N
+ , Lc> 1K = 1N , Lc 1N = 1K ,
K
particularly suitable pattern) of class ci , and, distant from
where h(Lc )= n,k −lin ,k log lin ,k is an entropy, and κ > 0
P
other prototypes (i.e., other irrelevant but within-class pat-
is a parameter that controls the smoothness of distribution.
terns) of class ci . Eq. 7 cannot capture this nature. Second,
With the soft assignment relaxation and the extra regular-
as the pixel-class distances are normalized across all classes
ization term h(Lc ), the solver of Eq. 9 can be given as [24]:
(cf. Eq. 6), Eq. 7 only optimizes the relative relations between
intra-class (i.e., si,ci ) and inter-class (i.e., {si,c0}c06=ci ) dis- P c>X c
Lc = diag(u) exp diag(v), (10)
tances, instead of directly regularizing the cosine distances κ
between pixels and classes. For example, when the intra- where u ∈ RK and v ∈ RN are renormalization vectors,
class distance si,ci of pixel i is relatively smaller than other computed by few steps of Sinkhorn-Knopp iteration [24].
inter-class distances {si,c0 }c06=ci , the penalty from Eq. 7 will Our online clustering is highly efficient on GPU, as it only
be small, but the intra-class distance si,ci might still be large involves a couple of matrix multiplications; in practice,
[89, 134]. Next we first elaborate on our within-class online clustering 10K pixels into 10 prototypes takes only 2.5 ms.
clustering strategy and then detail our two extra training ob- Pixel-Prototype Contrastive Learning. With the assign-
jectives which rely on prototype assignments (i.e., cluster- ment probability matrix Lc = [lin ]N n=1 ∈ [0, 1]
K×N
, we on-
c N
ing results) and address the above two issues respectively. line group the training pixels I = {in }n=1 into K proto-
Within-Class Online Clustering. We approach online clu- types {pc,k}K k=1 within class c. After all the samples in cur-
stering for prototype selection and assignment: pixel sam- rent batch are processed, each pixel i is assigned to ki -th
ples within the same class are assigned to the prototypes prototype of class ci , where ki = arg maxk {li,k }K k=1 and
belonging to that class, and the prototypes are then updated li,k ∈ li . It is natural to derive a training objective for pro-
according to the assignments. Clustering imposes a natural totype assignment prediction, i.e., maximize the prototype
bottleneck [55] that forces the model to discover intra-class assignment posterior probability. This can be viewed as a
discriminative patterns yet discard instance-specific details. pixel-prototype contrastive learning strategy, and addresses
2585
the first limitation of Eq. 7:
exp(i>pci ,ki /τ )
LPPC = −log P , (11)
exp(i>pci ,ki/τ )+ > −
p− ∈P − exp(i p /τ )
2586
resent prototypes as sub-cluster centers and obtain online 5. Experiment
assignments, allowing our method to scale gracefully to any 5.1. Experimental Setup
dataset size. We encourage a sparse distance distribution
with compactness-awareness, reinforcing the embedding Datasets. Our experiments are conducted on three datasets:
discrimination. With a broader view, a few embedding based • ADE20K [140] is a large-scale scene parsing benchmark
instance segmentation approaches [26, 86] can be viewed as that covers 150 stuff/object categories. The dataset is di-
nonparametric, i.e., treat instance centroids as prototypes. vided into 20k/2k/3k images for train/val/test.
• Cityscapes [22] has 5k finely annotated urban scene im-
Prototype Learning. Cognitive psychological studies evi- ages, with 2,975/500/1,524 for train/val/test. The
dence that people use past cases as models when learning to segmentation performance is evaluated over 19 challeng-
solve problems [1, 87, 125]. Among various machine learn- ing categories, such as rider, bicycle, and traffic light.
ing algorithms, ranging from classical statistics based meth- • COCO-Stuff [10] has 10k images gathered from COCO
ods to Support Vector Machine to Multilayer Perceptrons [9, [71], with 9k and 1k for train and test, respectively.
31, 33, 96], prototype based classification gains particu- There are 172 semantic categories in total, including 80
lar interest, due to its exemplar-driven nature and intuitive objects, 91 stuffs and 1 unlabeled.
interpretation: observations are directly compared with re-
Training. Our method is implemented on MMSegmenta-
presentative examples. Based on the nearest neighbors rule
tion [21], following default training settings. In particular,
– the earliest prototype learning method [23], many famous,
all backbones are initialized using corresponding weights
nonparametric classifiers are proposed [36], such as Learn-
pre-trained on ImageNet-1K [92], while remaining layers
ing Vector Quantization (LVQ) [61], generalized LVQ [94],
are randomly initialized. We use standard data augmen-
and Neighborhood Component Analysis [37, 93]. There has
tation techniques, including random scale jittering with a
been a recent surge of interest to integrate deep learning into
factor in [0.5, 2], random horizontal flipping, random crop-
prototype learning, showing good potential in few-shot [98],
ping as well as random color jittering. We train models
zero-shot [54], and unsupervised learning [116, 120], as
using SGD/AdamW for FCN-/attention-based models, re-
well as supervised classification [38, 83, 115, 123] and
spectively. The learning rate is scheduled following the
interpretable networks [65]. Remarkably, as many few-shot
polynomial annealing policy. In addition, for Cityscapes,
segmentation models can be viewed as prototype-based net-
we use a batch size of 8, and a training crop size of 768×768.
works [29, 106, 109], our work sheds light on the possibility
For ADE20K and COCO-Stuff, we use a crop size of
of closer collaboration between the two segmentation fields.
512 × 512 and train the models with batch size 16. The
Metric Learning. The selection of proper distance measure models are trained for 160k, 160k, and 80k iterations on
impacts the success of prototype based learners [8]; metric Cityscapes, ADE20K and COCO-Stuff, respectively. Ex-
learning and prototype learning are naturally related. As the ceptionally, for ablation study, we train models for 40K
literature on metric learning is vast [58], only the most rel- iterations. The hyper-parameters are empirically set to:
evant ones are discussed. The goal of metric learning is to K = 10, m = 0.999, τ = 0.1, κ = 0.05, λ1 = 0.01, λ2 = 0.01.
learn a distance metric/embedding such that similar samples Testing. For ADE20K and COCO-Stuff, we rescale the
are pulled together and dissimilar samples are pushed away. short scale of the image to training crop size, with the as-
It has shown a significant benefit by learning deep represen- pect ratio kept unchanged. For Cityscapes, we adopt sliding
tation using metric loss functions (e.g., contrastive loss [40], window inference with the window size 768×768. For sim-
triplet loss [95], n-pair loss [99]) for applications (e.g., im- plicity, we do not apply any test-time data augmentation.
age retrieval [107], face recognition [95]). Recently, metric Our model is implemented in PyTorch and trained on eight
learning showed good potential in unsupervised representa- Tesla V100 GPUs with a 32GB memory per-card. Testing
tion learning. Specifically, many instance-based approaches is conducted on the same machine.
use the contrastive loss [39, 88] to explicitly compare pairs Baselines. We mainly compare with four widely recognized
of image representations, so as to push away features from segmentation models, i.e., two FCN based (i.e., FCN [79],
different images while pulling together those from trans- HRNet [108]) and two attention based (i.e., Swin [78] and
formations of the same image [17, 19, 44, 46, 88]. Since SegFormer [118]). For fair comparison, all the models are
computing all the pairwise comparisons on a large dataset based on our reproduction, following the hyper-parameter
is challenging, some clustering-based methods turn to dis- and augmentation recipes used in MMSegmentation [21].
criminate between groups of images with similar features Evaluation Metric. Following conventions [15, 79], mean
instead of individual images [2, 5, 12, 13, 64, 104, 119, intersection-over-union (mIoU) is adopted for evaluation.
121, 122]. Our prototype-anchored metric learning strat-
egy shares a similar spirit of posing metric constraints over 5.2. Comparison to State-of-the-Arts
prototype (cluster) assignments, but it is to reshape the pixel ADE20K [140] val. Table 1 reports comparisons with rep-
segmentation embedding space with explicit supervision. resentative models on ADE20K val. Our nonparametric
2587
Figure 4. Qualitative results of Segformer [118] and our approach (from left to right: ADE20K [140], Cityscapes [22], COCO-Stuff [10]).
# Param mIoU # Param mIoU
Method Backbone Method Backbone
(M) (%) (M) (%)
DeepLabV3+ [ECCV18] [16] ResNet-101 [45] 62.7 44.1 SVCNet [CVPR19] [28] ResNet-101 [45] - 39.6
OCR [ECCV20] [129] HRNetV2-W48 [108] 70.3 45.6 DANet [CVPR19] [34] ResNet-101 [45] 69.1 39.7
MaskFormer [NeurIPS21] [20] ResNet-101 [45] 60.0 46.0 SpyGR [CVPR20] [67] ResNet-101 [45] - 39.9
UperNet [ECCV20] [117] Swin-Base [78] 121.0 48.4 MaskFormer [NeurIPS21] [20] ResNet-101 [45] 60.0 39.8
OCR [ECCV20] [129] HRFormer-B [130] 70.3 48.7 ACNet [ICCV19] [35] ResNet-101 [45] - 40.1
SETR [CVPR21] [139] ViT-Large [30] 318.3 50.2 OCR [ECCV20] [129] HRNetV2-W48 [108] 70.3 40.5
Segmenter [ICCV21] [100] ViT-Large [30] 334.0 51.8 FCN [CVPR15] [79] 68.6 32.5
† MaskFormer [NeurIPS21] [20] Swin-Base [78] 102.0 52.7 ResNet-101 [45]
Ours 68.5 34.0 ↑ 1.5
FCN [CVPR15] [79] 68.6 39.9 HRNet [PAMI21] [108] 65.9 38.7
ResNet-101 [45] HRNetV2-W48 [108]
Ours 68.5 41.1 ↑ 1.2 Ours 65.8 39.9 ↑ 1.2
HRNet [PAMI20] [108] 65.9 42.0 Swin [CVPR21] [78] 90.6 41.5
HRNetV2-W48 [108] Swin-Base [78]
Ours 65.8 43.0 ↑ 1.0 Ours 90.5 42.4 ↑ 0.9
Swin [CVPR21] [78] 90.6 48.0 SegFormer [NeurIPS21] [118] 64.1 42.5
Swin-Base [78] MiT-B4 [118]
Ours 90.5 48.6 ↑ 0.6 Ours 64.0 43.3 ↑ 0.8
SegFormer [NeurIPS21] [118] 64.1 50.9
MiT-B4 [118] Table 3. Quantitative results (§5.2) on COCO-Stuff [10] test.
Ours 64.0 51.7 ↑ 0.8
†: backbone is pre-trained on ImageNet-22K. test. It outperforms all the baselines. Notably, with MiT-
Table 1. Quantitative results (§5.2) on ADE20K [140] val.
B4 [118] as the network backbone, our approach earns an
# Param mIoU mIoU score of 43.3%, establishing a new state-of-the-art.
Method Backbone
(M) (%)
PSPNet [CVPR17] [135] ResNet-101 [45] 65.9 78.4
Qualitative Results. Fig. 4 provides qualitative compari-
PSANet [ECCV18] [136] ResNet-101 [45] - 78.6 son of Ours against Segformer [118] on representative ex-
AAF [ECCV18] [59] ResNet-101 [45] - 79.1 amples in the three datasets. We observe that our approach
Segmenter [ICCV21] [100] ViT-Large [30] 322.0 79.1
ContrastiveSeg [ICCV21] [111] ResNet-101 [45] 58.0 79.2 is able to handle diverse challenging scenarios and produce
MaskFormer [NeurIPS21] [20] ResNet-101 [45] 60.0 80.3 more accurate results (as highlighted in red dashed boxes).
DeepLabV3+ [ECCV18] [16] ResNet-101 [45] 62.7 80.9
OCR [ECCV20] [129] HRNetV2-W48 [108] 70.3 81.1 5.3. Scalability to Large-Vocabulary Semantic Seg-
FCN [CVPR15] [79] 68.6 78.1
Ours
ResNet-101 [45]
68.5 79.1 ↑ 1.0 mentation
HRNet [PAMI20] [108] 65.9 80.4
HRNetV2-W48 [108] Today, rigorous evaluation of semantic segmentation
Ours 65.8 81.1 ↑ 0.7
Swin [CVPR21] [78]
Swin-Base [78]
90.6 79.8 models is mostly performed in a few category regime (e.g.,
Ours 90.5 80.6 ↑ 0.8
SegFormer [NeurIPS21] [118] 64.1 80.7
19/150/172 classes for Cityscapes/ADE20K/COCO-Stuff),
MiT-B4 [118] while the generalization to more natural large-vocabulary
Ours 64.0 81.3 ↑ 0.6
Table 2. Quantitative results (§5.2) on Cityscapes [22] val. setting is ignored. In this section, we demonstrate the re-
markable superiority of our method in large-vocabulary set-
scheme obtains consistent improvements over the baselines, ting. We start with the default setting in ADE20K [140]
with fewer learnable parameters. In particular, it yields which includes 150 semantic concepts. Then, we gradually
1.2% and 1.0% mIoU improvements over the FCN-based increase the number of concepts based on their visibility
counterparts, i.e., FCN [79] and HRNet [108]. Similar frequency, and train/test models on the selected number of
performance gains (0.6% and 0.8%) are obtained over re- classes. In this experiment, we use MiT-V2 [118] as the
cent attention-based models, i.e., Swin [78] and SegFormer backbone and train models for 40k iterations.
[118], manifesting the high versatility of our approach. The results are summarized in Table 4, from which we
Cityscapes [22] val. Table 2 shows again our compelling find that: i) For the parametric scheme, the amount of proto-
performance on Cityscapes val. Specifically, our approach type parameters increases with vocabulary size. For the ex-
surpasses all the competitors, i.e., 1.0% over FCN, 0.7% treme case of 10 prototypes and 847 classes, the number of
over HRNet, 0.8% over Swin, and 0.6% over Segformer. prototype parameters is 6.5 M, accounting for ∼ 20% of to-
COCO-Stuff [10] test. As listed in Table 3, our approach tal parameters (i.e., 33.96 M). In sharp contrast, our scheme
also demonstrates promising performance on COCO-Stuff requires no any learnable prototype parameters. ii) Our
2588
150 classes 300 classes 500 classes 700 classes 847 classes
Method # Proto
mIoU (%) # Param (M) mIoU (%) # Param (M) mIoU (%) # Param (M) mIoU (%) # Param (M) mIoU (%) # Param (M)
parametric 1 45.1 27.48 (0.12) 36.5 27.62 (0.23) 25.7 27.80 (0.39) 19.8 27.98 (0.54) 16.5 28.11 (0.65)
nonparametric
1 45.5 ↑ 0.4 27.37 (0) 37.2 ↑ 0.7 27.37 (0) 26.8 ↑ 1.1 27.37 (0) 21.2 ↑ 1.4 27.37 (0) 18.1 ↑ 1.6 27.37 (0)
(Ours)
parametric 10 45.7 28.56 (1.2) 37.0 29.66 (2.3) 26.6 31.26 (3.9) 20.8 32.86 (5.4) 17.7 33.96 (6.5)
nonparametric
10 46.4 ↑ 0.7 27.37 (0) 37.8 ↑ 0.8 27.37 (0) 27.9 ↑ 1.3 27.37 (0) 22.1 ↑ 1.3 27.37 (0) 19.4 ↑ 1.7 27.37 (0)
(Ours)
Table 4. Scalability study (§5.3) of our nonparametric model against the parametric baseline (i.e., SegFormer [118]) on ADE20K [140].
For each model variant, we report its segmentation mIoU, parameter numbers of the entire model as well as the prototypes (in the bracket).
LCE LPPC LPPD mIoU # Prototype mIoU (%) Coefficient µ mIoU (%) Distance Measure mIoU (%)
(Eq. 7) (Eq. 11) (Eq. 12) (%) K =1 45.5 µ=0 44.9 Standard 45.7
3 45.0 K =5 46.0 µ = 0.9 45.9 Huberized 45.2
3 3 45.9 K = 10 46.4 µ = 0.99 46.0 Cosine 46.4
3 3 45.4 K = 20 46.5 µ = 0.999 46.4
3 3 3 46.4 K = 50 46.4 µ = 0.9999 46.3
(a) Training Objective L (b) Prototype Number K (c) Momentum Coefficient µ (d) Distance Measure
Table 5. A set of ablative studies (§5.4) on ADE20K [140] val. All model variants use MiT-B2 [118] as the backbone.
2589
References offrey Hinton. A simple framework for contrastive learning
of visual representations. In ICML, 2020. 6
[1] Agnar Aamodt and Enric Plaza. Case-based reasoning: [18] Wuyang Chen, Xinyu Gong, Xianming Liu, Qian Zhang,
Foundational issues, methodological variations, and system Yuan Li, and Zhangyang Wang. Fasterseg: Searching for
approaches. AI Communications, 7(1):39–59, 1994. 6 faster real-time semantic segmentation. In ICLR, 2019. 5
[2] Yuki Markus Asano, Christian Rupprecht, and Andrea [19] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He.
Vedaldi. Self-labelling via simultaneous clustering and rep- Improved baselines with momentum contrastive learning.
resentation learning. In ICLR, 2020. 4, 6 arXiv preprint arXiv:2003.04297, 2020. 6
[3] Andreas Backhaus and Udo Seiffert. Classification in high- [20] Bowen Cheng, Alexander G. Schwing, and Alexander Kir-
dimensional spectral data: Accuracy vs. interpretability vs. illov. Per-pixel classification is not all you need for seman-
model size. Neurocomputing, 131:15–22, 2014. 2 tic segmentation. In NeurIPS, 2021. 1, 5, 7
[4] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. [21] MMSegmentation Contributors. MMSegmentation:
Segnet: A deep convolutional encoder-decoder architecture Openmmlab semantic segmentation toolbox and
for image segmentation. IEEE TPAMI, 39(12):2481–2495, benchmark. https : / / github . com / open -
2017. 5 mmlab/mmsegmentation, 2020. 6
[5] Miguel A Bautista, Artsiom Sanakoyeu, Ekaterina [22] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
Tikhoncheva, and Björn Ommer. Cliquecnn: Deep unsu- Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe
pervised exemplar learning. In NeurIPS, 2016. 6 Franke, Stefan Roth, and Bernt Schiele. The cityscapes
[6] Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. Se- dataset for semantic urban scene understanding. In CVPR,
mantic segmentation with boundary neural fields. In CVPR, 2016. 2, 6, 7
2016. 5 [23] Thomas Cover and Peter Hart. Nearest neighbor pattern
[7] Michael Biehl, Barbara Hammer, Petra Schneider, and classification. IEEE TIT, 13(1):21–27, 1967. 1, 3, 6
Thomas Villmann. Metric learning for prototype-based [24] Marco Cuturi. Sinkhorn distances: Lightspeed computation
classification. In Innovations in Neural Information of optimal transport. In NeurIPS, 2013. 4
Paradigms and Applications, pages 183–199. 2009. 2 [25] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong
[8] Michael Biehl, Barbara Hammer, and Thomas Villmann. Zhang, Han Hu, and Yichen Wei. Deformable convolu-
Distance measures for prototype based classification. In In- tional networks. In CVPR, 2017. 5
ternational Workshop on Brain-Inspired Computing, 2013. [26] Bert De Brabandere, Davy Neven, and Luc Van Gool. Se-
3, 6 mantic instance segmentation for autonomous driving. In
[9] Christopher M Bishop. Pattern recognition. Machine learn- CVPR Workshop, 2017. 6
ing, 128(9), 2006. 6 [27] Henghui Ding, Xudong Jiang, Ai Qun Liu, Nadia Magne-
[10] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- nat Thalmann, and Gang Wang. Boundary-aware feature
stuff: Thing and stuff classes in context. In CVPR, 2018. 2, propagation for scene segmentation. In CVPR, 2019. 5
6, 7 [28] Henghui Ding, Xudong Jiang, Bing Shuai, Ai Qun Liu, and
[11] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nico- Gang Wang. Semantic correlation promoted shape-variant
las Usunier, Alexander Kirillov, and Sergey Zagoruyko. context for segmentation. In CVPR, 2019. 7
End-to-end object detection with transformers. In ECCV, [29] Nanqing Dong and Eric P Xing. Few-shot semantic seg-
2020. 1 mentation with prototype learning. In BMVC, 2018. 6
[12] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and [30] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Matthijs Douze. Deep clustering for unsupervised learning Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
of visual features. In ECCV, 2018. 6 Mostafa Dehghani, Matthias Minderer, Georg Heigold,
[13] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Sylvain Gelly, et al. An image is worth 16x16 words: Trans-
Piotr Bojanowski, and Armand Joulin. Unsupervised learn- formers for image recognition at scale. In ICLR, 2020. 5,
ing of visual features by contrasting cluster assignments. In 7
NeurIPS, 2020. 4, 6 [31] Richard O Duda, Peter E Hart, et al. Pattern classification
[14] Liang-Chieh Chen, Jonathan T Barron, George Papan- and scene analysis, volume 3. Wiley New York, 1973. 1,
dreou, Kevin Murphy, and Alan L Yuille. Semantic image 3, 6
segmentation with task-specific edge detection using cnns [32] David Eigen and Rob Fergus. Nonparametric image parsing
and a discriminatively trained domain transform. In CVPR, using adaptive neighbor sets. In CVPR, 2012. 5
2016. 5 [33] Jerome H Friedman. The elements of statistical learning:
[15] Liang-Chieh Chen, George Papandreou, Iasonas Kokki- Data mining, inference, and prediction. Springer, 2009. 3,
nos, Kevin Murphy, and Alan L Yuille. Deeplab: Se- 6
mantic image segmentation with deep convolutional nets, [34] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhi-
atrous convolution, and fully connected crfs. IEEE TPAMI, wei Fang, and Hanqing Lu. Dual attention network for
40(4):834–848, 2017. 1, 5, 6 scene segmentation. In CVPR, 2019. 1, 5, 7
[16] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Flo- [35] Jun Fu, Jing Liu, Yuhang Wang, Yong Li, Yongjun Bao,
rian Schroff, and Hartwig Adam. Encoder-decoder with Jinhui Tang, and Hanqing Lu. Adaptive context network
atrous separable convolution for semantic image segmenta- for scene parsing. In ICCV, 2019. 7
tion. In ECCV, 2018. 5, 7 [36] Salvador Garcia, Joaquin Derrac, Jose Cano, and Francisco
[17] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- Herrera. Prototype selection for nearest neighbor classi-
2590
fication: Taxonomy and empirical study. IEEE TPAMI, [56] Zhenchao Jin, Tao Gong, Dongdong Yu, Qi Chu, Jian
34(3):417–435, 2012. 3, 6 Wang, Changhu Wang, and Jie Shao. Mining contextual
[37] Jacob Goldberger, Geoffrey E Hinton, Sam Roweis, and information beyond image for semantic segmentation. In
Russ R Salakhutdinov. Neighbourhood components anal- ICCV, 2021. 5
ysis. In NeurIPS, 2004. 6 [57] Zhenchao Jin, Bin Liu, Qi Chu, and Nenghai Yu. Isnet: In-
[38] Samantha Guerriero, Barbara Caputo, and Thomas tegrate image-level and semantic-level context for semantic
Mensink. Deepncm: Deep nearest class mean classifiers. segmentation. In ICCV, 2021. 5
In ICLR workshop, 2018. 6 [58] Mahmut Kaya and Hasan Şakir Bilge. Deep metric learn-
[39] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive ing: A survey. Symmetry, 11(9):1066, 2019. 6
estimation: A new estimation principle for unnormalized [59] Tsung-Wei Ke, Jyh-Jing Hwang, Ziwei Liu, and Stella X
statistical models. In AISTATS, 2010. 6 Yu. Adaptive affinity fields for semantic segmentation. In
[40] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimension- ECCV, 2018. 7
ality reduction by learning an invariant mapping. In CVPR, [60] Barbara J Knowlton and Larry R Squire. The learning of
2006. 2, 6 categories: Parallel brain systems for item memory and cat-
[41] Adam W Harley, Konstantinos G Derpanis, and Iasonas egory knowledge. Science, 262(5140):1747–1749, 1993. 2,
Kokkinos. Segmentation-aware convolutional networks us- 3
ing local attention masks. In ICCV, 2017. 5 [61] Teuvo Kohonen. The self-organizing map. Neurocomput-
[42] Junjun He, Zhongying Deng, and Yu Qiao. Dynamic multi- ing, 21(1-3):1–6, 1998. 6
scale filters for semantic segmentation. In ICCV, 2019. 5 [62] Shu Kong and Charless C Fowlkes. Recurrent pixel embed-
[43] Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, and Yu ding for instance grouping. In CVPR, 2018. 5
Qiao. Adaptive pyramid context network for semantic seg- [63] Hanchao Li, Pengfei Xiong, Jie An, and Lingxue Wang.
mentation. In CVPR, 2019. 5 Pyramid attention network for semantic segmentation.
[44] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross arXiv preprint arXiv:1805.10180, 2018. 5
Girshick. Momentum contrast for unsupervised visual rep- [64] Junnan Li, Pan Zhou, Caiming Xiong, and Steven CH Hoi.
resentation learning. In CVPR, 2020. 6 Prototypical contrastive learning of unsupervised represen-
[45] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. tations. In ICLR, 2020. 2, 6
Deep residual learning for image recognition. In CVPR, [65] Oscar Li, Hao Liu, Chaofan Chen, and Cynthia Rudin.
2016. 2, 5, 7 Deep learning for case-based reasoning through prototypes:
[46] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, A neural network that explains its predictions. In AAAI,
Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua 2018. 6, 8
Bengio. Learning deep representations by mutual informa- [66] Xiangtai Li, Xia Li, Li Zhang, Guangliang Cheng, Jian-
tion estimation and maximization. In ICLR, 2019. 6 ping Shi, Zhouchen Lin, Shaohua Tan, and Yunhai Tong.
[47] Chi-Wei Hsiao, Cheng Sun, Hwann-Tzong Chen, and Min Improving semantic segmentation via decoupled body and
Sun. Specialize and fuse: Pyramidal output representation edge supervision. In ECCV, 2020. 5
for semantic segmentation. In ICCV, 2021. 5 [67] Xia Li, Yibo Yang, Qijie Zhao, Tiancheng Shen, Zhouchen
[48] Hanzhe Hu, Deyi Ji, Weihao Gan, Shuai Bai, Wei Wu, and Lin, and Hong Liu. Spatial pyramid based graph reasoning
Junjie Yan. Class-wise dynamic graph convolution for se- for semantic segmentation. In CVPR, 2020. 7
mantic segmentation. In ECCV, 2020. 5 [68] Xia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang,
[49] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation Zhouchen Lin, and Hong Liu. Expectation-maximization
networks. In CVPR, 2018. 1, 5 attention networks for semantic segmentation. In ICCV,
[50] Zilong Huang, Xinggang Wang, Lichao Huang, Chang 2019. 5
Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross [69] Yanwei Li, Lin Song, Yukang Chen, Zeming Li, Xiangyu
attention for semantic segmentation. In ICCV, 2019. 5 Zhang, Xingang Wang, and Jian Sun. Learning dynamic
[51] Peter J Huber. Robust regression: asymptotics, conjectures routing for semantic segmentation. In CVPR, 2020. 5
and monte carlo. The annals of statistics, pages 799–821, [70] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian
1973. 8 Reid. Refinenet: Multi-path refinement networks for high-
[52] Jyh-Jing Hwang, Stella X Yu, Jianbo Shi, Maxwell D resolution semantic segmentation. In CVPR, 2017. 5
Collins, Tien-Ju Yang, Xiao Zhang, and Liang-Chieh Chen. [71] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
Segsort: Segmentation by discriminative sorting of seg- Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
ments. In ICCV, 2019. 5, 8 C Lawrence Zitnick. Microsoft coco: Common objects in
[53] Shipra Jain, Danda Pani Paudel, Martin Danelljan, and context. In ECCV, 2014. 6
Luc Van Gool. Scaling semantic segmentation beyond 1k [72] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig
classes on a single gpu. In ICCV, 2021. 5 Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto-
[54] Saumya Jetley, Bernardino Romera-Paredes, Sadeep Jaya- deeplab: Hierarchical neural architecture search for seman-
sumana, and Philip Torr. Prototypical priors: From im- tic image segmentation. In CVPR, 2019. 5
proving classification to zero-shot learning. arXiv preprint [73] Ce Liu, Jenny Yuen, and Antonio Torralba. Sift flow: Dense
arXiv:1512.01192, 2015. 6 correspondence across scenes and its applications. IEEE
[55] Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant in- TPAMI, 33(5):978–994, 2010. 5
formation clustering for unsupervised image classification [74] Ce Liu, Jenny Yuen, and Antonio Torralba. Nonpara-
and segmentation. In ICCV, 2019. 4 metric scene parsing via label transfer. IEEE TPAMI,
2591
33(12):2368–2382, 2011. 5 linear embedding by preserving class neighbourhood struc-
[75] Mingyuan Liu, Dan Schonfeld, and Wei Tang. Exploit vi- ture. In Artificial Intelligence and Statistics, pages 412–
sual dependency relations for semantic segmentation. In 419, 2007. 6
CVPR, 2021. 5 [94] Atsushi Sato and Keiji Yamada. Generalized learning vec-
[76] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. tor quantization. In NeurIPS, 1995. 6
Large-margin softmax loss for convolutional neural net- [95] Florian Schroff, Dmitry Kalenichenko, and James Philbin.
works. In ICML, 2016. 3 Facenet: A unified embedding for face recognition and
[77] Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen Change Loy, and clustering. In CVPR, 2015. 3, 6
Xiaoou Tang. Deep learning markov random field for [96] John Shawe-Taylor, Nello Cristianini, et al. Kernel methods
semantic segmentation. IEEE TPAMI, 40(8):1814–1828, for pattern analysis. Cambridge university press, 2004. 6
2017. 5 [97] Karen Simonyan and Andrew Zisserman. Very deep con-
[78] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng volutional networks for large-scale image recognition. In
Zhang, Stephen Lin, and Baining Guo. Swin transformer: ICLR, 2015. 5
Hierarchical vision transformer using shifted windows. In [98] Jake Snell, Kevin Swersky, and Richard S Zemel. Proto-
ICCV, 2021. 2, 5, 6, 7 typical networks for few-shot learning. In NeurIPS, 2017.
[79] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully 6
convolutional networks for semantic segmentation. In [99] Kihyuk Sohn. Improved deep metric learning with multi-
CVPR, 2015. 1, 5, 6, 7 class n-pair loss objective. In NeurIPS, 2016. 6
[80] Tomasz Malisiewicz and Alexei A Efros. Recognition by [100] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia
association via learning per-exemplar distances. In CVPR, Schmid. Segmenter: Transformer for semantic segmenta-
2008. 5 tion. In ICCV, 2021. 1, 5, 7
[81] Sachin Mehta, Mohammad Rastegari, Anat Caspi, Linda [101] Guolei Sun, Wenguan Wang, Jifeng Dai, and Luc Van Gool.
Shapiro, and Hannaneh Hajishirzi. Espnet: Efficient spatial Mining cross-image semantics for weakly supervised se-
pyramid of dilated convolutions for semantic segmentation. mantic segmentation. In ECCV, 2020. 5
In ECCV, 2018. 5 [102] Joseph Tighe and Svetlana Lazebnik. Superparsing: scal-
[82] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and able nonparametric image parsing with superpixels. In
Christoph Feichtenhofer. Trackformer: Multi-object track- ECCV, 2010. 5
ing with transformers. arXiv preprint arXiv:2101.02702, [103] Antonio Torralba, Rob Fergus, and William T Freeman.
2021. 1 80 million tiny images: A large data set for nonparametric
[83] Pascal Mettes, Elise van der Pol, and Cees Snoek. Hyper- object and scene recognition. IEEE TPAMI, 30(11):1958–
spherical prototype networks. In NeurIPS, 2019. 3, 6 1970, 2008. 5
[84] Tom M Mitchell. The need for biases in learning general- [104] Wouter Van Gansbeke, Simon Vandenhende, Stamatios
izations. Department of Computer Science, Laboratory for Georgoulis, Marc Proesmans, and Luc Van Gool. Scan:
Computer Science Research, 1980. 3 Learning to classify images without labels. In ECCV, 2020.
[85] Vladimir Nekrasov, Hao Chen, Chunhua Shen, and Ian 6
Reid. Fast neural architecture search of compact semantic [105] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
segmentation models via auxiliary cells. In CVPR, 2019. 5 Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser,
[86] Davy Neven, Bert De Brabandere, Marc Proesmans, and and Illia Polosukhin. Attention is all you need. In NeurIPS,
Luc Van Gool. Instance segmentation by jointly optimizing 2017. 1, 5
spatial embeddings and clustering bandwidth. In CVPR, [106] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan
2019. 6 Wierstra, et al. Matching networks for one shot learning. In
[87] Allen Newell, Herbert Alexander Simon, et al. Human NeurIPS, 2016. 6
problem solving, volume 104. 1972. 6 [107] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosen-
[88] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- berg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu.
sentation learning with contrastive predictive coding. arXiv Learning fine-grained image similarity with deep ranking.
preprint arXiv:1807.03748, 2018. 6 In CVPR, 2014. 6
[89] Tianyu Pang, Kun Xu, Yinpeng Dong, Chao Du, Ning [108] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang,
Chen, and Jun Zhu. Rethinking softmax cross-entropy loss Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui
for adversarial robustness. In ICLR, 2020. 2, 4 Tan, Xinggang Wang, et al. Deep high-resolution represen-
[90] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- tation learning for visual recognition. IEEE TPAMI, 2020.
net: Convolutional networks for biomedical image segmen- 2, 6, 7
tation. In MICCAI, 2015. 5 [109] Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou,
[91] Eleanor H Rosch. Natural categories. Cognitive psychol- and Jiashi Feng. Panet: Few-shot image semantic segmen-
ogy, 4(3):328–350, 1973. 2, 3 tation with prototype alignment. In ICCV, 2019. 6
[92] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, [110] Wenguan Wang, Tianfei Zhou, Siyuan Qi, Jianbing Shen,
Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej and Song-Chun Zhu. Hierarchical human semantic parsing
Karpathy, Aditya Khosla, Michael S. Bernstein, Alexan- with comprehensive part-relation modeling. IEEE TPAMI,
der C. Berg, and Fei-Fei Li. Imagenet large scale visual 2021. 5
recognition challenge. IJCV, 115(3):211–252, 2015. 6 [111] Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, En-
[93] Ruslan Salakhutdinov and Geoff Hinton. Learning a non- der Konukoglu, and Luc Van Gool. Exploring cross-image
2592
pixel contrast for semantic segmentation. In ICCV, 2021. 2021. 7
2, 5, 7 [131] Yuhui Yuan, Jingyi Xie, Xilin Chen, and Jingdong Wang.
[112] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- Segfix: Model-agnostic boundary refinement for segmenta-
ing He. Non-local neural networks. In CVPR, 2018. 5 tion. In ECCV, 2020. 5
[113] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua [132] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang,
Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. End- Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Con-
to-end video instance segmentation with transformers. In text encoding for semantic segmentation. In CVPR, 2018.
CVPR, 2021. 1 5
[114] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. [133] Kai Zhang, James T Kwok, and Bahram Parvin. Prototype
A discriminative feature learning approach for deep face vector machine for large scale semi-supervised learning. In
recognition. In ECCV, 2016. 3 ICML, 2009. 2
[115] Zhirong Wu, Alexei A Efros, and Stella X Yu. Improving [134] Xiao Zhang, Rui Zhao, Yu Qiao, and Hongsheng Li. Rbf-
generalization via scalable neighborhood component anal- softmax: Learning deep representative prototypes with ra-
ysis. In ECCV, 2018. 2, 6 dial basis function softmax. In ECCV, 2020. 2, 4
[116] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. [135] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Unsupervised feature learning via non-parametric instance Wang, and Jiaya Jia. Pyramid scene parsing network. In
discrimination. In CVPR, 2018. 2, 6 CVPR, 2017. 1, 5, 7
[117] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and [136] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen
Jian Sun. Unified perceptual parsing for scene understand- Change Loy, Dahua Lin, and Jiaya Jia. Psanet: Point-wise
ing. In ECCV, 2018. 7 spatial attention network for scene parsing. In ECCV, 2018.
[118] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, 5, 7
Jose M Alvarez, and Ping Luo. Segformer: Simple and ef- [137] Mingmin Zhen, Jinglu Wang, Lei Zhou, Shiwei Li, Tianwei
ficient design for semantic segmentation with transformers. Shen, Jiaxiang Shang, Tian Fang, and Long Quan. Joint se-
In NeurIPS, 2021. 1, 2, 3, 5, 6, 7, 8 mantic segmentation and boundary detection using iterative
[119] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised pyramid contexts. In CVPR, 2020. 5
deep embedding for clustering analysis. In ICML, 2016. 6 [138] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-
[120] Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, and Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang
Zeynep Akata. Attribute prototype network for zero-shot Huang, and Philip HS Torr. Conditional random fields as
learning. In NeurIPS, 2020. 6 recurrent neural networks. In ICCV, 2015. 5
[121] Kouta Nakata Yaling Tao, Kentaro Takagi. Clustering- [139] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
friendly representation learning via instance discrimination Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
and feature decorrelation. In ICLR, 2021. 6 Xiang, Philip HS Torr, et al. Rethinking semantic segmen-
[122] Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti Ghadi- tation from a sequence-to-sequence perspective with trans-
yaram, and Dhruv Mahajan. Clusterfit: Improving general- formers. In CVPR, 2021. 1, 2, 3, 5, 7
ization of visual representations. In CVPR, 2020. 6 [140] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
[123] Hong-Ming Yang, Xu-Yao Zhang, Fei Yin, and Cheng- Barriuso, and Antonio Torralba. Scene parsing through
Lin Liu. Robust classification with convolutional prototype ade20k dataset. In CVPR, 2017. 2, 6, 7, 8
learning. In CVPR, 2018. 3, 6 [141] Tianfei Zhou, Liulei Li, Xueyi Li, Chun-Mei Feng, Jianwu
[124] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Li, and Ling Shao. Group-wise learning for weakly super-
Yang. Denseaspp for semantic segmentation in street vised semantic segmentation. IEEE TIP, 31:799–811, 2021.
scenes. In CVPR, 2018. 5 5
[125] Yi Yang, Yueting Zhuang, and Yunhe Pan. Multiple
knowledge representation for big data artificial intelli-
gence: framework, applications, and case studies. Fron-
tiers of Information Technology & Electronic Engineering,
22(12):1551–1558, 2021. 6
[126] Changqian Yu, Jingbo Wang, Changxin Gao, Gang Yu,
Chunhua Shen, and Nong Sang. Context prior for scene
segmentation. In CVPR, 2020. 5
[127] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao,
Gang Yu, and Nong Sang. Learning a discriminative feature
network for semantic segmentation. In CVPR, 2018. 5
[128] Fisher Yu and Vladlen Koltun. Multi-scale context aggre-
gation by dilated convolutions. In ICLR, 2016. 5
[129] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-
contextual representations for semantic segmentation. In
ECCV, 2020. 5, 7
[130] Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao
Zhang, Xilin Chen, and Jingdong Wang. Hrformer: High-
resolution transformer for dense prediction. In NeurIPS,
2593