0% found this document useful (0 votes)

11 views12 pages

Part-Based Semantic Transform For Few-Shot Semantic Segmentation

This document presents a novel method called Part-Based Semantic Transform (PST) for few-shot semantic segmentation, addressing the issue of semantic misalignment between support and query images. PST utilizes prototype mixture models (PMMs) for semantic decomposition and a min-cost flow module for semantic matching, significantly improving segmentation performance on datasets like MS-COCO. The method demonstrates enhanced tolerance to variations in object appearance and pose, achieving state-of-the-art results in few-shot segmentation tasks.

Uploaded by

chenyushao1990

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views12 pages

Part-Based Semantic Transform For Few-Shot Semantic Segmentation

Uploaded by

chenyushao1990

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO.

12, DECEMBER 2022 7141

Part-Based Semantic Transform for

Few-Shot Semantic Segmentation
Boyu Yang , Fang Wan , Chang Liu , Bohao Li, Xiangyang Ji , Member, IEEE,
and Qixiang Ye , Senior Member, IEEE

Abstract— Few-shot semantic segmentation remains an open is learned on abundant training images [11]. However, the sig-
problem for the lack of an effective method to handle the nificant variations including object poses and appearances
semantic misalignment between objects. In this article, we pro- between the support and query images and even the unseen cat-
pose part-based semantic transform (PST) and target at aligning
object semantics in support images with those in query images by egories, bring a great challenge to the learning of the few-shot
semantic decomposition-and-match. The semantic decomposition segmentation problem. In the metric learning framework,
process is implemented with prototype mixture models (PMMs), Zhang et al. [12], Dong and Xing [13], and Shaban et al. [14]
which use an expectation–maximization (EM) algorithm to pioneered the few-shot semantic segmentation by introducing
decompose object semantics into multiple prototypes correspond- the “prototype” vector, which indicates the class-aware seman-
ing to object parts. The semantic match between prototypes is
performed with a min-cost flow module, which encourages correct tic information across feature channels. The prototype vector
correspondence while depressing mismatches between object is generated using the global average pooling upon the support
parts. With semantic decomposition-and-match, PST enforces the image features guided by ground-truth mask(s) and is used to
network’s tolerance to objects’ appearance and/or pose variation compare the feature similarity of the query image for semantic
and facilities channelwise and spatial semantic activation of segmentation.
objects in query images. Extensive experiments on Pascal VOC
and MS-COCO datasets show that PST significantly improves Despite extensive achievements, we realize that the igno-
upon state-of-the-arts. In particular, on MS-COCO, it improves rance of object spatial layout leaves existing prototype models
the performance of five-shot semantic segmentation by up to an open problem. A single prototype commonly deteriorates
7.79% with a moderate cost of inference speed and model size. the distribution of features, which further results in semantic
Code for PST is released at https://github.com/Yang-Bob/PST. ambiguity among different objects [15]. Although recent stud-
Index Terms— Few-shot segmentation, prototype mixture mod- ies partly relieve this issue by iterative mask refinement [16],
els (PMMs), semantic match, semantic transform. prototype alignment [17], feature boosting [11], and cross-
I. I NTRODUCTION reference [18], the semantic misalignment problem between
support and query images remains.
S UBSTANTIAL progress has been made in semantic seg-
mentation [1]–[9]. This attributes to large-scale datasets
with precise mask annotations and convolutional neural net-
In this article, we propose a part-based semantic trans-
form (PST) and target at alleviating the semantic misalign-
works (CNNs) capable of absorbing the annotation informa- ment problem by semantic decomposition-and-match (Fig. 1).
tion. However, annotating large-scale datasets with semantic The semantic decomposition process is implemented by our
masks is expensive, laborious, or even impractical, which proposed prototype mixture models (PMMs) [19], which use
violates the principle of cognitive learning where models expectation maximization (EM) to extract prototypes cor-
should be constructed based on few supervisions [10]. responding to object parts given limited support images.
Given a few support images with corresponding segmenta- Few-shot semantic segmentation is built on a two-branch
tion masks, few-shot segmentation performs pixelwise classifi- metric learning framework consisting of a support branch
cation of query images based on feature representation which and a query branch. During network training, objects are
decomposed into PMMs, where positive samples are generated
Manuscript received 21 July 2020; revised 3 December 2020 and from deep pixels within the object mask. By estimating mixed
7 May 2021; accepted 24 May 2021. Date of publication 8 June 2021; date
of current version 1 December 2022. (Corresponding author: Fang Wan.) prototypes, PMMs are primarily concerned with represent-
Boyu Yang, Chang Liu, Bohao Li, and Qixiang Ye are with the School ing the diverse object appearances (Fig. 1). By modeling
of Electronic, Electrical and Communication Engineering, University background regions, the discriminative capacity of features
of Chinese Academy of Sciences (UCAS), Beijing 100049, China
(e-mail: [email protected]; [email protected]; is also enhanced. The semantic matching process defines a
[email protected]; [email protected]). min-cost flow module, which encourages the correct corre-
Fang Wan is with the School of Computer Science and Technology, spondence between prototypes (which represent the seman-
University of Chinese Academy of Sciences (UCAS), Beijing 100049, China
(e-mail: [email protected]). tics of object parts) while depressing mismatches between
Xiangyang Ji is with the Department of Automation, Tsinghua University, background/foreground regions with a min-cost matching
Beijing 10084, China (e-mail: [email protected]). algorithm.
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2021.3084252. During inference, prototypes are utilized to activate the
Digital Object Identifier 10.1109/TNNLS.2021.3084252 semantics of query image features in a duplex manner. Those
2162-237X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on February 18,2025 at 08:49:47 UTC from IEEE Xplore. Restrictions apply.
7142 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 12, DECEMBER 2022

Fig. 1. Illustration of PST. PST decomposes the convolutional features within object regions into prototypes, which correspond to object parts. T-SNE
visualization is applied to show the distribution of prototypes and features. It then leverages a semantic match module to encourage the correct correspondence
between object parts while depressing mismatches of image regions. In this way, PST defines a decomposition-and-match procedure which reduces semantic
misalignment between support and query images and improves the discriminability of features. (Best viewed in color.)

prototypes are regarded as representations for specific objects, and DeepLab [4]–[6]. By considering both instances and
which, concatenated with query features, are used to activate backgrounds, semantic segmentation has evolved to instance
feature channels. Besides, they are also multiplied with the segmentation [9], [21] and panoptic segmentation [22].
features from queries as linear classifiers to calculate an Relevant researches about semantic segmentation and
elementwise probability map. As a consequence of segmenting instance segmentation have provided fundamental techniques,
the query image, the semantic information of the channelwise e.g., multiscale feature aggregation [2] and atrous spatial
features and spatial probability map is well explored. pyramid pooling (ASPP) [5], which benefit few-shot semantic
The PMMs were originally proposed in our ECCV segmentation in this study. The SegSort method [8] that par-
2020 paper [19] and are promoted to the PST method which titions objects into parts using a divide-and-conquer strategy
defines a decomposition-and-match framework. Compared to provides an insight for our study.
PMMs, PST further introduces a semantic match module while
removing the complex module ensemble strategy. To sum up, B. Few-Shot Learning
we have made the following four contributions in this article. Few-shot learning targets at finding features that can
1) We propose PST, solving the problem of the seman- generalize well to novel classes. General few-shot learn-
tic ambiguity in few-shot segmentation with semantic ing methods have been exploited in broad areas such as
decomposition-and-match. few-shot object detection [23]–[26], fine-grained recogni-
2) PMMs are proposed in this article, which implement tion [27], and incremental learning [28]. Few-shot learning
semantic decomposition using a plug-and-play EM methods can be roughly categorized as either: optimization
algorithm. based [29]–[32], metric learning [15], [33]–[35], or data
3) We propose a prototype match module, which enforces augmentation [36]–[38].
object part correspondence while depressing image As a classical method, metric learning configures a
region mismatches using the min-cost flow module. two-stream model to compare the query image with the
4) With PST, we improve the state-of-the-art of few-shot support image(s) and determine the category of the query
semantic segmentation with significant margins. We fur- image. Optimization-based approaches (e.g., metalearning)
ther apply PST to few-shot object detection, validating specify optimization or loss functions in order to pursue
its general applicability to few-shot learning problems. faster adaptation of network parameters to new categories
with few samples. Data augmentation methods try to generate
II. R ELATED W ORK additional examples or features for novel categories [36], [37].
The effect of prototypes for few-shot learning has been
A. Semantic Segmentation demonstrated [39], which provides insights for applying
Semantic segmentation is a fundamental computer vision prototypes to capture representative and discriminative
task which classifies pixels within an input image into pre- features.
defined categories. In the past few years, semantic segmen- More sophisticated few-shot learning explored the robust-
tation has been extensively investigated and state-of-the-art ness of models about representation and label outliers [40]
algorithms are exploited based on fully convolutional networks and few-shot learning on embedded neural network acceler-
including UNet [1], PSPNet [2], EMANet [3], FCNs [20], ators which do not support backpropagation [41]. Geometric

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on February 18,2025 at 08:49:47 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: PST FOR FEW-SHOT SEMANTIC SEGMENTATION 7143

constraints [42] were also introduced to fine-tune the network from support and query images using a min-cost flow model,
with a few training examples. The constraints force novel our approach can handle object pose and appearance variations
category features to be close to the category weights while in a simple yet effective fashion.
maintaining the novel category weights far from those of the
base categories. Recent deepEMD method [43] formulated III. M ETHODOLOGY
pixels on the deep feature map as object parts. The “object A. Overview
parts” of support and query images were optimally matched
Given a set of training images and a couple of support
with a linear programming algorithm to handle pose and
images with ground-truth masks for predefined object cate-
appearance variations.
gories, few-shot segmentation aims to classify pixels in query
images. The training procedure of few-shot segmentation
C. Few-Shot Segmentation targets at constructing a segmentation model, e.g., a network
model, upon the set of training images containing object
Many approaches about few-shot segmentation were
categories different from those in query images. To simulate
inspired by prototype learning, e.g., squeezing semantics of
the inference procedure, training images with ground-truth
support images using a prototype vector and then using the
masks are split into small subsets where an image serves
prototype vector as a classifier to segment query images [17].
as query images and the other(s) as support one(s). The
Our study follows a metric learning framework, which consists
test procedure leverages the trained segmentation model and
of a support branch and a query branch.
few-shot support images to guide the segmentation of query
In [14], a two-branch network was proposed for few-shot
images.
segmentation using the support branch to process support
Our few-shot segmentation approach is based on the met-
images and the query branch to process query images. The
ric learning framework which is made up of two network
support branch is devoted to learn general representation
branches, i.e., the support branch Fig. 2 (top) and the query
and semantic vectors from the support set under a few-shot
branch [Fig. 2 (bottom)]. Two weights sharing CNNs (VGG
setting. The learned feature representation and semantic
or ResNet) are used as the backbone networks to extract
vectors were then used to segment images in the query
features for the support and query images. Let S ∈ RW ×H ×C
branch. In [12] and [13], prototypical networks were pro-
denote the support image features where W × H indicate
posed for few-shot semantic segmentation in a metric learning
the spatial resolution of features and C the feature channel
framework. A masked average pooling strategy came with
number. Q ∈ RW ×H ×C denotes the query image features.
the prototypical networks to guide prototype vector extraction.
During the training procedure, given S and Q, the PMMs
A cosine distance was then applied to measure the similarity
are applied to estimate prototype vectors representing the
between the features of support images and query images.
semantics of objects or backgrounds under the guidance of the
PANet [17] and Cross-Reference [18] improved the pro-
support/query mask, which is termed semantic decomposition.
totypical networks by introducing feature alignment or
The decomposed prototypes of support and query images
cross-reference between support and query branches, with the
representing local semantics are then matched by a min-cost
aim to exploit the co-occurrence between objects from support
flow model, which is termed semantic match. The semantic
and query images. CANet [16] leveraged channelwise dense
match process roots on a min-cost module, which encour-
comparison to improve feature diversity and discriminative-
ages correct matches between object parts while depressing
ness. With CANet came an iterative optimization strategy
mismatches for accurate semantic correspondence. After the
for segmentation refinement. The FWB approach [11] used
semantic decomposition-and-match procedures, PST leverages
foreground-background feature differences of support images
channelwise and spatial semantic transform to activate objects
to enforce the discriminativeness of prototype vectors.
in query images with a few convolutional layers, which is
In few-shot segmentation, global average pooling was
termed semantic activation.
commonly used to calculate the prototype vector. However,
Without loss of generality, we describe the proposed
squeezing the spatial extent to a single prototype tends to mix
approach for one-shot learning and it can be extended to
semantics from different object parts. This unintended seman-
five-shot learning by feeding the integrated five support images
tic mixing deteriorates the diversity of prototype vectors and
together to PMMs for prototype estimation.
the feature representation capacity seriously. Recent studies
used iterative mask refinement [16] and model ensemble [11]
to solve this problem. However, issues remain when there lacks B. Semantic Decomposition
a systematic way to model diverse semantics from object parts. The semantic decomposition procedure is based on
Compared with few-shot learning which treats each image PMMs. During training, images with ground-truth masks are
as an instance, few-shot segmentation requires to simultane- divided into support/query image pairs. The support features
ously handle semantic representation of features and spatial S ∈ RW ×H ×C for each support image are regarded as a sample
completeness of objects. Our approach thus updates the proto- set in the C-dimensional space with W × H samples. S are par-
type method [39] to a semantic transform approach. By intro- titioned into the foreground sample set S + and the background
ducing PMMs, we consider the spatial layout of semantics sample set S − under the guidance of the support image mask,
and enforce the capability for handling variation of object where S + contains features inside the support image mask and
appearances and poses. By optimally matching “object parts” S − contains features outside the support image mask. S − is

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on February 18,2025 at 08:49:47 UTC from IEEE Xplore. Restrictions apply.
7144 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 12, DECEMBER 2022

Fig. 2. PST is made up of two branches, i.e., a support branch (top) and a query branch (bottom). The support image features S with corresponding
ground-truth mask are leveraged to decompose the foreground prototypes μ+ −
s and background prototypes μs . The query image features Q are activated by
those prototypes for semantic segmentation in a duplex way, i.e., P-Conv and P-Match. In training stage, the support prototypes μ+ −
s , μs are matched with
the query prototypes μ+q , μ − decomposed by the query image features Q in the same way as the support branch. “ASPP” denotes the ASPP [5].
q

used to learn background prototypes (PMMs− ) and S + is to distance function can be better combined with our method.
learn foreground prototypes (PMMs+ ) which is corresponding According to such vector distance, our PMMs are defined as
to object parts (Fig. 6). The semantic decomposition procedure
pk (si |θ ) = βc (κ)eκμk si
T
(3)
is based on the EM algorithm, which partitions the samples
in the feature space into subsets. Without loss of generality, where θ = {κ, μ}, βc (κ) = (κ c/2−1 /(2π)c/2 Ic/2−1 (κ)) and
the models and learning procedure are defined for PMMs, Iν (·) denotes the Bessel function. κ denotes the constant para-
which can be applied for either PMMs− or PMMs+ . meter indicating distribution concentration, which is empiri-
1) Modeling: To perform semantic decomposition, PMMs cally set as κ = 20 in experiments.
combine probabilities of base models to represent the semantic 2) Decomposition: The semantic decomposition procedure
distribution of a class of objects using a probability mixture aims to estimate PMMs by the EM algorithm which com-
model as prises two iterative steps, i.e., E-steps and M-steps. In each
E-step, given model parameters and extracted sample features,

K
p(si |θ ) = wk pk (si |θ ) (1) the expectation of the sample si is calculated as
eκμk si
T
k=1 pk (si |θ )
E ik = K = K . (4)
where θ denotes the model parameters trained while estimating κμkT si
k=1 pk (si θ ) k=1 e
PMMs.
K wk is the mixture coefficient under the constraint of Then, we adopt the above expectation to innovate the mean
k=1 k = 1 and 0 ≤ wk ≤ 1. si ∈ S is the i th sample (deep
w of PMMS in each M-step, which can be defined as
feature vector). K denotes the number of base models and N
the kth of that is written as pk (si |θ ). Each of the base model E ik si
μk = i=1 N
(5)
is a probability model which is based on a Kernel distance i=1 E ik
function, as where N denotes the number of samples. Note that we ignore
the mixture coefficient wk , which is equal with different k,
pk (si |θ ) = β(θ )eKernel(si ,μk ) (2)
to equalize the importance of each prototype vector. As conse-
where μk ∈ θ is the mean vector (prototype) of the quence, each prototype vector becomes the sample distribution
kth model, β(θ ) denotes the normalization constant. For the center which stands for areas around similar object parts for
commonly used Gaussian mixture models (GMMs) with fixed the reception field effect.
covariance, the Kernel function is defined as a radial basis
function (RBF), Kernel(si , μk ) = −||(si − μk )||22 . For the C. Semantic Match
von Missies–Fisher (VMF) model [44], the kernel function The semantic decomposition procedure partitions objects
is defined as a cosine distance function, Kernel(si , μk ) = into parts, of which the semantics are represented with proto-
(μkT si /||μk ||2 ||si ||2 ). As validated in experiments, the cosine type vectors. The prototype vectors for the support image are

Fig. 3. Semantic matching using a min-cost flow module.

+ −
denoted as μ+ −
s = {μsk , k = 1, . . . , K } and μs = {μsk , k =
1, . . . , K }, where the former is about foreground objects and
the latter for backgrounds, K denotes the number of prototypes Fig. 4. Illustration of P-Match and P-Conv operations for semantic activation
of query feature (Q).
corresponding to the number of base models. In the training
procedure, the prototype vectors for the query image are
+ −
denoted as μ+ −
q = {μqk , k = 1, . . . , K } and μq = {μqk , k =

probability maps which are calculated by P-Conv, denoted as
1, . . . , K }. The semantic match procedure, towards finding the P(μ+ +
sk |Q, θ ) and P(μqk |Q, θ ). A cross entropy loss, termed
correspondence of the object part semantics, is defined for the matching loss, is defined to drive the semantic correspondence
foreground prototype vectors. Background prototypes are not between the support and the query images as
used because the backgrounds of the support and query images
are not consistent. Lm (θ ) = pkk · P(μ+ +
sk |Q, θ )logP(μqk |Q, θ ) (7)
1) Min-Cost Flow Model: Matching of support and query k,k
prototypes is implemented with a min-cost flow model. This where pkk , defined in (6), measures the matching confidence
model guarantees finds optimal solutions for matching prob- of prototypes μ+ +
sk and μqk . Optimizing the matching loss
lems. When objects are decomposed to parts, the model can drives learning network parameters which facilitate semantic
enforce the correct correspondence between object parts with match between support and query images.
lower computational cost. Given the foreground prototype
+ +
vectors, μ+ +
s = {μsk , k = 1, . . . , K } and μq = {μqk , k =

D. Semantic Activation
1, . . . , K }, we construct a flow model, Fig. 3, where μ+
As shown in Fig. 2, the estimated prototype vectors μ+
s
and μ+ s =
q are defined as intermediate nodes. They respectively
connected with the source and receiver nodes. The flow {μ+ − −
sk , k = 1, . . . , K } and μs = {μsk , k = 1, . . . , K } are
cost between each node from {μ+ +
sk } to {μqk } is defined as
leveraged to activate query features for semantic segmentation.
+ +
1.0 − μsk · μqk . The objective of the min-cost flow model is 1) P-Conv: On the one hand, prototypes incorporating
defined as discriminative information across feature channels can be
used as classifiers. Such classifiers convolve with the query
min pkk · (1.0 − μ+ +
sk · μqk ) (6) features Q to produce probability maps P(μsk |Q, θ ) =
k,k {P(μ+ −
sk |Q, θ ), P(μsk |Q, θ )}, as

where pkk denotes the flow between node k and k which P(μsk |Q, θ ) = P-Conv(μ+ −
k , μk , Q), k = 1, . . . , K . (8)
satisfies 0 <= pkk <= 1.0. In addition, (6) requires to satisfy As illustrated in Fig. 4, each prototype vector is multiplied
the following two constraints: 1) the output flow from the with Q in an elementwise manner. And then the probability
sourcenode (src) is equal to the input flow of the target node maps P(μsk |Q, θ ) are produced by the output maps using a
(tgt): k psrc,k = K , k pk ,tgt = K and 2) the input flow is Softmax operation across channels. After P-Conv, the prob-

equal to the output flow of node k and k : p − k pkk = 0, ability maps P(μ+ −
sk |Q, θ ), k = 1, . . . , K and P(μsk |Q, θ ),
src,k
k pkk − pk ,tgt = 0. Given the prototypes and min-cost flow k = 1, . . . , K are summarized to foreground/background
model, the Dijkstra algorithm [45] is employed to solve pkk . probability maps, respectively, as
After solving, larger pkk implies higher matching confidence
between two nodes (μ+ +
sk and μqk ). P(μ+s |Q, θ ) = P(μ+ sk |Q, θ )
2) Matching Loss: As shown in Fig. 2, the foreground pro-
k
totypes μ+ +
s and μq are respectively convolved with the support P(μ−
s |Q, θ ) = P(μ−
sk |Q, θ ) (9)
and query features, S and Q, and respectively produce K k

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on February 18,2025 at 08:49:47 UTC from IEEE Xplore. Restrictions apply.
7146 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 12, DECEMBER 2022

which are used to further activate the query image features step, the support/query features are classified into two sample
spatially as sets of foreground S + /Q + and background S − /Q − based on
the given ground-truth mask. PMMs are employed to perform
Q = P(μ+ −
s |Q, θ ) ⊕ P(μs |Q, θ ) ⊕ Q

(10)
semantic decomposition based on S + /Q + and S − /Q − and
where ⊕ indicates the operation of concatenation and Q the decomposed prototype vectors μ+ +
s and μq are matched
denotes the query features activate by the P-Match. using the min-cost flow model to produce semantic match
2) P-Match: On the other hand, considering that the positive loss Lm (θ ). Meanwhile, the prototype vectors μ+ −
s and μs
prototypes squeeze representative information μ+
s about object are used to activate features of the query image and predict
parts, we use them to activate Q, as segmentation result of that. During the backpropagation step,
the parameters θ are innovated to decrease the segmentation
Q = P-Match(μsk , P(μsk |Q, θ ), Q), k = 1, . . . , K (11)
loss Ls (θ ) which is implemented by the cross entropy at the
where μsk = {μ+ − +
sk , μsk }, P(μsk |Q, θ ) = {P(μsk |Q, θ ),
query branch. The overall training loss is defined as
−
P(μsk |Q, θ )}, and P-Match indicates the semantic activation
L(θ ) = Lm (θ ) + Ls (θ ). (12)
procedure which comprises prototype upsampling, probability
multiplication, feature addition, and convolution-based acti- During inference, the learned feature representation and
vation (Fig. 4). P-Match activates the object-related channels the prototype vectors of the support image(s) are used to
meanwhile it suppresses backgrounds-related ones, imple- activate the features of the query image using P-Match
menting a channelwise comparison, where the semantic infor- and P-Conv operations. The activated features are fed to a
mation of the completing object is combined with the features few convolutional layers to predict the segmentation mask.
in order to achieve more precise semantic segmentation. After Since the ground-truth mask of the query image is unavail-
P-Conv and P-Match, we strengthen the activated query fea- able, the semantic match procedure is not performed during
tures Q by ASPP, and then a convolutional layer is used as inference.
a classifier to predict the segmentation result (Fig. 2).
IV. E XPERIMENTS
Algorithm 1 Semantic Decomposition-Match-Activation A. Experimental Settings
Input:
1) Implementation Details: The baseline is conducted
Support and query images, masks Ms and Mq for support
by the CANet [16] method without iterative optimization
and query images;
where we construct the backbone network using VGG16 or
Output:
ResNet50 to extract features. In the training stage, Data aug-
Network parameters θ , prototypes μ+ −
s and μs for each
mentation includes random cropping, random resizing, random
support image;
horizontal flipping, and normalization [16]. We implement
for (each support/query image pair) do
our method using Pytorch and all experimental results are
Calculate support and query features S and Q; Divide S/Q
conducted on Nvidia 2080Ti GPUs. The network is opti-
into S + /Q + and S − /Q − based on Ms /Mq ;
mized by the SGD optimizer for 200 000 iterations with a
Semantic Decomposition:
cross-entropy loss. The initial learning rate is set to 0.0035 and
Estimate μ+ + + +
s /μq upon S /Q by the EM algorithm in
the momentum is set to 0.9. For each training step, eight pairs
Eqs. 4 and 5;
of support-query images are sent to the network, where the
Estimate μ− + −
s /μq upon S /Q
−
by the EM algorithm
categories are first formed from the training split, upon which
defined with Eqs. 4 and 5;
the support–query pairs are further sampled. The EM algo-
Semantic Match:
rithm iterates ten times to perform semantic decomposition
Construct a flow model and calculate the min-cost
and estimate PMMs.
match between μ+ +
s and μq , Eq. 6.
2) Datasets: The proposed approach is evaluated on Pascal-
Calculate semantic matching loss Lm (θ ), Eq. 7.
5i and COCO-20i . Pascal-5i is a dataset specified for few-shot
Semantic Activation:
semantic segmentation [14], which extends the Pascal VOC
Activate the query feature Q using P-Match and
2012 dataset with extra annotations from SDS [46]. The whole
P-Conv defined upon prototypes μ+ and μ− , Fig. 4;
20 object categories are divided into four splits with one
Predict a segmentation mask and calculate the segmenta-
for testing and three for training. In the testing (inference)
tion loss;
stage, we randomly sample 1000 support-query pairs from
Update θ so as to close the gap between semantic
the test split [16]. Following [11], COCO-20i is created from
match loss Lm (θ ) and segmentation loss Ls (θ ).
the MS-COCO 2017 dataset. Four splits are created from the
end for
80 classes. Each split contains 20 categories, and the val
dataset is used for evaluation. The cross-validation categories
are described in Tables I and II.
E. Few-Shot Segmentation 3) Evaluation Metric: We evaluate the model performance
We further design an end-to-end segmentation model, by the mean intersection-over-union (mIoU). The IoU of each
Fig. 2 of which the training procedure (decomposition-match- category is calculated by IoU = (TP/TP+FN+FP), where
activation) is depicted in Algorithm 1. During the feed-forward TP, FN, and FP represent the number of true positive, false

Fig. 5. Comparison of semantic activation maps by PST and the baseline method (CANet [16]). PST leverages PMMs to produce multiple probability maps
which facilitates activating and segmenting complete object extent (first two rows) or multiple objects (last row). CANet with only a single prototype to
segment object tends to miss object parts. (Best viewed in color.)

TABLE I TABLE II
C ROSS -VALIDATION C LASSES FOR PASCAL -5i C ROSS -VALIDATION C LASSES FOR COCO-20i

negative, and false positive pixels in the masks of predicted

segmentation, respectively.

B. Model Analysis
In Fig. 5, probability maps produced by decomposed posi-
tive prototypes are visualized. PST decomposes the semantics
into multiple probability maps, which facilitate activating com-
plete object extent (first two rows of Fig. 5). The advantage
in terms of representation capacity promoted by semantic
improve the recall rate. By further introducing background
decomposition is that PST performs better especially when
prototypes, PST (D&P-Conv) reduces the false positive pixels,
segmenting multiple objects within the same image (last row
which validates that the discriminative capability of the model
of Fig. 5). By comparison, using only a single prototype
can be improved by utilizing the background mixture models.
to activate object like CANet tends to miss object parts or
With decomposition-and-match, PST (D&M) further improves
whole objects. The probability maps produced by PST validate
the results by handling the variation of object appearance and
our primary assumption, i.e., prototypes correlated with the
pose.
semantics of object parts alleviate semantic misalignment.
In Fig. 6, we visualize the distributions of foreground pixels
from support and query images. It can be seen that the samples C. Ablation Study
of corresponding object parts have smaller distances while 1) Semantic Decomposition: In Table III, with semantic
those of different object parts have larger distances. This vali- decomposition and P-Match modules, the segmentation per-
dates that the semantic match module facilitates enforcing the formance is improved by 2.70% (54.63% versus 51.93%),
semantic correspondence between query and support images. which illustrates that the prototypes are more effective than
Such correspondence can be learned by the network to enforce the single prototype. Using both P-Match and P-Conv for
the capacity to object appearance/pose variation and improve semantic activation, the performance is further improved by
the segmentation performance. 0.64% (55.27% versus 54.63%), which shows that the prob-
In Fig. 7, the compared segmentation results validate ability map calculated by the combination of foreground and
that PST (D&P-Match) can segment more target pixels and background prototypes could further activate the features of

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on February 18,2025 at 08:49:47 UTC from IEEE Xplore. Restrictions apply.
7148 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 12, DECEMBER 2022

Fig. 6. Illustration of semantic match by visualizing the distributions of foreground pixels from support and the query images. (Best viewed in color.)

Fig. 7. Semantic segmentation results. “Baseline” refers to the CANet method [16] without iterative optimization. (Best viewed in color.)

interests spatially. In total, the semantic decomposition module number of prototypes with solely P-Match activation. It shows
improves the performance by 3.34% (55.27% versus 51.93%), that K = 2 outperforms K = 1 significantly, which vali-
which is a conspicuous margin for the challenging task. dates the plausibility of introducing mixture prototypes. The
2) Semantic Match: The semantic match module further better performance occurs when K = 2, 3, 4 and K = 3
improves the performance by 0.85% (56.12% versus 55.27%), reports the best mean performance. With the number of
which verifies the effectiveness of the semantic decomposition- prototypes increasing from K = 3, the performance slightly
and-match strategy. By using PMMs, the semantics about an decreases. The reason is that the semantic decomposition is
object are decomposed into multiple prototypes, which, with employed under limited numbers of feature samples within
min-cost flow match, handle the misalignment between objects a single support/query image. The increase of prototype
in support and query images in an explainable way. number (K ) decreases the representation of samples for the
3) Number of Prototypes: In Table IV, experiments are corresponding prototype and therefore increases the overfitting
conducted to determine the performance on the different risk.

TABLE III only applied in the training phase. Thereby, it does not increase
A BLATION S TUDY. “M EAN " D ENOTES M EAN M I O U ON PASCAL -5i . the model size or decrease the inference speed.
PST (D&P-M ATCH ) R EFERS TO PST W ITH P-M ATCH A CTIVATION ,
W ITHOUT S EMANTIC M ATCH . PST (P-C ONV ) R EFERS TO PST D. Performance
W ITH D ECOMPOSITION AND P-C ONV A CTIVATION , W ITHOUT
S EMANTIC M ATCH . PST (D&M) R EFERS TO PST W ITH 1) PASCAL-5i : In Tables VII and VIII, the proposed
S EMANTIC D ECOMPOSITION AND S EMANTIC M ATCH
PST approach outperforms all compared methods for the
single-scale test setting which demonstrates great advan-
tages. It is comparable to, if not outperforms, the compared
approaches for the multiscale test settings. For one-shot
setting with a VGG-16 backbone, PST achieves 1.22%
(53.12% versus 51.90%) performance improvement over the
state-of-the-art. With a Resnet50 backbone, PST improves
TABLE IV 2.16% (56.12% versus 53.96%) compared with the CANet
P ERFORMANCE ( M I O U%) ON P ROTOTYPE N UMBER K method [16], which is a significant margin. For five-shot
setting with a Resnet50 backbone, PST achieves 0.49%
(57.29% versus 56.80%) performance improvement over the
state-of-the-art. With the VGG16 backbone, PST is compara-
ble with the state-of-the-art. Note that an additional k-shot
fusion strategy is used by the PANet and FWB methods
while PST does not utilize any postprocessing strategy for
five support images (five-shot setting).
2) MS COCO: In Table IX, PST is validated on MS
TABLE V
COCO dataset under the evaluation metric on COCO-20i [11].
P ERFORMANCE C OMPARISON OF K ERNEL F UNCTIONS
The baseline is implemented using CANet without itera-
tive optimization. PST again achieves the state-of-the-art in
both one-shot and five-shot settings. For one-shot setting,
it improves the baseline by 6.56% (32.67% versus 26.11%)
and outperforms the PANet and FWB methods by 11.77%
and 11.48%, respectively. For five-shot setting, it improves
TABLE VI
the baseline by 9.63% (37.49% versus 27.86%) and outper-
C OMPARISON OF M ATCH S TRATEGIES . “N EAREST ” AND “M IN -C OST ”
R ESPECTIVELY R EFER TO THE N EAREST N EIGHBOR forms the PANet and FWB methods by 7.79% and 13.87%,
AND M IN -C OST F LOW M ATCH respectively, which are significant margins for the challenging
problem. The MS COCO dataset, which has more training
categories and images compared to PASCAL VOC, is advan-
tageous to learn richer feature representation related to various
object parts. Thereby, PST’s improvement on MS COCO is
more significant.

4) Kernel Functions: In Table V, two kernel functions E. Few-Shot Object Detection

(Gaussian kernel and versus VMF kernel) are compared for PST is generalized to few-shot object detection. Specif-
sample distance calculation when estimating PMMs. VMF ically, we choose the approach named feature reweighting
kernel reports better results which means that the cosine (FSRW) [50] as the baseline. FSRW uses a prototype squeezed
similarity can be better combined with the PST method. from the support image to reweight the features of the query
5) Match Strategy: In Table VI, two matching strategies, image and uses few-shot ground-truth object to train the object
nearest neighbor match and min-cost flow, are compared. detector. Following the approach for few-shot segmentation,
When using the nearest neighbor match, for each prototype j we update a single prototype to five prototypes produced
in the support set, we choose the nearest prototype k from by PMMs and adding the semantic match module (semantic
query set by arg mink 1.0 − μ+ +
s j μqk . It can be seen that matching loss) to correspond to the prototypes. The main
the min-cost flow strategy outperforms the nearest neighbor difference between few-shot detection to few-shot segmenta-
match, validating the effectiveness of the min-cost flow model. tion with PST lies in that the PMMs are generated under the
6) Inference Speed: The increase of prototypes in PST does guidance of an object bounding box instead of object masks.
not significantly increase the model size or computational cost In the inference stage, the mixture prototypes produced by PST
as they are 1 × 1 × C dimensional vectors. The size of the are used to replace the single prototype in FSRW for feature
PST model (19.5 M) is slightly larger than that of CANet [16] reweighting.
(19 M) but much smaller than that of OSLSM [14] (272.6 M). In Table X, we report the few-show detection results on
In one-shot setting, with three prototypes, the inference speed Pascal VOC following the experimental settings in FSRW [50].
of PST is 26 FPS on one 2080Ti GPU, which is slightly lower For one-shot setting, PST improves the mean average preci-
than that of CANet (29 FPS). The semantic match module is sion by 1.88% upon the novel classes and 1.02% upon the

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on February 18,2025 at 08:49:47 UTC from IEEE Xplore. Restrictions apply.
7150 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 12, DECEMBER 2022

TABLE VII
P ERFORMANCE OF 1-WAY 1-S HOT S EMANTIC S EGMENTATION ON PASCAL -5i . T HE S INGLE -S CALE T EST P ERFORMANCE I S O BTAINED
F ROM G ITHUB . COM / ICOZ 69/C A N ET /I SSUES /4 FOR A FAIR C OMPARISON . “*" I NDICATES M ULTISCALE T EST P ERFORMANCE

TABLE VIII
P ERFORMANCE OF O NE -WAY F IVE -S HOT S EMANTIC S EGMENTATION ON PASCAL -5i . “*" I NDICATES M ULTISCALE T EST P ERFORMANCE

TABLE IX
P ERFORMANCE OF O NE -S HOT AND F IVE -S HOT S EMANTIC S EGMENTATION ON MS COCO. FWB U SES THE R ES N ET 101
BACKBONE W HILE O THER A PPROACHES U SE THE R ES N ET 50 BACKBONE

TABLE X
P ERFORMANCE OF F EW-S HOT O BJECT D ETECTION ON PASCAL VOC D ATASET

base classes. For two-shot setting, it respectively improves it improves by 2.45% upon the novel classes. Such significant
by 4.63% and 3.15%. For three-shot setting, it respec- and consistent improvements demonstrate the effectiveness of
tively improves by 3.37% and 1.37%. For five-shot setting, PST to the few-shot object detection problem.

V. C ONCLUSION [17] K. Wang, J. Liew, Y. Zou, D. Zhou, and J. Feng, “PANet: Few-shot
image semantic segmentation with prototype alignment,” IEEE/CVF Int.
We have proposed PST, to solve the semantic misalign- Conf. Comput. Vis., Oct. 2019, pp. 622–631.
ment problem for few-shot segmentation. During training, [18] W. Liu, C. Zhang, G. Lin, and F. Liu, “CRNet: Cross-reference networks
PST incorporates rich channelwise and spatial semantics from for few-shot segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2020, pp. 4165–4173.
limited support images while corresponding the semantics [19] B. Yang, C. Liu, B. Li, J. Jiao, and Q. Ye, “Prototype mixture models for
with a loss function. During inference, PST activates query few-shot semantic segmentation,” in Proc. ECCV, 2020, pp. 763–778.
image features with multiple prototypes to perform precise [20] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell.,
semantic segmentation. PST improved the performance of vol. 39, no. 4, pp. 640–651, Apr. 2017.
few-shot segmentation, in striking contrast with state-of-the-art [21] X. Liang, L. Lin, Y. Wei, X. Shen, J. Yang, and S. Yan, “Proposal-free
approaches especially on the large-scale MS COCO dataset. network for instance-level object segmentation,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 40, no. 12, pp. 2978–2991, Dec. 2018.
As a general method to capture the diverse semantics of
[22] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid
object parts of few support samples, PST was extended to the networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
few-shot object detection problem, demonstrating its general (CVPR), Jun. 2019, pp. 6399–6408.
effectiveness to few-shot learning problems. In the future [23] H. Chen, Y. Wang, G. Wang, and Y. Qiao, “LSTD: A low-shot transfer
detector for object detection,” in Proc. AAAI, S. A. McIlraith and
work, the proposed approach can be extended to practical sce- K. Q. Weinberger, Eds., 2018, pp. 2836–2843.
narios where object-part masks are given. In these scenarios, [24] B. Kang, Z. Liu, X. Wang, F. Yu, J. Feng, and T. Darrell, “Few-shot
the semantic parts can be more precisely activated and the object detection via feature reweighting,” in Proc. IEEE/CVF Int. Conf.
Comput. Vis. (ICCV), Oct. 2019, pp. 8419–8428.
advantages of the PST approach would be further explored. [25] Y.-X. Wang, D. Ramanan, and M. Hebert, “Meta-learning to detect rare
objects,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
R EFERENCES pp. 9924–9933.
[26] X. Yan, Z. Chen, A. Xu, X. Wang, X. Liang, and L. Lin, “Meta R-CNN:
[1] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net- Towards general solver for instance-level low-shot learning,” in Proc.
works for biomedical image segmentation,” in Proc. 18th Int. Conf. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9576–9585.
Med. Image Comput. Comput.-Assist. Intervent., vol. 9351. 2015, [27] X.-S. Wei, P. Wang, L. Liu, C. Shen, and J. Wu, “Piecewise classifier
pp. 234–241. mappings: Learning fine-grained learners for novel categories with few
[2] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing examples,” IEEE Trans. Image Process., vol. 28, no. 12, pp. 6116–6125,
network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Dec. 2019.
Jul. 2017, pp. 6230–6239. [28] X. Tao, X. Hong, X. Chang, S. Dong, X. Wei, and Y. Gong, “Few-
[3] X. Li, Z. Zhong, J. Wu, Y. Yang, Z. Lin, and H. Liu, “Expectation- shot class-incremental learning,” in Proc. IEEE/CVF Conf. Comput. Vis.
maximization attention networks for semantic segmentation,” in Proc. Pattern Recognit. (CVPR), Jun. 2020, pp. 12180–12189.
IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9166–9175. [29] Y. Wang and M. Hebert, “Learning to learn: Model regression networks
[4] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, for easy small sample learning,” in Proc. ECCV, 2016, pp. 616–634.
“Semantic image segmentation with deep convolutional nets and fully [30] A. Srinivasan, A. Bharadwaj, M. Sathyan, and S. Natarajan, “Optimiza-
connected CRFs,” in Proc. ICLR, 2015, pp. 6230–6239. tion of image embeddings for few shot learning,” in Proc. 10th Int. Conf.
[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Pattern Recognit. Appl. Methods, 2021, pp. 236–242.
“DeepLab: Semantic image segmentation with deep convolutional nets, [31] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for
atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern fast adaptation of deep networks,” in Proc. ICML, 2017, pp. 1126–1135.
Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018. [32] M. A. Jamal and G.-J. Qi, “Task agnostic meta-learning for few-shot
[6] L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethink- learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
ing atrous convolution for semantic image segmentation,” CoRR, (CVPR), Jun. 2019, pp. 111719–111727.
vol. abs/1706.05587, Jun. 2017. [33] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra,
[7] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder- “Matching networks for one shot learning,” in Proc. NeurIPS, 2016,
decoder with atrous separable convolution for semantic image segmen- pp. 3630–3638.
tation,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 833–851.
[34] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and
[8] J.-J. Hwang et al., “SegSort: Segmentation by discriminative sorting
T. M. Hospedales, “Learning to compare: Relation network for few-
of segments,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
shot learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Oct. 2019, pp. 7334–7344.
Jun. 2018, pp. 1199–1208.
[9] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,”
[35] B. Li, B. Yang, C. Liu, F. Liu, R. Ji, and Q. Ye, “Beyond max-margin:
IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 386–397,
Class margin equilibrium for few-shot object detection,” in Proc. IEEE
Feb. 2020.
CVPR, Mar. 2021.
[10] P. Tokmakov, Y.-X. Wang, and M. Hebert, “Learning compositional
representations for few-shot recognition,” in Proc. IEEE/CVF Int. Conf. [36] B. Hariharan and R. Girshick, “Low-shot visual recognition by shrinking
Comput. Vis. (ICCV), Oct. 2019, pp. 6372–6381. and hallucinating features,” in Proc. IEEE Int. Conf. Comput. Vis.
[11] K. Nguyen and S. Todorovic, “Feature weighting and boosting for few- (ICCV), Oct. 2017, pp. 3037–3046.
shot segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), [37] Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan, “Low-shot
Oct. 2019, pp. 622–631. learning from imaginary data,” in Proc. IEEE/CVF Conf. Comput. Vis.
[12] X. Zhang, Y. Wei, Y. Yang, and T. Huang, “SG-One: Similar- Pattern Recognit., Jun. 2018, pp. 7278–7286.
ity guidance network for one-shot semantic segmentation,” CoRR, [38] W. Chen, Y. Liu, Z. Kira, Y. F. Wang, and J. Huang, “A closer look at
vol. abs/1810.09091, Jun. 2018. few-shot classification,” in Proc. ICLR, 2019.
[13] N. Dong and E. P. Xing, “Few-shot semantic segmentation with proto- [39] J. Snell, K. Swersky, and R. S. Zemel, “Prototypical networks for few-
type learning,” in Proc. BMVC, 2018, p. 79. shot learning,” in Proc. NeurIPS, 2017, pp. 4077–4087.
[14] A. Shaban, S. Bansal, Z. Liu, I. Essa, and B. Boots, “One-shot learning [40] J. Lu, S. Jin, J. Liang, and C. Zhang, “Robust few-shot learning for user-
for semantic segmentation,” in Proc. Brit. Mach. Vis. Conf., 2017. provided data,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 4,
[15] F. Hao, F. He, J. Cheng, L. Wang, J. Cao, and D. Tao, “Collect and select: pp. 1433–1447, Apr. 2021.
Semantic alignment metric learning for few-shot learning,” in Proc. [41] N. Passalis, A. Iosifidis, M. Gabbouj, and A. Tefas, “Hypersphere-based
IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 8460–8469. weight imprinting for few-shot learning on embedded devices,” IEEE
[16] C. Zhang, G. Lin, F. Liu, R. Yao, and C. Shen, “CANet: Class-agnostic Trans. Neural Netw. Learn. Syst., vol. 32, no. 2, pp. 925–930, Feb. 2021.
segmentation networks with iterative refinement and attentive few-shot [42] H.-G. Jung and S.-W. Lee, “Few-shot learning with geometric con-
learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. straints,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 11,
(CVPR), Jun. 2019, pp. 5217–5226. pp. 4660–4672, Nov. 2020.

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on February 18,2025 at 08:49:47 UTC from IEEE Xplore. Restrictions apply.
7152 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 12, DECEMBER 2022

[43] C. Zhang, Y. Cai, G. Lin, and C. Shen, “DeepEMD: Few-shot image Chang Liu received the B.S. degree from Jilin
classification with differentiable earth mover’s distance and structured University, Jilin, China, in 2012. He is currently pur-
classifiers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. suing the Ph.D. degree with the School of Electronic,
(CVPR), Jun. 2020, pp. 1203–12213. Electrical and Communication Engineering, Uni-
[44] A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra, “Clustering on the unit versity of Chinese Academy of Sciences, Beijing,
hypersphere using von Mises-Fisher distributions,” J. Mach. Learn. Res., China.
vol. 6, pp. 1345–1382, Sep. 2005. He has published more than 10 articles in refer-
[45] E. W. Dijkstra, “A note on two problems in connexion with graphs,” eed conferences, including the European Conference
Numerische Math., vol. 1, no. 1, pp. 269–271, Dec. 1959. on Computer Vision (ECCV), the IEEE Com-
[46] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik, “Semantic puter Vision and Pattern Recognition (CVPR), and
contours from inverse detectors,” in Proc. Int. Conf. Comput. Vis., the International Conference on Computer Vision
Nov. 2011, pp. 991–998. (ICCV). His research interests include neural architecture design and
[47] K. Rakelly, E. Shelhamer, T. Darrell, A. A. Efros, and S. Levine, self-supervised learning.
“Conditional networks for few-shot semantic segmentation,” in Proc.
ICLR Workshop, 2018.
[48] W. Liu, C. Zhang, G. Lin, and F. Liu, “CRNet: Cross-reference networks
for few-shot segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Bohao Li received the B.S. degree from Wuhan
Pattern Recognit. (CVPR), Jun. 2020, pp. 4164–4172. University, Wuhan, China, in 2020. He is currently
[49] C. Zhang, G. Lin, F. Liu, J. Guo, Q. Wu, and R. Yao, “Pyramid graph pursuing the master’s degree with the School of
networks with connection attentions for region-based one-shot semantic Electronic, Electrical and Communication Engineer-
segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), ing, University of Chinese Academy of Sciences,
Oct. 2019, pp. 9586–9594. Beijing, China.
[50] B. Kang, Z. Liu, X. Wang, F. Yu, J. Feng, and T. Darrell, “Few-shot His research interests include computer vision and
object detection via feature reweighting,” in Proc. IEEE/CVF Int. Conf. machine learning, specifically for few-shot learning.
Comput. Vis. (ICCV), Oct. 2019, pp. 8420–8429.

Xiangyang Ji (Member, IEEE) received the B.S.

degree in materials science and the M.S. degree
in computer science from the Harbin Institute of
Technology, Harbin, China, in 1999 and 2001,
respectively, and the Ph.D. degree in computer sci-
Boyu Yang received the B.S. degree from Wuhan ence from the Institute of Computing Technology,
University, Wuhan, China, in 2018. He is cur- Chinese Academy of Sciences, Beijing, China, in
rently pursuing the Ph.D. degree with the School of 2008.
Electronic, Electrical and Communication Engineer- He joined Tsinghua University, Beijing, in 2008,
ing, University of Chinese Academy of Sciences, where he is a Professor with the Department of
Beijing, China. Automation, School of Information Science and
His research interests include computer vision and Technology. He has authored over 100 refereed conference and journal
machine learning, specifically for semantic segmen- articles. His research interests include image/video compressing and intelligent
tation and few-shot learning. imaging.

Qixiang Ye (Senior Member, IEEE) received the

B.S. and M.S. degrees from the Harbin Institute
of Technology, Harbin, China, in 1999 and 2001,
respectively, and the Ph.D. degree from the Institute
of Computing Technology, Chinese Academy of
Sciences, Beijing, China, in 2006.
Fang Wan received the B.S. degree from Wuhan He has been a Professor with the University of
University, Wuhan, China, in 2013, and the Ph.D. Chinese Academy of Sciences, Beijing, since 2009,
degree from the University of Chinese Academy of and was a Visiting Assistant Professor with the
Sciences (UCAS), Beijing, China, in 2019. Institute of Advanced Computer Studies (UMIACS),
Since 2019, he has been a Post-Doctoral University of Maryland at College Park, College
Researcher with the School of Computer Science Park, MD, USA, until 2013. He has published more than 100 articles in
and Technology, UCAS. He has published 15 arti- refereed conferences and journals, including the IEEE the IEEE Computer
cles in refereed conferences and journals, including Vision and Pattern Recognition (CVPR), and the International Confer-
the IEEE Computer Vision and Pattern Recognition ence on Computer Vision (ICCV), the European Conference on Computer
(CVPR), the International Conference on Computer Vision (ECCV), the Conference on Neural Information Processing Sys-
Vision (ICCV), and the IEEE T RANSACTIONS ON tems (NeurIPS), the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND
PATTERN A NALYSIS AND M ACHINE I NTELLIGENCE (TPAMI). His research L EARNING S YSTEMS (TNNLS), and the IEEE T RANSACTIONS ON PATTERN
interests include weakly supervised learning, active learning, and visual object A NALYSIS AND M ACHINE I NTELLIGENCE (TPAMI). His research interests
detection. include visual object detection and machine learning.

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on February 18,2025 at 08:49:47 UTC from IEEE Xplore. Restrictions apply.