MSDNet Multi-Scale Decoder For Few-Shot Semantic S

Uploaded by

22119171

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views11 pages

MSDNet Multi-Scale Decoder For Few-Shot Semantic S

Uploaded by

22119171

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

1

MSDNet: Multi-Scale Decoder for Few-Shot

Semantic Segmentation via Transformer-Guided
Prototyping
Amirreza Fateh1 , Mohammad Reza Mohammadi1,* , Mohammad Reza Jahed Motlagh1
1
School of Computer Engineering, Iran University of Science and Technology (IUST), Tehran, Iran
*
Corresponding author. [email protected]
arXiv:2409.11316v1 [cs.CV] 17 Sep 2024

Abstract—Few-shot Semantic Segmentation addresses the chal-

lenge of segmenting objects in query images with only a handful
of annotated examples. However, many previous state-of-the-art
methods either have to discard intricate local semantic features
or suffer from high computational complexity. To address these
challenges, we propose a new Few-shot Semantic Segmenta-
tion framework based on the transformer architecture. Our
approach introduces the spatial transformer decoder and the
contextual mask generation module to improve the relational (a) Prototype-based
understanding between support and query images. Moreover,
we introduce a multi-scale decoder to refine the segmentation
mask by incorporating features from different resolutions in a
hierarchical manner. Additionally, our approach integrates global
features from intermediate encoder stages to improve contex-
tual understanding, while maintaining a lightweight structure
to reduce complexity. This balance between performance and
efficiency enables our method to achieve state-of-the-art results
on benchmark datasets such as P ASCAL − 5i and COCO − 20i
in both 1-shot and 5-shot settings. Notably, our model with (b) Pixel-wise
only 1.5 million parameters demonstrates competitive perfor-
mance while overcoming limitations of existing methodologies.
https://github.com/amirrezafateh/MSDNet
Index Terms—Few-shot learning, few-shot segmentation, Se-
mantic Segmentation, Prototype generation

I. I NTRODUCTION
Semantic segmentation is a key task in computer vision, (c) Cross-Domain Multi scale decoder with transformer
where each pixel of an image is labeled as part of a specific guided prototyping
category. This is important in many areas like OCR, au- Fig. 1. Comparison among existing methods and our proposed method for
tonomous driving, medical imaging, and scene understanding FSS. (a) Prototype-based methods; (b) Pixel-wise approach; (c) Cross-Domain
[1]–[3]. To perform this task well, models need to learn Multi scale decoder with transformer guided prototyping.
detailed object boundaries. In recent years, deep Convolutional
Neural Networks (CNNs) have made big improvements in this
area [4]. However, these high-performing models usually need in the query image and similar objects depicted in the support
large datasets with lots of labeled examples [5], [6], which examples. Effectively using the relationship between the query
takes a lot of time and effort to create. In real-world scenarios, image and the support examples is essential in tackling FSS.
like in medical imaging or other fields where labeled data is FSS can be widely categorized into two groups: Prototype-
limited, this becomes a big problem [7], [8]. To solve this, based approaches and Pixel-wise methods. As shown in Figure
Few-shot Semantic Segmentation (FSS) has become a useful 1-(a), prototype-based approaches involve abstracting semantic
approach. features of the target class from support images through a
FSS tries to segment new object classes in images using shared backbone network [11]. This process results in feature
only a few labeled examples, called support images, that show vectors called class-wise prototypes, which are obtained using
the target class [9]. This method helps reduce the need for techniques such as class-wise average pooling or clustering.
large datasets, making it more practical for real-world use These prototypes are then combined with query features
[10]. Addressing the challenges of FSS requires handling through operations like element-wise summation or channel-
differences in texture or appearance between the target object wise concatenation. The combined features are refined by a
2

decoder module to classify each pixel as either the target image. Furthermore, to reduce the impact of information loss
class or background [12]. In contrast, as shown in Figure 1- resulting from the abstraction of support images into a feature
(b), pixel-wise methods take a different approach by focusing vector named the ’support prototype,’ we integrate global
directly on pixel-level information rather than compressing it features from the intermediate stages of the encoder, which are
into prototypes. These methods aim to predict the target class fed with the support images, into our decoder. Incorporating
for each pixel in the query image by comparing it directly with these features allows us to leverage features from different
corresponding pixels in the support images. To achieve this, stages of the encoder, thereby enriching the decoder’s con-
they establish pixel-to-pixel correlations between the support textual understanding. Additionally, we introduce the Contex-
and query features, which allows the model to find precise tual Mask Generation Module (CMGM) to further augment
matches even when the object’s appearance varies [13]. This the model’s relational understanding, operating alongside the
process is often enhanced by attention mechanisms, like those STD and enhancing the model’s capacity to capture relevant
found in Transformer models, which help the model focus contextual information.
on important relationships between pixels. By avoiding the
need for prototypes, Pixel-wise methods aim to preserve more II. R ELATED W ORKS
detailed information, allowing for finer-grained segmentation A. Semantic Segmentation
[14], [15].
Semantic segmentation, a crucial task in computer vision,
While both groups have demonstrated efficacy, they also
involves labeling each pixel in an image with a correspond-
have certain limitations. Prototype-based methods may inad-
ing class [18], [19]. CNNs significantly advanced semantic
vertently discard complex local semantic features specific to
segmentation by replacing fully connected layers with convo-
the target class in support images. This can lead to coarse
lutional layers, enabling the processing of images of various
segmentation of the target class in query images, especially
sizes [20]. Since then, subsequent advancements have focused
for objects with complex appearances. On the other hand,
on enhancing the receptive field and aggregating long-range
while pixel-wise methods have notably improved performance
context in feature maps. Techniques such as dilated convolu-
compared to prototype-based approaches, they grapple with
tions [21], spatial pyramid pooling [22], and non-local blocks
computational complexity due to dot-product attention calcu-
[23] have been employed to capture contextual information at
lations across all pixels of support features and query features.
multiple scales. More recently, Transformer-based backbones,
Moreover, a large amount of pixel-wise support information
including SegFormer [24], Segmenter [25], and SETR [26],
can lead to confusion in attention mechanisms [13]. Also, a
have been introduced to better capture long-range context in
shared limitation across both approaches is the lack of use of
semantic segmentation tasks. Further enhancing this approach,
encoder middle features in the decoder section. Many methods
hierarchical architectures like the Swin Transformer [27] have
in both categories employ straightforward decoders that fail
achieved state-of-the-art performance by using shifted win-
to incorporate encoder middle features. However, in few-shot
dows in their general-purpose backbones. In parallel, self-
scenarios where data samples are limited, leveraging the global
supervised pretraining strategies, such as the masked image
features captured by the encoder in the decoder phase can
modeling used in BEiT [28], have also shown strong results,
prove to be highly beneficial.
fine-tuning directly on the semantic segmentation task and
Inspired by recent developments, we aim to develop a
pushing the boundaries of model performance.
straightforward and effective framework to address limitations
Semantic segmentation tasks typically involve per-pixel
in FSS methods. A notable approach gaining traction is the
classification. as demonstrated by approaches like Mask-
Query-based1 Transformer architecture, which has demon-
Former [29] and Mask2Former [30], which predict binary
strated versatility across various computer vision tasks, includ-
masks corresponding to individual class labels. Older architec-
ing few-shot learning scenarios [16], [17]. This architecture
tures, such as UNet [31], PSPNet [32], and Deeplab [33], [34],
utilizes learnable Query embeddings derived from support
have also significantly contributed to the field by incorporating
prototypes, enabling nuanced analysis of their relationships
features like global and local context aggregation and dilated
within the query feature map.
Inspired by previous works, as shown in Figure 1-(c), convolutions to increase the receptive field without reducing
we have designed a novel Transformer-based module, known resolution. Building upon these foundational approaches, more
as the Spatial Transformer Decoder (STD), to enhance the recent studies, including CRGNet [35] and SAM [36], have
relational understanding between support images and the query focused on further improving model performance, exploring
image. This module operates concurrently with the multi-scale new techniques to enhance accuracy in segmentation tasks.
decoder. Within the STD module, we introduce a common Despite the progress made in per-pixel classification, address-
strategy: Using the prototype of support images as a Query, ing the challenge of segmenting unseen classes remains an
while utilizing the features extracted from the query image as open area for future research
both Value and Key embeddings inputted into the Transformer
decoder. This formulation allows the Query to effectively focus B. Few-Shot Semantic Segmentation
on the semantic features of the target class within the query FSS is a challenging task in computer vision, wherein
1 For differentiating it from the conventional term ”query” frequently
the objective is to segment images with limited annotated
employed in FSS, we capitalize ”Query” when referring to the query sequence examples, known as support images. Approaches to FSS can
within the Transformer architecture. be categorized into various groups based on their primary aims
3

and methodologies employed [37]. One significant challenge model’s performance is evaluated on the Dtest dataset, where
in FSS is addressing the imbalance in details between support it predicts the segmentation mask for query images from the
and query images. Methods like PGNet [38] and PANet [39] test dataset using the knowledge learned during training.
aim to eliminate inconsistent regions between support and Overall, the goal of FSS is to develop a model that can
query images by associating each query pixel with relevant accurately segment images from novel classes with only a
parts of the support image or by regularizing the network to few annotated samples, demonstrating robust generalization
ensure its success regardless of the roles of support and query. capabilities across different datasets and unseen classes.
But methods like ASGNet [37], on the other hand, focuses on
finding an adaptive quantity of prototypes and their spatial B. Overview
expanses determined by image content, utilizing a boundary-
Given a support set S = Isi , Msi and a query image
conscious superpixel algorithm.
Iq , the objective is to generate the binary segmentation mask
Another critical aspect of FSS is bridging the inter-class gap
for Iq , identifying the same class as the support examples.
between base and novel datasets. Approaches like RePRI [40]
To address this task, we introduce a straightforward yet
and CWT [41] address this gap by fine-tuning over support
robust framework, outlined in Figure 2. For simplicity, we
images or episodically training self-attention blocks to adapt
illustrate a 1-shot setting within the framework, but this can
classifier weights during both training and testing phases.
be easily generalized to a 5-shot setting as well. The proposed
Additionally, architectures designed for supervised learning
method comprises several key components, including a shared
often trouble recognizing objects at different scales in few-
pretrained backbone, support prototype, CMGM, a multi-scale
shot scenarios. To address this issue, new methods have been
decoder, and STD. These elements collectively contribute to
developed to allow information exchange between different
the model’s ability to accurately segment objects of interest
resolutions [42].
in the query image based on contextual information provided
Moreover, ensuring the reliability of correlations between by the support set. In the following, we’ll take a closer look
support and query images is essential in FSS. Methods like at each component, explaining its role and how it interacts
HSNet [43] and CyCTR [44] utilize attention mechanisms to within our framework.
filter out erroneous support features and focus on beneficial 1) Backbone: In our proposed framework, we adopt a mod-
information. VAT [45], meanwhile, employs a cost aggregation ified ResNet architecture, initially pre-trained on the ImageNet
network to aggregate information between query and support dataset, to serve as the backbone for feature extraction from
features, leveraging a high-dimensional Swin Transformer to raw input images, ensuring that the size of the output of each
impart local context to all pixels. block does not reduce below a specified dimension. For in-
Overall, the field of FSS is advancing rapidly with inno- stance, like [46], we define that the output sizes from conv2 x
vative methods aimed at enhancing model performance and to conv5 x are maintained at 60 × 60 pixels. Specifically,
overcoming challenges in adapting segmentation models to we utilize a ResNet with shared weights between support
novel classes with limited annotated data. These efforts are and query images. This type of ResNet maintains the spatial
driven by the ongoing need to improve the effectiveness and resolution of feature maps at 60 × 60 pixels from the conv2 x
versatility of segmentation models in real-world applications. stage forward, preserving finer details crucial for accurate
segmentation. We extract high-level features (conv5 x), as
III. P ROPOSED METHOD well as mid-level features (conv3 x and conv4 x) from both
A. Problem Definition support and query images using the backbone.
The mid-level features of the support image are denoted
In FSS, the task involves segmenting images belonging to
as Xsconv3 and Xsconv4 , while the high-level features are
novel classes with limited annotated data. We operate with
denoted as Xsconv5 . Similarly, for the query image, the mid-
two datasets, Dtrain and Dtest , each associated with class
level features are represented as Xqconv3 and Xqconv4 , and the
sets Ctrain and Ctest , respectively. Notably, these class sets
high-level features as Xqconv5 . To integrate mid-level features
are disjoint (Ctrain ∩ Ctest = ∅), ensuring that there is no
across different stages, we concatenate the mid-level feature
overlap between the classes in the training and test datasets.
maps from conv3 x and conv4 x stages and apply a 1 × 1
Each training episode consists of a support set S and a query
convolution layer to yield a merged mid-level feature map,
set Q, where S includes a set of k support images along
denoted as Xsmerged . This merging process ensures that the
with their corresponding binary segmentation masks, while
resultant feature map retains essential information from both
Q contains a single query image. The model is trained to
mid-level stages, enhancing the model’s ability to capture
predict the segmentation mask for the query image based on
diverse contextual information (Equation 1, Equation 2).
the support set.
Both Dtrain and Dtest consist of a series of randomly Xsmerged = C1×1 (Cat(Xsconv3 , Xsconv4 )) (1)
sampled episodes (an episode is defined as a set comprising
Xqmerged = C1×1 (Cat(Xqconv3 , Xqconv4 )) (2)
support images and a query image. During each epoch, we can
have many episodes (e.g., 1000 episodes), each containing its Where Cat denotes concatenation along the channel dimen-
own set of support and query images). During training, the sion, and C1×1 denotes the 1×1 convolution operation. These
model learns to predict the segmentation mask for the query equations illustrate the process of merging mid-level features
image based on the support set. Similarly, during testing, the from different stages of the backbone network, resulting in a
4

Fig. 2. the overview of the proposed method

combined mid-level feature map that retains crucial informa- are five support examples, five cosine similarities are computed
tion from both stages. and subsequently averaged, yielding a novel cosine similarity
The decision to employ this modified ResNet architecture is measure representative of the collective support set.
grounded in its ability to balance computational efficiency with
feature representation. By maintaining the feature map size
at 60 × 60 pixels, the backbone effectively captures detailed
spatial information while avoiding excessive computational
overhead. This approach strikes a pragmatic balance between
model complexity and segmentation performance, making it
well-suited for our few-shot segmentation task, where compu-
tational efficiency is paramount.
2) Support Prototype: In our proposed framework, the
Support Prototype serves as a condensed representation of
the mid-level features extracted from the support example
(Xsmerged ). The Support Prototype is obtained by applying a
Masked Average Pooling (MAP) operation, which selectively
aggregates information based on the support mask. Mathemat-
ically, the Support Prototype Ps is defined in Equation 3.

Ps = Fpool (Xsmerged ⊙ Ms ) (3)

Fig. 3. the overview of CMGM
Where Fpool represents the average pooling operation, and ⊙
signifies element-wise multiplication (Hadamard product) with 4) Multi Scale Decoder: The multi scale decoder in our
the support mask Ms . The MAP operation involves computing proposed method is a critical component designed to refine
the average pooling of the masked feature map, focusing the segmentation mask by incorporating features from different
solely on regions of interest specified by the support mask. resolutions in a hierarchical manner. The decoder consists of
This results in the generation of the Support Prototype, which three stages, each comprising two residual layers. Input feature
encapsulates essential semantic information from the support map undergoes a sequence of convolutional operations within
example, facilitating effective few-shot segmentation. residual layers to gradually upsample the mask image.
3) Contextual Mask Generation Module (CMGM): The As shown in Figure 2, in the first stage of the decoder, the
CMGM is a novel component introduced by our framework, input feature map has a size of 60 × 60 pixels. This stage
designed to enhance the contextual understanding between begins with two residual layers applied to the input feature
support and query images in FSS tasks. As shown in Figure map. Each residual layer receives input from combination of
3, CMGM leverages the feature representations extracted from the previous layer’s output and Xsconv5 . Following these layers,
both the support and query images to generate a contextual a convolutional operation is employed to upsample the mask
mask that encapsulates pixel-wise relations indicative of the image to a resolution of 120 × 120 pixels.
target object. This process involves computing the cosine Second stage of the decoder, which operates on a feature
similarity between the query feature vector and the support map size of 120 × 120 pixels, has two residual layer like the
feature vector. Mathematically, cosine similarity cos(q, s) is first stage. Each residual layer takes input from combination of
calculated as the dot product of the normalized query and the previous layer’s output and the merged mid-level features
support feature vectors. In a five-shot scenario, where there (Xsmerged ) obtained from the support image’s encoder. Since
5

the size of Xsmerged remains at 60 × 60 pixels, it is upsampled C. Loss function

to 120 × 120 pixel resolution using a convolutional layer. This In our method, we employ the Dice loss function to train our
merged
upsampled feature map, denoted as Xs(120×120) . model. This loss function measures the dissimilarity between
Finally, in the third stage of the decoder, which operates on a the predicted segmentation mask M and the corresponding
feature map size of 240×240 pixels, the input to each residual ground truth query mask Mq . The Dice loss is formulated in
layer comprises the output from the combination of preceding 4. T
2 × |M Mq |
layer and the upsampled Xsmerged feature map. in this stage Dice Loss = 1 − (4)
merged |M | + |Mq |
Xs(120×120) , upsamples to 240×240 pixel resolution, denoted T
merged
as Xs(240×240) . This upsampled feature map is then integrated Where |M Mq | represents the intersection between the
with the output from the preceding layer to form the input for predicted and ground truth masks, and |M | and |Mq | denote
subsequent processing. the cardinality of the predicted and ground truth masks,
respectively. Minimizing the Dice loss encourages the model
Notably, one of the distinctive aspects of our multi-scale to generate segmentation masks that closely match the ground
decoder is the incorporation of mid-level and high-level fea- truth masks, leading to more accurate segmentation results
tures from the encoder, like U-Net architecture. Specifically, during training.
in each stage of the decoder, the input to the residual layers
combines the output from the previous layer with either the IV. E XPERIMENTAL R ESULTS
conv5 x features (the output of the last block of the encoder)
A. Datasets
or the merged mid-level features (Xsmerged ) extracted from the
support image’s encoder. This fusion of features from different We evaluated our proposed method on two widely used
levels of abstraction enhances the decoder’s ability to capture datasets commonly employed in few-shot segmentation tasks:
both detailed and contextual information essential for accurate P ASCAL − 5i [57] and COCO − 20i [53].
segmentation. PASCAL-5i Dataset. The P ASCAL − 5i dataset, intro-
duced by Shaban et al. [57], is derived from the PASCAL VOC
5) Spatial Transformer Decoder (STD): In parallel with dataset [58], and augmented with the SDS [59]. The original
the multi-scale decoder module, STD plays a pivotal role PASCAL VOC dataset comprises 20 object categories. For
in refining the final segmentation mask. As illustrated in P ASCAL − 5i , these 20 categories are evenly divided into 4
Figure 4, the STD module operates by leveraging multi- subsets, each denoted as P ASCAL − 5i . Consequently, each
head cross-attention, focusing on target objects within the subset consists of 5 distinct object categories.
query features to generate semantic-aware dynamic kernels. COCO-20i Dataset. The COCO − 20i dataset, introduced
This process begins by treating the support features as the by Nguyen et al. [53], is derived from MSCOCO dataset
Query embeddings, while the query features are utilized as [60]. The COCO − 20i dataset includes a total of 80 object
the Key and Value embeddings within the STD. Through categories. Similar to P ASCAL − 5i , these 80 categories
this strategic integration, the STD module adeptly captures are divided into 4 subsets, with each subset denoted as
intricate relationships between target objects present in the COCO − 20i . Each subset contains 20 distinct object cat-
query features and their corresponding representations in the egories. Notably, COCO − 20i presents a greater challenge
support features. due to its larger number of categories and images compared
to P ASCAL − 5i .
The architecture of the STD module employs multi-head Cross-Validation Training. To ensure robust evaluation,
cross-attention, rather than a traditional Transformer decoder we adopted a cross-validation training strategy commonly
paradigm. The prototype vector, representing the support employed in few-shot segmentation literature. Specifically, we
features, is integrated as a Query, enriched with learnable divided each dataset into four subsets. Three subsets were
positional encodings for heightened spatial context awareness. utilized as training sets, while the remaining subset served
The query feature map serves as Key and Value embeddings as the test set for model evaluation. During testing, we
for multi-head cross-attention, enabling comprehensive explo- randomly selected 1000 support-query pairs from the test set
ration of their interplay with the support features. Through for evaluation.
this multi-head cross-attention process, the STD dynamically
generates semantic-aware dynamic kernels crucial for fine-
tuning segmentation predictions. The output of the STD mod- B. Experimental Setting
ule represents a segmentation mask embedding that captures We implemented our proposed method using PyTorch ver-
the semantic information of the target objects within the sion 1.8.1. For feature extraction, we employed pretrained
query features. This embedding is crucial for refining the ResNet-50 and ResNet-101 backbones, which were originally
segmentation results. To integrate this information into the trained on the ImageNet dataset. During training, the pa-
final segmentation output, the segmentation mask embedding rameters of these pretrained models were frozen, and only
is combined with the feature map of the output from the multi- the newly added modules were trainable. For training on the
scale decoder using a dot-product operation. This operation ef- COCO−20i dataset, we conducted training for each fold over
ficiently merges the information from both modules, enhancing 30 epochs. Conversely, for the P ASCAL−5i dataset, training
the overall segmentation accuracy. was extended to 60 epochs to ensure optimal convergence.
6

Fig. 4. Spatial Transformer Decoder

TABLE I
P ERFORMANCE ON P ASCAL − 5i IN TERMS OF M I O U AND FB-I O U. N UMBERS IN BOLD REPRESENT THE BEST PERFORMANCE , WHILE UNDERLINED
VALUES DENOTE THE SECOND - BEST PERFORMANCE .

1-shot 5-shot # learnable

Backbone Methods Publication
fold0 fold1 fold2 fold3 mean FB-IoU fold0 fold1 fold2 fold3 mean FB-IoU params
PANet [39] ICCV19 44.0 57.5 50.8 44.0 49.1 - 55.3 67.2 61.3 53.2 59.3 - 23.5M
ResNet50 PGNet [38] ICCV19 56.0 66.9 50.6 50.4 56.0 69.9 57.7 68.7 52.9 54.6 58.5 70.5 17.2M
PFENet [47] TPAMI20 61.7 69.5 55.4 56.3 60.8 73.3 63.1 70.7 55.8 57.9 61.9 73.9 10.3M
PPNet [48] ECCV20 48.6 60.6 55.7 46.5 52.8 69.2 58.9 68.3 66.8 58.0 63.0 75.8 31.5M
RePRI [40] CVPR21 59.8 68.3 62.1 48.5 59.7 - 64.6 71.4 71.1 59.3 66.6 - -
HSNet [43] ICCV21 64.3 70.7 60.3 60.5 64 76.7 70.3 73.2 67.4 67.1 69.5 80.6 2.5M
CyCTR [44] NeurIPS21 65.7 71.0 59.5 59.7 64.0 - 69.3 73.5 63.8 63.5 67.5 - 15.4M
NTRENet [49] CVPR22 65.4 72.3 59.4 59.8 64.2 77.0 66.2 72.8 61.7 62.2 65.7 78.4 19.9M
ABCNet [50] CVPR23 62.5 70.8 57.2 58.1 62.2 74.1 64.7 73.0 57.1 59.5 63.6 74.2 -
QGPLNet [51] ACM TOMM23 56.95 68.99 60.1 54.98 60.25 - 61.78 70.96 69.56 58.26 65.14 - -
IEEE Trans.
DRNet [52] 66.1 68.8 61.3 58.2 63.6 76.9 69.2 73.9 65.4 65.3 68.5 81.6 -
CSVT24
MSDNet (our) - 66.3 71.9 57.2 62.0 64.3 77.1 73.2 75.4 59.9 66.3 68.7 82.1 1.5M
FWB [53] ICCV19 51.3 64.5 56.7 52.2 56.2 - 54.8 67.4 62.2 55.3 59.9 - 43.0M
PPNet [48] ECCV20 52.7 62.8 57.4 47.7 55.2 70.9 60.3 70.0 69.4 60.7 65.1 77.5 50.5M
PFENet [47] TPAMI20 60.5 69.4 54.4 55.9 60.1 72.9 62.8 70.4 54.9 57.6 61.4 73.5 10.3M
HSNet [43] ICCV21 67.3 72.3 62.0 63.1 66.2 77.6 71.8 74.4 67.0 68.3 70.4 80.6 2.5M
ResNet101
CWT [41] ICCV21 56.9 65.2 61.2 48.8 58 - 62.6 70.2 68.8 57.2 64.7 - -
CyCTR [44] NeurIPS21 69.3 72.7 56.5 58.6 64.3 73.0 73.5 74.0 58.6 60.2 66.6 75.4 15.4M
NTRENet [49] CVPR22 65.5 71.8 59.1 58.3 63.7 75.3 67.9 73.2 60.1 66.8 67.0 78.2 19.9M
ABCNet [50] CVPR23 62.7 70.0 55.1 57.5 61.3 73.7 63.4 71.8 56.4 57.7 62.3 74 -
QGPLNet [51] ACM TOMM23 59.66 69.77 65.15 55.9 62.64 - 65.05 72.75 71.12 59.85 67.19 - -
IEEE Trans.
DRNet [52] 66.4 70.7 64.9 59.8 65.3 79.2 69.3 74.1 66.7 66.5 69.2 84.5 -
CSVT24
MSDNet(our) - 67.6 72.8 58.2 60.0 64.7 77.3 75.5 77.2 62.5 68.1 70.8 85.0 1.5M

We utilized the Adam optimizer with a fixed learning rate of C. Evaluation Metrics
10−3 . All input images were resized to 473 × 473 pixels, and
the training batch size was set to 32 for the 1-shot setting and We employ the following evaluation metrics to assess the
16 for the 5-shot setting. Our training pipeline did not incor- performance of our proposed method:
porate any data augmentation strategies. After prediction, the Mean Intersection over Union (mIoU). mIoU is a widely
binary segmentation masks were resized to match the original used metric for evaluating segmentation performance. It cal-
dimensions of the input images for evaluation purposes. To culates the average intersection over union (IoU) across all
ensure robustness and mitigate the effects of randomness, we classes in the target dataset (Equation 5).
averaged the results of three trials conducted with different
random seeds. All experiments were performed on NVIDIA
RTX 4090 GPU. C
1 X
mIoU = IoUi (5)
C i=1
7

TABLE II
P ERFORMANCE ON COCO − 20i IN TERMS OF M I O U AND FB-I O U. N UMBERS IN BOLD REPRESENT THE BEST PERFORMANCE , WHILE UNDERLINED
VALUES DENOTE THE SECOND - BEST PERFORMANCE .

1-shot 5-shot # learnable

Backbone Methods Publication
fold0 fold1 fold2 fold3 mean FB-IoU fold0 fold1 fold2 fold3 mean FB-IoU params
PPNet [48] ECCV20 28.1 30.8 29.5 27.7 29.0 - 39.0 40.8 37.1 37.3 38.5 - 31.5M
PFENet [47] TPAMI20 36.5 38.6 34.5 33.8 35.8 - 36.5 43.3 37.8 38.4 39.0 - 10.3M
HSNet [43] ICCV21 36.3 43.1 38.7 38.7 39.2 68.2 43.3 51.3 48.2 45.0 46.9 70.7 2.5M
CyCTR [44] NeurIPS21 38.9 43.0 39.6 39.8 40.3 - 41.1 48.9 45.2 47.0 45.6 - 15.4M
NTRENet [49] CVPR22 36.8 42.6 39.9 37.9 39.3 68.5 38.2 44.1 40.4 38.4 40.3 69.2 19.9M
BAM [54] CVPR22 43.4 50.6 47.5 43.4 46.2 - 49.3 54.2 51.6 49.6 51.2 - 26.7M
DCAMA [15] ECCV22 41.9 45.1 44.4 41.7 43.3 69.5 45.9 50.5 50.7 46.0 48.3 71.7 47.7M
ABCNet [50] CVPR23 36.5 35.7 34.7 31.4 34.6 59.2 40.1 40.1 39.0 35.9 38.8 62.8 -
ResNet50
IEEE Trans.
DRNet [52] 42.1 42.8 42.7 41.3 42.2 68.6 47.7 51.7 47.0 49.3 49.0 71.8 -
CSVT24
IEEE Trans.
QPENet [55] 41.5 47.3 40.9 39.4 42.3 67.4 47.3 52.4 44.3 44.9 47.2 69.5 -
Multimedia24
PFENet++ [56] TPAMI24 40.9 46.0 42.3 40.1 42.3 65.7 47.5 53.3 47.3 46.4 48.6 70.3 -
MSDNet (our) - 43.7 49.1 46.9 46.2 46.5 70.4 50.1 58.5 56.3 53.1 54.5 74.5 1.5M
FWB [53] ICCV19 17.0 18.0 21.0 28.9 21.2 - 19.1 21.5 23.9 30.1 23.7 - 43.0M
PFENet [47] TPAMI20 36.8 41.8 38.7 36.7 38.5 63.0 40.4 46.8 43.2 40.5 42.7 65.8 10.3M
HSNet [43] ICCV21 37.2 44.1 42.4 41.3 41.2 69.1 45.9 53.0 51.8 47.1 49.5 72.4 2.5M
NTRENet [49] CVPR22 38.3 40.4 39.5 38.1 39.1 67.5 42.3 44.4 44.2 41.7 43.2 69.6 19.9M
DCAMA [15] ECCV22 41.5 46.2 45.2 41.3 43.5 69.9 48.0 58.0 54.3 47.1 51.9 73.3 47.7M
ABCNet [50] CVPR23 40.7 45.9 41.6 40.6 42.2 66.7 43.2 50.8 45.8 47.1 46.7 62.8 -
ResNet101
IEEE Trans.
DRNet [52] 43.2 43.9 43.3 43.9 43.6 69.2 52.0 54.5 47.9 49.8 51.1 73 -
CSVT24
IEEE Trans.
QPENet [55] 39.8 45.4 40.5 40.0 41.4 67.8 47.2 54.9 43.4 45.4 47.7 70.6 -
Multimedia24
PFENet++ [56] TPAMI24 42.0 44.1 41.0 39.4 41.6 65.4 47.3 55.1 50.1 50.1 50.7 70.9 -
MSDNet (our) - 44.5 52.5 48.9 48.1 48.5 71.3 50.4 59.9 57.6 53.3 55.3 75.1 1.5M

Here, C represents the number of classes in the target fold, posed method demonstrates superior performance under both
and IoUi denotes the intersection over union of class i. ResNet50 and ResNet101 backbones in both 1-shot and 5-shot
Foreground-Background IoU (FB-IoU). FB-IoU measures settings. Our method consistently outperforms all competing
the intersection over union specifically for the foreground approaches across all four folds of the COCO − 20i dataset,
and background classes. While FB-IoU provides insights into consistently achieving either first or second rank. We obtained
the model’s ability to distinguish between foreground and the highest mIoU scores in several folds and secured the
background regions, we primarily focus on mIoU as our main second rank in others. Notably, our method exhibits superior
evaluation metric due to its comprehensive assessment of performance in terms of mean and FB-IoU scores, further
segmentation performance. emphasizing its effectiveness and robustness.
It is important to highlight that our proposed method
D. Comparison with SOTA maintains a remarkably low number of learnable parameters,
In this subsection, we compare our proposed method with with only 1.5 million parameters. This stands in stark contrast
several SOTA methods on both the P ASCAL − 5i and to some SOTA methods, which possess significantly higher
COCO − 20i datasets. We present the results in Table I and parameter counts, exceeding 40 million parameters in certain
Table II, respectively, where we report the mIoU and FB-IoU cases. This demonstrates the efficiency and effectiveness of
scores under both 1-shot and 5-shot settings, along with the our approach in achieving superior segmentation performance
final FB-IoU value. The results of other methods are obtained while maintaining a compact model architecture.
from their respective original papers.
Results on PASCAL-5i Dataset. As shown in Table I, our
proposed method, utilizing ResNet50 and ResNet101 back- E. Cross-dataset task
bones, consistently surpasses SOTA methods in both 1-shot In this study, we investigate the cross-domain generalization
and 5-shot scenarios across all four folds of the P ASCAL−5i capabilities of our proposed few-shot segmentation method
dataset. Notably, our method achieves the one of the highest through rigorous domain shift testing. Specifically, we trained
performance across all folds. our model on the COCO − 20i dataset and conducted testing
Results on COCO-20i Dataset. Similarly, Table II presents on the P ASCAL−5i dataset to evaluate its adaptability across
the results on the COCO − 20i dataset, where our pro- different datasets and domain settings.
8

TABLE III
F EW- SHOT SEGMENTATION PERFORMANCE ON CROSS - DATASET TASK , ”COCO − 20i → P ASCAL − 5i ”, IN TERMS OF M I O U, WITH DIFFERENT
BACKBONES (R ES N ET-50 AND R ES N ET-101). N UMBERS IN BOLD REPRESENT THE BEST PERFORMANCE , WHILE UNDERLINED VALUES DENOTE THE
SECOND - BEST PERFORMANCE .

1-shot 5-shot
Backbone Methods Publication
fold0 fold1 fold2 fold3 mean fold0 fold1 fold2 fold3 mean
PFENet [47] TPAMI20 43.2 65.1 66.6 69.7 61.1 45.1 66.8 68.5 73.1 63.4
RePRI [40] CVPR21 52.2 64.3 64.8 71.6 63.2 56.5 68.2 70.0 76.2 67.7
HSNet [43] ICCV21 45.4 61.2 63.4 75.9 61.6 56.9 65.9 71.3 80.8 68.7
HSNet-HM [61] ECCV22 43.4 68.2 69.4 79.9 65.2 50.7 71.4 73.4 83.1 69.7
ResNet50 VAT-HM [61] ECCV22 68.3 64.9 67.5 79.8 65.1 55.6 68.1 72.4 82.8 69.7
RTD [62] ECCV22 57.4 62.2 68.0 74.8 65.6 65.7 69.7 70.8 75.0 70.1
PMNet [9] WACV24 68.8 70.0 65.1 62.3 66.6 73.9 74.5 73.3 72.1 73.4
MSDNet (our) - 70.7 73.2 71.1 73.2 72.1 72.5 75.0 73.8 75.5 74.2
HSNet [43] ICCV21 47.0 65.2 67.1 77.1 64.1 57.2 69.5 72.0 82.4 70.3
HSNetT-HM [61] ECCV22 46.7 68.6 71.1 79.7 66.5 53.7 70.7 75.2 83.9 70.9
ResNet101 RTD [62] ECCV22 59.4 64.3 70.8 72.0 66.6 67.2 72.7 72.0 78.9 72.7
PMNet [9] WACV24 71.0 72.3 66.6 63.8 68.4 75.2 76.3 77.0 72.6 75.3
MSDNet (our) - 71.6 75.6 73.0 75.2 73.9 71.5 79.6 76.4 77.9 76.4

The COCO − 20i dataset used in our experiments was IV show the impact of each component of the proposed
modified to exclude classes and associated images that overlap method.
with those present in P ASCAL − 5i . This adaptation ensured The first row of Table IV represents the performance of the
that the training process focused on distinct visual concepts, baseline model, consisting solely of the backbone architecture
thereby enhancing the model’s exposure to novel classes and support prototype mechanism. Subsequent rows introduce
during testing. additional components incrementally, including the CMGM,
For our experiments, we adopted a cross-dataset evaluation STD, and multi-scale decoder.
protocol where models trained on each fold of COCO − 20i
were repurposed for testing on the entire P ASCAL − 5i TABLE IV
T HE I MPACT OF E ACH C OMPONENT ON S EGMENTATION P ERFORMANCE
dataset. Notably, during training, the model was exposed IN THE COCO − 20i DATASET
only to specific classes within COCO − 20i , ensuring no
overlap with the classes present in P ASCAL − 5i . This setup Baseline CMGM STD
Multi Scale 1-shot
Decoder fold0 fold1 fold2 fold3 mean FB-IoU
effectively simulates a scenario where the model encounters
✓ 30.1 34.2 33.4 33.8 32.9 59.7
novel classes during testing that were not part of its training ✓ ✓ 31.5 35.9 34.8 34.2 34.1 60.8
curriculum. ✓ ✓ ✓ 43.0 45.2 43.1 41.4 43.2 67.6
For instance, in the fold-0 setting, the model was exclusively ✓ ✓ ✓ ✓ 43.7 49.1 46.9 46.2 46.5 70.4

trained on fold-0 of COCO − 20i and then assessed on the

entirety of P ASCAL − 5i after filtering out any classes that As shown in Table IV, each component contributes to an im-
were encountered during training. This approach tests the provement in segmentation performance, with the Multi Scale
model’s ability to generalize to new and unseen classes in Decoder showcasing the most substantial impact. The pro-
a different dataset domain. gressive integration of these components results in a notable
Our experimental results, as detailed in Table III, demon- enhancement in mIoU scores across all folds, underscoring
strate the superior performance of our proposed method com- their significance in refining segmentation masks and capturing
pared to existing SOTA approaches under both 1-shot and contextual information effectively.
5-shot evaluation scenarios. This underscores the robustness Furthermore, in Figure 5, we present the qualitative results
and effectiveness of our few-shot segmentation framework in obtained by incorporating each component into the baseline
handling cross-dataset challenges and domain shifts. model on the COCO − 20i dataset. As illustrated in Figure 5,
the addition of each component leads to noticeable improve-
ments in the segmentation results. Particularly, the integration
F. Ablation Study of the multi-scale decoder component demonstrates significant
For the ablation study, we conduct experiments on the enhancement in segmentation accuracy.
COCO − 20i dataset using the ResNet50 backbone in 1-shot To further explore the influence of the architecture within
scenario. Our first aim is to investigate the individual impact of the multi-scale decoder, we conducted an ablation study
various components on the segmentation performance. Table varying the number of residual blocks in each stage. Figure
9

Fig. 5. Qualitative comparison of component effects on COCO − 20i dataset in 1-shot scenario

TABLE V
T HE I MPACT OF NUMBER OF RESIDUAL BLOCKS IN EACH STAGE OF M ULTI
S CALE D ECODER ON S EGMENTATION P ERFORMANCE IN THE
COCO − 20i DATASET

# residual 1-shot # learnable

blocks fold0 fold1 fold2 fold3 mean FB-IoU params
1 42.4 48.2 46.0 45.1 45.4 69.4 1.0M
2 43.9 48.9 46.7 45.5 46.2 69.7 1.2M
3 43.7 49.1 46.9 46.2 46.5 70.4 1.5M
4 41.9 47.4 46.4 45.8 45.4 69.5 1.7M

Fig. 6. The overview of Multi Scale Decoder with different number of residual
blocks in each stage (1-4)
V. C ONCLUSION
In conclusion, our proposed few-shot segmentation frame-
work, leveraging a combination of components including a
6 provides an overview of the Multi-Scale Decoder with shared pretrained backbone, support prototype mechanism,
different numbers of residual blocks in each stage. The ex- CMGM, STD, and multi-scale decoder, has demonstrated
periment involved evaluating the segmentation performance remarkable efficacy in achieving SOTA performance on both
on the COCO − 20i dataset using the ResNet50 backbone P ASCAL − 5i and COCO − 20i datasets. Through extensive
in a 1-shot scenario. As depicted in Table V, we exam- experimentation and ablation studies, we have highlighted the
ined configurations ranging from one to four residual blocks critical contributions of each component, particularly empha-
per stage. Interestingly, the results revealed that the optimal sizing the significant impact of the multi-scale decoder in
segmentation performance was achieved with three residual enhancing segmentation accuracy while maintaining compu-
blocks in each stage. This finding suggests that an appropriate tational efficiency. Looking ahead, further investigation into
balance in the depth of the decoder architecture plays a crucial the dynamic adaptation of prototype representations and the
role in enhancing segmentation accuracy. Too few blocks may exploration of additional attention mechanisms could offer
limit the model’s capacity to capture intricate features, while avenues for improving the adaptability and robustness of our
an excessive number of blocks could lead to overfitting or method across diverse datasets and scenarios. Additionally,
computational inefficiency. Therefore, our results underscore exploring semi-supervised learning paradigms could enhance
the importance of carefully tuning the architecture parameters the generalization capability of our framework, enabling ef-
to achieve optimal performance in few-shot segmentation fective segmentation in scenarios with limited labeled data.
tasks. These avenues for future work hold promise for advancing
10

the effectiveness and applicability of few-shot segmentation [22] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
methods in real-world scenarios. convolutional networks for visual recognition,” IEEE transactions on
pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904–
1916, 2015.
R EFERENCES [23] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-
works,” in Proceedings of the IEEE conference on computer vision and
[1] A. Fateh, R. T. Birgani, M. Fateh, and V. Abolghasemi, “Advancing mul- pattern recognition, 2018, pp. 7794–7803.
tilingual handwritten numeral recognition with attention-driven transfer [24] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo,
learning,” IEEE Access, vol. 12, pp. 41 381–41 395, 2024. “Segformer: Simple and efficient design for semantic segmentation
[2] Y. Zhang, Z. Shen, and R. Jiao, “Segment anything model for medical with transformers,” Advances in neural information processing systems,
image segmentation: Current applications and future directions,” Com- vol. 34, pp. 12 077–12 090, 2021.
puters in Biology and Medicine, p. 108238, 2024. [25] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Trans-
[3] A. Saber, P. Parhami, A. Siahkarzadeh, and A. Fateh, “Efficient and former for semantic segmentation,” in Proceedings of the IEEE/CVF
accurate pneumonia detection using a novel multi-scale transformer international conference on computer vision, 2021, pp. 7262–7272.
approach,” arXiv preprint arXiv:2408.04290, 2024. [26] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng,
[4] S. Sun, W. Wang, A. Howard, Q. Yu, P. Torr, and L.-C. Chen, T. Xiang, P. H. Torr et al., “Rethinking semantic segmentation from a
“Remax: Relaxing for better training on efficient panoptic segmentation,” sequence-to-sequence perspective with transformers,” in Proceedings of
Advances in Neural Information Processing Systems, vol. 36, 2024. the IEEE/CVF conference on computer vision and pattern recognition,
[5] I. B. Barcelos, F. d. C. Belém, L. d. M. João, Z. K. d. P. Jr, A. X. Falcão, 2021, pp. 6881–6890.
and S. J. F. Guimarães, “A comprehensive review and new taxonomy [27] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
on superpixel segmentation,” ACM Computing Surveys, 2024. B. Guo, “Swin transformer: Hierarchical vision transformer using shifted
[6] X. Gu, Y. Cui, J. Huang, A. Rashwan, X. Yang, X. Zhou, G. Ghiasi, windows,” in Proceedings of the IEEE/CVF international conference on
W. Kuo, H. Chen, L.-C. Chen et al., “Dataseg: Taming a universal multi- computer vision, 2021, pp. 10 012–10 022.
dataset multi-task segmentation model,” Advances in Neural Information [28] H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image
Processing Systems, vol. 36, 2024. transformers,” arXiv preprint arXiv:2106.08254, 2021.
[7] Z. Marinov, P. F. Jäger, J. Egger, J. Kleesiek, and R. Stiefelhagen, [29] B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not
“Deep interactive segmentation of medical images: A systematic review all you need for semantic segmentation,” Advances in neural information
and taxonomy,” IEEE Transactions on Pattern Analysis and Machine processing systems, vol. 34, pp. 17 864–17 875, 2021.
Intelligence, 2024. [30] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar,
[8] F. Askari, A. Fateh, and M. R. Mohammadi, “Enhancing few-shot image “Masked-attention mask transformer for universal image segmentation,”
classification through learnable multi-scale embedding and attention in Proceedings of the IEEE/CVF conference on computer vision and
mechanisms,” arXiv preprint arXiv:2409.07989, 2024. pattern recognition, 2022, pp. 1290–1299.
[9] H. Chen, Y. Dong, Z. Lu, Y. Yu, and J. Han, “Pixel matching network for [31] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
cross-domain few-shot segmentation,” in Proceedings of the IEEE/CVF for biomedical image segmentation,” in Medical image computing and
Winter Conference on Applications of Computer Vision, 2024, pp. 978– computer-assisted intervention–MICCAI 2015: 18th international con-
987. ference, Munich, Germany, October 5-9, 2015, proceedings, part III 18.
[10] C. Lang, G. Cheng, B. Tu, and J. Han, “Few-shot segmentation via Springer, 2015, pp. 234–241.
divide-and-conquer proxies,” International Journal of Computer Vision, [32] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
vol. 132, no. 1, pp. 261–283, 2024. network,” in Proceedings of the IEEE conference on computer vision
[11] S.-A. Liu, Y. Zhang, Z. Qiu, H. Xie, Y. Zhang, and T. Yao, “Learning and pattern recognition, 2017, pp. 2881–2890.
orthogonal prototypes for generalized few-shot semantic segmentation,” [33] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
in Proceedings of the IEEE/CVF Conference on Computer Vision and “Deeplab: Semantic image segmentation with deep convolutional nets,
Pattern Recognition, 2023, pp. 11 319–11 328. atrous convolution, and fully connected crfs,” IEEE transactions on
[12] H. Ding, H. Zhang, and X. Jiang, “Self-regularized prototypical network pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848,
for few-shot semantic segmentation,” Pattern Recognition, vol. 133, p. 2017.
109018, 2023. [34] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-
[13] Q. Xu, W. Zhao, G. Lin, and C. Long, “Self-calibrated cross attention decoder with atrous separable convolution for semantic image segmen-
network for few-shot segmentation,” in Proceedings of the IEEE/CVF tation,” in Proceedings of the European conference on computer vision
International Conference on Computer Vision, 2023, pp. 655–665. (ECCV), 2018, pp. 801–818.
[14] D. Kang, P. Koniusz, M. Cho, and N. Murray, “Distilling self-supervised [35] Y. Xu and P. Ghamisi, “Consistency-regularized region-growing network
vision transformers for weakly-supervised few-shot classification & seg- for semantic segmentation of urban scenes with point-level annotations,”
mentation,” in Proceedings of the IEEE/CVF Conference on Computer IEEE Transactions on Image Processing, vol. 31, pp. 5038–5051, 2022.
Vision and Pattern Recognition, 2023, pp. 19 627–19 638. [36] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson,
[15] X. Shi, D. Wei, Y. Zhang, D. Lu, M. Ning, J. Chen, K. Ma, and T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,”
Y. Zheng, “Dense cross-query-and-support attention weighted mask in Proceedings of the IEEE/CVF International Conference on Computer
aggregation for few-shot segmentation,” in European Conference on Vision, 2023, pp. 4015–4026.
Computer Vision. Springer, 2022, pp. 151–168. [37] G. Li, V. Jampani, L. Sevilla-Lara, D. Sun, J. Kim, and J. Kim,
[16] J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: En- “Adaptive prototype learning and allocation for few-shot segmentation,”
hanced transformer with rotary position embedding,” Neurocomputing, in Proceedings of the IEEE/CVF conference on computer vision and
vol. 568, p. 127063, 2024. pattern recognition, 2021, pp. 8334–8343.
[17] S. Tian, L. Li, W. Li, H. Ran, X. Ning, and P. Tiwari, “A survey on few- [38] C. Zhang, G. Lin, F. Liu, J. Guo, Q. Wu, and R. Yao, “Pyramid
shot class-incremental learning,” Neural Networks, vol. 169, pp. 307– graph networks with connection attentions for region-based one-shot
324, 2024. semantic segmentation,” in Proceedings of the IEEE/CVF International
[18] G. Rizzoli, D. Shenaj, and P. Zanuttigh, “Source-free domain adaptation Conference on Computer Vision, 2019, pp. 9587–9595.
for rgb-d semantic segmentation with vision transformers,” in Proceed- [39] K. Wang, J. H. Liew, Y. Zou, D. Zhou, and J. Feng, “Panet: Few-shot
ings of the IEEE/CVF Winter Conference on Applications of Computer image semantic segmentation with prototype alignment,” in proceedings
Vision, 2024, pp. 615–624. of the IEEE/CVF international conference on computer vision, 2019,
[19] T. Zhou and W. Wang, “Cross-image pixel contrasting for semantic pp. 9197–9206.
segmentation,” IEEE Transactions on Pattern Analysis and Machine [40] M. Boudiaf, H. Kervadec, Z. I. Masud, P. Piantanida, I. Ben Ayed,
Intelligence, 2024. and J. Dolz, “Few-shot segmentation without meta-learning: A good
[20] A. Akter, N. Nosheen, S. Ahmed, M. Hossain, M. A. Yousuf, M. A. A. transductive inference is all you need?” in Proceedings of the IEEE/CVF
Almoyad, K. F. Hasan, and M. A. Moni, “Robust clinical applicable conference on computer vision and pattern recognition, 2021, pp.
cnn and u-net based algorithm for mri classification and segmentation 13 979–13 988.
for brain tumor,” Expert Systems with Applications, vol. 238, p. 122347, [41] Z. Lu, S. He, X. Zhu, L. Zhang, Y.-Z. Song, and T. Xiang, “Simpler
2024. is better: Few-shot semantic segmentation with classifier weight trans-
[21] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated former,” in Proceedings of the IEEE/CVF International Conference on
convolutions,” arXiv preprint arXiv:1511.07122, 2015. Computer Vision, 2021, pp. 8741–8750.
11

[42] A. Kayabaşı, G. Tüfekci, and İ. Ulusoy, “Elimination of non-novel

segments at multi-scale for few-shot segmentation,” in Proceedings of
the IEEE/CVF Winter Conference on Applications of Computer Vision,
2023, pp. 2559–2567.
[43] J. Min, D. Kang, and M. Cho, “Hypercorrelation squeeze for few-shot
segmentation,” in Proceedings of the IEEE/CVF international conference
on computer vision, 2021, pp. 6941–6952.
[44] G. Zhang, G. Kang, Y. Yang, and Y. Wei, “Few-shot segmentation via
cycle-consistent transformer,” Advances in Neural Information Process-
ing Systems, vol. 34, pp. 21 984–21 996, 2021.
[45] S. Hong, S. Cho, J. Nam, S. Lin, and S. Kim, “Cost aggregation with 4d
convolutional swin transformer for few-shot segmentation,” in European
Conference on Computer Vision. Springer, 2022, pp. 108–126.
[46] L. Cao, Y. Guo, Y. Yuan, and Q. Jin, “Prototype as query for few shot
semantic segmentation,” arXiv preprint arXiv:2211.14764, 2022.
[47] Z. Tian, H. Zhao, M. Shu, Z. Yang, R. Li, and J. Jia, “Prior guided feature
enrichment network for few-shot segmentation,” IEEE transactions on
pattern analysis and machine intelligence, vol. 44, no. 2, pp. 1050–1065,
2020.
[48] Y. Liu, X. Zhang, S. Zhang, and X. He, “Part-aware prototype net-
work for few-shot semantic segmentation,” in Computer Vision–ECCV
2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part IX 16. Springer, 2020, pp. 142–158.
[49] Y. Liu, N. Liu, Q. Cao, X. Yao, J. Han, and L. Shao, “Learning non-
target knowledge for few-shot semantic segmentation,” in Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition,
2022, pp. 11 573–11 582.
[50] Y. Wang, R. Sun, and T. Zhang, “Rethinking the correlation in few-
shot segmentation: A buoys view,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2023, pp.
7183–7192.
[51] Y. Tang and Y. Yu, “Query-guided prototype learning with decoder align-
ment and dynamic fusion in few-shot segmentation,” ACM Transactions
on Multimedia Computing, Communications and Applications, vol. 19,
no. 2s, pp. 1–20, 2023.
[52] Z. Chang, X. Gao, N. Li, H. Zhou, and Y. Lu, “Drnet: Disentanglement
and recombination network for few-shot semantic segmentation,” IEEE
Transactions on Circuits and Systems for Video Technology, 2024.
[53] K. Nguyen and S. Todorovic, “Feature weighting and boosting for
few-shot segmentation,” in Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2019, pp. 622–631.
[54] C. Lang, G. Cheng, B. Tu, and J. Han, “Learning what not to segment:
A new perspective on few-shot segmentation,” in Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, 2022,
pp. 8057–8067.
[55] R. Cong, H. Xiong, J. Chen, W. Zhang, Q. Huang, and Y. Zhao, “Query-
guided prototype evolution network for few-shot segmentation,” IEEE
Transactions on Multimedia, 2024.
[56] X. Luo, Z. Tian, T. Zhang, B. Yu, Y. Y. Tang, and J. Jia, “Pfenet++:
Boosting few-shot semantic segmentation with the noise-filtered context-
aware prior mask,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2023.
[57] A. Shaban, S. Bansal, Z. Liu, I. Essa, and B. Boots, “One-shot learning
for semantic segmentation,” arXiv preprint arXiv:1709.03410, 2017.
[58] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-
man, “The pascal visual object classes (voc) challenge,” International
journal of computer vision, vol. 88, pp. 303–338, 2010.
[59] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik, “Semantic
contours from inverse detectors,” in 2011 international conference on
computer vision. IEEE, 2011, pp. 991–998.
[60] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
context,” in Computer Vision–ECCV 2014: 13th European Conference,
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.
Springer, 2014, pp. 740–755.
[61] W. Liu, C. Zhang, H. Ding, T.-Y. Hung, and G. Lin, “Few-shot
segmentation with optimal transport matching and message flow,” IEEE
Transactions on Multimedia, vol. 25, pp. 5130–5141, 2022.
[62] W. Wang, L. Duan, Y. Wang, Q. En, J. Fan, and Z. Zhang, “Remember
the difference: Cross-domain few-shot semantic segmentation via meta-
memory transfer,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2022, pp. 7065–7074.