MSDNet Multi-Scale Decoder For Few-Shot Semantic S
MSDNet Multi-Scale Decoder For Few-Shot Semantic S
I. I NTRODUCTION
Semantic segmentation is a key task in computer vision, (c) Cross-Domain Multi scale decoder with transformer
where each pixel of an image is labeled as part of a specific guided prototyping
category. This is important in many areas like OCR, au- Fig. 1. Comparison among existing methods and our proposed method for
tonomous driving, medical imaging, and scene understanding FSS. (a) Prototype-based methods; (b) Pixel-wise approach; (c) Cross-Domain
[1]–[3]. To perform this task well, models need to learn Multi scale decoder with transformer guided prototyping.
detailed object boundaries. In recent years, deep Convolutional
Neural Networks (CNNs) have made big improvements in this
area [4]. However, these high-performing models usually need in the query image and similar objects depicted in the support
large datasets with lots of labeled examples [5], [6], which examples. Effectively using the relationship between the query
takes a lot of time and effort to create. In real-world scenarios, image and the support examples is essential in tackling FSS.
like in medical imaging or other fields where labeled data is FSS can be widely categorized into two groups: Prototype-
limited, this becomes a big problem [7], [8]. To solve this, based approaches and Pixel-wise methods. As shown in Figure
Few-shot Semantic Segmentation (FSS) has become a useful 1-(a), prototype-based approaches involve abstracting semantic
approach. features of the target class from support images through a
FSS tries to segment new object classes in images using shared backbone network [11]. This process results in feature
only a few labeled examples, called support images, that show vectors called class-wise prototypes, which are obtained using
the target class [9]. This method helps reduce the need for techniques such as class-wise average pooling or clustering.
large datasets, making it more practical for real-world use These prototypes are then combined with query features
[10]. Addressing the challenges of FSS requires handling through operations like element-wise summation or channel-
differences in texture or appearance between the target object wise concatenation. The combined features are refined by a
2
decoder module to classify each pixel as either the target image. Furthermore, to reduce the impact of information loss
class or background [12]. In contrast, as shown in Figure 1- resulting from the abstraction of support images into a feature
(b), pixel-wise methods take a different approach by focusing vector named the ’support prototype,’ we integrate global
directly on pixel-level information rather than compressing it features from the intermediate stages of the encoder, which are
into prototypes. These methods aim to predict the target class fed with the support images, into our decoder. Incorporating
for each pixel in the query image by comparing it directly with these features allows us to leverage features from different
corresponding pixels in the support images. To achieve this, stages of the encoder, thereby enriching the decoder’s con-
they establish pixel-to-pixel correlations between the support textual understanding. Additionally, we introduce the Contex-
and query features, which allows the model to find precise tual Mask Generation Module (CMGM) to further augment
matches even when the object’s appearance varies [13]. This the model’s relational understanding, operating alongside the
process is often enhanced by attention mechanisms, like those STD and enhancing the model’s capacity to capture relevant
found in Transformer models, which help the model focus contextual information.
on important relationships between pixels. By avoiding the
need for prototypes, Pixel-wise methods aim to preserve more II. R ELATED W ORKS
detailed information, allowing for finer-grained segmentation A. Semantic Segmentation
[14], [15].
Semantic segmentation, a crucial task in computer vision,
While both groups have demonstrated efficacy, they also
involves labeling each pixel in an image with a correspond-
have certain limitations. Prototype-based methods may inad-
ing class [18], [19]. CNNs significantly advanced semantic
vertently discard complex local semantic features specific to
segmentation by replacing fully connected layers with convo-
the target class in support images. This can lead to coarse
lutional layers, enabling the processing of images of various
segmentation of the target class in query images, especially
sizes [20]. Since then, subsequent advancements have focused
for objects with complex appearances. On the other hand,
on enhancing the receptive field and aggregating long-range
while pixel-wise methods have notably improved performance
context in feature maps. Techniques such as dilated convolu-
compared to prototype-based approaches, they grapple with
tions [21], spatial pyramid pooling [22], and non-local blocks
computational complexity due to dot-product attention calcu-
[23] have been employed to capture contextual information at
lations across all pixels of support features and query features.
multiple scales. More recently, Transformer-based backbones,
Moreover, a large amount of pixel-wise support information
including SegFormer [24], Segmenter [25], and SETR [26],
can lead to confusion in attention mechanisms [13]. Also, a
have been introduced to better capture long-range context in
shared limitation across both approaches is the lack of use of
semantic segmentation tasks. Further enhancing this approach,
encoder middle features in the decoder section. Many methods
hierarchical architectures like the Swin Transformer [27] have
in both categories employ straightforward decoders that fail
achieved state-of-the-art performance by using shifted win-
to incorporate encoder middle features. However, in few-shot
dows in their general-purpose backbones. In parallel, self-
scenarios where data samples are limited, leveraging the global
supervised pretraining strategies, such as the masked image
features captured by the encoder in the decoder phase can
modeling used in BEiT [28], have also shown strong results,
prove to be highly beneficial.
fine-tuning directly on the semantic segmentation task and
Inspired by recent developments, we aim to develop a
pushing the boundaries of model performance.
straightforward and effective framework to address limitations
Semantic segmentation tasks typically involve per-pixel
in FSS methods. A notable approach gaining traction is the
classification. as demonstrated by approaches like Mask-
Query-based1 Transformer architecture, which has demon-
Former [29] and Mask2Former [30], which predict binary
strated versatility across various computer vision tasks, includ-
masks corresponding to individual class labels. Older architec-
ing few-shot learning scenarios [16], [17]. This architecture
tures, such as UNet [31], PSPNet [32], and Deeplab [33], [34],
utilizes learnable Query embeddings derived from support
have also significantly contributed to the field by incorporating
prototypes, enabling nuanced analysis of their relationships
features like global and local context aggregation and dilated
within the query feature map.
Inspired by previous works, as shown in Figure 1-(c), convolutions to increase the receptive field without reducing
we have designed a novel Transformer-based module, known resolution. Building upon these foundational approaches, more
as the Spatial Transformer Decoder (STD), to enhance the recent studies, including CRGNet [35] and SAM [36], have
relational understanding between support images and the query focused on further improving model performance, exploring
image. This module operates concurrently with the multi-scale new techniques to enhance accuracy in segmentation tasks.
decoder. Within the STD module, we introduce a common Despite the progress made in per-pixel classification, address-
strategy: Using the prototype of support images as a Query, ing the challenge of segmenting unseen classes remains an
while utilizing the features extracted from the query image as open area for future research
both Value and Key embeddings inputted into the Transformer
decoder. This formulation allows the Query to effectively focus B. Few-Shot Semantic Segmentation
on the semantic features of the target class within the query FSS is a challenging task in computer vision, wherein
1 For differentiating it from the conventional term ”query” frequently
the objective is to segment images with limited annotated
employed in FSS, we capitalize ”Query” when referring to the query sequence examples, known as support images. Approaches to FSS can
within the Transformer architecture. be categorized into various groups based on their primary aims
3
and methodologies employed [37]. One significant challenge model’s performance is evaluated on the Dtest dataset, where
in FSS is addressing the imbalance in details between support it predicts the segmentation mask for query images from the
and query images. Methods like PGNet [38] and PANet [39] test dataset using the knowledge learned during training.
aim to eliminate inconsistent regions between support and Overall, the goal of FSS is to develop a model that can
query images by associating each query pixel with relevant accurately segment images from novel classes with only a
parts of the support image or by regularizing the network to few annotated samples, demonstrating robust generalization
ensure its success regardless of the roles of support and query. capabilities across different datasets and unseen classes.
But methods like ASGNet [37], on the other hand, focuses on
finding an adaptive quantity of prototypes and their spatial B. Overview
expanses determined by image content, utilizing a boundary-
Given a support set S = Isi , Msi and a query image
conscious superpixel algorithm.
Iq , the objective is to generate the binary segmentation mask
Another critical aspect of FSS is bridging the inter-class gap
for Iq , identifying the same class as the support examples.
between base and novel datasets. Approaches like RePRI [40]
To address this task, we introduce a straightforward yet
and CWT [41] address this gap by fine-tuning over support
robust framework, outlined in Figure 2. For simplicity, we
images or episodically training self-attention blocks to adapt
illustrate a 1-shot setting within the framework, but this can
classifier weights during both training and testing phases.
be easily generalized to a 5-shot setting as well. The proposed
Additionally, architectures designed for supervised learning
method comprises several key components, including a shared
often trouble recognizing objects at different scales in few-
pretrained backbone, support prototype, CMGM, a multi-scale
shot scenarios. To address this issue, new methods have been
decoder, and STD. These elements collectively contribute to
developed to allow information exchange between different
the model’s ability to accurately segment objects of interest
resolutions [42].
in the query image based on contextual information provided
Moreover, ensuring the reliability of correlations between by the support set. In the following, we’ll take a closer look
support and query images is essential in FSS. Methods like at each component, explaining its role and how it interacts
HSNet [43] and CyCTR [44] utilize attention mechanisms to within our framework.
filter out erroneous support features and focus on beneficial 1) Backbone: In our proposed framework, we adopt a mod-
information. VAT [45], meanwhile, employs a cost aggregation ified ResNet architecture, initially pre-trained on the ImageNet
network to aggregate information between query and support dataset, to serve as the backbone for feature extraction from
features, leveraging a high-dimensional Swin Transformer to raw input images, ensuring that the size of the output of each
impart local context to all pixels. block does not reduce below a specified dimension. For in-
Overall, the field of FSS is advancing rapidly with inno- stance, like [46], we define that the output sizes from conv2 x
vative methods aimed at enhancing model performance and to conv5 x are maintained at 60 × 60 pixels. Specifically,
overcoming challenges in adapting segmentation models to we utilize a ResNet with shared weights between support
novel classes with limited annotated data. These efforts are and query images. This type of ResNet maintains the spatial
driven by the ongoing need to improve the effectiveness and resolution of feature maps at 60 × 60 pixels from the conv2 x
versatility of segmentation models in real-world applications. stage forward, preserving finer details crucial for accurate
segmentation. We extract high-level features (conv5 x), as
III. P ROPOSED METHOD well as mid-level features (conv3 x and conv4 x) from both
A. Problem Definition support and query images using the backbone.
The mid-level features of the support image are denoted
In FSS, the task involves segmenting images belonging to
as Xsconv3 and Xsconv4 , while the high-level features are
novel classes with limited annotated data. We operate with
denoted as Xsconv5 . Similarly, for the query image, the mid-
two datasets, Dtrain and Dtest , each associated with class
level features are represented as Xqconv3 and Xqconv4 , and the
sets Ctrain and Ctest , respectively. Notably, these class sets
high-level features as Xqconv5 . To integrate mid-level features
are disjoint (Ctrain ∩ Ctest = ∅), ensuring that there is no
across different stages, we concatenate the mid-level feature
overlap between the classes in the training and test datasets.
maps from conv3 x and conv4 x stages and apply a 1 × 1
Each training episode consists of a support set S and a query
convolution layer to yield a merged mid-level feature map,
set Q, where S includes a set of k support images along
denoted as Xsmerged . This merging process ensures that the
with their corresponding binary segmentation masks, while
resultant feature map retains essential information from both
Q contains a single query image. The model is trained to
mid-level stages, enhancing the model’s ability to capture
predict the segmentation mask for the query image based on
diverse contextual information (Equation 1, Equation 2).
the support set.
Both Dtrain and Dtest consist of a series of randomly Xsmerged = C1×1 (Cat(Xsconv3 , Xsconv4 )) (1)
sampled episodes (an episode is defined as a set comprising
Xqmerged = C1×1 (Cat(Xqconv3 , Xqconv4 )) (2)
support images and a query image. During each epoch, we can
have many episodes (e.g., 1000 episodes), each containing its Where Cat denotes concatenation along the channel dimen-
own set of support and query images). During training, the sion, and C1×1 denotes the 1×1 convolution operation. These
model learns to predict the segmentation mask for the query equations illustrate the process of merging mid-level features
image based on the support set. Similarly, during testing, the from different stages of the backbone network, resulting in a
4
combined mid-level feature map that retains crucial informa- are five support examples, five cosine similarities are computed
tion from both stages. and subsequently averaged, yielding a novel cosine similarity
The decision to employ this modified ResNet architecture is measure representative of the collective support set.
grounded in its ability to balance computational efficiency with
feature representation. By maintaining the feature map size
at 60 × 60 pixels, the backbone effectively captures detailed
spatial information while avoiding excessive computational
overhead. This approach strikes a pragmatic balance between
model complexity and segmentation performance, making it
well-suited for our few-shot segmentation task, where compu-
tational efficiency is paramount.
2) Support Prototype: In our proposed framework, the
Support Prototype serves as a condensed representation of
the mid-level features extracted from the support example
(Xsmerged ). The Support Prototype is obtained by applying a
Masked Average Pooling (MAP) operation, which selectively
aggregates information based on the support mask. Mathemat-
ically, the Support Prototype Ps is defined in Equation 3.
TABLE I
P ERFORMANCE ON P ASCAL − 5i IN TERMS OF M I O U AND FB-I O U. N UMBERS IN BOLD REPRESENT THE BEST PERFORMANCE , WHILE UNDERLINED
VALUES DENOTE THE SECOND - BEST PERFORMANCE .
We utilized the Adam optimizer with a fixed learning rate of C. Evaluation Metrics
10−3 . All input images were resized to 473 × 473 pixels, and
the training batch size was set to 32 for the 1-shot setting and We employ the following evaluation metrics to assess the
16 for the 5-shot setting. Our training pipeline did not incor- performance of our proposed method:
porate any data augmentation strategies. After prediction, the Mean Intersection over Union (mIoU). mIoU is a widely
binary segmentation masks were resized to match the original used metric for evaluating segmentation performance. It cal-
dimensions of the input images for evaluation purposes. To culates the average intersection over union (IoU) across all
ensure robustness and mitigate the effects of randomness, we classes in the target dataset (Equation 5).
averaged the results of three trials conducted with different
random seeds. All experiments were performed on NVIDIA
RTX 4090 GPU. C
1 X
mIoU = IoUi (5)
C i=1
7
TABLE II
P ERFORMANCE ON COCO − 20i IN TERMS OF M I O U AND FB-I O U. N UMBERS IN BOLD REPRESENT THE BEST PERFORMANCE , WHILE UNDERLINED
VALUES DENOTE THE SECOND - BEST PERFORMANCE .
Here, C represents the number of classes in the target fold, posed method demonstrates superior performance under both
and IoUi denotes the intersection over union of class i. ResNet50 and ResNet101 backbones in both 1-shot and 5-shot
Foreground-Background IoU (FB-IoU). FB-IoU measures settings. Our method consistently outperforms all competing
the intersection over union specifically for the foreground approaches across all four folds of the COCO − 20i dataset,
and background classes. While FB-IoU provides insights into consistently achieving either first or second rank. We obtained
the model’s ability to distinguish between foreground and the highest mIoU scores in several folds and secured the
background regions, we primarily focus on mIoU as our main second rank in others. Notably, our method exhibits superior
evaluation metric due to its comprehensive assessment of performance in terms of mean and FB-IoU scores, further
segmentation performance. emphasizing its effectiveness and robustness.
It is important to highlight that our proposed method
D. Comparison with SOTA maintains a remarkably low number of learnable parameters,
In this subsection, we compare our proposed method with with only 1.5 million parameters. This stands in stark contrast
several SOTA methods on both the P ASCAL − 5i and to some SOTA methods, which possess significantly higher
COCO − 20i datasets. We present the results in Table I and parameter counts, exceeding 40 million parameters in certain
Table II, respectively, where we report the mIoU and FB-IoU cases. This demonstrates the efficiency and effectiveness of
scores under both 1-shot and 5-shot settings, along with the our approach in achieving superior segmentation performance
final FB-IoU value. The results of other methods are obtained while maintaining a compact model architecture.
from their respective original papers.
Results on PASCAL-5i Dataset. As shown in Table I, our
proposed method, utilizing ResNet50 and ResNet101 back- E. Cross-dataset task
bones, consistently surpasses SOTA methods in both 1-shot In this study, we investigate the cross-domain generalization
and 5-shot scenarios across all four folds of the P ASCAL−5i capabilities of our proposed few-shot segmentation method
dataset. Notably, our method achieves the one of the highest through rigorous domain shift testing. Specifically, we trained
performance across all folds. our model on the COCO − 20i dataset and conducted testing
Results on COCO-20i Dataset. Similarly, Table II presents on the P ASCAL−5i dataset to evaluate its adaptability across
the results on the COCO − 20i dataset, where our pro- different datasets and domain settings.
8
TABLE III
F EW- SHOT SEGMENTATION PERFORMANCE ON CROSS - DATASET TASK , ”COCO − 20i → P ASCAL − 5i ”, IN TERMS OF M I O U, WITH DIFFERENT
BACKBONES (R ES N ET-50 AND R ES N ET-101). N UMBERS IN BOLD REPRESENT THE BEST PERFORMANCE , WHILE UNDERLINED VALUES DENOTE THE
SECOND - BEST PERFORMANCE .
1-shot 5-shot
Backbone Methods Publication
fold0 fold1 fold2 fold3 mean fold0 fold1 fold2 fold3 mean
PFENet [47] TPAMI20 43.2 65.1 66.6 69.7 61.1 45.1 66.8 68.5 73.1 63.4
RePRI [40] CVPR21 52.2 64.3 64.8 71.6 63.2 56.5 68.2 70.0 76.2 67.7
HSNet [43] ICCV21 45.4 61.2 63.4 75.9 61.6 56.9 65.9 71.3 80.8 68.7
HSNet-HM [61] ECCV22 43.4 68.2 69.4 79.9 65.2 50.7 71.4 73.4 83.1 69.7
ResNet50 VAT-HM [61] ECCV22 68.3 64.9 67.5 79.8 65.1 55.6 68.1 72.4 82.8 69.7
RTD [62] ECCV22 57.4 62.2 68.0 74.8 65.6 65.7 69.7 70.8 75.0 70.1
PMNet [9] WACV24 68.8 70.0 65.1 62.3 66.6 73.9 74.5 73.3 72.1 73.4
MSDNet (our) - 70.7 73.2 71.1 73.2 72.1 72.5 75.0 73.8 75.5 74.2
HSNet [43] ICCV21 47.0 65.2 67.1 77.1 64.1 57.2 69.5 72.0 82.4 70.3
HSNetT-HM [61] ECCV22 46.7 68.6 71.1 79.7 66.5 53.7 70.7 75.2 83.9 70.9
ResNet101 RTD [62] ECCV22 59.4 64.3 70.8 72.0 66.6 67.2 72.7 72.0 78.9 72.7
PMNet [9] WACV24 71.0 72.3 66.6 63.8 68.4 75.2 76.3 77.0 72.6 75.3
MSDNet (our) - 71.6 75.6 73.0 75.2 73.9 71.5 79.6 76.4 77.9 76.4
The COCO − 20i dataset used in our experiments was IV show the impact of each component of the proposed
modified to exclude classes and associated images that overlap method.
with those present in P ASCAL − 5i . This adaptation ensured The first row of Table IV represents the performance of the
that the training process focused on distinct visual concepts, baseline model, consisting solely of the backbone architecture
thereby enhancing the model’s exposure to novel classes and support prototype mechanism. Subsequent rows introduce
during testing. additional components incrementally, including the CMGM,
For our experiments, we adopted a cross-dataset evaluation STD, and multi-scale decoder.
protocol where models trained on each fold of COCO − 20i
were repurposed for testing on the entire P ASCAL − 5i TABLE IV
T HE I MPACT OF E ACH C OMPONENT ON S EGMENTATION P ERFORMANCE
dataset. Notably, during training, the model was exposed IN THE COCO − 20i DATASET
only to specific classes within COCO − 20i , ensuring no
overlap with the classes present in P ASCAL − 5i . This setup Baseline CMGM STD
Multi Scale 1-shot
Decoder fold0 fold1 fold2 fold3 mean FB-IoU
effectively simulates a scenario where the model encounters
✓ 30.1 34.2 33.4 33.8 32.9 59.7
novel classes during testing that were not part of its training ✓ ✓ 31.5 35.9 34.8 34.2 34.1 60.8
curriculum. ✓ ✓ ✓ 43.0 45.2 43.1 41.4 43.2 67.6
For instance, in the fold-0 setting, the model was exclusively ✓ ✓ ✓ ✓ 43.7 49.1 46.9 46.2 46.5 70.4
Fig. 5. Qualitative comparison of component effects on COCO − 20i dataset in 1-shot scenario
TABLE V
T HE I MPACT OF NUMBER OF RESIDUAL BLOCKS IN EACH STAGE OF M ULTI
S CALE D ECODER ON S EGMENTATION P ERFORMANCE IN THE
COCO − 20i DATASET
Fig. 6. The overview of Multi Scale Decoder with different number of residual
blocks in each stage (1-4)
V. C ONCLUSION
In conclusion, our proposed few-shot segmentation frame-
work, leveraging a combination of components including a
6 provides an overview of the Multi-Scale Decoder with shared pretrained backbone, support prototype mechanism,
different numbers of residual blocks in each stage. The ex- CMGM, STD, and multi-scale decoder, has demonstrated
periment involved evaluating the segmentation performance remarkable efficacy in achieving SOTA performance on both
on the COCO − 20i dataset using the ResNet50 backbone P ASCAL − 5i and COCO − 20i datasets. Through extensive
in a 1-shot scenario. As depicted in Table V, we exam- experimentation and ablation studies, we have highlighted the
ined configurations ranging from one to four residual blocks critical contributions of each component, particularly empha-
per stage. Interestingly, the results revealed that the optimal sizing the significant impact of the multi-scale decoder in
segmentation performance was achieved with three residual enhancing segmentation accuracy while maintaining compu-
blocks in each stage. This finding suggests that an appropriate tational efficiency. Looking ahead, further investigation into
balance in the depth of the decoder architecture plays a crucial the dynamic adaptation of prototype representations and the
role in enhancing segmentation accuracy. Too few blocks may exploration of additional attention mechanisms could offer
limit the model’s capacity to capture intricate features, while avenues for improving the adaptability and robustness of our
an excessive number of blocks could lead to overfitting or method across diverse datasets and scenarios. Additionally,
computational inefficiency. Therefore, our results underscore exploring semi-supervised learning paradigms could enhance
the importance of carefully tuning the architecture parameters the generalization capability of our framework, enabling ef-
to achieve optimal performance in few-shot segmentation fective segmentation in scenarios with limited labeled data.
tasks. These avenues for future work hold promise for advancing
10
the effectiveness and applicability of few-shot segmentation [22] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
methods in real-world scenarios. convolutional networks for visual recognition,” IEEE transactions on
pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904–
1916, 2015.
R EFERENCES [23] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-
works,” in Proceedings of the IEEE conference on computer vision and
[1] A. Fateh, R. T. Birgani, M. Fateh, and V. Abolghasemi, “Advancing mul- pattern recognition, 2018, pp. 7794–7803.
tilingual handwritten numeral recognition with attention-driven transfer [24] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo,
learning,” IEEE Access, vol. 12, pp. 41 381–41 395, 2024. “Segformer: Simple and efficient design for semantic segmentation
[2] Y. Zhang, Z. Shen, and R. Jiao, “Segment anything model for medical with transformers,” Advances in neural information processing systems,
image segmentation: Current applications and future directions,” Com- vol. 34, pp. 12 077–12 090, 2021.
puters in Biology and Medicine, p. 108238, 2024. [25] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Trans-
[3] A. Saber, P. Parhami, A. Siahkarzadeh, and A. Fateh, “Efficient and former for semantic segmentation,” in Proceedings of the IEEE/CVF
accurate pneumonia detection using a novel multi-scale transformer international conference on computer vision, 2021, pp. 7262–7272.
approach,” arXiv preprint arXiv:2408.04290, 2024. [26] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng,
[4] S. Sun, W. Wang, A. Howard, Q. Yu, P. Torr, and L.-C. Chen, T. Xiang, P. H. Torr et al., “Rethinking semantic segmentation from a
“Remax: Relaxing for better training on efficient panoptic segmentation,” sequence-to-sequence perspective with transformers,” in Proceedings of
Advances in Neural Information Processing Systems, vol. 36, 2024. the IEEE/CVF conference on computer vision and pattern recognition,
[5] I. B. Barcelos, F. d. C. Belém, L. d. M. João, Z. K. d. P. Jr, A. X. Falcão, 2021, pp. 6881–6890.
and S. J. F. Guimarães, “A comprehensive review and new taxonomy [27] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
on superpixel segmentation,” ACM Computing Surveys, 2024. B. Guo, “Swin transformer: Hierarchical vision transformer using shifted
[6] X. Gu, Y. Cui, J. Huang, A. Rashwan, X. Yang, X. Zhou, G. Ghiasi, windows,” in Proceedings of the IEEE/CVF international conference on
W. Kuo, H. Chen, L.-C. Chen et al., “Dataseg: Taming a universal multi- computer vision, 2021, pp. 10 012–10 022.
dataset multi-task segmentation model,” Advances in Neural Information [28] H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image
Processing Systems, vol. 36, 2024. transformers,” arXiv preprint arXiv:2106.08254, 2021.
[7] Z. Marinov, P. F. Jäger, J. Egger, J. Kleesiek, and R. Stiefelhagen, [29] B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not
“Deep interactive segmentation of medical images: A systematic review all you need for semantic segmentation,” Advances in neural information
and taxonomy,” IEEE Transactions on Pattern Analysis and Machine processing systems, vol. 34, pp. 17 864–17 875, 2021.
Intelligence, 2024. [30] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar,
[8] F. Askari, A. Fateh, and M. R. Mohammadi, “Enhancing few-shot image “Masked-attention mask transformer for universal image segmentation,”
classification through learnable multi-scale embedding and attention in Proceedings of the IEEE/CVF conference on computer vision and
mechanisms,” arXiv preprint arXiv:2409.07989, 2024. pattern recognition, 2022, pp. 1290–1299.
[9] H. Chen, Y. Dong, Z. Lu, Y. Yu, and J. Han, “Pixel matching network for [31] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
cross-domain few-shot segmentation,” in Proceedings of the IEEE/CVF for biomedical image segmentation,” in Medical image computing and
Winter Conference on Applications of Computer Vision, 2024, pp. 978– computer-assisted intervention–MICCAI 2015: 18th international con-
987. ference, Munich, Germany, October 5-9, 2015, proceedings, part III 18.
[10] C. Lang, G. Cheng, B. Tu, and J. Han, “Few-shot segmentation via Springer, 2015, pp. 234–241.
divide-and-conquer proxies,” International Journal of Computer Vision, [32] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
vol. 132, no. 1, pp. 261–283, 2024. network,” in Proceedings of the IEEE conference on computer vision
[11] S.-A. Liu, Y. Zhang, Z. Qiu, H. Xie, Y. Zhang, and T. Yao, “Learning and pattern recognition, 2017, pp. 2881–2890.
orthogonal prototypes for generalized few-shot semantic segmentation,” [33] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
in Proceedings of the IEEE/CVF Conference on Computer Vision and “Deeplab: Semantic image segmentation with deep convolutional nets,
Pattern Recognition, 2023, pp. 11 319–11 328. atrous convolution, and fully connected crfs,” IEEE transactions on
[12] H. Ding, H. Zhang, and X. Jiang, “Self-regularized prototypical network pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848,
for few-shot semantic segmentation,” Pattern Recognition, vol. 133, p. 2017.
109018, 2023. [34] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-
[13] Q. Xu, W. Zhao, G. Lin, and C. Long, “Self-calibrated cross attention decoder with atrous separable convolution for semantic image segmen-
network for few-shot segmentation,” in Proceedings of the IEEE/CVF tation,” in Proceedings of the European conference on computer vision
International Conference on Computer Vision, 2023, pp. 655–665. (ECCV), 2018, pp. 801–818.
[14] D. Kang, P. Koniusz, M. Cho, and N. Murray, “Distilling self-supervised [35] Y. Xu and P. Ghamisi, “Consistency-regularized region-growing network
vision transformers for weakly-supervised few-shot classification & seg- for semantic segmentation of urban scenes with point-level annotations,”
mentation,” in Proceedings of the IEEE/CVF Conference on Computer IEEE Transactions on Image Processing, vol. 31, pp. 5038–5051, 2022.
Vision and Pattern Recognition, 2023, pp. 19 627–19 638. [36] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson,
[15] X. Shi, D. Wei, Y. Zhang, D. Lu, M. Ning, J. Chen, K. Ma, and T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,”
Y. Zheng, “Dense cross-query-and-support attention weighted mask in Proceedings of the IEEE/CVF International Conference on Computer
aggregation for few-shot segmentation,” in European Conference on Vision, 2023, pp. 4015–4026.
Computer Vision. Springer, 2022, pp. 151–168. [37] G. Li, V. Jampani, L. Sevilla-Lara, D. Sun, J. Kim, and J. Kim,
[16] J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: En- “Adaptive prototype learning and allocation for few-shot segmentation,”
hanced transformer with rotary position embedding,” Neurocomputing, in Proceedings of the IEEE/CVF conference on computer vision and
vol. 568, p. 127063, 2024. pattern recognition, 2021, pp. 8334–8343.
[17] S. Tian, L. Li, W. Li, H. Ran, X. Ning, and P. Tiwari, “A survey on few- [38] C. Zhang, G. Lin, F. Liu, J. Guo, Q. Wu, and R. Yao, “Pyramid
shot class-incremental learning,” Neural Networks, vol. 169, pp. 307– graph networks with connection attentions for region-based one-shot
324, 2024. semantic segmentation,” in Proceedings of the IEEE/CVF International
[18] G. Rizzoli, D. Shenaj, and P. Zanuttigh, “Source-free domain adaptation Conference on Computer Vision, 2019, pp. 9587–9595.
for rgb-d semantic segmentation with vision transformers,” in Proceed- [39] K. Wang, J. H. Liew, Y. Zou, D. Zhou, and J. Feng, “Panet: Few-shot
ings of the IEEE/CVF Winter Conference on Applications of Computer image semantic segmentation with prototype alignment,” in proceedings
Vision, 2024, pp. 615–624. of the IEEE/CVF international conference on computer vision, 2019,
[19] T. Zhou and W. Wang, “Cross-image pixel contrasting for semantic pp. 9197–9206.
segmentation,” IEEE Transactions on Pattern Analysis and Machine [40] M. Boudiaf, H. Kervadec, Z. I. Masud, P. Piantanida, I. Ben Ayed,
Intelligence, 2024. and J. Dolz, “Few-shot segmentation without meta-learning: A good
[20] A. Akter, N. Nosheen, S. Ahmed, M. Hossain, M. A. Yousuf, M. A. A. transductive inference is all you need?” in Proceedings of the IEEE/CVF
Almoyad, K. F. Hasan, and M. A. Moni, “Robust clinical applicable conference on computer vision and pattern recognition, 2021, pp.
cnn and u-net based algorithm for mri classification and segmentation 13 979–13 988.
for brain tumor,” Expert Systems with Applications, vol. 238, p. 122347, [41] Z. Lu, S. He, X. Zhu, L. Zhang, Y.-Z. Song, and T. Xiang, “Simpler
2024. is better: Few-shot semantic segmentation with classifier weight trans-
[21] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated former,” in Proceedings of the IEEE/CVF International Conference on
convolutions,” arXiv preprint arXiv:1511.07122, 2015. Computer Vision, 2021, pp. 8741–8750.
11