A Simple Baseline with Single-encoder for Referring Image Segmentation

Yu, Seonghoon; Jung, Ilchae; Han, Byeongju; Kim, Taeoh; Kim, Yunho; Wee, Dongyoon; Son, Jeany

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.15521 (cs)

[Submitted on 28 Aug 2024 (v1), last revised 17 Jun 2025 (this version, v3)]

Title:A Simple Baseline with Single-encoder for Referring Image Segmentation

Authors:Seonghoon Yu, Ilchae Jung, Byeongju Han, Taeoh Kim, Yunho Kim, Dongyoon Wee, Jeany Son

View PDF HTML (experimental)

Abstract:Referring image segmentation (RIS) requires dense vision-language interactions between visual pixels and textual words to segment objects based on a given description. However, commonly adapted dual-encoders in RIS, e.g., Swin transformer and BERT (uni-modal encoders) or CLIP (a multi-modal dual-encoder), lack dense multi-modal interactions during pre-training, leading to a gap with a pixel-level RIS task. To bridge this gap, existing RIS methods often rely on multi-modal fusion modules that interact two encoders, but this approach leads to high computational costs. In this paper, we present a novel RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of shared self-attention across all framework components. This enables seamless interactions of two modalities from input to final prediction, producing granularly aligned multi-modal features. Furthermore, we propose lightweight yet effective decoder modules, a Shared FPN and a Shared Mask Decoder, which contribute to the high efficiency of our model. Our simple baseline with a single encoder achieves outstanding performances on the RIS benchmark datasets while maintaining computational efficiency, compared to the most recent SoTA methods based on dual-encoders.

Comments:	arXiv pre-print
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2408.15521 [cs.CV]
	(or arXiv:2408.15521v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.15521

Submission history

From: Seonghoon Yu [view email]
[v1] Wed, 28 Aug 2024 04:14:01 UTC (16,093 KB)
[v2] Thu, 19 Sep 2024 06:21:03 UTC (16,093 KB)
[v3] Tue, 17 Jun 2025 06:34:15 UTC (7,962 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:A Simple Baseline with Single-encoder for Referring Image Segmentation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:A Simple Baseline with Single-encoder for Referring Image Segmentation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators