Luddecke Image Segmentation Using Text and Image Prompts CVPR 2022 Paper PDF
Luddecke Image Segmentation Using Text and Image Prompts CVPR 2022 Paper PDF
Abstract
Image segmentation is usually addressed by training a
model for a fixed set of object classes. Incorporating ad-
ditional classes or more complex queries later is expen-
sive as it requires re-training the model on a dataset that
encompasses these expressions. Here we propose a sys-
tem that can generate image segmentations based on ar-
bitrary prompts at test time. A prompt can be either a Figure 1. Our key idea is to use CLIP to build a flexible zero/one-
text or an image. This approach enables us to create a shot segmentation system that addresses multiple tasks at once.
unified model (trained once) for three common segmenta-
tion tasks, which come with distinct challenges: referring have been trained on. Different approaches have emerged
expression segmentation, zero-shot segmentation and one- that extend this fairly constrained setting (see Tab. 1):
shot segmentation. We build upon the CLIP model as a
• In generalized zero-shot segmentation, seen as well as
backbone which we extend with a transformer-based de-
unseen categories needs to be segmented by putting
coder that enables dense prediction. After training on an
unseen categories in relation to seen ones, e.g. through
extended version of the PhraseCut dataset, our system gen-
word embeddings [1] or WordNet [2].
erates a binary segmentation map for an image based on a
free-text prompt or on an additional image expressing the • In one-shot segmentation, the desired class is provided
query. We analyze different variants of the latter image- in form of an image (and often an associated mask) in
based prompts in detail. This novel hybrid input allows addition to the query image to be segmented.
for dynamic adaptation not only to the three segmentation • In referring expression segmentation, a model is
tasks mentioned above, but to any binary segmentation task trained on complex text queries but sees all classes dur-
where a text or image query can be formulated. Finally, ing training (i.e. no generalization to unseen classes).
we find our system to adapt well to generalized queries
involving affordances or properties. Code is available at To this work, we introduce the CLIPSeg model (Fig. 1),
https://eckerlab.org/code/clipseg which is capable of segmenting based on an arbitrary text
query or an example image. CLIPSeg can address all three
1. Introduction tasks named above. This multi-modal input format goes
beyond existing multi-task benchmarks such as Visual De-
The ability to generalize to unseen data is a fundamental
cathlon [3] where input is always provided in form of im-
problem relevant for a broad range of applications in artifi-
ages. To realize this system, we employ the pre-trained
cial intelligence. For instance, it is crucial that a household
CLIP model as a backbone and train a thin conditional
robot understands the prompt of its user, which might in-
segmentation layer (decoder) on top. We use the joint
volve an unseen object type or an uncommon expression
text-visual embedding space of CLIP for conditioning our
for an object. While humans excel at this task, this form of
model, which enables us to process prompts in text form as
inference is challenging for computer vision systems.
well as images. Our idea is to teach the decoder to relate
Image segmentation requires a model to output a predic-
activations inside CLIP with an output segmentation, while
tion for each pixel. Compared to whole-image classifica-
permitting as little dataset bias as possible and maintaining
tion, segmentation requires not only predicting what can be
the excellent and broad predictive capabilities of CLIP.
seen but also where it can be found. Classical semantic seg-
We employ a generic binary prediction setting, where
mentation models are limited to segment the categories they
a foreground that matches the prompt has to be differen-
§
[email protected] tiated from background. This binary setting can be adapted
7086
unseen free form no fixed negative transformer-based approach. To generate a segmentation,
classes prompt targets samples either a projection of the patch embeddings or mask trans-
Our setting ✓ ✓ ✓ ✓ former are proposed. Our CLIPSeg model extends CLIP
Classic - - - ✓ with a transformer-based decoder, i.e. we do not rely on
Referring Expression - ✓ ✓ - convolutional layers.
Zero-shot ✓ - ✓ ✓
One-shot ✓ - ✓ - Referring Expression Segmentation In referring ex-
pression segmentation a target is specified in a natural
Table 1. Comparison of different segmentation tasks. Negative language phrase. The goal is to segment all pixels that
means samples that do not contain the target (or one of the tar-
match this phrase. Early approaches used recurrent net-
gets in multi-label segmentation). All approaches except classic
segmentation adapt to new targets dynamically at inference time.
works in combination with CNNs to address this problem
[15, 16, 17, 18]. The CMSA module, which is central to
the approach of Ye et al. [19], models long-term dependen-
to multi-label predictions which is needed by Pascal zero- cies between text and image using attention. The more re-
shot segmentation. Although the focus of our work is on cent HULANet method [20] consists of Mask-RCNN back-
building a versatile model, we find that CLIPSeg achieves bone and specific modules processing categories, attributes
competitive performance across three low-shot segmenta- and relations, which are merged to generate a segmenta-
tion tasks. Moreover, it is able to generalize to classes and tion mask. MDETR [21] is an adaptation of the detec-
expressions for which it has never seen a segmentation. tion method DETR [22] to natural language phrase input.
It consists of a CNN which extracts features and a trans-
Contributions Our main technical contribution is the former which predicts bounding boxes for a set of query
CLIPSeg model, which extends the well-known CLIP trans- prompts. Note that referring expression segmentation does
former for zero-shot and one-shot segmentation tasks by a not require generalization to unseen object categories or
proposing a lightweight transformer-based decoder. A key understanding of visual support images. Several bench-
novelty of this model is that the segmentation target can be marks [20, 23, 24] were proposed to track progress in re-
specified by different modalities: through text or an image. ferring expression segmentation. We opt for the PhraseCut
This allows us to train a unified model for several bench- dataset [20] which is substantially larger in terms of images
marks. For text-based queries, unlike networks trained on and classes than other datasets. It contains structured text
PhraseCut, our model is able to generalize to new queries queries involving objects, attributes and relationships. A
involving unseen words. For image-based queries, we ex- query can match multiple object instances.
plore various forms of visual prompt engineering – analo-
gously to text prompt engineering in language modeling. Zero-Shot Segmentation In zero-shot segmentation the
Furthermore, we evaluate how our model generalizes to goal is to segment objects of categories that have not been
novel forms of prompts involving affordances. seen during training. Normally, multiple classes need to be
segmented in an image at the same time. In the general-
ized setting, both seen and unseen categories may occur. A
2. Related Work
key problem in zero-shot segmentation addressed by sev-
Foundation Models and Segmentation Instead of learn- eral methods is the bias which favors seen classes. Bucher
ing from scratch, modern vision systems are commonly pre- et al. [25] train a DeepLabV3-based network to synthesize
trained on a large-scale dataset (either supervised [4] or self- artificial, pixel-wise features for unseen classes based on
supervised [5, 6]) and use weight transfer. The term foun- word2vec label embeddings. These features are used to
dation model has been coined for very large pre-training learn a classifier. Follow-up work explicitly models the re-
models that are applicable to multiple downstream tasks [7]. lation between seen and unseen classes [26]. Others add se-
One of these models is CLIP [8], which has demonstrated mantic class information into dense prediction models [27].
excellent performance on several image classification tasks. More recent approaches use a joint space for image features
In contrast to previous models which rely on ResNet [9] and class prototypes [28], employ a probabilistic formula-
backbones, the best-performing CLIP model uses a novel tion to account for uncertainty [29] or model the detection
visual transformer [10] architecture. Analogously to im- of unseen objects explicitly [30].
age classification, there have been efforts to make use of
transformers for segmentation: TransUNet [11] and SETR One-Shot Semantic Segmentation In one-shot seman-
[12] employ a hybrid architecture which combine a visual tic segmentation, the model is provided at test time with
transformer for encoding with a CNN-based decoder. Seg- a single example of a certain class, usually as an image
former [13] combines a transformer encoder with an MLP- with a corresponding mask. One-shot semantic segmenta-
based decoder. The Segmentor model [14] pursues a purely tion is a comparably new task, with the pioneering work
7087
being published in 2017 by Shaban et al. [31], which in- terms of parameters). While the query image (RW ×H×3 )
troduced the Pascal-5i dataset based on Pascal images and is passed through the CLIP visual transformer, activations
labels. Their simple model extracts VGG16-features [32] at certain layers S are read out and projected to the token
from a masked support image to generate regression pa- embedding size D of our decoder. Then, these extracted ac-
rameters that are applied per-location on the output of a tivations (including CLS token) are added to the internal ac-
FCN [33] to yield a segmentation. Later works introduce tivations of our decoder before each transformer block. The
more complex mechanisms to handle one-shot segmenta- decoder has as many transformer blocks as extracted CLIP
tion: The pyramid graph network (PGNet) [34] generates activations (in our case 3). The decoder generates the binary
a set of differently-shaped feature maps obtained through segmentation by applying a linear projection on the tokens
W H
adaptive pooling and processes them by individual graph at- of its transformer (last layer) R(1+ P × P )×D 7→ RW ×H ,
tention units and passed through an atrous spatial pyramid where P is the token patch size of CLIP. In order to in-
pooling (ASPP) block [35]. The CANet network [36] first form the decoder about the segmentation target, we mod-
extracts images using a shared encoder. Then predictions ulate the decoder’s input activation by a conditional vector
are iteratively refined through a sequence of convolutions using FiLM [48]. This conditional vector can be obtained in
and ASPP blocks. Several approaches focus on the mod- two ways: (1) Using the CLIP text-transformer embedding
eling of prototypes [37, 38, 39]. PFENet [40] uses a prior of a text query and (2) using the CLIP visual transformer
computed on high-level CNN-features to provide an auxil- on a feature engineered prompt image. CLIP itself is not
iary segmentation that helps further processing. A weakly- trained, but only used as a frozen feature extractor. Due to
supervised variant as introduced by Rakelly et al. [41] re- the compact decoder, CLIPSeg has only 1,122,305 trainable
quires only sparse annotations in form of a set of points. parameters for D = 64.
In one-shot instance segmentation [42], instead of a binary The original CLIP is constrained to a fixed image size
match/non-match prediction, individual object instances are due to the learned positional embedding. We enable dif-
segmented. ferent image sizes (including larger ones) by interpolating
the positional embeddings. To validate the viability of this
CLIP Extensions Despite CLIP [8] being fairly new,
approach, we compare prediction quality for different im-
multiple derivative works across different sub-fields have
age sizes and find that for ViT-B/16 performance only de-
emerged. CLIP was combined with a GAN to modify im-
creases for images larger than 350 pixels (see supplemen-
ages based on a text prompt [43] and in robotics to general-
tary for details). In our experiments we use CLIP ViT-B/16
ize to unseen objects in manipulations tasks [44]. Other
with a patch size P of 16 and use a projection dimension of
work focused on understanding CLIP in more detail. In
D = 64 if not indicated otherwise. We extract CLIP activa-
the original CLIP paper [8], it was found that the design of
tions at layers S = [3, 7, 9], consequently our decoder has
prompts matters for downstream tasks, i.e. instead of using
only three layers.
an object name alone as a prompt, adding the prefix “a photo
of” increases performance. Zhou et al. [45] propose context Image-Text Interpolation Our model receives informa-
optimization (CoOp) which automatically learns tokens that tion about the segmentation target (“what to segment?”)
perform well for given downstream tasks. Other approaches through a conditional vector. This can be provided either by
rely on CLIP for open-set object detection [46, 47]. text or an image (through visual prompt engineering). Since
CLIP uses a shared embedding space for images and text
3. CLIPSeg Method captions, we can interpolate between both in the embedding
space and condition on the interpolated vector. Formally,
We use the visual transformer-based (ViT-B/16) CLIP let si be the embedding of the support image and ti the text
[8] model as a backbone (Fig. 2) and extend it with a small, embedding of a sample i, we obtain a conditional vector xi
parameter-efficient transformer decoder. The decoder is by a linear interpolation xi = asi + (1 − a)xi , where a is
trained on custom datasets to carry out segmentation, while sampled uniformly from [0, 1]. We use this randomized in-
the CLIP encoder remains frozen. A key challenge is to terpolation as a data augmentation strategy during training.
avoid imposing strong biases on predictions during segmen-
tation training and maintaining the versatility of CLIP. We 3.1. PhraseCut + Visual prompts (PC+)
do not use the larger ViT-L/14@336px CLIP variant as its
We use the PhraseCut dataset [20], which encompasses
weights were not publicly released as of writing this work.
over 340,000 phrases with corresponding image segmenta-
tions. Originally, this dataset does not contain visual sup-
Decoder Architecture Considering these demands, we port but only phrases and for every phrase a corresponding
propose CLIPSeg: A simple, purely-transformer based de- object exists. We extend this dataset in two ways: visual
coder, which has U-Net-inspired skip connections to the support samples and negative samples. To add visual sup-
CLIP encoder that allow the decoder to be compact (in port images for a prompt p, we randomly draw from the
7088
nothing
nothing
nothing
nothing
nothing
nothing
animal
animal
animal
animal
animal
animal
sieve
sieve
sieve
sieve
sieve
sieve
knife
knife
knife
knife
knife
knife
car
car
car
car
car
car
jug
jug
jug
jug
jug
jug
trash bin
trash bin
trash bin
trash bin
trash bin
trash bin
nothing
nothing
nothing
nothing
nothing
nothing
window
window
window
window
window
window
house
house
house
house
house
house
bike
bike
bike
bike
bike
bike
car
car
car
car
car
car
Figure 3. Different forms of combining an im-
age with the associated object mask to build a
Figure 2. Architecture of CLIPSeg: We extend a frozen CLIP model (red and blue) visual prompt have a strong effect on CLIP pre-
with a transformer that segments the query image based on either a support image or dictions (bar charts). We use the difference in
a support prompt. N CLIP activations are extracted after blocks defined by S. The the probability of the target object (orange) in
segmentation transformer and the projections (both green) are trained on PhraseCut the original image (left column) and the mask-
or PhraseCut+. ing methods for our systematic analysis.
set of all samples Sp , which share the prompt p. In case visual and text-based embedding and use the original CLIP
the prompt is unique (|Sp | = 1), we rely only on the text weights without any additional training.
prompt. Additionally, we introduce negative samples to the Specifically, we use CLIP to compute the text embed-
dataset, i.e. samples in which no object matches the prompt. dings ti which correspond to object names in the image.
To this end, the sample’s phrase is replaced by a differ- We then compare those to (1) the visual embedding of the
ent phrase with a probability qneg . Phrases are augmented original image without modifications, so and (2) the visual
randomly using a set of fixed prefixes (as suggested by the embedding sh highlighting the target object using a modi-
CLIP authors). On the images we apply random cropping fied RGB image or attention mask (both techniques are de-
under consideration of object locations, making sure the ob- scribed in detail below). By softmax-normalizing the vec-
ject remains at least partially visible. In the remainder of tor of alignments [sh t0 , sh t1 , . . . ] for different highlighting
this paper, we call this extended dataset PhraseCut+ (ab- techniques and images, we obtain the distributions shown
breviated by PC+). In contrast to the original PhraseCut in Fig. 3. For quantitative scores, we consider only the tar-
dataset, which uses only text to specify the target, PC+ sup- get object name embedding t0 , which we expect to have a
ports training using image-text interpolation. This way, we stronger alignment with the highlighted image embedding
can train a joint model that operates on text and visual input. sh than with the original image embedding s0 (Fig. 3). This
means, if a highlighting technique improves the alignment,
4. Visual Prompt Engineering the increase in object probability ∆P(object) = sh t0 − so t0
should be large. We base this analysis on the LVIS dataset
In conventional, CNN-based one-shot semantic segmen-
[49] since its images contain multiple objects and a rich set
tation, masked pooling [31] has emerged as a standard tech-
of categories. We sample 1,600 images and mask one target
nique to compute a prototype vector for conditioning. The
object out of all objects present in this image.
provided support mask is downsampled and multiplied with
a late feature map from the CNN along the spatial dimen- CLIP-Based Masking The straightforward equivalent to
sions and then pooled along the spatial dimensions. This masked pooling in a visual transformer is to apply the mask
way, only features that pertain to the support object are on the tokens. Normally, a visual transformer consists of a
considered in the prototype vector. This method cannot be fixed set of tokens which can interact at every layer through
applied directly to transformer-based architectures, as se- multi-head attention: A CLS token used for read-out and
mantic information is also accumulated in the CLS token image-region-related tokens which were originally obtained
throughout the hierarchy and not only in the feature maps. from image patches. Now, the mask can be incorporated by
Circumventing the CLS token and deriving the conditional constraining the interaction at one (e.g. the last layer 11)
vector directly from masked pooling of the feature maps is or more transformer layers to within-mask patch tokens as
not possible either, since it would break the compatibility well as the CLS token only. Our evaluation (Tab. 2, left)
between text embeddings and visual embeddings of CLIP. suggests that this form of introducing the mask does not
To learn more about how target information can be incor- work well. By constraining the interactions with the CLS
porated into CLIP, we compare several variants in a simple token (Tab. 2, left, top two rows) only a small improvement
experiment without segmentation and its confounding ef- is achieved (in last layer or in all layers) while constraining
fects. We consider the cosine distance (alignment) between all interactions decreases performance dramatically. From
7089
CLIP modification & extras ∆P(object) background modific. ∆P(object) cropping & combinations ∆P(object)
CLIP masking CLS in layer 11 1.34 BG intensity 50% 3.08 crop large context 6.27
CLIP masking CLS in all layers 1.71 BG intensity 10% 13.85 crop 13.60
CLIP masking all in all layers -14.44 BG intensity 0% 23.40 crop & BG blur 15.34
dye object red in grays. image 1.21 BG blur 13.15 crop & BG intensity 10% 21.73
add red object outline 2.29 + intensity 10% 21.73 + BG blur 23.50
Table 2. Visual prompt engineering: Average improvement of object probability for different forms of combining image and mask over
1,600 samples. Cropping means cutting the image according to the regions specified by the mask, “BG” means background.
this we conclude that more complex strategies are necessary the time the natural choice of 0.5 is used, the optimal val-
to combine image and mask internally. ues can strongly deviate from 0.5 if the probability that an
object matching the query differs between training and in-
Visual Prompt Engineering Instead of applying the ference (the a-priori probability of a query matching one or
mask inside the model, we can also combine mask and im- more objects in the scene depends highly on context and
age to a new image, which can then processed by the visual dataset). Therefore, we report performance of one-shot seg-
transformer. Analogous to prompt engineering in NLP (e.g. mentation using thresholds t optimized per task and model.
in GPT-3 [50]), we call this procedure visual prompt en- Additionally, we adopt the average precision metric (AP) in
gineering. Since this form of prompt design is novel and all our experiments. Average precision measures the area
strategies which perform best in this context are unknown, under the recall-precision curve. It measures how well the
we conduct an extensive evaluation of different variants of system can discriminate matches from non-matches, inde-
designing visual prompts (Tab. 2). We find that the exact pendent of the choice of threshold.
form of how the mask and image are combined matters a
lot. Generally, we identify three image operations that im- Models and Baselines In our experiments we differen-
prove the alignment between the object text prompts and tiate two variants of CLIPSeg: One trained on the original
the images: decreasing the background brightness, blurring PhraseCut dataset (PC) and one trained on the extended ver-
the background (using a Gaussian filter) and cropping to the sion of PhraseCut which uses 20% negative samples, con-
object. The combination of all three performs best (Tab. 2, tains visual samples (PC+) and uses image-text interpola-
last row). We will use this variant in the remainder. tion (Sec. 3). The robust latter version we call the universal
model. To put the performance of our models into perspec-
tive, we provide two baselines:
5. Experiments
• CLIP-Deconv encompasses CLIP but uses a very basic
We first evaluate our model on three established segmen- decoder, consisting only of the basic parts: FiLM con-
tation benchmarks before demonstrating the main contribu- ditioning [48], a linear projection and a deconvolution.
tion of our work: flexible few-shot segmentation that can be This helps us to estimate to which degree CLIP-alone
based on either text or image prompts. is responsible for the results.
• ViTSeg shares the architecture of CLIPSeg, but uses
Metrics Compared to approaches in zero-shot and one- an ImageNet-trained visual transformer as a backbone
shot segmentation (e.g. [25, 26]), the vocabulary we use [51]. For encoding text, we use the same text trans-
is open, i.e. the set of classes or expressions is not fixed. former of CLIP. This way we learn to which degree
Therefore, throughout the experiments, our models are the specific CLIP weights are crucial for good perfor-
trained to generate binary predictions that indicate where mance.
objects matching the query are located. If necessary, this
binary setting can be transformed into a multi-label setting We rely on PyTorch [52] for training and use an image size
(as we do in Section 5.2). of 352 × 352 pixels throughout our experiments (for details
In segmentation, intersection over union (IoU, also Jac- see appendix).
card score) is a common metric to compare predictions with
5.1. Referring Expression Segmentation
ground truth. Due to the diversity of the tasks, we em-
ploy different forms of IoU: Foreground IoU (IoUFG ) which We evaluate referring expression segmentation perfor-
computes IoU on foreground pixels only, mean IoU, which mance (Tab. 3) on the original PhraseCut dataset and com-
computes the average over foreground IoUs of different pare to scores reported by Wu et al. [20] as well as the
classes and binary IoU (IoUBIN ) which averages over fore- concurrently developed transformer-based MDETR method
ground IoU and background IoU. In binary segmentation, [21]. For this experiment we trained a version of CLIPSeg
IoU requires a threshold t to be specified. While most of on the original PhraseCut dataset (CLIPSeg [PC]) using
7090
t mIoU IoUFG AP unseen-10 unseen-4
pre-train. mIoUS mIoUU mIoUS mIoUU
CLIPSeg (PC+) 0.3 43.4 54.7 76.7
CLIPSeg (PC, D = 128) 0.3 48.2 56.5 78.2 CLIPSeg (PC+) CLIP 35.7 43.1 20.8 47.3
CLIPSeg (PC) 0.3 46.1 56.2 78.2 CLIP-Deconv (PC+) CLIP 25.1 36.7 25.9 41.9
CLIP-Deconv 0.3 37.7 49.5 71.2 ViTSeg (PC+) IN 4.2 19.0 6.0 24.8
ViTSeg (PC+) 0.1 28.4 35.4 58.3
ViTSeg (PC) 0.3 38.9 51.2 74.4 SPNet [27] IN 59.0 18.1 67.3 21.8
ZS3Net [25] IN-seen 33.9 18.1 66.4 23.2
MDETR [21] 53.7 - - CSRL [53] IN-seen 59.2 21.0 69.8 31.7
HulaNet [20] 41.3 50.8 - CaGNet [54] IN - - 69.5 40.2
Mask-RCNN top [20] 39.4 47.4 - OSR [30] IN-seen 72.1 33.9 75.0 44.1
RMI [20] 21.1 42.5 - JoEm [28] IN-seen 63.4 22.5 67.0 33.4
Table 3. Referring Expression Segmentation performance on Table 4. Zero-shot segmentation performance on Pascal-VOC
PhraseCut (t refers to the binary threshold). with 10 unseen classes. mIoUS and mIoUU indicate performance
on seen and unseen classes, respectively. Our model is trained on
PhraseCut with the Pascal classes being removed but uses a pre-
only text labels in addition to the universal variant which trained CLIP backbone. IN-seen indicates ImageNet pre-training
also includes visual samples (CLIPSeg [PC+]). with unseen classes being removed.
Our approaches outperform the two-stage HULANet ap-
proach by Wu et al. [20]. Especially, a high capacity de-
coder (D = 128) seems to be beneficial for PhraseCut. more balanced. This is due to other models being trained
However, the performance worse than MDETR [21], which exclusively on the 10 or 16 seen Pascal classes in contrast
operates at full image resolution and received two rounds to CLIPSeg which can differentiate many more classes (or
of fine-tuning on PhraseCut. Notably, the ViTSeg baseline phrases). In fact, our model performs better on unseen
performs generally worse than CLIPSeg, which shows that classes than on seen ones. This difference is likely because
CLIP pre-training is helpful. the seen classes are generally harder to segment: For the
unseen-4 setting, the unseen classes are “airplane”, “cow”,
5.2. Generalized Zero-Shot Segmentation “motorbike” and “sofa”. All of them are large and compar-
In generalized zero-shot segmentation, test images con- atively distinct objects.
tain categories that have never been seen before in addition
5.3. One-Shot Semantic Segmentation
to known categories. We evaluate the model’s zero-shot
segmentation performance using the established Pascal- In one-shot semantic segmentation, a single example im-
VOC benchmark (Tab. 4). It contains five splits involving age along with a mask is presented to the network. Regions
2 to 10 unseen classes (we report only 4 and 10 unseen that pertain to the class highlighted in the example image
classes). The latter is the most challenging setting as the must be found in a query image. Compared to previous
set of unseen classes is large. Since our model was trained tasks, we cannot rely on a text label but must understand the
on foreground/background segmentation we cannot directly provided support image. Above (Sec. 4) we identified the
use it in a multi-label setting. Therefore, we employ a sim- best method for visual prompt design, which we use here:
ple adaptation: Our model predicts a binary map indepen- cropping out the target object while blurring and darkening
dently for each of the 20 Pascal classes. Across all 20 pre- the background. To remove classes that overlap with the re-
dictions we determine the class with the highest probability spective subset of Pascal during training, we use the same
for each pixel. method as in the previous section (Sec. 5.2). Other than
We train on PhraseCut+ but remove the unseen Pascal in zero-shot segmentation, in one-shot segmentation, Ima-
classes from the dataset. This is carried out by assigning geNet pre-trained backbones are common [37, 40]. PFENet
the Pascal classes to WordNet synsets [2] and generating a particularly leverages pre-training by using high-level fea-
set of invalid words by traversing hyponyms (e.g. different ture similarity as a prior. Similarly, HSNet [55] processes
dog breeds for dog). Prompts that contain such a word are correlated activations of query and support image using 4D-
removed from the dataset. convolutions at multiple levels.
The idea of conducting this experiment is to provide a On Pascal-5i we find our universal model CLIPSeg
reference for the zero-shot performance of our universal (PC+) to achieve competitive performance (Tab. 5) among
model. It should not considered as competing in this bench- state-of-the-art methods, with only the very recent HSNet
mark as we use a different training (CLIP pre-training, bi- performing better. The results on COCO-20i (Tab. 6)
nary segmentation on PhraseCut). The results (Tab. 4) indi- show that CLIPSeg also works well when trained on other
cate a major gap between seen and unseen classes in mod- datasets than PhraseCut(+). Again HSNet performs bet-
els trained on Pascal-VOC, while our models tend to be ter. To put this in perspective, it should be considered that
7091
t vis. backb. mIoU IoUBIN AP Affordances Attributes Meronymy
mIoU AP mIoU AP mIoU AP
CLIPSeg (PC+) 0.3 ViT (CLIP) 59.5 75.0 82.3
CLIPSeg (PC) 0.3 ViT (CLIP) 52.3 69.5 72.4 CLIPSeg (PC+) 36.9 50.5 26.6 43.0 25.7 29.0
CLIP-Deconv (PC+) 0.2 ViT (CLIP) 48.0 65.8 68.0 CLIPSeg (LVIS) 37.7 44.6 18.4 16.6 18.9 13.8
ViTSeg (PC+) 0.2 ViT (IN) 39.0 59.0 62.4 CLIP-Deconv 32.2 43.7 23.1 35.6 21.1 27.1
VITSeg (PC+) 19.2 23.5 26.8 28.0 18.4 15.9
PPNet [39] RN50 52.8 69.2 -
RePRI [57] RN50 59.7 - -
PFENet [40] RN50 60.2 73.3 - Table 8. Performance for generalized prompts. While the PC+-
HSNet [55] RN50 64.0 76.7 - model has seen prompts during training (colliding prompts with
PPNet [39] RN101 55.2 70.9 - test set were removed), the LVIS version was trained on object
RePRI [57] RN101 59.4 - - classes only and is able to generalize due to the CLIP backbone.
PFENet [40] RN101 59.6 72.9 - We use the best threshold t for each model.
HSNet [55] RN101 66.2 77.6 -
Qualitative Results In Fig. 4 we show qualitative results
Table 5. One-shot performance on Pascal-5i (CLIPSeg and ViT- divided into two groups: (1, left) Affordance-like [59, 60]
Seg trained on PhraseCut+). (“generalized”) prompts that are different from the descrip-
tive prompts of PhraseCut and (2, right) prompts that were
t vis. backb. mIoU IoUBIN AP taken from the PhraseCut test set. For the latter we add
CLIPSeg (COCO) 0.1 ViT (CLIP) 33.2 58.4 40.5 challenging extra prompts involving an existing object but
CLIPSeg (COCO+N) 0.1 ViT (CLIP) 33.3 59.1 41.7 the wrong color (indicated in orange). Generalized prompts,
CLIP-Deconv (COCO+N) 0.1 ViT (CLIP) 29.8 56.8 40.8 which deviate from the PhraseCut training set by referring
ViTSeg (COCO) 0.1 ViT (IN) 14.4 46.1 15.7
to actions (“‘something to ...”) or rare object classes (“‘cut-
PPNet [39] RN50 29.0 - - lery”) work surprisingly well given that the model was not
RePRI [57] RN50 34.0 - -
PFENet [40] RN50 35.8 - - trained on such cases. It has learned an intuition of stuff that
HSNet [55] RN50 39.2 68.2 - can be stored away in cupboards, where sitting is possible
HSNet [55] RN101 41.2 69.1 - and what “living creature” means. Rarely, false positives
are generated (the bug in the salad is not a cow). Details in
Table 6. One-shot performance on COCO-20i (CLIPSeg trained the prompt are reflected by the segmentation (blue boxes)
on PhraseCut), +N indicates 10% negative samples. and information about the color influences predicted object
probabilities strongly (orange box).
Pascal-5i t vis. backb. mIoU IoUBIN AP Systematic Analysis To quantitatively assess the perfor-
CLIPSeg (PC+) 0.3 ViT (CLIP) 72.4 83.1 93.5 mance for generalized queries, we construct subsets of the
CLIPSeg (PC) 0.3 ViT (CLIP) 70.3 81.6 84.8 LVIS test datasets containing only images of classes that
CLIP-Deconv (PC+) 0.3 ViT (CLIP) 63.2 77.3 85.3
ViTSeg (PC+) 0.2 ViT (IN) 39.0 59.0 62.4
correspond to affordances or attributes. Then we ask our
model to segment with these affordances or attributes as
LSeg [58] ViT (CLIP) 52.3 67.0 -
PFENet [40] VGG16 54.2 - -
prompts. For instance, we compute the foreground inter-
section of union between armchair, sofa and loveseat ob-
Table 7. Zero-shot performance on Pascal-5i. The scores were ob- jects when “sit on” is used as prompt. A complete list
tained by following the evaluation protocol of one-shot segmenta- of which affordances or attributes are mapped onto which
tion but using text input. objects can be found in the appendix. We find (Tab. 8)
that the CLIPSeg version trained on PC+ performs better
than the CLIP-Deconv baseline and the version trained on
HSNet (and PFENet) are explicitly designed for one-shot LVIS, which contains only object labels instead of com-
segmentation, rely on pre-trained CNN activations and can- plex phrases. This result suggests that both dataset variabil-
not handle text by default: Tian et al. [40] extended PFENet ity and model complexity are necessary for generalization.
to zero-shot segmentation (but used the one-shot protocol) ViTSeg performs worse, which is expected as it misses the
by replacing the visual sample with word vectors [1, 56] of strong CLIP backbone, known for its generalization capa-
text labels. In that case, CLIPSeg outperforms their scores bilities.
by a large margin (Tab. 7).
5.5. Ablation Study
5.4. One Model For All: Generalized Prompts
In order to identify crucial factors for the performance
We have shown that CLIPSeg performs well on a variety of CLIPSeg, we conduct an ablation study on PhraseCut
of academic segmentation benchmarks. Next, we evaluate (Tab. 9). We evaluate text-based and visual prompt-based
its performance “in the wild” in unseen situations. performance (obtained using our modifications on Phrase-
7092
Figure 4. Qualitative predictions of CLIPSeg (PC+) for various prompts, darkness indicates prediction strength. The generalized prompts
(left) deviate from the PhraseCut prompts as they involve action-related properties or new object names.
7093
References [12] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Xiang, Philip HS Torr, et al. Rethinking semantic seg-
and Jeff Dean. Distributed representations of words and mentation from a sequence-to-sequence perspective with
phrases and their compositionality. In Advances in Neural transformers. In Proceedings of the IEEE/CVF Conference
Information Processing Systems (NIPS), pages 3111–3119, on Computer Vision and Pattern Recognition, pages 6881–
2013. 6890, 2021.
[2] George A Miller. Wordnet: a lexical database for english.
[13] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Communications of the ACM, 38(11):39–41, 1995.
Jose M Alvarez, and Ping Luo. Segformer: Simple and ef-
[3] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. ficient design for semantic segmentation with transformers.
Learning multiple visual domains with residual adapters. In arXiv preprint arXiv:2105.15203, 2021.
NIPS, 2017.
[14] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Schmid. Segmenter: Transformer for semantic segmenta-
and Li Fei-Fei. Imagenet: A large-scale hierarchical image tion. arXiv preprint arXiv:2105.05633, 2021.
database. In IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR). IEEE, 2009. [15] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg-
mentation from natural language expressions. In European
[5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- Conference on Computer Vision, pages 108–124. Springer,
offrey Hinton. A simple framework for contrastive learn- 2016.
ing of visual representations. In Hal Daumé III and Aarti
Singh, editors, International Conference on Machine Learn- [16] Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu,
ing (ICML), volume 119 of Proceedings of Machine Learn- and Alan Yuille. Recurrent multimodal interaction for refer-
ing Research, pages 1597–1607. PMLR, 13–18 Jul 2020. ring image segmentation. In Proceedings of the IEEE Inter-
URL https://proceedings.mlr.press/v119/ national Conference on Computer Vision, pages 1271–1280,
chen20j.html. 2017.
[6] Xinlei Chen and Kaiming He. Exploring simple siamese rep- [17] Hengcan Shi, Hongliang Li, Fanman Meng, and Q. Wu. Key-
resentation learning. In Proceedings of the IEEE/CVF Con- word-aware network for referring expression image segmen-
ference on Computer Vision and Pattern Recognition, pages tation. In ECCV, 2018.
15750–15758, 2021.
[18] Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan
[7] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- Qi, Xiaoyong Shen, and Jiaya Jia. Referring image seg-
man, Simran Arora, Sydney von Arx, Michael S Bernstein, mentation via recurrent refinement networks. In Proceed-
Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. ings of the IEEE Conference on Computer Vision and Pattern
On the opportunities and risks of foundation models. arXiv Recognition, pages 5745–5753, 2018.
preprint arXiv:2108.07258, 2021.
[19] Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang.
[8] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Cross-modal self-attention network for referring image seg-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
mentation. 2019 IEEE/CVF Conference on Computer Vision
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
and Pattern Recognition (CVPR), pages 10494–10503, 2019.
ing transferable visual models from natural language super-
vision. arXiv preprint arXiv:2103.00020, 2021. [20] Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, and
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Subhransu Maji. Phrasecut: Language-based image segmen-
Deep residual learning for image recognition. In IEEE tation in the wild. In Proceedings of the IEEE/CVF Con-
Conference on Computer Vision and Pattern Recognition ference on Computer Vision and Pattern Recognition, pages
(CVPR), 2016. 10216–10225, 2020.
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [21] Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Misra, Gabriel Synnaeve, and Nicolas Carion. Mdetr - mod-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- ulated detection for end-to-end multi-modal understanding.
vain Gelly, et al. An image is worth 16x16 words: Trans- ArXiv, abs/2104.12763, 2021.
formers for image recognition at scale. arXiv preprint
[22] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
arXiv:2010.11929, 2020.
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-
[11] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan end object detection with transformers. In Andrea Vedaldi,
Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Horst Bischof, Thomas Brox, and Jan-Michael Frahm, edi-
Transunet: Transformers make strong encoders for medi- tors, Computer Vision – ECCV 2020, pages 213–229, Cham,
cal image segmentation. arXiv preprint arXiv:2102.04306, 2020. Springer International Publishing. ISBN 978-3-030-
2021. 58452-8.
7094
[23] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, segmentation with deep convolutional nets, atrous convolu-
and Tamara L Berg. Modeling context in referring expres- tion, and fully connected crfs. IEEE Transactions on Pattern
sions. In European Conference on Computer Vision, pages Analysis and Machine Intelligence (PAMI), 40, 2018.
69–85. Springer, 2016.
[36] Chi Zhang, Guosheng Lin, Fayao Liu, Rui Yao, and Chun-
[24] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana hua Shen. Canet: Class-agnostic segmentation networks
Camburu, Alan L Yuille, and Kevin Murphy. Generation with iterative refinement and attentive few-shot learning. In
and comprehension of unambiguous object descriptions. In IEEE Conference on Computer Vision and Pattern Recogni-
Proceedings of the IEEE conference on computer vision and tion (CVPR), June 2019.
pattern recognition, pages 11–20, 2016.
[37] Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou,
[25] Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick and Jiashi Feng. Panet: Few-shot image semantic segmen-
Pérez. Zero-shot semantic segmentation. Advances in Neural tation with prototype alignment. In IEEE/CVF International
Information Processing Systems, 32:468–479, 2019. Conference on Computer Vision (ICCV), pages 9197–9206,
2019.
[26] Peike Li, Yunchao Wei, and Yi Yang. Consistent structural
relation learning for zero-shot segmentation. In NeurIPS, [38] Boyu Yang, Chang Liu, Bohao Li, Jianbin Jiao, and Qix-
2020. iang Ye. Prototype mixture models for few-shot semantic
segmentation. In European Conference on Computer Vision
[27] Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt (ECCV), 2020.
Schiele, and Zeynep Akata. Semantic projection network for
zero- and few-label semantic segmentation. In IEEE Confer- [39] Yongfei Liu, Xiangyi Zhang, Songyang Zhang, and Xum-
ence on Computer Vision and Pattern Recognition (CVPR), ing He. Part-aware prototype network for few-shot semantic
June 2019. segmentation. In European Conference on Computer Vision,
pages 142–158. Springer, 2020.
[28] Donghyeon Baek, Youngmin Oh, and Bumsub Ham. Ex-
ploiting a joint embedding space for generalized zero-shot [40] Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng
semantic segmentation. In Proceedings of the IEEE/CVF Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrich-
International Conference on Computer Vision, pages 9536– ment network for few-shot segmentation. IEEE transactions
9545, 2021. on pattern analysis and machine intelligence, PP, August
2020. ISSN 0162-8828. doi: 10.1109/tpami.2020.3013717.
[29] Ping Hu, Stan Sclaroff, and Kate Saenko. Uncertainty-aware URL https://doi.org/10.1109/TPAMI.2020.
learning for zero-shot semantic segmentation. In NeurIPS, 3013717.
2020.
[41] Kate Rakelly, Evan Shelhamer, Trevor Darrell, Alexei A.
[30] Hui Zhang and Henghui Ding. Prototypical matching and Efros, and Sergey Levine. Few-shot segmentation
open set rejection for zero-shot semantic segmentation. In propagation with guided networks. arXiv preprint
Proceedings of the IEEE/CVF International Conference on arXiv:1806.07373, 2018.
Computer Vision (ICCV), pages 6974–6983, October 2021.
[42] Claudio Michaelis, Ivan Ustyuzhaninov, Matthias Bethge,
[31] Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and and Alexander S. Ecker. One-shot instance segmentation.
Byron Boots. One-shot learning for semantic segmentation. arXiv, 2018. URL http://arxiv.org/abs/1811.
BMVC, 2017. 11507.
[32] Karen Simonyan and Andrew Zisserman. Very deep convo- [43] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or,
lutional networks for large-scale image recognition. arXiv and Dani Lischinski. Styleclip: Text-driven manipulation of
preprint arXiv:1409.1556, 2014. stylegan imagery. In Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision, pages 2085–2094,
[33] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully 2021.
convolutional networks for semantic segmentation. In Pro-
ceedings of the IEEE Conference on Computer Vision and [44] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport:
Pattern Recognition, 2015. What and where pathways for robotic manipulation. arXiv
preprint arXiv:2109.12098, 2021.
[34] Chi Zhang, Guosheng Lin, Fayao Liu, Jiushuang Guo,
Qingyao Wu, and Rui Yao. Pyramid graph networks with [45] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
connection attentions for region-based one-shot semantic Liu. Learning to prompt for vision-language models. arXiv
segmentation. In IEEE/CVF International Conference on preprint arXiv:2109.01134, 2021.
Computer Vision (ICCV), pages 9587–9595, 2019.
[46] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Zero-
[35] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, shot detection via vision and language knowledge distilla-
Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image tion. arXiv preprint arXiv:2104.13921, 2021.
7095
[47] Sepideh Esmaeilpour, Bing Liu, Eric Robertson, and Lei [58] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen
Shu. Zero-shot open set detection by extending clip. arXiv Koltun, and Rene Ranftl. Language-driven semantic seg-
preprint arXiv:2109.02748, 2021. mentation. In International Conference on Learning Rep-
resentations, 2022. URL https://openreview.net/
[48] Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian forum?id=RriDjddCLN.
Strub, Harm de Vries, Aaron Courville, and Yoshua Bengio.
Feature-wise transformations. Distill, 2018. doi: 10.23915/ [59] James Jerome Gibson. The Senses Considered as Perceptual
distill.00011. Systems. Houghton Mifflin, 1966.
7096