0% found this document useful (0 votes)

122 views

Luddecke Image Segmentation Using Text and Image Prompts CVPR 2022 Paper PDF

This document presents CLIPSeg, a model for image segmentation using text or image prompts. CLIPSeg builds on CLIP to generate binary segmentation masks from arbitrary prompts. It can perform referring expression segmentation, zero-shot segmentation, and one-shot segmentation using either text or image inputs. The authors train CLIPSeg on an extended PhraseCut dataset and find it achieves competitive performance across three low-shot segmentation tasks while also generalizing to unseen classes and expressions.

Uploaded by

Hariharan Ravishankar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views

Luddecke Image Segmentation Using Text and Image Prompts CVPR 2022 Paper PDF

Uploaded by

Hariharan Ravishankar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Image Segmentation Using Text and Image Prompts

Timo Lüddecke1,§ and Alexander Ecker1,2

1 2
Institute of Computer Science and CIDAS, University of Göttingen MPI for Dynamics and Self-Organization, Göttingen

Abstract
Image segmentation is usually addressed by training a
model for a fixed set of object classes. Incorporating ad-
ditional classes or more complex queries later is expen-
sive as it requires re-training the model on a dataset that
encompasses these expressions. Here we propose a sys-
tem that can generate image segmentations based on ar-
bitrary prompts at test time. A prompt can be either a Figure 1. Our key idea is to use CLIP to build a flexible zero/one-
text or an image. This approach enables us to create a shot segmentation system that addresses multiple tasks at once.
unified model (trained once) for three common segmenta-
tion tasks, which come with distinct challenges: referring have been trained on. Different approaches have emerged
expression segmentation, zero-shot segmentation and one- that extend this fairly constrained setting (see Tab. 1):
shot segmentation. We build upon the CLIP model as a
• In generalized zero-shot segmentation, seen as well as
backbone which we extend with a transformer-based de-
unseen categories needs to be segmented by putting
coder that enables dense prediction. After training on an
unseen categories in relation to seen ones, e.g. through
extended version of the PhraseCut dataset, our system gen-
word embeddings [1] or WordNet [2].
erates a binary segmentation map for an image based on a
free-text prompt or on an additional image expressing the • In one-shot segmentation, the desired class is provided
query. We analyze different variants of the latter image- in form of an image (and often an associated mask) in
based prompts in detail. This novel hybrid input allows addition to the query image to be segmented.
for dynamic adaptation not only to the three segmentation • In referring expression segmentation, a model is
tasks mentioned above, but to any binary segmentation task trained on complex text queries but sees all classes dur-
where a text or image query can be formulated. Finally, ing training (i.e. no generalization to unseen classes).
we find our system to adapt well to generalized queries
involving affordances or properties. Code is available at To this work, we introduce the CLIPSeg model (Fig. 1),
https://eckerlab.org/code/clipseg which is capable of segmenting based on an arbitrary text
query or an example image. CLIPSeg can address all three
1. Introduction tasks named above. This multi-modal input format goes
beyond existing multi-task benchmarks such as Visual De-
The ability to generalize to unseen data is a fundamental
cathlon [3] where input is always provided in form of im-
problem relevant for a broad range of applications in artifi-
ages. To realize this system, we employ the pre-trained
cial intelligence. For instance, it is crucial that a household
CLIP model as a backbone and train a thin conditional
robot understands the prompt of its user, which might in-
segmentation layer (decoder) on top. We use the joint
volve an unseen object type or an uncommon expression
text-visual embedding space of CLIP for conditioning our
for an object. While humans excel at this task, this form of
model, which enables us to process prompts in text form as
inference is challenging for computer vision systems.
well as images. Our idea is to teach the decoder to relate
Image segmentation requires a model to output a predic-
activations inside CLIP with an output segmentation, while
tion for each pixel. Compared to whole-image classifica-
permitting as little dataset bias as possible and maintaining
tion, segmentation requires not only predicting what can be
the excellent and broad predictive capabilities of CLIP.
seen but also where it can be found. Classical semantic seg-
We employ a generic binary prediction setting, where
mentation models are limited to segment the categories they
a foreground that matches the prompt has to be differen-
§
[email protected] tiated from background. This binary setting can be adapted

7086
unseen free form no fixed negative transformer-based approach. To generate a segmentation,
classes prompt targets samples either a projection of the patch embeddings or mask trans-
Our setting ✓ ✓ ✓ ✓ former are proposed. Our CLIPSeg model extends CLIP
Classic - - - ✓ with a transformer-based decoder, i.e. we do not rely on
Referring Expression - ✓ ✓ - convolutional layers.
Zero-shot ✓ - ✓ ✓
One-shot ✓ - ✓ - Referring Expression Segmentation In referring ex-
pression segmentation a target is specified in a natural
Table 1. Comparison of different segmentation tasks. Negative language phrase. The goal is to segment all pixels that
means samples that do not contain the target (or one of the tar-
match this phrase. Early approaches used recurrent net-
gets in multi-label segmentation). All approaches except classic
segmentation adapt to new targets dynamically at inference time.
works in combination with CNNs to address this problem
[15, 16, 17, 18]. The CMSA module, which is central to
the approach of Ye et al. [19], models long-term dependen-
to multi-label predictions which is needed by Pascal zero- cies between text and image using attention. The more re-
shot segmentation. Although the focus of our work is on cent HULANet method [20] consists of Mask-RCNN back-
building a versatile model, we find that CLIPSeg achieves bone and specific modules processing categories, attributes
competitive performance across three low-shot segmenta- and relations, which are merged to generate a segmenta-
tion tasks. Moreover, it is able to generalize to classes and tion mask. MDETR [21] is an adaptation of the detec-
expressions for which it has never seen a segmentation. tion method DETR [22] to natural language phrase input.
It consists of a CNN which extracts features and a trans-
Contributions Our main technical contribution is the former which predicts bounding boxes for a set of query
CLIPSeg model, which extends the well-known CLIP trans- prompts. Note that referring expression segmentation does
former for zero-shot and one-shot segmentation tasks by a not require generalization to unseen object categories or
proposing a lightweight transformer-based decoder. A key understanding of visual support images. Several bench-
novelty of this model is that the segmentation target can be marks [20, 23, 24] were proposed to track progress in re-
specified by different modalities: through text or an image. ferring expression segmentation. We opt for the PhraseCut
This allows us to train a unified model for several bench- dataset [20] which is substantially larger in terms of images
marks. For text-based queries, unlike networks trained on and classes than other datasets. It contains structured text
PhraseCut, our model is able to generalize to new queries queries involving objects, attributes and relationships. A
involving unseen words. For image-based queries, we ex- query can match multiple object instances.
plore various forms of visual prompt engineering – analo-
gously to text prompt engineering in language modeling. Zero-Shot Segmentation In zero-shot segmentation the
Furthermore, we evaluate how our model generalizes to goal is to segment objects of categories that have not been
novel forms of prompts involving affordances. seen during training. Normally, multiple classes need to be
segmented in an image at the same time. In the general-
ized setting, both seen and unseen categories may occur. A
2. Related Work
key problem in zero-shot segmentation addressed by sev-
Foundation Models and Segmentation Instead of learn- eral methods is the bias which favors seen classes. Bucher
ing from scratch, modern vision systems are commonly pre- et al. [25] train a DeepLabV3-based network to synthesize
trained on a large-scale dataset (either supervised [4] or self- artificial, pixel-wise features for unseen classes based on
supervised [5, 6]) and use weight transfer. The term foun- word2vec label embeddings. These features are used to
dation model has been coined for very large pre-training learn a classifier. Follow-up work explicitly models the re-
models that are applicable to multiple downstream tasks [7]. lation between seen and unseen classes [26]. Others add se-
One of these models is CLIP [8], which has demonstrated mantic class information into dense prediction models [27].
excellent performance on several image classification tasks. More recent approaches use a joint space for image features
In contrast to previous models which rely on ResNet [9] and class prototypes [28], employ a probabilistic formula-
backbones, the best-performing CLIP model uses a novel tion to account for uncertainty [29] or model the detection
visual transformer [10] architecture. Analogously to im- of unseen objects explicitly [30].
age classification, there have been efforts to make use of
transformers for segmentation: TransUNet [11] and SETR One-Shot Semantic Segmentation In one-shot seman-
[12] employ a hybrid architecture which combine a visual tic segmentation, the model is provided at test time with
transformer for encoding with a CNN-based decoder. Seg- a single example of a certain class, usually as an image
former [13] combines a transformer encoder with an MLP- with a corresponding mask. One-shot semantic segmenta-
based decoder. The Segmentor model [14] pursues a purely tion is a comparably new task, with the pioneering work

7087
being published in 2017 by Shaban et al. [31], which in- terms of parameters). While the query image (RW ×H×3 )
troduced the Pascal-5i dataset based on Pascal images and is passed through the CLIP visual transformer, activations
labels. Their simple model extracts VGG16-features [32] at certain layers S are read out and projected to the token
from a masked support image to generate regression pa- embedding size D of our decoder. Then, these extracted ac-
rameters that are applied per-location on the output of a tivations (including CLS token) are added to the internal ac-
FCN [33] to yield a segmentation. Later works introduce tivations of our decoder before each transformer block. The
more complex mechanisms to handle one-shot segmenta- decoder has as many transformer blocks as extracted CLIP
tion: The pyramid graph network (PGNet) [34] generates activations (in our case 3). The decoder generates the binary
a set of differently-shaped feature maps obtained through segmentation by applying a linear projection on the tokens
W H
adaptive pooling and processes them by individual graph at- of its transformer (last layer) R(1+ P × P )×D 7→ RW ×H ,
tention units and passed through an atrous spatial pyramid where P is the token patch size of CLIP. In order to in-
pooling (ASPP) block [35]. The CANet network [36] first form the decoder about the segmentation target, we mod-
extracts images using a shared encoder. Then predictions ulate the decoder’s input activation by a conditional vector
are iteratively refined through a sequence of convolutions using FiLM [48]. This conditional vector can be obtained in
and ASPP blocks. Several approaches focus on the mod- two ways: (1) Using the CLIP text-transformer embedding
eling of prototypes [37, 38, 39]. PFENet [40] uses a prior of a text query and (2) using the CLIP visual transformer
computed on high-level CNN-features to provide an auxil- on a feature engineered prompt image. CLIP itself is not
iary segmentation that helps further processing. A weakly- trained, but only used as a frozen feature extractor. Due to
supervised variant as introduced by Rakelly et al. [41] re- the compact decoder, CLIPSeg has only 1,122,305 trainable
quires only sparse annotations in form of a set of points. parameters for D = 64.
In one-shot instance segmentation [42], instead of a binary The original CLIP is constrained to a fixed image size
match/non-match prediction, individual object instances are due to the learned positional embedding. We enable dif-
segmented. ferent image sizes (including larger ones) by interpolating
the positional embeddings. To validate the viability of this
CLIP Extensions Despite CLIP [8] being fairly new,
approach, we compare prediction quality for different im-
multiple derivative works across different sub-fields have
age sizes and find that for ViT-B/16 performance only de-
emerged. CLIP was combined with a GAN to modify im-
creases for images larger than 350 pixels (see supplemen-
ages based on a text prompt [43] and in robotics to general-
tary for details). In our experiments we use CLIP ViT-B/16
ize to unseen objects in manipulations tasks [44]. Other
with a patch size P of 16 and use a projection dimension of
work focused on understanding CLIP in more detail. In
D = 64 if not indicated otherwise. We extract CLIP activa-
the original CLIP paper [8], it was found that the design of
tions at layers S = [3, 7, 9], consequently our decoder has
prompts matters for downstream tasks, i.e. instead of using
only three layers.
an object name alone as a prompt, adding the prefix “a photo
of” increases performance. Zhou et al. [45] propose context Image-Text Interpolation Our model receives informa-
optimization (CoOp) which automatically learns tokens that tion about the segmentation target (“what to segment?”)
perform well for given downstream tasks. Other approaches through a conditional vector. This can be provided either by
rely on CLIP for open-set object detection [46, 47]. text or an image (through visual prompt engineering). Since
CLIP uses a shared embedding space for images and text
3. CLIPSeg Method captions, we can interpolate between both in the embedding
space and condition on the interpolated vector. Formally,
We use the visual transformer-based (ViT-B/16) CLIP let si be the embedding of the support image and ti the text
[8] model as a backbone (Fig. 2) and extend it with a small, embedding of a sample i, we obtain a conditional vector xi
parameter-efficient transformer decoder. The decoder is by a linear interpolation xi = asi + (1 − a)xi , where a is
trained on custom datasets to carry out segmentation, while sampled uniformly from [0, 1]. We use this randomized in-
the CLIP encoder remains frozen. A key challenge is to terpolation as a data augmentation strategy during training.
avoid imposing strong biases on predictions during segmen-
tation training and maintaining the versatility of CLIP. We 3.1. PhraseCut + Visual prompts (PC+)
do not use the larger ViT-L/14@336px CLIP variant as its
We use the PhraseCut dataset [20], which encompasses
weights were not publicly released as of writing this work.
over 340,000 phrases with corresponding image segmenta-
tions. Originally, this dataset does not contain visual sup-
Decoder Architecture Considering these demands, we port but only phrases and for every phrase a corresponding
propose CLIPSeg: A simple, purely-transformer based de- object exists. We extend this dataset in two ways: visual
coder, which has U-Net-inspired skip connections to the support samples and negative samples. To add visual sup-
CLIP encoder that allow the decoder to be compact (in port images for a prompt p, we randomly draw from the

7088
nothing

nothing

nothing
animal

animal

animal
sieve

sieve

sieve
knife

knife

knife
car

car

car
jug

jug

jug
trash bin

trash bin

trash bin
nothing

nothing

nothing
window

window

window
house

house

house
bike

bike

bike
car

car

car
Figure 3. Different forms of combining an im-
age with the associated object mask to build a
Figure 2. Architecture of CLIPSeg: We extend a frozen CLIP model (red and blue) visual prompt have a strong effect on CLIP pre-
with a transformer that segments the query image based on either a support image or dictions (bar charts). We use the difference in
a support prompt. N CLIP activations are extracted after blocks defined by S. The the probability of the target object (orange) in
segmentation transformer and the projections (both green) are trained on PhraseCut the original image (left column) and the mask-
or PhraseCut+. ing methods for our systematic analysis.

set of all samples Sp , which share the prompt p. In case visual and text-based embedding and use the original CLIP
the prompt is unique (|Sp | = 1), we rely only on the text weights without any additional training.
prompt. Additionally, we introduce negative samples to the Specifically, we use CLIP to compute the text embed-
dataset, i.e. samples in which no object matches the prompt. dings ti which correspond to object names in the image.
To this end, the sample’s phrase is replaced by a differ- We then compare those to (1) the visual embedding of the
ent phrase with a probability qneg . Phrases are augmented original image without modifications, so and (2) the visual
randomly using a set of fixed prefixes (as suggested by the embedding sh highlighting the target object using a modi-
CLIP authors). On the images we apply random cropping fied RGB image or attention mask (both techniques are de-
under consideration of object locations, making sure the ob- scribed in detail below). By softmax-normalizing the vec-
ject remains at least partially visible. In the remainder of tor of alignments [sh t0 , sh t1 , . . . ] for different highlighting
this paper, we call this extended dataset PhraseCut+ (ab- techniques and images, we obtain the distributions shown
breviated by PC+). In contrast to the original PhraseCut in Fig. 3. For quantitative scores, we consider only the tar-
dataset, which uses only text to specify the target, PC+ sup- get object name embedding t0 , which we expect to have a
ports training using image-text interpolation. This way, we stronger alignment with the highlighted image embedding
can train a joint model that operates on text and visual input. sh than with the original image embedding s0 (Fig. 3). This
means, if a highlighting technique improves the alignment,
4. Visual Prompt Engineering the increase in object probability ∆P(object) = sh t0 − so t0
should be large. We base this analysis on the LVIS dataset
In conventional, CNN-based one-shot semantic segmen-
[49] since its images contain multiple objects and a rich set
tation, masked pooling [31] has emerged as a standard tech-
of categories. We sample 1,600 images and mask one target
nique to compute a prototype vector for conditioning. The
object out of all objects present in this image.
provided support mask is downsampled and multiplied with
a late feature map from the CNN along the spatial dimen- CLIP-Based Masking The straightforward equivalent to
sions and then pooled along the spatial dimensions. This masked pooling in a visual transformer is to apply the mask
way, only features that pertain to the support object are on the tokens. Normally, a visual transformer consists of a
considered in the prototype vector. This method cannot be fixed set of tokens which can interact at every layer through
applied directly to transformer-based architectures, as se- multi-head attention: A CLS token used for read-out and
mantic information is also accumulated in the CLS token image-region-related tokens which were originally obtained
throughout the hierarchy and not only in the feature maps. from image patches. Now, the mask can be incorporated by
Circumventing the CLS token and deriving the conditional constraining the interaction at one (e.g. the last layer 11)
vector directly from masked pooling of the feature maps is or more transformer layers to within-mask patch tokens as
not possible either, since it would break the compatibility well as the CLS token only. Our evaluation (Tab. 2, left)
between text embeddings and visual embeddings of CLIP. suggests that this form of introducing the mask does not
To learn more about how target information can be incor- work well. By constraining the interactions with the CLS
porated into CLIP, we compare several variants in a simple token (Tab. 2, left, top two rows) only a small improvement
experiment without segmentation and its confounding ef- is achieved (in last layer or in all layers) while constraining
fects. We consider the cosine distance (alignment) between all interactions decreases performance dramatically. From

7089
CLIP modification & extras ∆P(object) background modific. ∆P(object) cropping & combinations ∆P(object)
CLIP masking CLS in layer 11 1.34 BG intensity 50% 3.08 crop large context 6.27
CLIP masking CLS in all layers 1.71 BG intensity 10% 13.85 crop 13.60
CLIP masking all in all layers -14.44 BG intensity 0% 23.40 crop & BG blur 15.34
dye object red in grays. image 1.21 BG blur 13.15 crop & BG intensity 10% 21.73
add red object outline 2.29 + intensity 10% 21.73 + BG blur 23.50

Table 2. Visual prompt engineering: Average improvement of object probability for different forms of combining image and mask over
1,600 samples. Cropping means cutting the image according to the regions specified by the mask, “BG” means background.

this we conclude that more complex strategies are necessary the time the natural choice of 0.5 is used, the optimal val-
to combine image and mask internally. ues can strongly deviate from 0.5 if the probability that an
object matching the query differs between training and in-
Visual Prompt Engineering Instead of applying the ference (the a-priori probability of a query matching one or
mask inside the model, we can also combine mask and im- more objects in the scene depends highly on context and
age to a new image, which can then processed by the visual dataset). Therefore, we report performance of one-shot seg-
transformer. Analogous to prompt engineering in NLP (e.g. mentation using thresholds t optimized per task and model.
in GPT-3 [50]), we call this procedure visual prompt en- Additionally, we adopt the average precision metric (AP) in
gineering. Since this form of prompt design is novel and all our experiments. Average precision measures the area
strategies which perform best in this context are unknown, under the recall-precision curve. It measures how well the
we conduct an extensive evaluation of different variants of system can discriminate matches from non-matches, inde-
designing visual prompts (Tab. 2). We find that the exact pendent of the choice of threshold.
form of how the mask and image are combined matters a
lot. Generally, we identify three image operations that im- Models and Baselines In our experiments we differen-
prove the alignment between the object text prompts and tiate two variants of CLIPSeg: One trained on the original
the images: decreasing the background brightness, blurring PhraseCut dataset (PC) and one trained on the extended ver-
the background (using a Gaussian filter) and cropping to the sion of PhraseCut which uses 20% negative samples, con-
object. The combination of all three performs best (Tab. 2, tains visual samples (PC+) and uses image-text interpola-
last row). We will use this variant in the remainder. tion (Sec. 3). The robust latter version we call the universal
model. To put the performance of our models into perspec-
tive, we provide two baselines:
5. Experiments
• CLIP-Deconv encompasses CLIP but uses a very basic
We first evaluate our model on three established segmen- decoder, consisting only of the basic parts: FiLM con-
tation benchmarks before demonstrating the main contribu- ditioning [48], a linear projection and a deconvolution.
tion of our work: flexible few-shot segmentation that can be This helps us to estimate to which degree CLIP-alone
based on either text or image prompts. is responsible for the results.
• ViTSeg shares the architecture of CLIPSeg, but uses
Metrics Compared to approaches in zero-shot and one- an ImageNet-trained visual transformer as a backbone
shot segmentation (e.g. [25, 26]), the vocabulary we use [51]. For encoding text, we use the same text trans-
is open, i.e. the set of classes or expressions is not fixed. former of CLIP. This way we learn to which degree
Therefore, throughout the experiments, our models are the specific CLIP weights are crucial for good perfor-
trained to generate binary predictions that indicate where mance.
objects matching the query are located. If necessary, this
binary setting can be transformed into a multi-label setting We rely on PyTorch [52] for training and use an image size
(as we do in Section 5.2). of 352 × 352 pixels throughout our experiments (for details
In segmentation, intersection over union (IoU, also Jac- see appendix).
card score) is a common metric to compare predictions with
5.1. Referring Expression Segmentation
ground truth. Due to the diversity of the tasks, we em-
ploy different forms of IoU: Foreground IoU (IoUFG ) which We evaluate referring expression segmentation perfor-
computes IoU on foreground pixels only, mean IoU, which mance (Tab. 3) on the original PhraseCut dataset and com-
computes the average over foreground IoUs of different pare to scores reported by Wu et al. [20] as well as the
classes and binary IoU (IoUBIN ) which averages over fore- concurrently developed transformer-based MDETR method
ground IoU and background IoU. In binary segmentation, [21]. For this experiment we trained a version of CLIPSeg
IoU requires a threshold t to be specified. While most of on the original PhraseCut dataset (CLIPSeg [PC]) using

7090
t mIoU IoUFG AP unseen-10 unseen-4
pre-train. mIoUS mIoUU mIoUS mIoUU
CLIPSeg (PC+) 0.3 43.4 54.7 76.7
CLIPSeg (PC, D = 128) 0.3 48.2 56.5 78.2 CLIPSeg (PC+) CLIP 35.7 43.1 20.8 47.3
CLIPSeg (PC) 0.3 46.1 56.2 78.2 CLIP-Deconv (PC+) CLIP 25.1 36.7 25.9 41.9
CLIP-Deconv 0.3 37.7 49.5 71.2 ViTSeg (PC+) IN 4.2 19.0 6.0 24.8
ViTSeg (PC+) 0.1 28.4 35.4 58.3
ViTSeg (PC) 0.3 38.9 51.2 74.4 SPNet [27] IN 59.0 18.1 67.3 21.8
ZS3Net [25] IN-seen 33.9 18.1 66.4 23.2
MDETR [21] 53.7 - - CSRL [53] IN-seen 59.2 21.0 69.8 31.7
HulaNet [20] 41.3 50.8 - CaGNet [54] IN - - 69.5 40.2
Mask-RCNN top [20] 39.4 47.4 - OSR [30] IN-seen 72.1 33.9 75.0 44.1
RMI [20] 21.1 42.5 - JoEm [28] IN-seen 63.4 22.5 67.0 33.4

Table 3. Referring Expression Segmentation performance on Table 4. Zero-shot segmentation performance on Pascal-VOC
PhraseCut (t refers to the binary threshold). with 10 unseen classes. mIoUS and mIoUU indicate performance
on seen and unseen classes, respectively. Our model is trained on
PhraseCut with the Pascal classes being removed but uses a pre-
only text labels in addition to the universal variant which trained CLIP backbone. IN-seen indicates ImageNet pre-training
also includes visual samples (CLIPSeg [PC+]). with unseen classes being removed.
Our approaches outperform the two-stage HULANet ap-
proach by Wu et al. [20]. Especially, a high capacity de-
coder (D = 128) seems to be beneficial for PhraseCut. more balanced. This is due to other models being trained
However, the performance worse than MDETR [21], which exclusively on the 10 or 16 seen Pascal classes in contrast
operates at full image resolution and received two rounds to CLIPSeg which can differentiate many more classes (or
of fine-tuning on PhraseCut. Notably, the ViTSeg baseline phrases). In fact, our model performs better on unseen
performs generally worse than CLIPSeg, which shows that classes than on seen ones. This difference is likely because
CLIP pre-training is helpful. the seen classes are generally harder to segment: For the
unseen-4 setting, the unseen classes are “airplane”, “cow”,
5.2. Generalized Zero-Shot Segmentation “motorbike” and “sofa”. All of them are large and compar-
In generalized zero-shot segmentation, test images con- atively distinct objects.
tain categories that have never been seen before in addition
5.3. One-Shot Semantic Segmentation
to known categories. We evaluate the model’s zero-shot
segmentation performance using the established Pascal- In one-shot semantic segmentation, a single example im-
VOC benchmark (Tab. 4). It contains five splits involving age along with a mask is presented to the network. Regions
2 to 10 unseen classes (we report only 4 and 10 unseen that pertain to the class highlighted in the example image
classes). The latter is the most challenging setting as the must be found in a query image. Compared to previous
set of unseen classes is large. Since our model was trained tasks, we cannot rely on a text label but must understand the
on foreground/background segmentation we cannot directly provided support image. Above (Sec. 4) we identified the
use it in a multi-label setting. Therefore, we employ a sim- best method for visual prompt design, which we use here:
ple adaptation: Our model predicts a binary map indepen- cropping out the target object while blurring and darkening
dently for each of the 20 Pascal classes. Across all 20 pre- the background. To remove classes that overlap with the re-
dictions we determine the class with the highest probability spective subset of Pascal during training, we use the same
for each pixel. method as in the previous section (Sec. 5.2). Other than
We train on PhraseCut+ but remove the unseen Pascal in zero-shot segmentation, in one-shot segmentation, Ima-
classes from the dataset. This is carried out by assigning geNet pre-trained backbones are common [37, 40]. PFENet
the Pascal classes to WordNet synsets [2] and generating a particularly leverages pre-training by using high-level fea-
set of invalid words by traversing hyponyms (e.g. different ture similarity as a prior. Similarly, HSNet [55] processes
dog breeds for dog). Prompts that contain such a word are correlated activations of query and support image using 4D-
removed from the dataset. convolutions at multiple levels.
The idea of conducting this experiment is to provide a On Pascal-5i we find our universal model CLIPSeg
reference for the zero-shot performance of our universal (PC+) to achieve competitive performance (Tab. 5) among
model. It should not considered as competing in this bench- state-of-the-art methods, with only the very recent HSNet
mark as we use a different training (CLIP pre-training, bi- performing better. The results on COCO-20i (Tab. 6)
nary segmentation on PhraseCut). The results (Tab. 4) indi- show that CLIPSeg also works well when trained on other
cate a major gap between seen and unseen classes in mod- datasets than PhraseCut(+). Again HSNet performs bet-
els trained on Pascal-VOC, while our models tend to be ter. To put this in perspective, it should be considered that

7091
t vis. backb. mIoU IoUBIN AP Affordances Attributes Meronymy
mIoU AP mIoU AP mIoU AP
CLIPSeg (PC+) 0.3 ViT (CLIP) 59.5 75.0 82.3
CLIPSeg (PC) 0.3 ViT (CLIP) 52.3 69.5 72.4 CLIPSeg (PC+) 36.9 50.5 26.6 43.0 25.7 29.0
CLIP-Deconv (PC+) 0.2 ViT (CLIP) 48.0 65.8 68.0 CLIPSeg (LVIS) 37.7 44.6 18.4 16.6 18.9 13.8
ViTSeg (PC+) 0.2 ViT (IN) 39.0 59.0 62.4 CLIP-Deconv 32.2 43.7 23.1 35.6 21.1 27.1
VITSeg (PC+) 19.2 23.5 26.8 28.0 18.4 15.9
PPNet [39] RN50 52.8 69.2 -
RePRI [57] RN50 59.7 - -
PFENet [40] RN50 60.2 73.3 - Table 8. Performance for generalized prompts. While the PC+-
HSNet [55] RN50 64.0 76.7 - model has seen prompts during training (colliding prompts with
PPNet [39] RN101 55.2 70.9 - test set were removed), the LVIS version was trained on object
RePRI [57] RN101 59.4 - - classes only and is able to generalize due to the CLIP backbone.
PFENet [40] RN101 59.6 72.9 - We use the best threshold t for each model.
HSNet [55] RN101 66.2 77.6 -
Qualitative Results In Fig. 4 we show qualitative results
Table 5. One-shot performance on Pascal-5i (CLIPSeg and ViT- divided into two groups: (1, left) Affordance-like [59, 60]
Seg trained on PhraseCut+). (“generalized”) prompts that are different from the descrip-
tive prompts of PhraseCut and (2, right) prompts that were
t vis. backb. mIoU IoUBIN AP taken from the PhraseCut test set. For the latter we add
CLIPSeg (COCO) 0.1 ViT (CLIP) 33.2 58.4 40.5 challenging extra prompts involving an existing object but
CLIPSeg (COCO+N) 0.1 ViT (CLIP) 33.3 59.1 41.7 the wrong color (indicated in orange). Generalized prompts,
CLIP-Deconv (COCO+N) 0.1 ViT (CLIP) 29.8 56.8 40.8 which deviate from the PhraseCut training set by referring
ViTSeg (COCO) 0.1 ViT (IN) 14.4 46.1 15.7
to actions (“‘something to ...”) or rare object classes (“‘cut-
PPNet [39] RN50 29.0 - - lery”) work surprisingly well given that the model was not
RePRI [57] RN50 34.0 - -
PFENet [40] RN50 35.8 - - trained on such cases. It has learned an intuition of stuff that
HSNet [55] RN50 39.2 68.2 - can be stored away in cupboards, where sitting is possible
HSNet [55] RN101 41.2 69.1 - and what “living creature” means. Rarely, false positives
are generated (the bug in the salad is not a cow). Details in
Table 6. One-shot performance on COCO-20i (CLIPSeg trained the prompt are reflected by the segmentation (blue boxes)
on PhraseCut), +N indicates 10% negative samples. and information about the color influences predicted object
probabilities strongly (orange box).
Pascal-5i t vis. backb. mIoU IoUBIN AP Systematic Analysis To quantitatively assess the perfor-
CLIPSeg (PC+) 0.3 ViT (CLIP) 72.4 83.1 93.5 mance for generalized queries, we construct subsets of the
CLIPSeg (PC) 0.3 ViT (CLIP) 70.3 81.6 84.8 LVIS test datasets containing only images of classes that
CLIP-Deconv (PC+) 0.3 ViT (CLIP) 63.2 77.3 85.3
ViTSeg (PC+) 0.2 ViT (IN) 39.0 59.0 62.4
correspond to affordances or attributes. Then we ask our
model to segment with these affordances or attributes as
LSeg [58] ViT (CLIP) 52.3 67.0 -
PFENet [40] VGG16 54.2 - -
prompts. For instance, we compute the foreground inter-
section of union between armchair, sofa and loveseat ob-
Table 7. Zero-shot performance on Pascal-5i. The scores were objects when “sit on” is used as prompt. A complete list
tained by following the evaluation protocol of one-shot segmenta- of which affordances or attributes are mapped onto which
tion but using text input. objects can be found in the appendix. We find (Tab. 8)
that the CLIPSeg version trained on PC+ performs better
than the CLIP-Deconv baseline and the version trained on
HSNet (and PFENet) are explicitly designed for one-shot LVIS, which contains only object labels instead of com-
segmentation, rely on pre-trained CNN activations and can- plex phrases. This result suggests that both dataset variabil-
not handle text by default: Tian et al. [40] extended PFENet ity and model complexity are necessary for generalization.
to zero-shot segmentation (but used the one-shot protocol) ViTSeg performs worse, which is expected as it misses the
by replacing the visual sample with word vectors [1, 56] of strong CLIP backbone, known for its generalization capa-
text labels. In that case, CLIPSeg outperforms their scores bilities.
by a large margin (Tab. 7).
5.5. Ablation Study
5.4. One Model For All: Generalized Prompts
In order to identify crucial factors for the performance
We have shown that CLIPSeg performs well on a variety of CLIPSeg, we conduct an ablation study on PhraseCut
of academic segmentation benchmarks. Next, we evaluate (Tab. 9). We evaluate text-based and visual prompt-based
its performance “in the wild” in unseen situations. performance (obtained using our modifications on Phrase-

7092
Figure 4. Qualitative predictions of CLIPSeg (PC+) for various prompts, darkness indicates prediction strength. The generalized prompts
(left) deviate from the PhraseCut prompts as they involve action-related properties or new object names.

Text-based Visual-based prompts at inference time instead of expensive training on

mIoU AP mIoU AP new data. Specifically, we investigated the novel visual
CLIPSeg (PC+) 43.6 76.7 25.4 55.6 prompt engineering in detail and demonstrated competitive
no CLIP pre-training 13.1 12.6 12.7 - performance on referring expression, zero-shot and one-
no visual 46.4 77.8 14.4 31.0
D = 16 37.4 71.5 24.7 51.2
shot image segmentation tasks. Beyond that, we showed –
only layer 3 31.9 64.9 21.5 48.6 both qualitatively and quantitatively – that our model gen-
highlight mask 43.4 75.4 23.3 43.8 eralizes to novel prompts involving affordances and prop-
erties. We expect our method to be useful, especially for
Table 9. Ablation study conducted on PhraseCut, involving text inexperienced users for building a segmentation model by
(left) and visual prompts (right) at test time. We use the best specifying prompts and in robotic setups when interaction
threshold t for each model. with humans is desired. We believe that tackling multi-
ple tasks is a promising direction for future research toward
more generic and real-world compatible vision systems. In
Cut) separately for a complete picture. Both text-based and
a wider context, our experiments, in particular the compar-
visual performance drops when random weights instead of
ison to the ImageNet-based ViTSeg baseline, highlight the
CLIP weights are used (“no CLIP pre-training”). When the
power of foundation models like CLIP for solving several
number of parameters is reduced to 16 (“D = 16”) per-
tasks at once.
formance decreases substantially, which indicates the im-
portance of the information processing in the decoder. Us- Limitations Our experiments are limited to only a small
ing an unfavourable visual prompting technique (“highlight number of benchmarks, in future work more modalities
mask”) degrades performance on visual input, which sup- such as sound and touch could be incorporated. We depend
ports our findings from Sec. 4. Using only early activations on a large-scale dataset (CLIP) for pre-training. Note, we do
from layer 3 decreases performance (“only layer 3”), from not use the best-performing CLIP model ViT-L/14@336px
which we conclude that higher level features of CLIP are due to weight availability. Furthermore, our model focuses
useful for segmentation. Training without visual samples on images, an application to video might suffer from miss-
(“no visual”) decreases the performance on visual samples, ing temporal consistency. Image size may vary but only
which is expected as visual and text vectors do not align within certain limits (for details see supplementary).
perfectly. The gap in text-based performance to the hybrid
version (PC+) is negligible. Broader Impact There is a chance that the model repli-
cates dataset biases from PhraseCut but especially from the
6. Conclusion unpublished CLIP training dataset. Provided models should
be used carefully and not in tasks depicting humans. Our
We presented the CLIPSeg image segmentation ap- approach enables adaptation to new tasks without energy-
proach that can be adapted to new tasks by text or image intensive training.

7093
References [12] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Xiang, Philip HS Torr, et al. Rethinking semantic seg-
and Jeff Dean. Distributed representations of words and mentation from a sequence-to-sequence perspective with
phrases and their compositionality. In Advances in Neural transformers. In Proceedings of the IEEE/CVF Conference
Information Processing Systems (NIPS), pages 3111–3119, on Computer Vision and Pattern Recognition, pages 6881–
2013. 6890, 2021.
[2] George A Miller. Wordnet: a lexical database for english.
[13] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Communications of the ACM, 38(11):39–41, 1995.
Jose M Alvarez, and Ping Luo. Segformer: Simple and ef-
[3] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. ficient design for semantic segmentation with transformers.
Learning multiple visual domains with residual adapters. In arXiv preprint arXiv:2105.15203, 2021.
NIPS, 2017.
[14] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Schmid. Segmenter: Transformer for semantic segmenta-
and Li Fei-Fei. Imagenet: A large-scale hierarchical image tion. arXiv preprint arXiv:2105.05633, 2021.
database. In IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR). IEEE, 2009. [15] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg-
mentation from natural language expressions. In European
[5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- Conference on Computer Vision, pages 108–124. Springer,
offrey Hinton. A simple framework for contrastive learn- 2016.
ing of visual representations. In Hal Daumé III and Aarti
Singh, editors, International Conference on Machine Learn- [16] Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu,
ing (ICML), volume 119 of Proceedings of Machine Learn- and Alan Yuille. Recurrent multimodal interaction for refer-
ing Research, pages 1597–1607. PMLR, 13–18 Jul 2020. ring image segmentation. In Proceedings of the IEEE Inter-
URL https://proceedings.mlr.press/v119/ national Conference on Computer Vision, pages 1271–1280,
chen20j.html. 2017.
[6] Xinlei Chen and Kaiming He. Exploring simple siamese rep- [17] Hengcan Shi, Hongliang Li, Fanman Meng, and Q. Wu. Key-
resentation learning. In Proceedings of the IEEE/CVF Con- word-aware network for referring expression image segmen-
ference on Computer Vision and Pattern Recognition, pages tation. In ECCV, 2018.
15750–15758, 2021.
[18] Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan
[7] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- Qi, Xiaoyong Shen, and Jiaya Jia. Referring image seg-
man, Simran Arora, Sydney von Arx, Michael S Bernstein, mentation via recurrent refinement networks. In Proceed-
Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. ings of the IEEE Conference on Computer Vision and Pattern
On the opportunities and risks of foundation models. arXiv Recognition, pages 5745–5753, 2018.
preprint arXiv:2108.07258, 2021.
[19] Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang.
[8] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Cross-modal self-attention network for referring image seg-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
mentation. 2019 IEEE/CVF Conference on Computer Vision
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
and Pattern Recognition (CVPR), pages 10494–10503, 2019.
ing transferable visual models from natural language super-
vision. arXiv preprint arXiv:2103.00020, 2021. [20] Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, and
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Subhransu Maji. Phrasecut: Language-based image segmen-
Deep residual learning for image recognition. In IEEE tation in the wild. In Proceedings of the IEEE/CVF Con-
Conference on Computer Vision and Pattern Recognition ference on Computer Vision and Pattern Recognition, pages
(CVPR), 2016. 10216–10225, 2020.

[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [21] Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Misra, Gabriel Synnaeve, and Nicolas Carion. Mdetr - mod-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- ulated detection for end-to-end multi-modal understanding.
vain Gelly, et al. An image is worth 16x16 words: Trans- ArXiv, abs/2104.12763, 2021.
formers for image recognition at scale. arXiv preprint
[22] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
arXiv:2010.11929, 2020.
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-
[11] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan end object detection with transformers. In Andrea Vedaldi,
Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Horst Bischof, Thomas Brox, and Jan-Michael Frahm, edi-
Transunet: Transformers make strong encoders for medi- tors, Computer Vision – ECCV 2020, pages 213–229, Cham,
cal image segmentation. arXiv preprint arXiv:2102.04306, 2020. Springer International Publishing. ISBN 978-3-030-
2021. 58452-8.

7094
[23] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, segmentation with deep convolutional nets, atrous convolu-
and Tamara L Berg. Modeling context in referring expres- tion, and fully connected crfs. IEEE Transactions on Pattern
sions. In European Conference on Computer Vision, pages Analysis and Machine Intelligence (PAMI), 40, 2018.
69–85. Springer, 2016.
[36] Chi Zhang, Guosheng Lin, Fayao Liu, Rui Yao, and Chun-
[24] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana hua Shen. Canet: Class-agnostic segmentation networks
Camburu, Alan L Yuille, and Kevin Murphy. Generation with iterative refinement and attentive few-shot learning. In
and comprehension of unambiguous object descriptions. In IEEE Conference on Computer Vision and Pattern Recogni-
Proceedings of the IEEE conference on computer vision and tion (CVPR), June 2019.
pattern recognition, pages 11–20, 2016.
[37] Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou,
[25] Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick and Jiashi Feng. Panet: Few-shot image semantic segmen-
Pérez. Zero-shot semantic segmentation. Advances in Neural tation with prototype alignment. In IEEE/CVF International
Information Processing Systems, 32:468–479, 2019. Conference on Computer Vision (ICCV), pages 9197–9206,
2019.
[26] Peike Li, Yunchao Wei, and Yi Yang. Consistent structural
relation learning for zero-shot segmentation. In NeurIPS, [38] Boyu Yang, Chang Liu, Bohao Li, Jianbin Jiao, and Qix-
2020. iang Ye. Prototype mixture models for few-shot semantic
segmentation. In European Conference on Computer Vision
[27] Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt (ECCV), 2020.
Schiele, and Zeynep Akata. Semantic projection network for
zero- and few-label semantic segmentation. In IEEE Confer- [39] Yongfei Liu, Xiangyi Zhang, Songyang Zhang, and Xum-
ence on Computer Vision and Pattern Recognition (CVPR), ing He. Part-aware prototype network for few-shot semantic
June 2019. segmentation. In European Conference on Computer Vision,
pages 142–158. Springer, 2020.
[28] Donghyeon Baek, Youngmin Oh, and Bumsub Ham. Ex-
ploiting a joint embedding space for generalized zero-shot [40] Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng
semantic segmentation. In Proceedings of the IEEE/CVF Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrich-
International Conference on Computer Vision, pages 9536– ment network for few-shot segmentation. IEEE transactions
9545, 2021. on pattern analysis and machine intelligence, PP, August
2020. ISSN 0162-8828. doi: 10.1109/tpami.2020.3013717.
[29] Ping Hu, Stan Sclaroff, and Kate Saenko. Uncertainty-aware URL https://doi.org/10.1109/TPAMI.2020.
learning for zero-shot semantic segmentation. In NeurIPS, 3013717.
2020.
[41] Kate Rakelly, Evan Shelhamer, Trevor Darrell, Alexei A.
[30] Hui Zhang and Henghui Ding. Prototypical matching and Efros, and Sergey Levine. Few-shot segmentation
open set rejection for zero-shot semantic segmentation. In propagation with guided networks. arXiv preprint
Proceedings of the IEEE/CVF International Conference on arXiv:1806.07373, 2018.
Computer Vision (ICCV), pages 6974–6983, October 2021.
[42] Claudio Michaelis, Ivan Ustyuzhaninov, Matthias Bethge,
[31] Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and and Alexander S. Ecker. One-shot instance segmentation.
Byron Boots. One-shot learning for semantic segmentation. arXiv, 2018. URL http://arxiv.org/abs/1811.
BMVC, 2017. 11507.

[32] Karen Simonyan and Andrew Zisserman. Very deep convo- [43] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or,
lutional networks for large-scale image recognition. arXiv and Dani Lischinski. Styleclip: Text-driven manipulation of
preprint arXiv:1409.1556, 2014. stylegan imagery. In Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision, pages 2085–2094,
[33] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully 2021.
convolutional networks for semantic segmentation. In Pro-
ceedings of the IEEE Conference on Computer Vision and [44] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport:
Pattern Recognition, 2015. What and where pathways for robotic manipulation. arXiv
preprint arXiv:2109.12098, 2021.
[34] Chi Zhang, Guosheng Lin, Fayao Liu, Jiushuang Guo,
Qingyao Wu, and Rui Yao. Pyramid graph networks with [45] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
connection attentions for region-based one-shot semantic Liu. Learning to prompt for vision-language models. arXiv
segmentation. In IEEE/CVF International Conference on preprint arXiv:2109.01134, 2021.
Computer Vision (ICCV), pages 9587–9595, 2019.
[46] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Zero-
[35] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, shot detection via vision and language knowledge distilla-
Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image tion. arXiv preprint arXiv:2104.13921, 2021.

7095
[47] Sepideh Esmaeilpour, Bing Liu, Eric Robertson, and Lei [58] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen
Shu. Zero-shot open set detection by extending clip. arXiv Koltun, and Rene Ranftl. Language-driven semantic seg-
preprint arXiv:2109.02748, 2021. mentation. In International Conference on Learning Rep-
resentations, 2022. URL https://openreview.net/
[48] Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian forum?id=RriDjddCLN.
Strub, Harm de Vries, Aaron Courville, and Yoshua Bengio.
Feature-wise transformations. Distill, 2018. doi: 10.23915/ [59] James Jerome Gibson. The Senses Considered as Perceptual
distill.00011. Systems. Houghton Mifflin, 1966.

[60] James J. Gibson. The Ecological Approach to Visual Percep-

[49] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A
tion. Houghton Mifflin, 1979.
dataset for large vocabulary instance segmentation. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 5356–5364, 2019.

[50] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub-

biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand-
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J.
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler,
Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish, Alec Rad-
ford, Ilya Sutskever, and Dario Amodei. Language models
are few-shot learners. ArXiv, abs/2005.14165, 2020.

[51] Ross Wightman. Pytorch image models. https :

/ / github . com / rwightman / pytorch - image -
models, 2019.

[52] Adam Paszke, Sam Gross, Soumith Chintala, Gregory

Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban
Desmaison, Luca Antiga, and Adam Lerer. Automatic dif-
ferentiation in pytorch. In Advances in Neural Information
Processing Systems Workshops, 2017.

[53] Peike Li, Yunchao Wei, and Yi Yang. Consistent structural

relation learning for zero-shot segmentation. Advances in
Neural Information Processing Systems, 33, 2020.

[54] Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, and

Liqing Zhang. Context-aware feature generation for zero-
shot semantic segmentation. In Proceedings of the 28th ACM
International Conference on Multimedia, pages 1921–1929,
2020.

[55] Juhong Min, Dahyun Kang, and Minsu Cho. Hypercorre-

lation squeeze for few-shot segmentation. In Proceedings of
the IEEE/CVF International Conference on Computer Vision
(ICCV), 2021.

[56] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian

Puhrsch, and Armand Joulin. Advances in pre-training dis-
tributed word representations. ArXiv, abs/1712.09405, 2018.

[57] Malik Boudiaf, Hoel Kervadec, Ziko Imtiaz Masud, Pablo

Piantanida, Ismail Ben Ayed, and Jose Dolz. Few-shot seg-
mentation without meta-learning: A good transductive in-
ference is all you need? arXiv preprint arXiv:2012.06166,
2020.

7096

Geological Basin of Pakistan
No ratings yet
Geological Basin of Pakistan
19 pages
Big Data and Hadoop-Sentiment Analysis Using Flume and Hive
No ratings yet
Big Data and Hadoop-Sentiment Analysis Using Flume and Hive
27 pages
Ata 100 Code
100% (2)
Ata 100 Code
11 pages
Data Veiling Using Residue Number System Based Encryption in Cybersecurity
No ratings yet
Data Veiling Using Residue Number System Based Encryption in Cybersecurity
6 pages
Data Mining With Big DataUsing HACE Theorem
No ratings yet
Data Mining With Big DataUsing HACE Theorem
6 pages
Project Python
0% (1)
Project Python
22 pages
Advanced Image Hiding
No ratings yet
Advanced Image Hiding
15 pages
Data Security Using Variant of Hill Cipher
No ratings yet
Data Security Using Variant of Hill Cipher
9 pages
Catalogo Philip Respironic 2019
No ratings yet
Catalogo Philip Respironic 2019
95 pages
Image Encryption Using DNA Coding and Hyperchaotic System: IETE Technical Review
No ratings yet
Image Encryption Using DNA Coding and Hyperchaotic System: IETE Technical Review
13 pages
PHP Program
No ratings yet
PHP Program
16 pages
Niche Cpaelites - RAW
100% (1)
Niche Cpaelites - RAW
24 pages
Secret Communication Using Multi-Image Steganography and Face Recognition
No ratings yet
Secret Communication Using Multi-Image Steganography and Face Recognition
6 pages
Evolutionary Algorithm For Bug Localization in The Reconfigurations of Models at Runtime
No ratings yet
Evolutionary Algorithm For Bug Localization in The Reconfigurations of Models at Runtime
11 pages
Keyword Tool Export - Gameandroid
No ratings yet
Keyword Tool Export - Gameandroid
8 pages
" " Technique To Hide Information With in Image File: Steganography
No ratings yet
" " Technique To Hide Information With in Image File: Steganography
9 pages
Chapter 8. Tree
No ratings yet
Chapter 8. Tree
104 pages
Machine Learning
No ratings yet
Machine Learning
383 pages
Data Hiding Using Video Steganography - A Survey
No ratings yet
Data Hiding Using Video Steganography - A Survey
8 pages
MAD Labmanual
No ratings yet
MAD Labmanual
70 pages
Recent Advances MPI
No ratings yet
Recent Advances MPI
464 pages
Protein Memory
No ratings yet
Protein Memory
29 pages
Title:: File: Function Generator - SCH Sheet
No ratings yet
Title:: File: Function Generator - SCH Sheet
1 page
Face Detection Code
100% (2)
Face Detection Code
11 pages
Cryptography Course Work
No ratings yet
Cryptography Course Work
27 pages
Autism Essay
No ratings yet
Autism Essay
2 pages
Ent131 Business Pitching
No ratings yet
Ent131 Business Pitching
20 pages
TLDR Book
No ratings yet
TLDR Book
3,781 pages
MCQ of Basic Introduction To C
No ratings yet
MCQ of Basic Introduction To C
19 pages
CipherKey Algorithm
No ratings yet
CipherKey Algorithm
5 pages
A Reference Process Model For Master Dat
No ratings yet
A Reference Process Model For Master Dat
14 pages
Wavelet Transform Color Image Steganography
No ratings yet
Wavelet Transform Color Image Steganography
29 pages
Algorithms: A Novel Evolutionary Algorithm For Designing Robust Analog Filters
No ratings yet
Algorithms: A Novel Evolutionary Algorithm For Designing Robust Analog Filters
22 pages
A Tutorial On Cross-Layer Optimization
No ratings yet
A Tutorial On Cross-Layer Optimization
12 pages
Animate With Easy-to-Use and Free Pencil2D Software
No ratings yet
Animate With Easy-to-Use and Free Pencil2D Software
6 pages
Unit 1 Digital Image .
No ratings yet
Unit 1 Digital Image .
74 pages
Protocols For Self-Organization of A Wireless Sensor Network
No ratings yet
Protocols For Self-Organization of A Wireless Sensor Network
24 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
229 pages
Merlin AI Tools
No ratings yet
Merlin AI Tools
5 pages
OpenAssistant Roadmap
No ratings yet
OpenAssistant Roadmap
12 pages
dormans (Lecture Notes in Computer Science 7522 Information Systems and Applications, incl. Internet_Web, and HCI) Edirlei Soares de Lima, Bruno Feijó, Cesar T. Pozzer, Angelo E. M. Ciarlini (auth.), Marc Her.pdf
No ratings yet
dormans (Lecture Notes in Computer Science 7522 Information Systems and Applications, incl. Internet_Web, and HCI) Edirlei Soares de Lima, Bruno Feijó, Cesar T. Pozzer, Angelo E. M. Ciarlini (auth.), Marc Her.pdf
629 pages
Website Design CSC
No ratings yet
Website Design CSC
1 page
PDF Viewer Example - QT PDF 5.15.5
No ratings yet
PDF Viewer Example - QT PDF 5.15.5
6 pages
Profit Margin
No ratings yet
Profit Margin
20 pages
Beginner Python Coding Book 1
No ratings yet
Beginner Python Coding Book 1
8 pages
Device 3966649596982265979 PDF
No ratings yet
Device 3966649596982265979 PDF
16 pages
English 100 PDF Notes
No ratings yet
English 100 PDF Notes
28 pages
GBG Idscan Ieos Web API v4
No ratings yet
GBG Idscan Ieos Web API v4
16 pages
This Set of Software Engineering Multiple Choice Questions & Answers (MCQS) Focuses On "Software Engineering Ethics - 1"
No ratings yet
This Set of Software Engineering Multiple Choice Questions & Answers (MCQS) Focuses On "Software Engineering Ethics - 1"
6 pages
HTML 2023
No ratings yet
HTML 2023
123 pages
Code Injection
No ratings yet
Code Injection
11 pages
Using Open Data To Predict Market Movements: Ravinder Singh Marina Levina
No ratings yet
Using Open Data To Predict Market Movements: Ravinder Singh Marina Levina
31 pages
An Economic Evaluation System For Building Construction Projects in The Conceputal Phase
No ratings yet
An Economic Evaluation System For Building Construction Projects in The Conceputal Phase
6 pages
Project AWARE - Hackathon Presentation
100% (1)
Project AWARE - Hackathon Presentation
11 pages
Tybsc-It Sem5 Ai Apr19
No ratings yet
Tybsc-It Sem5 Ai Apr19
2 pages
Human Emotion Detection Using Deep Learning
No ratings yet
Human Emotion Detection Using Deep Learning
7 pages
Apriori Algorithm in Data Mining
No ratings yet
Apriori Algorithm in Data Mining
8 pages
Stealth Marketing - Consumer Behaviour
No ratings yet
Stealth Marketing - Consumer Behaviour
20 pages
What Is ICO Marketing and How To Do It
No ratings yet
What Is ICO Marketing and How To Do It
4 pages
Different Paradigms of Pattern Recognition
No ratings yet
Different Paradigms of Pattern Recognition
8 pages
Web Development Glossary For Beginners
No ratings yet
Web Development Glossary For Beginners
6 pages
1ds20ai009 Research Paper
No ratings yet
1ds20ai009 Research Paper
14 pages
BALTTRIB 2019 Programme FINAL
No ratings yet
BALTTRIB 2019 Programme FINAL
17 pages
Unit 8 - Session 2: Happy Workers Great Workers?
No ratings yet
Unit 8 - Session 2: Happy Workers Great Workers?
10 pages
Resources 1
No ratings yet
Resources 1
16 pages
Seatwork No.1
No ratings yet
Seatwork No.1
1 page
USI Pharmaceuticals IPS, UVAS
No ratings yet
USI Pharmaceuticals IPS, UVAS
10 pages
Abc - Recording - Form 3
No ratings yet
Abc - Recording - Form 3
3 pages
WHO Analgesic Ladder: Aabha A. Anekar Marco Cascella
No ratings yet
WHO Analgesic Ladder: Aabha A. Anekar Marco Cascella
5 pages
Practical Research 2 Quarter 1
100% (1)
Practical Research 2 Quarter 1
32 pages
Standard Huawei
No ratings yet
Standard Huawei
4 pages
Bme Syllabus Mme1051
No ratings yet
Bme Syllabus Mme1051
1 page
Gla Life of Pi Unit Lesson Plan
No ratings yet
Gla Life of Pi Unit Lesson Plan
4 pages
Launch CReader CR529 - Manual
No ratings yet
Launch CReader CR529 - Manual
52 pages
All About Mib
No ratings yet
All About Mib
13 pages
Astm Specifications For Concrete Masonry Units
No ratings yet
Astm Specifications For Concrete Masonry Units
9 pages
Project Proposal in CCS 122 Stock Mania
No ratings yet
Project Proposal in CCS 122 Stock Mania
2 pages
Death and Dying
No ratings yet
Death and Dying
8 pages
ARUP DEY - Admitcard
No ratings yet
ARUP DEY - Admitcard
2 pages
How Can We Predict Bacterial Eradication
No ratings yet
How Can We Predict Bacterial Eradication
8 pages
Rainmeter Manual 3.0
No ratings yet
Rainmeter Manual 3.0
345 pages
Ans. Module
No ratings yet
Ans. Module
2 pages
Simulation & Process Simulation: Lecture # 1
No ratings yet
Simulation & Process Simulation: Lecture # 1
35 pages
Managerial Accounting and The Business Environment: Presentation Topic
No ratings yet
Managerial Accounting and The Business Environment: Presentation Topic
9 pages
Edith Steins Phenomenology of Sensual and Emotion
No ratings yet
Edith Steins Phenomenology of Sensual and Emotion
21 pages
Synthesis & Integration: Working With Groups
No ratings yet
Synthesis & Integration: Working With Groups
8 pages
GP-GS010 Ver2.0 EN
No ratings yet
GP-GS010 Ver2.0 EN
17 pages
Kampala Junior Academy Schools Mathmatics Holiday Work For Primary Three Section A
No ratings yet
Kampala Junior Academy Schools Mathmatics Holiday Work For Primary Three Section A
4 pages
ATUPA Papers Template -2025
No ratings yet
ATUPA Papers Template -2025
2 pages
Elx Ds All Oc Ibmpower Ibm
No ratings yet
Elx Ds All Oc Ibmpower Ibm
2 pages