0% found this document useful (0 votes)
25 views12 pages

Seggpt Paper

Uploaded by

1905096
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views12 pages

Seggpt Paper

Uploaded by

1905096
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

SegGPT: Segmenting Everything In Context

Xinlong Wang1 * Xiaosong Zhang1∗ Yue Cao1∗ Wen Wang2 Chunhua Shen2 Tiejun Huang1,3
1 Beijing Academy of Artificial Intelligence 2 Zhejiang University 3 Peking University
Code & Demo: https://github.com/baaivision/Painter
arXiv:2304.03284v1 [cs.CV] 6 Apr 2023

the big red sphere all the spheres shadows

top surfaces contour of all spheres


SegGPT

rainbow multiple arbitrary parts

objects in a video ADE20K semantic segmentation

Figure 1: SegGPT is capable of segmenting everything in context with only one single model, which uses in-context examples
to indicate different tasks. For each sample, the orange box □ on the left displays the example/prompt image and its
corresponding mask, while the blue box □ on the right shows the input image and the resulting mask output. The mask
represents the bright region attached to the image. The caption for each sample (in the yellow box) is only for explanation.
Notably, SegGPT can perform arbitrary object segmentation (segment different components of the scene, such as the big red
sphere, all the spheres, contour of all spheres, top surfaces, and shadows), multiple part segmentation (specialized parts of the
iconic Statue of Liberty), rainbow segmentation, video object segmentation without videos in training, and close-set semantic
segmentation with learnable prompt tuning. More examples are shown in Figure 5.

Abstract or videos via in-context inference, such as object instance,


stuff, part, contour, and text. SegGPT is evaluated on a
We present SegGPT, a generalist model for segmenting ev- broad range of tasks, including few-shot semantic segmenta-
erything in context. We unify various segmentation tasks into tion, video object segmentation, semantic segmentation, and
a generalist in-context learning framework that accommo- panoptic segmentation. Our results show strong capabilities
dates different kinds of segmentation data by transforming in segmenting in-domain and out-of-domain targets, either
them into the same format of images. The training of Seg- qualitatively or quantitatively.
GPT is formulated as an in-context coloring problem with
random color mapping for each data sample. The objec-
tive is to accomplish diverse tasks according to the context, 1. Introduction
rather than relying on specific colors. After training, Seg-
GPT can perform arbitrary segmentation tasks in images Segmentation is one of the most fundamental problems
in computer vision, which aims to localize and re-organize
* Equal contribution. Correspondence to [email protected]. meaningful concepts at the pixel level, e.g., foreground, cat-
egory, object instance, etc. During recent years, we have tasks directly, i.e., without fine-tuning, including few-shot
witnessed great progress in developing more accurate and semantic segmentation, video object segmentation, semantic
faster algorithms for various segmentation tasks, such as fore- segmentation, and panoptic segmentation. (3) Our results
ground segmentation [41], interactive segmentation [51, 34], show strong capabilities in segmenting in-domain and out-
semantic segmentation [32, 28, 54, 39], instance segmenta- of-domain targets, either qualitatively or quantitatively.
tion [18, 11, 2, 48], and panoptic segmentation [23, 5, 8]. However, this work does not aim to claim new state-
However, these specialist segmentation models are lim- of-the-art results or outperform existing specialist methods
ited to specific tasks, classes, granularities, data types, etc. across all benchmarks, as we believe that this may not be the
A new model has to be trained when adapting to a different responsibility of a general-purpose model.
setting, e.g., to segment a novel concept, or to segment ob-
jects in videos instead of images. This requires expensive 2. Related Work
annotation efforts and is not sustainable for a large number
of segmentation tasks. 2.1. Visual Segmentation
In this work, we aim to train a single model that is capable Segmentation is a fundamental problem in computer vi-
of solving diverse and unlimited segmentation tasks. The sion that involves localizing and organizing meaningful con-
main challenges are twofold: (1) to incorporate those very cepts at the pixel level. The type of segmentation task varies
different data types in training, e.g., part, semantic, instance, depending on the definition of the concepts, such as fore-
panoptic, person, medical image, aerial image, etc.; (2) to ground, category, or object instance. For example, semantic
design a generalizable training scheme that differs from segmentation [55] involves pixel-level semantic classifica-
conventional multi-task learning, which is flexible on task tion of an image, while instance segmentation [30] aims to
definition and is capable of handling out-of-domain tasks. identify different object instances and their categories. Video
To address these challenges, we present SegGPT, a gener- object segmentation [52, 37, 12] is the task of segmenting a
alist model for segmenting everything in context. We view particular object throughout the entire video sequence given
segmentation as a general format for visual perception and only the object mask of the first frame.
unify different segmentation tasks into a generalist in-context Previous segmentation methods [32, 28, 54, 39, 18, 11, 2,
learning framework [46]. This framework accommodates 48, 23, 5, 8] have been designed specifically for certain tasks
different kinds of segmentation data by transforming them and cannot be generalized for switching tasks or changing
into the same format of images. The training of SegGPT is categories. This paper introduces a general interface that
formulated as an in-context coloring problem with random is compatible with all segmentation tasks with an appropri-
color mapping for each data sample. The objective is to color ate training scheme, a single generalist model can achieve
the corresponding areas, such as classes, object instances, good performance on both in-domain and out-of-domain
parts, etc., only according to the context. By using a random segmentation tasks, either qualitatively or quantitatively.
coloring scheme, the model is forced to reference contextual
information to complete the assigned task, instead of rely- 2.2. Vision Generalist
ing on specific colors. This allows for a more flexible and In recent years, there have been efforts to unify different
generalizable approach to training. The remaining parts of tasks in the vision domain using Transformer-based mod-
training keep the same as [46] using a vanilla ViT [42] and a els, resulting in several vision generalists [6, 7, 56, 33, 24].
simple smooth-ℓ1 [17] loss. DETR [5] is one of the first to adopt Transformer [42] as a
After training, SegGPT is able to perform diverse seg- task-specific head for object detection. Pix2Seq series [6, 7]
mentation tasks in images or videos given a few examples defines the output spaces of vision tasks as discrete ones
via in-context inference, such as object instance, stuff, part, and performs the task of object detection, instance segmen-
contour, text, etc. To effectively ensemble multiple exam- tation, keypoint estimation, and image captioning, in an
ples in context, we propose a simple yet effective context auto-regressive manner. Unified-IO [33] and OFA [45] per-
ensemble strategy, the feature ensemble, which can help the form joint modeling across vision, vision & language, and
model benefit from the multi-example prompting setting. NLP tasks in a sequence-to-sequence manner, that both the
Additionally, SegGPT can conveniently serve as a specialist inputs and outputs are defined to a sequence of discrete to-
model without updating the model parameters, by tuning a kens. UViM [24] unifies pixel-labeling tasks together, such
specific prompt for a specialized use case, such as in-domain as panoptic segmentation, depth estimation, and colorization,
ADE20K semantic segmentation. but trains separate models for each.
Our main contributions are as follows. (1) For the first Although these works all appear to unify different tasks
time, we demonstrate a single generalist model capable of into similar spaces, they actually accomplish each task
performing a diverse set of segmentation tasks automatically. through some form of hard indicators, such as a special
(2) We evaluate the pre-trained SegGPT on a broad range of token, making it difficult to generalize to new tasks. In con-

2
Diverse segmentation data In-context coloring


In-context samples Lreg

Vision Transformers



Mix and Mask

A patch A masked patch

Figure 2: Illustration of overall training framework of SegGPT. We incorporate diverse segmentation data, including part,
semantic, instance, panoptic, person, medical image, and aerial image segmentation, and transform them into the same format
of images. We generate in-context samples that share similar contexts on-the-fly, e.g., the overlapped colors shown in each
column, which indicate the same category or the same instance. We adopt a general Painter [46] framework with in-context
coloring as the training objective and a random coloring scheme for more flexible and generalizable training.

trast, this work uses an in-context framework that maintains collapse into the multi-task solution. As segmentation tasks
flexibility on task definition and utilizes a random coloring and datasets have less variability than depth/pose estimation,
scheme to prevent the model from collapsing into a multi- it is easier to share internal structures for effective training
task learning solution and instead forces it to accomplish the of in-domain tasks, while maintaining the generalization
assigned task via referring contextual information. Another capability to out-of-domain segmentation tasks.
difference is the scope of the tasks. This work primarily
focuses on a crucial category in visual perception, namely 3. Approach
image segmentation.
SegGPT is a special version of Painter [46] framework
2.3. In-Context Visual Learning which enables to segment everything with a generalist
Painter, thus the name of our model, SegGPT. This train-
GPT-3 [3] introduces the concept of in-context learning to
ing framework redefines the output space of vision tasks
deep learning, which allows a series of NLP tasks to be for-
as “images” and unifies different tasks into the same image
mulated as text completion problems given prompts and ex-
inpainting problem, i.e., to randomly mask the task output
amples. In computer vision, [1] first proposes an in-context
images and reconstruct the missing pixels. To maintain the
training framework using inpainting with discrete tokens on
simplicity and generality, we make no modifications to the
figures and infographics from vision articles, demonstrat-
architecture and loss function, i.e., a vanilla ViT [13] and a
ing the framework’s capabilities in foreground segmentation,
simple smooth-ℓ1 [17] loss, but design a new random color-
single object detection, and colorization. Painter [46] adopts
ing scheme in in-context training for better generalization
masked image modeling on continuous pixels to perform
capability.
in-context training with supervised datasets, on seven diverse
and challenging vision tasks, achieving highly competitive
3.1. In-Context Coloring
results on these tasks.
Our work builds upon the Painter framework, but with In the traditional framework of Painter, the color space
a specific focus on the segmentation task due to its central for each task is pre-defined, resulting in the solution col-
role in visual perception. Thus this work unifies diverse seg- lapse into multi-task learning. For example, for semantic
mentation data including semantic segmentation, instance segmentation, a set of colors is pre-defined, and each seman-
segmentation, part segmentation, and even those for spe- tic category is assigned a fixed color. Similarly, in instance
cial scenarios like aerial images. Additionally, we design a segmentation, the color of an instance object is assigned ac-
random coloring scheme that forces the model to reference cording to its location categories, i.e., the number of colors
contextual information to complete the assigned task but not equals the number of spatial locations, resulting in the model

3
SegGPT …
for ADE20K dataset

Spatial Ensemble
… SegGPT
for your apartment

SegGPT

for Bert face

Feature Ensemble
Figure 4: Illustration of in-context tuning on different task
specifications. For in-context tuning, we freeze the whole
pre-trained model and only optimize the learnable image
Figure 3: Illustration of our proposed context ensemble
tensor which serves as the input context. We can perform the
strategies for multi-example inference: the spatial ensemble
in-context prompt tuning on the specific datasets (ADE-20K
(top) and the feature ensemble (bottom). The spatial en-
semantic segmentation), specific scenes (your apartment),
semble strategy involves stitching multiple example images
and even specific characters (Bert’s face).
together and resizing them to the input resolution. The fea-
ture ensemble strategy averages features of the query image
after each attention layer so that the query image aggregates 3.2. Context Ensemble
all the reference examples.
Once the training is finished, its full power can be un-
leashed during inference. SegGPT enables arbitrary segmen-
tation in context, e.g., with an example of a single image
only relying on the color itself to determine the task, rather and its target image. The target image can be of a single
than using the relationships between segments. color (excluding the background), or multiple colors, e.g.,
To address this limitation, we propose a random coloring segmenting several categories or objects of interest in one
scheme for in-context coloring. We begin by randomly sam- shot. Specifically, given an input image to be tested, we
pling another image that shares a similar context with the stitch it with the example image and feed it to SegGPT to
input image, such as the same semantic category or object get the corresponding in-context predictions.
instance. Next, we randomly sample a set of colors from To serve a more accurate and concrete context, multiple
the target image and map each color to a random one. This examples can be used. For instance, several examples of
results in a re-coloring of the corresponding pixels. As a the same semantic category, or the previous frames in a
result, we get two pairs of images, which are defined as an video, can be employed. To efficiently leverage multiple
in-context pair. In addition, we introduce the mix-context examples for a SegGPT model, we propose two context
training method which trains the model using mixed exam- ensemble approaches. One is called Spatial Ensemble,
ples. This involves stitching together multiple images with multiple examples concatenated in n × n grid and then sub-
the same color mapping. The resulting image is then ran- sampled to the same size as a single example. This approach
domly cropped and resized to form a mixed-context training is in line with the intuition of in-context coloring and the
sample. By doing so, the model learns to focus on the con- semantic information of multiple examples can be in-context
textual information of the image rather than just relying on extracted with almost no additional cost. Another approach
specific color information to determine the task. is Feature Ensemble. Multiple examples are combined in
Such unification allows us to utilize all segmentation the batch dimension and computed independently except that
datasets in a consistent way, only varying the data sampling features of the query image are averaged after each attention
strategy depending on the specific task. We define differ- layer. In this way, the query image gathers information about
ent contexts according to different data types. For semantic multiple examples during inference.
segmentation, we randomly sample the categories. For in-
3.3. In-Context Tuning
stance segmentation, object instances are sampled in random
numbers. The different views of the same image, e.g., trans- SegGPT is capable of adapting to a unique use case with-
formed by a set of augmentations, are treated as the images out updating the model parameters. We freeze the whole
in context. In the implementation, the sampling is all about model and initialize a learnable image tensor as the input
colors, e.g., the same color refers to either the same category context. Only this learnable image tensor is updated during
or the same instance. the training. The rest of the training remains the same, e.g.,

4
cubes yellow cubes Ernie

one of the Twelve Apostles Earth multiple arbitrary parts

objects in a video COCO instance segmentation

Figure 5: More visualizations. For each sample, the orange box □ on the left displays the example/prompt image and its
corresponding mask, while the blue box □ on the right shows the input image and the resulting mask output. The mask is
visualized via the bright region attached to the image. SegGPT can perform arbitrary object/part segmentation (cubes, yellow
cubes, Ernie, one of the Twelve Apostles, earth, multiple arbitrary parts), video object segmentation without videos in training,
and close-set instance segmentation on COCO with learnable prompt tuning.

the same loss function. After the tuning, we take the learned 5K validation, with 80 “things” and 53 “stuff” categories.
image tensor out and use it as a plug-and-play key for a spe- PASCAL VOC [14] is a classic object recognition dataset.
cific application. For example, given a dataset with a fixed We use the augmented segmentation version which provides
set of object categories, e.g., ADE20K, we could train a annotations of 20 categories on 10582 training images.
customized prompt for this dataset, while there is no harm to Cityscapes [10] focuses on the scene understanding of the
the generality of the model. Or, we could optimize a prompt street views. We use the 2954 training images with semantic
image for a specific scene, e.g., your apartment, or a specific segmentation annotations of 19 categories.
character, e.g., Bert’s face. This opens up opportunities for a LIP [26] focuses on the semantic understanding of the per-
broad range of applications. son. We use the 30385 training images with segmentation
labels of 19 human part categories.
4. Experiment PACO [38] is a newly released dataset that provides anno-
tations for the parts and attributes of common objects. We
4.1. Training Data
process and use the 41807 training images with part annota-
Our approach uses a diverse set of segmentation datasets, tions.
including part, semantic, instance, panoptic, person, retinal- CHASE DB1 [16], DRIVE [40], HRF [4] and
vessel, and aerial-image segmentation. Unlike previous STARE [20] provide annotations for retinal vessel
methods that relied on handcrafted label merging to combine segmentation. We augment the high-resolution raw images
different types of segmentation datasets, our method offers a with random cropping.
unified perspective that eliminates the need for additional ef- iSAID [49] and loveDA [44] focus on semantic understand-
fort or adjustment on the datasets. In particular, our approach ing in aerial images, with 23262 and 2520 training images
does not require any modifications to either the architecture for 15 and 6 semantic categories respectively.
or training pipeline when adding an extra dataset.
ADE20K [55] provides segmentation labels for 150 seman- 4.2. One-Shot Training Details
tic categories, with a total of 25K images, including 20K Our approach for segmentation tasks utilizes a general
training images, 2K validation images, and 3K testing im- interface, where we emphasize that we only train one gen-
ages. eralist model with a mixture of datasets, and evaluated this
COCO [30] is a widely used visual perception dataset that model on diverse benchmarks. Following [46], we use a
supports instance segmentation, semantic segmentation and Vision Transformer (ViT-L) encoder [13], which has 307M
panoptic segmentation. It contains 118K training images and parameters. We use a pre-trained checkpoint from [46] as

5
Figure 6: Qualitative results of video object segmentation on YouTube-VOS 2018.

COCO-20i PASCAL-5i
method venue
one-shot few-shot one-shot few-shot
specialist model
HSNet [35] 41.2 49.5 66.2 70.4
ICCV’21
HSNet* 41.7 50.7 68.7 73.8
VAT [19] 41.3 47.9 67.9 72.0
ECCV’22
VAT* 42.9 49.4 72.4 76.3
FPTrans [53] 47.0 58.9 68.8 78.0
NeurIPS’22
FPTrans* 56.5 65.5 77.7 83.2
generalist model
Painter CVPR’23 32.8 32.6 64.5 64.6
SegGPT this work 56.1 67.9 83.2 89.8

Table 1: Quantitative results on COCO-20i and PASCAL-5i of example-based semantic segmentation. * indicates that the
categories in training cover the categories in testing.

the initialization. We employ an AdamW optimizer [22] and segmentation of YouTube-VOS 2018 dataset. From these
a cosine learning rate scheduler, with a base learning rate visualizations, SegGPT demonstrates the ability to make
1e−4. Weight decay is set to 0.05. The batch size is 2048. highly accurate predictions across a wide range of tasks,
We train for 9K iterations, with a warm-up period of 1.8K while maintaining super flexibility in the task definition.
iterations. We use a set of data augmentations including ran-
dom resize cropping, color jittering, and random horizontal 4.4. Comparison with Specialist Methods
flipping. The size of a single input image is 448 × 448.
Few-shot semantic segmentation. We evaluate the per-
4.3. Qualitative Results formance of SegGPT, on two settings of few-shot seman-
tic segmentation: in-domain on COCO-20i /PASCAL-5i ,
To demonstrate the capability of our SegGPT in an intu- and out-of-domain on FSS-1000. Table 1 shows the re-
itive perspective, we visualize the task output of the selected sults of example-based semantic segmentation on COCO-
images with the specialized task prompts, shown in Figure 1 20i /PASCAL-5i . For a fair comparison, we also evaluate
and Figure 5. These two figures include a wide range of specialist models on in-domain categories marked by *. Our
segmentation tasks, such as arbitrary part/object segmen- results indicate that SegGPT can achieve comparable or sig-
tation with varied granularities, text segmentation, video nificantly better performance than recently published state-
object segmentation without videos in training, and close-set of-the-art specialist models on these two benchmarks. Note
instance/semantic segmentation with learnable prompt tun- that the prior art FPTrans trains separate models with dif-
ing. Figure 6 presents more visualizations on video object ferent shots. Furthermore, SegGPT surpasses the generalist

6
mIoU 4.5. Ablation Study
method venue
one-shot few-shot
trained on FSS-1000
Here we ablate two context ensemble strategies, namely
DAN [43] ECCV’20 85.2 88.1 spatial and feature ensemble. Results are shown in Table 4a.
Our findings reveal that the spatial ensemble approach per-
HSNet [35] ICCV’21 86.5 88.5
forms well on FSS-1000 dataset but experiences a perfor-
SSP [15] ECCV’22 87.3 88.6
mance drop on DAVIS 2017. We attribute this to the fact
VAT [19] ECCV’22 90.3 90.8
that the spatial ensemble employs the sub-sampling on the
DACM [50] ECCV’22 90.8 91.7
examples. Notably, FSS-1000 dataset has a lower image res-
not trained on FSS-1000
olution (224×224) compared to the high-resolution DAVIS
Painter CVPR’23 61.7 62.3
dataset (640×480), and therefore, sub-sampling does not
SegGPT this work 85.6 89.3
result in significant information loss for FSS-1000. While,
Table 2: Quantitative results on few-shot semantic segmen- we observe that feature ensemble can reduce this informa-
tation on FSS-1000. SegGPT achieves remarkable results tion loss on sub-sampling, and achieve significantly better
although not trained on FSS-1000. performance on DAVIS 2017.
We also ablate the number of frames in DAVIS 2017,
as shown in Table 4b. As the number of frames increases,
the performance initially improves before reaching a point
Painter [46] by a considerable margin. of diminishing returns. In particular, we observe that the
optimal performance is achieved when using 8 frames.
Table 2 presents the results of few-shot semantic segmen-
tation on FSS-1000 with out-of-domain categories. Com- 4.6. In-Context Tuning
pared to specialist models trained on FSS-1000, SegGPT ex-
In-context tuning enables to customize a unique appli-
hibits highly competitive performance. Notably, our model
cation with a set of data samples. For example, to tune a
is not trained on the FSS-1000 dataset at all, yet still achieves
prompt for a specific dataset, scene, or even a person. Specif-
remarkable results, demonstrating its effectiveness.
ically, we define the task prompt as the learnable tensors,
Video object segmentation. Video object segmentation freeze the whole model, and then use the same training loss
(VOS) is a task that segments a particular object in video to optimize the task prompts. Here, we conduct in-context
frames. In this work, we focus on the semi-supervised VOS tuning on the challenging ADE20K semantic segmentation
setting and evaluate our proposed method, SegGPT, on the and COCO panoptic segmentation. We evaluate SegGPT
validation split of three datasets: YouTube-VOS 2018 [52], with learnable prompts on the corresponding benchmarks.
DAVIS 2017 [37], and the recently release challenging Results on ADE20K semantic segmentation are shown
benchmark MOSE [12]. We use two metrics commonly in Table 5. Our model SegGPT achieves competitive per-
used in VOS for evaluation: the J score and the F score, formance with specialist models like RefineNet. However,
and we evaluate our results with official evaluation servers compared to the generalist Painter, our approach shows a
or tools. 10.3 point drop in mIoU. This observation can be explained
by the introduction of a random color scheme, which makes
SegGPT performs video object segmentation by convert- it more challenging for the model to use color as a simple
ing the first frame and its object mask to in-context coloring indicator of in-domain tasks. Instead, the model needs to
examples. When testing a current frame, we use its previous rely on context examples to determine the task, making op-
K frames (if have) for constructing multiple examples. Ob- timization much more difficult. Similarly, Table 6 shows
ject masks for these frames have been predicted and stored the results of our SegGPT model on COCO panoptic seg-
by a FIFO queue. After multiple examples are constructed, mentation. Here, we again observe a 9.0 point drop in PQ
Feature Ensemble (describe in Section 3.2) is applied compared to the generalist Painter. Outperforming all spe-
and the prediction result will be stored for the next frame. cialist methods in specific benchmarks is not the purpose of
We evaluate our model on several benchmarks, and the re- this work, and we believe there is much room to improve in
sults are presented in Table 3. Despite not being specifically the future.
trained for the task, our approach achieves competitive re-
sults with the specialist models trained on these datasets. 5. Discussion and Conclusion
For instance, on YouTube-VOS 2018 [52], our method out-
performed the task-specific approach AGAME [21] and In this work, we present a generalist segmentation model,
AGSS [29] by clear margins. On the challenging MOSE showing how to design an appropriate training strategy to
benchmark which focuses on complex scenes, SegGPT even fully leverage the flexibility of in-context visual learning.
performs comparably with the state-of-the-art method RDE. Our model exhibits strong capabilities in handling both in-

7
YouTube-VOS 2018 [52] DAVIS 2017 [37] MOSE [12]
method venue
G Js Fs Ju Fu J&F J F J&F J F
with video data
AGAME [21] CVPR’19 66.0 66.9 - 61.2 - 70.0 67.2 72.7 - - -
AGSS [29] ICCV’19 71.3 71.3 65.5 75.2 73.1 67.4 64.9 69.9 - - -
STM [36] ICCV’19 79.4 79.7 84.2 72.8 80.9 81.8 79.2 84.3 - - -
AFB-URR [27] NeurIPS’20 79.6 78.8 83.1 74.1 82.6 74.6 73.0 76.1 - - -
RDE [25] CVPR’22 83.3 81.9 86.3 78.0 86.9 86.1 82.1 90.0 48.8 44.6 52.9
SWEM [31] CVPR’22 82.8 82.4 86.9 77.1 85.0 84.3 81.2 87.4 50.9 46.8 54.9
XMem [9] ECCV’22 86.1 85.1 89.8 80.3 89.2 87.7 84.0 91.4 57.6 53.3 62.0
without video data
Painter CVPR’23 24.1 27.6 35.8 14.3 18.7 34.6 28.5 40.8 14.5 10.4 18.5
SegGPT this work 74.7 75.1 80.2 67.4 75.9 75.6 72.5 78.6 45.1 42.2 48.0

Table 3: Quantitative results of video object segmentation on YouTube-VOS 2018, DAVIS 2017, and MOSE. Notably, Painter
and SegGPT do not use any video data in training. G is the average score over “seen” and “unseen” classes in YouTube-VOS
2018.

DAVIS 2017 FSS-1000


examples ensemble DAVIS 2017
J&F J F mIoU FB-IoU
frames 1 4 8 12 16
1 - 70.0 66.4 73.7 85.5 90.8
J&F 70.0 74.7 75.6 74.8 74.6
4 Spatial 61.9 58.0 65.8 89.3 93.5
J 66.4 71.6 72.5 71.6 71.4
4 Feature 74.7 71.6 77.7 87.8 92.4
F 73.7 77.7 78.6 77.9 77.8
8 Feature 75.6 72.5 78.6 89.8 93.8
(a) (b)

Table 4: Ablation study on ensemble strategy (a) and the number of frames (b) in in-context inference. Spatial ensemble
approach performs well on FSS-1000 dataset but experiences a performance drop on DAVIS 2017. Feature ensemble achieves
better results due to no sub-sampling.

method venue mIoU domain and out-of-domain segmentation tasks, including


specialist model object instance, stuff, part, contour, text segmentation, etc.
FCN [32] CVPR’15 29.4
RefineNet [28] CVPR’17 40.7 This work is not without drawbacks. While our work
DPT [39] ICCV’21 49.2 introduces a new random coloring regime for better gener-
Mask2Former [8] CVPR’22 57.7 alization capability of in-context training, it also makes the
generalist model training task inherently more difficult, which may be the rea-
Painter CVPR’23 49.9 son for inferior performance in in-domain tasks with ample
SegGPT this work 39.6 training data, such as semantic segmentation on ADE20K
and panoptic segmentation on COCO.
Table 5: Results on ADE20K semantic segmentation.
Looking forward, we believe that our approach has the po-
method venue PQ tential to serve as a powerful tool for enabling more diverse
specialist model applications in image/video segmentation, by leveraging the
PanopticFPN [23] CVPR’19 40.3 flexibility in task definition with in-context inference. Scal-
SOLOv2 [47] NeurIPS’20 42.1 ing up model size is one avenue that we plan to pursue to
Mask2Former [8] CVPR’22 57.8 further improve performance. With larger models, more
UViM [24] NeurIPS’22 45.8 complex patterns in the data can be captured, which may
generalist model lead to better segmentation results. However, this comes
Painter CVPR’23 43.4 with the challenge of finding more data. One potential solu-
SegGPT this work 34.4 tion is to explore self-supervised learning techniques. We
hope that our work will inspire the community to continue
Table 6: Results on COCO panoptic segmentation.
exploring the potential of in-context learning in computer
vision. We remain optimistic that the best GPT-3 moment in
the vision field is yet to come.

8
Acknowledgement video object segmentation in complex scenes. arXiv preprint
arXiv:2302.01872, 2023. 2, 7, 8
This project is supported by the National Key R&D Pro- [13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
gram of China (2022ZD0116302). We would like to thank Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Yemin Shi and Teng Dai for their help on the demo, Hanx- Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
iao Qu, Yan Tian, and Xigang Cao for the help on GPU vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
resources, as well as other colleagues at Beijing Academy worth 16x16 words: Transformers for image recognition at
of Artificial Intelligence for support throughout this project. scale. In Int. Conf. Learn. Representations, 2021. 3, 5
[14] Mark Everingham, Luc Van Gool, Christopher KI Williams,
References John Winn, and Andrew Zisserman. The pascal visual object
classes (voc) challenge. International journal of computer
[1] Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Glober- vision, 88:303–308, 2009. 5
son, and Alexei A. Efros. Visual prompting via image inpaint- [15] Qi Fan, Wenjie Pei, Yu-Wing Tai, and Chi-Keung Tang. Self-
ing. Adv. Neural Inform. Process. Syst., pages 1–24, 2022. support few-shot semantic segmentation. In Eur. Conf. Com-
3 put. Vis., pages 701–719. Springer, 2022. 7
[2] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. [16] Muhammad Moazam Fraz, Paolo Remagnino, Andreas
Yolact: Real-time instance segmentation. In Int. Conf. Com- Hoppe, Bunyarit Uyyanonvara, Alicja R Rudnicka, Christo-
put. Vis., 2019. 2 pher G Owen, and Sarah A Barman. An ensemble
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, classification-based approach applied to retinal blood vessel
Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav segmentation. IEEE Transactions on Biomedical Engineering,
Shyam, Girish Sastry, Amanda Askell, et al. Language models 59(9):2538–2548, 2012. 5
are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. [17] Ross Girshick. Fast R-CNN. In IEEE Conf. Comput. Vis.
3 Pattern Recog., 2015. 2, 3
[4] Attila Budai, Rüdiger Bock, Andreas Maier, Joachim Horneg- [18] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
ger, and Georg Michelson. Robust vessel segmentation in shick. Mask R-CNN. In Int. Conf. Comput. Vis., 2017. 2
fundus images. International journal of biomedical imaging, [19] Sunghwan Hong, Seokju Cho, Jisu Nam, Stephen Lin, and
2013, 2013. 5 Seungryong Kim. Cost aggregation with 4d convolutional
[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas swin transformer for few-shot segmentation. In European
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- Conference on Computer Vision, pages 108–126. Springer,
end object detection with transformers. In Eur. Conf. Comput. 2022. 6, 7
Vis., pages 213–229, 2020. 2 [20] AD Hoover, Valentina Kouznetsova, and Michael Goldbaum.
[6] Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, and Locating blood vessels in retinal images by piecewise thresh-
Geoffrey Hinton. Pix2seq: A language modeling framework old probing of a matched filter response. IEEE Transactions
for object detection. Int. Conf. Learn. Representations, pages on Medical imaging, 19(3):203–210, 2000. 5
1–17, 2021. 2 [21] Joakim Johnander, Martin Danelljan, Emil Brissman, Fa-
[7] Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J. had Shahbaz Khan, and Michael Felsberg. A generative
Fleet, and Geoffrey Hinton. A unified sequence interface for appearance model for end-to-end video object segmentation.
vision tasks. Adv. Neural Inform. Process. Syst., 2022. 2 In Proceedings of the IEEE/CVF conference on computer
[8] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander vision and pattern recognition, pages 8953–8962, 2019. 7, 8
Kirillov, and Rohit Girdhar. Masked-attention mask trans- [22] Diederik P Kingma and Jimmy Ba. Adam: A method for
former for universal image segmentation. arXiv preprint stochastic optimization. arXiv preprint arXiv:1412.6980,
arXiv:2112.01527, 2021. 2, 8 2014. 6
[9] Ho Kei Cheng and Alexander G Schwing. Xmem: Long-term [23] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr
video object segmentation with an atkinson-shiffrin memory Dollár. Panoptic feature pyramid networks. In IEEE Conf.
model. In Computer Vision–ECCV 2022: 17th European Con- Comput. Vis. Pattern Recog., pages 6399–6408, 2019. 2, 8
ference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, [24] Alexander Kolesnikov, André Susano Pinto, Lucas Beyer,
Part XXVIII, pages 640–658. Springer, 2022. 8 Xiaohua Zhai, Jeremiah Harmsen, and Neil Houlsby. UViM:
[10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo A unified modeling approach for vision with learned guiding
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, codes. Adv. Neural Inform. Process. Syst., 2022. 2, 8
Stefan Roth, and Bernt Schiele. The cityscapes dataset for [25] Mingxing Li, Li Hu, Zhiwei Xiong, Bang Zhang, Pan Pan,
semantic urban scene understanding. In Proceedings of the and Dong Liu. Recurrent dynamic embedding for video object
IEEE conference on computer vision and pattern recognition, segmentation. In Proceedings of the IEEE/CVF Conference
pages 3213–3223, 2016. 5 on Computer Vision and Pattern Recognition, pages 1332–
[11] Bert De Brabandere, Davy Neven, and Luc Van Gool. Seman- 1341, 2022. 8
tic instance segmentation with a discriminative loss function. [26] Xiaodan Liang, Ke Gong, Xiaohui Shen, and Liang Lin. Look
arXiv:1708.02551, 2017. 2 into person: Joint body parsing & pose estimation network
[12] Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and a new benchmark. IEEE Trans. Pattern Analysis and
Philip HS Torr, and Song Bai. Mose: A new dataset for Machine Intelligence, 41(4):871–885, 2018. 5

9
[27] Yongqing Liang, Xin Li, Navid Jafari, and Jim Chen. Video [41] Chris Stauffer and W Eric L Grimson. Adaptive background
object segmentation with adaptive feature bank and uncertain- mixture models for real-time tracking. In Proceedings. 1999
region refinement. Advances in Neural Information Process- IEEE computer society conference on computer vision and
ing Systems, 33:3430–3441, 2020. 8 pattern recognition (Cat. No PR00149), volume 2, pages
[28] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian 246–252. IEEE, 1999. 2
Reid. Refinenet: Multi-path refinement networks for high- [42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
resolution semantic segmentation. In IEEE Conf. Comput. reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Vis. Pattern Recog., pages 1925–1934, 2017. 2, 8 Polosukhin. Attention is all you need. Adv. Neural Inform.
[29] Huaijia Lin, Xiaojuan Qi, and Jiaya Jia. Agss-vos: Attention Process. Syst., 30, 2017. 2
guided single-shot video object segmentation. In Proceedings [43] Haochen Wang, Xudong Zhang, Yutao Hu, Yandan Yang,
of the IEEE/CVF international conference on computer vision, Xianbin Cao, and Xiantong Zhen. Few-shot semantic seg-
pages 3949–3957, 2019. 7, 8 mentation with democratic attention networks. In Eur. Conf.
[30] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Comput. Vis., pages 730–746. Springer, 2020. 7
Pietro Perona, Deva Ramanan, Piotr Dollár, and Lawrence [44] Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and
Zitnick. Microsoft COCO: Common objects in context. In Yanfei Zhong. Loveda: A remote sensing land-cover dataset
Eur. Conf. Comput. Vis., pages 740–755, 2014. 2, 5 for domain adaptive semantic segmentation. arXiv preprint
[31] Zhihui Lin, Tianyu Yang, Maomao Li, Ziyu Wang, Chun arXiv:2110.08733, 2021. 5
Yuan, Wenhao Jiang, and Wei Liu. Swem: Towards real- [45] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai,
time video object segmentation with sequential weighted Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and
expectation-maximization. In Proceedings of the IEEE/CVF Hongxia Yang. OFA: Unifying architectures, tasks, and
Conference on Computer Vision and Pattern Recognition, modalities through a simple sequence-to-sequence learning
pages 1362–1372, 2022. 8 framework. Int. Conf. Mach. Learn., 2022. 2
[32] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully [46] Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and
convolutional networks for semantic segmentation. In IEEE Tiejun Huang. Images speak in images: A generalist painter
Conf. Comput. Vis. Pattern Recog., pages 3431–3440, 2015. for in-context visual learning. In IEEE Conf. Comput. Vis.
2, 8 Pattern Recog., 2023. 2, 3, 5, 7
[33] Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot-
[47] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chun-
taghi, and Aniruddha Kembhavi. Unified-IO: A unified model
hua Shen. SOlOv2: Dynamic and fast instance segmentation.
for vision, language, and multi-modal tasks. arXiv preprint
Adv. Neural Inform. Process. Syst., 33:17721–17732, 2020. 8
arXiv:2206.08916, pages 1–19, 2022. 2
[48] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and
[34] Sabarinath Mahadevan, Paul Voigtlaender, and Bastian Leibe.
Lei Li. SOLO: A simple framework for instance segmentation.
Iteratively trained interactive segmentation. In British Ma-
IEEE Trans. Pattern Anal. Mach. Intell., 2021. 2
chine Vision Conference (BMVC), 2018. 2
[35] Juhong Min, Dahyun Kang, and Minsu Cho. Hypercorrela- [49] Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman
tion squeeze for few-shot segmentation. In Proceedings of Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao,
the IEEE/CVF International Conference on Computer Vision Gui-Song Xia, and Xiang Bai. isaid: A large-scale dataset
(ICCV), 2021. 6, 7 for instance segmentation in aerial images. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
[36] Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo
Recognition Workshops, pages 28–37, 2019. 5
Kim. Video object segmentation using space-time memory
networks. In Proceedings of the IEEE/CVF International [50] Zhitong Xiong, Haopeng Li, and Xiao Xiang Zhu. Doubly
Conference on Computer Vision, pages 9226–9235, 2019. 8 deformable aggregation of covariance matrices for few-shot
segmentation. In Eur. Conf. Comput. Vis., pages 133–150,
[37] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar-
2022. 7
beláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017
davis challenge on video object segmentation. arXiv preprint [51] Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas S
arXiv:1704.00675, 2017. 2, 7, 8 Huang. Deep interactive object selection. In Proceedings of
[38] Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, the IEEE conference on computer vision and pattern recogni-
Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, tion, pages 373–381, 2016. 2
Rama Kovvuri, Abhishek Kadian, et al. Paco: Parts and at- [52] Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen
tributes of common objects. arXiv preprint arXiv:2301.01795, Liang, Jianchao Yang, and Thomas Huang. Youtube-vos:
2023. 5 A large-scale video object segmentation benchmark. arXiv
[39] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- preprint arXiv:1809.03327, 2018. 2, 7, 8
sion transformers for dense prediction. In IEEE Conf. Comput. [53] Jian-Wei Zhang, Yifan Sun, Yi Yang, and Wei Chen. Feature-
Vis. Pattern Recog., pages 12179–12188, 2021. 2, 8 proxy transformer for few-shot segmentation. In Advances in
[40] Joes Staal, Michael D Abràmoff, Meindert Niemeijer, Max A Neural Information Processing Systems, 2022. 6
Viergever, and Bram Van Ginneken. Ridge-based vessel [54] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
segmentation in color images of the retina. IEEE transactions Wang, and Jiaya Jia. Pyramid scene parsing network. In
on medical imaging, 23(4):501–509, 2004. 5 IEEE Conf. Comput. Vis. Pattern Recog., 2017. 2

10
[55] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, nificantly boost the performance, e.g., +13.1% mIoU from 1
Adela Barriuso, and Antonio Torralba. Semantic understand- to 16 examples, although there is still a gap with the tuned
ing of scenes through the ade20k dataset. Int. J. Computer prompt. These experiments inspire us to explore in the fu-
Vision, 2018. 2, 5 ture what makes good examples and how many examples we
[56] Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Hongsheng Li, need to approach the results of in-context tuning.
Xiaohua Wang, and Jifeng Dai. Uni-Perceiver: Pre-training
Context ensemble. Here we qualitatively demonstrate the
unified architecture for generic perception for zero-shot and
effectiveness of our context ensemble approach in Figure S1.
few-shot tasks. arXiv preprint arXiv:2112.01522, pages
16783–16794, 2022. 2 Given a video clip and its first annotated frame, it is difficult
to distinguish the instances in a crowd when using only the
Appendix first frame as an example. With the context ensemble of
several previous frames and their pseudo-labels, SegGPT
A. Additional Implementation Details segments each object successfully.
Visualizations. We provide more visualizations in Fig-
Training. We use various segmentation datasets during train- ure S2, including semantic segmentation on ADE20K, in-
ing. The sampling weight for each dataset is 0.22 (COCO stance segmentation on COCO, and arbitrary segmentation
instance), 0.15 (ADE20K semantic), 0.15 (COCO panoptic in the wild.
semantic), 0.07 (Cityscapes semantic), 0.07 (COCO stuff
semantic), 0.07 (LIP person semantic), 0.07 (PASCAL VOC examples mIoU mAcc
semantic), 0.07 (PACO semantic), 0.06 (iSAID and loveDA 1 18.8 27.4
aerial semantic), and 0.06 (CHASE DB, DRIVE, HRF and 2 25.0 34.4
STARE retinal vessel). For semantic segmentation data, we 4 28.3 37.7
use a probability of 0.5 for using the transformation of the 8 30.1 38.9
input image as the in-context examples and then conduct 16 31.9 40.4
32 33.0 42.0
random color selection. For instance segmentation data, the
tuned 39.6 50.7
probability is 1.0, i.e., we always use two transformed views
Table S1: Example-based results on ADE20K semantic seg-
of the same image as the in-context pair. Almost all the seg-
mentation. More examples boost the performance.
mentation sub-tasks can be grouped into two types, i.e., to
segment a category or an instance (not limited to objects). To
avoid the ambiguity between category and instance, we ini-
tialize two learnable embeddings which are associated with
category-level and instance-level coloring tasks respectively.
Evaluation. For quantitative evaluation on the existing (a)
benchmarks, the examples are either from the support sam-
ples, the training set, the first frame in a video, or a learned
prompt. Take ADE20K semantic segmentation as an exam-
ple. Given a tuned prompt, we directly stitch the prompt
with each test image to obtain the predictions. Without the (b)
Figure S1: Context ensemble helps segment objects across
tuned prompt, for each category, we randomly sample sev- frames. a) Incorrect predictions for objects in a crowd when
eral images from the training set which contain that category. only the first frame is used as the example. b) Correct predic-
These examples are used together via context ensemble to tions using Feature Ensemble with previous frames.
obtain the predictions for this category across all test images.

B. Additional Results
ADE20K semantic segmentation. In Table S1, we pro-
vide the example-based semantic segmentation results on
ADE20K. Different from the in-context tuning, we only ran-
domly select several samples in the training set as examples,
and use Feature Ensemble to ensemble the examples.
Specifically, for each category, we randomly sample without
replacement from all images with that category. Since the
selection of the examples can affect performance, we sample
with different random seeds {1000, 2000, 3000, 4000} and
report the best results. We can see that more examples sig-

11
(a) Semantic segmentation on ADE20K

(b) Instance segmentation on COCO

(c) Arbitrary segmentation in the wild

Figure S2: More examples of SegGPT applications. Each test image and the corresponding predicted segmentation are
combined for better visualization. For (c), the orange box □ on the left displays the example/prompt image and its corresponding
mask, while the blue box □ on the right shows the input image and the resulting mask output.
12

You might also like