0% found this document useful (0 votes)

69 views30 pages

Segment Anything: Meta AI Research, FAIR

1. The document introduces the Segment Anything (SA) project which aims to build a foundation model for image segmentation. It consists of a segmentation task, a Segment Anything Model (SAM), and a dataset of over 1 billion masks called SA-1B. 2. SAM is designed to be "promptable" meaning it can transfer to new image distributions and tasks using text prompts without additional training. Its zero-shot performance is often competitive with or better than prior fully supervised models. 3. The authors collected SA-1B, the largest segmentation dataset to date, using SAM in a data collection loop. SA-1B contains over 1 billion masks on 11 million licensed images and was collected in a

Uploaded by

Console Boy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views30 pages

Segment Anything: Meta AI Research, FAIR

Uploaded by

Console Boy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Segment Anything

Alexander Kirillov1,2,4 Eric Mintun2 Nikhila Ravi1,2 Hanzi Mao2 Chloe Rolland3 Laura Gustafson3
Tete Xiao3 Spencer Whitehead Alexander C. Berg Wan-Yen Lo Piotr Dollár4 Ross Girshick4
1 2 3 4
project lead joint first author equal contribution directional lead
Meta AI Research, FAIR

valid mask valid mask annotate

arXiv:2304.02643v1 [cs.CV] 5 Apr 2023

lightweight mask decoder model data

train
model
image Segment Anything 1B (SA-1B):
encoder
• 1+ billion masks
prompt
cat with • 11 million images
encoder
black ears • privacy respecting
• licensed images
segmentation prompt image prompt image

(a) Task: promptable segmentation (b) Model: Segment Anything Model (SAM) (c) Data: data engine (top) & dataset (bottom)
Figure 1: We aim to build a foundation model for segmentation by introducing three interconnected components: a prompt-
able segmentation task, a segmentation model (SAM) that powers data annotation and enables zero-shot transfer to a range
of tasks via prompt engineering, and a data engine for collecting SA-1B, our dataset of over 1 billion masks.

Abstract matching in some cases) fine-tuned models [10, 21]. Empir-

We introduce the Segment Anything (SA) project: a new ical trends show this behavior improving with model scale,
task, model, and dataset for image segmentation. Using our dataset size, and total training compute [56, 10, 21, 51].
efficient model in a data collection loop, we built the largest Foundation models have also been explored in computer
segmentation dataset to date (by far), with over 1 billion vision, albeit to a lesser extent. Perhaps the most promi-
masks on 11M licensed and privacy respecting images. The nent illustration aligns paired text and images from the web.
model is designed and trained to be promptable, so it can For example, CLIP [82] and ALIGN [55] use contrastive
transfer zero-shot to new image distributions and tasks. We learning to train text and image encoders that align the two
evaluate its capabilities on numerous tasks and find that modalities. Once trained, engineered text prompts enable
its zero-shot performance is impressive – often competitive zero-shot generalization to novel visual concepts and data
with or even superior to prior fully supervised results. We distributions. Such encoders also compose effectively with
are releasing the Segment Anything Model (SAM) and cor- other modules to enable downstream tasks, such as image
responding dataset (SA-1B) of 1B masks and 11M images at generation (e.g., DALL·E [83]). While much progress has
https://segment-anything.com to foster research into foun- been made on vision and language encoders, computer vi-
dation models for computer vision. sion includes a wide range of problems beyond this scope,
and for many of these, abundant training data does not exist.
In this work, our goal is to build a foundation model for
1. Introduction image segmentation. That is, we seek to develop a prompt-
able model and pre-train it on a broad dataset using a task
Large language models pre-trained on web-scale datasets
that enables powerful generalization. With this model, we
are revolutionizing NLP with strong zero-shot and few-shot
aim to solve a range of downstream segmentation problems
generalization [10]. These “foundation models” [8] can
on new data distributions using prompt engineering.
generalize to tasks and data distributions beyond those seen
The success of this plan hinges on three components:
during training. This capability is often implemented with
task, model, and data. To develop them, we address the
prompt engineering in which hand-crafted text is used to
following questions about image segmentation:
prompt the language model to generate a valid textual re-
sponse for the task at hand. When scaled and trained with 1. What task will enable zero-shot generalization?
abundant text corpora from the web, these models’ zero and 2. What is the corresponding model architecture?
few-shot performance compares surprisingly well to (even 3. What data can power this task and model?

1
These questions are entangled and require a comprehen- Data engine (§4). To achieve strong generalization to new
sive solution. We start by defining a promptable segmenta- data distributions, we found it necessary to train SAM on
tion task that is general enough to provide a powerful pre- a large and diverse set of masks, beyond any segmenta-
training objective and to enable a wide range of downstream tion dataset that already exists. While a typical approach
applications. This task requires a model that supports flex- for foundation models is to obtain data online [82], masks
ible prompting and can output segmentation masks in real- are not naturally abundant and thus we need an alternative
time when prompted to allow for interactive use. To train strategy. Our solution is to build a “data engine”, i.e., we
our model, we need a diverse, large-scale source of data. co-develop our model with model-in-the-loop dataset an-
Unfortunately, there is no web-scale data source for seg- notation (see Fig. 1c). Our data engine has three stages:
mentation; to address this, we build a “data engine”, i.e., assisted-manual, semi-automatic, and fully automatic. In
we iterate between using our efficient model to assist in data the first stage, SAM assists annotators in annotating masks,
collection and using the newly collected data to improve the similar to a classic interactive segmentation setup. In the
model. We introduce each interconnected component next, second stage, SAM can automatically generate masks for
followed by the dataset we created and the experiments that a subset of objects by prompting it with likely object lo-
demonstrate the effectiveness of our approach. cations and annotators focus on annotating the remaining
Task (§2). In NLP and more recently computer vision, objects, helping increase mask diversity. In the final stage,
foundation models are a promising development that can we prompt SAM with a regular grid of foreground points,
perform zero-shot and few-shot learning for new datasets yielding on average ∼100 high-quality masks per image.
and tasks often by using “prompting” techniques. Inspired Dataset (§5). Our final dataset, SA-1B, includes more than
by this line of work, we propose the promptable segmen- 1B masks from 11M licensed and privacy-preserving im-
tation task, where the goal is to return a valid segmenta- ages (see Fig. 2). SA-1B, collected fully automatically us-
tion mask given any segmentation prompt (see Fig. 1a). A ing the final stage of our data engine, has 400× more masks
prompt simply specifies what to segment in an image, e.g., than any existing segmentation dataset [66, 44, 117, 60],
a prompt can include spatial or text information identifying and as we verify extensively, the masks are of high quality
an object. The requirement of a valid output mask means and diversity. Beyond its use in training SAM to be robust
that even when a prompt is ambiguous and could refer to and general, we hope SA-1B becomes a valuable resource
multiple objects (for example, a point on a shirt may infor research aiming to build new foundation models.
dicate either the shirt or the person wearing it), the output
Responsible AI (§6). We study and report on potential fair-
should be a reasonable mask for at least one of those ob-
ness concerns and biases when using SA-1B and SAM. Im-
jects. We use the promptable segmentation task as both a
ages in SA-1B span a geographically and economically di-
pre-training objective and to solve general downstream seg-
verse set of countries and we found that SAM performs sim-
mentation tasks via prompt engineering.
ilarly across different groups of people. Together, we hope
Model (§3). The promptable segmentation task and the goal this will make our work more equitable for real-world use
of real-world use impose constraints on the model architec- cases. We provide model and dataset cards in the appendix.
ture. In particular, the model must support flexible prompts,
needs to compute masks in amortized real-time to allow in- Experiments (§7). We extensively evaluate SAM. First, us-
teractive use, and must be ambiguity-aware. Surprisingly, ing a diverse new suite of 23 segmentation datasets, we find
we find that a simple design satisfies all three constraints: that SAM produces high-quality masks from a single fore-
a powerful image encoder computes an image embedding, ground point, often only slightly below that of the manu-
a prompt encoder embeds prompts, and then the two infor- ally annotated ground truth. Second, we find consistently
mation sources are combined in a lightweight mask decoder strong quantitative and qualitative results on a variety of
that predicts segmentation masks. We refer to this model as downstream tasks under a zero-shot transfer protocol using
the Segment Anything Model, or SAM (see Fig. 1b). By prompt engineering, including edge detection, object pro-
separating SAM into an image encoder and a fast prompt posal generation, instance segmentation, and a preliminary
encoder / mask decoder, the same image embedding can exploration of text-to-mask prediction. These results sug-
be reused (and its cost amortized) with different prompts. gest that SAM can be used out-of-the-box with prompt en-
Given an image embedding, the prompt encoder and mask gineering to solve a variety of tasks involving object and
decoder predict a mask from a prompt in ∼50ms in a web image distributions beyond SAM’s training data. Neverthe-
browser. We focus on point, box, and mask prompts, and less, room for improvement remains, as we discuss in §8.
also present initial results with free-form text prompts. To Release. We are releasing the SA-1B dataset for research
make SAM ambiguity-aware, we design it to predict mul- purposes and making SAM available under a permissive
tiple masks for a single prompt allowing SAM to naturally open license (Apache 2.0) at https://segment-anything.com.
handle ambiguity, such as the shirt vs. person example. We also showcase SAM’s capabilities with an online demo.

2
<50 masks
50-100 masks
100-200 masks
200-300 masks
300-400 masks
400-500 masks
>500 masks

Figure 2: Example images with overlaid masks from our newly introduced dataset, SA-1B. SA-1B contains 11M diverse,
high-resolution, licensed, and privacy protecting images and 1.1B high-quality segmentation masks. These masks were
annotated fully automatically by SAM, and as we verify by human ratings and numerous experiments, are of high quality and
diversity. We group images by number of masks per image for visualization (there are ∼100 masks per image on average).

3
2. Segment Anything Task
We take inspiration from NLP, where the next token pre-
diction task is used for foundation model pre-training and
to solve diverse downstream tasks via prompt engineer-
ing [10]. To build a foundation model for segmentation,
we aim to define a task with analogous capabilities.
Task. We start by translating the idea of a prompt from NLP
to segmentation, where a prompt can be a set of foreground
/ background points, a rough box or mask, free-form text,
or, in general, any information indicating what to segment
in an image. The promptable segmentation task, then, is to
return a valid segmentation mask given any prompt. The re-
quirement of a “valid” mask simply means that even when
a prompt is ambiguous and could refer to multiple objects
(e.g., recall the shirt vs. person example, and see Fig. 3),
the output should be a reasonable mask for at least one of
those objects. This requirement is similar to expecting a lan-
guage model to output a coherent response to an ambiguous
prompt. We choose this task because it leads to a natural
pre-training algorithm and a general method for zero-shot
transfer to downstream segmentation tasks via prompting.
Figure 3: Each column shows 3 valid masks generated by
Pre-training. The promptable segmentation task suggests a SAM from a single ambiguous point prompt (green circle).
natural pre-training algorithm that simulates a sequence of
prompts (e.g., points, boxes, masks) for each training sam-
ple and compares the model’s mask predictions against the a broadly capable model that can adapt to many (though
ground truth. We adapt this method from interactive seg- not all) existing and new segmentation tasks via prompt
mentation [109, 70], although unlike interactive segmenta- engineering. This capability is a form of task generaliza-
tion whose aim is to eventually predict a valid mask after tion [26]. Note that this is different than previous work on
enough user input, our aim is to always predict a valid mask multi-task segmentation systems. In a multi-task system, a
for any prompt even when the prompt is ambiguous. This single model performs a fixed set of tasks, e.g., joint seman-
ensures that a pre-trained model is effective in use cases that tic, instance, and panoptic segmentation [114, 19, 54], but
involve ambiguity, including automatic annotation as re- the training and test tasks are the same. An important dis-
quired by our data engine §4. We note that performing well tinction in our work is that a model trained for promptable
at this task is challenging and requires specialized modeling segmentation can perform a new, different task at inference
and training loss choices, which we discuss in §3. time by acting as a component in a larger system, e.g., to
perform instance segmentation, a promptable segmentation
Zero-shot transfer. Intuitively, our pre-training task en- model is combined with an existing object detector.
dows the model with the ability to respond appropriately to
Discussion. Prompting and composition are powerful tools
any prompt at inference time, and thus downstream tasks
that enable a single model to be used in extensible ways, po-
can be solved by engineering appropriate prompts. For ex-
tentially to accomplish tasks unknown at the time of model
ample, if one has a bounding box detector for cats, cat in-
design. This approach is analogous to how other founda-
stance segmentation can be solved by providing the detec-
tion models are used, e.g., how CLIP [82] is the text-image
tor’s box output as a prompt to our model. In general, a wide
alignment component of the DALL·E [83] image generation
array of practical segmentation tasks can be cast as prompt-
system. We anticipate that composable system design, pow-
ing. In addition to automatic dataset labeling, we explore
ered by techniques such as prompt engineering, will enable
five diverse example tasks in our experiments in §7.
a wider variety of applications than systems trained specif-
Related tasks. Segmentation is a broad field: there’s in- ically for a fixed set of tasks. It’s also interesting to com-
teractive segmentation [57, 109], edge detection [3], su- pare promptable and interactive segmentation through the
per pixelization [85], object proposal generation [2], fore- lens of composition: while interactive segmentation mod-
ground segmentation [94], semantic segmentation [90], in- els are designed with human users in mind, a model trained
stance segmentation [66], panoptic segmentation [59], etc. for promptable segmentation can also be composed into a
The goal of our promptable segmentation task is to produce larger algorithmic system as we will demonstrate.

4
, score

mask decoder
image
, score
encoder
conv prompt encoder

image image , score

embedding mask points box text
valid masks

Figure 4: Segment Anything Model (SAM) overview. A heavyweight image encoder outputs an image embedding that can
then be efficiently queried by a variety of input prompts to produce object masks at amortized real-time speed. For ambiguous
prompts corresponding to more than one object, SAM can output multiple valid masks and associated confidence scores.

3. Segment Anything Model loss [15, 45, 64] over masks. To rank masks, the model pre-
dicts a confidence score (i.e., estimated IoU) for each mask.
We next describe the Segment Anything Model (SAM)
Efficiency. The overall model design is largely motivated
for promptable segmentation. SAM has three components,
by efficiency. Given a precomputed image embedding, the
illustrated in Fig. 4: an image encoder, a flexible prompt
prompt encoder and mask decoder run in a web browser, on
encoder, and a fast mask decoder. We build on Transformer
CPU, in ∼50ms. This runtime performance enables seam-
vision models [14, 33, 20, 62] with specific tradeoffs for
less, real-time interactive prompting of our model.
(amortized) real-time performance. We describe these com-
ponents at a high-level here, with details in §A. Losses and training. We supervise mask prediction with
the linear combination of focal loss [65] and dice loss [73]
Image encoder. Motivated by scalability and powerful pre-
used in [14]. We train for the promptable segmentation task
training methods, we use an MAE [47] pre-trained Vision
using a mixture of geometric prompts (for text prompts see
Transformer (ViT) [33] minimally adapted to process high
§7.5). Following [92, 37], we simulate an interactive setup
resolution inputs [62]. The image encoder runs once per
by randomly sampling prompts in 11 rounds per mask, al-
image and can be applied prior to prompting the model.
lowing SAM to integrate seamlessly into our data engine.
Prompt encoder. We consider two sets of prompts: sparse
(points, boxes, text) and dense (masks). We represent 4. Segment Anything Data Engine
points and boxes by positional encodings [95] summed with
As segmentation masks are not abundant on the inter-
learned embeddings for each prompt type and free-form text
net, we built a data engine to enable the collection of our
with an off-the-shelf text encoder from CLIP [82]. Dense
1.1B mask dataset, SA-1B. The data engine has three
prompts (i.e., masks) are embedded using convolutions and
stages: (1) a model-assisted manual annotation stage, (2) a
summed element-wise with the image embedding.
semi-automatic stage with a mix of automatically predicted
Mask decoder. The mask decoder efficiently maps the im- masks and model-assisted annotation, and (3) a fully auto-
age embedding, prompt embeddings, and an output token matic stage in which our model generates masks without
to a mask. This design, inspired by [14, 20], employs a annotator input. We go into details of each next.
modification of a Transformer decoder block [103] followed
Assisted-manual stage. In the first stage, resembling clas-
by a dynamic mask prediction head. Our modified decoder
sic interactive segmentation, a team of professional annota-
block uses prompt self-attention and cross-attention in two
tors labeled masks by clicking foreground / background ob-
directions (prompt-to-image embedding and vice-versa) to
ject points using a browser-based interactive segmentation
update all embeddings. After running two blocks, we up-
tool powered by SAM. Masks could be refined using pixel-
sample the image embedding and an MLP maps the output
precise “brush” and “eraser” tools. Our model-assisted an-
token to a dynamic linear classifier, which then computes
notation runs in real-time directly inside a browser (using
the mask foreground probability at each image location.
precomputed image embeddings) enabling a truly interac-
Resolving ambiguity. With one output, the model will av- tive experience. We did not impose semantic constraints for
erage multiple valid masks if given an ambiguous prompt. labeling objects, and annotators freely labeled both “stuff”
To address this, we modify the model to predict multiple and “things” [1]. We suggested annotators label objects
output masks for a single prompt (see Fig. 3). We found they could name or describe, but did not collect these names
3 mask outputs is sufficient to address most common cases or descriptions. Annotators were asked to label objects in
(nested masks are often at most three deep: whole, part, and order of prominence and were encouraged to proceed to the
subpart). During training, we backprop only the minimum next image once a mask took over 30 seconds to annotate.

5
At the start of this stage, SAM was trained using com-
mon public segmentation datasets. After sufficient data an-
notation, SAM was retrained using only newly annotated
masks. As more masks were collected, the image encoder
was scaled from ViT-B to ViT-H and other architectural de-
Figure 5: Image-size normalized mask center distributions.
tails evolved; in total we retrained our model 6 times. Av-
erage annotation time per mask decreased from 34 to 14
seconds as the model improved. We note that 14 seconds 5. Segment Anything Dataset
is 6.5× faster than mask annotation for COCO [66] and
Our dataset, SA-1B, consists of 11M diverse, high-
only 2× slower than bounding-box labeling with extreme
resolution, licensed, and privacy protecting images and
points [76, 71]. As SAM improved, the average number of
1.1B high-quality segmentation masks collected with our
masks per image increased from 20 to 44 masks. Overall,
data engine. We compare SA-1B with existing datasets
we collected 4.3M masks from 120k images in this stage.
and analyze mask quality and properties. We are releasing
Semi-automatic stage. In this stage, we aimed to increase SA-1B to aid future development of foundation models for
the diversity of masks in order to improve our model’s computer vision. We note that SA-1B will be released un-
ability to segment anything. To focus annotators on less der a favorable license agreement for certain research uses
prominent objects, we first automatically detected confident and with protections for researchers.
masks. Then we presented annotators with images prefilled
with these masks and asked them to annotate any additional Images. We licensed a new set of 11M images from a
unannotated objects. To detect confident masks, we trained provider that works directly with photographers. These im-
a bounding box detector [84] on all first stage masks using a ages are high resolution (3300×4950 pixels on average),
generic “object” category. During this stage we collected an and the resulting data size can present accessibility and stor-
additional 5.9M masks in 180k images (for a total of 10.2M age challenges. Therefore, we are releasing downsampled
masks). As in the first stage, we periodically retrained our images with their shortest side set to 1500 pixels. Even af-
model on newly collected data (5 times). Average annota- ter downsampling, our images are significantly higher reso-
tion time per mask went back up to 34 seconds (excluding lution than many existing vision datasets (e.g., COCO [66]
the automatic masks) as these objects were more challeng- images are ∼480×640 pixels). Note that most models today
ing to label. The average number of masks per image went operate on much lower resolution inputs. Faces and vehicle
from 44 to 72 masks (including the automatic masks). license plates have been blurred in the released images.

Fully automatic stage. In the final stage, annotation was Masks. Our data engine produced 1.1B masks, 99.1% of
fully automatic. This was feasible due to two major en- which were generated fully automatically. Therefore, the
hancements to our model. First, at the start of this stage, we quality of the automatic masks is centrally important. We
had collected enough masks to greatly improve the model, compare them directly to professional annotations and look
including the diverse masks from the previous stage. Sec- at how various mask properties compare to prominent seg-
ond, by this stage we had developed the ambiguity-aware mentation datasets. Our main conclusion, as borne out in
model, which allowed us to predict valid masks even in am- the analysis below and the experiments in §7, is that our
biguous cases. Specifically, we prompted the model with a automatic masks are high quality and effective for training
32×32 regular grid of points and for each point predicted models. Motivated by these findings, SA-1B only includes
a set of masks that may correspond to valid objects. With automatically generated masks.
the ambiguity-aware model, if a point lies on a part or sub- Mask quality. To estimate mask quality, we randomly sam-
part, our model will return the subpart, part, and whole ob- pled 500 images (∼50k masks) and asked our professional
ject. The IoU prediction module of our model is used to se- annotators to improve the quality of all masks in these im-
lect confident masks; moreover, we identified and selected ages. Annotators did so using our model and pixel-precise
only stable masks (we consider a mask stable if threshold- “brush” and “eraser” editing tools. This procedure resulted
ing the probability map at 0.5 − δ and 0.5 + δ results in in pairs of automatically predicted and professionally cor-
similar masks). Finally, after selecting the confident and rected masks. We computed IoU between each pair and
stable masks, we applied non-maximal suppression (NMS) found that 94% of pairs have greater than 90% IoU (and
to filter duplicates. To further improve the quality of smaller 97% of pairs have greater than 75% IoU). For comparison,
masks, we also processed multiple overlapping zoomed-in prior work estimates inter-annotator consistency at 85-91%
image crops. For further details of this stage, see §B. We IoU [44, 60]. Our experiments in §7 confirm by human rat-
applied fully automatic mask generation to all 11M images ings that mask quality is high relative to a variety of datasets
in our dataset, producing a total of 1.1B high-quality masks. and that training our model on automatic masks is nearly as
We describe and analyze the resulting dataset, SA-1B, next. good as using all masks produced by the data engine.

6
SA-1B LVIS v1 COCO ADE20K Open Images
11M images 0.120M images 0.123M images 0.028M images 1M images
1129M (1.1B) masks 1.5M masks 0.9M masks 0.7M masks 2.7M masks

Percent of masks

Percent of masks
Percent of images

80 15
100
10
40
10−2 5
0 0
<10 11-50 51-100 101-200 >200 0.00 0.25 0.50 0.75 0.0 0.2 0.4 0.6 0.8
Number of masks per image Relative segmentation mask size Concavity
Figure 6: Dataset mask properties. The legend references the number of images and masks in each dataset. Note, that SA-1B
has 11× more images and 400× more masks than the largest existing segmentation dataset Open Images [60].

Number of images per country

800k Asia & Oceania
Per country 600k
Africa
Europe
image count North America
≥ 100k 400k
Latin America & Caribbean
< 100k 200k
< 10k
< 1k 0

RUS
THA
USA
ITA
GBR
DEU
ESP
IDN
UKR
FRA
JPN
MYS
TUR
IND
CHN
POL
NLD
VNM
BRA
CAN
GRC
AUS
PRT
CZE
BLR
ROU
KOR
ARE
AUT
SWE
TWN
HKG
CHE
ISR
SGP
HUN
BEL
HRV
BGR
PHL
KAZ
MEX
NOR
MMR
ZAF
SRB
DNK
MAR
FIN
LVA
50 most common countries (ISO codes)

Figure 7: Estimated geographic distribution of SA-1B images. Most of the world’s countries have more than 1000 images in
SA-1B, and the three countries with the most images are from different parts of the world.

Mask properties. In Fig. 5 we plot the spatial distribution SA-1B % images

of object centers in SA-1B compared to the largest existing # countries #imgs #masks SA-1B COCO O.I.
segmentation datasets. Common photographer biases are Africa 54 300k 28M 2.8% 3.0% 1.7%
present in all datasets. We observe that SA-1B has greater Asia & Oceania 70 3.9M 423M 36.2% 11.4% 14.3%
coverage of image corners compared to LVIS v1 [44] and Europe 47 5.4M 540M 49.8% 34.2% 36.2%
ADE20K [117], the two most similarly distributed datasets, Latin America & Carib. 42 380k 36M 3.5% 3.1% 5.0%
while COCO [66] and Open Images V5 [60] have a more North America 4 830k 80M 7.7% 48.3% 42.8%
prominent center bias. In Fig. 6 (legend) we compare these high income countries 81 5.8M 598M 54.0% 89.1% 87.5%
datasets by size. SA-1B has 11× more images and 400× middle income countries 108 4.9M 499M 45.0% 10.5% 12.0%
more masks than the second largest, Open Images. On av- low income countries 28 100k 9.4M 0.9% 0.4% 0.5%
erage, it has 36× more masks per image than Open Images. Table 1: Comparison of geographic and income representa-
The closest dataset in this respect, ADE20K, still has 3.5× tion. SA-1B has higher representation in Europe and Asia &
fewer masks per image. Fig. 6 (left) plots the masks-per- Oceania as well as middle income countries. Images from
image distribution. Next, we look at image-relative mask Africa, Latin America & Caribbean, as well as low income
size (square root of the mask area divided by image area) countries, are underrepresented in all datasets.
in Fig. 6 (middle). As expected, since our dataset has more
masks per image, it also tends to include a greater percent-
age of small and medium relative-size masks. Finally, to
Geographic and income representation. We infer the
analyze shape complexity, we look at mask concavity (1
country images were photographed in using standard meth-
minus mask area divided by area of mask’s convex hull) in
ods (see §C). In Fig. 7 we visualize the per-country image
Fig. 6 (right). Since shape complexity is correlated with
counts in SA-1B (left) and the 50 countries with the most
mask size, we control for the datasets’ mask size distribu-
images (right). We note that the top-three countries are
tions by first performing stratified sampling from binned
from different parts of the world. Next, in Table 1 we com-
mask sizes. We observe that the concavity distribution of
pare the geographic and income representation of SA-1B,
our masks is broadly similar to that of other datasets.
COCO [66], and Open Images [60]. SA-1B has a substan-
6. Segment Anything RAI Analysis tially higher percentage of images in Europe and Asia &
Oceania as well as in middle income countries. All datasets
We next perform a Responsible AI (RAI) analysis of our underrepresent Africa as well as low income countries. We
work by investigating potential fairness concerns and bi- note that in SA-1B, all regions, including Africa, have at
ases when using SA-1B and SAM. We focus on the geo- least 28 million masks, 10× more than the total number of
graphic and income distribution of SA-1B and fairness of masks of any previous dataset. Finally, we observe that the
SAM across protected attributes of people. We also provide average number of masks per image (not shown) is fairly
dataset, data annotation, and model cards in §F. consistent across region and income (94-108 per image).

7
mIoU at mIoU at ing training (our usage of “zero-shot transfer” follows its
1 point 3 points 1 point 3 points usage in CLIP [82]). The datasets may include novel image
perceived gender presentation perceived skin tone distributions, such as underwater or ego-centric images (e.g.
feminine 54.4 ±1.7 90.4 ±0.6 1 52.9 ±2.2 91.0 ±0.9 Fig. 8) that, to our knowledge, do not appear in SA-1B.
masculine 55.7 ±1.7 90.1 ±0.6 2 51.5 ±1.4 91.1 ±0.5 Our experiments begin by testing the core goal of
perceived age group 3 52.2 ±1.9 91.4 ±0.7 promptable segmentation: producing a valid mask from any
older 62.9 ±6.7 92.6 ±1.3 4 51.5 ±2.7 91.7 ±1.0
prompt. We emphasize the challenging scenario of a single
middle 54.5 ±1.3 90.2 ±0.5 5 52.4 ±4.2 92.5 ±1.4
foreground point prompt, since it is more likely to be am-
young 54.2 ±2.2 91.2 ±0.7 6 56.7 ±6.3 91.2 ±2.4
biguous than other more specific prompts. Next, we present
Table 2: SAM’s performance segmenting people across per- a sequence of experiments that traverse low, mid, and high-
ceived gender presentation, age group, and skin tone. 95% level image understanding and roughly parallel the histori-
confidence intervals are shown. Within each grouping, all cal development of the field. Specifically, we prompt SAM
confidence intervals overlap except older vs. middle. to (1) perform edge detection, (2) segment everything, i.e.
object proposal generation, (3) segment detected objects,
Fairness in segmenting people. We investigate potential i.e. instance segmentation, and (4), as a proof-of-concept, to
fairness concerns across perceived gender presentation, per- segment objects from free-form text. These four tasks dif-
ceived age group, and perceived skin tone by measuring fer significantly from the promptable segmentation task that
the performance discrepancy of SAM between groups. We SAM was trained on and are implemented via prompt engi-
use the More Inclusive Annotations for People (MIAP) [87] neering. Our experiments conclude with an ablation study.
dataset for gender presentation and age and a proprietary Implementation. Unless otherwise specified: (1) SAM
dataset for skin tone (see §C). Our evaluation uses simu- uses an MAE [47] pre-trained ViT-H [33] image encoder
lated interactive segmentation with random sampling of 1 and (2) SAM was trained on SA-1B, noting that this dataset
and 3 points (see §D). Table 2 (top left) shows results for includes only automatically generated masks from the final
perceived gender presentation. We note that females have stage of our data engine. For all other model and training
been shown to be underrepresented in detection and seg- details, such as hyperparameters, refer to §A.
mentation datasets [115], but observe that SAM performs
similarly across groups. We repeat the analysis for per- 7.1. Zero-Shot Single Point Valid Mask Evaluation
ceived age in Table 2 (bottom left), noting that those who Task. We evaluate segmenting an object from a single fore-
are perceived to be younger and older have been shown to ground point. This task is ill-posed as one point can refer
be underrepresented in large-scale datasets [110]. SAM per- to multiple objects. Ground truth masks in most datasets
forms best on those who are perceived older (although the do not enumerate all possible masks, which can make au-
confidence interval is large). Finally, we repeat the anal- tomatic metrics unreliable. Therefore, we supplement the
ysis for perceived skin tone in Table 2 (right), noting that standard mIoU metric (i.e., the mean of all IoUs between
those with lighter apparent skin tones have been shown to predicted and ground truth masks) with a human study in
be overrepresented and those with darker skin tones under- which annotators rate mask quality from 1 (nonsense) to 10
represented in large-scale datasets [110]. As MIAP does (pixel-perfect). See §D.1, §E, and §G for additional details.
not contain perceived skin tone annotations, we use a pro- By default, we sample points from the “center” of ground
prietary dataset that contains annotations for the perceived truth masks (at a maximal value of the mask’s interior dis-
Fitzpatrick skin type [36], which ranges from 1 (lightest tance transform), following the standard evaluation proto-
skin tone) to 6 (darkest skin tone). While the means vary col in interactive segmentation [92]. Since SAM is capable
somewhat, we do not find a significant difference across of predicting multiple masks, we evaluate only the model’s
groups. We believe our findings stem from the nature of most confident mask by default. The baselines are all
the task, and acknowledge biases may arise when SAM is single-mask methods. We compare mainly to RITM [92],
used as a component in larger systems. Finally, in §C we a strong interactive segmenter that performs best on our
extend the analysis to segmenting clothing where we find benchmark compared to other strong baselines [67, 18].
an indication of bias across perceived gender presentation.
Datasets. We use a newly compiled suite of 23 datasets
7. Zero-Shot Transfer Experiments with diverse image distributions. Fig. 8 lists the datasets
and shows a sample from each one (see appendix Table 7 for
In this section, we present zero-shot transfer experiments more details). We use all 23 datasets for mIoU evaluation.
with SAM, the Segment Anything Model. We consider five For the human study, we use the subset listed in Fig. 9b
tasks, four of which differ significantly from the promptable (due to the resource requirements of such studies). This
segmentation task used to train SAM. These experiments subset includes both datasets for which SAM outperforms
evaluate SAM on datasets and tasks that were not seen dur- and underperforms RITM according to automatic metrics.

8
ADE20K [117] BBBC038v1 [12] Cityscapes [25] DOORS [80] DRAM [24] EgoHOS [113] GTEA [34, 63] Hypersim [86]

IBD [17] iShape [111] LVIS [44] NDD20 [100] NDISPark [22, 23] OVIS [81] PPDLS [74] Plittersdorf [46]

STREETS [91] TimberSeg [38] TrashCan [52] VISOR [28, 27] WoodScape [112] PIDRay [104] ZeroWaste-f [6]

Figure 8: Samples from the 23 diverse segmentation datasets used to evaluate SAM’s zero-shot transfer capabilities.

Avg. mask rating

PPDLS [74] +46.9 9
BBBC038v1 [12] +44.7 Ground Truth
DOORS [80] +41.1 SAM
TimberSeg [38] +28.9 7
NDD20 [100] +21.1 SAM - single output
LVIS [44] +18.5 RITM
STREETS [91] +17.3 5
ZeroWaste-f [6] +9.1
iShape [111] +8.8 LVIS VISOR DRAM IBD NDD20 OVIS iShape
ADE20K [117] +7.8 Datasets
OVIS [81] +7.0
Hypersim [86] +6.1 (b) Mask quality ratings by human annotators
NDISPark [22, 23] +2.7
VISOR [28, 27] +1.8
+1.5 SAM (oracle)
mIoU (23 datasets)

Plittersdorf [46]

mIoU (23 datasets)

SAM (oracle)
EgoHOS [113] +0.8
IBD [17] -0.3 75 75
WoodScape [112] -0.6
Cityscapes [25] -2.0 SAM
PIDRay [104] -5.8 RITM
DRAM [24] -6.5 50 50
TrashCan [52] -15.0 SimpleClick
GTEA [34, 63] -21.4 FocalClick
-20 0 +20 +40 1 2 3 5 9 1 2 3 5 9
IoU delta at 1 center point Number of points Number of points
(a) SAM vs. RITM [92] on 23 datasets (c) Center points (default) (d) Random points

Figure 9: Point to mask evaluation on 23 datasets. (a) Mean IoU of SAM and the strongest single point segmenter, RITM [92].
Due to ambiguity, a single mask may not match ground truth; circles show “oracle” results of the most relevant of SAM’s 3
predictions. (b) Per-dataset comparison of mask quality ratings by annotators from 1 (worst) to 10 (best). All methods use
the ground truth mask center as the prompt. (c, d) mIoU with varying number of points. SAM significantly outperforms prior
interactive segmenters with 1 point and is on par with more points. Low absolute mIoU at 1 point is the result of ambiguity.

Results. First, we look at automatic evaluation on the full fall between 7 and 9, which corresponds to the qualitative
suite of 23 datasets using mIoU. We compare per-dataset rating guideline: “A high score (7-9): The object is identi-
results in Fig. 9a against RITM. SAM yields higher re- fiable and errors are small and rare (e.g., missing a small,
sults on 16 of the 23 datasets, by as much as ∼47 IoU. We heavily obscured disconnected component, ...).” These re-
also present an “oracle” result, in which the most relevant sults indicate that SAM has learned to segment valid masks
of SAM’s 3 masks is selected by comparing them to the from a single point. Note that for datasets like DRAM and
ground truth, rather than selecting the most confident mask. IBD, where SAM is worse on automatic metrics, it receives
This reveals the impact of ambiguity on automatic evalu- consistently higher ratings in the human study.
ation. In particular, with the oracle to perform ambiguity Fig. 9c shows additional baselines, SimpleClick [67] and
resolution, SAM outperforms RITM on all datasets. FocalClick [18], which obtain lower single point perfor-
Results of the human study are presented in Fig. 9b. Er- mance than RITM and SAM. As the number of points in-
ror bars are 95% confidence intervals for mean mask rat- creases from 1 to 9, we observe that the gap between meth-
ings (all differences are significant; see §E for details). We ods decreases. This is expected as the task becomes easier;
observe that the annotators consistently rate the quality of also, SAM is not optimized for the very high IoU regime.
SAM’s masks substantially higher than the strongest base- Finally, in Fig. 9d we replace the default center point sam-
line, RITM. An ablated, “ambiguity-unaware” version of pling with random point sampling. We observe that the gap
SAM with a single output mask has consistently lower rat- between SAM and the baselines grows and SAM is able to
ings, though still higher than RITM. SAM’s mean ratings achieve comparable results under either sampling method.

9
image ground truth SAM mask AR@1000
method all small med. large freq. com. rare
ViTDet-H [62] 63.0 51.7 80.8 87.0 63.1 63.3 58.3
zero-shot transfer methods:
SAM – single out. 54.9 42.8 76.7 74.4 54.7 59.8 62.0
SAM 59.3 45.5 81.6 86.9 59.1 63.9 65.8
Table 4: Object proposal generation on LVIS v1. SAM is
applied zero-shot, i.e. it was not trained for object proposal
generation nor did it access LVIS images or annotations.

Figure 10: Zero-shot edge prediction on BSDS500. SAM

was not trained to predict edge maps nor did it have access intermediate step in pioneering systems (e.g., [102, 41, 84]).
to BSDS images or annotations during training. To generate object proposals, we run a slightly modified
version of our automatic mask generation pipeline and out-
method year ODS OIS AP R50 put the masks as proposals (see §D.3 for details).
HED [108] 2015 .788 .808 .840 .923 We compute the standard average recall (AR) metric on
EDETR [79] 2022 .840 .858 .896 .930 LVIS v1 [44]. We focus on LVIS because its large number
zero-shot transfer methods: of categories presents a challenging test. We compare to
Sobel filter 1968 .539 - - - a strong baseline implemented as a ViTDet [62] detector
Canny [13] 1986 .600 .640 .580 -
(with cascade Mask R-CNN [48, 11] ViT-H). We note that
Felz-Hutt [35] 2004 .610 .640 .560 -
SAM 2023 .768 .786 .794 .928 this “baseline” corresponds to the “Detector Masquerading
as Proposal generator” (DMP) method [16] that was shown
Table 3: Zero-shot transfer to edge detection on BSDS500. to game AR, making it a truly demanding comparison.
Results. In Table 4 we see unsurprisingly that using the
7.2. Zero-Shot Edge Detection detections from ViTDet-H as object proposals (i.e., the
DMP method [16] that games AR) performs the best over-
Approach. We evaluate SAM on the classic low-level task all. However, SAM does remarkably well on several met-
of edge detection using BSDS500 [72, 3]. We use a sim- rics. Notably, it outperforms ViTDet-H on medium and
plified version of our automatic mask generation pipeline. large objects, as well as rare and common objects. In fact,
Specifically, we prompt SAM with a 16×16 regular grid of SAM only underperforms ViTDet-H on small objects and
foreground points resulting in 768 predicted masks (3 per frequent objects, where ViTDet-H can easily learn LVIS-
point). Redundant masks are removed by NMS. Then, edge specific annotation biases since it was trained on LVIS, un-
maps are computed using Sobel filtering of unthresholded like SAM. We also compare against an ablated ambiguity-
mask probability maps and standard lightweight postpro- unaware version of SAM (“single out.”), which performs
cessing, including edge NMS (see §D.2 for details). significantly worse than SAM on all AR metrics.
Results. We visualize representative edge maps in Fig. 10
7.4. Zero-Shot Instance Segmentation
(see Fig. 15 for more). Qualitatively, we observe that even
though SAM was not trained for edge detection, it produces Approach. Moving to higher-level vision, we use SAM
reasonable edge maps. Compared to the ground truth, SAM as the segmentation module of an instance segmenter. The
predicts more edges, including sensible ones that are not an- implementation is simple: we run a object detector (the
notated in BSDS500. This bias is reflected quantitatively in ViTDet used before) and prompt SAM with its output
Table 3: recall at 50% precision (R50) is high, at the cost of boxes. This illustrates composing SAM in a larger system.
precision. SAM naturally lags behind state-of-the-art meth-
ods that learn the biases of BSDS500, i.e., which edges to Results. We compare the masks predicted by SAM and
suppress. Nevertheless, SAM performs well compared to ViTDet on COCO and LVIS in Table 5. Looking at the
pioneering deep learning methods such as HED [108] (also mask AP metric we observe gaps on both datasets, where
trained on BSDS500) and significantly better than prior, SAM is reasonably close, though certainly behind ViTDet.
though admittedly outdated, zero-shot transfer methods. By visualizing outputs, we observed that SAM masks are
often qualitatively better than those of ViTDet, with crisper
7.3. Zero-Shot Object Proposals boundaries (see §D.4 and Fig. 16). To investigate this ob-
servation, we conducted an additional human study asking
Approach. Next, we evaluate SAM on the mid-level task annotators to rate the ViTDet masks and SAM masks on the
of object proposal generation [2, 102]. This task has played 1 to 10 quality scale used before. In Fig. 11 we observe that
an important role in object detection research, serving as an SAM consistently outperforms ViTDet in the human study.

10
COCO [66] LVIS v1 [44] 3 “a wheel” 3 “beaver tooth grille”
method AP APS APM APL AP APS APM APL
ViTDet-H [62] 51.0 32.0 54.3 68.9 46.6 35.0 58.0 66.3
zero-shot transfer methods (segmentation module only):
SAM 46.5 30.8 51.0 61.7 44.7 32.5 57.6 65.5
Table 5: Instance segmentation results. SAM is prompted 7 “a wiper” 3 “a wiper” + point
with ViTDet boxes to do zero-shot segmentation. The fully-
supervised ViTDet outperforms SAM, but the gap shrinks
on the higher-quality LVIS masks. Interestingly, SAM out-
performs ViTDet according to human ratings (see Fig. 11).
7 “wipers” 3 “wipers” + point
Percent of ratings

40 8.6 ± 0.06, LVIS GT

8.1 ± 0.07, SAM
20 7.9 ± 0.08, ViTDet-H
7.6 ± 0.12, COCO GT
Figure 12: Zero-shot text-to-mask. SAM can work with
0
1 2 3 4 5 6 7 8 9 10 simple and nuanced text prompts. When SAM fails to make
Mask quality rating
a correct prediction, an additional point prompt can help.
Figure 11: Mask quality rating distribution from our human
study for ViTDet and SAM, both applied to LVIS ground Results. We show qualitative results in Fig. 12. SAM
truth boxes. We also report LVIS and COCO ground truth can segment objects based on simple text prompts like “a
quality. The legend shows rating means and 95% confi- wheel” as well as phrases like “beaver tooth grille”. When
dence intervals. Despite its lower AP (Table 5), SAM has SAM fails to pick the right object from a text prompt only,
higher ratings than ViTDet, suggesting that ViTDet exploits an additional point often fixes the prediction, similar to [31].
biases in the COCO and LVIS training data. 7.6. Ablations
We perform several ablations on our 23 dataset suite with
We hypothesize that on COCO, where the mask AP gap
the single center point prompt protocol. Recall that a sin-
is larger and the ground truth quality is relatively low (as
gle point may be ambiguous and that ambiguity may not
borne out by the human study), ViTDet learns the specific
be represented in the ground truth, which contains only a
biases of COCO masks. SAM, being a zero-shot method,
single mask per point. Since SAM is operating in a zero-
is unable to exploit these (generally undesirable) biases.
shot transfer setting there can be systematic biases between
The LVIS dataset has higher quality ground truth, but there
SAM’s top-ranked mask vs. the masks resulting from data
are still specific idiosyncrasies (e.g., masks do not contain
annotation guidelines. We therefore additionally report the
holes, they are simple polygons by construction) and biases
best mask with respect to the ground truth (“oracle”).
for modal vs. amodal masks. Again, SAM is not trained to
Fig. 13 (left) plots SAM’s performance when trained on
learn these biases, while ViTDet can exploit them.
cumulative data from the data engine stages. We observe
7.5. Zero-Shot Text-to-Mask that each stage increases mIoU. When training with all three
stages, the automatic masks vastly outnumber the manual
Approach. Finally, we consider an even higher-level task: and semi-automatic masks. To address this, we found that
segmenting objects from free-form text. This experiment oversampling the manual and semi-automatic masks during
is a proof-of-concept of SAM’s ability to process text training by 10× gave best results. This setup complicates
prompts. While we used the exact same SAM in all prior training. We therefore tested a fourth setup that uses only
experiments, for this one SAM’s training procedure is mod- the automatically generated masks. With this data, SAM
ified to make it text-aware, but in a way that does not require performs only marginally lower than using all data (∼0.5
new text annotations. Specifically, for each manually col- mIoU). Therefore, by default we use only the automatically
lected mask with area larger than 1002 we extract the CLIP generated masks to simplify the training setup.
image embedding. Then, during training, we prompt SAM In Fig. 13 (middle) we look at the impact of data volume.
with the extracted CLIP image embeddings as its first in- The full SA-1B contains 11M images, which we uniformly
teraction. The key observation here is that because CLIP’s subsample to 1M and 0.1M for this ablation. At 0.1M im-
image embeddings are trained to align with its text embed- ages, we observe a large mIoU decline under all settings.
dings, we can train with image embeddings, but use text However, with 1M images, about 10% of the full dataset,
embeddings for inference. That is, at inference time we run we observe results comparable to using the full dataset.
text through CLIP’s text encoder and then give the resulting This data regime, which still includes approximately 100M
text embedding as a prompt to SAM (see §D.5 for details). masks, may be a practical setting for many use cases.

11
5 points
1 point (oracle)

mIoU (23 datasets)

70 1 point (oracle)

mIoU (23 datasets)

70 1 point 80
3 points
65
60 75
1 point (oracle) 60
50 1 point
70 2 points
manual + semi + automatic automatic 91M 308M 636M
automatic only 0.1M 1M 11M ViT-B ViT-L ViT-H
Training data stages Training images Number of parameters

Figure 13: Ablation studies of our data engine stages, image encoder scaling, and training data scaling. (Left) Each data
engine stage leads to improvements on our 23 dataset suite, and training with only the automatic data (our default) yields
similar results to using data from all three stages. (Middle) SAM trained with ∼10% of SA-1B and full SA-1B is comparable.
We train with all 11M images by default, but using 1M images is a reasonable practical setting. (Right) Scaling SAM’s image
encoder shows meaningful, yet saturating gains. Nevertheless, smaller image encoders may be preferred in certain settings.

Finally, Fig. 13 (right) shows results with ViT-B, ViT-L, Limitations. While SAM performs well in general, it is
and ViT-H image encoders. ViT-H improves substantially not perfect. It can miss fine structures, hallucinates small
over ViT-B, but has only marginal gains over ViT-L. Further disconnected components at times, and does not produce
image encoder scaling does not appear fruitful at this time. boundaries as crisply as more computationally intensive
methods that “zoom-in”, e.g. [18]. In general, we expect
8. Discussion dedicated interactive segmentation methods to outperform
SAM when many points are provided, e.g. [67]. Unlike
Foundation models. Pre-trained models have been adapted
these methods, SAM is designed for generality and breadth
to downstream tasks since the early days of machine learn-
of use rather than high IoU interactive segmentation. More-
ing [99]. This paradigm has become increasingly impor-
over, SAM can process prompts in real-time, but neverthe-
tant in recent years with a growing emphasis on scale, and
less SAM’s overall performance is not real-time when using
such models have recently been (re-)branded as “founda-
a heavy image encoder. Our foray into the text-to-mask task
tion models”: i.e. models that are “trained on broad data
is exploratory and not entirely robust, although we believe
at scale and are adaptable to a wide range of downstream
it can be improved with more effort. While SAM can per-
tasks” [8]. Our work correlates well with this definition,
form many tasks, it is unclear how to design simple prompts
though we note that a foundation model for image segmen-
that implement semantic and panoptic segmentation. Fi-
tation is an inherently limited scope, since it represents an
nally, there are domain-specific tools, such as [7], that we
important, yet fractional, subset of computer vision. We
expect to outperform SAM in their respective domains.
also contrast one aspect of our approach with [8], which
emphasizes the role of self-supervised learning in founda- Conclusion. The Segment Anything project is an attempt to
tion models. While our model is initialized with a self- lift image segmentation into the era of foundation models.
supervised technique (MAE [47]), the vast majority of its Our principal contributions are a new task (promptable seg-
capabilities come from large-scale supervised training. In mentation), model (SAM), and dataset (SA-1B) that make
cases where data engines can scale available annotations, this leap possible. Whether SAM achieves the status of a
like ours, supervised training provides an effective solution. foundation model remains to be seen by how it is used in
Compositionality. Pre-trained models can power new ca- the community, but regardless we expect the perspective of
pabilities even beyond ones imagined at the moment of this work, the release of over 1B masks, and our promptable
training. One prominent example is how CLIP [82] is used segmentation model will help pave the path ahead.
as a component in larger systems, such as DALL·E [83]. Acknowledgments. We would like to thank Aaron Ad-
Our goal is to make this kind of composition straightfor- cock and Jitendra Malik for helpful discussion. We thank
ward with SAM. We aim to achieve this by requiring SAM Vaibhav Aggarwal and Yanghao Li for help with scal-
to predict a valid mask for a wide range of segmentation ing the model. We thank Cheng-Yang Fu, Jiabo Hu, and
prompts. The effect is to create a reliable interface between Robert Kuo for help with data annotation platform. We
SAM and other components. For example, MCC [106] can thank Allen Goodman and Bram Wasti for help in optimiz-
easily use SAM to segment an object of interest and achieve ing web-version of our model. Finally, we thank Morteza
strong generalization to unseen objects for 3D reconstruc- Behrooz, Ashley Gabriel, Ahuva Goldstand, Sumanth Gur-
tion from a single RGB-D image. In another example, SAM ram, Somya Jain, Devansh Kukreja, Joshua Lane, Lilian
can be prompted with gaze points detected by a wearable Luong, Mallika Malhotra, William Ngan, Omkar Parkhi,
device, enabling new applications. Thanks to SAM’s abil- Nikhil Raina, Dirk Rowe, Neil Sejoor, Vanessa Stark, Bala
ity to generalize to new domains like ego-centric images, Varadarajan, and Zachary Winstrom for their help in mak-
such systems work without need for additional training. ing the demo, dataset viewer, and other assets and tooling.

12
References [19] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kir-
illov, and Rohit Girdhar. Masked-attention mask transformer for
[1] Edward H Adelson. On seeing stuff: the perception of materials by universal image segmentation. CVPR, 2022. 4
humans and machines. Human vision and electronic imaging VI, [20] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-
2001. 5 pixel classification is not all you need for semantic segmentation.
[2] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. What is an NeurIPS, 2021. 5, 16, 17
object? CVPR, 2010. 4, 10 [21] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten
[3] Pablo Arbeláez, Michael Maire, Charless Fowlkes, and Jitendra Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won
Malik. Contour detection and hierarchical image segmentation. Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling
TPAMI, 2010. 4, 10, 21, 28 language modeling with pathways. arXiv:2204.02311, 2022. 1
[4] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer [22] Luca Ciampi, Carlos Santiago, Joao Costeira, Claudio Gennaro, and
normalization. arXiv:1607.06450, 2016. 16 Giuseppe Amato. Domain adaptation for traffic density estimation.
[5] Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre-training of International Joint Conference on Computer Vision, Imaging and
image transformers. arXiv:2106.08254, 2021. 17 Computer Graphics Theory and Applications, 2021. 9, 20
[6] Dina Bashkirova, Mohamed Abdelfattah, Ziliang Zhu, James Akl, [23] Luca Ciampi, Carlos Santiago, Joao Costeira, Claudio Gennaro, and
Fadi Alladkani, Ping Hu, Vitaly Ablavsky, Berk Calli, Sarah Adel Giuseppe Amato. Night and day instance segmented park (NDIS-
Bargal, and Kate Saenko. ZeroWaste dataset: Towards deformable Park) dataset: a collection of images taken by day and by night for
object segmentation in cluttered scenes. CVPR, 2022. 9, 20 vehicle detection, segmentation and counting in parking areas. Zen-
[7] Stuart Berg, Dominik Kutra, Thorben Kroeger, Christoph N. odo, 2022. 9, 20
Straehle, Bernhard X. Kausler, Carsten Haubold, Martin Schiegg, [24] Nadav Cohen, Yael Newman, and Ariel Shamir. Semantic segmen-
Janez Ales, Thorsten Beier, Markus Rudy, Kemal Eren, Jaime I. tation in art paintings. Computer Graphics Forum, 2022. 9, 19, 20,
Cervantes, Buote Xu, Fynn Beuttenmueller, Adrian Wolny, Chong 23, 24
Zhang, Ullrich Koethe, Fred A. Hamprecht, and Anna Kreshuk. [25] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld,
ilastik: interactive machine learning for (bio)image analysis. Na- Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth,
ture Methods, 2019. 12 and Bernt Schiele. The Cityscapes dataset for semantic urban scene
[8] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, understanding. CVPR, 2016. 9, 19, 20
Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette [26] Bruno da Silva, George Konidaris, and Andrew Barto. Learning
Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportu- parameterized skills. ICML, 2012. 4
nities and risks of foundation models. arXiv:2108.07258, 2021. 1, [27] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino
12 Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan
[9] Gustav Bredell, Christine Tanner, and Ender Konukoglu. Iterative Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling
interaction training for segmentation editing networks. MICCAI, egocentric vision: Collection, pipeline and challenges for EPIC-
2018. 17 KITCHENS-100. IJCV, 2022. 9, 20, 23, 24
[10] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, [28] Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar,
Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen.
Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel EPIC-KITCHENS VISOR benchmark: Video segmentations and
Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, object relations. NeurIPS, 2022. 9, 19, 20, 23, 24
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris [29] Terrance De Vries, Ishan Misra, Changhan Wang, and Laurens
Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Ben- Van der Maaten. Does object recognition work for everyone? CVPR
jamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, workshops, 2019. 18
Alec Radford, Ilya Sutskever, and Dario Amodei. Language models [30] Mark Dı́az, Ian Kivlichan, Rachel Rosen, Dylan Baker, Razvan
are few-shot learners. NeurIPS, 2020. 1, 4 Amironesei, Vinodkumar Prabhakaran, and Emily Denton. Crowd-
[11] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into WorkSheets: Accounting for individual and collective identities un-
high quality object detection. CVPR, 2018. 10 derlying crowdsourced dataset annotation. ACM Conference on
[12] Juan C. Caicedo, Allen Goodman, Kyle W. Karhohs, Beth A. Ci- Fairness, Accountability, and Transparency, 2022. 25
mini, Jeanelle Ackerman, Marzieh Haghighi, CherKeng Heng, Tim [31] Henghui Ding, Scott Cohen, Brian Price, and Xudong Jiang.
Becker, Minh Doan, Claire McQuin, Mohammad Rohban, Shan- PhraseClick: toward achieving flexible interactive segmentation by
tanu Singh, and Anne E. Carpenter. Nucleus segmentation across phrase and click. ECCV, 2020. 11
imaging experiments: the 2018 data science bowl. Nature Methods, [32] Piotr Dollár and C Lawrence Zitnick. Fast edge detection using
2019. 9, 19, 20 structured forests. TPAMI, 2014. 21
[13] John Canny. A computational approach to edge detection. TPAMI, [33] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk
1986. 10, 21 Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De-
[14] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end Uszkoreit, and Neil Houlsby. An image is worth 16x16 words:
object detection with Transformers. ECCV, 2020. 5, 16, 17 Transformers for image recognition at scale. ICLR, 2021. 5, 8,
[15] Guillaume Charpiat, Matthias Hofmann, and Bernhard Schölkopf. 16
Automatic image colorization via multimodal predictions. ECCV, [34] Alireza Fathi, Xiaofeng Ren, and James M. Rehg. Learning to rec-
2008. 5, 17 ognize objects in egocentric activities. CVPR, 2011. 9, 19, 20
[16] Neelima Chavali, Harsh Agrawal, Aroma Mahendru, and Dhruv [35] Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-
Batra. Object-proposal evaluation protocol is’ gameable’. CVPR, based image segmentation. IJCV, 2004. 10
2016. 10, 21 [36] Thomas B. Fitzpatrick. The validity and practicality of sun-reactive
[17] Jiazhou Chen, Yanghui Xu, Shufang Lu, Ronghua Liang, and Lian- skin types i through vi. Archives of Dermatology, 1988. 8
gliang Nan. 3D instance segmentation of MVS buildings. IEEE [37] Marco Forte, Brian Price, Scott Cohen, Ning Xu, and François
Transactions on Geoscience and Remote Sensing, 2022. 9, 19, 20, Pitié. Getting to 99% accuracy in interactive segmentation.
23, 24 arXiv:2003.07932, 2020. 5, 17
[18] Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian Qi, and [38] Jean-Michel Fortin, Olivier Gamache, Vincent Grondin, François
Hengshuang Zhao. FocalClick: towards practical interactive image Pomerleau, and Philippe Giguère. Instance segmentation for au-
segmentation. CVPR, 2022. 8, 9, 12, 19 tonomous log grasping in forestry operations. IROS, 2022. 9, 20

13
[39] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jen- [55] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
nifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig.
Crawford. Datasheets for datasets. Communications of the ACM, Scaling up visual and vision-language representation learning with
2021. 25 noisy text supervision. ICML, 2021. 1
[40] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, [56] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown,
Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey
strong data augmentation method for instance segmentation. CVPR, Wu, and Dario Amodei. Scaling laws for neural language models.
2021. 16, 18, 22 arXiv:2001.08361, 2020. 1
[41] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. [57] Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes:
Rich feature hierarchies for accurate object detection and semantic Active contour models. IJCV, 1988. 4
segmentation. CVPR, 2014. 10 [58] Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, and
[42] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Weicheng Kuo. Learning open-world object proposals without
Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and learning to classify. IEEE Robotics and Automation Letters, 2022.
Kaiming He. Accurate, large minibatch SGD: Training ImageNet 21
in 1 hour. arXiv:1706.02677, 2017. 17 [59] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother,
[43] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary and Piotr Dollár. Panoptic segmentation. CVPR, 2019. 4
Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, [60] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan
Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Na- Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo
garajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari.
Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhong- The open images dataset v4: Unified image classification, object
cong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car- detection, and visual relationship detection at scale. IJCV, 2020. 2,
tillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, 6, 7, 18, 19
Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Chris- [61] Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and
tian Fuegen, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Thomas Dandres. Quantifying the carbon emissions of machine
Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jachym Ko- learning. arXiv:1910.09700, 2019. 28
lar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, [62] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Explor-
Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Mod- ing plain vision transformer backbones for object detection. ECCV,
hugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will 2022. 5, 10, 11, 16, 21, 23, 24
Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran [63] Yin Li, Zhefan Ye, and James M. Rehg. Delving into egocentric
Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, actions. CVPR, 2015. 9, 20
Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Yunyi Zhu, [64] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Interactive image
Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria segmentation with latent diversity. CVPR, 2018. 5, 17, 19
Farinella, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, [65] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr
Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Dollár. Focal loss for dense object detection. ICCV, 2017. 5, 17
Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, [66] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro
Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Mi-
Yan, and Jitendra Malik. Ego4D: Around the World in 3,000 Hours crosoft COCO: Common objects in context. ECCV, 2014. 2, 4, 6,
of Egocentric Video. CVPR, 2022. 20 7, 11, 18, 19, 20
[44] Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for [67] Qin Liu, Zhenlin Xu, Gedas Bertasius, and Marc Niethammer. Sim-
large vocabulary instance segmentation. CVPR, 2019. 2, 6, 7, 9, 10, pleClick: Interactive image segmentation with simple vision trans-
11, 19, 20, 21, 24 formers. arXiv:2210.11006, 2022. 8, 9, 12, 19
[45] Abner Guzman-Rivera, Dhruv Batra, and Pushmeet Kohli. Multiple [68] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regu-
choice learning: Learning to produce multiple structured outputs. larization. ICLR, 2019. 17
NeurIPS, 2012. 5, 17 [69] Cathy H Lucas, Daniel OB Jones, Catherine J Hollyhead, Robert H
[46] Timm Haucke, Hjalmar S. Kühl, and Volker Steinhage. Condon, Carlos M Duarte, William M Graham, Kelly L Robinson,
SOCRATES: Introducing depth in visual wildlife monitoring using Kylie A Pitt, Mark Schildhauer, and Jim Regetz. Gelatinous zoo-
stereo vision. Sensors, 2022. 9, 20 plankton biomass in the global oceans: geographic variation and
[47] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, environmental drivers. Global Ecology and Biogeography, 2014.
and Ross Girshick. Masked autoencoders are scalable vision learn- 20
ers. CVPR, 2022. 5, 8, 12, 16, 17 [70] Sabarinath Mahadevan, Paul Voigtlaender, and Bastian Leibe. Iter-
[48] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. atively trained interactive segmentation. BMVC, 2018. 4, 17
Mask R-CNN. ICCV, 2017. 10 [71] Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, and Luc
[49] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Van Gool. Deep extreme cut: From extreme points to object seg-
residual learning for image recognition. CVPR, 2016. 16 mentation. CVPR, 2018. 6
[50] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units [72] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik.
(gelus). arXiv:1606.08415, 2016. 16 A database of human segmented natural images and its applica-
[51] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena tion to evaluating segmentation algorithms and measuring ecologi-
Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, cal statistics. ICCV, 2001. 10, 21, 28
Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training [73] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-Net:
compute-optimal large language models. arXiv:2203.15556, 2022. Fully convolutional neural networks for volumetric medical image
1 segmentation. 3DV, 2016. 5, 17
[52] Jungseok Hong, Michael Fulton, and Junaed Sattar. TrashCan: A [74] Massimo Minervini, Andreas Fischbach, Hanno Scharr, and
semantically-segmented dataset towards visual detection of marine Sotirios A. Tsaftaris. Finely-grained annotated datasets for image-
debris. arXiv:2007.08097, 2020. 9, 19, 20 based plant phenotyping. Pattern Recognition Letters, 2016. 9, 20
[53] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Wein- [75] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes,
berger. Deep networks with stochastic depth. ECCV, 2016. 17 Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Debo-
[54] Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, rah Raji, and Timnit Gebru. Model cards for model reporting. Pro-
and Humphrey Shi. Oneformer: One transformer to rule universal ceedings of the conference on fairness, accountability, and trans-
image segmentation. arXiv:2211.06220, 2022. 4 parency, 2019. 25, 28

14
[76] Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio [97] Yansong Tang, Zian Wang, Jiwen Lu, Jianjiang Feng, and Jie Zhou.
Ferrari. Extreme clicking for efficient object annotation. ICCV, Multi-stream deep neural networks for RGB-D egocentric action
2017. 6 recognition. IEEE Transactions on Circuits and Systems for Video
[77] David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis- Technology, 2019. 20
Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and [98] The World Bank. The world by income and regions,
Jeff Dean. Carbon emissions and large neural network training. 2022. https://datatopics.worldbank.org/world-development-
arXiv:2104.10350, 2021. 28 indicators/the-world-by-income-and-region.html. 18
[78] Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Rus- [99] Sebastian Thrun. Is learning the n-th thing any easier than learning
sell Power. Semi-supervised sequence tagging with bidirectional the first? NeurIPS, 1995. 12
language models. Proceedings of the 55th Annual Meeting of the [100] Cameron Trotter, Georgia Atkinson, Matt Sharpe, Kirsten Richard-
Association for Computational Linguistics, 2017. 18 son, A. Stephen McGough, Nick Wright, Ben Burville, and Per
[79] Mengyang Pu, Yaping Huang, Yuming Liu, Qingji Guan, and Berggren. NDD20: A large-scale few-shot dolphin dataset for
Haibin Ling. EDTER: Edge detection with transformer. CVPR, coarse and fine-grained categorisation. arXiv:2005.13359, 2020.
2022. 10 9, 19, 20, 23, 24
[80] Mattia Pugliatti and Francesco Topputo. DOORS: Dataset fOr [101] United States Environmental Protection Agency. Greenhouse Gas
bOuldeRs Segmentation. Zenodo, 2022. 9, 20 Equivalencies Calculator. https://www.epa.gov/energy/greenhouse-
[81] Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang gas-equivalencies-calculator, 2022. 28
Bai, Serge Belongie, Alan Yuille, Philip Torr, and Song Bai. Oc- [102] Koen EA van de Sande, Jasper RR Uijlings, Theo Gevers, and
cluded video instance segmentation: A benchmark. ICCV, 2022. 9, Arnold WM Smeulders. Segmentation as selective search for ob-
20, 23, 24 ject recognition. ICCV, 2011. 10
[82] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, [103] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin.
Pamela Mishkin, Jack Clark, et al. Learning transferable visual Attention is all you need. NeurIPS, 2017. 5, 16
models from natural language supervision. ICML, 2021. 1, 2, 4, 5, [104] Boying Wang, Libo Zhang, Longyin Wen, Xianglong Liu, and Yan-
8, 12, 16, 22 jun Wu. Towards real-world prohibited item detection: A large-
[83] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea scale x-ray benchmark. CVPR, 2021. 9, 19, 20
Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text- [105] Weiyao Wang, Matt Feiszli, Heng Wang, Jitendra Malik, and
to-image generation. ICML, 2021. 1, 4, 12 Du Tran. Open-world instance segmentation: Exploiting pseudo
[84] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster ground truth from learned pairwise affinity. CVPR, 2022. 21
R-CNN: Towards real-time object detection with region proposal [106] Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feicht-
networks. NeurIPS, 2015. 6, 10 enhofer, and Georgia Gkioxari. Multiview compressive coding for
3D reconstruction. CVPR, 2023. 12
[85] Xiaofeng Ren and Jitendra Malik. Learning a classification model
for segmentation. ICCV, 2003. 4 [107] Jianxiong Xiao, James Hays, Krista Ehinger, Aude Oliva, and An-
tonio Torralba. SUN database: Large-scale scene recognition from
[86] Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar,
abbey to zoo. CVPR, 2010. 20
Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M.
Susskind. Hypersim: A photorealistic synthetic dataset for holistic [108] Saining Xie and Zhuowen Tu. Holistically-nested edge detection.
indoor scene understanding. ICCV, 2021. 9, 19, 20 ICCV, 2015. 10
[109] Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas S
[87] Candice Schumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari,
Huang. Deep interactive object selection. CVPR, 2016. 4, 19
and Caroline Pantofaru. A step toward more inclusive people anno-
tations for fairness. Proceedings of the 2021 AAAI/ACM Conference [110] Kaiyu Yang, Klint Qinami, Li Fei-Fei, Jia Deng, and Olga Rus-
on AI, Ethics, and Society, 2021. 8, 19 sakovsky. Towards fairer datasets: Filtering and balancing the dis-
tribution of the people subtree in the imagenet hierarchy. Proceed-
[88] Sefik Ilkin Serengil and Alper Ozpinar. LightFace: A hybrid deep
ings of the 2020 conference on fairness, accountability, and trans-
face recognition framework. ASYU, 2020. 26
parency, 2020. 8
[89] Sefik Ilkin Serengil and Alper Ozpinar. HyperExtended LightFace:
[111] Lei Yang, Yan Zi Wei, Yisheng HE, Wei Sun, Zhenhang Huang,
A facial attribute analysis framework. ICEET, 2021. 26
Haibin Huang, and Haoqiang Fan. iShape: A first step towards
[90] Jamie Shotton, John Winn, Carsten Rother, and Antonio Crimin- irregular shape instance segmentation. arXiv:2109.15068, 2021. 9,
isi. TextonBoost: Joint appearance, shape and context modeling for 20, 23, 24
mulit-class object recognition and segmentation. ECCV, 2006. 4 [112] Senthil Yogamani, Ciarán Hughes, Jonathan Horgan, Ganesh Sistu,
[91] Corey Snyder and Minh Do. STREETS: A novel camera network Padraig Varley, Derek O’Dea, Michal Uricár, Stefan Milz, Mar-
dataset for traffic flow. NeurIPS, 2019. 9, 20 tin Simon, Karl Amende, et al. WoodScape: A multi-task, multi-
[92] Konstantin Sofiiuk, Ilya A Petrov, and Anton Konushin. Reviving camera fisheye dataset for autonomous driving. ICCV, 2019. 9,
iterative training with mask guidance for interactive segmentation. 20
ICIP, 2022. 5, 8, 9, 17, 19, 23, 24, 28 [113] Lingzhi Zhang, Shenghao Zhou, Simon Stent, and Jianbo Shi. Fine-
[93] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya grained egocentric hand-object segmentation: Dataset, model, and
Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to applications. ECCV, 2022. 9, 19, 20
prevent neural networks from overfitting. The Journal of Machine [114] Wenwei Zhang, Jiangmiao Pang, Kai Chen, and Chen Change Loy.
Learning Research, 2014. 16 K-Net: Towards unified image segmentation. NeurIPS, 2021. 4
[94] Chris Stauffer and W Eric L Grimson. Adaptive background mix- [115] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-
ture models for real-time tracking. CVPR, 1999. 4 Wei Chang. Men also like shopping: Reducing gender bias ampli-
[95] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara fication using corpus-level constraints. arXiv:1707.09457, 2017. 8
Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra- [116] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and An-
mamoorthi, Jonathan Barron, and Ren Ng. Fourier features let net- tonio Torralba. Places: A 10 million image database for scene
works learn high frequency functions in low dimensional domains. recognition. TPAMI, 2017. 20
NeurIPS, 2020. 5, 16 [117] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler,
[96] Yansong Tang, Yi Tian, Jiwen Lu, Jianjiang Feng, and Jie Zhou. Adela Barriuso, and Antonio Torralba. Semantic understanding of
Action recognition in RGB-D egocentric videos. ICIP, 2017. 20 scenes through the ADE20K dataset. IJCV, 2019. 2, 7, 9, 20

15
Appendix image x2 dot product
embedding image to token attn. 2x per mask
(256x64x64) conv. masks
Table of contents: mlp trans.
output

• §A: Segment Anything Model and Task Details token to image attn.
token
per mask
mlp
output tokens token IoU
• §B: Automatic Mask Generation Details +
prompt tokens self attn.
to image
attn.
output
token IoU
mlp
(Ntokensx256) scores
• §C: RAI Additional Details mask decoder

• §D: Experiment Implementation Details Figure 14: Details of the lightweight mask decoder. A
• §E: Human Study Experimental Design two-layer decoder updates both the image embedding and
• §F: Dataset, Annotation, and Model Cards prompt tokens via cross-attention. Then the image embed-
ding is upscaled, from which the updated output tokens are
• §G: Annotation Guidelines used to dynamically predict masks. (Not illustrated for fig-
ure clarity: At every attention layer, positional encodings
A. Segment Anything Model and Task Details are added to the image embedding, and the entire original
prompt token (including position encoding) is re-added to
Image encoder. In general, the image encoder can be any
the token queries and keys.)
network that outputs a C×H×W image embedding. Mo-
tivated by scalability and access to strong pre-training, we
use an MAE [47] pre-trained Vision Transformer (ViT) [33] and image embedding are then added element-wise. If there
with minimal adaptations to process high resolution inputs, is no mask prompt, a learned embedding representing “no
specifically a ViT-H/16 with 14×14 windowed attention mask” is added to each image embedding location.
and four equally-spaced global attention blocks, follow- Lightweight mask decoder. This module efficiently maps
ing [62]. The image encoder’s output is a 16× downscaled the image embedding and a set of prompt embeddings to an
embedding of the input image. Since our runtime goal is to output mask. To combine these inputs, we take inspiration
process each prompt in real-time, we can afford a high num- from Transformer segmentation models [14, 20] and modify
ber of image encoder FLOPs because they are computed a standard Transformer decoder [103]. Before applying our
only once per image, not per prompt. decoder, we first insert into the set of prompt embeddings
Following standard practices (e.g., [40]), we use an in- a learned output token embedding that will be used at the
put resolution of 1024×1024 obtained by rescaling the im- decoder’s output, analogous to the [class] token in [33].
age and padding the shorter side. The image embedding For simplicity, we refer to these embeddings (not including
is therefore 64×64. To reduce the channel dimension, fol- the image embedding) collectively as “tokens”.
lowing [62], we use a 1×1 convolution to get to 256 chan- Our decoder design is shown in Fig. 14. Each decoder
nels, followed by a 3×3 convolution also with 256 channels. layer performs 4 steps: (1) self-attention on the tokens, (2)
Each convolution is followed by a layer normalization [4]. cross-attention from tokens (as queries) to the image em-
Prompt encoder. Sparse prompts are mapped to 256- bedding, (3) a point-wise MLP updates each token, and (4)
dimensional vectorial embeddings as follows. A point is cross-attention from the image embedding (as queries) to
represented as the sum of a positional encoding [95] of the tokens. This last step updates the image embedding with
point’s location and one of two learned embeddings that in- prompt information. During cross-attention, the image em-
dicate if the point is either in the foreground or background. bedding is treated as a set of 642 256-dimensional vectors.
A box is represented by an embedding pair: (1) the posi- Each self/cross-attention and MLP has a residual connec-
tional encoding of its top-left corner summed with a learned tion [49], layer normalization, and a dropout [93] of 0.1 at
embedding representing “top-left corner” and (2) the same training. The next decoder layer takes the updated tokens
structure but using a learned embedding indicating “bottom- and the updated image embedding from the previous layer.
right corner”. Finally, to represent free-form text we use the We use a two-layer decoder.
text encoder from CLIP [82] (any text encoder is possible in To ensure the decoder has access to critical geometric in-
general). We focus on geometric prompts for the remainder formation the positional encodings are added to the image
of this section and discuss text prompts in depth in §D.5. embedding whenever they participate in an attention layer.
Dense prompts (i.e., masks) have a spatial correspon- Additionally, the entire original prompt tokens (including
dence with the image. We input masks at a 4× lower res- their positional encodings) are re-added to the updated to-
olution than the input image, then downscale an additional kens whenever they participate in an attention layer. This
4× using two 2×2, stride-2 convolutions with output chan- allows for a strong dependence on both the prompt token’s
nels 4 and 16, respectively. A final 1×1 convolution maps geometric location and type.
the channel dimension to 256. Each layer is separated by After running the decoder, we upsample the updated im-
GELU activations [50] and layer normalization. The mask age embedding by 4× with two transposed convolutional

16
layers (now it’s downscaled 4× relative to the input image). Training algorithm. Following recent approaches [92, 37],
Then, the tokens attend once more to the image embedding we simulate an interactive segmentation setup during train-
and we pass the updated output token embedding to a small ing. First, with equal probability either a foreground point
3-layer MLP that outputs a vector matching the channel di- or bounding box is selected randomly for the target mask.
mension of the upscaled image embedding. Finally, we pre- Points are sampled uniformly from the ground truth mask.
dict a mask with a spatially point-wise product between the Boxes are taken as the ground truth mask’s bounding box,
upscaled image embedding and the MLP’s output. with random noise added in each coordinate with standard
The transformer uses an embedding dimension of 256. deviation equal to 10% of the box sidelength, to a maxi-
The transformer MLP blocks have a large internal dimen- mum of 20 pixels. This noise profile is a reasonable com-
sion of 2048, but the MLP is applied only to the prompt to- promise between applications like instance segmentation,
kens for which there are relatively few (rarely greater than which produce a tight box around the target object, and in-
20). However, in cross-attention layers where we have a teractive segmentation, where a user may draw a loose box.
64×64 image embedding, we reduce the channel dimension After making a prediction from this first prompt, subse-
of the queries, keys, and values by 2× to 128 for computa- quent points are selected uniformly from the error region
tional efficiency. All attention layers use 8 heads. between the previous mask prediction and the ground truth
The transposed convolutions used to upscale the output mask. Each new point is foreground or background if the er-
image embedding are 2×2, stride 2 with output channel di- ror region is a false negative or false positive, respectively.
mensions of 64 and 32 and have GELU activations. They We also supply the mask prediction from the previous it-
are separated by layer normalization. eration as an additional prompt to our model. To provide
Making the model ambiguity-aware. As described, a sin- the next iteration with maximal information, we supply the
gle input prompt may be ambiguous in the sense that it cor- unthresholded mask logits instead of the binarized mask.
responds to multiple valid masks, and the model will learn When multiple masks are returned, the mask passed to the
to average over these masks. We eliminate this problem next iteration and used to sample the next point is the one
with a simple modification: instead of predicting a single with the highest predicted IoU.
mask, we use a small number of output tokens and predict We find diminishing returns after 8 iteratively sampled
multiple masks simultaneously. By default we predict three points (we have tested up to 16). Additionally, to encour-
masks, since we observe that three layers (whole, part, and age the model to benefit from the supplied mask, we also
subpart) are often enough to describe nested masks. During use two more iterations where no additional points are sam-
training, we compute the loss (described shortly) between pled. One of these iterations is randomly inserted among the
the ground truth and each of the predicted masks, but only 8 iteratively sampled points, and the other is always at the
backpropagate from the lowest loss. This is a common tech- end. This gives 11 total iterations: one sampled initial in-
nique used for models with multiple outputs [15, 45, 64]. put prompt, 8 iteratively sampled points, and two iterations
For use in applications, we’d like to rank predicted masks, where no new external information is supplied to the model
so we add a small head (operating on an additional output so it can learn to refine its own mask predictions. We note
token) that estimates the IoU between each predicted mask that using a relatively large number of iterations is possible
and the object it covers. because our lightweight mask decoder requires less than 1%
Ambiguity is much rarer with multiple prompts and the of the image encoder’s compute and, therefore, each itera-
three output masks will usually become similar. To mini- tion adds only a small overhead. This is unlike previous
mize computation of degenerate losses at training and en- interactive methods that perform only one or a few interac-
sure the single unambiguous mask receives a regular gradi- tive steps per optimizer update [70, 9, 37, 92].
ent signal, we only predict a single mask when more than Training recipe. We use the AdamW [68] optimizer (β1 =
one prompt is given. This is accomplished by adding a 0.9, β2 = 0.999) and a linear learning rate warmup [42] for
fourth output token for an additional mask prediction. This 250 iterations and a step-wise learning rate decay schedule.
fourth mask is never returned for a single prompt and is the The initial learning rate (lr), after warmup, is 8e−4 . We
only mask returned for multiple prompts. train for 90k iterations (∼2 SA-1B epochs) and decrease the
Losses. We supervise mask prediction with a linear combi- lr by a factor of 10 at 60k iterations and again at 86666 it-
nation of focal loss [65] and dice loss [73] in a 20:1 ratio of erations. The batch size is 256 images. To regularize SAM,
focal loss to dice loss, following [20, 14]. Unlike [20, 14], we set weight decay (wd) to 0.1 and apply drop path [53]
we observe that auxiliary deep supervision after each de- (dp) with a rate of 0.4. We use a layer-wise learning rate
coder layer is unhelpful. The IoU prediction head is trained decay [5] (ld) of 0.8. No data augmentation is applied. We
with mean-square-error loss between the IoU prediction and initialize SAM from an MAE [47] pre-trained ViT-H. We
the predicted mask’s IoU with the ground truth mask. It is distribute training across 256 GPUs, due to the large image
added to the mask loss with a constant scaling factor of 1.0. encoder and 1024×1024 input size. To limit GPU mem-

17
ory usage, we train with up to 64 randomly sampled masks than 100 pixels (including removing entire masks if the
per GPU. Additionally, we find that lightly filtering SA-1B largest component is below this threshold). Second, another
masks to discard any that cover more than 90% of the image estimated 4% of masks include small, spurious holes. To
qualitatively improves results. address these, we filled holes with area less than 100 pixels.
For ablations and others variations on training (e.g., text- Holes were identified as components of inverted masks.
to-mask §D.5), we deviate from the default recipe above as Automatic mask generation model. We trained a special
follows. When training with data from the first and sec- version of SAM for fully automatic mask generation that
ond data engine stages only, we augment the input with sacrifices some inference speed for improved mask gener-
large-scale jitter [40] with a scale range of [0.1, 2.0]. In- ation properties. We note the differences between our de-
tuitively, data augmentation may be helpful when training fault SAM and the one used for data generation here: it
data is more limited. To train ViT-B and ViT-L, we use was trained on manual and semi-automatic data only, it was
180k iterations with batch size 128 distributed across 128 trained for longer (177656 iterations instead of 90k) with
GPUs. We set lr = 8e−4 /4e−4 , ld = 0.6/0.8, wd = 0.1, and large-scale jitter data augmentation [40], simulated interac-
dp = 0.6/0.4 for ViT-B/L, respectively. tive training used only point and mask prompts (no boxes)
and sampled only 4 points per mask during training (reduc-
B. Automatic Mask Generation Details ing from our default of 9 to 4 sped up training iterations
Here we discuss details of the data engine’s fully auto- and had no impact on 1-point performance, though it would
matic stage that was used to generate the released SA-1B. harm mIoU if evaluating with more points), and finally the
mask decoder used 3 layers instead of 2.
Cropping. Masks were generated from a regular grid of
32×32 points on the full image and 20 additional zoomed- SA-1B examples. We show SA-1B samples in Fig. 2. For
in image crops arising from 2×2 and 4×4 partially over- more examples, please see our dataset explorer.
lapping windows using 16×16 and 8×8 regular point grids,
respectively. The original high-resolution images were used C. RAI Additional Details
for cropping (this was the only time we used them). We re-
moved masks that touch the inner boundaries of the crops. Inferring geographic information for SA-1B. While the
We applied standard greedy box-based NMS (boxes were images in SA-1B are not geo-tagged, each image has a cap-
used for efficiency) in two phases: first within each crop and tion describing its contents and where it was taken. We infer
second across crops. When applying NMS within a crop, approximate image geo-locations from these captions using
we used the model’s predicted IoU to rank masks. When an Elmo-based named entity recognition model [78]. Each
applying NMS across crops, we ranked masks from most extracted location entity is mapped to every matching coun-
zoomed-in (i.e., from a 4×4 crop) to least zoomed-in (i.e., try, province, and city. Captions are mapped to a single
the original image), based on their source crop. In both country by first considering the matching countries, then
cases, we used an NMS threshold of 0.7. provinces, and finally cities. We note that there are ambigu-
Filtering. We used three filters to increase mask qual- ities and potential for biases with this method (e.g., “Geor-
ity. First, to keep only confident masks we filtered by the gia” may refer to the country or the US state). As such, we
model’s predicted IoU score at a threshold of 88.0. Second, use the extracted locations to analyze the dataset as a whole,
to keep only stable masks we compared two binary masks but do not release the inferred locations. The captions will
resulting from the same underlying soft mask by threshold- not be released publicly as required by the image provider.
ing it at different values. We kept the prediction (i.e., the Inferring geographic information for COCO and Open
binary mask resulting from thresholding logits at 0) only if Images. The COCO [66] and Open Images [60] datasets
the IoU between its pair of -1 and +1 thresholded masks was do not provide geo-locations. Following [29], we retrieve
equal to or greater than 95.0. Third, we noticed that occa- geographic metadata using the Flickr API. We retrieved
sionally an automatic mask would cover the entire image. locations for 24% of the COCO training set (19,562 im-
These masks were generally uninteresting, and we filtered ages) and for Open Images we retrieved 18% of the train-
them by removing masks that covered 95% or more of an ing set (493,517 images, after only considering images with
image. All filtering thresholds were selected to achieve both masks). We note that the geographic information is approx-
a large number of masks and high mask quality as judged by imate, and the sample of images with this information may
professional annotators using the method described in §5. not fully match the full dataset distribution.
Postprocessing. We observed two error types that are eas- Inferring income information. We use each image’s in-
ily mitigated with postprocessing. First, an estimated 4% ferred country to look up its income level using the levels
of masks include small, spurious components. To address defined by The World Bank [98]. We collapse the upper-
these, we removed connected components with area less middle and lower-middle levels into a single middle level.

18
mIoU at mIoU at Point sampling. Our default point sampling follows stan-
1 point 3 points 1 point 3 points dard practice in interactive segmentation [109, 64, 92]. The
perceived gender presentation perceived age group
first point is chosen deterministically as the point farthest
feminine 76.3 ±1.1 90.7 ±0.5 older 81.9 ±3.8 92.8 ±1.6
masculine 81.0 ±1.2 92.3 ±0.4 middle 78.2 ±0.8 91.3 ±0.3
from the object boundary. Each subsequent point is the
young 77.3 ±2.7 91.5 ±0.9 farthest from the boundary of the error region between
ground truth and the previous prediction. Some experiments
Table 6: SAM’s performance segmenting clothing across
(where specified) use a more challenging sampling strategy
perceived gender presentation and age group. The intervals
in which the first point is a random point, rather than a deter-
for perceived gender are disjoint, with mIoU for masculine
ministically selected “center” point. Each subsequent point
being higher. Confidence intervals for age group overlap.
is selected as described above. This setting better reflects
use cases in which the first point is not reliably near the
Fairness in segmenting people. To investigate SAM’s fair- center of the mask, such as prompting from eye gaze.
ness at segmenting people we use the More Inclusive Anno-
tations for People (MIAP) [87] test set annotations for Open Evaluation. We measure IoU between a prediction after
Images [60], which allows us to compare SAM’s perfor- N point prompts and a ground truth mask, where N =
mance across perceived gender presentation and perceived {1, 2, 3, 5, 9} and points are sampled iteratively with either
age group. MIAP provides box annotations, while we need of the strategies described above. The per-dataset mIoU is
ground truth masks for this analysis. To get ground truth the per-mask IoU averaged across all objects in the dataset.
masks, we select each person-category mask from Open Finally, we report the top-line metric by averaging the per-
Images if its corresponding bounding box is within a 1% dataset mIoUs across all 23 datasets. Our evaluation differs
margin (based on relative box side lengths) of an annotated from the standard interactive segmentation evaluation pro-
bounding box in MIAP, resulting in 3.9k masks. tocol which measures the average number of points needed
to achieve X% IoU, with up to 20 points. We focus on pre-
Fairness in segmenting clothing. We extend our analysis dictions after just one, or possibly a few points, since many
from §6 to clothing segmentation. We look at SAM’s per- of our use cases involve a single or very few prompts. Given
formance on clothing relative to the attributes of those wear- our application focus, which requires real-time prompt pro-
ing the clothes. We use all 6.5k ground truth masks from cessing, we expect the best interactive segmentation models
Open Images that have a category under the clothing super- to outperform SAM when using a large number of points.
class and reside within a person box from MIAP. In Table 6
we compare performance across perceived gender presenta- Baselines. We use three recent strong interactive base-
tion and age group. We find that SAM is better at segment- lines: RITM [92], FocalClick [18], and SimpleClick [67].
ing clothing on those who present predominantly mascu- For each, we use the largest models trained on the broad-
line, with disjoint 95% confidence intervals. The gap closes est datasets publicly released by the authors. For RITM,
when moving from 1 to 3 point evaluation. Differences for we use HRNet32 IT-M trained on the combination of
perceived age group are not significant. Our results indicate COCO [66] and LVIS [44] introduced by the authors.
there is a bias when segmenting clothing across perceived For FocalClick, we use SegFormerB3-S2 trained on a
gender presentation with a one point prompt, and we en- “combined dataset” that includes 8 different segmentation
courage users of SAM to be mindful of this limitation. datasets [18]. For SimpleClick, we use ViT-H448 trained
on a combination of COCO and LVIS. We follow the sug-
D. Experiment Implementation Details gested default strategies for data pre-processing (i.e., data
augmentations or image resizing) and do not change or
D.1. Zero-Shot Single Point Valid Mask Evaluation adapt any parameters for our evaluation. In our experi-
ments, we observe that RITM outperforms other baselines
Datasets. We built a new segmentation benchmark to eval-
on our 23 dataset suite with 1 point evaluation. Therefore,
uate the zero-shot transfer capabilities of our model using a
we use RITM as the default baseline. When evaluating with
suite of 23 diverse segmentation datasets from prior work.
more points we report results for all baselines.
A description of each dataset is given in Table 7. For exam-
ples, see main text Fig. 8. This suite covers a range of do- Single point ambiguity and oracle evaluation. In addition
mains including egocentric [34, 28, 113], microscopy [12], to IoU after N points prompts, we report SAM’s “oracle”
X-ray [104], underwater [52, 100], aerial [17], simula- performance at 1 point by evaluating the predicted mask that
tion [86], driving [25], and painting [24] images. For ef- best matches ground truth from amongst SAM’s three pre-
ficient evaluation we subsampled datasets with more than dictions (rather than using the one that SAM itself ranks
15k masks. Specifically, we randomly picked images so first, as we do by default). This protocol addresses possible
that the total number of masks in the sampled images was single point prompt ambiguity by relaxing the requirement
∼10k. We blurred faces of people in all the datasets. to guess the one right mask among several valid objects.

19
abbreviation image mask # images # masks
dataset description source split
& link type type sampled sampled
Plant Phenotyping Datasets
PPDLS Plants Leaf segmentation for images of tobacco and ara plants. Instance N/A 182 2347
Leaf Segmentation [74]

BBBC038v1 from Broad

Biological images of cells in a variety of settings testing
Bioimage Benchmark BBBC038v1 Microscopy Instance Train 227 10506
robustness in nuclei segmentation.
Collection [12]

Dataset fOr bOuldeRs Segmentation masks of single boulders positioned on the

DOORS Boulders Instance DS1 10000 10000
Segmentation [80] surface of a spherical mesh.

Segmentation masks of individual logs in piles of timber in

TimberSeg 1.0 [38] TimberSeg Logs various environments and conditions. Images are taken from Instance N/A 220 2487
an operator’s point-of-view.

Northumberland Dolphin Segmentation masks of two different dolphin species in

NDD20 Underwater Instance N/A 4402 6100
Dataset 2020 [100] images taken above and under water.

Large Vocabulary Instance Additional annotations for the COCO [66] dataset to enable
LVIS Scenes Instance Validation (v0.5) 945 9642
Segmentation [44] the study of long-tailed object detection and segmentation.
Traffic
STREETS [91] STREETS Segmentation masks of cars in traffic camera footage. Instance N/A 819 9854
camera
Segmentation masks in cluttered scenes of deformed
ZeroWaste-f [6] ZeroWaste-f Recycling Instance Train 2947 6155
recycling waste.

Irregular Segmentation masks of irregular shapes like antennas, logs,

iShape [111] iShape Instance Validation 754 9742
shapes fences, and hangers.

Object and part segmentation masks for images from

ADE20K [117] ADE20K Scenes Instance Validation 302 10128
SUN [107] and Places [116] datasets.

Occluded Video Instance Instance segmentation masks in videos, focusing on objects

OVIS Occlusions Instance Train 2044 10011
Segmentation [81] that are occluded.

Evermotion archinteriors
Photorealistic synthetic dataset of indoor scenes with instance
Hypersim [86] Hypersim Simulation Instance volumes 1-55 excluding 338 9445
masks.
20,25,40,49

Images of parking lots from video footage taken at day and

Night and Day Instance
NDISPark Parking lots night during different weather conditions and camera angles Instance Train 111 2577
Segmented Park [22, 23]
for vehicle segmentation.

Segmentation masks for hands and active objects in

EPIC-KITCHENS
VISOR Egocentric ego-centric video from the cooking dataset Instance Validation 1864 10141
VISOR [28, 27]
EPIC-KITCHENS [27].

Stereo Segmentation masks of wildlife in images taken with the

Plittersdorf dataset [46] Plittersdorf Instance Train, validation, test 187 546
images SOCRATES stereo camera trap.

Train (including only

Egocentric Hand-Object Fine-grained egocentric hand-object segmentation dataset.
EgoHOS Egocentric Instance Ego4D [43] and 2940 9961
Segmentation [113] Dataset contains mask annotations for existing datasets.
THU-READ [97, 96])

High-resolution drone UAV images annotated with roof

InstanceBuilding 2D [17] IBD Drones Instance Train (2D annotations) 467 11953
instance segmentation masks.

Fisheye Fisheye driving dataset with segmentation masks. Images are

WoodScape [112] WoodScape Instance Set 1 107 10266
driving taken from four surround-view cameras.

Cityscapes [25] Cityscapes Driving Stereo video of street scenes with segmentation masks. Panoptic Validation 293 9973

Segmentation masks of prohibited items in X-ray images of

PIDray [104] PIDRay X-ray Instance Test (hard) 3733 8892
baggage.

Diverse Realism in Art Domain adaptation dataset for semantic segmentation of art
DRAM Paintings Semantic Test 718 1179
Movements [24] paintings.

Segmentation masks of trash in images taken by underwater

TrashCan [52] TrashCan Underwater Instance Train (instance task) 5936 9540
ROVs. Images are sourced from the J-EDI [69] dataset.

Videos are composed of four different subjects performing

Georgia Tech Egocentric Train (segmenting hands
GTEA Egocentric seven types of daily activities with segmentation masks of Instance 652 1208
Activity Datasets [34, 63] task)
hands.

Table 7: Segmentation datasets used to evaluate zero-shot segmentation with point prompts. The 23 datasets cover a broad
range of domains; see column “image type”. To make our evaluation efficient, we subsample datasets that have more than
15k masks. Specifically, we randomly sampled images so that the total number of masks in the images is ∼10k.

20
image ground truth SAM image ground truth SAM

Figure 15: Additional visualizations of zero-shot edge predictions on BSDS500. Recall that SAM was not trained to predict
edge maps and did not have access to BSDS images and annotations during training.

D.2. Zero-Shot Edge Detection gories, we use AR@1000 but measured against a ground
truth set containing just the corresponding LVIS categories.
Dataset and metrics. We perform zero-shot edge detection
experiments on BSDS500 [72, 3]. The ground truth for each Baseline. We use cascade ViTDet-H as a baseline, the
image comes from the manual annotations of five different strongest model from [62] by AP on LVIS. As noted in the
subjects. We report results on the 200 image test subset main text, an object detector trained in-domain can “game”
using the four standard metrics for edge detection [3, 32]: AR [16] and is expected to be a stronger baseline than other
optimal dataset scale (ODS), optimal image scale (OIS), av- models that focus on open-world proposals or segmenta-
erage precision (AP), and recall at 50% precision (R50). tion [58, 105]. To produce 1000 proposals, we disable score
Method. For zero-shot transfer, we use a simplified ver- thresholding in the three cascade stages and as raise the
sion of our automatic mask generation pipeline. We prompt maximum number of predictions per stage to 1000.
SAM with a 16×16 regular grid of foreground points,
which yields 768 predicted masks (three per point). We do Method. We use a modified version of SAM’s automatic
not filter by predicted IoU or stability. Redundant masks mask generation pipeline for zero-shot transfer. First, to
are removed by NMS. Then we apply a Sobel filter to the make inference time comparable to that of ViTDet we do
remaining masks’ unthresholded probability maps and set not process image crops. Second, we remove filtering by
values to zero if they do not intersect with the outer bound- predicted IoU and stability. This leaves two tunable param-
ary pixels of a mask. Finally, we take a pixel-wise max over eters to get ∼1000 masks per image: the input point grid and
all the predictions, linearly normalize the result to [0,1], and the NMS threshold duplicate mask suppression. We choose
apply edge NMS [13] to thin the edges. a 64×64 point grid and an NMS threshold of 0.9, which
Visualizations. In Fig. 15, we show additional examples produces ∼900 masks per image on average. At evaluation,
of zero-shot edge predictions from SAM. These qualitative if greater than 1000 masks have been proposed in an im-
examples further illustrate how SAM tends to output sensi- age, they are ranked by the average of their confidence and
ble edge maps, despite not being trained for edge detection. stability scores, then truncated to the top 1000 proposals.
We see that the edges can align well with the human anno-
We hypothesize that SAM’s ability to output multiple
tations. Although, as previously mentioned, since SAM is
masks is especially valuable for this task, since recall should
not trained for edge detection it does not learn the biases of
benefit from proposals generated at multiple scales from
the BSDS500 dataset and often outputs more edges than are
a single input point. To test this, we compare to an ab-
present in the ground truth annotations.
lated version SAM that only outputs a single mask instead
D.3. Zero-Shot Object Proposals of three (SAM - single-output). Since this model produces
fewer masks, we further increase the number of points sam-
Dataset and metrics. We report the standard average recall pled and NMS threshold to 128×128 and 0.95, respectively,
(AR) metric for masks at 1000 proposals on the LVIS v1 obtaining ∼950 masks per image on average. Additionally,
validation set [44]. Since LVIS has high-quality masks for single-output SAM does not produce the IoU score used
1203 object classes, it provides a challenging test for ob- to rank masks for NMS in the automatic mask generation
ject proposal generation. We focus on AR@1000 due to the pipeline, so instead masks are ranked randomly. Testing
open-world nature of our model, which will likely produce suggests this has similar performance to more sophisticated
many valid masks outside even the 1203 classes in LVIS. To methods of ranking masks, such as using the max logit value
measure performance on frequent, common, and rare cate- of the mask as a proxy for model confidence.

21
ground truth ViTDet SAM ground truth ViTDet SAM

Figure 16: Zero-shot instance segmentation on LVIS v1. SAM produces higher quality masks than ViTDet. As a zero-shot
model, SAM does not have the opportunity to learn specific training data biases; see top-right as an example where SAM
makes a modal prediction, whereas the ground truth in LVIS is amodal given that mask annotations in LVIS have no holes.

D.4. Zero-Shot Instance Segmentation

Method. For zero-shot instance segmentation, we prompt

SAM with the boxes output by a fully-supervised ViTDet-H
on COCO and LVIS v1 validation splits. We apply an ad-
ditional mask refinement iteration by feeding the most con-
fident predicted mask, together with the box prompt, back
to the mask decoder to produce the final prediction. We
show zero-shot instance segmentations predicted on LVIS
in Fig. 16. Compared to ViTDet, SAM tends to produce
higher quality masks with cleaner boundaries. We confirm
this observation with human studies in §7.4. Note that as a
zero-shot model, SAM is not able to learn annotation biases Figure 17: Visualization of thresholding the similarities of
in a dataset. For instance, we see that SAM makes a valid mask embeddings from SAM’s latent space. A query is in-
modal prediction for the plate, whereas LVIS masks cannot dicated by the magenta box; top row shows matches at a low
contain holes by design so the plate is annotated amodally. threshold, bottom row at a high threshold. The most similar
mask embeddings in the same image can often be seman-
D.5. Zero-Shot Text-to-Mask tically similar to the query mask embedding, even though
SAM is not trained with explicit semantic supervision.
Model and training. We use the largest publicly available
CLIP model [82] (ViT-L/14@336px) to compute text
and image embeddings, which we `2 normalize prior to use. Inference. During inference we use the CLIP text encoder
To train SAM, we use masks from the first two stages of our without any modifications to create a prompt for SAM. We
data engine. Moreover, we discard all masks with an area rely on the fact that text and image embeddings are aligned
smaller than 1002 pixels. We train this model with large- by CLIP, which allows us to train without any explicit text
scale jitter [40] for 120k iterations with batch size 128. All supervision while using text-based prompts for inference.
other training parameters follow our default settings.
D.6. Probing the Latent Space of SAM
Generating training prompts. To extract an input prompt
we first expand the bounding box around each mask by a Finally, we perform an initial investigation to qualita-
random factor from 1× to 2×, square-crop the expanded tively probe the latent space learned by SAM. In particu-
box to maintain its aspect ratio, and resize it to 336×336 lar, we are interested in whether SAM is able to capture any
pixels. Before feeding the crop to the CLIP image encoder, semantics in its representation even though is not trained
with 50% probability we zero-out pixels outside the mask. with explicit semantic supervision. To do so, we compute
To ensure the embedding focuses on the object, we use mask embeddings by extracting an image embedding from
masked attention in the last layer to restrict attention from SAM from an image crop around a mask and its horizon-
the output token to the image positions inside the mask. Fi- tally flipped version, multiplying the image embedding by
nally, our prompt is the output token embedding. For train- the binary mask, and averaging over spatial locations. In
ing we supply the CLIP-based prompt first, followed by ad- Fig. 17, we show 3 examples of a query mask and similar
ditional iterative point prompts to refine the prediction. masks (in the latent space) in the same image. We observe

22
that the nearest neighbors for each query show some, albeit pany that collected manually annotated masks for the data
imperfect, shape and semantic similarity. Although these engine. An annotator was provided access to an image, the
results are preliminary, they indicate that the representations predicted mask of a single model, and the input to the model
from SAM may be useful for a variety of purposes, such as (either a single point or single box) and asked to judge the
further data labeling, understanding the contents of datasets, mask on three criterion: Does the mask correspond to a
or as features for downstream tasks. valid object? Does the mask have a clean boundary? and
Does the mask correspond to the input? They then submit-
E. Human Study Experimental Design ted a rating from 1-10 indicating the overall mask quality.
A score of 1 indicates a mask that corresponds to no ob-
Here we describe details of the human study used to eval-
ject at all; a low score (2-4) indicates that the mask has huge
uate mask quality in §7.1 and §7.4. The purpose of the
errors, such including huge regions of other objects or hav-
human study is to address two limitations of using IoU to
ing large areas of nonsensical boundaries; a middle score
ground truth as a measure of predicted mask quality. The
(5-6) indicates masks that are mostly sensible but still have
first limitation is that, for ambiguous inputs such as a single
significant semantic or boundary errors; a high score (7-
point, the model may be strongly penalized for returning a
9) indicates masks with only minor boundary errors; and a
valid mask of a different object than the ground truth. The
score of 10 is for masks with no visible errors. Annotators
second limitation is that ground truth masks may include
were provided with five different views, each designed to
various biases, such as systematic errors in the edge qual-
help identify different error types.
ity or decisions to modally or amodally segment occluding
objects. A model trained in-domain can learn these biases For single point experiments, 1000 masks per dataset
and obtain a higher IoU without necessarily producing bet- were selected randomly from the same subsets used for
ter masks. Human review can obtain a measure of mask benchmarking zero-shot interactive segmentation (see §D.1
quality independent of an underlying ground truth mask in for details on these subsets). The model input was the cen-
order to alleviate these issues. termost point, calculated as the largest value of the distance
transform from the edge of the mask. For instance seg-
Models. For single-point evaluation, we use RITM [92], mentation experiments, 1000 masks were selected from the
single-output SAM, and SAM to test two hypotheses. First, LVIS v1 validation set, and the model input was the LVIS
we hypothesize that SAM produces visually higher quality ground truth box. In all experiments, masks with a size
masks than baseline interactive segmentation models when smaller than 242 pixels were excluded from sampling, to
given a single point, even when metrics such as IoU with prevent showing raters a mask that was too small to judge
ground truth do not reveal this. Second, we hypothesize accurately. For both memory and display reasons, large im-
that SAM’s ability to disambiguate masks improves mask ages were rescaled to have a max side-length of 2000 before
quality for single point inputs, since single output SAM may predicting a mask. In all experiments, the same inputs were
return masks that average over ambiguous masks. fed to each model to produce a predicted mask.
For instance segmentation experiments, we evaluate cas-
For comparison, the ground truth masks from each
cade ViTDet-H [62] and SAM in order to test the hypothesis
dataset were also submitted for rating. For single-point
that SAM produces visually higher quality masks, even if it
experiments, this gave 4000 total rating jobs per dataset
obtains a lower AP due to the inability to learn specific an-
(1000 masks each for RITM, SAM single-output, SAM,
notation biases of the validation dataset.
and ground truth); for instance segmentation experiments,
Datasets. For single-point experiments, we select 7 datasets it gave 3000 total jobs (ViTDet, SAM, and ground truth).
from our set of 23 datasets, since the full suite is too large For each dataset, these jobs were inserted with random
for human review. We choose LVIS v0.5 [17], VISOR [28, ordering into a queue from which 30 annotators drew jobs.
27], DRAM [24], IBD [17], NDD20 [100], OVIS [81], and In initial testing of the review study, we provided each job to
iShape [111], which provide a diverse collection of images, five different annotators and found reasonable consistency
including scene-level, ego-centric, drawn, overhead, under- in scores: the average standard deviation in score over the
water, and synthetic imagery. Additionally, this set includes five annotators was 0.83. Additionally, the annotation com-
datasets both where SAM outperforms RITM with IoU met- pany deployed quality assurance testers who spot checked
rics and vice-versa. For instance segmentation experiments, a fraction of results for extreme departures from the guide-
we use the LVIS v1 validation set, allowing for direct com- lines. Thus for our experiments each job (i.e., rating one
parison to ViTDet, which was trained on LVIS. mask in one image) was completed by only a single anno-
Methodology. We presented masks generated by the mod- tator. Average time spent per annotator per job was 90 sec-
els to professional annotators and asked them to rate each onds, longer than our initial target of 30 seconds, but still
mask using provided guidelines (see §G for the complete sufficiently fast to collect a large number of ratings on each
guidelines). Annotators were sourced from the same com- of the 7 selected datasets.

23
Percent of ratings

Percent of ratings
6.5 ± 0.15, RITM 8.1 ± 0.10, SAM 6.3 ± 0.16, RITM 8.3 ± 0.09, SAM
40
40 7.7 ± 0.12, SAM - single output 8.5 ± 0.09, GT 7.5 ± 0.13, SAM - single output 8.5 ± 0.13, GT

20 20

0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Mask quality rating Mask quality rating
(a) LVIS v0.5 [17] (b) VISOR [28, 27]
Percent of ratings

Percent of ratings
40 5.9 ± 0.14, RITM 7.7 ± 0.13, SAM 40 7.1 ± 0.12, RITM 8.3 ± 0.08, SAM
6.8 ± 0.15, SAM - single output 8.0 ± 0.15, GT 7.9 ± 0.11, SAM - single output 8.4 ± 0.09, GT

20 20

0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Mask quality rating Mask quality rating
(c) DRAM [24] (d) IBD [17]
Percent of ratings

Percent of ratings
6.4 ± 0.17, RITM 8.6 ± 0.10, SAM 40 6.1 ± 0.15, RITM 7.2 ± 0.13, SAM
40 8.2 ± 0.11, SAM - single output 8.9 ± 0.06, GT 7.7 ± 0.12, SAM - single output 8.8 ± 0.09, GT

20 20

0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Mask quality rating Mask quality rating
(e) NDD20 [100] (f) OVIS [81]
Percent of ratings

4.9 ± 0.16, RITM 7.1 ± 0.15, SAM

40
6.2 ± 0.17, SAM - single output 9.3 ± 0.06, GT

0
1 2 3 4 5 6 7 8 9 10
Mask quality rating
(g) iShape [111]

Figure 18: Mask quality rating distributions by dataset from our human evaluation study.

SAM > baseline SAM > SAM single out. tests for two hypotheses: (1) that SAM gets higher scores
dataset p-value CI99 (∆µ) p-value CI99 (∆µ)
than the baseline model (RITM or ViTDet) and (2) that
point input (RITM [92] baseline):
SAM gets higher scores than single-output SAM. P-values
LVIS v0.5 [44] 4e-69 (1.40, 1.84) 2e-11 (0.29, 0.64)
VISOR [28, 27] 7e-98 (1.81, 2.24) 7e-26 (0.58, 0.94) are calculated via a paired t-test on the means of the model
DRAM [24] 1e-76 (1.54, 2.00) 2e-24 (0.62, 1.03) scores, which we supplement with a paired bootstrap test on
IBD [17] 2e-57 (1.03, 1.39) 1e-15 (0.32, 0.62) 10k samples to find the 99% confidence interval for the dif-
NDD20 [100] 2e-86 (1.88, 2.37) 5e-08 (0.19, 0.55) ference of means. Table 8 shows p-values and confidence
OVIS [81] 2e-64 (1.38, 1.84) 3e-10 (0.27, 0.63) intervals for these tests. All statistical tests are strongly sig-
iShape [111] 2e-88 (1.97, 2.47) 7e-23 (0.65, 1.10)
nificant, and all confidence intervals exclude zero.
box input (ViTDet-H [62] baseline):
LVIS v1 [44] 2e-05 (0.11, 0.42) N/A N/A For instance segmentation, Fig. 11 of the main text
Table 8: Statistical tests showing significance that SAM has shows the histogram for ratings. To compare to COCO
higher mask quality ratings than baseline and single-output ground truth, we additionally include 794 ratings of COCO
SAM. P-values are calculated by paired t-test, while confi- ground truth masks that were collected during our testing of
dence intervals for the difference in mean scores are calcu- the human review process. These masks were presented to
lated by paired bootstrap on 10k samples. All p-values are raters using an identical setup as the LVIS results. For fair
significant, and all confidence intervals exclude zero. comparison, results for LVIS in Fig. 11 were subsampled
to the same 794 inputs for each model and ground truth.
For Table 8, the full 1000 ratings are used to run statistical
Results. Fig. 18 shows histograms over ratings for each tests, which show that SAM’s mask quality improvement
dataset in the single-point experiments. We run statistical over ViTDet is statistically significant.

24
F. Dataset, Annotation, and Model Cards 7. Are relationships between individual instances made explicit (e.g., users’
movie ratings, social network links)? If so, please describe how these rela-
In §F.1 we provide a Dataset Card for SA-1B, follow- tionships are made explicit. No, there are no known relationships between
instances in the dataset.
ing [39], in a list of questions and answers. Next, we pro-
8. Are there any errors, sources of noise, or redundancies in the dataset? If
vide a Data Annotation Card in §F.2 for the first two stages so, please provide a description. Errors: The masks are generated by a
of our data engine described in §4, following CrowdWork- segmentation model, so there may be errors or inconsistencies in the masks.
Redundancies: While no two images are the same, there are instances of
Sheets [30], again as a list of questions and answers. We images of the same subject taken close together in time.
provide a Model Card following [75] in Table 9.
9. Is the dataset self-contained, or does it link to or otherwise rely on external
resources (e.g., websites, tweets, other datasets)? If it links to or relies on
F.1. Dataset Card for SA-1B external resources, a) are there guarantees that they will exist, and remain
constant, over time; b) are there official archival versions of the complete
Motivation dataset (i.e., including the external resources as they existed at the time
1. For what purpose was the dataset created? Was there a specific task in the dataset was created); c) are there any restrictions (e.g., licenses, fees)
mind? Was there a specific gap that needed to be filled? Please provide a associated with any of the external resources that might apply to a dataset
description. The contributions of our dataset to the vision community are consumer? Please provide descriptions of all external resources and any
fourfold: (1) We release a dataset of 11M images and 1.1B masks, by far the restrictions associated with them, as well as links or other access points, as
largest segmentation dataset to date. (2) The dataset we release is privacy appropriate. The dataset is self-contained.
protecting: we have blurred faces and license plates in all images. (3) The 10. Does the dataset contain data that might be considered confidential (e.g.,
dataset is licensed under a broad set of terms of use which can be found data that is protected by legal privilege or by doctor-patient confidentiality,
at https://ai.facebook.com/datasets/segment-anything. (4) The data is more data that includes the content of individuals’ non-public communications)?
geographically diverse than its predecessors, and we hope it will bring the If so, please provide a description. No.
community one step closer to creating fairer and more equitable models.
11. Does the dataset contain data that, if viewed directly, might be offensive,
2. Who created the dataset (e.g., which team, research group) and on behalf insulting, threatening, or might otherwise cause anxiety? If so, please de-
of which entity (e.g., company, institution, organization)? The dataset was scribe why. We have two safety measures to prevent objectionable content:
created by the FAIR team of Meta AI. The underlying images were collected (1) Photos are licensed from a photo provider and had to meet the terms of
and licensed from a third party photo company. service of the photo provider. We requested that all objectionable content
3. Who funded the creation of the dataset? If there is an associated grant, be filtered from the images we licensed. (2) If a user observes objectionable
please provide the name of the grantor and the grant name and number. image(s) in the dataset, we invite them to report the image(s) at segment-
Meta AI funded the creation of the dataset. [email protected] for removal. Despite the measures taken, we observe
that a small portion of images contains scenes of protests or other gatherings
4. Any other comments? No. that focus on a diverse spectrum of religious beliefs or political opinions that
may be offensive. We were not able to produce a filtering strategy that re-
Composition moves all such images and rely on users to report this type of content.
1. What do the instances that comprise the dataset represent (e.g., documents, 12. Does the dataset identify any subpopulations (e.g., by age, gender)? If so,
photos, people, countries)? Are there multiple types of instances (e.g., please describe how these subpopulations are identified and provide a de-
movies, users, and ratings; people and interactions between them; nodes scription of their respective distributions within the dataset. The dataset
and edges)? Please provide a description. All of the instances in the dataset does not identify any subpopulations of the people in the photos.
are photos. The photos vary in subject matter; common themes of the photo
include: locations, objects, scenes. All of the photos are distinct, however 13. Is it possible to identify individuals (i.e., one or more natural persons), ei-
there are some sets of photos that were taken of the same subject matter. ther directly or indirectly (i.e., in combination with other data) from the
dataset? If so, please describe how. No. Images were subjected to a face
2. How many instances are there in total (of each type, if appropriate)? There blurring model to remove any personally identifiable information. If a user
are 11 million images. observes any anonymization issue, we invite them to report the issue and
3. Does the dataset contain all possible instances or is it a sample (not nec- the image id(s) at [email protected].
essarily random) of instances from a larger set? If the dataset is a sample, 14. Does the dataset contain data that might be considered sensitive in any way
then what is the larger set? Is the sample representative of the larger set (e.g., data that reveals race or ethnic origins, sexual orientations, religious
(e.g., geographic coverage)? If so, please describe how this representa- beliefs, political opinions or union memberships, or locations; financial or
tiveness was validated/verified. If it is not representative of the larger set, health data; biometric or genetic data; forms of government identification,
please describe why not (e.g., to cover a more diverse range of instances, such as social security numbers; criminal history)? If so, please provide
because instances were withheld or unavailable). The dataset is composed a description. The dataset contains scenes of protests, or other gatherings
of images licensed from a photo provider. The dataset contains all instances that may suggest religious beliefs, political opinions or union memberships.
licensed. The images are photos, i.e. not artwork, although there are a few However, the faces of all people in the dataset have been anonymized via
exceptions. The dataset includes all generated masks for each image in the facial blurring, so it is not possible to identify any person in the dataset.
dataset. We withheld ∼2k randomly selected images for testing purposes.
15. Any other comments? No.
4. What data does each instance consist of? “Raw” data (e.g., unprocessed
text or images) or features? In either case, please provide a description. Collection Process
Each instance in the dataset is an image. The images were processed to blur
faces and license plates to protect the identities of those in the image. 1. How was the data associated with each instance acquired? Was the data
directly observable (e.g., raw text, movie ratings), reported by subjects (e.g.,
5. Is there a label or target associated with each instance? If so, please provide survey responses), or indirectly inferred/derived from other data (e.g., part-
a description. Each image is annotated with masks. There are no categories of-speech tags, model-based guesses for age or language)? If the data was
or text associated with the masks. The average image has ∼100 masks, and reported by subjects or indirectly inferred/derived from other data, was the
there are ∼1.1B masks in total. data validated/verified? If so, please describe how. The released masks
associated with each image were automatically inferred by our segmentation
6. Is any information missing from individual instances? If so, please provide
model, SAM. The masks that were collected using model-assisted manual
a description, explaining why this information is missing (e.g., because it
annotation will not be released. Quality was validated as described in §5.
was unavailable). This does not include intentionally removed information,
but might include, e.g., redacted text. Yes. Each image is accompanied by 2. What mechanisms or procedures were used to collect the data (e.g., hard-
a short caption that describes the content and place of the photo in a free ware apparatuses or sensors, manual human curation, software programs,
form text. Per our agreement with the photo provider we are not allowed to software APIs)? How were these mechanisms or procedures validated? The
release these captions. However, we use them in our paper to analyze the images in the dataset are licensed from an image provider. They are all pho-
geographical distribution of the dataset. tos taken by photographers with different cameras.

25
3. If the dataset is a sample from a larger set, what was the sampling strategy RetinaFace [88, 89] model (https://github.com/serengil/retinaface) to detect
(e.g., deterministic, probabilistic with specific sampling probabilities)? We faces. The model used to blur license plates has not been made public.
withheld ∼2k randomly selected images for testing purposes. The rest of
the licensed images are included in the dataset. Uses
4. Who was involved in the data collection process (e.g., students, crowdwork- 1. Has the dataset been used for any tasks already? If so, please provide a
ers, contractors) and how were they compensated (e.g., how much were description. The dataset was used to train our segmentation model, SAM.
crowdworkers paid)? The released masks were automatically inferred by 2. Is there a repository that links to any or all papers or systems that use the
SAM. For details on our model-assisted manual annotation process see our dataset? If so, please provide a link or other access point. No. However, all
Data Annotation Card in §F.2. Note these masks will not be released. users of the dataset must cite it, so its use is trackable via citation explorers.
5. Over what timeframe was the data collected? Does this timeframe match 3. What (other) tasks could the dataset be used for? We intend the dataset
the creation timeframe of the data associated with the instances (e.g., recent to be a large-scale segmentation dataset. However, we invite the research
crawl of old news articles)? If not, please describe the timeframe in which community to gather additional annotations for the dataset.
the data associated with the instances was created. The licensed photos
vary in their date taken over a wide range of years up to 2022. 4. Is there anything about the composition of the dataset or the way it was
collected and preprocessed/cleaned/labeled that might impact future uses?
6. Were any ethical review processes conducted (e.g., by an institutional re- For example, is there anything that a dataset consumer might need to know
view board)? If so, please provide a description of these review processes, to avoid uses that could result in unfair treatment of individuals or groups
including the outcomes, as well as a link or other access point to any sup- (e.g., stereotyping, quality of service issues) or other risks or harms (e.g.,
porting documentation. If the dataset does not relate to people, you may skip legal risks, financial harms)? If so, please provide a description. Is there
the remaining questions in this section. We underwent an internal privacy anything a dataset consumer could do to mitigate these risks or harms? We
review to evaluate and determine how to mitigate any potential risks with have an analysis of the approximate geographic and income level coverage
respect to the privacy of people in the photos. Blurring faces and license of our dataset in §6. While we believe our dataset to be more representative
plates protects the privacy of the people in the photos. than most of the publicly existing datasets at this time, we acknowledge
7. Did you collect the data from the individuals in question directly, or obtain that we do not have parity across all groups, and we encourage users to be
it via third parties or other sources (e.g., websites)? We licensed the data mindful of potential biases their models have learned using this dataset.
from a third party photo provider. 5. Are there tasks for which the dataset should not be used? If so, please pro-
8. Were the individuals in question notified about the data collection? If so, vide a description. Full terms of use for the dataset including prohibited use
please describe (or show with screenshots or other information) how no- cases can be found at https://ai.facebook.com/datasets/segment-anything.
tice was provided, and provide a link or other access point to, or other- 6. Any other comments? No.
wise reproduce, the exact language of the notification itself. The images
are licensed from a third party who provided appropriate representations Distribution
regarding the collection of any notices and consents as required from indi- 1. Will the dataset be distributed to third parties outside of the entity (e.g.,
viduals. In addition, all identifiable information (e.g. faces, license plates) company, institution, organization) on behalf of which the dataset was cre-
was blurred. Under the terms of the dataset license it is prohibited to attempt ated? If so, please provide a description. The dataset will be available for
to identify or associate an image with a particular individual. the research community.
9. Did the individuals in question consent to the collection and use of their 2. How will the dataset will be distributed (e.g., tarball on website, API,
data? If so, please describe (or show with screenshots or other informa- GitHub)? Does the dataset have a digital object identifier (DOI)? The
tion) how consent was requested and provided, and provide a link or other dataset is available at https://ai.facebook.com/datasets/segment-anything.
access point to, or otherwise reproduce, the exact language to which the
individuals consented. The images are licensed from a third party who pro- 3. When will the dataset be distributed? The dataset will be released in 2023.
vided appropriate representations regarding the collection of any notices and 4. Will the dataset be distributed under a copyright or other intellectual
consents as required from individuals. In addition, all identifiable informa- property (IP) license, and/or under applicable terms of use (ToU)? If
tion (e.g. faces, license plates) was blurred from all images. For avoidance so, please describe this license and/or ToU, and provide a link or other
of doubt, under the terms of the dataset license it is prohibited to attempt to access point to, or otherwise reproduce, any relevant licensing terms
identify or associate an image with a particular individual. or ToU, as well as any fees associated with these restrictions. Yes.
10. If consent was obtained, were the consenting individuals provided with a The license agreement and terms of use for the dataset can be found at
mechanism to revoke their consent in the future or for certain uses? If https://ai.facebook.com/datasets/segment-anything. Users must agree to the
so, please provide a description, as well as a link or other access point terms of use before downloading or using the dataset.
to the mechanism (if appropriate). We invite users to report at segment- 5. Have any third parties imposed IP-based or other restrictions on the data
[email protected] for image(s) removal. associated with the instances? If so, please describe these restrictions, and
11. Has an analysis of the potential impact of the dataset and its use on data provide a link or other access point to, or otherwise reproduce, any relevant
subjects (e.g., a data protection impact analysis) been conducted? If so, licensing terms, as well as any fees associated with these restrictions. Full
please provide a description of this analysis, including the outcomes, as terms of use and restrictions on use of the SA-1B dataset can be found at
well as a link or other access point to any supporting documentation. To https://ai.facebook.com/datasets/segment-anything.
eliminate any potential impact on people whose photos are included in the 6. Do any export controls or other regulatory restrictions apply to the dataset
dataset, identifiable information (faces, license plates) has been blurred. or to individual instances? If so, please describe these restrictions, and pro-
12. Any other comments? No. vide a link or other access point to, or otherwise reproduce, any supporting
documentation. The license and restrictions on use of the SA-1B dataset
Preprocessing / Cleaning / Labeling can be found at https://ai.facebook.com/datasets/segment-anything.
1. Was any preprocessing / cleaning / labeling of the data done (e.g., dis- 7. Any other comments? No.
cretization or bucketing, tokenization, part-of-speech tagging, SIFT fea-
ture extraction, removal of instances, processing of missing values)? If so, Maintenance
please provide a description. If not, you may skip the remaining questions
1. Who will be supporting/hosting/maintaining the dataset? The dataset will
in this section. We resized the high-resolution licensed images such that
be hosted at https://ai.facebook.com/datasets/segment-anything and main-
the shorter side is 1500 pixels and only processed the images to remove any
tained by Meta AI.
identifiable and personal information from the photos (faces, license plates).
2. How can the owner/curator/manager of the dataset be contacted (e.g., email
2. Was the “raw” data saved in addition to the preprocessed/cleaned/labeled
address)? Please email [email protected].
data (e.g., to support unanticipated future uses)? If so, please provide a link
or other access point to the “raw” data. No, as we removed the data for 3. Is there an erratum? If so, please provide a link or other access point. No.
safety reasons and to respect privacy, we do not release the unaltered photos.
4. Will the dataset be updated (e.g., to correct labeling errors, add new in-
3. Is the software that was used to preprocess/clean/label the data avail- stances, delete instances)? If so, please describe how often, by whom, and
able? If so, please provide a link or other access point. We used the how updates will be communicated to dataset consumers (e.g., mailing list,

26
GitHub)? To aid reproducibility of research using SA-1B, the only updates 3. Were sociodemographic characteristics used to select annotators for your
will be to remove reported images. task? If so, please detail the process. No.
5. If the dataset relates to people, are there applicable limits on the retention of 4. If you have any aggregated socio-demographic statistics about your anno-
the data associated with the instances (e.g., were the individuals in question tator pool, please describe. Do you have reason to believe that sociode-
told that their data would be retained for a fixed period of time and then mographic characteristics of annotators may have impacted how they an-
deleted)? If so, please describe these limits and explain how they will be notated the data? Why or why not? We worked with 130 annotators. The
enforced. There are no limits on data retention. We took measures to remove annotators were all based in Kenya. We do not believe sociodemographic
personally identifiable information from any images of people. Users may characteristics of annotators meaningfully impacted the annotated data.
report content for potential removal here: [email protected]. 5. Consider the intended context of use of the dataset and the individuals
6. Will older versions of the dataset continue to be sup- and communities that may be impacted by a model trained on this dataset.
ported/hosted/maintained? If so, please describe how. If not, please Are these communities represented in your annotator pool? The Segment
describe how its obsolescence will be communicated to dataset consumers. Anything 1B (SA-1B) dataset is to be used for research purposes only.
No, as the only updates will be to remove potentially harmful content, we The SA-1B dataset is one of the most geographically diverse segmentation
will not keep older versions with the content. dataset, as discussed in §6. In addition, we analyze the responsible AI axes
of a model trained on the dataset in §6.
7. If others want to extend/augment/build on/contribute to the dataset, is there
a mechanism for them to do so? If so, please provide a description. Will Platform and Infrastructure Choices
these contributions be validated/verified? If so, please describe how. If not, 1. What annotation platform did you utilize? At a high level, what considera-
why not? Is there a process for communicating/distributing these contributions informed your decision to choose this platform? Did the chosen plat-
tions to dataset consumers? If so, please provide a description. We encour- form sufficiently meet the requirements you outlined for annotator pools?
age users to gather further annotations for SA-1B. Any users who generate Are any aspects not covered? We used a proprietary annotation platform.
annotations will be liable for hosting and distributing their annotations.
2. What, if any, communication channels did your chosen platform offer to
8. Any other comments? No. facilitate communication with annotators? How did this channel of com-
munication influence the annotation process and/or resulting annotations?
F.2. Data Annotation Card We manually reviewed annotations and shared feedback with the annotators
on a weekly basis. We communicated common mistakes or inconsisten-
Task Formulation cies and the corresponding corrections. In addition, the annotators were
given feedback for improvements daily by the annotation QA team. Out-
1. At a high level, what are the subjective aspects of your task? Segmenting side the weekly feedback sessions, annotators had access to a spreadsheet
objects present in an image is inherently a subjective task. For instance, and chat group to facilitate communication with the research team. This
one annotator may segment two boots as one mask, whereas another may process greatly improved the average speed and quality of the annotations.
segment each boot separately. Depending on annotators’s skills, the quality
of the mask and the number of masks per image are different between an- 3. How much were annotators compensated? Did you consider any partic-
notators. Despite these subjective aspects of the task, we believed efficient ular pay standards, when determining their compensation? If so, please
annotation was possible as the data was annotated in a per-mask fashion describe. Annotators were compensated with an hourly wage set by the
with the main focus on the diversity of the data rather than completeness. vendor. The vendor is a Certified B Corporation.

2. What assumptions do you make about annotators? Our annotators worked Dataset Analysis and Evaluation
full time on our annotation task with very small attrition rate. This made 1. How do you define the quality of annotations in your context, and how did
it possible to train the annotators providing feedback and answering their you assess the quality in the dataset you constructed? Annotators were first
questions on a regular basis. Specifically: (1) By giving a clear understand- placed into training. They followed a 1-day training session led by the ven-
ing of the goals of this work and providing clear guidelines, including vi- dor and then were asked to annotate a large number of examples from a
suals and video recordings of the tasks, annotators had enough context to training queue. Annotators graduated from training to production after the
understand and perform the tasks reasonably. (2) Sharing objectives and vendor QA team, in collaboration with the research team, manually spot-
key results and meeting weekly with annotators increased the likelihood checked the annotator’s masks to ensure quality. On average, annotators
that annotators improved annotation quality and quantity over time. spent one week in training before graduating. Production quality assess-
3. How did you choose the specific wording of your task instructions? What ment followed a similar process: the vendor QA team and the research team
steps, if any, were taken to verify the clarity of task instructions and wording manually reviewed the annotations weekly, sharing feedback weekly.
for annotators? As our task was annotating images, the annotation guide- 2. Have you conducted any analysis on disagreement patterns? If so, what
lines included visual examples. Our research team completed 30 annotation analyses did you use and what were the major findings? Did you analyze
tasks to identify any obvious challenges using the annotation tool, collec- potential sources of disagreement? We pointed out common mistakes dur-
tively decide how to handle complex cases, and refine the guidelines. The ing weekly meetings with the annotators.
research team met with the annotators weekly for feedback sessions. Videos
3. How do the individual annotator responses relate to the final labels released
of the research team performing the task were shared live with the annota-
in the dataset? The annotations were only used to train early versions of the
tors, followed by Q&A sessions. Annotators were able to give feedback on
SAM model and we do not currently plan to release them.
unclear aspects, both during the feedback session and asynchronously.
4. What, if any, risks did your task pose for annotators and were they informed Dataset Release and Maintenance
of the risks prior to engagement with the task? No identified risks. Images 1. Do you have reason to believe the annotations in this dataset may change
were filtered for objectionable content prior to the annotation phase. over time? Do you plan to update your dataset? No, except to remove
objectionable images.
5. What are the precise instructions that were provided to annotators? We
provide only high-level instructions: Given an image, we aim at segment- 2. Are there any conditions or definitions that, if changed, could impact the
ing every possible object. Annotators generate a mask for every potential utility of your dataset? We do not believe so.
object they can identify. An object can be segmented using our interactive 3. Will you attempt to track, impose limitations on, or otherwise influence how
segmentation tool either by using corrective foreground/background clicks your dataset is used? If so, how? The SA-1B dataset will be released under
to add/remove parts of the mask or by drawing a bounding box around the a license agreement allowing use for certain research purposes and protec-
object. Masks can be refined using pixel-precise tools. tions for researchers. Researchers must agree to the terms of the license
Selecting Annotations agreement to access the dataset.

1. Are there certain perspectives that should be privileged? If so, how did you 4. Were annotators informed about how the data is externalized? If changes to
seek these perspectives out? We chose to work with annotators that have the dataset are made, will they be informed? No, we do not plan to release
worked on other vision annotation tasks before. the manual annotations at the moment.
5. Is there a process by which annotators can later choose to withdraw their
2. Are there certain perspectives that would be harmful to include? If so, how
data from the dataset? If so, please detail. No.
did you screen these perspectives out? No.

27
Model Overview
Name SAM or Segment Anything Model
Version 1.0
Date 2023
Organization The FAIR team of Meta AI
Mode type Promptable segmentation model
Architecture See §3
Repository https://github.com/facebookresearch/segment-anything
Citation https://research.facebook.com/publications/segment-anything
License Apache 2.0

Intended Use
Primary intended uses SAM is intended to be used for any prompt-based segmentation task. We explored its use in segmenting objects
from a point (§7.1), edge detection (§7.2), segmenting all objects (§7.3), and segmenting detected objects (§7.4).
We explored how SAM can integrate with other vision models to segment objects from text (§7.5).
Primary intended users SAM was primarily developed for research. The license for SAM can be found at
https://github.com/facebookresearch/segment-anything.
Out-of-scope use cases See terms of use for SAM found at https://github.com/facebookresearch/segment-anything. See Use Cases under
Ethical Considerations.
Caveats and recommendations SAM has impressive zero-shot performance across a wide range of tasks. We note, however, that in the zero-shot
setting there may be multiple valid ground truth masks for a given input. We recommend users take this into
consideration when using SAM for zero-shot segmentation. SAM can miss fine structures and can hallucinate
small disconnected components. See §8 for a discussion of limitations.

Relevant Factors
Groups SAM was designed to segment any object. This includes stuff and things.
Instrumentation and environment We benchmarked SAM on a diverse set of datasets and found that SAM can handle a variety of visual data including
simulations, paintings, underwater images, microscopy images, driving data, stereo images, fish-eye images. See
§D.1 and Table 7 for information on the benchmarks used.

Metrics
Model performance measures We evaluated SAM on a variety of metrics based on the downstream task in our experiments.

• mIoU: We used the mean intersection-over-union after a given number of prompts to evaluate the segmen-
tation quality of a mask when prompted with points.
• Human evaluation: We performed a human study (detailed in §E) to evaluate the real world performance
of SAM. We compared the masks generated by SAM to a baseline state-of-the-art interactive segmentation
model, RITM [92], using a perceptual quality scale from 1 to 10.
• AP: We used average precision to evaluate instance segmentation for a given box and edge detection.
• AR@1000: We used average recall to evaluate object proposal generation.
• ODS, OIS, AP, R50: We used the standard edge detection evaluation metrics from BSDS500 [72, 3].

Evaluation Data
Data sources See §D.1.

Training Data
Data source See Data Card in §F.1.

Ethical Considerations
Data We trained SAM on licensed images. The images were filtered for objectionable content by the provider, but we
acknowledge the possibility of false negatives. We performed a geographic analysis of the SA-1B dataset in §6.
While SA-1B is more geographically diverse than many of its predecessors, we acknowledge that some geographic
regions and economic groups are underrepresented.
Cost and impact of compute SAM was trained on 256 A100 GPUS for 68 hours. We acknowledge the environmental impact and cost of training
large scale models. The environmental impact of training the released SAM model is approximately 6963 kWh
resulting in an estimated 2.8 metric tons of carbon dioxide given the specific data center used, using the calculation
described in [77] and the ML CO2 Impact calculator [61]. This is equivalent to ∼7k miles driven by the average
gasoline-powered passenger vehicle in the US [101]. We released the SAM models to both reduce the need for
retraining and lower the barrier to entry for large scale vision research.
Risks and harms We evaluated SAM for fairness in §6. Downstream use cases of SAM will create their own potential for biases
and fairness concerns. As such we recommend users run their own fairness evaluation when using SAM for their
specific use case.
Use cases We implore users to use their best judgement for downstream use of the model.

Table 9: Model Card for SAM, following the procedure detailed in [75].

28
We have several models that, when provided with a click or a box as input, output a mask. We would
like to compare the quality of these models by rating the quality of their masks on many examples.
The interface will be different than for regular mask annotation.
• Each job reviews one mask in one image.
• On the right, there will be five image thumbnails in two rows. Each thumbnail can be moused-
over to show the image at a larger size. Clicking on the thumbnail will make it full screen, and
clicking again will return to the original screen.
– The images show the same mask in five different views. On the top row: (left) the image
without the mask, (middle) the mask overlaid on the image, and (right) the mask alone. On
the bottom row: (left) a zoomed in view of the object without a mask, and (right) a zoomed
in view of the mask overlaid on the image. These views are provided to make it easy to see
different types of mask errors.
– The mask will be in red when overlaid on the image.
– When shown by itself, the mask is yellow, and the background is purple.
– Each image will include either a blue dot or a blue and white box. This is the input to the
model, as if you had clicked at this location or drawn this box.
• On the left, there are buttons labeled 1-10. This is used to rate the quality of the shown mask.

Objective and Setup Example interface page. There will be five images on the Mouse over an image to show the full image. Click on an image to make it full screen. The arrows will cy-
right and a question box on the left. cle between images. Click again to return to previous view.

The first image on the top row shows the image without a The second image on the top row shows the mask for the The third image on the top row shows the mask only. The The first image on the bottom row shows a zoomed in view
mask. A blue point will be on the object of interest, or a object in red. mask is in yellow and the background is purple. of the object without a mask.
blue and white box will surround it.

Does the mask correspond to an actual object?

• Valid objects can include:
What we would like you to do for each job:
– Entire single objects (such as a person, shirt, or tree)
• Please aim to spend up to 30 seconds per job. – Logical parts of objects (a chair leg, a car door, a tabletop)
• Mouse-over or click each of the three images of the mask on the right to get a sense of the – Collections of objects (a stack of books, a crowd of people)
quality of the mask. The thumbnail is too small to judge a mask, do not judge a mask by the – ‘Stuff’ (the ground, the sky).
thumbnail alone. Each image can provide a different signal on possible mask errors:
• Example errors a mask may have. The severity of these errors may be minor or major:
– The unzoomed image can give context for the mask: does this mask correspond to an actual
– Include a piece of another object (the mask of a person including the arm of a nearby
object?
person)
– The mask-only image can show if the mask has small holes or separated, incorrect pixels.
– Miss part of an object (the mask covers only one part of a building obscured by a tree in
– The zoomed image can show if the mask boundaries make sense.
the foreground),
• Judge the quality of the mask on three criterion. Examples will follow. – Combine two unrelated things (a single mask covers both a mug and a pen on a desk)
– Include an arbitrary part of a collection for a point input (a point is on one apple, but
– Does the mask correspond to an actual object?
the mask covers three apples in a pile of many apples). If a box surrounds an arbitrary
– Does the mask have a good boundary?
collection, it is not an error to provide a mask for these objects.
– Does the mask correspond to the provided point or box?
• If you are unsure, a good rule-of-thumb is: can you name the object in question? However,
• Rate the quality of the mask on a scale of 1-10 using the drop-down box on the left.
some things that are hard to name may still be good objects (an unusual component of a
• Next are details and examples for judging mask quality according to the three criterion. These machine, something at the edge of the image for which it is hard to determine what it is).
are just examples and other cases may come up, please use your best judgment when deter-
mining if something is a good mask.

The second image on the bottom row shows a zoomed in On the left are buttons to rate the mask quality, with selec- Task Judging Mask Quality (1 of 3)
view of the object with a mask. The mask is in red. tions 1-10.

Does the mask correspond to the provided point or box?

• For points:
– The point needs to be on the mask.
Does the mask have a good boundary? – The size or position of the object with respect to the point does not matter (a point on
someone’s gloved hand can correspond to the glove or to the entire person, both are valid
• Errors in the boundary can include:
masks).
– Incorrect holes in the mask
• For boxes:
– Incorrect pixels included separated from the main part of the mask
– Poor edge quality, where the mask does not exactly match the edge of the object. – The object needs to be the best object that is the size of the box (if a box is around some-
– Failure to consistently handle obscuring foreground objects (a mask that covers obscuring one’s entire head but the mask is of their hair, this is an error: their hair is in the box but is
objects is fine, and a mask that doesn’t cover obscuring objects is fine, but one that does not the correct object).
some of both has an error) – If the box clearly corresponds to a given object but is slightly smaller than it, it is okay if
– Pixelation of a small mask is not an error, as long as the mask still matches the edges of the mask goes slightly outside a box (if a box around a person misses their extended hand,
the object. the mask can still include their hand even if the mask goes outside the box).

Judging Mask Quality (2 of 3) Judging Mask Quality (3 of 3) Example error of ‘Include a piece of another object’: The Example error of ‘Missing a part of an object’: the mask is
elephant mask contains a piece of another nearby elephant. missing a disconnected part of the object: the back half of
the zebra, and the right portion of the plate.

Example error of ‘Include an arbitrary part of a collection’: Example error for ‘Incorrect holes in the mask’: This mask Example error for ‘Incorrect pixels included separated from Example error for ‘Poor edge quality’: The mask has poor
In top top image, the point is on one orange rind, but the has holes in the upper left and on the left sides (black ar- the main part of the mask’: The ‘mask only’ view reveals a edge quality, both along the edge of the umbrella, as well as
mask covers two orange rinds. This is a mask error: the rows). These holes are much easier to see on the ‘mask few stray incorrect pixels on the clock face. along the thin pole.
mask covers an arbitrary number of objects in the collection, only’ image.
and should either cover one orange rind or all of them. In
the bottom image, the box is around both vegetables. Since
this is the best match to the box, this is not a mask error.

Figure 19: Here we provide the complete guidelines given to annotations for the human review of mask quality. Some images
been edited slightly and faces have been blurred to enable release. Best viewed with zoom (part 1 of 2).

G. Annotation Guidelines
We provide the complete guidelines given to annotations
for the human review of mask quality in Fig. 19 and Fig. 20.

29
Example for ‘Combine two unrelated things’: The point in- Example error for ‘Failure to consistently handle obscuring Example of ‘Pixelation of a small mask’: this mask has an Example error for consistency with the provided point: The
dicates the lizard, but the mask covers both the lizard and a foreground objects’: The pole on the right (blue arrow) is imperfect boundary, since it extends beyond the object at mask does not agree with the blue point, so this is a mask
bird. This is a mask error. excluded from the mask, while the pole on the left is in- the black arrow. However, the ‘blocky’ pattern of the mask error.
cluded in the object (black arrow). The mask should either is not an error, since, when zoomed in this much, the image
include or exclude both of these. is also blocky the same way.

Overall mask quality is subjective, each of the above errors may hurt mask quality only a little or a
lot, depending on how large the error is. Please use your best judgment when choosing mask scores,
and try to stay consistent from mask-to-mask. Here are some general guidelines for what different
scores should correspond to:
• A score of 1: It is not possible to tell what object this mask corresponds to. This includes the
case that there is no mask visible at all.
• A low score (2-4): The object is mostly identifiable, but the mask quality is extremely poor
(e.g. large regions of the mask cover other objects; large regions of the object missing; ex-
tremely splotchy mask boundaries that cut through the middle of the object).
• A mid score (5-6): The object is identifiable and the boundary is mostly correct, but there
are major errors (missing a significant disconnected part of the object; containing a significant
part of another object; very poor boundary quality in one area of the object but not the entire
object).
• A high score (7-9): The object is identifiable and errors are small and rare (missing a small,
heavily obscured disconnected component, having small regions where the mask boundary
does not quite match the object boundary).
• A score of 10: The mask is pixel-perfect; it has no identifiable errors at all.

Example for consistency with the provided point: For this Example for consistency with a box: The box surrounds the Example for consistency with a box: The box’s shape fits Mask Scoring
input point, but the logo (left) and the container (right) are bowl of oranges, but the mask is only of a single orange. the zebra. Even though the mask extends slightly outside
valid objects, since the blue point lies on both of them. Nei- This is a mask error. the box to include the zebra’s left leg, this is not an error.
ther mask has a mask error.

Example of a mask with a score of 1: It is not clear what Example of a mask with a low score (2-4): The main ob- Example of a mask with a low score (2-4): The main ob- Example of a mask with a low-to-medium score (4-5): The
object this mask corresponds to. ject is identifiable, but the mask includes a large, incorrect ject is identifiable, but a large, random part of the object is object is identifiable and the edges are all correct, but the
portion of another object. missing. mask incorrectly includes the hand of the person on the left.

Example of a mask with a medium score (5-6): The mask Example of a mask with a medium score (5-6): the object Example of a mask with a medium-to-high score (6-8): the Example of a mask with a medium-to-high score (6-8): The
clearly corresponds to the plate, but the boundary with the is easy to identify, and most of the edges make sense. How- mask has two small-ish regions of poor boundary, at the top wreath is a valid object that is the size of the box (the entire
waffle is quite poor. ever, there is a significant disconnected part (their arm inside of the mask and on the bottom right. wreath + clock would also be a valid object). However, there
the frame) that is mostly missing, as well as splotchy pixels are incorrect stray mask pixels on the clock.
in this region.

Example of a mask with a high score (7-9): The boundary of Example of a mask with a very high score (∼9): There are Example of a mask with a very high score (9-10): the mask Example of a mask with a very high score (9-10): There are
the horse is almost entirely correct, except for the right side only minor errors around the edge of the mask. The blocky has only very minor errors in the edge on the bottom right. only minor errors around the edge of the mask.
of its back leg. The mask consistently includes all of the ‘pixelation’ is not an error, since the image is also blocky at
equipment that horse is wearing, and has logical boundaries. this scale.

Figure 20: Here we provide the complete guidelines given to annotations for the human review of mask quality. Some images
been edited slightly and faces have been blurred to enable release. Best viewed with zoom (part 2 of 2).