Diffuse, Attend, and Segment Unsupervised Zero-Shot Segmentation Using

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.
Except for this watermark, it is identical to the accepted version;

the final published version of the proceedings is available on IEEE Xplore.
Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using

Stable Diffusion
Junjiao Tian1 Lavisha Aggarwal2 Andrea Colaco2

[email protected] [email protected] [email protected]
Zsolt Kira1 Mar Gonzalez-Franco2

[email protected] [email protected]
Abstract
Stable
Diffusion
Producing quality segmentation masks for images is a
Model
fundamental problem in computer vision. Recent research Attention Aggregation
has explored large-scale supervised training to enable zero-
shot transfer segmentation on virtually any image style and
Iterative Attention
unsupervised training to enable segmentation without dense Merging × #
annotations. However, constructing a model capable of seg-
menting anything in a zero-shot manner without any anno-
tations is still challenging. In this paper, we propose to uti- Non-Maximum
$×$ Suppression
lize the self-attention layers in stable diffusion models to Sampling Grid
achieve this goal because the pre-trained stable diffusion
model has learned inherent concepts of objects within its
attention layers. Specifically, we introduce a simple yet ef-
fective iterative merging process based on measuring KL Figure 1. Overview of DiffSeg. DiffSeg is an unsupervised and
divergence among attention maps to merge them into valid zero-shot segmentation algorithm using a pre-trained stable diffu-
segmentation masks. The proposed method does not re- sion model. Starting from M ⇥ M anchor points, DiffSeg itera-
tively merges self-attention maps from the diffusion model for N
quire any training or language dependency to extract qual-
iterations to segment any image without any prior knowledge and
ity segmentation for any images. On COCO-Stuff-27, our
external information.
method surpasses the prior unsupervised zero-shot trans-
fer SOTA method by an absolute 26% in pixel accuracy
and 17% in mean IoU. The project page is at https:
transfer segmentation2 for any images with unknown cat-
//sites.google.com/view/diffseg/home. 1
egories is much more challenging. A recent popular work
SAM [20], trains a neural network on 1.1B segmentation
annotations and achieves impressive zero-shot transfer to
1. Introduction any images. This is an important step towards making seg-
mentation a more flexible foundation task for many other
Semantic segmentation divides an image into regions of en- tasks, not just limited to a given dataset with limited cate-
tities sharing the same semantics. It is an important founda- gories. However, the cost of collecting per-pixel labels is
tional application for many downstream tasks such as im- high. Therefore, it is of high research and production inter-
age editing [26], medical imaging [27], and autonomous est to investigate unsupervised and zero-shot transfer seg-
driving [14] etc. While supervised semantic segmentation, mentation methods under the least restrictive settings, i.e.,
where a target dataset is available and the categories are
known, has been widely studied [10, 38, 40], zero-shot 2 Zero-shot transfer segmentation means that the underlying model is
not trained for segmentation, similar to CLIP which is not trained for clas-
1 Georgia Institute of Technology1 Google2 sification but can be used for zero-shot classification [32].
3554
no form of annotations and no prior knowledge of the target. methods [11, 16, 24], DiffSeg does not require specifying
Few works have taken up the challenge due to the com- the number of clusters beforehand and is also determinis-
bined difficulty of the unsupervised and zero-shot require- tic. Given an image, without any prior knowledge, Diff-
ments. For example, most works in unsupervised segmenta- Seg can produce a quality segmentation without resorting
tion require access to the target data for unsupervised adap- to any additional resources (e.g. as SAM requires). We
tion [11, 15, 16, 19, 24, 37, 42]. Therefore, these methods benchmark DiffSeg on the popular unsupervised segmenta-
cannot segment images that are not seen during the adap- tion dataset COCO-Stuff-27 and a specialized self-driving
tation. Recently, ReCo [36] proposed a retrieve-and-co- dataset Cityscapes. DiffSeg outperforms prior works on
segment strategy. It removes the requirement for supervi- both datasets despite requiring less auxiliary information
sion because it uses an unsupervised vision backbone, e.g., during inference. In summary,
DINO [9], and is zero-shot since it does not need training • We propose an unsupervised and zero-shot segmenta-
on the target distribution of images. Still, to satisfy these tion method, DiffSeg, using a pre-trained stable diffusion
requirements, ReCo needs to 1) identify the concept before- model. DiffSeg can segment images in the wild without
hand and 2) maintain a large pool of unlabelled images for any prior knowledge or additional resources.
retrieval. This is still far from the capability of SAM [20], • DiffSeg sets state-of-the-art performance on two segmen-
which does not require knowledge of the target image and tation benchmarks. On COCO-Stuff-27, our method sur-
any auxiliary information. passes a prior unsupervised zero-shot SOTA method by
To move towards the goal, we propose to utilize the an absolute 26% in pixel accuracy and 17% in mean IoU.
power of a stable diffusion (SD) model [33] to construct
a general segmentation model. Recently, stable diffusion 2. Related Works
models have been used to generate prompt-conditioned Diffusion Models. Our work builds on pre-trained sta-
high-resolution images [33]. It is reasonable to hypothesize ble diffusion models [18, 33, 34]. Many existing works
that there exists information on object groupings in a dif- have studied the discriminative visual features in them and
fusion model. For example, DiffuMask [39] discovers that used them for zero-shot classification [22], supervised seg-
the cross-attention layer contains explicit object-level pixel mentation [4], label-efficient segmentation [6], semantic-
groupings when cross-referenced between the produced at- correspondence matching [17, 43] and open-vocabulary
tention maps and the input prompt. However, DiffuMask segmentation [41]. They rely on the high-dimensional vi-
can only produce segmentation masks for a generated image sual features learned in stable diffusion models to perform
with a corresponding text prompt and is limited to domi- those tasks requiring additional training to take full advan-
nant foreground objects explicitly referred to by the prompt. tage of those features. Instead, we show that object group-
Motivated by this, we investigate the unconditioned self- ing is also an emergent property of the self-attention layer in
attention layer in a diffusion model. We observed that the the Transformer module manifested in 4-dimensional spa-
attention tensors from the self-attention layers contain spe- tial attention tensors with particular spatial patterns.
cific spatial relationships: Intra-Attention Similarity and Unsupervised Segmentation. Our work is closely re-
Inter-Attention Similarity (Sec. 3.1). Relying on these lated to unsupervised segmentation [11, 15, 16, 19, 24, 37,
two properties, it is possible to uncover segmentation masks 42]. This line of works aims to generate dense segmen-
for any objects in an image. tation masks without using any annotations. However, they
To this end, we propose DiffSeg (Sec. 3.2), a simple yet generally require unsupervised training on the target dataset
effective post-processing method, for producing segmenta- to achieve good performance. We characterize the most
tion masks by using the attention tensors generated by the recent works into two main categories: clustering based
self-attention layers in a diffusion model solely. The al- on invariance [11, 19] and, more recently, clustering us-
gorithm consists of three main components: attention ag- ing pre-trained models [16, 24, 30, 39, 42]. Our work falls
gregation, iterative attention merging, and non-maximum in the second category. To produce segmentation masks,
suppression, as shown in Fig. 1. Specifically, DiffSeg ag- these new works utilize the discriminative features learned
gregates the 4D attention tensors in a spatially consistent in those pre-trained models, i.e., features from the same se-
manner to preserve visual information across multiple res- mantic class are generally more similar. DiffuMasks [39]
olutions and uses an iterative merging process by first sam- uses the cross-modal grounding between the text prompt
pling a grid of anchor points. The sampled anchors pro- and the image in the cross-attention layer of a diffusion
vide starting points for merging attention masks, where model to segment the most prominent object, referred to
anchors in the same object are absorbed eventually. The by the text prompt in the synthetic image. However, Diffu-
merging process is controlled by measuring the similarity Masks can only be applied to a generated image. In con-
between two attention maps using KL divergence. Un- trast, our method applies to real images without prompts.
like, popular clustering-based unsupervised segmentation Zero-Shot Transfer Segmentation Another related
3555
field is zero-shot segmentation [7, 15, 20, 29, 36, 39, 45]. cross-attention layer, where salient objects are given more
Works in this line of work possess the capability of “seg- attention when referred by its corresponding text input,
menting anything” without any training. Most recently, we hypothesize that the unconditional self-attention also
SAM [20], trained on 1.1B segmentation masks, demon- contains inherent object grouping information and can be
strated impressive zero-shot transfer to any images. Some used to produce segmentation masks without text inputs.
recent works utilize auxiliary information such as additional Specifically, for each spatial location (I, J) in the atten-
images [15, 36] or text inputs [29, 45] to facilitate zero-shot tion tensor, the corresponding 2D attention map Ak [I, J, :, :
transfer segmentation. In contrast, our method uses a diffu- ] 2 Rhk ⇥wk 4 captures the semantic correlation between all
sion model to generate segmentation without synthesizing locations and the location (I, J). Each location (I, J) cor-
and querying multiple images and, more importantly, with- responds to a region in the original image pixel space, the
out knowing the objects in the image. size of which depends on the receptive field of the tensor.
To illustrate this, we visualize the self-attention tensors
3. Method A7 and A8 in Fig. 2. They have a dimension of A7 2 R8
4
4
DiffSeg utilizes a pre-trained stable diffusion model [33] and A8 2 R16 5 respectively. Two important observations
and specifically its self-attention layers to produce high- will motivate our method in Sec. 3.2.
quality segmentation masks. We will briefly review the ar- • Intra-Attention Similarity: Within a 2D attention map
chitecture of the stable diffusion model in Sec. 3.1 and in- Ak [I, J, :, :], locations tend to have strong responses if
troduce DiffSeg in Sec. 3.2. they correspond to the same object group as (I, J) in the
original image space, e.g., Ak [I, J, I + 1, J + 1] is likely
3.1. Stable Diffusion Model Review to be a large value.
The Stable diffusion model [33] is a popular variant of the • Inter-Attention Similarity: Between two 2D attention
diffusion model family [18, 34], a generative model. Dif- maps, e.g., Ak [I, J, :, :] and Ak [I + 1, J + 1, :, :], they
fusion models have a forward and a reverse pass. In the tend to share similar activations if (I, J) and (I + 1, J +
forward pass, at every time step, a small amount of Gaus- 1) belong to the same object group in the original image
sian noise is iteratively added until an image becomes an space.
isotropic Gaussian noised image. In the reverse pass, the The second observation is a direct consequence of the
diffusion model is trained to iteratively remove the Gaus- first one. In general, we found that attention maps tend to
sian noise to recover the original clean image. The stable focus on object groups, i.e., groups of individual objects
diffusion model [33] introduces an encoder-decoder and U- sharing similar visual features (see segmentation examples
Net design with attention layers [18]. It first compress an in Fig. 6). In this example (Fig. 2), the two people are
image x 2 RH⇥W ⇥3 into a latent space with smaller spa- grouped as a single object group. The 8 ⇥ 8 resolution map
tial dimension z 2 Rh⇥w⇥c using an encoder z = E(x), from a location inside the group highlights much of the en-
which can be de-compressed through a decoder x̃ = D(z). tire group. In contrast, the 16 ⇥ 16 resolution map from a
All diffusion processes happen in the latent space through location inside the object highlights a smaller portion.
a U-Net architecture. The U-Net architecture will be the Theoretically, the resolution of the attention map dictates
focus of this paper’s investigation. the size of its receptive field w.r.t the original image. A
The U-Net architecture [33] consists of a stack of smaller resolution leads to a larger respective field, corre-
modular blocks (a schematic is provided in Appendix 9.1). sponding to a larger portion of the original image. Prac-
Among those blocks, there are 16 specific blocks composed tically, lower resolution maps, e.g., 8 ⇥ 8, provide a bet-
of two major components: a ResNet layer and a Trans- ter grouping of large objects as a whole, and larger resolu-
former layer. The Transformer layer uses two attention tion maps, e.g., 16 ⇥ 16, give a more fine-grained group-
mechanisms: self-attention to learn the global attention ing of components in larger objects and potentially are bet-
across the image and cross-attention to learn attention ter for identifying small objects. The current stable diffu-
between the image and optional text input. The component sion model has attention maps in 4 resolutions (hk , wk ) 2
of interest for our investigation is the self-attention layer in {8⇥8, 16⇥16, 32⇥32, 64⇥64}. Building on these observa-
the Transformer layer. Specifically, there are a total of 16 tions, in the next section, we will propose a simple heuristic
self-attention layers distributed in the 16 composite blocks, to aggregate weights from different resolutions and intro-
giving 16 self-attention tensors: duce an iterative method to merge all attention maps into a
A 2 {Ak 2 Rhk ⇥wk ⇥hk ⇥wk |k = 1, ..., 16}. (1)
Every attention layer has 8 multi-head outputs. We average the attention
Each attention tensor Ak is 4-dimensional3 . Inspired by tensors along the multi-head axis to reduce to 4 dimensions because they
are very
P similar across this dimension. Please see Appendix 9.2 for details.
DiffuMask [39], which demonstrates object grouping in the 4 Ak [I, J, :, :] = 1 is a valid probability distribution.
3 There 5 R164 denotes R16⇥16⇥16⇥16 .
is a fifth dimension due to the multi-head attention mechanism.
3556
Input 8×8×8×8 8×8 16×16×16×16 16×16
Figure 2. Visualization of Segmentation Masks and Self-Attentions Tensors. Left: Overlay of segmentation and the original image. Right:
Attention maps from a stable diffusion model have two properties: Intra-Attention Similarity and Inter-Attention Similarity. Maps of
different resolutions have varying receptive fields w.r.t the original image.
valid segmentation. linear interpolation) the last 2 dimensions of all attention

A Note on Extracting Attention Maps. In our experi- maps to 64 ⇥ 64, their highest resolution. Formally, for
ments, we use the stable diffusion pre-trained models from Ak 2 Rhk ⇥wk ,⇥hk ⇥wk ,
Huggingface. These prompt-conditioned diffusion models
Ãk = Bilinear-upsample(Ak ) 2 Rhk ⇥wk ⇥64⇥64 . (2)
typically run for 50 diffusion steps to generate new im-
ages. However, in our case, we want to efficiently extract On the other hand, the first 2 dimensions indicate the
attention maps for an existing clean image without condi- locations to which attention maps are referenced. There-
tional prompts. We achieve this by 1) using only the un- fore, we need to aggregate attention maps accordingly. For
conditioned latent and 2) only running the diffusion process example, as shown in Fig. 3a, the attention map in the
4
once. The unconditional latent is calculated using an uncon- (0, 0) location in Ak 2 R8 is first upsampled and then
ditioned text embedding. Since we only do one pass through repeatedly aggregated pixel-wise with the 4 attention maps
the diffusion model, we also set the time-step variable t to
4
(0, 0), (0, 1), (1, 0), (1, 1) in Az 2 R16 . Formally, the final
a large value, e.g., t = 300, in our experiments such that aggregated attention tensor Af 2 R64 is,
4
the real images are viewed as primarily denoised generated X

images from the perspective of the diffusion model. Af [I, J, :, :] = Ãk [I/ k , J/ k , :, :] ⇤ Rk , (3)
k2{1,...,16}
3.2. DiffSeg X
where k = 64/wk , Rk = 1.
Since the self-attention layers capture inherent object k
grouping information in spatial attention (probability) where / denotes floor division here. Furthermore, to ensure
maps, we propose DiffSeg, a simple post-processing that the aggregated attention map is also a valid distribu-
P
method, to aggregate and merge attention tensors into tion, i.e., Af [I, J, :, :] = 1, we normalize Af [I, J, :, :]
a valid segmentation mask. The pipeline consists of after aggregation. Rk is the aggregation importance ratio
three components: attention aggregation, iterative atten- for each attention map. We assign every attention map of
tion merging, and non-maximum suppression, as shown in different resolutions a weight proportional to its resolution
Fig. 1. We will introduce each element in detail next. Rk / wk . We call this the proportional aggregation scheme
Attention Aggregation. Given an input image passing (Propto.). The weights R are important hyper-parameters,
through the encoder and U-Net, a stable diffusion model and we conduct a detailed study of them in Appendix 9.6.
generates 16 attention tensors. Specifically, there are 5 Iterative Attention Merging. In the previous step, the
tensors for each of the dimensions: 64 ⇥ 64 ⇥ 64 ⇥ 64,
4
algorithm computes an attention tensor Af 2 R64 . In this
32 ⇥ 32 ⇥ 32 ⇥ 32, 16 ⇥ 16 ⇥ 16 ⇥ 16, 8 ⇥ 8 ⇥ 8 ⇥ 8; section, the goal is to merge the 64 ⇥ 64 attention maps in
and 1 tensor for the dimension 8 ⇥ 8 ⇥ 8 ⇥ 8. The goal the tensor Af to a stack of object proposals where each pro-
is to aggregate attention tensors of different resolutions into posal likely contains the activation of a single object or stuff
the highest resolution tensor. To achieve this, we need to category. The naive solution is to run a K-means algorithm
carefully treat the 4 dimensions differently. on Af to find clusters of objects following existing works
As discussed in the previous section, the 2D map in unsupervised segmentation [11, 16, 24]. However, the
Ak [I, J, :, :] corresponds to the correlation between all spa- K-means algorithm requires the specification of the number
tial locations and the location (I, J). Therefore, the last 2 of clusters. This is not an intuitive hyper-parameter to tune
dimensions in the attention tensors are spatially consistent because we aim to segment any image in the wild. More-
despite different resolutions. Therefore, we upsample (bi- over, the K-means algorithm is stochastic depending on the
3557
8×8 16×16
Upsample &
Duplicate argmax
16×16×16×16 Object proposals ℒ̅! : #! × 512 × 512 Prediction ) : 512 × 512
8×8×8×8
(b) Example of Non-Maximum Suppression (NMS). NMS
(a) Example of Attention Aggregation. An attention map from a lower resolution is first looks up the maximum activation across the Lp proposals
upsampled and then duplicated to match the receptive field of the higher-resolution maps. for each pixel.
Figure 3
initialization. Each run can have a drastically different re- maps with a distance smaller than the threshold, effectively
sult for the same image. To highlight these limitations, we taking the union of attention maps likely belonging to the
compare them with K-Means baselines in our experiments. same object that the anchor point belongs to. All merged
Instead, we propose to generate a sampling grid from attention maps are stored in a new proposal list Lp .
which the algorithm can iteratively merge attention maps Note that the first iteration does not reduce the number
to create the proposals. Specifically, as shown in Fig. 1, of proposals compared to the number of anchors. From
in the sampling grid generation step, a set of M ⇥ M the second iteration onward, the algorithm starts merg-
(1  M  64) evenly spaced anchor points are gener- ing maps and reducing the number of proposals simultane-
ated M = {(im , jm )|m = 1, ..., M 2 }. We then sample ously by computing the distance between an element from
the corresponding attention maps from the tensor Af . This the proposal list Lp and all elements from the same list, and
operation yields a list of M 2 2D attention maps as anchors, merging elements with distance smaller than ⌧ without re-
placement. The entire iterative attention merging algorithm
is provided in Alg. 1 (Appendix 9.3).
La = {Af [im , jm , :, :] 2 R64⇥64 |(im , jm ) 2 M}. (4) Non-Maximum Suppression. The iterative attention
merging step yields a list Lp 2 RNp ⇥64⇥64 of Np object
Since we aim to merge attention maps to find the maps proposals in the form of attention maps (probability maps).
most likely containing an object, we rely on the two obser- Each proposal in the list potentially contains the activation
vations in Sec. 3.1. Specifically, when iteratively merging of a single object. To convert the list into a valid segmenta-
“similar” maps, Intra-Attention Similarity reinforces tion mask, we use non-maximum suppression (NMS). This
the activation of an object and Inter-Attention Simi- can be easily done since each element is a probability distri-
larity grows the activation to include as many pixels bution map. We can take the index of the largest probability
of the same object within the merged map. To mea- at each spatial location across all maps and assign that lo-
sure similarity, we use KL divergence to calculate the “dis- cation’s membership to the corresponding map’s index. An
tance” between two attention maps, since each attention example is shown in Fig. 3b. Note that, before NMS, we
map Af [i, j, :, :] (abbrev. Af [i, j]) is a valid probability dis- upsample all elements in Lp to the original resolution. For-
tribution. Formally, mally, the final segmentation mask S 2 R512⇥512 is
2 ⇤ D(Af [i, j], Af [y, z]) (5) L̄p = Bilinear-upsample(Lp ) 2 RNp ⇥512⇥512 (6)
= (KL(Af [i, j]kAf [y, z]) + KL(Af [yz, z]kAf [i, j])) . S[i, j] = arg max L̄p [:, i, j] 8i, j 2 {0, ..., 511}.
We use the forward and reverse KL to address the asym-
4. Experiments
metry in KL divergence. Intuitively, a small D(·) indicates
a high similarity between two attention maps and that their Datasets. Following existing works in unsupervised seg-
union is likely to represent better the object they both be- mentation [8, 11, 16, 19, 36, 45], we use two popular seg-
long to. mentation benchmarks, COCO-Stuff-27 [11, 12, 19] and
Then, we start the N iterations of the merging process. Cityscapes [12]. COCO-stuff-27 [11, 19] is a curated ver-
In the first iteration, we compute the pair-wise distance sion of the original COCO-stuff dataset [12]. Specifically,
between each element in the anchor list and all attention COCO-stuff-27 merges the 80 things and 91 stuff cate-
maps using D(·). We introduce a threshold hyper-parameter gories in COCO-stuff into 27 mid-level categories. We eval-
⌧ . For each element in the list, we average all attention uate our method on the validation split curated by prior
3558
Requirements Metrics Requirements Metrics
Model LD AX UA ACC. mIoU Model LD AX UA ACC. mIoU
IIC [19] 7 7 3 21.8 6.7 IIC [19] 7 7 3 47.9 6.4
MDC [8] 7 7 3 32.3 9.8 MDC [8] 7 7 3 40.7 7.1
PiCLE [11] 7 7 3 48.1 13.8 PiCLE [11] 7 7 3 65.5 12.3
PiCLE+H [11] 7 7 3 50.0 14.4 STEGO [16] 7 7 3 73.2 21.0
STEGO [16] 7 7 3 56.9 28.2
MaskCLIP [45] 3 7 7 35.9 10.0
ACSeg [24] 3 7 3 - 28.1
ReCo [36] 3 3 7 74.6 19.3
MaskCLIP [45] 3 7 7 32.2 19.6
Ours: DiffSeg (320) 7 7 7 67.3 15.2
ReCo [36] 3 3 7 46.1 26.3
Ours: DiffSeg (512) 7 7 7 76.0 21.2
K-Means-C (512) 7 7 7 58.9 33.7
K-Means-S (512) 7 7 7 62.6 34.7 Table 2. Evaluation on Cityscapes. Language Dependency (LD),
DBSCAN (512) 7 7 7 57.7 27.2 Auxiliary Images (AX), Unsupervised Adaptation (UA).
Ours: DinoSeg (224) 7 7 7 68.2 39.1
Ours: DiffSeg (320) 7 7 7 72.5 43.0
Ours: DiffSeg (512) 7 7 7 72.5 43.6 4.1. Main Results
Table 1. Evaluation on COCO-Stuff-27. Language Dependency On the COCO benchmark, we include two K-means
(LD), Auxiliary Images (AX), Unsupervised Adaptation (UA). We baselines, K-Means-C and K-Means-S. K-Means-C uses a
additionally present baseline results using K-Means. K-Means-C constant number of clusters, 6, calculated by averaging over
(constant) uses a constant number of clusters, 6. K-means-S (spe- the number of objects in all evaluated images. K-Means-S
cific) uses a specific number of clusters for each image based on uses a specific number for each image based on the number
the ground truth. The K-Means results use 512 ⇥ 512 resolution.
of objects in the ground truth of that image. We also
include another classic unsupervised clustering algorithm
DBSCAN [13] that automatically discovers clusters just
like DiffSeg. In Tab. 1, we observe that both K-Means vari-
works [11, 19]. Cityscapes [12] is a self-driving dataset ants outperform prior works, demonstrating the advantage
with 27 categories. We evaluate the official validation of using self-attention tensors. Furthermore, the K-Means-
split. For both datasets, we report results for the diffusion- S variant outperforms the K-Means-C variant. This shows
model-native 512times512 input resolution and a lower that the number of clusters is an important hyper-parameter
320times320 resolution following prior works. and should be tuned for each image. Nevertheless, Diff-
Seg, despite relying on the same attention tensors Af ,
Metrics. Following prior works [8, 11, 16, 19, 36, 45],
significantly outperforms the K-Means baselines. This
we use pixel accuracy (ACC) and mean intersection over
proves that our algorithm avoids the disadvantages of
union (mIoU) to measure segmentation performance. Be-
K-Means and provides better segmentation. Even though
cause our method does not provide a semantic label, we
DBSCAN does not require the number of clusters as
use the Hungarian matching algorithm [21] to assign each
input, DiffSeg significantly outperforms DBSCAN in our
predicted mask to a ground truth mask. When there are
experiments. We hypothesize that this is because DiffSeg
more predicted masks than ground truth masks, the un-
encodes spatial prior in its clustering process, i.e., spatially
matched predicted masks will be considered false negatives.
distributed anchors. Please refer to Appendix 9.4 for
In addition to the metrics, we highlight the requirements
more details on implementing the K-Means and DBSCAN
other baselines need in inference. Specifically, we empha-
baselines. Moreover, DiffSeg significantly outperforms
size unsupervised adaptation (UA), language dependency
prior works by an absolute 26% in accuracy and 17% in
(LD), and auxiliary image (AX). UA means that the specific
mIoU compared to the prior SOTA zero-shot method, ReCo
method requires unsupervised training on the target dataset.
on COCO-Stuff-27 for both resolutions (320 and 512).
This is common in the unsupervised segmentation litera-
ture. Methods without the UA requirement are considered On the more specialized self-driving segmentation task
zero-shot. LD means that the method requires text input, (Cityscapes), our method is on par with prior works us-
such as a descriptive sentence for the image, to facilitate ing the smaller 320-resolution input. It outperforms prior
segmentation. AX means that the method requires an addi- works using the larger 512-resolution input in accuracy and
tional pool of reference images or synthetic images. mIoU. The resolution of the inputs affects the performance
of Cityscapes more severely than that of COCO because
Models. DiffSeg builds on pre-trained stable diffusion Cityscapes has more small classes such as light poles and
models. We use stable diffusion V1.4 [3]. traffic signs. See discussion on limitations in Appendix 9.7.
3559
(a) Proportional (b) Only 64 ⇥ 64 maps (c) Only 32 ⇥ 32 maps (d) Only 16(8) ⇥ 16(8) maps (e) Original image
Figure 4. Effects of using Different Aggregation Weights (R). DiffSeg uses a proportional aggregation strategy to balance consistency
and detailedness. Higher-resolution maps produce more detailed but fractured segmentation while lower-resolution maps produce more
consistent but coarse segmentation.
Compared to prior works, DiffSeg achieves this level of per- fects of using attention maps of different resolutions for seg-
formance in a pure zero-shot manner without any language mentation in Fig. 4 while keeping other hyper-parameters
dependency or auxiliary images. Therefore, DiffSeg can constant (t = 100, M 2 = 256, N = 3, ⌧ = 1.0). We ob-
segment any image. serve that high-resolution maps, e.g., 64 ⇥ 64 in Fig. 4b,
DiffSeg is a generic clustering algorithm, theoretically yield the most detailed, however, fractured segmentation.
applicable to any Transformer based model. We fur- Lower-resolution maps, e.g., 32 ⇥ 32 in Fig. 4c, give more
ther show the performance of a variant (DinoSeg) using coherent segmentation but often over-segment details, es-
DINO [9] as the backbone in Tab. 1. We observe worse per- pecially along the edges. Finally, too low resolutions fail
formance due to a smaller attention map size and smaller to generate any segmentation in Fig. 4d because the entire
pre-training dataset. Please refer to Appendix 9.5 for more image is merged into one object given the current hyper-
details and discussion. parameter settings. Notably, our proportional aggregation
strategy (Fig. 4a) balances consistency and detailedness.
4.2. Hyper-Parameter Study
5. Adding Semantics
Name COCO Cityscapes While DiffSeg generates high-quality segmentation masks,
Aggregation weights (R) Propto. Propto. like SAM [20], it does not label each mask. We propose a
Time step (t) 300 300 simple extension in Appendix 9.8 to produce labeled seg-
Num. of anchors (M 2 ) 256 256
Num. of merging iterations (N ) 3 3 mentation masks.
KL threshold (⌧ ) 1.1 0.9
6. Visualization
Table 3. Hyper-parameters for Diffseg. The values are used for
producing the results in Tab. 1 and Tab. 2. To demonstrate the generalization capability of DiffSeg, we
provide examples of segmentation on images of different
styles. In Fig. 5 and Fig. 6, we show segmentation on sev-
There are several hyper-parameters in DiffSeg as listed
eral sketches and paintings. The images are taken from
in Tab. 3. The listed numbers are the exact parameters used
the DomainNet dataset [31]. In Fig. 7, we show the seg-
in Sec. 4.1. This section provides a sensitivity study for
mentation results of real images captured by a smartphone.
these hyper-parameters to illustrate their expected behavior.
We also provide more segmentation examples Cityscapes
Specifically, we show the effects of aggregation weights in
in Appendix Fig. 17 and SUN-RGBD [44] in Appendix
the main paper and the impact of other hyper-parameters in
Fig. 15. In Fig. 8, we show segmentation for satellite images
Appendix 9.6.
and CT scans. Naturally, DiffSeg can segment synthetic im-
Aggregation weights (R). The first step of DiffSeg is
ages of diverse styles, generated by a stable diffusion model.
attention aggregation, where attention maps of 4 resolu-
We show examples in Fig. 9 and more in Appendix Fig. 18.
tions are aggregated together. We adopt a proportional ag-
gregation scheme. Specifically, the aggregation weight for a
7. Conclusion
map of a certain resolution is proportional to its resolution,
i.e., high-resolution maps are assigned higher importance. Unsupervised and zero-shot segmentation is a very chal-
This is motivated by the observation that high-resolution lenging setting and only a few papers have attempted to
maps have a smaller receptive field w.r.t the original image, solve it. Most of the existing work either requires unsu-
thus giving more details. To illustrate this, we show the ef- pervised adaptation (not zero-shot) or external resources.
3560
Figure 5. Examples of Segmentation on DomainNet Sketch. Over- Figure 7. Examples of Segmentation on real-world images cap-
lay (left), input (middle), and segmentation (right) tured by a smartphone. Overlay (left), input (middle), and seg-
mentation (right).
Figure 8. Segmentation on satellite and brain tumor CT scan.
Figure 6. Examples of Segmentation on DomainNet Paint-

ing.Overlay (left), input (middle), and segmentation (right)
In this paper, we proposed DiffSeg to segment an im-

age without any prior knowledge or external resources, us-
ing a pre-trained stable diffusion model, without any addi-
tional training. Specifically, the algorithm relies on Intra- Figure 9. Examples of Segmentation on generated images. Over-
Attention Similarity and Inter-Attention Similarity to itera- lay (left), input (middle), and segmentation (right).
tively merge attention maps into a valid segmentation mask.
DiffSeg achieves state-of-the-art performance on popular
benchmarks and demonstrates superior generalization to The work was partially done during the first author’s intern-
images of diverse styles. ship at Google. This work was also partially supported by
ONR grant N00014-18-1-2829 and Google Gift Grant.
8. Acknowledgements
3561
References Intelligent Transportation Systems, 22(3):1341–1360, 2020.
1
[1] Sklearn dbscan algorithm. https://scikit- [15] Qianli Feng, Raghudeep Gadde, Wentong Liao, Eduard Ra-
learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html. mon, and Aleix Martinez. Network-free, unsupervised se-
11 mantic segmentation with synthetic images. In Proceedings
[2] Sklearn k-means algorithm. https://scikit- of the IEEE/CVF Conference on Computer Vision and Pat-
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html. tern Recognition, pages 23602–23610, 2023. 2, 3
11 [16] Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah
[3] Stable diffusion v1-4 model card hugginface. https:// Snavely, and William T Freeman. Unsupervised semantic
huggingface.co/CompVis/stable-diffusion- segmentation by distilling feature correspondences. arXiv
v1-4. 6 preprint arXiv:2203.08414, 2022. 2, 4, 5, 6
[4] Tomer Amit, Tal Shaharbany, Eliya Nachmani, and Lior [17] Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack,
Wolf. Segdiff: Image segmentation with diffusion proba- Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi.
bilistic models. arXiv preprint arXiv:2112.00390, 2021. 2 Unsupervised semantic correspondence using stable diffu-
[5] David Arthur and Sergei Vassilvitskii. K-means++ the ad-
sion. arXiv preprint arXiv:2305.15581, 2023. 2
vantages of careful seeding. In Proceedings of the eigh- [18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
teenth annual ACM-SIAM symposium on Discrete algo- fusion probabilistic models. Advances in neural information
rithms, pages 1027–1035, 2007. 11 processing systems, 33:6840–6851, 2020. 2, 3
[6] Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov, [19] Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant in-
Valentin Khrulkov, and Artem Babenko. Label-efficient se- formation clustering for unsupervised image classification
mantic segmentation with diffusion models. arXiv preprint and segmentation. In Proceedings of the IEEE/CVF inter-
arXiv:2112.03126, 2021. 2 national conference on computer vision, pages 9865–9874,
[7] Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick
2019. 2, 5, 6
Pérez. Zero-shot semantic segmentation. Advances in Neural [20] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
Information Processing Systems, 32, 2019. 3 Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
[8] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and
head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
Matthijs Douze. Deep clustering for unsupervised learning
thing. arXiv preprint arXiv:2304.02643, 2023. 1, 2, 3, 7,
of visual features. In Proceedings of the European confer-
13
ence on computer vision (ECCV), pages 132–149, 2018. 5, [21] Harold W Kuhn. The hungarian method for the assignment
6 problem. Naval research logistics quarterly, 2(1-2):83–97,
[9] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
1955. 6
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- [22] Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis
ing properties in self-supervised vision transformers. In Pro- Brown, and Deepak Pathak. Your diffusion model is secretly
ceedings of the IEEE/CVF international conference on com- a zero-shot classifier. arXiv preprint arXiv:2303.16203,
puter vision, pages 9650–9660, 2021. 2, 7, 12 2023. 2
[10] Liang-Chieh Chen, George Papandreou, Florian Schroff, [23] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.
and Hartwig Adam. Rethinking atrous convolution for se- Blip: Bootstrapping language-image pre-training for uni-
mantic image segmentation. arxiv 2017. arXiv preprint fied vision-language understanding and generation. In In-
arXiv:1706.05587, 2, 2019. 1 ternational Conference on Machine Learning, pages 12888–
[11] Jang Hyun Cho, Utkarsh Mall, Kavita Bala, and Bharath
12900. PMLR, 2022. 13
Hariharan. Picie: Unsupervised semantic segmentation us- [24] Kehan Li, Zhennan Wang, Zesen Cheng, Runyi Yu, Yian
ing invariance and equivariance in clustering. In Proceedings Zhao, Guoli Song, Chang Liu, Li Yuan, and Jie Chen. Ac-
of the IEEE/CVF Conference on Computer Vision and Pat- seg: Adaptive conceptualization for unsupervised semantic
tern Recognition, pages 16794–16804, 2021. 2, 4, 5, 6 segmentation. In Proceedings of the IEEE/CVF Conference
[12] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
on Computer Vision and Pattern Recognition, pages 7162–
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe
7172, 2023. 2, 4, 6
Franke, Stefan Roth, and Bernt Schiele. The cityscapes [25] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan
dataset for semantic urban scene understanding. In Proceed- Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana
ings of the IEEE conference on computer vision and pattern Marculescu. Open-vocabulary semantic segmentation with
recognition, pages 3213–3223, 2016. 5, 6 mask-adapted clip. In Proceedings of the IEEE/CVF Con-
[13] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu,
ference on Computer Vision and Pattern Recognition, pages
et al. A density-based algorithm for discovering clusters in
7061–7070, 2023. 13
large spatial databases with noise. In kdd, volume 96, pages [26] Huan Ling, Karsten Kreis, Daiqing Li, Seung Wook Kim,
226–231, 1996. 6, 11 Antonio Torralba, and Sanja Fidler. Editgan: High-precision
[14] Di Feng, Christian Haase-Schütz, Lars Rosenbaum, Heinz
semantic image editing. Advances in Neural Information
Hertlein, Claudius Glaeser, Fabian Timm, Werner Wies-
Processing Systems, 34:16331–16345, 2021. 1
beck, and Klaus Dietmayer. Deep multi-modal object de- [27] Xiangbin Liu, Liping Song, Shuai Liu, and Yudong Zhang. A
tection and semantic segmentation for autonomous driving: review of deep-learning-based medical image segmentation
Datasets, methods, and challenges. IEEE Transactions on methods. Sustainability, 13(3):1224, 2021. 1
3562
[28] Edward Loper and Steven Bird. Nltk: The natural language efficient design for semantic segmentation with transform-
toolkit. arXiv preprint cs/0205028, 2002. 13 ers. Advances in Neural Information Processing Systems,
[29] Timo Lüddecke and Alexander Ecker. Image segmenta- 34:12077–12090, 2021. 1
tion using text and image prompts. In Proceedings of [41] Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao-
the IEEE/CVF Conference on Computer Vision and Pattern long Wang, and Shalini De Mello. Open-vocabulary panop-
Recognition, pages 7086–7096, 2022. 3 tic segmentation with text-to-image diffusion models. In
[30] Tam Nguyen, Maximilian Dax, Chaithanya Kumar Mum- Proceedings of the IEEE/CVF Conference on Computer Vi-
madi, Nhung Ngo, Thi Hoai Phuong Nguyen, Zhongyu Lou, sion and Pattern Recognition, pages 2955–2966, 2023. 2,
and Thomas Brox. Deepusps: Deep robust unsupervised 13
saliency prediction via self-supervision. Advances in Neu- [42] Andrii Zadaianchuk, Matthaeus Kleindessner, Yi Zhu,
ral Information Processing Systems, 32, 2019. 2 Francesco Locatello, and Thomas Brox. Unsupervised se-
[31] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate mantic segmentation with self-supervised object-centric rep-
Saenko, and Bo Wang. Moment matching for multi-source resentations. arXiv preprint arXiv:2207.05027, 2022. 2
domain adaptation. In Proceedings of the IEEE International [43] Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Pola-
Conference on Computer Vision, pages 1406–1415, 2019. 7 nia Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan
[32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Yang. A tale of two features: Stable diffusion complements
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, dino for zero-shot semantic correspondence. arXiv preprint
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning arXiv:2305.15347, 2023. 2
transferable visual models from natural language supervi- [44] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Tor-
sion. In International conference on machine learning, pages ralba, and Aude Oliva. Learning deep features for scene
8748–8763. PMLR, 2021. 1 recognition using places database. Advances in neural in-
[33] Robin Rombach, Andreas Blattmann, Dominik Lorenz, formation processing systems, 27, 2014. 7
Patrick Esser, and Björn Ommer. High-resolution image [45] Chong Zhou, Chen Change Loy, and Bo Dai. Extract free
synthesis with latent diffusion models. In Proceedings of dense labels from clip. In European Conference on Com-
the IEEE/CVF conference on computer vision and pattern puter Vision, pages 696–712. Springer, 2022. 3, 5, 6
recognition, pages 10684–10695, 2022. 2, 3
[34] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
et al. Photorealistic text-to-image diffusion models with deep
language understanding. Advances in Neural Information
Processing Systems, 35:36479–36494, 2022. 2, 3
[35] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
man, et al. Laion-5b: An open large-scale dataset for training
next generation image-text models. Advances in Neural In-
formation Processing Systems, 35:25278–25294, 2022. 12
[36] Gyungin Shin, Weidi Xie, and Samuel Albanie. Reco: Re-
trieve and co-segment for zero-shot transfer. Advances in
Neural Information Processing Systems, 35:33754–33767,
2022. 2, 3, 5, 6, 12
[37] Gyungin Shin, Weidi Xie, and Samuel Albanie. Named-
mask: Distilling segmenters from complementary founda-
tion models. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 4960–
4969, 2023. 2
[38] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and
Liang-Chieh Chen. Max-deeplab: End-to-end panoptic
segmentation with mask transformers. In Proceedings of
the IEEE/CVF conference on computer vision and pattern
recognition, pages 5463–5474, 2021. 1
[39] Weijia Wu, Yuzhong Zhao, Mike Zheng Shou, Hong Zhou,
and Chunhua Shen. Diffumask: Synthesizing images with
pixel-level annotations for semantic segmentation using dif-
fusion models. arXiv preprint arXiv:2303.11681, 2023. 2, 3,
13
[40] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Jose M Alvarez, and Ping Luo. Segformer: Simple and
3563

Diffuse, Attend, and Segment Unsupervised Zero-Shot Segmentation Using

Uploaded by

Copyright:

Available Formats

Diffuse, Attend, and Segment Unsupervised Zero-Shot Segmentation Using

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Diffuse, Attend, and Segment Unsupervised Zero-Shot Segmentation Using

Uploaded by

Copyright:

Available Formats

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using

Junjiao Tian1 Lavisha Aggarwal2 Andrea Colaco2

Zsolt Kira1 Mar Gonzalez-Franco2

valid segmentation. linear interpolation) the last 2 dimensions of all attention

the real images are viewed as primarily denoised generated X

Figure 8. Segmentation on satellite and brain tumor CT scan.

Figure 6. Examples of Segmentation on DomainNet Paint-

In this paper, we proposed DiffSeg to segment an im-

You might also like