SegDiff - Image Segmentation With Diffusion Probabilistic Models
SegDiff - Image Segmentation With Diffusion Probabilistic Models
SegDiff - Image Segmentation With Diffusion Probabilistic Models
1
• We obtained state-of-the-art results on multiple bench- In our work, we take a different approach to condition-
marks. The margin is especially large for small data ing, adding (not concatenating) the input image, after it
sets. passes through an convolutional encoder, to the current esti-
mation of the segmentation image. In other words, we learn
the DPM of a residual model.
2. Related work
Image segmentation is a problem of assigning each
3. Background
pixel a label that identifies whether it belongs to a spe- We briefly introduce the formulation of diffusion models
cific class or not. This problem is widely investigated us- mentioned in [18]. Diffusion models are generative mod-
ing different architectures. These include fully convolu- els parametrized by a Markov chain and composed of for-
tional networks [31], encoder-decoder architectures with ward and backward processes. The forward process q is
skip-connections, such as U-Net [38], transformer-based ar- described by the formulation:
chitectures, such as the segformer [50], and even architec-
T
tures that combine hypernetworks, such as [36]. Y
Diffusion Probabilistic Models (DPM) [43] are a class q(x1:T |x0 ) = q(xt |xt−1 ), (1)
t=1
of generative models based on a Markov chain, which can
transform a simple distribution (e.g. Gaussian) to data where T is the number of steps in the diffusion model,
that is sampled in a complex distribution. Diffusion mod- x1 , ..., xT are latent variables, and x0 is a sample from the
els are capable of generating high-quality images that can data. At each iteration of the forward process, Gaussian
compete with and even outperform the latest GAN meth- noise is added according to
ods [10, 18, 35, 43]. A variational framework for the like- p
lihood estimation of diffusion models was introduced by q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt In×n ), (2)
Huang et al. [21]. Subsequently, Kingma et al. [23] pro-
posed a Variational Diffusion Model that produces state-of- where βt is a constant that defines the schedule of added
the-art results in likelihood estimation for image density. noise, and In×n is the identity matrix of size n. As de-
Diffusion models were also applied to language model- scribed in [18],
ing [2, 20], where a novel diffusion model for categorical t
Y
data was used. αt = 1 − βt , ᾱt = αs . (3)
Conditional Diffusion Probabilistic Models In our s=0
work, we use diffusion models to solve the image seg-
The forward process supports sampling at an arbitrary
mentation problem as conditional generation, given the
timestamp t, with the formula
image. Conditional generation with diffusion models in-
cludes methods for class-conditioned generation, which is √
q(xt |x0 ) = N (xt ; ᾱt x0 , (1 − ᾱt )In×n ), (4)
obtained by adding a class embedding to the timestamp em-
bedding [35]. In [8] a method for guiding the generative which can be reparametrized to:
process in DDPM is present. This method allows the gen- √ p
eration of images based on a given reference image without xt = ᾱt x0 + (1 − ᾱt ), ∼ N (0, In×n ). (5)
any additional learning.
In the domain of super resolution, the lower-resolution The reverse process is parametrized by θ and defined by
image is upsampled and then concatenated, channelwise, T
Y
to the generated image at each iteration [19, 41]. A sim- pθ (x0:T −1 |xT ) = pθ (xt−1 |xt ). (6)
ilar approach passes the low-resolution images through a t=1
convolutional block [27] prior to the concatenation. Con-
currently with our work, diffusion models were applied to Starting from pθ (xT ) = N (xT ; 0, In×n ), the reverse
image-to-image translation tasks [40]. These tasks include process transforms the latent variable distribution pθ (xT )
uncropping, inpainting, and colorization. The results ob- to the data distribution pθ (x0 ). The reverse process steps
tained outperform strong GAN baselines. are performed by taking small Gaussian steps described by
Conditional diffusion models have also been used for
pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t)). (7)
voice generation. The mel-spectrogram is processed with
a convolutional network, and is used as an additional in- Calculating q(xt−1 |xt , x0 ) using Bayes’ theorem, one
put to the DPM denoising network [6, 24, 30]. Further- obtains:
more, in [37] a text-to-speech diffusion model is introduced,
which uses text as a condition to the diffusion model. q(xt−1 |xt , x0 ) = N (xt−1 ; µ̃(xt , x0 ), β̃t In×n ), (8)
2
Figure 1. Our proposed diffusion method for image segmentation encodes the input signal, xt , with F . The extracted features are summed
with the feature map of the conditioned image I generated by network G. Networks E and D are a U-net encoder and decoder [35, 38],
respectively, that refine the estimated segmentation map, obtaining xt−1 .
where
√ √ Algorithm 1 Inference Algorithm
ᾱt−1 βt αt (1 − ᾱt−1 )
µ̃t (xt , x0 ) = x0 + xt , (9) Input total diffusion steps T, image I
1 − ᾱt 1 − ᾱt
xT ∼ N (0, In×n )
1 − ᾱt−1 for t = T, T − 1, ..., 1 do
β̃ = βt . (10)
1 − ᾱt z ∼ N (0, In×n )
−4 −2
The neural network µθ predicts the noise , which is βt = 10 (T −t)+2∗10
T −1
(t−1)
3
combines information derived from both the current esti- 4.3. Architecture
mate xt and the input image I.
The input image encoder G is built from Residual in
In diffusion models, θ is typically a U-Net [38]. In our
Residual Dense Blocks [47] (RRDBs), which combine
work, θ can be expressed in the following form:
multi-level residual connections without batch normaliza-
θ (xt , I, t) = D(E(F (xt ) + G(I), t), t) . (17) tion layers. G has an input 2D-convolutional layer, an
RRDB with a residual connection around it, followed by
In this architecture, the U-Net’s decoder D is conventional another 2D-convolutional layer, leaky RELU activation
and its encoder is broken down into three networks: E, and a final 2D-convolutional output layer. F is a 2D-
F , and G. The last encodes the input image, while F en- convolutional layer with a single-channel input and an out-
codes the segmentation map of the current step xt . The two put of C channels.
processed inputs have the same spatial dimensionality and The encoder-decoder part of θ , i.e., D and E, is based
number of channels. Based on the success of residual con- on U-Net, similarly to [35]. Each level is composed of
nections [17], we sum these signals F (xt ) + G(I). This residual blocks, and at resolution 16x16 and 8x8 each resid-
sum then passes to the rest of the U-Net encoder E. ual block is followed by an attention layer. The bottle-
The current step index t is passed to two different net- neck contains two residual blocks with an attention layer
works D and E. In each of these, it is embedded using a in between. Each attention layer contains multiple attention
shared learned look-up table. heads.
The output of θ from Eq. 17, which is conditioned on I, The residual block is composed of two convolutional
is plugged into Eq. 16, replacing the unconditioned θ net- blocks, where each convolutional block contains group-
work. This resulting inference time procedure is illustrated norm, Silu activation, and a 2D-convolutional layer. The
in Fig. 1 and detailed in Alg. 1. residual block receives the time embedding through a linear
layer, Silu activation, and another linear layer. The result is
4.1. Employing multiple generations then added to the output of the first 2D-convolutional block.
Additionally, the residual block has a residual connection
Since calculating xt−1 during inference includes the ad- that passes all its content.
dition of σθ (xt , t)z, where z is from a standard distribution,
On the encoder side (network E), there is a downsam-
there is significant variability between different runs of the
ple block after the residual blocks of the same depth, which
inference method on the same inputs, see Fig. 2(b).
is a 2D-convolutional layer with a stride of two. On the de-
In order to exploit this phenomenon, we run the inference coder side (network D), there is an upsample block after the
algorithm multiple times, then average the results. residual blocks of the same depth, which is composed of the
This way, we stabilize the results of segmentation and nearest interpolation that doubles the spatial size, followed
improve performance, as demonstrated in Fig. 2(c). We use by a 2D-convolutional layer. Each layer in the encoder has
thirty generated instances in all experiments, except for the a skip connection to the decoder side.
experiments in the ablation study, which quantifies the gain
of this averaging procedure.
5. Experiments
4.2. Training We present segmentation results for three datasets, as
The training procedure is depicted in Alg. 2. The total well as an ablation study.
number of diffusion steps T is set by the user. For each Datasets The Cityscapes dataset [9] is an instance seg-
iteration, a random sample is obtained (Ii , Mi ) (an image mentation dataset containing 5,000 annotated images di-
and the associated ground truth binary segmentation map). vided into 2,975 images for training, 500 for validation, and
The iteration number 1 ≤ t ≤ T is sampled from a uniform 1,525 for testing.
distribution, and epsilon from a standard distribution. The experimental setting used is sometimes referred to
We then sample xt according to Eq. 5, compute F (xt ) + as interactive segmentation and is motivated by the need to
G(Ii ), and apply networks E and D to obtain θ (xt , Ii , t). accelerate object annotation [1]. Under this setting, there
The loss being minimized is a modified version of Eq 15, are eight object categories, and the goal is to recover the ob-
namely: jects’ per-pixel masks, given a cropped patch that contains
the bounding box around each object.
√ √
Ex0 ,,xe ,t [|| − θ ( ᾱt x0 + 1 − ᾱt , Ii , t)||2 ]. (18) Our per-object training and validation sets are created by
taking crops from images in the original Cityscapes sets us-
At training time, the ground truth segmentation of the ing the locations of the ground truth classes (we do not have
input image Ii is known, and the loss is computed by setting access to the ground truth labels of the original Cityscapes
x 0 = Mi . test set).
4
(a) (b) (c) (d)
Figure 2. Obtaining multiple segmentation results for Cityscapes, Vaihingen, and MoNuSeg. (a) input image, (b) a subset of the obtained
results for multiple runs on the same input, visualized by the jet color scale between 0 in blue and 1 in red, (c) average result, and (d)
ground truth.
We compared our method for the Cityscapes dataset form [28], since it uses a different protocol. Specifically,
with PSPDeepLab [5], Polygon-RNN++ [1], Curve- this method, which improves upon Mask R-CNN [16], uti-
GCN [29] Deep active contours [14], Segformer-B5 [50] lizes the entire image (and not just the segmentation patch)
and Stdc1 [11]. For most baselines, we report the results as part of its inputs, and does not work on standard patches
obtained from previous publications. For Segformer and in a way that would enable a direct comparison.
Stdc, we train from scratch. The Vaihingen dataset [39] contains 168 aerial images of
We did not perform a comparison with PolyTrans- Vaihingen, in Germany, divided into 100 images for train-
5
ing and 68 for the test. The task is to segment the central including random scaling in the range of [0.75, 1.25], with
building in each image. For this dataset, the leading base- up to 22 degrees rotation in each direction, and a horizontal
lines are DSAC [34], DarNet [7], TDAC [15], Deep active flip with a probability of 0.5.
contours [14], FCN-UNET [38], FCN-ResNet-34, FCN- For the Vaihingen dataset, the size of the input image
HarDNet-85 [4], Segformer-B5 [50] and Stdc1 [11]. and the test image resolution was 256 × 256. The experi-
The MoNuSeg dataset [25, 26] contains a training set ments were performed with a batch size of eight images, six
with 30 microscopic images from seven organs, with an- RRDB blocks, and a depth of seven. The number of chan-
notations of 21,623 individual nuclei. The test dataset con- nels was set to [C, C, C, 2C, 2C, 4C, 4C] with C = 128.
tains 14 similar images. We resized the images to a res- The same augmentations are used as in [7]: random scal-
olution of 512 × 512, following [45]. The relevant base- ing by a factor sampled uniformly in the range [0.75, 1.5], a
line methods are FCN [3], UNET [38], UNET++ [53], Res- rotation sampled uniformly between zero and 360 degrees,
Unet [48], Axial attention (A.A) Unet [46] and Medical independent horizontal and vertical flips, applied with a
transformer [45]. probability of 0.5, and a random color jitter, with a max-
Evaluation The Cityscapes dataset is evaluated using the imum value of 0.6 brightness, 0.5 contrast, 0.4 saturation,
common metrics of mean Intersection-over-Union (mIoU) and 0.025 hue.
per class. For MoNuSeg, the input image resolution was 256×256,
but the test resolution was 512 × 512. To address this, we
N
X T P (yi , yˆi ) applied a sliding window of 256 × 256 with a stride of 256,
mIoU (yi , yˆi ) =
T P (yi , yˆi ) + F N (yi , yˆi ) + F P (yi , yˆi ) i.e., we tested each quadrant of the image separately.
i=1
(19) The experiments were carried out with a batch size of
Where N is the number of classes in the dataset, TP is the eight images, with 12 RRDB blocks. The network depth
true positive between the ground truth y and output mask ŷ, was seven, and the number of channels in each depth was
FN is a false negative, and FP is a false positive. [C, C, C, 2C, 2C, 4C, 4C], with C = 128. We used the
The Vaihingen dataset is evaluated using several metrics: same augmentation scheme as in [45] with random crop-
mIoU, F1-score, Weighted Coverage (WCov), and Bound- ping of 256 × 256 to adjust for GPU memory.
ary F-score (BoundF), as described in [7]. Briefly, the pre- It is worth noting that all baseline methods except Seg-
diction is correct if it is within a certain distance threshold former and Stdf rely on pre-trained weights obtained on the
from the ground truth. The benchmarks use five thresholds, ImageNet, PASCAL or COCO datasets. Our networks are
from 1px to 5px, for evaluating performance. initialized with random weights.
Following previous work, evaluation on the MoNuSeg Results Following previous work, Cityscapes is evalu-
dataset is performed using mIoU and the F1-score. ated in one of two settings. Tight: in this setting, the sam-
Training details The number of diffusion steps in previ- ples (image and associated segmentation map) are extracted
ous works was 1000 [18] and even 4000 [35]. The literature by a tight crop around the object mask. Expansion: sam-
suggests that more is better [42]. In our main experiments, ples are extracted by a crop around the object mask, which
we employ 100 diffusion steps to reduce inference time. An is 15% larger than the tight crop. The inputs of the model
additional set of experiments investigated the influence of are crops 10% - 20% larger than the tight one. This setting
the number of diffusion steps on the performance and run- is slightly more challenging, since there is less information
time of the method. on the location of the target object.
The AdamW [32] optimizer is used in all our experi- The results for the Cityscapes dataset are reported in
ments. Based on the intuition that the more RRDB blocks, Tab. 1. As can be seen, our method outperforms all baseline
the better the results, we used as many blocks as we could fit methods, across all categories and in both settings.
on the GPU without overly reducing batch size. The Unet The gap is apparent even for the most recent baseline
used for datasets with a resolution of 256 × 256 has one methods and, as can be seen in Fig. 3, the gap in perfor-
additional layer with respect to the dataset with half that mance is especially sizable for datasets with less training
resolution, in order to account for the spatial dimensions. images.
On the Cityscapes dataset, the input resolution of our The results for the Vaihingen dataset are presented in
model is 128 × 128. The test metrics are computed on the Tab. 2. As can be seen, our method outperforms the results
original resolution; therefore, we resized the prediction to reported in previous work for all four scores.
the original image size. The results for the MoNuSeg dataset are presented in
Training took place with a batch size of 30 images. The Tab. 3. In both segmentation metrics, our method outper-
network had 15 RRDB blocks and a depth of six. The num- forms all previous works, including very recent variants of
ber of channels was set to [C, C, 2C, 2C, 4C, 4C] with C = U-Net and transformers that were developed specifically for
128. We followed the same augmentation scheme as in [14], this segmentation task.
6
Method Bicycle Bus Person Train Truck M.cycle Car Rider Mean
Polygon-RNN++ [1] 63.06 81.38 72.41 64.28 78.90 62.01 79.08 69.95 71.38
expansion
PSP-DeepLab [5] 67.18 83.81 72.62 68.76 80.48 65.94 80.45 70.00 73.66
Polygon-GCN [29] 66.55 85.01 72.94 60.99 79.78 63.87 81.09 71.00 72.66
Spline-GCN [29] 67.36 85.43 73.72 64.40 80.22 64.86 81.88 71.73 73.70
SegDiff (ours) 69.80 85.97 76.09 75.95 80.68 67.06 83.40 72.57 76.44
Deep contour [14] 68.08 83.02 75.04 74.53 79.55 66.53 81.92 72.03 75.09
Segformer-B5 [50] 68.02 78.78 73.53 68.46 74.54 64.06 83.20 69.12 72.46
Stdc1 [11] 67.86 80.67 74.20 69.73 77.02 64.52 83.53 69.58 73.39
Stdc2 [11] 68.67 81.29 74.41 71.36 75.71 63.69 83.51 69.90 73.57
SegDiff (ours) 69.62 84.64 75.18 74.89 80.34 67.75 83.63 73.49 76.19
Table 1. Cityscapes segmentation results for two protocols: the top part refers to segmentation results with 15% expansion around the
bounding box; the bottom part refers to segmentation results with a tight bounding box.
FCN [3] 28.84 28.71 Variant three 93.77 88.67 91.69 80.15
U-Net [38] 79.43 65.99 Variant four 94.77 90.27 93.82 82.64
U-Net++ [53] 79.49 66.04 Variant five 93.16 87.76 91.08 79.89
Res-UNet [48] 79.49 66.07 Variant six 91.97 85.57 89.83 71.04
A.A U-Net [46] 76.83 62.49 Full method 94.95 90.64 94.00 84.37
MedT [45] 79.55 66.17
Ours 81.59 69.00 Variant one 90.52 84.15 90.37 62.66
Cityscapes “Bus”
7
(a) (b)
Figure 4. mIoU (mean and variance) across the test images as a function of the number of diffusion steps. (a) Results for the Cityscapes
classes, with 128 × 128 image resolution. (b) Results for the Vaihingen and MoNuSeg datasets, with 256 × 256 image resolution.
(a) (b)
Figure 5. mIoU per number of generated inferences. (a) Results for the Cityscapes classes, with 128 × 128 image resolution. (b) Results
for the Vaihingen and MoNuSeg datasets, with 256 × 256 image resolution.
stances on performance. The results can be seen in Fig. 5. Another aspect of achieving improvement by employing
In general, increasing the number of generated instances multiple generations is calibration. The calibration score is
tends to increase the mIoU score. However, the number of measured as the difference between the prediction proba-
runs required to reach optimal performance varies between bility and the true probability of the event. For example,
classes. For example, for the “Bus” and “Train” classes of a perfectly calibrated model is defined by P(Ŷ = Y |P̂ =
Cityscapes, the best score is achieved when using 10 and 3 p) = p, which means that the prediction probability equals
generated instances, respectively. MoNuSeg, requires con- the true probability of the event. We estimate the calibration
siderably more runs (25) for maximal performance. On the score by splitting the [0, 1] range into ten uniform bins, then
other hand, when the number of generated instances is in- average the squared difference between each bin’s mean
creased, inference time also increases linearly, resulting in a prediction probability and the percentage of positive sam-
slower method compared to architectures such as Segformer ples.
and Stdc. The results of examining the calibration scores are pre-
8
(a) (b)
Figure 6. Mean calibration score (lower is better) per number of generated inferences. The error bars depict the standard error. (a) Results
for the Cityscapes classes, with an image resolution of 128 × 128. (b) Results for the Vaihingen and MoNuSeg datasets, with the 256 × 256
image resolution.
Figure 7. Results of the ablation study. (a) the input image, (b-e) results for variants one–four of our method, respectively, (f) the result of
our method, and (g) ground truth. Panels (b-f) employ the jet color scale between 0 in blue and 1 in red
sented in Fig. 6. For most datasets, increasing the num- than other experiments with a larger number of generated
ber of generated instances improves the calibration score, instances. This phenomenon may be a result of the highly
especially when the increase is from a single instance. In varied size and the small number of test images.
addition, for the larger classes in Cityscapes - Rider and Bi-
cycle - and for the MoNuSeg and Vaihingen datasets, the Ablation Study We evaluate various alternatives to our
improvement continues to increase even more compared to method. The first variant concatenates [F (xt ), G(I)] at
the other datasets. The “Train” class in Cityscapes is an ex- the channel dimension. The second variant employs FC-
ception; here, the single-instance calibration score is better HarDNet-70 V2 [4] instead of RRDBs. The third variant,
following [19, 41], concatenates I channelwise to xt , with-
9
(a) (b)
Figure 8. mIoU per number of RRDB blocks. (a) Results on Vaihingen, (b) Results on Cityscapes “Bus”.
(a) (b)
Figure 9. Generation time in seconds and mIou per number of diffusion steps for Vaihingen and Cityscapes “Bus”. (a) mIoU per diffusion
step, (b) Time per diffusion step.
out using an encoder. The last alternative method is to prop- a large margin, while on Cityscapes “Bus”, the difference
agate F (xt ) through the U-Net module and add it to G(I) is small. The RRDB blocks are preferable to the FC-
after the first, third, and fifth downsample blocks (variants HarDNet architecture in both datasets (variant two). Re-
four–six), instead of performing F (xt ) + G(I). In this vari- moving the encoder affects the metrics significantly (vari-
ant, G(I) is downsampled to match the required number ant three), slightly more so on Vaihingen. The change in the
of channels by propagating it through a 2D-convolutional signal’s integration position of variant four leads to a neg-
layer with a stride of two. ligible difference on Vaihingen and even outperforms our
These variant experiments were tested by averaging full method on Cityscapes “Bus”. Variants five and six lead
nine generated instances on the Vaihingen dataset and on to a decrease in performance as the distance from the first
Cityscapes “Bus” (the performance reported for our method layer increases. Fig. 7 depicts sample results for the various
is therefore slightly different from those reported in Tab. 2). variants (one–four) of the Vaihingen dataset.
The summation we introduce as a conditioning approach Parameter sensitivity For testing the stability of our pro-
outperforms concatenation (variant one) on Vaihingen by posed method, we experimented with the two hyperparam-
10
eters that can affect performance the most: the number of generation, similarly to other recent diffusion models.
diffusion steps, and the number of RRDB blocks. To study In order to condition the input image, we generate an-
the effect of these parameters, we varied the number of dif- other encoding path, which is similar to U-Net’s encoder-
fusion steps in the range of [25, 50, 75, 100, 150, 200], and decoder use in conventional image segmentation methods.
the number of RRDB blocks in the range of [1, 3, 5, 10] for The two encoder pathways are merged by summing the ac-
Vaihingen and [5, 10, 15, 20, 25] for Cityscapes “Bus”. We tivations early in the U-Net’s encoder.
started from a baseline configuration (which was 100 dif- Using our approach, we obtain state-of-the-art segmenta-
fusion steps, 3 RRDB blocks for Vaihingen, and 10 RRDB tion results on a diverse set of benchmarks, including street
blocks for Cityscapes “Bus”) and experimented with differ- view images, aerial images, and microscopy.
ent values around these.
The effect of the number of RRDB blocks In this part 7. Acknowledgments
we set the number of diffusion steps to 100. As can be
seen in Fig. 8, with our configuration, the optimal number This project has received funding from the European Re-
of RRDB blocks is 3 for Vaihingen, and 10 for Cityscapes search Council (ERC) under the European Unions Horizon
“Bus”. However, evidently, the number of blocks has a lim- 2020 research and innovation programme (grant ERC CoG
ited impact in the case of both Cityscapes and Vaihingen. 725974).
The gap between the best and worst performance points is
less than 1 mIoU for Vaihingen and less than 2 mIoU for References
Cityscapes “Bus”. Therefore, we conclude that this hyper- [1] David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Ef-
parameter has a small effect on performance. ficient interactive annotation of segmentation datasets with
Varying the number of diffusion steps T In this part, polygon-rnn++. In Proceedings of the IEEE conference on
we set the number of RRDB blocks of Vaihingen to 3 and Computer Vision and Pattern Recognition, pages 859–868,
Cityscapes “Bus” to 10. We explore the possible accu- 2018. 4, 5, 7
racy/runtime tradeoff with regards to the number T of dif- [2] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tar-
fusion steps. Results are shown in Fig. 9. low, and Rianne van den Berg. Structured denoising dif-
fusion models in discrete state-spaces. Advances in Neural
When the number of diffusion steps is increased - as we
Information Processing Systems, 34:17981–17993, 2021. 2
can see in Fig. 9(a) - the graph fluctuation for Vaihingen is
[3] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla.
less than 1 mIoU, and for Cityscapes “Bus” it is less than 2
Segnet: A deep convolutional encoder-decoder architecture
mIoU. for image segmentation. IEEE transactions on pattern anal-
Surprisingly, when the number of diffusion steps is re- ysis and machine intelligence, 39(12):2481–2495, 2017. 6,
duced, even to just 25, which is a very low number com- 7
pared to the literature [18, 35], the segmentation results re- [4] Ping Chao, Chao-Yang Kao, Yu-Shan Ruan, Chien-Hsiang
main stable in both datasets, with a degradation of only up Huang, and Youn-Long Lin. Hardnet: A low memory traf-
to 2 mIou for Vaihingen, and 1 mIou for Cityscapes “Bus”. fic network. In Proceedings of the IEEE/CVF international
This reduction can speed up performance by a factor of four conference on computer vision, pages 3552–3561, 2019. 1,
and provide a reasonable accuracy to runtime tradeoff. 6, 7, 9
The results for the generation time of one sample in sec- [5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,
onds are presented in Fig. 9(b). As can be observed, both Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image
graphs are linear, with a different slope. The main reason segmentation with deep convolutional nets, atrous convolu-
tion, and fully connected crfs. IEEE transactions on pattern
for this is the difference in image size (which is 256 × 256
analysis and machine intelligence, 40(4):834–848, 2017. 5,
for Vaihingen and 128 × 128 for Cityscapes “Bus”). An-
7
other, minor reason is the difference in the number of RRDB
[6] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mo-
blocks in this experiment. hammad Norouzi, and William Chan. Wavegrad: Esti-
mating gradients for waveform generation. arXiv preprint
6. Conclusions arXiv:2009.00713, 2020. 2
[7] Dominic Cheng, Renjie Liao, Sanja Fidler, and Raquel Ur-
A wealth of methods have been applied to image seg-
tasun. Darnet: Deep active ray network for building seg-
mentation, including active contour and their deep vari- mentation. In Proceedings of the IEEE/CVF Conference
ants, encoder-decoder architectures, and U-Nets, which - to- on Computer Vision and Pattern Recognition, pages 7431–
gether with more recent, transformer-based methods - rep- 7439, 2019. 6, 7
resent a leading approach. In this work, we propose utiliz- [8] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune
ing the state-of-the-art image generation technique of diffu- Gwon, and Sungroh Yoon. Ilvr: Conditioning method for
sion models. Our diffusion model employs a U-Net archi- denoising diffusion probabilistic models. arXiv preprint
tecture, which is used to incrementally improve the obtained arXiv:2108.02938, 2021. 2
11
[9] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo [23] Diederik P Kingma, Tim Salimans, Ben Poole, and
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Jonathan Ho. Variational diffusion models. arXiv preprint
Franke, Stefan Roth, and Bernt Schiele. The cityscapes arXiv:2107.00630, 2021. 2
dataset for semantic urban scene understanding. In Proceed- [24] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and
ings of the IEEE conference on computer vision and pattern Bryan Catanzaro. Diffwave: A versatile diffusion model for
recognition, pages 3213–3223, 2016. 1, 4 audio synthesis. arXiv preprint arXiv:2009.09761, 2020. 2
[10] Prafulla Dhariwal and Alexander Nichol. Diffusion models [25] Neeraj Kumar, Ruchika Verma, Deepak Anand, Yanning
beat gans on image synthesis. Advances in Neural Informa- Zhou, Omer Fahri Onder, Efstratios Tsougenis, Hao Chen,
tion Processing Systems, 34, 2021. 1, 2 Pheng-Ann Heng, Jiahui Li, Zhiqiang Hu, et al. A multi-
[11] Mingyuan Fan, Shenqi Lai, Junshi Huang, Xiaoming Wei, organ nucleus segmentation challenge. IEEE transactions
Zhenhua Chai, Junfeng Luo, and Xiaolin Wei. Rethinking on medical imaging, 39(5):1380–1391, 2019. 1, 6
bisenet for real-time semantic segmentation. In Proceedings [26] Neeraj Kumar, Ruchika Verma, Sanuj Sharma, Surabhi
of the IEEE/CVF conference on computer vision and pattern Bhargava, Abhishek Vahadane, and Amit Sethi. A dataset
recognition, pages 9716–9725, 2021. 5, 6, 7 and a technique for generalized nuclear segmentation for
[12] Volker Fischer, Mummadi Chaithanya Kumar, Jan Hendrik computational pathology. IEEE transactions on medical
Metzen, and Thomas Brox. Adversarial examples for seman- imaging, 36(7):1550–1560, 2017. 1, 6
tic image segmentation. arXiv preprint arXiv:1703.01101,
[27] Haoying Li, Yifan Yang, Meng Chang, Shiqi Chen, Huajun
2017. 1
Feng, Zhihai Xu, Qi Li, and Yueting Chen. Srdiff: Single
[13] Jun Fu, Jing Liu, Jie Jiang, Yong Li, Yongjun Bao, and Han- image super-resolution with diffusion probabilistic models.
qing Lu. Scene segmentation with dual relation-aware atten- Neurocomputing, 2022. 1, 2
tion network. IEEE Transactions on Neural Networks and
[28] Justin Liang, Namdar Homayounfar, Wei-Chiu Ma, Yuwen
Learning Systems, 32(6):2547–2560, 2020. 1
Xiong, Rui Hu, and Raquel Urtasun. Polytransform: Deep
[14] Shir Gur, Tal Shaharabany, and Lior Wolf. End to end polygon transformer for instance segmentation. In Proceed-
trainable active contours via differentiable rendering. arXiv ings of the IEEE/CVF Conference on Computer Vision and
preprint arXiv:1912.00367, 2019. 5, 6, 7 Pattern Recognition, pages 9131–9140, 2020. 5
[15] Ali Hatamizadeh, Debleena Sengupta, and Demetri Ter-
[29] Huan Ling, Jun Gao, Amlan Kar, Wenzheng Chen, and Sanja
zopoulos. End-to-end trainable deep active contour models
Fidler. Fast interactive object annotation with curve-gcn. In
for automated image segmentation: Delineating buildings in
Proceedings of the IEEE/CVF Conference on Computer Vi-
aerial imagery. In European Conference on Computer Vision,
sion and Pattern Recognition, pages 5257–5266, 2019. 5,
pages 730–746. Springer, 2020. 6, 7
7
[16] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
[30] Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, Peng Liu,
shick. Mask r-cnn. In Proceedings of the IEEE international
and Zhou Zhao. Diffsinger: Singing voice synthesis via shal-
conference on computer vision, pages 2961–2969, 2017. 5
low diffusion mechanism. arXiv preprint arXiv:2105.02446,
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
2021. 2
Deep residual learning for image recognition. In Proceed-
ings of the IEEE conference on computer vision and pattern [31] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
recognition, pages 770–778, 2016. 4 convolutional networks for semantic segmentation. In Pro-
ceedings of the IEEE conference on computer vision and pat-
[18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
tern recognition, pages 3431–3440, 2015. 1, 2
sion probabilistic models. Advances in Neural Information
Processing Systems, 33:6840–6851, 2020. 2, 3, 6, 11 [32] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
regularization. arXiv preprint arXiv:1711.05101, 2017. 6
[19] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet,
Mohammad Norouzi, and Tim Salimans. Cascaded diffu- [33] Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob
sion models for high fidelity image generation. Journal of Verbeek. Semantic segmentation using adversarial networks.
Machine Learning Research, 23(47):1–33, 2022. 1, 2, 9 arXiv preprint arXiv:1611.08408, 2016. 1
[20] Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick [34] Diego Marcos, Devis Tuia, Benjamin Kellenberger, Lisa
Forré, and Max Welling. Argmax flows and multinomial dif- Zhang, Min Bai, Renjie Liao, and Raquel Urtasun. Learn-
fusion: Towards non-autoregressive language models. arXiv ing deep structured active contours end-to-end. In Proceed-
e-prints, pages arXiv–2102, 2021. 2 ings of the IEEE Conference on Computer Vision and Pattern
[21] Chin-Wei Huang, Jae Hyun Lim, and Aaron C Courville. A Recognition, pages 8877–8885, 2018. 6, 7
variational perspective on diffusion-based generative models [35] Alexander Quinn Nichol and Prafulla Dhariwal. Improved
and score matching. Advances in Neural Information Pro- denoising diffusion probabilistic models. In International
cessing Systems, 34, 2021. 2 Conference on Machine Learning, pages 8162–8171. PMLR,
[22] Zilong Huang, Xinggang Wang, Lichao Huang, Chang 2021. 2, 3, 4, 6, 11
Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross [36] Yuval Nirkin, Lior Wolf, and Tal Hassner. Hyperseg: Patch-
attention for semantic segmentation. In Proceedings of the wise hypernetwork for real-time semantic segmentation. In
IEEE/CVF International Conference on Computer Vision, Proceedings of the IEEE/CVF Conference on Computer Vi-
pages 603–612, 2019. 1 sion and Pattern Recognition, pages 4061–4070, 2021. 2
12
[37] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima [50] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Sadekova, and Mikhail Kudinov. Grad-tts: A diffusion prob- Jose M Alvarez, and Ping Luo. Segformer: Simple and
abilistic model for text-to-speech. In International Confer- efficient design for semantic segmentation with transform-
ence on Machine Learning, pages 8599–8608. PMLR, 2021. ers. Advances in Neural Information Processing Systems,
2 34, 2021. 1, 2, 5, 6, 7
[38] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- [51] Yuan Xue, Tao Xu, Han Zhang, L Rodney Long, and Xi-
net: Convolutional networks for biomedical image segmen- aolei Huang. Segan: adversarial network with multi-scale
tation. In International Conference on Medical image com- l1 loss for medical image segmentation. Neuroinformatics,
puting and computer-assisted intervention, pages 234–241. 16(3):383–392, 2018. 1
Springer, 2015. 1, 2, 3, 4, 6, 7 [52] Fisher Yu and Vladlen Koltun. Multi-scale context
[39] Franz Rottensteiner, Gunho Sohn, Markus Gerke, and aggregation by dilated convolutions. arXiv preprint
Jan D Wegner. Isprs semantic labeling contest. ISPRS: arXiv:1511.07122, 2015. 1
Leopoldshöhe, Germany, 2014. 1, 5 [53] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima
[40] Chitwan Saharia, William Chan, Huiwen Chang, Chris A Tajbakhsh, and Jianming Liang. Unet++: A nested u-net
Lee, Jonathan Ho, Tim Salimans, David J Fleet, and Mo- architecture for medical image segmentation. In Deep learn-
hammad Norouzi. Palette: Image-to-image diffusion mod- ing in medical image analysis and multimodal learning for
els. arXiv preprint arXiv:2111.05826, 2021. 2 clinical decision support, pages 3–11. Springer, 2018. 1, 6,
[41] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal- 7
imans, David J Fleet, and Mohammad Norouzi. Image
super-resolution via iterative refinement. arXiv preprint
arXiv:2104.07636, 2021. 1, 2, 9
[42] Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise
estimation for generative diffusion models. arXiv preprint
arXiv:2104.02600, 2021. 6
[43] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In International Confer-
ence on Machine Learning, pages 2256–2265. PMLR, 2015.
2
[44] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia
Schmid. Segmenter: Transformer for semantic segmenta-
tion. In Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision, pages 7262–7272, 2021. 1
[45] Jeya Maria Jose Valanarasu, Poojan Oza, Ilker Hacihaliloglu,
and Vishal M Patel. Medical transformer: Gated axial-
attention for medical image segmentation. In International
Conference on Medical Image Computing and Computer-
Assisted Intervention, pages 36–46. Springer, 2021. 6, 7
[46] Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam,
Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-
alone axial-attention for panoptic segmentation. In European
Conference on Computer Vision, pages 108–126. Springer,
2020. 6, 7
[47] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu,
Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En-
hanced super-resolution generative adversarial networks. In
Proceedings of the European conference on computer vision
(ECCV) workshops, pages 0–0, 2018. 4
[48] Xiao Xiao, Shen Lian, Zhiming Luo, and Shaozi Li.
Weighted res-unet for high-quality retina vessel segmenta-
tion. In 2018 9th international conference on information
technology in medicine and education (ITME), pages 327–
331. IEEE, 2018. 6, 7
[49] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou,
Lingxi Xie, and Alan Yuille. Adversarial examples for se-
mantic segmentation and object detection. In Proceedings of
the IEEE international conference on computer vision, pages
1369–1378, 2017. 1
13