Neural Scene Decoration From A Single Photograph
Neural Scene Decoration From A Single Photograph
3
Deakin University
1 Introduction
Furnishing and rendering indoor scenes is a common task for interior design. This
is typically performed by professional designers who carefully craft a conceptual
design and furniture placement, followed by extensive modeling via CAD/CAM
software to finally create a realistic image using a powerful rendering engine.
Such a task often requires extensive background knowledge and experience in
the field of interior design, as well as high-end professional software. This makes
it difficult for lay users to design their own scenes from scratch.
On the other hand, different image synthesis methods have been developed
and become popular in the field. Various types of deep neural networks - typically
in the form of auto-encoders [20] and generative adversarial networks [5] - have
2 Pang et al.
– A new task on scene synthesis and modeling that we name as neural scene
decoration: synthesize a realistic image with furnished decorations from an
empty background image of a scene and an object layout.
– A neural network architecture that enables neural scene decoration in a
simple and effective manner.
– Extensive experiments that demonstrate the performance of our proposed
method and its potential for future research. Quantitative evaluation re-
sults show that our method outperforms previous image translation works.
Qualitative results also confirm the ability of our method in generating
realistic-looking scenes.
2 Related Work
Prior to the resurgence of deep learning, editing a single photograph has often
been done by building a physical model of the scene in the photo for object
Neural Scene Decoration from a Single Photograph 3
3 Proposed Method
Our goal is to develop a neural scene decoration (NSD) system that produces a
decorated scene image Ŷ ∈ R3×W ×H , given a background image X ∈ R3×W ×H ,
and an object layout L for a list of objects to be added in the scene (see
Figure 2). Note that both the generated image Ŷ and the background image X
are captured from the same scene. Ideally, the NSD system should be able to
make Ŷ realistically decorated with objects specified in L, and also assimilate Ŷ
to the provided background image X.
The format of the object layout L is crucial in determining how easy and
effective the NSD system is. In SPADE [34], synthesized objects are labeled using
a pixel-wise fashion. This manner, however, requires detailed labeling which is
not effective in describing complicated objects and also takes considerable effort.
In our work, we propose to represent L using simple yet effective formats: box
label and point label. Specifically, let O = {o1 , ..., oN } be a set of objects added
to X; these objects belong to K different classes, e.g., chair, desk, lamp, etc.
Each object oi is associated with a class vector Ii ∈ {0, 1}K×1 and a layout map
fi ∈ R1×W ×H . The class vector Ii is designed such as Ii (k) = 1 if k is the class
Neural Scene Decoration from a Single Photograph 5
Box label. Like BachGAN [24], a box label indicates the presence of an object by
its bounding box. For each object oi , the layout map fi is constructed by simply
filling the entire area of the bounding box of oi with 1s, and elsewhere with 0s.
Mathematically, we define:
\label {eq:box_label} f_i(1,x,y)=\begin {cases} 1, & \text {if } (x,y) \in \mathrm {bounding\_box}(o_i) \\ 0, & \text {otherwise} \end {cases} (2)
Box label format has the advantage of indicating the boundary where objects
should be inserted and allowing finer control over the rough shape of objects.
Scale
X
GeneratorBlock 4→8
SPADEBlock
Upsample Y L
Scale Concatenate
Conv + BN + GLU
ConvBlock
AvgPool
Conv Dn2 + BN + LReLU
Scale GeneratorBlock 8→16
AvgPool
Conv + BN + LReLU
AvgPool
ConvBlock
Scale GeneratorBlock 32→64
AvgPool
ConvBlock
Scale GeneratorBlock 64→128
AvgPool
Conv
G(X,L)
Dadv Dobj
Fig. 1. Overview of our generator (left) and discriminator (right). Convolution layers
labeled with Dn2 halve the spatial dimensions of input feature maps using stride 2.
image, and finally performs a convolution (Conv) with batch normalization (BN)
and GLU. We follow [28] using skip-layer excitations (SLE) that modulates the
output of the last two generator blocks with that of the first two blocks. The
resulting feature map is passed through a final convolution layer to produce the
synthesized image Ŷ .
To enforce the integration of object layout in the generation process, we insert
L into every generator block in a bottom-up manner. Specifically, down-scaled
versions of L are created by consecutive average pooling layers and then fused
with corresponding feature maps at different resolutions. Likewise, to constrain
the consistency in the background of the synthesized image Ŷ and the input
image X, we make X in different scales and insert these scaled images into
every generator block. We refer readers to the supplementary material for further
details of our architecture.
3.3 Training
We train our NSD system using the conventional GAN training procedure, i.e.,
jointly optimizing both the generator G and discriminator D. Specifically, we
make use of the hinge adversarial loss function to train D as:
\mathcal {L}_{D}&=\mathbb {E}_{Y}[\max (0,1+D_{adv}(Y))] \nonumber \\ &+\mathbb {E}_{\hat {Y},L}[\max (0,1-D_{adv}(\hat {Y})-\lambda _{obj}D_{obj}(\hat {Y},L)]
(4)
where Y is the decorated scene image paired with X (from training data) and
Ŷ = G(X, L). Y is also the source image where objects in L are defined.
The generator G is updated to push the discriminator’s output towards the
real image direction:
4 Experiments
4.1 Dataset
We chose to conduct experiments on the Structured3D dataset [52], as it is, to
the best of our knowledge, the only publicly available dataset with pre-rendered
image pairs of empty and decorated scenes. The Structured3D dataset consists
of 78, 463 pairs of decorated and empty indoor images, rendered from a total of
3, 500 distinct 3D scenes. Following the recommendation of the dataset authors,
we use 3,000 scenes for training and the remaining scenes for validation.
8 Pang et al.
4.2 Baselines
Table 2. Comparison of the use of Dadv only, and the combination of Dadv and Dobj
as in our design.
(KID) [2]. Both FID and KID measure the dissimilarity between inception repre-
sentations [39] of a synthesized output and its real version. Wasserstein distance
and polynomial kernel are used in FID and KID respectively as dissimilarity
metrics [33].
For quantitative evaluation purpose, we used pairs of background and deco-
rated images from ground-truth. We also extracted bounding boxes and object
masks of decorated objects from the ground-truth to construct box labels and
point labels. We report the performance of our method and other baselines in
Table 1. As shown in the results, our method outperforms all the baselines on
both bedroom and living room test sets, with both box label and point label
format in FID metric. The same observation is true for the KID metric, with the
only exception that He et al.’s method [6] is ranked best with box label format
for bedroom scenes.
Our method also has a computational advantage. Specifically, the BachGAN
baseline took roughly the same amount of training time as our method but
required four GPUs. In contrast, our method can produce even better results
using less computational resources.
Table 1 also shows that point label scheme slightly outperforms box label
scheme. However, as discussed in the next section, each scheme is favored to
specific types of objects. For example, from a usage perspective, box label format
has a strong focus on user’s desire on how a decorated object appears, while
point label format offers more flexibility and autonomy to the NSD system.
10 Pang et al.
Fig. 2. Generation results of our method and other baselines, using box label format
(the first two rows) and point label format (the last two rows). For point label format,
the center and radius of each circle represent the location ci and size si of an object
(see Eq. (3)). Best view with zoom.
In this experiment, we prove the role of the additional discriminator Dobj . Recall
that Dobj is branched off from Dadv and combines L at various scales (see
Figure 1). To validate the role of Dobj , we amended the architecture of the
discriminator D by directly concatenating the object layout L with the decorated
image Y to make the input for D, like the designs in [34] and [24]. Experimental
results are in Table 2, which clearly confirms the superiority of our design for
the discriminator (i.e., using both Dadv and Dobj ) over the use of Dadv only.
Fig. 3. Generation results (from the same input) using box label format (top row) and
point label format (bottom row). While box label format suits small and relatively
fixed-size objects, point label format is more flexible to describe large objects whose
dimensions can be adjusted automatically.
Table 3. Performance of our method and the baselines using default object sizes. Lower
FID/KID values indicate better performance.
Box label vs. point label. Figure 3 visualizes some generation results using box
label and point label format on the same input background. In this experiment,
on each scene, box labels and point labels were derived from the same set of
objects. We observed that some object classes are better suited to a particular
label format. For example, small objects and those whose aspect ratio can be
varied (e.g., pictures can appear in either portrait or landscape shape) should be
described using box label format. On the other hand, objects that often occupy
large areas in a scene, such as beds and sofas, tend to have less distortion and
clearer details when represented with point label format.
12 Pang et al.
Table 4. Performance using different object sizes methods. (*) indicate the default
setting used in our experiments.
In the quantitative assessment, the sizes of decorated objects in the point label
format (i.e., si in Eq. (3)) were retrieved from ground-truth. However, in reality,
this information is provided by the user. In this experiment, we investigate a
simpler input for the point label format where the object sizes are set by default
values rather than given by either the ground-truth or user. In particular, for each
decorated object oi , we set the size si to the median size of all the objects having
the same object class with oi in the training data.
√ Particularly, the ground truth
value of si of each object is given by si = m Ai , where m is a fixed constant
and Ai is the area (i.e., the no. of pixels) of the object mask of oi . Here we set
m = 2.5 for our experiments.
We applied this setting to all the baselines and reported their performance
in Table 3. We observed that compared with using ground-truth values for the
Neural Scene Decoration from a Single Photograph 13
Fig. 5. Generation results under different settings for objects’ locations and sizes.
Manipulated objects are marked with “X”. Top two rows: we change the location of a
painting by moving its bounding box towards the left. Bottom two rows: we adjust the
size of a ceiling light and a TV by changing the radius at their centers.
object sizes (see Table 1), setting the object sizes to default values degrades
the performance of all the competitors. However, our method still consistently
outperforms all the baselines on all the test sets, using both FID and KID metrics.
We illustrate several results of this setting in Figure 4. In Table 4, we further
compare different variants of point label format, including the use of mean value
and median value to compute si . We show the results of using alternate values for
m in Table 4, which clearly confirms our choice for m for the best performance.
Fig. 7. User study’s results: Preference with (a) different methods, (b) box labels vs
point labels, and (c) different room types. (BR = bedroom, LR = living room)
5 Conclusion
We introduced a new task called neural scene decoration. The task aims to render
an empty indoor space with furniture and decorations specified in a layout map.
To realize this task, we propose an architecture conditioned on a background
image and an object layout map where decorated objects are described via either
bounding boxes or rough locations and sizes. We demonstrate the capability of
our method in scene design over previous works on the Structured3D dataset.
Neural scene decoration is henceforth a step toward building the next generation
of user-friendly interior design and rendering applications. Future work may
include better support of sequential object generation [41], interactive scene
decoration, and integration of more advanced network architecture.
Acknowledgment. This paper was partially supported by an internal grant
from HKUST (R9429) and the HKUST-WeBank Joint Lab.
Neural Scene Decoration from a Single Photograph 15
References
1. Bau, D., Strobelt, H., Peebles, W.S., Wulff, J., Zhou, B., Zhu, J., Torralba, A.:
Semantic photo manipulation with a generative image prior. ACM Transactions on
Graphics 38(4), 1–11 (2019) 3
2. Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying MMD GANs.
In: Proceedings of the International Conference on Learning Representations (2018)
9
3. Fisher, M., Ritchie, D., Savva, M., Funkhouser, T.A., Hanrahan, P.: Example-based
synthesis of 3d object arrangements. ACM Transactions on Graphics 31(6), 1–11
(2012) 3
4. Germer, T., Schwarz, M.: Procedural arrangement of furniture for real-time walk-
throughs. Computer Graphics Forum 28(8), 2068–2078 (2009) 4
5. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Proceedings of the
Advances in Neural Information Processing Systems (2014) 1, 3
6. He, S., Liao, W., Yang, M., Yang, Y., Song, Y.Z., Rosenhahn, B., Xiang, T.:
Context-aware layout to image generation with enhanced object appearance. In:
CVPR (2021) 3, 8, 9, 22
7. Henderson, P., Subr, K., Ferrari, V.: Automatic generation of constrained furniture
layouts. arXiv preprint arXiv:1711.10939 (2017) 4
8. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs
trained by a two time-scale update rule converge to a local nash equilibrium. In:
Proceedings of the Advances in Neural Information Processing Systems (2017) 8
9. Hu, R., Huang, Z., Tang, Y., van Kaick, O., Zhang, H., Huang, H.: Graph2Plan:
Learning floorplan generation from layout graphs. ACM Transactions on Graphics
39(4), 118–128 (2020) 4
10. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi-
tional adversarial networks. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (2017) 3, 19, 22
11. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for
improved quality, stability, and variation. In: Proceedings of the International
Conference on Learning Representations (2018) 3
12. Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training
generative adversarial networks with limited data. In: Proceedings of the Advances
in Neural Information Processing Systems (2020) 3, 29
13. Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J., Aila, T.:
Alias-free generative adversarial networks. arXiv preprint arXiv:2106.12423 (2021)
3
14. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative
adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (2019) 3
15. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for genera-
tive adversarial networks. IEEE Transactions on Pattern Analysis and Machine
Intelligence 43(12), 4217–4228 (2021) 3
16. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and
improving the image quality of StyleGAN. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (2020) 3, 29
17. Karsch, K.: Inverse Rendering Techniques for Physically Grounded Image Editing.
Ph.D. thesis, University of Illinois at Urbana-Champaign (2015) 3
16 Pang et al.
18. Karsch, K., Hedau, V., Forsyth, D., Hoiem, D.: Rendering synthetic objects into
legacy photographs. ACM Transactions on Graphics 30(6), 1–14 (2011) 3
19. Karsch, K., Sunkavalli, K., Hadap, S., Carr, N., Jin, H., Fonte, R., Sittig, M., Forsyth,
D.: Automatic scene inference for 3D object compositing. ACM Transactions on
Graphics 33(3), 1–15 (2014) 3
20. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: Proceedings of
the International Conference on Learning Representations (2014) 1, 3
21. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S.,
Kalantidis, Y., Li, L.J., Shamma, D.A., Bernstein, M., Fei-Fei, L.: Visual genome:
Connecting language and vision using crowdsourced dense image annotations. In:
International Journal of Computer Vision (2017) 3
22. Li, J., Yang, J., Hertzmann, A., Zhang, J., Xu, T.: LayoutGAN: Generating
graphic layouts with wireframe discriminators. In: Proceedings of the International
Conference on Learning Representations (2019) 3
23. Li, M., Patil, A.G., Xu, K., Chaudhuri, S., Khan, O., Shamir, A., Tu, C., Chen, B.,
Cohen-Or, D., Zhang, H.R.: GRAINS: generative recursive autoencoders for indoor
scenes. ACM Transactions on Graphics 38(2), 1–16 (2019) 4
24. Li, Y., Cheng, Y., Gan, Z., Yu, L., Wang, L., Liu, J.: BachGAN: High-resolution
image synthesis from salient object layout. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (2020) 3, 5, 8, 9, 10, 11, 22
25. Li, Z., Wu, J., Koh, I., Tang, Y., Sun, L.: Image synthesis from layout with
locality-aware mask adaption. In: ICCV (2021) 3, 8
26. Liang, Y., Fan, L., Ren, P., Xie, X., Hua, X.S.: Decorin: An automatic method
for plane-based decorating. IEEE Transactions on Visualization and Computer
Graphics (2021) 4
27. Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P.,
Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft coco: Common objects in context.
In: ECCV (2014) 3
28. Liu, B., Zhu, Y., Song, K., Elgammal, A.: Towards faster and stabilized GAN train-
ing for high-fidelity few-shot image synthesis. In: Proceedings of the International
Conference on Learning Representations (2021) 6, 26
29. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint
arXiv:1411.1784 (2014) 3
30. Nathan Silberman, Derek Hoiem, P.K., Fergus, R.: Indoor segmentation and support
inference from RGBD images. In: Proceedings of the European Conference on
Computer Vision (2012) 27
31. Nauata, N., Chang, K.H., Cheng, C.Y., Mori, G., Furukawa, Y.: House-GAN:
Relational generative adversarial networks for graph-constrained house layout
generation. In: Proceedings of the European Conference on Computer Vision (2020)
3, 4
32. Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever,
I., Chen, M.: GLIDE: towards photorealistic image generation and editing with
text-guided diffusion models. arXiv preprint 2112.10741 (2021) 8, 20, 21
33. Obukhov, A., Seitzer, M., Wu, P.W., Zhydenko, S., Kyl, J., Lin, E.Y.J.: High-fidelity
performance metrics for generative models in pytorch (2020) 9
34. Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-
adaptive normalization. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (2019) 3, 4, 5, 8, 9, 10, 11, 22, 26
35. Ritchie, D., Wang, K., Lin, Y.a.: Fast and flexible indoor scene synthesis via
deep convolutional generative models. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (2019) 4
Neural Scene Decoration from a Single Photograph 17
36. Schönfeld, E., Sushko, V., Zhang, D., Gall, J., Schiele, B., Khoreva, A.: You
only need adversarial supervision for semantic image synthesis. In: International
Conference on Learning Representations (2021) 19
37. Sun, W., Wu, T.: Image synthesis from reconfigurable layout and style. In: ICCV
(2019) 3
38. Sun, W., Wu, T.: Learning layout and style reconfigurable gans for controllable
image synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence
(PAMI) (2021) 8
39. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D.,
Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (2015) 9
40. Tang, H., Xu, D., Sebe, N., Wang, Y., Corso, J.J., Yan, Y.: Multi-channel attention
selection GAN with cascaded semantic guidance for cross-view image translation. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2019) 3
41. Turkoglu, M.O., Thong, W., Spreeuwers, L., Kicanaoglu, B.: A layer-based sequential
framework for scene generation with gans. In: AAAI Conference on Artificial
Intelligence (2019) 14
42. Wang, K., Lin, Y.A., Weissmann, B., Savva, M., Chang, A.X., Ritchie, D.: Planit:
Planning and instantiating indoor scenes with relation graph and spatial prior
networks. ACM Transactions on Graphics 38(4), 1–15 (2019) 4
43. Wang, K., Savva, M., Chang, A.X., Ritchie, D.: Deep convolutional priors for indoor
scene synthesis. ACM Transactions on Graphics 37(4), 1–14 (2018) 4
44. Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution
image synthesis and semantic manipulation with conditional GANs. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (2018) 3, 22
45. Yang, C., Shen, Y., Zhou, B.: Semantic hierarchy emerges in deep generative
representations for scene synthesis. International Journal of Computer Vision
129(5), 1451–1466 (2020) 3
46. Yu, L.F., Yeung, S.K., Tang, C.K., Terzopoulos, D., Chan, T.F., Osher, S.J.: Make
it home: automatic optimization of furniture arrangement. ACM Transactions on
Graphics 30(4), 1–11 (2011) 4
47. Yu, L.F., Yeung, S.K., Terzopoulos, D.: The clutterpalette: An interactive tool
for detailing indoor scenes. IEEE Transactions on Visualization and Computer
Graphics (2015) 4
48. Zhang, E., Cohen, M.F., Curless, B.: Emptying, refurnishing, and relighting indoor
spaces. ACM Transactions on Graphics 35(6), 1–14 (2016) 4
49. Zhang, S.K., Li, Y.X., He, Y., Yang, Y.L., Zhang, S.H.: Mageadd: Real-time
interaction simulation for scene synthesis. In: ACM International Conference on
Multimedia (2021) 4
50. Zhang, Z., Yang, Z., Ma, C., Luo, L., Huth, A., Vouga, E., Huang, Q.: Deep gener-
ative modeling for scene synthesis via hybrid representations. ACM Transactions
on Graphics 39(2), 1–21 (2020) 4
51. Zhao, S., Liu, Z., Lin, J., Zhu, J.Y., Han, S.: Differentiable augmentation for data-
efficient GAN training. In: Proceedings of the Conference on Neural Information
Processing Systems (2020) 28
52. Zheng, J., Zhang, J., Li, J., Tang, R., Gao, S., Zhou, Z.: Structured3D: A large
photo-realistic dataset for structured 3D modeling. In: Proceedings of the European
Conference on Computer Vision (2020) 7
18 Pang et al.
53. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using
cycle-consistent adversarial networks. In: Proceedings of the IEEE International
Conference on Computer Vision (2017) 3
54. Zhu, J.Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O., Shechtman,
E.: Toward multimodal image-to-image translation. In: Proceedings of the Advances
in Neural Information Processing Systems (2017) 3
Neural Scene Decoration from a Single Photograph 19
Supplementary Material
A Qualitative Results
Our method achieves diversity in the following ways. First, the input layout
controls the output diversity. One can change the input layout to change how the
scene is decorated. Second, given the same background and layout, diversity in
the appearance of scene objects can still be obtained. Technically, this is achieved
by changing the initial latent code of the generator and finetune the generator.
A current limitation is that our model ignores the noise vector, limiting
diversity. This problem is also reported in pix2pix [10]. Revising our network
architecture for greater diversity would be a future work, e.g., use the noise
injection by OASIS [36].
Same background with same layouts. Here we show results of our method using
the same background and layout. The diversity is now controlled by the initial
latent code of the generator. The results are presented in Figure 9. As can be
seen, our model can provide plausible diversity in the object appearance.
We analyze the effect of the background to the generation of the furniture. The
diagram below shows a simple example where we modify the background image by
enlarging the left white backdrop. In Figure 10, we see that objects like paintings
can conform to this structural change in the background. Other objects like beds
only have appearance change. The quantization of the impact of background
images is left for future work.
20 Pang et al.
Fig. 8. Diversity evaluation. Generation results under same background image X with
different object layouts.
Fig. 9. Diversity evaluation. Generation results from same background image X with
different model weights.
We provide more qualitative results of our method and all baselines (SPADE [34],
BachGAN [24], He et al. [6]) in Figure 13 and Figure 14. In general, we visually
found that bedroom images are often generated in higher quality compared with
living room images. This is because bedroom scenes have less variation in their
structure, and there are typically less objects decorated in the scenes, leading to
lower complexity in scene generation compared to living room scenes. Additionally,
we observed that box label format shows more advantages in generating small
and relatively fixed size objects. Point label format, on the other hand, allows
flexibility in determining the object size and thus works well with large and
shape-variable objects.
B Ablation Study
In addition to the ablation study provided in the main paper, here we further
explain our discriminator in detail. In the main paper, we take the generated image
Y as input to the discriminator. This is known as the unconditional discriminator
as it does not depend on the input X. In fact, image translation methods
like pix2pix [10], pix2pixHD [44] and SPADE [34] showed that a conditional
version of the discriminator can have better image fidelity. Particularly, the
conditional discriminator takes a channel-wise concatenation of the original and
generated image (background X and generated image Y in our case) as input
to the discriminator. Here we provide an experiment to compare the use of
unconditional and conditional discriminator in our case. Comparison results are
reported in Table 5. As shown in the results, the unconditional discriminator
has better results in most cases. The major difference between our method and
image translation methods lies in our data, where the domain gap between the
background and the decorated scene is less significant compared to data tested
in image translation methods, i.e., sketch or semantic maps vs. real images.
Therefore, we adopted the unconditional discriminator in our work.
C User Study
Our user study has 26 participants; each participant was asked with 48 questions.
For each question, we presented two decorated images, one image was generated
with our method and the other one was generated by a baseline. Both images
were generated from the same input scene. We asked each participant to choose
the image that they considered to be more natural and realistic. Each question
belongs to one of 12 test settings, which is a combination of the following factors:
3 baselines to compare, 2 label formats (box label / point label), and 2 test cases
(bedroom / living room). We randomly picked 4 samples for each setting, i.e.,
each participant was presented with a total of 48 image pairs in random order.
The order that two images in a pair were shown in each question was also made
randomly.
In general, our model is often preferred on images generated with point label
format, especially in the bedroom test case with fewer objects and clutter. When
using box label format, our method still produces results with on par quality
compared with the baselines.
24 Pang et al.
D Network Architecture
D.1 Generator
Table 6 describes the input and output dimensions used in the sequence of
generator blocks in our generator. For each generator block with vi input and
vo output channels, the object layout L first modulates the feature map using
a SPADE residual block similar [34], which consists of two consecutive SPADE
layers with ReLU activations, as well as a skip connection across the block.
Unlike [34], we do not add a convolutional layer after each SPADE layer in the
residual blocks. The number of channels remains to be vi before and after the
SPADE block, and the number of hidden channels in SPADE layers is set to
vi /2. Following the SPADE block, we upsample the feature map by a factor of 2,
pass through a convolutional layer with 2co output channels, batch norm layer
and finally through a gated linear unit (GLU), following the convolutional block
implementation in [28]. All aforementioned convolutional layers have a kernel
size of 3 and padding size of 1.
The last two generator blocks use the SLE module in [28] to modulate the
feature maps with earlier, smaller-resolution feature maps. We pass the output
of the source generator block through an adaptive pooling layer to reduce its
spatial size to 4 × 4, then use a 4 × 4 convolutional layer of kernel size of 4 to
collapse the spatial dimensions, reducing the feature map to a 1D vector. This
is passed through a LeakyReLU (0.1) activation, 1 × 1 convolutional layer and
sigmoid function to obtain a 1D vector of size vo , where vo is the number of
output channels of the destination generator block. This vector is multiplied
channel-wise with the feature map inside the destination generator block, right
after the upsample operation.
D.2 Discriminator
E Implementation Details
E.1 Dataset
As presented in the main paper, the semantic labels for images in the Struc-
tured3D dataset are retrieved from the NYU-Depth V2 dataset [30]. Five classes:
window, door, wall, ceiling, and floor are considered as “background” and
appear in both both empty and decorated scenes. The remaining classes repre-
sent “foreground” and are used in decorated scenes only. In addition, since the
distribution of the foreground classes is highly unbalanced, and some classes do
not really exist in the Structured3D dataset, only a subset of these foreground
classes were used in our experiments. We show the list of the foreground classes
used in our work in Table 8.
We carried out experiments on two subsets of the Structured3D dataset -
bedrooms and living rooms, as those sets contain enough samples for training
and testing. Note that each scene in the Structured3D dataset is associated with
a room type label, that allows us to identify bedroom and living room scenes.
To provide enough clue for a scene type, we filtered out images that contain less
than 4 objects. For each source image, we resized the image from the original
28 Pang et al.
cabinet picture
bed curtain
chair television
sofa nightstand
table lamp
desk pillow
(a) (b)
Fig. 15. (a) Sample image with corresponding object layout map where each dot shows
the location and semantic label (via the color) for an object. (b) Same sample after
translation and horizontal flipping.
size 1280 × 720 to 456 × 256, then cropped two images with size 256 × 256 from
each source image. Images were cropped such that, for foreground object pixels,
at least 60% were still present in cropped regions. We report the total number of
training and test samples for each set in Table 9.