Image Generation Models From Scene Graphs and Layouts: A Comparative Analysis
Image Generation Models From Scene Graphs and Layouts: A Comparative Analysis
net/publication/369859386
Article in Journal of King Saud University - Computer and Information Sciences · April 2023
DOI: 10.1016/j.jksuci.2023.03.021
CITATION READS
1 41
3 authors:
Ibrahim A. Hameed
Norwegian University of Science and Technology
240 PUBLICATIONS 4,366 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ibrahim A. Hameed on 29 May 2023.
a r t i c l e i n f o a b s t r a c t
Article history: An image is the abstraction of a thousand words. The meaning and essence of complex topics, ideas, and
Received 30 November 2022 concepts can be easily and effectively conveyed visually by a single image rather than a lengthy verbal
Revised 21 March 2023 description. It is not only essential to teach computers how to recognize and classify images but also
Accepted 27 March 2023
how to generate them. Controlled image generation depicting complex and multiple objects is a challeng-
Available online 6 April 2023
ing task in computer vision despite the significant advancements in generative modeling. Among the core
challenges, scene graph-based and scene layout-based image generation is a significant problem in com-
Keywords:
puter vision and requires generative models to reason about object relationships and compositionality.
Image generation
Image translation
Due to its ease of use, less time cost, and labor needs, image generation/synthesizing models from scene
Generative adversarial networks graphs and layouts are proliferating. In the case of a more significant number of scene graphs and layout
Graph convolutional networks to image generation models, a unique experimental evaluation methodology is required to evaluate the
Comparative analysis controlled image generation. To this extent, we, in this work, present a standard methodology to evaluate
the performance of scene graph and scene layout-based image generation models. We perform a compar-
ative analysis of image generation models to evaluate image generation models’ complexity from scene
graphs and scene layouts. We analyze the different components of these models on Visual Genome and
COCO-Stuff datasets. The experimental results show that the scene layout-based image generation out-
performs its graph-based counterpart in most quantitative and qualitative evaluations.
Ó 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access
article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
https://doi.org/10.1016/j.jksuci.2023.03.021
1319-1578/Ó 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
M.U. Hassan, S. Alaliyat and I.A. Hameed Journal of King Saud University – Computer and Information Sciences 35 (2023) 101543
image captioning methods are used (Gao et al., 2018). In 3D scene Shamsolmoali et al., 2021). However, to the best of our knowledge,
understanding, the scene graph-based methods can provide the no study has been conducted to provide a comprehensive compar-
numerically accurate analysis of object relationships in a 3D scene. ative analysis of scene graph-based and scene layout-based image
A 3D scene graph helps in understanding the complex indoor envi- generation methods. There is lack of comprehensive comparison of
ronments and other tasks (Armeni et al., 2019). image generation evaluation metrics, remedies for diverse image
The image generation based on scene graphs can better deal synthesis, and information about stable training. This paper reports
with complex scenes containing multiple objects and desired lay- an experimental comparison of the state-of-the-art scene graphs
outs. Object representations, attributes, and pairwise image rela- and scene layout-based image generation methods and provide a
tionships mainly reflect semantically structured information in comprehensive knowledge about training of SG2I and SL2I algo-
the scene graphs. Thus, scene graphs help provide favorable rea- rithms. The main contributions of this work are as follows:
soning and information about vision-text tasks such as image gen-
eration. With such contemporary tasks comes complex challenges. A standard methodology is proposed to conduct the compara-
The scene graph-to-image generation (scene graph to image – tive analysis of SG2I and SL2I methods is proposed in this work.
SG2I) is a crucial problem in the computer vision and computer To this end, we apply the identical configurations for training
graphics community. The images generated by existing SG2I- the SG2I and SL2I models from scratch on the Visual Genome
based algorithms are blurred, and the appearance of objects in Krishna et al. (2017) and COCO-Stuff Caesar et al. (2018) data-
the generated images is hardly understandable, making the SG2I sets and tested the methods on different hyperparameters.
task challenging. A theoretical comparison of different components of SG2I and
The general process of image generation from scene graphs is SL2I based methods is presented to help better understand
based on two-step: first, a layout is generated from the scene the complexity of image generation methods from scene graphs
graph, and then bounding boxes are constructed to convert the lay- and layouts for implementation purposes.
out into images. Nevertheless, sometimes, it is hard for someone to
design vocabulary-based scene graphs. To bridle this issue, a direct The organization of the paper is as follows. Section 3 presents
scene layout to image generation (scene layout to image – SL2I) the background knowledge of the basic methods used in image gen-
method was developed by Zhao et al. (2019b). The proposed eration. We provide a comprehensive overview of current state-of-
method is based on the core concept of scene graph to image the-art methods for image generation from texts, scene graphs, and
(SG2I) generation process. As mentioned earlier, traditional SG2I scene layouts in Section 4. Section 5 provides the methodological
methods first generate the layouts from scene graphs, but, in the description of SG2I and SL2I based methods. Section 7 is about
case of SL2I, the user only needs to define the bounding boxes with implementation details of comparison methods used in this study.
object categories which are used to generate the expected images. The results of the compared methods are discussed in Section 8
Image generation works based on scene graphs, and layouts are while concluding remarks are presented in Section 10.
overgrowing. Since SG2I and SL2I are becoming powerful enough,
they can one day replace the work of scene-generation artists.
3. Background
There is a clear motivation to design a standardized methodology
and analyze state-of-the-art methods to provide comprehensible
This section overviews relevant concepts often employed in
insights on recent developments in SG2I and SL2I generation mod-
constructing image generation pipelines. This unified pipeline con-
els. The standard input types for all such methods are scene graphs.
sists of three main components; scene graphs, graph convolutional
SL2I methods also incorporate a scene graph-based strategy to syn-
networks, and generative adversarial networks, as shown in Fig. 2.
thesize images. The generation of images by scene graphs and lay-
outs is more controllable, but it remains a one-to-many problem.
In order to evaluate the existing proposed works, there is a great 3.1. Scene graphs
demand to propose a standard methodology to evaluate SG2I and
SL2I methods. We, therefore, propose to compare the SG2I and A scene graph is a graph data structure that encapsulates infor-
mation related to objects and their relationships in a scene. It was
SL2I methods in a unified pipeline to generate images. In this
study, we aim to conduct a comparative analysis of image genera- initially proposed for the image retrieval task to search images
containing particular semantic contents (Johnson et al., 2015). As
tion algorithms that are based on scene graphs and layouts. These
methods, if improved, can assist the work of artists, graphic design- illustrated in Fig. 1, a complete scene graph includes objects and
relationships and represents the semantics of the scene. Scene
ers and can help crime scene investigators to learn more about the
graphs are powerful enough to encode the 2D (Johnson et al.,
evidences, visually. In the algorithms addressed in this analysis, it
2015) and 3D (Armeni et al., 2019) representations of images into
is only required to define some objects and how they interact with
their semantic elements without having any constraints on object
each other, then one can generate an image based on provided
types, attributes, and relationships.
characteristics and relationships. Moreover, the automatic image
According to Johnson et al. (2018), a scene graph data structure
generation process is so vigorous that one day it might replace
G contains a set of object categories C and relationship categories
the image and video search engines with customized image and
R. The scene graph G can be defined as a tuple consisting of ðO; EÞ,
video generation algorithms based on individual user preferences.
where O ¼ fo1 ; . . . ; on g is a set of objects, which may be, for exam-
We investigated four SG2I and SL2I based algorithms, which are
ple (See Fig. 1), persons (man, boy), places (patio), things (frisbee),
built upon the pioneer method (Johnson et al., 2018). We used
and other parts (arms, legs) with each oi 2 C. E is a set of directed
the identical parameters to test the proposed methods. The differ-
edges E # O R O, which are the relationships between objects,
ent components of image generation methods are discussed and
i.e., geometry (boy on the patio) and actions (man throwing fris-
compared in this work.
bee) in the form of oi ; r; oj where oi ; oj 2 O and r 2 R.
Different surveys of text-based image synthesis using GANs Graphs are commonly used to represent the relationships
have been conducted recently (He et al., 2021; Zhou et al., 2021; between data points in a vector space ðx 2 Rn Þ, where xðiÞ is the
2
M.U. Hassan, S. Alaliyat and I.A. Hameed Journal of King Saud University – Computer and Information Sciences 35 (2023) 101543
Fig. 1. The scene graph on the left contains objects (in blue) and their relationships (in purple). The objects of the scene graph refer to regions of the image on the right.
value of the signal of node i. The graph neural network (GNN) was i.e., generator and discriminator. The generator network G receives
initially proposed to process data representation in a graph domain a collection of training examples as input and learns the probabil-
(Scarselli et al., 2008), which can handle various types of graphs, ity distribution to produce data, whereas the discriminator net-
such as cyclic, acyclic, directed, and undirected. A special type of work D distinguishes between real and fake data. Both networks
GNN, called the graph convolutional network (GCN), uses convolu- (generator and discriminator) are trained against each other in a
tional aggregation. The convolutional layers in a GCN function sim- min–max game strategy, in which D divides the input into two
ilarly to the traditional 2D convolutional layers in a convolutional classes (real or fake), and G tries to fool the discriminator. The most
neural network (CNN). typical loss function for training a GAN is defined as follows:
In a GCN, a graph G is defined as a pair ðV; EÞ, where V denotes
minmaxVðD; GÞ
the set of nodes, and E represents the set of edges. A spatial grid of G D
ð3Þ
feature vectors is used as input, and a new spatial grid of vectors is ¼ Expdata ðxÞ ½log DðxÞ þ Ezpz ðzÞ ½logð1 DðGðzÞÞÞ
produced through convolutional aggregates, where the weights are
shared across all neighborhoods. For all objects oi 2 O and edges where a min–max strategy is played between generator G and dis-
criminator D; V is the value function, pdata ðxÞ is the actual data dis-
oi ; r; oj 2 E, the input vectors v i and v r for each node and edge,
tribution drawn from data x; pz ðzÞ is the input noise variable, DðxÞ is
respectively, are given as RD in , and the output vectors v i and v r
0 0
the probability of x where D is trained to maximize the probability
for all nodes and edges are RD out . of assigning correct labels to training examples and samples
In the typical pipeline for scene graph to image generation, whereas G is trained to minimize logð1 DðGðzÞÞÞ.
scene graphs are processed end-to-end using GCNs, which com-
prise multiple graph convolutional layers. At each node V and edge
4. Literature review
E, the input graph with dimension vector Din computes the new
dimension vectors Dout for each incoming node and edge. The graph
As of today, few attempts have been made regarding image gen-
convolution theory applied in GCN can be represented as v i and
eration from scene graphs and layouts (Johnson et al., 2018; Li
v r ; 2 RDin , representing
the given input vectors with objects oi 2 O
et al., 2019; Zhao et al., 2019b; Zhao et al., 2020). That makes image
and edges oi ; r; oj 2 E. The output vector in Eq. 1 can be computed
generation from scene graphs and layouts a bespoke area to work
for all nodes and edges using three functions g s ; g p , and g o , which on. This section provides an overview of related work concerning
take in triple vectors v i ; v r , and v j , respectively. The three functions image generation from text, scene graphs, and layouts. The goal
g s ; g p , and g o are implemented using a single network by concate- of this section is to review image synthesis from different
nating the three input vectors, which are then fed to a multilayer modalities.
perceptron (MLP). This results in new vectors for the subject oi ,
predicate r, and object oj . Therefore, the formulation can be stated 4.1. GANs for image synthesis
as follows:
Since the inception of GANs, image synthesis has become cru-
v 0i ; v 0r 2 RDout ð1Þ
cial for many real-world applications that generate synthetic data
to represent different entities. Usually, a generator synthesizes an
v 0r ¼ gp v i ; v r ; v j ð2Þ image, and a discriminator differentiates between a real and a fake
image. Promising results have been achieved using GANs in differ-
where v 0i and v 0r are the output vectors for object oi , and v 0c is the
ent fields, such as image generation (Johnson et al., 2018; Isola
output vector for edges presented in Eq. 2. A candidate vector for
et al., 2017), video prediction (Vondrick et al., 2016), texture syn-
edges starting at oi is computed by collecting all candidates in a
thesis (Zhao et al., 2021), natural language processing (Li et al.,
set V Si , and for all edges terminating at oj , a candidate vector is com- 2018), and image style transfer (Karras et al., 2019). GANs can
puted by V oi . All methods use a similar GCN formulation for scene apply the conditions on category labels by providing the category
graph to image generation reported in this study. labels as an additional input to the generator G and discriminator
D resulting in conditional image synthesis (Gauthier, 2014; Mirza
3.3. Generative adversarial networks and Osindero, 2014). However, the discriminator can also be forced
to predict the labels (Odena et al., 2017). Mainly, the GANs are used
A generative adversarial network (GAN) is a deep neural net- in image synthesis, which can produce better synthetic images
work proposed by Goodfellow et al. (2014) to solve the generative than previous state-of-the-art approaches (Salimans et al., 2016;
modeling problem. The GAN consists of two adversarial models, Mao et al., 2019).
3
M.U. Hassan, S. Alaliyat and I.A. Hameed Journal of King Saud University – Computer and Information Sciences 35 (2023) 101543
Although the state-of-the-art performance of GANs is visible on structure-aware convolution operations in both Euclidean and
the front end of image generation, the training process of GANs is non-Euclidean spaces. Li et al. (2019) proposed a method that uses
often unstable. To overcome this, Wassertein GAN (Arjovsky et al., an external object crop that acts as an anchor to control the gener-
2017) that is an alternative of traditional GAN (Goodfellow et al., ation task. Their proposed method is different from Johnson et al.
2014) was proposed for improving the learning stability and get- (2018) in three ways. First, external object crops are used; sec-
ting rid of model collapsing issues during the training process. ondly, they used a Crop Refining Network to convert layouts masks
Another approach, Unrolled GAN (Metz et al., 2016), was also pro- into images; third, a Crop Selector is introduced to choose the
posed to address the same problem. Wasserstein GAN (Arjovsky most-suitable crops from the objects database automatically.
et al., 2017) uses the Wasserstein distance metric to improve the An interactive approach to generate images from scene graphs
learning stability, and Unrolled GAN (Metz et al., 2016) addresses using recurrent neural networks by preserving image content
the same problem by unrolling the optimization process. Image and modifying cumulative images was proposed by Mittal et al.
generation requires high resolution and high fidelity images. For (Mittal et al., 2019). The method works in three stages, with
these reasons, Progressive GAN (Karras et al., 2017) and BigGAN increasing levels of complexity. At each stage, more nodes and
(Brock et al., 2018) are also a good fit, where Progressive GAN edges are added to the scene graph to give more information to
keeps adding new layers starting from a low-resolution until a the GCN to generate layout. They used a Scene Layout Network
high-resolution fine detail is achieved, and BigGAN is trained on (SLN) on top of the architecture proposed in Johnson et al. (2018)
a very large scale on ImageNet to achieve the high-resolution fine that generates the layouts for predicting the bounding boxes. Their
details of images. In summary, Progressive GAN (Karras et al., proposed method utilizes the GCN and adversarial image transla-
2017) generates high-resolution images by adding new layers pro- tion method to generate images in an unsupervised manner. The
gressively, and BigGAN (Brock et al., 2018) is trained on a large generated images still need improvements, such as the images
scale to achieve high-resolution fine details in the generated are blurry and the objects are not generated according to the input
images. scene graphs. Another work Tripathi et al. (2019) also improved
upon (Johnson et al., 2018) by introducing scene context network.
4.2. Image synthesis from scene graphs That work added a context-aware loss for a higher image matching
and introduced two new metrics for measuring the compliance of
Image generation based on scene graphs is a challenging task generated images with scene graphs: (i) relation score and (ii)
that requires a substantial effort to produce recognizable objects mean opinion relation score.
in complex scenes. Most of the methods proposed for image syn- GCN-based methods present some limitations, i.e., the GCN
thesis from scene graphs rely on the use of GCN. One of the first sometimes gets confused over relations among attributes, and
attempts in this area considered the creation of an image retrieval finding the correct relation is also laborious. For example, (Man,
framework based on a scene graph formulation (Johnson et al., right, Woman) and (Woman, left, Man) are always true, but it will
2015). Later, Johnson et al. (Johnson et al., 2018) proposed sg2im, typically result in different illustrations for most cases. Herzig et al.
the pioneer method for image generation using scene graphs. The (2020) proposed a canonical representation based on a method for
sg2im method aims to solve the challenges of image generation scene graph to image generation that respects the relations of attri-
faced by natural language/textual description methods regarding butes by keeping the graph’s information in a canonicalization
semantic entity information. process.
The image generation from scene graphs is also referred to as
conditional image generation, in which the expected image is con- 4.3. Image synthesis from layouts
ditioned on some additional information. The seminal work of
Johnson et al. (2018) focused on scene graphs that contain infor- Scene layouts are the intermediate states when generating
mation of multiple objects in the foreground. They used a GCN images from scene graphs. However, Zhao et al. (2019b) pro-
which passes the scene graph information along its graph edges. posed an explicit framework of directly generating images from
A scene layout is constructed by predicting the bounding boxes layouts without the need to define scene graphs manually. The
and segmentation masks, and finally, a cascaded refinement net- bounding boxes and object categories are specified at the begin-
work (CRN) (Chen and Koltun, 2017) is used to convert the pre- ning. Then, a diverse set of images is generated based on the
dicted layout into the expected image. Previous approaches defined coarse layouts. They also made an extension (Zhao
typically involve encoding the scene graph into a vector represen- et al., 2020) to their already proposed work (Zhao et al.,
tation and then decoding the vector to generate an image having 2019b) by explicitly defining the loss functions and extending
several drawbacks, such as the loss of spatial information and the object feature map module by adding the object-wise atten-
the inability to handle complex relationships between objects. To tion to their proposed framework.
address these issues, the authors propose a new model that gener- Sylvain et al. (2020) proposed an object-centric method to gen-
ates images directly from the scene graph. erate images from layouts. However, their proposed method incor-
To incorporate spatial information into the model, the authors porates scene graph-based retrieval to increase the fidelity of
(Johnson et al., 2018) introduce a new type of layer called a spatial layouts. Therefore, their proposed method is a hybrid of scene
feature transform (SFT) layer. This layer uses the spatial positions graph-based and layout-based image generation mechanisms.
of objects in the scene graph to transform the feature maps gener- They proposed an Object-Centric GAN (OC-GAN), which integrates
ated by the generator. The SFT layer enables the model to generate the scene graphs similarity module to learn the spatial representa-
images that accurately reflect the spatial relationships between tions of the objects in a scene layout. One of the limitations of their
objects in the scene. method is that a distant look of images generated by most SL2I
Basically, the GCNs are of two types: (i) Spectral GCN and (ii) generation methods appears to be adhering to the input layouts
Spatial GCN. The former one was proposed by Henaff et al. and look realistic. However, a closer inspection of these images
(2015) to construct a deep architecture with a slight learning com- reveals that there is a lack of context awareness and location sen-
plexity by incorporating a graph estimation procedure for the clas- sitiveness. To overcome these limitations, a final work to date is
sification problem. The latter is built upon classic CNN and proposed by He et al. (2021) by introducing the context-aware fea-
propagation models Zhang et al. (2019). The classic CNNs are ture transformation module. The generated features are updated
extended to spatial GCNs by mapping the graph data into for each object in their proposed method while computing the
4
M.U. Hassan, S. Alaliyat and I.A. Hameed Journal of King Saud University – Computer and Information Sciences 35 (2023) 101543
Gram matrix for feature maps to capture the inter-feature correla- images based on the scene graph are generated by adversarial
tions, which respect the location sensitiveness of objects. training of network f against a pair of network discriminators
Dimg and Dobj . The Dimg encourages image I to appear realistic while
5. Methods comparison Dobj contains the information of realistic and recognizable objects.
Each node and edge of the input scene graph is converted to a
This section highlights the methodological descriptions of SG2I dense vector from a categorical label through a learned embedding
and SL2I-based methods used in this analysis. We selected four layer.
methods that are built upon the pioneer work of SG2I method Cascaded Refinement Network. To synthesize an image with
Johnson et al. (2018) and have identical input and training data. respect to a given layout, it is necessary to respect the object posi-
Creating complex scene images from real-world objects tions available in the given layout. A cascaded refinement network
requires a high-level understanding of computer vision and com- (CRN) introduced works on this pattern and consists of a series of
puter graphics. The goal of image generation from scene graphs convolutional refinement modules. In CRN, the modules are con-
is to take a scene graph as an input and generate a realistic image catenated to each other channel-wise and are passed to a pair of
corresponding to the described objects and their relationships in 3 3 convolutional layers. Each module receives the inputs from
a graph. the scene layout, which is down-sampled to a specific input reso-
We propose following the typical pipeline to train all SG2I and lution and the output of the previous module.
SL2I generation methods for synthesizing images. For a scene-
graph-based image generation method, the proposed framework
should be able to move from the graph domain to the image 5.2. PasteGAN
domain. The typical workflow employed for scene-graph-based
image generation is illustrated in Fig. 2 where a GCN-based graph To achieve a more robust control of the image generation pro-
is constructed based on the input scene graph. The coarse 2D struc- cess in a more fine-grained manner, a crop selection-based strat-
ture of scene layouts is predicted using an object layout network egy, named PasteGAN, was developed by Li et al. (2019). They
based on the embedding vectors. Finally, an adversarial network made a threefold contribution: (i) the objects of scene graph work
generates the output image. as crops that use external object crops bank to guide image gener-
As a pioneer of scene graph to image generation works, Johnson ation process; (ii) to better generate an image, a Crop Refining Net-
et al. (2018) proposed to generate images from scene graph using work and an Object-Image Fuser were designed with the goal of
GCNs and a cascaded refinement network (CRN) Chen and Koltun making object crops appear in a fine-grained way in the final image
(2017). The GCN processes the input graphs to generate a scene generation process; and (iii) to automatically select the most com-
layout. The generated scene layout is based on the prediction of patible crop, a Crop Selector module is also devised in PasteGAN.
bounding boxes and segmentation masks of objects. In the final Basically, the PasteGAN encodes the input scene graph and object
stage, the CRN generates the output image based on the predicted crops to generate the corresponding images.
scene layout. During the generation of images, Johnson et al. The PasteGAN mainly used the external memory tank to find
(2018) experienced three primary challenges: (i) the development objects given in the input scene graphs to generate images. The
of a method that can process graph-structured inputs; (ii) the gen- training process of PasteGAN consists of two stages in which, the
erated images must comply with the objects and the relationship first stage aims to reconstruct the ground-truth images using orig-
between objects specified in a graph; and (iii) the images gener- inal crops morii , whereas the second stage focuses on generating
ated using GCN and CRN must be realistic. images with selected images crops msel i from the external memory
tank.
5.1. sg2im The scene graph is processed with GCN to obtain a latent vector
z containing contextual information of each entity. The PasteGAN
This section describes in detail the sg2im method. The image processes scene graphs using GCN and then a crop selector selects
generation network f takes the input scene graph G and noise z the good crop for objects that are relevant for generating realistic
to generate the output image I ¼ f ðG; zÞ. The processing of G takes images. The good crop selector should be able to recognize not only
place along with a GCN, which generates the embedding vectors the accurate objects, but it should also match with the similarity of
for each object. The embedding vectors of GCN respect the rela- scenes. The PasteGAN used the pretrained sg2im based GCN to pro-
tionships between objects in a scene graph by predicting the cess scene graphs.
bounding box and segmentation masks for each object in a scene The crop refining network adopted by PasteGAN is based on two
graph. After this step, a layout is generated, which acts as an inter- steps: (i) a crop encoder, which aims at extracting main visual fea-
mediate element between a scene graph and an output image. The tures of objects, and (ii) an object refiner, consisting of two 2D
generation of output image I is based on the CRN, and the realistic graph convolutional layers. It fuses the visual appearance of
Fig. 2. Typical pipeline employed for scene graph and layout to image generation. A scene graph is given as input and a graph convolutional network processes it to convert into
an image domain. This step produces scene layout predictions. Finally, a conditional GAN further synthesizes the image as final output.
5
M.U. Hassan, S. Alaliyat and I.A. Hameed Journal of King Saud University – Computer and Information Sciences 35 (2023) 101543
6
M.U. Hassan, S. Alaliyat and I.A. Hameed Journal of King Saud University – Computer and Information Sciences 35 (2023) 101543
5. Object Adversarial Loss, which used for ensuring that the gener- Dataset Visual COCO-
ated objects to appear realistic. Genome Stuff
6. Auxiliary Classifier Loss, which ensures that object discrimina- Training set 62,565 24,972
tors should classify all the generated objects in an image. Validation set 5506 1024
Test set 5088 2048
Total number of objects 178 171
PasteGAN generator is based on the minimized weighted sum of No. of objects in an image 3–30 3–8
six loss functions that were available in the provided source code; Min. number of relationships between objects 1 6
however, the reported study has mentioned to use eight loss func-
tions in their proposed method.
sists of an average of 35 objects, and the relationships between
1. Image Reconstruction Loss, which finds the L1 difference two objects can be actions, e.g., jumping over, wear, behind, drive
between ground-truth image and reconstructed image. on etc. and 26 attributes, e.g., color (red), states (sitting/standing),
2. Crop Matching Loss, which calculates the L1 difference between etc. The scene graphs in this dataset are the localized representa-
object crop feature maps and re-extracted object feature maps tions of an image and are combined to construct an entire image.
from generated images. The region descriptions are the natural descriptions in a sentence
3. Adversarial Loss, which is similar to object adversarial loss in format to describe a region of the scene. The objects, attributes,
sg2im (Johnson et al., 2018), i.e., its use aims to ensure that and relationships are combined through a directed graph to con-
the objects to appear realistic. struct region graphs in the VG dataset. Furthermore, two types of
4. Auxiliar Classifier Loss, which is used to ensure that object dis- QA pairs are associated with each image: (i) Freeform QA and (ii)
criminators should classify all the generated objects in an Region-based QA. The dataset is preprocessed at the beginning
image. and then divided into training (80%), validation (10%), and test
5. Perceptual Loss, which calculates the L1 difference between (10%) sets.
ground-truth images and reconstructed images in the global COCO-Stuff. This dataset consists of 164 K complex scene
feature space. images Caesar et al. (2018). It contains 172 classes comprising 80
6. Box Regression Loss, which is used to calculate L1 difference things, 91 stuff, and 1 class as unlabeled. An expert annotator
between ground truth and the prediction boxes. curated the 91 stuff classes. Additionally, the class unlabeled is used
in two scenarios. First, when the label is not listed in any of 171
There is only one loss function used in this method which is predefined classes, and second when the annotator is unable to
based on the single scene graph and its ground-truth layout. infer the pixel label. However, this dataset contains 40 K training
images and 5 K validation images from scene graphs and layouts
1. L1 Loss, which is used for minimizing the error between the dif- for the image generation task. Dense pixel-level annotations in
ference of ground truth mask and the predicted mask. COCO-Stuff are augmented from the COCO (Lin et al., 2014)
dataset.
Different from other SG2I approaches, this method introduces After a thorough study of preceding SG2I and SL2I methods, we
two loss functions. For example, a KL (Kullback–Leibler) loss func- come to the evaluation of these methods. The evaluation protocols
tion is proposed to compute the KL-divergence between a distribu- and implementation details of all methods used for comparative
tion and normal distribution, and an object latent code analysis are defined in the evaluation metrics and implementation
reconstruction loss strengthens the connection of specific object details sections, respectively.
appearance and latent codes to be invertible. However, other loss
functions, such as image reconstruction loss, object adversarial
loss, auxiliary classification loss, and adversarial image losses, are 7.2. Evaluation metrics
the same as introduced in the sg2im method Johnson et al. (2018).
The image quality of all generated images by the four methods
needs to be quantitatively measured, and for this reason, different
7. Experimental details
image quality evaluation metrics have been used by different
methods. Usually, the Inception score is the primary image quality
The methods which we used for comparison in this study are
measure that uses ImageNet to encourage recognizable objects
selected based on three keywords, i.e., (i) scene graph, (ii) layout
within the generated images. This analysis reported four main
generation, and (iii) image generation. This section provides in
evaluation metrics for a comparison of all methods, namely: Incep-
depth details of experimental setup employed to perform the
tion Score (IS), Frechet Inception Distance (FID), Diversity Score
analysis.
(DS), and Classification Accuracy (CA).
Inception Score: The inception score ð"Þ is implemented for the
7.1. Datasets evaluation of the generated image quality, specifically for synthetic
images, the higher the value is, the better the inception score is
The experiments in all four methods are mainly performed on (Salimans et al., 2016). It involves using a pre-trained deep neural
Visual Genome (VG) and COCO-Stuff datasets. We used the same network for the classification of generated images. It has two main
settings and the datasets for the comparison methods. Both data- objectives: (i) image quality, are the generated images look like a
sets contain the varying size of images. For a fair evaluation of all specific object?, and (ii) image diversity, are the generated objects
methods, we resize all images with size 64 64. Table 1 shows lie in a wide range?. A pre-trained VGGNet Simonyan and
the attributes of the datasets used for methods comparison. Zisserman (2014) is used to implement and compute the IS for
Visual Genome. This dataset is composed of 108,077 scene all the methods in this analysis.
graph annotated images with seven main components such as Fréchet Inception Distance: FID was proposed by Heusel et al.
objects, attributes, relationships, scene graphs, region descriptions, (2017), and is a metric that is used to embed a set of generated
region graphs, and question–answer (QA) pairs. Each image con- images into feature space which is given by a special layer of the
7
M.U. Hassan, S. Alaliyat and I.A. Hameed Journal of King Saud University – Computer and Information Sciences 35 (2023) 101543
Inception network. The lower value of FID (#) represents the higher and PasteGAN (Li et al., 2019) is relatively lower than WSGC
quality of images generated by the generator compared to the real (Herzig et al., 2020) and Layout2im (Zhao et al., 2019b). In the case
ones. of discriminator loss (Fig. 4b), all four methods performed compar-
Diversity Score: This metric is used in the deep feature space to atively better. However, we can observe that WSGC Herzig et al.
compute the perceptual similarity between two images. The DS is (2020) performed best in the case of generator loss on the COCO
different from IS in the sense that it measures the difference dataset, whereas Layout2im Zhao et al. (2019b) outperformed in
between generated images and real images from the same input. the case of discriminator loss on all datasets. The object classifica-
The higher the metric value is, the better the DS (") is. tion accuracy is also reported for all four comparison methods. It
Classification Accuracy: This is a measure to quantify the can be observed that the objected generated by the SL2I method
capacity to create identifiable objects, a crucial criterion for evalu- Zhao et al. (2019b) can be accurately classified for real images. It
ating the SG2I and SL2I works. We initially train a ResNet-101 is also observed during the experimentation that the object classi-
object classification model He et al. (2016). This is accomplished fication’s upper bound limit does not necessarily confirm the diffi-
by using the actual objects cropped and downsized from ground culty of distinguishing the generated images.
truth images inside the training set of each dataset. Afterwards, Fig. 5 shows the statistics of bounding box predictions for all
we calculate and report the object classification precision for the methods on two datasets. R@t is used to predict the accuracy of
generated images. The higher value of CA (") represents the best predicted bounding boxes, and from the experiments, it can be
score is achieved. seen that using the SL2I mechanism improves the prediction per-
formance. Fig. 5a presents the R@3 while Fig. 5b is the demonstra-
7.3. Implementation details tion of R@5.
Fig. 6 shows the generated images using comparison methods.
We implemented the identical parameters for all four methods From the set of images, it can be seen that Layout2im Zhao et al.
in this analysis using python version 3.5, PyTorch 0.4, and Linux (2019b) performed relatively better than other scene graph-
(Ubuntu) 20.04. With the updates of all libraries, we used a virtual based methods in the qualitative evaluation. According to their
environment for a fair comparison. The training learning rate was input labels, the objects are more recognizable and consistent.
The images generated by SG2I-based methods also respect the
set at 104 and a batch size of 32, 16, 8 and 16 for all methods,
compositionality of objects. However, the blurriness and object
respectively. Table 2 highlights the hyperparameter details of all
overlapping is still a major problem of object synthesis in these
the methods used in this work. All scene graphs are augmented
methods.
with a special image object, and a specific in image relationship
is connected to each true object in the SG2I and SL2I methods,
through which all scene graphs are connected.
9. Discussions
To generate the images of size 64 64 on the VG and COCO-
Stuff datasets, we used RTX 3090 GPU, and it took days to finish
9.1. Limitations
training with a million iterations on each dataset. The scene graphs
are available in a human-readable format in a JSON file. After
Since GANs are a powerful class of deep learning models widely
installing the GraphViz library, it is also possible to visualize the
used in image generation methods. While this work also leverages
input scene graph as a graph. The program used the Pytorch
GANs-based SG2I and SL2I generation methodology. There are sev-
library.
eral limitations associated with both using GANs and scene graphs
to synthesize images.
8. Experimental results For example, GANs can sometimes suffer from mode collapse,
where the generator network produces limited types of output that
Table 3 shows the performance of comparison methods on four fail to represent the full diversity of the target distribution. This
metrics, i.e., IS, FID, DS, and CA. Each dataset is split into 3 groups happens when the generator network produces similar or identical
and we report mean and standard deviation for IS, and DS for all outputs for different input values, resulting in a loss of variety in
methods. The samples are generated on a full model with image the generated images (Wang et al., 2020). GANs can be challenging
size 64 64 by defining different synthetic scene graphs. The mod- to train, and their training can be unstable. The generator and dis-
els’ abilities are evaluated through generating complex scenes. The criminator networks can get stuck in a suboptimal state, resulting
best IS and DS is achieved by Layout2im Zhao et al. (2019b), and in poor-quality output images. GANs require careful tuning of their
the best FID is achieved by PasteGAN Li et al. (2019). hyperparameters, such as learning rates, batch sizes, and regular-
The methods reported in this study used different loss func- ization terms. The choice of these hyperparameters can signifi-
tions, which we have reported in earlier methodological descrip- cantly impact the quality of the generated images (Meshry,
tions. However, for the sake of fair comparison, we evaluated the 2022). GANs require large datasets for training, and the quality of
generator and discriminator loss functions of all methods, which the generated images can depend on the quality and quantity of
are the two main components of a GAN. Fig. 4 illustrates the com- the training data. GAN-based image synthesis methods often can-
parison of two loss functions for 90 epochs, where we report the not provide fine-grained control over the generated images. For
loss for every ten epochs. Fig. 4a is the representation of generator example, it may be challenging to generate images with specific
loss, where we can see that the loss for sg2im (Johnson et al., 2018) attributes or to generate images that match a given textual
Table 2
Hyperparameter details.
8
M.U. Hassan, S. Alaliyat and I.A. Hameed Journal of King Saud University – Computer and Information Sciences 35 (2023) 101543
Table 3
Performance evaluation of IS, FID, DS, and CA of all methods on two datasets.
Fig. 4. Performance evaluation of comparative methods based on their training (a) Fig. 5. Intersection over union comparison of all four methods on two datasets. (a)
generator loss and (b) discriminator loss. R@3 and (b) R@5.
description. GANs are often limited in their ability to generate amounts of memory to process and manipulate complex graph
complex scenes with multiple objects or large-scale contexts and structures (Zhu et al., 2022). Similar to GAN-based methods, scene
can struggle to maintain spatial coherence and realistic object graph-based methods require large amounts of training data to
interactions. Overall, while GAN-based image synthesis methods achieve good results. SG2I and SL2I methods struggle to scale to
have made significant progress, they still have limitations that more complex and diverse scenes. They rely on explicit modeling
must be addressed to enable their wider adoption and enhance of object relationships and interactions, which can become compu-
their ability to generate high-quality and diverse images. tationally expensive as the number of objects and relationships
Scene graph-based methods can suffer from limited diversity, increases. Although scene graphs can include textual information
where the generated images tend to follow a small set of prototyp- about objects and their relationships, incorporating more detailed
ical visual layouts, which can make them less visually interesting textual descriptions or natural language instructions into SG2I and
or less representative of the full range of possible scenes. These SL2I methods can be challenging. Like GAN-based methods, scene
methods are often computationally intensive and require large graph-based methods can lack fine-grained control over the
9
M.U. Hassan, S. Alaliyat and I.A. Hameed Journal of King Saud University – Computer and Information Sciences 35 (2023) 101543
Fig. 6. A comparison of images generated by SG2I and SL2I based methods. The images of size 64 64 are generated using the same setting for all methods. (Orange) shows
the generated results on VG dataset while (Green) illustrate the images generated on COCO-Stuff dataset.
generated images, such as generating images with specific attri- form multiple tasks simultaneously. For example, a model could
butes or modifying certain aspects of the generated scenes. In sum- generate an image and its corresponding textual description or
mary, while scene graph-based methods offer a promising generate images with different styles or viewpoints. The research-
direction for image synthesis, challenges remain to be addressed, ers can investigate these multi-modal and multi-task scenarios to
such as improving scalability and diversity and finding ways to create more versatile and flexible image generation models.
incorporate textual information better and provide more fine- Incorporating real-world constraints for generating high-
grained control over the generated images. quality images is intrinsic for SG2I and SL2I generation methods.
For instance, in real-world scenarios, synthesized images must
9.2. Future directions adhere to certain constraints, such as object occlusions, lighting
conditions, and camera angles. There are possibilities to explore
In this section, we discuss some potential future research direc- how to incorporate these constraints into the image synthesis pro-
tions for SG2I and SL2I generation methods. Scene graph-based cess to generate more realistic images that better reflect the com-
image generation methods have recently gained popularity in the plexities of real-world scenes.
computer vision community for their ability to generate realistic Time consumption is a major issue in synthesizing images from
and diverse images by leveraging the rich semantic information scene graphs. It happens due to the fine-tuning of larger parame-
encoded in scene graphs. ters and complex hierarchical structures. To enhance the accelera-
Improving the quality and diversity of generated images is cru- tion of SG2I and SL2I generation methods, deep neural networks
cial for SG2I and SL2I generation methods. Although SG2I methods can be constructed based on extreme learning machine (ELM) the-
have achieved impressive results, there is still room for improve- ory (Zhang et al., 2020). The method is faster and easy to imple-
ment in terms of the quality and diversity of generated images. ment, involving two stages: (i) randomly generating hidden layer
Future research can explore novel techniques to enhance the real- parameters from a predefined specific interval and (ii) calculating
ism and diversity of synthesized images, such as incorporating the generalized inverse of the output weight matrix. The accelera-
attention mechanisms (Kitada and Iyatomi, 2022), adversarial tion of SG2I and SL2I-based methods can significantly be increased
training, and semantic consistency regularization. by incorporating the ELM theory to fine-tune parameters.
Exploring multi-modal and multi-task image generation is One of the critical challenges for SG2I and SL2I methods is
another potential research direction. SG2I generation can be adapting to new domains and modalities. SG2I and SL2I-based
extended to generate images with multiple modalities or to per- models have mainly been applied to 2D images, but they can also
10
M.U. Hassan, S. Alaliyat and I.A. Hameed Journal of King Saud University – Computer and Information Sciences 35 (2023) 101543
11
M.U. Hassan, S. Alaliyat and I.A. Hameed Journal of King Saud University – Computer and Information Sciences 35 (2023) 101543
Mao, Q., Lee, H.-Y., Tseng, H.-Y., Ma, S., Yang, M.-H., 2019. Mode seeking generative Yang, X., Tang, K., Zhang, H., Cai, J., 2019. Auto-encoding scene graphs for image
adversarial networks for diverse image synthesis. In: Proceedings of the IEEE/ captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
CVF Conference on Computer Vision and Pattern Recognition, pp. 1429–1437. Pattern Recognition, pp. 10685–10694.
Meshry, M., 2022. Neural rendering techniques for photo-realistic image generation Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N., 2017. Stackgan:
and novel view synthesis, Ph.D. thesis. Text to photo-realistic image synthesis with stacked generative adversarial
Metz, L., Poole, B., Pfau, D., Sohl-Dickstein, J., 2016. Unrolled generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer
networks, arXiv preprint arXiv:1611.02163. Vision, pp. 5907–5915.
Mirza, M., Osindero, S., 2014. Conditional generative adversarial nets, arXiv preprint Zhang, S., Tong, H., Xu, J., Maciejewski, R., 2019. Graph convolutional networks: a
arXiv:1411.1784. comprehensive review. Comput. Social Networks 6 (1), 1–23.
Mittal, G., Agrawal, S., Agarwal, A., Mehta, S., Marwah, T., 2019. Interactive image Zhang, C., Chao, W.-L., Xuan, D., 2019a. An empirical study on leveraging scene
generation using scene graphs, arXiv preprint arXiv:1905.03743. graphs for visual question answering, arXiv preprint arXiv:1907.12133.
Odena, A., Olah, C., Shlens, J., 2017. Conditional image synthesis with auxiliary Zhang, J., Li, Y., Xiao, W., Zhang, Z., 2020. Non-iterative and fast deep learning:
classifier gans. In: International Conference on Machine Learning, PMLR, pp. Multilayer extreme learning machines. J. Franklin Inst. 357 (13), 8925–8955.
2642–2651. Zhao, B., Meng, L., Yin, W., Sigal, L., 2019b. Image generation from layout. In:
Reed, S.E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H., 2016. Learning what and Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
where to draw. Adv. Neural Informat. Process. Syst. 29, 217–225. Recognition, pp. 8584–8593.
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H., 2016. Generative Zhao, B., Yin, W., Meng, L., Sigal, L., 2020. Layout2image: Image generation from
adversarial text to image synthesis. In: International Conference on Machine layout. Int. J. Comput. Vision 128.
Learning, PMLR, pp. 1060–1069. Zhao, X., Wang, L., Guo, J., Yang, B., Zheng, J., Li, F., 2021. Solid texture synthesis
Reed, S., Oord, A., Kalchbrenner, N., Colmenarejo, S.G., Wang, Z., Chen, Y., Belov, D., using generative adversarial networks, arXiv preprint arXiv:2102.03973.
Freitas, N., 2017. Parallel multiscale autoregressive density estimation. In: Zhao, S., Li, L., Peng, H., 2022. Aligned visual semantic scene graph for image
International Conference on Machine Learning, PMLR, pp. 2912–2921. captioning. Displays 74, 102210.
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., 2016. Zhou, R., Jiang, C., Xu, Q., 2021. A survey on generative adversarial network-based
Improved techniques for training gans. Adv. Neural Informat. Process. Syst. 29, text-to-image synthesis. Neurocomputing 451, 316–336.
2234–2242. Zhu, G., Zhang, L., Jiang, Y., Dang, Y., Hou, H., Shen, P., Feng, M., Zhao, X., Miao, Q.,
Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G., 2008. The graph Shah, S.A.A. et al., 2021. Scene graph generation: A comprehensive survey, arXiv
neural network model. IEEE Trans. Neural Networks 20 (1), 61–80. preprint arXiv:2201.00443.
Schuster, S., Krishna, R., Chang, A., Fei-Fei, L., Manning, C.D., 2015. Generating
semantically precise scene graphs from textual descriptions for improved
Muhammad Umair Hassan is a PhD candidate at the Norwegian University of
image retrieval. In: Proceedings of the Fourth Workshop on Vision and
Science and Technology, Ålesund, Norway. He obtained his master’s from the
Language, pp. 70–80.
Shamsolmoali, P., Zareapoor, M., Granger, E., Zhou, H., Wang, R., Celebi, M.E., Yang, J., University of Jinan, P.R. China, and a bachelor’s from the University of the Punjab,
2021. Image synthesis with adversarial networks: A comprehensive survey and Pakistan. His research interests include Computer Vision, Computer Graphics and
case studies. Informat. Fusion. Deep Learning. He has a soundtrack of publications in multidisciplinary areas,
Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale including journals and conference publications.
image recognition, arXiv preprint arXiv:1409.1556.
Sun, W., Wu, T., 2019. Image synthesis from reconfigurable layout and style. In: Saleh Alaliyat is currently an Associate Professor at the Department of ICT and
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. Natural Sciences, Norwegian University of Science and Technology, Ålesund, Nor-
10531–10540. way. His research interests include Artificial Intelligence, Swarm Intelligence, and
Sylvain, T., Zhang, P., Bengio, Y., Hjelm, R.D., Sharma, S., 2020. Object-centric image Computer Vision.
generation from layouts, arXiv preprint arXiv:2003.07449 1 (2), 4.
Tripathi, S., Bhiwandiwalla, A., Bastidas, A., Tang, H., 2019. Using scene graph Ibrahim A. Hameed has a PhD in AI from Korea University, South Korea and PhD in
context to improve image generation, arXiv preprint arXiv:1901.03762. field robotics from Aarhus University, Denmark. He is a Professor and Deputy Head
Vondrick, C., Pirsiavash, H., Torralba, A., 2016. Generating videos with scene of Research and Innovation within NTNU, IEEE senior member, elected chair of the
dynamics. Adv. Neural Informat. Process. Syst. 29, 613–621.
IEEE Computational Intelligence Society (CIS) Norway section, Founder and Head of
Wang, Y., Gonzalez-Garcia, A., Berga, D., Herranz, L., Khan, F.S., Weijer, J.V.D., 2020.
Social Robots Lab in Ålesund. His current research interests include Artificial
Minegan: effective knowledge transfer from gans to target domains with few
Intelligence, Machine Learning, Optimization, and Robotics.
images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 9332–9341.
12