Jimaging 08 00310 v3
Jimaging 08 00310 v3
Imaging
Review
A Review of Synthetic Image Data and Its Use in
Computer Vision
Keith Man * and Javaan Chahl
Abstract: Development of computer vision algorithms using convolutional neural networks and
deep learning has necessitated ever greater amounts of annotated and labelled data to produce high
performance models. Large, public data sets have been instrumental in pushing forward computer
vision by providing the data necessary for training. However, many computer vision applications
cannot rely on general image data provided in the available public datasets to train models, instead
requiring labelled image data that is not readily available in the public domain on a large scale. At
the same time, acquiring such data from the real world can be difficult, costly to obtain, and manual
labour intensive to label in large quantities. Because of this, synthetic image data has been pushed to
the forefront as a potentially faster and cheaper alternative to collecting and annotating real data. This
review provides general overview of types of synthetic image data, as categorised by synthesised
output, common methods of synthesising different types of image data, existing applications and
logical extensions, performance of synthetic image data in different applications and the associated
difficulties in assessing data performance, and areas for further research.
Keywords: computer vision; image synthesis; synthetic image data; synthetic data generation
images from synthetic virtual environments [3]. The visual look of synthetic image data
also varies between fields of application, from composite imagery [4–6] to photo-realistic
computer generated images [7–12].
The key advantage of synthesising image data, and the primary reason that makes the
generation of data faster and cheaper, is that a properly set up image synthesis pipeline is
capable of automating the data generation and labelling process at a comparatively low cost
to manual labour. That said, it is important to note that not every method of image synthesis
provides automatic data labelling and some methods still require significant amounts of
human labour before the synthesis process can be automated. The use of the terms “faster”
and “cheaper” to describe synthetic image data is mainly in comparison to the collection of
real data for a given application. In many computer vision applications, obtaining large
real datasets can be difficult and the manual labour required to label and annotate the raw
data to the required level of quality comes at a significant cost. Large scale data collection
can face limitations due privacy concerns, legal restrictions, and practical limitations, all
of which reduce the amount of data that can be collected for training purposes. Manual
annotation of data is also a time consuming task that becomes highly impractical for large
datasets or datasets that require large numbers of labels per data instance, such as for crowd
counting [13,14]. Even if manual labelling large data sets is possible, human error will
inevitably cause some decrease in data quality as the data set size increases, resulting in a
negative effect on training [15], while it is possible to train an algorithm to automatically
label data in place of a human, training such an algorithm necessitates large quantities
of labelled data for training to begin with, making it somewhat of a circular problem.
Synthesising image data provides significantly more control over the resulting dataset,
allowing for higher labelling accuracy, and object and environment information that would
otherwise be difficult to collect in addition to image data in the real world.
Naturally, it is important to consider the limitations of synthetic image data as well.
Data bias is a known issue with synthesised image data, often caused by inherent bias in
the input parameters for data generation, while synthetic data may not necessarily have
any direct relation to real objects, environments, or people, real data is often still needed to
form a reference on which synthetic data is created and biases in the real data will influence
the synthesised output. The domain gap between synthetic image data and real data is also
an issue commonly noted in research, with models trained only on synthetic data typically
displaying a significant drop in performance when tested on real data. Domain gap is
not a problem unique to synthetic data as it is also present when training and testing on
different sets of real data. However, the domain gap between synthetic and real data is
more noticeable when testing the performance of models trained only on synthetic data.
Finally, a less discussed limitation is the computational requirements to generate synthetic
image data. In order to generate large synthetic data sets within reasonable time frames,
significant computational power is required, computational power that may not be readily
available to some people wanting to utilise computer vision in their field of work, while
many papers have discussed the advantages and limitations of synthetic data in computer
vision, much less has been written about the computational resources and time required
to generate the data sets that were used. Nevertheless, this remains an important point of
discussion as these two factors can greatly impact the practicality of using synthetic data in
many applications.
Past reviews on synthetic data for computer vision have evaluated the use of synthetic
data focusing on the image generation process [16] or the use of synthetic image data
for specific applications such as navigating urban traffic environments [17], pedestrian
detection [18], text-to-image synthesis [19]. The goal of this review paper is to categorise
existing types of synthetic image data by output, review methods used to synthesise such
data, discuss the effectiveness of synthetic data in various computer vision tasks, logical
extensions to current use of synthetic data, and identify research gaps that may lead to
future research. Section 2 covers the difficulties and associated costs of obtaining real data
for computer vision applications and the reasons why synthetic image data has grown
J. Imaging 2022, 8, 310 3 of 33
Similarly the RarePlanes dataset [22] provides synthetic composite satellite imagery
of aircraft at different airport locations. However, instead of projecting 3D objects onto
a background, 2D images are directly overlaid onto the backgrounds, Figure 2. Satellite
imagery is one of many fields of computer vision where it is difficult to obtain large data
sets due to the nature of the image data required. The authors of the paper notes that there
are no expansive permissively licensed synthetic data sets for such data. The RarePlanes
dataset consists of a mix of real and synthetic satellite imagery that has had aerial images
of planes overlaid on top, while Figure 2 notes the use of real 2D backgrounds, in practice,
this can be extended to synthetic 2D backgrounds as well as it does not affect the overall
process of overlaying 2D images onto a background. The synthetic data was created using
the AI.Reverie platform, which used Unreal engine to create realistic synthetic data based
off real world airports.
Large crowd data sets are resource intensive to annotate, both images and videos, with
large numbers of people, in excess of 1000 people in some cases. People in crowds are also
often not fully in view, potentially only having part of their head visible with the rest of
their body obscured by the surroundings. Manual annotation can result in cases where data
is not fully labelled due to the difficulty in doing so, thereby introducing data set bias. There
are two common methods of synthesising crowd data. The first is to use 3D human models
and either project them onto a 2D background or place them into a 3D virtual environment.
In practice rendering scenes with over 1000 models would be highly computationally
demanding, but if video data is needed, this is still the easiest method of generating crowd
J. Imaging 2022, 8, 310 5 of 33
data. The second method is to use 2D overlays to project images of humans onto a 2D
background. A paper on large crowd analysis using synthetic data [23] projected synthetic
humans onto real scenes. The synthesis enabled illumination, movement and density of
people to be controlled while providing ground truth information.
Data sets such as foggy scenes [2] use real data as a basis and digitally manipulate the
image data in order to produce synthetic variations. Such data is created for applications
where it is difficult to obtain data due to specific environmental requirements, but real
environments and objects still hold enough value that it is not worth the effort of recreating
the entire scene virtually to create the necessary data. In practice, this method of image
synthesis can be considered an extension of overlaying 2D images onto a background,
but instead of overlaying an image, a filter is used to project the required environmental
conditions. Compared to 2D images, filters are also comparatively simpler to extend to
video data if so required.
While all synthetic composites are image composites by definition, there are also some
synthetic composites that do not use any synthetic objects or images in its creation. Image
compositing works the same way as 2D image overlays, but uses labelled 2D objects from
objects from one set of data and placing the object into scenes from other sets of data. This
method of data synthesis tends to create data set with lower domain gap than virtual
synthetic data sets, possibly due to domain randomisation increasing data diversity and
improving generalisation [4].
The fish identification data set [24] is an example which uses instances of real fish
cropped out from data collected using the Deep Vision system [25] and places them onto
backgrounds from Deep Vision footage where no other fish or objects are present, in
random orientations, positions, and sizes. The resultant composite image comprises of
only real data, but is still considered synthetic data as the exact scene was not capture in
the real world. The reason for the generation of such data is primarily the difficulty in
annotating existing Deep Vision data. Generating synthetic data with known fish species
allows for much cheaper labelled data and extracting fish from scenes where the species
can be readily identified by a human is also a significantly less time consuming task than
manually labelling the original Deep Vision data set.
Image synthesis could be considered an extreme version of image compositing where
instead of extracting labelled objects and placing them into other scenes, image synthesis
takes labelled object features and combines them with other labelled object features to
produce a new object. Visually, the new object may look nothing like the objects from which
the features were extracted, but from the perspective of a neural network, the synthesised
object still contains all the necessary features to identify what the object is [6].
The KITTI-360 dataset [26] was created with the goal of augmenting the KITTI
dataset [27] with more objects, increasing data efficiency for training. The paper noted
that while 3D rendered virtual worlds were becoming more popular for producing urban
environment data, the creation of such an environment requires significant human input
before data can begin automatic generation. Instead, the paper proposed a process to
integrate synthetic objects into real environments in a photo-realistic manner. By creating
J. Imaging 2022, 8, 310 6 of 33
360 degree environment maps, KITTI-360 was able to place high quality vehicle models
into existing KITTI scenes with realistic lighting conditions. The models themselves are
created by projecting 2D texture images onto 3D meshes, Figure 3, which are then projected
onto backgrounds to give a realistic view of the object as the perspective on the object
changes over the course of the video.
a background is required, it only needs to look correct from the viewing angle, in some
situations a realistic background might not even be required. Observing the face from
different angles or positions makes it clearly visible that the head is not a complete object. It
is not a complete virtual environment, and for the purposes of such applications, a complete
virtual environment is not required.
Virtual environments are a step above virtual scenes and comprise a complete 3D
virtual construction of a specific environment. The environment could be the inside of a
house or a pedestrian crossing. Either way, its goal is to enable the capture of image data
from multiple perspectives without risking the degradation of data quality due to problems
such as object artefacts. When viewed from outside, the virtual environment may still look
incomplete, but within the environment, it is self consistent.
Virtual worlds are effectively virtual environments on a larger scale. Scenes outside
a virtual environment that may have been flat 2D background are fully constructed with
events occurring beyond the view of the virtual camera. This is most commonly found
in virtual data captured from games that have pre-built large scale environments, such as
from the game Grand Theft Auto V. Creating virtual worlds to collect such data is labour
intensive, which is why collecting data from games with pre-built worlds is a common
alternative. Virtual KITTI [29] is an example of a virtual world where the environment
from parts of the KITTI dataset [27] was recreated digitally to produce a virtual copy of the
KITTI dataset.
In the field of object detection, some research has moved towards highly photorealistic
object renders to reduce the domain gap to the target domain. Other research has found
that photorealism might not be the only method of reducing domain gap, instead by using
domain randomization, where the objects of interest are placed into random non-realistic
environments, it is possible to force a model to learn object features [30]. Compared to
photorealistic objects, this type of synthetic data may not fit the target domain as well, but
its generalisation means that it stands to have better average performance across multiple
domains. Virtual synthetic data offers a way to create both photorealistic and photo-
unrealistic environments that can be manipulated as required to produce the necessary
image data.
Facial recognition has made significant progress over the past few years thanks to
developments in deep learning networks and large scale data sets. However, it has started
to become increasingly difficult to obtain larger data sets from the internet by trawling for
faces due to labelling noise and privacy concerns. As a result, synthetic data has become the
alternative to obtaining large data sets. The performance of networks trained on synthetic
data for facial recognition has historically not been good, the domain gap has often been
very large, resulting in poor real world performance. However, synthetic synthesised faces
still offer the great benefit of avoiding issues with privacy and developments over the years
have shown increased performance in face generation technology [31].
Moving past the generation of virtual synthetic data for standalone objects and faces,
there are some applications that necessitate the construction of a larger virtual scene. In
the scenario where a task such as pedestrian detection is required, but there is no existing
real data to conduct network training or even domain adaptation, synthetic data is the
only available method of sourcing any data to train a pedestrian detection model [3]. The
problem is that synthetic data suffers from domain gaps with real data and without any real
data, traditional methods of reducing the domain gap, such as mixing data or fine tuning
after pre-training on synthetic data are not possible. In cases like this, the best option is to
provide as much ground truth as possible from the virtual scene that has been constructed.
Vehicle re-identification is another field that can utilise virtual synthetic data in the
scope of a virtual scene, while vehicle detection and identification is closely related to
tasks such as urban driving, unlike urban driving, vehicle re-identification is primarily
concerned with stationary vehicles and so synthetic vehicles can be placed into small
virtual scenes for data collection. Similarities between vehicle types when viewed from
different angles as well as the lack of differences between some vehicles types can cause
J. Imaging 2022, 8, 310 8 of 33
many difficulties with real data. To address this issue, highly diverse data sets are required
to learn specific features. However, even if such data is available, manually annotating
such data is prohibitively expensive. Synthetic data provides an alternative source of large
automatically labelled data that can also be generated from many different perspectives,
allowing for much more diverse data sets than what might be normally available from the
real world [32].
In cases where synthetic scenes are not sufficient to produce the data required for the
application, synthetic worlds offer a much larger environment from which to capture data
at a computational cost. Most virtual worlds are not fully utilised all the time. Instead
virtual worlds allow for the capture of data in different environments, which can be useful
in applications such as autonomous vehicles, while photo-realistic environments are not
possible without the use of dedicated designers and significant rendering time, it is possible
to generate more basic environments using city layout generation algorithms combined
with pre-textured buildings, allowing for the creation of grid like city environments. The
effect of photorealism on performance is substantial, but the biggest advantage of virtual
synthesised environments lies in the automatic labeling of objects as well as complete
control over environment variables [33].
While virtual synthetic data does not directly contain any real data, this does not mean
that it cannot reference or replicate real data. The Virtual KITTI data set [29] is a fully
synthetic recreation of a subset of the KITTI data set. The goal of creating a virtual copy of
the KITTI data set was to provide evidence that models trained on real data would perform
similarly in virtual environments and that pre-training on synthetic data should provide
improvements in performance after fine tuning.
beneficial than generating a large virtual world that might not be fully utilised. For the
case of 3D worlds and environments, recent developments have seen a shift towards using
game engines to build virtual worlds due to the tools and utilities that come with such
software. In some cases, data can even be pulled from 3D environments that exist in games,
forgoing the manual creation of a virtual environment entirely.
Figure 5. Manual creation of image and video data via 3D modelling software such as Blender.
First proposed in 2014 [40], the original GAN model also utilised two two sub net-
works, pitting a generator network against a discriminator network, Figure 7. The goal
of a GAN is for the generator to produce synthetic data that the discriminator cannot tell
apart from real data while the discriminator learns to distinguish generated data, hence
adversarial. The generator learns by synthesising images from noise maps, learning better
mappings to produce more realistic samples as the discriminator provides feedback. Vice
versa, the discriminator learns from feedback for correct and incorrect classification of
real and synthesised images. GANs can be further classified into conditional GANs [41],
where the GAN is fed additional information such as class labels, and unconditional GANs,
where the discriminator attempts to distinguish real and fake images independent of what
the image actually is. In an ideal scenario, a generator will reach a state where its data is
indistinguishable from real data. However, in practice, this rarely occurs and the generator
and discriminator will typically maintain some form of balance, with convergence on the
generator or discriminator indicating some form of failure. Compared to VAEs, which can
be considered to use semi-supervised learning to generate images, as the target is the same
as the input, GANs use unsupervised learning to generate its data.
As of 2018, there were over 450 named variants of GAN networks [42], too many to
cover fully in a general review of synthetic data generation and usage. However, the more
well known GAN models used to generate image data, BigGAN, CycleGAN, DALL-E
2 (and similar models), DCGAN, and StyleGAN will be covered with applications and
differences in methodologies noted.
BigGAN and BigGAN-deep are a models based on SAGAN [43], a GAN design
that added attention maps that allowed the generator and discriminator models to focus
on features in different parts of an image. BigGAN utilises attention maps in addition
to greatly scaling both parameters and batch size to increase the quality of generated
images [44]. BigGAN is a conditional GAN as its generator and discriminator are both
conditioned with class labels at time of training. When BigGAN was proposed, state of
the art GAN models were showing weak Inception Scores [45] compared to real data, with
the goal of BigGAN being to reduce the gap in image quality and resolution between
J. Imaging 2022, 8, 310 11 of 33
GAN generated images and images from the ImageNet dataset. BigGAN demonstrated
substantial improvements to Inception Score when trained and tested against ImageNet
and JFT-300M datasets, showing good Inception Scores even at higher image resolutions of
256 × 256 and 512 × 512 pixels, which GANs traditionally have been weaker at. However,
the authors noted important trade-offs to achieve these results. GANs are already inherently
unstable and scaling GAN to BigGAN resulted in instabilities unique to GAN models of that
scale, resulting in eventual training collapse and necessitating early stopping of training,
while a combination of techniques was shown to be able to reduce those instabilities,
complete training stability came at the cost of a significant cost to performance trade-off.
Even without complete training stability, BigGAN still requires significant computational
resources in comparison to smaller GAN models due to the overall increase in scale.
Regardless of these trade-offs, BigGAN still remains the primary large scale GAN model
and has since seen several derivatives with the goal of improving different aspects of
the original model. LOGAN [46] introduced latent optimisation to BigGAN, improving
performance on 128 × 128 pixel images compared to base BigGAN, although higher
resolution performance was untested due to limited computational resources. Variants
looking to reduce computation requirements, such as not-so-BigGAN [47] have also been
developed as a significantly smaller compute budget alternative while attempting to reach
competitive levels of performance.
CycleGAN is a unique variant of GAN model where the goal is not to generate
new and unique image data, but to change the domain of an existing image to a new
domain [48]. Image-to-image translation is a process in which an image is transformed
to a different domain. Training of a computer vision model to achieve this task typically
requires paired training data, where the same scene is depicted in both the input domain
and target domain. As with many other computer vision tasks, obtaining paired image
data for image-to-image translation is often a very complex, costly, and laborious task.
CycleGAN was developed as a means to create image pairs for a set of unpaired data,
allowing the creation of image pairs for domains such as painting styles, environmental
conditions, and even conversions between animals. The name CycleGAN comes from
the property of cycle consistency in the GAN model, which is to say that if an image is
converted from one domain to another, converting the image back to the original domain
should reproduce the initial image. This cycle consistency was achieved by utilising two
generator/discriminator pairs for a total of four players in the GAN model. CycleGAN is
trained by passing the domain converted output from the first generator through a second
generator with the goal of reproducing the original image. By introducing cycle consistency
loss, CycleGAN was able to reduce the space of mapping functions, preventing inputs from
being converted to random permutations in the target domain. Interestingly, as the goal of
CycleGAN is to convert an image’s domain and not to generate a completely new image
from a noise pattern, it performs reasonably well on higher resolution images, such as
photos from the Cityscapes dataset [49]. In comparison to other image-to-image translation
models, the authors noted that there was a gap in performance compared to models trained
on paired datasets, such as pix2pix [50], a gap that may require the use of semi-supervision
in some form to close. However, CycleGAN’s methodology is still important as a model
that pushed the boundaries of what could be done with unsupervised learning. CycleGAN
has seen some derivative models, with most adapting it for use in specific applications, such
as CycleGAN-VC [51–53] for voice conversion, res-cycle GAN [54] for generating medical
scan pairs, and Cycle-Dehaze [55], a variant for dehazing images. Of important note,
while the CycleGAN model can be used to convert images between a range of different
domains, a specific model can only convert images between two domains at any given time.
Conversion between additional domains would require another CycleGAN model to be
trained for the additional domains. Alternatively, there are GAN models which are capable
of multi-domain conversion, such as StarGAN [56], although performance naturally varies.
DALL-E 2 [57,58] and similar models, Craiyon (formally DALL-E mini) [59], Stable
Diffusion [60], Parti [61], and Imagen [62], are all text-to-image generators. Text-to-image
J. Imaging 2022, 8, 310 12 of 33
generators primarily consist of a network that can learn feature representations from text,
typically a language model that can encode text into a latent space representation, and a
generator model capable of mapping the text representation into an image, while GANs
can be used in the generator side of text-to-image generators, other models have also been
used to produce similar results. Text-to-image generators initially used VAEs to produce
image outputs [63], but this later shifted to using GANs [64] and diffusion models for the
same reasons GANs superseded VAEs for image generation. Most notably, Craiyon and
Parti are the only models listed here that use GANs, with DALL-E 2, Stable Diffusion,
and Imagen all use diffusion models in their image generation. Craiyon uses the CLIP
network [65] for its text and image encoder and VQGAN [66] for its image generation.
The primary reason for Craiyon’s use of VQGAN over other GAN models is its ability
to synthesise high resolution images. GAN models have typically struggled to produce
high quality images at higher resolutions due to the quadratic increase in cost in pixel
space as the output resolution increases. VQGAN treats output images as a composition of
smaller images, rich in perceptual information, allowing the generation of higher quality
images at higher resolutions. The authors of the Craiyon model have noted that DALL-E,
which is closed source, still provides higher quality images in comparison, but that Craiyon
demonstrates the ability to generate reasonably good images at a much smaller resource
scale. Parti uses ViT-VQGAN [67] in a sequence-to-sequence model, a common model
in machine translation, where text tokens are transformed to image tokens. Similar to
Craiyon, this allows Parti to synthesise higher quality, higher resolution images. Moving to
DALL-E 2, Stable Diffusion, and Imagen, diffusion models function differently to GANs.
Training is conducted by adding noise to input images in multiple stages, with the target
being the reconstruction of the original image without noise. As more and more of the
input image is replaced with noise, the diffusion model eventually learns to create images
from nothing but noise. A language encoder is then used to generate noise maps from
which images are synthesised, DALL-E 2 uses OpenAI’s GPT-3 model [68], Stable Diffusion
uses the BERT-tokeniser [69], and Imagen has experimented with CLIP, T5 [70], and BERT.
Diffusion models hold several advantages over GANs, the key ones being scalability and
parallelisability, while avoiding the issues that come with adversarial training. These
advantages may eventually lead to diffusion or hybrid models superseding GANs as the
primary image synthesis model.
Deep conditional GANs (DCGAN), are a subclass of CNNs developed for unsuper-
vised learning [71]. GANs as a model are known to be unstable during training, with
mode collapse being a common failure point. In order to address this instability, various
approaches have been taken. One such approach is DCGAN, which uses CNNs to stabilise
the GAN model and learn unsupervised representations of image data. Unlike other GAN
models, which typically have fully connected pooling layers, DCGAN replaces them with
convolutional and transpose convolution layers that are not fully connected. DCGAN also
uses batch layers, which are not present in other GAN models. A notable advantage of DC-
GANs is that its architecture makes it stable to train in most scenarios and less susceptible
to mode collapse.
StyleGAN and its subsequent versions are a GAN model developed by researchers at
Nvidia [72–74]. Initially developed as a combination of Progressive GAN [75] with style
transfer, StyleGAN’s primary field is in the synthesis of human faces in resolutions up
to 1024 × 1024 pixels. Starting with the base model, Progressive GAN learn by training
the generator progressively from small resolutions, 4 × 4 pixels, up to 1024 × 1024 pixels,
adding new convolutional layers for each increase in resolution. This method avoids
the model instability that GANs typically run into when generating images at higher
resolutions, while Progressive GAN itself is capable of generating a wide range of image
classes, StyleGAN adapts it with style transfer for the specific goal of human face synthesis.
Using GANs to synthesise human faces is a rather common application, but controlling the
output of specific facial features has been a difficult task. Using Progressive GANs features,
StyleGAN is able to build up features layer by layer, with each progressive layer controlling
J. Imaging 2022, 8, 310 13 of 33
finer features without affecting coarser features built up in previous layers. To do this, the
StyleGAN generator does not take points from a latent space as input, but rather, takes
input from a noise layers and a standalone mapping network. The noise layers introduce
stochastic variation in order to produce unique variations on features while the standalone
mapping network controls which features are integrated in the different generator layers.
StyleGAN divides layers into three feature grades, coarse features, middle features, and
fine features. Coarse features, such as face pose and shape are generally defined by the
lower resolution layers. Middle features, consisting of finer facial details and hair styles,
are defined by the middle layers. Fine details, such as eye, hair, and skin colour are defined
by the remaining layers up to 1024 × 1024 pixels. This division of features by layer allows
specific features to be carried over between inputs when synthesising new faces. StyleGAN
sees less use outside the generation of human faces, but within that field, it currently leads
the generation of photorealistic human faces.
With the various and disadvantages presented by VAEs, GANs, and diffusion mod-
els, there have naturally been attempts to combine them to produce better image syn-
thesis models. As with GAN models, there are a significant number of hybrid models.
This review will briefly touch on two classes of hybrid models, VAE-GAN hybrids and
Diffusion-GAN hybrids.
VAE-GAN hybrids [76–80] function by combining a VAE’s encoder into the GANs
generator network, creating an unsupervised generative model capable of encoding, gener-
ating, and discriminating image samples. The goal of the combined network is to improve
image sample quality, representation learning, sample diversity, and training stability
compared to individual VAE and GAN models.
Diffusion-GAN hybrids [81] attempt to use noise diffusion, commonly used in diffu-
sion models, to stabilise GAN models during training. By introducing a Gaussian mixture
distribution to inject instance noise, the maximum noise to data ratio is adjusted over
successive training steps resulting in a consistent performance over baseline GAN models
in synthesising photorealistic images.
Overall, GANs have multiple notable use cases in synthetic data generation. The
most common use case is to generate data for training other networks. This use case
can be seen in a wide range of applications including face generation [74], medical
applications [82], and sports camera calibration [83]. In the case of facial recognition,
the performance of networks trained on synthetic data for facial recognition has historically
not been very good with large domain gaps resulting in poor real world performance [31].
This issue can largely be traced back to poor intra-class variations causing lack of diversity
and the domain gap between real and synthetic data. GANs can be used to blend faces
together and vary domains to increase data diversity, improving training performance.
Current medical research often suffers from biases in data due to the limited sample pop-
ulation that can be accessed. GANs offer the ability to synthesise some types of medical
data such as skin lesions and cell images, forgoing the privacy concerns that usually come
with using medical data [84]. The other use of GANs is as a sub network to a larger model,
where models like CycleGAN can be used to adapt input data to the target domain during
training, allowing both real and synthetic data to be adapted to the target domain without
the input dataset needing to be modified directly [14,85].
Of course, GANs are not without their disadvantages either. Compared to other
methods of image synthesis, GANs are comparatively weaker in generating good quality
images at higher resolution. According to Nvidia [86] a GAN typically requires training
data on the scale of 50,000 to 100,000 images in order to produce a high quality model,
which can be difficult to obtain in some fields of research, or for smaller research groups.
GANs are also not at the point where full 3D environment generation is possible, with 3D
object generation being the furthest research into this particular area [87,88]. 2D video or
animation generation is still in its early stages [89–92], meaning that GANs are not yet at
the stage where data for computer vision tasks such as object tracking can be generated
effectively. GANs are also a notably weaker when it comes to feature labelling within an
J. Imaging 2022, 8, 310 14 of 33
image, while GANs are capable of producing specific class datasets, annotating specific
features is a more complex task, while some specially trained GAN models are capable of
object feature labelling, current text-to-image models may be able to generate an images
of a large crowd, but it would not be able to provide annotations of the number of people
present in the crowd and their locations within the image.
There has been research into countering dataset bias by using 3DMM generated faces to
augmenting existing data sets to reduce bias [96], but this in itself requires a robust 3DMM
model. Similar to StyleGAN, 3DMMs find their primary use in 3D face generation, but
the model design itself does allow it to build parametric representations of many different
objects on the basis that enough information is given to learn 3D object representations.
Parametric models are a more general method of data generation that refers to the use
of parameterised variables to generate data from a theoretically finite number of possible
instances. Compared to 3DMMs, parametric models see use in a wider range of fields.
Parametric models can be challenging to create and are often limited by the data the models
are built on. The model needs to ensure that the range of parameters provided is large
enough to provide a diverse set of data without any parameters conflicting with each other
to create unwanted results. In the case of face synthesis, with a large enough library of
facial features and expressions, parametric models can offer the creation of a large range of
human faces. There are also examples of parametric models being used in conjunction with
3DMMs to produce to create synthetic faces [15]. Semi-parametric image synthesis has been
tested on datasets such as Cityscapes [97] and parametric models have also seen use in
more obscure applications such as fish recognition [24], labelled photos of known fish were
overlaid onto underwater background in varying positions, rotations, and sizes. Parametric
model data generators allow for precise characteristics to be controlled, providing the
ability to conduct quantitative studies.
Using games and games engines for synthetic data synthesis has several advantages
over other data synthesis methods and even real data in some cases. A major advantage is
that there is no need for training, not unless environment generation or asset behaviour
needs to be automated in a way that does not use a parametric model. In the case of
game engines, data can be collected as soon as the environment and assets are set up.
Games forgo this entirely and data can be collected as simply as opening up the game
and recording game play. These two methods of data synthesis also allow the tracking of
ground truth data, which may not be available in other data synthesis methods and difficult
to obtain in the real world. Object dimensions, classifications, movement, distances, and
other details can all be collected from assets present in games and game engines. Not all
that information may be necessary depending on the synthetic data requirements, but the
benefit is that the information is available if it is needed. Video data can also be captured
relatively easily in games and game engines in comparison to other data synthesis methods,
where synthesising a video clip can be a complex and resource heavy task that may not
even generate data to the required level of quality. There are also not many other options to
generate 3D environments when it comes to image data synthesis. The primary limitation
of building 3D environments in game engines is that a lot of the work still has to be
done by a human, while some environments can be procedurally generated, that restricts
the complexity of possible environments. Knowledge of game engine functionality and
scripting is generally required to fully utilise a game engine. Games avoid this limitation
at the cost of having less control over the environments as they are pre-built. Setting up
a game to capture data may require knowledge of the game engine the game is built on
as well as how to modify internal game files to set up specific scenarios which may not
normally occur within the game. There is also the upfront cost of the game to consider as
well as the end user agreements signed upon installing the game, which may restrict the
usage and modification of game data, and therefore, restrict the applications in which data
can be used, particularly if it is a commercial application.
Table 1. Performance of different data synthesis methods for face synthesis, the lower the FID score
the better.
Table 2 compares the scale at which data can be synthesised compared to data col-
lected from the real world. Fore the task of crowd counting, many real datasets are limited
to low resolutions due to camera quality and dataset size due to the time required to
capture data. In many real datasets, there is also very little control of how many peo-
ple are present in a given frame. Publicly available real datasets with multiple scenes
are also rarer, with datasets that contain multiple scenes having no consistent scenes be-
tween images. In comparison, synthetic crowd datasets are able to generate images at
a much higher resolution and with greater control over scenes. The biggest advantages
is the sheer number of images and people that can be captured from a synthesised en-
vironment. Capturing millions of people in a dataset for crowd counting would be an
astronomical task if possible, let alone the manual annotation that would be required to label
such a dataset.
Tables 3 and 4 show the difficulty in comparing synthesised data in some fields, even
where the application is the same. In the field of synthesising maps from aerial photos,
the only common metric in evaluation was Turkers, humans hired to evaluate the images
within a time frame and predict if the image was real or synthesised. Subjectiveness of
the evaluation metric aside, the papers that listed the metrics did not list any additional
information outside of the output resolution and performance. There was no information
on training cost, training dataset, or anything else that could be used to compare and
evaluate performance. Similarly, when comparing the performance of synthesising label
maps from photos in the Cityscapes dataset, some papers do not provide any information
outside of a given evaluation metric, making it very difficult to come to a reasonable
conclusion regarding comparative performance. Overall, a great part of the struggle with
synthetic data is the difficulty of evaluation, either due to limited overlapping metrics for
comparison or lack of data to compare in general. It is possible to collect the source code of
various models and run them on the same system to produce an even field for evaluation,
but in practice, the amounts of data and computational data involved make such a task a
massive undertaking that is not necessarily worth the resources used to do it.
J. Imaging 2022, 8, 310 19 of 33
Table 3. Performance of different data synthesis methods for aerial maps, photo to map, the higher
percent of Turkers that classify the output as real, the better.
Generation % Turkers
Method Resolution Training Cost
Cost Labeled Real
CoGAN [118] 512 × 512 N/A N/A 0.6% ± 0.5%
BiGAN/ALI [77,119] 512 × 512 N/A N/A 2.1% ± 1%
CycleGAN [48] 512 × 512 N/A N/A 26.8% ± 2.8%
Table 4. Performance of different data synthesis methods for image segmentation against the
Cityscapes photos to labels dataset, the higher the per-pixel accuracy, the better.
Generation Per-Pixel
Method Resolution Training Cost
Cost Accuracy
BiGAN/ALI [77,119] N/A N/A N/A 0.41
CoGAN [118] N/A N/A N/A 0.45
CycleGAN [48] N/A N/A N/A 0.58
Pix2Pix [50] 256 × 256 N/A N/A 0.85
Pix2PixHD [107] 2048 × 1024 N/A N/A 0.83
methods provide during the data synthesis process. That said, GAN does see use in synthe-
sising medical scans due [122,123,126]. Most GANs less favoured for object labelled data
compared to more parametric synthesis methods due to their lacking ability to label objects
within the synthesised image as part of the data synthesis process. GANs and diffusion
models have reached the point of being able to generate realistic scenes and objects based
on word prompts, but even with the information of what the model is generating, the
most GAN models are not designed label objects within images. GANs can be trained to
detect objects [127], but use of such a model is still a separate network compared to the
generator model. In comparison, parametric models or virtual worlds typically have all
object information present in some way that can be used to annotate images post synthesis
With regards to existing datasets for object recognition and classification, despite their
size, some research [128] has indicated that there are biases even in datasets like ImageNet
with regards to how objects are displayed and presented which affects models trained
on them. Using synthetic data to augment existing datasets can potentially improve data
diversity and reduce any existing dataset bias [129]. Extension of GAN and diffusion model
datasets to provide good quality internal image labelling, either by changing existing
model architecture or by using a separate model in conjunction to label images [130].
Providing such information can greatly increase the range of applications that both GAN
and diffusion models can be utilised in. Another logical extension of image synthesis for
object recognition is object tracking, moving from images to videos [131,132]. Synthesising
video data is more complex than synthesising images for some methodologies due to the
need to maintain continuity from frame to frame while other methods already have such
inbuilt capabilities. As a result, most video synthesis is done by capturing data from virtual
worlds as opposed to generated from GAN or diffusion models. However, there are still
opportunities to for those models to explore as the current time cost of generating data
from 3D worlds is still high due to rendering requirements, particularly for photorealistic
environments, whereas the cost for a GAN model to produce data is comparatively low
once training is complete.
quality models, 3DMM and parametric models typically have to use 3D face scans as a basis,
which runs into the same privacy issue that made synthetic data appealing in the first place.
3DMM and parametric models also tend to generate faces that look stiffer and less realistic
than what GAN models can produce. There is potential in using GANs to produce realistic
images of a human face from different perspectives to feed into training 3DMM models,
creating more realistic 3D human faces [135–137]. Beyond that, there is the potential to
expand realistic face generation to human body generation [138–140], both 2D images and
3D models. Naturally, that also brings up concerns about the misuse of data, which will
likely continue to become a bigger issue as image synthesis technology improves.
conditions need to occur in the first place for it to be captured, equipment necessary to
collect the data must also be present at the required location at the time.
Synthetic data allows data with specific environmental conditions to be generated,
either by digital manipulation [53] or by modifying environmental parameters in a virtual
world [152]. Video-to-video translation, an extension to image-to-image translation, is
already possible for vehicle video data with regards to ambient lighting [153] and Cycle-
GAN is capable of translating images between environmental domains. Extension into
environmental effects allows for the creation of datasets with much larger environmental
diversity than might have been possible in the real world as well as data that would be
riskier to collect in the real world, such as heavy storms or fog.
There is potential in not only augmenting existing datasets or adapting existing
datasets into new datasets, but to integrate into other fields of computer vision, such as
drone vision. This area of data synthesis has primarily found use in navigation applications,
but such data could also see use in creating paired datasets to train computer vision models
to remove environmental conditions, providing an estimation of scenes without conditions
without obscuring environment effects.
domain. Data diversity is also something that is difficult to measure and is also less tangible
compared to photorealism. Tsirikoglou, et al. [16] considers that real data will inevitably
skew synthesis data due to model training, with the only way to avoid it being to not use
real data for training at all. However, large scale data synthesis does offer the opportunity
to create training samples which reduce the inherent data bias. As a whole, evaluating data
diversity has become easier [19] with the introduction of various data scoring methods such
as IS [45] and FID [106]. However, most such methods are more focused on the evaluation
of the synthesis model itself and less on the synthesised images. Future evaluation methods
on synthesised data may have to consider data quality of individual images in order to
build up a dataset diversity score as opposed to evaluating the dataset as a whole. Data
diversity is an important factor when considering the quality of datasets in general, whether
real or synthetic, and improvements to evaluating data diversity are always welcomed,
more so if the method is applicable in multiple fields.
7.3. Photorealism
In data synthethsis, photorealism is goal of making synthesised data look as close to
what would be captured in the real world, the core idea being that training on photorealistic
data will reduce the domain gap to the target domain. This applies not only to people, but
also to objects, environments, and environmental effects. Photorealism has been pointed
to as a potential method of improving synthetic data quality[16], but the exact benefits
are still being debated with different papers showing varying results of the impact of
photorealism on the performance of synthetic data [8,30]. Some fields are also known to
benefit from photorealism more than others with face detection and analysis being one
of them. The impact on other applications is less clear with some research indicating
that domain randomisation of data, with less realistic environments, could also improve
data quality [30]. Naturally, there is also a trade off with photorealims, with increases
in photorealism typically coming at the cost of higher computational requirements. This
is a subject that has not seen much discussion, partially due to the lack of information
regarding computational resource requirements, but whether increases in photorealism
and computational resource requirements are worth the improvement in performance is
something that will have to be considered in future work.
information can be used to weigh the benefits of generating synthetic data over collecting
real data. In some applications, the cost of generating high quality synthetic data might
be greater than the cost of manually collecting and annotating real data, although this is
hard to judge due to the lack of existing data on the matter. Moving forward, there is not a
significant amount of information that needs to be included to be useful, a brief mention
of the hardware used and the time taken to generate a given number of data instances is
enough for basic comparisons and evaluations to be made.
8. Conclusions
As computer vision develops further and requirements on data increase, synthetic
data stands as a potential alternative to real data in many applications where collection
of real data is expensive. The current performance of computer vision models trained
J. Imaging 2022, 8, 310 27 of 33
on synthetic data might not match real data in any applications, but its performance in
some fields shows that synthetic data has the ability to at least augment existing data
sets if not replace them. There is still a notable domain gap between synthetic data and
real data in most applications and the effect of photorealism and data diversity on the
performance of synthetic data are not fully known, but clear directions exist in which
progress can be made towards evaluation. As research into data synthesis grows, the
need for complete information on synthesis models becomes more and more important,
with regards to training data, training costs, and evaluation metrics, information that is
currently not provided in many papers. Creating a framework for the evaluation of not just
synthetic data, but data in general, will likely become an important step in the progress of
data synthesis.
Author Contributions: Conceptualization, K.M. and J.C.; methodology, K.M.; investigation, K.M.;
writing—original draft preparation, K.M.; writing—review and editing, K.M. and J.C.; visualization,
K.M.; supervision, K.M. and J.C.; project administration, J.C.; All authors have read and agreed to the
published version of the manuscript.
Funding: This research received no external funding.
Acknowledgments: The authors would like to acknowledge the Australian Commonwealth Govern-
ment’s support provided through the Australian Government Research Training Program Scholarship.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
References
1. Atapour-Abarghouei, A.; Breckon, T.P. Real-time monocular depth estimation using synthetic data with domain adaptation via
image style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT,
USA, 18–23 June 2018; pp. 2800–2810.
2. Sakaridis, C.; Dai, D.; Van Gool, L. Semantic foggy scene understanding with synthetic data. Int. J. Comput. Vis. 2018, 126, 973–992.
[CrossRef]
3. Hattori, H.; Lee, N.; Boddeti, V.N.; Beainy, F.; Kitani, K.M.; Kanade, T. Synthesizing a scene-specific pedestrian detector and pose
estimator for static video surveillance. Int. J. Comput. Vis. 2018, 126, 1027–1044. [CrossRef]
4. Tripathi, S.; Chandra, S.; Agrawal, A.; Tyagi, A.; Rehg, J.M.; Chari, V. Learning to generate synthetic data via compositing. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019;
pp. 461–470.
5. Ekbatani, H.K.; Pujol, O.; Segui, S. Synthetic Data Generation for Deep Learning in Counting Pedestrians. In Proceedings of the
ICPRAM, Porto, Portugal, 24–26 February 2017; pp. 318–323.
6. Rogez, G.; Schmid, C. Image-based synthesis for deep 3D human pose estimation. Int. J. Comput. Vis. 2018, 126, 993–1008.
[CrossRef]
7. Behl, H.S.; Baydin, A.G.; Gal, R.; Torr, P.H.; Vineet, V. Autosimulate:(quickly) learning synthetic data generation. In Proceedings
of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 255–271.
8. Martinez-Gonzalez, P.; Oprea, S.; Garcia-Garcia, A.; Jover-Alvarez, A.; Orts-Escolano, S.; Garcia-Rodriguez, J. Unrealrox: An
extremely photorealistic virtual reality environment for robotics simulations and synthetic data generation. Virtual Real. 2020,
24, 271–288. [CrossRef]
9. Müller, M.; Casser, V.; Lahoud, J.; Smith, N.; Ghanem, B. Sim4cv: A photo-realistic simulator for computer vision applications.
Int. J. Comput. Vis. 2018, 126, 902–919. [CrossRef]
J. Imaging 2022, 8, 310 28 of 33
10. Poucin, F.; Kraus, A.; Simon, M. Boosting Instance Segmentation With Synthetic Data: A Study To Overcome the Limits of Real
World Data Sets. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 11–17 October
2021; pp. 945–953.
11. Jaipuria, N.; Zhang, X.; Bhasin, R.; Arafa, M.; Chakravarty, P.; Shrivastava, S.; Manglani, S.; Murali, V.N. Deflating dataset bias
using synthetic data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 772–773.
12. Jiang, C.; Qi, S.; Zhu, Y.; Huang, S.; Lin, J.; Yu, L.F.; Terzopoulos, D.; Zhu, S.C. Configurable 3d scene synthesis and 2d image
rendering with per-pixel ground truth using stochastic grammars. Int. J. Comput. Vis. 2018, 126, 920–941. [CrossRef]
13. Wang, Q.; Gao, J.; Lin, W.; Yuan, Y. Pixel-wise crowd understanding via synthetic data. Int. J. Comput. Vis. 2021, 129, 225–245.
[CrossRef]
14. Wang, Q.; Gao, J.; Lin, W.; Yuan, Y. Learning from synthetic data for crowd counting in the wild. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 8198–8207.
15. Kortylewski, A.; Schneider, A.; Gerig, T.; Egger, B.; Morel-Forster, A.; Vetter, T. Training deep face recognition systems with
synthetic data. arXiv 2018, arXiv:1802.05891.
16. Tsirikoglou, A.; Eilertsen, G.; Unger, J. A survey of image synthesis methods for visual machine learning. In Proceedings of the
Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2020; Volume 39; pp. 426–451.
17. Seib, V.; Lange, B.; Wirtz, S. Mixing Real and Synthetic Data to Enhance Neural Network Training–A Review of Current
Approaches. arXiv 2020, arXiv:2007.08781.
18. Gaidon, A.; Lopez, A.; Perronnin, F. The reasonable effectiveness of synthetic visual data. Int. J. Comput. Vis. 2018, 126, 899–901.
[CrossRef]
19. Frolov, S.; Hinz, T.; Raue, F.; Hees, J.; Dengel, A. Adversarial text-to-image synthesis: A review. Neural Netw. 2021, 144, 187–209.
[CrossRef] [PubMed]
20. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of
the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255.
21. Varol, G.; Romero, J.; Martin, X.; Mahmood, N.; Black, M.J.; Laptev, I.; Schmid, C. Learning from Synthetic Humans. In
Proceedings of the CVPR, Honolulu, HI, USA, 21–26 June 2017.
22. Shermeyer, J.; Hossler, T.; Van Etten, A.; Hogan, D.; Lewis, R.; Kim, D. Rareplanes: Synthetic data takes flight. In Proceedings of
the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2021; pp. 207–217.
23. Khadka, A.R.; Oghaz, M.; Matta, W.; Cosentino, M.; Remagnino, P.; Argyriou, V. Learning how to analyse crowd behaviour using
synthetic data. In Proceedings of the 32nd International Conference on Computer Animation and Social Agents, Paris, France,
1–3 July 2019; pp. 11–14.
24. Allken, V.; Handegard, N.O.; Rosen, S.; Schreyeck, T.; Mahiout, T.; Malde, K. Fish species identification using a convolutional
neural network trained on synthetic data. ICES J. Mar. Sci. 2019, 76, 342–349. [CrossRef]
25. Rosen, S.; Holst, J.C. DeepVision in-trawl imaging: Sampling the water column in four dimensions. Fish. Res. 2013, 148, 64–73
[CrossRef]
26. Alhaija, H.A.; Mustikovela, S.K.; Mescheder, L.; Geiger, A.; Rother, C. Augmented reality meets computer vision: Efficient data
generation for urban driving scenes. Int. J. Comput. Vis. 2018, 126, 961–972. [CrossRef]
27. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237.
[CrossRef]
28. Marcu, A.; Costea, D.; Licaret, V.; Pîrvu, M.; Slusanschi, E.; Leordeanu, M. SafeUAV: Learning to estimate depth and safe landing
areas for UAVs from synthetic data. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich,
Germany, 8–14 September 2018.
29. Gaidon, A.; Wang, Q.; Cabon, Y.; Vig, E. Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4340–4349.
30. Tremblay, J.; Prakash, A.; Acuna, D.; Brophy, M.; Jampani, V.; Anil, C.; To, T.; Cameracci, E.; Boochoon, S.; Birchfield, S. Training
deep networks with synthetic data: Bridging the reality gap by domain randomization. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 969–977.
31. Qiu, H.; Yu, B.; Gong, D.; Li, Z.; Liu, W.; Tao, D. SynFace: Face Recognition with Synthetic Data. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Seoul, Korea, 11–17 October 2021; pp. 10880–10890.
32. Tang, Z.; Naphade, M.; Birchfield, S.; Tremblay, J.; Hodge, W.; Kumar, R.; Wang, S.; Yang, X. Pamtri: Pose-aware multi-task
learning for vehicle re-identification using highly randomized synthetic data. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 211–220.
33. Shen, B.; Li, B.; Scheirer, W.J. Automatic Virtual 3D City Generation for Synthetic Data Collection. In Proceedings of the 2021 IEEE
Winter Conference on Applications of Computer Vision Workshops (WACVW), Seoul, Korea, 11–17 October 2021; pp. 161–170.
34. Richardson, E.; Sela, M.; Kimmel, R. 3D face reconstruction by learning from synthetic data. In Proceedings of the 2016 Fourth
International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 460–469.
35. Loing, V.; Marlet, R.; Aubry, M. Virtual training for a real application: Accurate object-robot relative localization without
calibration. Int. J. Comput. Vis. 2018, 126, 1045–1060. [CrossRef]
36. Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114.
J. Imaging 2022, 8, 310 29 of 33
37. Im Im, D.; Ahn, S.; Memisevic, R.; Bengio, Y. Denoising criterion for variational auto-encoding framework. In Proceedings of the
AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31.
38. Meng, Q.; Catchpoole, D.; Skillicom, D.; Kennedy, P.J. Relational autoencoder for feature extraction. In Proceedings of the 2017
International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 364–371.
39. Ballé, J.; Minnen, D.; Singh, S.; Hwang, S.J.; Johnston, N. Variational image compression with a scale hyperprior. arXiv 2018,
arXiv:1802.01436.
40. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial
Nets. In Proceedings of the Advances in Neural Information Processing Systems; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.,
Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27.
41. Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784.
42. Hindupur, A. The GAN Zoo. 2018. Available online: https://github.com/hindupuravinash/the-gan-zoo (accessed on 26
September 2022).
43. Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. In Proceedings of the
International Conference on Machine Learning PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 7354–7363.
44. Brock, A.; Donahue, J.; Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. arXiv 2018,
arXiv:1809.11096.
45. Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training gans. Adv. Neural
Inf. Process. Syst. 2016, 29, 2226–2234.
46. Wu, Y.; Donahue, J.; Balduzzi, D.; Simonyan, K.; Lillicrap, T. Logan: Latent optimisation for generative adversarial networks.
arXiv 2019, arXiv:1912.00953.
47. Han, S.; Srivastava, A.; Hurwitz, C.L.; Sattigeri, P.; Cox, D.D. not-so-biggan: Generating high-fidelity images on a small compute
budget arXiv 2020, arXiv:2009.04433.
48. Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In
Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232.
49. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset
for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Las Vegas, NV, USA, 27–29 June 2016; pp. 3213–3223.
50. Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017; pp. 1125–1134.
51. Kaneko, T.; Kameoka, H. Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks. In Proceedings
of the 2018 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, 3–7 September 2018; pp. 2100–2104.
52. Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. Cyclegan-vc2: Improved cyclegan-based non-parallel voice conversion. In
Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Brighton, UK, 12–17 May 2019; pp. 6820–6824.
53. Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. Cyclegan-vc3: Examining and improving cyclegan-vcs for mel-spectrogram
conversion. arXiv 2020, arXiv:2010.11672.
54. Harms, J.; Lei, Y.; Wang, T.; Zhang, R.; Zhou, J.; Tang, X.; Curran, W.J.; Liu, T.; Yang, X. Paired cycle-GAN-based image correction
for quantitative cone-beam computed tomography. Med. Phys. 2019, 46, 3998–4009. [CrossRef]
55. Engin, D.; Genç, A.; Kemal Ekenel, H. Cycle-dehaze: Enhanced cyclegan for single image dehazing. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 825–833.
56. Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. Stargan: Unified generative adversarial networks for multi-domain
image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City,
UT, USA, 18–22 June 2018; pp. 8789–8797.
57. Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In
Proceedings of the International Conference on Machine Learning PMLR, Virtual, 18–24 July 2021; pp. 8821–8831.
58. Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv
2022, arXiv:2204.06125.
59. Dayma, B.; Patil, S.; Cuenca, P.; Saifullah, K.; Abraham, T.; Lê Khac, P.; Melas, L.; Ghosh, R. DALL·E Mini. 2021. Available online:
https://github.com/borisdayma/dalle-mini (accessed on 27 September 2022).
60. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis With Latent Diffusion Models. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20
June 2022; pp. 10684–10695.
61. Yu, J.; Xu, Y.; Koh, J.Y.; Luong, T.; Baid, G.; Wang, Z.; Vasudevan, V.; Ku, A.; Yang, Y.; Ayan, B.K.; et al. Scaling autoregressive
models for content-rich text-to-image generation. arXiv 2022, arXiv:2206.10789.
62. Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S.K.S.; Ayan, B.K.; Mahdavi, S.S.; Lopes, R.G.; et al.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv 2022, arXiv:2205.11487.
63. Mansimov, E.; Parisotto, E.; Ba, J.L.; Salakhutdinov, R. Generating images from captions with attention. arXiv 2015,
arXiv:1511.02793.
J. Imaging 2022, 8, 310 30 of 33
64. Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. In Proceedings of
the International Conference on Machine Learning PMLR, Baltimore, MD, USA, 18–23 July 2016; pp. 1060–1069.
65. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning
transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine
Learning PMLR, Virtual, 13–14 August 2021; pp. 8748–8763.
66. Esser, P.; Rombach, R.; Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 12873–12883.
67. Yu, J.; Li, X.; Koh, J.Y.; Zhang, H.; Pang, R.; Qin, J.; Ku, A.; Xu, Y.; Baldridge, J.; Wu, Y. Vector-quantized image modeling with
improved vqgan. arXiv 2021, arXiv:2110.04627.
68. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al.
Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901.
69. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding.
arXiv 2018, arXiv:1810.04805.
70. Raffel, C.; Luong, M.T.; Liu, P.J.; Weiss, R.J.; Eck, D. Online and linear-time attention by enforcing monotonic alignments. In
Proceedings of the International Conference on Machine Learning PMLR, Sydney, Australia, 6–11 August 2017; pp. 2837–2846.
71. Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks.
arXiv 2015, arXiv:1511.06434.
72. Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410.
73. Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020;
pp. 8110–8119.
74. Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-Free Generative Adversarial Networks. In
Proceedings of the NeurIPS, Virtual, 6–14 December 2021.
75. Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv 2017,
arXiv:1710.10196.
76. Srivastava, A.; Valkov, L.; Russell, C.; Gutmann, M.U.; Sutton, C. Veegan: Reducing mode collapse in gans using implicit
variational learning. Adv. Neural Inf. Process. Syst. 2017, 30, 3310–3320.
77. Donahue, J.; Krähenbühl, P.; Darrell, T. Adversarial feature learning. arXiv 2016, arXiv:1605.09782.
78. Larsen, A.B.L.; Sønderby, S.K.; Larochelle, H.; Winther, O. Autoencoding beyond pixels using a learned similarity metric. In
Proceedings of the International Conference on Machine Learning PMLR, New York, NY, USA, 20–22 June 2016; pp. 1558–1566.
79. Bao, J.; Chen, D.; Wen, F.; Li, H.; Hua, G. CVAE-GAN: Fine-grained image generation through asymmetric training. In
Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2745–2754.
80. Shang, W.; Sohn, K.; Tian, Y. Channel-recurrent autoencoding for image modeling. In Proceedings of the 2018 IEEE Winter
Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2018; pp. 1195–1204.
81. Wang, Z.; Zheng, H.; He, P.; Chen, W.; Zhou, M. Diffusion-GAN: Training GANs with Diffusion. arXiv 2022, arXiv:2206.02262.
82. Ben-Cohen, A.; Klang, E.; Raskin, S.P.; Soffer, S.; Ben-Haim, S.; Konen, E.; Amitai, M.M.; Greenspan, H. Cross-modality synthesis
from CT to PET using FCN and GAN networks for improved automated lesion detection. Eng. Appl. Artif. Intell. 2019, 78, 186–194.
[CrossRef]
83. Chen, J.; Little, J.J. Sports camera calibration via synthetic data. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019.
84. Chen, R.J.; Lu, M.Y.; Chen, T.Y.; Williamson, D.F.; Mahmood, F. Synthetic data in machine learning for medicine and healthcare.
Nat. Biomed. Eng. 2021, 5, 1–5. [CrossRef]
85. Sankaranarayanan, S.; Balaji, Y.; Jain, A.; Lim, S.N.; Chellappa, R. Learning from synthetic data: Addressing domain shift for
semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT,
USA, 18–23 June 2018; pp. 3752–3761.
86. Salian, I. Nvidia Research Achieves AI Training Breakthrough. 2021. Available online: https://blogs.nvidia.com/blog/2020/12/
07/neurips-research-limited-data-gan/ (accessed on 27 September 2022).
87. Wu, J.; Zhang, C.; Xue, T.; Freeman, B.; Tenenbaum, J. Learning a probabilistic latent space of object shapes via 3d generative-
adversarial modeling. Adv. Neural Inf. Process. Syst. 2016, 29.
88. Marriott, R.T.; Romdhani, S.; Chen, L. A 3d gan for improved large-pose facial recognition. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13445–13455.
89. Pumarola, A.; Agudo, A.; Martinez, A.M.; Sanfeliu, A.; Moreno-Noguer, F. Ganimation: Anatomically-aware facial animation
from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September
2018; pp. 818–833.
90. Mirzaei, M.S.; Meshgi, K.; Frigo, E.; Nishida, T. Animgan: A spatiotemporally-conditioned generative adversarial network for
character animation. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United
Arab Emirates, 25–28 October 2020; pp. 2286–2290.
J. Imaging 2022, 8, 310 31 of 33
91. Wu, X.; Zhang, Q.; Wu, Y.; Wang, H.; Li, S.; Sun, L.; Li, X. F3 A-GAN: Facial Flow for Face Animation With Generative Adversarial
Networks. IEEE Trans. Image Process. 2021, 30, 8658–8670. [CrossRef] [PubMed]
92. Lin, C.Z.; Lindell, D.B.; Chan, E.R.; Wetzstein, G. 3D GAN Inversion for Controllable Portrait Image Animation. arXiv 2022,
arXiv:2203.13441.
93. Blanz, V.; Vetter, T. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th Annual Conference on Computer
Graphics and Interactive Techniques, New York, NY, USA, 8–13 August 1999; pp. 187–194.
94. Egger, B.; Smith, W.A.; Tewari, A.; Wuhrer, S.; Zollhoefer, M.; Beeler, T.; Bernard, F.; Bolkart, T.; Kortylewski, A.; Romdhani, S.;
et al. 3d morphable face models—past, present, and future. ACM Trans. Graph. 2020, 39, 1–38. [CrossRef]
95. Wood, E.; Baltrusaitis, T.; Hewitt, C.; Dziadzio, S.; Cashman, T.J.; Shotton, J. Fake It Till You Make It: Face analysis in the wild
using synthetic data alone. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 11–17
October 2021; pp. 3681–3691.
96. Kortylewski, A.; Egger, B.; Schneider, A.; Gerig, T.; Morel-Forster, A.; Vetter, T. Analyzing and reducing the damage of dataset
bias to face recognition with synthetic data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, Long Beach, CA, USA, 16–17 June 2019.
97. Qi, X.; Chen, Q.; Jia, J.; Koltun, V. Semi-parametric image synthesis. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8808–8816.
98. Qiu, W.; Yuille, A. Unrealcv: Connecting computer vision to unreal engine. In Proceedings of the European Conference on
Computer Vision, Tel Aviv, Israel, 23–27 October 2016; pp. 909–916.
99. Qiu, W.; Zhong, F.; Zhang, Y.; Qiao, S.; Xiao, Z.; Kim, T.S.; Wang, Y. Unrealcv: Virtual worlds for computer vision. In Proceedings
of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1221–1224.
100. Hatay, E.; Ma, J.; Sun, H.; Fang, J.; Gao, Z.; Yu, H. Learning To Detect Phone-Related Pedestrian Distracted Behaviors With
Synthetic Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June
2021; pp. 2981–2989.
101. Acharya, P.; Lohn, D.; Ross, V.; Ha, M.; Rich, A.; Sayyad, E.; Hollerer, T. Using Synthetic Data Generation To Probe Multi-View
Stereo Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 11–17 October
2021; pp. 1583–1591.
102. Saleh, F.S.; Aliakbarian, M.S.; Salzmann, M.; Petersson, L.; Alvarez, J.M. Effective use of synthetic data for urban scene semantic
segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018;
pp. 84–100.
103. Riegler, G.; Urschler, M.; Ruther, M.; Bischof, H.; Stern, D. Anatomical landmark detection in medical applications driven by
synthetic data. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7–13
December 2015; pp. 12–16.
104. Fabbri, M.; Brasó, G.; Maugeri, G.; Cetintas, O.; Gasparini, R.; Osep, A.; Calderara, S.; Leal-Taixe, L.; Cucchiara, R. MOTSynth:
How Can Synthetic Data Help Pedestrian Detection and Tracking? In Proceedings of the IEEE/CVF International Conference on
Computer Vision, Seoul, Korea, 11–17 October 2021; pp. 10849–10859.
105. Yi, X.; Adams, S.; Babyn, P.; Elnajmi, A. Automatic catheter and tube detection in pediatric x-ray images using a scale-recurrent
network and synthetic data. J. Digit. Imag. 2020, 33, 181–190. [CrossRef]
106. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a
local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6629–6640.
107. Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-Resolution Image Synthesis and Semantic Manipulation
with Conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT,
USA, 18–23 June 2018.
108. Gokay, D.; Simsar, E.; Atici, E.; Ahmetoglu, A.; Yuksel, A.E.; Yanardag, P. Graph2Pix: A Graph-Based Image to Image Translation
Framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 11–17 October 2021;
pp. 2001–2010.
109. Bergman, A.W.; Kellnhofer, P.; Wang, Y.; Chan, E.R.; Lindell, D.B.; Wetzstein, G. Generative Neural Articulated Radiance Fields.
arXiv 2022, arXiv:2206.14314.
110. Chan, E.R.; Lin, C.Z.; Chan, M.A.; Nagano, K.; Pan, B.; Mello, S.D.; Gallo, O.; Guibas, L.; Tremblay, J.; Khamis, S.; et al. Efficient
Geometry-aware 3D Generative Adversarial Networks. In Proceedings of the CVPR, New Orleans, LA, USA, 18–22 June 2022.
111. Chan, A.B.; Liang, Z.S.J.; Vasconcelos, N. Privacy preserving crowd monitoring: Counting people without people models or
tracking. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28
June 2008; pp. 1–7.
112. Chen, K.; Loy, C.C.; Gong, S.; Xiang, T. Feature mining for localised crowd counting. In Proceedings of the BMVC, Surrey, UK,
3–7 September 2012; Volume 1, p. 3.
113. Ferryman, J.; Shahrokni, A. Pets2009: Dataset and challenge. In Proceedings of the 2009 Twelfth IEEE International Workshop on
Performance Evaluation of Tracking and Surveillance, Snowbird, UT, USA, 13–17 June 2009; pp. 1–6.
114. Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 23–26 June 2016;
pp. 589–597.
J. Imaging 2022, 8, 310 32 of 33
115. Schröder, G.; Senst, T.; Bochinski, E.; Sikora, T. Optical flow dataset and benchmark for visual crowd analysis. In Proceedings of
the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand,
27–30 June 2018; pp. 1–6.
116. Shao, S.; Zhao, Z.; Li, B.; Xiao, T.; Yu, G.; Zhang, X.; Sun, J. Crowdhuman: A benchmark for detecting human in a crowd. arXiv
2018, arXiv:1805.00123.
117. Zhang, Q.; Lin, W.; Chan, A.B. Cross-view cross-scene multi-view crowd counting. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 557–567.
118. Liu, M.Y.; Tuzel, O. Coupled generative adversarial networks. Adv. Neural Inf. Process. Syst. 2016, 29, 469–477.
119. Dumoulin, V.; Belghazi, I.; Poole, B.; Mastropietro, O.; Lamb, A.; Arjovsky, M.; Courville, A. Adversarially learned inference.
arXiv 2016, arXiv:1606.00704.
120. Dahmen, T.; Trampert, P.; Boughorbel, F.; Sprenger, J.; Klusch, M.; Fischer, K.; Kübel, C.; Slusallek, P. Digital reality: A model-based
approach to supervised learning from synthetic data. AI Perspect. 2019, 1, 1–12. [CrossRef]
121. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in
context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755.
122. Frid-Adar, M.; Klang, E.; Amitai, M.; Goldberger, J.; Greenspan, H. Synthetic data augmentation using GAN for improved
liver lesion classification. In Proceedings of the 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018),
Washington, DC, USA, 4–7 April 2018; pp. 289–293.
123. Waheed, A.; Goyal, M.; Gupta, D.; Khanna, A.; Al-Turjman, F.; Pinheiro, P.R. Covidgan: Data augmentation using auxiliary
classifier gan for improved COVID-19 detection. IEEE Access 2020, 8, 91916–91923. [CrossRef] [PubMed]
124. Mariani, G.; Scheidegger, F.; Istrate, R.; Bekas, C.; Malossi, C. Bagan: Data augmentation with balancing gan. arXiv 2018,
arXiv:1803.09655.
125. Huang, S.W.; Lin, C.T.; Chen, S.P.; Wu, Y.Y.; Hsu, P.H.; Lai, S.H. Auggan: Cross domain adaptation with gan-based data
augmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018;
pp. 718–731.
126. Zhou, Y.; Giffard-Roisin, S.; De Craene, M.; Camarasu-Pop, S.; D’Hooge, J.; Alessandrini, M.; Friboulet, D.; Sermesant, M.;
Bernard, O. A framework for the generation of realistic synthetic cardiac ultrasound and magnetic resonance imaging sequences
from the same virtual patients. IEEE Trans. Med. Imaging 2017, 37, 741–754. [CrossRef] [PubMed]
127. Prakash, C.D.; Karam, L.J. It GAN DO better: GAN-based detection of objects on images with varying quality. IEEE Trans. Image
Process. 2021, 30, 9220–9230. [CrossRef]
128. Barbu, A.; Mayo, D.; Alverio, J.; Luo, W.; Wang, C.; Gutfreund, D.; Tenenbaum, J.; Katz, B. Objectnet: A large-scale bias-controlled
dataset for pushing the limits of object recognition models. Adv. Neural Inf. Process. Syst. 2019, 32, 9448–9458.
129. Mikołajczyk, A.; Grochowski, M. Data augmentation for improving deep learning in image classification problem. In Proceedings
of the 2018 international interdisciplinary PhD workshop (IIPhDW), Swinoujscie, Poland, 9–12 May 2018; pp. 117–122.
130. Liu, L.; Muelly, M.; Deng, J.; Pfister, T.; Li, L.J. Generative modeling for small-data object detection. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 6073–6081.
131. Balaji, Y.; Min, M.R.; Bai, B.; Chellappa, R.; Graf, H.P. Conditional GAN with Discriminative Filter Generation for Text-to-Video
Synthesis. In Proceedings of the IJCAI, Macao, China, 10–16 August 2019; Volume 1, p. 2.
132. Liu, M.Y.; Huang, X.; Yu, J.; Wang, T.C.; Mallya, A. Generative adversarial networks for image and video synthesis: Algorithms
and applications. arXiv 2020, arXiv:2008.02793.
133. Huang, G.B.; Mattar, M.; Berg, T.; Learned-Miller, E. Labeled faces in the wild: A database forstudying face recognition in
unconstrained environments. In Proceedings of the Workshop on faces in ‘Real-Life’ Images: Detection, Alignment, and
Recognition, Marseille, France, 17–20 October 2008.
134. Huang, R.; Zhang, S.; Li, T.; He, R. Beyond face rotation: Global and local perception gan for photorealistic and identity preserving
frontal view synthesis. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October
2017; pp. 2439–2448.
135. Shen, Y.; Luo, P.; Yan, J.; Wang, X.; Tang, X. Faceid-gan: Learning a symmetry three-player gan for identity-preserving face
synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22
June 2018; pp. 821–830.
136. Yin, X.; Yu, X.; Sohn, K.; Liu, X.; Chandraker, M. Towards large-pose face frontalization in the wild. In Proceedings of the IEEE
International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3990–3999.
137. Deng, J.; Cheng, S.; Xue, N.; Zhou, Y.; Zafeiriou, S. Uv-gan: Adversarial facial uv map completion for pose-invariant face
recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22
June 2018; pp. 7093–7102.
138. Liu, W.; Piao, Z.; Min, J.; Luo, W.; Ma, L.; Gao, S. Liquid warping gan: A unified framework for human motion imitation,
appearance transfer and novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
Seoul, Korea, 27–28 October 2019; pp. 5904–5913.
139. Liu, W.; Piao, Z.; Tu, Z.; Luo, W.; Ma, L.; Gao, S. Liquid warping gan with attention: A unified framework for human image
synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5114–5132. [CrossRef]
J. Imaging 2022, 8, 310 33 of 33
140. Men, Y.; Mao, Y.; Jiang, Y.; Ma, W.Y.; Lian, Z. Controllable person image synthesis with attribute-decomposed gan. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 4–9 June 2020; pp. 5084–5093.
141. Khanam, F.T.Z.; Al-Naji, A.; Chahl, J. Remote monitoring of vital signs in diverse non-clinical and clinical scenarios using
computer vision systems: A review. Appl. Sci. 2019, 9, 4474. [CrossRef]
142. Khanam, F.T.Z.; Chahl, L.A.; Chahl, J.S.; Al-Naji, A.; Perera, A.G.; Wang, D.; Lee, Y.; Ogunwa, T.T.; Teague, S.; Nguyen, T.X.B.;
et al. Noncontact sensing of contagion. J. Imaging 2021, 7, 28. [CrossRef]
143. Condrea, F.; Ivan, V.A.; Leordeanu, M. In Search of Life: Learning from Synthetic Data to Detect Vital Signs in Videos. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Virtual, 4–9 June 2020;
pp. 298–299.
144. Dao, P.D.; Liou, Y.A.; Chou, C.W. Detection of flood inundation regions with Landsat/MODIS synthetic data. In Proceedings of
the International Symposium on Remote Sensing, Berlin, Germany, 11–15 May 2015.
145. Zhu, P.; Wen, L.; Du, D.; Bian, X.; Ling, H.; Hu, Q.; Nie, Q.; Cheng, H.; Liu, C.; Liu, X.; et al. Visdrone-det2018: The vision meets
drone object detection in image challenge results. In Proceedings of the European Conference on Computer Vision (ECCV)
Workshops, Munich, Germany, 8–14 September 2018.
146. Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision
meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer
Vision Workshops, Munich, Germany, 8–14 September 2019.
147. Fan, H.; Du, D.; Wen, L.; Zhu, P.; Hu, Q.; Ling, H.; Shah, M.; Pan, J.; Schumann, A.; Dong, B.; et al. Visdrone-mot2020: The vision
meets drone multiple object tracking challenge results. In Proceedings of the European Conference on Computer Vision, Glasgow,
UK, 23–28 August 2020; pp. 713–727.
148. Cao, Y.; He, Z.; Wang, L.; Wang, W.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van Gool, L.; Han, J.; et al. VisDrone-DET2021: The
vision meets drone object detection challenge results. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, Seoul, Korea, 11–17 October 2021; pp. 2847–2854.
149. Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A
multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, Seattle, WA, USA, 23–29 June 2020; pp. 11621–11631.
150. Weyand, T.; Araujo, A.; Cao, B.; Sim, J. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and
retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 23–29
June; pp. 2575–2584.
151. Kelly, G.; McCabe, H. A survey of procedural techniques for city generation. ITB J. 2006, 14, 342–351.
152. Kishore, A.; Choe, T.E.; Kwon, J.; Park, M.; Hao, P.; Mittel, A. Synthetic data generation using imitation training. In Proceedings
of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 11–17 October 2021; pp. 3078–3086.
153. Chen, Y.; Pan, Y.; Yao, T.; Tian, X.; Mei, T. Mocycle-gan: Unpaired video-to-video translation. In Proceedings of the 27th ACM
International Conference on Multimedia, Nice, France, 22–29 October 2019; pp. 647–655.
154. Handa, A.; Patraucean, V.; Badrinarayanan, V.; Stent, S.; Cipolla, R. Understanding real world indoor scenes with synthetic
data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 22–26 June 2016;
pp. 4077–4085.
155. Mayer, N.; Ilg, E.; Fischer, P.; Hazirbas, C.; Cremers, D.; Dosovitskiy, A.; Brox, T. What makes good synthetic training data for
learning disparity and optical flow estimation? Int. J. Comput. Vis. 2018, 126, 942–960. [CrossRef]
156. Xue, Y.; Guo, Y.C.; Zhang, H.; Xu, T.; Zhang, S.H.; Huang, X. Deep image synthesis from intuitive user input: A review and
perspectives. Comput. Vis. Media 2022, 8, 3–31. [CrossRef]
157. Luo, S. A Survey on Multimodal Deep Learning for Image Synthesis: Applications, methods, datasets, evaluation metrics, and
results comparison. In Proceedings of the 2021 the 5th International Conference on Innovation in Artificial Intelligence, Xiamen,
China, 5–9 March 2021; pp. 108–120.