A Survey of Synthetic Data Augmentation Methods in Computer Vision
A Survey of Synthetic Data Augmentation Methods in Computer Vision
Abstract—The standard approach to tackling computer vision problems is to train deep convolutional neural network (CNN) models
using large-scale image datasets which are representative of the target task. However, in many scenarios, it is often challenging to
obtain sufficient image data for the target task. Data augmentation is a way to mitigate this challenge. A common practice is to explicitly
transform existing images in desired ways so as to create the required volume and variability of training data necessary to achieve
good generalization performance. In situations where data for the target domain is not accessible, a viable workaround is to synthesize
training data from scratch—i.e., synthetic data augmentation. This paper presents an extensive review of synthetic data augmentation
techniques. It covers data synthesis approaches based on realistic 3D graphics modeling, neural style transfer (NST), differential
neural rendering, and generative artificial intelligence (AI) techniques such as generative adversarial networks (GANs) and variational
autoencoders (VAEs). For each of these classes of methods, we focus on the important data generation and augmentation techniques,
general scope of application and specific use-cases, as well as existing limitations and possible workarounds. Additionally, we provide
a summary of common synthetic datasets for training computer vision models, highlighting the main features, application domains and
supported tasks. Finally, we discuss the effectiveness of synthetic data augmentation methods. Since this is the first paper to explore
synthetic data augmentation methods in great detail, we are hoping to equip readers with the necessary background information and
in-depth knowledge of existing methods and their attendant issues.
Index Terms—Data augmentation, generative AI, neural rendering, data synthesis, synthetic data, neural style transfer.
the aforementioned methods. Moreover, many computer of the main issues in Section 10 and outline promising
vision tasks are often use-case sensitive, requiring task- directions for future research in Section 11. Finally, conclude
specific data formats and annotation schemes. This makes in Section 12. A detailed outline of this survey is presented
it difficult for broadly-annotated, publicly-available large- in Figure 1.
scale datasets to meet the specific requirements of these
tasks. In these cases, the only viable approach is to generate
training data from scratch. Modern image synthesis meth- 2 O VERVIEW OF DATA AUGMENTATION METHODS
ods can simulate different kinds of task-specific, real-world Geometric data augmentation methods such as affine trans-
variability in the synthesized data. They are particularly formations [11], projective transformation [12] and nonlin-
useful in applications such as autonomous driving and nav- ear deformation [13] are aimed at creating various trans-
igation [1], [2], pose estimation [3], [4], affordance learning formations of the original images to encode invariance to
[5], [6], object grasping [7], [8] and manipulation [9], [10], spatial variations resulting from, for example, changes in
where obtaining camera-based images is time-consuming object size, orientation or view angles. Common geometric
and expensive. Moreover, in some applications,bitmap pixel transformations include rotation, shearing, scaling or resiz-
images may simply be unsuitable. Data synthesis methods ing, nonlinear deformation, cropping and flipping. On the
can readily support non-standard image modalities such as other hand, photometric techniques – for example, color
point clouds and voxels. Approaches based on 3D modeling jittering [14], lighting perturbation [15], [16] and image
also provide more scalable resolutions as well as flexible denoising [17] – manipulate the qualitative properties – for
content and labeling schemes that is adapted for the specific example, image contrast, brightness, color, hue, saturation
use-case. and noise levels – and thereby render the resulting deep
learning models invariant to changes in these properties.
1.3 Motivation for this survey In general, to ensure good generalization performance in
different scenarios, it is often necessary to apply many of
Data augmentation approaches based on data synthesis are
these procedures simultaneously.
becoming increasingly important in the wake of severe data
Recently, more advanced data augmentation methods
scarcity in many machine learning domains. In addition,
have become common. One of the most important class
the requirements of emerging machine vision applications
of techniques [18], [19], [20], [21] is based on transforming
such as autonomous driving, robotics and virtual reality are
different image regions discretely instead of uniformly ma-
increasingly becoming difficult to be met using traditional
nipulating the entire input space. This type of augmentation
data transformation-based augmentation. For this reason,
methods have been shown to be effective in simulating com-
data synthesis has become an important means to provide
plex visual effects such as non-uniform noise, non-uniform
quality training data for machine learning applications. Un-
illumination, partial occlusion and out-of-plane rotations.
fortunately, however, while many surveys on data augmen-
The second main direction of data augmentation exploits
tation approaches exist, very few works deal with synthetic
feature space transformation as a means of introducing vari-
data augmentation methods. This work is motivated by
ability of training data. These regularization approaches ma-
the lack of adequate discussion on this important class
nipulate learned feature representations within deep CNN
of techniques in the scientific literature. Consequently, we
layers to transform the visual appearance of the underly-
aim to provide an in-depth treatment of synthetic data
ing images. Examples of feature-level transformation ap-
augmentation methods to enriched the current literature on
proaches include feature mixing [18], [22] , feature interpola-
data augmentation. We discuss the various issues on data
tion [23], feature dropping [24] and selective augmentation
synthesis in detail, including concise information about the
of useful features [25]. These methods do not often lead to
main principles, use-cases and limitations of the various
semantically meaningful alterations. Nonetheless, they have
approaches.
proven very useful in enhancing the performance of deep
learning models. The third direction is associated with the
1.4 Outline of work automation of the augmentation process. To achieve this,
In this work, we first provide a broad overview of data typically, different transformation operations based on tradi-
augmentation in Section 2 and provide a concise taxonomy tional image processing techniques are applied to manually
of synthetic data augmentation approaches in Section 3. Fur- generate various primitive augmentations. Optimization al-
ther, in Section 4 through 7, we explore in detail the various gorithms are then used to automatically find the best model
techniques for synthesizing data for machine vision tasks. hyperparameters, as well as the augmentation types and
Here we discuss the important principles, approaches, use- their magnitudes for the given task.
cases and limitations of each of the main classes of methods. The approaches described above are realizable only
The approaches surveyed in this work are generative mod- when training data exists, and the goal of augmentation is
eling, computer graphics modeling, neural rendering, and to transform the available data to obtain desirable features.
neural style transfer (NST). We present a detailed discussion This work focuses on approaches that seek to generate novel
of each of these approaches in the following sections. We training data even in cases where data for the target task is
also compare the advantages and disadvantages of these inaccessible.
classes of data synthesis methods. We summarize the main Several survey works (e.g., [26], [27], [28], [28], [29], [30])
features of common synthetic datasets in Section 8. In Sec- have explored data augmentation in great detail. Shorten
tion 9 we discuss the effectiveness of synthetic data augmen- et al. [27], in particular, present a broad discussion of im-
tation in machine vision domains. We present a summary portant data augmentation methods. However, like most
3
1.1. Background
1.2. Overview of data augmentatin methods
1. Introduction 1.3. The need for synthetic data augmentation
1.4. Outline of survey
2. Overview of DA methods
3. Taxonomy of approaches
4.1. Common generative network models for DA
4.2.1. Image synthesis
4. Generative modeling 4.2. Approaches to DA with generative modeling
4.2.2. Image-to-image translation
4.2.3. Image manipulation
4.3. Limitations and workarounds of GM 4.2.4. Image quality enhancement
12. Conclusion
previous surveys, their coverage of data synthesis methods 3 T AXONOMY OF SYNTHETIC DATA AUGMENTATION
is rather limited. METHODS
To address this gap, in this surveys, we focus mainly on In practice, four main classes of synthetic data generation
data augmentation techniques that generate synthetic data techniques are commonly used:
for training machine learning models in computer vision do-
• generative modeling
mains. data augmentation methods. The main approaches
• computer graphics modeling
covered here are methods based on generative AI, procedu-
• neural rendering
ral data generation using 3D CAD tools and game engines,
• neural style transfer
differential neural rendering and neural style transfer. We
consider that such a narrow scope will enable us to provide Generative modeling methods rely on learning the in-
a much detailed treatment of topic and its important issues herent statistical distribution of input data in order to
while at the same time maintain a relatively concise volume. (automatically) generate new data. The second class of
4
the generated 3D data can be used for training. The fourth Loss
class of methods are known as neural style transfer. These Discriminator
Discriminator
approaches combine features of different semantic levels Fake images
extracted from different images to create a new set of im- Random Noise
ages. A general classification of synthetic data augmentation
approaches is depicted in Figure 2. Generator
4 G ENERATIVE MODELING
Figure 3. Functional block diagram of a basic GAN (a) and Conditional
Generative AI techniques present the most promising GAN (b) models.
prospect for generating synthetic datasets for complex com-
puter vision tasks. Generative modeling methods are a class In the Conditional GAN (cGAN) [40], the generation
of deep learning techniques that utilize special deep neural process is conditioned on control input to the generator
network architectures to learn holistic representation of the and discriminator (Figure 3 (b)). This provides additional
underlying categories in order to generate useful synthetic information that helps the network to reproduce desired
data for training deep learning models. Generally, they work characteristics in the target class. From a simple, fully con-
by learning possible statistical distributions of the target nected architecture in [33], [40], many new architectural
data using noise or examples of target data as input. This innovations have been introduced to improve the GAN’s
knowledge about the distribution of training data, thus, can ability to model data in image domains. Notable among
enable them to generate complex representations. Examples these include the Deep Convolutional GAN (DCGAN),
of generative models include Boltzmann machines (BMs) which employs convolutional layers for the generator and
[31] and restricted Boltzmann machines (RBMs) [32], gener- transpose-convolutional layers for the discriminator instead
ative adversarial networks (GANs) [33], variational autoen- of fully connected layers throughout; the Laplace Pyra-
coders (VAEs) [34], autoregressive models [35] and deep mid GAN (LAPGAN) [41], which uses multiple generator-
belief networks (DBNs) [36]. Currently, GANs and VAEs discriminator pairs in a multi-scale pyramidal structure;
and their various variants such as [37], [38], [39] are the Information Maximizing Generative Adversarial Network
5
proposed more complex generative modeling frameworks specific features such as contrast, texture, illumination and
that leverage the advantages of both types of models. other complex photometric transformations which would
otherwise be challenging for traditional augmentation ap-
4.2 Approaches to image data augmentation with gen- proaches. The approach, as a data augmentation method,
erative AI models has a wide scope of applications in computer vision. In med-
ical imaging applications, for example, approaches based on
In the context of data augmentation, generative models can image-to-image translation can be used to transfer images
be applied in several different ways. These include to gen- from one modality to another (e.g., from CT to MRI or X-ray
erate new images, to transfer specific image characteristics image format). Figure 7 shows different translations using
from source to target images, and to enhance the perceptual StyleGAN model [45].
quality or diversity of training data. data augmentation Another important application of image-to-image trans-
approaches based on these different principles have been lation is to synthesize view-consistent scenes, where 3D
used in diverse computer vision tasks, including medical information and the overall spatial structure of the scene
image classification [58], [59], object detection [60], pose es- is preserved in the process of translation [49], [75]. This
timation [61] and visual tracking [62]. Common approaches is extremely useful in semantic scene understanding tasks.
to solving data augmentation problems based on generative Image-to-image translation approaches have also been
modeling techniques are presented in subsections 4.2.1 to widely used to generate images of novel views from partic-
4.2.4. The key aspects and application scenarios of these ular views such as frontal facial views from angular views
methods are summarized in Table 1. (e.g., [76], [77]) or to produce different human poses from
a single pose (e.g., [78]). The effectiveness of image-to-
4.2.1 Image synthesis
image translation as a viable data augmentation strategy
In application settings where it is difficult or impossible to has also been demonstrated in challenging computer vision
obtain sufficient labelled data, the main goal of generative tasks such as visual tracking [79], [80], [81], person re-
modeling is to generate synthetic data [63], [64], [65], [66], identification [82], [83], [84], object detection under severe
[67] to be used in place of, or in combination with real occlusion [85] and strong lighting conditions [86]. Figure
data. Models used in this situation are aimed at synthe- 8 depicts multi-view images generated by different GAN
sizing specific categories of image data to aid training. models.
The primary objective of the generative modeling, then, While traditional image-to-image translation approaches
is to generate samples that cover the distribution of the [40] are generally based on cGANs and cVAEs architectures
underlying categories. This type of data can be achieved that utilize paired images, new techniques are based on the
with conventional CNN-based GAN architectures without concept of cyclic consistency. For example, CycleGAN [45]
utilizing conditional information as in the case of condi- DualGAN [87] and DiscoGAN [44] can translate images
tional GANs or conditional VAEs. For example, Kaplan et from one style to another without paired images. These
al. [68] demonstrated the ability of GAN and VAE to gener- methods learn a mapping function between source and
ate photorealistic retinal images without using conditional target image domains by means of unsupervised learn-
information. In [64], Bowles et al. employ a PGGAN model ing—that is, images in the target domain do not have cor-
to synthesize images for Computed Tomography (CT) and responding examples in the source. The approach is useful
Magnetic Resonance (MR) image segmentation tasks. The in many practical application since it is often challenging
authors showed that using GANs to generate synthetic to obtain paired images in real-world scenarios for training
images improves segmentation performance on the two machine learning models. For instance, for a specific envi-
different tasks, irrespective of the original data size and ronment, images for autonomous driving tasks may only
the proportion of synthetic samples added. To improve the be available for a limited set of weather conditions, but it
perceptual quality of generated images, several GANs may may be required to improve robustness by training on a
be used, each tuned for creating specific category (e.g., in wide range of possible conditions. In such a situation, with
[63]). In [66], Souly et al. proposed generative adversarial unpaired image-to-image translation methods, the available
network to generate large amount of labeled images in a content image can be transferred to all desired visual ap-
semi-supervised manner using unlabeled GAN-synthesized pearances without requiring additional content images.
image data to help in semantic segmentation tasks. The integration of large language models (LLMs) [88]
and vision-language model (VLMs) [89] with GANs, VAEs
4.2.2 Image-to-image translation and other generative models allows the process of image-
Image-to-image translation [40] is a technique used to trans- to-image translation to be automated using textual prompts.
form an image by transferring its content into the visual LLMs and VLMs can also enable textual descriptions of vi-
style of another image. In its basic form, the approach in- sual scenes to be automatically generated as supplementary
volves learning a mapping from source to target domain.The input for training computer vision models.
approaches rely on principles of conditional generation us-
ing models such as cGANs and cVAEs. Generative mod- 4.2.3 Image manipulation
eling methods based on image-to-image translation can be Another common generative modeling approach to data
used to convert images from one color space to another. augmentation is to qualitatively transform training data in
In particular, approaches for converting among infrared, desired ways by performing specific photometric (e.g., [91],
grayscale and RGB color images are common [73], [74]. [92]) and geometric ( [71], [93], [94], [95]) image manipu-
The techniques can also enable different visual effects and lations. Photometric image operations such as binarization
7
Figure 8. Photorealistic, multi-view images generated by 3D-aware image synthesis approaches: (a) GRAF [47], (b) GIRAFFE [48], (c) pi-GAN [49]
and (d) MVCGAN [50].
Table 1
Common approaches to data augmentation by using generative modeling
Main function of
Approach Classic models Application scenario
generative model
Synthesize new samples
Generate new samples per Deep Convolutional GAN (DCGAN) where no training data
Image synthesis
given categories (e.g., in [69]) exist
[91], colorization (i.e., conversion from grey-scale to color super-resolution GANs such as Pix2Pix [40], SRGAN [43],
images) [96] and dehazing [92] are common tasks that can ESRGAN [97] or their derivatives. In a recent work [98],
be accomplished with generative modeling. Wang et al. used a modified Pix2Pix model as a super-
resolution GAN to increase the resolution of low-resolution,
4.2.4 Image quality enhancement microscopic images for training deep neural networks. They
first generated additional data using CycleGAN before em-
In some computer vision tasks, images available for training ploying the super-resolution GAN to improve the quality
deep learning models are often of low quality. One way of the training dataset. In many studies, generative mod-
to improve performance is to enhance the quality of the els have been used to generate large, clean images from
training data. For example, generative modeling is com- noisy (e.g., [99], [100]) data, low resolution images (e.g.,
monly used to clean noisy images (e.g., in [72]). Also, low [101]), corrupted labels (e.g., [102] ) or images taken in
resolution images, can be qualitatively improved by using
8
adverse weather conditions such as rainy weather [90]. Fig- on Extreme Value Theory [114] that allows to generate realis-
ure 9 shows the effectiveness of image enhancement tech- tic as well as out-of-distribution or extreme training samples
niques such as de-raining in improving the perceptual input from a given distribution. In this context, extreme samples
data. Generative modeling approaches that apply geometric are training examples that deviate significantly from those
transforms on training samples have also been reported in present in the dataset. The approach also provides a way to
[103], [104], [105]. Some recent approaches are aimed at set degree of deviation and the likelihood of the occurrence
enhancing the perceptual quality of CAD-generated models. or proportion of these deviations in the generated data. Liu
For example, RenderGAN [105] and DA-GAN [106] seek to et al [116] demonstrated that GAN may in some situations
improve performance by refining simple, synthetically gen- fail to generate the task-required data (as a result of opti-
erated 3D models so as to endow them with photorealistic mizing for a different task altogether) because GANs may
appearance and desirable visual properties. be optimizing for a different objective. Specifically, in [116]
a GAN designed for object detection tasks was shown to
Images taken in a rany day: object De-rained images showing correctly (have) rather optimize for realism of generated images.
dectector unable to locallize objects predicted bounding boxes Like with all data synthesis methods, it is currently not
possible to directly compare different sets of synthetic data,
or even to determine the suitability of synthesized data
for a particular task without carrying out exhaustive tests.
The fundamental problem is the general lack of quality
metrics that can objectively evaluate the fitness of data for
a given task. While techniques like log-likelihood provides
a means to evaluate and assess the quality of VAEs, it is
currently difficult to extend these to objectively compare
the quality of GAN models. Some workarounds exist that
allow to roughly estimate the quality of generated samples
based on their similarity with the target data. This involves
comparing the statistics of the generated data to that of
the target data. The simplest metrics involve using more
traditional similarity measures such as nearest neighbors,
log-likelihoods [57], Minimum Mean Discrepancy (MMD)
[117] and Multi-scale Structural Similarity Index Measure
(MS-SSIM) [118]. Since these techniques merely estimate
pixel distribution, high scores on the metrics do not strictly
Figure 9. Examples of using generative modeling approaches to aug- indicate high image quality. More advanced metrics allow
ment data by enhancing perceptual quality, in this case – by de-raining. to quantitatively estimate data diversity (i.e., the degree
The images by Zhang et al. [90] shows improvement in object detection
performance after de-raining input samples. to which the synthetic data approximates the distribution
of the target data), quality (i.e., overall photorealism), and
other characteristics. The most important of these metrics
4.3 Common limitations of generative modeling tech- include Inception Score (IS) [119] and Fréchet Inception
Distance (FID) [120], as well as their new variants such
niques and possible workarounds
as Spatial FID [121], Unbiased FID [122], Memorization-
One of the main problems with generative modeling meth- informed FID [123] and Class-aware FID [124]. These met-
ods is that they require very large training data for good per- rics allow to evaluate not only the general quality but also
formance [107]. GANs are also susceptible to overfitting – a important aspects such as bias and fairness of generative
situation where the discriminator memorizes all the training models. Manual evaluation by visual inspection is another
inputs and no longer offer useful feedback for the generator common way to determine the quality of data [125], [126]
to improve performance. To address these problems of GAN synthesized using generative modeling techniques. The ap-
performance, a number of works [108], [109], [110] have proach relies on the developer’s domain knowledge to make
considered augmenting the data on which the generative good judgment about the appropriateness of the training
model is trained. These approaches have been shown to be data. In some cases, this may offer the best guarantee for
effective in alleviating small data and overfitting problems. success. However, the approach is very subjective and prone
However, employing augmentation strategies can lead to to biases of the human assessor. Moreover, because of the
a situation where the generator reproduces samples from limited capacity of human experts, the method cannot be
the distribution of the augmented data which may not be applied in settings that involve large-scale dataset.
truly representative of the target task. Consistency regu- A common class of problems with generative model-
larization [111], [112] is a recent approach that has been ing techniques relates to training challenges. In particular,
proposed to prevent augmented data from being strictly generative models based on GANs suffer from unstable
reproduced by the generator. More advanced methods to training. One of the main causes of this issue is the so-
improve GAN generalization include techniques based on called mode collapse problem [127]. This phenomenon oc-
perturbed convolutions [113] and Extreme Value Theory curs when the generator fails to learn the variety in input
[114]. For instance, to enable GANS to handle rare samples, data and is, thus, able to generate only a particular type of
Bhatia et al. [115] proposed a probabilistic approach based data that consistently beats the discriminator but is inferior
9
in terms of diversity. Common solutions to this problem in- tasks such as pose understanding, gesture and action recog-
clude weight Normalization [128] and other regularization nition are immensely aided by 3D supervision. Techniques
techniques [129] as well as architecture innovation [130]. based on CAD modeling can also simulate nonstandard
Another serious problem with training generative models visual data such as point clouds (e.g., [139]), voxels (e.g.,
is the non-convergence problem [131]. Researchers have at- [140]), thermal images (e.g., [141]), or a combination of two
tempted to address this difficulty by employing techniques or more of these modalities (e.g., clouds [142]). State-of- the-
such as adaptive learning rates [132], [133], restart learning art computer graphics tools are able to produce fairly
[134] and evolutionary optimization of model parameters realistic visual data for training machine learning models.
[135]. These approaches alleviate the problem to an extent Three-dimensional game engines are particularly promising
but do not completely eliminate it. in this regard, as they can simulate complex natural pro-
cesses and generate near-realistic environments under dif-
Fragment processing Output
Geometry processing 3D model & rasterization 2D image
ferent conditions using real physics models. This capability
provides an opportunity to train machine learning models
on complex real-world (natural) scenes. Examples of simple
3D objects from the Amazon Berkeley Objects (ABO) dataset
modeled using CAD tools are shown in Figure 11. Figure
12 shows realistic indoor scenes from the Hypersim [143]
Textures
Illumination dataset generated, also by CAD tools.
model
Figure 12. Examples of photorealistic synthetic indoor scenes from the Hypersim [143] dataset.
tion into pixel form; fragment shading – for processing proaches [151] favor the use of 3D game engines, which are
color and texture information. Many advanced integrated capable of generating complete virtual worlds for not only
development environments (IDEs) have been developed to training neural network models, but also enabling interac-
facilitate 3D modeling. They provide advanced, intuitive tive training of elements of worlds using deep reinforcement
graphics user interface (GUI) and easy-to-use toolsets for learning frameworks. For instance, ML-agents introduced
rendering and editing 3D models. Common 3D modeling in the Unity 3D game engine, provides a framework for
tools include simulation and 3D animation software tools training intelligent agents in both 2D and 3D worlds using a
such as Cinema 4d, Blender, Maya and 3DMax. These tools variety of machine learning techniques, including imitation
provide a means to obtain task-specific data in situations learning, evolutionary algorithms and reinforcement learn-
where available data does not meed the requirements of the ing.
target task. For instance, Hattori et al. [145] employ 3DMax
to synthesize data for human detection tasks in video
surveillance applications where task-specific data may not
readily be available. The approach allows the generated
data to be customized according to the specific require-
ments of a scene (e.g., scene geometry and object behavior)
and surveillance system (i.e., camera parameters). The tools
have also provided a means to create large-scale datasets
for generic applications. Examples of large-scale synthetic
datasets obtained from 3D CAD models include ShapeNet
[146], ModelNet [147] and SOMASet [148] datasets. Some
of the most important datasets created using 3D modeling
tools are described in Section 8.
5.1.2 Synthetic data from 3D physics (game) engines
While CAD tools are primarily used for creating 3D assets, Figure 13. Varying appearance of a sample synthetic scene from
game engines provide tools to manipulate the generated CARLA simulator [1] in different weather conditions.
3D objects and scenes in nuanced ways within virtual
environments. They typically come with built-in render- Varol et al. [3] synthesized realistic datasets using Unreal
ing engines like Corona renderer, V-ray and mental ray. Engine. Their synthetic dataset, synthetic humans for real
Advanced game engines such as Unity3D, Unreal Engine, tasks (SURREAL) has been provided as open-source dataset
and Cry Engine can simulate real-world phenomena such for training deep learning models on different computer
as realistic weather conditions; fluid and particle behavior, vision tasks. The authors in [3] showed that for depth
effects including diffuse lighting, shadows and reflections; estimation and semantic segmentation tasks, deep learning
object appearance variations resulting from the prevailing models trained using synthetic data generated by 3D game
phenomena. By randomizing parameters associated with engines can generalize well to real datasets. Jaipuria et al.
these phenomena, sufficient data diversity can be achieved. [152] used Unreal Engine to enhance the appearance of
Besides visual perception, simulated environments based on artificial data for lane detection and monocular depth esti-
3D game engines can serve for a broad range of applica- mation in autonomous vehicle navigation scenarios. As well
tions. They are particularly suitable for training models in as generating scenes with photorealistic, real-world objects,
domains like planning, autonomous navigation, simultane- they also simulated diverse variability in the generated data:
ous localization and mapping (SLAM), and control tasks. viewpoints, cloudiness, shadow effects, ground marker de-
Figures 13 and 15 show sample scenes from Carla [1] and fects and other irregularities. This diversity has been shown
AirSim [149], [150], respectively. Both tools are created from to improve performance under a wider range of real-world
Unreal Engine. Figure 14 shows the different sensing modal- conditions. Bongini et al. [141] rendered synthetic thermal
ities that can be obtained from Carla. objects using U3D’s thermal shader and superimposed them
Because of the advanced manipulation capabilities of in a scene captured using real thermal image sensors. They
modern game engines, recent synthetic data generation ap- additionally employ a GAN model to refine the visual
11
a b c
Figure 14. CARLA provides three sensing modalities: traditional vision (a), depth (b), and semantic segmentation (c).
appearance of the rendered images so that they look more tic gap between real and synthetic images results in better
like natural thermal images. performance than models trained on real image datasets.
Additional plugins have been developed to facilitate the Since it is generally more challenging to obtain relevant
ease of generation of image data and corresponding labels data for video object detection and tracking tasks than for
from game engines [153] or from virtual environments [153], other computer vision applications, generating data from
[154] developed on the basis of 3D engines. For instance, video games has emerge as a promising workaround to
Borkman et al. [153] introduced a Unity engine extension alleviate the challenge (e.g., in [161], [162], [163], [164]). The
known as Unity Perception that can be used to generate ar- main advantage of this approach is the possibility of utiliz-
tificial data and the corresponding annotations for different ing off-the-shelf video games without strict requirements for
computer vision tasks, including pose estimation, semantic the resolution of the captured images. Its main disadvantage
segmentation, object detection and image classification. The is that video games contain general environments that are
extension has been designed to synthesize data for both not tailored for specific computer vision applications. Also,
2D and 3D tasks. Hart et al. [155] implemented a custom the visual characteristics of images from game scenes may
OpenCV tool in Robot Operating System (ROS) environ- not be optimized for computer vision tasks. Moreover, there
ment to extract frames from simulated scenes in Gazebo is generally a lack of flexibility in the data generation
platform and generate their corresponding labels. Similarly, process as the user can exercise very little control over
Jang et al. [154] introduced CarFree, an open-source tool the scene appearance and cannot change scene behavior as
to automate the process of generating synthetic data from needed; since scenes in the video sequences are fixed, users
Carla. The utility is able to generate both 2D and 3D are not able to introduce factors of variability (e.g., arbitrary
bounding boxes for object detection tasks. It is also capable backgrounds, objects and appearance effects) into the scene.
of pixel-level annotations suitable for scene segmentation In contrast, approaches such as [158], [163], [165], [166] that
applications. Carla [1] provides a python-based (API) for re- synthesize scenes from scratch using 3D physics engines
searchers and developers to interact with and control scene bypass these limitations, but require enormously long time
elements. Mueller and Jutzi [156] utilized Gazebo simula- and are labor-intensive.
tor [157] to synthesize training images for pose regression
task. Kerim et al. [158] introduced the Silver framework, a 5.1.4 Obtaining synthetic 3D data through scanning of real
Unity game engine extension that provides highly flexible objects
approach to generating complex virtual environments. It
Another technique [138] to alleviate the laborious work
utilizes the built-in High Definition Render Pipeline (HDRP)
required in synthesizing dense 3D data from scratch using
to enable control of camera parameters, randomization of
graphics modeling approaches is to leverage special tools
scene elements, as well as control of weather and time
such as Microsoft Kinect to capture relevant details of the
effects.
target objects. This is accomplished by scanning different
views of the relevant objects at different resolutions and
5.1.3 Obtaining data directly from video games constructing mesh models from these scans. With this ap-
Some recent works, for instance [159], [160], [161]), have proach, the desired low-level geometric representation can
focused on directly extracting synthetic video frames from be obtained without explicitly modeling the target objects.
scenes of video commercial games as image data for use In the simplest case, the 3D representation can be obtained
in training computer vision models.. To achieve this, ap- with depth cameras to capture multiple views of the object.
propriate algorithms are used to extract and label random Suitable algorithms such as singular value decomposition
frames of video sequences by sampling RGB images at a (SVD) [168], random sample consensus (RANSAC) [169]
given frequency during game play. Since modern video and particle filtering [168] are then used to combine these
games are already photorealistic, the qualitative charac- multiple images into a composite 3D model. The approach
teristics of image data obtained this way is adequate for proposed in [169], [170], [171] utilize RGBD cameras as
many computer vision tasks. Shafaei et al. [159] showed 3D scanners for extracting relevant appearance information
that, for semantic segmentation tasks, models trained on from object. Figure 16 shows sample frames from the Open-
synthetic image data obtained directly through game play Rooms dataset [167], a dataset created from 3D-scanned
can achieve comparable generalization accuracies as those indoor scenes—ScanNet [172].
trained on real images. Further refinements by means of A common practice is to construct basic articulated 3D
domain adaptation techniques to bridge the inherent seman- models from the 3D scans which are further manipulated
12
Figure 15. Sample scenes from Microsoft AirSim [149], a synthetic virtual world for training unmanned aerial vehicles (UAVs) and autonomous
ground vehicles. The views show typical urban environments for autonomous driving.
within 3D modeling environments to create variability. The 5.1.6 Integrating real and virtual worlds
approach allows to incorporate only the relevant objects into As discussed earlier, the basic idea of the 3D object scanning
already modeled 3D scenes. The use of real 3D scans of approach is to obtain information about target objects as
objects also helps to achieve physically-plausible representa- stand-alone data by extracting 3D information of real objects
tions of the relevant objects. Furthermore, visual effects such and then utilizing graphics processing pipelines to perform
as lighting, reflections and shadows cannot easily be manip- 3D transformations on the skeletal models obtained by
ulated by conventional 2D image transformation methods. scanning. However, in some cases (e.g., [175], [176]), 3D
Therefore, these techniques are vital in situations where it is geometry models of objects obtained by scanning the real-
necessary to train deep learning models to be invariant with world objects are incorporated into more complex, task-
respect to these visual phenomena. For instance, Chogov- relevant 3D scenes created using graphics tools. Integrating
adze et al. [173], specifically employ Blender-based light scanned objects into such virtual worlds provide effective
probes to generate different illumination patterns to train means to manipulate and randomize different factors so as
deep learning models robust to illumination variations. Vyas to simulate complex real-world behavior useful for model
et al. [169] employed a RGBD sensor to obtain 3D point generalization and robustness. Synthetic 3D models ob-
clouds and then used a RANSAC-based 3D registration tained in this manner can be used to train models on com-
algorithm to construct the geometric representation from the plex 3D visual recognition tasks such as object manipulation
point cloud data. The authors obtained an accuracy of 91.2% and grapping. They can also be rendered as 2D pixel images
on pose estimation tasks when trained using the synthetic to augment image data. In some other cases (e.g., [177]), syn-
dataset, albeit with some domain adaptation applied. It thetic objects are immersed into real worlds in augmented
must, however, be noted that this method can only be reality fashion. These real worlds are obtained from camera
used in situations where access to the target domain data images or videos. In [177], this approach was shown to
is possible. outperform techniques based on purely real or synthetic
data. The most important data augmentation approaches
5.1.5 Combining synthetic with real data based of computer graphics modeling are summarized in
Because of the complex interaction of many physical vari- Table 2.
ables which are difficult to capture using computer graphics
methods, some researchers suggest using synthetic data 5.2 Challenges of computer graphics-based synthesis
simultaneously with real data. A few works suggest using and workarounds
synthetic data only as a means for defining useful visual
Despite the advanced rendering capabilities of modern 3D
attributes to guide the augmentation process. In [174], for
modeling tools, the use of graphics modeling tools to syn-
instance, Sevastopoulos et al. propose an approach where
thesize training data has a number of limitations:
synthetic data from a Unity-based simulated environment is
used as the first stage of data acquisition process to identify • The synthesis of realistic data with high level of
useful visual attributes that can be exploited to maximize detail and natural visual properties including real-
performance in a given task before collecting real data. The istic lighting, color and textures is a complex and
idea is to leverage synthetic data to provide initial direction time-consuming process. This may limit the scope
for further exploration so as to lower the cost of excessive of applications of 3D graphics modeling techniques
trial and error experimentation on real data. In Section 6, in data augmentation applications to only simple or
we present quantitative results on the effectiveness of data moderately complex settings.
augmentation approaches that combine real and synthetic • Currently, in many situations, there is no objective
data. means of assessing the quality of artificially gener-
13
Scanned image a b c d
(ScanNet dataset) Reconstructed image Different lighting Different materials Different views
Figure 16. Scanned indoor scenes from ScanNet are used to generate new synthetic 3D data based on the approach in [167]. The reconstructed
synthetic scenes (a) can then be rendered with different illumination levels and patterns (b), different materials (c), or different views (d).
Table 2
Summary of data augmentation approaches based on computer graphics modeling techniques
ated data. While some methods have been devised comparable to natural settings, especially when dealing
to evaluate synthetic data, these are only applicable with complex scenes. Since simulated data obtained by 3D
in situations where reference images are available for modeling tools are often not perfect, the process of aug-
comparison. However, in many of the settings that mentation does not only consist in modeling visual scenes,
necessitate the use of graphics models, samples of but also in correcting various imperfections and refining the
the target images are generally not available at the appearance of the artificially generated data to mitigate the
model development and testing stages. domain gap between synthetic and real data. Indeed, a large
• It is often difficult to know beforehand the desir- number of so-called sim-to-real (sim2real) techniques [180],
able factors and visual features that are good for [181], [182], [183] have been proposed for fine-tuning and
performance. Moreover, synthetic data that appear transferring synthetically generated graphics-based data to
photorealistic to the human observer may not actu- real-world domains. The use of generative modeling tech-
ally suitable for a deep learning model. For instance, niques (e.g, [105], [184], [185]) as a means to endow simple
while high-fidelity synthetic samples have failed to 3D models with photorealistic appearance and desirable
provide satisfactory performance in some situations, visual characteristics has recently gained attention. These
some researchers (e.g., in [160], [178], [179] have ob- approaches provide a more practical and cheaper means
tained good performance using low-fidelity synthetic for generating extra training data by leveraging unlabeled
images. image data and GANs to introduce hard-to-model, real-
• The generation process is usually accomplished by a world visual effects to simple computer graphics images. In
careful modeling process, and not from any natural [105] Sixt et al. proposed to learn augmentation parameters
processes or sensor data obtained from real-world to enhance the photorealism of synthetic 3D image data
variables. Even with the methods that generate syn- using a large set of unlabeled, real-world image samples.
thetic data by scanning real objects, 3D alignment Dual-agent generative adversarial network (DAGAN) has
techniques used for registration also introduce ad- been proposed to enhance the photorealism of synthetic,
ditional imperfections into the scene representation. rudimentary facial data generated by 3D models [106].
All these problems exacerbate domain gap problem Atapour and Breckon in [186] employ a CycleGAN model
between real and synthetic data. to refine the appearance of synthetic data which resulted
in improved performance. Instead of employing GAN to
As a result of the above limitations, it is often challenging perform visual style transfer in order to refine synthetically
to produce semantically meaningful synthetic environments
14
Model parameters
Output photo
Loss
Input photo
…
Global parameters
Figure 17. Simplified pipeline of the neural rendering process [197]. Reverse rendering allows intermediate 3D scenes and associated scene
parameters to be generated from 2D images.
Point cloud Low memory requirements Low accuracy of scene topology information
voxel More accurate with less processing, simplicity High memory footprint
mesh Provides more grounding (i.e., physics-aware scene representation) High computational cost; Difficulty in describing
complex shapes
multimodal High resolution, more robust to visual artifacts More complex, high computational demand
Figure 18. A summary of the main advantages and disadvantages of geometric prior representation approaches used in differential neural rendering
pipelines. Here, the designation NN stands for neural network.
various visual attributes of scenes. In general, most neural 6.2.2 Implicit (neural) Representation
rendering methods use mesh representation to describe
geometric scene properties. In addition to these methods, Unlike real 3D primitives whose construction is manual
special techniques have also been developed to handle other and laborious, their pure neural counterparts are generated
image modalities using voxels [203], [204], mesh [198] and automatically and can be constructed using less human
point cloud [205] data. Baek et al. [206] used the 3D mesh effort, albeit with long training schedules. However, their
renderer developed in [198] to synthesize 3D hand shapes ability to model 3D scene structure is directly dependent
and poses from RGB images. on the representation power and capacity of the underlying
neural network.
To improve the effectiveness of approaches designed
While approaches based on explicitly modeled geomet- using scene prior representation some works [204], [207],
ric priors can leverage readily available 3D CAD models, [208] utilize deeply learned priors as auxiliary elements to
they are typically limited by imperfections in the underlying refine the accuracy of explicitly modeled priors. Thies et
models. Additional processing is usually needed to meet al. [204], for instance, proposes to alleviate the burden of
the requirements of specific tasks. Thus, rendering methods rigorous modeling of textures by incorporating so-called
based on explicit representation pipelines can be adversely neural textures, a set of learned 2D convolutional feature
impacted by scene capture process and modeling time. Be- maps that are obtained from intermediate layers of deep
sides, these approaches often require the use of proprietary neural networks in the process of learning the scene capture
software tools at some stages of the development process, (process). The learned textures are then superimposed on
making accurate scene representation difficult and costly to the geometric priors (i.e., 3D mesh) used for rendering. The
obtain, and thereby limiting their scope of application. To approach allows course and imperfect 3D models without
achieve competitive performance, highly detailed geomet- detailed texture information to leverage artificially gen-
ric primitives and accurate scene parameters are required, erated textures to generate high-quality images. In [208],
along with sophisticated rendering methods. The process point-based neural augmentation method is proposed to
of modeling these elements is therefore extremely tedious. enrich point cloud representations by leveraging learnable
Because of these challenges, some works propose to deeply neural representations. Similarly, Liu et al. in [204] propose
learned to compose scene primitives rather than explicitly a hybrid geometry and appearance representation approach
modeling all elements from scratch based on so-called Neural Sparse Voxel Fields (NSVF). The
16
method combines explicit voxel representation with learned 3D representation before the final rendering stage to gen-
voxel-bounded implicit fields to encode scene geometry and erate desired 2D output. The basic idea is to disentangle
appearance. While the above studies [204], [207], [208] em- pertinent factors of image variation such as pose, texture,
ploy learnable elements to refine scene primitives explicitly color and shape. These can then be manipulated in an
constructed from scratch, some recent works [200], [201], intermediate 3D space before mapping into a 2D space
[202], [209], [210] have proposed to learn scene models – in- during the rendering process. Such an approach facilitates
cluding geometry and appearance – entirely by using learn- a more semantically meaningful manipulation of various
able elements. These representations are commonly learned scene elements and visual attributes. In data augmentation,
using 2D image supervision (as depicted in the reverse this may be necessary when synthesizing different poses of
rendering process of Figure 17). Most recent implicit neural objects in a scene [222] or when generating novel views from
representations commonly use Neural Radiance Fields or a single image sample (e.g., in [223]) or when introducing
NeRFs [201], [209], [211], [211], an approach that employs lighting effects to global scene appearance in order to en-
neural networks to learn a 3D function on a limited set of code invariance to these variations. These transformations
2D images to synthesize high-quality images of unobserved are usually difficult to realize accurately with 2D operations
viewpoints and different scene conditions based on ray that lack the 3D processing stage.
tracing techniques.
Currently, the photorealism of scenes generated using 6.3.2 2D to 3D synthesis (scene reconstruction)
approaches based solely on learned representations have Another important application of differentiable rendering
not matched those generated using explicit and hybrid related to data augmentation is in scene reconstruction,
representations. Duggal et al. [212] suggest that the problem i.e., the conversion from 2D image to 3D scene. Scene
is as a result of a lack of robust geometric priors in neural reconstruction techniques typically rely on inverse graphics
representation. They then propose an encoder sub-model to principles. In this case, a 2D image is used to recover the
initialize shape priors in latent space. In order to guarantee underlying 3D scene. Important scene parameters such as
that the synthesized shapes retain high-level characteristics scene geometry, lighting, camera parameters, as well as ob-
of their real-world counterparts, high-dimensional shape ject properties such as position, shape, texture and materials
priors realized with the aid of a discriminator sub-model. are also estimated in the reconstruction process. This is
This serves as a regularizer of the shape optimization pro- useful in applications that require 3D visual understand-
cess. More recently, implicit neural rendering techniques ing capabilities. Examples of such applications include for
have been used within generative modeling frameworks high-level cognitive machine vision tasks such as semantic
based on VAEs [213], [214] and (GANs) [48], [49], [215] to scene understanding, dexterous control and autonomous
enable 3D-aware visual style transfer. These have improved navigation in unstructured environments. Tancik et al. [202]
the results of data synthesis achievable with either neural recently proposed Block-NeRF, a technique to enable the
rendering or generative modeling techniques alone. For synthesis of large-scale environments (see Figure 21) . They
neural rendered scenes, the use of generative modeling addressed current limitations of NeRF-based models by
helps to easily adapt visual contexts as needed. On the other dividing the scene representation into distinct blocks that
hand, while generative modeling methods alone have been can be rendered independently in parallel and combined to
used to successfully model 3D scene, they often lack true form a holistic contiguous virtual environment for training
3D interpretation sufficient for complex 3D visual reasoning machine learning models on navigation tasks. These ap-
tasks [216]. pearance modifications can also be applied in a blockwise
manner, where smaller regions corresponding to individual
6.3 Approaches and use-cases of neural rendering in NeRFs are updated separately.
data augmentation Scene reconstruction methods have also been widely
used to improve the quality of medical image capture in ap-
For the purpose of data augmentation, there are three broad plication like MIR (e.g., [224], [225] and CT (e.g., [226], [227],
scenarios in which neural rendering is commonly applied: [228], [229], [230]). For instance, the works in [226], [227],
• 2D to 2D synthesis – situations where one synthe- [228] propose using implicit representations to increase the
sizes additional pixel (2D) data using 2D supervision resolution of otherwise sparse CT images. Gupta et al. [229]
• Scene reconstruction (2D to 3D)–3D visual under- employ NIR in an image reconstruction model, known as
standing tasks where 3D dataset is inaccessible, and NeuralCT, to compensate for motion artifacts in CT images.
it is required to synthesize 3D objects or scenes using Their approach does not require an explicit motion model to
available 2D image data handle discrepancies resulting from patient motion during
• 3D to 2D synthesis–tasks where 2D images are gen- the capture process.
erated from 3D assets
6.3.3 3D to 2D synthesis
Pixel image synthesis from 3D scene is a reverse process of
6.3.1 2D to 2D synthesis scene reconstruction. It is aimed at learning the 3D scene and
A straight-forward application of the data synthesis archi- mapping it to a 2D space with the help of neural rendering
tecture presented in Figure 17 is to generate new 2D image techniques. The idea is to utilize 3D models designed by
data with desired visual attributes from a given 2D input traditional graphics tools or game engines to generate 2D
image. In this case, the task of the neural rendering process pixel images. The process is quite simple: given a set of
is to apply appropriate transformations in the intermediate primitive 3D geometric priors and corresponding scene
17
A
1 CNN
N
renderer
Voxel world
feature voxel corner Pseudo-ground-truth Real image
Segmentation GAN model
map
Figure 19. A: General architecture of GANcraft [217],a NeRF framework for generating realistic 3D scenes from semantically-labeled Minecraft block
worlds without ground truth images. It employs SPADE, a conditional GAN framework proposed by Park et al. in [218] to synthesize pseudo-ground
truth images for training the proposed NeRF model. B: Sample Minecraft block inputs (insets) and corresponding photorealistic scenes generated
by the model. .
a b c d e
elements such as 3D geometry (e.g., vertices of volumetric
objects), color, lighting, materials, camera properties and
motion to faithfully synthesize unobserved pixels images.
Variability in the synthesized images is achieved by ma-
nipulating individual factors that contribute to the scene
structure (geometry) and appearance (photometry). Thus,
the process provides a way to control specific objects in the
scene (e.g., modify object scale, pose or appearance) as well
as general scene properties. This can help to incorporate
more physically-plausible semantic features into synthesize
than with traditional image synthesis approaches that lack
3D grounding.
a June
Sept.
Block-NeRF
1 km
Figure 21. State-of-the-art neural rendering models such as Block-NeRF [202] allow large 3D environments to be generated from sparse 2D views.
In multiple neural radiance field models combine to encode large. Shown here is an approximately one square kilometer environment rendered by
Block-NeRF (a). The authors also provide a way to alter the appearance of the rendered scene corresponding different environmental conditions
such as weather and illumination changes (b). .
scenes. Generative modeling techniques [47], [215], [217] training differential neural rendering models on many of the
have also been suggested as a viable means for providing possible representation modalities. Given the rapid growth
the necessary supervision in the absence of ground-truth of interests in neural rendering methods, this problem will
images. Hao et al. in [217], for instance, propose a neural be solved in the short term.
rendering model that allows to generate realistic scenes from
simple 3D LEGO-like block worlds like Minecraft. Since Implicit neural representation can also support non-
there are no ground-truth images for these types of inputs standard imaging modalities such as synthetic aperture
that can be used to supervise the training, the authors uti- sonar (SAS) images [240], [241], computed tomography
lize a pretrained GAN-generated images as pseudo-ground (CT) [230], [242], [243] and intensity diffraction tomogra-
truth instead of real images. Their approach, like [202], also phy (IDT) [244]. In addition, these approaches are capa-
allows flexible user control over both scene appearance and ble of modeling non-visual information like audio signals
semantic content using appearance codes. This capability is [245]. Approaches have also been proposed to leverage
useful for applications like long term visual tracking [235], multiple visual modalities together with other physical
where scene dynamics impacts severely on performance; the signals to provide complementary information about the
ability to control scene appearance is extremely important to physical properties of objects [245], [246], [247]. This can
simulate all possible view conditions. The basic structure of support multi-sensory intelligence in applications such as
GANcraft and sample outputs are shown in Figure 19. In robotics, virtual, augmented, extended and mixed reality,
Figure 20 we present a visual comparison of scenes gen- and human-computer interaction. However, adding these
erated by different GANs (MUNIT [219], (SPADE [218] and additional elements means that data requirements grow
wc-vid2vid [220]), NeRF (NSVF-W [221]) and a combination even more exponentially, making it challenging to accom-
of GAN and NeRF (GANcraft [217]). plish realistic results based on only learnt models. Many
An important aspect of data synthesis based on neural works (e.g., [248], [249], [250]) propose integrating task-
rendering approaches is the ability to work with differ- specific priors to improve the ability of neural rendering
ent data representations – for example, voxels, raw point methods that model complex scenes and interactions to
clouds, pixels, or implicitly defined data forms based on generalize better to unseen contexts. Incorporating strong
learned functional description of the physical properties of priors often requires special mathematical formulations
objects. However, this flexibility also presents difficulties: within neural network models for encoding spatial features
representations can be very different for the same data and and natural behavior of objects in the 3D world, leading
the different forms of representations are not necessarily to increased computational complexity and cost. For this
compatible with one another. Recent techniques [236], [237], reasons, state-of-the-art neural rendering frameworks such
[238], [239] allow to fuse or convert between these diverse as [248], [250], [251]) are inherently complex, limiting their
data representations. However, these methods often lead to ability to represent large-scale scenes. Block-NeRF [202]
the loss of vital information about the scene elements. As a proposes a modular framework that allows separate NeRF
result of these difficulties, training neural rendering models modules to represent individual regions of a scene (shown
to generate data in these modalities is still a challenging in Figure 21). This allows large-scale 3D scenes to be repre-
issue. Moreover, relatively small number of datasets exist for sented and efficiently manipulated using sparse 2D images.
19
7 N EURAL STYLE TRANSFER their semantic meaning. To accomplish this, Gatys et al.
Neural style transfer (NST) is a method for synthesizing [254] remove the fully connected layers of a VGG model
novel images similar to GAN-based style transfer. However, [255] and use the convolutional layers to extract visual
in contrast with generative modeling approaches, neural features at different semantic levels. They then introduce
style transfer exploit conventional feed forward convolu- an optimization function that consists of two different loss
tional neural networks for the synthesis. components – the style and content loss functions – which
are controlled separately by different sets of weight pa-
rameters. Based on these principles, the authors showed
7.1 Principles of neural style transfer that image content and style information can be transferred
Neural style transfer involves first learning representations independently to different visual contexts by separating the
for the content and structure of the original images, and filter responses of shallower and deeper layers of the CNN
the style of a reference samples. These representations are model. To create a new image with the content of a source
then combined to generate new representations in the style image Ic and the style of a target image Is, they propose to
of the reference images while at the same time maintaining combine the latter processing stages of Ic with the earlier
the content and structure of the original image. The method stages of Is (Figure 22). Neural style transfer has also been
leverages the hierarchical representation mechanism of deep extended to video [?], [256], [257], [258] and 3D computer
convolutional neural networks (DCNNs) to flexibly gen- vision [259] applications. Ruder et al. [259] propose an video
erate novel images with various appearance artifacts and style transfer technique that applies a reference style from a
styles. An illustration of the basic principles of the concept single image to an entire set of frames in a video sequence.
is shown in Figure 22. Since shallower layers in CNNs In [260], a neural style transfer approach is combined with
encode low-level visual features such as object texture, a GAN model to generate diverse 3D images from 2D
lines and edges [252], while deeper layers learn high-level views. The generated image data is then used to improve
semantic attributes, different augmentation schemes can be performance 3D recognition tasks.
realized by manipulating the two semantic levels separately
and combining them in different ways. In NST, typically,
a DCNN model without the fully connected layers are
used to extract image features at different levels. Low level
features encoded by shallower layers are then extracted
and combined with high-level features extracted from a 7.2 Neural style transfer as a data augmentation tech-
second image. As the second image contributes high-level nique
features, essentially, its semantic content is transferred to
the artificially generated image while the first image’s visual
style is transferred (see Figures 22 and 23). Following the original work [254], many subsequent studies
The original technique for neural style transfer) was first (e.g., [260], [261], [262], [263], [264], [265], [266], [267], [268])
proposed by Gatys et al. in [254] as a way to artificially have employed neural style transfer technique as a form of
create different artistic styles in images. Specifically, they data augmentation strategy to synthesize novel images to
altered landscape images taken by digital camera to look extend training data. In many of the works, including the
like images produced by artworks while still maintaining original work that introduced the approach, style transfer is
typically applied to synthesize non-photorealistic images. It
Style
is reasonable to assume that adding images that are stylized
Reconstructions in a non-photorealistic way to the training dataset could
still help to reduce overfitting and improve generalization
Style
performance. Indeed, a number works have shown this to
Repreentations be the case for many tasks. Jackson et al. [269] demonstrated
that augmenting training data nonphotorealistic images
(e.g., images synthesized in artistic styles)can significantly
Input image
Content improve performance in several different computer vision
Representations
Convolutional Neural Network tasks. Using neural style transfer, Jackson et al. in [269]
achieved an improvement in the range of 11.8 to 41.4% over
a baseline model (InceptionV3 without augmentation) and
improvement between 5.1 and 16.2% over color jitter (alone)
Content
Reconstructions on cross-domain image classification tasks. In addition, their
method yields a performance increase of at least 1.4% over
a combination of seven different classical augmentation
types. Instead of transferring a single style per image, the
Figure 22. Neural style transfer relies on the hierarchical representation authors in [269] aim to create more random styles suitable
of features in convolutional neural networks to render high-level seman- for multi-domain classification tasks by randomizing image
tic content of images in different visual styles. In the original work [253]
Gatys propose to extract style and content information from different
features (texture, color and contrast). They used a so-called
processing stages of convolutional neural networks. The extracted style style transfer network to obtain random style attributes by
and content information are then manipulated separately and combined sampling from multivariate distribution of low-level style
to form new images. embeddings.
20
Style image Content image Inputs Stylized outputs for different models
Style image Content image (a) (b) (c)
Output image
Table 3
The main ways to perform data augmentation using neural style transfer technique
Figure 26. Qualitative comparison of neural style transfer (NST) and GAN models (results courtesy [284]). The different models aim to perform
artistic stylization of the original images while maintaining all content. The styles were obtained from images extracted from cartoon videos. The
NST model in column b has been trained on style image that has similar content as the input image (shown inset). The rest of the columns (c, d,
e and f) show results of styling with an aggregate of 4,573 cartoon images. Note that the quality of stylization is defined by how well the semantic
content is preserved, including the clarity of edges. The specific models used are: NST by Gatys et al. [253] (column b and c), CycleGAN [45]
(column d), CycleGAN with identity loss (column e), and CartoonGAN [284] (column f).
bust representations akin to style randomization in rendered There is also a family of NST approaches that aim to
images. transfer specific visual attributes such as image color [286],
In addition to these approaches that aim to improve [293], [294], illumination settings [295], material properties
image synthesis by introducing new architectural methods [296] or texture [297], [298]. Rodriguez-Pardo and Garces in
of encoding styles within CNN layers, a number of works [296] propose a NST-based data augmentation scheme based
[290], [291] are primarily focused on developing new loss on transferring properties of images under varying condi-
functions that allows to better transfer more diverse and tions of illumination and geometric image distortions. Gatys
fine-grained features than earlier approaches based on style et al. in [299] propose to decompose style into different
and content loss functions [254]. Li et al. in [285] introduced perceptual factors such as geometry, color and illumination
a special loss function, known as diversity loss, to encour- which can then be combined in different permutations to
age diversity. Wang et al. Luan et al. [275] propose a loss synthesize image with specific, desirable attributes. In order
function based on smoothness estimation. to avoid geometric distortions and preserve statistical prop-
erties of convolutional features extracted from style images,
7.3.2 Photorealistic data syntheis with NST Yoo et al. [292] introduced a wavelet corrected transfer
Recently, several researchers [275], [292], [292] have in- mechanism that replaces standard pooling operations with
troduced methods to enhance the photorealism of images wavelet pooling units.
synthesized using NST methods. Photorealistic styles are Neural style transfer techniques have also been success-
stylization schemes that render the resulting image as vi- fully employed to transfer specific semantic visual contexts
sually close as possible to real images captured by real to synthetic data. For instance, Li et al. [277] transferred
image sensors. Photorealism is aimed at simulating high- images taken in clear weather to snowy conditions. In [253],
quality data in terms of image details and textures, tak- images taken in bright day are converted to night im-
ing into consideration a range of real-world factors, which ages using NST stylization techniques. More advanced neu-
may include blurring effects resulting from camera motion, ral stylization approaches [278], [279] have been designed
random noise, distortions and varying lighting conditions. to learn the semantic appearance as well as physically-
Figure 24 compares the results of different artistic style plausible behavior of objects of interest. Kim et al. [278],
transfer approaches. In Figure 25, the outputs of various for instance, proposed a physics-based Neural Style Transfer
models that aim to achieve photorealistic stylization are method to simulate particle and fluid behavior in synthetic
compared. We also compare the artistic style transfer ability images. Their approach allows realistic visual appearance
of common GAN models with NST methods in Figure 26. and semantic content of different particular matter to be
22
Figure 27. StyleBank [280] is an example of multi-style transfer technique based on an encoder-decoder architecture. It encodes different styles in
groups of convolutional filters in intermediate CNN layers that can be selectively applied to specific image content.
learned and transferred from 2D to 3D settings. data is produced. For instance, recent works have proposed
to optionally perform semantic segmentation on the target
7.3.3 Patch-level synthesis scene so as to enable context-specific style transfer (e.g.,
Researchers have also proposed photorealistic synthesis in [275], [300]). This additional processing step may lead
methods [296] based on patch-wise stylization that allow to increased model complexity and computational cost. In
specific image parts or object within images to be selectively addition, it can potentially introduce errors, e.g., a result of
altered. Such methods allow a more granular control of the incorrect segmentation masks, which may harm the accu-
output images. They may enable, for instance, the manipu- racy of the stylization.
lation of specific objects in a complex scene. This family of Also, the fact that the generated dataset is not derived
style transfer techniques can be useful in applications such from intuitive, physically grounded manipulation of input
as semantic segmentation (e.g., in [282]. Some works have data at the sample level may limit the degree to which
suggested using patch-level styling to achieve photorealism desired visual attributes can be simulated. Unlike neural
while at the same time reducing the computational de- rendering and generative modeling techniques that can
mands of conventional (holistic image) methods. Cygert and naturally learn desired visual contexts such as color, contrast
Czyżewski [267] argue that styling the whole input image and illumination levels, neural style transfer methods re-
could be detrimental to generalization performance. They quire more complex formulations to handle these attributes
propose to address this limitation by utilizaing a stylization successfully.
method that transforms only small patches of input images. An even more serious limitation of neural style transfer
To achieve this, they perform a hyperparameter search to methods is the difficulty of simulating spatial transforma-
find the best patch size for stylization. In [283], Chen et al. tions. For this reason, the approach is mainly used for pho-
proposed a patch-based stylization technique as a means of tometric data augmentation tasks. Furthermore, geometry-
reducing complexity and improving memory efficiency in consistent photometric stylizations (e.g., brightness levels
the stylization of high resolution images. based on viewpoints; generating and placing shadows at
the right positions in a scene; and fine-grained style-content
7.4 Limitations of NST methods and possible consistency under different poses) are difficult to achieve.
workarounds This can lead to distorted style patterns. The difficulty lies
in the inherent principle of the approach: different hierar-
The main advantage of neural style transfer is the fact chical levels of feature maps separately capture style and
that the approach leverages the natural representation of geometry information and are combined to generate new
visual features in CNN layers to synthesize new image data. samples. The decoupling of style from geometry makes it
Consequently, the amount of data and the time required for challenging to apply learned styles in a view-dependent
training models is relatively small. Also, model architectures manner. Some works (e.g., [273], [301]) have proposed to
for neural style transfer are generally simpler compared to incorporate explicit spatial deformation models within NST
other synthesis methods such as generative modeling and architectures to handle geometric transformations. In [302]
neural rendering. Moreover, since NST training is usually Jing et al. utilize graph neural network model to learn fine-
more sample-efficient than other synthesis methods, mem- grained style-content correspondences that minimizes local
ory requirement is greatly reduced. style distortions under geometric transformations.
One major weakness of datasets created by neural style
transfer approaches is that the data is generated by feature-
level manipulations and not by intuitive transformation of 8 S YNTHETIC DATASETS
real data. As a result, it is extremely difficult to account
for semantic information about the entities in a scene when Presently, a wide range of large-scale synthetic datasets are
applying stylization. This can also lead to noise and un- publicly available for training and evaluating machine vi-
wanted artifacts which may seriously affect the quality of sion models. We summarized the details of some of the most
the generated data. Another challenge with non-intuitive, important synthetic datasets in Table 4. These datasets sup-
feature level perturbations is the difficulty in maintaining port a wide range of visual recognition tasks. In addition,
consistency between local and global features [275] when they cover many of the synthesis methods explored in this
applying styles. These issues require additional (costly) pro- work. In addition, they include the common representation
cessing of semantic information so as to ensure that realistic methods.
23
Table 4
A summary of publicly-available synthetic datasets
MPI-Sintel [161] 3D animation (video) Spatio-temporal scene understanding Optical flow 1628
Depth estimation
Synthia [164] 3D game engine Autonomous driving Scene segmentation 200,000
Ego-motion
Crowd counting
LCrowd [166] 3D game engine Crowd analysis 20M
Person detection (in crowd)
Depth estimation
Optical flow
Virtual Kitti [162] 3D game engine Urban scene understanding Object detection and tracking 21,260
Scene and instance
segmentation
GTA-V [160] Video game play Autonomous driving Sematic Segmentation 25,000
Object detection
SceneNet- RGB-D rendered objects Pose estimation
Indoor scene understanding 5M
RGBD [308] from CAD models Depth estimation
segmentation
9 E FFECTIVENESS OF SYNTHETIC DATA AUGMEN - by themselves synthetic images do not guarantee good
TATION METHODS performance. In their study [160], they experimented first
with real data and obtained an mean IoU of 65.0%. When
Many works have demonstrated the effectiveness of syn- the training set was augmented with synthetic data, they
thetic data augmentation techniques. In some cases (e.g., were able to improve the mean IoU score by 3.9% (from 65.0
[3], [4], [184], [309]) data generated synthetically leads to to 68.9%). Indeed, using the augmented data, they achieved
better generalization performance than real data. Wang et al. comparable performance (65.2%) as with real data (65.0%)
[309], for example, reported several instances where models using only about one-third of the real data. However, the
trained on synthetic data achieve better results on face results were poor and unsatisfactory when only synthetic
recognition tasks. Similarly, Rogez and Schmid [4] consis- data was used (43.6%). Rajpura and Bojinov [310] compared
tently obtained higher performance with synthetic data than the performance of deep learning-based object detectors
with real data on pose estimation tasks. Applications scenar- trained on synthetic (3D-grahics models), real (RGB images),
ios where synthetic images have particularly performed bet- and hybrid (synthetic and real) data. The results showed
ter than real data have been settings that do not require high lower performance on synthetic data (24 mAP) compared to
level of photorealism (e.g., depth perception [3]) and pose real (28 mAP). However, the addition of the real and syn-
estimation [4], [184]. This is due to the fact that synthetic thetic images improves the performance by up to 12% (36
images are often “cleaner” (i.e., the do not contain spurious mAP). Similarly, in [177] Alhajia et al. observed that training
details and artefacts which may be irrelevant to the target with augmented reality environment that integrates real and
task). While some studies have obtained impressive results synthetically generated objects into a single environment
with training exclusively on synthetic data, many other achieves a significantly higher performance than with either
works show that synthetic data, when used alone, would separately. In [311], Zhang et al. observed that expanding
not always yield the desired performance. For instance, the training set by increasing the proportion of synthetic
results obtained by Richter et al. in [160] demonstrates that data does not lead to a linear increase in model performance.
while synthetic data can drastically reduce the amount of In fact, for some tasks, the performance flattens out at about
real training data needed to achieve optimum performance,
24
25% composition of synthetic data for the specific cases artificial generated images. Only in the last few years have
investigated. there been cost-effective methods for generating large-scale
virtual scenes using implicit neural representation. Recently,
NeRF been a subject of considerable research. This intensive
10 S UMMARY AND DISCUSSION research has led to the development of new deep learning
In the literature, many data synthesis approaches have methods and models that adequately represent 3D data in
been proposed. Despite the large variety of techniques, four sophisticated and efficient ways. As a consequence of the
main classes of approaches can be distinguished: generative development these methods, a wide range of possibilities
modeling, data synthesis by means of computer graphics are emerging regarding the training data for complex tasks
tools, neural rendering approaches that utilize deep learn- like autonomous driving, where the cost of camera-captured
ing models to simulate 3D modeling process, and neural data is exorbitant. As well as being able to generate more
style transfer that relies on combining different hierarchical plausible pixel images, modern differential neural rendering
levels of convolutional features to synthesizer new image models such as Block-NeRF [202] and Mega-NeRF [312] can
data. The first group of approaches – generative modeling synthesize realistic, large-scale 3D videos of entire scenes
– is mainly based on the generative adversarial networks using only a few 2D images as input. These models basi-
and variational autoencoders. Generative modeling meth- cally map pixel space into the context of a continuous 3D
ods allow realistic image data to be synthesized using scene. This capability is highly promising in complex visual
only random noise as input. Models can also generate perception tasks that require physics-aware interpretation of
outputs conditioned on specific input characteristics that input data. Conceivably, in the near future more powerful
define the desired appearance of target data. The second and efficient NeRF models capable of generating large-scale,
class of methods, computer graphics approaches explicitly fully dynamic and realistic 4D scenes will replace the exist-
construct 3D models based on manually modeled, primitive ing methods for synthesizing training data for applications
geometric elements. Neural rendering methods use deep such as robot navigation, autonomous driving and many 3D
neural networks to learn the representation of 3D objects perception tasks.
and then optionally render these as 2D images. Neural style
transfer combines different semantic information contained
in different layers of neural networks to synthesis various 11 F UTURE RESEARCH DIRECTIONS
visual styles. The main strengths and weaknesses of each of
these classes of approaches are summarized in Figure 28. The four main classes of synthetic data augmentation meth-
The different data synthesis approaches have unique ods surveyed in this work are computer graphics modeling,
characteristics that define their scope of application. Neu- neural rendering, generative modeling and neural style
ral style transfer method, for example, is highly flexible transfer. Overall, judging by recent trends, we expect signif-
since the stylization process can be controlled by setting icant progress in generative modeling and neural rendering
appropriate weight parameters corresponding to different techniques and minimal progress in neural style transfer
style intensities. This allows low-level transformations to be methods. We also expect the applications of these methods
easily accomplished for a given task. However, despite its in more challenging machine intelligence tasks such as affor-
simplicity and relative efficiency, the method is generally dance learning, human-machine interaction and extended
limited to 2D synthesis. It is also challenging to generate reality. Progress in this areas will undoubtedly have a pro-
large-scale photorealistic data. Its most important prospect found impact on the development of machine vision and
is in enhancing robustness to noise, overcoming overfit- artificial intelligence as a whole. We outline most promising
ting by preventing models from learning specific visual future research directions in the following paragraphs.
patterns, and encouraging texture invariance. Unlike neu- a. Modeling multiple sensory modalities in an inte-
ral style transfer methods, rapid advances in generative grated and adaptive way
modeling techniques have made it possible to easily create Recent interest in generative modeling and neural ren-
large volumes of artificial images with desired properties dering approaches has led to the development of new
for specific tasks. However, these methods typically model hybrid deep learning methods that combine the power of
objects and scenes in 2D, making it difficult to to transfer state-of-the-art neural rendering techniques like NeRFs and
encoded knowledge to 3D scene understanding tasks. Also, generative models such as GANs and VAEs to adequately
by ignoring the 3D structure of the real world, achieving represent both 2D and 3D data in more effective and control-
physically-grounded object manipulation is a challenging lable ways. These approaches primarily exploit conditional
task. Synthesis methods based on 3D graphics modeling GAN and conditional VAE techniques to provide a form
overcomes these limitations by providing a means to gen- of control over the visual properties of the synthesized
erate realistic 3D scenes. However, the modeling process is data, allowing models to learn multiple salient features
typically laborious, limiting the amount of detail and the to enhance the realism of representations. This additional
range of dynamic attributes that can be supported. Neural dimension of flexibility and control can be leveraged by fu-
rendering can be used to introduce more nuanced details ture synthesis models to specifically generate more dynamic
without manually creating the desired visual appearance. data whose appearance and properties change adaptively
In particular, differential neural rendering approaches based with the visual context. For applications such as high-
on implicit representations allow deep learning models to level perception and long-term scene understanding, the
encode 3D-grounded representation, as well as visual at- emergence of techniques that allow view-specific attributes
tributes such as color and varying lighting conditions in the to be modified online in response to real-world conditions
25
would be extremely useful. More generally, the develop- achieving more sample-efficient training. One of the most
ment of new data synthesis methods based on implicit neu- promising research goals is the development of uncondi-
ral rendering and generative modeling techniques will offer tional 3D-aware data synthesis techniques that allow to gen-
opportunities to radically extend the capabilities of state- of- erate high-quality, realistic 3D synthetic scenes without the
the-art machine vision systems. Because implicit neural need for reference images. New synthesis methods will also
representation techniques are spatio-temporally continuous allow to utilize more parse representations to synthesize re-
and differentiable, in addition to 3D visual information, alistic data with detailed visual information. The integration
they can be used as universal function approximators to of neural radiance fields (NeRF) models into state-of-the-art
model diverse sensory signals as well as complex physical generative modeling architectures will increasingly provide
processes of the real world. Future research is expected to better ways to achieve more compact and unified differen-
produce new and improved ways to enable these diverse tiable representations of complex scenes. New techniques
sensory modalities and physical properties of objects to be resulting from further breakthroughs in this line of works, in
compactly integrated into a kind of universal framework. the not-so-distant future, can be used to generate seemingly
Incorporating information-rich representation in this man- continuous and infinitely large scenes by exploiting more
ner will undoubtedly help to improve “common sense” effective and more efficient representation techniques.
reasoning capabilities of intelligent and robotic systems. c. Towards synthesis and representation of context-
b. Towards more effective and efficient representation relevant scene properties
and training Machine vision models mainly rely on visual appearance
A major shortcoming with current data synthesis ap- of input data to make predictions. The realism of visual
proaches, particularly generative modeling and neural features is, thus, the sole concern of developers and re-
rendering techniques, is the need to use several exam- searchers when designing methods for synthesizing training
ples from the target domain to guide the synthesis pro- data. However, in more complex tasks such as affordance
cess, especially when training for 3D scene understanding learning, robot perception and dexterous manipulation, in
tasks. Many recent works have focused on achieving more addition to synthesizing appearance and geometry, it is
computationally-efficient representation that enables scene often helpful to model non-visual properties such as friction,
data to scale up to city or metropolitan level environments. mass and other semantic information about objects in the
Application domains such as outdoor scene understanding scene. We expect that approaches that rely on jointly model-
and autonomous driving particularly require very large ing visual information and high-level non-visual attributes
continuous scenes which are currently challenging to syn- that reflect properties of the real world – or physics-aware
thesize owing to the enormous computational resource re- data synthesis – will become an important research topic
quirements. Today’s large-scale synthetic environments are in the foreseeable future. Research in this direction will
typically realized by stacking several smaller, image-level facilitate new ways to represent visual and semantic context
scenes together, where each constituent “mini scene” is en- information in more unified and coherent fashion. This
coded by a dedicated NeRF model. Obviously, this is not the will allow the synthesis fully interactive 3D environments
most natural and effective way to represent scenes. A large without explicitly modeling object properties and behavior
amount of future research efforts will increasingly focus on to become possible.
26
d. Simulating less intuitive augmentation schemes [7] Y. Lin, C. Tang, F.-J. Chu, and P. A. Vela, “Using synthetic data
Current data synthesis approaches assume visual simi- and deep networks to recognize primitive shapes for object
larity of the original data or target domain and the gener- grasping,” in 2020 IEEE International Conference on Robotics and
Automation (ICRA). IEEE, 2020, pp. 10 494–10 501.
ated image samples. Consequently, the generative modeling [8] A. Ummadisingu, K. Takahashi, and N. Fukaya, “Cluttered food
process primarily aims to generate clean data that is as close grasping with adaptive fingers and synthetic-data trained object
as possible to the target data. It is, however, known that detection,” arXiv preprint arXiv:2203.05187, 2022.
[9] T. Kollar, M. Laskey, K. Stone, B. Thananjeyan, and M. Tjersland,
techniques such as random image perturbations can some- “Simnet: Enabling robust unknown object manipulation from
times provide the most useful augmentations for improving pure synthetic data via stereo,” in Conference on Robot Learning.
generalization. Thus, by focusing on more aligned semantic PMLR, 2022, pp. 938–948.
content, data synthesis approaches based on generative [10] Z. Luo, W. Xue, J. Chae, and G. Fu, “Skp: Semantic 3d keypoint
detection for category-level robotic manipulation,” IEEE Robotics
modeling usually ignore useful augmentation strategies that and Automation Letters, vol. 7, no. 2, pp. 5437–5444, 2022.
rely on non-realistic data (e.g., methods such as blurring [11] A. H. Ornek and M. Ceylan, “Comparison of traditional trans-
and noise injection). At present, there is no workaround formations for data augmentation in deep learning of medical
thermography,” in 2019 42nd International Conference on Telecom-
that allows to generate implausible but effective data using munications and Signal Processing (TSP). IEEE, 2019, pp. 191–194.
generative modeling. [12] K. Wang, B. Fang, J. Qian, S. Yang, X. Zhou, and J. Zhou, “Per-
spective transformation data augmentation for object detection,”
IEEE Access, vol. 8, pp. 4935–4943, 2019.
12 C ONCLUSION [13] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convo-
Synthetic data augmentation is a way to overcome data lutional neural networks for volumetric medical image segmen-
scarcity in practical machine learning applications by tation,” in 2016 fourth international conference on 3D vision (3DV).
IEEE, 2016, pp. 565–571.
creating artificial samples from scratch. This survey [14] E. K. Kim, H. Lee, J. Y. Kim, and S. Kim, “Data augmentation
explores the most important approaches to generating method by applying color perturbation of inverse psnr and
synthetic data for training computer vision models. We geometric transformations for object recognition based on deep
learning,” Applied Sciences, vol. 10, no. 11, p. 3755, 2020.
present a detailed coverage of the methods, unique
[15] D. Sakkos, H. P. Shum, and E. S. Ho, “Illumination-based data
properties, application scenarios, as well as the important augmentation for robust background subtraction,” in 2019 13th
limitations of data synthesis methods for extending training International Conference on Software, Knowledge, Information Man-
data. We also summarized the main features, generation agement and Applications (SKIMA). IEEE, 2019, pp. 1–8.
[16] O. Mazhar and J. Kober, “Random shadows and highlights: A
methods, supported tasks and application domains of new data augmentation method for extreme lighting conditions,”
common publicly available, large-scale synthetic datasets. arXiv preprint arXiv:2101.05361, 2021.
Lastly, we investigate the effectiveness of data synthesis [17] A. Kotwal, R. Bhalodia, and S. P. Awate, “Joint desmoking and
approaches to data augmentation. The survey shows denoising of laparoscopy images,” in 2016 IEEE 13th International
Symposium on Biomedical Imaging (ISBI). IEEE, 2016, pp. 1050–
that synthetic data augmentation methods provides an 1054.
effective means to obtain good generalization performance [18] H. Li, X. Zhang, Q. Tian, and H. Xiong, “Attribute mix: semantic
in situations where it is difficult to access real data for data augmentation for fine grained recognition,” in 2020 IEEE
International Conference on Visual Communications and Image Pro-
training. Moreover, for tasks such as optical flow, depth cessing (VCIP). IEEE, 2020, pp. 243–246.
estimation and visual odometry, where photorealism plays [19] S. Feng, S. Yang, Z. Niu, J. Xie, M. Wei, and P. Li, “Grid cut
no role in inference, training with synthetic data sometimes and mix: flexible and efficient data augmentation,” in Twelfth
yield better performance than with real data. International Conference on Graphics and Image Processing (ICGIP
2020), vol. 11720. International Society for Optics and Photonics,
2021, p. 1172028.
[20] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix:
ACKNOWLEDGMENTS Regularization strategy to train strong classifiers with localizable
features,” in Proceedings of the IEEE/CVF international conference on
The authors would like to thank... computer vision, 2019, pp. 6023–6032.
[21] J. Yoo, N. Ahn, and K.-A. Sohn, “Rethinking data augmentation
for image super-resolution: A comprehensive analysis and a new
REFERENCES strategy,” in Proceedings of the IEEE/CVF Conference on Computer
[1] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, Vision and Pattern Recognition, 2020, pp. 8375–8384.
“Carla: An open urban driving simulator,” in Conference on robot [22] J. Lemley, S. Bazrafkan, and P. Corcoran, “Smart augmentation
learning. PMLR, 2017, pp. 1–16. learning an optimal data augmentation strategy,” Ieee Access,
[2] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Her- vol. 5, pp. 5858–5869, 2017.
rasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi, “Ai2-thor: [23] X. Li, Y. Dai, Y. Ge, J. Liu, Y. Shan, and L.-Y. Duan, “Uncertainty
An interactive 3d environment for visual ai,” arXiv preprint modeling for out-of-distribution generalization,” arXiv preprint
arXiv:1712.05474, 2017. arXiv:2202.03958, 2022.
[3] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, [24] X. Bouthillier, K. Konda, P. Vincent, and R. Memisevic, “Dropout
I. Laptev, and C. Schmid, “Learning from synthetic humans,” as data augmentation,” arXiv preprint arXiv:1506.08700, 2015.
in Proceedings of the IEEE conference on computer vision and pattern [25] B.-B. Jia and M.-L. Zhang, “Multi-dimensional classification via
recognition, 2017, pp. 109–117. selective feature augmentation,” Machine Intelligence Research,
[4] G. Rogez and C. Schmid, “Mocap-guided data augmentation for vol. 19, no. 1, pp. 38–51, 2022.
3d pose estimation in the wild,” Advances in neural information [26] K. Maharana, S. Mondal, and B. Nemade, “A review: Data pre-
processing systems, vol. 29, 2016. processing and data augmentation techniques,” Global Transitions
[5] K. Mo, Y. Qin, F. Xiang, H. Su, and L. Guibas, “O2o-afford: Proceedings, 2022.
Annotation-free large-scale object-object affordance learning,” in [27] C. Shorten and T. M. Khoshgoftaar, “A survey on image data
Conference on Robot Learning. PMLR, 2022, pp. 1666–1677. augmentation for deep learning,” Journal of big data, vol. 6, no. 1,
[6] F.-J. Chu, R. Xu, and P. A. Vela, “Learning affordance segmenta- pp. 1–48, 2019.
tion for real-world robotic manipulation via synthetic images,” [28] S. Yang, W. Xiao, M. Zhang, S. Guo, J. Zhao, and F. Shen, “Image
IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1140–1147, data augmentation for deep learning: A survey,” arXiv preprint
2019. arXiv:2204.08610, 2022.
27
[29] C. Khosla and B. S. Saini, “Enhancing performance of deep [50] X. Zhang, Z. Zheng, D. Gao, B. Zhang, P. Pan, and Y. Yang,
learning models with different data augmentation techniques: A “Multi-view consistent generative adversarial networks for 3d-
survey,” in 2020 International Conference on Intelligent Engineering aware image synthesis,” in Proceedings of the IEEE/CVF Conference
and Management (ICIEM). IEEE, 2020, pp. 79–85. on Computer Vision and Pattern Recognition, 2022, pp. 18 450–
[30] N. E. Khalifa, M. Loey, and S. Mirjalili, “A comprehensive survey 18 459.
of recent trends in deep learning for digital images augmenta- [51] H. Ohno, “Auto-encoder-based generative models for data aug-
tion,” Artificial Intelligence Review, pp. 1–27, 2021. mentation on regression problems,” Soft Computing, vol. 24,
[31] G. E. Hinton and T. J. Sejnowski, “Optimal perceptual inference,” no. 11, pp. 7999–8009, 2020.
in Proceedings of the IEEE conference on Computer Vision and Pattern [52] G. Mariani, F. Scheidegger, R. Istrate, C. Bekas, and C. Malossi,
Recognition, vol. 448. Citeseer, 1983, pp. 448–453. “Bagan: Data augmentation with balancing gan,” arXiv preprint
[32] P. R. Jeyaraj and E. R. S. Nadar, “Deep boltzmann machine arXiv:1803.09655, 2018.
algorithm for accurate medical image analysis for classification [53] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature
of cancerous region,” Cognitive Computation and Systems, vol. 1, learning,” arXiv preprint arXiv:1605.09782, 2016.
no. 3, pp. 85–90, 2019. [54] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb,
[33] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- M. Arjovsky, and A. Courville, “Adversarially learned inference,”
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad- arXiv preprint arXiv:1606.00704, 2016.
versarial nets,” Advances in neural information processing systems, [55] M. Brundage, S. Avin, J. Clark, H. Toner, P. Eckersley, B. Garfinkel,
vol. 27, 2014. A. Dafoe, P. Scharre, T. Zeitzoff, B. Filar et al., “The malicious use
[34] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” of artificial intelligence: Forecasting, prevention, and mitigation,”
arXiv preprint arXiv:1312.6114, 2013. arXiv preprint arXiv:1802.07228, 2018.
[35] H. Akaike, “autoregressive models for regression,” Annals of the [56] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan train-
Institute of Statistical Mathematics, vol. 21, pp. 243–247, 1969. ing for high fidelity natural image synthesis,” arXiv preprint
[36] J. M. Susskind, G. E. Hinton, J. R. Movellan, and A. K. Anderson, arXiv:1809.11096, 2018.
“Generating facial expressions with deep belief nets,” Affective [57] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,
Computing, Emotion Modelling, Synthesis and Recognition, pp. 421– “Autoencoding beyond pixels using a learned similarity metric,”
440, 2008. in International conference on machine learning. PMLR, 2016, pp.
[37] A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sut- 1558–1566.
ton, “Veegan: Reducing mode collapse in gans using implicit [58] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, and
variational learning,” Advances in neural information processing H. Greenspan, “Synthetic data augmentation using gan for im-
systems, vol. 30, 2017. proved liver lesion classification,” in 2018 IEEE 15th international
[38] L. Mescheder, S. Nowozin, and A. Geiger, “Adversarial varia- symposium on biomedical imaging (ISBI 2018). IEEE, 2018, pp. 289–
tional bayes: Unifying variational autoencoders and generative 293.
adversarial networks,” in International Conference on Machine [59] D. Ribli, A. Horváth, Z. Unger, P. Pollner, and I. Csabai, “Detect-
Learning. PMLR, 2017, pp. 2391–2400. ing and classifying lesions in mammograms with deep learning,”
[39] J. Peng, D. Liu, S. Xu, and H. Li, “Generating diverse structure for Scientific reports, vol. 8, no. 1, pp. 1–7, 2018.
image inpainting with hierarchical vq-vae,” in Proceedings of the [60] X. Wang, A. Shrivastava, and A. Gupta, “A-fast-rcnn: Hard posi-
IEEE/CVF Conference on Computer Vision and Pattern Recognition, tive generation via adversary for object detection,” in Proceedings
2021, pp. 10 775–10 784. of the IEEE conference on computer vision and pattern recognition,
[40] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image 2017, pp. 2606–2615.
translation with conditional adversarial networks,” in Proceedings [61] X. Peng, Z. Tang, F. Yang, R. S. Feris, and D. Metaxas, “Jointly
of the IEEE conference on computer vision and pattern recognition, optimize data augmentation and network training: Adversarial
2017, pp. 1125–1134. data augmentation in human pose estimation,” in Proceedings of
[41] E. L. Denton, S. Chintala, R. Fergus et al., “Deep generative image the IEEE Conference on Computer Vision and Pattern Recognition,
models using a laplacian pyramid of adversarial networks,” 2018, pp. 2226–2234.
Advances in neural information processing systems, vol. 28, 2015. [62] Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen,
[42] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and R. W. Lau, and M.-H. Yang, “Vital: Visual tracking via adversarial
P. Abbeel, “Infogan: Interpretable representation learning by learning,” in Proceedings of the IEEE conference on computer vision
information maximizing generative adversarial nets,” Advances and pattern recognition, 2018, pp. 8990–8999.
in neural information processing systems, vol. 29, 2016. [63] M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, and
[43] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, H. Greenspan, “Gan-based synthetic medical image augmenta-
A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo- tion for increased cnn performance in liver lesion classification,”
realistic single image super-resolution using a generative adver- Neurocomputing, vol. 321, pp. 321–331, 2018.
sarial network,” in Proceedings of the IEEE conference on computer [64] C. Bowles, L. Chen, R. Guerrero, P. Bentley, R. Gunn, A. Ham-
vision and pattern recognition, 2017, pp. 4681–4690. mers, D. A. Dickie, M. V. Hernández, J. Wardlaw, and D. Rueck-
[44] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning ert, “Gan augmentation: Augmenting training data using genera-
to discover cross-domain relations with generative adversarial tive adversarial networks,” arXiv preprint arXiv:1810.10863, 2018.
networks,” in International conference on machine learning. PMLR, [65] A. Madani, M. Moradi, A. Karargyris, and T. Syeda-Mahmood,
2017, pp. 1857–1865. “Chest x-ray generation and data augmentation for cardiovas-
[45] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to- cular abnormality classification,” in Medical Imaging 2018: Image
image translation using cycle-consistent adversarial networks,” Processing, vol. 10574. International Society for Optics and
in Proceedings of the IEEE international conference on computer vision, Photonics, 2018, p. 105741M.
2017, pp. 2223–2232. [66] N. Souly, C. Spampinato, and M. Shah, “Semi supervised se-
[46] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self- mantic segmentation using generative adversarial network,” in
attention generative adversarial networks,” in International con- Proceedings of the IEEE international conference on computer vision,
ference on machine learning. PMLR, 2019, pp. 7354–7363. 2017, pp. 5688–5696.
[47] K. Schwarz, Y. Liao, M. Niemeyer, and A. Geiger, “Graf: Gener- [67] S. Kaur, H. Aggarwal, and R. Rani, “Mr image synthesis using
ative radiance fields for 3d-aware image synthesis,” Advances in generative adversarial networks for parkinson’s disease classi-
Neural Information Processing Systems, vol. 33, pp. 20 154–20 166, fication,” in Proceedings of International Conference on Artificial
2020. Intelligence and Applications. Spring, 2021, pp. 317–327.
[48] Y. Xue, Y. Li, K. K. Singh, and Y. J. Lee, “Giraffe hd: [68] S. Kaplan, L. Lensu, L. Laaksonen, and H. Uusitalo, “Evaluation
A high-resolution 3d-aware generative model,” arXiv preprint of unconditioned deep generative synthesis of retinal images,” in
arXiv:2203.14954, 2022. International Conference on Advanced Concepts for Intelligent Vision
[49] E. R. Chan, M. Monteiro, P. Kellnhofer, J. Wu, and G. Wetzstein, Systems. Springer, 2020, pp. 262–273.
“pi-gan: Periodic implicit generative adversarial networks for 3d- [69] W. Fang, F. Zhang, V. S. Sheng, and Y. Ding, “A method for im-
aware image synthesis,” in Proceedings of the IEEE/CVF conference proving cnn-based image recognition using dcgan,” Computers,
on computer vision and pattern recognition, 2021, pp. 5799–5809. Materials and Continua, vol. 57, no. 1, pp. 167–178, 2018.
28
[70] C.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey, “St- [91] S. K. Jemni, M. A. Souibgui, Y. Kessentini, and A. Fornés,
gan: Spatial transformer generative adversarial networks for “Enhance to read better: A multi-task adversarial network for
image compositing,” in Proceedings of the IEEE Conference on handwritten document image enhancement,” Pattern Recognition,
Computer Vision and Pattern Recognition, 2018, pp. 9455–9464. vol. 123, p. 108370, 2022.
[71] S. C. Medin, B. Egger, A. Cherian, Y. Wang, J. B. Tenenbaum, [92] J. Liang, M. Li, Y. Jia, and R. Sun, “Single image dehazing in
X. Liu, and T. K. Marks, “Most-gan: 3d morphable stylegan 3d space with more stable gans,” in Proceedings of 2021 Chinese
for disentangled face image manipulation,” in Proceedings of the Intelligent Systems Conference. Springer, 2022, pp. 581–590.
AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. [93] X. Li, G. Teng, P. An, H. Yao, and Y. Chen, “Advertisement logo
1962–1971. compositing via adversarial geometric consistency pursuit,” in
[72] Z. Chen, Z. Zeng, H. Shen, X. Zheng, P. Dai, and P. Ouyang, “Dn- 2019 IEEE Visual Communications and Image Processing (VCIP).
gan: Denoising generative adversarial networks for speckle noise IEEE, 2019, pp. 1–4.
reduction in optical coherence tomography images,” Biomedical [94] J. Kossaifi, L. Tran, Y. Panagakis, and M. Pantic, “Gagan:
Signal Processing and Control, vol. 55, p. 101632, 2020. Geometry-aware generative adversarial networks,” in Proceedings
[73] D.-P. Fan, Z. Huang, P. Zheng, H. Liu, X. Qin, and L. Van Gool, of the IEEE conference on computer vision and pattern recognition,
“Facial-sketch synthesis: a new challenge,” Machine Intelligence 2018, pp. 878–887.
Research, vol. 19, no. 4, pp. 257–287, 2022. [95] F. Zhan, C. Xue, and S. Lu, “Ga-dan: Geometry-aware domain
[74] P. L. Suárez, A. D. Sappa, and B. X. Vintimilla, “Infrared image adaptation network for scene text detection and recognition,” in
colorization based on a triplet dcgan architecture,” in Proceedings Proceedings of the IEEE/CVF International Conference on Computer
of the IEEE Conference on Computer Vision and Pattern Recognition Vision, 2019, pp. 9105–9115.
Workshops, 2017, pp. 18–23. [96] S. Treneska, E. Zdravevski, I. M. Pires, P. Lameski, and S. Gievska,
[75] X. Zhao, F. Ma, D. Güera, Z. Ren, A. G. Schwing, and A. Colburn, “Gan-based image colorization for self-supervised visual feature
“Generative multiplane images: Making a 2d gan 3d-aware,” in learning,” Sensors, vol. 22, no. 4, p. 1599, 2022.
European Conference on Computer Vision. Springer, 2022, pp. 18– [97] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and
35. C. Change Loy, “Esrgan: Enhanced super-resolution generative
[76] R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: adversarial networks,” in Proceedings of the European conference on
Global and local perception gan for photorealistic and identity computer vision (ECCV) workshops, 2018, pp. 0–0.
preserving frontal view synthesis,” in Proceedings of the IEEE [98] W. Wang, H. Wang, S. Yang, X. Zhang, X. Wang, J. Wang, J. Lei,
international conference on computer vision, 2017, pp. 2439–2448. Z. Zhang, and Z. Dong, “Resolution enhancement in microscopic
[77] Y.-J. Ju, G.-H. Lee, J.-H. Hong, and S.-W. Lee, “Complete face imaging based on generative adversarial network with unpaired
recovery gan: Unsupervised joint face rotation and de-occlusion data,” Optics Communications, vol. 503, p. 127454, 2022.
from a single-view image,” in Proceedings of the IEEE/CVF Winter [99] S. N. Rai and C. Jawahar, “Removing atmospheric turbulence via
Conference on Applications of Computer Vision, 2022, pp. 3711–3721. deep adversarial learning,” IEEE Transactions on Image Processing,
[78] X. Chen, X. Luo, J. Weng, W. Luo, H. Li, and Q. Tian, “Multi- vol. 31, pp. 2633–2646, 2022.
view gait image generation for cross-view gait recognition,” IEEE [100] S. Tripathi, Z. C. Lipton, and T. Q. Nguyen, “Correction by pro-
Transactions on Image Processing, vol. 30, pp. 3041–3055, 2021. jection: Denoising images with generative adversarial networks,”
[79] S. Kim, J. Lee, and B. C. Ko, “Ssl-mot: self-supervised learning arXiv preprint arXiv:1803.04477, 2018.
based multi-object tracking,” Applied Intelligence, pp. 1–11, 2022. [101] Q. Lyu, C. You, H. Shan, Y. Zhang, and G. Wang, “Super-
[80] X. Wang, C. Li, B. Luo, and J. Tang, “Sint++: Robust visual track- resolution mri and ct through gan-circle,” in Developments in X-
ing via adversarial positive instance generation,” in Proceedings of ray tomography XII, vol. 11113. International Society for Optics
the IEEE conference on computer vision and pattern recognition, 2018, and Photonics, 2019, p. 111130X.
pp. 4864–4873. [102] F. Chiaroni, M.-C. Rahal, N. Hueber, and F. Dufaux, “Halluci-
[81] Q. Wu, Z. Chen, L. Cheng, Y. Yan, B. Li, and H. Wang, “Hal- nating a cleanly labeled augmented dataset from a noisy labeled
lucinated adversarial learning for robust visual tracking,” arXiv dataset using gan,” in 2019 IEEE International Conference on Image
preprint arXiv:1906.07008, 2019. Processing (ICIP). IEEE, 2019, pp. 3616–3620.
[82] J. Liu, B. Ni, Y. Yan, P. Zhou, S. Cheng, and J. Hu, “Pose [103] W. Lira, J. Merz, D. Ritchie, D. Cohen-Or, and H. Zhang, “Gan-
transferrable person re-identification,” in Proceedings of the IEEE hopper: Multi-hop gan for unsupervised image-to-image transla-
conference on computer vision and pattern recognition, 2018, pp. tion,” in European conference on computer vision. Springer, 2020,
4099–4108. pp. 363–379.
[83] Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated [104] E. Ntavelis, M. Shahbazi, I. Kastanis, R. Timofte, M. Danelljan,
by gan improve the person re-identification baseline in vitro,” in and L. Van Gool, “Arbitrary-scale image synthesis,” arXiv preprint
Proceedings of the IEEE international conference on computer vision, arXiv:2204.02273, 2022.
2017, pp. 3754–3762. [105] L. Sixt, B. Wild, and T. Landgraf, “Rendergan: Generating realistic
[84] M. Zanfir, A.-I. Popa, A. Zanfir, and C. Sminchisescu, “Human labeled data,” Frontiers in Robotics and AI, vol. 5, p. 66, 2018.
appearance transfer,” in Proceedings of the IEEE Conference on [106] J. Zhao, L. Xiong, P. Karlekar Jayashree, J. Li, F. Zhao, Z. Wang,
Computer Vision and Pattern Recognition, 2018, pp. 5391–5399. P. Sugiri Pranata, P. Shengmei Shen, S. Yan, and J. Feng, “Dual-
[85] K. Saleh, S. Szénási, and Z. Vámossy, “Occlusion handling in agent gans for photorealistic and identity preserving profile
generic object detection: A review,” in 2021 IEEE 19th World face synthesis,” Advances in neural information processing systems,
Symposium on Applied Machine Intelligence and Informatics (SAMI). vol. 30, 2017.
IEEE, 2021, pp. 000 477–000 484. [107] A. J. Ratner, H. Ehrenberg, Z. Hussain, J. Dunnmon, and C. Ré,
[86] L. Minciullo, F. Manhardt, K. Yoshikawa, S. Meier, F. Tombari, “Learning to compose domain-specific transformations for data
and N. Kobori, “Db-gan: Boosting object recognition under augmentation,” Advances in neural information processing systems,
strong lighting conditions,” in Proceedings of the IEEE/CVF Winter vol. 30, 2017.
Conference on Applications of Computer Vision, 2021, pp. 2939–2949. [108] S. Zhao, Z. Liu, J. Lin, J.-Y. Zhu, and S. Han, “Differentiable
[87] Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised augmentation for data-efficient gan training,” Advances in neural
dual learning for image-to-image translation,” in Proceedings of information processing systems, vol. 33, pp. 7559–7570, 2020.
the IEEE international conference on computer vision, 2017, pp. 2849– [109] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila,
2857. “Training generative adversarial networks with limited data,”
[88] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, Advances in Neural Information Processing Systems, vol. 33, pp.
B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language 12 104–12 114, 2020.
models,” arXiv preprint arXiv:2303.18223, 2023. [110] N.-T. Tran, V.-H. Tran, N.-B. Nguyen, T.-K. Nguyen, and N.-M.
[89] J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for Cheung, “Towards good practices for data augmentation in gan
vision tasks: A survey,” IEEE Transactions on Pattern Analysis and training,” arXiv preprint arXiv:2006.05338, vol. 2, no. 3, p. 3, 2020.
Machine Intelligence, 2024. [111] H. Zhang, Z. Zhang, A. Odena, and H. Lee, “Consistency reg-
[90] H. Zhang, V. Sindagi, and V. M. Patel, “Image de-raining using a ularization for generative adversarial networks,” arXiv preprint
conditional generative adversarial network,” IEEE transactions on arXiv:1910.12027, 2019.
circuits and systems for video technology, vol. 30, no. 11, pp. 3943– [112] Z. Zhao, S. Singh, H. Lee, Z. Zhang, A. Odena, and H. Zhang,
3956, 2019. “Improved consistency regularization for gans,” in Proceedings of
29
the AAAI conference on artificial intelligence, vol. 35, no. 12, 2021, [134] K. Li and D.-K. Kang, “Enhanced generative adversarial net-
pp. 11 033–11 041. works with restart learning rate in discriminator,” Applied Sci-
[113] S. Park, Y.-J. Yeo, and Y.-G. Shin, “Generative adversarial network ences, vol. 12, no. 3, p. 1191, 2022.
using perturbed-convolutions,” arXiv preprint arXiv:2101.10841, [135] C. G. Korde, M. Vasantha et al., “Training of generative adversar-
vol. 1, no. 3, p. 8, 2021. ial networks with hybrid evolutionary optimization technique,”
[114] B. Dodin and M. Sirvanci, “Stochastic networks and the extreme in 2019 IEEE 16th India Council International Conference (INDI-
value distribution,” Computers & operations research, vol. 17, no. 4, CON). IEEE, 2019, pp. 1–4.
pp. 397–409, 1990. [136] H. Su, C. R. Qi, Y. Li, and L. J. Guibas, “Render for cnn:
[115] S. Bhatia, A. Jain, and B. Hooi, “Exgan: Adversarial generation Viewpoint estimation in images using cnns trained with rendered
of extreme samples,” in Proceedings of the AAAI Conference on 3d model views,” in Proceedings of the IEEE international conference
Artificial Intelligence, vol. 35, no. 8, 2021, pp. 6750–6758. on computer vision, 2015, pp. 2686–2694.
[116] L. Liu, M. Muelly, J. Deng, T. Pfister, and L.-J. Li, “Generative [137] X. Peng, B. Sun, K. Ali, and K. Saenko, “Learning deep object
modeling for small-data object detection,” in Proceedings of the detectors from 3d models,” in Proceedings of the IEEE international
IEEE/CVF International Conference on Computer Vision, 2019, pp. conference on computer vision, 2015, pp. 1278–1286.
6073–6081. [138] S. Liu and S. Ostadabbas, “A semi-supervised data augmentation
[117] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and approach using 3d graphical engines,” in Proceedings of the Euro-
A. Smola, “A kernel two-sample test,” The Journal of Machine pean Conference on Computer Vision (ECCV) Workshops, 2018, pp.
Learning Research, vol. 13, no. 1, pp. 723–773, 2012. 0–0.
[118] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural [139] R. Sulzer, L. Landrieu, A. Boulch, R. Marlet, and B. Vallet,
similarity for image quality assessment,” in The Thrity-Seventh “Deep surface reconstruction from point clouds with visibility
Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2. information,” arXiv preprint arXiv:2202.01810, 2022.
Ieee, 2003, pp. 1398–1402. [140] J. Malik, S. Shimada, A. Elhayek, S. A. Ali, V. Golyanik,
[119] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, C. Theobalt, and D. Stricker, “Handvoxnet++: 3d hand shape
and X. Chen, “Improved techniques for training gans,” Advances and pose estimation using voxel-based neural networks,” IEEE
in neural information processing systems, vol. 29, 2016. Transactions on Pattern Analysis and Machine Intelligence, 2021.
[120] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and [141] F. Bongini, L. Berlincioni, M. Bertini, and A. Del Bimbo, “Partially
S. Hochreiter, “Gans trained by a two time-scale update rule con- fake it till you make it: mixing real and fake thermal images
verge to a local nash equilibrium,” Advances in neural information for improved object detection,” in Proceedings of the 29th ACM
processing systems, vol. 30, 2017. International Conference on Multimedia, 2021, pp. 5482–5490.
[121] C. Nash, J. Menick, S. Dieleman, and P. W. Battaglia, “Gen- [142] V. Hegde and R. Zadeh, “Fusionnet: 3d object classification using
erating images with sparse representations,” arXiv preprint multiple data representations,” arXiv preprint arXiv:1607.05695,
arXiv:2103.03841, 2021. 2016.
[143] M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista,
[122] M. J. Chong and D. Forsyth, “Effectively unbiased fid and incep-
N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photore-
tion score and where to find them,” in Proceedings of the IEEE/CVF
alistic synthetic dataset for holistic indoor scene understanding,”
conference on computer vision and pattern recognition, 2020, pp.
in Proceedings of the IEEE/CVF International Conference on Computer
6070–6079.
Vision, 2021, pp. 10 912–10 922.
[123] C.-Y. Bai, H.-T. Lin, C. Raffel, and W. C.-w. Kan, “On training
[144] J. Collins, S. Goel, K. Deng, A. Luthra, L. Xu, E. Gundogdu,
sample memorization: Lessons from benchmarking generative
X. Zhang, T. F. Y. Vicente, T. Dideriksen, H. Arora et al., “Abo:
modeling with a large-scale competition,” in Proceedings of the
Dataset and benchmarks for real-world 3d object understand-
27th ACM SIGKDD conference on knowledge discovery & data min-
ing,” in Proceedings of the IEEE/CVF conference on computer vision
ing, 2021, pp. 2534–2542.
and pattern recognition, 2022, pp. 21 126–21 136.
[124] S. Liu, Y. Wei, J. Lu, and J. Zhou, “An improved evaluation
[145] H. Hattori, V. Naresh Boddeti, K. M. Kitani, and T. Kanade,
framework for generative adversarial networks,” arXiv preprint “Learning scene-specific pedestrian detectors without real data,”
arXiv:1803.07474, 2018. in Proceedings of the IEEE conference on computer vision and pattern
[125] S. Zhou, M. Gordon, R. Krishna, A. Narcomey, L. F. Fei-Fei, and recognition, 2015, pp. 3819–3827.
M. Bernstein, “Hype: A benchmark for human eye perceptual [146] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang,
evaluation of generative models,” Advances in neural information Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet:
processing systems, vol. 32, 2019. An information-rich 3d model repository,” arXiv preprint
[126] P. Salehi, A. Chalechale, and M. Taghizadeh, “Generative ad- arXiv:1512.03012, 2015.
versarial networks (gans): An overview of theoretical model, [147] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao,
evaluation metrics, and recent developments,” arXiv preprint “3d shapenets: A deep representation for volumetric shapes,” in
arXiv:2005.13178, 2020. Proceedings of the IEEE conference on computer vision and pattern
[127] H. Thanh-Tung and T. Tran, “Catastrophic forgetting and mode recognition, 2015, pp. 1912–1920.
collapse in gans,” in 2020 international joint conference on neural [148] I. B. Barbosa, M. Cristani, B. Caputo, A. Rognhaugen, and
networks (ijcnn). IEEE, 2020, pp. 1–10. T. Theoharis, “Looking beyond appearances: Synthetic training
[128] L. Xu, X. Zeng, Z. Huang, W. Li, and H. Zhang, “Low-dose data for deep cnns in re-identification,” Computer Vision and Image
chest x-ray image super-resolution using generative adversarial Understanding, vol. 167, pp. 50–62, 2018.
nets with spectral normalization,” Biomedical Signal Processing and [149] K. Ashish and S. Shital, “Microsoft extends airsim to include
Control, vol. 55, p. 101600, 2020. autonomous car research.”
[129] M. Lee and J. Seok, “Regularization methods for generative ad- [150] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity
versarial networks: An overview of recent studies,” arXiv preprint visual and physical simulation for autonomous vehicles,” in Field
arXiv:2005.09165, 2020. and Service Robotics: Results of the 11th International Conference.
[130] Q. Hoang, T. D. Nguyen, T. Le, and D. Phung, “Mgan: Training Springer, 2018, pp. 621–635.
generative adversarial nets with multiple generators,” in Interna- [151] H. Abu Alhaija, S. K. Mustikovela, L. Mescheder, A. Geiger, and
tional conference on learning representations, 2018. C. Rother, “Augmented reality meets computer vision: Efficient
[131] M. M. Saad, R. O’ Reilly, and M. H. Rehmani, “A survey on train- data generation for urban driving scenes,” International Journal of
ing challenges in generative adversarial networks for biomedical Computer Vision, vol. 126, no. 9, pp. 961–972, 2018.
image analysis,” Artificial Intelligence Review, vol. 57, no. 2, p. 19, [152] N. Jaipuria, X. Zhang, R. Bhasin, M. Arafa, P. Chakravarty,
2024. S. Shrivastava, S. Manglani, and V. N. Murali, “Deflating dataset
[132] Z. Zhou, Q. Zhang, G. Lu, H. Wang, W. Zhang, and Y. Yu, bias using synthetic data augmentation,” in Proceedings of the
“Adashift: Decorrelation and convergence of adaptive learning IEEE/CVF Conference on Computer Vision and Pattern Recognition
rate methods,” arXiv preprint arXiv:1810.00143, 2018. Workshops, 2020, pp. 772–773.
[133] Y. Gan, T. Xiang, H. Liu, M. Ye, and M. Zhou, “Generative [153] S. Borkman, A. Crespi, S. Dhakad, S. Ganguly, J. Hogins, Y.-
adversarial networks with adaptive learning strategy for noise- C. Jhang, M. Kamalzadeh, B. Li, S. Leal, P. Parisi et al., “Unity
to-image synthesis,” Neural Computing and Applications, vol. 35, perception: Generate synthetic data for computer vision,” arXiv
no. 8, pp. 6197–6206, 2023. preprint arXiv:2107.04259, 2021.
30
[154] J. Jang, H. Lee, and J.-C. Kim, “Carfree: Hassle-free object detec- [174] C. Sevastopoulos, S. Konstantopoulos, K. Balaji, M. Zaki Zadeh,
tion dataset generation using carla autonomous driving simula- and F. Makedon, “A simulated environment for robot vision
tor,” Applied Sciences, vol. 12, no. 1, p. 281, 2021. experiments,” Technologies, vol. 10, no. 1, p. 7, 2022.
[155] K. M. Hart, A. B. Goodman, and R. P. O’Shea, “Automatic [175] S. Moro and T. Komuro, “Generation of virtual reality environ-
generation of machine learning synthetic data using ros,” in ment based on 3d scanned indoor physical space,” in International
International Conference on Human-Computer Interaction. Springer, Symposium on Visual Computing. Springer, 2021, pp. 492–503.
2021, pp. 310–325. [176] M. Sra, S. Garrido-Jurado, and P. Maes, “Oasis: Procedurally
[156] M. S. Mueller and B. Jutzi, “Uas navigation with squeezeposenet generated social virtual spaces from 3d scanned real spaces,”
accuracy boosting for pose regression by data augmentation,” IEEE transactions on visualization and computer graphics, vol. 24,
Drones, vol. 2, no. 1, p. 7, 2018. no. 12, pp. 3174–3187, 2017.
[157] N. Koenig and A. Howard, “Design and use paradigms [177] H. A. Alhaija, S. K. Mustikovela, L. Mescheder, A. Geiger, and
for gazebo, an open-source multi-robot simulator,” in 2004 C. Rother, “Augmented reality meets deep learning for car in-
IEEE/RSJ International Conference on Intelligent Robots and Systems stance segmentation in urban scenes,” in British machine vision
(IROS)(IEEE Cat. No. 04CH37566), vol. 3. IEEE, 2004, pp. 2149– conference, vol. 1, 2017, p. 2.
2154. [178] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,
[158] A. Kerim, L. Soriano Marcolino, and R. Jiang, “Silver: Novel “Domain randomization for transferring deep neural networks
rendering engine for data hungry computer vision models,” in from simulation to the real world,” in 2017 IEEE/RSJ international
2nd International Workshop on Data Quality Assessment for Machine conference on intelligent robots and systems (IROS). IEEE, 2017, pp.
Learning, 2021. 23–30.
[159] A. Shafaei, J. J. Little, and M. Schmidt, “Play and learn: Using [179] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas,
video games to train computer vision models,” arXiv preprint V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet:
arXiv:1608.01745, 2016. Learning optical flow with convolutional networks,” in Proceed-
[160] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: ings of the IEEE international conference on computer vision, 2015,
Ground truth from computer games,” in European conference on pp. 2758–2766.
computer vision. Springer, 2016, pp. 102–118. [180] R. Gao, Z. Si, Y.-Y. Chang, S. Clarke, J. Bohg, L. Fei-Fei, W. Yuan,
[161] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic and J. Wu, “Objectfolder 2.0: A multisensory object dataset for
open source movie for optical flow evaluation,” in European sim2real transfer,” arXiv preprint arXiv:2204.02389, 2022.
conference on computer vision. Springer, 2012, pp. 611–625. [181] A. Barisic, F. Petric, and S. Bogdan, “Sim2air-synthetic aerial
[162] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as dataset for uav monitoring,” IEEE Robotics and Automation Letters,
proxy for multi-object tracking analysis,” in Proceedings of the vol. 7, no. 2, pp. 3757–3764, 2022.
IEEE conference on computer vision and pattern recognition, 2016, [182] K. Dimitropoulos, I. Hatzilygeroudis, and K. Chatzilygeroudis,
pp. 4340–4349. “A brief survey of sim2real methods for robot learning,” in
[163] C. Roberto de Souza, A. Gaidon, Y. Cabon, and A. Manuel Lopez, International Conference on Robotics in Alpe-Adria Danube Region.
“Procedural generation of videos to train deep action recognition Springer, 2022, pp. 133–140.
networks,” in Proceedings of the IEEE Conference on Computer Vision [183] T. Ikeda, S. Tanishige, A. Amma, M. Sudano, H. Audren, and
and Pattern Recognition, 2017, pp. 4757–4767. K. Nishiwaki, “Sim2real instance-level style transfer for 6d pose
[164] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, estimation,” arXiv preprint arXiv:2203.02069, 2022.
“The synthia dataset: A large collection of synthetic images for [184] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and
semantic segmentation of urban scenes,” in Proceedings of the R. Webb, “Learning from simulated and unsupervised images
IEEE conference on computer vision and pattern recognition, 2016, through adversarial training,” in Proceedings of the IEEE conference
pp. 3234–3243. on computer vision and pattern recognition, 2017, pp. 2107–2116.
[165] M. Wrenninge and J. Unger, “Synscapes: A photorealistic [185] D.-Y. She and K. Xu, “Contrastive self-supervised representation
synthetic dataset for street scene parsing,” arXiv preprint learning using synthetic data,” International Journal of Automation
arXiv:1810.08705, 2018. and Computing, vol. 18, no. 4, pp. 556–567, 2021.
[166] E. Cheung, T. K. Wong, A. Bera, X. Wang, and D. Manocha, [186] A. Atapour-Abarghouei and T. P. Breckon, “Real-time monocular
“Lcrowdv: Generating labeled videos for simulation-based depth estimation using synthetic data with domain adaptation
crowd behavior learning,” in European Conference on Computer via image style transfer,” in Proceedings of the IEEE conference on
Vision. Springer, 2016, pp. 709–727. computer vision and pattern recognition, 2018, pp. 2800–2810.
[167] Z. Li, T.-W. Yu, S. Sang, S. Wang, M. Song, Y. Liu, Y.-Y. Yeh, [187] S. Huang and D. Ramanan, “Expecting the unexpected: Training
R. Zhu, N. Gundavarapu, J. Shi et al., “Openrooms: An open detectors for unusual pedestrians with adversarial imposters,” in
framework for photorealistic indoor scene datasets,” in Proceed- Proceedings of the IEEE Conference on Computer Vision and Pattern
ings of the IEEE/CVF conference on computer vision and pattern Recognition, 2017, pp. 2243–2252.
recognition, 2021, pp. 7190–7199. [188] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and
[168] R. Sandhu, S. Dambreville, and A. Tannenbaum, “Point set Q. Tian, “Mars: A video benchmark for large-scale person re-
registration via particle filtering and stochastic dynamics,” IEEE identification,” in Computer Vision–ECCV 2016: 14th European
transactions on pattern analysis and machine intelligence, vol. 32, Conference, Amsterdam, The Netherlands, October 11-14, 2016, Pro-
no. 8, pp. 1459–1473, 2009. ceedings, Part VI 14. Springer, 2016, pp. 868–884.
[169] K. Vyas, L. Jiang, S. Liu, and S. Ostadabbas, “An efficient 3d [189] Z. Chen, W. Ouyang, T. Liu, and D. Tao, “A shape transformation-
synthetic model generation pipeline for human pose data aug- based dataset augmentation framework for pedestrian detec-
mentation,” in Proceedings of the IEEE/CVF Conference on Computer tion,” International Journal of Computer Vision, vol. 129, no. 4, pp.
Vision and Pattern Recognition, 2021, pp. 1542–1552. 1121–1138, 2021.
[170] F. Bogo, M. J. Black, M. Loper, and J. Romero, “Detailed full- [190] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool,
body reconstructions of moving people from monocular rgb-d “Pose guided person image generation,” Advances in neural infor-
sequences,” in Proceedings of the IEEE international conference on mation processing systems, vol. 30, 2017.
computer vision, 2015, pp. 2300–2308. [191] Y. Pang, J. Cao, J. Wang, and J. Han, “Jcs-net: Joint classification
[171] N. Hesse, S. Pujades, M. J. Black, M. Arens, U. G. Hofmann, and and super-resolution network for small-scale pedestrian detec-
A. S. Schroeder, “Learning and tracking the 3d body shape of tion in surveillance images,” IEEE Transactions on Information
freely moving infants from rgb-d sequences,” IEEE transactions on Forensics and Security, vol. 14, no. 12, pp. 3322–3331, 2019.
pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2540– [192] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio,
2551, 2019. A. Blake, M. Cook, and R. Moore, “Real-time human pose recog-
[172] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and nition in parts from single depth images,” Communications of the
M. Nießner, “Scannet: Richly-annotated 3d reconstructions of ACM, vol. 56, no. 1, pp. 116–124, 2013.
indoor scenes,” in Proceedings of the IEEE conference on computer [193] P. Tokmakov, K. Alahari, and C. Schmid, “Learning motion pat-
vision and pattern recognition, 2017, pp. 5828–5839. terns in videos,” in Proceedings of the IEEE conference on computer
[173] G. Chogovadze, R. Pautrat, and M. Pollefeys, “Controllable vision and pattern recognition, 2017, pp. 3386–3394.
data augmentation through deep relighting,” arXiv preprint [194] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Doso-
arXiv:2110.13996, 2021. vitskiy, and T. Brox, “A large dataset to train convolutional
31
networks for disparity, optical flow, and scene flow estimation,” [215] M. Niemeyer and A. Geiger, “Giraffe: Representing scenes as
in Proceedings of the IEEE conference on computer vision and pattern compositional generative neural feature fields,” in Proceedings of
recognition, 2016, pp. 4040–4048. the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
[195] G. Yang, H. Zhao, J. Shi, Z. Deng, and J. Jia, “Segstereo: Exploiting tion, 2021, pp. 11 453–11 464.
semantic information for disparity estimation,” in Proceedings of [216] Y. Liu, Y.-S. Wei, H. Yan, G.-B. Li, and L. Lin, “Causal reason-
the European conference on computer vision (ECCV), 2018, pp. 636– ing meets visual representation learning: A prospective study,”
651. Machine Intelligence Research, vol. 19, no. 6, pp. 485–511, 2022.
[196] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, [217] Z. Hao, A. Mallya, S. Belongie, and M.-Y. Liu, “Gancraft: Unsu-
“Flownet 2.0: Evolution of optical flow estimation with deep pervised 3d neural rendering of minecraft worlds,” in Proceedings
networks,” in Proceedings of the IEEE conference on computer vision of the IEEE/CVF International Conference on Computer Vision, 2021,
and pattern recognition, 2017, pp. 2462–2470. pp. 14 072–14 082.
[197] “From traditional rendering to differentiable rendering: Theories, [218] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image
methods and applications. scientia sinica informationis, vol.51, synthesis with spatially-adaptive normalization,” in Proceedings
no.7, pp.1043-1067, 2021.” 2017. of the IEEE/CVF conference on computer vision and pattern recogni-
[198] H. Kato, Y. Ushiku, and T. Harada, “Neural 3d mesh renderer,” tion, 2019, pp. 2337–2346.
in Proceedings of the IEEE conference on computer vision and pattern [219] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal
recognition, 2018, pp. 3907–3916. unsupervised image-to-image translation,” in Proceedings of the
[199] M. de La Gorce, N. Paragios, and D. J. Fleet, “Model-based hand European conference on computer vision (ECCV), 2018, pp. 172–189.
tracking with texture, shading and self-occlusions,” in 2008 IEEE [220] A. Mallya, T.-C. Wang, K. Sapra, and M.-Y. Liu, “World-consistent
Conference on Computer Vision and Pattern Recognition. IEEE, 2008, video-to-video synthesis,” in Computer Vision–ECCV 2020: 16th
pp. 1–8. European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
[200] J. Liu, C.-H. Wu, Y. Wang, Q. Xu, Y. Zhou, H. Huang, C. Wang, Part VIII 16. Springer, 2020, pp. 359–378.
S. Cai, Y. Ding, H. Fan et al., “Learning raw image denoising with [221] R. Martin-Brualla, N. Radwan, M. S. Sajjadi, J. T. Barron, A. Doso-
bayer pattern unification and bayer preserving augmentation,” vitskiy, and D. Duckworth, “Nerf in the wild: Neural radiance
in Proceedings of the IEEE/CVF Conference on Computer Vision and fields for unconstrained photo collections,” in Proceedings of the
Pattern Recognition Workshops, 2019, pp. 0–0. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[201] B. Mildenhall, P. Srinivasan, M. Tancik, J. Barron, and R. Ng, 2021, pp. 7210–7219.
“Representing scenes as neural radiance fields for view synthe- [222] Z. Zhang, S. Xie, M. Chen, and H. Zhu, “Handaugment: A
sis,” in Proc. of European Conference on Computer Vision, Virtual, simple data augmentation method for depth-based 3d hand pose
2020. estimation,” arXiv preprint arXiv:2001.00702, 2020.
[202] M. Tancik, V. Casser, X. Yan, S. Pradhan, B. Mildenhall,
[223] G. Ning, G. Chen, C. Tan, S. Luo, L. Bo, and H. Huang, “Data
P. P. Srinivasan, J. T. Barron, and H. Kretzschmar, “Block-
augmentation for object detection via differentiable neural ren-
nerf: Scalable large scene neural view synthesis,” arXiv preprint
dering,” arXiv preprint arXiv:2103.02852, 2021.
arXiv:2202.05263, 2022.
[224] Q. Wu, Y. Li, Y. Sun, Y. Zhou, H. Wei, J. Yu, and Y. Zhang,
[203] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and
“An arbitrary scale super-resolution approach for 3-dimensional
M. Zollhofer, “Deepvoxels: Learning persistent 3d feature em-
magnetic resonance image using implicit neural representation,”
beddings,” in Proceedings of the IEEE/CVF Conference on Computer
arXiv preprint arXiv:2110.14476, 2021.
Vision and Pattern Recognition, 2019, pp. 2437–2446.
[225] Q. Wu, Y. Li, L. Xu, R. Feng, H. Wei, Q. Yang, B. Yu, X. Liu, J. Yu,
[204] L. Liu, J. Gu, K. Zaw Lin, T.-S. Chua, and C. Theobalt, “Neural
and Y. Zhang, “Irem: High-resolution magnetic resonance image
sparse voxel fields,” Advances in Neural Information Processing
reconstruction via implicit neural representation,” in International
Systems, vol. 33, pp. 15 651–15 663, 2020.
Conference on Medical Image Computing and Computer-Assisted In-
[205] E. Insafutdinov and A. Dosovitskiy, “Unsupervised learning of
tervention. Springer, 2021, pp. 65–74.
shape and pose with differentiable point clouds,” Advances in
neural information processing systems, vol. 31, 2018. [226] L. Shen, J. Pauly, and L. Xing, “Nerp: Implicit neural representa-
[206] S. Baek, K. I. Kim, and T.-K. Kim, “Pushing the envelope for rgb- tion learning with prior embedding for sparsely sampled image
based dense 3d hand pose estimation via neural rendering,” in reconstruction,” arXiv preprint arXiv:2108.10991, 2021.
Proceedings of the IEEE/CVF Conference on Computer Vision and [227] M. Tancik, B. Mildenhall, T. Wang, D. Schmidt, P. P. Srinivasan,
Pattern Recognition, 2019, pp. 1067–1076. J. T. Barron, and R. Ng, “Learned initializations for optimizing
[207] J. Thies, M. Zollhöfer, and M. Nießner, “Deferred neural render- coordinate-based neural representations,” in Proceedings of the
ing: Image synthesis using neural textures,” ACM Transactions on IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Graphics (TOG), vol. 38, no. 4, pp. 1–12, 2019. 2021, pp. 2846–2855.
[208] K.-A. Aliev, A. Sevastopolsky, M. Kolos, D. Ulyanov, and V. Lem- [228] D. B. Lindell, J. N. Martel, and G. Wetzstein, “Autoint: Automatic
pitsky, “Neural point-based graphics,” in European Conference on integration for fast neural volume rendering,” in Proceedings of the
Computer Vision. Springer, 2020, pp. 696–712. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[209] M. Adamkiewicz, T. Chen, A. Caccavale, R. Gardner, P. Culbert- 2021, pp. 14 556–14 565.
son, J. Bohg, and M. Schwager, “Vision-only robot navigation in [229] K. Gupta, B. Colvert, and F. Contijoch, “Neural computed tomog-
a neural radiance world,” IEEE Robotics and Automation Letters, raphy,” arXiv preprint arXiv:2201.06574, 2022.
vol. 7, no. 2, pp. 4606–4613, 2022. [230] Y. Sun, J. Liu, M. Xie, B. Wohlberg, and U. S. Kamilov, “Coil:
[210] Z. Kuang, K. Olszewski, M. Chai, Z. Huang, P. Achlioptas, and Coordinate-based internal learning for imaging inverse prob-
S. Tulyakov, “Neroic: Neural rendering of objects from online lems,” arXiv preprint arXiv:2102.05181, 2021.
image collections,” arXiv preprint arXiv:2201.02533, 2022. [231] K. Deng, A. Liu, J.-Y. Zhu, and D. Ramanan, “Depth-supervised
[211] A. Yu, V. Ye, M. Tancik, and A. Kanazawa, “pixelnerf: Neural nerf: Fewer views and faster training for free,” in Proceedings of the
radiance fields from one or few images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 882–12 891.
2021, pp. 4578–4587. [232] Q. Wang, Z. Wang, K. Genova, P. P. Srinivasan, H. Zhou, J. T. Bar-
[212] S. Duggal, Z. Wang, W.-C. Ma, S. Manivasagam, J. Liang, S. Wang, ron, R. Martin-Brualla, N. Snavely, and T. Funkhouser, “Ibrnet:
and R. Urtasun, “Mending neural implicit modeling for 3d vehi- Learning multi-view image-based rendering,” in Proceedings of the
cle reconstruction in the wild,” in Proceedings of the IEEE/CVF IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Winter Conference on Applications of Computer Vision, 2022, pp. 2021, pp. 4690–4699.
1900–1909. [233] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson, “Synsin: End-
[213] A. R. Kosiorek, H. Strathmann, D. Zoran, P. Moreno, R. Schneider, to-end view synthesis from a single image,” in Proceedings of
S. Mokrá, and D. J. Rezende, “Nerf-vae: A geometry aware 3d the IEEE/CVF conference on computer vision and pattern recognition,
scene generative model,” in International Conference on Machine 2020, pp. 7467–7477.
Learning. PMLR, 2021, pp. 5742–5752. [234] A. Tewari, O. Fried, J. Thies, V. Sitzmann, S. Lombardi,
[214] S. Yao, R. Zhong, Y. Yan, G. Zhai, and X. Yang, “Dfa-nerf: Person- K. Sunkavalli, R. Martin-Brualla, T. Simon, J. Saragih, M. Nießner
alized talking head generation via disentangled face attributes et al., “State of the art on neural rendering,” in Computer Graphics
neural rendering,” arXiv preprint arXiv:2201.00791, 2022. Forum, vol. 39, no. 2. Wiley Online Library, 2020, pp. 701–727.
32
[235] C. Liu, X.-F. Chen, C.-J. Bo, and D. Wang, “Long-term visual [256] H. Huang, H. Wang, W. Luo, L. Ma, W. Jiang, X. Zhu, Z. Li, and
tracking: review and experimental comparison,” Machine Intel- W. Liu, “Real-time neural style transfer for videos,” in Proceedings
ligence Research, vol. 19, no. 6, pp. 512–530, 2022. of the IEEE Conference on Computer Vision and Pattern Recognition,
[236] J. Xu, R. Zhang, J. Dou, Y. Zhu, J. Sun, and S. Pu, “Rpvnet: A deep 2017, pp. 783–791.
and efficient range-point-voxel fusion network for lidar point [257] M. Ruder, A. Dosovitskiy, and T. Brox, “Artistic style transfer for
cloud segmentation,” in Proceedings of the IEEE/CVF International videos and spherical images,” International Journal of Computer
Conference on Computer Vision, 2021, pp. 16 024–16 033. Vision, vol. 126, no. 11, pp. 1199–1219, 2018.
[237] J. Choe, B. Joung, F. Rameau, J. Park, and I. S. Kweon, “Deep [258] ——, “Artistic style transfer for videos,” in German conference on
point cloud reconstruction,” arXiv preprint arXiv:2111.11704, 2021. pattern recognition. Springer, 2016, pp. 26–36.
[238] P. Erler, P. Guerrero, S. Ohrhallinger, N. J. Mitra, and M. Wimmer, [259] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua, “Stereoscopic neural
“Points2surf learning implicit surfaces from point clouds,” in style transfer,” in Proceedings of the IEEE Conference on Computer
European Conference on Computer Vision. Springer, 2020, pp. 108– Vision and Pattern Recognition, 2018, pp. 6654–6663.
124. [260] C. Do, “3d image augmentation using neural style transfer and
[239] T. Hashimoto and M. Saito, “Normal estimation for accurate generative adversarial networks,” in Applications of Digital Image
3d mesh reconstruction with point cloud model incorporating Processing XLIII, vol. 11510. SPIE, 2020, pp. 707–718.
spatial structure.” in CVPR workshops, vol. 1, 2019. [261] X. Zheng, T. Chalasani, K. Ghosal, S. Lutz, and A. Smolic,
[240] A. Reed, T. Blanford, D. C. Brown, and S. Jayasuriya, “Implicit “Stada: Style transfer as data augmentation,” arXiv preprint
neural representations for deconvolving sas images,” in OCEANS arXiv:1909.01056, 2019.
2021: San Diego–Porto. IEEE, 2021, pp. 1–7. [262] I. Darma, N. Suciati, and D. Siahaan, “Neural style transfer and
[241] ——, “Sinr: Deconvolving circular sas images using implicit geometric transformations for data augmentation on balinese
neural representations,” IEEE Journal of Selected Topics in Signal carving recognition using mobilenet,” International Journal of In-
Processing, 2022. telligent Engineering and Systems, vol. 13, no. 6, pp. 349–363, 2020.
[242] F. Vasconcelos, B. He, N. Singh, and Y. W. Teh, “Uncertainr: Un- [263] B. Georgievski, “Image augmentation with neural style transfer,”
certainty quantification of end-to-end implicit neural representa- in International Conference on ICT Innovations. Springer, 2019, pp.
tions for computed tomography,” arXiv preprint arXiv:2202.10847, 212–224.
2022. [264] P. A. Cicalese, A. Mobiny, P. Yuan, J. Becker, C. Mohan, and
[243] L. Shen, J. Pauly, and L. Xing, “Nerp: implicit neural repre- H. V. Nguyen, “Stypath: Style-transfer data augmentation for
sentation learning with prior embedding for sparsely sampled robust histology image classification,” in International Conference
image reconstruction,” IEEE Transactions on Neural Networks and on Medical Image Computing and Computer-Assisted Intervention.
Learning Systems, 2022. Springer, 2020, pp. 351–361.
[244] R. Liu, Y. Sun, J. Zhu, L. Tian, and U. S. Kamilov, “Recovery of [265] Y. Xu and A. Goel, “Cross-domain image classification
continuous 3d refractive index maps from discrete intensity-only through neural-style transfer data augmentation,” arXiv preprint
measurements using neural fields,” Nature Machine Intelligence, arXiv:1910.05611, 2019.
vol. 4, no. 9, pp. 781–791, 2022. [266] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann,
[245] C. Gan, Y. Gu, S. Zhou, J. Schwartz, S. Alter, J. Traer, D. Gutfre- and W. Brendel, “Imagenet-trained cnns are biased towards tex-
und, J. B. Tenenbaum, J. H. McDermott, and A. Torralba, “Finding ture; increasing shape bias improves accuracy and robustness,”
fallen objects via asynchronous audio-visual integration,” in Pro- arXiv preprint arXiv:1811.12231, 2018.
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern [267] S. Cygert and A. Czyżewski, “Toward robust pedestrian detec-
Recognition, 2022, pp. 10 523–10 533. tion with data augmentation,” IEEE Access, vol. 8, pp. 136 674–
[246] R. Gao, Y.-Y. Chang, S. Mall, L. Fei-Fei, and J. Wu, “Objectfolder: 136 683, 2020.
A dataset of objects with implicit visual, auditory, and tactile [268] A. Mikołajczyk and M. Grochowski, “Style transfer-based image
representations,” arXiv preprint arXiv:2109.07991, 2021. synthesis as an efficient regularization technique in deep learn-
[247] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein, ing,” in 2019 24th International Conference on Methods and Models
“Implicit neural representations with periodic activation func- in Automation and Robotics (MMAR). IEEE, 2019, pp. 42–47.
tions,” Advances in neural information processing systems, vol. 33, [269] P. T. Jackson, A. A. Abarghouei, S. Bonner, T. P. Breckon, and
pp. 7462–7473, 2020. B. Obara, “Style augmentation: data augmentation via style ran-
[248] T. Chen, P. Wang, Z. Fan, and Z. Wang, “Aug-nerf: Train- domization.” in CVPR Workshops, vol. 6, 2019, pp. 10–11.
ing stronger neural radiance fields with triple-level physically- [270] Y. Yi, “Microsoft extends airsim to include autonomous car
grounded augmentations,” in Proceedings of the IEEE/CVF Confer- research,” 2020.
ence on Computer Vision and Pattern Recognition, 2022, pp. 15 191– [271] X. Huang and S. Belongie, “Arbitrary style transfer in real-time
15 202. with adaptive instance normalization,” in Proceedings of the IEEE
[249] J. Zhang, Y. Zhang, H. Fu, X. Zhou, B. Cai, J. Huang, R. Jia, international conference on computer vision, 2017, pp. 1501–1510.
B. Zhao, and X. Tang, “Ray priors through reprojection: Im- [272] S. S. Kim, N. Kolkin, J. Salavon, and G. Shakhnarovich, “De-
proving neural radiance fields for novel view extrapolation,” formable style transfer,” in Computer Vision–ECCV 2020: 16th
in Proceedings of the IEEE/CVF Conference on Computer Vision and European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
Pattern Recognition, 2022, pp. 18 376–18 386. Part XXVI 16. Springer, 2020, pp. 246–261.
[250] S. Kulkarni, P. Yin, and S. Scherer, “360fusionnerf: Panoramic [273] X.-C. Liu, Y.-L. Yang, and P. Hall, “Learning to warp for style
neural radiance fields with joint guidance,” in 2023 IEEE/RSJ transfer,” in Proceedings of the IEEE/CVF Conference on Computer
International Conference on Intelligent Robots and Systems (IROS). Vision and Pattern Recognition, 2021, pp. 3702–3711.
IEEE, 2023, pp. 7202–7209. [274] S. Li, X. Xu, L. Nie, and T.-S. Chua, “Laplacian-steered neural
[251] Y. Jiang, S. Jiang, G. Sun, Z. Su, K. Guo, M. Wu, J. Yu, and L. Xu, style transfer,” in Proceedings of the 25th ACM international confer-
“Neuralhofusion: Neural volumetric rendering under human- ence on Multimedia, 2017, pp. 1716–1724.
object interactions,” in Proceedings of the IEEE/CVF Conference on [275] F. Luan, S. Paris, E. Shechtman, and K. Bala, “Deep photo style
Computer Vision and Pattern Recognition, 2022, pp. 6155–6165. transfer,” in Proceedings of the IEEE conference on computer vision
[252] A. Mumuni and F. Mumuni, “Cnn architectures for geometric and pattern recognition, 2017, pp. 4990–4998.
transformation-invariant feature representation in computer vi- [276] R. R. Yang, “Multi-stage optimization for photorealistic neural
sion: a review,” SN Computer Science, vol. 2, no. 5, pp. 1–23, 2021. style transfer,” in Proceedings of the IEEE/CVF Conference on Com-
[253] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer puter Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
using convolutional neural networks,” in Proceedings of the IEEE [277] Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz, “A closed-form
conference on computer vision and pattern recognition, 2016, pp. solution to photorealistic image stylization,” in Proceedings of the
2414–2423. European Conference on Computer Vision (ECCV), 2018, pp. 453–468.
[254] ——, “A neural algorithm of artistic style,” arXiv preprint [278] B. Kim, V. C. Azevedo, M. Gross, and B. Solenthaler, “Transport-
arXiv:1508.06576, 2015. based neural style transfer for smoke simulations,” arXiv preprint
[255] K. Simonyan and A. Zisserman, “Very deep convolutional arXiv:1905.07442, 2019.
networks for large-scale image recognition,” arXiv preprint [279] ——, “Lagrangian neural style transfer for fluids,” ACM Transac-
arXiv:1409.1556, 2014. tions on Graphics (TOG), vol. 39, no. 4, pp. 52–1, 2020.
33
[280] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua, “Stylebank: European Conference on Computer Vision. Springer, 2022, pp. 111–
An explicit representation for neural image style transfer,” in 128.
Proceedings of the IEEE conference on computer vision and pattern [303] J. Tremblay, M. Meshry, A. Evans, J. Kautz, A. Keller, S. Khamis,
recognition, 2017, pp. 1897–1906. C. Loop, N. Morrical, K. Nagano, T. Takikawa et al., “Rtmv: A ray-
[281] Z. Wang, L. Zhao, H. Chen, L. Qiu, Q. Mo, S. Lin, W. Xing, and traced multi-view synthetic dataset for novel view synthesis,”
D. Lu, “Diversified arbitrary style transfer via deep feature per- arXiv preprint arXiv:2205.07058, 2022.
turbation,” in Proceedings of the IEEE/CVF Conference on Computer [304] A. Ahmadyan, L. Zhang, A. Ablavatski, J. Wei, and M. Grund-
Vision and Pattern Recognition, 2020, pp. 7789–7798. mann, “Objectron: A large scale dataset of object-centric videos
[282] C. Castillo, S. De, X. Han, B. Singh, A. K. Yadav, and T. Goldstein, in the wild with pose annotations,” in Proceedings of the IEEE/CVF
“Son of zorn’s lemma: Targeted style transfer using instance- conference on computer vision and pattern recognition, 2021, pp.
aware semantic segmentation,” in 2017 IEEE International Con- 7822–7831.
ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, [305] T. Hackel, N. Savinov, L. Ladicky, J. D. Wegner, K. Schindler, and
2017, pp. 1348–1352. M. Pollefeys, “Semantic3d. net: A new large-scale point cloud
[283] Z. Chen, W. Wang, E. Xie, T. Lu, and P. Luo, “Towards ultra- classification benchmark,” arXiv preprint arXiv:1704.03847, 2017.
resolution neural style transfer via thumbnail instance normaliza- [306] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B.
tion,” in Proceedings of the AAAI Conference on Artificial Intelligence, Tenenbaum, and W. T. Freeman, “Pix3d: Dataset and methods
vol. 36, no. 1, 2022, pp. 393–400. for single-image 3d shape modeling,” in Proceedings of the IEEE
[284] Y. Chen, Y.-K. Lai, and Y.-J. Liu, “Cartoongan: Generative ad- conference on computer vision and pattern recognition, 2018, pp.
versarial networks for photo cartoonization,” in Proceedings of the 2974–2983.
IEEE conference on computer vision and pattern recognition, 2018, pp. [307] Z. J. Chong, B. Qin, T. Bandyopadhyay, M. H. Ang, E. Frazzoli,
9465–9474. and D. Rus, “Synthetic 2d lidar for precise vehicle localization in
[285] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, 3d urban environment,” in 2013 IEEE International Conference on
“Diversified texture synthesis with feed-forward networks,” in Robotics and Automation. IEEE, 2013, pp. 1554–1559.
Proceedings of the IEEE Conference on Computer Vision and Pattern [308] J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison,
Recognition, 2017, pp. 3920–3928. “Scenenet rgb-d: Can 5m synthetic images beat generic imagenet
[286] Z. Xu, T. Wang, F. Fang, Y. Sheng, and G. Zhang, “Stylization- pre-training on indoor segmentation?” in Proceedings of the IEEE
based architecture for fast deep exemplar colorization,” in Pro- International Conference on Computer Vision, 2017, pp. 2678–2687.
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern [309] X. Wang, K. Wang, and S. Lian, “A survey on face data augmen-
Recognition, 2020, pp. 9363–9372. tation for the training of deep neural networks,” Neural computing
[287] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Improved texture and applications, vol. 32, no. 19, pp. 15 503–15 531, 2020.
networks: Maximizing quality and diversity in feed-forward styl- [310] P. S. Rajpura, H. Bojinov, and R. S. Hegde, “Object detection
ization and texture synthesis,” in Proceedings of the IEEE conference using deep cnns trained on synthetic images,” arXiv preprint
on computer vision and pattern recognition, 2017, pp. 6924–6932. arXiv:1706.06782, 2017.
[288] S. Gu, C. Chen, J. Liao, and L. Yuan, “Arbitrary style transfer [311] Z. Zhang, L. Yang, and Y. Zheng, “Multimodal medical vol-
with deep feature reshuffle,” in Proceedings of the IEEE Conference umes translation and segmentation with generative adversarial
on Computer Vision and Pattern Recognition, 2018, pp. 8222–8231. network,” Handbook of Medical Image Computing and Computer
[289] V. Dumoulin, J. Shlens, and M. Kudlur, “A learned representation Assisted Intervention, pp. 183–204, 2020.
for artistic style,” arXiv preprint arXiv:1610.07629, 2016. [312] H. Turki, D. Ramanan, and M. Satyanarayanan, “Mega-nerf: Scal-
[290] E. Risser, P. Wilmot, and C. Barnes, “Stable and controllable neu- able construction of large-scale nerfs for virtual fly-throughs,”
ral texture synthesis and style transfer using histogram losses,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
arXiv preprint arXiv:1701.08893, 2017. Pattern Recognition, 2022, pp. 12 922–12 931.
[291] Y. Li, N. Wang, J. Liu, and X. Hou, “Demystifying neural style
transfer,” arXiv preprint arXiv:1701.01036, 2017.
[292] J. Yoo, Y. Uh, S. Chun, B. Kang, and J.-W. Ha, “Photorealistic style
transfer via wavelet transforms,” in Proceedings of the IEEE/CVF
International Conference on Computer Vision, 2019, pp. 9036–9045.
[293] T. Yong-Jian and Z. Fan-Ju, “Neural style transfer algorithm
based on laplacian operator and color preservation,” Journal of
Computer Applications, p. 0, 2022.
[294] S. Meyer, V. Cornillère, A. Djelouah, C. Schroers, and M. Gross,
“Deep video color propagation,” arXiv preprint arXiv:1808.03232,
2018.
[295] J. Fišer, O. Jamriška, M. Lukáč, E. Shechtman, P. Asente, J. Lu, and
D. Sỳkora, “Stylit: illumination-guided example-based stylization
of 3d renderings,” ACM Transactions on Graphics (TOG), vol. 35,
no. 4, pp. 1–11, 2016.
[296] C. Rodriguez-Pardo and E. Garces, “Neural photometry-guided
visual attribute transfer,” arXiv preprint arXiv:2112.02520, 2021.
[297] L. Gatys, A. S. Ecker, and M. Bethge, “Texture synthesis using
convolutional neural networks,” Advances in neural information
processing systems, vol. 28, 2015.
[298] E. Heitz, K. Vanhoey, T. Chambon, and L. Belcour, “A sliced
wasserstein loss for neural texture synthesis,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2021, pp. 9412–9420.
[299] L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shecht-
man, “Controlling perceptual factors in neural style transfer,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2017, pp. 3985–3993.
[300] S. d’Angelo, F. Precioso, and F. Gandon, “Revisiting artistic style
transfer for data augmentation in a real-case scenario,” in 2022
IEEE International Conference on Image Processing (ICIP). IEEE,
2022, pp. 4178–4182.
[301] X.-C. Liu, X.-Y. Li, M.-M. Cheng, and P. Hall, “Geometric style
transfer,” arXiv preprint arXiv:2007.05471, 2020.
[302] Y. Jing, Y. Mao, Y. Yang, Y. Zhan, M. Song, X. Wang, and D. Tao,
“Learning graph neural networks for image style transfer,” in