0% found this document useful (0 votes)
28 views24 pages

Paper4 (GAN)

This survey discusses the advancements and applications of Generative Adversarial Networks (GANs) in various fields, highlighting their effectiveness in generating high-quality images and data augmentation. It covers the challenges in training GANs, evaluation metrics, and categorizes numerous research articles based on their applications in areas such as medical imaging, object detection, and natural language processing. The paper aims to provide insights and resources for researchers interested in exploring GANs and their diverse functionalities.

Uploaded by

Shahnawaz Alam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views24 pages

Paper4 (GAN)

This survey discusses the advancements and applications of Generative Adversarial Networks (GANs) in various fields, highlighting their effectiveness in generating high-quality images and data augmentation. It covers the challenges in training GANs, evaluation metrics, and categorizes numerous research articles based on their applications in areas such as medical imaging, object detection, and natural language processing. The paper aims to provide insights and resources for researchers interested in exploring GANs and their diverse functionalities.

Uploaded by

Shahnawaz Alam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

International Journal of Multimedia Information Retrieval (2021) 10:1–24

https://doi.org/10.1007/s13735-020-00196-w

TRENDS AND SURVEYS

Generative adversarial networks: a survey on applications


and challenges
M. R. Pavan Kumar1 · Prabhu Jayagopal2

Received: 4 August 2020 / Revised: 21 September 2020 / Accepted: 2 October 2020 / Published online: 24 October 2020
© Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract
Deep neural networks have attained great success in handling high dimensional data, especially images. However, generating
naturalistic images containing ginormous subjects for different tasks like image classification, segmentation, object detection,
reconstruction, etc., is continued to be a difficult task. Generative modelling has the potential to learn any kind of data
distribution in an unsupervised manner. Variational autoencoder (VAE), autoregressive models, and generative adversarial
network (GAN) are the popular generative modelling approaches that generate data distributions. Among these, GANs
have gained much attention from the research community in recent years in terms of generating quality images and data
augmentation. In this context, we collected research articles that employed GANs for solving various tasks from popular
databases and summarized them based on their application. The main objective of this article is to present the nuts and bolts
of GANs, state-of-the-art related work and its applications, evaluation metrics, challenges involved in training GANs, and
benchmark datasets that would benefit naive and enthusiastic researchers who are interested in working on GANs.

Keywords Generative model · Convolutional neural network · segmentation · Object detection · Generative adversarial
network

1 Introduction learning the structure of the data and generate data similar to
it. Generative models can be trained with high dimensional
Generating quality images is a challenging task in the field probability distributions. They can also be used in rein-
of computer vision and artificial intelligence, having numer- forcement learning, semi-supervised learning, etc. In general,
ous applications and research scope. Supervised machine generative models work on any one of these three principles:
learning and deep learning models require large and labelled inference approximation, maximum likelihood, and Markov
datasets to generalize the decision making process. However, chains. Latent Dirichlet distribution [7], restricted Boltzmann
the availability of large and labelled databases is questioned machines [36], deep belief networks (DBNs) [37], etc., are
in many domains like medical diagnosis, fault detection, the other generative models extensively used in the literature
intrusion detection, etc. Hence, the research community to generate naturalistic data. These models operate on the
heavily depends on unsupervised learning. In unsupervised principle of maximum likelihood. However, these models do
learning, the model strives to learn the structure and extracts not fit the data distributions completely.
the useful features of the data. Generative modelling is a sub- Goodfellow et al. [23] introduced GANs, an unsuper-
field of unsupervised learning that work towards the goal of vised generative model, worked on the principle of maximum
likelihood, and used adversarial training. Right from the
inception of generative adversarial networks (GANs), they
B Prabhu Jayagopal
[email protected] have been the most discussed and most researched domains
not only in the field of computer science but also in other
M. R. Pavan Kumar
[email protected] domains. GANs have been gained much popularity in gener-
ating high-quality realistic data. Thus, GANs also attracted
1 School of Computer Science and Engineering, Vellore researchers as a data augmentation tool for imbalanced data
Institute of Technology, Vellore, Tamil Nadu, India applications. Generative models, especially GANs, can be
2 School of Information Technology and Engineering, Vellore deployed in many machine learning tasks where multiple cor-
Institute of Technology, Vellore, Tamil Nadu, India

123
2 International Journal of Multimedia Information Retrieval (2021) 10:1–24

Fig. 1 The architecture of GAN

rect answers can be inferred from a single input. GANs are


also used to attribute more information into the context than
it has originally. GANs are used to create models that help
researchers in generating artificial naturalistic images. Since
then, they have been used in diverse domains and diverse
applications. GANs have been leveraged in medical imaging
for disease diagnosis, semantic segmentation, image caption-
ing, image attacking to change the classifier decision, image
deblurring, image dehazing, synthesizing, face frontaliza-
tion, high-resolution images from low-resolution images also
known as super-resolution (SR), text to image generation
or scene generation, steganography, object detection, speech
recognition, fault diagnosis, industrial risk analysis, and nat-
ural language processing applications like text generation,
text summarization, style transfer, etc. However, one should
note that GAN is not just an image generation tool, but it
retrieves useful information from the training data so that Fig. 2 Year wise number of applications
object detection, segmentation, and classification tasks can
be performed in various domains. The training data can be
any content of multimedia, for example, image, text, audio, ing to different tasks are discussed in Sect. 2. The collected
video, and animation. The GAN consists of a generator net- papers were segregated based on their objective and classified
work (G) and a discriminator network (D) as shown in Fig. 1. into multiple applications, as presented in Sect. 3. Section
The generator consumes a noise vector (z) as input and gen- 4 discusses various evaluation metrics used in the selected
erates data distribution similar to real data distribution. The papers to evaluate the GAN model. We discussed multiple
discriminator discriminates between real and artificial data as challenges involved in training a GAN in Sect. 5. Finally, the
a binary classifier. During the training, the generator loss and paper concludes in Sect. 6.
the discriminator loss are computed to compute the overall
loss (V). The GAN loss is computed using the formula shown
in Eq. 1, and the motive of training is to minimize the gener-
2 The beginning
ator loss and maximize the discriminator loss. x ∼ Pdata (x)
and z ∼ Pz (Z ) represents x is an instance from real distri-
With the advent of GANs that work on the principle of
bution and z is an instance from prior distribution.
generative and adversarial manner, researchers can synthe-
size novel and quality images. However, the synthesizing
of images has not just started with GANs. There are a few
min max V (D, G) = Ex∼ pdata (x) [log D(x)]
G D research works to synthesize images using convolutional
+Ez∼ pz (z) [log(1 − D(G(z)))] (1) neural networks (CNN). Abdalmageed et al. [1] discussed a
CNN-based pipelined architecture for face recognition task
The motive of this paper is to give a brief introduction by detecting and correcting multiple poses using deep face
about GANs, nuts and bolts of GANs, various derived GANs, representation methods. Masi et al. [69] synthesized more
and applications of GANs pertaining to different tasks in facial images using a deep convolutional model to make the
multiple domains. To this end, we collected 140 research dataset better. The existing facial images are manipulated
papers to give a detailed summary of generative adversarial in three variations pose, shape, and expression. The pose
networks, especially in terms of applications. Figure 2 shows and shape are simulated across three dimensions with closed
the graph of the number of publications considered year wise. mouth expression. But, these are subjected to limitations in
First, we identified the state-of-the-art works derived relat- terms of quality and diversity of images, thereby deterio-

123
International Journal of Multimedia Information Retrieval (2021) 10:1–24 3

Fig. 3 The architecture of


conditional GAN

rating the performance generalization indirectly. Once the hair colour changing, gender-changing, etc., using invertible
principle of GAN is formulated and achieved success in mod- conditional GAN (IcGAN). An encoder is placed to com-
elling probability distributions, plenty of other GAN models press the input vector into a latent and conditional vector.
are derived for diverse applications and produced promising Yoo et al. [118] discussed a framework that transfers one
results. Mirza and Osindero [71] introduced class label as an domain to another semantically in pixel-level. The frame-
additional input to both generator and discriminator to model work contains an encoder-decoder based generator and two
conditioned variant of GAN (CGAN). Since the class label discriminators. The two discriminators are trained to learn
is given as other input, the CGAN is capable of generating the semantic relations between the domains. Brock et al. [9]
images specific to the class label. The architecture of CGAN introduced a neural photo editor that affects the semantic
is shown in Fig. 3, and loss function is shown in Eq. 2. changes requested by the user at ease. The intuitive idea
Radford et al. [80] devised a GAN model named DCGAN behind neural photo editor is that it back propagates the
for learning unsupervised representations. Contrary to basic requested changes to compute the change in latent param-
GAN, DCGAN has convolutional layers to upscale the input eters. Given two unlabelled and unrelated domains P and Q,
vector z of the generator module. Also, it employed regular and a function f, a generative function G can be learned such
convolutional layers to classify generated and real images. that a sample from P can be mapped to Q, i.e. G: P → Q and
f (x) ∼ G( f (x)) [93].
min max V (D, G) = Ex∼ pdata (x) [log D(x|c)] Arjovsky et al. [5] introduced a novel training model
G D (WGAN) based on Wasserstein’s distance to avoid the mode
+Ez∼ pz (z) [log(1 − D(G(z|c)))] collapse problem that occurs in the training of traditional
(2) GANs. Antipov et al. [3] introduced Age-cGAN that gen-
erates high-quality synthetic images in which an aged face
Wu et al. [107] modelled the generation of 3D objects synthesized while preserving the person’s identity. Kim et
by mapping latent space to object space using 3D GAN. al. [50] learned the relations between two different domains
They extended the 3D GAN using a variational autoencoder using DiscoGAN that consist of two GANs coupled together.
(3D-VAEGAN) that maps a 2D vector space to 3D vec- Given an image in one domain DiscoGAN, can generate
tor space using the VAE module and then mapped to 3D the corresponding image in another domain. Li et al. [60]
object space by the GAN model. Liu and Tuzel [66] com- have developed a GAN for detecting smaller image enti-
puted the joint probability distributions of two domains using ties by reducing smaller objects representative margin to
two GANs. Each GAN learns the distribution of one domain larger objects. In [43], the authors have employed a condi-
while training and also shares the high-level weights that tional GAN model to translate an image to another analogous
allow coupled GAN to compute the joint distribution. By image. For example, converting day-photo to night-photo,
deploying multiple generators, kwak and Zhang [53] gen- converting to an aerial image to map design, etc. Karras et
erated a composite generative system called CGAN. First, al. [46] presented progressive growing GANs that generate
every generator separately removes a complex part of the quality and high-resolution images. The idea behind progres-
image. These components are then summarized by a blending sive growing GAN is that it extends the training process of a
process to generate a new image. Im et al. [41] demonstrated normal GAN by adding new layers. A super-resolution GAN
a novel image generation method based on the recursive (SRGAN) [55] model takes a low-resolution image as input
adversarial model (GRAN). GRAN incrementally generates and increases the spatial resolution of the image by an upscal-
high fidelity visual samples. A novel crossover evaluation ing factor and produces a high-resolution image as output.
scheme is also introduced between generator and discrim- The applications are ranging from satellite imaging, medi-
inator networks. Zhu et al. [138] introduced an immersive cal imaging, media content, face recognition in surveillance
generative model that allows users to control the visual systems, etc. A StackGAN [126] that automates synthesiz-
content naturally and realistically. Perarnau et al. [77] per- ing realistic images from human-written descriptions. The
formed image editing operations like expression changing, StackGAN works in two stages. Stage-1 GAN generates

123
4 International Journal of Multimedia Information Retrieval (2021) 10:1–24

low-resolution images with initial shape and basic colours 3 Applications


of objects. Stage-2 GAN takes low-resolution images, text
descriptions as inputs and corrects the errors, completes the In this section, we discuss diverse applications of GANs like
details, and generates photo-realistic images with four times medical diagnosis, text generation, hyperspectral image clas-
better resolution. AnoGAN [87] an unsupervised generative sification, etc., in detail.
model that can detect diseases from medical image data at
early stages. Zhu et al. [139] explored a generative model 3.1 Clinical diagnosis
called Cycle GAN that translates images from one domain
to another domain. For example, take an image and creates an MRI (magnetic resonance imaging), CT (computed tomogra-
image that looks like a painting of the first picture, convert- phy) scan, PET (positron emission tomography), ultrasound
ing a black-and-white picture to a colour image, etc. Yang et imaging (USI), electrocardiogram (ECG), and X-rays are the
al. [113] integrated the conventional acoustic loss function widely used imaging techniques for clinical diagnosis and to
with the discriminator loss function to model a multi-tasking identify the severity of the disease in the medical domain.
framework for text to speech synthesis. MRI images the water molecules in the body with the help
Choi et al. [15] developed StarGAN that can translate an of a very strong magnetic field. MRI images take the pictures
image among multiple domains with superior quality. Grover of the soft tissue of the organs and the bones. CT scanners
et al. [24] developed Flow-GAN that uses exact likelihood use a pencil-thick beam to take cross-sectional images of the
estimation for training and achieved significant improve- patient’s body. The beam rotates around the patient’s body.
ments in log-likelihood scores. Wang et al. [102] introduced The CT scan slices the patients’ body like a loaf of bread
a Residual-in-Residual Dense Block to the SRGAN [55] to and uses radiation to take the images. PET scanner captures
model enhanced SRGAN (ESRGAN). The model achieved the images of minuscule changes in the body’s metabolism
high-quality images with more natural and realistic textures. caused by the growth of abnormal cells. The PET scan can be
Kupyn et al. [52] developed a conditional generative model used in combination with a CT scan that allows physicians
called DeblurGAN that deblurs a blurred image and detects to identify the exact location, size, and shape of the diseased
the objects that are blurred due to the motion. They used tissue or tumour.
content loss function to optimize conditional GAN. A con- In [21], a cycleGAN-based unified framework is discussed
ditional generative adversarial framework is designed for to standardize the intensity distribution of MRI images
synthesizing 2048 × 1024 high-resolution photo naturalis- with different parameters coming from multiple groups. The
tic images using semantic label maps [101]. Xu et al. [110] framework consists of two kinds of paths: one forward path
proposed an AttnGAN that consists of attention models to using a one GAN and multiple backward paths using mul-
generate quality images from text descriptions. The authors tiple GANs. They also employed two jump connections to
have incorporated an attention module as a generator network keep the features safe and to avoid any loss of resolution.
in which each attention model create sub-regions in the image The effectiveness of the proposed method is investigated on
based on the extracted features from the text. Zhang et al. T2-FLAIR image datasets. Qi et al. [78] developed a model
[128] modelled a self-attention GAN (SAGAN) that gener- using cascaded conditional GAN (C-cGANs) for automatic
ates details using attention-driven and long-term dependency bi-ventricle segmentation of magnetic resonance images of
modelling. The authors have also applied spectral normal- the heart. The authors have divided the task of segmentation
ization to enhance the dynamics of training and achieved into two subtasks. For each subtask, they used a specific C-
significant results. Therefore, researchers have derived plenty cGAN. In both the C-cGANs they used an encoder module,
of GAN variants like CGAN, WGAN, ProgressiveGAN, MSAF module and a decoder module. The first C-cGAN
image-to-image translation GAN, Cycle GAN, SR GAN, identifies the region of interest using the binary segmenta-
text-to-image GAN, face inpainting GAN, text-to-speech tion task. The second C-cGAN implements the bi-ventricle
GAN, etc., for various applications. The evolution of a few segmentation task.
popular GANs is illustrated in Fig. 4 with the help of the Analyzing ECG signals enables diagnosing cardiovascu-
timeline diagram. In the next section, we discuss the appli- lar diseases (CVDs) or heart-related diseases in advance
cations specific to these variants that were modelled recently and helps to prevent them. Detecting abnormalities in ECG
in detail. signals is a class imbalance problem due to imbalance dis-
tribution among multiple classes. Wang et al. [99] proposed
a framework in which a classification model is incorporated
in between a GAN model. The generator and discrimina-
tor framework is inspired by the ACGAN model [72] to
support data augmentation. The classification model is imple-
mented using a residual block and a long short-term model

123
International Journal of Multimedia Information Retrieval (2021) 10:1–24 5

Fig. 4 Timeline for a few


notable GAN models

(LSTM). The proposed framework is tested on MIT-BIH ity images and maximum stability on MIT-BIH arrhythmia
standard database for single beat detection and competition dataset. In [88], the authors used a GAN model as a data
database for successive beats detection. In [117], the authors augmentation tool to generate synthetic data to tackle the
have addressed two issues while dealing with clinical data imbalanced classification of multi-class ECG data. Arrhyth-
for cardiac disease diagnosis using ECG signals. First, they mias MIT-BIH data has 15 ECG classes that can be divided
extracted the global features and then increased the stabil- into five (N, S, V, F, Q) categories. Authors have proposed
ity of training to extract high-quality diverse samples. The two deep learning models for classification task on data-
authors have developed a sequential GAN (RPSeqGAN) in augmented original data. First, a CNN-based end-to-end
which the generator consists of bidirectional gated recur- approach is used to classify the heartbeat as one among the 15
rent units (GRU), and the discriminator is implemented as classes. Second, another CNN-based hierarchical approach
a ResNet [32]-based ResNet-Pooling block (RPblock) that has two stages. In the first stage, the model identifies one of
extracts the global features. Authors have also employed a the five categories. In the second stage, any one of the five
policy gradient and Monte Carlo search algorithms to gain classes is identified under the category identified in the first
stability in training. The proposed algorithm achieved qual- stage.

123
6 International Journal of Multimedia Information Retrieval (2021) 10:1–24

Two issues are addressed in [114] while extracting vein Pulmonary nodes in the lungs are examined for the diagnosis
features from low-contrast infrared images of fingers. First, of lung cancer at early stages. However, most of the medical
the existing CNN-based models increase the processing time domain applications suffer from data scarcity problem. This
when handling low-quality finger vein images. Also, there is makes the application of deep learning models on limited data
a limitation on the size of finger vein images. Second, there resulting in wrong clinical diagnosis. A GAN-based unsu-
is a lack of feature representation about ground truth low- pervised approach is proposed on the principles of anomaly
quality finger vein patterns. The authors developed a finger detection for the diagnosis of lung cancer. An encoder module
vein GAN (FV-GAN) framework that consists of two genera- is incorporated along with the GAN model for the training of
tors: an image generator that generates vein images from vein benign pulmonary nodes. The GAN (MDGAN) consists of a
patterns using a UNET architecture and a pattern generator generator network and multiple discriminator networks. The
that maps vein images to vein patterns using encoder-decoder MDGAN computes the feature loss along with image recon-
network. The discriminator finds the latent space between the struction loss to assign high scores to malignant nodes and
correct and wrong vein patterns. The model is evaluated on small scores for benign nodes. The performance of the model
two publicly available datasets: Tsinghua University Finger is evaluated on LIDC-IDRI dataset and proved effective com-
Vein and Finger Dorsal Texture Database 2 (THU-FVFDT2) pared to supervised benchmarks [51]. Ghassemi et al. [22]
and ShanDong University finger vein database (SDU) and discussed a GAN-based model as a novel data augmentation
achieved significant results. Another useful resource for clin- method for multi-class classification of MR images. First,
ical diagnosis is ultrasound imaging. Ultrasound imaging the GAN is fed with different MR image datasets to gen-
techniques are profoundly used in the diagnosis of maternal- erate MR like images as the output of the generator. Later,
foetal medicine, abnormality in body parts for example the data-augmented new dataset is given to the discriminator,
breasts, liver cancer, identification of thyroids, etc. However, which is already trained during data augmentation phase for
capturing ultrasound images requires large infrastructured multi-class classification. The proposed model has achieved
devices that cannot be easily used in applications like rural significant accuracy rates on MRI dataset compared to state-
medicine, telemedicine, and community medicine applica- of-the-arts. Decreasing the dose of radiation during chest
tions. Thus, portable ultrasound imaging devices are used too imaging adds noise to the generated image, thereby altering
often in the above scenarios to improve global health care. clinical diagnosis. Kim et al. [48] devised a conditional GAN
But the low quality of ultrasound images with these portable (CGAN)-based denoising method that removes the noise in
devices questions the reliability of diagnosis is a limitation. reduced radiation chest images and enhances the image qual-
A two-stage GAN structure is devised in [136] to increase the ity for clinical diagnosis. The generator and discriminator of
image quality of hand-held or portable ultrasound devices. A conditioned GAN model are built of convolutional layers.
U-Net model is placed as a front-end tool for the generator Figure 5 shows the architecture of CGAN and the restored
at stage one. It extracts structural features at low frequencies and uncorrupted images. He et al. [34] modelled a label
in the reconstructed images. In stage 2, a GAN network is smoothing GAN (LSGAN) for the classification of optical
deployed to find the latent space between low-quality images coherence tomography (OCT) images that can help in detect-
and high-quality images. The generator takes a pair of inputs: ing and avoiding blindness at early stages. The model consists
a low-quality image and the output image of the U-Net model. of a generator, discriminator and a classifier. The generator
The discriminator also takes a pair of inputs: reconstructed creates synthetic unlabelled OCT images. The discriminator
generator images and high-quality images. The proposed 2- distinguishes between training OCT images and synthetic
stage model improved the image-quality of hand-held and OCT images while optimizing the performance of the gen-
portable ultrasound devices. erator to generate high-quality images. A label smoothing
MRI and PET images are fused in [116] to generate images strategy is embedded in the classifier that helps in labelling
that have both tissue structure from MRI and functional, unlabelled OCT images. The LSGAN model is evaluated
metabolic information from PET. The motive behind fusing on UCSD publicly available dataset and a locally developed
multiple source images is to get rid of redundant information HUCM dataset and achieved promising results.
and to get complementary information in one single image From the above discussion, it is observed that image recon-
to yield a better clinical diagnosis. The authors have pro- struction, image synthesis (for example conditional image
posed an algorithm based on Wasserstein GAN (MWGAN) synthesis and cross-modality image synthesis), segmenta-
to surmount the challenges involved in fusing images from tion, classification, abnormality detection, denoising, data
multiple sources. The GAN model consists of one generator augmentation, etc., are the novel tasks that were solved using
and two discriminator networks with a novel loss function. GANs.
The model can be extended for the fusion of MRI and CT
images also. The model is investigated on the publicly avail-
able dataset on the Harvard Medical School official page.

123
International Journal of Multimedia Information Retrieval (2021) 10:1–24 7

Fig. 5 Images cropped directly from [48] a Architeccture of CGAN (b) a Gaussian noise corrupted noise image b CGAN restored image c final
uncorrupted image

3.2 Intrusion detection datasets. Zhang et al. [130] proposed an extended Monte
Carlo tree search (MTCS) algorithm using a GAN model that
Although the success of machine learning and deep learn- produces adversarial examples of cross-site scripting (XSS)
ing in classification, adversarial examples attempt to get, attack traffic data. They added adversarial examples to an
deep learning models to miss-classify the images by inducing original dataset during the training phase. Also, they assigned
small noise patterns. Yuan et al. [124] developed a random- a probability value that bypasses the adversarial image from
ized nonlinear image transformation method to alter and ruin the detector. The model is examined using an intrusion detec-
the advanced patterns of attacking noise partly in the adver- tion (CICIDS2017) dataset that contains up-to-date attacks
sarial images. They employed a generative cleaning network on real-world data. Huang et al. [40] modelled an imbalance
to retrieve the lost content of the original image during the GAN (IGAN) framework to enhance the process of intrusion
image transformation phase. The discriminator network is detection in ad hoc networks. The architecture consists of a
used to defend the classification process and trained not feed forward network to extract the features, an IGAN with
to detect any leftover noise patterns in the images. They a filter to synthesize the abnormal class samples and a deep
evaluated the proposed model using CIFAR-10 and SVHN neural network to perform the classification task.

123
8 International Journal of Multimedia Information Retrieval (2021) 10:1–24

3.3 Fault diagnosis 3.4 Semantic segmentation

Fault detection is an important task in the field of control engi- Semantic segmentation is one of the tasks from the computer
neering to capture the malfunctioning of machine to avoid vision domain. Image segmentation divides the image into
machine failure and human loss. Shao et al. [89] devised a different sub-parts and classifies each sub-part into a class. In
model for monitoring of machine condition and fault diagno- contrast to this, semantic segmentation classifies each pixel of
sis using sensor data. The model design is based on ACGAN the image to a specific class. Kim et al. [49] proposed a modi-
[72] architecture that consists of 1D convolutional layers. fied generative adversarial model to synthesize the images of
Initially, the model is trained on the limited training data jellyfish to avoid jellyfish swarm in fisheries. They employed
during which it learns hierarchical representations and gen- an auto-encoder in parallel to GAN model. The generator
erates realistic, raw sensor signal data. Later the augmented model is used for the synthesis of images. The discrimina-
dataset is used for the classification of a machine fault. They tor takes two inputs: synthesized images from the generator
also used a novel quantitative method for the evaluation and real images from an autoencoder. The auto-encoder is
of generated sensor signal data and used time domain and also used to generate images from the synthesized vectors
frequency domain characteristics for assessing the diversity from the generator. They also estimated the density of jelly-
of generated samples. Yan et al. [111] addressed the auto- fish swarm using full convolutional and regression networks.
mated detection and diagnosis (AFDD) of fault training data Wang et al. [100] discussed a model named multi-context
using an unsupervised framework. However, the number of GAN (MCGAN) that completes the faces in the images with
training instances for normal machine states is higher than random missing regions. The model considers the semantic
faulty machine states. They explored a conditioned version and high frequency features using parallel dilated learning
of WGAN deployed to synthesize more training instances units (DLU). A stack of DLUs is then used to incorporate the
of faulty state samples. The authors have deployed a multi- fine details using a larger receptive field. The performance of
layer perceptron model used as generator and discriminator DLUs and the entire model is investigated on CelebA dataset
networks for AFDD. A support vector machine (SVM) is and yielded satisfactory results. In [73] the authors proposed
trained as a binary classifier on the augmented dataset. In the an attentively conditioned GAN (AC-GAN) for semantic
detection phase, it identifies the faulty state, and in the diag- segmentation. The generator model is used as a segmentor
nosis phase, it classifies the type of fault. Wang et al. [103] to generate maps from images. The discriminator model is
showed another GAN-based framework (CVAE-GAN) using used to differentiate the segmentor’s output from real labels.
the conditional variational autoencoder (CVAE) for imbal- Also, an attention network is deployed along with segmen-
anced fault diagnosis in a planetary gear box. The CVAE tor to provide attention probability of each feature map. They
consists of three modules encoder, decoder, and a sampling investigated the proposed model on PASCAL VOC 2012 and
network and considered as a generator network. It learns the Cam Vid datasets.
spectrum distribution features of vibrating signals to gen- The projective imaging nature of X-rays makes it a chal-
erate fault samples at different modes. The discriminator lenging device to use for clinical diagnosis. The image
network differentiates true fault sample with generated fault capturing technique used for X-rays bypasses the 3D spa-
sample and also classifies the variant of fault. Zhang et al. tial information between anatomies. It leads to difficulties in
[129] noted a framework that works in two stages for imbal- semantic segmentation which in turn deteriorates the clin-
anced fault diagnosis of rotating machines. A GAN model ical diagnosis performance. Also, the large availability of
that contains multiple generation modules is placed to gen- data annotations is not possible in the medical domain. In
erate samples for different fault conditions. A convolution this context, [131] modelled task-driven GAN (TD-GAN) to
model that ends with fully connected dense layers is placed perform multi-organ segmentation task. First, synthetic digi-
as a discriminator network. A deep convolutional model is tally reconstructed radiographs (DRR) are generated from 3D
used for the classification task on augmented data. Investiga- CT images and trained using digital image to image (DI2I)
tions on CWRU and Bogie datasets proved the effectiveness module. Then, the task-driven GAN is deployed that per-
of the proposed model. In [74], the authors have discussed forms image synthesis and segments of multiple organs in an
the semi-supervised and imbalanced fault bearing identifi- unsupervised manner. Mammography is used extensively in
cation in automation systems of industrial applications. A detecting abnormalities in women breasts to diagnose breast
deconvolutional network is employed as generator and a con- cancer in the early stage. Radiologists leverage low energy
volutional network is deployed as discriminator. X-ray signals to find the variance in appearance, location,
size, shape, and texture of breasts. Singh et al. [91] modelled
a framework (cGAN) that employs a single shot detector [67]
to locate the region of interest (ROI) in breast mammograms
and surround it by a bounding box. Later, the ROIs are given

123
International Journal of Multimedia Information Retrieval (2021) 10:1–24 9

as conditioned input to the generator that learns the inher- high-resolution images from the low-resolution images by
ent features like edges, grey-level, gradients, shape, etc., of considering the missing details of the text and conditioning
unhealthy and healthy tissue. It also produces a binary mask on the stage1 output. Stack GAN version2 consists of a series
(segmentation) based on these features. The discriminator of generators and a series of discriminators in a tree-like
network takes the ground truth and predicted masks as input structure. The stack GAN version2 model is implemented
and indicates the real one. They also used a multi-class CNN in both conditional and unconditional manner to generate
network for the classification of irregularities in breast shapes high-resolution naturalistic images. Cai et al. [10] described
(round, irregular, lobular, and oval). They investigated the a Dual attention GAN (DualAttn-GAN) model to generate
model using INbreast, DDSM public datasets and Hospital naturalistic and realistic images from text descriptions. As
Sant Joan de Reus private dataset and achieved significant the name suggests, they incorporated two attention models:
results. Figure 6a shows the workflow of cGAN for breast textual attention model and visual attention model. The tex-
tumour segmentation and classification. Figure 6b shows the tual attention model is employed to identify the semantics
generator network and the discriminator network of cGAN. between inputs and outputs. On the other hand, a visual atten-
Figure 6c shows the CNN architecture for classification of tion model is used to increase the representation power of
type of tumour. visual features. They evaluated the model using CUB and
Bisneto et al. [6] employed a conditional GAN model to Oxford-102 datasets. Contrary to the recognition of general
perform semantic optic disc segmentation for automatic diag- characters as machine-encoded text, extracting text from nat-
noses of neurodegenerative diseases. The CNN U-net [83] is ural images. It includes challenges from variations in the text
used as generator and PatchGAN [54] is used as discrimina- shape, colour, size, and patterns. Figure 7 shows images gen-
tor network . [85] performed segmentation and quantification erated by DualAttn-GAN compared to other models on CUB
of tumours simultaneously from CT images for diagnosis of dataset.
kidney tumours. A residual network that acts as multi-scale It is also difficult to extract text from complex backgrounds
feature extractor retrieves the tumour features. A multi- with a non-uniform degree of visibility, noise, pollution
tasking integrated network is used as generator network that occlusion, reflections, lightening, and blur. Lei et al. [58]
performs the semantic segmentation, object detection, and presented a model named defect-restore GAN to extract
direct quantification. A convolutional model is deployed as sequential text from abnormal images of the moving vehicles.
a discriminator network to encourage the optimization pro- The GAN model contains two encoders in the generator, a
cess. Han et al. [27] presented a GAN-based semi-supervised discriminator, and recurrent neural network (rnn) as an output
model for segmentation of lesion in breast ultrasound (BUS) block. The proposed model is investigated on their propri-
images. It first makes use of annotated images to synthesize etary wagon dataset, which has 5000 images and achieved
more BUS images and thereby enhances the segmentation significant results. Yanagi et al. [112] modelled a Query is
performance. Delannoy et al. [17] discussed a GAN-based GAN using AttnGAN [110] to extract scenes from the text
framework (SegSRGAN) that performs super-resolution to descriptions. First, three query images are generated using
increase the image quality and segmentation tasks to seg- the text description as input to the AttnGAN. Later, the gen-
ment the region of interest on brain MR images. Lei et al. erated query images and a hierarchical structure are used
[57] formulated a new GAN model to differentiate melanoma to retrieve the most desired scenes. Ak et al. [2] discussed
from a normal skin lesion. The novel GAN contains Unet- e-AttnGAN an extension for AttnGAN. The attention mod-
SCDC based generator that has skip connections as well as ule of e-AttnGAN involves contextual features of word and
dilated convolutions and produces segmentation masks. It sentences in image generation process. They employed spec-
also includes two CNN-based discriminator networks that tral normalization to maintain a stable training process. The
enhance the generated mask quality. The first CNN takes the e-AttnGAN has proved its effectiveness over the state-of-the-
concatenation of real input and generated segmented mask as art in image generation.
input while the second CNN takes the generated segmented
mask alone. 3.6 Natural language processing

3.5 Image to text (I2T) and text to image (T2I) Generating text sequences is part of natural language pro-
synthesis cessing tasks. Dialogue systems, machine translation, and
writing poetry are also part of the text generation task. Since
In [127], the authors have proposed two stack models: stack the inception of GANs, they have been coupled with rein-
GAN version1 and stack GAN version2 to synthesize images forcement learning to generate text sequences. The output of
from text. Again stack GAN version1 has two GANs, one the discriminator is fed as input to the generator to mimic
in each stage. The stage1 GAN generates low-resolution the reinforcement reward feedback signal. However, this
images from text descriptions. The stage2 GAN generates input is a scalar value and cannot maintain the high-level

123
10 International Journal of Multimedia Information Retrieval (2021) 10:1–24

Fig. 6 Images cropped directly from [91] a Workflow of cGAN b cGAN architecture for breast mass segmentation of tumour c CNN architecture
for shape classification of tumour

123
International Journal of Multimedia Information Retrieval (2021) 10:1–24 11

Fig. 7 Image generated by DualAttn-GAN [10] (2nd from right) compared to other models for the given text

semantic information of the text. Also, sampling is per- diverse text content and variable-length text. An automated
formed to complete the text sequences and get a reward signal method is also proposed to replace keywords that spec-
through the discriminator. The text sequences may contain ify the context with the words that are synonymous from
repeated subjects and missing verbs due to the high random- the trained text data. CTGAN is conditioned on the emo-
ness of the sampling process [122]. In [115], the authors tion label as an auxiliary input to have a control on topic.
have addressed these two issues in feature-guiding GAN They used an LSTM model as generator network and a
(FGGAN) to generate text sequences. The reward signal has CNN model as discriminator network. The CTGAN model
been replaced by a feature guided vector generated from is evaluated on Yelp restaurant reviews, Amazon reviews,
the features extracted by the discriminator using a feature and film review data and generated text with high quality of
module. Authors have also created semantic rules to control variable length. Automatically generating the text and sum-
the next word being generated at each time step prevent- marizing it to human-readable and a semantically similar
ing words that have low correlation with generated prefix way of the original text is defined as text summarization.
words. Li et al. [61] modelled a dialogue response system Text summarization is categorized into two ways: extractive
using adversarial reinforcement training model. The model summarization and abstractive summarization. Extractive
is embedded in a reinforcement framework and trained the summarization extracts the important words, phrases, and
generator based on the output of the discriminator. The model sentences and summarizes them. Abstractive summarization
has generated dialogue sentences that are competitive enough generates the text and then summarizes it to reflect as the
to human-generated sentences. Given the context of the text original. Zhuang et al. [141] proposed an abstractive summa-
[63] generated labelled sentences based on category infor- rization method using a GAN model which consists of one
mation using category sentence GAN (CS-GAN). To this generator and two discriminators. The generator is responsi-
end, they incorporated RNN to generate sequences, rein- ble for the encoding of long input text sentences into a short
forcement learning for predicting next character based on text representation. The first discriminator trains the gener-
the current state, and GAN for adversarial training and clas- ator to generate the text in human-readable form and the
sification. Wang et al. [98] discussed automatic sentimental second discriminator trains the generator to keep the promi-
text generation using SentiGAN framework. The SenitGAN nent features of the original text to convert the generated text
consists of multiple generators generating diverse sentimen- semantically similar to the original. The authors have imple-
tal texts using a novel penalty based objective function. The mented a policy gradient to train the model. The process of
discriminator model classifies the high-quality diverse texts rephrasing the sentence from the original style to another
to their sentiments. They extended the SentiGAN model, style without altering the semantics of the text is defined as
C-SentiGAN to tackle the problem of conditional text gen- style transfer. If it is from source style to target style, then
eration. The model is evaluated on Movie Reviews, Beer it is called unidirectional style transfer. Alternatively, multi-
Reviews, customer reviews, and emotional conversations and directional style transfer is also possible but at the cost of
achieved significant results in terms of the novelty, fluency, multiple trainings. If there are k attributes, then k × (k − 1)
intelligence, and diversity of the texts generated. training models are required. It has applications in NLP, for
Rizzo et al. [81] explored the performance of SeqGAN example, sentiment transformation, formality modification,
with contextual information encoded in global word embed- etc., and computer vision. Yu et al. [123] discussed a unified
dings as input. A self-attentive neural network is employed GAN (UGAN) model that transfers styles among multiple
as a discriminator to optimize the SeqGAN performance attributes using one training model. The original text and
in embedding knowledge into the generated text. Motivated target attributes are given as input to the generator that gen-
from the sequence GAN [12,122] discussed a conditional text erates text based on the given attributes. The discriminator
GAN (CTGAN) that is capable of generating high-quality takes this text and real text as input and generates as output

123
12 International Journal of Multimedia Information Retrieval (2021) 10:1–24

a rank and classification output. The proposed model signif- speed, to tackle unstable training, and mode collapse prob-
icantly reduced the training time for multi-directional style lems of GANs. The model has achieved promising results on
transfer. RESIDE dataset and O-Haze dataset.
Li et al. [59] discussed an improved-SAGAN model to
3.7 Image deblurring and dehazing generate high-quality dairy goat images. He used a self-
attention based normalized feature map method to compute
Adverse weather conditions like fog, rain, haze, and pollu- the correlation between features. They also replaced the one-
tion deteriorate the quality of images. Increasing the contrast, hot label for class labels with multi-class labels to improve the
colour, and texture of images to improve the quality of images quality of images. They investigated the model on a collec-
is referred to as image dehazing. In general, image dehaz- tion of goat images and CelebA datasets and got significant
ing techniques are categorized into image enhancement and improvements in results. [76] proposed an outdoor image
model-based dehazing approaches. Pang et al. [75] intro- dehazing technique that consists of two GANs:cycleGAN
duced a model based method named haze removal GAN and cGAN with different properties. First, the cycleGAN is
(HRGAN) that uses mathematical inversion techniques to trained on outdoor images and to generate haze-free coloured
reconstruct the haze-free images. The generator network images. On the other hand, cGAN is trained to keep the tex-
consists of three modules: a transmission map module, ture details like light, contrast, etc., of hazed images. Finally,
atmospheric module, and a processing module that gen- a convolutional neural network is fused to generate haze-
erates a reconstructed haze-free image. The CNN-based free images. Figure 8 shows the dehazed image of the hazed
discriminator classifies between real haze-free images and image using CycleGAN.
reconstructed haze-free images. It employs a significant loss
that consists of pixel-wise loss, perceptual loss, and an adver- 3.8 Face image synthesis
sarial loss to train the HRGAN model. The HRGAN model
achieved significant results in terms of removing haze and the Facial image synthesis and super-resolution, is also known
quality of image compared to other benchmarks on NYU2, as face hallucination, are the two most discussed topics in the
synthetic, Middlubury, and SOTS datasets. Zhao et al. [132] field of image processing and computer vision research. Face
developed a pyramid GAN (PGAN) in which the authors hallucination is the process of upscaling the low-resolution
have placed three GAN models in a pyramid shape. The images to high-resolution images. Preserving the identity of
first GAN block captures the non-local features of images the person is a challenge while performing face hallucination.
at multiple levels. The second GAN block captures the local Hsu et al. [38] discussed a Siamese GAN model (SiGAN) to
features of images at multiple levels. At this stage, the PGAN reconstruct the faces in the process of face hallucination.
combines and balances the local and non-local features of The SiGAN consists of two generators and a discriminator.
images. The final GAN block identifies the sharp edges of The two generators receive a low-resolution paired image as
the reconstructed image. The performance of the model is input and reconstruct a high-resolution paired image. This
evaluated on GOPRO dataset and MS COCO dataset. Rain high-resolution image is given as input to the discrimina-
is the prime factor that affects the quality of images cap- tor. They employed SiGAN loss which is a combination of
tured by surveillance systems in terms of blurring, raindrop GAN loss and contrastive loss, and reconstruction loss is
obstacles, and deformation. Xiang et al. [109] discussed a used to train the SiGAN model. Experimental results on
feature supervised GAN (FS-GAN) that removes the rain CASIA, LFW, and CelebA datasets proved the effective-
steaks from a single image and enhances the image qual- ness of SiGAN model. [65] reconstructed a high-resolution
ity. It introduced a feature supervised guidance at the last facial image using a component semantic prior GAN (CSP-
layer of the generator network and achieved fair results. Jin GAN) from a low-resolution facial image. They introduce
et al. [44] discussed an asynchronous interactive GAN (AI- a gradient loss along with perceptual loss in computing the
GAN) that deals with feature-wise extrication and finds the content loss of the generator. The discriminator network in
mutuality between feature-wise coupled components. Later, the GAN is capable of predicting multiple task semantic cat-
this interdependency is leveraged to achieve the deraining egory. The proposed model effectiveness is investigated on
effect successively. The AI-GAN is capable of decompos- labelled faces in the wild (LFW) and facial HR images online
ing all the diverse features involved in a single image using (FHRO) datasets. Figure 9 shows the ground truth textures in
a two-branch structure. Zhao et al. [134] discussed a dou- the first row, and the second row shows the realistic textures
ble discriminator GAN (DD-GAN) that has two generators captures by CSPGAN with multi-tasking capable discrimi-
and leveraged two discriminators against each generator. The nator.
main motive behind employing two discriminators is to main- Given a photo, synthesizing a pencil sketch is referred
tain a stable training process with limited training. They also to as photo-sketch synthesis and has applications in the
used a weight clipping algorithm to increase the convergence fields of digital entertainment and suspects identification in

123
International Journal of Multimedia Information Retrieval (2021) 10:1–24 13

Fig. 8 Dehazed image (right) of


hazed image (left) using
CycleGAN [76]

Fig. 9 Realistic textures captured by CSPGAN (2nd row) with multi-tasking discriminator [65]

law enforcement. This task suffers from loss of content, Karolinska directed emotional faces (KDEF) dataset. Iran-
colour inconsistency, distorted faces, lack of clarity, and manesh et al. [42] devised a coupled GAN (CpGAN) for
missing texture. [135] discussed a GAN model (EGGAN) face recognition task across diverse spectrums. It consists of
which is guided by a feature encoder. The feature encoder is two GAN-based sub-networks: Visible GAN and non-visible
trained particularly to search for effective face photo-sketch GAN paired by a contrastive loss function and performs
domain latent space. This model can perform photosynthesis nonlinear transformations. Both the generators are formed
and sketch synthesis simultaneously. The model is validated by encoder-decoder network, and discriminators are CNNs.
on two publicly available benchmark datasets: CUFS and The proposed model is evaluated on six different databases:
CUFSF. Han et al. [28] discussed another face frontaliza- CasiaHFB, Casia NIR-VIS, NightVision (NVESD), Notre
tion GAN model named face merged GAN (FM-GAN) that Dame X1 (UNDX1), Polarimetric Thermal, and Wright State
has two generators and one discriminator. The first generator (WSRI) and achieved significant results.
extracts the local face features from upper and lower parts of In [31], He et al. introduced a super-resolution GAN
profile face using an encoder network. Then, a decoder net- model that synthesizes high-resolution facial images which
work merges these features to synthesize a frontal face view. are scaled four times of low facial resolution images at dif-
The decoder of the second generator takes the encoded fea- ferent resolutions. The Bicubic interpolation method is used
tures of profile face and the merged frontal face view as inputs to resize the low-resolution blurred images. These images
and extracts the global features and generates a high dimen- along with the ground truth images from CelebA dataset are
sional frontal face view. Later, the discriminator trained on given as input to the stacked GAN that has three generators
real and synthesized data and produced promising results on and three discriminators. They incorporated residual learn-

123
14 International Journal of Multimedia Information Retrieval (2021) 10:1–24

ing for upsampling of images. Experimental results proved designed using a 3D GAN. The authors have used the GAN
that the proposed super resolution model outperformed other model as a regularization unit to alleviate the effect of over-
methods in terms of SR performance and generated realistic fitting due to limited samples. The performance of the model
images. Sun et al. [92] considered the problem of short- is evaluated on three publicly available datasets: Salina, Indi-
term facial age synthesis along with long-term facial age ana Pines, and Kennedy space centre. Cloud obstruction is
synthesis over various age spans. They employed a GAN net- a conventional problem in remote sensing object detection
work guided by age label distribution (IdGAN), especially field. The cloud obstruction makes the remote sensing images
for short-term facial age synthesis. The label distribution of sea surface temperatures (SST) unclear and hazy. To over-
consists of various age groups. The proposed model is exper- come this problem, [19] proposed a deep convolutional-based
imented on Audience, CACD, FG-NET, MORPH, and UTK GAN model with a novel inpainting loss function. The loss
Face facial age databases and yielded remarkable results. The function consists of a supervision term that removes the
task of identifying the frontal face images from the profile unclearness and identifies the nearest encodings in the low-
face images is referred to face frontalization. It has appli- dimensional images.
cations in face recognition systems. Recently, GANs have Feng et al. [20] discussed a spatial-spectral GAN model
proved their effectiveness in synthesizing frontal face images that performs a multi-class classification of hyperspectral
from profile face images with small face poses. To address images. This model addresses two issues of the classification
this issue, [82] developed face frontalization method feature process. First, it addresses the inability of the discrimina-
improving GAN (FIGAN) that achieved improved results tor in multi-class classification and Second, consideration
with large face poses. The authors have employed a feature of spatial and spectral information in the classification of
mapping block (FMB) that identifies the variance between hyperspectral images. Wang et al. [96] proposed a variational
the frontal face poses and profile face poses. The discrimi- GAN using a semi-supervised method to classify hyperspec-
nator network is modelled with a feature discriminator that tral images with limited labels. The semi-supervised context
improves the latent features generated by FMB block in the is incorporated using an encoder-decoder network, and a
generator. The model is investigated on celebrities in frontal collaborative optimization framework is used to find the
faces (CFP), labelled faces in the wild (LFW), and MultiPIE latent space between classification and sample generation
databases. tasks. The effectiveness of variational GAN is validated on
four benchmark datasets: University of Pavia, Pavia centre,
3.9 Geoscience and remote sensing DCMall, and Jiamusi. Zhu et al. [137] devised a multi-
branch conditional GAN (MCGAN) model to increase data
Spectral sensors have been used for capturing hyperspectral for objection in remote sensing images. The MCGAN archi-
images of the object from long distances. It captures both spa- tecture consists of one generator, three discriminators, and
tial information and spectral information of the target object. a classifier that build using deep CNNs. The data augmen-
Classification of such information is much useful in the appli- tation process is carried out on NWPU VHR - 10 dataset
cations of land change monitoring, resource management, with an alternative of DOTA dataset for severely low num-
remote sensing of ground water resources, remote sensing bered instance group in NWPU VHR - 10 dataset. Later,
of agriculture and vegetation, distant observing of forestry, NWPU VHR - 10 and DOTA datasets were merged to train
urban development, scene interpretation in law enforcement, the MCGAN. Experimental results proved the effectiveness
etc. Deep learning model requires a large number of samples of the model on the quality of objects detected from the gen-
for a successful classification process. However, the remote erated images.
sensing community suffers from the problem of limited sam-
ples. Due to this, the training process end up with over-fitting 3.10 Video generation
problem, i.e. data perform well during training and fails to
generalize. In [90] Shi et al. automatically generated build- Given a context, the process of forecasting the next sequence
ing footprints from the satellite images using a conditional of frames is known as video prediction. It has a wide range
GAN. Instead of using the generic cost function, authors have of applications like autonomous driving, object tracking,
deployed a Wasserstein distance to update the parameters. robotic planning, etc. An underlying uncertainty associated
A gradient penalty term is also used along with Wasser- with the dynamics of the real-world challenges the predic-
stein distance. The generator functionality is implemented tions. Wen et al. [106] generated a sequence of video frames
with U-Net architecture, and the discriminator functionality yi+1 , yi+2 , . . . , yk given two key input frames xi and xk+1 .
is implemented with PATCHGAN architecture. Zhu et al. They used two generators G 1 , G 2 and two discriminators
[140] presented two schemes for the classification of 1D and D1 ,D2 . The generators are placed in a sequential manner,
3D hyperspectral images. First, a spectral classifier is mod- where the output of the first generator is fed into the sec-
elled using a 1D GAN. Second, a spectral-spatial classifier is ond generator. G 1 learns motions from real videos during

123
International Journal of Multimedia Information Retrieval (2021) 10:1–24 15

training and G 2 adds more details to the output of G 1 . between the real faces and cartoon faces. Figure 10 shows the
D1 and D2 optimize the performance of generators through cartoon faces generated by landmark assisted CycleGAN.
adversarial training. Investigations proved that the generated [119] proposed an image reconstruction method (PI-REC)
video frames are clear and smooth. In [39], Hu et al. intro- that takes the flat colour domain and binary sparse edge
duced a novel stochastic video prediction GAN (VPGAN) as input to produce high quality reconstructed images. The
that is trained based on the cycle-consistent loss to predict authors have incorporated a GAN model that consists of three
the next sequence of actions in a video. An image seg- generators, and three discriminators in parallel, and each
mentation model is also incorporated using two generators GAN model works in a phase and refines the reconstructed
to extract the features. The proposed model is investigated image details progressively. The sparse and interpretable
on four datasets: Moving Mnist, KTH, BAIR, and UCF101 inputs ensure the control over style and content of images
and achieved significant improvement in the quality of pre- being generated. Finally, the method also produced signifi-
dicted future frames. [13] devised a model bottom-up GAN cant results on image to image translation task, provided the
(BoGAN) to generate video frames from text descriptions. domains should be similar. [120] proposed a GAN model
The model has an attention model that computes the region in which the discriminator generates two kinds of pseudo-
loss to fill the sub-regions of the video frame conditioned by labels using the self-supervised approach. Later, the discrete
words. The discriminator employs a frame-level loss to keep pseudo-labels are mapped to latent variables during train-
the semantic matching between successive frames. Finally, ing and eventually mapped to animation features to generate
another discriminator that maintains the global level seman- diverse animation clips. The continuous pseudo-labels are
tics between the sequence of frames in the final video. The used to create diverse frames in one animation clip. They
model produced promising results compared to benchmarks. also discussed a novel metric to investigate the quality of
animations.

3.11 Animation creation 3.12 Other application domains

Anime character and animation creation is a challenging task When training data and testing data do not agree with each
in the domain of multimedia applications. Image-to-Image other, they pose a challenge for speech recognition in noisy
translation and image super-resolution are the two major environments. Qian et al. [79] discussed a GAN model for
tasks involved in anime character synthesis. [45] modelled data augmentation and to improve the task of speech recog-
a cartoonGAN that transforms real-world images into car- nition, especially under noise conditions. A basic GAN is
toon style images a challenging task in computer graphics. employed for data generation process based on FBANK fea-
The generator network of cartoonGAN consists of convolu- ture map and generated frame by frame feature map. Since
tion, deconvolution, and residual blocks. The discriminator the generated data do not have labels, later, an unsupervised
comprises convolutional layers. The model takes a set of real- learning framework is deployed for the speech recognition
world images and another set of cartoon images for training. task. The authors have conditioned one GAN on acoustic
The model also employs two loss functions during training state and the other GAN on clean speech for better data gen-
to find the latent space between the two sets of images. A eration. The collection of hard labels and soft labels achieved
semantic content loss that manages the variations between promising performance using conditional GAN on Aroura-4
real images and cartoon images, and an edge enhancing and AMI-SDM datasets. Industries heavily depend on failure
adversarial loss that maintains the sharp edges. Experimen- data to mitigate the occurrence of hazardous events and loss
tal results proved that the generated cartoon images are of of human life. Thus, a risk warning system is an essential tool
high quality and surmounts the state-of-the-art style trans- for identifying and avoiding such rare events. However, these
forming methods. [26] imposed structural conditions at each rare events suffer from the problem of data scarcity for risk
scale of image generation during progressive training of analysis. In [33] He et al. constructed a semi-supervised real-
progressive structure-conditional GAN (PSGAN). PSGAN time risk management system by integrating fuzzy HAZOP
generates anime images at 1024 × 1024 resolution with full- risk analysis with a distributed control system (DCS). They
body structure. A landmark assisted CycleGAN [108] is also employed a GAN model that augments labelled process
modelled to generate high-quality cartoon faces from real data which enhances the assessment of the type of risk clas-
faces. The unpaired real faces and cartoon faces are used to sification. The framework is evaluated using a case study on
train the model. The model employs a regressor to detect the the processing of polyolefin using a multizone circulating
landmarks in the generated cartoon faces. A novel landmark reactor (MZCR). Domain adaptation is an important area of
consistency loss is used during training to capture the impor- research in the field of computer vision. Given two distribu-
tant features of real faces. Landmark consistency, along with tions: labelled and unlabelled relating to target data shifting
the local discriminators, alleviates the structural variance domain from labelled to unlabelled is defined as domain

123
16 International Journal of Multimedia Information Retrieval (2021) 10:1–24

Fig. 10 Cartoon faces generated


by [108] on the last column for
the given real faces on the first
column compared to others in
the middle column

adaptation. In this context, [14] modelled an unsupervised GAN) to reduce the overfitting problem occurred by auxil-
framework that contains a feature extractor, attention mod- iary classifier in the discriminator of ACGAN [72]. ResNet
ule embedded GAN (GAACN), and a classifier. The attention [32]-based generator network and discriminator network are
module is placed between the generator and discriminator used in ControlGAN. Authors have used a ResNet-based
to shift the transferable regions among different domains. independent classifier to evaluate the generated samples.
They also used a label classifier module to keep the class The proposed model is evaluated using CIFAR-10, CelebA,
consistency in discriminator network. The feature extractor and LSUN datasets. Mandal et al. [68] developed a deep
is forced to learn the joint feature distribution by the GAN CNN-based semi-supervised GAN (SSGAN) for the food
module. The feature extractor and classifier module are used recognition task. Food recognition is a thought provoking
in the testing phase to label the unlabelled target data. The task due to huge interclass variation in food images. Exper-
GAN model and the attention module are built of convolu- imental results proved the effectiveness of the proposed
tional layers. The experimental results on i. Digits dataset: semi-supervised model on ETH Food-101 Dataset and Indian
MNIST, USPS, SVHN ii. ImageCLEF-DA dataset: Caltech- Food Dataset. In [64] Lin et al. designed a defect enhance-
256, ILSVRC 2012, Pascal VOC 2012 iii. Office 31 dataset: ment GAN (DEGAN) based on deep convolutional GAN
Amazon, Webcam, DSLR iv. Office-Home dataset: Artistic (DCGAN) [80] and energy-based GAN (EBGAN) [133]
domain, Clip Art, Product domain, Real-World domain v. to generate microcrack defective samples. It incorporates a
VisDA 2017 dataset: synthetic domain, the real domain has defect enhancement algorithm in the forward path and after
produced significant results compared to other conventional the discriminator also. The generator model consists of con-
models. volutional layers, and the discriminator is implemented using
Wang et al. [97] discussed a new deep learning-based an encoder and decode network. [62] notified a similarity
model named adaptive balancing GAN (AdaBalGAN) model constraint GAN (SCGAN) that identifies the entangled fea-
to identify the defective types in imbalanced wafer maps ture and represents it in disentangled representation in an
data. They used a conditioned GAN model to generate wafer unsupervised manner. The proposed model is investigated on
maps of a specific type, and a generative controller is used to MNIST, FASHION MNIST, SVHN, CIFAR-10, and CelebA
change the sample distribution of the wafer maps concording datasets and gained significant improvements in results.
to the various defective patterns. The proposed model is eval- Reference [95] designed an evolutionary algorithm based
uated on real-world fabricated WM-811K wafer maps. [29] GAN (EGAN) framework in which they stabilized the
proposed a conditional generation method which generates GAN training. The evolutionary algorithm optimizes the
time-series data that belongs to multiple classes. The authors generator’s objective using multiple training objectives. It
have employed a canonical correlation analysis to exemplify consists of three phases: evaluation, variation, and selection.
the characteristics between the input and generated data. The proposed framework is evaluated using three datasets:
They also deployed the LSTM model in both generator and CIFAR-10, LSUN bedroom, and CelebA and obtained sig-
discriminator. [56] designed a controllable GAN (Control nificant improvement in results. Kasem et al. [47] introduced

123
International Journal of Multimedia Information Retrieval (2021) 10:1–24 17

Fig. 11 Application wise


number of publications

robust super-resolution GAN (RSR-GAN) that addresses two are in mimicking trained data. Thus, in this section, we
issues while improving the quality of subjects in the images. present a few metrics that are used in the literature exten-
First, it regains the texture details with extreme upscaling sively to evaluate the GAN model.
factors. Second, it alleviates the noise generated due to geo-
metric transformations. The RSR-GAN has a transformer
module in discriminator that enhances the discrimination 4.1 1-Nearest neighbour classifier (1-NN)
capacity. The generator loss term has an additional DCT loss
term that finds the right mapping between generated and real It is a version of classifier two sample tests (C2ST) and is
images. The authors have used Berkeley segmentation data not an evaluation metric. It checks the similarity between
set for training and BDS100, MANGA109, SET5, SET14, real data distribution Pw (.) and generated data distribution
and URBAN100 datasets for testing and obtained significant P(.). It computes the leave one out (LOO) cross-validation
results. [30] proposed a style consistent GAN (GylphGAN) accuracy of classification, where all the data points except
that generates novel font types. The generated font types are one point are used to estimate accuracy and left out point is
style consistent, legible, and diverse overall characters. [94] used for prediction.
designed a compressive privacy GAN (CPGAN) to defend
attacks while sharing data using machine learning as a service
(MLaaS) in cloud platforms. [121] devised a long short-term 4.2 Inception scores (IS)
memory based conditional GAN (LSTM-GAN) to identify
the taxi hotspots in both dimensions: spatial and temporal. It is a metric derived by [86] to evaluate the quality and
[16] generated realistic user behaviour data related to the diversity of synthesized images by generative models. First,
products that have not been released yet using a conditional they computed the conditional probability of an instance
GAN framework. Figure 11 shows the graph of application belonging to a class. Later, these conditional probabilities are
wise number of publications considered. used to compute the inception score on a pre-trained incep-
tion network. If the conditional label distribution has low
entropy, then the generated images are of good quality. To
4 Evaluation metrics produce a variety of images, the network should have a low
marginal conditional probability distribution. The inception
There has been an extensive usage of GANs in diverse appli- score ranges between 1 and total classes. The limitation of
cations in the late years of this decade. Generative modelling this metric is that it does not consider the statistics (mean,
aims at mimicking the trained data with generated data. variance, and standard deviation) of the original data distri-
Hence, it obvious to measure the distance between the real bution to compare with generated samples distribution.
data and artificial data. In general, a distance function that
computes the distance between a real distribution and gen-
erated distribution are used as loss functions. However, no  
standardized metrics are devised to evaluate how good GANs I S Pg = e E x∼ρg [KL( p(y|x) p(y))] (3)

123
18 International Journal of Multimedia Information Retrieval (2021) 10:1–24

4.3 Mode score (MS) 7. It assesses the similarity index between two images.
  
This metric overcomes the limitation faced by IS metric and 2μx μ y + C1 2σx y + C2
considers the statistics of prior distribution to evaluate the SSIM(x, y) =    (7)
μ2x + μ2y + C1 σx2 + σ y2 + C2
quality of images and diversity of images [11]. The mode
score can be computed using the Eq. 4. p(y ∗ ) represents the
distribution of ground truth labels computed using original μx , μ y , σx , σ y denote the mean and standard deviations of
data distribution. image signal x and image signal y, respectively. Ci is a con-
stant. [104] extended this metric to multi-scale to assess the
quality of images by integrating different image resolutions.
  The multi-scale SSIM is computed as follows:
Ex∼æg [KL( p(y|x) p ( y ∗ ))]−KL( p(y) p ( y ∗ ))
M S(Pg ) = e (4)
 α  M
 β  γ
SSIM(x, y) = l M (x, y) M · c j (x, y) j s j (x, y) j
4.4 Frechet inception distance (FID) j=1
(8)
It is a distance metric between the feature vectors of real
data and generated data. It measures the quality of generated
The exponents α, β, and γ are included to alter the relative
images and finds the occurrence of intra-class mode collapse.
importance of various components.
However, this metric considers the mean (m) and covariance
(C) of two Gaussians under study. The Frechet distance [35]
d(m,C) between the real data distribution Pw (m w , Cw ) and 4.7 Wasserstein critic
the synthetic data distribution Ps (m s , Cs ) is defined as fol-
lows: It estimates the Wasserstein distance between the real data
distribution Pr and the generated data distribution Pg . This
metric estimates lower values for generated instances and
d 2 ((m s , Cs ), (m w , Cw )) = m s − m w 22 higher values for real instances. In case of discrete distri-
 
+ Tr Cs + Cw − 2 (Cs Cw )1/2 bution transformations, it is also known as Earth Mover’s
distance (EMD). The Wasserstein critic between Pg and Pr
(5)
is estimated as shown in Eq. 9.
 
where T r represents the trace computation. W Pr , Pg ∝ max Ex∼Pr [ f (x)] − Ex∼Pg [ f (x)] (9)
f

4.5 Maximum mean discrepancy (MMD) where f : R D → R denotes the Lipschitz continuous func-
tion [8].
The MMD metric is used to compute the dissimilarity
between the real data distribution Pr and the generated
data distribution Pg . If we employ a fixed Gaussian kernel
 2 5 Challenges
k(x, x  ) = e||x−x || , then the kernel MMD [25] is computed
as shown in Eq. 6. A lower MMD indicates Pg is more similar Despite the success and extensive usage of GANs, often they
to Pr . face some common challenges during the training. The three
most important challenges faced by GANs are as follows:
    
K M M D Pr , Pg = E x,x  ∼Pr k x, x
−2E x∼Pr ,y∼Pg [k(x, y)] – If the generator is not as good as the discriminator,
   then the discriminator always differentiates between real
+Ey,y ∼Pg k y, y (6)
and artificial data. Hence, the gradients of the generator
will be vanished, leading to failure of the generator. [5]
4.6 Multi-scale structural similarity for image succeeded in eliminating the vanishing gradient prob-
quality lem of generators by introducing Wasserstein loss 4.7
to compute the distance between real and artificial data.
Wang, et al. [105] used structural information s(x, y), lumi- However, it is not guaranteed that replacing the min–max
nance information l(x, y), and contrast information c(x, y) loss with Wasserstein loss can eradicate the vanishing
to derive the structural similarity index (SSIM) as shown in gradient problem as it also depends on other factors like

123
International Journal of Multimedia Information Retrieval (2021) 10:1–24 19

Table 1 A summary of the tasks, applications, and the datasets used


Task/Reference Applications Datasets

Image Dehazing [134] Traffic monitoring RESIDE


Security monitoring O-Haze
Face photo - sketch synthesis [135] Digital entertainment CUFS
Law enforcement CUFSF
Face frontalization [82] Face recognition systems MultiPIE
LFW
CFP
Cross-modal face reccognition [42] Nighttime face recognition WSRI
UND X1
NVESD
Casia NIR-VIS
Casia HFB
Polarimetric Thermal
Person re-identification [125] Surveillance systems VPIeR
CUHK03
Market-1501
Text generation [115] Dialogue systems COCO
Machine translation Chinese poetry
Writing poetry
BOT applications
Abstractive text summarization [141] News letters CNN-Daily Mail
Social media marketing
Text style transfer [123] Sentiment transformation YELP
Formality direction AMAZON
CAPTION
Hyper-spectral image classification [140] Land change monitoring Salinas
Resource management Indiana Pines
Remote sensing of agriculture Kennedy Space Center
Video prediction [39] Autonomous driving Moving Mnist
Robotic planning BAIR
Object tracking KTH
UCF101
Semantic segmentation [73] Autonomous driving PASCAL VOC 2012
Medical diagnostics Cam Vid
Robotic systems CelebA
Fault diagnosis [89] Industrial machine monitoring Induction motor signal vibration
Risk analysis
Clinical diagnosis
ECG Analysis [99] Caridovascular diseases MIT-BIH Databse

available data, hyperparameter settings, model structure, known as mode collapse problem of GANs. Wasserstein
etc. loss [5] does not let discriminator struck at local optimum,
– The generator tries to over-optimize the discriminator and hence, generator produces a new set of outputs. [70]
in each epoch of the GAN training. If the discriminator modelled generators objective in coherence with optimal
caught in the local minimum trap and always rejecting discriminator to alleviate mode collapse problem. [25]
every instance of its input, then the generator keeps on devised an empirical model that automatically detects
generates the same set of instances. This is popularly the problem of mode anomaly.

123
20 International Journal of Multimedia Information Retrieval (2021) 10:1–24

Table 1 continued
Task/Reference Applications Datasets

MRI, PET, and CT Analysis [21] Cancer detection T2-Flair


Functional abnormalities in body parts LIDC-IDRI
Mammogram segmentation [131] Detection of breast cancer INbreast
DDSM
Hospital Sant Joan de Reus
Data augmentation [79] Speech recognition Aroura-4
Audio denoising AMI-SDM
DeepFakes [18] Educational applications DFDC
Enlightening mankind CelebDF
FaceForensics++DF

– Often GANs fail to converge due to irregularities in of section 3.8 to limit the size of the paper. We also noted
the structure of the model, hyperparameter tuning, and that all supervised, semi-supervised, and unsupervised algo-
training strategies. [4] added noise to the inputs of a dis- rithms are discussed. However, mostly semi-supervised and
criminator for stable training and GAN convergence. [84] unsupervised algorithms are used in case of data insufficient
introduced a novel regularization method to eliminate the problems. The application of GANs are spread over diverse
problem of convergence. domains, and they are not limited to the ones discussed in this
– The above-said challenges are discussed in the perspec- paper. Often, GANs face a challenge with the increase in the
tive of algorithmic and training issues of GANs. GANs number of distributions in the real data. With sophisticated
are highly successful in generating high-quality natu- techniques for training GANs, especially when dealing with
ralistic images. However, the performance of GANs is a large number of distributions, the applications of GANs
questioned in creating fake videos, also called deepfakes. can be widespread. It also provides a summary on various
Creating fake videos using deep learning techniques to applications, tasks achieved, and the datasets in Table 1 for
swap the identity of a person with another person is called quick referencing.
deepfakes. However, the deepfakes resembles realistic, it
is difficult to create a deepfake that mimics eye blinking,
since nobody likes to take a picture with eyes closing.
Also, while creating deepfakes, we need images that have
References
persons with similar skin tone, the orientation of faces,
etc. Otherwise, the output deepfake would not be optimal. 1. AbdAlmageed W, Wu Y, Rawls S, Harel S, Hassner T, Masi I, Choi
Deepfakes created using pairwise deepfake auto-encoder J, Lekust J, Kim J, Natarajan P, et al. (2016) Face recognition using
(DFAE) models are higher in quality compared to deep- deep multi-pose representations. In: 2016 IEEE winter conference
fakes created using GAN-based methods on deepfake on applications of computer vision (WACV), IEEE, pp 1–9
2. Ak KE, Lim JH, Tham JY, Kassim AA (2020) Semantically
detection challenge (DFDC) dataset [18]. consistent text to fashion image synthesis with an enhanced
attentional generative adversarial network. Pattern Recognit Lett
135:22–29
3. Antipov G, Baccouche M, Dugelay JL (2017) Face aging with
6 Conclusion conditional generative adversarial networks. In: 2017 IEEE inter-
national conference on image processing (ICIP), IEEE, pp 2089–
2093
This paper presents the ins and outs of GANs, derived GANs, 4. Arjovsky M, Bottou L (2017) Towards principled methods
their application areas, evaluation metrics, and challenges for training generative adversarial networks. arXiv preprint
involved in training GANs. A total of 88 publications are arXiv:1701.04862
5. Arjovsky M, Chintala S, Bottou L (2017) Wasserstein gan. arXiv
summarized based on their objective with an ease of under- preprint arXiv:1701.07875
standing terminology to a naive researcher. At this point, 6. Bisneto TRV, de Carvalho Filho AO, Magalhães DMV (2020)
it may be noted that the main objective of some publica- Generative adversarial network and texture features applied to
tions related to clinical diagnosis is segmentation. So, the automatic glaucoma detection. Appl Soft Comput 90:106165
7. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J
said publications are discussed in detail in the respective Mach Learn Res 3:993–1022
sect. 3.4. It is obvious that image super-resolution is an appli- 8. Borji A (2019) Pros and cons of gan evaluation measures. Comput
cation of GAN worth discussing. But it is covered as part Vis Image Underst 179:41–65

123
International Journal of Multimedia Information Retrieval (2021) 10:1–24 21

9. Brock A, Lim T, Ritchie JM, Weston N (2016) Neural photo breast ultrasound images with attentional generative adversarial
editing with introspective adversarial networks. arXiv preprint network. Comput Methods Programs Biomed 189:105275
arXiv:1609.07093 28. Han Z, Huang H, Huang T, Cao J (2019) Face merged generative
10. Cai Y, Wang X, Yu Z, Li F, Xu P, Li Y, Li L (2019) Dualattn-gan: adversarial network with tripartite adversaries. Neurocomputing
text to image synthesis with dual attentional generative adversarial 368:188–196
network. IEEE Access 7:183706–183716 29. Harada S, Hayashi H, Uchida S (2019) Biosignal generation and
11. Che T, Li Y, Jacob AP, Bengio Y, Li W (2016) Mode regularized latent variable analysis with recurrent generative adversarial net-
generative adversarial networks. arXiv preprint arXiv:1612.02136 works. IEEE Access 7:144292–144302
12. Chen J, Wu Y, Jia C, Zheng H, Huang G (2019) Customizable text 30. Hayashi H, Abe K, Uchida S (2019) Glyphgan: style-consistent
generation via conditional text generative adversarial network. font generation based on generative adversarial networks. Knowl-
Neurocomputing. https://doi.org/10.1016/j.neucom.2018.12.092 Based Syst 186:104927
13. Chen Q, Wu Q, Chen J, Wu Q, van den Hengel A, Tan M (2020) 31. He J, Zheng J, Shen Y, Guo Y, Zhou H (2020a) Facial image
Scripted video generation with a bottom-up generative adversarial synthesis and super-resolution with stacked generative adversarial
network. IEEE Trans Image Process 29:7454–7467 network. Neurocomputing 402:359–365
14. Chen W, Hu H (2020) Generative attention adversarial classi- 32. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for
fication network for unsupervised domain adaptation. Pattern image recognition. In: Proceedings of the IEEE conference on
Recognit 107:107440 computer vision and pattern recognition, pp 770–778
15. Choi Y, Choi M, Kim M, Ha JW, Kim S, Choo J (2018) Stargan: 33. He R, Li X, Chen G, Chen G, Liu Y (2020b) Generative adver-
unified generative adversarial networks for multi-domain image- sarial network-based semi-supervised learning for real-time risk
to-image translation. In: Proceedings of the IEEE conference on warning of process industries. Expert Syst Appl 150:113244
computer vision and pattern recognition, pp 8789–8797 34. He X, Fang L, Rabbani H, Chen X, Liu Z (2020c) Retinal optical
16. Chonwiharnphan P, Thienprapasith P, Chuangsuwanich E (2020) coherence tomography image classification with label smoothing
Generating realistic users using generative adversarial network generative adversarial network. Neurocomputing 405:37–47
with recommendation-based embedding. IEEE Access 8:41384– 35. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S
41393 (2017) Gans trained by a two time-scale update rule converge
17. Delannoy Q, Pham CH, Cazorla C, Tor-Díez C, Dollé G, Meunier to a local nash equilibrium. In: Advances in neural information
H, Bednarek N, Fablet R, Passat N, Rousseau F (2020) Segsrgan: processing systems, pp 6626–6637
super-resolution and segmentation using generative adversarial 36. Hinton GE (2012) A practical guide to training restricted Boltz-
networks-application to neonatal brain mri. Comput Biol Med mann machines. In: Montavon G, Orr GB, Muller KR (eds) Neural
120:103755 networks: tricks of the trade. Springer, Berlin, pp 599–619
18. Dolhansky B, Bitton J, Pflaum B, Lu J, Howes R, Wang M, Fer- 37. Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm
rer CC (2020) The deepfake detection challenge dataset. arXiv for deep belief nets. Neural Comput 18(7):1527–1554
preprint arXiv:2006.07397 38. Hsu CC, Lin CW, Su WT, Cheung G (2019) Sigan: siamese
19. Dong J, Yin R, Sun X, Li Q, Yang Y, Qin X (2018) Inpainting generative adversarial network for identity-preserving face hal-
of remote sensing sst images with deep convolutional generative lucination. IEEE Trans Image Process 28(12):6225–6236
adversarial network. IEEE Geosci Remote Sens Lett 16(2):173– 39. Hu Z, Turki T, Wang JT (2020) Generative adversarial networks
177 for stochastic video prediction with action control. IEEE Access
20. Feng J, Yu H, Wang L, Cao X, Zhang X, Jiao L (2019) Classifica- 8:63336–63348
tion of hyperspectral images based on multiclass spatial-spectral 40. Huang S, Lei K (2020) Igan-ids: an imbalanced generative adver-
generative adversarial networks. IEEE Trans Geosci Remote Sens sarial network towards intrusion detection system in ad-hoc
57(8):5329–5343 networks. Ad Hoc Netw 105:102177
21. Gao Y, Liu Y, Wang Y, Shi Z, Yu J (2019) A universal inten- 41. Im DJ, Kim CD, Jiang H, Memisevic R (2016) Generat-
sity standardization method based on a many-to-one weak-paired ing images with recurrent adversarial networks. arXiv preprint
cycle generative adversarial network for magnetic resonance arXiv:1602.05110
images. IEEE Trans Med Imaging 38(9):2059–2069 42. Iranmanesh SM, Riggan B, Hu S, Nasrabadi NM (2020) Coupled
22. Ghassemi N, Shoeibi A, Rouhani M (2020) Deep neural network generative adversarial network for heterogeneous face recogni-
with generative adversarial networks pre-training for brain tumor tion. Image Vis Comput 94:103861
classification based on mr images. Biomed Signal Process Control 43. Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image trans-
57:101678 lation with conditional adversarial networks. In: Proceedings of
23. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley the IEEE conference on computer vision and pattern recognition,
D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial pp 1125–1134
nets. In: Advances in neural information processing systems, pp 44. Jin X, Chen Z, Li W (2020) Ai-gan: asynchronous interactive gen-
2672–2680 erative adversarial network for single image rain removal. Pattern
24. Grover A, Dhar M, Ermon S (2017) Flow-gan: Combining max- Recogn 100:107143
imum likelihood and adversarial learning in generative models. 45. Jin Y, Zhang J, Li M, Tian Y, Zhu H, Fang Z (2017) Towards the
arXiv preprint arXiv:1705.08868 automatic anime characters creation with generative adversarial
25. Guo C, Huang D, Zhang J, Xu J, Bai G, Dong N (2020) Early networks. arXiv preprint arXiv:1708.05509
prediction for mode anomaly in generative adversarial network 46. Karras T, Aila T, Laine S, Lehtinen J (2017) Progressive grow-
training: an empirical study. Inf Sci 534:117–138 ing of gans for improved quality, stability, and variation. arXiv
26. Hamada K, Tachibana K, Li T, Honda H, Uchida Y (2018) Full- preprint arXiv:1710.10196
body high-resolution anime generation with progressive structure- 47. Kasem HM, Hung KW, Jiang J (2019) Spatial transformer gener-
conditional generative adversarial networks. In: Proceedings of ative adversarial network for robust image super-resolution. IEEE
the European conference on computer vision (ECCV) Access 7:182993–183009
27. Han L, Huang Y, Dou H, Wang S, Ahamad S, Luo H, Liu Q, Fan 48. Kim HJ, Lee D (2020) Image denoising with conditional gener-
J, Zhang J (2020) Semi-supervised segmentation of lesion from ative adversarial networks (cgan) in low dose chest images. Nucl
Instrum Methods Phys Res, Sect A 954:161914

123
22 International Journal of Multimedia Information Retrieval (2021) 10:1–24

49. Kim K, Myung H (2018) Autoencoder-combined generative tion? In: European conference on computer vision, Springer, pp
adversarial networks for synthetic image data generation and 579–596
detection of jellyfish swarm. IEEE Access 6:54207–54214 70. Metz L, Poole B, Pfau D, Sohl-Dickstein J (2016) Unrolled gen-
50. Kim T, Cha M, Kim H, Lee JK, Kim J (2017) Learning to discover erative adversarial networks. arXiv preprint arXiv:1611.02163
cross-domain relations with generative adversarial networks. In: 71. Mirza M, Osindero S (2014) Conditional generative adversarial
Proceedings of the 34th international conference on machine nets. arXiv preprint arXiv:1411.1784
learning-Volume 70, JMLR. org, pp 1857–1865 72. Odena A, Olah C, Shlens J (2016) Conditional image synthesis
51. Kuang Y, Lan T, Peng X, Selasi GE, Liu Q, Zhang J (2020) Unsu- with auxiliary classi er gans arxiv e-prints. (oct. arXiv preprint
pervised multi-discriminator generative adversarial network for arXiv:1610.09585
lung nodule malignancy classification. IEEE Access 8:77725– 73. Oluwasanmi A, Aftab MU, Shokanbi A, Jackson J, Kumeda B,
77734 Qin Z (2020) Attentively conditioned generative adversarial net-
52. Kupyn O, Budzan V, Mykhailych M, Mishkin D, Matas J (2018) work for semantic segmentation. IEEE Access 8:31733–31741
Deblurgan: blind motion deblurring using conditional adversarial 74. Pan T, Chen J, Xie J, Chang Y, Zhou Z (2020) Intelligent
networks. In: Proceedings of the IEEE conference on computer fault identification for industrial automation system via multi-
vision and pattern recognition, pp 8183–8192 scale convolutional generative adversarial network with partially
53. Kwak H, Zhang BT (2016) Generating images part by part labeled samples. ISA Trans 101:379–389
with composite generative adversarial networks. arXiv preprint 75. Pang Y, Xie J, Li X (2018) Visual haze removal by a unified
arXiv:1607.05387 generative adversarial network. IEEE Trans Circuits Syst Video
54. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature Technol 29(11):3211–3221
521(7553):436–444 76. Park J, Han DK, Ko H (2020) Fusion of heterogeneous adversarial
55. Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta networks for single image dehazing. IEEE Trans Image Process
A, Aitken A, Tejani A, Totz J, Wang Z, et al. (2017) Photo-realistic 29:4721–4732
single image super-resolution using a generative adversarial net- 77. Perarnau G, Van De Weijer J, Raducanu B, Álvarez JM (2016)
work. In: Proceedings of the IEEE conference on computer vision Invertible conditional gans for image editing. arXiv preprint
and pattern recognition, pp 4681–4690 arXiv:1611.06355
56. Lee M, Seok J (2019) Controllable generative adversarial net- 78. Qi L, Zhang H, Tan W, Qi S, Xu L, Yao Y, Qian W (2019)
work. IEEE Access 7:28158–28169 Cascaded conditional generative adversarial networks with multi-
57. Lei B, Xia Z, Jiang F, Jiang X, Ge Z, Xu Y, Qin J, Chen S, Wang T, scale attention fusion for automated bi-ventricle segmentation in
Wang S (2020) Skin lesion segmentation via generative adversar- cardiac mri. IEEE Access 7:172305–172320
ial networks with dual discriminators. Med Image Anal 64:101716 79. Qian Y, Hu H, Tan T (2019) Data augmentation using genera-
58. Lei M, Zhou Y, Zhou L, Zheng J, Li M, Zou L (2019) Noise- tive adversarial networks for robust speech recognition. Speech
robust wagon text extraction based on defect-restore generative Commun 114:1–9
adversarial network. IEEE Access 7:168236–168246 80. Radford A, Metz L, Chintala S (2015) Unsupervised represen-
59. Li H, Tang J (2020) Dairy goat image generation based on tation learning with deep convolutional generative adversarial
improved-self-attention generative adversarial networks. IEEE networks. arXiv preprint arXiv:1511.06434
Access 8:62448–62457 81. Rizzo G, Van THM (2020) Adversarial text generation with con-
60. Li J, Liang X, Wei Y, Xu T, Feng J, Yan S (2017a) Perceptual text adapted global knowledge and a self-attentive discriminator.
generative adversarial networks for small object detection. In: Inf Process Manag 102217
Proceedings of the IEEE conference on computer vision and pat- 82. Rong C, Zhang X, Lin Y (2020) Feature-improving genera-
tern recognition, pp 1222–1230 tive adversarial network for face frontalization. IEEE Access
61. Li J, Monroe W, Shi T, Jean S, Ritter A, Jurafsky D (2017b) Adver- 8:68842–68851
sarial learning for neural dialogue generation. arXiv preprint 83. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional
arXiv:1701.06547 network for biomedical image segmentation. In: International
62. Li X, Chen L, Wang L, Wu P, Tong W (2018a) Scgan: disen- conference on medical image computing and computer-assisted
tangled representation learning by adding similarity constraint on intervention, Springer, pp 234–241
generative adversarial nets. IEEE Access 7:147928–147938 84. Roth K, Lucchi A, Nowozin S, Hofmann T (2017) Stabilizing
63. Li Y, Pan Q, Wang S, Yang T, Cambria E (2018b) A generative training of generative adversarial networks through regulariza-
model for category text generation. Inf Sci 450:301–315 tion. In: Advances in neural information processing systems, pp
64. Lin S, He Z, Sun L (2019) Defect enhancement generative adver- 2018–2028
sarial network for enlarging data set of microcrack defect. IEEE 85. Ruan Y, Li D, Marshall H, Miao T, Cossetto T, Chan I, Daher
Access 7:148413–148423 O, Accorsi F, Goela A, Li S (2020) Mb-fsgan: joint segmenta-
65. Liu L, Wang S, Wan L (2019) Component semantic prior guided tion and quantification of kidney tumor on ct by the multi-branch
generative adversarial network for face super-resolution. IEEE feature sharing generative adversarial network. Med Image Anal
Access 7:77027–77036 64:101721
66. Liu MY, Tuzel O (2016) Coupled generative adversarial networks. 86. Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford
In: Advances in neural information processing systems, pp 469– A, Chen X (2016) Improved techniques for training gans. In:
477 Advances in neural information processing systems, pp 2234–
67. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg 2242
AC (2016) Ssd: Single shot multibox detector. In: European con- 87. Schlegl T, Seeböck P, Waldstein SM, Schmidt-Erfurth U, Langs
ference on computer vision, Springer, pp 21–37 G (2017) Unsupervised anomaly detection with generative adver-
68. Mandal B, Puhan NB, Verma A (2018) Deep convolutional gener- sarial networks to guide marker discovery. In: International con-
ative adversarial network-based food recognition using partially ference on information processing in medical imaging, Springer,
labeled data. IEEE Sens Lett 3(2):1–4 pp 146–157
69. Masi I, Trn AT, Hassner T, Leksut JT, Medioni G (2016) Do we 88. Shaker AM, Tantawi M, Shedeed HA, Tolba MF (2020) Gener-
really need to collect millions of faces for effective face recogni- alization of convolutional neural networks for ecg classification

123
International Journal of Multimedia Information Retrieval (2021) 10:1–24 23

using generative adversarial networks. IEEE Access 8:35592– 109. Xiang P, Wang L, Wu F, Cheng J, Zhou M (2019) Single-image de-
35605 raining with feature-supervised generative adversarial network.
89. Shao S, Wang P, Yan R (2019) Generative adversarial networks IEEE Signal Process Lett 26(5):650–654
for data augmentation in machine fault diagnosis. Comput Ind 110. Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018)
106:85–93 Attngan: fine-grained text to image generation with attentional
90. Shi Y, Li Q, Zhu XX (2018) Building footprint generation using generative adversarial networks. In: Proceedings of the IEEE con-
improved generative adversarial networks. IEEE Geosci Remote ference on computer vision and pattern recognition, pp 1316–1324
Sens Lett 16(4):603–607 111. Yan K, Chong A, Mo Y (2020) Generative adversarial network for
91. Singh VK, Rashwan HA, Romani S, Akram F, Pandey N, Sarker fault detection diagnosis of chillers. Build Environ 172:106698
MMK, Saleh A, Arenas M, Arquez M, Puig D et al (2020) Breast 112. Yanagi R, Togo R, Ogawa T, Haseyama M (2019) Query is gan:
tumor segmentation and shape classification in mammograms Scene retrieval with attentional text-to-image generative adver-
using generative adversarial and convolutional neural network. sarial network. IEEE Access 7:153183–153193
Expert Syst Appl 139:112855 113. Yang S, Xie L, Chen X, Lou X, Zhu X, Huang D, Li H (2017)
92. Sun Y, Tang J, Shu X, Sun Z, Tistarelli M (2020) Facial age synthe- Statistical parametric speech synthesis using generative adver-
sis with label distribution-guided generative adversarial network. sarial networks under a multi-task learning framework. In: 2017
IEEE Trans Inf Forensics Secur 15:2679–2691 IEEE Automatic speech recognition and understanding workshop
93. Taigman Y, Polyak A, Wolf L (2016) Unsupervised cross-domain (ASRU), IEEE, pp 685–691
image generation. arXiv preprint arXiv:1611.02200 114. Yang W, Hui C, Chen Z, Xue JH, Liao Q (2019a) Fv-gan: finger
94. Tseng BW, Wu PY (2020) Compressive privacy generative adver- vein representation using generative adversarial networks. IEEE
sarial network. IEEE Trans Inf Forensics Secur 15:2499–2513 Trans Inf Forensics Secur 14(9):2512–2524
95. Wang C, Xu C, Yao X, Tao D (2019a) Evolutionary generative 115. Yang Y, Dan X, Qiu X, Gao Z (2020) Fggan: feature-guiding
adversarial networks. IEEE Trans Evol Comput 23(6):921–934 generative adversarial networks for text generation. IEEE Access
96. Wang H, Tao C, Qi J, Li H, Tang Y (2019b) Semi-supervised vari- 8:105217–105225
ational generative adversarial networks for hyperspectral image 116. Yang Z, Chen Y, Le Z, Fan F, Pan E (2019b) Multi-source medical
classification. In: IGARSS 2019-2019 IEEE International geo- image fusion based on wasserstein generative adversarial net-
science and remote sensing symposium, IEEE, pp 9792–9794 works. IEEE Access 7:175947–175958
97. Wang J, Yang Z, Zhang J, Zhang Q, Chien WTK (2019c) Adabal- 117. Ye F, Zhu F, Fu Y, Shen B (2019) Ecg generation with sequence
gan: an improved generative adversarial network with imbalanced generative adversarial nets optimized by policy gradient. IEEE
learning for wafer defective pattern recognition. IEEE Trans Access 7:159369–159378
Semicond Manuf 32(3):310–319 118. Yoo D, Kim N, Park S, Paek AS, Kweon IS (2016) Pixel-level
98. Wang K, Wan X (2019) Automatic generation of sentimental texts domain transfer. In: European conference on computer vision,
via mixture adversarial networks. Artif Intell 275:540–558 Springer, pp 517–532
99. Wang P, Hou B, Shao S, Yan R (2019d) Ecg arrhythmias detec- 119. You S, You N, Pan M (2019) Pi-rec: progressive image recon-
tion using auxiliary classifier generative adversarial network and struction network with edge and color domain. arXiv preprint
residual network. IEEE Access 7:100910–100922 arXiv:1903.10146
100. Wang Q, Fan H, Zhu L, Tang Y (2018a) Deeply supervised face 120. Yu C, Wang W, Yan J (2020a) Self-supervised animation synthesis
completion with multi-context generative adversarial network. through adversarial training. IEEE Access 8:128140–128151
IEEE Signal Process Lett 26(3):400–404 121. Yu H, Li Z, Zhang G, Liu P, Wang J (2020b) Extracting and
101. Wang TC, Liu MY, Zhu JY, Tao A, Kautz J, Catanzaro B (2018b) predicting taxi hotspots in spatiotemporal dimensions using con-
High-resolution image synthesis and semantic manipulation with ditional generative adversarial neural networks. IEEE Trans Veh
conditional gans. In: Proceedings of the IEEE conference on com- Technol 69(4):3680–3692
puter vision and pattern recognition, pp 8798–8807 122. Yu L, Zhang W, Wang J, SeqGAN YY (2016) Sequence generative
102. Wang X, Yu K, Wu S, Gu J, Liu Y, Dong C, Qiao Y, Change Loy adversarial nets with policy gradient. arxiv e-prints, page. arXiv
C (2018c) Esrgan: Enhanced super-resolution generative adver- preprint arXiv:1609.05473
sarial networks. In: Proceedings of the European conference on 123. Yu W, Chang T, Guo X, Wang X, Liu B, He Y (2020c) Ugan:
computer vision (ECCV) unified generative adversarial networks for multidirectional text
103. Yr Wang, Sun Gd, Jin Q (2020) Imbalanced sample fault diagnosis style transfer. IEEE Access 8:55170–55180
of rotating machinery using conditional variational auto-encoder 124. Yuan J, He Z (2020) Adversarial dual network learning with ran-
generative adversarial network. Appl Soft Comput 92:106333 domized image transform for restoring attacked images. IEEE
104. Wang Z, Simoncelli EP, Bovik AC (2003) Multiscale structural Access 8:22617–22624
similarity for image quality assessment. In: The Thrity-Seventh 125. Zhang C, Wu L, Wang Y (2019a) Crossing generative adversarial
Asilomar conference on signals, systems & computers, 2003, networks for cross-view person re-identification. Neurocomput-
IEEE, vol 2, pp 1398–1402 ing 340:259–269
105. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image 126. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas
quality assessment: from error visibility to structural similarity. DN (2017) Stackgan: Text to photo-realistic image synthesis with
IEEE Trans Image Process 13(4):600–612 stacked generative adversarial networks. In: Proceedings of the
106. Wen S, Liu W, Yang Y, Huang T, Zeng Z (2018) Generating real- IEEE international conference on computer vision, pp 5907–5915
istic videos from keyframes with concatenated gans. IEEE Trans 127. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas
Circuits Syst Video Technol 29(8):2337–2348 DN (2018) Stackgan++: realistic image synthesis with stacked
107. Wu J, Zhang C, Xue T, Freeman B, Tenenbaum J (2016) generative adversarial networks. IEEE Trans Pattern Anal Mach
Learning a probabilistic latent space of object shapes via 3d Intell 41(8):1947–1962
generative-adversarial modeling. In: Advances in neural infor- 128. Zhang H, Goodfellow I, Metaxas D, Odena A (2019b) Self-
mation processing systems, pp 82–90 attention generative adversarial networks. In: International con-
108. Wu R, Gu X, Tao X, Shen X, Tai YW, Jia J (2019) Land- ference on machine learning, PMLR, pp 7354–7363
mark assisted cyclegan for cartoon face generation. arXiv preprint
arXiv:1907.01424

123
24 International Journal of Multimedia Information Retrieval (2021) 10:1–24

129. Zhang W, Li X, Jia XD, Ma H, Luo Z, Li X (2020a) Machin- 137. Zhu D, Xia S, Zhao J, Zhou Y, Jian M, Niu Q, Yao R, Chen Y
ery fault diagnosis with imbalanced data using deep generative (2020) Diverse sample generation with multi-branch conditional
adversarial networks. Measurement 152:107377 generative adversarial network for remote sensing objects detec-
130. Zhang X, Zhou Y, Pei S, Zhuge J, Chen J (2020b) Adversarial tion. Neurocomputing 381:40–51
examples detection for xss attacks based on generative adversarial 138. Zhu JY, Krähenbühl P, Shechtman E, Efros AA (2016) Generative
networks. IEEE Access 8:10989–10996 visual manipulation on the natural image manifold. In: European
131. Zhang Y, Miao S, Mansi T, Liao R (2020c) Unsupervised x-ray conference on computer vision, Springer, pp 597–613
image segmentation with task driven generative adversarial net- 139. Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-
works. Med Image Anal 62:101664 image translation using cycle-consistent adversarial networks. In:
132. Zhao B, Li W, Gong W (2019a) Deep pyramid generative adver- Proceedings of the IEEE international conference on computer
sarial network with local and nonlocal similarity features for vision, pp 2223–2232
natural motion image deblurring. IEEE Access 7:185893–185907 140. Zhu L, Chen Y, Ghamisi P, Benediktsson JA (2018) Generative
133. Zhao J, Mathieu M, LeCun Y (2016) Energy-based generative adversarial networks for hyperspectral image classification. IEEE
adversarial network. arXiv preprint arXiv:1609.03126 Trans Geosci Remote Sens 56(9):5046–5063
134. Zhao J, Zhang J, Li Z, Hwang JN, Gao Y, Fang Z, Jiang X, 141. Zhuang H, Zhang W (2019) Generating semantically similar and
Huang B (2019b) Dd-cyclegan: unpaired image dehazing via human-readable summaries with generative adversarial networks.
double-discriminator cycle-consistent generative adversarial net- IEEE Access 7:169426–169433
work. Eng Appl Artif Intell 82:263–271
135. Zheng J, Song W, Wu Y, Xu R, Liu F (2019) Feature encoder
guided generative adversarial network for face photo-sketch syn-
Publisher’s Note Springer Nature remains neutral with regard to juris-
thesis. IEEE Access 7:154971–154985
dictional claims in published maps and institutional affiliations.
136. Zhou Z, Wang Y, Guo Y, Qi Y, Yu J (2019) Image quality improve-
ment of hand-held ultrasound devices with a two-stage generative
adversarial network. IEEE Trans Biomed Eng 67(1):298–311

123

You might also like