Generative Adversarial Networks (Gans) : An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments
Generative Adversarial Networks (Gans) : An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments
Image Processing Research Lab, Department of Computer Engineering & Information Technology
Razi University, Kermanshah, IRAN
†
[email protected]
*
[email protected]
‡
[email protected]
One of the most significant challenges in statistical signal processing and machine learning is how to obtain a generative
model that can produce samples of large-scale data distribution, such as images and speeches. Generative Adversarial
Network (GAN) is an effective method to address this problem. The GANs provide an appropriate way to learn deep
representations without widespread use of labeled training data. This approach has attracted the attention of many
researchers in computer vision since it can generate a large amount of data without precise modeling of the probability
density function (PDF). In GANs, the generative model is estimated via a competitive process where the generator and
discriminator networks are trained simultaneously. The generator learns to generate plausible data, and the discriminator
learns to distinguish fake data created by the generator from real data samples. Given the rapid growth of GANs over the
last few years and their application in various fields, it is necessary to investigate these networks accurately. In this paper,
after introducing the main concepts and the theory of GAN, two new deep generative models are compared, the evaluation
metrics utilized in the literature and challenges of GANs are also explained. Moreover, the most remarkable GAN
architectures are categorized and discussed. Finally, the essential applications in computer vision are examined.
Keywords: Deep learning, Deep generative models, Generative Adversarial Networks, Semi-supervised learning,
Unsupervised learning
1 | Introduction
Recent several decades have witnessed a rapid expansion in artificial intelligence knowledge and its application in various
sciences following an increase in the power of computational systems and the emergence of large datasets in different
industries.
Machine learning[1], as one of the broad and extensively-used branches of artificial intelligence, is concerned with the
adjustment and exploration of the procedures and algorithms based on which computers and systems develop their learning
capabilities. Machine learning algorithms need to extract features from raw data. In previous methods, these features were
manually provided and fed to the algorithm concerned, a time-consuming and incomplete task under certain circumstances.
Representation learning or Feature learning[2] offers the system the ability to automatically discover the representations
required for feature detection, classification, and other issues. In other words, representation learning transforms input data
into meaningful outputs. Deep learning[3] is a kind of representation learning intended to model super-abstract concepts
in the dataset according to a set of algorithms. This process is modeled using a deep graph consisting of several layers of
linear and nonlinear transformations. Fig. 1 illustrates these definitions in the structure of the hierarchy.
*
Corresponding author
Machine learning algorithms broadly separated into two main categories – supervised learning and unsupervised
learning. Supervised learning needs a dataset with various features where each data should be labeled. These types of
algorithms used to solve classification and regression problems. In contrast, unsupervised learning requires a dataset with
more than one similar label. In this type of learning, the network is not told what pattern to look for, and there is no clear
error metric. Some common examples of unsupervised learning include generative models, density estimation, clustering,
noise generation, and noise elimination.
In supervised learning, manual management/collection of labeled data is costly and time-consuming; besides,
automated data collection is also difficult and complicated. In deep learning, one of the vital tricks to solve the problem is
the data augmentation method. Applying this method to the model increases the skill of the model, creates a regular effect,
and reduces generalization error. Data augmentation is done by creating new and acceptable samples of the training dataset,
including the application of operators, such as rotation, cropping, zooming, and other simple transformations on images.
Nevertheless, only data with limited information can be obtained using this method. The state-of-the-art type of data
augmentation is the generation of high-quality samples through generative models. Hence, considering the ability of
generative networks to generate images on a large scale, it is expected that the severe shortage of labeled data will be
substantially mitigated.
Generative models commonly work based on the Markov chain, maximum likelihood estimation (MLE), and
approximate inference. Restricted Boltzmann Machine (RBM)[4] and its developed models such as Deep Belief Network
(DBN)[5], and the Deep Boltzmann Machine (DBM)[6] are based on MLE. Generated samples by these methods compare
the data distribution with the experimental distribution of the training data. These prototypes have several severe constraints
and may not be well generalized.
Generative Adversarial Networks (GANs) were proposed as an idea for semi-supervised and unsupervised learning by
Ian Goodfellow[7]. Yann LeCun, director of the IBC's research at Facebook, introduced adversarial training as the most
interesting idea of the past ten years in machine learning[8]. Fig. 2 clearly shows the rapid growth in the number of
published articles in the field of GANs in recent years. GANs have shown impressive improvements over previous
generative methods, such as variational auto-encoders or restricted Boltzmann machines. Fig. 3 shows GANs progress over
several consecutive years for face generation.
3500
3006
3000
2500
2000
1356
1500
1000
500 278
1 3 24
0
2014 2015 2016 2017 2018 2019
So far, research has been conducted to review generative adversarial networks; In total has been dealt with the
introduction of GANs, its applications in various fields such as computer vision[14], signal processing[15], image synthesis
and editing[16], speech processing[17], how to combine the GAN with an autoencoder[18], introducing the most notable
architectures of GAN[19], and investigating the relationship between GANs and parallel intelligence[20].
The main idea of GAN is inspired by a two-person zero-sum game where the profits (or loss) of a participant are
precisely equal to the losses (or profits) of the other. The total gains of the participants minus the total losses will be zero.
The GAN architecture consists of two networks that train together: i.e., the generator and the discriminator. The generator
tries to learn the statistical distribution of real data to generate fake data that is indistinguishable from real-world data to
mislead the discriminator into thinking of these as real inputs. In contrast, the discriminator is a classifier that discriminates
whether a given content looks like real data from the dataset or like an artificially synthesized data. As both participants
continuously optimize themselves to improve their capabilities and attempt to learn from their own weaknesses and take
advantage of each other's weaknesses, the neural networks become stronger during the training process. The optimization
process aims to establish a Nash equilibrium between the two participants. In economics and game theory, Nash
equilibrium is a stable system state involving interaction between various participants. Under these circumstances, no
participant can benefit simply by unilaterally changing the strategy without altering the strategy of the other participants,
exactly what GAN is trying to do. Generator and discriminator reach a state where one cannot progress without changing
the other.
Nowadays, GANs are widely used in various examples, such as text-to-image synthesis, image-to-image translation,
and many potential medical applications. Fig. 4 shows the percentage of the total number of articles published until 2019
in different disciplines.
Fig. 4. Taxonomy of the number of articles indexed in Scopus based on different disciplines from 2014
to 2019. The cart from[9].
Given the importance of GAN and its application in various scientific fields, it is necessary to introduce it
comprehensively, to investigate research carried out in this field, and to describe the challenges in this field. Therefore, this
paper has addressed these issues. It is worth noting that a better understanding of GANs requires the perception of the
concepts of deep learning. In the book[21], are introduced the basics of deep learning theory and the mathematical details.
In another book[22], the common themes and concepts of deep learning are explained by coding in the Python
Programming language.
This paper is structured as follows. In section II, the main concepts and the theory of GAN are clarified, then after
comparing two recently introduced deep generative models, the evaluation metrics and challenges facing GAN are
described. Section III lists the GAN architectures and addresses the most prominent and widely-used ones. Section IV
describes some of the significant applications of GAN in the field of computer vision. Finally, Section V presents
conclusions and new directions.
JD
X𝑑𝑎𝑡𝑎
Real
𝐷(𝑥)
D or
Fake
z G 𝐺(𝑧)
JG
A discriminator acts as a binary classification and differentiates fake 𝐺(𝑧) samples from real 𝑋𝑑𝑎𝑡𝑎 samples. The
discriminator is trained to maximize the likelihood of assigning the correct labels to real and fake data. In other words, if
the input is made up of real 𝑋𝑑𝑎𝑡𝑎 data, the discriminator classifies it as real data and returns a numeric value close to 1.
Otherwise, if the input is composed of data generated by the generator, the discriminator classifies it as fake data and
returns a numeric value close to 0.
The generator and the discriminator can be neural networks, convolutional neural networks, recurrent neural networks,
and autoencoders. Therefore, the discriminator requires the loss function JD and the generator requires the loss function JG
to update the networks (Fig. 5). The generator updates its parameters only through the backpropagation signals of the fake
output. By contrast, the discriminator receives more information and updates its weights using fake and real output.
GAN can be modeled as a two-player minimax game with simultaneous training of both generator and discriminator
network. Minimax GAN Loss is regarded as an optimization strategy in two-player games whereby each player reduces
their losses or increases the costs of the other player. In GAN, the generator and discriminator represent the two players,
which in turn update their network weight. Minimax refers to minimizing the loss in the generator and maximizing the loss
in the discriminator[25]. Put differently, the discriminator seeks to maximize the probability of assigning proper labels to
the data. On the contrary, the generator seeks to generate a series of samples close to the real data distribution to minimize
cross-entropy.
. 𝑚𝑎𝑥
𝑖𝑓 𝑋 = 𝑋𝑑𝑎𝑡𝑎 ⟹ 𝐷(𝑋) → 1 ⟹ 𝐺 𝑉(𝐷. 𝐺) = 𝐸𝑥~𝑝𝑑𝑎𝑡𝑎 (𝑥) [log(𝐷(𝑥))] (1)
𝐷
One reason that remains challenging for beginners is the topic of GAN loss functions. The GAN optimization strategy,
as a minimax problem, is presented as Equation 3. For better understanding, Equation 3 is broken down into Equations 1
and 2. In which, 𝐸 is the mathematical expectation notation. 𝑝𝑑𝑎𝑡𝑎 stands for the distribution of real data, while 𝑝𝑧 is the
random noise distribution.
According to Equation 1, if 𝑋 = 𝑋𝑑𝑎𝑡𝑎 (𝑋 is the input data to the discriminator), the discriminator is expected to display
a numeric value close to 1 in the output. That is, 𝑋 is expected to distribute real data and maximize V(G. D).
According to Equation 2, if 𝑋 = 𝐺(𝑍), there will be two different perspectives that address the first criterion of the
problem from a discriminator perspective. The discriminator is expected to manage to detect that the generated sample is
fake and to display a numeric value close to 0 in its output. V(G. D) should also be maximized under these circumstances.
It shows the second criterion of the problem from a generator perspective. Here, the ideal case for the generator is to be
able to mislead the discriminator, i.e., a numeric value close to 1 is displayed in the output. In other words, the generator
is trained to fool the discriminator by minimizing V(G. D) and obtaining real data distribution. Finally, from a mathematical
point of view, Equation 3 shows a 2-player minimax game with value function 𝑉(𝐺. 𝐷).
Fig. 6 illustrates several steps of the simultaneous training of generator and discriminator in a GANs as an example. In
Fig. 6(a), GANs are trained by simultaneously updating the discriminative distribution (blue, dashed line) so that it
distinguishes between samples from the real data distribution (black, dotted line) and generated data distribution (green,
solid line). In Fig. 6(b), the discriminator trained to discriminate between real and fake data, and it easily does its task. In
Fig. 6(c), the discriminator training process is stopped, and only the generator is trained to bring the fake data distribution
closer to the real data distribution. These updates continue until the discriminator no longer distinguishes (Fig. 6(d)). It is
worth noting that the process of training GANs is not as simple and straightforward as the process presented in Fig. 6 The
fake data distribution is completely overlaid to the real data distribution under ideal conditions, while there are various
challenges in practice.
Fig. 6. An example of a GANs training process. Evolution of the generated data distribution (green)
towards the real data distribution (black) and the decision boundary (blue). The figure from[7].
GANs are a group of networks with a very complex and challenging training process because both generator and
discriminator networks are trained simultaneously in an adversarial manner. The whole basis of GANs is the equilibrium
between the two networks. In other words, the nature of the optimization problem changes every time the parameters of
one of the networks are updated, resulting in the establishment of a dynamic system. The technical challenge facing the
training of two competing neural networks is their delayed convergence[25].
GAN is a practical deep learning approach for the development of generative models. Generally, deep learning models are
trained until the convergence of the cost function. However, GAN exploits the balance between the generator and the
discriminator for training. Thus, one of the problems with using adversarial networks to make a fair comparison is to
evaluate the strengths and weaknesses of various models. To the best of the researchers’ knowledge, no consensus has been
reached so far on the estimation of relative or absolute quality and developed cost function training. Notwithstanding, a set
of quantitative and qualitative methods are widely used categorized into three classes – manual evaluation, qualitative
evaluation, and quantitative evaluation (see Fig. 7). In the following, a full explanation of each class is presented.
u
Nearest Neighbors
Evaluation Metrics
1. Manual evaluation is a technique used to evaluate the quality and diversity of images generated. In this technique,
the generated image and the target image are compared by the researcher himself/herself or by a person with related
expertise. Visual inspection of samples by humans is one of the most common and intuitive GAN evaluation methods[27].
Like other deep learning models, a generative model is also trained in each epoch. The exact time to stop the training
process and save the final model for subsequent use cannot be detected because there is no proper metric of model
performance. Hence, the model can be saved regularly once every other epoch. Then, the saved model can be selected by
manual inspection of generated images. This evaluation method is considered as a good starting point for beginners to get
acquainted with the proposed architecture.
Being the most straightforward model evaluation method, manual evaluation involves numerous limitations. Evaluating
the quality of images from a personal point of view is a relative and arbitrary issue; bias may be included in the comparison.
Furthermore, it is expensive and time-consuming.
2. Qualitative evaluation is a series of non-numeric metrics, often involving comparative or subjective evaluation. Five
qualitative methods for evaluating GAN models have been proposed[27]: a) Nearest Neighbors b) Preference Judgment c)
Rapid Scene Categorization d) Mode Drop and Collapse e) Network Internal.
1 2 3 4 5
Fig. 8. Generated samples nearest to real images from CIFAR-10. In the first
column shows real images, followed by the nearest image generated by
DCGAN[10], ALI[28], Unrolled GAN[29], and VEEGAN[30], respectively.
The “nearest neighbors” method is one of the most well-known approaches in evaluating the performance of a generator
model, which selects some samples of real images and one or more identical images for comparison (Fig. 8). Distance
metrics such as Euclidean distance between pixel image information are often used to select the generated sample most
similar to the real image. The nearest neighbors approach can help evaluate the degree to which the generated image is
real.
The “preference judgment” method is one of the qualitative evaluation techniques, an extension of manual evaluation.
In this type of experiment, individuals are asked to rate their generated images in terms of accuracy.
The “rapid scene categorization” method is similar to the former except that the images are shown to human judges for
a split second, and they are asked to classify them into real or fake. The variance in judgment is reduced by averaging the
scores among different judges. Being a complicated and time-consuming method, it can reduce costs by using a
crowdsourcing platform such as Amazon Mechanical Turk (MTurk). Another disadvantage of this approach is the unstable
performance of human judgment that can be improved over time.
One of the significant shortcomings in the development of GAN is the “mode drop and collapse”. Mode drop occurs
when the training process results in different outputs for similar inputs. Mode collapse, on the other hand, means that the
diversity of the generated samples for different latent spaces is limited. In[30]–[32], several methods have been introduced
to evaluate “mode drop” and “mode collapse.”
Though the “network internals” inspection and visualization is a broad topic, it can be applied to find out which features
in the latent layers are considered. The quality of internal representations can be evaluated by studying how the network is
trained and understanding what it learns in the latent layers.
3. Numerical scores also are calculated to compare the quality of the images generated. In the following, some of the
most widely used metrics for quantitative evaluation are discussed.
“Inception Score” (IS) metric[33] is an objective evaluation method to evaluate two features, i.e., the quality and the
diversity of generated images. It employs the MTurk platform to evaluate a large number of generated images, indicating
that IS performs well as subjective human evaluation. This metric has been introduced as an attempt to eliminate subjective
human evaluation. IS uses the pre-trained inception v3 network [34], with its minimum value being 1.0. Higher IS values
suggest the high quality of generated samples. IS is considered to be a useful and widely-used metric; however, when the
generator reaches mode collapse, it may still display a good value.
An evaluation metric called “Mode Score” (MS) is introduced in[35] based on IS, which can concurrently reflect the
diversity and visual quality of the generated samples. It has overcome the problem with IS, namely insensitivity to previous
distributions of ground truth labels (i.e., disregarding the dataset)[27].
The “Fréchet Inception Distance” (FID)[36] is an improved IS metric that can be applied to detect inter-class mode
dropping. In this method, the generated samples are embedded in the feature space provided by a particular layer of the
inception network. The mean and the covariance between the generated samples and the real data are calculated, assuming
that the generated samples follow a multidimensional Gaussian. The FID between the two Gaussians is then calculated to
evaluate the quality of the generated samples (examples). Nevertheless, IS and FID cannot solve the overfitting problem
well. In order to overcome the problem, the “Kernel Inception Distance” (KID) is offered in[37].
“Multi-Scale Structural Similarity for Image Quality” (MS-SSIM)[38] differs from the “Single-Scale Structural
Similarity for Image Quality” (SS-SSIM)[39], which can be used to measure the similarity between two images. In this
metric, the similarity between images is evaluated using Predicting Human Perception Similarity Judgment. In[40], [41],
has been used this metric to determine the diversity of the data generated. Also, in[42], FID and IS were used as auxiliary
evaluation metric with MS-SSIM to examine sample diversity.
In general, choosing an appropriate evaluation metric remains a complex issue. In[27], several measurements have been
introduced as meta-metrics to guide researchers towards the selection of quantitative evaluation metrics. An appropriate
evaluation metric should distinguish between the generated samples and real samples. Moreover, it should manage to detect
mode collapse, mode drop, and overfitting. It is attempted to introduce more suitable techniques in the future to evaluate
the quality of GANs.
2.3 | Challenges
Like any other technology, GANs also face several challenges. These problems are generally linked to the training process,
including mode collapse and training process instability. Furthermore, the evaluation technique, image resolution, and
ground truth are considered as other controversial domains.
One of the main issues of failure in GAN training is mode collapse. This refers to the state in which the generator starts
generating similar images. In other words, the diversity of generated samples is limited to different latent spaces. One
possible solution to increase data diversity is to use sample batch production instead of generating a sample. Another
approach is to use multiple generators to obtain various samples. In[43], the generates combinatorial samples have been
examined by different models to resolve the mode collapse. The objective function optimization can also be used to mitigate
this challenge, similar to the WGAN[44] and unrolled GAN[29] models. Hence, how the diversity of generated samples
should be increased is a crucial issue to be addressed in future work.
“Training process instability” is regarded as another challenge in this area, resulting in different outputs for similar
inputs. Although batch normalization is considered as a solution to GAN instability, it is not sufficient to improve GAN
performance with optimal stability. Numerous approaches have been suggested for more sustainable training[33], [36],
[44]–[47]. Notwithstanding, several solutions should be proposed to train a more stable GAN and to converge on the Nash
equilibrium. The next issue is GAN evaluation, a more complex issue than other generative models. In section 2, we review
several currently extensively-used evaluation metrics. Providing appropriate, acceptable, and inclusive evaluation methods
is one of the essential issues that need further study.
Another limitation of adversarial networks is the resolution of generated images. Currently, most GAN-based
applications for image processing are limited to 256×256. When this network is applied to high-resolution images, some
blurry images are usually created. Although some researchers use iterative coarse-to-fine methods to generate high-
resolution images, they do not run fast. Chen and Koltun (2017) have introduced cascaded refinement networks to create
a series of 2-megapixel images, a new perspective for high-resolution image production[48]. Collectively, it is possible to
offer an appropriate approach to enhance the resolution with the flexibility of image size thanks to the excellent capabilities
of the adversarial networks, and one area is still under investigation.
Using ground truth data for training is another common challenge; That is also known as a crucial problem in deep
learning. These data play a vital role in the synthesis and editing of GAN-based images because real synthesized/edited
images are not easily collected. The approaches proposed by CycleGAN[49] and Adversarial Inverse Graphics Networks
(AIGNs)[50] use unpaired data for model training. These approaches can be regarded as a suitable solution to similar
problems. Thus, this issue also requires further attention, investigation, and research into GAN-based applications.
CGAN[52], infoGAN[53],
Developments Based on
Conditional ACGAN[40], SGAN[54]
Developments Based on
Objective Function Optimization Unrolled GAN[29], f-GAN[59], Mode-Regularized
GAN[35], Least-Square GAN[60], EBGAN[61],
WGAN[44], WGAN-GP[45], WGAN-LP[46]
A discriminator is a standard convolutional network that captures an image as input and displays a binary classification
(real or fake) as output. In standard mode, deep convolutional networks utilize pooling layers to reduce input dimensionality
and feature maps with the depth network. This is not recommended for DCGAN; instead, strided convolution is used for
dimensionality reduction.
G(z)
d
100 Output
Thanks to its immense popularity and influence, this architecture is widely used. Information GAN (infoGAN)[53] is
developed from the cGAN architecture that makes the generation process more controlled. For example, in the MNIST
database of handwritten digits, controls such as style, thickness, and type are used to generate the image of the handwritten
digit. The cGAN architecture uses the label c in the dataset, while infoGAN also extracts other latent features by the
discriminator model and the probabilistic network Q. The image is fed to the discriminator as input, and realness or fakeness
besides 𝑄(𝑐|𝑋) is displayed as output. 𝑄(𝑐|𝑋) is the probability distribution of c conditional on image X (Fig. 11 (b)). For
example, the generated image of digit 3 is fed to the discriminator, and Q may estimate (0.1, 0, 0, 0.8, …); the image will
be “0” with a probability of 0.1 and “3” with a probability of 0.8.
The value of mutual information should be maximized to improve the relationship between x and c. The generator of
this architecture is similar to the cGAN architecture, except that the latent code c is not known and must be discovered
through the training process. The loss function is described as follows,
𝑚𝑖𝑛 𝑚𝑎𝑥
𝑉(𝐷. 𝐺) − 𝜆𝐼(𝑐 ‚ 𝐺(𝑧. 𝑐)) (5)
𝐺 𝐷
λ is a hyper-parameter to limit function 𝐼(𝑐 ‚ 𝐺(𝑧. 𝑐)). Mutual information makes latent codes c more suitable for
generated data.
X X Real
Real or
C D
(class)
D or
Fake
C
Q
Fake
(latent) C
z G G(z|c) z G G(z|c)
(a) (b)
.
Fig. 11. The model of (a)CGANs; (b) InfoGAN
InfoGAN specifically explores visual concepts such as hairstyles, the presence or absence of glasses, and facial
expressions. Although infoGAN is an unsupervised approach, experiments suggest that this architecture learns several
interpretable representations comparable to representations learned by supervised methods.
Auxiliary Classifier GAN (AC-GAN)[40] is developed from cGAN. In this architecture, the class label c is not inserted
into the discriminator. It uses another classifier to predict the probability of class labels c besides the probability of the
degree to which the image is real. In this method, the training process becomes more stable, and the model can generate
higher quality images in larger sizes.
Semi-Supervised GAN (SGAN)[54] is developed from the GAN architecture that simultaneously trains supervised
discriminator, unsupervised discriminator, and generator. One of the main goals of this architecture is to improve the
performance of adversarial networks for semi-supervised learning. The discriminator is updated by predicting N+1 classes,
where N is the number of datasets and classes added to match with the output of the generator.
Latent
Encoder code Z Decoder
Input Output
Real Real
z’~p(z) D or
Fake
Fig. 12. The architecture of Adversarial Autoencoder (AAE)
Hence, the adversarial autoencoder (AAE)[55], a combination of the adversarial network with the autoencoder, was
presented. In this approach, the previous arbitrary distribution is imposed on the latent layer distribution obtained by the
encoder to ensure that no gaps exist so that the decoder can reconstruct meaningful samples from each part of it. The AAE
architecture is illustrated in Fig. 12. In this architecture, the latent code z represents fake information, and z' is represented
by the specified distribution 𝑝(𝑧), with both inputs acting as the discriminator. Upon completion of the training process,
the encoder can learn the expected distribution, and the decoder, on the other hand, can generate the samples reconstructed
by the required distribution.
Some models add an encoder to the GAN[28], [56], [57]. The generator of these models can learn the features of the
latent space and obtain semantic variations in the data distribution; however, it cannot learn to map the distribution of the
data sample to the latent space. To address this problem, bidirectional GAN (BiGAN)[56] was introduced, not only capable
of generating valid inferences but also guaranteeing the quality of generated samples. The BiGAN architecture is illustrated
in Fig. 13(a) In this architecture, an encoder is added to the model in addition to the discriminator and generator. The
encoder uses the inverse mapping of data generated by GANs. The discriminator input for the generated data consists of a
tuple containing the generated data 𝐺(𝑧) and the corresponding latent code z. Another discriminator input for real samples
from the dataset is a tuple containing samples X and 𝐸(x) obtained by the inverse mapping of X by the encoder. In this
method, the encoder can be used as a feature extractor for the discriminator. Similar to the BiGAN architecture, an
adversarially learned inference (ALI)[28] was offered. This architecture employs an encoder to obtain the distribution of
the latent feature. These two approaches can simultaneously do generator and encoder learning.
In addition to the approaches that used a combination of autoencoder/adversarial networks, the Adversarial Generator-
Encoder (AGE) Network architecture[57] is proposed. In the adversarial architecture, the generator and the encoder
compete with each other without requiring a discriminator. Fig. 13(b) illustrated the AGE architecture in which R represents
the reconstruction loss function. According to the structure of this model, the generator tries to minimize the divergence
between the latent distribution z and the distribution of the generated data. On the other hand, the encoder seeks to maximize
the divergence between z and 𝐸(𝐺(𝑧)) as well as to minimize the divergence of real data X. Moreover, the reconstruction
loss has been used to prevent mode collapse. In Section 2-1 of this paper, we briefly compared deep generating models
VAE and GAN. In[58], the benefits of GAN and VAE are combined, in that the VAE decoder is combined with the
alternative GAN generator and the GAN loss function with the VAE objective function. This can reduce the severity of
their difficulty in generating blurry images by preserving VAEs' capability in learning latent code distribution.
G(z)
z z G
G G(z)
Real
D or R G(E(x)) R
Fake
E(x)
E(x) E X E
E(G(z)) X
(a) (b)
Fig. 13. The model of (a) BiGAN and ALI; (b) Adversarial Generator-Encoder Network(AGE)
Further studies in this field have proposed pix2pixHD[69] to promote the resolution of generated samples. In this
method, a new adversarial loss function is used to generate images with a resolution of 2048×1024. Pix2pix requires to
train the paired dataset, which is one of its limitations. That is, a dataset must be constructed from the input images before
translation and the output images from the same images after translation. However, such image pairs do not exist in many
cases. The unpaired image-to-image translation method in cycle-consistent GAN (cycleGAN)[49] can be employed to
overcome this problem. This method uses cycle consistency loss, which seeks to preserve the original image after a
translation and inverse translation cycle. This cycle does not need to pair images for training (see Fig. 16).
→ → →
Photos Monet Zebras Horses Summer Winter
← ← ←
Fig. 16. Image translation generated by the CycleGAN model[49]. This model automatically translates an image from one into the other and
vice versa.
Two-Pathway GAN (TP-GAN)[70] can use a profile image to generate high-resolution frontal face images (see Fig. 17).
This technique can consider local and global information like human beings. The face image generated by this method well
preserves the characteristics of an individual's identity. It can also process multiple images in different modes and lighting.
It has a two-pathway architecture. A global generator is trained to generate global features and a local generator to generate
details around the face markings (marked points).
Furthermore, Self-Attention GAN (SAGAN)[71] combines self-attention block with GAN for image synthesis to solve
long-range dependency problem. Thus, the discriminator is confident that it can determine the dependency between two
distant features. In this approach, the improvement of the quality of the synthesized image is of greater importance.
Based on SAGAN, the BigGAN method[72] is proposed to increase the diversity and accuracy of generated samples
by increasing the batch size and using a truncation trick. In the traditional approach, for the latent distribution z, z is fed to
the generator as input. Nevertheless, in BigGAN, z is embedded in multiple layers of the generator to affect the resolution
characteristics and different levels. As shown in Fig. 18, the generated samples are realistic.
The Disentangled Representation Learning GAN (DRGAN) method[73] was introduced for face image synthesis in the
new state. The generator uses an encoder-decoder architecture that learns separate representations for face images, encoder
output, and decoder input. The discriminator contains two parts, i.e., identity classification and state classification. The
results of the experiments show that DRGAN outperforms the existing face recognition techniques in a steady state.
The face frontalization GAN (FF-GAN) architecture[74] uses a 3D morphable model (3DMM)[75] in the GAN
structure. 3DMM provides geometry and appearance for face images. Likewise, 3DMM representations are small in
volume. Fast FF-GAN convergence and high-resolution full-faced images are of high quality.
(a) Input Image (b) Context Encoder (c) Yang et al (d) Ground truth
Likewise, in[78], the researchers used DCGAN to restore the image, which can generate lost parts of the image
successfully. Nonetheless, there is still a blurry state at the hole border. In[79] has been proposed a GAN-based approach
to image restoration compatible with global and local environments. The input is an image with an additional binary mask
to display the missing hole. The output of a restored image has the same resolution. The generator employs the encoder-
decoder architecture and extended convolutional layers instead of standard convolutional layers to support a larger
spatial[80]. There are two discriminators, a global discriminator that captures the whole image as input and a local
discriminator that covers a small region with its hole as input. The two discriminator networks ensure that the resulting
image is compatible, on both “global” and “local” scales. This results in a natural restored image for high-resolution images
with arbitrary holes.
References
[1] K. P. Murphy, Machine learning: a probabilistic perspective. MIT press, 2012.
[2] Y. Bengio, A. Courville, and P. Vincent, ‘Representation learning: A review and new perspectives’, IEEE Trans. Pattern Anal.
Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013.
[3] Y. LeCun, Y. Bengio, and G. Hinton, ‘Deep learning’, Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[4] P. Smolensky, Information processing in dynamical systems: Foundations of harmony theory. Cambridge: MIT Press, 1986.
[5] G. E. Hinton, S. Osindero, and Y.-W. Teh, ‘A fast learning algorithm for deep belief nets’, Neural Comput., vol. 18, no. 7, pp.
1527–1554, 2006.
[6] R. Salakhutdinov and G. Hinton, ‘Deep boltzmann machines’, in Proceedings of the Twelth International Conference on Artificial
intelligence and statistics, pp. 448–455, 2009.
[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, ‘Generative
adversarial nets’, in Advances in neural information processing systems, pp. 2672–2680, 2014.
[8] Y. LeCun, ‘RI Seminar: Yann LeCun : The Next Frontier in AI: Unsupervised Learning’, 2016. [Online]. Available:
https://www.youtube.com/watch?v=IbjF5VjniVE. [Accessed: 15-Apr-2020].
[9] ‘Scopus database’, 2019. [Online]. Available: www.scopus.com
[10] A. Radford, L. Metz, and S. Chintala, ‘Unsupervised representation learning with deep convolutional generative adversarial
networks’, in International Conference on Learning Representations, 2015.
[11] M.-Y. Liu and O. Tuzel, ‘Coupled generative adversarial networks’, in Advances in neural information processing systems, pp.
469–477, 2016.
[12] T. Karras, T. Aila, S. Laine, and J. Lehtinen, ‘Progressive growing of gans for improved quality, stability, and variation’, in
International Conference on Learning Representations, 2018.
[13] T. Karras, S. Laine, and T. Aila, ‘A style-based generator architecture for generative adversarial networks’, in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410, 2019.
[14] Y.-J. Cao, L.-L. Jia, Y.-X. Chen, N. Lin, C. Yang, B. Zhang, Z. Liu, X.-X. Li, and H.-H. Dai, ‘Recent advances of generative
adversarial networks in computer vision’, IEEE Access, vol. 7, pp. 14985–15006, 2019.
[15] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, ‘Generative adversarial networks: An
overview’, IEEE Signal Process. Mag., vol. 35, no. 1, pp. 53–65, 2018.
[16] X. Wu, K. Xu, and P. Hall, ‘A survey of image synthesis and editing with generative adversarial networks’, Tsinghua Sci. Technol.,
vol. 22, no. 6, pp. 660–674, 2017.
[17] T. Kaneko, ‘Generative adversarial networks: Foundations and applications’, Acoust. Sci. Technol., vol. 39, no. 3, pp. 189–197,
2018.
[18] Y. Hong, U. Hwang, J. Yoo, and S. Yoon, ‘How generative adversarial networks and their variants work: An overview’, ACM
Comput. Surv., vol. 52, no. 1, pp. 1–43, 2019.
[19] Z. Pan, W. Yu, X. Yi, A. Khan, F. Yuan, and Y. Zheng, ‘Recent progress on generative adversarial networks (GANs): A survey’,
IEEE Access, vol. 7, pp. 36322–36333, 2019.
[20] K. Wang, C. Gou, Y. Duan, Y. Lin, X. Zheng, and F.-Y. Wang, ‘Generative adversarial networks: introduction and outlook’,
IEEE/CAA J. Autom. Sin., vol. 4, no. 4, pp. 588–598, 2017.
[21] I. Goodfellow, Y. Bengio, and A. Courville, ‘“Deep learning,”MIT press’. 2016.
[22] F. CHOLLET, Deep Learning with Python. Manning Publications, 2017.
[23] K. Ganguly, Learning Generative Adversarial Networks: Next-generation deep learning simplified. Packt Publishing, 2017.
[24] I. Goodfellow, ‘NIPS 2016 tutorial: Generative adversarial networks’, arXiv Prepr. arXiv1701.00160, 2016.
[25] J. Brownlee, Generative Adversarial Networks with Python, Deep Learning Generative Models for Image Synthesis and Image
Translation. 2019.
[26] D. P. Kingma and M. Welling, ‘Auto-encoding variational bayes’, Int. Conf. Learn. Represent., 2014.
[27] A. Borji, ‘Pros and cons of gan evaluation measures’, Comput. Vis. Image Underst., vol. 179, pp. 41–65, 2019.
[28] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville, ‘Adversarially learned inference’,
Int. Conf. Learn. Represent., 2017.
[29] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, ‘Unrolled generative adversarial networks’, in proceedings international
conference on learning representations, pp. 1–25, 2017.
[30] A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton, ‘Veegan: Reducing mode collapse in gans using implicit
variational learning’, in Advances in Neural Information Processing Systems, pp. 3308–3318, 2017.
[31] Z. Lin, A. Khetan, G. Fanti, and S. Oh, ‘Pacgan: The power of two samples in generative adversarial networks’, in Advances in
neural information processing systems, pp. 1498–1507, 2018.
[32] S. Santurkar, L. Schmidt, and A. Madry, ‘A Classification–Based Study of Covariate Shift in GAN Distributions’, Int. Conf. Mach.
Learn., pp. 4487–4496, 2018.
[33] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, ‘Improved techniques for training gans’, in
Advances in neural information processing systems, pp. 2234–2242, 2016.
[34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, ‘Rethinking the inception architecture for computer vision’, in
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016.
[35] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li, ‘Mode regularized generative adversarial networks’, iInternational Conf. Learn.
Represent., 2017.
[36] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, ‘Gans trained by a two time-scale update rule converge to
a local nash equilibrium’, in Advances in neural information processing systems, pp. 6626–6637, 2017.
[37] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton, ‘Demystifying MMD GANs’, Int. Conf. Learn. Represent., 2018.
[38] Z. Wang, E. P. Simoncelli, and A. C. Bovik, ‘Multiscale structural similarity for image quality assessment’, in The Thrity-Seventh
Asilomar Conference on Signals, Systems & Computers, vol. 2, pp. 1398–1402, 2003.
[39] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, ‘Image quality assessment: from error visibility to structural similarity’,
IEEE Trans. image Process., vol. 13, no. 4, pp. 600–612, 2004.
[40] A. Odena, C. Olah, and J. Shlens, ‘Conditional image synthesis with auxiliary classifier gans’, in Proceedings of the 34th
International Conference on Machine Learning-Volume 70, pp. 2642–2651, 2017.
[41] W. Fedus, M. Rosca, B. Lakshminarayanan, A. M. Dai, S. Mohamed, and I. Goodfellow, ‘Many paths to equilibrium: Gans do not
need to decrease a divergence at every step’, Int. Conf. Learn. Represent., 2018.
[42] K. Kurach, M. Lucic, X. Zhai, M. Michalski, and S. Gelly, ‘A Large-Scale Study on Regularization and Normalization in GANs’,
Int. Conf. Mach. Learn., pp. 3581–3590, 2019.
[43] A. Ghosh, V. Kulharia, V. P. Namboodiri, P. H. S. Torr, and P. K. Dokania, ‘Multi-agent diverse generative adversarial networks’,
in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8513–8521, 2018.
[44] M. Arjovsky, S. Chintala, and L. Bottou, ‘Wasserstein Generative Adversarial Networks’, in Proceedings of the 34 th International
Conference on Machine Learning, Sydney, Australia, pp. 214–223, 2017.
[45] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, ‘Improved training of wasserstein gans’, in Advances in
neural information processing systems, pp. 5767–5777, 2017.
[46] H. Petzka, A. Fischer, and D. Lukovnicov, ‘On the regularization of wasserstein gans’, Int. Conf. Learn. Represent., pp. 1–24,
2018.
[47] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, ‘Spectral normalization for generative adversarial networks’, Proc. Int. Conf.
Learn. Represent., 2018.
[48] Q. Chen and V. Koltun, ‘Photographic image synthesis with cascaded refinement networks’, in Proceedings of the IEEE
international conference on computer vision, pp. 1511–1520, 2017.
[49] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, ‘Unpaired image-to-image translation using cycle-consistent adversarial networks’,
in Proceedings of the IEEE international conference on computer vision, pp. 2223–2232, 2017.
[50] H.-Y. F. Tung, A. W. Harley, W. Seto, and K. Fragkiadaki, ‘Adversarial inverse graphics networks: Learning 2d-to-3d lifting and
image-to-image translation from unpaired supervision’, in The IEEE International Conference on Computer Vision (ICCV), pp.
4364–4372, 2017.
[51] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, ‘Backpropagation applied to
handwritten zip code recognition’, Neural Comput., vol. 1, no. 4, pp. 541–551, 1989.
[52] M. Mirza and S. Osindero, ‘Conditional generative adversarial nets’, arXiv Prepr. arXiv1411.1784, 2014.
[53] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, ‘Infogan: Interpretable representation learning by
information maximizing generative adversarial nets’, in Advances in neural information processing systems, pp. 2172–2180, 2016.
[54] A. Odena, ‘Semi-supervised learning with generative adversarial networks’, Proc. Int. Conf. Learn. Represent., 2016.
[55] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, ‘Adversarial autoencoders’, Int. Conf. Learn. Represent., 2016.
[56] J. Donahue, P. Krähenbühl, and T. Darrell, ‘Adversarial Feature Learning’, Int. Conf. Learn. Represent., 2017.
[57] D. Ulyanov, A. Vedaldi, and V. Lempitsky, ‘It takes (only) two: Adversarial generator-encoder networks’, in Thirty-Second AAAI
Conference on Artificial Intelligence, pp. 1250–1257, 2018.
[58] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, ‘Autoencoding beyond pixels using a learned similarity metric’,
Proc. 33rd Int. Conf. Mach. Learn., pp. 1558–1566, 2016.
[59] S. Nowozin, B. Cseke, and R. Tomioka, ‘f-gan: Training generative neural samplers using variational divergence minimization’,
in Advances in neural information processing systems, pp. 271–279, 2016.
[60] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley, ‘On the effectiveness of least squares generative adversarial
networks’, IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 12, pp. 2947–2960, 2018.
[61] J. Zhao, M. Mathieu, and Y. LeCun, ‘Energy-based generative adversarial network’, Int. Conf. Learn. Represent., 2017.
[62] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, and Z. Wang, ‘Photo-realistic
single image super-resolution using a generative adversarial network’, in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 105–114, 2017.
[63] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy, ‘Esrgan: Enhanced super-resolution generative
adversarial networks’, in Proceedings of the European Conference on Computer Vision Workshops (ECCVW), 2018.
[64] A. Jolicoeur-Martineau, ‘The relativistic discriminator: a key element missing from standard GAN’, arXiv Prepr.
arXiv1807.00734, 2018.
[65] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ‘Image-to-image translation with conditional adversarial networks’, in Proceedings
of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134, 2017.
[66] O. Ronneberger, P. Fischer, and T. Brox, ‘U-net: Convolutional networks for biomedical image segmentation’, in International
Conference on Medical image computing and computer-assisted intervention, pp. 234–241, 2015.
[67] C. Li and M. Wand, ‘Precomputed real-time texture synthesis with markovian generative adversarial networks’, in European
conference on computer vision, pp. 702–716, 2016.
[68] P. Salehi and A. Chalechale, ‘Pix2Pix-based Stain-to-Stain Translation: A Solution for Robust Stain Normalization in
Histopathology Images Analysis’, arXiv Prepr. arXiv2002.00647, 2020.
[69] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, ‘High-resolution image synthesis and semantic
manipulation with conditional gans’, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
8798–8807, 2018.
[70] R. Huang, S. Zhang, T. Li, and R. He, ‘Beyond face rotation: Global and local perception gan for photorealistic and identity
preserving frontal view synthesis’, in Proceedings of the IEEE International Conference on Computer Vision, pp. 2439–2448,
2017.
[71] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, ‘Self-attention generative adversarial networks’, in The 36th International
Conference on Machine Learning (ICML), pp. 7354–7363, 2019.
[72] A. Brock, J. Donahue, and K. Simonyan, ‘Large scale gan training for high fidelity natural image synthesis’, Int. Conf. Learn.
Represent., 2019.
[73] L. Tran, X. Yin, and X. Liu, ‘Disentangled representation learning gan for pose-invariant face recognition’, in Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 1415–1424, 2017.
[74] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, ‘Towards large-pose face frontalization in the wild’, in Proceedings of the
IEEE International Conference on Computer Vision, pp. 3990–3999, 2017.
[75] V. Blanz and T. Vetter, ‘A morphable model for the synthesis of 3D faces’, in Proceedings of the 26th annual conference on
Computer graphics and interactive techniques, pp. 187–194, 1999.
[76] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, ‘Context encoders: Feature learning by inpainting’, in
Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 2536–2544, 2016.
[77] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li, ‘High-resolution image inpainting using multi-scale neural patch
synthesis’, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6721–6729, 2017.
[78] R. A. Yeh, C. Chen, T. Yian Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do, ‘Semantic image inpainting with deep
generative models’, in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 5485–5493,
2017.
[79] S. Iizuka, E. Simo-Serra, and H. Ishikawa, ‘Globally and locally consistent image completion’, ACM Trans. Graph., vol. 36, no.
4, pp. 1–14, 2017.
[80] F. Yu and V. Koltun, ‘Multi-scale context aggregation by dilated convolutions’, in International Conference on Learning
Representations (ICLR), pp. 1–13, 2016.