A Survey On GANs For Computer Vision
A Survey On GANs For Computer Vision
Review article
article info a b s t r a c t
Article history: In the last few years, there have been several revolutions in the field of deep learning, mainly headlined
Received 29 September 2021 by the large impact of Generative Adversarial Networks (GANs). GANs not only provide an unique
Received in revised form 24 February 2023 architecture when defining their models, but also generate incredible results which have had a direct
Accepted 14 March 2023
impact on society. Due to the significant improvements and new areas of research that GANs have
Available online 20 March 2023
brought, the community is constantly coming up with new researches that make it almost impossible
Keywords: to keep up with the times. Our survey aims to provide a general overview of GANs, showing the latest
Generative Adversarial Network architectures, optimizations of the loss functions, validation metrics and application areas of the most
Artificial intelligence widely recognized variants. The efficiency of the different variants of the model architecture will be
Machine learning evaluated, as well as showing the best application area; as a vital part of the process, the different
Deep learning
metrics for evaluating the performance of GANs and the frequently used loss functions will be analyzed.
The final objective of this survey is to provide a summary of the evolution and performance of the
GANs which are having better results to guide future researchers in the field.
© 2023 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
Contents
1. Introduction......................................................................................................................................................................................................................... 2
2. Related work ....................................................................................................................................................................................................................... 3
3. Structure of this survey ..................................................................................................................................................................................................... 3
4. Generative Adversarial Networks (GANs) ...................................................................................................................................................................... 3
4.1. Definition and structure........................................................................................................................................................................................ 3
4.2. Common problems ................................................................................................................................................................................................ 4
4.2.1. Mode collapse ......................................................................................................................................................................................... 4
4.2.2. Gradient vanishing ................................................................................................................................................................................. 4
4.2.3. Instability ................................................................................................................................................................................................ 4
4.2.4. Stopping problem................................................................................................................................................................................... 5
4.3. Evaluation metrics ................................................................................................................................................................................................. 5
4.3.1. Inception Score (IS) and its variants.................................................................................................................................................... 5
4.3.2. Multi-scale structural similarity for image quality (MS-SSIM) ....................................................................................................... 5
4.3.3. Classifier Two-sample Test (C2ST) ...................................................................................................................................................... 5
4.3.4. Perceptual path length .......................................................................................................................................................................... 5
4.3.5. Maximum Mean Discrepancy (MMD) ................................................................................................................................................ 6
4.3.6. Human rank (HR) ................................................................................................................................................................................... 6
5. GAN variants ....................................................................................................................................................................................................................... 6
5.1. Architecture optimization ..................................................................................................................................................................................... 6
5.1.1. Deep Convolutional GAN (DCGAN) ...................................................................................................................................................... 6
5.1.2. Conditional GAN (CGAN) ....................................................................................................................................................................... 6
5.1.3. Auxiliary Classifier GAN (ACGAN) ........................................................................................................................................................ 6
5.1.4. Interpretable Representation Learning by Information Maximizing GANs (InfoGAN)................................................................... 6
5.1.5. Image-to-Image Translation with Conditional Adversarial Nets (Pix2Pix)...................................................................................... 7
5.1.6. Cycle-Consistent GAN (CycleGAN) ....................................................................................................................................................... 7
∗ Corresponding author.
E-mail addresses: [email protected] (G. Iglesias), [email protected] (E. Talavera), [email protected] (A. Díaz-Álvarez).
https://doi.org/10.1016/j.cosrev.2023.100553
1574-0137/© 2023 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
Several other surveys of GANs published during the last years 4.1. Definition and structure
[27,30–33] have been studied to investigate the recent trends.
For example, [16] focus on the instability issues that GANs suffer GANs are an architecture composed of various neural net-
and show different ways to minimize it. The results suggest that works, their objective is to replicate a data distribution in an
some novel architectures try to control GAN’s training, while this unsupervised way. To achieve it, they are composed of two neural
control can be achieved by focusing on tuning hyper-parameters. networks that play a two-player zero-sum game. In this game, the
It also emphasizes that much of the theoretical work does not network called the Generator (G) is in charge of creating new data
fulfill in reality, which causes some GANs to convergence when samples replicating, but not copying, the origin data distribution;
they should not and not converge when they should. while the Discriminator (D) tries to distinguish real and generated
Few surveys have been conducted to explore several ap- data.
proaches to optimize the loss function of GAN. This research From a formal point of view, D estimates p(y|x), that is, the
approach tries to enhance the similarity between original and probability of a label y given the sample x; while G generates a
synthesized data distributions by defining an appropriate loss sample given a latent space z, which can be denoted as G(z).
function. Surveys such as [34] are focus on analyzing the state- This process consists in both networks competing. While G
of-the-art GANs and further analyzing the performance of a huge tries to generate more realistic results, D improves its accuracy
variety of networks. In addition, they propose a set of recom- detecting which samples are real and which not. In this process,
mendations of which loss function works best for each case of both competitors are synchronized, if G creates a better output,
use. it will be more difficult for D to differentiate them. On the other
Other works focusing on the applications of GANs instead of hand, if D is more precise, it will be more difficult for G to fool
their composition or loss function. For example, [35] focus on D. This process is a minimax game in which D tries to maximize
how different GAN’s architectures have been used during the the accuracy and G tries to minimize it. The formulation of the
last years for different problems, while [28] shows the different minimax game loss function can be denoted as:
architectures for computer vision and their applications. min max L(D, G) = Ex∼pr log [D(x)] + Ez ∼pz log [1 − D(G(x))] (1)
Due to the constant evolution of GANs during the last few G D
years, these reviews are outdated almost instantaneously. As a where x ∼ pr is the distribution of the real data and z ∼ pz
result of some relevant and recent researches like [11,36,37] denotes the probability distribution of the latent space of G. z ∼
3
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
entiate between the real distribution D(x) and the synthesized Section Content
4.2.4. Stopping problem results using a common metric. In further sections we will go
Traditional neural networks have to optimize a loss function through different results comparing them using FID.
decreasing monotonically, in theory, the cost function. Due to the One of the strengths of using this metric is that it takes into
minimax game that GANs have to optimize, this does not happen consideration contamination such as Gaussian noise, Gaussian
to them [44,53,54]. In a GAN training, the loss function does not blur, black rectangles, swirls, among others.
follow any pattern, so it is not possible to know the state of the
networks by their loss function. This causes that, when a training 4.3.2. Multi-scale structural similarity for image quality (MS-SSIM)
is occurring, it is not possible to know when the models have is based on the comparison between two image structures,
luminance and contrast at different scales [62]. The MS-SSIM
been fully optimized.
provides a metric that compares the similarity between the real
and the synthesized dataset. One of the strengths of MS-SSIM is
4.3. Evaluation metrics
that it correlates closer pixels with strong dependence. In com-
parison with other metrics such as Mean Squared Error (MSE),
Due to GAN’s particularity, there is not an unique metric that calculates the absolute error of an image, MS-SSIM provides
to measure the quality of the synthesized data [5]. One of the a metric based on the geometry and structure of the image.
reasons of why there is no consensus among researches is the The MS-SSIM scale is based on Structural Similarity Index
particularity of each GAN application. As mentioned in previous Measure (SSIM), and this metric is calculated as follows:
sections, GANs can be used to replicate any data distribution,
SSIM(x, y) = [lM (x, y)]αM
but it depends on the particular problem how to measure the
M
differences between the origin and synthesized distributions [55]. (3)
[cj (x, y)]βj [Sj (x, y)]γj
∏
As there is not an unique universal metric to measure the per- ·
formance of these kinds of models, during the last years there has j=1
been developed different metrics. Each metric has its particular where x and y are two windows of image of common size, l is
strength and it should be noted that, in practice, different metrics the luminance of an image, c the contrast and S the structure.
are used and compared to measure different aspects and to have The value of SSIM is a decimal between 0 and 1, the value of 1
a wider view of the GAN performance [49]. represents two identical sets of data. Therefore, it is assumed that
Since there is not an evaluation metric that fulfills all GAN pos- the higher value of SSIM, the higher quality of the synthesized
sible applications, we will review the most widely used metrics: images.
MS-SSIM is calculated using the average pairwise of SSIM
4.3.1. Inception Score (IS) and its variants with N batches. This metric is commonly used with IS or its
IS [45] measures the quality and diversity of the generated variations [63] to provide a wider view of the generated data
samples of a GAN. To do so, it uses a pretrained neural network quality.
classifier called Inception v3 [56]. The model is pretrained us-
ing a dataset of real world images called Imagenet [57], it can 4.3.3. Classifier Two-sample Test (C2ST)
differentiate between 1.000 of classes of images. To measure the quality of the generated distribution, a binary
The IS is calculated by predicting the probabilities of the classifier can be used [64]. The classifiers divide the samples into
synthesized and real ones, judging whether different samples
generated samples. A sample is classified strongly as one specific
belong to the same data distribution.
class means that it has high quality. In other words, it is assumed
It should be noted that this method is not constrained to image
that low entropy and high quality data are correlated. The IS value
evaluation, since a classifier can be used to classify any given data
varies between 1 and the number of classes of the classifier.
distribution, it can be adapted to any type of input data.
One of the main problems of the IS is that it cannot handle 1-Nearest Neighbor classifier (1-NN) [65] is a type of binary
mode collapse. In this case, all generated samples by the GAN classifier used to evaluate GAN performance. 1-NN is a variant of
will be practically the same, but the IS would be very high if the C2ST that does not require hyper-parameter tuning. C2ST using
images are strongly classified as one class. If this happens, the IS 1-NN is known as C2ST-1-NN.
could be high and the real situation is very bad. Neural networks can be used as a C2ST, as mentioned in
Other particularity of this metric is that it is designed to previous sections, D is indeed a classifier of real and generated
measure the quality of images since it uses an image classifier. data. As is proposed in [65], a C2ST can be applied to GANs by
Based on IS, there are some modifications to the metric. For ex- using the same composition of the discriminator, as is said in
ample, Mode Score (MS) [58] is a evaluation metric that takes into the paper ‘‘training a fresh discriminator on a fresh set of data’’.
account the prior distribution of the labels over the data, i.e. it is C2ST-Neural Network (C2ST-NN).
designed to reflect the quality and diversity of the synthesized Using C2ST, we can measure the distance between the synthe-
data simultaneously. sized and real data distributions. This provides a useful, human-
Other modification of IS is the modified-Inception Score (m- interpretable metric of GAN performance. C2ST has been applied
IS) [59]. It measures the diversity within the same class category to different GANs architectures such as DCGAN or CGAN, using
output, trying to mitigate the mode collapse problem. C2ST-NN and C2ST-1-NN [65].
Some of them, like Fréchet inception distance (FID) [43] cal-
culate the mean and covariance of the synthesized images and 4.3.4. Perceptual path length
Using the well-known neural network classifier VGG16 [66]
then calculate the distance between the real and generated image
the perceptual path length was designed [6] to measure the
distribution. The distance is measured using the Fréchet distance,
entanglement of images. The embeddings of consecutive images
also known as the Wasserstein-2 distance. The FID is calculated as
are calculated using VGG16, interpolating random latent space
follows: inputs, then it is calculated how the synthesized images changes.
FID = |µ − µw |2 + tr(Σ + Σw − 2(ΣΣw )1/2 ) (2) Drastic change means that, for a minimum change in the latent
space there are multiple features that are changing, that means
where w denotes the synthesized data of the G. that those features are entangled under the same representation.
The FID is the most common used metric to measure the This metric measures how well the GAN is learning the different
quality of generated images [3,6,60,61]. The use of a common features of the input images, measuring the entanglement of the
metric for different architectures allows to compare different generated images.
5
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
4.3.5. Maximum Mean Discrepancy (MMD) Convolutional layers are used not only used for image process-
is used to measure the distance between two distributions ing, but there are recent projects [70] that use matrices of data
[67]. A lower score for MMD means that the distributions that are to take advantage of using convolutional layers.
being compared are closer, and that means that the synthesized In addition to the convolutional layers, other changes were
data is similar to the original. suggested to stabilize the GAN’s training. Replacing the pool-
Given distributions P and Q and a kernel k. As it is defined ing layers by strided convolution has shown better performance
in [68], MMD can be denoted as: [71,72]. Therefore, it is proposed to use strided convolutions in
both G and D.
Mk (P, Q) = ∥µP − µQ ∥2H = EP [k(x, x′ )]
(4) The use of batch normalization layers in both G and D is
−2EP,Q [k(x, y)] + EQ [k(y, y′ )] proposed, this has been shown to reduce the noise and improve
It should be noted that this method can be used with any type the diversity of the generated samples [73,74].
of data. To activate the convolutional layers, it is proposed to use a
Rectified Linear Unit (ReLU) activation for the hidden layer of
4.3.6. Human rank (HR) G, hyperbolic tangent (tanh) for the output layer of G and leaky
Human classification can be useful in some cases. Either to rectified linear unit Leaky Rectified Linear Unit (LeakyReLU) for
complement other evaluation metrics, either because there is not D.
other metric that fulfills the particular problem, human evalua- In addition to the mentioned changes in the architecture of
tion of the generated data can be done. the GANs, the DCGAN paper also presents a technique to visualize
Due to the particularity of this method, it can only be used the filters learned by the models. This helps the comprehension
when the synthesized data is comprehensive for a human. of GANs learning methods, confirming previous works related to
For example, in [7,8] human classifications were applied via biology [75].
Amazon Mechanical Turk (AMT) to evaluate the realism of the This architecture supposes a change in how GANs are designed
outputs of the GAN. In this case, participants had to differentiate and trained. The innovations that were proposed in the paper are
between the generated and real images. The more images that applied in most of the following GAN models.
fool humans perception, the better.
This method can provide an approximation of how GANs 5.1.2. Conditional GAN (CGAN)
creation would be perceived by humans. Proposed in 2014 [76], the CGAN architecture adds a latent
class label c along with the latent space. The new label is used
5. GAN variants to split the processed data into different classes, thus the synthe-
sized data is generated according to the class of the input label.
Since the first GAN was developed [1] there has been pub- There are some problems that require the generated data to be
lished many different variations of it [3,6,8,26]. To have a broad
classified into different classes [77–79].
vision about recent GAN researches, we will review the recent
Despite being a simple technique, it has proven to prevent
progress in this field.
mode collapse. However, the training of a CGAN requires a labeled
This section is divided into GAN models according to their
dataset complicating its application to some problems.
main features. That said, we will divide the different GAN’s vari-
CGAN architecture has influenced GANs model since its propo-
ations in architecture modification based and loss function mod-
sition, there has been developed many variations [8,80,81].
ification based.
Q (c |x) is defined. Said so, the loss function of the InfoGAN is To stabilize the training and prevent mode collapse, the loss
defined as follows: format of WGAN [26] is used. This marks the architecture of the
min max VInfoGAN (D, G, Q ) = V (D, G) network and the construction of the objective function.
G,Q D In order to train each pair of G and D a reconstruction error
(7)
−λLI (G, Q ) term is defined. The reconstruction error objective is the same
that it was in CycleGAN, calculating the distance between the
where λ is a hyperparameter that is in charge of the latent code
original sample of data and its corresponding recovered sample.
control. As it is proposed in the original paper [80] a λ equal to
The reconstruction error is defined as:
1 is used when the latent code is discrete, for continuous latent
codes a smaller λ should be used. The reason for that is to control lg (u, v ) = λU ∥u − GB (GA (u, z), z ′ )∥
the differential entropy. +λV ∥v − GA (GB (v, z ′ ), z)∥ (10)
−DB (GB (v, z ′ )) − DA (GA (u, z))
5.1.5. Image-to-Image Translation with Conditional Adversarial Nets
(Pix2Pix) while U and V are both domains, λU and λV are two constant
The main objective of the Pix2Pix [8] architecture is to do parameters and z and z ′ are both random noises. λU and λV are
an image-to-image translation. That is, given an image from a normally set a value within [100.0, 1, 000.0], when the domain
domain A, transform this image to other domain B. For example, U contains real images (e.g. a human face photo) and V does not
given a map of a street, transform the map to an aerial photo of (e.g. a sketch of human face), it is more optimal to use a smaller
the street on the map. value of λU than λV .
The Pix2Pix architecture is based on an autoencoder, but skips DualGAN has been widely used and modified [95–97]. For
some connections. This architecture is known as U-Net, and it example, in [98] a DualGAN architecture was used to transform
is based on the idea of retrieving information at early stages of an input speech emotion. In this application, given the Funda-
the network. The same approaches of skipping connections have mental Frequency (F0) of a certain emotion, the trained network
been used before [83–86] showing great results and improving is capable of changing the emotion of the sound. To do so, F0
the network performance. is encoded using wavelet kernel learning [99] using the same
In addition to the new architecture, a new loss function is methodology as [100].
proposed that is denoted as:
LGAN (G, D) = Ey [logD(y)] 5.1.8. Learning to Discover Cross-Domain Relations with GANs
(8) (DiscoGAN)
+Ex,z [log(1 − D(G(x, z)))] DiscoGAN [101] is an architecture that follows the same struc-
As a follow-up of Pix2Pix, Pix2PixHD was proposed [87] im- ture as DualGAN and CycleGAN. The particularity that DiscoGAN
proving the quality of the generated images. Many later works has is the usage of an autoencoder for the G. For D, it uses a
have used Pix2Pix [88–91] converting it to one of the most classifier based on the encoder of the G.
popular architectures of the last decade. Autoencoders have been used to other reconstruction
The immediate application of these algorithms to images has problems [102–104], so applying of this architecture to domain-
had a great impact on society, radically increasing its popularity to-domain translation problems can benefit from their partic-
thanks to the applications developed. ularities. Autoencoders are based on the idea of reducing the
dimensionality of the input data, then they reconstruct the same
5.1.6. Cycle-Consistent GAN (CycleGAN) information. By doing the dimensional reduction, the network is
Cyclic consistency is the idea that, given a data x from a do- capable of maintaining the essential features of the input data. In
main A, if the data is translated to a domain B and translated again the case of domain-to-domain translation, by using autoencoders,
to the A domain it should be recovered the data x. In other words, the architecture is capable of maintaining the main features of
if a sample is translated to a domain and recovered from that a sample and translating this core information to other specific
domain, it should not change. This process, where a data sample domains.
is transformed and recovered, is known as cycle consistency, and The results presented in the original work show how GANs can
it has been widely used during the last decades [92,93]. learn high-level relationships between two complete different
This idea is the main base of CycleGAN [7]. The main strength domains. In the experiments carried out in the research, it was
of the application of cycles is that paired data is not a require- demonstrated how the networks discovered relationships such as
ment. GAN architecture adds a new mapping denoted as F, its orientation. E.g., pairing images of chairs and car with the same
function is to do the inverse mapping to retrieve the original data. orientation.
In other words, the function of F is F (G(x)) = x. To train the
architecture, a new cycle consistency loss is proposed to train 5.1.9. GANILLA
the so-called forward and backward cycle consistency. The cycle The GANILLA [105] architecture modifies the structure of the
consistency loss is denoted as follows: G of the GAN for image style transfer. The main objective of the
Lcycle (G, F ) = Ex variant is to maintain both the content and the style of an image,
pdata (x) [∥F (G(x)) − x∥1 ]
(9) previous methods usually lack one of this aspects in favor of the
+Ey pdata (y) [∥G(F (y)) − y∥1 ] other. The main idea of the GANILLA is to do the style transfer of
Despite CycleGAN was first proposed for image-to-image an image balancing style and content.
translation, it can be used for any data translation. The architecture of GANILLA uses low-level features to main-
tain the content of the image at the same time as the style
5.1.7. Unsupervised Dual Learning for Image-to-Image Translation transfer is done. The G model is based on two stages, one for
(DualGAN) downsampling the input image and the other for upsampling
The architecture of DualGAN [94] is very similar to CycleGAN. the information of the first stage. This architecture ensures that
As it was with the CycleGAN, the DualGAN does not require the style transfer maintains the input features of the image but,
paired data to train its models. To learn the translation from one in addition, some layers concatenate features of previous layers
data domain to another, DualGAN has two pairs of identical G and such as edges, shapes or morphological features. With these two
D, each pair is responsible for their respective translation. methods, the architecture controls both content and style.
7
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
Fig. 2. Structure of the proposed architecture of the GANILLA. Figure based on Ref. [105].
The downsampling stage is based on ResNet-18 [83] but with The ProGAN described in the original paper used the Gra-
skipped connections. This skipped connections then feed the up- dient Penalty WGAN (WGAN-GP) [4] loss format, despite that
sampling module. The architecture of the GANILLA can be ob- ProGAN architecture can be applied to any loss function. ProGAN
served in Fig. 2 training methodology has been implemented in many recent
For training the models, the cyclic consistency method of the researches [108,109].
CycleGAN [7] is used. This way, two pairs of G and D are used to
map both domains. 5.1.11. Dynamically Grown GAN (DGGAN)
The results of the GANILLA show the good performance, in DGGAN [110] proposes a new training methodology based on
specific for children’s book illustration dataset. Due to the partic- ProGAN. The architecture of the networks of DGGAN not only
ularities of the images of children’s books, being highly contrasted grow periodically, they rather grow dynamically adapting their
images with abstract objects, previous architectures had difficulty architecture and parameters during the training.
to do the style transfer. However, with the usage of low level The DGGAN questions some aspects of GANs such as the
features of the GANILLA, it is achieved an improvement of the symmetry between G and D or layer choice. The new methodol-
overall performance. ogy can automatically search the optimal parameters, respecting
ProGAN growing strategy was previously defined.
5.1.10. Progressive Growing of GANs (ProGAN) The DGGAN starts with a base D and G, the training alternates
Training a complex model can lead to strong instability. To between training steps and the growing of the network. To grow
tackle the instability of GANs models, ProGAN [3] proposes a the network, a set of child architectures are created. Each child
training methodology based on a growing architecture. The idea has the same architecture as the parent, but each child proposes
of a progressive neural network was previously proposed [106]. a different growing change to the network. During the training
The main idea behind progressive networks is the concatena- children architectures are trained, initializing the weights of the
tion of different training phases. In each phase, a model is trained inherited parent layers with their respective parent weights.
and, as the trainings are developed, the model number of layers In the proposed dynamic growing algorithm, each step chooses
increases. This way the created model scales up gradually stabi- among different growing possibilities: grow G with a certain
lizing the training. The strength of this architecture is that, due convolution layer, grow D with a certain convolution layer, or
to the simplicity of the first model, the networks are capable to grow both G and D to a higher resolution. A scheme of the training
learn properly the simplest form of the problem and then use the methodology can be seen in Fig. 4.
learned characteristics to scale up little by little the complexity of If all children were preserved in each step, it will produce
the problem. With each new phase, it is important to emphasize an exponential growing that would lead to large inefficiency.
that the weights of the networks remain trainable, letting them To avoid that, before the children generation, a prune is made.
to adapt to the new phases. A scheme of the progressive training Known as greedy prune, the prune is done by keeping the top K
of ProGAN can be seen in Fig. 3. children of each generation. Then each child becomes a parent
Due to the explained training methodology, ProGAN is capable and generates a new batch of children. The process repeats until
to stabilize the training of GANs, which is one of the most impor- the network grows to the desired size.
tant GAN problems. In addition, ProGAN’s training methodology In the original research, the child search was made combining
speeds up the training phase and produces images of state-of- different kernel sizes and number of filters, each parameter is
the-art quality, e.g. achieving an inception score of 8.8 in the known as an action, and the number of total actions is denoted
unsupervised CIFAR-10 [107] dataset. as T . It can be easily noted that different hyperparameters can
8
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
be searched by using this algorithm. To avoid a large increment paradigm of investigation is supported by the continuous mixing
of the number of children, the algorithm proposes a probability of new techniques.
p of a child to test a new parameter. A higher K , T and p Said so, the StyleGAN improves the quality of the generated
means a wider search, contributing to a better exploration of the images of the ProGAN, achieving a FID score of 5.06 in CelebA-HQ
candidates but a slower training. dataset and 4.40 in FFHQ dataset.
It should be noted that the search algorithm lacks the ef-
ficiency of the architecture by having to do multiples training 5.1.13. Alias-free GAN
simultaneously. It also lacks the ability of growing, due to the During the last years, multiples architectures have been im-
quick growing of the number of networks. proving the quality of the synthesized images. The previously
mentioned StyleGAN achieved one of the best results in image
5.1.12. A Style-Based Generator Architecture for Generative Adver- generation, producing images of human faces with a quality never
sarial Networks (StyleGAN) seen before. Besides its good results, some problems remain
StyleGAN [6] is based on the idea that, improving the pro- opened.
cessing of the latent space, the quality of the generated data One of the most visible problems that generated images of
will improve. Due to the particularities of the latent space, there StyleGAN had was the known as texture sticking. It happens when
are many interpolations on the variables [111,112] that produces a certain image feature depends on absolute coordinates instead
entanglement in the learned characteristics of the G. The ar- depending on other feature localization. E.g. the texture of the
chitecture of the StyleGAN is based on previous style transfer beard of a human face seems stuck when interpolating different
researches [113]. images. The texture sticking problem is noticeable especially
With the architecture of StyleGAN, G is capable to learn dif- when interpolating images, e.g. changing the posture of a human
ferent styles of the input data disentangling high-level charac- face image.
teristics. This produces an improvement on the quality of the Alias-Free GAN [60] focus on solving the texture sticking prob-
generated data and helps in the interpretation of the latent space, lem of the StyleGAN. The main idea is to suppress the alias in
previously poorly understood. Controlling the latent space leads the generated images, this way the finer details will be attached
to better interpolation properties, enabling interpolation oper- to the underlying surface of the image. To achieve this, each
ations in different scales, e.g., interpolation of poses, hair or layer of G is designed to be equivariant by applying rotations and
freckles in human face images. translations to the continuous input.
In the StyleGAN architecture, the input of G is mapped to an To achieve an equivariant G, many changes have been made.
intermediate latent space called W , then is used in each convo- A 10-pixel margin is used for the internal representations, due to
lution layer via an Adaptive Instance Normalization (ADAIN). In the assumption of infinite spatial extension for the feature maps.
addition to the latent space, gaussian noise is added to the output The Leaky ReLU layers are wrapped between an upsampling and
of each convolution layer. a downsampling, this is implemented with a CUDA kernel for
The StyleGAN architecture uses the training methodology used optimization. The cutoff frequency of the StyleGAN is cut off to
in ProGAN, supporting the previously mentioned idea that each ensure the alias frequencies are in the stopband. In addition,
research should not be considered as an isolated result. The the learned input constant of StyleGAN is substituted by Fourier
9
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
features [114,115]. Finally, the rotation equivariant version of 5.1.16. Your Local GAN (YLGAN)
the network is obtained by reducing the kernel size of 3 × 3 YLGAN [61] proposes a new attention layer that substitutes
convolutions to 1 × 1 and changing the sinc-based downsampling the SAGAN dense attention layer [116]. This new layer preserves
to a radially symmetric jinc-based one. two-dimensional image locality and contributes the flow of in-
formation through the different layers. To preserve the two-
5.1.14. Self Attention GAN (SAGAN) dimensional locality and quantify how information flows through
SAGAN [116] architecture covers the problem of local spatial the model, the framework of Information Flow Diagram (IFD)
information of images. I.e. images that have different components [119] is used.
correlated in different positions of the image can be difficult The modification of the self attention layer of SAGAN in-
to cover because the receptive field of the network is not big troduces sparse attention layers. This new method reduces the
enough. In SAGAN, the generation of different features is made quadratic complexity of the attention layer by splitting the atten-
considering cues from all images. In addition, SAGAN D is capable tion into multiple subsets of data. The main problem of the sparse
of evaluating the consistency of features along the image. attention layer is that, besides its computational optimization, it
SAGAN uses self attention layers [117], these layers are ca- lacks the information flow of the network. To tackle this infor-
pable to capture structural and geometric features of multiclass mation flow graphs are introduced, these graphs will be used to
datasets. The feature maps of each convolution are split into support Full Information through the layers of the network.
a 1 × 1 convolution in query, key and value, then they are The results show how applying the new layer improves the
multiplied to construct the output of the layer. This way the quality of the images compared to the SAGAN generated images.
network can learn long-range dependencies. The structure of the The architecture of the SAGAN, modifying the dense attention
self-attention layer can be seen in Fig. 5.
layer and preserving the rest parameters is called YLG-SAGAN.
YLG-SAGAN not only improves the FID of SAGAN, reducing it
5.1.15. BigGAN
score from 14.53 to 8.95, furthermore it optimizes the training
The BiGAN architecture [118] focuses on generating high res-
time to around a 40%.
olution images from diverse datasets. Previous models results
were able of synthesize new samples of low dimensionality, they
5.1.17. A GAN Through Quantum States (QuGAN)
had problems when scaling their results to bigger samples. The
During the last decade, quantum computing has become a hot
results achieved by the BigGAN, in terms of FID and IS outperform
topic in computer science. Since it was proposed in 1980 [121] it
previous models.
The researches of the BigGAN claim that GANs have better has always been restricted to a few laboratories around the world.
performance when they use higher dimensional data. The archi- Thanks to the progress made recently [122], it has made possible
tecture of the BigGAN is based on the SAGAN [116] architecture. to test the first algorithms, prototypes and ideas [123].
The authors show that, by enlarging the number of channels of Thanks to quantum computing particularities, problems pre-
the images used by a factor of 50%, the IS improve by a factor of viously defined can be solved, or are optimized, reducing their
21%. computation time. Using quantum superposition, the multiples
One innovation proposed in this article is the so-called ‘‘Trun- solutions can be evaluated simultaneously, then by using quan-
cation Trick’’. Previous GAN models used a normal or uniform tum interference and entanglement the correct answer can be
distribution to generate the latent space of the G network. The defined.
authors claim that by using a truncated normal distribution the QuGAN [124] proposes a GAN architecture powered by quan-
results, in terms of FID and IS, were better. This truncation trick tum computing. By using quantum computing, GANs are hugely
reduce the variety of values of the latent space by truncating optimized, reducing a 98.5% of its parameter set compared to
them towards zero. The main drawback produced by this is that traditional GANs.
the variability of the generated samples is reduced. It exists a QuGAN architectures use qubits to create the quantum layers
relationship between the variety and fidelity of the generated of G and D, known as QuG and QuD. The data that the networks
samples using this truncation. The more truncation applied to the use is transformed into quantum states.
latent space, the less variety of images were produced.
Other aspect that is scaled up in this work is the batch size of 5.1.18. Entangling Quantum GAN (EQGAN)
the GAN training, increasing it by a factor of 8. The authors show EQGAN [125] proposes a variation of the previously proposed
that by using larger batches the gradients of each iteration are quantum GANs. Benefiting from the entangling properties of
better, reaching a better performance in less steps. This is caused quantum circuits, EQGANs guarantees the convergence to a NE.
because the composition of each batch is more diverse, being able The main particularity of EQGAN is that it performs quantum
of covering more modes of the data. operations on both synthesized and real data. This approach
10
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
Fig. 6. Structure of the proposed enhanced D of the SSD-GAN. Figure based on Ref. [120].
produces fewer errors than swapping the data between quantum data. The C objective function is called spectral classification loss
and classical. and it is defined as:
To apply EQGAN to real problems, a Quantum Random Access Lspectral = Ex∼pdata (x) [logC (φ (x))]
Memory (QRAM) is used. By using the QRAMs, the EQGAN is (11)
+Ex∼pg (x) [log(1 − C (φ (x)))]
capable to improve the performance of the D.
One of the strengths of the SSD-GAN is its simplicity, easing
implementation and allowing its implementation on various net-
5.1.19. Classification Enhancement GAN (CEGAN)
work architectures without excessive cost. The Fig. 6 shows how
Data imbalance is a common problem when using real world
both spatial and spectral information are processed by the new D
datasets. Dataset often contains a majority of samples of a certain proposed for the SSD-GAN.
data class. In the case of GANs using unbalanced datasets, the SSD-GAN results show the potential of the proposed architec-
imbalance problem results in poor quality of the synthesized data ture. The quality of the images enhances the results of previous
of the class with less samples. architectures, e.g. reducing the FID score of the StyleGAN [6] from
CEGAN [40] tries to solve the data imbalance problem in GAN. 4.40 to 4.06 by including the spectral classification.
The objective is to enhance the quality of the synthesized data
and to improve the accuracy of the predictions. 5.1.21. Mobile Image Enhancement GAN (MIEGAN)
The CEGAN architecture consists of 3 different networks, G, D The MIEGAN [128] presents a novel architecture that aims
and a new network known as the classifier (C). The training of the to improve the quality of images taken with a mobile phone.
CEGAN divides in two steps. In the first step, the architecture is To do so, two new networks are proposed, the so-called multi-
normally trained, using D to differentiate between fake and real mode cascade generative network and the adaptive multi-scale
samples, C is used to classify the class label of the input sample. discriminative network. The generative network is composed of
Then, in the second step, an augmented training dataset is formed an Autoencoder architecture. The encoder of this new generator
via generating new samples from G, and this new dataset is used is divided into two streams, the inclusion of the second encoder
is in charge of improving the low luminance areas, where mobile
to train the C.
phones particularly lack in their clarity.
The methodology presented in CEGAN substitutes previous
The discriminator network has a dual goal. First, the global dis-
techniques to deal with data imbalance. Unlike other methods
criminator ensures overall image quality. Second, a local discrim-
such as undersampling [126] or oversampling [127] CEGAN does inator maintains the local quality of small areas of the image. To
not modify the original dataset. This way, some problems of combine both objectives, an adaptative weight allocation module
the traditional methods are avoided, e.g. shortening the original is also proposed that is responsible for balancing the importance
dataset by undersampling or redundant information by oversam- of each discriminator.
pling with geometric transformations. A brief scheme reviewing all presented architecture variant
GANs can be seen in Fig. 7. We divide the different architecture-
5.1.20. SSD-GAN based GANs in different groups based on the proposed changes.
The SSD-GAN [120] tackles the problem of high frequency The illustration gives a global view of how are interconnected
samples in GANs. The described problem causes high spectrum different researches of the last years.
discrepancies between the real and the synthesized samples. The
5.2. Loss function optimization
SSD-GAN proposes to alleviate this discrepancy to enhance the
quality of the synthesized data.
Orthogonal to the architecture modification GANs, there are
The idea behind the architecture is to reduce the gap of spec- many researches [18,26,129] that focuses on the objective func-
trum discrepancy, combining the spectral realness and the spatial tion of GANs. For example, the instability problem of GANs is
realness of each sample, to do so a new D is defined. The new actually caused by the Jensen–Shannon divergence, where D often
proposed D is known as Dss and it combines D and a classifier C. wins over G. Along with architecture optimization GANs, there
D is in charge of measuring the spatial realness of an image, this have been developed loss optimization researches, where both
is the same approach of the D of the traditional GAN [1]. The new approaches coexist and interact with each other.
proposed C is in charge of the known as spectral classification, this In this section, we will review the different most important
is, measure the difference of the specters of synthesized and real and recent progress in variations of the loss function of the GANs.
11
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
which data was real and which was synthesized in WGAN D ∧ (13)
λE∧ [(∥∇∧ D(x)∥2 − 1)2 ]
change its name to critic. The critic function is to measure the x ∼P∧
x
x
Fig. 8. Comparison between sigmoid cross entropy loss function (a) and least squares loss function (b).
Source: Figure from Ref. [18].
Fig. 9. Structure of the proposed residual blocks of the MISS GAN. Figure based on Ref. [139].
Respect previous normalizations [134] spectral normalization To train the MISS GAN models five different objective func-
is easier to implement. The previous methods imposed a much tions are proposed.
stronger constraint on the network matrix. With the spectral The first loss function is called the adversarial objective (Ladv )
normalization, it is possible to relax this constraint, allowing the and it is in charge of, taking the input image and the target
network to satisfy the local 1-Lipschitz constraint. The spectral domain, ensure that the generated image style corresponds with
normalization is defined as follows: the target domain. To do so the Ladv takes two discriminator
predictions, one for the input image and other for the synthesized
W̄SN := W /σ (W ) (20)
image.
where W is the weight matrix of D and σ (W ) is the L2 normal- The second loss function is denoted as style reconstruction
ization of W. objective (Lsty ), and it enforces the G to use the mapping network
As mentioned before, the proposed D network is very simple style code while receiving a generated latent code, to calculate
and additionally its computational cost is small. It also requires the Lsty the output of the G encoder over the generated image.
the tuning of one hyperparameter, the Lipschitz constant. The third proposed objective function is called style diver-
The generated images using SN-GAN are more diverse, achiev- sification objective (Lds ) and it compares a pair of synthesized
ing better comparative IS respecting other weight normalizations. images, each image corresponds to a different style code, each
one generated from a different latent code. The objective of this
5.2.8. Cyclic-Synthesized GAN (CSGAN) loss function is to force G to produce diverse images, preventing
CSGAN [135] proposes a new loss function for image-to-image two images with different latent codes from being the same.
translation problems. Previous works developed architectures The fourth objective function is the cycle consistency loss (Lcyc )
for concrete domains of translation, CSGAN proposes a common used in the CycleGAN [7].
framework for different domain translation. Finally, the fifth objective function is called content features
The Cyclic-Synthesized Loss (CS) is proposed as the objective loss (Lcontent_feat ), and it computes the distance in the feature space
function of CSGAN. The new loss objective is to evaluate the by using a VGG16 [66] network.
differences between a synthesized image and its correspondent To combine the different objective function a total objective is
cycled image. The proposed loss function is denoted as follows: defined as follows:
max min Ladv + λsty Lsty − λds Lds
L(GAB , GBA , DA , DB ) = LLSGANA + LLSGANB D G,F ,E
(21) (22)
+λA LcycA + λB LcycB + µA LCSA + µB LCSB +λcyc Lcyc + λfeat Lcontent_feat
were LCSA and LCSB are the Cyclic-Synthesized loss of both do- where E is the style encoder and F is the mapping network, all the
mains. λ parameters correspond to a hyperparameter for each objective
With respect to previous architectures, CSGAN produces im- function.
ages of better quality, notably reducing the artifacts of the syn-
thesized images. The results show better performance of CSGAN 5.2.10. Sphere GAN
in Chinese University of Hong Kong (CUHK) dataset [136] and SphereGAN [140] proposes a new architecture based on inte-
comparable performance in FACADES dataset [137]. The compar- gral probability metric (IPM). The main characteristic of Sphere-
ison of the performance is made against GAN [1], Pix2Pix [8], GAN is that it bounds the IPMs objective function on a hyper-
DualGAN [94], CycleGAN [7] and Photo-Sketch Synthesis using sphere.
Multi-Adversarial Networks (PS2MAN) [138]. Compared with other architectures such as WGAN-GP [4]
SphereGAN loss function does not require any constraint term, re-
5.2.9. Multi-IlluStrator Style GAN (MISS GAN) ducing the necessity of hyperparameter tuning. The loss function
The proposed architecture of MISS GAN [139] presents only of SphereGAN is defined as follows:
one trained model to generate illustrations for different image ∑ ∑
styles. Previous methods used different G for each style, limiting min max Ex [drs (N , D(x))] − Ez [drs (N , D(G(z)))] (23)
G D
r r
the practical application of the architectures, while MISS GAN
uses a unique model. where drs denotes the r-th moment distance between a sample
The proposed new G is based on the GANILLA [105] archi- and the north pole of the hypersphere.
tecture, but it proposes some changes to the architecture of In the original paper, the mathematical properties of Sphere-
the decoder of the GANILLA G. The new decoder contains three GAN are proved, showing that minimizing the objective func-
residual blocks, these residual blocks are in charge of processing tion of SphereGAN is equivalent to reducing IPM. In addition,
the low-level features from previous layers. The composition of it is proved that SphereGAN compared to WGAN can use r-
each residual block can be seen in Fig. 9. Wasserstein distances, unlike WGAN that could only use
14
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
Fig. 10. Survey proposed division of loss function variants for GANs.
1-Wasserstein distance. This provides to SphereGAN a wider Since the introduction of the SRGAN it has been used in many
function space. different applications [142–144]. In addition, there are works
The SphereGAN results show its good performance, achieving such as [145] that presents some improvements in the SRGAN
a IS of 8.39 and FID score of 17.1 in CIFAR-10 [107] dataset. structure, the new architecture is known as Super Resolution
Compared to WGAN-GP that achieved IS of 7.86 in the same Channel Attention GAN (srcaGAN). The architecture presented
dataset. in this papers adds a channel attention module to the models,
this module recovers the attention layer used in SAGAN [116].
5.2.11. Super Resolution GAN (SRGAN) The results presented in this new architecture outperforms the
In order to apply GANs to image upscaling the SRGAN [141] SRGAN.
was proposed. The proposed GAN objective is to take an input
natural image and upscale it resolution by a factor of 4. 5.2.12. Weighted SRGAN (WSRGAN)
To achieve the super resolution, the new variant proposes One of the characteristics of the SRGAN [141] was the combi-
a couple of adversarial and content losses. Both functions are nation of the content loss and adversarial loss during the training.
combined using the called perceptual loss function, this function The WSRGAN proposes is changing the importance of each loss
is in charge of a solution respecting the relevant characteristics and studying the effect of this action.
of the data. The content loss is defined as follows: The main objective of the WSRGAN is to improve the per-
formance of the architecture by analyzing its performance in
lSR = lSR
X + 10
−3 SR SR
lGen lGen (24)
different combinations of its objective functions. Then the new
where lSR SR
Gen is the adversarial loss and lX is the content loss.
weighted loss function is defined as follows:
The content loss used relies on a pre-trained VGG-19 model
X = w lMSE + (1 − w )10
lSR SR −3
+ lSR
VGG (27)
[66]. This model, respecting the usage of a loss function such as
MSE is more invariant to changes in pixel space. This metric will where w is the parameter that controls the impact of each loss
provide the network information about the quality of the content function on the final result.
of the synthesized image. The new loss function is calculated as: After training the network with different weight configura-
Wi,j Hi,j tions, the paper concludes that the MSE loss is the most important
1 ∑ ∑ loss function, being supported by the VGG loss.
lSR
VGG/i,j = (φi,j (I HR )x,y
Wi,j Hi,j (25) Additionally, the definition of the weight parameter is de-
x=1 y=1
clared dynamically, obtaining even better results than when it is
LR 2
−φi,j (Gθ G (I ))x,y ) static.
A brief scheme reviewing the different presented loss function
where I LR refers to the low resolution images and I HR refers to the
variant GANs can be seen in Fig. 10. We divide the different GANs
high resolution image.
in different groups based on the proposed changes in the loss
In addition to the content loss, the adversarial loss is defined
function.
as being this part of the generative component of the GAN.
This function is responsible for pushing the generated images
5.3. GAN timeline
to be realistic and indistinguishable from the real ones. The loss
function is defined as:
A timeline with the reviewed architectures is presented in
N
∑ Fig. 11. The GANs that have been studied during Sections 5.1 and
lSR
Gen = −logDθ D (Dθ G (I ))LR
(26) 5.2 are showed temporally. This timeline provides an overview of
n=1 the historical development of GANs.
The application of SRGAN improves the results of previous As it can be seen, the timeline compiles the most impor-
algorithms for image super resolution. tant works of the last decade. It is important to analyze that
15
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
some researches have influenced posterior ones. In some cases Since the first GAN publication [1] GAN architectures have
some researches adopt innovation of previous works as a base to been used for synthesizing real world images. In the original
then propose new changes, e.g. the DCGAN that have influenced proposed GAN the models were used to generate images repli-
several posterior works. In other cases there are relationships cating MNIST [148], CIFAR-10 [107] and Toronto Face Database
between works can be seen as an unique research, linking each (TFD) [149] datasets. The generated images using the original
article with each other by taking previous results and improving structure were very blurry and did not have good quality. Besides
them, e.g. in the case of ProGAN, StyleGAN and Alias-Free GAN. that, the presented results supposed the presentation of the GAN
architecture.
6. GAN applications One of the first improvements to the original architecture
was the DCGAN [69], it proposed structural changes and hyper-
As mentioned before, GANs are one of the most popular ap- parameter tuning respect the first proposed model. The results
plications of machine learning of the last years. GANs models of the DCGAN showed improvements in the performance and
can achieve results in fields where previous models could not, in generation of the networks, the generated images were clearer
other cases, GANs improve the previous results significantly. and more recognizable. Despite that, the architecture still suffered
In this section, we will review the most important fields where from instability and mode collapse.
GAN architectures are applied, paying a special attention to the The WGAN architecture [26] could reduce drastically the mode
GAN models related to computer vision tasks and we will com- collapse and instability of the previous models. Thus, later mod-
pare the different architecture results. els adapted the loss function of the WGAN along with their
Most of the last researches focus on how to apply GANs to respective structural changes in the network.
generate new synthesized data, replicating a data distribution. Recently the ProGAN [3] introduced a new training method-
ology that achieved an improved performance of the networks.
But, as we will review in this section GANs can be applied to other
With the new methodology came a huge improvement in the
fields, e.g. video game creation [11].
quality of the generated images. The results showed not only a
more stable trainings but sharper, with finer details and more
6.1. Image synthesis
diverse images. Due to the particularities of the applied method-
ology, it can be applied to other architectures, so in later works
One of the most important fields in which GANs are applied
the ProGAN training methodology will be used as its base.
is in computer vision. In particular, realistic image generation is
Following the line of research of ProGAN the StyleGAN [6] was
the most widely used application of GANs [3,6,26]. presented. The results produced by the StyleGAN could improve
Most of the proposed GAN variants are tested by generating the results of the ProGAN. At this point some generated datasets,
real world images. Arguably, image synthesis is the first applica- e.g. human faces images, were indistinguishable from real images
tion one might think of when thinking about GAN. Its popularity from a human perception. Along with the high quality of the im-
is due to the good results that GAN can achieve. Compared with ages the StyleGAN proposed style mixing, capable of generating
previous methods, GANs provide sharper results [146]. Both in new images combining previous images. This allows to modify
academic world and for the general public GAN has raised a lot image features at a high, medium and low level, allowing the
of interest. network to disentangle different features of an image, providing
One of the main reasons of the GAN success is its results more control of the generated images.
easy understanding. As the mainly generated output of GANs One of the main problems of the StyleGAN was the known
are images, they can be easily understood by anyone. Even if a as texture sticking. This caused the generated images to have
person does not have any technical understanding of artificial a certain texture in an absolute position. When interpolating
intelligence, it is possible to judge the results. different images it was noticeable that some parts of the images,
Within computer vision, image generation is the most used e.g. the hair of a human face, maintain the same texture in
method to test GANs. There are plenty of real world images spite of changing its position. The Alias-Free GAN proposed an
datasets that can be used to train GANs. The availability of architecture that suppressed the texture sticking problem. By
datasets that can be used for training neural networks is usually eliminating the sticking problem, the interpolation of synthesized
the main drawback of artificial intelligence projects. Either by images is smoothed, generating a continuum of images, not only
its availability or by its content [147] having a good dataset realistic individually but also as a set. The improvements of the
is essential for machine learning. When real world images are Alias-Free GAN together with the style mixing of StyleGAN allows
used to train GAN models, the availability of good datasets is to create animations of, for example, a human face changing its
not a problem, there are a large variety of datasets [57,107] that position, gender or features such as the smile.
have been widely tested and are well known in the academic The Table 2 summarizes the performance of the presented
community. GAN models during this section. The compared datasets are
16
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
Table 2
Performance summary of image generation GANs.
Model CIFAR-10 CelebA-HQ FFHQ
Accuracy ↑ IS ↑ FID ↓ FID ↓
DCGAN 82.8% 6.58 – –
ProGAN – 8.80 7.79 8.04
StyleGAN – – 5.06 4.40
StyleGAN2 – – – 2.70
Alias-Free GAN – – – 3.07
MNIST [148], TFD [149], CIFAR-10 [107], CelebA-HQ [3] and (PSNR) (the higher score the better ↑) and Learned Perceptual
Flickr-Faces-HQ (FFHQ) [6]. The used metric for comparing the Image Patch Similarity (LPIPS) [150,151] (the lower score the bet-
different variants are accuracy of the models (the higher score the ter ↓) are computed for different GAN variants. The comparison
better ↑ ), IS (the higher score the better ↑) and FID (the lower is made for CUHK [136] and FACADES [137] datasets. The LPIPS
score the better ↓) is a metric that measure the distance between the real and the
generated distribution via perceptual similarity.
6.2. Image-to-image translation
6.3. Video generation
Taking an image from one domain and converting it to the
other domain is known as image-to-image translation. It was first GANs have proven to generate state-of-the-art results in image
proposed with the Pix2Pix architecture [76], Pix2Pix is based on processing. Along with image generation comes the possibility to
CGAN following the idea of generating images conditioned on generate a set of images generating a video. Video generation is a
their composition via a label input. With Pix2Pix the networks are more complex task than image generation. The issues associated
capable of learning how the same image is translated between with image generation are included in video generation, but the
one domain and another. The main drawback that Pix2Pix had computational cost of training models that can process video is
was the requirement of having a paired dataset of images in both high. In addition, the synthesized videos must be coherent.
domains. One of the particular problems of video is the motion blur
Following the steps of Pix2Pix CycleGAN [7], DualGAN [94] generated by the networks [152]. When a video is generated, the
and DiscoGAN [101] were developed. These new architectures tracking of some objects can be difficult, generating fuzziness in
were based on the cyclic consistency idea. Cyclic consistency some portions of the image. Some works have tried to tackle this
was previously used in machine learning [92,93], it is based on problem [153–155], but it is still an open problem.
the idea that translating an image from one domain to another One of the most popular applications of video generation with
and then doing the reverse operation will recover the original GANs is the known as deep fake. Deep fake consists in taking
image. Following this concept the new networks were capable a video of a person and changing the face of the human to be
of translating images without a paired dataset. By not needing someone else. Many works have been developed in the last years
a paired dataset the number of possible applications of GAN to in this field [156].
image-to-image translation increased considerably. Deep fake is one of the most controversial applications of GAN,
Later on the CSGAN was proposed [135] improving the re- the possibility of changing a face in a video allows to generate
sults of previous architectures. The new proposed loss function fake videos that can be used to supplant a person. This problem
achieved better results in image generation, comparing with Cy- is magnified in the case of women [157] due to their position in
cleGAN [7], DualGAN [94], DiscoGAN [101] and PS2MAN [138]. society. Even so, there are some applications of deep fake where
This new architecture results follows the natural progression of it can be beneficial [158], its application still raises doubts in the
the GAN in image-to-image translation and promise an exciting society. This is why many recent researches have focused on how
future in what GAN can do. to detect deep fake videos [159–162].
The image-to-image translation is especially popular in so- Other application of GANs to video generation are video-to-
ciety, because of the applications that have been developed in video translation, which is indeed the general case of deep fake.
the last years. With the architecture of the presented GANs the Many architectures of this type have been proposed during the
general public is capable, for example, of taking a personal image last years [163,164].
of themselves and transforming it into one of an old person with It should be noted that, in the case of video processing, the
his face. This type of applications have become popular in social standard is to use previous information, such as another video,
networks, increasing their visibility even more. to generate the synthesized data. Unlike image generation, video
This interaction between society and GAN development is generation is more interesting if the new information is condi-
mutually beneficial, the society uses the technological advances tioned by an external agent. In image processing, the only input
of the last years while the academic community gain impact was the latent space, but the final images were conditioned by the
and repercussion. From an academic perspective this interaction dataset of the training. When videos are generated, the degree
should be considered positive and it should be noted that most of of freedom is extended, enabling the generated data to be less
the impact of machine learning during the last years have been controlled. Controlling the video output is necessary to maintain
caused by the publicity given by the mass media and the social the coherence of the final output, but it also eases the GAN
networks. Although most of the people are not interested in the job, which is significantly more difficult with respect to image
technique behind GAN applications they act as a catalyst to make processing.
more people interested in artificial intelligence and, ultimately, it
will bring more people to academic research in the field. 6.4. Image generation from text
The Table 3 summarizes the performance of the presented
GAN models in image-to-image translation tasks. The data is Since the introduction of CGAN the capabilities of GANs were
obtained from [135], where the SSIM (the higher score the better expanded. The possibility of constraint the synthesized informa-
↑), MSE (the lower score the better ↓), Peak Signal to Noise Ratio tion that GANs produced made the networks have a wider range
17
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
Table 3
Performance summary of image-to-image translation GANs.
Model CUHK FACADES
SSIM ↑ MSE ↓ PSNR ↑ LPIPS ↓ SSIM ↑ MSE ↓ PSNR ↑ LPIPS ↓
GAN 0.5398 94.8815 28.3628 0.157 0.1378 103.8049 27.9706 0.525
Pix2Pix 0.6056 89.9954 28.5989 0.154 0.2106 101.9864 28.0568 0.216
DualGAN 0.6359 85.5418 28.8351 0.132 0.0324 105.0175 27.9187 0.259
CycleGAN 0.6537 89.6019 28.6351 0.099 0.0678 104.3104 27.9489 0.248
PS2MAN 0.6409 86.7004 28.7779 0.098 0.1764 102.4183 28.032 0.221
CSGAN 0.6616 84.7971 28.8693 0.094 0.2183 103.7751 27.9715 0.22
of application. By controlling the output of the generations of the The one-stream information approach followed in DF-GAN
networks the applications of them can be much more specific was reused in Lightweight Dynamic Conditional GAN (LD-CGAN)
and interesting. One field were GANs have shown to outperform [171]. The proposed architecture of the LD-CGAN consists on
previous techniques in image generation from text [165]. one G and two independent discriminators. The generator is
Stacked GANs (StackGAN) [166] was one of the firsts proposed composed by a Conditional Embedding (CE) that disentangles the
architectures for image generation from text. The architecture features of the input text by using unsupervised learning. Then
splits in two stages, the generation problem, the objective is a Conditional Manipulating Block (CM-B) provides continuously
to divide the main problem in sub-problems that are easier to the images features with the compensation information. Finally
handle in the network. The known as Stage-I GAN is in charge using the known as Pyramid Attention Refine Block (PAR-B) the
of producing a coarse sketch of the desired image, this way this generated image is enriched maintaining multiscale context and
part of the network focuses on translating the text to a image spatial multiscale features. The results of the architecture not
that fulfills the description. Then, the Stage-II GAN takes the only shows a higher quality image respecting previous methods,
generated image from Stage-I GAN, increases its resolution and but also improves the performance decreasing the number of
define the finer details. The StackGAN is able of producing images parameters by 86.8% and the computation time by 94.9%.
that match the input description while achieving sharp, high The Table 4 summarizes the performance of the presented
quality samples. Later on the StackGAN++ (StackGAN-v2) [167] GAN models during this section. In addition to the mentioned
was proposed, this new architecture resolved some problems of networks the Generative Adversarial Text to Image Synthesis
the original StackGAN, stabilizing its training and improving the (GAN-INT-CLS) [172] and the Generative Adversarial What-Where
overall quality of the synthesized images. Network (GAWWN) [173] are included, both of this networks act
One problem of the StackGAN is that it is highly dependent on as a reference of previous architectures. The compared metrics
the sketch generated by the Stage-I GAN. To solve this Dynamic are HR (the lower score the better ↓), IS (the higher score the
Memory GAN (DM-GAN) proposed a new technique based on better ↑) and FID (the lower score the better ↓). The com-
memory networks [168,169] that divides the generation problem pared datasets are Common Objects in Context (COCO) [174],
in two steps. In the first one a initial image is generated and in the Caltech-UCSD Birds (CUB) [175] and Oxford-102 [176].
second step a memory network is used to refine the details and
produce a high quality image. To connect the memory and the 6.5. Language generation
GAN a response gate is proposed, by controlling dynamically the
flow of information the gate is capable of fusing the information GANs models have been used during the last years in Natu-
appropriately. The results of the StackGAN shows a higher quality ral Language Processing (NLP) tasks. The previously mentioned
respecting all previous architectures. text-to-image field is one of the applications of GAN where nat-
Dual Attentional GAN (DualAttn-GAN) proposed a new ar- ural language is involved. But there are some applications of
chitecture based on two modules. The Visual Attention Module GAN completely focused on how to produce new text using the
(VAM) is in charge of taking care of the internal representations models.
of the image information, capturing the global structures and Previous methods to process natural language used the known
their relationships. The Textual Attention Module (TAM) defines as Long Short-Term Memory (LSTM) [177]. LSTM is capable of
the relations between the text and the image, defining the links maintaining local relationships in space and time, this feature
between both. Finally a Attention Embedding Module (AEM) fuse provides the networks the ability of process whole sentences,
the visual with the textual information, concatenating them along paragraphs and text while maintaining global coherence. In ad-
with the input features of the image. The results of the DualAttn- dition to LSTM the previous methods used Recurrent Neural
GAN shows an improved performance respecting previously used Network (RNN) to generate new texts [178].
architectures. The Text GAN (textGAN) [179] uses LSTM along with Con-
Following the general architecture of StackGAN Deep Fusion volutional Neural Network (CNN) to synthesize new text. The
GAN (DF-GAN) was proposed [170]. The DF-GAN architecture proposed method applies the GAN training methodology via the
only have one stage of image generation, this backbone syn- known as adversarial training. The textGAN uses a LSTM as the G
thesized new images conditioned by an input text using only of the network and a CNN as the D. One of the main problems
one pair of G and D. Thus being a simpler structure, DF-GAN of the textGAN was the highly entangled features of the network,
achieves better performance and efficiency compared with pre- making the interpolation of different writing styles very difficult.
vious variants. The new techniques that DF-GAN proposes are The textGAN approach to language generation, suffering from
a new fusion module, known as deep text-image fusion block, the known as exposure bias. This bias is caused by the objec-
and a new discriminator capable of promoting the generator to tive function of the network, that focus on maximizing the log
synthesize higher quality images without extra networks. The likelihood of the prediction. The exposure bias is visible in the
results of the DF-GAN shows an improvement on the quality of inference stage, when the G generates a sequence of words it-
the images, without committing to more complex models and eratively predicting each word based on the previous ones. The
improving the efficiency of the previous architectures. problem comes when the prediction is based on words never seen
18
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
Table 4
Performance summary of image generation from text GANs.
Model COCO CUB Oxford-102
HR ↓ IS ↑ FID ↓ HR ↓ IS ↑ FID ↓ HR ↓ IS ↑ FID ↓
GAN-INT-CLS 1.89 7.88 – 2.81 2.88 – 1.87 2.66 –
GAWWN – – – 1.99 3.62 – – – –
StackGAN 1.11 8.45 – 1.37 3.70 – 1.13 3.20 –
StackGAN-v2 1.55 8.30 81.59 1.19 4.04 15.30 1.30 3.26 48.68
DM-GAN – 30.49 32.64 – 4.75 16.09 – – –
DualAttn-GAN – – – – 4.59 14.06 – 4.06 40.31
DF-GAN – – 21.42 – 5.10 14.81 – – –
LD-CGAN – – – – 4.18 – – 3.45 –
before in the training stage. Some works were made to tackle Data augmentation with GANs have been used in cases where
this problem [180] but the Sequence GAN (SeqGAN) [181] is the obtaining a dataset is difficult. For example, in medical applica-
architecture that betters the results produced. tions there is usually not many information available, in this cases
The G of SeqGAN is trained using a stochastic policy of Rein- GANs can make the difference. This is why during the last years
forcement Learning (RL). The RL reward is calculated by judging a GANs have been used in medical data augmentation [185–188].
complete sentence made with the G of the model. Then, to com-
pute the intermediate steps a Monte Carlo Search is made [182]. 6.7. Other domains
The results of the SeqGAN shows a huge improvement in tasks
such as language generation, poem composition and music gen- As mentioned before, due to the particularities of the GANs
eration. In addition, the performance of the models shows certain they can be applied to many different fields. One of the main
creativity in the synthesized data. strengths of the machine learning is that it adapts to different
Despite the good results of GAN in NLP tasks during the last situations without substantial changes in its structure. In partic-
ular, GAN can be adapted to any type of data distribution as long
years, there have been developed architectures that outperform
as there is an available dataset.
GANs in language generation. The most successful architecture
of this field is the Generative Pre-trained Transformer 3 (GPT-
6.7.1. GameGAN
3) [183], which belongs to the GPT-n series. The GPT-3 is a
One of the most interesting applications of GAN is the pre-
generator model based on the transformer [117] architecture.
sented with the GameGAN [11]. The main purpose of GameGAN is
The extraordinary results presented by the GPT-3 are often very to generate entirely a video game using machine learning. To do
difficult to distinguish from human writing. The emergence of so, the complete Model-View-Controller (MVC) software design
the GPT-3 caused a lower interest in GAN models applied to patterns is replicated using artificial intelligence. The proposed
NLP. Due to the good results of transformers in NLP, the GAN architecture is composed by three different modules.
approximation to this field has been losing interest. The dynamics engine is in charge of the logic of the whole sys-
tem, maintaining the global coherence and updating the internal
6.6. Data augmentation state of the game. The dynamics engine, for example, controls
which actions of the game are possible (e.g. eating a fruit in pac-
Other field where GANs have shown to be really useful is in man) and which ones are not (e.g. run through a wall in pac-man).
data augmentation. Due to the particularities of the GAN they can The dynamics engine is composed by an LSTM that updates the
be used to obtain more samples of an origin data distribution, state of the game in each frame, the LSTM provides the network
replicating its distribution. This way, by using GANs, the number way to control the previous states of the game to calculate the
of samples of a dataset can be multiplied. new information of the subsequent frames. This way, the network
Traditionally, data augmentation was achieved via transform- can access to the complete history of the game, maintaining the
ing the initial data; e.g. cropping, rotating, shearing, or flipping consistency of the system.
To save the state of the game a memory module is used. This
images. One of the main drawbacks of these methods is that they
module focus on maintain long-term consistency of the game
transform the original data by slightly changing their structure,
scene. When the game is being played there are different ele-
with the usage of GANs for data augmentation the new sam-
ments of the scene that not always are visible, with the memory
ples tries to synthesize new data from the original distribution.
module these elements are consistent over the time. This mem-
Instead of changing the samples of the dataset the generated
ory remembers the generated static elements of the game. The
samples of GAN are synthesized from scratch. This way, the new
memory module is implemented by using Neural Turing Machine
data is replicated by imitating the original data distribution. It (NTM) [189].
should be noted that data augmentation does not necessarily The third module that composes the system is the rendering
replace other methods of data augmentation, it proposes an al- engine, it is in charge of generating a visualization of the cur-
ternative that, in many cases, can be used together with other rent state of the game. This module focuses on representing the
data augmentation algorithms. different elements of the game realistically, producing disentan-
For example, the Data Augmentation Optimized for GAN (DAG) gled scenes. The rendering engine is composed by transposed
[184] proposes an enhanced data augmentation method for GAN, convolution layers that are initially trained using an autoencoder
combining it with data transformation such as rotation, flipping architecture to warm up the system and then they train along
or cropping. The DAG shows to improve the performance of with the rest of the modules.
data augmentation in GAN models, improving the FID of CGAN, The adversarial training of GameGAN has three types of dis-
Self-supervised GAN (SSGAN) and CycleGAN. The proposed ar- criminators. The single image discriminator evaluates the quality
chitecture uses one D for each transformation of the data, but a of each generated frame, judging how realistic it is. The action-
unique G. conditioned discriminator determines if two consecutive frames
19
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
are consistent with respect the input of the player. Finally the In addition, the environment where the images are taken,
temporal discriminator maintains the long-term consistency of most of the time in crops, can lead to many variance in the
the scene, preventing elements from appearing or disappearing images, such as lighting changes or object occlusion.
randomly.
One of the basis of GameGAN is the disentangling of dynamic 6.7.4. Drug discovery using GANs
and static elements of the game. The static elements of a game The process of discovering and designing new drugs has re-
could be, for example, walls while the dynamics elements of cently been impulsed by the field of Deep Learning [200,201]. In
a game are elements such as nonplayable characters. By dis- particular, GANs are an useful technique to synthesize new useful
entangling both types of elements, the game behavior is more samples of data. In the drug environment, the GAN architecture
interpretable for the model. can process the drug compound using graphs or Simplified Molec-
Finally, GameGAN introduces a warm-up phase where certain ular Input Line Entry Specification (SMILES), to then generate
real frames are introduced in the network during the first epoch synthetic samples of drugs.
of the training. Then the frequency of real frames is reduced Due to the flexibility that ANNs have in terms of operating
little by little until it disappears. This way the first epochs of the with different data types, it is possible to use the same archi-
training, that are usually the most complex in the network, are tectures in different fields. In this case the overall GAN design
controlled and progressively the GAN gains more control over the can be adapted to molecular data, being able to transfer the same
output. This helps the network to understand the problem. principles of the image generation to new data types.
The research followed by Kadurin et al. [202,203] generates
6.7.2. Medical imaging GANs new drug compounds for anticancer therapy, using biological and
One of the most popular application of the GAN architecture is chemical datasets. In particular, in [203] it is used an Adversarial
to enlarge datasets. The objective of synthesizing new data is to Autoencoder that uses molecular fingerprints as inputs of the
produce larger datasets that improve the performance of machine network. By using this architecture the researches are able of
learning models, which are very sensible with the number of define the desired properties of the synthesized drugs. Some of
samples used in their training. the new synthetic drugs discovered by the Deep Learning ar-
There are many fields where data augmentation can be ap- chitecture corresponded with previous known anticancer drugs.
plied, but in medical imaging to augment data have certain ben- This led the researches to suggest that the remaining unknown
efits due to the particularities of the problem. First, the medical
drugs generated by the GAN could be used to further study their
datasets are usually small, because of the cost of obtaining the
properties.
images, most of the time it is necessary to use measurement and
The work presented in [204] proposes the generation of new
recording machines such as radiography, magnetic resonance or
drugs combining GANs with reinforcement learning techniques.
ultrasound. But, in addition to the cost of obtaining these images
In particular, the proposed G takes as input a random latent space
there also exists ethical and legal problems related to the nature
and process it with RNN to produce a sequence of drug by using
of the data. Most of the time, obtaining images that expose the
SMILES representation. The D on its side uses a 1 dimensional
health status of different people is impossible, which leads to
CNN to distinguish the real data from the synthesized one. The
even more lack of available data.
results of the paper suggest that the new drugs discovered were
It should be noted that one of the benefits of generating data
unique and diverse. This may alleviate the first phases of drug
with GAN is that the new samples do not belong to any real
development, which are very expensive in terms of time.
person.
The Federated Generative Adversarial Network for Graph-
Because of all these factors, there has been lots of GAN works
based Molecule Drug Discovery (FL-DISCO) architecture [205]
related to the medical imaging field [190–195]. In addition, the
aims to combine the generation potential of GAN with the pro-
work of Chen et al. [196] analyzes the evolution of the field of
medical data augmentation and suggests that the research in this cessing of molecules using graphs of the Graph Neural Networks
field remains strong in the year 2021, despite that the fact that while maintaining the privacy of the data using Federated Learn-
from 2019 onwards the number of published works have been ing [206]. By using graph representation of the molecules instead
the same. of SMILES as previous works, the represented samples have
more realistic structures, maintaining structural relationships of
6.7.3. GANs in agriculture the connected atoms of the molecules. The Federated Learning
Similar to the medical imaging field, obtaining images to train framework is based on using different clients to train a specific
the computer vision models of agricultural image analysis is neural network model, each client has its respective portion of
not an easy task. These models benefit from having large-scale the data, which uses to train the network. This way each client
balanced datasets but the cost of obtaining high quality labeled knows a portion of the data and uses it to update the central
data makes the data augmentation a crucial task in these datasets. model, but it maintains the privacy due to the fact that the clients
Many different GAN models have been applied to agricultural are not able to communicate with each other. The results of this
data, such as [197–199]. These works aim to generate new im- research show progress in terms of novelty and diversity of the
ages of plant with different diseases, augmenting the number of synthesized drugs respect previous works.
samples by using GAN.
In these cases the use of GAN improves the results of the 7. Discussion
machine learning models by enlarging the number of available
data. The agricultural images have different particularities that Since their introduction in 2014 GANs have been the most
make the analysis of them a difficult task. For example the biolog- important generative architecture in computer vision. The results
ical variability between two samples of the same species makes provided by the developed GANs were notoriously better than
crucial to have many different samples to learn all the modes of previous architectures, such as Variational Autoencoders. This
the data. In particular, the same leaf of a fruit can drastically differ leaded to a constant improvement of the model, solving problems
from one individual to another. like stabilization or mode collapse.
Other important factor is that the labeling of the data can be With the introduction of the Diffusion models [207–209], the
very costly, specially for specific applications such as the disease results of GANs have been surpassed by this new models solving
detection of certain plant, e.g. tomato leaf [197]. some of its most important problems. Some aspects in which
20
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
diffusion models outperform GANs are better stability, they do Declaration of competing interest
not suffer from mode collapse and they provide more diverse
results. This is mainly caused because of the fact that they are The authors declare that they have no known competing finan-
likelihood-based [210]. Despite the better results of diffusion cial interests or personal relationships that could have appeared
models they still have shortcomings in some aspects such as the to influence the work reported in this paper.
cost of synthesizing new samples, which makes them difficult to
being applied in real-time problems. Data availability
In [211] it was developed a diffusion model to perform an
image-to-image translation. The results showed in this research No data was used for the research described in the article.
show that their solution outperforms GANs without special atten-
tion to the hyper-parameter tuning or any kind of sophisticated References
technique or loss function. Moreover this research shows the
great stability of the diffusion model architecture. [1] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S.
Despite the fact that Diffusion models are a novel architecture Ozair, A. Courville, Y. Bengio, Generative adversarial networks, 2014.
[2] J. Cheng, Y. Yang, X. Tang, N. Xiong, Y. Zhang, F. Lei, Generative adversarial
with not many works published, it is a very potential architecture networks: A literature review., KSII Trans. Internet Inf. Syst. 14 (12)
to surpass GAN results in a near future. At present, there are (2020).
not enough results or applications of diffusion models to data [3] T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of GANs for
generation, but the potential of this new architecture could lead improved quality, stability, and variation, 2018.
[4] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. Courville, Improved
to a significant improvement in the results of data synthesis. We
training of wasserstein GANs, in: Proceedings of the 31st International
consider that this models could replace GANs because of their Conference on Neural Information Processing Systems, NIPS ’17, Curran
stability and not needing fine-tuning in their hyperparameters. Associates Inc., Red Hook, NY, USA, 2017, pp. 5769–5779.
Other new architectures have been used to enhance the re- [5] J. Xu, X. Ren, J. Lin, X. Sun, Diversity-promoting GAN: A cross-entropy
sults of GANs, such as transformers, to improve their results. based generative adversarial network for diversified text generation, in:
Proceedings of the 2018 Conference on Empirical Methods in Natural
Transformer architecture is a time-series-based architecture that Language Processing, Association for Computational Linguistics, Brussels,
adopts the self-attention layers [117] making possible to design Belgium, 2018, pp. 3940–3949.
larger models. Transformers have been used as the base neural [6] T. Karras, S. Laine, T. Aila, A style-based generator architecture for
model of the G and D of the GAN architecture, improving the generative adversarial networks, 2019.
[7] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation
performance of the model.
using cycle-consistent adversarial networks, in: 2017 IEEE International
The TransGAN [212] presents a GAN architecture free of con- Conference on Computer Vision, ICCV, 2017, pp. 2242–2251.
volutions that makes possible to generate high resolution images [8] P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with
by using transformer in both G and D of the GAN. The results of conditional adversarial networks, 2018.
the article shows an improved results respect to the IS and FID [9] M. Zhu, P. Pan, W. Chen, Y. Yang, DM-GAN: Dynamic memory generative
adversarial networks for text-to-image synthesis, in: Proceedings of the
on CIFAR-10 dataset [107]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR,
Another work that showcases the interaction between GANs 2019.
and transformers is the one presented in [213]. This work uses the [10] Y. Li, M. Min, D. Shen, D. Carlson, L. Carin, Video generation from text,
generative model to predict pedestrian paths, using the memory in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32,
2018, p. 1.
that the transformer architecture has. In this sense, the GAN
[11] S.W. Kim, Y. Zhou, J. Philion, A. Torralba, S. Fidler, Learning to sim-
makes possible to train the network to predict future paths of ulate dynamic environments with gamegan, in: Proceedings of the
pedestrians, while the transformer provides the memory to pro- IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020,
cess an historical sequence of the latest movements. pp. 1231–1240.
[12] D.H. Ackley, G.E. Hinton, T.J. Sejnowski, A learning algorithm for
Boltzmann machines, Cogn. Sci. 9 (1) (1985) 147–169.
8. Conclusion [13] D. Bank, N. Koenigstein, R. Giryes, Autoencoders, 2021.
[14] A. van den Oord, N. Kalchbrenner, Pixel RNN, in: ICML, 2016.
This report summarizes the recent progress of GANs, going [15] Y. Sun, L. Xu, L. Guo, Y. Li, Y. Wang, A comparison study of VAE and
GAN for software fault prediction, in: S. Wen, A. Zomaya, L.T. Yang
from the basic principles in which GAN are sustained to the
(Eds.), Algorithms and Architectures for Parallel Processing, Springer
most innovative architectures of the last years. In addition, the International Publishing, Cham, 2020, pp. 82–96.
different problems that GANs can suffer are categorized and the [16] M. Wiatrak, S.V. Albrecht, Stabilizing generative adversarial network
most common evaluation metrics are explained and discussed. training: A survey, 2019, arXiv.
Respect the recent progress in the field, a taxonomy for the [17] H. Thanh-Tung, T. Tran, S. Venkatesh, Improving generalization and
stability of generative adversarial networks, 2019.
GAN variants is proposed. The researches are divided in two [18] X. Mao, Q. Li, H. Xie, R.Y. Lau, Z. Wang, S. Paul Smolley, Least squares
groups, one with the GANs that focus in architecture optimization generative adversarial networks, in: Proceedings of the IEEE International
and the other with the GANs that focus in objective function Conference on Computer Vision, ICCV, 2017.
optimization. Despite being two separate groups of variants, it [19] Bhagyashree, V. Kushwaha, G.C. Nandi, Study of prevention of mode
collapse in generative adversarial network (GAN), in: 2020 IEEE 4th
should be noted that the different researches benefit from the
Conference on Information Communication Technology, CICT, 2020,
progress of the rest. These ecosystem where there are various pp. 1–6.
approaches for GAN development is connected with the main [20] D. Bang, H. Shim, MGGAN: Solving mode collapse using manifold guided
problems that are reviewed in this survey, since normally each training, 2018.
[21] S. Adiga, M.A. Attia, W.-T. Chang, R. Tandon, On the tradeoff between
research focus in trying to solve a certain problematic of previous
mode collapse and sample quality in generative adversarial networks,
researches. in: 2018 IEEE Global Conference on Signal and Information Processing
Finally the different application of the GANs during the last (GlobalSIP), 2018, pp. 1184–1188.
years are summarized. The different applications of GAN are in- [22] D. Bau, J.-Y. Zhu, J. Wulff, W. Peebles, H. Strobelt, B. Zhou, A. Torralba,
fluenced by the development of the field, its impact in the society Seeing what a GAN cannot generate, in: Proceedings of the IEEE/CVF
International Conference on Computer Vision, ICCV, 2019.
and in the industry. We conclude with a comparison between [23] R. Durall, A. Chatzimichailidis, P. Labus, J. Keuper, Combating mode
the different architectures performance to provide a quantitative collapse in GAN training: An empirical analysis using hessian eigenvalues,
view of the evolution of GANs. 2020.
21
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
[24] H. Thanh-Tung, T. Tran, Catastrophic forgetting and mode collapse in [55] A. Borji, Pros and cons of GAN evaluation measures, Comput. Vis. Image
GANs, in: 2020 International Joint Conference on Neural Networks, IJCNN, Underst. 179 (2019) 41–65.
2020, pp. 1–10. [56] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the
[25] A. Aggarwal, M. Mittal, G. Battineni, Generative adversarial network: An inception architecture for computer vision, 2015.
overview of theory and applications, Int. J. Inf. Manage. Data Insights 1 [57] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-
(1) (2021) 100004. scale hierarchical image database, in: 2009 IEEE Conference on Computer
[26] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein GAN, 2017. Vision and Pattern Recognition, IEEE, 2009, pp. 248–255.
[27] B. Ghosh, I.K. Dutta, M. Totaro, M. Bayoumi, A survey on the progression [58] S. Nowozin, B. Cseke, R. Tomioka, f-GAN: Training generative neural
and performance of generative adversarial networks, in: 2020 11th samplers using variational divergence minimization, 2016.
International Conference on Computing, Communication and Networking
[59] S. Gurumurthy, R.K. Sarvadevabhatla, V.B. Radhakrishnan, DeLiGAN:
Technologies, ICCCNT, 2020, pp. 1–8.
Generative adversarial networks for diverse and limited data, 2017.
[28] Z. Wang, Q. She, T.E. Ward, Generative adversarial networks in computer
[60] T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, T. Aila,
vision: A survey and taxonomy, 2020.
Alias-free generative adversarial networks, 2021, arXiv preprint arXiv:
[29] H. Alqahtani, M. Kavakli-Thorne, D.G. Kumar Ahuja, Applications of gen-
2106.12423.
erative adversarial networks (GANs): An updated review, Arch. Comput.
Methods Eng. 28 (2019). [61] G. Daras, A. Odena, H. Zhang, A.G. Dimakis, Your local GAN: Designing
[30] Z. Pan, W. Yu, X. Yi, A. Khan, F. Yuan, Y. Zheng, Recent progress on two dimensional local attention mechanisms for generative models, in:
generative adversarial networks (GANs): A survey, IEEE Access 7 (2019) Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
36322–36333. Recognition, 2020, pp. 14531–14539.
[31] K. Wang, C. Gou, Y. Duan, Y. Lin, X. Zheng, F.-Y. Wang, Generative [62] Z. Wang, E. Simoncelli, A. Bovik, Multiscale structural similarity for
adversarial networks: introduction and outlook, IEEE/CAA J. Autom. Sin. image quality assessment, in: The Thrity-Seventh Asilomar Conference
4 (4) (2017) 588–598. on Signals, Systems Computers, 2003, Vol. 2, 2003, pp. 1398–1402, Vol.2.
[32] V. Sampath, I. Maurtua, J.J.A. Martín, A. Gutierrez, A survey on generative [63] K. Kurach, M. Lucic, X. Zhai, M. Michalski, S. Gelly, The GAN landscape:
adversarial networks for imbalance problems in computer vision tasks, J. Losses, architectures, regularization, and normalization, 2019.
Big Data 8 (1) (2021) 1–59. [64] E.L. Lehmann, J.P. Romano, Testing Statistical Hypotheses, Springer
[33] X. Wu, K. Xu, P. Hall, A survey of image synthesis and editing with Science & Business Media, 2006.
generative adversarial networks, Tsinghua Sci. Technol. 22 (6) (2017) [65] D. Lopez-Paz, M. Oquab, Revisiting classifier two-sample tests, 2018.
660–674. [66] K. Simonyan, A. Zisserman, Very deep convolutional networks for
[34] Z. Pan, W. Yu, B. Wang, H. Xie, V.S. Sheng, J. Lei, S. Kwong, Loss functions large-scale image recognition, in: International Conference on Learning
of generative adversarial networks (GANs): opportunities and challenges, Representations, 2015.
IEEE Trans. Emerg. Top. Comput. Intell. 4 (4) (2020) 500–522. [67] W. Bounliphone, E. Belilovsky, M.B. Blaschko, I. Antonoglou, A. Gretton, A
[35] J. Gui, Z. Sun, Y. Wen, D. Tao, J. Ye, A review on generative adversarial test of relative similarity for model selection in generative models, 2016.
networks: Algorithms, theory, and applications, 2020. [68] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, B. Póczos, MMD GAN: Towards
[36] H. Zhang, Z. Le, Z. Shao, H. Xu, J. Ma, MFF-GAN: An unsupervised gen- deeper understanding of moment matching network, 2017.
erative adversarial network with adaptive and gradient joint constraints
[69] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning
for multi-focus image fusion, Inf. Fusion 66 (2021) 40–53.
with deep convolutional generative adversarial networks, 2016.
[37] R. Liu, Y. Ge, C.L. Choi, X. Wang, H. Li, DivCo: Diverse conditional image
[70] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, K. Tunyasuvunakool,
synthesis via contrastive generative adversarial network, in: Proceedings
O. Ronneberger, R. Bates, A. Žídek, A. Bridgland, et al., High accuracy
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
protein structure prediction using deep learning, in: Fourteenth Critical
CVPR, 2021, pp. 16377–16386.
Assessment of Techniques for Protein Structure Prediction (Abstract
[38] D.M. De Silva, G. Poravi, A review on generative adversarial networks, in:
Book), Vol. 22, 2020, p. 24.
2021 6th International Conference for Convergence in Technology (I2CT),
2021, pp. 1–4. [71] J.T. Springenberg, A. Dosovitskiy, T. Brox, M. Riedmiller, Striving for
[39] L. Metz, B. Poole, D. Pfau, J. Sohl-Dickstein, Unrolled generative adversarial simplicity: The all convolutional net, 2015.
networks, 2017. [72] R. Ayachi, M. Afif, Y. Said, M. Atri, Strided convolution instead of max
[40] S. Suh, H. Lee, P. Lukowicz, Y.O. Lee, CEGAN: Classification enhancement pooling for memory efficiency of convolutional neural networks, in:
generative adversarial networks for unraveling data imbalance problems, M.S. Bouhlel, S. Rovetta (Eds.), Proceedings of the 8th International
Neural Netw. 133 (2021) 69–86. Conference on Sciences of Electronics, Technologies of Information and
[41] J. Nash, Non-cooperative games, Ann. of Math. (1951) 286–295. Telecommunications (SETIT’18), Vol. 1, Springer International Publishing,
[42] F. Farnia, A. Ozdaglar, GANs may have no Nash equilibria, 2020. Cham, 2020, pp. 234–243.
[43] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, Gans [73] Y. Li, N. Xiao, W. Ouyang, Improved boundary equilibrium generative
trained by a two time-scale update rule converge to a local nash adversarial networks, IEEE Access 6 (2018) 11342–11348.
equilibrium, Adv. Neural Inf. Process. Syst. 30 (2017). [74] S. Wu, G. Li, L. Deng, L. Liu, D. Wu, Y. Xie, L. Shi, L1 norm batch
[44] Á. González-Prieto, A. Mozo, E. Talavera, S. Gómez-Canaval, Dynamics of normalization for efficient training of deep neural networks, IEEE Trans.
Fourier modes in torus generative adversarial networks, Mathematics 9 Neural Netw. Learn. Syst. 30 (7) (2019) 2043–2051.
(4) (2021). [75] D.H. Hubel, T.N. Wiesel, Receptive fields of single neurones in the cat’s
[45] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, striate cortex, J. Physiol. 148 (3) (1959) 574–591.
Improved techniques for training GANs, 2016. [76] M. Mirza, S. Osindero, Conditional generative adversarial nets, 2014.
[46] Z. Zhang, C. Luo, J. Yu, Towards the gradient vanishing, divergence [77] M. Loey, G. Manogaran, N.E.M. Khalifa, A deep transfer learning model
mismatching and mode collapse of generative adversarial nets, in: Pro- with classical data augmentation and cgan to detect covid-19 from chest
ceedings of the 28th ACM International Conference on Information and ct radiography digital images, Neural Comput. Appl. (2020) 1–13.
Knowledge Management, CIKM ’19, Association for Computing Machinery,
[78] Y. Ma, X. Chen, W. Zhu, X. Cheng, D. Xiang, F. Shi, Speckle noise reduction
New York, NY, USA, 2019, pp. 2377–2380.
in optical coherence tomography images based on edge-sensitive cGAN,
[47] H.D. Meulemeester, J. Schreurs, M. Fanuel, B.D. Moor, J.A.K. Suykens, The
Biomed. Opt. Express 9 (11) (2018) 5129–5146.
bures metric for generative adversarial networks, 2021.
[79] Y. Li, R. Fu, X. Meng, W. Jin, F. Shao, A SAR-to-optical image translation
[48] W. Li, L. Fan, Z. Wang, C. Ma, X. Cui, Tackling mode collapse in multi-
method based on conditional generation adversarial network (cGAN), IEEE
generator GANs with orthogonal vectors, Pattern Recognit. 110 (2021)
Access 8 (2020) 60338–60343.
107646.
[49] I. Goodfellow, NIPS 2016 tutorial: Generative adversarial networks, 2017. [80] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, P. Abbeel,
[50] S. Pei, R.Y. Da Xu, G. Meng, dp-GAN: Alleviating mode collapse in GAN Infogan: Interpretable representation learning by information maximiz-
via diversity penalty module, 2021, arXiv preprint arXiv:2108.02353. ing generative adversarial nets, in: Proceedings of the 30th Inter-
[51] J. Su, GAN-QP: A novel GAN framework without gradient vanishing and national Conference on Neural Information Processing Systems, 2016,
Lipschitz constraint, 2018. pp. 2180–2188.
[52] Y. Zuo, G. Avraham, T. Drummond, Improved training of generative ad- [81] A. Odena, C. Olah, J. Shlens, Conditional image synthesis with auxiliary
versarial networks using decision forests, in: Proceedings of the IEEE/CVF classifier gans, in: International Conference on Machine Learning, PMLR,
Winter Conference on Applications of Computer Vision, WACV, 2021, 2017, pp. 2642–2651.
pp. 3492–3501. [82] C.E. Shannon, A mathematical theory of communication, Bell Syst. Tech.
[53] S. Liu, O. Bousquet, K. Chaudhuri, Approximation and convergence J. 27 (3) (1948) 379–423.
properties of generative adversarial learning, 2017. [83] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image
[54] S.A. Barnett, Convergence problems with generative adversarial networks recognition, in: Proceedings of the IEEE Conference on Computer Vision
(GANs), 2018. and Pattern Recognition, 2016, pp. 770–778.
22
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
[84] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. [111] T. Sainburg, M. Thielk, B. Theilman, B. Migliori, T. Gentner, Generative
Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceed- adversarial interpolative autoencoding: adversarial training on latent
ings of the IEEE Conference on Computer Vision and Pattern Recognition, space interpolations encourage convex latent distributions, 2018, arXiv
2015, pp. 1–9. preprint arXiv:1807.06650.
[85] Y. Zhou, T.L. Berg, Learning temporal transformations from time-lapse [112] S. Laine, Feature-Based Metrics for Exploring the Latent Space of
videos, in: European Conference on Computer Vision, Springer, 2016, Generative Models, ICLR Workshop Poster, 2018.
pp. 262–277. [113] X. Huang, S. Belongie, Arbitrary style transfer in real-time with adap-
[86] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style tive instance normalization, in: Proceedings of the IEEE International
transfer and super-resolution, in: European Conference on Computer Conference on Computer Vision, 2017, pp. 1501–1510.
Vision, Springer, 2016, pp. 694–711. [114] M. Tancik, P.P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U.
[87] M. Liu, J. Zhu, A. Tao, J. Kautz, B. Catanzaro, High-resolution image Singhal, R. Ramamoorthi, J.T. Barron, R. Ng, Fourier features let networks
synthesis and semantic manipulation with conditional gans, in: ICCV, learn high frequency functions in low dimensional domains, 2020, arXiv
2017. preprint arXiv:2006.10739.
[88] Y. Qu, Y. Chen, J. Huang, Y. Xie, Enhanced pix2pix dehazing network, in: [115] R. Xu, X. Wang, K. Chen, B. Zhou, C.C. Loy, Positional encoding as spatial
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern inductive bias in gans, in: Proceedings of the IEEE/CVF Conference on
Recognition, 2019, pp. 8160–8168. Computer Vision and Pattern Recognition, 2021, pp. 13569–13578.
[89] M. Mori, T. Fujioka, L. Katsuta, Y. Kikuchi, G. Oda, T. Nakagawa, Y. [116] H. Zhang, I. Goodfellow, D. Metaxas, A. Odena, Self-attention generative
Kitazume, K. Kubota, U. Tateishi, Feasibility of new fat suppression for adversarial networks, in: International Conference on Machine Learning,
breast MRI using pix2pix, Jpn. J. Radiol. 38 (11) (2020) 1075–1081. PMLR, 2019, pp. 7354–7363.
[90] W. Pan, C. Torres-Verdín, M.J. Pyrcz, Stochastic pix2pix: a new machine
[117] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł.
learning method for geophysical and well conditioning of rule-based
Kaiser, I. Polosukhin, Attention is all you need, in: Advances in Neural
channel reservoir models, Natural Resour. Res. 30 (2) (2021) 1319–1345.
Information Processing Systems, 2017, pp. 5998–6008.
[91] M. Drob, RF PIX2PIX unsupervised wi-fi to video translation, 2021, arXiv
[118] A. Brock, J. Donahue, K. Simonyan, Large scale GAN training for high
preprint arXiv:2102.09345.
fidelity natural image synthesis, 2018, arXiv preprint arXiv:1809.11096.
[92] N. Sundaram, T. Brox, K. Keutzer, Dense point trajectories by gpu-
[119] A.G. Dimakis, P.B. Godfrey, Y. Wu, M.J. Wainwright, K. Ramchandran,
accelerated large displacement optical flow, in: European Conference on
Network coding for distributed storage systems, IEEE Trans. Inform.
Computer Vision, Springer, 2010, pp. 438–451.
[93] Z. Kalal, K. Mikolajczyk, J. Matas, Forward-backward error: Automatic Theory 56 (9) (2010) 4539–4551.
detection of tracking failures, in: 2010 20th International Conference on [120] Y. Chen, G. Li, C. Jin, S. Liu, T. Li, SSD-GAN: Measuring the realness in the
Pattern Recognition, IEEE, 2010, pp. 2756–2759. spatial and spectral domains, 2020, arXiv preprint arXiv:2012.05535.
[94] Z. Yi, H. Zhang, P. Tan, M. Gong, Dualgan: Unsupervised dual learning [121] P. Benioff, The computer as a physical system: A microscopic quantum
for image-to-image translation, in: Proceedings of the IEEE International mechanical Hamiltonian model of computers as represented by turing
Conference on Computer Vision, 2017, pp. 2849–2857. machines, J. Stat. Phys. 22 (5) (1980) 563–591.
[95] J. Ye, Y. Ji, X. Wang, X. Gao, M. Song, Data-free knowledge amalgamation [122] E.R. MacQuarrie, C. Simon, S. Simmons, E. Maine, The emerging com-
via group-stack dual-gan, in: Proceedings of the IEEE/CVF Conference on mercial landscape of quantum computing, Nat. Rev. Phys. 2 (11) (2020)
Computer Vision and Pattern Recognition, 2020, pp. 12516–12525. 596–598.
[96] D. Prokopenko, J.V. Stadelmann, H. Schulz, S. Renisch, D.V. Dylov, Syn- [123] Y. Cao, J. Romero, J.P. Olson, M. Degroote, P.D. Johnson, M. Kieferová,
thetic CT generation from MRI using improved DualGAN, 2019, arXiv I.D. Kivlichan, T. Menke, B. Peropadre, N.P. Sawaya, et al., Quantum
preprint arXiv:1909.08942. chemistry in the age of quantum computing, Chem. Rev. 119 (19) (2019)
[97] W. Liang, D. Ding, G. Wei, An improved DualGAN for near-infrared image 10856–10915.
colorization, Infrared Phys. Technol. 116 (2021) 103764. [124] S.A. Stein, B. Baheri, R.M. Tischio, Y. Mao, Q. Guan, A. Li, B. Fang, S. Xu,
[98] C.L.M. Veillon, N. Obin, A. Roebel, Towards end-to-end F0 voice conversion Qugan: A generative adversarial network through quantum states, 2020,
based on dual-GAN with convolutional wavelet kernels, 2021, arXiv arXiv preprint arXiv:2010.09036.
preprint arXiv:2104.07283. [125] M.Y. Niu, A. Zlokapa, M. Broughton, S. Boixo, M. Mohseni, V. Smelyanskyi,
[99] F. Yger, A. Rakotomamonjy, Wavelet kernel learning, Pattern Recognit. 44 H. Neven, Entangling quantum generative adversarial networks, 2021,
(10–11) (2011) 2614–2629. arXiv preprint arXiv:2105.00080.
[100] Z. Luo, J. Chen, T. Takiguchi, Y. Ariki, Emotional voice conversion using [126] W.W. Ng, J. Hu, D.S. Yeung, S. Yin, F. Roli, Diversified sensitivity-based
dual supervised adversarial networks with continuous wavelet transform undersampling for imbalance classification problems, IEEE Trans. Cybern.
f0 features, IEEE/ACM Trans. Audio Speech Lang. Process. 27 (10) (2019) 45 (11) (2014) 2402–2412.
1535–1548. [127] E. Ramentol, Y. Caballero, R. Bello, F. Herrera, SMOTE-RS B*: a hybrid
[101] T. Kim, M. Cha, H. Kim, J.K. Lee, J. Kim, Learning to discover cross- preprocessing approach based on oversampling and undersampling for
domain relations with generative adversarial networks, in: International high imbalanced data-sets using SMOTE and rough sets theory, Knowl.
Conference on Machine Learning, PMLR, 2017, pp. 1857–1865. Inf. Syst. 33 (2) (2012) 245–265.
[102] C.R.A. Chaitanya, A.S. Kaplanyan, C. Schied, M. Salvi, A. Lefohn, D. [128] Z. Pan, F. Yuan, J. Lei, W. Li, N. Ling, S. Kwong, MIEGAN: Mobile image
Nowrouzezahrai, T. Aila, Interactive reconstruction of Monte Carlo image enhancement via a multi-module cascade neural network, IEEE Trans.
sequences using a recurrent denoising autoencoder, ACM Trans. Graph. Multimed. 24 (2021) 519–533.
36 (4) (2017) 1–12. [129] G. Qi, Loss-sensitive generative adversarial networks on lipschitz
[103] I.A. Luchnikov, A. Ryzhov, P.-J. Stas, S.N. Filippov, H. Ouerdane, Variational densities, 2017, CoRR abs/1701.06264. arXiv preprint arXiv:1701.06264.
autoencoder reconstruction of complex many-body physics, Entropy 21
[130] L. Weng, From gan to wgan, 2019, arXiv preprint arXiv:1904.08994.
(11) (2019) 1091.
[131] J. Cao, L. Mo, Y. Zhang, K. Jia, C. Shen, M. Tan, Multi-marginal wasserstein
[104] J. Mehta, A. Majumdar, Rodeo: robust de-aliasing autoencoder for
gan, Adv. Neural Inf. Process. Syst. 32 (2019) 1776–1786.
real-time medical image reconstruction, Pattern Recognit. 63 (2017)
[132] Y. Xiangli, Y. Deng, B. Dai, C.C. Loy, D. Lin, Real or not real, that is the
499–510.
question, 2020, arXiv preprint arXiv:2002.05512.
[105] S. Hicsonmez, N. Samet, E. Akbas, P. Duygulu, GANILLA: Generative
adversarial networks for image to illustration translation, Image Vis. [133] T. Miyato, T. Kataoka, M. Koyama, Y. Yoshida, Spectral normalization for
Comput. 95 (2020) 103886. generative adversarial networks, 2018, arXiv preprint arXiv:1802.05957.
[106] A.A. Rusu, N.C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. [134] T. Salimans, D.P. Kingma, Weight normalization: A simple reparameter-
Kavukcuoglu, R. Pascanu, R. Hadsell, Progressive neural networks, 2016, ization to accelerate training of deep neural networks, Adv. Neural Inf.
arXiv preprint arXiv:1606.04671. Process. Syst. 29 (2016) 901–909.
[107] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from [135] K.B. Kancharagunta, S.R. Dubey, Csgan: Cyclic-synthesized generative
tiny images, 2009. adversarial networks for image-to-image transformation, 2019, arXiv
[108] H. Yang, J. Liu, L. Zhang, Y. Li, H. Zhang, ProEGAN-MS: A progressive grow- preprint arXiv:1901.03554.
ing generative adversarial networks for electrocardiogram generation, [136] X. Wang, X. Tang, Face photo-sketch synthesis and recognition, IEEE
IEEE Access 9 (2021) 52089–52100. Trans. Pattern Anal. Mach. Intell. 31 (11) (2008) 1955–1967.
[109] V. Bhagat, S. Bhaumik, Data augmentation using generative adversarial [137] R. Tyleček, R. Šára, Spatial pattern templates for recognition of objects
networks for pneumonia classification in chest xrays, in: 2019 Fifth with regular structure, in: German Conference on Pattern Recognition,
International Conference on Image Information Processing, ICIIP, IEEE, Springer, 2013, pp. 364–374.
2019, pp. 574–579. [138] L. Wang, V. Sindagi, V. Patel, High-quality facial photo-sketch synthesis
[110] L. Liu, Y. Zhang, J. Deng, S. Soatto, Dynamically grown generative ad- using multi-adversarial networks, in: 2018 13th IEEE International Con-
versarial networks, in: Proceedings of the AAAI Conference on Artificial ference on Automatic Face & Gesture Recognition (FG 2018), IEEE, 2018,
Intelligence, Vol. 35, 2021, pp. 8680–8687. pp. 83–90.
23
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
[139] N. Barzilay, T.B. Shalev, R. Giryes, MISS GAN: A multi-IlluStrator style gen- [166] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas,
erative adversarial network for image to illustration translation, Pattern Stackgan: Text to photo-realistic image synthesis with stacked generative
Recognit. Lett. (2021). adversarial networks, in: Proceedings of the IEEE International Conference
[140] S.W. Park, J. Kwon, Sphere generative adversarial network based on on Computer Vision, 2017, pp. 5907–5915.
geometric moment matching, in: Proceedings of the IEEE/CVF Conference [167] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas,
on Computer Vision and Pattern Recognition, 2019, pp. 4292–4301. Stackgan++: Realistic image synthesis with stacked generative adversarial
[141] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. networks, IEEE Trans. Pattern Anal. Mach. Intell. 41 (8) (2018) 1947–1962.
Aitken, A. Tejani, J. Totz, Z. Wang, et al., Photo-realistic single image [168] C. Gulcehre, S. Chandar, K. Cho, Y. Bengio, Dynamic neural turing machine
super-resolution using a generative adversarial network, in: Proceedings with soft and hard addressing schemes, 2016, arXiv preprint arXiv:1607.
of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 00036.
pp. 4681–4690. [169] J. Weston, S. Chopra, A. Bordes, Memory networks, 2014, arXiv preprint
[142] H. Zhang, T. Zhu, X. Chen, L. Zhu, D. Jin, P. Fei, Super-resolution generative arXiv:1410.3916.
adversarial network (SRGAN) enabled on-chip contact microscopy, J. Phys. [170] M. Tao, H. Tang, S. Wu, N. Sebe, X.-Y. Jing, F. Wu, B. Bao, Df-gan: Deep
D: Appl. Phys. 54 (39) (2021) 394005. fusion generative adversarial networks for text-to-image synthesis, 2020,
[143] O. Dehzangi, S.H. Gheshlaghi, A. Amireskandari, N.M. Nasrabadi, A. Rezai, arXiv preprint arXiv:2008.05865.
OCT image segmentation using neural architecture search and SRGAN, in: [171] L. Gao, D. Chen, Z. Zhao, J. Shao, H.T. Shen, Lightweight dynamic condi-
2020 25th International Conference on Pattern Recognition, ICPR, IEEE, tional GAN with pyramid attention for text-to-image synthesis, Pattern
2021, pp. 6425–6430. Recognit. 110 (2021) 107384.
[172] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, Generative
[144] S. Zhao, Y. Fang, L. Qiu, Deep learning-based channel estimation with
adversarial text to image synthesis, in: International Conference on
SRGAN in OFDM systems, in: 2021 IEEE Wireless Communications and
Machine Learning, PMLR, 2016, pp. 1060–1069.
Networking Conference, WCNC, IEEE, 2021, pp. 1–6.
[173] S.E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, H. Lee, Learning what
[145] B. Liu, J. Chen, A super resolution algorithm based on attention
and where to draw, Adv. Neural Inf. Process. Syst. 29 (2016) 217–225.
mechanism and SRGAN network, IEEE Access (2021).
[174] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
[146] A. Genevay, G. Peyré, M. Cuturi, GAN and VAE from an optimal transport
C.L. Zitnick, Microsoft coco: Common objects in context, in: European
point of view, 2017, arXiv preprint arXiv:1706.01807.
Conference on Computer Vision, Springer, 2014, pp. 740–755.
[147] E. Denton, A. Hanna, R. Amironesei, A. Smart, H. Nicole, M.K. Scheuerman, [175] C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The caltech-ucsd
Bringing the people back in: Contesting benchmark machine learning birds-200–2011 dataset, 2011.
datasets, 2020, arXiv preprint arXiv:2007.07399. [176] M.-E. Nilsback, A. Zisserman, Automated flower classification over a large
[148] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied number of classes, in: 2008 Sixth Indian Conference on Computer Vision,
to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324. Graphics & Image Processing, IEEE, 2008, pp. 722–729.
[149] J. Susskind, A. Anderson, G.E. Hinton, The Toronto Face Dataset, Tech. [177] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput.
Rep., Technical Report UTML TR 2010-001, U. Toronto, 2010. 9 (8) (1997) 1735–1780.
[150] R. Zhang, P. Isola, A.A. Efros, E. Shechtman, O. Wang, The unreasonable [178] A.M. Dai, Q.V. Le, Semi-supervised sequence learning, Adv. Neural Inf.
effectiveness of deep features as a perceptual metric, in: Proceedings of Process. Syst. 28 (2015) 3079–3087.
the IEEE Conference on Computer Vision and Pattern Recognition, 2018, [179] Y. Zhang, Z. Gan, L. Carin, Generating text via adversarial training, in:
pp. 586–595. NIPS Workshop on Adversarial Training, Vol. 21, academia. edu, 2016,
[151] J. Lin, Y. Xia, T. Qin, Z. Chen, T.-Y. Liu, Conditional image-to-image pp. 21–32.
translation, in: Proceedings of the IEEE Conference on Computer Vision [180] S. Bengio, O. Vinyals, N. Jaitly, N. Shazeer, Scheduled sampling for
and Pattern Recognition, 2018, pp. 5524–5532. sequence prediction with recurrent neural networks, 2015, arXiv preprint
[152] Q. Guo, W. Feng, R. Gao, Y. Liu, S. Wang, Exploring the effects of blur and arXiv:1506.03099.
deblurring to visual object tracking, IEEE Trans. Image Process. 30 (2021) [181] L. Yu, W. Zhang, J. Wang, Y. Yu, Seqgan: Sequence generative adversarial
1812–1824. nets with policy gradient, in: Proceedings of the AAAI Conference on
[153] K. Zhang, W. Luo, Y. Zhong, L. Ma, B. Stenger, W. Liu, H. Li, Deblurring Artificial Intelligence, Vol. 31, 2017.
by realistic blurring, in: Proceedings of the IEEE/CVF Conference on [182] C.B. Browne, E. Powley, D. Whitehouse, S.M. Lucas, P.I. Cowling, P.
Computer Vision and Pattern Recognition, 2020, pp. 2737–2746. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, S. Colton, A survey of
[154] M.A. Younus, T.M. Hasan, Effective and fast deepfake detection method monte carlo tree search methods, IEEE Trans. Comput. Intell. AI Games 4
based on haar wavelet transform, in: 2020 International Conference (1) (2012) 1–43.
on Computer Science and Software Engineering, CSASE, IEEE, 2020, [183] L. Floridi, M. Chiriatti, GPT-3: Its nature, scope, limits, and consequences,
pp. 186–190. Minds Mach. 30 (4) (2020) 681–694.
[155] X. Ren, Z. Qian, Q. Chen, Video deblurring by fitting to test data, 2020, [184] N.-T. Tran, V.-H. Tran, N.-B. Nguyen, T.-K. Nguyen, N.-M. Cheung, On data
arXiv preprint arXiv:2012.05228. augmentation for GAN training, IEEE Trans. Image Process. 30 (2021)
[156] M. Westerlund, The emergence of deepfake technology: A review, 1882–1897.
Technol. Innov. Manage. Rev. 9 (11) (2019). [185] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, Synthetic
data augmentation using GAN for improved liver lesion classification, in:
[157] V.C. Martínez, G.P. Castillo, Historia del ‘‘fake’’ audiovisual: ‘‘deepfake’’ y
2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI
la mujer en un imaginario falsificado y perverso, Hist. Comun. Soc. 24 (2)
2018), IEEE, 2018, pp. 289–293.
(2019) 55.
[186] D. Kiyasseh, G.A. Tadesse, L. Thwaites, T. Zhu, D. Clifton, et al., Plethaug-
[158] A.O. Kwok, S.G. Koh, Deepfake: A social construction of technology
ment: Gan-based ppg augmentation for medical diagnosis in low-resource
perspective, Curr. Issues Tour. 24 (13) (2021) 1798–1802.
settings, IEEE J. Biomed. Health Inf. 24 (11) (2020) 3226–3235.
[159] P. Korshunov, S. Marcel, Vulnerability assessment and detection of deep-
[187] C. Qi, J. Chen, G. Xu, Z. Xu, T. Lukasiewicz, Y. Liu, SAG-GAN: Semi-
fake videos, in: 2019 International Conference on Biometrics, ICB, IEEE,
supervised attention-guided GANs for data augmentation on medical
2019, pp. 1–6.
images, 2020, arXiv preprint arXiv:2011.07534.
[160] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, C. Can- [188] M. Hammami, D. Friboulet, R. Kechichian, Cycle GAN-based data aug-
ton Ferrer, The deepfake detection challenge dataset, 2020, arXiv e-prints mentation for multi-organ detection in CT images via yolo, in: 2020
arXiv–2006. IEEE International Conference on Image Processing, ICIP, IEEE, 2020,
[161] N. Carlini, H. Farid, Evading deepfake-image detectors with white- pp. 390–393.
and black-box attacks, in: Proceedings of the IEEE/CVF Conference on [189] A. Graves, G. Wayne, I. Danihelka, Neural turing machines, 2014, arXiv
Computer Vision and Pattern Recognition Workshops, 2020, pp. 658–659. preprint arXiv:1410.5401.
[162] H. Zhao, W. Zhou, D. Chen, T. Wei, W. Zhang, N. Yu, Multi-attentional [190] P. Guo, P. Wang, J. Zhou, V.M. Patel, S. Jiang, Lesion mask-based si-
deepfake detection, in: Proceedings of the IEEE/CVF Conference on multaneous synthesis of anatomic and molecular mr images using a
Computer Vision and Pattern Recognition, 2021, pp. 2185–2194. gan, in: International Conference on Medical Image Computing and
[163] Y. Chen, Y. Pan, T. Yao, X. Tian, T. Mei, Mocycle-gan: Unpaired video- Computer-Assisted Intervention, Springer, 2020, pp. 104–113.
to-video translation, in: Proceedings of the 27th ACM International [191] T.C. Mok, A. Chung, Learning data augmentation for brain tumor
Conference on Multimedia, 2019, pp. 647–655. segmentation with coarse-to-fine generative adversarial networks, in:
[164] A. Bansal, S. Ma, D. Ramanan, Y. Sheikh, Recycle-gan: Unsupervised video International MICCAI Brainlesion Workshop, Springer, 2018, pp. 70–80.
retargeting, in: Proceedings of the European Conference on Computer [192] H. Uzunova, J. Ehrhardt, H. Handels, Generation of annotated brain
Vision, ECCV, 2018, pp. 119–135. tumor MRIs with tumor-induced tissue deformations for training and
[165] L. Kurup, M. Narvekar, R. Sarvaiya, A. Shah, Evolution of neural text gen- assessment of neural networks, in: International Conference on Medical
eration: Comparative analysis, in: Advances in Computer, Communication Image Computing and Computer-Assisted Intervention, Springer, 2020,
and Computational Sciences, Springer, 2021, pp. 795–804. pp. 501–511.
24
G. Iglesias, E. Talavera and A. Díaz-Álvarez Computer Science Review 48 (2023) 100553
[193] A. Segato, V. Corbetta, M. Di Marzo, L. Pozzi, E. De Momi, Data aug- [203] A. Kadurin, S. Nikolenko, K. Khrabrov, A. Aliper, A. Zhavoronkov, druGAN:
mentation of 3D brain environment using deep convolutional refined an advanced generative adversarial autoencoder model for de novo
auto-encoding alpha GAN, IEEE Trans. Med. Robot. Bionics 3 (1) (2020) generation of new molecules with desired molecular properties in silico,
269–272. Mol. Pharmaceut. 14 (9) (2017) 3098–3104.
[194] T. Kossen, P. Subramaniam, V.I. Madai, A. Hennemuth, K. Hildebrand, A. [204] G.R. Padalkar, S.D. Patil, M.M. Hegadi, N.K. Jaybhaye, Drug discovery using
Hilbert, J. Sobesky, M. Livne, I. Galinovic, A.A. Khalil, et al., Synthesizing generative adversarial network with reinforcement learning, in: 2021
anonymized and labeled TOF-MRA patches for brain vessel segmentation International Conference on Computer Communication and Informatics,
using generative adversarial networks, Comput. Biol. Med. 131 (2021)
ICCCI, IEEE, 2021, pp. 1–3.
104254.
[205] D. Manu, Y. Sheng, J. Yang, J. Deng, T. Geng, A. Li, C. Ding, W. Jiang,
[195] T. Xia, A. Chartsias, C. Wang, S.A. Tsaftaris, A.D.N. Initiative, et al., Learning
L. Yang, FL-DISCO: Federated generative adversarial network for graph-
to synthesise the ageing brain without longitudinal data, Med. Image
based molecule drug discovery: Special session paper, in: 2021 IEEE/ACM
Anal. 73 (2021) 102169.
[196] Y. Chen, X.-H. Yang, Z. Wei, A.A. Heidari, N. Zheng, Z. Li, H. Chen, H. International Conference on Computer Aided Design, ICCAD, IEEE, 2021,
Hu, Q. Zhou, Q. Guan, Generative adversarial networks in medical image pp. 1–7.
augmentation: a review, Comput. Biol. Med. (2022) 105382. [206] J. Konečnỳ, H.B. McMahan, F.X. Yu, P. Richtárik, A.T. Suresh, D. Bacon,
[197] M. Li, G. Zhou, A. Chen, J. Yi, C. Lu, M. He, Y. Hu, FWDGAN-based data Federated learning: Strategies for improving communication efficiency,
augmentation for tomato leaf disease identification, Comput. Electron. 2016, arXiv preprint arXiv:1610.05492.
Agric. 194 (2022) 106779. [207] P. Dhariwal, A. Nichol, Diffusion models beat gans on image synthesis,
[198] M. Xu, S. Yoon, A. Fuentes, J. Yang, D.S. Park, Style-consistent image Adv. Neural Inf. Process. Syst. 34 (2021) 8780–8794.
translation: A novel data augmentation paradigm to improve plant [208] J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models, Adv.
disease recognition, Front. Plant Sci. 12 (2021) 773142. Neural Inf. Process. Syst. 33 (2020) 6840–6851.
[199] H. Jin, Y. Li, J. Qi, J. Feng, D. Tian, W. Mu, GrapeGAN: Unsupervised [209] Y. Song, S. Ermon, Generative modeling by estimating gradients of the
image enhancement for improved grape leaf disease recognition, Comput. data distribution, Adv. Neural Inf. Process. Syst. 32 (2019).
Electron. Agric. 198 (2022) 107055. [210] F.-A. Croitoru, V. Hondru, R.T. Ionescu, M. Shah, Diffusion models in
[200] Y. Jing, Y. Bian, Z. Hu, L. Wang, X.-Q.S. Xie, Deep learning for drug design: vision: A survey, 2022, arXiv preprint arXiv:2209.04747.
an artificial intelligence paradigm for drug discovery in the big data era,
[211] C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, M.
AAPS J. 20 (3) (2018) 1–10.
Norouzi, Palette: Image-to-image diffusion models, in: ACM SIGGRAPH
[201] D. Dana, S.V. Gadhiya, L.G. St. Surin, D. Li, F. Naaz, Q. Ali, L. Paka, M.A.
2022 Conference Proceedings, 2022, pp. 1–10.
Yamin, M. Narayan, I.D. Goldberg, et al., Deep learning in drug discovery
[212] Y. Jiang, S. Chang, Z. Wang, Transgan: Two transformers can make one
and medicine; scratching the surface, Molecules 23 (9) (2018) 2384.
[202] A. Kadurin, A. Aliper, A. Kazennov, P. Mamoshina, Q. Vanhaelen, K. strong gan, 2021, arXiv preprint arXiv:2102.07074 1, 3.
Khrabrov, A. Zhavoronkov, The cornucopia of meaningful leads: Apply- [213] Z. Lv, X. Huang, W. Cao, An improved GAN with transformers for
ing deep adversarial autoencoders for new molecule development in pedestrian trajectory prediction models, Int. J. Intell. Syst. 37 (8) (2022)
oncology, Oncotarget 8 (7) (2017) 10883. 4417–4436.
25