Fardapaper Image Generation Through Feature Extraction and Learning Using A Deep Learning Approach
Fardapaper Image Generation Through Feature Extraction and Learning Using A Deep Learning Approach
I would like to take the opportunity here to thank the people that made my mas-
ter thesis possible. First and foremost, I would like to thank both of my thesis
supervisors, Prof. Dr. Welf Löwe and Dag Björnberg. I have had the pleasure of
participating in many of Prof. Dr. Löwe’s machine learning courses over my master
studies, which got me inspired and interested in the field of AI in the first place.
Over all the years of higher education studies, I always struggled finding the direc-
tion that I really wanted to continue in, these courses and the many conversations
with Prof. Dr. Welf Löwe completely changed that. I have also been very fortunate
to work together with Dag for the past year at Softwerk AB and learn a lot. Thank
you for guiding me all the past months with much expertise, critical feedback and
many suggestions, which made my thesis what it is today.
I would also like to thank my thesis reader Dr. Rafael Messias Martin for the sugges-
tions and discussions, with great ideas for future work following up on this thesis.
And Dr. Tobias Ohlsson for the guidance throughout the course.
On top of this I would like to thank my partner for reading and reviewing my thesis
extensively, and pushing me further throughout my studies and goals. I would like
to thank my parents for making my studies and living in Sweden possible in the first
place, I would not be here without their support.
Finally I would like to thank Björn Lundsten for making this thesis with Softwerk
AB possible and the many opportunities I have received working at Softwerk AB
during my master degree studies. I would also like to thank my colleagues at Soft-
werk AB for making it such a nice work environment and team to be part of. Thank
you all!
And thank you for reading my thesis, I hope you enjoy and learn something reading
it.
Tibo Bruneel
May 31, 2023
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Problem and solution proposal . . . . . . . . . . . . . . . . 3
1.3.2 Research questions . . . . . . . . . . . . . . . . . . . . . . 5
1.4 This Thesis Report . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Target groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Report Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 8
2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Generative Artificial Intelligence . . . . . . . . . . . . . . . . . . . 14
2.3 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Introduction to Autoencoders . . . . . . . . . . . . . . . . 15
2.3.2 Variational Autoencoder Architecture . . . . . . . . . . . . 16
2.3.3 Loss function of a VAE architecture . . . . . . . . . . . . . 17
2.3.4 Learning and optimisation of the ELBO . . . . . . . . . . . 18
2.3.5 Reparameterisation trick . . . . . . . . . . . . . . . . . . . 20
2.3.6 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . 21
2.4.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Flow-based Generative Models . . . . . . . . . . . . . . . . . . . . 23
2.5.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Latent Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Triplet Loss and Generation . . . . . . . . . . . . . . . . . . . . . 25
3 Method 27
3.1 Scientific approach . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Method description . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Controlled Experiment . . . . . . . . . . . . . . . . . . . . 27
3.3 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1 Construct Validity . . . . . . . . . . . . . . . . . . . . . . 29
3.4.2 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.3 External Validity . . . . . . . . . . . . . . . . . . . . . . . 30
4 Literature Review 31
4.1 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 Literature review research goals/questions . . . . . . . . . . 31
4.1.2 Search procedure of primary studies . . . . . . . . . . . . . 32
4.1.3 Inclusion and exclusion criteria . . . . . . . . . . . . . . . 32
4.1.4 Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Variational Autoencoders / VAE Findings . . . . . . . . . . . . . . 33
4.2.1 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.2 Hybrid variants . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.3 Disentanglement . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Generative Adversarial Networks / GAN Findings . . . . . . . . . . 35
4.3.1 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.2 Hybrid variants . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Flow-based Generative Models Findings . . . . . . . . . . . . . . . 38
4.4.1 Architectures and variants . . . . . . . . . . . . . . . . . . 38
4.4.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Comparison Findings . . . . . . . . . . . . . . . . . . . . . . . . . 40
6 Controlled Experiment 52
6.1 VAE Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.1.1 Results - 1K data set Training’s . . . . . . . . . . . . . . . 52
6.1.2 Results - 20K data set Training’s . . . . . . . . . . . . . . . 57
6.1.3 Results - 100K data set Training’s . . . . . . . . . . . . . . 57
6.2 Triplet Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 Recognition Models . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.4 Experiment Validity . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7 Discussions 68
References 72
A Appendix 1: Extension on GAN and Flow-based Models A
A.1 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . A
A.1.1 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A
A.1.2 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . A
A.2 Flow-based models . . . . . . . . . . . . . . . . . . . . . . . . . . B
A.2.1 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . B
1.1 Background
Image generation is the process of artificially generating non-existent images,
based on a model that has learned how to generate them through the model goal
and given data. The model attempts to learn a feature distribution that resembles
the real world data distribution as well as possible. A feature can be seen as a
certain aspect of an object on an image, that is extracted in the case of generative
artificial intelligence (AI). Some general examples of features can be shape, colour,
or curvature. These features can go from general to highly detailed. Learning a
distribution furthermore has to do with how such a feature is mapped over a full set
of data. These distributions are of high importance for learning features, and how
the data behaves/evolves. By extracting features and learning their distribution over
the full sample of data, it then becomes interesting to generate new non-existing
data with these models.
Generative models attempt to learn how to generate artificial images based on
the training data, by minimising the difference between the real world data distri-
bution and the generated/reconstructed images of the model. Different generative
models optimise in different ways, but the end goal of all the various state-of-the-
art models is to generate as accurately as possible. As almost all problems are not
closed-form and data is usually only a sample of the full possible data set and real
world, it is hard to learn the underlying distribution exactly. This is why generative
models attempt to learn it as well as possible, but in practice it would eventually
always be an approximation of the real world.
Image generation has many useful applications and has proven to be quite im-
portant in different use cases, with some of the most common ones being data ex-
tension, data augmentation, anomaly detection, and text-to-image generation. In all
the above mentioned use cases, images are used as model input, output, or both.
In data extension, artificial images can be used to extend the real images in a data
set, which is especially interesting in cases where the data set is rather small and
insufficient for using it in other algorithms. In data augmentation, it can be useful to
extend the existing data set to generalise any model better and to prevent any over-
fitting on real data. Although extension and augmentation are very similar in their
use and goals, in data extension the focus is purely on extending the data set with
more data. While with augmentation the focus is on augmenting the real data with
generated versions, to introduce variants with augmentations and commonly some
noise. Anomaly detection is the task of finding/detecting abnormal observations in
data, based on the knowledge of what a non-anomaly or normal behavior is being
1
considered as. However, sometimes the exact anomaly is known, which makes de-
tection and training easier. Detecting these outlier observations and patterns in the
data is interesting and useful for many different reasons and use cases. Outlier de-
tection is a commonly occurring manually solved problem, to which an automated
solution can have a major impact to businesses on financial and time aspects, but
also on people and environments. Image generation can prove to be very useful for
detecting unexpected behavior, as the generative model would perform differently
on those anomalies, since it learned well how the features are distributed on normal
behaviour. [1, 2] Finally, text-to-image generation has become quite a hot topic over
the past years within the field of AI, with generators as DALL-E [3], DALL-E-2 [4]
and Stable Diffusion [5], where the goal of these models is to generate an artificial
image based on given textual captions, i.e. a prompt. These have of late become
very applicable in combination with large language models such as GPT-4 [6]. The
automatic generation of new artificial data, that resembles real data has found to
be useful for AI that is applied in many different sectors. These different sectors
mainly exist out of production, analytics, security, and medical environments.
Within the deep learning space for image generation in computer vision, there
are a few leading approaches. These include Variational Autoencoders (VAE) [7],
which is based on the concept of Autoencoders [8], Generative Adversarial Net-
works (GAN) [9], and Flow-based generative models [10].
1.2 Motivations
There has been a lot of active research in feature extraction/learning, and image
generation in the past decades. A lot of approaches have been very successful in
their own use cases, especially in deep learning with different variational autoen-
coders [7, 11, 12, 13], various generative adversarial networks [9, 14, 15, 16], and
flow-based generative models [10, 17, 18]. These architectures and algorithms have
been achieving remarkable results in generating artificial images. This generation
process and its results can contribute to significantly improving other fields within
AI and other industries. Within anomaly detection the automatic unsupervised de-
tection of anomalies can reduce a lot of costs and time in production environments,
assist in medical and security environments, and in various analytics use cases like
network analysis [1]. Additionally, it can also assist in improving and regularising
existing models through data augmentation. As well as extending data sets where
the data set is small, and the data collection process is expensive on financial and
time aspects. It can also contribute heavily with approaches like text-to-image gen-
eration or other similar content-to-image methodologies.
The motivation for this thesis project is to construct a pipeline using a generative
model, with a sampling algorithm, to potentially improve a recognition model. On
top of this, the aim is to also construct a variant methodology in the form of a neural
network architecture, similar to the existing approaches. The aim of the variant and
the difference with the base approach is on learning the data more effectively and
accurately in an unsupervised way, through more targeted learning on individual
features. A forestry related project within Softwerk AB on generating artificial tree
log ends, is a good application to test the generative model approach and the variant
on, in order to further assess the usability and potential within the AI pipeline in
the project. Softwerk AB is a software development company based in the south
Swedish city of Växjö. Softwerk AB offers a wide range of technical excellence
2
over a large variety of projects, as software development, machine learning, system
development and others. The proposed architecture and methods can contribute to
the field by learning how effective and applicable it is. With success, the proposed
variant approach provides better results and learning than the standard generative
approach, and it can therefore improve the application by using this architecture.
Furthermore it could contribute within the computer vision field, and impact the
industry and society by making it applicable to different use cases. On top of that,
if either the variant or standard generative model approach works well in the image
synthesis within the forestry application, artificially generated images can be used
in order to improve other models related to the same application. It can therefore
also impact the industry application and related research within the field of wood
processing and forestry.
The proposed variant is a network that is based on the autoencoder and varia-
tional autoencoder architectures by maintaining the encoder and decoder networks
in the model. The vanilla variational autoencoder architecture with encoder, la-
tent space with latent distributions, and decoder is visualised in the above Figure
1.1. The difference with the architecture proposed within this thesis project and
the standard vanilla VAE, is that the proposed architecture learns a 1D latent space
for every D individual feature separately, instead of learning one latent space of all
the extracted features together at the same time. In this architecture an extra set
of hidden layers is added between the latent space training and the encoder net-
work for the individual features. This is the key difference between the vanilla VAE
architecture and the proposed VAE. The goal of applying this architecture is to ex-
periment if such an architecture with an extra sub-network for all the features and
3
separated latent spaces can learn individual data distributions better compared to
standard vanilla VAE architectures, to furthermore, with these sub-network latent
dimensions, generate better, more creative and feature-rich images when sampling
the latent space. On the Figure 1.2, the individual latent dimensions with their own
sub-networks is visualised, this easily shows the difference between this architec-
ture and the vanilla architecture in Figure 1.1. The main final goal of the architec-
ture is to learn individual specific Gaussian distributions better from the images fed
into the model, which in turn is interconnected with the quality of image genera-
tion/reconstruction, as better distributions would lead to better outputs of sampling
the latent spaces. By applying both the standard VAE architecture and the proposed
variant, to an existing problem and data set, the quality of the methodologies can be
thoroughly tested and measured, while also being be compared.
The application for this thesis project is a forestry application within Softwerk
AB, which is on extracting information out of images of cross-sections of tree logs.
Information extraction is done through different models such as log recognition.
The generation of artificial images of tree log ends of high quality and high resem-
blance to real log ends, can significantly reduce the need for real data, which can
save major costs and time, as data collection is often a tough, expensive and long
process. It is especially interesting if the real data can be fully replaced by gener-
ated data in different models, and it is not just used as a data augmentation or data
extension approach. The focus of this thesis project and its problem statement is to,
next to a literature review on generative models, apply the above VAE architectures
to the application of generating artificial images of tree log ends and optimise these
architectures. The goal is then also to further down in the pipeline sample artifi-
cial log ends, using a proposed triplet sampling algorithm with the learned latent
spaces, and use these images for a tree log end recognition model training. In this
way, the fit of VAE architectures on this problem can be measured, and with suc-
cess can improve other information extraction models through training on artificial
data. The experiment can on top of the generative ability, improve other problems
significantly in future work and research.
4
1.3.2 Research questions
With the above problem statement, this research aims to answer the following
three research questions in Table 1.1 as accurately as possible:
RQ1 What generative variant can be constructed with the goal of learning
individual data distributions better than the standard approach?
RQ2 How does the applied generative variant and base model compare to
other generative deep learning approaches in image synthesis?
RQ3 How can the proposed variant and base model contribute for triplet
generation, for improving a recognition model on artificial data?
Table 1.1: Thesis research questions
1.5 Contributions
This thesis project contributes to the field of log end image generation and recog-
nition, but also and especially to the field of generative models with deep learning.
5
This thesis project presents a variant VAE approach for improving the latent space
and a triplet sampling algorithm on the VAE latent space. And all of this applied
on the generation of artificial images on tree log ends. To the best of my knowl-
edge and findings, these proposed approaches are novel and therefore their results
provide an idea of its’ capabilities within the field.
6
Figure 1.3: List of different types of risks associated with deepfakes. [19]
Both the ethical considerations mentioned above do concern with the outputs of
generative models, but the intentions are different. As with the output discrimina-
tion, there might not be a bad intention behind it. However with the misinformation
there is a bad intention behind the usage of such deep generative models. Both of
these risks are concerning and we recognise that there is a high need and impor-
tance for AI safety and methodologies recognising and mitigating these problems
as much as possible. Especially towards the future with these generative models
becoming stronger and better. There will be harmful sides with any technology,
and it is something that should be always discussed and approached in research. We
recognise that the good sides do outweigh the bad sides, as the powers of these mod-
els could simplify many problems appearing in humanity in the future and with this,
steps can taken towards a better world where AI plays a large role in automation.
7
2 Background
In this section the goal is to provide an understanding of the state of the art, in
order to advance to the scientific approaches and practical implementations of this
thesis project. This section: 2.1 introduces the basics of deep learning, while 2.2
covers an introduction to the field of generative artificial intelligence. Afterwards
the most well-known and used deep generative models are presented, with the 2.3
Variational Autoencoders, the 2.4 Generative Adversarial Networks, and the 2.5
Flow-based Generative Models. Next, in 2.6 an introduction to latent sampling is
given. Finally, 2.7 delves into the concept of Triplet Loss and Generation.
8
of hidden layers is to learn a certain representation and structure in the data
by tuning the parameter values of the neurons and layers. The parameters
are learned during the training of the neural network. The reason why these
layers are called "hidden" is because of their goal, they estimate and learn
a structure, without knowing the function or distribution of the data itself.
The outputs of these layers are often difficult for interpretation and hard to
correlate with a real meaning or output data type. Therefore, neural networks
also heavily suffer from explainability issues.
• Output layer
The output layer is the final layer of the neural network architecture and is
used for the final processing of the output, coming from the final hidden layer.
The output layer processes and returns an output that is interpretable for a hu-
man to use. This can be a numeric value, a set of probabilities, an image or
even another output, depending on what is set and on the problem at hand.
y = f (wT h + b) (1)
Where y is the output, f is the activation function, w is the weight vector, h is
the input, and b is the bias parameter.
9
• Weight
The weight parameter of a neuron is a learnable value representing the strength
of a connection between two neurons, and therefore how much a neuron
should weigh through. The weights a certain neuron receives determine its
own value and affects the output. This parameter is used for computing a
weighted sum of inputs.
• Bias
The bias parameter is a single learnable value that each neuron has, it is used
for adding a bias to the weighted sum of inputs.
• Activation
The activation is a function that defines the final output of a neuron by in-
putting the weighted sum of inputs plus bias through a predefined mathemat-
ical function. The activation is used to add non linearity to the neurons, in
order to learn more complex structures and patterns in the data, which would
not be possible while being restricted to linearity. Common activation func-
tions are ReLU (Rectified Linear Unit), Sigmoid, and Tanh.
Weights and biases are usually randomly initialised, often through a Uniform
or Gaussian initialiser, or are zero-initialised. The initialisation depends on the
network and layers as well. The activation unit is commonly equal for the whole
layer and typically predefined by the researcher, otherwise a default or no activation
function is used.
2.1.2 Training
With the knowledge on layers, parameters and activation’s, the next step is the
training of a neural network architecture. For the training of neural networks, a
technique called stochastic gradient descent is used, as often used in many other
machine learning methods.
Cost Function
In order to evaluate the quality of the model, and compute the difference in the
approximations made by the neural networks with the actual output that it should
approximate as closely as possible to, a cost function is necessary. The choice of
cost functions highly impacts the learning and is therefore an important choice that
one has to make when training networks. There are many types of cost functions,
where the goal is either to minimise the cost, such as error cost functions, like the
MSE (Mean-Squared Error) and the MAE (Mean Absolute Error). Or to maximise
the cost, such as an accuracy or correctness function. The choices are very depen-
dent on the type of network and the problem statement/use case. A combination
of different cost functions in one function is also common, where there is some
adversarial or multi-objective learning goal. A combination of cost functions can
also be used as regularisation of the training, to prevent any overfitting, a possible
occurrence during training where the network learns too well how to approximate
the training data, such that it no longer becomes generalisable over unseen data.
This is why validation and test data sets are also used. The cost function is used in
the gradient descent algorithm in order to optimise it and descent the gradient so it
achieves a better cost.
10
Gradient-Based Optimisation
Gradient descent is an optimisation technique, that based on the cost function de-
scents the gradient to minimise or maximise the cost function. The technique is
used to iteratively adjust the parameters of the network, in other words the weights
and biases, in the direction of the steepest descent using a certain learning rate. The
learning rate is a step size used to descent the gradient. It is a parameter that allows
to take smaller steps down the gradient in order to avoid overshooting the gradient,
but it also allows to take large enough steps to not learn too slowly or end up stuck
in local minima/maxima. The gradient is computed using the partial derivatives.
The formula for gradient descent (Equation 2) is expressed as:
θ∗ = θ − α · ∇J(θ) (2)
Where θ∗ represents the next set of parameters, θ represents the current set of
parameters, α represents the learning rate, and ∇J(θ) represents the gradient of the
cost function J over the current parameter set. Therefore, by iteratively computing
the gradient and cost functions, the parameters θ of the neural network are adjusted
during training. To mitigate problems occurring during the gradient descent, many
optimisation techniques [20] have been proposed, each with their own specifics.
The most common ones are Adam [21], AdaGrad [22], and RMSProp. Therefore,
for training neural networks, one would always opt for one of these optimisation
techniques. This is nonetheless, together with the learning rate, quite an important
decision to make, as the adjustment of either the learning rate or optimisation tech-
nique highly affects the learning. One can however rely on the default settings used
in other networks. In Figure 2.5, four gradients for four different networks and their
loss functions are visualised. This illustrates that gradient descent is not always as
straight forward as a convex gradient and that gradient-based optimisation can def-
initely easily end up in a local minima.
11
Figure 2.5: Gradients of the loss functions of different neural network architectures.
Figure obtained from: https://www.cs.umd.edu/ tomg/projects/landscapes/
Training Batches
Gradient descent is usually applied using a batch version of the algorithm, meaning
that subsets and thus smaller samples of the full training data are used for descend-
ing the gradient, instead of the full data set. In every epoch, gradient descent is
applied for T rainingDataSize/BatchSize steps. A batch version of gradient de-
scent is mainly applied for computational efficiency, with the trade-off of a less
accurate and generalisable gradient that represents the full training data. An exam-
ple of this is the usage of large image input into a neural network, without any batch
gradient descent, the approach would run out of memory very quickly. While with
a certain batch size, it becomes computationally feasible. It is therefore of high im-
portance to adjust the batch size hyperparameter accordingly when training neural
networks. A very low batch size would give less generalisable gradients for the full
training data, while a too high batch size would run into computational problems.
2.1.3 Backpropagation
Backpropagation is an essential part of deep learning and having a notion of it
is important for understanding what happens during the training of architectures.
Backpropagation is an algorithm used for training a network together with an opti-
12
misation algorithm such as gradient descent. The algorithm is used for propagating
backwards through the architecture, i.e. from output to input, to compute the gradi-
ent (derivatives) based on the cost function for a single training example, using the
current weights of the network. The algorithm outputs the gradient for every weight
in the network and defines how a single training example would want to adjust the
parameter in order to minimise the cost function. Doing this for a whole batch, the
average of these gradients can then be used for the gradient descent algorithm to
descent towards the lowest error and thus optimum with a certain learning rate. The
algorithm thereby allows propagation through the network backwards, and through
that computing the relative changes to decrease the error through the gradients for
the learnable parameters.
2.1.4 Layers
There are many different types of layers available for neural networks. Below
only the layer types relevant for the implementation of this thesis project are de-
scribed.
Dense layers
A dense layer, also densely connected layer, is a type of layer where each unit/neuron
in the layer is connected to every unit/neuron in the previous layer of the network. It
is one of the more common deep learning layers, and can be used in many use cases.
Flatten layers
A flatten layer is a type of layer where a multidimensional input is outputted to
a single dimensional output, in other words one larger vector. This layer is often
used between convolutional layers and dense layers to handle the differences in
output/input.
Convolutional layers
A convolutional layer is a type of layer that is quite different compared to the dense
layer. The application of it is in the Convolutional Neural Network (CNN), which
is used for computer vision applications such as images or time series problems,
as it is great for learning and detecting features due to its nature. A convolutional
layer and network is based on the idea of convolution itself and has a more grid-like
topology. Convolution is the mathematical operation where a new function is com-
puted based on two given functions. The computation itself is defined as the integral
of the product of the first function and the reflected and shifted second function. In
image processing and deep learning, this is seen as the data being the first function
and a grid-like kernel of size n x m being the second function. In image process-
ing, such as blurring or edge detection, the kernel shifted over the data contains
certain hard-coded values over the kernel cells, while in deep learning this kernel
is learned during training. There are a few main parameters on convolutional layers:
• Kernel
The kernel parameter is equal to the size of the grid-like kernel to be learned.
The dimensions of this kernel is dependent on the type of layer and the input
data, for images this would commonly be two-dimensional. In the case of
13
multi-dimensional kernels, the sizes are often equal and thus square. Com-
mon sizes of kernels are 3x3 and 5x5. Odd-sized kernels are also preferred
over even-sized kernels, because they have an exact center grid.
• Filters
The filters parameter is equal to the amount of output dimensions of the con-
volutional layer, this thus represents how many filters or kernels one wants to
learn in a single layer.
• Strides
The strides parameter represents the number of steps the kernel takes when
shifting over the data. Default this would be equal to one and the kernel
filter then moves normally, if a higher stride is set, the kernel skips x grid
cells. This behaviour can be preferable in certain use cases such as very large
dimensional data. Although it can speed up computation and learning, it does
have a trade-off for the accuracy and detail.
• Padding
One can understand that when a kernel moves over an image for example, the
least information is captured at the borders, as only the borders of the kernel
move over it. While towards the center, the kernel will move over every
cell completely. Meaning information can be lost at the borders of images.
A padding parameter allows to still maintain proper detection at borders and
controlling the output space, as the sides of the images can be padded. Usually
those padding borders are filled with zeros.
• Activation Function
The activation parameter can also be chosen to further process the output fil-
ters. ReLU is a common activation function for convolutional layers, with
Sigmoid and Tanh being popular choices for final convolutional layers in
CNN’s.
14
focus of this thesis project is on image synthesis based on image training data, where
the aim is to learn an underlying distribution from images and being able to trans-
late and sample from this learned distribution. The main deep generative models
within this field are variational autoencoders, generative adversarial networks, and
deep flow-based generative models.
n
1 X 2
MSE = Yi − Ŷi . (3)
n i=1
The hidden layer in a neural network which represents the low-dimensional rep-
resentation of size p, between the encoder and decoder, is commonly called the
15
bottleneck. In both the encoder and decoder, multiple hidden layers are sequenced
and connected with this bottleneck. In an encoder, the layer size would sequen-
tially output smaller and smaller output representations. In a decoder, the reverse
happens where sequentially output representations of the layers grow larger, until
the output of size n is reached again. While training, the loss function is used to
tune the weights and biases of these layers, in an attempt to learn a model where the
hidden layers output a minimised loss function.
The autoencoder has been applied to different problems in various fields, espe-
cially in the field of optimisation, recognition and anomaly detection. In optimisa-
tion, the network can be useful for denoising data, e.g. denoising of medical images
[24] or blind denoising [25]. Or optimisations as in enhancement data, e.g. a cou-
pled deep autoencoder for enhancing image resolution [26]. Also recognition and
anomaly detection have been proven to be a useful case of deep autoencoders [2]
[27] [28].
There is more to it than just a rather simple architecture, learning a true un-
derlying distribution of a data set is a problem that can not just be easily solved,
especially with a VAE architecture due to its nature. The goal of a VAE model is to
learn a proxy distribution which is an approximation of the posterior distribution, by
maximising a lower bound on the log likelihood. The learning and specifics of the
VAE will be further discussed in this section, but will not go into all the details, es-
pecially regarding all the mathematics and fine-grained specifics. As the focus here
is on apprehending an understanding of the architecture itself. For more details
than described, the following references are suggested. Kingma and Welling [29]
present an extensive and detailed introduction to variational autoencoders. Cinelli
16
[30] presents a book on variational methods for machine learning with applications
on deep networks, which covers extensive details on learning a variational method
such as the VAE model.
Maximum log-likelihood
The most common metric used for probabilistic models is the maximum log-likelihood
function, it is an important way of evaluating the quality of a probabilistic model and
how well the model represents the observed data. The parameters θ for a probabilis-
tic model can thus be optimised by maximising the log-likelihood, also equivalent
to minimising the Kullback-Leibler divergence, between the observed data and the
model distributions. The maximum log-likelihood estimation (MLE) is usually op-
timised in deep networks through the technique of batch gradient descent. MLE
(Equation 4) is therefore the estimation of the parameter set θ̂ through the maximi-
sation of the log-likelihood loss function Ln (θ ; y) of the model distributions and
the observed data, where θ ∈ Θ and Θ represents the parameter space.
Kullback-Leibler Divergence
The KL-Divergence, short for Kullback-Leibler Divergence, is a commonly used
measure for determining how different a certain probability distribution P is from
another probability distribution Q. For discrete distributions the measure (Equation
6) is given as:
17
X P (x)
DKL (P ∥ Q) = P (x) log (6)
x∈X
Q(x)
For continuous distributions, the measure is transformed to instead use integrals
(Equation 7):
Z ∞
p(x)
DKL (P ∥ Q) = p(x) log dx (7)
−∞ q(x)
Simply put, the KL Divergence is the expectation of the log difference between
P and Q, using the probabilities P . It is also important to note that the Kullback-
Leibler is a distance, but is not considered a distance metric. This because the
distance does not meet the symmetric requirement of distance metrics, as switching
P and Q will result in a different distance.
18
Figure 2.8: The process of VAE learning. In the encoder an approximation qϕ (z|x)
of the intractable posterior distribution is optimised through the learning of distri-
bution parameters in the hidden layers of the deep neural network. This results
in a prior distribution pθ (z) in the latent space or z-space. The stochastic decoder
pθ (x|z) further reconstructs by sampling from the z-space. Figure obtained from
[29].
using large data or images. Usually a converge rule is set to stop the learning when
it has reached predefined criteria. The ELBO over a data set is the sum or average,
depending on the implementation, over all the individual ELBO functions over the
data-points in the data set. The individual ELBO over a data-point xi in a data set X
is usually intractable, however good unbiased gradients can be obtained. In discrete
distribution cases, this is easier to compute. In continuous latent spaces, this is not
as straight forward, the parameters can then be computed through the reparameter-
isation trick. The reparameterisation trick is a methodology for the network that
allows for backpropagation while still sampling.
Lastly, important to note on the learning of the latent dimensions is that usually
a multivariate Gaussian model with diagonal covariance is applied. Although other
distribution models can also be used and have been depending on the application.
19
In the case of a Gaussian model, the network aims to learn two parameters per latent
dimension: the mean µ vector and the log variance log(σ 2 ) vector. The learning of
the VAE is visualised in Figure 2.8.
2.3.6 Variants
Over the years of research, many variants have been proposed and introduced
in the field of variational autoencoders. Some of the more commonly used and
impactful variants are discussed below, found in the search for related studies and
literature review. Overall, many variants also have their own variants. This to im-
prove its architecture, loss functions, learning or other found weaknesses/issues.
20
instead of continuous, allows for more structured and interpretable representations.
And furthermore easier sampling for synthesis. The Vector Quantised-Variational
Autoencoder also mitigates the issue of mode collapse, which can appear with the
base architecture. Due to the model learning discrete spaces, the VQ-VAE architec-
ture definitely works well on data with more structure and discrete nature. [11] [12]
Conditional VAE
The conditional VAE [13] architecture is another variant of the VAE that is impor-
tant to mention, as this extension has opened up for new applications within the
field of the VAE. The base variational autoencoder does not have any conditional
properties, therefore making it impossible to perform conditional synthesis of any
kind. The addition of conditional properties allows for generation based on prede-
fined inputs to the sampling model, such as a category, or data attribute. It has also
found to be great for applications as image inpainting where it can conditioned on
an incomplete image, generate the missing part and complete the image.
β-VAE
β-VAE [31] is a variant of the standard variational autoencoder with a minor change
in the learning of the algorithm, that has significant impact on the latent space. The
researchers of this approach introduced this variant with a focus on disentangling
and organising the latent dimensions. This is done through a modification of the
ELBO with the introduction of a new parameter β. The parameter multiplies the
KL Divergence, putting more weight on that parameter of the ELBO. This results
in more structured and disentangled learning of the latent space, with the trade-off
on reconstruction quality of the network.
NVAE
NVAE [32], short for Nouveau VAE, is a deep hierarchical variant of the base archi-
tecture. Through the hierarchical structure of the convolutional blocks in the model,
it learns the data to a deeper level. Often able to learn more complex features and
attributes in that data, that a normal VAE would look over. Because of this, the
reconstruction quality and learning is more stable and better.
21
Figure 2.10: Generative Adversarial Network Architecture with the Generator and
Discriminator networks.
minmaxV (G, D) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z))] (10)
G D
Binary Cross-Entropy
The binary cross-entropy loss or log loss is one of the very common, if not most
common loss function used for discriminators, as it is well fit for binary classifi-
cation. The binary cross-entropy (Equation 11) measures the dissimilarity between
two distributions, the predicted and true distribution. It does this by comparing
the real binary classification with the predicted probability of the output being that
class. It therefore penalises losses being away from the true probability.
22
L(y, ŷ) = −[y ∗ log(ŷ) + (1 − y) ∗ log(1 − ŷ)] (11)
L(G, D) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z))] (12)
In the Appendix Section A.1, alternate GAN loss functions and variants are de-
scribed for the interested reader.
23
Figure 2.11: Concept of normalising flows, transforming a simple distribution
into a complex distribution using invertible mappings. Figure obtained from:
https://lilianweng.github.io/posts/2018-10-13-flow-models/
Different types of flow architectures have been proposed over the years, with
each having their variants of models. The most prominent types of flows are cou-
pling flows, autoregressive flows, residual flows and continuous flows. Coupling
flows have proven to be the most commonly applied with variants as NICE [10],
RealNVP [17], and GLOW [18]. Coupling flows have usually been an attractive
approach due to their strengths of computational efficiency, scalability and flexibil-
ity in learning. Autoregressive flows have also been a powerful type of flows with
variants such as Masked Autoregressive Flow [36], PixelRNN [37] and Inverse Au-
toregressive Flow [38].
In the Appendix Section A.2, some Flow-based model variants are described for the
interested reader. For further introduction on flow-based models and these types of
flows, one can read the introduction presented by Kobyzev et al. [39].
24
eventually return a newly generated image. As a VAE latent space is usually multi-
dimensional, and contains more than 3 latent dimensions, it becomes harder for us
humans to interpret such a latent space, as this can not be easily visualised. Es-
pecially for highly dimensional latent spaces, sampling techniques are required for
more efficient generation. In this section, random Gaussian sampling and interpola-
tion sampling are described, both are main sampling strategies used. These methods
however can both return more worse and blurry results, or just overall bad genera-
tion, especially when the latent space was trained on smaller data. To solve prob-
lems appearing with the main sampling strategies, optimisations have been made in
studies, such as presented in the research of White T. [40] through bias-corrected
vectors with data replication and synthetic vectors with data augmentation.
Interpolation Sampling
Interpolation sampling is another common sampling strategy, also known as linear
interpolation. In interpolation sampling, two points, usually of training, validation
or testing data, are chosen. Between these two points, encodings are sampled lin-
early in x steps. With linearly, it is meant that exactly between two points, the
shortest distance line is traversed in x steps, and the steps are taken as the sampled
encodings. These steps between the two prechosen points, can then be decoded
and provide different samples, maintaining similar features as the prechosen steps.
The strength of these features appearing depends highly on how close the point was
sampled towards one of the reference points. On Figure 2.12 an example on linear
interpolation is given, it is shown how towards both of the directions the images
become more and more like the original reference image.
Figure 2.12: An example of linear interpolation retrieved from the generations fur-
ther done in the implementation and controlled experiments on tree log ends.
25
distant. The loss function was introduced in FaceNet [41], an embedding for face
recognition and clustering. The triplet exists out of 3 encodings:
• Anchor: the reference point in space, to which the positive and negative are
compared.
• Positive: the point in space matching with the anchor, measured in distance
from the anchor, aimed to be closer to the anchor than the negative.
• Negative: the point in space that does not match with the anchor, measured in
distance from the anchor, aimed to be further away from the anchor compared
to positive.
The goal is therefore to cluster similarities with these 3 encodings. While doing this
over many triplets, the network learns to better identify and classify similarities and
dissimilarities correctly. The loss has often been used for recognition models, to
help the training more and improve the model quality and robustness. By making it
harder to learn these triplets, the network becomes better. One can understand the
loss function as the following goal (Equation 14):
26
3 Method
In this section the method of the thesis is described extensively in advance of the
actual implementation. This section includes: 3.1 introduces the applied scientific
approaches, with in 3.2 a description of the selected scientific methods. Eventually,
in 3.3, the reliability is assessed, and in 3.4, the validity is covered.
27
maintaining the specific architecture itself. Here multiple quantitative results are re-
turned, giving a good overview of the quality and learning of the models. After the
models have been trained, the generation quality and sampling quality are also ex-
perimented with, through the training of a recognition model, returning even more
quantitative metrics. In this way, not only the generative models can be compared,
but the generative images can be compared with real data in the recognition model.
This controlled experiment provides more specific analysis on the current state, and
with this a good grasp for future work on this research. More specifics can be found
in Section 6 (Controlled Experiment).
3.3 Reliability
The controlled experiments have some reliability weaknesses. Below these reli-
abilities are discussed, with arguments and counter-arguments.
The first reliability is the data collection for images. Log end images are re-
trieved from production pipelines. For this, a time frame of a week is set and all
images from within the time frame are retrieved from storage. Of this larger set of
images, from the selected week, a randomly smaller set of n size is used for training
the models. It could be argumented that this data collection approach might intro-
duce any bias towards certain log ends due to random parameters and does not allow
for reproducibility. A counter argument here, is that when scaling up in the data set
size, it will reduce any possible bias towards certain types of logs as more data is
used. The randomisation parameter does also remove any researcher bias as there
is no control of the selected images for training the generative models. It could also
be argumented that the selected time frame might introduce any bias, but overall
the differences in different time frames can be considered minimal, and the trained
generative model should be able to handle this. For future work, it could definitely
be interesting to also use more data, and from within different time frames.
The second reliability is the usage of the data set itself, although the exact model
and parameters for training are given in the Appendix D, the data set is private and
everything is done within an industrial environment and setting. So the exact results
and implementation are not reproducible, due to the privacy of the application. Ob-
taining a similar data set from a different source might return very similar results
nonetheless.
A third reliability is the neural network. A neural network contains random
parameters, and although one can set random seeds, to allow for exact replication,
this was not done within this thesis project. The reason for this is that setting an
exact random seed is good for reproducibility, but it does introduce a large bias
for training the neural network and highly affects the training results. To avoid
any large bias like this, the randomisation is kept. This does not allow for exact
reproducibility and becomes a weakness on this part. However, to improve the va-
lidity and reliability of the approaches, the exact model architectures are discussed
in Subsection 5.1 Design and Implementation and can be found in Appendix D.
The decision to reduce exact reproducibility of the training and the controlled
experiment was done to avoid any researcher or seed bias. Nonetheless, with the
exact parameters, the results of the deep learning process should not be too dif-
ferent with different randomisation’s and initialisation, as the models have proven
to be quite stable. So very similar results will be achieved when replicating the
implementation and controlled experiment.
28
3.4 Validity
Validity and the possible initial threats in this thesis are extensively discussed
below with the construct validity, internal validity and the external validity. At
the end of the 6 controlled experiment, any other validity concerns found post-
experimentation are discussed.
29
features, in this 4.2 literature review section some primary studies within the field
are covered to identify and address the threat to internal validity. On top of that, the
proposed variant architecture is designed in an attempt to learn features better and
more specifically. It is therefore also more focused on disentanglement and inter-
pretability compared to the base architecture. Nonetheless, the explainability and
interpretability of neural networks and deep generative models, still is considered
to be a large validity threat.
30
4 Literature Review
In this section a literature review is performed in order to get an understand-
ing of the generative models and the state-of-the-art research built on top of the
base models. The literature review is partly based on the guidelines of "Procedures
on performing systematic reviews" [42], which was written for systematic reviews
within the software engineering field, but in turn was built on top of guidelines and
procedures from the medical field. This section includes: 4.1 an introduction to
the review protocol with the search procedure, inclusion and exclusion criteria, and
the data extraction. After introducing the protocol, the literature review findings are
discussed, with the 4.2 VAE findings, the 4.3 GAN findings, and the 4.3 Flow-based
models findings. Finally 4.5 presents general comparison findings.
4.1 Protocol
4.1.1 Literature review research goals/questions
The literature review is done to answer the following thesis research question:
"How does the applied generative variant and base model compare to other genera-
tive deep learning approaches in image synthesis?", where the applied methods are
the variant and base VAE. The research question can be answered by investigating
the research space for the different methodologies, in order to further compare them
with valid reasoning. This includes applications, variants, strengths, weaknesses,
and image synthesis. The aim of the search for primary studies is to answer the
following goals/questions per generative model:
Variational Autoencoders
The goal is to identify VAE variant architectures, the strengths of variant model and
the application of it. On top of the VAE variants, identify hybrid variant architec-
tures, their addition to a VAE model and the application of it. And as the VAE’s
have a learned latent space, which is also the focus of the thesis project itself, iden-
tify research on disentanglement of latent VAE space, and the optimisation on usage
of latent dimensions.
Flow-based models
As flow-based models are not at the research stage of VAE and GAN architectures,
this primary study search is more limited. The goal here is to identify flow-based
variant architectures and applications, the strengths of the models and the applica-
tion of it. And identify other applications and studies of existing GAN (variant)
architectures.
31
4.1.2 Search procedure of primary studies
Studies are collected through a search procedure in:
• Google Scholar 1
• Arxiv 2
• ACM Digital Library 3
• IEEE Xplore 4
• Proceedings of the IEEE International Conference on Computer Vision 5
• Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition 6
• The Conference and Workshop on Neural Information Processing Systems is
a machine learning and computational neuroscience conference 7
• International Journal of Computer Vision 8 journal
The search procedure is executed based on search terms with key term and
Boolean AND/OR, * required and ? optional operators. The following search term
is used:
(Applications of OR Assessment of OR A survey on)? AND (Variational
Autoencoders OR VAE OR Generative Adversarial Network OR GAN OR
Flow-based Generative Model)* AND (Image Generation OR Network OR
Variants OR Generation)?
32
4.1.4 Data Extraction
In order to accurately extract the information out of the primary studies, a
data extraction strategy is designed. The following information is extracted and
recorded:
• Number of data sets/problems for the application of the generative model and
which data sets were used for training/testing.
In order to maximise validity and transparency, the list of primary studies with data
extractions are presented in the Appendix E.
4.2.1 Variants
Variants are considered to be VAE architectures that are built on top of the base
architecture or another variant itself. These usually consist out of changes in the
base architecture or in the loss function to make the model learn in different and
specific ways. From the literature review of primary studies one can observe that
the variants were mostly focused on extending the VAE with new properties.
Variational autoencoders do not have the ability of conditional generation, thus
the synthesis of images based on conditional inputs that define and control the out-
put. To resolve this issue, Harvey et al. [43] present a conditional VAE variant that
allows for conditional synthesis. Variational autoencoder models have also proven
to struggle with higher resolution synthesis, especially compared to GAN generated
outputs, more blurriness is often found in the model generation. VQ-VAE-2 [12],
VQ-VAE [11], RVQ-VAE [44], VD-VAE [45], and VAE with self-attention and
mutual information [46] are found variants that attempt to improve the generation
quality more. Improvements for training are also proposed with the Exemplar VAE
[47], ByPE-VAE [48], eVAE [49], BooVAE [50] variants, that each propose their
own improvements for robustness, complexity and learning. Loss functions are also
33
commonly adjusted for better training in certain use cases, such as the novel fre-
quency loss introduced by Jiang et al. [51] or the β-VAE [31]. Many of the VAE
variants found in the search for primary studies mainly focus on the application of
image synthesis such as VQ-VAE-2 [12] and VD-VAE [45] architectures, with Dohi
[52] applying a VAE for image synthesis on the problem of jet simulation. Image
inpainting is also a common application for variational autoencoders like Tu and
Chen [53] applying the model for face inpainting. Some variants are also focused
on a specific problem, such as the DP2 -VAE variant by Jiang et al. [54] that tackles
the problem of privacy concerns in training data. Wei et al. [55] further present a
comprehensive and comparative evaluation of different state of the art variants of
the standard VAE architecture.
4.2.3 Disentanglement
Disentanglement is considered to be the process of disentangling and organising
of latent variables and dimensions of a VAE space. The ability of disentanglement
of latent variables provides a lot of power and control on the synthesis and sampling
of the latent space. Such control is similar to what is found in conditional generative
models, as the conditional VAE [43] previously mentioned. The primary studies
found under this section are considered to be studies investigating disentanglement,
or the control of the latent variables.
34
For disentanglement of latent variables, Burgess et al. [31] introduce β-VAE,
a VAE architecture that learns to disentangle latent variables more, compared to
different implementations. This with quite a small change in the loss function re-
sulting in a large impact on learning. One issue with the β-VAE model, is that it
trades a better KL-loss, for a worse reconstruction and synthesis. Approaches as
ControlVAE [63], GCVAE [64], PBT-VAE [65], and the approach by Ebrahimabadi
[66] all focus on improving reconstruction errors while still maintaining high dis-
entanglement qualities. Other approaches such as the Topographic VAE [67] and
Spatial-VAE [68] are also disentanglement approaches, that are more applied on
specific applications for disentanglement, such as the topographic and spatial fea-
tures in data. Xu et al. introduce Multi-VAE [69], a VAE variant that uses a multi-
clustering framework by learning disentangled representations, allowing for control
with it. Finally, Pastrana [70] presents an experimental study that investigates latent
space disentanglement with VAE models and compares different architectures.
4.3.1 Variants
Variants are considered to be GAN architectures that are built on top of the base
architecture or another GAN variant itself. The changes the variants provide are
often changes in loss functions and architecture for either or both the generator and
discriminator networks.
Such as the well-known StyleGAN architecture proposed by Karras et al. [15]
that applies an alternative generator architecture, based on style transfer. The Style-
GAN leads to better disentanglement of high level features, stochastic variation in
generation, and specific synthesis control. The StyleGAN network also has variants
of itself such as a robust Style-GAN [71], StylEx [72], ReStyle [73], and a Style-
GAN approach with an improvement focus towards disentanglement [74]. Lu et al.
[75] present a contextual generative adversarial network to learn the joint distribu-
tion of sketches and images. Radford et al. [76] present the deep convolutional gen-
erative adversarial network, DCGAN. The architecture adopts convolutional layers
in order to learn the representations well. The deep convolutional GAN learns a
hierarchy of representations from object parts to scenes, in both the networks of the
GAN. A variant of the DCGAN [77] focuses on image synthesis on multiple class
conditions. Yang et al. [78] propose a Layered Recursive GAN, LR-GAN, a variant
approach for learning scene structure and context. The approach is a generative ad-
versarial network for learning the foreground and background of images separately
and recursively, to further down the pipeline combine the layers again and generate
a natural looking image. Zhang et al. [79] propose a Self-Attention GAN, SAGAN
variant. The variant focuses on image generation of details using cues originat-
ing from all feature locations, compared to spatially local point in lower resolution
35
feature maps in traditional GANs. On top of that it provides improved training dy-
namics of the generator through spectral normalisation. Liu et al. [80] propose a
Residual Block Based GAN variant for image synthesis, a GAN architecture where
the generator and discriminator are modified and a residual block is added to the
architecture. The implemented methodology aims to achieve more stable training
and learn image features better. Huang et al. [81] propose Stacked Generative Ad-
versarial Networks, SGAN, a generative model extended on the standard GAN. The
model exists out of a top-down stack of different GANs, with the goal of learning
lower level representations based on the higher level representations. The models
are encouraged to align with each other through the discriminator and an introduced
conditional loss function. Because of the stacked GAN architecture, the generative
model is able to generate significantly better quality images and learn features better.
Durugkar et al. [82] present Generative Multi-Adversarial Networks. A variant of a
GAN that instead of implementing one generator and one discriminator, implements
a generator with multiple discriminators. Liu et al. [83] introduce BlendGAN, a
GAN variant approach to handle the issues of layer-swapping mechanisms not be-
ing able fit arbitrary styles in a single model and the necessity of style-consistent
data for all the different styles. The BlendGAN architecture approaches this by im-
plementing an encoder to extract the style representations and a controllable weight
blending module, all in a unified model that reduces the necessity for data for each
individual style. As in VAE’s, disentanglement of features also has research within
GAN’s, such as the Branched Generative Adversarial Network, BSD-GAN [84], for
Scale-Disentangled Representation Learning and Image Synthesis. Chan et al. [85]
present the Periodic Implicit Generative Adversarial Network, pi-GAN, a variant
for 3D image synthesis. The architecture involves periodic activation functions and
volumetric rendering, to learn high quality image synthesis with a consistent 3D
multi-view understanding.
A default GAN architecture does not allow for much controlled properties,
some variants tackle this problem by allowing for controlled synthesis. Such as
the ADGAN [86] for controllable person image synthesis, the conditional GAN ar-
chitecture [14] and CWGAN [87], a variant of the Wasserstein GAN [88], a variant
that on top of controlled synthesis, aims to mitigate the problem of gradient vanish-
ing.
On top of improved synthesis, variance and disentanglement of features. Many
architectures also attempt to mitigate problems appearing in the base architecture
such as the InfoMax-GAN [89] for catastrophic forgetting of the discriminator net-
work and mode collapse of the generator network. The LSGAN [90] attempts to
mitigate the problem of vanishing gradients in the the sigmoid cross entropy loss
function of the discriminator. The Lifelong GAN [91] for lifelong learning and mit-
igating catastrophic forgetting in the network. Dong et al. [92] present an approach
to handle key challenges with state of the art generative models, caused by arbitrary
person pose manipulation. To take on all the current challenges, the authors present
a Soft-Gated Warping Generative Adversarial Network, Warping GAN. The variant
model consist of two stages: the target pose segmentation map synthesis, and the
modified GAN approach with a soft-gated warping block for feature-level mapping.
Variants also attempt to tune the network to occurring issues with the train-
ing data itself, such as the F2GAN [93] that attempts to optimise high quality di-
verse images on classes with limited data or the TrGAN, a variant that attempts
to solve the issue of conditional GANs requiring a lot of labelled data. Choi et al.
36
[94] present StarGAN, a novel and scalable GAN framework for multi-domain high
quality image-to-image translation. The unified architecture of StarGAN provides
the opportunity for simultaneous training on multiple data sets.
An application that quite some variants also aim to approach is the up-scaling
and super-resolution synthesis of images. Some of these variants are the SR-GAN
[95] and the WDSRGAN [96]. Another approach is image improvement through
impact inpainting, Wang et al. [97] introduce the Discriminative Region Proposal
Adversarial Networks, DRPAN model, a GAN variant that works straightforward
by finding the worst generated/fake region in an image using DRPnet, and imple-
menting learned image inpainting on that region for a more realistic generated part.
4.3.3 Applications
Under applications, other primary studies that are neither variants of the GAN
architecture or a hybrid model, but rather applications of existing GANs on a certain
problem or data set, are described. This also includes analyses and overviews on
GANs.
When investigating the research space of deep generative models, two fields
of applications appear quite a lot. First of all, the medical field, where generative
models, specifically here, GAN’s, can be applied to generate more rare or harder
to collect data, obtain higher resolutions, or find features that a human would not
detect. An example of this is the application of generative adversarial networks on
image synthesis for Magnetic Resonance (MR) images [102], where the synthesis
even passed a visual Turing test. Another common field is the one of anomaly de-
tection, with the SDGAN [103], a GAN variant for surface defect image generation,
or the application of a Cycle-GAN [16] variant for abnormal-to-normal generation
[104]. A common application for reporting quantitative results, especially with
GAN’s, is face synthesis. An example of this is the study on DCGAN’s [76] ap-
plied for face image synthesis with added and controllable attributes [105]. Zeno et
37
al. [106] implement a comparative analysis of a UNIT GAN, a hybrid model based
on a GAN combined with a VAE, and a StarGAN architecture. The authors apply
these architectures to the application of photo-realistic new face image synthesis,
with preserving identity in the generation. There are many other applications as
well, some examples of these are the following. Minaee and Abdolrashidi [107]
apply generative adversarial networks for the image synthesis of high-quality fin-
gerprints. Many existing models and approaches are not powerful enough to learn
such complex representations as fingerprints, the GAN approach in this research
has proven to work well for such complex tasks as synthesis of fingerprints. And
Mustikovela et al. [108] present SSOD, an end-to-end analysis-by-synthesis frame-
work using controllable GANs, for the task of self-supervised object detection. The
authors apply their framework on the application of car detection. Zhao et al. [109]
apply image augmentations to vanilla GAN architectures, to investigate and study
the impact on results and training for image synthesis. The application of it is tested
in a variety of settings. On top of that, Liu et al. [110] present an extensive overview
of adversarial generative models for image and video synthesis, with related works,
regularisation’s/stabilisation’s, improvements and limitations. Shamsolmoali et al.
[111] provide a comprehensive survey and case studies on adversarial networks for
image synthesis. The authors summarise the methodologies and applications, with
extensive details on architectures, loss functions, metrics, and training. Tahmid et
al. [112] present a comparative analysis of different GAN variants and quality im-
age synthesis assessment.
38
et al. [116] present Attentive Contractive Flow, a model and plug-and-play flow
for localised self-attention. The self-attention contractive flow is implemented for
extending expressiveness while maintaining invertibility of the flows. The authors
demonstrate representation power, faster training convergence and more realistic
synthesis by using a generative model with attentive contractive flows. Another ap-
proach to improving flows expressiveness is the mAR [117] or Multi-Scale Autore-
gressive Priors approach and introducing channel-wise dependencies in the latent
space. Using these mAR flow layers, the model results in better density estimation
and improved image generation losses. To mitigate the three limits of uniform noise
for dequantisation, inexpressive affine flows, and purely convolutional conditioning
in coupling layers in prior work, Ho et al.[118] introduce Flow++, a variant flow-
based generative model. The architecture achieves new state-of-the-art results due
to the architectural changes, compared to the prior work that contained these lim-
its. Chen et al. [119] propose GSMFlow, an flow-based architecture for tackling
generation shifts. Generation shifts means the model synthesis shifts away from
the real distributions of unseen data. The researchers find and attempt to mitigate
three problems possibly causing generation shifts: semantic inconsistency, variance
collapse, and structure disorder. The flow model achieves state-of-the-art results on
the tested data sets on image synthesis.
4.4.2 Applications
Hajij et al. [120] implement the RealNVP [17] flow-based architecture on the
application of medical image synthesis and show the effectiveness of the method on
the data. Much research is focused on the quality of models based on their gener-
ation performance, but Pope et al. [121] study the adversarial robustness of flow-
based models, with more specifically the GLOW [18] and RealNVP [17] models.
The authors prove both the models are extremely sensitive to adversarial attacks,
and a significant improvement in robustness with a more hybrid training proce-
dure. Pires and Figueiredo [122] present a way of introducing discrete structure
into the flow-based model framework, by applying a flow-based model combined
with a mixture model. The authors demonstrate its application in density estimation,
clustering, and semi-supervised learning. Tailanian et al. [123] apply a U-shaped
flow-based model on the application of anomaly detection, in combination with a
multi-scale image transformer network. The application of the presented framework
on anomaly detection returns state-of-the-art results compared to all other models
on the commonly used anomaly detection data set. Dong et al. [124] present the
application of flow-based models on image super-resolution enhancement, of medi-
cal MRSI data. The flow-based enhancer shows flow-based models can be used for
this application well, as the authors report better resolution enhancement compared
to generative adversarial networks and base flow-based models. ClothFlow [125]
and FVTN [126] are both flow-based model frameworks, that are applied on the
problem of clothed person image synthesis for virtual try-on and pose-guided im-
age generation. Both flow-based approaches returned state-of-the-art results on the
synthesis, with especially, FVTN, that came after ClothFlow achieved high-quality
try-on synthesis.
39
4.5 Comparison Findings
After covering the findings on the three deep generative models separately, a
comparison can be made based on the qualitative observations on the primary stud-
ies.
Research wise, the GAN space seems to be the largest and most adapted for im-
age synthesis, followed by the VAE and then the Flow-based models. Especially the
last couple of years, all the three generative approaches have been very large within
research with many variants and adaptations to improve the existing state-of-the-art
architectures. Overall the GAN and Flow-based architectures with the many vari-
ants seem to be able to generate more higher quality and resolution outputs, while
the VAE ends up in more blurry reconstructions. VAE variants attempt to solve the
blurriness, but so far, it does not outperform the other models. The GAN however
does end up in more noisy outputs, while the VAE and Flow-based models end up in
more cleaner and smoother images. The VAE also offers a very accessible learned
latent space that can be used for sampling other images with different combinations
of features. This latent space furthermore allows for research and variants focus-
ing on disentanglement and latent variable control. The different models optimise
in different ways, where VAE learns a lower bound of the maximum likelihood,
while the Flow-based models learn the exact maximum likelihood, therefore in the-
ory being able to learn significantly more exact than the VAE. And then the GAN
not optimising for the maximum likelihood at all, but instead optimising for the
minimax-loss.
All the architectures do seem to have some problems during training, that is the
GAN suffering from mode collapse, vanishing gradients, catastrophic forgetting and
in-stable training. The VAE suffering from posterior collapse, disentanglement and
over-regularisation leading to blurry reconstructions. And the Flow-based model
suffering from mode collapse, and generation shifts. All three deep generative mod-
els do not have that capability of conditioned synthesis with the base architecture,
in all three, variants as the conditional VAE [43], cGAN [14], and the CAGlow
[114] architectures provide a conditional solution. Lifelong learning is also a prob-
lem within image synthesis which certain deep generative variants attempt to tackle
with adjusted architectures.
In theory, the flow-based models can due to their nature, outperform the other
models as they are able to explicitly learn the probability density function of real
data, something that VAE and GAN do not learn. However in practice, the flow-
based models are significantly more heavier to train compared to the VAE and GAN.
Per epoch, the VAE seems to be lighter to train, as the GAN trains the discriminator
and generator adversarially. But the performances highly depend on the imple-
mented architectures and the complexity of the network.
As the individual deep generative models suffer from different problems in their
architectures, hybrid variants are proposed, these combined deep generative models
attempt to solve weaknesses in the individual architecture, such as the VAE-Info-
cGAN [61]. These hybrid models seem to be able to achieve promising results and
proves that it might be possible to solve general problems by combining models.
Many details of this comparison have already been covered in the individual
model sections. It is important to mention that this comparison is based on a sample
of the research out there, thus it is not complete in any way. It does highlight some
of the larger outlines in the differences and models. This comparison was again
based on qualitative directions, no quantitative comparisons are done here.
40
5 Design and Implementation
In this section the implementation for the thesis project is described. This sec-
tion includes: 5.1 introduces the experimentation pipeline. The pipeline exists out
of the 5.2 VAE architectures, the 5.3 triplet sampling, and the 5.4 recognition model.
5.5 describes the data collection processes, and 5.6 the technical specifications.
5.1 Pipeline
The pipeline in this implementation is a sequence of three different stages and
components, as visualised in Figure 5.13. There is an ordered sequence to the steps,
due to the dependencies of the individual components. The first stage is the VAE
model, this stage includes the training and experimenting with both the vanilla VAE
architecture and the proposed variant VAE architecture. From this stage, the model
is further used for sampling new images in the triplet sampling stage, where a pro-
posed algorithm uses the learned distributions to sample triplets based on anchor
images. This triplet data is then further used in the recognition model stage, where
triplet samples are used to train a log end image recognition model.
Figure 5.13: Pipeline of the implementation, with the different major steps.
On top of the three major components, there are two data collection components
for both the deep learning models. For the VAE model stage, real log end images
are collected for training, with different data set sizes used. For the recognition
model real triplets are collected, that are further used for training a baseline model
and testing the recognition model on sampled triplets. The exact order of stages was
also followed in the development of this implementation. In the following sections,
the stages are described more in depth.
41
of convolutional layers, in order to learn the image features. The D number of la-
tent dimensions was one of the dependent variables and was adjusted based on the
performances of the models and scaling of data set.
Encoder
The encoder of the vanilla VAE architecture exists out of 5 convolutional layers, a
flattening layer, 3 dense layers, and a custom sampling layer. The 2D convolutional
layers ("Conv2D" in TensorFlow) have filters of 32, 64, 128, 256, and 512, respec-
tively. All Conv2D layers use a kernel of 3x3, a stride of 2x2, a "same" padding,
and a ReLU activation function. Using this sequence of Conv2D layers, the first and
second dimension are scaled down exponentially, originally the height and width di-
mensions of the image, while the third dimension, originally the RGB channels, is
scaled up exponentially. The flattening layer is then used to flatten the output of
the last convolutional layer. The flattened output further goes through a Dense layer
of the D latent dimensions units size, without a defined activation function. This
output is then used to train both the mean and log variance layers, which are us-
ing the same parameters as the first Dense layer. Afterwards both the mean and log
variance outputs are used in the custom sampling layer and reparameterisation trick.
The architecture and sequence of layers in the encoder are described in Table 5.2.1
with the example on a vanilla VAE with 1024 latent dimensions.
Decoder
The decoder of the vanilla VAE architecture exists out of a dense layer, reshape
layer and 5 convolutional transpose layers. The Dense layer has the same amount of
units as the flattening layer in the encoder architecture and uses the ReLU activation
function. The output from the Dense layer is then reshaped using a Reshape layer,
so it can be used for the 2D convolution transpose "Conv2DTranspose" layers. The
5 transpose layers have filters of 512, 256, 128, 64, and 3 respectively. All of the
Conv2DTranspose layers use a kernel of 3x3, a strides of 2x2, a "same" padding,
and a ReLU activation function. Besides the last Conv2DTranspose layer using a
1x1 stride. The output of the decoder is again an image of the same input size
with the 3 RGB channels. The architecture and sequence of layers in the decoder
42
are described in Table 5.2.1 with the example on a vanilla VAE with 1024 latent
dimensions.
Sampling
As described above in the encoder, the network learns both the mean and log vari-
ance in the dense layers of D latent dimension units. The distribution type for the
VAE, as most commonly used, is the Gaussian distribution. In order to sample from
this, a custom sampling function is used. This is like the VAE reparameterisation
trick, as described in Section 2.6. The inputs of the sampling function are both the
mean and log variance layers of D latent dimensions size, these represent the learned
distributions. In the algorithm, a new parameter epsilon is computed based on the
means, using a random Gaussian distribution. Further the standard deviation can be
easily computed based on the log variances. Therefore based on the mean, standard
deviation and epsilon, a point in the latent space is sampled. This algorithm allows
for backpropagation, while still being able to learn the distributions.
Connected Architecture
The final step of this model is to just connect the encoder and decoder. The losses
can be computed based on both the outputs of the encoder and the decoder.
43
Algorithm 2 Connected VAE
1: function VAE(data, latentdims)
2: ▷ Input the data and n latent dimensions through the encoder
3: ▷ This will return the latent mean, log variance, and points
4: zmean, zlogvar, z ← encoder(data, latentdims)
5: ▷ Reconstruct the latent point through the decoder
6: reconstruction ← decoder(z)
7: return reconstruction
8: end function
learning of the mean and log variance, in the encoder network. As described above,
the vanilla VAE learns all the D dimensions means and log variances in one layer
each, with the previous layer being a dense layer connecting to these two layers. The
variant approach is constructed in a way that the network learns every dimension
parameters separately. For each dimension, a small sub-network of layers being
added before the layers are learned. This is again visualised in Figure 5.14.
In the implementation within this thesis project, a sub-network of two sub-layers
is implemented for each latent dimension. These layers are both densely connected
layers with 8 and 4 units respectively, using a ReLU activation function. After the
two sub-layers, the mean layer and the log variance layer, each containing 1 unit,
are added. This is repeated for every latent dimension and all these layer parame-
ters are constantly trained. The aim of this approach is to improve the learning of
the features, resulting in more creative sampling and better reconstructions. One
can understand an improvement in creativity as an improvement in the learning of
features, therefore resulting in more different features upon sampling. Figure 5.15
visualises the implemented sub-network, with the two dense layers and the latent
dimension parameter layers, for one single latent dimension. Besides the changes
regarding latent dimensions in the encoder, the variant architecture is exactly the
same as the vanilla architecture implemented and described above.
Figure 5.14: Variant VAE Architecture with D sub-networks each learning the mean
and log variance dimension parameters.
44
Figure 5.15: Visualisation of one sub-network for one latent dimension. This sub-
network is repeated and trained for every latent dimension. Here the sub-network
contains two sub-layers before the mean and log variance layers are trained.
And the KL loss regularisation term (Equation 19), where zLogVar and zMean
are the learned sets of latent parameters learned by the network.
n d
1 XX
KL = [−0.5 ∗ (1 + zLogV ar − zM ean2 − ezLogV ar )] (19)
n i=1 j=1
45
5.2.4 Network sizes
The network sizes are listed in Table 5.4. The networks largely scale up in
network size and parameters based on the defined D latent dimensions. The variant
networks are also larger in trainable parameters compared to the vanilla architecture,
with the difference growing larger with the latent dimensions.
• Data: The data used for obtaining the VAE distributions. For this, the training
data of the VAE can be used.
• Anchors: A set of anchor images of which for every image, an anchor can
be encoded, and a positive and negative can be sampled. The test data of the
VAE can be used.
• VAE: The trained variational autoencoder model, that exists out of the en-
coder and decoder networks.
• Latent dimensions: The size of the latent space used for the trained VAE
model.
• Radius for positive: The distance in log variance from the encoded anchor
that the positive is sampled from. Parameter has to be lower than the radius
for negative.
• Radius for negative: The distance in log variance from the encoded anchor
that the negative is sampled from. Parameter has to be higher than the radius
for positive.
• Probability of sampling towards the mean for the positive sample: This is the
probability of sampling towards the mean, for the positive encoding. Param-
eter is a value between 0 and 1, inclusive of both 0 and 1. Parameter is used
46
for tuning the amount of outliers and noise in the images. The higher the pa-
rameter, the less noise can be found in the image as it samples more towards
the mean of the latent distribution.
• Probability of sampling towards the mean for the negative sample: This is the
same parameter as above, but for the negative sample. A lower value than the
probability for a positive sample is preferred, as the negative is preferred to
have more different outlier features and noise.
With the parameters given to the function, the algorithm first computes the dis-
tributions from the VAE on the data. The latent distributions can be given as a
parameter as well, as they are known after training the VAE model itself. In case
this is not saved, it can be computed again. After the distributions are computed,
the algorithm encodes the original anchor image into the encoded anchor, which
can then further be used for sampling the positive and negative sample.
For the sampling of the positive and negative, it is computed per individual latent
dimension. One can only sample in two directions in a single dimension, towards or
away from the mean of the distribution. As doing this completely random returns
more noisy results, a probability of sampling towards the mean has been added,
allowing for more good features and less outliers. The probability of sampling in
the direction of the mean can be different for the positive and the negative, where
a negative is allowed to be more noisy than the positive. First the direction is de-
cided through a uniform random variable, then the decided log variance distance
from the point is computed. Doing this for every latent dimension, a new encoded
sample is constructed based on the encoded anchor sample. This process is done
for both the positive and negative, with their respective log variance distances and
probability towards mean parameters. The algorithm is visualised in Figure 5.16
with the different parameters and encoding points. After the encoded triplet has
been constructed, the encoded anchor, sampled positive and sampled negative, can
be decoded with the VAE decoder.
47
Algorithm 3 Triplet Sampling
1: function T RIPLET S AMPLING(data, anchors, vae, latentDim, radiusP os,
radiusN eg, posSampleT owardsM eanP rob, negSampleT owardsM eanP rob)
2: ▷ Input the data and n latent dimensions through the encoder to obtain the
distributions
3: zM ean, zLogV ar, z ← vae.encoder(data, latentDims)
4: ▷ Loop over the given anchor images
5: for anchor in anchors do
6: ▷ Input the anchor through the encoder to obtain the encoded point
7: imgZM ean, imgZLogV ar, imgZ = vae.encoder.predict(anchor)
8: posSample, negSample = zeros((latentDim)), zeros((latentDim))
9: ▷ Loop over the latent dimensions
10: for dim in latentDim do
11: globalM ean = zM ean[dim]
12: mean, logV ar = imgZM ean[dim], imgZLogV ar[dim]
13: ▷ Sampling the point based on the probability of sampling towards
or away from the mean. Both for the positive and negative sample
14: if random(0, 1) <= posSampleT owardsM eanP rob then
15: if mean > globalM ean then
16: posSample[dim] = mean − (radiusP os ∗ abs(logV ar))
17: else
18: posSample[dim] = mean + (radiusP os ∗ abs(logV ar))
19: end if
20: else
21: if mean > globalM ean then
22: posSample[dim] = mean + (radiusP os ∗ abs(logV ar))
23: else
24: posSample[dim] = mean − (radiusP os ∗ abs(logV ar))
25: end if
26: end if
27: if random(0, 1) <= negSampleT owardsM eanP rob then
28: if mean > globalM ean then
29: negSample[dim] = mean − (radiusN eg ∗ abs(logV ar))
30: else
31: negSample[dim] = mean + (radiusN eg ∗ abs(logV ar))
32: end if
33: else
34: if mean > globalM ean then
35: negSample[dim] = mean + (radiusN eg ∗ abs(logV ar))
36: else
37: negSample[dim] = mean − (radiusN eg ∗ abs(logV ar))
38: end if
39: end if
40: end for
41: ▷ Decode the encoded points to obtain the triplet
42: triplet = vae.decoder.predict([imgZM ean, posSample, negSample])
43: return triplet
44: end for
45: end function
48
The proposed algorithm does have both deterministic and stochastic variables,
but with high stochastic probability values, the algorithm will generate very similar
logs upon each generation with the same anchor images. As a future approach, it
would be good for the triplet generation to introduce a random parameter for the
distance too, to add the capability of generating different triplets with the same
anchor. For now, it is advised to use different anchors when using high probability
of sampling towards the mean.
5.4.1 Architecture
The architecture used for the recognition model is based on the EfficientNetB0
pretrained neural network. Using this, the weights are not randomly initialised and
the recognition network trained on log triplets does not have to fully learn from
scratch, as the EfficientNetB0 neural network has been pretrained on the large Im-
ageNet data set. The EfficientNet architecture [127] was originally proposed for
the reason that it is easily applicable for different convolutional classification prob-
lems and networks. The EfficientNetB0 network is combined with a MaxPooling,
Dense, and Lambda output layer to form the full recognition model. The model can
be considered rather small compared to other convolutional classifiers. There are
also way heavier EfficientNet or other pretrained networks available 9 , but the goal
of this project was not optimising this architecture, instead test the application of
sampled triplets on a recognition model. The weights of the EfficientNet architec-
ture are also not set to be trainable. Table 5.4.1 visualises a simplified version of the
architecture.
49
Layer (Type) Output Shape Parameters
Input Layer (256, 256, 3) 0
EfficientNetB0 (8, 8, 1280) 4,049,571
MaxPooling2D Layer (1280) 0
Dense Layer (1024) 1,311,744
Output Lambda Layer (1024) 0
Total parameters: 5,361,315
Table 5.5: Layer architecture and specifics for the recognition network.
these semi-hard negative encodings are within the margin from the radius for pos-
itive, therefore further away from the anchor than the positive, but still producing
positive losses. Using the semi-hard triplet loss, the network will only update the
parameters based on the negatives that are semi-hard, not on easy or hard negatives.
The algorithm therefore only selects the semi-hard negatives. The reason for this
is that focusing the training on these forces the network to separate the encoding
space more and learn features better.
Accuracy
The accuracy represents the accuracy to which the recognition model can correctly
identify logs between anchors and positives. The accuracy is not used as the cost
function, but is however used to measure the quality of the recognition model, as
the accuracy of identifying the same log ends at different stages is vital.
5.4.3 Training
During the training of the recognition model, augmentations are used to improve
the learning and robustness of the model. These augmentations include mirroring,
affine transformations, perspective transformations and slight colour filters. These
augmentations on the images also prevent the model from overfitting heavily, es-
pecially since small data sets are used. On every epoch, the images are augmented
randomly and with different randomised augmentations parameters. The network
also uses an Adam optimiser with an initial learning rate of 1e-04.
50
pipeline, types of trees can be found in different conditions, shapes, and other visual
features.
• Memory: 80GB
• VCPU: 8
• VGPU: 1
51
6 Controlled Experiment
In this section the controlled experiment is described. This section includes: 6.1
where the different VAE model training’s and results are described. 6.2 presents the
triplet sampling results and selected hyperparameters. 6.3 presents the recognition
model training and results, from both real data and VAE sampled data. 6.4 addresses
additional aspects of experiment validity, post-experimentation.
Final losses
Based on the results reported in Table 6.6, it can be concluded that the variant VAE
with 512D achieves the best training losses, with the lowest reconstruction and KL
loss, on the 1K data set. In validation losses, report in Table 6.7, the vanilla VAE
achieves the lowest loss due to a higher reconstruction error on the variant VAE. The
variant VAE however still achieves a lower KL loss on validation. The difference
between the training and validation losses is quite large here, appearing due to the
high variation in tree log ends with the combination of using a very small data set.
This causes the training to generalise less over the validation and test sets, therefore
returning higher reconstruction errors.
52
Losses during training
The reconstruction training losses, as visualised on Figure 6.17, all follow the same
trend over the epochs during training, where both the 512D models are very close to
each other and keep on slowly dropping. The validation losses for the 512D models
follow the exact same trend, with the vanilla VAE staying under the variant VAE in
loss. The validation loss for the 256D vanilla VAE drops significantly slower and
does not reach a plateau yet within the 500 epochs. This is also why the variant
VAE was never trained on 256D, and 512 was used immediately.
Figure 6.17: Reconstruction Loss during the training of the models on the 1K data
set.
The KL training loss for the 256D vanilla VAE drops a lot in the beginning of
training, as visualised in Figure 6.18, but keeps on increasing over time. While both
the 512D models stay quite stagnant and decrease very slowly after a large drop the
first 100 epochs. The validation losses for the vanilla VAE models follow a very
similar trend and reach the same plateau. The variant model does too, but at a lower
level and therefore a better KL loss.
53
Figure 6.18: KL Loss during the training of the models on the 1K data set.
54
Figure 6.20: 256D Vanilla VAE reconstructions on random validation samples.
55
Figure 6.21: 512D Vanilla VAE reconstructions on random training samples.
56
Figure 6.24: 512D Variant VAE reconstructions on random validation samples.
Final losses
After 100 epochs on both models, the variant VAE achieves lower losses on all
training losses compared to the vanilla VAE. The KL loss is also lower on valida-
tion data with the variant model, where the reconstruction and total ELBO loss are
higher compared to what the vanilla VAE achieves. The KL loss of the variant is
also noticeably under the KL loss of the vanilla model.
57
10K images for validation and testing each. The vanilla VAE model was trained in
two sessions of 200 epochs each, 400 epochs in total. The variant VAE model on
the other hand was trained in four sessions of 100 epochs each, also 400 epochs in
total. The reason for this is that when training on data this large, the virtual ma-
chines need to train for a large amount of time. Where the variant VAE needs about
2-3x more time to train, due to all the sub-networks and the current implementation.
Therefore training in sessions was necessary for mitigating memory issues with the
available computing. This does however slightly affect the training, as restarting a
session initialises the batches differently, causing some drops in losses with the new
batches. The vanilla VAE was trained before the variant VAE, and therefore it was
not initially planned to do the training in 4 sessions. Although this slightly biases
the results, the impact of this in the end is minimal as the data set is very large and
both were trained for 400 epochs in total.
Final losses
On the large data sets the model still achieved similar results for the KL losses as
with the other data sets, where the variant achieves a lower KL divergence. But
the reconstruction in training and validation are both lower with the vanilla model,
something not appearing in training on smaller data sets. For this reason, the total
ELBO loss is also higher for the variant VAE.
58
if the models reach a plateau and the losses would not be affected much with more
training.
Figure 6.25: Reconstruction Loss during the training of the models on the 100K
data set.
Figure 6.26: KL Loss during the training of the models on the 100K data set.
59
despite most features being present, a lot of detail is lost on the log rings. Especially
in validation reconstructions where the blurriness is heavier.
60
Figure 6.29: 1024D Variant VAE reconstructions on random training samples.
The vanilla VAE model shows a significantly faster prediction time, of about
28.71x times faster than the variant VAE. Although the variant VAE has a few mil-
lion trainable parameters more, its architecture has many more layers. The current
61
implementation done in this thesis project could be more optimised and parallelised.
These reasons lead to way slower prediction times for the variant. Although the pre-
diction time per image is still rather low for the variant VAE, it might be too slow
to deploy this architecture in production, as of now. Where the vanilla architecture
could be more interesting to have deployed.
62
Figure 6.31: An example of a triplet with the chosen parameters.
On Figure 6.32, the chosen radii are used for sampling, but the probability of
sampling towards the mean is just purely random, so with a 0.5 probability in the
direction of the mean. This visually shows that these parameters highly impact the
sampling and make a significant contribution to the proposed sampling algorithm
itself.
No metrics were used for tuning the parameters of the sampling algorithm.
However, in future work the AI models that use the sampled images can be used
for optimising the parameters. Where networks can be learned on different param-
eter sets, and the optimal parameter set can be found. More generated triplets can
be found in Appendix C.
63
and real images is definitely present with image quality. The same triplet margin
was used for all models with a value of 0.6. All these training’s allowed for com-
parison and evaluation on if such models are in any way applicable to real data, and
can therefore be used as data extension/augmentation/replacement.
The baseline resulted in quite a good accuracy for being trained on only 1470
triplets, especially its validation and testing accuracy demonstrate that it can be used
for unseen log ends as well and is robust to a certain degree. The 1470 sampled data
set training’s resulted in average training accuracies, proving there is some learning
on recognition. It however seems to perform poorly on validation accuracy making
it less applicable for unseen data, with further a very low accuracy on testing with
real data. The same results are found with the 10K sampled images data sets. There
seems to not be any improvement in learning with more data and the models trained
on generated data seem to not be applicable on real testing data.
Table 6.13: Models Trained on 1470 Real data set - Accuracy Losses
Table 6.14: Models Trained on 1470 Sampled data set - Accuracy Losses
Table 6.15: Models Trained on 10k Sampled data set - Accuracy Losses
64
Figure 6.33: Image reconstructions on random samples from validation data set.
Figure 6.34: Image reconstructions on random samples from validation data set.
65
Figure 6.35: Image reconstructions on random samples from validation data set.
66
Another validity concerns appears with the hyperparameter optimisation of the
triplet algorithm parameters. The hyperparameters were chosen in a subject man-
ner, instead of any quantitative basis. It would require a lot of time to optimise
the parameters based on the recognition model results as well, therefore such an
approach was not feasible within the thesis project. Due to the subjective manner
in which the parameters were decided, a slight research bias could appear therefore
slightly affecting the results.
67
7 Discussions
In this section the research questions are evaluated and discussed, based on the
results achieved in this thesis project in the literature review and controlled experi-
ments.
RQ1: What generative variant can be constructed with the goal of learning
individual data distributions better than the standard approach?
In this thesis, we proposed a VAE variant approach, as visible on Figure 7.36, that
adapts the base architecture with sub-networks for each latent dimension in the la-
tent space. The goal of this variant was to learn the latent space more, learn better
features and with this generate more feature-rich images. Qualitatively the dif-
ferences between the vanilla VAE architecture and the proposed variant are hard
to observe. An immediate impact on the generation of images itself is not really
visible. On quantitative basis, the variant however achieved lower KL-Divergence
losses on all variant models, on both training and validation. This while having
little impact on the reconstruction qualities. We can conclude that the constructed
variant achieved slightly better learning of individual data distributions compared
to the standard VAE approach on this specific problem and data set. Due to the
constructed variant being tested on only this problem and data, the results are not
sufficient to be generalisable over other problems. The resource requirements with
the current technical implementation are very high and about double of the standard
VAE approach, the prediction time is also significantly larger. A trade-off and anal-
ysis on if this is worth it should be done for further use of the constructed variant
approach, especially in environments where time is highly constricted.
RQ2: How does the applied generative variant and base model compare to
other generative deep learning approaches in image synthesis?
A literature review was done for findings on related work with the different deep
generative models. During this process many strengths and weaknesses in the dif-
ferent models were identified, which variants and hybrid models attempted to ap-
proach. This search for related work allowed for detailed analysis of a sample of
the research out there, and with this allowed the opportunity for a comparison be-
tween the deep generative models. The comparison findings are covered in Section
4.5 (Literature Review). In the literature review itself, the base model and variant
68
were taken as one architecture under the search for VAE, as at the point of research,
the results of the models were not found. Eventually the weaknesses found in the
base VAE model are also reflected in the generative variant and application within
this thesis project, such as the blurriness and high reconstruction losses with the
complex data. This opens up the opportunity for applying similar solutions as other
variant and hybrid models out there, to in the future mitigate and solve the occurring
problems regarding reconstruction qualities appearing now.
RQ3: How can the proposed variant and base model contribute for triplet gen-
eration, for improving a recognition model on artificial data?
The variant and base model learned latent spaces were used for triplet generation,
where we proposed an algorithm to sample triplet images from the latent space.
This was first done by obtaining an anchor, getting its encoding, and from the en-
coded point in space the positive and negative points in space were sampled. This
was done based on a log variance distance from the anchor and a probability of
sampling towards the mean of each latent distribution. In this way both the trained
variant and base model were able to contribute for triplet generation, with in the
end quite feature-rich tree log-ends. The obtained sampled triplets were then used
for training a recognition model. Due to the high reconstruction losses and the ap-
pearing blurriness in the image synthesis, the recognition model trained on sampled
triplets could not contribute for proper evaluation on real images, as the differences
were too large between the real and generated data. Therefore the generated data,
at this point in the research, is not capable of improving another model and con-
sequently unable to mitigate costs such as data collection. We reason that this is
because of the quality differences.
69
8 Conclusions and Future Work
In this section the final conclusions of the research and work are presented, fol-
lowed by future work suggestions.
Conclusions
The motivations for this thesis project were to construct a full deep learning pipeline,
apply a deep generative model, propose a variant generative approach and to intro-
duce a sampling algorithm. All of this in the hopes of improving a recognition
model on artificial data and evaluating the usability and capabilities on an industrial
forestry application. In the end, the construction of these individual components
was successful. However, the ideal situation where a VAE model can generate large
amounts of high quality tree log end images and consequently improve a recogni-
tion model with artificial data was not reached. Nonetheless, there are still many
interesting results to be found, opening many opportunities for future work.
The VAE image generation did prove quite good feature learning, where the pro-
posed variant VAE in this application returned better losses and feature extraction
through an adapted architecture using sub-networks for each latent space. Although
the variant requires more resources, longer training and more total parameters, this
trade-off can definitely be worth it to achieve better feature learning. The recon-
struction and generation quality itself did not reach the ideal point, resulting in a
large amount of blurriness appearing in the final decoded images. We reason that
this is most likely due to the implemented architecture, meaning that the used data
set requires more complexity in the architecture, especially in the decoder, in order
to have better decoded reconstructions from the encoded representations. We do
also recognise that tree log ends are not the easiest of image data to learn, due to
a very large variance in the data and logs often containing very different features.
Therefore, more data or the option of image augmentations might also be required
for better learning. The appearing blurriness is a commonly appearing problem with
VAE models, which was also noticeable here.
The presented triplet sampling algorithm was not able to evaluated properly
on a quantitative basis due to the problems occurring in the VAE reconstructions
and therefore also propagating through to the recognition model. The algorithm
nonetheless did prove to be able to generate many triplets that were sampled in a
creative way and showed very different features based on the sampling parameters.
The recognition models trained well on the sampled triplets themselves, and
returned quite good losses for a first iteration in this research. Unfortunately, the
recognition models trained on VAE sampled triplets proved to be not applicable to
real triplet images as testing data, where the recognition model achieved very low
accuracies on recognition when trained on generated data.
This thesis work contributes to the field of generative models and the forestry
industry. Not only by proving that the variant VAE network, triplet sampling algo-
rithm and individual recognition models still performed quite well, but also by prov-
ing that the current state of this research is lacking the generation complexity and
quality in order to use the synthesis for improving recognition models. Future work
might be able to solve this problem. On top of that, this thesis presents a literature
review on the different deep generative models and the differences in achievements
between them. Although the primary studies found in this literature search are only
a sample of all the research out there, it still contributes through the exploration of
70
findings in the sample of studies, as well as the weaknesses, strengths, variants and
applications found and identified in research of deep generative models.
Future Work
There are several directions for future work based on the results of this thesis
project. The first step for future work on the VAE that we suggest is a signif-
icantly more complex decoder network and more experimentation with it, while
also further investigating variants for improved quality. The image synthesis re-
sults ended up quite blurry, while the features were still learned quite well and were
mostly reconstructed. A heavier architecture than used in this project could improve
the reconstruction and decrease losses. On top of that, the encoder weights could
be initialised to the already trained weights from the trained models on large data.
A second suggestion for future work is the experimentation of different losses, a
different loss might lead to improved reconstructions as well. The variant architec-
ture implementation in this thesis was not the most optimised and built on quite a
high level using the TensorFlow and Keras libraries, a lower level implementation
might significantly increase performance and utilise resources better. Additionally,
it would be interesting to experiment the variant on more data sets to test the gener-
alisability and performances on other problems. A different approach to the log end
image generation could also be an interesting direction with something that works
better for a VAE, such as a combination with a GAN for optimised synthesis.
For the triplet sampling approach, it would be interesting to add more stochastic
variables to the algorithm, in order to sample more different triplets, allowing to use
anchors several times and still producing very different triplets containing creative
feature generation. It would also be very interesting to analyse the latent space
more, with the help of dimensionality reduction techniques, such as t-SNE, in order
to generate a more structured mapping of the latent space and identify clusters. In
this way, the sampling algorithm can also be more optimised towards the latent
space. For example, by finding interesting points in the latent space and sampling
triplets based on linear interpolations.
At last, it could be interesting to experiment more with the recognition model
stage as well. Suggested options here are experimenting with different pretrained
models and hyperparameters such as the triplet margin. This was out of the scope of
this thesis due to time restrictions, but would still be very interesting to experiment
with in advance of investing in VAE improvement approaches. On top of that, it
could be very interesting to sample new triplets during training that are considered
semi-hard triplets, as only these are used with the semi-hard triplet loss. This has the
potential of learning faster and better, as more triplets would be used for training.
This thesis project has many directions for future work. The above are not the
only ones, but at least the ones we think to be interesting and suggest to follow up
on the research done in this thesis.
71
References
[1] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A
survey,” ACM Comput. Surv., vol. 41, no. 3, jul 2009. [Online]. Available:
https://doi.org/10.1145/1541880.1541882
72
S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc.,
2017. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/
2017/file/7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf
[16] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image trans-
lation using cycle-consistent adversarial networks,” in 2017 IEEE Interna-
tional Conference on Computer Vision (ICCV), 2017, pp. 2242–2251.
73
[21] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
in 3rd International Conference on Learning Representations, ICLR 2015,
San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[Online]. Available: http://arxiv.org/abs/1412.6980
[23] Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,” ATT
Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, vol. 2, 2010.
[26] K. Zeng, J. Yu, R. Wang, C. Li, and D. Tao, “Coupled deep autoencoder for
single image super-resolution,” IEEE Transactions on Cybernetics, vol. 47,
no. 1, pp. 27–37, 2017.
[27] J. Chow, Z. Su, J. Wu, P. Tan, X. Mao, and Y. Wang, “Anomaly detection of
defects on concrete structures with the convolutional autoencoder,” Advanced
Engineering Informatics, vol. 45, p. 101105, 2020. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S1474034620300744
74
[33] E. Tabak and E. Vanden-Eijnden, “Density estimation by dual ascent of
the log-likelihood,” Communications in Mathematical Sciences - COMMUN
MATH SCI, vol. 8, 03 2010.
75
[43] W. Harvey, S. Naderiparizi, and F. Wood, “Conditional image generation
by conditioning variational auto-encoders,” in The Tenth International
Conference on Learning Representations, ICLR 2022, Virtual Event,
April 25-29, 2022. OpenReview.net, 2022. [Online]. Available: https:
//openreview.net/forum?id=7MV6uLzOChW
[44] C. Lai, D. Zou, and G. Lerman, “Robust vector quantized-variational
autoencoder,” CoRR, vol. abs/2202.01987, 2022. [Online]. Available:
https://arxiv.org/abs/2202.01987
[45] R. Child, “Very deep vaes generalize autoregressive models and can
outperform them on images,” in 9th International Conference on Learning
Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
OpenReview.net, 2021. [Online]. Available: https://openreview.net/forum?
id=RLRXCV6DbEJ
[46] L. Lin, X. Liu, and W. Liang, “Improving variational auto-encoder with self-
attention and mutual information for image generation,” in Proceedings of
the 3rd International Conference on Video and Image Processing, ser. ICVIP
2019. New York, NY, USA: Association for Computing Machinery, 2020,
p. 162–167. [Online]. Available: https://doi.org/10.1145/3376067.3376090
[47] S. Norouzi, D. J. Fleet, and M. Norouzi, “Exemplar vae: Linking generative
models, nearest neighbor retrieval, and data augmentation,” in Advances
in Neural Information Processing Systems, H. Larochelle, M. Ranzato,
R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc.,
2020, pp. 8753–8764. [Online]. Available: https://proceedings.neurips.cc/
paper/2020/file/63c17d596f401acb520efe4a2a7a01ee-Paper.pdf
[48] Q. Ai, L. HE, S. LIU, and Z. Xu, “Bype-vae: Bayesian pseudocoresets
exemplar vae,” in Advances in Neural Information Processing Systems,
M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W.
Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 5910–
5920. [Online]. Available: https://proceedings.neurips.cc/paper/2021/file/
2e9f978b222a956ba6bdf427efbd9ab3-Paper.pdf
[49] Z. Wu, L. Cao, and L. Qi, “evae: Evolutionary variational autoencoder,”
CoRR, vol. abs/2301.00011, 2023. [Online]. Available: https://doi.org/10.
48550/arXiv.2301.00011
[50] E. Egorov, A. Kuzina, and E. Burnaev, “Boovae: Boosting approach for
continual learning of vae,” in Advances in Neural Information Processing
Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W.
Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 17 889–
17 901. [Online]. Available: https://proceedings.neurips.cc/paper/2021/file/
952285b9b7e7a1be5aa7849f32ffff05-Paper.pdf
[51] L. Jiang, B. Dai, W. Wu, and C. C. Loy, “Focal frequency loss for image
reconstruction and synthesis,” in 2021 IEEE/CVF International Conference
on Computer Vision (ICCV), 2021, pp. 13 899–13 909.
[52] K. Dohi, “Variational autoencoders for jet simulation,” 2020. [Online].
Available: https://arxiv.org/abs/2009.04842
76
[53] C.-T. Tu and Y.-F. Chen, “Facial image inpainting with variational autoen-
coder,” in 2019 2nd International Conference of Intelligent Robotic and Con-
trol Engineering (IRCE), 2019, pp. 119–122.
[54] D. Jiang, G. Zhang, M. Karami, X. Chen, Y. Shao, and Y. Yu, “Dp2 -vae:
Differentially private pre-trained variational autoencoders,” CoRR, vol.
abs/2208.03409, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.
2208.03409
[56] H. Huang, z. li, R. He, Z. Sun, and T. Tan, “Introvae: Introspective variational
autoencoders for photographic image synthesis,” in Advances in Neural
Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle,
K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran
Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/
paper_files/paper/2018/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf
[57] T. Daniel and A. Tamar, “Soft-introvae: Analyzing and improving the intro-
spective variational autoencoder,” in Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition (CVPR), June 2021, pp.
4391–4400.
[60] D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han, “Autoregressive image
generation using residual quantization,” in 2022 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11 513–11 522.
77
[63] H. Shao, S. Yao, D. Sun, A. Zhang, S. Liu, D. Liu, J. Wang, and
T. Abdelzaher, “ControlVAE: Controllable variational autoencoder,” in
Proceedings of the 37th International Conference on Machine Learning, ser.
Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds.,
vol. 119. PMLR, 13–18 Jul 2020, pp. 8655–8664. [Online]. Available:
https://proceedings.mlr.press/v119/shao20b.html
[69] J. Xu, Y. Ren, H. Tang, X. Pu, X. Zhu, M. Zeng, and L. He, “Multi-vae:
Learning disentangled view-common and view-peculiar visual representa-
tions for multi-view clustering,” in 2021 IEEE/CVF International Conference
on Computer Vision (ICCV), 2021, pp. 9214–9223.
78
[72] O. Lang, Y. Gandelsman, M. Yarom, Y. Wald, G. Elidan, A. Hassidim, W. T.
Freeman, P. Isola, A. Globerson, M. Irani, and I. Mosseri, “Explaining in
style: Training a gan to explain a classifier in stylespace,” in 2021 IEEE/CVF
International Conference on Computer Vision (ICCV), 2021, pp. 673–682.
[74] G. Kwon and J. C. Ye, “Diagonal attention and style-based gan for
content-style disentanglement in image generation and translation,” in 2021
IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp.
13 960–13 969.
[75] Y. Lu, S. Wu, Y.-W. Tai, and C.-K. Tang, “Image generation from sketch
constraint using contextual gan,” in Proceedings of the European Conference
on Computer Vision (ECCV), September 2018.
[77] E. Bolluyt and C. Comaniciu, “Collapse resistant deep convolutional gan for
multi-object image generation,” in 2019 18th IEEE International Conference
On Machine Learning And Applications (ICMLA), 2019, pp. 1404–1408.
[80] K.-H. Liu, C.-C. Lin, and T.-J. Liu, “Image generation by residual block
based generative adversarial networks,” in 2022 IEEE International Confer-
ence on Consumer Electronics (ICCE), 2022, pp. 1–4.
79
[82] I. P. Durugkar, I. Gemp, and S. Mahadevan, “Generative multi-
adversarial networks,” in 5th International Conference on Learning
Representations, ICLR 2017, Toulon, France, April 24-26, 2017,
Conference Track Proceedings. OpenReview.net, 2017. [Online]. Available:
https://openreview.net/forum?id=Byk-VI9eg
[86] Y. Men, Y. Mao, Y. Jiang, W.-Y. Ma, and Z. Lian, “Controllable person image
synthesis with attribute-decomposed gan,” in 2020 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5083–5092.
[87] Q. Jin, X. Luo, Y. Shi, and K. Kita, “Image generation method based on
improved condition gan,” in 2019 6th International Conference on Systems
and Informatics (ICSAI), 2019, pp. 1290–1294.
[89] K. S. Lee, N.-T. Tran, and N.-M. Cheung, “Infomax-gan: Improved adversar-
ial image generation via information maximization and contrastive learning,”
in Proceedings of the IEEE/CVF Winter Conference on Applications of Com-
puter Vision (WACV), January 2021, pp. 3942–3952.
[90] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least
squares generative adversarial networks,” in Proceedings of the IEEE Inter-
national Conference on Computer Vision (ICCV), Oct 2017.
[91] M. Zhai, L. Chen, F. Tung, J. He, M. Nawhal, and G. Mori, “Lifelong gan:
Continual learning for conditional image generation,” in Proceedings of the
IEEE/CVF International Conference on Computer Vision (ICCV), October
2019.
80
[92] H. Dong, X. Liang, K. Gong, H. Lai, J. Zhu, and J. Yin, “Soft-gated
warping-gan for pose-guided person image synthesis,” in Advances in Neural
Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle,
K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran
Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/
paper/2018/file/1700002963a49da13542e0726b7bb758-Paper.pdf
[93] Y. Hong, L. Niu, J. Zhang, W. Zhao, C. Fu, and L. Zhang, “F2gan:
Fusing-and-filling gan for few-shot image generation,” in Proceedings of
the 28th ACM International Conference on Multimedia, ser. MM ’20.
New York, NY, USA: Association for Computing Machinery, 2020, p.
2535–2543. [Online]. Available: https://doi.org/10.1145/3394171.3413561
[94] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Uni-
fied generative adversarial networks for multi-domain image-to-image trans-
lation,” in Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), June 2018.
[95] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta,
A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single
image super-resolution using a generative adversarial network,” in 2017
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017,
Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2017, pp.
105–114. [Online]. Available: https://doi.org/10.1109/CVPR.2017.19
[96] M. Wang, X. Zhang, K. Shi, X. Zhang, D. Lei, and X. Yang,
“Image super-resolution reconstruction algorithm based on improved gan,”
in Proceedings of the 3rd International Conference on Data Science
and Information Technology, ser. DSIT 2020. New York, NY, USA:
Association for Computing Machinery, 2020, p. 59–64. [Online]. Available:
https://doi-org/10.1145/3414274.3414487
[97] C. Wang, H. Zheng, Z. Yu, Z. Zheng, Z. Gu, and B. Zheng, “Discriminative
region proposal adversarial networks for high-quality image-to-image
translation,” in Computer Vision - ECCV 2018 - 15th European Conference,
Munich, Germany, September 8-14, 2018, Proceedings, Part I, ser. Lecture
Notes in Computer Science, V. Ferrari, M. Hebert, C. Sminchisescu,
and Y. Weiss, Eds., vol. 11205. Springer, 2018, pp. 796–812. [Online].
Available: https://doi.org/10.1007/978-3-030-01246-5_47
[98] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, “Cvae-gan: Fine-grained image
generation through asymmetric training,” in Proceedings of the IEEE Inter-
national Conference on Computer Vision (ICCV), Oct 2017.
[99] M. Gorijala and A. Dukkipati, “Image generation and editing with variational
info generative adversarialnetworks,” CoRR, vol. abs/1701.04568, 2017.
[Online]. Available: http://arxiv.org/abs/1701.04568
[100] Y. Zhao, B. Deng, J. Huang, H. Lu, and X.-S. Hua, “Stylized adversarial
autoencoder for image generation,” in Proceedings of the 25th ACM
International Conference on Multimedia, ser. MM ’17. New York, NY,
USA: Association for Computing Machinery, 2017, p. 244–251. [Online].
Available: https://doi-org/10.1145/3123266.3123450
81
[101] B. Zhang, S. Gu, B. Zhang, J. Bao, D. Chen, F. Wen, Y. Wang, and B. Guo,
“Styleswin: Transformer-based gan for high-resolution image generation,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2022, pp. 11 304–11 314.
[103] S. Niu, B. Li, X. Wang, and H. Lin, “Defect image sample generation with
gan for improving defect recognition,” IEEE Transactions on Automation
Science and Engineering, vol. 17, no. 3, pp. 1611–1622, 2020.
[110] M.-Y. Liu, X. Huang, J. Yu, T.-C. Wang, and A. Mallya, “Generative adver-
sarial networks for image and video synthesis: Algorithms and applications,”
Proceedings of the IEEE, vol. 109, no. 5, pp. 839–862, 2021.
82
[112] M. Tahmid, S. Alam, and M. k. Akram, “Comparative analysis of generative
adversarial networks and their variants,” in 2020 23rd International Confer-
ence on Computer and Information Technology (ICCIT), 2020, pp. 1–6.
[113] H. Shibata, S. Hanaoka, Y. Cao, M. Yoshikawa, T. Takenaga, Y. Nomura,
N. Hayashi, and O. Abe, “Local differential privacy image generation using
flow-based deep generative models,” CoRR, vol. abs/2212.10688, 2022.
[Online]. Available: https://doi.org/10.48550/arXiv.2212.10688
[114] R. Liu, Y. Liu, X. Gong, X. Wang, and H. Li, “Conditional adversarial gener-
ative flow for controllable image synthesis,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[115] X. Ma, X. Kong, S. Zhang, and E. Hovy, “Macow: Masked convolutional
generative flow,” in Advances in Neural Information Processing Systems,
H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox,
and R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019.
[Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2019/
file/20c86a628232a67e7bd46f76fba7ce12-Paper.pdf
[116] A. Mukherjee, B. N. Patro, S. Sidheekh, M. Singh, and V. P. Namboodiri,
“Attentive contractive flow: Improved contractive flows with lipschitz-
constrained self-attention,” CoRR, vol. abs/2109.12135, 2021. [Online].
Available: https://arxiv.org/abs/2109.12135
[117] A. Bhattacharyya, S. Mahajan, M. Fritz, B. Schiele, and S. Roth, “Normaliz-
ing flows with multi-scale autoregressive priors,” in 2020 IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8412–
8421.
[118] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel, “Flow++:
Improving flow-based generative models with variational dequantization and
architecture design,” in Proceedings of the 36th International Conference
on Machine Learning, ser. Proceedings of Machine Learning Research,
K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 09–15 Jun
2019, pp. 2722–2730. [Online]. Available: https://proceedings.mlr.press/
v97/ho19a.html
[119] Z. Chen, Y. Luo, S. Wang, J. Li, and Z. Huang, “Gsmflow: Generation
shifts mitigating flow for generalized zero-shot learning,” CoRR, vol.
abs/2207.01798, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.
2207.01798
[120] M. Hajij, G. Zamzmi, R. Paul, and L. Thukar, “Normalizing flow for syn-
thetic medical images generation,” in 2022 IEEE Healthcare Innovations and
Point of Care Technologies (HI-POCT), 2022, pp. 46–49.
[121] P. Pope, Y. Balaji, and S. Feizi, “Adversarial robustness of flow-based
generative models,” in Proceedings of the Twenty Third International
Conference on Artificial Intelligence and Statistics, ser. Proceedings of
Machine Learning Research, S. Chiappa and R. Calandra, Eds., vol.
108. PMLR, 26–28 Aug 2020, pp. 3795–3805. [Online]. Available:
https://proceedings.mlr.press/v108/pope20a.html
83
[122] G. G. P. F. Pires and M. A. T. Figueiredo, “Variational mixture of
normalizing flows,” in 28th European Symposium on Artificial Neural
Networks, Computational Intelligence and Machine Learning, ESANN 2020,
Bruges, Belgium, October 2-4, 2020, 2020, pp. 205–210. [Online]. Available:
https://www.esann.org/sites/default/files/proceedings/2020/ES2020-188.pdf
[123] M. Tailanián, Á. Pardo, and P. Musé, “U-flow: A u-shaped normalizing
flow for anomaly detection with unsupervised threshold,” CoRR, vol.
abs/2211.12353, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.
2211.12353
[124] S. Dong, G. Hangel, E. Z. Chen, S. Sun, W. Bogner, G. Widhalm, C. You,
J. A. Onofrey, R. de Graaf, and J. S. Duncan, “Flow-based visual quality
enhancer for super-resolution magnetic resonance spectroscopic imaging,”
in Deep Generative Models, A. Mukhopadhyay, I. Oksuz, S. Engelhardt,
D. Zhu, and Y. Yuan, Eds. Cham: Springer Nature Switzerland, 2022,
pp. 3–13.
[125] X. Han, X. Hu, W. Huang, and M. R. Scott, “Clothflow: A flow-based model
for clothed person generation,” in Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision (ICCV), October 2019.
[126] T. Wang, X. Gu, and J. Zhu, “A flow-based generative network for photo-
realistic virtual try-on,” IEEE Access, vol. 10, pp. 40 899–40 909, 2022.
[127] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for
convolutional neural networks,” in Proceedings of the 36th International
Conference on Machine Learning ICML 2019, 9-15 June 2019, Long
Beach, California, USA, ser. Proceedings of Machine Learning Research,
K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 6105–
6114. [Online]. Available: http://proceedings.mlr.press/v97/tan19a.html
[128] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley,
“Least squares generative adversarial networks,” in IEEE International
Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29,
2017. IEEE Computer Society, 2017, pp. 2813–2821. [Online]. Available:
https://doi.org/10.1109/ICCV.2017.304
[129] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila,
“Analyzing and improving the image quality of stylegan,” in 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR
2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation
/ IEEE, 2020, pp. 8107–8116. [Online]. Available: https://openaccess.
thecvf.com/content_CVPR_2020/html/Karras_Analyzing_and_Improving_
the_Image_Quality_of_StyleGAN_CVPR_2020_paper.html
[130] T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and
T. Aila, “Alias-free generative adversarial networks,” in Advances in Neural
Information Processing Systems 34: Annual Conference on Neural Informa-
tion Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual,
M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan,
Eds., 2021, pp. 852–863. [Online]. Available: https://proceedings.neurips.
cc/paper/2021/hash/076ccd93ad68be51f23707988e934906-Abstract.html
84
[131] J. Prost, A. Houdard, N. Papadakis, and A. Almansa, “Diverse
super-resolution with pretrained deep hiererarchical vaes,” CoRR, vol.
abs/2205.10347, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.
2205.10347
[132] J. Wang, W. Zhou, G.-J. Qi, Z. Fu, Q. Tian, and H. Li, “Transforma-
tion gan for unsupervised image synthesis and representation learning,” in
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2020, pp. 469–478.
85
A Appendix 1: Extension on GAN and Flow-based Models
A.1 Generative Adversarial Networks
In this Appendix sub-section alternate GAN loss functions and variant approaches
are presented.
A.1.1 Losses
Non-Saturating GAN Loss
In the GAN introduction paper [9] an adjusted generator loss is suggested for mit-
igating the gradient saturation problem. The gradient saturation problem appears
when the gradient values are very small, ending up in very slow steps on the gra-
dient and thus slow learning. The adjusted non-saturating GAN loss (Equation 20)
provides much stronger gradients early in training, ending up in better training and
faster convergence in practice. Instead of the generator minimising the log inverse
discriminator loss on generated images, the non-saturating loss maximises the log
discriminator loss on generated images.
A.1.2 Variants
Conditional GAN
The conditional GAN [14], or also named cGAN, is a GAN variant that extends
the base architecture with the capability of conditional properties and thus an extra
input. The base GAN architecture does not have any conditional generation, while
this variant does. In the model, both the generator and the discriminator get ex-
tended with an extra conditional input y on top of the data input x. The conditional
A
GAN has proven to work very well for conditional image generation, where condi-
tional properties such as classes or descriptions can be given to the generator.
StyleGAN
StyleGAN [15] is a GAN variant proposed by researchers at Nvidia, where the GAN
has an improved generator architecture based on the concept of style transfer. The
generator network starts with rather small images and gradually increases the size
throughout the style blocks in the architecture. The StyleGAN approaches leads
to high-resolution and realistic synthesis, better separation of features, and scale-
specific control of the synthesis. There have also been updated approaches on the
variant, with the StyleGAN-2 [129] improving the original StyleGAN architecture
and leading to better synthesis results and faster training. And the StyleGAN-3
[130] variant solving the texture sticking problem occurring in the previous variant
by introducing strict low-pass filters.
CycleGAN
CycleGAN [16] is a GAN variant that approaches the problem of paired data collec-
tion, by having an approach that can generate between two independent domains.
It does this by training two GAN’s, one for each domain, so two generators and
two discriminators, and the introduction of a cycle-consistency loss that uses the
two generators. The CycleGAN has proven to work well on applications of image
synthesis, enhancement, and style transfer.
A.2.1 Variants
NICE
NICE, short for Non-linear Independent Component Estimation, was one of the first
flow-based models, introduced by Dinh et al. [10]. The NICE framework is a flow-
based model within the type of coupling flows. It uses a special type of coupling
flow, called the affine coupling flow. The affine coupling layer used in NICE is a
function that takes a subset of the input and applies an affine transformation to it.
While leaving the remaining subset of the input as is, without any applied trans-
formation. By transforming one subset and leaving the other subset as is, the layer
becomes bijective and can there be propagated backwards with a neural network.
By sequencing the bijective affine transformations, a more complex distribution can
be approximated.
RealNVP
RealNVP, short for Real-valued Non-Volume Preserving, introduced by Dinh et al.
[17] is a flow-based model continuing on the NICE flow model [10]. RealNVP in-
troduces a scaling parameter on the affine coupling layer that NICE presents. This
scaling parameter is learned during the training of the model. The scaling parameter
being added to the additive coupling layer of NICE, improves the density estima-
tion. It is also this parameter that makes RealNVP non-volume preserving.
B
GLOW
GLOW, introduced by D. P. Kingma and P. Dhariwal [18], is another flow-based
generative model, built further on RealNVP [17]. The GLOW model implements
a different architecture consisting out of three components. The first component
of the GLOW model is the actnorm layer, a scale and bias layer, that acts similar
as batch normalization. The second component is the invertible 1 × 1 convolution,
which is the main novel component introduced, that also significantly improves the
learning. The third and final component is an affine coupling layer. Using this
architecture, the GLOW model has improved all of the state of the art benchmarks
achieved by RealNVP on various data sets.
C
B Appendix 2: Extension on VAE reconstructions
In this Appendix section more training reconstructions per final trained model
per data set are given. For every model, 15 reconstructions are given.
Figure 2.37: Training reconstructions from vanilla model with 256D latent dimen-
sions using the 1K data set.
D
B.1.2 512D Vanilla VAE
Figure 2.38: Training reconstructions from vanilla model with 512 latent dimen-
sions using the 1K data set.
E
B.1.3 512D Variant VAE
Figure 2.39: Training reconstructions from variant model with 512 latent dimen-
sions using the 1K data set.
F
B.2 100K data set Reconstructions
B.2.1 1024D Vanilla VAE
Figure 2.40: Training reconstructions from vanilla model with 1024 latent dimen-
sions using the 100K data set.
G
B.2.2 1024D Variant VAE
Figure 2.41: Training reconstructions from variant model with 1024 latent dimen-
sions using the 100K data set.
H
C Appendix 3: Extension on VAE sampled triplets
In this Appendix section more sampled triplets generated from the vanilla and
variant VAE with 1024 latent dimensions using the 100K data set are given.
Figure 3.42: Triplets sampled with selected optimal parameters from vanilla model
with 1024 latent dimensions using the 100K data set.
I
Figure 3.43: Triplets sampled with selected optimal parameters from vanilla model
with 1024 latent dimensions using the 100K data set.
J
Figure 3.44: Triplets sampled with selected optimal parameters from vanilla model
with 1024 latent dimensions using the 100K data set.
K
C.2 Variant VAE
Figure 3.45: Triplets sampled with selected optimal parameters from variant model
with 1024 latent dimensions using the 100K data set.
L
Figure 3.46: Triplets sampled with selected optimal parameters from variant model
with 1024 latent dimensions using the 100K data set.
M
Figure 3.47: Triplets sampled with selected optimal parameters from variant model
with 1024 latent dimensions using the 100K data set.
N
D Appendix 4: Source code for the models
The source code for the models and algorithms have been published on GitHub:
https://github.com/TiboBruneel/Thesis-VAE-Models.
O
E Appendix 5: Literature Review Data Extraction
VAE Data Extraction Studies
Ref Source and date extraction Search term Type data sets
[52] Google Scholar, 6 Feb. 2023 "Assessment of variational autoen- VAE 1 (Jet data set)
coders image generation"
[61] Google Scholar, 6 Feb. 2023 "Applications of VAE Image gener- VAE, GAN 1 (Probe Data)
ation"
[62] IEEE Explore, 8 Feb. 2023 "VAE Image generation" VAE, GAN 3 (Street View House Numbers, CIFAR-
10, Chest X-Ray)
[66] arXiv, 7 Feb. 2023 "VAE Image generation" VAE, GAN 1 (dSprite)
[65] NeurIPS, 7 Feb. 2023 "VAE" VAE 2 (DSprites, Shapes3D)
[69] ICCV, 7 Feb. 2023 "VAE" VAE 4 (Multi-MNIST, Multi-Fashion, Multi-
COIL-10, Multi-COIL-20)
[68] Google Scholar, 6 Feb. 2023 "Applications of VAE Image gener- VAE 3 (MNIST, Galaxy Zoon, EM data sets)
P
ation"
[64] arXiv, 6 Feb. 2023 "VAE Variant" VAE 2 (DSprites, MNIST)
[31] arXiv, 6 Feb. 2023 "VAE Generation" VAE 3 (Spatial blobs, DSprites, 3DChairs)
[55] Google Scholar, 6 Feb. 2023 "Assessment of variational autoen- VAE 1 (MNIST)
coders image generation"
[53] IEEE Explore, 8 Feb. 2023 "VAE Image generation" VAE 1 (AR data set)
[46] ACM Digital Library, 7 Feb. "Assessment of variational autoen- VAE 2 (CelebA, MNIST)
2023 coders image generation"
[49] arXiv, 7 Feb. 2023 "VAE Image generation" VAE 1 (CelebA)
[59] arXiv, 7 Feb. 2023 "VAE Image generation" VAE 3 (CelebA-HQ-128, CIFAR-10, CelebA-
64)
[54] arXiv, 7 Feb. 2023 "VAE Variant" VAE 1 (MNIST)
[50] NeurIPS, 7 Feb. 2023 "VAE" VAE 4 (MNIST, Fashion-MNIST, NotMNIST,
CelebA)
VAE Data Extraction Studies
Ref Source and date extraction Search term Type data sets
[67] NeurIPS, 7 Feb. 2023 "VAE" VAE 1 (MNIST)
[51] ICCV, 7 Feb. 2023 "VAE" VAE 7 (Describable Textures data set, CelebA,
CelebA-HQ, CMP Facades, shoes,
Cityscapes, ADE20K)
[60] CVPR, 7 Feb. 2023 "VAE" VAE 2 (LSUN, FF-HQ)
[43] arXiv, 7 Feb. 2023 "VAE Image generation" VAE 3 (CIFAR-10, ImageNet-64, FFHQ-256)
[48] NeurIPS, 7 Feb. 2023 "VAE" VAE 4 (MNIST, Fashion-MNIST, CIFAR-10,
CelebA)
[47] NeurIPS, 7 Feb. 2023 "VAE" VAE 4 ( MNIST, Fashion-MNIST, Omniglot,
CelebA)
Q
[131] arXiv, 7 Feb. 2023 "VAE Image generation" VAE 1 (FFHQ-256)
[56] arXiv, 7 Feb. 2023 "Variational autoencoder image VAE 3 (Celeb-A, CelebA-HQ, LSUN BED-
generation" ROOM)
[57] Google Scholar, 7 Feb. 2023 "Assessment of variational autoen- VAE 3 (CIFAR-10, CelebA-HQ, FFQQ)
coders image generation"
[58] arXiv, 7 Feb. 2023 "VAE Image generation" VAE 4 (MNIST, CIFAR-10, CelebA-128,
CelebA-256).
[12] Google Scholar, 6 Feb. 2023 "Applications of VAE Image gener- VAE 2 (ImageNet, FFHQ)
ation"
[44] arXiv, 7 Feb. 2023 "VAE Image generation" VAE 2 (FaceMask, RoomCrop)
[63] arXiv, 7 Feb. 2023 "VAE Variant" VAE 2 (2D Shapes, CelebA)
[70] arXiv, 7 Feb. 2023 "VAE Image generation" VAE 1 (MNIST)
GAN Data Extraction Studies
Ref Source and date extraction Search term Type data sets
[98] Google Scholar, 13 Feb. 2023 "GAN Image Generation" GAN, VAE 3 (FaceScrub, 102 Category Flower,
CUB-200)
[102] Google Scholar, 13 Feb. 2023 "GAN Image Generation" GAN 1 (BRATS 2016)
[75] Google Scholar, 13 Feb. 2023 "GAN Image Generation" GAN 3 (Large-scale CelebFaces Attributes
(CelebA), Caltech-UCSD Birds-200-
2011, Stanford’s Cars)
[101] Google Scholar, 13 Feb. 2023 "GAN Image Generation" GAN, Trans- 3 (CelebA-HQ, LSUN Church, FFHQ)
former
[91] Google Scholar, 15 Feb. 2023 "GAN Image Generation" GAN Inclusion
[78] Google Scholar, 15 Feb. 2023 "GAN Image Generation" GAN 3 (MNIST, CIFAR-10, CUB-200)
[93] Google Scholar, 15 Feb. 2023 "GAN Image Generation" GAN 5 (Omniglot, EMNIST, VGGFace, Flow-
ers, Animal Faces)
R
[89] Google Scholar, 15 Feb. 2023 "GAN Image Generation" GAN 5 (ImageNet, CelebA, CIFAR-10, STL-
10, CIFAR-100)
[109] Google Scholar, 15 Feb. 2023 "GAN Image Generation" GAN 1 (CIFAR-10)
[87] Google Scholar, 15 Feb. 2023 "GAN Image Generation" GAN 2 (MNIST, Fashion-MNIST)
[103] Google Scholar, 15 Feb. 2023 "GAN Image Generation" GAN 1 (Commutator cylinder surface defect
imagery)
[110] Google Scholar, 15 Feb. 2023 "Generative Adversarial Networks GAN Inclusion
Applications"
[76] Google Scholar, 15 Feb. 2023 "Generative Adversarial Networks" GAN 3 (Large-scale Scene Understanding,
Imagenet-1k, newly assembled Faces)
[15] Google Scholar, 15 Feb. 2023 "Generative adversarial networks GAN 1 (FFHQ)
image generation"
[90] Google Scholar, 15 Feb. 2023 "Generative adversarial networks GAN 1 (CIFAR-10)
image generation"
GAN Data Extraction Studies
Ref Source and date extraction Search term Type data sets
[79] Google Scholar, 15 Feb. 2023 "Generative adversarial networks GAN 1 (LSVRC2012 (ImageNet))
image generation"
[94] Google Scholar, 15 Feb. 2023 GAN 2 (CelebA, Inclusion
RaFD)
[71] arXiv, 15 Feb. 2023 "GAN Image Generation" GAN 1 (FFHQ)
[104] arXiv, 15 Feb. 2023 "GAN Image Generation" GAN 2 (MCTEC-AD, PCAM, OCT)
[99] arXiv, 15 Feb. 2023 "GAN Image Generation" GAN 3 (Labeled Faces in the Wild (LFW),
celebA and a modified version of MNIST
data sets)
[82] arXiv, 15 Feb. 2023 "GAN Image Generation" GAN 3 (MNIST, CIFAR-10, CelebA)
[81] arXiv, 15 Feb. 2023 "GAN Image Generation" GAN 3 (MNIST, SVHN, CIFAR-10)
[107] arXiv, 15 Feb. 2023 "GAN Image Generation" GAN 1 (PolyU Fingerprint Databases)
[111] arXiv, 15 Feb. 2023 "Survey on GAN Image genera- GAN Inclusion
S
tion"
[95] arXiv, 15 Feb. 2023 "Generative adversarial networks GAN 3 (Set5, Set14, BSD100)
image generation"
[105] ACM Digital Library, 15 Feb. "GAN Image Generation" GAN 1 (CelebA)
2023
[100] ACM Digital Library, 15 Feb. "GAN Image Generation" GAN 4 (MNIST, Labeled Faces in the
2023 Wild(LFW), IIIT-5K, Plate)
[96] ACM Digital Library, 15 Feb. "GAN Image Generation" GAN 3 (Set5, Set14, BSD100)
2023
[106] ACM Digital Library, 15 Feb. "Generative adversarial networks GAN 1 (CelebA)
2023 image generation"
[112] IEEE Explore, 15 Feb. 2023 "GAN Variants" GAN 1 (CIFAR-10)
[84] IEEE Explore, 15 Feb. 2023 "GAN Image Generation" GAN 3 (church_outdoor from LSUN, CelebA-
HQ, Car)
GAN Data Extraction Studies
Ref Source and date extraction Search term Type data sets
[77] IEEE Explore, 15 Feb. 2023 "GAN Image Generation" GAN 1 (Cityscapes)
[80] IEEE Explore, 15 Feb. 2023 "Generative adversarial networks GAN 3 (Anime face, LSUN bedroom, CelebA)
image generation"
[72] ICCV, 16 Feb. 2023 "GAN" GAN 5 (AFHQ, FFHQ, Plant-Village, Retinal
Fundus, CUB-2011)
[73] ICCV, 16 Feb. 2023 "GAN" GAN 6 (FFHQ, CelebA-HQ, Stanford Cars,
LSUN Horse, LSUN Church, AFHQ
Wild)
[108] ICCV, 16 Feb. 2023 "GAN" GAN 3 (Compcars, KITTI, Cityscapes)
T
[74] ICCV, 16 Feb. 2023 "GAN" GAN 2 (CelebA-HQ, FFHQ)
[85] CVPR, 16 Feb. 2023 "GAN" GAN 3 (CelebA, Cats, CARLA)
[132] CVPR, 16 Feb. 2023 "GAN" GAN 5 (CIFAR-10, ImageNet, CELEBA-HQ,
LSUN-BEDROOM, CIFAR-10)
[86] CVPR, 16 Feb. 2023 "GAN" GAN 1 (In-shop Clothes Retrieval Benchmark
DeepFashion)
[83] NeurIPS, 16 Feb. 2023 "GAN" GAN 2 (FFHQ, AAHQ)
[92] NeurIPS, 16 Feb. 2023 "GAN" GAN 2 (DeepFashion, Market-1501)
[97] International Journal of Com- "GAN" GAN 4 (ImageNet,CityScapes, Facades, Aerial
puter Vision, 16 Feb. 2023 data)
Flow-based Model Data Extraction Studies
Ref Source and date extraction Search term Type data sets
[133] Google Scholar, 1. Mar. 2023 "Flow-based Generative Model Im- Flow-Based 1 (The Alzheimer’s Disease Neuroimag-
age Generation" Model ing Initia- tive (ADNI))
[125] Google Scholar, 1. Mar. 2023 "Flow-based Generative Model Im- Flow-Based 2 (DeepFashion, VITON)
age Generation" Model
[121] Google Scholar, 1. Mar. 2023 "Flow-based Generative Model Im- Flow-Based 3 (CIFAR-10, LSUN Bedroom, CelebA)
age Generation" Model
[118] Google Scholar, 1. Mar. 2023 "Flow-based Generative Model Im- Flow-Based 4 (CIFAR10, 32x32 ImageNet, 64x64 Im-
age Generation" Model ageNet, 5-bit CelebA)
[114] Google Scholar, 1. Mar. 2023 "Flow-based Generative Model Im- Flow-Based 2 (MNIST, CelebA)
age Generation" Model
[18] Google Scholar, 1. Mar. 2023 "Flow-based Generative Model Im- Flow-Based 4 (CIFAR, LSUN, ImageNet, CelebA)
age Generation" Model
U
[115] Google Scholar, 1. Mar. 2023 "Flow-based Generative Model Im- Flow-Based 4 (CIFAR-10, ImageNet-64, CelebA-HQ,
age Variant" Model LSUN)
[113] arXiv, 3 Mar. 2023 "Flow-based Generative Model Im- Flow-Based 1 (RSNA Pneumonia Detection Chal-
age Generation" Model lenge data set)
[123] arXiv, 3 Mar. 2023 "Flow-based Generative Model Im- Flow-Based 1 (MvTec-AD)
age Generation" Model
[124] arXiv, 3 Mar. 2023 "Flow-based Generative Model Im- Flow-Based 1 (H-MRSI)
age Generation" Model
[119] arXiv, 3 Mar. 2023 "Flow-based Generative Model Im- Flow-Based 4 (Caltech-UCSD Birds-200-2011, Ox-
age Generation" Model ford Flowers, Attribute Pascal and Yahoo,
Animals with Attributes 2)
[116] arXiv, 3 Mar. 2023 "Flow-based Generative Model Flow-Based 4 (MNIST, CIFAR-10, ImageNet-32,
Variant" Model ImageNet-64)
Flow-based Model Data Extraction Studies
Ref Source and date extraction Search term Type data sets
[122] arXiv, 3 Mar. 2023 "Flow-based Generative Model Flow-Based 3 (PinWheel, Two-Circle, MNIST)
Variant" Model
[120] IEEE Explore, 3 Mar. 2023 "Flow-based Generative Model Im- Flow-Based 2 (Chest x-ray, Skin cancer)
V
age Generation" Model
[117] IEEE Explore, 3 Mar. 2023 "Flow-based Generative Model Im- Flow-Based 3 (MNIST, CIFAR-10, ImageNet)
age Generation" Model
[126] IEEE Explore, 3 Mar. 2023 "Flow-based Generative Model Flow-Based 1 (VITON)
Network" Model