TFG Final
TFG Final
net/publication/331089295
CITATION READS
1 4,804
1 author:
SEE PROFILE
All content following this page was uploaded by Pablo Sánchez Martín on 14 February 2019.
Bachelor Thesis
Supervised by
Pablo Martínez Olmos
Leganés, June 2018
Generative models have been one of the major research fields in unsupervised deep
learning during the last years. They are achieving promising results in learning the distri-
bution of multidimensional variables as well as in finding meaningful hidden representa-
tions in data.
The aim of this thesis is to gain a sound understanding of generative models through a
profound study of one of the most promising and widely used generative models family,
the variational autoencoders. In particular, the performance of the standard variational
autoencoder (known as VAE) and the Gaussian Mixture variational autoencoder (called
GMVAE) is assessed. First, the mathematical and probabilistic basis of both models is
presented. Then, the models are implemented in Python using the Tensorflow framework.
The source code is freely available and documented in a personal GitHub repository cre-
ated for this thesis. Later, the performance of the implemented models is appraised in
terms of generative capabilities and interpretability of the hidden representation of the
inputs. Two real datasets are used during the experiments, the MNIST and "Frey faces".
Results show the models implemented work correctly, and they also show the GMVAE
outweighs the performance of the standard VAE, as expected.
Keywords: deep learning, generative model, variational autoencoder, Monte Carlo
simulation, latent variable, KL divergence, ELBO.
iii
ACKNOWLEDGEMENTS
To start with, I would like to thank my family, for being completely supportive through-
out the time I have been conducting this thesis and the degree. I would like to thank spe-
cially my parents for helping me on a daily basis, both on the good and bad times, for
teaching me important principles such as perseverance and basically, for everything they
do.
I would also like to thank my university colleagues with which I have shared unfor-
gettable memories during the last years.
Last but not least, I would like to thank my supervisor Pablo M. Olmos. I am ex-
tremely grateful to him for the time he has invested in teaching me everything this thesis
required, in resolving any issue that arose and for being such a good mentor.
Thank you all for your trust and support.
v
CONTENTS
1. INTRODUCTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1. Statement of Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1. Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2. Programming Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3. GPU Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3. Regulatory Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1. Legislation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4. Work Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5. Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1. Overview of Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1. Explicit Density Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2. Implicit Density Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2. Probabilistic PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3. Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1. Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 12
3. VARIATIONAL AUTOENCODER . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2. Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1. Dealing with the Integral over z . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2. Inference Network: qϕ (z|x) . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3. KL Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.4. Core Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.5. ELBO Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.6. Reparameterization Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3. Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
vii
4. GAUSISIAN MIXTURE VARIATIONAL AUTOENCODER . . . . . . . . . . . 22
4.1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3. Defining z directly as a GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4. Optimization Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.1. Reconstruction Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4.2. Conditional prior term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4.3. W-prior term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4.4. Y-prior term. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4.5. ELBO Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5. Reparameterization Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6. Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5. METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1. Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2. Prepare Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3. Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.4. Definition of hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.5. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.6. GMAVE numerical stability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.6.1. Approximation of pβ (y j = 1|w, z) . . . . . . . . . . . . . . . . . . . . . . . . 32
5.6.2. Output restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.7. Motivate clustering behaviour. . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.8. Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.9. Visualization of results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6. RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.1. Initial considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2. Experiment 1: VAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3. Experiment 2: Understanding the latent variables of GMVAE. . . . . . . . . . 39
6.4. Experiment 3: Comparison between GMVAE and VAE . . . . . . . . . . . . . 40
6.5. Experiment 4: Performance evaluation with corrupted input. . . . . . . . . . . 42
6.6. Experiment 5: Convolutional architecture . . . . . . . . . . . . . . . . . . . . . 43
viii
7. SOCIO-ECONOMIC ENVIRONMENT . . . . . . . . . . . . . . . . . . . . . . . 46
7.1. Budget. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.2. Practical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.1. Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.2. Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A. VAE LOSS RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
B. GENERATED SAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
C. KL DIVERGENCE OF GMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
D. GMVAE OBJECTIVE FUNCTION DERIVATIONS. . . . . . . . . . . . . . . . 54
D.1. Conditional prior term. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
D.2. W-prior term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
D.3. Y-prior term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
BIBLIOGRAPHY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
ix
LIST OF FIGURES
4.1 Graphical models for the GMVAE showing the generative model (left)
and the inference model (right). . . . . . . . . . . . . . . . . . . . . . . . 22
xi
6.12 Example of results using VAE convolutional architecture on MNIST. Model
(a) and (b) with dim z = 2 and model (c) and (d) with dim z = 10. . . . . . 45
6.13 Example of results using GMVAE’s convolutional architecture on FREY
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
xii
LIST OF TABLES
xiv
1. INTRODUCTION
This chapter presents the project in brief. It contains the motivation and objectives of
this thesis, the regulatory framework related to the project, the requirements, the activities
undertaken and the overall structure of the document.
Neural networks have existed since the definition of the Perceptron in 1958 [1]. How-
ever, they did not become popular until the appearance of backpropagation in the late 80s
[2]. Latest advances in distributed computing, specialized hardware and an exponential
growth in the amount of data created, have enabled to train extremely complex neural net-
works with large number of layers and parameters. The discipline that studies this type of
networks is known as Deep Learning and it is achieving state-of-the-art results in many
difficult problems such as speech recognition [3], object detection [4], text generation [5]
and recently in data generation.
A generative model is a kind of unsupervised learning model that aims to learn any
type of data distribution and it has the potential to unveil underlying structure in the data.
Generative models have recently emerged as a revolution in deep learning with many
applications in a wide-ranging variety of fields such as medicine, art, gaming or self-
driving. Deep learning, and specially generative models, is evolving so fast that it is
difficult to keep up with the latest developments.
For these reasons, the purpose of this work is to serve as a sound introduction to
deep learning generative models. To achieve this goal, the variational autoencoder
model or VAE, one of the most widely used generative models, is studied in detail and
compared to one of its variations: the Gaussian Mixture VAE. The implementation of the
GMVAE will be based on [6] with some modifications proposed in this work. So the main
objective of this bachelor thesis can be expanded as follows:
1. Study in detail the mathematical and probabilistic basis of the variational autoen-
coders.
1
As it will be seen further on, the GMVAE describes a two level hierarchical latent
space that aims to capture the multimodal nature of the input data. Due to this, the hy-
pothesis of this thesis is that the GMVAE model improves the generative and clustering
capabilities of the standard VAE. Experiments will be conducted to test this hypothesis.
1.2. Requirements
The first part of this thesis is purely theoretical. In the second part, the following tools
are required for the implementation of the models: datasets, programming framework and
GPU power.
1.2.1. Datasets
The datasets used for this thesis are the MNIST and "Frey faces" (or FREY) datasets, both
publicly available. They consist of images with dimension D x that represents the number
of pixels:
MNIST Dataset
The MNIST database [7] is a labeled collection of gray-scale handwritten digits dis-
tributed by Yann Lecun’s website. Each image is formed by 28x28 size-normalized pix-
els. The MNIST database is often used to assess performance of algorithms in research.
There are functions available to prepare the dataset so there is no need to write code about
preprocessing or formatting.
2
FREY Dataset
The "Fray faces" or FREY dataset [8] consists of 1965 images with 20x28 pixels of Bren-
dan’s face taken from sequential frames of a video. Although this is an unlabeled dataset,
Brendan appears varying the expressions of his face thus showing different emotions, such
as happiness, anger or sorrow, that can be associated to different classes. Several samples
are shown in figure 1.2.
Python was the programming language chosen to implement the models and run the ex-
periments. Python makes extremely easy to inspect variables and outputs, to debug code
and to visualize results. Moreover, there are great tools and software packages for deep
learning available in this programming language such as Keras, Theano, TensorFlow or
Caffe. In particular, Tensorflow has been used in this thesis.
Tensorflow is an open-source software library developed by Google Brain for internal
use. It was released under the Apache 2.0 open source license on November 2015. It
provides much control over the implementation and training of neural networks. More-
over, it comes with a wonderful suite of visualization tools known as TensorBoard that
allows visualizing TensorFlow graphs, plot the values of tensors during training, visualize
embeddings (using T-SNE and PCA algorithms) and much more.
For all these reasons, TensorFlow was the choice to develop this project.
To speed up the training process, the graphic card NVidia GeForce GTX 730 GPU of
a personal computer was used to train the models. To utilize the graphic cards of the
computer, the GPU implementation of Tensorflow was used. This led to great reduction
in training time compared to using the CPU Tensorflow operation mode. However, as the
complexity of the models increased, the simulations needed more computational power
3
to be completed on a reasonable amount of time. Thus, they were launched in the servers
of the Signal Processing Group.
This section discusses the legislation applicable to the realization of this bachelor thesis,
namely the development of generative models and the use of personal data. The protection
of intellectual property is also presented.
1.3.1. Legislation
Data is an essential tool to train any kind of machine or deep learning algorithm and
its quality can condition the resulting model. In some cases the data required is freely
available and completely anonymous, for example the MNIST dataset. However, in many
other cases the data needed is obtained from users, for example data regarding localiza-
tion, gender, age or personal interests, and therefore it is necessary to take into account
aspects such as privacy, security and ethical responsibility.
On 25th May it came into force in the European Union the General Data Protection
Regulation (GDPR) [9], which aims to protect individuals’ personal information across
the European Union. This new regulation applies to business that somehow use personal
data. In short, the GDPR expands the accountability requirements to use personal data,
substantially increases the penalties for organizations that do not comply with this regu-
lation and it enables individuals to better control their personal data. The following are
the key changings of the GDPR but notice the full document contains 11 chapters and 99
articles:
• The individual must clearly consent the use of their personal data by means of "any
freely given, specific, informed and unambiguous indication" and this consent must
be demonstrable.
• More specific definition of personal data: any direct or indirect information relating
to an identified or identifiable natural person.
• Data subjects rights: breach notification, right to be forgotten, access to data col-
lected, data portability, privacy by design and data protection offices.
4
1.4. Work Plan
All the activities realized during this bachelor thesis can be grouped in six blocks, as
shown in Table 1.1.
1.5. Organization
In addition to the introductory chapter, this work contains the following chapters:
• Chapter 4, Gaussian Mixture VAE, explains in detail the GMVAE model, describing
its architecture and the loss function used for training.
• Chapter 8, Conclusions, discusses the results obtained and states the conclusions of
the thesis.
5
• Appendix A, VAE loss results, contains a table with all the loss values of the con-
figurations tested.
6
2. BACKGROUND
In this section, the theoretical concepts required to understand the VAE are explained.
First, a review of the current situation of generative models is presented. Then, the Prob-
abilistic PCA is explained as this algorithm can be understood as a simplified form of
the VAE. Thus, this section serves as an introduction to the model in study. Later, a
brief introduction of neural networks is made. Finally, a summary of convolutional neural
networks, which are really efficient when working with images, is also presented.
Generative models are classified as unsupervised learning. In a very general way, a gener-
ative model is one that given a number of samples (training set) drawn from a distribution
p(x), is capable of learning an estimate pg (x) of such distribution somehow.
There are several estimation methods on which generative models are based. How-
ever, to simplify their classification [11], only generative models based on maximum
likelihood will be considered.
The idea behind the maximum likelihood framework lies in modeling the approxima-
tion of the prior distribution of the data, which is known as the likelihood, through some
parameters θ: pθ (x). Then the parameters that maximize the likelihood will be selected.
N
∏
θ∗ = arg max pθ (x(i) ) (2.1)
θ i=1
In practice, it is typical to maximize the log likelihood log pθ (x(i) ) as it is less likely to
suffer numeric instability when implemented on a computer.
N
∑
θ = arg max
∗
log pθ (x(i) ) (2.2)
θ i=1
If only generative models based on maximum likelihood are considered , they can be
classified in two groups according to how they calculate the likelihood (or an approxima-
tion of it): explicit density models or implicit density models. The whole taxonomy tree
is shown in figure 2.1.
7
Fig. 2.1. Taxonomy of generative models based on maximum likelihood. Extracted from [11].
Explicit density models can be subdivided in two types: those capable of directly maxi-
mizing an explicit density pθ (x) and those capable of maximizing an approximation of it.
This subdivision represents two different ways of dealing with the problem of tractabil-
ity when maximizing the likelihood. Some of the most famous approaches to tractable
density models are Pixel Recurrent Neural Networks (PixelRNN) [12] and Nonlinear In-
dependent Components Analysis (NICA) [13].
The models that maximize an approximation of the density function pθ (x) are likewise
subdivided into two categories depending on whether the approximation is deterministic
or stochastic: variational or Markov chain. Later in this work the most widely approach
to variational learning, the variational autoencoder also known as VAE will be analyzed
in detail (see Chapter 3 and Chapter 4).
Furthermore, Section 2.2 describes one of the simplest explicit density models, Prob-
abilistic PCA, and serves as an introduction to this type of models.
These are models that do not estimate the density probability but instead define a stochas-
tic procedure which directly generates data [14]. This procedure involves comparing
real data with generated data. Some of the most famous approaches are the Generative
Stochastic Network [15] and Generative Adversarial Network [16].
8
2.2. Probabilistic PCA
Principal Component Analysis (known as PCA) [17] is a technique used for dimension-
ality reduction, lossy data compression and data visualization that has wide-ranging ap-
plications in fields such as signal processing, mechanical engineering or data analysis.
Depending on the field scope, it can also be named discrete Karthunen-Loéve transform.
PCA is a technique that can be analyzed from two different points of view: linear
algebra and probability. In this section it is presented an overview of PCA from the
probabilistic point of view as it is closely related to the simplest case of the variational
autoencoder and a good starting point to latent variable models. The explanation is largely
based on Bishop’s book [18].
Although PCA was firstly conceived by Karl Pearson in 1901 [19], it was not until
the late 90s when probabilistic PCA was formulated by Tipping and Bishop [20] and
by Roweis [21]. Probabilistic PCA holds several advantages compared to conventional
PCA. One of them is that it can be used as a simple generative model and this is why this
algorithm is meaningful in this work.
As its name states, PCA finds the principal components of data, or in other words,
it finds the features, the directions where there is the most variance. So given a dataset
X = {x(i) }1N with dimensionality D x , the main idea behind PCA is to obtain a subspace
(called the principal-component subspace) with dimension Dz , being Dz << D x , so that
each x(i) is represented with a z(i) (latent variable) in the best possible way. Then, it is
possible to recover x(i) from z(i) .
N
∑
log p(x|W, µ, σ ) =
2
log p(x(i) |W, µ, σ2 )
i=1
9
N
Dx N N 1 ∑ (i)
=− log(2π) − log(|C|) − (x − µ)T C−1 (x(i) − µ)
2 2 2 i=1
The optimization can be computed in a closed form [18] or using the EM algorithm
[22]. By closed form, the derivative of log p(x|W, µ, σ2 ) with respect to each parameter is
computed and equal to 0 to obtain the optimal values, namely µ ML , W ML and σ2ML .
The optimal mean µ ML is the sample mean of the data:
N
∑
µ ML = x(i) = x̄ (2.4)
i=
The expression for the optimal matrix W shown in equation 2.5 is quite complex.
However, all the matrices involved are known or can be obtained. The matrix U M (with
shape D x x Dz ) contains the eigenvectors of the data covariance matrix S which cor-
respond to the largest eigenvalues λi . These eigenvalues form the diagonal matrix L M ,
which has shape Dz x Dz . The matrix R (with shape Dz x Dz ) is an arbitrary orthogonal
matrix.
W ML = U M (L M − σ2 I)(1/2) R (2.5)
The σ2ML represents the average variance associated with the discarded dimensions.
Dx
1 ∑
σ2ML = λi (2.6)
D x − Dz i=D +1
z
Now that all the parameters involved are defined and it is known how to compute
them, it is possible to define how new samples are generated.
The generation phase comprises two steps. First, the variable z is sample from a priori
known distribution p(z). Then, new data samples are drawn from p(x|z) = N(x|µ(z), σ2 I)
where µ(z) = W z + µ. This process is shown in figure 2.2.
This is a simple case of a generative model where the relation between the latent and
the predictive variables is linear:
10
x ∼ Wz + µ + ϵ
Artificial neural networks are a set of machine learning algorithms that simulates the
functioning of biological neurons by means of a mathematical model.
Basically an artificial neuron consists of a set of inputs xi that are multiplied by weight
parameters wi and added point wise. Then the bias parameter b is added. Finally, the result
is introduced into a non-linear function, known as activation function, to obtain the output
of the neuron. This mathematical model is shown in figure 2.3 and expressed in equation
2.7. ⎛ ⎞
⎜⎜⎜∑
o = ϕ ⎜⎝⎜ (xi wi ) + b⎟⎟⎠⎟
⎟⎟
(2.7)
i
11
(a) Sigmoid (b) Tanh (c) ReLU
Among them, the Rectifier Linear Unit or ReLU (Equation 2.8) is probably the most
widely used function today. It has proven to help to prevent vanishing gradient among
other benefits. However, the selection of the activation function depends on the applica-
tion.
Now that all the basic elements of a neural network are described, the question that
arise is how to build a neural network. The process is quite simple: one or several neurons
form a layer and several stacked layers form a basic neural network. This is probably the
most generic architecture of a neural network and it is also the most inefficient. There are
different architectures that explote certain characteristics of the data to reduce the required
number of parameters to be more efficient. One of them is the convolutional network.
12
for certain features or shapes. For instance, in figure 2.5 it is displayed a filter that detects
edges.
A convolutional layer consists of a dot product between the input image and one or
several kernels. The output of each of these convolutions is known as a feature map. It
is possible to find an intuitive visualization of this process in [27]. A set of convolutional
layers interwined with pooling (spatial down sampling along the output volume) and ac-
tivation layers form a CNN. Moreover, it is common to add several fully connected layers
at the end of the network to flatten the output. This type of architecture is shown in figure
2.6.
13
3. VARIATIONAL AUTOENCODER
In this section it is described the variational autoencoder (known as VAE). The ex-
planation starts with a conceptual and general view of how the model works, and subse-
quently a formal description of its components and objective function.
3.1. Overview
The VAE is a type of generative model that was defined in 2013 [28]. It is a model ca-
pable of generating realistic data and obtaining meaningful latent representations of the
input. To achieve this, VAE is based on Bayesian inference. VAE models an approxima-
tion of the underlying probability distribution of input data p(x) through a latent variable
z. As mentioned before in this document, VAE can also be understood as a non-linear
Probabilistic PCA where the non-linearity is introduced using neural networks.
Learning p(x) can be a remarkably complex task. Data can have large dimensions
with complicated relations among them. For example, in the case in which data are gray
scale face images, the number of dimensions is width x height, namely the number of
pixels. For high resolution images it can reach the order of millions. Furthermore, each
picture contains borders and complex objects such as eyes, mouth, nose. In addition, each
object holds attributes such as orientation, size or shape, and they are located in particular
places to form faces. Therefore, it is clear that data has complex structure which the model
should learn to be able to generate realistic samples.
So what is a variational autoencoder? Explaining in short, it is an autoencoder, with a
random variable z defined in the latent space, that tries to maximize a lower bound of the
data log-likelihood.
The VAE is composed of two networks as shown in Figure 3.1 and Figure 3.2. The
encoder or inference network (parameterized by ϕ) maps the input data into a latent rep-
resentation that contains important attributes of the data. This latent representation is then
mapped, through the decoder or generative network (parameterized by θ), into the recon-
structed data. The main characteristic of VAE is that the latent representation of the input
is motivated to be similar to a known distribution. The usual choice is to define the prior
distribution p(z) = N(z|0, I). In this way, if the latent variable is distributed similarly to
N(0, I) and it holds the correct information about the input data, it is possible to obtain
14
samples from z and then generate realistic data.
Fig. 3.2. Graphical models for the VAE showing the generative model (left) and the inference
model (right).
Before describing the VAE in detail , here are some important points about it:
• Similar inputs will have similar representations in the latent space. So, the latent
space will help us interpret input data and their relations.
• The dimension of the latent variable z should be smaller than the dimension of the
input so that it captures only the most important features.
• VAE builds an approximation of the true distribution of the input : pg (x) ≈ p(x).
• A lower bound of log p(x) called ELBO is the functions that will be optimized.
• All density functions involved in the training process are prefixed. In this document,
they are assumed to be Gaussian because they will model pixels of images. This
way, it is possible to compute its derivatives and use optimization techniques.
15
3.2. Objective
Variational autoencoders approximately maximize the density function p(x) of the train-
ing set according to the following formula:
∫
p(x) = pθ (x|z)p(z)d z (3.1)
In an ideal case, the model would perfectly learn the distribution of the data. There-
fore, there will be no losses. However, there is one main drawback when solving equation
3.1: it is not possible to have infinite samples of z to compute the integral. Consequently,
the equation is intractable and there will be losses.
This section will explain the steps, tricks and assumptions needed to make equation
3.1 tractable and "optimizable". The explanation will derive in the objective function of
the VAE, the Evidence Lower Bound (ELBO):
It is important to mention that only a subset of X = {x(i) }1N samples from the true
distribution p∗ (x) is available. Then, the marginal log likelihood will be approximated as
shown in Equation 3.3, which is an unbiased estimator:
N
] 1∑
log pθ (x(i) )
[
E p∗ (x) log pθ (x) ≈ (3.3)
N i=1
To reduce the memory needed to compute it in a computer, only a mini batch of size
M (M << N) will be considered at a time, the estimator will remain unbiased but with
greater variance.
M
1 ∑
log pθ (x(i) )
[ ]
E p∗ (x) log pθ (x) ≈
M i=1
As previously mentioned, it is necessary to find a formula that can be optimized via gra-
dient descent or similar algorithm [30] in order to maximize p(x). The key idea is to
approximate the integral with n samples. Let us define equation 3.1 in this way:
16
p(x) = Ez∼p(z) pθ (x|z)
[ ]
Then, if it is possible to obtain n samples from z, the equation can be written as:
n
1∑
p(x) ≈ pθ (x|z(i) ) (3.4)
n i=1
By convenience, the log pθ (x|z) will be used instead. This will not affect the opti-
mization as the logarithm is a monotonic function1 . It is assumed pθ (x|z) is distributed
as a Gaussian pθ (x|z) = N(z|µθ (z), Σ) . Therefore, its log likelihood is proportional to the
euclidean distance between µ and x which is simpler to manipulate analytically.
1
where k = − log |Σ|1/2 (2π)Dx /2
( )
log pθ (x|z) = k− (x−µθ (z))T Σ−1 (x−µθ (z)) Σ = σ2 I
2
The value of σ2 can be chosen. It is desirable to keep it small as it will be seen later.
According to equation 3.4, the approximation of p(x) is better as the number of samples
increases. However, the computation power needed also increases. The good news is that
in practice many values of z do not generate valid data. In other words, the conditional
probability pθ (x|z) is close to 0 for the vast majority of z. This is why it is defined a family
of conditional distributions qϕ (z|x) parameterized by ϕ that represents the distribution of
z for each input datum. Thus, the new approximation is:
[ ]
Ez∼ qϕ (z|x) log(pθ (x|z)
In this way, only values of z that are likely to have produced x are used to compute
p(x) [31]. We are one step closer to arrive at the final equation.
3.2.3. KL Divergence
The Kullback-Leibler divergence is a measure of how different are two probability func-
tions p and q. One interesting property of the KL divergence which will be really valuable
in this study is that it is always a positive number.
For discrete variables it is defines as follows:
∑ P(x)
KL(P(x)||Q(x)) = P(x)log (3.5)
x∈X
Q(x)
1
Function which does not change the maximum
17
For a continuous variable it is defined as follows:
∫
p(x) p(x)
KL(p(x)||q(x) = p(x)log dx = E x∼p [log ] (3.6)
q(x) q(x)
The model under study only deals with continuous variables thus only equation 3.6 is
of interest.
The VAE optimization function will be derived from ( the idea that) qϕ (z|x) and p(z|x)
should be as similar as possible, that is to say, the KL qϕ (z|x)||p(z|x) should be close to
0.
( ) [ ]
KL qϕ (z|x)||p(z|x) = Ez∼qϕ log qϕ (z|x) − log p(z|x)
p(x|z)p(z)
p(z|x) =
p(x)
( )
Then, applying Bayes rule in KL qϕ (z|x)||p(z|x) and extracting p(x) from the expec-
tation because it does not depend on z:
[ ]
KL(qϕ (z|x)||p(z|x)) = Ez∼qϕ log qϕ (z|x) − log pθ (x|z) − log p(z) + log p(x) =
[ ]
−Ez∼qϕ log pθ (x|z) + Ez∼qϕ log qϕ (z|x) − log p(z) + log p(x)
[ ]
( ) ( )
log p(x) − KL qϕ (z|x)||p(z|x) = Ez∼qϕ log pθ (x|z) − KL qϕ (z|x)||p(z)
[ ]
(3.7)
Intuitively this equation tell us that the log p(x) minus an error term, called encoding
error, is equal to the mean of log pθ (x|z) evaluated at z sampled from qϕ (z|x) minus an
error term. Also notice the right part looks like an autoencoder: qϕ (z|x) encodes and
p(x|z) decodes.
The problem
( of equation) 3.7 is that p(z|x) is unknown. Thus, it is not possible to
calculate KL qϕ (z|x)||p(z|x) . This is why this term will be removed from the equation,
leading to the loss function of the VAE called ELBO:
18
log p(x) ≥ ELBO(θ, ϕ)
( )
ELBO(θ, ϕ) = Ez∼qϕ log pθ (x|z) − KL qϕ (z|x)||p(z)
[ ]
(3.8)
( )
The ELBO is a lower bound of log p(x). If and only if KL qϕ (z|x)||p(z|x) = 0 the
ELBO is equal to the true distribution log p(x) = ELBO(θ, ϕ).
It is interesting to describe both terms to understand what they represent:
[ ]
• The Ez∼qϕ log pθ (x|z) term represents how good is the reconstruction of the input.
( )
• The KL qϕ (z|x)||p(z) term is a regularizer. In order to maximize the ELBO this
term must be small. So this term forces qϕ (z|x) to be similar to p(z), a simple and
known distribution, and prevents overfitting.
The next sections develop in detail the loss function and check if it can be optimized
through SGD during training
In principle, all the distributions that appear in the ELBO are unknown, so they will be
assumed to be Gaussian. This assumption is really convenience because it facilitates the
computations since the ELBO will have a closed form. In addition, it is a meaningful
choice from the probabilistic perspective since the Gaussian distribution has the minimal
prior structure so it maximizes the entropy [32]. So it is a suitable choice when there is
no knowledge about the true nature of a distribution.
Then qϕ (z|x) is defined as:
p(z) = N(0, I)
Both µ and Σ are neural networks with parameters ϕ that( are learnt during
) training.
Moreover, Σ is defined as a diagonal matrix so that the KL qϕ (z|x)||p(z|x) has a closed
form:
( ) 1[ ( ) ( )T ( )]
KL N(µϕ (x), Σϕ (x))||N(0, I) = tr Σϕ (x) − Dz − log det(Σϕ (x)) + µϕ (x) µϕ (x)
2
19
What about the other term? It involves solving the next integral:
∫
Ez∼qϕ log pθ (x|z) =
[ ]
log pθ (x|z)qϕ (z|x)dz
M
1 ∑
log pθ (x|z(i) )
[ ]
Ez∼qϕ log pθ (x|z) ≈
M i=1
M
1 ∑ ( )
ELBO(θ, ϕ) ≈ log pθ (x|z(i) ) − KL N(µϕ (x), Σϕ (x))||N(0, I)
M i=1
M
1 ∑ 1 ( )
ELBO(θ, ϕ) = k − (x − µθ (z(i) ))T Σ−1 (x − µθ (z(i) )) − KL qϕ (z|x)||p(z) (3.9)
M i=1 2
where
From equation 3.9, it is important to notice that σ2 acts as a weighting factor between
the two terms. However, take into account that the existence of this parameter depends on
the choice of pθ (x|z). Our choice of σ determines how accurately we expect the model to
reconstruct x. If σ2 is small, the first term outweighs the second one forcing x ≈ µθ (z).
Now, this formula looks better, more tractable. However, can it be optimized via
backpropagation? The answer is no because it is necessary to obtain z samples in an
intermediate layer to calculate log pθ (x|z(i) ):
x → µϕ , Σϕ → z ∼ N(µϕ , Σϕ ) → µθ → x
Because of this, the forward pass of this network works fine and, if the output is aver-
aged over many samples of x and z, it produces the correct expected value. Nevertheless,
it is not possible to back-propagate the error through stochastic layers. Sampling is a non-
continuous operation and has no gradient. In order to backpropagate the only source of
stochastic can be the inputs. This is what "Reparametrization Trick" achieves.
20
3.2.6. Reparameterization Trick
x, ϵ → µϕ , Σϕ → z = µϕ + Σϕ ∗ ϵ → µθ → x
√
If x and ϵ are fixed, Equation 3.10 is deterministic and continuous in θ and ϕ. This
means it is possible to use SGD to optimize it via backpropagation. Finally, the ELBO is
tractable and can be optimized.
3.3. Sampling
Now that every part of the model has been explained, the sampling of new data is quite
straightforward: z is sampled from its prior distribution p(z) = N(0, I) and injected into
the decoder:
z ∼ N(0, I) → µθ (z) → x
21
4. GAUSISIAN MIXTURE VARIATIONAL AUTOENCODER
4.1. Introduction
4.2. Overview
The description of the GMVAE model described in this work will be largely based on
model presented by Nat Dilokthanakul et al [6]. Nevertheless, the definition of the prior
conditional term of the objective function (explained later) derived in this work differs
from the one proposed in the aforementioned paper. As with VAE, GMVAE consists of
two clearly differentiated parts: the inference and the generative networks. Nevertheless,
as can be seen in figure 4.1 there are more variables at stake.
Fig. 4.1. Graphical models for the GMVAE showing the generative model (left) and the inference
model (right).
The generative model shown in figure 4.1 provides a good intuition of what represents
each variable. The idea is that the latent variable y represents the class of the data to be
generated and the latent variable w captures variable attributes of the data. These two
variables form the first latent hierarchical level. Finally, in the second latent hierarchical
level is the variable z. This variable will have the form of a Gaussian mixture distribution
22
and it will be a representation of the multimodal nature of the data. Now that the intuitive
meaning of each variable is clear, their definitions will make sense:
The most direct solution to inject multi-modality into the latent space would be to use the
standard VAE framework but defining z as a Gaussian mixture density.
K
∑
p(z) = Πi N(µi , Σi )
i=1
However, the problem with this approach is that using GMM makes the Kullback-
Leibler divergence of the ELBO equation 3.8 intractable (see Appendix C). While it is
possible to approximate it using Monte Carlo estimator, this increases the variance of the
estimator.
M
qϕ (z(i) |x)
[ ]
( ) qϕ (z|x) 1 ∑
KL qϕ (z|x)||p(z) = Ez∼qϕ log ≈ log
p(z) M i=1 p(z(i) )
Another problem with this approach is that it produces numerical instability during
training when qϕ (z(i) |x) → 0 or p(z(i) ) → 0 or both go to 0 (see section 5.6 for more
information). Thus, this approach is discarded.
As stated in the paper, the objective function (known as ELBO) is defined in the following
manner:
pβz ,θ (x, z, w, y)
ELBO = Eq [ ]
qβy ,ϕz ,ϕw (z, w, y|x)
The definitions of pβz ,θ (x, z, w, y) and qβy ,ϕz ,ϕw (z, w, y|x) can be arbitrarily chosen. For
convenience, they are defined as shown in equation 4.1 and equation 4.2 respectively.
23
From these definitions, it is important to remember that the parameters of each distribution
means it is implemented using neural networks.
∏
qβy ,ϕz ,ϕw (z, w, y|x) = pβy (y(i) |w(i) , z(i) )qϕz (z(i) |x(i) )qϕw (w(i) |x(i) ) (4.2)
i
K the number of clusters selected a priori. Finally, the conditional distribution of x|z is
defined as pθ (x|z) = N(µθ (z), Σ).
As it is done in the original paper, to simplify the notation only one data point at a time
will be considered. Then, the conditional distribution qβy ,ϕz ,ϕw (z, w, y|x) is decomposed in
a multiplication of three terms. The conditional distributions of z|x and w|x are both
defined as Gaussian being qϕw (w|x) = N(µϕw (x), Σϕw (x)) and qϕz (z|x) = N(µϕz (x), Σϕz (x))
respectively. The distribution pβy (y j = 1|w(i) , z(i) ) is parameterized by βy because during
the experiments it will be modeled using a neural network to assure numerical stability
(see section 5.6). However, the most accurate definition of such distribution is:
p(y j = 1)pβz (z|y j = 1, w)
pβz (y j = 1|w, z) = ∑K
k=1 p(yk = 1)pβz (z|yk = 1, w)
Given this definition of the distributions and considering Eq ≡ Eqβy ,ϕz ,ϕw (z,w,y|x) , the
ELBO function can be rewritten as:
[ ]
p(w)p(y)pβz (z|w, y)pθ (x|z)
ELBO = Eq =
qϕz (z|x)qϕw (w|x)pβz (y|w, z)
[ ] [ ] [ ]
pβz (z|w, y) p(w) p(y)
Eq log pθ (x|z) + Eq log + Eq log + Eq log =
[ ]
qϕz (z|x) qϕw (w|x) pβ (y|w, z)
[ ] [ ] [ ]
[ ] qϕz (z|x) qϕw (w|x) pβz (y|w, z)
Eq log pθ (x|z) − Eq log − Eq log − Eq log
pβz (z|w, y) p(w) p(y)
Each of the terms composing the ELBO has a specific function. In order, they are
named reconstruction term, conditional prior term, w-prior term and z-prior term. In the
following sections each of these terms is operated and simplified as much as possible.
The complete steps are shown in Appendix D.
24
4.4.1. Reconstruction Term
The first component is the reconstruction term. It measures the difference between the
input data and their reconstruction. This term has exactly the same form as the first term
of the ELBO’s VAE.
1
Eq [log pθ (x|z)] ≈ (x − µθ (z))2
2σ2
q z (z|x)
[ ]
The conditional prior term Eq log pβϕ(z|w,y) measures the similarity between qϕz (z|x) and
pβ (z|w, y). As it is a positive value and appears with a negative sign in the ELBO equation,
the conditional prior term should be small in order to maximize the objective function. To
be a small value, both probability functions should be similar to each other. Thus, this
term is restricting the expressive capacity of the model: it is a regularizer.
In the model presented by Nat Dilokthanakul et al [6], this term is simplified and
results in an expectation of a KL divergence. Then it is approximated using Monte Carlo.
[ ( )]
Eqϕw (w|x)pβ (y|w,z) KL qϕz (z|x)||pβ (z|w, y)
M K
1 ∑∑ ( )
≈ pβ (yk = 1|w(i) , z(i) )KL qϕz (z|x)||pβ (z|w(i) , yk = 1)
M m=1 k=1
To do the approximation, the latent variable w is sampled from qϕw (w|x) but it is not
clear how z is sampled to obtain the final result. For this reason, the simplified expression
for this term has been derived in this thesis and it turns out it differs from the one proposed
in the aforementioned paper.
[ ] ⎡ K ⎤
qϕz (z|x) ⎢⎢⎢∑ qϕz (z|x)
= Eqϕw (w|x)qϕz (z|x) ⎢⎢⎣ π˜k log
⎥⎥⎥
Eq log
=
⎥⎥⎦
pβ (z|w, y) k=1
pβ (z|w, yk 1)
As it can be observed, now it is clear this term can be approximated using the Monte
Carlo estimator by sampling w from qϕw (w|x) and z from qϕz (z|x):
⎡ K ⎤ N ∑ M ∑ K
⎢⎢⎢∑ qϕz (z|x) ⎥⎥⎥ 1 1 ∑ qϕz (z(m) |x)
Eqϕw (w|x)qϕ (z|x) ⎢
⎢⎣ π˜k log ⎥⎥⎦ ≈ π˜k log
z
k=1
pβ (z|w, yk = 1) N M n=1 m=1 k=1 pβ (z(m) |w(n) , yk = 1)
This expression can be simplified considering only one sample for each variable.
⎡ K ⎤ K
⎢⎢⎢∑ qϕz (z|x) ⎥⎥⎥ ∑ qϕz (z|x)
Eqϕw (w|x)qϕ (z|x) ⎢
⎢⎣ π˜k log ⎥⎥⎦ ≈ π˜k log
z
k=1
pβ (z|w, yk = 1) k=1
pβ (z|w, yk = 1)
25
Moreover, as all variables involved are distributed as Gaussians, it can be further sim-
plified. For the complete details see Appendix D.
q (w|x)
[ ]
The w-prior term Eq log ϕwp(w) measures the similarity between qϕw (w|x) and p(w). As
it is a positive value and appears with a negative sign in the ELBO equation, the w-prior
term is also a regularizer. It can be written as a KL divergence of Gaussian distributions.
Consequently, it can be computed analytically.
[ ]
qϕw (w|x) ( )
Eq log = KL qϕw (w|x)||p(w)
p(w)
Recalling that p(w) = N(0, I):
( ) 1[ ( ) ( )T ( )]
KL qϕw (w|x)||p(w) = tr Σϕw (x) − Dw − log det(Σϕw (x)) + µϕw (x) µϕw (x)
2
p (y|w,z)
[ ]
The y-prior term Eq log β p(y) measures the similarity between pβ (y|w, z) and p(y). As
it is a positive value and appears with a negative sign in the ELBO equation, the y-prior
term is also a regularizer. It can be written as an expectation over qϕz (z|x) and qϕw (w|x) of
a KL divergence of categorical distributions.
[ ]
pβ (y|w, z) [ ( )]
Eq log = Eqϕz (z|x)qϕw (w|x) KL pβ (y|w, z)||p(y)
p(y)
K
( ) ∑ p(yi = 1)
KL pβ (y|w, z)||p(y) = p(yi = 1) log
i=1
pβ (yi = 1|w, z)
Joining the previous expressions, the lower bound function can be written as:
⎡ K ⎤
⎢⎢⎢∑ qϕz (z|x)
ELBO(ϕw , ϕz , β, θ) = Eqϕz (z|x) [log pθ (x|z)] − Eqϕw (w|x)qϕz (z|x) ⎢⎢⎣ π˜k log
⎥⎥⎥
⎥⎥⎦ −
k=1
p β (z|w, yk = 1)
( ) [ ( )]
KL qϕw (w|x)||p(w) − Eqϕz (z|x)qϕw (w|x) KL pβ (y|w, z)||p(y)
(4.3)
26
This is the expression that will be optimized during training, having applied reparam-
eterization trick previously.
As it happens in the VAE, it is necessary to move the intermediate sampling layers to the
input in order to be able to apply backpropagation. Therefore, the sampling from qϕz (z|x)
and qϕw (w|x) is realized through two new input noises, ϵz ∼ N(0, I) and ϵw ∼ N(0, I), as
shown below:
√
z = µϕz (x) + ϵ Σϕz (x) ∼ qϕz (z|x) = N(z|µϕz (x), Σϕz (x))
w = µϕw (x) + ϵ Σϕw (x) ∼ qϕw (w|x) = N(w|µϕw (x), Σϕw (x))
√
In this way, backpropagation can be applied to optimize the network parameters since
given a fixed x, ϵz and ϵw , the ELBO is deterministic and continuous, meaning it is possible
to use SGD to optimize it via backpropagation.
4.6. Sampling
The sampling of new data is straightforward. First, w and y are sampled from their corre-
sponding prior distributions p(w) = N(0, I) and p(y) ∼ Mult(π). Then z is sampled from
pβz (z|w, y) and injected into the decoder:
27
5. METHODOLOGY
5.1. Objective
The purpose of the experiments conducted in this work is to understand the effect of differ-
ent configurations of the models’ hyperparameters and to assess if the GMVAE described
in chapter 4 achieves superior performance, in terms of generative capability and latent
space interpretability, compared to the standard VAE.
Both datasets used in this thesis are already cleaned and contain no missing data. They
will be firstly split in test and train. Then the train set will be divided in two subsets as
shown in Figure 5.1 to obtain a validation set. Thus, three subsets of images will be used
to realize the experiments, each of them with a specific purpose:
• Validation Data: data used to evaluate the loss in each epoch during training. It is
not used in the learning process but to stop training earlier to prevent overfitting
(stop training when validation error increases)
• Test Data: data used to generate results and assess the performance of the model.
28
5.3. Model Architecture
Three different types of neural networks have been used to build the models:
Then, with these three types of neural networks two types of architectures have been
used to implement VAE and GMVAE. They have been named Dense Architecture and
Conv Architecture . Table 5.1 and Table 5.2 show how each distribution has been imple-
mented for each architecture.
In general terms, when training a machine or deep learning algorithm there are a num-
ber of parameters that can be selected before training. Some of them handle the files
generated, others deal with saving and restoring a model and others determine certain as-
pects of the algorithm. In order to test different configurations of the VAE and GMVAE
algorithms, some parameters must variate. These are defined in Table 5.3.
29
Hyperparameter Definition
Epochs The number of epochs determines how many times each training
image is used during training.
dim z The z dimension defines the dimension of the latent variable z.
dim w The w dimension defines the dimension of the latent variable w.
Used only in GMVAE
K clusters The K clusters set the number of possible values that the categor-
ical variable y can take. Used only in GMVAE
Hidden dimension The hidden dimension specify the number of neurons in a single
dense layer.
Number of layers The number of layers determines the layers used in a dense net-
work.
To accomplish the goals defined previously, the following experiments were conducted:
• Experiment 2. Train GMVAE changing dim w (see Table 5.5) to understand the
information learnt by the latent variables w and z. Evaluate the results in terms of
interpretability.
• Experiment 4. Evaluate the log likelihood of VAE and GMVAE models trained
in previous experiments when uniform and dropout noise is added to the input.
The uniform noise (soft) adds ±0.1 to each normalized pixel and the dropout noise
(strong) sets to 0 half of the pixels.
30
All models have been trained using images as input. As it is known, the most efficient
way to deal with images is to use convolutional layers. However, simple fully connected
dense networks have been used for the first four experiments in order to minimize training
time.
The selection of the set of values for each hyperparameter was based on hyperparam-
eters selected in similar works and on interpretability.
Experiment 1
Generative model VAE
Model name model1
Dataset MNIST
Architecture Dense Architecture
σ2 [0.01, 0.001]
dim z [2, 10, 20, 50]
Hidden layers [2, 3, 4]
Hidden dimension [64, 128]
Experiment 2
Generative model GMVAE
Model name model4_bias
Dataset MNIST
Architecture Dense Architecture
σ2 0.001
dim z 10
dim w [2, 10, 20]
K 10
Hidden layers 3
Hidden dimension 128
31
Experiment 5
Generative model [GMVAE, VAE]
Model name [model5, model2]
Dataset [MNIST, FREY]
Architecture Conv Architecture
In the first attempt to implement the GMVAE, numerical stability problems were found.
At some point during training (usually at the beginning), NaN values appeared in the loss
function. As a consequence, training stopped.
After a sound analysis, it was found the numerical instability was originated in the
evaluation of the distribution pβ (y j = 1|w, z):
32
Softmax approximation
The first idea was to approximate pβ (y j = 1|w, z) with a softmax function. Neverthe-
less, it also had numerical stability issues.
v j = z − µβ (w)y j
ev j
pβ (y j = 1|w, z) = so f tmax(v j ) = ∑K
k=1 evk
As explained in the next section, the problem was mitigated by limiting the range
values of several neural networks outputs. However, the limitations were so strong that
the model performed poorly. This alternative was discarded.
Constant PDF
The next alternative was to simplify the model by defining pβ (y j = 1|w, z) as :
1
pβ (y j = 1|w, z) =
∀j
K
This way, it is not required to evaluate the probabilities pβ (z|yk = 1, w) so, obviously,
the instability problem disappears. However this is a strong assumption that reduces the
expressivity of the model. Thus, further alternatives were sought.
Neural network approximation
The final approach was to approximate the PDF using a neural network composed of
fully connected layers. The input of this neural network is the concatenation of w and z
and the output is the probability for each y j .:
As described above, the instability problem arise when the variable z must be evaluated at
distant points from the mean of the distribution or its variance tends to 0 or ∞. However
this situation is nonsense as every image should have a latent representation that is related
to one cluster and extreme values of the parameters of the Gaussian distributions are not
interpretable. This is the reason to impose constrains in the PDFs pβz (z|w, y) and qϕz (z|x).
In other words, to impose restrictions in the output range of the following neural networks:
33
• σ2ϕz (x), σ2βz (w)y j
To restrict the possible values of outputs of the neural networks modeling the means
and the variances it has been used ϕ(x) = A ∗ t f.tanh(x), where A ∈ [1, 100], as the
activation function of the last layer of the neural networks. Moreover, as the variance
cannot be negative neither close to zero, the following sigma function was defined, being
x the output of the neural network:
With this definition of the variance and the constrains in the range of outputs the problem
is solved.
Other problem found is anti-clustering behaviour. As it is stated in [6], the z-prior term
can degenerate the clustering nature of the latent variable z. It is a regularization term that
may go to 0 if all the clusters merge into a single one. This over-regularization issue is an
open problem under research.
To stimulate clustering behaviour, the bias for the last layer of the neural networks
µϕz (x) and µβz (w)y j is not initialize with zeros (as usual) but as a truncated normal. We are
just providing the model with a priori knowledge to facility the learning process.
5.8. Metrics
The selection of metrics to evaluate the performance of generative models is still an open
discussion in spite of the fact that there has been significance progress. In general terms,
a good metric should:
34
in other metrics. For example, high log-likelihood does not guarantee realistic generated
data [36]. Therefore, it is important to examine the goal when selecting the evaluation
technique.
For this thesis, the goal is to assess generative models in terms of image synthesis and
latent variable interpretability. As such, table 5.7 shows the metrics selected.
The ELBO function differs slightly between VAE (see Equation 3.8) and GMVAE
(see Equation 4.3). It is the function maximized during training.
The average log-likelihood E[log p(x|z)] measures the reconstruction capabilities of
the model (it is not related to the generative capacity). It is the same for all models and
has the following form.
N K
1∑1∑
E[log p(x|z)] ≈ log p(x(i) |z(i)
k )
N i=1 K k=1
Dx 1
k ) = −
log p(x(i) |z(i) log(2πσ2 ) − (x(i) − µ(z(i) 2
k ))
2 2σ2
Notice it is calculated sampling several z for each datum in the dataset to reach a better
approximation.
The latent space interpretability will be appraised visually in terms of how well the
multimodal nature of the data is represented in the latent space.
The quality of the generated images will be assessed through human visual inter-
pretation. However, to simplify this task, only the models’ configurations with smaller
KL (q(z|x)||p(z)) will be chosen in advance. Recall that a small KL(∗) means the distribu-
tion of the data in the latent space is close to the distribution from which generated data
will be sampled.
A very important aspect in deep learning is the visualization of the results. In this study
the results are the generated / reconstructed images and the distribution of the latent vari-
ables. The visualization of the images is straightforward. However, the latent variables
35
sometimes are high dimensional so it is impossible to understand them in their raw di-
mension. Therefore, they are represented in 2D or 3D plots through a dimensionality
reduction algorithm such as PCA or t-SNE.
36
6. RESULTS
This chapter analyzes the results obtained from the implementation of the previously
described models. The source code is available at GitHub through the following URL:
https://github.com/psanch21/VAE-GMVAE.
This chapter contains several tables that show the value of the loss function (the nega-
tive ELBO function) at the end of the training process. These tables contain at least two
columns called ValidLoss and TrainingLoss that refer to the negative ELBO function for
the validation and train dataset respectively. These tables may contain other columns that
represent the terms composing the ELBO function. Moreover, these tables contains a
column called ModelName that holds information about the model configuration. For the
case of VAE models (model1 or model2) this column has the following format {model
Type}_{dataset Name}_{epochs}_{sigma Recons}_{dimZ}_{dimHidden}_{num Layers}
and for the case of GMVAE models (model4_bias or model5) it has this format {model
Type}_{dataset Name}_{epochs}_{sigma Recons}_{dim Z}_{dim W}_{dim Hidden}_{num
Layers}_{K clusters}.
Another important consideration is that the range of the reconstruction loss and there-
fore the total loss depends on the chosen value for σ2 . So it is important to keep this in
mind when comparing different configurations.
Results for the different configurations of the VAE model show that it presents enormous
difficulties to achieve good clustering and generation results at the same time. Tables 6.1
and 6.2 show the final loss of the configurations that resulted in the best clustering and
generation capabilities respectively. The entire table with the results for all the configura-
tions tested is included in Appendix A. Furthermore, Appendix B includes some examples
of poorly generated samples.
37
Table 6.1. Loss results of the best clustering configurations. Ordered by
ValidLoss.
Analyzing the tables it is import to remark that the value of the KL and the dimension
of the latent variable determine how the model behaves. Configurations that produce
good clustering present larger dimension on the latent variable (dim z = 10) and large
KL (around 35 − 55). On the other hand, configurations that obtain good generation are
those with smaller dimension of the latent variable (dim z = 2) and smaller KL (around
13 − 40).
Figures 6.1 and 6.2 show examples of good clustering and good generation results
respectively. In the first case, the multimodal nature is captured clearly but the generated
samples are terrible, just a few of them can be considered realistic. The second case
generates a lot of realistic samples (although a bit fuzzy). However, only the images
representing the number 1 are visibly separated in the latent space, the others are mixed
together.
38
(a) Generated images (b) Latent space (c) Reconstruction
In any case, the results obtained with any configuration of the VAE outweigh the
performance of the PCA algorithm in terms of image’s reconstruction and latent space
interpretability. As it is shown in Figure 6.3, the representation of the images in the
latent space improves as the dimensionality increases, as with VAE. In relation to the
reconstruction, it also improves with the dimensionality of the latent space, but the quality
is clearly lower than with the VAE.
Fig. 6.3. Examples of PCA latent space representation and reconstruction varying dim z.
Results show that the latent variable w holds information about variational features of the
data while the latent variable z learns to separate the data in clusters, according to their
class. Therefore, by fixing z and varying w is it possible to generate samples from the
same class but slightly different.
Figure 6.4 shows three scatter plots of the variable w with different dimensions. Rep-
resentations in plot (a) are unmistakable distributed as Gaussian. Nonetheless, plots (b)
and (c) show some kind of structure. However, it is attributed to the dimensionality re-
duction of the multidimensional Gaussian and therefore it is meaningless.
39
(a) dim w = 2 (b) dim w = 10 (c) dim w = 20
Figure 6.5 shows three scatter plots of the variable z. It can be observed that a good
clustering behaviour is obtained in all cases.
By comparing the loss values obtained in this experiment (which are shown in tables 6.3
and 6.4) with the values obtained in experiment 1, there is no significant improvements:
neither ValidLoss nor the regularizer terms values are enhanced. Therefore, it may seem
that the GMVAE has not proven better performance.
However, if the results are analyzed qualitatively (via visual appraisal) it can be noted
significant performance improvements regarding interpretability of the latent space. Now,
the variable z unambiguously captures the multimodal nature of the data and simultane-
ously the quality of the samples is quite good. So using the GMVAE is possible to ac-
complish both objectives, which was not possible in the standard VAE. Moreover, with
the GMVAE it is possible to easily generate samples from a single class of the data in that
each cluster is associated with a specific label. This is also a major improvement with
respect to the standard VAE.
40
Table 6.3. Best clustering configurations of VAE in GMVAE. Ordered by
CondPrior.
(i) Cluster: ? (j) Cluster: ? (k) w scatter plot (l) z scatter plot
41
Figure 6.7 shows an example of a hyperparameters’ configuration that led to good
clustering results but bad generation in the VAE (see figure 6.1). It can be observed using
the GMVAE the quality of the samples generated is much better while keeping a good
clustering in the latent space.
(i) Cluster: 8 (j) Cluster: 9 (k) w scatter plot (l) z scatter plot
Table 6.5 shows the loglikelihood without noise and with two different types of noise
(uniform and dropout) for two configurations of the VAE, first two rows, and two config-
urations of the GMVAE, last two rows, that achieved good performance.
Before describing the results, it is important to notice the order of magnitude of the
log-likelihood depends on the value of σ2 as it appears in the denominator in the formula
of the Gaussian log-likelihood.
Results demonstrate the VAE deteriorates much more than GMVAE when noise is
introduced. The deterioration of the log-likelihood for both VAE’s configurations is of
906.31 and 8136.39 units. On the contrary, the deterioration of the log-likelihood for
the GMVAE is of 740.79 and 4201.7 units. So it is considerably less, which means the
GMVAE is more resistant to corrupted inputs.
42
Table 6.5. Loglikelihood results.
Figures 6.8, 6.9 and 6.10 show examples of images inputted to the models (left col-
umn) and their reconstructions (right column) when different types of noise are present.
Table 6.6 and Table 6.7 show the values of the loss function have not significantly im-
proved. They are smaller but remains in the same order as in experiments 3 and experi-
ment 1 (see Tables 6.1, 6.2, 6.3 and 6.4).
43
Table 6.6. GMVAE training result.
However it can be observed in Figure 6.11 and Figure 6.12 that excellent results are
obtained for both GMVAE and VAE. For the GMVAE, the quality of the generated images
improves dramatically and the data multimodality is perfectly represented in the latent
space. This implies the generation is more accurate in the sense that each cluster generates
samples only from a single class of the data. For the VAE, the results are improved
compared to experiment 1. Nevertheless, the GMVAE clearly outweigh performance.
Another major improvement is that the number of parameters of the model is smaller
so it is less prone to suffer from overfitting.
(i) Cluster: 8 (j) Cluster: 9 (k) w scatter plot (l) z scatter plot
44
(a) Generated images (b) Latent space (c) Generated images (d) Latent space
Fig. 6.12. Example of results using VAE convolutional architecture on MNIST. Model (a) and (b)
with dim z = 2 and model (c) and (d) with dim z = 10.
It has also been tested the GMVAE’s convolutional architecture with the FREY dataset.
Figure 6.13 shows that good results are obtained. It can be observed each cluster repre-
sents a well-defined expression and face orientation and therefore each cluster can be
assigned to a specific label. Arbitrarily, five different classes have been defined: angry,
sad, serious, tongue and smile. So now it is possible to obtain a labeled dataset.
It is important to mention that several clusters are assigned to the same label. There
are two explanations for this. The first one is because training data present unbalanced
samples from the different defined labels and the second one is that a label encapsulates
a wider spectrum of variations. For example, there are many more faces smiling than
sticking out the tongue so consequently three clusters are assigned to the label "smile"
and one cluster is assigned to the label "tongue".
(a) Label: Angry (b) Label: Angry (c) Label: Sad (d) Label: Sad
(e) Label: Serious (f) Label: Serious (g) Label: Tongue (h) Label: Smile
(i) Label: Smile (j) Label: Smile (k) w scatter plot (l) z scatter plot
Fig. 6.13. Example of results using GMVAE’s convolutional architecture on FREY dataset.
45
7. SOCIO-ECONOMIC ENVIRONMENT
This chapter presents the budget required to conduct a research project related to this
thesis and possible practical applications of the VAE and GMVAE models alongside with
the socio-economic impact they may generate.
7.1. Budget
A research project related to the study conducted in this thesis would require a budget
to cover mainly human resources, computer power, conference costs and publication ex-
penses. This section details the budget required to conduct a 2 years research project
as part of the Signal Processing Group (known in Spanish as GTS) of the UC3M. The
necessary funding will be e 50988.88. For the detailed budget see Table 7.1.
Regarding human resources costs, two people will be involved in the project: the
principal investigator, which is already a UC3M professor, and a research student to be
hired for the project. According to the official remuneration’s tables of the UC3M, the
monthly gross salary of a research student in the UC3M is e 1582.87 thus e 37988.88 in
total.
In relation to the necessary equipment, the computer power costs are covered by the
GTS. It has six servers with several Graphics Processing Units (GPUs) which are suffi-
cient resources to conduct the experiments. The laptop for the student will have a cost of
e 2000 and the cost of the personal computer for the principal investigator will be covered
by the university. Software licenses are free of charge as the programming language and
libraries needed are open source.
The part of the budget for conferences is estimated to be about e 3000 per year adding
e 6000 for the whole project. It will cover transportation and accommodation costs and
entry fees. Finally, according to the European Union, all results obtained in projects
funded by public goods must be published. The expenses for publishing the results is
estimated to be around e 5000 for the whole project.
46
Concept Description Cost e
Student’s Expenses Salary of the research student 37988.88
Student’s PC Laptop used to develop the implementation of 2000
the models and to connect to the server.
Conferences’ Expenses Costs for traveling, accommodations and entry 6000
fees.
Publications’ Expenses Costs for traveling, accommodations and entry 5000
fees.
TOTAL 50988.88
The results obtain in this thesis can be extrapolated to more complicated problems. The
family of variational autoencoders is suitable to be used in any situation where data is
unlabeled and the objective is to generate new realistic samples, to understand the un-
derlying the structure or to reconstruct data. Moreover, they can also be used to perform
unsupervised clustering and obtain labeled samples that can be used to perform other
tasks.
Table 7.2 shows some specific applications of generative models that contribute to ac-
celerate processes and save time (market segmentation), improve the quality of the prod-
uct (3D modeling) or provide with the required tools to develop other research projects
(medical data generation). So generative models can contribute to save money and time,
boost sales and encourage other researches.
47
Concept Description
Image Denoising Generative models can assist in reconstructing a corrupted image.
For example, photographs taken under low light.
3D Modeling Generative models can assist in the design of real world shapes
for sandbox video games such as Minecraft. Generative models
provide huge variety in the generated objects which can create a
positive effect on the player.
Market segmentation Generative models can be used to find a meaningful latent repre-
sentation of users’ data so that it is possible to identify different
types of customer purchasing behaviour, create a labeled dataset
that can be used to further analysis such as classification or re-
gression.
Medical data genera- Medical data is usually scarce, hard to obtain and chaotic. Be-
tion sides, it is extremely sensitive personal data in which privacy is re-
ally important. Generative models can generate new anonymous
samples that can be used to perform further analysis.
Complex simulations In fields such as astrophysics it is extremely expensive to conduct
experiments to obtain data that can be analyzed. Given a dataset,
generative models can generate more data and consequently save
huge amount of money and time.
48
8. CONCLUSIONS
This chapter presents the conclusions of this thesis, revises the objectives and hypoth-
esis stated at the beginning of the document and proposes future lines of research.
8.1. Conclusions
This thesis began as a research study whose goal was to fully understand one of the
most widely used generative models family, the variational autoencoders, and compare
the performance of the standard configuration with one of its variations. Analyzing the
previous chapters, it can be safely said the objectives set at the beginning of this document
have been more than fulfilled.
In a first step, the standard VAE was implemented smoothly, without any significant
incident. Then, the implementation of the GMVAE was based on the objective function
described in this thesis and the results obtained are quite successful. The visual quality
of the generated samples and the latent representation is extraordinary, they have defi-
nitely improved compared to the standard VAE. It has been proven GMVAE outweigh the
performance of standard VAE.
The realization of the experiments described previously in this work has helped to
gain further understanding of how the models work. It has been observed the trade-off
between good generative capabilities and good latent space representation of the VAE.
Its simple definition of the latent space make it difficult to achieve both at the same time.
Then it has been proven GMVAE eliminates this constrain successfully by defining a more
expressive latent space. It has also been learnt about the importance of the dimensionality
of the latent variables as it can determine the features of the resulting model. It has been
observed that defining a high dimensional latent variable in the VAE results in a model that
gives priority to reconstruction and good latent space representation over generation. This
is likewise observed in the loss function in the form of high values for the KL divergence.
It has also been assessed the GMVAE is more resilient to corrupted input as was expected.
Finally, it has been appraised the importance of selecting the correct architecture: using
CNNs the performance improves substantially and the number of parameters of the model
decreases, which is desirable.
The main problem that has arose was the numerical instability found during training
of the GMVAE. It was a time-consuming complication but it was also useful to learn
how to debug neural networks and to learn how to search for alternative implementations,
evaluate them and select the more suitable.
From a personal point of view, this thesis has been an introduction to the research
world for me. I have acquired a solid knowledge about the mathematical and probabilistic
49
foundations of the state of the art generative models and I have learnt about how to im-
plement them using one of the most famous frameworks for deep and machine learning.
Furthermore, the source code is documented and is available in my personal GitHub, so
hopefully it is useful for other deep learning enthusiasts. I am sure all the work and effort
devoted to this thesis will be very helpful for future projects.
To improve the generative models presented in this document, in future work, more flex-
ible assumptions on the nature of the distributions can be investigated. For example, not
assuming that Σϕ is a diagonal matrix. Finding more structured latent spaces could be
an interesting line of research too. It could also be really useful to build a model for
heterogeneous data in which it is possible to combine different continuous and discrete
distributions and also add the temporal dimension.
50
A. VAE LOSS RESULTS
This appendix includes the table with the training results for the model configuration
tested in experiment1.
51
B. GENERATED SAMPLES
Fig. B.1. Low quality generated samples with different VAE configurations.
52
C. KL DIVERGENCE OF GMM
∫ ∫ ∫
p(x)
KL(q||p) = p(x)log dx = p(x)logp(x)dx − p(x)logq(x)dx = I1 + I2
q(x)
Considering the case for two univariate variables, let q(x) be a Gaussian Mixture
Model with c components with different means and variances, each of them with proba-
bility Πi ( ci=1 Πi = 1). On the other hand, p(x) is a Gaussian variable with a certain mean
∑
and variance:
c
∑
q(x) = Πi N(µi , σi )
i=1
p(x) = N(µ1 , σ1 )
↓
1
I1 = − (1 + log(2πσ21 ))
2
However, the integral I2 does not have a closed solution, it contains a summation
inside a logarithm:
∫ c ∫ c (x−µi )2
∑ ∑ 1 −
I2 = p(x)log( Πi N(µi , σi ))dx = p(x)log( Πi √
2
e 2σi )dx
i=1 i=1 2πσ2i
53
D. GMVAE OBJECTIVE FUNCTION DERIVATIONS.
[ ] ∫ ∫ ∫
qϕz (z|x) qϕz (z|x)
Eq log = qβy ,ϕz ,ϕw (z, w, y|x) log dydwd z =
pβ (z|w, y) z w y pβ (z|w, y)
qϕz (z|x)
∫ ∫ ∫
qϕz (z|x)qϕw (w|x)pβ (y|w, z) log dydwd z =
z w y pβ (z|w, y)
⎛ K ⎞
qϕz (z|x)
∫ ∫
⎜⎜⎜⎜∑
pβ (yk = 1|w, z) log
⎟⎟⎟⎟
qϕw (w|x) qϕz (z|x) ⎜⎝ ⎟⎠ d zdw →
w z k=1
p β (z|w, y k = 1)
[ ] ⎡ K ⎤
qϕz (z|x) ⎢⎢⎢∑ qϕz (z|x)
= Eqϕw (w|x)qϕz (z|x) ⎢⎢⎣ π˜k log
⎥⎥⎥
Eq log
=
⎥⎥⎦
pβ (z|w, y) k=1
pβ (z|w, yk 1)
⎡ K ⎤ N ∑ M ∑ K
⎢⎢⎢∑ qϕz (z|x) ⎥⎥⎥ 1 1 ∑ qϕz (zm |x)
Eqϕw (w|x)qϕ (z|x) ⎢
⎢⎣ π˜k log ⎥⎥⎦ ≈ π˜k log
z
k=1
pβ (z|w, yk = 1) N M n=1 m=1 k=1 pβ (zm |wn , yk = 1)
⎡ K ⎤ K
⎢⎢⎢∑ qϕz (z|x) ⎥⎥⎥ ∑ qϕz (zm |x)
Eqϕw (w|x)qϕ (z|x) ⎢
⎢⎣ π˜k log ⎥⎥⎦ ≈ π˜k log =
z
k=1
pβ (z|w, yk = 1) k=1
pβ (zm |wn , yk = 1)
K
∑ ( ) K
∑ K
∑
π˜k log qϕz (zm |x) − log pβ (zm |wn , yk = 1) = log qϕz (zm |x) π˜k − π˜k log pβ (zm |wn , yk = 1) =
k=1 k=1 k=1
π˜k = 1 then:
∑K
It is known that k=1
⎡ K ⎤ K
⎢⎢⎢∑ qϕz (z|x) ∑
⎢⎣ π˜k log ⎥⎥⎦ = log qϕz (zm |x) − π˜k log pβ (zm |wn , yk = 1)
⎥⎥⎥
Eqϕw (w|x)qϕ (z|x) ⎢
z
k=1
pβ (z|w, yk = 1) k=1
(D.1)
54
Each of the terms in equation D.1 can be further developed:
1 1
qϕz (zm |x) = N(µ(x), σ2 (x)) → log qϕz = cte − log det σ2 (x) − 2
(zm − µ(x))2
2 2σ (x)
1 1
pβ (zm |wn , yk = 1) = N(µ(w)yk , σ2 (w)yk ) → log pβ = cte− log det σ2 (w)yk − 2 yk (zm −µ(w)yk )2
2 2σ (w)
K K
∑ ∑ 1 1
π˜k log pβ (zm |wn , yk = 1) = π˜k cte − π˜k log det σ2 (w)yk − π˜k 2 yk (zm − µ(w)yk )2 =
k=1 k=1
2 2σ (w)
K K
1∑ 1∑ 1
cte − π˜k log det σ2 (w)yk − π˜k 2 yk (zm − µ(w)yk )2 =
2 k=1 2 k=1 σ (w)
[ ] ∫ ∫ ∫
qϕw (w|x) qϕ (w|x)
Eq log = qϕz (z|x)qϕw (w|x)pβ (y|w, z) log w dydwd z =
p(w) z w y p(w)
∫ ⎛∫ ⎛ K ⎞ ⎞
qϕw (w|x) ⎜⎜⎜ ⎜⎜⎜∑
pβ (yk = 1|w, z)⎟⎟⎟⎠ d z⎟⎟⎟⎠ dw =
⎟⎟ ⎟⎟
qϕw (w|x) log ⎜⎜⎝ qϕz (z|x) ⎜⎜⎝
w p(w) z k=1
k=1 pβ (yk = 1|w, z) = 1 and qϕz (z|x)d z = 1 the equation obtained is:
∑K ∫
As z
[ ]
qϕw (w|x) ( )
Eq log = KL qϕw (w|x)||p(w)
p(w)
[ ] ∫ ∫ ∫
pβ (y|w, z) pβ (y|w, z)
Eq log = qϕz (z|x)qϕw (w|x)pβ (y|w, z) log dwd z =
p(y) z w y p(y)
55
[ ]
pβ (y|w, z) [ ( )]
Eq log = Eqϕz (z|x)qϕw (w|x) KL pβ (y|w, z)||p(y)
p(y)
K
( ) ∑ p(yi = 1)
KL pβ (y|w, z)||p(y) = p(yi = 1) log
i=1
pβ (yi = 1|w, z)
) 1∑ K
( 1
KL pβ (y|w, z)||p(y) = log − log pβ (yi = 1|w, z)
K i=1 K
Finally
K
( ) 1∑
KL pβ (y|w, z)||p(y) = − log K + log pβ (yi = 1|w, z)
K i=1
K
( ) 1∑
KL pβ (y|w, z)||p(y) = − log K − log pβ (yi = 1|w, z)
K i=1
56
BIBLIOGRAPHY
[1] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and
organization in the brain,” Psychological Review, pp. 65–386, 1958.
[2] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Neurocomputing: Founda-
tions of research,” in, J. A. Anderson and E. Rosenfeld, Eds., Cambridge, MA,
USA: MIT Press, 1988, ch. Learning Representations by Back-propagating Errors,
pp. 696–699. [Online]. Available: http://dl.acm.org/citation.cfm?id=
65669.104451.
[3] G. Hinton et al., “Deep neural networks for acoustic modeling in speech recogni-
tion: The shared views of four research groups,” IEEE Signal Processing Magazine,
vol. 29, no. 6, pp. 82–97, 2012. doi: 10.1109/MSP.2012.2205597.
[4] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Uni-
fied, Real-Time Object Detection,” ArXiv e-prints, Jun. 2015. arXiv: 1506.02640
[cs.CV].
[5] A. Graves, “Generating Sequences With Recurrent Neural Networks,” ArXiv e-
prints, Aug. 2013. arXiv: 1308.0850.
[6] N. Dilokthanakul et al., “Deep Unsupervised Clustering with Gaussian Mixture
Variational Autoencoders,” ArXiv e-prints, Nov. 2016. arXiv: 1611.02648 [cs.LG].
[7] C. J. B. Yann LeCun Corinna Cortes. (2000). The mnist database of handwritten
digits, [Online]. Available: http://yann.lecun.com/exdb/mnist/ (visited on
01/26/2018).
[8] S. Roweis. (2000). Frey faces dataset, [Online]. Available: https : / / cs . nyu .
edu/~roweis/data.html (visited on 04/08/2018).
[9] European Union, “Regulation 2016/679 of the European parliament and the Coun-
cil of the European Union,” Official Journal of the European Communities, vol. 2014,
no. April, pp. 1–88, 2016. doi: http://eur-lex.europa.eu/pri/en/oj/dat/
2003/l_285/l_28520031101en00330037.pdf. arXiv: arXiv:1011.1669v3.
[10] Jefatura del Estado, “Real Decreto legislativo 1/1996, de 12 de abril, por el que
se aprueba el texto refundido de la Ley de Propiedad intelectual, regularizando,
aclarando y armonizando las disposiciones legales vigentes sobre la materia.,” Bo-
letín oficial del estado, pp. 14 369–14 396, 1996.
[11] I. Goodfellow, “NIPS 2016 Tutorial: Generative Adversarial Networks,” ArXiv e-
prints, Dec. 2017. arXiv: 1701.00160 [cs.LG].
[12] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel Recurrent Neural
Networks,” ArXiv e-prints, Jan. 2016. arXiv: 1601.06759 [cs.CV].
57
[13] L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear Independent Components
Estimation,” ArXiv e-prints, Oct. 2014. arXiv: 1410.8516 [cs.LG].
[14] S. Mohamed and B. Lakshminarayanan, “Learning in Implicit Generative Models,”
ArXiv e-prints, Oct. 2016. arXiv: 1610.03483 [stat.ML].
[15] G. Alain et al., “GSNs : Generative Stochastic Networks,” ArXiv e-prints, Mar.
2015. arXiv: 1503.05571 [cs.LG].
[16] I. J. Goodfellow et al., “Generative Adversarial Networks,” ArXiv e-prints, Jun.
2014. arXiv: 1406.2661 [stat.ML].
[17] J. Shlens, “A tutorial on principal component analysis,” CoRR, vol. abs/1404.1100,
2014. arXiv: 1404.1100. [Online]. Available: http://arxiv.org/abs/1404.
1100.
[18] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science
and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006, ch. 12,
pp. 570–587.
[19] K. P. F.R.S., “Liii. on lines and planes of closest fit to systems of points in space,”
The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Sci-
ence, vol. 2, no. 11, pp. 559–572, 1901. doi: 10 . 1080 / 14786440109462720.
eprint: https://doi.org/10.1080/14786440109462720. [Online]. Available:
https://doi.org/10.1080/14786440109462720.
[20] M. E. Tipping and C. M. Bishop, “Probabilistic principal component analysis,”
Journal of the Royal Statistical Society. Series B (Statistical Methodology), vol. 61,
no. 3, pp. 611–622, 1999. [Online]. Available: http : / / www . jstor . org /
stable/2680726.
[21] S. T. Roweis, “Em algorithms for pca and spca,” in Advances in neural information
processing systems, 1998, pp. 626–632.
[22] S. Roweis, “EM Algorithms for PCA and SPCA,” Computing, vol. 10, no. 13,
pp. 626–632, 1997. doi: 10 . 1021 / ja100409b. arXiv: arXiv : 1011 . 1669v3.
[Online]. Available: https://cs.nyu.edu/{~}roweis/papers/empca.pdf.
[23] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to
document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324,
Nov. 1998. doi: 10.1109/5.726791.
[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep
convolutional neural networks,” F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.
Weinberger, Eds., pp. 1097–1105, 2012. [Online]. Available: http://papers.
nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-
neural-networks.pdf.
[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recog-
nition,” ArXiv e-prints, Dec. 2015. arXiv: 1512.03385 [cs.CV].
58
[26] (). Gnu image manipulation program, [Online]. Available: https://docs.gimp.
org/en/plug-in-convmatrix.html (visited on 05/28/2018).
[27] A. K. Justin Johnson. (2009). Cs231n: Convolutional neural networks for visual
recognition, [Online]. Available: http://cs231n.github.io/convolutional-
networks/#conv (visited on 04/20/2018).
[28] D. P Kingma and M. Welling, “Auto-Encoding Variational Bayes,” ArXiv e-prints,
Dec. 2013. arXiv: 1312.6114 [stat.ML].
[29] M. Shiffman. (2016). Under the hood of the variational autoencoder (in prose and
code), [Online]. Available: http://blog.fastforwardlabs.com/2016/08/
22/under-the-hood-of-the-variational-autoencoder-in.html (visited
on 03/10/2018).
[30] S. Ruder, “An overview of gradient descent optimization algorithms,” ArXiv e-
prints, Sep. 2016. arXiv: 1609.04747 [cs.LG].
[31] C. Doersch, “Tutorial on Variational Autoencoders,” ArXiv e-prints, Jun. 2016.
arXiv: 1606.05908 [stat.ML].
[32] J. Wang and J. X. Wang, “Properties of the Random Variable in Normal Distri-
bution,” Nonlinear Analysis and Differential Equations, vol. 5, no. 5, 2017. doi:
10.12988/nade.2017.756. [Online]. Available: http://www.m-hikari.com/
nade/nade2017/5-8-2017/p/wangNADE5-8-2017.pdf.
[33] D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling, “Semi-supervised
learning with deep generative models,” CoRR, vol. abs/1406.5298, 2014. arXiv:
1406.5298. [Online]. Available: http://arxiv.org/abs/1406.5298.
[34] R. Shu. (2016). Gaussian mixture vae: Lessons in variational inference, generative
models, and deep nets, [Online]. Available: http://ruishu.io/2016/12/25/
gmvae/ (visited on 03/22/2018).
[35] T. Salimans et al., “Improved Techniques for Training GANs,” ArXiv e-prints, Jun.
2016. arXiv: 1606.03498 [cs.LG].
[36] L. Theis, A. van den Oord, and M. Bethge, “A note on the evaluation of generative
models,” ArXiv e-prints, Nov. 2015. arXiv: 1511.01844 [stat.ML].
[37] J. R. Hershey and P. A. Olsen, “Approximating the kullback leibler divergence
between gaussian mixture models,” vol. 4, Apr. 2007. doi: 10 . 1109 / ICASSP .
2007.366913.
59