Lecture 2.3.2VariationalAutoencoders (VAEs)
Lecture 2.3.2VariationalAutoencoders (VAEs)
Lecture 2.3.2VariationalAutoencoders (VAEs)
Autoencoders (VAEs)
Building, step by step, the reasoning that leads to VAEs.
Introduction
In the last few years, deep learning based generative models have
gained more and more interest due to (and implying) some amazing
improvements in the field. Relying on huge amount of data, well-
designed networks architectures and smart training techniques, deep
generative models have shown an incredible ability to produce highly
realistic pieces of content of various kind, such as images, texts and
sounds. Among these deep generative models, two major families
stand out and deserve a special attention: Generative Adversarial
Networks (GANs) and Variational Autoencoders (VAEs).
Face images generated with a Variational Autoencoder (source: Wojciech
Mormul on Github).
If the last two sentences summarise pretty well the notion of VAEs,
they can also raise a lot of questions. What is an autoencoder? What is
the latent space and why regularising it? How to generate new data
from VAEs? What is the link between VAEs and variational inference?
In order to describe VAEs as well as possible, we will try to answer all
this questions (and many others!) and to provide the reader with as
much insights as we can (ranging from basic intuitions to more
advanced mathematical details). Thus, the purpose of this post is not
only to discuss the fundamental notions Variational Autoencoders rely
on but also to build step by step and starting from the very beginning
the reasoning that leads to these notions.
Without further ado, let’s (re)discover VAEs together!
Outline
In the first section, we will review some important notions about
dimensionality reduction and autoencoder that will be useful for the
understanding of VAEs. Then, in the second section, we will show why
autoencoders cannot be used to generate new data and will introduce
Variational Autoencoders that are regularised versions of
autoencoders making the generative process possible. Finally in the
last section we will give a more mathematical presentation of VAEs,
based on variational inference.
First, let’s call encoder the process that produce the “new features”
representation from the “old features” representation (by selection or
by extraction) and decoder the reverse process. Dimensionality
reduction can then be interpreted as data compression where the
encoder compress the data (from the initial space to the encoded
space, also called latent space) whereas the decoder decompress
them. Of course, depending on the initial data distribution, the latent
space dimension and the encoder definition, this compression can be
lossy, meaning that a part of the information is lost during the
encoding process and cannot be recovered when decoding.
defines the reconstruction error measure between the input data x and
the encoded-decoded data d(e(x)). Notice finally that in the following
we will denote N the number of data, n_d the dimension of the initial
(decoded) space and n_e the dimension of the reduced (encoded)
space.
Principal Component Analysis (PCA) is looking for the best linear subspace using
linear algebra.
Autoencoders
Let’s now discuss autoencoders and see how we can use neural
networks for dimensionality reduction. The general idea of
autoencoders is pretty simple and consists in setting an encoder
and a decoder as neural networks and to learn the best
encoding-decoding scheme using an iterative optimisation
process. So, at each iteration we feed the autoencoder architecture
(the encoder followed by the decoder) with some data, we compare the
encoded-decoded output with the initial data and backpropagate the
error through the architecture to update the weights of the networks.
Let’s first suppose that both our encoder and decoder architectures
have only one layer without non-linearity (linear autoencoder). Such
encoder and decoder are then simple linear transformations that can
be expressed as matrices. In such situation, we can see a clear link
with PCA in the sense that, just like PCA does, we are looking for the
best linear subspace to project data on with as few information loss as
possible when doing so. Encoding and decoding matrices obtained
with PCA define naturally one of the solutions we would be satisfied to
reach by gradient descent, but we should outline that this is not the
only one. Indeed, several basis can be chosen to describe the
same optimal subspace and, so, several encoder/decoder pairs can
give the optimal reconstruction error. Moreover, for linear
autoencoders and contrarily to PCA, the new features we end up do
not have to be independent (no orthogonality constraints in the neural
networks).
Link between linear autoencoder and PCA.
Now, let’s assume that both the encoder and the decoder are deep and
non-linear. In such case, the more complex the architecture is, the
more the autoencoder can proceed to a high dimensionality reduction
while keeping reconstruction loss low. Intuitively, if our encoder and
our decoder have enough degrees of freedom, we can reduce any initial
dimensionality to 1. Indeed, an encoder with “infinite power” could
theoretically takes our N initial data points and encodes them as 1, 2,
3, … up to N (or more generally, as N integer on the real axis) and the
associated decoder could make the reverse transformation, with no
loss during the process.
When reducing dimensionality, we want to keep the main structure there exists
among the data.
Variational Autoencoders
Up to now, we have discussed dimensionality reduction problem and
introduce autoencoders that are encoder-decoder architectures that
can be trained by gradient descent. Let’s now make the link with the
content generation problem, see the limitations of autoencoders in
their current form for this problem and introduce Variational
Autoencoders.
We can generate new data by decoding points that are randomly sampled from
the latent space. The quality and relevance of generated data depend on the
regularity of the latent space.
When thinking about it for a minute, this lack of structure among the
encoded data into the latent space is pretty normal. Indeed, nothing in
the task the autoencoder is trained for enforce to get such
organisation: the autoencoder is solely trained to encode and
decode with as few loss as possible, no matter how the latent
space is organised. Thus, if we are not careful about the definition
of the architecture, it is natural that, during the training, the network
takes advantage of any overfitting possibilities to achieve its task as
well as it can… unless we explicitly regularise it!
With this regularisation term, we prevent the model to encode data far
apart in the latent space and encourage as much as possible returned
distributions to “overlap”, satisfying this way the expected continuity
and completeness conditions. Naturally, as for any regularisation
term, this comes at the price of a higher reconstruction error on the
training data. The tradeoff between the reconstruction error and the
KL divergence can however be adjusted and we will see in the next
section how the expression of the balance naturally emerge from our
formal derivation.
At this point, we can already notice that the regularisation of the latent
space that we lacked in simple autoencoders naturally appears here in
the definition of the data generation process: encoded representations
z in the latent space are indeed assumed to follow the prior
distribution p(z). Otherwise, we can also remind the the well-known
Bayes theorem that makes the link between the prior p(z), the
likelihood p(x|z), and the posterior p(z|x)
Let’s consider, for now, that f is well defined and fixed. In theory, as
we know p(z) and p(x|z), we can use the Bayes theorem to compute
p(z|x): this is a classical Bayesian inference problem. However, as we
discussed in our previous article, this kind of computation is often
intractable (because of the integral at the denominator) and require
the use of approximation techniques such as variational inference.
Note. Here we can mention that p(z) and p(x|z) are both
Gaussian distribution. So, if we had E(x|z) = f(z) = z, it would
imply that p(z|x) should also follow a Gaussian distribution and,
in theory, we could “only” try to express the mean and the
covariance matrix of p(z|x) with respect to the means and the
covariance matrices of p(z) and p(x|z). However, in practice this
condition is not met and we need to use of an approximation
technique like variational inference that makes the approach
pretty general and more robust to some changes in the
hypothesis of the model.
In the second last equation, we can observe the tradeoff there exists —
when approximating the posterior p(z|x) — between maximising the
likelihood of the “observations” (maximisation of the expected log-
likelihood, for the first term) and staying close to the prior distribution
(minimisation of the KL divergence between q_x(z) and p(z), for the
second term). This tradeoff is natural for Bayesian inference problem
and express the balance that needs to be found between the confidence
we have in the data and the confidence we have in the prior.
So, let’s consider that, as we discussed earlier, we can get for any
function f in F (each defining a different probabilistic decoder p(x|z))
the best approximation of p(z|x), denoted q*_x(z). Despite its
probabilistic nature, we are looking for an encoding-decoding scheme
as efficient as possible and, then, we want to choose the function f that
maximises the expected log-likelihood of x given z when z is sampled
from q*_x(z).
In other words, for a given input x, we want to maximise the probability to have x
distribution q*_x(z) and then sample x̂ from the distribution p(x|
Thus, we are looking for the optimal f* such that
Contrarily to the encoder part that models p(z|x) and for which we
considered a Gaussian with both mean and covariance that are
functions of x (g and h), our model assumes for p(x|z) a Gaussian with
fixed covariance. The function f of the variable z defining the mean of
that Gaussian is modelled by a neural network and can be represented
as follows
Decoder part of the VAE.
Takeaways
The main takeways of this article are:
● dimensionality reduction is the process of reducing the
number of features that describe some data (either by selecting
only a subset of the initial features or by combining them into
a reduced number new features) and, so, can be seen as an
encoding process
● autoencoders are neural networks architectures composed of
both an encoder and a decoder that create a bottleneck to go
through for data and that are trained to lose a minimal
quantity of information during the encoding-decoding process
(training by gradient descent iterations with the goal to reduce
the reconstruction error)
● due to overfitting, the latent space of an autoencoder can be
extremely irregular (close points in latent space can give very
different decoded data, some point of the latent space can give
meaningless content once decoded, …) and, so, we can’t really
define a generative process that simply consists to sample a
point from the latent space and make it go through the decoder
to get a new data
● variational autoencoders (VAEs) are autoencoders that tackle
the problem of the latent space irregularity by making the
encoder return a distribution over the latent space instead of a
single point and by adding in the loss function a regularisation
term over that returned distribution in order to ensure a better
organisation of the latent space
● assuming a simple underlying probabilistic model to describe
our data, the pretty intuitive loss function of VAEs, composed
of a reconstruction term and a regularisation term, can be
carefully derived, using in particular the statistical technique
of variational inference (hence the name “variational”
autoencoders)
To conclude, we can outline that, during the last years, GANs have
benefited from much more scientific contributions than VAEs. Among
other reasons, the higher interest that has been shown by the
community for GANs can be partly explained by the higher degree of
complexity in VAEs theoretical basis (probabilistic model and
variational inference) compared to the simplicity of the adversarial
training concept that rules GANs. With this post we hope that we
managed to share valuable intuitions as well as strong theoretical
foundations to make VAEs more accessible to newcomers, . However,
now that we have discussed in depth both of them, one question
remains… are you more GANs or VAEs?
The first term represents the reconstruction likelihood and the other term ensures
that our learned distribution q is similar to the true prior distribution p.
Thus our total loss consists of two terms, one is reconstruction error and other is
KL-divergence loss:
Implementation:
In this implementation, we will be using the Fashion-MNIST dataset, this dataset
is already available in keras.datasets API, so we don’t need to add or upload
manually.
● python3
# code
import numpy as np
import tensorflow as tf
2. Link : https://analyticsindiamag.com/6-types-of-artificial-neural-
networks-currently-being-used-in-todays-technology/
3. https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73
4. Vedio Link: https://www.youtube.com/watch?
v=K9gjuXjJeEM&list=PLJ5C_6qdAvBFqAYS0P9INAogIMklG8E-9