Unit 5
Unit 5
An autoencoder is a type of artificial neural network used to learn data encodings in an unsupervised
manner. The aim of an autoencoder is to learn a lower-dimensional representation (encoding) for a
higher-dimensional data, typically for dimensionality reduction, by training the network to capture
the most important parts of the input image.
Properties of Autoencoders
An autoencoder learns two functions:
1. An encoding function that transforms the input data, and a decoding function that
recreates the input data from the encoded representation. The autoencoder learns an
efficient representation (encoding) for a set of data, typically for dimensionality
reduction.
2. Data specific : It is used to compress the data to the standard compression algorithm
3. Lossy :The output of the autoencoders will not be exactly same as the input it will be
close but degraded representation
4. Unsupervised : No labels will be available for the data in autoencoders
1. Encoder: A module that compresses the train-validate-test set input data into an encoded
representation that is typically several orders of magnitude smaller than the input data.
2. Bottleneck: A module that contains the compressed knowledge representations and is therefore
the most important part of the network.
3. Decoder: A module that helps the network“decompress” the knowledge representations and
reconstructs the data back from its encoded form. The output is then compared with a ground truth.
The architecture as a whole looks something like this:
Encoder
The encoder is a set of convolutional blocks followed by pooling modules that compress the input to
the model into a compact section called the bottleneck.The bottleneck is followed by the decoder that
consists of a series of upsampling modules to bring the compressed feature back into the form of an
image. In case of simple autoencoders, the output is expected to be the same as the input data with
reduced noise.However, for variational autoencoders it is a completely new image, formed with
information the model has been provided as input.
Bottleneck
The most important part of the neural network, and ironically the smallest one, is the bottleneck. The
bottleneck exists to restrict the flow of information to the decoder from the encoder, thus,allowing
only the most vital information to pass through.Since the bottleneck is designed in such a way that
the maximum information possessed by an image is captured in it, we can say that the bottleneck
helps us form a knowledge-representation of the input.
Thus, the encoder-decoder structure helps us extract the most from an image in the form of data and
establish useful correlations between various inputs within the network. A bottleneck as a compressed
representation of the input further prevents the neural network from memorising the input and
overfitting on the data.Very small bottlenecks would restrict the amount of information storable,
which increases the chances of important information slipping out through the pooling layers of the
encoder.
Decoder
Finally, the decoder is a set of upsampling and convolutional blocks that reconstructs the bottleneck's
output.
Since the input to the decoder is a compressed knowledge representation, the decoder serves as a
“decompressor” and builds back the image from its latent attributes.
1. Code size: The code size or the size of the bottleneck is the most important hyperparameter
used to tune the autoencoder. The bottleneck size decides how much the data has to be
compressed. This can also act as a regularisation term.
3. Number of nodes per layer: The number of nodes per layer defines the weights we use per
layer. Typically, the number of nodes decreases with each subsequent layer in the
autoencoder as the input to each of these layers becomes smaller across the layers.
4. Reconstruction Loss: The loss function we use to train the autoencoder is highly dependent
on the type of input and output we want the autoencoder to adapt to. If we are working with
image data, the most popular loss functions for reconstruction are MSE Loss and L1 Loss. In
case the inputs and outputs are within the range [0,1], as in MNIST, we can also make use
The first applications date to the 1980s. Initially used for dimensionality reduction and feature
learning, an autoencoder concept has evolved over the years and is now widely used for learning
generative models of data.
1. Undercomplete autoencoders
2. Sparse autoencoders
3. Contractive autoencoders
4. Denoising autoencoders
1. Undercomplete autoencoders
Undercomplete autoencoder takes in an image and tries to predict the same image as output, thus
reconstructing the image from the compressed bottleneck region.Undercomplete autoencoders are
truly unsupervised as they do not take any form of label, the target being the same as the input.The
primary use of autoencoders like such is the generation of the latent space or the bottleneck, which
forms a compressed substitute of the input data and can be easily decompressed back with the help
of the network when needed.This form of compression in the data can be modeled as a form
of dimensionality reduction.
When we think of dimensionality reduction, we tend to think of methods like PCA (Principal
Component Analysis) that form a lower-dimensional hyperplane to represent data in a higher-
dimensional form without losing information.pca can only build linear relationships. As a result, it is
put at a disadvantage compared with methods like undercomplete autoencoders that can learn non-
linear relationships and, therefore, perform better in dimensionality reduction.This form of nonlinear
dimensionality reduction where the autoencoder learns a non-linear manifold is also termed
as manifold learning.Effectively, if we remove all non-linear activations from an undercomplete
autoencoder and use only linear layers, we reduce the undercomplete autoencoder into something
that works at an equal footing with PCA.The loss function used to train an undercomplete
autoencoder is called reconstruction loss, as it is a check of how well the image has been
reconstructed from the input data.
Although the reconstruction loss can be anything depending on the input and output, we will use an
L1 loss to depict the term (also called the norm loss) represented by:
Where x^ represents the predicted output and x represents the ground truth.
As the loss function has no explicit regularisation term, the only method to ensure that the model is
not memorising the input data is by regulating the size of the bottleneck and the number of hidden
layers within this part of the network—the architecture.
2. Sparse autoencoders
Sparse autoencoders are similar to the undercomplete autoencoders in that they use the same image
as input and ground truth. However—
While undercomplete autoencoders are regulated and fine-tuned by regulating the size of the
bottleneck, the sparse autoencoder is regulated by changing the number of nodes at each hidden layer.
Since it is not possible to design a neural network that has a flexible number of nodes at its hidden
layers, sparse autoencoders work by penalizing the activation of some neurons in hidden layers.
In other words, the loss function has a term that calculates the number of neurons that have been
activated and provides a penalty that is directly proportional to that.
This penalty, called the sparsity function, prevents the neural network from activating more neurons
and serves as a regularizer.
While typical regularizers work by creating a penalty on the size of the weights at the nodes, sparsity
regularizer works by creating a penalty on the number of nodes activated.This form of regularization
allows the network to have nodes in hidden layers dedicated to find specific features in images during
training and treating the regularization problem as a problem separate from the latent space problem.
We can thus set latent space dimensionality at the bottleneck without worrying about regularization.
There are two primary ways in which the sparsity regularizer term can be incorporated into the loss
function.
L1 Loss: In here, we add the magnitude of the sparsity regularizer as we do for general regularizers:
Where h represents the hidden layer, i represents the image in the minibatch, and a represents the
activation.
KL-Divergence: In this case, we consider the activations over a collection of samples at once rather
than summing them as in the L1 Loss method. We constrain the average activation of each neuron
over this collection.
Considering the ideal distribution as a Bernoulli distribution, we include KL divergence within the
loss to reduce the difference between the current distribution of the activations and the ideal
(Bernoulli) distribution:
Where and j denote the specific neuron for layer h and a collection
of m samples is being made here, each denoted as x.
3. Contractive autoencoders
Mathematically:
An important thing to note in the loss function (formed from the norm of the derivatives and the
reconstruction loss) is that the two terms contradict each other.While the reconstruction loss wants
the model to tell differences between two inputs and observe variations in the data, the frobenius
norm of the derivatives says that the model should be able to ignore variations in the input data.
Putting these two contradictory conditions into one loss function enables us to train a network where
the hidden layers now capture only the most essential information. This information is necessary to
separate images and ignore information that is non-discriminatory in nature, and therefore, not
important.
Where h> is the hidden layer for which a gradient is calculated and represented with respect to the
input x as
The gradient is summed over all training samples, and a frobenius norm of the same is taken.
Applications of autoencoders
Now that you understand various types of autoencoders, let’s summarize some of their most common
use cases.
1. Dimensionality reduction
Undercomplete autoencoders are those that are used for dimensionality reduction.
These can be used as a pre-processing step for dimensionality reduction as they can perform fast and
accurate dimensionality reductions without losing much information.Furthermore, while
dimensionality reduction procedures like PCA can only perform linear dimensionality reductions,
undercomplete autoencoders can perform large-scale non-linear dimensionality reductions.
2. Image denoising
Autoencoders like the denoising autoencoder can be used for performing efficient and highly accurate
image denoising.
Unlike traditional methods of denoising, autoencoders do not search for noise, they extract the image
from the noisy data that has been fed to them via learning a representation of it. The representation is
then decompressed to form a noise-free image.Denoising autoencoders thus can denoise complex
images that cannot be denoised via traditional methods.
Variational Autoencoders can be used to generate both image and time series data.
The parameterized distribution at the bottleneck of the autoencoder can be randomly sampled to
generate discrete values for latent attributes, which can then be forwarded to the decoder,leading to
generation of image data. VAEs can also be used to model time series data like music.
4. Anomaly detection
For example—consider an autoencoder that has been trained on a specific dataset P. For any image
sampled for the training dataset, the autoencoder is bound to give a low reconstruction loss and is
supposed to reconstruct the image as it is.For any image which is not present in the training dataset,
however, the autoencoder cannot perform the reconstruction, as the latent attributes are not adapted
for the specific image that has never been seen by the network.As a result, the outlier image gives off
a very high reconstruction loss and can easily be identified as an anomaly with the help of a proper
threshold.
Autoencoders can be used for image denoising, image compression, and, in some cases, even
generation of image data.
While autoencoders might seem easy at the first glance (as they have a very simple theoretical
background), making them learn a representation of the input that is meaningful is quite
difficult.
Autoencoders like the undercomplete autoencoder and the sparse autoencoder do not have
large scale applications in computer vision compared to VAEs and DAEs which are still used
in works since being proposed in 2013 (by Kingmaet al).
4. Denoising autoencoders
Denoising autoencoders, as the name suggests, are autoencoders that remove noise from an image.
As opposed to autoencoders we’ve already covered, this is the first of its kind that does not have the
input image as its ground truth.
In denoising autoencoders, we feed a noisy version of the image, where noise has been added via
digital alterations. The noisy image is fed to the encoder-decoder architecture, and the output is
compared with the ground truth image.
The denoising autoencoder gets rid of noise by learning a representation of the input where the noise
can be filtered out easily.While removing noise directly from the image seems difficult, the
autoencoder performs this by mapping the input data into a lower-dimensional manifold (like in
undercomplete autoencoders), where filtering of noise becomes much easier. Essentially, denoising
autoencoders work with the help of non-linear dimensionality reduction. The loss function generally
used in these types of networks is L2 or L1 loss.
5. Variational autoencoders
Standard and variational autoencoders learn to represent the input just in a compressed form called
the latent space or the bottleneck.Therefore, the latent space formed after training the model is not
necessarily continuous and, in effect, might not be easy to interpolate.This is what a variational
autoencoder would learn from the input:
While these attributes explain the image and can be used in reconstructing the image from the
compressed latent space, they do not allow the latent attributes to be expressed in a probabilistic
fashion.Variational autoencoders deal with this specific topic and express their latent attributes as a
probability distribution, leading to the formation of a continuous latent space that can be easily
sampled and interpolated.When fed the same input, a variational autoencoder would construct latent
attributes in the following manner:
The latent attributes are then sampled from the latent distribution formed and fed to the decoder,
reconstructing the input.The motivation behind expressing the latent attributes as a probability
distribution can be very easily understood via statistical expressions.We aim at identifying the
characteristics of the latent vector z that reconstructs the output given a particular input. Effectively,
we want to study the characteristics of the latent vector given a certain output x[p(z|x)].
While estimating the distribution becomes impossible mathematically, a much simpler and easier
option is to build a parameterized model that can estimate the distribution for us. It does it by
minimizing the KL divergence between the original distribution and our parameterized one.
Expressing the parameterized distribution as q, we can infer the possible latent attributes used in the
image reconstruction.Assuming the prior z to be a multivariate Gaussian model, we can build a
parameterized distribution as one containing two parameters, the mean and the variance. The
corresponding distribution is then sampled and fed to the decoder, which then proceeds to reconstruct
the input from the sample points.While this seems easy in theory, it becomes impossible to implement
because backpropagation cannot be defined for a random sampling process performed before feeding
the data to the decoder.To get by this hurdle, we use the reparameterization trick—a cleverly defined
way to bypass the sampling process from the neural network.In the reparameterization trick, we
randomly sample a valueε from a unit Gaussian and then scale this by the latent distribution varianceσ
and shift it by the mean μ of the same.Now, we have left behind the sampling process as something
done outside what the backpropagation pipeline handles, and the sampled value ε acts just like
another input to the model, that is fed at the bottleneck.
Variational Autoencoder
I am assuming that the reader is already familiar with the working of a vanilla autoencoder. We
know that we can use an autoencoder to encode an input image to a much smaller dimensional
representation which can store latent information about the input data distribution. But in a
vanilla autoencoder, the encoded vector can only be mapped to the corresponding input using a
decoder. It certainly can’t be used to generate similar images with some variability.To achieve
this, the model needs to learn the probability distribution of the training data. VAE is one of the
most popular approach to learn the complicated data distribution such as images using neural
networks in an unsupervised fashion. It is a probabilistic graphical model rooted in Bayesian
inference i.e., the model aims to learn the underlying probability distribution of the training data
so that it could easily sample new data from that learned distribution. The idea is to learn a low-
dimensional latent representation of the training data called latent variables (variables which are
not directly observed but are rather inferred through a mathematical model) which we assume
to have generated our actual training data. These latent variables can store useful information
about the type of output the model needs to generate. The probability distribution of latent
variables z is denoted by P(z). A Gaussian distribution is selected as a prior to learn the
distribution P(z) so as to easily sample new data points during inference time.
Now the primary objective is to model the data with some parameters which maximizes the
likelihood of training data X. In short, we are assuming that a low-dimensional latent vector has
generated our data x (x ∈ X) and we can map this latent vector to data x using a deterministic
function f(z;θ) parameterized by theta which we need to evaluate (see fig. 1[1]). Under this
generative process, our aim is to maximize the probability of each data in X which is given as,
The intuition behind this maximum likelihood estimation is that if the model can generate
training samples from these latent variables then it can also generate similar samples with some
variations. In other words, if we sample a large number of latent variables from P(z) and generate
x from these variables then the generated x should match the data distribution Pdata(x). Now
we have two questions which we need to answer. How to capture the distribution of latent
variables and how to integrate Equation 1 over all the dimensions of z?
Generative Adversarial Networks
The adversarial training is the coolest thing since sliced bread. Seeing the popularity of
Generative Adversarial Networks and the quality of the results they produce, I think most of us
would agree with him. Adversarial training has completely changed the way we teach the neural
networks to do a specific task. Generative Adversarial Networks don’t work with any explicit
density estimation like Variational Autoencoders. Instead, it is based on game theory approach
with an objective to find Nash equilibrium between the two networks, Generator and
Discriminator. The idea is to sample from a simple distribution like Gaussian and then learn to
transform this noise to data distribution using universal function approximators such as neural
networks.This is achieved by adversarial training of these two networks. A generator model G
learns to capture the data distribution and a discriminator model D estimates the probability that
a sample came from the data distribution rather than model distribution. Basically the task of
the Generator is to generate natural looking images and the task of the Discriminator is to decide
whether the image is fake or real. This can be thought of as a mini-max two player game where
the performance of both the networks improves over time. In this game, the generator tries to
fool the discriminator by generating real images as far as possible and the discriminator tries not
to get fooled by the generator by improving its discriminative capability. Below image shows
the basic architecture of GAN.
Fig.3. Building block of Generative Adversarial Network
We define a prior on input noise variables P(z) and then the generator maps this to data
distribution using a complex differentiable function with parameters өg. In addition to this, we
have another network called Discriminator which takes in input x and using another
differentiable function with parameters өd outputs a single scalar value denoting the probability
that x comes from the true data distribution Pdata(x). The objective function of the GAN is
defined as
In the above equation, if the input to the Discriminator comes from true data distribution then
D(x) should output 1 to maximize the above objective function w.r.t D whereas if the image has
been generated from the Generator then D(G(z)) should output 1 to minimize the objective
function w.r.t G. The latter basically implies that G should generate such realistic images which
can fool D. We maximize the above function w.r.t parameters of Discriminator using Gradient
Ascent and minimize the same w.r.t parameters of Generator using Gradient Descent. But there
is a problem in optimizing generator objective. At the start of the game when the generator
hasn’t learned anything, the gradient is usually very small and when it is doing very well, the
gradients are very high (see Fig. 4). But we want the opposite behaviour. We therefore maximize
E[log(D(G(z))] rather than minimizing E[log(1-D(G(z))]
Fig.4. Cost for the Generator as a function of Discriminator response on the generated
image
One of the cool thing about GANs is that they can be trained even with small training data.
Indeed the results of GANs are promising but the training procedure is not trivial especially
setting up the hyperparameters of the network. Moreover, GANs are difficult to optimize as they
don’t converge easily. Of course there are some tips and tricks to hack GANs but they may not
always help. You can find some of these tips here. Also, we don’t have any criteria for the
quantitative evaluation of the results except to check whether the generated images are
perceptually realistic or not.
It uses using random noise and the discriminator on the other hand diffrentiates between the
fake and real samples, after multiple sample are diffrentiated the generator also refers the
feedbacks given from discriminator and enhances the fake sample such that the real and fake
sample both can't be diffrentiated easily. To read more about the GANs you could refer this
article by Taru Jain from opengenus Beginner's Guide to Generative Adversarial Networks
with a demo.
There are multiple types of GANs that perform different applications but in this article we are
only going to discuss some of the important GANs.
Vanilla GAN
CycleGAN
Generative Adversarial Text to Image Synthesis
Style GAN
1. Vanilla GAN - The Vanilla GAN is the simplest type of GAN made up of the generator and
discriminator , where the classification and generation of images is done by the generator and
discriminator internally with the use of multi layer perceptrons. The generator captures the data
distribution meanwhile , the discriminator tries to find the probability of the input belonging to
a certain class, finally the feedback is sent to both the generator and discriminator after
calculating the loss function , and hence the effort to minimize the loss comes into picture.
2. Conditional Gan (CGAN) - In this GAN the generator and discriminator both are provided
with additional information that could be a class label or any modal data. As the name suggests
the additional information helps the discriminator in finding the conditional probability instead
of the joint probability.
The loss function of the conditional GAN is as below
3. Deep Convolutional GAN (DCGAN)-This is the first GAN where the generator used deep
convolutional network , hence generating high resolution and quality images to be
diffrentiated.ReLU activation is used in Generator all layers except last one where Tanh
activation is used, meanwhile in Discriminator all layers use the Leaky-ReLu activation
function. Adam optimizer is used with a learning rate of 0.0002.
The above figure shows the architecture of generator of the GAN. The input generated is of 64
X 64 resolution.
4. Cycle GAN - This GAN is made for Image-to-Image translations, meaning one image
to be mapped with another image. For example , if summer and winter are made to
undergo the process of Image-Image translation we find a mapping function that could
convert summer images into that of winter images and vice versa by adding or removing
features according to the mapping function,such that the predicted output and actual
output have minimized loss.
5. Generative Adversarial Text to Image Synthesis - In this the GANs are capable of finding
an image from the dataset that is closest to the text description and generate similar images.
Gan Architecture is given below :
As you can the generator network is trying to generate based on the description and the
diffrentiation is done by the discriminator based off the features mentioned in text description.
6. Style GAN- Other GANs focused on improving the discriminator in this case we improve
the generator. This GAN generates by taking a reference picture.
As you an see the figure below, Style Gan architecture consists of a Mapping network that
maps the input to an intermediate Latent space, Further the intermediate is processed using the
AdaIN after each layer , there are approximately 18 convolutional layers.
The Style GAN uses the AdaIN or the Adaptive Instance Normalization which is defined as
Applications:
1. Firstly, GANs can be used as data augmentation techniques where the generator
generates new images taking the training dataset and producing multiple images by
applying some changes.
3. GANs are also used to improve the resolution of any input image.
4. Just like filters used in Snapchat , a filter could be applied to see what a place might
look like if in summer, winter,spring or autumn and many more conditions could be
applied and hence thats where Deep ConvNets play their role.
5. GANs are used to convert semantics into images and better understand the
visualizations done by the machine.
Generative models are forms of Artificial Intelligence (AI) and Machine Learning (ML) that
use deep neural networks that understand the distribution of complex training data sets. This
knowledge facilitates the generation of large datasets that know the probability of the next item
in a sequence. Applications include natural language processing, speech processing, and
computer vision.
To create more authentic output from your generative model, you can use Generative
Adversarial Networks (GAN) to create a synthetic training data set that trains a second
competing Neural Network. The generated neural network instances become negative training
examples for the discriminator. By learning to distinguish the generator’s fake data from actual
data, generating more plausible and original new data is possible.
To create more authentic output from your generative model, you can use Generative
Adversarial Networks (GAN) to create a synthetic training data set that trains a second
competing Neural Network. The generated neural network instances become negative training
examples for the discriminator. By learning to distinguish the generator’s fake data from actual
data, generating more plausible and original new data is possible.
Different algorithms are applicable depending on the application of a deep generative model.
These include the following.
Variational Autoencoders
Variational autoencoders can learn to reconstruct and generate new samples from a provided
dataset. By utilizing a latent space, variational autoencoders can represent data continuously
and smoothly. This enables the generation of variations of the input data with smooth
transitions.
Autoregressive Models
An autoregressive model is a statistical model used to understand and predict future values in
a time series based on past values.
Energy-Based Models
An energy-based model is a generative model usually used in statistical physics. After learning
the data distribution of a training data set, the generative model can produce other datasets
matching the data distributions.
Score-Based Models
Score-based generative models estimate the scores from the training data, allowing the model
to navigate the data space according to the learned distribution and generate similar new data.
Below are some use cases for deep generative models being applied in the real world today:
Autonomous vehicle systems use inputs from visual and Lidar sensors fed to a neural network
that predicts future behavior to make proactive course corrections thousands of times a second.
Fraud detection compares historical behavior to current transactions to detect anomalies and
act accordingly.
Virtual assistants learn a person’s taste in music, their schedule, purchasing history and any
other information they have access to make recommendations. For example, it can provide
travel times to home or places to work.
Entertainment systems can recommend movies based on past viewing of similar content.
A smartwatch can warn of potential medical conditions, over-exertion, and lack of sleep to
oversee the owner’s well-being.
Images taken with a digital camera or scanned images can be enhanced by increasing
sharpness, balancing colors, and suggesting crops.
Captions can be auto-generated for movies or meeting videos to enhance playback.
Handwriting style can be learned, and new text can be generated in the same style.
Captioned videos can have captions generated in multiple languages.
Photo libraries can be tagged with descriptions to make finding similar ones or duplicates
easier.