Mod 3 Advanced AI
Mod 3 Advanced AI
Variational Autoencoders
Content
Variational Autoencoders : (7 hours)
3.1 Introduction:
Basic components of Variational Autoencoders(VAEs), Architecture
and training of VAEs the loss function, Latent space representation
and inference, Applications of VAEs in image generation.
3.2 Types of Autoencoders:
Undercomplete autoencoders, Sparse autoencoders, Contractive
autoencoders, Denoising autoencoders, Variational Autoencoders (for
generative modelling)
Autoencoder
• In the most basic form of an autoencoder, the encoder and decoder are typically
composed of fully connected layers (MLP or CNN).
• The objective of training an autoencoder is to minimize the difference between
the input data and its reconstructed output, which is typically measured using a
loss function such as mean squared error or binary cross-entropy.
• Autoencoders can be used for a
variety of tasks, such as denoising,
image super-resolution, anomaly detection, and clustering.
• They can also be stacked to create deeper architectures, such as deep autoencoders or
convolutional autoencoders, that are capable of capturing more complex features and patterns
in the input data.
• One of the main advantages of autoencoders is their ability to perform unsupervised
learning, meaning they can learn to extract meaningful features from raw data without requiring
labeled data.
•This makes them useful for tasks where labeled data is scarce or expensive to obtain.
Additionally, autoencoders can be trained using a variety of optimization algorithms, such as
stochastic gradient descent and its variants, which can scale to large datasets and high-
dimensional input spaces.
• Autoencoders also have some limitations. They are susceptible to overfitting, where the
model learns to simply memorize the training data rather than learning to generalize to new
data.
•This can be mitigated by adding regularization techniques such as dropout or early stopping.
Additionally, autoencoders can be limited by the size of the compressed representation, as the
model needs to strike a balance between preserving the most relevant information in the input
and minimizing the reconstruction error.
Variational Autoencoders(VAEs)
• Variational autoencoder was proposed in 2013 by Diederik P. Kingma and Max
Welling at Google and Qualcomm.
• A variational autoencoder (VAE) provides a probabilistic manner for describing an
observation in latent space.
• Thus, rather than building an encoder that outputs a single value to describe each
latent state attribute, we’ll formulate our encoder to describe a probability distribution
for each latent attribute.
• Variational autoencoder is different from an autoencoder in a way that it provides a
statistical manner for describing the samples of the dataset in latent space.
• Therefore, in the variational autoencoder, the encoder outputs a probability
distribution in the bottleneck layer instead of a single output value.
https://www.geeksforgeeks.org/variational-autoencoders/
Variational Autoencoders(VAEs)
• Variational Autoencoders (VAEs) are generative models explicitly designed to capture the
underlying probability distribution of a given dataset and generate novel samples.
• They utilize an architecture that comprises an encoder-decoder structure.
• The encoder transforms input data into a latent form, and the decoder aims to reconstruct
the original data based on this latent representation.
• The VAE is programmed to minimize the dissimilarity between the original and
reconstructed data, enabling it to comprehend the underlying data distribution and generate
new samples that conform to the same distribution.
• One notable advantage of VAEs is their ability to generate new data samples resembling the
training data.
• Because the VAE’s latent space is continuous, the decoder can generate new data points that
seamlessly interpolate among the training data points.
• VAEs find applications in various domains like density estimation and text generation.
https://www.analyticsvidhya.com/blog/2023/07/an-overview-of-variational-autoencoders/
Variational Autoencoder (VAE)
• Variational Autoencoders (VAEs) are a type of autoencoder that was introduced to
overcome some limitations of traditional AE.
• VAEs extend the traditional AE architecture by introducing a probabilistic framework
for generating the compressed representation of the input data.
The Architecture of Variational Autoencoder
• The encoder-decoder architecture lies at the heart of Variational Autoencoders (VAEs),
distinguishing them from traditional autoencoders.
• The encoder network takes raw input data and transforms it into a probability
distribution within the latent space.
• The latent code generated by the encoder is a probabilistic encoding, allowing the VAE
to express not just a single point in the latent space but a distribution of potential
representations.
• The decoder network, in turn, takes a sampled point from the latent distribution and
reconstructs it back into data space.
• During training, the model refines both the encoder and decoder parameters to
minimize the reconstruction loss – the disparity between the input data and the
decoded output.
• The goal is not just to achieve accurate reconstruction but also to regularize the latent
space, ensuring that it conforms to a specified distribution.
• In VAEs, the encoder still maps the input data to a lower-dimensional latent space, but
instead of a single point in the latent space, the encoder generates a probability distribution
over the latent space.
• The decoder then samples from this distribution to generate a new data point. This
probabilistic approach to encoding the input allows VAEs to learn a more structured and
continuous latent space representation, which is useful for generative modeling and data
synthesis.
To go from a traditional autoencoder to a VAE, we need to make two key modifications.
1. First, we need to replace the encoder’s output with a probability distribution. Instead of the
encoder outputting a point in the latent space, it outputs the parameters of a probability
distribution, such as mean and variance. This distribution is typically a multivariate
Gaussian distribution but can be some other distribution as well (e.g., Bernoulli).
2. Second, we introduce a new term in the loss function called the Kullback-Leibler (KL)
divergence. This term measures the difference between the learned probability distribution
over the latent space and a predefined prior distribution (usually a standard normal
distribution).
• The KL divergence term ensures that the learned distribution over the latent space is close to
the prior distribution, which helps regularize the model and ensures that the latent space has a
meaningful structure.
•The optimized term L in the above equation is called the ELBO (Expectation Lower BOund).
• The loss function for a VAE is typically composed of two parts: the reconstruction loss
(similar to the traditional autoencoder loss) and the KL divergence loss.
• The reconstruction loss measures the difference between the original input and the output
generated by the decoder.
•The KL divergence loss measures the difference between the learned probability distribution
and the predefined prior distribution.
•https://medium.com/@rushikesh.shende/autoencoders-variational-autoencoders-vae-and-
%CE%B2-vae-
ceba9998773d#:~:text=The%20loss%20function%20for%20a,output%20generated%20by%20t
he%20decoder
• Variational autoencoder uses KL-divergence as its loss function, the goal of this is to minimize
the difference between a supposed distribution and original distribution of dataset.
• Suppose there exists some hidden variable z which generates an observation x.
• We can only see x but we would like to infer the characteristics of z.
In other words we’d like to compute p(z|x).
• The first term represents the reconstruction likelihood and the second term ensures
that our learned distribution q is similar to the true prior distribution p.
•https://www.geeksforgeeks.org/variational-autoencoders/
•https://www.jeremyjordan.me/variational-autoencoders/
•To revisit our graphical model, we can use q to infer the possible hidden variables (ie. latent
state) which was used to generate an observation. We can further construct this model into a
neural network architecture where the encoder model learns a mapping from x to z and the
decoder model learns a mapping from z back to x.
• Our loss function for this network will consist of two terms, one which penalizes
reconstruction error and a second term which encourages our learned distribution q(z|x) to be
similar to the true prior distribution p(z), which we'll assume follows a unit Gaussian
distribution, for each dimension j of the latent space.
Variational Autoencoders applications
• Image Generation: VAEs have been used to generate realistic images in
applications such as art and content creation, data augmentation for
training deep learning models, and image synthesis in computer vision tasks.
• Anomaly Detection: VAEs can be applied to detect anomalies in various types
of data, including network traffic, sensor readings, financial transactions, and
medical diagnostics.
• Text Generation: VAEs have been used to generate natural language text, such
as product reviews, song lyrics, or news articles. They can also be employed in
text summarization, language translation, and sentiment analysis.
• Drug Discovery: VAEs have shown promise in generating new drug candidates
with desired properties, optimizing molecular structures, and predicting
molecular properties.
Refer:
https://towardsdatascience.com/understanding-variational-autoencoders-vaes-
f70510919f73
https://www.geeksforgeeks.org/variational-autoencoders/
https://gaussian37.github.io/deep-learning-chollet-8-4/(Practical)
Undercomplete autoencoders
• The undercomplete autoencoders are the simplest architecture for autoencoders.
• The architecture depends on putting constraints on the number of nodes that can be
added to the hidden layers and the central bottleneck.
• The theory behind this is, the approach tries to restrict the flow of information
through the network.
• The architecture depends on the fact, that if the flow of information is less and the
network needs to learn the encoding the best way, it will only consider the most
important dependencies and reject the rest. Thus we will be able to create the
encoding for best reconstruction.
• The loss function used is normal reconstruction error loss, which is MSE or Binary
Cross entropy.
• As we are restricting the flow of information using the bottleneck, there is no chance
that the model memorizes the input and cheats.
The above diagram shows an undercomplete autoencoder. We can see the hidden
layers have a lower number of nodes.
• Under complete Autoencoder is a type of Autoencoder. Its goal is to capture the
important features present in the data.
• It has a small hidden layer hen compared to Input Layer.
• This Autoencoder do not need any regularization as they maximize the probability of
data rather copying the input to output.
• One way to get useful features from the autoencoder is to constrain H to have a
smaller dimension than x.
• An autoencoder whose code dimension is less than the input dimension is called
undercomplete.
• Learning an undercomplete representation forces the autoencoder to capture the most
salient features of the training data.
The learning process is described simply as minimizing a loss function
L(x, g(f(x)))
Where
L is a loss function penalizing g(f(x)) for being dissimilar from x, such as the mean
squared error.
Sparse autoencoders
• When we were talking about the undercomplete autoencoders, we told we restrict the
number of nodes in the hidden layer to restrict the data flow.
• But often this approach creates issues because the limitation on the nodes of the
hidden layers and shallower networks prevent the neural network to uncover complex
relationships among the data items.
• So, we need to use deeper networks with more hidden layer nodes. Again, if we use
more hidden layer nodes, the network may just memorize the input and overfit, which
will make our intentions void.
• So, to solve this we use regularizers. The regularizers prevent the network from
overfitting to the input data and prevent the memorization problem.
• During regularization, we normally regularize weights but in this case, we regularize
activations that are actually passed from one hidden layer to another.
• In simpler words, the idea is we won’t let all the nodes in the hidden layers learn.
• Now, if we go to the basics of neural networks, an activation function controls how
much information a particular node passes.
• The activation function works like a gate. If the activation for a particular node is 0,
then the node is not contributing its information. The idea of sparse autoencoders is
something like that.
• Now, one thing to note is, the activations are dependent on the input data ad will
change with the change in input. So, we let our model decide the activations and
penalize their activation values. We usually do this in two ways:
L1 Regularization: L1 regularizers restrict the activations as discussed above. It forces
the network to use only the nodes of the hidden layers that handle a high amount of
information and block the rest.
It is given by:
The reconstruction loss is given by L and the second part is the regularizers that
penalize the activations. As we can see the regularizer part is a summation of
activations of all nodes in the hidden layer h.
• So, when we try to minimize the loss function we decrease the activations.
• Again, we use a tuning parameter lambda. Lambda helps to ensure how much
attention we want to pay for the regularization aspect.
KL Divergence: Kullback-Leibler Divergence is a way to measure the difference and
similarity between two mathematical probability distributions. It is given by:
• So, basically, it tells us how similar p and q are. This method uses a sparsity
parameter ρ (Rho).
• Rho is said to be the average activation of a neuron over a set of samples. The idea is
to use a very low Rho value such that the neuron or the nodes keep a low value as
average and in order to achieve that the node will have just 0 activations for some of
the samples in the collection, where it is not essential.
• Fewer nodes activating while still keeping its performance would guarantee that the
autoencoder is actually learning latent representations instead of redundant
information in our input data
Contractive autoencoders
• A contractive autoencoder is considered an unsupervised deep learning technique.
• It helps a neural network to encode unlabeled training data.
• The idea behind that is to make the autoencoders robust small changes in the training
dataset.
• We use autoencoders to learn a representation, or encoding, for a set of unlabeled
data.
• It is usually the first step towards dimensionality reduction or generating new data
models.
• Contractive autoencoder targets to learn invariant representations to unimportant
transformations for the given data.
Working of Contractive Autoencoders
• A contractive autoencoder is less sensitive to slight variations in the training dataset.
• We can achieve this by adding a penalty term or regularizer to whatever cost or
objective function the algorithm is trying to minimize.
• The result reduces the learned representation's sensitivity towards the training input.
• This regularizer needs to conform to the Frobenius norm of the Jacobian matrix for
the encoder activation sequence concerning the input.
• If this value is zero, we don't observe any change in the learned hidden
representations as we change input values. But if the value is huge, then the learned
model is unstable as the input values change.
• We generally employ Contractive autoencoders as one of several other autoencoder
nodes. It is in active mode only when other encoding schemes fail to label a data
point.
• A contractive autoencoder is an unsupervised deep learning technique that helps
a neural network encode unlabeled training data.
• Contractive autoencoder (CAE) objective is to have a robust learned representation
which is less sensitive to small variation in the data.
• Robustness of the representation for the data is done by applying a penalty term to
the loss function.
• The penalty term is Frobenius norm of the Jacobian matrix.
• Frobenius norm of the Jacobian matrix for the hidden layer is calculated with respect
to input.
• Frobenius norm of the Jacobian matrix is the sum of square of all elements.
• CAE surpasses results obtained by regularizing autoencoder using weight decay or
by denoising. CAE is a better choice than denoising autoencoder to learn useful
feature extraction.
• Penalty term generates mapping which are strongly contracting the data and hence
the name contractive autoencoder.
https://www.i2tutorials.com/explain-about-the-contractive-autoencoders/
Denoising autoencoders
• Autoencoders are Neural Networks which are commonly used for feature selection
and extraction.
• However, when there are more nodes in the hidden layer than there are inputs, the
Network is risking to learn the so-called “Identity Function”, also called “Null
Function”, meaning that the output equals the input, marking the Autoencoder
useless.
• Denoising Autoencoders solve this problem by corrupting the data on purpose by
randomly turning some of the input values to zero.
• In general, the percentage of input nodes which are being set to zero is about 50%.
• Other sources suggest a lower count, such as 30%. It depends on the amount of data
and input nodes you have.
•
When calculating the Loss function, it is important to compare the output values with
the original input, not with the corrupted input. That way, the risk of learning the
identity function instead of extracting features is eliminated.
https://towardsdatascience.com/denoising-autoencoders-explained-dbb82467fc2
OR
What is Denoising Autoencoders?
• Denoising autoencoders are a specific type of neural network that enables
unsupervised learning of data representations or encodings.
• Their primary objective is to reconstruct the original version of the input signal
corrupted by noise.
• This capability proves valuable in problems such as image recognition or fraud
detection, where the goal is to recover the original signal from its noisy form.
An autoencoder consists of two main components:
• Encoder: This component maps the input data into a low-dimensional
representation or encoding.
• Decoder: This component returns the encoding to the original data space.
• During the training phase, present the autoencoder with a set of clean input
examples along with their corresponding noisy versions.
• The objective is to learn a task using an encoder-decoder architecture that
efficiently transforms noisy input into clean output.
Architecture of DAE
• The denoising autoencoder (DAE) architecture is similar to a standard autoencoder. It
consists of two main components:
Encoder
• The encoder creates a neural network equipped with one or more hidden layers.
• Its purpose is to receive noisy input data and generate an encoding, which represents a
low-dimensional representation of the data.
• Understand an encoder as a compression function because the encoding has fewer
parameters than the input data.
Decoder
• Decoder acts as an expansion function, which is responsible for reconstructing the
original data from the compressed encoding.
• It takes as input the encoding generated by the encoder and reconstructs the original
data.
• Like encoders, decoders are implemented as neural networks featuring one or more
hidden layers.
• During the training phase, present the denoising autoencoder (DAE) with a
collection of clean input examples along with their respective noisy counterparts.
• The objective is to acquire a function that maps a noisy input to a relatively clean
output using an encoder-decoder architecture
• To achieve this, a reconstruction loss function is typically employed to evaluate the
disparity between the clean input and the reconstructed output.
• A DAE is trained by minimizing this loss through the use of backpropagation,
which involves updating the weights of both encoder and decoder components.
Applications of Denoising Autoencoders (DAEs) span a variety of domains, including
computer vision, speech processing, and natural language processing.
References for 3.2
• https://towardsdatascience.com/introduction-to-
autoencoders-7a47cf4ef14b
• https://cedar.buffalo.edu/~srihari/CSE676/14.1%20Autoen
coders.pdf
• https://www.i2tutorials.com/explain-about-under-complete-
autoencoder/
• https://www.simplilearn.com/tutorials/deep-learning-
tutorial/what-are-autoencoders-in-deep-learning
Thank you