Dlunit 4
Dlunit 4
Techniques
UNIT IV- Generative Networks
Dr.E.Poongothai
Assistant Professor
Department of Computational Intelligence
Introduction
• Machine translation is a major problem domain when input and
output are both variable-length sequences.
• To handle this type of inputs and outputs, we can design an
architecture with two major components.
• The first component is an encoder: it takes a variable-length sequence
as the input and transforms it into a state with a fixed shape.
• The second component is a decoder: it maps the encoded state of a
fixed shape to a variable-length sequence.
• This is called an encoder-decoder architecture
Encoder Decoder Model
• Encoder Decoder (ED) is a widely used structure in deep learning.
• Sequence to Sequence (seq2seq) problems like machine translation has
inputs and outputs of varying lengths that are unaligned.
• Standard approach to handle this sort of data is to design an
encoder-decoder architecture, consisting of:
• an encoder that takes a variable-length sequence as input
• a decoder that acts as a conditional language model, taking in the encoded
input and the leftwards context of the target sequence and predicting the
subsequent token in the target sequence.
..contd..
Encoder:
• Encoding means to convert data into a required
format.
• For language translation, sequence of words
are converted into a two-dimensional vector,
known as hidden state.
• Encoder is built by stacking recurrent neural
network (RNN).
• This type of layer is used because its structure
allows the model to understand context and
temporal dependencies of the sequences.
• Output of the encoder, the hidden state, is the
state of the last RNN timestep.
..contd..
Hidden State:
• The output of the encoder, a
two-dimensional vector
encapsulates the whole
meaning of the input
sequence.
• The length of the vector
depends on the number of
cells in the RNN.
Encoder
18
Why Autoencoders are preferred over PCA?
• An autoencoder can learn non-linear
transformations with a non-linear activation
function and multiple layers.
• It doesn’t have to learn dense layers. It can
use convolutional layers to learn which is
better for video, image and series data.
• It is more efficient to learn several layers
with an autoencoder rather than learn one
huge transformation with PCA.
• An autoencoder provides a representation of
each layer as the output.
• It can make use of pre-trained layers from
another model to apply transfer learning to
enhance the encoder/decoder.
19
The schematic structure of an
autoencoder is as follows:
Architecture of AutoEncoders
An Autoencoder consist of three layers:
• Encoder
• Code
• Decoder
21
..contd..
Encoder:
• It compresses the input into a latent space representation. It encodes the
input image as a compressed representation in a reduced dimension. The
compressed image is the distorted version of the original image.
Code:
• It represents the compressed input which is fed to the decoder.
Decoder:
• This layer decodes the encoded image back to the original dimension. The
decoded image is a lossy reconstruction of the original image and it is
reconstructed from the latent space representation.
22
Encoding part
• The encoder part of the network is used for encoding and
sometimes even for data compression purposes although it
is not very effective as compared to other general
compression techniques like JPEG.
• Encoding is achieved by the encoder part of the network which
has decreasing number of hidden units in each layer.
• Thus this part is forced to pick up only the most significant and
representative features of the data.
Decoding function
• The second half of the network performs the Decoding
function. This part has the increasing number of hidden
units in each layer and thus tries to reconstruct the original
input from the encoded data.
• Training of an Auto-encoder for data compression: For a
data compression procedure, the most important aspect of the
compression is the reliability of the reconstruction of the
compressed data.
• Step 1: Encoding the input data
• The Auto-encoder first tries to
encode the data using the
initialized weights and biases.
27
Auto Encoder - Architecture
The different ways to constrain the network
are:-
• Keep small Hidden Layers: If the size of each hidden layer is kept
as small as possible, then the network will be forced to pick up only
the representative features of the data thus encoding the data.
• Regularization: In this method, a loss term is added to the cost
function which encourages the network to train in ways other than
copying the input.
• Denoising: Another way of constraining the network is to add noise
to the input and teaching the network how to remove the noise from
the data.
• Tuning the Activation Functions: This method involves changing
the activation functions of various nodes so that a majority of
the nodes are dormant thus effectively reducing the size of the
hidden layers.
Hyperparameters for Autoencoders
These 4 hyperparameters are set before training an autoencoder.
• Code size: It represents the number of nodes in the middle layer. Smaller
size results in more compression.
• Number of layers: Autoencoder can consist of as many layers as needed.
• Number of nodes per layer: The number of nodes per layer decreases with
each subsequent layer of the encoder, and increases back in the decoder.
The decoder is symmetric to the encoder in terms of the layer structure.
• Loss function: Either mean squared error or binary cross-entropy is used.
If the input values are in the range [0, 1] then cross-entropy is used,
otherwise, mean squared error is used.
30
Types of Autoencoders
Different types of Autoencoders are
• Undercomplete Autoencoder
• Regularized Autoencoder
• Stochastic Autoencoder
• Denoising Autoencoder
• Contractive Autoencoder
• Variational autoencoders
• Sparse Autoencoder
31
Undercomplete Autoencoders
• Goal of the Autoencoder is to capture the most important features present in the
data.
• Undercomplete autoencoders have a smaller dimension for hidden layer
compared to the input layer.
• This helps to obtain important features from the data.
• Objective is to minimize the loss function by penalizing the g(f(x)) for being
different from the input x.
32
Under Complete Auto encoder
34
..contd..
• When the decoder is linear and L is the mean squared error, an
undercomplete autoencoder learns to span the same subspace as PCA.
• Autoencoders with nonlinear encoder functions f and nonlinear decoder
functions g can thus learn a more powerful nonlinear generalization of
PCA.
• If the encoder and decoder are loaded too much, the autoencoder can learn
to perform the copying task without extracting useful information about
the distribution of the data.
• Advantage: can learn the salient features of data.
• Disadvantage: fails to learn anything useful if the encoder and decoder are
given too much capacity.
35
Over complete : Use case:
•
Choice of Activation function and Loss function
41
i) Sparse Autoencoders
• Sparse autoencoders are used to learn features for another task such as
classification.
• An autoencoder that has been regularized to be sparse must respond to
unique statistical features of the dataset it has been trained on, rather than
simply acting as an identity function.
• In this way, training to perform the copying task with a sparsity penalty
can yield a model that has learned useful features as a by product.
• Another way to constraint the reconstruction of autoencoder is to impose a
constraint in its loss.
• For example, add a regularization term in the loss function, so that
autoencoder will learn sparse representation of data.
42
..contd..
43
..contd..
• Sparse autoencoders have hidden nodes greater than input nodes. They can
still discover important features from the data.
• Sparsity constraint is introduced on the hidden layer. This is to prevent output
layer copy input data.
• Sparse autoencoders have a sparsity penalty, Ω(h), a value close to zero but
not zero, that is applied on the hidden layer in addition to the reconstruction
error to prevent overfitting.
• Sparse autoencoders take the highest activation values in the hidden layer and
zero out the rest of the hidden nodes.
• This prevents autoencoders to use all of the hidden nodes at a time and forcing
only a reduced number of hidden nodes to be used.
• As we activate and inactivate hidden nodes for each row in the dataset. Each
hidden node extracts a feature from the data.
44
ii) Denoising Autoencoders
• Rather than adding a penalty to the loss function, this autoencoder learns
something useful by changing the reconstruction error term of the loss
function.
• Denoising refers to intentionally adding noise to the raw input before
providing it to the network and make the autoencoder learn to remove it.
• Denoising can be achieved using stochastic mapping.
• By this means, the encoder will extract the most important features and
learn a robust representation of the data.
• Denoising autoencoders create a corrupted copy of the input by
introducing some noise.
45
..contd..
• Corruption of the input can be done randomly by making some of the input
as zero. Remaining nodes copy the input to the noised input.
• Denoising autoencoders must remove the corruption to generate an output
that is similar to the input. Output is compared with input and not with
noised input. To minimize the loss function, continue until convergence.
• This autoencoders minimizes the loss function between the output node
and the corrupted input.
• Denoising autoencoders helps to learn the latent representation present in
the data. Denoising is a stochastic autoencoder as we use a stochastic
corruption process to set some of the inputs to zero.
46
..contd..
47
..contd..
48
Over Complete Auto encoder
Forces the model to learn a function that does not change much when x
changes slightly
• This is Called a Contractive Auto Encoder (CAE)
Stochastic Autoencoders
variational autoencoder
• A variational autoencoder (VAE) provides a probabilistic manner for
describing an observation in latent space.
• Thus, rather than building an encoder that outputs a single value to describe
each latent state attribute, we’ll formulate our encoder to describe a
probability distribution for each latent attribute.
• It has many applications, such as data compression, synthetic data creation,
etc.
• Variational autoencoder is different from an autoencoder in a way that it
provides a statistical manner for describing the samples of the dataset in
latent space.
• Therefore, in the variational autoencoder, the encoder outputs a probability
distribution in the bottleneck layer instead of a single output value.
Variational Autoencoders (VAE)
• The encoder network takes raw input data and transforms it into a probability distribution within the latent space.
Architecture of Variational Autoencoder
• The encoder-decoder architecture lies at the heart of Variational Autoencoders (VAEs), distinguishing them from traditional
autoencoders. The encoder network takes raw input data and transforms it into a probability distribution within the latent space.
• The latent code generated by the encoder is a probabilistic encoding, allowing the VAE to express not just a single point in the
latent space but a distribution of potential representations.
• The decoder network, in turn, takes a sampled point from the latent distribution and reconstructs it back into data space. During
training, the model refines both the encoder and decoder parameters to minimize the reconstruction loss – the disparity between
the input data and the decoded output. The goal is not just to achieve accurate reconstruction but also to regularize the latent
space, ensuring that it conforms to a specified distribution.
• The process involves a delicate balance between two essential components: the reconstruction loss and the regularization term,
often represented by the Kullback-Leibler divergence. The reconstruction loss compels the model to accurately reconstruct the
input, while the regularization term encourages the latent space to adhere to the chosen distribution, preventing overfitting and
promoting generalization.
• By iteratively adjusting these parameters during training, the VAE learns to encode input data into a meaningful latent space
representation. This optimized latent code encapsulates the underlying features and structures of the data, facilitating precise
reconstruction. The probabilistic nature of the latent space also enables the generation of novel samples by drawing random
points from the learned distribution.
Applications of Autoencoders
• Dimensionality Reduction
• Image Compression
• Image Denoising
• Feature Extraction
• Image generation
• Sequence to sequence prediction
• Recommendation system
Image Denoising
• When our image get corrupted or there is a bit of noise in it, we call this image as a noisy image.
We apply Denoising autoencoder to remove (if not all)most of the noise of the image.
Feature Extraction
• Encoding part of Autoencoders helps to learn important hidden features present in the input data, in the
process to reduce the reconstruction error.
• During encoding, a new set of combination of original features is generated.
Image Generation
• There is a type of Autoencoder, named Variational Autoencoder(VAE), this type of autoencoders
are Generative Model, used to generate images.
• Given input images like images of face or scenery, the system will generate similar images.
• The use is to:
• generate new characters of animation
• generate fake human images
Sequence to Sequence Prediction
• The Encoder-Decoder Model that can capture temporal structure, such as LSTMs-based
autoencoders, can be used to address Machine Translation problems.
• This can be used to:
• predict the next frame of a video
• generate fake videos
Recommender Systems via Matrix Completion
An idea: If the predicted value of a user’s rating for a movie is high, then we should
ideally recommend this movie to the user
Thus if we can “reconstruct” the missing entries in R, we can use this method to
recommend movies to users. Using an autoencoders can help us do this
An Autoencoder based Approach
Using the rating vectors of all users, can learn an autoencoder
Note: During backprop, only update weights in W that are connected to the observed ratings
Once learned, the model can predict (reconstruct) the missing ratings
Assignment 2
Application case study -Handwritten digits recognition using deep learning, LSTM with Keras
– sentiment Analysis
Assignment 3
Application case study – Image dimensionality reduction using encoders LSTM with Keras –
sentiment Analysis
Optimizers in deep leanring
• In machine learning, optimizers and loss functions are two components that help
improve the performance of the model.
• By calculating the difference between the expected and actual outputs of a model, a
loss function evaluates the effectiveness of a model.
• Among the loss functions are log loss, hinge loss, and mean square loss.
• By modifying the model’s parameters to reduce the loss function value, the
optimizer contributes to its improvement.
• RMSProp, ADAM, and SGD are a few examples of optimizers.
• The optimizer’s job is to determine which combination of the neural network’s
weights and biases will give it the best chance to generate accurate predictions.
• There are various optimization techniques to change model weights
and learning rates, like
• Gradient Descent, Stochastic Gradient Descent, Stochastic Gradient
descent with momentum, Mini-Batch Gradient Descent, AdaGrad,
RMSProp, AdaDelta, and Adam.
• These optimization techniques play a critical role in the training of
neural networks, as they help improve the model by adjusting its
parameters to minimize the loss of function value. Choosing the best
optimizer depends on the application.
1. The epoch is the number of times the algorithm iterates over the entire
training dataset.
2. Batch weights refer to the number of samples used for updating the
model parameters.
3. A sample is a single record of data in a dataset.
4. Learning Rate is a parameter determining the scale of model weight
updates
5. Weights and Bias are learnable parameters in a model that regulate
the signal between two neurons.
Gradient Descent
• ADAGRAD, short for adaptive gradient, signifies that the learning rates are
adjusted or adapted over time based on previous gradients. A limitation of
the previously discussed optimizers is the use of a fixed learning rate for all
parameters throughout each cycle. This can hinder the training features
which often exhibit small average gradients causing them to train at a slower
pace. While one potential solution is to set different learning rates for each
feature, this can become complex . AdaGrad addresses this issue by
implementing the concept that the more a feature has been updated in the
past, the less it will be updated in the future. This provides an opportunity
for other features, such as sparse features, to catch up. AdaGrad, as an
optimizer, dynamically adjusts the learning rate for each parameter at every
time step ‘t’.
RMSProp
• The goal of generative modeling is to autonomously identify patterns in input data, enabling the
model to produce new examples that feasibly resemble the original dataset.
• Generative Adversarial Networks (GANs) are a powerful class of neural networks that are used for
an unsupervised learning. GANs are made up of two neural networks, a discriminator and a
generator. They use adversarial training to produce artificial data that is identical to actual data.
• The Generator attempts to fool the Discriminator, which is tasked with accurately distinguishing
between produced and genuine data, by producing random noise samples.
• Realistic, high-quality samples are produced as a result of this competitive interaction, which drives
both networks toward advancement.
• GANs are proving to be highly versatile artificial intelligence tools, as evidenced by their extensive
use in image synthesis, style transfer, and text-to-image synthesis.
• They have also revolutionized generative modeling.
Generative Adversarial Networks (GANs)
can be broken down into three parts:
• Generative: To learn a generative model, which describes how data is
generated in terms of a probabilistic model.
• Adversarial: The word adversarial refers to setting one thing up
against another. This means that, in the context of GANs, the
generative result is compared with the actual images in the data set. A
mechanism known as a discriminator is used to apply a model that
attempts to distinguish between real and fake images.
• Networks: Use deep neural networks as artificial intelligence (AI)
algorithms for training purposes.
Types of GANs
1. Vanilla GAN: This is the simplest type of GAN. Here, the Generator and the Discriminator are simple a basic multi-layer
perceptrons. In vanilla GAN, the algorithm is really simple, it tries to optimize the mathematical equation using stochastic
gradient descent.
2. Conditional GAN (CGAN): CGAN can be described as a deep learning method in which some conditional parameters are
put into place.
2. In CGAN, an additional parameter ‘y’ is added to the Generator for generating the corresponding data.
3. Labels are also put into the input to the Discriminator in order for the Discriminator to help distinguish the real data from the fake generated
data.
3. Deep Convolutional GAN (DCGAN): DCGAN is one of the most popular and also the most successful implementations of
GAN. It is composed of ConvNets in place of multi-layer perceptrons.
2. The ConvNets are implemented without max pooling, which is in fact replaced by convolutional stride.
3. Also, the layers are not fully connected.
4. Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible image representation consisting of a set of
band-pass images, spaced an octave apart, plus a low-frequency residual.
2. This approach uses multiple numbers of Generator and Discriminator networks and different levels of the Laplacian Pyramid.
3. This approach is mainly used because it produces very high-quality images. The image is down-sampled at first at each layer of the pyramid
and then it is again up-scaled at each layer in a backward pass where the image acquires some noise from the Conditional GAN at these layers
until it reaches its original size.
5. Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of designing a GAN in which a deep neural
network is used along with an adversarial network in order to produce higher-resolution images. This type of GAN is
particularly useful in optimally up-scaling native low-resolution images to enhance their details minimizing errors while doing
so.
Architecture of GANs
• A Generative Adversarial Network (GAN) is composed of two primary parts, which are the
Generator and the Discriminator.
Generator Model
• A key element responsible for creating fresh, accurate data in a Generative Adversarial
Network (GAN) is the generator model.
• The generator takes random noise as input and converts it into complex data samples, such
text or images. It is commonly depicted as a deep neural network.
• The training data’s underlying distribution is captured by layers of learnable parameters in
its design through training.
• The generator adjusts its output to produce samples that closely mimic real data as it is
being trained by using backpropagation to fine-tune its parameters.
• The generator’s ability to generate high-quality, varied samples that can fool the
discriminator is what makes it successful.
Discriminator Model
• An artificial neural network called a discriminator model is used in Generative Adversarial Networks
(GANs) to differentiate between generated and actual input.
• By evaluating input samples and allocating probability of authenticity, the discriminator functions as
a binary classifier.
• Over time, the discriminator learns to differentiate between genuine data from the dataset and
artificial samples created by the generator. This allows it to progressively hone its parameters and
increase its level of proficiency.
• Convolutional layers or pertinent structures for other modalities are usually used in its architecture
when dealing with picture data.
• Maximizing the discriminator’s capacity to accurately identify generated samples as fraudulent and
real samples as authentic is the aim of the adversarial training procedure.
• The discriminator grows increasingly discriminating as a result of the generator and discriminator’s
interaction, which helps the GAN produce extremely realistic-looking synthetic data overall.
Applications of AutoEncoders
Information Retrieval:
• Task of finding entries in a database that resemble a query entry.
• Search can become extremely efficient in low dimensional spaces.
• If dimensionality reduction algorithm is trained to produce a code that is low
dimensional and binary, then all database entries can be stored in a hash table with
binary code vectors and respective entries.
• Using this hash table, information retrieval can be done by returning all database entries
that have the same binary code as the query.
• Also used for searching similar entries by flipping individual bits from the encoding of
the query.
• This approach to information retrieval via dimensionality reduction and binarization is
called semantic hashing and has been applied to both textual input and images.
..contd..
Image Generation:
• There is a type of Autoencoder, named Variational Autoencoder (VAE).
This type of autoencoders are Generative Model used to generate images.
• The idea is that given input images like face or scenery, the system will
generate similar images. The use is to:
• generate new characters of animation
• generate fake human images
..contd..
Image Coloring:
• Autoencoders are used for converting any black and white picture into a
colored image. Depending on what is in the picture, it is possible to tell
what the color should be.
102
..contd..
Feature Variation:
• It extracts only the required features of an image and generates the output
by removing any noise or unnecessary interruption.
103
..contd..
Dimensionality Reduction:
• Lower-dimensional representations can
improve performance on many tasks,
such as classification, information
retrieval.
• Models of smaller spaces consume less
memory and runtime.
• Performs better than PCA.
• The reconstructed image is the same as
the input but with reduced dimensions.
It helps in providing the similar image
with a reduced pixel value.
104
..contd..
Denoising Image:
• Input seen by the autoencoder is not the raw input but a stochastically
corrupted version. A denoising autoencoder is thus trained to reconstruct
the original input from the noisy version.
105
..contd..
Watermark Removal:
• It is also used for removing watermarks from images or to remove any
object while filming a video or a movie.
106
..contd..
Sequence to Sequence Prediction:
• Encoder-Decoder Model that can capture temporal structure, such as
LSTMs-based autoencoders, can be used to address Machine Translation
problems.
• This can be used to:
• predict the next frame of a video
• generate fake videos
107
..contd..
Recommendation System:
• Deep Autoencoders can be used to understand user preferences to
recommend movies, books or other items.
• Consider the case of YouTube, the idea is:
• the input data is the clustering of similar users based on interests
• interests of users are denoted by videos watched, watch time for each, interactions
(comments) with the video
• above data is captured by clustering content
• Encoder part will capture the interests of the user
• Decoder part will try to project the interests on two parts:
• existing unseen content
• new content from content creators
108
Applications of Encoder-Decoder LSTMs
ML output: ‘Road surrounded by palm trees leading to a beach’, Photo by Milo Miloezger on Unsplash
..contd..
• Sentiment Analysis – It understands the meaning and emotions of the
input sentence and output a sentiment score. It is usually rated between -1
and 1 where 0 is neutral. Used in call centers to analyze the client’s
emotions and their reactions to certain keywords or company discounts..
..contd..
• Translation – This model reads an input sentence, understands the
message and the concepts, then translates it into a second language. Eg:
Google Translate is built upon an encoder decoder structure.
Thank You