0% found this document useful (0 votes)
4 views122 pages

Dlunit 4

The document discusses deep learning techniques, specifically focusing on generative networks and autoencoders. It explains the encoder-decoder architecture used for tasks like machine translation, detailing how encoders convert variable-length sequences into fixed-size representations and decoders reconstruct them. Additionally, it covers the structure and types of autoencoders, their training processes, and various constraints and regularization methods to improve their performance.

Uploaded by

rjeevashree0504
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views122 pages

Dlunit 4

The document discusses deep learning techniques, specifically focusing on generative networks and autoencoders. It explains the encoder-decoder architecture used for tasks like machine translation, detailing how encoders convert variable-length sequences into fixed-size representations and decoders reconstruct them. Additionally, it covers the structure and types of autoencoders, their training processes, and various constraints and regularization methods to improve their performance.

Uploaded by

rjeevashree0504
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

21AIC301J-Deep Learning

Techniques
UNIT IV- Generative Networks

Dr.E.Poongothai
Assistant Professor
Department of Computational Intelligence
Introduction
• Machine translation is a major problem domain when input and
output are both variable-length sequences.
• To handle this type of inputs and outputs, we can design an
architecture with two major components.
• The first component is an encoder: it takes a variable-length sequence
as the input and transforms it into a state with a fixed shape.
• The second component is a decoder: it maps the encoded state of a
fixed shape to a variable-length sequence.
• This is called an encoder-decoder architecture
Encoder Decoder Model
• Encoder Decoder (ED) is a widely used structure in deep learning.
• Sequence to Sequence (seq2seq) problems like machine translation has
inputs and outputs of varying lengths that are unaligned.
• Standard approach to handle this sort of data is to design an
encoder-decoder architecture, consisting of:
• an encoder that takes a variable-length sequence as input
• a decoder that acts as a conditional language model, taking in the encoded
input and the leftwards context of the target sequence and predicting the
subsequent token in the target sequence.
..contd..
Encoder:
• Encoding means to convert data into a required
format.
• For language translation, sequence of words
are converted into a two-dimensional vector,
known as hidden state.
• Encoder is built by stacking recurrent neural
network (RNN).
• This type of layer is used because its structure
allows the model to understand context and
temporal dependencies of the sequences.
• Output of the encoder, the hidden state, is the
state of the last RNN timestep.
..contd..
Hidden State:
• The output of the encoder, a
two-dimensional vector
encapsulates the whole
meaning of the input
sequence.
• The length of the vector
depends on the number of
cells in the RNN.
Encoder

• A stack of several recurrent units where each accepts a single element


of the input sequence, collects information for that element and
propagates it.
• In question-answering problem, the input sequence is a collection of
all words from the question. Each word is represented
as x_i where i is the order of that word
• The hidden states h_i are computed using the formula:
..contd..
Decoder:
• Converts a coded message into intelligible language. In language
translation, the two-dimensional vector is converted into the output
sequence. It is also built with RNN layers and a dense layer to predict the
output.
Decoder

• A stack of several recurrent units where each predicts an


output y_t at a time step t.
• Each recurrent unit accepts a hidden state from the previous unit
and produces and output as well as its own hidden state.
• In the question-answering problem, the output sequence is a
collection of all words from the answer. Each word is represented
as y_i where i is the order of that word.
• Any hidden state h_i is computed using the formula:
Decoder

• Simply the previous hidden state is used to compute the next


one.
• The output y_t at time step t is computed using the formula:
S2: Auto Encoder –
Introduction, Auto Encoders
Introduction
• A typical use of a Neural Network is a case of supervised
learning. It involves training data which contains an output
label.
• The network is first trained on the given input.
• The network tries to reconstruct the given input from the
features it picked up and gives an approximation to the input as
the output.
• The training step involves the computation of the error and
backpropagating the error.
• The typical architecture of an Auto-encoder resembles a
bottleneck.
Introduction

• An Auto encoder is a special type of feed forward


neural network which does the following
• Encodes its input xi into a hidden representation
h
• h= g(WXi + b)
• Decodes the input again from this hidden
representation.
• Xi^ = f(W*h+c)
• The model is trained to minimize a certain loss
function which will ensure that Xi^ is close to Xi
Introduction

• Autoencoders should not copy perfectly. They


restricted by design to copy only approximately. By
doing so, it learns useful properties of data
• Autoencoders traditionally used for Dimensionality
reduction and for feature engineering
• Can learn stochastic mappings - Go beyond
deterministic functions to mappings Pencoder(h|x) and
Pdecoder(x|h)
• Autoencoder with linear decoder +MSE is PCA
• Autoencoders with nonlinear f and g can learn more
powerful nonlinear generalizations of PCA
Properties of Auto Encoders

• Data-specific: Autoencoders meaningfully compress data similar to


what they have been trained on. They learn features specific for the
given training data and different from standard compression algorithm.
• Eg:Lossy: The output of the Autoencoder trained on handwritten digits
cannot compress landscape photos.
• will not be exactly the same as the input, it will be a close but
degraded representation.
• Unsupervised: Autoencoders are considered
an unsupervised learning technique since they don’t need explicit
labels to train on. But they are self-supervised because they generate
their own labels from the training data.
• Can learn stochastic mappings
Introduction to AutoEncoders
• An autoencoder (also known as Replicator neural network) neural network
is an Unsupervised Machine learning algorithm that applies
backpropagation, setting the target values to be equal to the inputs.
• Autoencoders are used to reduce the size of the inputs into a smaller
representation.
• Original data is reconstructed from the compressed data.

18
Why Autoencoders are preferred over PCA?
• An autoencoder can learn non-linear
transformations with a non-linear activation
function and multiple layers.
• It doesn’t have to learn dense layers. It can
use convolutional layers to learn which is
better for video, image and series data.
• It is more efficient to learn several layers
with an autoencoder rather than learn one
huge transformation with PCA.
• An autoencoder provides a representation of
each layer as the output.
• It can make use of pre-trained layers from
another model to apply transfer learning to
enhance the encoder/decoder.
19
The schematic structure of an
autoencoder is as follows:
Architecture of AutoEncoders
An Autoencoder consist of three layers:
• Encoder
• Code
• Decoder

21
..contd..
Encoder:
• It compresses the input into a latent space representation. It encodes the
input image as a compressed representation in a reduced dimension. The
compressed image is the distorted version of the original image.
Code:
• It represents the compressed input which is fed to the decoder.
Decoder:
• This layer decodes the encoded image back to the original dimension. The
decoded image is a lossy reconstruction of the original image and it is
reconstructed from the latent space representation.

22
Encoding part
• The encoder part of the network is used for encoding and
sometimes even for data compression purposes although it
is not very effective as compared to other general
compression techniques like JPEG.
• Encoding is achieved by the encoder part of the network which
has decreasing number of hidden units in each layer.
• Thus this part is forced to pick up only the most significant and
representative features of the data.
Decoding function
• The second half of the network performs the Decoding
function. This part has the increasing number of hidden
units in each layer and thus tries to reconstruct the original
input from the encoded data.
• Training of an Auto-encoder for data compression: For a
data compression procedure, the most important aspect of the
compression is the reliability of the reconstruction of the
compressed data.
• Step 1: Encoding the input data
• The Auto-encoder first tries to
encode the data using the
initialized weights and biases.

• Step 2: Decoding the input data


• It tries to reconstruct the original
input from the encoded data to
test the reliability of the encoding.
• Step 3: Backpropagating the
error
• After the reconstruction, the loss
function is computed to determine
the reliability of the encoding. The
error generated is
backpropagated.

• The above-described training


process is reiterated several times
until an acceptable level of
reconstruction is reached.
..contd..
• The layer between the encoder and
decoder “code” is also known as
Bottleneck.
• This is a well-designed approach to
decide which aspects of observed
data are relevant information and
what aspects can be discarded. It
does this by balancing two criteria :
• Compactness of representation,
measured as the compressibility.
• It retains some behaviorally relevant
variables from the input.

27
Auto Encoder - Architecture
The different ways to constrain the network
are:-
• Keep small Hidden Layers: If the size of each hidden layer is kept
as small as possible, then the network will be forced to pick up only
the representative features of the data thus encoding the data.
• Regularization: In this method, a loss term is added to the cost
function which encourages the network to train in ways other than
copying the input.
• Denoising: Another way of constraining the network is to add noise
to the input and teaching the network how to remove the noise from
the data.
• Tuning the Activation Functions: This method involves changing
the activation functions of various nodes so that a majority of
the nodes are dormant thus effectively reducing the size of the
hidden layers.
Hyperparameters for Autoencoders
These 4 hyperparameters are set before training an autoencoder.
• Code size: It represents the number of nodes in the middle layer. Smaller
size results in more compression.
• Number of layers: Autoencoder can consist of as many layers as needed.
• Number of nodes per layer: The number of nodes per layer decreases with
each subsequent layer of the encoder, and increases back in the decoder.
The decoder is symmetric to the encoder in terms of the layer structure.
• Loss function: Either mean squared error or binary cross-entropy is used.
If the input values are in the range [0, 1] then cross-entropy is used,
otherwise, mean squared error is used.

30
Types of Autoencoders
Different types of Autoencoders are
• Undercomplete Autoencoder
• Regularized Autoencoder
• Stochastic Autoencoder
• Denoising Autoencoder
• Contractive Autoencoder
• Variational autoencoders
• Sparse Autoencoder

31
Undercomplete Autoencoders
• Goal of the Autoencoder is to capture the most important features present in the
data.
• Undercomplete autoencoders have a smaller dimension for hidden layer
compared to the input layer.
• This helps to obtain important features from the data.
• Objective is to minimize the loss function by penalizing the g(f(x)) for being
different from the input x.

• Undercomplete autoencoders do not need any regularization as they maximize


the probability of data rather than copying the input to the output.

32
Under Complete Auto encoder

• Let us consider the case where dim(h) < dim(Xi)


• If we are still able to reconstruct Xi^ perfectly from h, then
what does it say about h?
• h is a loss free encoding of Xi. It captures all the
important characteristics of Xi
• An Auto encoder where dim(h) < dim(Xi) is called an
under complete auto encoder.
..contd..

34
..contd..
• When the decoder is linear and L is the mean squared error, an
undercomplete autoencoder learns to span the same subspace as PCA.
• Autoencoders with nonlinear encoder functions f and nonlinear decoder
functions g can thus learn a more powerful nonlinear generalization of
PCA.
• If the encoder and decoder are loaded too much, the autoencoder can learn
to perform the copying task without extracting useful information about
the distribution of the data.
• Advantage: can learn the salient features of data.
• Disadvantage: fails to learn anything useful if the encoder and decoder are
given too much capacity.
35
Over complete : Use case:

• Consider u r interested in predicting disease of a person (heart attack or


diabetics) using BMI as one of the parameter.
• We know that BMI = f(height ,weight)
• In this use case the entangled features are de entangled


Choice of Activation function and Loss function

• Choice of f(Xi) and g(Xi)


• Choice of f(Xi) and g(Xi)
Choice of Loss function
Cases when Autoencoder Learning Fails- Regularized Auto
Encoders

• Where autoencoders fail to learn anything useful:


1. Capacity of encoder/decoder f/g is too high
• Capacity controlled by depth
2. Hidden code h has dimension equal to input x
3. Over complete case: where hidden code h has dimension greater
than input x
• Even a linear encoder/decoder can learn to copy input to output without learning anything
useful about data distribution
Regularized Autoencoders
• Rather than limiting the capacity (like Undercomplete Autoencoders) by
keeping the encoder and decoder shallow and the code size small
• Regularized autoencoders use a loss function that encourages the model to
have other properties besides the ability to copy its input to its output.
• There are two types of regularized autoencoder
1) The sparse autoencoder
2) Denoising autoencoder.

41
i) Sparse Autoencoders
• Sparse autoencoders are used to learn features for another task such as
classification.
• An autoencoder that has been regularized to be sparse must respond to
unique statistical features of the dataset it has been trained on, rather than
simply acting as an identity function.
• In this way, training to perform the copying task with a sparsity penalty
can yield a model that has learned useful features as a by product.
• Another way to constraint the reconstruction of autoencoder is to impose a
constraint in its loss.
• For example, add a regularization term in the loss function, so that
autoencoder will learn sparse representation of data.
42
..contd..

43
..contd..
• Sparse autoencoders have hidden nodes greater than input nodes. They can
still discover important features from the data.
• Sparsity constraint is introduced on the hidden layer. This is to prevent output
layer copy input data.
• Sparse autoencoders have a sparsity penalty, Ω(h), a value close to zero but
not zero, that is applied on the hidden layer in addition to the reconstruction
error to prevent overfitting.
• Sparse autoencoders take the highest activation values in the hidden layer and
zero out the rest of the hidden nodes.
• This prevents autoencoders to use all of the hidden nodes at a time and forcing
only a reduced number of hidden nodes to be used.
• As we activate and inactivate hidden nodes for each row in the dataset. Each
hidden node extracts a feature from the data.
44
ii) Denoising Autoencoders
• Rather than adding a penalty to the loss function, this autoencoder learns
something useful by changing the reconstruction error term of the loss
function.
• Denoising refers to intentionally adding noise to the raw input before
providing it to the network and make the autoencoder learn to remove it.
• Denoising can be achieved using stochastic mapping.
• By this means, the encoder will extract the most important features and
learn a robust representation of the data.
• Denoising autoencoders create a corrupted copy of the input by
introducing some noise.

45
..contd..
• Corruption of the input can be done randomly by making some of the input
as zero. Remaining nodes copy the input to the noised input.
• Denoising autoencoders must remove the corruption to generate an output
that is similar to the input. Output is compared with input and not with
noised input. To minimize the loss function, continue until convergence.
• This autoencoders minimizes the loss function between the output node
and the corrupted input.
• Denoising autoencoders helps to learn the latent representation present in
the data. Denoising is a stochastic autoencoder as we use a stochastic
corruption process to set some of the inputs to zero.

46
..contd..

47
..contd..

48
Over Complete Auto encoder

• Let us consider the case where dim(h) >= dim(Xi)


• In such a case the auto encoder could learn trivial
encoding by simply copying Xi into h and then
copying h into Xi^.
• Such an identity encoding is useless in practice as
it does not really tell us anything about the
important characteristics of the data.
• An Auto encoder where dim(h) >= dim(Xi) is
called an under Over complete auto encoder.
Regularized Auto Encoder

• Use regularization to design right autoencoder


• Regularization improves the ability of autoencoders to capture
important information and representations.
• Ideally, choose code size (dimension of h) small and capacity of
encoder f and decoder g based on complexity of distribution modeled
• Regularized autoencoders provide the ability to do so
• Rather than limiting model capacity by keeping encoder/decoder shallow and code size
small
• They use a loss function that encourages the model to have properties other than copy its
input to output
Regularized Autoencoder Properties

•Regularized AEs have properties beyond copying input to output:


• Sparsity of representation
• Smallness of the derivative of the representation
• Robustness to noise
• Robustness to missing inputs

• Regularized autoencoder can be nonlinear and overcomplete


But still learn something useful about the data distribution even if model capacity is great enough to learn trivial
identity function
Ways to regularize the model

• Make the learned code sparse (Sparse Autoencoders)


• Make the model robust against noisy/incomplete inputs (Denoising
Autoencoders)
• Make the model robust against small changes in the input (Contractive
Autoencoders)
Sparse Autoencoders

•A sparse autoencoder is simply an autoencoder whose training criterion involves a


sparsity penalty Ω(h) on the code layer h , in addition to the reconstruction error

• Where g(h) is the decoder output, and h=f(x), the encoder output and

•penalized activation function activates only a small number of neurons.


•Sparse autoencoders are used to learn features for another task, such as
classification
•An autoencoder that has been regularized to be sparse must respond to unique
statistical features of the dataset it has been trained on, rather than simply acting as
an identity function.
•individual nodes of a trained model which activate are data-dependent, different
inputs will result in activations of different nodes through the network.
Sparse Autoencoders
• Penalty term Ω(h) is a regularizer term added to a feedforward
network whose
• Primary task: copy input to output (with Unsupervised learning
objective)
• Also perform some supervised task (with Supervised learning objective) that depends on the sparse
features
Denoising Autoencoders

• Small code layer forces the autoencoder to learn an intelligent


representation of the data.
• To learn useful features, random noise is added to its inputs and
making autoencoder to recover the original noise-free data.
• The autoencoder subtract the noise and produce the underlying
meaningful data. This is called a Denoising autoencoder.
Contractive Autoencoder
• A contractive autoencoder makes encoding less sensitive to small
variations in its training dataset.
• This is accomplished by adding a regularizer, or penalty term, to cost
(or)objective function .
• The end result is to reduce the learned representation’s sensitivity
towards the training input.
• This regularizer needs to conform to the Frobenius norm (sometimes called as
Euclidean norm)of the Jacobian matrix for the encoder activation sequence,
with respect to the input.
Contractive Autoencoder
• Use penalty as in sparse autoencoders
L(x, g ( f (x))) + Ω(h,x)
• But with a different form of Ω.

Forces the model to learn a function that does not change much when x
changes slightly
• This is Called a Contractive Auto Encoder (CAE)
Stochastic Autoencoders
variational autoencoder
• A variational autoencoder (VAE) provides a probabilistic manner for
describing an observation in latent space.
• Thus, rather than building an encoder that outputs a single value to describe
each latent state attribute, we’ll formulate our encoder to describe a
probability distribution for each latent attribute.
• It has many applications, such as data compression, synthetic data creation,
etc.
• Variational autoencoder is different from an autoencoder in a way that it
provides a statistical manner for describing the samples of the dataset in
latent space.
• Therefore, in the variational autoencoder, the encoder outputs a probability
distribution in the bottleneck layer instead of a single output value.
Variational Autoencoders (VAE)
• The encoder network takes raw input data and transforms it into a probability distribution within the latent space.
Architecture of Variational Autoencoder
• The encoder-decoder architecture lies at the heart of Variational Autoencoders (VAEs), distinguishing them from traditional
autoencoders. The encoder network takes raw input data and transforms it into a probability distribution within the latent space.
• The latent code generated by the encoder is a probabilistic encoding, allowing the VAE to express not just a single point in the
latent space but a distribution of potential representations.
• The decoder network, in turn, takes a sampled point from the latent distribution and reconstructs it back into data space. During
training, the model refines both the encoder and decoder parameters to minimize the reconstruction loss – the disparity between
the input data and the decoded output. The goal is not just to achieve accurate reconstruction but also to regularize the latent
space, ensuring that it conforms to a specified distribution.
• The process involves a delicate balance between two essential components: the reconstruction loss and the regularization term,
often represented by the Kullback-Leibler divergence. The reconstruction loss compels the model to accurately reconstruct the
input, while the regularization term encourages the latent space to adhere to the chosen distribution, preventing overfitting and
promoting generalization.
• By iteratively adjusting these parameters during training, the VAE learns to encode input data into a meaningful latent space
representation. This optimized latent code encapsulates the underlying features and structures of the data, facilitating precise
reconstruction. The probabilistic nature of the latent space also enables the generation of novel samples by drawing random
points from the learned distribution.
Applications of Autoencoders

• Dimensionality Reduction
• Image Compression
• Image Denoising
• Feature Extraction
• Image generation
• Sequence to sequence prediction
• Recommendation system
Image Denoising
• When our image get corrupted or there is a bit of noise in it, we call this image as a noisy image.
We apply Denoising autoencoder to remove (if not all)most of the noise of the image.
Feature Extraction
• Encoding part of Autoencoders helps to learn important hidden features present in the input data, in the
process to reduce the reconstruction error.
• During encoding, a new set of combination of original features is generated.
Image Generation
• There is a type of Autoencoder, named Variational Autoencoder(VAE), this type of autoencoders
are Generative Model, used to generate images.
• Given input images like images of face or scenery, the system will generate similar images.
• The use is to:
• generate new characters of animation
• generate fake human images
Sequence to Sequence Prediction
• The Encoder-Decoder Model that can capture temporal structure, such as LSTMs-based
autoencoders, can be used to address Machine Translation problems.
• This can be used to:
• predict the next frame of a video
• generate fake videos
Recommender Systems via Matrix Completion
An idea: If the predicted value of a user’s rating for a movie is high, then we should
ideally recommend this movie to the user

Thus if we can “reconstruct” the missing entries in R, we can use this method to
recommend movies to users. Using an autoencoders can help us do this
An Autoencoder based Approach
Using the rating vectors of all users, can learn an autoencoder

Note: During backprop, only update weights in W that are connected to the observed ratings
Once learned, the model can predict (reconstruct) the missing ratings
Assignment 2

Application case study -Handwritten digits recognition using deep learning, LSTM with Keras
– sentiment Analysis
Assignment 3

Application case study – Image dimensionality reduction using encoders LSTM with Keras –
sentiment Analysis
Optimizers in deep leanring
• In machine learning, optimizers and loss functions are two components that help
improve the performance of the model.
• By calculating the difference between the expected and actual outputs of a model, a
loss function evaluates the effectiveness of a model.
• Among the loss functions are log loss, hinge loss, and mean square loss.
• By modifying the model’s parameters to reduce the loss function value, the
optimizer contributes to its improvement.
• RMSProp, ADAM, and SGD are a few examples of optimizers.
• The optimizer’s job is to determine which combination of the neural network’s
weights and biases will give it the best chance to generate accurate predictions.
• There are various optimization techniques to change model weights
and learning rates, like
• Gradient Descent, Stochastic Gradient Descent, Stochastic Gradient
descent with momentum, Mini-Batch Gradient Descent, AdaGrad,
RMSProp, AdaDelta, and Adam.
• These optimization techniques play a critical role in the training of
neural networks, as they help improve the model by adjusting its
parameters to minimize the loss of function value. Choosing the best
optimizer depends on the application.
1. The epoch is the number of times the algorithm iterates over the entire
training dataset.
2. Batch weights refer to the number of samples used for updating the
model parameters.
3. A sample is a single record of data in a dataset.
4. Learning Rate is a parameter determining the scale of model weight
updates
5. Weights and Bias are learnable parameters in a model that regulate
the signal between two neurons.
Gradient Descent

• A derivative or gradient indicates the direction of increase of the function.


Thus a negative derivative or gradient would indicate the direction of
decrease of the function. This fact is used to minimize the value of the
function.
• In gradient descent, we initialize the variables with random values.
1. We calculate the derivative/gradient for each variable.
2. We take steps in the direction of the negative derivate/gradient using a
learning rate. The learning rate controls the descent. Too large learning rate
may result in oscillations while a small learning rate results in slow
convergence and hence the optimal value of the learning rate is critical
3. This is iteratively done until we reach a convergence criteria.
Advantages
•Easy to implement and compute
Disadvantages
•Chances of getting stuck in local minima.
•If dataset is too large it becomes computationally expensive and
requires large memory
• Stochastic Gradient Descent (SGD):
• It’s a variation of the Gradient Descent algorithm. In Gradient Descent, we analyze the entire dataset
in each step, which may not be efficient when dealing with very large datasets. To address this issue,
we have Stochastic Gradient Descent (SGD). In Stochastic Gradient Descent, we process just one
example at a time to perform a single step. So, if the dataset contains 10000 rows, SGD will update
the model parameters 10000 times in a single cycle through the dataset, as opposed to just once in the
case of Gradient Descent.
• Here’s the process:
1. Select an example from the dataset.
2. Calculate its gradient.
3. Utilize the calculated gradient from step 2 to update the model weights.
4. Repeat steps 1 to 3 for all examples in the training dataset.
5. Completing a full pass through all the examples constitutes one epoch.
6. Repeat this entire process for several epochs as specified during training.
Advantages
•Requires less memory
•May get new minima
Disadvantages
•SGD algorithm is noisier and takes more iterations as compared
to gradient descent.
Mini Batch Stochastic Gradient Descent:

• We utilize a mini-batch stochastic gradient descent, which consists of a predetermined number of


training examples, smaller than the full dataset. This approach combines the advantages of the
previously mentioned variants. In one epoch, following the creation of fixed-size mini-batches, we
execute the following steps:
1. Select a mini-batch.
2. Compute the mean gradient of the mini-batch.
3. Apply the mean gradient obtained in step 2 to update the model’s weights.
4. Repeat steps 1 to 2 for all the mini-batches that have been created.
• Advantages
• Requires medium amount of memory
• Less time required to converge when compared to SGD
• Disadvantage
• May get stuck at local minima
SGD with Momentum:

• In Stochastic Gradient Descent, we don’t calculate the precise


derivative of our loss function. Instead, we estimate it using a small
batch. This results in “noisy” derivatives, which implies that we don’t
always move in the optimal direction. To address this issue,
Momentum was introduced to mitigate the noise in SGD. It speeds up
convergence towards the relevant direction and diminishes
fluctuations in irrelevant directions.
• The concept behind Momentum involves denoising the derivatives by
employing an exponential weighting average by assigning more
weight to recent updates compared to previous ones
AdaGrad

• ADAGRAD, short for adaptive gradient, signifies that the learning rates are
adjusted or adapted over time based on previous gradients. A limitation of
the previously discussed optimizers is the use of a fixed learning rate for all
parameters throughout each cycle. This can hinder the training features
which often exhibit small average gradients causing them to train at a slower
pace. While one potential solution is to set different learning rates for each
feature, this can become complex . AdaGrad addresses this issue by
implementing the concept that the more a feature has been updated in the
past, the less it will be updated in the future. This provides an opportunity
for other features, such as sparse features, to catch up. AdaGrad, as an
optimizer, dynamically adjusts the learning rate for each parameter at every
time step ‘t’.
RMSProp

• The challenge with AdaGrad lies in its notably slow convergence.


• This is primarily due to the fact that the sum of squared gradients only
accumulates and never diminishes.
• To address this limitation, RMSProp, short for Root Mean Square Propagation,
introduces a decay factor. More precisely, it transforms the sum of squared
gradients into a decayed sum of squared gradients.
• The decay rate indicates that only recent gradient squared values are relevant,
while those from the distant past are effectively disregarded.
• Instead of accumulating all previously squared gradients, RMSProp restricts the
window of accumulated past gradients to a fixed size ‘w’. It achieves this by using
an exponentially moving average instead of the sum of all gradients.
Adam

• Adam, which stands for Adaptive Moment Estimation, combines the


strengths of both Momentum and RMSProp. Adam is the preferred
choice for many deep learning applications in recent years
• Advantages:
• The method is fast and converges rapidly.
• Disadvantages:
• Takes lot of memory due to large number of parameters and hence
computationally costly.
• Comparison with SGD Optimizer
• Let us see how each of the subsequent optimizers tackled different
issues of SGD which finally lead to ADAM which is now widely used
optimizer .
• Mini batch SGD is less noisy when compared to SGD however it
comes at an increase in computation cost/memory. Also it suffers with
same problem of local minima and fixed learning rate.

Generative Adversarial Network (GAN)

• The goal of generative modeling is to autonomously identify patterns in input data, enabling the
model to produce new examples that feasibly resemble the original dataset.
• Generative Adversarial Networks (GANs) are a powerful class of neural networks that are used for
an unsupervised learning. GANs are made up of two neural networks, a discriminator and a
generator. They use adversarial training to produce artificial data that is identical to actual data.
• The Generator attempts to fool the Discriminator, which is tasked with accurately distinguishing
between produced and genuine data, by producing random noise samples.
• Realistic, high-quality samples are produced as a result of this competitive interaction, which drives
both networks toward advancement.
• GANs are proving to be highly versatile artificial intelligence tools, as evidenced by their extensive
use in image synthesis, style transfer, and text-to-image synthesis.
• They have also revolutionized generative modeling.
Generative Adversarial Networks (GANs)
can be broken down into three parts:
• Generative: To learn a generative model, which describes how data is
generated in terms of a probabilistic model.
• Adversarial: The word adversarial refers to setting one thing up
against another. This means that, in the context of GANs, the
generative result is compared with the actual images in the data set. A
mechanism known as a discriminator is used to apply a model that
attempts to distinguish between real and fake images.
• Networks: Use deep neural networks as artificial intelligence (AI)
algorithms for training purposes.
Types of GANs
1. Vanilla GAN: This is the simplest type of GAN. Here, the Generator and the Discriminator are simple a basic multi-layer
perceptrons. In vanilla GAN, the algorithm is really simple, it tries to optimize the mathematical equation using stochastic
gradient descent.
2. Conditional GAN (CGAN): CGAN can be described as a deep learning method in which some conditional parameters are
put into place.
2. In CGAN, an additional parameter ‘y’ is added to the Generator for generating the corresponding data.
3. Labels are also put into the input to the Discriminator in order for the Discriminator to help distinguish the real data from the fake generated
data.
3. Deep Convolutional GAN (DCGAN): DCGAN is one of the most popular and also the most successful implementations of
GAN. It is composed of ConvNets in place of multi-layer perceptrons.
2. The ConvNets are implemented without max pooling, which is in fact replaced by convolutional stride.
3. Also, the layers are not fully connected.
4. Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible image representation consisting of a set of
band-pass images, spaced an octave apart, plus a low-frequency residual.
2. This approach uses multiple numbers of Generator and Discriminator networks and different levels of the Laplacian Pyramid.
3. This approach is mainly used because it produces very high-quality images. The image is down-sampled at first at each layer of the pyramid
and then it is again up-scaled at each layer in a backward pass where the image acquires some noise from the Conditional GAN at these layers
until it reaches its original size.
5. Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of designing a GAN in which a deep neural
network is used along with an adversarial network in order to produce higher-resolution images. This type of GAN is
particularly useful in optimally up-scaling native low-resolution images to enhance their details minimizing errors while doing
so.
Architecture of GANs

• A Generative Adversarial Network (GAN) is composed of two primary parts, which are the
Generator and the Discriminator.
Generator Model
• A key element responsible for creating fresh, accurate data in a Generative Adversarial
Network (GAN) is the generator model.
• The generator takes random noise as input and converts it into complex data samples, such
text or images. It is commonly depicted as a deep neural network.
• The training data’s underlying distribution is captured by layers of learnable parameters in
its design through training.
• The generator adjusts its output to produce samples that closely mimic real data as it is
being trained by using backpropagation to fine-tune its parameters.
• The generator’s ability to generate high-quality, varied samples that can fool the
discriminator is what makes it successful.
Discriminator Model

• An artificial neural network called a discriminator model is used in Generative Adversarial Networks
(GANs) to differentiate between generated and actual input.
• By evaluating input samples and allocating probability of authenticity, the discriminator functions as
a binary classifier.
• Over time, the discriminator learns to differentiate between genuine data from the dataset and
artificial samples created by the generator. This allows it to progressively hone its parameters and
increase its level of proficiency.
• Convolutional layers or pertinent structures for other modalities are usually used in its architecture
when dealing with picture data.
• Maximizing the discriminator’s capacity to accurately identify generated samples as fraudulent and
real samples as authentic is the aim of the adversarial training procedure.
• The discriminator grows increasingly discriminating as a result of the generator and discriminator’s
interaction, which helps the GAN produce extremely realistic-looking synthetic data overall.
Applications of AutoEncoders
Information Retrieval:
• Task of finding entries in a database that resemble a query entry.
• Search can become extremely efficient in low dimensional spaces.
• If dimensionality reduction algorithm is trained to produce a code that is low
dimensional and binary, then all database entries can be stored in a hash table with
binary code vectors and respective entries.
• Using this hash table, information retrieval can be done by returning all database entries
that have the same binary code as the query.
• Also used for searching similar entries by flipping individual bits from the encoding of
the query.
• This approach to information retrieval via dimensionality reduction and binarization is
called semantic hashing and has been applied to both textual input and images.
..contd..
Image Generation:
• There is a type of Autoencoder, named Variational Autoencoder (VAE).
This type of autoencoders are Generative Model used to generate images.
• The idea is that given input images like face or scenery, the system will
generate similar images. The use is to:
• generate new characters of animation
• generate fake human images
..contd..
Image Coloring:
• Autoencoders are used for converting any black and white picture into a
colored image. Depending on what is in the picture, it is possible to tell
what the color should be.

102
..contd..
Feature Variation:
• It extracts only the required features of an image and generates the output
by removing any noise or unnecessary interruption.

103
..contd..
Dimensionality Reduction:
• Lower-dimensional representations can
improve performance on many tasks,
such as classification, information
retrieval.
• Models of smaller spaces consume less
memory and runtime.
• Performs better than PCA.
• The reconstructed image is the same as
the input but with reduced dimensions.
It helps in providing the similar image
with a reduced pixel value.

104
..contd..
Denoising Image:
• Input seen by the autoencoder is not the raw input but a stochastically
corrupted version. A denoising autoencoder is thus trained to reconstruct
the original input from the noisy version.

105
..contd..
Watermark Removal:
• It is also used for removing watermarks from images or to remove any
object while filming a video or a movie.

106
..contd..
Sequence to Sequence Prediction:
• Encoder-Decoder Model that can capture temporal structure, such as
LSTMs-based autoencoders, can be used to address Machine Translation
problems.
• This can be used to:
• predict the next frame of a video
• generate fake videos

107
..contd..
Recommendation System:
• Deep Autoencoders can be used to understand user preferences to
recommend movies, books or other items.
• Consider the case of YouTube, the idea is:
• the input data is the clustering of similar users based on interests
• interests of users are denoted by videos watched, watch time for each, interactions
(comments) with the video
• above data is captured by clustering content
• Encoder part will capture the interests of the user
• Decoder part will try to project the interests on two parts:
• existing unseen content
• new content from content creators

108
Applications of Encoder-Decoder LSTMs

• Machine Translation, e.g. English to French translation of phrases.


• Image Captioning, e.g. generating a text description for images.
• Conversational Modeling, e.g. generating answers to textual
questions.
• Movement Classification, e.g. generating a sequence of commands
from a sequence of gestures.
..contd..
ED model is used in the following cases:
• Image Captioning – ED model allows a process to generate a sentence
describing an image. It receives the image as the input and outputs a
sequence of words. This also works with videos.

ML output: ‘Road surrounded by palm trees leading to a beach’, Photo by Milo Miloezger on Unsplash
..contd..
• Sentiment Analysis – It understands the meaning and emotions of the
input sentence and output a sentiment score. It is usually rated between -1
and 1 where 0 is neutral. Used in call centers to analyze the client’s
emotions and their reactions to certain keywords or company discounts..
..contd..
• Translation – This model reads an input sentence, understands the
message and the concepts, then translates it into a second language. Eg:
Google Translate is built upon an encoder decoder structure.
Thank You

You might also like