0% found this document useful (0 votes)
43 views18 pages

DLT Unit-4

Gg

Uploaded by

summa0346
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views18 pages

DLT Unit-4

Gg

Uploaded by

summa0346
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT-IV ADDITIONAL DEEP LEARNING ARCHITECTURES

LONG SHORT TERM MEMORY


Long Short Term Memory is a kind of recurrent neural network. In RNN output
from the last step is fed as input in the current step. LSTM was designed by Hochreiter &
Schmidhuber. It tackled the problem of long-term dependencies of RNN in which the RNN
cannot predict the word stored in the long-term memory but can give more accurate
predictions from the recent information. As the gap length increases RNN does not give an
efficient performance. LSTM can by default retain the information for a long period of
time. It is used for processing, predicting, and classifying on the basis of time-series data.
Long Short-Term Memory (LSTM) is a type of Recurrent Neural Network (RNN)
that is specifically designed to handle sequential data, such as time series, speech, and text.
LSTM networks are capable of learning long-term dependencies in sequential data, which
makes them well suited for tasks such as language translation, speech recognition, and time
series forecasting.
A traditional RNN has a single hidden state that is passed through time, which can
make it difficult for the network to learn long-term dependencies. LSTMs address this
problem by introducing a memory cell, which is a container that can hold information for
an extended period of time. The memory cell is controlled by three gates: the input gate, the
forget gate, and the output gate. These gates decide what information to add to, remove
from, and output from the memory cell.
The input gate controls what information is added to the memory cell. The forget
gate controls what information is removed from the memory cell. And the output gate
controls what information is output from the memory cell. This allows LSTM networks to
selectively retain or discard information as it flows through the network, which allows them
to learn long-term dependencies.
LSTMs can be stacked to create deep LSTM networks, which can learn even more
complex patterns in sequential data. LSTMs can also be used in combination with other
neural network architectures, such as Convolutional Neural Networks (CNNs) for image
and video analysis.
Structure Of LSTM:
LSTM has a chain structure that contains four neural networks and different
memory blocks called cells.
Information is retained by the cells and the memory manipulations are done by
the gates. There are three gates –

1. Forget Gate: The information that is no longer useful in the cell state is removed
with the forget gate. Two inputs x_t (input at the particular time) and h_t-1 (previous cell
output) are fed to the gate and multiplied with weight matrices followed by the addition of
bias. The resultant is passed through an activation function which gives a binary output. If
for a particular cell state the output is 0, the piece of information is forgotten and for output
1, the information is retained for future use. The equation for the forget gate is:
f_t = σ(W_f · [h_t-1, x_t] + b_f)
where:
 W_f represents the weight matrix associated with the forget gate.
 [h_t-1, x_t] denotes the concatenation of the current input and the previous hidden state.
 b_f is the bias with the forget gate.
 σ is the sigmoid activation function.

2. Input gate: The addition of useful information to the cell state is done by the
input gate. First, the information is regulated using the sigmoid function and filter the
values to be remembered similar to the forget gate using inputs h_t-1 and x_t. Then, a
vector is created using tanh function that gives an output from -1 to +1, which contains all
the possible values from h_t-1 and x_t. At last, the values of the vector and the regulated
values are multiplied to obtain the useful information. The equation for the input gate is:
i_t = σ(W_i · [h_t-1, x_t] + b_i)
Ĉ_t = tanh(W_c · [h_t-1, x_t] + b_c)

C_t = f_t ⊙ C_t-1 + i_t ⊙ Ĉ_t


where

 ⊙ denotes element-wise multiplication


 tanh is tanh activation function

3. Output gate: The task of extracting useful information from the current cell state
to be presented as output is done by the output gate. First, a vector is generated by applying
tanh function on the cell. Then, the information is regulated using the sigmoid function and
filter by the values to be remembered using inputs h_t-1 and x_t. At last, the values of the
vector and the regulated values are multiplied to be sent as an output and input to the next
cell. The equation for the output gate is:
o_t = σ(W_o · [h_t-1, x_t] + b_o)

Advantages of LSTM

1. Long-term dependencies can be captured by LSTM networks. They have a memory cell
that is capable of long-term information storage.
2. In traditional RNNs, there is a problem of vanishing and exploding gradients when
models are trained over long sequences. By using a gating mechanism that selectively
recalls or forgets information, LSTM networks deal with this problem.
3. LSTM enables the model to capture and remember the important context, even when
there is a significant time gap between relevant events in the sequence. So where
understanding context is important, LSTMS are used. eg. machine translation.

Disadvantages of LSTM

1. Compared to simpler architectures like feed-forward neural networks LSTM networks


are computationally more expensive. This can limit their scalability for large-scale
datasets or constrained environments.
2. Training LSTM networks can be more time-consuming compared to simpler models
due to their computational complexity. So training LSTMs often requires more data and
longer training times to achieve high performance.
3. Since it is processed word by word in a sequential manner, it is hard to parallelize the
work of processing the sentences.

Some of the famous applications of LSTM includes:

1. Long Short-Term Memory (LSTM) is a powerful type of Recurrent Neural Network


(RNN) that has been used in a wide range of applications. Here are a few famous
applications of LSTM:
2. Language Modeling: LSTMs have been used for natural language processing tasks such
as language modeling, machine translation, and text summarization. They can be trained
to generate coherent and grammatically correct sentences by learning the dependencies
between words in a sentence.
3. Speech Recognition: LSTMs have been used for speech recognition tasks such as
transcribing speech to text and recognizing spoken commands. They can be trained to
recognize patterns in speech and match them to the corresponding text.
4. Time Series Forecasting: LSTMs have been used for time series forecasting tasks such
as predicting stock prices, weather, and energy consumption. They can learn patterns in
time series data and use them to make predictions about future events.
5. Anomaly Detection: LSTMs have been used for anomaly detection tasks such as
detecting fraud and network intrusion. They can be trained to identify patterns in data
that deviate from the norm and flag them as potential anomalies.
6. Recommender Systems: LSTMs have been used for recommendation tasks such as
recommending movies, music, and books. They can learn patterns in user behavior and
use them to make personalized recommendations.
7. Video Analysis: LSTMs have been used for video analysis tasks such as object
detection, activity recognition, and action classification. They can be used in
combination with other neural network architectures, such as Convolutional Neural
Networks (CNNs), to analyze video data and extract useful information.
GATED RECURRENT UNITS

Gated Recurrent Units were introduced by Kyunghyun Cho et al. in 2014 as a solution
to the vanishing gradient problem. GRUs use gating mechanisms to control the flow of
information. These gates determine what information should be passed to the output and what
should continue to be retained in the network's internal state, allowing the model to better
capture dependencies for sequences of varied lengths.

GRU Architecture

The GRU has two gates:

 Update Gate: The update gate helps the model determine how much of the
past information (from previous time steps) needs to be passed along to the
future. It is crucial for the model to capture long-term dependencies and
decide what to retain in the memory.Reset

 Gate: The reset gate decides how much of the past information to forget. It
allows the model to decide how important each input is to the current state and
is useful for making predictions.

These gates are vectors that contain values between 0 and 1. These values are
calculated using the sigmoid activation function. A value close to 0 means that the gate is
closed, and no information is passed through, while a value close to 1 means the gate is open,
and all information is passed through.

GRU Equations

The operations within a GRU can be described by the following set of equations:
 Update Gate: zt = σ(Wz * [ht-1, xt])

Candidate Hidden State: h̃ t = tanh(W * [rt ⊙ ht-1, xt])


 Reset Gate: rt = σ(Wr * [ht-1, xt])

Final Hidden State: ht = (1 - zt) ⊙ ht-1 + zt ⊙ h̃ t



Here, σ represents the sigmoid function, tanh is the hyperbolic tangent function,

⊙ represents element-wise multiplication, and ht is the current hidden state.


Wz, Wr, and W are parameter matrices, h t-1 is the previous hidden state, x t is the current input,

Advantages of GRUs

GRUs provide several advantages:

 Solving Vanishing Gradient Problem: GRUs can maintain long-term dependencies within
the input data, which traditional RNNs often fail to capture.

 Efficiency:

GRUs are computationally more efficient than Long Short-Term Memory networks
(LSTMs), another popular RNN variant, because they have fewer parameters.

 Flexibility: GRUs are capable of handling sequences of varying lengths and are suitable for
applications where the sequence length might not be fixed or known in advance.

Applications of GRUs

GRUs are used in tasks where sequence data is prevalent. Some applications include:

 Language Modeling:

GRUs can predict the probability of a sequence of words or the next word in a sentence,
which is useful for tasks like text generation or auto-completion.

 Machine Translation: They can be used to translate text from one language to another by
capturing the context of the input sequence.
 Speech Recognition: GRUs can process audio data over time to transcribe spoken language
into text.
 Time Series Analysis: They are effective for predicting future values in a time series, such as
stock prices or weather forecasts.

AUTOENCODERS
Autoencoders are a specific type of feedforward neural networks where the input is the
same as the output. They compress the input into a lower-dimensional code and then
reconstruct the output from this representation. The code is a compact “summary” or
“compression” of the input, also called the latent-space representation.
An autoencoder consists of 3 components: encoder, code and decoder. The encoder
compresses the input and produces the code, the decoder then reconstructs the input only using
this code.

To build an autoencoder we need 3 things: an encoding method, decoding method, and a loss
function to compare the output with the target. We will explore these in the next section.

Autoencoders are mainly a dimensionality reduction (or compression) algorithm with a couple
of important properties:

 Data-specific: Autoencoders are only able to meaningfully compress data similar to what
they have been trained on. Since they learn features specific for the given training data,
they are different than a standard data compression algorithm like gzip. So we can’t
expect an autoencoder trained on handwritten digits to compress landscape photos.

 Lossy: The output of the autoencoder will not be exactly the same as the input, it will be a
close but degraded representation. If you want lossless compression they are not the way
to go.

 Unsupervised: To train an autoencoder we don’t need to do anything fancy, just throw the
raw input data at it. Autoencoders are considered an unsupervised learning technique
since they don’t need explicit labels to train on. But to be more precise they are self-
supervised because they generate their own labels from the training data.

2. Architecture

Let’s explore the details of the encoder, code and decoder. Both the encoder and decoder are
fully-connected feedforward neural networks, essentially the ANNs we covered in Part 1.
Code is a single layer of an ANN with the dimensionality of our choice. The number of nodes
in the code layer (code size) is a hyperparameter that we set before training the autoencoder.
This is a more detailed visualization of an autoencoder. First the input passes through
the encoder, which is a fully-connected ANN, to produce the code. The decoder, which has the
similar ANN structure, then produces the output only using the code. The goal is to get an
output identical with the input. Note that the decoder architecture is the mirror image of the
encoder. This is not a requirement but it’s typically the case. The only requirement is the
dimensionality of the input and output needs to be the same. Anything in the middle can be
played with.

There are 4 hyperparameters that we need to set before training an autoencoder:


 Code size: number of nodes in the middle layer. Smaller size results in more compression.
 Number of layers: the autoencoder can be as deep as we like. In the figure above we have
2 layers in both the encoder and decoder, without considering the input and output.
 Number of nodes per layer: the autoencoder architecture we’re working on is called
a stacked autoencoder since the layers are stacked one after another. Usually stacked
autoencoders look like a “sandwitch”. The number of nodes per layer decreases with each
subsequent layer of the encoder, and increases back in the decoder. Also the decoder is
symmetric to the encoder in terms of layer structure. As noted above this is not necessary
and we have total control over these parameters.
 Loss function: we either use mean squared error (mse) or binary crossentropy. If the
input values are in the range [0, 1] then we typically use crossentropy, otherwise we use
the mean squared error.

SPARSE AUTOENCODER(SAE)
SAE can include more hidden layers than the input. Still, only a small number of
hidden units are allowed to be active at once, i.e., for any given observation network learns an
encoding and decoding which relies on activating neurons.
The intuition behind this method is that suppose 'A' claims to be an expert in mathematics,
computer science, psychology, and classical dance, then 'A' might be just learning some quite
shallow knowledge in these subjects. However, 'A' claims to be devoted only to mathematics,
then 'A' would-be master in it, giving us some useful insights. (And it's the same for
autoencoders we're training — fewer nodes activating while still keeping its performance
would guarantee that the autoencoder is learning latent representations instead of redundant
information in our input data.)

Sparse Autoencoder

 So, we have a limited network's capacity to memorize input data without limiting the
network's capacity to extract features from the input. There are two main ways by
which we can impose sparsity constraints; both involve measuring the hidden layer
activations for each training batch and adding some term to the loss function to
penalize excessive activations-
i) L1 regularization:: L1 Regularization adds the absolute value of the magnitude of
the coefficient as a penalty term. It tends to shrink the penalty coefficient to zero,
whereas L2 Regularization adds squared magnitude to the penalty term, thus moving
the coefficient towards zero but never reaching it.
Now, If we consider loss function L1 and represent L1 Regularization :
L1 Regularization and its derivative

 Here in L1 Regularization, the gradient is either 1 or -1 except when w=0, which


means that L1 Regularization will always move w towards zero with the same step size
(1 or -1) regardless of the value of w. And when w=0, the gradient becomes zero, and
no update will be made anymore. Due to the sparsity of L1 Regularization, a sparse
autoencoder learns better representations, and its activations are more sparse, which
makes it perform better than the original autoencoder without L1 Regularization.
ii) KL-divergence:: The KL divergence tells how well the probability distribution Q
approximates the probability distribution P by calculating the cross-entropy minus the
entropy. Intuitively, you can think of that as the statistical measure of how one
distribution differs from another.

DENOISE AUTOENCODER(DAE)

The denoising auto-encoder is a stochastic version of the autoencoder to force the


hidden layer to discover more robust features and prevent it from simply learning the identity;
we train the autoencoder to reconstruct the input from an intensionally added slightly
corrupted version of the input.
Intuitively, a denoising autoencoder does two things: encode the input while preserving the
information about the input and undo the effect of a corruption process stochastically applied
to the input of the auto-encoder.
Also, the 'noise' process typically follows one of two approaches: By manual means or by
adding a dropout layer, randomly set some of the inputs to zero to imply missing values or one
between the inputs and the first hidden layer. Alternatively, continuous-valued inputs can add
pure Gaussian noise.
Denoise Autoencoder

CONTRACTIVE AUTOENCODER(CAE)

The objective of a contractive autoencoder is to have a robust learned representation


that is less sensitive to small variations in the data. Applying a penalty term or adding an
explicit regulariser to their loss function forces the model to learn a vital function to slight
variations of input values.
There is a connection between the denoising autoencoder (DAE) and the contractive
autoencoder (CAE): in the limit of small Gaussian input noise, DAE makes the reconstruction
function resist small but finite-sized perturbations of the input. In contrast, CAE makes the
extracted features resist infinitesimal perturbations of the input.
Intuitively, a CAE applied to images should learn tangent vectors that show how the image
changes as objects in the image gradually change pose. This property would not emphasize as
much in a standard loss function.

DEEP AUTOENCODER

A deep autoencoder comprises two symmetrical deep-belief networks that typically


have four or five shallow layers representing the encoding half of the net and the second set of
four or five layers that make up the decoding half.
The layers are restricted Boltzmann machines which are the building blocks of deep-belief
networks. It uses binary transformations after each RBM.
Deep autoencoders can also be used for other types of datasets with real-valued data, on which
you would use Gaussian rectified transformations for the RBMs instead.
For example, the input size of the MNIST dataset can be (SAMPLE_NUM, 784) while the
compressed representation would be (SAMPLE_NUM, 64). And in this way, we make sure
that we're learning compressed representations used to approximate input data and not just
copying original input.

Deep Autoencoder

VARIATIONAL AUTOENCODER(VAE)

Like a standard autoencoder, a variational autoencoder is an architecture composed of


both an encoder and a decoder trained to minimize the reconstruction error between the
encoded-decoded data and the initial data. But, instead of mapping an input to a fixed vector,
the input is mapped to a distribution. Rather than building an encoder that outputs a single
value to describe each latent state attribute, the only difference in variational autoencoder is
that the bottleneck vector is replaced with two different vectors, one representing the mean of
the distribution and the other representing the standard deviation of the distribution.
The loss function in variational autoencoder consists of two terms: One represents the
reconstruction loss, and the second term is a regularizer. The kullback-Leibler divergence
between the encoder's distribution qθ (z∣x) and p (z) and divergence measures how much
information is lost when using q to represent p.

Loss Function for Variational Autoencoder

 The model is then trained by firstly, input is encoded as a distribution over the latent
space. Then, a point from the latent space is sampled from that distribution where the
later point is decoded, and the reconstruction is computed. And finally, the
reconstruction error is backpropagated through the network.
Additionally, we can add two components to improve quality:
> pre-trained classifier as an extractor to input data that aligns with the reproduced
images, or
> discriminator network for additional adversarial loss signals.

Variational Autoencoder
Comparative Study of the Structures:

 Below is a small table defining the advantages and disadvantages of the structures
discussed above:
ADVERSARIAL GENERATIVE NETWORKS

A Generative Adversarial Network (GAN) is a deep learning architecture that


consists of two neural networks competing against each other in a zero-sum game
framework. The goal of GANs is to generate new, synthetic data that resembles some
known data distribution.
What is a Generative Adversarial Network?
Generative Adversarial Networks (GANs) are a powerful class of neural networks that are
used for unsupervised learning. It was developed and introduced by Ian J. Goodfellow in
2014. GANs are basically made up of a system of two competing neural network models
which compete with each other and are able to analyze, capture and copy the variations
within a dataset.
Why were GANs developed in the first place?
It has been noticed most of the mainstream neural nets can be easily fooled into
misclassifying things by adding only a small amount of noise into the original data.
Surprisingly, the model after adding noise has higher confidence in the wrong prediction
than when it predicted correctly. The reason for such an adversary is that most machine
learning models learn from a limited amount of data, which is a huge drawback, as it is
prone to overfitting. Also, the mapping between the input and the output is almost linear.
Although, it may seem that the boundaries of separation between the various classes are
linear, but in reality, they are composed of linearities, and even a small change in a point in
the feature space might lead to the misclassification of data.

How do GANs work?


Generative Adversarial Networks (GANs) can be broken down into three parts:
 Generative: To learn a generative model, which describes how data is generated in
terms of a probabilistic model.
 Adversarial: The training of a model is done in an adversarial setting.
 Networks: Use deep neural networks as artificial intelligence (AI) algorithms for
training purposes.
In GANs, there is a Generator and a Discriminator. The Generator generates fake
samples of data(be it an image, audio, etc.) and tries to fool the Discriminator. The
Discriminator, on the other hand, tries to distinguish between the real and fake samples.
The Generator and the Discriminator are both Neural Networks and they both run in
competition with each other in the training phase. The steps are repeated several times and
in this, the Generator and Discriminator get better and better in their respective jobs after
each repetition. The work can be visualized by the diagram given below:
Generative Adversarial Network Architecture and its Components
Here, the generative model captures the distribution of data and is trained in such a
manner that it tries to maximize the probability of the Discriminator making a mistake. The
Discriminator, on the other hand, is based on a model that estimates the probability that the
sample that it got is received from the training data and not from the Generator. The GANs
are formulated as a minimax game, where the Discriminator is trying to minimize its
reward V(D, G) and the Generator is trying to minimize the Discriminator’s reward or in
other words, maximize its loss. It can be mathematically described by the formula below:

Loss function for a GAN Model


where,
 G = Generator
 D = Discriminator
 Pdata(x) = distribution of real data
 P(z) = distribution of generator
 x = sample from Pdata(x)
 z = sample from P(z)
 D(x) = Discriminator network
 G(z) = Generator network

Generator Model

The Generator is trained while the Discriminator is idle. After the Discriminator is trained
by the generated fake data of the Generator, we can get its predictions and use the results
for training the Generator and get better from the previous state to try and fool the
Discriminator.
Discriminator Model

The Discriminator is trained while the Generator is idle. In this phase, the network is only
forward propagated and no back-propagation is done. The Discriminator is trained on real
data for n epochs and sees if it can correctly predict them as real. Also, in this phase, the
Discriminator is also trained on the fake generated data from the Generator and see if it can
correctly predict them as fake.
Different Types of GAN Models

1. Vanilla GAN: This is the simplest type of GAN. Here, the Generator and the
Discriminator are simple multi-layer perceptrons. In vanilla GAN, the algorithm is
really simple, it tries to optimize the mathematical equation using stochastic gradient
descent.
2. Conditional GAN (CGAN): CGAN can be described as a deep learning method in
which some conditional parameters are put into place. In CGAN, an additional
parameter ‘y’ is added to the Generator for generating the corresponding data. Labels
are also put into the input to the Discriminator in order for the Discriminator to help
distinguish the real data from the fake generated data.
3. Deep Convolutional GAN (DCGAN): DCGAN is one of the most popular and also the
most successful implementations of GAN. It is composed of ConvNets in place of
multi-layer perceptrons. The ConvNets are implemented without max pooling, which is
in fact replaced by convolutional stride. Also, the layers are not fully connected.
4. Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible
image representation consisting of a set of band-pass images, spaced an octave apart,
plus a low-frequency residual. This approach uses multiple numbers of Generator and
Discriminator networks and different levels of the Laplacian Pyramid. This approach is
mainly used because it produces very high-quality images. The image is down-sampled
at first at each layer of the pyramid and then it is again up-scaled at each layer in a
backward pass where the image acquires some noise from the Conditional GAN at these
layers until it reaches its original size.
5. Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of
designing a GAN in which a deep neural network is used along with an adversarial
network in order to produce higher-resolution images. This type of GAN is particularly
useful in optimally up-scaling native low-resolution images to enhance their details
minimizing errors while doing so.
6. Generate new data from available data – It means generating new samples from an
available sample that is not similar to a real one.
7. Generate realistic pictures of people that have never existed.
8. Gans is not limited to Images, It can generate text, articles, songs, poems, etc.
9. Generate Music by using some clone Voice – If you provide some voice then GANs can
generate a similar clone feature of it. In this research paper, researchers from NIT in
Tokyo proposed a system that is able to generate melodies from lyrics with help of learned
relationships between notes and subjects.
10. Text to Image Generation (Object GAN and Object Driven GAN)
11. Creation of anime characters in Game Development and animation production.
12. Image to Image Translation – We can translate one Image to another without changing the
background of the source image. For example, Gans can replace a dog with a cat.
13. Low resolution to High resolution – If you pass a low-resolution Image or video, GAN can
produce a high-resolution Image version of the same.
14. Prediction of Next Frame in the video – By training a neural network on small frames of
video, GANs are capable to generate or predict a small next frame of video. For example,
you can have a look at below GIF
15. Interactive Image Generation – It means that GANs are capable to generate images and
video footage in an art form if they are trained on the right real dataset.
16. Speech – Researchers from the College of London recently published a system called
GAN-TTS that learns to generate raw audio through training on 567 corpora of speech
data.

You might also like