DLT Unit-4
DLT Unit-4
1. Forget Gate: The information that is no longer useful in the cell state is removed
with the forget gate. Two inputs x_t (input at the particular time) and h_t-1 (previous cell
output) are fed to the gate and multiplied with weight matrices followed by the addition of
bias. The resultant is passed through an activation function which gives a binary output. If
for a particular cell state the output is 0, the piece of information is forgotten and for output
1, the information is retained for future use. The equation for the forget gate is:
f_t = σ(W_f · [h_t-1, x_t] + b_f)
where:
W_f represents the weight matrix associated with the forget gate.
[h_t-1, x_t] denotes the concatenation of the current input and the previous hidden state.
b_f is the bias with the forget gate.
σ is the sigmoid activation function.
2. Input gate: The addition of useful information to the cell state is done by the
input gate. First, the information is regulated using the sigmoid function and filter the
values to be remembered similar to the forget gate using inputs h_t-1 and x_t. Then, a
vector is created using tanh function that gives an output from -1 to +1, which contains all
the possible values from h_t-1 and x_t. At last, the values of the vector and the regulated
values are multiplied to obtain the useful information. The equation for the input gate is:
i_t = σ(W_i · [h_t-1, x_t] + b_i)
Ĉ_t = tanh(W_c · [h_t-1, x_t] + b_c)
3. Output gate: The task of extracting useful information from the current cell state
to be presented as output is done by the output gate. First, a vector is generated by applying
tanh function on the cell. Then, the information is regulated using the sigmoid function and
filter by the values to be remembered using inputs h_t-1 and x_t. At last, the values of the
vector and the regulated values are multiplied to be sent as an output and input to the next
cell. The equation for the output gate is:
o_t = σ(W_o · [h_t-1, x_t] + b_o)
Advantages of LSTM
1. Long-term dependencies can be captured by LSTM networks. They have a memory cell
that is capable of long-term information storage.
2. In traditional RNNs, there is a problem of vanishing and exploding gradients when
models are trained over long sequences. By using a gating mechanism that selectively
recalls or forgets information, LSTM networks deal with this problem.
3. LSTM enables the model to capture and remember the important context, even when
there is a significant time gap between relevant events in the sequence. So where
understanding context is important, LSTMS are used. eg. machine translation.
Disadvantages of LSTM
Gated Recurrent Units were introduced by Kyunghyun Cho et al. in 2014 as a solution
to the vanishing gradient problem. GRUs use gating mechanisms to control the flow of
information. These gates determine what information should be passed to the output and what
should continue to be retained in the network's internal state, allowing the model to better
capture dependencies for sequences of varied lengths.
GRU Architecture
Update Gate: The update gate helps the model determine how much of the
past information (from previous time steps) needs to be passed along to the
future. It is crucial for the model to capture long-term dependencies and
decide what to retain in the memory.Reset
Gate: The reset gate decides how much of the past information to forget. It
allows the model to decide how important each input is to the current state and
is useful for making predictions.
These gates are vectors that contain values between 0 and 1. These values are
calculated using the sigmoid activation function. A value close to 0 means that the gate is
closed, and no information is passed through, while a value close to 1 means the gate is open,
and all information is passed through.
GRU Equations
The operations within a GRU can be described by the following set of equations:
Update Gate: zt = σ(Wz * [ht-1, xt])
Here, σ represents the sigmoid function, tanh is the hyperbolic tangent function,
Advantages of GRUs
Solving Vanishing Gradient Problem: GRUs can maintain long-term dependencies within
the input data, which traditional RNNs often fail to capture.
Efficiency:
GRUs are computationally more efficient than Long Short-Term Memory networks
(LSTMs), another popular RNN variant, because they have fewer parameters.
Flexibility: GRUs are capable of handling sequences of varying lengths and are suitable for
applications where the sequence length might not be fixed or known in advance.
Applications of GRUs
GRUs are used in tasks where sequence data is prevalent. Some applications include:
Language Modeling:
GRUs can predict the probability of a sequence of words or the next word in a sentence,
which is useful for tasks like text generation or auto-completion.
Machine Translation: They can be used to translate text from one language to another by
capturing the context of the input sequence.
Speech Recognition: GRUs can process audio data over time to transcribe spoken language
into text.
Time Series Analysis: They are effective for predicting future values in a time series, such as
stock prices or weather forecasts.
AUTOENCODERS
Autoencoders are a specific type of feedforward neural networks where the input is the
same as the output. They compress the input into a lower-dimensional code and then
reconstruct the output from this representation. The code is a compact “summary” or
“compression” of the input, also called the latent-space representation.
An autoencoder consists of 3 components: encoder, code and decoder. The encoder
compresses the input and produces the code, the decoder then reconstructs the input only using
this code.
To build an autoencoder we need 3 things: an encoding method, decoding method, and a loss
function to compare the output with the target. We will explore these in the next section.
Autoencoders are mainly a dimensionality reduction (or compression) algorithm with a couple
of important properties:
Data-specific: Autoencoders are only able to meaningfully compress data similar to what
they have been trained on. Since they learn features specific for the given training data,
they are different than a standard data compression algorithm like gzip. So we can’t
expect an autoencoder trained on handwritten digits to compress landscape photos.
Lossy: The output of the autoencoder will not be exactly the same as the input, it will be a
close but degraded representation. If you want lossless compression they are not the way
to go.
Unsupervised: To train an autoencoder we don’t need to do anything fancy, just throw the
raw input data at it. Autoencoders are considered an unsupervised learning technique
since they don’t need explicit labels to train on. But to be more precise they are self-
supervised because they generate their own labels from the training data.
2. Architecture
Let’s explore the details of the encoder, code and decoder. Both the encoder and decoder are
fully-connected feedforward neural networks, essentially the ANNs we covered in Part 1.
Code is a single layer of an ANN with the dimensionality of our choice. The number of nodes
in the code layer (code size) is a hyperparameter that we set before training the autoencoder.
This is a more detailed visualization of an autoencoder. First the input passes through
the encoder, which is a fully-connected ANN, to produce the code. The decoder, which has the
similar ANN structure, then produces the output only using the code. The goal is to get an
output identical with the input. Note that the decoder architecture is the mirror image of the
encoder. This is not a requirement but it’s typically the case. The only requirement is the
dimensionality of the input and output needs to be the same. Anything in the middle can be
played with.
SPARSE AUTOENCODER(SAE)
SAE can include more hidden layers than the input. Still, only a small number of
hidden units are allowed to be active at once, i.e., for any given observation network learns an
encoding and decoding which relies on activating neurons.
The intuition behind this method is that suppose 'A' claims to be an expert in mathematics,
computer science, psychology, and classical dance, then 'A' might be just learning some quite
shallow knowledge in these subjects. However, 'A' claims to be devoted only to mathematics,
then 'A' would-be master in it, giving us some useful insights. (And it's the same for
autoencoders we're training — fewer nodes activating while still keeping its performance
would guarantee that the autoencoder is learning latent representations instead of redundant
information in our input data.)
Sparse Autoencoder
So, we have a limited network's capacity to memorize input data without limiting the
network's capacity to extract features from the input. There are two main ways by
which we can impose sparsity constraints; both involve measuring the hidden layer
activations for each training batch and adding some term to the loss function to
penalize excessive activations-
i) L1 regularization:: L1 Regularization adds the absolute value of the magnitude of
the coefficient as a penalty term. It tends to shrink the penalty coefficient to zero,
whereas L2 Regularization adds squared magnitude to the penalty term, thus moving
the coefficient towards zero but never reaching it.
Now, If we consider loss function L1 and represent L1 Regularization :
L1 Regularization and its derivative
DENOISE AUTOENCODER(DAE)
CONTRACTIVE AUTOENCODER(CAE)
DEEP AUTOENCODER
Deep Autoencoder
VARIATIONAL AUTOENCODER(VAE)
The model is then trained by firstly, input is encoded as a distribution over the latent
space. Then, a point from the latent space is sampled from that distribution where the
later point is decoded, and the reconstruction is computed. And finally, the
reconstruction error is backpropagated through the network.
Additionally, we can add two components to improve quality:
> pre-trained classifier as an extractor to input data that aligns with the reproduced
images, or
> discriminator network for additional adversarial loss signals.
Variational Autoencoder
Comparative Study of the Structures:
Below is a small table defining the advantages and disadvantages of the structures
discussed above:
ADVERSARIAL GENERATIVE NETWORKS
Generator Model
The Generator is trained while the Discriminator is idle. After the Discriminator is trained
by the generated fake data of the Generator, we can get its predictions and use the results
for training the Generator and get better from the previous state to try and fool the
Discriminator.
Discriminator Model
The Discriminator is trained while the Generator is idle. In this phase, the network is only
forward propagated and no back-propagation is done. The Discriminator is trained on real
data for n epochs and sees if it can correctly predict them as real. Also, in this phase, the
Discriminator is also trained on the fake generated data from the Generator and see if it can
correctly predict them as fake.
Different Types of GAN Models
1. Vanilla GAN: This is the simplest type of GAN. Here, the Generator and the
Discriminator are simple multi-layer perceptrons. In vanilla GAN, the algorithm is
really simple, it tries to optimize the mathematical equation using stochastic gradient
descent.
2. Conditional GAN (CGAN): CGAN can be described as a deep learning method in
which some conditional parameters are put into place. In CGAN, an additional
parameter ‘y’ is added to the Generator for generating the corresponding data. Labels
are also put into the input to the Discriminator in order for the Discriminator to help
distinguish the real data from the fake generated data.
3. Deep Convolutional GAN (DCGAN): DCGAN is one of the most popular and also the
most successful implementations of GAN. It is composed of ConvNets in place of
multi-layer perceptrons. The ConvNets are implemented without max pooling, which is
in fact replaced by convolutional stride. Also, the layers are not fully connected.
4. Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible
image representation consisting of a set of band-pass images, spaced an octave apart,
plus a low-frequency residual. This approach uses multiple numbers of Generator and
Discriminator networks and different levels of the Laplacian Pyramid. This approach is
mainly used because it produces very high-quality images. The image is down-sampled
at first at each layer of the pyramid and then it is again up-scaled at each layer in a
backward pass where the image acquires some noise from the Conditional GAN at these
layers until it reaches its original size.
5. Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of
designing a GAN in which a deep neural network is used along with an adversarial
network in order to produce higher-resolution images. This type of GAN is particularly
useful in optimally up-scaling native low-resolution images to enhance their details
minimizing errors while doing so.
6. Generate new data from available data – It means generating new samples from an
available sample that is not similar to a real one.
7. Generate realistic pictures of people that have never existed.
8. Gans is not limited to Images, It can generate text, articles, songs, poems, etc.
9. Generate Music by using some clone Voice – If you provide some voice then GANs can
generate a similar clone feature of it. In this research paper, researchers from NIT in
Tokyo proposed a system that is able to generate melodies from lyrics with help of learned
relationships between notes and subjects.
10. Text to Image Generation (Object GAN and Object Driven GAN)
11. Creation of anime characters in Game Development and animation production.
12. Image to Image Translation – We can translate one Image to another without changing the
background of the source image. For example, Gans can replace a dog with a cat.
13. Low resolution to High resolution – If you pass a low-resolution Image or video, GAN can
produce a high-resolution Image version of the same.
14. Prediction of Next Frame in the video – By training a neural network on small frames of
video, GANs are capable to generate or predict a small next frame of video. For example,
you can have a look at below GIF
15. Interactive Image Generation – It means that GANs are capable to generate images and
video footage in an art form if they are trained on the right real dataset.
16. Speech – Researchers from the College of London recently published a system called
GAN-TTS that learns to generate raw audio through training on 567 corpora of speech
data.