0% found this document useful (0 votes)
22 views35 pages

Unit V Recurrent Neural Networks

Uploaded by

SATHYA P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views35 pages

Unit V Recurrent Neural Networks

Uploaded by

SATHYA P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

lOMoARcPSD|18499994

UNIT V Recurrent Neural Networks

Neural Network and Deep Learning (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by SATHYA P ([email protected])
lOMoARcPSD|18499994

UNIT V RECURRENT NEURAL NETWORKS


Recurrent Neural Networks: Introduction – Recursive Neural Networks – Bidirectional RNNs – Deep
Recurrent Networks – Applications: Image Generation, Image Compression, Natural Language Processing.
Complete Auto encoder, Regularized Autoencoder, Stochastic Encoders and Decoders, Contractive
Encoders.
RECURRENT NEURAL NETWORKS: INTRODUCTION
What is Recurrent Neural Network (RNN)?
Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous step is fed
as input to the current step. In traditional neural networks, all the inputs and outputs are independent of each
other, but in cases when it is required to predict the next word of a sentence, the previous words are required
and hence there is a need to remember the previous words. Thus RNN came into existence, which solved
this issue with the help of a Hidden Layer. The main and most important feature of RNN is its Hidden state,
which remembers some information about a sequence. The state is also referred to as Memory State since it
remembers the previous input to the network. It uses the same parameters for each input as it performs the
same task on all the inputs or hidden layers to produce the output. This reduces the complexity of
parameters, unlike other neural networks.

Architecture Of Recurrent Neural Network


RNNs have the same input and output architecture as any other deep neural architecture. However,
differences arise in the way information flows from input to output. Unlike Deep neural networks where we
have different weight matrices for each Dense network in RNN, the weight across the network remains the
same. It calculates state hidden state Hi for every input Xi . By using the following formulas:
h= σ(UX + Wh-1 + B)
Y = O(Vh + C) Hence
Y = f (X, h , W, U, V, B, C)
Here S is the State matrix which has element si as the state of the network at timestep i
The parameters in the network are W, U, V, c, b which are shared across timestep

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

How RNN works


The Recurrent Neural Network consists of multiple fixed activation function units, one for each time step.
Each unit has an internal state which is called the hidden state of the unit. This hidden state signifies the past
knowledge that the network currently holds at a given time step. This hidden state is updated at every time
step to signify the change in the knowledge of the network about the past. The hidden state is updated using
the following recurrence relation:-

The formula for calculating the current state:

where:

ht -> current state


ht-1 -> previous state
xt -> input state
Formula for applying Activation function(tanh):

where:

whh -> weight at recurrent neuron


wxh -> weight at input neuron
The formula for calculating output:

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Yt -> output
Why -> weight at output layer
These parameters are updated using Backpropagation. However, since RNN works on sequential data here
we use an updated backpropagation which is known as Backpropagation through time.
Backpropagation Through Time (BPTT)
In RNN the neural network is in an ordered fashion and since in the ordered network each variable is
computed one at a time in a specified order like first h1 then h2 then h3 so on. Hence we will apply
backpropagation throughout all these hidden time states sequentially.

L(θ)(loss function) depends on h3


h3 in turn depends on h2 and W
h2 in turn depends on h1 and W
h1 in turn depends on h0 and W
where h0 is a constant starting state.
Training through RNN
A single-time step of the input is provided to the network.
Then calculate its current state using a set of current input and the previous state.
The current ht becomes ht-1 for the next time step.
One can go as many time steps according to the problem and join the information from all the previous
states.
Once all the time steps are completed the final current state is used to calculate the output.
The output is then compared to the actual output i.e the target output and the error is generated.
The error is then back-propagated to the network to update the weights and hence the network (RNN) is
trained using Backpropagation through time.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Advantages of Recurrent Neural Network


An RNN remembers each and every piece of information through time. It is useful in time series prediction
only because of the feature to remember previous inputs as well. This is called Long Short Term Memory.
Recurrent neural networks are even used with convolutional layers to extend the effective pixel
neighborhood.
Disadvantages of Recurrent Neural Network
Gradient vanishing and exploding problems.
Training an RNN is a very difficult task.
It cannot process very long sequences if using tanh or relu as an activation function.
Applications of Recurrent Neural Network
Language Modelling and Generating Text
Speech Recognition
Machine Translation
Image Recognition, Face detection
Time series Forecasting
Types Of RNN
There are four types of RNNs based on the number of inputs and outputs in the network.
One to One
One to Many
Many to One
Many to Many
One to One
This type of RNN behaves the same as any simple Neural network it is also known as Vanilla Neural
Network. In this Neural network, there is only one input and one output.

One To Many
In this type of RNN, there is one input and many outputs associated with it. One of the most used examples
of this network is Image captioning where given an image we predict a sentence having Multiple words.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Many to One
In this type of network, Many inputs are fed to the network at several states of the network generating only
one output. This type of network is used in the problems like sentimental analysis. Where we give multiple
words as input and predict only the sentiment of the sentence as output.

Many to Many
In this type of neural network, there are multiple inputs and multiple outputs corresponding to a problem.
One Example of this Problem will be language translation. In language translation, we provide multiple
words from one language as input and predict multiple words from the second language as output.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Variation Of Recurrent Neural Network (RNN)


To overcome the problems like vanishing gradient and exploding gradient descent several new advanced
versions of RNNs are formed some of these are as ;
Bidirectional Neural Network (BiNN)
Long Short-Term Memory (LSTM)
Bidirectional Neural Network (BiNN)
A BiNN is a variation of a Recurrent Neural Network in which the input information flows in both direction
and then the output of both direction are combined to produce the input. BiNN is useful in situations when
the context of the input is more important such as Nlp tasks and Time-series analysis problems.
Long Short-Term Memory (LSTM)
Long Short-Term Memory works on the read-write-and-forget principle where given the input information
network reads and writes the most useful information from the data and it forgets about the information
which is not important in predicting the output. For doing this three new gates are introduced in the RNN. In
this way, only the selected information is passed through the network.
Difference between RNN and Simple Neural Network
RNN is considered to be the better version of deep neural when the data is sequential. There are significant
differences between the RNN and deep neural networks they are listed as:

Recurrent Neural Network Deep Neural Network

Weights are same across all the layers


Weights are different for each layer of the network
number of a Recurrent Neural Network

Recurrent Neural Networks are used when A Simple Deep Neural network does not have any
the data is sequential and the number of special method for sequential data also here the the
inputs is not predefined. number of inputs is fixed

The Numbers of parameter in the RNN are


The Numbers of Parameter are lower than RNN
higher than in simple DNN

Exploding and vanishing gradients is the the These problems also occur in DNN but these are
major drawback of RNN not the major problem with DNN

RECURSIVE NEURAL NETWORKS:


Deep Learning is a subfield of machine learning and artificial intelligence (AI) that attempts to imitate how
the human brain processes data and gains certain knowledge. Neural Networks form the backbone of Deep
Learning. These are loosely modeled after the human brain and designed to accurately recognize underlying
patterns in a data set. If you want to predict the unpredictable, Deep Learning is the solution.
Recursive Neural Networks (RvNNs) are a class of deep neural networks that can learn detailed and
structured information. With RvNN, you can get a structured prediction by recursively applying the same set
of weights on structured inputs. The word recursive indicates that the neural network is applied to its output.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Due to their deep tree-like structure, Recursive Neural Networks can handle hierarchical data. The tree
structure means combining child nodes and producing parent nodes. Each child-parent bond has a weight
matrix, and similar children have the same weights. The number of children for every node in the tree is
fixed to enable it to perform recursive operations and use the same weights. RvNNs are used when there's a
need to parse an entire sentence.
To calculate the parent node's representation, we add the products of the weight matrices (W_i) and the
children's representations (C_i) and apply the transformation f:
\[h = f \left( \sum_{i=1}^{i=c} W_i C_i \right) \], where c is the number of children.
Recursive Neural Network Implementation
A Recursive Neural Network is used for sentiment analysis in natural language sentences. It is one of the
most important tasks of Natural language Processing (NLP), which identifies the writing tone and
sentiments of the writer in a particular sentence. If a writer expresses any sentiment, basic labels about the
writing tone are recognized. We want to identify the smaller components like nouns or verb phrases and
order them in a syntactic hierarchy. For example, it identifies whether the sentence showcases a constructive
form of writing or negative word choices.

A variable called 'score' is calculated at each traversal of nodes, telling us which pair of phrases and words
we must combine to form the perfect syntactic tree for a given sentence.

Let us consider the representation of the phrase -- "a lot of fun" in the following sentence.

Programming is a lot of fun.

An RNN representation of this phrase would not be suitable because it considers only sequential relations.
Each state varies with the preceding words' representation. So, a subsequence that doesn't occur at the
beginning of the sentence can't be represented. With RNN, when processing the word 'fun,' the hidden state
will represent the whole sentence.

However, with a Recursive Neural Network (RvNN), the hierarchical architecture can store the
representation of the exact phrase. It lies in the hidden state of the node R_{a\ lot\ of\ fun}. Thus, Syntactic
parsing is completely implemented with the help of Recursive Neural Networks.

Benefits of RvNNs for Natural Language Processing


The two significant advantages of Recursive Neural Networks for Natural Language Processing are their
structure and reduction in network depth.
As already explained, the tree structure of Recursive Neural Networks can manage hierarchical data like in
parsing problems.
Another benefit of RvNN is that the trees can have a logarithmic height. When there are O(n) input words, a
Recursive Neural Network can represent a binary tree with height O(log\ n). This lessens the distance
between the first and last input elements. Hence, the long-term dependency turns shorter and easier to grab.
Disadvantages of RvNNs for Natural Language Processing

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

The main disadvantage of recursive neural networks can be the tree structure. Using the tree structure
indicates introducing a unique inductive bias to our model. The bias corresponds to the assumption that the
data follow a tree hierarchy structure. But that is not the truth. Thus, the network may not be able to learn the
existing patterns.
Another disadvantage of the Recursive Neural Network is that sentence parsing can be slow and ambiguous.
Interestingly, there can be many parse trees for a single sentence.
Also, it is more time-consuming and labor-intensive to label the training data for recursive neural networks
than to construct recurrent neural networks. Manually parsing a sentence into short components is more
time-consuming and tedious than assigning a label to a sentence.
BIDIRECTIONAL RNNS:
An architecture of a neural network called a bidirectional recurrent neural network (BRNN) is made to
process sequential data. In order for the network to use information from both the past and future context
in its predictions, BRNNs process input sequences in both the forward and backward directions. This is
the main distinction between BRNNs and conventional recurrent neural networks.
A BRNN has two distinct recurrent hidden layers, one of which processes the input sequence forward and
the other of which processes it backward. After that, the results from these hidden layers are collected and
input into a prediction-making final layer. Any recurrent neural network cell, such as Long Short-Term
Memory (LSTM) or Gated Recurrent Unit, can be used to create the recurrent hidden layers.
The BRNN functions similarly to conventional recurrent neural networks in the forward direction,
updating the hidden state depending on the current input and the prior hidden state at each time step. The
backward hidden layer, on the other hand, analyses the input sequence in the opposite manner, updating
the hidden state based on the current input and the hidden state of the next time step.
Compared to conventional unidirectional recurrent neural networks, the accuracy of the BRNN is
improved since it can process information in both directions and account for both past and future contexts.
Because the two hidden layers can complement one another and give the final prediction layer more data,
using two distinct hidden layers also offers a type of model regularisation.
In order to update the model parameters, the gradients are computed for both the forward and backward
passes of the backpropagation through the time technique that is typically used to train BRNNs. The input
sequence is processed by the BRNN in a single forward pass at inference time, and predictions are made
based on the combined outputs of the two hidden layers. layers.

Bi-directional Recurrent Neural Network

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Working of Bidirectional Recurrent Neural Network


1. Inputting a sequence: A sequence of data points, each represented as a vector with the same
dimensionality, are fed into a BRNN. The sequence might have different lengths.
2. Dual Processing: Both the forward and backward directions are used to process the data. On the basis
of the input at that step and the hidden state at step t-1, the hidden state at time step t is determined in
the forward direction. The input at step t and the hidden state at step t+1 are used to calculate the
hidden state at step t in a reverse way.
3. Computing the hidden state: A non-linear activation function on the weighted sum of the input and
previous hidden state is used to calculate the hidden state at each step. This creates a memory
mechanism that enables the network to remember data from earlier steps in the process.
4. Determining the output: A non-linear activation function is used to determine the output at each step
from the weighted sum of the hidden state and a number of output weights. This output has two
options: it can be the final output or input for another layer in the network.
5. Training: The network is trained through a supervised learning approach where the goal is to
minimize the discrepancy between the predicted output and the actual output. The network adjusts its
weights in the input-to-hidden and hidden-to-output connections during training through
backpropagation.
To calculate the output from an RNN unit, we use the following formula:
Ht (Forward) = A(Xt * WXH (forward) + Ht-1 (Forward) * WHH (Forward) + bH (Forward)
Ht (Backward) = A(Xt * WXH (Backward) + Ht+1 (Backward) * W HH (Backward) + bH (Backward)

where,
A = activation function,
W = weight matrix
b = bias
The hidden state at time t is given by a combination of H t (Forward) and Ht (Backward). The output at any
given hidden state is :
Yt = Ht * WAY + by
The training of a BRNN is similar to backpropagation through a time algorithm. BPTT algorithm works as
follows:
 Roll out the network and calculate errors at each iteration
 Update weights and roll up the network.
However, because forward and backward passes in a BRNN occur simultaneously, updating the weights
for the two processes may occur at the same time. This produces inaccurate outcomes. Thus, the following
approach is used to train a BRNN to accommodate forward and backward passes individually.
Applications of Bidirectional Recurrent Neural Network
Bi-RNNs have been applied to various natural language processing (NLP) tasks, including:
1. Sentiment Analysis: By taking into account both the prior and subsequent context, BRNNs can be
utilized to categorize the sentiment of a particular sentence.
2. Named Entity Recognition: By considering the context both before and after the stated thing,
BRNNs can be utilized to identify those entities in a sentence.
3. Part-of-Speech Tagging: The classification of words in a phrase into their corresponding parts of
speech, such as nouns, verbs, adjectives, etc., can be done using BRNNs.
4. Machine Translation: BRNNs can be used in encoder-decoder models for machine translation, where
the decoder creates the target sentence and the encoder analyses the source sentence in both directions
to capture its context.
5. Speech Recognition: When the input voice signal is processed in both directions to capture the
contextual information, BRNNs can be used in automatic speech recognition systems.
Advantages of Bidirectional RNN
 Context from both past and future: With the ability to process sequential input both forward and
backward, BRNNs provide a thorough grasp of the full context of a sequence. Because of this, BRNNs
are effective at tasks like sentiment analysis and speech recognition.
 Enhanced accuracy: BRNNs frequently yield more precise answers since they take both historical
and upcoming data into account.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

 Efficient handling of variable-length sequences: When compared to conventional RNNs, which


require padding to have a constant length, BRNNs are better equipped to handle variabl e-length
sequences.
 Resilience to noise and irrelevant information: BRNNs may be resistant to noise and irrelevant data
that are present in the data. This is so because both the forward and backward paths offer useful
information that supports the predictions made by the network.
 Ability to handle sequential dependencies: BRNNs can capture long-term links between sequence
pieces, making them extremely adept at handling complicated sequential dependencies.
Disadvantages of Bidirectional RNN
 Computational complexity: Given that they analyze data both forward and backward, BRNNs can
be computationally expensive due to the increased amount of calculations needed.
 Long training time: BRNNs can also take a while to train because there are many parameters to
optimize, especially when using huge datasets.
 Difficulty in parallelization: Due to the requirement for sequential processing in both the forward
and backward directions, BRNNs can be challenging to parallelize.
 Overfitting: BRNNs are prone to overfitting since they include many parameters that might result in
too complicated models, especially when trained on short datasets.
 Interpretability: Due to the processing of data in both forward and backward directions, BRNNs can
be tricky to interpret since it can be difficult to comprehend what the model is doing and how it is
producing predictions.
Implementation of Bi-directional Recurrent Neural Network on NLP dataset
There are multiple processes involved in training a bidirectional RNN on an NLP dataset, including data
preprocessing, model development, and model training. Here is an illustration of a Python implementation
using Keras and TensorFlow. We’ll utilize the IMDb movie review sentiment classification dataset from
Keras in this example. The data must first be loaded and preprocessed.

import warnings
warnings.filterwarnings('ignore')
from keras.datasets import imdb
from keras_preprocessing.sequence import pad_sequences

# let's load the dataset and then split


# it into training and testing sets
features = 2000
len = 50
(X_train, y_train),\
(X_test, y_test) = imdb.load_data(num_words=features)

# we are using pad sequences to a fixed length


X_train = pad_sequences(X_train, maxlen=len)
X_test = pad_sequences(X_test, maxlen=len)

Model Architecture

By using the high-level API of the Keras we will implement a Bidirectional Recurrent Neural Network
model. This model will have 64 hidden units and 128 as the size of the embedding layer. While compiling
a model we provide these three essential parameters:
 optimizer – This is the method that helps to optimize the cost function by using gradient descent.
 loss – The loss function by which we monitor whether the model is improving with training or not.
 metrics – This helps to evaluate the model by predicting the training and the validation data.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

# Import the necessary modules from Keras:


from keras.models import Sequential
from keras.layers import Embedding,\
Bidirectional, SimpleRNN, Dense

# Set the values for the embedding size and


# number of hidden units in the LSTM layer
embedding = 128
hidden = 64

# Create a Sequential model object


model = Sequential()
model.add(Embedding(features, embedding,
input_length=len))
model.add(Bidirectional(SimpleRNN(hidden)))
model.add(Dense(1, activation='sigmoid'))
model.compile('adam', 'binary_crossentropy',
metrics=['accuracy'])
Model Training
As we have compiled our model successfully and the data pipeline is also ready so, we can move forward
toward the process of training our BRNN.

#set batch size and number of epochs you want


batch_size = 32
epochs = 5

model.fit(X_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(X_test, y_test))
Output:

Training progress of the BRNN epoch-by-epoch

Evaluate the Model


Now as we have our model ready let’s evaluate its performance on the validation data using
different evaluation metrics. For this purpose, we will first predict the class for the validation data using
this model and then compare the output with the true labels.

loss, accuracy = model.evaluate(X_test, y_test)


print('Test accuracy:', accuracy)
Output :

Validation Accuracy of the model on the holdout dataset

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Here we are using simple BRNN but we can also use LSTM with the bidirectional network for better
accuracy and the result of the model.

DEEP RECURRENT NETWORKS:


APPLICATIONS
IMAGE GENERATION:
Deep Recurrent Attentive Writer (DRAW) is a neural network architecture for image generation. DRAW
networks combine a novel spatial attention mechanism that mimics the foveation of the human eye, with a
sequential variational auto-encoding framework that allows for the iterative construction of complex images.
The system substantially improves on the state of the art for generative models on MNIST, and, when
trained on the Street View House Numbers dataset, it generates images that cannot be distinguished from
real data with the naked eye.
The core of the DRAW architecture is a pair of recurrent neural networks: an encoder network that
compresses the real images presented during training, and a decoder that reconstitutes images after receiving
codes. The combined system is trained end-to-end with stochastic gradient descent, where the loss function
is a variational upper bound on the log-likelihood of the data.
DRAW Architecture
DRAW Network is similar to other variational auto-encoders, it contains an encoder network that
determines a distribution over latent codes that capture salient information about the input data and
a decoder network receives samples from the code distribution and uses them to condition its own
distribution over images.

3 Key Differences Between DRAW and Auto-Encoders


Both, the encoder and decoder are recurrent networks in DRAW.Decoder’s output are added successively to
the distribution in order to generate the data, instead of generating this the distribution in single steps.A
dynamically updated attention mechanism is used to restrict both the input region observed by the encoder,
and the output region modified by the decoder. In simple terms, the network decides at each time-step
“where to read” and “where to write” as well as “what to write”.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Left: Conventional Variational Auto-Encoder.


During generation, a sample z is drawn from a prior P(z) and passed through the feedforward decoder
network to compute the probability of the input P(x|z) given the sample.

During inference the input x is passed to the encoder network, producing an approximate
posterior Q(z|x) over latent variables. During training, z is sampled from Q(z|x) and then used to compute
the total description length KL ( Q (Z|x)∣∣ P(Z)−log(P(x|z)), which is minimized with stochastic gradient
descent.

Right: DRAW Network.


At each time-step a sample z_t from the prior P(z_t) is passed to the recurrent decoder network, which then
modifies part of the canvas matrix. The final canvas matrix cT is used to compute P(x|z_1:T).

During inference the input is read at every time-step and the result is passed to the encoder RNN. The
RNNs at the previous time-step specify where to read. The output of the encoder RNN is used to compute
the approximate posterior over the latent variables at that time-step.

Loss Function
The final canvas matrix cT is used to parametrize a model D(X | cT) of the input data. If the input is binary,
the natural choice for D is a Bernoulli distribution with means given by σ(cT). The reconstruction loss Lx is
defined as the negative log probability of x under D:

of nats required to transmit the latent sample sequence z_1:T to the decoder from the prior, and (if x is
discrete) L^x is the number of nats required for the decoder to reconstruct x given z_1:T. The total loss is
therefore equivalent to the expected compression of the data by the decoder and prior. The latent loss

for a sequence of latent distributions

is defined as the summed Kullback-Leibler divergence of some latent prior P(Z_t) from

Note that this loss depends upon the latent samples z_t drawn from

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

which depend in turn on the input x. If the latent distribution is a diagonal Gaussian with μt, σt where:

a simple choice for P(Z_t) is a standard Gaussian with mean zero and standard deviation one, in which case
the equation becomes:

The total loss L for the network is the expectation of the sum of the reconstruction and latent losses:

Which we optimize using a single sample of z for each stochastic gradient descent step.

L^z can be interpreted as the number

Improving Images
As Eric Jang mentions on his post, it’s easier to ask our neural network to merely “improve the image”
rather than “finish the image in one shot”. Human artists work by iterating on their canvas, and infer from
their drawing what to fix and what to paint next.

Improving an image or progressive refinement is simply breaking up our joint distribution P(C) over and
over again, resulting in a chain of latent variables C1,C2,…CT−1 to a new observed variable
distribution P(CT).

The trick is to sample from the iterative refinement distribution P(Ct|Ct−1)several times rather than straight-
up sampling from P(C).

In the DRAW model, P(Ct|Ct−1) is the same distribution for all t, so we can compactly represent this as the
following recurrence relation (if not, then we have a Markov Chain instead of a recurrent network)

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

The DRAW model applied


Imagine you are trying to encode an image of the number 8. Every handwritten number is drawn differently,
while some portions may be thicker others can be longer. Without attention, the encoder would be forced to
try and capture all these small variations at the same time.

But…what about if the encoder could choose a small crop of the image on every frame and examine each
portion of the number one at a time? That would make the work more easy, right?

The same logic applies for generating the number. The attention unit will determine where to draw the next
portion of the number 8 -or any other-, while the latent vector passed will determine if the decoder generates
a thicker area or a thinner area.

Basically, if we think of the latent code in a VAE (variational auto-encoder)as a vector that represents the
entire image, the latent codes in DRAW can be thought of as vectors that represent a pen stroke. Eventually,
a sequence of these vectors creates a recreation of the original image.

Ok, But how does it really work?


In a recurrent VAE model, the encoder takes in the entire input image at every single timestep. In DRAW
we need to focus in the attention gate between the two of them, so the encoder only receives the portion of
our image that the network deems is important at that timestep. That first attention gate is called
the “read” attention.

The “read” attention consists in two parts:

Choosing the important portionCropping the image and forget about other parts

Choosing the important portion of an image


In order to determine which part of the image to focus on, we need some sort of observation to make a
decision based on. In DRAW, we use the previous timestep’s decoder hidden state. Using a simple fully-
connected layer, we can map the hidden state to three parameters that represent our square crop: center x,
center y, and the scale.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Cropping the image


Now, instead of encoding the entire image, we crop it so only a small part of the image is encoded. This
code is then passed through the system, and decoded back into a small patch.

We now arrive to the second part of our attention gate, the “write”attention, which have the same setup as
the “read” section, except that the “write” attention gate uses the current decoder instead of the previous
timestep’s decoder.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Wait…is that really done in practice?


While describing the attention mechanism as a crop makes sense intuitively, in practice, a different method
is used. The model structure described above is still accurate, but a matrix of gaussian filters instead of a
crop is used.

In DRAW, we take an array of gaussian filters, each with their centers spaced apart evenly.

IMAGE COMPRESSION:
Introduction:
The development and demand for multimedia goods has risen in recent years, resulting in network
bandwidth and storage device limitations. As a result, image compression theory is becoming more
significant for reducing data redundancy and boosting device space and transmission bandwidth savings. In
computer science and information theory, data compression, also known as source coding, is the process of
encoding information using fewer bits or other information-bearing units than an unencoded version.
Compression is advantageous because it saves money by reducing the use of expensive resources such as
hard disc space and transmission bandwidth.
Image Compression:
Image compression is a type of data compression in which the original image is encoded with a small
number of bits. Compression focuses on reducing image size without sacrificing the uniqueness and
information included in the original. The purpose of image compression is to eliminate image redundancy
while also increasing storage capacity for well-organized communication.

There are two major types of image compression techniques:


1. Lossless Compression:
This method is commonly used for archive purposes. Lossless compression is suggested for images with
geometric forms that are relatively basic. It's used for medical, technical, and clip art graphics, among other
things.
2. Lossy Compression:
Lossy compression algorithms are very useful for compressing natural pictures such as photographs, where a
small loss in reliability is sufficient to achieve a significant decrease in bit rate. This is the most common
method for compressing multimedia data, and some data may be lost as a result.
RNN Based Encoder and Decoders:
Two convolutional kernels are employed in the recurrent units used to produce the encoder and decoder: one
on the input vector that enters into the unit from the previous layer, and the other on the state vector that
gives the recurring character of the unit. The "hidden convolution" and "hidden kernel" refer to the
convolution on the state vector and its kernel, respectively.
The input-vector convolutional kernel's spatial extent and output depth are shown in Figure. All
convolutional kernels support full depth mixing. For example, the unit D-RNN#3 operates on the input
vector with 256 convolutional kernels, each with 33 spatial extent and full input-depth extent (128 in this
case, because D-depth RNN#2's is decreased by a factor of four when it passes through the "Depth-to-
Space" unit).

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Except in units D-RNN#3 and D-RNN#4, where the hidden kernels are 33, the spatial extents of the hidden
kernels are all 11. When compared to the 11 hidden kernels, the larger hidden kernels consistently produced
better compression curves.

Types of Recurrent Units:


1. LTSM:
The long short-term memory (LSTM) architecture is a deep learning architecture that employs a recurrent
neural network (RNN). LSTM features feedback connections, unlike standard feedforward neural networks.
It is capable of handling not just single data points (such as images), but also whole data streams (such as
speech or video). Tasks like unsegmented, linked handwriting identification, speech recognition, and
anomaly detection in network traffic or IDSs (intrusion detection systems) can all benefit from LSTM.

Let xt, ct, and ht represent the input, cell, and hidden states at iteration t, respectively. The new cell state ct
and the new hidden state ht are computed using the current input xt, prior cell state ct1, and previous hidden
state ht1.

2. Associative LTSM:
To enable key-value storage of data, an Associative LSTM combines an LSTM with principles from
Holographic Reduced Representations (HRRs). To achieve key-value binding between two vectors, HRRs
employ a "binding" operator (the key and its associated content). Associative arrays are natively
implemented as a byproduct. Stacks, Queues, or Lists can also be easily implemented

Associative LSTM extends LSTM using holographic representation. Its new states are computed as:

Only when employed in the decoder were associative LSTMs effective.

3. Gated Recurrent Units:


Kyunghyun Cho et al. established gated recurrent units (GRUs) as a gating technique in recurrent neural
networks in 2014. The GRU is similar to a long short-term memory (LSTM) with a forget gate, but it lacks
an output gate, hence it has fewer parameters. GRU's performance on polyphonic music modelling, speech
signal modelling, and natural language processing tasks was found to be comparable to that of LSTM in
some cases. On some smaller and less frequent datasets, GRUs have been found to perform better.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

The GRU formula, which has an input xt and a hiddenstate/output ht, is as follows:

Reconstruction Framework:
Three distinct ways for constructing the final image reconstruction from the decoder outputs are explored, in
addition to employing different types of recurrent units.

One-shot Reconstruction:
One-shot Reconstruction: As was done in Toderici et al. [2016], After each iteration of the decoder (= 0 in
(1)), we predict the whole picture. Each cycle has more access to the encoder's produced bits, allowing for a
better reconstruction. This method is known as "one-shot reconstruction." We merely transfer the previous
iteration's residual to the next iteration, despite trying to rebuild the original picture at each iteration. The
number of weights is reduced as a result, and trials demonstrate that sending both the original picture and the
residual does not enhance the reconstructions.
Additive Reconstruction:
In additive reconstruction, which is more widely used in traditional image coding, each iteration only tries to
reconstruct the residual from the previous iterations. The final image reconstruction is then the sum of the
outputs of all iterations (γ = 1).
Residual Scaling:
The residual starts large in both additive and "one shot" reconstruction, and we anticipate it to diminish with
each repetition. However, operating the encoder and decoder effectively across a large range of values may
be problematic. In addition, the pace at which the residual diminishes is determined by the content. The
drop-off will be significantly more apparent in certain areas (for example, uniform regions) than in others
(e.g., highly textured patches).
The additive reconstruction architecture is enhanced to incorporate a content-dependent, iteration-dependent
gain factor to address these variances.

The following is a diagram of the extension that is used:

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Entropy Encoding:
Because the network is not deliberately intended to maximise entropy in its codes, and the model does not
always utilise visual redundancy across a vast geographical extent, the entropy of the codes created during
inference is not maximum. As is usual in regular image compression codecs, adding an entropy coding layer
can boost the compression ratio even more.

The lossless entropy coding techniques addressed here are completely convolutional, process binary codes in
progressive order, and process raster-scan order for a particular encoding iteration. All of our image encoder
designs produce binary codes of the type c(y, x, d) with the dimensions H W D, where H and W are integer
fractions of the picture height and width, and D is m the number of iterations. A conventional lossless
encoding system is considered, which combines a conditional probabilistic model of the present binary code
c(y, x, d) with an arithmetic coder to do the actual compression. More formally, given a context T(y, x, d)
which depends only on previous bits in stream order, we will estimate P(c(y, x, d) | T(y, x, d)) so that the
expected ideal encoded length of c(y, x, d) is the cross entropy between P(c | T) and Pˆ(c | T). We do not
consider the small penalty involved by using a practical arithmetic coder that requires a quantized version of
Pˆ(c | T).

Single Iteration Entropy Coder:


We employ the PixelRNN architecture for single-layer binary code compression and a related design
(BinaryRNN) for multi-layer binary code compression. The estimate of the conditional code probabilities for
line y in this architecture is directly dependent on certain neighbouring codes, but it is also indirectly
dependent on the previously decoded binary codes via a line of states S of size 1 W k that captures both
short and long term dependencies. All of the previous lines are summarised in the state line. We use k = 64
in practise. Using a 13 LSTM convolution, the probabilities are calculated and the state is updated line by
line. There are three steps to the end-to-end probability estimation.

First, a 7/7 convolution is used to enlarge the LSTM state's receptive field, with the receptive field being the
set of codes c(i, j, ) that potentially impact the probability estimate of codes c(y, x, ).

To prevent dependence on subsequent routines, this first convolution is a disguised convolution. The line
LSTM in the second stage takes the output z0 of the initial convolution as input and processes one scan line
at a time. The line LSTM captures both short- and long-term dependencies since LSTM hidden states are
created by processing preceding scan lines. The input-to-state LSTM transform is likewise a masked
convolution for the same reason. Finally, two 11 convolutions are added to the network to boost its capacity
to remember additional binary code patterns. The Bernoulli-distribution parameter may be easily calculated
using a sigmoid activation in the final convolution because we are attempting to predict binary codes.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Above Image: Binary recurrent network (BinaryRNN) architecture for a single iteration. The gray area
denotes the context that is available at decode time.

Progressive Entropy Encoding:


To cope with many iterations, a simple entropy coder would be to reproduce the single iteration entropy
coder many times, with each iteration having its own line LSTM. However, such a structure would fail to
account for the duplication that exists between iterations. We can add some information from the previous
layers to the data that is provided to the line LSTM of iteration #k.

Description of neural network used to compute additional line LSTM inputs for progressive entropy coder.
This allows propagation of information from the previous iterations to the current.

Evaluation Metrics
For evaluation purposes we use Multi-Scale Structural Similarity (MS-SSIM) a well-established metric for
comparing lossy image compression algorithms, and the more recent Peak Signal to Noise Ratio - Human
Visual System (PSNR-HVS). While PSNR-HVS already has colour information, we apply MS-SSIM to
each of the RGB channels separately and average the results. The MS-SSIM score ranges from 0 to 1,
whereas the PSNR-HVS is recorded in decibels. Higher scores indicate a closer match between the test and
reference photos in both circumstances. After each cycle, both metrics are computed for all models across
the reconstructed pictures. We utilise an aggregate metric derived as the area under the rate-distortion curve
to rank models (AUC).

NATURAL LANGUAGE PROCESSING:

RNNs are ideal for solving problems where the sequence is more important than the individual items
themselves.

An RNNs is essentially a fully connected neural network that contains a refactoring of some of its layers into
a loop. That loop is typically an iteration over the addition or concatenation of two inputs, a matrix
multiplication and a non-linear function.

Among the text usages, the following tasks are among those RNNs perform well at:

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

• Sequence labelling

• Natural Language Processing (NLP) text classification

• Natural Language Processing (NLP) text generation

Other tasks that RNNs are effective at solving are time series predictions or other sequence predictions that
aren’t image or tabular based.

There has been several highlighted and controversial reports in the media over the advances in text
generation, in particular OpenAI’s GPT-2 algorithm. In many cases the generated text is often
indistinguishable from text written by humans.

I found learning how RNNs function and how to construct them and their varients has been among the most
difficult topics I have had to learn. I would like to thank the Fastai team and Jeremy Howard for their
courses explaining the concepts in amore understandable order, which I’ve followed in this article’s
explanation.

RNNs effectively have an internal memory that allows the previous inputs to affect the subsequent
predictions. It’s much easier to predict the next word in a sentence with more accuracy, if you know what
the previous words were.

Often with tasks well suited to RNNs, the sequence of the items is as or more important than the previous
item in the sequence.

As I’m typing the draft for this on my smart phone, the next word suggested by my phone’s keyboard will
be predicted by an RNN. For example, the swift key keyboard software uses RNNs to predict what you are
typing.

Natural Language Processing:

Natural Language Processing (NLP) is a sub-field of computer science and artificial intelligence, dealing
with processing and generating natural language data. Although there is still research that is outside of the
machine learning, most NLP is now based on language models produced by machine learning.

NLP is a good use case for RNNs and is used in the article to explain how RNNs can be constructed.

Language models

The aim for a language model is to minimise how confused the model is having seen a given sequence of
text.

It is only necessary to train one language model per domain, as the language model encoder can be used for
different purposes such as text generation and multiple different classifiers within that domain.

As the longest part of training is usually creating the language model encoder, reusing the encoder can save
significant training time.

Comparing an RNN to a fully connected neural network:

If we take a sequence of three words of text and a network that predicts the fourth word.

The network has three hidden layers, each of which are an affine function (for example a matrix dot product
multiplication), followed by a non-linear function then the last hidden layer is followed by an output from
the last layer activation function.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

The input vectors representing each word in the sequence are lookups in a word embedding matrix, based on
a one hot encoded vector representing the word in the vocabulary. Note that all inputted words use the same
word embedding. In this context a word is actually a token that could represent a word or a punctuation
mark.

The output will be a one hot encoded vector representing the predicted fourth word in the sequence.

The first hidden layer takes a vector representing the first word in the sequence as an input and the output
activations serve as one of the inputs into the second hidden layer.

The second hidden layer takes the input from the activations of the first hidden layer and also an input of the
second word represented as a vector. These two inputs could be either added or concatenated together.

The third hidden layer follows the same structure as the second hidden layer, taking the activation from the
second hidden layer combined with the vector representing the third word in the sequence. Again, these
inputs are added or concatenated together.

The output from the last hidden layer goes through an activation function that produces an output
representing a word from the vocabulary, as a one hot encoded vector.

This second and third hidden layer could both use the same weight matrix, opening the opportunity of
refactoring this into a loop to become recurrent.

A fully connected network for text generation/prediction. Source: Fastai deep learning course V3 by Jeremy
Howard.
Vocabulary:
The vocabulary is a vector of numbers, called tokens where each token represents one of the unique words
or punctuation symbols in our corpus.
Usually words that don’t occur at least twice in the texts making up the corpus usually aren’t included,
otherwise the vocabulary would be too large. I wonder if this could be used as a factor for detecting
generating text, looking for the presence of words not common in the given domain.
Word embedding:
A word embedding is a matrix of weights, with a row for each word/token in the vocabulary
Matrix dot product multiplication with a one hot encoded vector outputs a row of the matrix representing
activations from that word. It is essentially a row lookup in the matrix and is computationally more efficient
to do that, this is called an embedding lookup.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Using the vector from the word embedding helps prevent the resulting activations being very sparse. As if
the input was the one hot encoded vector, which is all zeros apart from one element, the majority of the
activations would also be zero. This would then be difficult to train.
Refactored with a loop, an RNN:
For the network to be recurrent, a loop needs to be factored into the network’s model. It makes sense to use
the same embedded weight matrix for every word input. This means we can replace the second and third
layers with iterations within a loop.
Each iteration of the loop takes an input of a vector representing the next word in the sequence with the
output activations from the last iteration. These inputs are added or concatenated together.
The output from the last iteration is a representation of the next word in the sentence being put through the
last layer activation function which converts it to a one hot encoded vector representing a word in the
vocabulary.

A basic RNN. Source: Fastai deep learning course V3 by Jeremy Howard.


This allows the network to predict a word at the end of a sequence of any arbitrary length.
Retaining the output through out the loop, an improved RNN:
Once at the end of the sequence of words, the predicted output of the next word could be stored, appended to
an array, to be used as additional information in the next iteration. Each iteration then has access to the
previous predictions.
For a given number of inputs there are the same number of outputs created.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

An improved RNN retaining its output. Source: Fastai deep learning course V3 by Jeremy Howard.
In theory the sequence of predicted text could be infinite in length, with a predicted word following the last
predicted word in the loop.
Retaining the history, a further improved RNN:
With each new batch the history of the previous batch’s sequence, the state, is often lost. Assuming the
sentences are related, this may lose important insights.
To aid the prediction when we start each batch, it is helpful to know the history of the last batch rather than
reset it. This retains the state and hence the context, this results in an understanding of the words that is a
better approximation.
Note with some datasets such as one-billion-words each sentence isn’t related to the previous one, in this
case this may not help as there is no context between sentences.
Backpropagation through time:
Back propagation through time (BPTT) is the sequence length used during training. If we were trying to
train on sequences of 50 words, the BPTT would be 50.
Usually the document is split into 64 equal sections. In this case the BPTT is the document length in words
divided by 64. If the document length in words is 3200 then that divided by 64 gives a BPTT of 50.
It’s beneficial to slightly randomise the BPTT value for each sequence to help improve the model.
Layered RNNs:
To get more layers of computation to be able to solve or approximate more complex tasks, the output of the
RNN could be fed into another RNN, or any number of layers of RNNs. The next section explains how this
can be done.
Extending RNNs to avoid the vanishing gradient:
As the number of layers of RNNs increases the loss landscape and can become impossible to train, this is the
vanishing gradient problem. To solve this problem a Gated Recurrent Unit (GRU) or a Long Term Short
Term Memory (LSTM) network can be used.
LSTMs and GRUs take the current input and previous hidden state, then compute the next hidden state.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

As part of this computation, the sigmoid function squashes the values of these vectors between 0 and 1, and
by multiplying them elementwise with another vector you define how much of that other vector you want to
“let through”
Long Term Short Term Memory (LSTM):
An RNN has short term memory. When used in combination with Long Short Term Memory (LSTM) Gates,
the network can have long term memory.
Instead of the recurring section of an RNN, an LTSM is a small neural network consisting of four neural
network layers. These are the recurring layer from the RNN with three networks acting as gates.
An LSTM also has a cell state as well, along side the hidden state. This cell state is the long term memory.
Rather than just returning the hidden state at each iteration, a tuple of hidden states are returned comprised
of the cell state and hidden state.
Long Short Term Memory (LSTM) has three gates:
1. An Input gate, this controls the information input at each time step.
2. An Output gate, this controls how much information is outputted to the next cell or upward layer
3. A Forget gate, this controls how much data to lose at each time step.
Gated recurrent unit (GRU):
A gated recurrent unit is sometimes referred to as a gated recurrent network.
At the output of each iteration there is a small neural network with three neural networks layers
implemented, consisting of the recurring layer from the RNN, a reset gate and an update gate. The update
gate acts as a forget and input gate. The coupling of these two gates performs a similar function as the three
gates forget, input and output in an LSTM.
Compared to an LSTM, a GRU has a merged cell state and hidden state, whereas in an LSTM these are
separate.
Reset gate:
The reset gate takes the input activations from last layer, these are multiplied by a reset factor between 0 and
1. The reset factor is calculated by a neural network with no hidden layer (like a logistic regression), this
performs a dot product matrix multiplication between a weight matrix and the addition/concatenation of the
previous hidden state and our new input. This is then all put through the sigmoid function e^x / (1 + e^x).
This can learn to do different things in different situations, for example to forget more information if there’s
a full stop token.
Update gate:
The update gate controls how much of the new input to take and how much of the hidden state to take. This
is a linear interpolation. This is 1 — Z multiplied by the previous hidden state plus Z multiplied by the new
hidden state. This controls to what degree we keep information from the previous states and to what degree
we use information from the new state.
The update gate is often represented as a switch in diagrams, although the gate can be in any position to
create a linear interpolation between the two hidden states.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

A RNN with a GRU. Source: Fastai deep learning course V3 by Jeremy Howard.

Which is better, a GRU or an LSTM:

This depends entirely on the task in question, it is often worth trying both to see which can perform better.

Text classification:

In text classification the prediction of the network is to classify which group or groups the text belongs to. A
common use is classifying if the sentiment of a piece of text is positive or negative.

If an RNN is trained to predict text from a corpus within a given domain as in the RNN explanation earlier in

this article, it is close to ideal to be re-purposed for text classification within that domain. The generation
‘head’ of the network is removed leaving the ‘backbone’ of the network. The weights within the backbone

can then be frozen. A new classification head can then be attached to the backbone and trained to predict the
required classifications.

It can be a very effective method to speed up training to gradually unfreeze the weights within the layers.

Starting with the weights of the last two layers, then the weights of the last three layers, and finally all

unfreeze all of the layers’ weights.

COMPLETE AUTO ENCODER:

What are AutoEncoders?

AutoEncoder is an artificial neural network model that seeks to learn from a compressed representation of
the input.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

There are various types of autoencoders available suited for different types of scenarios, however, the
commonly used autoencoder is for feature extraction.

Combining feature extraction models with different types of models has a wide variety of applications.

Feature Extraction Autoencoders models for prediction sequence problems are quite challenging not because
the length of the input can vary, its because machine learning algorithms and neural networks are designed
to work with fixed length inputs.

Another problem with sequence prediction is the temporal ordering of the observations can make it
challenging to extract features. Therefore special predictive models were developed to overcome such
challenges. These are called Sequence-to-sequence, or seq2seq. and the widely used we already have heard
of are the LSTM models.
LSTM:
Recurrent neural networks such as the LSTM or Long Short-Term Memory network are specially designed
to support the sequential data.
They are capable of learning the complex dynamics within the temporal ordering of input sequences as well
as using an internal memory to remember or use information across long input sequences.
NOW combing Autoencoders with LSTM will allow us to understand the pattern of sequential data with
LSTM then extract the features with Autoencoders to recreate the input sequence.
In other words, for a given dataset of sequences, an encoder-decoder LSTM is configured to read the input
sequence, encode it and recreate it. The performance of the model is evaluated based on the model’s ability
to recreate the input sequence.
Once the model achieves a desired level of performance in recreating the sequence. The decoder part of the
model can be removed, leaving just the encoder model. Now further this model can be used to encode input
sequences.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

The workflow of the composite encoder will be something like this.

REGULARIZED AUTOENCODER:
Introduction:
As we know, regularization and autoencoders are two different terminologies. First, we will briefly discuss
each topic, i.e., autoencoders and regularization, separately, and then we will see different ways to do
regularization of autoencoders.
Autoencoders:
Autoencoders are a variant of feed-forward neural networks that have an extra bias for calculating the error
of reconstructing the original input. After training, autoencoders are then used as a normal feed-forward
neural network for activations. This is an unsupervised form of feature extraction because the neural
network uses only the original input for learning weights rather than backpropagation, which has labels.
Deep networks can use either RBMs or autoencoders as building blocks for larger networks (a single
network rarely uses both).
Use of autoencoders:
Autoencoders are used to learn compressed representations of datasets. Commonly, we use it in reducing the
dimensions of the dataset. The output of the autoencoder is a reformation of the input data in the most
efficient form.
Similarities of autoencoders to multilayer perceptron
Autoencoders are identical to multilayer perceptron neural networks because, like multilayer perceptrons,
autoencoders have an input layer, some hidden layers, and an output layer. The key difference between a
multilayer perceptron network and an autoencoder is that the output layer of an autoencoder has the same
number of neurons as that of the input layer.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Regularization
Regularization helps with the effects of out-of-control parameters by using different methods to minimize
parameter size over time.
In mathematical notation, we see regularization represented by the coefficient lambda, controlling the trade-
off between finding a good fit and keeping the value of certain feature weights low as the exponents on
features increase.
Regularization coefficients L1 and L2 help fight overfitting by making certain weights smaller. Smaller-
valued weights lead to simpler hypotheses, which are the most generalizable. Unregularized weights with
several higher-order polynomials in the feature sets tend to overfit the training set.
As the input training set size grows, the effect of regularization decreases, and the parameters tend to
increase in magnitude. This is appropriate because an excess of features relative to training set examples
leads to overfitting in the first place. Bigger data is the ultimate regularizer.
Regularized autoencoders
There are other ways to constrain the reconstruction of an autoencoder than to impose a hidden layer of
smaller dimensions than the input. The regularized autoencoders use a loss function that helps the model to
have other properties besides copying input to the output. We can generally find two types of regularized
autoencoder: the denoising autoencoder and the sparse autoencoder.
Denoising autoencoder
We can modify the autoencoder to learn useful features is by changing the inputs; we can add random noise
to the input and recover it to the original form by removing noise from the input data. This prevents the
autoencoder from copying the data from input to output because it contains random noise. We ask it to
subtract the noise and produce meaningful underlying data. This is called a denoising autoencoder.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

In the above diagram, the first row contains original images. We can see in the second row that random
noise is added to the original images; this noise is called Gaussian noise. The input of the autoencoder will
not get the original images, but autoencoders are trained in such a way that they will remove noise and
generate the original images.

The only difference between implementing the denoising autoencoder and the normal autoencoder is a
change in input data. The rest of the implementation is the same for both the autoencoders. Below is the
difference between training the autoencoder.

Training simple autoencoder:


autoencoder.fit(x_train, x_train)
Training denoising autoencoder:
autoencoder.fit(x_train_noisy, x_train)
Simple as that, everything else is exactly the same. The input to the autoencoder is the noisy image, and the
expected target is the original noise-free one.
Sparse autoencoders
Another way of regularizing the autoencoder is by using a sparsity constraint. In this way of regularization,
only fraction nodes are allowed to do forward and backward propagation. These nodes have non-zero values
and are called active nodes.
To do so, we add a penalty term to the loss function, which helps to activate the fraction of nodes. This
forces the autoencoder to represent each input as a combination of a small number of nodes and demands it
to discover interesting structures in the data. This method is efficient even if the code size is large because
only a small subset of the nodes will be active.
For example, add a regularization term in the loss function. Doing this will make our autoencoder learn the
sparse representation of data.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

input_size = 256
hidden_size = 32
output_size = 256
l1 = Input(shape=(input_size,))
# Encoder
h1 = Dense(hidden_size ,activity_regularizer=regularizers.l1(10e-6), activation='relu')(l1)
# Decoder
l2 = Dense(output_size, activation='sigmoid')(h1)
autoencoder = Model(input=l1, output=l2)
autoencoder.compile(loss='mse', optimizer='adam’)
In the above code, we have added L1 regularization to the hidden layer of the encoder, which adds the
penalty to the loss function.
STOCHASTIC ENCODERS AND DECODERS:
Variational Autoencoders (VAEs):
Variational Autoencoders are a type of generative model used for tasks like image generation, data
compression, and feature learning. They consist of two main components: an encoder and a decoder. The
goal of a VAE is to learn a probabilistic model of the data, which allows it to generate new data samples that
are similar to the ones it was trained on.
Stochastic Encoder:
The encoder in a VAE is responsible for mapping an input data point (e.g., an image) into a probability
distribution in a lower-dimensional latent space. A deterministic encoder would produce a single point in
this latent space for each input.
In contrast, a stochastic encoder generates a probability distribution over the latent space. This distribution is
typically represented as a Gaussian distribution parameterized by two values: a mean (μ) and a variance (σ²),
which are outputs of the encoder neural network.
The mean (μ) represents the expected position of the encoded data point in the latent space, and the variance
(σ²) represents the uncertainty or spread of the encoded data point in the latent space.
By sampling from this Gaussian distribution, you obtain different points in the latent space for the same
input data. This introduces a source of randomness and allows for the generation of diverse latent
representations for similar input data. This diversity is essential for the generative aspect of VAEs.
The process of sampling from this distribution during encoding is known as the "reparameterization trick." It
allows for backpropagation during training and makes it possible to optimize the model using techniques
like stochastic gradient descent.
Stochastic Decoder:
The decoder in a VAE takes a point in the latent space and maps it back to the data space, attempting to
reconstruct the original input. In the case of image generation, it might generate a probability distribution
over pixel values for each location in the image.

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

The stochastic decoder acknowledges the uncertainty introduced by the stochastic encoder. It also produces
a probability distribution over the data space, which can be thought of as the likelihood of generating a
particular data point given a point in the latent space.
By sampling from this distribution, you can produce different reconstructions of the same input data. This is
crucial for the generative aspect of VAEs, as it allows the model to generate diverse outputs that capture the
inherent uncertainty in the data.
CONTRACTIVE ENCODERS:
Contractive Autoencoder was proposed by the researchers at the University of Toronto in 2011 in the paper
Contractive auto-encoders: Explicit invariance during feature extraction. The idea behind that is to make the
autoencoders robust of small changes in the training dataset.
To deal with the above challenge that is posed in basic autoencoders, the authors proposed to add another
penalty term to the loss function of autoencoders. We will discuss this loss function in details.
The Loss function:
Contractive autoencoder adds an extra term in the loss function of autoencoder, it is given as:

i.e the above penalty term is the Frobenius Norm of the encoder, the frobenius norm is just a generalization
of Euclidean norm.
In the above penalty term, we first need to calculate the Jacobian matrix of the hidden layer, calculating a
jacobian of the hidden layer with respect to input is similar to gradient calculation. Let’s first calculate the
Jacobian of hidden layer:

where, \phi is non-linearity. Now, to get the jth hidden unit, we need to get the dot product of ith feature
vector and the corresponding weight. For this, we need to apply the chain rule.

The above method is similar to how we calculate the gradient descent, but there is one major difference, that
is we take h(X) as a vector-valued function, each as a separate output. Intuitively, For example, we have 64
hidden units, then we have 64 function outputs, and so we will have a gradient vector for each of that 64
hidden unit.
Let diag(x) is the diagonal matrix, the matrix from the above derivative is as follows:

Now, we place the diag(x) equation to the above equation and simplify:

Downloaded by SATHYA P ([email protected])


lOMoARcPSD|18499994

Relationship with Sparse Autoencoder


In sparse autoencoder, our goal is to have the majority of components of representation close to 0, for this to
happen, they must be lying in the left saturated part of the sigmoid function, where their corresponding
sigmoid value is close to 0 with a very small first derivative, which in turn leads to the very small entries in
the Jacobian matrix. This leads to highly contractive mapping in the sparse autoencoder, even though this is
not the goal in sparse Autoencoder.
Relationship with Denoising Autoencoder
The idea behind denoising autoencoder is just to increase the robustness of the encoder to the small changes
in the training data which is quite similar to the motivation of Contractive Autoencoder. However, there is
some difference:
CAEs encourage robustness of representation f(x), whereas DAEs encourage robustness of reconstruction,
which only partially increases the robustness of representation.
DAE increases its robustness by stochastically training the model for the reconstruction, whereas CAE
increases the robustness of the first derivative of Jacobian matrix.

Downloaded by SATHYA P ([email protected])

You might also like