Unit V Recurrent Neural Networks
Unit V Recurrent Neural Networks
where:
where:
Yt -> output
Why -> weight at output layer
These parameters are updated using Backpropagation. However, since RNN works on sequential data here
we use an updated backpropagation which is known as Backpropagation through time.
Backpropagation Through Time (BPTT)
In RNN the neural network is in an ordered fashion and since in the ordered network each variable is
computed one at a time in a specified order like first h1 then h2 then h3 so on. Hence we will apply
backpropagation throughout all these hidden time states sequentially.
One To Many
In this type of RNN, there is one input and many outputs associated with it. One of the most used examples
of this network is Image captioning where given an image we predict a sentence having Multiple words.
Many to One
In this type of network, Many inputs are fed to the network at several states of the network generating only
one output. This type of network is used in the problems like sentimental analysis. Where we give multiple
words as input and predict only the sentiment of the sentence as output.
Many to Many
In this type of neural network, there are multiple inputs and multiple outputs corresponding to a problem.
One Example of this Problem will be language translation. In language translation, we provide multiple
words from one language as input and predict multiple words from the second language as output.
Recurrent Neural Networks are used when A Simple Deep Neural network does not have any
the data is sequential and the number of special method for sequential data also here the the
inputs is not predefined. number of inputs is fixed
Exploding and vanishing gradients is the the These problems also occur in DNN but these are
major drawback of RNN not the major problem with DNN
Due to their deep tree-like structure, Recursive Neural Networks can handle hierarchical data. The tree
structure means combining child nodes and producing parent nodes. Each child-parent bond has a weight
matrix, and similar children have the same weights. The number of children for every node in the tree is
fixed to enable it to perform recursive operations and use the same weights. RvNNs are used when there's a
need to parse an entire sentence.
To calculate the parent node's representation, we add the products of the weight matrices (W_i) and the
children's representations (C_i) and apply the transformation f:
\[h = f \left( \sum_{i=1}^{i=c} W_i C_i \right) \], where c is the number of children.
Recursive Neural Network Implementation
A Recursive Neural Network is used for sentiment analysis in natural language sentences. It is one of the
most important tasks of Natural language Processing (NLP), which identifies the writing tone and
sentiments of the writer in a particular sentence. If a writer expresses any sentiment, basic labels about the
writing tone are recognized. We want to identify the smaller components like nouns or verb phrases and
order them in a syntactic hierarchy. For example, it identifies whether the sentence showcases a constructive
form of writing or negative word choices.
A variable called 'score' is calculated at each traversal of nodes, telling us which pair of phrases and words
we must combine to form the perfect syntactic tree for a given sentence.
Let us consider the representation of the phrase -- "a lot of fun" in the following sentence.
An RNN representation of this phrase would not be suitable because it considers only sequential relations.
Each state varies with the preceding words' representation. So, a subsequence that doesn't occur at the
beginning of the sentence can't be represented. With RNN, when processing the word 'fun,' the hidden state
will represent the whole sentence.
However, with a Recursive Neural Network (RvNN), the hierarchical architecture can store the
representation of the exact phrase. It lies in the hidden state of the node R_{a\ lot\ of\ fun}. Thus, Syntactic
parsing is completely implemented with the help of Recursive Neural Networks.
The main disadvantage of recursive neural networks can be the tree structure. Using the tree structure
indicates introducing a unique inductive bias to our model. The bias corresponds to the assumption that the
data follow a tree hierarchy structure. But that is not the truth. Thus, the network may not be able to learn the
existing patterns.
Another disadvantage of the Recursive Neural Network is that sentence parsing can be slow and ambiguous.
Interestingly, there can be many parse trees for a single sentence.
Also, it is more time-consuming and labor-intensive to label the training data for recursive neural networks
than to construct recurrent neural networks. Manually parsing a sentence into short components is more
time-consuming and tedious than assigning a label to a sentence.
BIDIRECTIONAL RNNS:
An architecture of a neural network called a bidirectional recurrent neural network (BRNN) is made to
process sequential data. In order for the network to use information from both the past and future context
in its predictions, BRNNs process input sequences in both the forward and backward directions. This is
the main distinction between BRNNs and conventional recurrent neural networks.
A BRNN has two distinct recurrent hidden layers, one of which processes the input sequence forward and
the other of which processes it backward. After that, the results from these hidden layers are collected and
input into a prediction-making final layer. Any recurrent neural network cell, such as Long Short-Term
Memory (LSTM) or Gated Recurrent Unit, can be used to create the recurrent hidden layers.
The BRNN functions similarly to conventional recurrent neural networks in the forward direction,
updating the hidden state depending on the current input and the prior hidden state at each time step. The
backward hidden layer, on the other hand, analyses the input sequence in the opposite manner, updating
the hidden state based on the current input and the hidden state of the next time step.
Compared to conventional unidirectional recurrent neural networks, the accuracy of the BRNN is
improved since it can process information in both directions and account for both past and future contexts.
Because the two hidden layers can complement one another and give the final prediction layer more data,
using two distinct hidden layers also offers a type of model regularisation.
In order to update the model parameters, the gradients are computed for both the forward and backward
passes of the backpropagation through the time technique that is typically used to train BRNNs. The input
sequence is processed by the BRNN in a single forward pass at inference time, and predictions are made
based on the combined outputs of the two hidden layers. layers.
where,
A = activation function,
W = weight matrix
b = bias
The hidden state at time t is given by a combination of H t (Forward) and Ht (Backward). The output at any
given hidden state is :
Yt = Ht * WAY + by
The training of a BRNN is similar to backpropagation through a time algorithm. BPTT algorithm works as
follows:
Roll out the network and calculate errors at each iteration
Update weights and roll up the network.
However, because forward and backward passes in a BRNN occur simultaneously, updating the weights
for the two processes may occur at the same time. This produces inaccurate outcomes. Thus, the following
approach is used to train a BRNN to accommodate forward and backward passes individually.
Applications of Bidirectional Recurrent Neural Network
Bi-RNNs have been applied to various natural language processing (NLP) tasks, including:
1. Sentiment Analysis: By taking into account both the prior and subsequent context, BRNNs can be
utilized to categorize the sentiment of a particular sentence.
2. Named Entity Recognition: By considering the context both before and after the stated thing,
BRNNs can be utilized to identify those entities in a sentence.
3. Part-of-Speech Tagging: The classification of words in a phrase into their corresponding parts of
speech, such as nouns, verbs, adjectives, etc., can be done using BRNNs.
4. Machine Translation: BRNNs can be used in encoder-decoder models for machine translation, where
the decoder creates the target sentence and the encoder analyses the source sentence in both directions
to capture its context.
5. Speech Recognition: When the input voice signal is processed in both directions to capture the
contextual information, BRNNs can be used in automatic speech recognition systems.
Advantages of Bidirectional RNN
Context from both past and future: With the ability to process sequential input both forward and
backward, BRNNs provide a thorough grasp of the full context of a sequence. Because of this, BRNNs
are effective at tasks like sentiment analysis and speech recognition.
Enhanced accuracy: BRNNs frequently yield more precise answers since they take both historical
and upcoming data into account.
import warnings
warnings.filterwarnings('ignore')
from keras.datasets import imdb
from keras_preprocessing.sequence import pad_sequences
Model Architecture
By using the high-level API of the Keras we will implement a Bidirectional Recurrent Neural Network
model. This model will have 64 hidden units and 128 as the size of the embedding layer. While compiling
a model we provide these three essential parameters:
optimizer – This is the method that helps to optimize the cost function by using gradient descent.
loss – The loss function by which we monitor whether the model is improving with training or not.
metrics – This helps to evaluate the model by predicting the training and the validation data.
model.fit(X_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(X_test, y_test))
Output:
Here we are using simple BRNN but we can also use LSTM with the bidirectional network for better
accuracy and the result of the model.
During inference the input x is passed to the encoder network, producing an approximate
posterior Q(z|x) over latent variables. During training, z is sampled from Q(z|x) and then used to compute
the total description length KL ( Q (Z|x)∣∣ P(Z)−log(P(x|z)), which is minimized with stochastic gradient
descent.
During inference the input is read at every time-step and the result is passed to the encoder RNN. The
RNNs at the previous time-step specify where to read. The output of the encoder RNN is used to compute
the approximate posterior over the latent variables at that time-step.
Loss Function
The final canvas matrix cT is used to parametrize a model D(X | cT) of the input data. If the input is binary,
the natural choice for D is a Bernoulli distribution with means given by σ(cT). The reconstruction loss Lx is
defined as the negative log probability of x under D:
of nats required to transmit the latent sample sequence z_1:T to the decoder from the prior, and (if x is
discrete) L^x is the number of nats required for the decoder to reconstruct x given z_1:T. The total loss is
therefore equivalent to the expected compression of the data by the decoder and prior. The latent loss
is defined as the summed Kullback-Leibler divergence of some latent prior P(Z_t) from
Note that this loss depends upon the latent samples z_t drawn from
which depend in turn on the input x. If the latent distribution is a diagonal Gaussian with μt, σt where:
a simple choice for P(Z_t) is a standard Gaussian with mean zero and standard deviation one, in which case
the equation becomes:
The total loss L for the network is the expectation of the sum of the reconstruction and latent losses:
Which we optimize using a single sample of z for each stochastic gradient descent step.
Improving Images
As Eric Jang mentions on his post, it’s easier to ask our neural network to merely “improve the image”
rather than “finish the image in one shot”. Human artists work by iterating on their canvas, and infer from
their drawing what to fix and what to paint next.
Improving an image or progressive refinement is simply breaking up our joint distribution P(C) over and
over again, resulting in a chain of latent variables C1,C2,…CT−1 to a new observed variable
distribution P(CT).
The trick is to sample from the iterative refinement distribution P(Ct|Ct−1)several times rather than straight-
up sampling from P(C).
In the DRAW model, P(Ct|Ct−1) is the same distribution for all t, so we can compactly represent this as the
following recurrence relation (if not, then we have a Markov Chain instead of a recurrent network)
But…what about if the encoder could choose a small crop of the image on every frame and examine each
portion of the number one at a time? That would make the work more easy, right?
The same logic applies for generating the number. The attention unit will determine where to draw the next
portion of the number 8 -or any other-, while the latent vector passed will determine if the decoder generates
a thicker area or a thinner area.
Basically, if we think of the latent code in a VAE (variational auto-encoder)as a vector that represents the
entire image, the latent codes in DRAW can be thought of as vectors that represent a pen stroke. Eventually,
a sequence of these vectors creates a recreation of the original image.
Choosing the important portionCropping the image and forget about other parts
We now arrive to the second part of our attention gate, the “write”attention, which have the same setup as
the “read” section, except that the “write” attention gate uses the current decoder instead of the previous
timestep’s decoder.
In DRAW, we take an array of gaussian filters, each with their centers spaced apart evenly.
IMAGE COMPRESSION:
Introduction:
The development and demand for multimedia goods has risen in recent years, resulting in network
bandwidth and storage device limitations. As a result, image compression theory is becoming more
significant for reducing data redundancy and boosting device space and transmission bandwidth savings. In
computer science and information theory, data compression, also known as source coding, is the process of
encoding information using fewer bits or other information-bearing units than an unencoded version.
Compression is advantageous because it saves money by reducing the use of expensive resources such as
hard disc space and transmission bandwidth.
Image Compression:
Image compression is a type of data compression in which the original image is encoded with a small
number of bits. Compression focuses on reducing image size without sacrificing the uniqueness and
information included in the original. The purpose of image compression is to eliminate image redundancy
while also increasing storage capacity for well-organized communication.
Except in units D-RNN#3 and D-RNN#4, where the hidden kernels are 33, the spatial extents of the hidden
kernels are all 11. When compared to the 11 hidden kernels, the larger hidden kernels consistently produced
better compression curves.
Let xt, ct, and ht represent the input, cell, and hidden states at iteration t, respectively. The new cell state ct
and the new hidden state ht are computed using the current input xt, prior cell state ct1, and previous hidden
state ht1.
2. Associative LTSM:
To enable key-value storage of data, an Associative LSTM combines an LSTM with principles from
Holographic Reduced Representations (HRRs). To achieve key-value binding between two vectors, HRRs
employ a "binding" operator (the key and its associated content). Associative arrays are natively
implemented as a byproduct. Stacks, Queues, or Lists can also be easily implemented
Associative LSTM extends LSTM using holographic representation. Its new states are computed as:
The GRU formula, which has an input xt and a hiddenstate/output ht, is as follows:
Reconstruction Framework:
Three distinct ways for constructing the final image reconstruction from the decoder outputs are explored, in
addition to employing different types of recurrent units.
One-shot Reconstruction:
One-shot Reconstruction: As was done in Toderici et al. [2016], After each iteration of the decoder (= 0 in
(1)), we predict the whole picture. Each cycle has more access to the encoder's produced bits, allowing for a
better reconstruction. This method is known as "one-shot reconstruction." We merely transfer the previous
iteration's residual to the next iteration, despite trying to rebuild the original picture at each iteration. The
number of weights is reduced as a result, and trials demonstrate that sending both the original picture and the
residual does not enhance the reconstructions.
Additive Reconstruction:
In additive reconstruction, which is more widely used in traditional image coding, each iteration only tries to
reconstruct the residual from the previous iterations. The final image reconstruction is then the sum of the
outputs of all iterations (γ = 1).
Residual Scaling:
The residual starts large in both additive and "one shot" reconstruction, and we anticipate it to diminish with
each repetition. However, operating the encoder and decoder effectively across a large range of values may
be problematic. In addition, the pace at which the residual diminishes is determined by the content. The
drop-off will be significantly more apparent in certain areas (for example, uniform regions) than in others
(e.g., highly textured patches).
The additive reconstruction architecture is enhanced to incorporate a content-dependent, iteration-dependent
gain factor to address these variances.
Entropy Encoding:
Because the network is not deliberately intended to maximise entropy in its codes, and the model does not
always utilise visual redundancy across a vast geographical extent, the entropy of the codes created during
inference is not maximum. As is usual in regular image compression codecs, adding an entropy coding layer
can boost the compression ratio even more.
The lossless entropy coding techniques addressed here are completely convolutional, process binary codes in
progressive order, and process raster-scan order for a particular encoding iteration. All of our image encoder
designs produce binary codes of the type c(y, x, d) with the dimensions H W D, where H and W are integer
fractions of the picture height and width, and D is m the number of iterations. A conventional lossless
encoding system is considered, which combines a conditional probabilistic model of the present binary code
c(y, x, d) with an arithmetic coder to do the actual compression. More formally, given a context T(y, x, d)
which depends only on previous bits in stream order, we will estimate P(c(y, x, d) | T(y, x, d)) so that the
expected ideal encoded length of c(y, x, d) is the cross entropy between P(c | T) and Pˆ(c | T). We do not
consider the small penalty involved by using a practical arithmetic coder that requires a quantized version of
Pˆ(c | T).
First, a 7/7 convolution is used to enlarge the LSTM state's receptive field, with the receptive field being the
set of codes c(i, j, ) that potentially impact the probability estimate of codes c(y, x, ).
To prevent dependence on subsequent routines, this first convolution is a disguised convolution. The line
LSTM in the second stage takes the output z0 of the initial convolution as input and processes one scan line
at a time. The line LSTM captures both short- and long-term dependencies since LSTM hidden states are
created by processing preceding scan lines. The input-to-state LSTM transform is likewise a masked
convolution for the same reason. Finally, two 11 convolutions are added to the network to boost its capacity
to remember additional binary code patterns. The Bernoulli-distribution parameter may be easily calculated
using a sigmoid activation in the final convolution because we are attempting to predict binary codes.
Above Image: Binary recurrent network (BinaryRNN) architecture for a single iteration. The gray area
denotes the context that is available at decode time.
Description of neural network used to compute additional line LSTM inputs for progressive entropy coder.
This allows propagation of information from the previous iterations to the current.
Evaluation Metrics
For evaluation purposes we use Multi-Scale Structural Similarity (MS-SSIM) a well-established metric for
comparing lossy image compression algorithms, and the more recent Peak Signal to Noise Ratio - Human
Visual System (PSNR-HVS). While PSNR-HVS already has colour information, we apply MS-SSIM to
each of the RGB channels separately and average the results. The MS-SSIM score ranges from 0 to 1,
whereas the PSNR-HVS is recorded in decibels. Higher scores indicate a closer match between the test and
reference photos in both circumstances. After each cycle, both metrics are computed for all models across
the reconstructed pictures. We utilise an aggregate metric derived as the area under the rate-distortion curve
to rank models (AUC).
RNNs are ideal for solving problems where the sequence is more important than the individual items
themselves.
An RNNs is essentially a fully connected neural network that contains a refactoring of some of its layers into
a loop. That loop is typically an iteration over the addition or concatenation of two inputs, a matrix
multiplication and a non-linear function.
Among the text usages, the following tasks are among those RNNs perform well at:
• Sequence labelling
Other tasks that RNNs are effective at solving are time series predictions or other sequence predictions that
aren’t image or tabular based.
There has been several highlighted and controversial reports in the media over the advances in text
generation, in particular OpenAI’s GPT-2 algorithm. In many cases the generated text is often
indistinguishable from text written by humans.
I found learning how RNNs function and how to construct them and their varients has been among the most
difficult topics I have had to learn. I would like to thank the Fastai team and Jeremy Howard for their
courses explaining the concepts in amore understandable order, which I’ve followed in this article’s
explanation.
RNNs effectively have an internal memory that allows the previous inputs to affect the subsequent
predictions. It’s much easier to predict the next word in a sentence with more accuracy, if you know what
the previous words were.
Often with tasks well suited to RNNs, the sequence of the items is as or more important than the previous
item in the sequence.
As I’m typing the draft for this on my smart phone, the next word suggested by my phone’s keyboard will
be predicted by an RNN. For example, the swift key keyboard software uses RNNs to predict what you are
typing.
Natural Language Processing (NLP) is a sub-field of computer science and artificial intelligence, dealing
with processing and generating natural language data. Although there is still research that is outside of the
machine learning, most NLP is now based on language models produced by machine learning.
NLP is a good use case for RNNs and is used in the article to explain how RNNs can be constructed.
Language models
The aim for a language model is to minimise how confused the model is having seen a given sequence of
text.
It is only necessary to train one language model per domain, as the language model encoder can be used for
different purposes such as text generation and multiple different classifiers within that domain.
As the longest part of training is usually creating the language model encoder, reusing the encoder can save
significant training time.
If we take a sequence of three words of text and a network that predicts the fourth word.
The network has three hidden layers, each of which are an affine function (for example a matrix dot product
multiplication), followed by a non-linear function then the last hidden layer is followed by an output from
the last layer activation function.
The input vectors representing each word in the sequence are lookups in a word embedding matrix, based on
a one hot encoded vector representing the word in the vocabulary. Note that all inputted words use the same
word embedding. In this context a word is actually a token that could represent a word or a punctuation
mark.
The output will be a one hot encoded vector representing the predicted fourth word in the sequence.
The first hidden layer takes a vector representing the first word in the sequence as an input and the output
activations serve as one of the inputs into the second hidden layer.
The second hidden layer takes the input from the activations of the first hidden layer and also an input of the
second word represented as a vector. These two inputs could be either added or concatenated together.
The third hidden layer follows the same structure as the second hidden layer, taking the activation from the
second hidden layer combined with the vector representing the third word in the sequence. Again, these
inputs are added or concatenated together.
The output from the last hidden layer goes through an activation function that produces an output
representing a word from the vocabulary, as a one hot encoded vector.
This second and third hidden layer could both use the same weight matrix, opening the opportunity of
refactoring this into a loop to become recurrent.
A fully connected network for text generation/prediction. Source: Fastai deep learning course V3 by Jeremy
Howard.
Vocabulary:
The vocabulary is a vector of numbers, called tokens where each token represents one of the unique words
or punctuation symbols in our corpus.
Usually words that don’t occur at least twice in the texts making up the corpus usually aren’t included,
otherwise the vocabulary would be too large. I wonder if this could be used as a factor for detecting
generating text, looking for the presence of words not common in the given domain.
Word embedding:
A word embedding is a matrix of weights, with a row for each word/token in the vocabulary
Matrix dot product multiplication with a one hot encoded vector outputs a row of the matrix representing
activations from that word. It is essentially a row lookup in the matrix and is computationally more efficient
to do that, this is called an embedding lookup.
Using the vector from the word embedding helps prevent the resulting activations being very sparse. As if
the input was the one hot encoded vector, which is all zeros apart from one element, the majority of the
activations would also be zero. This would then be difficult to train.
Refactored with a loop, an RNN:
For the network to be recurrent, a loop needs to be factored into the network’s model. It makes sense to use
the same embedded weight matrix for every word input. This means we can replace the second and third
layers with iterations within a loop.
Each iteration of the loop takes an input of a vector representing the next word in the sequence with the
output activations from the last iteration. These inputs are added or concatenated together.
The output from the last iteration is a representation of the next word in the sentence being put through the
last layer activation function which converts it to a one hot encoded vector representing a word in the
vocabulary.
An improved RNN retaining its output. Source: Fastai deep learning course V3 by Jeremy Howard.
In theory the sequence of predicted text could be infinite in length, with a predicted word following the last
predicted word in the loop.
Retaining the history, a further improved RNN:
With each new batch the history of the previous batch’s sequence, the state, is often lost. Assuming the
sentences are related, this may lose important insights.
To aid the prediction when we start each batch, it is helpful to know the history of the last batch rather than
reset it. This retains the state and hence the context, this results in an understanding of the words that is a
better approximation.
Note with some datasets such as one-billion-words each sentence isn’t related to the previous one, in this
case this may not help as there is no context between sentences.
Backpropagation through time:
Back propagation through time (BPTT) is the sequence length used during training. If we were trying to
train on sequences of 50 words, the BPTT would be 50.
Usually the document is split into 64 equal sections. In this case the BPTT is the document length in words
divided by 64. If the document length in words is 3200 then that divided by 64 gives a BPTT of 50.
It’s beneficial to slightly randomise the BPTT value for each sequence to help improve the model.
Layered RNNs:
To get more layers of computation to be able to solve or approximate more complex tasks, the output of the
RNN could be fed into another RNN, or any number of layers of RNNs. The next section explains how this
can be done.
Extending RNNs to avoid the vanishing gradient:
As the number of layers of RNNs increases the loss landscape and can become impossible to train, this is the
vanishing gradient problem. To solve this problem a Gated Recurrent Unit (GRU) or a Long Term Short
Term Memory (LSTM) network can be used.
LSTMs and GRUs take the current input and previous hidden state, then compute the next hidden state.
As part of this computation, the sigmoid function squashes the values of these vectors between 0 and 1, and
by multiplying them elementwise with another vector you define how much of that other vector you want to
“let through”
Long Term Short Term Memory (LSTM):
An RNN has short term memory. When used in combination with Long Short Term Memory (LSTM) Gates,
the network can have long term memory.
Instead of the recurring section of an RNN, an LTSM is a small neural network consisting of four neural
network layers. These are the recurring layer from the RNN with three networks acting as gates.
An LSTM also has a cell state as well, along side the hidden state. This cell state is the long term memory.
Rather than just returning the hidden state at each iteration, a tuple of hidden states are returned comprised
of the cell state and hidden state.
Long Short Term Memory (LSTM) has three gates:
1. An Input gate, this controls the information input at each time step.
2. An Output gate, this controls how much information is outputted to the next cell or upward layer
3. A Forget gate, this controls how much data to lose at each time step.
Gated recurrent unit (GRU):
A gated recurrent unit is sometimes referred to as a gated recurrent network.
At the output of each iteration there is a small neural network with three neural networks layers
implemented, consisting of the recurring layer from the RNN, a reset gate and an update gate. The update
gate acts as a forget and input gate. The coupling of these two gates performs a similar function as the three
gates forget, input and output in an LSTM.
Compared to an LSTM, a GRU has a merged cell state and hidden state, whereas in an LSTM these are
separate.
Reset gate:
The reset gate takes the input activations from last layer, these are multiplied by a reset factor between 0 and
1. The reset factor is calculated by a neural network with no hidden layer (like a logistic regression), this
performs a dot product matrix multiplication between a weight matrix and the addition/concatenation of the
previous hidden state and our new input. This is then all put through the sigmoid function e^x / (1 + e^x).
This can learn to do different things in different situations, for example to forget more information if there’s
a full stop token.
Update gate:
The update gate controls how much of the new input to take and how much of the hidden state to take. This
is a linear interpolation. This is 1 — Z multiplied by the previous hidden state plus Z multiplied by the new
hidden state. This controls to what degree we keep information from the previous states and to what degree
we use information from the new state.
The update gate is often represented as a switch in diagrams, although the gate can be in any position to
create a linear interpolation between the two hidden states.
A RNN with a GRU. Source: Fastai deep learning course V3 by Jeremy Howard.
This depends entirely on the task in question, it is often worth trying both to see which can perform better.
Text classification:
In text classification the prediction of the network is to classify which group or groups the text belongs to. A
common use is classifying if the sentiment of a piece of text is positive or negative.
If an RNN is trained to predict text from a corpus within a given domain as in the RNN explanation earlier in
this article, it is close to ideal to be re-purposed for text classification within that domain. The generation
‘head’ of the network is removed leaving the ‘backbone’ of the network. The weights within the backbone
can then be frozen. A new classification head can then be attached to the backbone and trained to predict the
required classifications.
It can be a very effective method to speed up training to gradually unfreeze the weights within the layers.
Starting with the weights of the last two layers, then the weights of the last three layers, and finally all
AutoEncoder is an artificial neural network model that seeks to learn from a compressed representation of
the input.
There are various types of autoencoders available suited for different types of scenarios, however, the
commonly used autoencoder is for feature extraction.
Combining feature extraction models with different types of models has a wide variety of applications.
Feature Extraction Autoencoders models for prediction sequence problems are quite challenging not because
the length of the input can vary, its because machine learning algorithms and neural networks are designed
to work with fixed length inputs.
Another problem with sequence prediction is the temporal ordering of the observations can make it
challenging to extract features. Therefore special predictive models were developed to overcome such
challenges. These are called Sequence-to-sequence, or seq2seq. and the widely used we already have heard
of are the LSTM models.
LSTM:
Recurrent neural networks such as the LSTM or Long Short-Term Memory network are specially designed
to support the sequential data.
They are capable of learning the complex dynamics within the temporal ordering of input sequences as well
as using an internal memory to remember or use information across long input sequences.
NOW combing Autoencoders with LSTM will allow us to understand the pattern of sequential data with
LSTM then extract the features with Autoencoders to recreate the input sequence.
In other words, for a given dataset of sequences, an encoder-decoder LSTM is configured to read the input
sequence, encode it and recreate it. The performance of the model is evaluated based on the model’s ability
to recreate the input sequence.
Once the model achieves a desired level of performance in recreating the sequence. The decoder part of the
model can be removed, leaving just the encoder model. Now further this model can be used to encode input
sequences.
REGULARIZED AUTOENCODER:
Introduction:
As we know, regularization and autoencoders are two different terminologies. First, we will briefly discuss
each topic, i.e., autoencoders and regularization, separately, and then we will see different ways to do
regularization of autoencoders.
Autoencoders:
Autoencoders are a variant of feed-forward neural networks that have an extra bias for calculating the error
of reconstructing the original input. After training, autoencoders are then used as a normal feed-forward
neural network for activations. This is an unsupervised form of feature extraction because the neural
network uses only the original input for learning weights rather than backpropagation, which has labels.
Deep networks can use either RBMs or autoencoders as building blocks for larger networks (a single
network rarely uses both).
Use of autoencoders:
Autoencoders are used to learn compressed representations of datasets. Commonly, we use it in reducing the
dimensions of the dataset. The output of the autoencoder is a reformation of the input data in the most
efficient form.
Similarities of autoencoders to multilayer perceptron
Autoencoders are identical to multilayer perceptron neural networks because, like multilayer perceptrons,
autoencoders have an input layer, some hidden layers, and an output layer. The key difference between a
multilayer perceptron network and an autoencoder is that the output layer of an autoencoder has the same
number of neurons as that of the input layer.
Regularization
Regularization helps with the effects of out-of-control parameters by using different methods to minimize
parameter size over time.
In mathematical notation, we see regularization represented by the coefficient lambda, controlling the trade-
off between finding a good fit and keeping the value of certain feature weights low as the exponents on
features increase.
Regularization coefficients L1 and L2 help fight overfitting by making certain weights smaller. Smaller-
valued weights lead to simpler hypotheses, which are the most generalizable. Unregularized weights with
several higher-order polynomials in the feature sets tend to overfit the training set.
As the input training set size grows, the effect of regularization decreases, and the parameters tend to
increase in magnitude. This is appropriate because an excess of features relative to training set examples
leads to overfitting in the first place. Bigger data is the ultimate regularizer.
Regularized autoencoders
There are other ways to constrain the reconstruction of an autoencoder than to impose a hidden layer of
smaller dimensions than the input. The regularized autoencoders use a loss function that helps the model to
have other properties besides copying input to the output. We can generally find two types of regularized
autoencoder: the denoising autoencoder and the sparse autoencoder.
Denoising autoencoder
We can modify the autoencoder to learn useful features is by changing the inputs; we can add random noise
to the input and recover it to the original form by removing noise from the input data. This prevents the
autoencoder from copying the data from input to output because it contains random noise. We ask it to
subtract the noise and produce meaningful underlying data. This is called a denoising autoencoder.
In the above diagram, the first row contains original images. We can see in the second row that random
noise is added to the original images; this noise is called Gaussian noise. The input of the autoencoder will
not get the original images, but autoencoders are trained in such a way that they will remove noise and
generate the original images.
The only difference between implementing the denoising autoencoder and the normal autoencoder is a
change in input data. The rest of the implementation is the same for both the autoencoders. Below is the
difference between training the autoencoder.
input_size = 256
hidden_size = 32
output_size = 256
l1 = Input(shape=(input_size,))
# Encoder
h1 = Dense(hidden_size ,activity_regularizer=regularizers.l1(10e-6), activation='relu')(l1)
# Decoder
l2 = Dense(output_size, activation='sigmoid')(h1)
autoencoder = Model(input=l1, output=l2)
autoencoder.compile(loss='mse', optimizer='adam’)
In the above code, we have added L1 regularization to the hidden layer of the encoder, which adds the
penalty to the loss function.
STOCHASTIC ENCODERS AND DECODERS:
Variational Autoencoders (VAEs):
Variational Autoencoders are a type of generative model used for tasks like image generation, data
compression, and feature learning. They consist of two main components: an encoder and a decoder. The
goal of a VAE is to learn a probabilistic model of the data, which allows it to generate new data samples that
are similar to the ones it was trained on.
Stochastic Encoder:
The encoder in a VAE is responsible for mapping an input data point (e.g., an image) into a probability
distribution in a lower-dimensional latent space. A deterministic encoder would produce a single point in
this latent space for each input.
In contrast, a stochastic encoder generates a probability distribution over the latent space. This distribution is
typically represented as a Gaussian distribution parameterized by two values: a mean (μ) and a variance (σ²),
which are outputs of the encoder neural network.
The mean (μ) represents the expected position of the encoded data point in the latent space, and the variance
(σ²) represents the uncertainty or spread of the encoded data point in the latent space.
By sampling from this Gaussian distribution, you obtain different points in the latent space for the same
input data. This introduces a source of randomness and allows for the generation of diverse latent
representations for similar input data. This diversity is essential for the generative aspect of VAEs.
The process of sampling from this distribution during encoding is known as the "reparameterization trick." It
allows for backpropagation during training and makes it possible to optimize the model using techniques
like stochastic gradient descent.
Stochastic Decoder:
The decoder in a VAE takes a point in the latent space and maps it back to the data space, attempting to
reconstruct the original input. In the case of image generation, it might generate a probability distribution
over pixel values for each location in the image.
The stochastic decoder acknowledges the uncertainty introduced by the stochastic encoder. It also produces
a probability distribution over the data space, which can be thought of as the likelihood of generating a
particular data point given a point in the latent space.
By sampling from this distribution, you can produce different reconstructions of the same input data. This is
crucial for the generative aspect of VAEs, as it allows the model to generate diverse outputs that capture the
inherent uncertainty in the data.
CONTRACTIVE ENCODERS:
Contractive Autoencoder was proposed by the researchers at the University of Toronto in 2011 in the paper
Contractive auto-encoders: Explicit invariance during feature extraction. The idea behind that is to make the
autoencoders robust of small changes in the training dataset.
To deal with the above challenge that is posed in basic autoencoders, the authors proposed to add another
penalty term to the loss function of autoencoders. We will discuss this loss function in details.
The Loss function:
Contractive autoencoder adds an extra term in the loss function of autoencoder, it is given as:
i.e the above penalty term is the Frobenius Norm of the encoder, the frobenius norm is just a generalization
of Euclidean norm.
In the above penalty term, we first need to calculate the Jacobian matrix of the hidden layer, calculating a
jacobian of the hidden layer with respect to input is similar to gradient calculation. Let’s first calculate the
Jacobian of hidden layer:
where, \phi is non-linearity. Now, to get the jth hidden unit, we need to get the dot product of ith feature
vector and the corresponding weight. For this, we need to apply the chain rule.
The above method is similar to how we calculate the gradient descent, but there is one major difference, that
is we take h(X) as a vector-valued function, each as a separate output. Intuitively, For example, we have 64
hidden units, then we have 64 function outputs, and so we will have a gradient vector for each of that 64
hidden unit.
Let diag(x) is the diagonal matrix, the matrix from the above derivative is as follows:
Now, we place the diag(x) equation to the above equation and simplify: