0% found this document useful (0 votes)
24 views105 pages

DL Mod4

The document provides an overview of computational graphs and recurrent neural networks (RNNs), explaining their structure, training processes, advantages, and disadvantages. It details the encoder-decoder architecture for sequence learning, the challenges of training RNNs, and introduces variations like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU). Additionally, it outlines various applications of RNNs, including language modeling, speech recognition, and time series forecasting.

Uploaded by

v3319763
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views105 pages

DL Mod4

The document provides an overview of computational graphs and recurrent neural networks (RNNs), explaining their structure, training processes, advantages, and disadvantages. It details the encoder-decoder architecture for sequence learning, the challenges of training RNNs, and introduces variations like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU). Additionally, it outlines various applications of RNNs, including language modeling, speech recognition, and time series forecasting.

Uploaded by

v3319763
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

MODULE 4

COMPUTATIONAL GRAPHS
• A computational graph is a directed graph where the nodes
correspond to operations or variables.
• Variables can feed their value into operations, and operations can
feed their output into other operations.
• Computational graphs are a type of graph that can be used to
represent mathematical expressions. This is similar to descriptive
language in the case of deep learning models, providing a functional
description of the required computation.
In general, the computational graph is a directed graph that is used for
expressing and evaluating mathematical expressions.
RECURRENT NEURAL NETWORK
• The computational graph to compute the training loss of a recurrent
network that maps an input sequence of x values to a corresponding
sequence of output o values.
• A loss L measures how far each o is from the corresponding training target
y.
• When using softmax outputs, we assume o is the unnormalized log
probabilities.
• The loss L internally computes yˆ = softmax(o) and compares this to the
target y.
• The RNN has input to hidden connections parametrized by a weight matrix
U , hidden-to-hidden recurrent connections parametrized by a weight
matrix W, and hidden-to-output connections parametrized by a weight
matrix V
RECURRENT NEURAL NETWORK
• Recurrent Neural Network(RNN) is a type of Neural Network where the output from the
previous step is fed as input to the current step.
• In traditional neural networks, all the inputs and outputs are independent of each other,
but in cases when it is required to predict the next word of a sentence, the previous
words are required and hence there is a need to remember the previous words.
• Thus RNN came into existence, which solved this issue with the help of a Hidden Layer.
• The main and most important feature of RNN is its Hidden state, which remembers
some information about a sequence.
• The state is also referred to as Memory State since it remembers the previous input to
the network.
• It uses the same parameters for each input as it performs the same task on all the inputs
or hidden layers to produce the output. This reduces the complexity of parameters,
unlike other neural networks.
• The Recurrent Neural Network consists of multiple fixed
activation function units, one for each time step.
• Each unit has an internal state which is called the hidden
state of the unit.
• This hidden state signifies the past knowledge that the
network currently holds at a given time step.
• This hidden state is updated at every time step to signify
the change in the knowledge of the network about the past.
• The hidden state is updated using the following recurrence
relation:-
• The formula for calculating the current state:
• Training through RNN
1.A single-time step of the input is provided to the network.
2.Then calculate its current state using a set of current input and the
previous state.
3.The current ht becomes ht-1 for the next time step.
4.One can go as many time steps according to the problem and join the
information from all the previous states.
5.Once all the time steps are completed the final current state is used
to calculate the output.
6.The output is then compared to the actual output i.e the target
output and the error is generated.
7.The error is then back-propagated to the network to update the
weights and hence the network (RNN) is trained
using Backpropagation through time.
• Advantages of Recurrent Neural Network
1.An RNN remembers each and every piece of information
through time. It is useful in time series prediction only
because of the feature to remember previous inputs as well.
This is called Long Short Term Memory.
2.Recurrent neural networks are even used with
convolutional layers to extend the effective pixel
neighborhood.
• Disadvantages of Recurrent Neural Network
1.Gradient vanishing and exploding problems.
2.Training an RNN is a very difficult task.
3.It cannot process very long sequences if using tanh or relu
as an activation function.
• Applications of Recurrent Neural Network
1.Language Modelling and Generating Text
2.Speech Recognition
3.Machine Translation
4.Image Recognition, Face detection
5.Time series Forecasting
• Types Of RNN
• There are four types of RNNs based on the number of
inputs and outputs in the network.
1.One to One
2.One to Many
3.Many to One
4.Many to Many
• One to One
• This type of RNN behaves the same as any simple
Neural network it is also known as Vanilla Neural
Network. In this Neural network, there is only one
input and one output.
• One To Many
• In this type of RNN, there is one input and many outputs associated with it.
One of the most used examples of this network is Image captioning where
given an image we predict a sentence having Multiple words.

• of Recurrent Neural Network
1.Language Modelling and Generating Text
2.Speech Recognition
3.Machine Translation
4.Image Recognition, Face detection
5.Time series Forecasting
• Many to One
• In this type of network, Many inputs are fed to the network at several states
of the network generating only one output. This type of network is used in
the problems like sentimental analysis. Where we give multiple words as input
and predict only the sentiment of the sentence as output.
• Many to Many
• In this type of neural network, there are multiple inputs and multiple
outputs corresponding to a problem. One Example of this Problem will be
language translation. In language translation, we provide multiple words
from one language as input and predict multiple words from the second
language as output.
• Variation Of Recurrent Neural Network (RNN)
• To overcome the problems like vanishing gradient and exploding gradient
descent several new advanced versions of RNNs are formed some of these are
as ;
1.Bidirectional Neural Network (BiNN)
2.Long Short-Term Memory (LSTM)
• Bidirectional Neural Network (BiNN)
• A BiNN is a variation of a Recurrent Neural Network in which
the input information flows in both direction and then the
output of both direction are combined to produce the input.
BiNN is useful in situations when the context of the input is
more important such as Nlp tasks and Time-series analysis
problems.
• Long Short-Term Memory (LSTM)

• Long Short-Term Memory works on the read-write-and-forget


principle where given the input information network reads and writes
the most useful information from the data and it forgets about the
information which is not important in predicting the output. For doing
this three new gates are introduced in the RNN. In this way, only the
selected information is passed through the network.
Difference between RNN and DNN
Encoder-Decoder Sequence to Sequence
Architecture
• In Deep Learning, Many Complex problems can be
solved by constructing better neural network
architecture.
• The RNN(Recurrent Neural Network) and its variants
are much useful in sequence to sequence learning.
• The RNN variant LSTM (Long Short-term Memory) is
the most used cell in seq-seq learning tasks.
• Encoder-Decoder Model
• There are three main blocks in the encoder-decoder
model,
• Encoder
• Hidden Vector
• Decoder
• The Encoder will convert the input sequence into a
single-dimensional vector (hidden vector). The decoder
will convert the hidden vector into the output sequence.
• Encoder
• Multiple RNN cells can be stacked together to form the
encoder. RNN reads each inputs sequentially
• For every timestep (each input) t, the hidden state (hidden
vector) h is updated according to the input at that
timestep X[i].
• After all the inputs are read by encoder model, the final
hidden state of the model represents the context/summary
of the whole input sequence.
• Example: Consider the input sequence “I am a Student” to
be encoded. There will be totally 4 timesteps ( 4 tokens) for
the Encoder model. At each time step, the hidden state h
will be updated using the previous hidden state and the
current input.

• At the first timestep t1, the previous hidden state h0
will be considered as zero or randomly chosen.
• So the first RNN cell will update the current hidden
state with the first input and h0.
• Each layer outputs two things — updated hidden state
and the output for each stage.
• The outputs at each stage are rejected and only the
hidden states will be propagated to the next layer.
• The hidden states h_i are computed using the formula:

• At second timestep t2, the hidden state h1 and the
second input X[2] will be given as input , and the
hidden state h2 will be updated according to both
inputs. Then the hidden state h1 will be updated with
the new input and will produce the hidden state h2.
This happens for all the four stages wrt example taken.
• A stack of several recurrent units (LSTM or GRU cells
for better performance) where each accepts a single
element of the input sequence, collects information for
that element, and propagates it forward.
• Encoder Vector
• This is the final hidden state produced from the
encoder part of the model. It is calculated using the
formula above.
• This vector aims to encapsulate the information for all
input elements in order to help the decoder make
accurate predictions.
• It acts as the initial hidden state of the decoder part of
the model.
• Decoder
• The Decoder generates the output sequence by predicting
the next output Yt given the hidden state ht.
• The input for the decoder is the final hidden vector
obtained at the end of encoder model.
• Each layer will have three inputs, hidden vector from
previous layer ht-1 and the previous layer output yt-1,
original hidden vector h.
• At the first layer, the output vector of encoder and the
random symbol START, empty hidden state ht-1 will be
given as input, the outputs obtained will be y1 and updated
hidden state h1 (the information of the output will be
subtracted from the hidden vector).
• The second layer will have the updated hidden state h1
and the previous output y1 and original hidden vector
h as current inputs, produces the hidden vector h2 and
output y2.
• The outputs occurred at each timestep of decoder is
the actual output. The model will predict the output
until the END symbol occurs.
• Any hidden state h_i is computed using the formula:
• Output Layer
• We use Softmax activation function at the output layer.
• It is used to produce the probability distribution from a
vector of values with the target class of high probability.
• The output y_t at time step t is computed using the
formula:

• We calculate the outputs using the hidden state at the
current time step together with the respective weight
W(S).
• Softmax is used to create a probability vector that will
help us determine the final output
• Applications
• It possesses many applications such as
• Google’s Machine Translation
• Question answering chatbots
• Speech recognition
• Time Series Application etc.,
Deep Recurrent Network
Ways of Making RNN Deep
RECURSIVE NEURAL NETWORKS
Challenges of Training Recurrent Network
• Two Issues of Standard RNNs
• 1. Vanishing Gradient Problem
• Recurrent Neural Networks enable you to model time-
dependent and sequential data problems, like stock
exchange prediction, artificial intelligence, and text
generation. RNN is tough to train due to the gradient
problem.
• RNNs suffer from the matter of vanishing gradients.
The gradients carry information utilized in the RNN,
and when the gradient becomes too small, the
parameter updates become insignificant. This
makes the training of long data sequences difficult.
• 2. Exploding Gradient Problem
• While training a neural network, if the slope tends to
grow exponentially rather than decaying, this is often
called an Exploding Gradient. This problem arises when
large error gradients accumulate, leading to very large
updates to the neural network model weights during
the training process.
• Long training time, poor performance, and bad
accuracy are the key issues in gradient problems.
• A gradient in the context of a neural network refers to
the gradient of the loss function with respect to the
weights of the network.
• This gradient is calculated using backpropagation. The
goal here is to find the optimal weight for each
connection that would minimise the overall loss of the
network.
• Recurrent neural networks are very hard to train because of the fact that
the time-layered network is a very deep network, especially if the input
sequence is long.
• In other words,the depth of the temporal layering is input-dependent.
• As in all deep networks, the loss function has highly varying sensitivities of
the loss function (i.e., loss gradients) to different temporal layers.
• Furthermore, even though the loss function has highly varying gradients to
the variables in different layers, the same parameter matrices are shared
by different temporal layers.
• This combination of varying sensitivity and shared parameters in different
layers can lead to some unusually unstable effects.
• The primary challenge associated with a recurrent neural network is
that of the vanishing and exploding gradient problems.
Language Modeling Example of RNN
• Consider the sentence:
• The cat chased the mouse.
• In this case, we have a lexicon of four words, which are
{“the,”“cat,”“chased,”“mouse”}.
• Ideally, we would like the probability of the next word to be predicted
correctly from the probabilities of the previous words.
• Each one-hot encoded input vector xt has length four, in which only
one bit is 1 and the remaining bits are 0s.
• The main flexibility here is in the dimensionality p of the hidden
representation, which we set to 2 in this case.
• As a result, the matrix Wxh will be a 2 × 4 matrix, so that it maps a
one-hot encoded input vector into a hidden vector ‘ht’ vector of size 2.
• Each column of Wxh corresponds to one of the four words, and one
of these columns is copied by the expression Wxh xt. Note that this
expression is added to Whh ht and then transformedwith the tanh
function to produce the final expression.
• The final output yt is defined by Why ht. Note that the matrices Whh
and Why are of sizes 2 × 2 and 4 × 2, respectively.
• In this case, the outputs are continuous values (not probabilities) in
which larger values indicate greater likelihood of presence.
• These continuous values are eventually converted to probabilities
with the softmax function
• The word “cat” is predicted in the first time-stamp with a value of 1.3,
although this value seems to be (incorrectly) outstripped by “mouse”
for which the corresponding value is 1.7.
• However, the word “chased” seems to be predicted correctly at the
next time-stamp.
• As in all learning algorithms, one cannot hope to predict every value
exactly, and such errors are more likely to be made in the early
iterations of the backpropagation algorithm.
Generating Language Sample
• The likelihoods of the tokens at the first time-stamp can be generated using the <START>
token
• as input.
• Since the <START> token is also available in the training data, the model will typically
select a word that often starts text segments.
• Subsequently, the idea is to sample one of the tokens generated at each time-stamp
(based on the predicted likelihood), and then use it as an input to the next time-stamp.
• To improve the accuracy of the sequentially predicted token, one might use beam search.
• By recursively applying this operation, one can generate an arbitrary sequence of text
that reflects the particular training data at hand.
• If the <END> token is predicted, it indicates the end of that particular segment of text.
GRU [Gated Recurrent Unit]
• GRU (Gated Recurrent Unit) aims to solve
the vanishing gradient problem which comes with a
standard recurrent neural network.
• GRU can also be considered as a variation on the
LSTM because both are designed similarly and, in some
cases, produce equally excellent results.
• To solve the vanishing gradient problem of a standard
RNN, GRU uses, so-called, update gate and reset gate.
• Basically, these are two vectors which decide what
information should be passed to the output.
• The special thing about them is that they can be
trained to keep information from long ago, without
• #1. Update gate
• We start with calculating the update gate z_t for time
step t using the formula:

• When x_t is plugged into the network unit, it is
multiplied by its own weight W(z). The same goes
for h_(t-1) which holds the information for the
previous t-1 units and is multiplied by its own
weight U(z). Both results are added together and a
sigmoid activation function is applied to squash the
result between 0 and 1.
• The update gate helps the model to determine how
much of the past information (from previous time steps)
needs to be passed along to the future.
• That is really powerful because the model can decide to
copy all the information from the past and eliminate
the risk of vanishing gradient problem.
• #2. Reset gate
• Essentially, this gate is used from the model to decide
how much of the past information to forget. To
calculate it, we use:

• As before, we plug in h_(t-1) — blue line and x_t —
purple line, multiply them with their corresponding
weights, sum the results and apply the sigmoid
function.
• #3. Current memory content
• Let’s see how exactly the gates will affect the final
output. First, we start with the usage of the reset gate.
We introduce a new memory content which will use
the reset gate to store the relevant information from
the past. It is calculated as follows:
• We do an element-wise multiplication of h_(t-1) —
blue line and r_t — orange line and then sum
the result — pink line with the input x_t — purple line.
Finally, tanh is used to produce h’_t — bright green
line.
• #4. Final memory at current time step
• As the last step, the network needs to calculate h_t —
vector which holds information for the current unit
and passes it down to the network. In order to do that
the update gate is needed. It determines what to
collect from the current memory content — h’_t and
what from the previous steps — h_(t-1). That is done
as follows:
• Following through, you can see how z_t — green line is
used to calculate 1-z_t which, combined with h’_t —
bright green line, produces a result in the dark red
line. z_t is also used with h_(t-1) — blue line in an
element-wise multiplication. Finally, h_t — blue line is
a result of the summation of the outputs corresponding
to the bright and dark red lines.
• GRUs are able to store and filter the information using
their update and reset gates. That eliminates the
vanishing gradient problem since the model is not
washing out the new input every single time but keeps
the relevant information and passes it down to the
next time steps of the network. If carefully trained,
they can perform extremely well even in complex
scenarios.
BERT (Bidirectional Encoder
Representations from Transformers)
• BERT makes use of Transformer, an attention
mechanism that learns contextual relations between
words (or sub-words) in a text.
• Transformer includes two separate mechanisms — an
encoder that reads the text input and a decoder that
produces a prediction for the task.
• Since BERT’s goal is to generate a language model, only
the encoder mechanism is necessary.
• The Transformer encoder reads the entire sequence of
words at once.
• This characteristic allows the model to learn the
context of a word based on all of its surroundings (left
and right of the word).
• BERT uses two training strategies:
• Masked LM (MLM)
• Before feeding word sequences into BERT, 15% of the
words in each sequence are replaced with a [MASK]
token. The model then attempts to predict the original
value of the masked words, based on the context
provided by the other, non-masked, words in the
sequence.
• In technical terms, the prediction of the output words
requires:
1.Adding a classification layer on top of the encoder
output.
2.Multiplying the output vectors by the embedding
matrix, transforming them into the vocabulary
dimension.
3.Calculating the probability of each word in the
vocabulary with softmax.
• The BERT loss function takes into consideration only
the prediction of the masked values and ignores the
prediction of the non-masked words.
• Next Sentence Prediction (NSP)
• In the BERT training process, the model receives pairs
of sentences as input and learns to predict if the
second sentence in the pair is the subsequent sentence
in the original document.
• During training, 50% of the inputs are a pair in which
the second sentence is the subsequent sentence in the
original document, while in the other 50% a random
sentence from the corpus is chosen as the second
sentence.
• To help the model distinguish between the two sentences in
training, the input is processed in the following way before
entering the model:

1.A [CLS] token is inserted at the beginning of the first


sentence and a [SEP] token is inserted at the end of each
sentence.
2.A sentence embedding indicating Sentence A or Sentence
B is added to each token.
3.A positional embedding is added to each token to indicate
its position in the sequence.

• To predict if the second sentence is indeed connected
to the first, the following steps are performed:
1.The entire input sequence goes through the
Transformer model.
2.The output of the [CLS] token is transformed into a
2×1 shaped vector, using a simple classification layer
(learned matrices of weights and biases).
3.Calculating the probability of IsNextSequence with
softmax.
• When training the BERT model, Masked LM and Next
Sentence Prediction are trained together, with the goal
of minimizing the combined loss function of the two
strategies.
• How to use BERT (Fine-tuning)
• Using BERT for a specific task is relatively straightforward:
• BERT can be used for a wide variety of language tasks, while
only adding a small layer to the core model:
1.Classification tasks such as sentiment analysis are done similarly
to Next Sentence classification, by adding a classification layer
on top of the Transformer output for the [CLS] token.
2.In Question Answering tasks (e.g. SQuAD v1.1), the software
receives a question regarding a text sequence and is required to
mark the answer in the sequence. Using BERT, a Q&A model
can be trained by learning two extra vectors that mark the
beginning and the end of the answer.
3.In Named Entity Recognition (NER), the software receives a text
sequence and is required to mark the various types of entities
(Person, Organization, Date, etc) that appear in the text.

You might also like