0% found this document useful (0 votes)
24 views121 pages

Unit 3 Chapter 1 RNN

Uploaded by

Poranki Anusha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views121 pages

Unit 3 Chapter 1 RNN

Uploaded by

Poranki Anusha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 121

UNIT-3

Chapter-1
Sequence Modeling: Recurrent and Recursive nets
Unfolding Computational Graphs
• In deep learning, unfolding computational graphs is a
process that creates a repetitive structure in a
computational graph by sharing parameters across a deep
network structure.
• This is done by recursively or recurrently computing a set
of computations.
• Computational graphs are a way to represent
mathematical operations that machines use to learn from
data.
• They are similar to flowcharts, where each node represents
an operation and the lines between nodes show how the
results flow from one step to the next.
• A computational graph is a way to formalize the
structure of a set of computations, such as
those involved in mapping inputs and
parameters to outputs and loss.
• The idea of Unfolding Computational Graphs is
sharing of parameters across a deep network
structure.
Unfolding the equation by repeatedly applying the definition in this
way has yielded an expression that does not involve recurrence.
Such an expression can now be represented by a traditional
directed acyclic computational graph. The unfolded computational
graph of equation

Each node represents state at some time t


• Function f maps state at time t to the state at t+1
• The same parameters ( the same value of θ used to
parameterize f ) are used for all time steps
• That is Unfolding a recursive or recurrent
computation into a computational graph that has
a repetitive structure, typically corresponding to a
chain of events.
Recurrent Neural Networks
• Unfolding computational graphs is a way to
visualize recurrent neural networks (RNNs)
across time steps, which helps reveal their
sequential nature.
• This is useful when training RNNs.
Motivation to RNN
• The fundamental neural network FNN
– Moves forward
– No loop
– Considers current input state
– No memory
• Disadvantage is it cannot handle
sequential data
Introduction to RNN
• A recurrent neural network or RNN is a deep
neural network trained on sequential or time
series data to create a machine learning (ML)
model that can make sequential predictions or
conclusions based on sequential inputs.
• That is i.e., add-ons to
– loop FNN but structure
– Memory is same
– Input from the previous state as output to the
the next time step
• In the traditional neural network, the inputs
and the outputs are independent of each other,
whereas the output in RNN is dependent on
prior elementals within the sequence.
• Recurrent networks also share parameters
across each layer of the network. In
feedforward networks, there are different
weights across each node.
• Whereas RNN shares the same weights within
each layer of the network and during
gradient descent, the weights and basis are
adjusted individually to reduce the loss.
How RNNs work
How RNNs work
How RNNs work
•In RNN, the information cycles through the loop, so
the output is determined by the current input and
previously received inputs.
•The input layer X processes the initial input and
passes it to the middle layer A.
•The middle layer consists of multiple hidden layers,
each with its activation functions, weights, and
biases.
•These parameters are standardized across the
hidden layer so that instead of creating multiple
hidden layers, it will create one and loop it over.

Instead of using traditional backpropagation, recurrent neural networks use


backpropagation through time (BPTT) algorithms to determine the gradient.
Types of Recurrent Neural Networks

• There are four types of RNN based on different lengths


of inputs and outputs.
• One-to-one is a simple neural network. It is commonly
used for machine learning problems that have a single
input and output.
• One-to-many has a single input and multiple outputs.
This is used for generating image captions.
• Many-to-one takes a sequence of multiple inputs and
predicts a single output. It is popular in sentiment
classification, where the input is text and the output is
a category.
• Many-to-many takes multiple inputs and outputs. The
most common application is machine translation.
real-world examples of recurrent neural
networks (RNNs):
• Speech recognition: RNNs are used in speech recognition
because they can remember previous inputs to predict the
next sequence.
• Machine translation: RNNs can convert a sentence in one
language to another language.
• Sentiment analysis: RNNs can classify a sentence or
review as positive, negative, or neutral.
• Spam detection: RNNs can analyze an email's content to
determine if it's spam.
• Apple's Siri and Google's voice search: RNNs are used in
these voice assistants.
Example
• One of the best ways to visualize a Recurrent Neural
Network is as a cyclic computational graph
• In this representation the Recurrent Neural Network
has three major states:
• Input state, which captures the input data for the
model.
• Output state, which captures the results of the
model.
• Recurrent state, which is in fact a chain of hidden
states, and captures all the computations between
the input and output states.
Compressed view of RNN

Recurrent
Neural
Network
represented
as a
computation
graph
unfolded view of RNN

Recurrent Neural
Network represented
as an unfolded
computational graph.
• As each internal state relies on the previous
one, you have information that is propagated
onto each layer of neurons in the network
since the beginning of the sequence.
• Like an old memory that is passed on to future
generations.
• In the case of a Recurrent Neural Network,
memories are information about the
computations applied to the sequence
Teacher forcing and networks
with output recurrence
What is Teacher Forcing?

• Teacher forcing is a training technique used in recurrent


neural networks (RNNs) and other sequence-to-sequence
models, particularly for tasks such as language modeling,
translation, and text generation.
• During the training process, instead of feeding the
model’s own previous output as input to the next time
step, the ground truth (actual) output from the training
data is provided as input.
• This approach is intended to help the model learn the
correct sequence of output tokens more efficiently and
stabilize the training process
What can Teacher Forcing do?
• Teacher forcing can be employed in various
applications involving sequence-to-sequence
models, such as:
– Machine translation: Translating text from one
language to another while preserving the meaning
and context.
– Text summarization: Generating concise and
coherent summaries of longer texts.
– Image captioning: Creating textual descriptions of
images based on their content.
– Speech recognition: Converting spoken language
into written text.
• Teacher forcing is a method for quickly and
efficiently training recurrent neural network models
that use the ground truth from a prior time step as
input.
• It is a network training method critical to the
development of deep learning language models
used in machine translation, text summarization,
and image captioning, among many other
applications.
• Teacher forcing works by using the actual or
expected output from the training dataset at the
current time step y(t) as input in the next time step
X(t+1), rather than the output generated by the
How does teacher forcing work?

• During training, the model is provided with an input


sequence, expected to generate an output sequence
step by step.
• The model generates an output token for each step
based on the input and previously generated tokens.
• The model is given the actual target (correct)
sequence as input at each step instead of its
previously generated tokens in teacher forcing.
• This means that the model gets to “see” the correct
output sequence during training and is guided by it.
• The model is given the actual target (correct)
sequence as input at each step instead of its
previously generated tokens.
• The model’s parameters are updated based on
the loss between its generated output and the
true target output at each step.
• This helps the model learn to create sequences
closer to the desired target.
Step-by-step guide on how to implement teacher forcing

• Step 1: Data Preparation


• Before diving into the inner workings, you
need a dataset containing pairs of input
sequences and their corresponding target
sequences. These pairs serve as the
foundation for training your sequence
generation model.
• Step 2: Training the Model
• The core idea behind teacher forcing is to use the ground truth or
actual target sequence as the input during training, at least in the
early stages of the process. This enables the model to receive precise
guidance from the outset.
• Input: At each time step, the model receives a token from the true
target sequence. This serves as the initial input.
• Predictions: The model then predicts based on this initial token and its
learned parameters.
• Feedback: The true target sequence provides immediate feedback.
The model can compare its prediction to the correct token at this step.
• Loss Calculation: A loss function measures the disparity between
predicted and actual tokens. This is typically done using techniques
like cross-entropy loss.
• Parameter Update: To minimize the loss, the model’s parameters are
adjusted through backpropagation and optimization methods (e.g.,
gradient descent).
• Step 3: Iteration
• The above steps are repeated for each time step in the
sequence. The model’s training process involves generating
one token at a time while relying on the true target sequence
to guide its predictions. This iterative process continues until
the entire target sequence has been generated.
• Step 4: Gradual Transition
• One vital aspect is that teacher forcing doesn’t have to be
used throughout training. Using it exclusively can lead to
exposure bias, where the model struggles to generate
sequences independently during inference.
• A technique called “scheduled sampling” is often employed to
address this. Scheduled sampling gradually transitions from
using the true target as input to using the model’s predictions.
This helps the model adapt to generating sequences
independently, reducing the reliance on teacher forcing.
• Step 5: Inference
• Once the model has been trained using
teacher forcing and has learned to generate
sequences effectively, it can be used for
inference. The model cannot access the true
target during inference and must create
sequences based on its predictions.
Example

• Math exam questions that consist of multiple parts,


where the answer for part (a) is needed for the
calculation in part (b), answer for part (b) is needed
for part (c), and so on?
• If we get part (a) wrong, then all the subsequent
parts will most likely be wrong as well, even though
the formulas and the calculations are correct.
• Teacher Forcing remedies this as follows: After we
obtain an answer for part (a), a teacher will compare
our answer with the correct one, record the score for
part (a), and tell us the correct answer so that we can
use it for part (b).
• The situation for Recurrent Neural Networks that
output sequences is very similar.
• Let us assume we want to train an image
captioning model, and the ground truth caption
for the above image is “Two people reading a
book”.
• Our model makes a mistake in predicting the 2nd
word and we have “Two” and “birds” for the 1st
and 2nd prediction respectively.
• Without Teacher Forcing, we would feed “birds”
back to our RNN to predict the 3rd word. Let’s
say the 3rd prediction is “flying”. Even though it
makes sense for our model to predict “flying”
given the input is “birds”, it is different from the
ground truth.
• On the other hand, if we use Teacher Forcing, we
would feed “people” to our RNN for the 3rd
prediction, after computing and recording the
loss for the 2nd prediction.
Computing the gradient in a
Recurrent Neural Network
Introduction
• Much as Convolutional Neural Networks are used to
deal with images, the Recurrent Neural Networks
are used to process sequential data.
• The key idea in recurrent neural networks is
parameter sharing across different parts of the
model i.e. a RNN shares same weights across
several time steps.
• As feedforward neural network can model any
function, a recurrent neural network can model any
function involving recurrence.
• A recurrent neural network is a neural network that is
specialized for processing a sequence of data x(t)= x(1),
. . . , x(τ) with the time step index t ranging from 1 to τ.
• For tasks that involve sequential inputs, such as
speech and language, it is often better to use RNNs.
• In a NLP problem, if you want to predict the next word
in a sentence it is important to know the words before
it.
• RNNs are called recurrent because they perform the
same task for every element of a sequence, with the
output being depended on the previous computations.
• Another way to think about RNNs is that they have a
“memory” which captures information about what has
been calculated so far.
• The left side of the above diagram shows a
notation of an RNN and on the right side an RNN
being unrolled (or unfolded) into a full network.
• By unrolling we mean that we write out the
network for the complete sequence.
• For example, if the sequence we care about is a
sentence of 3 words, the network would be
unrolled into a 3-layer neural network, one layer
for each word.
Recurrent Networks as directed
graphical models
Directed Graphical Model

• The definition of a graphical model is: "a


probabilistic model for which a graph
expresses the conditional dependence
structure between random variables.“
• As we can draw a dependency graph to
represent a NN, it falls in this category of
"graphical models".
Overview of Directed Graphical Models
• Directed graphical models, also known as
Bayesian networks, represent probabilistic
relationships between variables.
• Each node in the graph represents a random
variable, and the edges denote conditional
dependencies.
• These models are particularly useful for
modeling sequential data, making them a
natural fit for RNNs.
RNNs as Directed Graphical Models
• An RNN can be unfolded over time, creating a
sequence of interconnected nodes. Each node
represents the hidden state of the network at
a particular time step. The connections
between these nodes define the temporal
dependencies:
– Self-loop: A node connects back to itself,
representing the influence of the previous hidden
state on the current one.
– Input connections: Nodes receive connections
from input variables, allowing the network to
process external information.
– Output connections : Nodes may connect to
output variables, producing the network's
predictions.
•Each circle represents a hidden state at a specific time step.
•The arrows indicate the flow of information through the
network.
• Directed Graphs:
– In RNNs, the structure is directed, meaning that
the flow of information has a specific direction—
from one time step to the next.
– Each node in the graph represents the hidden
state of the network at a given time, and directed
edges represent the connections (weights)
between these states.
• Temporal Dependencies:
– RNNs are designed to handle sequential data,
capturing temporal dependencies by using the
previous state to influence the current state.
– This is represented in the graphical model by
edges that loop back to earlier nodes (states),
indicating that the past affects the present.
• State Representation: At each time step, the
hidden state is updated based on the input at
that time step and the previous hidden state.
• This can be represented mathematically as:
• Graphical Interpretation:
– In a directed graphical model, nodes correspond
to random variables, and edges represent
probabilistic dependencies.
– For RNNs, the states hth_tht​can be treated as
latent variables conditioned on previous states
and inputs, forming a Markov chain.
• Training and Inference:
– The training of RNNs typically involves
backpropagation through time (BPTT), a method
adapted from standard backpropagation that
accounts for the temporal nature of the data.
– This involves unfolding the RNN through time into
a feedforward structure, where each time step is
treated as a layer in a deep neural network.
• Extensions:
– Variants like Long Short-Term Memory (LSTM) and
Gated Recurrent Units (GRUs) introduce gating
mechanisms to better manage information flow
and mitigate issues like vanishing gradients, which
can also be understood in the context of graphical
models by modifying how dependencies are
represented.
Example
• Simple RNN for Sequence Prediction
• Consider a task where we want to predict the
next word in a sequence based on the
previous words. We can represent this using a
simple RNN.
• Step 1: Sequence Representation
• Imagine we have a sequence of words: "I love
programming". We can represent this
sequence as:
• Therefore RNN can be represented as a
directed graphical model.
• The nodes (hidden states) capture the
sequence's temporal dependencies, and the
directed edges illustrate how information
flows through the network over time.
• This structure allows the RNN to learn
patterns in sequential data effectively.
Modelling sequences conditioned
on context with RNNs
• Modeling sequences conditioned on context
using recurrent neural networks (RNNs) can
significantly enhance the model's ability to
make predictions by incorporating additional
relevant information.
Definition
• Context can be any additional information that
influences the sequence being modeled.
Examples include:
– Historical data (e.g., previous states in time series)
– Features related to the sequence (e.g., speaker
identity in dialogue systems)
– External factors (e.g., weather data affecting
conversation topics)
• Modeling sequences conditioned on context
with RNNs provides a powerful way to
enhance predictive performance by
incorporating relevant information.
• This approach is applicable across various
domains, including natural language
processing, speech recognition, and time
series forecasting, making RNNs versatile tools
for handling sequential data.
Bidirectional RNNs
• Bi-directional recurrent neural networks (Bi-RNNs) are
artificial neural networks that process input data in both
the forward and backward directions.
• They are often used in natural language processing tasks,
such as language translation, text classification, and
named entity recognition.
• In addition, they can capture contextual dependencies in
the input data by considering past and future contexts.
• Bi-RNNs consist of two separate RNNs that process the
input data in opposite directions, and the outputs of
these RNNs are combined to produce the final output.
• A bi-directional recurrent neural network (Bi-RNN) is a
type of recurrent neural network (RNN) that processes
input data in both forward and backward directions.
• The goal of a Bi-RNN is to capture the contextual
dependencies in the input data by processing it in
both directions, which can be useful in various natural
language processing (NLP) tasks.
• In a Bi-RNN, the input data is passed through two
separate RNNs: one processes the data in the forward
direction, while the other processes it in the reverse
direction.
• The outputs of these two RNNs are then combined in
some way to produce the final output.
• One common way to combine the outputs of
the forward and reverse RNNs is to
concatenate them.
• Still, other methods, such as element-wise
addition or multiplication, can also be used.
• The choice of combination method can
depend on the specific task and the desired
properties of the final output.
Need for Bi-directional RNNs
• A uni-directional recurrent neural network (RNN)
processes input sequences in a single direction, either
from left to right or right to left.
• This means the network can only use information from
earlier time steps when making predictions at later
time steps.
• This can be limiting, as the network may not capture
important contextual information relevant to the
output prediction.
• For example, in natural language processing tasks, a
uni-directional RNN may not accurately predict the
next word in a sentence if the previous words provide
important context for the current word.
Example
• Consider an example where we could use the
recurrent network to predict the masked word
in a sentence.
• Apple is my favorite _____.
• Apple is my favourite _____, and I work there.
• Apple is my favorite _____, and I am going to
buy one.
• n the first sentence, the answer could be fruit,
company, or phone. But it can not be a fruit in the
second and third sentences.
• A Recurrent Neural Network that can only process
the inputs from left to right may not accurately
predict the right answer for sentences discussed
above.
• To perform well on natural language tasks, the
model must be able to process the sequence in
both directions.
• This means that the network has two separate
RNNs:
– One that processes the input sequence from left
to right
– Another one that processes the input sequence
from right to left.
• These two RNNs are typically called forward
and backward RNNs, respectively.
• During the forward pass of the RNN, the forward
RNN processes the input sequence in the usual
way by taking the input at each time step and
using it to update the hidden state. The updated
hidden state is then used to predict the output.
• Backpropagation through time (BPTT) is a
widely used algorithm for training recurrent
neural networks (RNNs). It is a variant of the
backpropagation algorithm specifically designed
to handle the temporal nature of RNNs, where
the output at each time step depends on the
inputs and outputs at previous time steps.
• In the case of a bidirectional RNN, BPTT
involves two separate Backpropagation
passes:
– one for the forward RNN and one for the
backward RNN.
– During the forward pass, the forward RNN
processes the input sequence in the usual way and
makes predictions for the output sequence.
• These predictions are then compared to the
target output sequence, and the error is
backpropagated through the network to
update the weights of the forward RNN.
• The backward RNN processes the input sequence in
reverse order during the backward pass and predicts
the output sequence. These predictions are then
compared to the target output sequence in reverse
order, and the error is backpropagated through the
network to update the weights of the backward RNN.
• Once both passes are complete, the weights of the
forward and backward RNNs are updated based on
the errors computed during the forward and
backward passes, respectively. This process is
repeated for multiple iterations until the model
converges and the predictions of the bidirectional
RNN are accurate.
• This allows the bidirectional RNN to consider
information from past and future time steps
when making predictions, which can
significantly improve the model's accuracy.
Encoder-Decoder sequence-to –
sequence architectures
(Seq2Seq)
• An RNN typically has fixed-size input and output vectors,
i.e., the lengths of both the input and output vectors are
predefined.
• Consider a case where the English phrase "How have you
been?" is translated into French.
• In French, you’d say "Comment avez-vous été?".
• Here, neither the input nor output sequences are fixed in
size.
• In this context, if you want to build a language translator
using an RNN, you do not want to define the sequence
lengths beforehand.
• Sequence-to-sequence (seq2seq) models can
help solve the above-mentioned problem.
• When given an input, the encoder-decoder
seq2seq model first generates an encoded
representation of the model, which is then
passed to the decoder to generate the desired
output.
• In this case, the input and output vectors
need not be fixed in size.
Architecture

• The idea behind the design of this model is to enable it


to process input where we do not constrain the length.
• One RNN will be used as an encoder, and another as a
decoder.
• The output vector generated by the encoder and the
input vector given to the decoder will possess a fixed
size. However, they need not be equal.
• The output generated by the encoder can either be
given as a whole chunk or can be connected to the
hidden units of the decoder unit at every time step.
• Encoder Block
• The main purpose of the encoder block is to process the
input sequence and capture information in a fixed-size
context vector.
• Architecture:
• The input sequence is put into the encoder.
• The encoder processes each element of the input sequence
using neural networks (or transformer architecture).
• Throughout this process, the encoder keeps an internal
state, and the ultimate hidden state functions as the
context vector that encapsulates a compressed
representation of the entire input sequence.
• This context vector captures the semantic meaning and
important information of the input sequence.
The final hidden state of the encoder is then passed as the
context vector to the decoder.
• Decoder Block
• The decoder block is similar to encoder block.
The decoder processes the context vector
from encoder to generate output sequence
incrementally.
• Architecture:
• In the training phase, the decoder receives
both the context vector and the desired target
output sequence (ground truth).
• During inference, the decoder relies on its
own previously generated outputs as inputs
for subsequent steps.
• The decoder uses the context vector to
comprehend the input sequence and create the
corresponding output sequence.
• It engages in autoregressive generation,
producing individual elements sequentially.
• At each time step, the decoder uses the current
hidden state, the context vector, and the
previous output token to generate a probability
distribution over the possible next tokens.
• The token with the highest probability is then
chosen as the output, and the process continues
until the end of the output sequence is reached.
• RNN based Seq2Seq Model
• The decoder and encoder architecture utilizes RNNs to generate
desired outputs. Let’s look at the simplest seq2seq model.
• Recurrent Neural Networks can easily map sequences to
sequences when the alignment between the inputs and
the outputs are known in advance.
• Although the vanilla version of RNN is rarely used, its more
advanced version i.e. LSTM or GRU is used.
• This is because RNN suffers from the problem of vanishing
gradient.
• LSTM develops the context of the word by taking 2 inputs
at each point in time.
• One from the user and the other from its previous output,
hence the name recurrent (output goes as input).
Deep Recurrent networks
What is a Recurrent Neural Network in Deep Learning?

• A Recurrent Neural Network (RNN) is a type of


neural network designed for sequential data
processing, featuring connections that loop back
within the network.
• It can retain memory of previous inputs, making it
adept at tasks like natural language processing, time
series analysis, and speech recognition.
• RNNs excel in capturing temporal dependencies and
contextual information, crucial for understanding
sequential data.
Concepts of Deep RNN

• Through the use of several layers of recurrent units,


deep RNNs improve upon the capabilities of
conventional RNNs.
• A deep RNN hierarchy extracts progressively more
abstract representations of sequential input data at each
layer.
• Deep RNNs are better able to understand intricate
patterns and connections within sequential data
because of this hierarchical feature extraction, which
also improves the network's generative and predictive
powers.
Multi Layers of Deep RNNs

• A deep neural network is created by stacking several


layers of recurrent units on top of one another in a deep
RNN architecture.
• Deep RNNs may learn hierarchical representations of
sequential data because of the hierarchical structure of
these layers, where each layer captures a distinct degree
of abstraction.
• Deep RNNs may successfully describe long-term
relationships and complex patterns inside sequential
data streams by utilizing the expressive potential of
deep learning architectures.
• Determining the training objectives, recurrent unit types
and architecture are all necessary while building deep
RNNs.
• Convolutional, attention-based and fully linked models
are common deep RNN designs.
• The selection of recurrent units, such as Gated Recurrent
Units (GRU) or Long Short-Term Memory (LSTM), is
contingent upon the particular needs of the work at
hand and the intended equilibrium between computing
efficiency and memory capacity.
• Typically, training Deep RNNs entails utilizing gradient-
based optimization algorithms like Adam or Stochastic
Gradient Descent (SGD) to optimize an appropriate loss
function.
Architecture
Recursive neural networks
• Recursive Neural Networks are a type of neural
network architecture that is specially designed
to process hierarchical structures and capture
dependencies within recursively structured
data.
• Unlike traditional feedforward neural networks
(RNNs), Recursive Neural Networks or RvNN
can efficiently handle tree-structured inputs
which makes them suitable for tasks involving
nested and hierarchical relationships.
What is RvNN?
• From the family of Neural networks, RvNN is a special
Deep Learning which has the ability to operate on structured
input data like parse trees in natural language processing or
molecular structures in chemistry.
• The network processes the input recursively, combining
information from child nodes to form representations for
parent nodes.
• RvNN mainly used in some NLP tasks like sentiment analysis
etc. by processing data which is in the format of parse tree.
• RvNN processes parse trees by assigning vectors to each word
or subphrase based on the information from its children
which allows the network to capture hierarchical
relationships and dependencies within the sentence.
Working Principles of RvNN
• Recursive Structure Handling: RvNN is designed to handle
recursive structures which means it can naturally process
hierarchical relationships in data by combining information from
child nodes to form representations for parent nodes.
• Parameter Sharing: RvNN often uses shared parameters across
different levels of the hierarchy which enables the model to
generalize well and learn from various parts of the input structure.
• Tree Traversal: RvNN traverses the tree structure in a bottom-up
or top-down manner by simultaneously updating node
representations based on the information gathered from their
children.
• Composition Function: The composition function in RvNN
combines information from child nodes to create a representation
for the parent node. This function is crucial in capturing the
hierarchical relationships within the data.
Difference between Recursive Neural Network and CNN

You might also like