0% found this document useful (0 votes)
7 views

Module 5(Chapter 10)

Uploaded by

prajwaloconner
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Module 5(Chapter 10)

Uploaded by

prajwaloconner
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Module 5

Recurrent and Recursive Neural Networks


Recurrent neural networks or RNNs are a family of neural networks for processing sequential
data. A recurrent neural network is a neural network that is specialized for processing a
sequence of values x(1),...,x( T)
To go to RNN Parameter sharing is important. It makes possible, to extend and apply the
model to examples of different forms and generalize across them. Such sharing is particularly
important when a specific piece of information can occur at multiple positions within the
sequence.
For example, consider the two sentences “I went to Nepal in 2009” and “In 2009, I went to
Nepal.” If we ask a machine learning model to read each sentence and extract the year in which
the narrator went to Nepal, we would like it to recognize the year 2009 as the relevant piece of
information, whether it appears in the sixth word or the second word of the sentence.
Recurrent networks share parameters in a different way.
1. Each member of the output is a function of the previous members of the output.
2. Each member of the output is produced using the same update rule applied to the
previous outputs.
In practice, recurrent networks usually operate on minibatches of sequences, with a different
sequence length τ for each member of the minibatch.
Time step refers to the single point in the sequence of data or it refers only to the position in
the sequence.

10.1 Unfolding Computational Graphs


A computational graph is a way to represent the steps of a calculation, like turning inputs into
outputs or measuring error. In recursive or recurrent models, these calculations repeat in a
chain-like structure over time. When we "unfold" this process, we spread out the repeated steps
into a larger graph while keeping the same rules or parameters for each step, which creates a
deep, connected network.
For example, consider the classical form of a dynamical system:

where s(t) is called the state of the system. The above equation is recurrent because the definition
of s at time t refers back to the same definition at time t − 1. For a finite number of time steps
τ , the graph can be unfolded by applying the definition τ − 1 times. For example, if we unfold
above equation for τ = 3 time steps, we obtain:
The dynamical system described can be illustrated as an unfolded computational graph. Each
node represents the state at some time t and the function f maps the state at t to the state at t +
1.

As another example, let us consider a dynamical system driven by an external signal x(t) ,

Here, the state s(t) incorporates information about the entire past sequence. This idea underpins
Recurrent Neural Networks (RNNs), which process sequences by applying recurrence. In
RNNs, the state is typically represented by hidden units, denoted as h(t), and the equation is
rewritten as:

This means the RNN uses both the previous hidden state and the current input to update itself
over time. The parameters θ are shared across all steps.
RNNs are great for tasks involving sequences, like predicting the next word in a sentence,
because they store important information from the past in h(t). This hidden state acts as a
summary of everything the RNN has seen so far. However, since h(t) has a fixed size, it can’t
remember every detail from the input sequence—it only keeps what’s necessary for the task.
For example, in predicting the next word, the RNN focuses on the parts of the past that help
guess the next word.
For harder tasks, like reconstructing an entire input sequence (as in autoencoders), the RNN
needs to store more detailed information in h(t). This makes RNNs flexible tools for
understanding sequences.

A recurrent network with no outputs. This recurrent network just processes information from
the input x by incorporating it into the state h that is passed forward through time. (Left)Circuit
diagram. The black square indicates a delay of a single time step. (Right)The same network
seen as an unfolded computational graph, where each node is now associated with one
particular time instance.
The unfolding process thus introduces two major advantages:
1. Regardless of the sequence length, the learned model always has the same input size, because
it is specified in terms of transition from one state to another state, rather than specified in terms
of a variable-length history of states.
2. It is possible to use the same transition function f with the same parameters at every time
step.

10.2 Recurrent Neural Networks


With the concepts of graph unrolling and parameter sharing, we can create many different
types of recurrent neural networks (RNNs). These ideas allow us to represent sequences
effectively and reuse the same parameters across time steps, making RNNs flexible for various
tasks involving sequential data.

The RNN uses three sets of parameters:


• UUU: weights for input-to-hidden connections.
• WWW: weights for hidden-to-hidden (recurrent) connections.
• VVV: weights for hidden-to-output connections.
These connections are applied repeatedly across time. The forward propagation equation
defines how the input, hidden states, and outputs interact at each step.
• Left Diagram: The RNN is drawn with loops to indicate recurrent connections.
• Right Diagram: The same RNN is shown as an unfolded computational graph,
where each node corresponds to computations at a specific time step.
Some examples of important design patterns for recurrent neural networks include the
following:
1. Recurrent networks that produce an output at each time step and have recurrent
connections between hidden units.
2. Recurrent networks that produce an output at each time step and have recurrent
connections only from the output at one time step to the hidden units at the next time
step.
3. Recurrent networks with recurrent connections between hidden units, that read an entire
sequence and then produce a single output.

A recurrent neural network (RNN) is a highly versatile model capable of computing any
function that a Turing machine can, provided it has a finite size. It processes inputs, such as
binary sequences, and produces outputs after a number of steps proportional to the input length.
The RNN's outputs are discrete and need to be post-processed, often using a softmax function,
to produce normalized probabilities over possible outcomes.

This makes RNNs suitable for tasks like predicting words or characters. A single RNN with a
finite number of units can compute all functions, as it can simulate a Turing machine by
representing its activations and weights with rational numbers of unbounded precision. The
RNN operates by updating its hidden state over time, starting from an initial state, and uses
activation functions like the hyperbolic tangent for its computations. This framework allows
RNNs to handle a wide range of problems effectively by encoding and processing sequences.

We can then apply the softmax operation as a post-processing step to obtain a vector yˆ of
normalized probabilities over the output. Forward propagation begins with a specification of
the initial state h(0). Then, for each time step from t = 1 to t = τ , we apply the following update
equations:

if L(t) is the negative log-likelihood of y(t) given x(1), . . . , x(t) , then:

This type of RNN has a simpler structure, with feedback connections only from the output to
the hidden layer. At each time step, the input x(t) influences the hidden activations h(t), which
produce outputs o(t). These outputs are the only information passed to future time steps, as
there are no direct connections from the hidden state to the future.
While this design makes the RNN less powerful—it cannot retain as much detailed information
about the past as more complex RNNs—it has an advantage in training. Since each time step
can be trained independently, it allows for better parallelization, making the training process
faster and more efficient. However, the limited ability to carry rich past information may make
it less suitable for tasks requiring detailed sequence memory.

10.2.1 Teacher Forcing and Networks with Output Recurrence


The network shown in the image with only output-to-hidden recurrent connections is less
powerful because it lacks hidden-to-hidden connections. It cannot simulate a universal Turing
machine because the output units need to capture all past information required to predict the
future. This makes it harder to preserve the necessary historical context for predictions.

A time-unfolded RNN with a single output at the end summarizes the entire sequence into a
fixed-size representation, which can be used for further processing. The output's gradient can
either be back-propagated from a target at the end or from downstream modules.
The advantage of this simpler design is that it allows for parallelized training, as each time step
can be trained independently, without needing information from previous steps. During
training, a technique called teacher forcing can be used, where the model receives the actual
output from the training set as input for the next time step, improving the training process by
directly providing the correct context for each prediction.
The conditional maximum likelihood criterion is:

Teacher forcing allows models with hidden-to-hidden connections to use output values from
one time step as input for the next time step, eliminating the need for backpropagation through
time. However, a disadvantage arises when the model is later used in an open-loop mode,
where the outputs (or samples from the output distribution) are fed back as inputs. In this case,
the training inputs might differ significantly from those encountered during testing, which
could affect the model's performance. The open-loop mode refers to this situation where
feedback is generated by the model itself rather than being provided by the training data.

10.2.2 Computing the Gradient in a Recurrent Neural Network


Computing the gradient through a recurrent neural network (RNN) is straightforward by
applying the generalized back-propagation algorithm to the unrolled computational graph. This
allows the gradients obtained by back-propagation to be used with any general-purpose
gradient-based techniques for training an RNN. The process involves recursively computing
the gradient for each node, starting from the nodes immediately preceding the final loss.
We start the recursion with the nodes immediately preceding the final loss:

The gradient ∇o(t)L on the outputs at time step t, for all i,t, is as follows:

We work our way backwards, starting from the end of the sequence. At the final time step τ ,
h(τ) only has o(τ) as a descendent, so its gradient is simple:

We can then iterate backwards in time to back-propagate gradients through time, from t = τ −
1 down to t = 1, noting that h(t) (for t < τ) has as descendents both o(t) and h(t+1). Its gradient
is thus given by:
Using this notation, the gradient on the remaining parameters is given by:

We do not need to compute the gradient with respect to x(t) for training because it does not
have any parameters as ancestors in the computational graph defining the loss.

10.2.3 Recurrent Networks as Directed Graphical Models


The losses L(t) were cross-entropies between the training targets y(t) and outputs o(t). Similar
to feedforward networks, nearly any loss function can be used with a recurrent network,
depending on the task. Typically, the output of the RNN is interpreted as a probability
distribution, and the cross-entropy loss associated with that distribution is used. For instance,
mean squared error can be seen as the cross-entropy loss for a unit Gaussian output distribution,
similar to a feedforward network.
Log-Likelihood Maximization:
RNNs are trained to maximize the conditional probability of the next output given the past
inputs:
If the model incorporates past outputs, the formula becomes:

let us consider the case where the RNN models only a sequence of scalar random variables Y
= {y(1), . . . , y(τ) }, with no additional inputs x. The input at time step t is simply the output at
time step t −1. The RNN then defines a directed graphical model over the y variables. We
parametrize the joint distribution of these observations using the chain rule for conditional
probabilities:

where the right-hand side of the bar is empty for t = 1, of course. Hence the negative log-
likelihood of a set of values {y(1), . . . , y(τ) } according to such a model

This is efficient with RNNs, which use parameter sharing across all time steps.
Parameter Sharing:
RNNs don’t need a separate set of rules or weights for each step in the sequence. They reuse
the same set of parameters at every step. Example: Whether you predict a 3-word sentence or
a 10-word sentence, the same RNN structure is applied at every step. This saves memory and
makes RNNs powerful for long sequences.
A fully connected graphical model for a sequence allows every past observation to influence
future ones. However, directly parameterizing the model this way is inefficient due to the
increasing number of inputs and parameters as the sequence length grows.
RNNs obtain the same full connectivity but efficient parametrization, as illustrated in figure
below:

By using a state variable in the RNN model, we can make the network more efficient. Each
step in the sequence uses the same structure and shares the same parameters, reducing
complexity.
RNNs: Efficient But Challenging
1 Fewer Parameters, But Harder to Optimize:
RNNs share the same parameters across all time steps, which reduces the number of
parameters. However, this makes training harder because the model needs to learn how these
shared parameters work over time.
2 Stationarity Assumption:
RNNs assume that the relationship between time steps stays the same. This helps keep the
model simple, but might make it hard to capture complex patterns that change over time.
Generating Sequences in RNNs:
1. End-of-Sequence Symbol:
A special symbol (like </end>) is added to mark the end of the sequence. The model stops when
it generates this symbol.
2. Bernoulli Output for Continuation:
A sigmoid unit decides whether to continue or stop generating the sequence at each time step.
It uses a probability between 0 (stop) and 1 (continue).
3. Predicting Sequence Length (τ):
The model can predict the sequence length (τ), and then generate exactly that many steps. It
uses extra inputs to know how many steps are left.
4. Sequence Generation Formula:
The probability of generating a sequence is:
This breaks the sequence generation into:
P(τ) : Probability of sequence length.

: Probability of generating the sequence given the length.

10.2.4 Modeling Sequences Conditioned on Context with RNNs


In a basic RNN setup, the sequence of outputs y(t) is modeled without explicit inputs,
representing a joint distribution over the sequence. When inputs x(t) are introduced, the RNN
not only models the outputs y but also a conditional distribution P(y∣x), where x is the input
sequence.
RNNs typically take a sequence of input vectors x(1),x(2),…,x(τ), where each x(t) represents
the input at time step t.
An alternative approach is to use a single fixed-size input vector x, which can be incorporated
into the RNN as an additional input. This way, the RNN can still generate a sequence of outputs
y(t), but the model only needs one input vector rather than a sequence.
Some common ways of providing an extra input to an RNN are:
o as an extra input at each time step, or
o as the initial state h(0), or
o both

This diagram shows how a Recurrent Neural Network (RNN) maps a fixed-length input vector
x to a distribution over sequences Y. 𝑥 𝑇 R initializes the hidden state bias for global context. At
each time step t, The RNN updates ℎ(𝑡) using ℎ(𝑡−1) , 𝑥 𝑇 R 𝑦 (𝑡) is generated based on ℎ(𝑡) .
• X -> input vector
• R -> weight matrix
• 𝑥 𝑇 R -> bias parameter
• ℎ(𝑡) -> hidden states
• 𝐿(𝑡) -> Loss function
The RNN takes x as input and produces a sequence of outputs y(t), where each output word
y(t) depends on the previous word y(t−1) during training, utilizing the technique of teacher
forcing. In this setup, each element in the output sequence serves both as the input for the
current time step and the target for the previous time step, allowing the model to learn to
generate coherent sequences, such as captions, based on the initial fixed-length input vector.
Conditional Independence Assumption
Rather than receiving only a single vector x as input, the RNN may receive a sequence of
vectors x(t) as input. The RNN described in equation corresponds to a conditional distribution
is given as:

That makes a conditional independence assumption that this distribution factorizes as

To remove the conditional independence assumption, we can add connections from the output
at time t to the hidden unit at time t+ 1, as shown in figure.

10.3 Bidirectional RNNs


A bidirectional RNN combines RNNs: one that processes the sequence from the beginning to
the end (forward) and another that processes it from the end to the beginning (backward). This
allows the model to have access to both past and future context when generating outputs. For
example, if the model is processing a sentence, it can consider both the words before and after
the current word, improving its understanding. The output of the model at each time step is
based on information from both directions, making it more sensitive to the input around that
specific time.
This concept can also be extended to 2D inputs, like images. In this case, four RNNs can be
used: one for each direction—up, down, left, and right. By processing the image in these four
directions, the RNN can capture both local details and long-range dependencies, allowing it to
understand the image more effectively. Although using RNNs for images can be more
computationally expensive than convolutional networks, they offer the advantage of capturing
long-range interactions between features within the same image, something traditional
convolutional networks might not do as easily.

Computation of a typical bidirectional recurrent neural network, meant to learn to map input
sequences x to target sequences y, with loss L(t) at each step t. The h recurrence propagates
information forward in time (towards the right) while the g recurrence propagates information
backward in time (towards the left). Thus at each point t, the output units o(t) can benefit from
a relevant summary of the past in its h(t) input and from a relevant summary of the future in its
g(t) input.

10.4 Deep Recurrent Networks


The computation in most RNNs can be decomposed into three blocks of parameters and
associated transformations:
1. from the input to the hidden state,
2. from the previous hidden state to the next hidden state, and
3. from the hidden state to the output.
With the RNN architecture, each of these three blocks is associated with a single weight matrix.
In other words, when the network is unfolded, each of these corresponds to a shallow
transformation.
By a shallow transformation, we mean a transformation that would be represented by a single
layer within a deep MLP. Typically this is a transformation represented by a learned affine
transformation followed by a fixed nonlinearity.
Adding depth in these transformations increases representational capacity. It supports
hierarchical learning from raw input to higher-level representations.
Introducing depth in RNN operations has been shown to be advantageous based on
experimental evidence (Graves et al., 2013; Pascanu et al., 2014a). The addition of depth helps
in performing more complex mappings, as demonstrated by decomposing the state of an RNN
into multiple layers, which improves the representation of raw input at higher levels of the
hidden state. Pascanu et al. further propose using separate deep MLPs for each block (input-
to-hidden, hidden-to-hidden, and hidden-to-output), which increases the network's
representational capacity.
However, adding depth can make optimization more challenging, as it increases the length of
the shortest path between variables across time steps, making it harder to train. This issue can
be mitigated by introducing skip connections in the hidden-to-hidden path, which helps
maintain efficient optimization by allowing more direct gradient flow.

(a)The hidden recurrent state can be broken down into groups organized hierarchically.
(b)Deeper computation (e.g., an MLP) can be introduced in the input-tohidden, hidden-to-
hidden and hidden-to-output parts. This may lengthen the shortest path linking different time
steps. (c)The path-lengthening effect can be mitigated by introducing skip connections.
10.5 Recursive Neural Networks
Recursive neural networks (RecNNs) are a generalization of recurrent neural network. They
process the data in a tree-like structure instead of a chain-like structure of RNNs. Useful for
representing hierarchical or structured data.
RecNNs designed to handle structured data, such as tree or graphs, as input. They are
particularly useful in areas like natural language processing(NLP) and computer vision such as
parse tree in NLP and recognizing relationships between parts of an image in computer vision.
How RecNNs work
RecNNs are structured as computational trees, where each node performs computations by
taking inputs and producing outputs. The input to each node comes from its parent nodes, with
data flowing from the leaf nodes (bottom of the tree) to the root node (top). As each node
processes its inputs, it generates an output, which is passed up to its parent node, building a
hierarchical representation. The input sequence (e.g., x(1), x(2), x(3), ..., x(t)) is processed
through the tree's layers, with fixed weight matrices (𝑈, 𝑉, 𝑊) used to combine and transform
inputs at each node. Ultimately, the tree structure generates a final output (0) corresponding to
the sequence target (y).

Advantages
Depth Reduction:Explain how recursive networks can reduce depth from ( t ) to O(log t) ,
helping manage long-term dependencies.
Tree Structure:Discuss the options for tree structures (fixed structures like balanced binary trees
vs. data-dependent structures like parse trees).
Learning Tree Structure:Describe the ideal scenario where the network learns and adapts its
tree structure based on the input data.

Applications
• RecNNs can be used to build language models, machine translation systems, and speech
recognition systems.
• It can also be used for image classification, object detection, and video analysis.

There are many variants of the recursive network concept. For instance, Frasconi et al. (1997,
1998) associate data with a tree structure, linking inputs and targets to individual nodes. The
computation at each node does not have to follow the traditional artificial neuron approach
(affine transformation followed by a monotone nonlinearity). Socher et al. (2013a) suggest
using tensor operations and bilinear forms, which have been effective in modeling relationships
between concepts, particularly when these concepts are represented as continuous vectors or
embeddings (Weston et al., 2010; Bordes et al., 2012).

10.6 The Long Short-Term Memory and Other Gated RNNs

The most effective sequence models used in practical applications are called gated RNNs.
These include the long short-term memory and networks based on the gated recurrent unit.

Gated RNNs, like LSTMs and GRUs, are designed to help neural networks remember
important information over time and forget irrelevant details. Unlike traditional RNNs, which
can struggle to maintain useful information in long sequences, gated RNNs have special gates
that control what information should be remembered and what should be forgotten.

These gates are learned during training, so the network can decide for itself when to reset or
update its memory. This allows the network to accumulate useful information over time, while
also being able to forget old, unnecessary data when it's no longer needed. Essentially, gated
RNNs make it easier for neural networks to manage long-term memory and perform better on
tasks involving sequences, such as language or time-series prediction.

10.6.1 LSTM

The clever idea of introducing self-loops to produce paths where the gradient can flow for long
durations is a core contribution of the initial long short-term memory (LSTM) model. A crucial
addition has been to make the weight on this self-loop conditioned on the context, rather than
fixed.

The LSTM cell consists of several components that control how information is processed and
stored. The input feature is first computed by a regular artificial neuron and can be added to
the cell's state if allowed by the input gate, which uses a sigmoid function. The state unit has a
self-loop that is influenced by the forget gate, determining which information should be
discarded. The output gate, also using a sigmoid function, controls whether the cell’s output is
active. All gating units have sigmoid nonlinearities, while the input unit can use any squashing
nonlinearity. The state unit can also feed into the gating units. A black square represents a delay
of one time step in the diagram.
The most important component is the state unit s (t) i that has a linear self-loop similar to the
leaky units described in the previous section. However, here, the self-loop weight (or the
associated time constant) is controlled by a forget gate unit f (t)i (for time step t and cell i), that
sets this weight to a value between 0 and 1 via a sigmoid unit:

The LSTM cell internal state is thus updated as follows, but with a conditional self-loop weight
f (t)I :

where b, U and W respectively denote the biases, input weights and recurrent weights into the
LSTM cell. The external input gate unit g (t) i is computed similarly to the forget gate (with a
sigmoid unit to obtain a gating value between 0 and 1), but with its own parameters:

The output h(t) i of the LSTM cell can also be shut off, via the output gate q (t) i , which also
uses a sigmoid unit for gating:
10.10.2 Other Gated RNNs

The key components of the LSTM architecture—input, forget, and output gates—are essential
for controlling the flow of information and managing long-term dependencies. However,
simpler alternatives like Gated Recurrent Units (GRUs) combine the forget and input gates into
a single gating unit, making the architecture less complex while still allowing dynamic control
over forgetting and state updates. GRUs provide a more efficient solution for some tasks while
maintaining the ability to handle time scale and forgetting behaviors effectively.

The update equations are the following:

where u stands for “update” gate and r for “reset” gate. Their value is defined as usual:

The reset and update gates in gated RNNs allow the model to control which parts of the state
vector are used and how they evolve. The update gate acts like a leaky integrator, deciding
whether to retain or ignore certain parts of the state, while the reset gate controls which parts
of the state contribute to the next target state. These gates introduce nonlinearity in the
relationship between past and future states, enabling more flexible learning.

While various architectural variations, such as sharing reset gates across multiple units or
combining global and local gates, have been explored, no single variant has been found to
outperform the standard LSTM or GRU across all tasks. Research has shown that the forget
gate is crucial, and adding a bias of 1 to the LSTM's forget gate, as suggested by Gers et al.,
can improve its performance significantly.

You might also like