Module 5(Chapter 10)
Module 5(Chapter 10)
where s(t) is called the state of the system. The above equation is recurrent because the definition
of s at time t refers back to the same definition at time t − 1. For a finite number of time steps
τ , the graph can be unfolded by applying the definition τ − 1 times. For example, if we unfold
above equation for τ = 3 time steps, we obtain:
The dynamical system described can be illustrated as an unfolded computational graph. Each
node represents the state at some time t and the function f maps the state at t to the state at t +
1.
As another example, let us consider a dynamical system driven by an external signal x(t) ,
Here, the state s(t) incorporates information about the entire past sequence. This idea underpins
Recurrent Neural Networks (RNNs), which process sequences by applying recurrence. In
RNNs, the state is typically represented by hidden units, denoted as h(t), and the equation is
rewritten as:
This means the RNN uses both the previous hidden state and the current input to update itself
over time. The parameters θ are shared across all steps.
RNNs are great for tasks involving sequences, like predicting the next word in a sentence,
because they store important information from the past in h(t). This hidden state acts as a
summary of everything the RNN has seen so far. However, since h(t) has a fixed size, it can’t
remember every detail from the input sequence—it only keeps what’s necessary for the task.
For example, in predicting the next word, the RNN focuses on the parts of the past that help
guess the next word.
For harder tasks, like reconstructing an entire input sequence (as in autoencoders), the RNN
needs to store more detailed information in h(t). This makes RNNs flexible tools for
understanding sequences.
A recurrent network with no outputs. This recurrent network just processes information from
the input x by incorporating it into the state h that is passed forward through time. (Left)Circuit
diagram. The black square indicates a delay of a single time step. (Right)The same network
seen as an unfolded computational graph, where each node is now associated with one
particular time instance.
The unfolding process thus introduces two major advantages:
1. Regardless of the sequence length, the learned model always has the same input size, because
it is specified in terms of transition from one state to another state, rather than specified in terms
of a variable-length history of states.
2. It is possible to use the same transition function f with the same parameters at every time
step.
A recurrent neural network (RNN) is a highly versatile model capable of computing any
function that a Turing machine can, provided it has a finite size. It processes inputs, such as
binary sequences, and produces outputs after a number of steps proportional to the input length.
The RNN's outputs are discrete and need to be post-processed, often using a softmax function,
to produce normalized probabilities over possible outcomes.
This makes RNNs suitable for tasks like predicting words or characters. A single RNN with a
finite number of units can compute all functions, as it can simulate a Turing machine by
representing its activations and weights with rational numbers of unbounded precision. The
RNN operates by updating its hidden state over time, starting from an initial state, and uses
activation functions like the hyperbolic tangent for its computations. This framework allows
RNNs to handle a wide range of problems effectively by encoding and processing sequences.
We can then apply the softmax operation as a post-processing step to obtain a vector yˆ of
normalized probabilities over the output. Forward propagation begins with a specification of
the initial state h(0). Then, for each time step from t = 1 to t = τ , we apply the following update
equations:
This type of RNN has a simpler structure, with feedback connections only from the output to
the hidden layer. At each time step, the input x(t) influences the hidden activations h(t), which
produce outputs o(t). These outputs are the only information passed to future time steps, as
there are no direct connections from the hidden state to the future.
While this design makes the RNN less powerful—it cannot retain as much detailed information
about the past as more complex RNNs—it has an advantage in training. Since each time step
can be trained independently, it allows for better parallelization, making the training process
faster and more efficient. However, the limited ability to carry rich past information may make
it less suitable for tasks requiring detailed sequence memory.
A time-unfolded RNN with a single output at the end summarizes the entire sequence into a
fixed-size representation, which can be used for further processing. The output's gradient can
either be back-propagated from a target at the end or from downstream modules.
The advantage of this simpler design is that it allows for parallelized training, as each time step
can be trained independently, without needing information from previous steps. During
training, a technique called teacher forcing can be used, where the model receives the actual
output from the training set as input for the next time step, improving the training process by
directly providing the correct context for each prediction.
The conditional maximum likelihood criterion is:
Teacher forcing allows models with hidden-to-hidden connections to use output values from
one time step as input for the next time step, eliminating the need for backpropagation through
time. However, a disadvantage arises when the model is later used in an open-loop mode,
where the outputs (or samples from the output distribution) are fed back as inputs. In this case,
the training inputs might differ significantly from those encountered during testing, which
could affect the model's performance. The open-loop mode refers to this situation where
feedback is generated by the model itself rather than being provided by the training data.
The gradient ∇o(t)L on the outputs at time step t, for all i,t, is as follows:
We work our way backwards, starting from the end of the sequence. At the final time step τ ,
h(τ) only has o(τ) as a descendent, so its gradient is simple:
We can then iterate backwards in time to back-propagate gradients through time, from t = τ −
1 down to t = 1, noting that h(t) (for t < τ) has as descendents both o(t) and h(t+1). Its gradient
is thus given by:
Using this notation, the gradient on the remaining parameters is given by:
We do not need to compute the gradient with respect to x(t) for training because it does not
have any parameters as ancestors in the computational graph defining the loss.
let us consider the case where the RNN models only a sequence of scalar random variables Y
= {y(1), . . . , y(τ) }, with no additional inputs x. The input at time step t is simply the output at
time step t −1. The RNN then defines a directed graphical model over the y variables. We
parametrize the joint distribution of these observations using the chain rule for conditional
probabilities:
where the right-hand side of the bar is empty for t = 1, of course. Hence the negative log-
likelihood of a set of values {y(1), . . . , y(τ) } according to such a model
This is efficient with RNNs, which use parameter sharing across all time steps.
Parameter Sharing:
RNNs don’t need a separate set of rules or weights for each step in the sequence. They reuse
the same set of parameters at every step. Example: Whether you predict a 3-word sentence or
a 10-word sentence, the same RNN structure is applied at every step. This saves memory and
makes RNNs powerful for long sequences.
A fully connected graphical model for a sequence allows every past observation to influence
future ones. However, directly parameterizing the model this way is inefficient due to the
increasing number of inputs and parameters as the sequence length grows.
RNNs obtain the same full connectivity but efficient parametrization, as illustrated in figure
below:
By using a state variable in the RNN model, we can make the network more efficient. Each
step in the sequence uses the same structure and shares the same parameters, reducing
complexity.
RNNs: Efficient But Challenging
1 Fewer Parameters, But Harder to Optimize:
RNNs share the same parameters across all time steps, which reduces the number of
parameters. However, this makes training harder because the model needs to learn how these
shared parameters work over time.
2 Stationarity Assumption:
RNNs assume that the relationship between time steps stays the same. This helps keep the
model simple, but might make it hard to capture complex patterns that change over time.
Generating Sequences in RNNs:
1. End-of-Sequence Symbol:
A special symbol (like </end>) is added to mark the end of the sequence. The model stops when
it generates this symbol.
2. Bernoulli Output for Continuation:
A sigmoid unit decides whether to continue or stop generating the sequence at each time step.
It uses a probability between 0 (stop) and 1 (continue).
3. Predicting Sequence Length (τ):
The model can predict the sequence length (τ), and then generate exactly that many steps. It
uses extra inputs to know how many steps are left.
4. Sequence Generation Formula:
The probability of generating a sequence is:
This breaks the sequence generation into:
P(τ) : Probability of sequence length.
This diagram shows how a Recurrent Neural Network (RNN) maps a fixed-length input vector
x to a distribution over sequences Y. 𝑥 𝑇 R initializes the hidden state bias for global context. At
each time step t, The RNN updates ℎ(𝑡) using ℎ(𝑡−1) , 𝑥 𝑇 R 𝑦 (𝑡) is generated based on ℎ(𝑡) .
• X -> input vector
• R -> weight matrix
• 𝑥 𝑇 R -> bias parameter
• ℎ(𝑡) -> hidden states
• 𝐿(𝑡) -> Loss function
The RNN takes x as input and produces a sequence of outputs y(t), where each output word
y(t) depends on the previous word y(t−1) during training, utilizing the technique of teacher
forcing. In this setup, each element in the output sequence serves both as the input for the
current time step and the target for the previous time step, allowing the model to learn to
generate coherent sequences, such as captions, based on the initial fixed-length input vector.
Conditional Independence Assumption
Rather than receiving only a single vector x as input, the RNN may receive a sequence of
vectors x(t) as input. The RNN described in equation corresponds to a conditional distribution
is given as:
To remove the conditional independence assumption, we can add connections from the output
at time t to the hidden unit at time t+ 1, as shown in figure.
Computation of a typical bidirectional recurrent neural network, meant to learn to map input
sequences x to target sequences y, with loss L(t) at each step t. The h recurrence propagates
information forward in time (towards the right) while the g recurrence propagates information
backward in time (towards the left). Thus at each point t, the output units o(t) can benefit from
a relevant summary of the past in its h(t) input and from a relevant summary of the future in its
g(t) input.
(a)The hidden recurrent state can be broken down into groups organized hierarchically.
(b)Deeper computation (e.g., an MLP) can be introduced in the input-tohidden, hidden-to-
hidden and hidden-to-output parts. This may lengthen the shortest path linking different time
steps. (c)The path-lengthening effect can be mitigated by introducing skip connections.
10.5 Recursive Neural Networks
Recursive neural networks (RecNNs) are a generalization of recurrent neural network. They
process the data in a tree-like structure instead of a chain-like structure of RNNs. Useful for
representing hierarchical or structured data.
RecNNs designed to handle structured data, such as tree or graphs, as input. They are
particularly useful in areas like natural language processing(NLP) and computer vision such as
parse tree in NLP and recognizing relationships between parts of an image in computer vision.
How RecNNs work
RecNNs are structured as computational trees, where each node performs computations by
taking inputs and producing outputs. The input to each node comes from its parent nodes, with
data flowing from the leaf nodes (bottom of the tree) to the root node (top). As each node
processes its inputs, it generates an output, which is passed up to its parent node, building a
hierarchical representation. The input sequence (e.g., x(1), x(2), x(3), ..., x(t)) is processed
through the tree's layers, with fixed weight matrices (𝑈, 𝑉, 𝑊) used to combine and transform
inputs at each node. Ultimately, the tree structure generates a final output (0) corresponding to
the sequence target (y).
Advantages
Depth Reduction:Explain how recursive networks can reduce depth from ( t ) to O(log t) ,
helping manage long-term dependencies.
Tree Structure:Discuss the options for tree structures (fixed structures like balanced binary trees
vs. data-dependent structures like parse trees).
Learning Tree Structure:Describe the ideal scenario where the network learns and adapts its
tree structure based on the input data.
Applications
• RecNNs can be used to build language models, machine translation systems, and speech
recognition systems.
• It can also be used for image classification, object detection, and video analysis.
There are many variants of the recursive network concept. For instance, Frasconi et al. (1997,
1998) associate data with a tree structure, linking inputs and targets to individual nodes. The
computation at each node does not have to follow the traditional artificial neuron approach
(affine transformation followed by a monotone nonlinearity). Socher et al. (2013a) suggest
using tensor operations and bilinear forms, which have been effective in modeling relationships
between concepts, particularly when these concepts are represented as continuous vectors or
embeddings (Weston et al., 2010; Bordes et al., 2012).
The most effective sequence models used in practical applications are called gated RNNs.
These include the long short-term memory and networks based on the gated recurrent unit.
Gated RNNs, like LSTMs and GRUs, are designed to help neural networks remember
important information over time and forget irrelevant details. Unlike traditional RNNs, which
can struggle to maintain useful information in long sequences, gated RNNs have special gates
that control what information should be remembered and what should be forgotten.
These gates are learned during training, so the network can decide for itself when to reset or
update its memory. This allows the network to accumulate useful information over time, while
also being able to forget old, unnecessary data when it's no longer needed. Essentially, gated
RNNs make it easier for neural networks to manage long-term memory and perform better on
tasks involving sequences, such as language or time-series prediction.
10.6.1 LSTM
The clever idea of introducing self-loops to produce paths where the gradient can flow for long
durations is a core contribution of the initial long short-term memory (LSTM) model. A crucial
addition has been to make the weight on this self-loop conditioned on the context, rather than
fixed.
The LSTM cell consists of several components that control how information is processed and
stored. The input feature is first computed by a regular artificial neuron and can be added to
the cell's state if allowed by the input gate, which uses a sigmoid function. The state unit has a
self-loop that is influenced by the forget gate, determining which information should be
discarded. The output gate, also using a sigmoid function, controls whether the cell’s output is
active. All gating units have sigmoid nonlinearities, while the input unit can use any squashing
nonlinearity. The state unit can also feed into the gating units. A black square represents a delay
of one time step in the diagram.
The most important component is the state unit s (t) i that has a linear self-loop similar to the
leaky units described in the previous section. However, here, the self-loop weight (or the
associated time constant) is controlled by a forget gate unit f (t)i (for time step t and cell i), that
sets this weight to a value between 0 and 1 via a sigmoid unit:
The LSTM cell internal state is thus updated as follows, but with a conditional self-loop weight
f (t)I :
where b, U and W respectively denote the biases, input weights and recurrent weights into the
LSTM cell. The external input gate unit g (t) i is computed similarly to the forget gate (with a
sigmoid unit to obtain a gating value between 0 and 1), but with its own parameters:
The output h(t) i of the LSTM cell can also be shut off, via the output gate q (t) i , which also
uses a sigmoid unit for gating:
10.10.2 Other Gated RNNs
The key components of the LSTM architecture—input, forget, and output gates—are essential
for controlling the flow of information and managing long-term dependencies. However,
simpler alternatives like Gated Recurrent Units (GRUs) combine the forget and input gates into
a single gating unit, making the architecture less complex while still allowing dynamic control
over forgetting and state updates. GRUs provide a more efficient solution for some tasks while
maintaining the ability to handle time scale and forgetting behaviors effectively.
where u stands for “update” gate and r for “reset” gate. Their value is defined as usual:
The reset and update gates in gated RNNs allow the model to control which parts of the state
vector are used and how they evolve. The update gate acts like a leaky integrator, deciding
whether to retain or ignore certain parts of the state, while the reset gate controls which parts
of the state contribute to the next target state. These gates introduce nonlinearity in the
relationship between past and future states, enabling more flexible learning.
While various architectural variations, such as sharing reset gates across multiple units or
combining global and local gates, have been explored, no single variant has been found to
outperform the standard LSTM or GRU across all tasks. Research has shown that the forget
gate is crucial, and adding a bias of 1 to the LSTM's forget gate, as suggested by Gers et al.,
can improve its performance significantly.