0% found this document useful (0 votes)
2 views

Module5_notes

Dy Patil and family audience ki baguntaadhi kani high moments of the day of

Uploaded by

nickjason670
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module5_notes

Dy Patil and family audience ki baguntaadhi kani high moments of the day of

Uploaded by

nickjason670
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Deep Learning Module 5: Recurrent and Recursive Neural Networks

Module 5: Recurrent and Recursive Neural Networks

Module 5: Unfolding Computational Graphs, Recurrent Neural Network, Bidirectional RNNs, Deep
Recurrent Networks, Recursive Neural Networks, The Long Short Term Memory and Other Gated RNNs,
Applications.

Text Book: Ian Goodfellow, Yoshua Bengio, Aaron Courville, “Deep Learning”, MIT Press, 2016.

(Chapters 10.1-10.3, 10.5, 10.6, 10.10, 12)

➢ Sequence Modeling: Recurrent and Recursive Nets:

Recurrent neural networks or RNNs (Rumelhart et al., 1986a) are a family of neural networks for processing
sequential data. Much as a convolutional network is a neural network that is specialized for processing a grid
of values X such as an image, a recurrent neural network is a neural network that is specialized for processing
a sequence of values 𝑥 (1) … 𝑥 (𝑡) . Just as convolutional networks can readily scale to images with large width
and height, and some convolutional networks can process images of variable size, recurrent networks can
scale to much longer sequences than would be practical for networks without sequence-based specialization.
Most recurrent networks can also process sequences of variable length.

➢ The Recurrent Neural Network:


• To transition from multi-layer networks to recurrent networks, we leverage a concept from the 1980s in
machine learning and statistical Modeling: parameter sharing across different parts of a model.
• This approach allows the model to handle inputs of varying forms (in this case, different sequence lengths)
and to generalize effectively.
• Without shared parameters for each time step, the model would struggle to generalize to sequence lengths
it hasn't encountered during training or to utilize statistical patterns across different sequence lengths and
time positions.
• Parameter sharing is crucial, especially when specific information can appear at various points within a
sequence.
• For instance, consider the sentences, “I went to Nepal in 2009” and “In 2009, I went to Nepal.” If a
machine learning model is tasked with identifying the year the narrator visited Nepal, it should recognize
"2009" as the relevant information, regardless of whether it appears as the sixth word or the second word.

Department of CSE, Vemana I.T Page 1 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

• If we used a feedforward network designed for fixed-length sentences, it would require separate
parameters for each input feature, forcing it to learn the rules of the language independently for every
position in the sentence.
• In contrast, a recurrent neural network (RNN) shares the same weights across multiple time steps,
enabling it to generalize patterns regardless of their position within the sequence.
• A similar concept is the use of convolution over a one-dimensional temporal sequence, which forms the
foundation of time-delay neural networks (Lang and Hinton, 1988; Waibel et al., 1989; Lang et al., 1990).
• Convolution enables a network to share parameters across time but remains relatively shallow. Each
output in a convolutional network is determined by a small neighbourhood of inputs, with parameter
sharing achieved through the repeated application of the same convolution kernel at every time step.
• Recurrent networks, on the other hand, share parameters differently. In these networks, each output
depends on the preceding outputs, with the same update rule applied consistently across time steps.
• This recurrent mechanism results in parameter sharing across a much deeper computational graph,
enabling the network to capture long-term dependencies in the data.

• For simplicity, we describe RNNs as processing a sequence of vectors 𝑥 (𝑡) , where the time step index t
ranges from 1 to τ.
• In practice, however, recurrent networks typically operate on minibatches of sequences, where each
sequence in the batch may have a different length τ. To keep the notation simple, we omit the minibatch
indices.
• Additionally, the time step index t does not always represent the actual passage of time; it often simply
indicates the position within the sequence.
• RNNs can also be applied in two dimensions for spatial data, such as images. Even when working with
time-based data, the network can include connections that move backward in time, as long as the entire
sequence is available before being processed by the network.
• The next section extends the idea of a computational graph to include cycles. These cycles represent the
influence of the present value of a variable on its own value at a future time step.
• Such computational graphs allow us to define recurrent neural networks. We then describe many different
ways to construct, train, and use recurrent neural networks.

Department of CSE, Vemana I.T Page 2 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

➢ Unfolding Computational Graphs:


• A computational graph provides a structured representation of computations, capturing the relationships
between inputs, parameters, outputs, and loss. Here, we focus on the concept of unfolding recursive or
recurrent computations into a computational graph with a repetitive structure, often representing a
sequence of events.
• Unfolding this graph creates a chain-like structure, facilitating parameter sharing across the depth of the
network.
• A Computational Graph is a way to formalize the structure of a set of computations such as mapping
inputs and parameters to outputs and loss.
• We can unfold a recursive or recurrent computation into a computational graph that has a repetitive
structure, corresponding to a chain of events. Unfolding this graph results in sharing of parameters across
a deep network structure.
• Example of unfolding a recurrent equation:
▪ consider the classical form of a dynamical system:
▪ s (t) = f(s (t−1) ; θ), s (t) is called the state of the system.
• Equation is recurrent because the definition of s at time t refers back to the same definition at time t-1.
• For a finite no. of time steps τ, the graph can be unfolded by applying the definition τ-1 times.
• E.g. for τ = 3 time steps we get s(3)= f (s(2); θ) = f (f (s(1); θ); θ)
• Unfolding equation by repeatedly applying the definition in this way has yielded expression without
recurrence.
• s(1) is ground state and s(2) computed by applying f .
• Such an expression can be represented by a traditional acyclic computational graph.
• Unfolded dynamical system:
• The classical dynamical system described by s(t) =f (s(t-1); θ ) and s(3)= f (f (s(1); θ); θ) is illustrated (fig.1)
as an unfolded computational graph.

Fig 1. Illustration of an unfolded computational graph.

• Each node represents state at some time t. Function f maps state at time t to the state at t+1. The same
parameters (the same value of θ used to parameterize f) are used for all time steps.

Department of CSE, Vemana I.T Page 3 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

• As another example, let us consider a dynamical system driven by an external signal x(t),

s(t) = f(s(t−1), x(t); θ),

• where we see that the state now contains information about the whole past sequence.
• Recurrent neural networks can be built in many different ways. Much as almost any function can be
considered a feedforward neural network, essentially any function involving recurrence can be
considered a recurrent neural network.
• To indicate that the state is the hidden units of the network, now let us rewrite the above equation using
the variable h to represent the state:

h(t) = f(h(t−1), x(t); θ),

• A recurrent network with no outputs is illustrated in figure2. This recurrent network just processes
information from the input x by incorporating it into the state h that is passed forward through time.
(Left)Circuit diagram.
• The black square indicates a delay of a single time step. (Right)The same network seen as an unfolded
computational graph, where each node is now associated with one particular time instance.

Fig.2 Hidden state h representation.

• As illustrated in figure 2, typical RNNs will add extra architectural features such as output layers that
read information out of the state h to make predictions.

• When a recurrent network is trained for tasks that involve predicting the future from the past, it
typically learns to use h(t) as a condensed summary of the task-relevant aspects of the input sequence
up to time t. This summary is inherently lossy because it compresses a sequence of arbitrary length,
(x(t), x(t−1), x(t−2), . . . ,x(2) ,x(1)), into a fixed-length vector, h(t).The degree of detail retained in this
summary depends on the training objective; some parts of the sequence may be preserved with greater
precision than others.
• For instance, in statistical language modeling, where the goal is to predict the next word based on
previous words, the RNN does not need to store all information from the input sequence. Instead, it
retains just enough information to predict the continuation of the sentence effectively.

Department of CSE, Vemana I.T Page 4 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

• When RNN is required to perform a task of predicting the future from the past, network typically learns
to use h(t) as a lossy summary of the task-relevant aspects of the past sequence of inputs upto t.
• The summary is in general lossy since it maps a sequence of arbitrary length (x(t), x(t-1),..,x(2),x(1)) to a
fixed length vector h(t)
• Information Contained in Summary: Depending on criterion, summary keeps some aspects of past
sequence more precisely than other aspects.
• Examples:
• RNN used in statistical language modeling, typically to predict next word from past words. It may not
be necessary to store all information upto time t but only enough information to predict rest of sentence.
• Most demanding situation: we ask h(t) to be rich enough to allow one to approximately recover the
input sequence as in autoencoders.
• Unfolding: from circuit diagram to computational graph:
• Equation h(t) =f (h(t-1), x(t); θ) can be written in two different ways: circuit diagram or an unfolded
computational graph. Unfolding is the operation that maps a circuit to a computational graph with
repeated pieces. The unfolded graph has a size dependent on the sequence length.
• We can represent the unfolded recurrence after t steps with a function g(t) : h(t) =g(t)(x(t) , x(t−1) , x(t−2) , .
. . , x(2) , x(1)) =f(h(t−1) , x(t) ; θ) The function g(t) takes the whole past sequence (x(t) , x(t−1) , x(t−2) , . . . ,
x(2) , x(1)) as input and produces the current state, but the unfolded recurrent structure allows us to
factorize g(t) into repeated application of a function f.
• The unfolding process thus introduces two major advantages:
i. Regardless of the sequence length, the learned model always has the same input size, because
it is specified in terms of transition from one state to another state, rather than specified in terms
of a variable-length history of states.
ii. It is possible to use the same transition function f with the same parameters at every time step.
• These two factors enable the learning of a single model f that works across all time steps and sequence
lengths, rather than requiring separate models g(t) for each time step. By sharing parameters, the model
can generalize to sequence lengths not seen during training and can be trained with significantly fewer
examples compared to models without parameter sharing.
• Both the recurrent graph and the unrolled graph have their uses. The recurrent graph is succinct. The
unfolded graph provides an explicit description of which computations to perform. The unfolded graph
also helps to illustrate the idea of information flow forward in time (computing outputs and losses) and
backward in time (computing gradients) by explicitly showing the path along which this information
flows.

Department of CSE, Vemana I.T Page 5 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

➢ Recurrent Neural Networks:

RNNs are specialized neural networks designed for sequence data. They process sequences step by step,
sharing parameters across time to handle inputs of varying lengths. The key concept to be captured in here
are:

• Unfolding Computational Graphs: RNNs use a repetitive structure that simplifies computations. Instead
of using a separate model for each time step, a single function with shared parameters is applied
repeatedly.
• Parameter Sharing: This allows RNNs to generalize across different sequence lengths and positions,
making them more efficient and requiring fewer training examples.
• Teacher Forcing in Recurrent Neural Networks (RNNs): Teacher forcing is a training technique used in
RNNs where the model is provided with the actual target output (ground truth) from the training data at
the previous timestep, rather than using the model's predicted output. This helps guide the model during
training and ensures faster convergence.
• How It Works:
• During training, at each timestep t, the RNN receives:
• The current input xt.
• The true target output yt−1 from the previous timestep as context.
• This contrasts with the inference phase, where the model uses its own predicted output from the previous
timestep y`t−1 as input.
• Advantages
• Faster Training: Providing the ground truth reduces error propagation during training, leading to quicker
convergence.
• Stable Gradients: Teacher forcing stabilizes the training process, especially in long sequences, by
preventing compounded prediction errors.
• Design Patterns in Recurrent Neural Networks (RNNs)
• RNNs can be designed in various ways depending on the specific task and the nature of the input and
output sequences. Here are the three main design patterns:
1. Produce Outputs at Every Time Step with Connections Between Hidden Units
• Description:
• At each time step, the RNN processes the input and produces an output.
• There are recurrent connections between the hidden units across time steps, allowing the
network to retain information from previous inputs.

Department of CSE, Vemana I.T Page 6 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

• Applications:
• Time series prediction (e.g., stock price forecasting).
• Language models that predict the next word in a sequence.
• Illustration: x1→h1→o1, x2→h2→o2, …, xT→hT→oT
2. Use Outputs at One Time Step as Inputs for the Next
• Description:
• The output from one time step is used as an input for the next time step, in addition to the
original input at that step.
• These connections enable the model to utilize the output information from prior steps directly.
• Applications:
• Text generation, where each word predicted is fed back into the network to generate the next
word.
• Reinforcement learning environments requiring sequential decision-making.
• Illustration: x1→h1→o1→h2, x2→h2→o2→h3, …

3. Process an Entire Sequence to Produce a Single, Summarized Output

• Description:

• The RNN processes the entire input sequence and generates a single output at the end,
summarizing the sequence into a fixed-size representation.

• This design is common for tasks where the sequence needs to be analyzed as a whole.

• Applications:

• Sentiment analysis (classifying whether a sentence expresses positive or negative sentiment).


• Sequence classification tasks (e.g., identifying spam emails, video activity recognition).

• Illustration: x1→h1, x2→h2, …, xT→hT→o

• Forward Propagation
• RNNs update their hidden state at each time step using a combination of current input and the
previous hidden state:

• Outputs are computed from the hidden state: ot=Vht+c

Department of CSE, Vemana I.T Page 7 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

• A softmax layer can transform outputs into probabilities.


• Loss Function
• RNNs can map input sequences to output sequences of the same length.
• The loss is computed by summing individual losses at each time step, such as the negative log-
likelihood of predicted values.
• Backward Propagation Through Time (BPTT):
• Gradients are calculated by unrolling the RNN through all timesteps and applying
backpropagation.
• Challenges:
• Sequential nature makes parallel computation difficult.
• Memory and computational costs grow with sequence length.
• Advantages and Challenges:
• RNNs can handle sequences of arbitrary lengths and retain useful information from the past.
• However, training can be expensive and prone to issues like vanishing or exploding gradients.

➢ Bidirectional RNNs:
• All the recurrent networks we’ve discussed so far have a “causal” structure, meaning the state at time
t captures information only from the past inputs (x(1),…,x (t−1)) (and the current input x(t).
• In some cases, these models also incorporate information from past y values into the current state,
provided those y values are accessible.
• In some tasks, the interpretation of the current input depends on the entire input sequence, not just past
or present information.
• For instance, in speech recognition, determining the correct phoneme for the current sound might rely
on the next few phonemes due to co-articulation or even on future words because of linguistic
dependencies.
• If multiple interpretations of a word are acoustically plausible, context from both the future and the
past may be needed to resolve ambiguity.
• This concept applies to other sequence-to-sequence tasks as well, such as handwriting recognition.

Department of CSE, Vemana I.T Page 8 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

Fig 3. Computation of a typical bidirectional recurrent neural network

• Bidirectional Recurrent Neural Networks (Bidirectional RNNs) were introduced by Schuster and
Paliwal in 1997 to address the need for utilizing both past and future context when processing
sequential data. These networks process input sequences in both forward and backward directions,
enabling the model to capture information from the entire sequence.
• Bidirectional RNNs have proven highly effective in tasks requiring this capability, such as:
• Handwriting recognition (Graves et al., 2008; Graves and Schmidhuber, 2009).
• Speech recognition (Graves and Schmidhuber, 2005; Graves et al., 2013).
• Bioinformatics (Baldi et al., 1999).
• Their ability to integrate context from both directions makes them powerful for sequence-to-sequence
tasks.
• Bidirectional RNNs combine two RNNs: one processes the sequence forward in time (from the start
to the end), and the other processes it backward in time (from the end to the start). In this architecture:
• h(t): Represents the state of the forward-moving RNN.
• g(t): Represents the state of the backward-moving RNN.
• This setup allows the output units o(t) to create a representation that leverages information from both
the past and the future.

Department of CSE, Vemana I.T Page 9 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

• Unlike feedforward networks, convolutional networks, or standard RNNs with fixed-size look-ahead
buffers, bidirectional RNNs do not require specifying a fixed-size window around t. Instead, they
naturally focus on the input values near t, making them highly effective for tasks that benefit from
bidirectional context.
• The concept of bidirectional RNNs can be extended to 2-dimensional inputs, such as images, by
introducing four RNNs that process the data in different directions: up, down, left, and right. At each
position (i,j) in a 2D grid, an output Oi,j can compute a representation that captures mostly local
information while also incorporating long-range dependencies, provided the RNN effectively learns
to carry such information.
• Compared to convolutional networks:
• RNNs applied to images are generally more computationally expensive.
• They enable long-range lateral interactions between features within the same feature map, which
convolutional networks cannot achieve as easily.
• The forward propagation equations for these RNNs show they include:
• A convolution operation to compute bottom-up input for each layer.
• Recurrent propagation across the feature map, incorporating lateral interactions.
• This approach allows RNNs to leverage both local and global context in image data, providing a more
comprehensive understanding of spatial relationships (Visin et al., 2015; Kalchbrenner et al., 2015).

• Deep Recurrent Networks:

• A deep RNN is simply an RNN with multiple hidden layers stacked on top of each other. This stacking
allows the network to learn more complex patterns and representations from the data. Each layer in a deep
RNN can capture different levels of abstraction, making it more powerful than a single-layer RNN.
• Key Idea is that they process one step of a sequence at a time (like reading one word or one frame of a
video). Information flows not only through the sequence (via time) but also vertically through the layers
to capture deeper relationships.
• Example: Imagine analysing a video: A simple RNN might understand how objects in a single frame relate
to the previous one. A deep RNN could understand more abstract things, like the movement of a person
or the mood of the scene.

Department of CSE, Vemana I.T Page 10 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

• Architecture of a Deep RNN:

Here’s a visual representation of a deep RNN architecture:

Fig 4. Architecture of a deep neural network.

In the above figure 4, the deep RNN has three hidden layers. The arrows indicate the flow of information
through the network over time.

Let us now try to understand how a recurrent neural network can be made deep in many ways.

a) The hidden recurrent state can be broken down into groups organized hierarchically.
b) Deeper computation (e.g., an MLP) can be introduced in the input-to-hidden, hidden-to-hidden and
hidden-to-output parts. This may lengthen the shortest path linking different time steps.
c) The path-lengthening effect can be mitigated by introducing skip connections.

Fig 5. A recurrent neural network can be made deep in many ways.

Department of CSE, Vemana I.T Page 11 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

(a) Hidden State Organized Hierarchically:

• As seen in figure 4(a), the hidden recurrent state is broken into groups or layers that are organized
hierarchically.
• The state h is updated at each time step, and it interacts with another intermediate layer z before
producing the final output y.
• Hierarchical organization means the recurrent hidden states (h) have additional layers of computation
(z) between them, creating a deeper network.
• The hierarchical nature of hidden states aligns with the idea that the hidden recurrent state can be
broken into groups.
• This hierarchical arrangement introduces multiple processing steps (hidden layers), allowing the RNN
to learn more abstract representations over time.

(b) Deeper Computation at Each Stage:

• The shows deeper computation introduced at key parts of the network:


o Input-to-hidden: The input x undergoes additional transformations before updating the hidden
state h.
o Hidden-to-hidden: The hidden state h involves multiple layers or additional computation
before transitioning to the next time step.
o Hidden-to-output: The output layer y is generated after deeper computation from the hidden
state.
• The additional processing lengthens the shortest path connecting information across time steps,
making the network deeper.
• The deeper computation mentioned in the paragraph is clearly represented here by the additional
processing layers between x, h, and y.
• This "deepening" allows for more complex transformations and modeling of the data, but it also
lengthens the path for gradients during backpropagation through time.

(c) Skip Connections to Mitigate Path Lengthening:

• In figure 4(c), skip connections are introduced:


• The input x has direct connections to later parts of the network (hidden states h and outputs).
• The hidden state h also has connections that bypass some layers to directly influence the output y.

Department of CSE, Vemana I.T Page 12 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

• These skip connections shorten the gradient path by allowing information to flow directly between
layers, mitigating the path-lengthening effect caused by deeper computation.
• The skip connections mentioned in the paragraph are explicitly shown here as shortcuts in the
network.
• These connections address the problem of long paths, improving gradient flow and enabling the RNN
to handle deeper architectures effectively.

• Recursive Neural Networks:

Recursive neural networks2 represent yet another generalization of recurrent networks, with a different kind
of computational graph, which is structured as a deep tree, rather than the chain-like structure of RNNs. The
typical computational graph for a recursive network is illustrated in figure 5.

Figure 5. A typical graph of recursive neural network.

• Recursive networks were first proposed by Pollack (1990), with their potential application for reasoning
highlighted by Bottou (2011). These networks have been effectively utilized for processing structured data
as inputs to neural networks (Frasconi et al., 1997, 1998), as well as in natural language processing and
computer vision.

Department of CSE, Vemana I.T Page 13 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

• A notable advantage of recursive networks over recurrent networks is their ability to significantly reduce
the depth (measured by the number of compositions of nonlinear operations) from τ to O (log τ) for a
sequence of length τ, which can help address long-term dependencies. However, determining the optimal
tree structure remains an open question. One approach is to use a fixed tree structure, such as a balanced
binary tree, that does not depend on the data.
• In certain domains, external methods can provide suitable tree structures. For instance, in natural language
processing, the tree structure can be aligned with the parse tree of a sentence generated by a natural
language parser. Ideally, the learner itself would infer and discover the most appropriate tree structure for
a given input, as suggested by Bottou (2011).
• There are numerous possible variants of the recursive network concept. For instance, Frasconi et al. (1997,
1998) associate the data with a tree structure where inputs and targets are linked to individual nodes of the
tree.
• The computations performed at each node are not limited to the conventional artificial neuron operations
(an affine transformation followed by a monotonic nonlinearity). For example, Socher et al. (2013a)
propose the use of tensor operations and bilinear forms, which have previously been shown to effectively
model relationships between concepts (Weston et al., 2010; Bordes et al., 2012) when the concepts are
represented as continuous vector embeddings.

➢ The Long Short-Term Memory and Other Gated RNNs:

The introduction of self-loops to create paths that allow gradients to flow for extended durations is a key
innovation of the original Long Short-Term Memory (LSTM) model by Hochreiter and Schmidhuber
(1997). A significant improvement was later made by Gers et al. (2000), where the weight of the self-loop is
conditioned on the context rather than being fixed. By using a gated mechanism (controlled by another
hidden unit), the time scale of integration can be dynamically adjusted. This means that even with fixed
parameters, an LSTM can adapt the integration time scale based on the input sequence, as the time constants
are generated by the model itself.

The LSTM has achieved remarkable success across a variety of applications, including:

• Unconstrained handwriting recognition (Graves et al., 2009),


• Speech recognition (Graves et al., 2013; Graves and Jaitly, 2014),

Department of CSE, Vemana I.T Page 14 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

• Handwriting generation (Graves, 2013),


• Machine translation (Sutskever et al., 2014),
• Image captioning (Kiros et al., 2014b; Vinyals et al., 2014b; Xu et al., 2015), and
• Parsing (Vinyals et al., 2014a).
• This dynamic gating mechanism enables LSTMs to adapt effectively to different input patterns,
making them highly versatile and powerful in sequence modeling tasks.
• LSTM (Long Short-Term Memory) is a recurrent neural network (RNN) architecture widely used in
Deep Learning. It excels at capturing long-term dependencies, making it ideal for sequence prediction
tasks.
• Unlike traditional neural networks, LSTM incorporates feedback connections, allowing it to process
entire sequences of data, not just individual data points. This makes it highly effective in understanding
and predicting patterns in sequential data like time series, text, and speech.
• LSTM has become a powerful tool in artificial intelligence and deep learning, enabling breakthroughs
in various fields by uncovering valuable insights from sequential data.

Fig 6. Architecture of LSTM.


• The figure 5 depicts the block diagram of the LSTM recurrent network “cell.” Cells are connected
recurrently to each other, replacing the usual hidden units of ordinary recurrent networks.
• An input feature is computed with a regular artificial neuron unit. Its value can be accumulated into
the state if the sigmoidal input gate allows it. The state unit has a linear self-loop whose weight is
controlled by the forget gate.
• The output of the cell can be shut off by the output gate. All the gating units have a sigmoid
nonlinearity, while the input unit can have any squashing nonlinearity.
• The state unit can also be used as an extra input to the gating units. The black square indicates a delay
of a single time step.

Department of CSE, Vemana I.T Page 15 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

• Just like a simple RNN, an LSTM also has a hidden state where H(t-1) represents the hidden state of
the previous timestamp and Ht is the hidden state of the current timestamp. In addition to that, LSTM
also has a cell state represented by C(t-1) and C(t) for the previous and current timestamps,
respectively.

• Example of LTSM Working: Let’s take an example to understand how LSTM works. Here we have
two sentences separated by a full stop. The first sentence is “Bob is a nice person,” and the second
sentence is “Dan, on the Other hand, is evil”. It is very clear, in the first sentence, we are talking about
Bob, and as soon as we encounter the full stop(.), we started talking about Dan.
• As we move from the first sentence to the second sentence, our network should realize that we are no
more talking about Bob. Now our subject is Dan. Here, the Forget gate of the network allows it to
forget about it. Let’s understand the roles played by these gates as per the LSTM architecture.
• Forget Gate: In a cell of the LSTM neural network, the first step is to decide whether we should keep
the information from the previous time step or forget it. Here is the equation for forget gate.

(𝑡)
• forget gate unit 𝑓𝑖 (for time step t and cell i), that sets this weight to a value between 0 and 1 via a
sigmoid unit.
• where x(t) is the current input vector and h(t) is the current hidden layer vector, containing the outputs
of all the LSTM cells, and bf, Uf, Wf are respectively biases, input weights and recurrent weights for
the forget gates.
(𝑡)
• The LSTM cell internal state is thus updated as follows, but with a conditional self-loop weight 𝑓𝑖 :

• where b, U and W respectively denote the biases, input weights and recurrent weights into the LSTM
(𝑡)
cell. The external input gate unit 𝑔𝑖 is computed similarly to the forget gate (with a sigmoid unit to
obtain a gating value between 0 and 1), but with its own parameters:

Department of CSE, Vemana I.T Page 16 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

(𝑡) (𝑡)
• The output ℎ𝑖 of the LSTM cell can also be shut off, via the output gate 𝑞𝑖 , which also uses a
sigmoid unit for gating:

• which has parameters bo, Uo, Wo for its biases, input weights and recurrent weights, respectively.
(𝑡)
Among the variants, one can choose to use the cell state 𝑠𝑖 as an extra input (with its weight) into the
three gates of the i-th unit, as shown in figure 6.
➢ Other Gated RNNs:
• Which pieces of the LSTM architecture are actually necessary? What other successful architectures could
be designed that allow the network to dynamically control the time scale and forgetting behaviour of
different units?
• Some answers to these questions are given with the recent work on gated RNNs, whose units are also
known as gated recurrent units.
• The main difference with the LSTM is that a single gating unit simultaneously controls the forgetting
factor and the decision to update the state unit. The update equations are the following:

• where u stands for “update” gate and r for “reset” gate. Their value is defined as usual:

• The reset and update gates can selectively "ignore" specific parts of the state vector. The update gates
function as conditional leaky integrators, allowing any dimension to be linearly gated.
• At one extreme of the sigmoid function, they can copy the current value directly, while at the other
extreme, they can completely discard it and replace it with the new target state value, which the leaky
integrator aims to reach.

Department of CSE, Vemana I.T Page 17 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

• Meanwhile, the reset gates determine which parts of the state contribute to the computation of the next
target state. This introduces an additional nonlinear effect that influences the relationship between the
previous state and the future state.

• Many more variants around this theme can be designed. For example, the reset gate (or forget gate) output
could be shared across multiple hidden units. Alternately, the product of a global gate (covering a whole
group of units, such as an entire layer) and a local gate (per unit) could be used to combine global control
and local control.

➢ Optimization for Long-Term Dependencies:

Clipping gradients is a popular technique in optimization, especially for training recurrent neural networks
(RNNs) and their variants (like LSTMs and GRUs), to address the issue of exploding gradients that arise when
dealing with long-term dependencies.

1. Problem of Exploding Gradients

• When training deep networks or RNNs over long sequences, the gradients of the loss function with
respect to the model parameters can sometimes grow exponentially during backpropagation through
time (BPTT).
• This occurs when the product of gradients of weight matrices over multiple time steps becomes
excessively large, leading to unstable updates in gradient-based optimization.
• Exploding gradients result in:
o Overshooting the optimal parameters,
o Poor convergence, or
o NaNs (numerical instability).

2. What is Gradient Clipping?

Gradient clipping is a method to stabilize training by preventing the gradients from growing too large. The
idea is simple:

• If the norm of the gradient exceeds a predefined threshold, the gradient is rescaled so that its norm
matches the threshold.
• This ensures that gradient updates remain controlled and do not explode.

Department of CSE, Vemana I.T Page 18 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

➢ In tasks with long-term dependencies, the gradients must propagate through many time steps. While
vanishing gradients hinder learning of long-term dependencies, exploding gradients can destabilize the
network. Gradient clipping directly addresses this instability by:

• Keeping gradient updates within a controlled range,


• Allowing the model to still learn meaningful information across long time sequences, and
• Preventing catastrophic updates that cause the training to diverge.

➢ Large-Scale Deep Learning:


• Large-Scale Deep Learning refers to the practice of training deep neural networks with a massive amount
of data and/or large model architectures. It involves addressing challenges related to computational
resources, optimization, and scalability, while ensuring that models can learn from extensive datasets in a
reasonable time frame.
• Here’s a deeper look into the main aspects of large-scale deep learning:

1. Key Challenges in Large-Scale Deep Learning:

• a. Data Size and Quality


o Big Data: Large-scale deep learning often involves datasets that are too large to fit into memory or
process on a single machine. For instance, training on millions or billions of labelled data points.
o Data Preprocessing: Handling, cleaning, augmenting, and managing large datasets require
specialized tools and distributed computing frameworks.
• b. Model Size and Complexity
o Modern deep learning models, particularly in areas like computer vision, speech recognition, and
natural language processing (NLP), can have millions or billions of parameters.
o These models can be highly memory-intensive and computationally expensive to train.
• c. Computational Resources
o Parallelism: Training large-scale models requires distributing computations across multiple
machines or GPUs, a process known as distributed training.
o GPUs and TPUs: GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units) are
specialized hardware designed to accelerate deep learning operations, especially in large-scale
training tasks.

Department of CSE, Vemana I.T Page 19 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

• d. Optimization and Convergence


o Training large-scale models on large datasets can lead to challenges with optimization, such as
local minima, exploding/vanishing gradients, and slow convergence.

2. Techniques for Scaling Deep Learning

• Distributed Training
o Data Parallelism: The dataset is divided into smaller chunks, and each chunk is processed by a
different machine or GPU. The gradients from all machines are then aggregated to update the
model weights.
o Model Parallelism: The model is split across different devices, with each device handling a subset
of the model's parameters.
• Gradient Clipping
o As previously discussed, gradient clipping is used to prevent exploding gradients when training
large models, particularly on long sequences (e.g., RNNs and LSTMs).

3. Scaling Up with Hardware

• GPUs and TPUs


o GPUs are widely used for deep learning because they are designed to handle the matrix operations
common in neural networks more efficiently than CPUs.
o TPUs (Tensor Processing Units) are specialized hardware accelerators built by Google specifically
for deep learning tasks. They are optimized for training and inference of large models in
TensorFlow.
o Large-scale training often involves using clusters of GPUs or TPUs, allowing simultaneous
computations over multiple devices and speeding up training time.
• Cloud Computing and Distributed Systems
o Cloud providers like Amazon Web Services (AWS), Google Cloud, and Microsoft Azure offer
scalable infrastructures for training large models. This allows deep learning practitioners to rent
vast amounts of computational power on demand.
o Cloud platforms often provide auto-scaling features, which automatically scale resources up or
down depending on the computational requirements of the task.

Department of CSE, Vemana I.T Page 20 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

➢ Speech recognition: Speech recognition is the process of converting spoken language into written
text. It involves understanding, interpreting, and transcribing spoken words into a format that can be
processed by computers. It is a crucial component of various applications like voice assistants (e.g., Siri,
Google Assistant), transcription services, and voice-controlled devices.
➢ Key Features of Speech Recognition:
• Accuracy and Speed: They can process speech in real-time or near real-time, providing quick
responses to user inputs.
• Multi-Language Support: Support for multiple languages and dialects, allowing users from different
linguistic backgrounds to interact with technology in their native language.
• Background Noise Handling: This feature is crucial for voice-activated systems used in public or
outdoor settings.
➢ Challenges in Speech Recognition
• Variability in Speech
▪ Accents and Dialects: Different accents, dialects, and regional variations in pronunciation
can make speech recognition more challenging.
▪ Background Noise: Speech recognition systems can struggle when background noise is
present, such as in crowded environments or when multiple speakers are involved.
▪ Speech Rate: Variations in the speed at which people speak can impact recognition
accuracy. Fast speech, slurred speech, or pauses between words can be problematic.
• Ambiguity in Speech
▪ Homophones: Words that sound the same but have different meanings (e.g., "see" and
"sea") can confuse the model.
▪ Contextual Meaning: Words can have different meanings depending on the context (e.g.,
"bank" as a financial institution or the side of a river).
▪ Out-of-Vocabulary Words: If a word has not been seen in the training data, the system may
struggle to recognize it correctly.
• Real-Time Processing
▪ For applications like voice assistants and real-time transcription, the model must recognize
speech instantaneously with minimal delay.
▪ Efficient latency management is crucial for keeping the user experience smooth and
responsive.

Department of CSE, Vemana I.T Page 21 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

➢ Natural Language Processing (NLP): Natural Language Processing (NLP) is a branch of artificial
intelligence (AI) focused on enabling computers to understand, interpret, and generate human language.
NLP combines computational linguistics, machine learning, and deep learning to process and analyze large
amounts of natural language data, such as text or speech.
• Key Tasks in Natural Language Processing:
• NLP encompasses various tasks that are fundamental to understanding and processing human language.
Some of these tasks include:
• Tokenization: The process of breaking down text into smaller units, such as words, sentences, or sub word
units. For example, the sentence "Natural language processing is fun!" can be tokenized into the words
"Natural," "language," "processing," "is," "fun," and "!".
• Part-of-Speech (POS) Tagging: This task involves identifying the grammatical components of each token
in a sentence. It labels words as nouns, verbs, adjectives, etc., to understand their roles within the sentence.
• Named Entity Recognition (NER): NER involves identifying and classifying entities (such as names of
people, organizations, locations, dates, etc.) in the text.
• Sentiment Analysis: This technique determines the sentiment or emotional tone of a piece of text. It helps
classify text as positive, negative, or neutral, and can be used to analyze social media, product reviews,
etc.
• Machine Translation: The process of automatically translating text from one language to another. Google
Translate and Deeply are popular examples of machine translation systems.
• Text Summarization: This involves condensing a piece of text into a shorter summary while preserving its
essential meaning. There are two types of summarizations:
o Extractive Summarization: Selects important sentences from the original text.
o Abstractive Summarization: Generates new sentences to convey the core ideas of the original text.
• Speech Recognition: Converting spoken language into written text. This task involves recognizing words
in spoken language and transcribing them into text format.
• Challenges in NLP:
• Ambiguity: Natural language is inherently ambiguous. Words can have multiple meanings depending on
the context (e.g., "bank" can refer to a financial institution or the side of a river). Resolving this ambiguity
is a key challenge for NLP systems.
• Sarcasm and Irony: NLP models often struggle with understanding sarcasm, irony, and other figurative
language because they require nuanced understanding of context and tone.

Department of CSE, Vemana I.T Page 22 of 23


Deep Learning Module 5: Recurrent and Recursive Neural Networks

• Multilingual NLP: NLP systems trained in one language may not perform well in others. Building
multilingual systems that can handle many languages is a complex challenge, but there has been progress
with multilingual models like mBERT (Multilingual BERT).
• Low-Resource Languages: Many languages still lack sufficient training data or resources, making it harder
to build high-performing models for these languages.
• Bias and Fairness: NLP models trained on large corpora may inherit biases present in the data, leading to
discriminatory or unfair outcomes in applications like hiring, law enforcement, or healthcare.

Department of CSE, Vemana I.T Page 23 of 23

You might also like