0% found this document useful (0 votes)
53 views36 pages

Unit 3 Notes

Uploaded by

Poranki Anusha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views36 pages

Unit 3 Notes

Uploaded by

Poranki Anusha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

UNIT-3

Chapter-1

Sequence Modeling: Recurrent and Recursive nets

Unfolding Computational Graphs

In deep learning, unfolding computational graphs is a process that creates a repetitive structure in a
computational graph by sharing parameters across a deep network structure. This is done by recursively
or recurrently computing a set of computations.

Computational graphs are a way to represent mathematical operations that machines use to learn from
data. They are similar to flowcharts, where each node represents an operation and the lines between
nodes show how the results flow from one step to the next.

A computational graph is defined as a directed graph where the nodes correspond to mathematical
operations. Computational graphs are a way of expressing and evaluating a mathematical expression.

For example, here is a simple mathematical equation −

p=x+y

We can draw a computational graph of the above equation as follows.

The above computational graph has an addition node (node with "+" sign) with two input
variables x and y and one output q.

Let us take another example, slightly more complex. We have the following equation.

g=(x+y)∗z

The above equation is represented by the following computational graph.


Unfolding this graph results in sharing of parameters across a deep network structure

• A computational graph is a way to formalize the structure of a set of computations, such as


those involved in mapping inputs and parameters to outputs and loss.

The idea of Unfolding Computational Graphs is sharing of parameters across a deep network structure.
Unfolding the equation by repeatedly applying
the definition in this way has yielded an
expression that does state
Each node represents not involve
at somerecurrence.
time t
Such an expression
• Function can now
f maps state be represented
at time t to the stateby
at
at+1
traditional directed acyclic computational
graph. The unfolded
• The same parameterscomputational graph
( the same value of θof
equation
used to
parameterize f ) are used for all time steps
That is Unfolding a recursive or recurrent computation into a computational graph that has a repetitive
structure, typically corresponding to a chain of events.

Recurrent Neural Networks


Computing the gradient in a Recurrent Neural Network

Computing the gradient in a Recurrent Neural Network (RNN) is essential for training the model using
backpropagation. However, due to the temporal nature of RNNs, the process involves a technique called
Backpropagation Through Time (BPTT).

In a Recurrent Neural Network (RNN), the gradient computation is crucial for training the network via
backpropagation through time (BPTT). The goal is to compute the gradients of the loss with respect to
the network's parameters (weights and biases) so that we can update them using gradient-based
optimization methods like stochastic gradient descent (SGD).

Gradient Computation Steps


To summarize, the steps for computing the gradients in an RNN are as follows:

1. Forward Pass: Compute the hidden states and outputs at each time step.
2. Loss Computation: Calculate the loss based on the output at each time step.
3. Backpropagation:
o Compute the gradients with respect to the output layer.
o Propagate the gradients backward through the hidden states, using the chain rule
to account for the dependence of hth_tht on ht−1h_{t-1}ht−1.
4. Gradient Update: Use the computed gradients to update the parameters (weights and
biases) via an optimization algorithm (e.g., SGD).
5. Backpropagation Through Time allows us to train RNNs by efficiently computing
gradients over the entire sequence, although care must be taken to handle issues like
vanishing and exploding gradients.
Bidirectional RNNs

A bi-directional recurrent neural network (Bi-RNN) is a type of recurrent neural network (RNN)
that processes input data in both forward and backward directions. The goal of a Bi-RNN is to
capture the contextual dependencies in the input data by processing it in both directions, which
can be useful in various natural language processing (NLP) tasks.

In a Bi-RNN, the input data is passed through two separate RNNs: one processes the data in the
forward direction, while the other processes it in the reverse direction. The outputs of these two
RNNs are then combined in some way to produce the final output.

One common way to combine the outputs of the forward and reverse RNNs is to concatenate
them. Still, other methods, such as element-wise addition or multiplication, can also be used. The
choice of combination method can depend on the specific task and the desired properties of the
final output.

Bi-directional RNNs

 A bidirectional recurrent neural network (RNN) is a type of recurrent neural network (RNN) that
processes input sequences in both forward and backward directions.
 This allows the RNN to capture information from the input sequence that may be relevant to the
output prediction. Still, the same could be lost in a traditional RNN that only processes the input
sequence in one direction.
 This allows the network to consider information from the past and future when making
predictions rather than just relying on the input data at the current time step.
 This can be useful for tasks such as language processing, where understanding the context of a
word or phrase can be important for making accurate predictions.
 In general, bidirectional RNNs can help improve a model's performance on various sequence-
based tasks.

This means that the network has two separate RNNs:

1. One that processes the input sequence from left to right


2. Another one that processes the input sequence from right to left.

These two RNNs are typically called forward and backward RNNs, respectively.
During the forward pass of the RNN, the forward RNN processes the input sequence in the usual
way by taking the input at each time step and using it to update the hidden state. The updated
hidden state is then used to predict the output.

Backpropagation through time (BPTT) is a widely used algorithm for training recurrent neural
networks (RNNs). It is a variant of the backpropagation algorithm specifically designed to
handle the temporal nature of RNNs, where the output at each time step depends on the inputs
and outputs at previous time steps.

In the case of a bidirectional RNN, BPTT involves two separate Backpropagation passes: one for
the forward RNN and one for the backward RNN. During the forward pass, the forward RNN
processes the input sequence in the usual way and makes predictions for the output sequence.
These predictions are then compared to the target output sequence, and the error is
backpropagated through the network to update the weights of the forward RNN.

The backward RNN processes the input sequence in reverse order during the backward pass and
predicts the output sequence. These predictions are then compared to the target output sequence
in reverse order, and the error is backpropagated through the network to update the weights of
the backward RNN.

Once both passes are complete, the weights of the forward and backward RNNs are updated
based on the errors computed during the forward and backward passes, respectively. This
process is repeated for multiple iterations until the model converges and the predictions of the
bidirectional RNN are accurate.

This allows the bidirectional RNN to consider information from past and future time steps when
making predictions, which can significantly improve the model's accuracy.

Working of Bidirectional Recurrent Neural Network


1. Inputting a sequence: A sequence of data points, each represented as a vector with the
same dimensionality, are fed into a BRNN. The sequence might have different lengths.
2. Dual Processing: Both the forward and backward directions are used to process the
data. On the basis of the input at that step and the hidden state at step t-1, the hidden state
at time step t is determined in the forward direction. The input at step t and the hidden
state at step t+1 are used to calculate the hidden state at step t in a reverse way.
3. Computing the hidden state: A non-linear activation function on the weighted sum of
the input and previous hidden state is used to calculate the hidden state at each step. This
creates a memory mechanism that enables the network to remember data from earlier
steps in the process.
4. Determining the output: A non-linear activation function is used to determine the
output at each step from the weighted sum of the hidden state and a number of output
weights. This output has two options: it can be the final output or input for another layer
in the network.
5. Training: The network is trained through a supervised learning approach where the goal
is to minimize the discrepancy between the predicted output and the actual output. The
network adjusts its weights in the input-to-hidden and hidden-to-output connections
during training through backpropagation.

Encoder-Decoder sequence-to –sequence architectures

The Encoder-Decoder architecture is a fundamental model for sequence-to-sequence (Seq2Seq)


tasks, which are common in many natural language processing (NLP) and machine learning
applications such as machine translation, text summarization, and speech recognition. This
architecture is designed to handle variable-length input and output sequences.

The Seq2Seq model, typically using Recurrent Neural Networks (RNNs), Long Short-Term
Memory (LSTM) networks, or Gated Recurrent Units (GRUs), is composed of two primary
components:

1. Encoder
2. Decoder

Let’s take machine translation from English to French as an example. Given an input sequence in
English: “They”, “are”, “watching”, “.”, this encoder–decoder architecture first encodes the
variable-length input into a state, then decodes the state to generate the translated sequence,
token by token, as output: “Ils”, “regardent”, “.”.
1. Encoder

The encoder is responsible for processing the input sequence and encoding it into a fixed-length
context vector (or a sequence of context vectors in the case of more advanced models). It reads
the input sequence step-by-step, updating its hidden state at each time step to reflect the input
seen so far.

 Input Sequence: The encoder takes a sequence of tokens (e.g., words in a sentence,
characters in a string) as input, typically represented as embeddings (e.g., word
embeddings, character embeddings, or one-hot vectors).
 Hidden States: The encoder uses an RNN, LSTM, or GRU to maintain a hidden state
hth_tht at each time step ttt. This state is updated as the sequence is processed.
 Output of the Encoder: In traditional Seq2Seq, the encoder outputs a final hidden state
(often a context vector or latent vector), which is passed to the decoder. This context
vector is expected to encapsulate all the relevant information from the entire input
sequence. However, in more advanced versions, like the Attention-based models, this
hidden state is used differently to allow the decoder to focus on different parts of the
input sequence at each decoding step.

Key Steps in the Encoder:

 Process input sequence token by token.


 Update hidden state after each token.
 Generate a context vector (last hidden state).

2. Decoder

The decoder is responsible for generating the output sequence based on the encoded context
vector (from the encoder). It does this by using the context vector as its initial state and
generating tokens step-by-step, conditioned on the previous tokens it has generated.

 Initial Hidden State: The decoder’s initial hidden state is typically set to the final hidden
state of the encoder, representing the full input sequence.
 Generating Output Sequence: At each decoding step, the decoder produces an output
token (e.g., a word or character) based on its current hidden state and the context vector.
It also takes the previous output token as input for generating the next token, allowing for
autoregressive sequence generation.
 Autoregressive Generation: The decoder generates one token at a time. For each token
generated, the hidden state is updated, and the token is passed to the next step in the
sequence as input for the subsequent time step. In many cases, the decoder uses teacher
forcing during training, where the actual previous token (from the ground truth) is fed
into the decoder rather than the model's own prediction.

Key Steps in the Decoder:

 Use the encoder's final hidden state (or context vector) as the initial state.
 Autoregressively generate each token in the output sequence, based on the context and
previous tokens.

3. Encoder-Decoder Flow

 Step 1: The input sequence x=(x1,x2,...,xT)x = (x_1, x_2, ..., x_T)x=(x1,x2,...,xT) is


processed by the encoder. The encoder updates its hidden state hth_tht at each time step.
 Step 2: The final hidden state hTh_ThT (or context vector) is passed to the decoder.
 Step 3: The decoder generates the output sequence y=(y1,y2,...,yT)y = (y_1, y_2, ...,
y_T)y=(y1,y2,...,yT), using its internal hidden state sts_tst and the previous tokens
y1,y2,...,yt−1y_1, y_2, ..., y_{t-1}y1,y2,...,yt−1 to generate each token autoregressively.

4. Loss Function

The output sequence is compared to the ground truth sequence, and a loss function (typically
cross-entropy loss) is used to compute the difference between the predicted and actual tokens at
each time step. The objective is to minimize this loss during training.

5. Challenges in Encoder-Decoder Architectures


 Fixed-size Bottleneck: In early sequence-to-sequence models, the encoder produces a
fixed-size context vector (the last hidden state). This can be limiting, especially for long
sequences, as the entire information must be compressed into a single vector, potentially
causing a loss of information.
 Vanishing/Exploding Gradients: When backpropagating through time (BPTT),
especially for long sequences, the gradients can either vanish (leading to poor learning) or
explode (causing instability). This can be mitigated by using advanced RNN architectures
like LSTMs or GRUs.

6. Attention Mechanism

A key advancement in Seq2Seq models is the Attention Mechanism, which addresses the
bottleneck problem by allowing the decoder to attend to different parts of the input sequence at
each decoding step, rather than relying on a single context vector.

 How Attention Works:


o At each step of the decoding process, the attention mechanism computes a set of
attention weights that determine which parts of the input sequence are most
relevant to the current decoding step.
o These weights are used to form a weighted sum of the encoder’s hidden states,
which becomes the decoder's context at each time step.
 Benefits of Attention:
o Allows the model to focus on different parts of the input sequence at different
times, improving translation accuracy and handling long-range dependencies
better.
o Leads to more interpretable models, as it provides insight into which input tokens
are being "attended" to when generating each output token.

Variants of Attention:

 Bahdanau Attention: Introduced by Bahdanau et al., this is an additive attention


mechanism where the attention score is computed using a feed-forward neural network.
 Luong Attention: Introduced by Luong et al., this is a multiplicative attention
mechanism, where the attention score is computed using dot product operations.

7. Transformer Architecture

The Transformer architecture, introduced by Vaswani et al., is an evolution of the encoder-


decoder architecture. Unlike traditional RNN-based Seq2Seq models, transformers use self-
attention and positional encodings to handle sequential data.

 Self-Attention: Each token in the input sequence attends to every other token, allowing
the model to capture long-range dependencies without relying on recurrent connections.
 Positional Encoding: Since transformers don’t inherently capture the order of sequences,
positional encodings are added to provide the model with information about the positions
of tokens in the sequence.

The Transformer architecture, which is the foundation of models like BERT, GPT, and T5, has
largely replaced RNN-based encoder-decoder models in many applications due to its parallelism
and ability to handle long-range dependencies more efficiently.

8. Applications of Encoder-Decoder Models

Encoder-decoder architectures are widely used in sequence-based tasks, including:

 Machine Translation: Translating sentences from one language to another (e.g., English
to French).
 Speech Recognition: Converting spoken language into text.
 Text Summarization: Generating concise summaries of long texts.
 Image Captioning: Generating descriptive captions for images (using a CNN encoder
for image features and an RNN decoder).
 Dialogue Systems: Generating responses in a conversational agent.

The Encoder-Decoder architecture is a powerful framework for sequence-to-sequence tasks,


where the encoder processes the input sequence and encodes it into a fixed-length context vector,
which the decoder uses to generate the output sequence. Attention mechanisms enhance this
architecture by allowing the model to focus on relevant parts of the input sequence, and the
Transformer architecture has further advanced this approach, replacing RNNs with self-attention
mechanisms for more efficient parallel processing and better handling of long-range
dependencies

Deep Recurrent networks

A deep recurrent neural network (Deep RNN) is a type of neural network that's used to process
sequential data, such as time series, words, or sentences. Deep RNNs are well-suited for this task
because they can capture patterns and temporal dependencies through feedback connections. A Deep
Recurrent Neural Network is a type of neural network that is well-suited for processing time series data
by effectively capturing temporal dependencies and patterns through feedback connections that enable
the persistence of information over time.

Deep RNN is a type of computer program that can learn to recognize patterns in data that occur in a
sequence, like words in a sentence or musical notes in a song. It works by processing information in
layers, building up a more complete understanding of the data with each layer. This helps it capture
complex relationships between the different pieces of information and make better predictions about
what might come next.

Deep RNNs are used in many real-life applications, such as speech recognition systems like Siri or Alexa,
language translation software, and even self-driving cars. They’re particularly useful in situations where
there’s a lot of sequential data to process, like when you’re trying to teach a computer to understand
human language.

Deep RNNs, with their ability to handle sequential data and capture complex relationships between
input and output sequences, have become a powerful tool in various real-life applications, ranging from
speech recognition and natural language processing to music generation and autonomous driving.

Deep RNN (Recurrent Neural Network) refers to a neural network architecture that has multiple layers
of recurrent units. Recurrent Neural Networks are a type of neural network that is designed to handle
sequential data, such as time series or natural language, by maintaining an internal memory of previous
inputs.

A Deep RNN takes the output from one layer of recurrent units and feeds it into the next layer, allowing
the network to capture more complex relationships between the input and output sequences. The
number of layers in a deep RNN can vary depending on the complexity of the problem being solved, and
the number of hidden units in each layer can also be adjusted.
Deep RNNs have been successfully applied in various applications such as natural language processing,
speech recognition, image captioning, and music generation. The use of deep RNNs has been shown to
significantly improve performance compared to single-layer RNNs or shallow neural networks.

Steps to develop a deep RNN application

Developing an end-to-end deep RNN application involves several steps, including data preparation,
model architecture design, training the model, and deploying it. Here is an example of an end-to-end
deep RNN application for sentiment analysis:

1. Data preparation: The first step is to gather and preprocess the data. In this case, we’ll need a
dataset of text reviews labelled with positive or negative sentiment. The text data needs to be
cleaned, tokenized, and converted to the numerical format. This can be done using libraries like
NLTK or spaCy in Python.

2. Model architecture design: The next step is to design the deep RNN architecture. We’ll need to
decide on the number of layers, number of hidden units, and type of recurrent unit (e.g. LSTM
or GRU). We’ll also need to decide how to handle the input and output sequences, such as using
padding or truncation.

3. Training the model: Once the architecture is designed, we’ll need to train the model using the
preprocessed data. We’ll split the data into training and validation sets and train the model
using an optimization algorithm like stochastic gradient descent. We’ll also need to set
hyperparameters like learning rate and batch size.

4. Evaluating the model: After training, we’ll evaluate the model’s performance on a separate test
set. We’ll use metrics like accuracy, precision, recall, and F1 score to assess the model’s
performance.

5. Deploying the model: Finally, we’ll deploy the trained model to a production environment,
where it can be used to classify sentiment in real-time. This could involve integrating the model
into a web application or API.
In this model, each RNN cell accepts the input xi and the state si and outputs the restoring force fsi. It
indicates that the restoring force relies not only on the current input but also on the system’s history
which is consistent with the definition of hysteresis.

Features of Deep RNNs

 Stacked Architecture:

o Deep RNNs consist of multiple hidden layers stacked on top of each other.

o Each layer can learn different levels of abstraction, improving the model's ability to
capture complex patterns in the data.

 Temporal Dynamics:

o The hidden state is updated at each time step based on the current input and the
previous hidden state.

o This allows the network to maintain context over long sequences, which is crucial for
tasks involving sequential data.

Real life examples

Deep RNNs have been successfully applied in various real-life applications. Here are a few examples:

1. Speech Recognition: Deep RNNs have been used to build speech recognition systems, such as
Google’s Speech API, Amazon’s Alexa, and Apple’s Siri. These systems use deep RNNs to convert
speech signals into text.

2. Natural Language Processing (NLP): Deep RNNs are used in various NLP applications, such as
language translation, sentiment analysis, and text classification. For example, Google Translate
uses a deep RNN to translate text from one language to another.

3. Music Generation: Deep RNNs have been used to generate music, such as Magenta’s MusicVAE,
which uses a deep RNN to generate melodies and harmonies.

4. Image Captioning: Deep RNNs are used in image captioning systems, such as Google’s Show and
Tell, which uses a deep RNN to generate captions for images.

5. Autonomous Driving: Deep RNNs have been used in autonomous driving systems to predict the
behaviour of other vehicles on the road, such as the work done by Waymo.

Deep RNNs represent a significant advancement in the field of neural networks, enabling more
sophisticated modeling of sequential data and improving performance across various applications.

Recursive Neural Network


Recursive Neural Network (RNN): A neural network that can operate on input sequences of variable
length, such as text. It uses weights to make structured predictions.

Recursive Neural Networks (RvNNs) are a type of neural network architecture designed to process
structured data, particularly hierarchical or tree-structured inputs. Unlike traditional feedforward or
recurrent neural networks, which operate on fixed-size sequences, RvNNs can handle variable-sized
inputs and are particularly effective for tasks involving nested structures, such as natural language
processing (NLP) and computer vision.

Recursive Neural Networks (RvNNs) are deep neural networks used for natural language processing.
We get a Recursive Neural Network when the same weights are applied recursively on a structured
input to obtain a structured prediction.
● Recursive Neural Networks (RvNNs) are a class of deep neural networks that can learn
detailed and structured information. With RvNN, you can get a structured prediction by recursively
applying the same set of weights on structured inputs. The word recursive indicates that the neural
network is applied to its output.
● Due to their deep tree-like structure, Recursive Neural Networks can handle hierarchical data.
The tree structure means combining child nodes and producing parent nodes. Each child-parent bond
has a weight matrix, and similar children have the same weights. The number of children for every
node in the tree is fixed to enable it to perform recursive operations and use the same weights.
RvNNs are used when there's a need to parse an entire sentence.
Architecture

The architecture of a Recursive Neural Network can be described as follows:

1. Tree Structure:
o RvNNs are built on a tree structure, where each node represents a substructure of the
input data. The leaves of the tree are the input features, while the internal nodes
represent the composition of those features.
2. Recursive Composition:
o The network processes the input data recursively. Starting from the leaves (the base
elements), the network combines them into higher-level representations as it moves up
the tree.
o Each internal node computes a representation based on its child nodes using a neural
network function, typically a feedforward neural network.
3. Activation Functions:
o Common activation functions used in RvNNs include ReLU (Rectified Linear Unit) and
sigmoid functions, which help to introduce non-linearity into the model.
4. Weight Sharing:
o The same set of weights is used for each composition operation, allowing the network
to generalize across different parts of the input structure.
Explanation

How Recursive Neural Networks Work:

1. Input Representation:
o Inputs are represented as a tree structure. For example, in NLP, sentences can be
represented as parse trees, where each node corresponds to a word or phrase.
2. Recursive Processing:
o The RvNN processes the tree in a bottom-up manner:
 For each leaf node (e.g., a word in a sentence), it computes an initial
representation (e.g., an embedding).
 Moving up the tree, each internal node computes its representation by
combining the representations of its child nodes.
3. Combining Representations:
o The combination of representations can be expressed mathematically as: [ h_i =
f(h_{left}, h_{right}) ]
o Here, (h_i) is the representation of the parent node, (h_{left}) and (h_{right}) are the
representations of the child nodes, and (f) is a neural network function that combines
them (e.g., a simple feedforward layer).
4. Output Layer:
o Once the root of the tree is reached, the final representation can be used for various
tasks, such as classification, regression, or generating outputs.

Applications

Recursive Neural Networks are particularly useful in several domains:

1. Natural Language Processing (NLP):


o RvNNs can be employed for sentiment analysis, where the hierarchical structure of
sentences and phrases is crucial for understanding context and meaning.
2. Image Processing:
o In computer vision, RvNNs can be used to analyze images with hierarchical structures,
such as scenes with objects that have relationships.
3. Semantic Parsing:
o RvNNs can be used to parse sentences into logical forms, capturing the hierarchical
relationships between different components of the sentence.
4. Graph Data:
o They can also be applied to graph-structured data, where the relationships between
nodes can be modeled recursively.

Example

Here’s a simplified example of how an RvNN might be structured for sentiment analysis of a sentence:

1. Input: "The movie was not good."


2. Parse Tree:
S
/\
NP VP
| /\
Det V Adj
| | |
The was good
|
Neg
|
not
3.Processing:

 Each word is represented as a vector (embedding).


 The RvNN processes the tree recursively, combining the embeddings of "not" and "good" to
form a representation for "not good", and then combining that with the representation of "was"
to form the representation of the verb phrase "was not good".
 Finally, the representation of the entire sentence is used to predict its sentiment.

The Challenge of Long-Term Dependencies

Long-term dependencies refer to the difficulty that certain machine learning models,
especially traditional recurrent neural networks (RNNs), face in learning and retaining
information over extended sequences of data. In many real-world tasks, the relationship
between inputs and outputs can span long time intervals, making it challenging for models to
capture and utilize relevant information that may be separated by many time steps.
Why Long-Term Dependencies are Challenging

1. Vanishing and Exploding Gradients:


o Vanishing Gradients: During backpropagation, gradients can become very
small, effectively preventing the model from updating weights associated with
earlier layers or time steps. This makes it difficult for the model to learn from
long-range dependencies.
o Exploding Gradients: Conversely, gradients can also become excessively large,
leading to unstable updates and making the training process difficult. Both issues
hinder the model's ability to learn effectively from long sequences.
o Training a neural network can become unstable given the choice of error function,
learning rate, or even the scale of the target variable. Large updates to weights
during training can cause a numerical overflow or underflow often referred to as
“Exploding Gradients.”
o The problem of exploding gradients is more common with recurrent neural
networks, such as LSTMs given the accumulation of gradients unrolled over
hundreds of input time steps.
2. Limited Memory Capacity:
o Traditional RNNs have a limited ability to remember information from earlier
time steps due to their architecture. As sequences grow longer, the influence of
earlier inputs diminishes, which can lead to poor performance on tasks that
require understanding context over long distances.
3. Sequential Processing:
o RNNs process sequences one time step at a time, which can create challenges in
capturing dependencies that span across many time steps. The sequential nature of
RNNs can lead to inefficiencies in learning long-range dependencies.

 Neural network optimization face a difficulty when computational graphs become deep,
e.g.,
■ Feedforward networks with many layers
■ RNNs that repeatedly apply the same operation at each time step of a long
temporal sequence
● Gradients propagated over many stages tend to either vanish (most of the

time) or explode (damaging optimization)

● The difficulty with long-term dependencies arise from exponentially smaller weights
given to long-term interactions (involving multiplication of many Jacobians)
Leaky Units & Other strategies for multiple timescales

When dealing with multiple time scales in computational models, including neural networks, there are
several strategies that can be employed. One such strategy involves using leaky units or leaky
integrators. Leaky units introduce a decay factor that allows information to leak or decay over time,
enabling the model to capture longer-term dependencies.

Leaky units can be applied to recurrent neural networks (RNNs) or other types of models where
temporal dynamics are important. By incorporating leaky units, the model can retain information from
previous time steps while also allowing for gradual forgetting or decay of irrelevant information. This
helps capture both short-term and long-term dependencies in the data.

Another strategy is the use of gated units, such as long short-term memory (LSTM) or gated recurrent
units (GRUs). Gated units have mechanisms that selectively update and control the flow of information
through time. They have built-in mechanisms to retain or forget information based on input and internal
gating signals, allowing them to capture different time scales effectively.

Additionally, the concept of time constants can be applied to determine the rates at which different
variables or units update in a model. By assigning different time constants to various components, the
model can simulate dynamics at different time scales more accurately.

Furthermore, hierarchical or multiscale architectures can be employed to handle multiple time scales.
These architectures consist of multiple layers or modules, each responsible for processing information at
a specific time scale. The output of one layer/module is fed as input to the next layer/module, allowing
information to flow across different scales.

Lastly, techniques like time-delay embedding or Fourier analysis can be used for time series analysis to
extract relevant features or patterns at different frequencies or time scales.

These strategies, including leaky units, gated units, time constants, hierarchical architectures, and time
series analysis techniques, provide ways to handle multiple time scales in computational models and
facilitate the modeling of complex temporal dynamics

In machine learning, especially in the context of neural networks dealing with sequential data or
time series, capturing information across multiple time scales is critical. Different phenomena
can occur over varying durations, so models must be able to effectively learn from both short-
term and long-term patterns. Here, we discuss leaky units and various strategies to manage
multiple time scales.

1. Leaky Units

Definition: Leaky units are variations of traditional activation functions designed to mitigate
issues like the "dying ReLU" problem, where neurons become inactive and stop learning. By
allowing a small, non-zero gradient when the unit is not active, leaky units can maintain some
level of learning.

Leaky ReLU: The most common leaky unit is the Leaky ReLU (Rectified Linear Unit), defined
as:

[ f(x) = \begin{cases} x & \text{if } x > 0 \ \alpha x & \text{if } x \leq 0 \end{cases} ]

where (\alpha) is a small constant (e.g., 0.01). This allows a small gradient to flow when the
input is negative, helping to maintain some learning capability even for non-active neurons.

Benefits:

 Gradient Flow: Leaky ReLU helps maintain gradient flow during backpropagation,
reducing the risk of neurons becoming inactive.
 Memory Retention: By allowing small outputs from inactive neurons, leaky units can
help the model retain some memory of previous states, which can be beneficial for
capturing dependencies over multiple time scales.

2. Gated Mechanisms

Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTM): Both GRUs and
LSTMs use gating mechanisms to control the flow of information, making them adept at
handling multiple time scales.

 Gates:
o Input Gate: Controls how much of the new input to incorporate.
o Forget Gate: Decides what information to discard from the cell state.
o Output Gate: Determines what information to output based on the cell state.

These mechanisms enable the model to maintain different timescales of information, effectively
allowing it to learn both short-term and long-term dependencies.

3. Attention Mechanisms

Attention: Attention mechanisms allow models to focus on specific parts of the input sequence
when making predictions. This is particularly useful in scenarios where relevant information may
be distributed across various time steps.

 Self-Attention: Computes attention scores for all pairs of positions in the input sequence,
allowing the model to weigh the importance of each position relative to others.
 Multi-Head Attention: Enables the model to attend to different parts of the sequence in
parallel, capturing various aspects of the data at different timescales.

Benefits:
 Dynamic Focus: Attention mechanisms enable the model to dynamically adjust its focus
based on the input, allowing it to capture dependencies over varying timescales.
 Parallel Processing: Unlike RNNs, which process inputs sequentially, attention
mechanisms allow for parallelization, improving computational efficiency.

4. Temporal Convolutional Networks (TCNs)

TCNs: Temporal Convolutional Networks leverage convolutional layers to process sequences.


They utilize causal convolutions to ensure that predictions for time step (t) depend only on
current and past inputs, preserving the temporal order of the data.

Benefits:

 Multiple Kernels: TCNs can use multiple convolutional kernels of different sizes,
allowing them to capture patterns across various timescales.
 Residual Connections: These connections help maintain gradient flow and allow for
deeper architectures, enhancing the model's ability to learn complex temporal
dependencies.

5. Hierarchical Models

Hierarchical RNNs: Hierarchical RNNs consist of multiple RNN layers, where each layer
captures information at different timescales. The lower layers can focus on short-term
dependencies, while higher layers learn from the aggregated information of the lower layers.

Benefits:

 Layered Learning: Different layers can specialize in capturing different timescales,


allowing the model to learn a richer representation of the data.
 Improved Contextual Understanding: Hierarchical structures can provide a better
understanding of context by processing information at various levels of granularity.

6. Memory-Augmented Networks

Memory Networks: Memory-augmented networks incorporate an external memory component


that allows the model to read from and write to memory. This enables the network to store
information over long periods, making it easier to capture long-term dependencies.

Benefits:

 Long-Term Memory: External memory can hold information that may be relevant for
future time steps, effectively allowing the model to manage long-term dependencies.
 Flexible Access: The model can selectively access memory based on the current input,
improving its ability to utilize relevant historical information.
LSTM
LSTM excels in sequence prediction tasks, capturing long-term dependencies. Ideal for time series,
machine translation, and speech recognition due to order dependence. The article provides an in-depth
introduction to LSTM, covering the LSTM model, architecture, working principles, and the critical role
they play in various applications.
Long Short-Term Memory is an improved version of recurrent neural network designed by Hochreiter &
Schmidhuber.
A traditional RNN has a single hidden state that is passed through time, which can make it difficult for
the network to learn long-term dependencies. LSTMs model address this problem by introducing a
memory cell, which is a container that can hold information for an extended period.
LSTM architectures are capable of learning long-term dependencies in sequential data, which makes
them well-suited for tasks such as language translation, speech recognition, and time series forecasting.
LSTMs can also be used in combination with other neural network architectures, such as Convolutional
Neural Networks (CNNs) for image and video analysis.
LSTM Architecture
The LSTM architectures involves the memory cell which is controlled by three gates: the input gate, the
forget gate, and the output gate. These gates decide what information to add to, remove from, and
output from the memory cell.
 The input gate controls what information is added to the memory cell.
 The forget gate controls what information is removed from the memory cell.
 The output gate controls what information is output from the memory cell.
This allows LSTM networks to selectively retain or discard information as it flows through the network,
which allows them to learn long-term dependencies.
The LSTM maintains a hidden state, which acts as the short-term memory of the network. The hidden
state is updated based on the input, the previous hidden state, and the memory cell’s current state

Input gate

The addition of useful information to the cell state is done by the input gate. First, the
information is regulated using the sigmoid function and filter the values to be remembered
similar to the forget gate using inputs ht-1 and xt. . Then, a vector is created using tanh function
that gives an output from -1 to +1, which contains all the possible values from ht-1 and xt. At last,
the values of the vector and the regulated values are multiplied to obtain the useful information.
The equation for the input gate is:

it=σ(Wi⋅[ht−1,xt]+bi) it=σ(Wi⋅[ht−1,xt]+bi)

C^t=tanh(Wc⋅[ht−1,xt]+bc)C^t=tanh(Wc⋅[ht−1,xt]+bc)

ignore. Next, we include it∗Ct. This represents the updated candidate values, adjusted for the
We multiply the previous state by ft, disregarding the information we had previously chosen to

amount that we chose to update each state value.

Ct=ft⊙Ct−1+it⊙C^tCt=ft⊙Ct−1+it⊙C^t

where

 ⊙ denotes element-wise multiplication

 tanh is tanh activation function


Output gate

The task of extracting useful information from the current cell state to be presented as output is
done by the output gate. First, a vector is generated by applying tanh function on the cell. Then,
the information is regulated using the sigmoid function and filter by the values to be remembered
using inputs ht−1ht−1and xtxt. At last, the values of the vector and the regulated values are
multiplied to be sent as an output and input to the next cell. The equation for the output gate is:

ot=σ(Wo⋅[ht−1,xt]+bo)ot=σ(Wo⋅[ht−1,xt]+bo)
Applications of LSTM

Some of the famous applications of LSTM includes:

 Language Modeling: LSTMs have been used for natural language processing tasks such as
language modeling, machine translation, and text summarization. They can be trained to
generate coherent and grammatically correct sentences by learning the dependencies between
words in a sentence.

 Speech Recognition: LSTMs have been used for speech recognition tasks such as transcribing
speech to text and recognizing spoken commands. They can be trained to recognize patterns in
speech and match them to the corresponding text.

 Time Series Forecasting: LSTMs have been used for time series forecasting tasks such as
predicting stock prices, weather, and energy consumption. They can learn patterns in time series
data and use them to make predictions about future events.

 Anomaly Detection: LSTMs have been used for anomaly detection tasks such as detecting fraud
and network intrusion. They can be trained to identify patterns in data that deviate from the
norm and flag them as potential anomalies.

 Recommender Systems: LSTMs have been used for recommendation tasks such as
recommending movies, music, and books. They can learn patterns in user behavior and use
them to make personalized recommendations.
 Video Analysis: LSTMs have been used for video analysis tasks such as object detection, activity
recognition, and action classification. They can be used in combination with other neural
network architectures, such as Convolutional Neural Networks (CNNs), to analyze video data
and extract useful information

You might also like