DL 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Unfolding Computational Graphs

Unfolding computational graphs is a technique used to represent recurrent neural networks


(RNNs) as acyclic graphs. This is done by unfolding the RNN's recurrence over a finite time
horizon. The resulting graph can then be trained using traditional back propagation techniques.

For example, in a recurrent neural network used for natural language processing, the
computational graph can be unfolded over time to represent the processing of a sentence. Each
word in the sentence corresponds to a time step, and the same set of operations is applied to
each word. The computational graph is replicated for each time step, with the outputs of one
step serving as inputs for the next step.

Unfolding the computational graph allows us to visualize the repeated computations and the
flow of data across different time steps. It helps in understanding how information is processed
and propagated through the model over time. It also enables efficient computation by reusing
the same set of operations and their corresponding parameters for each time step, rather than
recreating them for each iteration.

Advantages of Unfolding Process


The unfolding process introduces two major advantages :
1. Regardless of sequence length, the learned model has the same input size. This is
because it is specified in terms of transition from one state to another state rather than
specified in terms of a variable length history of states
2. Possible to use the same function f with the same parameters at every step.

These two factors make it possible to learn a single model f, that operates on all time steps and
all sequence lengths, rather than needing separate model g(t) for all possible time steps

Recurrent Neural Networks


Recurrent Neural Networks (RNNs) are a type of neural network where the results of one step
are fed into the next step's computations. Unlike feedforward neural networks, which process
inputs in a single pass without any memory, RNNs have a built-in memory mechanism that
allows them to capture dependencies and patterns in sequential data.

Traditional neural networks have inputs and outputs that are independent of one another, but
there is a need to remember the previous words in situations where it is necessary to anticipate
the next word in a sentence.

As a result, RNN was developed, which utilized a hidden layer to resolve this problem. The
hidden state, which retains some information about a sequence, is the primary and most
significant characteristic of RNN.
The key feature of RNNs is their ability to maintain and update an internal state, also known as
a hidden state or memory, which is passed from one step to the next. This hidden state serves
as a form of memory that captures information from previous steps and influences the
processing of future steps. It allows RNNs to consider the context and temporal information in
sequential data.

For example, let's say the RNN is trained on the sentence "The quick brown fox jumps over the
lazy dog." The RNN would learn to predict the next word in the sentence by using its internal
state to store information about the previous words in the sentence. For example, when the
RNN is presented with the word "quick," it would use its internal state to store information about
the word "the." This information would then be used to predict the next word in the sentence,
which is "brown."

RNNs can be used to: Translate languages, Generate text

Working of each Recurrent Unit


A recurrent unit (RU) is a basic building block of a recurrent neural network (RNN). It is a
function that takes as input the current state of the RNN and the current input, and outputs the
next state of the RNN.

There are many different types of RUs, but the most common type is the gated recurrent unit
(GRU). A GRU has two gates: an update gate and a reset gate. The update gate controls how
much of the current state is updated, and the reset gate controls how much of the current state
is forgotten.

The GRU works as follows:


- The current input and state is multiplied by the input and state weights and passed
through an activation function to get the input activation.
- The input activation and state activation are concatenated and passed through a linear
layer to get the candidate activation.
- The candidate activation is then passed through the update gate and the reset gate.
- The update gate is multiplied by the state activation to get the updated state.
- The reset gate is multiplied by the state activation to get the forgotten state.
- The updated state and forgotten state are then added together to get the next state.

Long Short Term Memory Networks and Gated Recurrent Unit Networks, two key versions of
RNN, were created to address the issue of Vanishing gradients and Exploding Gradients.

Advantages:
- They can learn long-term dependencies.
- They are relatively easy to train.
Disadvantages:
- They can be computationally expensive to train.
- They can be difficult to interpret.
Bidirectional RNN
A Bidirectional Recurrent Neural Network (Bidirectional RNN) is a type of Recurrent Neural
Network (RNN) architecture that processes sequential data in both forward and backward
directions. It combines the information from past and future states to make predictions or extract
features from the input sequence.

In a standard RNN, the hidden state at each time step is computed based on the previous
hidden state and the current input. This allows the RNN to capture dependencies and patterns
in the past context of the sequence. However, the standard RNN may not have access to future
information, which can be valuable for tasks that require understanding the entire sequence.

The Bidirectional RNN addresses this limitation by introducing two separate RNNs: one that
processes the sequence in the forward direction (from the beginning to the end) and another
that processes the sequence in the backward direction (from the end to the beginning). Each
RNN has its own set of parameters.

The forward RNN computes a forward hidden state sequence h_f(t) at each time step t, starting
from the first element of the input sequence. On the other hand, the backward RNN computes a
backward hidden state sequence h_b(t) at each time step t, starting from the last element of the
input sequence.

Once the forward and backward hidden states are obtained, the Bidirectional RNN can combine
the information from both directions. This can be done in different ways depending on the
specific task or objective. One common approach is to concatenate the forward and backward
hidden states at each time step:

h(t) = [h_f(t); h_b(t)]

The concatenated hidden states can then be used for further processing, such as making
predictions, extracting features, or passing them to subsequent layers in a deep neural network.

The main advantage of Bidirectional RNNs is that they capture information from both past and
future contexts, enabling them to better model dependencies in the input sequence. This can be
particularly beneficial for tasks such as speech recognition, named entity recognition, sentiment
analysis, and machine translation, where understanding the context of the entire sequence is
important.

It's worth noting that the use of Bidirectional RNNs can increase computational complexity and
memory requirements compared to standard RNNs. Additionally, in tasks where future
information is not available, such as online prediction, Bidirectional RNNs may not be suitable.
Encoder-Decoder Sequence-to-Sequence Architectures
Encoder-Decoder sequence-to-sequence architectures are a type of neural network architecture
designed to handle tasks involving sequences, such as machine translation, text summarization,
and speech recognition. This architecture consists of two main components: an encoder and a
decoder.

The encoder processes the input sequence and encodes it into a fixed-length vector
representation called the context vector or latent representation. The decoder then takes this
context vector as input and generates the output sequence step by step.

The encoder-decoder model is composed on three primary building blocks:


1. Encoder
- The input sequence, such as a sentence in machine translation, is fed into the
encoder one element at a time (e.g., word embeddings or characters).
- At each time step, the encoder computes a hidden state based on the current
input and the previous hidden state.
- The encoder's hidden state captures information from the previous inputs and
contextual information from the input sequence.
- The final hidden state of the encoder summarizes the entire input sequence into
a fixed-length context vector.

2. Hidden Vector / Encoder Vector


- The Encoder vector is a representation of the input sequence captured by the
encoder.
- It contains information about the input sequence's semantic and contextual
meaning, as encoded by the encoder.
- The Encoder vector serves as a condensed representation of the input sequence
and provides relevant information to the decoder.
3. Decoder
- The decoder takes the Encoder vector as an initial input and generates the
output sequence step by step.
- At each time step, the decoder computes a hidden state based on the current
input and the previous hidden state.
- The decoder's hidden state, similar to the encoder, captures information from the
previous inputs and the context vector.
- The decoder then predicts the next element of the output sequence based on the
current hidden state.
- This process is repeated iteratively until the desired output sequence is
generated or a predefined maximum length is reached.

Applications
- Google’s Machine Translation
- Question answering chatbots
- Speech recognition

Deep Recurrent Network


Deep Recurrent Networks (DRNs) are a class of neural network architectures that combine the
depth of deep neural networks with the sequential modeling capabilities of recurrent neural
networks (RNNs). DRNs are designed to capture complex dependencies in sequential data by
stacking multiple recurrent layers on top of each other.

The key idea behind DRNs is to create deeper representations of sequential data by allowing
information to flow through multiple layers of recurrent units. Each layer in the network receives
input from the previous layer and passes its output to the next layer, enabling the network to
capture hierarchical patterns and long-term dependencies in the data.

Training deep recurrent networks involves backpropagation through time (BPTT), an extension
of the backpropagation algorithm for recurrent networks. BPTT allows gradients to flow through
time steps and layers, enabling the network to learn from the sequential dependencies and
adjust its parameters accordingly.
Recursive Neural Networks
Recursive Neural Networks (RecNNs) are a type of neural network architecture that operate on
structured or hierarchical data, such as parse trees, dependency trees, or other recursive
structures. Unlike traditional feedforward or recurrent neural networks that process sequential or
fixed-size inputs, RecNNs recursively build representations for structured data by recursively
applying the same set of operations.

A recursive neural network (RvNN) is a type of neural network that can be used to model
hierarchical data. Hierarchical data is data that has a tree-like structure, such as sentences or
paragraphs. RvNNs can be used to learn the relationships between the different parts of the
data, and to make predictions about the data.

RvNNs are made up of a set of recursive units. Each recursive unit takes as input the current
node in the tree, and the outputs of its children. The recursive unit then produces an output for
the current node, which is used by its parent. This process continues until the root node of the
tree is reached.

The outputs of the recursive units can be used to make predictions about the data. For example,
in the case of sentences, the outputs of the recursive units can be used to predict the next word
in the sentence.

A Recursive Neural Network is used for sentiment analysis in natural language sentences. It is
one of the most important tasks of Natural language Processing (NLP), which identifies the
writing tone and sentiments of the writer in a particular sentence.

Advantages
- structure and decrease in network depth are two main advantages
- RvNN can manage hierarchical data, which is important for many real-world tasks.
- The trees can have logarithmic height, learn long-range dependencies in data.
The Challenge of Long-Term Dependencies
The challenge of long-term dependencies refers to the difficulty that arises when capturing and
modeling relationships between distant elements in a sequence or data. In sequential data,
such as natural language sentences or time series, long-term dependencies occur when the
current element or event depends on elements or events that are far in the past. However,
traditional machine learning models, including simple feedforward neural networks, struggle to
effectively capture and learn such dependencies.

The primary reason behind this challenge is the vanishing or exploding gradient problem. During
the training of neural networks, gradients are used to update the network's parameters based on
the error or loss. However, when backpropagating through many time steps or layers, the
gradients can exponentially diminish (vanishing gradients) or grow uncontrollably (exploding
gradients). This issue makes it challenging for the network to propagate information over long
time horizons, hindering its ability to capture long-term dependencies effectively.

Vanishing gradients occur when the gradients become extremely small, causing the weights to
be updated minimally or not at all. Consequently, the network fails to capture relevant
information from distant elements, and the impact of those elements on the current prediction
diminishes rapidly.

On the other hand, exploding gradients occur when the gradients become very large, leading to
unstable updates and difficulties in converging to an optimal solution. This issue can cause
training instability and prevent the network from effectively learning long-term dependencies.

The challenge of long-term dependencies is particularly problematic in tasks where


understanding the entire context is crucial, such as language modeling, machine translation, or
speech recognition. In these tasks, capturing dependencies across long distances is crucial for
generating accurate predictions or understanding the meaning of the input.

Echo State Network


Echo state networks (ESNs) are a type of recurrent neural network (RNN) that are specifically
designed to learn long-term dependencies. ESNs are particularly well-suited for processing
time-dependent or sequential data, such as time series analysis, signal processing, and
dynamic system modeling

The key idea behind Echo State Networks is the separation of the network into two components:
a fixed random reservoir and a trainable readout layer. The random reservoir is sparsely
connected and remains fixed throughout the training process. The readout layer, which is
trained using supervised learning, learns to map the reservoir's dynamics to the desired output.

ESNs consist of a reservoir and a readout layer. The reservoir is a recurrent neural network with
a large number of neurons and randomly initialized weights.
The reservoir is used to store information about the input sequence. The readout layer is used
to make predictions about the output sequence.

The main advantage of Echo State Networks lies in their simplicity and efficiency. The fixed and
randomly initialized reservoir eliminates the need to train the recurrent connections, reducing
the computational complexity and training time. Moreover, the separation of the fixed reservoir
and the trainable readout layer allows for efficient training even with limited labeled data.

Why should you use Echo State Networks


● Echo State Networks do not suffer from the vanishing/exploding gradient problem.
● While traditional neural networks are computationally expensive, ESNs tend to be fast
due to the lack of a backpropagation phase on the reservoir.
● Echo State Networks are effective at handling chaotic time series.
● Before echo state networks were introduced, recurrent neural networks were hardly ever
used in practice. This was due to the complexity involved in adjusting their connections
due to the lack of auto differentiation and susceptibility to vanishing and exploding
gradients.

Working
The echo state network makes use of a very sparsely connected hidden layer (that usually has
1% connectivity). The connectivity and weights of hidden neurons are fixed and are assigned on
a random basis. The weights of output neurons can be learned, enabling the network to produce
or reproduce specific temporal patterns. The most interesting part of this network is that in spite
of its behavior being non-linear, the only weights that end up getting modified during the training
processes are for the synapses that connect the hidden neurons to output neurons.

Leaky Units and Other Strategies for Multiple Time Scales


The Goal is to deal with long-term dependencies, Strategies which are useful to build fine and
coarse time scales are:
● Skip connections: Skip connections are connections that directly link different
time steps in an RNN. They can be used to help the network learn long-term
dependencies by providing a path for information to flow from the distant past to
the present.
● Gated units: Gated units are units that have a gating mechanism that allows
them to selectively forget information. This can be helpful for learning long-term
dependencies because it allows the network to forget irrelevant information while
retaining important information.
● LSTMs and GRUs: LSTMs and GRU(Gated Recurrent Unit) are specialized
RNN architectures that are designed to learn long-term dependencies. They do
this by using gated units to control the flow of information through the network.
The Long Short-Term Memory (LSTM) and GRU
LSTM (Long Short-Term Memory) is a type of recurrent neural network (RNN) architecture that
is specifically designed to address the vanishing gradient problem and capture long-term
dependencies in sequential data. LSTMs are widely used in tasks such as natural language
processing, speech recognition, machine translation, and time series analysis.

A general LSTM unit is composed of a cell, an input gate, an output gate, and a forget gate. The
cell remembers values over arbitrary time intervals, and three gates regulate the flow of
information into and out of the cell.

The architecture of an LSTM network includes several key components that enable it to
effectively capture and propagate information over long sequences:
● Cell State: The cell state serves as the memory of the LSTM network. It carries
information across different time steps and allows the network to maintain information
over long distances.
● Input Gate: The input gate controls the flow of information into the cell state. It
determines which parts of the input and the previous hidden state are relevant to update
the cell state at the current time step. The input gate uses a sigmoid activation function
to generate values between 0 and 1, indicating the importance of each element.
● Forget Gate: The forget gate determines which information in the cell state should be
discarded or forgotten. It takes the previous hidden state and the current input as input
and produces a forget gate value between 0 and 1 for each element in the cell state. The
forget gate uses a sigmoid activation function to determine how much information should
be forgotten.
● Output Gate: The output gate controls the flow of information from the cell state to the
output and the next hidden state. It selects which parts of the cell state should be
outputted at the current time step. Similar to the input and forget gates, the output gate
uses a sigmoid activation function to determine the relevance of each element.

Gated Recurrent Unit (GRU) is a variant of recurrent neural network (RNN) architecture that,
like LSTM, addresses the vanishing gradient problem and captures long-term dependencies in
sequential data. GRU simplifies the LSTM architecture by combining the forget and input gates
into a single update gate and merging the cell state and hidden state into a single state vector.

The architecture of a GRU network includes the following components:


● Update Gate (zt): The update gate controls the information flow from the previous
hidden state to the current hidden state. It determines how much of the previous hidden
state should be retained and how much should be updated with the current input. The
update gate uses a sigmoid activation function, and a value of 1 means to keep all the
previous hidden state, while a value of 0 means to discard it completely.
● Reset Gate (rt): The reset gate determines how much of the previous hidden state
should be forgotten and how much new information should be incorporated from the
current input. The reset gate uses a sigmoid activation function to decide which parts of
the previous hidden state should be ignored.

LSTM vs GRU

Optimization for Long term dependencies


There are two way to Optimization long term dependencies:
● Gradient clipping: Gradient clipping is a technique that limits the size of the gradients
that are used to update the network's parameters. This can help to prevent the gradients
from exploding or vanishing, which can make it difficult for the network to learn long-term
dependencies.
● Regularization: Regularization is a technique that adds a penalty to the loss function to
prevent the network from overfitting the training data. This can help to improve the
generalization performance of the network, which can make it easier for the network to
learn long-term dependencies.

Explicit Memory
In traditional RNNs, the hidden state serves as an implicit form of memory that carries
information from previous time steps. However, the hidden state has limited capacity to retain
information over long sequences due to the vanishing gradient problem. As a result, the network
may struggle to capture long-term dependencies effectively.
Explicit memory mechanisms address this limitation by introducing an explicit memory
component that can store and access information over long time spans. This memory can be
accessed at any time step, allowing the network to explicitly retrieve important information from
the past.

The explicit memory mechanisms provide several benefits to RNNs:


● Improved Long-Term Dependencies: By explicitly storing and accessing information
from the past, RNNs with explicit memory can better capture long-term dependencies in
sequential data.
● Enhanced Contextual Information: The ability to retrieve specific past information
allows the network to better contextualize the current input and make more informed
predictions or decisions.
● Increased Capacity: The explicit memory component increases the network's capacity
to store and process information, enabling it to handle longer sequences and more
complex dependencies.

Explicit memory mechanisms have been successfully applied in various tasks, including
machine translation, language modeling, question answering, and image captioning. They
provide a means to overcome the limitations of standard RNNs and capture dependencies over
longer sequences, ultimately improving the performance of the network on tasks involving
long-term information retention.

You might also like