Lecture Notes_RRN
Lecture Notes_RRN
Unlike traditional feedforward neural networks, RNNs have connections that form cycles, allowing
information to persist. This enables RNNs to capture temporal dependencies and context in
sequential data.
Applications of RNNs :
Natural Language Processing (NLP): Text generation, sentiment analysis, machine translation.
Key Characteristics :
Sequential Data : RNNs process data in sequences (e.g., time steps in a time series, words in a
sentence).
Hidden State : The core feature of RNNs is the hidden state \( h_t \), which carries information
from one time step to the next, allowing the network to "remember" information from previous
time steps.
Mathematical Formulation :
Graphical Representation :
Each time step t the same weights, forming a cycle in the network structure. This is what gives
RNNs their "recurrent" property.
While RNNs are powerful for sequential data, they come with several challenges:
During backpropagation through time (BPTT), gradients can shrink exponentially as they are
propagated back through many layers or time steps, making it difficult for the network to learn
longrange dependencies.
Conversely, gradients can also grow exponentially, leading to instability in training and making
optimization difficult.
3. Limited Memory :
Basic RNNs struggle to capture longterm dependencies, as information from earlier time steps
gets "forgotten" quickly as the network processes new inputs.
A special kind of RNN designed to address the vanishing gradient problem and improve the
network's ability to learn longrange dependencies.
LSTM Components :
Forget gate : Decides which information to discard from the cell state.
Input gate : Decides which new information to add to the cell state.
Output gate : Controls what part of the cell state is output as the hidden state.
The LSTM cell uses these gates to regulate the flow of information, mitigating issues such as
vanishing gradients and enabling better memory retention.
A simpler variant of LSTMs with fewer gates (no separate cell state).
GRU Components :
Update gate : Decides how much of the previous hidden state should be carried forward.
Reset gate : Decides how much of the previous hidden state to forget.
GRUs tend to perform similarly to LSTMs but with less computational overhead.
2. Compute loss : At each time step, compute the loss based on the predicted output and the
actual target.
3. Backward pass : Calculate the gradients of the loss with respect to the weights by unrolling the
RNN over time and applying the chain rule. This involves:
6. Variants of RNNs
These networks process the sequence in both forward and backward directions, allowing them to
capture context from both past and future time steps.
Deep RNNs :
Stacking multiple layers of RNNs can increase the representational power of the network.
7. Training RNNs
Optimization :
RNNs are typically trained using gradientbased optimization algorithms (e.g., SGD, Adam).
Special attention is needed to handle issues like vanishing/exploding gradients, often through
initialization schemes or using LSTM/GRU cells.
Regularization :
Gradient clipping : Used to handle exploding gradients by capping the gradients during training.
8. Advanced Topics
Attention Mechanism :
Attention allows the network to focus on important parts of the input sequence, enabling it to
handle longrange dependencies better than vanilla RNNs.
Transformers :
A modern architecture that, while inspired by RNNs, uses attention mechanisms in place of
recurrence to achieve better performance and parallelization, especially for long sequences.
Problem : Given a sequence of words, predict the next word in the sequence.
2. RNN Process :
The RNN processes the sequence one word at a time, updating its hidden state at each step.
3. Output : At each time step, the RNN generates a probability distribution over the vocabulary for
the next word. The word with the highest probability is chosen as the output.
Traditional Neural Networks (TNNs) and Recurrent Neural Networks (RNNs) are both types of
artificial neural networks, but they differ in how they process information and the types of tasks they
are suited for. Below are the key differences:
1. Architecture:
These are feedforward networks, meaning that the information moves in one direction—from
the input layer, through hidden layers, to the output layer.
There are no cycles or loops in the network. Each layer's output only depends on the current
input and weights, and there is no memory of previous inputs.
RNNs have recurrent connections, meaning that the output from the previous time step is fed
back into the network, allowing the network to maintain a form of "memory."
The hidden state of the network can capture temporal dependencies, meaning the model can
take into account past inputs when producing outputs.
RNNs are designed to process sequential data and are often used in tasks like language modeling,
speech recognition, and time series analysis.
2. Memory:
TNNs:
Traditional neural networks do not have memory. Each input is processed independently, and the
network does not retain any information about past inputs once it moves on to the next one.
RNNs:
RNNs have an inherent memory mechanism. The hidden state of the network at a given time step
is influenced by both the current input and the previous hidden state, allowing the model to
remember information from previous time steps.
3. Use Cases:
TNNs:
Best suited for problems where the relationship between inputs and outputs does not depend on
sequential or temporal context.
Examples: Image classification, object recognition, simple regression tasks, and pattern
recognition where inputs are independent.
RNNs:
Ideal for sequential data or problems where timedependent patterns need to be learned. They
excel at tasks where the output depends on previous inputs.
Examples: Natural Language Processing (NLP), machine translation, speech recognition, and time
series forecasting.
TNNs:
Typically process fixed size input where each sample is independent of the others.
RNNs:
Designed to handle variable length sequences where the input is time or order dependent.
The input at each time step depends not only on the current input but also on the previous
inputs or states, making them suitable for sequential tasks.
5. Training Difficulty:
TNNs:
Training is generally easier because there are no dependencies across time steps, and the
backpropagation algorithm works straightforwardly.
RNNs:
RNNs are more difficult to train because they involve dependencies across time steps. The
backpropagation through time (BPTT) algorithm is used, which can suffer from issues like the
vanishing gradient problem and exploding gradients, making learning more challenging.
Variants like Long Short Term Memory (LSTM) networks and Gated Recurrent Units (GRUs) were
developed to address these challenges by providing better memory management.
6. Output:
TNNs:
Typically produce a single output based on the entire input. For example, in image classification,
the output is a label assigned to the image.
RNNs:
Can produce a sequence of outputs for each time step (as in the case of sequence to sequence
models) or a single output after processing an entire sequence (e.g., in sequence classification tasks).
7. Parameter Sharing:
TNNs:
Each layer in a traditional neural network has its own set of weights for every connection. These
weights do not share information across different inputs or time steps.
RNNs:
RNNs share weights across time steps. This weight sharing is what allows RNNs to generalize over
sequences of different lengths. The same weights are applied to each time step in the sequence,
which is one of the key reasons they are suited for sequential data.
Summary of Differences: