1 Recurrent Neural Networks
1 Recurrent Neural Networks
C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
341
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
342
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
RNN in Google
Recurrent Neuron and RNN
Unfolding
343
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
In these equations:
- \( h \) is the hidden state, 1. **Stepwise Input Processing**: The
input is processed one step at a
- \( U, W, V \) are weight matrices, time, with each step updating the
current hidden state using the input
- \( X \) is the input, and
and the previous state.
- \( B, C \) are bias terms.
2. **Error Calculation**: The error is
calculated by comparing the output
to the target, and weights are
updated using backpropagation
How RNN Works through time.
344
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
Advantages:
1. **Memory of Past Information**: 1. **Language Modeling**
RNNs remember past inputs, making
2. **Speech Recognition**
them suitable for time-series and
sequential data tasks. 3. **Machine Translation**
2. **Combining with Convolutional 4. **Image Recognition**
Layers**: RNNs can be combined
with convolutional layers to improve 5. **Face Detection**
pixel neighborhood analysis. 6. **Time Series Forecasting**
345
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
2 Bidirectional RNNs
Here is a point-wise breakdown of the text about Bidirectional Recurrent Neural
Networks (BRNNs):
2. **Challenges of RNNs**:
- Conventional RNNs face the vanishing gradient problem during
backpropagation, which hinders learning.
- To overcome this, variants like Long Short-Term Memory (LSTM) and Gated
Recurrent Units (GRU) have been introduced.
4. **BRNN Functionality**:
- The forward hidden layer updates the hidden state based on the current input
and the prior hidden state.
- The backward hidden layer processes the sequence in reverse, updating the
hidden state based on future inputs.
- BRNNs improve accuracy by capturing context in both directions.
346
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
5. **Training of BRNNs**:
- Gradients are computed in both forward and backward passes during training
using backpropagation through time (BPTT).
- Inference involves a single forward pass, where predictions are made based
on the combined hidden layers' outputs.
6. **Working of BRNNs**:
- Input sequence: A sequence of vectors is fed into the network, which can
have variable lengths.
- Dual processing: Hidden states are computed based on both past (forward)
and future (backward) steps.
- Hidden state calculation: A non-linear activation function is applied to
compute hidden states.
- Output: A non-linear activation function computes the output at each step.
- Training: Weights are adjusted through backpropagation to minimize
prediction error.
7. **Backpropagation in BRNNs**:
- The hidden state at time `t` combines forward and backward hidden states.
- BPTT updates weights individually for forward and backward passes.
8. **Applications of BRNNs**:
- Sentiment analysis: Helps in categorizing the sentiment of sentences by
considering both past and future context.
- Named entity recognition: Identifies entities in a sentence by analyzing both
preceding and following contexts.
- Part-of-speech tagging: Classifies words in a sentence based on their parts of
speech.
- Machine translation: Used in encoder-decoder models to translate sentences
by capturing context from both directions.
- Speech recognition: Processes speech signals in both directions to improve
recognition accuracy.
347
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
9. **Advantages of BRNNs**:
- Capture context from both past and future inputs.
- Higher accuracy in predictions.
- Efficient in handling variable-length sequences.
- More resilient to noise and irrelevant information in data.
348
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
1. **Introduction to Autoencoders**
- Autoencoders are a subset of neural networks used for unsupervised learning.
- They are adaptable and powerful architectures, ideal for identifying complex
patterns and representations.
- Autoencoders are widely used in various fields like image processing and
anomaly detection due to their ability to learn effective data representations.
2. **Definition of Autoencoders**
- Specialized algorithms that learn efficient representations of input data
without the need for labels.
- Designed for unsupervised learning.
- They compress and represent data without requiring specific labels.
- The structure involves an encoder (reduces dimensionality) and a decoder
(rebuilds the original input).
3. **Architecture of Autoencoders**
- The architecture consists of three main parts: encoder, bottleneck layer, and
decoder.
- **Encoder**: Captures essential features and reduces dimensionality to form
a compressed representation (latent space).
- **Decoder**: Reconstructs the original input from the compressed
representation.
- **Loss function**: Measures the difference between the input and the
reconstructed output (e.g., Mean Squared Error or Binary Cross-Entropy).
4. **Training Autoencoders**
- During training, autoencoders minimize the reconstruction loss to capture key
input features in the bottleneck layer.
- After training, only the encoder is used for encoding new data.
349
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
6. **Types of Autoencoders**
- **Denoising Autoencoder**: Trains on corrupted input to recover the original
data.
- Advantages: Reduces noise and can generate additional training samples.
- Disadvantages: Challenging noise selection; possible loss of essential
information.
3410
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
4 Encoders – Decoder
### Key Points on Autoencoders:
- **Autoencoders in General**:
- Autoencoders are a subset of neural networks used for **unsupervised
learning**.
- They are adaptable, useful in various fields like **image processing** and
**anomaly detection**, and help in **learning effective data representations**.
- **Definition of Autoencoders**:
- Autoencoders are designed to learn **efficient representations of input data**
without needing labels.
- They use a two-part structure: an **encoder** and a **decoder**.
- **Encoder** reduces input to a lower-dimensional **latent space**.
- **Decoder** reconstructs the original input from the reduced representation.
- **Architecture of Autoencoders**:
1. **Encoder**: Reduces dimensionality and captures essential patterns.
2. **Bottleneck Layer**: A compressed representation of the input data.
3. **Decoder**: Reconstructs the input from the encoded data.
4. **Loss Function**: Measures reconstruction error (e.g., Mean Squared Error).
- **Training Autoencoders**:
- The goal during training is to **minimize reconstruction loss**, ensuring
important features are captured in the bottleneck layer.
- After training, only the encoder can be used to encode new data.
- **Constraining Autoencoders**:
1. **Small Hidden Layers**: Encourages learning of only essential features.
2. **Regularization**: Adds a loss term to prevent copying input directly.
3. **Denoising**: Adds noise to input, and the network learns to remove it.
3411
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
- **Types of Autoencoders**:
1. **Denoising Autoencoder**: Recovers original input from corrupted data.
- Advantages: Extracts important features, good for **data augmentation**.
- Disadvantages: Selecting noise is challenging, potential information loss.
3412
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
5 Sequence-to-Sequence
Architectures
Here’s a breakdown of the text into points:
3413
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
- **Decoder Block**:
- The decoder uses the context vector to generate the output sequence
incrementally.
- During training, it receives both the context vector and the target output
sequence.
- During inference, it uses previously generated outputs as inputs for the next
steps.
- The decoder autoregressively generates output tokens and continues until the
sequence is complete.
3414
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
3415
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
6 Deep Recurrent
Networks
Here are the key points summarized from the text:
3416
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
- Multi-layer deep RNNs capture varied data representations and pass multiple
hidden states across layers.
3417
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
7 Bidirectional Encoder
Representation from Transformers
(BERT)
1. **Introduction to BERT:**
2. **Training Data:**
- Pre-trained using Wikipedia (2,500 million words) and Book Corpus (800 million
words).
3. **Challenges in NLP:**
- One major challenge is the lack of training data for NLP tasks.
- Many task-specific datasets only contain a few thousand or hundred thousand human-
labeled examples.
4. **BERT's Approach:**
- To address the lack of data, BERT trains on large, unlabeled text corpora
(unsupervised or semi-supervised learning).
- The model can be fine-tuned for specific tasks using supervised learning.
5. **Bidirectionality:**
- Traditional language models process text in one direction (either left-to-right or right-
to-left).
6. **BERT Architecture:**
- BERT BASE has 12 layers in the encoder stack, 12 attention heads, 110 million
parameters, and 768 hidden sizes.
3418
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
- BERT LARGE has 24 layers, 16 attention heads, 340 million parameters, and 1024
hidden sizes.
7. **Transformer Architecture:**
8. **Input Representation:**
- BERT's input is represented in a single token sequence, which can include a single
sentence or a pair of sentences (e.g., question and answer).
9. **Tokenization in BERT:**
- For sentence pairs, [SEP] tokens are used to separate them, and a learned embedding
distinguishes whether a token belongs to sentence A or B.
- BERT excels at understanding context, which is crucial for texts with ambiguous
words (homonyms).
- For example:
1. "You were right." vs. "Make a right turn at the light." The meaning of "right"
changes based on context.
2. "My favorite flower is a rose." vs. "He quickly rose from his seat." Here, "rose" is
interpreted differently in each sentence.
- For 80% of the time, the masked word is replaced with the [MASK] token.
- For 10% of the time, the masked word is replaced with a random word.
3419
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
- In BERT’s training, pairs of sentences are used to predict whether the second
sentence follows the first.
- 50% of the time, sentence B is the actual next sentence (labeled "IsNext").
- The other 50% of the time, sentence B is a random sentence (labeled "NotNext").
- **Input:** "[CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk
[SEP]"
- **Output:** IsNext.
- **Input:** "[CLS] the man [MASK] to the store [SEP] penguin [MASK] are flightless
birds [SEP]"
- **Output:** NotNext.
3420
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
- RvNNs are a type of deep neural network used in natural language processing (NLP).
- A Recursive Neural Network is created when the same set of weights is applied
recursively to a structured input to make structured predictions.
- **Neural Networks** are the core of deep learning and are loosely modeled after the
human brain to recognize patterns in data.
- Recursive Neural Networks have a deep, tree-like structure, making them capable of
handling hierarchical data.
- In RvNNs, the tree structure forms by combining child nodes to produce parent nodes.
Each child-parent connection has a weight matrix, and similar children share the same
weights.
- The number of children for every node is fixed to ensure recursive operations can be
applied with consistent weights.
- RvNNs are useful when there's a need to parse entire sentences, particularly in NLP
tasks.
- This section mentions the comparison between Recurrent Neural Networks (RNNs)
and Recursive Neural Networks (RvNNs), likely pointing out their differences in handling
sequential data (RNNs) versus hierarchical data (RvNNs).
This summary covers all the key points from the text you provided.
3421
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
8 Long Short-Term
Memory
### Explanation of Long-Term Dependencies and LSTM (Summarized Points)
1. **Long-term dependencies**:
- Situations where the output of an RNN depends on input from many time steps ago.
- Example: In the sentence "The cat, which was very hungry, ate the mouse", the
subject "cat" is related to the verb "ate" even though they're separated by a clause.
- This occurs because the gradient, used to update network weights, becomes either
too small or too large as it propagates through the network.
- Gated units like LSTM and GRU help manage long-term dependencies by
remembering or forgetting information based on current inputs.
- These units selectively control the flow of information and help ignore irrelevant data.
4. **Attention mechanisms**:
5. **Challenges of LSTM**:
- Long-term dependencies can lead to the vanishing gradient problem, where the
gradient becomes too small, preventing learning from distant inputs.
- Introduced by Hochreiter & Schmidhuber (1997), LSTMs use memory cells to retain
information over longer periods.
3422
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
- The architecture includes gates (input, output, forget) to regulate the flow of
information.
7. **Forget gate**:
- It uses two inputs (current input `x_t` and previous hidden state `h_t-1`) and
processes them through the sigmoid function.
8. **Input gate**:
- This gate adds new information to the cell state using sigmoid and tanh functions.
- It selects which parts of the input should be remembered for future use.
9. **Output gate**:
- This gate extracts useful information from the current cell state to be passed as
output.
2. Speech recognition
- Processes sequential data in both forward and backward directions, capturing longer-
range dependencies.
- BiLSTMs are made up of two LSTM networks (one for each direction), and their
outputs are combined.
- LSTM networks use memory cells and gates to control the flow of information,
enabling them to learn long-term dependencies.
3423
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
9 Long-Term
Dependencies
Here’s a point-wise breakdown of the content related to Long-Term Dependencies,
presented by Dr. Saurabh Agrawal at VIT Vellore:
- Long-term dependencies occur when the output of a recurrent neural network (RNN)
depends on inputs that happened many time steps earlier.
- Example: In the sentence, "The cat, which was very hungry, ate the mouse,"
understanding the meaning requires remembering that the cat is the subject of the verb
"ate," even though a long clause separates them.
- RNNs are powerful for processing sequential data (e.g., text, speech, video) but
struggle to capture long-term dependencies.
- The gradient, which helps update the network’s weights, becomes too small
(vanishes) or too large (explodes) as it propagates through the network, leading to:
- This issue occurs due to the repeated multiplication of the same matrix at each time
step.
- Gated units, like **LSTM (Long Short-Term Memory)** and **GRU (Gated Recurrent
Unit)**, are used to address long-term dependencies.
- These units can control the flow of information, allowing the network to remember or
forget previous inputs depending on the current context.
3424
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
- Attention mechanisms help by focusing on the most important parts of the input or
output sequence.
- This allows for capturing relationships between distant elements, enhancing sequence
representation.
### 5. **Challenges**
- Sometimes, only recent information is needed for a task (e.g., predicting "sky" after
"the clouds are in the sky").
- In more complex cases (e.g., "I grew up in France. I speak fluent French"), the relevant
information may be separated by a large gap, making it harder for RNNs to predict
outcomes.
- While RNNs theoretically can handle long-term dependencies, they often fail due to
the **vanishing gradient problem**.
- As gradients grow smaller when moving down to lower layers, the network stops
improving and learning.
- LSTMs, introduced by Hochreiter & Schmidhuber in 1997, are specialized RNNs that
learn long-term dependencies.
- LSTMs have a memory cell, input gate, output gate, and forget gate:
- The **memory cell** retains the previous state, while gates control how much
memory to expose.
- The gates manage the flow of information and are responsible for remembering or
forgetting parts of the previous inputs and outputs.
- The **forget gate** removes information from the memory that is not relevant to the
current unit of the LSTM.
- It receives inputs (previous output \( h_{t-1} \) and current input \( x_t \)), multiplies
them by weight matrices, and applies a **sigmoid function** to determine how much
previous state is passed to the next.
3425
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
- This ensures that only relevant information is added to the cell state, avoiding
redundancy.
- The **output gate** creates a vector by applying the tanh function to the cell state.
In summary, long-term dependencies in sequential data are crucial for certain tasks but
pose challenges to RNNs due to gradient-related problems. Solutions like LSTMs and
attention mechanisms address these issues, enabling more effective learning from
distant elements in the sequence.
3426
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
10 Long Short-Term
Memory
Here is a breakdown of the provided text about Long Short-Term Memory (LSTM):
2. It is ideal for time series, machine translation, and speech recognition due to order
dependence.
4. Traditional RNNs have a single hidden state passed through time, which can struggle
with learning long-term dependencies.
5. LSTMs address this by introducing a memory cell that holds information for an
extended period.
6. This makes LSTMs suitable for tasks like language translation, speech recognition, and
time series forecasting.
1. LSTM architecture involves memory cells controlled by three gates: input gate, forget
gate, and output gate.
- **Forget gate:** Controls what information is removed from the memory cell.
- **Output gate:** Controls what information is output from the memory cell.
2. LSTMs selectively retain or discard information as it flows through the network, helping
them learn long-term dependencies.
3. LSTMs maintain a hidden state, acting as short-term memory, updated based on the
input, previous hidden state, and memory cell’s current state.
3. Bi-LSTMs consist of two LSTM networks: one processing the input sequence forward,
and the other processing it backward.
3427
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
4. The outputs of both LSTMs are combined to produce the final output.
5. LSTM layers can be stacked to create deeper architectures for learning more complex
patterns in sequential data.
1. LSTM has a chain structure with four neural networks and memory blocks (cells).
2. Information is retained by the cells, with memory manipulation done by the three
gates: Forget gate, Input gate, and Output gate.
1. The forget gate removes information from the cell state when it is no longer useful.
3. A vector is created using the tanh function, giving output from -1 to +1, containing
values from **ht-1** and **xt**.
4. The values from the vector and regulated values are multiplied to extract useful
information.
1. The output gate extracts useful information from the current cell state and presents it
as output.
3. Information is regulated using a sigmoid function to filter values using inputs **ht-1**
and **xt**.
4. The values of the vector and regulated values are multiplied and sent as output and
input for the next cell.
3428
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
1. **Language Modeling:** LSTMs are used in tasks like language modeling, machine
translation, and text summarization by learning dependencies between words.
3. **Time Series Forecasting:** LSTMs predict stock prices, weather, and energy
consumption by learning patterns in time series data.
6. **Video Analysis:** LSTMs are used in video tasks like object detection, activity
recognition, and action classification, often in combination with CNNs.
This text outlines the capabilities, architecture, working, and applications of LSTMs in
great detail, focusing on their strengths in learning long-term dependencies in sequential
data.
3429
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
- GRU stands for Gated Recurrent Unit, a type of recurrent neural network (RNN)
architecture similar to LSTM (Long Short-Term Memory).
- GRU has a simpler architecture compared to LSTM, with fewer parameters, making it
easier to train and more computationally efficient.
- The key difference between GRU and LSTM is how they handle the memory cell state.
- In LSTM, the memory cell state is separate from the hidden state and updated using
three gates: input, output, and forget gates.
- In GRU, the memory cell state is replaced with a "candidate activation vector,"
updated by two gates: the reset gate and the update gate.
2. **GRU Gates**
- **Reset gate**: Determines how much of the previous hidden state should be
forgotten.
- **Update gate**: Determines how much of the candidate activation vector should be
incorporated into the new hidden state.
- GRU is often chosen over LSTM for cases where computational resources are limited,
or a simpler architecture is preferred.
- GRU processes sequential data one element at a time, updating its hidden state
based on the current input and the previous hidden state.
- The candidate vector is used to update the hidden state for the next time step.
- The reset gate controls how much of the previous hidden state to forget, while the
update gate controls how much of the candidate activation vector is incorporated into
the new hidden state.
- **Reset gate** (`r_t`) and **update gate** (`z_t`) are computed using the current
input (`x_t`) and previous hidden state (`h_t-1`):
3430
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
- Here, `W_r` and `W_z` are weight matrices learned during training.
- The **candidate activation vector** (`h_t~`) is computed using the current input and
a reset version of the previous hidden state:
- The new hidden state (`h_t`) is computed by combining the candidate activation
vector and the previous hidden state, weighted by the update gate:
5. **GRU Architecture**
- **Input Layer**: Receives sequential data, such as words or a time series, and feeds it
into the GRU.
- **Hidden Layer**: Where recurrent computations occur. The hidden state is updated
based on the current input and the previous hidden state.
- **Reset Gate**: Determines how much of the previous hidden state is forgotten. It
takes the previous hidden state and the current input to compute a vector between 0 and
1.
- **Output Layer**: Receives the final hidden state as input and produces the network’s
output, which could be a single number, a sequence, or a probability distribution,
depending on the task.
3431
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
- Models that deal with sequence data, such as time series or natural language, often
face difficulties in capturing long-term dependencies.
- **LSTM and GRU**: Long Short-Term Memory (LSTM) networks and Gated Recurrent
Units (GRUs) are specifically designed to remember information over long periods,
addressing issues like vanishing gradients.
3. **Attention Mechanisms**
- **Multi-Head Attention**: Using multiple attention heads helps the model focus on
various parts of the sequence simultaneously, further enhancing the model's capacity to
learn long-term dependencies.
4. **Positional Encoding**
5. **Hierarchical Models**
6. **Dilated Convolutions**
7. **Memory-Augmented Networks**
- Models like Neural Turing Machines or Differentiable Neural Computers use external
memory, enabling them to store and retrieve information over long periods.
3432
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
8. **Regularization Techniques**
- Regularization methods prevent overfitting, ensuring that the model remains sensitive
to long-term dependencies.
9. **Feature Engineering**
- Designing features that explicitly capture long-term trends (e.g., moving averages or
seasonal indicators) can improve model performance on long-term dependencies,
especially in time series data.
- Techniques that artificially expand the training dataset expose the model to a wider
variety of long-term dependencies.
- Normalizing activations helps stabilize the learning process and improves the model’s
ability to capture long-term dependencies.
- Using training techniques like **curriculum learning**, where the model is first
trained on simpler sequences before being exposed to more complex ones, can enhance
long-term dependency learning.
Each of these strategies contributes to enhancing the ability of models to capture and
optimize long-term dependencies, especially in sequence-based tasks.
3433
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l
3434