0% found this document useful (0 votes)
31 views34 pages

1 Recurrent Neural Networks

Uploaded by

Aritra Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views34 pages

1 Recurrent Neural Networks

Uploaded by

Aritra Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

11.

C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

1 Recurrent Neural Networks

Why RNN? 2. In traditional neural networks, all


inputs and outputs are independent
1. CNNs are great at recognizing
of each other.
objects, animals, and people.
3. For predicting the next word in a
2. However, understanding what's
sentence, previous words are
happening in a picture requires more
needed, which RNNs handle using a
context than a single image.
hidden layer.
3. Example: To determine whether a
4. The hidden state of an RNN
ball in the air is rising or falling,
remembers information about
context from a sequence (like a
sequences.
video) is needed.
5. The hidden state is also called a
4. This requires the neural network to
memory state, as it retains previous
"remember" previously encountered
inputs.
information and incorporate it into
future calculations. 6. RNNs use the same parameters for
each input, reducing complexity
5. The problem of remembering
compared to other neural networks.
extends beyond videos to other
domains like natural language
understanding (NLP), where
Differences between RNN and
algorithms need to recall previous
Feedforward Neural Network:
information for context.
1. Feedforward neural networks don’t
have looping nodes and pass
Issues in Feedforward Neural information unidirectionally.
Networks:
2. They are suitable for image
1. Traditional feedforward neural classification where inputs and
networks do not retain past outputs are independent.
information, making them unsuitable
3. However, they are less useful for
for tasks requiring memory, like
sequential data, as they don't retain
sequential data analysis.
previous inputs like RNNs.

Reason to Use RNN:


Recurrent Neuron and RNN
1. RNNs feed the output from the Unfolding:
previous step as input to the current
1. The fundamental unit in an RNN is
step.
a recurrent unit, which has a hidden

341
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

state allowing the network to 1. RNN consists of multiple fixed


remember previous inputs. activation units, each with a hidden
state.
2. Advanced versions like Long Short-
Term Memory (LSTM) and Gated 2. The hidden state signifies the
Recurrent Units (GRU) improve the network’s knowledge of the past and
handling of long-term dependencies. is updated at every time step.
3. The hidden state is updated using
a recurrence relation.
Types of RNNs:
4. The parameters are updated
1. **One to One**: A simple neural
through backpropagation, with a
network with one input and one
specialized version called
output (also known as Vanilla Neural
Backpropagation Through Time
Network).
(BPTT) used for sequential data.
2. **One to Many**: One input with
multiple outputs (e.g., image
captioning). Why Recurrent Neural Networks
(RNN)?
3. **Many to One**: Multiple inputs
but one output (e.g., sentiment
analysis).
1. **CNN Limitation**: Convolutional
4. **Many to Many**: Multiple inputs Neural Networks (CNNs) excel at
and multiple outputs (e.g., language recognizing objects like animals and
translation). people, but struggle to understand
dynamic contexts such as
determining if a ball in the air is
RNN Architecture: rising or falling.

1. RNNs have the same input-output 2. **Context Requirement**:


structure as other deep neural Understanding context (like a ball’s
networks but differ in how motion) often requires sequential
information flows. data, such as a video, rather than a
single image.
2. In RNNs, weight matrices are
shared across the network, reducing 3. **Memory of Past Data**: To
complexity. determine motion in the video, a
neural network needs to "remember"
3. It calculates a hidden state for previous frames, introducing the
each input using the formula: need for memory-based
- \( h = \sigma(UX + W_{h-1} + computation.
B) \) 4. **Beyond Videos**: Many Natural
- \( Y = O(Vh + C) \) Language Processing (NLP) tasks,
such as sentence prediction or topic
- The state matrix holds the hidden recall, also require memory, which
state at each timestep. RNNs handle by remembering prior
information.

How RNN Works:

342
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

Issues in Feedforward Neural making them capable of handling


Networks sequential data.
2. **Hidden State**: The hidden state
in RNNs stores information about
1. **Feedforward Limitation**: In
previous inputs, allowing the network
Feedforward Neural Networks, all
to retain memory across sequences.
inputs and outputs are independent,
making them unsuitable for 3. **Parameter Sharing**: RNNs use
sequential tasks (e.g., text or time- the same set of parameters
series data). (weights) for each input, reducing
the complexity of training compared
2. **No Memory**: Feedforward
to other neural networks.
networks cannot remember prior
inputs, so they can't handle tasks
requiring memory of previous steps,
How RNN Differs from
like sentence prediction or video
Feedforward Neural Networks
analysis.

1. **Feedforward Networks**: These


networks pass information
Reason to Use RNN unidirectionally from input to output,
with no looping nodes and no
memory, making them less useful for
1. **Sequential Dependency**: RNNs sequential data tasks.
process sequences, allowing the
2. **Sequential Data**: RNNs are
network to learn and remember prior
designed to handle sequential data
data, making them well-suited for
by maintaining a hidden state that
sequential tasks like language
carries information over time steps,
modeling or video analysis.
unlike Feedforward networks.

RNN in Google
Recurrent Neuron and RNN
Unfolding

1. **Google’s Usage**: Google


employs RNNs in various
1. **Recurrent Unit**: The basic unit
applications, such as natural
of RNN is the recurrent unit, capable
language processing (NLP) and other
of maintaining a hidden state,
sequential data-related tasks.
allowing the network to capture
dependencies over time.

Recurrent Neural Network (RNN) 2. **Advanced RNNs**: Long Short-


Definition Term Memory (LSTM) and Gated
Recurrent Unit (GRU) are variations
of RNNs that handle long-term
1. **Input and Output Dependency**: dependencies more effectively.
RNNs use the output of the previous
step as input to the current step,
Types of RNNs

343
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

2. **Recurrence Relation**: The


hidden state is updated at each time
1. **One-to-One**: Similar to a
step using the recurrence relation \
simple neural network with one input
( h_t = f(h_{t-1}, x_t) \), where \
and one output.
( h_t \) is the current state, \( h_{t-1}
2. **One-to-Many**: One input with \) is the previous state, and \( x_t \) is
multiple outputs, used in tasks like the input at time step \( t \).
image captioning.
3. **Backpropagation Through Time
3. **Many-to-One**: Multiple inputs (BPTT)**: RNNs are trained using
producing a single output, used in BPTT, where the error is propagated
sentiment analysis. back through all previous time steps
to update weights.
4. **Many-to-Many**: Multiple inputs
and outputs, often used in language
translation.
Issues in Standard RNNs

RNN Architecture and Formula


1. **Vanishing Gradient**: During
backpropagation, the gradient can
become very small, making learning
1. **State Calculation Formula**: slow or ineffective for long
- Hidden state formula: \( h = \ sequences.
sigma(UX + W h_{t-1} + B) \) 2. **Exploding Gradient**: Gradients
- Output formula: \( Y = O(Vh + can also become very large, leading
C) \) to instability in the network and
excessively large updates to weights.
- The final output formula: \( Y =
f(X, h, W, U, V, B, C) \)
Training Through RNN

In these equations:
- \( h \) is the hidden state, 1. **Stepwise Input Processing**: The
input is processed one step at a
- \( U, W, V \) are weight matrices, time, with each step updating the
current hidden state using the input
- \( X \) is the input, and
and the previous state.
- \( B, C \) are bias terms.
2. **Error Calculation**: The error is
calculated by comparing the output
to the target, and weights are
updated using backpropagation
How RNN Works through time.

1. **Hidden State**: Each unit in an Advantages and Disadvantages


RNN maintains a hidden state that of RNN
represents the memory of past
inputs.

344
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

Advantages:
1. **Memory of Past Information**: 1. **Language Modeling**
RNNs remember past inputs, making
2. **Speech Recognition**
them suitable for time-series and
sequential data tasks. 3. **Machine Translation**
2. **Combining with Convolutional 4. **Image Recognition**
Layers**: RNNs can be combined
with convolutional layers to improve 5. **Face Detection**
pixel neighborhood analysis. 6. **Time Series Forecasting**

Disadvantages: Variations of RNN


1. **Gradient Problems**: RNNs
suffer from vanishing and exploding
gradients. 1. **Bidirectional Neural Networks
(BiNN)**: In BiNN, information flows
2. **Difficult Training**: Training in both directions, useful for tasks
RNNs is challenging due to these where context from both past and
gradient problems and their future data is needed.
sequential nature.
2. **Long Short-Term Memory
3. **Activation Function Limitation**: (LSTM)**: LSTMs introduce gates
Activation functions like tanh or ReLU (forget, input, and output) to manage
struggle with very long sequences. what information is retained or
discarded, helping solve the
vanishing gradient problem.
Applications of RNN

345
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

2 Bidirectional RNNs
Here is a point-wise breakdown of the text about Bidirectional Recurrent Neural
Networks (BRNNs):

1. **Recurrent Neural Networks (RNNs) Overview**:


- RNNs process sequential input like speech, text, and time series data.
- RNNs handle data as a sequence of vectors, unlike feedforward neural
networks, which use fixed-length vectors.
- The hidden state at each step depends on both the current input and the
hidden state from the previous step.
- RNNs store memory from earlier steps, making them suitable for tasks
requiring context and sequence relationships.

2. **Challenges of RNNs**:
- Conventional RNNs face the vanishing gradient problem during
backpropagation, which hinders learning.
- To overcome this, variants like Long Short-Term Memory (LSTM) and Gated
Recurrent Units (GRU) have been introduced.

3. **Bidirectional Recurrent Neural Networks (BRNNs)**:


- BRNNs process sequential data in both forward and backward directions to
utilize information from both past and future contexts.
- The architecture has two hidden layers: one for forward processing and
another for backward processing.
- Outputs from both hidden layers are combined and passed to a final
prediction layer.
- BRNNs can use any RNN cells, including LSTMs or GRUs, for their hidden
layers.

4. **BRNN Functionality**:
- The forward hidden layer updates the hidden state based on the current input
and the prior hidden state.
- The backward hidden layer processes the sequence in reverse, updating the
hidden state based on future inputs.
- BRNNs improve accuracy by capturing context in both directions.

346
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

- Two hidden layers provide additional regularization, which helps improve


model performance.

5. **Training of BRNNs**:
- Gradients are computed in both forward and backward passes during training
using backpropagation through time (BPTT).
- Inference involves a single forward pass, where predictions are made based
on the combined hidden layers' outputs.

6. **Working of BRNNs**:
- Input sequence: A sequence of vectors is fed into the network, which can
have variable lengths.
- Dual processing: Hidden states are computed based on both past (forward)
and future (backward) steps.
- Hidden state calculation: A non-linear activation function is applied to
compute hidden states.
- Output: A non-linear activation function computes the output at each step.
- Training: Weights are adjusted through backpropagation to minimize
prediction error.

7. **Backpropagation in BRNNs**:
- The hidden state at time `t` combines forward and backward hidden states.
- BPTT updates weights individually for forward and backward passes.

8. **Applications of BRNNs**:
- Sentiment analysis: Helps in categorizing the sentiment of sentences by
considering both past and future context.
- Named entity recognition: Identifies entities in a sentence by analyzing both
preceding and following contexts.
- Part-of-speech tagging: Classifies words in a sentence based on their parts of
speech.
- Machine translation: Used in encoder-decoder models to translate sentences
by capturing context from both directions.
- Speech recognition: Processes speech signals in both directions to improve
recognition accuracy.

347
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

9. **Advantages of BRNNs**:
- Capture context from both past and future inputs.
- Higher accuracy in predictions.
- Efficient in handling variable-length sequences.
- More resilient to noise and irrelevant information in data.

10. **Disadvantages of BRNNs**:


- High computational complexity due to processing in both directions.
- Long training times due to the large number of parameters.
- Difficult to parallelize due to the sequential nature of forward and backward
processing.
- Prone to overfitting, especially with small datasets.
- Harder to interpret compared to simpler models.

11. **BRNN Example**:


- A sentence like "Dhaval loves Apple" is processed by unidirectional and
bidirectional networks for prediction purposes.

12. **BRNN Training Formula**:


- The hidden state is computed by combining forward and backward hidden
states.
- Errors are calculated, and weights are updated for both forward and backward
passes during training.

13. **Training Process of BRNNs**:


- BPTT involves rolling out the network, calculating errors, updating weights,
and rolling it back up.
- Forward and backward passes must be handled separately to avoid
inaccuracies.

14. **BRNN Example Applications**:


- Examples include named entity recognition, sentiment analysis, and machine
translation.

348
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

This is a detailed summary of the points related to Bidirectional Recurrent Neural


Networks (BRNNs).
3 Auto Encoders
**Key Points on Autoencoders**

1. **Introduction to Autoencoders**
- Autoencoders are a subset of neural networks used for unsupervised learning.
- They are adaptable and powerful architectures, ideal for identifying complex
patterns and representations.
- Autoencoders are widely used in various fields like image processing and
anomaly detection due to their ability to learn effective data representations.

2. **Definition of Autoencoders**
- Specialized algorithms that learn efficient representations of input data
without the need for labels.
- Designed for unsupervised learning.
- They compress and represent data without requiring specific labels.
- The structure involves an encoder (reduces dimensionality) and a decoder
(rebuilds the original input).

3. **Architecture of Autoencoders**
- The architecture consists of three main parts: encoder, bottleneck layer, and
decoder.
- **Encoder**: Captures essential features and reduces dimensionality to form
a compressed representation (latent space).
- **Decoder**: Reconstructs the original input from the compressed
representation.
- **Loss function**: Measures the difference between the input and the
reconstructed output (e.g., Mean Squared Error or Binary Cross-Entropy).

4. **Training Autoencoders**
- During training, autoencoders minimize the reconstruction loss to capture key
input features in the bottleneck layer.
- After training, only the encoder is used for encoding new data.

5. **Methods to Constrain Autoencoders**

349
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

- **Small Hidden Layers**: Encourages capturing only the representative


features by minimizing the size of hidden layers.
- **Regularization**: Adds a loss term to the cost function, preventing the
network from merely copying the input.
- **Denoising**: Adds noise to the input and trains the network to remove it.
- **Tuning Activation Functions**: Adjusting the activation functions to reduce
the number of active nodes and hidden layer size.

6. **Types of Autoencoders**
- **Denoising Autoencoder**: Trains on corrupted input to recover the original
data.
- Advantages: Reduces noise and can generate additional training samples.
- Disadvantages: Challenging noise selection; possible loss of essential
information.

- **Sparse Autoencoder**: Only a few hidden units are active at once,


promoting sparsity.
- Advantages: Filters out noise and irrelevant features; learns meaningful
patterns.
- Disadvantages: Hyperparameter tuning is crucial; computationally complex.

- **Variational Autoencoder (VAE)**: Uses stochastic methods and assumes


latent variable distribution to generate new data points.
- Advantages: Generates new data similar to training data; useful in anomaly
detection.
- Disadvantages: Uses approximations that introduce errors; limited diversity
in generated samples.

- **Convolutional Autoencoder (CAE)**: Uses convolutional layers for


compressing image data.
- Advantages: Efficiently compresses image data and reconstructs missing
parts; handles image variations.
- Disadvantages: Prone to overfitting; may result in lower quality images due
to compression.

3410
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

4 Encoders – Decoder
### Key Points on Autoencoders:

- **Autoencoders in General**:
- Autoencoders are a subset of neural networks used for **unsupervised
learning**.
- They are adaptable, useful in various fields like **image processing** and
**anomaly detection**, and help in **learning effective data representations**.

- **Definition of Autoencoders**:
- Autoencoders are designed to learn **efficient representations of input data**
without needing labels.
- They use a two-part structure: an **encoder** and a **decoder**.
- **Encoder** reduces input to a lower-dimensional **latent space**.
- **Decoder** reconstructs the original input from the reduced representation.

- **Architecture of Autoencoders**:
1. **Encoder**: Reduces dimensionality and captures essential patterns.
2. **Bottleneck Layer**: A compressed representation of the input data.
3. **Decoder**: Reconstructs the input from the encoded data.
4. **Loss Function**: Measures reconstruction error (e.g., Mean Squared Error).

- **Training Autoencoders**:
- The goal during training is to **minimize reconstruction loss**, ensuring
important features are captured in the bottleneck layer.
- After training, only the encoder can be used to encode new data.

- **Constraining Autoencoders**:
1. **Small Hidden Layers**: Encourages learning of only essential features.
2. **Regularization**: Adds a loss term to prevent copying input directly.
3. **Denoising**: Adds noise to input, and the network learns to remove it.

3411
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

4. **Tuning Activation Functions**: Reduces active nodes, forcing efficient


representation.

- **Types of Autoencoders**:
1. **Denoising Autoencoder**: Recovers original input from corrupted data.
- Advantages: Extracts important features, good for **data augmentation**.
- Disadvantages: Selecting noise is challenging, potential information loss.

2. **Sparse Autoencoder**: Most hidden units remain inactive, emphasizing


essential features.
- Advantages: Filters out irrelevant data and focuses on meaningful patterns.
- Disadvantages: Hyperparameters are crucial, and complexity increases.

3. **Variational Autoencoder**: Uses probability to learn latent space


representations.
- Advantages: Good for **generating new data** and anomaly detection.
- Disadvantages: Approximation can lead to errors, limited sample diversity.

4. **Convolutional Autoencoder**: Uses CNNs for image data compression and


reconstruction.
- Advantages: Efficient for **image storage** and handling small variations.
- Disadvantages: Prone to **overfitting**, potential loss in image quality.

- **Encoders and Decoders**:


- **Encoder**: Compresses data through dimensionality reduction.
- **Bottleneck**: Contains the most compressed form of the input.
- **Decoder**: Reconstructs original data from the encoded representation.
- The difference between the reconstructed output and the original input is
known as the **reconstruction error**.

This breakdown provides a complete overview of autoencoders, their


architecture, training, and types.

3412
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

5 Sequence-to-Sequence
Architectures
Here’s a breakdown of the text into points:

### Introduction to Seq2Seq Model:


- Seq2Seq (Sequence-to-Sequence) is a machine learning architecture used for
sequential data tasks.
- It processes an input sequence and generates an output sequence.
- The model consists of two key components: an **encoder** and a **decoder**.
- Seq2Seq models have greatly improved machine translation systems.

### What is a Seq2Seq Model?


- It is a machine learning model that accepts sequential data as input and
outputs sequential data.
- Before Seq2Seq, machine translation relied on **statistical methods** and
**phrase-based approaches** like **phrase-based statistical machine translation
(SMT)**.
- SMT struggled with long-distance dependencies and capturing global context.

### Improvements with Seq2Seq Models:


- Seq2Seq models use neural networks, specifically **Recurrent Neural Networks
(RNNs)**, to solve issues like long-distance dependencies.
- The model was introduced in a paper titled “**Sequence to Sequence Learning
with Neural Networks**” by Google.
- The encoder-decoder architecture is fundamental to many natural language
processing (NLP) tasks.
- The encoder processes the input sequence into a **fixed-size hidden
representation** (context vector).
- The decoder uses the hidden representation to generate the output sequence.
- Seq2Seq models handle sequences of varying lengths and are trained using
input-output pairs.

### Introduction of Transformers:

3413
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

- Advances in neural networks led to **transformer** models, which are more


advanced Seq2Seq architectures.
- The paper "Attention is All You Need" introduced transformers, revolutionizing
deep learning for language tasks.
- Transformers use **attention layers** and separate encoder and decoder
stacks, enhancing performance in language tasks.
- Seq2Seq models often include attention mechanisms to improve performance,
allowing the decoder to focus on relevant parts of the input sequence.

### Encoder and Decoder in Seq2Seq Models:


- **Encoder Block**:
- Processes the input sequence and captures information in a **context
vector**.
- The encoder uses neural networks or transformer architectures to process
each element of the input.
- The final hidden state of the encoder serves as the context vector, capturing
the input sequence’s important information.

- **Decoder Block**:
- The decoder uses the context vector to generate the output sequence
incrementally.
- During training, it receives both the context vector and the target output
sequence.
- During inference, it uses previously generated outputs as inputs for the next
steps.
- The decoder autoregressively generates output tokens and continues until the
sequence is complete.

### RNN-based Seq2Seq Models:


- RNNs can map sequences to sequences when the alignment between input and
output is predefined.
- However, RNNs suffer from **vanishing gradient** problems, which is why
advanced versions like **LSTMs (Long Short-Term Memory)** or **GRUs (Gated
Recurrent Units)** are used.
- LSTMs take two inputs at each time step: one from the user and the other from
the previous output, making it "recurrent."

3414
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

### Advantages of Seq2Seq Models:


1. **Flexibility**: Can handle various tasks like machine translation, text
summarization, image captioning, and handle variable-length sequences.
2. **Sequential Data Handling**: Suitable for tasks involving natural language,
speech, and time-series data.
3. **Context Handling**: Encoder-decoder architecture helps capture and utilize
input context for output generation.
4. **Attention Mechanism**: Improves performance by focusing on specific parts
of the input when generating output.

### Disadvantages of Seq2Seq Models:


1. **Computational Expense**: They require significant computational resources
to train and optimize.
2. **Limited Interpretability**: Their internal workings are hard to interpret,
complicating understanding of their decisions.
3. **Overfitting**: Without proper regularization, they may overfit the training
data, performing poorly on new data.
4. **Handling Rare Words**: They struggle with words not present in the training
data.
5. **Long Input Sequences**: Handling very long input sequences can be
problematic, as the context vector may not capture all the information.

### Applications of Seq2Seq Models:


1. **Text Summarization**: Effective in summarizing news and documents by
understanding the input text.
2. **Speech Recognition**: Excel in processing audio for Automatic Speech
Recognition (ASR), capturing spoken language patterns.
3. **Image Captioning**: Can integrate image features with textual generation to
describe images in a human-readable format.

3415
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

6 Deep Recurrent
Networks
Here are the key points summarized from the text:

1. **Speech Recognition Overview:**


- Speech recognition is the process where computers identify the text in
speech.
- Speech is sequential in nature.
- To model speech recognition in deep learning, the appropriate model must be
selected.

2. **Initial Speech Recognition Model Analysis:**


- A speech recognition RNN model was tested but showed unsatisfactory
results.
- Deep feedforward neural networks provided better accuracy compared to
typical RNNs.
- Researchers proposed adding depth to RNNs, similar to deep feedforward
networks, to improve accuracy.

3. **Deep Recurrent Networks (Deep RNNs):**


- **Depth in RNNs:** RNNs are naturally deep in time, but introducing depth in
space, like feedforward networks, led to the concept of Deep RNNs.
- Deep RNNs have multiple hidden units that perform looping operations.
- This design allows for more complex data representation across time steps.

4. **2-Layer Deep RNNs:** Introduced as part of exploring depth in RNNs.

5. **Multi-Layer Deep RNNs:**


- These models increase the distance a variable travels from one time step to
the next (t to t+1).
- They can use simple RNNs, GRUs, or LSTMs as hidden units.

3416
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

- Multi-layer deep RNNs capture varied data representations and pass multiple
hidden states across layers.

6. **Mathematical Notation:** A section discussing mathematical representation


of multi-layer deep RNNs (specific details not provided here).

7. **Training Deep RNNs:**


- Training of Deep RNNs follows a similar process to Backpropagation Through
Time (BPTT) with additional hidden units.
- These networks capture complex relationships in data to improve prediction.

8. **Applications of Deep RNNs:**


- Deep RNNs are widely used in speech recognition (Siri, Alexa), language
translation, self-driving cars, music generation, and natural language processing.

9. **Steps to Develop Deep RNN for Sentiment Analysis:**


1. **Data Preparation:** Gather, clean, tokenize, and convert text reviews to
numerical format using libraries like NLTK or spaCy.
2. **Model Architecture Design:** Decide on the number of layers, hidden
units, and recurrent unit (LSTM or GRU). Also handle input/output sequences via
padding or truncation.
3. **Model Training:** Split data into training and validation sets, and train
using algorithms like stochastic gradient descent. Set hyperparameters such as
learning rate and batch size.
4. **Model Evaluation:** Test the model on a separate dataset and evaluate its
performance using accuracy, precision, recall, and F1 score.
5. **Model Deployment:** Deploy the trained model to production for real-time
sentiment classification, potentially via a web app or API.

10. **Overall Development:** Developing a deep RNN application requires


technical skills in programming, machine learning, data preprocessing, and
domain understanding.

3417
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

7 Bidirectional Encoder
Representation from Transformers
(BERT)
1. **Introduction to BERT:**

- BERT (Bidirectional Encoder Representations from Transformers) is a Natural


Language Processing (NLP) model developed by Google AI.

- Achieved state-of-the-art accuracy on 11 NLP and Natural Language Understanding


(NLU) tasks, including SQuAD v1.1, GLUE, and SWAG.

2. **Training Data:**

- Pre-trained using Wikipedia (2,500 million words) and Book Corpus (800 million
words).

- Can be fine-tuned with specific question and answer datasets.

3. **Challenges in NLP:**

- One major challenge is the lack of training data for NLP tasks.

- Many task-specific datasets only contain a few thousand or hundred thousand human-
labeled examples.

4. **BERT's Approach:**

- To address the lack of data, BERT trains on large, unlabeled text corpora
(unsupervised or semi-supervised learning).

- The model can be fine-tuned for specific tasks using supervised learning.

5. **Bidirectionality:**

- Traditional language models process text in one direction (either left-to-right or right-
to-left).

- BERT is unique because it reads in both directions simultaneously, a feature enabled


by the Transformer architecture.

6. **BERT Architecture:**

- BERT consists of two versions: BERT BASE and BERT LARGE.

- BERT BASE has 12 layers in the encoder stack, 12 attention heads, 110 million
parameters, and 768 hidden sizes.

3418
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

- BERT LARGE has 24 layers, 16 attention heads, 340 million parameters, and 1024
hidden sizes.

7. **Transformer Architecture:**

- The transformer architecture consists of an encoder-decoder network with self-


attention on the encoder side and attention on the decoder side.

- BERT uses only the encoder part of the transformer.

8. **Input Representation:**

- BERT's input is represented in a single token sequence, which can include a single
sentence or a pair of sentences (e.g., question and answer).

- Uses WordPiece embeddings with a 30,000 token vocabulary.

9. **Tokenization in BERT:**

- Each input sequence starts with a [CLS] token (classification token).

- For sentence pairs, [SEP] tokens are used to separate them, and a learned embedding
distinguishes whether a token belongs to sentence A or B.

10. **Understanding Context:**

- BERT excels at understanding context, which is crucial for texts with ambiguous
words (homonyms).

- For example:

1. "You were right." vs. "Make a right turn at the light." The meaning of "right"
changes based on context.

2. "My favorite flower is a rose." vs. "He quickly rose from his seat." Here, "rose" is
interpreted differently in each sentence.

11. **Pre-training Tasks:**

- BERT is pre-trained on two main tasks:

1. **MLM (Masked Language Modeling):** Predicts missing words in sentences by


randomly masking 15% of the tokens.

2. **NSP (Next Sentence Prediction):** Determines if a given sentence B follows


sentence A in a text.

12. **Masked Language Modeling (MLM):**

- For 80% of the time, the masked word is replaced with the [MASK] token.

- For 10% of the time, the masked word is replaced with a random word.

3419
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

- For the remaining 10%, the word remains unchanged.

13. **Next Sentence Prediction (NSP):**

- In BERT’s training, pairs of sentences are used to predict whether the second
sentence follows the first.

- 50% of the time, sentence B is the actual next sentence (labeled "IsNext").

- The other 50% of the time, sentence B is a random sentence (labeled "NotNext").

14. **Examples of NSP:**

- **Input:** "[CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk
[SEP]"

- **Output:** IsNext.

- **Input:** "[CLS] the man [MASK] to the store [SEP] penguin [MASK] are flightless
birds [SEP]"

- **Output:** NotNext.

3420
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

7 Recursive Neural Networks


Here’s a breakdown of the text you provided, organized into clear points:

1. **Recursive Neural Networks (RvNNs) Overview**

- RvNNs are a type of deep neural network used in natural language processing (NLP).

- A Recursive Neural Network is created when the same set of weights is applied
recursively to a structured input to make structured predictions.

2. **What Is a Recursive Neural Network?**

- **Deep Learning** is a subfield of machine learning and artificial intelligence (AI),


inspired by the functioning of the human brain to process data and learn patterns.

- **Neural Networks** are the core of deep learning and are loosely modeled after the
human brain to recognize patterns in data.

- Recursive Neural Networks have a deep, tree-like structure, making them capable of
handling hierarchical data.

- In RvNNs, the tree structure forms by combining child nodes to produce parent nodes.
Each child-parent connection has a weight matrix, and similar children share the same
weights.

- The number of children for every node is fixed to ensure recursive operations can be
applied with consistent weights.

- RvNNs are useful when there's a need to parse entire sentences, particularly in NLP
tasks.

3. **Recursive vs. Recurrent Neural Networks**

- This section mentions the comparison between Recurrent Neural Networks (RNNs)
and Recursive Neural Networks (RvNNs), likely pointing out their differences in handling
sequential data (RNNs) versus hierarchical data (RvNNs).

This summary covers all the key points from the text you provided.

3421
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

8 Long Short-Term
Memory
### Explanation of Long-Term Dependencies and LSTM (Summarized Points)

1. **Long-term dependencies**:

- Situations where the output of an RNN depends on input from many time steps ago.

- Example: In the sentence "The cat, which was very hungry, ate the mouse", the
subject "cat" is related to the verb "ate" even though they're separated by a clause.

2. **Why long-term dependencies are hard to learn**:

- RNNs struggle with capturing long-term dependencies due to the vanishing or


exploding gradient problem.

- This occurs because the gradient, used to update network weights, becomes either
too small or too large as it propagates through the network.

3. **Handling long-term dependencies with gated units**:

- Gated units like LSTM and GRU help manage long-term dependencies by
remembering or forgetting information based on current inputs.

- These units selectively control the flow of information and help ignore irrelevant data.

4. **Attention mechanisms**:

- Attention mechanisms help handle long-term dependencies by focusing on important


parts of the input/output sequence.

- Self-attention computes the similarity between elements in a sequence and uses


weights to create a context vector, capturing relationships between distant elements.

5. **Challenges of LSTM**:

- Long-term dependencies can lead to the vanishing gradient problem, where the
gradient becomes too small, preventing learning from distant inputs.

6. **LSTM as a solution to long-term dependencies**:

- LSTMs are a special type of RNN designed to handle long-term dependencies.

- Introduced by Hochreiter & Schmidhuber (1997), LSTMs use memory cells to retain
information over longer periods.

3422
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

- The architecture includes gates (input, output, forget) to regulate the flow of
information.

7. **Forget gate**:

- This gate removes information that is no longer needed by applying a filter.

- It uses two inputs (current input `x_t` and previous hidden state `h_t-1`) and
processes them through the sigmoid function.

8. **Input gate**:

- This gate adds new information to the cell state using sigmoid and tanh functions.

- It selects which parts of the input should be remembered for future use.

9. **Output gate**:

- This gate extracts useful information from the current cell state to be passed as
output.

10. **LSTM applications**:

1. Language modeling (e.g., machine translation, text summarization)

2. Speech recognition

3. Time series forecasting (e.g., stock prices, weather)

4. Anomaly detection (e.g., fraud detection, network intrusion)

5. Recommender systems (e.g., personalized recommendations)

6. Video analysis (e.g., object detection, activity recognition)

11. **Bidirectional LSTM (BiLSTM)**:

- Processes sequential data in both forward and backward directions, capturing longer-
range dependencies.

- BiLSTMs are made up of two LSTM networks (one for each direction), and their
outputs are combined.

12. **Overall LSTM architecture**:

- LSTM networks use memory cells and gates to control the flow of information,
enabling them to learn long-term dependencies.

3423
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

9 Long-Term
Dependencies
Here’s a point-wise breakdown of the content related to Long-Term Dependencies,
presented by Dr. Saurabh Agrawal at VIT Vellore:

### 1. **What are Long-Term Dependencies?**

- Long-term dependencies occur when the output of a recurrent neural network (RNN)
depends on inputs that happened many time steps earlier.

- Example: In the sentence, "The cat, which was very hungry, ate the mouse,"
understanding the meaning requires remembering that the cat is the subject of the verb
"ate," even though a long clause separates them.

- These dependencies can affect the performance of RNNs when generating or


analyzing such sequences.

### 2. **Why Are Long-Term Dependencies Hard to Learn?**

- RNNs are powerful for processing sequential data (e.g., text, speech, video) but
struggle to capture long-term dependencies.

- The difficulty arises from the **vanishing or exploding gradient problem**.

- The gradient, which helps update the network’s weights, becomes too small
(vanishes) or too large (explodes) as it propagates through the network, leading to:

- Difficulty learning from distant inputs (vanishing gradients).

- Instability and erratic outputs (exploding gradients).

- This issue occurs due to the repeated multiplication of the same matrix at each time
step.

### 3. **Handling Long-Term Dependencies with Gated Units**

- Gated units, like **LSTM (Long Short-Term Memory)** and **GRU (Gated Recurrent
Unit)**, are used to address long-term dependencies.

- These units can control the flow of information, allowing the network to remember or
forget previous inputs depending on the current context.

- This selective information access improves handling of long-term dependencies by


retaining relevant information and ignoring irrelevant details.

3424
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

### 4. **Using Attention Mechanisms to Handle Long-Term Dependencies**

- Attention mechanisms help by focusing on the most important parts of the input or
output sequence.

- **Self-attention** computes the similarity between each element in the sequence,


assigning weights and creating a context vector that summarizes the information.

- This allows for capturing relationships between distant elements, enhancing sequence
representation.

### 5. **Challenges**

- The challenge of long-term dependencies arises primarily due to the vanishing


gradient problem, which makes it difficult for standard RNNs to learn effectively from
distant data points in a sequence.

### 6. **LSTM: A Solution to Long-Term Dependencies**

- Sometimes, only recent information is needed for a task (e.g., predicting "sky" after
"the clouds are in the sky").

- In more complex cases (e.g., "I grew up in France. I speak fluent French"), the relevant
information may be separated by a large gap, making it harder for RNNs to predict
outcomes.

- While RNNs theoretically can handle long-term dependencies, they often fail due to
the **vanishing gradient problem**.

- As gradients grow smaller when moving down to lower layers, the network stops
improving and learning.

### 7. **Long Short-Term Memory (LSTM) Networks**

- LSTMs, introduced by Hochreiter & Schmidhuber in 1997, are specialized RNNs that
learn long-term dependencies.

- LSTMs have a memory cell, input gate, output gate, and forget gate:

- The **memory cell** retains the previous state, while gates control how much
memory to expose.

- The gates manage the flow of information and are responsible for remembering or
forgetting parts of the previous inputs and outputs.

### 8. **Forget Gate**

- The **forget gate** removes information from the memory that is not relevant to the
current unit of the LSTM.

- It receives inputs (previous output \( h_{t-1} \) and current input \( x_t \)), multiplies
them by weight matrices, and applies a **sigmoid function** to determine how much
previous state is passed to the next.

3425
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

### 9. **Input Gate**

- The **input gate** adds new information to the memory cell.

- A combination of \( h_{t-1} \) and \( x_t \) is passed through sigmoid and tanh


functions to create a vector of possible values that can be added to the memory.

- This ensures that only relevant information is added to the cell state, avoiding
redundancy.

### 10. **Output Gate**

- The **output gate** creates a vector by applying the tanh function to the cell state.

- It uses \( h_{t-1} \) and \( x_t \) to create a filter through a sigmoid function,


regulating which values from the created vector will be output.

- The output is generated by multiplying the filtered values.

In summary, long-term dependencies in sequential data are crucial for certain tasks but
pose challenges to RNNs due to gradient-related problems. Solutions like LSTMs and
attention mechanisms address these issues, enabling more effective learning from
distant elements in the sequence.

3426
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

10 Long Short-Term
Memory
Here is a breakdown of the provided text about Long Short-Term Memory (LSTM):

### General Overview of LSTM:

1. LSTM excels in sequence prediction tasks and captures long-term dependencies.

2. It is ideal for time series, machine translation, and speech recognition due to order
dependence.

3. LSTM is an improved version of a recurrent neural network (RNN) designed by


Hochreiter & Schmidhuber.

4. Traditional RNNs have a single hidden state passed through time, which can struggle
with learning long-term dependencies.

5. LSTMs address this by introducing a memory cell that holds information for an
extended period.

6. This makes LSTMs suitable for tasks like language translation, speech recognition, and
time series forecasting.

### LSTM Architecture:

1. LSTM architecture involves memory cells controlled by three gates: input gate, forget
gate, and output gate.

- **Input gate:** Controls what information is added to the memory cell.

- **Forget gate:** Controls what information is removed from the memory cell.

- **Output gate:** Controls what information is output from the memory cell.

2. LSTMs selectively retain or discard information as it flows through the network, helping
them learn long-term dependencies.

3. LSTMs maintain a hidden state, acting as short-term memory, updated based on the
input, previous hidden state, and memory cell’s current state.

### Bidirectional LSTM Model:

1. Bidirectional LSTM (Bi-LSTM) is a recurrent neural network that processes sequential


data in both forward and backward directions.

2. Bi-LSTM allows learning of longer-range dependencies in sequential data than


traditional LSTMs.

3. Bi-LSTMs consist of two LSTM networks: one processing the input sequence forward,
and the other processing it backward.

3427
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

4. The outputs of both LSTMs are combined to produce the final output.

5. LSTM layers can be stacked to create deeper architectures for learning more complex
patterns in sequential data.

### LSTM Working (Overview):

1. LSTM has a chain structure with four neural networks and memory blocks (cells).

2. Information is retained by the cells, with memory manipulation done by the three
gates: Forget gate, Input gate, and Output gate.

### LSTM Working: Forget Gate:

1. The forget gate removes information from the cell state when it is no longer useful.

2. Two inputs are given:

- **xt** (input at a particular time)

- **ht-1** (previous cell output)

3. These inputs are multiplied by weight matrices, with bias added.

4. The resultant passes through an activation function, giving a binary output:

- **Output 0:** Information is forgotten.

- **Output 1:** Information is retained.

### LSTM Working: Input Gate:

1. The input gate adds useful information to the cell state.

2. Information is regulated using a sigmoid function to filter the values to be


remembered, similar to the forget gate.

3. A vector is created using the tanh function, giving output from -1 to +1, containing
values from **ht-1** and **xt**.

4. The values from the vector and regulated values are multiplied to extract useful
information.

### LSTM Working: Output Gate:

1. The output gate extracts useful information from the current cell state and presents it
as output.

2. A vector is generated by applying the tanh function on the cell.

3. Information is regulated using a sigmoid function to filter values using inputs **ht-1**
and **xt**.

4. The values of the vector and regulated values are multiplied and sent as output and
input for the next cell.

3428
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

### LSTM Applications:

1. **Language Modeling:** LSTMs are used in tasks like language modeling, machine
translation, and text summarization by learning dependencies between words.

2. **Speech Recognition:** LSTMs are applied in speech-to-text transcription and


recognizing spoken commands by identifying speech patterns.

3. **Time Series Forecasting:** LSTMs predict stock prices, weather, and energy
consumption by learning patterns in time series data.

4. **Anomaly Detection:** LSTMs detect anomalies like fraud or network intrusion by


identifying patterns that deviate from the norm.

5. **Recommender Systems:** LSTMs recommend movies, music, and books by learning


patterns in user behavior.

6. **Video Analysis:** LSTMs are used in video tasks like object detection, activity
recognition, and action classification, often in combination with CNNs.

This text outlines the capabilities, architecture, working, and applications of LSTMs in
great detail, focusing on their strengths in learning long-term dependencies in sequential
data.

3429
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

11 Other Gated RNNs Gated


Recurrent Unit
Here’s a breakdown of the text in points, including all key details:

1. **What is Gated Recurrent Unit (GRU)?**

- GRU stands for Gated Recurrent Unit, a type of recurrent neural network (RNN)
architecture similar to LSTM (Long Short-Term Memory).

- Like LSTM, GRU is designed to model sequential data by allowing selective


information to be remembered or forgotten over time.

- GRU has a simpler architecture compared to LSTM, with fewer parameters, making it
easier to train and more computationally efficient.

- The key difference between GRU and LSTM is how they handle the memory cell state.

- In LSTM, the memory cell state is separate from the hidden state and updated using
three gates: input, output, and forget gates.

- In GRU, the memory cell state is replaced with a "candidate activation vector,"
updated by two gates: the reset gate and the update gate.

2. **GRU Gates**

- **Reset gate**: Determines how much of the previous hidden state should be
forgotten.

- **Update gate**: Determines how much of the candidate activation vector should be
incorporated into the new hidden state.

- GRU is often chosen over LSTM for cases where computational resources are limited,
or a simpler architecture is preferred.

3. **How GRU Works**

- GRU processes sequential data one element at a time, updating its hidden state
based on the current input and the previous hidden state.

- At each time step, a "candidate activation vector" is computed by combining


information from the input and the previous hidden state.

- The candidate vector is used to update the hidden state for the next time step.

- The reset gate controls how much of the previous hidden state to forget, while the
update gate controls how much of the candidate activation vector is incorporated into
the new hidden state.

4. **Mathematics Behind GRU**

- **Reset gate** (`r_t`) and **update gate** (`z_t`) are computed using the current
input (`x_t`) and previous hidden state (`h_t-1`):

3430
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

- `r_t = sigmoid(W_r * [h_t-1, x_t])`

- `z_t = sigmoid(W_z * [h_t-1, x_t])`

- Here, `W_r` and `W_z` are weight matrices learned during training.

- The **candidate activation vector** (`h_t~`) is computed using the current input and
a reset version of the previous hidden state:

- `h_t~ = tanh(W_h * [r_t * h_t-1, x_t])`

- The new hidden state (`h_t`) is computed by combining the candidate activation
vector and the previous hidden state, weighted by the update gate:

- `h_t = (1 - z_t) * h_t-1 + z_t * h_t~`

5. **GRU Architecture**

- **Input Layer**: Receives sequential data, such as words or a time series, and feeds it
into the GRU.

- **Hidden Layer**: Where recurrent computations occur. The hidden state is updated
based on the current input and the previous hidden state.

- **Reset Gate**: Determines how much of the previous hidden state is forgotten. It
takes the previous hidden state and the current input to compute a vector between 0 and
1.

- **Update Gate**: Determines how much of the candidate activation vector is


incorporated into the new hidden state. It also takes the previous hidden state and
current input to produce a vector between 0 and 1.

- **Candidate Activation Vector**: A modified version of the previous hidden state,


computed using a tanh activation function that squashes its output between -1 and 1.

- **Output Layer**: Receives the final hidden state as input and produces the network’s
output, which could be a single number, a sequence, or a probability distribution,
depending on the task.

3431
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

11 Optimization for Long-Term


Dependencies
Here are the key points broken down from the text on **Optimization for Long-Term
Dependencies**:

1. **Challenge of Optimizing for Long-Term Dependencies**

- Models that deal with sequence data, such as time series or natural language, often
face difficulties in capturing long-term dependencies.

2. **Recurrent Neural Networks (RNNs)**

- **LSTM and GRU**: Long Short-Term Memory (LSTM) networks and Gated Recurrent
Units (GRUs) are specifically designed to remember information over long periods,
addressing issues like vanishing gradients.

3. **Attention Mechanisms**

- **Self-Attention**: Self-attention mechanisms, like those in Transformer models, allow


the model to assess the importance of different parts of an input sequence, aiding in the
capture of long-range dependencies.

- **Multi-Head Attention**: Using multiple attention heads helps the model focus on
various parts of the sequence simultaneously, further enhancing the model's capacity to
learn long-term dependencies.

4. **Positional Encoding**

- In Transformer models, positional encoding is used to maintain information about the


order and position of elements in a sequence, helping with understanding long-range
context.

5. **Hierarchical Models**

- Hierarchical structures process data at multiple levels of granularity, allowing the


model to first capture local dependencies and then move to global dependencies.

6. **Dilated Convolutions**

- Using dilated convolutions in Convolutional Neural Networks (CNNs) helps capture a


broader context without significantly increasing the number of model parameters.

7. **Memory-Augmented Networks**

- Models like Neural Turing Machines or Differentiable Neural Computers use external
memory, enabling them to store and retrieve information over long periods.

3432
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

8. **Regularization Techniques**

- Regularization methods prevent overfitting, ensuring that the model remains sensitive
to long-term dependencies.

9. **Feature Engineering**

- Designing features that explicitly capture long-term trends (e.g., moving averages or
seasonal indicators) can improve model performance on long-term dependencies,
especially in time series data.

10. **Data Augmentation**

- Techniques that artificially expand the training dataset expose the model to a wider
variety of long-term dependencies.

11. **Gradient Clipping**

- Implementing gradient clipping helps manage exploding gradients, ensuring the


model can learn from longer sequences without losing valuable information.

12. **Batch Normalization and Layer Normalization**

- Normalizing activations helps stabilize the learning process and improves the model’s
ability to capture long-term dependencies.

13. **Training Techniques**

- Using training techniques like **curriculum learning**, where the model is first
trained on simpler sequences before being exposed to more complex ones, can enhance
long-term dependency learning.

14. **Advanced Architectures**

- Exploring advanced architectures like Transformers with memory networks or other


novel designs that inherently address long-range dependencies can further improve
model performance.

Each of these strategies contributes to enhancing the ability of models to capture and
optimize long-term dependencies, especially in sequence-based tasks.

3433
11. C o n c e p t s o f D e e p L e a r n i n g
C F U Fu n d a m e n t a l

3434

You might also like