0% found this document useful (0 votes)
3 views60 pages

AI Foundation Application-RNN

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views60 pages

AI Foundation Application-RNN

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Recurrent Neural Networks

Thien Huynh-The
Department of Computer and Communications Engineering
HCMC University of Technology and Education

April 17, 2025

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 1 / 60


Introduction: The Need for Memory

• Traditional Neural Networks (FFNNs) process inputs independently.


• This is insufficient for sequential data (e.g., text, time series).
• RNNs introduce the concept of memory to handle sequences.
• They maintain a hidden state that captures information from previous inputs.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 2 / 60


RNN Architecture: Unrolling in Time
• RNNs process sequential data by maintaining a hidden state - carry information across time steps.
• The ”unrolling” visualization shows how the RNN processes a sequence step-by-step.
• At each time step t:
• The RNN receives an input xt .
• It updates its hidden state ht based on the current input and the previous hidden state ht−1 .
• It produces an output yt (optional, depending on the task).
• The same weight matrices (Wxh , Whh , Why ) are used at each time step for model efficiency.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 3 / 60


RNN Equations
• Hidden state update:

ht = σ(Wxh xt + Whh ht−1 + bh )

• Output calculation:

yt = softmax(Why ht + by )

where:
• xt : Input at time t
• ht : Hidden state at time t
• yt : Output at time t
• Wxh , Whh , Why : Weight matrices
• bh , by : Bias vectors
• σ: Activation function (e.g., tanh, ReLU)

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 4 / 60


Detailed Look at the Hidden State Update

• The core of an RNN is the hidden state update.


• The hidden state at time t, ht , is calculated using the following equation:

ht = σ(Wxh xt + Whh ht−1 + bh )


• Breakdown:
• xt : Input at time t.
• ht−1 : Hidden state from the previous time step. This is the “memory” of the network.
• Wxh : Weight matrix connecting the input to the hidden state.
• Whh : Weight matrix connecting the previous hidden state to the current hidden state.
• bh : Bias vector for the hidden state.
• σ: Activation function (e.g., tanh, ReLU). This introduces non-linearity.
• The weights Wxh and Whh , and the bias bh are shared across all time steps.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 5 / 60


Input, Hidden State, and Output Dimensions

• Let’s consider the dimensions of the vectors and matrices:


• xt ∈ Rnx : Input vector of size nx .
• ht ∈ Rnh : Hidden state vector of size nh .
• yt ∈ Rny : Output vector of size ny .
• The weight matrices then have the following dimensions:
• Wxh ∈ Rnh ×nx : Maps the input to the hidden state space.
• Whh ∈ Rnh ×nh : Maps the previous hidden state to the current hidden state space.
• Why ∈ Rny ×nh : Maps the hidden state to the output space.
• The bias vectors have the following dimensions:
• bh ∈ Rnh : Bias for the hidden state.
• by ∈ Rny : Bias for the output.
• Understanding these dimensions is crucial for implementing and working with RNNs.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 6 / 60


Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 7 / 60
Many-to-Many RNNs: Sequence Input, Sequence Output

• Many-to-many RNNs are used when both the input and output are sequences of arbitrary length.
• Examples:
• Machine Translation (e.g., English sentence to French sentence)
• Video captioning (sequence of frames to a description)
• Part-of-speech tagging (sequence of words to sequence of tags)
• The network processes each input in the sequence and produces an output at each time step.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 8 / 60


Unrolling the Many-to-Many RNN

• To understand the computations, we “unroll” the RNN over time.


• This reveals the sequential processing of inputs and the generation of outputs.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 9 / 60


Computational Steps and Equations

• At each time step t:


1. The RNN receives input xt .
2. The hidden state is updated:

ht = σ(Wxh xt + Whh ht−1 + bh )

3. An output yt is produced:
yt = softmax(Why ht + by )
• Key Points:
• The same weight matrices (Wxh , Whh , Why ) are used at every time step. This is crucial for
handling variable-length sequences and greatly reduces the number of parameters.
• The hidden state ht carries information from previous time steps, enabling the network to
learn dependencies in the sequence.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 10 / 60


Loss Calculation in Many-to-Many RNNs

• In a Many-to-Many RNN, we often compute a loss at each time step.


• This allows us to train the network to generate the correct output sequence.
• Let’s define:
• yt : The predicted output at time step t.
• ŷt : The target (or true) output at time step t.
• Lt : The loss function at time step t.
• The loss at each time step is calculated by comparing the predicted output yt with the
corresponding target output ŷt using a suitable loss function L.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 11 / 60


Loss Calculation in Many-to-Many RNNs

• Common loss functions:


• Cross-Entropy Loss: Typically used for classification tasks where the output is a probability
distribution over classes.
XC
Lt = − ŷt,i log(yt,i )
i=1

where C is the number of classes, ŷt,i is the true probability for class i at time t, and yt,i is
the predicted probability for class i at time t.
• Mean Squared Error (MSE): Used for regression tasks where the output is a continuous
value.
n
1X
Lt = (ŷt,i − yt,i )2
n
i=1

where n is the dimension of the output vector.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 12 / 60


Loss Calculation in Many-to-Many RNNs

Total Loss
The total loss over the entire sequence is usually the sum of the individual time-step losses:
T
X
L= Lt
t=1

or the average:
T
1 X
L= Lt
T t=1
where T is the length of the sequence. This total loss is what is minimized during training using
backpropagation through time (BPTT).

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 13 / 60


Weight Sharing: Efficiency and Generalization

• Parameter Efficiency: Sharing weights significantly reduces the number of parameters the model
needs to learn. This is especially important when dealing with long sequences. Imagine if each
time step had its own set of weights; the number of parameters would explode.
• Generalization: Weight sharing allows the model to generalize across different positions in the
sequence. The model learns features that are useful regardless of where they appear in the input.
For example, in language modeling, the model learns grammatical rules that apply throughout a
sentence, not just at specific word positions.
• This is a key characteristic that distinguishes RNNs from other neural network architectures.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 14 / 60


Many-to-One RNNs: Sequence Input, Single Output

• Many-to-one RNNs process a sequence of inputs and produce a single output.


• They are useful for tasks where the entire input sequence needs to be summarized or classified.
• Examples:
• Sentiment analysis (a sentence to a sentiment label: positive, negative, neutral)
• Document classification (a document to a category)
• DNA sequence classification (a DNA sequence to a type of disease)
• The network reads the entire input sequence, updates its hidden state at each step, and uses the
final hidden state to generate the output.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 15 / 60


Unrolling the Many-to-One RNN
• Unrolling the RNN reveals the sequential processing of the input and the final output generation.
• The hidden state carries information from all previous inputs.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 16 / 60


Computational Steps and Equations
• The RNN processes the input sequence step by step:
1. For each time step t, the hidden state is updated:

ht = σ(Wxh xt + Whh ht−1 + bh )

2. After processing the entire sequence, the final hidden state hT (where T is the length of the
sequence) is used to compute the output:

y = g (Why hT + by )

where g is an appropriate output activation function (e.g., softmax for classification, sigmoid
for binary classification).
• Key Aspects:
• The same weight matrices (Wxh , Whh , Why ) are used at each time step.
• The final hidden state hT summarizes the information from the entire input sequence.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 17 / 60


The Importance of the Final Hidden State

• In many-to-one RNNs, the final hidden state hT is crucial.


• It acts as a compressed representation of the entire input sequence.
• This representation is then used to make a prediction or classification.
• For example, in sentiment analysis, hT captures the overall sentiment expressed in the sentence.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 18 / 60


One-to-Many RNNs: Single Input, Sequence Output

• One-to-many RNNs take a single input and generate a sequence as output.


• They are suitable for tasks like:
• Image captioning (an image to a descriptive sentence)
• Music generation (a starting note or style to a melody)
• Generating text from a single keyword
• The input is typically used to initialize the hidden state, and then the RNN generates the output
sequence step by step.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 19 / 60


Unrolling the One-to-Many RNN

• The unrolled graph visualizes the sequential output generation.


• The initial input influences the entire generated sequence.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 20 / 60


Computational Steps and Equations
• The process unfolds as follows:
1. The initial hidden state h0 is often a function of the input x:
h0 = f (Wxh x + bh )
where f can be a linear transformation or more complex function. In simpler cases, h0 can be
initialized to zero.
2. For each time step t (starting from t = 1):
2.1 The hidden state is updated:
ht = σ(Whh ht−1 + bh )
Note: no xt here since the input is only used to initialize h0 .
2.2 An output yt is generated:
yt = g (Why ht + by )
where g is an appropriate output activation function (e.g., softmax, linear).
• Key Points:
• Weight matrices (Whh , Why ) are shared across time steps.
• The initial input x influences the entire output sequence through h0 .
Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 21 / 60
Example: Image Captioning

• In image captioning:
• The input x could be a feature vector extracted from an image using a Convolutional Neural
Network (CNN).
• The RNN then generates a sequence of words (the caption) based on this image feature
vector.
• The initial hidden state h0 is initialized based on the image features, setting the context for
the caption generation.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 22 / 60


Example: Image Captioning

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 23 / 60


Example: Video Captioning

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 24 / 60


Example: Video Captioning

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 25 / 60


Example: Video Frame-based Action Recognition

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 26 / 60


Vanishing/Exploding Gradients

• Vanishing gradients: Gradients become very small, hindering learning of long-term


dependencies.
• Exploding gradients: Gradients become very large, leading to unstable training.
• Solutions:
• Gradient clipping
• Gated architectures (LSTMs, GRUs)

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 27 / 60


LSTM Core Idea: The Cell State
• LSTMs introduce the concept of a cell state (Ct ), which acts as a ”highway” for information to
flow through the network relatively unchanged.
• This allows information to be preserved over many time steps.
• Gates regulate the flow of information into and out of the cell state.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 28 / 60


LSTM Gates: Controlling Information Flow

• LSTMs have three main gates:


• Forget Gate (ft ): Determines what information to discard from the cell state.
• Input Gate (it ): Determines what new information to store in the cell state.
• Output Gate (ot ): Determines what information from the cell state to output.
• Each gate is a sigmoid layer, outputting values between 0 and 1.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 29 / 60


LSTM Equations: A Step-by-Step Breakdown
• Let’s define the key components:
• xt : Input at time t.
• ht−1 : Hidden state at the previous time step.
• Ct−1 : Cell state at the previous time step.
• ht : Hidden state at time t.
• Ct : Cell state at time t.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 30 / 60


LSTM Equations: Forget Gate

• Forget Gate: Decides what information to throw away from the cell state.

ft = σ(Wf [ht−1 , xt ] + bf )

where:
• σ: Sigmoid function.
• Wf : Weight matrix for the forget gate.
• [ht−1 , xt ]: Concatenation of ht−1 and xt .
• bf : Bias for the forget gate.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 31 / 60


LSTM Equations: Input Gate

• Input Gate: Decides what new information to store in the cell state.

it = σ(Wi [ht−1 , xt ] + bi )

• Candidate Cell State: Creates a vector of new candidate values.

C̃t = tanh(WC [ht−1 , xt ] + bC )

• Where:
• Wi : Weight matrix for the input gate.
• WC : Weight matrix for the candidate cell state.
• bi , bC : Biases.
• tanh: Hyperbolic tangent function.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 32 / 60


LSTM Equations: Cell State Update

• Cell State Update: Combines the forget gate, previous cell state, input gate, and
candidate cell state.
Ct = ft ⊙ Ct−1 + it ⊙ C̃t
where ⊙ represents element-wise multiplication.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 33 / 60


LSTM Equations: Output Gate and Hidden State

• Output Gate: Decides what parts of the cell state to output.

ot = σ(Wo [ht−1 , xt ] + bo )

• Hidden State:
ht = ot ⊙ tanh(Ct )
where:
• Wo : Weight matrix for the output gate.
• bo : Bias for the output gate.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 34 / 60


Gated Recurrent Unit (GRU): Simplification of LSTMs
• GRUs are a simplified version of LSTMs, designed to address the vanishing gradient problem with
fewer parameters.
• They achieve comparable performance to LSTMs in many tasks but are computationally more
efficient.
• GRUs combine the cell state and hidden state into a single hidden state.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 35 / 60


GRU Gates: Reset and Update

• GRUs have two gates:


• Update Gate (zt ): Determines how much of the previous hidden state to keep and how
much of the new candidate hidden state to incorporate.
• Reset Gate (rt ): Determines how much of the previous hidden state to ignore.
• Both gates are sigmoid layers, outputting values between 0 and 1.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 36 / 60


GRU Equations: A Step-by-Step Breakdown
• Let’s define the key components:
• xt : Input at time t.
• ht−1 : Hidden state at the previous time step.
• ht : Hidden state at time t.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 37 / 60


GRU Equations: Update Gate
• Update Gate: Controls how much of the previous hidden state is retained.
zt = σ(Wz [ht−1 , xt ] + bz )
where:
• σ: Sigmoid function.
• Wz : Weight matrix for the update gate.
• [ht−1 , xt ]: Concatenation of ht−1 and xt .
• bz : Bias for the update gate.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 38 / 60


GRU Equations: Reset Gate

• Reset Gate: Controls how much of the previous hidden state is used to compute the candidate
hidden state.
rt = σ(Wr [ht−1 , xt ] + br )
where:
• Wr : Weight matrix for the reset gate.
• br : Bias for the reset gate.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 39 / 60


GRU Equations: Candidate Hidden State

• Candidate Hidden State: Computes a new hidden state based on the current input and the
(potentially reset) previous hidden state.

h̃t = tanh(W [rt ⊙ ht−1 , xt ] + b)

where:
• tanh: Hyperbolic tangent function.
• W : Weight matrix for the candidate hidden state.
• rt ⊙ ht−1 : Element-wise multiplication of the reset gate and the previous hidden state.
• b: Bias for the candidate hidden state.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 40 / 60


GRU Equations: Final Hidden State Update
• Final Hidden State: Combines the previous hidden state and the candidate hidden state based
on the update gate.
ht = (1 − zt ) ⊙ ht−1 + zt ⊙ h̃t
• This is a weighted average between the previous hidden state and the candidate hidden state.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 41 / 60


LSTM and GRU: Video Explanation

Illustrated Guide to LSTM’s and GRU’s: A step by step explanation

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 42 / 60


LSTM and GRU: Video Explanation

Long Short-Term Memory for NLP

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 43 / 60


Applications of LSTMs: Natural Language Processing (NLP)
• Machine Translation: LSTMs are used in sequence-to-sequence models to translate text from
one language to another. They excel at capturing long-range dependencies in sentences.
• Text Generation: LSTMs can generate text, such as poems, scripts, and articles, by learning
patterns in existing text.
• Sentiment Analysis: LSTMs can analyze text to determine the sentiment expressed (positive,
negative, neutral). They can capture contextual information that is crucial for accurate sentiment
detection.
• Speech Recognition: LSTMs are used to transcribe spoken language into text. They can handle
the temporal nature of speech signals effectively.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 44 / 60


Applications of LSTMs: Time Series Analysis

• Stock Market Prediction: LSTMs can be used to predict stock prices based on historical data.
They can capture temporal patterns and trends in financial data.
• Weather Forecasting: LSTMs can forecast weather conditions based on historical weather data.
• Healthcare: LSTMs can analyze medical time series data, such as ECG or EEG signals, for
disease detection and prediction.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 45 / 60


Advantages of LSTMs Compared to Other Methods

• Handling Long-Term Dependencies: LSTMs are specifically designed to address the vanishing
gradient problem, enabling them to capture long-range dependencies in sequential data, which
traditional RNNs struggle with.
• Superior Performance in Sequence Tasks: Compared to traditional machine learning methods
like Hidden Markov Models (HMMs) or Conditional Random Fields (CRFs), LSTMs often achieve
better performance in tasks involving sequential data, especially when long-range dependencies are
important.
• Flexibility in Input and Output: LSTMs can handle variable-length input and output sequences,
making them suitable for a wide range of tasks.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 46 / 60


Limitations of LSTMs

• Computational Cost: LSTMs are computationally more expensive than simpler RNNs due to the
multiple gates and complex computations within each cell.
• Difficulty in Parallelization: The sequential nature of LSTMs makes it difficult to parallelize
computations, which can limit training speed on large datasets.
• Still Sensitive to Hyperparameters: LSTMs still require careful tuning of hyperparameters, such
as the number of layers, hidden units, and learning rate.
• Limited Long Context in Very Long Sequences: While they mitigate the vanishing gradient
problem, extremely long sequences can still pose challenges for LSTMs to capture dependencies
across the entire sequence.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 47 / 60


Future Research Directions

• More Efficient Architectures: Research is ongoing to develop more efficient RNN architectures,
such as GRUs or other novel gating mechanisms, that can achieve similar performance to LSTMs
with reduced computational cost.
• Attention Mechanisms: Integrating attention mechanisms with LSTMs allows the model to
focus on relevant parts of the input sequence, further improving performance on tasks with long
sequences.
• Transformer Networks: Transformer networks, which rely on attention mechanisms and do not
have the inherent sequential limitations of RNNs, have shown great success in NLP and other
sequence tasks and are a major area of research. However, RNNs and LSTMs are still relevant in
many contexts.
• Combining with other architectures: Combining LSTMs with Convolutional Neural Networks
(CNNs) or other architectures for multimodal tasks or for capturing different types of features.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 48 / 60


The Rise of Transformers: Overcoming RNN Limitations

• RNNs, while effective for sequential data,


suffer from limitations:
• Vanishing/exploding gradients, hindering
long-range dependency learning.
• Sequential computation, limiting
parallelization.
• Transformers address these limitations
using attention mechanisms.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 49 / 60


Key Idea: Attention
• Attention allows the model to focus on different parts of the input sequence when
processing each element.
• Unlike RNNs, which process sequentially, attention allows for parallel computation.
• This leads to significant speed improvements, especially on long sequences.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 50 / 60


Scaled Dot-Product Attention
• Scaled Dot-Product Attention is the core attention mechanism used in Transformers.
• It takes three inputs: Queries (Q), Keys (K), and Values (V).

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 51 / 60


Scaled Dot-Product Attention: The Calculation

• The attention is calculated as follows:

QK T
 
Attention(Q, K , V ) = softmax √ V
dk
where:
• Q: Queries matrix.
• K : Keys matrix.
• V : Values matrix. √
• dk : Dimension of the keys. Scaling by dk prevents gradients from becoming too small.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 52 / 60


Multi-Head Attention

• Multi-Head Attention runs the scaled


dot-product attention multiple times in
parallel.
• This allows the model to attend to
information from different representation
subspaces at different positions.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 53 / 60


Multi-Head Attention: The Process

• Each “head” computes:

headi = Attention(QWiQ , KWiK , VWiV )

• The outputs are concatenated and linearly transformed:

MultiHead(Q, K , V ) = Concat(head1 , ..., headh )W O

• Where WiQ , WiK , WiV , and W O are weight matrices.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 54 / 60


Transformer Architecture: Encoder
• The encoder consists of multiple identical layers.
• Each layer contains:
• Multi-Head Attention layer.
• Feed-Forward Network (FFN).
• Residual connections and layer normalization are applied around each sub-layer.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 55 / 60


Transformer Architecture: Decoder
• The decoder also consists of multiple identical layers.
• Each layer contains:
• Masked Multi-Head Attention layer (prevents attending to future tokens).
• Multi-Head Attention layer (over the encoder output).
• Feed-Forward Network (FFN).
• Residual connections and layer normalization are also applied.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 56 / 60


Positional Encoding
• Since Transformers process inputs in parallel, they lack inherent information about the position of
words in the sequence.
• Positional encodings are added to the input embeddings to provide this information.

• Positional encodings are calculated using sine and cosine functions:


 pos 
PE(pos,2i) = sin 2i/dmodel
10000pos 
PE(pos,2i+1) = cos
100002i/dmodel
where:
• pos: Position in the sequence.
• i: Dimension.
• dmodel : Dimension of the embeddings.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 57 / 60


Applications of Transformers

• Natural Language Processing (NLP): Machine translation, text summarization, question


answering, text generation.
• Computer Vision: Image classification, object detection, image generation.
• Speech Recognition: Speech-to-text.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 58 / 60


Advantages of Transformers

• Parallel Computation: Enables faster training, especially on long sequences.


• Effective Long-Range Dependency Learning: Attention mechanisms allow the model to
capture relationships between distant words in a sequence.
• State-of-the-art Performance: Transformers have achieved state-of-the-art results on many
NLP and other sequence-based tasks.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 59 / 60


Conclusion

• RNNs are powerful tools for processing sequential data.


• LSTMs and GRUs address the vanishing gradient problem.
• They have numerous applications in various fields.
• Transformers have revolutionized sequence modeling by introducing attention mechanisms.
• Their ability to handle long-range dependencies and parallelize computations has led to significant
advancements in various fields.

Thien Huynh-The - HCMUTE Recurrent Neural Networks April 17, 2025 60 / 60

You might also like