AI Foundation Application-RNN
AI Foundation Application-RNN
Thien Huynh-The
Department of Computer and Communications Engineering
HCMC University of Technology and Education
• Output calculation:
yt = softmax(Why ht + by )
where:
• xt : Input at time t
• ht : Hidden state at time t
• yt : Output at time t
• Wxh , Whh , Why : Weight matrices
• bh , by : Bias vectors
• σ: Activation function (e.g., tanh, ReLU)
• Many-to-many RNNs are used when both the input and output are sequences of arbitrary length.
• Examples:
• Machine Translation (e.g., English sentence to French sentence)
• Video captioning (sequence of frames to a description)
• Part-of-speech tagging (sequence of words to sequence of tags)
• The network processes each input in the sequence and produces an output at each time step.
3. An output yt is produced:
yt = softmax(Why ht + by )
• Key Points:
• The same weight matrices (Wxh , Whh , Why ) are used at every time step. This is crucial for
handling variable-length sequences and greatly reduces the number of parameters.
• The hidden state ht carries information from previous time steps, enabling the network to
learn dependencies in the sequence.
where C is the number of classes, ŷt,i is the true probability for class i at time t, and yt,i is
the predicted probability for class i at time t.
• Mean Squared Error (MSE): Used for regression tasks where the output is a continuous
value.
n
1X
Lt = (ŷt,i − yt,i )2
n
i=1
Total Loss
The total loss over the entire sequence is usually the sum of the individual time-step losses:
T
X
L= Lt
t=1
or the average:
T
1 X
L= Lt
T t=1
where T is the length of the sequence. This total loss is what is minimized during training using
backpropagation through time (BPTT).
• Parameter Efficiency: Sharing weights significantly reduces the number of parameters the model
needs to learn. This is especially important when dealing with long sequences. Imagine if each
time step had its own set of weights; the number of parameters would explode.
• Generalization: Weight sharing allows the model to generalize across different positions in the
sequence. The model learns features that are useful regardless of where they appear in the input.
For example, in language modeling, the model learns grammatical rules that apply throughout a
sentence, not just at specific word positions.
• This is a key characteristic that distinguishes RNNs from other neural network architectures.
2. After processing the entire sequence, the final hidden state hT (where T is the length of the
sequence) is used to compute the output:
y = g (Why hT + by )
where g is an appropriate output activation function (e.g., softmax for classification, sigmoid
for binary classification).
• Key Aspects:
• The same weight matrices (Wxh , Whh , Why ) are used at each time step.
• The final hidden state hT summarizes the information from the entire input sequence.
• In image captioning:
• The input x could be a feature vector extracted from an image using a Convolutional Neural
Network (CNN).
• The RNN then generates a sequence of words (the caption) based on this image feature
vector.
• The initial hidden state h0 is initialized based on the image features, setting the context for
the caption generation.
• Forget Gate: Decides what information to throw away from the cell state.
ft = σ(Wf [ht−1 , xt ] + bf )
where:
• σ: Sigmoid function.
• Wf : Weight matrix for the forget gate.
• [ht−1 , xt ]: Concatenation of ht−1 and xt .
• bf : Bias for the forget gate.
• Input Gate: Decides what new information to store in the cell state.
it = σ(Wi [ht−1 , xt ] + bi )
• Where:
• Wi : Weight matrix for the input gate.
• WC : Weight matrix for the candidate cell state.
• bi , bC : Biases.
• tanh: Hyperbolic tangent function.
• Cell State Update: Combines the forget gate, previous cell state, input gate, and
candidate cell state.
Ct = ft ⊙ Ct−1 + it ⊙ C̃t
where ⊙ represents element-wise multiplication.
ot = σ(Wo [ht−1 , xt ] + bo )
• Hidden State:
ht = ot ⊙ tanh(Ct )
where:
• Wo : Weight matrix for the output gate.
• bo : Bias for the output gate.
• Reset Gate: Controls how much of the previous hidden state is used to compute the candidate
hidden state.
rt = σ(Wr [ht−1 , xt ] + br )
where:
• Wr : Weight matrix for the reset gate.
• br : Bias for the reset gate.
• Candidate Hidden State: Computes a new hidden state based on the current input and the
(potentially reset) previous hidden state.
where:
• tanh: Hyperbolic tangent function.
• W : Weight matrix for the candidate hidden state.
• rt ⊙ ht−1 : Element-wise multiplication of the reset gate and the previous hidden state.
• b: Bias for the candidate hidden state.
• Stock Market Prediction: LSTMs can be used to predict stock prices based on historical data.
They can capture temporal patterns and trends in financial data.
• Weather Forecasting: LSTMs can forecast weather conditions based on historical weather data.
• Healthcare: LSTMs can analyze medical time series data, such as ECG or EEG signals, for
disease detection and prediction.
• Handling Long-Term Dependencies: LSTMs are specifically designed to address the vanishing
gradient problem, enabling them to capture long-range dependencies in sequential data, which
traditional RNNs struggle with.
• Superior Performance in Sequence Tasks: Compared to traditional machine learning methods
like Hidden Markov Models (HMMs) or Conditional Random Fields (CRFs), LSTMs often achieve
better performance in tasks involving sequential data, especially when long-range dependencies are
important.
• Flexibility in Input and Output: LSTMs can handle variable-length input and output sequences,
making them suitable for a wide range of tasks.
• Computational Cost: LSTMs are computationally more expensive than simpler RNNs due to the
multiple gates and complex computations within each cell.
• Difficulty in Parallelization: The sequential nature of LSTMs makes it difficult to parallelize
computations, which can limit training speed on large datasets.
• Still Sensitive to Hyperparameters: LSTMs still require careful tuning of hyperparameters, such
as the number of layers, hidden units, and learning rate.
• Limited Long Context in Very Long Sequences: While they mitigate the vanishing gradient
problem, extremely long sequences can still pose challenges for LSTMs to capture dependencies
across the entire sequence.
• More Efficient Architectures: Research is ongoing to develop more efficient RNN architectures,
such as GRUs or other novel gating mechanisms, that can achieve similar performance to LSTMs
with reduced computational cost.
• Attention Mechanisms: Integrating attention mechanisms with LSTMs allows the model to
focus on relevant parts of the input sequence, further improving performance on tasks with long
sequences.
• Transformer Networks: Transformer networks, which rely on attention mechanisms and do not
have the inherent sequential limitations of RNNs, have shown great success in NLP and other
sequence tasks and are a major area of research. However, RNNs and LSTMs are still relevant in
many contexts.
• Combining with other architectures: Combining LSTMs with Convolutional Neural Networks
(CNNs) or other architectures for multimodal tasks or for capturing different types of features.
QK T
Attention(Q, K , V ) = softmax √ V
dk
where:
• Q: Queries matrix.
• K : Keys matrix.
• V : Values matrix. √
• dk : Dimension of the keys. Scaling by dk prevents gradients from becoming too small.