Module 4
Module 4
2. Sequence-to-Vector (top-right):
● Takes a sequence but only uses final output
● Example: Sentiment analysis of movie reviews, where words are the input
sequence and the output is a single sentiment score
● Good for classification/scoring of sequential data
3. Vector-to-Sequence (bottom-left):
● Takes a single vector repeatedly as input and produces a sequence
● Example: Image captioning, where a CNN-processed image is input and
the output is a sequence of words describing it
● Useful when generating sequential content from a fixed input
4. Encoder-Decoder (bottom-right):
● Combines sequence-to-vector (encoder) with vector-to-sequence
(decoder)
● Example: Language translation, where input sentence is encoded to a
vector, then decoded to target language
● Better than direct sequence-to-sequence for translation because it can
consider entire input context before generating output
● More complex implementation than the diagram suggests (covered in
Chapter 16)
Training RNNs
1. Basic Concept:
● BPTT involves unrolling the RNN through time
● Uses regular backpropagation principles on the unrolled network
● Consists of forward pass followed by backward pass
2. Forward Pass:
● Network processes the input sequence from start to finish
● Represented by dashed arrows in Figure 15-5
● Generates predictions Ŷ(0) through Ŷ(T) for each timestep
3. Loss Function:
4. Backward Pass:
5. Parameter Updates:
import tensorflow as tf
my_series = [0, 1, 2, 3, 4, 5]
my_dataset = tf.keras.utils.timeseries_dataset_from_array(
my_series,
sequence_length=3,
batch_size=2
)
Alternative method using window():
seq_length = 56
train_ds = tf.keras.utils.timeseries_dataset_from_array(
rail_train.to_numpy(),
targets=rail_train[seq_length:],
sequence_length=seq_length,
batch_size=32,
shuffle=True,
seed=42
valid_ds = tf.keras.utils.timeseries_dataset_from_array(
rail_valid.to_numpy(),
targets=rail_valid[seq_length:],
sequence_length=seq_length,
batch_size=32
)
Forecasting Using Linear Model
Performance Results:
● The model achieved a validation MAE of approximately 37,866
● This performance is:
○ Better than naive forecasting
○ Worse than the SARIMA model
tf.random.set_seed(42)
model = tf.keras.Sequential([
tf.keras.layers.Dense(1, input_shape=[seq_length])
])
Forecasting Using Simple RNN
Initial Simple RNN Implementation:
model = tf.keras.Sequential([
])
2. Input Shape Requirements:
● RNN layers expect 3D inputs: [batch size, time steps, dimensionality]
● Input_shape ignores the first dimension (batch size)
● Time steps can be None (any size)
● Dimensionality is 1 for univariate time series
3. How the Simple RNN Works:
● Initial state h(init) starts at 0
● Each step processes current input and previous state
● Uses hyperbolic tangent (tanh) activation by default
● Outputs only the final value unless return_sequences=True
4. Problems with Initial Model:
● Validation MAE > 100,000 (poor performance)
● Only 3 parameters total (2 weights + 1 bias)
● Limited by tanh activation range (-1 to +1)
● Too simple for the complexity of the data
Forecasting Using Deep RNN
Code Snippet:
deep_model = tf.keras.Sequential([
tf.keras.layers.SimpleRNN(32, return_sequences=True,
input_shape=[None, 1]),
tf.keras.layers.SimpleRNN(32, return_sequences=True),
tf.keras.layers.SimpleRNN(32),
tf.keras.layers.Dense(1)
])
Fighting Unstable Gradiets Problem
1. Common Deep Learning Techniques That Help:
● Good parameter initialization
● Faster optimizers
● Dropout
2. ReLU and Non-saturating Activation Functions:
● May not help as much with RNNs
● Can actually increase instability
● Risk of exploding outputs due to weight reuse across time steps
● Saturating functions like tanh are preferred (hence being the default)
3. Gradient Issues:
● Gradients can explode
● Solutions include:
○ Using smaller learning rates
○ Monitoring gradient size (via TensorBoard)
○ Using gradient clipping
4. Batch Normalization (BN) Limitations:
● Less effective with RNNs than with feedforward networks
● Cannot be used effectively between time steps
● When used in memory cells:
○ Same BN layer used at each time step
○ Same parameters regardless of input scale
○ Only slightly beneficial when applied to layer inputs
○ Not helpful when applied to hidden states
○ Can slow down training
5. Layer Normalization Benefits:
● Better suited for RNNs than batch normalization
● Normalizes across features dimension instead of batch dimension
● Advantages:
○ Can compute statistics on the fly at each time step
○ Works independently for each instance
○ Consistent behavior during training and testing
○ Doesn't need exponential moving averages
○ Learns scale and offset parameters for each input
6. Implementation:
● Used after linear combination of inputs and hidden states
● Requires defining a custom memory cell in Keras
● Cell's call() method needs to handle both:
○ Current time step inputs
○ Previous time step hidden states
Tackling Short-Term Memory Problem
Using,
1. LSTM
2. GRU
LSTM
Code Snippet:
model = tf.keras.Sequential([
[None, 5]),
tf.keras.layers.Dense(14)
])
GRU
Code Snippet:
model = tf.keras.Sequential([
[None, 5]),
tf.keras.layers.Dense(14)
])