0% found this document useful (0 votes)
2 views

Module 4

The document discusses recurrent neural networks (RNNs), detailing their structure, including recurrent neurons, memory cells, and various input-output sequence types. It covers training methods, data preparation for machine learning models, and forecasting techniques using linear models, simple RNNs, and deep RNNs, while addressing challenges like unstable gradients and short-term memory problems. Additionally, it introduces advanced architectures such as LSTM and GRU for improved performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 4

The document discusses recurrent neural networks (RNNs), detailing their structure, including recurrent neurons, memory cells, and various input-output sequence types. It covers training methods, data preparation for machine learning models, and forecasting techniques using linear models, simple RNNs, and deep RNNs, while addressing challenges like unstable gradients and short-term memory problems. Additionally, it introduces advanced architectures such as LSTM and GRU for improved performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Recurrent Neurons and Layers

1. The simplest RNN has just one neuron that:


● Receives inputs at each time step t
● Receives its own previous output from time step t-1
● At the first time step, with no previous output, it starts at 0
2. When expanded to a full RNN layer:
○ Every neuron receives both the input vector x(t)
○ Every neuron receives the output vector from previous time step ŷ
(t-1)
○ The inputs and outputs become vectors instead of scalars
3. Each recurrent neuron has two weight sets:
● wx: weights for current input x(t)
● wŷ: weights for previous outputs ŷ(t-1)
● For a full layer, these become matrices Wx and Wŷ
4. The output calculation for a single instance is:
ŷ(t) = ϕ(Wx⊺x(t) + Wŷ⊺ŷ(t-1) + b)
5. For a mini-batch, the output calculation becomes:
Ŷ(t) = ϕ(X(t)Wx + Ŷ(t-1)Wŷ + b) = ϕ([X(t) Ŷ(t-1)]W + b) with W = [Wx Wŷ]
Where:
● Ŷ(t) is m × n neurons matrix (outputs at time t)
● X(t) is m × n inputs matrix (inputs for all instances)
● Wx is n inputs × n neurons matrix (weights for current inputs)
● Wŷ is n neurons × n neurons matrix (weights for previous outputs)
● b is the bias vector of size n neurons
6. Key characteristics:
● The output Ŷ(t) depends on both current input X(t) and previous output Ŷ
(t-1)
● This creates a chain of dependencies going back to the first time step
● At t=0, previous outputs are initialized to zeros
Memory Cells
1. Memory in RNNs:
● A recurrent neuron's output at time t depends on all previous inputs
● This creates a form of memory in the network
● Any part of a neural network that maintains state across time steps
is called a memory cell
2. Basic Memory Cells:
○ A single recurrent neuron is a basic memory cell
○ A layer of recurrent neurons is also a basic memory cell
○ These basic cells can typically learn patterns about 10 steps long
○ The pattern length capability varies depending on the task
3. More Complex Cells:

● Later chapters cover more sophisticated cell types


● These can learn patterns roughly 10 times longer
● Pattern length still varies based on the task

4. Cell State Characteristics:

● Cell state at time t is denoted as h(t) (h stands for "hidden")


● State is a function of:
○ Current inputs x(t)
○ Previous state h(t-1)
● Written as: h(t) = f(x(t), h(t-1))
5. Cell Output:

● Output at time t is denoted as ŷ(t)


● Output is a function of:
○ Previous state
○ Current inputs
● In basic cells: output equals state
● In complex cells: output may differ from state
Input and Output Sequences
1. Sequence-to-Sequence (top-left):
● Takes a sequence and outputs a sequence
● Example: Power consumption forecasting where you input N days of data and
output predictions shifted by one day
● Best for tasks where input and output are naturally sequential and aligned

2. Sequence-to-Vector (top-right):
● Takes a sequence but only uses final output
● Example: Sentiment analysis of movie reviews, where words are the input
sequence and the output is a single sentiment score
● Good for classification/scoring of sequential data
3. Vector-to-Sequence (bottom-left):
● Takes a single vector repeatedly as input and produces a sequence
● Example: Image captioning, where a CNN-processed image is input and
the output is a sequence of words describing it
● Useful when generating sequential content from a fixed input
4. Encoder-Decoder (bottom-right):
● Combines sequence-to-vector (encoder) with vector-to-sequence
(decoder)
● Example: Language translation, where input sentence is encoded to a
vector, then decoded to target language
● Better than direct sequence-to-sequence for translation because it can
consider entire input context before generating output
● More complex implementation than the diagram suggests (covered in
Chapter 16)
Training RNNs
1. Basic Concept:
● BPTT involves unrolling the RNN through time
● Uses regular backpropagation principles on the unrolled network
● Consists of forward pass followed by backward pass

2. Forward Pass:
● Network processes the input sequence from start to finish
● Represented by dashed arrows in Figure 15-5
● Generates predictions Ŷ(0) through Ŷ(T) for each timestep
3. Loss Function:

● Evaluates output sequence against target sequence


● Format: ℒ(Y(0), Y(1), ..., Y(T); Ŷ(0), Ŷ(1), ..., Ŷ(T))
● Can selectively ignore certain outputs depending on the task
● Example: Sequence-to-vector RNNs only use the final output

4. Backward Pass:

● Gradients flow backward through the unrolled network


● Only flows through outputs used in loss calculation
● In the example, only flows through Ŷ(2), Ŷ(3), and Ŷ(4)

5. Parameter Updates:

● Same parameters (W and b) are used at each timestep


● Parameters receive multiple gradient updates during backprop
● Final gradient descent step updates parameters just like regular backprop
Preparing Data for ML models
● The text describes preparing time series data for machine learning
models, with the goal of forecasting tomorrow's ridership based on
8 weeks (56 days) of past data.
● The concept of using sliding windows: Every 56-day window from
the past serves as training data, with the target being the value
immediately following each window.
● Keras provides two methods for creating time series datasets:

First method using timeseries_dataset_from_array():

import tensorflow as tf

my_series = [0, 1, 2, 3, 4, 5]

my_dataset = tf.keras.utils.timeseries_dataset_from_array(

my_series,

targets=my_series[3:], # targets are 3 steps into the future

sequence_length=3,

batch_size=2

)
Alternative method using window():

dataset = tf.data.Dataset.range(6).window(4, shift=1, drop_remainder=True)

dataset = dataset.flat_map(lambda window_dataset:


window_dataset.batch(4))

# Helper function for extracting windows

def to_windows(dataset, length):

dataset = dataset.window(length, shift=1, drop_remainder=True)

return dataset.flat_map(lambda window_ds: window_ds.batch(length))


Final data preparation steps for the rail ridership example:

rail_train = df["rail"]["2016-01":"2018-12"] / 1e6

rail_valid = df["rail"]["2019-01":"2019-05"] / 1e6

rail_test = df["rail"]["2019-06":] / 1e6

seq_length = 56

train_ds = tf.keras.utils.timeseries_dataset_from_array(

rail_train.to_numpy(),

targets=rail_train[seq_length:],

sequence_length=seq_length,
batch_size=32,

shuffle=True,

seed=42

valid_ds = tf.keras.utils.timeseries_dataset_from_array(

rail_valid.to_numpy(),

targets=rail_valid[seq_length:],

sequence_length=seq_length,

batch_size=32

)
Forecasting Using Linear Model
Performance Results:
● The model achieved a validation MAE of approximately 37,866
● This performance is:
○ Better than naive forecasting
○ Worse than the SARIMA model

Key Model Characteristics:


● Uses Huber loss instead of MAE directly for better performance
● Implements early stopping to prevent overfitting
● Uses SGD optimizer with momentum
● Monitors validation MAE for early stopping
Code Snippet:

tf.random.set_seed(42)

model = tf.keras.Sequential([

tf.keras.layers.Dense(1, input_shape=[seq_length])

])
Forecasting Using Simple RNN
Initial Simple RNN Implementation:

model = tf.keras.Sequential([

tf.keras.layers.SimpleRNN(1, input_shape=[None, 1])

])
2. Input Shape Requirements:
● RNN layers expect 3D inputs: [batch size, time steps, dimensionality]
● Input_shape ignores the first dimension (batch size)
● Time steps can be None (any size)
● Dimensionality is 1 for univariate time series
3. How the Simple RNN Works:
● Initial state h(init) starts at 0
● Each step processes current input and previous state
● Uses hyperbolic tangent (tanh) activation by default
● Outputs only the final value unless return_sequences=True
4. Problems with Initial Model:
● Validation MAE > 100,000 (poor performance)
● Only 3 parameters total (2 weights + 1 bias)
● Limited by tanh activation range (-1 to +1)
● Too simple for the complexity of the data
Forecasting Using Deep RNN
Code Snippet:

deep_model = tf.keras.Sequential([

tf.keras.layers.SimpleRNN(32, return_sequences=True,

input_shape=[None, 1]),

tf.keras.layers.SimpleRNN(32, return_sequences=True),

tf.keras.layers.SimpleRNN(32),

tf.keras.layers.Dense(1)

])
Fighting Unstable Gradiets Problem
1. Common Deep Learning Techniques That Help:
● Good parameter initialization
● Faster optimizers
● Dropout
2. ReLU and Non-saturating Activation Functions:
● May not help as much with RNNs
● Can actually increase instability
● Risk of exploding outputs due to weight reuse across time steps
● Saturating functions like tanh are preferred (hence being the default)
3. Gradient Issues:
● Gradients can explode
● Solutions include:
○ Using smaller learning rates
○ Monitoring gradient size (via TensorBoard)
○ Using gradient clipping
4. Batch Normalization (BN) Limitations:
● Less effective with RNNs than with feedforward networks
● Cannot be used effectively between time steps
● When used in memory cells:
○ Same BN layer used at each time step
○ Same parameters regardless of input scale
○ Only slightly beneficial when applied to layer inputs
○ Not helpful when applied to hidden states
○ Can slow down training
5. Layer Normalization Benefits:
● Better suited for RNNs than batch normalization
● Normalizes across features dimension instead of batch dimension
● Advantages:
○ Can compute statistics on the fly at each time step
○ Works independently for each instance
○ Consistent behavior during training and testing
○ Doesn't need exponential moving averages
○ Learns scale and offset parameters for each input
6. Implementation:
● Used after linear combination of inputs and hidden states
● Requires defining a custom memory cell in Keras
● Cell's call() method needs to handle both:
○ Current time step inputs
○ Previous time step hidden states
Tackling Short-Term Memory Problem

Using,
1. LSTM
2. GRU
LSTM
Code Snippet:

model = tf.keras.Sequential([

tf.keras.layers.LSTM(32, return_sequences=True, input_shape=

[None, 5]),

tf.keras.layers.Dense(14)

])
GRU
Code Snippet:

model = tf.keras.Sequential([

tf.keras.layers.GRU(32, return_sequences=True, input_shape=

[None, 5]),

tf.keras.layers.Dense(14)

])

You might also like