0% found this document useful (0 votes)
149 views37 pages

Recurrent Neural Networks: Anahita Zarei, PH.D

This document discusses recurrent neural networks (RNNs) and their use for sequence modeling tasks. It provides an overview of long short-term memory (LSTM) and gated recurrent unit (GRU) RNNs, as well as 1D convolutional neural networks. The document motivates the use of RNNs by explaining that sequences are important in tasks like language processing, activity forecasting, and genome modeling. It then demonstrates how RNNs can model sequences by maintaining internal states, and describes how LSTMs and GRUs help address the vanishing gradient problem in simple RNNs. Finally, it provides a stock price prediction example to illustrate how to preprocess time series data and use RNNs for sequential regression tasks.

Uploaded by

Nick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views37 pages

Recurrent Neural Networks: Anahita Zarei, PH.D

This document discusses recurrent neural networks (RNNs) and their use for sequence modeling tasks. It provides an overview of long short-term memory (LSTM) and gated recurrent unit (GRU) RNNs, as well as 1D convolutional neural networks. The document motivates the use of RNNs by explaining that sequences are important in tasks like language processing, activity forecasting, and genome modeling. It then demonstrates how RNNs can model sequences by maintaining internal states, and describes how LSTMs and GRUs help address the vanishing gradient problem in simple RNNs. Finally, it provides a stock price prediction example to illustrate how to preprocess time series data and use RNNs for sequential regression tasks.

Uploaded by

Nick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Recurrent Neural Networks

Anahita Zarei, Ph.D.


Overview
• Recurrent Networks
• LSTM
• GRU
• 1D Convnets
• Reading: 6.2, 6.3, 6.4 from Deep Learning with Python
Motivation for RNN’s
• We previously saw that Convolutional Neural Networks (CNNs), form
the base of many state-of-the-art computer vision systems. However,
we do not understand the world around us with vision alone.
• Sound, for one, also plays an important role. As humans, we
communicate and express ideas through sequences of symbolic
reductions and abstract representations.
• Naturally, we would want machines to understand this manner of
processing sequential information, as it could help us to resolve many
problems we face with such sequential tasks in the real world.
Examples of Sequences
• Visit a foreign country and need to order in a restaurant.
• Want your car to perform a sequence of movements automatically so
that it is able to park by itself.
• Want to understand how different sequences of adenine, guanine,
thymine, and cytosine molecules in the human genome lead to
differences in biological processes occurring in the human body.
• What's common between all these examples?
• All are related to sequence modeling tasks. In such tasks, the training
examples (vectors of words, a set of car movements, or configuration
of A, G, T, and C molecules) are multiple time-dependent data points.
Examples of Sequences
• Don't judge a book by its ___.
• How do you know what the next word is?
• You consider the relative positions of words and (subconsciously) perform
some form of Bayesian inference, leveraging the sentences you have
previously seen and their apparent similarity to this example.
• i.e., you used your internal model of the English language to predict the
most probable word to follow.
• Language model refers to the probability of a particular configuration of
words occurring together in a given sequence.
• Such models are the fundamental components of modern speech
recognition and machine translation systems.
• They rely on modeling the likelihood of sequences of words.
Why RNN’s?
• A major characteristic of densely connected networks and convnets, is that
they have no memory.
• Each input shown to them is processed independently, with no state kept
in between inputs.
• For example, these networks would likely treat both “this movie is a bomb”
and “this movie is the bomb” as being negative reviews, since they don’t
consider inter-word relationships and sentence structure.
• Therefore, in order to process a sequence or a time series, you have to
show the entire sequence to the network at once.
• For instance, in the IMDB example: an entire movie review was
transformed into a single large vector and processed in one go. Such
networks are called feedforward networks.
Why RNN’s?
• The previous architectures did not operate over a sequence of
vectors.
• This prohibits us from sharing any time-dependent information that
may affect the likelihood of our predictions.
• In the case of image classification, the fact that NN saw the image of a
cat at the last iteration does not help it classify the current image,
because the class probabilities of these two instances are not
temporally related. However, this approach may cause problem in
other instances such as in sentiment analysis.
RNN Architecture
• An RNN processes sequences by iterating through
the sequence elements and maintaining a state
containing information relative to what it has seen
so far.
• RNN has an internal loop; they save relevant
information in memory (also referred to as its
state) and use this information to perform
predictions at subsequent time steps.
• The state of the RNN is reset between processing
two different, independent sequences (such as two
different IMDB reviews)
• So one sequence is still a single data point: a single
input to the network.
• What changes is that this data point is no longer
processed in a single step; rather, the network
internally loops over sequence elements.
RNN Architecture
• RNNs are characterized by their activation function, such as the following
function in this case

• In practice, you’ll always use a more elaborate model than the simple
expression above.
• SimpleRNN has a major issue: although it should theoretically be able to
retain at time t information about inputs seen many timesteps before, in
practice, such long-term dependencies are impossible to learn.
• This is due to the vanishing gradient problem, an effect that is similar to
what is observed with non-recurrent networks (feedforward networks) that
are many layers deep: as you keep adding layers to a network, the network
eventually becomes untrainable.
LSTM (Long Short-Term Memory)
• LSTM adds a way to carry information
across many timesteps.
• Imagine a conveyor belt running parallel
to the sequence you’re processing.
• Information from the sequence can jump
onto the conveyor belt at any point, be
transported to a later timestep, and jump
off, intact, when you need it.
• This is essentially what LSTM does: it
saves information for later, thus
preventing older signals from gradually
vanishing during processing.
• LSTM network provides a more complex
solution to the problems of exploding and
vanishing gradients.
GRU
• The GRU can be considered the younger sibling of the LSTM
• In essence, both leverage similar concepts to modeling long-term
dependencies, such as remembering whether the subject of the sentence
is plural, when generating following sequences.
• The underlying difference between GRUs and LSTMs is in the
computational complexity they represent.
• Simply put, LSTMs are more complex architectures that, while
computationally expensive and time-consuming to train, perform very well
at breaking down the training data into meaningful and generalizable
representations.
• GRUs, on the other hand, while computationally less intensive, are limited
in their representational abilities compared to LSTM.
• However, not all tasks require heavyset 10-layer LSTMs!!
Example – Stock Market
• The goal of this exercise is to predict the movement of stock prices.
• We will use the S&P 500 dataset, and select a random stock to
prepare for sequential modeling.
• The dataset comprises historical stock prices (opening, high, low, and
closing prices) for all current S&P 500 large capital companies traded
on the American stock market.
• We do acknowledge the stochasticity that lies embedded in market
trends: the reality is that there is a lot of randomness that often
escapes even the most predictive of models. Investor behavior is hard
to foresee, as investors tend to capitalize for various motives.
Importing the Data
Visualizing the Data
• We select a random stock (American airlines
group, aal) out of the 505 different stocks in
our dataset.
• Note that data is sorted by date, since we
deal with a time series prediction problem
where the order of the sequence is very
important to our task.
• We then visually display our data by plotting
out the high and low prices (on a given day) in
sequential order of occurrence.
• We observe that, while slightly different from
one another, the high and low prices both
follow the same pattern.
• Hence, it would be redundant to use both
these variables for predictive modeling, as
they are highly correlated. We pick just the
high values.
Convert to Numpy Array
• We will convert the high price
column on a given observation
day into a NumPy array.
• We do so by calling values on
that column, which returns its
NumPy representation.
Train, Validation, and Test Splits
We use 70% of our data for training, 15% for validation, and 15% for
test.
Visualizing the Data Subsets
• We visualize the unnormalized
training, validation, and testing
segments of the AAL stock data.
• Note that, the test data appears
between the price range of $40
to $55 in the time frame of
observations it represents, while
training data appears in the
range between $25 to $50+ in its
respectively longer span of
observation.
Normalizing the Data
• Recall that you need to normalize data for various machine learning
tasks.

• You do need to reshape your data to a 2D array from a scaler array.


Normalizing Data
Recall that we normalize data based on training parameters.
Creating sequences
• In order to train the RNN we need to organize our time series into
segments of n consecutive values in a given sequence.
• The output for each training sequence will correspond to the stock
price some timesteps into the future.
• We have two variables look_back and foresight:
• look_back refers to the number of stock prices we keep in a given
observation.
• foresight refers to the number of steps between the last data point in the
observed sequence, and the data point we aim to predict.
Sequences

• What will be the length of trainNorm after creating the


sequence?
• 648-7-6 = 635
Creating Sequences for Validation and Test
• You typically need to experiment with different values of look_back
and foresight to assess how larger look_back and foresight values
each affect the predictive power of your model.
• In practice, you will experience diminishing returns on either side for both
values.
Reshaping the Data for Keras Layers
• We need to prepare a 3D tensor of (nb_samples, look_back,
num_features).
Imports
Simple LSTM
Fitting Simple LSTM
Error Plot for Simple LSTM
Simple LSTM Performance on Test Set
Simple GRU
Fitting Simple GRU
Error Plot for Simple GRU
Simple GRU Performance on Test Set
Sequence processing with convnets
• We saw that convnets perform particularly well on computer vision
problems
• This is due to their ability to extract features from local input patches and
allow for representation modularity and data efficiency.
• The same properties that make convnets excel at computer vision also
make them highly relevant to sequence processing.
• Time can be treated as a spatial dimension, like the height or width of a 2D
image.
• convnets can be competitive with RNNs on certain sequence-processing
problems, usually at a considerably cheaper computational cost.
Understanding 1D convolution for Sequence
Data
• The convolution layers introduced previously were 2D
convolutions, extracting 2D patches from image tensors
and applying an identical transformation to every patch.
• In the same way, you can use 1D convolutions, extracting
local 1D patches (subsequences) from sequences.
• Such 1D convolution layers can recognize local patterns in
a sequence.
• Because the same input transformation is performed on
every patch, a pattern learned at a certain position in a
sentence can later be recognized at a different position,
making 1D convnets translation invariant (for temporal
translations).
• For instance, a 1D convnet processing sequences of
characters using convolution windows of size 5 should be
able to learn words length 5 or less, and recognize these
words in any context in an input sequence.
1D pooling for sequence data
• In Keras, you use a 1D convnet via the Conv1D layer, which has an interface
similar to Conv2D.
• It takes as input 3D tensors with shape (samples, time, features) and
returns similarly shaped 3D tensors.
• The convolution window is a 1D window on the temporal axis: axis 1 in the
input tensor.
• Keep in mind that here you can use larger convolution windows with 1D
convnets.
• With a 2D convolution layer, a 3 × 3 convolution window contains 3 × 3 = 9
feature vectors; but with a 1D convolution layer, a convolution window of
size 3 contains only 3 feature vectors. You can thus easily afford 1D
convolution windows of size 7 or 9.
Advantages and Drawbacks of RNN’s
References
• Hands-on Neural Networks with Keras by Niloy Purkait
• Deep Learning with Python by Chollet

You might also like