This document discusses recurrent neural networks (RNNs) and their use for sequence modeling tasks. It provides an overview of long short-term memory (LSTM) and gated recurrent unit (GRU) RNNs, as well as 1D convolutional neural networks. The document motivates the use of RNNs by explaining that sequences are important in tasks like language processing, activity forecasting, and genome modeling. It then demonstrates how RNNs can model sequences by maintaining internal states, and describes how LSTMs and GRUs help address the vanishing gradient problem in simple RNNs. Finally, it provides a stock price prediction example to illustrate how to preprocess time series data and use RNNs for sequential regression tasks.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
149 views37 pages
Recurrent Neural Networks: Anahita Zarei, PH.D
This document discusses recurrent neural networks (RNNs) and their use for sequence modeling tasks. It provides an overview of long short-term memory (LSTM) and gated recurrent unit (GRU) RNNs, as well as 1D convolutional neural networks. The document motivates the use of RNNs by explaining that sequences are important in tasks like language processing, activity forecasting, and genome modeling. It then demonstrates how RNNs can model sequences by maintaining internal states, and describes how LSTMs and GRUs help address the vanishing gradient problem in simple RNNs. Finally, it provides a stock price prediction example to illustrate how to preprocess time series data and use RNNs for sequential regression tasks.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37
Recurrent Neural Networks
Anahita Zarei, Ph.D.
Overview • Recurrent Networks • LSTM • GRU • 1D Convnets • Reading: 6.2, 6.3, 6.4 from Deep Learning with Python Motivation for RNN’s • We previously saw that Convolutional Neural Networks (CNNs), form the base of many state-of-the-art computer vision systems. However, we do not understand the world around us with vision alone. • Sound, for one, also plays an important role. As humans, we communicate and express ideas through sequences of symbolic reductions and abstract representations. • Naturally, we would want machines to understand this manner of processing sequential information, as it could help us to resolve many problems we face with such sequential tasks in the real world. Examples of Sequences • Visit a foreign country and need to order in a restaurant. • Want your car to perform a sequence of movements automatically so that it is able to park by itself. • Want to understand how different sequences of adenine, guanine, thymine, and cytosine molecules in the human genome lead to differences in biological processes occurring in the human body. • What's common between all these examples? • All are related to sequence modeling tasks. In such tasks, the training examples (vectors of words, a set of car movements, or configuration of A, G, T, and C molecules) are multiple time-dependent data points. Examples of Sequences • Don't judge a book by its ___. • How do you know what the next word is? • You consider the relative positions of words and (subconsciously) perform some form of Bayesian inference, leveraging the sentences you have previously seen and their apparent similarity to this example. • i.e., you used your internal model of the English language to predict the most probable word to follow. • Language model refers to the probability of a particular configuration of words occurring together in a given sequence. • Such models are the fundamental components of modern speech recognition and machine translation systems. • They rely on modeling the likelihood of sequences of words. Why RNN’s? • A major characteristic of densely connected networks and convnets, is that they have no memory. • Each input shown to them is processed independently, with no state kept in between inputs. • For example, these networks would likely treat both “this movie is a bomb” and “this movie is the bomb” as being negative reviews, since they don’t consider inter-word relationships and sentence structure. • Therefore, in order to process a sequence or a time series, you have to show the entire sequence to the network at once. • For instance, in the IMDB example: an entire movie review was transformed into a single large vector and processed in one go. Such networks are called feedforward networks. Why RNN’s? • The previous architectures did not operate over a sequence of vectors. • This prohibits us from sharing any time-dependent information that may affect the likelihood of our predictions. • In the case of image classification, the fact that NN saw the image of a cat at the last iteration does not help it classify the current image, because the class probabilities of these two instances are not temporally related. However, this approach may cause problem in other instances such as in sentiment analysis. RNN Architecture • An RNN processes sequences by iterating through the sequence elements and maintaining a state containing information relative to what it has seen so far. • RNN has an internal loop; they save relevant information in memory (also referred to as its state) and use this information to perform predictions at subsequent time steps. • The state of the RNN is reset between processing two different, independent sequences (such as two different IMDB reviews) • So one sequence is still a single data point: a single input to the network. • What changes is that this data point is no longer processed in a single step; rather, the network internally loops over sequence elements. RNN Architecture • RNNs are characterized by their activation function, such as the following function in this case • • In practice, you’ll always use a more elaborate model than the simple expression above. • SimpleRNN has a major issue: although it should theoretically be able to retain at time t information about inputs seen many timesteps before, in practice, such long-term dependencies are impossible to learn. • This is due to the vanishing gradient problem, an effect that is similar to what is observed with non-recurrent networks (feedforward networks) that are many layers deep: as you keep adding layers to a network, the network eventually becomes untrainable. LSTM (Long Short-Term Memory) • LSTM adds a way to carry information across many timesteps. • Imagine a conveyor belt running parallel to the sequence you’re processing. • Information from the sequence can jump onto the conveyor belt at any point, be transported to a later timestep, and jump off, intact, when you need it. • This is essentially what LSTM does: it saves information for later, thus preventing older signals from gradually vanishing during processing. • LSTM network provides a more complex solution to the problems of exploding and vanishing gradients. GRU • The GRU can be considered the younger sibling of the LSTM • In essence, both leverage similar concepts to modeling long-term dependencies, such as remembering whether the subject of the sentence is plural, when generating following sequences. • The underlying difference between GRUs and LSTMs is in the computational complexity they represent. • Simply put, LSTMs are more complex architectures that, while computationally expensive and time-consuming to train, perform very well at breaking down the training data into meaningful and generalizable representations. • GRUs, on the other hand, while computationally less intensive, are limited in their representational abilities compared to LSTM. • However, not all tasks require heavyset 10-layer LSTMs!! Example – Stock Market • The goal of this exercise is to predict the movement of stock prices. • We will use the S&P 500 dataset, and select a random stock to prepare for sequential modeling. • The dataset comprises historical stock prices (opening, high, low, and closing prices) for all current S&P 500 large capital companies traded on the American stock market. • We do acknowledge the stochasticity that lies embedded in market trends: the reality is that there is a lot of randomness that often escapes even the most predictive of models. Investor behavior is hard to foresee, as investors tend to capitalize for various motives. Importing the Data Visualizing the Data • We select a random stock (American airlines group, aal) out of the 505 different stocks in our dataset. • Note that data is sorted by date, since we deal with a time series prediction problem where the order of the sequence is very important to our task. • We then visually display our data by plotting out the high and low prices (on a given day) in sequential order of occurrence. • We observe that, while slightly different from one another, the high and low prices both follow the same pattern. • Hence, it would be redundant to use both these variables for predictive modeling, as they are highly correlated. We pick just the high values. Convert to Numpy Array • We will convert the high price column on a given observation day into a NumPy array. • We do so by calling values on that column, which returns its NumPy representation. Train, Validation, and Test Splits We use 70% of our data for training, 15% for validation, and 15% for test. Visualizing the Data Subsets • We visualize the unnormalized training, validation, and testing segments of the AAL stock data. • Note that, the test data appears between the price range of $40 to $55 in the time frame of observations it represents, while training data appears in the range between $25 to $50+ in its respectively longer span of observation. Normalizing the Data • Recall that you need to normalize data for various machine learning tasks.
• You do need to reshape your data to a 2D array from a scaler array.
Normalizing Data Recall that we normalize data based on training parameters. Creating sequences • In order to train the RNN we need to organize our time series into segments of n consecutive values in a given sequence. • The output for each training sequence will correspond to the stock price some timesteps into the future. • We have two variables look_back and foresight: • look_back refers to the number of stock prices we keep in a given observation. • foresight refers to the number of steps between the last data point in the observed sequence, and the data point we aim to predict. Sequences
• What will be the length of trainNorm after creating the
sequence? • 648-7-6 = 635 Creating Sequences for Validation and Test • You typically need to experiment with different values of look_back and foresight to assess how larger look_back and foresight values each affect the predictive power of your model. • In practice, you will experience diminishing returns on either side for both values. Reshaping the Data for Keras Layers • We need to prepare a 3D tensor of (nb_samples, look_back, num_features). Imports Simple LSTM Fitting Simple LSTM Error Plot for Simple LSTM Simple LSTM Performance on Test Set Simple GRU Fitting Simple GRU Error Plot for Simple GRU Simple GRU Performance on Test Set Sequence processing with convnets • We saw that convnets perform particularly well on computer vision problems • This is due to their ability to extract features from local input patches and allow for representation modularity and data efficiency. • The same properties that make convnets excel at computer vision also make them highly relevant to sequence processing. • Time can be treated as a spatial dimension, like the height or width of a 2D image. • convnets can be competitive with RNNs on certain sequence-processing problems, usually at a considerably cheaper computational cost. Understanding 1D convolution for Sequence Data • The convolution layers introduced previously were 2D convolutions, extracting 2D patches from image tensors and applying an identical transformation to every patch. • In the same way, you can use 1D convolutions, extracting local 1D patches (subsequences) from sequences. • Such 1D convolution layers can recognize local patterns in a sequence. • Because the same input transformation is performed on every patch, a pattern learned at a certain position in a sentence can later be recognized at a different position, making 1D convnets translation invariant (for temporal translations). • For instance, a 1D convnet processing sequences of characters using convolution windows of size 5 should be able to learn words length 5 or less, and recognize these words in any context in an input sequence. 1D pooling for sequence data • In Keras, you use a 1D convnet via the Conv1D layer, which has an interface similar to Conv2D. • It takes as input 3D tensors with shape (samples, time, features) and returns similarly shaped 3D tensors. • The convolution window is a 1D window on the temporal axis: axis 1 in the input tensor. • Keep in mind that here you can use larger convolution windows with 1D convnets. • With a 2D convolution layer, a 3 × 3 convolution window contains 3 × 3 = 9 feature vectors; but with a 1D convolution layer, a convolution window of size 3 contains only 3 feature vectors. You can thus easily afford 1D convolution windows of size 7 or 9. Advantages and Drawbacks of RNN’s References • Hands-on Neural Networks with Keras by Niloy Purkait • Deep Learning with Python by Chollet