0% found this document useful (0 votes)
36 views60 pages

Chap 7.2 Sequence Analysis Using RNN LSTM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views60 pages

Chap 7.2 Sequence Analysis Using RNN LSTM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Sequence Analysis Part II

Dr. Sanjay Chatterji


CS 831
Recurrent Neural Networks
RNNs were first introduced in the 1980s.
It has regained popularity recently due to several intellectual
and hardware breakthroughs that have made them tractable to
train.
RNNs are different from feed-forward networks because
They leverage a special type of neural layer, known as recurrent layers,
that enable the network to maintain state between uses of the network.
Unlike a feed-forward layer, recurrent layers also have recurrent
connections, which propagate information between neurons of the same
layer.
Neural architecture of a recurrent layer
Connections of Neurons of FFN
and RNN
1. incoming connections emanating from all of the neurons of the
previous layer
2. outgoing connections leading to all of the neurons to the
subsequent layer.
• However, these aren’t the only connections that neurons of a
recurrent layer have.
• A fully connected recurrent layer has information flow from every
neuron to every other neuron in its layer (including itself).
• A recurrent layer with r neurons has total r2 recurrent connections.
FFN & RNN Information Flow

• Feedforward connections represent information flow from one


neuron to another where the data being transferred is the
computed neuronal activation from the current time step.
• Recurrent connections, however, represent information flow
where the data is the stored neuronal activation from the
previous time step.
• Thus, the activations of the neurons in a recurrent network
represent the accumulating state of the network instance.
How RNN functions after it’s been
appropriately trained
• Every time we want to process a new sequence, we create a fresh
instance of our model.
• We can reason about networks that contain recurrent layers by
dividing the lifetime of the network instance into discrete time steps.
• At each time step, we feed the model the next element of the input.
• The initial activations of neurons in the recurrent layer are parameters
of our model.
• We determine the optimal values for them just like we determine the
optimal values for the weights of each connection during the process
of training.
“unrolling” the RNN through time
• Given a fixed lifetime (say t time steps) of an RNN instance, we can express
the instance as a feed-forward network (albeit irregularly structured).
• This clever transformation is referred to as “unrolling” the RNN through time.
• We’d like to map a sequence of two inputs (each dimension 1) to a single
output (also of dimension 1) in the following figure.
• We perform the transformation by taking the neurons of the single recurrent
layer (including input and output layer) and replicate it t times, once for each
time step.
• We redraw the feed-forward connections within each time replica just as they
were in the original network.
• Then we draw the recurrent connections as feed-forward connections from
each time replica to the next.
example RNN to map a sequence of two
inputs (each dimension 1) to a single
output (also of
dimension 1)
Training RNN using Gradient based on unrolled version
• All of the backpropagation techniques that we utilized for feedforward networks
also apply to training RNNs.
• Issue: After every batch of training examples we use, we need to modify the
weights based on the error derivatives we calculate.
• In our unrolled network, we have sets of connections that all correspond to the
same connection in the original RNN.
• The error derivatives calculated for these unrolled connections, however, are not
guaranteed to be (and, in practice, probably won’t be) equal.
• We can circumvent this issue by averaging or summing the error derivatives over
all the connections that belong to the same set.
• This allows us to utilize an error derivative that considers all of the dynamics
acting on the weight of a connection as we attempt to force the network to
construct an accurate output.
The Challenges with Vanishing Gradients
• Our motivation for using a stateful network model hinges on this idea of
capturing long-term dependencies in the input sequence.
• It seems reasonable that an RNN with a large memory bank (i.e., a
significantly sized recurrent layer) would be able to summarize these
dependencies.
• In fact, from a theoretical perspective, Kilian and Siegelmann
demonstrated in 1996 that the RNN is a universal functional
representation.
• In other words, with enough neurons and the right parameter settings, an
RNN can be used to represent any functional mapping between input and
output sequences.
Theory is promising, but not in practice
• While it is nice to know that it is possible for an RNN to
represent any arbitrary function, it is more useful to know
whether it is practical to teach the RNN a realistic functional
mapping from scratch by applying gradient descent algorithms.
• If it turns out to be impractical, we’ll be in hot water, so it will
be useful for us to be rigorous in exploring this question.
• Let’s start our investigation by considering the simplest possible
RNN with a single input neuron, a single output neuron, and a
fully connected recurrent layer with one neuron.
A single neuron, fully connected recurrent layer
(both compressed and unrolled) for the sake of
investigating gradient-based learning algorithms
Compute how hidden neuron activation changes
in response to input logit from past k time steps
• Given nonlinearity f, we can express the activation h(t) of the hidden
neuron of the recurrent layer at time step t as follows, where i(t) is
the incoming logit from the input neuron at time step t:

• In analyzing the component of the backpropagation gradient


expressions, we can start to quantify how much “memory” is
retained from past inputs.
• We start by taking the partial derivative and apply the chain rule:
Considering the magnitude of the
derivative
• Because we care about the magnitude of this derivative, we can take
the absolute value of both sides.
• We also know that for all common nonlinearities (the tanh, logistic,
and ReLU nonlinearities), the maximum value of f ′ is at most 1.
• This leads to the following recursive inequality:

• We can continue to expand this inequality recursively until we reach


the base case, at step t - k:
Vanishing Gradients
• This results in the final inequality.
• This relationship places a strong upper bound on how much a change in the
input at time t – k can impact the hidden state at time t.
• Because the weights of our model are initialized to small values, the value of
this derivative approaches zero as k increases.
• In other words, the gradient quickly diminishes when it’s computed with
respect to inputs several time steps into the past, severely limiting our
model’s ability to learn long-term dependencies.
• This issue is commonly referred to as vanishing gradients, and it severely
impacts the learning capabilities of vanilla recurrent neural networks.
• In order to address this limitation, we will explore an extraordinarily
influential twist on recurrent layers known as long short-term memory.
Long Short-Term Memory (LSTM)
Units
• In order to combat the problem of vanishing gradients, the
Long Short-Term Memory (LSTM) architecture has been
introduced (by Sepp Hochreiter and Jürgen Schmidhuber).
• The network would be designed for reliably transmitting
important information many time steps into the future.
The architecture of an LSTM unit, illustrated
at a tensor (designated by arrows) and
operation (designated by the purple blocks)
level
Tensors and Operations
• Let’s step back from the individual neuron level and talk about the
network as collection of tensors and operations on tensors.
• One of the core components is the memory cell, a tensor
represented by the bolded loop in the center of the figure.
• The memory cell holds critical information that it has learned over
time.
• The network is designed to effectively maintain useful information
in the memory cell.
• At every time step, the LSTM unit modifies the memory cell with
new information with three different phases.
Keep Gate
• Using keep gate LSTM unit determines how much of previous
memory to keep.
• The memory state tensor from the previous time step is rich with
information
• But some of that information may be stale (and therefore might
need to be erased).
• We figure out which elements in the memory state tensor are still
relevant and which elements are irrelevant by computing a bit
tensor (zeros and ones) that we multiply with the previous state.
Keep Gate Cont.
• If a particular location in the bit tensor holds a 1, it means that location in
the memory cell is still relevant and ought to be kept.
• If that particular location holds a 0, it means that the location in the
memory cell is no longer relevant and ought to be eased.
• We approximate this bit tensor by concatenating the input of this time step
and the LSTM unit’s output from the previous time step and applying a
sigmoid layer to the resulting tensor.
• A sigmoidal neuron, as you may recall, outputs a value that is either very
close to 0 or very close to 1 most of the time (the only exception is when
the input is close to zero).
• As a result, the output of the sigmoidal layer is a close approximation of a
bit tensor, and we can use this to complete the keep gate.
Architecture of the keep gate of an
LSTM unit
Write Gate
• This is broken down into two major components.
• Component1 figures out what information to write into the state.
• This is computed by the tanh layer to create an intermediate
tensor.
• Component2 figures out which components of computed tensor
to include into the new state and which to toss before writing.
• We do this by approximating a bit vector of 0’s and 1’s using the
same strategy (a sigmoidal layer) as we used in the keep gate.
• We multiply the bit vector with our intermediate tensor and then
add the result to create the new state vector for the LSTM.
Architecture of the write gate of an
LSTM unit
Output Gate
• Finally, at every time step, the LSTM unit provides an output.
• While we could treat the state vector as output directly, the LSTM
unit is engineered to provide flexibility by emitting output tensor.
• For output gate we use a nearly identical structure as write gate
1. the tanh layer creates an intermediate tensor from the state
vector
2. the sigmoid layer produces a bit tensor mask using the current
input and previous output
3. the intermediate tensor is multiplied with the bit tensor to
produce the final output.
Architecture of the output gate of an
LSTM unit
So why is this better than using a raw RNN unit?

• The key observation is how information propagates through the


network when we unroll the LSTM unit through time.
• At the very top, we can observe the propagation of the state
vector, whose interactions are primarily linear through time.
• The result is that the gradient that relates an input several time
steps in the past to the current output does not attenuate as
dramatically as in the vanilla RNN architecture.
• This means that the LSTM can learn long-term relationships much
more effectively than our original formulation of the RNN.
Unrolling an LSTM unit through time
RNN Vs LSTM
• How easy it is to generate arbitrary architectures with LSTM units?
• Do we need to sacrifice any flexibility to use LSTM units instead of
a vanilla RNN?
• Just as we can stack RNN layers to create more expressive models
with more capacity, we can similarly stack LSTM units, where the
input of the second unit is the output of the first unit, the input of
the third unit is the output of the second, and so on.
• This means that anywhere we use a vanilla RNN layer, we can easily
substitute an LSTM unit.
• Now we have overcome the issue of vanishing gradients and
understand the inner workings of LSTM units.
Composing LSTM units as one might stack recurrent
layers in a neural network
TensorFlow Primitives for RNN Models
• TensorFlow provides several primitives in order to build RNN models.
• RNNCell objects represent either an RNN layer or an LSTM unit:
cell_1 = tf.nn.rnn_cell.BasicRNNCell(num_units, input_size=None, activation=tanh)
cell_2 = tf.nn.rnn_cell.BasicLSTMCell(num_units, forget_bias=1.0, input_size=None,
state_is_tuple=True, activation=tanh)
cell_3 = tf.nn.rnn_cell.LSTMCell(num_units, input_size=None, use_peepholes=False,
cell_clip=None, initializer=None, num_proj=None, proj_clip=None,
num_unit_shards=1, num_proj_shards=1, forget_bias=1.0, state_is_tuple=True,
activation=tanh)
cell_4 = tf.nn.rnn_cell.GRUCell(num_units, input_size=None, activation=tanh)
TensorFlow Primitives
• The BasicRNNCell abstraction represents a vanilla RNN layer.
• The BasicLSTMCell represents a simple LSTM unit
• The LSTMCell implements more configuration options (peephole
structures, clipping of state values, etc.).
• The TensorFlow library also includes a variation of the LSTM unit
known as the Gated Recurrent Unit.
• The critical initialization variable for all of these cells is the size of
the hidden state vector or num_units.
TensorFlow Wrappers
• There are several wrappers to add to our arsenal.
• If we want to stack recurrent units or layers, we can use:
cell_1 = tf.nn.rnn_cell.BasicLSTMCell(10)
cell_2 = tf.nn.rnn_cell.BasicLSTMCell(10)
full_cell = tf.nn.rnn_cell.MultiRNNCell([cell_1, cell_2])
• We can also use a wrapper to apply dropout to the inputs and
outputs of an LSTM with specified input and output keep
probabilities:
tf.nn.rnn_cell.DropoutWrapper(cell_1, input_keep_prob=1.0,
output_keep_prob=1.0, seed=None)
Completing the RNN
• Finally, we complete the RNN by wrapping everything into the
appropriate TensorFlow RNN primitive:
outputs, state = tf.nn.dynamic_rnn(cell, inputs, sequence_length=None,
initial_state=None, dtype=None, parallel_iterations=None,
swap_memory=False, time_major=False, scope=None)
• The cell is the RNNCell object.
• If time_major == False (which is the default setting), inputs must be
a tensor of the shape [batch_size, max_time, ...].
• If time_major==True, inputs have the shape [max_time, batch_size, ...].
• Refer TensorFlow documentation for other parameters.
Implementing a Sentiment Analysis
Model
• We attempt to analyze the sentiment of movie reviews taken from
the Large Movie Review Dataset.
• This dataset consists of 50,000 reviews from IMDB, each of which
labeled as having positive or negative sentiment.
• We use a simple LSTM model leveraging dropout.
• The LSTM model will consume the review one word at a time.
• Once it has consumed the entire review, we’ll use its output as the
basis of a binary classification to map the sentiment to be
“positive” or “negative.”
Loading the dataset
• To load the dataset, we’ll utilize the helper library tflearn.
• Once we install the package, we download the dataset, prune the
vocabulary to include the 30,000 most common words, pad each input
sequence up to a length 500 words, and process the labels:
from tflearn.data_utils import to_categorical, pad_sequences
from tflearn.datasets import imdb
train, test, _ = imdb.load_data(path='imdb.pkl', n_words=30000, valid_portion=0.1)
trainX, trainY = train
testX, testY = test
trainX = pad_sequences(trainX, maxlen=500, value=0.)
testX = pad_sequences(testX, maxlen=500, value=0.)
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
Data Preparation
• The inputs are now 500-dimensional vectors.
• Each vector corresponds to a movie review where the ith component
of the vector corresponds to the index of the ith word of the review
in our global dictionary of 30,000 words.
• To complete the data preparation, we create a Python class designed
to serve minibatches of a desired size from the underlying dataset.
• We use the IMDBDataset Python class to serve both the training and
validation sets we’ll use while training our sentiment analysis model.
Constructing the Sentiment Analysis
model
• First, let’s map each word in the input review to a word vector.
• We’ll utilize an embedding layer, which is a simple lookup table
that stores an embedding vector that corresponds to each word.
• Unlike in previous examples, where we treated the learning of the
word embeddings as a separate problem, we’ll learn the word
embeddings jointly with the sentiment analysis problem with the
embedding matrix as a matrix of parameters in the full problem.
• We accomplish this by using the TensorFlow primitives for
managing embeddings.
Building an LSTM
• We then take the result of the embedding layer and build an LSTM
with dropout.
• We do some extra work to pull out the last output emitted by the
LSTM using the tf.slice and tf.squeeze operators, which find the
exact slice that contains the last output of the LSTM and then
eliminates the unnecessary dimension.
• The implementation of the LSTM can be achieved as follows:
Inference Graph
• We top it all off using a batch-normalized hidden layer.
• Stringing all of these components together, we can build the
inference graph.
def inference(input, phase_train):
embedding = embedding_layer(input, [30000, 512])
lstm_output = lstm(embedding, 512, 0.5, phase_train)
output = layer(lstm_output, [512, 2], [2], phase_train)
return output
• We omit the other boilerplate involved in setting up summary
statistics, because it’s identical to the other models we have built.
• We then run and visualize the performance of our model using
TensorBoard.
Performance
• At the beginning of training, the model struggles slightly with
stability, and toward the end of the training, the model clearly
starts to overfit as training cost and validation cost significantly
diverge.
• At its optimal performance, however, the model performs rather
effectively and generalizes to approximately 86% accuracy on the
test set.
• Congratulations! You’ve built your first recurrent neural network.
Solving seq2seq Tasks with
Recurrent Neural Networks
• We started this chapter with an example of seq2seq task: mapping
a sequence of words in a sentence to a sequence of POS tags.
• Tackling this problem was tractable because we didn’t need to take
into account long-term dependencies to generate the appropriate
tags.
• But there are several seq2seq problems, such as translating
between languages or creating a summary for a video, where long-
term dependencies are crucial to the success of the model.
• This is where RNN comes in.
RNN approach to seq2seq
• The RNN approach to seq2seq looks a lot like the autoencoder.
• The seq2seq model is composed of two networks: encoder and decoder.
• The encoder network is a recurrent network (usually one that uses LSTM
units) that consumes the entire input sequence.
• The goal of the encoder network is to generate a condensed understanding of
the input and summarize it into a singular thought represented by the final
state of the encoder network.
• Decoder network’s starting state is initialized with the final state of the
encoder network, to produce the target output sequence token by token.
• At each step, the decoder network consumes its own output from the
previous time step as the current time step’s input.
encoder/decoder recurrent network for seq2seq
• We tokenize the input sentence and use an embedding, one word at a
time as an input to the encoder network.
• At the end of the sentence, we use a special <EOS> token.
• Then we take the hidden state of the encoder network and use that as
the initialization of the decoder network.
• The first input to the decoder network is the EOS token, and the output
is interpreted as the first word of the predicted translation.
• From that point onward, we use the output of the decoder network as
the input to itself at the next time step.
• We continue until the decoder network emits EOS token as its output.
Skip-thought vector
• The seq2seq RNN architecture can also be reappropriated for
the purpose of learning good embeddings of sequences.
• Skip-thought vector borrowed architectural characteristics from
both the autoencoder framework and Skip-Gram model.
• The skip-thought vector was generated by dividing up a passage
into a set of triplets consisting of consecutive sentences.
• The authors utilized a single encoder network and two decoder
networks.
The skip-thought seq2seq architecture to generate
embedding representations of entire sentences
Skip-thought seq2seq architecture
• The encoder network consumed the sentence for which we
wanted to generate a condensed representation (which was
stored in the final hidden state of the encoder network).
• Then came the decoding step.
• The first decoder network would take that representation as the
initialization of its own hidden state and attempt to reconstruct
the sentence that appeared prior to the input sentence.
• The second decoder network would instead attempt the
sentence that appeared immediately after the input sentence.
• The full system was trained end to end on these triplets.
Example story generation
• Here’s an example of story generation, excerpted from the original paper:
she grabbed my hand .
"come on . "
she fluttered her back in the air .
"i think we're at your place . I ca n't come get you . "
he locked himself back up
" no . she will . "
kyrian shook his head

• Now that we’ve developed an understanding of how to leverage recurrent


neural networks to tackle seq2seq problems, we’re almost ready to try to build
our own.
• Before we get there, however, we’ve got one more major challenge to tackle:
attentions in seq2seq RNNs.
Things helpful when trying to complete a translation

• First it’s helpful to read the full sentence to understand the concept
you would like to convey.
• Then you write out the translation one word at a time, each word
following logically from the word you wrote previously.
• But one important aspect of translation is that as you compose the
new sentence, you often refer back to the original text, focusing on
specific parts that are relevant to your current translation.
• At each step, you are paying attention to the most relevant parts of
the original “input” so you can make the best decision about the
next word to put on the page.
Augmenting Recurrent Networks with Attention

• Let’s think back to our approach to seq2seq.


• By consuming the full input and summarizing it into a “thought” inside its
hidden state, the encoder network achieves the first part of the translation
process.
• By using the previous output as its current input, the decoder network
achieves the second part of the translation process.
• This phenomenon of attention has yet to be captured by our seq2seq.
• One way to give the decoder network some vision into the original sentence is
by giving the decoder access to all of the outputs from the encoder network.
• These outputs are interesting to us because they represent how the encoder
network’s internal state evolves after seeing each new token.
An attempt at engineering attentional abilities in a
seq2seq architecture.
A critical Flaw of the approach
• The problem here is that at every time step, the decoder considers
all of the outputs of the encoder network in the exact same way.
• However, this is clearly not the case for a human during the
translation process.
• We focus on different aspects of the original text when working on
different parts of the translation.
• The key realization here is that it’s not enough to merely give the
decoder access to all the outputs.
• Instead, we must engineer a mechanism by which the decoder
network can dynamically pay attention to a specific subset of the
encoder’s outputs.
Fixing the Flaw of the Approach
• We can fix this problem by changing the inputs to the
concatenation operation.
• Instead of directly using the raw outputs from the encoder network,
we perform a weighting operation on the encoder’s outputs.
• We leverage the decoder network’s state at time t – 1 as the basis
for the weighting operation.
• A modification to our original proposal that enables a dynamic
attentional mechanism based on the hidden state of the decoder
network in the previous time step is shown in figure.
Weighting operation
• First we create a scalar (a single number) relevance score for each
of the encoder’s outputs by computing the dot product between
each encoder output and the decoder’s state at time t - 1.
• We then normalize these scores using a softmax operation.
• Finally, we use these normalized scores to scale the encoder’s
outputs before plugging them into the concatenation operation.
• The relative scores computed for each encoder output signify how
important that encoder output is to the decision for the decoder at
time step t.
• We can visualize which parts of the input are most relevant to the
translation at each time step using the output of the softmax!
Application of Attentions
• Attentions are incredibly applicable in problems that extend beyond
language translation.
• Attentions can be important in speech-to-text problems, where the
algorithm learns to dynamically pay attention to corresponding parts of the
audio while transcribing the audio into text.
• Similarly, attentions can be used to improve image captioning algorithms by
helping the captioning algorithm focus on specific parts of the input image
while writing out the caption.
• Anytime there are particular parts of the input that are highly correlated to
correctly producing corresponding segments of the output, attentions can
dramatically improve performance.
Thank You

You might also like