06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
• Observe that in our case of RNN we are now more interested on the next state, ht
not exactly the output, yt
The Problem of Long-Term Dependencies
• Sometimes, we only need to look at recent information to perform the present task. For example,
consider a language model trying to predict the next word based on the previous ones. If we are
trying to predict the last word in “the clouds are in the sky,” we don’t need any further context –
it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the
relevant information and the place that it’s needed is small, RNNs can learn to use the past
information.
The Problem of Long-Term Dependencies
• But there are also cases where we need more context. Consider trying to predict the last word in
the text “I grew up in France… I speak fluent French.” Recent information suggests that the next
word is probably the name of a language, but if we want to narrow down which language, we
need the context of France, from further back. It’s entirely possible for the gap between the
relevant information and the point where it is needed to become very large.
• Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.
Long Short Term Memory networks(LSTM)
• LSTM provides a different recurrent formula fW , it's more powefull than vanilla
RNN, due to it's complex fW that add "residual information" to the next state
instead of just transforming each state. Imagine LSTM are the "residual" version
of RNNs.
• In other words LSTM suffer much less from vanishing gradients than normal
RNNs. Remember that the plus gates distribute the gradients.
• So by suffering less from vanishing gradients, the LSTMs can remember much
more in the past. So from now just use LSTMs when you think about RNN.
• Also in other words LSTM are better to remember long term dependencies.
Long Short Term Memory networks(LSTM)
• The vanishing problem can be solved
with LSTM, but another problem that
can happen with all recurrent neural
network is the exploding gradient
problem.
• To fix the exploding gradient problem,
people normally do a gradient clipping,
that will allow only a maximum
gradient value.
• This highway for the gradients is called
Cell-State, so one difference compared
to the RNN that has only the state
flowing, on LSTM we have states and
the cell state.
Long Short Term Memory networks(LSTM)
• In the above diagram, each line carries an entire vector, from the output of one
node to the inputs of others. The pink circles represent pointwise operations, like
vector addition, while the yellow boxes are learned neural network layers. Lines
merging denote concatenation, while a line forking denote its content being
copied and the copies going to different locations.
Long Short Term Memory networks(LSTM)
• The key to LSTMs is the cell state, the horizontal line running through the top of
the diagram.
• The cell state is kind of like a conveyor belt. It runs straight down the entire chain,
with only some minor linear interactions. It’s very easy for information to just
flow along it unchanged.
Long Short Term Memory networks(LSTM)
• LSTM Gate
• Doing a zoom on the
LSTM gate. This also
improves how to do the
backpropagation.
Variants on Long Short Term Memory
• In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The
differences are minor, but it’s worth mentioning some of them.
• One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole
connections.” This means that we let the gate layers look at the cell state.
• The above diagram adds peepholes to all the gates, but many papers will give some peepholes
and not others
Variants on Long Short Term Memory
• Another variation is to use coupled forget and input gates. Instead of separately
deciding what to forget and what we should add new information to, we make
those decisions together. We only forget when we’re going to input something in
its place. We only input new values to the state when we forget something older.
GRU (Gated Recurrent Unit)
• The Gru cells can be considered as a variant of the LSTM (Also want's to fight
vanishing gradients) cell, but more computational efficient. On this cell the forget
and input gates are merged (update gate).
Bidirectional RNN
• Bidirectional recurrent neural networks(RNN) are really just putting two independent RNNs
together. The input sequence is fed in normal time order for one network, and in reverse time
order for another. The outputs of the two networks are usually concatenated at each time step,
though there are other options, e.g. summation.
• This structure allows the networks to have both backward and forward information about the
sequence at every time step. The concept seems easy enough. But when it comes to actually
implementing a neural network which utilizes bidirectional structure, confusion arises…
Bidirectional RNN
• Bidirectional recurrent neural
networks(RNN) are really just
putting two independent RNNs
together. The input sequence is
fed in normal time order for one
network, and in reverse time
order for another. The outputs of
the two networks are usually
concatenated at each time step,
though there are other options,
e.g. summation.
Bidirectional RNN
Bidirectional RNN
Bidirectional RNN
RNN and LSTM application
RNN and LSTM application
RNN and LSTM application
RNN and LSTM application
Processing text data