RNN 2
RNN 2
(RNN)
RNN
• RNN have a “memory” which remembers all information
about what has been calculated.
• It uses the same parameters for each input as it performs the
same task on all the inputs or hidden layers to produce the
output.
• This reduces the complexity of parameters, unlike other
neural networks.
Training through RNN
• A single time step of the input is provided to the network.
• Then calculate its current state using set of current input and the
previous state.
• The current ht becomes ht-1 for the next time step.
• One can go as many time steps according to the problem and join
the information from all the previous states.
• Once all the time steps are completed the final current state is
used to calculate the output.
• The output is then compared to the actual output i.e the target
output and the error is generated.
• The error is then back-propagated to the network to update the
weights and hence the network (RNN) is trained.
RNN
• Although the basic Recurrent Neural Network is fairly
effective, it can suffer from a significant problem.
• For deep networks, The Back-Propagation process can lead to
the following issues:-
– Vanishing Gradients: This occurs when the gradients become very
small and tend towards zero.
– Exploding Gradients: This occurs when the gradients become too large
due to back-propagation.
RNN
• Recurrent Neural Networks are those networks that deal with
sequential data.
• They predict outputs using not only the current inputs but also
by taking into consideration those that occurred before it.
• In other words, the current output depends on current output
as well as a memory element (which takes into account the
past inputs)
• For training such networks, we use good old backpropagation
but with a slight twist. We don’t independently train the
system at a specific time “t”.
• To train it at a specific time “t” as well as all that has happened
before time “t” like t-1, t-2, t-3.
RNN
Training RNN
• S1, S2, S3 are the hidden states or memory units at time t1,
t2, t3 respectively, and Ws is the weight matrix associated
with it.
• X1, X2, X3 are the inputs at time t1, t2, t3 respectively,
and Wx is the weight matrix associated with it.
• Y1, Y2, Y3 are the outputs at time t1, t2, t3 respectively,
and Wy is the weight matrix associated with it.
For any time, t, we have the following two equations:
so at t =3,
*We are using the squared error here, where d3 is the desired
output at time t = 3.
To perform back propagation, we have to adjust the weights
associated with inputs, the memory units and the outputs.
Adjusting Wy
Adjusting Ws
Adjusting WX:
Limitations:
• This method of Back Propagation through time (BPTT) can be
used up to a limited number of time steps like 8 or 10.
• If we back propagate further, the gradient becomes too small.
• This problem is called the “Vanishing gradient” problem.
• The problem is that the contribution of information decays
geometrically over time.
• So, if the number of time steps is >10 (Let’s say), that
information will effectively be discarded.
LSTM
• Long Short Term Memory networks – usually just called
“LSTMs” – are a special kind of RNN, capable of learning long-
term dependencies.
• Concept was introduced by Hochreiter & Schmidhuber (1997),
and were refined and popularized by many people in following
work.
• They work tremendously well on a large variety of problems,
and are now widely used.
• LSTMs are explicitly designed to avoid the long-term
dependency problem. Remembering information for long
periods of time is practically their default behavior, not
something they struggle to learn!
RNN
• All recurrent neural networks have the form of a chain of
repeating modules of neural network.
• In standard RNNs, this repeating module will have a very
simple structure, such as a single tanh layer.
RNN
LSTM
• LSTMs also have this chain like structure, but the repeating
module has a different structure. Instead of having a single
neural network layer, there are four, interacting in a very
special way.
LSTM
• An LSTM has a similar control flow as a recurrent neural
network.
• It processes data passing on information as it propagates
forward. The differences are the operations within the LSTM’s
cells.
The Core Idea Behind LSTMs
• The core concept of LSTM’s are the cell state, and it’s various
gates.
• The cell state act as a transport highway that transfers relative
information all the way down the sequence chain.
• You can think of it as the “memory” of the network.
• The cell state, in theory, can carry relevant information
throughout the processing of the sequence.
• So even information from the earlier time steps can make it’s
way to later time steps, reducing the effects of short-term
memory.
The Core Idea Behind LSTMs
• As the cell state goes on its journey, information get’s added
or removed to the cell state via gates.
• The gates are different neural networks that decide which
information is allowed on the cell state.
• The gates can learn what information is relevant to keep or
forget during training.
LSTM
• Three different gates that regulate information flow in an
LSTM cell.
• A forget gate, input gate, and output gate.
• Concept of cell state
Forget gate
Input gate
Cell State
Output Gate
GRU (Gated Recurrent Unit)
• Introduced by Cho, et al. in 2014,
• GRU (Gated Recurrent Unit) aims to solve the vanishing
gradient problem which comes with a standard recurrent
neural network.
• GRU can also be considered as a variation on the LSTM
because both are designed similarly and, in some cases,
produce equally excellent results.
GRU
• GRUs are improved version of standard recurrent neural
network.
• To solve the vanishing gradient problem of a standard RNN,
GRU uses, update gate and reset gate.
• these are two vectors which decide what information should
be passed to the output.
• The special thing about them is that they can be trained to
keep information from long ago, without washing it through
time or remove information which is irrelevant to the
prediction.
GRU
• GRU’s got rid of the cell state and used the hidden state to
transfer information.
• It also only has two gates, a reset gate and update gate.
LSTM vs GRU
GRU
Update Gate:
– The update gate acts similar to the forget and input gate of
an LSTM.
– It decides what information to throw away and what new
information to add.
Reset Gate
– The reset gate is another gate is used to decide how much
past information to forget.
GRU
• GRU’s has fewer tensor operations; therefore, they are a little
speedier to train then LSTM’s.
• There isn’t a clear winner which one is better.
• Researchers and engineers usually try both to determine
which one works better for their use case
GRU
GRU
Update gate
Reset gate
Current memory content
Final memory at current time step
Bi-Directional LSTM
Bidirectional LSTMs
• Bidirectional LSTMs are an extension to typical LSTMs
that can enhance performance of the model on sequence
classification problems.
• Where all time steps of the input sequence are available,
• Bi-LSTMs train two LSTMs instead of one LSTMs on the
input sequence.
• The first on the input sequence as-is and the other on a
reversed copy of the input sequence.
• By this additional context is added to network and results
are faster.
Bidirectional LSTMs
• The idea behind Bidirectional Recurrent Neural Networks (RNNs) is
very straightforward.
• Which involves replicating the first recurrent layer in the network
then providing the input sequence as it is as input to the first layer
and providing a reversed copy of the input sequence to the
replicated layer.
• This overcomes the limitations of a traditional RNN.
• Bidirectional recurrent neural network (BRNN) can be trained using
all available input info in the past and future of a particular time-
step.
• Split of state neurons in regular RNN is responsible for the forward
states (positive time direction) and a part for the backward states
(negative time direction).
Bidirectional LSTMs
Attention in Deep Learning
Attention
• In psychology, attention is the cognitive process of selectively
concentrating on one or a few things while ignoring others.
• While the context vector has access to the entire input sequence,
we don’t need to worry about forgetting.
Attention Mechanism
• The alignment between the source and target is learned and
controlled by the context vector.
• Soft Attention: the alignment weights are learned and placed “softly” over
all patches in the source image; essentially the same type of attention as
in Bahdanau et al., 2015.
– Pro: the model is smooth and differentiable.
– Con: expensive when the source input is large.
• Hard Attention: only selects one patch of the image to attend to at a time.
– Pro: less calculation at the inference time.
– Con: the model is non-differentiable and requires more complicated techniques
such as variance reduction or reinforcement learning to train. (Luong, et al.,
2015)
Global vs Local Attention
• Luong, et al., 2015 proposed the “global” and “local”
attention.
• The global attention is similar to the soft attention, while the
local one is an interesting blend between hard and soft, an
improvement over the hard attention to make it
differentiable.
• In local attention, the model first predicts a single aligned
position for the current target word and a window centered
around the source position is then used to compute a context
vector.
Transformer
Transformer
• The Transformer in NLP is a novel architecture that aims to
solve sequence-to-sequence tasks while handling long-range
dependencies with ease.
• It relies entirely on self-attention to compute representations
of its input and output WITHOUT using sequence-aligned
RNNs or convolution.
• The Transformer was proposed in the paper Attention Is All
You Need.
Transformer