0% found this document useful (0 votes)
14 views144 pages

RNN 2

Uploaded by

as.business.023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views144 pages

RNN 2

Uploaded by

as.business.023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 144

Recurrent Neural Network

(RNN)
RNN
• RNN have a “memory” which remembers all information
about what has been calculated.
• It uses the same parameters for each input as it performs the
same task on all the inputs or hidden layers to produce the
output.
• This reduces the complexity of parameters, unlike other
neural networks.
Training through RNN
• A single time step of the input is provided to the network.
• Then calculate its current state using set of current input and the
previous state.
• The current ht becomes ht-1 for the next time step.
• One can go as many time steps according to the problem and join
the information from all the previous states.
• Once all the time steps are completed the final current state is
used to calculate the output.
• The output is then compared to the actual output i.e the target
output and the error is generated.
• The error is then back-propagated to the network to update the
weights and hence the network (RNN) is trained.
RNN
• Although the basic Recurrent Neural Network is fairly
effective, it can suffer from a significant problem.
• For deep networks, The Back-Propagation process can lead to
the following issues:-
– Vanishing Gradients: This occurs when the gradients become very
small and tend towards zero.
– Exploding Gradients: This occurs when the gradients become too large
due to back-propagation.
RNN
• Recurrent Neural Networks are those networks that deal with
sequential data.
• They predict outputs using not only the current inputs but also
by taking into consideration those that occurred before it.
• In other words, the current output depends on current output
as well as a memory element (which takes into account the
past inputs)
• For training such networks, we use good old backpropagation
but with a slight twist. We don’t independently train the
system at a specific time “t”.
• To train it at a specific time “t” as well as all that has happened
before time “t” like t-1, t-2, t-3.
RNN
Training RNN
• S1, S2, S3 are the hidden states or memory units at time t1,
t2, t3 respectively, and Ws is the weight matrix associated
with it.
• X1, X2, X3 are the inputs at time t1, t2, t3 respectively,
and Wx is the weight matrix associated with it.
• Y1, Y2, Y3 are the outputs at time t1, t2, t3 respectively,
and Wy is the weight matrix associated with it.
For any time, t, we have the following two equations:

where g1 and g2 are activation functions.


• Let us now perform back propagation at time t = 3.
Let the error function be:

so at t =3,

*We are using the squared error here, where d3 is the desired
output at time t = 3.
To perform back propagation, we have to adjust the weights
associated with inputs, the memory units and the outputs.
Adjusting Wy
Adjusting Ws
Adjusting WX:
Limitations:
• This method of Back Propagation through time (BPTT) can be
used up to a limited number of time steps like 8 or 10.
• If we back propagate further, the gradient becomes too small.
• This problem is called the “Vanishing gradient” problem.
• The problem is that the contribution of information decays
geometrically over time.
• So, if the number of time steps is >10 (Let’s say), that
information will effectively be discarded.
LSTM
• Long Short Term Memory networks – usually just called
“LSTMs” – are a special kind of RNN, capable of learning long-
term dependencies.
• Concept was introduced by Hochreiter & Schmidhuber (1997),
and were refined and popularized by many people in following
work.
• They work tremendously well on a large variety of problems,
and are now widely used.
• LSTMs are explicitly designed to avoid the long-term
dependency problem. Remembering information for long
periods of time is practically their default behavior, not
something they struggle to learn!
RNN
• All recurrent neural networks have the form of a chain of
repeating modules of neural network.
• In standard RNNs, this repeating module will have a very
simple structure, such as a single tanh layer.
RNN
LSTM
• LSTMs also have this chain like structure, but the repeating
module has a different structure. Instead of having a single
neural network layer, there are four, interacting in a very
special way.
LSTM
• An LSTM has a similar control flow as a recurrent neural
network.
• It processes data passing on information as it propagates
forward. The differences are the operations within the LSTM’s
cells.
The Core Idea Behind LSTMs
• The core concept of LSTM’s are the cell state, and it’s various
gates.
• The cell state act as a transport highway that transfers relative
information all the way down the sequence chain.
• You can think of it as the “memory” of the network.
• The cell state, in theory, can carry relevant information
throughout the processing of the sequence.
• So even information from the earlier time steps can make it’s
way to later time steps, reducing the effects of short-term
memory.
The Core Idea Behind LSTMs
• As the cell state goes on its journey, information get’s added
or removed to the cell state via gates.
• The gates are different neural networks that decide which
information is allowed on the cell state.
• The gates can learn what information is relevant to keep or
forget during training.
LSTM
• Three different gates that regulate information flow in an
LSTM cell.
• A forget gate, input gate, and output gate.
• Concept of cell state
Forget gate
Input gate
Cell State
Output Gate
GRU (Gated Recurrent Unit)
• Introduced by Cho, et al. in 2014,
• GRU (Gated Recurrent Unit) aims to solve the vanishing
gradient problem which comes with a standard recurrent
neural network.
• GRU can also be considered as a variation on the LSTM
because both are designed similarly and, in some cases,
produce equally excellent results.
GRU
• GRUs are improved version of standard recurrent neural
network.
• To solve the vanishing gradient problem of a standard RNN,
GRU uses, update gate and reset gate.
• these are two vectors which decide what information should
be passed to the output.
• The special thing about them is that they can be trained to
keep information from long ago, without washing it through
time or remove information which is irrelevant to the
prediction.
GRU
• GRU’s got rid of the cell state and used the hidden state to
transfer information.
• It also only has two gates, a reset gate and update gate.
LSTM vs GRU
GRU
Update Gate:
– The update gate acts similar to the forget and input gate of
an LSTM.
– It decides what information to throw away and what new
information to add.

Reset Gate
– The reset gate is another gate is used to decide how much
past information to forget.
GRU
• GRU’s has fewer tensor operations; therefore, they are a little
speedier to train then LSTM’s.
• There isn’t a clear winner which one is better.
• Researchers and engineers usually try both to determine
which one works better for their use case
GRU
GRU
Update gate
Reset gate
Current memory content
Final memory at current time step
Bi-Directional LSTM
Bidirectional LSTMs
• Bidirectional LSTMs are an extension to typical LSTMs
that can enhance performance of the model on sequence
classification problems.
• Where all time steps of the input sequence are available,
• Bi-LSTMs train two LSTMs instead of one LSTMs on the
input sequence.
• The first on the input sequence as-is and the other on a
reversed copy of the input sequence.
• By this additional context is added to network and results
are faster.
Bidirectional LSTMs
• The idea behind Bidirectional Recurrent Neural Networks (RNNs) is
very straightforward.
• Which involves replicating the first recurrent layer in the network
then providing the input sequence as it is as input to the first layer
and providing a reversed copy of the input sequence to the
replicated layer.
• This overcomes the limitations of a traditional RNN.
• Bidirectional recurrent neural network (BRNN) can be trained using
all available input info in the past and future of a particular time-
step.
• Split of state neurons in regular RNN is responsible for the forward
states (positive time direction) and a part for the backward states
(negative time direction).
Bidirectional LSTMs
Attention in Deep Learning
Attention
• In psychology, attention is the cognitive process of selectively
concentrating on one or a few things while ignoring others.

– A neural network is considered to be an effort to mimic human brain


actions in a simplified manner.
– Attention Mechanism is also an attempt to implement the same action
of selectively concentrating on a few relevant things, while ignoring
others in deep neural networks.
Attention in Deep Learning
• The attention mechanism emerged as an
improvement over the encoder decoder-
based neural machine translation system in natural
language processing (NLP).

• Later, this mechanism, or its variants, was used in


other applications, including computer vision, speech
processing, etc.
Seq to Seq Model
Encoder and Decoder
• Encoder and decoder are stacks of LSTM/RNN units.
• It works in the two following steps:

– The encoder LSTM is used to process the entire


input sentence and encode it into a context vector,
which is the last hidden state of the LSTM/RNN.

– The decoder LSTM or RNN units produce the


words in a sentence one after another
Encoder and Decoder
Drawbacks of Encoder- Decoder
• If the encoder makes a bad summary, the translation will
also be bad. And indeed it has been observed that the
encoder creates a bad summary when it tries to
understand longer sentences. It is called the long-range
dependency problem of RNN/LSTMs.

• RNNs cannot remember longer sentences and sequences


due to the vanishing/exploding gradient problem. It can
remember the parts which it has just seen.
Drawbacks of Encoder- Decoder
• Even Cho et al (2014), who proposed the encoder-decoder
network, demonstrated that the performance of the encoder-
decoder network degrades rapidly as the length of the input
sentence increases.

• Although an LSTM is supposed to capture the long-range


dependency better than the RNN, it tends to become forgetful
in specific cases.

• Another problem is that there is no way to give more


importance to some of the input words compared to others
while translating the sentence.
Attention Mechanism
Attention Mechanism
• The attention mechanism was born to help memorize long source
sentences in neural machine translation (NMT).

• Rather than building a single context vector out of the encoder’s


last hidden state, the secret sauce invented by attention is to create
shortcuts between the context vector and the entire source input.

• The weights of these shortcut connections are customizable for


each output element.

• While the context vector has access to the entire input sequence,
we don’t need to worry about forgetting.
Attention Mechanism
• The alignment between the source and target is learned and
controlled by the context vector.

• Essentially the context vector consumes three pieces of


information:
– encoder hidden states;
– decoder hidden states;
– alignment between source and target.
Attention Mechanism
• The Bidirectional LSTM used here generates a sequence of
annotations (h1, h2,….., hTx) for each input sentence.

• All the vectors h1,h2.., etc. are the concatenation of forward


and backward hidden states in the encoder.
Attention Mechanism
• source sequence x of length n and try to output a target
sequence y of length m:

• The encoder is a bidirectional RNN with a forward hidden


state and a backward one
• A simple concatenation of two represents the encoder state.
• The motivation is to include both the preceding and following
words in the annotation of one word.
Attention Mechanism
• The decoder network has hidden state for
the output word at position t, t=1,…,m, where the context
vector c t is a sum of hidden states of the input sequence,
weighted by alignment scores:
Attention Mechanism
• The alignment model assigns a score α t,i to the pair of input at
position i and output at position t, (yt,xi), based on how well
they match.
• The set of {αt,i } are weights defining how much of each source
hidden state should be considered for each output.
• In Bahdanau’s paper, the alignment score α is parametrized by
a feed-forward network with a single hidden layer and this
network is jointly trained with other parts of the model.
Attention Mechanism
• The score function is therefore in the following form, given
that tanh is used as the non-linear activation function:

• where both va and Wa are weight matrices to be learned in


the alignment model.
Self-Attention
• Self-attention, also known as intra-attention, is an attention
mechanism relating different positions of a single sequence in
order to compute a representation of the same sequence.

• It has been shown to be very useful in machine reading,


abstractive summarization, or image description generation.

• The long short-term memory network paper used self-


attention to do machine reading.
Self-Attention
Soft vs Hard Attention
• In the show, attend and tell paper, attention mechanism is
applied to images to generate captions.

• The image is first encoded by a CNN to extract features.

• Then a LSTM decoder consumes the convolution features to


produce descriptive words one by one, where the weights are
learned through attention.

• The visualization of the attention weights clearly demonstrates


which regions of the image the model is paying attention to so as
to output a certain word.
Soft vs Hard Attention
Distinction between “soft” vs “hard” attention is described based on whether
the attention has access to the entire image or only a patch:

• Soft Attention: the alignment weights are learned and placed “softly” over
all patches in the source image; essentially the same type of attention as
in Bahdanau et al., 2015.
– Pro: the model is smooth and differentiable.
– Con: expensive when the source input is large.

• Hard Attention: only selects one patch of the image to attend to at a time.
– Pro: less calculation at the inference time.
– Con: the model is non-differentiable and requires more complicated techniques
such as variance reduction or reinforcement learning to train. (Luong, et al.,
2015)
Global vs Local Attention
• Luong, et al., 2015 proposed the “global” and “local”
attention.
• The global attention is similar to the soft attention, while the
local one is an interesting blend between hard and soft, an
improvement over the hard attention to make it
differentiable.
• In local attention, the model first predicts a single aligned
position for the current target word and a window centered
around the source position is then used to compute a context
vector.
Transformer
Transformer
• The Transformer in NLP is a novel architecture that aims to
solve sequence-to-sequence tasks while handling long-range
dependencies with ease.
• It relies entirely on self-attention to compute representations
of its input and output WITHOUT using sequence-aligned
RNNs or convolution.
• The Transformer was proposed in the paper Attention Is All
You Need.
Transformer

– “The Transformer is the first transduction model relying


entirely on self-attention to compute representations of its
input and output without using sequence-aligned RNNs or
convolution.”

 “transduction” means the conversion of input sequences into


output sequences.
 The idea behind Transformer is to handle the dependencies
between input and output with attention and recurrence
completely.
Transformer
• The word embeddings of the input sequence are
passed to the first encoder
• These are then transformed and propagated to the
next encoder
• The output from the last encoder in the encoder-
stack is passed to all the decoders in the decoder-
stack
Inputs to Encoder and Decoder
• All input and output tokens to Encoder/Decoder are
converted to vectors using learned embeddings.

• These input embeddings are then passed to


Positional Encoding.
Positional Encoding
• The Transformer’s architecture does not contain any recurrence
or convolution and hence has no notion of word order.
• All the words of the input sequence are fed to the network with
no special order or position as they all flow simultaneously
through the Encoder and decoder stack.
• To understand the meaning of a sentence, it is essential to
understand the position and the order of words.
• Positional encoding is added to the model to helps inject the
information about the relative or absolute position of the
words in the sentence
• Positional encoding has the same dimension as the input
embedding so that the two can be summed.
Self Attention
• Attention in simplistic terms is to get a better understanding
of the meaning and the context of words in a sentence.
• A self-attention layer connects all positions with a constant
number of sequentially executed operations and hence are
faster than recurrent layers
• An Attention function in a Transformer is described as
mapping a query and a set of key and value pair to an output.
• Query, key, and value are all vectors.
• Attention weights are calculated using Scaled Dot-Product
Attention for each word in the sentence.
• The final score is the weighted sum of the values.
Self attention Examples
Calculating Self-Attention
1. First, we need to create three vectors from each of the encoder’s input
vectors:
– Query Vector
– Key Vector
– Value Vector.
These vectors are trained and updated during the training process.

2. Next, we will calculate self-attention for every word in the input


sequence
3. Consider this phrase – “Action gets results”. To calculate the self-attention
for the first word “Action”, we will calculate scores for all the words in the
phrase with respect to “Action”. This score determines the importance of
other words when we are encoding a certain word in an input sequence
Calculating Self-Attention
1. The score for the first word is calculated by taking
the dot product of the Query vector (q1) with the
keys vectors (k1, k2, k3) of all the words:
Calculating Self-Attention
2. Then, these scores are divided by 8 which is the square root
of the dimension of the key vector:
Calculating Self-Attention
3. Next, these scores are normalized using the
softmax activation function
Calculating Self-Attention
4. These normalized scores are then multiplied by the value
vectors (v1, v2, v3) and sum up the resultant vectors to
arrive at the final vector (z1). This is the output of the self-
attention layer. It is then passed on to the feed-forward
network as input:
Calculating Self-Attention
z1 is the self-attention vector for the first word of the input
sequence “Action gets results”. We can get the vectors for the
rest of the words in the input sequence in the same fashion:
Multi-Head Attention
• Self-attention is computed not once but multiple times in the
Transformer’s architecture, in parallel and independently.
• It is therefore referred to as Multi-head Attention.
Multi-Head Attention
• Each attention-head has a different linear transformation
applied to the same input representation.
• The Transformer uses eight different attention heads, which
are computed parallelly and independently.
• With eight different attention heads, we have eight different
sets of the query, key, and value and also eight sets of
Encoder and Decoder each of these sets is initialized
randomly

– “Multi-head attention allows the model to jointly attend to


information from different representation subspaces at different
positions.”
Masked Multi-Head Attention
• The Decoder has masked multi-head attention where it masks
or blocks the decoder inputs from the future steps.
• During training, the multi-head attention of the Decoder hides
the future decoder inputs.
• For the machine translation task to translate a sentence, “I
enjoy nature” from English to Hindi using the Transformer,
the Decoder will consider all the inputs words “I, enjoy,
nature” to predict the first word.
Masked Multi-Head Attention
• the Decoder would block the inputs from future steps
Layer Normalization:
• Normalizes the inputs across each of
the features and is independent of
other examples.
• Layer normalization reduces the
training time in feed-forward neural
networks.
• In Layer normalization, we compute
mean and variance from all of the
summed inputs to the neurons in a
layer on a single training case.
Fully Connected Layer
• Encoder and Decoder in the Transformer both have a
fully connected feed-forward network, and it has two
linear transformations containing a ReLU activation in
between.
Features of Transformers
The drawbacks of the seq2seq model are addressed by
Transformer
• Parallelizing Computation:
– Transformer’s architecture removes the auto-regressive model
used in the Seq2Seq model and relies entirely on Self-Attention
to understand global dependencies between input and output.
– Self-Attention helps significantly with parallelizing the
computation
• Reduced number of operations:
– Transformers have a constant number of operations as the
attention weights are averaged in multi-head attention
Features of Transformers
The drawbacks of the seq2seq model are addressed by
Transformer
• Long-range dependencies:
– Factor that impacts the learning of long-range dependencies is
based on the length of forward and backward paths the signals
have to traverse in the network.
– The shorter the route between any combination of positions in
the input and output sequences, the easier it is to learn long-
range dependencies.
– Self-Attention layer connects all positions with a constant
number of sequentially executed operations learning long-range
dependencies.
Limitations of the Transformer
• Transformer is undoubtedly a huge improvement
over the RNN based seq2seq models.
• But it comes with its own share of limitations:
– Attention can only deal with fixed-length text strings. The
text has to be split into a certain number of segments or
chunks before being fed into the system as input
– This chunking of text causes context fragmentation. For
example, if a sentence is split from the middle, then a
significant amount of context is lost.

You might also like