0% found this document useful (0 votes)
23 views44 pages

06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)

Uploaded by

Hoàng Đăng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views44 pages

06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)

Uploaded by

Hoàng Đăng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Class 06

Deep Learning for Text Data


(LSTM Seq2Seq Models)
Dr Tran Anh Tuan
Department of Math & Computer Sciences
University of Science, HCMC

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


1
University of Science HCMC
Contents
• The Problem of Long-Term Dependencies
• Long Short Term Memory networks(LSTM)
• LSTM Usecase
• Variants on Long Short Term Memory
• GRU (Gated Recurrent Unit)
• Bidirectional RNN
• RNN and LSTM application
• Processing text data
• Seq2Seq Model
• What is Encoder Decoder Architecture ?
• Encoder - Decoder

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


2
University of Science HCMC
The Problem of Long-Term Dependencies
• The RNN will be a function with inputs xt (input vector) and previous state ht−1 .
The new state will be ht. The recurrent function, fW , will be fixed after training
and used to every time step.
• Recurrent Neural Networks are the best model for regression, because it take into
account past values.

Recurrent Neural Networks have loops


The Problem of Long-Term Dependencies
• Implementing Vanilla RNN

• Observe that in our case of RNN we are now more interested on the next state, ht
not exactly the output, yt
The Problem of Long-Term Dependencies
• Sometimes, we only need to look at recent information to perform the present task. For example,
consider a language model trying to predict the next word based on the previous ones. If we are
trying to predict the last word in “the clouds are in the sky,” we don’t need any further context –
it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the
relevant information and the place that it’s needed is small, RNNs can learn to use the past
information.
The Problem of Long-Term Dependencies
• But there are also cases where we need more context. Consider trying to predict the last word in
the text “I grew up in France… I speak fluent French.” Recent information suggests that the next
word is probably the name of a language, but if we want to narrow down which language, we
need the context of France, from further back. It’s entirely possible for the gap between the
relevant information and the point where it is needed to become very large.
• Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.
Long Short Term Memory networks(LSTM)
• LSTM provides a different recurrent formula fW , it's more powefull than vanilla
RNN, due to it's complex fW that add "residual information" to the next state
instead of just transforming each state. Imagine LSTM are the "residual" version
of RNNs.
• In other words LSTM suffer much less from vanishing gradients than normal
RNNs. Remember that the plus gates distribute the gradients.
• So by suffering less from vanishing gradients, the LSTMs can remember much
more in the past. So from now just use LSTMs when you think about RNN.
• Also in other words LSTM are better to remember long term dependencies.
Long Short Term Memory networks(LSTM)
• The vanishing problem can be solved
with LSTM, but another problem that
can happen with all recurrent neural
network is the exploding gradient
problem.
• To fix the exploding gradient problem,
people normally do a gradient clipping,
that will allow only a maximum
gradient value.
• This highway for the gradients is called
Cell-State, so one difference compared
to the RNN that has only the state
flowing, on LSTM we have states and
the cell state.
Long Short Term Memory networks(LSTM)
• In the above diagram, each line carries an entire vector, from the output of one
node to the inputs of others. The pink circles represent pointwise operations, like
vector addition, while the yellow boxes are learned neural network layers. Lines
merging denote concatenation, while a line forking denote its content being
copied and the copies going to different locations.
Long Short Term Memory networks(LSTM)
• The key to LSTMs is the cell state, the horizontal line running through the top of
the diagram.
• The cell state is kind of like a conveyor belt. It runs straight down the entire chain,
with only some minor linear interactions. It’s very easy for information to just
flow along it unchanged.
Long Short Term Memory networks(LSTM)
• LSTM Gate
• Doing a zoom on the
LSTM gate. This also
improves how to do the
backpropagation.
Variants on Long Short Term Memory
• In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The
differences are minor, but it’s worth mentioning some of them.
• One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole
connections.” This means that we let the gate layers look at the cell state.
• The above diagram adds peepholes to all the gates, but many papers will give some peepholes
and not others
Variants on Long Short Term Memory
• Another variation is to use coupled forget and input gates. Instead of separately
deciding what to forget and what we should add new information to, we make
those decisions together. We only forget when we’re going to input something in
its place. We only input new values to the state when we forget something older.
GRU (Gated Recurrent Unit)
• The Gru cells can be considered as a variant of the LSTM (Also want's to fight
vanishing gradients) cell, but more computational efficient. On this cell the forget
and input gates are merged (update gate).
Bidirectional RNN
• Bidirectional recurrent neural networks(RNN) are really just putting two independent RNNs
together. The input sequence is fed in normal time order for one network, and in reverse time
order for another. The outputs of the two networks are usually concatenated at each time step,
though there are other options, e.g. summation.
• This structure allows the networks to have both backward and forward information about the
sequence at every time step. The concept seems easy enough. But when it comes to actually
implementing a neural network which utilizes bidirectional structure, confusion arises…
Bidirectional RNN
• Bidirectional recurrent neural
networks(RNN) are really just
putting two independent RNNs
together. The input sequence is
fed in normal time order for one
network, and in reverse time
order for another. The outputs of
the two networks are usually
concatenated at each time step,
though there are other options,
e.g. summation.
Bidirectional RNN
Bidirectional RNN
Bidirectional RNN
RNN and LSTM application
RNN and LSTM application
RNN and LSTM application
RNN and LSTM application
Processing text data

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


32
University of Science HCMC
Processing text data

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


33
University of Science HCMC
Processing
text data

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


34
University of Science HCMC
Processing text data
Word level processing (using embedding):
• In this method, we do the same steps as the first method, but here instead
of make a dictionary of characters, we make a dictionary of the words used
in the text we want to process or sometimes we use the most frequent
10,000 words of the text’s language.
• To make it easy to understand what we are going to do, we will :
1.Convert text to lowercase
2.Clean data from digits and punctuation .
3.append ‘SOS’ and ‘EOS’ to the target data:
4. Make dictionaries to convert words to indexed numbers .
5. Use embedding layer to convert each word to a fixed length vector .
Now, the data is ready to be used by seq2seq network.

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


35
University of Science HCMC
Seq2Seq Model
• We will use an architecture called (seq2seq) or ( Encoder Decoder), It
is appropriate in our case where the length of the input sequence (
English sentences in our case) does not has the same length as the
output data ( French sentences in our case)

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


36
University of Science HCMC
What is Encoder Decoder Architecture ?
1. Encoder:
• The encoder simply takes the input data, and train on it then it passes the last state of its
recurrent layer as an initial state to the first recurrent layer of the decoder part.
2. Decoder :
• The decoder takes the last state of encoder’s last recurrent layer and uses it as an initial state to
its first recurrent layer , the input of the decoder is the sequences that we want to get ( in our
case French sentences).

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


37
University of Science HCMC
What is Encoder Decoder
Architecture ?
• Some other images explaining
the encoder decoder:

Generative Model Chatbots


Dr Tran Anh Tuan - Department of Math & Computer Sciences -
38
University of Science HCMC
Encoder
• The encoder is made up of :
1.Input Layer : Takes the English sentence and pass it to the embedding layer.
2.Embedding Layer : Takes the English sentence and convert each word to fixed size
vector
3.First LSTM Layer : Every time step, it takes a vector that represents a word and
pass its output to the next layer, We used CuDNNLSTM layer instead of LSTM
because it’s much much faster.
4.Second LSTM Layer : It does the same thing as the previous layer, but instead of
passing its output, it passes its states to the first LSTM layer of the decoder .

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


39
University of Science HCMC
Encoder

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


40
University of Science HCMC
Decoder
• The decoder is made up of :
1.Input Layer : Takes the French sentence and pass it to the embedding layer.
2.Embedding Layer : Takes the French sentence and convert each word to fixed size
vector
3.First LSTM Layer : Every time step, it takes a vector that represents a word and
pass its output to the next layer, but here in the decoder, we initialize the state of
this layer to be the last state of the last LSTM layer from the decoder .
4.Second LSTM Layer : Processing the output from the previous layer and passes its
output to a dense layer .
5.Dense Layer (Output Layer) : Takes the output from the previous layer and
outputs a one hot vector representing the target French word
Dr Tran Anh Tuan - Department of Math & Computer Sciences -
41
University of Science HCMC
Decoder

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


42
University of Science HCMC
Reference
• https://towardsdatascience.com/nlp-sequence-to-sequence-
networks-part-1-processing-text-data-d141a5643b72
• https://towardsdatascience.com/nlp-sequence-to-sequence-
networks-part-2-seq2seq-model-encoderdecoder-model-
6c22e29fd7e1

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


43
University of Science HCMC
THANK YOU

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


44
University of Science HCMC

You might also like