LSTM
LSTM
- Introduction to Long
Short-Term Memory
LSTM, an advanced form of Recurrent
Neural Network, is crucial in Deep Learning
for processing time series and sequential data.
Designed by Hochreiter and Schmidhuber,
LSTM effectively addresses RNN's
limitations, particularly the vanishing gradient
problem, making it superior for remembering
long-term dependencies.
This neural network integrates complex
algorithms and gated cells, allowing it to
retain and manipulate memory effectively,
which is pivotal for applications like video
processing and reading comprehension.
Need of LSTM
LSTM was introduced to tackle the
problems and challenges in Recurrent Neural
Networks.
RNN is a type of Neural Network that stores
the previous output to help improve its future
predictions.
Vanilla RNN has a “short-term” memory.
The input at the beginning of the sequence
doesn’t affect the output of the Network after
a while, maybe 3 or 4 inputs.
This is called a long-term dependency
issue.
Example:
Let’s take this sentence.
The Sun rises in the ______.
An RNN could easily return the correct output
that the sun rises in the East as all the necessary
information is nearby.
Let’s take another example.
I was born in Japan, ……… and I speak
fluent ______.
In this sentence, the RNN would be unable to
return the correct output as it requires
remembering the word Japan for a long
duration. Since RNN only has a “Short-term”
memory, it doesn’t work well. LSTM solves this
problem by enabling the Network to
remember Long-term dependencies.
The other RNN problems are the Vanishing
Gradient and Exploding Gradient.
It arises during the Backpropagation of the
Neural Network. For example, suppose the
gradient of each layer is contained between 0
and 1. As the value gets multiplied in each
layer, it gets smaller and smaller, ultimately, a
value very close to 0. This is the Vanishing
gradient problem. The converse, when the
values are greater than 1, exploding gradient
problem occurs, where the value gets really big,
disrupting the training of the Network. Again,
these problems are tackled in LSTMs.
Structure of LSTM
LSTM is a cell that consists of 3 gates.
A forget gate, input gate, and output gate.
The gates decide which information is
important and which information can be
forgotten.
The cell has two states Cell
State and Hidden State.
They are continuously updated and carry the
information from the previous to the current
time steps.
The cell state is the “long-term” memory,
while the hidden state is the “short-term”
memory. Now let’s look at each gate in detail.
Forget Gate:
Forget gate is responsible for deciding what
information should be removed from the cell
state.
It takes in the hidden state of the previous
time-step and the current input and passes it to
a Sigma Activation Function, which outputs a
value between 0 and 1, where 0 means forget
and 1 means keep.
Input Gate:
The Input Gate considers the current input
and the hidden state of the previous time step.
The input gate is used to update the cell
state value.
It has two parts.
The first part contains the Sigma
activation function.
Its purpose is to decide what percent of the
information is required.
The second part passes the two values to
a Tanh activation function.
It aims to map the data between -1 and 1.
To obtain the relevant information required
from the output of Tanh, we multiply it by the
output of the Sigma function.
This is the output of the Input gate, which
updates the cell state.
Output Gate:
The output gate returns the hidden state for
the next time stamp.
The output gate has two parts.
The first part is a Sigma function, which
serves the same purpose as the other two
gates, to decide the percent of the relevant
information required.
Next, the newly updated cell state is passed
through a Tanh function and multiplied by
the output from the sigma function. This is
now the new hidden state.
Cell State:
The forget gate and input gate update the
cell state.
The cell state of the previous state is
multiplied by the output of the forget gate.
The output of this state is then summed with
the output of the input gate.
This value is then used to calculate hidden
state in the output gate.
Example:
He went to ______
The model can only predict the correct value to
fill in the blank with the next sentence.
And he got up early the next morning
With this sentence to help, we can predict the
blank that he went to sleep. This can be
predicted by a BiLSTM model as it would
simultaneously process the data backward. So
BiLISTM enables better performance in
Sequential data.
Sequence to Sequence LSTMs or RNN
Encoder-Decoders
Seq2Seq is basically many-to-
many Architecture seen in RNNs. In many-to-
many architecture, an arbitrary length input is
given, and an arbitrary length is returned as
output. This Architecture is useful in
applications where there is variable input and
output length. For example, one such
application is Language Translation, where a
sentence length in one language doesn’t
translate to the same length in another language.
In these situations, Seq2Seq LSTMs are used.