Understanding LSTM Networks
Understanding LSTM Networks
Understanding LSTM Networks
Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine
you want to classify what kind of event is happening at every point in a movie. It’s unclear how a
traditional neural network could use its reasoning about previous events in the film to inform later
ones.
Recurrent neural networks address this issue. They are networks with loops in them, allowing
information to persist.
In the above diagram, a chunk of neural network, A, looks at some input xt and outputs a value ht. A
loop allows information to be passed from one step of the network to the next.
These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit
more, it turns out that they aren’t all that different than a normal neural network. A recurrent neural
network can be thought of as multiple copies of the same network, each passing a message to a
successor. Consider what happens if we unroll the loop:
This chain-like nature reveals that recurrent neural networks are intimately related to sequences and
lists. They’re the natural architecture of neural network to use for such data.
And they certainly are used! In the last few years, there have been incredible success applying RNNs
to a variety of problems: speech recognition, language modeling, translation, image captioning… The
list goes on. I’ll leave discussion of the amazing feats one can achieve with RNNs to Andrej
Karpathy’s excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks. But
they really are pretty amazing.
Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network
which works, for many tasks, much much better than the standard version. Almost all exciting results
based on recurrent neural networks are achieved with them. It’s these LSTMs that this essay will
explore.
Sometimes, we only need to look at recent information to perform the present task. For example,
consider a language model trying to predict the next word based on the previous ones. If we are trying
to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty
obvious the next word is going to be sky. In such cases, where the gap between the relevant
information and the place that it’s needed is small, RNNs can learn to use the past information.
But there are also cases where we need more context. Consider trying to predict the last word in the
text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is
probably the name of a language, but if we want to narrow down which language, we need the context
of France, from further back. It’s entirely possible for the gap between the relevant information and the
point where it is needed to become very large.
Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.
In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could
carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t
seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) [German] and
Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information
for long periods of time is practically their default behavior, not something they struggle to learn!
All recurrent neural networks have the form of a chain of repeating modules of neural network. In
standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.
LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of
having a single neural network layer, there are four, interacting in a very special way.
Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by step
later. For now, let’s just try to get comfortable with the notation we’ll be using.
In the above diagram, each line carries an entire vector, from the output of one node to the inputs of
others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes
are learned neural network layers. Lines merging denote concatenation, while a line forking denote its
content being copied and the copies going to different locations.
The LSTM does have the ability to remove or add information to the cell state, carefully regulated by
structures called gates.
Gates are a way to optionally let information through. They are composed out of a sigmoid neural net
layer and a pointwise multiplication operation.
The sigmoid layer outputs numbers between zero and one, describing how much of each component
should be let through. A value of zero means “let nothing through,” while a value of one means “let
everything through!”
An LSTM has three of these gates, to protect and control the cell state.
and xt, and outputs a number between 0 and 1 for each number in the cell state Ct−1. A 1 represents
“completely keep this” while a 0 represents “completely get rid of this.”
Let’s go back to our example of a language model trying to predict the next word based on all the
previous ones. In such a problem, the cell state might include the gender of the present subject, so that
the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old
subject.
The next step is to decide what new information we’re going to store in the cell state. This has two
parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a
tanh layer creates a vector of new candidate values, C~t, that could be added to the state. In the next
step, we’ll combine these two to create an update to the state.
In the example of our language model, we’d want to add the gender of the new subject to the cell state,
to replace the old one we’re forgetting.
It’s now time to update the old cell state, Ct−1, into the new cell state Ct. The previous steps already
decided what to do, we just need to actually do it.
We multiply the old state by ft, forgetting the things we decided to forget earlier. Then we add it∗C~t.
This is the new candidate values, scaled by how much we decided to update each state value.
In the case of the language model, this is where we’d actually drop the information about the old
subject’s gender and add the new information, as we decided in the previous steps.
Finally, we need to decide what we’re going to output. This output will be based on our cell state, but
will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re
going to output. Then, we put the cell state through tanh
(to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that
we only output the parts we decided to.
For the language model example, since it just saw a subject, it might want to output information
relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject
is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows
next.
Variants on Long Short Term Memory
What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In
fact, it seems like almost every paper involving LSTMs uses a slightly different version. The
differences are minor, but it’s worth mentioning some of them.
One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole
connections.” This means that we let the gate layers look at the cell state.
The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not
others.
Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget
and what we should add new information to, we make those decisions together. We only forget when
we’re going to input something in its place. We only input new values to the state when we forget
something older.
A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by
Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the
cell state and hidden state, and makes some other changes. The resulting model is simpler than
standard LSTM models, and has been growing increasingly popular.
These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated
RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term
dependencies, like Clockwork RNNs by Koutnik, et al. (2014).
Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of
popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015) tested more than ten
thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.
Conclusion
Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these
are achieved using LSTMs. They really work a lot better for most tasks!
Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them
step by step in this essay has made them a bit more approachable.
LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another
big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The
idea is to let every step of an RNN pick information to look at from some larger collection of
information. For example, if you are using an RNN to create a caption describing an image, it might
pick a part of the image to look at for every word it outputs. In fact, Xu, et al. (2015) do exactly this –
it might be a fun starting point if you want to explore attention! There’s been a number of really
exciting results using attention, and it seems like a lot more are around the corner…
Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs by Kalchbrenner,
et al. (2015) seem extremely promising. Work using RNNs in generative models – such as Gregor, et
al. (2015), Chung, et al. (2015), or Bayer & Osendorfer (2015) – also seems very interesting. The last
few years have been an exciting time for recurrent neural networks, and the coming ones promise to
only be more so!
Acknowledgments
I’m grateful to a number of people for helping me better understand LSTMs, commenting on the
visualizations, and providing feedback on this post.
I’m very grateful to my colleagues at Google for their helpful feedback, especially Oriol Vinyals, Greg
Corrado, Jon Shlens, Luke Vilnis, and Ilya Sutskever. I’m also thankful to many other friends and
colleagues for taking the time to help me, including Dario Amodei, and Jacob Steinhardt. I’m
especially thankful to Kyunghyun Cho for extremely thoughtful correspondence about my diagrams.
Before this post, I practiced explaining LSTMs during two seminar series I taught on neural networks.
Thanks to everyone who participated in those for their patience with me, and for their feedback.