RNNs
RNNs
• LSTMs
• Conclusions
Are traditional RNNs still relevant?
• Until the last few years, Long Short-Term Memory
(LSTM) RNNs represented state-of-the art in
sequence modeling
• In Natural Language Processing, LSTMs are trained to
predict the next word in a sequence or to classify text.
• LSTMs can also be used to model other sequences, LSTM – smaller and less
contemporary, but still a
such as patient trajectories, to make predictions hefty player in AI
informed by the nature & order of prior
events/diagnoses, etc.
Prefatory comments
• In diagrams, RNNs are usually depicted as operating on a series of
individually time-stamped input vectors
• Vectors may represent a sequence in time or space (e.g., word order in text)
o(t) = Vh(t) +c
• Note that the eigenvalues of W at time step t are raised to the power of t, causing exponentiated values of <1 to
evaporate with time, and values >1 to explode, hindering learning in general and obscuring/eroding signals related to
long-range dependencies
*Gradient clipping can help address gradient explosion, but gradient loss remains problematic in vanilla RNNs
How can we improve memory in an RNN?
• Leaky units “allow more leakage of the past into the present”
• Compute hidden state at time t.
• Then combine weighted “self-connection” of the hidden state with a weighted value of the hidden
representation at h(t-1) (to “access more of the past”) → “revised” h(t)
• Updated/revised” h(t) can be denoted as ρ(t) = α(ρ(t-1)) + (1-α)h(t) where alpha is an adjustable
hyperparameter - a real number within the range (0,1)
• Repeat the above at each time step.
• The closer the value of α is to 1, the more of the past that seeps into the present
*Parameters remain shared, however, among weight matrices. See next slide.
Core LSTM building block: The Memory Cell
h(t) hidden layer vector at time t
output
s(t) memory cell state at time t
h(t)
s(t-1) memory cell state at time t-1
s(t-1)
g(t) output of input gate
f(t) output of forget gate
self-loop
q(t) output of the output gate
s(t) x(t) input at time t
h(t-1) hidden layer vector at time t
state
U and W represent learned weight matrices with
superscripts indicating distinct weight matrices for the
input, forget, and output gates.
g(t) f(t) q(t)
input input gate forget gate output gate Input gate controls extent to which current input informs
cell state. Forget gate modulates degree to which prior
state is reflected in new state. Output gate governs how
U W Ug Wg U f Wf Uo Wo much influence current state has on output to next layer.
Gates each apply scalar outputs between 0 and 1.
h(t) = tanh(s(t))q(t)
g(t) f(t) q(t)
input input gate forget gate output gate b is a specific bias term indicated by superscript. Simple
stochastic gradient descent often used in backpropagation for
LSTMs.
U W Ug Wg U f Wf Uo Wo