Bianchi
Bianchi
A QUICK OVERVIEW
1. Introduction
2. The Recurrent Neural Network
3. RNN applications
4. The RNN model
5. Gated RNN (LSTM and GRU)
6. Echo State Network
2
1. INTRODUCTION
3
THE RECURRENT NEURAL NETWORK
A recurrent neural network (RNN) is a universal approximator of dynamical systems.
It can be trained to reproduce any target dynamics, up to a given degree of precision.
Evolution of RNN
internal state
Dynamical 4
system
System observation Evolution of
system state
An RNN generalizes naturally to new inputs with any lengths.
An RNN make use of sequential information, by modelling a temporal dependencies in the inputs.
Example: if you want to predict the next word in a sentence you need to know which words came before it
The output of the network depends on the current input and on the value of the previous internal state.
The internal state maintains a (vanishing) memory about history of all past inputs.
RNNs can make use of information coming from arbitrarily long sequences, but in practice they are limited to look
back only a few time steps.
5
RNN can be trained to predict a future value, of the driving input.
A side-effect we get a generative model, which allows us to generate new elements by sampling from the output
probabilities.
teacher
forcing
generative
RNN - RNN mode
input output
training output
feedback
6
DIFFERENCES WITH CNN
Temporal data
𝑥 7
𝑡𝑖𝑚𝑒
Spatial data
2. RNN APPLICATIONS
8
APPLICATION 1: NATURAL
LANGUAGE PROCESSING
10
NATURAL LANGUAGE
GENERATION: SHAKESPEARE
11
[Source: Andrej Karpathy]
TEXT GENERATION:
WIKIPEDIA
12
TEXT GENERATION:
SCIENTIFIC PAPER
13
[Source: Andrej Karpathy]
APPLICATION II: MACHINE TRANSLATION
14
17
APPLICATION VI: MUSIC INFORMATION RETRIEVAL
Music transcription example
18
19
ARCHITECTURE COMPONENTS
20
STATE UPDATE AND OUTPUT GENERATION
ℎ 𝑡 + 1 = 𝑓 𝑊ℎℎ ℎ 𝑡 + 𝑊𝑖ℎ 𝑥 𝑡 + 1 + 𝑏ℎ ,
𝑦 𝑡 + 1 = 𝑔(𝑊ℎ𝑜 ℎ 𝑡 + 1 + 𝑏𝑜 ).
𝑓() is the transfer function implemented by each neuron (usually the same non-linear
function for all neurons).
𝑔() is the readout of the RNN. Usually is the identity function - all the non-linearity is
provided by the internal processing units (neurons) – or the softmax function.
21
NEURON TRANSFER FUNCTION
24
RNN UNFOLDING
3
In the example, we need to backpropagate the 𝜕𝐸3 𝜕𝐸3 𝜕 𝑦ො3 𝜕ℎ3 𝜕ℎ𝑘
𝜕𝐸 =
gradient 𝜕𝑊3 from current time (𝑡3 ) to initial 𝜕𝑊 𝜕𝑦ො3 𝜕ℎ3 𝜕ℎ𝑘 𝜕𝑊
𝑘=0
time (𝑡0 ) → chain rule (eq. on the right).
We sum up the contributions of each time step
to the gradient.
With of very long sequence (possibly infinite)
we have untreatable depth.
Repeate the procedure only up to a given time
(truncate BPPT).
Why it works? Because each state carries a
little bit of information on each previous input.
Once the network is unfolded, the procedure is
analogue to standard backpropagation used in 26
Bengio, 1991.
A simple RNN is trained to
keep 1 bit of information for 𝑇
time steps.
𝑃(𝑠𝑢𝑐𝑐𝑒𝑠𝑠|𝑇) decreases
exponentially as 𝑇 increases.
[Image:Yoshua Bengio]
27
VANISHING GRADIENT: TOO MANY PRODUCTS!
In order to have (local) stability, the spectral radius of the matrix 𝑊ℎℎ must be lower than 1.
Consider state update equation ℎ 𝑡 + 1 = 𝑓 ℎ 𝑡 , 𝑢 𝑡 + 1 . We can see it is a recursive equation.
When input sequence is given, the previous equation can be rewritten explicitly as:
ℎ 𝑡 + 1 = 𝑓𝑡 ℎ 𝑡 = 𝑓𝑡 𝑓𝑡−1 (… 𝑓0 (ℎ 0 )). (1)
The resulting gradient, relative to the loss at time 𝑡 will be:
𝜕𝐿𝑡 𝜕𝐿𝑡 𝜕ℎ𝑡 𝜕ℎ𝜏
= . (2)
𝜕𝑊 𝜕ℎ𝑡 𝜕ℎ𝜏 𝜕𝑊
𝜏
𝜕ℎ𝑡
The Jacobian of matrix derivatives can be factorized as follows
𝜕ℎ𝜏
𝜕ℎ𝑡 𝜕ℎ𝑡 𝜕ℎ𝑡−1 𝜕ℎ𝜏+1 ′ ′
= … = 𝑓𝑡′ 𝑓𝑡−1 … 𝑓𝜏+1 (3)
𝜕ℎ𝜏 𝜕ℎ𝑡−1 𝜕ℎ𝑡−2 𝜕ℎ𝜏 28
In order to reliably “store” information in the state of the network ℎ𝑡 , RNN dynamics must remain close to a
stable attractor.
According to local stability analysis, the latter condition is met when 𝑓𝑡′ < 1
𝜕ℎ𝑡
However, the previous product , expanded in (3) rapidly (exponentially) converges to 0 when 𝑡 − 𝜏 increases.
𝜕ℎ𝜏
29
HOW TO LIMIT VANISHING GRADIENT ISSUE?
Use ReLU activations (in RNN however, they cause the “dying neurons” problem).
Use LSTM or GRU architectures (discussed later).
Use a proper initialization of the weights in 𝑊.
30
WEIGHTS INITIALIZATION
A suitable initialization of the weights permits the gradient to flow quicker through the layers.
A smoother flow ensures faster convergence of the training procedure (faster reach of the minimum).
It also helps to reduce the issue of vanishing gradient.
When using sigmoids or hyperbolic tangent neurons, use the following weight initialization:
This has to be repeated for each layer. The value n is the number of neurons in each layer.
This ensures that all neurons in the network initially have approximately the same output distribution.
The biases instead should be initialized to 0.
31
Random initialization is important to break symmetry (prevents coupling of neurons).
LEARNING RATE
With a high learning rate, the system has too much kinetic energy and the parameter vector bounces around
chaotically, unable to settle down into deeper, but narrower parts of the loss function.
Possible solution: anneal the learning rate over time.
Decay it slowly and you’ll be wasting computation bouncing around chaotically with little improvement for a long
time.
Decay it too aggressively and the system will cool too quickly, unable to reach the best position it can.
3 common ways to decay learning rate:
1. Step decay: half 𝛾 every few epochs.
2. Exponential decay: 𝛾 = 𝛾0 𝑒 −𝑘𝑡 , where 𝛾0 and 𝑘 are hyperparameters and 𝑡 is epoch number.
𝛾0
3. Multiplicative decay: 𝛾 = , where 𝛾0 and 𝑘 are hyperparameters and 𝑡 is epoch number.
1+𝑘𝑡 33
SECOND ORDER METHODS
34
SGD WITH MOMENTUM
35
Adapts the learning rate to the parameters, performing larger updates for infrequent parameters and smaller
updates for frequent parameters.
𝛾
Adagrad: 𝑔𝑖,𝑡 = 𝛻𝑊 𝐽 𝑤𝑖 , 𝑤𝑖,𝑡+1 = 𝑤𝑖,𝑡 − ∙ 𝑔𝑖,𝑡 , where 𝐺𝑡 is a diagonal matrix whose elements are the
𝐺𝑡,𝑖𝑖 +𝜖
sum of the squares of the gradients w.r.t. individual weights 𝑤𝑖 at time step 𝑡 (they might be updated differently)
and 𝜖 is a smoothing term that avoids division by zero.
Other adaptive learning rate methods are:
1. Adadelta and Rmsprop: reduce the aggressive, monotonically decreasing learning rate of Adagrad, by restricting the window
of accumulated past gradients to some fixed size.
2. Adam: adaptive learning rate + momentum.
3. Nadam: adaptive learning rate + Nesterov momentum.
36
COMPARISON OF
LEARNING PROCESS
DYNAMICS (1/2)
37
COMPARISON OF
LEARNING PROCESS
DYNAMICS (2/2)
38
4. RNN EXTENSIONS
39
DEEP RNN (1/2)
Stacked RNN
(learns different time-scales at each layer – 40
Deep input-to-hidden +
Deep input-to-hidden + Deep hidden-to-hidden +
Deep hidden-to-hidden + Shortcut connections (useful for letting the 41
42
DEEP (BIDIRECTIONAL) RNN
Similar to Bidirectional RNNs, but with multiple hidden layers per time
step.
Higher learning capacity.
Needs a lot of training data (the deeper the architecture the harder is
the training).
Graves Alex, Navdeep Jaitly, and Abdel-rahman Mohamed. "Hybrid speech
recognition with deep bidirectional LSTM.", 2013.
43
5. GATED RNNS
As the gap in time grows, the harder for an RNN become to handle the problem.
Let’s see how Long-Short Term Memory (LSTM) can handle this difficulty. 46
LSTM OVERVIEW
47
The processing unit of the LSTM is more complex and is called cell.
An LSTM cell is composed of 4 layers, interacting with each other in a special way.
48
CELL STATE AND GATES
Cell state
Gate
49
FORGET GATE
0 → 𝑛𝑜 𝑢𝑝𝑑𝑎𝑡𝑒
Update gate: 𝑖𝑡 = 𝜎 𝑊𝑖 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑖 = ቊ
1 → 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑙𝑦 𝑢𝑝𝑑𝑎𝑡𝑒
0 → 𝑛𝑜 𝑜𝑢𝑡𝑝𝑢𝑡
Output gate: 𝑜𝑡 = 𝜎 𝑊𝑜 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑜 = ቊ
1 → 𝑟𝑒𝑡𝑢𝑟𝑛 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 𝑐𝑒𝑙𝑙 𝑠𝑡𝑎𝑡𝑒
Cell output: ℎ𝑡 = 𝑜𝑡 ∗ tanh(𝐶𝑡 )
52
COMPLETE FORWARD PROPAGATION STEP
Gates are never really 1 or 0. The content of the cell is inevitably corrupted after long time.
Even if LSTM provides a huge improvement w.r.t. RNN, it still struggles with very long time dependencies.
55
GRU FORWARD PROPAGATION STEP
57
ESN OVERVIEW
Introduced by Jaeger (2001).
With Liquid State Machine, they form the family of Reservoir Computing approaches.
Large, untrained recurrent layer (reservoir).
It has to be sparse and as much heterogeneous as possible to generate a large variety of internal dynamics.
Linear, memoryless readout trained with linear regression, picks the dynamics useful to solve the task at hand.
State of the art in real-valued time series prediction.
Pros:
Fast and easy to train (no backpropagation).
Relatively small, they are used in embedded systems.
Cons:
High sensitivity to hyperparameters (model parameters, usually set by hand).
58
Random initialization add stochasticity to the results.
Dashed boxes are randomly initialized weights, left untrained.
Solid boxes are readout weights, usually trained with linear
regression.
Outputs of the network are forced to match target values.
𝐿2 regularization is used in the linear regression to prevent
overfitting.
ESN state and output update:
ℎ 𝑡 + 1 = 𝑓(𝑊𝑟𝑟 ℎ 𝑡 + 𝑊𝑖𝑟 𝑥 𝑡 + 1 + 𝑊𝑜𝑟 𝑦[𝑡])
𝑦 𝑡 + 1 = 𝑔 𝑊𝑟𝑜 ℎ 𝑡 + 1 + 𝑊𝑖𝑜 𝑥 𝑡 + 1
𝑓() is usually implemented as a tanh.
g() is usually the identity function (more complicated readouts
are possible though).
59
PRINCIPAL ESN HYPERPARAMETERS
Reservoir size:
Should be large enough to provide a great variety of dynamics.
If too large, there could be overfit (the readout just learn a simple mapping 1-to-1 from network state to output (effect is
damped by regularization).
Reservoir spectral radius:
Is the largest eigenvalue of the reservoir matrix.
Controls the dynamics of the system. Should be tuned to provide a stable dynamics, yet sufficient rich (edge of chaos).
Input scaling:
Controls the amount of nonlinearity introduce by the neurons.
High value → high amount of nonlinearity.
60
COOL
PeoplePEOPLE
whose workIN DEEP
(partially) LEARNING
inspired this presentation.
61
C. Olah A. Karpathy
THE END
THANKS FOR THE ATTENTION
62