0% found this document useful (0 votes)

21 views62 pages

Bianchi

Uploaded by

Marwa Bayou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views62 pages

Bianchi

Uploaded by

Marwa Bayou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

RECURRENT NEURAL NETWORKS

A QUICK OVERVIEW

Geilo 2017 – Winter School

Filippo Maria Bianchi – [email protected]

Machine Learning Group – Department of Physics and Technology
Universitet i Tromsø 1
PRESENTATION OUTLINE

1. Introduction
2. The Recurrent Neural Network
3. RNN applications
4. The RNN model
5. Gated RNN (LSTM and GRU)
6. Echo State Network

2
1. INTRODUCTION

3
THE RECURRENT NEURAL NETWORK
 A recurrent neural network (RNN) is a universal approximator of dynamical systems.
 It can be trained to reproduce any target dynamics, up to a given degree of precision.

RNN input Reccurrent Target output

Output
Hidden Layer

Evolution of RNN
internal state

Dynamical 4
system
System observation Evolution of
system state
 An RNN generalizes naturally to new inputs with any lengths.
 An RNN make use of sequential information, by modelling a temporal dependencies in the inputs.
 Example: if you want to predict the next word in a sentence you need to know which words came before it

 The output of the network depends on the current input and on the value of the previous internal state.
 The internal state maintains a (vanishing) memory about history of all past inputs.
 RNNs can make use of information coming from arbitrarily long sequences, but in practice they are limited to look
back only a few time steps.

5
 RNN can be trained to predict a future value, of the driving input.
 A side-effect we get a generative model, which allows us to generate new elements by sampling from the output
probabilities.

teacher
forcing
generative
RNN - RNN mode
input output

training output
feedback
6
DIFFERENCES WITH CNN

 Convolution in space (CNN) VS convolution in time (RNN) .

 CNN: models relationships in space. Filter slides along 𝑥 and 𝑦 dimensions.
 RNN: models relationships in time. ‘‘Filter’’ slides along time dimension.

CNN Filter RNN

Temporal data

𝑥 7
𝑡𝑖𝑚𝑒
Spatial data
2. RNN APPLICATIONS

8
APPLICATION 1: NATURAL
LANGUAGE PROCESSING

 Given a sequence of words,

RNN predicts the probability
of next word given the
previous ones.
 Input/output words are
encoded as one-hot vector.
 We must provide the RNN all
the dictionary of interest
(usually, just the alphabet).
 In the output layer, we want
the green numbers to be high
and red numbers to be low.
[Image: Andrej Karpathy]
9
 Once trained, the RNN can work in generative mode.
 In NLP context, a generative RNN can be used in Natural Language Generation.
 Applications:
 Generate text (human readable data) from databse of numbers and log files, not readable by human.
 What you see is what you meant. Allows users to see and manipulate the continuously rendered view (NLG output) of an
underlying formal language document (NLG input), thereby editing the formal language without learning it.

10
NATURAL LANGUAGE
GENERATION: SHAKESPEARE

 Dataset: all the works of

Shakespeare, concatenated
them into a single (4.4MB) file.
 3-layer RNN with 512 hidden
nodes on each layer.
 Few hours of training.

11
[Source: Andrej Karpathy]
TEXT GENERATION:
WIKIPEDIA

 Hutter Prize 100MB dataset of

raw Wikipedia.
 LSTM
 The link does not exist 

[Source: Andrej Karpathy]

12
TEXT GENERATION:
SCIENTIFIC PAPER

 RNN trained on a book

(LaTeX source code of 16MB).
 Multilayer LSTM

13
[Source: Andrej Karpathy]
APPLICATION II: MACHINE TRANSLATION

 Similar to language modeling.

 Train 2 different RNNs.
 Input RNN: trained on a source language (e.g.
German).
 Output RNN: trained on a target language (e.g.
English).
 The second RNN computes the output from the
hidden layer of the first RNN.
 Google translator.

[Image: Richard Socher]

APPLICATION III: SPEECH RECOGNITION

 Input: input sequence of acoustic

signals.
 Output phonetic segments.
 Necessity of encoder/decoder to
transit from digital/analogic
domain.
 Graves, Alex, and Navdeep Jaitly.
"Towards End-To-End Speech
Recognition with Recurrent Neural
Networks.“, 2014.
15
APPLICATION IV: IMAGE TAGGING

 RNN + CNN jointly

trained.
 CNN generates features
(hidden state
representation).
 RNN reads CNN
features and produces
output (end-to-end
training).
 Aligns the generated
words with features
found in the images
 Karpathy, Andrej, and Li Fei-
Fei. "Deep visual-semantic
alignments for generating
image descriptions.", 2015. 16
APPLICATION V: TIME SERIES PREDICTION
 Forecast of future values in a time series, from past seen values.
 Many applications:
 Weather forcast.
 Load forecast. Electricity Load
 Financial time series.
Telephonic traffic

17
APPLICATION VI: MUSIC INFORMATION RETRIEVAL
Music transcription example

 MIR: identification of songs/music

 Automatic categorization.
 Recommender systems.
 Track separation and instrument recognition.
 Music generation.
 Automatic music transcription.

[Source: Meinard Müller]

Automatic categorization software Software with recommender system

3. DESCRIPTION OF RNN MODEL

19
ARCHITECTURE COMPONENTS

 𝑥: input  𝑊ℎℎ : recurrent layer weights

 𝑦: output  𝑊ℎ𝑜 : output weights

 ℎ: internal state (memory of the network)  𝑧 −1 : time-delay unit

 𝑊𝑖ℎ : input weights  : neuron transfer function

20
STATE UPDATE AND OUTPUT GENERATION

 An RNN selectively summarize an input sequence in a fixed-size state vector via a

recursive update.
 Discrete, time-independent difference equations of RNN state and output:

ℎ 𝑡 + 1 = 𝑓 𝑊ℎℎ ℎ 𝑡 + 𝑊𝑖ℎ 𝑥 𝑡 + 1 + 𝑏ℎ ,
𝑦 𝑡 + 1 = 𝑔(𝑊ℎ𝑜 ℎ 𝑡 + 1 + 𝑏𝑜 ).
 𝑓() is the transfer function implemented by each neuron (usually the same non-linear
function for all neurons).
 𝑔() is the readout of the RNN. Usually is the identity function - all the non-linearity is
provided by the internal processing units (neurons) – or the softmax function.

21
NEURON TRANSFER FUNCTION

 The activation function in a RNN is traditionally

implemented by a sigmoid.
 Saturation causes vanishing gradient.
 Non-zero centering produces only positive outputs, which lead to
zig-zagging dynamics in the gradient updates.

 Another common choice is the tanh.

 Saturation causes vanishing gradient.

 ReLU (not very much used in RNN).

 Greatly accelerate gradient convergence and it has low
demanding computational time.
 No vanishing gradient.
 Large gradient flowing through a ReLU neuron could cause the its 22
“death”.
TRAINING

 Model's parameters are trained with gradient descent.

 A loss function is evaluated on the error performed by the network on the training set and, usually, also a
regularization term.
𝐿 = 𝐸 𝑦, 𝑦ො + λ𝑅
Where 𝐸() is the error function, 𝑦 and 𝑦ො are target and estimated outputs, λ is the regularization parameter, 𝑅 is the
regularization term.
 The derivative of the loss function, with respect to the model parameters, is backpropagated through the
network.
 Weights are adjusted until a stop criterion is met:
 Maximum number of epochs is reached.
 Loss function stop decreasing.
23
REGULARIZATION

 Introduce a bias, necessary to prevent the RNN to overfit on training data.

 In order to generalize well to unseen data, the variance (complexity) of the model should be limited.
 Common regularization terms:
1. 𝐿1 regularization of the weights: 𝑊 1 . Enforce sparsity in the weights.
2. 𝐿2 regularization of the weights: 𝑊 2 . Enforce small values for the weights.
3. 𝐿1 + 𝐿2 (elastic net penalty). Combines the two previous regularizations.
4. Dropout. Done usually only on the output weights. Dropout on recurrent layer is more complicated (the weights are
constrained to be the same in each time step by the BPPT) → requires workaround.

24
RNN UNFOLDING

3
 In the example, we need to backpropagate the 𝜕𝐸3 𝜕𝐸3 𝜕 𝑦ො3 𝜕ℎ3 𝜕ℎ𝑘
𝜕𝐸 =෍
gradient 𝜕𝑊3 from current time (𝑡3 ) to initial 𝜕𝑊 𝜕𝑦ො3 𝜕ℎ3 𝜕ℎ𝑘 𝜕𝑊
𝑘=0
time (𝑡0 ) → chain rule (eq. on the right).
 We sum up the contributions of each time step
to the gradient.
 With of very long sequence (possibly infinite)
we have untreatable depth.
 Repeate the procedure only up to a given time
(truncate BPPT).
 Why it works? Because each state carries a
little bit of information on each previous input.
 Once the network is unfolded, the procedure is
analogue to standard backpropagation used in 26

deep Feedforward Neural Networks.

VANISHING GRADIENT:
SIMPLE EXPERIMENT

 Bengio, 1991.
 A simple RNN is trained to
keep 1 bit of information for 𝑇
time steps.
 𝑃(𝑠𝑢𝑐𝑐𝑒𝑠𝑠|𝑇) decreases
exponentially as 𝑇 increases.

[Image:Yoshua Bengio]

27
VANISHING GRADIENT: TOO MANY PRODUCTS!

 In order to have (local) stability, the spectral radius of the matrix 𝑊ℎℎ must be lower than 1.
 Consider state update equation ℎ 𝑡 + 1 = 𝑓 ℎ 𝑡 , 𝑢 𝑡 + 1 . We can see it is a recursive equation.
 When input sequence is given, the previous equation can be rewritten explicitly as:
ℎ 𝑡 + 1 = 𝑓𝑡 ℎ 𝑡 = 𝑓𝑡 𝑓𝑡−1 (… 𝑓0 (ℎ 0 )). (1)
 The resulting gradient, relative to the loss at time 𝑡 will be:
𝜕𝐿𝑡 𝜕𝐿𝑡 𝜕ℎ𝑡 𝜕ℎ𝜏
=෍ . (2)
𝜕𝑊 𝜕ℎ𝑡 𝜕ℎ𝜏 𝜕𝑊
𝜏
𝜕ℎ𝑡
 The Jacobian of matrix derivatives can be factorized as follows
𝜕ℎ𝜏
𝜕ℎ𝑡 𝜕ℎ𝑡 𝜕ℎ𝑡−1 𝜕ℎ𝜏+1 ′ ′
= … = 𝑓𝑡′ 𝑓𝑡−1 … 𝑓𝜏+1 (3)
𝜕ℎ𝜏 𝜕ℎ𝑡−1 𝜕ℎ𝑡−2 𝜕ℎ𝜏 28
 In order to reliably “store” information in the state of the network ℎ𝑡 , RNN dynamics must remain close to a
stable attractor.
 According to local stability analysis, the latter condition is met when 𝑓𝑡′ < 1
𝜕ℎ𝑡
 However, the previous product , expanded in (3) rapidly (exponentially) converges to 0 when 𝑡 − 𝜏 increases.
𝜕ℎ𝜏

 Consequently, the sum in (2) is dominated by terms corresponding to short-term dependencies.

 This effects is called “vanishing gradient”.
 As an effect, weights are less and less updates, as the gradient flows backward through the architecture.
 On the other hand, when 𝑓𝑡′ > 1 we obtain an opposite effect called “exploding gradient”, which leads to
instability in the network.

29
HOW TO LIMIT VANISHING GRADIENT ISSUE?

 Use ReLU activations (in RNN however, they cause the “dying neurons” problem).
 Use LSTM or GRU architectures (discussed later).
 Use a proper initialization of the weights in 𝑊.

30
WEIGHTS INITIALIZATION

 A suitable initialization of the weights permits the gradient to flow quicker through the layers.
 A smoother flow ensures faster convergence of the training procedure (faster reach of the minimum).
 It also helps to reduce the issue of vanishing gradient.
 When using sigmoids or hyperbolic tangent neurons, use the following weight initialization:

 This has to be repeated for each layer. The value n is the number of neurons in each layer.
 This ensures that all neurons in the network initially have approximately the same output distribution.
 The biases instead should be initialized to 0.
31
 Random initialization is important to break symmetry (prevents coupling of neurons).
LEARNING RATE

 The standard weights-update procedure is called Stochastic Gradient Descent (SGD).

 SGD depends on a learning rate hyperparameter 𝛾:
𝜕𝐿 𝜕𝐿 𝜕𝐿
𝑊𝑖ℎ = 𝑊𝑖ℎ −𝛾 , 𝑊ℎℎ = 𝑊ℎℎ −𝛾 , 𝑊ℎ𝑜 = 𝑊ℎ𝑜 −𝛾 .
𝜕𝑊𝑖ℎ 𝜕𝑊ℎℎ 𝜕𝑊ℎ𝑜
 𝛾 is a critical parameter. How to set it?
 Use cross-validationn to find optimal value.
 Several strategies can be used to improve learning:
1. Annealing of the learning rate.
2. Second order methods.
3. Add momentum to SGD
32
4. Adaptive learning rate methods.
ANNEALING OF THE LEARNING RATE

 With a high learning rate, the system has too much kinetic energy and the parameter vector bounces around
chaotically, unable to settle down into deeper, but narrower parts of the loss function.
 Possible solution: anneal the learning rate over time.
 Decay it slowly and you’ll be wasting computation bouncing around chaotically with little improvement for a long
time.
 Decay it too aggressively and the system will cool too quickly, unable to reach the best position it can.
 3 common ways to decay learning rate:
1. Step decay: half 𝛾 every few epochs.
2. Exponential decay: 𝛾 = 𝛾0 𝑒 −𝑘𝑡 , where 𝛾0 and 𝑘 are hyperparameters and 𝑡 is epoch number.
𝛾0
3. Multiplicative decay: 𝛾 = , where 𝛾0 and 𝑘 are hyperparameters and 𝑡 is epoch number.
1+𝑘𝑡 33
SECOND ORDER METHODS

 Weights are updated as follows

ℎ ← ℎ − 𝐻𝑓 ℎ −1 𝛻𝑓(ℎ)

 Where 𝐻𝑓 ℎ is the Hessian matrix, containing second-order partial

derivatives of 𝑓.
 Hessian describes the local curvature of the loss function.
 More aggressive steps in directions of shallow curvature and shorter
steps in directions of steep curvature.
 Perform a more efficient update.
 Impractical for most deep learning applications: computing (and
inverting) the Hessian is a very costly process in both space and time.
 Martens, James. "Deep learning via Hessian-free optimization.“, 2010:
reduces cost, but still much slower than first-order methods.

34
SGD WITH MOMENTUM

 In SGD update, gradient directly integrates the position.

 With momentum, the gradient only directly influences the velocity, which in turn has an effect on the position.
𝜕𝑓 ℎ
 Momentum update: 𝑣 ← 𝜇 ∙ 𝑣 − 𝛾 , ℎ ←ℎ+𝑣
𝜕ℎ
𝜕𝑓 ℎ∗
 Nesterov momentum (NAG): ℎ∗ ← ℎ + 𝜇 ∙ 𝑣, 𝑣 ← 𝜇 ∙ 𝑣 − 𝛾 𝜕ℎ∗ ,ℎ ← ℎ +𝑣

[Image: Andrej Karpathy]

ADAPTIVE LEARNING RATE

 Adapts the learning rate to the parameters, performing larger updates for infrequent parameters and smaller
updates for frequent parameters.
𝛾
 Adagrad: 𝑔𝑖,𝑡 = 𝛻𝑊 𝐽 𝑤𝑖 , 𝑤𝑖,𝑡+1 = 𝑤𝑖,𝑡 − ∙ 𝑔𝑖,𝑡 , where 𝐺𝑡 is a diagonal matrix whose elements are the
𝐺𝑡,𝑖𝑖 +𝜖
sum of the squares of the gradients w.r.t. individual weights 𝑤𝑖 at time step 𝑡 (they might be updated differently)
and 𝜖 is a smoothing term that avoids division by zero.
 Other adaptive learning rate methods are:
1. Adadelta and Rmsprop: reduce the aggressive, monotonically decreasing learning rate of Adagrad, by restricting the window
of accumulated past gradients to some fixed size.
2. Adam: adaptive learning rate + momentum.
3. Nadam: adaptive learning rate + Nesterov momentum.
36
COMPARISON OF
LEARNING PROCESS
DYNAMICS (1/2)

 Contours of a loss surface and

time evolution of different
optimization algorithms.

[Image: Alec Radford]

37
COMPARISON OF
LEARNING PROCESS
DYNAMICS (2/2)

 A visualization of a saddle point

in the optimization landscape.
 The curvature along different
dimension has different signs
(one dimension curves up and
another down – very common
in deep learning!).
 SGD has a very hard time
breaking symmetry and gets
stuck on the top.
 RMSprop will see very low
gradients in the saddle
direction
[Image: Alec Radford]

38
4. RNN EXTENSIONS

39
DEEP RNN (1/2)

 Increase the depth of RNN to increase expressive power.

 N.B. here we add depth in SPACE (like FFNN), not in TIME.
 Pascanu, R., Gulcehre, C., Cho, K., & Bengio, Y, “How to
construct deep recurrent neural networks”, 2013.

Stacked RNN
(learns different time-scales at each layer – 40

Standard RNN from fast to slow dynamics)

DEEP RNN (2/2)

Deep input-to-hidden +
Deep input-to-hidden + Deep hidden-to-hidden +
Deep hidden-to-hidden + Shortcut connections (useful for letting the 41

Deep hidden-to-output gradient flow faster during backpropagation).

BIDIRECTIONAL RNN

 The output at time 𝑡 may not only depend on the

previous elements in the sequence, but also future
elements.
 Example: to predict a missing word in a sequence
you want to look at both the left and the right
context.
 Two RNNs stacked on top of each other.
 Output computed based on the hidden state of
both RNNs.
 Huang Zhiheng, Xu Wei,Yu Kai. Bidirectional LSTM
Conditional Random Field Models for Sequence Tagging.

42
DEEP (BIDIRECTIONAL) RNN

 Similar to Bidirectional RNNs, but with multiple hidden layers per time
step.
 Higher learning capacity.
 Needs a lot of training data (the deeper the architecture the harder is
the training).
 Graves Alex, Navdeep Jaitly, and Abdel-rahman Mohamed. "Hybrid speech
recognition with deep bidirectional LSTM.", 2013.

43
5. GATED RNNS

Ispired by: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

44
LONG-TERM DEPENDENCIES

 Due to vanishing gradient, RNN are uncapable of learning long-term dependencies.

 Some applications require both short and long term dependencies.
 Example: Natural Language Processing (NLP).
 In some cases, short-term dependencies are sufficient to make predictions.

 Consider to train the RNN to make 1-step ahead

prediction.
 To predict the word ‘sea’ it is sufficient to look only 3 step
back in time.
 In this case, it is sufficient to backpropagate the gradient 3
step back to succesfully learn this task. 45
 Let’s stick to 1-step ahead prdiction.
 Consider the sentence: I am from Rome. I was born 30 years ago. I like eating good food and riding my bike. My native
language is Italian.
 When we want to predict the word Italian, we have to look back several time steps, up to the word Rome.
 In this case, the short-term memory of the RNN would not do the trick.

 As the gap in time grows, the harder for an RNN become to handle the problem.
 Let’s see how Long-Short Term Memory (LSTM) can handle this difficulty. 46
LSTM OVERVIEW

 Introduced by Hochreiter & Schmidhuber (1997).

 Work very well on many different problems and are widely used nowadays.
 Like RNN, they must be unfolded in time to be trained and understood.

 Let’s recall the unfolded version of

a RNN.
 A very simple processing unit is
repeated each time.

47
 The processing unit of the LSTM is more complex and is called cell.
 An LSTM cell is composed of 4 layers, interacting with each other in a special way.

48
CELL STATE AND GATES

 The state of a cell at time 𝑡 is 𝐶𝑡 .

 The LSTM modify the state only through linear interactions:
information flows smoothly across time.
 LSTM protect and control the information in the cell through 3
gates.
 Gates are implemented by a sigmoid and a pointwise
multiplication.

Cell state

Gate
49
FORGET GATE

 Decide what information should be discarded from the cell state.

0 → 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑙𝑦 𝑔𝑒𝑡 𝑟𝑖𝑑 𝑜𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 𝑖𝑛 𝐶𝑡

𝑓𝑡 = 𝜎 𝑊𝑓 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑓 = ቊ
1 → 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑙𝑦 𝑘𝑒𝑒𝑝 𝑡ℎ𝑒 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 𝑖𝑛 𝐶𝑡

 Gate controlled by current input 𝑥𝑡 and

past cell output ℎ𝑡−1 .
 NLP example: cell state keep the gender
of the present subject to use correct
pronouns.
 When sees a new subject, forget the
gender of the old subject. 50
UPDATE GATE

 With forget gate we decided wheter or not to forget cell content.

 After, with update gate we decide how much to update the old state 𝐶𝑡−1 with a new candidate 𝐶ሚ𝑡 .

0 → 𝑛𝑜 𝑢𝑝𝑑𝑎𝑡𝑒
 Update gate: 𝑖𝑡 = 𝜎 𝑊𝑖 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑖 = ቊ
1 → 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑙𝑦 𝑢𝑝𝑑𝑎𝑡𝑒

 New candidate: 𝐶ሚ𝑡 = tanh 𝑊𝑐 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝐶 .

 New state: 𝐶𝑡 = 𝐶𝑡−1 + 𝑖𝑡 ∗ 𝐶ሚ𝑡 .
 In the NLP example, we update the cell state as we
see a new subject.
 Note that the new candidate is computed exactly
like the state in traditional RNN (same difference 51
equation).
OUTPUT GATE

 The output is a filtered version of the cell state.

 Cell state is fed into a tanh, which squashed its values between -1 and 1.
 Then, the gate select the part to be returned as output.

0 → 𝑛𝑜 𝑜𝑢𝑡𝑝𝑢𝑡
 Output gate: 𝑜𝑡 = 𝜎 𝑊𝑜 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑜 = ቊ
1 → 𝑟𝑒𝑡𝑢𝑟𝑛 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 𝑐𝑒𝑙𝑙 𝑠𝑡𝑎𝑡𝑒
 Cell output: ℎ𝑡 = 𝑜𝑡 ∗ tanh(𝐶𝑡 )

 In NLP example, after having seen a

subject, if a verb comes next, the cell
outputs information about being singluar
or plural.

52
COMPLETE FORWARD PROPAGATION STEP

 𝑓𝑡 = 𝜎 𝑊𝑓 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑓 - forget gate

 𝑖𝑡 = 𝜎 𝑊𝑖 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑖 - input gate

 𝑜𝑡 = 𝜎 𝑊𝑜 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑜 - output gate
 𝐶ሚ𝑡 = tanh 𝑊𝑐 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝐶 - new state candidate
 𝐶𝑡 = 𝐶𝑡−1 + 𝑖𝑡 ∗ 𝐶ሚ𝑡 - new cell state
 ℎ𝑡 = 𝑜𝑡 ∗ tanh(𝐶𝑡 ) – cell output

 Parameters of the model: 𝑊𝑖 , 𝑊𝑐 , 𝑊𝑜 , 𝑏𝑖 , 𝑏𝑐 , 𝑏𝑜

 Note that the weight matrix have a larger size than in normal RNN (they multiply the concatentation of 𝑥𝑡
and ℎ𝑡 ). 53
LSTM DOWNSIDES

 Gates are never really 1 or 0. The content of the cell is inevitably corrupted after long time.
 Even if LSTM provides a huge improvement w.r.t. RNN, it still struggles with very long time dependencies.

 Number of parameters: 4 ∙ 𝑁𝑖 + 1 ∙ 𝑁𝑜 +𝑁𝑜2 .

 Example: input = time series of 100 elements, LSTM units = 256 → 168960 parameters.
 Scales up quickly!! Lot of parameters = lot of training data.
 Rule of thumb: data elements must always be one order of magnitude greater than the number of parameters.
 Memory problems when dealing with lot of data.
 Long training time (use GPU computing).
54
GATED RECURRENT UNITS

 Several LSTM variants exists.

 GRU is one of the most famous alternative architectures (Cho, et al. (2014).
 It combines the forget and input gates into a single update gate.
 It also merges the cell state and hidden state, and makes some other changes.
 The cell is characterize by fewer parameters.

55
GRU FORWARD PROPAGATION STEP

 𝑟𝑡 = 𝜎 𝑊𝑟 ∙ ℎ𝑡−1 , 𝑥𝑡 - reset gate (merge of input and forget gate).

 𝑧𝑡 = 𝜎 𝑊𝑧 ∙ ℎ𝑡−1 , 𝑥𝑡 - output gate.
 ℎ෨ 𝑡 = tanh 𝑊 ∙ 𝑟𝑡 ∗ ℎ𝑡−1 , 𝑥𝑡 - new candidate output (merge internal state and output).
 ℎ𝑡 = 1 − 𝑧𝑡 ∗ ℎ𝑡−1 + 𝑧𝑡 ∗ ℎ෨ 𝑡 - cell output

 Performs better LSTM or GRU?

 Depends on the problem at hand: Chung, Junyoung, et al.
"Empirical evaluation of gated recurrent neural networks on
sequence modeling.“, 2014.
 ‘No free lunch’ theorem  56
6. ECHO STATE NETWORKS

57
ESN OVERVIEW
 Introduced by Jaeger (2001).
 With Liquid State Machine, they form the family of Reservoir Computing approaches.
 Large, untrained recurrent layer (reservoir).
 It has to be sparse and as much heterogeneous as possible to generate a large variety of internal dynamics.
 Linear, memoryless readout trained with linear regression, picks the dynamics useful to solve the task at hand.
 State of the art in real-valued time series prediction.
 Pros:
 Fast and easy to train (no backpropagation).
 Relatively small, they are used in embedded systems.
 Cons:
 High sensitivity to hyperparameters (model parameters, usually set by hand).
58
 Random initialization add stochasticity to the results.
 Dashed boxes are randomly initialized weights, left untrained.
 Solid boxes are readout weights, usually trained with linear
regression.
 Outputs of the network are forced to match target values.
 𝐿2 regularization is used in the linear regression to prevent
overfitting.
 ESN state and output update:
ℎ 𝑡 + 1 = 𝑓(𝑊𝑟𝑟 ℎ 𝑡 + 𝑊𝑖𝑟 𝑥 𝑡 + 1 + 𝑊𝑜𝑟 𝑦[𝑡])
𝑦 𝑡 + 1 = 𝑔 𝑊𝑟𝑜 ℎ 𝑡 + 1 + 𝑊𝑖𝑜 𝑥 𝑡 + 1
 𝑓() is usually implemented as a tanh.
 g() is usually the identity function (more complicated readouts
are possible though).

59
PRINCIPAL ESN HYPERPARAMETERS

 Reservoir size:
 Should be large enough to provide a great variety of dynamics.
 If too large, there could be overfit (the readout just learn a simple mapping 1-to-1 from network state to output (effect is
damped by regularization).
 Reservoir spectral radius:
 Is the largest eigenvalue of the reservoir matrix.
 Controls the dynamics of the system. Should be tuned to provide a stable dynamics, yet sufficient rich (edge of chaos).

 Input scaling:
 Controls the amount of nonlinearity introduce by the neurons.
 High value → high amount of nonlinearity.

60
COOL
PeoplePEOPLE
whose workIN DEEP
(partially) LEARNING
inspired this presentation.

Y. Bengio G. Hinton J. Schmidthuber A. Graves

C. Olah A. Karpathy
THE END
THANKS FOR THE ATTENTION

Module 4 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
No ratings yet
Module 4 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
21 pages
Fichas de Aprendizaje IBM AI Practitioner Qs - Quizletv3
No ratings yet
Fichas de Aprendizaje IBM AI Practitioner Qs - Quizletv3
17 pages
TensorFlow in 1 Day: Make your own Neural Network
From Everand
TensorFlow in 1 Day: Make your own Neural Network
Krishna Rungta
3.5/5 (10)
Data Science Training in Naresh I Technologies
100% (3)
Data Science Training in Naresh I Technologies
18 pages
A60dd56d886364 Ekta Bhaskar
No ratings yet
A60dd56d886364 Ekta Bhaskar
1 page
Artificial Intelligence IBM
No ratings yet
Artificial Intelligence IBM
5 pages
Unit 3 RCNN Updated
No ratings yet
Unit 3 RCNN Updated
28 pages
Module 4-1
No ratings yet
Module 4-1
44 pages
Unit 3 RCNN
No ratings yet
Unit 3 RCNN
25 pages
DL M5 Tech
No ratings yet
DL M5 Tech
21 pages
Recurrent Neural Networks (RNNS) : Foundations and Applications in Sequential Learning
No ratings yet
Recurrent Neural Networks (RNNS) : Foundations and Applications in Sequential Learning
9 pages
Unit 4
No ratings yet
Unit 4
34 pages
Semster - DL
No ratings yet
Semster - DL
15 pages
ch6 RNN
No ratings yet
ch6 RNN
25 pages
Recurrent Neural Networks (RNNS)
No ratings yet
Recurrent Neural Networks (RNNS)
45 pages
Recurrent Neural Network: Dr. Sukanta Ghosh
100% (1)
Recurrent Neural Network: Dr. Sukanta Ghosh
34 pages
Module 4 Recurrent Neural Network
No ratings yet
Module 4 Recurrent Neural Network
78 pages
Sequence Modeling Recurrent Neural Networks
No ratings yet
Sequence Modeling Recurrent Neural Networks
18 pages
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
No ratings yet
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
9 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
34 pages
Soft Computing 1
No ratings yet
Soft Computing 1
15 pages
What Is A Recurrent Neural Network (RNN) ?
No ratings yet
What Is A Recurrent Neural Network (RNN) ?
4 pages
DL Mod4
No ratings yet
DL Mod4
105 pages
DL Module 4 Notes
No ratings yet
DL Module 4 Notes
27 pages
Unit 4
No ratings yet
Unit 4
13 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
6 pages
RNN Simplified.
No ratings yet
RNN Simplified.
2 pages
Unit-Iv DL
No ratings yet
Unit-Iv DL
54 pages
Module 4 RNN LSTM GRU
No ratings yet
Module 4 RNN LSTM GRU
59 pages
Ad3501-Dl-Unit 3 Notes
No ratings yet
Ad3501-Dl-Unit 3 Notes
34 pages
Module 06
No ratings yet
Module 06
5 pages
Module5 DL
No ratings yet
Module5 DL
18 pages
Lec 4 Recurrent Neural Network Long Short-Term Memory
No ratings yet
Lec 4 Recurrent Neural Network Long Short-Term Memory
32 pages
CH4 - AA1.1-Sequence Models
No ratings yet
CH4 - AA1.1-Sequence Models
26 pages
1 Recurrent Neural Networks
No ratings yet
1 Recurrent Neural Networks
34 pages
Explain The Concept of Unfolding Computational Graphs in The Context of Recurrent Neural Networks
No ratings yet
Explain The Concept of Unfolding Computational Graphs in The Context of Recurrent Neural Networks
9 pages
Unit-Iv DL
No ratings yet
Unit-Iv DL
23 pages
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
No ratings yet
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
71 pages
Unit 5 Updated
No ratings yet
Unit 5 Updated
125 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
DL Unit Iv
No ratings yet
DL Unit Iv
15 pages
DL Unit 4 Part 2
No ratings yet
DL Unit 4 Part 2
8 pages
Deep Learning (MODULE-4)
No ratings yet
Deep Learning (MODULE-4)
102 pages
5707 11 RNN LSTM
No ratings yet
5707 11 RNN LSTM
128 pages
6S191 MIT DeepLearning L2
No ratings yet
6S191 MIT DeepLearning L2
85 pages
DL 4
No ratings yet
DL 4
19 pages
T3-Slide - 002 - Vanilla RNNs
No ratings yet
T3-Slide - 002 - Vanilla RNNs
25 pages
Recurrent Neural Network Jeeva
No ratings yet
Recurrent Neural Network Jeeva
10 pages
RNN and LSTM
No ratings yet
RNN and LSTM
65 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
Blue and White Simple Business Plan Presentation
No ratings yet
Blue and White Simple Business Plan Presentation
15 pages
Unit 4 NLP
No ratings yet
Unit 4 NLP
19 pages
REPORT
No ratings yet
REPORT
24 pages
What Is A Recurrent Neural Network
No ratings yet
What Is A Recurrent Neural Network
36 pages
UNIT-3 Part2
No ratings yet
UNIT-3 Part2
14 pages
Deep Learning U4
No ratings yet
Deep Learning U4
5 pages
Technical DL U4-6
No ratings yet
Technical DL U4-6
98 pages
Technical DL U4-6
No ratings yet
Technical DL U4-6
98 pages
DL 4 Notes
No ratings yet
DL 4 Notes
34 pages
21cse356t NLP Unit 4
No ratings yet
21cse356t NLP Unit 4
81 pages
21CSE356T-NLP-Unit 4.1
No ratings yet
21CSE356T-NLP-Unit 4.1
46 pages
Ad3501 DL Unit 3 Notes
No ratings yet
Ad3501 DL Unit 3 Notes
30 pages
Unit V Recurrent Neural Networks
No ratings yet
Unit V Recurrent Neural Networks
35 pages
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
From Everand
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
César Pérez López
No ratings yet
Deep Learning-Based Detection of One and Two-Column Textual Blocks in Camera-Captured Pashto Documents Images
No ratings yet
Deep Learning-Based Detection of One and Two-Column Textual Blocks in Camera-Captured Pashto Documents Images
10 pages
Software Defect Prediction Based On Multi-Filter Wrapper Feature
No ratings yet
Software Defect Prediction Based On Multi-Filter Wrapper Feature
28 pages
Show and Tell: A Neural Image Caption Generator (CVPR 2015) : Presenters: Tianlu Wang, Yin Zhang October 5
No ratings yet
Show and Tell: A Neural Image Caption Generator (CVPR 2015) : Presenters: Tianlu Wang, Yin Zhang October 5
13 pages
Sensors 22 04544
No ratings yet
Sensors 22 04544
15 pages
Multi-Task ADAS System On FPGA
0% (1)
Multi-Task ADAS System On FPGA
4 pages
Gender and Age Detection
No ratings yet
Gender and Age Detection
9 pages
Final Year Project Lung Cancer Detection Using Efficient Net B3
No ratings yet
Final Year Project Lung Cancer Detection Using Efficient Net B3
14 pages
Hassan 2021
No ratings yet
Hassan 2021
7 pages
Bodyslam: A Generalized Monocular Visual Slam Framework For Surgical Applications
No ratings yet
Bodyslam: A Generalized Monocular Visual Slam Framework For Surgical Applications
16 pages
Final Report
No ratings yet
Final Report
38 pages
Internship PPT 1
No ratings yet
Internship PPT 1
13 pages
Final Page
No ratings yet
Final Page
75 pages
Acknowledgements
No ratings yet
Acknowledgements
66 pages
Majorproject201 1
No ratings yet
Majorproject201 1
74 pages
Final Project Intelligent Smart Parking Assistant 1738707796
No ratings yet
Final Project Intelligent Smart Parking Assistant 1738707796
20 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
9 pages
Project's Cover
No ratings yet
Project's Cover
7 pages
Deep Learning For Financial Applications - A Survey
100% (1)
Deep Learning For Financial Applications - A Survey
29 pages
Retracted Face Recognition Attendance System Based On Real-Time Video Processing
No ratings yet
Retracted Face Recognition Attendance System Based On Real-Time Video Processing
9 pages
Synopsis
No ratings yet
Synopsis
10 pages
A Quality Control Application On A Smart Factory Prototype Using Deep Learning Methods
No ratings yet
A Quality Control Application On A Smart Factory Prototype Using Deep Learning Methods
4 pages
2 - Self-Supervised Learning For Anomaly Detection and Localization
No ratings yet
2 - Self-Supervised Learning For Anomaly Detection and Localization
28 pages
Image Classification
No ratings yet
Image Classification
3 pages
Time-Series Forecasting With Deep Learning - A Survey
No ratings yet
Time-Series Forecasting With Deep Learning - A Survey
14 pages
TLM For CNN
No ratings yet
TLM For CNN
32 pages
Artificial Intelligence Machine Learning Program Brochure
No ratings yet
Artificial Intelligence Machine Learning Program Brochure
24 pages

Bianchi

Uploaded by

Bianchi

Uploaded by

RECURRENT NEURAL NETWORKS

Geilo 2017 – Winter School

Filippo Maria Bianchi – [email protected]

RNN input Reccurrent Target output

 Convolution in space (CNN) VS convolution in time (RNN) .

CNN Filter RNN

 Given a sequence of words,

 Dataset: all the works of

 Hutter Prize 100MB dataset of

[Source: Andrej Karpathy]

 RNN trained on a book

 Similar to language modeling.

[Image: Richard Socher]

 Input: input sequence of acoustic

 RNN + CNN jointly

 MIR: identification of songs/music

[Source: Meinard Müller]

Automatic categorization software Software with recommender system

 𝑥: input  𝑊ℎℎ : recurrent layer weights

 𝑦: output  𝑊ℎ𝑜 : output weights

 ℎ: internal state (memory of the network)  𝑧 −1 : time-delay unit

 𝑊𝑖ℎ : input weights  : neuron transfer function

 An RNN selectively summarize an input sequence in a fixed-size state vector via a

 The activation function in a RNN is traditionally

 Another common choice is the tanh.

 ReLU (not very much used in RNN).

 Model's parameters are trained with gradient descent.

 Introduce a bias, necessary to prevent the RNN to overfit on training data.

 In order to train the network with

deep Feedforward Neural Networks.

 Consequently, the sum in (2) is dominated by terms corresponding to short-term dependencies.

 The standard weights-update procedure is called Stochastic Gradient Descent (SGD).

 Weights are updated as follows

 Where 𝐻𝑓 ℎ is the Hessian matrix, containing second-order partial

 In SGD update, gradient directly integrates the position.

[Image: Andrej Karpathy]

 Contours of a loss surface and

[Image: Alec Radford]

 A visualization of a saddle point

 Increase the depth of RNN to increase expressive power.

Standard RNN from fast to slow dynamics)

Deep hidden-to-output gradient flow faster during backpropagation).

 The output at time 𝑡 may not only depend on the

Ispired by: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

 Due to vanishing gradient, RNN are uncapable of learning long-term dependencies.

 Consider to train the RNN to make 1-step ahead

 Introduced by Hochreiter & Schmidhuber (1997).

 Let’s recall the unfolded version of

 The state of a cell at time 𝑡 is 𝐶𝑡 .

 Decide what information should be discarded from the cell state.

0 → 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑙𝑦 𝑔𝑒𝑡 𝑟𝑖𝑑 𝑜𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 𝑖𝑛 𝐶𝑡

 Gate controlled by current input 𝑥𝑡 and

 With forget gate we decided wheter or not to forget cell content.

 New candidate: 𝐶ሚ𝑡 = tanh 𝑊𝑐 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝐶 .

 The output is a filtered version of the cell state.

 In NLP example, after having seen a

 𝑓𝑡 = 𝜎 𝑊𝑓 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑓 - forget gate

 𝑖𝑡 = 𝜎 𝑊𝑖 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑖 - input gate

 Parameters of the model: 𝑊𝑖 , 𝑊𝑐 , 𝑊𝑜 , 𝑏𝑖 , 𝑏𝑐 , 𝑏𝑜

 Number of parameters: 4 ∙ 𝑁𝑖 + 1 ∙ 𝑁𝑜 +𝑁𝑜2 .

 Several LSTM variants exists.

 𝑟𝑡 = 𝜎 𝑊𝑟 ∙ ℎ𝑡−1 , 𝑥𝑡 - reset gate (merge of input and forget gate).

 Performs better LSTM or GRU?

Y. Bengio G. Hinton J. Schmidthuber A. Graves

You might also like