0% found this document useful (0 votes)
19 views121 pages

05 Rnns

Uploaded by

shurooqanana23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views121 pages

05 Rnns

Uploaded by

shurooqanana23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 121

Lecture-05: Recurrent Neural Networks

(Deep Learning & AI)


Speaker: Pankaj Gupta
PhD Student (Advisor: Prof. Hinrich Schütze) CIS, University of Munich (LMU)
Research Scientist (NLP/Deep Learning), Machine Intelligence, Siemens AG | Nov 2018
Intern © Siemens AG 2017
Lecture Outline

 Motivation: Sequence Modeling


 Understanding Recurrent Neural Networks (RNNs)
 Challenges in vanilla RNNs: Exploding and Vanishing gradients. Why? Remedies?
 RNN variants:
o Long Short Term Memory (LSTM) networks, Gated recurrent units (GRUs)
o Bi-directional Sequence Learning
o Recursive Neural Networks (RecNNs): TreeRNNs and TreeLSTMs
o Deep, Multi-tasking and Generative RNNs (overview)
 Attention Mechanism: Attentive RNNs
 RNNs in Practice + Applications
 Introduction to Explainability/Interpretability of RNNs

Intern © Siemens AG 2017


Seite 2 May 2017 Corporate Technology
Motivation: Need for Sequential Modeling

Why do we need Sequential Modeling?

Intern © Siemens AG 2017


Seite 3 May 2017 Corporate Technology
Motivation: Need for Sequential Modeling

Examples of Sequence data Input Data Output

Speech Recognition This is RNN

Hallo, ich bin Pankaj.


Machine Translation Hello, I am Pankaj.
है लो, म� पंकज �ं ।

Language Modeling Recurrent neural __ based __ model network


language
Named Entity Recognition
Pankaj lives in Munich Pankaj lives in Munich
person location
Sentiment Classification There is nothing to like in this movie.

Video Activity Analysis


Punching

Intern © Siemens AG 2017


Seite 4 May 2017 Corporate Technology
Motivation: Need for Sequential Modeling

Inputs, Outputs can be different lengths in different examples


Example:
Sentence1: Pankaj lives in Munich
Sentence2: Pankaj Gupta lives in Munich DE

Intern © Siemens AG 2017


Seite 5 May 2017 Corporate Technology
Motivation: Need for Sequential Modeling

Inputs, Outputs can be different lengths in different examples


Example:
Sentence1: Pankaj lives in Munich Addit ional w ord

Sentence2: Pankaj Gupta lives in Munich DE ‘PAD’ i.e., padding

Pankaj … person Pankaj … person

lives other Gupta person


in other lives other

Munich location in other

PAD other Munich location


… …
PAD other Germany location

FF-net / CNN FF-net / CNN


Intern © Siemens AG 2017 *FF-net: Feed-forward network
Seite 6 May 2017 Corporate Technology
Motivation: Need for Sequential Modeling

Inputs, Outputs can be different lengths in different examples


Example:
person other other location
Sentence1: Pankaj lives in Munich
Sentence2: Pankaj Gupta lives in Munich DE Models
variable
lengt h
Pankaj … person Pankaj Pankaj lives in Munich sequences
… person

lives other Gupta person


person person other other location location
in other lives other

Munich location in other

PAD other Munich location


… …
PAD other Germany location
Pankaj Gupta lives in Munich Germany
FF-net / CNN FF-net / CNN
Sequential model: RNN
Intern © Siemens AG 2017 *FF-net: Feed-forward network
Seite 7 May 2017 Corporate Technology
Motivation: Need for Sequential Modeling

Share Features learned across different positions or time steps


Example:
Sam e uni-gram
Sentence1: Market falls into bear territory  Trading/Marketing
st at ist ics
Sentence2: Bear falls into market territory  UNK

Intern © Siemens AG 2017


Seite 8 May 2017 Corporate Technology
Motivation: Need for Sequential Modeling

Share Features learned across different positions or time steps


Example:
Sentence1: Market falls into bear territory  Trading/Marketing No sequent ial
or temporal
Sentence2: Bear falls into market territory  UNK m odeling, i.e.,
order-less

… … Treat s t he t w o
falls falls sentences t he
bear bear sam e
market Trading market UNK

into into

territory … territory …
sentence1 sentence2
FF-net / CNN FF-net / CNN
Intern © Siemens AG 2017
Seite 9 May 2017 Corporate Technology
Motivation: Need for Sequential Modeling

Share Features learned across different positions or time steps


Example:
Sentence1: Market falls into bear territory  Trading/Marketing Language
Trading concept s,
Sentence2: Bear falls into market territory  UNK
Word
ordering,
… … Synt act ic &
falls falls
market falls into bear territory sem ant ic
bear bear
inform at ion
UNK
market Trading market UNK

into into

territory … territory …
sentence1 sentence2
bear falls into market territory
FF-net / CNN FF-net / CNN
Intern © Siemens AG 2017 Sequential model: RNN
Seite 10 May 2017 Corporate Technology
Motivation: Need for Sequential Modeling

Share Features learned across different positions or time steps


Example:
Sentence1: Market falls into bear territory  Trading/Marketing Language
Trading concept s,
Sentence2: Bear falls into market territory  UNK
Word
ordering,
… … Synt act ic &
falls Direct ion of
falls
market falls into bear territory sem ant ic
bear informat
bear ion f low inform at ion
UNK
market Trading mat ters! UNK
market

into into

territory … territory …
sentence1 sentence2
bear falls into market territory
FF-net / CNN FF-net / CNN
Intern © Siemens AG 2017 Sequential model: RNN
Seite 11 May 2017 Corporate Technology
Motivation: Need for Sequential Modeling

Machine Translation: Different Input and Output sizes, incurring sequential patterns

Decoder Decoder
pankaj lebt in münchen पं कज मु िनच म� रहता है

encodes input text encodes input text


Pankaj lives in Munich Pankaj lives in Munich

Encoder Encoder

Intern © Siemens AG 2017


Seite 12 May 2017 Corporate Technology
Motivation: Need for Sequential Modeling

Convolutional vs Recurrent Neural Networks

RNN
- perform well when the input data is interdependent in a sequential pattern
- correlation between previous input to the next input
- introduce bias based on your previous output

CNN/FF-Nets
- all the outputs are self dependent
- Feed-forward nets don’t remember historic input data at test time unlike recurrent networks.

Intern © Siemens AG 2017


Seite 13 May 2017 Corporate Technology
Motivation: Need for Sequential Modeling

Memory-less Models Memory Networks


Autoregressive models: -possess a dynamic hidden state that can store long
Predict the next input in a sequence from a fixed term information, e.g., RNNs.
number of previous inputs using “delay taps”.
Wt-2 Recurrent Neural Networks:
Wt-1
inputt-2 inputt-1 inputt RNNs are very powerful, because they combine the
following properties-

Feed-forward neural networks: Distributed hidden state: can efficiently store a lot of
information about the past.
Generalize autoregressive models by using
non-linear hidden layers. Non-linear dynamics: can update their hidden state
Wt-2 in complicated ways
Wt-1 Temporal and accumulative: can build semantics,
e.g., word-by-word in sequence over time
inputt-2 inputt-1 inputt

Intern © Siemens AG 2017


Seite 14 May 2017 Corporate Technology
Notations

• 𝒉𝒉𝑡𝑡 : Hidden Unit


• 𝒙𝒙𝑡𝑡 : Input
• 𝒐𝒐𝑡𝑡 : Output
• 𝑾𝑾ℎℎ : Shared Weight Parameter
• 𝑾𝑾ℎ𝑜𝑜 : Parameter weight between hidden layer and output
• 𝜃𝜃: parameter in general
• 𝑔𝑔𝜃𝜃 : non linear function
• 𝐿𝐿𝑡𝑡 :Loss between the RNN outputs and the true output
• 𝐸𝐸𝑡𝑡 : cross entropy loss

Intern © Siemens AG 2017


Seite 15 May 2017 Corporate Technology
Long Term and Short Dependencies

Short Term Dependencies

 need recent information to perform the present task.


For example in a language model, predict the next word based on the previous ones.
“the clouds are in the ?”  ‘sky’ “the clouds are in the sky”

 Easier to predict ‘sky’ given the context, i.e., short term dependency

Long Term Dependencies

 Consider longer word sequence “I grew up in France…........…………………… I speak fluent French.”


 Recent information suggests that the next word is probably the name of a language, but if we want to
narrow down which language, we need the context of France, from further back.

Intern © Siemens AG 2017


Seite 16 May 2017 Corporate Technology
Foundation of Recurrent Neural Networks

Goal
 model long term dependencies
 connect previous information to the present task
 model sequence of events with loops, allowing information to persist

punching

Intern © Siemens AG 2017


Seite 17 May 2017 Corporate Technology
Foundation of Recurrent Neural Networks

Goal
 model long term dependencies
 connect previous information to the present task
 model sequence of events with loops, allowing information to persist
Feed Forward NNets can not take time dependencies into account.
Sequential data needs a Feedback Mechanism.
o o0 ot-1 ot oT
Unfold
x0 o0 feedback mechanism in time
… Whh Whh Whh
… or internal state loop A

xt ot Whh … …

… … xt-1 xt xT
x x0
FF-net / CNN time
Recurrent Neural Network (RNN)
Intern © Siemens AG 2017
Seite 18 May 2017 Corporate Technology
Foundation of Recurrent Neural Networks

person other other location


output labels
softmax-layer .8 .1 .1 .2 .1 .7 .1 .1 .8 .1 .7 .2 person
location
output layer .8 .1 .1 .2 .1 .7 .1 .1 .8 .1 .7 .2
other
Who
.5 .3 .5 .6
Whh Whh Whh
hidden layer .2 .3 .4 .7
Recurrent Neural Network
.7 -.1 .9 .5
Wxh
1 0 0 0
0 1 0 0
input layer
0 0 1 0
0 0 0 1
input sequence Pankaj lives in Munich
Intern © Siemens AG 2017 time
Seite 19 May 2017 Corporate Technology
(Vanilla) Recurrent Neural Network

Process a sequence of vectors x by applying a recurrence at every time step:

o o0 ot-1 ot oT
Unfold Who
feedback mechanism
in time
or internal state loop Whh Whh Whh
A
Whh h0 ht
ℎ𝑡𝑡 = 𝑔𝑔𝜃𝜃 (ℎ𝑡𝑡−1 , 𝑥𝑥𝑡𝑡 ) Wxh
… …

x x0 xt-1 xt xT
time
Input vector at time step, t
new hidden Vanilla Recurrent Neural Network (RNN)
state at time some function with old hidden state
step, t parameters Whh Wxh at time step, t-1

Remark: The same function g and same set of parameters W are used at every time step

Intern © Siemens AG 2017


Seite 20 May 2017 Corporate Technology
(Vanilla) Recurrent Neural Network

Process a sequence of vectors x by applying a recurrence at every time step:

o o0 ot-1 ot oT
feedback mechanism Unfold Who
or internal state loop in time
Whh Whh Whh
A
Whh
ℎ𝑡𝑡 = 𝑔𝑔𝜃𝜃 ℎ𝑡𝑡−1 , 𝑥𝑥𝑡𝑡 … …
Wxh
x x0 xt-1 xt xT
ℎ𝑡𝑡 = tanh(𝑊𝑊ℎℎ ℎ𝑡𝑡−1 + 𝑊𝑊𝑥𝑥ℎ 𝑥𝑥𝑡𝑡 ) time
Vanilla Recurrent Neural Network (RNN)
𝑜𝑜𝑡𝑡 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑊𝑊ℎ𝑜𝑜 ℎ𝑡𝑡 )

Remark: RNN‘s can be seen as selective summarization of input sequence in a fixed-size


state/hidden vector via a recursive update.
Intern © Siemens AG 2017
Seite 21 May 2017 Corporate Technology
Recurrent Neural Network: Probabilistic Interpretation

RNN as a generative model x0 x1 xt xt+1 <eos>

 induces a set of procedures to model


Whh Whh Whh Whh
the conditional distribution of xt+1 given x<=t
for all t = 1, …,T … … …

<bos> x0 xt-1 xt xT
time
 Think of the output as the probability distribution of the Generative Recurrent Neural Network (RNN)

xt given the previous ones in the sequence


Training: Computing probability of the sequence and Maximum likelihood training

Details: https://www.cs.cmu.edu/~epxing/Class/10708-17/project-reports/project10.pdf

Intern © Siemens AG 2017


Seite 22 May 2017 Corporate Technology
RNN: Computational Graphs

Sequence of output

𝑜𝑜1 𝑜𝑜2 𝑜𝑜3

𝑔𝑔𝜃𝜃 𝑔𝑔𝜃𝜃 𝑔𝑔𝜃𝜃


Initial
State, A0
Next state
𝑥𝑥1 𝑥𝑥2 𝑥𝑥3

𝜃𝜃
Sequence of Inputs

Intern © Siemens AG 2017


Seite 23 May 2017 Corporate Technology
RNN: Different Computational Graphs
𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐 𝒐𝒐𝟑𝟑 𝒐𝒐𝟏𝟏
𝒐𝒐𝟏𝟏

𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑 𝒙𝒙𝟒𝟒


𝒙𝒙𝟏𝟏
𝒙𝒙𝟏𝟏 Many to One
One to one One to Many 𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐 𝒐𝒐𝟑𝟑 𝒐𝒐𝟒𝟒
𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐

𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑 𝒙𝒙𝟒𝟒


𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 Many to Many Many to Many
Intern © Siemens AG 2017
Seite 24 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

 Training recurrent networks via BPTT

 Compute gradients via backpropagation on the (multi-layer) unrolled model

Intern © Siemens AG 2017


Seite 25 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

 Training recurrent networks via BPTT

 Compute gradients via backpropagation on the (multi-layer) unrolled model

 Think of the recurrent net as a layered,


feed-forward net with shared weights and
then train the feed-forward net in time domain

Lecture from the course Neural Networks for Machine Learning


Intern © Siemens AG 2017
Seite 26 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

 Training recurrent networks via BPTT

 Compute gradients via backpropagation on the (multi-layer) unrolled model

 Think of the recurrent net as a layered,


feed-forward net with shared weights and
then train the feed-forward net in time domain

Training algorithm in time domain:


 The forward pass builds up a stack of the activities of all the
units at each time step
 The backward pass peels activities off the stack to compute the
error derivatives at each time step.
 After the backward pass we add together the derivatives at all
the different times for each weight.
Lecture from the course Neural Networks for Machine Learning
Intern © Siemens AG 2017
Seite 27 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

 Training recurrent networks via BPTT

 Compute gradients via backpropagation on the (multi-layer) unrolled model

 Think of the recurrent net as a layered,


feed-forward net with shared through
Forward weights and entire sequence  compute loss
then train the feed-forward
Backward net in time domain
through entire sequence  compute gradient
Training algorithm in time domain:
 The forward pass builds up a stack of the activities of all the
units at each time step
 The backward pass peels activities off the stack to compute the
error derivatives at each time step.
 After the backward pass we add together the derivatives at all
the different times for each weight.
Lecture from the course Neural Networks for Machine Learning
Intern © Siemens AG 2017
Seite 28 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

 Training recurrent networks via BPTT

The output at time t=T is dependent on the inputs from t=T to t=1

𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐 𝒐𝒐𝟑𝟑

𝒉𝒉𝟏𝟏 𝒉𝒉𝟐𝟐 𝒉𝒉𝟑𝟑

𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑

Direction of Forward pass

Intern © Siemens AG 2017


Seite 29 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

 Training recurrent networks via BPTT

The output at time t=T is dependent on the inputs from t=T to t=1

𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐 𝒐𝒐𝟑𝟑

𝒉𝒉𝟏𝟏 𝒉𝒉𝟐𝟐 𝒉𝒉𝟑𝟑

𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑

Direction of Forward pass

Intern © Siemens AG 2017


Seite 30 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

 Training recurrent networks via BPTT

The output at time t=T is dependent on the inputs from t=T to t=1

𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐 𝒐𝒐𝟑𝟑

𝒉𝒉𝟏𝟏 𝒉𝒉𝟐𝟐 𝒉𝒉𝟑𝟑

𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑

Direction of Forward pass

Intern © Siemens AG 2017


Seite 31 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

 Training recurrent networks via BPTT

The output at time t=T is dependent on the inputs from t=T to t=1

𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐 𝒐𝒐𝟑𝟑

𝒉𝒉𝟏𝟏 𝒉𝒉𝟐𝟐 𝒉𝒉𝟑𝟑

𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑

Direction of Forward pass

Intern © Siemens AG 2017


Seite 32 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

 Training recurrent networks via BPTT

The output at time t=T is dependent on the inputs from t=T to t=1

𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐 𝒐𝒐𝟑𝟑

𝒉𝒉𝟏𝟏 𝒉𝒉𝟐𝟐 𝒉𝒉𝟑𝟑

𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑

Direction of Forward pass

Intern © Siemens AG 2017


Seite 33 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

 Training recurrent networks via BPTT

The output at time t=T is dependent on the inputs from t=T to t=1

𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


𝒐𝒐𝟏𝟏 𝜕𝜕𝐸𝐸1 𝒐𝒐𝟐𝟐 𝜕𝜕𝐸𝐸2 𝒐𝒐𝟑𝟑 𝜕𝜕𝐸𝐸3
𝜕𝜕ℎ1 𝜕𝜕ℎ2 𝜕𝜕ℎ3

𝒉𝒉𝟏𝟏 𝜕𝜕ℎ2 𝒉𝒉 𝜕𝜕ℎ3 𝒉𝒉𝟑𝟑


𝟐𝟐
𝜕𝜕ℎ1 𝜕𝜕ℎ2
𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑

Direction of Backward pass (via partial derivatives)


--- gradient flow ---

Intern © Siemens AG 2017


Seite 34 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

 Training recurrent networks via BPTT

The output at time t=T is dependent on the inputs from t=T to t=1

𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


 Let us take our loss/error function to be cross entropy:
𝒐𝒐𝟏𝟏 𝜕𝜕𝐸𝐸1 𝒐𝒐𝟐𝟐 𝜕𝜕𝐸𝐸2 𝒐𝒐𝟑𝟑 𝜕𝜕𝐸𝐸3
𝐸𝐸𝑡𝑡 𝑜𝑜𝑡𝑡 ′, 𝑜𝑜𝑡𝑡 = −𝑜𝑜𝑡𝑡 ′ log 𝑜𝑜𝑡𝑡 𝜕𝜕ℎ1 𝜕𝜕ℎ2 𝜕𝜕ℎ3

𝐸𝐸 𝑜𝑜𝑡𝑡 ′, 𝑜𝑜𝑡𝑡 = � 𝐸𝐸𝑡𝑡 (𝑜𝑜𝑡𝑡 ′, 𝑜𝑜𝑡𝑡 )


𝑡𝑡 𝒉𝒉𝟏𝟏 𝜕𝜕ℎ2 𝒉𝒉 𝜕𝜕ℎ3 𝒉𝒉𝟑𝟑
𝟐𝟐
𝜕𝜕ℎ1 𝜕𝜕ℎ2
𝐸𝐸 𝑜𝑜𝑡𝑡′ , 𝑜𝑜𝑡𝑡 = − � 𝑜𝑜𝑡𝑡 ′ log 𝑜𝑜𝑡𝑡 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑
𝑡𝑡 Direction of Backward pass (via partial derivatives)
--- gradient flow ---
Where 𝑜𝑜𝑡𝑡 ′ are the truth values
Intern © Siemens AG 2017
Seite 35 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

The output at time t=3 is dependent on the inputs from t=3 to t=1

Writing gradients in a sum-of-products form


𝜕𝜕𝐸𝐸 𝜕𝜕𝐸𝐸𝑡𝑡
= � 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3
𝜕𝜕θ 𝜕𝜕θ
1≤𝑡𝑡≤3 𝒐𝒐𝟏𝟏 𝜕𝜕𝐸𝐸1 𝒐𝒐𝟐𝟐 𝜕𝜕𝐸𝐸2 𝒐𝒐𝟑𝟑 𝜕𝜕𝐸𝐸3
𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸 𝜕𝜕𝑜𝑜 𝜕𝜕𝐸𝐸 𝜕𝜕𝑜𝑜 𝜕𝜕𝑧𝑧 𝜕𝜕ℎ1 𝜕𝜕ℎ3
= 𝜕𝜕𝑜𝑜3 𝜕𝜕𝑊𝑊 3 = 𝜕𝜕𝑜𝑜3 𝜕𝜕𝑧𝑧3 𝜕𝜕𝑊𝑊3 Who Who 𝜕𝜕ℎ2 Who
𝜕𝜕𝑊𝑊ℎ𝑜𝑜 3 ℎ𝑜𝑜 3 3 ℎ𝑜𝑜
Whh Whh
𝑤𝑤ℎ𝑒𝑒𝑒𝑒𝑒𝑒, 𝑧𝑧3 = 𝑊𝑊ℎ𝑜𝑜 ℎ3 i.e., 𝑜𝑜3 with softmax
𝜕𝜕𝐸𝐸3 𝒉𝒉𝟏𝟏 𝜕𝜕ℎ2 𝒉𝒉 𝜕𝜕ℎ3 𝒉𝒉𝟑𝟑
𝟐𝟐
= 𝑜𝑜3 ′(𝑜𝑜3 − 1) × (ℎ3 ) 𝜕𝜕ℎ1
𝜕𝜕𝑊𝑊ℎ𝑜𝑜 𝜕𝜕ℎ2
𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑
𝑤𝑤ℎ𝑒𝑒𝑒𝑒𝑒𝑒, ×= 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
Direction of Backward pass (via partial derivatives)
𝜕𝜕𝐸𝐸3 --- gradient flow ---
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜 𝑜𝑜3 , 𝑜𝑜3′ 𝑎𝑎𝑎𝑎𝑎𝑎 ℎ3
𝜕𝜕𝑊𝑊ℎ𝑜𝑜

Intern © Siemens AG 2017


Seite 36 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

𝜕𝜕𝐸𝐸3 of f line
= 𝑜𝑜3 ′(𝑜𝑜3 − 1) × (ℎ3 ) How ?
𝜕𝜕𝑊𝑊ℎ𝑜𝑜
Proof
𝐸𝐸3 𝑜𝑜3 ′, 𝑜𝑜3 = −𝑜𝑜3 ′ log 𝑜𝑜3 𝑜𝑜3 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑧𝑧3 ), 𝑎𝑎𝑎𝑎𝑎𝑎 𝑧𝑧3 = 𝑊𝑊ℎ𝑜𝑜 ℎ3

𝜕𝜕𝐸𝐸3 𝜕𝜕log(𝑜𝑜3 ) 1
𝑜𝑜3 = Ω 𝑒𝑒 𝑧𝑧3 and , Ω = ∑𝑖𝑖 𝑒𝑒 𝑧𝑧𝑖𝑖 log o3 = z3 − log Ω
= −𝑜𝑜3 ′
𝜕𝜕𝑧𝑧3 𝜕𝜕𝑧𝑧3
𝜕𝜕log(𝑜𝑜3 ) 1 𝜕𝜕Ω 𝜕𝜕Ω
=1− = � 𝑒𝑒 𝑧𝑧𝑧𝑧 𝛿𝛿𝑖𝑖3 = 𝑒𝑒 𝑧𝑧𝑘𝑘
𝜕𝜕𝑧𝑧3 Ω 𝜕𝜕𝑧𝑧3 𝜕𝜕𝑧𝑧3
𝑖𝑖
𝜕𝜕log(𝑜𝑜3 )
= 1 − 𝑜𝑜3 𝜕𝜕𝑜𝑜3
𝜕𝜕𝑧𝑧3 = 𝑜𝑜3 (1 − 𝑜𝑜3 )
𝜕𝜕𝑧𝑧3
𝜕𝜕𝐸𝐸3
= −𝑜𝑜3 ′(1 − 𝑜𝑜3 ) = 𝑜𝑜3′ 𝑜𝑜3 − 1 𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕𝑜𝑜3 𝜕𝜕𝑧𝑧3 𝜕𝜕𝐸𝐸3 𝜕𝜕𝑧𝑧3
𝜕𝜕𝑧𝑧3 = = = 𝑜𝑜3′ 𝑜𝑜3 − 1 × (ℎ3 )
𝜕𝜕𝑊𝑊ℎ𝑜𝑜 𝜕𝜕𝑜𝑜3 𝜕𝜕𝑧𝑧3 𝜕𝜕𝑊𝑊ℎ𝑜𝑜 𝜕𝜕𝑧𝑧3 𝜕𝜕𝑊𝑊ℎ𝑜𝑜
1. http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/
2. https://stats.stackexchange.com/questions/235528/backpropagation-with-softmax-cross-entropy

Intern © Siemens AG 2017


Seite 37 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

The output at time t=3 is dependent on the inputs from t=3 to t=1

Writing gradients in a sum-of-products form 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


𝜕𝜕𝐸𝐸 𝜕𝜕𝐸𝐸𝑡𝑡 𝒐𝒐𝟏𝟏 𝜕𝜕𝐸𝐸1 𝒐𝒐𝟐𝟐 𝜕𝜕𝐸𝐸2 𝒐𝒐𝟑𝟑 𝜕𝜕𝐸𝐸3
𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3
= � = Who 𝜕𝜕ℎ1 𝜕𝜕ℎ2 𝜕𝜕ℎ3
𝜕𝜕θ 𝜕𝜕θ 𝜕𝜕Wℎℎ 𝜕𝜕ℎ3 𝜕𝜕Wℎℎ Who Who
1≤𝑡𝑡≤3
Whh Whh
Since ℎ3 depends on ℎ2 𝐚𝐚𝐚𝐚𝐚𝐚 ℎ2 depends on ℎ1 , therefore
3 𝒉𝒉𝟏𝟏 𝜕𝜕ℎ2 𝒉𝒉 𝜕𝜕ℎ3 𝒉𝒉𝟑𝟑
𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝟐𝟐
=� e.g., 𝜕𝜕h = 𝜕𝜕ℎ 𝜕𝜕ℎ 𝜕𝜕ℎ1 𝜕𝜕ℎ2
𝜕𝜕Wℎℎ 𝜕𝜕ℎ3 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ 1 2 1
𝑘𝑘=1 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑

In general, 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 Direction of Backward pass (via partial derivatives)
= � --- gradient flow ---
𝜕𝜕Wℎℎ 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ
1≤𝑘𝑘≤𝑡𝑡

Intern © Siemens AG 2017


Seite 38 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

The output at time t=3 is dependent on the inputs from t=3 to t=1

Writing gradients in a sum-of-products form 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


𝜕𝜕𝐸𝐸 𝜕𝜕𝐸𝐸𝑡𝑡 𝒐𝒐𝟏𝟏 𝜕𝜕𝐸𝐸1 𝒐𝒐𝟐𝟐 𝜕𝜕𝐸𝐸2 𝒐𝒐𝟑𝟑 𝜕𝜕𝐸𝐸3
𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3
= � = 𝜕𝜕ℎ1 𝜕𝜕ℎ3
𝜕𝜕θ 𝜕𝜕θ 𝜕𝜕Wℎℎ 𝜕𝜕ℎ3 𝜕𝜕Wℎℎ
Who Who 𝜕𝜕ℎ2 Who
1≤𝑡𝑡≤3
Whh Whh
Since ℎ3 depends on ℎ2 𝐚𝐚𝐚𝐚𝐚𝐚 ℎ2 depends on ℎ1 , therefore
3 𝒉𝒉𝟏𝟏 𝒉𝒉𝟐𝟐 𝜕𝜕ℎ3 𝒉𝒉𝟑𝟑
𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕ℎ2
=� e.g., 𝜕𝜕h = 𝜕𝜕ℎ 𝜕𝜕ℎ 𝜕𝜕ℎ2
𝜕𝜕Wℎℎ 𝜕𝜕ℎ3 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ 1 2 1
𝑘𝑘=1 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑

In general, 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 Direction of Backward pass (via partial derivatives)
= � --- gradient flow ---
𝜕𝜕Wℎℎ 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ
1≤𝑘𝑘≤𝑡𝑡
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖 Transport error in time from step t back to step k
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
𝝏𝝏𝒉𝒉𝒕𝒕
Jacobian matrix
Intern © Siemens AG 2017 𝝏𝝏𝒉𝒉𝒌𝒌
Seite 39 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

The output at time t=3 is dependent on the inputs from t=3 to t=1

Writing gradients in a sum-of-products form 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


𝜕𝜕𝐸𝐸 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝒐𝒐𝟏𝟏 𝜕𝜕𝐸𝐸1 𝒐𝒐𝟐𝟐 𝜕𝜕𝐸𝐸2 𝒐𝒐𝟑𝟑 𝜕𝜕𝐸𝐸3
= � =
𝜕𝜕θ 𝜕𝜕θ 𝜕𝜕Wℎℎ 𝜕𝜕ℎ3 𝜕𝜕Wℎℎ Who 𝜕𝜕ℎ1 Who 𝜕𝜕ℎ2 Who 𝜕𝜕ℎ3
1≤𝑡𝑡≤3
Whh Whh
Since ℎ3 depends on ℎ2 𝐚𝐚𝐚𝐚𝐚𝐚 ℎ2 depends on ℎ1 , therefore
3 𝒉𝒉𝟏𝟏
𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝒉𝒉𝟐𝟐 𝜕𝜕ℎ3 𝒉𝒉𝟑𝟑
=� e.g., 𝜕𝜕h = 𝜕𝜕ℎ 𝜕𝜕ℎ 𝜕𝜕ℎ2
𝜕𝜕Wℎℎ 𝜕𝜕ℎ3 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ 1 2 1
𝑘𝑘=1 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑

In general, 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 Weight matrix Direction of Backward pass (via partial derivatives)
= �
𝜕𝜕Wℎℎ 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ --- gradient flow ---
1≤𝑘𝑘≤𝑡𝑡
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖 Derivative of activation function
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1 Transport error in time from step t back to step k
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
𝝏𝝏𝒉𝒉𝒕𝒕
Jacobian matrix
Intern © Siemens AG 2017 𝝏𝝏𝒉𝒉𝒌𝒌
Seite 40 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

The output at time t=3 is dependent on the inputs from t=3 to t=1

Writing gradients in a sum-of-products form 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


𝜕𝜕𝐸𝐸 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝒐𝒐𝟏𝟏 𝜕𝜕𝐸𝐸1 𝒐𝒐𝟐𝟐 𝜕𝜕𝐸𝐸2 𝒐𝒐𝟑𝟑 𝜕𝜕𝐸𝐸3
= � = 𝜕𝜕ℎ3
𝜕𝜕θ 𝜕𝜕θ 𝜕𝜕Wℎℎ 𝜕𝜕ℎ3 𝜕𝜕Wℎℎ Who 𝜕𝜕ℎ1 Who 𝜕𝜕ℎ2 Who
1≤𝑡𝑡≤3
Whh Whh
Since ℎ3 depends on ℎ2 𝐚𝐚𝐚𝐚𝐚𝐚 ℎ2 depends on ℎ1 , therefore
3 𝒉𝒉𝟏𝟏 𝜕𝜕ℎ2 𝒉𝒉 𝜕𝜕ℎ3 𝒉𝒉𝟑𝟑
𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝟐𝟐
=� e.g., 𝜕𝜕h = 𝜕𝜕ℎ 𝜕𝜕ℎ 𝜕𝜕ℎ1 𝜕𝜕ℎ2
𝜕𝜕Wℎℎ 𝜕𝜕ℎ3 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ 1 2 1
𝑘𝑘=1 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑
𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 Direction of Backward pass (via partial derivatives)
= �multiplications leads to vanishing and exploding
Repeated matrix gradients
--- gradient flow ---
𝜕𝜕Wℎℎ 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ
1≤𝑘𝑘≤𝑡𝑡
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1 Transport error in time from step t back to step k
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
𝝏𝝏𝒉𝒉𝒕𝒕
Jacobian matrix
Intern © Siemens AG 2017 𝝏𝝏𝒉𝒉𝒌𝒌
Seite 41 May 2017 Corporate Technology
BPTT: Gradient Flow

3
𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ𝑘𝑘
=�
𝜕𝜕Wℎℎ 𝜕𝜕ℎ3 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3
𝑘𝑘=1
𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐 𝒐𝒐𝟑𝟑
𝜕𝜕ℎ3
𝜕𝜕𝐸𝐸3 𝜕𝜕𝑜𝑜3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸3 𝜕𝜕𝑜𝑜3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸3 𝜕𝜕𝑜𝑜3 𝜕𝜕ℎ3 𝜕𝜕ℎ3
= + + 𝜕𝜕ℎ2
𝜕𝜕𝑜𝑜3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝑊𝑊ℎℎ 𝜕𝜕𝑜𝑜3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝑊𝑊ℎℎ 𝜕𝜕𝑜𝑜3 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕𝑊𝑊ℎℎ 𝜕𝜕ℎ2
𝜕𝜕ℎ1

𝑾𝑾𝒉𝒉𝒉𝒉 𝑾𝑾𝒉𝒉𝒉𝒉
𝒉𝒉𝟏𝟏 𝒉𝒉𝟐𝟐 𝒉𝒉𝟑𝟑

𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑

Intern © Siemens AG 2017


Seite 42 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

Code snippet for forward-propagation is shown below (Before going for BPTT code) of f line

https://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf

Intern © Siemens AG 2017


Seite 43 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN

Code snippet for backpropagation w.r.t. time is shown below


𝒕𝒕
𝝏𝝏𝑬𝑬𝒕𝒕 𝝏𝝏𝑬𝑬𝒕𝒕 𝝏𝝏𝒉𝒉𝒕𝒕 𝝏𝝏𝒉𝒉𝒌𝒌
=�
of f line 𝝏𝝏𝑾𝑾𝒉𝒉𝒉𝒉 𝝏𝝏𝒉𝒉𝒕𝒕 𝝏𝝏𝒉𝒉𝒌𝒌 𝝏𝝏𝑾𝑾𝒉𝒉𝒉𝒉
𝒌𝒌=𝟏𝟏

𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑡𝑡−1 𝜕𝜕ℎ𝑡𝑡−2 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑡𝑡−1 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡
= ⋯+ + +
𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑡𝑡−1 𝜕𝜕ℎ𝑡𝑡−2 𝜕𝜕𝑊𝑊ℎℎ 𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑡𝑡−1 𝜕𝜕𝑊𝑊ℎℎ 𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕𝑊𝑊ℎℎ
𝜕𝜕𝐸𝐸𝑡𝑡
= −(𝑜𝑜𝑡𝑡 −𝑜𝑜𝑡𝑡 ′)(ℎ𝑡𝑡 )
𝜕𝜕𝑊𝑊ℎ𝑜𝑜
𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕ℎ𝑡𝑡
= −(𝑜𝑜𝑡𝑡 −𝑜𝑜𝑡𝑡 ′)𝑊𝑊ℎ𝑜𝑜 𝑎𝑎𝑎𝑎𝑎𝑎 = (1 − ℎ𝑡𝑡2 )(ℎ𝑡𝑡−1 )
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕𝑊𝑊ℎℎ
𝜕𝜕𝐸𝐸 𝜕𝜕ℎ𝑡𝑡
= −(𝑜𝑜𝑡𝑡 −𝑜𝑜𝑡𝑡 ′)𝑊𝑊ℎ𝑜𝑜 (1 − ℎ𝑡𝑡2 )(ℎ𝑡𝑡−1 )
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕𝑊𝑊ℎℎ
𝐀𝐀

𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰 𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅_𝒕𝒕
𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑡𝑡−1 2
= − 𝑜𝑜𝑡𝑡 − 𝑜𝑜𝑡𝑡′ 𝑊𝑊ℎ𝑜𝑜 1 − ℎ𝑡𝑡2 (𝑊𝑊ℎℎ )(1 − ℎ𝑡𝑡−1 )(ℎ𝑡𝑡−2 ) 𝑩𝑩
𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑡𝑡−1 𝜕𝜕𝑊𝑊ℎℎ

𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅_𝒕𝒕
𝜕𝜕𝐸𝐸
= A + B + ⋯ (till the end of dependency)
Intern © Siemens AG 2017 𝜕𝜕𝑊𝑊ℎℎ
Seite 44 May 2017 Corporate Technology
Break (10 minutes)

Intern © Siemens AG 2017


Seite 45 May 2017 Corporate Technology
Challenges in Training an RNN: Vanishing Gradients

Short Term Dependencies

 need recent information to perform the present task.


For example in a language model, predict the next word based on the previous ones.
“the clouds are in the ?”  ‘sky’
 Easier to predict ‘sky’ given the context, i.e., short term dependency  (vanilla) RNN Good so far.

Long Term Dependencies

 Consider longer word sequence “I grew up in France…........…………………… I speak fluent French.”


 Recent information suggests that the next word is probably the name of a language, but if we want to
narrow down which language, we need the context of France, from further back.
 As the gap increases  practically difficult for RNN to learn from the past information
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Intern © Siemens AG 2017
Seite 46 May 2017 Corporate Technology
Challenges in Training an RNN: Vanishing Gradients

Assume an RNN of 5 time steps: Long Term dependencies


Let‘s look at the Jacobian matrix while BPTT:

𝜕𝜕𝐸𝐸5 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸 𝜕𝜕ℎ 𝜕𝜕ℎ 𝜕𝜕ℎ
= + + 5 5 4 3 + …
𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕𝜃𝜃

−1.13e−08 2.61e−09 1.50e−08


−1.70e−10 4.94e−10 2.29e−10 B= −1.11e−08 5.70e−09 1.51e−08 −1.70e−06 8.70e−06 9.40e−06
A= −1.73e−10 5.56e−10 2.55e−10
−1.33e−08 9.11e−09 1.83e−08 C = −2.51e−07 7.30e−06 8.98e−06
−1.81e−10 4.40e−10 2.08e−10
𝑩𝑩 = 1.53e−07 7.32e−07 7.85e−06 1.05e−05
𝑨𝑨 = 1.00e−09
𝑪𝑪 = 2.18e−05
𝜕𝜕𝐸𝐸5
 is dominated by short-term dependencies(e.g., C), but
𝜕𝜕𝜕𝜕

𝜕𝜕𝐸𝐸5
 Gradient vanishes in long-term dependencies i.e. is updated much less due to A as compared
𝜕𝜕𝜕𝜕
to updated by C
Intern © Siemens AG 2017
Seite 47 May 2017 Corporate Technology
Challenges in Training an RNN: Vanishing Gradients

Assume an RNN of 5 time steps: Long Term dependencies


Let‘s look at the Jacobian matrix while BPTT:

𝜕𝜕𝐸𝐸5 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸 𝜕𝜕ℎ 𝜕𝜕ℎ 𝜕𝜕ℎ
= + + 5 5 4 3 + …
𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕𝜃𝜃

−1.13e−08 2.61e−09 1.50e−08


−1.70e−10 4.94e−10 2.29e−10 B= −1.11e−08 5.70e−09 1.51e−08 −1.70e−06 8.70e−06 9.40e−06
A= −1.73e−10 5.56e−10 2.55e−10
−1.33e−08 9.11e−09 1.83e−08 C = −2.51e−07 7.30e−06 8.98e−06
−1.81e−10 4.40e−10 2.08e−10
𝑩𝑩 = 1.53e−07 7.32e−07 7.85e−06 1.05e−05
𝑨𝑨 = 1.00e−09
𝑪𝑪 = 2.18e−05
𝜕𝜕𝐸𝐸5
 is dominated by short-term dependencies(e.g., C), but
𝜕𝜕𝜕𝜕
Long Term Components goes exponentially fast to norm 0
 no correlation
 Gradient vanishes between temporally
in long-term dependencies
𝜕𝜕𝐸𝐸
i.e. 𝜕𝜕𝜕𝜕5 isdistant
updated events
much less due to A as compared
to updated by C
Intern © Siemens AG 2017
Seite 48 May 2017 Corporate Technology
Challenges in Training an RNN: Exploding Gradients

Assume an RNN of 5 time steps: Long Term dependencies


Let‘s look at the Jacobian matrix while BPTT:

𝜕𝜕𝐸𝐸5 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸 𝜕𝜕ℎ 𝜕𝜕ℎ 𝜕𝜕ℎ
= + + 5 5 4 3 + …
𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕𝜃𝜃

−1.13e + 08 2.61e + 09 1.50e + 08


−1.70e + 10 4.94e + 10 2.29e−10 B=
−1.11e + 08 5.70e + 09 1.51e + 08 −1.70e + 06 8.70e + 06 9.40e + 06
A= −1.73e + 10 5.56e + 10 2.55e−10
−1.33e + 08 9.11e + 09 1.83e + 08 C = −2.51e + 07 7.30e + 06 8.98e + 06
−1.81e + 10 4.40e + 10 2.08e−10
𝑩𝑩 = 1.53e+107 7.32e + 07 7.85e + 06 1.05e + 05
𝑨𝑨 = 1.00e+109
𝑪𝑪 = 2.18e+105

𝜕𝜕𝐸𝐸5
 , gradient explodes, i.e., NaN due to very large numbers
𝜕𝜕𝜕𝜕

Intern © Siemens AG 2017


Seite 49 May 2017 Corporate Technology
Challenges in Training an RNN: Exploding Gradients

Assume an RNN of 5 time steps: Long Term dependencies


Let‘s look at the Jacobian matrix while BPTT:

𝜕𝜕𝐸𝐸5 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸 𝜕𝜕ℎ 𝜕𝜕ℎ 𝜕𝜕ℎ
= + + 5 5 4 3 + …
𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕𝜃𝜃

−1.13e + 06 2.61e + 06 1.50e + 06


−1.70e + 10 4.94e + 10 2.29e−10 B=
−1.11e + 06 5.70e + 06 1.51e + 06 −1.70e + 04 8.70e + 04 9.40e + 04
A= −1.73e + 10 5.56e + 10 2.55e−10
−1.33e + 06 9.11e + 06 1.83e + 06 C = −2.51e + 04 7.30e + 04 8.98e + 04
−1.81e + 10 4.40e + 10 2.08e−10
𝑩𝑩 = 1.53e+97 7.32e + 04 7.85e + 04 1.05e + 04
𝑨𝑨 = 1.00e+109
𝑪𝑪 = 2.18e+85

𝜕𝜕𝐸𝐸5
Large increase in the norm of the gradient during training 
 , gradient
dueexplodes, i.e., NaN
to explosion of due
longto term
very large numbers
components
𝜕𝜕𝜕𝜕

Intern © Siemens AG 2017


Seite 50 May 2017 Corporate Technology
Vanishing Gradient in Long-term Dependencies

Often, the length of sequences are long….e.g., documents, speech, etc.

𝐸𝐸1 𝐸𝐸2 𝐸𝐸3 𝐸𝐸50


𝒐𝒐𝟏𝟏 𝜕𝜕𝐸𝐸1 𝜕𝜕𝐸𝐸3 𝒐𝒐𝟑𝟑 𝜕𝜕𝐸𝐸50
𝒐𝒐𝟐𝟐 𝜕𝜕𝐸𝐸2 𝒐𝒐𝟑𝟑
𝜕𝜕ℎ1 𝜕𝜕ℎ2 𝜕𝜕ℎ3 𝜕𝜕ℎ50

𝒉𝒉𝟏𝟏 𝜕𝜕ℎ2 𝒉𝒉 𝜕𝜕ℎ3 𝒉𝒉𝟑𝟑 … 𝒉𝒉𝟓𝟓𝟓𝟓


𝟐𝟐
𝜕𝜕ℎ1 𝜕𝜕ℎ2
𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑 𝒙𝒙𝟓𝟓𝟓𝟓

In practice as the length of the sequence increases, the probability of training being successful
decrease drastically.
Why

Intern © Siemens AG 2017


Seite 51 May 2017 Corporate Technology
Vanishing Gradient in Long-term Dependencies

Why

Let us look at the recurrent part of our RNN equation:

ℎ𝑡𝑡 = 𝑔𝑔𝑊𝑊 ℎ𝑡𝑡−1 , 𝑥𝑥𝑡𝑡 tanh Expansion

ℎ𝑡𝑡 = tanh(𝑊𝑊ℎℎ ℎ𝑡𝑡−1 + 𝑊𝑊𝑥𝑥ℎ 𝑥𝑥𝑡𝑡 ) ℎ𝑡𝑡 = 𝑊𝑊ℎℎ f(ℎt−1 ) + some other terms

ℎ𝑡𝑡 = 𝑊𝑊ℎℎ ℎ0 + some other terms


𝑜𝑜𝑡𝑡 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑊𝑊ℎ𝑜𝑜 ℎ𝑡𝑡 )

Intern © Siemens AG 2017


Seite 53 May 2017 Corporate Technology
Vanishing Gradient in Long-term Dependencies

Writing gradients in a sum-of-products form 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


𝒐𝒐𝟏𝟏 𝜕𝜕𝐸𝐸1 𝒐𝒐𝟐𝟐 𝜕𝜕𝐸𝐸2 𝒐𝒐𝟑𝟑 𝜕𝜕𝐸𝐸3
𝜕𝜕𝐸𝐸 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3
= � = Who 𝜕𝜕ℎ1 Who 𝜕𝜕ℎ2 Who 𝜕𝜕ℎ3
𝜕𝜕θ 𝜕𝜕θ 𝜕𝜕Wℎℎ 𝜕𝜕ℎ3 𝜕𝜕Wℎℎ
1≤𝑡𝑡≤3
Whh Whh
Since ℎ3 depends on ℎ2 𝐚𝐚𝐚𝐚𝐚𝐚 ℎ2 depends on ℎ1 , therefore
3 𝒉𝒉𝟏𝟏 𝜕𝜕ℎ2 𝒉𝒉 𝜕𝜕ℎ3 𝒉𝒉𝟑𝟑
𝟐𝟐
𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1
=� e.g., = 𝜕𝜕ℎ2
𝜕𝜕Wℎℎ 𝜕𝜕ℎ3 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ 𝜕𝜕h1 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑
𝑘𝑘=0
Direction of Backward pass (via partial derivatives)
In general, 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘
= � --- gradient flow ---
𝜕𝜕Wℎℎ 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ
1≤𝑘𝑘≤𝑡𝑡
𝒉𝒉𝒕𝒕 = 𝑾𝑾𝒉𝒉𝒉𝒉 𝐟𝐟(𝒉𝒉𝐭𝐭−𝟏𝟏 ) + 𝐬𝐬𝐬𝐬𝐬𝐬𝐬𝐬 𝐭𝐭𝐭𝐭𝐭𝐭𝐭𝐭𝐭𝐭
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1 Transport error in time from step t back to step k
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
𝝏𝝏𝒉𝒉𝒕𝒕
Jacobian matrix
𝝏𝝏𝒉𝒉𝒌𝒌

This term is the product of Jacobian matrix .


Intern © Siemens AG 2017
Seite 54 May 2017 Corporate Technology
Vanishing Gradient in Long-term Dependencies

Writing gradients in a sum-of-products form 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


𝒐𝒐𝟏𝟏 𝜕𝜕𝐸𝐸1 𝒐𝒐𝟐𝟐 𝜕𝜕𝐸𝐸2 𝒐𝒐𝟑𝟑 𝜕𝜕𝐸𝐸3
𝜕𝜕𝐸𝐸 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3
= � = Who 𝜕𝜕ℎ1 Who 𝜕𝜕ℎ2 Who 𝜕𝜕ℎ3
𝜕𝜕θ 𝜕𝜕θ 𝜕𝜕Wℎℎ 𝜕𝜕ℎ3 𝜕𝜕Wℎℎ
1≤𝑡𝑡≤3
Whh Whh
𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 𝒉𝒉𝟏𝟏 𝜕𝜕ℎ2 𝒉𝒉
= � 𝟐𝟐 𝜕𝜕ℎ3 𝒉𝒉𝟑𝟑
𝜕𝜕Wℎℎ 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ 𝜕𝜕ℎ1
1≤𝑘𝑘≤𝑡𝑡 𝜕𝜕ℎ2
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1 Direction of Backward pass (via partial derivatives)
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
𝝏𝝏𝒉𝒉𝒕𝒕 --- gradient flow ---
Jacobian matrix
𝝏𝝏𝒉𝒉𝒌𝒌

Intern © Siemens AG 2017


Seite 55 May 2017 Corporate Technology
Vanishing Gradient in Long-term Dependencies

Writing gradients in a sum-of-products form 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


𝒐𝒐𝟏𝟏 𝜕𝜕𝐸𝐸1 𝒐𝒐𝟐𝟐 𝜕𝜕𝐸𝐸2 𝒐𝒐𝟑𝟑 𝜕𝜕𝐸𝐸3
𝜕𝜕𝐸𝐸 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3
= � = Who 𝜕𝜕ℎ1 Who 𝜕𝜕ℎ2 Who 𝜕𝜕ℎ3
𝜕𝜕θ 𝜕𝜕θ 𝜕𝜕Wℎℎ 𝜕𝜕ℎ3 𝜕𝜕Wℎℎ
1≤𝑡𝑡≤3
Whh Whh
𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 𝒉𝒉𝟏𝟏 𝜕𝜕ℎ2 𝒉𝒉
= � 𝟐𝟐 𝜕𝜕ℎ3 𝒉𝒉𝟑𝟑
𝜕𝜕Wℎℎ 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ 𝜕𝜕ℎ1
1≤𝑘𝑘≤𝑡𝑡 𝜕𝜕ℎ2
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1 Direction of Backward pass (via partial derivatives)
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
--- gradient flow ---

Repeated matrix multiplications leads to vanishing gradients !!!

Intern © Siemens AG 2017


Seite 56 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients

𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1 𝒐𝒐𝟏𝟏 𝜕𝜕𝐸𝐸1 𝒐𝒐𝟐𝟐 𝜕𝜕𝐸𝐸2 𝒐𝒐𝟑𝟑 𝜕𝜕𝐸𝐸3
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
Who 𝜕𝜕ℎ1 Who 𝜕𝜕ℎ2 Who 𝜕𝜕ℎ3
Consider identity activation function Whh Whh

If recurrent matrix Wℎℎ is a diagonalizable: 𝒉𝒉𝟏𝟏 𝜕𝜕ℎ2 𝒉𝒉 𝜕𝜕ℎ3 𝒉𝒉𝟑𝟑


𝟐𝟐
𝜕𝜕ℎ1 𝜕𝜕ℎ2
𝑾𝑾𝒉𝒉𝒉𝒉 = 𝑸𝑸−𝟏𝟏 ∗ 𝜵𝜵 ∗ 𝑸𝑸 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑
Direction of Backward pass (via partial derivatives)
matrix composed of eigenvectors of Wℎℎ --- gradient flow ---
diagonal matrix with eigenvalues placed on the diagonals

Using power iteration method, computing powers of Wℎℎ :


n
𝑾𝑾𝒉𝒉𝒉𝒉 = 𝑸𝑸−𝟏𝟏 ∗ 𝜵𝜵 n ∗ 𝑸𝑸
Bengio et al, "On the difficulty of training recurrent neural networks." (2012)
Intern © Siemens AG 2017
Seite 57 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients

𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1 𝒐𝒐𝟏𝟏 𝜕𝜕𝐸𝐸1 𝒐𝒐𝟐𝟐 𝜕𝜕𝐸𝐸2 𝒐𝒐𝟑𝟑 𝜕𝜕𝐸𝐸3
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
Who 𝜕𝜕ℎ1 Who 𝜕𝜕ℎ2 Who 𝜕𝜕ℎ3
Consider identity activation function Whh Whh

computing powers of Wℎℎ : 𝒉𝒉𝟏𝟏 𝜕𝜕ℎ2 𝒉𝒉


𝟐𝟐 𝜕𝜕ℎ3 𝒉𝒉𝟑𝟑
n 𝜕𝜕ℎ1 𝜕𝜕ℎ2
𝑾𝑾𝒉𝒉𝒉𝒉 = 𝑸𝑸−𝟏𝟏 ∗ 𝜵𝜵 n ∗ 𝑸𝑸 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑
Vanishing gradients Direction of Backward pass (via partial derivatives)

- 0.618 0 - 0.0081
𝜵𝜵 = 10
𝜵𝜵 =
0
0 1.618 0 122.99
Eigen values on the diagonal
Exploding gradients
Bengio et al, "On the difficulty of training recurrent neural networks." (2012)
Intern © Siemens AG 2017
Seite 58 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients

𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
Need for tight conditions
Consider identity activation function on eigen values
computing powers of Wℎℎ : during training to prevent
n
𝑾𝑾𝒉𝒉𝒉𝒉 = 𝑸𝑸−𝟏𝟏 ∗ 𝜵𝜵 n ∗ 𝑸𝑸 gradients to vanish or explode
Vanishing gradients
- 0.618 0 - 0.0081
𝜵𝜵 = 10
𝜵𝜵 =
0
0 1.618 0 122.99
Eigen values on the diagonal
Exploding gradients
Bengio et al, "On the difficulty of training recurrent neural networks." (2012)
Intern © Siemens AG 2017
Seite 59 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients

Writing gradients in a sum-of-products form 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3


𝒐𝒐𝟏𝟏 𝜕𝜕𝐸𝐸1 𝒐𝒐𝟐𝟐 𝜕𝜕𝐸𝐸2 𝒐𝒐𝟑𝟑 𝜕𝜕𝐸𝐸3
𝜕𝜕𝐸𝐸 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3
= � = Who 𝜕𝜕ℎ1 Who 𝜕𝜕ℎ2 Who 𝜕𝜕ℎ3
𝜕𝜕θ 𝜕𝜕θ 𝜕𝜕Wℎℎ 𝜕𝜕ℎ3 𝜕𝜕Wℎℎ
1≤𝑡𝑡≤3
Whh Whh
𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 𝒉𝒉𝟏𝟏 𝜕𝜕ℎ2 𝒉𝒉
= � 𝟐𝟐 𝜕𝜕ℎ3 𝒉𝒉𝟑𝟑
𝜕𝜕Wℎℎ 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ 𝜕𝜕ℎ1
1≤𝑘𝑘≤𝑡𝑡 𝜕𝜕ℎ2
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1 Direction of Backward pass (via partial derivatives)
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
--- gradient flow ---

𝜕𝜕ℎ
Find Sufficient condition for when gradients vanish  compute an upper bound for 𝜕𝜕ℎ 𝑡𝑡 term
𝑘𝑘

𝜕𝜕ℎ𝑖𝑖
� �≤ Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1  find out an upper bound for the norm of the jacobian!
𝜕𝜕ℎ𝑖𝑖−1

Intern © Siemens AG 2017


Seite 60 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients

Lets find an upper bound for the term: 𝑾𝑾𝑻𝑻 𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅 𝒈𝒈′ 𝒉𝒉𝒊𝒊−𝟏𝟏
Propert y of mat rix norm
• Proof: 𝑀𝑀 2 = 𝜆𝜆𝑚𝑚𝑚𝑚𝑚𝑚 (𝑀𝑀∗ 𝑀𝑀) = 𝛾𝛾𝑚𝑚𝑚𝑚𝑚𝑚 (M)
where the spectral norm 𝑀𝑀 2 ‖of a complex matrix 𝑀𝑀 is defined as 𝑚𝑚𝑚𝑚𝑚𝑚 𝑀𝑀𝑀𝑀 2: 𝑥𝑥 = 1
of f line
The norm of a matrix is equal to the largest singular value of the matrix and is
related to the largest Eigen value (spectral radius)

Put 𝐵𝐵 = 𝑀𝑀 ∗ 𝑀𝑀 which is a Hermitian matrix. As a linear transformation of Euclidean vector space 𝐸𝐸 is Hermite
iff there exists an orthonormal basis of 𝐸𝐸 consisting of all the eigenvectors of 𝐵𝐵
Let 𝜆𝜆1 , 𝜆𝜆2 , 𝜆𝜆3 … 𝜆𝜆𝑛𝑛 be the eigenvalues of 𝐵𝐵 and 𝑒𝑒1 , 𝑒𝑒2 … … . 𝑒𝑒𝑛𝑛 be an orthonormal basis of 𝐸𝐸
Let 𝑥𝑥 = 𝑎𝑎1 𝑒𝑒1 + … … 𝑎𝑎𝑛𝑛 𝑒𝑒𝑛𝑛 (linear combination of eigen vectors)
The specttal norm of x:
𝑥𝑥 = ∑𝑛𝑛𝑖𝑖=1 𝑎𝑎𝑖𝑖 𝑒𝑒𝑖𝑖 ∑𝑛𝑛𝑖𝑖=1 𝑎𝑎𝑖𝑖 𝑒𝑒𝑖𝑖 1/2
= ∑𝑛𝑛𝑖𝑖=1 𝑎𝑎𝑖𝑖2
Intern © Siemens AG 2017
Seite 61 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients

Using characteristic equation to find a matrix's eigenvalues,


𝑛𝑛 𝑛𝑛
𝑛𝑛
𝐵𝐵𝐵𝐵 = 𝐵𝐵 � 𝑎𝑎𝑖𝑖 𝑒𝑒𝑖𝑖 = � 𝑎𝑎𝑖𝑖 𝐵𝐵 𝑒𝑒𝑖𝑖 = � 𝜆𝜆𝑖𝑖 𝑎𝑎𝑖𝑖 𝑒𝑒𝑖𝑖
𝑖𝑖=1
𝑖𝑖=1 𝑖𝑖=1
of f line
Therefore,
𝑛𝑛 𝑛𝑛 𝑛𝑛

𝑀𝑀𝑀𝑀 = 𝑀𝑀𝑀𝑀, 𝑀𝑀𝑀𝑀 = 𝑥𝑥, 𝑀𝑀∗ 𝑀𝑀𝑀𝑀 = 𝑥𝑥, 𝐵𝐵𝐵𝐵 = � 𝑎𝑎𝑖𝑖 𝑒𝑒𝑖𝑖 � 𝜆𝜆𝑖𝑖 𝑎𝑎𝑖𝑖 𝑒𝑒𝑖𝑖 = � 𝑎𝑎𝑖𝑖 𝜆𝜆𝑖𝑖 𝑎𝑎𝑖𝑖 ≤ max 𝜆𝜆𝑗𝑗 × ( 𝑥𝑥 )
(1≤𝑗𝑗≤𝑛𝑛)
𝑖𝑖=1 𝑖𝑖=1 𝑖𝑖=1

Thus,
If 𝑀𝑀 = 𝑚𝑚𝑚𝑚𝑚𝑚 𝑀𝑀𝑀𝑀 : 𝑥𝑥 = 1 , 𝑡𝑡ℎ𝑒𝑒𝑒𝑒 𝑀𝑀 ≤ max 𝜆𝜆𝑗𝑗 equation (1)
1≤𝑗𝑗≤𝑛𝑛

Intern © Siemens AG 2017


Seite 62 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients

Consider,

𝑥𝑥0 = 𝑒𝑒𝑗𝑗0 ⇒ 𝑥𝑥 = 1, 𝑠𝑠𝑠𝑠 𝑡𝑡ℎ𝑎𝑎𝑎𝑎 𝑀𝑀 ≥ 𝑥𝑥, 𝐵𝐵𝐵𝐵 = 𝑒𝑒𝑗𝑗0 , 𝐵𝐵 𝑒𝑒𝑗𝑗0 = 𝑒𝑒𝑗𝑗0 , 𝜆𝜆𝑗𝑗0 𝑒𝑒𝑗𝑗0 = 𝜆𝜆𝑗𝑗0 … equation (2)

where, 𝑗𝑗0 is the largest eigen value.


of f line

Combining (1) and (2) give us 𝑀𝑀 = max 𝜆𝜆𝑗𝑗 𝑤𝑤ℎ𝑒𝑒𝑒𝑒𝑒𝑒, 𝜆𝜆𝑗𝑗 𝑖𝑖𝑖𝑖 𝑡𝑡ℎ𝑒𝑒 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑜𝑜𝑜𝑜 𝐵𝐵 = 𝑀𝑀∗ 𝑀𝑀
1≤𝑗𝑗≤𝑛𝑛

Conclusion : 𝑀𝑀 2 = 𝜆𝜆𝑚𝑚𝑚𝑚𝑚𝑚 (𝑀𝑀∗ 𝑀𝑀) = 𝛾𝛾𝑚𝑚𝑚𝑚𝑚𝑚 (M) …. equation (3)

Remarks:
 The spectral norm of a matrix is equal to the largest singular value of the
matrix and is related to the largest Eigen value (spectral radius)
 If the matrix is square symmetric, the singular value = spectral Radius

Intern © Siemens AG 2017


Seite 63 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients

Let’s use these properties:


𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘

𝜕𝜕ℎ𝑖𝑖
� �≤ Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1
𝜕𝜕ℎ𝑖𝑖−1

Intern © Siemens AG 2017


Seite 64 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients

Let’s use these properties:


𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖 Gradient of the nonlinear function
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )] (sigmoid or tanh) 𝑔𝑔′ ℎ𝑖𝑖−1 is bounded by
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘 constant, .i.e., 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1 ≤ 𝛾𝛾𝑔𝑔

𝜕𝜕ℎ𝑖𝑖 an upper bound for the norm


� �≤ Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1 of the gradient of activation
𝜕𝜕ℎ𝑖𝑖−1

𝛾𝛾𝑔𝑔 = ¼ for sigmoid


constant
𝛾𝛾𝑔𝑔 = 1 for tanh
Intern © Siemens AG 2017
Seite 65 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients

Let’s use these properties:


𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖 Gradient of the nonlinear function
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )] (sigmoid or tanh) 𝑔𝑔′ ℎ𝑖𝑖−1 is bounded by
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘 constant, .i.e., 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1 ≤ 𝛾𝛾𝑔𝑔

𝜕𝜕ℎ𝑖𝑖 an upper bound for the norm


� �≤ Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1 of the gradient of activation
𝜕𝜕ℎ𝑖𝑖−1
Largest Singular
≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
value of 𝑾𝑾𝒉𝒉𝒉𝒉

𝜸𝜸𝑾𝑾 𝜸𝜸𝒈𝒈 = an upper bound for the norm of jacobian!

𝛾𝛾𝑔𝑔 = ¼ for sigmoid


constant
𝛾𝛾𝑔𝑔 = 1 for tanh
Intern © Siemens AG 2017
Seite 66 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients

Let’s use these properties:


𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖 Gradient of the nonlinear function
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )] (sigmoid or tanh) 𝑔𝑔′ ℎ𝑖𝑖−1 is bounded by
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘 constant, .i.e., 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1 ≤ 𝛾𝛾𝑔𝑔

𝜕𝜕ℎ𝑖𝑖 an upper bound for the norm


� �≤ Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1 of the gradient of activation
𝜕𝜕ℎ𝑖𝑖−1
Largest Singular
≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
value of 𝑾𝑾𝒉𝒉𝒉𝒉

𝜸𝜸𝑾𝑾 𝜸𝜸𝒈𝒈 = an upper bound for the norm of jacobian!

𝜕𝜕ℎ3 𝑡𝑡−𝑘𝑘
� � ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
𝜕𝜕ℎ𝑘𝑘

Intern © Siemens AG 2017


Seite 67 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients

Let’s use these properties:


𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖 Sufficient Condition for Vanishing Gradient
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1 As 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔 < 1 and (t-k)∞ then long term
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
contributions go to 0 exponentially fast with t-k
(power iteration method).
𝜕𝜕ℎ𝑖𝑖
� �≤ Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1 Therefore,
𝜕𝜕ℎ𝑖𝑖−1 sufficient condition for vanishing gradient to occur:
𝛾𝛾𝑊𝑊 < 1/𝛾𝛾𝑔𝑔
Largest Singular
≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔 i.e. for sigmoid, 𝛾𝛾𝑊𝑊 < 4
value of 𝑾𝑾𝒉𝒉𝒉𝒉 i.e., for tanh, 𝛾𝛾𝑊𝑊 < 1
𝜸𝜸𝑾𝑾 𝜸𝜸𝒈𝒈 = an upper bound for the norm of jacobian!

𝜕𝜕ℎ3 𝑡𝑡−𝑘𝑘
� � ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
𝜕𝜕ℎ𝑘𝑘

Intern © Siemens AG 2017


Seite 68 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients

Let’s use these properties:


𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖 Necessary Condition for Exploding Gradient
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1 As 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔 > 1 and (t-k)∞ then gradient explodes!!!
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
Therefore,
Necessary condition for exploding gradient to occur:
𝜕𝜕ℎ𝑖𝑖 𝛾𝛾𝑊𝑊 > 1/𝛾𝛾𝑔𝑔
� �≤ Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1
𝜕𝜕ℎ𝑖𝑖−1 i.e. for sigmoid, 𝛾𝛾𝑊𝑊 > 4
Largest Singular i.e., for tanh, 𝛾𝛾𝑊𝑊 > 1
≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
value of 𝑾𝑾𝒉𝒉𝒉𝒉

𝜸𝜸𝑾𝑾 𝜸𝜸𝒈𝒈 = an upper bound for the norm of jacobian!

𝜕𝜕ℎ3 𝑡𝑡−𝑘𝑘
� � ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
𝜕𝜕ℎ𝑘𝑘

Intern © Siemens AG 2017


Seite 69 May 2017 Corporate Technology
Vanishing Gradient in Long-term Dependencies

What have we concluded with the upper bound of derivative from recurrent step?

𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1 𝜕𝜕ℎ3 𝑡𝑡−𝑘𝑘
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
� � ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
𝜕𝜕ℎ𝑘𝑘
𝜕𝜕ℎ𝑖𝑖
� �≤ Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1 ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
𝜕𝜕ℎ𝑖𝑖−1

If we multiply the same term 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔 < 1 again and again, the overall number becomes very
small(i.e almost equal to zero)

HOW ?
Repeated matrix multiplications leads to vanishing and exploding gradients
Intern © Siemens AG 2017
Seite 70 May 2017 Corporate Technology
Vanishing Gradient in Long-term Dependencies

𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ3
= + + 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3
𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕𝑊𝑊
𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐 𝒐𝒐𝟑𝟑

= ≪≪ 1 + ≪1 + <1

The gradients no longer depend on the past inputs…


since, the near past inputs dominate the gradient !!! 𝑾𝑾
𝑾𝑾

𝒉𝒉 𝟏𝟏 𝒉𝒉 𝟐𝟐 𝒉𝒉 𝟑𝟑
Gradient due to Gradient due to
Total long term short term 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑
Gradient dependencies dependencies

Remark: The gradients due to short term dependencies (just previous dependencies) dominates the
gradients due to long-term dependencies.
This means network will tend to focus on short term dependencies which is often not desired
Problem of Vanishing Gradient
Intern © Siemens AG 2017
Seite 71 May 2017 Corporate Technology
Vanishing Gradient in Long-term Dependencies

𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ3
= + + 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3
𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕𝑊𝑊
𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐 𝒐𝒐𝟑𝟑

= ≪≪ 1 + ≪1 + <1

The gradients no longer depend on the past inputs…


Repeated matrix multiplications leads to vanishing and exploding gradients
since, the near past inputs dominate the gradient !!! 𝑾𝑾
𝑾𝑾

𝒉𝒉 𝟏𝟏 𝒉𝒉 𝟐𝟐 𝒉𝒉 𝟑𝟑
Gradient due to Gradient due to
Total long term short term 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑
Gradient dependencies dependencies

Remark: The gradients due to short term dependencies (just previous dependencies) dominates the
gradients due to long-term dependencies.
This means network will tend to focus on short term dependencies which is often not desired
Problem of Vanishing Gradient
Intern © Siemens AG 2017
Seite 72 May 2017 Corporate Technology
Exploding Gradient in Long-term Dependencies

What have we concluded with the upper bound of derivative from recurrent step?

𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1 𝜕𝜕ℎ3 𝑡𝑡−𝑘𝑘
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
� � ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
𝜕𝜕ℎ𝑘𝑘
𝜕𝜕ℎ𝑖𝑖
� �≤ Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1 ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
𝜕𝜕ℎ𝑖𝑖−1

If we multiply the same term 𝜸𝜸𝑾𝑾 𝜸𝜸𝒈𝒈 > 1 again and again, the overall number explodes and
hence the gradient explodes

HOW ?
Repeated matrix multiplications leads to vanishing and exploding gradients
Intern © Siemens AG 2017
Seite 73 May 2017 Corporate Technology
Vanishing Gradient in Long-term Dependencies

𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ3
= + +
𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕𝑊𝑊
𝐸𝐸1 𝐸𝐸2 𝐸𝐸3
= ≫≫1 + ≫≫ 1 + ≫≫ 1
𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐 𝒐𝒐𝟑𝟑
= Very large number, i.e., NaN

Problem of Exploding Gradient


𝑾𝑾
𝑾𝑾

𝒉𝒉 𝟏𝟏 𝒉𝒉 𝟐𝟐 𝒉𝒉 𝟑𝟑

𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑

Intern © Siemens AG 2017


Seite 74 May 2017 Corporate Technology
Vanishing vs Exploding Gradients

𝜕𝜕ℎ3 𝑡𝑡−𝑘𝑘
� � ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔 For tanh or linear activation
𝜕𝜕ℎ𝑘𝑘

𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔 greater than 1 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔 less than 1


Gradient Expodes !!! Gradient Vanishes !!!

Remark: This problem of exploding/vanishing gradient occurs because the same number is
multiplied in the gradient repeatedly.
Intern © Siemens AG 2017
Seite 75 May 2017 Corporate Technology
Dealing With Exploding Gradients

Intern © Siemens AG 2017


Seite 76 May 2017 Corporate Technology
Dealing with Exploding Gradients: Gradient Clipping

Scaling down the gradients


 rescale norm of the gradients whenever it goes over a threshold

 Proposed clipping is simple and computationally efficient,


 introduce an additional hyper-parameter, namely the threshold
Pascanu et al., 2013. On the difficulty of training recurrent neural networks.
Intern © Siemens AG 2017
Seite 77 May 2017 Corporate Technology
Dealing With Vanishing Gradients

Intern © Siemens AG 2017


Seite 78 May 2017 Corporate Technology
Dealing with Vanishing Gradient

• As discussed, the gradient vanishes due to the recurrent part of the RNN equations.

ℎ𝑡𝑡 = 𝑊𝑊ℎℎ ht−1 + some other terms

• What if Largest Eigen value of the parameter matrix becomes 1, but in this case, memory just grows.

• We need to be able to decide when to put information in the memory

Intern © Siemens AG 2017


Seite 79 May 2017 Corporate Technology
Long Short Term Memory (LSTM): Gating Mechanism

Gates :
 way to optionally let information through.
Forget
 composed out of a sigmoid neural net layer and a Gate
pointwise multiplication operation.
 remove or add information to the cell state
„Clouds“

Current
Cell state Output
 3 gates in LSTM Input
Gate
Gate

Input from rest Output to rest


 gates to protect and control the cell state. of the LSTM of the LSTM

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Intern © Siemens AG 2017
Seite 80 May 2017 Corporate Technology
Long Short Term Memory (LSTM): Gating Mechanism

Remember the word „ clouds“ over time….

Forget Forget Forget


Forget
Gate:0 Gate:0 Gate:0
Gate:0

„clouds“ „clouds“ „clouds“

Input Output Output


Gate:1 Input Output Input
Gate:0 Gate:1
Gate:0 Gate: 0 Gate:0

„clouds“
„clouds“
Lecture from the course Neural Networks for Machine Learning by Greff Hinton
Intern © Siemens AG 2017
Seite 81 May 2017 Corporate Technology
Long Short Term Memory (LSTM)

Motivation:
 Create a self loop path from where gradient can flow
 self loop corresponds to an eigenvalue of Jacobian to be slightly less than 1

Self loop
𝑛𝑛𝑛𝑛𝑛𝑛 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠

𝑛𝑛𝑛𝑛𝑛𝑛 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝑜𝑜𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + 𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢


+ ×
𝑜𝑜𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝜕𝜕𝑛𝑛𝑛𝑛𝑛𝑛 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 ~
𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 = 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼
𝜕𝜕𝑜𝑜𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠

LONG SHORT-TERM MEMORY, Sepp Hochreiter and Jürgen Schmidhuber


Intern © Siemens AG 2017
Seite 82 May 2017 Corporate Technology
Long Short Term Memory (LSTM): Step by Step

Key Ingredients
Cell state - transport the information through the units
Gates – optionally allow information passage

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Intern © Siemens AG 2017
Seite 83 May 2017 Corporate Technology
Long Short Term Memory (LSTM): Step by Step

Cell: Transports information through the units (key idea)


 the horizontal line running through the top
LSTM removes or adds information to the cell state using gates.

Intern © Siemens AG 2017


Seite 84 May 2017 Corporate Technology
Long Short Term Memory (LSTM): Step by Step

Forget Gate:
 decides what information to throw away or remember from the previous cell state
 decision maker: sigmoid layer (forget gate layer)
The output of the sigmoid lies between 0 to 1,
 0 being forget, 1 being keep.

𝒇𝒇𝒕𝒕 = 𝒔𝒔𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊(𝜽𝜽𝒙𝒙𝒙𝒙 𝒙𝒙𝒕𝒕 + 𝜽𝜽𝒉𝒉𝒉𝒉 𝒉𝒉𝒕𝒕−𝟏𝟏 + 𝒃𝒃𝒇𝒇 )

 looks at ht−1 and xt, and outputs a number between 0 and 1


for each number in the cell state Ct−1

Intern © Siemens AG 2017


Seite 85 May 2017 Corporate Technology
Long Short Term Memory (LSTM): Step by Step

Input Gate: Selectively updates the cell state based on the new input.
A multiplicative input gate unit to protect the memory contents stored in j from perturbation by irrelevant inputs

The next step is to decide what new


information we’re going to store in the cell
state. This has two parts:
1. A sigmoid layer called the “input gate layer”
decides which values we’ll update.
2. A tanh layer creates a vector of new candidate
values, , that could be added to the state.
In the next step, we’ll combine these two to
create an update to the state.
𝒊𝒊𝒕𝒕 = 𝒔𝒔𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊(𝜽𝜽𝒙𝒙𝒙𝒙 𝒙𝒙𝒕𝒕 + 𝜽𝜽𝒉𝒉𝒉𝒉 𝒉𝒉𝒕𝒕−𝟏𝟏 + 𝒃𝒃𝒊𝒊 )

Intern © Siemens AG 2017


Seite 86 May 2017 Corporate Technology
Long Short Term Memory (LSTM): Step by Step

Cell Update
- update the old cell state, Ct−1, into the new cell state Ct
- multiply the old state by ft, forgetting the things we
decided to forget earlier
- add it ∗ to get the new candidate values, scaled by
how much we decided to update each state value.

Intern © Siemens AG 2017


Seite 87 May 2017 Corporate Technology
Long Short Term Memory (LSTM): Step by Step

Output Gate: Output is the filtered version of the cell state


- Decides the part of the cell we want as our output in the form of new hidden state
- multiplicative output gate to protect other units from perturbation by currently irrelevant memory contents
- a sigmoid layer decides what parts of the cell state goes to output. Apply tanh to the cell state and multiply it
by the output of the sigmoid gate  only output the parts decided

𝒐𝒐𝒕𝒕 = 𝒔𝒔𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊𝒊 𝜽𝜽𝒙𝒙𝒙𝒙 𝒙𝒙𝒕𝒕 + 𝜽𝜽𝒉𝒉𝒉𝒉 𝒉𝒉𝒕𝒕−𝟏𝟏 + 𝒃𝒃𝒐𝒐


𝒉𝒉𝒕𝒕 = 𝒐𝒐𝒕𝒕 ∗ 𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕(𝑪𝑪𝒕𝒕 )

Intern © Siemens AG 2017


Seite 88 May 2017 Corporate Technology
Dealing with Vanishing Gradients in LSTM

As seen, the gradient vanishes due to the recurrent part of the RNN equations

ℎ𝑡𝑡 = 𝑊𝑊ℎℎ ht−1 + some other terms

How LSTM tackled vanishing gradient?


Answer: forget gate

 The forget gate parameters takes care of the vanishing gradient problem
 Activation function becomes identity and therefore, the problem of vanishing gradient is addressed.
 The derivative of the identity function is, conveniently, always one. So if f = 1, information from the
previous cell state can pass through this step unchanged

Intern © Siemens AG 2017


Seite 89 May 2017 Corporate Technology
LSTM code snippet

Code snippet for LSTM unit:


Of f line

Parameter Dimension
Intern © Siemens AG 2017
Seite 90 May 2017 Corporate Technology
LSTM code snippet

Code snippet for LSTM unit: LSTM equations forward pass and shape of gates
Of f line

Intern © Siemens AG 2017


Seite 91 May 2017 Corporate Technology
Gated Recurrent Unit (GRU)

• GRU like LSTMs, attempts to solve the Vanishing gradient problem in RNN
Gates:

Update
Gate
These 2 vectors decide
what information
should be passed to
output
Reset Gate

• Units with short-term dependencies will have active reset gates r


• Units with long term dependencies have active update gates z

Intern © Siemens AG 2017


Seite 92 May 2017 Corporate Technology
Gated Recurrent Unit (GRU)

Update Gate:
- to determine how much of the past information
(from previous time steps) needs to be passed
along to the future.
- to learn to copy information from the past such that
gradient is not vanished.

Here, 𝑥𝑥𝑡𝑡 is the input and ℎ𝑡𝑡−1 holds the information


from the previous gate.

𝑧𝑧𝑡𝑡 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑊𝑊 𝑧𝑧 𝑥𝑥𝑡𝑡 + 𝑈𝑈 𝑧𝑧 ℎ𝑡𝑡−1 )

Intern © Siemens AG 2017


Seite 93 May 2017 Corporate Technology
Gated Recurrent Unit (GRU)

Reset Gate
- model how much of information to forget by the unit

Here, 𝑥𝑥𝑡𝑡 is the input and ℎ𝑡𝑡−1 holds the information


from the previous gate.

𝑟𝑟𝑡𝑡 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑊𝑊 (𝑟𝑟) 𝑥𝑥𝑡𝑡 + 𝑈𝑈 (𝑟𝑟) ℎ𝑡𝑡−1 )

Memory Content:
ℎ′𝑡𝑡 = 𝑡𝑡𝑡𝑡𝑡𝑡ℎ(𝑊𝑊𝑥𝑥𝑡𝑡 + 𝑟𝑟_𝑡𝑡 ⊙ 𝑈𝑈ℎ𝑡𝑡−1 )
Final Memory at current time step
ℎ𝑡𝑡 = 𝑧𝑧𝑡𝑡 ⊙ ℎ 𝑡𝑡−1 + (1 − 𝑧𝑧𝑡𝑡 ) ⊙ ℎ𝑡𝑡′

Intern © Siemens AG 2017


Seite 94 May 2017 Corporate Technology
Dealing with Vanishing Gradient s in Gated Recurrent Unit (GRU)

We had a product of Jacobian:


Of f line
𝑡𝑡
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑗𝑗
= � ≤ 𝛼𝛼 𝑡𝑡−𝑗𝑗−1
𝜕𝜕ℎ𝑘𝑘 𝜕𝜕ℎ𝑗𝑗−1
𝑗𝑗=𝑘𝑘+1

Where, alpha depends upon weight matrix and derivative of the activation function
Now,

𝜕𝜕ℎ𝑗𝑗 𝜕𝜕ℎ𝑗𝑗′
= 𝑧𝑧𝑗𝑗 + 1 − 𝑧𝑧𝑗𝑗
𝜕𝜕ℎ𝑗𝑗−1 𝜕𝜕ℎ𝑗𝑗−1

𝜕𝜕ℎ𝑗𝑗′
And, 𝜕𝜕ℎ𝑗𝑗−1
= 1 𝑓𝑓𝑓𝑓𝑓𝑓 𝑧𝑧𝑗𝑗 = 1

Intern © Siemens AG 2017


Seite 95 May 2017 Corporate Technology
Code snippet of GRU unit

Code snippet of GRU unit:


Of f line

Intern © Siemens AG 2017


Seite 96 May 2017 Corporate Technology
Comparing LSTM and GRU

LSTM over GRU


One feature of the LSTM has: controlled exposure of the memory content, not in GRU.
In the LSTM unit, the amount of the memory content that is seen, or used by other units in the network is controlled by
the output gate. On the other hand the GRU exposes its full content without any control.

 GRU performs comparably to LSTM


GRU LSTM unit

Chung et al, 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Intern © Siemens AG 2017
Seite 97 May 2017 Corporate Technology
Break (10 minutes)

Intern © Siemens AG 2017


Seite 98 May 2017 Corporate Technology
Bi-directional RNNs

Bidirectional Recurrent Neural Networks (BRNN)


- connects two hidden layers of opposite directions to the same output
- output layer can get information from past (backwards) and future (forward) states simultaneously
- learn representations from future time steps to better understand the context and eliminate ambiguity
Example sentences: sequence of
Output
Sentence1: “He said, Teddy bears are on sale”
Forward state
Sentnce2: “He said, Teddy Roosevelt was a great President”.
when we are looking at the word “Teddy” and the previous two words
Backward
“He said”, we might not be able to understand if the sentence refers state

to the President or Teddy bears. sequence of


Input
Therefore, to resolve this ambiguity, we need to look ahead.
https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional-rnn-lstm-gru-73927ec9df15

Intern © Siemens AG 2017


Seite 99 May 2017 Corporate Technology
Bi-directional RNNs

Bidirectional Recurrent Neural Networks (BRNN)

Gupta 2015. (Master Thesis). Deep Learning Methods for the Extraction of Relations in Natural Language Text
Gupta and Schütze. 2018. LISA: Explaining Recurrent Neural Network Judgments via Layer-wIse Semantic Accumulation and Example to Pattern Transformation
Vu et al., 2016. Combining recurrent and convolutional neural networks for relation classification

Intern © Siemens AG 2017


Seite 100 May 2017 Corporate Technology
Recursive Neural Networks (RecNNs): TreeRNN or TreeLSTM

 applying the same set of weights recursively over a structured


input, by traversing a given structure in topological order,
e.g., parse tree
 Use principle of compositionality

 Recursive Neural Nets can jointly learn compositional


vector representations and parse trees
RecNN

 The meaning (vector) of a sentence is determined by


(1) the meanings of its words and
(2) the rules that combine them.

http://www.iro.umontreal.ca/~bengioy/talks/gss2012-YB6-NLP-recursive.pdf RNN
Intern © Siemens AG 2017
Seite 101 May 2017 Corporate Technology
Recursive Neural Networks (RecNNs): TreeRNN or TreeLSTM

http://www.iro.umontreal.ca/~bengioy/talks/gss2012-YB6-NLP-recursive.pdf
Intern © Siemens AG 2017
Seite 102 May 2017 Corporate Technology
Recursive Neural Networks (RecNNs): TreeRNN or TreeLSTM

Applications
 represent the meaning of longer phrases
 Map phrases into a vector space
 Sentence parsing
 Scene parsing

Intern © Siemens AG 2017


Seite 103 May 2017 Corporate Technology
Recursive Neural Networks (RecNNs): TreeRNN or TreeLSTM

Application: Relation Extraction Within and Cross Sentence Boundaries, i.e., document-level relation extraction

Gupta et al., 2019. Neural Relation Extraction Within and Across Sentence Boundaries.
Intern © Siemens AG 2017
Seite 104 May 2017 Corporate Technology
Recursive Neural Networks (RecNNs): TreeRNN or TreeLSTM

Relation Extraction Within and Cross Sentence Boundaries, i.e., document-level relation extraction

Gupta et al., 2019. Neural Relation Extraction Within and Across Sentence Boundaries.
Intern © Siemens AG 2017
Seite 105 May 2017 Corporate Technology
Deep and Multi-tasking RNNs

Deep RNN architecture Multi-task RNN architecture

Marek Rei . 2017. Semi-supervised Multitask Learning for Sequence Labeling


Intern © Siemens AG 2017
Seite 106 May 2017 Corporate Technology
RNN in Practice: Training Tips

Weight Initialization Methods


 Identity weight initialization with ReLU activation

Activation Function: ReLU


i.e., ReLU(x) = max{0,x}
And it’s gradient = 0 for x < 0 and 1 for x > 0
Therefore,

Intern © Siemens AG 2017


Seite 107 May 2017 Corporate Technology
RNN in Practice: Training Tips

Weight Initialization Methods (in Vanilla RNNs)


 Random Whh initialization of RNN  no constraint on eigenvalues

vanishing or exploding gradients in the initial epoch

 Careful initialization of Whh with suitable eigenvalues


 Whh initialized to Identity matrix
 Activation function: ReLU What else?
Bat ch Normalizat ion: faster convergence
 allows the RNN to learn in the initial epochs  Dropout : better generalization
 can generalize well for further iterations
Geoffrey et al, “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units”
Intern © Siemens AG 2017
Seite 108 May 2017 Corporate Technology
Attention Mechanism: Attentive RNNs

Translation often requires arbitrary input length and output length


 Encode-decoder can be applied to N-to-M sequence, but is one hidden state really enough?

https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129

Intern © Siemens AG 2017


Seite 109 May 2017 Corporate Technology
Attention Mechanism: Attentive RNNs

Attention to improve the performance of the Encoder-Decoder RNN on machine translation.


 allows to focus on local or global features
 is a vector, often the outputs of dense layer using softmax function
 generates a context vector into the gap between encoder and decoder

Context vector
 takes all cells’ outputs as input
 compute the probability distribution of source language
words for each word in decoder (e.g., ‘Je’)

https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129

Intern © Siemens AG 2017


Seite 110 May 2017 Corporate Technology
Attention Mechanism: Attentive RNNs

How does it Work?


Idea: Compute Context vector for every output/target word, t (during decoding)
For each target word, t
1. generate scores between each encoder state hs and the target state ht
2. apply softmax to normalize scores  attention weights
(the probability distribution conditioned on the target state)

3. compute context vector for the target word, t


using attention weights
4. compute attention vector for the target word, t

https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129
Intern © Siemens AG 2017
Seite 111 May 2017 Corporate Technology
Explainability/Interpretability of RNNs

Visualization
Visualize output predictions: LISA
 Visualize neuron activations: Sensitivity Analysis

Further Details:
- Gupta et al, 2018. “LISA: Explaining Recurrent Neural Network Judgments via Layer-wIse Semantic Accumulation and Example to
Pattern Transformation”. https://arxiv.org/abs/1808.01591
- Andrej Karpathy, Blog on “Unreasonable Effectiveness of Recurrent Neural Networks”
- Hendrick et al, “Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks”
Intern © Siemens AG 2017
Seite 112 May 2017 Corporate Technology
Explainability/Interpretability of RNNs

 Visualize output predictions: LISA

Checkout our POSTER about LISA paper (EMNLP2018 conference)

https://www.researchgate.net/publication/328956863_LISA_Explaining_RNN_Judg
ments_via_Layer-
wIse_Semantic_Accumulation_and_Example_to_Pattern_Transformation_Analyzi
ng_and_Interpreting_RNNs_for_NLP

Full paper:

Gupta et al, 2018. “LISA: Explaining Recurrent Neural Network Judgments via Layer-
wIse Semantic Accumulation and Example to Pattern Transformation”.
https://arxiv.org/abs/1808.01591

Intern © Siemens AG 2017


Seite 113 May 2017 Corporate Technology
Explainability/Interpretability of RNNs

Visualize neuron activations via Heat maps, i.e. Sensitivity Analysis


Figure below shows the plot of the sensitivity score .Each row corresponds to saliency score for the
correspondent word representation with each grid representing each dimension.

All three models assign high sensitivity to “hate” and dampen the influence of other tokens. LSTM offers a clearer focus
on “hate” than the standard recurrent model, but the bi-directional LSTM shows the clearest focus, attaching almost zero
emphasis on words other than “hate”. This is presumably due to the gates structures in LSTMs and Bi-LSTMs that
controls information flow, making these architectures better at filtering out less relevant information.

LSTM and RNN capture short-term depdendency


Jiwei LI et al, “Visualizing and Understanding Neural Models in NLP”

Intern © Siemens AG 2017


Seite 114 May 2017 Corporate Technology
Explainability/Interpretability of RNNs

Visualize neuron activations via Heat maps, i.e. Sensitivity Analysis

LSTM captures long-term depdendency, (vanilla) RNN not.

Jiwei LI et al, “Visualizing and Understanding Neural Models in NLP”

Intern © Siemens AG 2017


Seite 115 May 2017 Corporate Technology
RNNs in Topic Trend Extraction (Dynamic Topic Evolution): RNN-RSM

RSM RSM RSM RSM


(1) h(1) (2) h(2) … (T-1) h(T-1) (T) h(T)
bh bh bh bh
Wvh Wvh Wvh Wuh Wvh
Wuh (1) Wuh (2) Wuh (T-1) (T)
bv bv bv bv
V(1) V(2) V(T-1) V(T) Observable
Wuv Wvu
Softmax Visibles
Wuv Wvu Wuv Wvu Wuv Wvu

RNN u(0) u(1) u(2) u(T-1) u(T)


Wuu Wuu Wuu Wuu

Neural Net w ork Neural Net w ork Neural Net w ork Topic-words
Language Models Language Models Language Models
Word Represent at ion Super vised Word Embedding
over time
Linear Model Linear Model Word Embeddings for topic
Rule Set Rule Set Word Represent at ion ‘Word Vector’
1996 1997 2014
Gupta et al. 2018. Deep Temporal-Recurrent-Replicated-Softmax for Topical Trends over Time
Intern © Siemens AG 2017
Seite 116 May 2017 Corporate Technology
RNNs in Topic Trend Extraction (Dynamic Topic Evolution): RNN-RSM

RSM RSM RSM RSM


Latent Topics
(1) h(1) (2) h(2) … (T-1) h(T-1) (T) h(T)
bh bh bh bh
Wvh Wvh Wvh Wuh Wvh
Wuh (1) Wuh (2) Wuh (T-1) (T)
bv bv bv bv
V(1) V(2) V(T-1) V(T) Observable
Wuv Wvu
Softmax Visibles
Wuv Wvu Wuv Wvu Wuv Wvu

RNN u(0) u(1) u(2) u(T-1) u(T)


Wuu Wuu Wuu Wuu

Cost in RNN-RSM, the negative log-likelihood

Training via BPTT

Gupta et al. 2018. Deep Temporal-Recurrent-Replicated-Softmax for Topical Trends over Time
Intern © Siemens AG 2017
Seite 117 May 2017 Corporate Technology
RNNs in Topic Trend Extraction (Dynamic Topic Evolution): RNN-RSM

Topic Trend Extraction or Topic Evolution in NLP research over time

Gupta et al. 2018. Deep Temporal-Recurrent-Replicated-Softmax for Topical Trends over Time
Intern © Siemens AG 2017
Seite 118 May 2017 Corporate Technology
Key Takeaways

 RNNs model sequential data


 Long term dependencies are a major problem in RNNs
Solution:
 careful weight initialization
 LSTM/GRUs
 Gradients Explodes
Solution:  Gradient norm clipping
 Regularization (Batch normalization and Dropout) and attention help
 Interesting direction to visualize and interpret RNN learning

Intern © Siemens AG 2017


Seite 119 May 2017 Corporate Technology
References, Resources and Further Reading

 RNN lecture (Ian Goodfellow): https://www.youtube.com/watch?v=ZVN14xYm7JA


 Andrew Ng lecture on RNN: https://www.coursera.org/lecture/nlp-sequence-models/why-sequence-models-0h7gT
 Recurrent Highway Networks (RHN)
 LSTMs for Language Models (Lecture 07)
 Bengio et al,. "On the difficulty of training recurrent neural networks." (2012)
 Geoffrey et al, “Improving Perfomance of Recurrent Neural Network with ReLU nonlinearity”
 Geoffrey et al, “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units”
 Cooijmans, Tim, et al. "Recurrent batch normalization."(2016).
 Dropout : A Probabilistic Theory of Deep Learning, Ankit B. Patel, Tan Nguyen, Richard G. Baraniuk.
 Barth (2016) : “Semenuita et al. 2016. “Recurrent dropout without memory loss”
 Andrej Karpathy, Blog on “Unreasonable Effectiveness of Recurrent Neural Networks”
 Ilya Sutskever, et al. 2014. “Sequence to Sequence Learning with Neural Networks”
 Bahdanau et al. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate”
 Hierarchical Attention Networks for Document Classification, 2016.
 Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification, 2016
 Good Resource: http://slazebni.cs.illinois.edu/spring17/lec20_rnn.pdf
Intern © Siemens AG 2017
Seite 120 May 2017 Corporate Technology
References, Resources and Further Reading

 Lecture from the course Neural Networks for Machine Learning by Greff Hinton
 Lecture by Richard Socher: https://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf
 Understanding LSTM: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
 Recursive NN: http://www.iro.umontreal.ca/~bengioy/talks/gss2012-YB6-NLP-recursive.pdf
 Attention: https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129
 Gupta, 2015. Master Thesis on “Deep Learning Methods for the Extraction of Relations in Natural Language Text”
 Gupta et al., 2016. Table Filling Multi-Task Recurrent Neural Network for Joint Entity and Relation Extraction.
 Vu et al., 2016. Combining recurrent and convolutional neural networks for relation classification.
 Vu et al., 2016. Bi-directional recurrent neural network with ranking loss for spoken language understanding.
 Gupta et al. 2018. Deep Temporal-Recurrent-Replicated-Softmax for Topical Trends over Time
 Gupta et al., 2018. LISA: Explaining Recurrent Neural Network Judgments via Layer-wIse Semantic Accumulation and Example to
Pattern Transformation.
 Gupta et al., 2018. Replicated Siamese LSTM in Ticketing System for Similarity Learning and Retrieval in Asymmetric Texts.
 Gupta et al., 2019. Neural Relation Extraction Within and Across Sentence Boundaries
 Talk/slides: https://vimeo.com/277669869

Intern © Siemens AG 2017


Seite 121 May 2017 Corporate Technology
Thanks !!!

Write me, if interested in ….

[email protected]
@Linkedin: https://www.linkedin.com/in/pankaj-gupta-6b95bb17/

About my research contributions:


https://scholar.google.com/citations?user=_YjIJF0AAAAJ&hl=en

Intern © Siemens AG 2017


Seite 122 May 2017 Corporate Technology

You might also like