05 Rnns
05 Rnns
… … Treat s t he t w o
falls falls sentences t he
bear bear sam e
market Trading market UNK
into into
territory … territory …
sentence1 sentence2
FF-net / CNN FF-net / CNN
Intern © Siemens AG 2017
Seite 9 May 2017 Corporate Technology
Motivation: Need for Sequential Modeling
into into
territory … territory …
sentence1 sentence2
bear falls into market territory
FF-net / CNN FF-net / CNN
Intern © Siemens AG 2017 Sequential model: RNN
Seite 10 May 2017 Corporate Technology
Motivation: Need for Sequential Modeling
into into
territory … territory …
sentence1 sentence2
bear falls into market territory
FF-net / CNN FF-net / CNN
Intern © Siemens AG 2017 Sequential model: RNN
Seite 11 May 2017 Corporate Technology
Motivation: Need for Sequential Modeling
Machine Translation: Different Input and Output sizes, incurring sequential patterns
Decoder Decoder
pankaj lebt in münchen पं कज मु िनच म� रहता है
Encoder Encoder
RNN
- perform well when the input data is interdependent in a sequential pattern
- correlation between previous input to the next input
- introduce bias based on your previous output
CNN/FF-Nets
- all the outputs are self dependent
- Feed-forward nets don’t remember historic input data at test time unlike recurrent networks.
Feed-forward neural networks: Distributed hidden state: can efficiently store a lot of
information about the past.
Generalize autoregressive models by using
non-linear hidden layers. Non-linear dynamics: can update their hidden state
Wt-2 in complicated ways
Wt-1 Temporal and accumulative: can build semantics,
e.g., word-by-word in sequence over time
inputt-2 inputt-1 inputt
Easier to predict ‘sky’ given the context, i.e., short term dependency
Goal
model long term dependencies
connect previous information to the present task
model sequence of events with loops, allowing information to persist
punching
Goal
model long term dependencies
connect previous information to the present task
model sequence of events with loops, allowing information to persist
Feed Forward NNets can not take time dependencies into account.
Sequential data needs a Feedback Mechanism.
o o0 ot-1 ot oT
Unfold
x0 o0 feedback mechanism in time
… Whh Whh Whh
… or internal state loop A
…
xt ot Whh … …
…
… … xt-1 xt xT
x x0
FF-net / CNN time
Recurrent Neural Network (RNN)
Intern © Siemens AG 2017
Seite 18 May 2017 Corporate Technology
Foundation of Recurrent Neural Networks
o o0 ot-1 ot oT
Unfold Who
feedback mechanism
in time
or internal state loop Whh Whh Whh
A
Whh h0 ht
ℎ𝑡𝑡 = 𝑔𝑔𝜃𝜃 (ℎ𝑡𝑡−1 , 𝑥𝑥𝑡𝑡 ) Wxh
… …
x x0 xt-1 xt xT
time
Input vector at time step, t
new hidden Vanilla Recurrent Neural Network (RNN)
state at time some function with old hidden state
step, t parameters Whh Wxh at time step, t-1
Remark: The same function g and same set of parameters W are used at every time step
o o0 ot-1 ot oT
feedback mechanism Unfold Who
or internal state loop in time
Whh Whh Whh
A
Whh
ℎ𝑡𝑡 = 𝑔𝑔𝜃𝜃 ℎ𝑡𝑡−1 , 𝑥𝑥𝑡𝑡 … …
Wxh
x x0 xt-1 xt xT
ℎ𝑡𝑡 = tanh(𝑊𝑊ℎℎ ℎ𝑡𝑡−1 + 𝑊𝑊𝑥𝑥ℎ 𝑥𝑥𝑡𝑡 ) time
Vanilla Recurrent Neural Network (RNN)
𝑜𝑜𝑡𝑡 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑊𝑊ℎ𝑜𝑜 ℎ𝑡𝑡 )
<bos> x0 xt-1 xt xT
time
Think of the output as the probability distribution of the Generative Recurrent Neural Network (RNN)
Details: https://www.cs.cmu.edu/~epxing/Class/10708-17/project-reports/project10.pdf
Sequence of output
𝜃𝜃
Sequence of Inputs
The output at time t=T is dependent on the inputs from t=T to t=1
The output at time t=T is dependent on the inputs from t=T to t=1
The output at time t=T is dependent on the inputs from t=T to t=1
The output at time t=T is dependent on the inputs from t=T to t=1
The output at time t=T is dependent on the inputs from t=T to t=1
The output at time t=T is dependent on the inputs from t=T to t=1
The output at time t=T is dependent on the inputs from t=T to t=1
The output at time t=3 is dependent on the inputs from t=3 to t=1
𝜕𝜕𝐸𝐸3 of f line
= 𝑜𝑜3 ′(𝑜𝑜3 − 1) × (ℎ3 ) How ?
𝜕𝜕𝑊𝑊ℎ𝑜𝑜
Proof
𝐸𝐸3 𝑜𝑜3 ′, 𝑜𝑜3 = −𝑜𝑜3 ′ log 𝑜𝑜3 𝑜𝑜3 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑧𝑧3 ), 𝑎𝑎𝑎𝑎𝑎𝑎 𝑧𝑧3 = 𝑊𝑊ℎ𝑜𝑜 ℎ3
𝜕𝜕𝐸𝐸3 𝜕𝜕log(𝑜𝑜3 ) 1
𝑜𝑜3 = Ω 𝑒𝑒 𝑧𝑧3 and , Ω = ∑𝑖𝑖 𝑒𝑒 𝑧𝑧𝑖𝑖 log o3 = z3 − log Ω
= −𝑜𝑜3 ′
𝜕𝜕𝑧𝑧3 𝜕𝜕𝑧𝑧3
𝜕𝜕log(𝑜𝑜3 ) 1 𝜕𝜕Ω 𝜕𝜕Ω
=1− = � 𝑒𝑒 𝑧𝑧𝑧𝑧 𝛿𝛿𝑖𝑖3 = 𝑒𝑒 𝑧𝑧𝑘𝑘
𝜕𝜕𝑧𝑧3 Ω 𝜕𝜕𝑧𝑧3 𝜕𝜕𝑧𝑧3
𝑖𝑖
𝜕𝜕log(𝑜𝑜3 )
= 1 − 𝑜𝑜3 𝜕𝜕𝑜𝑜3
𝜕𝜕𝑧𝑧3 = 𝑜𝑜3 (1 − 𝑜𝑜3 )
𝜕𝜕𝑧𝑧3
𝜕𝜕𝐸𝐸3
= −𝑜𝑜3 ′(1 − 𝑜𝑜3 ) = 𝑜𝑜3′ 𝑜𝑜3 − 1 𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕𝑜𝑜3 𝜕𝜕𝑧𝑧3 𝜕𝜕𝐸𝐸3 𝜕𝜕𝑧𝑧3
𝜕𝜕𝑧𝑧3 = = = 𝑜𝑜3′ 𝑜𝑜3 − 1 × (ℎ3 )
𝜕𝜕𝑊𝑊ℎ𝑜𝑜 𝜕𝜕𝑜𝑜3 𝜕𝜕𝑧𝑧3 𝜕𝜕𝑊𝑊ℎ𝑜𝑜 𝜕𝜕𝑧𝑧3 𝜕𝜕𝑊𝑊ℎ𝑜𝑜
1. http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/
2. https://stats.stackexchange.com/questions/235528/backpropagation-with-softmax-cross-entropy
The output at time t=3 is dependent on the inputs from t=3 to t=1
In general, 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 Direction of Backward pass (via partial derivatives)
= � --- gradient flow ---
𝜕𝜕Wℎℎ 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ
1≤𝑘𝑘≤𝑡𝑡
The output at time t=3 is dependent on the inputs from t=3 to t=1
In general, 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 Direction of Backward pass (via partial derivatives)
= � --- gradient flow ---
𝜕𝜕Wℎℎ 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ
1≤𝑘𝑘≤𝑡𝑡
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖 Transport error in time from step t back to step k
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
𝝏𝝏𝒉𝒉𝒕𝒕
Jacobian matrix
Intern © Siemens AG 2017 𝝏𝝏𝒉𝒉𝒌𝒌
Seite 39 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN
The output at time t=3 is dependent on the inputs from t=3 to t=1
In general, 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 Weight matrix Direction of Backward pass (via partial derivatives)
= �
𝜕𝜕Wℎℎ 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ --- gradient flow ---
1≤𝑘𝑘≤𝑡𝑡
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖 Derivative of activation function
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1 Transport error in time from step t back to step k
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
𝝏𝝏𝒉𝒉𝒕𝒕
Jacobian matrix
Intern © Siemens AG 2017 𝝏𝝏𝒉𝒉𝒌𝒌
Seite 40 May 2017 Corporate Technology
Backpropogation through time (BPTT) in RNN
The output at time t=3 is dependent on the inputs from t=3 to t=1
3
𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ𝑘𝑘
=�
𝜕𝜕Wℎℎ 𝜕𝜕ℎ3 𝜕𝜕ℎ𝑘𝑘 𝜕𝜕Wℎℎ 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3
𝑘𝑘=1
𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐 𝒐𝒐𝟑𝟑
𝜕𝜕ℎ3
𝜕𝜕𝐸𝐸3 𝜕𝜕𝑜𝑜3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸3 𝜕𝜕𝑜𝑜3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸3 𝜕𝜕𝑜𝑜3 𝜕𝜕ℎ3 𝜕𝜕ℎ3
= + + 𝜕𝜕ℎ2
𝜕𝜕𝑜𝑜3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝑊𝑊ℎℎ 𝜕𝜕𝑜𝑜3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝑊𝑊ℎℎ 𝜕𝜕𝑜𝑜3 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕𝑊𝑊ℎℎ 𝜕𝜕ℎ2
𝜕𝜕ℎ1
𝑾𝑾𝒉𝒉𝒉𝒉 𝑾𝑾𝒉𝒉𝒉𝒉
𝒉𝒉𝟏𝟏 𝒉𝒉𝟐𝟐 𝒉𝒉𝟑𝟑
Code snippet for forward-propagation is shown below (Before going for BPTT code) of f line
https://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf
𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑡𝑡−1 𝜕𝜕ℎ𝑡𝑡−2 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑡𝑡−1 𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡
= ⋯+ + +
𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑡𝑡−1 𝜕𝜕ℎ𝑡𝑡−2 𝜕𝜕𝑊𝑊ℎℎ 𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑡𝑡−1 𝜕𝜕𝑊𝑊ℎℎ 𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕𝑊𝑊ℎℎ
𝜕𝜕𝐸𝐸𝑡𝑡
= −(𝑜𝑜𝑡𝑡 −𝑜𝑜𝑡𝑡 ′)(ℎ𝑡𝑡 )
𝜕𝜕𝑊𝑊ℎ𝑜𝑜
𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕ℎ𝑡𝑡
= −(𝑜𝑜𝑡𝑡 −𝑜𝑜𝑡𝑡 ′)𝑊𝑊ℎ𝑜𝑜 𝑎𝑎𝑎𝑎𝑎𝑎 = (1 − ℎ𝑡𝑡2 )(ℎ𝑡𝑡−1 )
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕𝑊𝑊ℎℎ
𝜕𝜕𝐸𝐸 𝜕𝜕ℎ𝑡𝑡
= −(𝑜𝑜𝑡𝑡 −𝑜𝑜𝑡𝑡 ′)𝑊𝑊ℎ𝑜𝑜 (1 − ℎ𝑡𝑡2 )(ℎ𝑡𝑡−1 )
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕𝑊𝑊ℎℎ
𝐀𝐀
𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰𝑰 𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅_𝒕𝒕
𝜕𝜕𝐸𝐸𝑡𝑡 𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑡𝑡−1 2
= − 𝑜𝑜𝑡𝑡 − 𝑜𝑜𝑡𝑡′ 𝑊𝑊ℎ𝑜𝑜 1 − ℎ𝑡𝑡2 (𝑊𝑊ℎℎ )(1 − ℎ𝑡𝑡−1 )(ℎ𝑡𝑡−2 ) 𝑩𝑩
𝜕𝜕𝑜𝑜𝑡𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑡𝑡−1 𝜕𝜕𝑊𝑊ℎℎ
𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅_𝒕𝒕
𝜕𝜕𝐸𝐸
= A + B + ⋯ (till the end of dependency)
Intern © Siemens AG 2017 𝜕𝜕𝑊𝑊ℎℎ
Seite 44 May 2017 Corporate Technology
Break (10 minutes)
𝜕𝜕𝐸𝐸5 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸 𝜕𝜕ℎ 𝜕𝜕ℎ 𝜕𝜕ℎ
= + + 5 5 4 3 + …
𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕𝜃𝜃
𝜕𝜕𝐸𝐸5
Gradient vanishes in long-term dependencies i.e. is updated much less due to A as compared
𝜕𝜕𝜕𝜕
to updated by C
Intern © Siemens AG 2017
Seite 47 May 2017 Corporate Technology
Challenges in Training an RNN: Vanishing Gradients
𝜕𝜕𝐸𝐸5 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸 𝜕𝜕ℎ 𝜕𝜕ℎ 𝜕𝜕ℎ
= + + 5 5 4 3 + …
𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕𝜃𝜃
𝜕𝜕𝐸𝐸5 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸 𝜕𝜕ℎ 𝜕𝜕ℎ 𝜕𝜕ℎ
= + + 5 5 4 3 + …
𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕𝜃𝜃
𝜕𝜕𝐸𝐸5
, gradient explodes, i.e., NaN due to very large numbers
𝜕𝜕𝜕𝜕
𝜕𝜕𝐸𝐸5 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸 𝜕𝜕ℎ 𝜕𝜕ℎ 𝜕𝜕ℎ
= + + 5 5 4 3 + …
𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝜃𝜃 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕𝜃𝜃
𝜕𝜕𝐸𝐸5
Large increase in the norm of the gradient during training
, gradient
dueexplodes, i.e., NaN
to explosion of due
longto term
very large numbers
components
𝜕𝜕𝜕𝜕
In practice as the length of the sequence increases, the probability of training being successful
decrease drastically.
Why
Why
ℎ𝑡𝑡 = tanh(𝑊𝑊ℎℎ ℎ𝑡𝑡−1 + 𝑊𝑊𝑥𝑥ℎ 𝑥𝑥𝑡𝑡 ) ℎ𝑡𝑡 = 𝑊𝑊ℎℎ f(ℎt−1 ) + some other terms
- 0.618 0 - 0.0081
𝜵𝜵 = 10
𝜵𝜵 =
0
0 1.618 0 122.99
Eigen values on the diagonal
Exploding gradients
Bengio et al, "On the difficulty of training recurrent neural networks." (2012)
Intern © Siemens AG 2017
Seite 58 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
Need for tight conditions
Consider identity activation function on eigen values
computing powers of Wℎℎ : during training to prevent
n
𝑾𝑾𝒉𝒉𝒉𝒉 = 𝑸𝑸−𝟏𝟏 ∗ 𝜵𝜵 n ∗ 𝑸𝑸 gradients to vanish or explode
Vanishing gradients
- 0.618 0 - 0.0081
𝜵𝜵 = 10
𝜵𝜵 =
0
0 1.618 0 122.99
Eigen values on the diagonal
Exploding gradients
Bengio et al, "On the difficulty of training recurrent neural networks." (2012)
Intern © Siemens AG 2017
Seite 59 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients
𝜕𝜕ℎ
Find Sufficient condition for when gradients vanish compute an upper bound for 𝜕𝜕ℎ 𝑡𝑡 term
𝑘𝑘
𝜕𝜕ℎ𝑖𝑖
� �≤ Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1 find out an upper bound for the norm of the jacobian!
𝜕𝜕ℎ𝑖𝑖−1
Lets find an upper bound for the term: 𝑾𝑾𝑻𝑻 𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅 𝒈𝒈′ 𝒉𝒉𝒊𝒊−𝟏𝟏
Propert y of mat rix norm
• Proof: 𝑀𝑀 2 = 𝜆𝜆𝑚𝑚𝑚𝑚𝑚𝑚 (𝑀𝑀∗ 𝑀𝑀) = 𝛾𝛾𝑚𝑚𝑚𝑚𝑚𝑚 (M)
where the spectral norm 𝑀𝑀 2 ‖of a complex matrix 𝑀𝑀 is defined as 𝑚𝑚𝑚𝑚𝑚𝑚 𝑀𝑀𝑀𝑀 2: 𝑥𝑥 = 1
of f line
The norm of a matrix is equal to the largest singular value of the matrix and is
related to the largest Eigen value (spectral radius)
Put 𝐵𝐵 = 𝑀𝑀 ∗ 𝑀𝑀 which is a Hermitian matrix. As a linear transformation of Euclidean vector space 𝐸𝐸 is Hermite
iff there exists an orthonormal basis of 𝐸𝐸 consisting of all the eigenvectors of 𝐵𝐵
Let 𝜆𝜆1 , 𝜆𝜆2 , 𝜆𝜆3 … 𝜆𝜆𝑛𝑛 be the eigenvalues of 𝐵𝐵 and 𝑒𝑒1 , 𝑒𝑒2 … … . 𝑒𝑒𝑛𝑛 be an orthonormal basis of 𝐸𝐸
Let 𝑥𝑥 = 𝑎𝑎1 𝑒𝑒1 + … … 𝑎𝑎𝑛𝑛 𝑒𝑒𝑛𝑛 (linear combination of eigen vectors)
The specttal norm of x:
𝑥𝑥 = ∑𝑛𝑛𝑖𝑖=1 𝑎𝑎𝑖𝑖 𝑒𝑒𝑖𝑖 ∑𝑛𝑛𝑖𝑖=1 𝑎𝑎𝑖𝑖 𝑒𝑒𝑖𝑖 1/2
= ∑𝑛𝑛𝑖𝑖=1 𝑎𝑎𝑖𝑖2
Intern © Siemens AG 2017
Seite 61 May 2017 Corporate Technology
Mechanics behind Vanishing and Exploding Gradients
𝑀𝑀𝑀𝑀 = 𝑀𝑀𝑀𝑀, 𝑀𝑀𝑀𝑀 = 𝑥𝑥, 𝑀𝑀∗ 𝑀𝑀𝑀𝑀 = 𝑥𝑥, 𝐵𝐵𝐵𝐵 = � 𝑎𝑎𝑖𝑖 𝑒𝑒𝑖𝑖 � 𝜆𝜆𝑖𝑖 𝑎𝑎𝑖𝑖 𝑒𝑒𝑖𝑖 = � 𝑎𝑎𝑖𝑖 𝜆𝜆𝑖𝑖 𝑎𝑎𝑖𝑖 ≤ max 𝜆𝜆𝑗𝑗 × ( 𝑥𝑥 )
(1≤𝑗𝑗≤𝑛𝑛)
𝑖𝑖=1 𝑖𝑖=1 𝑖𝑖=1
Thus,
If 𝑀𝑀 = 𝑚𝑚𝑚𝑚𝑚𝑚 𝑀𝑀𝑀𝑀 : 𝑥𝑥 = 1 , 𝑡𝑡ℎ𝑒𝑒𝑒𝑒 𝑀𝑀 ≤ max 𝜆𝜆𝑗𝑗 equation (1)
1≤𝑗𝑗≤𝑛𝑛
Consider,
𝑥𝑥0 = 𝑒𝑒𝑗𝑗0 ⇒ 𝑥𝑥 = 1, 𝑠𝑠𝑠𝑠 𝑡𝑡ℎ𝑎𝑎𝑎𝑎 𝑀𝑀 ≥ 𝑥𝑥, 𝐵𝐵𝐵𝐵 = 𝑒𝑒𝑗𝑗0 , 𝐵𝐵 𝑒𝑒𝑗𝑗0 = 𝑒𝑒𝑗𝑗0 , 𝜆𝜆𝑗𝑗0 𝑒𝑒𝑗𝑗0 = 𝜆𝜆𝑗𝑗0 … equation (2)
Combining (1) and (2) give us 𝑀𝑀 = max 𝜆𝜆𝑗𝑗 𝑤𝑤ℎ𝑒𝑒𝑒𝑒𝑒𝑒, 𝜆𝜆𝑗𝑗 𝑖𝑖𝑖𝑖 𝑡𝑡ℎ𝑒𝑒 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑜𝑜𝑜𝑜 𝐵𝐵 = 𝑀𝑀∗ 𝑀𝑀
1≤𝑗𝑗≤𝑛𝑛
Remarks:
The spectral norm of a matrix is equal to the largest singular value of the
matrix and is related to the largest Eigen value (spectral radius)
If the matrix is square symmetric, the singular value = spectral Radius
𝜕𝜕ℎ𝑖𝑖
� �≤ Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1
𝜕𝜕ℎ𝑖𝑖−1
𝜕𝜕ℎ3 𝑡𝑡−𝑘𝑘
� � ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
𝜕𝜕ℎ𝑘𝑘
𝜕𝜕ℎ3 𝑡𝑡−𝑘𝑘
� � ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
𝜕𝜕ℎ𝑘𝑘
𝜕𝜕ℎ3 𝑡𝑡−𝑘𝑘
� � ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
𝜕𝜕ℎ𝑘𝑘
What have we concluded with the upper bound of derivative from recurrent step?
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1 𝜕𝜕ℎ3 𝑡𝑡−𝑘𝑘
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
� � ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
𝜕𝜕ℎ𝑘𝑘
𝜕𝜕ℎ𝑖𝑖
� �≤ Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1 ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
𝜕𝜕ℎ𝑖𝑖−1
If we multiply the same term 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔 < 1 again and again, the overall number becomes very
small(i.e almost equal to zero)
HOW ?
Repeated matrix multiplications leads to vanishing and exploding gradients
Intern © Siemens AG 2017
Seite 70 May 2017 Corporate Technology
Vanishing Gradient in Long-term Dependencies
𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ3
= + + 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3
𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕𝑊𝑊
𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐 𝒐𝒐𝟑𝟑
= ≪≪ 1 + ≪1 + <1
𝒉𝒉 𝟏𝟏 𝒉𝒉 𝟐𝟐 𝒉𝒉 𝟑𝟑
Gradient due to Gradient due to
Total long term short term 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑
Gradient dependencies dependencies
Remark: The gradients due to short term dependencies (just previous dependencies) dominates the
gradients due to long-term dependencies.
This means network will tend to focus on short term dependencies which is often not desired
Problem of Vanishing Gradient
Intern © Siemens AG 2017
Seite 71 May 2017 Corporate Technology
Vanishing Gradient in Long-term Dependencies
𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ3
= + + 𝐸𝐸1 𝐸𝐸2 𝐸𝐸3
𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕𝑊𝑊
𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐 𝒐𝒐𝟑𝟑
= ≪≪ 1 + ≪1 + <1
𝒉𝒉 𝟏𝟏 𝒉𝒉 𝟐𝟐 𝒉𝒉 𝟑𝟑
Gradient due to Gradient due to
Total long term short term 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑
Gradient dependencies dependencies
Remark: The gradients due to short term dependencies (just previous dependencies) dominates the
gradients due to long-term dependencies.
This means network will tend to focus on short term dependencies which is often not desired
Problem of Vanishing Gradient
Intern © Siemens AG 2017
Seite 72 May 2017 Corporate Technology
Exploding Gradient in Long-term Dependencies
What have we concluded with the upper bound of derivative from recurrent step?
𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑖𝑖
= � = � Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝑔𝑔′(ℎ𝑖𝑖−1 )]
𝜕𝜕h𝑘𝑘 𝜕𝜕ℎ𝑖𝑖−1 𝜕𝜕ℎ3 𝑡𝑡−𝑘𝑘
𝑡𝑡≥𝑖𝑖>𝑘𝑘 𝑡𝑡≥𝑖𝑖>𝑘𝑘
� � ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
𝜕𝜕ℎ𝑘𝑘
𝜕𝜕ℎ𝑖𝑖
� �≤ Wℎℎ 𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑔𝑔′ ℎ𝑖𝑖−1 ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔
𝜕𝜕ℎ𝑖𝑖−1
If we multiply the same term 𝜸𝜸𝑾𝑾 𝜸𝜸𝒈𝒈 > 1 again and again, the overall number explodes and
hence the gradient explodes
HOW ?
Repeated matrix multiplications leads to vanishing and exploding gradients
Intern © Siemens AG 2017
Seite 73 May 2017 Corporate Technology
Vanishing Gradient in Long-term Dependencies
𝜕𝜕𝐸𝐸3 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝐸𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ3
= + +
𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕𝑊𝑊 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕𝑊𝑊
𝐸𝐸1 𝐸𝐸2 𝐸𝐸3
= ≫≫1 + ≫≫ 1 + ≫≫ 1
𝒐𝒐𝟏𝟏 𝒐𝒐𝟐𝟐 𝒐𝒐𝟑𝟑
= Very large number, i.e., NaN
𝒉𝒉 𝟏𝟏 𝒉𝒉 𝟐𝟐 𝒉𝒉 𝟑𝟑
𝜕𝜕ℎ3 𝑡𝑡−𝑘𝑘
� � ≤ 𝛾𝛾𝑊𝑊 𝛾𝛾𝑔𝑔 For tanh or linear activation
𝜕𝜕ℎ𝑘𝑘
Remark: This problem of exploding/vanishing gradient occurs because the same number is
multiplied in the gradient repeatedly.
Intern © Siemens AG 2017
Seite 75 May 2017 Corporate Technology
Dealing With Exploding Gradients
• As discussed, the gradient vanishes due to the recurrent part of the RNN equations.
• What if Largest Eigen value of the parameter matrix becomes 1, but in this case, memory just grows.
Gates :
way to optionally let information through.
Forget
composed out of a sigmoid neural net layer and a Gate
pointwise multiplication operation.
remove or add information to the cell state
„Clouds“
Current
Cell state Output
3 gates in LSTM Input
Gate
Gate
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Intern © Siemens AG 2017
Seite 80 May 2017 Corporate Technology
Long Short Term Memory (LSTM): Gating Mechanism
„clouds“
„clouds“
Lecture from the course Neural Networks for Machine Learning by Greff Hinton
Intern © Siemens AG 2017
Seite 81 May 2017 Corporate Technology
Long Short Term Memory (LSTM)
Motivation:
Create a self loop path from where gradient can flow
self loop corresponds to an eigenvalue of Jacobian to be slightly less than 1
Self loop
𝑛𝑛𝑛𝑛𝑛𝑛 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
Key Ingredients
Cell state - transport the information through the units
Gates – optionally allow information passage
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Intern © Siemens AG 2017
Seite 83 May 2017 Corporate Technology
Long Short Term Memory (LSTM): Step by Step
Forget Gate:
decides what information to throw away or remember from the previous cell state
decision maker: sigmoid layer (forget gate layer)
The output of the sigmoid lies between 0 to 1,
0 being forget, 1 being keep.
Input Gate: Selectively updates the cell state based on the new input.
A multiplicative input gate unit to protect the memory contents stored in j from perturbation by irrelevant inputs
Cell Update
- update the old cell state, Ct−1, into the new cell state Ct
- multiply the old state by ft, forgetting the things we
decided to forget earlier
- add it ∗ to get the new candidate values, scaled by
how much we decided to update each state value.
As seen, the gradient vanishes due to the recurrent part of the RNN equations
The forget gate parameters takes care of the vanishing gradient problem
Activation function becomes identity and therefore, the problem of vanishing gradient is addressed.
The derivative of the identity function is, conveniently, always one. So if f = 1, information from the
previous cell state can pass through this step unchanged
Parameter Dimension
Intern © Siemens AG 2017
Seite 90 May 2017 Corporate Technology
LSTM code snippet
Code snippet for LSTM unit: LSTM equations forward pass and shape of gates
Of f line
• GRU like LSTMs, attempts to solve the Vanishing gradient problem in RNN
Gates:
Update
Gate
These 2 vectors decide
what information
should be passed to
output
Reset Gate
Update Gate:
- to determine how much of the past information
(from previous time steps) needs to be passed
along to the future.
- to learn to copy information from the past such that
gradient is not vanished.
Reset Gate
- model how much of information to forget by the unit
Memory Content:
ℎ′𝑡𝑡 = 𝑡𝑡𝑡𝑡𝑡𝑡ℎ(𝑊𝑊𝑥𝑥𝑡𝑡 + 𝑟𝑟_𝑡𝑡 ⊙ 𝑈𝑈ℎ𝑡𝑡−1 )
Final Memory at current time step
ℎ𝑡𝑡 = 𝑧𝑧𝑡𝑡 ⊙ ℎ 𝑡𝑡−1 + (1 − 𝑧𝑧𝑡𝑡 ) ⊙ ℎ𝑡𝑡′
Where, alpha depends upon weight matrix and derivative of the activation function
Now,
𝜕𝜕ℎ𝑗𝑗 𝜕𝜕ℎ𝑗𝑗′
= 𝑧𝑧𝑗𝑗 + 1 − 𝑧𝑧𝑗𝑗
𝜕𝜕ℎ𝑗𝑗−1 𝜕𝜕ℎ𝑗𝑗−1
𝜕𝜕ℎ𝑗𝑗′
And, 𝜕𝜕ℎ𝑗𝑗−1
= 1 𝑓𝑓𝑓𝑓𝑓𝑓 𝑧𝑧𝑗𝑗 = 1
Chung et al, 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Intern © Siemens AG 2017
Seite 97 May 2017 Corporate Technology
Break (10 minutes)
Gupta 2015. (Master Thesis). Deep Learning Methods for the Extraction of Relations in Natural Language Text
Gupta and Schütze. 2018. LISA: Explaining Recurrent Neural Network Judgments via Layer-wIse Semantic Accumulation and Example to Pattern Transformation
Vu et al., 2016. Combining recurrent and convolutional neural networks for relation classification
http://www.iro.umontreal.ca/~bengioy/talks/gss2012-YB6-NLP-recursive.pdf RNN
Intern © Siemens AG 2017
Seite 101 May 2017 Corporate Technology
Recursive Neural Networks (RecNNs): TreeRNN or TreeLSTM
http://www.iro.umontreal.ca/~bengioy/talks/gss2012-YB6-NLP-recursive.pdf
Intern © Siemens AG 2017
Seite 102 May 2017 Corporate Technology
Recursive Neural Networks (RecNNs): TreeRNN or TreeLSTM
Applications
represent the meaning of longer phrases
Map phrases into a vector space
Sentence parsing
Scene parsing
Application: Relation Extraction Within and Cross Sentence Boundaries, i.e., document-level relation extraction
Gupta et al., 2019. Neural Relation Extraction Within and Across Sentence Boundaries.
Intern © Siemens AG 2017
Seite 104 May 2017 Corporate Technology
Recursive Neural Networks (RecNNs): TreeRNN or TreeLSTM
Relation Extraction Within and Cross Sentence Boundaries, i.e., document-level relation extraction
Gupta et al., 2019. Neural Relation Extraction Within and Across Sentence Boundaries.
Intern © Siemens AG 2017
Seite 105 May 2017 Corporate Technology
Deep and Multi-tasking RNNs
https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129
Context vector
takes all cells’ outputs as input
compute the probability distribution of source language
words for each word in decoder (e.g., ‘Je’)
https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129
https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129
Intern © Siemens AG 2017
Seite 111 May 2017 Corporate Technology
Explainability/Interpretability of RNNs
Visualization
Visualize output predictions: LISA
Visualize neuron activations: Sensitivity Analysis
Further Details:
- Gupta et al, 2018. “LISA: Explaining Recurrent Neural Network Judgments via Layer-wIse Semantic Accumulation and Example to
Pattern Transformation”. https://arxiv.org/abs/1808.01591
- Andrej Karpathy, Blog on “Unreasonable Effectiveness of Recurrent Neural Networks”
- Hendrick et al, “Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks”
Intern © Siemens AG 2017
Seite 112 May 2017 Corporate Technology
Explainability/Interpretability of RNNs
https://www.researchgate.net/publication/328956863_LISA_Explaining_RNN_Judg
ments_via_Layer-
wIse_Semantic_Accumulation_and_Example_to_Pattern_Transformation_Analyzi
ng_and_Interpreting_RNNs_for_NLP
Full paper:
Gupta et al, 2018. “LISA: Explaining Recurrent Neural Network Judgments via Layer-
wIse Semantic Accumulation and Example to Pattern Transformation”.
https://arxiv.org/abs/1808.01591
All three models assign high sensitivity to “hate” and dampen the influence of other tokens. LSTM offers a clearer focus
on “hate” than the standard recurrent model, but the bi-directional LSTM shows the clearest focus, attaching almost zero
emphasis on words other than “hate”. This is presumably due to the gates structures in LSTMs and Bi-LSTMs that
controls information flow, making these architectures better at filtering out less relevant information.
Neural Net w ork Neural Net w ork Neural Net w ork Topic-words
Language Models Language Models Language Models
Word Represent at ion Super vised Word Embedding
over time
Linear Model Linear Model Word Embeddings for topic
Rule Set Rule Set Word Represent at ion ‘Word Vector’
1996 1997 2014
Gupta et al. 2018. Deep Temporal-Recurrent-Replicated-Softmax for Topical Trends over Time
Intern © Siemens AG 2017
Seite 116 May 2017 Corporate Technology
RNNs in Topic Trend Extraction (Dynamic Topic Evolution): RNN-RSM
Gupta et al. 2018. Deep Temporal-Recurrent-Replicated-Softmax for Topical Trends over Time
Intern © Siemens AG 2017
Seite 117 May 2017 Corporate Technology
RNNs in Topic Trend Extraction (Dynamic Topic Evolution): RNN-RSM
Gupta et al. 2018. Deep Temporal-Recurrent-Replicated-Softmax for Topical Trends over Time
Intern © Siemens AG 2017
Seite 118 May 2017 Corporate Technology
Key Takeaways
Lecture from the course Neural Networks for Machine Learning by Greff Hinton
Lecture by Richard Socher: https://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf
Understanding LSTM: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recursive NN: http://www.iro.umontreal.ca/~bengioy/talks/gss2012-YB6-NLP-recursive.pdf
Attention: https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129
Gupta, 2015. Master Thesis on “Deep Learning Methods for the Extraction of Relations in Natural Language Text”
Gupta et al., 2016. Table Filling Multi-Task Recurrent Neural Network for Joint Entity and Relation Extraction.
Vu et al., 2016. Combining recurrent and convolutional neural networks for relation classification.
Vu et al., 2016. Bi-directional recurrent neural network with ranking loss for spoken language understanding.
Gupta et al. 2018. Deep Temporal-Recurrent-Replicated-Softmax for Topical Trends over Time
Gupta et al., 2018. LISA: Explaining Recurrent Neural Network Judgments via Layer-wIse Semantic Accumulation and Example to
Pattern Transformation.
Gupta et al., 2018. Replicated Siamese LSTM in Ticketing System for Similarity Learning and Retrieval in Asymmetric Texts.
Gupta et al., 2019. Neural Relation Extraction Within and Across Sentence Boundaries
Talk/slides: https://vimeo.com/277669869
[email protected]
@Linkedin: https://www.linkedin.com/in/pankaj-gupta-6b95bb17/