11.1. Deep Learning (RNN)
11.1. Deep Learning (RNN)
Sequence Models
Part – 1
Dr. Oybek Eraliev,
Department of Computer Engineering
Inha University In Tashkent.
Email: [email protected]
y
Key idea: RNNs have an
“internal state” that is
RNN updated as a sequence is
processed
x
y1 y2 y3 yt
x1 x2 x3 xt
y1 y2 y3 yt
ℎ" ℎ# ℎ$ ℎ%
RNN RNN RNN … RNN
x1 x2 x3 xt
ℎ; = 𝑓< (ℎ;=> , 𝑥; ) y
RNN
ℎ; = 𝑡𝑎𝑛ℎ(𝑊?? ℎ;=> , +𝑊@? 𝑥; )
𝑦; = 𝑓(𝑊?A ℎ; ) x
𝑥# 𝑥$ 𝑥%
𝑾
𝑥# 𝑥$ 𝑥%
𝑾
𝑥# 𝑥$ 𝑥%
𝑾
𝑥# 𝑥$ 𝑥%
𝑾
𝑥
𝑾
𝑥 ? ? ?
𝑾
𝑥 0 0 0
𝑾
𝑥 𝑦# 𝑦$ 𝑦!"#
𝑾
y1 y2 y3 y4
One big problem is
that the more we
unroll a recurrent RNN RNN RNN RNN
neural network, the
harder it is to train
x1 x2 x3 x4
y1 y2 y3 y4
This problem is
called The
Vanishing/Explodin RNN RNN RNN RNN
g Gradient Problem.
x1 x2 x3 x4
y1 y2 y3 y4
This problem is
called The
Vanishing/Explodin RNN RNN RNN RNN
g Gradient Problem.
x1 x2 x3 x4
y1 y2 y3 y4
In our example the
Vanishing/Exploding
𝑊** 𝑊** 𝑊** 𝑊**
Gradient problem RNN RNN RNN RNN
has to do with the
squiglle that we copy
each time we unroll x1 x2 x3 x4
the network.
x1 x2 x3 x4
x1 x2 x3 x4
x1 x2 x3 x4
x1 x2 x3 x4
𝑥# ×2
x1 x2 x3 x4
𝑥# ×2×2
x1 x2 x3 x4
𝑥# ×2×2×2
𝑥# ×2% x1 x2 x3 x4
+,-./011
𝑥# ×𝑊**
Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 36
Recurrent Neural Network
𝑥# ×2% x1 x2 x3 x4
+,-./011
𝑥# ×𝑊**
Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 37
Recurrent Neural Network
Now if we had 50
sequential RNN cells
y1 y2 y3 y4
+,-./011
𝑥# ×𝑊** x1 x2 x3 x4
+,-./011
𝑥# ×𝑊** x1 x2 x3 x4
+,-./011
𝑥# ×𝑊**
&'())
However, when the gradient W≔ 𝑊 − 𝛼 &)
contains a huge number,
then we’ll end up taking
𝐽 𝑊
relatively large steps.
&'())
One way to prevent the W≔ 𝑊 − 𝛼 &)
Exploding Gradient
Problem would be to limit
𝐽 𝑊
𝑊** to value <1.
&'())
One way to prevent the W≔ 𝑊 − 𝛼 &)
Exploding Gradient
Problem would be to limit
𝐽 𝑊
𝑊** to value <1.
Now if we had 50
sequential RNN cells
y1 y2 y3 y4
+,-./011
x1 x2 x3 x4
𝑥# ×𝑊**
+,-./011
x1 x2 x3 x4
𝑥# ×𝑊**
&'())
Now, when optimizing a W≔ 𝑊 − 𝛼 &)
parameter, instead of taking
step that are too large, we
𝐽 𝑊
end up taking steps that are
too small.
𝑦! = 𝑓(𝑊%' ℎ ! )
𝒙𝑻
RNN Disadvantages:
• Recurrent computation is slow
• In practice, difficult to access information from many steps back
There are several variations on the full gated unit, with gating done using the
previous hidden state and light ted recurrent unit, and a simplified form
called minimal gated unit.