CS115 Math for Computer Science
CS115 Math for Computer Science
November 2, 2024
The contents of this document are taken mainly from the follow sources:
Robin M.Schmidt, Recurrent Neural Networks (RNNs): a gentle
introductionand overview1 ;
Razvan Pascanu, Tomas Mikolov, Yoshua Bengio, On the difficulty of
training RNNs2
Yoshua Bengio, Patrice Simard, Paolo Frasconi, Learning long-term
dependencies with gradients descent is difficult3
1
https://arxiv.org/pdf/1912.05911.pdf
2
https://arxiv.org/pdf/1211.5063.pdf
3
https://www.researchgate.net/profile/Y-Bengio/publication/5583935_
Learning_long-term_dependencies_with_gradient_descent_is_difficult/
links/546b702d0cf2397f7831c03d/
Learning-long-term-dependencies-with-gradient-descent-is-difficult.pdf
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 2 / 45
Table of Contents
Benefits
Possibility of processing input of any length;
Model size not increasing with size of input;
Computation takes into account historical information;
Weights are shared across time
Costs
Computation being slow;
Difficulty of accessing information from a long time ago;
Cannot consider any future input for the current state;
for t = 1 to B do
t
Generate the graph state SG based on fG,V t−1 and G;
tabel
τ . append
Team 2 (v t) ;
(UIT) CS115 - Math for Computer Science November 2, 2024 12 / 45
The different categories of sequence modeling
Many to One RNN
Many-to-One is used when a single output is required from
multiple input units or a sequence of them. It takes a sequence of
inputs to display a fixed output.
Sentiment Analysis is a common example of this type of Recurrent
Neural Network.
Multilayer RNN
y <t> y <t-1> y <t> y (t+1)
h<t>
2
h<t-1> h<t> h(t+1)
2 2 2
Unfold
h<t> h<t-1> h<t> h(t+1)
1 1 1 1
Net input: t t t− 1
zh = W hx x + W hh h + bh
Activation: t
t
h = Øh zh
Net input: t t t− 1
zh = W hx x + W hh h + bh Net input:
Activation: zyt = W yh h t
+ by
t t
h = Øh zh
Output:
t
y = Øy zyt
Gradient descent (GD) adjusts the weights of the model by finding the
error function derivatives with respect to each member of the weight
matrices.
Hidden variable
h(t) = ϕh (xt · Wxh + h(t−1) · Whh + bh )
Output variable
y(t) = ϕy (h(t) · Wyh + by )
We can define a loss function L(y, o) to describe the difference between all
outputs yt and the target values ot as shown in the equation below.
Loss Function
T
X
L(y, o) = l(t) (y(t) , o(t) )
t=1
T
X ∂ℓ(t) ∂y(t) ∂zy (t)
∂L ∂h(t)
= · · ·
∂Whh
t=1
∂y(t) ∂zy(t) ∂h(t) ∂Whh
T
X ∂ℓ(t) ∂y(t) ∂h(t)
= · · Wyh ·
t=1
∂y(t) ∂zy(t) ∂Whh
( )
= ( ) y(t-1) y(t) y(t+1)
( )
= ( ) y(t-1) y(t) y(t+1)
o(t)
(t)
( )
= +
y(t)
Wyh
Whh
1 2 h(t-1) h(t)
( )
= +
2 1
Whh
o(t)
( )
= + (t)
= + + y(t)
Wyh
Whh
( )
= + o(t)
= + + (t)
= + + y(t)
Wyh
Whh
( )
= +
= + +
= + +
= + +
= + +
( ) ( )
= ( )
( )
= ( )
( ) ( ) ( ) ( ) ( ) ( )
= ( ) ( ) ( ) ( )
Since each ht depends on the previous time step we can substitute the
last part from above equation
Partial derivative with respect to Whh
T t
∂L X ∂ℓ(t) ∂y(t) X ∂h(t) ∂h(k)
= (t)
· (t)
· W yh ·
(k) ∂W
∂Whh ∂y ∂zy
∂h hh
t=1 k=1
t
∂h(t) Y ∂h(i)
=
∂h(k) i=k+1 ∂h(i−1)
t t
Y ∂h(i) Y (i)
(i−1)
= T
Whh diag(ϕ′h (zh ))
i=k+1
∂h i=k+1
…
=
( ) ( ) ( )
1 ( ) 1 ( ) 1 ( )
(k+1) 2
This is just W T with the jth row multiplied by (1 − (hj ) , or:
∂h(k+1)
(k+1) 2
(k)
= diag 1 − (h 1 ) • WT
∂h
2
(k+1)
1 − h1 0 ··· 0
W11 W21 w31 · · ·
(k+1) 2 W12 W22 · · · · · ·
= •
0 1 − h2 ··· 0
.. .. .. ..
.. .. .. .. . . . .
. . . .
ℎ ′
∥ ∥ = ∥ diag ∥
ℎ −1
′
ℎ
≤ diag ∥ ∥ ∥ ∥≤ ∥ ∥
ℎ −1
ℎ
∥ ∥≤
1 ℎ −1
′
≤ = if is sigmoid
4
′ ≤1= [ if is tanh ]
1
f (x) =
1 +e−x
d 1 −1
f ′ (x) = −x
= −e
dx 1 + e−x (1 + e−x ) 2
e−x 1 + e−x − 1
= 2 =
(1 + e−x ) (1 + e−x )2
2 2
1 + e−x
1 1 1
= − = −
(1 + e−x )2 1 + e−x (1 + e−x ) 1 + e−x
= f (x) − f (x)2 = f (x)(1 − f (x))
d
σ ′ (x) = σ (x) = slope of σ (x) at x
dx
1 1
= 1−
1 + e−x 1 + e−x
Derivative of σ(x) with respect to x
x = 10 → g(x) ≈ 1
d
σ(x) ≈ 1(1 − 1) = 0
dx
x = −10 → g(x) ≈ 0
d
σ(x) ≈ 1(1 − 1) = 0
dx
1
x = 0 → g(x) = 2
d 1 1
σ(x) = 1 1 − =
dx 2 4
t
∂ht Y ∂hi
=
∂hk ∂hi−1
i=k+1
t
Y
≤ γλ
i=k+1
(t−k)
≤ (γλ)