0% found this document useful (0 votes)
41 views

CS115 Math for Computer Science

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

CS115 Math for Computer Science

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Recurrent Neural Networks (RNNs)

and the Exploding & Vanishing Gradient Problems

Dang Huu Phat 22521065


Phan Nguyen Huu Phong 22521090
Chau The Vi 22521653

University of Information Technology

November 2, 2024

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 1 / 45


References

The contents of this document are taken mainly from the follow sources:
Robin M.Schmidt, Recurrent Neural Networks (RNNs): a gentle
introductionand overview1 ;
Razvan Pascanu, Tomas Mikolov, Yoshua Bengio, On the difficulty of
training RNNs2
Yoshua Bengio, Patrice Simard, Paolo Frasconi, Learning long-term
dependencies with gradients descent is difficult3

1
https://arxiv.org/pdf/1912.05911.pdf
2
https://arxiv.org/pdf/1211.5063.pdf
3
https://www.researchgate.net/profile/Y-Bengio/publication/5583935_
Learning_long-term_dependencies_with_gradient_descent_is_difficult/
links/546b702d0cf2397f7831c03d/
Learning-long-term-dependencies-with-gradient-descent-is-difficult.pdf
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 2 / 45
Table of Contents

1 Introduction & Notation

2 Training Recurrent Neural Networks

3 Exploding and Vanishing Gradient

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 3 / 45


Table of Contents

1 Introduction & Notation

2 Training Recurrent Neural Networks

3 Exploding and Vanishing Gradient

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 4 / 45


Sequential Data
Sequential data, which more commonly known as sequence data or
sequences
What makes it unique? It is that elements in a sequence appear in a
certain order and are not independent of each other.
Example: Stock Price Prediction
In this example, the input data is the historical stock prices. i1 is the
stock price on January 1st, i2 is the stock price on January 2nd, and so
on. (i1 , i2 , ...i30 ) is called sequences. RNN will learn from the input
and predict the stock price in the future.

Figure: Example of Stock Price Prediction


Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 5 / 45
Sequential data versus time series data

Time series data is a special type of sequential data where each


example is associated with a dimension for time.
In time series data, samples are taken at successive timestamps, and
therefore, the time dimension determines the order among the data
points. For example, stock prices and voice or speech records are time
series data.
In contrast, not all sequential data shares this temporal structure. Text
data or DNA sequences, for instance, exhibit ordered patterns, but lack
a distinct time dimension.

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 6 / 45


Sequential data versus time series data

A example of machine translation, input is sentence: "I love CS115",


the order of words significantly impact on the meaning of the sentence.
Input data, which consists of the sequence of words [’I’, ’love’, ’CS115’]
is refered to as sequences.
In natural language processing (NLP), it is not feasible to process entire
sentences as inputs. Similar to how videos are processed by breaking
them down into individual frames, in NLP, sentences are broken down
into individual words for input processing.

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 7 / 45


Representing sequences

The movie my friend has not seen is good


The movie my friend has seen is not good

Output: y<1> y<2> y<3> y<4> y<5> y<6>

Input: x<1> x<2> x<3> x<4> x<5> x<6> Time

Figure: An example of time series data

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 8 / 45


What is Recurrent Neural Networks (RNNs)?
Recurrent Neural Networks (RNNs) is a type of neural network
architecture which is used to detect patterns in a sequences.
They are also can be extended to images by breaking them down into a
series of patches and treating them as a sequence.
What differentiates Recurrent Neural Networks from Feedforward
Neural Networks also known as Multi-Layer Perceptrons (MLPs) is how
information gets passed through the network.

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 9 / 45


What is Recurrent Neural Networks (RNNs)?
Multilayer perceptrons (MLPs) and CNNs for image data, assume that
the training examples are independent of each other and thus do not
incorporate ordering information.
We can say that such models do not have a memory of previously seen
training examples.
RNNs, by contrast, are designed for modeling sequences and are capable
of remembering past information and processing new events accordingly,
which is a clear advantage when working with sequence data.

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 10 / 45


Pros and Cons

Benefits
Possibility of processing input of any length;
Model size not increasing with size of input;
Computation takes into account historical information;
Weights are shared across time

Costs
Computation being slow;
Difficulty of accessing information from a long time ago;
Cannot consider any future input for the current state;

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 11 / 45


Frame Title

Algorithm 2: Evaluate the learned policy on an unlabeled test graph


Input: the unlabeled test graph G, the learned policy πθ , the query budget B
Result: the classification GNN f trained on the selected node sequence τ
τ = []
0
Vlabel ← ∅;
Randomly initialize the classification GNN as fG,V 0 ;
label

for t = 1 to B do
t
Generate the graph state SG based on fG,V t−1 and G;
tabel

Compute the probability distribution over candidate nodes as ptG = πθ SG


t

;
t−1
Select vt = arg max ptG (v) from Vtrain \Vlabel and query for its label;
v
t t−1
Vlabel ← Vlabel ∪ {vt } ;
Train the classification GNN for one more epoch to get fG,V t ;
label

τ . append
Team 2 (v t) ;
(UIT) CS115 - Math for Computer Science November 2, 2024 12 / 45
The different categories of sequence modeling
Many to One RNN
Many-to-One is used when a single output is required from
multiple input units or a sequence of them. It takes a sequence of
inputs to display a fixed output.
Sentiment Analysis is a common example of this type of Recurrent
Neural Network.

Figure: Many to One mode

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 13 / 45


The different categories of sequence modeling

Many to Many RNN


Many-to-Many is used to generate a sequence of output data from
a sequence of input units.
This type of RNN is further divided into the following two
subcategories:
1 Equal Unit Size: In this case, the number of both the input and output
units is the same. A common application can be found in Name-Entity
Recognition.
2 Unequal Unit Size In this case, inputs and outputs have different
numbers of units. Its application can be found in Machine Translation.

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 14 / 45


Understanding the dataflow in RNNs
In a standard feedforward network, information flows from the input to
the hidden layer, and then from the hidden layer to the output layer.
On the other hand, in an RNN, the hidden layer receives its input from
both the input layer of the current time step and the hidden layer from
the previous time step.

Figure: The dataflow of a standard feedforward NN and an RNN

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 15 / 45


Understanding the dataflow in RNNs

y <t> Single layer RNN y <t-1> y <t> y (t+1)

h<t> Unfold h<t-1> h<t> h(t+1)

x<t> x<t-1> x<t> x(t+1)

Multilayer RNN
y <t> y <t-1> y <t> y (t+1)

h<t>
2
h<t-1> h<t> h(t+1)
2 2 2

Unfold
h<t> h<t-1> h<t> h(t+1)
1 1 1 1

x<t> x<t-1> x<t> x(t+1)

Figure: Examples of an RNN with one and two hidden layers

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 16 / 45


Table of Contents

1 Introduction & Notation

2 Training Recurrent Neural Networks

3 Exploding and Vanishing Gradient

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 17 / 45


Weight matrices in a single-hidden layer RNN

y <t> y <t-1> y <t> y <t+1>


[
Wh= Whh ; Whx ]
Wyh Wyh Wyh Wyh
Whh
Whh Whh
h<t> Unfold h<t-1> h<t> h<t+1>

Whx Whx Whx Whx


<t>
x x <t-1>
x <t>
x<t+1>

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 18 / 45


Weight matrices in a single-hidden layer RNN

y <t> y <t-1> y <t> y <t+1>


[
Wh = Whh ; Whx ]
Wyh Wyh Wyh Wyh
Whh
Whh Whh
h <t> Unfold h <t-1> h <t> h <t+1>

Whx Whx Whx Whx

x <t> x <t-1> x <t> x <t+1>

Net input: t t t− 1
zh = W hx x + W hh h + bh
Activation: t
t
h = Øh zh

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 19 / 45


Weight matrices in a single-hidden layer RNN

y <t> y <t-1> y <t> y <t+1>


[
Wh = Whh ; Whx ]
Wyh Wyh Wyh Wyh
Whh
Whh Whh
h <t> Unfold h <t-1> h <t> h <t+1>

Whx Whx Whx Whx

x <t> x <t-1> x <t> x <t+1>

Net input: t t t− 1
zh = W hx x + W hh h + bh Net input:
Activation: zyt = W yh h t
+ by
t t
h = Øh zh
Output:
t
y = Øy zyt

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 20 / 45


Gradient-based Learning Methods

Gradient descent (GD) adjusts the weights of the model by finding the
error function derivatives with respect to each member of the weight
matrices.

Batch Gradient Descent


Computing the gradient for the whole dataset in each optimization iteration
to perform a single update as
U
λ X ∂ℓk
θt+1 = θt −
U ∂θ
k=1

However, computing error-derivatives through time is difficult. This is


mostly due to the relationship among the parameters and the dynamics
of the RNN, that is highly unstable and makes GD ineffective.
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 21 / 45
Back-propagation Through Time (BPTT)

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 22 / 45


Back-propagation Through Time (BPTT)

Hidden variable
h(t) = ϕh (xt · Wxh + h(t−1) · Whh + bh )

Output variable
y(t) = ϕy (h(t) · Wyh + by )

We can define a loss function L(y, o) to describe the difference between all
outputs yt and the target values ot as shown in the equation below.
Loss Function
T
X
L(y, o) = l(t) (y(t) , o(t) )
t=1

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 23 / 45


Back-propagation Through Time (BPTT)
Since we have three weight matrices Wxh , Whh and Wyh we need to
compute the partial derivative to each of these weight matrices using the
chain rule.
Partial derivative with respect to Wyh
T
X ∂l(t) ∂y(t) ∂zy (t)
X ∂l(t) ∂y(t) T
∂L
= (t)
· ·
(t) ∂W
= (t)
· (t) · h(t)
∂Wyh ∂y ∂zy yh ∂y ∂zy
t=1 t=1

Partial derivative with respect to Whh

T
X ∂ℓ(t) ∂y(t) ∂zy (t)
∂L ∂h(t)
= · · ·
∂Whh
t=1
∂y(t) ∂zy(t) ∂h(t) ∂Whh
T
X ∂ℓ(t) ∂y(t) ∂h(t)
= · · Wyh ·
t=1
∂y(t) ∂zy(t) ∂Whh

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 24 / 45


Back-propagation through time (BPTT)

(t-1) (t) (t+1)

( )
= ( ) y(t-1) y(t) y(t+1)

Wyh Wyh Wyh


Whh Whh Whh
h(...) h(t-1) h(t) h(t+1)

Wxh Wxh Wxh

x(t-1) x(t) x(t+1)

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 25 / 45


Back-propagation through time (BPTT)

(t-1) (t) (t+1)

( )
= ( ) y(t-1) y(t) y(t+1)

Wyh Wyh Wyh


Whh Whh Whh
h(...) h(t-1) h(t) h(t+1)

Wxh Wxh Wxh

x(t-1) x(t) x(t+1)

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 26 / 45


Back-propagation through time (BPTT)

o(t)

(t)

( )
= +
y(t)

Wyh

h(...) h(t-2) h(t-1) h(t) h(t+1)

Whh

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 27 / 45


Back-propagation through time (BPTT)

1 2 h(t-1) h(t)
( )
= +
2 1

Whh

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 28 / 45


Back-propagation through time (BPTT)

o(t)

( )
= + (t)

= + + y(t)

Wyh

h(...) h(t-2) h(t-1) h(t) h(t+1)

Whh

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 29 / 45


Back-propagation through time (BPTT)

( )
= + o(t)

= + + (t)

= + + y(t)

Wyh

h(...) h(t-2) h(t-1) h(t) h(t+1)

Whh

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 30 / 45


Back-propagation through time (BPTT)

( )
= +

= + +

= + +

= + +

= + +
( ) ( )
= ( )

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 31 / 45


Back-propagation through time (BPTT)

( )
= ( )

( ) ( ) ( ) ( ) ( ) ( )
= ( ) ( ) ( ) ( )

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 32 / 45


Back-propagation through time (BPTT)

Since each ht depends on the previous time step we can substitute the
last part from above equation
Partial derivative with respect to Whh
T t
∂L X ∂ℓ(t) ∂y(t) X ∂h(t) ∂h(k)
= (t)
· (t)
· W yh ·
(k) ∂W
∂Whh ∂y ∂zy
∂h hh
t=1 k=1

Similarly, we have the partial derivative with respect to Wxh as below


Partial derivative with respect to Wxh
T t
∂L X ∂ℓ(t) ∂y(t) X ∂h(t) ∂h(k)
= · · W yh ·
∂Wxh
t=1
∂y(t) ∂zy(t) k=1
∂h(k) ∂W
xh

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 33 / 45


Back-propagation through time (BPTT)

In order to transport the error through time from timestep t back to


timestep k we can have

t
∂h(t) Y ∂h(i)
=
∂h(k) i=k+1 ∂h(i−1)

We can consider previous equation as a Jacobian matrix for the hidden


state parameter as

t t
Y ∂h(i) Y (i)
(i−1)
= T
Whh diag(ϕ′h (zh ))
i=k+1
∂h i=k+1

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 34 / 45


Back-propagation through time (BPTT)


=

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 35 / 45


Back-propagation through time (BPTT)

Assume activation function for h is tanh.


∂tanh(x)
∂x = 1 − tanh(x)2
Thus:
(k+1)
∂hj 
(k+1) 2

(k)
= wi,j 1 − [hj ]
∂hi

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 36 / 45


Back-propagation through time (BPTT)
( ) ( ) ( )
1 ( ) 1 ( ) 1 ( )
( ) ( ) ( )
( ) 1 ( ) 1 ( ) 1 ( )
( )
=

( ) ( ) ( )
1 ( ) 1 ( ) 1 ( )

(k+1) 2
This is just W T with the jth row multiplied by (1 − (hj ) , or:

∂h(k+1) 
(k+1) 2

(k)
= diag 1 − (h 1 ) • WT
∂h
  2 
(k+1)
1 − h1 0 ··· 0
 
  W11 W21 w31 · · ·
(k+1) 2   W12 W22 · · · · · · 
 
= •

0 1 − h2 ··· 0 
.. .. .. ..

 
.. .. .. .. . . . .
. . . .

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 37 / 45


Back-propagation through time (BPTT)

This is very problematic: Vanishing/Exploding gradient problem!

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 38 / 45


Table of Contents

1 Introduction & Notation

2 Training Recurrent Neural Networks

3 Exploding and Vanishing Gradient

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 39 / 45


Difficulty of training RNNs

ℎ ′
∥ ∥ = ∥ diag ∥
ℎ −1


≤ diag ∥ ∥ ∥ ∥≤ ∥ ∥
ℎ −1

∥ ∥≤
1 ℎ −1

≤ = if is sigmoid
4
′ ≤1= [ if is tanh ]

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 40 / 45


Prove the derivative of the sigmoid function

1
f (x) =
1 +e−x   
d 1 −1
f ′ (x) = −x

= −e
dx 1 + e−x (1 + e−x ) 2

e−x 1 + e−x − 1
= 2 =
(1 + e−x ) (1 + e−x )2
2 2
1 + e−x
 
1 1 1
= − = −
(1 + e−x )2 1 + e−x (1 + e−x ) 1 + e−x
= f (x) − f (x)2 = f (x)(1 − f (x))

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 41 / 45


Gradient of Sigmoid Function

Figure: Sigmoid function

d
σ ′ (x) = σ (x) = slope of σ (x) at x
dx
 
1 1
= 1−
1 + e−x 1 + e−x
Derivative of σ(x) with respect to x

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 42 / 45


Gradient of Sigmoid Function

x = 10 → g(x) ≈ 1
d
σ(x) ≈ 1(1 − 1) = 0
dx
x = −10 → g(x) ≈ 0

d
σ(x) ≈ 1(1 − 1) = 0
dx
1
x = 0 → g(x) = 2  
d 1 1
σ(x) = 1 1 − =
dx 2 4

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 43 / 45


Difficulty of training RNNs

t
∂ht Y ∂hi
=
∂hk ∂hi−1
i=k+1
t
Y
≤ γλ
i=k+1
(t−k)
≤ (γλ)

1 If (γλ) > 1 then gradient will explode


2 If (γλ) < 1 then gradient will vanish

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 44 / 45


Exploding Gradient

Gradients in training RNNs on long sequences may explode as the


weights become larger and the norm of the gradient during training
largely increases.

As it is stated before, the necessary condition for this situation to


happen is λ > γ1 .

Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 45 / 45

You might also like