0% found this document useful (0 votes)
67 views20 pages

Comparison of Gradient Descent Algorithms On Training Neural Networks

The document compares different gradient descent algorithms for training neural networks. It begins with an introduction to gradient descent and its variants: batch gradient descent, stochastic gradient descent (SGD), and mini-batch gradient descent. It then discusses momentum-based methods like gradient descent with momentum and Nesterov momentum. Finally, it mentions other algorithms like Adagrad that adapt the learning rate for each parameter. The document aims to evaluate these algorithms and determine the most efficient for training neural networks, particularly recurrent neural networks on a oil production dataset.

Uploaded by

Loc Luong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views20 pages

Comparison of Gradient Descent Algorithms On Training Neural Networks

The document compares different gradient descent algorithms for training neural networks. It begins with an introduction to gradient descent and its variants: batch gradient descent, stochastic gradient descent (SGD), and mini-batch gradient descent. It then discusses momentum-based methods like gradient descent with momentum and Nesterov momentum. Finally, it mentions other algorithms like Adagrad that adapt the learning rate for each parameter. The document aims to evaluate these algorithms and determine the most efficient for training neural networks, particularly recurrent neural networks on a oil production dataset.

Uploaded by

Loc Luong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

COMPARISON OF GRADIENT DESCENT ALGORITHMS

ON TRAINING NEURAL NETWORKS

Abstract
Gradient descent is one of the most popular optimization techniques that
widely used in machine learning and deep learning to find the minimum
of loss function. The two main computational advantages of gradient de-
scent methods are easy to implement and low memory cost compare to other
second-order methods such as Newton method. Nonetheless, in neural net-
works, gradient descent algorithms may fail to find global optimum due to
the non-convexity or many local minima of cost function. The main draw-
back of gradient descent methods is that the convergence highly depends on
the learning rate selection, and it can be stuck at local minimum points.

The cost function of neural networks mostly is non-convex with most of


activation function. The non-convexity of neural networks could sometimes
make the optimizer harder to find the optimum point of the problems. There
are a number of variants of gradient descent algorithms that are very com-
monly employed in machine learning and deep learning libraries such as scikit
learn, keras, pytorch or tensorflow. In this research, different gradient de-
scent methods will be implemented and compared in order to highlight the
most efficient algorithms in training neural networks. The investigated al-
gorithms are Gradient Descent with Momentum and Nesterov Momentum,
Stochastic Gradient Descent (SGD), Root Mean Square Propagation (RM-
SProp), Adaptive Moment Estimation (Adam) and its variants. The neural
networks used to evaluate the gradient descent methods are recurrent neural
networks, and the data for training is the public dataset of oil production
rate from Volve field of Equinor company.

1
1 Introduction
In optimization in general and machine learning in particular, we often have
to find the minimum (or sometimes the maximum) value of a function. In
general, finding the global minimum of a loss functions in machine learning
is very complicated, even impossible. Instead, they often try to find local
minimum points, and, to some extent, consider it as the solution of the prob-
lem.
The local minimum points of a function is the solution of the derivative
equation of zero (if the function is differentiable). However, in most cases,
solving the derivative equation by zero is impossible. This may be due to the
complexity of the derivative form, or the data points have a large number of
dimensions, or from having too many data points.
The most common approach is to start from a point that we consider to be
close to the solution to the problem, then use an iterative operation to get
to the point we are looking for until the derivative is close to zero.
Gradient Descent (abbreviated as GD) and its variants are by far the most
popular optimization algorithms used in training machine learning and deep
learning models. They have proved their performance and effectiveness over
other methods such as second order methods or population methods (PSO
or GA) in training a wide range of neural networks, and has been emplyoyed
in various machine learning and deep learning library or framwork such as
Keras, scikit-learn, or tensorflow, etc.
Non-convex optimization problems is the most common scenarios for neu-
ral networks, but Bengio, Yoshua et al., 2005 [1] have shown that training
neural networks can be seen as convex optimization problem. The process
of finding global optimum points in these networks is really challenging due
to the large number of parameters in high dimensional space and the zeros
gradient problem [2] during training process. Hence selection of appropriate
optimization algorithms is one of the most important tasks in training neural
networks, especially in training deep neural networks.
This work, we will focus on training multi-layer neural network and recurrent
neural network with different variants of gradient descent algorithms.
2 Gradient descent
Gradient descent overview
Gradient descent is an algorithm used to optimize the objective function J(θ),
with respect to the parameter θ ∈ Rd , by gradually updating the parameter
θ in the opposite direction to the gradient of the objective function. The
learning rate η determines the size of each jump we take to the minimum
value.
The advantages of GD are the ease of implementation and low memory cost
[3], while the drawback is the convergence time. In machine learning or deep
learning area, we use this algorithm to update parameters of models or the
weights (θ) in neural networks. The general definition of gradient descent is
followed by:
θ = θ − η∇J(θ)
where η is defined as the learning rate and negative sign means that we will
go down hill for seeking the minimum.
There are three typical variants of gradient descent method, which are:

1. Batch gradient descent


In batch gradient, we calculate the gradient of objective function J(θ)
with respect to θ on entire training data points. This approach has
some limitations with the large data set. After each iteration, we have
to recalculate the gradient on the entire data set again. As this pro-
cess requires more memory and computational overhead, this method
is hardly ever used in training neural networks , and other machine
learning problems.
Algorithm 1 (Batch gradient descent)

1. Initialize weights (θ)


2. Choosing learning rate
3. For iter < maximum iteration
4. Calculate the gradient with respect to each weight (θ) on entire
data set
5. Update theta: new theta equal to old theta minus gradient.

2. Stochastic gradient descent


In contrast with batch gradient descent, stochastic gradient descent
(SGD) updates the weights of neural networks by calculating only one
training sample in one iteration. In this algorithm, at one time, we
only calculate the gradient of the loss function based on only one data
point, and then update the weight based on this gradient. This is done
for each point on the entire data, then repeat the process. This very
simple algorithm actually works very well for training neural networks,
especially in deep learning.
Each iteration went through all the data points, it is called an epoch.
With regular GD, every epoch is one update for θ, but with SGD then
each epoch corresponds to N times of update θ (N is the number of data
points). On the one hand, updating each point like this can reduce the
speed of performing an epoch. But, on the other hand, SGD only
requires a small value of epoch, therefore, SGD is suitable for problems
with large datasets.
An important note in SGD is that after every iteration, we need to
shuffle the order of data to ensure the randomness of the process. This
also affects to performance of SGD.
Algorithm 2 (Stochastic gradient descent)

1. Shuffle entire data set.


2. Initialize weights (θ)
3. Choosing learning rate
4. For iter < maximum iteration
5. Calculate the gradient with respect to each weight (θ) on only one
data point
6. Update theta: new theta equal to old theta minus gradient.

3. Mini-batch gradient descent


Unlike SGD, mini-batch uses a number of k, usually greater than 1
(but still much smaller than the total number N of the dataset). Like
SGD, the mini-batch gradient descent starts each iteration by randomly
shuffling the data and dividing the entire data into mini-batches, each
mini-batch has k data points (except the last mini-batch can be less if
N is not divisible by k). And this method is by far the most common
used in training neural networks with large dataset.
Algorithm 3 (Mini-batch gradient descent)

1. Shuffle entire data set.


2. Initialize weights (θ)
3. Choosing learning rate
4. For iter < maximum iteration
5. Calculate the gradient with respect to each weight (θ) on k data
points.
6. Update theta: new theta equal to old theta minus gradient.

Gradient descent algorithms


1. Momentum
In gradient descent, we need to calculate the gradient of the loss func-
tion with respect to θ at the time t and update to the new value of θ.
If we can consider this gradient as the velocity vt in physics, the new
position of θ is θt = θt−1 − vt . The negative sign means that we need
to move against the gradient. In momentum [5], we can estimate the
velocity vt , which can carry the information of slope (gradient), and
the previous momentum (vt−1 ). A simple way to combine these term
is that we can add with weighted these two quantities together:

vt = γvt−1 + η∇θ J(θt−1 )

In general, the value of γ can be chosen as 0.9, vt−1 is the previous


velocity (we can set the initial velocity to be zero), ∇θ J(θt−1 ) is the
slope of the previous point. So, the new update is followed:

θt = θt−1 − vt

2. Nesterov accelerated gradient


Momentum sometimes helps gradient to overcome the local minimum,
however, there is a limitation as when approaching the new destination,
momentum still takes quite a bit of time before stopping. The is also
due to the momentum. There is another method that helps to overcome
this problem, which is called the Nesterov accelerated gradient (NAG)
[6]. This method will allow the algorithm to converge faster. Nesterov
method allows us to reduce the amount of computational overhead
at each step to find the minimum by computing the gradient at θ −
γvt−1 , and hence give us the approximation of the next position of the
parameters. Rules for updating NAG method is follows:

vt = γvt−1 + η∇θ J(θt−1 − γvt−1 )


θt = θt−1 − vt
3. Adagrad
Momentum and Nesterov momentum update the parameters with the
same learning rate. The adaptive subgradient method, or Adagrad,[7]
adapts a learning rate for each component of θ. Adagrad divides the
learning rate by the sum of squares of past gradient with respect to
θi at time step t, and  is the factor to avoid errors of dividing by
zeros. An obvious benefit of Adagrad is to avoid adjusting the learn-
ing rate manually, usually we leave it at the default value of 0.01 and
the algorithm then automatically tune it. Another advantage is that
Adagrad alleviate the impact of parameters with consistently high gra-
dients, thereby increasing the influence of parameters with infrequent
updates. However, one of the weaknesses of Adagrad is that the total
square of gradient will grow over time until it makes the learning rate
extremely small, and the whole training process will be frozen. The
update rule of Adagrad is follows:

gt = ∇θ J(θt−1 )
t
X
Gt = gt2
i=1
η
θt = θt−1 − √ · gt
Gt + 

4. Adadelta
Adadelta [8] is invented to deal with Adagrad’s constantly decreasing
learning rate problem. This method will restrict the accumulation of
gradient to a certain limit. In order to set this limit, the author intro-
duces a new concept, which is running average. And this depends on
the previous average and the current gradient.
The authors also noticed that the units in the update equations for
gradient descent, momentum, and Adagrad do not match. To handle
this problem, they proposes a second exponentially decaying average
which is the square of updated parameter θ. Combining these two run-
ning average and exponentially decaying average, the learning rate α is
removed from the update rule. An implementation is provided by the
following rules:
gt = ∇θ J(θt−1 )
E[g 2 ]t = γE[g 2 ]t−1 + (1 − γ)gt2
E[∆θ2 ]t = γE[∆θ2 ]t−1 + (1 − γ)∆θt2
p
RM S[g]t = E[g 2 ]t
p
RM S[∆θ]t = E[∆θ2 ]t
RM S[∆θ]t
θt = θt−1 − gt
RM S[gt ]

5. RMSprop
RMSProp[9] is another method that enhances Adagrad to address the
effects of diminishing learning rate at each time step. Both Adadelta
and RMSProp algorithms were created independently at the same time.
The update rule is similar to the first update rule of Adadelta. The
decaying γ can be treated as the same as the momentum. The value of
γ is 0.9 as the author recommended.

gt = ∇θ J(θt−1 )
E[g ]t = γE[g 2 ]t−1 + (1 − γ)gt2
2

η
θt = θt−1 − p gt
E[g 2 ]t + 
η
θt = θt−1 − gt
RM S[gt ]

6. Adam
The adaptive moment estimation method, or Adam [10], also adapts
learning rates to each parameter. It stores both an exponentially de-
caying squared gradient like RMSProp and Adadelta, but also an ex-
ponentially decaying gradient like momentum. The rule of updates is
followed:
gt = ∇θ J(θt−1 )
mt = β1 mt−1 + (1 − β1 )gt
vt = β2 vt−1 + (1 − β2 )gt2
m̂t = mt /(1 − β1t )
v̂t= vt /(1 − β2t )
p
θt = θt−1 − η · m̂t /( v̂t + )
3 Overview of neural network

Input Hidden Hidden Hidden Output


layer layer 1 layer 2 layer 3 layer

Input #1

Input #2
Output
Input #3

Input #4

Figure 1: Multi-layer neural networks

In a typical multi-layer neural networks, beside the input and output layers,
there might have many hidden layers in the middle. Figure 1 is an example
of fully connect multi-layer neural network with three hidden layers. The
number of layers in a neural networks is calculated by the number of hidden
layers plus one.
A circle node in a layer is called a unit. Units in the input layers, hidden
layers, and output layers are often called input units, hidden units, and
output units, respectively. The input of the hidden layers is denoted by z,
the output of each unit is usually denoted as a.
There are L matrix of coefficients for a neural networks with L layers. These
(l−1) ×d(l)
matrices are often denoted by W(l) ∈ Rd , l = 1, 2, . . . , L, and Wl
represent the connection between layer (l − 1) and layer (l). Biases of layer
(l)
(l) usually denote by b(l) ∈ Rd . In optimizing the loss function of neural
networks, we need to find the values of these weights (W(l) ) and biases (b(l) ).
We can define θ (l) = [b(l) , W(l) ] is the matrix of coefficients of layer (l).
The output from each layer in vector form can be defined as :
a(l) = f (W(l) a(l−1) + b(l) ) = f (θ (l) a∗ (l−1) )
where a(l−1) is the output from the layer (l − 1) (note that a(0) is the input
data), and f is the activation functions.
We can write this in different form with respect to θ.
a(l) = f (θ (l) a∗ (l−1) )
where a∗ (l−1) is the concatenate form of column vector of 1 and a(l−1) .
In regression problem, the loss function of neural network is usually mean
square error (MSE) which is the sum of square of the mismatch between
model responses and actual data. The loss function can be defined by:
N
1 X
J(W, b) = kyi − ŷi k22
N i=1

where: W, b are the set of all matrices of coefficients and biases in each
layer, yi is the predicted output from neural network, and ŷi is the actual
value getting from datasets. Or it can be written by:
N
1 X
J(θ) = kyi − ŷi k22
N i=1

where θ is the set of coefficients as defined above.


The process of getting output from neural networks is shown in the figure
below:

After applying the weight matrix multiply, we continue to apply activation


operation, which is usually a non-linear (tanh, sigmoid functions) or convex
element (ReLU function for example). So, the output from neural networks
is composition function of the weights. In general, these outputs are non-
convex function with respect to their weight (θ).

4 Methodology
4.1 Input data
The production dataset used in this paper is from Volve field on Norwegian
continental shelf. After the field was decommissioned in 2016, data of this
field is published for the education purposes under the ExxonMobil and Bay-
erngas licensees. Daily oil, gas and water production data are all recorded
consist with pressure, temperature and choke size as operation constraint.
These production data are the main input variables for the machine learning
and deep learning model, and it gives multiphase production predictions in
the future as model output.
In this research, bottom hole pressure (BHP), tubing head pressure (THP),
bottom hole temperature (BHT), well head temperature (WHT), different
pressure in casing (DP), choke size (CS) in percentage is the input for train-
ing neural networks algorithms.

4.2 Work-flow

Training Output TRAINING PHASE


Y train

Raw training Extracted feature


data (input) Xtrain Machine learning
Feature Algorithms
Extraction
(Feature
Engineering) Deep learning
Algorithms

VALIDATION PHASE
Xvalidation

Evaluation model
Y validation
on validation test

Raw test Extracted feature


data (input) Xtest Machine learning
Feature Algorithms
Extraction
(Feature
Engineering) Deep learning
Algorithms

TESTING PHASE Test output


Y test

Figure 2: Work flow in training machine and deep learning algorithms

Almost every machine learning and deep learning algorithm have followed
this work-flow above. There are three phases in this process which are train-
ing, validation and testing phase. In training phase, we need to design two
blocks (green and pink), which are feature extractors and main algorithms.
The input data is all information we know about the data. For example, for
an image, it is the value of each pixel in image; for audio, it is the signal; or
in this research, the raw input is all information about geology, geophysics,
petrophysics, static models, dynamic reservoir models of Volve field. The raw
data is not usually in vector form, without the same dimensions and often
contains noise, missing values problems might encounter also. After feature
extractor phases, we will get extracted features from raw input. These fea-
tures will be used to train the machine learning or deep learning algorithms.
In main algorithms, once we have extracted extracted features, we use this
information along with (optional) training output to create appropriate mod-
els for our dataset and objective. An important point is that when building
feature extractors and main algorithms, we should not use any information
in the test data set. We have to assume that the information in the test data
has never been seen.
In testing phase, with the new raw input, we use the feature extractor cre-
ated above to create the corresponding testing feature vectors. Using these
testing feature vectors and the main algorithm learned in the training phase,
to predict output and evaluate the generation of our models.

Input Preprocessing CV Strategies Feature Scaling Machine Learning Output

BHP
Data Clearning Initialize Model
Training, Validation andTest

TIme Series Cross-Validation

THP Normalization
Optimize Model Oil rate
Delete Noise

BHT Standardisation
Data Intergration Gas rate
Validate Model
WHT
Min-Max Scaling Water rate
Data Reduction Testing Model
DP

CS Forecasting

Training, Validation and Test Splitting Time Series Cross-Validation Strategies


Time

Time
Split data into 5 folds

Train Test

Traing set Validation Test set Train Test


Train
70% 15% 15%
Train Train Train Test

Train Train Train Train Test

Figure 3: Detailed work flow for training neural networks


4.3 Experiment setup
To empirically evaluate gradient descent algorithms on training neural net-
works, we investigated the multilayer fully connected neural networks and
recurrent neural networks.
The neural network architectures have the same number of hidden layer, and
activation functions. We also use the same initialization method which is
Xavier[11] initialization and identical hyper-parameters.
We use Kaggle platform to train long-short term memory (LSTM), which is
a special type of recurrent neural networks. We also use the same Kaggle
cloud architecture for training the networks.
5 Result and discussion
5.1 Result of multilayer fully connected neural networks
The architecture of multilayer fully connected neural networks is followed:

Table 1: Multilayer neural networks architecture


Architectures Number of units Activation function Connection
Hidden layer 1 200 sigmoid Fully
Hidden layer 2 100 sigmoid Fully
Output layer 2 linear Fully

Table 2: Result of different optimizers on training MLP


Optimizers MSE on blind test Loss on training Loss on validation
Gradient Descent 7.56 42.39 5.677
GD with Momentum 0.665 10.189 1.2267
GD with Momentum and Nesterov 0.614 10.33 1.46
Adagrad 11.159 372.024 35.14
Adadelta 0.3877 9.575 0.85
RMSProp 2.684 9.412 1.382
Adam 0.353 13.422 0.9176
Optimizers comparison with multilayer neural networks

Gradient descent
45 SGD + Momentum
SGD +Momentum +Nesterov
40 Adagrad
Adadelta
35 RMSProp
Adam
30
Training loss

25

20

15

10

0 50 100 150 200 250 300 350 400 450 500


Iterations

Figure 4: Training loss with fully connected neural networks


Optimizers comparison with multilayer neural networks
18
Gradient descent
16 SGD + Momentum
SGD +Momentum +Nesterov
14 Adagrad
Adadelta
12 RMSProp
Adam
Validation error

10

0 50 100 150 200 250 300 350 400 450 500


Iterations

Figure 5: Validation loss with fully connected neural networks

Table 2 and Figures 3-4 present the results for the neural networks regres-
sion problem, which shows SGD and Adam outperform the other optimizers,
while Adadelta and RMSProp also give favorable results . The convergence
was at approximately 100 epochs. Gradient descent without momentum or
Nesterov, Adagrad are the worst performers on this dataset.
It can be also noticeable that combine stochastic gradient descent with mo-
mentum and Nesterov methods, the performance is increased significantly
and convergence time is shorter.

5.2 Result of LSTM


The architecture of neural networks with LSTM has shown in table 3. The
data get from Volve field contains noise and sometimes missing values and
outliers. In order to get more useful insight from data, a cleaning process is
applied to get cleaner and more organized data. Moreover, the production
rate values usually much bigger than the input values from physical param-
eters such as bottomhole pressures or temperatures. Due to these problems,
the dataset was handled with preprocessing technique to reduce noise and
outlier values. A min-max scaler is also applied to rescale the input and
target to the range from 0 to 1.

Table 3: LSTM network architecture


Architecture Type of neural networks Number of units
Hidden layer 1 Long-short term memory 1024
Hidden layer 2 Long-short term memory 512
Hidden layer 3 Long-short term memory 512
Hidden layer 4 Time-distributed 256
Output layer Fully-connected 3
Optimizers comparison with LSTM

Gradient descent
0.7 SGD + Momentum
SGD +Momentum +Nesterov
Adagrad
0.6 Adadelta
RMSProp
0.5 Adam
Training loss

0.4

0.3

0.2

0.1

0 50 100 150 200 250 300 350 400 450 500


Iterations

Figure 6: Training loss with LSTM


Optimizers comparison with LSTM

0.35 Gradient descent


SGD + Momentum
SGD +Momentum +Nesterov
0.3
Adagrad
Adadelta
0.25 RMSProp
Adam
Validation error

0.2

0.15

0.1

5 · 10−2

0 50 100 150 200 250 300 350 400 450 500


Iterations

Figure 7: Validation loss with LSTM

Figures 5 and 6 show the results when training long short term memory with
different optimizers. Gradient descent itself without momentum and NNes-
terov performed poorly. It also can be seen that Adadelta underperformed
in these datasets, while RMSProp is unstable at the beginning of training
process, but in the end, it gives the best convergence results. As presented in
figure 5, Adam and SGD with momentum and Nesterov techniques are the
two reliable optimizers.
6 Conclusion
In this paper, a comparison between different optimizers has been performed
on training neural networks with the dataset from oil and gas production
of Volve field. The finding concludes that Adam optimization exhibited the
most powerful and robust performance. In terms of overall performance in
these datasets, SGD with momentum and Nesterov and Adam were the best
optimizers. The experiments conducted also reveal that the performance of
optimizers is also dependent on the type and size of datasets.
In this study, only multilayer fully connected neural networks and recur-
rent neural networks have been studied. To get better comparison on the
a wide range of datasets and different problems, further study needs to be
implemented.
References
[1] Bengio, Yoshua, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau, and
Patrice Marcotte. "Convex neural networks." In Proceedings of the 18th
International Conference on Neural Information Processing Systems, pp.
123-130. 2005.

[2] Du, Simon, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. "Gra-
dient Descent Finds Global Minima of Deep Neural Networks." In Inter-
national Conference on Machine Learning, pp. 1675-1685. 2019.

[3] Meza, Juan C. "Steepest descent." Wiley Interdisciplinary Reviews:


Computational Statistics 2, no. 6 (2010): 719-722.

[4] Ruder, Sebastian. "An overview of gradient descent optimization algo-


rithms." arXiv preprint arXiv:1609.04747 (2016).

[5] Qian, N. "On the momentum term in gradient descent learning algo-
rithms." Neural networks: the official journal of the International Neural
Network Society 12, no. 1 (1999): 145.

[6] Nesterov, Yurii. "A method for unconstrained convex minimization prob-
lem with the rate of convergence O (1/k2)." In Doklady an ussr, vol. 269,
pp. 543-547. 1983.

[7] Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive subgradient
methods for online learning and stochastic optimization." Journal of ma-
chine learning research 12, no. Jul (2011): 2121-2159.

[8] Zeiler, Matthew D. "Adadelta: an adaptive learning rate method." arXiv


preprint arXiv:1212.5701 (2012).

[9] Tieleman, T. and Hinton, G. Lecture 6.5 - RMSProp, COURSERA: Neu-


ral Networks for Machine Learning. Technical report, 2012.

[10] Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Op-
timization. International Conference on Learning Representations, 1–13.

[11] Glorot, X., & Bengio, Y. (2010, March). Understanding the difficulty
of training deep feedforward neural networks. In Proceedings of the thir-
teenth international conference on artificial intelligence and statistics (pp.
249-256).

You might also like