Comparison of Gradient Descent Algorithms On Training Neural Networks
Comparison of Gradient Descent Algorithms On Training Neural Networks
Abstract
Gradient descent is one of the most popular optimization techniques that
widely used in machine learning and deep learning to find the minimum
of loss function. The two main computational advantages of gradient de-
scent methods are easy to implement and low memory cost compare to other
second-order methods such as Newton method. Nonetheless, in neural net-
works, gradient descent algorithms may fail to find global optimum due to
the non-convexity or many local minima of cost function. The main draw-
back of gradient descent methods is that the convergence highly depends on
the learning rate selection, and it can be stuck at local minimum points.
1
1 Introduction
In optimization in general and machine learning in particular, we often have
to find the minimum (or sometimes the maximum) value of a function. In
general, finding the global minimum of a loss functions in machine learning
is very complicated, even impossible. Instead, they often try to find local
minimum points, and, to some extent, consider it as the solution of the prob-
lem.
The local minimum points of a function is the solution of the derivative
equation of zero (if the function is differentiable). However, in most cases,
solving the derivative equation by zero is impossible. This may be due to the
complexity of the derivative form, or the data points have a large number of
dimensions, or from having too many data points.
The most common approach is to start from a point that we consider to be
close to the solution to the problem, then use an iterative operation to get
to the point we are looking for until the derivative is close to zero.
Gradient Descent (abbreviated as GD) and its variants are by far the most
popular optimization algorithms used in training machine learning and deep
learning models. They have proved their performance and effectiveness over
other methods such as second order methods or population methods (PSO
or GA) in training a wide range of neural networks, and has been emplyoyed
in various machine learning and deep learning library or framwork such as
Keras, scikit-learn, or tensorflow, etc.
Non-convex optimization problems is the most common scenarios for neu-
ral networks, but Bengio, Yoshua et al., 2005 [1] have shown that training
neural networks can be seen as convex optimization problem. The process
of finding global optimum points in these networks is really challenging due
to the large number of parameters in high dimensional space and the zeros
gradient problem [2] during training process. Hence selection of appropriate
optimization algorithms is one of the most important tasks in training neural
networks, especially in training deep neural networks.
This work, we will focus on training multi-layer neural network and recurrent
neural network with different variants of gradient descent algorithms.
2 Gradient descent
Gradient descent overview
Gradient descent is an algorithm used to optimize the objective function J(θ),
with respect to the parameter θ ∈ Rd , by gradually updating the parameter
θ in the opposite direction to the gradient of the objective function. The
learning rate η determines the size of each jump we take to the minimum
value.
The advantages of GD are the ease of implementation and low memory cost
[3], while the drawback is the convergence time. In machine learning or deep
learning area, we use this algorithm to update parameters of models or the
weights (θ) in neural networks. The general definition of gradient descent is
followed by:
θ = θ − η∇J(θ)
where η is defined as the learning rate and negative sign means that we will
go down hill for seeking the minimum.
There are three typical variants of gradient descent method, which are:
θt = θt−1 − vt
gt = ∇θ J(θt−1 )
t
X
Gt = gt2
i=1
η
θt = θt−1 − √ · gt
Gt +
4. Adadelta
Adadelta [8] is invented to deal with Adagrad’s constantly decreasing
learning rate problem. This method will restrict the accumulation of
gradient to a certain limit. In order to set this limit, the author intro-
duces a new concept, which is running average. And this depends on
the previous average and the current gradient.
The authors also noticed that the units in the update equations for
gradient descent, momentum, and Adagrad do not match. To handle
this problem, they proposes a second exponentially decaying average
which is the square of updated parameter θ. Combining these two run-
ning average and exponentially decaying average, the learning rate α is
removed from the update rule. An implementation is provided by the
following rules:
gt = ∇θ J(θt−1 )
E[g 2 ]t = γE[g 2 ]t−1 + (1 − γ)gt2
E[∆θ2 ]t = γE[∆θ2 ]t−1 + (1 − γ)∆θt2
p
RM S[g]t = E[g 2 ]t
p
RM S[∆θ]t = E[∆θ2 ]t
RM S[∆θ]t
θt = θt−1 − gt
RM S[gt ]
5. RMSprop
RMSProp[9] is another method that enhances Adagrad to address the
effects of diminishing learning rate at each time step. Both Adadelta
and RMSProp algorithms were created independently at the same time.
The update rule is similar to the first update rule of Adadelta. The
decaying γ can be treated as the same as the momentum. The value of
γ is 0.9 as the author recommended.
gt = ∇θ J(θt−1 )
E[g ]t = γE[g 2 ]t−1 + (1 − γ)gt2
2
η
θt = θt−1 − p gt
E[g 2 ]t +
η
θt = θt−1 − gt
RM S[gt ]
6. Adam
The adaptive moment estimation method, or Adam [10], also adapts
learning rates to each parameter. It stores both an exponentially de-
caying squared gradient like RMSProp and Adadelta, but also an ex-
ponentially decaying gradient like momentum. The rule of updates is
followed:
gt = ∇θ J(θt−1 )
mt = β1 mt−1 + (1 − β1 )gt
vt = β2 vt−1 + (1 − β2 )gt2
m̂t = mt /(1 − β1t )
v̂t= vt /(1 − β2t )
p
θt = θt−1 − η · m̂t /( v̂t + )
3 Overview of neural network
Input #1
Input #2
Output
Input #3
Input #4
In a typical multi-layer neural networks, beside the input and output layers,
there might have many hidden layers in the middle. Figure 1 is an example
of fully connect multi-layer neural network with three hidden layers. The
number of layers in a neural networks is calculated by the number of hidden
layers plus one.
A circle node in a layer is called a unit. Units in the input layers, hidden
layers, and output layers are often called input units, hidden units, and
output units, respectively. The input of the hidden layers is denoted by z,
the output of each unit is usually denoted as a.
There are L matrix of coefficients for a neural networks with L layers. These
(l−1) ×d(l)
matrices are often denoted by W(l) ∈ Rd , l = 1, 2, . . . , L, and Wl
represent the connection between layer (l − 1) and layer (l). Biases of layer
(l)
(l) usually denote by b(l) ∈ Rd . In optimizing the loss function of neural
networks, we need to find the values of these weights (W(l) ) and biases (b(l) ).
We can define θ (l) = [b(l) , W(l) ] is the matrix of coefficients of layer (l).
The output from each layer in vector form can be defined as :
a(l) = f (W(l) a(l−1) + b(l) ) = f (θ (l) a∗ (l−1) )
where a(l−1) is the output from the layer (l − 1) (note that a(0) is the input
data), and f is the activation functions.
We can write this in different form with respect to θ.
a(l) = f (θ (l) a∗ (l−1) )
where a∗ (l−1) is the concatenate form of column vector of 1 and a(l−1) .
In regression problem, the loss function of neural network is usually mean
square error (MSE) which is the sum of square of the mismatch between
model responses and actual data. The loss function can be defined by:
N
1 X
J(W, b) = kyi − ŷi k22
N i=1
where: W, b are the set of all matrices of coefficients and biases in each
layer, yi is the predicted output from neural network, and ŷi is the actual
value getting from datasets. Or it can be written by:
N
1 X
J(θ) = kyi − ŷi k22
N i=1
4 Methodology
4.1 Input data
The production dataset used in this paper is from Volve field on Norwegian
continental shelf. After the field was decommissioned in 2016, data of this
field is published for the education purposes under the ExxonMobil and Bay-
erngas licensees. Daily oil, gas and water production data are all recorded
consist with pressure, temperature and choke size as operation constraint.
These production data are the main input variables for the machine learning
and deep learning model, and it gives multiphase production predictions in
the future as model output.
In this research, bottom hole pressure (BHP), tubing head pressure (THP),
bottom hole temperature (BHT), well head temperature (WHT), different
pressure in casing (DP), choke size (CS) in percentage is the input for train-
ing neural networks algorithms.
4.2 Work-flow
VALIDATION PHASE
Xvalidation
Evaluation model
Y validation
on validation test
Almost every machine learning and deep learning algorithm have followed
this work-flow above. There are three phases in this process which are train-
ing, validation and testing phase. In training phase, we need to design two
blocks (green and pink), which are feature extractors and main algorithms.
The input data is all information we know about the data. For example, for
an image, it is the value of each pixel in image; for audio, it is the signal; or
in this research, the raw input is all information about geology, geophysics,
petrophysics, static models, dynamic reservoir models of Volve field. The raw
data is not usually in vector form, without the same dimensions and often
contains noise, missing values problems might encounter also. After feature
extractor phases, we will get extracted features from raw input. These fea-
tures will be used to train the machine learning or deep learning algorithms.
In main algorithms, once we have extracted extracted features, we use this
information along with (optional) training output to create appropriate mod-
els for our dataset and objective. An important point is that when building
feature extractors and main algorithms, we should not use any information
in the test data set. We have to assume that the information in the test data
has never been seen.
In testing phase, with the new raw input, we use the feature extractor cre-
ated above to create the corresponding testing feature vectors. Using these
testing feature vectors and the main algorithm learned in the training phase,
to predict output and evaluate the generation of our models.
BHP
Data Clearning Initialize Model
Training, Validation andTest
THP Normalization
Optimize Model Oil rate
Delete Noise
BHT Standardisation
Data Intergration Gas rate
Validate Model
WHT
Min-Max Scaling Water rate
Data Reduction Testing Model
DP
CS Forecasting
Time
Split data into 5 folds
Train Test
Gradient descent
45 SGD + Momentum
SGD +Momentum +Nesterov
40 Adagrad
Adadelta
35 RMSProp
Adam
30
Training loss
25
20
15
10
10
Table 2 and Figures 3-4 present the results for the neural networks regres-
sion problem, which shows SGD and Adam outperform the other optimizers,
while Adadelta and RMSProp also give favorable results . The convergence
was at approximately 100 epochs. Gradient descent without momentum or
Nesterov, Adagrad are the worst performers on this dataset.
It can be also noticeable that combine stochastic gradient descent with mo-
mentum and Nesterov methods, the performance is increased significantly
and convergence time is shorter.
Gradient descent
0.7 SGD + Momentum
SGD +Momentum +Nesterov
Adagrad
0.6 Adadelta
RMSProp
0.5 Adam
Training loss
0.4
0.3
0.2
0.1
0.2
0.15
0.1
5 · 10−2
Figures 5 and 6 show the results when training long short term memory with
different optimizers. Gradient descent itself without momentum and NNes-
terov performed poorly. It also can be seen that Adadelta underperformed
in these datasets, while RMSProp is unstable at the beginning of training
process, but in the end, it gives the best convergence results. As presented in
figure 5, Adam and SGD with momentum and Nesterov techniques are the
two reliable optimizers.
6 Conclusion
In this paper, a comparison between different optimizers has been performed
on training neural networks with the dataset from oil and gas production
of Volve field. The finding concludes that Adam optimization exhibited the
most powerful and robust performance. In terms of overall performance in
these datasets, SGD with momentum and Nesterov and Adam were the best
optimizers. The experiments conducted also reveal that the performance of
optimizers is also dependent on the type and size of datasets.
In this study, only multilayer fully connected neural networks and recur-
rent neural networks have been studied. To get better comparison on the
a wide range of datasets and different problems, further study needs to be
implemented.
References
[1] Bengio, Yoshua, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau, and
Patrice Marcotte. "Convex neural networks." In Proceedings of the 18th
International Conference on Neural Information Processing Systems, pp.
123-130. 2005.
[2] Du, Simon, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. "Gra-
dient Descent Finds Global Minima of Deep Neural Networks." In Inter-
national Conference on Machine Learning, pp. 1675-1685. 2019.
[5] Qian, N. "On the momentum term in gradient descent learning algo-
rithms." Neural networks: the official journal of the International Neural
Network Society 12, no. 1 (1999): 145.
[6] Nesterov, Yurii. "A method for unconstrained convex minimization prob-
lem with the rate of convergence O (1/k2)." In Doklady an ussr, vol. 269,
pp. 543-547. 1983.
[7] Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive subgradient
methods for online learning and stochastic optimization." Journal of ma-
chine learning research 12, no. Jul (2011): 2121-2159.
[10] Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Op-
timization. International Conference on Learning Representations, 1–13.
[11] Glorot, X., & Bengio, Y. (2010, March). Understanding the difficulty
of training deep feedforward neural networks. In Proceedings of the thir-
teenth international conference on artificial intelligence and statistics (pp.
249-256).