0% found this document useful (0 votes)
196 views4 pages

Comparative Analysis of Optimizers in Deep Neural Networks

The role of optimizer in deep neural networks model impacts the accuracy of the model. Deep learning comes under the umbrella of parametric approaches; however, it tries to relax as many as assumptions as possible
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
196 views4 pages

Comparative Analysis of Optimizers in Deep Neural Networks

The role of optimizer in deep neural networks model impacts the accuracy of the model. Deep learning comes under the umbrella of parametric approaches; however, it tries to relax as many as assumptions as possible
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Volume 5, Issue 10, October – 2020 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Comparative Analysis of Optimizers in Deep Neural


Networks
Chitra Desai
Professor
Department of Computer Science
National Defence Academy, Pune, India

Abstract:- The role of optimizer in deep neural networks Training deep learning models are iterative and requires
model impacts the accuracy of the model. Deep learning initial point to be specified to start with and it is this
comes under the umbrella of parametric approaches; initialization that strongly affects most algorithm [2]. The
however, it tries to relax as many as assumptions as classical Stochastic Gradient Descent (SGD) [3] and SGD
possible. The process of obtaining parameters from the with momentum have proven track of their suitability for
data is gradient descent. Gradient descent is the chosen learning deep neural network. Enhancement to existing
optimizer in neural network and many of the machine techniques is inevitable and so came set of adaptive learning
learning algorithms. The classical stochastic gradient methods.
descent (SGD) and SGD with momentum which were
used in deep neural networks had several challenges Adaptive learning methods were developed over a
which were attempted to resolve using adaptive learning period of time to claim their supremacy over classical SGD
optimizers. Adaptive learning algorithms like- and SGD with momentum. However, several studies
RMSprop, Adagrad, Adam wherein learning rate for [4][5][6]show that SGD with momentum proved
each parameter is computed were further developments comparatively better than the adaptive learning methods in
for better optimizer. Adam optimizer in Deep Neural particular Adam optimizer which tends to be a default choice.
Networks is often a default choice observed recently.
Adam optimizer is a combination of RMSprop and The paper aims at analyzing the performance of deep
momentum. Though, Adam since its introduction has neural network by applying different optimizer to the chosen
gained popularity, there are claims that report dataset.
convergence problem with Adam optimizer. Also, it is
advocated that SGD with momentum gives better The dataset is divided into training set and test set. The
performance compared to Adam. This paper presents deep neural network is trained on the training data and tested
comparative analysis of SGD, SGD with momentum, on the test data.
RMSprop, Adagrad and Adam optimizer on Seattle
weather dataset.The Seattle weather dataset, was The paper does not cover the underlying data
processed assuming Adam optimizer will prove to be the preprocessing and deep neural network, the focus here is on
better optimizer choice as preferred a default choice by minimizing the training and validation loss and observing the
many, however, SGD with momentum proved to be a testing loss by changing optimizers. The optimizer used for
unsurpassed optimizer for this particular dataset. comparative study in this paper are SGD, RMSprop,
Adagrad, SGD with momentum and Adam.
Keywords:- Gradient Descent, SGD with momentum
RMSprop, Adagrad and Adam. II. DATA AND DATA PRE-PROCESSING

I. INTRODUCTION The dataset for study used is Seattle, US weather


dataset [7]. It is labelled dataset which consists of 4 feature
Deep learning algorithms involve optimizations. variables – DATE, PRCP, TMAX and TMIN and one target
Optimization refers to minimizing or maximizing an variable RAIN which is categorical having value {0,1}. The
objective function, which, is also called cost function or loss dataset contains 25552 records of daily rainfall patterns from
function. Given a training dataset for deep neural network, 1st Jan 1948 to 12th Dec 2017. The data is preprocessed to
there are attempts to find optimal parameters () that provide input to the deep neural network by checking for
significantly reduce the cost function J (). Gradient descent duplicates, removing null values and splitting of DATE
can be used in deep neural network to find the optimal column DAY, MON and YEAR. Table 1 shows the sample
parameters [1]. data after splitting the column. Scaling applied to the data is
standardization. Data is split into train and test with a ratio
of 80:20.

IJISRT20OCT608 www.ijisrt.com 959


Volume 5, Issue 10, October – 2020 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
TABLE 1. SAMPLE DATA AFTER SPLITTING DATE COLUMN
PRCP TMAX TMIN RAIN YEAR MON DAY
0.47 51 42 1 1948 1 1
0.59 45 36 1 1948 1 2
0.42 45 35 1 1948 1 3
0.31 45 34 1 1948 1 4
0.17 45 32 1 1948 1 5

The elaborated details of data pre-processing for Gradient descent is the way to minimize objective
Seattle weather data set and the Deep neural architecture as function by updating model’s parameter in the opposite
presented in next session can be referred at [8]. direction of the gradient. When the derivative of the function
𝑓(𝑥) is zero, then it provides no information of the direction
III. DNN ARCHITECTURE DESIGN to move, this point is known as critical point [2]. So, a critical
point is a point with slope zero. When the critical point is
The overall structure of the deep neural network lower than the neighboring points, then it is local minima.
organized into layers to study the impact of different When the critical point is higher than the neighboring points
optimizers is presented here. Deep sequential model is used, then it is local maxima. When the critical point has both
the summary of it is shown in table 1. There are six input higher and lower points in its neighboring than it is called
features for the Seattle weather dataset. The shape of the saddle point.
weights depends upon the shape of the input. The target
variable is binary with output either 0 or 1. At hidden layers Gradient descent is effective for training neural network
ReLu [9] activation function is used at hidden layers and based on small local moves and reaching the global solution.
sigmoid function is used that output layer. Weights are In gradient descent the weights are updated incrementally
initialized using uniform optimizers. after each epoch. There are limits on the performance of any
optimization algorithm that are designed for neural network
Model is compiled by setting the learning rate to 0.001, [11]. There are variants of gradient descent [12] and in this
which is chosen by observing the learning curve by plotting paper we discuss SGD, SGD with momentum, RMSprop,
the objective function as a function of time. As the problem Adagrad, and Adam optimizers for analyzing their
belongs to the class of binary classification, the loss is performance in terms of test accuracy.
calculated using cross entropy. The batch size is set to 64 and
epoch to 10. V. OPTIMIZERS

The data is scaled and split into training and test data. For large training set Stochastic Gradient Descent
The model is initialized and then with different optimizers (SGD) [13]is considered as good learning algorithm to train
the model fit to analyze the performance with respect to each neural networks [10]. It updates the parameters using single
optimizer under study. or very few parameters, where the new update parameter is
given by eq.1, here xi and yi are from the training set. It helps
TABLE 2 MODEL SUMMARY to reduce the variance and lead to stable convergence. α is
Layer (type) Output Shape Param # the learning rate.
dense_1 (Dense) (None, 6) 42
dense_2 (Dense) (None, 4) 28  =  − α∇J(; xi, yi) (1)
dense_3 (Dense) (None, 1) 5
Total Params: 75 If the objective is shallow SGD may tend to oscillate.
Trainable params: 75 This problem can be overcome by adding momentum to
Non-trainable params: 0 SGD.  is the current velocity.γ∈(0,1] determines number of
iterations of the previous gradients are incorporated into the
IV. GRADIENT DESCENT current update.

Given function 𝑦 = 𝑓(𝑥), where 𝑥 and 𝑦 are some real  =  + α∇J(; xi, yi) (2)
𝑑𝑦
numbers. The derivative 𝑑𝑥 of the function 𝑓(𝑥) gives the
= − (3)
slope of 𝑓(𝑥) at a point 𝑥. Derivative is useful in minimizing
the function as it tells how a small change in input 𝑥, makes While implementing SGD with momentum the value of
corresponding change in the output 𝑦. To reduce 𝑓(𝑥), we momentum is set to 0.9 during the experiment.
can move 𝑥 in small steps in opposite direction of the
derivative. This technique is known as Gradient Descent The Adagrad [14] adapts all model parameters by
[10]. scaling them inversely proportional to the square root of the
sum of all the historical squared values of gradient. While
training DNN models, from the beginning of training if the

IJISRT20OCT608 www.ijisrt.com 960


Volume 5, Issue 10, October – 2020 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
squared gradient starts accumulating, it may lead to very less memory. Adam includes bias correction, RMSprop
premature and excessive decrease in the effective learning lacks correction factor.
rate.
VI. RESULTS
RMSprop [15] is an improvement over Adagrad by
changing the gradient accumulation into an exponentially In this section the results obtained for each optimizer
weighted moving average. are presented.

Adam optimizer [16] short for ‘adaptive moments’ is A. SGD and SGD with Momentum
considered as a variant of RMSprop and momentum with Table 3 shows the results for SGD and SGD
few variations. It is computationally efficient and requires withmomentum.

TABLE 3. SGD AND SGD WITH MOMENTUM

Optimizer Learning Rate Momentum Test Loss Test Accuracy Model Training
Time (Sec)
SGD 0.01 - 0.2160 0.9337 2.54
SGD 0.001 - 0.2007 0.9364 2.34
SGD with 0.01 0.9 0.0115 0.9992 2.11
Momentum

It is observed that no significant change in model In case of SGD with momentum a significant increase
performance is observed with change in learning rate from in test accuracy observed compared to SGD and time taken
0.01 to 0.001 in SGD. The time taken to train the model is for training is also lowest. Figure 1,2 and 3 shows the
comparatively less with learning rate 0.001. accuracy and loss with respect training and validation data
over 10 epochs while the model is being trained.

B. Adaptive Learning Algorithms


Table 4 shows the results for adaptive learning
algorithms-Adagrad, RMSprop and Adam optimizers. From
all the three algorithms it is observed that the model performs
best with Adam optimizer. However, time taken to train the
model Adagrad is low.

TABLE 4 ADGRAD, RMSPROP AND ADAM


FIGURE 1 MODEL ACCURACY AND MODEL LOSS FOR SGD
Optimizer Learning Test Test Model
WITH LEARNING RATE 0.01
Rate Loss Accuracy Training
Time
(Sec)
Adagrad 0.01 0.2667 0.8963 2.12
RMSprop 0.01 0.1407 0.9505 2.28
Adam 0.01 0.0837 0.9771 2.26

Figure 4,5 and 6 shows the accuracy and loss for the
training and validation data for Adagrad, RMSprop and
Adam optimizer respectively. The blue line indicates the
FIGURE 2 MODEL ACCURACY AND MODEL LOSS WITH
training data and the orange indicate the validation data in
LEARNING RATE 0.001
each of the figure above.

FIGURE 3 MODEL ACCURACY AND MODEL LOSS FOR SGD


WITH MOMENTUM FIGURE 4 MODEL ACCURACY AND MODEL LOSS FOR
ADAGRAD

IJISRT20OCT608 www.ijisrt.com 961


Volume 5, Issue 10, October – 2020 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
[8]. Chitra Desai, “ Rainfall Prediction Using Deep Neural
Network,” unpublished
[9]. D. Zou, Y. Cao, D. Zhou, and Q. Gu, “Stochastic
gradient descent optimizes over-parameterized deep
relu networks,” arXiv preprint arXiv:1811.08888,
2018.
[10]. A. Cauchy. Methodes generales pour la resolution des
syst‘emes dequations simultanees,. C.R. Acad. Sci.
FIGURE 5 MODEL ACCURACY AND MODEL LOSS OR Par., 25:536–538, 1847.
RMSPROP [11]. AVRIM L. BLUM* AND RONALD L. RIVEST, “
Training a 3-Node Neural Network is NP-Complete”,
Neural Networks, Vol. 5, pp. 117-127, 1992
[12]. Ruder, S. (2016). An overview of gradient descent
optimization algorithms. arXiv
preprint arXiv:1609.04747..
[13]. Bottou L. (2012) Stochastic Gradient Descent Tricks.
In: Montavon G., Orr G.B., Müller KR. (eds) Neural
Networks: Tricks of the Trade. Lecture Notes in
Computer Science, vol 7700. Springer, Berlin,
FIGURE 6 MODEL ACCURACY AND MODEL LOSS FOR ADAM Heidelberg. https://doi.org/10.1007/978-3-642-35289-
8_25
[14]. Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive
Subgradient Methods for Online Learning and
VII. CONCLUSION
Stochastic Optimization. Journal of Machine Learning
The paper presented the impact of different optimizers Research, 12, 2121–2159.
[15]. Geoffrey Hinton Neural Networks for machine
on the chosen labeled data set. The comparison was mainly
learning nline course.
aimed to ensure that for labeled data set a default choice of
Adam, which a adaptive learning algorithm will give best https://www.coursera.org/learn/neural-
networks/home/welcome
model performance. However, when SGD with momentum
[16]. Diederik P. Kingma and Jimmy Lei Ba. Adam: A
was used, it gave comparatively better result then the
method for stochastic optimization. 2014.
adaptive learning algorithms and in particular Adam
optimizer. The model training time was also the lowest for arXiv:1412.6980v9 (2014)
SGD with momentum compared to other optimizers.

REFERENCES

[1]. S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai,


“Gradient descent finds global minima of deep neural
networks,” ICML, arXiv:1811.03804, 2018.
[2]. Ian Goodfellow, Yoshua Bengio and Aaron Courville,
Deep Learning, MIT Press, 2016
[3]. Robbins, H., & Monro, S. (1951). A stochastic
approximation method. The annals of mathematical
statistics, 400-407
[4]. Huang, G., Liu, Z., Weinberger, K. Q., & van der
Maaten, L. (2017). Densely Connected Convolutional
Networks. In Proceedings of CVPR 2017
[5]. Wu, Y., Schuster, M., Chen, Z., Le, Q. V, Norouzi, M.,
Macherey, W., Dean, J. (2016). Google’s Neural
Machine Translation System: Bridging the Gap
between Human and Machine Translation. arXiv
Preprint arXiv:1609.08144.
[6]. Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., &
Recht, B. (2017). The Marginal Value of Adaptive
Gradient Methods in Machine Learning. arXiv Preprint
arXiv:1705.08292. Retrieved
from http://arxiv.org/abs/1705.08292
[7]. https://www.kaggle.com/rtatman/did-it-rain-in-seattle-
19482017

IJISRT20OCT608 www.ijisrt.com 962

You might also like