Comparative Analysis of Optimizers in Deep Neural Networks
Comparative Analysis of Optimizers in Deep Neural Networks
ISSN No:-2456-2165
Abstract:- The role of optimizer in deep neural networks Training deep learning models are iterative and requires
model impacts the accuracy of the model. Deep learning initial point to be specified to start with and it is this
comes under the umbrella of parametric approaches; initialization that strongly affects most algorithm [2]. The
however, it tries to relax as many as assumptions as classical Stochastic Gradient Descent (SGD) [3] and SGD
possible. The process of obtaining parameters from the with momentum have proven track of their suitability for
data is gradient descent. Gradient descent is the chosen learning deep neural network. Enhancement to existing
optimizer in neural network and many of the machine techniques is inevitable and so came set of adaptive learning
learning algorithms. The classical stochastic gradient methods.
descent (SGD) and SGD with momentum which were
used in deep neural networks had several challenges Adaptive learning methods were developed over a
which were attempted to resolve using adaptive learning period of time to claim their supremacy over classical SGD
optimizers. Adaptive learning algorithms like- and SGD with momentum. However, several studies
RMSprop, Adagrad, Adam wherein learning rate for [4][5][6]show that SGD with momentum proved
each parameter is computed were further developments comparatively better than the adaptive learning methods in
for better optimizer. Adam optimizer in Deep Neural particular Adam optimizer which tends to be a default choice.
Networks is often a default choice observed recently.
Adam optimizer is a combination of RMSprop and The paper aims at analyzing the performance of deep
momentum. Though, Adam since its introduction has neural network by applying different optimizer to the chosen
gained popularity, there are claims that report dataset.
convergence problem with Adam optimizer. Also, it is
advocated that SGD with momentum gives better The dataset is divided into training set and test set. The
performance compared to Adam. This paper presents deep neural network is trained on the training data and tested
comparative analysis of SGD, SGD with momentum, on the test data.
RMSprop, Adagrad and Adam optimizer on Seattle
weather dataset.The Seattle weather dataset, was The paper does not cover the underlying data
processed assuming Adam optimizer will prove to be the preprocessing and deep neural network, the focus here is on
better optimizer choice as preferred a default choice by minimizing the training and validation loss and observing the
many, however, SGD with momentum proved to be a testing loss by changing optimizers. The optimizer used for
unsurpassed optimizer for this particular dataset. comparative study in this paper are SGD, RMSprop,
Adagrad, SGD with momentum and Adam.
Keywords:- Gradient Descent, SGD with momentum
RMSprop, Adagrad and Adam. II. DATA AND DATA PRE-PROCESSING
The elaborated details of data pre-processing for Gradient descent is the way to minimize objective
Seattle weather data set and the Deep neural architecture as function by updating model’s parameter in the opposite
presented in next session can be referred at [8]. direction of the gradient. When the derivative of the function
𝑓(𝑥) is zero, then it provides no information of the direction
III. DNN ARCHITECTURE DESIGN to move, this point is known as critical point [2]. So, a critical
point is a point with slope zero. When the critical point is
The overall structure of the deep neural network lower than the neighboring points, then it is local minima.
organized into layers to study the impact of different When the critical point is higher than the neighboring points
optimizers is presented here. Deep sequential model is used, then it is local maxima. When the critical point has both
the summary of it is shown in table 1. There are six input higher and lower points in its neighboring than it is called
features for the Seattle weather dataset. The shape of the saddle point.
weights depends upon the shape of the input. The target
variable is binary with output either 0 or 1. At hidden layers Gradient descent is effective for training neural network
ReLu [9] activation function is used at hidden layers and based on small local moves and reaching the global solution.
sigmoid function is used that output layer. Weights are In gradient descent the weights are updated incrementally
initialized using uniform optimizers. after each epoch. There are limits on the performance of any
optimization algorithm that are designed for neural network
Model is compiled by setting the learning rate to 0.001, [11]. There are variants of gradient descent [12] and in this
which is chosen by observing the learning curve by plotting paper we discuss SGD, SGD with momentum, RMSprop,
the objective function as a function of time. As the problem Adagrad, and Adam optimizers for analyzing their
belongs to the class of binary classification, the loss is performance in terms of test accuracy.
calculated using cross entropy. The batch size is set to 64 and
epoch to 10. V. OPTIMIZERS
The data is scaled and split into training and test data. For large training set Stochastic Gradient Descent
The model is initialized and then with different optimizers (SGD) [13]is considered as good learning algorithm to train
the model fit to analyze the performance with respect to each neural networks [10]. It updates the parameters using single
optimizer under study. or very few parameters, where the new update parameter is
given by eq.1, here xi and yi are from the training set. It helps
TABLE 2 MODEL SUMMARY to reduce the variance and lead to stable convergence. α is
Layer (type) Output Shape Param # the learning rate.
dense_1 (Dense) (None, 6) 42
dense_2 (Dense) (None, 4) 28 = − α∇J(; xi, yi) (1)
dense_3 (Dense) (None, 1) 5
Total Params: 75 If the objective is shallow SGD may tend to oscillate.
Trainable params: 75 This problem can be overcome by adding momentum to
Non-trainable params: 0 SGD. is the current velocity.γ∈(0,1] determines number of
iterations of the previous gradients are incorporated into the
IV. GRADIENT DESCENT current update.
Given function 𝑦 = 𝑓(𝑥), where 𝑥 and 𝑦 are some real = + α∇J(; xi, yi) (2)
𝑑𝑦
numbers. The derivative 𝑑𝑥 of the function 𝑓(𝑥) gives the
= − (3)
slope of 𝑓(𝑥) at a point 𝑥. Derivative is useful in minimizing
the function as it tells how a small change in input 𝑥, makes While implementing SGD with momentum the value of
corresponding change in the output 𝑦. To reduce 𝑓(𝑥), we momentum is set to 0.9 during the experiment.
can move 𝑥 in small steps in opposite direction of the
derivative. This technique is known as Gradient Descent The Adagrad [14] adapts all model parameters by
[10]. scaling them inversely proportional to the square root of the
sum of all the historical squared values of gradient. While
training DNN models, from the beginning of training if the
Adam optimizer [16] short for ‘adaptive moments’ is A. SGD and SGD with Momentum
considered as a variant of RMSprop and momentum with Table 3 shows the results for SGD and SGD
few variations. It is computationally efficient and requires withmomentum.
Optimizer Learning Rate Momentum Test Loss Test Accuracy Model Training
Time (Sec)
SGD 0.01 - 0.2160 0.9337 2.54
SGD 0.001 - 0.2007 0.9364 2.34
SGD with 0.01 0.9 0.0115 0.9992 2.11
Momentum
It is observed that no significant change in model In case of SGD with momentum a significant increase
performance is observed with change in learning rate from in test accuracy observed compared to SGD and time taken
0.01 to 0.001 in SGD. The time taken to train the model is for training is also lowest. Figure 1,2 and 3 shows the
comparatively less with learning rate 0.001. accuracy and loss with respect training and validation data
over 10 epochs while the model is being trained.
Figure 4,5 and 6 shows the accuracy and loss for the
training and validation data for Adagrad, RMSprop and
Adam optimizer respectively. The blue line indicates the
FIGURE 2 MODEL ACCURACY AND MODEL LOSS WITH
training data and the orange indicate the validation data in
LEARNING RATE 0.001
each of the figure above.
REFERENCES