Gradient Descent
Gradient Descent
Abstract—The gradient descent algorithm is a type of training effect. As a result, two new gradient descent
optimization algorithm that is widely used to solve machine algorithms are developed: the stochastic gradient descent
learning algorithm model parameters. Through continuous algorithm and the Mini-batch gradient descent algorithm[1].
iteration, it obtains the gradient of the objective function,
In a general sense, a logistic regression analysis model
gradually approaches the optimal solution of the objective
is a linear regression analysis model that is commonly used
function, and finally obtains the minimum loss function and
in data mining, economic forecasting, medicine, and other
related parameters. The gradient descent algorithm is
fields. It's essentially a conditional probability-based
frequently used in the solution process of logical regression,
discrimination model. At present, logistic regression is
which is a common binary classification approach. This paper
mainly used in the medical field, such as exploring the main
compares and analyzes the differences between batch gradient
factors leading to a certain disease or predicting the
descent and its derivative algorithms — stochastic gradient
possibility of the occurrence of the disease according to the
descent algorithm and mini- batch gradient descent algorithm
factors affecting the disease [2].
in terms of iteration number, loss function through
experiments, and provides some suggestions on how to pick the II. GRADIENT DESCENT ALGORITHM
best algorithm for the logistic regression binary task in The gradient descent method is the most basic method
machine learning. for solving unconstrained optimization problems, as it
considers the negative gradient direction to be the minimum
Keywords—Gradient Descent, Logistic Regression, Machine
of the objective function [3]. During training for algorithms
Learning
like neural networks and regression analysis, gradient
I. INTRODUCTION descent is commonly used to minimize the loss function.
Gradient descent is an iterative optimization algorithm A. Gradient descent algorithm derivation
that finds the smallest value of a function. Through
The general linear regression problem can be expressed
continuous iteration, it obtains the gradient of the objective
as:
function, gradually approaches the optimal solution of the
objective function, and finally obtains the smallest loss n
T
h ( x) x
0 0 x
1 1 x
n n x
i i x (1)
function and related parameters. The conventional gradient i 0
m
gradient descent and stochastic gradient descent algorithms
1
J( ) (h ( x (i ) ) y ( i ) ) 2 (4) are thoroughly considered in mini-batch, and the required
2m i 1
number of samples from all the samples are chosen for
Where: m is number of training samples; x is (i ) training in each iteration. The iterative formula of gradient
attributes of the sample; y ( i ) is the predicted value of each descent in mini-batch:
sample.
Take the partial derivative of the loss function J ( ) * 1 i 63 ( k )
j j (y h ( x( k ) ))x j (i ) (8)
64 k i
with respect to j and set the partial derivative equal to
zero:
III. LOGICAL EGRESSION
m
J( ) 1 Logical regression is a probabilistic regression model,
(h ( x (i ) ) y (i ) )x j (i ) 0 (5)
j m i 1 which is a kind of generalized linear regression model
[8]-[9]. It mainly reflects a mapping relationship between
According to Equation (5), the calculation formula for
independent variables and dependent variables with
solving the parameter vector is:
dichotomous properties, that is, the known independent
1 m variables can be used to predict the values of a group of
*
( y (i ) h ( x(i ) ))x j (i ) (6)
j j
m i 1
discrete variables.
1) Sigmoid function
The iterative algorithm of Equation (6) is also known as Logical regression maps any input to the interval [0,1]
the batch gradient descent algorithm. by introducing the Sigmoid function. The regression
B. Improved gradient descent algorithm predicted value is mapped to the Sigmoid function to
Batch gradient descent algorithm needs to consider all complete the transformation from numerical value to
samples in the training process. As a result, this algorithm probability value, resulting in the classification prediction.
will select the overall best path in each iteration, making it The sigmoid function as follows:
12
Authorized licensed use limited to: Netaji Subhas University of Technology New Delhi. Downloaded on January 25,2025 at 10:31:56 UTC from IEEE Xplore. Restrictions apply.
1 IV. CASE ANALYSIS
g ( z) z
(9)
1 e
The instance selected data set is a data set that predicts
It can be seen from the sigmoid function that the range whether the student is admitted. The student's two test
of independent variable z is ( , ) , and the range of scores are the input independent variable, and the
function value is [0,1] . The relationship between probability of admission is the output dependent variable.
independent and dependent variables is:
The dataset consists of 100 samples. Partial data of the
1 , z 0 sample are shown in Table 1.
g ( z) 0.5 , z 0 (10)
TABLE I. PARTIAL SAMPLE DATA
0 , z 0
Score 1 Score 2 Admitted
Because of the Sigmoid function's unique relationship
1 95.8615507093572 38.22527805795094 0
between the independent and dependent variables, we can
2 75.01365838958247 30.60326323428011 0
consider all samples with function values greater than or
equal to 0.5 to be positive when making classification 3 82.30705337399482 76.48196330235604 1
predictions and all samples with a function value less than 4 69.36458875970939 97.71869196188608 1
0.5 are classified as negative.
5 39.53833914367223 76.03681085115882 0
2) Logic regression solving
The prediction function h ( x) of linear regression can Where: 0 indicates that the student not admitted;1
be written: means the student admitted.
1) Comparison of convergence speed of different
T 1
h ( x) g( x) T
x
(11) gradient descent algorithms with the same number of
1 e iterations
Suppose the linear fitting function is:
Therefore, the probability of obtaining the positive is
h ( x) , the probability of obtaining the negative is
h ( x) 0 x
1 1 2 2x (15)
1 h ( x)
According to the parameter update formulas of three
p( y | x; ) (h ( x)) y (1 h ( x))1 y
(12)
different gradient descent algorithms, 8000 iterations were
carried out, and the output results are shown in Table 2.
So, the likelihood function is:
The logarithm is taken on both sides of the equation BGD -0.00048126 0.00777571 0.00305656
(13), and the iterative formula of the parameter is obtained SGD -0.00049396 0.00782276 0.00268035
according to the gradient decrease method is:
Mini-BGD -0.00048275 0.00770747 0.00294887
m
* 1
j j (h ( xi ) yi ) x j (i ) (14)
m i 1
13
Authorized licensed use limited to: Netaji Subhas University of Technology New Delhi. Downloaded on January 25,2025 at 10:31:56 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Diagram of loss and number of iterations under SGD Fig. 5. SGD, Iterations=15000, learning rate=0.000002
14
Authorized licensed use limited to: Netaji Subhas University of Technology New Delhi. Downloaded on January 25,2025 at 10:31:56 UTC from IEEE Xplore. Restrictions apply.
stochastic gradient descent, the number of iterations is fluctuate a lot, which is not good for the final result.
nearly 10,000 times lower.
ACKNOWLEDGMENT
V. CONCLUSIONS This research is supported by the foundation project:
Different gradient descent methods can be used in Xi’an Shiyou University’s Graduate Innovation and
realistic applications depending on the data sets. The Practical Ability Training Program.
stochastic gradient descent method has more noise points
REFERENCES
when the training data set sample size is small, and the
[1] Chen X W, Lin X. Big data deep learning: challenges and
fluctuation near the optimal point is evident, making it easy perspectives [J]. IEEE access, 2014, 2: 514-525.
to fall into the local optimal solution; When the sample size [2] Langer D L, Van der Kwast T H, Evans A J, et al. Prostate cancer
detection with multi parametric MRI: Logistic regression analysis
of the training data set is large, the batch gradient descent of quantitative T2, diffusion weighted imaging, and dynamic
contrast enhanced MRI [J]. Journal of Magnetic Resonance
algorithm's complexity is high, and parameter iteration's Imaging: An Official Journal of the International Society for
Magnetic Resonance in Medicine, 2009, 30(2): 327-334.
convergence rate is slow. When the stochastic gradient
[3] Qu Q, Zhang Y, Eldar Y C, et al. Convolutional phase retrieval via
descent method is used to set the parameters, however, it gradient descent[J]. IEEE Transactions on Information Theory, 2019,
66(3): 1785-1821.
typically only takes a few iterations to achieve a better
[4] Zhou F, Cong G. On the convergence properties of a $ K $-step
fitting effect. Furthermore, the mini-batch gradient descent averaging stochastic gradient descent algorithm for nonconvex
optimization [J]. arXiv preprint arXiv:1708.01012, 2017.
algorithm is obviously superior to other algorithms in terms [5] Manogaran G, Lopez D. Health data analytics using scalable logistic
of loss function and convergence speed. regression with stochastic gradient descent [J]. International Journal
of Advanced Intelligence Paradigms, 2018, 10(1-2): 118-132.
[6] Huo Z, Huang H. Asynchronous mini-batch gradient descent with
When using gradient descent method to optimize variance reduction for non-convex optimization[C]//Proceedings of
parameters, the selection of learning rate and iteration times the AAAI Conference on Artificial Intelligence. 2017, 31(1).
[7] Yazan E, Talu M F. Comparison of the stochastic gradient descent
is particularly important. The loss function will converge based optimization techniques[C]//2017 International Artificial
slowly if the learning rate is too low; if the learning rate is Intelligence and Data Processing Symposium (IDAP). IEEE, 2017:
1-5.
too high, it is likely to exceed the global optimum. [8] Sur P, Candès E J. A modern maximum-likelihood theory for
high-dimensional logistic regression[J]. Proceedings of the National
Continuously, if the number of iterations is too large, the Academy of Sciences, 2019, 116(29): 14516-14525.
loss function will fall into the local optimal solution; if the [9] Kuha J, Mills C. On group comparisons with logistic regression
models [J]. Sociological Methods & Research, 2020, 49(2): 498-525.
number of iterations is too low, the loss function will
15
Authorized licensed use limited to: Netaji Subhas University of Technology New Delhi. Downloaded on January 25,2025 at 10:31:56 UTC from IEEE Xplore. Restrictions apply.