0% found this document useful (0 votes)
4 views5 pages

Gradient Descent

This paper discusses the gradient descent algorithm and its application in machine learning, particularly in logistic regression for binary classification. It compares batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, analyzing their performance in terms of iteration number and loss function. The findings suggest that while stochastic gradient descent converges faster, mini-batch gradient descent offers a balance between speed and stability.

Uploaded by

kumar.ayana2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views5 pages

Gradient Descent

This paper discusses the gradient descent algorithm and its application in machine learning, particularly in logistic regression for binary classification. It compares batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, analyzing their performance in terms of iteration number and loss function. The findings suggest that while stochastic gradient descent converges faster, mini-batch gradient descent offers a balance between speed and stability.

Uploaded by

kumar.ayana2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2021 International Conference on Computer Network, Electronic and Automation (ICCNEA)

Research on the Application of Gradient Descent


Algorithm in Machine Learning
2021 International Conference on Computer Network, Electronic and Automation (ICCNEA) | 978-1-6654-4486-6/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICCNEA53019.2021.00014

Xin Wang Liting Yan Qizhi Zhang


1.
1.
School of Electrical School of Electrical 1.
School of Electrical
Engineering, Xi’an Shiyou Engineering, Xi’an Shiyou Engineering, Xi’an Shiyou
University University University
2.
2.
Shaanxi Provincial Key Lab of Shaanxi Provincial Key Lab of 2.
Shaanxi Provincial Key Lab of
Oil and Gas Well Measurement Oil and Gas Well Measurement Oil and Gas Well Measurement
and Control Technology and Control Technology and Control Technology
Xi’an, China Xi’an, China Xi’an, China
e-mail:[email protected] e-mail:[email protected] e-mail: [email protected]
yu.edu.cn

Abstract—The gradient descent algorithm is a type of training effect. As a result, two new gradient descent
optimization algorithm that is widely used to solve machine algorithms are developed: the stochastic gradient descent
learning algorithm model parameters. Through continuous algorithm and the Mini-batch gradient descent algorithm[1].
iteration, it obtains the gradient of the objective function,
In a general sense, a logistic regression analysis model
gradually approaches the optimal solution of the objective
is a linear regression analysis model that is commonly used
function, and finally obtains the minimum loss function and
in data mining, economic forecasting, medicine, and other
related parameters. The gradient descent algorithm is
fields. It's essentially a conditional probability-based
frequently used in the solution process of logical regression,
discrimination model. At present, logistic regression is
which is a common binary classification approach. This paper
mainly used in the medical field, such as exploring the main
compares and analyzes the differences between batch gradient
factors leading to a certain disease or predicting the
descent and its derivative algorithms — stochastic gradient
possibility of the occurrence of the disease according to the
descent algorithm and mini- batch gradient descent algorithm
factors affecting the disease [2].
in terms of iteration number, loss function through
experiments, and provides some suggestions on how to pick the II. GRADIENT DESCENT ALGORITHM
best algorithm for the logistic regression binary task in The gradient descent method is the most basic method
machine learning. for solving unconstrained optimization problems, as it
considers the negative gradient direction to be the minimum
Keywords—Gradient Descent, Logistic Regression, Machine
of the objective function [3]. During training for algorithms
Learning
like neural networks and regression analysis, gradient
I. INTRODUCTION descent is commonly used to minimize the loss function.
Gradient descent is an iterative optimization algorithm A. Gradient descent algorithm derivation
that finds the smallest value of a function. Through
The general linear regression problem can be expressed
continuous iteration, it obtains the gradient of the objective
as:
function, gradually approaches the optimal solution of the
objective function, and finally obtains the smallest loss n
T
h ( x) x
0 0 x
1 1 x
n n x
i i x (1)
function and related parameters. The conventional gradient i 0

descent algorithm trains all of the samples every time,


Where: h ( x) is the predicted value of linear
which extends the training time and may affect the final

978-1-6654-4486-6/21/$31.00 ©2021 IEEE 11


DOI 10.1109/ICCNEA53019.2021.00014
Authorized licensed use limited to: Netaji Subhas University of Technology New Delhi. Downloaded on January 25,2025 at 10:31:56 UTC from IEEE Xplore. Restrictions apply.
regression; 0 is the Bias; 1 n is the parameters that easier to find the best solution. However, if the number of
need to be solved; x1 xn is the sample attribute. samples and attribute values is huge, the input matrix made
up of samples will be large, resulting in a slow iteration
In fact, the predicted value of linear regression is not
speed and failure to achieve the desired effect.
capable of correspond to the real value, and there is often an
error between the predicted value and the real value. We 1) Stochastic gradient descent algorithm
The stochastic gradient descent algorithm randomly
call this error item selects one sample from all samples for iterative training in
each iteration [4][5]. This approach needs less computation
(i ) T (i ) (i )
y x (2) in each iteration for large-scale sample data, and the
convergence speed is clearly faster than other algorithms,
T
Where: i is sample number; x(i ) is predictive value;
resulting in high performance. The iterative formula of
y ( i ) is true value. It is assumed that the error terms are stochastic gradient descent algorithm is:
independently identically distributed and follow a Gaussian
distribution with a mean of zero and a variance of 2 . *
j j ( y (i ) h ( x(i ) )) x j (i ) (7)
Equation (2) is substituted into the Gaussian distribution
expression of the error term : 2) Mini-batch gradient descent algorithm
Although the convergence speed of the stochastic
1 ( y (i ) T
x(i ) )2 gradient descent algorithm is fast, each iteration of a
p( y (i ) | x ( i ) ; ) exp( 2
) (3)
2 2 randomly selected sample cannot guarantee the direction of
convergence for each iteration [6]-[7]. The final outcome
Take logarithm on both sides of Equation (3), and could be worse if the randomly chosen sample points are
finally get the loss function: irregular points. The advantages and drawbacks of batch

m
gradient descent and stochastic gradient descent algorithms
1
J( ) (h ( x (i ) ) y ( i ) ) 2 (4) are thoroughly considered in mini-batch, and the required
2m i 1
number of samples from all the samples are chosen for
Where: m is number of training samples; x is (i ) training in each iteration. The iterative formula of gradient
attributes of the sample; y ( i ) is the predicted value of each descent in mini-batch:
sample.
Take the partial derivative of the loss function J ( ) * 1 i 63 ( k )
j j (y h ( x( k ) ))x j (i ) (8)
64 k i
with respect to j and set the partial derivative equal to
zero:
III. LOGICAL EGRESSION
m
J( ) 1 Logical regression is a probabilistic regression model,
(h ( x (i ) ) y (i ) )x j (i ) 0 (5)
j m i 1 which is a kind of generalized linear regression model
[8]-[9]. It mainly reflects a mapping relationship between
According to Equation (5), the calculation formula for
independent variables and dependent variables with
solving the parameter vector is:
dichotomous properties, that is, the known independent
1 m variables can be used to predict the values of a group of
*
( y (i ) h ( x(i ) ))x j (i ) (6)
j j
m i 1
discrete variables.

1) Sigmoid function
The iterative algorithm of Equation (6) is also known as Logical regression maps any input to the interval [0,1]
the batch gradient descent algorithm. by introducing the Sigmoid function. The regression
B. Improved gradient descent algorithm predicted value is mapped to the Sigmoid function to

Batch gradient descent algorithm needs to consider all complete the transformation from numerical value to

samples in the training process. As a result, this algorithm probability value, resulting in the classification prediction.

will select the overall best path in each iteration, making it The sigmoid function as follows:

12

Authorized licensed use limited to: Netaji Subhas University of Technology New Delhi. Downloaded on January 25,2025 at 10:31:56 UTC from IEEE Xplore. Restrictions apply.
1 IV. CASE ANALYSIS
g ( z) z
(9)
1 e
The instance selected data set is a data set that predicts
It can be seen from the sigmoid function that the range whether the student is admitted. The student's two test
of independent variable z is ( , ) , and the range of scores are the input independent variable, and the
function value is [0,1] . The relationship between probability of admission is the output dependent variable.
independent and dependent variables is:
The dataset consists of 100 samples. Partial data of the
1 , z 0 sample are shown in Table 1.
g ( z) 0.5 , z 0 (10)
TABLE I. PARTIAL SAMPLE DATA
0 , z 0
Score 1 Score 2 Admitted
Because of the Sigmoid function's unique relationship
1 95.8615507093572 38.22527805795094 0
between the independent and dependent variables, we can
2 75.01365838958247 30.60326323428011 0
consider all samples with function values greater than or
equal to 0.5 to be positive when making classification 3 82.30705337399482 76.48196330235604 1

predictions and all samples with a function value less than 4 69.36458875970939 97.71869196188608 1
0.5 are classified as negative.
5 39.53833914367223 76.03681085115882 0
2) Logic regression solving
The prediction function h ( x) of linear regression can Where: 0 indicates that the student not admitted;1
be written: means the student admitted.
1) Comparison of convergence speed of different
T 1
h ( x) g( x) T
x
(11) gradient descent algorithms with the same number of
1 e iterations
Suppose the linear fitting function is:
Therefore, the probability of obtaining the positive is
h ( x) , the probability of obtaining the negative is
h ( x) 0 x
1 1 2 2x (15)
1 h ( x)
According to the parameter update formulas of three
p( y | x; ) (h ( x)) y (1 h ( x))1 y
(12)
different gradient descent algorithms, 8000 iterations were
carried out, and the output results are shown in Table 2.
So, the likelihood function is:

m m TABLE II. GRADIENT DESCENT ALGORITHM PARAMETER VALUES


L( ) p( y | x; ) (h ( xi )) yi (1 h ( xi ))1 yi
(13)
i 1 i 1
0 1 2

The logarithm is taken on both sides of the equation BGD -0.00048126 0.00777571 0.00305656

(13), and the iterative formula of the parameter is obtained SGD -0.00049396 0.00782276 0.00268035
according to the gradient decrease method is:
Mini-BGD -0.00048275 0.00770747 0.00294887

m
* 1
j j (h ( xi ) yi ) x j (i ) (14)
m i 1

The logical regression model predicts not only the


actual "category," but also its approximate probability
estimation, also known as "confidence." Furthermore, the
classification is directly modelled, and there is no need to
assume data distribution in advance to avoid problems
caused by incorrect distribution.

Fig. 1. Diagram of loss and number of iterations under BGD

13

Authorized licensed use limited to: Netaji Subhas University of Technology New Delhi. Downloaded on January 25,2025 at 10:31:56 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Diagram of loss and number of iterations under SGD Fig. 5. SGD, Iterations=15000, learning rate=0.000002

The loss function fluctuates significantly when the


number of iterations is small and the learning rate is high,
as seen in Figure 4 and Figure 5. The loss function has
modified significantly after increasing the number of
iterations and decreasing the learning rate. However, the
stability remains low, necessitating a slow learning pace to
meet it.

Fig. 3. Diagram of loss and number of iterations under Mini-BGD

It can be obtained from the graph: The loss function of


stochastic gradient descent algorithm has obvious
fluctuation near the optimal point, which indicates that it
has more noise points. The convergence rate is slightly
faster when the mini-batch gradient descent method is used.

2) The Influence of Iteration Number and Learning


Rate on Gradient Descent Algorithm Fig. 6. Diagram of loss and number of iterations under Mini-BGD

It can be seen from the gradient descent method's


derivation process that iteration times and learning rate
have a significant impact on the final loss. If the learning
rate is too slow, there will be too many iterations. If the
learning rate is too fast, the local minimum can be missed,
resulting in an inability to converge.

Fig. 7. Diagram of loss and number of iterations under SGD

In order to reduce the loss function smaller, the data


samples are standardized. It can be seen from the results in
Figure 6 and Figure 7, the loss value can reach 0.22 after
data standardization, which is a great improvement
compared with the previous one. Stochastic gradient
Fig. 4. SGD, Iterations=5000, learning rate=0.001
descent is faster, but requires more iterations; Despite the
fact that mini-batch gradient descent is slower than

14

Authorized licensed use limited to: Netaji Subhas University of Technology New Delhi. Downloaded on January 25,2025 at 10:31:56 UTC from IEEE Xplore. Restrictions apply.
stochastic gradient descent, the number of iterations is fluctuate a lot, which is not good for the final result.
nearly 10,000 times lower.
ACKNOWLEDGMENT
V. CONCLUSIONS This research is supported by the foundation project:
Different gradient descent methods can be used in Xi’an Shiyou University’s Graduate Innovation and
realistic applications depending on the data sets. The Practical Ability Training Program.
stochastic gradient descent method has more noise points
REFERENCES
when the training data set sample size is small, and the
[1] Chen X W, Lin X. Big data deep learning: challenges and
fluctuation near the optimal point is evident, making it easy perspectives [J]. IEEE access, 2014, 2: 514-525.
to fall into the local optimal solution; When the sample size [2] Langer D L, Van der Kwast T H, Evans A J, et al. Prostate cancer
detection with multi parametric MRI: Logistic regression analysis
of the training data set is large, the batch gradient descent of quantitative T2, diffusion weighted imaging, and dynamic
contrast enhanced MRI [J]. Journal of Magnetic Resonance
algorithm's complexity is high, and parameter iteration's Imaging: An Official Journal of the International Society for
Magnetic Resonance in Medicine, 2009, 30(2): 327-334.
convergence rate is slow. When the stochastic gradient
[3] Qu Q, Zhang Y, Eldar Y C, et al. Convolutional phase retrieval via
descent method is used to set the parameters, however, it gradient descent[J]. IEEE Transactions on Information Theory, 2019,
66(3): 1785-1821.
typically only takes a few iterations to achieve a better
[4] Zhou F, Cong G. On the convergence properties of a $ K $-step
fitting effect. Furthermore, the mini-batch gradient descent averaging stochastic gradient descent algorithm for nonconvex
optimization [J]. arXiv preprint arXiv:1708.01012, 2017.
algorithm is obviously superior to other algorithms in terms [5] Manogaran G, Lopez D. Health data analytics using scalable logistic
of loss function and convergence speed. regression with stochastic gradient descent [J]. International Journal
of Advanced Intelligence Paradigms, 2018, 10(1-2): 118-132.
[6] Huo Z, Huang H. Asynchronous mini-batch gradient descent with
When using gradient descent method to optimize variance reduction for non-convex optimization[C]//Proceedings of
parameters, the selection of learning rate and iteration times the AAAI Conference on Artificial Intelligence. 2017, 31(1).
[7] Yazan E, Talu M F. Comparison of the stochastic gradient descent
is particularly important. The loss function will converge based optimization techniques[C]//2017 International Artificial
slowly if the learning rate is too low; if the learning rate is Intelligence and Data Processing Symposium (IDAP). IEEE, 2017:
1-5.
too high, it is likely to exceed the global optimum. [8] Sur P, Candès E J. A modern maximum-likelihood theory for
high-dimensional logistic regression[J]. Proceedings of the National
Continuously, if the number of iterations is too large, the Academy of Sciences, 2019, 116(29): 14516-14525.
loss function will fall into the local optimal solution; if the [9] Kuha J, Mills C. On group comparisons with logistic regression
models [J]. Sociological Methods & Research, 2020, 49(2): 498-525.
number of iterations is too low, the loss function will

15

Authorized licensed use limited to: Netaji Subhas University of Technology New Delhi. Downloaded on January 25,2025 at 10:31:56 UTC from IEEE Xplore. Restrictions apply.

You might also like