0% found this document useful (0 votes)
27 views14 pages

A Survey of Optimization Methods From A Machine Learning Perspective

Uploaded by

Swati Singhvi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views14 pages

A Survey of Optimization Methods From A Machine Learning Perspective

Uploaded by

Swati Singhvi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

3668 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 50, NO.

8, AUGUST 2020

A Survey of Optimization Methods From


a Machine Learning Perspective
Shiliang Sun , Zehui Cao, Han Zhu, and Jing Zhao

Abstract—Machine learning develops rapidly, which has made From the perspective of the gradient information in
many theoretical breakthroughs and is widely applied in various optimization, popular optimization methods can be divided
fields. Optimization, as an important part of machine learning, into three categories: 1) first-order optimization methods,
has attracted much attention of researchers. With the exponential
growth of data amount and the increase of model complex- which are represented by the widely used stochastic gradi-
ity, optimization methods in machine learning face more and ent methods; 2) high-order optimization methods, in which
more challenges. A lot of work on solving optimization prob- Newton’s method is a typical example; and 3) heuristic
lems or improving optimization methods in machine learning derivative-free optimization methods, in which the coordinate
has been proposed successively. The systematic retrospect and descent method is a representative.
summary of the optimization methods from the perspective of
machine learning are of great significance, which can offer guid- As the representative of the first-order optimization meth-
ance for both developments of optimization and machine learning ods, the stochastic gradient descent (SGD) method [1], [2],
research. In this article, we first describe the optimization prob- as well as its variants, has been widely used in recent years
lems in machine learning. Then, we introduce the principles and and is evolving at a high speed. However, many users pay
progresses of commonly used optimization methods. Finally, we little attention to the characteristics or application scope of
explore and give some challenges and open problems for the
optimization in machine learning. these methods. They often adopt them as black-box optimizers,
which may limit the functionality of the optimization methods.
Index Terms—Approximate Bayesian inference, deep neu- In this article, we comprehensively introduce the fundamental
ral network (DNN), machine learning, optimization method,
reinforcement learning (RL). optimization methods. Particularly, we systematically explain
their advantages and disadvantages, their application scope,
and the characteristics of their parameters. We hope that the
targeted introduction will help users choose the first-order
I. I NTRODUCTION optimization methods more conveniently and make parameter
ECENTLY, machine learning has grown at a remark- adjustment more reasonable in the learning process.
R able rate, attracting a great number of researchers and
practitioners. It has become one of the most popular research
Compared with the first-order optimization methods, high-
order methods [3]–[5] converge at a faster speed in which the
directions and plays a significant role in many fields, such curvature information makes the search direction more effec-
as machine translation, speech recognition, image recogni- tive. High-order optimizations attract widespread attention but
tion, recommendation systems, etc. Optimization is one of face more challenges. The difficulty in the high-order meth-
the core components of machine-learning. The essence of ods lies in the operation and storage of the inverse matrix
most machine learning algorithms is to build an optimization of the Hessian matrix. To solve this problem, many vari-
model and learn the parameters in the objective function from ants based on Newton’s method have been developed, most
the given data. In the era of immense data, the effectiveness of which try to approximate the Hessian matrix through some
and efficiency of the numerical optimization algorithms dra- techniques [6], [7]. In subsequent studies, the stochastic quasi-
matically influence the popularization and application of the Newton method and its variants are introduced to extend
machine-learning models. In order to promote the development high-order methods to large-scale data [8]–[10].
of machine learning, a series of effective optimization meth- The derivative-free optimization methods [11], [12] are
ods was put forward, which have improved the performance mainly used in the case that the derivative of the objective
and efficiency of the machine-learning methods. function may not exist or be difficult to calculate. There are
two main ideas in the derivative-free optimization methods.
Manuscript received June 16, 2019; revised October 24, 2019; accepted One is adopting a heuristic search based on empirical rules,
October 28, 2019. Date of publication November 18, 2019; date of current
version July 10, 2020. This work was supported in part by NSFC under and the other is fitting the objective function with samples.
Project 61673179, and in part by Shanghai Sailing Program under Grant The derivative-free optimization methods can also work in
17YF1404600. This article was recommended by Associate Editor Y. Xia. conjunction with the gradient-based methods.
(Corresponding author: Jing Zhao.)
The authors are with the School of Computer Science and Technology, Most machine-learning problems, once formulated, can be
East China Normal University, Shanghai 200062, China (e-mail: solved as the optimization problems. Optimization in the
[email protected]; [email protected]). fields of deep neural network (DNN), reinforcement learn-
This article has supplementary downloadable material available at
http://ieeexplore.ieee.org, provided by the author. ing (RL), meta learning, variational inference- and Markov
Digital Object Identifier 10.1109/TCYB.2019.2950779 chain Monte Carlo (MCMC) encounters different difficulties
2168-2267 
c 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on January 05,2025 at 08:34:34 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: A SURVEY OF OPTIMIZATION METHODS FROM MACHINE LEARNING PERSPECTIVE 3669

and challenges. The optimization methods developed in the to handle the variational inference. Thus, the stochastic vari-
specific machine-learning fields are different, which can be ational inference was proposed, which introduced natural
inspiring to the development of general optimization methods. gradients and extended the variational inference to large-scale
DNNs have shown great success in pattern recognition data [58].
and machine learning. There are two popular NNs, that Optimization methods have a significative influence
is, convolutional neural networks (CNNs) [13] and recur- on various fields of machine learning. For example,
rent neural networks (RNNs), which play important roles in Pajarinen et al. [5] proposed the transformer network using
various fields of machine learning. CNNs are feedforward neu- Adam optimization [33], which is applied to machine trans-
ral networks with convolution calculation. CNNs have been lation tasks. Ledig et al. [59] proposed the super-resolution
successfully used in many fields, such as image process- generative adversarial network for image super resolution,
ing [14], [15]; video processing [16]; and natural language which is also optimized by Adam. Wu et al. [60] proposed
processing (NLP) [17], [18]. RNNs are a kind of sequential actor–critic using trust-region optimization to solve the deep
model and very active in NLP [19]–[22]. Besides, RNNs are RL on Atari games as well as the MuJoCo environments.
also popular in the fields of image processing [23], [24] and The stochastic optimization method can also be applied to
video processing [25]. In the field of constrained optimization, MCMC sampling to improve efficiency. In this kind of appli-
RNNs can achieve excellent results [26]–[29]. In these works, cation, stochastic gradient Hamiltonian Monte Carlo (HMC)
the parameters of weights in RNNs can be learned by analyt- is a representative method [61] where the stochastic gradient
ical methods, and these methods can find the optimal solution accelerates the step of gradient update when handling large-
according to the trajectory of the state solution. scale samples. The noise introduced by the stochastic gradient
The stochastic gradient-based algorithms are widely used can be characterized by introducing the Gaussian noise and
in DNNs [30]–[33]. However, various problems are emerg- friction terms. In addition, the deviation caused by HMC dis-
ing when employing stochastic gradient-based algorithms. For cretization can be eliminated by the friction term, and thus the
example, the learning rate will be oscillating in the later train- Metropolis–Hasting step can be omitted. The hyper parameter
ing stage of some adaptive methods [34], [35], which may lead settings in the HMC will affect the performance of the model.
to the problem of nonconvergence. Thus, further optimization There are some efficient ways to automatically adjust the
algorithms based on variance reduction were proposed to hyperparameters and improve the performance of the sampler.
improve the convergence rate [36], [37]. Moreover, combining The development of optimization brings a lot of contri-
the SGD and the characteristics of its variants is a possible butions to the progress of machine learning. However, there
direction to improve the optimization. Especially, switching are still many challenges and open problems for optimization
an adaptive algorithm to the SGD method can improve the problems in machine learning.
accuracy and convergence speed of the algorithm [38]. 1) How to improve the optimization performance with
RL is a branch of machine learning, for which an agent insufficient data in DNNs is a tricky problem. If there
interacts with the environment by trial-and-error mechanism are not enough samples in the training of DNNs, it
and learns an optimal policy by maximizing cumulative is prone to cause the problem of high variances and
rewards [39]. Deep RL combines the RL and deep learning overfitting [62]. In addition, nonconvex optimization has
techniques and enables the RL agent to have a good per- been one of the difficulties in DNNs, which makes the
ception of its environment. Recent research has shown that optimization tend to obtain a locally optimal solution
deep learning can be applied to learn a useful representation rather than the global optimal solution.
for the RL problems [40]–[44]. The stochastic optimization 2) For sequential models, the samples are often trun-
algorithms are commonly used in RL and deep RL models. cated by batches when the sequence is too long, which
Meta learning [45], [46] has recently become very popular will cause deviation. How to analyze the deviation of
in the field of machine learning. The goal of meta learning stochastic optimization in this case and correct it is vital.
is to design a model that can efficiently adapt to the new 3) The stochastic variational inference is graceful and prac-
environment with as few samples as possible. The application tical, and it is probably a good choice to develop
of meta learning in supervised learning can solve few-shot methods of applying high-order gradient information to
learning problems [47]. In general, the meta learning meth- stochastic variational inference.
ods can be summarized into the following three types [48]: 4) It may be a great idea to introduce the stochastic tech-
1) metric-based methods [49]–[52]; 2) model-based meth- nique to the conjugate gradient method to obtain an ele-
ods [53], [54]; and 3) optimization-based methods [47], [55], gant and powerful optimization algorithm. The detailed
[56]. We will describe the details of optimization-based meta techniques to make improvements in the stochastic
learning methods in the subsequent sections. conjugate gradient is an interesting and challenging
Variational inference is a useful approximation method problem.
which aims to approximate the posterior distributions in The purpose of this article is to summarize and analyze the
the Bayesian machine learning. It can be considered as an classical and modern optimization methods from a machine-
optimization problem. For example, mean-field variational learning perspective. The remainder of this article is organized
inference uses a coordinate ascent to solve this optimization as follows. Section II summarizes the machine-learning prob-
problem [57]. As the amount of data increase continuously, lems from the perspective of optimization. Section III dis-
it is not friendly to use the traditional optimization method cusses the classical optimization algorithms and their latest

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on January 05,2025 at 08:34:34 UTC from IEEE Xplore. Restrictions apply.
3670 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 50, NO. 8, AUGUST 2020

developments in machine learning. Particularly, the recent B. Optimization Problems in Semisupervised Learning
popular optimization methods including the first- and second- SSL is the method between supervised and unsupervised
order optimization algorithms are emphatically introduced. learning, which incorporates labeled data and unlabeled data
Section IV presents the challenges and open problems in during the training process. It can deal with different tasks,
the optimization methods. In Section V, we conclude this including classification tasks [70], [71]; regression tasks [72];
article. We introduce the developments and applications of clustering tasks [73], [74]; and dimensionality reduction
optimization methods in some specific machine-learning fields tasks [75], [76]. There are different kinds of SSL meth-
in the supplementary material. ods, including self-training, generative models, semisupervised
support vector machines (S3VM) [77], graph-based methods,
multilearning methods, and others. We take S3VM as an
II. M ACHINE L EARNING F ORMULATED AS O PTIMIZATION example to introduce the optimization in SSL.
Almost all machine-learning algorithms can be formulated S3VM is a learning model that can deal with binary
as an optimization problem to find the extremum of an objec- classification problems and only part of the training set in
tive function. Building models and constructing reasonable this problem is labeled. Let Dl be the labeled data which
objective functions are the first step in the machine-learning can be represented as Dl = {{x1 , y1 }, {x2 , y2 }, . . . , {xl , yl }},
methods. With the determined objective function, appropriate and Du be the unlabeled data which can be represented as
numerical or analytical optimization methods are usually used Du = {xl+1 , xl+2 , . . . , xN } with N = l + u. In order to use
to solve the optimization problem. the information of unlabeled data, an additional constraint on
According to the modeling purpose and the problem to be the unlabeled data is added to the original objective of SVM
solved, the machine-learning algorithms can be divided into with slack variables ζ i . Specifically, defining  j as the mis-
supervised learning, semisupervised learning (SSL), unsuper- classification error of the unlabeled instance if its true label is
vised learning, and RL. Particularly, supervised learning is positive and zj as the misclassification error of the unlabeled
further divided into the classification problem (e.g., sentence instance
 if its true label is negative. The constraint means to
classification [17], [63]; image classification [64]–[66], etc.) make N j=l+1 min( i , ζ i ) as small as possible. Thus, an S3VM
and regression problem; unsupervised learning is divided into problem can be described as
clustering and dimension reduction [67]–[69], among others. ⎡ ⎤
l N
 
min ω + C⎣ ζi + min  i , zj ⎦
A. Optimization Problems in Supervised Learning i=1 j=l+1
 
For supervised learning, the goal is to find an optimal map- subject to yi w · xi + b + ζ i ≥ 1, ζ ≥ 0, i = 1, . . . , l
ping function f (x) to minimize the loss function of the training w · xj + b +  j ≥ 1,  ≥ 0, j = l + 1, . . . , N
samples  
− w · xj + b + zj ≥ 1, zj ≥ 0 (3)

1   i  i 
N
where C is a penalty coefficient. The optimization problem
min L y ,f x ,θ (1) in S3VM is a mixed-integer problem which is difficult to
θ N
i=1 deal with [78]. There are various methods summarized in [79]
where N is the number of training samples, θ is the param- to deal with this problem, such as the branch-and-bound
eter of the mapping function, xi is the feature vector of the techniques [80] and convex relaxation methods [81].
ith samples, yi is the corresponding label, and L is the loss
function. C. Optimization Problems in Unsupervised Learning
There are many kinds of loss functions in supervised learn- Clustering algorithms [67], [82]–[84] divide a group of
ing, such as the square of Euclidean distance, cross-entropy, samples into multiple clusters, ensuring that the differences
contrast loss, hinge loss, information gain- and so on. For between the samples in the same cluster are as small as pos-
the regression problems, the simplest way is using the square sible, and samples in different clusters are as different as
of Euclidean distance as the loss function, that is, minimiz- possible. The optimization problem for the k-means cluster-
ing square errors on training samples. But the generalization ing algorithm is formulated as minimizing the following loss
performance of this kind of empirical loss is not necessarily function:
good. Another typical form is structured risk minimization, 
K 
whose representative method is the support vector machine. On min x − μk 22 (4)
S
the objective function, regularization items are usually added k=1 x∈Sk
to alleviate overfitting, for example, in terms of L2 -norm
where K is the number of clusters, x is the feature vector of
samples, μk is the center of cluster k, and Sk is the sample set
1   i  i 
N
min L y , f x , θ + λθ 22 (2) of cluster k. The implication of this objective function is to
θ N make the sum of variances of all clusters as small as possible.
i=1
The dimensionality reduction algorithm ensures that the
where λ is the compromise parameter, which can be deter- original information from data is retained as much as pos-
mined through cross-validation. sible after projecting them into the low-dimensional space.

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on January 05,2025 at 08:34:34 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: A SURVEY OF OPTIMIZATION METHODS FROM MACHINE LEARNING PERSPECTIVE 3671

The principal component analysis (PCA) [85]–[87] is a typical a long history and are constantly evolving. They are pro-
algorithm of dimensionality reduction methods. The objective gressing in many practical applications and have achieved
of PCA is formulated to minimize the reconstruction error as good performance. Here, we mainly introduce the first-order
 optimization methods. The high-order and derivative-free

N
2 
D
min x i − xi 2
where xi = zij ej , D  D (5) optimization methods are presented in the supplementary
i=1 j=1 material. Besides these fundamental methods, preconditioning
is a useful technique for optimization methods. Applying rea-
where N represents the number of samples, xi is a sonable preconditioning can reduce the number of iterations
D-dimensional vector, and xi is the reconstruction of xi . zi = and obtain better spectral characteristics. These technologies
{zi1 , . . . , ziD } is the projection of xi in D -dimensional coordi- have been widely used in practice. For the convenience of
nates. ej is the standard orthogonal basis under D -dimensional researchers, we summarize the existing common optimization
coordinates. toolkits in a table at the end of this section.
Another common optimization goal in the probabilistic
models is to find an optimal probability density function
A. First-Order Methods
of p(x), which maximizes the logarithmic-likelihood function
(MLE) of the training samples In the field of machine learning, the most commonly used
first-order optimization methods are mainly based on gradient

N
  descent. In this section, we introduce some of the representative
max ln p xi ; θ . (6)
algorithms along with the development of the gradient descent
i=1
methods. At the same time, the classical alternating direction
In the framework of the Bayesian methods, some prior distri- method of multipliers (ADMMs) and the Frank–Wolfe method
butions are often assumed on parameter θ , which also has the in numerical optimization are also introduced.
effect of alleviating overfitting. 1) Gradient Descent: The gradient descent method is the
earliest and most common optimization method. The idea of
D. Optimization Problems in Reinforcement Learning the gradient descent method is that variables update iteratively
RL [42], [88], [89], unlike supervised learning and unsu- in the (opposite) direction of the gradients of the objective
pervised learning, aims to find an optimal strategy function, function. The update is performed to gradually converge to
whose output varies with the environment. For a deterministic the optimal value of the objective function. The learning rate
strategy, the mapping function from state s to action a is the η determines the step size in each iteration, and thus influences
learning target. For an uncertain strategy, the probability of the number of iterations to reach the optimal value [90].
executing each action is the learning target. In each state, the The steepest descent algorithm is a widely known algo-
action is determined by a = π(s), where π(s) is the policy rithm. The idea is to select an appropriate search direction
function. in each iteration so that the value of the objective function
The optimization problem in RL can be formulated as maxi- minimizes the fastest. Gradient descent and steepest descent
mizing the cumulative return after executing a series of actions are not the same because the direction of the negative gradi-
which are determined by the policy function ent does not always descend fastest. Gradient descent is an
example of using the Euclidean norm in steepest descent [91].

 Next, we give the formal expression of the gradient descent
max Vπ (s) where Vπ (s) = E γ k rt+k |St = s (7)
π method. For a linear regression model, we assume that fθ (x)
k=0
is the function to be learned, L(θ ) is the loss function, and θ
where Vπ (s) is the value function of state s under policy π , r is the parameter to be optimized. The goal is to minimize the
is the reward, and γ ∈ [0, 1] is the discount factor. loss function with
1  i
N
 2
E. Optimization for Machine Learning L(θ ) = y − fθ x i (8)
2N
Overall, the main steps of machine learning are to build a i=1
model hypothesis, define the objective function, and solve the 
D
maximum or minimum of the objective function to determine fθ (x) = θj xj (9)
the parameters of the model. In these three vital steps, the first j=1
two steps are the modeling problems of machine learning, and where N is the number of training samples, D is the num-
the third step is to solve the desired model by optimization ber of input features, xi is an independent variable with
methods. xi = (x1i , . . . , xD
i ) for i = 1, . . . , N, and yi is the target output.

The gradient descent alternates the following two steps until


III. F UNDAMENTAL O PTIMIZATION M ETHODS it converges.
AND P ROGRESSES 1) Derive L(θ ) for θj to obtain the gradient corresponding
From the perspective of gradient information, the funda- to each θj
mental optimization methods can be divided into first-order
1  i
N
∂L(θ )  
optimization methods, high-order optimization methods, and =− y − fθ xi xji . (10)
derivative-free optimization methods. These methods have ∂θj N
i=1

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on January 05,2025 at 08:34:34 UTC from IEEE Xplore. Restrictions apply.
3672 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 50, NO. 8, AUGUST 2020

2) Update each θj in the negative gradient direction to hundreds of thousands. Therefore, compared with batch meth-
minimize the risk function ods, SGD can effectively reduce the computational complexity
and accelerate convergence.
1  i
N
 
θj = θj + η · y − fθ xi xji . (11) However, one problem in SGD is that the gradient direction
N oscillates because of additional noise introduced by random
i=1
The gradient descent method is simple to implement. The selection, and the search process is blind in the solution space.
solution is global optimal when the objective function is con- Unlike batch gradient descent which always moves toward the
vex. It often converges at a slower speed if the variable is optimal value along the negative direction of the gradient, the
closer to the optimal solution, and more careful iterations need variance of gradients in SGD is large and the movement direc-
to be performed. tion in SGD is biased. So a compromise between the two
In the above linear regression example, note that all of the methods, the mini-batch gradient descent method (MSGD),
training data are used in each iteration step, so the gradient was proposed [1].
descent method is also called the batch gradient descent. If The MSGD uses b independent identically distributed sam-
the number of samples is N and the dimension of x is D, the ples (b is generally in 50 to 256 [90]) as the sample sets to
computational complexity for each iteration will be O(ND). In update the parameters in each iteration. It reduces the vari-
order to mitigate the cost of computation, some parallelization ance of the gradients and makes the convergence more stable,
methods were proposed [92], [93]. However, the cost is still which helps to improve the optimization speed. For brevity,
hard to accept when dealing with large-scale data. Thus, the we will call MSGD as SGD in the following sections.
SGD method emerges. As a common feature of stochastic optimization, SGD has
2) Stochastic Gradient Descent: Since the batch gradient a better chance of finding the global optimal solution for
descent has high computational complexity in each iteration complex problems. The deterministic gradient in batch gra-
for large-scale data and does not allow online update, SGD was dient descent may cause the objective function to fall into
proposed [1]. The idea of SGD is using one sample randomly to a local minimum for the multimodal problem. The fluctua-
update the gradient per iteration, instead of directly calculating tion in the SGD helps the objective function jump to another
the exact value of the gradient. The stochastic gradient is possible minimum. However, the fluctuation in SGD always
an unbiased estimate of the real gradient [1]. The cost of exists, which may more or less slow down the process of
the SGD algorithm is independent of sample numbers and convergence.
can achieve sublinear convergence speed [37]. SGD reduces There are still many details to be noted about the use of SGD
the update time to deal with large numbers of samples and in the concrete optimization process [90], such as the choice of
removes a certain amount of computational redundancy, which a proper learning rate. A too small learning rate will result in
significantly accelerates the calculation. In the strong convex a slower convergence rate, while a too large learning rate will
problem, SGD can achieve the optimal convergence speed [36], hinder convergence, making loss function fluctuate at the min-
[94]–[96]. Meanwhile, it overcomes the disadvantage of batch imum. One way to solve this problem is to set up a predefined
gradient descent that cannot be used for online learning. list of learning rates or a certain threshold and adjust the learn-
The loss function (8) can be written as the following ing rate during the learning process [1], [97]. However, these
equation: lists or thresholds need to be defined in advance according
to the characteristics of the dataset. It is also inappropriate
1  1 i 1 
N N
 2    to use the same learning rate for all parameters. If data are
L(θ ) = y − fθ x i = cost θ, xi , yi . (12)
N 2 N sparse and features occur at different frequencies, it is not
i=1 i=1
expected to update the corresponding variables with the same
If a random sample i is selected in SGD, the loss function will learning rate. A higher learning rate is often expected for less
be L∗ (θ ) frequently occurring features [30], [33].
   1   2 Besides the learning rate, how to avoid the objective func-
L∗ (θ ) = cost θ, xi , yi = yi − fθ xi . (13)
2 tion being trapped in infinite numbers of the local minimum
The gradient update in SGD uses the random sample i rather is a common challenge. Some work has proved that this dif-
than all samples in each iteration ficulty does not come from the local minimum values, but
   comes from the “saddle point” [98]. The slope of a sad-
θ  = θ + η yi − fθ xi xi . (14)
dle point is positive in one direction and negative in another
Since SGD uses only one sample per iteration, the com- direction, and gradient values in all directions are zero. It
putational complexity for each iteration is O(D), where D is is an important problem for SGD to escape from these
the number of features. The update rate for each iteration of points. Some research about escaping from saddle points was
SGD is much faster than that of batch gradient descent when developed [99], [100].
the number of samples N is large. SGD increases the overall 3) Nesterov Accelerated Gradient Descent: Although SGD
optimization efficiency at the expense of more iterations, but is popular and widely used, its learning process is sometimes
the increased iteration number is insignificant compared with prolonged. How to adjust the learning rate, how to speed up
the high computational complexity caused by large numbers of the convergence, and how to prevent from being trapped at
samples. It is possible to use only thousands of samples over- a local minimum during the search are worthwhile research
all to obtain the optimal solution even when the sample size is directions.

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on January 05,2025 at 08:34:34 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: A SURVEY OF OPTIMIZATION METHODS FROM MACHINE LEARNING PERSPECTIVE 3673

Much work is presented to improve SGD. For example, the information compared with the traditional momentum method.
momentum idea was proposed to be applied in SGD [101]. Note that Nesterov momentum improves the convergence from
The concept of momentum is derived from the mechanics O(1/k) (after k steps) to O(1/k2 ), when not using stochastic
of physics, which simulates the inertia of objects. The idea optimization [104].
of applying momentum in SGD is to preserve the influence of Another issue worth considering is how to determine the
the previous update direction on the next iteration to a certain size of the learning rate. It is more likely to occur with
degree. The momentum method can speed up the convergence the oscillation if the search is closer to the optimal point.
when dealing with high curvature, small but consistent gra- Thus, the learning rate should be adjusted. The learning rate
dients, or noisy gradients [102]. The momentum algorithm decay factor d is commonly used in the SGD’s momentum
introduces the variable v as the speed, which represents the method, which makes the learning rate decrease with the
direction and the rate of the parameter’s movement in the iteration period [105]. The formula of the learning rate decay
parameter space. The speed is set as the average exponential is defined as
decay of the negative gradient. η0
In the gradient descent method, the speed update is v = ηt = (17)
1+d·t
η · (−(∂L(θ ))/∂(θ )) each time. Using the momentum algo-
rithm, the amount of the update v is not just the amount of where ηt is the learning rate at the t-th iteration, η0 is the
gradient descent calculated by η · (−(∂L(θ ))/∂(θ )). It also original learning rate, and d is a decimal in [0, 1]. As can
takes into account the friction factor, which is represented as be seen from the formula, the smaller d is, the slower the
the previous update vold multiplied by a momentum factor decay of the learning rate will be. The learning rate remains
ranging between [0, 1]. Generally, the mass of the object is unchanged when d = 0 and the learning rate decays fastest
set to 1. The formulation is expressed as when d = 1.
 4) Adaptive Learning Rate Method: The manually regu-
∂L(θ )
v=η· − + vold · mtm (15) lated learning rate greatly influences the effect of the SGD
∂(θ ) method. It is a tricky problem for setting an appropriate value
where mtm is the momentum factor. If the current gradient of the learning rate [30], [33], [106]. Some adaptive methods
is parallel to the previous speed vold , the previous speed can were proposed to adjust the learning rate automatically. These
speed up this search. The proper momentum plays a role in methods are free of parameter adjustment, fast to converge,
accelerating the convergence when the learning rate is small. If and often achieve not bad results. They are widely used in
the derivative decays to 0, it will continue to update v to reach DNNs to deal with optimization problems.
equilibrium and will be attenuated by friction. It is beneficial to The most straightforward improvement to SGD is
escape from the local minimum in the training process so that AdaGrad [30]. AdaGrad adjusts the learning rate dynamically
the search process can converge more quickly [101], [103]. based on the historical gradient in some previous iterations.
If the current gradient is opposite than the previous update The update formulas are as follows:
vold , the value vold will have a deceleration effect on this ⎧ ∂L(θt )
search. ⎨ gt = ∂θ

The momentum method with a proper momentum factor t
Vt = i=1 (gi ) + 
2 (18)
plays a positive role in reducing the oscillation of convergence ⎪
⎩ gt
θt+1 = θt − η Vt
when the learning rate is large. How to select the proper size
of the momentum factor is also a problem. If the momentum where gt is the gradient of parameter θ at iteration t, Vt is
factor is small, it is hard to obtain the effect of improving con- the accumulate historical gradient of parameter θ at iteration
vergence speed. If the momentum factor is large, the current t, and θt is the value of parameter θ at iteration t.
point may jump out of the optimal value point. Many exper- The difference between AdaGrad and the gradient descent
iments have empirically verified the most appropriate setting is that during the parameter update process, the learning rate is
for the momentum factor is 0.9 [90]. no longer fixed, but is computed using all historical gradients
The Nesterov accelerated gradient descent (NAG) accumulated up to this iteration. One main benefit of AdaGrad
makes further improvement over the traditional momen- is that it eliminates the need to tune the learning rate manu-
tum method [103], [104]. In Nesterov momentum, the ally. Most implementations use a default value of 0.01 for
momentum vold · mtm is added to θ , denoted as  θ . The η in (18).
gradient of  θ is used when updating. The detailed update Although AdaGrad adaptively adjusts the learning rate, it
formulas for parameters θ are as follows: still has two issues: 1) the algorithm still needs to set the

⎪
⎨ θ = θ + vold · mtm  
global learning rate η manually and 2) as the training time
 increases, the accumulated gradient will become larger and
v = vold · mtm + η · − ∂L( θ)
∂(θ) (16)

⎩  larger, making the learning rate tend to zero, resulting in an
θ = θ + v. ineffective parameter update.
The improvement of Nesterov momentum over momentum AdaGrad was further improved to AdaDelta [31] and
is reflected in updating the gradient of the future position RMSProp [32] for solving the problem that the learning rate
instead of the current position. From the update formula, we will eventually go to zero. The idea is to consider not accumu-
can find that Nestorov momentum includes more gradient lating all historical gradients but focusing only on the gradients

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on January 05,2025 at 08:34:34 UTC from IEEE Xplore. Restrictions apply.
3674 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 50, NO. 8, AUGUST 2020

in a window over a period, and using the exponential moving However, the SAG method is applicable only to the case
average to calculate the second-order cumulative momentum where the loss function is smooth and the objective function
 is convex [36], [107], such as convex linear prediction prob-
Vt = βVt−1 + (1 − β)(gt )2 (19) lems. In this case, the SAG achieves a faster convergence rate
than the SGD. In addition, under some specific problems, it
where β is the exponential decay parameter. Both RMSProp
can even deliver better convergence than the standard batch
and AdaDelta have been developed independently around the
gradient descent.
same time, stemming from the need to resolve the radically
Stochastic Variance Reduction Gradient: Since the SAG
diminishing learning rates of AdaGrad.
method is applicable only to smooth and convex functions
Adaptive moment estimation (Adam) [33] is another
and needs to store the gradient of each sample, it is incon-
advanced SGD method, which introduces an adaptive learning
venient to be applied in the nonconvex neural networks. The
rate for each parameter. It combines the adaptive learn-
stochastic variance reduction gradient (SVRG) [37] method
ing rate and momentum methods. In addition to storing an
was proposed to improve the performance of optimization in
exponentially decaying average of past squared gradients Vt ,
the complex models.
like AdaDelta and RMSProp, Adam also keeps an exponen-
The algorithm of SVRG maintains the interval average gra-
tially decaying average of past gradients mt , similar to the
dient μ̃ by calculating the gradients of all samples in every w
momentum method
iterations instead of in each iteration
mt = β1 mt−1 + (1 − β1 )gt (20) 1   
N

μ̃ = gi θ̃ (24)
Vt = β2 Vt−1 + (1 − β2 )(gt )2 (21) N
i=1
where β1 and β2 are the exponential decay rates. The final where θ̃ is the interval update parameter. The interval param-
update formula for the parameter θ is eter μ̃ contains the average memory of all sample gradients in
√ the past time for each time interval w. SVRG picks uniform
1 − β2 mt
θt+1 = mt − η . (22) it ∈ {1, . . . , N} randomly and executes gradient updates to the
1 − β1 Vt + 
current parameters
The default values of β1 , β2 , and  are suggested to set to    
0.9, 0.999, and 10−8 , respectively. Adam works well in prac- θt = θt−1 − η · git (θt−1 ) − git θ̃ + μ̃ . (25)
tice and compares favorably to other adaptive learning rate The gradient is calculated up to two times in each update. After
algorithms. w iterations, perform θ̃ ← θw and start the next w iterations.
5) Variance Reduction Methods: Due to a large amount Through these updates, θt and the interval update parameter θ̃
of redundant information in the training samples, the SGD will converge to the optimal θ ∗ , and then μ̃ → 0, and
methods are very popular since they were proposed. However,    
the stochastic gradient method can only converge at a sub- git (θt−1 ) − git θ̃ + μ̃ → git (θt−1 ) − git θ ∗ → 0. (26)
linear rate and the variance of gradient is often very large.
SVRG proposes a vital concept called variance reduction.
How to reduce the variance and improve SGD to the linear
This concept is related to the convergence analysis of SGD,
convergence has always been an important problem.
in which it is necessary to assume that there is a constant
Stochastic Average Gradient: The stochastic average gra-
upper bound for the variance of the gradients. This constant
dient (SAG) method [36] is a variance reduction method
upper bound implies that the SGD cannot achieve linear con-
proposed to improve the convergence speed. The SAG algo-
vergence. However, in SVRG, the upper bound of variance
rithm maintains parameter d recording the sum of the N latest
can be continuously reduced due to the special update item
gradients {gi } in memory, where gi is calculated using one
git (θt−1 ) − git (θ̃) + μ̃, thus achieving linear convergence [37].
sample i, i ∈ {1, . . . , N}. The detailed implementation is to
The strategies of SAG and SVRG are related to vari-
select a sample it to update d randomly, and use d to update
ance reduction. Compared with SAG, SVRG does not need
the parameter θ in iteration t
⎧ to maintain all gradients in memory, which means that
⎨ d = d − ĝit + git (θt−1 ) memory resources are saved, and it can be applied to com-
ĝi = git (θt−1 ) (23) plex problems efficiently. Experiments have shown that the
⎩ t
θt = θt−1 − Nα d performance of SVRG is remarkable on a nonconvex neu-
ral network [37], [108], [109]. There are also many variants
where the updated item d is calculated by replacing the old
of such linear convergence stochastic optimization algorithms,
gradient ĝit in d with the new gradient git (θt−1 ) in iteration
such as the SAGA algorithm [110].
t,and α is a constant representing the learning rate. Thus, each
6) Alternating Direction Method of Multipliers:
update only needs to calculate the gradient of one sample, not
The augmented Lagrangian multiplier method is a common
the gradients of all samples. The computational overhead is no
method to solve optimization problems with linear constraints.
different from SGD, but the memory overhead is much larger.
Compared with the naive Lagrangian multiplier method, it
This is a typical way of using space for saving time. The SAG
makes problems easier to solve by adding a penalty term to
has been shown to be a linear convergence algorithm [36],
the objective. Consider the following example:
which is much faster than SGD, and has great advantages over
other stochastic gradient algorithms. min{θ1 (x) + θ2 (y)|Ax + By = b, x ∈ X , y ∈ Y}. (27)

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on January 05,2025 at 08:34:34 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: A SURVEY OF OPTIMIZATION METHODS FROM MACHINE LEARNING PERSPECTIVE 3675

The augmented Lagrange function for problem (27) is Algorithm 1 Frank–Wolfe Method [117], [118]
Input: x0 , ε ≥ 0, t := 0
Lβ (x, y, λ) = θ1 (x) + θ2 (y) − λ (Ax + By − b) Output: x∗
β yt ← min ∇f (xt ) x
+ Ax + By − b2 . (28)
2 while |∇f (xt ) (yt − xt )| > ε do
When solved by the augmented Lagrangian multiplier method, λt = arg min0≤λ≤1 f (xt + λ(yt − xt ))
its t-th step iteration starts from the given λt , and the xt+1 ≈ xt + λt (yt − xt )
optimization turns out to be t := t + 1
   yt ← min ∇f (xt ) x
(xt+1 , yt+1 ) = arg min Lβ (x, y, λt )|x ∈ X , y ∈ Y end while
(29)
λt+1 = λt − β(Axt+1 + Byt+1 − b). x∗ ≈ xt
Separating the (x, y) subproblem in (29), the augmented
Lagrange multiplier method can be relaxed to the following
ADMMs [111], [112]. Its t-th step iteration starts with the f (x) ≈ f (x0 ) + ∇f (x0 ) (x − x0 ), and substitute it into (31).
given (yt , λt ), and the details of iterative optimization are as Then, we have

follows: min f (xt ) + ∇f (xt ) (x − xt )
⎧   (32)
s.t. x∈S

⎪ x = arg min θ (x) − (λ ) Ax + β
Con 2
|x ∈ X
⎨ t+1 
1 t 2 x

β 2 which is equivalent to
⎪ yt+1 = arg min θ2 (y) − (λt ) By + 2 Cony |y ∈ Y 

⎩ min ∇f (xt ) x
λt+1 = λt − β(Axt+1 + Byt+1 − b) (33)
s.t. x ∈ S.
(30)
Suppose there exist an optimal solution yt , and then there
where Conx = Ax + Byt − b and Cony = Axt+1 + By − b. must be
The penalty parameter β has a certain impact on the con- 
∇f (xt ) yt < ∇f (xt ) xt
vergence rate of the ADMM. The larger β is, the greater the (34)
∇f (xt ) (yt − xt ) < 0.
penalties for the constraint term. In general, a monotonically
increasing sequence of {βt } can be adopted instead of the So yt − xt is the decreasing direction of f (x) at xt . A fetch
fixed β [113]. Specifically, an auto-adjustment criterion that step of λt updates the search point in a feasible direction. The
automatically adjusts {βt } based on the current value of {xt } detailed operation is shown in Algorithm 1.
during the iteration was proposed and applied for solving some The algorithm satisfies the following convergence
convex optimization problems [114], [115]. theorem [117]:
The ADMM method uses the separable operators in the 1) xt is the Kuhn–Tucker point of (31) when ∇f (xt ) (yt −
convex optimization problem to divide a large problem into xt ) = 0;
multiple small problems that can be solved in a distributed 2) since yt is an optimal solution for problem (33), the vec-
manner. In theory, the framework of ADMM can solve most of tor dt satisfies dt = yt −xt and is the feasible descending
the large-scale optimization problems. However, there are still direction of f at point xt when ∇f (xt ) (yt − xt ) = 0.
some problems in practical applications. For example, if we The Frank–Wolfe algorithm is a first-order iterative method
use a stop criterion to determine whether convergence occurs, to solve the convex optimization problems with constrained
the original residuals and dual residuals are both related to β, conditions. It consists of determining the feasible descent
and β with a large value will lead to difficulty in meeting the direction and calculating the search step size. The algorithm
convergence conditions [116]. is characterized by fast convergence in early iterations and
7) Frank–Wolfe Method: In 1956, Frank and Wolfe slower in later phases. When the iterative point is close to the
proposed an algorithm to solve the linear constraint prob- optimal solution, the search direction and the gradient direc-
lems [117]. The basic idea is to approximate the objective tion of the objective function tend to be orthogonal. Such a
function with a linear function, then solve the linear program- direction is not the best downward direction so that the Frank–
ming to find the feasible descending direction and, finally, Wolfe algorithm can be improved and extended in terms of the
make a 1-D search along the direction in the feasible domain. selection of the descending directions [119]–[121].
This method is also called the approximate linearization 8) Summary: We summarize the above-mentioned first-
method. order optimization methods in terms of properties, advantages,
Here, we give a simple example of the Frank–Wolfe method. and disadvantages in Table I in the supplementary material.
Consider the optimization problem
⎧ B. Preconditioning in Optimization
⎨ min f (x)
s.t. Ax = b (31) Preconditioning is a very important technique in
⎩ optimization methods. Reasonable preconditioning can
x≥0
reduce the iteration number of optimization algorithms.
where A is an m × n full-row rank matrix, and the feasible For many important iterative methods, the convergence
region is S = {x|Ax = b, x ≥ 0}. Expand f (x) linearly at x0 , depends largely on the spectral properties of the coefficient

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on January 05,2025 at 08:34:34 UTC from IEEE Xplore. Restrictions apply.
3676 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 50, NO. 8, AUGUST 2020

TABLE I
AVAILABLE T OOLKITS FOR O PTIMIZATION

matrix [122]. It can be simply considered that the pre- Algorithm 2 Preconditioned Conjugate Gradient Method [93]
treatment is to transform a difficult linear system Aθ = b Input: A, θ0 , M, b
into an equivalent system with the same solution but better Output: The solution θ ∗
spectral characteristics. For example, if M is a nonsingular f0 = f (θ0 )
approximation of the coefficient matrix A, the transformed g0 = ∇f (θ0 ) = Aθ0 − b
system y0 is the solution of My = g0
d0 = −g0
M −1 Aθ = M −1 b (35) t=t
while gt = 0 do
will have the same solution as the system Aθ = b. But (35) g y
may be easier to solve and the spectral properties of the ηt = d t Adt
t t
coefficient matrix M −1 A may be more favorable. θt+1 = θt + ηt dt
In most linear systems, for example, Aθ = b, the matrix gt+1 = gt + ηt Adt
A is often complex and makes it hard to solve the system. yt+1 =solution of My = gt
g yt+1
Therefore, some transformation is needed to simplify this βt+1 = t+1
gt dt
system. M is called the preconditioner. If the matrix after using dt+1 = −yt+1 + βt+1 dt
preconditioner is obviously structured, or sparse, it will be t =t+1
beneficial to the calculation [123]. end while
The conjugate gradient algorithm is the most commonly
used optimization method with preconditioning technology,
which speeds up the convergence. The algorithm is shown
integrated powerful toolkits. We summarize the existing com-
in Algorithm 2.
mon optimization toolkits and present them in Table I.

C. Public Toolkits for Optimization IV. C HALLENGES AND O PEN P ROBLEMS


The fundamental optimization methods are applied in With the rise of practical demand and the increase of the
machine-learning problems extensively. There are many complexity of the machine-learning models, the optimization

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on January 05,2025 at 08:34:34 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: A SURVEY OF OPTIMIZATION METHODS FROM MACHINE LEARNING PERSPECTIVE 3677

methods in machine learning still face challenges. In this Meta learning methods can be used for systematically learn-
part, we discuss open problems and challenges for some ing parameter initialization, which ensures that training begins
optimization methods in machine learning, which may offer with a suitable initial model. However, it is necessary to ensure
suggestions or ideas for future research and promote the wider the correlation between multiple tasks for meta-training and
application of optimization methods in machine learning. tasks for meta-testing. Under the premise of models with
similar data sources for training, transfer learning and meta
learning can overcome the difficulties caused by insufficient
A. Challenges in Deep Neural Networks
training data in new data sources, but these methods usually
There are still many challenges while optimizing DNNs. introduce a large number of parameters or complex parameter
Here, we mainly discuss two challenges with respect to data adjustment mechanisms, which need to be further improved
and model. One is insufficient data in training, and the other for specific problems. Therefore, using insufficient data for
is a nonconvex objective in DNNs. training DNNs is still a challenge.
1) Insufficient Data in Training Deep Neural Networks: 2) Nonconvex Optimization in the Deep Neural
In general, deep learning is based on big datasets and com- Network: Convex optimization has good properties and a
plex models. It requires a large number of training samples comprehensive set of tools is open to solve the optimization
to achieve good training effects. But in some particular fields, problem. However, many machine-learning problems are for-
finding a sufficient amount of training data is difficult. If we mulated as the nonconvex optimization problems. For exam-
do not have enough data to estimate the parameters in the ple, almost all of the optimization problems in DNNs are
neural networks, it may lead to high variance and overfitting. nonconvex. Nonconvex optimization is one of the difficulties
There are some techniques in neural networks that can be in the optimization problem. Unlike convex optimization, there
used to reduce the variance. Adding L2 regularization to the may be innumerable optimum solutions in its feasible domain
objective is a natural method to reduce the model complexity. in nonconvex problems. The complexity of the algorithm for
Recently, a common method is dropout [62]. In the training searching the global optimal value is NP-hard [108].
process, each neuron is allowed to stop working with a prob- In recent years, nonconvex optimization has gradually
ability of p, which can prevent the synergy between certain attracted the attention of researchers. The methods for solving
neurons. M subnets can be sampled like bagging by multiple nonconvex optimization problems can be roughly divided into
inputs and returns [130]. Each expected result at the output two types. One is to transform the nonconvex optimization
layer is calculated as into a convex optimization problem, and then use the con-
vex optimization method. The other is to use some special
  M
optimization method for solving nonconvex functions directly.
o = EM f (x; θ, M) = p(Mi )f (x; θ, Mi ) (36)
i=1
There is some work on summarizing the optimization meth-
ods for solving nonconvex functions from the perspective of
where p(Mi ) is the probability of the ith subnet. Dropout can machine learning [133].
prevent overfitting and improve the generalization ability of 1) Relaxation Method: Relax the problem to make
the network, but its disadvantage is increasing the training it become a convex optimization problem. There
time as each training changes from the full network to a are many relaxation techniques, for example, the
subnetwork [131]. branch-and-bound method called αBB convex relax-
Not only overfitting but also some training details will affect ation [134], [135], which uses a convex relaxation at
the performance of the model due to the complexity of the each step to compute the lower bound in the region.
DNNs. The improper selection of the learning rate and the The convex relaxation method has been used in many
number of iterations in the SGD will make the model unable to fields. In the field of computer vision, a convex relax-
converge, which makes the accuracy of model fluctuate greatly. ation method was proposed to calculate minimal par-
Besides, taking an inappropriate black box of neural-network titions [136]. For unsupervised and SSL, the convex
construction may result in training not being able to continue, relaxation method was used for solving semidefinite
so designing an appropriate neural-network model is particu- programming [137].
larly important. These impacts are even greater when data are 2) Nonconvex Optimization Methods: These methods
insufficient. include projection gradient descent [138], [139];
The technology of transfer learning [132] can be applied to alternating minimization [140]–[142]; expectation
build networks in the scenario of insufficient data. Its idea is maximization algorithm [143], [144]; and stochastic
that the models trained from other data sources can be reused optimization and its variants [37].
in similar target fields after certain modifications and improve-
ments, which dramatically alleviates the problems caused by
insufficient datasets. Moreover, the advantages brought by B. Difficulties in Sequential Models With Large-Scale Data
transfer learning are not limited to reducing the need for suf- When dealing with the large-scale time series, the usual
ficient training data but also can avoid overfitting effectively solutions are using stochastic optimization, processing data
and achieve better performance in general. However, if tar- in mini-batches, or utilizing distributed computing to improve
get data are not as relevant to the original training data, the computational efficiency [145]. For a sequential model, seg-
transferred model does not bring good performance. menting the sequences can affect the dependencies between

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on January 05,2025 at 08:34:34 UTC from IEEE Xplore. Restrictions apply.
3678 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 50, NO. 8, AUGUST 2020

the data on the adjacent time indices. If sequence length is derivative-free aspects, as well as the research progress in
not an integral multiple of the mini-batch size, the general recent years. Then, we described the applications of the
operation is to add some items sampled from the previous optimization methods in different machine-learning scenarios
data into the last subsequence. This operation will introduce and the approaches to improve their performance in the supple-
the wrong dependency in the training model. Therefore, the mentary material. Finally, we discussed some challenges and
analysis of the difference between the approximated solution open problems in the machine-learning optimization methods.
obtained and the exact solution is a direction worth exploring.
Particularly, in RNNs, the problem of gradient vanishing and R EFERENCES
gradient explosion is also prone to occur. So far, it is generally [1] H. Robbins and S. Monro, “A stochastic approximation method,” Ann.
solved by specific interaction modes of LSTM and GRU [146] Math. Stat., vol. 22, no. 3, pp. 400–407, 1951.
or gradient clipping. Better appropriate solutions for dealing [2] P. Jain, S. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford,
“Parallelizing stochastic gradient descent for least squares regression:
with problems in RNNs are still worth investigating. Mini-batching, averaging, and model misspecification,” J. Mach. Learn.
Res., vol. 18, no. 223, pp. 1–42, 2018.
C. High-Order Methods for Stochastic Variational Inference [3] D. F. Shanno, “Conditioning of quasi-Newton methods for function
minimization,” Math. Comput., vol. 24, no. 111, pp. 647–656, 1970.
The high-order optimization method utilizes curvature [4] J. Hu, B. Jiang, L. Lin, Z. Wen, and Y.-X. Yuan, “Structured quasi-
Newton methods for optimization with orthogonality constraints,”
information and thus converges fast. Although computing and SIAM J. Sci. Comput., vol. 41, pp. 2239–2269, 2019.
storing the Hessian matrices are difficult, with the develop- [5] J. Pajarinen, H. L. Thai, R. Akrour, J. Peters, and G. Neumann,
ment of research, the calculation of the Hessian matrix has “Compatible natural gradient policy search,” Mach. Learn., vol. 108,
nos. 8–9, pp. 1443–1466, 2019.
made great progress [8], [9], [147], and the second-order [6] J. E. Dennis, Jr., and J. J. Moré, “Quasi-Newton methods, motivation
optimization method has become more and more attractive. and theory,” SIAM Rev., vol. 19, no. 1, pp. 46–89, 1977.
Recently, stochastic methods have also been introduced into [7] J. Martens, “Deep learning via Hessian-free optimization,” in Proc. Int.
Conf. Mach. Learn., 2010, pp. 735–742.
the second-order method, which extends the second-order [8] F. Roosta-Khorasani and M. W. Mahoney, “Sub-sampled Newton meth-
method to large-scale data [8], [10]. ods II: Local convergence rates,” arXiv preprint arXiv:1601.04738,
We have introduced some work on stochastic variational pp. 1–56, Feb. 2016.
[9] P. Xu, J. Yang, F. Roosta-Khorasani, C. Ré, and M. W. Mahoney, “Sub-
inference. It introduces the stochastic method into variational sampled Newton methods with non-uniform sampling,” in Proc. Adv.
inference, which is an interesting and meaningful combination. Neural Inf. Process. Syst., 2016, pp. 3000–3008.
This makes variational inference be able to handle large-scale [10] R. Bollapragada, R. H. Byrd, and J. Nocedal, “Exact and inexact
subsampled Newton methods for optimization,” IMA J. Numer. Anal.,
data. A natural idea is whether we can incorporate second- vol. 1, no. 2, pp. 1–34, 2018.
order optimization methods (or higher-order) into stochastic [11] L. M. Rios and N. V. Sahinidis, “Derivative-free optimization: A review
variational inference, which is interesting and challenging. of algorithms and comparison of software implementations,” J. Glob.
Optim., vol. 56, no. 3, pp. 1247–1293, 2013.
[12] A. S. Berahas, R. H. Byrd, and J. Nocedal, “Derivative-free
D. Stochastic Optimization in Conjugate Gradient optimization of noisy functions via quasi-Newton methods,” SIAM J.
Optim., vol. 29, no. 2, pp. 965–993, 2019.
Stochastic methods exhibit powerful capabilities when [13] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learn-
dealing with large-scale data, especially for first-order ing applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
optimization [148]. Then, the relevant experts and schol- pp. 2278–2324, Nov. 1998.
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
ars also introduced this stochastic idea to the second-order with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
optimization methods [149]–[151] and achieved good results. Process. Syst., 2012, pp. 1097–1105.
The conjugate gradient method is an elegant and attrac- [15] P. Sermanet and D. Eigen, “Overfeat: Integrated recognition, localiza-
tion and detection using convolutional networks,” in Proc. Int. Conf.
tive algorithm, which has the advantages of both the first- Learn. Represent., 2014, pp. 1–16.
order and second-order optimization methods. The standard [16] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
form of a conjugate gradient is not suitable for a stochas- L. Fei-Fei, “Large-scale video classification with convolutional neu-
ral networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
tic approximation. Using the fast Hessian-gradient product, Columbus, OH, USA, 2014, pp. 1725–1732.
the stochastic method is also introduced to conjugate gradi- [17] Y. Kim, “Convolutional neural networks for sentence classification,” in
ent, in which some numerical results show the validity of the Proc. Conf. Empir. Methods Nat. Lang. Process., 2014, pp. 1746–1751.
[18] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks
algorithm [148]. Another version of the stochastic conjugate for human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,
gradient method employs the variance reduction technique and vol. 35, no. 1, pp. 221–231, Jan. 2012.
converges quickly with just a few iterations and requires less [19] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural
networks for text classification,” in Proc. Assoc. Adv. Artif. Intell., 2015,
storage space during the running process [152]. The stochas- pp. 2267–2273.
tic version of conjugate gradient is a potential optimization [20] K. Cho et al., “Learning phrase representations using RNN encoder–
method and is still worth studying. decoder for statistical machine translation,” in Proc. Conf. Empir.
Methods Nat. Lang. Process., 2014, pp. 1724–1734.
[21] P. Liu, X. Qiu, and X. Huang, “Recurrent neural network for text classi-
V. C ONCLUSION fication with multi-task learning,” in Proc. Int. Joint Conf. Artif. Intell.,
2016, pp. 2873–2879.
This article introduced and summarized the frequently [22] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with
used optimization methods from the perspective of machine deep recurrent neural networks,” in Proc. Int. Conf. Acoust. Speech
learning, and studied their applications in various fields of Signal Process., Vancouver, BC, Canada, 2013, pp. 6645–6649.
[23] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra,
machine learning. First, we described the theoretical basis “DRAW: A recurrent neural network for image generation,” in Proc.
of optimization methods from the first-order, high-order, and Int. Conf. Mach. Learn., 2015, pp. 1462–1471.

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on January 05,2025 at 08:34:34 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: A SURVEY OF OPTIMIZATION METHODS FROM MACHINE LEARNING PERSPECTIVE 3679

[24] A. V. D. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recur- [51] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra,
rent neural networks,” in Proc. Int. Conf. Mach. Learn., 2016, “Matching networks for one shot learning,” in Proc. Adv. Neural Inf.
pp. 1747–1756. Process. Syst., 2016, pp. 3630–3638.
[25] A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, [52] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for
“Action recognition in video sequences using deep bi-directional few-shot learning,” in Proc. Adv. Neural Inf. Process. Syst., 2017,
LSTM with CNN features,” IEEE Access, vol. 6, pp. 1155–1166, pp. 4077–4087.
2017. [53] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap,
[26] Y. Xia and J. Wang, “A bi-projection neural network for “Meta-learning with memory-augmented neural networks,” in Proc. Int.
solving constrained quadratic optimization problems,” IEEE Conf. Mach. Learn., New York, NY, USA, 2016, pp. 1842–1850.
Trans. Neural Netw. Learn. Syst., vol. 27, no. 2, pp. 214–224, [54] J. Weston, S. Chopra, and A. Bordes, “Memory networks,” in Proc.
Feb. 2015. Int. Conf. Learn. Represent., 2015, pp. 1–15.
[27] S. Zhang, Y. Xia, and J. Wang, “A complex-valued projection neu- [55] M. Andrychowicz et al., “Learning to learn by gradient descent by
ral network for constrained optimization of real functions in complex gradient descent,” in Proc. Adv. Neural Inf. Process. Syst., 2016,
variables,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 12, pp. 3981–3989.
pp. 3227–3238, Dec. 2015. [56] S. Ravi and H. Larochelle, “Optimization as a model for few-shot
[28] Y. Xia and J. Wang, “Robust regression estimation based learning,” in Proc. Int. Conf. Learn. Represent., 2016, pp. 1–11.
on low-dimensional recurrent neural networks,” IEEE Trans. [57] C. M. Bishop, Pattern Recognition and Machine Learning. New York,
Neural Netw. Learn. Syst., vol. 29, no. 12, pp. 5935–5946, NY, USA: Springer, 2006.
Dec. 2018. [58] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic
[29] Y. Xia, J. Wang, and W. Guo, “Two projection neural networks with variational inference,” J. Mach. Learn. Res., vol. 14, pp. 1303–1347,
reduced model complexity for nonlinear programming,” IEEE Trans. May 2013.
Neural Netw. Learn. Syst., to be published. [59] C. Ledig et al., “Photo-realistic single image super-resolution using
[30] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods a generative adversarial network,” in Proc. Comput. Vis. Pattern
for online learning and stochastic optimization,” J. Mach. Learn. Res., Recognit., 2017, pp. 4681–4690.
vol. 12, pp. 2121–2159, Jul. 2011. [60] Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba, “Scalable
[31] M. D. Zeiler, “ADADELTA: An adaptive learning rate method,” arXiv trust-region method for deep reinforcement learning using Kronecker-
preprint arXiv:1212.5701, pp. 1–6, Dec. 2012. factored approximation,” in Proc. Adv. Neural Inf. Process. Syst., 2017,
[32] T. Tieleman and G. Hinton, “Divide the gradient by a running aver- pp. 5279–5288.
age of its recent magnitude,” COURSERA Neural Netw. Mach. Learn., [61] T. Chen, E. Fox, and C. Guestrin, “Stochastic gradient Hamiltonian
vol. 4, no. 2, pp. 26–31, 2012. Monte Carlo,” in Proc. Int. Conf. Mach. Learn., 2014, pp. 1683–1691.
[33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” [62] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
in Proc. Int. Conf. Learn. Represent., 2014, pp. 1–15. R. Salakhutdinov, “Dropout: A simple way to prevent neural networks
[34] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of Adam and from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958,
beyond,” in Proc. Int. Conf. Learn. Represent., 2018, pp. 1–23. Jun. 2014.
[35] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation [63] W. Yin and H. Schütze, “Multichannel variable-size convolution for
learning with deep convolutional generative adversarial networks,” sentence classification,” in Proc. Conf. Comput. Lang. Learn., 2015,
arXiv preprint arXiv:1511.06434, pp. 1–16, Jan. 2015. pp. 204–214.
[36] N. L. Roux, M. Schmidt, and F. R. Bach, “A stochastic gradient method [64] J. Yang, K. Yu, Y. Gong, and T. S. Huang, “Linear spatial pyra-
with an exponential convergence rate for finite training sets,” in Proc. mid matching using sparse coding for image classification,” in Proc.
Adv. Neural Inf. Process. Syst., 2012, pp. 2663–2671. IEEE Conf. Comput. Vis. Pattern Recognit., Miami, FL, USA, 2009,
[37] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent pp. 1794–1801.
using predictive variance reduction,” in Proc. Adv. Neural Inf. Process. [65] Y. Bazi and F. Melgani, “Gaussian process approach to remote sensing
Syst., 2013, pp. 315–323. image classification,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 1,
[38] N. S. Keskar and R. Socher, “Improving generalization performance pp. 186–197, Jan. 2010.
by switching from Adam to SGD,” arXiv preprint arXiv:1712.07628, [66] D. C. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep
pp. 1–10, Dec. 2017. neural networks for image classification,” in Proc. IEEE Conf. Comput.
[39] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Vis. Pattern Recognit., 2012, pp. 3642–3649.
Cambridge, MA, USA: MIT Press, 1998. [67] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A k-means clus-
[40] J. Mattner, S. Lange, and M. Riedmiller, “Learn to swing up and bal- tering algorithm,” J. Roy. Stat. Soc. C (Appl. Stat.), vol. 28, no. 1,
ance a real pole based on raw visual input data,” in Proc. Int. Conf. pp. 100–108, 1979.
Neural Inf. Process., 2012, pp. 126–133. [68] S. Guha, R. Rastogi, and K. Shim, “ROCK: A robust clustering algo-
[41] V. Mnih et al., “Playing Atari with deep reinforcement learning,” in rithm for categorical attributes,” Inf. Syst., vol. 25, no. 5, pp. 345–366,
Proc. Adv. Neural Inf. Process. Syst. Workshop, 2013, pp. 1–9. 2000.
[42] V. Mnih et al., “Human-level control through deep reinforcement [69] C. Ding, X. He, H. Zha, and H. D. Simon, “Adaptive dimension reduc-
learning,” Nature, vol. 518, pp. 529–533, Feb. 2015. tion for clustering high dimensional data,” in Proc. IEEE Int. Conf.
[43] Y. Bengio, “Learning deep architectures for AI,” Found. Trends Mach. Data Min., 2002, pp. 147–154.
Learn., vol. 2, no. 1, pp. 1–127, 2009. [70] M. Guillaumin, J. Verbeek, and C. Schmid, “Multimodal semi-
[44] S. S. Mousavi, M. Schukat, and E. Howley, “Deep reinforcement learn- supervised learning for image classification,” in Proc. Comput. Vis.
ing: An overview,” in Proc. SAI Intell. Syst. Conf., 2016, pp. 426–440. Pattern Recognit., San Francisco, CA, USA, 2010, pp. 902–909.
[45] J. Schmidhuber, “Evolutionary principles in self-referential learning, [71] O. Chapelle and A. Zien, “Semi-supervised classification by low
or on learning how to learn: The meta-meta-. . . hook,” Ph.D. disserta- density separation,” in Proc. Int. Conf. Artif. Intell. Stat., 2005,
tion, Institut für Informatik, Technische Universität München, Munich, pp. 57–64.
Germany, 1987. [72] Z.-H. Zhou and M. Li, “Semi-supervised regression with co-training,”
[46] T. Schaul and J. Schmidhuber, “Metalearning,” Scholarpedia, vol. 5, in Proc. Int. Joint Conf. Artif. Intell., 2005, pp. 908–913.
no. 6, pp. 46–50, 2010. [73] A. Demiriz and K. P. Bennett, “Semi-supervised clustering using
[47] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for genetic algorithms,” Artif. Neural Netw. Eng., vol. 1, pp. 809–814,
fast adaptation of deep networks,” in Proc. Int. Conf. Mach. Learn., Sep. 1999.
2017, pp. 1126–1135. [74] B. Kulis, S. Basu, I. Dhillon, and R. Mooney, “Semi-supervised graph
[48] O. Vinyals. (2017). Model vs Optimization Meta Learning. [Online]. clustering: A kernel approach,” Mach. Learn., vol. 74, no. 1, pp. 1–22,
Available: http://metalearning-symposium.ml/files/vinyals.pdf 2009.
[49] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature [75] D. Zhang and Z.-H. Zhou, “Semi-supervised dimensionality reduction,”
verification using a“siamese” time delay neural network,” in Proc. Adv. in Proc. SIAM Int. Conf. Data Min., 2007, pp. 629–634.
Neural Inf. Process. Syst., Denver, CO, USA, 1994, pp. 737–744. [76] P. Chen, L. Jiao, F. Liu, J. Zhao, Z. Zhao, and S. Liu, “Semi-
[50] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks supervised double sparse graphs based discriminant analysis for
for one-shot image recognition,” in Proc. Int. Conf. Mach. Learn. dimensionality reduction,” Pattern Recognit., vol. 61, pp. 361–378,
WorkShop, 2015, pp. 1–30. Jan. 2017.

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on January 05,2025 at 08:34:34 UTC from IEEE Xplore. Restrictions apply.
3680 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 50, NO. 8, AUGUST 2020

[77] K. P. Bennett and A. Demiriz, “Semi-supervised support vector [106] C. Darken and J. E. Moody, “Note on learning rate schedules for
machines,” in Proc. Adv. Neural Inf. Process. Syst., 1999, pp. 368–374. stochastic optimization,” in Proc. Adv. Neural Inf. Process. Syst., 1991,
[78] E. Cheung, Optimization Methods for Semi-Supervised Learning. pp. 832–838.
Waterloo, ON, Canada: Univ. Waterloo, 2018. [107] M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums with
[79] O. Chapelle, V. Sindhwani, and S. S. Keerthi, “Optimization techniques the stochastic average gradient,” Math. Program., vol. 162, nos. 1–2,
for semi-supervised support vector machines,” J. Mach. Learn. Res., pp. 83–112, 2017.
vol. 9, pp. 203–233, Jun. 2008. [108] Z. Allen-Zhu and E. Hazan, “Variance reduction for faster non-convex
[80] O. Chapelle, V. Sindhwani, and S. S. Keerthi, “Branch and bound for optimization,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 699–707.
semi-supervised support vector machines,” in Proc. Adv. Neural Inf. [109] S. J. Reddi, A. Hefny, S. Sra, B. Póczós, and A. Smola, “Stochastic
Process. Syst., 2007, pp. 217–224. variance reduction for nonconvex optimization,” in Proc. Int. Conf.
[81] Y.-F. Li, I. W. Tsang, J. T. Kwok, and Z.-H. Zhou, “Convex and scal- Mach. Learn., 2016, pp. 314–323.
able weakly labeled SVMs,” J. Mach. Learn. Res., vol. 14, no. 1, [110] A. Defazio, F. Bach, and S. Lacoste-Julien, “SAGA: A fast incremental
pp. 2151–2188, 2013. gradient method with support for non-strongly convex composite objec-
[82] F. Murtagh, “A survey of recent advances in hierarchical clustering tives,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 1646–1654.
algorithms,” Comput. J., vol. 26, no. 4, pp. 354–359, 1983. [111] M. J. Powell, “A method for nonlinear constraints in minimization
[83] V. Estivill-Castro and J. Yang, “A fast and robust general purpose clus- problems,” in Optimization. New York, NY, USA: Academic, 1969,
tering algorithm,” in Proc. 4th Eur. Workshop Principles Knowl. Disc. pp. 283–298.
Databases Data Min., 2000, pp. 208–218. [112] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed
[84] G. H. Ball and D. J. Hall, “A clustering technique for summarizing optimization and statistical learning via the alternating direction method
multivariate data,” Behav. Sci., vol. 12, no. 2, pp. 153–155, 1967. of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122,
[85] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” 2011.
Chemometr. Intell. Lab. Syst., vol. 2, nos. 1–3, pp. 37–52, 1987. [113] A. Nagurney and P. Ramanujam, “Transportation network pol-
[86] I. T. Jolliffe, “Principal component analysis,” in International icy modeling with goal targets and generalized penalty functions,”
Encyclopedia of Statistical Science. Heidelberg, Germany: Springer, Transport. Sci., vol. 30, pp. 3–13, 1996.
2011, pp. 1094–1096. [114] B. He, H. Yang, and S. Wang, “Alternating direction method with self-
[87] M. E. Tipping and C. M. Bishop, “Probabilistic principal compo- adaptive penalty parameters for monotone variational inequalities,” J.
nent analysis,” J. Roy. Stat. Soc. B (Stat. Methodol.), vol. 61, no. 3, Optim. Theory Appl., vol. 106, no. 1, pp. 337–356, 2000.
pp. 611–622, 1999. [115] D. Hallac, C. Wong, S. Diamond, A. Sharang, S. Boyd, and
[88] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. J. Leskovec, “SnapVX: A network-based convex optimization solver,”
Cambridge, MA, USA: MIT Press, 2018. J. Mach. Learn. Res., vol. 18, nos. 1–5, pp. 1–5, 2017.
[89] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement [116] B. Wahlberg, S. Boyd, M. Annergren, and Y. Wang, “An ADMM algo-
learning: A survey,” J. Artif. Intell. Res., vol. 4, pp. 237–285, rithm for a class of total variation regularized estimation problems,”
Mar. 1996. IFAC Proc. Vol., vol. 45, no. 16, pp. 83–88, 2012.
[117] M. Frank and P. Wolfe, “An algorithm for quadratic programming,”
[90] S. Ruder, “An overview of gradient descent optimization algorithms,”
Naval Res. Logistics Quart., vol. 3, nos. 1–2, pp. 95–110, 1956.
arXiv preprint arXiv:1609.04747, pp. 1–14, Jun. 2016.
[118] M. Jaggi, “Revisiting Frank–Wolfe: Projection-free sparse convex
[91] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.:
optimization,” in Proc. Int. Conf. Mach. Learn., 2013, pp. 427–435.
Cambridge Univ. Press, 2004.
[119] M. Fukushima, “A modified Frank–Wolfe algorithm for solving the
[92] J. Alspector, R. Meir, B. Yuhas, A. Jayakumar, and D. Lippe, “A
traffic assignment problem,” Transport. Res. B Methodol., vol. 18, no. 2,
parallel gradient descent method for learning in analog VLSI neural
pp. 169–177, 1984.
networks,” in Proc. Adv. Neural Inf. Process. Syst., 1993, pp. 836–844.
[120] M. Patriksson, The Traffic Assignment Problem: Models and Methods.
[93] J. Nocedal and S. J. Wright, Numerical Optimization. New York, NY,
Mineola, NY, USA: Dover, 2015.
USA: Springer, 2006.
[121] K. L. Clarkson, “Coresets, sparse greedy approximation, and the Frank–
[94] A. S. Nemirovsky and D. B. Yudin, Problem Complexity and Method Wolfe algorithm,” ACM Trans. Algorithms, vol. 6, no. 4, pp. 63–96,
Efficiency in Optimization. Chichester, U.K.: Wiley, 1983. 2010.
[95] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic [122] T. Huckle, “Approximate sparsity patterns for the inverse of a
approximation approach to stochastic programming,” SIAM J. Optim., matrix and preconditioning,” Appl. Numer. Math., vol. 30, nos. 2–3,
vol. 19, no. 4, pp. 1574–1609, 2009. pp. 291–303, 1999.
[96] A. Agarwal, M. J. Wainwright, P. L. Bartlett, and P. K. Ravikumar, [123] M. Benzi, “Preconditioning techniques for large linear systems: A
“Information-theoretic lower bounds on the oracle complexity of con- survey,” J. Comput. Phys., vol. 182, no. 2, pp. 418–477, 2002.
vex optimization,” in Proc. Adv. Neural Inf. Process. Syst., 2009, [124] M. Grant and S. Boyd. (2014). CVX: MATLAB Software for
pp. 1–9. Disciplined Convex Programming, Version 2.1. [Online]. Available:
[97] C. Darken, J. Chang, and J. Moody, “Learning rate schedules for faster http://cvxr.com/cvx
stochastic gradient search,” in Proc. Neural Netw. Signal Process., [125] S. Diamond and S. Boyd, “CVXPY: A python-embedded modeling
1992, pp. 3–12. language for convex optimization,” J. Mach. Learn. Res., vol. 17,
[98] I. Sutskever, “Training recurrent neural networks,” Ph.D. dissertation, pp. 2909–2913, Apr. 2016.
Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada, 2013. [126] M. Andersen, J. Dahl, and L. Vandenberghe. (2013). CVXOPT: A
[99] Z. Allen-Zhu, “Natasha 2: Faster non-convex optimization than SGD,” Python Package for Convex Optimization, Version 1.1.6. [Online].
in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 2675–2686. Available: https://cvxopt.org/
[100] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points— [127] J. D. Hedengren, R. A. Shishavan, K. M. Powell, and T. F. Edgar,
Online stochastic gradient for tensor decomposition,” in Proc. Conf. “Nonlinear modeling, estimation and predictive control in APMonitor,”
Learn. Theory, 2015, pp. 797–842. Comput. Chem. Eng., vol. 70, pp. 133–148, Nov. 2014.
[101] B. T. Polyak, “Some methods of speeding up the convergence of [128] J. Mairal, F. Bach, J. Ponce, G. Sapiro, R. Jenatton, and G. Obozinski.
iteration methods,” USSR Comput. Math. Math. Phys., vol. 4, no. 5, (2014). SPAMS: A Sparse Modeling Software, Version 2.3. [Online].
pp. 1–17, 1964. Available: http://spams-devel.gforge.inria.fr/downloads.html
[102] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. [129] M. Abadi et al., “TensorFlow: A system for large-scale machine learn-
Cambridge, MA, USA: MIT Press, 2016. ing,” in Proc. USENIX Symp. Oper. Syst. Design Implement., 2016,
[103] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance pp. 265–283.
of initialization and momentum in deep learning,” in Proc. Int. Conf. [130] L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, no. 2,
Mach. Learn., 2013, pp. 1139–1147. pp. 123–140, 1996.
[104] Y. Nesterov, “A method for unconstrained convex minimization [131] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network
problem with the rate of convergence O( 12 ),” Doklady Akademii Nauk regularization,” arXiv preprint arXiv:1409.2329, pp. 1–8, Feb. 2015.
k
SSSR, vol. 269, no. 3, pp. 543–547, 1983. [132] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.
[105] L. C. Baird, III, and A. W. Moore, “Gradient descent for general rein- Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
forcement learning,” in Proc. Adv. Neural Inf. Process. Syst., 1999, [133] P. Jain and P. Kar, “Non-convex optimization for machine learning,”
pp. 968–974. Found. Trends Mach. Learn., vol. 10, pp. 142–336, Dec. 2017.

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on January 05,2025 at 08:34:34 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: A SURVEY OF OPTIMIZATION METHODS FROM MACHINE LEARNING PERSPECTIVE 3681

[134] C. S. Adjiman and S. Dallwig, “A global optimization method, Shiliang Sun received the Ph.D. degree in pat-
αBB, for general twice-differentiable constrained NLPs—I. Theoretical tern recognition and intelligent systems from the
advances,” Comput. Chem. Eng., vol. 22, no. 9, pp. 1137–1158, 1998. Department of Automation, State Key Laboratory
[135] C. S. Adjiman, C. A. Schweiger, and C. A. Floudas, “Mixed- of Intelligent Technology and Systems, Tsinghua
integer nonlinear optimization in process synthesis,” in Handbook University, Beijing, China, in 2007.
of Combinatorial Optimization. Boston, MA, USA: Springer, 1998, He is a Professor with the School of Computer
pp. 1–76. Science and Technology and the Head of the
[136] T. Pock, A. Chambolle, D. Cremers, and H. Bischof, “A convex relax- Pattern Recognition and Machine Learning Research
ation approach for computing minimal partitions,” in Proc. IEEE Conf. Group, East China Normal University, Shanghai,
Comput. Vis. Pattern Recognit., 2009, pp. 810–817. China. His research results have expounded
[137] L. Xu and D. Schuurmans, “Unsupervised and semi-supervised multi- over 100 publications at peer-reviewed jour-
class support vector machines,” in Proc. Assoc. Adv. Artif. Intell., 2005, nals and conferences, such as the Journal of Machine Learning
pp. 904–910. Research, the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND
[138] Y. Chen and M. J. Wainwright, “Fast low-rank estimation by projected L EARNING S YSTEMS, the IEEE T RANSACTIONS ON C YBERNETICS, the
gradient descent: General statistical and algorithmic guarantees,” arXiv IEEE T RANSACTIONS ON M ULTIMEDIA, the IEEE T RANSACTIONS ON
preprint arXiv:1509.03025, pp. 1–63, Sep. 2015. I NTELLIGENT T RANSPORTATION S YSTEMS, ICML, IJCAI, and ECML. His
[139] D. Park, A. Kyrillidis, S. Bhojanapalli, C. Caramanis, and current research interests include kernel methods, multiview learning, learning
S. Sanghavi, “Provable non-convex projected gradient descent for a theory, approximate inference, sequential modeling, and their applications.
class of constrained matrix optimization problems,” arXiv preprint Prof. Sun is on the editorial board of multiple international journals, includ-
arXiv:1606.01316, pp. 1–29, Jun. 2016. ing Neurocomputing and the IEEE T RANSACTIONS ON N EURAL N ETWORKS
[140] P. Jain, P. Netrapalli, and S. Sanghavi, “Low-rank matrix completion AND L EARNING S YSTEMS .
using alternating minimization,” in Proc. ACM Annu. Symp. Theory
Comput., 2013, pp. 665–674. Zehui Cao is currently pursuing the master’s degree
[141] M. Hardt, “Understanding alternating minimization for matrix com- with the Pattern Recognition and Machine Learning
pletion,” in Proc. IEEE Annu. Symp. Found. Comput. Sci., 2014, Research Group, School of Computer Science
pp. 651–660. and Technology, East China Normal University,
[142] M. Hardt and M. Wootters, “Fast matrix completion without the Shanghai, China.
condition number,” in Proc. Conf. Learn. Theory, 2014, pp. 638–678. Her current research interests include machine
[143] S. Balakrishnan, M. J. Wainwright, and B. Yu, “Statistical guarantees learning and pattern recognition.
for the EM algorithm: From population to sample-based analysis,” Ann.
Stat., vol. 45, no. 1, pp. 77–120, 2017.
[144] Z. Wang, Q. Gu, Y. Ning, and H. Liu, “High dimensional expecta-
tion maximization algorithm: Statistical optimization and asymptotic
normality,” arXiv preprint arXiv:1412.8729, pp. 1–84, Dec. 2014. Han Zhu is currently pursuing the master’s degree
[145] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and with the Pattern Recognition and Machine Learning
P. T. P. Tang, “On large-batch training for deep learning: Generalization Research Group, School of Computer Science
gap and sharp minima,” in Proc. Int. Conf. Learn. Represent., 2016, and Technology, East China Normal University,
pp. 1–16. Shanghai, China.
[146] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation Her current research interests include machine
of gated recurrent neural networks on sequence modeling,” in Proc. learning and pattern recognition.
Adv. Neural Inf. Process. Syst. Workshop, 2014, pp. 1–9.
[147] J. Martens, Second-Order Optimization For Neural Networks, Univ.
Toronto, Toronto, ON, Canada, 2016.
[148] N. N. Schraudolph and T. Graepel, “Conjugate directions for stochas-
tic gradient descent,” in Proc. Int. Conf. Artif. Neural Netw., 2002, Jing Zhao received the Ph.D. degree in pat-
pp. 1351–1356. tern recognition and machine learning from the
[149] N. N. Schraudolph, J. Yu, and S. Günter, “A stochastic quasi-Newton Department of Computer Science and Technology,
method for online convex optimization,” in Proc. Art. Intell. Stat., 2007, East China Normal University, Shanghai, China, in
pp. 436–443. 2016.
[150] R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer, “A stochastic She is an Assistant Professor with the Pattern
quasi-Newton method for large-scale optimization,” SIAM J. Optim., Recognition and Machine Learning Research Group,
vol. 26, no. 2, pp. 1008–1031, 2016. School of Computer Science and Technology, East
[151] A. Bordes, L. Bottou, and P. Gallinari, “SGD-QN: Careful quasi- China Normal University. Her research results have
Newton stochastic gradient descent,” J. Mach. Learn. Res., vol. 10, expounded in over ten publications at peer-reviewed
pp. 1737–1754, Jan. 2009. journals and conferences, such as the Journal of
[152] X.-B. Jin, X.-Y. Zhang, K. Huang, and G.-G. Geng, “Stochastic Machine Learning Research, Pattern Recognition, the IEEE T RANSACTIONS
conjugate gradient algorithm with variance reduction,” IEEE Trans. ON I NTELLIGENT T RANSPORTATION S YSTEMS , and IJCAI. Her current
Neural Netw. Learn. Syst., vol. 30, no. 5, pp. 1360–1369, research interests include Bayesian methods, kernel methods, sequential
May 2019. modeling, and their applications.

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on January 05,2025 at 08:34:34 UTC from IEEE Xplore. Restrictions apply.

You might also like