0% found this document useful (0 votes)
123 views

A Survey of Optimization Methods ML

1. The document discusses optimization methods from a machine learning perspective, dividing methods into first-order (e.g. stochastic gradient descent), high-order (e.g. Newton's method), and derivative-free categories. 2. It introduces these categories and compares their advantages/disadvantages, then discusses applications in deep learning, reinforcement learning, and other machine learning fields. 3. The document explores challenges in optimization for machine learning, noting methods must address growing data and model complexity.

Uploaded by

hunternamkhung
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views

A Survey of Optimization Methods ML

1. The document discusses optimization methods from a machine learning perspective, dividing methods into first-order (e.g. stochastic gradient descent), high-order (e.g. Newton's method), and derivative-free categories. 2. It introduces these categories and compares their advantages/disadvantages, then discusses applications in deep learning, reinforcement learning, and other machine learning fields. 3. The document explores challenges in optimization for machine learning, noting methods must address growing data and model complexity.

Uploaded by

hunternamkhung
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

1

A Survey of Optimization Methods from


a Machine Learning Perspective
Shiliang Sun, Zehui Cao, Han Zhu, and Jing Zhao

Abstract—Machine learning develops rapidly, which has made high-order optimization methods, in which Newton’s method
many theoretical breakthroughs and is widely applied in various is a typical example; and heuristic derivative-free optimization
arXiv:1906.06821v2 [cs.LG] 23 Oct 2019

fields. Optimization, as an important part of machine learning, methods, in which the coordinate descent method is a
has attracted much attention of researchers. With the exponential
growth of data amount and the increase of model complexity, representative.
optimization methods in machine learning face more and more As the representative of first-order optimization methods,
challenges. A lot of work on solving optimization problems or the stochastic gradient descent method [1], [2], as well as
improving optimization methods in machine learning has been its variants, has been widely used in recent years and is
proposed successively. The systematic retrospect and summary evolving at a high speed. However, many users pay little
of the optimization methods from the perspective of machine
learning are of great significance, which can offer guidance attention to the characteristics or application scope of these
for both developments of optimization and machine learning methods. They often adopt them as black box optimizers,
research. In this paper, we first describe the optimization which may limit the functionality of the optimization methods.
problems in machine learning. Then, we introduce the principles In this paper, we comprehensively introduce the fundamental
and progresses of commonly used optimization methods. Next, optimization methods. Particularly, we systematically explain
we summarize the applications and developments of optimization
methods in some popular machine learning fields. Finally, we their advantages and disadvantages, their application scope,
explore and give some challenges and open problems for the and the characteristics of their parameters. We hope that the
optimization in machine learning. targeted introduction will help users to choose the first-order
Index Terms—Machine learning, optimization method, deep optimization methods more conveniently and make parameter
neural network, reinforcement learning, approximate Bayesian adjustment more reasonable in the learning process.
inference. Compared with first-order optimization methods, high-
order methods [3], [4], [5] converge at a faster speed in
which the curvature information makes the search direction
I. I NTRODUCTION
more effective. High-order optimizations attract widespread

R ECENTLY, machine learning has grown at a remarkable


rate, attracting a great number of researchers and
practitioners. It has become one of the most popular research
attention but face more challenges. The difficulty in high-
order methods lies in the operation and storage of the inverse
matrix of the Hessian matrix. To solve this problem, many
directions and plays a significant role in many fields, such variants based on Newton’s method have been developed, most
as machine translation, speech recognition, image recognition, of which try to approximate the Hessian matrix through some
recommendation system, etc. Optimization is one of the core techniques [6], [7]. In subsequent studies, the stochastic quasi-
components of machine learning. The essence of most machine Newton method and its variants are introduced to extend high-
learning algorithms is to build an optimization model and learn order methods to large-scale data [8], [9], [10].
the parameters in the objective function from the given data. Derivative-free optimization methods [11], [12] are mainly
In the era of immense data, the effectiveness and efficiency of used in the case that the derivative of the objective function
the numerical optimization algorithms dramatically influence may not exist or be difficult to calculate. There are two
the popularization and application of the machine learning main ideas in derivative-free optimization methods. One is
models. In order to promote the development of machine adopting a heuristic search based on empirical rules, and the
learning, a series of effective optimization methods were put other is fitting the objective function with samples. Derivative-
forward, which have improved the performance and efficiency free optimization methods can also work in conjunction with
of machine learning methods. gradient-based methods.
From the perspective of the gradient information in opti- Most machine learning problems, once formulated, can
mization, popular optimization methods can be divided into be solved as optimization problems. Optimization in the
three categories: first-order optimization methods, which are fields of deep neural network, reinforcement learning, meta
represented by the widely used stochastic gradient methods; learning, variational inference and Markov chain Monte
Carlo encounters different difficulties and challenges. The
This work was supported by NSFC Project 61370175 and Shanghai Sailing
Program 17YF1404600. optimization methods developed in the specific machine
Shiliang Sun, Zehui Cao, Han Zhu, and Jing Zhao are with learning fields are different, which can be inspiring to the
School of Computer Science and Technology, East China Normal development of general optimization methods.
University, 3663 North Zhongshan Road, Shanghai 200062, P. R.
China. E-mail: [email protected], [email protected] (Shiliang Sun); Deep neural networks (DNNs) have shown great success
[email protected], [email protected] (Jing Zhao) in pattern recognition and machine learning. There are two
2

very popular NNs, i.e., convolutional neural networks (CNNs) variational inference was proposed, which introduced natural
[13] and recurrent neural networks (RNNs), which play gradients and extended the variational inference to large-scale
important roles in various fields of machine learning. CNNs data [58].
are feedforward neural networks with convolution calculation. Optimization methods have a significative influence on
CNNs have been successfully used in many fields such as various fields of machine learning. For example, [5] proposed
image processing [14], [15], video processing [16] and natural the transformer network using Adam optimization [33], which
language processing (NLP) [17], [18]. RNNs are a kind of is applied to machine translation tasks. [59] proposed super-
sequential model and very active in NLP [19], [20], [21], resolution generative adversarial network for image super
[22]. Besides, RNNs are also popular in the fields of image resolution, which is also optimized by Adam. [60] proposed
processing [23], [24] and video processing [25]. In the field of Actor-Critic using trust region optimization to solve the deep
constrained optimization, RNNs can achieve excellent results reinforcement learning on Atari games as well as the MuJoCo
[26], [27], [28], [29]. In these works, the parameters of environments.
weights in RNNs can be learned by analytical methods, and The stochastic optimization method can also be applied to
these methods can find the optimal solution according to the Markov chain Monte Carlo (MCMC) sampling to improve
trajectory of the state solution. Stochastic gradient-based efficiency. In this kind of application, stochastic gradient
algorithms are widely used in deep neural networks [30], [31], Hamiltonian Monte Carlo (HMC) is a representative method
[32], [33]. However, various problems are emerging when [61] where the stochastic gradient accelerates the step of
employing stochastic gradient-based algorithms. For example, gradient update when handling large-scale samples. The noise
the learning rate will be oscillating in the later training stage introduced by the stochastic gradient can be characterized by
of some adaptive methods [34], [35], which may lead to introducing Gaussian noise and friction terms. Additionally,
the problem of non-converging. Thus, further optimization the deviation caused by HMC discretization can be eliminated
algorithms based on variance reduction were proposed to by the friction term, and thus the Metropolis-Hasting step can
improve the convergence rate [36], [37]. Moreover, combining be omitted. The hyper-parameter settings in the HMC will
the stochastic gradient descent and the characteristics of its affect the performance of the model. There are some efficient
variants is a possible direction to improve the optimization. ways to automatically adjust the hyperparameters and improve
Especially, switching an adaptive algorithm to the stochastic the performance of the sampler.
gradient descent method can improve the accuracy and The development of optimization brings a lot of contri-
convergence speed of the algorithm [38]. butions to the progress of machine learning. However, there
Reinforcement learning (RL) is a branch of machine are still many challenges and open problems for optimization
learning, for which an agent interacts with the environment problems in machine learning. 1) How to improve optimization
by trial-and-error mechanism and learns an optimal policy performance with insufficient data in deep neural networks is a
by maximizing cumulative rewards [39]. Deep reinforcement tricky problem. If there are not enough samples in the training
learning combines the RL and deep learning techniques, of deep neural networks, it is prone to cause the problem of
and enables the RL agent to have a good perception of its high variances and overfitting [62]. In addition, non-convex
environment. Recent research has shown that deep learning can optimization has been one of the difficulties in deep neural
be applied to learn a useful representation for reinforcement networks, which makes the optimization tend to get a locally
learning problems [40], [41], [42], [43], [44]. Stochastic optimal solution rather than the global optimal solution. 2) For
optimization algorithms are commonly used in RL and deep sequential models, the samples are often truncated by batches
RL models. when the sequence is too long, which will cause deviation.
Meta learning [45], [46] has recently become very popular How to analyze the deviation of stochastic optimization in
in the field of machine learning. The goal of meta learning this case and correct it is vital. 3) The stochastic variational
is to design a model that can efficiently adapt to the new inference is graceful and practical, and it is probably a
environment with as few samples as possible. The application good choice to develop methods of applying high-order
of meta learning in supervised learning can solve the few-shot gradient information to stochastic variational inference. 4) It
learning problems [47]. In general, meta learning methods can may be a great idea to introduce the stochastic technique
be summarized into the following three types [48]: metric- to the conjugate gradient method to obtain an elegant and
based methods [49], [50], [51], [52], model-based methods powerful optimization algorithm. The detailed techniques to
[53], [54] and optimization-based methods [55], [56], [47]. We make improvements in the stochastic conjugate gradient is an
will describe the details of optimization-based meta learning interesting and challenging problem.
methods in the subsequent sections. The purpose of this paper is to summarize and analyze
Variational inference is a useful approximation method classical and modern optimization methods from a machine
which aims to approximate the posterior distributions in learning perspective. The remainder of this paper is organized
Bayesian machine learning. It can be considered as an as follows. Section II summarizes the machine learning
optimization problem. For example, mean-field variational problems from the perspective of optimization. Section III
inference uses coordinate ascent to solve this optimization discusses the classical optimization algorithms and their latest
problem [57]. As the amount of data increases continuously, developments in machine learning. Particularly, the recent
it is not friendly to use the traditional optimization method popular optimization methods including the first and second
to handle the variational inference. Thus, the stochastic order optimization algorithms are emphatically introduced.
3

Section IV describes the developments and applications of It can deal with different tasks including classification tasks
optimization methods in some specific machine learning fields. [70], [71], regression tasks [72], clustering tasks [73], [74] and
Section V presents the challenges and open problems in the dimensionality reduction tasks [75], [76]. There are different
optimization methods. Finally, we conclude the whole paper. kinds of semi-supervised learning methods including self-
training, generative models, semi-supervised support vector
II. M ACHINE L EARNING F ORMULATED AS O PTIMIZATION machines (S3VM) [77], graph-based methods, multi-learning
Almost all machine learning algorithms can be formulated method and others. We take S3VM as an example to introduce
as an optimization problem to find the extremum of an ob- the optimization in semi-supervised learning.
jective function. Building models and constructing reasonable S3VM is a learning model that can deal with binary
objective functions are the first step in machine learning classification problems and only part of the training set in
methods. With the determined objective function, appropriate this problem is labeled. Let Dl be labeled data which can
numerical or analytical optimization methods are usually used be represented as Dl = {{x1 , y 1 }, {x2 , y 2 }, ..., {xl , y l }},
to solve the optimization problem. and Du be unlabeled data which can be represented as
According to the modeling purpose and the problem to Du = {xl+1 , xl+2 , ..., xN } with N = l + u. In order to
be solved, machine learning algorithms can be divided into use the information of unlabeled data, additional constraint
supervised learning, semi-supervised learning, unsupervised on the unlabeled data is added to the original objective of
learning, and reinforcement learning. Particularly, supervised SVM with slack variables ζ i . Specifically, define ǫj as the
learning is further divided into the classification problem (e.g., misclassification error of the unlabeled instance if its true
sentence classification [17], [63], image classification [64], label is positive and z j as the misclassification error of the
[65], [66], etc.) and regression problem; unsupervised learning unlabeled instance
PNif its true label is negative. The constraint
is divided into clustering and dimension reduction [67], [68], means to make j=l+1 min(ǫi , ζ i ) as small as possible. Thus,
[69], among others. an S3VM problem can be described as
 
X l XN

A. Optimization Problems in Supervised Learning min k ω k +C  ζi + min(ǫi , z j ) ,


i=1 j=l+1
For supervised learning, the goal is to find an optimal
mapping function f (x) to minimize the loss function of the subject to
training samples, y i (w · xi + b) + ζ i ≥ 1, ζ ≥ 0, i = 1, ..., l,
N w · xj + b + ǫj ≥ 1, ǫ ≥ 0, j = l + 1, ..., N,
1 X
min L(y i , f (xi , θ)), (1) − (w · xj + b) + z j ≥ 1, z j ≥ 0, (3)
θ N
i=1
where C is a penalty coefficient. The optimization problem
where N is the number of training samples, θ is the parameter
in S3VM is a mixed-integer problem which is difficult to
of the mapping function, xi is the feature vector of the ith
deal with [78]. There are various methods summarized in
samples, y i is the corresponding label, and L is the loss
[79] to deal with this problem, such as the branch and bound
function.
techniques [80] and convex relaxation methods [81].
There are many kinds of loss functions in supervised
learning, such as the square of Euclidean distance, cross-
entropy, contrast loss, hinge loss, information gain and so on. C. Optimization Problems in Unsupervised Learning
For regression problems, the simplest way is using the square Clustering algorithms [67], [82], [83], [84] divide a group
of Euclidean distance as the loss function, that is, minimizing of samples into multiple clusters ensuring that the differences
square errors on training samples. But the generalization between the samples in the same cluster are as small as
performance of this kind of empirical loss is not necessarily possible, and samples in different clusters are as different as
good. Another typical form is structured risk minimization, possible. The optimization problem for the k-means clustering
whose representative method is the support vector machine. On algorithm is formulated as minimizing the following loss
the objective function, regularization items are usually added function:
to alleviate overfitting, e.g., in terms of L2 norm,
K X
X
N
1 X min kx − µk k22 , (4)
min L(y i , f (xi , θ)) + λ k θ k22 . (2) S
k=1 x∈Sk
θ N
i=1
where K is the number of clusters, x is the feature vector of
where λ is the compromise parameter, which can be deter-
samples, µk is the center of cluster k, and Sk is the sample set
mined through cross-validation.
of cluster k. The implication of this objective function is to
make the sum of variances of all clusters as small as possible.
B. Optimization Problems in Semi-supervised Learning The dimensionality reduction algorithm ensures that the
Semi-supervised learning (SSL) is the method between original information from data is retained as much as possible
supervised and unsupervised learning, which incorporates after projecting them into the low-dimensional space. Principal
labeled data and unlabeled data during the training process. component analysis (PCA) [85], [86], [87] is a typical
4

algorithm of dimensionality reduction methods. The objective practical applications and have achieved good performance.
of PCA is formulated to minimize the reconstruction error as Besides these fundamental methods, preconditioning is a use-
N
X D
X
′ ful technique for optimization methods. Applying reasonable
min i
kx − xi k22 where x = i
zji ej , D ≫ D′ , (5) preconditioning can reduce the number of iterations and obtain
i=1 j=1 better spectral characteristics. These technologies have been
widely used in practice. For the convenience of researchers,
where N represents the number of samples, xi is a D-
we summarize the existing common optimization toolkits in a
dimensional vector, xi is the reconstruction of xi . z i =
table at the end of this section.
{z1i , ..., zD
i
′} is the projection of xi in D′ -dimensional
coordinates. ej is the standard orthogonal basis under D′ - A. First-Order Methods
dimensional coordinates.
In the field of machine learning, the most commonly
Another common optimization goal in probabilistic models
used first-order optimization methods are mainly based on
is to find an optimal probability density function of p(x),
gradient descent. In this section, we introduce some of the
which maximizes the logarithmic likelihood function (MLE)
representative algorithms along with the development of the
of the training samples,
gradient descent methods. At the same time, the classical
N
X alternating direction method of multipliers and the Frank-
max ln p(xi ; θ). (6) Wolfe method in numerical optimization are also introduced.
i=1 1) Gradient Descent: The gradient descent method is the
In the framework of Bayesian methods, some prior distribu- earliest and most common optimization method. The idea of
tions are often assumed on parameter θ, which also has the the gradient descent method is that variables update iteratively
effect of alleviating overfitting. in the (opposite) direction of the gradients of the objective
function. The update is performed to gradually converge to
D. Optimization Problems in Reinforcement Learning the optimal value of the objective function. The learning rate
η determines the step size in each iteration, and thus influences
Reinforcement learning [42], [88], [89], unlike supervised the number of iterations to reach the optimal value [90].
learning and unsupervised learning, aims to find an optimal The steepest descent algorithm is a widely known algorithm.
strategy function, whose output varies with the environment. The idea is to select an appropriate search direction in each
For a deterministic strategy, the mapping function from state iteration so that the value of the objective function minimizes
s to action a is the learning target. For an uncertain strategy, the fastest. Gradient descent and steepest descent are not the
the probability of executing each action is the learning target. same, because the direction of the negative gradient does not
In each state, the action is determined by a = π(s), where always descend fastest. Gradient descent is an example of
π(s) is the policy function. using the Euclidean norm in steepest descent [91].
The optimization problem in reinforcement learning can Next, we give the formal expression of gradient descent
be formulated as maximizing the cumulative return after method. For a linear regression model, we assume that fθ (x)
executing a series of actions which are determined by the is the function to be learned, L(θ) is the loss function, and θ
policy function, is the parameter to be optimized. The goal is to minimize the
"∞ #
X loss function with
max Vπ (s) where Vπ (s) = E γ k rt+k |St = s , (7) N
π 1 X i
k=0 L(θ) = (y − fθ (xi ))2 , (8)
2N i=1
where Vπ (s) is the value function of state s under policy π, r
is the reward, and γ ∈ [0, 1] is the discount factor. D
X
fθ (x) = θj xj , (9)
j=1
E. Optimization for Machine Learning
where N is the number of training samples, D is the number
Overall, the main steps of machine learning are to build a of input features, xi is an independent variable with xi =
model hypothesis, define the objective function, and solve the (xi1 , ..., xiD ) for i = 1, ..., N and y i is the target output. The
maximum or minimum of the objective function to determine gradient descent alternates the following two steps until it
the parameters of the model. In these three vital steps, the first converges:
two steps are the modeling problems of machine learning, and
1) Derive L(θ) for θj to get the gradient corresponding to
the third step is to solve the desired model by optimization
each θj :
methods.
N
III. F UNDAMENTAL O PTIMIZATION M ETHODS AND ∂L(θ) 1 X i
P ROGRESSES =− (y − fθ (xi ))xij . (10)
∂θj N i=1
From the perspective of gradient information, fundamental
2) Update each θj in the negative gradient direction to
optimization methods can be divided into first-order optimiza-
minimize the risk function:
tion methods, high-order optimization methods and derivative- N
free optimization methods. These methods have a long history ′ 1 X i
θj = θj + η · (y − fθ (xi ))xij . (11)
and are constantly evolving. They are progressing in many N i=1
5

The gradient descent method is simple to implement. The However, one problem in SGD is that the gradient direction
solution is global optimal when the objective function is oscillates because of additional noise introduced by random
convex. It often converges at a slower speed if the variable selection, and the search process is blind in the solution space.
is closer to the optimal solution, and more careful iterations Unlike batch gradient descent which always moves towards
need to be performed. the optimal value along the negative direction of the gradient,
In the above linear regression example, note that all the the variance of gradients in SGD is large and the movement
training data are used in each iteration step, so the gradient direction in SGD is biased. So, a compromise between the two
descent method is also called the batch gradient descent. If methods, the mini-batch gradient descent method (MSGD),
the number of samples is N and the dimension of x is D, the was proposed [1].
computation complexity for each iteration will be O(N D). In The MSGD uses b independent identically distributed
order to mitigate the cost of computation, some parallelization samples (b is generally in 50 to 256 [90]) as the sample sets to
methods were proposed [92], [93]. However, the cost is still update the parameters in each iteration. It reduces the variance
hard to accept when dealing with large-scale data. Thus, the of the gradients and makes the convergence more stable, which
stochastic gradient descent method emerges. helps to improve the optimization speed. For brevity, we will
2) Stochastic Gradient Descent: Since the batch gradient call MSGD as SGD in the following sections.
descent has high computational complexity in each iteration As a common feature of stochastic optimization, SGD
for large-scale data and does not allow online update, has a better chance of finding the global optimal solution
stochastic gradient descent (SGD) was proposed [1]. The idea for complex problems. The deterministic gradient in batch
of stochastic gradient descent is using one sample randomly to gradient descent may cause the objective function to fall into a
update the gradient per iteration, instead of directly calculating local minimum for the multimodal problem. The fluctuation in
the exact value of the gradient. The stochastic gradient is the SGD helps the objective function jump to another possible
an unbiased estimate of the real gradient [1]. The cost minimum. However, the fluctuation in SGD always exists,
of the stochastic gradient descent algorithm is independent which may more or less slow down the process of converging.
of sample numbers and can achieve sublinear convergence There are still many details to be noted about the use of
speed [37]. SGD reduces the update time for dealing with SGD in the concrete optimization process [90], such as the
large numbers of samples and removes a certain amount choice of a proper learning rate. A too small learning rate
of computational redundancy, which significantly accelerates will result in a slower convergence rate, while a too large
the calculation. In the strong convex problem, SGD can learning rate will hinder convergence, making loss function
achieve the optimal convergence speed [94], [95], [96], [36]. fluctuate at the minimum. One way to solve this problem is to
Meanwhile, it overcomes the disadvantage of batch gradient set up a predefined list of learning rates or a certain threshold
descent that cannot be used for online learning. and adjust the learning rate during the learning process [97],
The loss function (8) can be written as the following [98]. However, these lists or thresholds need to be defined in
equation: advance according to the characteristics of the dataset. It is also
N N inappropriate to use the same learning rate for all parameters.
1 X1 i 1 X
L(θ) = (y − fθ (xi ))2 = cost(θ, (xi , y i )). If data are sparse and features occur at different frequencies, it
N i=1 2 N i=1 is not expected to update the corresponding variables with the
(12) same learning rate. A higher learning rate is often expected
If a random sample i is selected in SGD, the loss function for less frequently occurring features [30], [33].
will be L∗ (θ): Besides the learning rate, how to avoid the objective
1 function being trapped in infinite numbers of the local
L∗ (θ) = cost(θ, (xi , y i )) = (y i − fθ (xi ))2 . (13)
2 minimum is a common challenge. Some work has proved
The gradient update in SGD uses the random sample i rather that this difficulty does not come from the local minimum
than all samples in each iteration, values, but comes from the “saddle point” [99]. The slope
′ of a saddle point is positive in one direction and negative
θ = θ + η(y i − fθ (xi ))xi . (14)
in another direction, and gradient values in all directions are
Since SGD uses only one sample per iteration, the com- zero. It is an important problem for SGD to escape from these
putation complexity for each iteration is O(D) where D is points. Some research about escaping from saddle points were
the number of features. The update rate for each iteration of developed [100], [101].
SGD is much faster than that of batch gradient descent when 3) Nesterov Accelerated Gradient Descent: Although SGD
the number of samples N is large. SGD increases the overall is popular and widely used, its learning process is sometimes
optimization efficiency at the expense of more iterations, but prolonged. How to adjust the learning rate, how to speed up
the increased iteration number is insignificant compared with the convergence, and how to prevent being trapped at a local
the high computation complexity caused by large numbers minimum during the search are worthwhile research directions.
of samples. It is possible to use only thousands of samples Much work is presented to improve SGD. For example, the
overall to get the optimal solution even when the sample momentum idea was proposed to be applied in SGD [102].
size is hundreds of thousands. Therefore, compared with The concept of momentum is derived from the mechanics
batch methods, SGD can effectively reduce the computational of physics, which simulates the inertia of objects. The idea
complexity and accelerate convergence. of applying momentum in SGD is to preserve the influence
6

of the previous update direction on the next iteration to a O( k1 ) (after k steps) to O( k12 ), when not using stochastic
certain degree. The momentum method can speed up the optimization [105].
convergence when dealing with high curvature, small but Another issue worth considering is how to determine the
consistent gradients, or noisy gradients [103]. The momentum size of the learning rate. It is more likely to occur the
algorithm introduces the variable v as the speed, which oscillation if the search is closer to the optimal point. Thus, the
represents the direction and the rate of the parameter’s learning rate should be adjusted. The learning rate decay factor
movement in the parameter space. The speed is set as the d is commonly used in the SGD’s momentum method, which
average exponential decay of the negative gradient. makes the learning rate decrease with the iteration period
In the gradient descent method, the speed update is v = [106]. The formula of the learning rate decay is defined as
η · (− ∂L(θ)
∂(θ) ) each time. Using the momentum algorithm, the η0
amount of the update v is not just the amount of gradient ηt = , (17)
1+d·t
descent calculated by η · (− ∂L(θ)
∂(θ) ). It also takes into account
the friction factor, which is represented as the previous update where ηt is the learning rate at the tth iteration, η0 is the
v old multiplied by a momentum factor ranging between [0, 1]. original learning rate, and d is a decimal in [0, 1]. As can be
Generally, the mass of the object is set to 1. The formulation seen from the formula, the smaller the d is, the slower the
is expressed as decay of the learning rate will be. The learning rate remains
unchanged when d = 0 and the learning rate decays fastest
∂L(θ) when d = 1.
v = η · (− ) + v old · mtm, (15)
∂(θ) 4) Adaptive Learning Rate Method: The manually regu-
lated learning rate greatly influences the effect of the SGD
where mtm is the momentum factor. If the current gradient
method. It is a tricky problem for setting an appropriate value
is parallel to the previous speed v old , the previous speed can
of the learning rate [30], [33], [107]. Some adaptive methods
speed up this search. The proper momentum plays a role in
were proposed to adjust the learning rate automatically. These
accelerating the convergence when the learning rate is small. If
methods are free of parameter adjustment, fast to converge,
the derivative decays to 0, it will continue to update v to reach
and often achieving not bad results. They are widely used in
equilibrium and will be attenuated by friction. It is beneficial
deep neural networks to deal with optimization problems.
for escaping from the local minimum in the training process
The most straightforward improvement to SGD is AdaGrad
so that the search process can converge more quickly [102],
[30]. AdaGrad adjusts the learning rate dynamically based on
[104]. If the current gradient is opposite to the previous update
the historical gradient in some previous iterations. The update
v old , the value v old will have a deceleration effect on this
formulae are as follows:
search.
The momentum method with a proper momentum factor 
 ∂L(θt )
plays a positive role in reducing the oscillation of convergence  gt =
 ,

 r∂θ
when the learning rate is large. How to select the proper size  Xt
of the momentum factor is also a problem. If the momentum Vt = (gi )2 + ǫ, (18)

 i=1
factor is small, it is hard to obtain the effect of improving 

 θt+1 = θt − η gt ,

convergence speed. If the momentum factor is large, the Vt
current point may jump out of the optimal value point. Many
where gt is the gradient of parameter θ at iteration t, Vt is
experiments have empirically verified the most appropriate
the accumulate historical gradient of parameter θ at iteration
setting for the momentum factor is 0.9 [90].
t, and θt is the value of parameter θ at iteration t.
Nesterov Accelerated Gradient Descent (NAG) makes fur-
The difference between AdaGrad and gradient descent is
ther improvement over the traditional momentum method
that during the parameter update process, the learning rate
[104], [105]. In Nesterov momentum, the momentum v old ·
e The gradient of θe is used is no longer fixed, but is computed using all the historical
mtm is added to θ, denoted as θ.
gradients accumulated up to this iteration. One main benefit
when updating. The detailed update formulae for parameters
of AdaGrad is that it eliminates the need to tune the learning
θ are as follows:
rate manually. Most implementations use a default value of
 0.01 for η in (18).

 θe = θ + v old · mtm, Although AdaGrad adaptively adjusts the learning rate, it

 e
∂L(θ) still has two issues. 1) The algorithm still needs to set the
old (16)
 v = v · mtm + η · (− ),
 ∂(θ) global learning rate η manually. 2) As the training time

 ′
θ = θ + v. increases, the accumulated gradient will become larger and
larger, making the learning rate tend to zero, resulting in
The improvement of Nesterov momentum over momentum ineffective parameter update.
is reflected in updating the gradient of the future position AdaGrad was further improved to AdaDelta [31] and
instead of the current position. From the update formula, we RMSProp [32] for solving the problem that the learning
can find that Nestorov momentum includes more gradient rate will eventually go to zero. The idea is to consider not
information compared with the traditional momentum method. accumulating all historical gradients, but focusing only on
Note that Nesterov momentum improves the convergence from the gradients in a window over a period, and using the
7

exponential moving average to calculate the second-order [36], which is much faster than SGD, and has great advantages
cumulative momentum, over other stochastic gradient algorithms.
p However, the SAG method is only applicable to the case
Vt = βVt−1 + (1 − β)(gt )2 , (19) where the loss function is smooth and the objective function is
convex [36], [108], such as convex linear prediction problems.
where β is the exponential decay parameter. Both RMSProp
In this case, the SAG achieves a faster convergence rate
and AdaDelta have been developed independently around the
than the SGD. In addition, under some specific problems, it
same time, stemming from the need to resolve the radically
can even deliver better convergence than the standard batch
diminishing learning rates of AdaGrad.
gradient descent.
Adaptive moment estimation (Adam) [33] is another ad-
Stochastic Variance Reduction Gradient Since the SAG
vanced SGD method, which introduces an adaptive learning
method is only applicable to smooth and convex functions
rate for each parameter. It combines the adaptive learning
and needs to store the gradient of each sample, it is
rate and momentum methods. In addition to storing an
inconvenient to be applied in non-convex neural networks. The
exponentially decaying average of past squared gradients
stochastic variance reduction gradient (SVRG) [37] method
Vt , like AdaDelta and RMSProp, Adam also keeps an
was proposed to improve the performance of optimization in
exponentially decaying average of past gradients mt , similar
the complex models.
to the momentum method:
The algorithm of SVRG maintains the interval average
mt = β1 mt−1 + (1 − β1 )gt , (20) gradient µ̃ by calculating the gradients of all samples in every
w iterations instead of in each iteration:
p N
Vt = β2 Vt−1 + (1 − β2 )(gt )2 , (21) 1 X
µ̃ = gi (θ̃), (24)
where β1 and β2 are exponential decay rates. The final update N i=1
formula for the parameter θ is
where θ̃ is the interval update parameter. The interval

1 − β2 mt parameter µ̃ contains the average memory of all sample
θt+1 = mt − η . (22) gradients in the past time for each time interval w. SVRG
1 − β1 Vt + ǫ
picks uniform it ∈ {1, ..., N } randomly, and executes gradient
The default values of β1 , β2 , and ǫ are suggested to set updates to the current parameters:
to 0.9, 0.999, and 10−8 , respectively. Adam works well in
practice and compares favorably to other adaptive learning rate θt = θt−1 − η · (git (θt−1 ) − git (θ̃) + µ̃). (25)
algorithms.
5) Variance Reduction Methods: Due to a large amount The gradient is calculated up to two times in each update. After
of redundant information in the training samples, the SGD w iterations, perform θ̃ ← θw and start the next w iterations.
methods are very popular since they were proposed. However, Through these update, θt and the interval update parameter θ̃
the stochastic gradient method can only converge at a sublinear will converge to the optimal θ∗ , and then µ̃ → 0, and
rate and the variance of gradient is often very large. How git (θt−1 ) − git (θ̃) + µ̃ → git (θt−1 ) − git (θ∗ ) → 0. (26)
to reduce the variance and improve SGD to the linear
convergence has always been an important problem. SVRG proposes a vital concept called variance reduction.
Stochastic Average Gradient The stochastic average This concept is related to the convergence analysis of SGD,
gradient (SAG) method [36] is a variance reduction method in which it is necessary to assume that there is a constant
proposed to improve the convergence speed. The SAG upper bound for the variance of the gradients. This constant
algorithm maintains parameter d recording the sum of the N upper bound implies that the SGD cannot achieve linear
latest gradients {gi } in memory where gi is calculated using convergence. However, in SVRG, the upper bound of variance
one sample i, i ∈ {1, ..., N }. The detailed implementation is can be continuously reduced due to the special update item
to select a sample it to update d randomly, and use d to update git (θt−1 ) − git (θ̃) + µ̃ , thus achieving linear convergence
the parameter θ in iteration t: [37].
 The strategies of SAG and SVRG are related to variance

 d = d − ĝit + git (θt−1 ), reduction. Compared with SAG, SVRG does not need to

ĝit = git (θt−1 ), (23) maintain all gradients in memory, which means that memory

 θt = θt−1 − α d,
 resources are saved, and it can be applied to complex problems
N efficiently. Experiments have shown that the performance of
where the updated item d is calculated by replacing the old SVRG is remarkable on a non-convex neural network [37],
gradient ĝit in d with the new gradient git (θt−1 ) in iteration [109], [110]. There are also many variants of such linear
t, α is a constant representing the learning rate. Thus, each convergence stochastic optimization algorithms, such as the
update only needs to calculate the gradient of one sample, not SAGA algorithm [111].
the gradients of all samples. The computational overhead is 6) Alternating Direction Method of Multipliers: Aug-
no different from SGD, but the memory overhead is much mented Lagrangian multiplier method is a common method to
larger. This is a typical way of using space for saving time. solve optimization problems with linear constraints. Compared
The SAG has been shown to be a linear convergence algorithm with the naive Lagrangian multiplier method, it makes
8

problems easier to solve by adding a penalty term to the Here, we give a simple example of Frank-Wolfe method.
objective. Consider the following example, Consider the optimization problem,

min {θ1 (x) + θ2 (y)|Ax + By = b, x ∈ X , y ∈ Y} . (27)  min f (x),
s.t. Ax = b, (31)

The augmented Lagrange function for problem (27) is x ≥ 0,

Lβ (x, y, λ) =θ1 (x) + θ2 (y) − λ⊤ (Ax + By − b) where A is an m × n full row rank matrix, and the feasible
β (28) region is S = {x|Ax = b, x ≥ 0}. Expand f (x) linearly at
+ ||Ax + By − b||2 . x0 , f (x) ≈ f (x0 ) + ∇f (x0 )⊤ (x − x0 ), and substitute it into
2
equation (31). Then we have
When solved by the augmented Lagrangian multiplier method, 
min f (xt ) + ∇f (xt )⊤ (x − xt ),
its tth step iteration starts from the given λt , and the (32)
s.t. x ∈ S,
optimization turns out to
( which is equivalent to
(xt+1 , yt+1 ) = arg min {Lβ (x, y, λt )|x ∈ X , y ∈ Y} , 
min ∇f (xt )⊤ x,
λt+1 = λt − β(Axt+1 + Byt+1 − b). (33)
s.t. x ∈ S.
(29)
Suppose there exist an optimal solution yt , and then there
Separating the (x, y) sub-problem in (29), the augmented must be (
Lagrange multiplier method can be relaxed to the following ∇f (xt )⊤ yt < ∇f (xt )⊤ xt ,
(34)
alternating direction method of multipliers (ADMM) [112], ∇f (xt )⊤ (yt − xt ) < 0.
[113]. Its tth step iteration starts with the given (yt , λt ), and
the details of iterative optimization are as follows: So yt − xt is the decreasing direction of f (x) at xt . A fetch
   step of λt updates the search point in a feasible direction. The

 ⊤ β 2 detailed operation is shown in Algorithm 1.

 xt+1 = arg min θ1 (x) − (λt ) Ax + ||Conx || |x ∈ X ,

 2
 
⊤ β 2
Algorithm 1 Frank-Wolfe Method [118], [119]

 yt+1 = arg min θ2 (y) − (λt ) By + ||Cony || |y ∈ Y ,

 2 Input: x0 , ε ≥ 0, t := 0


λt+1 = λt − β(Axt+1 + Byt+1 − b), Output: x∗
(30) yt ← min ∇f (xt )⊤ x
where Conx = Ax + Byt − b and Cony = Axt+1 + By − b. while |∇f (xt )⊤ (yt − xt )| > ε do
The penalty parameter β has a certain impact on the λt = arg min0≤λ≤1 f (xt + λ(yt − xt ))
convergence rate of the ADMM. The larger β is, the greater the xt+1 ≈ xt + λt (yt − xt )
penalties for the constraint term. In general, a monotonically t := t + 1
increasing sequence of {βt } can be adopted instead of the yt ← min ∇f (xt )⊤ x
fixed β [114]. Specifically, an auto-adjustment criterion that end while
automatically adjusts {βt } based on the current value of {xt } x∗ ≈ xt
during the iteration was proposed, and applied for solving
some convex optimization problems [115], [116]. The algorithm satisfies the following convergence theorem
The ADMM method uses the separable operators in the [118]:
convex optimization problem to divide a large problem into (1) xt is the Kuhn-Tucker point of (31) when ∇f (xt )⊤ (yt −
multiple small problems that can be solved in a distributed xt ) = 0.
manner. In theory, the framework of ADMM can solve most of (2) Since yt is an optimal solution for problem (33), the
the large-scale optimization problems. However, there are still vector dt satisfies dt = yt − xt and is the feasible descending
some problems in practical applications. For example, if we direction of f at point xt when ∇f (xt )⊤ (yt − xt ) 6= 0.
use a stop criterion to determine whether convergence occurs, The Frank-Wolfe algorithm is a first-order iterative method
the original residuals and dual residuals are both related to β, for solving convex optimization problems with constrained
and β with a large value will lead to difficulty in meeting the conditions. It consists of determining the feasible descent
convergence conditions [117]. direction and calculating the search step size. The algorithm
7) Frank-Wolfe Method: In 1956, Frank and Wolfe pro- is characterized by fast convergence in early iterations and
posed an algorithm for solving linear constraint problems slower in later phases. When the iterative point is close to
[118]. The basic idea is to approximate the objective function the optimal solution, the search direction and the gradient
with a linear function, then solve the linear programming to direction of the objective function tend to be orthogonal. Such
find the feasible descending direction, and finally make a one- a direction is not the best downward direction so that the
dimensional search along the direction in the feasible domain. Frank-Wolfe algorithm can be improved and extended in terms
This method is also called the approximate linearization of the selection of the descending directions [120], [121],
method. [122].
9

8) Summary: We summarize the mentioned first-order optimization problem that minimizes the quadratic positive
optimization methods in terms of properties, advantages, and definite function,
disadvantages in Table I.
1 ⊤
min F (θ) = θ Aθ − bθ + c. (36)
θ 2
B. High-Order Methods
The above two equations have an identical unique solution. It
The second-order methods can be used for addressing the enables us to regard the conjugate gradient as a method for
problem where an objective function is highly non-linear and solving optimization problems.
ill-conditioned. They work effectively by introducing curvature The gradient of F (θ) can be obtained by simple calculation,
information. and it equals the residual of the linear system [93]: r(θ) =
This section begins with introducing the conjugate gradient ∇F (θ) = Aθ − b.
method, which is a method that only needs first-order deriva- Definition 1: Conjugate: Given an n×n symmetric positive-
tive information for well-defined quadratic programming, but definite matrix A, two non-zero vector di , dj are conjugate
overcomes the shortcoming of the steepest descent method, with respect to A if
and avoids the disadvantages of Newton’s method of storing
and calculating the inverse Hessian matrix. But note that
when applying it to general optimization problems, the second- d⊤
i Adj = 0. (37)
order gradient is needed to get an approximation to quadratic
A set of non-zero vector {d1 , d2 , d3 , ...., dn } is said to be
programming. Then, the classical quasi-Newton method using
conjugate with respect to A if any two unequal vectors are
second-order information is described. Although the conver-
conjugate with respect to A [93].
gence of the algorithm can be guaranteed, the computational
Next, we introduce the detailed derivation of the conjugate
process is costly and thus rarely used for solving large machine n−1
gradient method. θ0 is a starting point, {dt }t=1 is a set of
learning problems. In recent years, with the continuous
conjugate directions. In general, one can generate the update
improvement of high-order optimization methods, more and
sequence {θ1 , θ2 , ...., θn } by a iteration formula:
more high-order methods have been proposed to handle large-
scale data by using stochastic techniques [124], [125], [126]. θt+1 = θt + ηt dt . (38)
From this perspective, we discuss several high-order methods
including the stochastic quasi-Newton method (integrating the The step size ηt can be obtained by a linear search, which
second-order information and the stochastic method) and their means choosing ηt to minimize the object function f (·) along
variants. These algorithms allow us to use high-order methods θt +ηt dt . After some calculations (more details in [93], [128]),
to process large-scale data. the update formula of ηt is
1) Conjugate Gradient Method: The conjugate gradient
(CG) approach is a very interesting optimization method, rt⊤ rt
ηt = . (39)
which is one of the most effective methods for solving large- d⊤
t Adt
scale linear systems of equations. It can also be used for
The search direction dt is obtained by a linear combination of
solving nonlinear optimization problems [93]. As we know,
the negative residual and the previous search direction,
the first-order methods are simple but have a slow convergence
speed, and the second-order methods need a lot of resources. dt = −rt + βt dt−1 , (40)
Conjugate gradient optimization is an intermediate algorithm,
which can only utilize the first-order information for some where rt can be updated by rt = rt−1 + ηt−1 Adt−1 . The
problems but ensures the convergence speed like high-order scalar βt is the update parameter, which can be determined
methods. by satisfying the requirement that dt and dt−1 are conjugate
Early in the 1960s, a conjugate gradient method for solving with respect to A, i.e., d⊤
t Adt−1 = 0. Multiplying both sides
a linear system was proposed, which is an alternative to Gaus- of the equation (40) by d⊤ t−1 A, one can obtain βt by
sian elimination [127]. Then in 1964, the conjugate gradient
method was extended to handle nonlinear optimization for d⊤
t−1 Art
βt = ⊤
. (41)
general functions [93]. For years, many different algorithms dt−1 Adt−1
have been presented based on this method, some of which
have been widely used in practice. The main features of these After several derivations of the above formula according to
algorithms are that they have faster convergence speed than [93], the simplified version of βt is
steepest descent. Next, we describe the conjugate gradient rt⊤ rt
method. βt = ⊤
. (42)
rt−1 rt−1
Consider a linear system,
Aθ = b, (35) The CG method, has a graceful property that generating a
new vector dt only using the previous vector dt−1 , which does
where A is an n × n symmetric, positive-definite matrix. The not need to know all the previous vectors d0 , d1 , d2 . . . dt−2 .
matrix A and vector b are known, and we need to solve the The linear conjugate gradient algorithm is shown in Algorithm
value of θ. The problem (35) can also be considered as an 2.
10

TABLE I: Summary of First-Order Optimization Methods

Method Properties Advantages Disadvantages

GD Solve the optimal value along the The solution is global optimal when the In each parameter update, gradients of
direction of the gradient descent. The objective function is convex. total samples need to be calculated, so
method converges at a linear rate. the calculation cost is high.

SGD [1] The update parameters are calculated The calculation time for each update It is difficult to choose an appropriate
using a randomly sampled mini-batch. does not depend on the total number learning rate, and using the same
The method converges at a sublinear of training samples, and a lot of learning rate for all parameters is
rate. calculation cost is saved. not appropriate. The solution may be
trapped at the saddle point in some
cases.

NAG [105] Accelerate the current gradient descent When the gradient direction changes, It is difficult to choose a suitable
by accumulating the previous gradient the momentum can slow the update learning rate.
as momentum and perform the gradient speed and reduce the oscillation; when
update process with momentum. the gradient direction remains, the mo-
mentum can accelerate the parameter
update. Momentum helps to jump out
of locally optimal solution.

AdaGrad [30] The learning rate is adaptively adjusted In the early stage of training, the cumu- As the training time increases, the ac-
according to the sum of the squares of lative gradient is smaller, the learning cumulated gradient will become larger
all historical gradients. rate is larger, and learning speed is and larger, making the learning rate
faster. The method is suitable for tend to zero, resulting in ineffective
dealing with sparse gradient problems. parameter updates. A manual learning
The learning rate of each parameter rate is still needed. It is not suitable for
adjusts adaptively. dealing with non-convex problems.

AdaDelta/ Change the way of total gradient Improve the ineffective learning prob- In the late training stage, the update
RMSProp [31], accumulation to exponential moving lem in the late stage of AdaGrad. It process may be repeated around the
[32] average. is suitable for optimizing non-stationary local minimum.
and non-convex problems.

Adam [33] Combine the adaptive methods and the The gradient descent process is rela- The method may not converge in some
momentum method. Use the first-order tively stable. It is suitable for most cases.
moment estimation and the second- non-convex optimization problems with
order moment estimation of the gradi- large data sets and high dimensional
ent to dynamically adjust the learning space.
rate of each parameter. Add the bias
correction.

SAG [36] The old gradient of each sample and The method is a linear convergence The method is only applicable to
the summation of gradients over all algorithm, which is much faster than smooth and convex functions and needs
samples are maintained in memory. For SGD. to store the gradient of each sample. It
each update, one sample is randomly is inconvenient to be applied in non-
selected and the gradient sum is convex neural networks.
recalculated and used as the update
direction.

SVRG [37] Instead of saving the gradient of each The method does not need to maintain To apply it to larger/deeper neural nets
sample, the average gradient is saved at all gradients in memory, which saves whose training cost is a critical issue,
regular intervals. The gradient sum is memory resources. It is a linear con- further investigation is still needed.
updated at each iteration by calculating vergence algorithm.
the gradients with respect to the old
parameters and the current parameters
for the randomly selected samples.

ADMM [123] The method solves optimization prob- The method uses the separable op- The original residuals and dual resid-
lems with linear constraints by adding erators in the convex optimization uals are both related to the penalty
a penalty term to the objective and problem to divide a large problem into parameter whose value is difficult to
separating variables into sub-problems multiple small problems that can be determine.
which can be solved iteratively. solved in a distributed manner. The
framework is practical in most large-
scale optimization problems.

Frank-Wolfe The method approximates the objec- The method can solve optimization The method converges slowly in later
[118] tive function with a linear function, problems with linear constraints, whose phases. When the iterative point is close
solves the linear programming to find convergence speed is fast in early to the optimal solution, the search di-
the feasible descending direction, and iterations. rection and the gradient of the objective
makes a one-dimensional search along function tend to be orthogonal. Such
the direction in the feasible domain. a direction is not the best downward
direction.
11

Algorithm 2 Conjugate Gradient Method [128] second-order gradient is not directly needed in the quasi-
Input: A, b, θ0 Newton method, so it is sometimes more efficient than
Output: The solution θ∗ Newton’s method. In the following section, we will introduce
r0 = Aθ0 − b several quasi-Newton methods, in which the Hessian matrix
d0 = −r0 , t = 0 and its inverse matrix are approximated by different methods.
while Unsatisfied convergence condition do Quasi-Newton Condition We first introduce the quasi-
r⊤ r Newton condition. Assuming that the objective function f can
ηt = d⊤tAdt t
t
θt+1 = θt + ηt dt be approximated by a quadratic function, we can extend f (θ)
rt+1 = rt + ηt Adt to Taylor series at θ = θt+1 , i.e.,
r ⊤ rt+1
βt+1 = t+1 rt⊤ rt f (θ) ≈ f (θt+1 ) + ∇f (θt+1 )⊤ (θ − θt+1 )
dt+1 = −rt+1 + βt+1 dt 1
t=t+1 + (θ − θt+1 )⊤ ∇2 f (θt+1 )(θ − θt+1 ). (46)
2
end while
Then we can compute the gradient on both sides of the above
equation, and obtain
2) Quasi-Newton Methods: Gradient descent employs first-
∇f (θ) ≈ ∇f (θt+1 ) + ∇2 f (θt+1 )(θ − θt+1 ). (47)
order information, but its convergence rate is slow. Thus,
the natural idea is to use second-order information, e.g., Set θ = θt in (47), we have
Newton’s method [129]. The basic idea of Newton’s method
is to use both the first-order derivative (gradient) and second- ∇f (θt ) ≈ ∇f (θt+1 ) + ∇2 f (θt+1 )(θt − θt+1 ). (48)
order derivative (Hessian matrix) to approximate the objective
Use B to represent the approximate matrix of the Hessian
function with a quadratic function, and then solve the
matrix. Set st = θt+1 − θt , and ut = ∇f (θt+1 ) − ∇f (θt ).
minimum optimization of the quadratic function. This process
The matrix Bt+1 is satisfied that
is repeated until the updated variable converges.
The one-dimensional Newton’s iteration formula is shown
as ut = Bt+1 st . (49)

f ′ (θt ) This equation is called the quasi-Newton condition, or secant


θt+1 = θt − , (43) equation.
f ′′ (θt )
The search direction of quasi-Newton method is
where f is the object function. More general, the high-
dimensional Newton’s iteration formula is dt = −Bt−1 gt , (50)
θt+1 = θt − ∇2 f (θt )−1 ∇f (θt ) , t ≥ 0, (44) where gt is the gradient of f , and the update of quasi-Newton
2
where ∇ f is a Hessian matrix of f . More precisely, if the is
learning rate (step size factor) is introduced, the iteration θt+1 = θt + ηt dt . (51)
formula is shown as
The step size ηt is chosen to satisfy the Wolfe conditions,
which is a set of inequalities for inexact line searches
dt = −∇2 f (θt )−1 ∇f (θt ), minηt f (θt + ηt dt ) [132]. Unlike Newton’s method, quasi-
θt+1 = θt + ηt dt , (45) Newton method uses Bt to approximate the true Hessian
matrix. In the following paragraphs, we will introduce some
where dt is the Newton’s direction, ηt is the step size. particular quasi-Newton methods, in which Ht is used to
This method can be called damping Newton’s method [130]. express the inverse of Bt , i.e., Ht = Bt−1 .
Geometrically speaking, Newton’s method is to fit the local DFP In the 1950s, a physical scientist, William C. Davidon
surface of the current position with a quadratic surface, while [133], proposed a new approach to solve nonlinear problems.
the gradient descent method is to fit the current local surface Then Fletcher and Powel [134] explained and improved this
with a plane [131]. method, which sparked a lot of research in the late 1960s and
Quasi-Newton Method Newton’s method is an iterative early 1970s [6]. DFP is the first quasi-Newton method named
algorithm that requires the computation of the inverse Hessian after the initials of their three names. The DFP correction
matrix of the objective function at each step, which makes formula is one of the most creative inventions in the field
the storage and computation very expensive. To overcome the of non-linear optimization, shown as below:
expensive storage and computation, an approximate algorithm
was considered which is called the quasi-Newton method. The (DF P ) u t s⊤
t st u ⊤ ut u⊤
Bt+1 = (I − ⊤
)Bt (I − ⊤ t ) + ⊤ t . (52)
essential idea of the quasi-Newton method is to use a positive u t st u t st u t st
definite matrix to approximate the inverse of the Hessian
The update formula of Ht+1 is
matrix, thus simplifying the complexity of the operation. The
quasi-Newton method is one of the most effective methods DF P Ht u t u ⊤
t Ht st s⊤
t
for solving non-linear optimization problems. Moreover, the Ht+1 = Ht − + . (53)
u⊤t Ht u t u⊤
t st
12

BFGS Broyden, Fletcher, Goldfarb and Shanno proposed Algorithm 3 Two-Loop Recursion for Ht gt [93]
the BFGS method [135], [136], [137], [3], in which Bt+1 is Input: ∇ft , ut , st
updated according to Output: Ht+1 gt+1
Bt s t s ⊤ ut u⊤ gt = ∇ft
(BF GS) t Bt t s⊤ u
Bt+1 = Bt − + . (54) Ht0 = kut t kt2 I
s⊤
t Bt s t u⊤
t st
for l = t − 1 to t − p do
The corresponding update of Ht+1 is ηl = ρl s⊤ l gl+1
(BF GS) st u ⊤ u t s⊤ u t s⊤ gl = gl+1 − ηl ul
t t t
Ht+1 = (I − )H t (I − ) + . (55) end for
s⊤
t ut s⊤
t ut s⊤
t ut
rt−p−1 = Ht0 gt−p
The quasi-Newton algorithm still cannot solve large-scale for l = t − p to t − 1 do
data optimization problem because the method generates a βl = ρl u ⊤ l ρl−1
sequence of matrices to approximate the Hessian matrix. ρl = ρl−1 + sl (ηl − βl )
Storing these matrices needs to consume computer resources, end for
especially for high-dimensional problems. It is also impossible Ht+1 gt+1 = ρ
to retain these matrices in the high-speed storage of computers,
restricting its use to even small and midsize problems [138].
Algorithm 4 Limited-BFGS [139]
L-BFGS Limited memory quasi-Newton methods, named
L-BFGS [138], [139] is an improvement based on the quasi- Input: θ0 ∈ R n , ǫ > 0
Newton method, which is feasible in dealing with the high- Output: the solution θ∗
dimensional situation. The method stores just a few n- t=0
dimensional vectors, instead of retaining and computing fully g0 = ∇f0
dense n × n approximations of the Hessian [140]. The u0 = 1
basic idea of L-BFGS is to store the vector sequence in s0 = 1
the calculation of approximation Ht+1 , instead of storing while k gt k< ǫ do
s⊤ u
complete matrix Ht . L-BFGS makes further consolidation for Choose Ht0 , for example Ht0 = kut t kt2 I
the update formula of Ht+1 , gt = ∇ft
dt = −Ht gt from Algorithm L-BFGS two-loop
st u ⊤ u t s⊤ st s⊤ recursion for Ht gt
t t t
Ht+1 = (I − )H t (I − ) + Search a step size ηt through Wolfe rule
u⊤t st u⊤t st u⊤
t st θt+1 = θt + ηt dt
= Vt⊤ Ht Vt + ρst s⊤ t , (56) if k > p then
where Discard the vector pair {st−p , yt−p } from storage
1 end if
Vt = I − ρut s⊤
t , ρt = . (57)
s⊤
t ut Compute and save
st = θt+1 − θt , ut = gt+1 − gt
The above equation means that the inverse Hessian ap-
t= t+1
proximation Ht+1 can be obtained using the sequence pair
end while
{sl , ul }tl=t−p+1 . Ht+1 can be computed if we know pairs
{sl , yl }tl=t−p+1 . In other words, instead of storing and
calculating the complete matrix Ht+1 , L-BFGS only computes
overlapping mini-batches for consecutive samples for quasi-
the latest p pairs of {sl , yl }. According to the equation, a
Newton update. It means that the calculation of ut becomes
recursive procedure can be reached. When the latest p steps
ut = ∇St+1 f (θt+1 )− ∇St f (θt ), where St is a small subset of
are retained, the calculation of Ht+1 can be expressed as [139]
samples, meanwhile St+1 and St are not independent, perhaps
Ht+1 = (Vt⊤ Vt−1⊤ ⊤
· · · Vt−p+1 )Ht0 (Vt−p+1 Vt−p+2 · · · Vt ) containing a relatively large overlap. Some numerical results in
+ ρt−p+1 (Vt⊤ Vt−1 ⊤
· · · Vt−p+2 )st−p+1 s⊤ [141] have shown that the modification in L-BFGS is effective
t−p+1 (Vt−p+2 · · · Vt )
in practice.
+ ρt−p+2 (Vt⊤ Vt−1 ⊤ ⊤
· Vt−p+3 )st−p+2 s⊤
t−p+2 (Vt−p+3 · · · Vt ) 3) Stochastic Quasi-Newton Method: In many large-scale
+ ··· machine learning models, it is necessary to use a stochastic
+ ρt s t s ⊤
t .
approximation algorithm with each step of update based on a
(58) relatively small training subset [125]. Stochastic algorithms
often obtain the best generalization performances in large-
The update direction dt = Ht gt can be calculated, where gt is scale learning systems [142]. The quasi-Newton method only
the gradient of the objective function f . The detailed algorithm uses the first-order gradient information to approximate the
is shown in Algorithms 3 and 4. Hessian matrix. It is a natural idea to combine the quasi-
For more information about BFGS and L-BFGS algorithms, Newton method with the stochastic method, so that it can
one can refer to [93], [138]. Recently, the batch L-BFGS perform on large-scale problems. Online-BFGS and online-
on machine learning was proposed [141], which uses the LBFGS are two variants of BFGS [124].
13

Consider the minimization of a convex stochastic function, Algorithm 5 SQN Framework [143]
Input: θ0 , V , m, ηt
minθ∈R F (θ) = E[f (θ, ξ)], (59)
Output: The solution θ∗
where ξ is a random seed. We assume that ξ represents a for t=1, 2, 3, 4,....., do
sample (or a set of samples) consisting of an input-output pair s′t = Ht gt using the two-loop recursion.
(x, y). In machine learning x typically represents an input and st = −ηt s′t
y is the target output. f usually has the following form: θt+1 = θt + s′t
if update pairs then
f (θ; ξ) = f (θ; xi , yi ) = l(h(w; xi ); yi ), (60) Compute st and ut
Add a new displacement pair {st , ut } to V
where h is a prediction model parameterized by θ, and l is
if |V | > m then
a loss function. We define fi (θ) = f (θ; xi , yi ), and use the
Remove the eldest pair from V
empirical loss to define the objective,
end if
N end if
1 X
F (θ) = fi (θ). (61) end for
N i=1

Typically, if a large amount of training data is used to train


the machine learning models, a better choice is using mini- In the above algorithm, V = {st , ut } is a collection
batch stochastic gradient, of m displacement pairs, and gt is the current stochastic
gradient ∇FSt (θt ). Meanwhile, the matrix-vector product
1X
∇FSt (θt ) = ∇fi (θt ), (62) Ht gt can be computed by a two-loop recursion as described
c in the previous section. Recently, more and more work
i∈St
has achieved very good results in stochastic quasi-Newton.
where subset St ⊂ {1, 2, 3 · · · N } is randomly selected. c is
Specifically, a regularized stochastic BFGS method was
the cardinality of St and c ≪ N . Let StH ⊂ {1, 2, 3, · · · , N }
proposed, which makes a corresponding analysis of the
be a randomly chosen subset of the training samples and the
convergence of this optimization method [144]. Further, an
stochastic Hessian estimate can be
online L-BFGS was presented in [145]. A linearly convergent
1 X 2 method was proposed [126], which combines the L-BFGS
∇2 FSt (θt ) = ∇ fi (θt ), (63)
ch H method in [125] with the variance reduction technique.
i∈St
Besides these, a variance reduced block L-BFGS method was
where ch is the cardinality of StH . With given stochastic proposed, which works by employing the actions of a sub-
gradient, a direct approach to develop stochastic quasi-Newton sampled Hessian on a set of random vectors [146].
method is to transform deterministic gradients into stochastic To sum up, we have discussed the techniques of using
gradients throughout the iterations, such as online-BFGS and stochastic methods in second-order optimization. The stochas-
online-LBFGS [124], which are two stochastic adaptations tic quasi-Newton method is a combination of the stochastic
of the BFGS algorithms. Specifically, following the BFGS method and the quasi-Newton method, which makes the quasi-
described in the previous section, st , ut are modified as Newton method extend to large datasets. We have introduced
the related work of the stochastic quasi-Newton method in
st := θt+1 − θt and ut := ∇FSt (θt+1 ) − ∇FSt (θt ). (64)
recent years, which reflects the potential of the stochastic
One disadvantage of this method is that each iteration quasi-Newton method in machine learning applications.
requires two gradient estimates. Besides this, a more worrying 4) Hessian-Free Optimization Method: The main idea of
fact is that updating the inverse Hessian approximations in Hessian-free (HF) method is similar to Newton’s method,
each step may not be reasonable [143]. Then the stochastic which employs second-order gradient information. The dif-
quasi-Newton (SQN) method was proposed, which is to use ference is that the HF method is not necessary to directly
sub-sampled Hessian-vector products to update Ht by the calculate the Hessian matrix H. It estimates the product Hv
LBFGS according to [125]. Meanwhile, the authors proposed by some techniques, and thus is called “Hessian free”.
an effective approach that decouples the stochastic gradient Consider a local quadratic approximation Qθ (dt ) of the
and curvature estimate calculations to obtain a stable Hessian object F around parameter θ,
approximation. In particular, since

1
∇F (θt+1 ) − ∇F (θt ) ≈ ∇2 F (θt )(θt+1 − θt ), (65) F (θt +dt ) ≈ Qθ (dt ) = F (θt )+∇F (θt )⊤ dt + d⊤ Bt dt , (67)
2 t
ut can be rewritten as where dt is the search direction. The HF method applies the
2
ut := ∇ FStH (θt )st . (66) conjugate gradient method to compute an approximate solution
dt of the linear system,
Based on these techniques, an SQN Framework was proposed,
and the detailed procedure is shown in Algorithm 5. Bt dt = −∇F (θt ), (68)
14

where Bt = H(θt ) is the Hessian matrix, but in practice Bt As described above, we can obtain the approximate solution of
is often defined as Bt = H(θt ) + λI, λ ≥ 0 [7]. The new direction dt by employing the CG method to solve the linear
update is then given by system,
∇2 FStH (θt )dt = −∇FSt (θt ), (74)
θt+1 = θt + ηt dt , (69)
in which the stochastic gradient and stochastic Hessian matrix
where ηt is the step size that ensures sufficient decrease in are used. The basic framework of sub-sampled HF algorithm
the objective function, usually obtained by a linear search. is given in [147].
According to [7], the basic framework of HF optimization is A natural question is how to determine the size of StH . On
shown in Algorithm 6. one hand, StH can be chosen small enough so that the total cost
of CG iteration is not much greater than a gradient evaluation.
Algorithm 6 Hessian-Free Optimization Method [7] On the other hand, StH should be large enough to get useful
curvature information from Hessian-vector product. How to
Input: θ0 , ∇f (θ0 ), λ
balance the size of StH is a challenge being studied [147].
Output: The solution θ∗
5) Natural Gradient: The natural gradient method can be
t=0
repeat potentially applied to any objective function which measures
the performance of some statistical models [148]. It enjoys
gt = ∇f (θt )
richer theoretical properties when applied to objective func-
Compute λ by some methods
Bt (v) ≡ H(θt )v + λv tions based on the KL divergence between the model’s dis-
tribution and the target distribution, or certain approximation
Compute the step size ηt
dt = CG(Bt , −gt ) surrogates of these [149].
θt+1 = θt + ηt dt The traditional gradient descent algorithm is based on the
Euclidean space. However, in many cases, the parameter
t=t+1
until satisfy convergence condition space is not Euclidean, and it may have a Riemannian metric
structure. In this case, the steepest direction of the objective
function cannot be given by the ordinary gradient and should
The advantage of using the conjugate gradient method be given by the natural gradient [148].
is that it can calculate the Hessian-vector product without We consider such a model distribution p(y|x, θ), and π(x, y)
directly calculating the Hessian matrix. Because in the CG- is an empirical distribution. We need to fit the parameters
algorithm, the Hessian matrix is paired with a vector, then θ ∈ RN . Assume that x is an observation vector, and y is
we can compute the Hessian-vector product to avoid the its associated label. It has the objective function,
calculation of the Hessian inverse matrix. There are many
ways to calculate Hessian-vector products, one of which is F (θ) = E(x,y)∼π [− log p(y|x, θ)], (75)
calculated by a finite difference as [7] and we need to solve the optimization problem,
∇f (θ + εv) − ∇f (θ) θ∗ = argminθ F (θ). (76)
Hv = lim . (70)
ε→+0 ε
According to [148], the natural gradient can be transformed
Sub-sampled Hessian-Free Method HF is a well-known from a traditional gradient multiplied by a Fisher information
method, and has been studied for decades in the optimization matrix, i.e.,
literatures, but has shortcomings when applied to deep neural ∇N F = G−1 ∇F, (77)
networks with large-scale data [7]. Therefore, a sub-sampled
where F is the object function, ▽F is the traditional gradient,
technique is employed in HF, resulting in an efficient HF
▽N F is the natural gradient, and G is the Fisher information
method [7], [147]. The cost in each iteration can be reduced by
matrix, with the following form:
using only a small sample set S to calculate Hv. The objective   
function has the following form: ∂p(y|x; θ) ∂p(y|x; θ) ⊤
G = Ex∼π Ey∼p(y|x,θ) ( )( ) .
∂θ ∂θ
N
1 X (78)
min F (θ) = fi (θ). (71) The update formula with the natural gradient is
N i=1
θt = θt − ηt ∇N F. (79)
In the tth iteration, the stochastic gradient estimation can be
written as We cannot ignore that the application of the natural gradient is
1 X very limited because of too much computation. It is expensive
∇FSt (θt ) = ∇fi (θt ), (72)
|St | to estimate the Fisher information matrix and calculate its
i∈St
inverse matrix. To overcome this limitation, the truncated
and the stochastic Hessian estimate is expressed as Newton’s method was developed [7], in which the inverse
is calculated by an iterative procedure, thus avoiding the
1 X 2
∇2 FStH (θt ) = ∇ fi (θt ). (73) direct calculation of the inverse of the Fisher information
|StH | H matrix. In addition, the factorized natural gradient (FNG) [150]
i∈St
15

and Kronecker-factored approximate curvature (K-FAC) [151] discipline of mathematical optimization [155], [156], [157]. It
methods were proposed to use the derivatives of probabilistic can find the optimal solution without the gradient information.
models to calculate the approximate natural gradient update. There are mainly two types of ideas for derivative-
6) Trust Region Method: The update process of most free optimization. One is to use heuristic algorithms. It
methods introduced above can be described as θt + ηt dt . The is characterized by empirical rules and chooses methods
displacement of the point in the direction of dt can be written that have already worked well, rather than derives solutions
as st . The typical trust region method (TRM) can be used for systematically. There are many types of heuristic optimization
unconstrained nonlinear optimization problems [140], [152], methods, including classical simulated annealing arithmetic,
[153], in which the displacement st is directly determined genetic algorithms, ant colony algorithms, and particle swarm
without the search direction dt . optimization [158], [159], [160]. These heuristic methods usu-
For the problem min fθ (x), the TRM [140] uses the second- ally yield approximate global optimal values, and theoretical
order Taylor expansion to approximate the objective function support is weak. We do not focus on such techniques in this
fθ (x), denoted as qt (s). Each search is done within the range section. The other is to fit an appropriate function according
of trust region with radius △t . This problem can be described to the samples of the objective function. This type of method
as usually attaches some constraints to the search space to derive
 the samples. Coordinate descent method is a typical derivative-
 min q (s) = f (x ) + g ⊤ s + 1 s⊤ B s, free algorithm [161], and it can be extended and applied to
t θ t t t
2 (80) optimization algorithms for machine learning problems easily.
 s.t. ||s || ≤ △ ,
t t In this section, we mainly introduce the coordinate descent
where gt is the approximate gradient of the objective function method.
f (x) at the current iteration point xt , gt ≈ ∇f (xt ), Bt is The coordinate descent method is a derivative-free opti-
a symmetric matrix, which is the approximation of Hessian mization algorithm for multi-variable functions. Its idea is
matrix ∇2 fθ (xt ), and △t > 0 is the radius of the trust region. that a one-dimensional search can be performed sequentially
If the L2 norm is used in the constraint function, it becomes along each axis direction to obtain updated values for each
the Levenberg-Marquardt algorithm [154]. dimension. This method is suitable for some problems in
If st is the solution of the trust region subproblem (80), the which the loss function is non-differentiable.
displacement st of each update is limited by the trust region The vanilla approach is to select a set of bases e1 , e2 , ..., eD
radius △t . The core part of the TRM is the update of △t . in the linear space as the search directions and minimizes the
In each update process, the similarity of the quadratic model value of the objective function in each direction. For the target
q(st ) and the objective function fθ (x) is measured, and △t is function L(Θ), when Θt is already obtained, the jth dimension
updated dynamically. The actual amount of descent in the tth of Θt+1 is solved by [155]
iteration is [140] θjt+1 = arg minθj ∈R L(θ1t+1 , ..., θj−1
t+1 t
, θj , θj+1 t
, ..., θD ). (84)
△ft = ft − f (xt + st ). (81) Thus, L(Θt+1 ) ≤ L(Θt ) ≤ ... ≤ L(Θ0 ) is guaranteed. The
The predicted drop in the tth iteration is convergence of this method is similar to the gradient descent
method. The order of update can be an arbitrary arrangement
△qt = ft − q(st ). (82) from e1 to eD in each iteration. The descent direction can be
The ratio rt is defined to measure the approximate degree of generalized from the coordinate axis to the coordinate block
both, [162].
△ft The main difference between the coordinate descent and
rt = . (83) the gradient descent is that each update direction in the
△qt
gradient descent method is determined by the gradient of the
It indicates that the model is more realistic than expected when
current position, which may not be parallel to any coordinate
rt is close to 1, and then we should consider expanding △t .
axis. In the coordinate descent method, the optimization
At the same time, it indicates that the model predicts a large
direction is fixed from beginning to end. It does not need
drop and the actual drop is small when rt is close to 0, and
to calculate the gradient of the objective function. In each
then we should reduce △t . Moreover, if rt is between 0 and
iteration, the update is only executed along the direction of
1, we can leave △t unchanged. The thresholds 0 and 1 are
one axis, and thus the calculation of the coordinate descent
generally set as the left and right boundaries of rt [140].
method is simple even for some complicated problems. For
7) Summary: We summarize the mentioned high-order
indivisible functions, the algorithm may not be able to find
optimization methods in terms of properties, advantages and
the optimal solution in a small number of iteration steps. An
disadvantages in Table II.
appropriate coordinate system can be used to accelerate the
convergence. For example, the adaptive coordinate descent
C. Derivative-Free Optimization method takes principal component analysis to obtain a new
For some optimization problems in practical applications, coordinate system with as little correlation as possible between
the derivative of the objective function may not exist or is not the coordinates [163]. The coordinate descent method still
easy to calculate. The solution of finding the optimal point, has limitations when performing on the non-smooth objective
in this case, is called derivative-free optimization, which is a function, which may fall into a non-stationary point.
16

TABLE II: Summary of High-Order Optimization Methods

Method Properties Advantages Disadvantages

Conjugate Gradi- It is an optimization method between CG method only calculates the first or- Compared with the first-order gradient
ent [127] the first-order and second-order gra- der gradient but has faster convergence method, the calculation of the conjugate
dient methods. It constructs a set of than the steepest descent method. gradient is more complex.
conjugated directions using the gradient
of known points, and searches along the
conjugated direction to find the mini-
mum points of the objective function.

Newton’s Newton’s method calculates the inverse Newton’s method uses second-order It needs long computing time and large
Method [129] matrix of the Hessian matrix to obtain gradient information which has faster storage space to calculate and store the
faster convergence than the first-order convergence than the first-order gra- inverse matrix of the Hessian matrix at
gradient descent method. dient method. Newton’s method has each iteration.
quadratic convergence under certain
conditions.

Quasi-Newton Quasi-Newton method uses an approx- Quasi-Newton method does not need Quasi-Newton method needs a large
Method [93] imate matrix to approximate the the to calculate the inverse matrix of the storage space, which is not suitable for
Hessian matrix or its inverse matrix. Hessian matrix, which reduces the com- handling the optimization of large-scale
Popular quasi-Newton methods include puting time. In general cases, quasi- problems.
DFP, BFGS and LBFGS. Newton method can achieve superlinear
convergence.

Sochastic Quasi- Stochastic quasi-Newton method em- Stochastic quasi-Newton method can Compared with the stochastic gradient
Newton Method ploys techniques of stochastic opti- deal with large-scale machine learning method, the calculation of stochastic
[143]. mization. Representative methods are problems. quasi-Newton method is more complex.
online-LBFGS [124] and SQN [125].

Hessian Free HF method performs a sub- HF method can employ the second- The cost of computation for the matrix-
Method [7] optimization using the conjugate order gradient information but does vector product in HF method increases
gradient, which avoids the expensive not need to directly calculate Hessian linearly with the increase of training
computation of inverse Hessian matrix. matrices. Thus, it is suitable for high data. It does not work well for large-
dimensional optimization. scale problems.

Sub-sampled Sup-sampled Hessian free method uses The sub-sampled HF method can deal Compared with the stochastic gradient
Hessian Free stochastic gradient and sub-sampled with large-scale machine learning opti- method, the calculation is more com-
Method [147] Hessian-vector during the process of mization problems. plex and needs more computing time
updating. in each iteration.

Natural Gradient The basic idea of the natural gradient The natural gradient uses the Riemann In the natural gradient method, the
[148] is to construct the gradient descent structure of the parametric space to calculation of the Fisher information
algorithm in the predictive function adjust the update direction, which is matrix is complex.
space rather than the parametric space. more suitable for finding the extremum
of the objective function.

D. Preconditioning in Optimization using preconditioner is obviously structured, or sparse, it will


Preconditioning is a very important technique in opti- be beneficial to the calculation [165].
mization methods. Reasonable preconditioning can reduce The conjugate gradient algorithm mentioned previously is
the iteration number of optimization algorithms. For many the most commonly used optimization method with precon-
important iterative methods, the convergence depends largely ditioning technology, which speeds up the convergence. The
on the spectral properties of the coefficient matrix [164]. It algorithm is shown in Algorithm 7.
can be simply considered that the pretreatment is to transform
a difficult linear system Aθ = b into an equivalent system E. Public Toolkits for Optimization
with the same solution but better spectral characteristics.
Fundamental optimization methods are applied in machine
For example, if M is a nonsingular approximation of the
learning problems extensively. There are many integrated
coefficient matrix A, the transformed system,
powerful toolkits. We summarize the existing common op-
M −1 Aθ = M −1 b, (85) timization toolkits and present them in Table III.
will have the same solution as the system Aθ = b. But (85)
may be easier to solve and the spectral properties of the IV. D EVELOPMENTS AND A PPLICATIONS FOR S ELECTED
coefficient matrix M −1 A may be more favorable. M ACHINE L EARNING F IELDS
In most linear systems, e.g., Aθ = b, the matrix A Optimization is one of the cores of machine learning. Many
is often complex and makes it hard to solve the system. optimization methods are further developed in the face of
Therefore, some transformation is needed to simplify this different machine learning problems and specific application
system. M is called the preconditioner. If the matrix after environments. The machine learning fields selected in this
17

TABLE III: Available Toolkits for Optimization

Toolkit Language Description

CVX [166] Matlab CVX is a matlab-based modeling system for convex


optimization but cannot handle large-scale problems.
http://cvxr.com/cvx/download/

CVXPY [167] Python CVXPY is a python package developed by Stanford


University Convex Optimization Group for solving convex
optimization problems.
http://www.cvxpy.org/

CVXOPT [168] Python CVXOPT can be used for handling convex optimization. It
is developed by Martin Andersen, Joachim Dahl, and Lieven
Vandenberghe.
http://cvxopt.org/

APM [169] Python APM python is suitable for large-scale optimization and
can solve the problems of linear programming, quadratic
programming, integer programming, nonlinear optimization
and so on.
http://apmonitor.com/wiki/index.php/Main/PythonApp

SPAMS [123] C++ SPAMS is an optimization toolbox for solving various sparse
estimation problems, which is developed and maintained by
Julien Mairal. Available interfaces include matlab, R, python
and C++.
http://spams-devel.gforge.inria.fr/

minConf Matlab minConf can be used for optimizing differentiable multi-


variate functions which subject to simple constraints on
parameters. It is a set of matlab functions, in which there
are many methods to choose from.
https://www.cs.ubc.ca/∼ schmidtm/Software/minConf.html

tf.train.optimizer [170] Python; C++; CUDA The basic optimization class, which is usually not called
directly and its subclasses are often used. It includes
classic optimization algorithms such as gradient descent and
AdaGrad.
https://www.tensorflow.org/api guides/python/train

section mainly include deep neural networks, reinforcement


learning, variational inference and Markov chain Monte Carlo.
Algorithm 7 Preconditioned Conjugate Gradient Method [93]
Input: A, θ0 , M , b A. Optimization in Deep Neural Networks
Output: The solution θ∗ The deep neural network (DNN) is a hot topic in the
f0 = f (θ0 ) machine learning community in recent years. There are many
g0 = ∇f (θ0 ) = Aθ0 − b optimization methods for DNNs. In this section, we introduce
y0 is the solution of M y = g0 them from two aspects, i.e., first-order optimization methods
d0 = −g0 and high-order optimization methods.
t=t 1) First-Order Gradient Method in Deep Neural Networks:
while gt 6= 0 do The stochastic gradient optimization method and its adaptive
g⊤ y
ηt = d⊤t Adtt variants have been widely used in DNNs and have achieved
t
θt+1 = θt + ηt dt good performance. SGD introduces the learning rate decay
gt+1 = gt + ηt Adt factor and AdaGrad accumulates all previous gradients so
yt+1 =solution of M y = gt that their learning rates are continuously decreased and
g⊤ yt+1
βt+1 = t+1 converge to zero. However, the learning rates of these
gt⊤ dt
dt+1 = −yt+1 + βt+1 dt two methods make the update slow in the later stage of
t=t+1 optimization. AdaDelta, RMSProp, Adam and other methods
end while use the exponential averaging to provide effective updates
and simplify the calculation. These methods use exponential
moving average to alleviate the problems caused by the
rapid decay of the learning rate but limit the current
18

learning rate to only relying on a few gradients [34]. The movement of SGD can be decomposed into the learning
Reddi et al. used a simple convex optimization example to rates along Adam’s direction and its orthogonal direction.
demonstrate that the RMSProp and Adam algorithms could If SGD is going to finish the trajectory but Adam has not
not converge [34]. Almost all the algorithms that rely on finished due to the momentum after selecting the optimization
a fixed-size window of the past gradients will suffer from direction, walking along Adam’s direction is a good choice for
this problem, including AdaDelta and Nesterov-accelerated SWATS. At the same time, SWATS also adjusts its optimized
adaptive moment estimation (Nadam) [171]. trajectory by moving in the orthogonal direction. Let
It is better to rely on the long-term memory of past
P rojAdam dSGD
t = dAdam
t , (89)
gradients rather than the exponential moving average of
gradients to ensure convergence. A new version of Adam [34], and derive solution
called AmsGrad, uses a simple correction method to ensure (dAdam )T dAdam
t t
the convergence of the model while preserving the original ηtSGD = Adam
, (90)
(dt )T gt
computational performance and advantages. Compared with
the Adam method, the AmsGrad makes the following changes where P rojAdam means the projection in the direction of
to the first-order moment estimation and the second-order Adam. To reduce noise, a moving average can be used to
moment estimation: correct the estimate of the learning rate,


 mt = β1t mt−1 + (1 − β1t )gt , λSGD
t = β2 λSGD SGD
t−1 + (1 − β2 )ηt , (91)
 q
Vt = β2 Vt−1 + (1 − β2 )gt2 , (86)

 SGD λSGD
 λ˜t = t , (92)
V̂t = max(V̂t−1 , Vt ), 1 − β2
where β1t is a non-constant which decreases with time, and where λSGD is the first moment of learning rate η SGD , and
t
β2 is a constant learning rate. The correction is operated in ˜ SGD
λt is the learning rate of SGD after converting [38]. For
the second-order moment Vt , making V̂t monotonous. V̂t is SGD
substantially used in the iteration of the target function. The switch point, a simple guideline |λ˜t − λSGD
t | < ǫ is often
AmsGrad method takes the long-term memory of past gradi- used [38]. Although there is no rigorous mathematical proof
ents based on the Adam method, guarantees the convergence for selecting this conversion criterion, it performs well across
in the later stage, and works well in applications. a variety of applications. For the mathematical proof of switch
Further, adjusting parameters β1 , β2 at the same time helps point, further research can be conducted. Although the SWATS
to converge to a certain extent. For example, β1 can decay is based on Adam, this switching method is also applicable to
modestly as β1t = βt1 , β1t ≤ β1 , for all t ∈ [T ]. β2 can be other adaptive methods, such as AdaGrad and RMSProp. The
set as β2t = 1 − 1t , for all t ∈ [T ], as in AdamNC algorithm procedure is insensitive to hyper-parameters and can obtain an
[34]. optimal solution comparable to SGD, but with a faster training
Another idea of combining SGD and Adam was proposed speed in the case of deep networks.
for solving the non-convergence problem of adaptive gradient Recently some researchers are trying to explain and improve
algorithm [38]. Adaptive algorithms, such as Adam, converge the adaptive methods [172], [173]. Their strategies can also be
fast and are suitable for processing sparse data. SGD with combined with the above switching techniques to enhance the
momentum can converge to more accurate results. The performance of the algorithm.
combination of SGD and Adam develops the advantages of General fully connected neural networks cannot process
both methods. Specifically, it first trains with Adam to quickly sequential data such as text and audio. Recurrent neural
drop and then switches to SGD for precise optimization based network (RNN) is a kind of neural networks that is more
on the previous parameters at an appropriate switch point. The suitable for processing sequential data. It was generally
strategy is named as switching from Adam to SGD (SWATS) considered that the use of first-order methods to optimize RNN
[38]. There are two core problems in SWATS. One is when was not effective, because the SGD and its variant methods
to switch from Adam to SGD, the other is how to adjust the were difficult to learn long-term dependencies in sequence
learning rate after switching the optimization algorithm. The problems [99], [104], [174].
SWATS approach is described in detail below. In recent years, a well-designed method of random param-
eter initialization scheme using only SGD with momentum
The movement dAdam of the parameter at iteration t of the
without curvature information has achieved good results in
Adam is
η Adam training RNNs [99]. In [104], [175], some techniques for
dAdam
t = mt , (87) improving the optimization in training RNNs are summarized
Vt
such as the momentum methods and NAG. The first-order
where η Adam is the learning rate of Adam [38]. The movement optimization methods have got development for training
dSGD of the parameter at iteration t of the SGD is RNNs, but they still face the problem of slow convergence in
dSGD = η SGD gt , (88) deep RNNs. The high-order optimization methods employing
t
curvature information can accelerate the convergence near
where η SGD is the learning rate of SGD and gt is the gradient the optimal value and is considered to be more effective in
of the current position [38]. optimizing DNNs.
19

2) High-Order Gradient Method in Deep Neural Networks: some layers of hidden units in neural networks (like RNNs).
We have described the first-order optimization method applied A structural damping can be defined as
in DNNs. As most DNNs use large-scale data, different 1 X
versions of stochastic gradient methods were developed and R(θ) = D(e(x, θ), e(x, θt )), (95)
|S|
have got excellent performance and properties. For making (x,y)∈S
full use of gradient information, the second-order method where D is a distance function or a loss function. It can prevent
is gradually applied to DNNs. In this section, we mainly a large change in e(x, θ) by penalizing the distance between
introduce the Hessian-free method in DNN. e(x, θ) and e(x, θt ). Then, the damped local objective can be
Hessian-free (HF) method has been studied for a long time written as
in the field of optimization, but it is not directly suitable for λ
dealing with neural networks [7]. As the objective function Qθ (d)′ = Qθ (d) + µR(d + θt ) + d⊤ d, (96)
2
in DNN is not convex, the exact Hessian matrix may not be
where µ and λ are two parameters to be dynamically adjusted.
positive definite. Therefore, some modifications need to be
d is the direction at the tth iteration. More details of the
made so that the HF method can be applied to neural networks
structural damping can refer to [176].
[176].
Besides, there are many second-order optimization methods
The Generalized Gauss-Newton Matrix One solution is to employed in RNNs. For example, quasi-Newton based opti-
use the generalized Gauss-Newton (GGN) matrix, which can mization and L-BFGS were proposed to train RNNs [179],
be seen as an approximation of a Hessian matrix [177]. The [180].
GGN matrix is a provably positive semidefinite matrix, which In order to make the damping method based on punishment
avoids the trouble of negative curvature. There are at least two work better, the damping parameters can be adjusted continu-
ways to derive the GGN matrix [176]. Both of them require ously. A Levenberg-Marquardt style heuristic method was used
that f (θ) can be expressed as a composition of two functions to adjust λ directly [7]. The Levenberg-Marquardt heuristic is
written as f (θ) = Q(F (θ)) where f (θ) is the object function described as follows:
and Q is convex. The GGN matrix G takes the following form,
1) If γ < 41 λ then λ ← 32 λ,
2) If γ > 43 λ then λ ← 23 λ,
G = J ⊤ Q′′ J, (93)
where γ is a “reduction rate” with the following form,

where J is the Jacobian of F . f (θt−1 + dt ) − f (θt−1 )


γ= . (97)
Damping Methods Another modification to the HF method Mt−1 (dt )
is to use different damping methods. For example, Tikhonov Sub-sampling As sub-sampling Hessian can be used to
damping, one of the most famous damping methods, is handle large-scale data, several variations of the sub-sampling
implemented by introducing a quadratic penalty term into the methods were proposed [8], [9], [10], which used either
quadratic model. A quadratic penalty term λ2 d⊤ d is added to stochastic gradients or exact gradients. These approaches use
the quadratic model, Bt = ∇2St f (θt ) as a Hessian approximation, where St is a
subset of samples. We need to compute the Hessian-vector
product in some optimization methods. If we adopt the sub-
λ 1 sampling method, it also means that we can save a lot of
Q(θ) := Q(θ) + d⊤ d = f (θt ) + ∇f (θt )⊤ d + d⊤ Bd, (94)
2 2 computation in each iteration, such as the method proposed in
[7].
where B = H + λI, and λ > 0 determines the “strength” Preconditioning Preconditioning can be used to simplify
of the damping which is a scalar parameter. Thus, Bv is the optimization problems. For example, preconditioning can
formulated as Bv = (H + λI)v = Hv + λv. However, accelerate the CG method. It is found that diagonal matrices
the basic Tikhonov damping method is not good in training are particularly effective and one can use the following
RNNs [178]. Due to the complex structure of RNNs, the local preconditioner [7]:
quadratic approximation in certain directions in the parameter N
 X α
space, even at very small distances, maybe highly imprecise. M = diag( ∇fi (θ) ⊙ ∇fi (θ)) + λI , (98)
The Tikhonov damping method can only compensate for i=1
this by increasing punishment in all directions because
where ⊙ denotes the element-wise product and the exponent
the method lacks a selective mechanism [176]. Therefore,
α is chosen to be less than 1.
the structural damping was proposed, which makes the
performance substantially better and more robust.
The HF method with structural damping can effectively train B. Optimization in Reinforcement Learning
RNNs [176]. Now we briefly introduce the HF method with Reinforcement learning (RL) is an important research
structural damping. Let e(x, θ) mean the vector-value function field of machine learning and is also one of the most
of θ which can be interpreted as intermediate quantities during popular topics. Agents using deep reinforcement learning have
the calculation of f (x, θ), where f (x, θ) is the object function. achieved great success in learning complex behavior skills and
For instance, e(x, θ) might contain the activation function of solved challenging control tasks in high-dimensional primitive
20

perceptual state space [181], [182], [183]. It interacts with the The value function of the current state s can be calculated by
environment through the trial-and-error mechanism and learns the value function of the next state s′ . The Bellman equations
optimal strategies by maximizing cumulative rewards [39]. of Vπ (s) and Qπ (s, a) describe the relation by
We describe several concepts of reinforcement learning as X X
follows: Vπ (s) = π(a|s) p(s′ , r|s, a)[r(s, a, s′ )
a s′ ,r
1) Agent: making different actions according to the state + γVπ (s )], ′
(102)
of the external environment, and adjusting the strategy X
according to the reward of the external environment. Qπ (s, a) = p(s′ , r|s, a)[r(s, a, s′ )
2) Environment: all things outside the agent that will be s′ ,r
X
affected by the action of the agent. It can change the +γ π(a′ |s′ )Qπ (s′ , a′ )]. (103)
state and provide the reward to the agent. a′
3) State s: a description of the environment.
There are many reinforcement learning methods based
4) Action a: a description of the behavior of the agent.
on value function. They are called value-based methods,
5) Reward rt (st−1 , at−1 , st ): the timely return value at
which play a significant role in RL. For example, Q-learning
time t.
[184] and SARSA [185] are two popular methods which use
6) Policy π(a|s): a function that the agent decides the
temporal difference algorithms. The policy-based approach
action a according to the current state s.
is to optimize the policy πθ (a|s) directly and update the
7) State transition probability p(s′ |s, a): the probability
parameters θ by gradient descent [186].
distribution that the environment will transfer to state s′
The actor-critic algorithm is a reinforcement learning
at the next moment, after the agent selecting an action
method combining policy gradient and temporal differential
a based on the current state s.
learning, which learns both a policy and a state value function.
8) p(s′ , r|s, a): the probability that the agent transforms to
It estimates the parameters of two structures simultaneously.
state s′ and obtains the reward r, where the agent is in
state s and selecting the action a. 1) The actor is a policy function, which is to learn a policy
πθ (a|s) to obtain the highest possible return.
Many reinforcement learning problems can be described 2) The critic refers to the learned value function Vφ (s),
by Markov decision process (MDP) < S, A, P, γ, r > [39], which estimates the value function of the current policy,
in which S is state space, A is action space, P is state that is to evaluate the quality of the actor.
transition probability function, r is reward function and γ
In the actor-critic method, the critic solves a problem of
is the discount factor 0 < γ < 1. At each time, the agent
prediction, while the actor pays attention to the control [187].
accepts a state and selects the action from an action set
There is more information of actor-critic method in [88], [187]
according to the policy. The agent receives feedback from the
The summary of the value-based method, the policy-based
environment and then moves to the next state. The goal of
method, and the actor-critic method are as follows:
reinforcement learning is to find a strategy that allows us to get
the maximum γ-discounted cumulative reward. The discounted 1) The value-based method: It needs to calculate value
return is calculated by function, and usually gets a definite policy.
2) The policy-based method: It optimizes the policy π

X without selecting an action according to value function.
Gt = γ k rt+k . (99) 3) The actor-critic method: It combines the above two
k=0 methods, and learns both the policy π and the state value
function.
People do not necessarily know the MDP behind the prob- Deep reinforcement learning (DRL) combines reinforce-
lem. From this point, reinforcement learning is divided into ment learning and deep learning, which defines problems and
two categories. One is model-based reinforcement learning optimizes goals in the framework of RL, and solves problems
which knows the MDP of the whole model (including the such as state representation and strategy representation using
transition probability P and reward function r), and the other deep learning techniques.
is the model-free method in which the MDP is unknown. DRL has achieved great success in many challenging control
Systematic exploration is required in the latter methods. tasks and uses DNNs to represent the control policy. For
The most commonly used value function is the state value neural network training, a simple stochastic gradient algorithm
function, or other first-order algorithms are usually chosen, but these
Vπ (s) = Eπ [Gt |St = s], (100) algorithms are not efficient in exploring the weight space,
which makes DRL methods often take several days to train
[60]. So, a distributed method was proposed to solve this
which is the expected return of executing policy π from state
problem, in which parallel actor-learners have a stabilizing
s. The state-action value function is also essential which is the
effect during training [182]. It executes multiple agents to
expected return for selecting action a under state s and policy
interact with the environment simultaneously, which reduces
π,
the training time. But this method ignores the sampling
Qπ (s, a) = Eπ [Gt |St = s, At = a]. (101) efficiency. A scalable and sample-efficient natural gradient
21

algorithm was proposed, which uses a Kronecker-factored where θt is the model parameter at the iteration t, and N is
approximation method to compute the natural policy gradient the meta-optimizer with parameter φ that learns how to predict
update, and employ the update to the actor and the critic the gradient. After training, the meta-optimizer N and its
(ACKTR) [60]. parameter φ are updated according to the loss value in the test
samples. The experiments have confirmed that learning neural
optimizers is advantageous compared to the most advanced
C. Optimization in Meta Learning
adaptive stochastic gradient optimization methods used in deep
Meta learning [45], [46] is a popular research direction learning [55]. Due to the similarity between the gradient
in the field of machine learning. It solves the problem of update in backpropagation and the cell state update in the
learning to learn. In the past cognition, the research of machine long short-term memory (LSTM), LSTM is often used as the
learning is to obtain a large amount of data in a specific task meta-optimizer [55], [56].
firstly and then use the data to train the model. In machine A model-agnostic meta learning algorithm (MAML) is
learning, adequate training data is the guarantee of achieving another method for meta learning which was proposed to learn
good performance. However, human beings can well process the parameters of any model subjected to gradient descent
new tasks with only a few training samples, which are much methods. It is applicable to different learning problems,
more efficient than traditional machine learning methods. The including classification, regression and reinforcement learning
key point could be that the human brain has learned “how to [47]. The basic idea of the model-agnostic algorithm is to
learn” and can make full use of past knowledge and experience begin multiple tasks at the same time, and then get the
to guide the learning of new tasks. Therefore, how to make synthetic gradient direction of different tasks, so as to learn a
machines have the ability to learn efficiently like human beings common base model. The main process can be described as
has become a frontier issue in machine learning. follows: in the meta-train step, multiple tasks batch τi , which
The goal of meta learning is to design a model that can contains (Ditrain , Ditest ), are extracted from the total task set

training well in the new tasks using as few samples as possible T . For all τi , train and update the parameter θi with the train
without overfitting. The process of adapting to the new tasks samples Di train
:
is essentially a learning process in the meta-testing, but only
with limited samples from new tasks. The application of meta ′ ∂Jτi (θ)
θi = θ − α , (105)
learning methods in supervised learning can solve the few-shot ∂(θ)
learning problems [47]. where α is the learning rate of training process and Jτi (θ) is
As few-shot learning problems receive more and more the loss function in task i with training samples Ditrain . After
attention, meta learning is also developing rapidly. In general, the training step, use the synthetic gradient direction of these
meta learning methods can be summarized into the following ′
parameters θi on the test samples Ditest of the respective task
three types [48]: metric-based methods [49], [50], [51], to update parameter θ:
[52], model-based methods [53], [54] and optimization-based P ′
methods [55], [56], [47]. In this subsection, we focus on the ∂ τi ∼p(T ) Jτi (θi )
θ =θ−β , (106)
optimization-based meta learning methods. In meta learning, ∂(θ)
there are usually some tasks with sufficient training samples
where β is the meta learning rate of the test process and Jτi (θ)
and a new task with only a few training samples. The main
is the loss function in task i with test samples Ditest . The meta-
idea can be described as follows: in the meta-train step,
train step is repeated multiple times to optimize a good initial
sample a task τ from the total task set T , which contains
parameter θ. In the meta-test step, the trained parameter θ is
(Dτtrain , Dτtest ). For task τ , train and update the optimizer
used as the initial parameter such that the model has a maximal
parameter θ with the training samples Dτtrain , update the
performance on the new task. MAML does not introduce
meta-optimizer parameter φ with the test samples Dτtest .
additional parameters for meta learning, nor does it require a
The process of sampling tasks and updating parameters are
specific learner architecture. The development of the method is
repeated multiple times. In the meta-test step, the trained meta-
of great significance to the optimization-based meta learning
optimizer is used for learning a new task.
methods. Recently, an expanded task-agnostic meta learning
Since the purpose of meta learning is to achieve fast
algorithm is proposed to enhance the generalization of meta-
learning, a key point is to make the gradient descent
learner towards a variety of tasks, which achieves outstanding
more accurately in the optimization. In some meta learning
performance on few-shot classification and reinforcement
methods, the optimization process itself can be regarded as a
learning tasks [189].
learning problem to learn the prediction gradient rather than a
determined gradient descent algorithm [188]. Neural networks
with original gradient as input and prediction gradient as D. Optimization in Variational Inference
output is often used as a meta-optimizer [55]. The neural work In the machine learning community, there are many at-
is trained using the training and test samples from other tasks tractive probabilistic models but with complex structures and
and used in the new task. The parameter update in the process intractable posteriors, and thus some approximate methods
of training is as follows: are used, such as variational inference and Markov chain
Monte Carlo (MCMC) sampling. Variational inference, a
θt+1 = θt + N (g(θt ), φ), (104) common technique in machine learning, is widely used to
22

approximate the posterior density of the Bayesian model, Then the CAVI algorithm can be given below in Algorithm 8.
which transforms intricate inference problems into high-
dimensional optimization problems [190], [191]. Compared
with MCMC, the variational inference is faster and more Algorithm 8 Coordinate Ascent Variational Inference [192]
suitable for dealing with large-scale data. Variational inference Input: p(X, Z), XQ
has been applied to large-scale machine learning tasks, Output: q(Z) = M i=1 qi (zi )
such as large-scale document analysis, computer vision and Initialize Variational factors qi (zi )
computational neuroscience [192]. repeat
Variational inference often defines a flexible family of for i=1,2,3....,M do
distributions indexed by free parameters on latent variables qi∗ ∝ exp{E−i [log p(zi , Z−i , X)]}
[190], and then finds the variational parameters by solving an end for
optimization problem. Compute ELBO(q):
Now let us review the principle of variational inference
[58]. Variational inference approximates the true posterior by ELBO(q) = E[log p(Z, X)] − E log q(Z)
attempting to minimize the Kullback-Leibler (KL) divergence until ELBO converges
between a potential factorized distribution and the true
posterior.
In traditional coordinate ascension algorithms, the efficiency
Let Z = {zi } represent the set of all latent variables and
of processing large data is very low, because each iteration
parameters in the model and X = {xi } be a set of all
needs to compute all the data, which is very time-consuming.
observed data. The joint likelihood of X and Z is p(Z, X) =
Modern machine learning models often need to analyze and
p(Z)p(X|Z). In Bayesian models, the posterior distribution
process large-scale data, which is difficult and costly. Stochas-
p(Z|X) should be computed to make further inference.
tic optimization enables machine learning to be extended
What we need to do is to approximate p(Z|X) with the
on massive data [193]. This reminds us of an attractive
distribution q(Z) that belongs to a constrained family of dis-
technique to handle large data sets: stochastic optimization
tributions. The goal is to make the two distributions as similar
[97], [192], [194]. By introducing stochastic optimization into
as possible. Variational inference chooses KL divergence to
variational inference, the stochastic variational inference (SVI)
measure the difference between the two distributions, that is
was proposed [58], in which the exponential family is taken
to minimize the KL divergence of q(Z) and p(Z|X). Here is
as a typical example.
the formula for the KL divergence between q and p:
  Gaussian process (GP) is an important machine learning
q(Z) method based on statistical learning and Bayesian theory. It
KL[q(Z)||p(Z|X)] = Eq log
p(Z|X) is suitable for complex regression problems such as high
= Eq [log q(Z)] − Eq [log p(Z|X)] dimensions, small samples, and nonlinearities. GP has the
= Eq [log q(Z)] − Eq [log p(Z, X)] + log p(X) advantages of strong generalization ability, flexible non-
parametric inference, and strong interpretability. However,
= −ELBO(q) + const, (107)
the complexity and storage requirements of accurate solution
where log p(X) is replaced by a constant because we are for GP are high, which hinders the development of GP
only interested in q. With the above formula, we can know under large-scale data. The stochastic variational inference
KL divergence is difficult to optimize because it requires method introduced in this section can popularize variational
knowing the distribution that we are trying to approximate. An inference on large-scale datasets, but it can only be applied to
alternative method is to maximize the evidence lower bound probabilistic models with factorized structures. For GPs whose
(ELBO), a lower bound on the logarithm of the marginal observations are correlated with each other, the stochastic
probability of the observations. We can obtain ELBO’s variational inference can be adapted by introducing the
formula as global inducing variables as variational variables [195], [196].
Specifically, the observations are assumed to be conditionally
ELBO(q) = E [log p(Z, X)] − E [log q(Z)] . (108)
independent given the inducing variables and the variational
Variational inference can be treated as an optimization distribution for the inducing variables is assumed to have an
problem with the goal of minimizing the evidence lower explicit form. Thus, the resulting GP model can be factorized
bound. A direct method is to solve this optimization problem in a necessary manner, enabling the stochastic variational
using the coordinate ascent, which is called coordinate ascent inference. This method can also be easily extended to models
variational inference (CAVI). CAVI iteratively optimizes each with non-Gaussian likelihood or latent variable models based
factor of the mean-field variational density, while holding the on GPs.
others fixed [192].
Specifically, variational
Qdistribution q has the structure of the E. Optimization in Markov Chain Monte Carlo
M
mean-field, i.e., q(Z) = i=1 qi (zi ). With this assumption, we
Markov chain Monte Carlo (MCMC) is a class of sampling
can bring the distribution q into the ELBO, by some derivation
algorithms to simulate complex distributions that are difficult
according to [57], and obtain the following formula:
to sample directly. It is a practical tool for Bayesian posterior
qi∗ ∝ exp{E−i [log p(zi , Z−i , X)]}. (109) inference. The traditional and common MCMC algorithms
23

include Gibbs sampling, slice sampling, Hamiltonian Monte be done after the leapfrog step. These MH steps require
Carlo (HMC) [197], [198], Reimann manifold variants [199], expensive calculations overall data in each iteration. Beyond
and so on. These sampling methods are limited by the that, there is an incorrect stationary distribution [200] in
computational cost and are difficult to extend to large-scale the stochastic gradient variant of HMC. Thus, Hamiltonian
data.This section takes HMC as an example to introduce the dynamic was further modified, which minimizes the effect of
optimization in MCMC. The bottleneck of the HMC is that the additional noise, achieves the invariant distribution and
the gradient calculation is costly on large data sets. eliminates MH steps [61]. Specifically, a friction term is added
We first introduce the derivation of HMC. Consider the to the dynamical process of momentum update:
random variable θ, which can be sampled from the posterior 
distribution, dθ = M −1 rdt,
p(θ|D) ∝ exp(−U (θ)), (110) dr = −∇U (θ)dt − BM −1 rdt + N (0, 2B(θ)dt).
(118)
where D is the set of observations, and U is the potential The introduced friction term is helpful for decreasing total
energy function with the following formula: energy H(θ, r) and weakening the effects of noise in the
X momentum update phase. The dynamical system is also the
U (θ) = − log p(θ|D) = − log p(x|θ) − log p(θ). (111)
type of second-order Langevin dynamics with friction in
x∈D
physics, which can explore efficiently and counteract the
In HMC [197], an independent auxiliary momentum variable effect of the noisy gradients [61] and thus no MH correction
r is introduced from Hamiltonian dynamic. The Hamiltonian is required. This second-order Langevin dynamic MCMC
function and the joint distribution of θ and r are described by method, called SGHMC, is used to deal with sampling
1 problems on large data sets [61], [201].
H(θ, r) = U (θ) + rT M −1 r = U (θ) + K(r), (112)
2 Moreover, HMC is highly sensitive to hyper-parameters,
such as the path length (step number) L and the step size
1
p(θ, r) ∝ exp(−U (θ) − rT M −1 r), (113) ǫ. If the hyper-parameters are not set properly, the efficiency
2 of the HMC will drop dramatically. There are some methods
where M denotes the mass matrix, and K(r) is the kinetic to optimize these two hyper-parameters instead of manually
energy function. The process of HMC sampling is derived by setting them.
simulating the Hamiltonian dynamic system, 1) Path Length L: The value of path length L has a great
 influence on the performance of HMC. If L is too small,
dθ = M −1 rdt,
(114) the distance between the resulting sample points will be very
dr = −∇U (θ)dt.
close; if L is too large, the resulting sample points will loop
Hamiltonian dynamic describes the continuous motion of a back, resulting in wasted computation. In general, manually
particle. Hamiltonian equations are numerically approximated setting L cannot maximize the sampling efficiency of the
by the discretized leapfrog integrator for practical simulating HMC.
[197]. The update equations are as follows [197]: Matthew et al. [202] proposed an extension of the HMC
 method called the No-U-Turn sampler (NUTS), which uses a
 ri (t + 2ǫ ) = ri (t) − 2ǫ dr(t),
θi (t + ǫ) = θi (t) + ǫdθ(t + 2ǫ ), (115) recursive algorithm to generate a set of possible independent
 samples efficiently, and stops the simulation by discriminating
ri (t + ǫ) = ri (t + 2ǫ ) − 2ǫ dr(t + ǫ).
the backtracking automatically. There is no need to set the
In the case of large datasets, the gradient of U (θ) needs to step parameter L manually. In models with multiple discrete
be calculated on the entire data set in each leapfrog iteration. In variables, the ability of NUTS to select the track length
order to improve the efficiency, the stochastic gradient method automatically allows it to generate more valid samples and
was used to calculate ∇U (θ) with a mini-batch D̃ sampled perform more efficiently than the original HMC.
uniformly from D, which reduces the cost of calculation [61]. 2) Adaptive Step Size ǫ: The performance of HMC is highly
However, the gradient calculated in a mini-batch instead of sensitive to the step size ǫ in leapfrog integrator. If ǫ is too
the full dataset will cause noise. According to the central limit small, the update will slow, and the calculation cost will be
theorem, this noisy gradient can be approximated as high; if ǫ is too large, the rejection rate will be high, resulting
∇Ũ (θ) ≈ ∇U (θ) + N (0, V (θ)), (116) in useless updates.
To set ǫ reasonably and adaptively, a vanishing adaptation
where gradient noise obeys normal distribution whose covari- of the dual averaging algorithm can be used in HMC [203],
ance is V (θ). If we replace ∇U (θ) by ∇Ũ (θ) directly, the [204]. Specifically, a statistic Ht = δ − αt is adopted
Hamiltonian dynamics will be changed as in dual averaging method, where δ is the desired average

dθ = M −1 rdt, acceptance probability, and αt is the current Metropolis-
(117) Hasting acceptance probability for iteration t. The statistic
dr = −∇U (θ)dt + N (0, 2B(θ)dt),
Ht ’s expectation h(ǫ) is defined as
where B(θ) = 12 ǫV (θ) is the diffusion matrix [61].
Since the discretization of the dynamical system introduces 1
noise, the Metropolis-Hastings (MH) correction step should h(ǫ) ≡ Et [Ht |ǫt ] ≡ lim E[Ht |ǫt ], (119)
T →∞ T
24

where ǫt is the step size for iteration t in the leapfrog each training changes from the full network to a sub-network
integrator. To satisfy h(ǫ) ≡ Et [Ht |ǫt ] = 0, we can derive [210].
the update formula of ǫ, i.e., ǫt+1 = ǫt − ηt Ht . Tuning ǫ Not only overfitting but also some training details will affect
by vanishing adaptation algorithm guarantees that the average the performance of the model due to the complexity of the
acceptance probability of Metropolis verges to a fixed value. DNNs. The improper selection of the learning rate and the
The hyper-parameters in the HMC include not only the step number of iterations in the SGD will make the model unable
size ǫ and the length of iteration steps L, but also the mass to converge, which makes the accuracy of model fluctuate
M , etc. Optimizing these hyper-parameters can help improve greatly. Besides, taking an inappropriate black box of neural
sampling performance [199], [205], [206]. It is convenient and network construction may result in training not being able to
efficient to tune the hyper-parameters automatically without continue, so designing an appropriate neural network model is
cumbersome adjustments based on data and variables in particularly important. These impacts are even greater when
MCMC. These adaptive tuning methods can be applied to data are insufficient.
other MCMC algorithms to improve the performance of the The technology of transfer learning [211] can be applied
samplers. to build networks in the scenario of insufficient data. Its
In addition to second-order SGHMC, stochastic gradient idea is that the models trained from other data sources can
Langevin dynamics (SGLD) [207] is a first-order Langevin be reused in similar target fields after certain modifications
dynamic technique combined with stochastic optimization. and improvements, which dramatically alleviates the problems
Efficient variants of both SGLD and SGHMC are still active caused by insufficient datasets. Moreover, the advantages
[201], [208]. brought by transfer learning are not limited to reducing
the need for sufficient training data, but also can avoid
V. C HALLENGES AND O PEN P ROBLEMS overfitting effectively and achieve better performance in
general. However, if target data is not as relevant to the
With the rise of practical demand and the increase of original training data, the transferred model does not bring
the complexity of machine learning models, the optimization good performance.
methods in machine learning still face challenges. In this Meta learning methods can be used for systematically
part, we discuss open problems and challenges for some learning parameter initialization, which ensures that training
optimization methods in machine learning, which may offer begins with a suitable initial model. However, it is necessary to
suggestions or ideas for future research and promote the wider ensure the correlation between multiple tasks for meta-training
application of optimization methods in machine learning. and tasks for meta-testing. Under the premise of models with
similar data sources for training, transfer learning and meta
A. Challenges in Deep Neural Networks learning can overcome the difficulties caused by insufficient
There are still many challenges while optimizing DNNs. training data in new data sources, but these methods usually
Here we mainly discuss two challenges with respect to data introduce a large number of parameters or complex parameter
and model, respectively. One is insufficient data in training, adjustment mechanisms, which need to be further improved
and the other is a non-convex objective in DNNs. for specific problems. Therefore, using insufficient data for
1) Insufficient Data in Training Deep Neural Networks: In training DNNs is still a challenge.
2) Non-convex Optimization in Deep Neural Network:
general, deep learning is based on big data sets and complex
Convex optimization has good properties and a comprehensive
models. It requires a large number of training samples to
set of tools are open to solve the optimization problem.
achieve good training effects. But in some particular fields,
However, many machine learning problems are formulated
finding a sufficient amount of training data is difficult. If we
as non-convex optimization problems. For example, almost
do not have enough data to estimate the parameters in the
all the optimization problems in DNNs are non-convex.
neural networks, it may lead to high variance and overfitting.
Non-convex optimization is one of the difficulties in the
There are some techniques in neural networks that can
optimization problem. Unlike convex optimization, there may
be used to reduce the variance. Adding L2 regularization
be innumerable optimum solutions in its feasible domain in
to the objective is a natural method to reduce the model
non-convex problems. The complexity of the algorithm for
complexity. Recently, a common method is dropout [62]. In
searching the global optimal value is NP-hard [109].
the training process, each neuron is allowed to stop working
In recent years, non-convex optimization has gradually
with a probability of p, which can prevent the synergy between
attracted the attention of researches. The methods for solving
certain neurons. M subnets can be sampled like bagging by
non-convex optimization problems can be roughly divided into
multiple puts and returns [209]. Each expected result at the
two types. One is to transform the non-convex optimization
output layer is calculated as
into a convex optimization problem, and then use the convex
M
X optimization method. The other is to use some special
o = EM [f (x; θ, M )] = p(Mi )f (x; θ, Mi ), (120) optimization method for solving non-convex functions directly.
i=1 There is some work on summarizing the optimization methods
where p(Mi ) is the probability of the ith subnet. Dropout can for solving non-convex functions from the perspective of
prevent overfitting and improve the generalization ability of the machine learning [212].
network, but its disadvantage is increasing the training time as 1) Relaxation method: Relax the problem to make it
25

become a convex optimization problem. There are many [227]. Then the relevant experts and scholars also introduced
relaxation techniques, for example, the branch-and- this stochastic idea to the second-order optimization methods
bound method called αBB convex relaxation [213], [124], [125], [228] and achieved good results.
[214], which uses a convex relaxation at each step to Conjugate gradient method is an elegant and attractive
compute the lower bound in the region. The convex algorithm, which has the advantages of both the first-
relaxation method has been used in many fields. In the order and second-order optimization methods. The stan-
field of computer vision, a convex relaxation method dard form of a conjugate gradient is not suitable for a
was proposed to calculate minimal partitions [215]. For stochastic approximation. Through using the fast Hessian-
unsupervised and semi-supervised learning, the convex gradient product, the stochastic method is also introduced to
relaxation method was used for solving semidefinite conjugate gradient, in which some numerical results show the
programming [216]. validity of the algorithm [227]. Another version of stochastic
2) Non-convex optimization methods: These methods in- conjugate gradient method employs the variance reduction
clude projection gradient descent [217], [218], alter- technique, and converges quickly with just a few iterations and
nating minimization [219], [220], [221], expectation requires less storage space during the running process [229].
maximization algorithm [222], [223] and stochastic The stochastic version of conjugate gradient is a potential
optimization and its variants [37]. optimization method and is still worth studying.

B. Difficulties in Sequential Models with Large-Scale Data VI. C ONCLUSION


When dealing with large-scale time series, the usual This paper introduces and summarizes the frequently
solutions are using stochastic optimization, processing data in used optimization methods from the perspective of machine
mini-batches, or utilizing distributed computing to improve learning, and studies their applications in various fields of
computational efficiency [224]. For a sequential model, machine learning. Firstly, we describe the theoretical basis
segmenting the sequences can affect the dependencies between of optimization methods from the first-order, high-order,
the data on the adjacent time indices. If sequence length is and derivative-free aspects, as well as the research progress
not an integral multiple of the mini-batch size, the general in recent years. Then we describe the applications of the
operation is to add some items sampled from the previous optimization methods in different machine learning scenarios
data into the last subsequence. This operation will introduce and the approaches to improve their performance. Finally,
the wrong dependency in the training model. Therefore, the we discuss some challenges and open problems in machine
analysis of the difference between the approximated solution learning optimization methods.
obtained and the exact solution is a direction worth exploring.
Particularly, in RNNs, the problem of gradient vanishing and
R EFERENCES
gradient explosion is also prone to occur. So far, it is generally
solved by specific interaction modes of LSTM and GRU [225] [1] H. Robbins and S. Monro, “A stochastic approximation method,” The
Annals of Mathematical Statistics, pp. 400–407, 1951.
or gradient clipping. Better appropriate solutions for dealing [2] P. Jain, S. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford,
with problems in RNNs are still worth investigating. “Parallelizing stochastic gradient descent for least squares regression:
mini-batching, averaging, and model misspecification,” Journal of
Machine Learning Research, vol. 18, 2018.
C. High-Order Methods for Stochastic Variational Inference [3] D. F. Shanno, “Conditioning of quasi-Newton methods for function
The high-order optimization method utilizes curvature minimization,” Mathematics of Computation, vol. 24, pp. 647–656,
1970.
information and thus converges fast. Although computing [4] J. Hu, B. Jiang, L. Lin, Z. Wen, and Y.-x. Yuan, “Structured quasi-
and storing the Hessian matrices are difficult, with the newton methods for optimization with orthogonality constraints,” SIAM
development of research, the calculation of the Hessian matrix Journal on Scientific Computing, vol. 41, pp. 2239–2269, 2019.
[5] J. Pajarinen, H. L. Thai, R. Akrour, J. Peters, and G. Neumann,
has made great progress [8], [9], [226], and the second-order “Compatible natural gradient policy search,” Machine Learning, pp.
optimization method has become more and more attractive. 1–24, 2019.
Recently, stochastic methods have also been introduced into [6] J. E. Dennis, Jr, and J. J. Moré, “Quasi-Newton methods, motivation
and theory,” SIAM Review, vol. 19, pp. 46–89, 1977.
the second-order method, which extends the second order [7] J. Martens, “Deep learning via Hessian-free optimization,” in
method to large-scale data [8], [10]. International Conference on Machine Learning, 2010, pp. 735–742.
We have introduced some work on stochastic variational [8] F. Roosta-Khorasani and M. W. Mahoney, “Sub-sampled Newton
methods II: local convergence rates,” arXiv preprint arXiv:1601.04738,
inference. It introduces the stochastic method into variational 2016.
inference, which is an interesting and meaningful combination. [9] P. Xu, J. Yang, F. Roosta-Khorasani, C. Ré, and M. W. Mahoney, “Sub-
This makes variational inference be able to handle large-scale sampled Newton methods with non-uniform sampling,” in Advances in
Neural Information Processing Systems, 2016, pp. 3000–3008.
data. A natural idea is whether we can incorporate second- [10] R. Bollapragada, R. H. Byrd, and J. Nocedal, “Exact and inexact
order optimization methods (or higher-order) into stochastic subsampled newton methods for optimization,” IMA Journal of
variational inference, which is interesting and challenging. Numerical Analysis, vol. 1, pp. 1–34, 2018.
[11] L. M. Rios and N. V. Sahinidis, “Derivative-free optimization: a review
of algorithms and comparison of software implementations,” Journal
D. Stochastic Optimization in Conjugate Gradient of Global Optimization, vol. 56, pp. 1247–1293, 2013.
[12] A. S. Berahas, R. H. Byrd, and J. Nocedal, “Derivative-free
Stochastic methods exhibit powerful capabilities when deal- optimization of noisy functions via quasi-newton methods,” SIAM
ing with large-scale data, especially for first-order optimization Journal on Optimization, vol. 29, pp. 965–993, 2019.
26

[13] Y. LeCun and L. Bottou, “Gradient-based learning applied to document [38] N. S. Keskar and R. Socher, “Improving generalization performance
recognition,” Proceedings of the IEEE, vol. 86, pp. 2278–2324, 1998. by switching from Adam to SGD,” arXiv preprint arXiv:1712.07628,
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification 2017.
with deep convolutional neural networks,” in Advances in neural [39] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
information processing systems, 2012, pp. 1097–1105. MIT press, 1998.
[15] P. Sermanet and D. Eigen, “Overfeat: Integrated recognition, local- [40] J. Mattner, S. Lange, and M. Riedmiller, “Learn to swing up and
ization and detection using convolutional networks,” in International balance a real pole based on raw visual input data,” in International
Conference on Learning Representations, 2014. Conference on Neural Information Processing, 2012, pp. 126–133.
[16] A. Karpathy and G. Toderici, “Large-scale video classification with [41] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
convolutional neural networks,” in IEEE Conference on Computer D. Wierstra, and M. Riedmiller, “Playing Atari with deep reinforcement
Vision and Pattern Recognition, 2014, pp. 1725–1732. learning,” arXiv preprint arXiv:1312.5602, 2013.
[17] Y. Kim, “Convolutional neural networks for sentence classification,” [42] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,
in Conference on Empirical Methods in Natural Language Processing, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland,
2014, pp. 1746–1751. and G. Ostrovski, “Human-level control through deep reinforcement
[18] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks learning,” Nature, vol. 518, pp. 529–533, 2015.
for human action recognition,” IEEE Transactions on Pattern Analysis [43] Y. Bengio, “Learning deep architectures for AI,” Foundations and
and Machine Intelligence, vol. 35, pp. 221–231, 2012. Trends in Machine Learning, vol. 2, pp. 1–127, 2009.
[19] S. Lai, L. Xu, and K. Liu, “Recurrent convolutional neural networks [44] S. S. Mousavi, M. Schukat, and E. Howley, “Deep reinforcement
for text classification,” in Association for the Advancement of Artificial learning: an overview,” in SAI Intelligent Systems Conference, 2016,
Intelligence, 2015, pp. 2267–2273. pp. 426–440.
[20] K. Cho and B. Van Merriënboer, “Learning phrase representations [45] J. Schmidhuber, “Evolutionary principles in self-referential learning, or
using RNN encoder-decoder for statistical machine translation,” in on learning how to learn: the meta-meta-... hook,” Ph.D. dissertation,
Conference on Empirical Methods in Natural Language Processing, Technische Universität München, München, Germany, 1987.
2014, pp. 1724–1734. [46] T. Schaul and J. Schmidhuber, “Metalearning,” Scholarpedia, vol. 5,
[21] P. Liu and X. Qiu, “Recurrent neural network for text classification with pp. 46–50, 2010.
multi-task learning,” in International Joint Conferences on Artificial [47] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
Intelligence, 2016, pp. 2873–2879. for fast adaptation of deep networks,” in International Conference on
[22] A. Graves and A.-r. Mohamed, “Speech recognition with deep recurrent Machine Learning, 2017, pp. 1126–1135.
neural networks,” in International Conference on Acoustics, Speech and
[48] O. Vinyals, “Model vs optimization meta learning,”
Signal processing, 2013, pp. 6645–6649.
http://metalearning-symposium.ml/files/vinyals.pdf, 2017.
[23] K. Gregor and I. Danihelka, “Draw: A recurrent neural network for
[49] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature
image generation,” arXiv preprint arXiv:1502.04623, 2015.
verification using a ”siamese” time delay neural network,” in Advances
[24] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent
in Neural Information Processing Systems, 1994, pp. 737–744.
neural networks,” arXiv preprint arXiv:1601.06759, 2016.
[50] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks
[25] A. Ullah and J. Ahmad, “Action recognition in video sequences using
for one-shot image recognition,” in International Conference on
deep bi-directional LSTM with CNN features,” IEEE Access, vol. 6,
Machine Learning WorkShop, 2015, pp. 1–30.
pp. 1155–1166, 2017.
[51] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching
[26] Y. Xia and J. Wang, “A bi-projection neural network for solving
networks for one shot learning,” in Advances in Neural Information
constrained quadratic optimization problems,” IEEE Transactions on
Processing Systems, 2016, pp. 3630–3638.
Neural Networks and Learning Systems, vol. 27, no. 2, pp. 214–224,
2015. [52] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-
[27] S. Zhang, Y. Xia, and J. Wang, “A complex-valued projection neural shot learning,” in Advances in Neural Information Processing Systems,
network for constrained optimization of real functions in complex 2017, pp. 4077–4087.
variables,” IEEE Transactions on Neural Networks and Learning [53] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap,
Systems, vol. 26, no. 12, pp. 3227–3238, 2015. “Meta-learning with memory-augmented neural networks,” in Interna-
[28] Y. Xia and J. Wang, “Robust regression estimation based on low- tional Conference on Machine Learning, 2016, pp. 1842–1850.
dimensional recurrent neural networks,” IEEE Transactions on Neural [54] J. Weston, S. Chopra, and A. Bordes, “Memory networks,” in
Networks and Learning Systems, vol. 29, no. 12, pp. 5935–5946, 2018. International Conference on Learning Representations, 2015, pp. 1–
[29] Y. Xia, J. Wang, and W. Guo, “Two projection neural networks 15.
with reduced model complexity for nonlinear programming,” IEEE [55] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau,
Transactions on Neural Networks and Learning Systems, pp. 1–10, T. Schaul, B. Shillingford, and N. De Freitas, “Learning to learn
2019. by gradient descent by gradient descent,” in Advances in Neural
[30] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods Information Processing Systems, 2016, pp. 3981–3989.
for online learning and stochastic optimization,” Journal of Machine [56] S. Ravi and H. Larochelle, “Optimization as a model for few-shot
Learning Research, vol. 12, pp. 2121–2159, 2011. learning,” in International Conference on Learning Representations,
[31] M. D. Zeiler, “AdaDelta: An adaptive learning rate method,” arXiv 2016, pp. 1–11.
preprint arXiv:1212.5701, 2012. [57] C. M. Bishop, Pattern Recognition and Machine Learning. Springer,
[32] T. Tieleman and G. Hinton, “Divide the gradient by a running average 2006.
of its recent magnitude,” COURSERA: Neural Networks for Machine [58] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic
Learning, pp. 26–31, 2012. variational inference,” Journal of Machine Learning Research, vol. 14,
[33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” pp. 1303–1347, 2013.
in International Conference on Learning Representations, 2014, pp. 1– [59] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta,
15. A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single
[34] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of Adam image super-resolution using a generative adversarial network,” in
and beyond,” in International Conference on Learning Representations, Computer Vision and Pattern Recognition, 2017, pp. 4681–4690.
2018, pp. 1–23. [60] Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba, “Scalable
[35] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation trust-region method for deep reinforcement learning using Kronecker-
learning with deep convolutional generative adversarial networks,” factored approximation,” in Advances in Neural Information Processing
arXiv preprint arXiv:1511.06434, 2015. Systems, 2017, pp. 5279–5288.
[36] N. L. Roux, M. Schmidt, and F. R. Bach, “A stochastic gradient [61] T. Chen, E. Fox, and C. Guestrin, “Stochastic gradient Hamiltonian
method with an exponential convergence rate for finite training sets,” in Monte Carlo,” in International Conference on Machine Learning, 2014,
Advances in Neural Information Processing Systems, 2012, pp. 2663– pp. 1683–1691.
2671. [62] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
[37] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent dinov, “Dropout: a simple way to prevent neural networks from
using predictive variance reduction,” in Advances in Neural Information overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–
Processing Systems, 2013, pp. 315–323. 1958, 2014.
27

[63] W. Yin and H. Schütze, “Multichannel variable-size convolution for [91] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge
sentence classification,” in Conference on Computational Language University Press, 2004.
Learning, 2015, pp. 204–214. [92] J. Alspector, R. Meir, B. Yuhas, A. Jayakumar, and D. Lippe, “A
[64] J. Yang, K. Yu, Y. Gong, and T. S. Huang, “Linear spatial pyramid parallel gradient descent method for learning in analog VLSI neural
matching using sparse coding for image classification,” in IEEE networks,” in Advances in Neural Information Processing Systems,
Conference on Computer Vision and Pattern Recognition, 2009, pp. 1993, pp. 836–844.
1794–1801. [93] J. Nocedal and S. J. Wright, Numerical Optimization. Springer, 2006.
[65] Y. Bazi and F. Melgani, “Gaussian process approach to remote sensing [94] A. S. Nemirovsky and D. B. Yudin, Problem Complexity and Method
image classification,” IEEE Transactions on Geoscience and Remote Efficiency in Optimization. John Wiley & Sons, 1983.
Sensing, vol. 48, pp. 186–197, 2010. [95] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic
[66] D. C. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep approximation approach to stochastic programming,” SIAM Journal on
neural networks for image classification,” in IEEE Conference on Optimization, vol. 19, pp. 1574–1609, 2009.
Computer Vision and Pattern Recognition, 2012, pp. 3642–3649. [96] A. Agarwal, M. J. Wainwright, P. L. Bartlett, and P. K. Ravikumar,
[67] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A k-means “Information-theoretic lower bounds on the oracle complexity of
clustering algorithm,” Journal of the Royal Statistical Society. Series convex optimization,” in Advances in Neural Information Processing
C (Applied Statistics), vol. 28, pp. 100–108, 1979. Systems, 2009, pp. 1–9.
[68] S. Guha, R. Rastogi, and K. Shim, “ROCK: A robust clustering [97] H. Robbins and S. Monro, “A stochastic approximation method,” The
algorithm for categorical attributes,” Information Systems, vol. 25, pp. Annals of Mathematical Statistics, pp. 400–407, 1951.
345–366, 2000. [98] C. Darken, J. Chang, and J. Moody, “Learning rate schedules for faster
[69] C. Ding, X. He, H. Zha, and H. D. Simon, “Adaptive dimension stochastic gradient search,” in Neural Networks for Signal Processing,
reduction for clustering high dimensional data,” in IEEE International 1992, pp. 3–12.
Conference on Data Mining, 2002, pp. 147–154. [99] I. Sutskever, “Training recurrent neural networks,” Ph.D. dissertation,
[70] M. Guillaumin and J. Verbeek, “Multimodal semi-supervised learning University of Toronto, Ontario, Canada, 2013.
for image classification,” in Computer Vision and Pattern Recognition,
[100] Z. Allen-Zhu, “Natasha 2: Faster non-convex optimization than SGD,”
2010, pp. 902–909.
in Advances in Neural Information Processing Systems, 2018, pp.
[71] O. Chapelle and A. Zien, “Semi-supervised classification by low den- 2675–2686.
sity separation.” in International Conference on Artificial Intelligence
[101] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle pointson-
and Statistics, 2005, pp. 57–64.
line stochastic gradient for tensor decomposition,” in Conference on
[72] Z.-H. Zhou and M. Li, “Semi-supervised regression with co-training.”
Learning Theory, 2015, pp. 797–842.
in International Joint Conferences on Artificial Intelligence, 2005, pp.
908–913. [102] B. T. Polyak, “Some methods of speeding up the convergence of iter-
ation methods,” USSR Computational Mathematics and Mathematical
[73] A. Demiriz and K. P. Bennett, “Semi-supervised clustering using
Physics, vol. 4, pp. 1–17, 1964.
genetic algorithms,” Artificial Neural Networks in Engineering, vol. 1,
pp. 809–814, 1999. [103] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT
Press, 2016.
[74] B. Kulis and S. Basu, “Semi-supervised graph clustering: a kernel
approach,” Machine Learning, vol. 74, pp. 1–22, 2009. [104] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance
[75] D. Zhang and Z.-H. Zhou, “Semi-supervised dimensionality reduction,” of initialization and momentum in deep learning,” in International
in SIAM International Conference on Data Mining, 2007, pp. 629–634. Conference on Machine Learning, 2013, pp. 1139–1147.
[76] P. Chen and L. Jiao, “Semi-supervised double sparse graphs [105] Y. Nesterov, “A method for unconstrained convex minimization
based discriminant analysis for dimensionality reduction,” Pattern problem with the rate of convergence O( k12 ),” Doklady Akademii Nauk
Recognition, vol. 61, pp. 361–378, 2017. SSSR, vol. 269, pp. 543–547, 1983.
[77] K. P. Bennett and A. Demiriz, “Semi-supervised support vector [106] L. C. Baird III and A. W. Moore, “Gradient descent for general
machines,” in Advances in Neural Information processing systems, reinforcement learning,” in Advances in Neural Information Processing
1999, pp. 368–374. Systems, 1999, pp. 968–974.
[78] E. Cheung, Optimization Methods for Semi-Supervised Learning. [107] C. Darken and J. E. Moody, “Note on learning rate schedules for
University of Waterloo, 2018. stochastic optimization,” in Advances in Neural Information Processing
[79] O. Chapelle, V. Sindhwani, and S. S. Keerthi, “Optimization techniques Systems, 1991, pp. 832–838.
for semi-supervised support vector machines,” Journal of Machine [108] M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums with
Learning Research, vol. 9, pp. 203–233, 2008. the stochastic average gradient,” Mathematical Programming, vol. 162,
[80] ——, “Branch and bound for semi-supervised support vector pp. 83–112, 2017.
machines,” in Advances in Neural Information Processing Systems, [109] Z. Allen-Zhu and E. Hazan, “Variance reduction for faster non-convex
2007, pp. 217–224. optimization,” in International Conference on Machine Learning, 2016,
[81] Y.-F. Li and I. W. Tsang, “Convex and scalable weakly labeled svms,” pp. 699–707.
Journal of Machine Learning Research, vol. 14, pp. 2151–2188, 2013. [110] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, “Stochastic
[82] F. Murtagh, “A survey of recent advances in hierarchical clustering variance reduction for nonconvex optimization,” in International
algorithms,” The Computer Journal, vol. 26, pp. 354–359, 1983. Conference on Machine Learning, 2016, pp. 314–323.
[83] V. Castro and J. Yang, “A fast and robust general purpose clustering [111] A. Defazio, F. Bach, and S. Lacoste-Julien, “SAGA: A fast incremental
algorithm,” in Knowledge Discovery in Databases and Data Mining, gradient method with support for non-strongly convex composite
2000, pp. 208–218. objectives,” in Advances in Neural Information Processing Systems,
[84] G. H. Ball and D. J. Hall, “A clustering technique for summarizing 2014, pp. 1646–1654.
multivariate data,” Behavioral Science, vol. 12, pp. 153–155, 1967. [112] M. J. Powell, “A method for nonlinear constraints in minimization
[85] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” problems,” Optimization, pp. 283–298, 1969.
Chemometrics and Intelligent Laboratory Systems, vol. 2, pp. 37–52, [113] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed
1987. optimization and statistical learning via the alternating direction method
[86] I. Jolliffe, “Principal component analysis,” in International Encyclope- of multipliers,” Foundations and Trends in Machine Learning, vol. 3,
dia of Statistical Science, 2011, pp. 1094–1096. pp. 1–122, 2011.
[87] M. E. Tipping and C. M. Bishop, “Probabilistic principal component [114] A. Nagurney and P. Ramanujam, “Transportation network policy
analysis,” Journal of the Royal Statistical Society: Series B (Statistical modeling with goal targets and generalized penalty functions,”
Methodology), vol. 61, pp. 611–622, 1999. Transportation Science, vol. 30, pp. 3–13, 1996.
[88] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. [115] B. He, H. Yang, and S. Wang, “Alternating direction method with
MIT Press, 2018. self-adaptive penalty parameters for monotone variational inequalities,”
[89] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement Journal of Optimization Theory and Applications, vol. 106, pp. 337–
learning: A survey,” Journal of Artificial Intelligence Research, vol. 4, 356, 2000.
pp. 237–285, 1996. [116] D. Hallac, C. Wong, S. Diamond, A. Sharang, S. Boyd, and
[90] S. Ruder, “An overview of gradient descent optimization algorithms,” J. Leskovec, “Snapvx: A network-based convex optimization solver,”
arXiv preprint arXiv:1609.04747, 2016. Journal of Machine Learning Research, vol. 18, pp. 1–5, 2017.
28

[117] B. Wahlberg, S. Boyd, M. Annergren, and Y. Wang, “An ADMM [146] R. Gower, D. Goldfarb, and P. Richtárik, “Stochastic block BFGS:
algorithm for a class of total variation regularized estimation problems,” Squeezing more curvature out of data,” in International Conference on
arXiv preprint arXiv:1203.1828, 2012. Machine Learning, 2016, pp. 1869–1878.
[118] M. Frank and P. Wolfe, “An algorithm for quadratic programming,” [147] R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal, “On the use of
Naval Research Logistics Quarterly, vol. 3, pp. 95–110, 1956. stochastic Hessian information in optimization methods for machine
[119] M. Jaggi, “Revisiting Frank-Wolfe: Projection-free sparse convex learning,” SIAM Journal on Optimization, vol. 21, pp. 977–995, 2011.
optimization,” in International Conference on Machine Learning, 2013, [148] S. I. Amari, “Natural gradient works efficiently in learning,” Neural
pp. 427–435. Computation, vol. 10, pp. 251–276, 1998.
[120] M. Fukushima, “A modified Frank-Wolfe algorithm for solving [149] J. Martens, “New insights and perspectives on the natural gradient
the traffic assignment problem,” Transportation Research Part B: method,” arXiv preprint arXiv:1412.1193, 2014.
Methodological, vol. 18, pp. 169–177, 1984. [150] R. Grosse and R. Salakhudinov, “Scaling up natural gradient by
[121] M. Patriksson, The Traffic Assignment Problem: Models and Methods. sparsely factorizing the inverse fisher matrix,” in International
Dover Publications, 2015. Conference on Machine Learning, 2015, pp. 2304–2313.
[122] K. L. Clarkson, “Coresets, sparse greedy approximation, and the Frank- [151] J. Martens and R. Grosse, “Optimizing neural networks with
Wolfe algorithm,” ACM Transactions on Algorithms, vol. 6, pp. 63–96, Kronecker-factored approximate curvature,” in International Confer-
2010. ence on Machine Learning, 2015, pp. 2408–2417.
[123] J. Mairal, F. Bach, J. Ponce, G. Sapiro, R. Jenatton, and [152] R. H. Byrd, J. C. Gilbert, and J. Nocedal, “A trust region method based
G. Obozinski, “SPAMS: A sparse modeling software, version 2.3,” on interior point techniques for nonlinear programming,” Mathematical
http://spams-devel.gforge.inria.fr/downloads.html, 2014. Programming, vol. 89, pp. 149–185, 2000.
[124] N. N. Schraudolph, J. Yu, and S. Günter, “A stochastic quasi-Newton
[153] L. Hei, “Practical techniques for nonlinear optimization,” Ph.D.
method for online convex optimization,” in Artificial Intelligence and
dissertation, Northwestern University, America, 2007.
Statistics, 2007, pp. 436–443.
[125] R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer, “A stochastic [154] M. I. Lourakis, “A brief description of the levenberg-marquardt
quasi- method for large-scale optimization,” SIAM Journal on algorithm implemented by levmar,” Foundation of Research and
Optimization, vol. 26, pp. 1008–1031, 2016. Technology, vol. 4, pp. 1–6, 2005.
[126] P. Moritz, R. Nishihara, and M. Jordan, “A linearly-convergent [155] A. R. Conn, K. Scheinberg, and L. N. Vicente, Introduction to
stochastic L-BFGS algorithm,” in Artificial Intelligence and Statistics, Derivative-Free Optimization. Society for Industrial and Applied
2016, pp. 249–258. Mathematics, 2009.
[127] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for [156] C. Audet and M. Kokkolaras, Blackbox and Derivative-Free Optimiza-
solving linear systems. NBS Washington, DC, 1952. tion: Theory, Algorithms and Applications. Springer, 2016.
[128] J. R. Shewchuk, “An introduction to the conjugate gradient method [157] L. M. Rios and N. V. Sahinidis, “Derivative-free optimization: a review
without the agonizing pain,” Carnegie Mellon University, Tech. Rep., of algorithms and comparison of software implementations,” Journal
1994. of Global Optimization, vol. 56, pp. 1247–1293, 2013.
[129] M. Avriel, Nonlinear Programming: Analysis and Methods. Dover [158] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by
Publications, 2003. simulated annealing,” Science, vol. 220, pp. 671–680, 1983.
[130] P. T. Harker and J. Pang, “A damped-Newton method for the linear [159] M. Mitchell, An Introduction to Genetic Algorithms. MIT press, 1998.
complementarity problem,” Lectures in Applied Mathematics, vol. 26, [160] M. Dorigo, M. Birattari, C. Blum, M. Clerc, T. Stützle, and A. Winfield,
pp. 265–284, 1990. Ant Colony Optimization and Swarm Intelligence. Springer, 2008.
[131] P. Y. Ayala and H. B. Schlegel, “A combined method for determining [161] D. P. Bertsekas, Nonlinear Programming. Athena Scientific Belmont,
reaction paths, minima, and transition state geometries,” The Journal 1999.
of Chemical Physics, vol. 107, pp. 375–384, 1997. [162] P. Richtárik and M. Takáč, “Iteration complexity of randomized block-
[132] M. Raydan, “The barzilai and borwein gradient method for the coordinate descent methods for minimizing a composite function,”
large scale unconstrained minimization problem,” SIAM Journal on Mathematical Programming, vol. 144, pp. 1–38, 2014.
Optimization, vol. 7, pp. 26–33, 1997. [163] I. Loshchilov, M. Schoenauer, and M. Sebag, “Adaptive coordinate
[133] W. C. Davidon, “Variable metric method for minimization,” SIAM descent,” in Annual Conference on Genetic and Evolutionary
Journal on Optimization, vol. 1, pp. 1–17, 1991. Computation, 2011, pp. 885–892.
[134] R. Fletcher and M. J. Powell, “A rapidly convergent descent method [164] T. Huckle, “Approximate sparsity patterns for the inverse of a matrix
for minimization,” The Computer Journal, vol. 6, pp. 163–168, 1963. and preconditioning,” Applied Numerical Mathematics, vol. 30, pp.
[135] C. G. Broyden, “The convergence of a class of double-rank 291–303, 1999.
minimization algorithms: The new algorithm,” IMA Journal of Applied [165] M. Benzi, “Preconditioning techniques for large linear systems: a
Mathematics, vol. 6, pp. 222–231, 1970. survey,” Journal of Computational Physics, vol. 182, pp. 418–477,
[136] R. Fletcher, “A new approach to variable metric algorithms,” The 2002.
Computer Journal, vol. 13, pp. 317–322, 1970. [166] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex
[137] D. Goldfarb, “A family of variable-metric methods derived by programming, version 2.1,” http://cvxr.com/cvx, 2014.
variational means,” Mathematics of Computation, vol. 24, pp. 23–26,
[167] S. Diamond and S. Boyd, “Cvxpy: A python-embedded modeling
1970.
language for convex optimization,” Journal of Machine Learning
[138] J. Nocedal, “Updating quasi-Newton matrices with limited storage,”
Research, vol. 17, pp. 2909–2913, 2016.
Mathematics of Computation, vol. 35, pp. 773–782, 1980.
[139] D. C. Liu and J. Nocedal, “On the limited memory BFGS method [168] M. Andersen, J. Dahl, and L. Vandenberghe, “Cvxopt: A python
for large scale optimization,” Mathematical programming, vol. 45, pp. package for convex optimization, version 1.1.6,” https://cvxopt.org/,
503–528, 1989. 2013.
[140] W. Sun and Y. X. Yuan, Optimization theory and methods: nonlinear [169] J. D. Hedengren, R. A. Shishavan, K. M. Powell, and T. F. Edgar,
programming. Springer Science & Business Media, 2006. “Nonlinear modeling, estimation and predictive control in apmonitor,”
[141] A. S. Berahas, J. Nocedal, and M. Takác, “A multi-batch L-BFGS Computers & Chemical Engineering, vol. 70, pp. 133–148, 2014.
method for machine learning,” in Advances in Neural Information [170] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
Processing Systems, 2016, pp. 1055–1063. S. Ghemawat, G. Irving, and M. Isard, “Tensorflow: a system for large-
[142] L. Bottou and O. Bousquet, “The tradeoffs of large scale learning,” in scale machine learning,” in USENIX Symposium on Operating Systems
Advances in Neural Information Processing Systems, 2008, pp. 161– Design and Implementations, 2016, pp. 265–283.
168. [171] T. Dozat, “Incorporating nesterov momentum into adam,” in Interna-
[143] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for tional Conference on Learning Representations, 2016, pp. 1–14.
large-scale machine learning,” Society for Industrial and Applied [172] I. Loshchilov and F. Hutter, “Fixing weight decay regularization in
Mathematics Review, vol. 60, pp. 223–311, 2018. Adam,” arXiv preprint arXiv:1711.05101, 2017.
[144] A. Mokhtari and A. Ribeiro, “Res: Regularized stochastic BFGS [173] Z. Zhang, L. Ma, Z. Li, and C. Wu, “Normalized direction-preserving
algorithm,” IEEE Transactions on Signal Processing, vol. 62, pp. 6089– Adam,” arXiv preprint arXiv:1709.04546, 2017.
6104, 2014. [174] H. Salehinejad, S. Sankar, J. Barfett, E. Colak, and S. Valaee,
[145] ——, “Global convergence of online limited memory BFGS,” Journal “Recent advances in recurrent neural networks,” arXiv preprint
of Machine Learning Research, vol. 16, pp. 3151–3181, 2015. arXiv:1801.01078, 2017.
29

[175] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu, “Advances in [202] M. D. Hoffman and A. Gelman, “The No-U-turn sampler: adaptively
optimizing recurrent networks,” in IEEE International Conference on setting path lengths in Hamiltonian monte carlo,” Journal of Machine
Acoustics, Speech and Signal Processing, 2013, pp. 8624–8628. Learning Research, vol. 15, pp. 1593–1623, 2014.
[176] J. Martens and I. Sutskever, “Training deep and recurrent networks with [203] Y. Nesterov, “Primal-dual subgradient methods for convex problems,”
Hessian-free optimization,” in Neural Networks: Tricks of the Trade, Mathematical Programming, vol. 120, pp. 221–259, 2009.
2012, pp. 479–535. [204] C. Andrieu and J. Thoms, “A tutorial on adaptive MCMC,” Statistics
[177] N. N. Schraudolph, “Fast curvature matrix-vector products for second- and Computing, vol. 18, pp. 343–373, 2008.
order gradient descent,” Neural Computation, vol. 14, pp. 1723–1738, [205] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich,
2002. M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell, “Stan: A
[178] J. Martens and I. Sutskever, “Learning recurrent neural networks with probabilistic programming language,” Journal of Statistical Software,
Hessian-free optimization,” in International Conference on Machine vol. 76, pp. 1–37, 2017.
Learning, 2011, pp. 1033–1040. [206] S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White, “MCMC
[179] A. Likas and A. Stafylopatis, “Training the random neural network methods for functions: modifying old algorithms to make them faster,”
using quasi-Newton methods,” European Journal of Operational Statistical Science, vol. 28, pp. 424–446, 2013.
Research, vol. 126, pp. 331–339, 2000. [207] M. Welling and Y. W. Teh, “Bayesian learning via stochastic
[180] X. Liu and S. Liu, “Limited-memory bfgs optimization of recurrent gradient Langevin dynamics,” in International Conference on Machine
neural network language models for speech recognition,” in Interna- Learning, 2011, pp. 681–688.
tional Conference on Acoustics, Speech and Signal Processing, 2018, [208] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven,
pp. 6114–6118. “Bayesian sampling using stochastic gradient thermostats,” in Advances
[181] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, in Neural Information Processing Systems, 2014, pp. 3203–3211.
Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep [209] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, pp. 123–
reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015. 140, 1996.
[182] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, [210] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep regularization,” arXiv preprint arXiv:1409.2329, 2014.
reinforcement learning,” in International Conference on Machine
[211] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE
Learning, 2016, pp. 1928–1937.
Transactions on Knowledge and Data Engineering, vol. 22, pp. 1345–
[183] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van 1359, 2010.
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and
[212] P. Jain and P. Kar, “Non-convex optimization for machine learning,”
M. Lanctot, “Mastering the game of go with deep neural networks and
Foundations and Trends in Machine Learning, vol. 10, pp. 142–336,
tree search,” Nature, vol. 529, pp. 484–489, 2016.
2017.
[184] C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8,
[213] C. S. Adjiman and S. Dallwig, “A global optimization method, αbb, for
pp. 279–292, 1992.
general twice-differentiable constrained NLPs–I. theoretical advances,”
[185] G. A. Rummery and M. Niranjan, “On-line Q-learning using connec-
Computers & Chemical Engineering, vol. 22, pp. 1137–1158, 1998.
tionist systems,” Cambridge University Engineering Department, Tech.
Rep., 1994. [214] C. Adjiman, C. Schweiger, and C. Floudas, “Mixed-integer nonlinear
optimization in process synthesis,” in Handbook of combinatorial
[186] Y. Li, “Deep reinforcement learning: An overview,” arXiv preprint
optimization, 1998, pp. 1–76.
arXiv:1701.07274, 2017.
[187] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee, “Natural [215] T. Pock, A. Chambolle, D. Cremers, and H. Bischof, “A convex
actor-critic algorithms,” Automatica, vol. 45, pp. 2471–2482, 2009. relaxation approach for computing minimal partitions,” in IEEE
Conference on Computer Vision and Pattern Recognition, 2009, pp.
[188] S. Thrun and L. Pratt, Learning to Learn. Springer Science & Business
810–817.
Media, 2012.
[189] M. Abdullah Jamal and G.-J. Qi, “Task agnostic meta-learning for [216] L. Xu and D. Schuurmans, “Unsupervised and semi-supervised multi-
few-shot learning,” in The IEEE Conference on Computer Vision and class support vector machines,” in Association for the Advancement of
Pattern Recognition (CVPR), 2019, pp. 1–11. Artificial Intelligence, 904-910, p. 13.
[190] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An [217] Y. Chen and M. J. Wainwright, “Fast low-rank estimation by projected
introduction to variational methods for graphical models,” Machine gradient descent: General statistical and algorithmic guarantees,” arXiv
Learning, vol. 37, pp. 183–233, 1999. preprint arXiv:1509.03025, 2015.
[191] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential [218] D. Park and A. Kyrillidis, “Provable non-convex projected gradient
families, and variational inference,” Foundations and Trends in descent for a class of constrained matrix optimization problems,” arXiv
Machine Learning, vol. 1, pp. 1–305, 2008. preprint arXiv:1606.01316, 2016.
[192] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: [219] P. Jain, P. Netrapalli, and S. Sanghavi, “Low-rank matrix completion
A review for statisticians,” Journal of the American Statistical using alternating minimization,” in ACM Annual Symposium on Theory
Association, vol. 112, pp. 859–877, 2017. of Computing, 2013, pp. 665–674.
[193] L. Bottou and Y. L. Cun, “Large scale online learning,” in Advances [220] M. Hardt, “Understanding alternating minimization for matrix
in Neural Information Processing Systems, 2004, pp. 217–224. completion,” in IEEE Annual Symposium on Foundations of Computer
[194] J. C. Spall, Introduction to Stochastic Search and Optimization: Science, 2014, pp. 651–660.
Estimation, Simulation, and Control. Wiley-Interscience, 2005. [221] M. Hardt and M. Wootters, “Fast matrix completion without the
[195] J. Hensman, N. Fusi, and N. Lawrence, “Gaussian processes for big condition number,” in Conference on Learning Theory, 2014, pp. 638–
data,” in Conference on Uncertainty in Artificial Intellegence, 2013, 678.
pp. 282–290. [222] S. Balakrishnan, M. J. Wainwright, and B. Yu, “Statistical guarantees
[196] J. Hensman, A. G. d. G. Matthews, and Z. Ghahramani, “Scalable for the em algorithm: From population to sample-based analysis,” The
variational gaussian process classification,” in International Conference Annals of Statistics, vol. 45, pp. 77–120, 2017.
on Artificial Intelligence and Statistics, 2015, pp. 351–360. [223] Z. Wang, Q. Gu, Y. Ning, and H. Liu, “High dimensional expectation-
[197] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “Hybrid maximization algorithm: Statistical optimization and asymptotic
monte carlo,” Physics Letters B, vol. 195, pp. 216–222, 1987. normality,” arXiv preprint arXiv:1412.8729, 2014.
[198] R. Neal, “MCMC using Hamiltonian dynamics,” Handbook of Markov [224] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P.
Chain Monte Carlo, vol. 2, pp. 113–162, 2011. Tang, “On large-batch training for deep learning: Generalization gap
[199] M. Girolami and B. Calderhead, “Riemann manifold langevin and and sharp minima,” arXiv preprint arXiv:1609.04836, 2016.
hamiltonian monte carlo methods,” Journal of the Royal Statistical [225] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
Society: Series B (Statistical Methodology), vol. 73, pp. 123–214, 2011. gated recurrent neural networks on sequence modeling,” arXiv preprint
[200] M. Betancourt, “The fundamental incompatibility of scalable Hamil- arXiv:1412.3555, 2014.
tonian monte carlo and naive data subsampling,” in International [226] J. Martens, Second-Order Optimization For Neural Networks. Uni-
Conference on Machine Learning, 2015, pp. 533–540. versity of Toronto (Canada), 2016.
[201] S. Ahn, A. Korattikara, and M. Welling, “Bayesian posterior sampling [227] N. N. Schraudolph and T. Graepel, “Conjugate directions for stochastic
via stochastic gradient fisher scoring,” in International Conference on gradient descent,” in International Conference on Artificial Neural
Machine Learning, 2012, pp. 1591–1598. Networks, 2002, pp. 1351–1356.
30

[228] A. Bordes, L. Bottou, and P. Gallinari, “SGD-QN: Careful quasi-


Newton stochastic gradient descent,” Journal of Machine Learning
Research, vol. 10, pp. 1737–1754, 2009.
[229] X. Jin, X. Zhang, K. Huang, and G. Geng, “Stochastic conjugate
gradient algorithm with variance reduction,” IEEE transactions on
Neural Networks and Learning Systems, pp. 1–10, 2018.

You might also like