Benson 1
Benson 1
https://doi.org/10.1007/s12532-024-00265-9
Received: 19 May 2023 / Accepted: 12 August 2024 / Published online: 16 September 2024
© The Author(s) 2024
Abstract
Conjugate gradient minimization methods (CGM) and their accelerated variants are
widely used. We focus on the use of cubic regularization to improve the CGM direction
independent of the step length computation. In this paper, we propose the Hybrid Cubic
Regularization of CGM, where regularized steps are used selectively. Using Shanno’s
reformulation of CGM as a memoryless BFGS method, we derive new formulas for the
regularized step direction. We show that the regularized step direction uses the same
order of computational burden per iteration as its non-regularized version. Moreover,
the Hybrid Cubic Regularization of CGM exhibits global convergence with fewer
assumptions. In numerical experiments, the new step directions are shown to require
fewer iteration counts, improve runtime, and reduce the need to reset the step direction.
Overall, the Hybrid Cubic Regularization of CGM exhibits the same memoryless and
matrix-free properties, while outperforming CGM as a memoryless BFGS method in
iterations and runtime.
B Cassidy K. Buhler
[email protected]
Hande Y. Benson
[email protected]
David F. Shanno
[email protected]
1 Department of Decision Sciences and MIS, Drexel University, Philadelphia, PA, USA
2 Rutgers University, RUTCOR (Emeritus), New Brunswick, NJ, USA
123
630 C. K. Buhler et al.
1 Introduction
where αk is the step length, xk is the step direction, and βk is a scalar defined as
∇ f (xk+1 )T yk
βk = . (4)
ykT xk
∇ f (xk+1 )T yk
βk = . (5)
∇ f (xk )T ∇ f (xk )
This form of βk gives the Polak-Ribiere formula [23]. If f is quadratic, then (5) further
reduces to
∇ f (xk+1 )T ∇ f (xk+1 )
βk = , (6)
∇ f (xk )T ∇ f (xk )
123
Regularized steps in nonlinear CG methods 631
In this paper, a restart that occurs after (7) will be called a Powell restart. We will show
in the numerical results section that nearly all of the problems in our test set require at
least one Powell restart and, on average, nearly half of all iterations are Powell restart
iterations. Therefore, in order to truly distinguish CGM from steepest descent and
improve the rate of convergence, we need a mechanism to improve the step directions.
A formal definition of “improvement” will be provided in the next section.
The second concern is about the step length calculation and impacts the amount
of effort required per iteration, and, thus, the runtime of the algorithm. An exact line
search seeks to find a step length α ∗ which solves
for a given point x and step direction x. For a general nonlinear function f ,
this minimization problem in one variable (α) is usually “solved” using an iterative
approach such as bisection or cubic interpolation [6], even though these approaches
cannot find the exact value of the minimizing α within a finite number of iterations.
For specific forms of f , such as a strictly convex quadratic function, a formula
can be used to directly calculate the minimizing α without the need for an iter-
ative approach. We should also note that at the solution of the exact line search,
(α) = ∇ f (x + αx)T x = 0.
By contrast, an inexact line search only seeks to approximately minimize (α),
requiring lower levels of accuracy when the iterates xk are away from a stationary point
of f . In order to maintain theoretical guarantees, most inexact line search techniques
rely on guaranteeing sufficient descent, that is, f (x)− f (x +αx) must be sufficiently
large according to some criterion. The Armijo criterion is given by
123
632 C. K. Buhler et al.
for constants 0 < 1 < 2 < 1. The two conditions together are called the Wolfe
conditions. The curvature condition can also be modified as
123
Regularized steps in nonlinear CG methods 633
In this section, we will present our proposed cubic regularization scheme and its
related step length rules. The integration of cubic regularization for CGM will be closer
to the approach taken in [1], wherein Benson and Shanno discuss the equivalence
between cubic regularization, Levenberg-Marquardt regularization (this equivalence
was originally pointed out by Griewank in [11]), trust-region radius control, and the
perturbation of the diagonal of the Hessian matrix for line-search approaches based on
Newton’s method. Therefore, in this paper, cubic regularization will generally arise
in form of a diagonal perturbation to the approximate Hessian matrix. (Our previous
paper on symmetric rank-1 methods with cubic regularization [2] took the approach
of modifying the secant equation, which we are not proposing here but will leave for
future work.) Since the formulation of CGM given by (2) and (4) was matrix-free,
we will use Shanno’s reformulation of CGM as a memoryless BFGS method [27].
We start with a brief review of the reformulation and then present the proposed new
approach.
−1
where Hk ≈ ∇ 2 f (xk ) , with H0 = I and Hk+1 obtained using the BFGS update
formula
123
634 C. K. Buhler et al.
Hk yk pkT + pk ykT Hk ykT Hk yk pk pkT
Hk+1 = Hk − + 1+ . (9)
pkT yk pkT yk pkT yk
Here, yk is as before and pk = αk xk . A memoryless BFGS method would mean that
the updates are not accumulated, that is, Hk is replaced by I in the update formula and
yk pkT + pk ykT ykT yk pk pkT
Hk+1 = I − + 1+ . (10)
pkT yk pkT yk pkT yk
To derive the equivalence, Shanno [27] notes that Perry [22] expressed (2)-(5) in
matrix form as
pk ykT pk pkT
xk+1 = − I − T + T ∇ f (xk+1 ). (11)
pk yk pk yk
Shanno [27] notes that the matrix in this formulation is not symmetric and adds a
further correction:
pk ykT yk pkT pk pkT
xk+1 = − I − T − T + T ∇ f (xk+1 ).
yk pk yk pk pk yk
Finally, to ensure that a secant condition is satisfied, the last term in the matrix is
re-scaled:
pk ykT yk pkT ykT yk pk pkT
xk+1 = − I − T − T + 1+ T ∇ f (xk+1 ). (12)
yk pk yk pk pk yk pkT yk
The matrix term is exactly the formula (10) for the memoryless BFGS update.
It is important to note here that, unlike in BFGS, we do not need to store a matrix or
a series of updates to calculate xk+1 using (12). Multiplying ∇ f (xk+1 ) through the
matrix in (12) simply requires dot-products, scalar-vector multiplications, and vector
addition and subtraction. As such, each update only requires the storage of 3 vectors
of length n and O(n) operations.
Furthermore, for an exact line search, (12) reduces to the Polak-Ribiere formula
(5), thereby ensuring that our proposed cubic regularization approach remains valid
in that case as well. Finally, one advantage of using (12) for CGM is that the criterion
pkT yk > 0 is always satisfied, which ensures that the sequence of step directions it
produces remain stable and is required for most proofs of global convergence, such as
the one proposed by [15].
In the first iteration, CGM can be initialized using the gradient. However, a two-step
process based on the self-scaling proposed in [21] has demonstrated better stability
and improved iteration counts [29]. We will use the same initialization scheme so that
123
Regularized steps in nonlinear CG methods 635
H0 = I
p0T y0 p0 y0T + y0 p0T y0T y0 p0 p0T p0 p0T
H1 = I− + + (13)
y0T y0 p0T y0 p0T y0 p0T y0 p0T y0
As discussed, CGM is restarted every n iterations (Beale restart) and when condition
(7) is satisfied (Powell restart). The inverse Hessian approximation at the most recent
restart iteration t is given by
p T yt pt ytT + yt ptT ytT yt pt ptT pt ptT
Ht = tT I− + + , (14)
yt yt ptT yt ptT yt ptT yt ptT yt
Note that Ht ∇ f (xk+1 ) and Ht yk can be obtained via dot products and scalar-vector
multiplications, so there is still no need to compute or store a matrix. With these
modifications, the CGM algorithm can be fully described as Algorithm 1.
It should be noted that the line search is the same one in [27], which is an inexact
line search that requires sufficient decrease of the objective function at each iteration.
The implementation uses cubic interpolation [6] to approximately solve (8).
123
636 C. K. Buhler et al.
1
f N (xk + x) := f (xk ) + ∇ f (xk )T x + x T Bk x, (16)
2
1 M
f M (xk + x) := f (xk ) + ∇ f (xk )T x + x T Bk x + x3 , (17)
2 6
where M is the approximation to the Lipschitz constant for ∇ 2 f (x). The cubic step
direction is found by solving the problem
In [4, 11, 20], it is shown that for sufficiently large M, xk + x will satisfy an
Armijo condition and that a line search is not needed when using cubic regularization
within a quasi-Newton method or Newton’s method. In this new framework, we need
to control M rather than α. In [4], the ARC method starts with a sufficiently large
value of M that is decreased through the iterations and approaches (or is set to) 0 in
a neighborhood of the solution. In [1] and [2], the authors proposed setting M = 0
for all iterations where the Hessian or its estimate are positive definite and picking a
value of M using iteration-specific data only as needed.
In order to motivate the selective use of cubic regularization, we observe that deter-
mining x using (17–18) is nontrivial as it involves the solution of an unconstrained
NLP. In [4], the authors show that it suffices to solve (18) only approximately in order
to achieve global convergence.
Moreover, the use of cubic regularization during iterations with negative curvature is
based on its equivalence to the Levenberg-Marquardt method. To see the equivalence,
let us examine the solution of (18). Note that the first-order necessary conditions for
the optimization problem are
M
∇ f (xk ) + Bk + sI s = 0. (19)
2
(Bk + λI)x = −∇ f (x k ),
123
Regularized steps in nonlinear CG methods 637
2λ
M= .
x
(Further details of the equivalence are provided in [1].) Thus, we simply re-interpret
the cubic regularization step as coming from a Levenberg-Marquardt regularization
of the CGM step with appropriately related values of λ and M. (A similar insight was
mentioned in [32] but not explicitly used.) If this direction is not accepted, then we
increase λ.
We posit here that the theoretical difficulties encountered by CGM can be similarly
addressed via the cubic regularization approach, without reducing and potentially even
improving its overall solution time. In this section, we start by deriving the update
formula for an iteration during which cubic regularization is used. Then, we will show
the impact of using cubic regularization to reduce the need for restarts in the algorithm.
In Sect. 4, our numerical results show the effectiveness of this approach in practice.
As discussed in the previous section, the use of cubic regularization necessitates
that we add a term to the approximate Hessian, that is, compute the step direction
xk using Bk+1 + λI, instead of Bk+1 . While the regularization is applied to the
approximate Hessian itself, the CGM update formula (9) updates the inverse of the
approximate Hessian, that is Hk+1 = (Bk+1 )−1 . As such, in order to compute the
regularized step direction, we need to compute (Bk+1 + λI)−1 . In previous papers that
use regularization with BFGS or with Newton’s method, there is no direct formula
for computing this inverse. As such, the solution of multiple linear systems may be
required at each iteration until a suitable λ value is found, which means that despite
improved step directions that reduce the number of iterations, the effort and time per
iteration increase, potentially increasing overall solution time as well.
When applying the regularization to the CG update formula, however, we can derive
an explicit formula for (Bk+1 + λI)−1 . This is a significant advantage for CGM, in
that the use of cubic regularization does not incur additional computational burden at
each iteration.
We start by showing the formulae for Bt and Bk+1 .
ytT yt pt p T yt y T
Bt = Ĥt−1 = I− T t + T t (20)
ptT yt pt pt yt yt
−1 Bt pk pkT Bt yk ykT
Bk+1 = Ĥk+1 = Bt − + (21)
pkT Bt pk pkT yk
Next, we will apply the regularization to Bk+1 and take its inverse. To do so, we
will first need to compute the inverse of Bt + λI (henceforth referred to as Ht (λ)):
123
638 C. K. Buhler et al.
where
ytT yt ytT yt
a= , b = 2 + λ, c = ytT yt + λ ptT yt .
ptT pt ptT yt
In order to see that the formula (23) consists of a sum of rank-1 updates to Ht (λ), we
introduce the following intermediate calculation:
123
Regularized steps in nonlinear CG methods 639
p̃kT yk
Hk+1 (λ) = Ht (λ) − p̃k ykT Ht (λ) + Ht (λ)yk p̃kT
d
p T yk + ykT Ht (λ)yk
+ k p̃k p̃kT
d
p T Bt pk − pkT Bt p̃k
− k Ht (λ)yk ykT Ht (λ), (24)
d
and d as
Now that we know how to calculate a step direction with cubic regularization, we need
to answer two questions:
1. When do we apply cubic regularization?
2. When applying cubic regularization, how do we choose a value for λ?
The answer to the first question determines how we answer the second one. As shown
above, the computation of the regularized step direction does not require significantly
more effort than the non-regularized one, so it remains to be seen whether being
selective with when to apply the regularization is as important for CGM as it was for
Newton’s method and quasi-Newton methods.
To start with, we decided to try finding a pair (λ, α) which minimizes f (xk +αx),
where x was calculated from a regularized step in every iteration. Our preliminary
numerical studies showed that this approach reduced the number of iterations for many
of the problems, but it significantly worsened the computational effort per iteration by
requiring an iterative approach that simultaneously optimized over two variables.
Instead, using [1] as a guide, we pursued the following approach: selectively use
cubic regularization whenever the non-regularized step direction failed to satisfy the
Powell criterion, was not a descent direction, or resulted in a line search failure. The
assumption here is that cubic regularization “improves” the step direction in some
sense, so it should be deployed when the step direction needs such improvement. In
our numerical studies, there were few to no instances of failure to obtain a descent
direction or line search failure, so we could not reliably assess the impact of cubic
regularization. However, the prevalence of Powell restarts, as will be noted in Sect. 4,
provided a good opportunity to test potential improvements.
When the non-regularized step leads to a Powell restart, we will set λ > 0 and try
a regularized step, with increasing values of λ until it no longer results in a Powell
restart. The initial value of λ for a Powell restart is computed as
123
640 C. K. Buhler et al.
|∇ f (xk+1 )T ∇ f (xk )|
λ=5 . (25)
∇ f (xk+1 )2
and doubled as needed. For each value of λ, we will choose a corresponding optimal
α. We have also added a safeguard to bound the number of λ updates by a constant
U , which helps the numerical stability and convergence results of the algorithm. If
the number of updates reaches U , we perform a restart. However, in our numerical
testing, we set U = 5 and this bound was never invoked.
With all the details complete, we now describe the approach, called Hybrid Cubic
Regularization of CGM, as Algorithm 2.
3 Theoretical results
As stated at the beginning of Sect. 2, we had four dimensions to our goal of improving
step quality within CGM. Two of those goals were theoretical in nature:
– exhibit global convergence with fewer assumptions than its non-regularized ver-
sion.
– require the same order of computational burden per iteration as its non-regularized
version
In this section, we will demonstrate that we have achieved both of these goals. Since
we have mentioned the second goal already, we will start with formalizing it first.
123
Regularized steps in nonlinear CG methods 641
We start by showing that the additional work per iteration performed by Algorithm 2
does not grow with the size of the problem.
Theorem 1 Computational effort per iteration for Algorithm 2 is of the same order as
the computational effort per iteration for Algorithm 1.
Proof The Beale restart and the non-regularized steps in Algorithm 2 match those of
Algorithm 1. Therefore, we only need to analyze the steps with cubic regularization.
We know that we will need to try at most U different λ values to exit the while-loop
in the cubic regularization step with a descent direction that satisfies (7) or with a
restart. We can also see that all of the components required to compute x with cubic
regularization using (24), that is, Ht (λ), p̃k , Bt , and d, can all be obtained using vector-
vector and scalar-vector operations of the form that does not necessitate the calculation
or storage of a matrix to obtain −Hk+1 ∇ f (xk ). The vectors used in these calculations
are the same ones as before (yt , pt , yk , and pk ), which means that the computational
burden per iteration of the new approach remains the same as in Algorithm 1.
We now show that Algorithm 2 is globally convergent. Our results utilize the con-
vergence results for Algorithm 1 as provided in [26], the main theorem of which we
have included here as Appendix A for completeness. The assumptions to establish
convergence for Algorithm 1 are stated in [26] as follows:
– The eigenvalues of the Hessian of f remain uniformly bounded above.
– f (x) is bounded below.
– The line search can find a descent direction providing sufficient descent at each
step.
The first assumption is used to show that the condition number of the Hessian remains
uniformly bounded, which leads directly to the proof of the theorem given here in
Appendix A. The last assumption is to ensure that the algorithm does not cycle. Addi-
tionally, there are mild assumptions (on the line search and the initialization/restart
scaling) stated throughout [26] to ensure that the Hessian estimate remains positive
definite, so we will consider this an assumption as well, even though it is not explicitly
stated in the proof included in Appendix A here.
If the Powell restart/cubic regularization step is not invoked, Algorithm 2 is equiv-
alent to Algorithm 1. Therefore, it suffices to analyze the iterations with cubic
regularization.
The key feature of the proof of the convergence result in [26] is that the condition
number of Hk remains bounded above. Since the only change to the algorithm is to
replace Hk with Hk (λ) = (Bk + λI)−1 in the cubic regularization step, we only need
to examine the condition number of Hk (λ) in order to use the rest of the proofs given
in [26].
123
642 C. K. Buhler et al.
Lemma 1 Let W be a symmetric, positive definite matrix, and let us denote its
Euclidean norm condition number by κ(W). Then, for λ > 0,
Proof Let χmax and χmin be the maximum and minimum eigenvalues of W, respec-
tively. By the assumptions made in [26], we know that χmin > 0. Then,
(1/χmin ) + λ
κ (W−1 + λI)−1 = κ W−1 + λI =
(1/χmax ) + λ
χmax + λχmin χmax
=
χmin + λχmin χmax
χmax
≤
χmin
= κ(W)
With Lemma 1 and the comment above, we can invoke Theorem 2 from Appendix
A to conclude that Algorithm 2 exhibits global convergence.
While Algorithm 2 only invokes cubic regularization for a Powell restart, it can be
modified to include more cases, such as when the search direction fails to be a descent
direction, as another reason to invoke it. Doing so would mean that the assumption on
the line search and the assumption of the Hessian estimate remaining positive definite
are no longer necessary. Moreover, by Lemma 1, we can potentially relax the upper
bound on the condition number of the Hessian. We will investigate this in future work.
As such, using cubic regularization allows us to relax one or two of the assumptions
in the convergence proof of Algorithm 1.
4 Numerical results
We start this section by describing our testing environment and then we will introduce
our numerical results on general unconstrained NLPs from the CUTEst test set [10].
123
Regularized steps in nonlinear CG methods 643
4.2 Parameters
In our numerical testing, we set threshold criteria = 1 × 10−6 for each algorithm
and the bound U = 5 for Algorithm 2.
The test problems were compiled from the CUTEst test set [10] as implemented in
ampl [31]. We chose 230 unconstrained problems, which included all of the uncon-
strained problems available from [31] except for those where the objective function
could not be evaluated at the provided or default initial solution, where the initial
solution was a stationary point (0 iterations), or where the objective function was
unbounded below. There are 38 QPs and 192 NLPs in the set. We will focus on the
two groups of problems separately in the analysis below.
Of the 192 NLPs in the test set, 187 of them require at least one Powell restart. That is
97.4% - a significant portion. In fact, of the 5 problems that do not use Powell restarts,
4 conclude within the first two iterations without ever reaching a check for the Powell
restart and 1 has an objective function that is the weighted sum of a quadratic term
and a nonlinear term, where the weight and function values of the nonlinear term are
negligible with respect to the quadratic term (and as such, it is effectively a QP). It
is, therefore, safe to conclude that every NLP of interest requires at least one Powell
restart and focus on these 187 problems in our ongoing analysis.
Of the 187 problems, our main code fails on 30 problems (8 exit due to line search
failures, 3 exit due to function evaluation errors, and 19 reach the iteration limit of
10,000). The remaining 157 problems are reported as solved and show at least one
Powell restart. For each of these problems, we calculated the percentage of Powell
restart iterations as
The average value of φ is 45.5%, and its median is 45.7%. Note that we do not test
for a Powell restart during an iteration that is already slated for a Beale restart, so the
percentage of iterations that satisfy the Powell restart criterion (7) is actually slightly
higher at around 55%.
Given the prevalence of Powell restarts, it may be natural to ask how the algorithm
performs without them. When Powell restarts are disabled, the code still converges on
the same number of problems (27 of the 30 failures remain the same, 3 get resolved,
and there are 3 new failures). We get the same iteration count on 13 problems, the
code performs better with Powell restarts on 101 problems (80% fewer iterations on
average), and the code performs better without Powell restarts on 41 problems (30%
123
644 C. K. Buhler et al.
We will now use an example to illustrate the impact of cubic regularization on the
satisfaction of the Powell restart criterion (7). More specifically, we will examine how
increasing values of λ impact the value of the fraction
We start our numerical comparison with general problems of the form (1) from the
CUTEst test set. Detailed results on these problems are provided in Tables 2 in the
Appendix. The detailed results include the number of iterations, runtime in CPU sec-
onds, and the objective function value at the reported solution for each algorithm. The
first, labeled “With Powell Restarts,” is the algorithm implemented in Conmin-CG
123
Regularized steps in nonlinear CG methods 645
Fig. 1 λ versus the log of the Powell fraction in Iteration 3 of solving the problem s206. The first two
iterations of the problem were solved without cubic regularization. This graph corresponds with Table 1,
with more values of λ between 0 and 600 to obtain a smooth graph
and previously presented as Algorithm 1. The second, labeled “No Powell Restarts,”
is a modified version of Algorithm 1 with the check for Powell restarts removed, so
that only Beale restarts are performed. This algorithm is included in the tables to sup-
port the results provided in Section 4.3. The last algorithm, labeled “Hybrid Cubic,”
implements Algorithm 2, which uses a hybrid approach where cubic regularization is
only invoked when the Powell restart criterion (7) is satisfied.
The results in Tables 2 and Figs. 2 and 3 show that hybrid cubic regularization
improves the number of iterations and the runtime on the CUTEst test set.
– For overall success, we have that “With Powell Restarts” and “Hybrid Cubic” each
solve 190 of the 230 problems, a rate of 82.6%. This includes 180 jointly solved
problems, 10 problems solved by “With Powell Restarts” only, and 10 problems
solved by “Hybrid Cubic” only.
– For iterations, we see in Fig. 2 that “Hybrid Cubic” outperforms “With Powell
Restarts” on the jointly solved problems. The graph on the left shows a scatterplot
where each point represents one problem, with the coordinates equaling iteration
counts by the two codes. (The line y = x is included to help our assessment.) In this
graph, there are 81 yellow squares representing instances where “Hybrid Cubic”
had fewer iterations, 59 purple dots where “With Powell Restarts” had fewer
iterations, and 40 blue triangles where they had the same number of iterations.
That means on 121 of the 180 jointly solved problems (67.2%), “Hybrid Cubic”
exhibits the same or fewer number of iterations as “With Powell Restarts.” The
performance profile for iterations, as shown in Fig. 3, indicates that “Hybrid Cubic”
outperforms “With Powell Restarts” on all 230 problems of the test set.
– It is interesting to note that the improvement becomes slightly more pronounced
if we use a typical iteration limit of 1,000 (instead of 10,000). In that case, there
are 166 jointly solved problems. “Hybrid Cubic” performs fewer iterations on
76, “With Powell Restarts” performs fewer iterations on 50, and the two codes
123
646 C. K. Buhler et al.
Fig. 2 Pairwise comparisons of iterations and runtimes for Conmin-CG with Powell restarts and with
hybrid cubic regularization. The iterations comparison was conducted on 180 out of the 230 unconstrained
problems we solved from the CUTEst test set, and the runtimes comparison was conducted on 36 problems
on which both solvers exhibited runtimes of at least 0.1 CPU seconds. Yellow squares denote the problems
where the code with hybrid cubic regularization outperforms the code with Powell restarts, purple dots
represent the opposite relationship, and blue triangles represent a tie (or runtimes within 0.1 of each other).
The dotted black line is y = x
perform the same number of iterations on 40. That means on 116 of the 166 jointly
solved problems, or 69.9%, “Hybrid Cubic” exhibits the same or fewer number of
iterations as “With Powell Restarts.”
– There were 14 jointly solved problems where at least one solver performed over
1,000 iterations. On these instances, there is no clearly observed pattern, as such
a high iteration typically indicates that the problem is severely ill-conditioned or
that f (x) is highly nonconvex. One example is the problem watson, with n = 31,
and where
⎛ ⎛ ⎞2 ⎞2
29 n j−2 n j−2
⎜ i i ⎟
f (x) = ⎝ ( j − 1) xj − ⎝ x j ⎠ − 1⎠
29 29
i=1 j=2 j=1
On this problem, “With Powell Restarts” performs 2248 and “Hybrid Cubic” per-
forms 8919 iterations to reach a solution. Similar behavior is observed on s371,
which is identical to watson but with n = 9. However, on dixmaani-dixmaanl,
where m = 1, 000, n = 3, 000, and
123
Regularized steps in nonlinear CG methods 647
Fig. 3 Performance profiles of the iterations and runtime results from CUTEst test set. The iterations
comparison was conducted on all 230 unconstrained problems from the CUTEst test set, and the runtimes
comparison was conducted on 36 problems on which both solvers exhibited runtimes of at least 0.1 CPU
seconds
n 2 n−1 2m
i
f (x) = 1.0 + â xi2 + b̂xi2 (xi+1 + xi+1
2
)2 + ĉxi2 xi+m
4
n
i=1 i=1 i=1
m 2
i
+ d̂ xi xi+2m
n
i=1
123
648 C. K. Buhler et al.
for different values of â, b̂, ĉ, and d̂. On these problems, “With Powell Restarts”
performs 2,000–6,000 iterations, which is 1,000–3,000 more than “Hybrid Cubic.”
– For the runtimes comparisons, we have taken a subset of the jointly solved prob-
lems, namely those that were solved in 0.1 or more CPU seconds by both solvers.
(This is to ensure a fair comparison, as smaller runtimes can be easily influenced
by other processes running on the same machine and/or exhibit very small differ-
ences.) The resulting set consisted of 36 problems, of which “With Powell Restarts”
was faster on 13 and “Hybrid Cubic” was faster on 23. This means that “Hybrid
Cubic” resulted in faster runtimes on 63.4% of the problems with runtimes of at
least 0.1 CPU seconds. The runtime results shown in Figs. 2 and 3 support this
conclusion.
– For runtime comparison, it may be a good idea to consider small differences as a
tie. If runtimes within 0.1 CPU seconds of each other are considered a tie, then
“With Powell Restarts” was faster on 9, “Hybrid Cubic” was faster on 20, and we
considered 7 instances as a tie.
5 Conclusion
Our goal in this paper was to incorporate cubic regularization selectively into a CGM
framework so that the resulting approach would
– require fewer iterations in computational experiments than its non-regularized
version,
– exhibit global convergence with fewer assumptions than its non-regularized ver-
sion,
– require the same order of computational burden per iteration as its non-regularized
version, and
– demonstrate faster overall runtime in computational experiments than its non-
regularized version.
Our global convergence results in Sect. 3 and our numerical results in Sect. 4 showed
that we were able to attain this goal fully.
We have some additional tasks to explore in future work. In our current implemen-
tation, we find the optimal α. However, we wish to explore how these results may be
affected with a fixed step size. In addition, we hope to extend our work to incorpo-
rate subgradients so that we can solve nondifferentiable problems. Finally, we plan to
apply CGM towards solving machine learning problems. Our framework is related to
a common neural network solver, Scaled Conjugate Gradient [19]. Thus, our work in
machine learning will include implementing Hybrid Cubic Regularization of CGM as
a solver for neural networks.
123
Regularized steps in nonlinear CG methods 649
Appendix
We showed global convergence in Sect. 3.2 using Theorem 7 from [26]. This theorem
is given below for completeness.
Theorem 2 Let f (x) satisfy
where u is an arbitrary vector in Rn , G(x) = ∇ 2 f (x), 0 < m < ∞, and L > −∞.
Then, for Algorithm 1, if αk satisfies
123
Table 2 Numerical results on the unconstrained problems from the CUTEst test set [10]
650
123
aircrftb 5 50 <0.1 3.1E-16 53 <0.1 3.8E-15 53 <0.1 5.2E-13
allinitu 4 20 <0.1 5.7E+00 9 <0.1 5.7E+00 7 <0.1 5.7E+00
arglina* 100 1 <0.1 1.0E+02 1 <0.1 1.0E+02 1 <0.1 1.0E+02
arglinb* 10 1 <0.1 4.6E+00 1 <0.1 4.6E+00 1 <0.1 4.6E+00
arglinc* 8 1 <0.1 6.1E+00 1 <0.1 6.1E+00 1 <0.1 6.1E+00
arwhead 5000 8 2.6E-01 −9.6E-10 9 <0.1 −1.4E-09 4 1.2E-01 −2.7E-09
bard 3 17 <0.1 8.2E-03 16 <0.1 8.2E-03 15 <0.1 8.2E-03
bdexp 5000 6 2.5E-01 8.0E-04 3 <0.1 7.3E-121 3 <0.1 7.3E-121
bdqrtic 1000 (E) (E) 438 4.2E+00 4.0E+03
beale 2 11 <0.1 8.9E-18 11 <0.1 5.5E-19 10 <0.1 6.2E-21
biggs3 3 14 <0.1 1.7E-10 16 <0.1 4.5E-13 28 <0.1 1.1E-11
biggs5 5 84 <0.1 5.7E-03 90 <0.1 5.7E-03 169 <0.1 5.7E-03
biggs6 6 62 <0.1 3.4E-07 110 <0.1 7.2E-09 76 <0.1 3.7E-06
box2 2 5 <0.1 4.2E-15 7 <0.1 1.7E-14 5 <0.1 3.5E-14
box3 3 9 <0.1 4.4E-12 10 <0.1 2.4E-11 12 <0.1 2.9E-11
bratu1d 1001 (E) (E) (IL)
brkmcc 2 4 <0.1 1.7E-01 5 <0.1 1.7E-01 5 <0.1 1.7E-01
brownal 10 7 <0.1 4.6E-16 7 <0.1 1.8E-15 5 <0.1 4.0E-15
brownbs 2 8 <0.1 2.1E-14 11 <0.1 2.4E-22 6 <0.1 2.2E-13
brownden 4 21 <0.1 8.6E+04 37 <0.1 8.6E+04 14 <0.1 8.6E+04
C. K. Buhler et al.
Table 2 continued
broydn7d 1000 339 3.3E-01 4.0E+02 332 2.0E-01 4.0E+02 345 2.0E-01 4.0E+02
brybnd 5000 12 3.3E-01 1.5E-12 14 3.0E-01 4.1E-12 13 3.0E-01 4.1E-13
chainwoo 1000 167 <0.1 1.0E+00 380 1.7E-01 4.6E+00 217 <0.1 1.0E+00
chnrosnb 50 218 <0.1 1.0E-13 258 <0.1 3.1E-14 232 <0.1 1.1E-13
cliff 2 (E) (E) (IL)
clplatea 4970 877 5.9E+00 −1.3E-02 1306 8.4E+00 −1.3E-02 840 4.3E+00 −1.3E-02
clplateb 4970 538 6.2E+00 −7.0E+00 680 4.4E+00 −7.0E+00 900 4.8E+00 −7.0E+00
Regularized steps in nonlinear CG methods
123
651
Table 2 continued
652
123
dixmaanb 3000 7 1.1E-01 1.0E+00 11 <0.1 1.0E+00 4 <0.1 1.0E+00
dixmaanc 3000 8 1.0E-01 1.0E+00 13 <0.1 1.0E+00 5 <0.1 1.0E+00
dixmaand 3000 10 1.7E-01 1.0E+00 16 <0.1 1.0E+00 5 <0.1 1.0E+00
dixmaane 3000 255 3.2E-01 1.0E+00 305 4.7E-01 1.0E+00 256 3.0E-01 1.0E+00
dixmaanf 3000 195 6.1E-01 1.0E+00 248 8.4E-01 1.0E+00 268 7.0E-01 1.0E+00
dixmaang 3000 227 7.0E-01 1.0E+00 239 8.5E-01 1.0E+00 268 7.2E-01 1.0E+00
dixmaanh 3000 181 5.6E-01 1.0E+00 316 1.0E+00 1.0E+00 220 6.0E-01 1.0E+00
dixmaani 3000 6084 1.1E+01 1.0E+00 4793 7.2E+00 1.0E+00 4806 5.0E+00 1.0E+00
dixmaanj 3000 3816 1.2E+01 1.0E+00 1409 4.6E+00 1.0E+00 675 1.7E+00 1.0E+00
dixmaank 3000 3597 1.1E+01 1.0E+00 663 2.3E+00 1.0E+00 499 1.3E+00 1.0E+00
dixmaanl 3000 2050 6.3E+00 1.0E+00 709 2.3E+00 1.0E+00 380 1.0E+00 1.0E+00
dixon3dq* 10 10 <0.1 2.2E-28 10 <0.1 2.2E-28 10 <0.1 2.4E-28
dqdrtic* 5000 6 <0.1 2.3E-15 6 <0.1 2.3E-15 6 <0.1 2.3E-15
dqrtic 5000 13 5.7E-01 1.0E-01 164 4.2E-01 9.5E-02 7 2.5E-01 9.5E-03
edensch 2000 17 <0.1 1.2E+04 17 <0.1 1.2E+04 16 <0.1 1.2E+04
eg2 1000 2 <0.1 −1.0E+03 2 <0.1 −1.0E+03 2 <0.1 −1.0E+03
engval1 5000 14 2.8E-01 5.5E+03 21 2.2E-01 5.5E+03 8 1.3E-01 5.5E+03
engval2 3 28 <0.1 5.9E-13 36 <0.1 9.0E-17 29 <0.1 1.5E-13
errinros 50 259 <0.1 4.0E+01 432 <0.1 4.0E+01 394 <0.1 4.0E+01
expfit 2 11 <0.1 2.4E-01 12 <0.1 2.4E-01 5 <0.1 2.4E-01
C. K. Buhler et al.
Table 2 continued
fminsrf2 1024 223 1.9E-01 1.0E+00 384 3.2E-01 1.0E+00 1106 1.4E+00 1.0E+00
fminsurf 1024 202 1.6E-01 1.0E+00 339 3.5E-01 1.0E+00 420 5.8E-01 1.0E+00
freuroth 5000 22 7.3E-01 6.1E+05 28 3.1E-01 6.1E+05 54 7.9E+00 6.1E+05
genhumps 5 36 <0.1 7.0E-14 34 <0.1 5.5E-13 24 <0.1 3.5E-12
genrose 500 2668 6.8E-01 1.0E+00 3682 5.5E-01 1.0E+00 2572 9.5E-01 1.0E+00
growth 3 196 <0.1 1.0E+00 159 <0.1 1.0E+00 (IL)
growthls 3 169 <0.1 1.0E+00 177 <0.1 1.0E+00 (IL)
gulf 3 41 <0.1 4.7E-13 32 <0.1 4.7E-09 41 <0.1 4.9E-10
hairy 2 18 <0.1 2.0E+01 11 <0.1 2.0E+01 5 <0.1 2.0E+01
hatfldd 3 22 <0.1 2.5E-07 26 <0.1 2.5E-07 36 <0.1 2.6E-07
hatflde 3 35 <0.1 4.4E-07 28 <0.1 4.4E-07 44 <0.1 4.4E-07
heart6ls 6 (IL) (IL) (IL)
123
653
Table 2 continued
654
123
heart8ls 8 339 <0.1 1.5E-16 382 <0.1 1.5E-12 590 <0.1 5.2E-12
helix 3 21 <0.1 3.6E-17 25 <0.1 2.2E-17 18 <0.1 8.4E-23
hilberta* 10 8 <0.1 5.8E-10 8 <0.1 5.8E-10 8 <0.1 5.8E-10
hilbertb* 50 5 <0.1 2.1E-20 5 <0.1 2.1E-20 5 <0.1 2.1E-20
himmelbb 2 4 <0.1 1.7E-18 5 <0.1 1.7E-16 4 <0.1 5.8E-19
himmelbf 4 44 <0.1 3.2E+02 81 <0.1 3.2E+02 30 <0.1 3.2E+02
himmelbg 2 6 <0.1 1.6E-16 7 <0.1 9.4E-20 6 <0.1 1.3E-16
himmelbh 2 5 <0.1 −1.0E+00 5 <0.1 −1.0E+00 5 <0.1 −1.0E+00
humps 2 55 <0.1 2.1E-12 89 <0.1 4.6E-16 48 <0.1 2.2E-12
jensmp 2 15 <0.1 1.2E+02 16 <0.1 1.2E+02 6 <0.1 1.2E+02
kowosb 4 25 <0.1 3.1E-04 45 <0.1 3.1E-04 63 <0.1 3.1E-04
liarwhd 10000 14 1.8E+00 5.5E-15 18 3.3E-01 1.3E-17 12 7.1E-01 2.4E-19
loghairy 2 184 <0.1 1.8E-01 66 <0.1 1.8E-01 4 <0.1 6.2E+00
mancino 100 15 1.6E-01 3.6E-15 15 1.9E-01 5.5E-15 15 1.8E-01 3.9E-15
maratosb 2 (E) (E) 4 <0.1 −1.0E+00
C. K. Buhler et al.
Table 2 continued
methanb8 31 2101 <0.1 4.9E-05 1377 <0.1 5.2E-05 2602 1.5E-01 6.9E-05
methanl8 31 (IL) (IL) (IL)
mexhat 2 9 <0.1 −4.0E-02 6 <0.1 −4.0E-02 6 <0.1 −4.0E-02
meyer3 3 (E) (E) (IL)
minsurf 36 13 <0.1 1.0E+00 17 <0.1 1.0E+00 15 <0.1 1.0E+00
msqrtals 1024 3376 2.2E+01 2.7E-08 3658 2.9E+01 1.1E-08 3866 4.0E+01 4.7E-08
msqrtbls 1024 2376 1.6E+01 3.4E-09 2741 2.1E+01 1.7E-08 3630 3.8E+01 6.9E-09
Regularized steps in nonlinear CG methods
123
655
Table 2 continued
656
123
palmer1d* 7 (E) (E) 8357 5.0E-01 6.5E-01
palmer1e 8 (IL) (IL) 4 <0.1 0.0E+00
palmer2c* 8 (IL) (IL) (IL)
palmer2e 8 (IL) (IL) (IL)
palmer3c* 8 5844 1.0E-01 2.0E-02 (IL) (IL)
palmer3e 8 (IL) 9272 1.9E-01 5.1E-05 (IL)
palmer4c* 8 3319 <0.1 5.0E-02 4021 <0.1 5.1E-02 (IL)
palmer4e 8 4453 <0.1 1.5E-04 6286 1.3E-01 1.5E-04 (IL)
palmer5c* 6 6 <0.1 2.1E+00 6 <0.1 2.1E+00 6 <0.1 2.1E+00
palmer5d* 4 9 <0.1 8.7E+01 9 <0.1 8.7E+01 8 <0.1 8.7E+01
palmer6c* 8 2336 <0.1 2.1E-02 6634 1.0E-01 1.9E-02 (IL)
palmer7c* 8 8231 1.4E-01 6.2E-01 (IL) (IL)
palmer8c* 8 1863 <0.1 1.6E-01 4321 <0.1 1.7E-01 (IL)
penalty1 1000 25 <0.1 9.7E-03 49 <0.1 9.7E-03 25 <0.1 9.7E-03
penalty2 100 86 <0.1 9.7E+04 (E) 63 <0.1 9.7E+04
penalty3 100 (E) (E) (IL)
pfit1 3 40 <0.1 2.9E-04 38 <0.1 2.9E-04 26 <0.1 2.9E-04
C. K. Buhler et al.
Table 2 continued
123
657
Table 2 continued
658
123
s240* 3 2 <0.1 3.8E-15 2 <0.1 3.8E-15 2 <0.1 3.8E-15
s243 3 9 <0.1 8.0E-01 9 <0.1 8.0E-01 9 <0.1 8.0E-01
s245 3 11 <0.1 1.7E-15 11 <0.1 2.7E-17 30 <0.1 6.1E-14
s246 3 17 <0.1 1.8E-16 18 <0.1 5.5E-18 18 <0.1 4.1E-20
s256 4 78 <0.1 2.0E-10 67 <0.1 2.2E-11 70 <0.1 5.6E-10
s258 4 48 <0.1 6.2E-13 27 <0.1 1.8E-15 24 <0.1 3.1E-17
s260 4 48 <0.1 6.2E-13 27 <0.1 1.8E-15 24 <0.1 3.1E-17
s261 4 27 <0.1 1.2E-09 37 <0.1 6.4E-10 59 <0.1 1.1E-09
s266 5 12 <0.1 1.0E+00 13 <0.1 1.0E+00 10 <0.1 1.0E+00
s267 5 75 <0.1 2.6E-03 68 <0.1 7.7E-09 163 <0.1 5.5E-07
s271* 6 6 <0.1 0.0E+00 6 <0.1 0.0E+00 6 <0.1 0.0E+00
s272 6 34 <0.1 5.7E-03 61 <0.1 5.7E-03 69 <0.1 5.7E-03
s272a 6 67 <0.1 3.4E-02 68 <0.1 3.4E-02 (IL)
s273 6 11 <0.1 5.3E-18 14 <0.1 6.3E-15 6 <0.1 1.0E-14
s274* 2 2 <0.1 2.6E-24 2 <0.1 2.6E-24 2 <0.1 2.6E-24
s275* 4 3 <0.1 6.0E-12 3 <0.1 6.0E-12 3 <0.1 6.0E-12
s276* 6 3 <0.1 1.5E-12 3 <0.1 1.5E-12 3 <0.1 1.5E-12
s281a* 10 11 <0.1 2.0E-15 11 <0.1 2.0E-15 11 <0.1 1.3E-16
s282 10 212 <0.1 2.7E-15 220 <0.1 1.2E-16 292 <0.1 1.3E-15
s283 10 52 <0.1 1.5E-09 117 <0.1 7.3E-09 49 <0.1 2.4E-09
s286 20 24 <0.1 6.3E-16 27 <0.1 7.2E-14 22 <0.1 1.7E-17
s287 20 54 <0.1 2.4E-17 36 <0.1 9.3E-16 30 <0.1 9.4E-15
s288 20 70 <0.1 3.2E-10 80 <0.1 4.0E-10 58 <0.1 6.9E-10
C. K. Buhler et al.
Table 2 continued
s296 16 115 <0.1 5.7E-14 134 <0.1 7.6E-15 117 <0.1 1.4E-14
s297 30 203 <0.1 1.6E-13 280 <0.1 1.3E-14 190 <0.1 1.1E-14
s298 50 288 <0.1 1.2E-14 413 <0.1 7.6E-15 317 <0.1 1.7E-14
s299 100 590 <0.1 5.3E-14 809 <0.1 1.1E-14 618 <0.1 3.7E-14
s300* 20 20 <0.1 −2.0E+01 20 <0.1 −2.0E+01 20 <0.1 −2.0E+01
s301* 50 50 <0.1 −5.0E+01 50 <0.1 −5.0E+01 50 <0.1 −5.0E+01
s302* 100 100 <0.1 −1.0E+02 100 <0.1 −1.0E+02 100 <0.1 −1.0E+02
s303 20 15 <0.1 6.7E-30 18 <0.1 3.3E-19 12 <0.1 8.1E-16
s304 50 11 <0.1 7.0E-14 19 <0.1 6.6E-25 9 <0.1 3.3E-24
s305 100 13 <0.1 4.2E-23 30 <0.1 1.6E-15 14 <0.1 3.1E-28
s308 2 7 <0.1 7.7E-01 8 <0.1 7.7E-01 7 <0.1 7.7E-01
s309 2 6 <0.1 2.9E-01 7 <0.1 2.9E-01 7 <0.1 2.9E-01
s311 2 6 <0.1 1.7E-14 7 <0.1 2.1E-23 5 <0.1 1.1E-19
123
659
Table 2 continued
660
123
s312 2 33 <0.1 5.9E+00 26 <0.1 5.9E+00 17 <0.1 5.9E+00
s314 2 4 <0.1 1.7E-01 5 <0.1 1.7E-01 5 <0.1 1.7E-01
s333 3 (E) (E) 3 <0.1 0.0E+00
s334 3 17 <0.1 8.2E-03 16 <0.1 8.2E-03 15 <0.1 8.2E-03
s350 4 25 <0.1 3.1E-04 45 <0.1 3.1E-04 63 <0.1 3.1E-04
s351 4 65 <0.1 3.2E+02 46 <0.1 3.2E+02 54 <0.1 3.2E+02
s352* 4 4 <0.1 9.0E+02 4 <0.1 9.0E+02 4 <0.1 9.0E+02
s370 6 72 <0.1 2.3E-03 80 <0.1 2.3E-03 98 <0.1 2.3E-03
s371 9 771 <0.1 1.8E-06 1838 <0.1 4.0E-06 3361 1.3E-01 5.2E-06
s379 11 159 <0.1 4.0E-02 159 <0.1 4.0E-02 151 <0.1 4.0E-02
s386* 2 2 <0.1 4.7E-27 2 <0.1 4.7E-27 2 <0.1 4.7E-27
sbrybnd 5000 (IL) (IL) (IL)
schmvett 10000 (E) (E) (E)
scosine 10000 (IL) (IL) (IL)
scurly10 10000 (IL) (IL) (IL)
scurly20 10000 (IL) (IL) (IL)
scurly30 10000 (IL) (IL) (IL)
C. K. Buhler et al.
Table 2 continued
123
661
662 C. K. Buhler et al.
Acknowledgements Sadly, David F. Shanno passed away in July 2019. The research documented in this
paper was started by Benson and Shanno in 2015 and was presented at the SIAM Optimization Meeting
in 2017. Cassidy Buhler joined the research group after Shanno’s passing. This work represents the last
project that Benson and Shanno completed in their 20-years of collaboration, and the authors hope that it is
a tribute to his legacy and a solid foundation for the next generation of researchers. The authors would like
to thank Drs. Müge Çapan, Vasilis Gkatzelis, Chelsey Hill, and Matthew Schneider for their feedback on an
earlier version of the paper. We are especially grateful to Dr. Gkatzelis for conversations on the theoretical
results in the paper.
Author contributions Hande Benson and David Shanno contributed to the study conception, and all authors
contributed to the design. Material preparation, data collection and analysis were performed by Cassidy
Buhler and Hande Benson. The first draft of the manuscript was written by Hande Benson and David
Shanno, and all authors commented on previous versions of the manuscript. Cassidy Buhler and Hande
Benson read and approved the final manuscript. Approval of the final manuscript was also provided by a
family representative of David Shanno after his passing.
Funding The authors declare that no funds, grants, or other support were received during the preparation
of this manuscript.
Data Availability The models analyzed during the current study are available in AMPL from https://
vanderbei.princeton.edu/ampl/nlmodels/cute/index.html. These models have been converted to AMPL from
SIF and originated from the CUTEst repository, https://github.com/ralna/CUTEst.
Declarations
Conflict of interest The authors have no relevant financial or non-financial interests to disclose.
Code availability The software Conmin-CG [3] is open source and available for download,
https://doi.org/10.5281/zenodo.13315592.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence,
and indicate if changes were made. The images or other third party material in this article are included
in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If
material is not included in the article’s Creative Commons licence and your intended use is not permitted
by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References
1. Benson, H., Shanno, D.: Interior-point methods for nonconvex nonlinear programming: cubic regular-
ization. Comput. Optim. Appl. 58 (2014)
2. Benson, H.Y., Shanno, D.F.: Cubic regularization in symmetric rank-1 quasi-Newton methods. Math.
Program. Comput. 10(4), 457–486 (2018)
3. Buhler, C.K.: Conmin-CG (2024). https://doi.org/10.5281/zenodo.13315592
4. Cartis, C., Gould, N., Toint, P.: Adaptive cubic regularisation methods for unconstrained optimization.
Part I: motivation, convergence and numerical results. Math. Program. 127, 245–295 (2011). https://
doi.org/10.1007/s10107-009-0286-5
5. Crowder, H., Wolfe, P.: Linear convergence of the conjugate gradient method. IBM J. Res. Dev. 16(4),
431–433 (1972)
6. Davidon, W.C.: Variance algorithm for minimization. Comput. J. 10(4), 406–410 (1968)
7. Duchi, J.C., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic
optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
123
Regularized steps in nonlinear CG methods 663
8. Fletcher, R., Reeves, C.M.: Function minimization by conjugate gradients. Comput. J. 7(2), 149–154
(1964)
9. Fourer, R., Gay, D., Kernighan, B.: AMPL: A Modeling Language for Mathematical Programming.
Scientific Press (1993)
10. Gould, N.I., Orban, D., Toint, P.L.: CUTEst: a constrained and unconstrained testing environment with
safe threads for mathematical optimization. Comput. Optim. Appl. 60, 545–557 (2015)
11. Griewank, A.: The modification of Newton’s method for unconstrained optimization by bounding cubic
terms. Technical Report NA/12, University of Cambridge (1981)
12. Hestenes, M.R., Stiefel, E.: Methods of Conjugate Gradients for Solving Linear Systems, vol. 49. NBS
(1952)
13. Hinton, G., Srivastava, N., Swersky, K.: Neural networks for machine learning lecture 6a overview
of mini-batch gradient descent. (2012). Retrieved April 27, 2021, from https://www.cs.toronto.edu/
~hinton/coursera/lecture6/lec6.pdf
14. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the International
Conference on Learning Representations (ICLR) (2015)
15. Lenard, M.: Practical convergence conditions for unconstrained optimization. Math. Program. 4, 309–
323 (1973)
16. Levenberg, K.: A method for the solution of certain problems in least squares. Q. Appl. Math. 2,
164–168 (1944)
17. Marquardt, D.: An algorithm for least-squares estimation of nonlinear parameters. SIAM J. Appl.
Math. 11, 431–441 (1963)
18. MathWorks: Choose a multilayer neural network training function. MathWorks Docu-
mentation. https://www.mathworks.com/help/deeplearning/ug/choose-a-multilayer-neural-network-
training-function.html. Retrieved April 27, 2021
19. Møller, M.F.: A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw. 6(4),
525–533 (1993)
20. Nesterov, Y., Polyak, B.: Cubic regularization of Newton method and its global performance. Math.
Program. 108, 177–205 (2006). https://doi.org/10.1007/s10107-006-0706-8
21. Oren, S.S., Spedicato, E.: Optimal conditioning of self-scaling variable metric algorithms. Math. Pro-
gram. 10(1), 70–90 (1976)
22. Perry, A.: Technical note-A modified conjugate gradient algorithm. Oper. Res. 26(6), 1073–1078
(1978). https://doi.org/10.1287/opre.26.6.1073
23. Polak, E., Ribìere, G.: Note sur la convergence de méthodes de directions conjuguées. Revue française
d’informatique et de recherche opérationnelle. Série rouge 16, 35–43 (1969)
24. Powell, M.J.D.: Some convergence properties of the conjugate gradient method. Math. Program. 11(1),
42–49 (1976)
25. Powell, M.J.D.: Restart procedures for the conjugate gradient method. Math. Program. 12(1), 241–254
(1977)
26. Shanno, D.: On the convergence of a new conjugate gradient algorithm. SIAM J. Numer. Anal. 15(6),
1247–1257 (1978)
27. Shanno, D.F.: Conjugate gradient methods with inexact searches. Math. Oper. Res. 3(3), 244–256
(1978)
28. Shanno, D.F., Phua, K.H.: Algorithm 500: Minimization of unconstrained multivariate functions [e4].
ACM Trans. Math. Softw. (TOMS) 2(1), 87–94 (1976)
29. Shanno, D.F., Phua, K.H.: Matrix conditioning and nonlinear optimization. Math. Program. 14, 149–
160 (1978)
30. Sherman, J., Morrison, W.J.: Adjustment of an inverse matrix corresponding to changes in the elements
of a given column or a given row of the original matrix. In: Annals of Mathematical Statistics, vol.
20(4), pp. 621–621 (1949)
31. Vanderbei, R.J.: AMPL models (1997). https://vanderbei.princeton.edu/ampl/nlmodels/cute/index.
html
32. Weiser, M., Deuflhard, P., Erdmann, B.: Affine conjugate adaptive Newton methods for nonlinear
elastomechanics. Optim. Methods Softw. 22(3), 413–431 (2007)
33. Zeiler, M.D.: Adadelta: An adaptive learning rate method. In: Proceedings of the International Confer-
ence on Machine Learning (ICML), vol. 28, pp. 105–112 (2012). https://www.jmlr.org/proceedings/
papers/v28/zeiler13.pdf
123
664 C. K. Buhler et al.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123