0% found this document useful (0 votes)
5 views28 pages

Higher-Order Newton Methods With Polynomial Work Per Iteration

This paper introduces higher-order Newton methods that utilize derivatives of arbitrary order while ensuring polynomial computational cost per iteration. The proposed methods leverage semidefinite programming to minimize a sum of squares-convex approximation of the Taylor expansion, achieving local convergence of order d and reduced oracle complexity compared to classical Newton methods. Numerical examples demonstrate that increasing the order d can enlarge the basins of attraction around local minima, and a modified algorithm is presented for global convergence under specific conditions.

Uploaded by

Fonteles Rogerio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views28 pages

Higher-Order Newton Methods With Polynomial Work Per Iteration

This paper introduces higher-order Newton methods that utilize derivatives of arbitrary order while ensuring polynomial computational cost per iteration. The proposed methods leverage semidefinite programming to minimize a sum of squares-convex approximation of the Taylor expansion, achieving local convergence of order d and reduced oracle complexity compared to classical Newton methods. Numerical examples demonstrate that increasing the order d can enlarge the basins of attraction around local minima, and a modified algorithm is presented for global convergence under specific conditions.

Uploaded by

Fonteles Rogerio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Higher-Order Newton Methods

with Polynomial Work per Iteration


arXiv:2311.06374v2 [math.OC] 12 Jun 2024

Amir Ali Ahmadi∗, Abraar Chaudhry∗, Jeffrey Zhang†

Abstract
We present generalizations of Newton’s method that incorporate deriva-
tives of an arbitrary order d but maintain a polynomial dependence on
dimension in their cost per iteration. At each step, our dth -order method
uses semidefinite programming to construct and minimize a sum of squares-
convex approximation to the dth -order Taylor expansion of the function
we wish to minimize. We prove that our dth -order method has local con-
vergence of order d. This results in lower oracle complexity compared
to the classical Newton method. We show on numerical examples that
basins of attraction around local minima can get larger as d increases.
Under additional assumptions, we present a modified algorithm, again
with polynomial cost per iteration, which is globally convergent and has
local convergence of order d.
Keywords. Newton’s method, tensor methods, semidefinite programming,
sum of squares methods, convergence analysis.

1 Introduction
Newton’s method is perhaps one of the most well-known and prominent algo-
rithms in optimization. In its attempt to minimize a function f : Rn → R,
this algorithm replaces f with its second-order Taylor expansion at an iterate
xk ∈ Rn and defines the next iterate xk+1 to be a critical point of this quadratic
approximation. This critical point coincides with a minimizer of the quadratic
approximation in the case where the Hessian of f at xk is positive semidefinite.
The work required in each iteration of Newton’s method consists of solving a
system of linear equations which arises from setting the gradient of the quadratic
approximation to zero. This can be carried out in time that grows polynomially
with the dimension n. Perhaps the most well-known theorem about the perfor-
mance of Newton’s method is its local quadratic convergence. More precisely,
under the assumptions that the second derivative of f is locally Lipschitz around
a local minimizer x∗ , and that the Hessian at x∗ is positive definite, there exists
∗ Princeton University, Operations Research and Financial Engineering. AAA and AC were

partially supported by the MURI award of the AFOSR and the Sloan Fellowship.
† Yale University, Department of Biomedical Informatics and Data Science.

1
a full-dimensional basin around x∗ and a constant c, such that if x0 is in this
basin, one has
∥xk+1 − x∗ ∥ ≤ c∥xk − x∗ ∥2
for all k ≥ 0. We note however that Newton’s method is in general not globally
convergent. Lack of global convergence can occur even when in addition to the
previous assumptions, f is assumed to be strongly convex (see, e.g., Example 5.1
in Section 5).
As higher-order Taylor expansions provide closer local approximations to the
function f , it is natural to ask why Newton’s method limits the order of Taylor
approximation to 2. The main barrier to higher-order Newton methods is the
computational burden associated with minimizing polynomials of degree larger
than 2 which would arise from higher-order Taylor expansions. For instance, any
of the following tasks that one could consider for each iteration of a higher-order
Newton method are in general NP-hard:

(i) finding a global minimum of polynomials of degree even1 and at least 4


(see, e.g., [39]),

(ii) finding a local minimum of polynomials of degree at least 4 (see [8, Theo-
rem 2.1]),
(iii) finding a second-order point (i.e., a point where the gradient vanishes and
the Hessian is positive semidefinite) of polynomials of degree at least 4
(see [7, Theorem 2.2]),
(iv) finding a critical point (i.e., a point where the gradient vanishes) of poly-
nomials of degree at least 3 (see [7, Theorem 2.1]).

In addition to matters related to computation, there are geometric distinc-


tions between Newton’s method and higher-order analogues of it. For example,
even when the function f is strongly convex and the starting iterate is arbitrar-
ily close to its minimizer, Taylor expansions of even degree and larger than 2
may not be bounded below. One can see this by examining the strongly convex
univariate function f (x) = x2 − x4 + x6 and its 4th order Taylor expansion near
the origin.
Despite these barriers, the question of whether one can make higher-order
Newton methods tractable and in some way superior to Newton’s method has
been considered at least since the work of Chebyshev [20] (see Section 1.1 for
more recent literature). More specifically, the question that is of interest to us
is whether it is possible to design a higher-order Newton method (i.e., a method
which utilizes a Taylor expansion of degree d > 2 in each iteration) in such a
way that (i) the work per iteration grows polynomially with the dimension, and
(ii) the local order of convergence grows with d, hence requiring fewer function
evaluations as d increases. In this paper, we show that this is indeed possible
(Algorithm 1 and Theorem 4).
1 Note that odd-degree polynomials are unbounded below.

2
Our algorithm relies on sum of squares techniques in optimization [44], [30]
and semidefinite programming and does not require the function f to be convex.
For any fixed degree d, our approach is to approximate the d-th order Taylor
expansion of f with an “sos-convex” polynomial (see Section 2 for a definition).
Sos-convex polynomials form a subclass of convex polynomials whose convexity
has an explicit algebraic proof. One can then use a first-order sum of squares
relaxation to minimize this sos-convex polynomial. It turns out that both the
task of finding a suitable sos-convex polynomial and that of minimizing it can
be carried out by solving two semidefinite programs whose sizes are polynomial
in the dimension n (in fact of the same order as the number of terms in the
Taylor expansion). As is well known, semidefinite programs can be solved to
arbitrary accuracy in polynomial time; see [48] and references therein.
We work with sos-convex polynomials instead of general convex polynomials
since the latter set lacks a tractable description [5], and the former, as we show,
turns out to be sufficient for achieving an algorithm with superlinear local con-
vergence. Our sum of squares based algorithm works for higher-order Newton
methods of any order d and can be easily implemented using any sum of squares
parser (e.g., YALMIP [35] or SOSTOOLS [45]). This is in contrast to previous
work where implementable algorithms have been worked out only for d = 3 ; see
[40], [23, Sect. 1.5], [25, Sect. 5]. While we present our algorithms in the un-
constrained case, they can be readily implemented in the presence of sos-convex
constraints (such as linear constraints or convex quadratic constraints). We
note, however, that our interest in this paper is only on generalizing Newton’s
method in terms of its convergence order and polynomial work per iteration,
and not on the practical aspects of implementation. Designing more scalable
algorithms for semidefinite programs is an active area of research [36, 49]. In
addition, we believe that there are promising future research directions which
could make our algorithms more practical at larger scale (see Section 7).

1.1 Related Work


Over the years, there have been many adaptations of and extensions to Newton’s
method. A primary example is the pioneering work of Nesterov and Polyak [41],
where the idea of Newton’s method with cubic regularization was introduced.
We do not review the large literature that emerged from this work since the
order d of Taylor expansion in this line of work is still equal to 2, and hence
these methods are not considered “higher-order” (i.e., d > 2). However, the
framework that we propose, similar to most of the literature, follows the struc-
ture of [41] (and [33, 37]) in terms of minimizing, in each iteration, a Taylor
expansion of a certain order plus an appropriate regularization term. Recently,
there has been a body of work following this structure with Taylor expansions
of order higher than two [40, 9, 12, 28, 29, 25]. Unlike our paper, these works
are in the setting of convex optimization, do not study the complexity of mini-
mizing the regularized Taylor expansion in each iteration (except in the case of
d = 3 for a subset of these papers), and derive sublinear rates of global conver-
gence. There has also been work on lower bounds on the rates of convergence

3
for such methods [11, 1, 13, 40]. These lower bounds are nearly achieved by the
algorithms in the aforementioned papers. The recent textbook [17] provides an
accessible summary of this literature and its broader scope. See also [14, 16, 15]
and references therein.
In terms of work per iteration of higher-order Newton methods, Nesterov
presents a polynomial-time algorithm in [40] for minimizing a quartically-regularized
third-order Taylor expansion. This problem is revisited recently in [18], where an
algorithm for recovering an approximate second-order point for a possibly non-
convex quartically-regularized third-order Taylor expansion is presented. In [47],
a different third-order Newton method is presented which has polynomial work
per iteration. In each iteration, this algorithm moves to a local minimum of the
third-order Taylor expansion. It turns out that local minima of cubic polyno-
mials can be found by semidefinite programs of polynomial size [7]. To the best
of our knowledge, no efficient algorithm for higher-order Newton methods of de-
gree d > 3 has been presented. In fact, designing such an algorithm is referred
to as an open problem in [23, Sec. 1.5] and [25, Sec. 5]. Interestingly, Nesterov
asks in [40, Sec. 6] whether it is possible to tackle this problem using “some
tools from algebraic geometry and the related technique of sums of squares”.
This is precisely the approach that we take in this paper.
To our knowledge, the only works that establish superlinear rates of local
convergence for higher-order Newton methods are [47] and [24] (and the related
PhD thesis [23]), the latter of which came to our attention at the time of writing
this paper. In [47], the authors establish third-order local convergence rate
for an unregularized third-order Newton method applied to a strongly convex
function. In [24], the authors establish superlinear local convergence for higher-
order Newton methods applied to convex optimization problems with composite
objective. When the smooth part of the objective function is strongly convex,
the authors show local convergence of order d in function value and norm of
the subgradient for their proposed dth -order Newton method. An algorithm
carrying out the work per iteration of this method, however, is available only
in the case of d = 3 (and is the same as that in [40]). Moreover, similar to
much of the literature, the regularization term that is added to the Taylor
expansion in this method requires knowledge of the Lipschitz constant of the
dth derivative of f . Our proof technique for local superlinear convergence is
different than [24] both in the parts where the sum of squares programming
aspects come in and in the parts that they do not. Furthermore, our method
has polynomial work per iteration for any degree d. It also does not rely on
knowledge of any Lipschitz constants. Our regularization term is instead derived
from the optimal value of a semidefinite program which can be written down
from the coefficients of the Taylor expansion alone. This optimized approach can
potentially lead to smaller deviations from the Taylor expansion and therefore an
improved convergence factor. Finally, we note that in our work, assumptions on
convexity of f and knowledge of the Lipschitz constant of its dth derivative are
made only in Section 6, where global convergence is established. Our approach in
Section 6 is based on incorporating sum of squares methods into the framework
of Nesterov in [40], though in theory this can also be done with other globally

4
convergent higher-order Newton methods. In fact, at the time of revision, there
has already been interesting follow-up work to our paper which combines our
sum of squares framework with adaptive regularization techniques for tensor
methods and analyzes the complexity of the resulting algorithm for finding an
approximate stationary point of a nonconvex function [19].

1.2 Organization and Contributions


In Section 2, we review preliminaries on sos-convexity, sos-convex polynomial
optimization, and error rates of derivatives of Taylor expansions. In Section 3,
we present our main algorithm (Algorithm 1). In Section 4, we prove that
our algorithm is well-defined in the sense that the semidefinite programs it
executes are always feasible and that the next iterate is always uniquely defined
(Theorem 3). We then prove that our semidefinite programming-based dth -order
Newton scheme has local convergence of order d (Theorem 4). Compared to
the classical Newton method, this leads to fewer calls to the Taylor expansion
oracle (a common oracle in this literature; see e.g., [12], [29, Sect. 2.2], [1,
Sect. 1.1], [11, Sect. 2], [17, Chap. 1.2]) at the price of requiring higher-
order derivatives. The proof of Theorem 4 is more involved than the proof of
local quadratic convergence of Newton’s method. This is in part because the
expression for the next iterate of Newton’s method is explicit, whereas our next
iterate comes from the solution to two semidefinite programs. We also remark
that our proof framework is applicable to a broader class of higher-order Newton
methods that may not necessarily use sum of squares techniques.
In Section 5, we present three numerical examples. We give an explicit
expression and a geometric interpretation of our third-order Newton method in
dimension one. We compare the basins of attraction of local minima for our
higher-order methods to those of the classical Newton method. In Section 6,
we present a slightly modified higher-order Newton method which is globally
convergent under additional convexity and Lipschitzness assumptions similar
to those in [40]. This modified algorithm works in the case of d being an odd
integer and still has polynomial work per iteration and local convergence of
order d. Finally, in Section 7, we present a few directions for future research.

2 Preliminaries
2.1 SOS-Convex Polynomial Optimization
In each iteration of the higher-order Newton methods that we propose, two
semidefinite programs (SDPs) need to be solved. These SDPs arise from the
notion of sos-convexity, which is reviewed in this subsection.
Definition 1. A polynomial p : Rn 7→ R is said to be a sum Pr of squares (sos) if
there exist polynomials q1 , . . . , qr : Rn 7→ R such that p = i=1 qi2 .
As is well known, one can check if a polynomial is sos by solving an SDP.
The next theorem establishes this link. We denote that a symmetric matrix

5
A is positive semidefinite (i.e., has nonnegative eigenvalues) with the standard
notation A ⪰ 0.
Theorem 1 (see, e.g., [44]). For a variable x ∈ Rn and an even integer d, let
ϕ d (x) denote the vector of all monomials of degree at most d2 in x. A polynomial
2
p : Rn 7→ R of degree d is sos if and only if there exists a symmetric matrix Q
such that (i) p(x) = ϕ d (x)T Qϕ d (x) for all x ∈ Rn , and, (ii) Q ⪰ 0.
2 2

The first constraint above can be written as a finite number of linear equa-
tions by coefficient matching. Therefore, the two constraints together represent
the intersection of an affine subspace with the cone of positive semidefinite ma-
trices. Thus, as polynomials can be encoded as an ordered vector of coefficients,
the set of sos polynomials of a given degree has a description as the feasible
region of a semidefinite program. Furthermore, the size of this SDP grows
polynomially in n when d is fixed.
Throughout this paper, we denote the gradient vector (resp. Hessian matrix)
of a function g : Rn 7→ R with the standard notation ∇g (resp. ∇2 g).
Definition 2 (SOS-Convex). A polynomial p : Rn 7→ R is said to be sos-convex
if the polynomial q : Rn × Rn 7→ R defined as q(x, y) := y T ∇2 p(x)y is sos.
Note that any sos-convex polynomial is convex. The converse statement is
not true, except for certain dimensions and degrees (see [6]). By Theorem 1
above, the set of sos-convex polynomials of a given degree also form the feasible
region of a semidefinite program. Because the polynomial q(x, y) is quadratic in
y, one can reduce the size of the underlying SDP. More specifically, a polynomial
p : Rn 7→ R of degree d is sos-convex2 if and only if there exists a symmetric
matrix Q ⪰ 0 such that y T ∇2 p(x)y = (ϕ d −1 (x) ⊗ y)T Q(ϕ d −1 (x) ⊗ y). (Here, ⊗
2 2
denotes the Kronecker product.) We see that the size of the SDP that represents
sos-convex polynomials of degree d in n variables grows polynomially in n when
d is fixed.
We next explain why sos-convex polynomial optimization problems can be
solved with the first level of the so-called Lasserre hierarchy. A polynomial
optimization problem is a problem of the form
inf g0 (x)
x∈Rn
(1)
s.t. gj (x) ≤ 0 j = 1, . . . , m,
where gj (x) are real-valued polynomial functions of a variable x ∈ Rn . The
first-level Lasserre relaxation (see [30]) corresponding to problem (1) takes the
form
sup γ
γ∈R,λ∈Rm
m
X
s.t. g0 (x) − γ + λj gj (x) is sos (2)
j=1

λj ≥ 0 j = 1, . . . , m.
2 Note that an odd-degree polynomial can never be convex, except for the trivial case of

affine polynomials.

6
The reader can check that the optimal value of (2) is always a lower bound
on that of (1). The next theorem establishes that this lower-bound is tight when
the defining polynomials of (1) are sos-convex.
Theorem 2 (See Corollary 2.5 from [31], and Theorem 3.3 from [32]). Suppose
that the polynomials g0 , . . . , gm in (1) are sos-convex, the optimal value of (1) is
finite, and that the Slater condition holds3 . Then, the optimal values of (1) and
(2) are the same. Moreover, an optimal solution to (1) can be readily recovered
from a solution to the semidefinite program that is dual to (2).
This result is already proven by Lasserre in [32] using a lemma of Helton
and Nie from [26]. For completeness and for the benefit of the reader, we give
an alternative short proof of the first claim.
Proof. Recalling that an sos polynomial is nonnegative and that λj ≥ 0 for
j = 1, . . . , m, it is easy to see that the optimal value of (1) is larger than or
equal to the optimal value of (2). To show the opposite inequality, let γ ∗ be the
optimal value of (1). Then, the convex function x 7→ g0 (x) − γ ∗ is nonnegative
over the set {x | gj (x) ≤ 0, j = 1, . . . , m}. By the convex Farkas lemma (see,
e.g., [46, Theorem 2.1]), Pmthere exists a nonnegative vector λ∗ ∈ Rm such that
∗ ∗
p(x) := g0 (x) − γ + j=1 λj gj (x) ≥ 0 for all x ∈ Rn . Notice that p(x) is sos-
convex since it is a conic combination of sos-convex polynomials. Thus, by [6,
Theorem 3.1], the polynomial q(x, y) := p(y) − p(x) − ∇p(x)T (y − x) is sos. Let
x∗ be an optimal solution to (1) (such a vector must exist [10]). Observe that
the polynomial y 7→ q(x∗ , y) is also sos (since it is the restriction of q(x, y) to
x = x∗ ). By optimality of x∗ to (1), we have p(x∗ ) ≤ 0. Since p is nonnegative,
we have p(x∗ ) = 0 and ∇p(x∗ ) = 0. Thus, p(y) = q(x∗ , y), and hence p(y) must
be sos. Therefore, γ ∗ , λ∗ is feasible to (2), and hence the optimal value of (2) is
at least γ ∗ ; i.e., the optimal value of (1).

For a proof of the second claim and an explicit expression of the dual of (2),
see Theorem 3.3 from [32].

2.2 Error rates of Taylor remainders


In this subsection, we review certain error rates of multivariate Taylor expan-
sions that will be used in our arguments. We denote by ∇d f the dth order
symmetric tensor of order-d partial derivatives of the function f . We denote
the tensor product of a set of vectors x1 , . . . , xd ∈ Rn with x1 ⊠ x2 ⊠ . . . ⊠ xd .4
We use the notation x⊠d to denote the tensor product of a vector x ∈ Rn with
itself d times. With this notation, we can define the dth -order Taylor expansion
of a d-times differentiable function f at a point x̄ as
d
X 1 i
Tx̄,d (x) := f (x̄) + ⟨∇ f (x̄), (x − x̄)⊠i ⟩,
i=1
i!
3 That is, there exists some x̄ ∈ Rn such that gj (x̄) < 0 for all j = 1, . . . , m.
4 We use this slightly nonstandard notation to avoid confusion with the Kronecker product.

7
where ⟨·, ·⟩ denotes the standard tensor inner product. The remainder or error
term of the Taylor expansion is

Rx̄,d (x) := f (x) − Tx̄,d (x).

For a dth -order tensor D, let us define the following norm

∥D∥ := max ⟨D, x1 ⊠ x2 ⊠ . . . ⊠ xd ⟩,


∥x1 ∥,...,∥xd ∥≤1

where ||xi || denotes the Euclidean 2-norm of the vector xi ∈ Rn . Note that for
cases of d = 1 and d = 2, this expression reduces to the standard Euclidean
norm and the spectral norm, respectively.
We will need the following lemma in Section 4.
Lemma 1 (see, e.g., inequality (11) in [9]). Fix a vector x̄ ∈ Rn . Suppose ∇d f
has a Lipschitz constant L over a convex set C containing x̄, i.e.,

∥∇d f (x) − ∇d f (y)∥ ≤ L∥x − y∥

for all x, y ∈ C. Then, for any x ∈ C, we have


L
∥∇Rx̄,d (x)∥ ≤ ∥x − x̄∥d .
d!
and
L
∥∇2 Rx̄,d (x)∥ ≤ ∥x − x̄∥d−1 .
(d − 1)!

3 Algorithm Definition
For a given integer d ≥ 3, we consider the task of minimizing a function f which
is assumed to have derivatives up to order d, and a local minimum x∗ satisfying
∇2 f (x∗ ) ≻ 0. We also assume that the dth derivative of f is locally Lipschitz
around the point x∗ , i.e., there is a radius rL > 0, and a scalar L ≥ 0, such that
for points x, y in the set {z ∈ Rn | ∥z − x∗ ∥ ≤ rL }, we have

∥∇d f (x) − ∇d f (y)∥ ≤ L∥x − y∥.

Note that the latter assumption is always satisfied if the d + 1th derivative of f
exists and is continuous. Our goal is to minimize f by iteratively minimizing a
surrogate function of the type

Txk ,d (x) + t||x − xk ||d ,

where xk is our current iterate, Txk ,d is the Taylor expansion of f of order d at


xk , d′ is the smallest even integer greater than d (as we require the surrogate
to be a polynomial), and t is chosen according to the following sum of squares
program:

8
min t
t∈R

s.t. Txk ,d (x) + t||x − xk ||d sos-convex (3)
t ≥ 0.
In view of Theorem 1 and the remarks after Definition 2, this program can
be reformulated as an SDP of size polynomial in n. Letting t(xk ) denote the
optimal value of (3) for a given xk , we define our surrogate function to be

ψxk ,d (x) := Txk ,d (x) + t(xk )||x − xk ||d . (4)
In our algorithm, we choose xk+1 to be the minimizer of ψxk ,d (which exists and
is unique; see Theorem 3 below). By Theorem 2, since ψxk ,d is sos-convex, we
can find its minimizer via another SDP of size polynomial in n.
If xk is far from x∗ so that ∇2 f (xk ) is not positive definite, it may occur
that (3) is infeasible. If this occurs, we fix a positive scalar5 ε and instead solve
the SDP:
min t̄
t̄∈R
 
1 ′
s.t. Txk ,d (x) + ε − λmin ∇2 f (xk ) ||x − xk ||2 + t̄||x − xk ||d sos-convex
2
t ≥ 0.
(5)
Let t̄(xk ) denote the optimal value of (5) and define
 
1 ′
ψ̄xk ,d (x) := Txk ,d (x) + ε − λmin ∇f (xk ) ||x − xk ||2 + t̄(xk )||x − xk ||d . (6)
2
We then let xk+1 be the minimizer ψ̄xk ,d (which again exists and is unique; see
Theorem 3 below). As before, we can find a minimizer of ψ̄xk ,d by solving an
SDP of size polynomial in n; see Theorem 2.
Our overall algorithm is summarized below:
Algorithm 1: dth -order Newton method
Parameter: ε > 0
Input: x0 ∈ Rn
1 for k = 0, . . . do
2 if ∇2 f (xk ) ≻ 0 then
3 Solve (3) to find t(xk )
4 Let xk+1 be the minimizer of ψxk ,d (see (4))
5 else
6 Solve (5) to find t̄(xk )
7 Let xk+1 be the minimizer of ψ̄xk ,d (see (6))
8 end
9 end

5 Our analysis applies to any positive value of ε.

9
4 Algorithm Analysis and Convergence
In this section, we present our main technical results. Theorem 3 shows that
our algorithm is well-defined for all initial conditions. Theorem 4 gives our
convergence result. We remind the reader that the assumptions made on f are
described in the first paragraph of Section 3. In particular, the function f is
not required to be convex, and the dth derivatives of f are not required to be
globally Lipschitz.
Theorem 3. Algorithm 1 is well-defined in the sense that
(i) the problems (3) and (5) are always feasible when required at Lines 3 and
6, and
(ii) the functions ψxk ,d and ψ̄xk ,d (see (4) and (6)) always possess a unique
minimizer when required at Lines 4 and 7.
Theorem 4. There exist constants r, c > 0 such that if ||x0 − x∗ || ≤ r, then the
sequence {xk } generated by Algorithm 1 satisfies

||xk+1 − x∗ || ≤ c||xk − x∗ ||d

for all k.
The power d in this theorem is referred to as the order of convergence and
the constant c is referred to as the factor of convergence. We note that the
factor of convergence arising from our proof is explicit.
To prove Theorems 3 and 4, we first establish some technical lemmas. Lem-
mas 2 and 3 are used to prove the first claim of Theorem 3; Lemmas 4 and 5
are for the second claim; and Lemmas 3, 4, and 6 are employed in the proof of
Theorem 4.
In Lemma 2, we show that a particular polynomial is in the interior of the
cone of sos-convex polynomials. This is used in Lemma 3 to show that we can
always make our surrogate functions defined in (3) and (5) sos-convex.
Lemma 2. Let x := (x1 , . . . , xn ). The polynomial

p(x) = xT x + (xT x)d

is in the interior of the cone of sos-convex polynomials in n variables and of


degree at most 2d.
Proof. We first establish the following claim:
Claim 0. For all d ≥ 0, the polynomial

p̃d (x) = 1 + (d + 1)(xT x)d


can be written as ϕd (x)T Qϕd (x), where ϕd is the standard basis of monomials
of degree up to d with the monomials appearing in ascending order of degree,
and Q is a positive definite matrix.

10
To prove Claim 0, it suffices to show that for all d ≥ 0, there exists a
constant αd > 0 and a positive definite matrix Q̂d such that 1 + αd (xT x)d =
ϕd (x)T Q̂d ϕd (x). Indeed, if αd < d + 1, we can observe that
 
p̃d (x) = 1 + αd (xT x)d + ((d + 1) − αd ) (xT x)d = ϕd (x)T Q̂d + Q′ ϕd (x),

where Q′ can be taken to be positive semidefinite since ((d + 1) − αd )(xT x)d is


sos. If αd > d + 1, we can observe that
 
d+1 T d d+1 T d+1 ′
p̃d (x) = + (d + 1)(x x) + (1 − ) = ϕd (x) Q̂d + Q ϕd (x),
αd αd αd

where Q′ can be taken to be positive semidefinite since (1 − d+1αd ) is sos.


Let us now proceed by induction on d to prove the claim made in the previous
paragraph. The case of d = 0 is clear since we can take any α0 > 0 and the
associated matrix Q̂0 is simply a 1 × 1 matrix containing the scalar 1 + α0 .
Now suppose that the induction hypothesis holds for d = k. To construct αk+1
and Q̂k+1 , we will add matrices associated with the polynomials 1 + αk (xT x)k
and α(xT x)k+1 − αk (xT x)k , where α is an arbitrary scalar. From the induction
hypothesis, there exist a scalar αk > 0 and a matrix Q̂k ≻ 0 of size n+k
 n+k
k × k
that satisfy
 
T k T T Q̂k 0
1 + αk (x x) = ϕk (x) Q̂k ϕk (x) = ϕk+1 (x) ϕ (x).
0 0 k+1

Meanwhile, observe that we can write


 
T k+1 T k T 0 A
α(x x) − αk (x x) = ϕk+1 (x) ϕ (x)
AT αP k+1

for some matrices A and P ≻ 0, where the zero block is of size n+k
 n+k
k × k . In-
deed, we can take the matrix P to be diagonal with its diagonal entries equalling
the coefficients of (xT x)k+1 and move the coefficients of αk (xT x)k to the matrix
A. Adding the two identities, we observe that:
 
Q̂k A
1 + α(xT x)k+1 = ϕk+1 (x)T ϕ (x).
AT αP k+1

Since Q̂k and P are both positive definite matrices, by the Schur complement
condition, whenever αP − AT Q̂−1 k A ≻ 0, the matrix on the right-hand side of
the above expression will be positive definite. One can therefore choose αk+1
to be any large enough value of α that satisfies the previous condition and let
Q̂k A
Q̂k+1 := . We have thus proved Claim 0.
AT αk+1 P
By Claim 0 (with d replaced by d − 1), we can fix a positive definite matrix

11
Q such that 1 + d(xT x)d−1 = ϕ(x)Td−1 Qϕd−1 (x) for all x. One can check that
 
y T ∇2 p(x)y = y T 2I + 2d(xT x)d−1 I + 4d(d − 1)(xT x)d−2 xxT y

= 2(y T y)(1 + d(xT x)d−1 ) + 4d(d − 1)(xT x)d−2 (xT y)2


= 2(y T y)ϕ(x)Td−1 Qϕd−1 (x) + 4d(d − 1)(xT x)d−2 (xT y)2
= (ϕd−1 (x) ⊗ y)T (Q ⊗ 2I + Q′ )(ϕd−1 (x) ⊗ y),

where Q′ can be taken to be positive semidefinite since 4d(d−1)(xT x)d−2 (xT y)2
is sos. Since the matrix Q ⊗ 2I + Q′ is positive definite, it follows that p is in
the interior of the cone of sos-convex polynomials of degree at most 2d.
Lemma 3. Suppose f : Rn 7→ R has continuous derivatives up to order d over
a compact set B ⊆ Rn . If ∇2 f (x) ≻ 0 for all x ∈ B, then t(x) (i.e., the optimal
value of (3)) is uniformly bounded from above over B.
Proof. Let δ be a positive scalar such that λmin ∇2 f (x) ≥ δ for all x ∈ B. Let
x′ be any vector in B, and define
2
Fx′ (x) := Tx′ ,d (x′ + x).
δ
Since ∇2 f (x′ ) ⪰ δI, we have ∇2 Fx′ (0) ⪰ 2I. Let Qx′ (resp. Cx′ ) be the
sum of the quadratic and higher (resp. cubic and higher) terms of Fx′ . For
a polynomial p, define ||p||∞ as the infinity norm of the coefficients of p when
expressed in the standard monomial basis. By Lemma 2, we can fix a positive
scalar R such that for any polynomial p of degree at most d with ||p||∞ ≤ R,

we have that the polynomial ||x||2 + ||x||d + p is sos-convex. Fix a scalar M
such that ||Cx′ ||∞ < M for all x′ ∈ B. Define α := min{1, M R
}. We have
3
||x 7→ Cx′ (αx)||∞ ≤ α ||Cx′ ||∞ since all terms of Cx′ are of cubic or higher
order. Then we can write
1 ′ 1 1 ′

2
Qx′ (αx) + ∥x∥d = xT ∇2 Fx′ (0)x + 2 Cx′ (αx) + ∥x∥d
α 2 α
1 T 2 1 ′
= x (∇ Fx′ (0) − 2I)x + 2 Cx′ (αx) + (∥x∥2 + ∥x∥d ).
2 α
Since ∇2 Fx′ (0) ⪰ 2I, the first term is sos-convex. We can bound the second
term as follows: ∥x 7→ α12 Cx′ (αx)∥∞ ≤ α||Cx′ ||∞ ≤ αM ≤ R. Thus, the sum
of the second and the third term is sos-convex by the definition of R. It follows
that the polynomial
1 ′

2
Qx′ (αx) + ∥x∥d is sos-convex.
α
We can then conclude the sos-convexity of the polynomials

(a) Qx′ (αx) + α2 ∥x∥d ,

12

(b) Fx′ (αx) + α2 ∥x∥d ,
′ ′
(c) Fx′ (αx) + α2−d ∥αx∥d ,
′ ′
(d) Fx′ (x) + α2−d ∥x∥d ,
′ ′
(e) Tx′ ,d (x′ + x) + 2δ α2−d ∥x∥d , and
′ ′
(f) Tx′ ,d (x) + 2δ α2−d ∥x − x′ ∥d ,
respectively (a) by scaling, (b) by the observation that the affine terms do not
affect sos-convexity, (c) by rewriting, (d) by a linear change of coordinates, (e)
by another rescaling, and (f) by an affine change of coordinates. Thus, we have

t(x) ≤ 2δ α2−d for x ∈ B.

We next use a quadrature rule for integration to establish a technical lemma


that is needed for the remainder of this section. By a polynomial matrix, we
mean a matrix whose entries are polynomial functions.

Lemma 4. Let M : R → 7 Sn×n be univariate polynomial matrix whose entries


have degree at most d, where d is even. Suppose M (s) ⪰ 0 for all s ∈ [0, 1].
Then,
Z 1
1
M (s)ds ⪰ M (α)
0 2(d2 − 1)
for α ∈ {0, 1}.

Proof. Using a quadrature rule for integration proposed in [21] and analyzed
in [27], there exist a set of weights w0 , . . . , wd ≥ 0, with w0 = d21−1 , and a set
of points s0 , . . . , sd ∈ [−1, 1], with s0 = 1, such that for any polynomial p of
degree at most d we have
Z 1 d
X
p(s)ds = wi p(si ).
−1 i=0

Now we can write


Z 1 1  
1−s
Z
1
M (s)ds = M ds
0 2 −1 2
d  
1X 1 − si
= wi M
2 i=0 2
 
1 1 − s0 1
⪰ w0 M = M (0).
2 2 2(d2 − 1)

By replacing s with 1 − s, the claim with α = 1 follows.

13
The next lemma directly proves the second claim of Theorem 3 and is pos-
sibly of independent interest.
Lemma 5. If a convex polynomial p : Rn 7→ R satisfies ∇2 p(x0 ) ≻ 0 for any
point x0 ∈ Rn , then p has a unique minimizer.
Proof. Without loss of generality, assume x0 = 0. Let d be an even integer
greater than the degree of the Hessian of p. For any x ∈ Rn ,
Z 1 Z t 
T T 2
p(x) = p(0) + x ∇p(0) + x ∇ p(sx)dsdt x
0 0
Z 1 Z 1 
= p(0) + xT ∇p(0) + xT t ∇2 p(stx)dsdt x
0 0
Z 1   
1
≥ p(0) + xT ∇p(0) + xT ∇t2
p(0) dt x
0 2(d2 − 1)
 
1 1
= p(0) + xT ∇p(0) + xT ∇ 2
p(0) x,
4 d2 − 1
where the inequality follows from Lemma 4. Thus, p is lower bounded by a
coercive6 quadratic function, and hence p is coercive itself. A coercive function
that is convex (and hence continuous) has at least one minimizer.
Suppose for the sake of contradiction that p had two minimizers x̄, ȳ. Then,
by convexity, any point on the line segment connecting x̄ and ȳ would also be
a minimizer. Since p is a polynomial, it follows that p must be constant along
the line passing through x̄ and ȳ. This contradicts coercivity.
We remark that the statement of Lemma 5 does not hold for non-polynomial
convex functions (consider, e.g., the univariate function max{0, x2 − 1}).
The next lemma is used in the proof of Theorem 4.
Lemma 6. There exists a constant r > 0 such that if ||xk − x∗ || ≤ r, then
λmin ∇2 ψxk ,d (x∗ ) ≥ 21 λmin ∇2 f (x∗ ).
Proof. We show that we can take
(   1 )
(d − 1)!λmin ∇2 f (x∗ ) d−1
r = min rL ,
2L

where rL and L are as in the first paragraph of Section 3. By Lemma 1, For


every x satisfying ∥x − xk ∥ ≤ rL , we have
L
∥∇2 f (x) − ∇2 Txk ,d (x)∥ ≤ ∥x − xk ∥d−1 .
(d − 1)!
Thus, if ∥x∗ − xk ∥ ≤ r, we have
1
∥∇2 f (x∗ ) − ∇2 Txk ,d (x∗ )∥ ≤ λmin ∇2 f (x∗ ).
2
6 We recall that a function g : Rn 7→ R is coercive if g(x) → ∞ as ||x|| → ∞.

14
It follows that
1
λmin ∇2 Txk ,d (x∗ ) ≥ λmin ∇2 f (x∗ ).
2
Indeed, if there was a unit vector y such that if y T ∇2 Txk ,d (x∗ )y < 21 λmin ∇2 f (x∗ ),
the previous inequality would be violated.
Recall from (4) that ψxk ,d is obtained by adding to Txk ,d the convex function

t(xk )∥x − xk ∥d . Therefore, we have ∇2 ψxk ,d (x∗ ) ⪰ ∇2 Txk ,d (x∗ ), which gives
the claim.

We now have all the ingredients to prove Theorems 3 and 4.

Proof of Theorem 3.
(i) When ∇2 f (xk ) ≻ 0, the proof of Lemma 3 with B = {xk } demonstrates a
feasible solution to (3). This argument also extends to show feasibility of (5)
since the polynomial Txk ,d (x) + 12 ε − λmin ∇2 f (xk ) ||x − xk ||2 has a positive
definite Hessian at xk .
(ii) At Line 4 (resp. Line 7), ψxk ,d (resp. ψ̄xk ,d ) has a positive definite Hessian
at xk . Moreover, the polynomial ψxk ,d (resp. ψ̄xk ,d ) is sos-convex and therefore
convex. Thus, by Lemma 5, ψxk ,d (resp. ψ̄xk ,d ) has a unique minimizer.

Proof of Theorem 4. Since d > 1, it suffices to show that there exist constants
r′ , c′ > 0 such that if ||x0 − x∗ || ≤ r′ , then ||x1 − x∗ || ≤ c′ ||x0 − x∗ ||d .
By continuity of the map x 7→ λmin ∇2 f (x), there exists a scalar r1 > 0 such
that λmin ∇2 f (x) ≥ 12 λmin ∇2 f (x∗ ) > 0 for all x with ||x − x∗ || ≤ r1 .
Let r2 > 0 be the constant needed for the conclusion of Lemma 6 to hold.
Define
r′ := min{rL , r1 , r2 }
and Ω := {x ∈ Rn | ||x − x∗ || ≤ r′ }. Suppose x0 ∈ Ω. Note that in this
case, Algorithm 1 finds the next iterate x1 by minimizing the polynomial ψx0 ,d
defined in (4).
By the fundamental theorem of calculus, we have
Z 1 
∇ψx0 ,d (x∗ ) − ∇ψx0 ,d (x1 ) = ∇2 ψx0 ,d (x1 + s(x∗ − x1 ))ds (x∗ − x1 ).
0

Since x1 minimizes ψx0 ,d , we have ∇ψx0 ,d (x1 ) = 0, and thus


Z 1 

∇ψx0 ,d (x ) = ∇ ψx0 ,d (x1 + s(x − x1 ))ds (x∗ − x1 ).
2 ∗
0

We can bound the norm of this vector from below:


Z 1 
||∇ψx0 ,d (x∗ )|| ≥ λmin ∇2 ψx0 ,d (x1 + s(x∗ − x1 ))ds ||x∗ − x1 ||. (7)
0

15
Applying first Lemma 4 and then Lemma 6, we have
Z 1
λmin ∇2 ψx0 ,d (x∗ )

λmin ∇2 ψx0 ,d (x1 + s(x∗ − x1 ))ds ≥
0 2((d′ − 2)2 − 1)
λmin ∇2 f (x∗ )
≥ .
4((d′ − 2)2 − 1)

Substituting this into (7) and rearranging yields

4((d′ − 2)2 − 1)
∥x1 − x∗ ∥ ≤ ∥∇ψx0 ,d (x∗ )∥. (8)
λmin ∇2 f (x∗ )

Expanding ∇ψx0 ,d (x∗ ), we have


∥∇ψx0 ,d (x∗ )∥ = ∇Tx0 ,d (x∗ ) + ∇(t(x0 )||x − x0 ||d )
x∗

= ∇Tx0 ,d (x∗ ) + t(x0 )d′ ||x∗ − x0 ||d −2 (x∗ − x0 )

≤ ||∇Tx0 ,d (x∗ )|| + t(x0 )d′ ||x∗ − x0 ||d −1 .

Applying Lemma 1 and noting that ∇f (x∗ ) = 0, we have

L ∗ ′
||∇ψx0 ,d (x∗ )|| ≤ ||x − x0 ||d + t(x0 )d′ ||x∗ − x0 ||d −1 .
d!
Using Lemma 3 and the fact that ||x∗ − x0 || ≤ r′ , we get

L ∗
||∇ψx0 ,d (x∗ )|| ≤ ||x − x0 ||d + (sup t(x))d′ max{r′ , 1}||x∗ − x0 ||d
d! x∈Ω
 
L
= + (sup t(x))d′ max{r′ , 1} ||x∗ − x0 ||d .
d! x∈Ω

Substituting into (8), we have

4((d′ − 2)2 − 1) L
  
||x1 − x∗ || ≤ + (sup t(x))d′
max{r ′
, 1} ||x∗ − x0 ||d
λmin ∇2 f (x∗ ) d! x∈Ω

as desired.

5 Numerical Examples
We present three examples to compare the performance of our dth -order Newton
methods and the classical Newton method.

16
5.1 The Univariate Case
In the univariate case, the iterations of the classical Newton method read
f ′ (xk )
xk+1 = xk − .
f ′′ (xk )
In terms of finding a root of f ′ , this iteration can be interpreted as first com-
puting the first-order Taylor expansion of f ′ at xk , and then finding the root of
this affine function to define xk+1 .
We derive a similar explicit formula for our higher-order Newton method
in the case where n = 1, d = 3, and f ′′ is positive. Since convex univariate
polynomials are sos-convex, finding explicit solutions to the two SDPs involved
in each iteration of our algorithm reduces to arguments about roots of univariate
polynomials.
Proposition 1. In the univariate case, when f ′′ (xk ) > 0 and f ′′′ (xk ) ̸= 0, the
next iterate of the 3rd -order version of Algorithm 1 is given by7
v
′′ 2
u f (xk ) − 23 (ff ′′′(x(xkk)))
u ′
′′
f (xk ) 3
xk+1 = xk − 2 ′′′ − t
(f ′′′ (xk ))2
.
f (xk ) ′′ 12f (xk )

Proof. To simplify notation, we let T := Txk ,3 and ψ := ψxk ,3 . By translation,


we may assume xk = 0. Then T (x) = f (xk )+xf ′ (xk )+ 21 x2 f ′′ (xk )+ 16 x3 f ′′′ (xk ),
ψ(x) = T (x) + tx4 , where t is the smallest constant that makes ψ convex. We
have ψ ′′ (x) = f ′′ (xk ) + xf ′′′ (xk ) + 12tx2 . The discriminant of ψ ′′ is (f ′′′ (xk ))2 −
′′′ 2
48tf ′′ (xk ), which tells us that t = (f48f ′′(x(xk ))
k)
.
To find xk+1 , we look for the root of ψ ′ . One can write the expression for
ψ ′ in the following form:
3
(f ′′′ (xk ))2 f ′′ (xk ) 2 (f ′′ (xk ))2

′ ′
ψ (x) = x + 2 + f (x k ) − .
12f ′′ (xk ) f ′′′ (xk ) 3 f ′′′ (xk )
Observe that a univariate cubic polynomial of the form a(x−b)3 +c, with a ̸= 0,
3 c
p
has a unique root at x = b − a . Therefore, after a translation back by xk , we
have v
′′ 2
u f (xk ) − 23 (ff ′′′(x(xkk)))
u ′
′′
f (xk ) 3
xk+1 = xk − 2 ′′′ − t
(f ′′′ (xk ))2
.
f (xk ) ′′ 12f (xk )

As in the case of the classical Newton method, the expression in Proposition 1


can be interpreted geometrically in terms of finding a root of f ′ . This iteration
computes the second-order Taylor expansion of f ′ at xk , adds a sufficiently large
cubic term to enforce monotonicity, and then finds the root of this monotone
cubic function to define xk+1 .
7 Note that when f ′′ (x ) > 0 and f ′′′ (x ) = 0, the third-order Taylor series is convex and
k k
coincides with the second-order Taylor series. Therefore, the next iterates of the third-order
and the classical Newton method coincide.

17
Example 1 In this example, we apply our method to the univariate function
p
f (x) = x2 + 1 − 1. (9)

This is a strictly convex function with its unique minimizer at x∗ = 0. One can
check that the classical Newton method converges to this minimizer if and only
if |x0 | < 1. Using Proposition 1, we can calculate the exact basin of convergence
of our third-order Newton method to be (−β, β), where
v !

u q
u1 142 3
β=t 11 + p √ + 1691 + 9i 47 ∼ 3.407.
3 3
1691 + 9i 47

This is strictly larger than the basin of convergence of the classical method.
Figure 1 demonstrates the difference between one iteration of the classical
and our third-order Newton method starting at the point x0 = 1.5. We display
the quadratic and quartic polynomials Tx0 ,2 and ψx0 ,3 . The minimizers of these
polynomials are denoted by xNewton
1 and x3ON
1 , which are respectively the next
iterates of the classical and our third-order Newton method. Since the third-
order Taylor expansion of f provides a more accurate approximation, we see
that the next iterate of our method is closer to x∗ , while that of the classical
Newton method moves farther away from x∗ .

Figure 1: A comparison of one iteration of the classical Newton method and our
third-order Newton method applied to the function in (9) starting at x0 = 1.5.

For our dth -order Newton methods with d > 3, we calculate the radii of
convergence numerically. These radii increase with degree as the following table
demonstrates:

18
Degree d Radius of Convergence
2 (Classical Newton) 1
3 ∼3.4
4 ∼4.5
5 ∼5.9
We can visualize the speed of convergence of the fifth-order method, for
example, in Figure 2. In this figure, we plot the absolute value of |xk − x∗ |
starting at x0 = 5.9, which is close to the boundary of the basin. In just five
iterations, the method reaches a point with absolute value approximately 10−15 .

Figure 2: 5th -order Newton iterates applied to the function in (9).

Example 2 In this example, we compare our third-order method to the clas-


sical Newton method when applied to the function
1 2
f (x) = 2x arctan(x) − log(1 + x2 ) + x . (10)
10
This is a strongly convex function with its unique minimizer at x∗ = 0.
In Figure 3, N2 (resp. N3 ) is the map that takes a point to the corresponding
next iterate of the classical (resp. third-order) Newton method. In this example,
the third-order method satisfies |N3 (x)| < |x| for all nonzero x, implying global
convergence of the method. Meanwhile, the classical Newton method oscillates
between ±13.494 when x0 is outside of the range [−α, α], where α ∼ 1.712 is
point of intersection of the functions N2 and −x.
In Figure 4, we can see a comparison of the iterates of the third-order and
the classical Newton method starting from the initial condition x0 = 1.7. While
both methods converge to the minimizer, the third-order method converges
much faster.

19
(a) (b)

Figure 3: Comparison of the classical Newton map N2 and our third-order


Newton map N3 applied to the function in (10). Subfigure (a) implies that
the third-order method is globally convergent, while the classical method is
not. Subfigure (b) zooms in on the behavior of these maps near the origin
to show that the basin of attraction for the classical method is approximately
(−1.712, 1.712).

Figure 4: Iterates of our third-order and the classical Newton method applied
to the function in (10) starting from a point in the basin of attraction of both
methods.

20
5.2 A Multivariate Example
In our last example, we compare the classical and the third-order Newton meth-
ods applied to a standard test function in nonlinear optimization called the Beale
function:

f (x1 , x2 ) = (1.5 − x1 + x1 x2 )2 + (2.25 − x1 + x1 x22 )2 + (2.625 − x1 + x1 x32 )2 .

This nonconvex function has a single global minimum at x∗ = (3, 0.5)T and no
other local minima. In Figure 5, we explore the behavior of both methods with
initial conditions in the region {x ∈ R2 | ∥x∥∞ ≤ 4}. We initialize the classical
method and our third-order method at a fine grid of points in this box and
run both methods for 350 iterations. For our third-order method, we take the
parameter ε in Algorithm 1 to be equal to 0.01. In Figure 5, the color yellow
corresponds to initial points that converge to x∗ , and the color blue corresponds
to any other behavior including divergence or convergence to a point which is
not a local minimum. In this example, the two basins are incomparable, but
that of the third-order method is more contiguous and larger in volume.

(a) classical Newton (b) Third-order Newton

Figure 5: The basins of attraction for the classical and the third-order Newton
methods for the minimizer of the Beale function. The basin for the classical
method has fractal structure, demonstrating more sensitivity to initialization.

6 Global convergence
In this section, we present a slightly modified algorithm which has global conver-
gence under additional assumptions. There is a vast literature on modifications
to Newton’s method that lead to global convergence in special circumstances:
see, e.g., [41, 43, 38, 22]. In the setting of our work, it turns out that we can
use a result of Nesterov from [40] to show that a simple modification to our

21
algorithm that still has polynomial work per iteration is globally convergent
when the Taylor expansion is made to an odd order.8 This modified algorithm
(Algorithm 2 below) also inherits the local convergence order of Algorithm 1.
As in [40], suppose the dth derivative of the function f : Rn 7→ R that we
wish to minimize has a Lipschitz constant Ld , and that an upper bound M on
Ld is known. In this setting, consider the following algorithm:

Algorithm 2: dth -order globally convergent Newton method (d odd)


Input: x0 ∈ Rn
1 for k = 0, . . . do
2 Solve (3) to find t(xk )
3 Let xk+1 be the minimizer of
dM
Txk ,d (x) + max{ (d+1)! , t(xk )}∥x − xk ∥d+1
4 end

Using the same arguments as those in the proof of Theorem 3, one can see
that the next iterate xk+1 produced by this algorithm is well-defined whenever
∇2 f (xk ) ≻ 0. Also as before, problem (3) can be solved as a semidefinite
program of size polynomial in the dimension. This claim also holds for the
problem of finding the (unique) minimizer of the degree d + 1 polynomial
dM
Txk ,d + max{ , t(xk )}∥x − xk ∥d+1 .
(d + 1)!

This is because the polynomials ∥x − xk ∥d+1 and Txk ,d + t(xk )∥x − xk ∥d+1 are
sos-convex and a conic combination of two sos-convex polynomials is sos-convex,
making Theorem 2 applicable.
Theorem 5. Suppose f : Rn → R has bounded level sets, a positive definite
Hessian everywhere, and the Lipschitz constant of its dth derivative bounded
above by M .9 Then, the iterates of Algorithm 2 starting from any x0 ∈ Rn
converge to the (unique) minimizer of f . Furthermore, Algorithm 2 has local
convergence rate of order d.
Proof. Since the Hessian of f is positive definite everywhere, the function f is
strictly convex. This, along with boundedness of the level sets, implies that f
has a unique (global) minimizer which we call x∗ .
dM
Define ψxk ,d (x) := Txk ,d (x) + max{ (d+1)! , t(xk )}∥x − xk ∥d+1 . By Theorem 1
all x ∈ Rn , thuso the method is monotone;
from [40], we have ψxk ,d (x) ≥ f (x) for n
i.e., f (xk+1 ) ≤ f (xk ). Let Mk := max M, (d+1)!t(x
d
k)
and δk := f (xk )−f (x∗ ).
8 The reason we need the Taylor expansion order to be odd is that in the work of Nesterov,

the Taylor polynomial is regularized by a term of degree one larger. We need this new term
to be a polynomial function for sum of squares methods to be readily applicable.
9 The assumptions that we make here are the same as those in [40] except that our as-

sumption of positive definiteness of the Hessian is stronger than the assumption of positive
semidefiniteness of the Hessian made in [40].

22
Since the set {x ∈ Rn | f (x) ≤ f (x0 )} is compact and the method is monotone,
there exists a scalar D such that ∥xk − x∗ ∥ ≤ D for all k. By the arguments in
the proof of Theorem 2 from [40], we can conclude that
d+1
δk − δk+1 ≥ Ck δk d ,
  d1
d d!
where Ck := d+1 (dMk +Ld )D d+1 .
By Lemma 3, we know that

tmax := sup t(x)


∥x−x∗ ∥≤D

is finite. Letting Mmax := max{M, (d+1)!t max


}, we have Mk ≤ Mmax , and
  d1 d
d d!
therefore Ck ≥ d+1 (dMmax +Ld )D d+1
for all k. Continuing the argument
from the proof of Theorem 2 from [40], we can conclude that
d
(dMmax + Ld )Dd+1

d+1
f (xk ) − f (x∗ ) ≤ .
d! k

Thus, we have f (xk ) − f (x∗ ) → 0 and therefore xk → x∗ .


For the local superlinear convergence rate, it suffices to show that for xk
close enough to x∗ , we have

||xk+1 − x∗ || ≤ c′ ||xk − x∗ ||d

for some constant c′ . Let r1 and r2 be as in the proof of Theorem 4, r′ := min{r1 , r2 },


and Ω := {x ∈ Rn | ||x−x∗ || ≤ r′ }. By the arguments in the proof of Theorem 4,
for every xk ∈ Ω, we have
 
∗ Ld ∗ dM
||∇ψxk ,d (x )|| ≤ d
||x − xk || + max , t(xk ) (d + 1)||x∗ − xk ||d
d! (d + 1)!
 
Ld ∗ dM
≤ d
||x − xk || + max , sup t(x) (d + 1)||x∗ − xk ||d .
d! (d + 1)! x∈Ω

Substituting into (8) (with x0 replaced with xk ), we have

||xk+1 − x∗ || ≤ c′ ||x∗ − xk ||d ,

where
4((d − 1)2 − 1)
   
Ld dM
c′ := + max , sup t(x) (d + 1) .
λmin ∇2 f (x∗ ) d! (d + 1)! x∈Ω

We note that by Lemma 3, supx∈Ω t(x) is finite.

23
7 Future directions
Besides the question of extending the results of Section 6 to the case of d even,
there are a few other potential directions for future research that we wish to
highlight:
• Can we replace the SDPs used in Algorithm 1 with more scalable conic
programs such as linear programs (LPs) or second-order cone programs
(SOCPs)? There has been work (see, e.g., [4]) on replacing methods based
on sos programming with LP or SOCP-based approaches that rely on
more tractable subsets of sos polynomials, such as the so-called diagonally
dominant sum of squares (dsos) or scaled diagonally dominant sum of
squares (sdsos) polynomials. In our setting, we might wish to replace the
constraint in (3) (or (5)) that a polynomial is sos-convex with a constraint
that it is “dsos-convex” or “sdsos-convex” (see, e.g., [3]). The results in [3]
on the difference of dsos-convex decompositions of arbitrary polynomials
could be explored to potentially replace the first SDP in each iteration
of Algorithm 1 with an LP or SOCP. One would then need to establish
an appropriate dsos or sdsos version of Theorem 2 to replace our second
SDP with an LP or SOCP. It would be interesting to compare the factor
of convergence of such an algorithm to that of the SDP-based approach.
• Can we create a method that uses a sparse subset of higher-order deriva-
tives of the function f and that perhaps approximates the remaining
derivatives in order to speed up each iteration? Such a method would be
a higher-order analogue to the so-called “quasi-Newton” methods which
rely on approximations of the Hessian of f (see, e.g., [42, Chap. 6]).
An example of such a higher-order quasi-Newton method which results in
semidefinite programs of small size in each iteration has been proposed
in [2], but its convergence properties are currently unknown.
• Can we use our method or a modification thereof to solve systems of
nonlinear equations (in a way that is superior to simply minimizing the
sum of the squares of the equations)? The classical Newton method and
its variants can be used for this purpose (see, e.g., [42, Sect. 11.1]). What
are the right higher-order analogues of these approaches?
• Each iteration of the algorithms that we have presented in this paper
can be interpreted as running just one iteration of the so-called “convex-
concave procedure” (see, e.g., [34]) to a particular difference of convex
decomposition of the Taylor expansion of f . Are there benefits of work-
ing with alternative difference of convex decompositions (see, e.g., [3]) of
the Taylor expansion, or running more iterations of the convex-concave
procedure before the Taylor polynomial is updated?

24
Acknowledgements
We would like to thank Jean-Bernard Lasserre for insightful discussions around
the results in [32].

References
[1] N. Agarwal and E. Hazan. Lower bounds for higher-order convex optimiza-
tion. In Proceedings of the 31st Conference On Learning Theory, volume 75
of Proceedings of Machine Learning Research, pages 774–792, 2018.
[2] A. A. Ahmadi, C. Dibek, and G. Hall. Sums of separable and quadratic
polynomials. Mathematics of Operations Research, 48, 2022.

[3] A. A. Ahmadi and G. Hall. DC decomposition of nonconvex polynomials


with algebraic techniques. Mathematical Programming, 169(1):69–94, 2018.
[4] A. A. Ahmadi and A. Majumdar. DSOS and SDSOS optimization: More
tractable alternatives to sum of squares and semidefinite optimization.
SIAM Journal on Applied Algebra and Geometry, 3(2):193–230, 2019.

[5] A. A. Ahmadi, A. Olshevsky, P. A. Parrilo, and J. N. Tsitsiklis. NP-


hardness of deciding convexity of quartic polynomials and related problems.
Mathematical Programming, 137:453–476, 2013.
[6] A. A. Ahmadi and P. A. Parrilo. A complete characterization of the
gap between convexity and sos-convexity. SIAM Journal on Optimization,
23(2):811–833, 2013.
[7] A. A. Ahmadi and J. Zhang. Complexity aspects of local minima and
related notions. Advances in Mathematics, 397:108119, 2022.
[8] A. A. Ahmadi and J. Zhang. On the complexity of finding a local minimizer
of a quadratic function over a polytope. Mathematical Programming, 195(1-
2):783–792, 2022.
[9] M. Baes. Estimate sequence methods: extensions and approximations.
Institute for Operations Research, ETH, Zürich, Switzerland, 2(1), 2009.

[10] E. G. Belousov and D. Klatte. A Frank–Wolfe type theorem for con-


vex polynomial programs. Computational Optimization and Applications,
22(1):37–48, 2002.
[11] E. G. Birgin, J. L. Gardenghi, J. M. Martı́nez, S. A. Santos, and P. L.
Toint. Worst-case evaluation complexity for unconstrained nonlinear opti-
mization using high-order regularized models. Mathematical Programming,
163(1):359–368, 2017.

25
[12] S. Bubeck, Q. Jiang, Y. T. Lee, Y. Li, and A. Sidford. Near-optimal method
for highly smooth convex optimization. In Conference on Learning Theory,
pages 492–507. Proceedings of Machine Learning Research, 2019.
[13] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for find-
ing stationary points i. Mathematical Programming, 184(1):71–120, 2020.

[14] C. Cartis, N. I. Gould, and P. L. Toint. Universal regularization methods:


varying the power, the smoothness and the accuracy. SIAM Journal on
Optimization, 29(1):595–615, 2019.
[15] C. Cartis, N. I. Gould, and P. L. Toint. A concise second-order complexity
analysis for unconstrained optimization using high-order regularized mod-
els. Optimization Methods and Software, 35(2):243–256, 2020.
[16] C. Cartis, N. I. Gould, and P. L. Toint. Sharp worst-case evaluation com-
plexity bounds for arbitrary-order nonconvex optimization with inexpensive
constraints. SIAM Journal on Optimization, 30(1):513–541, 2020.

[17] C. Cartis, N. I. Gould, and P. L. Toint. Evaluation Complexity of Algo-


rithms for Nonconvex Optimization: Theory, Computation and Perspec-
tives. SIAM, 2022.
[18] C. Cartis and W. Zhu. Second-order methods for quartically-regularised
cubic polynomials, with applications to high-order tensor methods. arXiv
preprint arXiv:2308.15336, 2023.
[19] C. Cartis and W. Zhu. Global convergence of high-order regular-
ization methods with sums-of-squares Taylor models. arXiv preprint
arXiv:2404.03035, 2024.

[20] P. L. Chebyshev. Polnoe Sobranie Sochinenii. Izd. Akad. Nauk SSSR,


5:7–25, 1951.
[21] C. W. Clenshaw and A. R. Curtis. A method for numerical integration on
an automatic computer. Numerische Mathematik, 2(1):197–205, 1960.
[22] A. Conn, N. Gould, and P. Toint. Trust Region Methods. MPS-SIAM Series
on Optimization. Society for Industrial and Applied Mathematics, 2000.
[23] N. Doikov. New second-order and tensor methods in convex optimization.
PhD thesis, Université catholique de Louvain, 2021.
[24] N. Doikov and Y. Nesterov. Local convergence of tensor methods. Mathe-
matical Programming, 193(1):315–336, 2022.
[25] G. N. Grapiglia and Y. Nesterov. Tensor methods for finding approximate
stationary points of convex functions. Optimization Methods and Software,
37(2):605–638, 2022.

26
[26] J. W. Helton and J. Nie. Semidefinite representation of convex sets. Math-
ematical Programming, 122:21–64, 2010.
[27] J. P. Imhof. On the method for numerical integration of Clenshaw and
Curtis. Numerische Mathematik, 5(1):138–141, 1963.
[28] B. Jiang, T. Lin, and S. Zhang. A unified adaptive tensor approximation
scheme to accelerate composite convex optimization. SIAM Journal on
Optimization, 30(4):2897–2926, 2020.
[29] B. Jiang, H. Wang, and S. Zhang. An optimal high-order tensor method
for convex optimization. Mathematics of Operations Research, 46(4):1390–
1412, 2021.
[30] J.-B. Lasserre. Global optimization with polynomials and the problem of
moments. SIAM Journal on Optimization, 11:796–817, 2000.
[31] J.-B. Lasserre. Representation of nonnegative convex polynomials. Archiv
der Mathematik, 91(2):126–130, 2008.
[32] J.-B. Lasserre. Convexity in semialgebraic geometry and polynomial opti-
mization. SIAM Journal on Optimization, 19:1995–2014, 2009.
[33] K. Levenberg. Method for the solution of certain problems in least squares.
J Numer Anal, 16:588–A604, 1944.
[34] T. Lipp and S. Boyd. Variations and extension of the convex–concave
procedure. Optimization and Engineering, 17(2):263–287, 2016.
[35] J. Löfberg. YALMIP: A toolbox for modeling and optimization in MAT-
LAB. In IEEE International Conference on Robotics and Automation,
pages 284–289, 2004.
[36] A. Majumdar, G. Hall, and A. A. Ahmadi. Recent scalability improvements
for semidefinite programming with applications in machine learning, con-
trol, and robotics. Annual Review of Control, Robotics, and Autonomous
Systems, 3:331–360, 2020.
[37] D. W. Marquardt. An algorithm for least-squares estimation of nonlinear
parameters. Journal of the Society for Industrial and Applied Mathematics,
11(2):431–441, 1963.
[38] J. J. Moré. Recent Developments in Algorithms and Software for Trust
Region Methods, pages 258–287. Springer Berlin Heidelberg, 1983.
[39] K. G. Murty and S. N. Kabadi. Some NP-complete problems in quadratic
and nonlinear programming. Mathematical Programming, 39(2):117–129,
1987.
[40] Y. Nesterov. Implementable tensor methods in unconstrained convex opti-
mization. Mathematical Programming, 186(1):157–183, 2021.

27
[41] Y. Nesterov and B. T. Polyak. Cubic regularization of Newton method and
its global performance. Mathematical Programming, 108(1):177–205, 2006.
[42] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.
[43] J. M. Ortega and W. C. Rheinboldt. Iterative Solution of Nonlinear Equa-
tions in Several Variables. SIAM, 2000.
[44] P. A. Parrilo. Structured semidefinite programs and semialgebraic geometry
methods in robustness and optimization. PhD thesis, California Institute
of Technology, 2000.
[45] S. Prajna, A. Papachristodoulou, and P. A. Parrilo. Introducing SOS-
TOOLS: A general purpose sum of squares programming solver. In Pro-
ceedings of the 41st IEEE Conference on Decision and Control, volume 1,
pages 741–746, 2002.
[46] I. Pólik and T. Terlaky. A survey of the S-lemma. SIAM Review, 49(3):371–
418, 2007.

[47] O. Silina and J. Zhang. An unregularized third order Newton method.


arXiv preprint arXiv:2209.10051, 2022.
[48] L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM Review,
38(1):49–95, 1996.

[49] A. Yurtsever, J. A. Tropp, O. Fercoq, M. Udell, and V. Cevher. Scalable


semidefinite programming. SIAM Journal on Mathematics of Data Science,
3(1):171–200, 2021.

28

You might also like