Inexact Newton Method For Minimization of Convex P
Inexact Newton Method For Minimization of Convex P
1 Introduction
The present paper is devoted to theoretical and experimental study of novel tech-
niques for incorporation of preconditioned conjugate gradient linear solver into inex-
act Newton method. Earlier, similar method was successfully applied to optimizaton
problems arising in numerical grid generation [7, 8, 13], and here we will consider
its application to the numerical solution of piecewise-quadratic unconstrained opti-
mization problems [9, 15, 16, 17]. The latter include such problems as finding the
projection of a given point onto the set of nonnegative solutions of an underdeter-
mined system of linear equations [6] or finding a distance between two convex poly-
hedra [3] (and both are tightly related to the standard linear programming problem).
The paper is organized as follows. In Section 2, a typical problem of minimization
A.I. Golikov
Dorodnicyn Computing Center of FRC CSC RAS, Vavilova 40, 119333 Moscow, Russia;
Moscow Institute of Physics and Technology (State University), 9 Institutskiy per., Dolgoprudny,
Moscow Region, 141701, Russia; e-mail: [email protected]
I.E. Kaporin
Dorodnicyn Computing Center of FRC CSC RAS, Vavilova 40, 119333 Moscow, Russia; e-mail:
[email protected]
1
2 A.I. Golikov and I.E. Kaporin
where the standard notation ξ+ = max(0, ξ ) = (ξ + |ξ |)/2 is used. Problem (1) can
be viewed as the dual for finding projection of a vector on the set of nonnegative
solutions of underdetermined linear systems of equations [6, 9]:
1
x∗ = arg min kx − xbk2 ,
Ax=b 2
x≥0
x + AT p∗ )+ . Therefore, we are
the solution of which is expressed via p∗ as x∗ = (b
considering piecewise quadratic function ϕ : Rm → R1 determined as
1
x + AT p)+ k2 − bT p,
ϕ (p) = k(b (2)
2
which is convex and differentiable. Its gradient g(p) = grad p is given by
x + AT p)+ − b,
g(p) = A(b (3)
The relation of H(p) to ϕ (p) and g(p) will be explained later in Remark 1.
The following result is a special case of Taylor expansion with the residual term in
integral form.
LEMMA 1. For any real scalars η and ζ it holds
Inexact Newton method 3
Z 1 Z 1
1 1
((η + ζ )+ )2 − (η+ )2 − ζ η+ = ζ 2 sign(η + st ζ )+ ds tdt. (5)
2 2 0 0
PROOF. Setting in (6) and (7) y = xb+ AT p and z = AT q readily yields the required
result (with account of cancellation of linear terms involving b in the left hand side
of (8)).
REMARK 1. As is seen from (9), if the condition
x + AT p + ϑ ATq)+ = sign(b
sign(b x + AT p)+ , (10)
|(AT q) j | ≤ |(b
x + AT p) j | whenever (AT q) j (b
x + AT p) j < 0; (12)
that is, if certain components of the increment q are relatively small, then ϕ is exactly
quadratic (11) in the corresponding neighborhood of p.
As suggests condition (10) and its consequence (11), one can try to numerically
minimize ϕ using Newton type method pk+1 = pk − dk , where dk = H(pk )−1 g(pk ).
Note that by (11) this will immediately give the exact minimizer p∗ = pk+1 if the
magnitudes of dk components are sufficiently small to satisfy (12) taken with p = pk
and q = −dk . However, initially pk may be rather far from solution, and only grad-
ual improvements are possible. First, a damping factor αk must be used to guarantee
monotone convergence (with respect to the decrease of ϕ (pk ) as k increases). Sec-
ond, H(pk ) must be replaced by some appropriate approximation Mk in order to
provide its invertibility with a reasonable bound for the inverse. Therefore, we pro-
pose the following prototype scheme
where
Mk = H(pk ) + δ Diag(AAT ). (13)
The parameters 0 < αk ≤ 1 and 0 ≤ δ ≪ 1 must be defined properly for better
convergence. Furthermore, at initial stages of iteration, the most efficient strategy
is to use approximate Newton directions dk ≈ Mk−1 g(pk ), which can be obtained
using preconditioned conjugate gradient (PCG) method for the solution of Newton
equation Mk dk = g(pk ). As will be seen later, it suffices to use any vector dk which
satisfies conditions
dkT gk = dkT Mk dk = ϑk2 gTk Mk−1 gk (14)
with 0 < ϑk < 1 sufficiently separated from zero. For any preconditioning, the ap-
proximations constructed by the PCG method satisfy (14), see Section 5.3 below.
With account of the Armijo type criterion
α T
ϕ (pk − α dk ) ≤ ϕ (pk ) − d g(pk ), α ∈ {1, 1/2, 1/4, . . .}, (15)
2 k
where the maximum steplength α satisfying (15) is used, the inexact Newton algo-
rithm can be presented as follows:
Algorithm 1.
Input:
A ∈ Rm×n , b ∈ Rm , xb ∈ Rm ;
Initialization:
δ = 10−6 , ε = 10−12 , τ = 10−15
Inexact Newton method 5
It appears that Algorithm 1 exactly conforms with the convergence analysis pre-
sented in [13] (see also [2]). For the completeness of presentation and compatibility
of notations, we reproduce here the main results of [13].
The main assumptions we need for the function ϕ (p) under consideration are that it
is bounded from below, have gradient g(p) ∈ Rm , and satisfies
6 A.I. Golikov and I.E. Kaporin
γ
ϕ (p + q) − ϕ (p) − qTg(p) ≤ qT Mq (16)
2
for the symmetric positive definite m × m matrix M = M(p) defined above in (13)
and some constant γ ≥ 1. Note that the exact knowledge of γ is not necessary for
actual calculations. The existence of such γ follows from (13) and Lemma 3. Indeed,
denoting D = (Diag(AAT ))1/2 and A b = D−1 A, for the right hand side of (8) one has,
with account of kDiag(d)k ≤ 1 and H(p) ≥ 0,
b 2
kAk kAk
b 2
b 2 Diag(AAT ) ≤
ADiag(d)AT ≤ AAT ≤ kAk δ Diag(AAT ) + H(p) = M;
δ δ
therefore, (16) holds with
b 2 /δ .
γ = kAk (17)
The latter formula explains our choice of M which is more appropriate in cases of
large variations in norms of rows in A (see the examples and discussion in Section 6).
Next we will estimate the reduction in the value of ϕ attained by the descent
along the direction (−d) satisfying (14). One can show the following estimate for
the decrease of objective function value at each iteration (here, simplified notations
p = pk , p̂ = pk+1 etc. are used) p̂ = p − α d with α = 2−l , where l = 0, 1, . . ., as
evaluated according to (15):
ϑ 2 T −1
ϕ ( p̂) ≤ ϕ (p) − g M g. (18)
4γ
Noting that the right hand side of the latter estimate does not depend on k, it finally
follows that
lim kg(pk )k = 0,
k→∞
where the PCG direction vectors are pairwise M-orthogonal: (s( j) )T Ms(l) = 0, j 6= l.
Let also denote the M-norms of PCG directions as η ( j) = (s( j) )T Ms( j) , j = 0, 1, . . . , i−
1. Therefore, from (21), one can determine
i−1
ζ (i) = (d (i) )T Md (i) = ∑ η ( j) ,
j=0
8 A.I. Golikov and I.E. Kaporin
where k is the Newton iteration number. Summing up the latter inequalities for 0 ≤
k ≤ m − 1, we get
m−1 ik −1
( j)
c0 ≡ 4γ (ϕ0 − ϕ∗ ) ≥ ∑ ∑ ηk (22)
k=0 j=0
On the other hand, the cost measure related to the total time needed to perform
m inexact Newton iterations with ik PCG iterations at each Newton step, can be
estimated as proportional to
−1
m−1
−1
∑m−1
k=0 εCG + ik ε −1 + ik
Tm = ∑ εCG + ik ≤ c0 (
≤ c0 max iCG .
k=0 ∑m−1 ∑
ik −1
η
j) k<m ∑ k −1 η ( j)
k=0 j=0 k j=0 k
Here εCG is a small parameter reflecting the ratio of one linear PCG iteration
cost to the cost of one Newton iteration (in particular, including construction of
preconditioning and several ϕ evaluations needed for backtracking) plus possi-
ble efficiency loss due to early PCG termination. Thus, introducing the function
−1
ψ (i) = (εCG + i)/ζ (i) , (here, we omit the index k) one obtains a reasonable criterion
to stop PCG iterations in the form ψ (i) > ψ (i − 1). Here, the use of smaller values
εCG generally corresponds to the increase of the resulting iteration number bound.
Rewriting the latter condition, one obtains the final form of the PCG stopping rule:
−1
(εCG + i)η (i−1) ≤ ζ (i) . (23)
Note that by this rule, the PCG iteration number is always no less than 2.
Finally, we explicitly present the resulting formulae for the PCG algorithm incor-
porating the new stopping rule. Following [9], we use the Jacobi preconditioning
C = (Diag(M))−1 . (24)
Moreover, the reformulation [14] of the CG algorithm [11, 1] is used. This may give
a more efficient parallel implementation, see, e.g., [9].
Following [14], recall that at each PCG iteration the M −1 -norm of the (i + 1)-
th residual r(i+1) = g − Md (i+1) attains its minimum over the corresponding Krylov
subspace. Using the standard PCG recurrences (see Section 5.3 below) one can find
d (i+1) = d (i) +Cr(i) α (i) + s(i−1) α (i) β (i−1) . Therefore, the optimum increment s(i) in
the recurrence d (i+1) = d (i) + s(i) , where s(i) = V (i) h(i) and V (i) = [Cr(i) | s(i−1) ], can
be determined via the solution of the following 2-dimensional linear least squares
problem:
h (i) i
α
β (i)
= h(i) = arg min kg − Md (i+1)kM−1 = arg min kr(i) − MV (i) hkM−1 .
h∈R2 h∈R2
Inexact Newton method 9
By redefining r(i) := −r(i) and introducing vectors t (i) = Ms(i) , the required PCG
reformulation follows:
Algorithm 2.
r(0) = −g, d (0) = s(−1) = t (−1) = 0, ζ (−1) = 0;
i = 0, 1, . . . , itmax :
w(i) = Cr(i) ,
z(i) = Mw(i) ,
γ (i) = (r(i) )T w(i) , ξ (i) = (w(i) )T z(i) , η (i−1) = (s(i−1) )T t (i−1) ,
ζ (i) = ζ (i−1) + η (i−1),
−1
if ((εCG + i)η (i−1) ≤ ζ (i) ) or (γ (i) ≤ εCG 2
γ (0) ) return {d (i) };
if (k = 0) then
α (i) = −γ (i) /ξ (i) , β (i) = 0;
else
δ (i) = γ (i) /(ξ (i) η (i−1) − (γ (i) )2 ),
α (i) = −η (i−1) δ (i) , β (i) = γ (i) δ (i) ;
end if
t (i) = z(i) α (i) + t (i−1)β (i) , r(i+1) = r(i) + t (i) ,
(i) (i) (i) (i−1) (i)
s = w α +s β , d (i+1) = d (i) + s(i) .
For maximum reliability, the new stopping rule (23) is used along with the standard
one; however, in almost all cases the new rule provides for an earlier CG termination.
Despite of somewhat larger workspace and number of vector operations com-
pared to the standard algorithm, the above version of CG algorithm enables more ef-
ficient parallel implementation of scalar product operations. At each iteration of the
above presented algorithm, it suffices to use one MPI AllReduce(∗,∗,3,. . . ) operation
instead of two MPI AllReduce(∗,∗,1,. . . ) operation in the standard PCG recurrences.
This is especially important when many MPI processes are used and the start-up
time for MPI AllReduce operations is relatively large. For another equivalent PCG
reformulations allowing to properly reorder the scalar product operations, see [5]
and references cites therein.
Let us recall some basic properties of the PCG algorithm, see, e.g. [1]. The standard
PCG algorithm (algebraically equivalent to Algorithm 2) for the solution of the
problem Md = g can be written as follows (the initial guess for the solution d0 is set
to zero):
10 A.I. Golikov and I.E. Kaporin
The scaling property (14) (omitting the upper and lower indices at d, it reads d T g =
d T Md) can be proved as follows. Let d = d (i) be obtained after i iterations of the
PCG method applied to Md = g with zero initial guess d (0) = 0. Therefore, d ∈ Ki =
span{Cg,CMCg, . . . , (CM)i−1Cg}, and, by the PCG optimality property, it holds
Setting here α = d T g/d T Md, one can easily transform this inequality as 0 ≥
(−d T g + d T Md)2 , which readily yields (14). Furthermore, by the well known es-
timate of the PCG iteration error [1] using Chebyshev polynomials, one gets
√
1 − θ 2 ≡ (g − Md)T M −1 (g − Md)/gT M −1 g ≤ cosh−2 2i/ κ
where
κ = cond(CM) ≡ λmax (CM)/λmin (CM).
By the scaling condition, this gives
√
θ 2 = d T Md/gT M −1 g ≥ tanh2 2i/ κ . (26)
Below we consider two families of test problems which can be solved via minimiza-
tion of piecewise quadratic problems. The first one was described above in Section 2
(see also [6]), while the second coincides with the problem setting for the evaluation
Inexact Newton method 11
of distance between two convex polyhedra used in [3]. The latter problem is of key
importance e.g., in robotics and computer animation.
Matrix data from the following 11 linear programming problems (this is the same
selection from NETLIB collection as considered in [15]), were used to form test
problems (1). Note that further we only consider the case xb = 0. Recall also the no-
x + AT p∗ )+ . The problems in Table 1 below are ordered by the number
tation x∗ = (b
of nonzero elements nz(A) in A ∈ Rm×n .
It is readily seen that 3 out of 11 matrices have null rows, and more than half of
them have rather large variance of row norms. This explains the proposed Hessian
regularization (13) instead of the earlier construction [6, 15] Mk = H(pk )+ δ Im . The
latter is a proper choice only for matrices with rows of nearly equal length, such
as maros r7 example or various matrices with uniformly distributed quasirandom
entries, as used for testing in [9, 15]. In particular, estimate (17) with D = I would
take the form γ = kAk2/δ , so the resulting method appears to be rather sensitive to
the choice of δ .
In Table 2, the results presented in [15] are reproduced along with similar data
obtained with our version of Generalized Newton method. It must be stressed that
we used the fixed set of tunung parameters
for all problems. Note that In [15] the parameter choice for the Armijo procedure
was not specified.
12 A.I. Golikov and I.E. Kaporin
Table 3, where the timing (in seconds) and precision results averaged over the same
11 problems are given. One can see that nearly the same average residual norm
kAx − bk∞ can be obtained considerably faster and with less critical dependence on
εCG when using the new PCG iteration stopping rule.
Let the two convex polyhedra X1 and X2 be described by the following two systems
of linear inequalities:
and the vector b = [bT1 , bT2 ]T ∈ Rn1 +n2 , we consider the problem
ε 2 1 T 1 T 2
x∗ (ε ) = arg min kxk + x Bx + k(A x − b)+k ,
x∈R2s 2 2 2ε
Acknowledgements This work was supported by the Russian Foundation for Basic Research
grant No. 17-07-00510 and by Program No. 26 of the Presidium of the Russian Academy of Sci-
ences. The authors are grateful to the anonymous referees and to Prof. V.Garanzha for many useful
comments which greatly improved the exposition of the paper.
Inexact Newton method 15
Table 4 Performance of the generalized Newton method (ε = 10−4 ) for the problem of distance
between two quasirandom convex polyhedrons with n/2 faces each
n kx1 − x2 k2 kAT x∗ − ck∞ time(sec) kg(x∗ )k∞ #NewtIter
8 0.001815 9.69–09 < 0.001 7.89–13 15
16 0.481528 8.63–05 < 0.001 1.27–13 3
32 0.795116 8.80–05 < 0.001 1.46–12 28
64 1.102286 1.32–04 < 0.001 5.58–13 13
128 1.446262 1.36–04 < 0.001 7.12–13 17
256 1.449913 9.54–05 < 0.001 4.37–13 11
512 1.460197 1.31–04 0.001 8.16–13 15
1024 1.460063 1.46–04 0.002 1.09–12 14
2048 1.463320 1.04–04 0.005 6.58–13 19
4096 1.463766 1.26–04 0.009 3.59–13 20
8192 1.463879 1.03–04 0.009 8.32–14 12
16384 1.463976 7.58–05 0.009 1.64–12 13
32768 1.464046 3.28–05 0.018 1.54–12 13
References
1. Axelsson, O.: A class of iterative methods for finite element equations. Computer Meth. Appl.
Mech. Engrg. 9, 123–137 (1976)
2. Axelsson, O., Kaporin, I.E.: Error norm estimation and stopping criteria in preconditioned
conjugate gradient iterations. Numer. Linear Algebra Appls. 8 (4), 265–286 (2001)
3. Bobrow, J.E.: A direct minimization approach for obtaining the distance between convex
polyhedra. The International Journal of Robotics Research, 8(3), 65–76 (1989)
4. Dembo R., Steihaug T.: Truncated Newton algorithms for large-scale unconstrained optimiza-
tion. Math. Program. 26, 190–212 (1983)
5. Dongarra, J., Eijkhout, V.: Finite-choice algorithm optimization in Conjugate Gradients. La-
pack Working Note 159, University of Tennessee Computer Science Report UT-CS-03-502
(2003)
6. Ganin, B.V., Golikov, A.I., Evtushenko, Y.G.: Projective-dual method for solving systems of
linear equations with nonnegative variables. Comput. Math. and Math. Phys. 58 (2), 159–169
(2018)
7. Garanzha V.A., Kaporin I.E.: Regularization of the barrier variational method of grid genera-
tion. Comput. Math. and Math. Phys. 39 (9), 1426–1440 (1999)
8. Garanzha, V., Kaporin, I., Konshin, I.: Truncated Newton type solver with application to grid
untangling problem. Numer. Linear Algebra Appls. 11 (5-6), 525–533 (2004)
9. Garanzha, V.A., Golikov, A.I., Evtushenko, Y.G., Nguen, M.K.: Parallel implementation of
Newton’s method for solving large-scale linear programs. Comput. Math. and Math. Phys. 49
(8), 1303–1317 (2009)
10. Hiriart-Urruty, J. B., Strodiot, J. J., Nguyen, V. H.: Generalized Hessian matrix and second-
order optimality conditions for problems with C1,1 data. Applied mathematics and optimiza-
tion, 11(1), 43–56 (1984)
11. Hestenes, M.R., Stiefel, E.L.: Methods of conjugate gradients for solving linear systems. J.
Research Nat. Bur. Standards 49 (1), 409–436 (1952)
12. Kaporin, I.E., Axelsson, O.: On a class of nonlinear equation solvers based on the residual
norm reduction over a sequence of affine subspaces. SIAM J. Sci. Comput. 16 (1), 228–249
(1995)
13. Kaporin, I.E.: Using inner conjugate gradient iterations in solving large-scale sparse nonlinear
optimization problems. Comput. Math. and Math. Phys. 43 (6), 766–771 (2003)
16 A.I. Golikov and I.E. Kaporin
14. Kaporin, I.E., Milyukova, O.Y.: The massively parallel preconditioned conjugate gradient
method for the numerical solution of linear algebraic equations. In: Zhadan, V.G. (ed.) Col-
lection of Papers of the Department of Applied Optimization of the Dorodnicyn Computing
Center, pp. 132–157, Russian Academy of Sciences, Moscow (2011)
15. Ketabchi, S., Moosaei, H., Parandegan, M., Navidi, H.: Computing minimum norm solution
of linear systems of equations by the generalized Newton method. Numerical Algebra, Con-
trol and Optimization, 7 (2), 113–119 (2017)
16. Mangasarian, O.L.: A finite Newton method for classification. Optimization Methods and
Software, 17 (5), 913–929 (2002)
17. Mangasarian, O.L.: A Newton method for linear programming. Journal of Optimization The-
ory and Applications, 121 (1), 1–18 (2004)
18. Yu, L., Barbot, J.P., Zheng, G., Sun, H.: Compressive sensing with chaotic sequence. IEEE
Signal Processing Letters, 17(8), 731–734 (2010)