Subspace Methods For Nonlinear Optimization
Subspace Methods For Nonlinear Optimization
2 Nonlinear Optimization
3
4 Xin Liu∗
5 Zaiwen Wen†
6 Ya-xiang Yuan‡
7 Abstract. Subspace techniques such as Krylov subspace methods have been well known and extensively used in
8 numerical linear algebra. They are also ubiquitous and becoming indispensable tools in nonlinear opti-
9 mization due to their ability to handle large scale problems. There are generally two types of principals: i)
10 the decision variable is updated in a lower dimensional subspace; ii) the objective function or constraints
11 are approximated in a certain smaller functional subspace. The key ingredients are the constructions of
12 suitable subspaces and subproblems according to the specific structures of the variables and functions
13 such that either the exact or inexact solutions of subproblems are readily available and the corresponding
14 computational cost is significantly reduced. A few relevant techniques include but not limited to direct
15 combinations, block coordinate descent, active sets, limited-memory, Anderson acceleration, subspace
16 correction, sampling and sketching. This paper gives a comprehensive survey on the subspace meth-
17 ods and their recipes in unconstrained and constrained optimization, nonlinear least squares problem,
18 sparse and low rank optimization, linear and nonlinear eigenvalue computation, semidefinite program-
19 ming, stochastic optimization and etc. In order to provide helpful guidelines, we emphasize on high level
20 concepts for the development and implementation of practical algorithms from the subspace framework.
21 Key words. nonlinear optimization, subspace techniques, block coordinate descent, active sets, limited memory,
22 Anderson acceleration, subspace correction, subsampling, sketching
24 1 Introduction 3
25 1.1 Overview of Subspace Techniques . . . . . . . . . . . . . . . . . . . . . . . 4
26 1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
27 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
∗ State Key Laboratory of Scientific and Engineering Computing, Academy of Mathematics and Systems Science,
Chinese Academy of Sciences, and University of Chinese Academy of Sciences, China ([email protected]). Re-
search supported in part by NSFC grants 11622112, 11471325, 91530204 and 11688101, the National Center for
Mathematics and Interdisciplinary Sciences, CAS, and Key Research Program of Frontier Sciences QYZDJ-SSW-
SYS010, CAS.
† Beijing International Center for Mathematical Research, Peking University, China ([email protected]). Re-
search supported in part by NSFC grant 11831002, and by Beijing Academy of Artificial Intelligence (BAAI).
‡ State Key Laboratory of Scientific and Engineering Computing, Academy of Mathematics and Systems Science,
Chinese Academy of Sciences, China ([email protected]). Research supported in part by NSFC grants 11331012
and 11688101.
1
44 4 Stochastic Optimization 15
45 4.1 Stochastic First-order Methods . . . . . . . . . . . . . . . . . . . . . . . . . 16
46 4.2 Stochastic Second-Order method . . . . . . . . . . . . . . . . . . . . . . . . 17
47 5 Sparse Optimization 18
48 5.1 Basis Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
49 5.2 Active Set Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
59 8 Eigenvalue Computation 26
60 8.1 Classic Subspace Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
61 8.2 Polynomial Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
62 8.3 Limited Memory Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
63 8.4 Augmented Rayleigh-Ritz Method . . . . . . . . . . . . . . . . . . . . . . . 28
64 8.5 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 29
65 8.6 Randomized SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
66 8.7 Truncated Subspace Method for Tensor Train . . . . . . . . . . . . . . . . . 31
87 12 Conclusion 48
88 References 49
89
90
94 where x is the decision variable, f (x) is the objective function and X is the feasible set. Effi-
95 cient numerical optimization algorithms have been extensively developed for (1.1) with vari-
96 ous types of objective functions and constraints [111, 88]. With the rapidly increasing prob-
97 lem scales, subspace techniques are ubiquitous and becoming indispensable tools in nonlinear
98 optimization due to their ability to handle large scale problems. For example, the Krylov sub-
99 space methods developed in the numerical linear algebraic community have been widely used
100 for the linear least squares problem and linear eigenvalue problem. The characteristics of the
101 subspaces are clear in many popular optimization algorithms such as the linear and nonlin-
102 ear conjugate gradient methods, Nesterov’s accelerated gradient method, the Quasi-Newton
103 methods and the block coordinate decent (BCD) method. The subspace correction method
104 for convex optimization can be viewed as generalizations of multigrid and domain decompo-
105 sition methods. The Anderson acceleration or the direct inversion of iterative subspace (DIIS)
106 methods have been successful in computational quantum physics and chemistry. The stochas-
107 tic gradient type methods usually take a mini-batch from a large collection samples so that
108 the computational cost of each inner iteration is small. The sketching techniques formulate a
109 reduced problem by a multiplication with random matrices with certain properties.
110 The purpose of this paper is to provide a review of the subspace methods for nonlinear
111 optimization, for their further improvement and for their future usage in even more diverse
112 and emerging fields. The subspaces techniques for (1.1) are generally divided into two cat-
113 egories. The first type is to update the decision variable in a lower dimensional subspace,
114 while the second type is to construct approximations of the objective function or constraints
115 in a certain smaller subspace of functions. Usually, there are three key steps.
116 • Identify a suitable subspace either for the decision variables or the functions.
117 • Construct a proper subproblem by various restrictions or approximations.
118 • Find either an exact or inexact solution of subproblems.
119 These steps are often mixed together using the specific structures of the problems case by
120 case. The essence is how to reduce the corresponding computational cost significantly.
121 During the practice in unconstrained and constrained optimization, nonlinear least squares
122 problem, sparse and low rank optimization, linear and nonlinear eigenvalue computation,
123 semidefinite programming, stochastic optimization, manifold optimization, phase retrieval,
3
130 1.1. Overview of Subspace Techniques. We next summarize the concepts and
131 contexts of a few main subspace techniques.
132 Direct Combinations. It is a common practice to update the decision variables using a
133 combination of a few known directions which forms a subspace. The linear and nonlinear
134 conjugate gradient methods [111, 88], the Nesterov’s accelerated gradient method [84, 85],
135 the Heavy-ball method [90], the search direction correction method [126] and the momentum
136 method [47] take a linear combination of the gradient and the previous search direction. The
137 main difference is reflected in the choices of the coefficients according to different explicit
138 formulas.
139 BCD. The variables in many problems can be split naturally into a few blocks whose sub-
140 spaces are spanned by the coordinate directions. The Gauss-Seidel type of the BCD method
141 updates only one block by minimizing the objective function or its surrogate while all other
142 blocks are fixed at each iteration. It has been one of the core algorithmic idea in solving
143 problems with block structures, such as convex programming [77], nonlinear programming
144 [9], semidefinite programming [129, 145], compressive sensing [72, 32], etc. A proximal
145 alternating linearized minimization method is developed in [10] for solving a summation of
146 nonconvex but differentiable and nonsmooth functions. The alternating direction methods of
147 multipliers (ADMM) [11, 27, 41, 45, 55, 125] minimize the augmented Lagrangian function
148 with respect to the primal variables by BCD, then update the Lagrangian multiplier.
149 Active Sets. When a clear partition of variables is not available, a subset of the variables
150 can be fixed in the so-called active sets under certain mechanisms and the remaining variables
151 are determined from certain subproblems for optimization problems with bound constraints
152 or linear constraints in [17, 18, 51, 81, 82], `1 -regularized problem for sparse optimization
153 in [133, 105, 64] and general nonlinear programs in [19, 20]. In quadratic programming, the
154 inequality constraints that have zero values at the optimal solution are called active, and they
155 are replaced by equality constraints in the subproblem [111].
156 Limited-memory. A typical subspace is constructed from a number of history infor-
157 mation, for example, the previous iterates {xk }, the previous gradients {∇f (xk )}, the dif-
158 ferences between two consecutive iterates {xk − xk−1 }, and the differences between two
159 consecutive gradients {∇f (xk ) − ∇f (xk−1 )}. After the new iterate is formed, the oldest
160 vectors in the storage are replaced by the most recent vectors if certain justification rules are
161 satisfied. Two examples are the limited memory BFGS method [111, 88], and the limited
162 memory block Krylov subspace optimization method (LMSVD) [74].
163 Anderson Acceleration. For a sequence {xk } generated by a general fixed-point iter-
164 ation, the Anderson acceleration produces a new point using a linear combination of a few
165 points in {xk }, where the coefficients are determined from an extra linear least squares prob-
166 lem with a normalized constraint [13, 4, 123]. A few related schemes include the minimal
167 polynomial extrapolation, modified minimal polynomial extrapolation, reduced rank extrap-
168 olation, the vector Epsilon algorithm and the topological Epsilon algorithm. The Anderson
169 acceleration is also known as Anderson mixing, Pulay mixing, DIIS or the commutator DIIS
170 [92, 93, 115] in electronic structure calculation. These techniques have also been applied to
171 optimization problems in [99, 147].
172 Subspace correction. For variational problems, the domain decomposition methods
4
211 where f (x) : Rn → R is a differentiable function. The line search and trust region methods
212 are the two main types of approaches for solving (2.1). The main difference between them
213 is the order of determining the so-called step size and search direction. Subspace techniques
214 have been substantially studied in [26, 48, 140, 142, 143, 87, 128, 127, 49].
215 2.1. The Line Search Methods. At the k-th iteration xk , the line search methods
216 first generate a descent search direction dk and then search along this direction for a step size
217 αk such that the objective function at the next point
218 (2.2) xk+1 = xk + αk dk
5
223 d ∈ Sk .
229 which is spanned by the gradient gk and the last search direction dk−1 . More specifically, dk
230 is a linear combination of −gk and dk−1 with a weight βk−1 , i.e.,
232 where d0 = −g0 and β0 = 0. A few widely used choices for the weight βk−1 are
gk> gk
233 βk−1 = >
, (F-R Formula),
gk−1 gk−1
gk> (gk − gk−1 )
234 βk−1 = , (H-S or C-W Formula),
d>
k−1 (gk − gk−1 )
gk> (gk − gk−1 )
235 βk−1 = > g
, (PRP Formula),
gk−1 k−1
gk> gk
236 βk−1 = − , (Dixon Formula),
d>
k−1 gk−1
gk> gk
237 βk−1 = − >
, (D-Y Formula).
dk−1 (gk − gk−1 )
238
239 It is easy to observe that these formulas are equivalent in the sense that they yield the same
240 search directions when the function f (x) is quadratic with a positive definite Hessian matrix.
241 In this case, the directions d1 , . . . , dk are conjugate to each other with respect to the Hessian
242 matrix. It can also be proved that the CG method has global convergence and n-step local
243 quadratic convergence. However, for a general nonlinear function with inexact line search,
244 the behavior of the methods with different βk can be significantly different.
245 2.1.2. Nesterov’s Accelerated Gradient Method. The steepest descent gradient
246 method simply uses dk = −gk in (2.2) for unconstrained optimization. Assume that the
247 function f (x) is convex, the optimal value f ∗ of (2.1) is finite and it attains at a point x∗ , and
248 the gradient f (x) is Lipschitz continuous with a constant L, i.e.,
An illustration of the FISTA method is shown in Figure 2.1. Under the same assumptions as
xk = yk − tk ∇f (yk )
xk−2 xk−1 yk
268 2.1.3. The Heavy-ball Method. The heavy-ball method [90] is also a two-step scheme:
320 where Qk (d) is an approximation to f (xk + d) for d in the subspace Sk . It would be de-
321 sirable that the approximation model Qk (d) has the following properties: (i) it is easy to be
322 minimized in the subspace Sk ; (ii) it is a good approximation to f and the solution of the
323 subspace subproblem will give a sufficient reduction with respect to the original objective
324 function f .
325 It is natural to use quadratic approximations to the objective function. This leads to
326 quadratic models in subspaces. A successive two-dimensional search algorithm is developed
327 by Stoer and Yuan in [143] based on
Let the dimension dim(Sk ) = τk and Sk be a set generated by all linear combinations of
vectors p1 , p2 , . . . , pτk ∈ Rn , i.e.,
Sk = span{p1 , p2 , . . . , pτk } .
329 Define Pk = [p1 , p2 , ..., pτk ]. Then d ∈ Sk can be represented as d = Pk d¯ with d¯ ∈ Rτk .
330 Hence, a quadratic function Qk (d) defined in the subspace can be expressed as a function Q̄k
331 ¯ Since τk usually is quite small,
in a lower dimension space Rτk in terms of Qk (d) = Q̄k (d).
332 the Newton method can be used to solve (2.13) efficiently.
333 We now discuss a few possible choices for the subspace Sk . A special subspace is a
334 combination of the current gradient and the previous search directions, i.e.,
338 where d¯ = (α, β1 , · · · , βm )> ∈ Rm+1 . All second order terms of the Taylor expansion of
339 f (xk + d) in the subspace Sk can be approximated by secant conditions
341 except gk> ∇2 f (xk )gk . Therefore, it is reasonable to use the following quadratic model in the
342 subspace Sk :
343 (2.17) ¯ = (−kgk k2 , g > sk−1 , · · · , g > sk−m )d¯ + 1 d¯> B̄k d¯,
Q̄k (d) k k
2
344 where
−gk> yk−1 −gk> yk−m
ρk ...
−gk> yk−1 >
yk−1 sk−1 ... >
yk−m sk−1
(2.18) B̄k =
345 .. .. .. ..
. . . .
−gk> yy−m >
yk−m sk−1 ... >
yk−m sk−m
9
(s>
k−1 gk )
2
349 (2.19) ρk = 2 >
,
sk−1 yk−1
350 due to the fact that the mean value of cos2 (θ) is 21 , which gives
1 (s> 2
k−1 ∇ f (xk )gk )
2
(s>
k−1 gk )
2
351 (2.20) gk> ∇2 f (xk )gk = ≈ 2 ,
cos2 θk s> 2
k−1 ∇ f (xk )sk−1 s>
k−1 yk−1
1 1
352 where θk is the angle between (∇2 f (xk )) 2 gk and (∇2 f (xk )) 2 sk−1 . Another way to es-
353 timate gk> (∇2 f (xk ))gk is by replacing ∇2 f (xk ) by a quasi-Newton matrix. We can also
354 obtain ρk by computing an extra function value f (xk + tgk ) and setting
363 where Wk = [−gk , yk−1 , ..., yk−m ] ∈ Rn×(m+1) . The Newton step in the subspace Sk is
364 Wk d¯k with
−1 >
d¯k = − Wk> ∇2 f (xk )Wk
365 (2.25) Wk ∇f (xk ) .
366 Thus,
the remaining issue is to obtain a good estimate of d¯k , using the fact that all the elements
367 of Wk (∇2 f (xk ))−1 Wk is known except one entry gk ∇2 f (xk )−1 gk .
>
d> gk τ
377 (2.28) min ≤− .
d∈Sk kdk2 kgk k2 n
i Pτ ij 2
378 When (gkτ +1 )2 ≤ j=1 (gk ) , the following estimation can be established:
d> gk 1
379 (2.29) min ≤ −p .
d∈Sk kdk2 kgk k2 1 + (n − τ )
Consequently, a sequential steepest coordinates search (SSCS) technique can be designed
by augmenting the steepest coordinate directions into the subspace sequentially. For example,
consider minimizing a convex quadratic function
1
Q(x) = g > x + x> Bx.
2
380 At the k-th iteration of SSCS, we first compute gk = ∇Q(xk ), then choose
384 2.2. Trust Region Methods. The trust region methods for (2.1) compute a search
385 direction in a ball determined by a given trust region radius whose role is similar to the step
386 size. The trust region subproblem (TRS) is normally
1
min Qk (s) = gk> s + s> Bk d
387 (2.30) s∈Rn 2
s. t. ksk2 ≤ ∆k ,
388 where Bk is an approximation to the Hessian ∇2 f (xk ) and ∆k > 0 is the trust region radius.
389 A subspace version of the trust region subproblem is suggested in [101]:
min Qk (s)
s∈Rn
390
s. t. ksk2 ≤ ∆k , s ∈ Sk .
391 The Steihaug truncated CG method [107] for solving (2.30) is essentially a subspace method.
392 When the approximate Hessian Bk is generated by the quasi-Newton updates such as the SR1,
393 PSB or the Broyden family [111, 88], the TRS has subspace properties. Suppose B1 = σI
394 with σ > 0, let sk be an optimal solution of TRS (2.30) and set xk+1 = xk + sk . Let
395 Gk = span{g1 , g2 , · · · , gk }. Then it can be proved that sk ∈ Gk and for any z ∈ Gk ,
396 w ∈ Gk⊥ , it holds
397 (2.31) Bk z ∈ Gk , Bk u = σu .
398 Therefore, the subspace trust region algorithm generates the same sequences as the full space
399 trust region quasi-Newton algorithm. Based on the above results, Wang and Yuan [128]
11
443 where Jk is the Jacobian of F at xk , Similarly, one can construct a subspace type of the
444 Levenberg-Marquardt method for (3.2) as
λk
445 min kF (xk ) + Jk Qk zk22 + kzk22 ,
z 2
446 where λk is a regularization parameter adjusted to ensure global convergence.
447 3.2. Subspace by Subsampling/Sketching. We start from solving a linear least
448 squares problem on massive data sets:
450 where A ∈ Rm×n and b ∈ Rm . Although applying the direct or iterative methods to (3.5) is
451 straightforward, it may be prohibitive for large values of m. The sketching technique chooses
452 a matrix W ∈ Rr×m with r m and formulates a reduced problem
454 It can be proved that the solution of (3.6) can be a good approximation to that of (3.5) in
455 certain sense if the matrix W is chosen suitably. For example, one may randomly select r
456 rows from the identity matrix to form W so that W A is a submatrix of A. Another choice is
457 that each element of W is sampled from an i.i.d. normal random variable with mean zero and
458 variance 1/r. These concepts have been extensively investigated for randomized algorithms
459 in numerical linear algebra [78, 136].
13
463 which is an incomplete set of equations. To solve the nonlinear equations (3.1) is to find a x
464 at which F maps to the origin [141]. Let Pk> be a mapping from Rm to a lower dimensional
465 subspace. Solving the reduced system
472 Of course, the efficiency of such an approach depends on how to select Pk and Qk . We can
473 borrow ideas from subspace techniques for large scale linear systems [98]. Instead of using
474 (3.10), we can use a subproblem of the following form:
476 where Jˆk ∈ Rik ×ik is an approximation to Pk> Jk Qk . The reason for preferring (3.11) over
477 (3.10) is that in (3.11) we do not need the Jacobian matrix Jk , whose size is normally signifi-
478 cantly larger than that of Jˆk .
479 Similar idea has also been studied for nonlinear least squares problems. At the k-th
480 iteration, we consider minimizing the sum of squares of some randomly selected terms in an
481 index set Ik ⊂ {1, ..., m} instead of all terms:
X
482 (3.12) minn (F i (x))2 .
x∈R
i∈Ik
483 This approach works quite well on the distance geometry problem which has lots of applica-
484 tions including protein structure prediction, where the nonlinear least squares of all the terms
485 have lots of local minimizers [103]. Combining subspace with sketching yields a counterpart
486 to (3.9) for nonlinear least squares:
488 A projected nonlinear least squares method is studied in [57] to fit a model ψ to (noisy)
489 measurements y for the exponential fitting problem:
491 where ψ(x) ∈ Rm and n m. Since computing the Jacobian of (3.14) can be expensive,
492 a sequence of low-dimensional surrogate problems are constructed from a sequence of sub-
493 spaces {W` } ⊂ Rm . Let PW` be an orthogonal projection onto W` and W` is an orthonormal
494 basis for W` , i.e., PW` = W` W`> . Then it solves the following minimization problem:
495 min kPW` [ψ(x) − y]k22 = min kW`> ψ(x) − W`> yk22 .
x x
14
503 It is easy to see that partition of variables use special subspaces that spanned by coordinate
504 directions. An obvious generalization of partition of variables is decomposition of the space
505 which uses subspaces spanned by any given directions.
506 3.4. τ −steepest Descent Coordinate Subspace. The τ −steepest descent coor-
507 dinate subspace discussed in Section 2 can also be extended to nonlinear equations and non-
508 linear least squares. Assume that
510 at the k−th iteration. If F (x) is a monotone operator, applying the method in a subspace
511 spanned by the coordinate directions {eij , j = 1, ..., τ } generates a system
For general nonlinear functions F (x), it is better to replace eij by the steepest descent coor-
dinate direction of the function F ij (x) at xk , i.e., substituting ij by an index lj such that
∂F ij (xk )
lj = arg max .
t=1,...,n ∂xt
513 However, it may be possible to have two different j at one lj so that subproblem (3.17) has no
514 solution in the subspace spanned by {el1 , ..., elτ }. Further discussion on subspace methods
515 for nonlinear equations and nonlinear least squares can be found in [141].
516 4. Stochastic Optimization. The supervised learning model in machine learning as-
517 sumes that (a, b) follows a probability distribution P , where a is an input data and b is the
518 corresponding label. Let φ(a, x) be a prediction function in a certain functional space and
519 `(·, ·) represent a loss function to measure the accuracy between the prediction and the true la-
520 bel. The task is to predict a label b from a given input a, that is, finding a function φ such that
521 the expected risk E[`(φ(a, x), b)] is minimized. In practice, the real probability distribution
522 P is unknown, but a data set D = {(a1 , b1 ), (a2 , b2 ), · · · , (aN , bN )} is obtained by random
523 sampling, where (ai , bi ) ∼ P i.i.d. Then an empirical risk minimization is formulated as
N
1 X
524 (4.1) min f (x) := fi (x),
x N i=1
525 where fi (x) = `(φ(ai ; x), bi )). In machine learning, a large number of problems can be
526 expressed in the form of (4.1). For example, the function φ in deep learning is expressed
527 by a deep neural network. Since the size N usually is huge, it is time consuming to use the
528 information of all components fi (x). However, it is affordable to compute the information at
529 a few samples so that the amount of calculation in each step is greatly reduced.
15
543 Obviously, subsampling defines a kind of subspace in terms of the component functions
544 {f1 (x), . . . , fN (x)}. For simplicity, we next only consider extensions based on (4.2).
545 After calculating a random gradient ∇fsk (xk ) at the current point, the momentum method
546 does not directly update it to the variable xk . It introduces a speed variable v, which represents
547 the direction and magnitude of the parameter movements. Formally, the iterative scheme is
vk+1 = µk vk − αk ∇fsk (xk ),
548 (4.4)
xk+1 = xk + vk+1 .
549 This new update direction v is a linear combination of the previous update direction vk and
550 the gradient ∇fsk (xk ) to obtain a new vk+1 . When µk = 0, the algorithm degenerates to
551 SGD. In the momentum method, the parameter µk is often in the range of [0, 1). A typical
552 value is µk ≥ 0.5, which means that the iteration point has a large inertia and each iteration
553 will make a small correction to the previous direction.
554 Since the dimension of the variable x can be more than 10 million and the convergence
555 speed of each variable may be different, updating all components of x using a single step size
556 αk may not be suitable. The adaptive subgradient method (AdaGrad) controls the step sizes of
557 each component separately by monitoring the accumulation of the gradients elementwisely:
k
X
558 Gk = ∇fsi (xi ) ∇fsi (xi ),
i=1
559 where the Hadamard product between two vectors. The AdaGrad method is
αk
xk+1 = xk − √ ∇fsk+1 (xk+1 ),
560 (4.5) Gk + en
Gk+1 = Gk + ∇fsk+1 (xk+1 ) ∇fsk+1 (xk+1 ),
561 where the division in √Gα+ek
is also performed elementwisely. Adding the term 1n is to
k n
562 prevent the division by zeros.
563 The adaptive moment estimation (Adam) method combines the momentum and AdaGrad
564 together by adding a few small corrections. At each iteration, it performs:
565 vk = ρ1 vk−1 + (1 − ρ1 )∇fsk (xk ),
16
583 Therefore, the subspace techniques in section 2 can also be adopted here.
584 Assume that the loss function is the negative logarithm probability associated with a
585 distribution with a density function p(y|a, x) defined by the neural network and parameterized
586 by x. The so-called KFAC method [79] is based on the Kronecker-factored approximate
587 Fisher matrix. Take an L-layer feed-forward neural network for example. Namely, each layer
588 j ∈ {1, 2, . . . , L} is given by
589 (4.6) sj = Tj wj−1 , wj = ψj (sj ),
590 where w0 = a is the input of the neural network, wL (x) ∈ Rm is the output of the neural
591 network under the input a, the constant term 1 is not considered in wj−1 for simplicity, Tj is
592 the weight matrix and ψj is the block-wise activation function. The jth diagonal block of F
593 corresponding to the parameters in the jth layer using a sample set B can be written in the
594 following way:
595 (4.7) F j := Qj−1,j−1 ⊗ Gj,j ,
596 where
1 X i
Qj−1,j−1 = i
wj−1 (wj−1 )> ,
|B|
i∈B
597
1 X
Gj,j = Ez∼p(z|ai ,x) [g̃ji (z)g̃ji (z)> ],
|B|
i∈B
∂`(φ(ai ,x),z)
598 and g̃ji (z)
:= ∂sij
. Therefore, the KFAC method computes a search direction in the
599 jth layer from
600 F j djk = −gkj ,
17
607 where kxk0 = |{j | xj 6= 0}|, i.e., the number of the nonzero elements of x. An exact
608 recovery of the sparest signal often requires the so-called restricted isometry property (RIP),
609 i.e., there exists a constant δr such that
611 The greedy pursuit methods build up an approximation in a subspace at the k-th iteration.
612 Let Tk be a subset of {1, . . . , n}, xTk be a subvector of x corresponding to the set Tk and
613 ATk be a column submatrix of A whose column indices are collected in the set Tk . Then the
614 subspace problem is
1
615 xTk k = arg min kATk x − bk22 .
x 2
616 Clearly, the solution is xTk k = A†Tk b where A†Tk is the pseudoinverse of ATk . Since the size of
617 Tk is controlled to be small, ATk roughly has full rank column due to the RIP property. All
618 other elements of xk whose indices are in the complementary set of Tk are set to 0.
619 We next explain the choices of Tk . Assume that the initial index set T0 is empty. The
620 orthogonal matching pursuit (OMP) [116] computes the gradient
622 then selects an index such that tk = arg maxj=1,...,n |gj |. If multiple indices attain the
623 maximum, one can break the tie deterministically or randomly. Then the index set at the next
624 iteration is augmented as
625 Tk+1 = Tk ∪ {tk }.
626 The CoSaMP [83] method generates an s-sparse solution, i.e., the number of nonzero com-
627 ponents is at most s. Let (xk )s be a truncation of xk such that only the s largest entries in the
628 absolute values are kept and all other elements are set to 0. The support of (xk )s is denoted as
629 supp((xk )s ). Then a gradient gk is computed at the point (xk )s and the set Tk+1 is updated
630 by
631 Tk+1 = supp((gk )2s ) ∪ supp((xk )s ).
632 5.2. Active Set Methods. Consider the `1 -regularized minimization problem
634 where µ > 0 and f (x) : Rn → R is continuously differentiable. The optimality condition of
635 (5.2) is that there exists a vector
= −µ,
xi > 0,
i
636 (5.3) (∇f (x)) = +µ, xi < 0,
∈ [−µ, µ], otherwise.
18
1
649 xk+1 := arg min µkxk1 + (x − xk )> gk + kx − xk k22 ,
x 2αk
1
653 S(y, ν) = arg min νkxk1 + kx − yk22
x 2
654 = sign(y) max {|y| − ν, 0} .
655 Note that the scheme (5.4) first executes a gradient step with a step size αk , then performs
656 a shrinkage. In practice, αk can be computed by a non-monotone line search in which the
657 initial value is set to the BB step size. The convergence of (5.4) has been studied in [53]
658 under suitable conditions on αk and the Hessian ∇2 f . An appealing feature proved in [53] is
659 that (5.4) yields the support and the signs of the minimizer x∗ of (5.2) after a finite number
660 of steps under favorable conditions. For more references related to shrinkage, the reader is
661 referred to [133].
662 We now describe the subspace optimization in the second stage. For a given vector
663 x ∈ Rn , the active set is denoted by A(x) and the inactive set (or support) is denoted by I(x)
664 as follows
665 (5.5) A(x) := {i ∈ {1, · · · , n} | |xi | = 0} and I(x) := {i ∈ {1, · · · , n} | |xi | > 0}.
666 We require that each component xi either has the same sign as xik or is zero, i.e., x is required
667 to be in the set
695 Let xh,k be a vector where the first subscript h denotes the discretization level of the
696 multigrid and the second subscript k denotes the iteration count. We next briefly describe a
697 two-level subspace method in [24] instead of simply finding a point xh,k+1 in the coarser grid
698 space VH . We seek a point xh,k+1 in Sh,k + VH , satisfying some conditions, where Sh,k is
699 a subspace including the descent information, such as the coordinate direction of the current
700 iteration and the previous iterations or the gradient Dh f (xh,k ) in the finite space Vh . Then,
701 we solve
703 When xh,k is not optimal on a coarse level H ∈ {1, 2, . . . , N }, we go to this level and
704 compute a new solution xh,k+1 by
718 and the coarse grid correction procedure is the operation on the coarse level H, namely, to
719 find a direction dh,k to approximate the solution
721 The concept of the subspace correction methods can be used to solve subproblems (6.5)
(j) h
722 and (6.6). Let {φh }nj=1 be a basis for Vh , where nh is the dimension of Vh . Denote Vh
(1) (n )
723 as a direct sum of the one-dimensional subspaces Vh = Vh ⊕ · · · ⊕ Vh h . Then for each
724 j = 1, · · · , nh in turn, we perform the following correction step for subproblem (6.5) at the
725 k-th iteration:
(j)∗ (j)
dh,k = (j)min (j) fh (xh,k + dh,k )
dh,k ∈Vh
726 (6.7)
(j)∗
xh,k = xh,k + dh,k .
727 For subproblem (6.6), a similar strategy can be adopted by decompose space VH into a di-
728 rect sum. Global convergence of this algorithm has been proved in [113] for strictly convex
729 functions under some assumptions. The subspace correction method can be viewed as a gen-
730 eralization of the coordinate search method or the pattern search method.
731 6.3. Parallel Line Search Subspace Correction Method. In this subsection, we
732 consider the following optimization problem
734 where f (x) is differentiable convex function and h(x) is a convex function that is possibly
735 nonsmooth. The `1 -regularized minimization (LASSO) [114] and the sparse logistic regres-
736 sion [100] are examples of (6.8). The PSC methods have been studied for LASSO in [36, 39]
737 and total variation minimization in [37, 38, 39, 68].
738 Suppose that Rn is split into p subspaces, namely,
739 (6.9) Rn = X 1 + X 2 + · · · + X p ,
where
X i = {x ∈ Rn |supp(x) ⊂ Ji }, 1 ≤ i ≤ p,
p
S T
740 such that J := {1, ..., n} and J = Ji . For any i 6= j, 1 ≤ i, j ≤ p, Ji Jj = ∅ holds in
i=1
741 a non-overlapping T domain decomposition of Rn . Otherwise, there exist i, j ∈ {1, ..., p} and
742 i 6= j such that Ji Jj 6= ∅ in an overlapping domain decomposition of Rn .
743 Let ϕik be a surrogate function of ϕ restricted to the i-th subspace at k-th iteration. The
744 PSC framework for solving (6.8) is:
747 The
Pp convergence can be proved if the step sizes αki (1 ≤ i ≤ p) satisfy the conditions:
i i i
748 i=1 αk ≤ 1 and αk > 0 (1 ≤ i ≤ p). Usually, the step size αk is quite small under these
749 conditions and convergence becomes slow. For example, the diminishing step size αki = p1
750 tends to be smaller and smaller as the number of subspaces increases.
751 A parallel subspace correction method (PSCL) with the Armijo backtracking line search
752 for a large step size is proposed in [29]. At the k-th iteration, it chooses a surrogate functions
753 ϕik and solves P the subproblem (6.10) for each block, then computes a summation of the
p
754 direction dk = i=1 dik . The next iteration is
755 xk+1 = xk + αk dk ,
756 where αk satisfies the Armijo backtracking conditions. When h(x) = 0 and f (x) is strongly
757 convex, the surrogate function can be set to the original objective function ϕ. Otherwise, it
758 can be a first-order Taylor expansion of the smooth part f (x) with a proximal term and the
759 nonsmooth part h(x):
1
760 (6.11) ϕik (di ) = ∇f (xk )> di + kdi k22 + h(xk + di ), for di ∈ X i .
2λi
761 Both non-overlapping and overlapping schemes can be designed for PSCL.
762 The directions from different subproblems can be equipped with different step sizes. Let
763 Zk = d1k , d2k , . . . , dpk . The next iteration is set to
764 xk+1 = xk + Zk αk .
769 The global convergence of PSCL is established by following the convergence analysis
770 of the subspace correction methods for strongly convex problem [112], the active-set method
771 for l1 minimization [134] and the BCD method for nonsmooth separable minimization [119].
772 Specifically, linear convergence rate is proved for the strongly convex case and convergence
773 to the solution set of problem (6.8) globally is obtained for the general nonsmooth case.
774 7. General Constrained Optimization. In this section, we first present subspace
775 methods for solving general equality constrained optimization problems:
min f (x)
x∈Rn
776 (7.1)
s. t. c(x) = 0,
777 where c(x) = (c1 (x), · · · , cm (x))> , f (x) and ci (x) are real functions defined in Rn and at
778 least one of the functions f (x) and ci (x) is nonlinear. Note that inequality constraints can also
779 be added to (7.1) but they are omitted to simplify our discussion in the first few subsections.
22
1
789 (7.2) Qk (d) = gk> d + d> Bk d,
2
790 where gk = ∇f (xk ) and Bk is an approximation to the Hessian of the Lagrangian function.
791 The search direction dk of a line search type SQP method is obtained by solving the following
792 QP subproblem
794
795 (7.4) s. t. c(xk ) + A>
k d = 0,
796 where Ak = ∇c(xk ). Although the SQP subproblem is simpler than (7.1), it is still difficult
797 when the dimension n is large.
798 In general, the subspace SQP method constructs the search direction dk by solving a QP
799 in a subspace:
min Qk (d)
800 (7.5)
s. t. c k + A>
k d = 0, d ∈ Sk ,
Sk = span{gk , sk−m̄ , ..., sk−1 , ȳk−m̄ , ..., ȳk−1 , ∇c1 (xk ), ..., ∇cm (xk )},
801 where m̄ is the memory size of the limited memory BFGS method for constructing Bk in
802 (7.2), and ȳi is a linear combination of yi and Bi si . Let Uk be a matrix of linearly independent
803 vectors in Sk . A reduced constrained version of (7.5) is
1
min (Uk> gk )> z + z > Uk> Bk Uk z
804 (7.6) z 2
s. t. Tk> (ck + A>
k k z) = 0,
U
805 where Tk is a projection matrix so that the constraints are not over-determined.
806 7.2. Second Order Correction Steps. The SQP step dk can be decomposed into
807 two parts dk = hk + vk where vk ∈ Range(Ak ) and hk ∈ (A> k ). Thus, vk is a solution of
808 the linearized constrained (7.4) in the range space of Ak , while hk is the minimizer of the
809 quadratic function Qk (vk + d) in the null space of A>
k.
810 One good property of the SQP method is its superlinear convergence rate, namely when
811 xk is close to a Karush–Kuhn–Tucker (KKT) point x∗ it holds
816 because the left hand side of (7.8) is a better approximation to c(xk + d) close to the point
817 d = dk . Since the modification of the constraints is a second order term, the new step can
818 be viewed as the SQP step dk adding a second order correction step dˆk . Consequently, the
819 Maratos effect is overcomed. For detailed discussions on the SQP method and the second
820 order correction step, we refer the reader to [111].
821 We now examine the second order correction step from subspace point of views. It can
822 be verified that the second order correction step dˆk is a solution of
min Qk (dk + d)
d∈Rn
823
s. t. c(xk + dk ) + A>
k d = 0.
829 Since dk is the SQP step, it follows that gk + Bk dk ∈ Range(Ak ), which implies that the
830 minimization problem (7.9) is equivalent to
1
831 (7.10) min (v̂k + h)> Bk (v̂k + h) .
h∈Null(A>
k )
2
832 Examining the SQP method from the perspective of subspace enables us to get more
833 insights. If Yk> Bk Zk = 0, it holds ĥk = 0, which means that the second order correction
834 step dˆk ∈ Range(Ak ) is also a range space step. Hence, the second order correction uses
835 two range space steps and one null space step. Note that a range space step is fast since it is
836 a Newton step, while a null space step is normally slower because Bk is often approximated
837 by a quasi-Newton approximation to the Hessian of the Lagrangian function. Intuitively, it
838 might be better to have two slower steps with one fast step. Therefore, it might be reasonable
839 to study a correction step dˆk ∈ Null(A>
k ) in a modified SQP method.
840 7.3. The Celis-Dennis-Tapia (CDT) Subproblem. The CDT subproblem [23] is
841 often needed in some trust region algorithms for constrained optimization. It has two trust
842 region ball constraints:
1
min Qk (d) = gk> d + d> Bk d
843 (7.11) d∈Rn 2
s. t. kck + A>k dk2 ≤ ξk , kdk2 ≤ ∆k ,
24
847 where ḡk = Zk> gk , B̄k = Zk> Bk Zk and Āk = Zk> Ak . Consequently, a subspace version of
848 the Powell-Yuan trust algorithm [91] was developed in [50].
849 7.4. Simple Bound-constrained Problems. We now consider the optimization
850 problems with simple bound-constraints:
min f (x)
x∈Rn
851 (7.12)
s. t. l ≤ x ≤ u,
852 where l and u are two given vectors in Rn . In this subsection, the superscript of a vector
853 denotes its indices, for example, xi is the ith component of x.
854 A subspace adaptation of the Coleman-Li trust region and interior method is proposed in
855 [12]. The affine scaling matrices Dk and Ck are defined from examining the KKT conditions
856 of (7.12) as:
858 where J v (x) is a diagonal matrix whose diagonal elements equal to zero or ±1, and
i
x − ui , if g i < 0 and ui < +∞,
xi − li ,
if g i ≥ 0 and li > −∞,
859 vi =
−1, if g i < 0 and ui = +∞,
+1, if g i ≥ 0 and li = −∞.
1
min gk> s + s> (Hk + Ck )s
863 (7.13) s 2
s. t. kDk sk2 ≤ ∆k , s ∈ Sk .
865 Sk = span{Dk−2 gk , wk },
−1
866 where wk is either ŝN
k = −M̂k ĝk or its inexact version. Otherwise, Sk is set to
921 The major purpose of orthogonalization is to guarantee the full-rankness of the matrix Z
922 since AU may lose rank numerically. The so-called deflation can be executed after each
923 RR projection to fix the numerically converged eigenvectors since the convergence rates for
924 different eigenpairs are not the same. Moreover, q extra vectors, often called guard vectors,
925 are added to U to accelerate convergence. Although the iteration cost is increased at the initial
926 stage, the overall performance may be better.
927 Due to fast memory access and highly parallelizable computation on modern computer
928 architectures, simultaneous matrix-block multiplications have advantages over individual matrix-
929 vector multiplications. Whenever there is a gap between the p-th and the (p + 1)-th eigen-
930 values of A, the SSI method is ensured to converge to the largest p eigenpairs from any
931 generic starting point. However, the convergence speed of the SSI method depends critically
932 on eigenvalue distributions. It can be intolerably slow if the eigenvalue distributions are not
933 favorable.
934 8.2. Polynomial Filtering. The idea of polynomial filtering is originated from a well-
935 known fact that polynomials are able to manipulate the eigenvalues of any symmetric matrix
936 A while keeping its eigenvectors unchanged. Due to the eigenvalue decomposition (8.1), it
937 holds that
n
X
938 (8.3) ρ(A) = Qρ(Λ)QT = ρ(λi )qi qiT ,
i=1
939 where ρ(Λ) = diag(ρ(λ1 ), ρ(λ2 ), . . . , ρ(λn )). Ideally, the eigenvalue distribution ρ(A) is
940 more favorable than the original one.
941 The convergence of the desired eigen-space of SSI is determined by the gap of the eigen-
942 values, which can be very slow if the gap is nearly zero. Polynomial filtering has been used
943 to manipulate the gap in eigenvalue computation through various ways [97, 109, 150, 34] in
944 order to obtain a faster convergence. One popular choice of ρ(t) is the Chebyshev polynomial
945 of the first kind, which can be written as
(
cos(d arccos t) |t| ≤ 1,
946 (8.4) Td (t) = 1
√ √
((t − t 2 − 1)d + (t + t 2 − 1)d ) |t| > 1,
2
947 where d is the degree of the polynomial. Since Chebyshev polynomials grow pretty fast
948 outside the interval [−1, 1], they can help to suppress all unwanted eigenvalues in this interval
949 efficiently. For these eigenvalues in a general interval [a, b], the polynomial can be chosen as
t − (b + a)/2
950 (8.5) ρ(t) = Td .
(b − a)/2
959 where Λ = X > AX ∈ Rp×p is the matrix of Lagrangian multipliers. Once the matrix Λ
960 is diagonalized, the matrix pair (Λ, X) provides p eigenpairs of A. When maximization is
961 replaced by minimization, (8.7) computes an eigenspace associated with p smallest eigenval-
962 ues. A few block algorithms have been designed based on solving (8.7), including the locally
963 optimal block preconditioned conjugate gradient method (LOBPCG) [65] and the limited
964 memory block Krylov subspace optimization method (LMSVD) [74]. At each iteration, these
965 methods in fact solve a subspace trace maximization problem of the form
967 Obviously, the closed-form solution of (8.8) can be obtained by using the RR procedure.
968 The subspace S is varied from method to method. In LOBPCG, S is the span of the two
969 most recent iterations Xi−1 and Xi , and the residual AXi − Xi Λi at Xi , which is essentially
970 equivalent to
972 The term AXi can be pre-multiplied by a pre-conditioning matrix if it is available. The
973 LMSVD method constructs the subspace S as a limited memory of the current i-th iterate
974 and the previous t iterates; i.e.,
976 In general, the subspace S should be constructed such that the computational cost of solving
977 the subproblem (8.8) is relatively small.
978 8.4. Augmented Rayleigh-Ritz Method. We next introduce the augmented Rayleigh-
979 Ritz (ARR) procedure. It is easy to see that the RR map (Y, Σ) = RR(A, Z) is equivalent to
980 solving the trace-maximization subproblem (8.8) with the subspace S = R(Z), while requir-
981 ing Y > AY to be a diagonal matrix Σ. For a fixed number p, the larger the subspace R(Z)
982 is, the greater chance there is to extract better Ritz pairs. The augmentation of the subspaces
983 in LOGPCG and LMSVD is the main reason why they generally achieve faster convergence
984 than the classic SSI.
985 The augmentation in ARR is based on a block Krylov subspace structure, i.e., for some
986 integer t ≥ 0,
988 Then the optimal solution of the trace maximization problem (8.8), restricted in the sub-
989 space S in (8.11), is computed via the RR procedure using (Ŷ , Σ̂) = RR(A, Kt ), where
990 Kt = [X, AX, A2 X, . . . , At X]. Finally, the p leading Ritz pairs (Y, Σ) is extracted from
991 (Ŷ , Σ̂). This augmented RR procedure is simply referred as ARR. It looks identical to a
28
1013 where the parameter α > 0 is a step size. The classic power iteration can be modified without
1014 orthogonalization at each step. For X = [x1 x2 · · · , xm ] ∈ Rn×m , the power iteration is
1015 applied individually to all columns of the iterate matrix X, i.e.,
xi
1016 xi = ρ(A)xi and xi = , i = 1, 2, · · · , m.
kxi k2
1026 The next iterate Xi+1 is generated from a SSI step on X̂i , i.e.,
1027 (8.13) Xi+1 ∈ orth AA> X̂i .
1028 We collect a limited memory of the last a few iterates in (8.10) into a matrix
1038 Note that X ∈ Si if and only if X = QV for some V ∈ Rq×p . The generalized eigenvalue
1039 problem (8.12) is converted into an equivalent eigenvalue problem
1040 (8.16) max kRV k2F , s. t. V > V = I,
V ∈Rq×p
1041 where
1042 (8.17) R = Rti := A> Qti .
1043 The matrix product R in (8.17) can be computed from historical information without any
1044 additional computation involving the matrix A. Since Q ∈ orth(X) and X has a full rank,
1045 there exists a nonsingular matrix C ∈ Rq×q such that X = QC. Therefore, Q = XC −1 , and
1046 R in (8.17) can be assembled as
1047 (8.18) R = A> Q = (A> X)C −1 = YC −1 ,
1048 where Y = A> X is accessible from our limited memory. Once R is available, a solution V̂
1049 to (8.16) can be computed from the p leading eigenvectors of the q × q matrix R> R. The
1050 matrix product can then be calculated as
1051 (8.19) AA> X̂i = ARV̂ = AYC −1 V̂ .
1052 We now explain how to efficiently and stably compute Q and R when the matrix X is
1053 numerically rank deficient. Since each block itself in X is orthonormal, keeping the latest
1054 block Xi intact and projecting the rest of the blocks onto the null space of Xi> yields
>
PX = PX
1055 (8.20) i := I − Xi Xi [Xi−1 · · · Xi−p ] .
1056 An orthonormalization of PX is performed via the eigenvalue decomposition of its Gram
1057 matrix
1058 (8.21) P> >
X PX = UX ΛX UX ,
1067 where PY = PYi := A> PX before the stabilization procedure but some of the columns of
1068 PY may have been deleted due to the stabilization steps. Therefore, the R matrix in (8.23) is
1069 well defined as is the Q matrix in (8.22) after the numerical rank deficiency is removed.
30
1072 A ≈ QQT A.
1073 A prototype randomized SVD in [54] is essentially one step of the Power method using an
1074 initial random input. We select an oversampling parameter l ≥ 2 and an exponent t (for
1075 example, t = 1 or t = 2), then perform the following steps.
1076 • Generate an n × (p + l) Gaussian matrix Ω.
1077 • Compute Y = (AA> )t AΩ by the multiplications of A and A> alternatively.
1078 • Construct a matrix Q = orth(Y ) by the QR factorization.
1079 • Form the matrix B = Q> A.
1080 • Calculate an SVD of B to obtain B = Ũ ΣV > , and set U = QŨ .
1081 Consequently, we have the approximation A ≈ U ΣV > . For the eigenvalue computation, we
1082 can simply run the SSI (8.2) for only one step with an Gaussian matrix U . Assume that the
1083 computation is performed in exact arithmetic. It is shown in [54] that
√
> 4 p+l
1084 EkA − QQ Ak2 ≤ 1 + σp+1 ,
l−1
1085 where the expectation is taken with respect to the random matrix Ω and σp+1 is the (p + 1)-th
1086 largest singular value of A.
1087 Suppose that a low rank approximation of A with a target rank r is needed. A sketching
1088 method is further developed in [118] for selected p and `. Again, we draw independent
1089 Gaussian matrix Ω ∈ Rn×p and Ψ ∈ R`×m , and compute the matrix-matrix multiplications:
1097 8.7. Truncated Subspace Method for Tensor Train. In this subsection, we con-
1098 sider the trace maximization problem (8.7) whose dimension reaches the magnitude of O(1042 ).
1099 Due to the scale of data storage, a tensor train (TT) format is used to express data matrices
1100 and eigenvectors in [148]. The corresponding eigenvalue problem can be solved based on the
1101 subspace algorithm and the alternating direction method with suitable truncations.
1102 The goal is to express a vector x ∈ Rn as a tensor x ∈ Rn1 ×n2 ×···×nd for some positive
1103 integers n1 , . . . , nd such that n = n1 n2 . . . nd using a collection of three-dimensional tensor
1104 cores Xµ ∈ Rrµ−1 ×rµ ×nu with fixed dimensions rµ , µ = 1, . . . , d and r0 = rd = 1. A
1105 tensor x is stored in the TT format if its elements can be written as
1107 where Xµ (iµ ) ∈ Rrµ−1 ×rµ is the iµ -th slice of Xµ for iµ = 1, 2, . . . , nµ . The values rµ are
1108 often equal to a constant r, which is then called the TT-rank. Consequently, storing a vector
d
1109 x ∈ Rn1 only needs O(dn1 r2 ) entries if the corresponding tensor x has a TT format. The
1110 representation of x is shown as graphs in Figure 8.1.
31
n2 nd
r rd-1
n1
X1 r1
X2 X3 · · · · ·d-2
· Xd-1 Xd
r2
r0 r1
r2 rd
r3 rd-1
r3 rd-1
r2
r2 rd-2
r1 rd-1
r1
X1 (i1 )
× × ×· · · · · ·× × Xd (id )
X2 (i2 )
X3 (i3 ) Xd-1 (id-1 )
Fig. 8.1 The first row is a TT format of u with cores Xµ , µ = 1, 2, . . . , d. The second row is a representation of
its elements xi1 i2 ...id .
1111 There are several ways to express a matrix X ∈ Rn×p with p n in the TT format. A
1112 direct way is to store each column of X as tensors x1 , x2 , . . . , xp in the TT format separately.
1113 Another economic choice is that these p tensors share all except one core. Let the shared
1114 cores be Xi , i 6= µ and the µ-th core of xi be Xµ,i , for i = 1, 2, . . . , p. Then the i1 i2 · · · id
1115 component of xj is
1117 The above scheme generates a block-µ TT (µ-BTT) format, which is depicted in Figure 8.2.
1118 Similarly, a matrix A ∈ Rn×n is in an operator TT format A if the components of A can be
1119 assembled as
1120 (8.25) Ai1 i2 ···id ,j1 j2 ···jd = A1 (i1 , j1 )A2 (i2 , j2 ) · · · Ad (id , jd ),
rµ−1
Xµ,1 nµ+1
nµ-1 nd
rµ
n1
X1 rµ-2
Xµ−1 rµ
Xµ+1 rd-1
Xd
r0 r1 nµ
rµ-1 rd
rµ+1
Xµ,p
rµ−1
rµ
1122 Assume that the matrix A itself can be written in the operator TT format. Let X ∈ Rn×p
1123 with n = n1 n2 . . . nd whose BTT format is X, and Tn,r,p be the set of the BTT formats
1124 whose TT-ranks are no more than r. Then the eigenvalue problem in the block BTT format is
1136 (8.27) ST
k = span{PT (AXk ), Xk , Xk−1 },
1138 (8.28) ST
k = span{Xk , PT (Rk ), PT (Pk )},
1139 where the conjugate gradient direction is Pk = Xk − Xk−1 and the residual vector is Rk =
1140 AXk − Xk Λk .
1141 Consequently, the subspace problem in the BTT format is
1145 Note that Yk+1 ∈ / Tn,r,p because the rank of Yk+1 is larger than r due to several additions
1146 between the BTT formats. Since Yk+1 is a linear combination of the BTT formats in ST k,
1147 problem (8.29) still can be solved easily but only the coefficients of the linear combinations
1148 are stored.
1149 We next project Yk+1 to the required space Tn,r,p as
1151 This problem can be solved by using the alternating minimization scheme. By fixing all
1152 except the µth core, we obtain
1154 where
1155 X6=µ := (X≥µ+1 ⊗ Inµ ⊗ X≤µ−1 ),
1156 and
1157 X≤µ = [X1 (i1 )X2 (i2 ) · · · Xµ (iµ )] ∈ Rn1 n2 ···nµ ×rµ ,
1158 X≥µ = [Xµ (iµ )Xµ+1 (iµ+1 ) · · · Xd (id )]> ∈ Rnµ nµ+1 ···nd ×rµ−1 .
1161 whose optimal solution can be computed by the p-dominant SVD of X6=>µ vec(Yk+1 ).
33
1165 where f (X) : Cn×p → R is a R-differentiable function [67]. The set St(n, p) := {X ∈
1166 Cn×p : X ∗ X = Ip } is called the Stiefel manifold. Obviously, the eigenvalue problem in sec-
1167 tion 8 is a special case of (9.1). Other important applications include the density functional
1168 theory [131], Bose-Einstein condensates [137], low rank nearest correlation matrix comple-
1169 tion [121], and etc. Although (9.1) can be treated from the perspective of general nonlinear
1170 programming, the intrinsic structure of the Stiefel manifold enables us to develop more effi-
1171 cient algorithms. In fact, it can be solved by the Riemannian gradient descent, Riemannian
1172 conjugate gradient, proximal Riemannian gradient methods [40, 104, 2, 59]. The Rieman-
1173 nian Newton, trust-region, adaptive regularized Newton methods [120, 1, 2, 59] can used
1174 when the Hessian information is available. Otherwise, the quasi-Newton types methods are
1175 good alternatives [62, 61, 58].
1176 The tangent space is TX := {ξ ∈ Cn×p : X ∗ ξ + ξ ∗ X = 0}. The operator ProjX (Z) :=
1177 Z − Xsym(X ∗ Z) is the projection of Z onto the tangent space TX and sym(A) := (A +
1178 A∗ )/2. The symbols ∇f (X) (∇2 f (X)) and grad f (X) (Hess f (X)) denote the Euclidean
1179 and Riemannian gradient (Hessian) of f at X. Using the real part of the Frobenius inner
1180 product < hA, Bi as the Euclidean metric, the Riemannian Hessian Hess f (X) [31, 3] can be
1181 written as
1183 where ξ is any tangent vector in TX . A retraction R is a smooth mapping from the tangent
1184 bundle to the manifold. Moreover, the restriction RX of R to TX has to satisfy RX (0X ) = X
1185 and DRX (0X ) = idTX , where idTX is the identity mapping on TX .
1186 9.1. Regularized Newton Type Approaches. We now describe an adaptively reg-
1187 ularized Riemannian Newton type method with a subspace refinement procedure [59, 58].
1188 Note that the Riemannian Hessian-vector multiplication (9.2) involves the Euclidean Hessian
1189 and gradient with simple structures. We construct a second-order Taylor approximation in the
1190 Euclidean space rather than the Riemannian space at the k-th iteration:
1 τk
1191 (9.3) mk (X) := < h∇f (Xk ), X − Xk i + < hBk [X − Xk ], X − Xk i + kX − Xk k2F ,
2 2
1192 where Bk is either ∇2 f (Xk ) or its approximation based on whether ∇2 f (Xk ) is affordable
1193 or not, and τk is a regularization parameter to control the distance between X and Xk . Then
1194 the subproblem is
1196 After obtaining an approximate solution Zk of (9.4), we calculate a ratio between the pre-
1197 dicted reduction and the actual reduction, then use the ratio to decide whether Xk+1 is set to
1198 Zk or Xk and to adjust the parameter τk similar to the trust region methods.
1199 In particular, the model (9.4) can be minimized by using a modified CG method to solve
1200 a single Riemannian Newton system:
1213 Once the direction ξk is computed, a trial point Zk is searched along ξk followed by a retrac-
1214 tion, i.e.,
1215 (9.8) Zk = RXk (αk ξk ).
1216 The step size αk = α0 δ h is chosen by the Armijo rule such that h is the smallest integer
1217 satisfying
1218 (9.9) mk (RXk (α0 δ h ξk )) ≤ ρα0 δ h hgrad mk (Xk ), ξk i ,
1219 where ρ, δ ∈ (0, 1) and α0 ∈ (0, 1] are given constants.
1220 The performance of the Newton-type method may be seriously deteriorated when the
1221 Hessian is close to be singular. One reason is that the Riemannian Newton direction is nearly
1222 parallel to the negative gradient direction. Consequently, the next iteration Xk+1 very likely
1223 belongs to the subspace span{Xk , grad f (Xk )}, which is similar to the Riemannian gradient
1224 approach. To overcome the numerical difficulty, we can further solve (9.1) in a restricted
1225 subspace. Specifically, a q-dimensional subspace Sk is constructed with an orthogonal basis
1226 Qk ∈ Cn×q (p ≤ q ≤ n). Then the representation of any point X in the subspace Sk is
1227 X = Qk M
1228 for some M ∈ Cq×p . In a similar fashion to these constructions for the linear eigen-
1229 value problems in section 8, the subspace Sk can be built by using the history information
1230 {Xk , Xk−1 , . . .}, {grad f (Xk ), grad f (Xk−1 ), . . .} and other useful information. Once a
1231 subspace Sk is given, (9.1) with an additional constraint X ∈ Sk becomes
1232 (9.10) min f (Qk M ) s. t. M ∗ M = Ip .
M ∈Cq×p
1233 Suppose that Mk is an inexact solution of the problem (9.10) from existing optimization
1234 methods on manifold. Then Xk+1 = Qk Mk is a better point than Xk . For extremely difficult
1235 problems, one may alternate between the Newton type method and the subspace refinement
1236 procedure for a few cycles.
1237 9.2. A Structured Quasi-Newton Update with Nyström Approximation. The
1238 secant condition in the classical quasi-Newton methods for constructing the quasi-Newton
1239 matrix Bk
1240 (9.11) Bk [Sk ] = ∇f (Xk ) − ∇f (Xk−1 ),
35
1281 where L is a discretized Laplacian operator, the charge density is ρ(X) = diag(XX ∗ ), Vion
1282 is the constant ionic pseudopotentials, wl represents a discretized pseudopotential reference
1283 projection function, ζl is a constant whose value is ±1, and xc is related to the exchange
1284 correlation energy. The Fock exchange operator V(·) : Cn×n → Cn×n is usually a fourth-
1285 order tensor [69] which satisfies the following properties: (i) hV(D1 ), D2 i = hV(D2 ), D1 i
1286 for any D1 , D2 ∈ Cn×n ; (ii) V(D) is Hermitian if D is Hermitian. Then the Fock exchange
1287 energy is
1 1
1288 (9.19) Ef (X) := hV(XX ∗ )X, Xi = hV(XX ∗ ), XX ∗ i .
4 4
1289 Therefore, the total energy minimization problem can be formulated as
1 X
1296 (9.21) Hks (X) := L + Vion + ζl wl wl∗ + Diag((<L† )ρ) + Diag(µxc (ρ)∗ en ),
2
l
1300 A detailed calculation shows that the Euclidean gradient of Eks (X) is
1302 The gradient of Ef (X) is ∇Ef (X) = V(XX ∗ )X. Assume that xc (ρ(X)) is twice differen-
1303 tiable with respect to ρ(X), the Hessian of Eks (X) is
1329 Other types of mixing includes Broyden mixing, Kerker mixing and Anderson mixing, etc.
1330 Charge mixing is widely used for improving the convergence of SCF even though its conver-
1331 gence property is still not clear in few cases.
1332 In HF, the SCF method at the k-th iteration solves:
1333 H̃k X = XΛ, X ∗ X = Ip ,
1334 where H̃k is formed from certain mixing schemes. Note that the Hamiltonian (9.22) can be
1335 written as Hhf (D) with respect to the density matrix D = XX ∗ . In the commutator DIIS
1336 (C-DIIS) method [92, 93], the residual Wj is defined as the commutator between Hhf (Dj )
1337 and Dj , i.e.,
1338 (9.28) Wj = Hhf (Dj )Dj − Dj Hhf (Dj ).
1339 We next solve the following minimization to obtain a coefficient c:
2
m−1
X
1340 min cj W j , s. t. c> em = 1.
c
j=0
F
Pm−1
1341 Then, a new Hamiltonian matrix is obtained H̃k = j=0 cj Hk−j . Since an explicit storage
1342 of the density matrix can be prohibitive, the projected C-DIIS in [60] uses projections of the
1343 density and commutator matrices so that the sizes are much smaller.
38
1351 is able to reduce the computational cost significantly. Note that the adaptive compression
1352 method in [73] compresses the operator V(Xk Xk∗ ) on the subspace span{Xk }. Conse-
1353 quently, we can keep the easier parts Eks but approximate Ef (X) by using (9.29). Hence, a
1354 new subproblem is formulated as
1D E
1355 (9.30) min Eks (X) + V̂ (Xk Xk∗ ) X, X s. t. X ∗ X = Ip .
X∈C n×p 4
1356 The subproblem (9.30) can be solved by the SCF iteration, the Riemannian gradient method
1357 or the modified CG method based on the following linear equation
2 1 ∗ ∗
1358 ProjXk ∇ Eks (Xk )[ξ] + V̂(Xk Xk )ξ − ξsym(Xk ∇f (Xk )) = −grad Ehf (Xk ).
2
1359 9.3.4. A Regularized Newton Type Method. Computing the p-smallest eigenpairs
1360 of Hks (ρ̃) is equivalent to a trace minimization problem
1
1361 (9.31) min q(X) := tr(X ∗ Hks (ρ̃)X) s. t. X ∗ X = Ip .
X∈Cn×p 2
1362 Note that q(X) is a second-order approximation to the total energy Eks (X) without consid-
1363 ering the second term in the Hessian (9.24). Hence, the SCF method may not converge if this
1364 second term dominates. The regularized Newton in (9.1) can be applied to solve both KSDFT
1365 and HF with convergence guarantees. We next explain a particular version in [138] whose
1366 subproblem is
1 τk
1367 (9.32) min tr(X ∗ Hks (ρ̃)X) + kXX > − Xk Xk> k2F s. t. X ∗ X = Ip .
X∈Cn×p 2 4
kXX > − Xk Xk> k2F = tr((XX > − Xk Xk> )(XX > − Xk Xk> ))
1369
= 2p − 2tr(X > Xk Xk> X).
1374 Y = [Xk , Pk , Rk ],
1375 where Pk = Xk − Xk−1 and Rk = Hks (Xk )Xk − Xk Λk . Then the variable X can be
1376 expressed as X = Y G where G ∈ C3p×p . The total energy minimization problem (9.20)
1377 becomes:
1378 min Eks (Y G), s. t. G∗ Y ∗ Y G = Ip ,
G
1381 The subspace refinement method may help when the regularized Newton method does
1382 not perform well. Note that the total energy minimization problem (9.20) is not necessary
1383 equivalent to a nonlinear eigenvalue problem (9.26) for finding the p smallest eigenvalues of
1384 H(X). Although an intermediate iterate X is orthogonal and contains eigenvectors of H(X),
1385 these eigenvectors are not necessary the eigenvectors corresponding to the p smallest eigen-
1386 values. Hence, we can form a subspace which contains these possible target eigenvectors. In
1387 particular, we first compute the first γp smallest eigenvalues for some small integer γ. Their
1388 corresponding eigenvectors of H(Xk ), denoted by Γk , are put in a subspace as
1390 Numerical experience shows that the refinement scheme in subsection 9.1 with this subspace
1391 is likely escape a stagnated point.
1392 10. Semidefinite Programming (SDP). In this section, we present two specialized
1393 subspace methods for solving the maxcut SDP and the maxcut SDP with nonnegative con-
1394 straints from community detection.
1395 10.1. The Maxcut SDP. The maxcut problem partition the vertices of a graph into
1396 two sets so that the sum of the weights of the edges connecting vertices in one set with these
1397 in the other set is maximized. The corresponding SDP relaxation [46, 16, 56, 8] is
min hC, Xi
X∈S n
1398 (10.1) s. t. X ii = 1, i = 1, · · · , n,
X 0.
1399 We first describe a second-order cone program (SOCP) restriction for the SDP prob-
1400 lem (10.1) by fixing all except one row and column of the matrix X. For any integer
c c
1401 i ∈ {1, . . . , n}, the complement of the set {i} is ic = {1, . . . , n}\{i}. Let B = X i ,i
c
1402 be the submatrix of X after deleting its i-th row and column, and y = X i ,i be the ith col-
1403 umn of the matrix X without the element X i,i . Since Xii = 1, the variable X of (10.1) can
1404 be written as
1 y> y>
1
1405 X := := c c
y B y X i ,i
1406 without loss of generality. Suppose that the submatrix B is fixed. It then follows from the
1407 Schur complement theorem that X 0 is equivalent to
1408 ξ − y > B −1 y ≥ 0.
40
min c> y
b
y∈Rn−1
1411 (10.2)
s. t. 1 − y > B † y ≥ ν, y ∈ Range(B),
c
1412 where b c> Bb
c := 2C i ,i . If γ := b c > 0, an explicit solution of (10.2) is given by
r
1−ν
1413 (10.3) y=− Bbc.
γ
1428 c> Bb
If γ := b c > 0, the solution of problem (10.5) is
p
σ2 + γ − σ
1429 (10.6) y=− Bb
c.
γ
√
σ 2 +γ−σ
1430 Consequently, the subproblem (10.2) has the same solution as (10.5) if ν = 2σ γ .
1431 10.1.1. Examples: Phase Retrieval. Given a matrix A ∈ Cm×n and a vector b ∈
m
1432 R , the phase retrieval problem can be formulated as a feasibility problem:
1
min kAx − yk22
1435 x∈Cn ,y∈Rm 2
s. t. |y| = b,
41
1438 By fixing the variable u, it becomes a least squares problem with respect to x, whose explicit
1439 solution is x = A† diag(b)u. Substituting x back to (10.7) yields a general maxcut problem:
min u∗ M u
u∈Cm
1440
s. t. |ui | = 1, i = 1, . . . , m,
1441 where M = diag(b)(I − AA† )diag(b) is positive semidefinite. Hence, the corresponding
1442 SDP relaxation is
minm tr(U M )
U ∈S
1443
s. t. U ii = 1, i = 1, · · · , m, U 0.
1444 The above problem can be further solved by the RBR method.
1445 10.2. Community Detection. Suppose that the nodes [n] = {1, . . . , n} of a network
1446 can be partitioned into r ≥ 2 disjoint sets {K1 , . . . , Kr }. A binary matrix X is called a
1447 partition matrix if X ij = 1 for i, j ∈ Kt , t ∈ {1, ..., r} and
Potherwise X ij = 0. Let A be the
ij
1448 adjacency matrix and d be the degree vector, where di = j A , i ∈ [n]. Define the matrix
1450 where λ = 1/kdk1 . A popular method for the community detection problem is to maximize
1451 the modularity [86] as:
min hC, Xi
X∈Rn×n
s. t. X ii = 1, i = 1, . . . , n,
1455 (10.10)
0 ≤ X ij ≤ 1, ∀i, j,
X 0.
1456 The RBR method in subsection 10.1 can not be applied to (10.10) directly due to the compo-
1457 nentwise constraints 0 ≤ X ij ≤ 1.
1458 Note that the true partition matrix X ∗ can be decomposed as X ∗ = Φ∗ (Φ∗ )> , where
∗
1459 Φ ∈ {0, 1}n×r is the true assignment matrix. This decomposition is unique up to a permu-
1460 tation of the columns of Φ∗ . The structures of Φ∗ leads to a new relaxation of the original
1461 partition matrix X [146]. Define a matrix
>
U = u1 , ..., un ∈ Rn×r .
1462
1463 We can consider a decomposition X = U U > . The constraints X ii = 1 and Φ∗ ∈ {0, 1}n×r
1464 imply that
1465 kui k2 = 1, U ≥ 0, kui k0 ≤ p,
42
min C, U U >
U ∈Rn×r
For the i-th subproblem, we fix all except the i-th row of U and formulate the subproblem as
σ
ui = arg min f (u1 , ..., ui−1 , x, ui+1 , ..., un ) + kx − ūi k2 ,
x∈U 2
1474 where the last part in the objective function is the proximal term and σ > 0 is a parameter.
1475 Note that the quadratic term kxk2 is eliminated due to the constraint kuk2 = 1. Therefore,
1476 the subproblem becomes
c c
1478 where b = 2C i,i U −i − σūi , and C i,i is the i-th row of C without the i-th component, U −i
1479 is the matrix U without the i-th row. Define b+ = max{b, 0}, b− = max{−b, 0}, where the
1480 max is taken component-wisely. Then the closed-form solution of (10.13) is given by
( bp
−
p , if b− 6= 0,
1481 (10.14) u = kb− k
ej0 , with j0 = arg minj bj , otherwise,
1482 where bp− is obtained by keeping the largest p components in b− and letting the others be
1483 zero, and when kb− k0 ≤ p, bp− = b− . Then the RBR method goes over all rows of U by
1484 using (10.14).
1485 We next briefly describe the parallelization of the RBR method on a shared memory
1486 computer with many threads. The variable U is stored in the shared memory so that it can be
1487 accessed by all threads. Even when some row ui is updating in a thread, the other threads can
1488 still access U whenever necessary. In the sequential RBR method, the main cost of updating
c
1489 one row ui is the computation of b = 2C i,i U −i − σūi , where ūi and U are the current
1490 iterates. The definition of C in (10.8) gives
c c
1491 (10.15) b> = −2Ai,i U −i + 2λdi (di )> U −i − σūi ,
c
1492 where Ai,i is the i-th row of A without the i-th component. The parallel RBR method is
1493 outlined in Figure 10.1 where many threads are working simultaneously. The vector d> U
1494 and matrix U are stored in the shared memory and all threads can access and update them.
43
1495 Every thread picks up their own row ui at a time and then reads U and the vector d> U . Then,
1496 a private copy of b> is computed. Thereafter, the variable ui is updated and d> U is set to
1497 d> U ← d> U + di (ui − ūi ) in the shared memory. It immediately proceeds to another row
1498 without waiting for other threads to finish their tasks. Therefore, when a thread is updating
1499 its variables, other blocks of variables uj , j 6= i are not necessarily the most new version.
1500 Moreover, if this thread is reading some row uj or the vector d> U from memory and another
1501 thread is just modifying them, the data of ui will be partially updated. Since the memory
1502 locking is removed, the parallel RBR method may be able to provide near-linear speedups.
1503 See also the HOGWILD! [94] and CYCLADES [89] for the asynchronous methods.
1504 11. Low Rank Matrix Optimization. Optimization problems whose variable is re-
1505 lated to low-rank matrices arise in many applications, for example, semidefinite programming
1506 (SDP), matrix completion, robust principle component analysis, control and systems theory,
1507 model reduction [76], phase retrieval, blind deconvolution, data mining, pattern recognitions
1508 [33], latent semantic indexing, collaborative prediction and low-dimensional embedding.
1509 11.1. Low Rank Structure of First-order Methods. A common feature of many
1510 first-order methods for the low rank matrix optimization problems is that the next iterate xk+1
1511 is defined by the current iterate xk and a partial eigenvalue decomposition of certain matrix.
1512 They can be unified as the following fixed-point iteration scheme [71]:
1524 The operator Ψ has the low-rank property at X if there exists an orthogonal matrix VI ∈
1525 Rn×p (p n) that span a p-dimensional eigen-space corresponding to λi (X), i ∈ I, such
1526 that Ψ(X) = Φ(PVI (X)), where Φ is either the same as Ψ or a different spectral operator
1527 induced by φ, and I is an index set depending on X. The low-rank property ensures that the
1528 full eigenvalue decomposition is not needed.
1529 The scheme (11.1) is time-consuming for large scale problems since first-order methods
1530 often take thousands of iterations to converge and each iteration requires at least one full
1531 or partial eigenvalue decomposition for evaluating Ψ. However, Ψ(B(xk )) often lives in a
1532 low-dimensional eigenspace in practice. A common practice is to use inexact method such as
1533 the Lanczos method, LOBPCG, and randomized methods with early stopping rules [149, 6,
1534 106]. The so-called subspace method performs refinement on a low-dimensional subspace for
1535 univariate maximal eigenvalue optimization problem [66, 102, 63] and in the SCF iteration
1536 for KSDFT [151]. In the rest of this section, we present approaches [71] which integrate
1537 eigenvalue computation coherently with the underlying optimization methods.
1538 11.2. A Polynomial-filtered Subspace Method. We now describe a general sub-
1539 space framework for the scheme (11.1) using Chebyshev polynomials ρk (·) defined in (8.5).
1540 Assume that x∗ is a limit point of the fixed-point iteration (11.1) and the low-rank property
1541 holds for every B(xk ) in (11.1). Consequently, the scheme (11.1) is equivalent to
1543 where VIk is determined by B(xk ). Although the exact subspace VIk usually is unknown,
1544 it can be approximated by an estimated subspace Uk so that the computational cost of Ψ
1545 is significantly reduced. After the next point xk+1 is formed, a polynomial filter step is
1546 performed in order to extract a new subspace Uk+1 based on Uk . Therefore, combining the
1547 two steps (8.6) and (11.4) together gives
1551 where qk is a small number (e.g. 1 to 3) of the polynomial filter ρk (·) applied to Uk . The
1552 Chebyshev polynomials are suitable when the targeted eigenvalues are located within an inter-
1553 val, for example, finding a few largest/smallest eigenvalues in magnitude or all positive/neg-
1554 ative eigenvalues.
1555 The main feature is that the exact subspace VIk is substituted by its approximation Uk
1556 in (11.5). The principle angle between the true and extracted subspace is controlled by the
1557 polynomial degree. Then the error between one exact and inexact iteration is bounded. When
1558 the initial space is not orthogonal to the target space, the convergence of (11.5)-(11.6) is
1559 established under mild assumptions. In fact, the subspace often becomes more and more
1560 accurate so that the warm start property is helpful, i.e., the subspace of the current iteration
1561 can be refined from the previous one.
45
1566 where F (x) = f ◦ λ(B(x)) with B(x) = G + A∗ (x) and R(x) is a regularization term with
1567 simple structures but need not be smooth. Here G is a known matrix in S n , the linear operator
1568 A and its adjoint operator A∗ are defined as
m
X
1569 (11.8) A(X) = [hA1 , Xi , . . . , hAm , Xi]T , A∗ (x) = x i Ai ,
i=1
1570 for given symmetric matrices Ai ∈ S n . The function f : Rn → R is smooth and absolutely
1571 symmetric, i.e., f (x) = f (P x) for all x ∈ Rn and any permutation matrix P ∈ Rn×n .
1572 Let Ψ be a spectral operator induced by ψ = ∇f . It can be verified that the gradient of
1573 F in (11.7) is
1579 where τk is the step size. Therefore, the iteration (11.11) is a special case of (11.1) with
T (x, X) = proxτk R (x − A(X)),
1580
Ψ(X) = V Diag(∇f (λ(X)))V > .
1581 Assume that the low-rank property holds at every iteration. The corresponding polynomial-
1582 filtered method can be written as
1586 11.3.1. Examples: Maximal Eigenvalue and Matrix Completion. Consider the
1587 maximal eigenvalue optimization problem:
∗
1589 where B(x) = G + A (x). Certain specific formulations of phase recovery and blind decon-
1590 volution are special case of (11.14). The subgradient of F (x) is
1592 where U1 ∈ Rn×r1 is the subspace spanned by eigenvectors of λ1 (B(x)) with multiplicity
1593 r1 . For simplicity, we assume r1 = 1 and λ1 (B(x)) > 0, which means that ∂F (x) has only
1594 one element and the function F (x) is differentiable. Then the polynomial-filtered method is
max hW S, Gi ,
s.t. Gii = I2 ,
1639 (11.19)
G 0,
kGk2 ≤ αK,
1640 where G = (Gij )i,j=1,...,K ∈ S 2K is the variable, with each block Gij being a 2-by-2 small
1641 matrix, and k · k2 is the spectral norm. A three-block ADMM is introduced to solve (11.19).
1642 The cost of the projection onto the semidefinite cone can be reduced by the polynomial filters.
1643 The semidefinite relaxation of the LUD problem is
P
min 1≤i<j≤K kcij − Gij cji k2 ,
s.t. Gii = I2 ,
1644 (11.20)
G 0,
kGk2 ≤ αK,
1645 where G, Gij , K are defined the same in (11.19), and cij ∈ R2 are known vectors. The
1646 spectral norm constraint in (11.20) is optional. A four-block ADMM is proposed to solve
1647 (11.20). Similarly, the polynomial filters can be inserted into the ADMM update to reduce
1648 the computational cost.
1649 12. Conclusion. In this paper, we provide a comprehensive survey on various sub-
1650 space techniques for nonlinear optimization. The main idea of subspace algorithms aims
1651 to conquer large scale nonlinear problems by performing iterations in a lower dimensional
1652 subspace. We next summarize a few typical scenarios as follows.
1653 • Find a linear combination of several known directions. Examples are the linear and
1654 nonlinear conjugate gradient methods, the Nesterov’s accelerated gradient method,
1655 the Heavy-ball method and the momentum method.
1656 • Keep the objective function and constraints, but add an extra restriction in a cer-
1657 tain subspace. Examples are OMP, CoSaMP, LOBPCG, LMSVD, Arrabit, subspace
1658 refinement and multilevel methods.
1659 • Approximate the objective objective function but keep the constraints. Examples are
1660 BCD, RBR, trust region with subspaces and parallel subspace correction.
1661 • Approximate the objective objective function and design new constraints. Examples
1662 are trust region with subspaces and FPC AS.
1663 • Add a postprocess procedure after the subspace problem is solved. An example is
1664 the truncated subspace method for tensor train.
1665 • Use subspace techniques to approximate the objective functions. Examples are sam-
1666 pling, sketching and Nyström approximation.
1667 • Integrate the optimization method and subspace update in one framework. An ex-
1668 ample is the polynomial-filtered subspace method for low-rank matrix optimization.
48
1676 REFERENCES
1677 [1] P.-A. A BSIL , C. G. BAKER , AND K. A. G ALLIVAN, Trust-region methods on Riemannian manifolds,
1678 Found. Comput. Math., 7 (2007), pp. 303–330.
1679 [2] P.-A. A BSIL , R. M AHONY, AND R. S EPULCHRE, Optimization algorithms on matrix manifolds, Princeton
1680 University Press, Princeton, NJ, 2008.
1681 [3] P.-A. A BSIL , R. M AHONY, AND J. T RUMPF, An extrinsic look at the Riemannian Hessian, in Geometric
1682 science of information, Springer, 2013, pp. 361–368.
1683 [4] D. G. A NDERSON, Iterative procedures for nonlinear integral equations, Journal of the ACM (JACM), 12
1684 (1965), pp. 547–560.
1685 [5] A. B ECK AND M. T EBOULLE, A fast iterative shrinkage-thresholding algorithm for linear inverse problems,
1686 SIAM Journal on Imaging Sciences, 2 (2009), pp. 183–202.
1687 [6] S. B ECKER , V. C EVHER , AND A. K YRILLIDIS, Randomized low-memory singular value projection, arXiv
1688 preprint arXiv:1303.0167, (2013).
1689 [7] S. B ELLAVIA AND B. M ORINI, A globally convergent Newton-GMRES subspace method for systems of
1690 nonlinear equations, SIAM J. Sci. Comput., 23 (2001), pp. 940–960.
1691 [8] S. J. B ENSON , Y. Y E , AND X. Z HANG, Solving large-scale sparse semidefinite programs for combinatorial
1692 optimization, SIAM J. Optim., 10 (2000), pp. 443–461.
1693 [9] D. P. B ERTSEKAS, Nonlinear Programming, Athena Scientific, September 1999.
1694 [10] J. B OLTE , S. S ABACH , AND M. T EBOULLE, Proximal alternating linearized minimization for nonconvex
1695 and nonsmooth problems, Math. Program., 146 (2014), pp. 459–494.
1696 [11] S. B OYD , N. PARIKH , E. C HU , B. P ELEATO , AND J. E CKSTEIN, Distributed optimization and statistical
1697 learning via the alternating direction method of multipliers, Foundations and Trends R in Machine
1698 Learning, 3 (2011), pp. 1–122.
1699 [12] M. A. B RANCH , T. F. C OLEMAN , AND Y. L I, A subspace, interior, and conjugate gradient method for
1700 large-scale bound-constrained minimization problems, SIAM J. Sci. Comput., 21 (1999), pp. 1–23.
1701 [13] C. B REZINSKI , M. R EDIVO -Z AGLIA , AND Y. S AAD, Shanks sequence transformations and Anderson
1702 acceleration, SIAM Rev., 60 (2018), pp. 646–669.
1703 [14] P. N. B ROWN AND Y. S AAD, Hybrid Krylov methods for nonlinear systems of equations, SIAM J. Sci.
1704 Statist. Comput., 11 (1990), pp. 450–481.
1705 [15] O. B URDAKOV, L. G ONG , S. Z IKRIN , AND Y.- X . Y UAN, On efficiently combining limited-memory and
1706 trust-region techniques, Math. Program. Comput., 9 (2017), pp. 101–134.
1707 [16] S. B URER AND R. D. C. M ONTEIRO, A projected gradient algorithm for solving the maxcut SDP relax-
1708 ation, Optim. Methods Softw., 15 (2001), pp. 175–200.
1709 [17] J. V. B URKE AND J. J. M OR É, On the identification of active constraints, SIAM J. Numer. Anal., 25 (1988),
1710 pp. 1197–1211.
1711 [18] , Exposing constraints, SIAM J. Optim., 4 (1994), pp. 573–595.
1712 [19] R. H. B YRD , N. I. M. G OULD , J. N OCEDAL , AND R. A. WALTZ, An algorithm for nonlinear optimization
1713 using linear programming and equality constrained subproblems, Math. Program., 100 (2004), pp. 27–
1714 48.
1715 [20] , On the convergence of successive linear-quadratic programming algorithms, SIAM J. Optim., 16
1716 (2005), pp. 471–489.
1717 [21] R. H. B YRD , J. N OCEDAL , AND R. B. S CHNABEL, Representations of quasi-Newton matrices and their
1718 use in limited memory methods, Math. Programming, 63 (1994), pp. 129–156.
1719 [22] C. C ARSTENSEN, Domain decomposition for a non-smooth convex minimization problem and its applica-
1720 tion to plasticity, Numerical linear algebra with applications, 4 (1997), pp. 177–190.
1721 [23] M. R. C ELIS , J. E. D ENNIS , AND R. A. TAPIA, A trust region strategy for nonlinear equality constrained
1722 optimization, in Numerical optimization, 1984 (Boulder, Colo., 1984), SIAM, Philadelphia, PA, 1985,
1723 pp. 71–82.
1724 [24] C. C HEN , Z. W EN , AND Y.- X . Y UAN, A general two-level subspace method for nonlinear optimization, J.
1725 Comput. Math., 36 (2018), pp. 881–902.
1726 [25] Y. C HEN , X. L I , AND J. X U, Convexified modularity maximization for degree-corrected stochastic block
49
50
51
52
53
54