0% found this document useful (0 votes)
48 views

Subspace Methods For Nonlinear Optimization

Uploaded by

da da
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Subspace Methods For Nonlinear Optimization

Uploaded by

da da
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

1 Subspace Methods for

2 Nonlinear Optimization
3

4 Xin Liu∗
5 Zaiwen Wen†
6 Ya-xiang Yuan‡

7 Abstract. Subspace techniques such as Krylov subspace methods have been well known and extensively used in
8 numerical linear algebra. They are also ubiquitous and becoming indispensable tools in nonlinear opti-
9 mization due to their ability to handle large scale problems. There are generally two types of principals: i)
10 the decision variable is updated in a lower dimensional subspace; ii) the objective function or constraints
11 are approximated in a certain smaller functional subspace. The key ingredients are the constructions of
12 suitable subspaces and subproblems according to the specific structures of the variables and functions
13 such that either the exact or inexact solutions of subproblems are readily available and the corresponding
14 computational cost is significantly reduced. A few relevant techniques include but not limited to direct
15 combinations, block coordinate descent, active sets, limited-memory, Anderson acceleration, subspace
16 correction, sampling and sketching. This paper gives a comprehensive survey on the subspace meth-
17 ods and their recipes in unconstrained and constrained optimization, nonlinear least squares problem,
18 sparse and low rank optimization, linear and nonlinear eigenvalue computation, semidefinite program-
19 ming, stochastic optimization and etc. In order to provide helpful guidelines, we emphasize on high level
20 concepts for the development and implementation of practical algorithms from the subspace framework.

21 Key words. nonlinear optimization, subspace techniques, block coordinate descent, active sets, limited memory,
22 Anderson acceleration, subspace correction, subsampling, sketching

23 AMS subject classification. 65K05, 90C30

24 1 Introduction 3
25 1.1 Overview of Subspace Techniques . . . . . . . . . . . . . . . . . . . . . . . 4
26 1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
27 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

28 2 General Unconstrained Optimization 5


29 2.1 The Line Search Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
30 2.1.1 The Nonlinear Conjugate Gradient (CG) Method . . . . . . . . . . . 6
31 2.1.2 Nesterov’s Accelerated Gradient Method . . . . . . . . . . . . . . . 6
32 2.1.3 The Heavy-ball Method . . . . . . . . . . . . . . . . . . . . . . . . 7
33 2.1.4 A Search Direction Correction (SDC) Method . . . . . . . . . . . . . 7
34 2.1.5 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . 8
35 2.1.6 Acceleration Techniques . . . . . . . . . . . . . . . . . . . . . . . . 8

∗ State Key Laboratory of Scientific and Engineering Computing, Academy of Mathematics and Systems Science,

Chinese Academy of Sciences, and University of Chinese Academy of Sciences, China ([email protected]). Re-
search supported in part by NSFC grants 11622112, 11471325, 91530204 and 11688101, the National Center for
Mathematics and Interdisciplinary Sciences, CAS, and Key Research Program of Frontier Sciences QYZDJ-SSW-
SYS010, CAS.
† Beijing International Center for Mathematical Research, Peking University, China ([email protected]). Re-

search supported in part by NSFC grant 11831002, and by Beijing Academy of Artificial Intelligence (BAAI).
‡ State Key Laboratory of Scientific and Engineering Computing, Academy of Mathematics and Systems Science,

Chinese Academy of Sciences, China ([email protected]). Research supported in part by NSFC grants 11331012
and 11688101.
1

This manuscript is for review purposes only.


36 2.1.7 Search Direction From Minimization Subproblems . . . . . . . . . . 9
37 2.1.8 Subspace By Coordinate Directions . . . . . . . . . . . . . . . . . . 10
38 2.2 Trust Region Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

39 3 Nonlinear Equations and Nonlinear Least Squares Problem 13


40 3.1 General Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
41 3.2 Subspace by Subsampling/Sketching . . . . . . . . . . . . . . . . . . . . . . 13
42 3.3 Partition of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
43 3.4 τ −steepest Descent Coordinate Subspace . . . . . . . . . . . . . . . . . . . 15

44 4 Stochastic Optimization 15
45 4.1 Stochastic First-order Methods . . . . . . . . . . . . . . . . . . . . . . . . . 16
46 4.2 Stochastic Second-Order method . . . . . . . . . . . . . . . . . . . . . . . . 17

47 5 Sparse Optimization 18
48 5.1 Basis Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
49 5.2 Active Set Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

50 6 The Domain Decomposition Methods 20


51 6.1 A Two-level Subspace Method . . . . . . . . . . . . . . . . . . . . . . . . . 20
52 6.2 The Subspace Correction Method . . . . . . . . . . . . . . . . . . . . . . . 21
53 6.3 Parallel Line Search Subspace Correction Method . . . . . . . . . . . . . . . 21

54 7 General Constrained Optimization 22


55 7.1 Direct Subspace Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 23
56 7.2 Second Order Correction Steps . . . . . . . . . . . . . . . . . . . . . . . . . 23
57 7.3 The Celis-Dennis-Tapia (CDT) Subproblem . . . . . . . . . . . . . . . . . . 24
58 7.4 Simple Bound-constrained Problems . . . . . . . . . . . . . . . . . . . . . . 25

59 8 Eigenvalue Computation 26
60 8.1 Classic Subspace Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
61 8.2 Polynomial Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
62 8.3 Limited Memory Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
63 8.4 Augmented Rayleigh-Ritz Method . . . . . . . . . . . . . . . . . . . . . . . 28
64 8.5 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 29
65 8.6 Randomized SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
66 8.7 Truncated Subspace Method for Tensor Train . . . . . . . . . . . . . . . . . 31

67 9 Optimization with Orthogonality Constraints 34


68 9.1 Regularized Newton Type Approaches . . . . . . . . . . . . . . . . . . . . . 34
69 9.2 A Structured Quasi-Newton Update with Nyström Approximation . . . . . . 35
70 9.3 Electronic Structure Calculations . . . . . . . . . . . . . . . . . . . . . . . . 36
71 9.3.1 The Mathematical Models . . . . . . . . . . . . . . . . . . . . . . . 37
72 9.3.2 The Self-Consistent Field (SCF) Iteration . . . . . . . . . . . . . . . 38
73 9.3.3 Subspace Methods For HF using Nyström Approximation . . . . . . 39
74 9.3.4 A Regularized Newton Type Method . . . . . . . . . . . . . . . . . 39
75 9.3.5 Subspace Refinement for KSDFT . . . . . . . . . . . . . . . . . . . 40

76 10 Semidefinite Programming (SDP) 40


77 10.1 The Maxcut SDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
78 10.1.1 Examples: Phase Retrieval . . . . . . . . . . . . . . . . . . . . . . . 41
2

This manuscript is for review purposes only.


79 10.2 Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

80 11 Low Rank Matrix Optimization 44


81 11.1 Low Rank Structure of First-order Methods . . . . . . . . . . . . . . . . . . 44
82 11.2 A Polynomial-filtered Subspace Method . . . . . . . . . . . . . . . . . . . . 45
83 11.3 The Polynomial-filtered Proximal Gradient Method . . . . . . . . . . . . . . 46
84 11.3.1 Examples: Maximal Eigenvalue and Matrix Completion . . . . . . . 46
85 11.4 The Polynomial-filtered ADMM Method . . . . . . . . . . . . . . . . . . . . 47
86 11.4.1 Examples: 2-RDM and Cryo-EM . . . . . . . . . . . . . . . . . . . 48

87 12 Conclusion 48

88 References 49
89
90

91 1. Introduction. Large scale optimization problems appear in a wide variety of scien-


92 tific and engineering domains. In this paper, we consider a general optimization problem

93 (1.1) min f (x), s. t. x ∈ X ,


x

94 where x is the decision variable, f (x) is the objective function and X is the feasible set. Effi-
95 cient numerical optimization algorithms have been extensively developed for (1.1) with vari-
96 ous types of objective functions and constraints [111, 88]. With the rapidly increasing prob-
97 lem scales, subspace techniques are ubiquitous and becoming indispensable tools in nonlinear
98 optimization due to their ability to handle large scale problems. For example, the Krylov sub-
99 space methods developed in the numerical linear algebraic community have been widely used
100 for the linear least squares problem and linear eigenvalue problem. The characteristics of the
101 subspaces are clear in many popular optimization algorithms such as the linear and nonlin-
102 ear conjugate gradient methods, Nesterov’s accelerated gradient method, the Quasi-Newton
103 methods and the block coordinate decent (BCD) method. The subspace correction method
104 for convex optimization can be viewed as generalizations of multigrid and domain decompo-
105 sition methods. The Anderson acceleration or the direct inversion of iterative subspace (DIIS)
106 methods have been successful in computational quantum physics and chemistry. The stochas-
107 tic gradient type methods usually take a mini-batch from a large collection samples so that
108 the computational cost of each inner iteration is small. The sketching techniques formulate a
109 reduced problem by a multiplication with random matrices with certain properties.
110 The purpose of this paper is to provide a review of the subspace methods for nonlinear
111 optimization, for their further improvement and for their future usage in even more diverse
112 and emerging fields. The subspaces techniques for (1.1) are generally divided into two cat-
113 egories. The first type is to update the decision variable in a lower dimensional subspace,
114 while the second type is to construct approximations of the objective function or constraints
115 in a certain smaller subspace of functions. Usually, there are three key steps.
116 • Identify a suitable subspace either for the decision variables or the functions.
117 • Construct a proper subproblem by various restrictions or approximations.
118 • Find either an exact or inexact solution of subproblems.
119 These steps are often mixed together using the specific structures of the problems case by
120 case. The essence is how to reduce the corresponding computational cost significantly.
121 During the practice in unconstrained and constrained optimization, nonlinear least squares
122 problem, sparse and low rank optimization, linear and nonlinear eigenvalue computation,
123 semidefinite programming, stochastic optimization, manifold optimization, phase retrieval,
3

This manuscript is for review purposes only.


124 variational minimization and etc, the collection of subspaces techniques is growing ever rich.
125 It includes but not limited to direct combinations, BCD, active sets, limited-memory, Ander-
126 son acceleration, subspace correction, sampling and sketching. We aim to provide helpful
127 guidelines for the development and implementation of practical algorithms using the sub-
128 space framework. Hence, only high level algorithmic ideas rather than theoretical properties
129 of the subspace techniques are covered in various contexts.

130 1.1. Overview of Subspace Techniques. We next summarize the concepts and
131 contexts of a few main subspace techniques.
132 Direct Combinations. It is a common practice to update the decision variables using a
133 combination of a few known directions which forms a subspace. The linear and nonlinear
134 conjugate gradient methods [111, 88], the Nesterov’s accelerated gradient method [84, 85],
135 the Heavy-ball method [90], the search direction correction method [126] and the momentum
136 method [47] take a linear combination of the gradient and the previous search direction. The
137 main difference is reflected in the choices of the coefficients according to different explicit
138 formulas.
139 BCD. The variables in many problems can be split naturally into a few blocks whose sub-
140 spaces are spanned by the coordinate directions. The Gauss-Seidel type of the BCD method
141 updates only one block by minimizing the objective function or its surrogate while all other
142 blocks are fixed at each iteration. It has been one of the core algorithmic idea in solving
143 problems with block structures, such as convex programming [77], nonlinear programming
144 [9], semidefinite programming [129, 145], compressive sensing [72, 32], etc. A proximal
145 alternating linearized minimization method is developed in [10] for solving a summation of
146 nonconvex but differentiable and nonsmooth functions. The alternating direction methods of
147 multipliers (ADMM) [11, 27, 41, 45, 55, 125] minimize the augmented Lagrangian function
148 with respect to the primal variables by BCD, then update the Lagrangian multiplier.
149 Active Sets. When a clear partition of variables is not available, a subset of the variables
150 can be fixed in the so-called active sets under certain mechanisms and the remaining variables
151 are determined from certain subproblems for optimization problems with bound constraints
152 or linear constraints in [17, 18, 51, 81, 82], `1 -regularized problem for sparse optimization
153 in [133, 105, 64] and general nonlinear programs in [19, 20]. In quadratic programming, the
154 inequality constraints that have zero values at the optimal solution are called active, and they
155 are replaced by equality constraints in the subproblem [111].
156 Limited-memory. A typical subspace is constructed from a number of history infor-
157 mation, for example, the previous iterates {xk }, the previous gradients {∇f (xk )}, the dif-
158 ferences between two consecutive iterates {xk − xk−1 }, and the differences between two
159 consecutive gradients {∇f (xk ) − ∇f (xk−1 )}. After the new iterate is formed, the oldest
160 vectors in the storage are replaced by the most recent vectors if certain justification rules are
161 satisfied. Two examples are the limited memory BFGS method [111, 88], and the limited
162 memory block Krylov subspace optimization method (LMSVD) [74].
163 Anderson Acceleration. For a sequence {xk } generated by a general fixed-point iter-
164 ation, the Anderson acceleration produces a new point using a linear combination of a few
165 points in {xk }, where the coefficients are determined from an extra linear least squares prob-
166 lem with a normalized constraint [13, 4, 123]. A few related schemes include the minimal
167 polynomial extrapolation, modified minimal polynomial extrapolation, reduced rank extrap-
168 olation, the vector Epsilon algorithm and the topological Epsilon algorithm. The Anderson
169 acceleration is also known as Anderson mixing, Pulay mixing, DIIS or the commutator DIIS
170 [92, 93, 115] in electronic structure calculation. These techniques have also been applied to
171 optimization problems in [99, 147].
172 Subspace correction. For variational problems, the domain decomposition methods
4

This manuscript is for review purposes only.


173 split the spatial domain into several subdomains and solve the corresponding problems on
174 these subdomains iteratively using certain strategies. The successive subspace correction
175 (SSC) and parallel subspace correction (PSC) methods [22, 36, 39, 38, 68, 112] are similar to
176 the Gauss-Seidel-type and Jacobian-type BCD methods, respectively. However, the subspace
177 correction is significantly different from BCD due to the strong connections between variables
178 in the subdomains. The PSC methods have been studied for LASSO in [36, 39, 29] and total
179 variation minimization in [37, 38, 39, 68].
180 Sampling. Assume that there are a large number of data. The general concept of sam-
181 pling is to randomly select a small set of samples with an appropriate probability distribution
182 with or without replacement. In the stochastic gradient descent type methods, the gradient in
183 expectation is approximated by a sum of sample gradients over a mini-batch [47]. Random
184 sampling is also helpful in many other contexts, for example, a greedy algorithm for a mixed
185 integer programming in volumetric modulated arc therapy [139].
186 Sketching. For huge data represented in matrices, the sketching technique builds low-
187 dimensional approximations using random linear maps [78, 136, 118]. It has been adopted
188 for nonlinear least squares problems in [141, 103] and large scale SDP problems in [144].
189 The Nyström approximation can be viewed as a special sketching scheme. An initial quasi-
190 Newton matrix can be constructed if a single Hessian-matrix multiplication is affordable in
191 [58].
192 1.2. Notation. Let S n be the collection of all n-by-n symmetric matrices. For any
193 matrix X ∈ Rn×n , diag(X) denotes a column vector consisting of all diagonal entries of X.
194 For any vector x ∈ Rn , Diag(x) is an n-by-n diagonal matrix whose i-th diagonal entry is
195 xi . Given two matrices A, B ∈ Cn×p , the Frobenius inner product is definedp as hA, Bi =
196 tr(A∗ B), and the corresponding Frobenius norm is defined as kAkF = tr(A∗ A). The
197 operation A B denotes the Hadamard product between two matrices A and B of the same
198 sizes. Let en be a vector of all ones in Rn . For any matrix X ∈ Rn×p , Range(X) denotes the
199 subspace spanned by the columns of X. The subscript usually denotes the iteration number,
200 while the supscript is reserved as the index of a vector or matrix.
201 1.3. Organization. The rest of this paper is organized as follows. The subspace meth-
202 ods applied in general unconstrained optimization, nonlinear equations and nonlinear least
203 squares problem, stochastic optimization, sparse optimization, the domain decomposition,
204 general constrained optimization, eigenvalue computation, optimization problems with or-
205 thogonality constraints, semidefinite programming and low rank matrix optimization are dis-
206 cussed in Sections 2 to 11, respectively. Finally, a few typical scenarios are summarized in
207 Section 12.
208 2. General Unconstrained Optimization. In this section, we consider the uncon-
209 strained optimization
210 (2.1) min f (x) ,
x∈Rn

211 where f (x) : Rn → R is a differentiable function. The line search and trust region methods
212 are the two main types of approaches for solving (2.1). The main difference between them
213 is the order of determining the so-called step size and search direction. Subspace techniques
214 have been substantially studied in [26, 48, 140, 142, 143, 87, 128, 127, 49].
215 2.1. The Line Search Methods. At the k-th iteration xk , the line search methods
216 first generate a descent search direction dk and then search along this direction for a step size
217 αk such that the objective function at the next point
218 (2.2) xk+1 = xk + αk dk
5

This manuscript is for review purposes only.


219 is suitably reduced. The step size αk is often selected by the monotone line search procedures
220 with the Armijo, Goldstein or the Wolfe-Powell rules. The nonmonotone line procedures are
221 also widely used. Interested readers are referred to [111, 88] for further information. Here,
222 we mainly focus on generating the direction dk in a subspace Sk , i.e.,

223 d ∈ Sk .

224 For simplicity, we often denote gk = ∇f (xk ).


225 2.1.1. The Nonlinear Conjugate Gradient (CG) Method. The nonlinear CG method
226 is popular for solving large scale optimization problems. The search direction dk lies in a par-
227 ticular subspace

228 (2.3) Sk = span{gk , dk−1 },

229 which is spanned by the gradient gk and the last search direction dk−1 . More specifically, dk
230 is a linear combination of −gk and dk−1 with a weight βk−1 , i.e.,

231 (2.4) dk = −gk + βk−1 dk−1 ,

232 where d0 = −g0 and β0 = 0. A few widely used choices for the weight βk−1 are
gk> gk
233 βk−1 = >
, (F-R Formula),
gk−1 gk−1
gk> (gk − gk−1 )
234 βk−1 = , (H-S or C-W Formula),
d>
k−1 (gk − gk−1 )
gk> (gk − gk−1 )
235 βk−1 = > g
, (PRP Formula),
gk−1 k−1

gk> gk
236 βk−1 = − , (Dixon Formula),
d>
k−1 gk−1
gk> gk
237 βk−1 = − >
, (D-Y Formula).
dk−1 (gk − gk−1 )
238

239 It is easy to observe that these formulas are equivalent in the sense that they yield the same
240 search directions when the function f (x) is quadratic with a positive definite Hessian matrix.
241 In this case, the directions d1 , . . . , dk are conjugate to each other with respect to the Hessian
242 matrix. It can also be proved that the CG method has global convergence and n-step local
243 quadratic convergence. However, for a general nonlinear function with inexact line search,
244 the behavior of the methods with different βk can be significantly different.
245 2.1.2. Nesterov’s Accelerated Gradient Method. The steepest descent gradient
246 method simply uses dk = −gk in (2.2) for unconstrained optimization. Assume that the
247 function f (x) is convex, the optimal value f ∗ of (2.1) is finite and it attains at a point x∗ , and
248 the gradient f (x) is Lipschitz continuous with a constant L, i.e.,

249 k∇f (x) − ∇f (y)k ≤ Lkx − yk.

250 Let {xk }∞


k=0 be a sequence generated by the gradient method with a fixed step size αk =
1
L.
251 Then it can be proved that the convergence of the objective function values is
L
252 f (xk ) − f (x∗ ) ≤ kx0 − x∗ k2 ,
2k
6

This manuscript is for review purposes only.


253 which is often described as a convergence rate at O(1/k).
254 A natural question is whether a faster convergence rate can be achieved if only the gra-
255 dient information is used. We now present the so-called FISTA method proposed by Beck
256 and Teboulle [5] which is equivalent to Nesterov accelerated gradient method [84, 85]. The
257 FISTA method first calculates a new point by an extrapolation of the previous two points,
258 then performs a gradient step at this new point:
k−2
259 yk = xk−1 + (xk−1 − xk−2 ),
k+1
260 xk = yk − αk ∇f (yk ).

An illustration of the FISTA method is shown in Figure 2.1. Under the same assumptions as

xk = yk − tk ∇f (yk )

xk−2 xk−1 yk

Fig. 2.1 The FISTA method


261
1
262 the gradient method, the FISTA method with a fixed step size αk = L has a convergence rate
263 of O(1/k 2 ), i.e.,
2L
264 f (xk ) − f ∗ ≤ kx0 − x∗ k2 .
(k + 1)2
265 Obviously, the FISTA method can also be interpreted as a subspace method whose subspace
266 is

267 (2.5) Sk = span{xk−1 , xk−2 , ∇f (yk )}.

268 2.1.3. The Heavy-ball Method. The heavy-ball method [90] is also a two-step scheme:

269 dk = −gk + βdk−1 ,


270 xk+1 = xk + αdk ,
 i
271 with p0 = 0 and α, β > 0. If β ∈ [0, 1) and α ∈ 0, 1−β L and under the same assumptions
272 as in Sec. 2.1.2, it is established in [42] that
 
∗ 1 β ∗ 1−β ∗ 2
273 f (x̄k ) − f ≤ (f (x0 ) − f ) + kx0 − x k ,
k+1 1−β 2α
1
Pk
274 where x̄k = 1+k i=1 xi . We can see that the Heavy-ball method is the same as the nonlin-
275 ear CG method (2.4) except that the parameter β is different.
276 2.1.4. A Search Direction Correction (SDC) Method. The search direction (2.4)
277 can also be modified by adding a non-trivial weight to gk . Let d0 = 0. At the beginning of
278 the (k + 1)-th iteration, if a descent condition

279 (2.6) hgk , dk i ≤ 0


7

This manuscript is for review purposes only.


280 holds, we update
kdk k
281 (2.7) dk+1 = (1 − βk )dk − γk gk − gk .
kgk k
282 Then we update βk+1 and γk+1 as follows:
r r−3
283 (2.8) βk = , γk = ,
lk − 1 + r lk − 1 + r
284 where r ≥ 3, {lk } is a sequence of parameters with of l1 = 1 and lk+1 = lk + 1. If the
285 criterion (2.6) is not met, we reset dk+1 , βk+1 and γk+1 as
286 dk+1 = −gk , βk+1 = β1 , γk+1 = γ1 , lk+1 = l1 .
287 For more details, we refer the reader to [126].
288 2.1.5. Quasi-Newton Methods. The search directions of the limited-memory quasi-
289 Newton methods [111, 88] also lie in subsapces. Let Bk be the limited-memory BFGS (L-
290 BFGS) matrix and Hk be its inverse matrix generated from a few most recent pairs {si , yi },
291 where
292 si = xi+1 − xi , yi = gi+1 − gi .
293 Then the search direction is
294 (2.9) dk = −Bk−1 gk = −Hk gk ,
295 which is usually computed by the two-loop recursion. In fact, both Bk and Hk can be written
296 in a compact representation [21]. Assume that there are p pairs of vectors:
297 (2.10) Uk = [sk−p , . . . , sk−1 ] ∈ Rn×p , Yk = [yk−p , . . . , yk−1 ] ∈ Rn×p .
298 For a given initial matrix Hk0 , the Hk matrix is:
299 (2.11) Hk = Hk0 + Ck Pk Ck> ,
300 where
Ck := Uk , Hk0 Yk ∈ Rn×2p , Dk = diag s> >
   
301 k−p yk−p , . . . , sk−1 yk−1
(
s>
 −>
Rk (Dk + Yk> Hk0 Yk )Rk−1 −Rk−> k−p+i−1 yk−p+j−1 , if i ≤ j,

302 Pk := −1 , (R )
k i,j =
303 −Rk 0 0, o.w.
304 The initial matrix Hk0 is usually set to be a positive scalar γk times the identity matrix, i.e.,
305 γk I. Therefore, we have
306 dk ∈ span{gk , sk−1 , . . . , sk−p , yk−1 , . . . , yk−p }.
307 2.1.6. Acceleration Techniques. Gradient descent algorithms may converge slowly
308 after certain iterations. This issue can be resolved by using acceleration techniques such
309 as Anderson Acceleration (AA) [4, 123]. An extrapolation-based acceleration techniques
310 proposed in [99] can be applied to overcome the instability of the Anderson Acceleration. To
311 be precise, we perform linear combinations of the points xk every l + 2 iterations to obtain a
Pl
312 better estimation x̃ = i=0 c̃i xk−l+i . Define the difference of l + 2 iteration points as
313 U = [xk−l+1 − xk−l , . . . , xk+1 − xk ].
314 Then the coefficients c̃ = (c̃0 , . . . , c̃l )T is the solution of the following problem
315 (2.12) c̃ = arg min cT (U T U + λI)c,
c> el+1 =1

316 where λ > 0 is a regularization parameter.


8

This manuscript is for review purposes only.


317 2.1.7. Search Direction From Minimization Subproblems. We next construct
318 the search direction by solving a subproblem defined in a subspace Sk as

319 (2.13) min Qk (d) ,


d∈Sk

320 where Qk (d) is an approximation to f (xk + d) for d in the subspace Sk . It would be de-
321 sirable that the approximation model Qk (d) has the following properties: (i) it is easy to be
322 minimized in the subspace Sk ; (ii) it is a good approximation to f and the solution of the
323 subspace subproblem will give a sufficient reduction with respect to the original objective
324 function f .
325 It is natural to use quadratic approximations to the objective function. This leads to
326 quadratic models in subspaces. A successive two-dimensional search algorithm is developed
327 by Stoer and Yuan in [143] based on

328 min Qk (d).


d∈span{−gk ,dk−1 }

Let the dimension dim(Sk ) = τk and Sk be a set generated by all linear combinations of
vectors p1 , p2 , . . . , pτk ∈ Rn , i.e.,

Sk = span{p1 , p2 , . . . , pτk } .

329 Define Pk = [p1 , p2 , ..., pτk ]. Then d ∈ Sk can be represented as d = Pk d¯ with d¯ ∈ Rτk .
330 Hence, a quadratic function Qk (d) defined in the subspace can be expressed as a function Q̄k
331 ¯ Since τk usually is quite small,
in a lower dimension space Rτk in terms of Qk (d) = Q̄k (d).
332 the Newton method can be used to solve (2.13) efficiently.
333 We now discuss a few possible choices for the subspace Sk . A special subspace is a
334 combination of the current gradient and the previous search directions, i.e.,

335 (2.14) Sk = span{−gk , sk−1 , ..., sk−m } .

336 In this case, any vector d in the subspace Sk can be expressed as


m
X
337 (2.15) d = αgk + βi sk−i = (−gk , sk−1 , · · · , sk−m )d¯
i=1

338 where d¯ = (α, β1 , · · · , βm )> ∈ Rm+1 . All second order terms of the Taylor expansion of
339 f (xk + d) in the subspace Sk can be approximated by secant conditions

340 (2.16) s> 2 >


k−i ∇ f (xk )sk−j ≈ sk−i yk−j , s> 2 >
k−i ∇ f (xk )gk ≈ yk−i gk ,

341 except gk> ∇2 f (xk )gk . Therefore, it is reasonable to use the following quadratic model in the
342 subspace Sk :

343 (2.17) ¯ = (−kgk k2 , g > sk−1 , · · · , g > sk−m )d¯ + 1 d¯> B̄k d¯,
Q̄k (d) k k
2
344 where
−gk> yk−1 −gk> yk−m
 
ρk ...
 −gk> yk−1 >
yk−1 sk−1 ... >
yk−m sk−1 
(2.18) B̄k = 
 
345 .. .. .. .. 
 . . . . 
−gk> yy−m >
yk−m sk−1 ... >
yk−m sk−m
9

This manuscript is for review purposes only.


346 with ρk ≈ gk> ∇2 f (xk )gk . Hence, once a good estimation to the term gk> ∇2 f (xk )gk is
347 available, we can obtain a good quadratic model in the subspace Sk .
348 There are different ways to choose ρk . Similar to the approach in [143], we can let

(s>
k−1 gk )
2
349 (2.19) ρk = 2 >
,
sk−1 yk−1

350 due to the fact that the mean value of cos2 (θ) is 21 , which gives

1 (s> 2
k−1 ∇ f (xk )gk )
2
(s>
k−1 gk )
2
351 (2.20) gk> ∇2 f (xk )gk = ≈ 2 ,
cos2 θk s> 2
k−1 ∇ f (xk )sk−1 s>
k−1 yk−1

1 1
352 where θk is the angle between (∇2 f (xk )) 2 gk and (∇2 f (xk )) 2 sk−1 . Another way to es-
353 timate gk> (∇2 f (xk ))gk is by replacing ∇2 f (xk ) by a quasi-Newton matrix. We can also
354 obtain ρk by computing an extra function value f (xk + tgk ) and setting

2(f (xk + tgk ) − f (xk ) − tkgk k22 )


355 (2.21) ρk = .
t2
356 By letting the second order curvature along gk to be the average of those along sk−i (i =
357 1, ..., m), we obtain
m
kgk k22 X s>
k−i yk−i
358 (2.22) ρk = .
m i=1 s> k−i sk−i

359 Similar to (2.14), a slightly different subspace is

360 (2.23) Sk = span{−gk , yk−1 , ..., yk−m } .

361 In this case, any vector in Sk can be represented as


m
X
362 (2.24) d = αgk + βi yk−i = Wk d¯
i=1

363 where Wk = [−gk , yk−1 , ..., yk−m ] ∈ Rn×(m+1) . The Newton step in the subspace Sk is
364 Wk d¯k with
−1 >
d¯k = − Wk> ∇2 f (xk )Wk

365 (2.25) Wk ∇f (xk ) .

366 Thus,
 the remaining issue is to obtain a good estimate of d¯k , using the fact that all the elements
367 of Wk (∇2 f (xk ))−1 Wk is known except one entry gk ∇2 f (xk )−1 gk .
>

368 2.1.8. Subspace By Coordinate Directions. We next consider subspaces spanned


369 by coordinate directions with sparsity structures. Let gki be the i-th component of the gradient
370 gk . The absolute values |gki | are sorted in the descending order such that

371 (2.26) |gki1 | ≥ |gki2 | ≥ |gki3 | ≥ · · · ≥ |gkin |.

372 The subspace

373 (2.27) Sk = span{ei1 , ei2 , ..., eiτ }


10

This manuscript is for review purposes only.


374 is called as the τ -steepest coordinates subspace, where ei is a vector of all zeros except that
375 the i-th component is one. Then, the steepest descent direction in the subspace is sufficiently
376 descent, namely

d> gk τ
377 (2.28) min ≤− .
d∈Sk kdk2 kgk k2 n
i Pτ ij 2
378 When (gkτ +1 )2 ≤  j=1 (gk ) , the following estimation can be established:

d> gk 1
379 (2.29) min ≤ −p .
d∈Sk kdk2 kgk k2 1 + (n − τ )
Consequently, a sequential steepest coordinates search (SSCS) technique can be designed
by augmenting the steepest coordinate directions into the subspace sequentially. For example,
consider minimizing a convex quadratic function
1
Q(x) = g > x + x> Bx.
2
380 At the k-th iteration of SSCS, we first compute gk = ∇Q(xk ), then choose

381 ik = arg min{|gki |}.


i

382 Let Sk = span{ei1 , ..., eik }. The next iteration is to find

383 xk+1 = arg min Q(x).


x∈xk +Sk

384 2.2. Trust Region Methods. The trust region methods for (2.1) compute a search
385 direction in a ball determined by a given trust region radius whose role is similar to the step
386 size. The trust region subproblem (TRS) is normally
1
min Qk (s) = gk> s + s> Bk d
387 (2.30) s∈Rn 2
s. t. ksk2 ≤ ∆k ,

388 where Bk is an approximation to the Hessian ∇2 f (xk ) and ∆k > 0 is the trust region radius.
389 A subspace version of the trust region subproblem is suggested in [101]:

min Qk (s)
s∈Rn
390
s. t. ksk2 ≤ ∆k , s ∈ Sk .

391 The Steihaug truncated CG method [107] for solving (2.30) is essentially a subspace method.
392 When the approximate Hessian Bk is generated by the quasi-Newton updates such as the SR1,
393 PSB or the Broyden family [111, 88], the TRS has subspace properties. Suppose B1 = σI
394 with σ > 0, let sk be an optimal solution of TRS (2.30) and set xk+1 = xk + sk . Let
395 Gk = span{g1 , g2 , · · · , gk }. Then it can be proved that sk ∈ Gk and for any z ∈ Gk ,
396 w ∈ Gk⊥ , it holds

397 (2.31) Bk z ∈ Gk , Bk u = σu .

398 Therefore, the subspace trust region algorithm generates the same sequences as the full space
399 trust region quasi-Newton algorithm. Based on the above results, Wang and Yuan [128]
11

This manuscript is for review purposes only.


400 presented a subspace trust region quasi-Newton method for large scale unconstrained opti-
401 mization. Similar results for the line search quasi-Newton methods were obtained by Gill
402 and Leonard [44, 43].
403 We next discuss a special trust region subproblem which can make good use of subspace
404 properties. If the norm k.k2 is replaced by a general norm k.kW in (2.30), we can obtain a
405 general TRS subproblem
1
minn g > s + s> Bs
406 s∈R 2
s. t. kskW ≤ ∆.
Here, the subscript k in gk and Bk is omitted for simplicity. Intuitively, we should choose
the norm k.kW properly so that the TRS can easily be solved by using the corresponding
subspace properties of the objective function g > s + 21 s> Bs. Assume that B is a limited
memory quasi-Newton matrix which can be expressed as
B = σI + P DP > , P ∈ Rn×l ,
407 where P > P = I. Let P⊥> be the projection onto the space orthogonal to Range(P ). Then
408 the following function
409 (2.32) kskP = max{kP > sk∞ , kP⊥> sk2 }
410 is a well-defined norm, which leads to a P -norm TRS:
1
min g > s + s> Bs
411 (2.33) s∈Rn 2
s. t. kskP ≤ ∆.
412 Due to the definition of the k.kP , the solution s of the TRS (2.33) can be expressed by
413 s = P s1 + P⊥ s2 ,
414 where s1 and s2 can be computed easily. In fact, s1 is the solution of the box constrained
415 quadratic program (QP):
1
min s> (P > g) + s> (σI + D)s
416 s∈Rl 2
s. t. ksk∞ ≤ ∆,
417 It can be verified that s1 has a closed form solution:
−(P > g)i
(
σ+Dii if |(P > g)i | < (σ + Dii )∆ ,
418 (s1 )i =
∆sign(−(P > g)i ) otherwise ,
419 for i = 1, ..., l. On the other hand, s2 is solution of the 2-norm constrained special QP
1
min s> (P⊥> g) + σs> s
420 s∈Rn−l 2
s. t. ksk2 ≤ ∆.
421 whose closed-form solution is
 
1 ∆
422 s2 = − min , P⊥> g .
σ kP⊥> gk
423 Numerical results based on a trust region algorithm that uses the the W -norm TRS were
424 shown in [15].
12

This manuscript is for review purposes only.


425 3. Nonlinear Equations and Nonlinear Least Squares Problem. In this sec-
426 tion, we consider the system of nonlinear equations

427 (3.1) F (x) = 0, x ∈ Rn ,

428 and nonlinear least squares problem:

429 (3.2) min kF (x)k22 ,


x∈Rn

430 where F (x) = (F 1 (x), F 2 (x), . . . , F m (x))> ∈ Rm .


431 3.1. General Subspace Methods. Due to the special structures of nonlinear equa-
432 tions, several implementations of Newton-like iteration schemes based on Krylov subspace
433 projection methods are considered in [14]. Newton–Krylov methods with a global strategy
434 restricted to a suitable Krylov subspace are developed in [7]. Because the nonlinear least
435 squares problem (3.2) is also an unconstrained optimization problem, all the subspace tech-
436 niques discussed in Section 2 can be applied. For example, assume that there are ik lin-
437 early independent vectors {qk1 , qk2 , ..., qkik } which spans Sk . Let Qk = [qk1 , qk2 , ..., qkik ]. Then
438 d ∈ Sk can be expressed as Qk z with respect to a variable z ∈ Rik . For (3.1), one can find a
439 subspace step from

440 (3.3) F (xk + Qk z) = 0.

441 The linearized system for subproblem (3.3) is

442 (3.4) F (xk ) + Jk Qk z = 0,

443 where Jk is the Jacobian of F at xk , Similarly, one can construct a subspace type of the
444 Levenberg-Marquardt method for (3.2) as

λk
445 min kF (xk ) + Jk Qk zk22 + kzk22 ,
z 2
446 where λk is a regularization parameter adjusted to ensure global convergence.
447 3.2. Subspace by Subsampling/Sketching. We start from solving a linear least
448 squares problem on massive data sets:

449 (3.5) min kAx − bk22 ,


x

450 where A ∈ Rm×n and b ∈ Rm . Although applying the direct or iterative methods to (3.5) is
451 straightforward, it may be prohibitive for large values of m. The sketching technique chooses
452 a matrix W ∈ Rr×m with r  m and formulates a reduced problem

453 (3.6) min kW (Ax − b)k22 .


x

454 It can be proved that the solution of (3.6) can be a good approximation to that of (3.5) in
455 certain sense if the matrix W is chosen suitably. For example, one may randomly select r
456 rows from the identity matrix to form W so that W A is a submatrix of A. Another choice is
457 that each element of W is sampled from an i.i.d. normal random variable with mean zero and
458 variance 1/r. These concepts have been extensively investigated for randomized algorithms
459 in numerical linear algebra [78, 136].
13

This manuscript is for review purposes only.


460 For nonlinear equations, the simple sketching approach is to ignore some equations.
461 Instead of requiring the original system (3.1), we consider

462 (3.7) F i (x) = 0, i ∈ Ik ,

463 which is an incomplete set of equations. To solve the nonlinear equations (3.1) is to find a x
464 at which F maps to the origin [141]. Let Pk> be a mapping from Rm to a lower dimensional
465 subspace. Solving the reduced system

466 (3.8) Pk> F (x) = 0

467 is exactly replacing F = 0 by requiring its mapping to the subspace spanned by Pk to be


468 zero. Together with (3.3) yields:

469 (3.9) Pk> F (xk + Qk z) = 0,

470 The linearized system for subproblem (3.9) is

471 (3.10) Pk> [ F (xk ) + Jk Qk z ] = 0 .

472 Of course, the efficiency of such an approach depends on how to select Pk and Qk . We can
473 borrow ideas from subspace techniques for large scale linear systems [98]. Instead of using
474 (3.10), we can use a subproblem of the following form:

475 (3.11) Pk> F (xk ) + Jˆk z = 0 ,

476 where Jˆk ∈ Rik ×ik is an approximation to Pk> Jk Qk . The reason for preferring (3.11) over
477 (3.10) is that in (3.11) we do not need the Jacobian matrix Jk , whose size is normally signifi-
478 cantly larger than that of Jˆk .
479 Similar idea has also been studied for nonlinear least squares problems. At the k-th
480 iteration, we consider minimizing the sum of squares of some randomly selected terms in an
481 index set Ik ⊂ {1, ..., m} instead of all terms:
X
482 (3.12) minn (F i (x))2 .
x∈R
i∈Ik

483 This approach works quite well on the distance geometry problem which has lots of applica-
484 tions including protein structure prediction, where the nonlinear least squares of all the terms
485 have lots of local minimizers [103]. Combining subspace with sketching yields a counterpart
486 to (3.9) for nonlinear least squares:

487 (3.13) min kPk> F (xk + d)k22 .


d∈Sk

488 A projected nonlinear least squares method is studied in [57] to fit a model ψ to (noisy)
489 measurements y for the exponential fitting problem:

490 (3.14) min kψ(x) − yk22 ,


x∈Rn

491 where ψ(x) ∈ Rm and n  m. Since computing the Jacobian of (3.14) can be expensive,
492 a sequence of low-dimensional surrogate problems are constructed from a sequence of sub-
493 spaces {W` } ⊂ Rm . Let PW` be an orthogonal projection onto W` and W` is an orthonormal
494 basis for W` , i.e., PW` = W` W`> . Then it solves the following minimization problem:

495 min kPW` [ψ(x) − y]k22 = min kW`> ψ(x) − W`> yk22 .
x x
14

This manuscript is for review purposes only.


496 3.3. Partition of Variables. We now consider the partition of variables, which is also
497 a subspace technique for nonlinear least squares problem. Let Ik be a subset of {1, ..., n}.
498 The variables are partitioned into two group x = (x̄ , x̂), where x̄ = (xi , i ∈ Ik ) and
499 x̂ = (xi , i 6∈ Ik ). At the k-th iteration, the variables xi (i 6∈ Ik ) are fixed and xi (i ∈ Ik ) are
500 free to be changed in order to obtain a better iterate point. This procedure yields a subproblem
501 with fewer variables:
n
X
502 (3.15) min (F i (x̄, x̂k ))2 .
x̄∈R|Ik |
i=1

503 It is easy to see that partition of variables use special subspaces that spanned by coordinate
504 directions. An obvious generalization of partition of variables is decomposition of the space
505 which uses subspaces spanned by any given directions.
506 3.4. τ −steepest Descent Coordinate Subspace. The τ −steepest descent coor-
507 dinate subspace discussed in Section 2 can also be extended to nonlinear equations and non-
508 linear least squares. Assume that

509 (3.16) |F i1 (xk )| > · · · > |F iτ (xk )| > · · ·

510 at the k−th iteration. If F (x) is a monotone operator, applying the method in a subspace
511 spanned by the coordinate directions {eij , j = 1, ..., τ } generates a system

512 (3.17) F ij (xk ) + d> ∇F ij (xk ) = 0, j = 1, ..., τ .

For general nonlinear functions F (x), it is better to replace eij by the steepest descent coor-
dinate direction of the function F ij (x) at xk , i.e., substituting ij by an index lj such that

∂F ij (xk )
lj = arg max .
t=1,...,n ∂xt

513 However, it may be possible to have two different j at one lj so that subproblem (3.17) has no
514 solution in the subspace spanned by {el1 , ..., elτ }. Further discussion on subspace methods
515 for nonlinear equations and nonlinear least squares can be found in [141].
516 4. Stochastic Optimization. The supervised learning model in machine learning as-
517 sumes that (a, b) follows a probability distribution P , where a is an input data and b is the
518 corresponding label. Let φ(a, x) be a prediction function in a certain functional space and
519 `(·, ·) represent a loss function to measure the accuracy between the prediction and the true la-
520 bel. The task is to predict a label b from a given input a, that is, finding a function φ such that
521 the expected risk E[`(φ(a, x), b)] is minimized. In practice, the real probability distribution
522 P is unknown, but a data set D = {(a1 , b1 ), (a2 , b2 ), · · · , (aN , bN )} is obtained by random
523 sampling, where (ai , bi ) ∼ P i.i.d. Then an empirical risk minimization is formulated as
N
1 X
524 (4.1) min f (x) := fi (x),
x N i=1

525 where fi (x) = `(φ(ai ; x), bi )). In machine learning, a large number of problems can be
526 expressed in the form of (4.1). For example, the function φ in deep learning is expressed
527 by a deep neural network. Since the size N usually is huge, it is time consuming to use the
528 information of all components fi (x). However, it is affordable to compute the information at
529 a few samples so that the amount of calculation in each step is greatly reduced.
15

This manuscript is for review purposes only.


530 4.1. Stochastic First-order Methods. In this subsection, we briefly review a few
531 widely used stochastic first-order methods [47]. Instead of using the full gradient ∇f (xk ),
532 the stochastic gradient method (SGD) for (4.1) selects a uniformly random sample sk from
533 {1, . . . , N } and updates
534 (4.2) xk+1 = xk − αk ∇fsk (xk ).
535 A common assumption for convergence is that the expectation of the stochastic gradient is
536 equal to the full gradient, i.e.,
537 E[∇fsk (xk ) | xk ] = ∇f (xk ).
538 When fi (xk ) is not differentiable, then its subgradient is used in (4.2). Note that only one
539 sample is used in (4.2). The mini-batch SGD method tries to balance between the robustness
540 of the SGD and the computational cost. It randomly selects a mini-batch Ik ⊂ {1, . . . , N }
541 such that |Ik | is quite small, then computes
αk X
542 (4.3) xk+1 = xk − ∇fsk (xk ).
|Ik |
sk ∈Ik

543 Obviously, subsampling defines a kind of subspace in terms of the component functions
544 {f1 (x), . . . , fN (x)}. For simplicity, we next only consider extensions based on (4.2).
545 After calculating a random gradient ∇fsk (xk ) at the current point, the momentum method
546 does not directly update it to the variable xk . It introduces a speed variable v, which represents
547 the direction and magnitude of the parameter movements. Formally, the iterative scheme is
vk+1 = µk vk − αk ∇fsk (xk ),
548 (4.4)
xk+1 = xk + vk+1 .
549 This new update direction v is a linear combination of the previous update direction vk and
550 the gradient ∇fsk (xk ) to obtain a new vk+1 . When µk = 0, the algorithm degenerates to
551 SGD. In the momentum method, the parameter µk is often in the range of [0, 1). A typical
552 value is µk ≥ 0.5, which means that the iteration point has a large inertia and each iteration
553 will make a small correction to the previous direction.
554 Since the dimension of the variable x can be more than 10 million and the convergence
555 speed of each variable may be different, updating all components of x using a single step size
556 αk may not be suitable. The adaptive subgradient method (AdaGrad) controls the step sizes of
557 each component separately by monitoring the accumulation of the gradients elementwisely:
k
X
558 Gk = ∇fsi (xi ) ∇fsi (xi ),
i=1

559 where the Hadamard product between two vectors. The AdaGrad method is
αk
xk+1 = xk − √ ∇fsk+1 (xk+1 ),
560 (4.5) Gk + en
Gk+1 = Gk + ∇fsk+1 (xk+1 ) ∇fsk+1 (xk+1 ),
561 where the division in √Gα+ek
is also performed elementwisely. Adding the term 1n is to
k n
562 prevent the division by zeros.
563 The adaptive moment estimation (Adam) method combines the momentum and AdaGrad
564 together by adding a few small corrections. At each iteration, it performs:
565 vk = ρ1 vk−1 + (1 − ρ1 )∇fsk (xk ),
16

This manuscript is for review purposes only.


566 Gk = ρ2 Gk−1 + (1 − ρ2 )∇fsk (xk ) ∇fsk (xk ),
vk
567 v̂k = ,
1 − ρk1
Gk
568 Ĝk = ,
1 − ρk2
αk
569 xk+1 = xk − p v̂k ,
Ĝk + en
570 where the typical values for ρ1 and ρ2 are ρ1 = 0.9 and ρ2 = 0.999. We can see that the
571 direction vk is a convex combination of vk−1 and ∇fsk (xk ), then it is corrected to v̂k . The
572 value Ĝk is also obtained in a similar fashion. The main advantage of Adam is that after the
573 deviation correction, the step size of each iteration has a certain range, making the parameters
574 relatively stable.
575 The above algorithms have been implemented in mainstream deep learning frameworks,
576 which can be very convenient for training neural networks. The algorithms implemented
577 in Pytorch are AdaDelta, AdaGrad, Adam, Nesterov, RMSProp, etc. The algorithms im-
578 plemented in Tensorflow are AdaDelta, AdaGradDA, AdaGrad, ProximalAdagrad, Ftrl, Mo-
579 mentum, Adam and CenteredRMSProp, etc.
580 4.2. Stochastic Second-Order method. The subsampled Newton method takes an
581 additional random set IkH ⊂ {1, . . . , N } independent to Ik and compute a search direction as
 
1 X
2  dk = − 1
X
582  ∇ f i (x) ∇fsk (xk ).
|IkH | H
|Ik |
sk ∈Ik
i∈Ik

583 Therefore, the subspace techniques in section 2 can also be adopted here.
584 Assume that the loss function is the negative logarithm probability associated with a
585 distribution with a density function p(y|a, x) defined by the neural network and parameterized
586 by x. The so-called KFAC method [79] is based on the Kronecker-factored approximate
587 Fisher matrix. Take an L-layer feed-forward neural network for example. Namely, each layer
588 j ∈ {1, 2, . . . , L} is given by
589 (4.6) sj = Tj wj−1 , wj = ψj (sj ),
590 where w0 = a is the input of the neural network, wL (x) ∈ Rm is the output of the neural
591 network under the input a, the constant term 1 is not considered in wj−1 for simplicity, Tj is
592 the weight matrix and ψj is the block-wise activation function. The jth diagonal block of F
593 corresponding to the parameters in the jth layer using a sample set B can be written in the
594 following way:
595 (4.7) F j := Qj−1,j−1 ⊗ Gj,j ,
596 where
1 X i
Qj−1,j−1 = i
wj−1 (wj−1 )> ,
|B|
i∈B
597
1 X
Gj,j = Ez∼p(z|ai ,x) [g̃ji (z)g̃ji (z)> ],
|B|
i∈B
∂`(φ(ai ,x),z)
598 and g̃ji (z)
:= ∂sij
. Therefore, the KFAC method computes a search direction in the
599 jth layer from
600 F j djk = −gkj ,
17

This manuscript is for review purposes only.


601 where gkj is the corresponding subsampled gradient in the jth layer.
602 5. Sparse Optimization.
603 5.1. Basis Pursuit. Given a matrix A ∈ Rm×n and a vector b ∈ Rm such that m 
604 n, basis pursuit is to find the sparsest signal among all solutions of the equation Ax = b. It
605 leads to a NP-hard problem:

606 (5.1) min kxk0 , s. t. Ax = b,


x

607 where kxk0 = |{j | xj 6= 0}|, i.e., the number of the nonzero elements of x. An exact
608 recovery of the sparest signal often requires the so-called restricted isometry property (RIP),
609 i.e., there exists a constant δr such that

610 (1 − δr )kxk22 ≤ kAxk22 ≤ (1 + δr )kxk22 , whenever kxk0 ≤ r.

611 The greedy pursuit methods build up an approximation in a subspace at the k-th iteration.
612 Let Tk be a subset of {1, . . . , n}, xTk be a subvector of x corresponding to the set Tk and
613 ATk be a column submatrix of A whose column indices are collected in the set Tk . Then the
614 subspace problem is
1
615 xTk k = arg min kATk x − bk22 .
x 2
616 Clearly, the solution is xTk k = A†Tk b where A†Tk is the pseudoinverse of ATk . Since the size of
617 Tk is controlled to be small, ATk roughly has full rank column due to the RIP property. All
618 other elements of xk whose indices are in the complementary set of Tk are set to 0.
619 We next explain the choices of Tk . Assume that the initial index set T0 is empty. The
620 orthogonal matching pursuit (OMP) [116] computes the gradient

621 gk = A> (ATk xTk k − b),

622 then selects an index such that tk = arg maxj=1,...,n |gj |. If multiple indices attain the
623 maximum, one can break the tie deterministically or randomly. Then the index set at the next
624 iteration is augmented as
625 Tk+1 = Tk ∪ {tk }.
626 The CoSaMP [83] method generates an s-sparse solution, i.e., the number of nonzero com-
627 ponents is at most s. Let (xk )s be a truncation of xk such that only the s largest entries in the
628 absolute values are kept and all other elements are set to 0. The support of (xk )s is denoted as
629 supp((xk )s ). Then a gradient gk is computed at the point (xk )s and the set Tk+1 is updated
630 by
631 Tk+1 = supp((gk )2s ) ∪ supp((xk )s ).
632 5.2. Active Set Methods. Consider the `1 -regularized minimization problem

633 (5.2) min ψµ (x) := µkxk1 + f (x),


x∈Rn

634 where µ > 0 and f (x) : Rn → R is continuously differentiable. The optimality condition of
635 (5.2) is that there exists a vector

= −µ,
 xi > 0,
i
636 (5.3) (∇f (x)) = +µ, xi < 0,

∈ [−µ, µ], otherwise.

18

This manuscript is for review purposes only.


637 A two-stage active-set algorithm called “FPC AS” is proposed in [133]. First, an active set
638 is identified by a first-order type method using the so-called “shrinkage” operation. Then,
639 a smooth and smaller subproblem is constructed based on this active set and solved by a
640 second-order type method. These two operations are iterated until convergence criteria are
641 satisfied. While shrinkage is very effective in obtaining a support superset, it can take a lot
642 of steps to find the corresponding values. On the other hand, if one imposes the signs of the
643 components of the variable x that are the same as those of the exact solution, problem (5.2)
644 reduces to a small smooth optimization problem, which can be relatively easily solved to
645 obtain x. Consequently, the key components are the identification of a “good” support set by
646 using shrinkage and the construction of a suitable approximate smooth optimization problem.
647 The iterative shrinkage procedure for solving (5.2) is indeed a proximal gradient method.
648 Given an initial point x0 , the algorithm iteratively computes

1
649 xk+1 := arg min µkxk1 + (x − xk )> gk + kx − xk k22 ,
x 2αk

650 where gk := ∇f (xk ) and αk > 0. A simple calculation shows that

651 (5.4) xk+1 = S (xk − αk gk , µαk ) ,

652 where for y ∈ Rn and ν ∈ R, the shrinkage operator is defined as

1
653 S(y, ν) = arg min νkxk1 + kx − yk22
x 2
654 = sign(y) max {|y| − ν, 0} .

655 Note that the scheme (5.4) first executes a gradient step with a step size αk , then performs
656 a shrinkage. In practice, αk can be computed by a non-monotone line search in which the
657 initial value is set to the BB step size. The convergence of (5.4) has been studied in [53]
658 under suitable conditions on αk and the Hessian ∇2 f . An appealing feature proved in [53] is
659 that (5.4) yields the support and the signs of the minimizer x∗ of (5.2) after a finite number
660 of steps under favorable conditions. For more references related to shrinkage, the reader is
661 referred to [133].
662 We now describe the subspace optimization in the second stage. For a given vector
663 x ∈ Rn , the active set is denoted by A(x) and the inactive set (or support) is denoted by I(x)
664 as follows

665 (5.5) A(x) := {i ∈ {1, · · · , n} | |xi | = 0} and I(x) := {i ∈ {1, · · · , n} | |xi | > 0}.

666 We require that each component xi either has the same sign as xik or is zero, i.e., x is required
667 to be in the set

Ω(xk ) := x ∈ Rn : sign(xik )xi ≥ 0, i ∈ I(xk ) and xi = 0, i ∈ A(xk ) .



668 (5.6)

669 Then, a smooth subproblem is formulated as either an essentially unconstrained problem

670 (5.7) min µ sign(xIkk )> xIk + f (x), s. t. xi = 0, i ∈ A(xk )


x

671 or a simple bound-constrained problem

672 (5.8) min µ sign(xkIk )> xIk + f (x), s. t. x ∈ Ω(xk ).


x
19

This manuscript is for review purposes only.


673 Since the size of the variables in (5.7) and (5.8) is relatively small, these two problems can be
674 solved efficiently by methods such as L-BFGS-B. If f (x) is quadratic, problem (5.7) can be
675 solved by the CG method for a system of linear equations.
676 The active set strategies have also been studied in [105, 64]. Specifically, the method
677 in [64] solves a smooth quadratic subproblem determined by the active sets and invokes a
678 corrective cycle that greatly improves the efficiency and robustness of the algorithm. The
679 method is globalized by using a proximal gradient step to check the desired progress.
680 6. The Domain Decomposition Methods.
681 6.1. A Two-level Subspace Method. Consider an infinite dimensional minimiza-
682 tion problem

683 (6.1) min F(x),


x∈V

684 where F is a mapping from an infinite-dimensional space V to R. Many large-scale finite


685 dimensional optimization problems arise from infinite dimensional optimization problems
686 [28]. Since explicit solutions for these problems are usually not available, we solve the dis-
687 cretized version of them from the “discretize-then-optimize” strategy by using the concept of
688 multilevel optimization.
689 Let Vh be a finite dimensional subset of V at the grid level h, for example, a standard
690 finite element space associated with the grid level h. For consecutive coarser levels, we
691 choose nested spaces, so that V1 ⊂ · · · ⊂ VN−1 ⊂ VN ⊂ V, where N is reserved for the
692 index of the finest level and 1 for the coarsest level. The functional F(x) restricted on Vh is
693 constructed as fh (xh ) for xh ∈ Vh . Therefore, the discretization of problem (6.1) is

694 (6.2) min fh (xh ).


xh ∈Vh

695 Let xh,k be a vector where the first subscript h denotes the discretization level of the
696 multigrid and the second subscript k denotes the iteration count. We next briefly describe a
697 two-level subspace method in [24] instead of simply finding a point xh,k+1 in the coarser grid
698 space VH . We seek a point xh,k+1 in Sh,k + VH , satisfying some conditions, where Sh,k is
699 a subspace including the descent information, such as the coordinate direction of the current
700 iteration and the previous iterations or the gradient Dh f (xh,k ) in the finite space Vh . Then,
701 we solve

702 (6.3) Sh,k = span{xh,k−1 , xh,k , Dh f (xh,k )} ⊆ Vh .

703 When xh,k is not optimal on a coarse level H ∈ {1, 2, . . . , N }, we go to this level and
704 compute a new solution xh,k+1 by

705 (6.4) xh,k+1 ≈ arg min f (x), s. t. x ∈ Sh,k + VH .

706 Otherwise, we find a point xh,k+1 ∈ Vh on level h.


707 The so-called full multigrid skill or mesh refinement technique can often help to generate
708 a good initial point so that the total number of iterations is reduced. Firstly, we solve the
709 target problem at the coarsest level which is computationally cheap. After an approximate
710 solution x∗h at the current level is obtained, we prolongate it to the next finer level h + 1 with
711 interpolation as an initial point, and then apply the two-level subspace method at this new
712 level to find a solution x∗h+1 . This process is repeated until the finest level is reached.
20

This manuscript is for review purposes only.


713 6.2. The Subspace Correction Method. We next briefly review the subspace cor-
714 rection methods [112]. Given the current point xh,k , the relaxation (or smoothing) procedure
715 is the operation on the current level h, namely, to find a direction dh,k to approximate the
716 solution of

717 (6.5) min fh (xh,k + d),


d∈Vh

718 and the coarse grid correction procedure is the operation on the coarse level H, namely, to
719 find a direction dh,k to approximate the solution

720 (6.6) min fh (xh,k + d).


d∈VH

721 The concept of the subspace correction methods can be used to solve subproblems (6.5)
(j) h
722 and (6.6). Let {φh }nj=1 be a basis for Vh , where nh is the dimension of Vh . Denote Vh
(1) (n )
723 as a direct sum of the one-dimensional subspaces Vh = Vh ⊕ · · · ⊕ Vh h . Then for each
724 j = 1, · · · , nh in turn, we perform the following correction step for subproblem (6.5) at the
725 k-th iteration:
 (j)∗ (j)
dh,k = (j)min (j) fh (xh,k + dh,k )

dh,k ∈Vh
726 (6.7)
 (j)∗
xh,k = xh,k + dh,k .

727 For subproblem (6.6), a similar strategy can be adopted by decompose space VH into a di-
728 rect sum. Global convergence of this algorithm has been proved in [113] for strictly convex
729 functions under some assumptions. The subspace correction method can be viewed as a gen-
730 eralization of the coordinate search method or the pattern search method.
731 6.3. Parallel Line Search Subspace Correction Method. In this subsection, we
732 consider the following optimization problem

733 (6.8) min ϕ(x) := f (x) + h(x),


x∈Rn

734 where f (x) is differentiable convex function and h(x) is a convex function that is possibly
735 nonsmooth. The `1 -regularized minimization (LASSO) [114] and the sparse logistic regres-
736 sion [100] are examples of (6.8). The PSC methods have been studied for LASSO in [36, 39]
737 and total variation minimization in [37, 38, 39, 68].
738 Suppose that Rn is split into p subspaces, namely,

739 (6.9) Rn = X 1 + X 2 + · · · + X p ,

where
X i = {x ∈ Rn |supp(x) ⊂ Ji }, 1 ≤ i ≤ p,
p
S T
740 such that J := {1, ..., n} and J = Ji . For any i 6= j, 1 ≤ i, j ≤ p, Ji Jj = ∅ holds in
i=1
741 a non-overlapping T domain decomposition of Rn . Otherwise, there exist i, j ∈ {1, ..., p} and
742 i 6= j such that Ji Jj 6= ∅ in an overlapping domain decomposition of Rn .
743 Let ϕik be a surrogate function of ϕ restricted to the i-th subspace at k-th iteration. The
744 PSC framework for solving (6.8) is:

745 (6.10) dik = arg min ϕik (di ), i = 1, ..., p,


di ∈X i
21

This manuscript is for review purposes only.


p
X
746 xk+1 = xk + αki dik .
i=1

747 The
Pp convergence can be proved if the step sizes αki (1 ≤ i ≤ p) satisfy the conditions:
i i i
748 i=1 αk ≤ 1 and αk > 0 (1 ≤ i ≤ p). Usually, the step size αk is quite small under these
749 conditions and convergence becomes slow. For example, the diminishing step size αki = p1
750 tends to be smaller and smaller as the number of subspaces increases.
751 A parallel subspace correction method (PSCL) with the Armijo backtracking line search
752 for a large step size is proposed in [29]. At the k-th iteration, it chooses a surrogate functions
753 ϕik and solves P the subproblem (6.10) for each block, then computes a summation of the
p
754 direction dk = i=1 dik . The next iteration is

755 xk+1 = xk + αk dk ,

756 where αk satisfies the Armijo backtracking conditions. When h(x) = 0 and f (x) is strongly
757 convex, the surrogate function can be set to the original objective function ϕ. Otherwise, it
758 can be a first-order Taylor expansion of the smooth part f (x) with a proximal term and the
759 nonsmooth part h(x):
1
760 (6.11) ϕik (di ) = ∇f (xk )> di + kdi k22 + h(xk + di ), for di ∈ X i .
2λi
761 Both non-overlapping and overlapping schemes can be designed for PSCL.
762 The directions from  different subproblems can be equipped with different step sizes. Let
763 Zk = d1k , d2k , . . . , dpk . The next iteration is set to

764 xk+1 = xk + Zk αk .

765 One can find αk as an optimal solution of

766 αk = arg min ϕ(xk + Zk α).


α∈Rp

767 Alternatively, we can solve the following approximation:


1
768 ak ≈ arg min ∇f (xk )> Zk α + kZk αk22 + h(xk + Zk a).
α∈Rp 2tk

769 The global convergence of PSCL is established by following the convergence analysis
770 of the subspace correction methods for strongly convex problem [112], the active-set method
771 for l1 minimization [134] and the BCD method for nonsmooth separable minimization [119].
772 Specifically, linear convergence rate is proved for the strongly convex case and convergence
773 to the solution set of problem (6.8) globally is obtained for the general nonsmooth case.
774 7. General Constrained Optimization. In this section, we first present subspace
775 methods for solving general equality constrained optimization problems:

min f (x)
x∈Rn
776 (7.1)
s. t. c(x) = 0,

777 where c(x) = (c1 (x), · · · , cm (x))> , f (x) and ci (x) are real functions defined in Rn and at
778 least one of the functions f (x) and ci (x) is nonlinear. Note that inequality constraints can also
779 be added to (7.1) but they are omitted to simplify our discussion in the first few subsections.
22

This manuscript is for review purposes only.


780 In the last subsection, we discuss methods for bound-constrained minimization problems.
781 Problem (7.1) is often minimized by computing solutions of a sequence of subproblems which
782 are simpler than (7.1) itself. However, they are still large-scale linear or quadratic problems
783 because normally subproblems are also defined in the same dimensional space as the original
784 nonlinear problem.
785 7.1. Direct Subspace Techniques. The sequential quadratic programming (SQP)
786 is an important method for solving (7.1). It successively minimizes quadratic approximations
787 to the Lagrangian function subject to the linearized constraints. Let Qk (d) be a quadratic
788 approximation to the Lagrangian function of (7.1) at the k-th iteration:

1
789 (7.2) Qk (d) = gk> d + d> Bk d,
2
790 where gk = ∇f (xk ) and Bk is an approximation to the Hessian of the Lagrangian function.
791 The search direction dk of a line search type SQP method is obtained by solving the following
792 QP subproblem

793 (7.3) min Qk (d)


d∈Rn

794
795 (7.4) s. t. c(xk ) + A>
k d = 0,

796 where Ak = ∇c(xk ). Although the SQP subproblem is simpler than (7.1), it is still difficult
797 when the dimension n is large.
798 In general, the subspace SQP method constructs the search direction dk by solving a QP
799 in a subspace:

min Qk (d)
800 (7.5)
s. t. c k + A>
k d = 0, d ∈ Sk ,

where Sk is a subspace. Lee et al. [70] considered the following choice:

Sk = span{gk , sk−m̄ , ..., sk−1 , ȳk−m̄ , ..., ȳk−1 , ∇c1 (xk ), ..., ∇cm (xk )},

801 where m̄ is the memory size of the limited memory BFGS method for constructing Bk in
802 (7.2), and ȳi is a linear combination of yi and Bi si . Let Uk be a matrix of linearly independent
803 vectors in Sk . A reduced constrained version of (7.5) is

1
min (Uk> gk )> z + z > Uk> Bk Uk z
804 (7.6) z 2
s. t. Tk> (ck + A>
k k z) = 0,
U

805 where Tk is a projection matrix so that the constraints are not over-determined.
806 7.2. Second Order Correction Steps. The SQP step dk can be decomposed into
807 two parts dk = hk + vk where vk ∈ Range(Ak ) and hk ∈ (A> k ). Thus, vk is a solution of
808 the linearized constrained (7.4) in the range space of Ak , while hk is the minimizer of the
809 quadratic function Qk (vk + d) in the null space of A>
k.
810 One good property of the SQP method is its superlinear convergence rate, namely when
811 xk is close to a Karush–Kuhn–Tucker (KKT) point x∗ it holds

812 (7.7) xk + dk − x∗ = o(kxk − x∗ k) .


23

This manuscript is for review purposes only.


However, a superlinearly convergent step dk may generate a point that seems “bad” since
it may increase both the objective function and the constraint violations. Even though (7.7)
holds, the Maratos effect shows that it is possible for the SQP step dk to have both
f (xk + dk ) > f (xk ), kc(xk + dk )k > kc(xk )k.
813 The second order correction step method [35, 80] solves a QP subproblem whose constraints
814 (7.4) are replaced by
815 (7.8) c(xk + dk ) + A>
k (d − dk ) = 0,

816 because the left hand side of (7.8) is a better approximation to c(xk + d) close to the point
817 d = dk . Since the modification of the constraints is a second order term, the new step can
818 be viewed as the SQP step dk adding a second order correction step dˆk . Consequently, the
819 Maratos effect is overcomed. For detailed discussions on the SQP method and the second
820 order correction step, we refer the reader to [111].
821 We now examine the second order correction step from subspace point of views. It can
822 be verified that the second order correction step dˆk is a solution of
min Qk (dk + d)
d∈Rn
823
s. t. c(xk + dk ) + A>
k d = 0.

824 Compute the QR factorization


 
Rk
825 Ak = [Yk , Zk ]
0
826 and assume that Rk is nonsingular. Therefore, the second order correction step can be repre-
827 sented as dˆk = v̂k + ĥk , where v̂k = −Yk Rk−T c(xk + dk ) and ĥk is the minimizer of
828 (7.9) min Q(dk + v̂k + h) .
h∈Null(A>
k )

829 Since dk is the SQP step, it follows that gk + Bk dk ∈ Range(Ak ), which implies that the
830 minimization problem (7.9) is equivalent to
1
831 (7.10) min (v̂k + h)> Bk (v̂k + h) .
h∈Null(A>
k )
2
832 Examining the SQP method from the perspective of subspace enables us to get more
833 insights. If Yk> Bk Zk = 0, it holds ĥk = 0, which means that the second order correction
834 step dˆk ∈ Range(Ak ) is also a range space step. Hence, the second order correction uses
835 two range space steps and one null space step. Note that a range space step is fast since it is
836 a Newton step, while a null space step is normally slower because Bk is often approximated
837 by a quasi-Newton approximation to the Hessian of the Lagrangian function. Intuitively, it
838 might be better to have two slower steps with one fast step. Therefore, it might be reasonable
839 to study a correction step dˆk ∈ Null(A>
k ) in a modified SQP method.

840 7.3. The Celis-Dennis-Tapia (CDT) Subproblem. The CDT subproblem [23] is
841 often needed in some trust region algorithms for constrained optimization. It has two trust
842 region ball constraints:
1
min Qk (d) = gk> d + d> Bk d
843 (7.11) d∈Rn 2
s. t. kck + A>k dk2 ≤ ξk , kdk2 ≤ ∆k ,
24

This manuscript is for review purposes only.


844 where ξk and ∆k are both trust region radii. Let Sk = span{Zk }, Zk> Zk = I, span{Ak , gk } ⊂
845 Sk and Bk u = σu, ∀u ∈ Sk⊥ . It is shown in [50] that the CDT subproblem is equivalent to

min ¯ = ḡ > d¯ + 1 d¯> B̄k d¯


Q̄k (d) k
846 ¯ r
d∈R 2
>¯ ¯ 2 ≤ ∆k ,
s. t. kck + Āk dk2 ≤ ξk , kdk

847 where ḡk = Zk> gk , B̄k = Zk> Bk Zk and Āk = Zk> Ak . Consequently, a subspace version of
848 the Powell-Yuan trust algorithm [91] was developed in [50].
849 7.4. Simple Bound-constrained Problems. We now consider the optimization
850 problems with simple bound-constraints:

min f (x)
x∈Rn
851 (7.12)
s. t. l ≤ x ≤ u,

852 where l and u are two given vectors in Rn . In this subsection, the superscript of a vector
853 denotes its indices, for example, xi is the ith component of x.
854 A subspace adaptation of the Coleman-Li trust region and interior method is proposed in
855 [12]. The affine scaling matrices Dk and Ck are defined from examining the KKT conditions
856 of (7.12) as:

857 Dk = D(xk ) = diag(|v(xk )|−1/2 ), Ck = Dk diag(gk )Jkv Dk

858 where J v (x) is a diagonal matrix whose diagonal elements equal to zero or ±1, and
 i

x − ui , if g i < 0 and ui < +∞,
xi − li ,

if g i ≥ 0 and li > −∞,
859 vi =


−1, if g i < 0 and ui = +∞,
+1, if g i ≥ 0 and li = −∞.

860 Let Hk be an approximation to the Hessian matrix ∇f (xk ) and define

861 ĝk = Dk−1 gk , M̂k = Dk−1 Hk Dk−1 + diag(gk )Jkv .

862 Then the subspace trust region subproblem is

1
min gk> s + s> (Hk + Ck )s
863 (7.13) s 2
s. t. kDk sk2 ≤ ∆k , s ∈ Sk .

864 If the matrix M̂k is positive definite, the subspace is taken as

865 Sk = span{Dk−2 gk , wk },
−1
866 where wk is either ŝN
k = −M̂k ĝk or its inexact version. Otherwise, Sk is set to

867 span{Dk−2 sign(gk )} or span{Dk−2 sign(gk ), wk },

868 where ŵk is a vector of nonpositive curvature of M̂k .


25

This manuscript is for review purposes only.


869 A subspace limited memory quasi-Newton method is developed by Ni and Yuan in [87].
870 There are three types of search directions: a subspace quasi-Newton direction, subspace gra-
871 dient and modified gradient directions. The limited memory quasi-Newton method is used
872 to update the variables with indices outside of the active set, while the projected gradient
873 method is used to update the active variables. An active set algorithm is designed in [52].
874 The algorithm consists of a nonmonotone gradient projection step, an unconstrained opti-
875 mization step, and a set of rules for branching between the two steps. After a suitable active
876 set is detected, some components of variables are fixed and the method is switched to the
877 unconstrained optimization algorithm in a lower-dimensional space.
878 8. Eigenvalue Computation. The eigenvalue decompostion (EVD) and singular value
879 decomposition (SVD) are fundamental computational tools with extraordinarily wide-ranging
880 applications in science and engineering. For example, most algorithms in high dimensional-
881 ity reduction, such as the principal component analyses (PCA), the multidimensional scaling,
882 Isomap and etc, use them to transform the data into a meaningful representation of reduced
883 dimensionality. More recently, identifying dominant eigenvalue or singular value decom-
884 positions of a sequence of closely related matrices has become an indispensable algorithmic
885 component for many first-order optimization methods for various convex and nonconvex opti-
886 mization problems, such as semidefinite programming, low-rank matrix computation, robust
887 principal component analysis, sparse principal component analysis, sparse inverse covari-
888 ance matrix estimation, nearest correlation matrix estimation and the self-consistent iteration
889 in electronic struture calculation. The computational cost of these decompositions is a major
890 bottleneck which significantly affects the overall efficiency of these algorithms.
891 For a given real symmetric matrix A ∈ Rn×n , let λ1 , λ2 , · · · , λn be the eigenvalues
892 of A sorted in a descending order: λ1 ≥ λ2 ≥ · · · ≥ λn , and q1 , . . . , qn ∈ Rn be the
893 corresponding eigenvectors such that Aqi = λi qi , kqi k2 = 1, i = 1, . . . , n and qi> qj = 0
894 for i 6= j. The eigenvalue decomposition of A is defined as A = Qn Λn Q> n , where, for any
895 integer i ∈ [1, n],
896 (8.1) Qi = [q1 , q2 , . . . , qi ] ∈ Rn×i , Λi = Diag(λ1 , λ2 , . . . , λi ) ∈ Ri×i ,
897 and Diag(·) denotes a diagonal matrix with its arguments on the diagonal. For simplicity,
898 we also write A = QΛQ> where Q = Qn and Λ = Λn . Without loss of generality, we
899 assume for convenience that A is positive definite (after a shift if necessary). Our task is to
900 compute p largest eigenpairs (Qp , Λp ) for some p  n where by definition AQp = Qp Λp
901 and Q> p Qp = I ∈ R
p×p
. Replacing A by a suitable function of A, say λ1 I − A, one can also
902 in principle apply the same algorithms to finding p smallest eigenpairs as well.
903 We next describe the Rayleigh-Ritz (RR) step which is to extract approximate eigenpairs,
904 called Ritz-pairs, from the range space R(Z) spanned by a given matrix Z ∈ Rn×m . This
905 procedure is widely used as an important component for an approximation to a desired m-
906 dimensional eigenspace of A. It consists of the following four steps.
907 (i) Given Z ∈ Rn×m , orthonormalize Z to obtain U ∈ orth(Z), where orth(Z) is the
908 set of orthonormal bases for the range space of Z
909 (ii) Compute H = U > AU ∈ Rm×m , the projection of A onto the range space of U .
910 (iii) Compute the eigenvalue decomposition H = V > ΣV , where V > V = I and Σ is
911 diagonal.
912 (iv) Assemble the Ritz pairs (Y, Σ) where Y = U V ∈ Rn×m satisfies Y > Y = I.
913 The RR procedure is denoted as a map (Y, Σ) = RR(A, Z) where the output (Y, Σ) is a Ritz
914 pair block.
915 8.1. Classic Subspace Iteration. The simple (simultaneous) subspace iteration (SSI)
916 method [95, 96, 108, 110] is an extension of the power method which computes a single eigen-
26

This manuscript is for review purposes only.


917 pair corresponding to the largest eigenvalue in magnitude. Starting from an initial matrix U ,
918 SSI repeatedly performs the matrix-matrix multiplications AU , followed by an orthogonal-
919 ization and RR projection, i.e.,

920 (8.2) Z = orth(AU ), U = RR(A, Z).

921 The major purpose of orthogonalization is to guarantee the full-rankness of the matrix Z
922 since AU may lose rank numerically. The so-called deflation can be executed after each
923 RR projection to fix the numerically converged eigenvectors since the convergence rates for
924 different eigenpairs are not the same. Moreover, q extra vectors, often called guard vectors,
925 are added to U to accelerate convergence. Although the iteration cost is increased at the initial
926 stage, the overall performance may be better.
927 Due to fast memory access and highly parallelizable computation on modern computer
928 architectures, simultaneous matrix-block multiplications have advantages over individual matrix-
929 vector multiplications. Whenever there is a gap between the p-th and the (p + 1)-th eigen-
930 values of A, the SSI method is ensured to converge to the largest p eigenpairs from any
931 generic starting point. However, the convergence speed of the SSI method depends critically
932 on eigenvalue distributions. It can be intolerably slow if the eigenvalue distributions are not
933 favorable.

934 8.2. Polynomial Filtering. The idea of polynomial filtering is originated from a well-
935 known fact that polynomials are able to manipulate the eigenvalues of any symmetric matrix
936 A while keeping its eigenvectors unchanged. Due to the eigenvalue decomposition (8.1), it
937 holds that
n
X
938 (8.3) ρ(A) = Qρ(Λ)QT = ρ(λi )qi qiT ,
i=1

939 where ρ(Λ) = diag(ρ(λ1 ), ρ(λ2 ), . . . , ρ(λn )). Ideally, the eigenvalue distribution ρ(A) is
940 more favorable than the original one.
941 The convergence of the desired eigen-space of SSI is determined by the gap of the eigen-
942 values, which can be very slow if the gap is nearly zero. Polynomial filtering has been used
943 to manipulate the gap in eigenvalue computation through various ways [97, 109, 150, 34] in
944 order to obtain a faster convergence. One popular choice of ρ(t) is the Chebyshev polynomial
945 of the first kind, which can be written as
(
cos(d arccos t) |t| ≤ 1,
946 (8.4) Td (t) = 1
√ √
((t − t 2 − 1)d + (t + t 2 − 1)d ) |t| > 1,
2

947 where d is the degree of the polynomial. Since Chebyshev polynomials grow pretty fast
948 outside the interval [−1, 1], they can help to suppress all unwanted eigenvalues in this interval
949 efficiently. For these eigenvalues in a general interval [a, b], the polynomial can be chosen as
 
t − (b + a)/2
950 (8.5) ρ(t) = Td .
(b − a)/2

951 From an initial matrix U , the polynomial-filtered subspace iteration is simply

952 (8.6) Z = orth(ρ(A)U ), U = RR(A, Z).


27

This manuscript is for review purposes only.


953 8.3. Limited Memory Methods. Finding a p-dimensional eigenspace associated with
954 p largest eigenvalues of A is equivalent to solving a trace maxmization problem with orthog-
955 onality constraints:

956 (8.7) max tr(X > AX), s. t. X > X = I.


X∈Rn×p

957 The first-order optimality conditions of (8.7) are

958 AX = XΛ, X > X = I,

959 where Λ = X > AX ∈ Rp×p is the matrix of Lagrangian multipliers. Once the matrix Λ
960 is diagonalized, the matrix pair (Λ, X) provides p eigenpairs of A. When maximization is
961 replaced by minimization, (8.7) computes an eigenspace associated with p smallest eigenval-
962 ues. A few block algorithms have been designed based on solving (8.7), including the locally
963 optimal block preconditioned conjugate gradient method (LOBPCG) [65] and the limited
964 memory block Krylov subspace optimization method (LMSVD) [74]. At each iteration, these
965 methods in fact solve a subspace trace maximization problem of the form

Y = arg max tr(X > AX) : X > X = I, X ∈ S .



966 (8.8)
X∈Rn×p

967 Obviously, the closed-form solution of (8.8) can be obtained by using the RR procedure.
968 The subspace S is varied from method to method. In LOBPCG, S is the span of the two
969 most recent iterations Xi−1 and Xi , and the residual AXi − Xi Λi at Xi , which is essentially
970 equivalent to

971 (8.9) S = span {Xi−1 , Xi , AXi } .

972 The term AXi can be pre-multiplied by a pre-conditioning matrix if it is available. The
973 LMSVD method constructs the subspace S as a limited memory of the current i-th iterate
974 and the previous t iterates; i.e.,

975 (8.10) S = span {Xi , Xi−1 , ..., Xi−t } .

976 In general, the subspace S should be constructed such that the computational cost of solving
977 the subproblem (8.8) is relatively small.
978 8.4. Augmented Rayleigh-Ritz Method. We next introduce the augmented Rayleigh-
979 Ritz (ARR) procedure. It is easy to see that the RR map (Y, Σ) = RR(A, Z) is equivalent to
980 solving the trace-maximization subproblem (8.8) with the subspace S = R(Z), while requir-
981 ing Y > AY to be a diagonal matrix Σ. For a fixed number p, the larger the subspace R(Z)
982 is, the greater chance there is to extract better Ritz pairs. The augmentation of the subspaces
983 in LOGPCG and LMSVD is the main reason why they generally achieve faster convergence
984 than the classic SSI.
985 The augmentation in ARR is based on a block Krylov subspace structure, i.e., for some
986 integer t ≥ 0,

987 (8.11) S = span{X, AX, A2 X, . . . , At X}.

988 Then the optimal solution of the trace maximization problem (8.8), restricted in the sub-
989 space S in (8.11), is computed via the RR procedure using (Ŷ , Σ̂) = RR(A, Kt ), where
990 Kt = [X, AX, A2 X, . . . , At X]. Finally, the p leading Ritz pairs (Y, Σ) is extracted from
991 (Ŷ , Σ̂). This augmented RR procedure is simply referred as ARR. It looks identical to a
28

This manuscript is for review purposes only.


992 block Lanczos algorithm. However, a fundamental dissimilarity is that the ARR is primarily
993 developed to compute a relatively large number of eigenpairs by using only a few augmenta-
994 tion blocks.
995 We next describe an “Arrabit” algorithmic framework with two main steps at each outer
996 iteration: a subspace update (SU) step and an ARR projection step, for computing a subset
997 of eigenpairs of large matrices. The goal of the subspace update step is finding a matrix
998 X ∈ Rn×p so that its column space is a good approximation to the p-dimensional eigenspace
999 spanned by p desired eigenvectors. Once X is obtained, the projection step aims to extract
1000 from X a set of approximate eigenpairs that are optimal in certain sense. The SU step is
1001 often performed on a transformed matrix ρ(A), where ρ(t) : R → R is a suitable polynomial
1002 function. For a reasonable choice X ∈ Rn×p , it follows from (8.3) that ρ(A)X ≈ Qp QTp X
1003 would be an approximate basis for the desired eigenspace. The analysis of ARR in [135,
1004 Corollary 4.6] shows that the convergence rate of SSI is improved from |ρ(λp+1 )/ρ(λp )| for
1005 RR (t = 0) to |ρ(λ(t+1)p+1 )/ρ(λp )| for ARR (t > 0). Therefore, a significant improvement
1006 is possible with a suitably chosen polynomial ρ(·) such that |ρ(λ(t+1)p+1 )|  |ρ(λp+1 )|.
1007 In principle, the SU step can be fulfilled by many kinds of updating schemes without
1008 explicit orthogonalizations. The Gauss-Newton (GN) algorithm in [75] solves the nonlinear
1009 least squares problem:
1010 min kXX > − Ak2F .
1011 For any full-rank matrix X ∈ Rn×p , it takes the simple closed form
 
1
X+ = X + α I − X(X > X)−1 X > AX(X > X)−1 − X ,

1012
2

1013 where the parameter α > 0 is a step size. The classic power iteration can be modified without
1014 orthogonalization at each step. For X = [x1 x2 · · · , xm ] ∈ Rn×m , the power iteration is
1015 applied individually to all columns of the iterate matrix X, i.e.,

xi
1016 xi = ρ(A)xi and xi = , i = 1, 2, · · · , m.
kxi k2

1017 This scheme is called a multi-power method.


1018 8.5. Singular Value Decomposition. Computing the singular value decomposition
1019 of a real symmetric matrix A ∈ Rm×n is equivalent to finding the eigenvalue decomposition
1020 of AA> . Although the methods in the previous subsections can be applied to AA> directly,
1021 the efficiency can be improved when some operations are performed carefully. We first state
1022 the abstract form of the LMSVD method [74], then describe a few implementation details.
1023 There are two main steps. For a chosen subspace Si with a block Krylov subspace
1024 structure, an intermediate iterate is computed from

1025 (8.12) X̂i := arg max kA> Xk2F , s. t. X > X = I, X ∈ Si .


X∈Rm×p

1026 The next iterate Xi+1 is generated from a SSI step on X̂i , i.e.,
 
1027 (8.13) Xi+1 ∈ orth AA> X̂i .

1028 We collect a limited memory of the last a few iterates in (8.10) into a matrix

1029 (8.14) X = Xti := [Xi , Xi−1 , ..., Xi−t ] ∈ Rm×q


29

This manuscript is for review purposes only.


1030 where q = (t + 1)p is the total number of columns in Xti . For simplicity of notation, the su-
1031 perscript and subscript of Xti are dropped whenever no confusion would arise. The collection
1032 matrix X is written in boldfaces to differentiate it from its blocks. Similarly, a collection of
1033 matrix-vector multiplications from the SSI steps are saved in
Y = Yit := A> Xti := A> Xi , A> Xi−1 , ..., A> Xi−t ∈ Rm×q .
 
1034 (8.15)
1035 Assume that X has a full rank and this assumption will be relaxed later. A stable ap-
1036 proach for solving (8.12) is to find an orthonormal basis for Si , say,
Q = Qti ∈ orth Xti .

1037

1038 Note that X ∈ Si if and only if X = QV for some V ∈ Rq×p . The generalized eigenvalue
1039 problem (8.12) is converted into an equivalent eigenvalue problem
1040 (8.16) max kRV k2F , s. t. V > V = I,
V ∈Rq×p

1041 where
1042 (8.17) R = Rti := A> Qti .
1043 The matrix product R in (8.17) can be computed from historical information without any
1044 additional computation involving the matrix A. Since Q ∈ orth(X) and X has a full rank,
1045 there exists a nonsingular matrix C ∈ Rq×q such that X = QC. Therefore, Q = XC −1 , and
1046 R in (8.17) can be assembled as
1047 (8.18) R = A> Q = (A> X)C −1 = YC −1 ,
1048 where Y = A> X is accessible from our limited memory. Once R is available, a solution V̂
1049 to (8.16) can be computed from the p leading eigenvectors of the q × q matrix R> R. The
1050 matrix product can then be calculated as
1051 (8.19) AA> X̂i = ARV̂ = AYC −1 V̂ .
1052 We now explain how to efficiently and stably compute Q and R when the matrix X is
1053 numerically rank deficient. Since each block itself in X is orthonormal, keeping the latest
1054 block Xi intact and projecting the rest of the blocks onto the null space of Xi> yields
>
PX = PX

1055 (8.20) i := I − Xi Xi [Xi−1 · · · Xi−p ] .
1056 An orthonormalization of PX is performed via the eigenvalue decomposition of its Gram
1057 matrix
1058 (8.21) P> >
X PX = UX ΛX UX ,

1059 where UX is orthogonal and ΛX is diagonal. If ΛX is invertible, it holds


−1
h i
Q = Qti := Xi , PX UX ΛX 2 ∈ orth Xti .

1060 (8.22)
1061 The above procedure can be stabilized by deleting the columns of PX whose Euclidean
1062 norms are below a threshold or deleting the small eigenvalues in ΛX and the corresponding
1063 columns in UX . The same notations are still used for PX , UX and ΛX after these possible
1064 deletions. Therefore, a stable construction of Q is still provided by formula (8.22) and the
1065 corresponding R matrix can be formulated as
−1
h i
1066 (8.23) R = Rti := Yi , PY UX ΛX 2 ,

1067 where PY = PYi := A> PX before the stabilization procedure but some of the columns of
1068 PY may have been deleted due to the stabilization steps. Therefore, the R matrix in (8.23) is
1069 well defined as is the Q matrix in (8.22) after the numerical rank deficiency is removed.
30

This manuscript is for review purposes only.


1070 8.6. Randomized SVD. Given an m × n matrix A and an integer p < min(m, n), we
1071 want to find an orthonormal m × p matrix Q such that

1072 A ≈ QQT A.

1073 A prototype randomized SVD in [54] is essentially one step of the Power method using an
1074 initial random input. We select an oversampling parameter l ≥ 2 and an exponent t (for
1075 example, t = 1 or t = 2), then perform the following steps.
1076 • Generate an n × (p + l) Gaussian matrix Ω.
1077 • Compute Y = (AA> )t AΩ by the multiplications of A and A> alternatively.
1078 • Construct a matrix Q = orth(Y ) by the QR factorization.
1079 • Form the matrix B = Q> A.
1080 • Calculate an SVD of B to obtain B = Ũ ΣV > , and set U = QŨ .
1081 Consequently, we have the approximation A ≈ U ΣV > . For the eigenvalue computation, we
1082 can simply run the SSI (8.2) for only one step with an Gaussian matrix U . Assume that the
1083 computation is performed in exact arithmetic. It is shown in [54] that
 √ 
> 4 p+l
1084 EkA − QQ Ak2 ≤ 1 + σp+1 ,
l−1
1085 where the expectation is taken with respect to the random matrix Ω and σp+1 is the (p + 1)-th
1086 largest singular value of A.
1087 Suppose that a low rank approximation of A with a target rank r is needed. A sketching
1088 method is further developed in [118] for selected p and `. Again, we draw independent
1089 Gaussian matrix Ω ∈ Rn×p and Ψ ∈ R`×m , and compute the matrix-matrix multiplications:

1090 Y = AΩ, W = ΨA,

1091 Then an approximation  is computed:


1092 • Calculate an orthogonal-triangular factorization Y = QR where Q ∈ Rm×p .
1093 • Compute a least-squares problem to derive X = (ΨQ)† W ∈ Rp×n
1094 • Assemble the rank-p approximation  = QX
1095 Assume that p = 2r + 1 and ` = 4r + 2. It is established that

1096 EkA − ÂkF ≤ 2 min kA − ZkF .


rank(Z)≤r

1097 8.7. Truncated Subspace Method for Tensor Train. In this subsection, we con-
1098 sider the trace maximization problem (8.7) whose dimension reaches the magnitude of O(1042 ).
1099 Due to the scale of data storage, a tensor train (TT) format is used to express data matrices
1100 and eigenvectors in [148]. The corresponding eigenvalue problem can be solved based on the
1101 subspace algorithm and the alternating direction method with suitable truncations.
1102 The goal is to express a vector x ∈ Rn as a tensor x ∈ Rn1 ×n2 ×···×nd for some positive
1103 integers n1 , . . . , nd such that n = n1 n2 . . . nd using a collection of three-dimensional tensor
1104 cores Xµ ∈ Rrµ−1 ×rµ ×nu with fixed dimensions rµ , µ = 1, . . . , d and r0 = rd = 1. A
1105 tensor x is stored in the TT format if its elements can be written as

1106 xi1 i2 ...id = X1 (i1 )X2 (i2 ) · · · Xd (id ),

1107 where Xµ (iµ ) ∈ Rrµ−1 ×rµ is the iµ -th slice of Xµ for iµ = 1, 2, . . . , nµ . The values rµ are
1108 often equal to a constant r, which is then called the TT-rank. Consequently, storing a vector
d
1109 x ∈ Rn1 only needs O(dn1 r2 ) entries if the corresponding tensor x has a TT format. The
1110 representation of x is shown as graphs in Figure 8.1.
31

This manuscript is for review purposes only.


n3 nd-1

n2 nd

r rd-1
n1
X1 r1
X2 X3 · · · · ·d-2
· Xd-1 Xd
r2
r0 r1
r2 rd

r3 rd-1

r3 rd-1

r2
r2 rd-2
r1 rd-1
r1
X1 (i1 )
× × ×· · · · · ·× × Xd (id )

X2 (i2 )
X3 (i3 ) Xd-1 (id-1 )

Fig. 8.1 The first row is a TT format of u with cores Xµ , µ = 1, 2, . . . , d. The second row is a representation of
its elements xi1 i2 ...id .

1111 There are several ways to express a matrix X ∈ Rn×p with p  n in the TT format. A
1112 direct way is to store each column of X as tensors x1 , x2 , . . . , xp in the TT format separately.
1113 Another economic choice is that these p tensors share all except one core. Let the shared
1114 cores be Xi , i 6= µ and the µ-th core of xi be Xµ,i , for i = 1, 2, . . . , p. Then the i1 i2 · · · id
1115 component of xj is

1116 (8.24) X(i1 , . . . , iµ , . . . ., id ; j) = X1 (i1 ) · · · Xµ,j (iµ ) · · · Xd (id ).

1117 The above scheme generates a block-µ TT (µ-BTT) format, which is depicted in Figure 8.2.
1118 Similarly, a matrix A ∈ Rn×n is in an operator TT format A if the components of A can be
1119 assembled as

1120 (8.25) Ai1 i2 ···id ,j1 j2 ···jd = A1 (i1 , j1 )A2 (i2 , j2 ) · · · Ad (id , jd ),

1121 where Aµ (iµ , jµ ) ∈ Rrµ−1 ×rµ for iµ , jµ ∈ {1, . . . , nµ }.


rµ−1

Xµ,1 nµ+1

nµ-1 nd


n1
X1 rµ-2
Xµ−1 rµ
Xµ+1 rd-1
Xd
r0 r1 nµ
rµ-1 rd

rµ+1

Xµ,p
rµ−1

Fig. 8.2 Demonstration of a µ-BTT format.

1122 Assume that the matrix A itself can be written in the operator TT format. Let X ∈ Rn×p
1123 with n = n1 n2 . . . nd whose BTT format is X, and Tn,r,p be the set of the BTT formats
1124 whose TT-ranks are no more than r. Then the eigenvalue problem in the block BTT format is

1125 (8.26) min tr(X> AX), s. t. X> X = Ip and X ∈ Tn,r,p ,


X∈Rn×p
32

This manuscript is for review purposes only.


1126 where X ∈ Tn,r,p means that all calculations are performed in the BTT format. Since
1127 the TT-ranks increase dramatically after operations such as the addition and matrix-vector
1128 multiplication in the TT formats, the computational cost and the storage becomes more and
1129 more expensive as the TT-ranks increase. Therefore, the subspace methods in subsection 8.3
1130 can only be applied with projections to Tn,r,p at some suitable places so that the overall
1131 computational cost is still tractable.
1132 In our truncated subspace optimization methods, solving the subproblem (8.8) is split
1133 into a few steps. First, the subspace Sk is modified with truncations so that the computation of
1134 the coefficient matrix U > AU in the RR procedure is affordable. Let PT (X) be the truncation
1135 of X to the BTT format Tn,r,p . One can choose either the following subspace

1136 (8.27) ST
k = span{PT (AXk ), Xk , Xk−1 },

1137 or a subspace similar to that of LOBPCG with two truncations as

1138 (8.28) ST
k = span{Xk , PT (Rk ), PT (Pk )},

1139 where the conjugate gradient direction is Pk = Xk − Xk−1 and the residual vector is Rk =
1140 AXk − Xk Λk .
1141 Consequently, the subspace problem in the BTT format is

1142 (8.29) Yk+1 := arg min tr(X> AX), s. t. X> X = Ip , X ∈ ST


k,
X∈Rn×p

1143 which is equivalent to a generalized eigenvalue decomposition problem:

1144 (8.30) min tr(V > (S > AS)V ), s. t. V > S > SV = Ip .


V ∈Rq×p

1145 Note that Yk+1 ∈ / Tn,r,p because the rank of Yk+1 is larger than r due to several additions
1146 between the BTT formats. Since Yk+1 is a linear combination of the BTT formats in ST k,
1147 problem (8.29) still can be solved easily but only the coefficients of the linear combinations
1148 are stored.
1149 We next project Yk+1 to the required space Tn,r,p as

1150 (8.31) Xk+1 = arg min


n×p
kX − Yk+1 k2F , s. t. X> X = Ip , X ∈ Tn,r,p .
X∈R

1151 This problem can be solved by using the alternating minimization scheme. By fixing all
1152 except the µth core, we obtain

1153 (8.32) min kX6=µ V − vec(Yk+1 )k2F , s. t. V > X6=>µ X6=µ V = Ip ,


V

1154 where
1155 X6=µ := (X≥µ+1 ⊗ Inµ ⊗ X≤µ−1 ),
1156 and

1157 X≤µ = [X1 (i1 )X2 (i2 ) · · · Xµ (iµ )] ∈ Rn1 n2 ···nµ ×rµ ,
1158 X≥µ = [Xµ (iµ )Xµ+1 (iµ+1 ) · · · Xd (id )]> ∈ Rnµ nµ+1 ···nd ×rµ−1 .

1159 Therefore, after imposing orthogonality on X6=µ , (8.32) is reformulated as

1160 (8.33) min kV − X6=>µ vec(Yk+1 )k2F , s. t. V > V = Ip ,


V

1161 whose optimal solution can be computed by the p-dominant SVD of X6=>µ vec(Yk+1 ).
33

This manuscript is for review purposes only.


1162 9. Optimization with Orthogonality Constraints. In this section, we consider the
1163 optimization problem with orthogonality constraints [132, 59, 2]:

1164 (9.1) min f (X) s. t. X ∗ X = Ip ,


X∈Cn×p

1165 where f (X) : Cn×p → R is a R-differentiable function [67]. The set St(n, p) := {X ∈
1166 Cn×p : X ∗ X = Ip } is called the Stiefel manifold. Obviously, the eigenvalue problem in sec-
1167 tion 8 is a special case of (9.1). Other important applications include the density functional
1168 theory [131], Bose-Einstein condensates [137], low rank nearest correlation matrix comple-
1169 tion [121], and etc. Although (9.1) can be treated from the perspective of general nonlinear
1170 programming, the intrinsic structure of the Stiefel manifold enables us to develop more effi-
1171 cient algorithms. In fact, it can be solved by the Riemannian gradient descent, Riemannian
1172 conjugate gradient, proximal Riemannian gradient methods [40, 104, 2, 59]. The Rieman-
1173 nian Newton, trust-region, adaptive regularized Newton methods [120, 1, 2, 59] can used
1174 when the Hessian information is available. Otherwise, the quasi-Newton types methods are
1175 good alternatives [62, 61, 58].
1176 The tangent space is TX := {ξ ∈ Cn×p : X ∗ ξ + ξ ∗ X = 0}. The operator ProjX (Z) :=
1177 Z − Xsym(X ∗ Z) is the projection of Z onto the tangent space TX and sym(A) := (A +
1178 A∗ )/2. The symbols ∇f (X) (∇2 f (X)) and grad f (X) (Hess f (X)) denote the Euclidean
1179 and Riemannian gradient (Hessian) of f at X. Using the real part of the Frobenius inner
1180 product < hA, Bi as the Euclidean metric, the Riemannian Hessian Hess f (X) [31, 3] can be
1181 written as

1182 (9.2) Hess f (X)[ξ] = ProjX (∇2 f (X)[ξ] − ξsym(X ∗ ∇f (X))),

1183 where ξ is any tangent vector in TX . A retraction R is a smooth mapping from the tangent
1184 bundle to the manifold. Moreover, the restriction RX of R to TX has to satisfy RX (0X ) = X
1185 and DRX (0X ) = idTX , where idTX is the identity mapping on TX .
1186 9.1. Regularized Newton Type Approaches. We now describe an adaptively reg-
1187 ularized Riemannian Newton type method with a subspace refinement procedure [59, 58].
1188 Note that the Riemannian Hessian-vector multiplication (9.2) involves the Euclidean Hessian
1189 and gradient with simple structures. We construct a second-order Taylor approximation in the
1190 Euclidean space rather than the Riemannian space at the k-th iteration:

1 τk
1191 (9.3) mk (X) := < h∇f (Xk ), X − Xk i + < hBk [X − Xk ], X − Xk i + kX − Xk k2F ,
2 2
1192 where Bk is either ∇2 f (Xk ) or its approximation based on whether ∇2 f (Xk ) is affordable
1193 or not, and τk is a regularization parameter to control the distance between X and Xk . Then
1194 the subproblem is

1195 (9.4) min mk (X) s. t. X ∗ X = I.


X∈Cn×p

1196 After obtaining an approximate solution Zk of (9.4), we calculate a ratio between the pre-
1197 dicted reduction and the actual reduction, then use the ratio to decide whether Xk+1 is set to
1198 Zk or Xk and to adjust the parameter τk similar to the trust region methods.
1199 In particular, the model (9.4) can be minimized by using a modified CG method to solve
1200 a single Riemannian Newton system:

1201 (9.5) grad mk (Xk ) + Hess mk (Xk )[ξ] = 0.


34

This manuscript is for review purposes only.


1202 A simple calculation yields:
1203 (9.6) Hess mk (Xk )[ξ] = ProjXk (Bk [ξ] − ξsym((Xk )∗ ∇f (Xk )) + τk ξ, ξ ∈ TXk .
1204 Hence, the regularization term shifts the spectrum of the Riemannian Hessian by τk . The
1205 modified CG method is a direct adaption of the truncated CG method for solving the classic
1206 trust region subproblem, see [88, Chapter 5] and [2, Chapter 7] for a comparison. It is ter-
1207 minated when either the residual becomes small or a negative curvature is detected since the
1208 Hessian may be indefinite. During the process, two different vectors sk and dk are generated,
1209 where the vector dk represents the negative curvature direction and sk corresponds to the con-
1210 jugate direction from the CG iteration. The direction dk is zero unless a negative curvature is
1211 detected. Therefore, a possible choice of the search direction ξk is
(
sk + τk dk if dk 6= 0, hdk , grad mk (Xk )i
1212 (9.7) ξk = with τk := .
sk if dk = 0, hdk , Hess mk (Xk )[dk ]i

1213 Once the direction ξk is computed, a trial point Zk is searched along ξk followed by a retrac-
1214 tion, i.e.,
1215 (9.8) Zk = RXk (αk ξk ).
1216 The step size αk = α0 δ h is chosen by the Armijo rule such that h is the smallest integer
1217 satisfying
1218 (9.9) mk (RXk (α0 δ h ξk )) ≤ ρα0 δ h hgrad mk (Xk ), ξk i ,
1219 where ρ, δ ∈ (0, 1) and α0 ∈ (0, 1] are given constants.
1220 The performance of the Newton-type method may be seriously deteriorated when the
1221 Hessian is close to be singular. One reason is that the Riemannian Newton direction is nearly
1222 parallel to the negative gradient direction. Consequently, the next iteration Xk+1 very likely
1223 belongs to the subspace span{Xk , grad f (Xk )}, which is similar to the Riemannian gradient
1224 approach. To overcome the numerical difficulty, we can further solve (9.1) in a restricted
1225 subspace. Specifically, a q-dimensional subspace Sk is constructed with an orthogonal basis
1226 Qk ∈ Cn×q (p ≤ q ≤ n). Then the representation of any point X in the subspace Sk is
1227 X = Qk M
1228 for some M ∈ Cq×p . In a similar fashion to these constructions for the linear eigen-
1229 value problems in section 8, the subspace Sk can be built by using the history information
1230 {Xk , Xk−1 , . . .}, {grad f (Xk ), grad f (Xk−1 ), . . .} and other useful information. Once a
1231 subspace Sk is given, (9.1) with an additional constraint X ∈ Sk becomes
1232 (9.10) min f (Qk M ) s. t. M ∗ M = Ip .
M ∈Cq×p

1233 Suppose that Mk is an inexact solution of the problem (9.10) from existing optimization
1234 methods on manifold. Then Xk+1 = Qk Mk is a better point than Xk . For extremely difficult
1235 problems, one may alternate between the Newton type method and the subspace refinement
1236 procedure for a few cycles.
1237 9.2. A Structured Quasi-Newton Update with Nyström Approximation. The
1238 secant condition in the classical quasi-Newton methods for constructing the quasi-Newton
1239 matrix Bk
1240 (9.11) Bk [Sk ] = ∇f (Xk ) − ∇f (Xk−1 ),
35

This manuscript is for review purposes only.


1241 where
1242 Sk := Xk − Xk−1 .
1243 Assume that the Euclidean Hessian ∇2 f (X) is a summation of a relatively cheap part Hc (X)
1244 and a relatively expensive or even inaccessible part He (X), i.e.,
1245 (9.12) ∇2 f (X) = Hc (X) + He (X).
1246 Then it is reasonable to keep the cheaper part Hc (X) but approximate He (X) using the
1247 quasi-Newton update Ek . It yields an approximation Bk to the Hessian ∇2 f (Xk ) as
1248 (9.13) Bk = Hc (Xk ) + Ek ,
1249 Plugging (9.13) into (9.11) gives the following revised secant condition
1250 (9.14) Ek [Sk ] = Yk ,
1251 where
1252 (9.15) Yk := ∇f (Xk ) − ∇f (Xk−1 ) − Hc (Xk )[Sk ].
1253 A good initial matrix Ek0 to Ek is important to ensure the convergence speed of the limited-
1254 memory quasi-Newton method. We assume that a known matrix Êk0 can approximate the ex-
1255 pensive part of the Hessian He (Xk ) well, a very limited number of matrix-matrix products
1256 involving Êk0 is affordable but many of them are still prohibitive. We next use the Nyström
1257 approximation [117] to construct a low rank matrix. Let Ω be a matrix whose columns consti-
1258 tute an orthogonal basis of a well-chosen subspace S and denote W = Êk0 [Ω]. The Nyström
1259 approximation is
1260 (9.16) Ek0 [U ] := W (W ∗ Ω)† W ∗ U,
where U ∈ Cn×p is any direction. When the dimension of the subspace S is small enough,
the rank of W (W ∗ Ω)† W ∗ is also small so that the computational cost of Ek0 [U ] is significantly
cheaper than the original Êk0 [U ]. Suppose the subspace S is chosen as
span{Xk−1 , Xk },
1261 which contains the element Sk . If Êk0 [U V ] = Êk0 [U ]V for any matrices U, V with proper
1262 dimension (this condition is satisfied when Êk0 is a matrix), then the secant condition still
1263 holds at Ek0 , i.e.,
1264 Ek0 [Sk ] = Yk .
1265 The subspace S can also be defined as
1266 (9.17) span{Xk−1 , Xk , Êk0 [Xk ]} or span{Xk−h , . . . , Xk−1 , Xk }
1267 with small memory length h. Consequently, we obtain a limited-memory Nyström approxi-
1268 mation.
1269 9.3. Electronic Structure Calculations. The density functional theory (DFT) in
1270 electronic structure calculation is an important source of optimization problems with orthog-
1271 onality constraints. By abuse of notation, we refer to Kohn-Sham (KS) equations with lo-
1272 cal or semi-local exchange-correlation functionals as KSDFT, and KS equations with hybrid
1273 functionals as HF (short for Hartree-Fock). The KS/HF equations try to identify orthogo-
1274 nal eigenvectors to satisfy the nonlinear eigenvalue problems, while the KS/HF minimization
1275 problem minimizes the KS/HF total energy functionals under the orthogonality constraints.
1276 These two problems are connected by the optimality conditions.
36

This manuscript is for review purposes only.


1277 9.3.1. The Mathematical Models. The wave functions of p occupied states can be
1278 expressed as X = [x1 , . . . , xp ] ∈ Cn×p with X ∗ X = Ip after some suitable discretization.
1279 The KS total energy functional is defined as
(9.18)
1 1 1 XX 1 1
1280 Eks (X) := tr(X ∗ LX) + tr(X ∗ Vion X) + ζl |x∗i wl |2 + ρ> L† ρ + e> xc (ρ),
4 2 2 i
4 2 n
l

1281 where L is a discretized Laplacian operator, the charge density is ρ(X) = diag(XX ∗ ), Vion
1282 is the constant ionic pseudopotentials, wl represents a discretized pseudopotential reference
1283 projection function, ζl is a constant whose value is ±1, and xc is related to the exchange
1284 correlation energy. The Fock exchange operator V(·) : Cn×n → Cn×n is usually a fourth-
1285 order tensor [69] which satisfies the following properties: (i) hV(D1 ), D2 i = hV(D2 ), D1 i
1286 for any D1 , D2 ∈ Cn×n ; (ii) V(D) is Hermitian if D is Hermitian. Then the Fock exchange
1287 energy is
1 1
1288 (9.19) Ef (X) := hV(XX ∗ )X, Xi = hV(XX ∗ ), XX ∗ i .
4 4
1289 Therefore, the total energy minimization problem can be formulated as

1290 (9.20) min E(X), s. t. X ∗ X = Ip ,


X∈Cn×p

1291 where E(X) is Eks (X) in KSDFT and

1292 Ehf (X) := Eks (X) + Ef (X)

1293 in HF. Computing Ef (X) is very expensive since a multiplication between an n × n × n × n


1294 fourth-order tensor and an n-by-n matrix is needed in V(·).
1295 Denote the KS Hamiltonian Hks (X) as

1 X
1296 (9.21) Hks (X) := L + Vion + ζl wl wl∗ + Diag((<L† )ρ) + Diag(µxc (ρ)∗ en ),
2
l

1297 where µxc (ρ) = ∂xc (ρ)


∂ρ . Since Hks (X) is essentially determined by the charge density ρ(X),
1298 it is often written as Hks (ρ). The HF Hamiltonian is

1299 (9.22) Hhf (X) := Hks (X) + V(XX ∗ ).

1300 A detailed calculation shows that the Euclidean gradient of Eks (X) is

1301 (9.23) ∇Eks (X) = Hks (X)X.

1302 The gradient of Ef (X) is ∇Ef (X) = V(XX ∗ )X. Assume that xc (ρ(X)) is twice differen-
1303 tiable with respect to ρ(X), the Hessian of Eks (X) is

1304 (9.24) ∇2 Eks (X)[U ] = Hks (X)U + R(X)[U ],


 2

and R(X)[U ] := Diag <L† + ∂∂ρxc

1305 where U ∈ Cn×p 2 en (X̄ U +X Ū )en X. The
1306 Hessian of Ef (X) is

1307 (9.25) ∇2 Ef (X)[U ] = V(XX ∗ )U + V(XU ∗ + U X ∗ )X.


37

This manuscript is for review purposes only.


1308 9.3.2. The Self-Consistent Field (SCF) Iteration. The first-order optimality con-
1309 ditions for the total energy minimization problem are
1310 (9.26) H(X)X = XΛ, X ∗ X = Ip ,
1311 where X ∈ Cn×p , Λ is a diagonal matrix and H represents Hks in (9.21) or Hhf in (9.22). For
1312 KSDFT, one of the most popular methods is the SCF iteration. At the k-th iteration, we first
1313 fix the Hamiltonian to be Hks (ρ̃k ) for a given ρ̃k and solve the following linear eigenvalue
1314 problem
1315 (9.27) Hks (ρ̃k )X = XΛ, X ∗ X = Ip .
1316 The eigenvectors corresponding to the p smallest eigenvalues of Hks (ρk ) is denoted as Xk+1 ,
1317 which leads to a new charge density ρk+1 = ρ(Xk+1 ). It is then mixed with charge densities
1318 from previous steps to produce the new charge density ρ̃k+1 in order to accelerate the con-
1319 vergence instead of using ρk+1 directly. This procedure is repeated until self-consistency is
1320 reached.
1321 A particular charge mixing scheme is the direct inversion of iterative subspace (DIIS) or
1322 the Pulay mixing [92, 93, 115]. Choose an integer m with m ≤ k. Let
1323 W = (∆ρk , ∆ρk−1 , . . . , ∆ρk−m+1 ), ∆ρj = ρj − ρj−1 .
1324 The Pulay mixing generates the charge density ρ̃k by a linear combination of the previously
1325 charge densities
m−1
X
1326 ρ̃k = cj ρk−j ,
j=0

1327 where c = (c0 , c1 , . . . , cm−1 ) is the solution to the minimization problem:


1328 min kW ck2 , s. t. c> em = 1.
c

1329 Other types of mixing includes Broyden mixing, Kerker mixing and Anderson mixing, etc.
1330 Charge mixing is widely used for improving the convergence of SCF even though its conver-
1331 gence property is still not clear in few cases.
1332 In HF, the SCF method at the k-th iteration solves:
1333 H̃k X = XΛ, X ∗ X = Ip ,

1334 where H̃k is formed from certain mixing schemes. Note that the Hamiltonian (9.22) can be
1335 written as Hhf (D) with respect to the density matrix D = XX ∗ . In the commutator DIIS
1336 (C-DIIS) method [92, 93], the residual Wj is defined as the commutator between Hhf (Dj )
1337 and Dj , i.e.,
1338 (9.28) Wj = Hhf (Dj )Dj − Dj Hhf (Dj ).
1339 We next solve the following minimization to obtain a coefficient c:
2
m−1
X
1340 min cj W j , s. t. c> em = 1.
c
j=0
F
Pm−1
1341 Then, a new Hamiltonian matrix is obtained H̃k = j=0 cj Hk−j . Since an explicit storage
1342 of the density matrix can be prohibitive, the projected C-DIIS in [60] uses projections of the
1343 density and commutator matrices so that the sizes are much smaller.
38

This manuscript is for review purposes only.


1344 9.3.3. Subspace Methods For HF using Nyström Approximation. Note that
1345 the most expensive part in HF is the evaluation of Ef (X) and the related derivatives. We
1346 apply the limited-memory Nyström technique to approximate V (Xk Xk∗ ) by V̂ (Xk Xk∗ ). Let
1347 Z = V (Xk Xk∗ ) Ω where Ω is an orthogonal basis of the subspace such as

1348 span{Xk }, span{Xk−1 , Xk } or span{Xk−1 , Xk , V (Xk Xk∗ )Xk }.

1349 Then the low rank approximation

1350 (9.29) V̂ (Xk Xk∗ ) := Z(Z ∗ Ω)† Z ∗

1351 is able to reduce the computational cost significantly. Note that the adaptive compression
1352 method in [73] compresses the operator V(Xk Xk∗ ) on the subspace span{Xk }. Conse-
1353 quently, we can keep the easier parts Eks but approximate Ef (X) by using (9.29). Hence, a
1354 new subproblem is formulated as

1D E
1355 (9.30) min Eks (X) + V̂ (Xk Xk∗ ) X, X s. t. X ∗ X = Ip .
X∈C n×p 4

1356 The subproblem (9.30) can be solved by the SCF iteration, the Riemannian gradient method
1357 or the modified CG method based on the following linear equation
 
2 1 ∗ ∗
1358 ProjXk ∇ Eks (Xk )[ξ] + V̂(Xk Xk )ξ − ξsym(Xk ∇f (Xk )) = −grad Ehf (Xk ).
2

1359 9.3.4. A Regularized Newton Type Method. Computing the p-smallest eigenpairs
1360 of Hks (ρ̃) is equivalent to a trace minimization problem

1
1361 (9.31) min q(X) := tr(X ∗ Hks (ρ̃)X) s. t. X ∗ X = Ip .
X∈Cn×p 2

1362 Note that q(X) is a second-order approximation to the total energy Eks (X) without consid-
1363 ering the second term in the Hessian (9.24). Hence, the SCF method may not converge if this
1364 second term dominates. The regularized Newton in (9.1) can be applied to solve both KSDFT
1365 and HF with convergence guarantees. We next explain a particular version in [138] whose
1366 subproblem is

1 τk
1367 (9.32) min tr(X ∗ Hks (ρ̃)X) + kXX > − Xk Xk> k2F s. t. X ∗ X = Ip .
X∈Cn×p 2 4

1368 Since Xk and X are orthonormal matrices, we have

kXX > − Xk Xk> k2F = tr((XX > − Xk Xk> )(XX > − Xk Xk> ))
1369
= 2p − 2tr(X > Xk Xk> X).

1370 Therefore, (9.32) is a linear eigenvalue problem:

(Hks (ρ̃) − τk Xk Xk> )X = XΛ,


1371
X > X = Ip .
39

This manuscript is for review purposes only.


1372 9.3.5. Subspace Refinement for KSDFT. The direct minimization method in [138]
1373 is a kind of subspace refinement procedure using

1374 Y = [Xk , Pk , Rk ],

1375 where Pk = Xk − Xk−1 and Rk = Hks (Xk )Xk − Xk Λk . Then the variable X can be
1376 expressed as X = Y G where G ∈ C3p×p . The total energy minimization problem (9.20)
1377 becomes:
1378 min Eks (Y G), s. t. G∗ Y ∗ Y G = Ip ,
G

1379 whose first-order optimality condition is a generalized linear eigenvalue problem:

1380 (Y ∗ Hks (Y G)Y )G = Y ∗ Y GΩ, G∗ Y ∗ Y G = Ip .

1381 The subspace refinement method may help when the regularized Newton method does
1382 not perform well. Note that the total energy minimization problem (9.20) is not necessary
1383 equivalent to a nonlinear eigenvalue problem (9.26) for finding the p smallest eigenvalues of
1384 H(X). Although an intermediate iterate X is orthogonal and contains eigenvectors of H(X),
1385 these eigenvectors are not necessary the eigenvectors corresponding to the p smallest eigen-
1386 values. Hence, we can form a subspace which contains these possible target eigenvectors. In
1387 particular, we first compute the first γp smallest eigenvalues for some small integer γ. Their
1388 corresponding eigenvectors of H(Xk ), denoted by Γk , are put in a subspace as

1389 (9.33) span{Xk−1 , Xk , grad E(Xk ), Γk }.

1390 Numerical experience shows that the refinement scheme in subsection 9.1 with this subspace
1391 is likely escape a stagnated point.
1392 10. Semidefinite Programming (SDP). In this section, we present two specialized
1393 subspace methods for solving the maxcut SDP and the maxcut SDP with nonnegative con-
1394 straints from community detection.
1395 10.1. The Maxcut SDP. The maxcut problem partition the vertices of a graph into
1396 two sets so that the sum of the weights of the edges connecting vertices in one set with these
1397 in the other set is maximized. The corresponding SDP relaxation [46, 16, 56, 8] is

min hC, Xi
X∈S n
1398 (10.1) s. t. X ii = 1, i = 1, · · · , n,
X  0.

1399 We first describe a second-order cone program (SOCP) restriction for the SDP prob-
1400 lem (10.1) by fixing all except one row and column of the matrix X. For any integer
c c
1401 i ∈ {1, . . . , n}, the complement of the set {i} is ic = {1, . . . , n}\{i}. Let B = X i ,i
c
1402 be the submatrix of X after deleting its i-th row and column, and y = X i ,i be the ith col-
1403 umn of the matrix X without the element X i,i . Since Xii = 1, the variable X of (10.1) can
1404 be written as
1 y> y>
   
1
1405 X := := c c
y B y X i ,i
1406 without loss of generality. Suppose that the submatrix B is fixed. It then follows from the
1407 Schur complement theorem that X  0 is equivalent to

1408 ξ − y > B −1 y ≥ 0.
40

This manuscript is for review purposes only.


1409 In order to maintain the strict positive definiteness of X, we require 1 − y > B −1 y ≥ ν for a
1410 small constant ν > 0. Therefore, the SDP problem (10.1) is reduced to a SOCP:

min c> y
b
y∈Rn−1
1411 (10.2)
s. t. 1 − y > B † y ≥ ν, y ∈ Range(B),
c
1412 where b c> Bb
c := 2C i ,i . If γ := b c > 0, an explicit solution of (10.2) is given by
r
1−ν
1413 (10.3) y=− Bbc.
γ

1414 Otherwise, the solution is y = 0.


1415 We next describe the RBR method [130]. Starting from a positive definite feasible so-
1416 lution X1 , it updates one row/column of X at each of the inner steps. The operations from
1417 the first row to the last row is called a cycle. At the first step of the k-th cycle, the matrix B
c c
1418 is set to Xk1 ,1 and y is computed by (10.3). Then the first row/column of Xk is substituted
c
1419 by Xk1 ,1 := y. Other rows/columns are updated in a similar fashion until all row/column
1420 are updated. Then we set Xk+1 := Xk and this procedure is repeated until certain stopping
1421 criteria are met.
1422 The RBR method can also be derived from the logarithmic barrier problem

min φσ (X) := hC, Xi − σ log det X


X∈S n
1423 (10.4)
s. t. X ii = 1, ∀i = 1, · · · , n, X  0.
c
,ic
1424 Fixing the block B = X i gives
c c
1425 det(X) = det(B)(1 − (X i ,i )> B −1 X i ,i ).

1426 Therefore, the RBR subproblem for (10.4) is

1427 (10.5) min c> y − σ log(1 − y > B −1 y).


b
y∈Rn−1

1428 c> Bb
If γ := b c > 0, the solution of problem (10.5) is
p
σ2 + γ − σ
1429 (10.6) y=− Bb
c.
γ

σ 2 +γ−σ
1430 Consequently, the subproblem (10.2) has the same solution as (10.5) if ν = 2σ γ .

1431 10.1.1. Examples: Phase Retrieval. Given a matrix A ∈ Cm×n and a vector b ∈
m
1432 R , the phase retrieval problem can be formulated as a feasibility problem:

1433 find x, s. t. |Ax| = b.

1434 An equivalent model in [122] is

1
min kAx − yk22
1435 x∈Cn ,y∈Rm 2
s. t. |y| = b,
41

This manuscript is for review purposes only.


1436 which can be further reformulated as
1
min kAx − diag(b)uk22
1437 (10.7) x∈Cn ,u∈Cm 2
s. t. |ui | = 1, , i = 1, . . . , m.

1438 By fixing the variable u, it becomes a least squares problem with respect to x, whose explicit
1439 solution is x = A† diag(b)u. Substituting x back to (10.7) yields a general maxcut problem:

min u∗ M u
u∈Cm
1440
s. t. |ui | = 1, i = 1, . . . , m,

1441 where M = diag(b)(I − AA† )diag(b) is positive semidefinite. Hence, the corresponding
1442 SDP relaxation is
minm tr(U M )
U ∈S
1443
s. t. U ii = 1, i = 1, · · · , m, U  0.
1444 The above problem can be further solved by the RBR method.
1445 10.2. Community Detection. Suppose that the nodes [n] = {1, . . . , n} of a network
1446 can be partitioned into r ≥ 2 disjoint sets {K1 , . . . , Kr }. A binary matrix X is called a
1447 partition matrix if X ij = 1 for i, j ∈ Kt , t ∈ {1, ..., r} and
Potherwise X ij = 0. Let A be the
ij
1448 adjacency matrix and d be the degree vector, where di = j A , i ∈ [n]. Define the matrix

1449 (10.8) C = −(A − λdd> ),

1450 where λ = 1/kdk1 . A popular method for the community detection problem is to maximize
1451 the modularity [86] as:

1452 (10.9) min hC, Xi s. t. X ∈ Pnr ,


X

1453 where Pnr


is the set of all partition matrices of n nodes with no more than r subsets. Since
1454 the modularity optimization (10.9) is NP-hard, a SDP relaxation proposed in [25] is:

min hC, Xi
X∈Rn×n

s. t. X ii = 1, i = 1, . . . , n,
1455 (10.10)
0 ≤ X ij ≤ 1, ∀i, j,
X  0.

1456 The RBR method in subsection 10.1 can not be applied to (10.10) directly due to the compo-
1457 nentwise constraints 0 ≤ X ij ≤ 1.
1458 Note that the true partition matrix X ∗ can be decomposed as X ∗ = Φ∗ (Φ∗ )> , where

1459 Φ ∈ {0, 1}n×r is the true assignment matrix. This decomposition is unique up to a permu-
1460 tation of the columns of Φ∗ . The structures of Φ∗ leads to a new relaxation of the original
1461 partition matrix X [146]. Define a matrix
>
U = u1 , ..., un ∈ Rn×r .

1462

1463 We can consider a decomposition X = U U > . The constraints X ii = 1 and Φ∗ ∈ {0, 1}n×r
1464 imply that
1465 kui k2 = 1, U ≥ 0, kui k0 ≤ p,
42

This manuscript is for review purposes only.


1466 where the cardinality constraints are added to impose sparsity of the solution. Therefore, an
1467 alternative relaxation to (10.9) is

min C, U U >
U ∈Rn×r

1468 (10.11) s. t. kui k2 = 1, i = 1, . . . , n,


kui k0 ≤ p, i = 1, . . . , n,
U ≥ 0.

1469 Although (10.11) is still NP-hard, it enables us to develop a computationally efficient


1470 RBR method. The feasible set for each block ui is

1471 U = {u ∈ Rr | kuk2 = 1, u ≥ 0, kuk0 ≤ p}.

1472 Then, problem (10.11) can be rewritten as

1473 (10.12) min f (U ) ≡ C, U U > , s. t. ui ∈ U.


U ∈Rn×r

For the i-th subproblem, we fix all except the i-th row of U and formulate the subproblem as
σ
ui = arg min f (u1 , ..., ui−1 , x, ui+1 , ..., un ) + kx − ūi k2 ,
x∈U 2
1474 where the last part in the objective function is the proximal term and σ > 0 is a parameter.
1475 Note that the quadratic term kxk2 is eliminated due to the constraint kuk2 = 1. Therefore,
1476 the subproblem becomes

1477 (10.13) ui = arg min b> x,


x∈U

c c
1478 where b = 2C i,i U −i − σūi , and C i,i is the i-th row of C without the i-th component, U −i
1479 is the matrix U without the i-th row. Define b+ = max{b, 0}, b− = max{−b, 0}, where the
1480 max is taken component-wisely. Then the closed-form solution of (10.13) is given by
( bp

p , if b− 6= 0,
1481 (10.14) u = kb− k
ej0 , with j0 = arg minj bj , otherwise,

1482 where bp− is obtained by keeping the largest p components in b− and letting the others be
1483 zero, and when kb− k0 ≤ p, bp− = b− . Then the RBR method goes over all rows of U by
1484 using (10.14).
1485 We next briefly describe the parallelization of the RBR method on a shared memory
1486 computer with many threads. The variable U is stored in the shared memory so that it can be
1487 accessed by all threads. Even when some row ui is updating in a thread, the other threads can
1488 still access U whenever necessary. In the sequential RBR method, the main cost of updating
c
1489 one row ui is the computation of b = 2C i,i U −i − σūi , where ūi and U are the current
1490 iterates. The definition of C in (10.8) gives
c c
1491 (10.15) b> = −2Ai,i U −i + 2λdi (di )> U −i − σūi ,
c
1492 where Ai,i is the i-th row of A without the i-th component. The parallel RBR method is
1493 outlined in Figure 10.1 where many threads are working simultaneously. The vector d> U
1494 and matrix U are stored in the shared memory and all threads can access and update them.
43

This manuscript is for review purposes only.


Shared
Memory

Core 1 Core 2 Core 3


b>i b>i b>i
U ū ū ū

Core 1 Core 2 Core 3


A
Update Update Update
ui , d> U ui , d> U ui , d> U
d> U
Sync. when all rows processed

Fig. 10.1 An illustration of the asynchronous parallel proximal RBR method

1495 Every thread picks up their own row ui at a time and then reads U and the vector d> U . Then,
1496 a private copy of b> is computed. Thereafter, the variable ui is updated and d> U is set to
1497 d> U ← d> U + di (ui − ūi ) in the shared memory. It immediately proceeds to another row
1498 without waiting for other threads to finish their tasks. Therefore, when a thread is updating
1499 its variables, other blocks of variables uj , j 6= i are not necessarily the most new version.
1500 Moreover, if this thread is reading some row uj or the vector d> U from memory and another
1501 thread is just modifying them, the data of ui will be partially updated. Since the memory
1502 locking is removed, the parallel RBR method may be able to provide near-linear speedups.
1503 See also the HOGWILD! [94] and CYCLADES [89] for the asynchronous methods.
1504 11. Low Rank Matrix Optimization. Optimization problems whose variable is re-
1505 lated to low-rank matrices arise in many applications, for example, semidefinite programming
1506 (SDP), matrix completion, robust principle component analysis, control and systems theory,
1507 model reduction [76], phase retrieval, blind deconvolution, data mining, pattern recognitions
1508 [33], latent semantic indexing, collaborative prediction and low-dimensional embedding.
1509 11.1. Low Rank Structure of First-order Methods. A common feature of many
1510 first-order methods for the low rank matrix optimization problems is that the next iterate xk+1
1511 is defined by the current iterate xk and a partial eigenvalue decomposition of certain matrix.
1512 They can be unified as the following fixed-point iteration scheme [71]:

1513 (11.1) xk+1 = T (xk , Ψ(B(xk ))), xk ∈ D,

1514 where B : D → S n is a bounded mapping from a given Euclidean space D to the n-


1515 dimensional symmetric matrix space S n , and T is a general mapping from D × S n to D.
1516 The spectral operator Ψ : S n → S n is given by

1517 (11.2) Ψ(X) = V Diag(ψ(λ(X)))V > ,


44

This manuscript is for review purposes only.


1518 where X = V Diag(λ1 , . . . , λn )V > is the eigenvalue decomposition of X with eigenvalues
1519 in descending order λ1 ≥ λ2 · · · ≥ λn , λ(X) = (λ1 , . . . , λn )T , the operator ψ : Rn → Rn
1520 is a vector-valued symmetric mapping, i.e., ψ(P λ) = P ψ(λ) for any permutation matrix P .
1521 The orthogonal projection of a symmetric matrix X on to a given Range(Q) with Q> Q = I
1522 is defined as:

1523 (11.3) PQ (X) := arg min kY − Xk2F = QQ> XQQ> .


Y ∈S n , Range(Y )=Range(Q)

1524 The operator Ψ has the low-rank property at X if there exists an orthogonal matrix VI ∈
1525 Rn×p (p  n) that span a p-dimensional eigen-space corresponding to λi (X), i ∈ I, such
1526 that Ψ(X) = Φ(PVI (X)), where Φ is either the same as Ψ or a different spectral operator
1527 induced by φ, and I is an index set depending on X. The low-rank property ensures that the
1528 full eigenvalue decomposition is not needed.
1529 The scheme (11.1) is time-consuming for large scale problems since first-order methods
1530 often take thousands of iterations to converge and each iteration requires at least one full
1531 or partial eigenvalue decomposition for evaluating Ψ. However, Ψ(B(xk )) often lives in a
1532 low-dimensional eigenspace in practice. A common practice is to use inexact method such as
1533 the Lanczos method, LOBPCG, and randomized methods with early stopping rules [149, 6,
1534 106]. The so-called subspace method performs refinement on a low-dimensional subspace for
1535 univariate maximal eigenvalue optimization problem [66, 102, 63] and in the SCF iteration
1536 for KSDFT [151]. In the rest of this section, we present approaches [71] which integrate
1537 eigenvalue computation coherently with the underlying optimization methods.
1538 11.2. A Polynomial-filtered Subspace Method. We now describe a general sub-
1539 space framework for the scheme (11.1) using Chebyshev polynomials ρk (·) defined in (8.5).
1540 Assume that x∗ is a limit point of the fixed-point iteration (11.1) and the low-rank property
1541 holds for every B(xk ) in (11.1). Consequently, the scheme (11.1) is equivalent to

1542 (11.4) xk+1 = T (xk , Φ(PVIk (B(xk )))),

1543 where VIk is determined by B(xk ). Although the exact subspace VIk usually is unknown,
1544 it can be approximated by an estimated subspace Uk so that the computational cost of Ψ
1545 is significantly reduced. After the next point xk+1 is formed, a polynomial filter step is
1546 performed in order to extract a new subspace Uk+1 based on Uk . Therefore, combining the
1547 two steps (8.6) and (11.4) together gives

1548 (11.5) xk+1 = T (xk , Φ(PUk (B(xk )))),


q
k+1
1549
1550 (11.6) Uk+1 = orth(ρk+1 (B(xk+1 ))Uk ),

1551 where qk is a small number (e.g. 1 to 3) of the polynomial filter ρk (·) applied to Uk . The
1552 Chebyshev polynomials are suitable when the targeted eigenvalues are located within an inter-
1553 val, for example, finding a few largest/smallest eigenvalues in magnitude or all positive/neg-
1554 ative eigenvalues.
1555 The main feature is that the exact subspace VIk is substituted by its approximation Uk
1556 in (11.5). The principle angle between the true and extracted subspace is controlled by the
1557 polynomial degree. Then the error between one exact and inexact iteration is bounded. When
1558 the initial space is not orthogonal to the target space, the convergence of (11.5)-(11.6) is
1559 established under mild assumptions. In fact, the subspace often becomes more and more
1560 accurate so that the warm start property is helpful, i.e., the subspace of the current iteration
1561 can be refined from the previous one.
45

This manuscript is for review purposes only.


1562 11.3. The Polynomial-filtered Proximal Gradient Method. We next show how
1563 to apply the subspace update (11.5) and (11.6) to the proximal gradient method on a set of
1564 composite optimization problems

1565 (11.7) min h(x) := F (x) + R(x),

1566 where F (x) = f ◦ λ(B(x)) with B(x) = G + A∗ (x) and R(x) is a regularization term with
1567 simple structures but need not be smooth. Here G is a known matrix in S n , the linear operator
1568 A and its adjoint operator A∗ are defined as
m
X
1569 (11.8) A(X) = [hA1 , Xi , . . . , hAm , Xi]T , A∗ (x) = x i Ai ,
i=1

1570 for given symmetric matrices Ai ∈ S n . The function f : Rn → R is smooth and absolutely
1571 symmetric, i.e., f (x) = f (P x) for all x ∈ Rn and any permutation matrix P ∈ Rn×n .
1572 Let Ψ be a spectral operator induced by ψ = ∇f . It can be verified that the gradient of
1573 F in (11.7) is

1574 (11.9) ∇F (x) = A(Ψ(B(x))).

1575 The proximal operator is defined by


1
1576 (11.10) proxtR (x) = arg min R(u) + ku − xk22 .
u 2t
1577 Consequently, the proximal gradient method is

1578 (11.11) xk+1 = proxτk R (xk − τk A(Ψ(B(xk )))),

1579 where τk is the step size. Therefore, the iteration (11.11) is a special case of (11.1) with
T (x, X) = proxτk R (x − A(X)),
1580
Ψ(X) = V Diag(∇f (λ(X)))V > .
1581 Assume that the low-rank property holds at every iteration. The corresponding polynomial-
1582 filtered method can be written as

1583 (11.12) xk+1 = proxτk R (xk − τk A(Φ(PUk (B(xk )))),


k+1q
1584
1585 (11.13) Uk+1 = orth(ρk+1 (B(xk+1 ))Uk ).

1586 11.3.1. Examples: Maximal Eigenvalue and Matrix Completion. Consider the
1587 maximal eigenvalue optimization problem:

1588 (11.14) min F (x) + R(x) := λ1 (B(x)) + R(x),


x


1589 where B(x) = G + A (x). Certain specific formulations of phase recovery and blind decon-
1590 volution are special case of (11.14). The subgradient of F (x) is

1591 ∂F (x) = {A(U1 SU1T ) | S  0, tr(S) = 1},

1592 where U1 ∈ Rn×r1 is the subspace spanned by eigenvectors of λ1 (B(x)) with multiplicity
1593 r1 . For simplicity, we assume r1 = 1 and λ1 (B(x)) > 0, which means that ∂F (x) has only
1594 one element and the function F (x) is differentiable. Then the polynomial-filtered method is

1595 xk+1 = proxτ R (xk − τ A(u1 uT1 )),


46

This manuscript is for review purposes only.


1596 where u1 is the eigenvector of λ1 (B(xk )). Hence, we have
1597 T (x, W ) = proxτ R (x − τ A(W )), Ψ(X) = u1 uT1 .
1598 In addition, Ψ(·) satisfies the low-rank property around x∗ with I = {1} and
(
1, i = 1,
1599 (ψ(λ))i = (φ(λ))i =
0, otherwise.
1600 Another example is the penalized formulation of the matrix completion problem:
1
1601 (11.15) min kXk∗ + kPΩ (X − M )k2F ,

1602 where Ω is a given index set of the true matrix M , and PΩ : Rm×n → Rm×n denotes
1603 the projection operator onto the sparse matrix space with non-zero entries on Ω. Problem
1604 (11.15) can be solved by the proximal gradient method. At the k-th iteration, the main cost
1605 is to compute the truncated SVD of a matrix. Although (11.15) is not a direct special case of
1606 (11.7), we can still insert the polynomial filter into the proximal gradient method to reduce
1607 the cost of SVD.
1608 11.4. The Polynomial-filtered ADMM Method. Consider the standard SDP:
min hC, Xi ,
1609 (11.16) s. t. AX = b,
X  0,
1610 where C, A and b are given, the linear operator A and its adjoint are defined in (11.8).
1611 Note that the ADMM on the dual problem of (11.16) is equivalent to the Douglas-Rachford
1612 Splitting (DRS) method [30] on the primal SDP (11.16). Define F (X) = 1{X0} (X) and
1613 G(X) = 1{AX=b} (X) + hC, Xi, where 1Ω (X) is the indicator function on a set Ω. The
1614 proximal operators proxtF (Z) and proxtG (Y ) can be computed explicitly as
1615 proxtF (Z) = P+ (Z),
1616 proxtG (Y ) = (Y + tC) − A∗ (AA∗ )−1 (AY + tAC − b),
1617 where P+ (Z) is the projection operator onto the positive semi-definite cone. Hence, DRS
1618 can be formulated as

1619 Zk+1 = TDRS (Zk ) = proxtG (2proxtF (Zk ) − Zk ) − proxtF (Zk ) + Zk ,
1620 which is also a special case of (11.1) with
T (x, X) = proxtG (2X − x) − X + x,
1621
Ψ(X) = P+ (X).
1622 Note that P+ (X) is a spectral operator induced by ψ with the form
1623 (ψ(λ))i = max{λi , 0}, ∀i.
1624 It can be verified that Ψ(X) = Ψ(PVI (X)), where I contains all indices of the positive
1625 eigenvalues λi (X). The operator Ψ(X) satisfies the low-rank property if X only has a few
1626 positive eigenvalues. Hence, the polynomial-filtered method method can be written as
1627 (11.17) Zk+1 = proxtG (2P+ (PUk (Zk )) − Zk ) − P+ (PUk (Zk )) + Zk ,
q
k+1
1628 (11.18) Uk+1 = orth(ρk+1 (Zk+1 )Uk ).
47

This manuscript is for review purposes only.


1629 11.4.1. Examples: 2-RDM and Cryo-EM. The two-body reduced density matrix
1630 (2-RDM) problem can be formulated as a standard SDP. It has a block diagonal structure
1631 with respect to the variable X, where each block is a low rank matrix. Hence, the polynomial
1632 filters can be applied to each block to reduce the cost. As an extension, we can plug poly-
1633 nomial filters into multi-block ADMM for the nonlinear SDPs from the weighted LS model
1634 with spectral norm constraints and least unsquared deviations (LUD) model in orientation de-
1635 termination of cryo-EM images [124]. For these examples we only introduce the formulation
1636 of the corresponding model. The details of the multi-block ADMM can be found in [124].
1637 Suppose K is a given integer and S and W are two known matrices, the weighted LS
1638 model with spectral norm constraints is

max hW S, Gi ,
s.t. Gii = I2 ,
1639 (11.19)
G  0,
kGk2 ≤ αK,

1640 where G = (Gij )i,j=1,...,K ∈ S 2K is the variable, with each block Gij being a 2-by-2 small
1641 matrix, and k · k2 is the spectral norm. A three-block ADMM is introduced to solve (11.19).
1642 The cost of the projection onto the semidefinite cone can be reduced by the polynomial filters.
1643 The semidefinite relaxation of the LUD problem is
P
min 1≤i<j≤K kcij − Gij cji k2 ,
s.t. Gii = I2 ,
1644 (11.20)
G  0,
kGk2 ≤ αK,

1645 where G, Gij , K are defined the same in (11.19), and cij ∈ R2 are known vectors. The
1646 spectral norm constraint in (11.20) is optional. A four-block ADMM is proposed to solve
1647 (11.20). Similarly, the polynomial filters can be inserted into the ADMM update to reduce
1648 the computational cost.
1649 12. Conclusion. In this paper, we provide a comprehensive survey on various sub-
1650 space techniques for nonlinear optimization. The main idea of subspace algorithms aims
1651 to conquer large scale nonlinear problems by performing iterations in a lower dimensional
1652 subspace. We next summarize a few typical scenarios as follows.
1653 • Find a linear combination of several known directions. Examples are the linear and
1654 nonlinear conjugate gradient methods, the Nesterov’s accelerated gradient method,
1655 the Heavy-ball method and the momentum method.
1656 • Keep the objective function and constraints, but add an extra restriction in a cer-
1657 tain subspace. Examples are OMP, CoSaMP, LOBPCG, LMSVD, Arrabit, subspace
1658 refinement and multilevel methods.
1659 • Approximate the objective objective function but keep the constraints. Examples are
1660 BCD, RBR, trust region with subspaces and parallel subspace correction.
1661 • Approximate the objective objective function and design new constraints. Examples
1662 are trust region with subspaces and FPC AS.
1663 • Add a postprocess procedure after the subspace problem is solved. An example is
1664 the truncated subspace method for tensor train.
1665 • Use subspace techniques to approximate the objective functions. Examples are sam-
1666 pling, sketching and Nyström approximation.
1667 • Integrate the optimization method and subspace update in one framework. An ex-
1668 ample is the polynomial-filtered subspace method for low-rank matrix optimization.
48

This manuscript is for review purposes only.


1669 The competitive performance of the methods adopting the above mentioned subspace
1670 techniques in the related examples implies that the subspace methods are very promising
1671 tools for large scale optimization problems. In fact, how to choose subspaces, how to con-
1672 struct subproblems, and how to solve them efficiently are the key questions of designing a
1673 successful subspace method. A good tradeoff between the simplicity of subproblems and the
1674 computational cost has to be made carefully. We are confident that many future directions are
1675 worth to be pursued from the point view of subspaces.

1676 REFERENCES

1677 [1] P.-A. A BSIL , C. G. BAKER , AND K. A. G ALLIVAN, Trust-region methods on Riemannian manifolds,
1678 Found. Comput. Math., 7 (2007), pp. 303–330.
1679 [2] P.-A. A BSIL , R. M AHONY, AND R. S EPULCHRE, Optimization algorithms on matrix manifolds, Princeton
1680 University Press, Princeton, NJ, 2008.
1681 [3] P.-A. A BSIL , R. M AHONY, AND J. T RUMPF, An extrinsic look at the Riemannian Hessian, in Geometric
1682 science of information, Springer, 2013, pp. 361–368.
1683 [4] D. G. A NDERSON, Iterative procedures for nonlinear integral equations, Journal of the ACM (JACM), 12
1684 (1965), pp. 547–560.
1685 [5] A. B ECK AND M. T EBOULLE, A fast iterative shrinkage-thresholding algorithm for linear inverse problems,
1686 SIAM Journal on Imaging Sciences, 2 (2009), pp. 183–202.
1687 [6] S. B ECKER , V. C EVHER , AND A. K YRILLIDIS, Randomized low-memory singular value projection, arXiv
1688 preprint arXiv:1303.0167, (2013).
1689 [7] S. B ELLAVIA AND B. M ORINI, A globally convergent Newton-GMRES subspace method for systems of
1690 nonlinear equations, SIAM J. Sci. Comput., 23 (2001), pp. 940–960.
1691 [8] S. J. B ENSON , Y. Y E , AND X. Z HANG, Solving large-scale sparse semidefinite programs for combinatorial
1692 optimization, SIAM J. Optim., 10 (2000), pp. 443–461.
1693 [9] D. P. B ERTSEKAS, Nonlinear Programming, Athena Scientific, September 1999.
1694 [10] J. B OLTE , S. S ABACH , AND M. T EBOULLE, Proximal alternating linearized minimization for nonconvex
1695 and nonsmooth problems, Math. Program., 146 (2014), pp. 459–494.
1696 [11] S. B OYD , N. PARIKH , E. C HU , B. P ELEATO , AND J. E CKSTEIN, Distributed optimization and statistical
1697 learning via the alternating direction method of multipliers, Foundations and Trends R in Machine
1698 Learning, 3 (2011), pp. 1–122.
1699 [12] M. A. B RANCH , T. F. C OLEMAN , AND Y. L I, A subspace, interior, and conjugate gradient method for
1700 large-scale bound-constrained minimization problems, SIAM J. Sci. Comput., 21 (1999), pp. 1–23.
1701 [13] C. B REZINSKI , M. R EDIVO -Z AGLIA , AND Y. S AAD, Shanks sequence transformations and Anderson
1702 acceleration, SIAM Rev., 60 (2018), pp. 646–669.
1703 [14] P. N. B ROWN AND Y. S AAD, Hybrid Krylov methods for nonlinear systems of equations, SIAM J. Sci.
1704 Statist. Comput., 11 (1990), pp. 450–481.
1705 [15] O. B URDAKOV, L. G ONG , S. Z IKRIN , AND Y.- X . Y UAN, On efficiently combining limited-memory and
1706 trust-region techniques, Math. Program. Comput., 9 (2017), pp. 101–134.
1707 [16] S. B URER AND R. D. C. M ONTEIRO, A projected gradient algorithm for solving the maxcut SDP relax-
1708 ation, Optim. Methods Softw., 15 (2001), pp. 175–200.
1709 [17] J. V. B URKE AND J. J. M OR É, On the identification of active constraints, SIAM J. Numer. Anal., 25 (1988),
1710 pp. 1197–1211.
1711 [18] , Exposing constraints, SIAM J. Optim., 4 (1994), pp. 573–595.
1712 [19] R. H. B YRD , N. I. M. G OULD , J. N OCEDAL , AND R. A. WALTZ, An algorithm for nonlinear optimization
1713 using linear programming and equality constrained subproblems, Math. Program., 100 (2004), pp. 27–
1714 48.
1715 [20] , On the convergence of successive linear-quadratic programming algorithms, SIAM J. Optim., 16
1716 (2005), pp. 471–489.
1717 [21] R. H. B YRD , J. N OCEDAL , AND R. B. S CHNABEL, Representations of quasi-Newton matrices and their
1718 use in limited memory methods, Math. Programming, 63 (1994), pp. 129–156.
1719 [22] C. C ARSTENSEN, Domain decomposition for a non-smooth convex minimization problem and its applica-
1720 tion to plasticity, Numerical linear algebra with applications, 4 (1997), pp. 177–190.
1721 [23] M. R. C ELIS , J. E. D ENNIS , AND R. A. TAPIA, A trust region strategy for nonlinear equality constrained
1722 optimization, in Numerical optimization, 1984 (Boulder, Colo., 1984), SIAM, Philadelphia, PA, 1985,
1723 pp. 71–82.
1724 [24] C. C HEN , Z. W EN , AND Y.- X . Y UAN, A general two-level subspace method for nonlinear optimization, J.
1725 Comput. Math., 36 (2018), pp. 881–902.
1726 [25] Y. C HEN , X. L I , AND J. X U, Convexified modularity maximization for degree-corrected stochastic block
49

This manuscript is for review purposes only.


1727 models, Ann. Statist., 46 (2018), pp. 1573–1602.
1728 [26] A. C ONN , N. G OULD , A. S ARTENAER , AND P. T OINT, On iterated-subspace methods for nonlinear op-
1729 timization, in Linear and Nonlinear Conjugate Gradient-Related Methods, J. Adams and J. Nazareth,
1730 eds., 1996, pp. 50–79.
1731 [27] W. D ENG , M.-J. L AI , Z. P ENG , AND W. Y IN, Parallel multi-block ADMM with o(1/k) convergence, J.
1732 Sci. Comput., 71 (2017), pp. 712–736.
1733 [28] E. D. D OLAN , J. J. M OR É , AND T. S. M UNSON, Benchmarking optimization software with cops 3.0, tech.
1734 rep., Mathematics and Computer Science Division, Argonne National Laboratory, February 2004.
1735 [29] Q. D ONG , X. L IU , Z.-W. W EN , AND Y.-X. Y UAN, A parallel line search subspace correction method for
1736 composite convex optimization, J. Oper. Res. Soc. China, 3 (2015), pp. 163–187.
1737 [30] J. D OUGLAS AND H. H. R ACHFORD, On the numerical solution of heat conduction problems in two and
1738 three space variables, Transactions of the American mathematical Society, 82 (1956), pp. 421–439.
1739 [31] A. E DELMAN , T. A. A RIAS , AND S. T. S MITH, The geometry of algorithms with orthogonality constraints,
1740 SIAM J. Matrix Anal. Appl., 20 (1999), pp. 303–353.
1741 [32] M. E LAD , B. M ATALON , AND M. Z IBULEVSKY, Coordinate and subspace optimization methods for linear
1742 least squares with non-quadratic regularization, Appl. Comput. Harmon. Anal., 23 (2007), pp. 346–
1743 367.
1744 [33] L. E LD ÉN, Matrix Methods in Data Mining and Pattern Recognition (Fundamentals of Algorithms), Society
1745 for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2007.
1746 [34] H.-R. FANG AND Y. S AAD, A filtered Lanczos procedure for extreme and interior eigenvalue problems,
1747 SIAM J. Sci. Comput., 34 (2012), pp. A2220–A2246.
1748 [35] R. F LETCHER, Second order corrections for nondifferentiable optimization, in Numerical analysis (Dundee,
1749 1981), vol. 912 of Lecture Notes in Math., Springer, Berlin-New York, 1982, pp. 85–114.
1750 [36] M. F ORNASIER, Domain decomposition methods for linear inverse problems with sparsity constraints, In-
1751 verse Problems, 23 (2007), p. 2505.
1752 [37] M. F ORNASIER , Y. K IM , A. L ANGER , AND C.-B. S CH ÖNLIEB, Wavelet decomposition method for l2 /tv-
1753 image deblurring, SIAM Journal on Imaging Sciences, 5 (2012), pp. 857–885.
1754 [38] M. F ORNASIER , A. L ANGER , AND C.-B. S CH ÖNLIEB, A convergent overlapping domain decomposition
1755 method for total variation minimization, Numerische Mathematik, 116 (2010), pp. 645–685.
1756 [39] M. F ORNASIER AND C.-B. S CH ÖNLIEB, Subspace correction methods for total variation and l1 -
1757 minimization, SIAM Journal on Numerical Analysis, 47 (2009), pp. 3397–3428.
1758 [40] D. G ABAY, Minimizing a differentiable function over a differential manifold, J. Optim. Theory Appl., 37
1759 (1982), pp. 177–219.
1760 [41] D. G ABAY AND B. M ERCIER, A dual algorithm for the solution of nonlinear variational problems via finite
1761 element approximation, Computers & Mathematics with Applications, 2 (1976), pp. 17–40.
1762 [42] E. G HADIMI , H. R. F EYZMAHDAVIAN , AND M. J OHANSSON, Global convergence of the heavy-ball
1763 method for convex optimization, in 2015 European Control Conference (ECC), 2015, pp. 310–315.
1764 [43] P. E. G ILL AND M. W. L EONARD, Reduced-Hessian quasi-Newton methods for unconstrained optimiza-
1765 tion, SIAM J. Optim., 12 (2001), pp. 209–237.
1766 [44] , Limited-memory reduced-Hessian methods for large-scale unconstrained optimization, SIAM J.
1767 Optim., 14 (2003), pp. 380–401.
1768 [45] R. G LOWINSKI AND A. M ARROCCO, Sur l’approximation, par éléments finis d’ordre un, et la résolution,
1769 par pénalisation-dualité, d’une classe de problèmes de Dirichlet non linéaires, Rev. Française Automat.
1770 Informat. Recherche Opérationnelle Sér. Rouge Anal. Numér., 9 (1975), pp. 41–76.
1771 [46] M. X. G OEMANS AND D. P. W ILLIAMSON, Improved approximation algorithms for maximum cut and
1772 satisfiability problems using semidefinite programming, J. Assoc. Comput. Mach., 42 (1995), pp. 1115–
1773 1145.
1774 [47] I. G OODFELLOW, Y. B ENGIO , AND A. C OURVILLE, Deep Learning, MIT Press, 2016. http://www.
1775 deeplearningbook.org.
1776 [48] N. G OULD , D. O RBAN , AND P. T OINT, Numerical methods for large-scale nonlinear optimization, Acta
1777 Numer., 14 (2005), pp. 299–361.
1778 [49] G. N. G RAPIGLIA , J. Y UAN , AND Y.- X . Y UAN, A subspace version of the powell–yuan trust-region al-
1779 gorithm for equality constrained optimization, Journal of the Operations Research Society of China, 1
1780 (2013), pp. 425–451.
1781 [50] G. N. G RAPIGLIA , Y. Y UAN , AND Y.- X . Y UAN, A subspace version of the powell–yuan trust-region al-
1782 gorithm for equality constrained optimization, J. Operations Research Society of China, 1 (2013),
1783 pp. 425–451.
1784 [51] W. W. H AGER AND H. Z HANG, A new active set algorithm for box constrained optimization, SIAM J.
1785 Optim., 17 (2006), pp. 526–557.
1786 [52] , A new active set algorithm for box constrained optimization, SIAM J. Optim., 17 (2006), pp. 526–
1787 557.
1788 [53] E. T. H ALE , W. Y IN , AND Y. Z HANG, Fixed-point continuation for l1 -minimization: methodology and

50

This manuscript is for review purposes only.


1789 convergence, SIAM J. Optim., 19 (2008), pp. 1107–1130.
1790 [54] N. H ALKO , P. G. M ARTINSSON , AND J. A. T ROPP, Finding structure with randomness: probabilistic
1791 algorithms for constructing approximate matrix decompositions, SIAM Rev., 53 (2011), pp. 217–288.
1792 [55] B. H E , H.-K. X U , AND X. Y UAN, On the proximal Jacobian decomposition of ALM for multiple-block
1793 separable convex minimization problems and its relationship to ADMM, J. Sci. Comput., 66 (2016),
1794 pp. 1204–1217.
1795 [56] C. H ELMBERG AND F. R ENDL, A spectral bundle method for semidefinite programming, SIAM J. Optim.,
1796 10 (2000), pp. 673–696.
1797 [57] J. M. H OKANSON, Projected nonlinear least squares for exponential fitting, SIAM J. Sci. Comput., 39
1798 (2017), pp. A3107–A3128.
1799 [58] J. H U , B. J IANG , L. L IN , Z. W EN , AND Y.- X . Y UAN, Structured quasi-Newton methods for optimization
1800 with orthogonality constraints, SIAM J. Sci. Comput., 41 (2019), pp. A2239–A2269.
1801 [59] J. H U , A. M ILZAREK , Z. W EN , AND Y. Y UAN, Adaptive quadratically regularized Newton method for
1802 Riemannian optimization, SIAM J. Matrix Anal. Appl., 39 (2018), pp. 1181–1207.
1803 [60] W. H U , L. L IN , AND C. YANG, Projected commutator DIIS method for accelerating hybrid functional
1804 electronic structure calculations, J. Chem. Theory Comput., (2017).
1805 [61] W. H UANG , P.-A. A BSIL , AND K. G ALLIVAN, A Riemannian BFGS method without differentiated retrac-
1806 tion for nonconvex optimization problems, SIAM J. Optim., 28 (2018), pp. 470–495.
1807 [62] W. H UANG , K. A. G ALLIVAN , AND P.-A. A BSIL, A Broyden class of quasi-Newton methods for Rieman-
1808 nian optimization, SIAM J. Optim., 25 (2015), pp. 1660–1685.
1809 [63] F. K ANGAL , K. M EERBERGEN , E. M ENGI , AND W. M ICHIELS, A subspace method for large-scale eigen-
1810 value optimization, SIAM Journal on Matrix Analysis and Applications, 39 (2018), pp. 48–82.
1811 [64] N. K ESKAR , J. N OCEDAL , F. Ö ZTOPRAK , AND A. W ÄCHTER, A second-order method for convex `1 -
1812 regularized optimization with active-set prediction, Optim. Methods Softw., 31 (2016), pp. 605–621.
1813 [65] A. V. K NYAZEV, Toward the optimal preconditioned eigensolver: locally optimal block preconditioned con-
1814 jugate gradient method, SIAM J. Sci. Comput., 23 (2001), pp. 517–541. Copper Mountain Conference
1815 (2000).
1816 [66] D. K RESSNER , D. L U , AND B. VANDEREYCKEN, Subspace acceleration for the crawford number and re-
1817 lated eigenvalue optimization problems, SIAM Journal on Matrix Analysis and Applications, 39 (2018),
1818 pp. 961–982.
1819 [67] K. K REUTZ -D ELGADO, The complex gradient operator and the CR-calculus, 2009.
1820 http://arxiv.org/abs/0906.4835.
1821 [68] A. L ANGER , S. O SHER , AND C.-B. S CH ÖNLIEB, Bregmanized domain decomposition for image restora-
1822 tion, Journal of Scientific Computing, 54 (2013), pp. 549–576.
1823 [69] C. L E B RIS, Computational chemistry from the perspective of numerical analysis, Acta Numer., 14 (2005),
1824 pp. 363–444.
1825 [70] J. H. L EE , Y. M. J UNG , Y.- X . Y UAN , AND S. Y UN, A subspace SQP method for equality constrained
1826 optimization, Comput. Optim. Appl., 74 (2019), pp. 177–194.
1827 [71] Y. L I , H. L IU , Z. W EN , AND Y. Y UAN, Low-rank matrix optimization using polynomial-filtered subspace
1828 extraction.
1829 [72] Y. L I AND S. O SHER, Coordinate descent optimization for `1 minimization with application to compressed
1830 sensing; a greedy algorithm, Inverse Probl. Imaging, 3 (2009), pp. 487–503.
1831 [73] L. L IN, Adaptively compressed exchange operator, J. Chem. Theory Comput., 12 (2016), pp. 2242–2249.
1832 [74] X. L IU , Z. W EN , AND Y. Z HANG, Limited memory block Krylov subspace optimization for computing
1833 dominant singular value decompositions, SIAM J. Sci. Comput., 35 (2013), pp. A1641–A1668.
1834 [75] , An efficient Gauss-Newton algorithm for symmetric low-rank product matrix approximations, SIAM
1835 J. Optim., 25 (2015), pp. 1571–1608.
1836 [76] Z. L IU AND L. VANDENBERGHE, Interior-point method for nuclear norm approximation with application
1837 to system identification, SIAM Journal on Matrix Analysis and Applications, 31 (2009), pp. 1235–1256.
1838 [77] Z. Q. L UO AND P. T SENG, On the convergence of the coordinate descent method for convex differentiable
1839 minimization, J. Optim. Theory Appl., 72 (1992), pp. 7–35.
1840 [78] M. W. M AHONEY, Randomized algorithms for matrices and data, Foundations and Trendsr in Machine
1841 Learning, 3 (2011), pp. 123–224.
1842 [79] J. M ARTENS AND R. G ROSSE, Optimizing neural networks with kronecker-factored approximate curvature,
1843 in International Conference on Machine Learning, 2015, pp. 2408–2417.
1844 [80] D. Q. M AYNE AND E. P OLAK, A superlinearly convergent algorithm for constrained optimization prob-
1845 lems, Math. Programming Stud., (1982), pp. 45–61.
1846 [81] J. J. M OR É AND G. T ORALDO, Algorithms for bound constrained quadratic programming problems, Nu-
1847 mer. Math., 55 (1989), pp. 377–400.
1848 [82] , On the solution of large quadratic programming problems with bound constraints, SIAM J. Optim.,
1849 1 (1991), pp. 93–113.
1850 [83] D. N EEDELL AND J. A. T ROPP, CoSaMP: iterative signal recovery from incomplete and inaccurate sam-

51

This manuscript is for review purposes only.


1851 ples, Appl. Comput. Harmon. Anal., 26 (2009), pp. 301–321.
1852 [84] Y. N ESTEROV, Introductory lectures on convex optimization: A basic course, vol. 87, Springer Science &
1853 Business Media, 2013.
1854 [85] Y. E. N ESTEROV, A method for solving the convex programming problem with convergence rate o (1/k2̂), in
1855 Dokl. Akad. Nauk SSSR, vol. 269, 1983, pp. 543–547.
1856 [86] M. E. N EWMAN, Modularity and community structure in networks, Proceedings of the national academy of
1857 sciences, 103 (2006), pp. 8577–8582.
1858 [87] Q. N I AND Y. Y UAN, A subspace limited memory quasi-Newton algorithm for large-scale nonlinear bound
1859 constrained optimization, Math. Comp., 66 (1997), pp. 1509–1520.
1860 [88] J. N OCEDAL AND S. J. W RIGHT, Numerical Optimization, Springer Series in Operations Research and
1861 Financial Engineering, Springer, New York, second ed., 2006.
1862 [89] X. PAN , M. L AM , S. T U , D. PAPAILIOPOULOS , C. Z HANG , M. I. J ORDAN , K. R AMCHANDRAN , C. R E ,
1863 AND B. R ECHT , Cyclades: Conflict-free asynchronous machine learning, in Advances in Neural Infor-
1864 mation Processing Systems, 2016, p. 2576–2584.
1865 [90] B. P OLYAK, Some methods of speeding up the convergence of iteration methods, USSR Computational
1866 Mathematics and Mathematical Physics, 4 (1964), pp. 1 – 17.
1867 [91] M. J. D. P OWELL AND Y. Y UAN, A trust region algorithm for equality constrained optimization, Math.
1868 Programming, 49 (1990/91), pp. 189–211.
1869 [92] P. P ULAY, Convergence acceleration of iterative sequences. the case of SCF iteration, Chemical Physics
1870 Letters, 73 (1980), pp. 393–398.
1871 [93] P. P ULAY, Improved SCF convergence acceleration, Journal of Computational Chemistry, 3 (1982),
1872 pp. 556–560.
1873 [94] B. R ECHT, C. R E , S. W RIGHT, AND F. N IU, Hogwild: A lock-free approach to parallelizing stochastic
1874 gradient descent, in Advances in Neural Information Processing Systems, 2011, pp. 693–701.
1875 [95] H. RUTISHAUSER, Computational aspects of F. L. Bauer’s simultaneous iteration method, Numer. Math.,
1876 13 (1969), pp. 4–13.
1877 [96] H. RUTISHAUSER, Simultaneous iteration method for symmetric matrices, Numer. Math., 16 (1970),
1878 pp. 205–223.
1879 [97] Y. S AAD, Chebyshev acceleration techniques for solving nonsymmetric eigenvalue problems, Mathematics
1880 of Computation, 42 (1984), pp. 567–588.
1881 [98] Y. S AAD, Iterative methods for sparse linear systems, Society for Industrial and Applied Mathematics,
1882 Philadelphia, PA, second ed., 2003.
1883 [99] D. S CIEUR , A. D ’A SPREMONT, AND F. BACH, Regularized nonlinear acceleration, in Advances In Neural
1884 Information Processing Systems, 2016, pp. 712–720.
1885 [100] S. K. S HEVADE AND S. S. K EERTHI, A simple and efficient algorithm for gene selection using sparse
1886 logistic regression, Bioinformatics, 19 (2003), pp. 2246–2253.
1887 [101] G. A. S HULTZ , R. B. S CHNABEL , AND R. H. B YRD, A family of trust-region-based algorithms for uncon-
1888 strained minimization with strong global convergence properties, SIAM J. Numer. Anal., 22 (1985),
1889 pp. 47–67.
1890 [102] P. S IRKOVIC AND D. K RESSNER, Subspace acceleration for large-scale parameter-dependent Hermitian
1891 eigenproblems, SIAM Journal on Matrix Analysis and Applications, 37 (2016), pp. 695–718.
1892 [103] A. S IT, Z. W U , AND Y. Y UAN, A geometric buildup algorithm for the solution of the distance geometry
1893 problem using least-squares approximation, Bull. Math. Biol., 71 (2009), pp. 1914–1933.
1894 [104] S. T. S MITH, Optimization techniques on Riemannian manifolds, Fields Institute Communications, 3 (1994).
1895 [105] S. S OLNTSEV, J. N OCEDAL , AND R. H. B YRD, An algorithm for quadratic `1 -regularized optimization
1896 with a flexible active-set strategy, Optim. Methods Softw., 30 (2015), pp. 1213–1237.
1897 [106] M. S OLTANI AND C. H EGDE, Fast low-rank matrix estimation without the condition number, arXiv preprint
1898 arXiv:1712.03281, (2017).
1899 [107] T. S TEIHAUG, The conjugate gradient method and trust regions in large scale optimization, SIAM J. Numer.
1900 Anal., 20 (1983), pp. 626–637.
1901 [108] G. W. S TEWART, Simultaneous iteration for computing invariant subspaces of non-Hermitian matrices,
1902 Numer. Math., 25 (1975/76), pp. 123–136.
1903 [109] , Matrix algorithms Vol. II: Eigensystems, Society for Industrial and Applied Mathematics (SIAM),
1904 Philadelphia, PA, 2001.
1905 [110] W. J. S TEWART AND A. J ENNINGS, A simultaneous iteration algorithm for real matrices, ACM Trans.
1906 Math. Software, 7 (1981), pp. 184–198.
1907 [111] W. S UN AND Y. Y UAN, Optimization theory and methods: nonlinear programming, vol. 1, Springer Science
1908 & Business Media, 2006.
1909 [112] X.-C. TAI AND J. X U, Global and uniform convergence of subspace correction methods for some convex
1910 optimization problems, Mathematics of Computation, 71 (2002), pp. 105–124.
1911 [113] , Global and uniform convergence of subspace correction methods for some convex optimization
1912 problems, Math. Comp., 71 (2002), pp. 105–124.

52

This manuscript is for review purposes only.


1913 [114] R. T IBSHIRANI, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society.
1914 Series B (Methodological), (1996), pp. 267–288.
1915 [115] A. T OTH , J. A. E LLIS , T. E VANS , S. H AMILTON , C. K ELLEY, R. PAWLOWSKI , AND S. S LATTERY, Local
1916 improvement results for Anderson acceleration with inaccurate function evaluations, SIAM Journal on
1917 Scientific Computing, 39 (2017), pp. S47–S65.
1918 [116] J. A. T ROPP AND A. C. G ILBERT, Signal recovery from random measurements via orthogonal matching
1919 pursuit, IEEE Trans. Inform. Theory, 53 (2007), pp. 4655–4666.
1920 [117] J. A. T ROPP, A. Y URTSEVER , M. U DELL , AND V. C EVHER, Fixed-rank approximation of a positive-
1921 semidefinite matrix from streaming data, in Advances in Neural Information Processing Systems, 2017,
1922 pp. 1225–1234.
1923 [118] J. A. T ROPP, A. Y URTSEVER , M. U DELL , AND V. C EVHER, Practical sketching algorithms for low-rank
1924 matrix approximation, SIAM J. Matrix Anal. Appl., 38 (2017), pp. 1454–1485.
1925 [119] P. T SENG AND S. Y UN, A coordinate gradient descent method for nonsmooth separable minimization,
1926 Mathematical Programming, 117 (2009), pp. 387–423.
1927 [120] C. U DRISTE, Convex functions and optimization methods on Riemannian manifolds, vol. 297, Springer
1928 Science & Business Media, 1994.
1929 [121] B. VANDEREYCKEN, Low-rank matrix completion by Riemannian optimization, SIAM Journal on Opti-
1930 mization, 23 (2013), pp. 1214–1236.
1931 [122] I. WALDSPURGER , A. D ’A SPREMONT, AND S. M ALLAT, Phase recovery, MaxCut and complex semidefi-
1932 nite programming, Math. Program., 149 (2015), pp. 47–81.
1933 [123] H. F. WALKER AND P. N I, Anderson acceleration for fixed-point iterations, SIAM Journal on Numerical
1934 Analysis, 49 (2011), pp. 1715–1735.
1935 [124] L. WANG , A. S INGER , AND Z. W EN, Orientation determination of cryo-em images using least unsquared
1936 deviations, SIAM journal on imaging sciences, 6 (2013), pp. 2450–2483.
1937 [125] X. WANG , M. H ONG , S. M A , AND Z.-Q. L UO, Solving multiple-block separable convex minimization
1938 problems using two-block alternating direction method of multipliers, arXiv preprint arXiv:1308.5294,
1939 (2013).
1940 [126] Y. WANG , Z. J IA , AND Z. W EN, The search direction correction makes first-order methods faster, (2019).
1941 Arxiv 1905.06507.
1942 [127] Z. WANG , Z. W EN , AND Y. Y UAN, A subspace trust region method for large scale unconstrained opti-
1943 mization, in Numerical Linear Algebra and Optimization, Y.Yuan, ed., 2004, pp. 265–274.
1944 [128] Z.-H. WANG AND Y.-X. Y UAN, A subspace implementation of quasi-Newton trust region methods for
1945 unconstrained optimization, Numer. Math., 104 (2006), pp. 241–269.
1946 [129] Z. W EN , D. G OLDFARB , AND K. S CHEINBERG, Block coordinate descent methods for semidefinite pro-
1947 gramming, Handbook on Semidefinite, Cone and Polynomial Optimization, (2011).
1948 [130] , Block coordinate descent methods for semidefinite programming, in Handbook on semidefinite,
1949 conic and polynomial optimization, vol. 166 of Internat. Ser. Oper. Res. Management Sci., Springer,
1950 New York, 2012, pp. 533–564.
1951 [131] Z. W EN , A. M ILZAREK , M. U LBRICH , AND H. Z HANG, Adaptive regularized self-consistent field iteration
1952 with exact Hessian for electronic structure calculation, SIAM J. Sci. Comput., 35 (2013), pp. A1299–
1953 A1324.
1954 [132] Z. W EN AND W. Y IN, A feasible method for optimization with orthogonality constraints, Math. Program.,
1955 142 (2013), pp. 397–434.
1956 [133] Z. W EN , W. Y IN , D. G OLDFARB , AND Y. Z HANG, A fast algorithm for sparse reconstruction based on
1957 shrinkage, subspace optimization and continuation, SIAM Journal on Scientific Computing, 32 (2010),
1958 pp. 1832–1857.
1959 [134] Z. W EN , W. Y IN , H. Z HANG , AND D. G OLDFARB, On the convergence of an active-set method for l1
1960 minimization, Optimization Methods and Software, 27 (2012), pp. 1127–1146.
1961 [135] Z. W EN AND Y. Z HANG, Accelerating convergence by augmented rayleigh–ritz projections for large-scale
1962 eigenpair computation, SIAM Journal on Matrix Analysis and Applications, 38 (2017), pp. 273–296.
1963 [136] D. P. W OODRUFF, Sketching as a tool for numerical linear algebra, Found. Trends Theor. Comput. Sci., 10
1964 (2014), pp. iv+157.
1965 [137] X. W U , Z. W EN , AND W. BAO, A regularized newton method for computing ground states of Bose-Einstein
1966 condensates, arXiv preprint arXiv:1504.02891, (2015).
1967 [138] C. YANG , J. C. M EZA , AND L.-W. WANG, A trust region direct constrained minimization algorithm for
1968 the Kohn-Sham equation, SIAM J. Sci. Comput., 29 (2007), pp. 1854–1875.
1969 [139] Y. YANG , B. D ONG , AND Z. W EN, Randomized algorithms for high quality treatment planning in volu-
1970 metric modulated arc therapy, Inverse Problems, 33 (2017), pp. 025007, 22.
1971 [140] Y.- X . Y UAN, Subspace techniques for nonlinear optimization, in Some topics in industrial and applied
1972 mathematics, vol. 8 of Ser. Contemp. Appl. Math. CAM, Higher Ed. Press, Beijing, 2007, pp. 206–
1973 218.
1974 [141] Y.-X. Y UAN, Subspace methods for large scale nonlinear equations and nonlinear least squares, Optim.

53

This manuscript is for review purposes only.


1975 Eng., 10 (2009), pp. 207–218.
1976 [142] Y.- X . Y UAN, A review on subspace methods for nonlinear optimization, in Proceedings of the International
1977 Congress of Mathematicians—Seoul 2014. Vol. IV, Kyung Moon Sa, Seoul, 2014, pp. 807–827.
1978 [143] Y.- X . Y UAN AND J. S TOER, A subspace study on conjugate gradient algorithms, Z. Angew. Math. Mech.,
1979 75 (1995), pp. 69–77.
1980 [144] A. Y URTSEVER , J. A. T ROPP, O. F ERCOQ , M. U DELL , AND V. C EVHER, Scalable semidefinite program-
1981 ming, 2019. arXiv:1912.02949.
1982 [145] J. Z HANG , H. L IU , Z. W EN , AND S. Z HANG, A sparse completely positive relaxation of the modularity
1983 maximization for community detection, SIAM J. Sci. Comput., 40 (2018), pp. A3091–A3120.
1984 [146] , A sparse completely positive relaxation of the modularity maximization for community detection,
1985 SIAM J. Sci. Comput., 40 (2018), pp. A3091–A3120.
1986 [147] J. Z HANG , B. O’D ONOGHUE , AND S. B OYD, Globally convergent Type-I Anderson acceleration for non-
1987 smooth fixed-point iterations, (2018). arXiv:1808.03971.
1988 [148] J. Z HANG , Z. W EN , AND Y. Z HANG, Subspace methods with local refinements for eigenvalue computation
1989 using low-rank tensor-train format, Journal of Scientific Computing, 70 (2017), pp. 478–499.
1990 [149] T. Z HOU AND D. TAO, Godec: Randomized low-rank & sparse matrix decomposition in noisy case, in
1991 International conference on machine learning, Omnipress, 2011.
1992 [150] Y. Z HOU AND Y. S AAD, A Chebyshev–Davidson algorithm for large symmetric eigenproblems, SIAM J.
1993 Matrix Anal. and Appl., 29 (2007), pp. 954–971.
1994 [151] Y. Z HOU , Y. S AAD , M. L. T IAGO , AND J. R. C HELIKOWSKY, Self-consistent-field calculations using
1995 Chebyshev-filtered subspace iteration, Journal of Computational Physics, 219 (2006), pp. 172–184.

54

This manuscript is for review purposes only.

You might also like