Optimization Algorithms For Data Analysis Wright
Optimization Algorithms For Data Analysis Wright
Optimization Algorithms For Data Analysis Wright
2 Stephen J. Wright
3 Contents
4 1 Introduction 2
5 1.1 Omissions 3
6 1.2 Notation 3
7 2 Optimization Formulations of Data Analysis Problems 4
8 2.1 Setup 4
9 2.2 Least Squares 6
10 2.3 Matrix Completion 6
11 2.4 Nonnegative Matrix Factorization 7
12 2.5 Sparse Inverse Covariance Estimation 8
13 2.6 Sparse Principal Components 8
14 2.7 Sparse Plus Low-Rank Matrix Decomposition 9
15 2.8 Subspace Identification 9
16 2.9 Support Vector Machines 10
17 2.10 Logistic Regression 12
18 2.11 Deep Learning 13
19 3 Preliminaries 15
20 3.1 Solutions 16
21 3.2 Convexity and Subgradients 16
22 3.3 Taylor’s Theorem 17
23 3.4 Optimality Conditions for Smooth Functions 19
24 3.5 Proximal Operators and the Moreau Envelope 20
25 3.6 Convergence Rates 21
26 4 Gradient Methods 23
27 4.1 Steepest Descent 23
28 4.2 General Case 24
29 4.3 Convex Case 24
30 4.4 Strongly Convex Case 25
31 4.5 General Case: Line-Search Methods 26
32 4.6 Conditional Gradient Method 27
2010 Mathematics Subject Classification. Primary 14Dxx; Secondary 14Dxx.
Key words and phrases. Park City Mathematics Institute.
1
2 Optimization Algorithms for Data Analysis
33 5 Prox-Gradient Methods 29
34 6 Accelerating Gradient Methods 32
35 6.1 Heavy-Ball Method 32
36 6.2 Conjugate Gradient 33
37 6.3 Nesterov’s Accelerated Gradient: Weakly Convex Case 34
38 6.4 Nesterov’s Accelerated Gradient: Strongly Convex Case 36
39 6.5 Lower Bounds on Rates 39
40 7 Newton Methods 40
41 7.1 Basic Newton’s Method 40
42 7.2 Newton’s Method for Convex Functions 42
43 7.3 Newton Methods for Nonconvex Functions 43
44 7.4 A Cubic Regularization Approach 45
45 8 Conclusions 47
46 1. Introduction
47 In this article, we consider algorithms for solving smooth optimization prob-
48 lems, possibly with simple constraints or structured nonsmooth regularizers. One
49 such canonical formulation is
50 (1.0.1) min f(x),
x∈Rn
51 where f : Rn → R has at least Lipschitz continuous gradients. Additional as-
52 sumptions about f, such as convexity and Lipschitz continuity of the Hessian, are
53 introduced as needed. Another formulation we consider is
54 (1.0.2) min f(x) + λψ(x),
x∈Rn
55 where f is as in (1.0.1), ψ : Rn
→ R is a function that is usually convex and usually
56 nonsmooth, and λ > 0 is a regularization parameter.1 We refer to (1.0.2) as a
57 regularized minimization problem because the presence of the term involving ψ
58 induces certain structural properties on the solution, that make it more desirable
59 or plausible in the context of the application. We describe iterative algorithms
60 that generate a sequence {xk }k=0,1,2,... of points that, in the case of convex objective
61 functions, converges to the set of solutions. (Some algorithms also generate other
62 “auxiliary” sequences of iterates.)
63 We are motivated to study problems of the forms (1.0.1) and (1.0.2) by their
64 ubiquity in data analysis applications. Accordingly, Section 2 describes some
65 canonical problems in data analysis and their formulation as optimization prob-
66 lems. After some preliminaries in Section 3, we describe in Section 4 algorithms
67 that take step based on the gradients ∇f(xk ). Extensions of these methods to
1A set S is said to be convex if for any pair of points z 0 , z 00 ∈ S, we have that αz 0 + (1 − α)z 00 ∈ S for
all α ∈ [0, 1]. A function φ : Rn → R is convex if φ(αz 0 + (1 − α)z 00 ) 6 αφ(z 0 ) + (1 − α)φ(z 00 )
for all z 0 , z 00 in the (convex) domain of φ and all α ∈ [0, 1].
Stephen J. Wright 3
68 the case (1.0.2) of regularized objectives are described in Section 5. Section 6 de-
69 scribes accelerated gradient methods, which achieve better worst-case complexity
70 than basic gradient methods, while still only using first-derivative information.
71 We discuss Newton’s method in Section 7, outlining variants that can guarantee
72 convergence to points that approximately satisfy second-order conditions for a
73 local minimizer of a smooth nonconvex function.
74 1.1. Omissions Our approach throughout is to give a concise description of
75 some of the most important algorithmic tools for smooth nonlinear optimization
76 and regularized optimization, along with the basic convergence theory for each.
77 (In any given context, we mean by “smooth” that the function is differentiable as
78 many times as is necessary for the discussion to make sense.) In most cases, the
79 theory is elementary enough to include here in its entirety. In the few remaining
80 cases, we provide citations to works in which complete proofs can be found.
81 Although we allow nonsmoothness in the regularization term in (1.0.2), we do
82 not cover subgradient methods or mirror descent explicitly in this chapter. We
83 also do not discuss stochastic gradient methods, a class of methods that is central
84 to modern machine learning. All these topics are discussed in the contribution of
85 John Duchi to the current volume [22]. Other omissions include the following.
86 • Coordinate descent methods; see [47] for a recent review.
87 • Augmented Lagrangian methods, including alternating direction meth-
88 ods of multipliers (ADMM) [23]. The review [5] remains a good reference
89 for the latter topic, especially as it applies to problems from data analysis.
90 • Semidefinite programming (see [43, 45]) and conic optimization (see [6]).
91 • Methods tailored specifically to linear or quadratic programming, such as
92 the simplex method or interior-point methods (see [46] for a discussion of
93 the latter).
94 • Quasi-Newton methods, which modify Newton’s method by approximat-
95 ing the Hessian or its inverse, thus attaining attractive theoretical and
96 practical performance without using any second-derivative information.
97 For a discussion of these methods, see [36, Chapter 6]. One important
98 method of this class, which is useful in data analysis and many other
99 large-scale problems, is the limited-memory method L-BFGS [30]; see also
100 [36, Section 7.2].
101 1.2. Notation Our notational conventions in this chapter are as follows. We
102 use upper-case Roman characters (A, L, R, and so on) for matrices and lower-
103 case Roman (x, v, u, and so on) for vectors. (Vectors are assumed to be column
104 vectors.) Transposes are indicated by a superscript “T .” Elements of matrices and
105 vectors are indicated by subscripts, for example, Aij and xj . Iteration numbers are
106 indicated by superscripts, for example, xk . We denote the set of real numbers by
107 R, so that Rn denotes the Euclidean space of dimension n. The set of symmetric
108 real n × n matrices is denoted by SRn×n . Real scalars are usually denoted by
4 Optimization Algorithms for Data Analysis
109 Greek characters, for example, α, β, and so on, though in deference to convention,
110 we sometimes use Roman capitals (for example, L for the Lipschitz constant of
111 a gradient). Where vector norms appear, the type of norm in use is indicated
112 by a subscript (for example kxk1 ), except that when no subscript appears, the
113 Euclidean norm k · k2 is assumed. Matrix norms are defined where first used.
145 where the jth term `(aj , yj ; x) is a measure of the mismatch between φ(aj ) and
146 yj , and x is the vector of parameters that determines φ.
Stephen J. Wright 5
147 One use of φ is to make predictions about future data items. Given another
148 previously unseen item of data â of the same type as aj , j = 1, 2, . . . , m, we
149 predict that the label ŷ associated with â would be φ(â). The mapping may also
150 expose other structure and properties in the data set. For example, it may reveal
151 that only a small fraction of the features in aj are needed to reliably predict the
152 label yj . (This is known as feature selection.) The function φ or its parameter x
153 may also reveal important structure in the data. For example, X could reveal a
154 low-dimensional subspace that contains most of the aj , or X could reveal a matrix
155 with particular structure (low-rank, sparse) such that observations of X prompted
156 by the feature vectors aj yield results close to yj .
157 Examples of labels yj include the following.
158 • A real number, leading to a regression problem.
159 • A label, say yj ∈ {1, 2, . . . , M} indicating that aj belongs to one of M
160 classes. This is a classification problem. We have M = 2 for binary classifi-
161 cation and M > 2 for multiclass classification.
162 • Null. Some problems only have feature vectors aj and no labels. In this
163 case, the data analysis task may consist of grouping the aj into clusters
164 (where the vectors within each cluster are deemed to be functionally sim-
165 ilar), or identifying a low-dimensional subspace (or a collection of low-
166 dimensional subspaces) that approximately contains the aj . Such prob-
167 lems require the labels yj to be learned, alongside the function φ. For
168 example, in a clustering problem, yj could represent the cluster to which
169 aj is assigned.
170 Even after cleaning and preparation, the setup above may contain many com-
171 plications that need to be dealt with in formulating the problem in rigorous math-
172 ematical terms. The quantities (aj , yj ) may contain noise, or may be otherwise
173 corrupted. We would like the mapping φ to be robust to such errors. There may
174 be missing data: parts of the vectors aj may be missing, or we may not know all
175 the labels yj . The data may be arriving in streaming fashion rather than being
176 available all at once. In this case, we would learn φ in an online fashion.
177 One particular consideration is that we wish to avoid overfitting the model to
178 the data set D in (2.1.1). The particular data set D available to us can often be
179 thought of as a finite sample drawn from some underlying larger (often infinite)
180 collection of data, and we wish the function φ to perform well on the unobserved
181 data points as well as the observed subset D. In other words, we want φ to
182 be not too sensitive to the particular sample D that is used to define empirical
183 objective functions such as (2.1.2). The optimization formulation can be modified
184 in various ways to achieve this goal, by the inclusion of constraints or penalty
185 terms that limit some measure of “complexity” of the function (such techniques
186 are called generalization or regularization). Another approach is to terminate the
187 optimization algorithm early, the rationale being that overfitting occurs mainly in
188 the later stages of the optimization process.
6 Optimization Algorithms for Data Analysis
189 2.2. Least Squares Probably the oldest and best-known data analysis problem is
190 linear least squares. Here, the data points (aj , yj ) lie in Rn × R, and we solve
1 X T
m
1
191 (2.2.1) min (aj x − yj )2 = kAx − yk22 ,
x 2m 2m
j=1
192 where A is the matrix whose rows are aTj , j = 1, 2, . . . , m and y = (y1 , y2 , . . . , ym )T .
193 In the terminology above, the function φ is defined by φ(a) := aT x. (We could
194 also introduce a nonzero intercept by adding an extra parameter β ∈ R and
195 defining φ(a) := aT x + β.) This formulation can be motivated statistically, as a
196 maximum-likelihood estimate of x when the observations yj are exact but for
197 i.i.d. Gaussian noise. Randomized linear algebra methods for large-scale in-
198 stances of this problem are discussed in Section 5 of the lectures of Drineas and
199 Mahoney [20] in this volume.
200 Various modifications of (2.2.1) impose desirable structure on x and hence on
201 φ. For example, Tikhonov regularization with a squared `2 -norm, which is
1
202 min kAx − yk22 + λkxk22 , for some parameter λ > 0,
x 2m
203 yields a solution x with less sensitivity to perturbations in the data (aj , yj ). The
204 LASSO formulation
1
205 (2.2.2) min kAx − yk22 + λkxk1
x 2m
206 tends to yield solutions x that are sparse, that is, containing relatively few nonzero
207 components [42]. This formulation performs feature selection: The locations of
208 the nonzero components in x reveal those components of aj that are instrumental
209 in determining the observation yj . Besides its statistical appeal — predictors that
210 depend on few features are potentially simpler and more comprehensible than
211 those depending on many features — feature selection has practical appeal in
212 making predictions about future data. Rather than gathering all components of a
213 new data vector â, we need to find only the “selected” features, since only these
214 are needed to make a prediction. The LASSO formulation (2.2.2) is an important
215 prototype for many problems in data analysis, in that it involves a regularization
216 term λkxk1 that is nonsmooth and convex, but with relatively simple structure
217 that can potentially be exploited by algorithms.
218 2.3. Matrix Completion Matrix completion is in one sense a natural extension
219 of least-squares to problems in which the data aj are naturally represented as
220 matrices rather than vectors. Changing notation slightly, we suppose that each
221 Aj is an n × p matrix, and we seek another n × p matrix X that solves
1 X
m
222 (2.3.1) min (hAj , Xi − yj )2 ,
X 2m
j=1
225 combinations (where the elements of Aj are selected i.i.d. from some distribution)
226 or single-element observations (in which each Aj has 1 in a single location and
227 zeros elsewhere). A regularized version of (2.3.1), leading to solutions X that are
228 low-rank, is
1 X
m
229 (2.3.2) min (hAj , Xi − yj )2 + λkXk∗ ,
X 2m
j=1
230 where kXk∗ is the nuclear norm, which is the sum of singular values of X [39].
231 The nuclear norm plays a role analogous to the `1 norm in (2.2.2). Although the
232 nuclear norm is a somewhat complex nonsmooth function, it is at least convex, so
233 that the formulation (2.3.2) is also convex. This formulation can be shown to yield
234 a statistically valid solution when the true X is low-rank and the observation ma-
235 trices Aj satisfy a “restricted isometry” property, commonly satisfied by random
236 matrices, but not by matrices with just one nonzero element. The formulation is
237 also valid in a different context, in which the true X is incoherent (roughly speak-
238 ing, it does not have a few elements that are much larger than the others), and
239 the observations Aj are of single elements [10].
240 In another form of regularization, the matrix X is represented explicitly as a
241 product of two “thin” matrices L and R, where L ∈ Rn×r and R ∈ Rp×r , with
242 r min(n, p). We set X = LRT in (2.3.1) and solve
1 X
m
243 (2.3.3) min (hAj , LRT i − yj )2 .
L,R 2m
j=1
244 In this formulation, the rank r is “hard-wired” into the definition of X, so there is
245 no need to include a regularizing term. This formulation is also typically much
246 more compact than (2.3.2); the total number of elements in (L, R) is (n + p)r,
247 which is much less than np. A disadvantage is that it is nonconvex. An active
248 line of current research, pioneered in [9] and also drawing on statistical sources,
249 shows that the nonconvexity is benign in many situations, and that under certain
250 assumptions on the data (Aj , yj ), j = 1, 2, . . . , m and careful choice of algorithmic
251 strategy, good solutions can be obtained from the formulation (2.3.3). A clue to
252 this good behavior is that although this formulation is nonconvex, it is in some
253 sense an approximation to a tractable problem: If we have a complete observation
254 of X, then a rank-r approximation can be found by performing a singular value
255 decomposition of X, and defining L and R in terms of the r leading left and right
256 singular vectors.
257 2.4. Nonnegative Matrix Factorization Some applications in computer vision,
258 chemometrics, and document clustering require us to find factors L and R like
259 those in (2.3.3) in which all elements are nonnegative. If the full matrix Y ∈ Rn×p
260 is observed, this problem has the form
261 min kLRT − Yk2F , subject to L > 0, R > 0.
L,R
8 Optimization Algorithms for Data Analysis
262 2.5. Sparse Inverse Covariance Estimation In this problem, the labels yj are
263 null, and the vectors aj ∈ Rn are viewed as independent observations of a ran-
264 dom vector A ∈ Rn , which has zero mean. The sample covariance matrix con-
265 structed from these observations is
1 X
m
266 S= aj aTj .
m−1
j=1
267 The element Sil is an estimate of the covariance between the ith and lth elements
268 of the random variable vector A. Our interest is in calculating an estimate X of
269 the inverse covariance matrix that is sparse. The structure of X yields important
270 information about A. In particular, if Xil = 0, we can conclude that the i and
271 l components of A are conditionally independent. (That is, they are independent
272 given knowledge of the values of the other n − 2 components of A.) Stated an-
273 other way, the nonzero locations in X indicate the arcs in the dependency graph
274 whose nodes correspond to the n components of A.
275 One optimization formulation that has been proposed for estimating the in-
276 verse sparse covariance matrix X is the following:
277 (2.5.1) min hS, Xi − log det(X) + λkXk1 ,
X∈SRn×n , X0
280 2.6. Sparse Principal Components The setup for this problem is similar to the
281 previous section, in that we have a sample covariance matrix S that is estimated
282 from a number of observations of some underlying random vector. The princi-
283 pal components of this matrix are the eigenvectors corresponding to the largest
284 eigenvalues. It is often of interest to find sparse principal components, approxi-
285 mations to the leading eigenvectors that also contain few nonzeros. An explicit
286 optimization formulation of this problem is
287 (2.6.1) max vT Sv s.t. kvk2 = 1, kvk0 6 k,
v∈Rn
288 where k · k0 indicates the cardinality of v (that is, the number of nonzeros in v)
289 and k is a user-defined parameter indicating a bound on the cardinality of v. The
290 problem (2.6.1) is NP-hard, so exact formulations (for example, as a quadratic
291 program with binary variables) are intractable. We consider instead a relaxation,
292 due to [18], which replaces vvT by a positive semidefinite proxy M ∈ SRn×n :
293 (2.6.2) max hS, Mi s.t. M 0, hI, Mi = 1, kMk1 6 ρ,
M∈SRn×n
294 for some parameter ρ > 0 that can be adjusted to attain the desired sparsity. This
295 formulation is a convex optimization problem, in fact, a semidefinite program-
296 ming problem.
297 This formulation can be generalized to find the leading r > 1 sparse principal
298 components. Ideally, we would obtain these from a matrix V ∈ Rn×r whose
Stephen J. Wright 9
299 columns are mutually orthogonal and have at most k nonzeros each. We can
300 write a convex relaxation of this problem, once again a semidefinite program, as
301 (2.6.3) max hS, Mi s.t. 0 M I, hI, Mi = 1, kMk1 6 ρ .
M∈SRn×n
302 A more compact (but nonconvex) formulation is
303 max hS, FFT i s.t. kFk2 6 1, kFk2,1 6 R̄,
F∈Rn×r
Pn
304 where kFk2,1 := i=1 kFi· k2[15]. The latter regularization term is often called
305 a “group-sparse” or “group-LASSO” regularizer. (An early use of this type of
306 regularizer was described in [44].)
307 2.7. Sparse Plus Low-Rank Matrix Decomposition Another useful paradigm
308 is to decompose a partly or fully observed n × p matrix Y into the sum of a
309 sparse matrix and a low-rank matrix. A convex formulation of the fully-observed
310 problem is
311 min kMk∗ + λkSk1 s.t. Y = M + S,
M,S
Pn Pp
where kSk1 := i=1 j=1 |Sij | [11, 14]. Compact, nonconvex formulations that
allow noise in the observations include the following:
1
min kLRT + S − Yk2F (fully observed)
L,R,S 2
1
min kPΦ (LRT + S − Y)k2F (partially observed),
L,R,S 2
312 where Φ represents the locations of the observed entries of Y and PΦ is projection
313 onto this set [15, 48].
314 One application of these formulations is to robust PCA, where the low-rank
315 part represents principal components and the sparse part represents “outlier”
316 observations. Another application is to foreground-background separation in
317 video processing. Here, each column of Y represents the pixels in one frame of
318 video, whereas each row of Y shows the evolution of one pixel over time.
319 2.8. Subspace Identification In this application, the aj ∈ Rn , j = 1, 2, . . . , m are
320 vectors that lie (approximately) in a low-dimensional subspace. The aim is to
321 identify this subspace, expressed as the column subspace of a matrix X ∈ Rn×r .
322 If the aj are fully observed, an obvious way to solve this problem is to perform
323 a singular value decomposition of the n × m matrix A = [aj ]m j=1 , and take X to
324 be the leading r right singular vectors. In interesting variants of this problem,
325 however, the vectors aj may be arriving in streaming fashion and may be only
326 partly observed, for example in indices Φj ⊂ {1, 2, . . . , n}. We would thus need to
327 identify a matrix X and vectors sj ∈ Rr such that
328 PΦj (aj − Xsj ) ≈ 0, j = 1, 2, . . . , m.
329 The algorithm for identifying X, described in [1], is a manifold-projection scheme
330 that takes steps in incremental fashion for each aj in turn. Its validity relies on
10 Optimization Algorithms for Data Analysis
331 incoherence of the matrix X with respect to the principal axes, that is, the matrix
332 X should not have a few elements that are much larger than the others. A local
333 convergence analysis of this method is given in [2].
2.9. Support Vector Machines Classification via support vector machines (SVM)
is a classical paradigm in machine learning. This problem takes as input data
(aj , yj ) with aj ∈ Rn and yj ∈ {−1, 1}, and seeks a vector x ∈ Rn and a scalar
β ∈ R such that
344 Note that the jth term in this summation is zero if the conditions (2.9.1) are
345 satisfied, and positive otherwise. Even if no pair (x, β) exists with H(x, β) = 0,
346 the pair (x, β) that minimizes (2.1.2) will be the one that comes as close as possible
347 to satisfying (2.9.1), in a suitable sense. A term λkxk22 , where λ is a small positive
348 parameter, is often added to (2.9.2), yielding the following regularized version:
1 X
m
1
349 (2.9.3) H(x, β) = max(1 − yj (aTj x − β), 0) + λkxk22 .
m 2
j=1
350 If λ is sufficiently small (but positive), and if separating hyperplanes exist, the
351 pair (x, β) that minimizes (2.9.3) is the maximum-margin separating hyperplane.
352 The maximum-margin property is consistent with the goals of generalizability
353 and robustness. For example, if the observed data (aj , yj ) is drawn from an
354 underlying “cloud” of positive and negative cases, the maximum-margin solution
355 usually does a reasonable job of separating other empirical data samples drawn
356 from the same clouds, whereas a hyperplane that passes close by several of the
357 observed data points may not do as well (see Figure 2.9.4).
358 The problem of minimizing (2.9.3) can be written as a convex quadratic pro-
359 gram — having a convex quadratic objective and linear constraints — by intro-
360 ducing variables sj , j = 1, 2, . . . , m to represent the residual terms. Then,
Stephen J. Wright 11
1 T 1
(2.9.5a) min 1 s + λkxk22 ,
x,β,s m 2
(2.9.5b) subject to sj > 1 − yj (aTj x − β), sj > 0, j = 1, 2, . . . , m,
361 where 1 = (1, 1, . . . , 1)T ∈ Rm .
362 Often it is not possible to find a hyperplane that separates the positive and
363 negative cases well enough to be useful as a classifier. One solution is to trans-
364 form all of the raw data vectors aj by a mapping ζ into a higher-dimensional
365 Euclidean space, then perform the support-vector-machine classification on the
366 vectors ζ(aj ), j = 1, 2, . . . , m.
The conditions (2.9.1) would thus be replaced by
378 Interestingly, problem (2.9.8) can be formulated and solved without any explicit
379 knowledge or definition of the mapping ζ. We need only a technique to define the
380 elements of Q. This can be done with the use of a kernel function K : Rn × Rn → R,
381 where K(ak , al ) replaces ζ(ak )T ζ(al ) [4, 16]. This is the so-called “kernel trick.”
382 (The kernel function K can also be used to construct a classification function
383 φ from the solution of (2.9.8).) A particularly popular choice of kernel is the
384 Gaussian kernel:
385 K(ak , al ) := exp(−kak − al k2 /(2σ)),
386 where σ is a positive parameter.
387 2.10. Logistic Regression Logistic regression can be viewed as a variant of bi-
388 nary support-vector machine classification, in which rather than the classification
389 function φ giving a unqualified prediction of the class in which a new data vector
390 a lies, it returns an estimate of the odds of a belonging to one class or the other.
391 We seek an “odds function” p parametrized by a vector x ∈ Rn as follows:
392 (2.10.1) p(a; x) := (1 + exp(aT x))−1 ,
and aim to choose the parameter x so that
396 We can perform feature selection using this model by introducing a regularizer
397 λkxk1 , as follows:
1 X X
398 (2.10.4) max log(1 − p(aj ; x)) + log p(aj ; x) − λkxk1 ,
x m
j:yj =−1 j:yj =1
399 where λ > 0 is a regularization parameter. (Note that we subtract rather than add
400 the regularization term λkxk1 to the objective, because this problem is formulated
401 as a maximization rather than a minimization.) As we see later, this term has
402 the effect of producing a solution in which few components of x are nonzero,
403 making it possible to evaluate p(a; x) by knowing only those components of a
404 that correspond to the nonzeros in x.
405 An important extension of this technique is to multiclass (or multinomial) lo-
406 gistic regression, in which the data vectors aj belong to more than two classes.
407 Such applications are common in modern data analysis. For example, in a speech
408 recognition system, the M classes could each represent a phoneme of speech, one
409 of the potentially thousands of distinct elementary sounds that can be uttered by
Stephen J. Wright 13
exp(aT x[k] )
414 (2.10.5) pk (a; X) := PM , k = 1, 2, . . . , M,
T
l=1 exp(a x[l] )
415 where we define X := {x[k] | k = 1, 2, . . . , M}. Note that for all a and for all
P
416 k = 1, 2, . . . , M, we have pk (a) ∈ (0, 1) and also M k=1 pk (a) = 1. The operation
417 in (2.10.5) is referred to as a “softmax” on the quantities {aT x[l] | l = 1, 2, . . . , M}.
418 If one of these inner products dominates the others, that is, aT x[k] aT x[l] for
419 all l 6= k, the formula (2.10.5) will yield pk (a; X) ≈ 1 and pl (a; X) ≈ 0 for all l 6= k.
420 In the setting of multiclass logistic regression, the labels yj are vectors in RM ,
421 whose elements are defined as follows:
1 when aj belongs to class k,
422 (2.10.6) yjk =
0 otherwise.
Similarly to (2.10.2), we seek to define the vectors x[k] so that
1 X X X
m
"M M
!#
T T
425 (2.10.8) L(X) := yj` (x[`] aj ) − log exp(x[`] aj ) .
m
j=1 `=1 `=1
output nodes
hidden layers
input nodes
468 and defining X := {x[k] | k = 1, 2, . . . , M} as in Section 2.10, we can write the loss
469 function for deep learning as follows:
1 X X X
m
"M M
!#
T D T D
470 (2.11.3) L(w, X) := yj` (x[`] aj (w)) − log exp(x[`] aj (w)) .
m
j=1 `=1 `=1
471 Note that this is exactly the function (2.10.8) applied to the output of the top
472 hidden layer aD D
j (w). We write aj (w) to make explicit the dependence of aj
D
473 on the parameters w of (2.11.2), as well as on the input vector aj . (We can view
474 multiclass logistic regression (2.10.8) as a special case of deep learning in which
475 there are no hidden layers, so that D = 0, w is null, and aD j = aj , j = 1, 2, . . . , m.)
476 Neural networks in use for particular applications (in image recognition and
477 speech recognition, for example, where they have been very successful) include
478 many variants on the basic design above. These include restricted connectivity
479 between layers (that is, enforcing structure on the matrices W l , l = 1, 2, . . . , D),
480 layer arrangements that are more complex than the linear layout illustrated in
481 Figure 2.11.1, with outputs coming from different levels, connections across non-
482 adjacent layers, different componentwise transformations σ at different layers,
483 and so on. Deep neural networks for practical applications are highly engineered
484 objects.
485 The loss function (2.11.3) shares with many other applications the “summation”
486 form (2.1.2), but it has several features that set it apart from the other applications
487 discussed above. First, and possibly most important, it is nonconvex in the param-
488 eters w. There is reason to believe that the “landscape” of L is complex, with the
489 global minimizer being exceedingly difficult to find. Second, the total number
490 of parameters in (w, X) is usually very large. The most popular algorithms for
491 minimizing (2.11.3) are of stochastic gradient type, which like most optimization
492 methods come with no guarantee for finding the minimizer of a nonconvex func-
493 tion. Effective training of deep learning classifiers typically requires a great deal
494 of data and computation power. Huge clusters of powerful computers, often us-
495 ing multicore processors, GPUs, and even specially architected processing units,
496 are devoted to this task. Efficiency also requires many heuristics in the formula-
497 tion and the algorithm (for example, in the choice of regularization functions and
498 in the steplengths for stochastic gradient).
499 3. Preliminaries
500 We discuss here some foundations for the analysis of subsequent sections.
501 These include useful facts about smooth and nonsmooth convex functions, Tay-
502 lor’s theorem and some of its consequences, optimality conditions, and proximal
503 operators.
504 In the discussion of this section, our basic assumption is that f is a mapping
505 from Rn to R ∪ {+∞}, continuous on its effective domain D := {x | f(x) < ∞}.
506 Further assumptions of f are introduced as needed.
16 Optimization Algorithms for Data Analysis
507 3.1. Solutions Consider the problem of minimizing f (1.0.1). We have the fol-
508 lowing terminology:
509 • x∗ is a local minimizer of f if there is a neighborhood N of x∗ such that
510 f(x) > f(x∗ ) for all x ∈ N.
511 • x∗ is a global minimizer of f if f(x) > f(x∗ ) for all x ∈ Rn .
512 • x∗ is a strict local minimizer if it is a local minimizer on some neighborhood
513 N and in addition f(x) > f(x∗ ) for all x ∈ N with x 6= x∗ .
514 • x∗ is an isolated local minimizer if there is a neighborhood N of x∗ such that
515 f(x) > f(x∗ ) for all x ∈ N and in addition, N contains no local minimizers
516 other than x∗ .
517 3.2. Convexity and Subgradients A convex set Ω ⊂ Rn has the property that
518 (3.2.1) x, y ∈ Ω ⇒ (1 − α)x + αy ∈ Ω for all α ∈ [0, 1].
519 We usually deal with closed convex sets in this article. For a convex set Ω ⊂ Rn
520 we define the indicator function IΩ (x) as follows:
0 if x ∈ Ω
521 IΩ (x) =
+∞ otherwise.
522 Indicator functions are useful devices for deriving optimality conditions for con-
523 strained problems, and even for developing algorithms. The constrained opti-
524 mization problem
525 (3.2.2) min f(x)
x∈Ω
526 can be restated equivalently as follows:
527 (3.2.3) min f(x) + IΩ (x).
528 We noted already that a convex function φ : Rn → R ∪ {+∞} has the following
529 defining property:
(3.2.4)
530 φ((1 − α)x + αy) 6 (1 − α)φ(x) + αφ(y), for all x, y ∈ Rn and all α ∈ [0, 1].
531 The concepts of “minimizer” are simpler in the case of convex objective func-
532 tions than in the general case. In particular, the distinction between “local” and
533 “global” minimizers disappears. For f convex in (1.0.1), we have the following.
534 (a) Any local minimizer of (1.0.1) is also a global minimizer.
535 (b) The set of global minimizers of (1.0.1) is a convex set.
536 If there exists a value γ > 0 such that
1
537 (3.2.5) φ((1 − α)x + αy) 6 (1 − α)φ(x) + αφ(y) − γα(1 − α)kx − yk22
2
538 for all x and y in the domain of φ and α ∈ [0, 1], we say that φ is strongly convex
539 with modulus of convexity γ.
540 We summarize some definitions and results about subgradients of convex func-
541 tions here. For a more extensive discussion, see [22].
Stephen J. Wright 17
547 Proof. From the convexity of f and the definitions of a and b, we deduce that
548 f(y) > f(x) + aT (y − x) and f(x) > f(y) + bT (x − y). The result follows by adding
549 these two inequalities.
550 We can easily characterize a minimum in terms of the subdifferential.
551 Theorem 3.2.8. The point x∗ is the minimizer of a convex function f if and only if
552 0 ∈ ∂f(x∗ ).
587 Lemma 3.3.10. Given convex f satisfying (3.2.5), with ∇f uniformly Lipschitz continu-
588 ous with constant L, we have for any x, y that
γ L
589 (3.3.11) ky − xk2 6 f(y) − f(x) − ∇f(x)T (y − x) 6 ky − xk2 .
2 2
590 For later convenience, we define a condition number κ as follows:
L
591 (3.3.12) κ := .
γ
592 When f is twice continuously differentiable, we can characterize the constants γ
593 and L in terms of the eigenvalues of the Hessian ∇f(x). Specifically, we can show
594 that (3.3.11) is equivalent to
595 (3.3.13) γI ∇2 f(x) LI, for all x.
596 When f is strictly convex and quadratic, κ defined in (3.3.12) is the condition
597 number of the (constant) Hessian, in the usual sense of linear algebra.
598 Strongly convex functions have unique minimizers, as we now show.
599 Theorem 3.3.14. Let f be differentiable and strongly convex with modulus γ > 0. Then
600 the minimizer x∗ of f exists and is unique.
Stephen J. Wright 19
601 Proof. We show first that for any point x0 , the level set {x | f(x) 6 f(x0 )} is closed
602 and bounded, and hence compact. Suppose for contradiction that there is a se-
603 quence {x` } such that kx` k → ∞ and
604 (3.3.15) f(x` ) 6 f(x0 ).
605 By strong convexity of f, we have for some γ > 0 that
γ
606 f(x` ) > f(x0 ) + ∇f(x0 )T (x` − x0 ) + kx` − x0 k2 .
2
607 By rearranging slightly, and using (3.3.15), we obtain
γ `
608 kx − x0 k2 6 −∇f(x0 )T (x` − x0 ) 6 k∇f(x0 )kkx` − x0 k.
2
609 By dividing both sides by (γ/2)kx` − x0 k, we obtain kx` − x0 k 6 (2/γ)k∇f(x0 )k
610 for all `, which contradicts unboundedness of {x` }. Thus, the level set is bounded.
611 Since it is also closed (by continuity of f), it is compact.
612 Since f is continuous, it attains its minimum on the compact level set, which is
613 also the solution of minx f(x), and we denote it by x∗ . Suppose for contradiction
614 that the minimizer is not unique, so that we have two points x∗1 and x∗2 that
615 minimize f. Obviously, these points must attain equal objective values, so that
616 f(x∗1 ) = f(x∗2 ) = f∗ for some f∗ . By taking (3.2.5) and setting φ = f∗ , x = x∗1 ,
617 y = x∗2 , and α = 1/2, we obtain
1 1
618 f((x∗1 + x∗2 )/2) 6 (f(x∗1 ) + f(x∗2 )) − γkx∗1 − x∗2 k2 < f∗ ,
2 8
619 so the point (x∗1 + x∗2 )/2 has a smaller function value than both x∗1 and x∗2 , contra-
620 dicting our assumption that x∗1 and x∗2 are both minimizers. Hence, the minimizer
621 x∗ is unique.
622 3.4. Optimality Conditions for Smooth Functions We consider the case of a
623 smooth (twice continuously differentiable) function f that is not necessarily con-
624 vex. Before designing algorithms to find a minimizer of f, we need to identify
625 properties of f and its derivatives at a point x̄ that tell us whether or not x̄ is a
626 minimizer, of one of the types described in Subsection 3.1. We call such properties
627 optimality conditions.
628 A first-order necessary condition for optimality is that ∇f(x̄) = 0. More precisely,
629 if x̄ is a local minimizer, then ∇f(x̄) = 0. We can prove this by using Taylor’s
630 theorem. Supposing for contradiction that ∇f(x̄) 6= 0, we can show by setting
631 x = x̄ and p = −α∇f(x̄) for α > 0 in (3.3.3) that f(x̄ − α∇f(x̄)) < f(x̄) for all
632 α > 0 sufficiently small. Thus any neighborhood of x̄ will contain points x with a
633 f(x) < f(x̄), so x̄ cannot be a local minimizer.
634 If f is convex, as well as smooth, the condition ∇f(x̄) = 0 is sufficient for x̄ to be
635 a global solution. This claim follows immediately from Theorems 3.2.8 and 3.2.9.
636 A second-order necessary condition for x̄ to be a local solution is that ∇f(x̄) = 0
637 and ∇2 f(x̄) is positive semidefinite. The proof is by an argument similar to that
638 of the first-order necessary condition, but using the second-order Taylor series ex-
639 pansion (3.3.5) instead of (3.3.3). A second-order sufficient condition is that ∇f(x̄) = 0
20 Optimization Algorithms for Data Analysis
640 and ∇2 f(x̄) is positive definite. This condition guarantees that x̄ is a strict local
641 minimizer, that is, there is a neighborhood of x̄ such that x̄ has a strictly smaller
642 function value than all other points in this neighborhood. Again, the proof makes
643 use of (3.3.5).
644 We call x̄ a stationary point for smooth f if it satisfies the first-order necessary
645 condition ∇f(x̄) = 0. Stationary points are not necessarily local minimizers. In
646 fact, local maximizers satisfy the same condition. More interestingly, stationary
647 points can be saddle points. These are points for which there exist directions u
648 and v such that f(x̄ + αu) < f(x̄) and f(x̄ + αv) > f(x̄) for all positive α suffi-
649 ciently small. When the Hessian ∇2 f(x̄) has both strictly positive and strictly
650 negative eigenvalues, it follows from (3.3.5) that x̄ is a saddle point. When ∇2 f(x̄)
651 is positive semidefinite or negative semidefinite, second derivatives alone are in-
652 sufficient to classify x̄; higher-order derivative information is needed.
653 3.5. Proximal Operators and the Moreau Envelope Here we present some anal-
654 ysis for analyzing the convergence of algorithms for the regularized problem
655 (1.0.2), where the objective is the sum of a smooth function and a convex (usually
656 nonsmooth) function.
657 We start with a formal definition.
658 Definition 3.5.1. For a closed proper convex function h and a positive scalar λ,
659 the Moreau envelope is
1 1 1
660 (3.5.2) Mλ,h (x) := inf h(u) + ku − xk2 = inf λh(u) + ku − xk2 .
u 2λ λ u 2
661 The proximal operator of the function λh is the value of u that achieves the infi-
662 mum in (3.5.2), that is,
1
663 (3.5.3) proxλh (x) := arg min λh(u) + ku − xk2 .
u 2
664 From optimality properties for (3.5.3) (see Theorem 3.2.8), we have
665 (3.5.4) 0 ∈ λ∂h(proxλh (x)) + (proxλh (x) − x).
666 The Moreau envelope can be viewed as a kind of smoothing or regularization
667 of the function h. It has a finite value for all x, even when h takes on infinite
668 values for some x ∈ Rn . In fact, it is differentiable everywhere, with gradient
1
669 ∇Mλ,h (x) = (x − proxλh (x)).
λ
670 Moreover, x∗ is a minimizer of h if and only if it is a minimizer of Mλ,h .
671 The proximal operator satisfies a nonexpansiveness property. From the opti-
672 mality conditions (3.5.4) at two points x and y, we have
673 x − proxλh (x) ∈ λ∂(proxλh (x)), y − proxλh (y) ∈ λ∂(proxλh (y)).
674 By applying monotonicity (Lemma 3.2.7), we have
T
675 (1/λ) (x − proxλh (x)) − (y − proxλh (y)) (proxλh (x) − proxλh (y)) > 0,
Stephen J. Wright 21
706 zero of {dist(0, ∂f(xk ))} (the sequence of distances from 0 to the subdifferential
707 ∂f(xk )). Other error measures for which we may be able to prove convergence
708 rates include kxk − x∗ k (where x∗ is a solution) and f(xk ) − f∗ (where f∗ is the
709 optimal value of the objective function f). For generality, we denote by {φk } the
710 sequence of nonnegative scalars whose rate of convergence to 0 we wish to find.
711 We say that linear convergence holds if there is some σ ∈ (0, 1) such that
712 (3.6.1) φk+1 /φk 6 1 − σ, for all k sufficiently large.
713 (This property is sometimes also called geometric or exponential convergence, but
714 the term linear is standard in the optimization literature, so we use it here.) It
715 follows from (3.6.1) that there is some positive constant C such that
716 (3.6.2) φk 6 C(1 − σ)k , k = 1, 2, . . . .
717 While (3.6.1) implies (3.6.2), the converse does not hold. The sequence
2−k k even
718 φk =
0 k odd,
719 satisfies (3.6.2) with C = 1 and σ = .5, but does not satisfy (3.6.1). To distinguish
720 between these two slightly different definitions, (3.6.1) is sometimes called Q-
721 linear while (3.6.2) is called R-linear.
Sublinear convergence is, as its name suggests, slower than linear. Several
varieties of sublinear convergence are encountered in optimization algorithms
for data analysis, including the following
√
(3.6.3a) φk 6 C/ k, k = 1, 2, . . . ,
(3.6.3b) φk 6 C/k, k = 1, 2, . . . ,
(3.6.3c) φk 6 C/k2 , k = 1, 2, . . . ,
722 where in each case, C is some positive constant.
723 Superlinear convergence occurs when the constant σ ∈ (0, 1) in (3.6.1) can be
724 chosen arbitrarily close to 1. Specifically, we say that the sequence {φk } converges
725 Q-superlinearly to 0 if
726 (3.6.4) lim φk+1 /φk = 0.
k→∞
727 Q-Quadratic convergence occurs when
728 (3.6.5) φk+1 /φ2k 6 C, k = 1, 2, . . . ,
729 for some sufficiently large C. We say that the convergence is R-superlinear if
730 there is a Q-superlinearly convergent sequence {νk } that dominates {φk } (that is,
731 0 6 φk 6 νk for all k). R-quadratic convergence is defined similarly. Quadratic
732 and superlinear rates are associated with higher-order methods, such as Newton
733 and quasi-Newton methods.
734 When a convergence rate applies globally, from any reasonable starting point,
735 it can be used to derive a complexity bound for the algorithm, which takes the
Stephen J. Wright 23
774 For x = xk and d = −∇f(xk ), the value of α that minimizes the expression on the
775 right-hand side is α = 1/L. By substituting these values, we obtain
1
776 (4.1.3) f(xk+1 ) = f(xk − (1/L)∇f(xk )) 6 f(xk ) − k∇f(xk )k2 .
2L
777 This expression is one of the foundational inequalities in the analysis of optimiza-
778 tion methods. Depending on the assumptions about f, we can derive a variety of
779 different convergence rates from this basic inequality.
780 4.2. General Case We consider first a function f that is Lipschitz continuously
781 differentiable and bounded below, but that need not necessarily be convex. Using
782 (4.1.3) alone, we can prove a sublinear convergence result for the steepest descent
783 method.
784 Theorem 4.2.1. Suppose that f is Lipschitz continuously differentiable, satisfying (3.3.6),
785 and that f is bounded below by a constant f̄. Then for the steepest descent method with
786 constant steplength αk ≡ 1/L, applied from a starting point x0 , we have for any integer
787 T > 1 that
r r
k 2L[f(x0 ) − f(xT )] 2L[f(x0 ) − f̄]
788 min k∇f(x )k 6 6 .
06k6T −1 T T
789 Proof. Rearranging (4.1.3) and summing over the first T − 1 iterates, we have
X
T −1 X
T −1
2
790 (4.2.2) k
k∇f(x )k 6 2L [f(xk ) − f(xk+1 )] = 2L[f(x0 ) − f(xT )].
k=0 k=0
791 (Note the telescoping sum.) Since f is bounded below by f̄, the right-hand side is
792 bounded above by the constant 2L[f(x0 ) − f̄]. We also have that
v
u1 X
r u T −1
793 min k∇f(xk )k = min k∇f(xk )k2 6 t k∇f(xk )k2 .
06k6T −1 06k6T −1 T
k=0
794 The result is obtained by combining this bound with (4.2.2).
795 This result shows that within the first T − 1 steps
p of steepest descent, at least
796 one of the iterates has gradient norm less than 2L[f(x0 ) − f̄]/T , which repre-
797 sents sublinear convergence of type (3.6.3a). It follows too from (4.2.2) that for f
798 bounded below, any accumulation point of the sequence {xk } is stationary.
799 4.3. Convex Case When f is also convex, we have the following stronger result
800 for the steepest descent method.
801 Theorem 4.3.1. Suppose that f is convex and Lipschitz continuously differentiable, sat-
802 isfying (3.3.6), and that (1.0.1) has a solution x∗ . Then the steepest descent method with
803 stepsize αk ≡ 1/L generates a sequence {xk }∞ k=0 that satisfies
L 0
804 (4.3.2) f(xT ) − f∗ 6 kx − x∗ k2 .
2T
Stephen J. Wright 25
X
T −1
L X k
T −1
(f(xk+1 ) − f∗ ) 6 kx − x∗ k2 − kxk+1 − x∗ k2
2
k=0 k=0
L
= kx0 − x∗ k2 − kxT − x∗ k2
2
L 0
6 kx − x∗ k2 .
2
805 Since {f(xk )} is a nonincreasing sequence, we have, as required,
1 X
T −1
L 0
806 f(xT ) − f(x∗ ) 6 (f(xk+1 ) − f∗ ) 6 kx − x∗ k2 .
T 2T
k=0
807 4.4. Strongly Convex Case Recall that the definition (3.3.9) of strong convexity
808 shows that f can be bounded below by a quadratic with Hessian γI. A strongly
809 convex f with L-Lipschitz gradients is also bounded above by a similar quadratic
810 (see (3.3.7)) differing only in the quadratic term, which becomes LI. From this
811 “sandwich” effect, we derive a linear convergence rate for the gradient method,
812 stated formally in the following theorem.
813 Theorem 4.4.1. Suppose that f is Lipschitz continuously differentiable, satisfying (3.3.6),
814 and strongly convex, satisfying (3.2.5) with modulus of convexity γ. Then f has a unique
815 minimizer x∗ , and the steepest descent method with stepsize αk ≡ 1/L generates a se-
816 quence {xk }∞
k=0 that satisfies
γ
817 f(xk+1 ) − f(x∗ ) 6 1 − (f(xk ) − f(x∗ )), k = 0, 1, 2, . . . .
L
Proof. Existence of the unique minimizer x∗ follows from Theorem 3.3.14. Min-
imizing both sides of the inequality (3.3.9) with respect to y, we find that the
minimizer on the left side is attained at y = x∗ , while on the right side it is
attained at x − ∇f(x)/γ. Plugging these optimal values into (3.3.9), we obtain
γ
min f(y) > min f(x) + ∇f(x)T (y − x) + ky − xk2
y y 2
2
∗ T 1 γ 1
⇒ f(x ) > f(x) − ∇f(x) ∇f(x) + ∇f(x)
γ 2 γ
1
⇒ f(x∗ ) > f(x) − k∇f(x)k2 .
2γ
26 Optimization Algorithms for Data Analysis
851 Theorem 4.5.3. Suppose that f is Lipschitz continuously differentiable, satisfying (3.3.6),
852 and that f is bounded below by a constant f̄. Consider the method that takes steps of the
853 form (4.0.1), where dk satisfies (4.5.1) for some η > 0 and the conditions (4.5.2) hold at
854 all k, for some constants c1 and c2 with 0 < c1 < c2 < 1. Then for any integer T > 1,
855 we have s r
k L f(x0 ) − f̄
856 min k∇f(x )k 6 .
06k6T −1 η2 c1 (1 − c2 ) T
857 Proof. By combining the Lipschitz property (3.3.6) with (4.5.2b), we have
858 −(1 − c2 )∇f(xk )T dk 6 [∇f(xk + αk dk ) − ∇f(xk )]T dk 6 Lαk kdk k2 .
859 By comparing the first and last terms in these inequalities, we obtain the following
860 lower bound on αk :
(1 − c2 ) ∇f(xk )T dk
861 αk > − .
L kdk k2
862 By substituting this bound into (4.5.2a), and using (4.5.1) and the step definition
863 (4.0.1), we obtain
f(xk+1 ) = f(xk + αk dk ) 6 f(xk ) + c1 αk ∇f(xk )T dk
c1 (1 − c2 ) (∇f(xk )T dk )2
(4.5.4) 6 f(xk ) −
864
L kdk k2
c (1 − c2 ) 2
6 f(xk ) − 1 η k∇f(xk )k2 ,
L
865 which by rearrangement yields
L
866 (4.5.5) k∇f(xk )k2 6 f(xk
) − f(xk+1
) .
c1 (1 − c2 )η2
867 The result now follows as in the proof of Theorem 4.2.1.
868 It follows by taking limits on both sides of (4.5.5) that
869 (4.5.6) lim k∇f(xk )k = 0,
k→∞
870 and therefore all accumulation points x̄ of the sequence {xk } generated by the
871 algorithm (4.0.1) have ∇f(x̄) = 0. In the case of f convex, this condition guarantees
872 that x̄ is a solution of (1.0.1). When f is nonconvex, x̄ may be a local minimum,
873 but it may also be a saddle point or a local maximum.
874 The paper [29] uses the stable manifold theorem to show that line-search gra-
875 dient methods are highly unlikely to converge to stationary points x̄ at which
876 some eigenvalues of the Hessian ∇2 f(x̄) are negative. Although it is easy to con-
877 struct examples for which such bad behavior occurs, it requires special choices of
878 starting point x0 . Possibly the most obvious example is where f(x1 , x2 ) = x21 − x22
879 starting from x0 = (1, 0)T , where dk = −∇f(xk ) at each k. For this example, all
880 iterates have xk 2 = 0 and, under appropriate conditions, converge to the saddle
881 point x̄ = 0. Any starting point with x02 6= 0 cannot converge to 0, in fact, it is easy
882 to see that xk2 diverges away from 0.
28 Optimization Algorithms for Data Analysis
883 4.6. Conditional Gradient Method The conditional gradient approach, often
884 known as “Frank-Wolfe” after the authors who devised it [24], is a method for
885 convex nonlinear optimization over compact convex sets. This is the problem
886 (4.6.1) min f(x),
x∈Ω
887 (see earlier discussion around (3.2.2)), where Ω is a compact convex set and f
888 is a convex function whose gradient is Lipschitz continuously differentiable in a
889 neighborhood of Ω, with Lipschitz constant L. We assume that Ω has diameter
890 D, that is, kx − yk 6 D for all x, y ∈ Ω.
The conditional gradient method replaces the objective in (4.6.1) at each iter-
ation by a linear Taylor-series approximation around the current iterate xk , and
minimizes this linear objective over the original constraint set Ω. It then takes
a step from xk towards the minimizer of this linearized subproblem. The full
method is as follows:
f(xk+1 ) − f(x∗ )
2 1 4
6 1− [f(xk ) − f(x∗ )] + LD2 from (4.6.6), (4.6.2b)
k+2 2 (k + 2)2
2k 2
6 LD2 + from (4.6.4)
(k + 2)2 (k + 2)2
(k + 1)
= 2LD2
(k + 2)2
k+1 1
= 2LD2
k+2 k+2
k+2 1 2LD2
6 2LD2 = ,
k+3 k+2 k+3
912 as required.
933 One way to verify this equivalence is to note that the objective in (5.0.3) can be
934 written as
1 1 2
935 z − (xk − αk ∇f(xk )) + αk λψ(x) ,
αk 2
936 (modulo a term αk k∇f(xk )k2 that does not involve z). The subproblem objective
937 in (5.0.3) consists of a linear term ∇f(xk )T (z − xk ) (the first-order term in a Taylor-
938 series expansion), a proximality term 2α1 k kz − xk k2 that becomes more strict as
939 αk ↓ 0, and the regularization term λψ(x) in unaltered form. When λ = 0, we
940 have xk+1 = xk − αk ∇f(xk ), so the iteration (5.0.2) (or (5.0.3)) reduces to the
941 usual steepest-descent approach discussed in Section 4 in this case. It is useful
942 to continue thinking of αk as playing the role of a line-search parameter, though
943 here the line search is expressed implicitly through a proximal term.
944 We will demonstrate convergence of the method (5.0.2) at a sublinear rate, for
945 functions f whose gradients satisfy a Lipschitz continuity property with Lipschitz
946 constant L (see (3.3.6)), and for the constant steplength choice αk = 1/L. The proof
947 makes use of a “gradient map” defined by
1
948 (5.0.4) Gα (x) := x − proxαλψ (x − α∇f(x)) .
α
949 By comparing with (5.0.2), we see that this map defines the step taken at iteration
950 k:
1 k
951 (5.0.5) xk+1 = xk − αk Gαk (xk ) ⇔ Gαk (xk ) = (x − xk+1 ).
αk
952 The following technical lemma reveals some useful properties of Gα (x).
953 Lemma 5.0.6. Suppose that in problem (5.0.1), ψ is a closed convex function and that f
954 is convex with Lipschitz continuous gradient on Rn , with Lipschitz constant L. Then for
955 the definition (5.0.4) with α > 0, the following claims are true.
956 (a) Gα (x) ∈ ∇f(x) + λ∂ψ(x − αGα (x)).
957 (b) For any z, and any α ∈ (0, 1/L], we have that
α
958 φ(x − αGα (x)) 6 φ(z) + Gα (x)T (x − z) − kGα (x)k2 .
2
959 Proof. For part (a), we use the optimality property (3.5.4) of the prox operator,
960 and make the following substitutions: x − α∇f(x) for “x”, αλ for “λ”, and ψ for
961 “h” to obtain
962 0 ∈ αλ∂ψ(proxαλψ (x − α∇f(x))) + (proxαλψ (x − α∇f(x)) − (x − α∇f(x)).
963 We make the substitution proxαλψ (x − α∇f(x)) = x − αGα (x), using definition
964 (5.0.4), to obtain
965 0 ∈ αλ∂ψ(x − αGα (x)) − α(Gα (x) − ∇f(x)),
966 and the result follows when we divide by α.
967 For (b), we start with the following consequence of Lipschitz continuity of ∇f,
968 from Lemma 3.3.10:
L
969 f(y) 6 f(x) + ∇f(x)T (y − x) + ky − xk2 .
2
Stephen J. Wright 31
1028 where αk and βk are positive scalars. That is, a momentum term βk (xk − xk−1 )
1029 is added to the usual steepest descent update. Although this method can be ap-
1030 plied to any smooth convex f (and even to nonconvex functions), the convergence
1031 analysis is most straightforward for the special case of strongly convex quadratic
1032 functions (see [38]). (This analysis also suggests appropriate values for the step
1033 lengths αk and βk .) Consider the function
1
1034 (6.1.2) minn f(x) := xT Ax − bT x,
x∈R 2
1035 where the (constant) Hessian A has eigenvalues in the range [γ, L], with 0 < γ 6 L.
1036 For the following constant choices of steplength parameters:
√ √
4 L− γ
1037 αk = α := √ √ , β k = β := √ √ ,
( L + γ)2 L+ γ
1038 it can be shown that kxk − x∗ k 6 Cβk , for some (possibly large) constant C. We
1039 can use (3.3.7) to translate this into a bound on the function error, as follows:
L k LC2 2k
1040 f(xk ) − f(x∗ ) 6kx − x∗ k2 6 β ,
2 2
1041 allowing a direct comparison with the rate (4.4.3) for the steepest descent method.
1042 If we suppose that L γ, we have
r
γ
1043 β ≈ 1−2 ,
L
1044 so that we achieve approximate convergence f(xk ) − f(x∗ ) 6 (for small pos-
p
1045 itive ) in O( L/γ log(1/)) iterations, compared with O((L/γ) log(1/)) for
1046 steepest descent — a significant improvement.
1047 The heavy-ball method is fundamental, but several points should be noted.
1048 First, the analysis for convex quadratic f is based on linear algebra arguments,
1049 and does not generalize to general strongly convex nonlinear functions. Second,
1050 the method requires knowledge of γ and L, for the purposes of defining parame-
1051 ters α and β. Third, it is not a descent method; we usually have f(xk+1 ) > f(xk )
1052 for many k. These properties are not specific to the heavy-ball method — some
1053 of them are shared by other methods that use momentum.
1054 6.2. Conjugate Gradient The conjugate gradient method for solving linear sys-
1055 tems Ax = b (or, equivalently, minimizing the convex quadratic (6.1.2)) where A
1056 is symmetric positive definite, is one of the most important algorithms in compu-
1057 tational science. Though invented earlier than the other algorithms discussed in
1058 this section (see [27]) and motivated in a different way, conjugate gradient clearly
1059 makes use of momentum. Its steps have the form
1060 (6.2.1) xk+1 = xk + αk pk , where pk = −∇f(xk ) + ξk pk−1 ,
1061 for some choices of αk and ξk , which is identical to (6.1.1) when we define βk ap-
1062 propriately. For convex, strongly quadratic problems (6.1.2), conjugate gradient
1063 has excellent properties. It does not require prior knowledge of the range [γ, L] of
34 Optimization Algorithms for Data Analysis
1106 Theorem 6.3.5. Suppose that f in (1.0.1) is convex, with ∇f Lipschitz continuously
1107 differentiable with constant L (as in (3.3.6)) and that the minimum of f is attained at x∗ ,
1108 with f∗ := f(x∗ ). Then the method defined by (6.3.2), (6.3.3) with x0 = y0 yields an
1109 iteration sequence {xk } with the following property:
2Lkx0 − x∗ k2
1110 f(xT ) − f∗ 6 , T = 1, 2, . . . .
(T + 1)2
1111 Proof. From convexity of f and (3.3.7), we have for any x and y that
f(y − ∇f(y)/L) − f(x)
6 f(y − ∇f(y)/L) − f(y) + ∇f(y)T (y − x)
(6.3.6) L
6 ∇f(y)T (y − ∇f(y)/L − y) + ky − ∇f(y)/L − yk2 + ∇f(y)T (y − x)
1112
2
1 2 T
= − k∇f(y)k + ∇f(y) (y − x).
2L
1113 Setting y = y and x = xk in this bound, we obtain
k
1140 Theorem 6.4.2. Suppose that f is such that ∇f is Lipschitz continuously differentiable
1141 with constant L, and that it is strongly convex with modulus of convexity γ and unique
Stephen J. Wright 37
1142 minimizer x∗ . Then the method (6.3.2), (6.4.1) with starting point x0 = y0 satisfies
1 T
L+γ 0
1143 f(xT ) − f(x∗ ) 6 kx − x∗ k2 1 − √ , T = 1, 2, . . . .
2 κ
Proof. The proof makes use of a family of strongly convex functions Φk (z) de-
fined inductively as follows:
γ
(6.4.3a) Φ0 (z) = f(y0 ) + kz − y0 k2 ,
√2
(6.4.3b) Φk+1 (z) = (1 − 1/ κ)Φk (z)
1 γ
+ √ f(yk ) + ∇f(yk )T (z − yk ) + kz − yk k2 .
κ 2
1144 Each Φk (·) is a quadratic, and an inductive argument shows that ∇2 Φk (z) = γI
1145 for all k and all z. Thus, each Φk has the form
γ
1146 (6.4.4) Φk (z) = Φ∗k + kz − vk k2 , k = 0, 1, 2, . . . ,
2
1147 where vk is the minimizer of Φk (·) and Φ∗k is its optimal value. (From (6.4.3a),
1148 we have v0 = y0 .) We note too that Φk becomes a tighter overapproximation to f
1149 as k → ∞. To show this, we use (3.3.9) to replace the final term in parentheses in
1150 (6.4.3b) by f(z), then subtract f(z) from both sides of (6.4.3b) to obtain
√
1151 (6.4.5) Φk+1 (z) − f(z) 6 (1 − 1/ κ)(Φk (z) − f(z)).
1152 In the remainder of the proof, we show that the following bound holds:
1153 (6.4.6) f(xk ) 6 min Φk (z) = Φ∗k , k = 0, 1, 2, . . . .
z
1154 The upper bound in Lemma 3.3.10 for x = x∗ gives f(z) − f(x∗ ) 6 (L/2)kz − x∗ k2 .
1155 By combining this bound with (6.4.5) and (6.4.6), we have
f(xk ) − f(x∗ ) 6 Φ∗k − f(x∗ )
6 Φk (x∗ ) − f(x∗ )
√
1156 (6.4.7) 6 (1 − 1/ κ)k (Φ0 (x∗ ) − f(x∗ ))
√
6 (1 − 1/ κ)k [(Φ0 (x∗ ) − f(x0 )) + (f(x0 ) − f(x∗ ))]
√ γ+L 0
6 (1 − 1/ κ)k kx − x∗ k2 .
2
The proof is completed by establishing (6.4.6), by induction on k. Since x0 = y0 ,
it holds by definition at k = 0. By using step formula (6.3.2a), the convexity
property (3.3.8) (with x = yk ), and the inductive hypothesis, we have
(6.4.8) f(xk+1 )
1
6 f(yk ) − k∇f(yk )k2
2L
√ √ √ 1
= (1 − 1/ κ)f(xk ) + (1 − 1/ κ)(f(yk ) − f(xk )) + f(yk )/ κ − k∇f(yk )k2
2L
√ ∗ √ k √ 1
6 (1 − 1/ κ)Φk + (1 − 1/ κ)∇f(y ) (y − x ) + f(y )/ κ − k∇f(yk )k2 .
k T k k
2L
38 Optimization Algorithms for Data Analysis
1157 Thus the claim is established (and the theorem is proved) if we can show that the
1158 right-hand side in (6.4.8) is bounded above by Φ∗k+1 .
1159 Recalling the observation (6.4.4), we have by taking derivatives of both sides of
1160 (6.4.3b) with respect to z that
√ √ √
1161 (6.4.9) ∇Φk+1 (z) = γ(1 − 1/ κ)(z − vk ) + ∇f(yk )/ κ + γ(z − yk )/ κ.
1162 Since vk+1 is the minimizer of Φk+1 we can set ∇Φk+1 (vk+1 ) = 0 in (6.4.9) to
1163 obtain
√ √ √
1164 (6.4.10) vk+1 = (1 − 1/ κ)vk + yk / κ − ∇f(yk )/(γ κ).
1165 By subtracting yk from both sides of this expression, and taking k · k2 of both
1166 sides, we obtain
√
kvk+1 − yk k2 = (1 − 1/ κ)2 kyk − vk k2 + k∇f(yk )k2 /(γ2 κ)
1167 (6.4.11) √ √
− 2(1 − 1/ κ)/(γ κ)∇f(yk )T (vk − yk ).
1168 By evaluating Φk+1 at z = yk , using both (6.4.4) and (6.4.3b), we obtain
γ
Φ∗k+1 + kyk − vk+1 k2
2√ √
1169 (6.4.12) = (1 − 1/ κ)Φk (yk ) + f(yk )/ κ
√ γ √ √
= (1 − 1/ κ)Φ∗k + (1 − 1/ κ)kyk − vk k2 + f(yk )/ κ.
2
1170 By substituting (6.4.11) into (6.4.12), we obtain
√ √ √ √
Φ∗k+1 = (1 − 1/ κ)Φ∗k + f(yk )/ κ + γ(1 − 1/ κ)/(2 κ)kyk − vk k2
1 √ √
− k∇f(yk )k2 + (1 − 1/ κ)∇f(yk )T (vk − yk )/ κ
2L
1171 (6.4.13) √ √
> (1 − 1/ κ)Φ∗k + f(yk )/ κ
1 √ √
− k∇f(yk )k2 + (1 − 1/ κ)∇f(yk )T (vk − yk )/ κ,
2L
1172 where we simply dropped a nonnegative term from the right-hand side to obtain
1173 the inequality. The final step is to show that
√
1174 (6.4.14) vk − yk = κ(yk − xk ),
1175 which we do by induction. Note that v0 = x0 = y0 , so the claim holds for k = 0.
1176 We have
√ √ √
vk+1 − yk+1 = (1 − 1/ κ)vk + yk / κ − ∇f(yk )/(γ κ) − yk+1
√ √ √
= κyk − ( κ − 1)xk − κ∇f(yk )/L − yk+1
1177 (6.4.15) √ √
= κxk+1 − ( κ − 1)xk − yk+1
√
= κ(yk+1 − xk+1 ),
1178 where the first equality is from (6.4.10), the second equality is from the inductive
1179 hypothesis, the third equality is from the iteration formula (6.3.2a), and the final
1180 equality is from the iteration formula (6.3.2b) with the definition of βk+1 from
1181 (6.4.1). We have thus proved (6.4.14), and by substituting this equality into (6.4.13),
Stephen J. Wright 39
1182 we obtain that Φ∗k+1 is an upper bound on the right-hand side of (6.4.8). This
1183 establishes (6.4.6) and thus completes the proof of the theorem.
1184 6.5. Lower Bounds on Rates The term “optimal” in Nesterov’s optimal method
1185 is used because the convergence rate achieved by the method is the best possible
1186 (possibly up to a constant), among algorithms that make use of gradient informa-
1187 tion at the iterates xk . This claim can be proved by means of a carefully designed
1188 function, for which no method that makes use of all gradients observed up to and
1189 including iteration k (namely, ∇f(xi ), i = 0, 1, 2, . . . , k) can produce a sequence
1190 {xk } that achieves a rate better than that of Theorem 6.3.5. The function proposed
1191 in [32] is a convex quadratic f(x) = (1/2)xT Ax − eT1 x, where
2 −1 0 0 ... ... 0
−1 2 −1 0 . . . . . . 0
1
0
0 −1 2 −1 0 . . . 0
1192 A=
. . . , e1 =
0 .
.. .. ..
.
..
0 ... 0 −1 2 −1
0
0 ... 0 −1 2
1193 The solution x∗ satisfies Ax∗ = e1 ; its components are x∗i = 1 − i/(n + 1), for
1194 i = 1, 2, . . . , n. If we use x0 = 0 as the starting point, and construct the iterate
1195 xk+1 as
Xk
k+1 k
1196 x =x + ξj ∇f(xj ),
j=0
1260 Theorem 7.1.6. Consider the problem (7.1.1) with f twice Lipschitz continuously differ-
1261 entiable with Lipschitz constant M defined in (7.1.2). Suppose that the second-order suf-
1262 ficient conditions are satisfied for the problem (7.1.1) at the point x∗ , that is, ∇f(x∗ ) = 0
γ
1263 and ∇2 f(x∗ ) γI for some γ > 0. Then if kx0 − x∗ k 6 2M , the sequence defined by
1264 (7.1.5) converges to x∗ at a quadratic rate, with
M k
1265 (7.1.7) kxk+1 − x∗ k 6 kx − x∗ k2 , k = 0, 1, 2, . . . .
γ
Proof. From (7.1.4) and (7.1.5), and using ∇f(x∗ ) = 0, we have
1273 where λmin (·) denotes the smallest eigenvalue of a symmetric matrix. Thus for
γ
1274 (7.1.10) kxk − x∗ k 6 ,
2M
1275 we have
γ γ
1276 λmin (∇2 f(xk )) > λmin (∇2 f(x∗ )) − Mkxk − x∗ k > γ − M > ,
2M 2
1277 so that k∇2 f(xk )−1 k 6 2/γ. By substituting this result together with (7.1.9) into
1278 (7.1.8), we obtain
2M k M k
1279 kxk+1 − x∗ k 6 kx − x∗ k2 = kx − x∗ k2 ,
γ 2 γ
1280 verifying the local quadratic convergence rate. By applying (7.1.10) again, we
1281 have
M k 1
1282 kxk+1 − x∗ k 6 kx − x∗ k kxk − x∗ k 6 kxk − x∗ k,
γ 2
1283 so, by arguing inductively, we see that the sequence converges to x∗ provided
1284 that x0 satisfies (7.1.10), as claimed.
1285 Of course, we do not need to explicitly identify a starting point x0 in the stated
1286 region of convergence. Any sequence that approaches to x∗ will eventually enter
1287 this region, and thereafter the quadratic convergence guarantees apply.
1288 We have established that Newton’s method converges rapidly once the iterates
1289 enter the neighborhood of a point x∗ satisfying second-order sufficient optimality
1290 conditions. But what happens when we start far from such a point?
1291 7.2. Newton’s Method for Convex Functions When the function f is convex as
1292 well as smooth, we can devise variants of Newton’s method for which global
1293 convergence and complexity results (in particular, results based on those of Sec-
1294 tion 4.5) can be proved in addition to local quadratic convergence.
1295 When f is strongly convex with modulus γ and satisfies Lipschitz continuity
1296 of the gradient (3.3.6), the Hessian ∇2 f(xk ) is positive definite for all k, with
1297 all eigenvalues in the interval [γ, L]. Thus, the Newton direction (7.1.4) is well
1298 defined at all iterates xk , and is a descent direction satisfying the condition (4.5.1)
1299 with η = γ/L. To verify this claim, note first
1
1300 kpk k 6 k∇2 f(xk )−1 kk∇f(xk )k 6 k∇f(xk )k.
γ
Then
(pk )T ∇f(xk ) = −∇f(xk )T ∇2 f(xk )−1 ∇f(xk )
1
6 − k∇f(xk )k2
L
γ
6 − k∇f(xk )kkpk k.
L
1301 We can use the Newton direction in the line-search framework of Subsection 4.5
1302 to obtain a method for which xk → x∗ , where x∗ is the (unique) global minimizer
1303 of f. (This claim follows from the property (4.5.6) together with the fact that x∗ is
1304 the only point for which ∇f(x∗ ) = 0.) We can even obtain a complexity result —
√
1305 and O(1/ T ) bound on min06k6T −1 k∇f(xk )k — from Theorem 4.5.3.
Stephen J. Wright 43
1306 These global convergence properties are enhanced by the local quadratic con-
1307 vergence property of Theorem 7.1.6 if we modify the line-search framework by
1308 accepting the step length αk = 1 in (4.0.1) whenever it satisfies the weak Wolfe
1309 conditions (4.5.2). (It can be shown, by again using arguments based on Taylor’s
1310 theorem (Theorem 3.3.1), that these conditions will be satisfied by αk = 1 for all
1311 xk sufficiently close to the minimizer x∗ .)
1312 Consider now the case in which f is convex and satisfies condition (3.3.6) but
1313 is not strongly convex. Here, the Hessian ∇2 f(xk ) may be singular for some k, so
1314 the direction (7.1.4) may not be well defined. However, by adding any positive
1315 number λk > 0 to the diagonal, we can ensure that the modified Newton direction
1316 defined by
1317 (7.2.1) pk = −[∇2 f(xk ) + λk I]−1 ∇f(xk ),
1318 is well defined and is a descent direction for f. For any η ∈ (0, 1) in (4.5.1),
1319 we have by choosing λk large enough that λk /(L + λk ) > η that the condition
1320 (4.5.1) is satisfied too, so we can use the resulting direction pk in the line-search
1321 framework of Subsection 4.5, to obtain a method that convergence to a solution
1322 x∗ of (1.0.1), when one exists.
1323 If, in addition, the minimizer x∗ is unique and satisfies a second-order suffi-
1324 cient condition (so that ∇2 f(x∗ ) is positive definite), then ∇2 f(xk ) will be positive
1325 definite too for k sufficiently large. Thus, provided that η is sufficiently small,
1326 the unmodified Newton direction (with λk = 0 in (7.2.1)) will satisfy the condi-
1327 tion (4.5.1). If we use (7.2.1) in the line-search framework of Section 4.5, but set
1328 λk = 0 where possible, and accept αk = 1 as the step length whenever it satisfies
1329 (4.5.2), we can obtain local quadratic convergence to x∗ , in addition to the global
1330 convergence and complexity promised by Theorem 4.5.3.
1331 7.3. Newton Methods for Nonconvex Functions For smooth nonconvex f, the
1332 Hessian ∇2 f(xk ) may be indefinite for some k. The Newton direction (7.1.4)
1333 may not exist (when ∇2 f(xk ) is singular) or it may not be a descent direction
1334 (when ∇2 f(xk ) has negative eigenvalues). However, we can still define a modified
1335 Newton direction as in (7.2.1), which will be a descent direction for λk sufficiently
1336 large, and thus can be used in the line-search framework of Section 4.5. For a
1337 given η in (4.5.1), a sufficient condition for pk from (7.2.1) to satisfy (4.5.1) is that
λk + λmin (∇2 f(xk ))
1338 > η,
λk + L
1339 where λmin (∇2 f(xk )) is the minimum eigenvalue of the Hessian, which may be
1340 negative. The line-search framework of Section 4.5 can then be applied to ensure
1341 that ∇f(xk ) → 0.
1342 Once again, if the iterates {xk } enter the neighborhood of a local solution x∗
1343 for which ∇2 f(x∗ ) is positive definite, some enhancements of the strategy for
1344 choosing λk and the step length αk can recover the local quadratic convergence
1345 of Theorem 7.1.6.
44 Optimization Algorithms for Data Analysis
1346 Formula (7.2.1) is not the only way to modify the Newton direction to ensure
1347 descent in a line-search framework. Other approaches are outlined in [36, Chap-
1348 ter 3]. One such technique is to modify the Cholesky factorization of ∇2 (fk ) by
1349 adding positive elements to the diagonal only as needed to allow the factoriza-
1350 tion to proceed (that is, to avoid taking the square root of a negative number),
1351 then using the modified factorization in place of ∇2 f(xk ) in the calculation of the
1352 Newton step pk . Another technique is to compute an eigenvalue decomposition
1353 ∇2 f(xk ) = Qk Λk QTk (where Qk is orthogonal and Λk is the diagonal matrix con-
1354 taining the eigenvalues), then define Λ̃k to be a modified version of Λk in which
1355 all the diagonals are positive. Then, following (7.1.4), pk can be defined as
1356 pk := −Qk Λ̃−1 T k
k Qk ∇f(x ).
1357 When an appropriate strategy is used to define Λ̃k , we can ensure satisfaction
1358 of the descent condition (4.5.1) for some η > 0. As above, the line-search frame-
1359 work of Section 4.5 can be used to obtain an algorithm that generates a sequence
1360 {xk } such that ∇f(xk ) → 0. We noted earlier that this condition ensures that all
1361 accumulation points x̂ are stationary points, that is, they satisfy ∇f(x̂) = 0.
1362 Stronger guarantees can be obtained from a trust-region version of Newton’s
1363 method, which ensures convergence to a point satisfying second-order necessary
1364 conditions, that is, ∇2 f(x̂) 0 in addition to ∇f(x̂) = 0. The trust-region approach
1365 was developed in the late 1970s and early 1980s, and has become popular again
1366 recently because of this appealing global convergence behavior. A trust-region
1367 Newton method also recovers quadratic convergence to solutions x∗ satisfying
1368 second-order-sufficient conditions, without any special modifications. (The trust-
1369 region Newton approach is closely related to cubic regularization [26, 35], which
1370 we discuss in the next section.)
1371 We now outline the trust-region approach. (Further details can be found in
1372 [36, Chapter 4].) The subproblem to be solved at each iteration is
1
1373 (7.3.1) min f(xk ) + ∇f(xk )T d + dT ∇2 f(xk )d subject to kdk2 6 ∆k .
d 2
1374 The objective is a second-order Taylor-series approximation while ∆k is the radius
1375 of the trust region — the region within which we trust the second-order model
1376 to capture the true behavior of f. Somewhat surprisingly, the problem (7.3.1) is
1377 not too difficult to solve, even when the Hessian ∇2 f(xk ) is indefinite. In fact, the
1378 solution dk of (7.3.1) satisfies the linear system
1379 (7.3.2) [∇2 f(xk ) + λI]dk = −∇f(xk ), for some λ > 0,
1380 where λ is chosen such that ∇2 f(xk ) + λI is positive semidefinite and λ > 0 only if
1381 kdk k = ∆k (see [31]). Solving (7.3.1) thus reduces to a search for the appropriate
1382 value of the scalar λk , for which specialized methods have been devised.
1383 For large-scale problems, it may be too expensive to solve (7.3.1) near-exactly,
1384 since the process may require several factorizations of an n × n matrix (namely,
1385 the coefficient matrix in (7.3.2), for different values of λ). A popular approach
Stephen J. Wright 45
1386 for finding approximate solutions of (7.3.1), which can be used when ∇2 f(xk )
1387 is positive definite, is the dogleg method. In this method the curved path traced
1388 out by solutions of (7.3.2) for values of λ in the interval [0, ∞) is approximated
1389 by simpler path consisting of two line segments. The first segment joins 0 to
1390 the point dk C that minimizes the objective in (7.3.1) along the direction −∇f(x ),
k
1391
k
while the second segment joins dC to the pure Newton step defined in (7.1.4). The
1392 approximate solution is taken to be the point at which this “dogleg” path crosses
1393 the boundary of the trust region kdk 6 ∆k . If the dogleg path lies entirely inside
1394 the trust region, we take dk to be the pure Newton step. See [36, Section 4.1].
1395 Having discussed the trust-region subproblem (7.3.1), let us outline how it can
1396 be used as the basis for a complete algorithm. A crucial role is played by the ratio
1397 between the amount of decrease in f predicted by the quadratic objective in (7.3.1) and
1398 the actual decrease in f, namely, f(xk ) − f(xk + dk ). Ideally, this ratio would be close
1399 to 1. If it is at least greater than a small tolerance (say, 10−4 ) we accept the step
1400 and proceed to the next iteration. Otherwise, we conclude that the trust-region
1401 radius ∆k is too large, so we do not take the step, shrink the trust region, and
1402 re-solve (7.3.1) to obtain a new step. Additionally, when the actual-to-predicted
1403 ratio is close to 1, we conclude that a larger trust region may hasten progress, so
1404 we increase ∆ for the next iteration, provided that the bound kdk k 6 ∆k really is
1405 active at the solution of (7.3.1).
1406 Unlike a basic line-search method, the trust-region Newton method can “es-
1407 cape” from a saddle point. Suppose we have ∇f(xk ) = 0 and ∇2 f(xk ) indefinite
1408 with some strictly negative eigenvalues. Then, the solution dk to (7.3.1) will be
1409 nonzero, and the algorithm will step away from the saddle point, in the direc-
1410 tion of most negative curvature for ∇2 f(xk ). Another appealing feature of the
1411 trust-region Newton approach is that when the sequence {xk } approaches a point
1412 x∗ satisfying second-order sufficient conditions, the trust region bound becomes
1413 inactive, and the method takes pure Newton steps (7.1.4) for all sufficiently large
1414 k so the local quadratic convergence that characterizes Newton’s method.
1415 The basic difference between line-search and trust-region methods can be sum-
1416 marized as follows. Line-search methods first choose a direction pk , then decide
1417 how far to move along that direction. Trust-region methods do the opposite: They
1418 choose the distance ∆k first, then find the direction that makes the best progress
1419 for this step length.
1420 7.4. A Cubic Regularization Approach Trust-region Newton methods have the
1421 significant advantage of guaranteeing that any accumulation points will satisfy
1422 second-order necessary conditions. A related approach based on cubic regulariza-
1423 tion has similar properties, plus some additional complexity guarantees. Cubic
1424 regularization requires the Hessian to be Lipschitz continuous, as in (7.1.2). It
1425 follows that the following cubic function yields a global upper bound for f:
1 M
1426 (7.4.1) TM (z; x) := f(x) + ∇f(x)T (z − x) + (z − x)T ∇2 f(x)(z − x) + kz − xk3 .
2 6
46 Optimization Algorithms for Data Analysis
1458 Using the lower bound f̄ on the objective f, we see that the number of iterations
1459 K required must satisfy the condition
!
2g 2 3H
1460 K min , 6 f(x0 ) − f̄,
2L 3 M2
1461 from which we conclude that
3 2 −3 0
1462 K 6 max 2L−2
g , M H f(x ) − f̄ .
2
1463 We also observe that that the maximum number of iterates required to identify a
1464 point at which only the approximate stationarity condition k∇f(xk )k 6 g holds
is 2L−2 0
1465 g (f(x ) − f̄). (We can just omit the second-order part of the algorithm.)
1466 Note too that it is easy to devise approximate versions of this algorithm with simi-
1467 lar complexity. For example, the negative curvature direction pk in step (ii) above
1468 can be replaced by an approximation to the direction of most negative curvature,
1469 obtained by the Lanczos iteration with random initialization.
1470 In algorithms that make more complete use of the cubic model (7.4.1), the term
−3/2
1471 −2
g in the complexity expression becomes g , and the constants are different.
1472 The subproblems (7.4.1) are more complicated to solve than those in the simple
1473 scheme above. Active research is going on into other algorithms that achieve
1474 complexities similar to those of the cubic regularization approach. A variety of
1475 methods that make use of Newton-type steps, approximate negative curvature di-
1476 rections, accelerated gradient methods, random perturbations, randomized Lanc-
1477 zos and conjugate gradient methods, and other algorithmic elements have been
1478 proposed.
1479 8. Conclusions
1480 We have outlined various algorithmic tools from optimization that are useful
1481 for solving problems in data analysis and machine learning, and presented their
1482 basic theoretical properties. The intersection of optimization and machine learn-
1483 ing is a fruitful and very popular area of current research. All the major machine
1484 learning conferences have a large contingent of optimization papers, and there is
1485 a great deal of interest in developing algorithmic tools to meet new challenges
1486 and in understanding their properties. The edited volume [41] contains a snap-
1487 shot of the state of the art circa 2010, but this is a fast-moving field and there have
1488 been many developments since then.
1489 Acknowledgments
1490 I thank Ching-pei Lee for a close reading and many helpful suggestions, and
1491 David Hong and an anonymous referee for detailed, excellent comments.
48 References
1492 References
1493 [1] L. Balzano, R. Nowak, and B. Recht, Online identification and tracking of subspaces from highly incom-
1494 plete information, 48th Annual Allerton Conference on Communication, Control, and Computing,
1495 2010, pp. 704–711. ←9
1496 [2] L. Balzano and S. J. Wright, Local convergence of an algorithm for subspace identification from partial
1497 data, Foundations of Computational Mathematics 14 (2014), 1–36. DOI: 10.1007/s10208-014-9227-
1498 7. ←10
1499 [3] A. Beck and M. Teboulle, A fast iterative shrinkage-threshold algorithm for linear inverse problems,
1500 SIAM Journal on Imaging Sciences 2 (2009), no. 1, 183–202. ←32, 35
1501 [4] B. E. Boser, I. M. Guyon, and V. N. Vapnik, A training algorithm for optimal margin classifiers,
1502 Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 1992, pp. 144–
1503 152. ←12
1504 [5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learn-
1505 ing via the alternating direction methods of multipliers, Foundations and Trends in Machine Learning
1506 3 (2011), no. 1, 1–122. ←3
1507 [6] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University Press, 2003. ←3
1508 [7] S. Bubeck, Convex optimization: Algorithms and complexity, Foundations and Trends in Machine
1509 Learning 8 (2015), no. 3–4, 231–357. ←35, 36
1510 [8] S. Bubeck, Y. T. Lee, and M. Singh, A geometric alternative to Nesterov’s accelerated gradient descent,
1511 Technical Report arXiv:1506.08187, Microsoft Research, 2015. ←35
1512 [9] S. Burer and R. D. C. Monteiro, A nonlinear programming algorithm for solving semidefinite programs
1513 via low-rank factorizations, Mathematical Programming, Series B 95 (2003), 329–257. ←7
1514 [10] E. Candès and B. Recht, Exact matrix completion via convex optimization, Foundations of Computa-
1515 tional Mathematics 9 (2009), 717–772. ←7
1516 [11] E. J. Candès, X. Li, Y. Ma, and J. Wright, Robust principal component analysis?, Journal of the ACM
1517 58.3 (2011), 11. ←9
1518 [12] C. Cartis, N. I. M. Gould, and Ph. L. Toint, Adaptive cubic regularisation methods for unconstrained op-
1519 timization. Part I: Motivation, convergence and numerical results, Mathematical Programming, Series
1520 A 127 (2011), 245–295. ←46
1521 [13] C. Cartis, N. I. M. Gould, and Ph. L. Toint, Adaptive cubic regularisation methods for unconstrained
1522 optimization. Part II: Worst-case function-and derivative-evaluation complexity, Mathematical Program-
1523 ming, Series A 130 (2011), no. 2, 295–319. ←46
1524 [14] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, Rank-sparsity incoherence for
1525 matrix decomposition, SIAM Journal on Optimization 21 (2011), no. 2, 572–596. ←9
1526 [15] Y. Chen and M. J. Wainwright, Fast low-rank estimation by projected gradent descent: General statistical
1527 and algorithmic guarantees, Technical Report arXiv:1509.03025, University of California-Berkeley,
1528 2015. ←9
1529 [16] C. Cortes and V. N. Vapnik, Support-vector networks, Machine Learning 20 (1995), 273–297. ←12
1530 [17] A. d’Aspremont, O. Banerjee, and L. El Ghaoui, First-order methods for sparse covariance selection,
1531 SIAM Journal on Matrix Analysis and Applications 30 (2008), 56–66. ←8
1532 [18] A. d’Aspremont, L. El Ghaoui, M. I. Jordan, and G. Lanckriet, A direct formulation for sparse PCA
1533 using semidefinte programming, SIAM Review 49 (2007), no. 3, 434–448. ←8
1534 [19] T. Dasu and T. Johnson, Exploratory data mining and data cleaning, John Wiley & Sons, 2003. ←4
1535 [20] P. Drineas and M. W. Mahoney, Lectures on randomized numerical linear algebra, The mathematics
1536 of data, 2018. ←6
1537 [21] D. Drusvyatskiy, M. Fazel, and S. Roy, An optimal first-order method based on optimal quadratic av-
1538 eraging, Technical Report arXiv:1604.06543, University of Washington, 2016. To appear in SIAM
1539 Journal on Optimization. ←35
1540 [22] J. C. Duchi, Introductory lectures on stochastic optimization, The mathematics of data, 2018. ←3, 16,
1541 23
1542 [23] J. Eckstein and D. P. Bertsekas, On the Douglas-Rachford splitting method and the proximal point
1543 algorithm for maximal monotone operators, Mathematical Programming 55 (1992), 293–318. ←3
1544 [24] M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Research Logistics Quarterly
1545 3 (1956), 95–110. ←28
1546 [25] J. Friedman, T. Hastie, and R. Tibshirani, Sparse inverse covariance estimation with the graphical lasso,
1547 Biostatistics 9 (2008), no. 3, 432–441. ←8
References 49
1548 [26] A. Griewank, The modification of Newton’s method for unconstrained optimization by bounding cubic
1549 terms, Technical Report NA/12, DAMTP, Cambridge University, 1981. ←44, 46
1550 [27] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, Journal of
1551 Research of the National Bureau of Standards 49 (1952), 409–436. ←33
1552 [28] A. J. Hoffman and H. Weilandt, The variation of the spectrum of a normal matrix, Duke Mathematical
1553 Journal 20 (1953), no. 1, 37–39. ←41
1554 [29] J. D Lee, M. Simchowitz, M. I Jordan, and B. Recht, Gradient descent only converges to minimizers,
1555 Conference on learning theory, 2016, pp. 1246–1257. ←27
1556 [30] D. C. Liu and J. Nocedal, On the limited-memory BFGS method for large scale optimization, Mathe-
1557 matical Programming 45 (1989), 503–528. ←3, 34
1558 [31] J. J. Moré and D. C. Sorensen, Computing a trust region step, SIAM Journal on Scientific and
1559 Statistical Computing 4 (1983), 553–572. ←44
1560 [32] A. S. Nemirovski and D. B. Yudin, Problem complexity and method efficiency in optimization, John
1561 Wiley, 1983. ←39
1562 [33] Y. Nesterov, A method for unconstrained convex problem with the rate of convergence O(1/k2 ), Dok-
1563 lady AN SSSR 269 (1983), 543–547. ←32
1564 [34] Y. Nesterov, Introductory lectures on convex optimization: A basic course, Springer Science and Busi-
1565 ness Media, New York, 2004. ←32, 36
1566 [35] Y. Nesterov and B. T. Polyak, Cubic regularization of Newton method and its global performance, Math-
1567 ematical Programming, Series A 108 (2006), 177–205. ←44, 46
1568 [36] J. Nocedal and S. J. Wright, Numerical Optimization, Second, Springer, New York, 2006. ←3, 26,
1569 34, 44, 45
1570 [37] B. T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Computational
1571 Mathematics and Mathematical Physics 4.5 (1964), 1–17. ←32
1572 [38] B. T. Polyak, Introduction to optimization, Optimization Software, 1987. ←32, 33
1573 [39] B. Recht, M. Fazel, and P. Parrilo, Guaranteed minimum-rank solutions to linear matrix equations via
1574 nuclear norm minimization, SIAM Review 52 (2010), no. 3, 471–501. ←7
1575 [40] R. T. Rockafellar, Convex analysis, Princeton University Press, Princeton, N.J., 1970. ←17
1576 [41] S. Sra, S. Nowozin, and S. J. Wright (eds.), Optimization for machine learning, NIPS Workshop
1577 Series, MIT Press, 2011. ←47
1578 [42] R. Tibshirani, Regression shrinkage and selection via the LASSO, Journal of the Royal Statistical
1579 Society B 58 (1996), 267–288. ←6
1580 [43] M. J. Todd, Semidefinite optimization, Acta Numerica 10 (2001), 515–560. ←3
1581 [44] B. Turlach, W. N. Venables, and S. J. Wright, Simultaneous variable selection, Technometrics 47
1582 (2005), no. 3, 349–363. ←9
1583 [45] L. Vandenberghe and S. Boyd, Semidefinite programming, SIAM Review 38 (1996), 49–95. ←3
1584 [46] S. J. Wright, Primal-dual interior-point methods, SIAM, Philadelphia, PA, 1997. ←3
1585 [47] S. J. Wright, Coordinate descent algorithms, Mathematical Programming, Series B 151 (2015Decem-
1586 ber), 3–34. ←3
1587 [48] X. Yi, D. Park, Y. Chen, and C. Caramanis, Fast algorithms for robust PCA via gradient descent,
1588 Advances in Neural Information Processing Systems 29, 2016, pp. 4152–4160. ←9