Optimization Algorithms For Data Analysis Wright

Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

IAS/Park City Mathematics Series

Volume 00, Pages 000–000


S 1079-5634(XX)0000-0

1 Optimization Algorithms for Data Analysis

2 Stephen J. Wright

3 Contents
4 1 Introduction 2
5 1.1 Omissions 3
6 1.2 Notation 3
7 2 Optimization Formulations of Data Analysis Problems 4
8 2.1 Setup 4
9 2.2 Least Squares 6
10 2.3 Matrix Completion 6
11 2.4 Nonnegative Matrix Factorization 7
12 2.5 Sparse Inverse Covariance Estimation 8
13 2.6 Sparse Principal Components 8
14 2.7 Sparse Plus Low-Rank Matrix Decomposition 9
15 2.8 Subspace Identification 9
16 2.9 Support Vector Machines 10
17 2.10 Logistic Regression 12
18 2.11 Deep Learning 13
19 3 Preliminaries 15
20 3.1 Solutions 16
21 3.2 Convexity and Subgradients 16
22 3.3 Taylor’s Theorem 17
23 3.4 Optimality Conditions for Smooth Functions 19
24 3.5 Proximal Operators and the Moreau Envelope 20
25 3.6 Convergence Rates 21
26 4 Gradient Methods 23
27 4.1 Steepest Descent 23
28 4.2 General Case 24
29 4.3 Convex Case 24
30 4.4 Strongly Convex Case 25
31 4.5 General Case: Line-Search Methods 26
32 4.6 Conditional Gradient Method 27
2010 Mathematics Subject Classification. Primary 14Dxx; Secondary 14Dxx.
Key words and phrases. Park City Mathematics Institute.

©0000 (copyright holder)

1
2 Optimization Algorithms for Data Analysis

33 5 Prox-Gradient Methods 29
34 6 Accelerating Gradient Methods 32
35 6.1 Heavy-Ball Method 32
36 6.2 Conjugate Gradient 33
37 6.3 Nesterov’s Accelerated Gradient: Weakly Convex Case 34
38 6.4 Nesterov’s Accelerated Gradient: Strongly Convex Case 36
39 6.5 Lower Bounds on Rates 39
40 7 Newton Methods 40
41 7.1 Basic Newton’s Method 40
42 7.2 Newton’s Method for Convex Functions 42
43 7.3 Newton Methods for Nonconvex Functions 43
44 7.4 A Cubic Regularization Approach 45
45 8 Conclusions 47

46 1. Introduction
47 In this article, we consider algorithms for solving smooth optimization prob-
48 lems, possibly with simple constraints or structured nonsmooth regularizers. One
49 such canonical formulation is
50 (1.0.1) min f(x),
x∈Rn
51 where f : Rn → R has at least Lipschitz continuous gradients. Additional as-
52 sumptions about f, such as convexity and Lipschitz continuity of the Hessian, are
53 introduced as needed. Another formulation we consider is
54 (1.0.2) min f(x) + λψ(x),
x∈Rn
55 where f is as in (1.0.1), ψ : Rn
→ R is a function that is usually convex and usually
56 nonsmooth, and λ > 0 is a regularization parameter.1 We refer to (1.0.2) as a
57 regularized minimization problem because the presence of the term involving ψ
58 induces certain structural properties on the solution, that make it more desirable
59 or plausible in the context of the application. We describe iterative algorithms
60 that generate a sequence {xk }k=0,1,2,... of points that, in the case of convex objective
61 functions, converges to the set of solutions. (Some algorithms also generate other
62 “auxiliary” sequences of iterates.)
63 We are motivated to study problems of the forms (1.0.1) and (1.0.2) by their
64 ubiquity in data analysis applications. Accordingly, Section 2 describes some
65 canonical problems in data analysis and their formulation as optimization prob-
66 lems. After some preliminaries in Section 3, we describe in Section 4 algorithms
67 that take step based on the gradients ∇f(xk ). Extensions of these methods to
1A set S is said to be convex if for any pair of points z 0 , z 00 ∈ S, we have that αz 0 + (1 − α)z 00 ∈ S for
all α ∈ [0, 1]. A function φ : Rn → R is convex if φ(αz 0 + (1 − α)z 00 ) 6 αφ(z 0 ) + (1 − α)φ(z 00 )
for all z 0 , z 00 in the (convex) domain of φ and all α ∈ [0, 1].
Stephen J. Wright 3

68 the case (1.0.2) of regularized objectives are described in Section 5. Section 6 de-
69 scribes accelerated gradient methods, which achieve better worst-case complexity
70 than basic gradient methods, while still only using first-derivative information.
71 We discuss Newton’s method in Section 7, outlining variants that can guarantee
72 convergence to points that approximately satisfy second-order conditions for a
73 local minimizer of a smooth nonconvex function.
74 1.1. Omissions Our approach throughout is to give a concise description of
75 some of the most important algorithmic tools for smooth nonlinear optimization
76 and regularized optimization, along with the basic convergence theory for each.
77 (In any given context, we mean by “smooth” that the function is differentiable as
78 many times as is necessary for the discussion to make sense.) In most cases, the
79 theory is elementary enough to include here in its entirety. In the few remaining
80 cases, we provide citations to works in which complete proofs can be found.
81 Although we allow nonsmoothness in the regularization term in (1.0.2), we do
82 not cover subgradient methods or mirror descent explicitly in this chapter. We
83 also do not discuss stochastic gradient methods, a class of methods that is central
84 to modern machine learning. All these topics are discussed in the contribution of
85 John Duchi to the current volume [22]. Other omissions include the following.
86 • Coordinate descent methods; see [47] for a recent review.
87 • Augmented Lagrangian methods, including alternating direction meth-
88 ods of multipliers (ADMM) [23]. The review [5] remains a good reference
89 for the latter topic, especially as it applies to problems from data analysis.
90 • Semidefinite programming (see [43, 45]) and conic optimization (see [6]).
91 • Methods tailored specifically to linear or quadratic programming, such as
92 the simplex method or interior-point methods (see [46] for a discussion of
93 the latter).
94 • Quasi-Newton methods, which modify Newton’s method by approximat-
95 ing the Hessian or its inverse, thus attaining attractive theoretical and
96 practical performance without using any second-derivative information.
97 For a discussion of these methods, see [36, Chapter 6]. One important
98 method of this class, which is useful in data analysis and many other
99 large-scale problems, is the limited-memory method L-BFGS [30]; see also
100 [36, Section 7.2].
101 1.2. Notation Our notational conventions in this chapter are as follows. We
102 use upper-case Roman characters (A, L, R, and so on) for matrices and lower-
103 case Roman (x, v, u, and so on) for vectors. (Vectors are assumed to be column
104 vectors.) Transposes are indicated by a superscript “T .” Elements of matrices and
105 vectors are indicated by subscripts, for example, Aij and xj . Iteration numbers are
106 indicated by superscripts, for example, xk . We denote the set of real numbers by
107 R, so that Rn denotes the Euclidean space of dimension n. The set of symmetric
108 real n × n matrices is denoted by SRn×n . Real scalars are usually denoted by
4 Optimization Algorithms for Data Analysis

109 Greek characters, for example, α, β, and so on, though in deference to convention,
110 we sometimes use Roman capitals (for example, L for the Lipschitz constant of
111 a gradient). Where vector norms appear, the type of norm in use is indicated
112 by a subscript (for example kxk1 ), except that when no subscript appears, the
113 Euclidean norm k · k2 is assumed. Matrix norms are defined where first used.

114 2. Optimization Formulations of Data Analysis Problems


115 In this section, we describe briefly some representative problems in data anal-
116 ysis and machine learning, emphasizing their formulation as optimization prob-
117 lems. Our list is by no means exhaustive. In many cases, there are a number of
118 different ways to formulate a given application as an optimization problem. We
119 do not try to describe all of them. But our list here gives a flavor of the interface
120 between data analysis and optimization.
121 2.1. Setup Practical data sets are often extremely messy. Data may be misla-
122 beled, noisy, incomplete, or otherwise corrupted. Much of the hard work in data
123 analysis is done by professionals, familiar with the underlying applications, who
124 “clean” the data and prepare it for analysis, while being careful not to change the
125 essential properties that they wish to discern from the analysis. Dasu and John-
126 son [19] claim out that “80% of data analysis is spent on the process of cleaning
127 and preparing the data.” We do not discuss this aspect of the process, focusing in-
128 stead on the part of the data analysis pipeline in which the problem is formulated
129 and solved.
130 The data set in a typical analysis problem consists of m objects:
131 (2.1.1) D := {(aj , yj ), j = 1, 2, . . . , m},
132 where aj is a vector (or matrix) of features and yj is an observation or label. (Each
133 pair (aj , yj ) has the same size and shape for all j = 1, 2, . . . , m.) The analysis
134 task then consists of discovering a function φ such that φ(aj ) ≈ yj holds for
135 most j = 1, 2, . . . , m. The process of discovering the mapping φ is often called
136 “learning” or “training.”
137 The function φ is often defined in terms of a vector or matrix of parameters,
138 which we denote by x or X. (Other notation also appears below.) With these
139 parametrizations, the problem of identifying φ becomes a data-fitting problem:
140 “Find the parameters x defining φ such that φ(aj ) ≈ yj , j = 1, 2, . . . , m in some
141 optimal sense.” Once we come up with a definition of the term “optimal,” we
142 have an optimization problem. Many such optimization formulations have objec-
143 tive functions of the “summation” type
X
m
144 (2.1.2) LD (x) := `(aj , yj ; x),
j=1

145 where the jth term `(aj , yj ; x) is a measure of the mismatch between φ(aj ) and
146 yj , and x is the vector of parameters that determines φ.
Stephen J. Wright 5

147 One use of φ is to make predictions about future data items. Given another
148 previously unseen item of data â of the same type as aj , j = 1, 2, . . . , m, we
149 predict that the label ŷ associated with â would be φ(â). The mapping may also
150 expose other structure and properties in the data set. For example, it may reveal
151 that only a small fraction of the features in aj are needed to reliably predict the
152 label yj . (This is known as feature selection.) The function φ or its parameter x
153 may also reveal important structure in the data. For example, X could reveal a
154 low-dimensional subspace that contains most of the aj , or X could reveal a matrix
155 with particular structure (low-rank, sparse) such that observations of X prompted
156 by the feature vectors aj yield results close to yj .
157 Examples of labels yj include the following.
158 • A real number, leading to a regression problem.
159 • A label, say yj ∈ {1, 2, . . . , M} indicating that aj belongs to one of M
160 classes. This is a classification problem. We have M = 2 for binary classifi-
161 cation and M > 2 for multiclass classification.
162 • Null. Some problems only have feature vectors aj and no labels. In this
163 case, the data analysis task may consist of grouping the aj into clusters
164 (where the vectors within each cluster are deemed to be functionally sim-
165 ilar), or identifying a low-dimensional subspace (or a collection of low-
166 dimensional subspaces) that approximately contains the aj . Such prob-
167 lems require the labels yj to be learned, alongside the function φ. For
168 example, in a clustering problem, yj could represent the cluster to which
169 aj is assigned.
170 Even after cleaning and preparation, the setup above may contain many com-
171 plications that need to be dealt with in formulating the problem in rigorous math-
172 ematical terms. The quantities (aj , yj ) may contain noise, or may be otherwise
173 corrupted. We would like the mapping φ to be robust to such errors. There may
174 be missing data: parts of the vectors aj may be missing, or we may not know all
175 the labels yj . The data may be arriving in streaming fashion rather than being
176 available all at once. In this case, we would learn φ in an online fashion.
177 One particular consideration is that we wish to avoid overfitting the model to
178 the data set D in (2.1.1). The particular data set D available to us can often be
179 thought of as a finite sample drawn from some underlying larger (often infinite)
180 collection of data, and we wish the function φ to perform well on the unobserved
181 data points as well as the observed subset D. In other words, we want φ to
182 be not too sensitive to the particular sample D that is used to define empirical
183 objective functions such as (2.1.2). The optimization formulation can be modified
184 in various ways to achieve this goal, by the inclusion of constraints or penalty
185 terms that limit some measure of “complexity” of the function (such techniques
186 are called generalization or regularization). Another approach is to terminate the
187 optimization algorithm early, the rationale being that overfitting occurs mainly in
188 the later stages of the optimization process.
6 Optimization Algorithms for Data Analysis

189 2.2. Least Squares Probably the oldest and best-known data analysis problem is
190 linear least squares. Here, the data points (aj , yj ) lie in Rn × R, and we solve
1 X T
m
1
191 (2.2.1) min (aj x − yj )2 = kAx − yk22 ,
x 2m 2m
j=1

192 where A is the matrix whose rows are aTj , j = 1, 2, . . . , m and y = (y1 , y2 , . . . , ym )T .
193 In the terminology above, the function φ is defined by φ(a) := aT x. (We could
194 also introduce a nonzero intercept by adding an extra parameter β ∈ R and
195 defining φ(a) := aT x + β.) This formulation can be motivated statistically, as a
196 maximum-likelihood estimate of x when the observations yj are exact but for
197 i.i.d. Gaussian noise. Randomized linear algebra methods for large-scale in-
198 stances of this problem are discussed in Section 5 of the lectures of Drineas and
199 Mahoney [20] in this volume.
200 Various modifications of (2.2.1) impose desirable structure on x and hence on
201 φ. For example, Tikhonov regularization with a squared `2 -norm, which is
1
202 min kAx − yk22 + λkxk22 , for some parameter λ > 0,
x 2m
203 yields a solution x with less sensitivity to perturbations in the data (aj , yj ). The
204 LASSO formulation
1
205 (2.2.2) min kAx − yk22 + λkxk1
x 2m
206 tends to yield solutions x that are sparse, that is, containing relatively few nonzero
207 components [42]. This formulation performs feature selection: The locations of
208 the nonzero components in x reveal those components of aj that are instrumental
209 in determining the observation yj . Besides its statistical appeal — predictors that
210 depend on few features are potentially simpler and more comprehensible than
211 those depending on many features — feature selection has practical appeal in
212 making predictions about future data. Rather than gathering all components of a
213 new data vector â, we need to find only the “selected” features, since only these
214 are needed to make a prediction. The LASSO formulation (2.2.2) is an important
215 prototype for many problems in data analysis, in that it involves a regularization
216 term λkxk1 that is nonsmooth and convex, but with relatively simple structure
217 that can potentially be exploited by algorithms.
218 2.3. Matrix Completion Matrix completion is in one sense a natural extension
219 of least-squares to problems in which the data aj are naturally represented as
220 matrices rather than vectors. Changing notation slightly, we suppose that each
221 Aj is an n × p matrix, and we seek another n × p matrix X that solves
1 X
m
222 (2.3.1) min (hAj , Xi − yj )2 ,
X 2m
j=1

223 where hA, Bi := trace(AT B).


Here we can think of the Aj as “probing” the un-
224 known matrix X. Commonly considered types of observations are random linear
Stephen J. Wright 7

225 combinations (where the elements of Aj are selected i.i.d. from some distribution)
226 or single-element observations (in which each Aj has 1 in a single location and
227 zeros elsewhere). A regularized version of (2.3.1), leading to solutions X that are
228 low-rank, is
1 X
m
229 (2.3.2) min (hAj , Xi − yj )2 + λkXk∗ ,
X 2m
j=1

230 where kXk∗ is the nuclear norm, which is the sum of singular values of X [39].
231 The nuclear norm plays a role analogous to the `1 norm in (2.2.2). Although the
232 nuclear norm is a somewhat complex nonsmooth function, it is at least convex, so
233 that the formulation (2.3.2) is also convex. This formulation can be shown to yield
234 a statistically valid solution when the true X is low-rank and the observation ma-
235 trices Aj satisfy a “restricted isometry” property, commonly satisfied by random
236 matrices, but not by matrices with just one nonzero element. The formulation is
237 also valid in a different context, in which the true X is incoherent (roughly speak-
238 ing, it does not have a few elements that are much larger than the others), and
239 the observations Aj are of single elements [10].
240 In another form of regularization, the matrix X is represented explicitly as a
241 product of two “thin” matrices L and R, where L ∈ Rn×r and R ∈ Rp×r , with
242 r  min(n, p). We set X = LRT in (2.3.1) and solve
1 X
m
243 (2.3.3) min (hAj , LRT i − yj )2 .
L,R 2m
j=1

244 In this formulation, the rank r is “hard-wired” into the definition of X, so there is
245 no need to include a regularizing term. This formulation is also typically much
246 more compact than (2.3.2); the total number of elements in (L, R) is (n + p)r,
247 which is much less than np. A disadvantage is that it is nonconvex. An active
248 line of current research, pioneered in [9] and also drawing on statistical sources,
249 shows that the nonconvexity is benign in many situations, and that under certain
250 assumptions on the data (Aj , yj ), j = 1, 2, . . . , m and careful choice of algorithmic
251 strategy, good solutions can be obtained from the formulation (2.3.3). A clue to
252 this good behavior is that although this formulation is nonconvex, it is in some
253 sense an approximation to a tractable problem: If we have a complete observation
254 of X, then a rank-r approximation can be found by performing a singular value
255 decomposition of X, and defining L and R in terms of the r leading left and right
256 singular vectors.
257 2.4. Nonnegative Matrix Factorization Some applications in computer vision,
258 chemometrics, and document clustering require us to find factors L and R like
259 those in (2.3.3) in which all elements are nonnegative. If the full matrix Y ∈ Rn×p
260 is observed, this problem has the form
261 min kLRT − Yk2F , subject to L > 0, R > 0.
L,R
8 Optimization Algorithms for Data Analysis

262 2.5. Sparse Inverse Covariance Estimation In this problem, the labels yj are
263 null, and the vectors aj ∈ Rn are viewed as independent observations of a ran-
264 dom vector A ∈ Rn , which has zero mean. The sample covariance matrix con-
265 structed from these observations is
1 X
m
266 S= aj aTj .
m−1
j=1

267 The element Sil is an estimate of the covariance between the ith and lth elements
268 of the random variable vector A. Our interest is in calculating an estimate X of
269 the inverse covariance matrix that is sparse. The structure of X yields important
270 information about A. In particular, if Xil = 0, we can conclude that the i and
271 l components of A are conditionally independent. (That is, they are independent
272 given knowledge of the values of the other n − 2 components of A.) Stated an-
273 other way, the nonzero locations in X indicate the arcs in the dependency graph
274 whose nodes correspond to the n components of A.
275 One optimization formulation that has been proposed for estimating the in-
276 verse sparse covariance matrix X is the following:
277 (2.5.1) min hS, Xi − log det(X) + λkXk1 ,
X∈SRn×n , X0

278 where SRn×n is the set of n × n symmetric matrices, X  0 indicates that X is


P
279 positive definite, and kXk1 := ni,l=1 |Xil | (see [17, 25]).

280 2.6. Sparse Principal Components The setup for this problem is similar to the
281 previous section, in that we have a sample covariance matrix S that is estimated
282 from a number of observations of some underlying random vector. The princi-
283 pal components of this matrix are the eigenvectors corresponding to the largest
284 eigenvalues. It is often of interest to find sparse principal components, approxi-
285 mations to the leading eigenvectors that also contain few nonzeros. An explicit
286 optimization formulation of this problem is
287 (2.6.1) max vT Sv s.t. kvk2 = 1, kvk0 6 k,
v∈Rn
288 where k · k0 indicates the cardinality of v (that is, the number of nonzeros in v)
289 and k is a user-defined parameter indicating a bound on the cardinality of v. The
290 problem (2.6.1) is NP-hard, so exact formulations (for example, as a quadratic
291 program with binary variables) are intractable. We consider instead a relaxation,
292 due to [18], which replaces vvT by a positive semidefinite proxy M ∈ SRn×n :
293 (2.6.2) max hS, Mi s.t. M  0, hI, Mi = 1, kMk1 6 ρ,
M∈SRn×n
294 for some parameter ρ > 0 that can be adjusted to attain the desired sparsity. This
295 formulation is a convex optimization problem, in fact, a semidefinite program-
296 ming problem.
297 This formulation can be generalized to find the leading r > 1 sparse principal
298 components. Ideally, we would obtain these from a matrix V ∈ Rn×r whose
Stephen J. Wright 9

299 columns are mutually orthogonal and have at most k nonzeros each. We can
300 write a convex relaxation of this problem, once again a semidefinite program, as
301 (2.6.3) max hS, Mi s.t. 0  M  I, hI, Mi = 1, kMk1 6 ρ .
M∈SRn×n
302 A more compact (but nonconvex) formulation is
303 max hS, FFT i s.t. kFk2 6 1, kFk2,1 6 R̄,
F∈Rn×r
Pn
304 where kFk2,1 := i=1 kFi· k2[15]. The latter regularization term is often called
305 a “group-sparse” or “group-LASSO” regularizer. (An early use of this type of
306 regularizer was described in [44].)
307 2.7. Sparse Plus Low-Rank Matrix Decomposition Another useful paradigm
308 is to decompose a partly or fully observed n × p matrix Y into the sum of a
309 sparse matrix and a low-rank matrix. A convex formulation of the fully-observed
310 problem is
311 min kMk∗ + λkSk1 s.t. Y = M + S,
M,S
Pn Pp
where kSk1 := i=1 j=1 |Sij | [11, 14]. Compact, nonconvex formulations that
allow noise in the observations include the following:
1
min kLRT + S − Yk2F (fully observed)
L,R,S 2
1
min kPΦ (LRT + S − Y)k2F (partially observed),
L,R,S 2
312 where Φ represents the locations of the observed entries of Y and PΦ is projection
313 onto this set [15, 48].
314 One application of these formulations is to robust PCA, where the low-rank
315 part represents principal components and the sparse part represents “outlier”
316 observations. Another application is to foreground-background separation in
317 video processing. Here, each column of Y represents the pixels in one frame of
318 video, whereas each row of Y shows the evolution of one pixel over time.
319 2.8. Subspace Identification In this application, the aj ∈ Rn , j = 1, 2, . . . , m are
320 vectors that lie (approximately) in a low-dimensional subspace. The aim is to
321 identify this subspace, expressed as the column subspace of a matrix X ∈ Rn×r .
322 If the aj are fully observed, an obvious way to solve this problem is to perform
323 a singular value decomposition of the n × m matrix A = [aj ]m j=1 , and take X to
324 be the leading r right singular vectors. In interesting variants of this problem,
325 however, the vectors aj may be arriving in streaming fashion and may be only
326 partly observed, for example in indices Φj ⊂ {1, 2, . . . , n}. We would thus need to
327 identify a matrix X and vectors sj ∈ Rr such that
328 PΦj (aj − Xsj ) ≈ 0, j = 1, 2, . . . , m.
329 The algorithm for identifying X, described in [1], is a manifold-projection scheme
330 that takes steps in incremental fashion for each aj in turn. Its validity relies on
10 Optimization Algorithms for Data Analysis

331 incoherence of the matrix X with respect to the principal axes, that is, the matrix
332 X should not have a few elements that are much larger than the others. A local
333 convergence analysis of this method is given in [2].
2.9. Support Vector Machines Classification via support vector machines (SVM)
is a classical paradigm in machine learning. This problem takes as input data
(aj , yj ) with aj ∈ Rn and yj ∈ {−1, 1}, and seeks a vector x ∈ Rn and a scalar
β ∈ R such that

(2.9.1a) aTj x − β > 1 when yj = +1;


(2.9.1b) aTj x − β 6 −1 when yj = −1.
334 Any pair (x, β) that satisfies these conditions defines a separating hyperplane in
335 Rn , that separates the “positive” cases {aj | yj = +1} from the “negative” cases
336 {aj | yj = −1}. (In the language of Section 2.1, we could define the function
337 φ as φ(aj ) = sign(aTj x − β).) Among all separating hyperplanes, the one that
338 minimizes kxk2 is the one that maximizes the margin between the two classes,
339 that is, the hyperplane whose distance to the nearest point aj of either class is
340 greatest.
341 We can formulate the problem of finding a separating hyperplane as an opti-
342 mization problem by defining an objective with the summation form (2.1.2):
1 X
m
343 (2.9.2) H(x, β) = max(1 − yj (aTj x − β), 0).
m
j=1

344 Note that the jth term in this summation is zero if the conditions (2.9.1) are
345 satisfied, and positive otherwise. Even if no pair (x, β) exists with H(x, β) = 0,
346 the pair (x, β) that minimizes (2.1.2) will be the one that comes as close as possible
347 to satisfying (2.9.1), in a suitable sense. A term λkxk22 , where λ is a small positive
348 parameter, is often added to (2.9.2), yielding the following regularized version:
1 X
m
1
349 (2.9.3) H(x, β) = max(1 − yj (aTj x − β), 0) + λkxk22 .
m 2
j=1

350 If λ is sufficiently small (but positive), and if separating hyperplanes exist, the
351 pair (x, β) that minimizes (2.9.3) is the maximum-margin separating hyperplane.
352 The maximum-margin property is consistent with the goals of generalizability
353 and robustness. For example, if the observed data (aj , yj ) is drawn from an
354 underlying “cloud” of positive and negative cases, the maximum-margin solution
355 usually does a reasonable job of separating other empirical data samples drawn
356 from the same clouds, whereas a hyperplane that passes close by several of the
357 observed data points may not do as well (see Figure 2.9.4).
358 The problem of minimizing (2.9.3) can be written as a convex quadratic pro-
359 gram — having a convex quadratic objective and linear constraints — by intro-
360 ducing variables sj , j = 1, 2, . . . , m to represent the residual terms. Then,
Stephen J. Wright 11

Figure 2.9.4. Linear support vector machine classification, with


one class represented by circles and the other by squares. One
possible choice of separating hyperplane is shown at left. If the
observed data is an empirical sample drawn from a cloud of un-
derlying data points, this plane does not do well in separating
the two clouds (middle). The maximum-margin separating hy-
perplane does better (right).

1 T 1
(2.9.5a) min 1 s + λkxk22 ,
x,β,s m 2
(2.9.5b) subject to sj > 1 − yj (aTj x − β), sj > 0, j = 1, 2, . . . , m,
361 where 1 = (1, 1, . . . , 1)T ∈ Rm .
362 Often it is not possible to find a hyperplane that separates the positive and
363 negative cases well enough to be useful as a classifier. One solution is to trans-
364 form all of the raw data vectors aj by a mapping ζ into a higher-dimensional
365 Euclidean space, then perform the support-vector-machine classification on the
366 vectors ζ(aj ), j = 1, 2, . . . , m.
The conditions (2.9.1) would thus be replaced by

(2.9.6a) ζ(aj )T x − β > 1 when yj = +1;


T
(2.9.6b) ζ(aj ) x − β 6 −1 when yj = −1,
367 leading to the following analog of (2.9.3):
1 X
m
1
368 (2.9.7) H(x, β) = max(1 − yj (ζ(aj )T x − β), 0) + λkxk22 .
m 2
j=1

369 When transformed back to Rm , the surface {a | ζ(a)T x − β = 0} is nonlinear and


370 possibly disconnected, and is often a much more powerful classifier than the
371 hyperplanes resulting from (2.9.3).
372 We can formulate (2.9.7) as a convex quadratic program in exactly the same
373 manner as we derived (2.9.5) from (2.9.3). By taking the dual of this quadratic
374 program, we obtain another convex quadratic program, in m variables:
1 1
375 (2.9.8) min αT Qα − 1T α subject to 0 6 α 6 1, yT α = 0,
α∈Rm 2 λ
376 where
377 Qkl = yk yl ζ(ak )T ζ(al ), y = (y1 , y2 , . . . , ym )T , 1 = (1, 1, . . . , 1)T ∈ Rm .
12 Optimization Algorithms for Data Analysis

378 Interestingly, problem (2.9.8) can be formulated and solved without any explicit
379 knowledge or definition of the mapping ζ. We need only a technique to define the
380 elements of Q. This can be done with the use of a kernel function K : Rn × Rn → R,
381 where K(ak , al ) replaces ζ(ak )T ζ(al ) [4, 16]. This is the so-called “kernel trick.”
382 (The kernel function K can also be used to construct a classification function
383 φ from the solution of (2.9.8).) A particularly popular choice of kernel is the
384 Gaussian kernel:
385 K(ak , al ) := exp(−kak − al k2 /(2σ)),
386 where σ is a positive parameter.
387 2.10. Logistic Regression Logistic regression can be viewed as a variant of bi-
388 nary support-vector machine classification, in which rather than the classification
389 function φ giving a unqualified prediction of the class in which a new data vector
390 a lies, it returns an estimate of the odds of a belonging to one class or the other.
391 We seek an “odds function” p parametrized by a vector x ∈ Rn as follows:
392 (2.10.1) p(a; x) := (1 + exp(aT x))−1 ,
and aim to choose the parameter x so that

(2.10.2a) p(aj ; x) ≈ 1 when yj = +1;


(2.10.2b) p(aj ; x) ≈ 0 when yj = −1.
393 (Note the similarity to (2.9.1).) The optimal value of x can be found by maximizing
394 a log-likelihood function:
 
1  X X
395 (2.10.3) L(x) := log(1 − p(aj ; x)) + log p(aj ; x) .
m
j:yj =−1 j:yj =1

396 We can perform feature selection using this model by introducing a regularizer
397 λkxk1 , as follows:
 
1  X X
398 (2.10.4) max log(1 − p(aj ; x)) + log p(aj ; x) − λkxk1 ,
x m
j:yj =−1 j:yj =1

399 where λ > 0 is a regularization parameter. (Note that we subtract rather than add
400 the regularization term λkxk1 to the objective, because this problem is formulated
401 as a maximization rather than a minimization.) As we see later, this term has
402 the effect of producing a solution in which few components of x are nonzero,
403 making it possible to evaluate p(a; x) by knowing only those components of a
404 that correspond to the nonzeros in x.
405 An important extension of this technique is to multiclass (or multinomial) lo-
406 gistic regression, in which the data vectors aj belong to more than two classes.
407 Such applications are common in modern data analysis. For example, in a speech
408 recognition system, the M classes could each represent a phoneme of speech, one
409 of the potentially thousands of distinct elementary sounds that can be uttered by
Stephen J. Wright 13

410 humans in a few tens of milliseconds. A multinomial logistic regression problem


411 requires a distinct odds function pk for each class k ∈ {1, 2, . . . , M}. These func-
412 tions are parametrized by vectors x[k] ∈ Rn , k = 1, 2, . . . , M, defined as follows:
413

exp(aT x[k] )
414 (2.10.5) pk (a; X) := PM , k = 1, 2, . . . , M,
T
l=1 exp(a x[l] )
415 where we define X := {x[k] | k = 1, 2, . . . , M}. Note that for all a and for all
P
416 k = 1, 2, . . . , M, we have pk (a) ∈ (0, 1) and also M k=1 pk (a) = 1. The operation
417 in (2.10.5) is referred to as a “softmax” on the quantities {aT x[l] | l = 1, 2, . . . , M}.
418 If one of these inner products dominates the others, that is, aT x[k]  aT x[l] for
419 all l 6= k, the formula (2.10.5) will yield pk (a; X) ≈ 1 and pl (a; X) ≈ 0 for all l 6= k.
420 In the setting of multiclass logistic regression, the labels yj are vectors in RM ,
421 whose elements are defined as follows:

1 when aj belongs to class k,
422 (2.10.6) yjk =
0 otherwise.
Similarly to (2.10.2), we seek to define the vectors x[k] so that

(2.10.7a) pk (aj ; X) ≈ 1 when yjk = 1


(2.10.7b) pk (aj ; X) ≈ 0 when yjk = 0.
423 The problem of finding values of x[k] that satisfy these conditions can again be
424 formulated as one of maximizing a log-likelihood:

1 X X X
m
"M M
!#
T T
425 (2.10.8) L(X) := yj` (x[`] aj ) − log exp(x[`] aj ) .
m
j=1 `=1 `=1

426 “Group-sparse” regularization terms can be included in this formulation to se-


427 lect a set of features in the vectors aj , common to each class, that distinguish
428 effectively between the classes.
429 2.11. Deep Learning Deep neural networks are often designed to perform the
430 same function as multiclass logistic regression, that is, to classify a data vector a
431 into one of M possible classes, where M > 2 is large in some key applications.
432 The difference is that the data vector a undergoes a series of structured transfor-
433 mations before being passed through a multiclass logistic regression classifier of
434 the type described in the previous subsection.
435 The simple neural network shown in Figure 2.11.1 illustrates the basic ideas.
436 In this figure, the data vector aj enters at the bottom of the network, each node in
437 the bottom layer corresponding to one component of aj . The vector then moves
438 upward through the network, undergoing a structured nonlinear transformation
439 as it moves from one layer to the next. A typical form of this transformation,
440 which converts the vector ajl−1 at layer l − 1 to input vector alj at layer l, is
441 alj = σ(W l al−1
j + gl ), l = 1, 2, . . . , D,
14 Optimization Algorithms for Data Analysis

output nodes

hidden layers

input nodes

Figure 2.11.1. A deep neural network, showing connections be-


tween adjacent layers.

442 where W l is a matrix of dimension |alj | × |al−1


j | and g is a vector of length |aj |, σ
l l

443 is a componentwise nonlinear transformation, and D is the number of hidden layers,


444 defined as the layers situated strictly between the bottom and top layers. Each
445 arc in Figure 2.11.1 represents one of the elements of a transformation matrix W l .
446 We define a0j to be the “raw” input vector aj , and let aD j be the vector formed
447 by the nodes at the topmost hidden layer in Figure 2.11.1. Typical forms of the
448 function σ include the following, acting identically on each component t ∈ R of
449 its input vector:
450 • Logistic function: t → 1/(1 + e−t );
451 • Hinge loss: t → max(t, 0);
452 • Bernoulli: a random function that outputs 1 with probability 1/(1 + e−t )
453 and 0 otherwise.
454 Each node in the top layer corresponds to a particular class, and the output of
455 each node corresponds to the odds of the input vector belonging to each class. As
456 mentioned, the “softmax” operator is typically used to convert the transformed
457 input vector in the second-top layer (layer D) to a set of odds at the top layer. As-
458 sociated with each input vector aj are labels yjk , defined as in (2.10.6) to indicate
459 which of the M classes that aj belongs to.
460 The parameters in this neural network are the matrix-vector pairs (W l , gl ),
461 l = 1, 2, . . . , D that transform the input vector aj into its form aD
j at the topmost
462 hidden layer, together with the parameters X of the multiclass logistic regression
463 operation that takes place at the very top stage, where X is defined exactly as
464 in the discussion of Section 2.10. We aim to choose all these parameters so that
465 the network does a good job on classifying the training data correctly. Using the
466 notation w for the hidden layer transformations, that is,
467 (2.11.2) w := (W 1 , g1 , W 2 , g2 , . . . , W D , gD ),
Stephen J. Wright 15

468 and defining X := {x[k] | k = 1, 2, . . . , M} as in Section 2.10, we can write the loss
469 function for deep learning as follows:

1 X X X
m
"M M
!#
T D T D
470 (2.11.3) L(w, X) := yj` (x[`] aj (w)) − log exp(x[`] aj (w)) .
m
j=1 `=1 `=1

471 Note that this is exactly the function (2.10.8) applied to the output of the top
472 hidden layer aD D
j (w). We write aj (w) to make explicit the dependence of aj
D

473 on the parameters w of (2.11.2), as well as on the input vector aj . (We can view
474 multiclass logistic regression (2.10.8) as a special case of deep learning in which
475 there are no hidden layers, so that D = 0, w is null, and aD j = aj , j = 1, 2, . . . , m.)
476 Neural networks in use for particular applications (in image recognition and
477 speech recognition, for example, where they have been very successful) include
478 many variants on the basic design above. These include restricted connectivity
479 between layers (that is, enforcing structure on the matrices W l , l = 1, 2, . . . , D),
480 layer arrangements that are more complex than the linear layout illustrated in
481 Figure 2.11.1, with outputs coming from different levels, connections across non-
482 adjacent layers, different componentwise transformations σ at different layers,
483 and so on. Deep neural networks for practical applications are highly engineered
484 objects.
485 The loss function (2.11.3) shares with many other applications the “summation”
486 form (2.1.2), but it has several features that set it apart from the other applications
487 discussed above. First, and possibly most important, it is nonconvex in the param-
488 eters w. There is reason to believe that the “landscape” of L is complex, with the
489 global minimizer being exceedingly difficult to find. Second, the total number
490 of parameters in (w, X) is usually very large. The most popular algorithms for
491 minimizing (2.11.3) are of stochastic gradient type, which like most optimization
492 methods come with no guarantee for finding the minimizer of a nonconvex func-
493 tion. Effective training of deep learning classifiers typically requires a great deal
494 of data and computation power. Huge clusters of powerful computers, often us-
495 ing multicore processors, GPUs, and even specially architected processing units,
496 are devoted to this task. Efficiency also requires many heuristics in the formula-
497 tion and the algorithm (for example, in the choice of regularization functions and
498 in the steplengths for stochastic gradient).

499 3. Preliminaries
500 We discuss here some foundations for the analysis of subsequent sections.
501 These include useful facts about smooth and nonsmooth convex functions, Tay-
502 lor’s theorem and some of its consequences, optimality conditions, and proximal
503 operators.
504 In the discussion of this section, our basic assumption is that f is a mapping
505 from Rn to R ∪ {+∞}, continuous on its effective domain D := {x | f(x) < ∞}.
506 Further assumptions of f are introduced as needed.
16 Optimization Algorithms for Data Analysis

507 3.1. Solutions Consider the problem of minimizing f (1.0.1). We have the fol-
508 lowing terminology:
509 • x∗ is a local minimizer of f if there is a neighborhood N of x∗ such that
510 f(x) > f(x∗ ) for all x ∈ N.
511 • x∗ is a global minimizer of f if f(x) > f(x∗ ) for all x ∈ Rn .
512 • x∗ is a strict local minimizer if it is a local minimizer on some neighborhood
513 N and in addition f(x) > f(x∗ ) for all x ∈ N with x 6= x∗ .
514 • x∗ is an isolated local minimizer if there is a neighborhood N of x∗ such that
515 f(x) > f(x∗ ) for all x ∈ N and in addition, N contains no local minimizers
516 other than x∗ .
517 3.2. Convexity and Subgradients A convex set Ω ⊂ Rn has the property that
518 (3.2.1) x, y ∈ Ω ⇒ (1 − α)x + αy ∈ Ω for all α ∈ [0, 1].
519 We usually deal with closed convex sets in this article. For a convex set Ω ⊂ Rn
520 we define the indicator function IΩ (x) as follows:

0 if x ∈ Ω
521 IΩ (x) =
+∞ otherwise.
522 Indicator functions are useful devices for deriving optimality conditions for con-
523 strained problems, and even for developing algorithms. The constrained opti-
524 mization problem
525 (3.2.2) min f(x)
x∈Ω
526 can be restated equivalently as follows:
527 (3.2.3) min f(x) + IΩ (x).
528 We noted already that a convex function φ : Rn → R ∪ {+∞} has the following
529 defining property:
(3.2.4)
530 φ((1 − α)x + αy) 6 (1 − α)φ(x) + αφ(y), for all x, y ∈ Rn and all α ∈ [0, 1].
531 The concepts of “minimizer” are simpler in the case of convex objective func-
532 tions than in the general case. In particular, the distinction between “local” and
533 “global” minimizers disappears. For f convex in (1.0.1), we have the following.
534 (a) Any local minimizer of (1.0.1) is also a global minimizer.
535 (b) The set of global minimizers of (1.0.1) is a convex set.
536 If there exists a value γ > 0 such that
1
537 (3.2.5) φ((1 − α)x + αy) 6 (1 − α)φ(x) + αφ(y) − γα(1 − α)kx − yk22
2
538 for all x and y in the domain of φ and α ∈ [0, 1], we say that φ is strongly convex
539 with modulus of convexity γ.
540 We summarize some definitions and results about subgradients of convex func-
541 tions here. For a more extensive discussion, see [22].
Stephen J. Wright 17

542 Definition 3.2.6. A vector v ∈ Rn is a subgradient of f at a point x if


543 f(x + d) > f(x) + vT d. for all d ∈ Rn .
544 The subdifferential, denoted ∂f(x), is the set of all subgradients of f at x.

545 Subdifferentials satisfy a monotonicity property, as we show now.

546 Lemma 3.2.7. If a ∈ ∂f(x) and b ∈ ∂f(y), we have (a − b)T (x − y) > 0.

547 Proof. From the convexity of f and the definitions of a and b, we deduce that
548 f(y) > f(x) + aT (y − x) and f(x) > f(y) + bT (x − y). The result follows by adding
549 these two inequalities. 
550 We can easily characterize a minimum in terms of the subdifferential.

551 Theorem 3.2.8. The point x∗ is the minimizer of a convex function f if and only if
552 0 ∈ ∂f(x∗ ).

553 Proof. Suppose that 0 ∈ ∂f(x∗ ), we have by substituting x = x∗ and v = 0 into


554 Definition 3.2.6 that f(x∗ + d) > f(x∗ ) for all d ∈ Rn , which implies that x∗ is a
555 minimizer of f.
556 The converse follows trivially by showing that v = 0 satisfies Definition 3.2.6
557 when x∗ is a minimizer. 
558 The subdifferential is the generalization to nonsmooth convex functions of the
559 concept of derivative of a smooth function.

560 Theorem 3.2.9. If f is convex and differentiable at x, then ∂f(x) = {∇f(x)}.

561 A converse of this result is also true. Specifically, if the subdifferential of a


562 convex function f at x contains a single subgradient, then f is differentiable with
563 gradient equal to this subgradient (see [40, Theorem 25.1]).
564 3.3. Taylor’s Theorem Taylor’s theorem is a foundational result for optimization
565 of smooth nonlinear functions. It shows how smooth functions can be approxi-
566 mated locally by low-order (linear or quadratic) functions.

567 Theorem 3.3.1. Given a continuously differentiable function f : Rn → R, and given


568 x, p ∈ Rn , we have that
Z1
(3.3.2) f(x + p) = f(x) + ∇f(x + ξp)T p dξ,
0
(3.3.3) f(x + p) = f(x) + ∇f(x + ξp)T p, some ξ ∈ (0, 1).

569 If f is twice continuously differentiable, we have


Z1
(3.3.4) ∇f(x + p) = ∇f(x) + ∇2 f(x + ξp)p dξ,
0
1
(3.3.5) f(x + p) = f(x) + ∇f(x)T p + pT ∇2 f(x + ξp)p, for some ξ ∈ (0, 1).
2
18 Optimization Algorithms for Data Analysis

570 We can derive an important consequence of this theorem when f is Lipschitz


571 continuously differentiable with constant L, that is,
572 (3.3.6) k∇f(x) − ∇f(y)k 6 Lkx − yk, for all x, y ∈ Rn .
573 We have by setting y = x + p in (3.3.2) and subtracting the term ∇f(x)T (y − x)
574 from both sides that Z1
575 f(y) − f(x) − ∇f(x) (y − x) = [∇f(x + ξ(y − x)) − ∇f(x)]T (y − x) dξ.
T
0
By using (3.3.6), we have

[∇f(x + ξ(y − x)) − ∇f(x)]T (y − x) 6 k∇f(x + ξ(y − x)) − ∇f(x)kky − xk


6 Lξky − xk2 .
576 By substituting this bound into the previous integral, we obtain
L
577 (3.3.7) f(y) − f(x) − ∇f(x)T (y − x) 6 ky − xk2 .
2
578 For the remainder of Section 3.3, we assume that f is continuously differ-
579 entiable and also convex. The definition of convexity (3.2.4) and the fact that
580 ∂f(x) = {∇f(x)} implies that
581 (3.3.8) f(y) > f(x) + ∇f(x)T (y − x), for all x, y ∈ Rn .
582 We defined “strong convexity with modulus γ” in (3.2.5). When f is differentiable,
583 we have the following equivalent definition, obtained by rearranging (3.2.5) and
584 letting α ↓ 0.
γ
585 (3.3.9) f(y) > f(x) + ∇f(x)T (y − x) + ky − xk2 .
2
586 By combining this expression with (3.3.7), we have the following result.

587 Lemma 3.3.10. Given convex f satisfying (3.2.5), with ∇f uniformly Lipschitz continu-
588 ous with constant L, we have for any x, y that
γ L
589 (3.3.11) ky − xk2 6 f(y) − f(x) − ∇f(x)T (y − x) 6 ky − xk2 .
2 2
590 For later convenience, we define a condition number κ as follows:
L
591 (3.3.12) κ := .
γ
592 When f is twice continuously differentiable, we can characterize the constants γ
593 and L in terms of the eigenvalues of the Hessian ∇f(x). Specifically, we can show
594 that (3.3.11) is equivalent to
595 (3.3.13) γI  ∇2 f(x)  LI, for all x.
596 When f is strictly convex and quadratic, κ defined in (3.3.12) is the condition
597 number of the (constant) Hessian, in the usual sense of linear algebra.
598 Strongly convex functions have unique minimizers, as we now show.

599 Theorem 3.3.14. Let f be differentiable and strongly convex with modulus γ > 0. Then
600 the minimizer x∗ of f exists and is unique.
Stephen J. Wright 19

601 Proof. We show first that for any point x0 , the level set {x | f(x) 6 f(x0 )} is closed
602 and bounded, and hence compact. Suppose for contradiction that there is a se-
603 quence {x` } such that kx` k → ∞ and
604 (3.3.15) f(x` ) 6 f(x0 ).
605 By strong convexity of f, we have for some γ > 0 that
γ
606 f(x` ) > f(x0 ) + ∇f(x0 )T (x` − x0 ) + kx` − x0 k2 .
2
607 By rearranging slightly, and using (3.3.15), we obtain
γ `
608 kx − x0 k2 6 −∇f(x0 )T (x` − x0 ) 6 k∇f(x0 )kkx` − x0 k.
2
609 By dividing both sides by (γ/2)kx` − x0 k, we obtain kx` − x0 k 6 (2/γ)k∇f(x0 )k
610 for all `, which contradicts unboundedness of {x` }. Thus, the level set is bounded.
611 Since it is also closed (by continuity of f), it is compact.
612 Since f is continuous, it attains its minimum on the compact level set, which is
613 also the solution of minx f(x), and we denote it by x∗ . Suppose for contradiction
614 that the minimizer is not unique, so that we have two points x∗1 and x∗2 that
615 minimize f. Obviously, these points must attain equal objective values, so that
616 f(x∗1 ) = f(x∗2 ) = f∗ for some f∗ . By taking (3.2.5) and setting φ = f∗ , x = x∗1 ,
617 y = x∗2 , and α = 1/2, we obtain
1 1
618 f((x∗1 + x∗2 )/2) 6 (f(x∗1 ) + f(x∗2 )) − γkx∗1 − x∗2 k2 < f∗ ,
2 8
619 so the point (x∗1 + x∗2 )/2 has a smaller function value than both x∗1 and x∗2 , contra-
620 dicting our assumption that x∗1 and x∗2 are both minimizers. Hence, the minimizer
621 x∗ is unique. 
622 3.4. Optimality Conditions for Smooth Functions We consider the case of a
623 smooth (twice continuously differentiable) function f that is not necessarily con-
624 vex. Before designing algorithms to find a minimizer of f, we need to identify
625 properties of f and its derivatives at a point x̄ that tell us whether or not x̄ is a
626 minimizer, of one of the types described in Subsection 3.1. We call such properties
627 optimality conditions.
628 A first-order necessary condition for optimality is that ∇f(x̄) = 0. More precisely,
629 if x̄ is a local minimizer, then ∇f(x̄) = 0. We can prove this by using Taylor’s
630 theorem. Supposing for contradiction that ∇f(x̄) 6= 0, we can show by setting
631 x = x̄ and p = −α∇f(x̄) for α > 0 in (3.3.3) that f(x̄ − α∇f(x̄)) < f(x̄) for all
632 α > 0 sufficiently small. Thus any neighborhood of x̄ will contain points x with a
633 f(x) < f(x̄), so x̄ cannot be a local minimizer.
634 If f is convex, as well as smooth, the condition ∇f(x̄) = 0 is sufficient for x̄ to be
635 a global solution. This claim follows immediately from Theorems 3.2.8 and 3.2.9.
636 A second-order necessary condition for x̄ to be a local solution is that ∇f(x̄) = 0
637 and ∇2 f(x̄) is positive semidefinite. The proof is by an argument similar to that
638 of the first-order necessary condition, but using the second-order Taylor series ex-
639 pansion (3.3.5) instead of (3.3.3). A second-order sufficient condition is that ∇f(x̄) = 0
20 Optimization Algorithms for Data Analysis

640 and ∇2 f(x̄) is positive definite. This condition guarantees that x̄ is a strict local
641 minimizer, that is, there is a neighborhood of x̄ such that x̄ has a strictly smaller
642 function value than all other points in this neighborhood. Again, the proof makes
643 use of (3.3.5).
644 We call x̄ a stationary point for smooth f if it satisfies the first-order necessary
645 condition ∇f(x̄) = 0. Stationary points are not necessarily local minimizers. In
646 fact, local maximizers satisfy the same condition. More interestingly, stationary
647 points can be saddle points. These are points for which there exist directions u
648 and v such that f(x̄ + αu) < f(x̄) and f(x̄ + αv) > f(x̄) for all positive α suffi-
649 ciently small. When the Hessian ∇2 f(x̄) has both strictly positive and strictly
650 negative eigenvalues, it follows from (3.3.5) that x̄ is a saddle point. When ∇2 f(x̄)
651 is positive semidefinite or negative semidefinite, second derivatives alone are in-
652 sufficient to classify x̄; higher-order derivative information is needed.
653 3.5. Proximal Operators and the Moreau Envelope Here we present some anal-
654 ysis for analyzing the convergence of algorithms for the regularized problem
655 (1.0.2), where the objective is the sum of a smooth function and a convex (usually
656 nonsmooth) function.
657 We start with a formal definition.

658 Definition 3.5.1. For a closed proper convex function h and a positive scalar λ,
659 the Moreau envelope is
1 1 1
660 (3.5.2) Mλ,h (x) := inf h(u) + ku − xk2 = inf λh(u) + ku − xk2 .
u 2λ λ u 2
661 The proximal operator of the function λh is the value of u that achieves the infi-
662 mum in (3.5.2), that is,
1
663 (3.5.3) proxλh (x) := arg min λh(u) + ku − xk2 .
u 2
664 From optimality properties for (3.5.3) (see Theorem 3.2.8), we have
665 (3.5.4) 0 ∈ λ∂h(proxλh (x)) + (proxλh (x) − x).
666 The Moreau envelope can be viewed as a kind of smoothing or regularization
667 of the function h. It has a finite value for all x, even when h takes on infinite
668 values for some x ∈ Rn . In fact, it is differentiable everywhere, with gradient
1
669 ∇Mλ,h (x) = (x − proxλh (x)).
λ
670 Moreover, x∗ is a minimizer of h if and only if it is a minimizer of Mλ,h .
671 The proximal operator satisfies a nonexpansiveness property. From the opti-
672 mality conditions (3.5.4) at two points x and y, we have
673 x − proxλh (x) ∈ λ∂(proxλh (x)), y − proxλh (y) ∈ λ∂(proxλh (y)).
674 By applying monotonicity (Lemma 3.2.7), we have
T
675 (1/λ) (x − proxλh (x)) − (y − proxλh (y)) (proxλh (x) − proxλh (y)) > 0,
Stephen J. Wright 21

Rearranging this and applying the Cauchy-Schwartz inequality yields

kproxλh (x) − proxλh (y)k2 6 (x − y)T (proxλh (x) − proxλh (y))


6 kx − yk kproxλh (x) − proxλh (y)k,
676 from which we obtain kproxλh (x) − proxλh (y)k 6 kx − yk, as claimed.
677 We list the prox operator for several instances of h that are common in data
678 analysis applications. These definitions are useful in implementing the prox-
679 gradient algorithms of Section 5.
680 • h(x) = 0 for all x, for which we have proxλh (x) = 0. (This observation is
681 useful in proving that the prox-gradient method reduces to the familiar
682 steepest descent method when the objective contains no regularization
683 term.)
684 • h(x) = IΩ (x), the indicator function for a closed convex set Ω. In this
685 case, we have for any λ > 0 that
1 1
686 proxλIΩ (x) = arg min λIΩ (u) + ku − xk2 = arg min ku − xk2 ,
u 2 u∈Ω 2
687 which is simply the projection of x onto the set Ω.
688 • h(x) = kxk1 . By substituting into definition (3.5.3) we see that the mini-
689 mization separates into its n separate components, and that the ith com-
690 ponent of proxλk·k1 (x) is
h i 1
691 proxλk·k1 (x) = arg min λ|ui | + (ui − xi )2 .
i ui 2
692 We can thus verify that


 if xi > λ;
xi − λ
693 (3.5.5) [proxλk·k1 (x)]i = 0 if xi ∈ [−λ, λ];


x + λ if xi < −λ,
i

694 an operation that is known as soft-thresholding.


695 • h(x) = kxk0 , where kxk0 denotes the cardinality of the vector x, its number
696 of nonzero components. Although this h is not a convex function (as
697 we can see by considering convex combinations of the vectors (0, 1)T and
698 (1, 0)T in R2 ), its proximal operator is well defined, and is known as hard
699 thresholding:
 √
xi if |xi | > 2λ;
700 [proxλk·k0 (x)]i = √
0 if |xi | < 2λ.
701 As in (3.5.5), the definition (3.5.3) separates into n individual components.
702 3.6. Convergence Rates An important measure for evaluating algorithms is the
703 rate of convergence to zero of some measure of error. For smooth f, we may be
704 interested in how rapidly the sequence of gradient norms {k∇f(xk )k} converges
705 to zero. For nonsmooth convex f, a measure of interest may be convergence to
22 Optimization Algorithms for Data Analysis

706 zero of {dist(0, ∂f(xk ))} (the sequence of distances from 0 to the subdifferential
707 ∂f(xk )). Other error measures for which we may be able to prove convergence
708 rates include kxk − x∗ k (where x∗ is a solution) and f(xk ) − f∗ (where f∗ is the
709 optimal value of the objective function f). For generality, we denote by {φk } the
710 sequence of nonnegative scalars whose rate of convergence to 0 we wish to find.
711 We say that linear convergence holds if there is some σ ∈ (0, 1) such that
712 (3.6.1) φk+1 /φk 6 1 − σ, for all k sufficiently large.
713 (This property is sometimes also called geometric or exponential convergence, but
714 the term linear is standard in the optimization literature, so we use it here.) It
715 follows from (3.6.1) that there is some positive constant C such that
716 (3.6.2) φk 6 C(1 − σ)k , k = 1, 2, . . . .
717 While (3.6.1) implies (3.6.2), the converse does not hold. The sequence

2−k k even
718 φk =
0 k odd,
719 satisfies (3.6.2) with C = 1 and σ = .5, but does not satisfy (3.6.1). To distinguish
720 between these two slightly different definitions, (3.6.1) is sometimes called Q-
721 linear while (3.6.2) is called R-linear.
Sublinear convergence is, as its name suggests, slower than linear. Several
varieties of sublinear convergence are encountered in optimization algorithms
for data analysis, including the following

(3.6.3a) φk 6 C/ k, k = 1, 2, . . . ,
(3.6.3b) φk 6 C/k, k = 1, 2, . . . ,
(3.6.3c) φk 6 C/k2 , k = 1, 2, . . . ,
722 where in each case, C is some positive constant.
723 Superlinear convergence occurs when the constant σ ∈ (0, 1) in (3.6.1) can be
724 chosen arbitrarily close to 1. Specifically, we say that the sequence {φk } converges
725 Q-superlinearly to 0 if
726 (3.6.4) lim φk+1 /φk = 0.
k→∞
727 Q-Quadratic convergence occurs when
728 (3.6.5) φk+1 /φ2k 6 C, k = 1, 2, . . . ,
729 for some sufficiently large C. We say that the convergence is R-superlinear if
730 there is a Q-superlinearly convergent sequence {νk } that dominates {φk } (that is,
731 0 6 φk 6 νk for all k). R-quadratic convergence is defined similarly. Quadratic
732 and superlinear rates are associated with higher-order methods, such as Newton
733 and quasi-Newton methods.
734 When a convergence rate applies globally, from any reasonable starting point,
735 it can be used to derive a complexity bound for the algorithm, which takes the
Stephen J. Wright 23

736 form of a bound on the number of iterations K required to reduce φk below


737 some specified tolerance . For a sequence satisfying the R-linear convergence
738 condition (3.6.2) a sufficient condition for φK 6  is C(1 − σ)K 6 . By using the
739 estimate log(1 − σ) 6 −σ for all σ ∈ (0, 1), we have that
740 C(1 − σ)K 6  ⇔ K log(1 − σ) 6 log(/C) ⇐ K > log(C/)/σ.
741 It follows that for linearly convergent algorithms, the number of iterations re-
742 quired to converge to a tolerance  depends logarithmically on 1/ and inversely
743 on the rate constant σ. For an algorithm that satisfies the sublinear rate (3.6.3a), a

744 sufficient condition for φK 6  is C/ K 6 , which is equivalent to K > (C/)2 ,
745 so the complexity is O(1/2 ). Similar analyses for (3.6.3b) reveal complexity of

746 O(1/), while for (3.6.3c), we have complexity O(1/ ).
747 For quadratically convergent methods, the complexity is doubly logarithmic
748 in  (that is, O(log log(1/))). Once the algorithm enters a neighborhood of qua-
749 dratic convergence, just a few additional iterations are required for convergence
750 to a solution of high accuracy.

751 4. Gradient Methods


752 We consider here iterative methods for solving the unconstrained smooth prob-
753 lem (1.0.1) that make use of the gradient ∇f (see also, [22] which describes sub-
754 gradient methods for nonsmooth convex functions.) We consider mostly methods
755 that generate an iteration sequence {xk } via the formula
756 (4.0.1) xk+1 = xk + αk dk ,
757 where dk is the search direction and αk is a steplength.
758 We consider the steepest descent method, which searches along the negative
759 gradient direction dk = −∇f(xk ), proving convergence results for nonconvex
760 functions, convex functions, and strongly convex functions. In Subsection 4.5, we
761 consider methods that use more general descent directions dk , proving conver-
762 gence of methods that make careful choices of the line search parameter αk at
763 each iteration. In Subsection 4.6, we consider the conditional gradient method for
764 minimization of a smooth function f over a compact set.
765 4.1. Steepest Descent The simplest stepsize protocol is the short-step variant
766 of steepest descent. We assume here that f is differentiable, with gradient ∇f
767 satisfying the Lipschitz continuity condition (3.3.6) with constant L. We choose
768 the search direction dk = −∇f(xk ) in (4.0.1), and set the steplength αk to be the
769 constant 1/L, to obtain the iteration
1
770 (4.1.1) xk+1 = xk − ∇f(xk ), k = 0, 1, 2, . . . .
L
771 To estimate the amount of decrease in f obtained at each iterate of this method,
772 we use Taylor’s theorem. From (3.3.7), we have
L
773 (4.1.2) f(x + αd) 6 f(x) + α∇f(x)T d + α2 kdk2 ,
2
24 Optimization Algorithms for Data Analysis

774 For x = xk and d = −∇f(xk ), the value of α that minimizes the expression on the
775 right-hand side is α = 1/L. By substituting these values, we obtain
1
776 (4.1.3) f(xk+1 ) = f(xk − (1/L)∇f(xk )) 6 f(xk ) − k∇f(xk )k2 .
2L
777 This expression is one of the foundational inequalities in the analysis of optimiza-
778 tion methods. Depending on the assumptions about f, we can derive a variety of
779 different convergence rates from this basic inequality.
780 4.2. General Case We consider first a function f that is Lipschitz continuously
781 differentiable and bounded below, but that need not necessarily be convex. Using
782 (4.1.3) alone, we can prove a sublinear convergence result for the steepest descent
783 method.

784 Theorem 4.2.1. Suppose that f is Lipschitz continuously differentiable, satisfying (3.3.6),
785 and that f is bounded below by a constant f̄. Then for the steepest descent method with
786 constant steplength αk ≡ 1/L, applied from a starting point x0 , we have for any integer
787 T > 1 that
r r
k 2L[f(x0 ) − f(xT )] 2L[f(x0 ) − f̄]
788 min k∇f(x )k 6 6 .
06k6T −1 T T
789 Proof. Rearranging (4.1.3) and summing over the first T − 1 iterates, we have
X
T −1 X
T −1
2
790 (4.2.2) k
k∇f(x )k 6 2L [f(xk ) − f(xk+1 )] = 2L[f(x0 ) − f(xT )].
k=0 k=0
791 (Note the telescoping sum.) Since f is bounded below by f̄, the right-hand side is
792 bounded above by the constant 2L[f(x0 ) − f̄]. We also have that
v
u1 X
r u T −1
793 min k∇f(xk )k = min k∇f(xk )k2 6 t k∇f(xk )k2 .
06k6T −1 06k6T −1 T
k=0
794 The result is obtained by combining this bound with (4.2.2). 
795 This result shows that within the first T − 1 steps
p of steepest descent, at least
796 one of the iterates has gradient norm less than 2L[f(x0 ) − f̄]/T , which repre-
797 sents sublinear convergence of type (3.6.3a). It follows too from (4.2.2) that for f
798 bounded below, any accumulation point of the sequence {xk } is stationary.
799 4.3. Convex Case When f is also convex, we have the following stronger result
800 for the steepest descent method.

801 Theorem 4.3.1. Suppose that f is convex and Lipschitz continuously differentiable, sat-
802 isfying (3.3.6), and that (1.0.1) has a solution x∗ . Then the steepest descent method with
803 stepsize αk ≡ 1/L generates a sequence {xk }∞ k=0 that satisfies
L 0
804 (4.3.2) f(xT ) − f∗ 6 kx − x∗ k2 .
2T
Stephen J. Wright 25

Proof. By convexity of f, we have f(x∗ ) > f(xk ) + ∇f(xk )T (x∗ − xk ), so by substi-


tuting into (4.1.3), we obtain for k = 0, 1, 2, . . . that
1
f(xk+1 ) 6 f(x∗ ) + ∇f(xk )T (xk − x∗ ) − k∇f(xk )k2
2L !
2
∗ L k ∗ 2 k 1 ∗ k
= f(x ) + kx − x k − x − x − ∇f(x )
2 L
L  
= f(x∗ ) + kxk − x∗ k2 − kxk+1 − x∗ k2 .
2
By summing over k = 0, 1, 2, . . . , T − 1, and noting the telescoping sum, we have

X
T −1
L X k
T −1 
(f(xk+1 ) − f∗ ) 6 kx − x∗ k2 − kxk+1 − x∗ k2
2
k=0 k=0
L  
= kx0 − x∗ k2 − kxT − x∗ k2
2
L 0
6 kx − x∗ k2 .
2
805 Since {f(xk )} is a nonincreasing sequence, we have, as required,

1 X
T −1
L 0
806 f(xT ) − f(x∗ ) 6 (f(xk+1 ) − f∗ ) 6 kx − x∗ k2 . 
T 2T
k=0

807 4.4. Strongly Convex Case Recall that the definition (3.3.9) of strong convexity
808 shows that f can be bounded below by a quadratic with Hessian γI. A strongly
809 convex f with L-Lipschitz gradients is also bounded above by a similar quadratic
810 (see (3.3.7)) differing only in the quadratic term, which becomes LI. From this
811 “sandwich” effect, we derive a linear convergence rate for the gradient method,
812 stated formally in the following theorem.
813 Theorem 4.4.1. Suppose that f is Lipschitz continuously differentiable, satisfying (3.3.6),
814 and strongly convex, satisfying (3.2.5) with modulus of convexity γ. Then f has a unique
815 minimizer x∗ , and the steepest descent method with stepsize αk ≡ 1/L generates a se-
816 quence {xk }∞
k=0 that satisfies
 γ
817 f(xk+1 ) − f(x∗ ) 6 1 − (f(xk ) − f(x∗ )), k = 0, 1, 2, . . . .
L
Proof. Existence of the unique minimizer x∗ follows from Theorem 3.3.14. Min-
imizing both sides of the inequality (3.3.9) with respect to y, we find that the
minimizer on the left side is attained at y = x∗ , while on the right side it is
attained at x − ∇f(x)/γ. Plugging these optimal values into (3.3.9), we obtain
γ
min f(y) > min f(x) + ∇f(x)T (y − x) + ky − xk2
y y 2
  2
∗ T 1 γ 1
⇒ f(x ) > f(x) − ∇f(x) ∇f(x) + ∇f(x)
γ 2 γ
1
⇒ f(x∗ ) > f(x) − k∇f(x)k2 .

26 Optimization Algorithms for Data Analysis

818 By rearrangement, we obtain


819 (4.4.2) k∇f(x)k2 > 2γ[f(x) − f(x∗ )].
820 By substituting (4.4.2) into our basic inequality (4.1.3), we obtain
 
1 1 γ
821 f(xk+1 ) = f xk − ∇f(xk ) 6 f(xk ) − k∇f(xk )k2 6 f(xk ) − (f(xk ) − f∗ ).
L 2L L
822 Subtracting f∗ from both sides of this inequality yields the result. 
823 Note that After T steps, we have
 γ T
824 (4.4.3) f(xT ) − f∗ 6 1 − (f(x0 ) − f∗ ),
L
825 which is convergence of type (3.6.2) with constant σ = γ/L.
826 4.5. General Case: Line-Search Methods Returning to the case in which f has
827 Lipschitz continuous gradients but is possibly nonconvex, we consider algorithms
828 that take steps of the form (4.0.1), where dk is a descent direction, that is, it
829 makes a positive inner product with the negative gradient −∇f(xk ), so that
830 ∇f(xk )T dk < 0. This condition ensures that f(xk + αdk ) < f(xk ) for sufficiently
831 small positive values of step length α — we obtain improvement in f by taking
832 small steps along dk . (This claim follows from (3.3.3).) Line-search methods are
833 built around this fundamental observation. By introducing additional conditions
834 on dk and αk , that can be verified in practice with reasonable effort, we can estab-
835 lish a bound on decrease similar to (4.1.3) on each iteration, and thus a conclusion
836 similar to that of Theorem 4.2.1.
837 We assume that dk satisfies the following for some η > 0:
838 (4.5.1) ∇f(xk )T dk 6 −ηk∇f(xk )kkdk k.
For the steplength αk , we assume the following weak Wolfe conditions hold, for
some constants c1 and c2 with 0 < c1 < c2 < 1:

(4.5.2a) f(xk + αk dk ) 6 f(xk ) + c1 αk ∇f(xk )T dk


(4.5.2b) ∇f(xk + αk dk )T dk > c2 ∇f(xk )T dk .
839 Condition (4.5.2a) is called “sufficient decrease;” it ensures descent at each step of
840 at least a small fraction c1 of the amount promised by the first-order Taylor-series
841 expansion (3.3.3). Condition (4.5.2b) ensures that the directional derivative of f
842 along the search direction dk is significantly less negative at the chosen steplength
843 αk than at α = 0. This condition ensures that the step is “not too short.” It can
844 be shown that it is always possible to find αk that satisfies both conditions (4.5.2)
845 simultaneously.
846 Line-search procedures, which are specialized optimization procedures for
847 minimizing functions of one variable, have been devised to find such values effi-
848 ciently; see [36, Chapter 3] for details.
849 For line-search methods of this type, we have the following generalization of
850 Theorem 4.2.1.
Stephen J. Wright 27

851 Theorem 4.5.3. Suppose that f is Lipschitz continuously differentiable, satisfying (3.3.6),
852 and that f is bounded below by a constant f̄. Consider the method that takes steps of the
853 form (4.0.1), where dk satisfies (4.5.1) for some η > 0 and the conditions (4.5.2) hold at
854 all k, for some constants c1 and c2 with 0 < c1 < c2 < 1. Then for any integer T > 1,
855 we have s r
k L f(x0 ) − f̄
856 min k∇f(x )k 6 .
06k6T −1 η2 c1 (1 − c2 ) T

857 Proof. By combining the Lipschitz property (3.3.6) with (4.5.2b), we have
858 −(1 − c2 )∇f(xk )T dk 6 [∇f(xk + αk dk ) − ∇f(xk )]T dk 6 Lαk kdk k2 .
859 By comparing the first and last terms in these inequalities, we obtain the following
860 lower bound on αk :
(1 − c2 ) ∇f(xk )T dk
861 αk > − .
L kdk k2
862 By substituting this bound into (4.5.2a), and using (4.5.1) and the step definition
863 (4.0.1), we obtain
f(xk+1 ) = f(xk + αk dk ) 6 f(xk ) + c1 αk ∇f(xk )T dk
c1 (1 − c2 ) (∇f(xk )T dk )2
(4.5.4) 6 f(xk ) −
864
L kdk k2
c (1 − c2 ) 2
6 f(xk ) − 1 η k∇f(xk )k2 ,
L
865 which by rearrangement yields
L  
866 (4.5.5) k∇f(xk )k2 6 f(xk
) − f(xk+1
) .
c1 (1 − c2 )η2
867 The result now follows as in the proof of Theorem 4.2.1. 
868 It follows by taking limits on both sides of (4.5.5) that
869 (4.5.6) lim k∇f(xk )k = 0,
k→∞

870 and therefore all accumulation points x̄ of the sequence {xk } generated by the
871 algorithm (4.0.1) have ∇f(x̄) = 0. In the case of f convex, this condition guarantees
872 that x̄ is a solution of (1.0.1). When f is nonconvex, x̄ may be a local minimum,
873 but it may also be a saddle point or a local maximum.
874 The paper [29] uses the stable manifold theorem to show that line-search gra-
875 dient methods are highly unlikely to converge to stationary points x̄ at which
876 some eigenvalues of the Hessian ∇2 f(x̄) are negative. Although it is easy to con-
877 struct examples for which such bad behavior occurs, it requires special choices of
878 starting point x0 . Possibly the most obvious example is where f(x1 , x2 ) = x21 − x22
879 starting from x0 = (1, 0)T , where dk = −∇f(xk ) at each k. For this example, all
880 iterates have xk 2 = 0 and, under appropriate conditions, converge to the saddle
881 point x̄ = 0. Any starting point with x02 6= 0 cannot converge to 0, in fact, it is easy
882 to see that xk2 diverges away from 0.
28 Optimization Algorithms for Data Analysis

883 4.6. Conditional Gradient Method The conditional gradient approach, often
884 known as “Frank-Wolfe” after the authors who devised it [24], is a method for
885 convex nonlinear optimization over compact convex sets. This is the problem
886 (4.6.1) min f(x),
x∈Ω
887 (see earlier discussion around (3.2.2)), where Ω is a compact convex set and f
888 is a convex function whose gradient is Lipschitz continuously differentiable in a
889 neighborhood of Ω, with Lipschitz constant L. We assume that Ω has diameter
890 D, that is, kx − yk 6 D for all x, y ∈ Ω.
The conditional gradient method replaces the objective in (4.6.1) at each iter-
ation by a linear Taylor-series approximation around the current iterate xk , and
minimizes this linear objective over the original constraint set Ω. It then takes
a step from xk towards the minimizer of this linearized subproblem. The full
method is as follows:

(4.6.2a) vk := arg min vT ∇f(xk );


v∈Ω
2
(4.6.2b) xk+1 := xk + αk (vk − xk ), .
αk :=
k+2
891 The method has a sublinear convergence rate, as we show below, and indeed
892 requires many iterations in practice to obtain an accurate solution. Despite this
893 feature, it makes sense in many interesting applications, because the subproblems
894 (4.6.2a) can be solved very cheaply in some settings, and because highly accurate
895 solutions are not required in some applications.
896 We have the following result for sublinear convergence of the conditional gra-
897 dient method.
898 Theorem 4.6.3. Under the conditions above, where L is the Lipschitz constant for ∇f on
899 an open neighborhood of Ω and D is the diameter of Ω, the conditional gradient method
900 (4.6.2) applied to (4.6.1) satisfies
2LD2
901 (4.6.4) f(xk ) − f(x∗ ) 6 , k = 1, 2, . . . ,
k+2
902 where x∗ is any solution of (4.6.1).
903 Proof. Setting x = xk and y = xk+1 = xk + αk (vk − xk ) in (3.3.7), we have
1
f(xk+1 ) 6 f(xk ) + αk ∇f(xk )T (vk − xk ) + α2k Lkvk − xk k2
904 (4.6.5) 2
1
6 f(xk ) + αk ∇f(xk )T (vk − xk ) + α2k LD2 ,
2
905 where the second inequality comes from the definition of D. For the first-order
906 term, we have since vk solves (4.6.2a) and x∗ is feasible for (4.6.2a) that
907 ∇f(xk )T (vk − xk ) 6 ∇f(xk )T (x∗ − xk ) 6 f(x∗ ) − f(xk ).
908 By substituting in (4.6.5) and subtracting f(x∗ ) from both sides, we obtain
1
909 (4.6.6) f(xk+1 ) − f(x∗ ) 6 (1 − αk )[f(xk ) − f(x∗ )] + α2k LD2 .
2
Stephen J. Wright 29

910 We now apply an inductive argument. For k = 0, we have α0 = 1 and


1 2
911 f(x1 ) − f(x∗ ) 6 LD2 < LD2 ,
2 3
so that (4.6.4) holds in this case. Supposing that (4.6.4) holds for some value of k,
we aim to show that it holds for k + 1 too. We have

f(xk+1 ) − f(x∗ )
 
2 1 4
6 1− [f(xk ) − f(x∗ )] + LD2 from (4.6.6), (4.6.2b)
k+2 2 (k + 2)2
 
2k 2
6 LD2 + from (4.6.4)
(k + 2)2 (k + 2)2
(k + 1)
= 2LD2
(k + 2)2
k+1 1
= 2LD2
k+2 k+2
k+2 1 2LD2
6 2LD2 = ,
k+3 k+2 k+3
912 as required. 

913 5. Prox-Gradient Methods


914 We now describe an elementary but powerful approach for solving the regu-
915 larized optimization problem
916 (5.0.1) min φ(x) := f(x) + λψ(x),
x∈Rn
917 where f is a smooth convex function, ψ is a convex regularization function (known
918 simply as the “regularizer”), and λ > 0 is a regularization parameter. The tech-
919 nique we describe here is a natural extension of the steepest-descent approach,
920 in that it reduces to the steepest-descent method analyzed in Theorems 4.3.1 and
921 4.4.1 applied to f when the regularization term is not present (λ = 0). It is useful
922 when the regularizer ψ has a simple structure that is easy to account for explicitly,
923 as is true for many regularizers that arise in data analysis, such as the `1 function
924 (ψ(x) = kxk1 ) of the indicator function for a simple set Ω (ψ(x) = IΩ (x)), such
925 as a box Ω = [l1 , u1 ] ⊗ [l2 , u2 ] ⊗ . . . ⊗ [ln , un ]. For such regularizers, the proximal
926 operators can be computed explicitly and efficiently.2
927 Each step of the algorithm is defined as follows:
928 (5.0.2) xk+1 := proxαk λψ (xk − αk ∇f(xk )),
929 for some steplength αk > 0, and the prox operator defined in (3.5.3). By substitut-
930 ing into this definition, we can verify that xk+1 is the solution of an approximation
931 to the objective φ of (5.0.1), namely:
1
932 (5.0.3) xk+1 := arg min ∇f(xk )T (z − xk ) + kz − xk k2 + λψ(z).
z 2αk
2 For the analysis of this section I am indebted to class notes of L. Vandenberghe, from 2013-14.
30 Optimization Algorithms for Data Analysis

933 One way to verify this equivalence is to note that the objective in (5.0.3) can be
934 written as
1 1 2
935 z − (xk − αk ∇f(xk )) + αk λψ(x) ,
αk 2
936 (modulo a term αk k∇f(xk )k2 that does not involve z). The subproblem objective
937 in (5.0.3) consists of a linear term ∇f(xk )T (z − xk ) (the first-order term in a Taylor-
938 series expansion), a proximality term 2α1 k kz − xk k2 that becomes more strict as
939 αk ↓ 0, and the regularization term λψ(x) in unaltered form. When λ = 0, we
940 have xk+1 = xk − αk ∇f(xk ), so the iteration (5.0.2) (or (5.0.3)) reduces to the
941 usual steepest-descent approach discussed in Section 4 in this case. It is useful
942 to continue thinking of αk as playing the role of a line-search parameter, though
943 here the line search is expressed implicitly through a proximal term.
944 We will demonstrate convergence of the method (5.0.2) at a sublinear rate, for
945 functions f whose gradients satisfy a Lipschitz continuity property with Lipschitz
946 constant L (see (3.3.6)), and for the constant steplength choice αk = 1/L. The proof
947 makes use of a “gradient map” defined by
1 
948 (5.0.4) Gα (x) := x − proxαλψ (x − α∇f(x)) .
α
949 By comparing with (5.0.2), we see that this map defines the step taken at iteration
950 k:
1 k
951 (5.0.5) xk+1 = xk − αk Gαk (xk ) ⇔ Gαk (xk ) = (x − xk+1 ).
αk
952 The following technical lemma reveals some useful properties of Gα (x).

953 Lemma 5.0.6. Suppose that in problem (5.0.1), ψ is a closed convex function and that f
954 is convex with Lipschitz continuous gradient on Rn , with Lipschitz constant L. Then for
955 the definition (5.0.4) with α > 0, the following claims are true.
956 (a) Gα (x) ∈ ∇f(x) + λ∂ψ(x − αGα (x)).
957 (b) For any z, and any α ∈ (0, 1/L], we have that
α
958 φ(x − αGα (x)) 6 φ(z) + Gα (x)T (x − z) − kGα (x)k2 .
2
959 Proof. For part (a), we use the optimality property (3.5.4) of the prox operator,
960 and make the following substitutions: x − α∇f(x) for “x”, αλ for “λ”, and ψ for
961 “h” to obtain
962 0 ∈ αλ∂ψ(proxαλψ (x − α∇f(x))) + (proxαλψ (x − α∇f(x)) − (x − α∇f(x)).
963 We make the substitution proxαλψ (x − α∇f(x)) = x − αGα (x), using definition
964 (5.0.4), to obtain
965 0 ∈ αλ∂ψ(x − αGα (x)) − α(Gα (x) − ∇f(x)),
966 and the result follows when we divide by α.
967 For (b), we start with the following consequence of Lipschitz continuity of ∇f,
968 from Lemma 3.3.10:
L
969 f(y) 6 f(x) + ∇f(x)T (y − x) + ky − xk2 .
2
Stephen J. Wright 31

970 By setting y = x − αGα (x), for any α ∈ (0, 1/L], we have


Lα2
f(x − αGα (x)) 6 f(x) − αGα (x)T ∇f(x) + kGα (x)k2
971 (5.0.7) 2
α
6 f(x) − αGα (x)T ∇f(x) + kGα (x)k2 .
2
972 (The second inequality uses α ∈ (0, 1/L].) We also have by convexity of f and ψ
973 that for any z and any v ∈ ∂ψ(x − αGα (x) the following are true:
f(z) > f(x) + ∇f(x)T (z − x),
974 (5.0.8)
ψ(z) > ψ(x − αGα (x)) + vT (z − (x − αGα (x))).
975 From part (a) that v = (Gα (x) − ∇f(x))/λ ∈ ∂ψ(x − αGα (x)). Making this choice
976 of v in (5.0.8) and using (5.0.7) we have for any α ∈ (0, 1/L] that
φ(x − αGα (x))
= f(x − αGα (x)) + λψ(x − αGα (x))
α
6 f(x) − αGα (x)T ∇f(x) + kGα (x)k2 + λψ(x − αGα (x)) (from (5.0.7))
2
α
6 f(z) + ∇f(x) (x − z) − αGα (x)T ∇f(x) + kGα (x)k2
T
2
+ λψ(z) + (Gα (x) − ∇f(x))T (x − αGα (x) − z) (from (5.0.8))
T α 2
= f(z) + λψ(z) + Gα (x) (x − z) − kGα (x)k ,
2
977 where the last equality follows from cancellation of several terms in the previous
978 line. Thus (b) is proved. 
979 Theorem 5.0.9. Suppose that in problem (5.0.1), ψ is a closed convex function and that f
980 is convex with Lipschitz continuous gradient on Rn , with Lipschitz constant L. Suppose
981 that (5.0.1) attains a minimizer x∗ (not necessarily unique) with optimal objective value
982 φ∗ . Then if αk = 1/L for all k in (5.0.2), we have
Lkx0 − x∗ k2
983 φ(xk ) − φ∗ 6 , k = 1, 2, . . . .
2k
984 Proof. Since αk = 1/L satisfies the conditions of Lemma 5.0.6, we can use part (b)
985 of this result to show that the sequence {φ(xk )} is decreasing and that the distance
986 to the optimum x∗ also decreases at each iteration. Setting x = z = xk and α = αk
987 in Lemma 5.0.6, and recalling (5.0.5), we have
α
988 φ(xk+1 ) = φ(xk − αk Gαk (xk )) 6 φ(xk ) − k kGαk (xk )k2 ,
2
989 justifying the first claim. For the second claim, we have by setting x = xk , α = αk ,
990 and z = x∗ in Lemma 5.0.6 that
0 6 φ(xk+1 ) − φ∗ = φ(xk − αk Gαk (xk )) − φ∗
α
6 Gαk (xk )T (xk − x∗ ) − k kGαk (xk )k2
2
(5.0.10) 1  k 
kx − x k − kx − x∗ − αk Gαk (xk )k2
∗ 2
991 k
=
2αk
1  k 
= kx − x∗ k2 − kxk+1 − x∗ k2 ,
2αk
32 Optimization Algorithms for Data Analysis

992 from which kxk+1 − x∗ k 6 kxk − x∗ k follows.


993 By setting αk = 1/L in (5.0.10), and summing over k = 0, 1, 2, . . . , K − 1, we
994 obtain from a telescoping sum on the right-hand side that
X
K−1
L 0  L
995 (φ(xk+1 ) − φ∗ ) 6 kx − x∗ k2 − kxK − x∗ k2 6 kx0 − x∗ k2 .
2 2
k=0

996 By monotonicity of {φ(xk )}, we have


X
K−1

997
K
K(φ(x ) − φ ) 6 (φ(xk+1 ) − φ∗ ).
k=0
998 The result follows immediately by combining these last two expressions. 

999 6. Accelerating Gradient Methods


1000 We showed in Section 4 that the basic steepest descent method for solving
1001 (1.0.1) for smooth f converges sublinearly at a 1/k rate when f is convex, and
1002 linearly at a rate of (1 − γ/L) when f is strongly convex, satisfying (3.3.13) for
1003 positive γ and L. We show in this section that by using the gradient information
1004 in a more clever way, faster convergence rates can be attained.
1005 The key idea is momentum. In iteration k of a momentum method, we tend to
1006 continue moving along the previous search direction at each iteration, making a
1007 small adjustment toward the negative gradient −∇f evaluated at xk or a nearby
1008 point. (Steepest descent simply uses −∇f(xk ) as the search direction.) Although
1009 not obvious at first, there is some intuition behind the momentum idea. The step
1010 taken at the previous iterate xk−1 was based on negative gradient information
1011 at that iteration, along with the search direction from the iteration prior to that
1012 one, namely, xk−2 . By continuing this line of reasoning backwards, we see that
1013 the previous step is a linear combination of all the gradient information that we
1014 have encountered at all iterates so far, going back to the initial iterate x0 . If this
1015 information is aggregated properly, it can produce a richer overall picture of the
1016 function than the latest negative gradient alone, and thus has the potential to
1017 yield better convergence.
1018 Sure enough, several intricate methods that use the momentum idea have been
1019 proposed, and have been widely successful. These methods are often called accel-
1020 erated gradient methods. A major contributor in this area is Yuri Nesterov, dating to
1021 his seminal contribution in 1983 [33] and explicated further in his book [34] and
1022 other publications. Another key contribution is [3], which derived an accelerated
1023 method for the regularized case (1.0.2).
1024 6.1. Heavy-Ball Method Possibly the most elementary method of momentum
1025 type is the heavy-ball method of Polyak [37]; see also [38]. Each iteration of this
1026 method has the form
1027 (6.1.1) xk+1 = xk − αk ∇f(xk ) + βk (xk − xk−1 ),
Stephen J. Wright 33

1028 where αk and βk are positive scalars. That is, a momentum term βk (xk − xk−1 )
1029 is added to the usual steepest descent update. Although this method can be ap-
1030 plied to any smooth convex f (and even to nonconvex functions), the convergence
1031 analysis is most straightforward for the special case of strongly convex quadratic
1032 functions (see [38]). (This analysis also suggests appropriate values for the step
1033 lengths αk and βk .) Consider the function
1
1034 (6.1.2) minn f(x) := xT Ax − bT x,
x∈R 2
1035 where the (constant) Hessian A has eigenvalues in the range [γ, L], with 0 < γ 6 L.
1036 For the following constant choices of steplength parameters:
√ √
4 L− γ
1037 αk = α := √ √ , β k = β := √ √ ,
( L + γ)2 L+ γ
1038 it can be shown that kxk − x∗ k 6 Cβk , for some (possibly large) constant C. We
1039 can use (3.3.7) to translate this into a bound on the function error, as follows:
L k LC2 2k
1040 f(xk ) − f(x∗ ) 6kx − x∗ k2 6 β ,
2 2
1041 allowing a direct comparison with the rate (4.4.3) for the steepest descent method.
1042 If we suppose that L  γ, we have
r
γ
1043 β ≈ 1−2 ,
L
1044 so that we achieve approximate convergence f(xk ) − f(x∗ ) 6  (for small pos-
p
1045 itive ) in O( L/γ log(1/)) iterations, compared with O((L/γ) log(1/)) for
1046 steepest descent — a significant improvement.
1047 The heavy-ball method is fundamental, but several points should be noted.
1048 First, the analysis for convex quadratic f is based on linear algebra arguments,
1049 and does not generalize to general strongly convex nonlinear functions. Second,
1050 the method requires knowledge of γ and L, for the purposes of defining parame-
1051 ters α and β. Third, it is not a descent method; we usually have f(xk+1 ) > f(xk )
1052 for many k. These properties are not specific to the heavy-ball method — some
1053 of them are shared by other methods that use momentum.
1054 6.2. Conjugate Gradient The conjugate gradient method for solving linear sys-
1055 tems Ax = b (or, equivalently, minimizing the convex quadratic (6.1.2)) where A
1056 is symmetric positive definite, is one of the most important algorithms in compu-
1057 tational science. Though invented earlier than the other algorithms discussed in
1058 this section (see [27]) and motivated in a different way, conjugate gradient clearly
1059 makes use of momentum. Its steps have the form
1060 (6.2.1) xk+1 = xk + αk pk , where pk = −∇f(xk ) + ξk pk−1 ,
1061 for some choices of αk and ξk , which is identical to (6.1.1) when we define βk ap-
1062 propriately. For convex, strongly quadratic problems (6.1.2), conjugate gradient
1063 has excellent properties. It does not require prior knowledge of the range [γ, L] of
34 Optimization Algorithms for Data Analysis

1064 the eigenvalue spectrum of A, choosing the steplengths αk and ξk in an adaptive


1065 fashion. (In fact, αk is chosen to be the exact minimizer along the search direction
1066 pk .) The main arithmetic operation per iteration is one matrix-vector multiplica-
1067 tion involving A, the same cost as a gradient evaluation for f in (6.1.2). Most
1068 importantly, there is a rich convergence theory, that characterizes convergence in
1069 terms of the properties of the full spectrum of A (not just its extreme elements),
1070 showing in particular that good approximate solutions can be obtained quickly
1071 if the eigenvalues are clustered. Convergence to an exact solution of (6.1.2) in at
1072 most n iterations is guaranteed (provided, naturally, that the arithmetic is carried
1073 out exactly).
1074 There has been much work over the years on extending the conjugate gradi-
1075 ent method to general smooth functions f. Few of the theoretical properties for
1076 the quadratic case carry over to the nonlinear setting, though several results are
1077 known; see [36, Chapter 5], for example. Such “nonlinear” conjugate gradient
1078 methods vary in the accuracy with which they perform the line search for αk in
1079 (6.2.1) and — more fundamentally — in the choice of ξk . The latter is done in
1080 a way that ensures that each search direction pk is a descent direction. In some
1081 methods, ξk is set to zero on some iterations, which causes the method to take
1082 a steepest descent step, effectively “restarting” the conjugate gradient method at
1083 the latest iterate.
1084 Despite these qualifications, nonlinear conjugate gradient is quite commonly
1085 used in practice, because of its minimal storage requirements and the fact that
1086 it requires only one gradient evaluation per iteration. Its popularity has been
1087 eclipsed in recent years by the limited-memory quasi-Newton method L-BFGS
1088 [30], [36, Section 7.2], which requires more storage (though still O(n)) and is
1089 similarly economical and easy to implement.
1090 6.3. Nesterov’s Accelerated Gradient: Weakly Convex Case We now describe
1091 Nesterov’s method for (1.0.1) and prove its convergence — sublinear at a 1/k2
1092 rate — for the case of f convex with Lipschitz continuous gradients satisfying
1093 (3.3.6). Each iteration of this method has the form
 
1094 (6.3.1) xk+1 = xk − αk ∇f xk + βk (xk − xk−1 ) + βk (xk − xk−1 ),
for choices of the parameters αk and βk to be defined. Note immediately the
similarity to the heavy-ball formula (6.1.1). The only difference is that the extrap-
olation step xk → xk + βk (xk − xk−1 ) is taken before evaluation of the gradient
∇f in (6.3.1), whereas in (6.1.1) the gradient is simply evaluated at xk . It is con-
venient for purposes of analysis (and implementation) to introduce an auxiliary
sequence {yk }, fix αk ≡ 1/L, and rewrite the update (6.3.1) as follows:
1
(6.3.2a) xk+1 = yk − ∇f(yk ),
L
(6.3.2b) yk+1 = xk+1 + βk+1 (xk+1 − xk ), k = 0, 1, 2, . . . ,
Stephen J. Wright 35

1095 where we initialize at an arbitrary y0 and set x0 = y0 . We define βk with reference


1096 to another scalar sequence λk in the following manner:
 
1 λ −1
q
1097 (6.3.3) λ0 = 0, λk+1 = 1 + 1 + 4λ2k , βk = k .
2 λk+1
1098 Since λk > 1 for k = 1, 2, . . . , we have βk+1 > 0 for k = 0, 1, 2, . . . . It also follows
1099 from the definition of λk+1 that
1100 (6.3.4) λ2k+1 − λk+1 = λ2k .
1101 We have the following result for convergence of Nesterov’s scheme on general
1102 convex functions. We prove it using an argument from [3], as reformulated in
1103 [7, Section 3.7]. The analysis is famously technical, and intuition is hard to come
1104 by. Some recent progress has been made in deriving algorithms similar to (6.3.2)
1105 that have a plausible geometric or algebraic motivation; see [8, 21].

1106 Theorem 6.3.5. Suppose that f in (1.0.1) is convex, with ∇f Lipschitz continuously
1107 differentiable with constant L (as in (3.3.6)) and that the minimum of f is attained at x∗ ,
1108 with f∗ := f(x∗ ). Then the method defined by (6.3.2), (6.3.3) with x0 = y0 yields an
1109 iteration sequence {xk } with the following property:
2Lkx0 − x∗ k2
1110 f(xT ) − f∗ 6 , T = 1, 2, . . . .
(T + 1)2
1111 Proof. From convexity of f and (3.3.7), we have for any x and y that
f(y − ∇f(y)/L) − f(x)
6 f(y − ∇f(y)/L) − f(y) + ∇f(y)T (y − x)
(6.3.6) L
6 ∇f(y)T (y − ∇f(y)/L − y) + ky − ∇f(y)/L − yk2 + ∇f(y)T (y − x)
1112

2
1 2 T
= − k∇f(y)k + ∇f(y) (y − x).
2L
1113 Setting y = y and x = xk in this bound, we obtain
k

f(xk+1 ) − f(xk ) = f(yk − ∇f(yk )/L) − f(xk )


1
1114 (6.3.7) 6− k∇f(yk )k2 + ∇f(yk )T (yk − xk )
2L
L
= − kxk+1 − yk k2 − L(xk+1 − yk )T (yk − xk ).
2
1115
k ∗
We now set y = y and x = x in (6.3.6), and use (6.3.2a) to obtain
L
1116 (6.3.8) f(xk+1 ) − f(x∗ ) 6 − kxk+1 − yk k2 − L(xk+1 − yk )T (yk − x∗ ).
2
Introducing notation δk := f(xk ) − f(x∗ ), we multiply (6.3.7) by λk+1 − 1 and add
it to (6.3.8) to obtain
(λk+1 − 1)(δk+1 − δk ) + δk+1
L
6 − λk+1 kxk+1 − yk k2 − L(xk+1 − yk )T (λk+1 yk − (λk+1 − 1)xk − x∗ ).
2
36 Optimization Algorithms for Data Analysis

We multiply this bound by λk+1 , and use (6.3.4) to obtain

(6.3.9) λ2k+1 δk+1 − λ2k δk


Lh i
6 − kλk+1 (xk+1 − yk )k2 + 2λk+1 (xk+1 − yk )T (λk+1 yk − (λk+1 − 1)xk − x∗ )
2
Lh i
= − kλk+1 xk+1 − (λk+1 − 1)xk − x∗ k2 − kλk+1 yk − (λk+1 − 1)xk − x∗ k2 ,
2
where in the final equality we used the identity kak2 + 2aT b = ka + bk2 − kbk2 .
By multiplying (6.3.2b) by λk+2 , and using λk+2 βk+1 = λk+1 − 1 from (6.3.3), we
have
λk+2 yk+1 = λk+2 xk+1 + λk+2 βk+1 (xk+1 − xk )
= λk+2 xk+1 + (λk+1 − 1)(xk+1 − xk ).
1117 By rearranging this equality, we have
1118 λk+1 xk+1 − (λk+1 − 1)xk = λk+2 yk+1 − (λk+2 − 1)xk+1 .
1119 By substituting into the first term on the right-hand side of (6.3.9), and using the
1120 definition
1121 (6.3.10) uk := λk+1 yk − (λk+1 − 1)xk − x∗ ,
1122 we obtain
L
1123 λ2k+1 δk+1 − λ2k δk 6 − (kuk+1 k2 − kuk k2 ).
2
1124 By summing both sides of this inequality over k = 0, 1, . . . , T − 1, and using λ0 = 0,
1125 we obtain
L L
1126 λ2T δT 6 (ku0 k2 − kuT k2 ) 6 kx0 − x∗ k2 ,
2 2
1127 so that
Lkx0 − x∗ k2
1128 (6.3.11) δT = f(xT ) − f(x∗ ) 6 .
2λ2T
1129 A simple induction confirms that λk > (k + 1)/2 for k = 1, 2, . . . , and the claim of
1130 the theorem follows by substituting this bound into (6.3.11). 
1131 6.4. Nesterov’s Accelerated Gradient: Strongly Convex Case We turn now to
1132 Nesterov’s approach for smooth strongly convex functions, which satisfy (3.2.5)
1133 with γ > 0. Again, we follow the proof in [7, Section 3.7], which is based on
1134 the analysis in [34]. The method uses the same update formula (6.3.2) as in the
1135 weakly convex case, and the same initialization, but with a different choice of
1136 βk+1 , namely:
√ √ √
L− γ κ−1
1137 (6.4.1) βk+1 ≡ √ √ =√ .
L+ γ κ+1
1138 The condition measure κ is defined in (3.3.12). We prove the following conver-
1139 gence result.

1140 Theorem 6.4.2. Suppose that f is such that ∇f is Lipschitz continuously differentiable
1141 with constant L, and that it is strongly convex with modulus of convexity γ and unique
Stephen J. Wright 37

1142 minimizer x∗ . Then the method (6.3.2), (6.4.1) with starting point x0 = y0 satisfies
1 T
 
L+γ 0
1143 f(xT ) − f(x∗ ) 6 kx − x∗ k2 1 − √ , T = 1, 2, . . . .
2 κ
Proof. The proof makes use of a family of strongly convex functions Φk (z) de-
fined inductively as follows:
γ
(6.4.3a) Φ0 (z) = f(y0 ) + kz − y0 k2 ,
√2
(6.4.3b) Φk+1 (z) = (1 − 1/ κ)Φk (z)
1  γ 
+ √ f(yk ) + ∇f(yk )T (z − yk ) + kz − yk k2 .
κ 2
1144 Each Φk (·) is a quadratic, and an inductive argument shows that ∇2 Φk (z) = γI
1145 for all k and all z. Thus, each Φk has the form
γ
1146 (6.4.4) Φk (z) = Φ∗k + kz − vk k2 , k = 0, 1, 2, . . . ,
2
1147 where vk is the minimizer of Φk (·) and Φ∗k is its optimal value. (From (6.4.3a),
1148 we have v0 = y0 .) We note too that Φk becomes a tighter overapproximation to f
1149 as k → ∞. To show this, we use (3.3.9) to replace the final term in parentheses in
1150 (6.4.3b) by f(z), then subtract f(z) from both sides of (6.4.3b) to obtain

1151 (6.4.5) Φk+1 (z) − f(z) 6 (1 − 1/ κ)(Φk (z) − f(z)).
1152 In the remainder of the proof, we show that the following bound holds:
1153 (6.4.6) f(xk ) 6 min Φk (z) = Φ∗k , k = 0, 1, 2, . . . .
z
1154 The upper bound in Lemma 3.3.10 for x = x∗ gives f(z) − f(x∗ ) 6 (L/2)kz − x∗ k2 .
1155 By combining this bound with (6.4.5) and (6.4.6), we have
f(xk ) − f(x∗ ) 6 Φ∗k − f(x∗ )
6 Φk (x∗ ) − f(x∗ )

1156 (6.4.7) 6 (1 − 1/ κ)k (Φ0 (x∗ ) − f(x∗ ))

6 (1 − 1/ κ)k [(Φ0 (x∗ ) − f(x0 )) + (f(x0 ) − f(x∗ ))]
√ γ+L 0
6 (1 − 1/ κ)k kx − x∗ k2 .
2
The proof is completed by establishing (6.4.6), by induction on k. Since x0 = y0 ,
it holds by definition at k = 0. By using step formula (6.3.2a), the convexity
property (3.3.8) (with x = yk ), and the inductive hypothesis, we have

(6.4.8) f(xk+1 )
1
6 f(yk ) − k∇f(yk )k2
2L
√ √ √ 1
= (1 − 1/ κ)f(xk ) + (1 − 1/ κ)(f(yk ) − f(xk )) + f(yk )/ κ − k∇f(yk )k2
2L
√ ∗ √ k √ 1
6 (1 − 1/ κ)Φk + (1 − 1/ κ)∇f(y ) (y − x ) + f(y )/ κ − k∇f(yk )k2 .
k T k k
2L
38 Optimization Algorithms for Data Analysis

1157 Thus the claim is established (and the theorem is proved) if we can show that the
1158 right-hand side in (6.4.8) is bounded above by Φ∗k+1 .
1159 Recalling the observation (6.4.4), we have by taking derivatives of both sides of
1160 (6.4.3b) with respect to z that
√ √ √
1161 (6.4.9) ∇Φk+1 (z) = γ(1 − 1/ κ)(z − vk ) + ∇f(yk )/ κ + γ(z − yk )/ κ.
1162 Since vk+1 is the minimizer of Φk+1 we can set ∇Φk+1 (vk+1 ) = 0 in (6.4.9) to
1163 obtain
√ √ √
1164 (6.4.10) vk+1 = (1 − 1/ κ)vk + yk / κ − ∇f(yk )/(γ κ).
1165 By subtracting yk from both sides of this expression, and taking k · k2 of both
1166 sides, we obtain

kvk+1 − yk k2 = (1 − 1/ κ)2 kyk − vk k2 + k∇f(yk )k2 /(γ2 κ)
1167 (6.4.11) √ √
− 2(1 − 1/ κ)/(γ κ)∇f(yk )T (vk − yk ).
1168 By evaluating Φk+1 at z = yk , using both (6.4.4) and (6.4.3b), we obtain
γ
Φ∗k+1 + kyk − vk+1 k2
2√ √
1169 (6.4.12) = (1 − 1/ κ)Φk (yk ) + f(yk )/ κ
√ γ √ √
= (1 − 1/ κ)Φ∗k + (1 − 1/ κ)kyk − vk k2 + f(yk )/ κ.
2
1170 By substituting (6.4.11) into (6.4.12), we obtain
√ √ √ √
Φ∗k+1 = (1 − 1/ κ)Φ∗k + f(yk )/ κ + γ(1 − 1/ κ)/(2 κ)kyk − vk k2
1 √ √
− k∇f(yk )k2 + (1 − 1/ κ)∇f(yk )T (vk − yk )/ κ
2L
1171 (6.4.13) √ √
> (1 − 1/ κ)Φ∗k + f(yk )/ κ
1 √ √
− k∇f(yk )k2 + (1 − 1/ κ)∇f(yk )T (vk − yk )/ κ,
2L
1172 where we simply dropped a nonnegative term from the right-hand side to obtain
1173 the inequality. The final step is to show that

1174 (6.4.14) vk − yk = κ(yk − xk ),
1175 which we do by induction. Note that v0 = x0 = y0 , so the claim holds for k = 0.
1176 We have
√ √ √
vk+1 − yk+1 = (1 − 1/ κ)vk + yk / κ − ∇f(yk )/(γ κ) − yk+1
√ √ √
= κyk − ( κ − 1)xk − κ∇f(yk )/L − yk+1
1177 (6.4.15) √ √
= κxk+1 − ( κ − 1)xk − yk+1

= κ(yk+1 − xk+1 ),
1178 where the first equality is from (6.4.10), the second equality is from the inductive
1179 hypothesis, the third equality is from the iteration formula (6.3.2a), and the final
1180 equality is from the iteration formula (6.3.2b) with the definition of βk+1 from
1181 (6.4.1). We have thus proved (6.4.14), and by substituting this equality into (6.4.13),
Stephen J. Wright 39

1182 we obtain that Φ∗k+1 is an upper bound on the right-hand side of (6.4.8). This
1183 establishes (6.4.6) and thus completes the proof of the theorem. 
1184 6.5. Lower Bounds on Rates The term “optimal” in Nesterov’s optimal method
1185 is used because the convergence rate achieved by the method is the best possible
1186 (possibly up to a constant), among algorithms that make use of gradient informa-
1187 tion at the iterates xk . This claim can be proved by means of a carefully designed
1188 function, for which no method that makes use of all gradients observed up to and
1189 including iteration k (namely, ∇f(xi ), i = 0, 1, 2, . . . , k) can produce a sequence
1190 {xk } that achieves a rate better than that of Theorem 6.3.5. The function proposed
1191 in [32] is a convex quadratic f(x) = (1/2)xT Ax − eT1 x, where
 
2 −1 0 0 ... ... 0  

−1 2 −1 0 . . . . . . 0 
 1
 
 0
   
 
 0 −1 2 −1 0 . . . 0   
1192 A=

. . .  , e1 = 
 0 .
 .. .. ..  
.



  .. 
 0 ... 0 −1 2 −1   
  0
0 ... 0 −1 2
1193 The solution x∗ satisfies Ax∗ = e1 ; its components are x∗i = 1 − i/(n + 1), for
1194 i = 1, 2, . . . , n. If we use x0 = 0 as the starting point, and construct the iterate
1195 xk+1 as
Xk
k+1 k
1196 x =x + ξj ∇f(xj ),
j=0

1197 for some coefficients ξj , j = 0, 1, . . . , k, an elementary inductive argument shows


1198 that each iterate xk can have nonzero entries only in its first k components. It
1199 follows that for any such algorithm, we have
X n X
n 
j
2
k ∗ 2 ∗ 2
1200 (6.5.1) kx − x k > (xj ) = 1− .
n+1
j=k+1 j=k+1

1201 A little arithmetic shows that


1 0 n
1202 (6.5.2) kxk − x∗ k2 > kx − x∗ k2 , k = 1, 2, . . . , − 1,
8 2
1203 It can be shown further that
3L n
1204 (6.5.3) f(xk ) − f∗ > kx0 − x∗ k2 , k = 1, 2, . . . , − 1,
32(k + 1)2 2
1205 where L = kAk2 . This lower bound on f(xk ) − x∗ is within a constant factor of
1206 the upper bound of Theorem 6.3.5.
1207 The restriction k 6 n/2 in the argument above is not fully satisfying. A more
1208 compelling example would show that the lower bound (6.5.3) holds for all k, but
1209 an example of this type is not currently known.
40 Optimization Algorithms for Data Analysis

1210 7. Newton Methods


1211 So far, we have dealt with methods that use first-order (gradient or subgra-
1212 dient) information about the objective function. We have shown that such algo-
1213 rithms can yield sequences of iterates that converge at linear or sublinear rates.
1214 We turn our attention in this chapter to methods that exploit second-derivative
1215 (Hessian) information. The canonical method here is Newton’s method, named
1216 after Isaac Newton, who proposed a version of the method for polynomial equa-
1217 tions in around 1670.
1218 For many functions, including many that arise in data analysis, second-order
1219 information is not difficult to compute, in the sense that the functions that we
1220 deal with are simple (usually compositions of elementary functions). In compar-
1221 ing with first-order methods, there is a tradeoff. Second-order methods typically
1222 have local superlinear or quadratic convergence rates: Once the iterates reach a
1223 neighborhood of a solution at which second-order sufficient conditions are sat-
1224 isfied, convergence is rapid. Moreover, their global convergence properties are
1225 attractive. With appropriate enhancements, they can provably avoid convergence
1226 to saddle points. But the costs of calculating and handling the second-order infor-
1227 mation and of computing the step is higher. Whether this tradeoff makes them
1228 appealing depends on the specifics of the application and on whether the second-
1229 derivative computations are able to take advantage of structure in the objective
1230 function.
1231 We start by sketching the basic Newton’s method for the unconstrained smooth
1232 optimization problem min f(x), and prove local convergence to a minimizer x∗
1233 that satisfies second-order sufficient conditions. Subsection 7.2 discusses perfor-
1234 mance of Newton’s method on convex functions, where the use of Newton search
1235 directions in the line search framework (4.0.1) can yield global convergence. Mod-
1236 ifications of Newton’s method for nonconvex functions are discussed in Subsec-
1237 tion 7.3. Subsection 7.4 discusses algorithms for smooth nonconvex functions
1238 that use gradient and Hessian information but guarantee convergence to points
1239 that approximately satisfy second-order necessary conditions. Some variants of
1240 these methods are related closely to the trust-region methods discussed in Sub-
1241 section 7.3, but the motivation and mechanics are somewhat different.
1242 7.1. Basic Newton’s Method Consider the problem
1243 (7.1.1) min f(x),
1244 where f : Rn → R is a Lipschitz twice continuously differentiable function, where
1245 the Hessian has Lipschitz constant M, that is,
1246 (7.1.2) k∇2 f(x 0 ) − ∇2 f(x 00 )k 6 Mkx 0 − x 00 k,
1247 where k · k denotes the Euclidean vector norm and its induced matrix norm. New-
1248 ton’s method generates a sequence of iterates {xk }k=0,1,2,... .
Stephen J. Wright 41

1249 A second-order Taylor series approximation to f around the current iterate xk


1250 is
1
1251 (7.1.3) f(xk + p) ≈ f(xk ) + ∇f(xk )T p + pT ∇2 f(xk )p.
2
1252 When ∇2 f(xk ) is positive definite, the minimizer pk of the right-hand side is
1253 unique; it is
1254 (7.1.4) pk = −∇2 f(xk )−1 ∇f(xk ).
1255 This is the Newton step. In its most basic form, then, Newton’s method is defined
1256 by the following iteration:
1257 (7.1.5) xk+1 = xk − ∇2 f(xk )−1 ∇f(xk ).
1258 We have the following local convergence result in the neighborhood of a point x∗
1259 satisfying second-order sufficient conditions.

1260 Theorem 7.1.6. Consider the problem (7.1.1) with f twice Lipschitz continuously differ-
1261 entiable with Lipschitz constant M defined in (7.1.2). Suppose that the second-order suf-
1262 ficient conditions are satisfied for the problem (7.1.1) at the point x∗ , that is, ∇f(x∗ ) = 0
γ
1263 and ∇2 f(x∗ )  γI for some γ > 0. Then if kx0 − x∗ k 6 2M , the sequence defined by
1264 (7.1.5) converges to x∗ at a quadratic rate, with
M k
1265 (7.1.7) kxk+1 − x∗ k 6 kx − x∗ k2 , k = 0, 1, 2, . . . .
γ
Proof. From (7.1.4) and (7.1.5), and using ∇f(x∗ ) = 0, we have

xk+1 − x∗ = xk − x∗ − ∇2 f(xk )−1 ∇f(xk )


= ∇2 f(xk )−1 [∇2 f(xk )(xk − x∗ ) − (∇f(xk ) − ∇f(x∗ ))].
1266 so that
1267 (7.1.8) kxk+1 − x∗ k 6 k∇2 f(xk )−1 kk∇2 f(xk )(xk − x∗ ) − (∇f(xk ) − ∇f(x∗ ))k.
1268 By using Taylor’s theorem (see (3.3.4) with x = xk and p = x∗ − xk ), we have
Z1
∇f(x ) − ∇f(x ) = ∇2 f(xk + t(x∗ − xk ))(xk − x∗ ) dt.
k ∗
0
1269 By using this result along with the Lipschitz condition (7.1.2), we have
k∇2 f(xk )(xk − x∗ ) − (∇f(xk ) − ∇f(x∗ ))k
Z1
= [∇2 f(xk ) − ∇2 f(xk + t(x∗ − xk )](xk − x∗ ) dt
0
1270 (7.1.9) Z1
6 k∇2 f(xk ) − ∇2 f(xk + t(x∗ − xk ))kkxk − x∗ k dt
0
Z 1
!
6 Mt dt kxk − x∗ k2 = 12 Mkxk − x∗ k2 .
0

1271 From the Weilandt-Hoffman inequality[28] and (7.1.2), we have that


1272 |λmin (∇2 f(xk )) − λmin (∇2 f(x∗ ))| 6 k∇2 f(xk ) − ∇2 f(x∗ )k 6 Mkxk − x∗ k,
42 Optimization Algorithms for Data Analysis

1273 where λmin (·) denotes the smallest eigenvalue of a symmetric matrix. Thus for
γ
1274 (7.1.10) kxk − x∗ k 6 ,
2M
1275 we have
γ γ
1276 λmin (∇2 f(xk )) > λmin (∇2 f(x∗ )) − Mkxk − x∗ k > γ − M > ,
2M 2
1277 so that k∇2 f(xk )−1 k 6 2/γ. By substituting this result together with (7.1.9) into
1278 (7.1.8), we obtain
2M k M k
1279 kxk+1 − x∗ k 6 kx − x∗ k2 = kx − x∗ k2 ,
γ 2 γ
1280 verifying the local quadratic convergence rate. By applying (7.1.10) again, we
1281 have  
M k 1
1282 kxk+1 − x∗ k 6 kx − x∗ k kxk − x∗ k 6 kxk − x∗ k,
γ 2
1283 so, by arguing inductively, we see that the sequence converges to x∗ provided
1284 that x0 satisfies (7.1.10), as claimed. 
1285 Of course, we do not need to explicitly identify a starting point x0 in the stated
1286 region of convergence. Any sequence that approaches to x∗ will eventually enter
1287 this region, and thereafter the quadratic convergence guarantees apply.
1288 We have established that Newton’s method converges rapidly once the iterates
1289 enter the neighborhood of a point x∗ satisfying second-order sufficient optimality
1290 conditions. But what happens when we start far from such a point?
1291 7.2. Newton’s Method for Convex Functions When the function f is convex as
1292 well as smooth, we can devise variants of Newton’s method for which global
1293 convergence and complexity results (in particular, results based on those of Sec-
1294 tion 4.5) can be proved in addition to local quadratic convergence.
1295 When f is strongly convex with modulus γ and satisfies Lipschitz continuity
1296 of the gradient (3.3.6), the Hessian ∇2 f(xk ) is positive definite for all k, with
1297 all eigenvalues in the interval [γ, L]. Thus, the Newton direction (7.1.4) is well
1298 defined at all iterates xk , and is a descent direction satisfying the condition (4.5.1)
1299 with η = γ/L. To verify this claim, note first
1
1300 kpk k 6 k∇2 f(xk )−1 kk∇f(xk )k 6 k∇f(xk )k.
γ
Then
(pk )T ∇f(xk ) = −∇f(xk )T ∇2 f(xk )−1 ∇f(xk )
1
6 − k∇f(xk )k2
L
γ
6 − k∇f(xk )kkpk k.
L
1301 We can use the Newton direction in the line-search framework of Subsection 4.5
1302 to obtain a method for which xk → x∗ , where x∗ is the (unique) global minimizer
1303 of f. (This claim follows from the property (4.5.6) together with the fact that x∗ is
1304 the only point for which ∇f(x∗ ) = 0.) We can even obtain a complexity result —

1305 and O(1/ T ) bound on min06k6T −1 k∇f(xk )k — from Theorem 4.5.3.
Stephen J. Wright 43

1306 These global convergence properties are enhanced by the local quadratic con-
1307 vergence property of Theorem 7.1.6 if we modify the line-search framework by
1308 accepting the step length αk = 1 in (4.0.1) whenever it satisfies the weak Wolfe
1309 conditions (4.5.2). (It can be shown, by again using arguments based on Taylor’s
1310 theorem (Theorem 3.3.1), that these conditions will be satisfied by αk = 1 for all
1311 xk sufficiently close to the minimizer x∗ .)
1312 Consider now the case in which f is convex and satisfies condition (3.3.6) but
1313 is not strongly convex. Here, the Hessian ∇2 f(xk ) may be singular for some k, so
1314 the direction (7.1.4) may not be well defined. However, by adding any positive
1315 number λk > 0 to the diagonal, we can ensure that the modified Newton direction
1316 defined by
1317 (7.2.1) pk = −[∇2 f(xk ) + λk I]−1 ∇f(xk ),
1318 is well defined and is a descent direction for f. For any η ∈ (0, 1) in (4.5.1),
1319 we have by choosing λk large enough that λk /(L + λk ) > η that the condition
1320 (4.5.1) is satisfied too, so we can use the resulting direction pk in the line-search
1321 framework of Subsection 4.5, to obtain a method that convergence to a solution
1322 x∗ of (1.0.1), when one exists.
1323 If, in addition, the minimizer x∗ is unique and satisfies a second-order suffi-
1324 cient condition (so that ∇2 f(x∗ ) is positive definite), then ∇2 f(xk ) will be positive
1325 definite too for k sufficiently large. Thus, provided that η is sufficiently small,
1326 the unmodified Newton direction (with λk = 0 in (7.2.1)) will satisfy the condi-
1327 tion (4.5.1). If we use (7.2.1) in the line-search framework of Section 4.5, but set
1328 λk = 0 where possible, and accept αk = 1 as the step length whenever it satisfies
1329 (4.5.2), we can obtain local quadratic convergence to x∗ , in addition to the global
1330 convergence and complexity promised by Theorem 4.5.3.
1331 7.3. Newton Methods for Nonconvex Functions For smooth nonconvex f, the
1332 Hessian ∇2 f(xk ) may be indefinite for some k. The Newton direction (7.1.4)
1333 may not exist (when ∇2 f(xk ) is singular) or it may not be a descent direction
1334 (when ∇2 f(xk ) has negative eigenvalues). However, we can still define a modified
1335 Newton direction as in (7.2.1), which will be a descent direction for λk sufficiently
1336 large, and thus can be used in the line-search framework of Section 4.5. For a
1337 given η in (4.5.1), a sufficient condition for pk from (7.2.1) to satisfy (4.5.1) is that
λk + λmin (∇2 f(xk ))
1338 > η,
λk + L
1339 where λmin (∇2 f(xk )) is the minimum eigenvalue of the Hessian, which may be
1340 negative. The line-search framework of Section 4.5 can then be applied to ensure
1341 that ∇f(xk ) → 0.
1342 Once again, if the iterates {xk } enter the neighborhood of a local solution x∗
1343 for which ∇2 f(x∗ ) is positive definite, some enhancements of the strategy for
1344 choosing λk and the step length αk can recover the local quadratic convergence
1345 of Theorem 7.1.6.
44 Optimization Algorithms for Data Analysis

1346 Formula (7.2.1) is not the only way to modify the Newton direction to ensure
1347 descent in a line-search framework. Other approaches are outlined in [36, Chap-
1348 ter 3]. One such technique is to modify the Cholesky factorization of ∇2 (fk ) by
1349 adding positive elements to the diagonal only as needed to allow the factoriza-
1350 tion to proceed (that is, to avoid taking the square root of a negative number),
1351 then using the modified factorization in place of ∇2 f(xk ) in the calculation of the
1352 Newton step pk . Another technique is to compute an eigenvalue decomposition
1353 ∇2 f(xk ) = Qk Λk QTk (where Qk is orthogonal and Λk is the diagonal matrix con-
1354 taining the eigenvalues), then define Λ̃k to be a modified version of Λk in which
1355 all the diagonals are positive. Then, following (7.1.4), pk can be defined as
1356 pk := −Qk Λ̃−1 T k
k Qk ∇f(x ).

1357 When an appropriate strategy is used to define Λ̃k , we can ensure satisfaction
1358 of the descent condition (4.5.1) for some η > 0. As above, the line-search frame-
1359 work of Section 4.5 can be used to obtain an algorithm that generates a sequence
1360 {xk } such that ∇f(xk ) → 0. We noted earlier that this condition ensures that all
1361 accumulation points x̂ are stationary points, that is, they satisfy ∇f(x̂) = 0.
1362 Stronger guarantees can be obtained from a trust-region version of Newton’s
1363 method, which ensures convergence to a point satisfying second-order necessary
1364 conditions, that is, ∇2 f(x̂)  0 in addition to ∇f(x̂) = 0. The trust-region approach
1365 was developed in the late 1970s and early 1980s, and has become popular again
1366 recently because of this appealing global convergence behavior. A trust-region
1367 Newton method also recovers quadratic convergence to solutions x∗ satisfying
1368 second-order-sufficient conditions, without any special modifications. (The trust-
1369 region Newton approach is closely related to cubic regularization [26, 35], which
1370 we discuss in the next section.)
1371 We now outline the trust-region approach. (Further details can be found in
1372 [36, Chapter 4].) The subproblem to be solved at each iteration is
1
1373 (7.3.1) min f(xk ) + ∇f(xk )T d + dT ∇2 f(xk )d subject to kdk2 6 ∆k .
d 2
1374 The objective is a second-order Taylor-series approximation while ∆k is the radius
1375 of the trust region — the region within which we trust the second-order model
1376 to capture the true behavior of f. Somewhat surprisingly, the problem (7.3.1) is
1377 not too difficult to solve, even when the Hessian ∇2 f(xk ) is indefinite. In fact, the
1378 solution dk of (7.3.1) satisfies the linear system
1379 (7.3.2) [∇2 f(xk ) + λI]dk = −∇f(xk ), for some λ > 0,
1380 where λ is chosen such that ∇2 f(xk ) + λI is positive semidefinite and λ > 0 only if
1381 kdk k = ∆k (see [31]). Solving (7.3.1) thus reduces to a search for the appropriate
1382 value of the scalar λk , for which specialized methods have been devised.
1383 For large-scale problems, it may be too expensive to solve (7.3.1) near-exactly,
1384 since the process may require several factorizations of an n × n matrix (namely,
1385 the coefficient matrix in (7.3.2), for different values of λ). A popular approach
Stephen J. Wright 45

1386 for finding approximate solutions of (7.3.1), which can be used when ∇2 f(xk )
1387 is positive definite, is the dogleg method. In this method the curved path traced
1388 out by solutions of (7.3.2) for values of λ in the interval [0, ∞) is approximated
1389 by simpler path consisting of two line segments. The first segment joins 0 to
1390 the point dk C that minimizes the objective in (7.3.1) along the direction −∇f(x ),
k

1391
k
while the second segment joins dC to the pure Newton step defined in (7.1.4). The
1392 approximate solution is taken to be the point at which this “dogleg” path crosses
1393 the boundary of the trust region kdk 6 ∆k . If the dogleg path lies entirely inside
1394 the trust region, we take dk to be the pure Newton step. See [36, Section 4.1].
1395 Having discussed the trust-region subproblem (7.3.1), let us outline how it can
1396 be used as the basis for a complete algorithm. A crucial role is played by the ratio
1397 between the amount of decrease in f predicted by the quadratic objective in (7.3.1) and
1398 the actual decrease in f, namely, f(xk ) − f(xk + dk ). Ideally, this ratio would be close
1399 to 1. If it is at least greater than a small tolerance (say, 10−4 ) we accept the step
1400 and proceed to the next iteration. Otherwise, we conclude that the trust-region
1401 radius ∆k is too large, so we do not take the step, shrink the trust region, and
1402 re-solve (7.3.1) to obtain a new step. Additionally, when the actual-to-predicted
1403 ratio is close to 1, we conclude that a larger trust region may hasten progress, so
1404 we increase ∆ for the next iteration, provided that the bound kdk k 6 ∆k really is
1405 active at the solution of (7.3.1).
1406 Unlike a basic line-search method, the trust-region Newton method can “es-
1407 cape” from a saddle point. Suppose we have ∇f(xk ) = 0 and ∇2 f(xk ) indefinite
1408 with some strictly negative eigenvalues. Then, the solution dk to (7.3.1) will be
1409 nonzero, and the algorithm will step away from the saddle point, in the direc-
1410 tion of most negative curvature for ∇2 f(xk ). Another appealing feature of the
1411 trust-region Newton approach is that when the sequence {xk } approaches a point
1412 x∗ satisfying second-order sufficient conditions, the trust region bound becomes
1413 inactive, and the method takes pure Newton steps (7.1.4) for all sufficiently large
1414 k so the local quadratic convergence that characterizes Newton’s method.
1415 The basic difference between line-search and trust-region methods can be sum-
1416 marized as follows. Line-search methods first choose a direction pk , then decide
1417 how far to move along that direction. Trust-region methods do the opposite: They
1418 choose the distance ∆k first, then find the direction that makes the best progress
1419 for this step length.
1420 7.4. A Cubic Regularization Approach Trust-region Newton methods have the
1421 significant advantage of guaranteeing that any accumulation points will satisfy
1422 second-order necessary conditions. A related approach based on cubic regulariza-
1423 tion has similar properties, plus some additional complexity guarantees. Cubic
1424 regularization requires the Hessian to be Lipschitz continuous, as in (7.1.2). It
1425 follows that the following cubic function yields a global upper bound for f:
1 M
1426 (7.4.1) TM (z; x) := f(x) + ∇f(x)T (z − x) + (z − x)T ∇2 f(x)(z − x) + kz − xk3 .
2 6
46 Optimization Algorithms for Data Analysis

1427 Specifically, we have for any x that


1428 f(z) 6 TM (z; x), for all z.
1429 The basic cubic regularization algorithm starting from x0 proceeds as follows:
1430 (7.4.2) xk+1 = arg min TM (z; xk ), k = 0, 1, 2, . . . .
z
1431 The complexity properties of this approach were analyzed in [35], with variants
1432 being studied in [26] and [12, 13]. Rather than present the theory for the method
1433 based on (7.4.2), we describe an elementary algorithm that makes use of the ex-
1434 pansion (7.4.1) as well as the steepest-descent theory of Subsection 4.1. Our algo-
1435 rithm aims to identify a point that approximately satisfies second-order necessary
1436 conditions, that is,
1437 (7.4.3) k∇f(x)k 6 g , λmin (∇2 f(x)) > −H ,
1438 where g and H are two small constants. In addition to Lipschitz continuity of
1439 the Hessian (7.1.2), we assume Lipschitz continuity of the gradient with constant
1440 L (see (3.3.6)), and also that the objective f is lower-bounded by some number f̄.
1441 Our algorithm takes steps of two types: a steepest-descent step, as in Subsec-
1442 tion 4.1, or a step in a negative curvature direction for ∇2 f. Iteration k proceeds
1443 as follows:
1444 (i) If k∇f(xk )k > g , take the steepest descent step (4.1.1).
1445 (ii) Otherwise, if λmin (∇2 f(xk )) < −H , choose pk to be the eigenvector cor-
1446 responding to the most negative eigenvalue of ∇2 f(xk ). Choose the size
1447 and sign of pk such that kpk k = 1 and (pk )T ∇f(xk ) 6 0, and set
2
1448 (7.4.4) xk+1 = xk + αk pk , where αk = H .
M
1449 If neither of these conditions hold, then xk satisfies the approximate second-order
1450 necessary conditions (7.4.3), so we terminate.
1451 For the steepest-descent step (i), we have from (4.1.3) that
1 2g
1452 (7.4.5) f(xk+1 ) 6 f(xk ) − k∇f(xk )k2 6 f(xk ) − .
2L 2L
1453 For a step of type (ii), we have from (7.4.1) that
1 1
f(xk+1 ) 6 f(xk ) + αk ∇f(xk )T pk + α2k (pk )T ∇2 f(xk )pk + Mα3k kpk k3
2 6
 2  3
1 2H 1 2H
1454 (7.4.6) 6 f(xk ) − H + M
2 M 6 M
2 3H
= f(xk ) − .
3 M2
1455 By aggregating (7.4.5) and (7.4.6), we have that at each xk for which the condition
1456 (7.4.3) does not hold, we attain a decrease in the objective of at least
!
2g 2 3H
1457 min , .
2L 3 M2
References 47

1458 Using the lower bound f̄ on the objective f, we see that the number of iterations
1459 K required must satisfy the condition
!
2g 2 3H
1460 K min , 6 f(x0 ) − f̄,
2L 3 M2
1461 from which we conclude that 
3 2 −3  0 
1462 K 6 max 2L−2
g , M H f(x ) − f̄ .
2
1463 We also observe that that the maximum number of iterates required to identify a
1464 point at which only the approximate stationarity condition k∇f(xk )k 6 g holds
is 2L−2 0
1465 g (f(x ) − f̄). (We can just omit the second-order part of the algorithm.)
1466 Note too that it is easy to devise approximate versions of this algorithm with simi-
1467 lar complexity. For example, the negative curvature direction pk in step (ii) above
1468 can be replaced by an approximation to the direction of most negative curvature,
1469 obtained by the Lanczos iteration with random initialization.
1470 In algorithms that make more complete use of the cubic model (7.4.1), the term
−3/2
1471 −2
g in the complexity expression becomes g , and the constants are different.
1472 The subproblems (7.4.1) are more complicated to solve than those in the simple
1473 scheme above. Active research is going on into other algorithms that achieve
1474 complexities similar to those of the cubic regularization approach. A variety of
1475 methods that make use of Newton-type steps, approximate negative curvature di-
1476 rections, accelerated gradient methods, random perturbations, randomized Lanc-
1477 zos and conjugate gradient methods, and other algorithmic elements have been
1478 proposed.

1479 8. Conclusions
1480 We have outlined various algorithmic tools from optimization that are useful
1481 for solving problems in data analysis and machine learning, and presented their
1482 basic theoretical properties. The intersection of optimization and machine learn-
1483 ing is a fruitful and very popular area of current research. All the major machine
1484 learning conferences have a large contingent of optimization papers, and there is
1485 a great deal of interest in developing algorithmic tools to meet new challenges
1486 and in understanding their properties. The edited volume [41] contains a snap-
1487 shot of the state of the art circa 2010, but this is a fast-moving field and there have
1488 been many developments since then.

1489 Acknowledgments
1490 I thank Ching-pei Lee for a close reading and many helpful suggestions, and
1491 David Hong and an anonymous referee for detailed, excellent comments.
48 References

1492 References
1493 [1] L. Balzano, R. Nowak, and B. Recht, Online identification and tracking of subspaces from highly incom-
1494 plete information, 48th Annual Allerton Conference on Communication, Control, and Computing,
1495 2010, pp. 704–711. ←9
1496 [2] L. Balzano and S. J. Wright, Local convergence of an algorithm for subspace identification from partial
1497 data, Foundations of Computational Mathematics 14 (2014), 1–36. DOI: 10.1007/s10208-014-9227-
1498 7. ←10
1499 [3] A. Beck and M. Teboulle, A fast iterative shrinkage-threshold algorithm for linear inverse problems,
1500 SIAM Journal on Imaging Sciences 2 (2009), no. 1, 183–202. ←32, 35
1501 [4] B. E. Boser, I. M. Guyon, and V. N. Vapnik, A training algorithm for optimal margin classifiers,
1502 Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 1992, pp. 144–
1503 152. ←12
1504 [5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learn-
1505 ing via the alternating direction methods of multipliers, Foundations and Trends in Machine Learning
1506 3 (2011), no. 1, 1–122. ←3
1507 [6] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University Press, 2003. ←3
1508 [7] S. Bubeck, Convex optimization: Algorithms and complexity, Foundations and Trends in Machine
1509 Learning 8 (2015), no. 3–4, 231–357. ←35, 36
1510 [8] S. Bubeck, Y. T. Lee, and M. Singh, A geometric alternative to Nesterov’s accelerated gradient descent,
1511 Technical Report arXiv:1506.08187, Microsoft Research, 2015. ←35
1512 [9] S. Burer and R. D. C. Monteiro, A nonlinear programming algorithm for solving semidefinite programs
1513 via low-rank factorizations, Mathematical Programming, Series B 95 (2003), 329–257. ←7
1514 [10] E. Candès and B. Recht, Exact matrix completion via convex optimization, Foundations of Computa-
1515 tional Mathematics 9 (2009), 717–772. ←7
1516 [11] E. J. Candès, X. Li, Y. Ma, and J. Wright, Robust principal component analysis?, Journal of the ACM
1517 58.3 (2011), 11. ←9
1518 [12] C. Cartis, N. I. M. Gould, and Ph. L. Toint, Adaptive cubic regularisation methods for unconstrained op-
1519 timization. Part I: Motivation, convergence and numerical results, Mathematical Programming, Series
1520 A 127 (2011), 245–295. ←46
1521 [13] C. Cartis, N. I. M. Gould, and Ph. L. Toint, Adaptive cubic regularisation methods for unconstrained
1522 optimization. Part II: Worst-case function-and derivative-evaluation complexity, Mathematical Program-
1523 ming, Series A 130 (2011), no. 2, 295–319. ←46
1524 [14] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, Rank-sparsity incoherence for
1525 matrix decomposition, SIAM Journal on Optimization 21 (2011), no. 2, 572–596. ←9
1526 [15] Y. Chen and M. J. Wainwright, Fast low-rank estimation by projected gradent descent: General statistical
1527 and algorithmic guarantees, Technical Report arXiv:1509.03025, University of California-Berkeley,
1528 2015. ←9
1529 [16] C. Cortes and V. N. Vapnik, Support-vector networks, Machine Learning 20 (1995), 273–297. ←12
1530 [17] A. d’Aspremont, O. Banerjee, and L. El Ghaoui, First-order methods for sparse covariance selection,
1531 SIAM Journal on Matrix Analysis and Applications 30 (2008), 56–66. ←8
1532 [18] A. d’Aspremont, L. El Ghaoui, M. I. Jordan, and G. Lanckriet, A direct formulation for sparse PCA
1533 using semidefinte programming, SIAM Review 49 (2007), no. 3, 434–448. ←8
1534 [19] T. Dasu and T. Johnson, Exploratory data mining and data cleaning, John Wiley & Sons, 2003. ←4
1535 [20] P. Drineas and M. W. Mahoney, Lectures on randomized numerical linear algebra, The mathematics
1536 of data, 2018. ←6
1537 [21] D. Drusvyatskiy, M. Fazel, and S. Roy, An optimal first-order method based on optimal quadratic av-
1538 eraging, Technical Report arXiv:1604.06543, University of Washington, 2016. To appear in SIAM
1539 Journal on Optimization. ←35
1540 [22] J. C. Duchi, Introductory lectures on stochastic optimization, The mathematics of data, 2018. ←3, 16,
1541 23
1542 [23] J. Eckstein and D. P. Bertsekas, On the Douglas-Rachford splitting method and the proximal point
1543 algorithm for maximal monotone operators, Mathematical Programming 55 (1992), 293–318. ←3
1544 [24] M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Research Logistics Quarterly
1545 3 (1956), 95–110. ←28
1546 [25] J. Friedman, T. Hastie, and R. Tibshirani, Sparse inverse covariance estimation with the graphical lasso,
1547 Biostatistics 9 (2008), no. 3, 432–441. ←8
References 49

1548 [26] A. Griewank, The modification of Newton’s method for unconstrained optimization by bounding cubic
1549 terms, Technical Report NA/12, DAMTP, Cambridge University, 1981. ←44, 46
1550 [27] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, Journal of
1551 Research of the National Bureau of Standards 49 (1952), 409–436. ←33
1552 [28] A. J. Hoffman and H. Weilandt, The variation of the spectrum of a normal matrix, Duke Mathematical
1553 Journal 20 (1953), no. 1, 37–39. ←41
1554 [29] J. D Lee, M. Simchowitz, M. I Jordan, and B. Recht, Gradient descent only converges to minimizers,
1555 Conference on learning theory, 2016, pp. 1246–1257. ←27
1556 [30] D. C. Liu and J. Nocedal, On the limited-memory BFGS method for large scale optimization, Mathe-
1557 matical Programming 45 (1989), 503–528. ←3, 34
1558 [31] J. J. Moré and D. C. Sorensen, Computing a trust region step, SIAM Journal on Scientific and
1559 Statistical Computing 4 (1983), 553–572. ←44
1560 [32] A. S. Nemirovski and D. B. Yudin, Problem complexity and method efficiency in optimization, John
1561 Wiley, 1983. ←39
1562 [33] Y. Nesterov, A method for unconstrained convex problem with the rate of convergence O(1/k2 ), Dok-
1563 lady AN SSSR 269 (1983), 543–547. ←32
1564 [34] Y. Nesterov, Introductory lectures on convex optimization: A basic course, Springer Science and Busi-
1565 ness Media, New York, 2004. ←32, 36
1566 [35] Y. Nesterov and B. T. Polyak, Cubic regularization of Newton method and its global performance, Math-
1567 ematical Programming, Series A 108 (2006), 177–205. ←44, 46
1568 [36] J. Nocedal and S. J. Wright, Numerical Optimization, Second, Springer, New York, 2006. ←3, 26,
1569 34, 44, 45
1570 [37] B. T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Computational
1571 Mathematics and Mathematical Physics 4.5 (1964), 1–17. ←32
1572 [38] B. T. Polyak, Introduction to optimization, Optimization Software, 1987. ←32, 33
1573 [39] B. Recht, M. Fazel, and P. Parrilo, Guaranteed minimum-rank solutions to linear matrix equations via
1574 nuclear norm minimization, SIAM Review 52 (2010), no. 3, 471–501. ←7
1575 [40] R. T. Rockafellar, Convex analysis, Princeton University Press, Princeton, N.J., 1970. ←17
1576 [41] S. Sra, S. Nowozin, and S. J. Wright (eds.), Optimization for machine learning, NIPS Workshop
1577 Series, MIT Press, 2011. ←47
1578 [42] R. Tibshirani, Regression shrinkage and selection via the LASSO, Journal of the Royal Statistical
1579 Society B 58 (1996), 267–288. ←6
1580 [43] M. J. Todd, Semidefinite optimization, Acta Numerica 10 (2001), 515–560. ←3
1581 [44] B. Turlach, W. N. Venables, and S. J. Wright, Simultaneous variable selection, Technometrics 47
1582 (2005), no. 3, 349–363. ←9
1583 [45] L. Vandenberghe and S. Boyd, Semidefinite programming, SIAM Review 38 (1996), 49–95. ←3
1584 [46] S. J. Wright, Primal-dual interior-point methods, SIAM, Philadelphia, PA, 1997. ←3
1585 [47] S. J. Wright, Coordinate descent algorithms, Mathematical Programming, Series B 151 (2015Decem-
1586 ber), 3–34. ←3
1587 [48] X. Yi, D. Park, Y. Chen, and C. Caramanis, Fast algorithms for robust PCA via gradient descent,
1588 Advances in Neural Information Processing Systems 29, 2016, pp. 4152–4160. ←9

1589 Computer Sciences Department, University of Wisconsin-Madison, Madison, WI 53706


1590 E-mail address: [email protected]

You might also like