Optimization For Data Analysis Stephen J Wright Benjamin Recht Instant Download
Optimization For Data Analysis Stephen J Wright Benjamin Recht Instant Download
https://ebookbell.com/product/optimization-for-data-analysis-
stephen-j-wright-benjamin-recht-42617428
https://ebookbell.com/product/data-analysis-and-optimization-for-
engineering-and-computing-problems-proceedings-of-the-3rd-eai-
international-conference-on-computer-science-and-engineering-and-
health-services-1st-ed-pandian-vasant-22505734
https://ebookbell.com/product/engineering-mathematics-ii-algebraic-
stochastic-and-analysis-structures-for-networks-data-classification-
and-optimization-1st-edition-sergei-silvestrov-5838406
https://ebookbell.com/product/learning-to-love-data-science-
explorations-of-emerging-technologies-and-platforms-for-predictive-
analytics-machine-learning-digital-manufacturing-and-supply-chain-
optimization-mike-barlow-23686416
https://ebookbell.com/product/data-deduplication-for-data-
optimization-for-storage-and-network-systems-1st-edition-daehee-
kim-5675370
Source Code Optimization Techniques For Data Flow Dominated Embedded
Software 1st Edition Heiko Falk
https://ebookbell.com/product/source-code-optimization-techniques-for-
data-flow-dominated-embedded-software-1st-edition-heiko-falk-4190050
https://ebookbell.com/product/ai-for-data-science-artificial-
intelligence-frameworks-and-functionality-for-deep-learning-
optimization-and-beyond-zacharias-voulgaris-phd-yunus-emrah-
bulut-50200348
https://ebookbell.com/product/ai-for-data-science-artificial-
intelligence-frameworks-and-functionality-for-deep-learning-
optimization-and-beyond-zacharias-voulgaris-11240018
https://ebookbell.com/product/business-intelligence-data-mining-and-
optimization-for-decision-making-1st-edition-carlo-vercellis-1719914
https://ebookbell.com/product/energy-efficient-servers-blueprints-for-
data-center-optimization-1st-edition-corey-gough-5053214
Optimization for Data Analysis
Optimization techniques are at the core of data science, including data analysis and
machine learning. An understanding of basic optimization techniques and their
fundamental properties provides important grounding for students, researchers, and
practitioners in these areas This text covers the fundamentals of optimization
algorithms in a compact, self-contained way, focusing on the techniques most relevant
to data science An introductory chapter demonstrates that many standard problems in
data science can be formulated as optimization problems Next, many fundamental
methods in optimization are described and analyzed, including gradient and
accelerated gradient methods for unconstrained optimization of smooth (especially
convex) functions; the stochastic gradient method, a workhorse algorithm in machine
learning; the coordinate descent approach; several key algorithms for constrained
optimization problems; algorithms for minimizing nonsmooth functions arising in data
science; foundations of the analysis of nonsmooth functions and optimization duality;
and the back-propagation approach, relevant to neural networks.
STEPHEN J. WRIGHT
University of Wisconsin–Madison
B E N JA M I N R E C H T
University of California, Berkeley
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314 321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre,
New Delhi 110025, India
103 Penang Road, #05–06/07, Visioncrest Commercial, Singapore 238467
www.cambridge org
Information on this title: www.cambridge.org/9781316518984
DOI: 10 1017/9781009004282
© Stephen J. Wright and Benjamin Recht 2022
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2022
Printed in the United Kingdom by TJ Books Ltd, Padstow Cornwall
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
Names: Wright, Stephen J , 1960– author | Recht, Benjamin, author
Title: Optimization for data analysis / Stephen J. Wright and Benjamin Recht.
Description: New York : Cambridge University Press, [2021] | Includes
bibliographical references and index.
Identifiers: LCCN 2021028671 (print) | LCCN 2021028672 (ebook) |
ISBN 9781316518984 (hardback) | ISBN 9781009004282 (epub)
Subjects: LCSH: Big data | Mathematical optimization. | Quantitative
research. | Artificial intgelligence. | BISAC: MATHEMATICS / General |
MATHEMATICS / General
Classification: LCC QA76.9.B45 W75 2021 (print) | LCC QA76.9.B45 (ebook)
| DDC 005.7–dc23
LC record available at https://lccn.loc.gov/2021028671
LC ebook record available at https://lccn.loc.gov/2021028672
ISBN 978-1-316-51898-4 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
Preface page ix
1 Introduction 1
1.1 Data Analysis and Optimization 1
1.2 Least Squares 4
1.3 Matrix Factorization Problems 5
1.4 Support Vector Machines 6
1.5 Logistic Regression 9
1.6 Deep Learning 11
1.7 Emphasis 13
2 Foundations of Smooth Optimization 15
2.1 A Taxonomy of Solutions to Optimization Problems 15
2.2 Taylor’s Theorem 16
2.3 Characterizing Minima of Smooth Functions 18
2.4 Convex Sets and Functions 20
2.5 Strongly Convex Functions 22
3 Descent Methods 26
3.1 Descent Directions 27
3.2 Steepest-Descent Method 28
3.2.1 General Case 28
3.2.2 Convex Case 29
3.2.3 Strongly Convex Case 30
3.2.4 Comparison between Rates 32
3.3 Descent Methods: Convergence 33
3.4 Line-Search Methods: Choosing the Direction 36
3.5 Line-Search Methods: Choosing the Steplength 38
v
vi Contents
Bibliography 216
Index 223
Preface
ix
x Preface
This book has been a work in progress since about 2010, when we began
to revamp our optimization courses, trying to balance the viewpoints of
practical optimization techniques against renewed interest in non-asymptotic
analyses of optimization algorithms. At that time, the flavor of analysis of
optimization algorithms was shifting to include a greater emphasis on worst-
case complexity. But algorithms were being judged more by their worst-case
bounds rather than by their performance on practical problems in applied
sciences. This book occupies a middle ground between analysis and practice.
Beginning with our courses CS726 and CS730 at University of Wisconsin,
we began writing notes, problems, and drafts. After Ben moved to UC Berkeley
in 2013, these notes became the core of the class EECS227C. Our material
drew heavily from the evolving theoretical understanding of optimization
algorithms. For instance, in several parts of the text, we have made use of the
excellent slides written and refined over many years by Lieven Vandenberghe
for the UCLA course ECE236C. Our presentation of accelerated methods
reflects a trend in viewing optimization algorithms as dynamical systems,
and was heavily influenced by collaborative work with Laurent Lessard and
Andrew Packard. In choosing what material to include, we tried to not be
distracted by methods that are not widely used in practice but also to highlight
how theory can guide algorithm selection and design by applied researchers.
We are indebted to many other colleagues whose input shaped the material
in this book. Moritz Hardt initially inspired us to try to write down our views
after we presented a review of optimization algorithms at the bootcamp for
the Simons Institute Program on Big Data in Fall 2013. He has subsequently
provided feedback on the presentation and organization of drafts of this
book. Ashia Wilson was Ben’s TA in EECS227C, and her input and notes
helped us to clarify our pedagogical messages in several ways. More recently,
Martin Wainwright taught EECS227C and provided helpful feedback, and
Jelena Diakonikolas provided corrections for the early chapters after she
taught CS726. André Wibisono provided perspectives on accelerated gradient
methods, and Ching pei Lee gave useful advice on coordinate descent. We are
also indebted to the many students who took CS726 and CS730 at Wisconsin
and EECS227C at Berkeley who found typos and beta tested homework
problems, and who continue to make this material a joy to teach. Finally,
we would like to thank the Simons Institute for supporting us on multiple
occasions, including Fall 2017 when we both participated in their program
on Optimization.
1
2 1 Introduction
1
m
LD (x) := (aj ,yj ;x). (1.2)
m
j =1
The function (a,y;x) here represents a “loss” incurred for not properly
aligning our prediction φ(a) with y. Thus, the objective LD (x) measures the
average loss accrued over the entire data set when the parameter vector is
equal to x.
Once an appropriate value of x (and thus φ) has been learned from the data,
we can use it to make predictions about other items of data not in the set D
(1.1). Given an unseen item of data â of the same type as aj , j = 1,2, . . . ,m,
we predict the label ŷ associated with â to be φ(â). The mapping φ may also
expose other structures and properties in the data set. For example, it may
reveal that only a small fraction of the features in aj are needed to reliably
predict the label yj . (This is known as feature selection.) When the parameter
x is a matrix, it could reveal a low-dimensional subspace that contains most of
the vectors aj , or it could reveal a matrix with particular structure (low-rank,
sparse) such that observations of X prompted by the feature vectors aj yield
results close to yj .
The form of the labels yj differs according to the nature of the data analysis
problem.
the aj into clusters (where the vectors within each cluster are deemed to be
functionally similar) or identify a low-dimensional subspace (or a
collection of low-dimensional subspaces) that approximately contains the
aj . In such problems, we are essentially learning the labels yj alongside the
function φ. For example, in a clustering problem, yj could represent the
cluster to which aj is assigned.
Even after cleaning and preparation, the preceding setup may contain many
complications that need to be dealt with in formulating the problem in rigorous
mathematical terms. The quantities (aj ,yj ) may contain noise or may be
otherwise corrupted, and we would like the mapping φ to be robust to such
errors. There may be missing data: Parts of the vectors aj may be missing,
or we may not know all the labels yj . The data may be arriving in streaming
fashion rather than being available all at once. In this case, we would learn φ
in an online fashion.
One consideration that arises frequently is that we wish to avoid overfitting
the model to the data set D in (1.1). The particular data set D available to us
can often be thought of as a finite sample drawn from some underlying larger
(perhaps infinite) collection of possible data points, and we wish the function φ
to perform well on the unobserved data points as well as the observed subset D.
In other words, we want φ to be not too sensitive to the particular sample D that
is used to define empirical objective functions such as (1.2). One way to avoid
this issue is to modify the objective function by adding constraints or penalty
terms, in a way that limits the “complexity” of the function φ. This process is
typically called regularization. An optimization formulation that balances fit
to the training data D, model complexity, and model structure is
min LD (x) + λ pen(x), (1.3)
x∈
1 Interestingly, the concept of overfitting has been reexamined in recent years, particularly in the
context of deep learning, where models that perfectly fit the training data are sometimes
observed to also do a good job of classifying previously unseen data. This phenomenon is a
topic of intense current research in the machine learning community.
4 1 Introduction
The constraint set in (1.3) may be chosen to exclude values of x that are
not relevant or useful in the context of the data analysis problem. For example,
in some applications, we may not wish to consider values of x in which one
or more components are negative, so we could set to be the set of vectors
whose components are all greater than or equal to zero.
We now examine some particular problems in data science that give rise to
formulations that are special cases of our master problem (1.3). We will see that
a large variety of problems can be formulated using this general framework, but
we will also see that within this framework, there is a wide range of structures
that must be taken into account in choosing algorithms to solve these problems
efficiently.
1 T 2
m
1
min aj x yj = Ax y22, (1.4)
x 2m 2m
j =1
1
m
min (Aj ,X yj )2, (1.6)
X 2m
j =1
where A,B := trace(AT B). Here we can think of the Aj as “probing” the
unknown matrix X. Commonly considered types of observations are random
linear combinations (where the elements of Aj are selected i.i.d. from some
distribution) or single element observations (in which each Aj has 1 in a
single location and zeros elsewhere). A regularized version of (1.6), leading
to solutions X that are low rank, is
1
m
min (Aj ,X yj )2 + λX∗, (1.7)
X 2m
j =1
where X∗ is the nuclear norm, which is the sum of singular values of X
(Recht et al., 2010). The nuclear norm plays a role analogous to the 1 norm in
(1.5), where as the 1 norm favors sparse vectors, the nuclear norm favors low-
rank matrices. Although the nuclear norm is a somewhat complex nonsmooth
function, it is at least convex so that the formulation (1.7) is also convex. This
formulation can be shown to yield a statistically valid solution when the true
6 1 Introduction
1
m
min (Aj ,LR T − yj )2 . (1.8)
L,R 2m
j =1
data (aj ,yj ) with aj ∈ Rn and yj ∈ { 1,1}, SVM seeks a vector x ∈ Rn and
a scalar β ∈ R such that
Any pair (x,β) that satisfies these conditions defines a separating hyperplane
in Rn , that separates the “positive” cases {aj | yj = +1} from the “negative”
cases {aj | yj = −1}. Among all separating hyperplanes, the one that
minimizes x2 is the one that maximizes the margin between the two classes –
that is, the hyperplane whose distance to the nearest point aj of either class is
greatest.
We can formulate the problem of finding a separating hyperplane as an
optimization problem by defining an objective with the summation form (1.2):
1
m
H (x,β) = max(1 − yj (ajT x − β),0). (1.10)
m
j =1
Note that the j th term in this summation is zero if the conditions (1.9) are
satisfied, and it is positive otherwise. Even if no pair (x,β) exists for which
H (x,β) = 0, a value (x,β) that minimizes (1.2) will be the one that comes
as close as possible to satisfying (1.9) in some sense. A term λx22 (for some
parameter λ > 0) is often added to (1.10), yielding the following regularized
version:
1
m
1
H (x,β) = max(1 yj (ajT x β),0) + λx22 . (1.11)
m 2
j =1
Note that, in contrast to the examples presented so far, the SVM problem has
a nonsmooth loss function and a smooth regularizer.
If λ is sufficiently small, and if separating hyperplanes exist, the pair
(x,β) that minimizes (1.11) is the maximum-margin separating hyperplane.
The maximum-margin property is consistent with the goals of generalizability
and robustness. For example, if the observed data (aj ,yj ) is drawn from
an underlying “cloud” of positive and negative cases, the maximum-margin
solution usually does a reasonable job of separating other empirical data
samples drawn from the same clouds, whereas a hyperplane that passes close
to several of the observed data points may not do as well (see Figure 1.1).
Often, it is not possible to find a hyperplane that separates the positive
and negative cases well enough to be useful as a classifier. One solution is
to transform all of the raw data vectors aj by some nonlinear mapping ψ and
1.5 Logistic Regression 9
and Vapnik, 1995). This is the so-called kernel trick. (The kernel function K
can also be used to construct a classification function φ from the solution of
(1.14).) A particularly popular choice of kernel is the Gaussian kernel:
1
K(ak ,al ) := exp ak al 2 ,
2σ
where σ is a positive parameter.
Note that the definition (1.15) ensures that p(a;x) ∈ (0,1) for all a and x;
thus, log(1 p(aj ;x)) < 0 and log p(aj ;x) < 0 for all j and all x. When the
conditions (1.16) are satisfied, these log terms will be only slightly negative,
so values of x that satisfy (1.17) will be near optimal.
We can perform feature selection using the model (1.17) by introducing a
regularizer λx1 (as in the LASSO technique for least squares (1.5)),
⎡ ⎤
1 ⎣
min log(1 p(aj ;x)) + log p(aj ;x)⎦ + λx1,
x m
j :yj =−1 j :yj =1
(1.18)
where λ > 0 is a regularization parameter. As we see later, this term has
the effect of producing a solution in which few components of x are nonzero,
10 1 Introduction
exp(a T x[k] )
pk (a;X) := M
, k = 1,2, . . . ,M, (1.19)
T
l=1 exp(a x[l] )
The problem of finding values of x[k] that satisfy these conditions can again be
formulated as one of minimizing a negative-log likelihood:
1
m M M
T T
L(X) := yj (x[] aj ) log exp(x[] aj ) . (1.22)
m
j =1 =1 =1
Figure 1.2 A deep neural network, showing connections between adjacent layers,
where each layer is represented by a shaded rectangle.
12 1 Introduction
outputs in the rightmost layer, and a loss function similar to (1.22) is obtained,
as we describe now.
Consider the special (but not uncommon) case in which the neural net
structure is a linear graph of D levels, in which the output for layer l 1
becomes the input for layer l (for l = 1,2, . . . ,D) with aj = aj0 , j =
1,2, . . . ,m, and the transformation within each box has the form (1.23). A
softmax is applied to the output of the rightmost layer to obtain a set of odds.
The parameters in this neural network are the matrix vector pairs (W l ,g l ),
l = 1,2, . . . ,D that transform the input vector aj = aj0 into the output ajD of
the final layer. We aim to choose all these parameters so that the network does
a good job of classifying the training data correctly. Using the notation w for
the layer to layer transformations, that is,
often using multicore processors, GPUs, and even specially architected pro-
cessing units – are devoted to this task.
1.7 Emphasis
Many problems can be formulated as in the framework (1.3), and their
properties may differ significantly. They might be convex or nonconvex, and
smooth or nonsmooth. But there are important features that they all share.
solution to be useful in the application that gave rise to the problem. Worst-case
complexity guarantees are only a piece of the story here, and understanding the
various parameters and heuristics that form part of any practical algorithmic
strategy are critical for building reliable solvers.
15
16 2 Foundations of Smooth Optimization
1 See the Appendix for a description of the order notation O(·) and o(·).
2.2 Taylor’s Theorem 17
pT ∇ 2 f (x + γ αp)p ≤ Lp2 .
f (y) ≥ f (x ∗ ) + ∇f (x ∗ )T (y x ∗ ) = f (x ∗ ),
for all x and y in the domain of f , we say that f is strongly convex with
modulus of convexity m. When f is differentiable, we have the following
2.5 Strongly Convex Functions 23
uT ∇ 2 f (x + γ αu)u ≥ mu2 .
We obtain the strong convexity expression when we bound the last term as
follows:
Notation
We use · to denote the Euclidean norm · 2 of a vector in Rn . Other norms,
such as · 1 and · ∞ , will be denoted explicitly.
Exercises
1. Prove that the effective domain of a convex function f (that is, the set of
points x ∈ Rn such that f (x) < ∞) is a convex set.
2. Prove that epi f is a convex subset of Rn × R for any convex function f .
3. Suppose that f : Rn → R is convex and concave. Show that f must be an
affine function.
4. Suppose that f : Rn → R is convex and upper bounded. Show that f must
be a constant function.
5. Suppose f : Rn → R is strongly convex and Lipschitz. Show that no such
f exists.
6. Show rigorously how (2.19) is derived from (2.18) when f is continuously
differentiable.
7. Suppose that f : Rn → R is a convex function with L Lipschitz gradient
and a minimizer x ∗ with function value f ∗ = f (x ∗ ).
(a) Show (by minimizing both sides of (2.9) with respect to y) that for any
x ∈ R n , we have
1
f (x) f∗ ≥ ∇f (x)2 .
2L
(b) Prove the following co-coercivity property: For any x,y ∈ Rn , we have
1
[∇f (x) ∇ f (y)]T (x y) ≥ ∇f (x) ∇ f (y)2 .
L
Hint: Apply part (a) to the following two functions:
hx (z) := f (z) − ∇f (x)T z, hy (z) := f (z) − ∇f (y)T z.
8. Suppose that f : Rn → R is an m strongly convex function with
L Lipschitz gradient and (unique) minimizer x ∗ with function value
f ∗ = f (x ∗ ).
(a) Show that the function q(x) := f (x) m2 x2 is convex with
L m-Lipschitz continuous gradients.
(b) By applying the co-coercivity property of the previous question to this
function q, show that the following property holds:
[∇f (x) − ∇f (y)]T (x − y)
mL 1
≥ x y2 + ∇f (x) ∇ f (y)2 . (2.21)
m+L m+L
3
Descent Methods
Methods that use information about gradients to obtain descent in the objective
function at each iteration form the basis of all of the schemes studied in this
book. We describe several fundamental methods of this type and analyze their
convergence and complexity properties. This chapter can be read as an intro-
duction both to elementary methods based on gradients of the objective and
to the fundamental tools of analysis that are used to understand optimization
algorithms.
Throughout the chapter, we consider the unconstrained minimization of a
smooth convex function:
The algorithms of this chapter are suited to the case in which f and its gradient
∇f can be evaluated – exactly, in principle – at arbitrary points x. Bearing in
mind that this setup may not hold for many data analysis problems, we focus on
those fundamental algorithms that can be extended to more general situations,
for example:
26
3.1 Descent Directions 27
Definition 3.1 d is a descent direction for f at x if f (x + td) < f (x) for all
t > 0 sufficiently small.
for some steplength αk > 0. At each iteration, we are guaranteed that there is
some nonnegative step α that decreases the function value, unless ∇f (x k ) = 0.
But note that when ∇f (x) = 0 (that is, x is stationary), we will have found a
point that satisfies a first-order necessary condition for local optimality. (If f is
also convex, this point will be a global minimizer of f .) The algorithm defined
by (3.2) is called the gradient descent method or the steepest-descent method.
(We use the latter term in this chapter.) In the next section, we will discuss the
28 3 Descent Methods
choice of steplengths αk and analyze how many iterations are required to find
points where the gradient nearly vanishes.
(In the case that f has a global minimizer x ∗ , f¯ could be any value such that
f ≤ f (x ∗ ).) By summing the inequalities (3.5) over k = 0,1, . . . ,T 1, and
canceling terms, we find that
T −1
1
f (x ) ≤ f (x )
T 0
∇f (x k )2 .
2L
k=0
Since f¯ ≤ f (x T ), we have
−1
T
∇f (x k )2 ≤ 2L[f (x 0 ) f]
k=0
1 f¯]
T 1
2L[f (x 0 )
min ∇f (x k )2 ≤ ∇f (x k )2 ≤ .
0≤k≤T −1 T T
k=0
Thus, we have shown that after T steps of steepest descent, we can find a point
x satisfying
2L[f (x 0 ) − f¯]
min ∇f (x k ) ≤ . (3.7)
0≤k≤T −1 T
Note that this convergence rate is slow and tells us only that we will find a
point x k that is nearly stationary. We need to assume stronger properties of f
to guarantee faster convergence and global optimality.
L 0
f (x T ) − f ∗ ≤ x − x ∗ 2, T = 1,2, . . . . (3.8)
2T
30 3 Descent Methods
L k
−1
T T −1
(f (x k+1 ) f ∗) ≤ x x ∗ 2 x k+1 x ∗ 2
2
k=0 k=0
L 0
= x x ∗ 2 x T x ∗ 2
2
L 0
≤ x x ∗ 2 .
2
Since {f (x k )} is a nonincreasing sequence, we have
T −1
1 L 0
f (x T ) f∗ ≤ (f (x k+1 ) f ∗) ≤ x x ∗ 2,
T 2T
k=0
as desired.
adding any small positive multiple of the squared Euclidean norm. In fact, if f
is any L-smooth function, then
fμ (x) = f (x) + μx2
is strongly convex for μ large enough. (Exercise: Prove this!)
As another canonical example, note that a quadratic function f (x) =
1 T
2 x Qx is strongly convex if and only if the smallest eigenvalue of Q is
strictly positive. We saw in Theorem 2.8 that a strongly convex f has a unique
minimizer, which we denote by x ∗ .
Strongly convex functions are, in essence, the “easiest” functions to opti-
mize by first-order methods. First, the norm of the gradient provides useful
information about how far away we are from optimality. Suppose we minimize
both sides of the inequality (3.9) with respect to z. The minimizer on the left-
hand side is clearly attained at z = x ∗ , while on the right-hand side, it is
attained at x − ∇f (x)/m. By plugging these optimal values into (3.9), we
obtain
2
1 m1
f (x ∗ ) ≥ f (x) ∇ f (x)T ∇f (x) + ∇f (x)
m 2 m
1
= f (x) ∇f (x)2 .
2m
By rearrangement, we obtain
∇f (x)2 ≥ 2m[f (x) f (x ∗ )]. (3.10)
If ∇f (x) < δ, we have
∇f (x)2 δ2
f (x) f (x ∗ ) ≤ ≤ .
2m 2m
Thus, for strongly convex functions, when the gradient is small, we are close
to having found a point with minimal function value.
We can derive an estimate of the distance of x to the optimal point x ∗
in terms of the gradient by using (3.9) and the Cauchy Schwarz inequality.
We have
m
f (x ∗ ) ≥ f (x) + ∇f (x)T (x ∗ − x) + x − x ∗ 2
2
∗ m
≥ f (x) ∇ f (x) x x + x x ∗ 2 .
2
By rearranging terms, we have
2
x − x ∗ ≤∇f (x). (3.11)
m
We summarize this discussion in the following lemma.
32 3 Descent Methods
For the general convex case, we have from (3.8) that f (x k ) f∗ ≤ when
Lx 0 x ∗ 2
k≥ . (3.16)
2
For the strongly convex case, we have from (3.15) that f (x k ) − f ∗ ≤ for all
k satisfying
L
k≥ log((f (x 0 ) f ∗ )/ ). (3.17)
m
3.3 Descent Methods: Convergence 33
Note that in all three cases, we can get bounds in terms of the initial distance
to optimality x 0 x ∗ rather than the initial optimality gap f (x 0 ) f ∗ by
using the inequality
L 0
f (x 0 ) f∗ ≤ x x ∗ 2 .
2
The linear rate (3.17) depends only logarithmically on , whereas the
sublinear rates depend on 1/ or 1/ 2 . When is small (for example, =
10−6 ), the linear rate would appear to be dramatically faster, and, indeed, this
is usually the case. The only exception would be when m is extremely small,
so that m/L is of the same order as . The problem is extremely ill conditioned
in this case, and there is little difference between the linear rate (3.17) and the
sublinear rate (3.16).
All of these bounds depend on knowledge of L. What happens when we do
not know L? Even when we do know it, is the steplength αk ≡ 1/L good in
practice? We have reason to suspect not, since the inequality (3.5) on which it
is based uses the conservative global upper bound L on curvature. (A sharper
bound could be obtained in terms of the curvature in the neighborhood of the
current iterate x k .) In the remainder of this chapter, we expand our view to
more general choices of search directions and steplengths.
of the form (3.18), provided that d k and αk satisfy certain intuitive properties.
Specifically, we show that the following inequality holds:
The remainder of the analyses in the previous section used properties about
the function f itself that were independent of the algorithm: smoothness,
convexity, and strong convexity. For a general descent method, we can provide
similar analyses based on the property (3.19).
What can we say about the sequence of iterates {x k } generated by a scheme
that guarantees (3.19)? The following elementary theorem shows one basic
property.
Theorem 3.5 Suppose that f is bounded below, with Lipschitz continuous
gradient. Then all accumulation points x̄ of the sequence {x k } generated by a
scheme that satisfies (3.19) are stationary; that is, ∇f (x) = 0. If, in addition,
f is convex, each such x is a solution of (3.1).
Proof Note first from (3.19) that
For the case in which f is strongly convex with modulus m (and unique
solution x ∗ ), we can combine (3.12) with (3.19) to deduce that
3.3 Descent Methods: Convergence 35
and we obtain the result by taking the inverse of both sides in this bound and
using T = f (x T ) f (x ∗ ).
(d k )T ∇f (x k )
0< ≤ , (3.22a)
∇f (x k )d k
d k
0 < γ1 ≤ ≤ γ2 . (3.22b)
∇f (x k )
Condition (3.22a) says that the angle between ∇f (x k ) and d k is acute and
bounded away from π/2 for all k, and condition (3.22b) ensures that d k and
∇f (x k ) are not too much different in length. (If x k is a stationary point, we
have ∇f (x k ) = 0, so our algorithm will set d k = 0 and terminate.)
For the negative gradient (steepest-descent) search direction d k =
∇f (x k ), the conditions (3.22) hold trivially, with = γ1 = γ2 = 1.
We can use Taylor’s theorem to bound the change in f when we move along
d k from the current iteration x k . By setting x = x k and d = d k in (3.4), we
obtain
f (x k+1 ) = f (x k + αd k )
L
≤ f (x k ) + α∇f (x k )T d k + α 2 d k 2
2
L
≤ f (x k ) α ∇f (x k )d k + α 2 d k 2
2
L
≤ f (x k ) − α ¯ − α γ2 ∇f (x k )d k , (3.23)
2
where we used (3.22) for the last two inequalities. It is clear from this
expression that for all values of α sufficiently small – to be precise, for
α ∈ (0,2¯ /(Lγ2 )) – we have f (x k+1 ) < f (x k ) – unless, of course, x k is a
stationary point.
3.4 Line-Search Methods: Choosing the Direction 37
Fixed Steplength. As we have seen in Section 3.2, fixed steplengths can yield
useful convergence results. One drawback of the fixed steplength approach is
that some prior information is needed to properly choose the steplength.
The first approach to choosing a fixed steplength (one commonly used in
machine learning, where the steplength is often known as the “learning rate”)
is trial and error. Extensive experience in applying gradient (or stochastic
gradient) algorithms to a particular class of problems may reveal that a par-
ticular steplength is reliable and reasonably efficient. Typically, a reasonable
heuristic is to pick α as large as possible such that the algorithm does not
diverge. In some sense, this approach is estimating the Lipschitz constant of the
gradient of f by trial and error. Slightly enhanced variants are also possible;
for example, αk may be held constant for many successive iterations and then
decreased periodically. Since such schemes are highly application and problem
dependent, we cannot say much more about them here.
A second approach, a special case of which was investigated already in
Section 3.2, is to base the choice of αk on knowledge of the global properties
of the function f , particularly on the Lipschitz constant L for the gradient (see
(2.7)) or the modulus of convexity m (see (2.18)). Given the expression (3.23),
for example, and supposing we have estimates of all the quantities ¯ , γ2 , and L
that appear therein, we could choose α to maximize the coefficient of the last
term. Setting α = ¯ /(Lγ2 ), we obtain from (3.23) and (3.22) that
3.5 Line-Search Methods: Choosing the Steplength 39
2 2γ
1
f (x k+1 ) ≤ f (x k ) ∇f (x k )d k ≥ f (x k ) ∇f (x k )2 .
2Lγ2 2Lγ2
(3.24)
min f (x k + αd k ). (3.25)
α>0
Approximate Line Search. In full generality, exact line searches are expen-
sive and unnecessary. Better empirical performance is achieved by approx-
imate line search. Many line-search methods were proposed in the 1970s
and 1980s for finding conditions that should be satisfied by approximate line
searches so as to guarantee good convergence properties and on identifying
line-search procedures that find such approximate solutions economically. (By
“economically,” we mean that an average of three or less evaluations of f
are required.) One popular pair of conditions that the approximate minimizer
α = αk is required to satisfy, called the weak Wolfe Conditions, is defined as
follows:
f (x k + αd k ) ≤ f (x k ) + c1 α∇f (x k )T d k , (3.26a)
∇f (x k + αd k )T d k ≥ c2 ∇f (x k )T d k. (3.26b)
Here, c1 and c2 are constants that satisfy 0 < c1 < c2 < 1. The condition
(3.26a) is often known as the “sufficient decrease condition,” because it ensures
that the actual amount of decrease in f is at least a multiple c1 of the amount
Another Random Document on
Scribd Without Any Related Topics
avait donné l'apothéose officielle des barricades de Juillet? De police,
il n'y en avait plus. Quant aux troupes, suivant l'expression de M.
Thiers, «ébranlées par le souvenir de la révolution, elles craignaient
de se commettre avec le peuple[191]». Restait seulement la garde
nationale, incertaine, troublée, tout à fait mauvaise dans certaines
de ses parties, par exemple l'artillerie[192], et, dans ses meilleurs
éléments, habituée non à obéir au gouvernement, mais à agir de
son chef, suivant les inspirations du moment: on était réduit, en cas
de trouble, à lui laisser une sorte de dictature[193]. Du reste, le
commandant de cette milice, La Fayette, tout en souhaitant de
sauver les ministres, ne consentait à employer que des moyens
moraux et des démonstrations sentimentales.
III
Plus que jamais donc, il nous fallait, pour maintenir l'entente avec
l'Angleterre, renoncer à tout avantage direct. M. de Talleyrand en
avait été convaincu dès le premier jour. Il semble cependant qu'à
plusieurs reprises, il ait alors sondé le terrain pour voir s'il serait
possible d'être moins absolument désintéressé. Un jour, s'il faut en
croire le témoignage, suspect, il est vrai, de lord Palmerston, il
lançait cette idée hardie de mettre le roi de Saxe à Bruxelles, de
donner la Saxe à la Prusse et les provinces rhénanes à la France;
d'autres fois, il se contentait de demander pour son pays soit
Luxembourg, soit une partie des provinces wallonnes, ou de
revendiquer les «petites frontières», celles de 1790 et de 1814, qui
nous eussent fait rentrer en possession de Marienbourg et de
Philippeville[217]. Mais qu'il réclamât peu ou beaucoup, il ne pouvait
tromper la vigilance hargneuse de lord Palmerston, et se heurtait,
chez ce dernier, à un refus net et roide. «Vous devez faire entendre
à toute occasion, écrivait le ministre anglais à son ambassadeur à
Paris, que, si désireux que nous soyons d'être dans la meilleure
entente avec la France et dans les termes de l'amitié la plus intime,
ce n'est cependant que sous la condition qu'elle se contente de
posséder le plus beau territoire de l'Europe et ne songe plus à ouvrir
un nouveau chapitre d'empiétements et de conquêtes[218].» Il est à
supposer que M. de Talleyrand était le dernier à s'étonner de
l'insuccès de ses ouvertures; mais on le pressait de Paris; les
ministres eussent voulu donner satisfaction au désir, alors plus vif et
plus répandu que jamais en France, d'un certain accroissement de
territoire, d'un pas fait vers la reprise de ce qu'on appelait les
«frontières naturelles[219]». Peut-être aussi le vieux diplomate, fort
expert dans tous les tours de son métier, ne feignait-il de demander
ce qu'il savait bien devoir lui être refusé, que pour détourner, pour
user en quelque sorte sur ce sujet la résistance des autres
puissances, et être plus sûr d'obtenir ensuite les avantages vraiment
essentiels[220].
ebookbell.com