0% found this document useful (0 votes)
3 views315 pages

Notes Eth

The document contains lecture notes for a course on Optimization for Data Science, covering various optimization methods and their theoretical foundations. It emphasizes the importance of understanding optimization in the context of learning problems, rather than merely solving classical computational problems. The course aims to equip students with the knowledge to select suitable optimization algorithms for data science applications, while also discussing empirical risk minimization and the role of computational complexity.

Uploaded by

izhaan31hbd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views315 pages

Notes Eth

The document contains lecture notes for a course on Optimization for Data Science, covering various optimization methods and their theoretical foundations. It emphasizes the importance of understanding optimization in the context of learning problems, rather than merely solving classical computational problems. The course aims to equip students with the knowledge to select suitable optimization algorithms for data science applications, while also discussing empirical risk minimization and the role of computational complexity.

Uploaded by

izhaan31hbd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 315

Optimization for Data Science

Lecture Notes, FS 23

Bernd Gärtner and Niao He, ETH


Martin Jaggi, EPFL

June 25, 2023


Contents

1 Introduction 2–40

2 Theory of Convex Functions 40–91

3 Gradient Descent 91–114

4 Projected Gradient Descent 114–127

5 Coordinate Descent 127–142

6 Nonconvex functions 142–161

7 The Frank-Wolfe Algorithm 161–178

8 Newton’s Method 178–190

9 Quasi-Newton Methods 190–208

10 Subgradient Methods 208–227

11 Mirror Descent, Smoothing, Proximal Algorithms 227–248

12 Stochastic Optimization 248–262

13 Finite Sum Optimization 262–277

14 Min-Max Optimization 277–314

1
Chapter 1

Introduction

Contents
1.1 About this lecture . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Algorithms in Data Science . . . . . . . . . . . . . . . . . . . 6
1.2.1 Case Study: Learning the minimum spanning tree . . 7
1.3 Expected risk minimization . . . . . . . . . . . . . . . . . . . 8
1.3.1 Running example: Learning a halfplane . . . . . . . . 10
1.4 Empirical risk minimization . . . . . . . . . . . . . . . . . . . 11
1.5 Empirical versus expected risk . . . . . . . . . . . . . . . . . 13
1.5.1 A counterexample . . . . . . . . . . . . . . . . . . . . 15
1.6 The map of learning . . . . . . . . . . . . . . . . . . . . . . . 16
1.6.1 What are the guarantees? . . . . . . . . . . . . . . . . 20
1.7 Vapnik-Chervonenkis theory . . . . . . . . . . . . . . . . . . 22
1.7.1 The growth function . . . . . . . . . . . . . . . . . . . 22
1.7.2 The result . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.8 Distribution-dependent guarantees . . . . . . . . . . . . . . . 26
1.9 Worst-case versus average-case complexity . . . . . . . . . . 28
1.10 The story of the simplex method . . . . . . . . . . . . . . . . 30
1.10.1 Initial wandering . . . . . . . . . . . . . . . . . . . . . 30
1.10.2 Worst-case complexity . . . . . . . . . . . . . . . . . . 31
1.10.3 Average-case complexity . . . . . . . . . . . . . . . . 32
1.10.4 Smoothed complexity . . . . . . . . . . . . . . . . . . 33
1.10.5 An open end . . . . . . . . . . . . . . . . . . . . . . . . 36
1.11 The estimation-optimization tradeoff . . . . . . . . . . . . . . 36

2
1.12 Further listening . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3
1.1 About this lecture
These are lecture notes for the ETH course 261-5110-00L Optimization for
Data Science. This course has partially been co-developed with the EPFL
course CS-439 Optimization for machine learnbiaseding; the two courses
share some of their content but also differ in some material.
This course provides an in-depth theoretical treatment of classical and
modern optimization methods that are relevant in data science. The em-
phasis is on the motivations and design principles behind the algorithms,
on provable performance bounds, and on the mathematical tools and tech-
niques to prove them. The goal is to equip students with a fundamental
understanding about why optimization algorithms work, and what their
limits are. This understanding will be of help in selecting suitable algo-
rithms in a given application, but providing concrete practical guidance is
not our focus.
In this first chapter, we discuss the role of optimization for Data Sci-
ence. This turns out to be quite different from the classical role where an
optimization algorithm (for example Kruskal’s) solves a well-defined op-
timization problem (find the minimum spanning tree). In Data Science,
we have learning problems, and they are not always well-defined. Opti-
mization typically happens on training data, but even a perfect result may
fail to solve the underlying learning problem. This is not a failure of the
optimization algorithm but of the model in which it was applied. In Data
Science, optimization is merely one ingredient in solving a learning prob-
lem. While this course focuses on optimization, it does not turn you into
a holistic data scientist. But it will allow you to better understand and ex-
plore the (after all not so small) optimization corner in the Data Science
landscape.
In his arcticle 50 years of Data Science, David Donoho outlines the six
divisions of what he calls Greater Data Science [Don17]:
1. Data Gathering, Preparation, and Exploration
2. Data Representation and Transformation
3. Computing with Data
4. Data Modeling
5. Data Visualization and Presentation

4
6. Science about Data Science

Even Computing with Data (which some may think that Data Science
is mainly about) is only one of six divisions, and optimization is a sub-
division of that. But optimization is important also in Data Modeling:
towards being able to actually optimize a given model, the model designer
must already be aware of the computational methods that are available,
what they can in principle achieve, and how efficiently they can do this.
The classical worst-case complexity of an optimization algorithm is rarely
the right measure of applicability in practice. In Data Modeling, one also
needs to make sure that the result of the optimization is meaningful for
the learning problem at hand.
To summarize the state of affairs at this point: optimization is an en-
abling technology in Data Science, but it’s not what Data Science is about.
Optimization algorithms should not be used blindly, by confusing the op-
timization problem being solved with the ground truth. The optimization
problem helps to solve a learning problem, and the latter is the ground
truth. The same learning problem can in principle be modeled with many
different optimization problems. What counts in the end is whether the
result of the optimization is useful towards learning.
In the remainder of this chapter, we elaborate on this high-level sum-
mary. We start by framing the minimum spanning tree problem as a learn-
ing problem and present some surprising implications that this has for
the classical minimum spanning tree algorithms. Then we introduce the
predominant application of optimization in Data Science, namely empiri-
cal risk minimization as a way to learn from training data. We concisely
explain overfitting, underfitting, generalization, regularization and early
stopping. We then present two classical theorems as showcases for distribution-
independent and distribution-dependent techniques to guarantee that em-
prirical risk minimization entails learning. Finally, we talk about the role
of computational complexity in Data Science, by discussing the evolu-
tion from classical worst-case complexity over average-case complexity to
smoothed complexity. The simplex method for linear programming serves
as a prime example to document this evolution.

5
1.2 Algorithms in Data Science
The classical picture of an algorithm, as portrayed in many textbooks, is
that of a computational procedure that transforms input into output. Here
is an example of how Cormen, Leiserson, Rivest, and Stein describe such
a transformation in their Introduction to Algorithms [CLRS09, Section 23.1]:
Assume that we have a connected, undirected graph G =
(V, E) with a weight function w : E → R and wish to find a
minimum spanning tree for G.
Then they present two algorithms for solving this problem, Kruskal’s
algorithm, and Prim’s algorithm. Kruskal’s algorithm maintains a forest (ini-
tially empty), and in each step adds an edge of minimum weight that con-
nects two trees in the forest. Prim’s algorithm maintains a tree (initially
consisting of an arbitrary vertex), and in each step adds an edge of mini-
mum weight that connects the tree to its complement.
Both algorithms are analyzed in terms of their worst case performance.
Kruskal’s algorithm with a suitable union-find data structure needs time
O (|E| log |E|). Prim’s algorithm takes time O (|E| log |V |) which can be im-
proved to O (|E| + |V | log |V |) using Fibonacci Heaps. So Prim’s algorithm
is faster for dense graphs. Case closed.
In Data Science, the viewpoint is very different. Here, the starting point
are data which we want to explain in some (not always prespecified) way.
An algorithm is a tool to help us in finding such an explanation. The data
typically come from measurements or experiments, and there is not a sin-
gle input, but a typically unknown input distribution from which a mea-
surement or an experiment draws a (noisy) sample. In such a situation,
we are not interested in explaining a concrete sample, but we want to ex-
plain the nature of the distribution. And this changes the way in which
we should evaluate an algorithm.
The prevalent quality measure is how good an explanation is in expec-
tation (over the input distribution). If we don’t know the distribution, we
are stuck with the empirical approach: sample a finite number of inputs
(these are the training data), and based on them, construct an explanation
for the whole input distribution. An algorithm is evaluated according to
how good this explanation is. Classical criteria such as runtime or space
are considered only to ensure that we can actually run the algorithm effi-
ciently on possibly huge data.

6
We can summarize the situation as follows: in Data Science, an algo-
rithm solves a learning problem (finding an explanation for data), not a clas-
sical computational problem in which every input must be mapped to a
specified output.

1.2.1 Case Study: Learning the minimum spanning tree


For the minimum spanning tree problem, the above Data Science view-
point changes the picture quite radically as shown in a case study by Gron-
skiy [Gro18, Chapter 4]. The setup is that there are ground truth weights
which are all roughly the same, but they are subject to random noise which
has the potential to produce different orderings of edges by weight.
As a consequence, we may see different minimum spanning trees for
different samples drawn from the induced weight distribution. The de-
sired explanation of the data is the unknown ground truth minimum span-
ning tree. So we want an algorithm that produces a spanning tree as
close to the ground truth as possible (measured in the number of common
edges, say).
If the noise is negligible, we can solve the problem perfectly from one
sample, using any algorithm that computes the minimum spanning tree
of the sample. But as the noise becomes more and more significant, this
algorithm inevitably produces worse and worse explanations. Whether
Kruskal’s or Prim’s algorithm is used doesn’t make any difference.
But the picture becomes more interesting once we have two samples
X , X 00 (formally, these are weighted graphs with the same underlying
0

undirected graph). Consider running algorithm A (think of Kruskal’s or


Prim’s) in parallel on these two different noisy versions of the ground
truth. In each step, one edge is added to each of the two trees. After
step t, let At (X 0 ) and At (X 00 ) be the sets of spanning trees that are still pos-
sible for X 0 and X 00 , taking the edges already processed into account. We
have A0 (X 0 ) = A0 (X 00 ), the set of all spaning trees, but as t grows, the sets
At (X 0 ) and At (X 00 ) (and also their intersection) are shrinking.
There is a sweet spot (a value of t) that maximizes the joint information,
defined as the probability that two spanning trees, chosen uniformly at
random from At (X 0 ) and At (X 00 ), respectively, are equal. At this point,
we stop the algorithm and return a spanning tree randomly chosen from
At (X 0 ) ∩ At (X 00 ) as our explanation.

7
It can be shown experimentally that this early stopping works quite well
(compared to standard attempts to learn the ground truth minimum span-
ning tree from two samples). More interestingly, it now matters which al-
gorithm is chosen, not because of runtime as in the classical setting, but
because of explanation quality. As Gronskiy shows, Kruskal’s algorithm
with early stopping leads in expectation to a much better explanation than
Prim’s algorithm with early stopping.
In fact, Gronskiy shows that a third algorithm (not discussed by Cor-
men, Leiserson, Rivest and Stein at all, due to its abysmal worst-case com-
plexity) is even better than Kruskal’s algorithm. This is the Reverse Delete
algorithm. It starts with the full graph, and in every step it deletes the
edge of maximum weight that does not disconnect the graph.
The takeway message at this point is that classical algorithms and Data
Science algorithms have quite different design goals, and this in particular
holds in optimization. In the following sections, we will elaborate on this
in some more detail, with a focus on optimization.

1.3 Expected risk minimization


Optimization is a key ingredient in the process of learning from data. To
make this point, we start by introducing a general framework that encom-
passes many concrete learning problems, and we frame the learning task
as an idealized optimization problem that typically cannot be solved due
to lack of knowledge.
We have a data source X (a probability space, equipped with a prob-
ability distribution that we in general do not know). But we can tap the
source by drawing independent samples X1 , X2 , . . . ∼ X , as many as we
want (or can afford). The notation X ∼ X stands for “X randomly chosen
from X , according to the probability distribution over X ”. Our goal is to
explain X through these samples (we deliberately leave this vague, as it
can mean many things).
Formally, a probability space is a triple consisting of a ground set, a
sigma algebra (set of events), and a probability distribution function. It is
common to abuse notation by identifying X with the ground set. We will
also do this in the following.
As a concrete example, consider the situation where X = {0, 1} models
a biased coin flip (1 = head, 0 = tail), meaning that head comes up with

8
(unknown) probability p? for X ∼ X . The desired explanation of X is the
bias p? . By the law of large numbers, we have p? = limn→∞ |{i : Xi = 1}|/n,
but we will not be able to compute this limit via finitely many samples.
When sampling X1 , X2 , . . . , Xn ∼ X , we can only hope to approximate p? .
Concretely, the Chernoff bound tells us how large n needs to be in order
for the empirical estimate |{i : Xi = 1}|/n to yield a good approximation
of p? with high probability.
In our abstract framework, we have a class H of hypotheses that we
can think of as possible explanations of X , and our goal is to select the
hypothesis that best explains X . To quantify this, we use a risk or loss
function ` : H × X → R that says by how much we believe that a given
hypothesis H ∈ H fails to explain given data X ∈ X . Then we consider
the expected risk
`(H) := EX [`(H, X)] (1.1)
as the measure of quality of H for the whole data source (other quality
measures are conceivable, but expected risk is the standard one). Formally,
our goal is therefore to find a hypothesis with smallest expected risk, i.e.

H ? = argmin `(H), (1.2)


H∈H

assuming that such an optimal hypothesis exists. To put the biased coin
example into this framework, we choose hypothesis class H = [0, 1] (the
candidates for the bias p? ) and loss function `(H, X) = (X − H)2 . In Ex-
ercise 1, you will analyze this case and in particular prove that a unique
optimal hypothesis H ? as in (1.2) exists and equals p? , the bias of the coin.
So mathematically, we may be able to argue about expected risk.
But computationally, we are typically at a loss (no pun intended). Not
knowing the probability distribution over X , we cannot compute the ex-
pected risk `(H) of a hypothesis H as in (1.1), let alone find H ? as in (1.2).
Having access to X only through finitely many samples, we generally can-
not expect to perfectly explain X .
Alternatively, and more realistically, we can try to be probably approxi-
mately correct (PAC). This means the following: given any tolerances δ, ε >
0, we can produce with probability at least 1 − δ a hypothesis H̃ ∈ H such
that
`(H̃) ≤ inf `(H) + ε. (1.3)
H∈H

9
This means that with high probability, we approximately solve the opti-
mization problem (1.2). Here, the probability is over the joint probability
distribution of the individual samples. So formally H̃ is a (hypothesis-
valued) random variable. If the algorithm producing H̃ is randomized,
the random variable additionally depends on the random choices made
by the algorithm.

1.3.1 Running example: Learning a halfplane


Since the biased coin flip is a toy example not worth an abstract frame-
work, we also want to introduce a “real” learning problem here. Let X =
R2 × {0, 1}. The source X comes with an unknown halfplane H ? . A half-
plane is a set of the form H = {(x1 , x2 ) ∈ R2 : ax1 + bx2 ≤ c} for given
parameters a, b, c ∈ R. We can think of a halfplane as a classifier for points
in R2 . One class consists of the points in the halfplane, the other class of
the points in the halfplane’s complement; see Figure 1.1. Our goal is to
learn the unknown classifier H ? , or at least get close to it. The hypothesis
class H is the set of all halfplanes.

Figure 1.1: A halfplane in R2

Every sample X = (x, y) ∈ X tells us whether the sampled point x is


in the unknown halfplane or not. Concretely, y = I({x ∈ H ? }), where I
denotes the indicator function of an event.
The probability distribution of the source X is therefore in fact a dis-
tribution over R2 from which we can sample points x, and the label y =
I({x ∈ H ? }) ∈ {0, 1} is a dependent variable. This is the scenario of su-
pervised learning where we obtain labeled training samples from the data
source.
Our loss function ` is very simple in this example: For a given halfplane
H and given sample X = (x, y), the value of `(H, X) tells us whether x is
misclassified by H (loss 1) or not (loss 0). This is known as the 0-1-loss and

10
is formally defined as

`(H, X) = I({I({x ∈ H}) 6= y}) = I(x ∈ H ⊕ H ? ), (1.4)

where ⊕ denotes symmetric difference.


The expected risk `(H) of a halfplane is then the probability that H
misclassifies a point x randomly sampled from X :

`(H) = prob(H ⊕ H ? ),

Halfplane H ? solves the expected risk minimization problem (1.1), with


value `(H ? ) = 0. Through finitely many samples, we cannot expect to fully
nail down H ? , but we can hope to find a PAC halfplance H̃ as in (1.3),
meaning that with high probability, the measure of misclassified points is
small; see Figure 1.2.

H?

Figure 1.2: An almost optimal halfplane in R2 ; the gray region H̃ ⊕ H ? of


misclasified points has small measure.

1.4 Empirical risk minimization


Empirical risk minimization is arguably the most prominent “customer”
of optimization in Data Science. As outlined in the previous Section, we
would like to solve the expected risk minimization problem (1.1) or its
approximate version (1.3). But the crucial data we need for that, namely
the expected risks `(H), H ∈ H are not available to us. But what we can
do is draw n independent samples X1 , X2 , . . . , Xn from X which form our
training data. Given these data, we compute the empirical risk
n
1X
`n (H) = `(H, Xi ) (1.5)
n i=1

11
of hypothesis H. Note that `n (H) is a random variable. In probability, the
sequence (`n (H))n∈N converges to `(H). We know this from rolling a dice
n times: as n → ∞, the average number of pips converges to its expected
value 3.5.
Formally, convergence in probability means the following.

Lemma 1.1 ((Weak) law of large numbers). Let H ∈ H be a hypothesis. For


any δ,  > 0, there exists n0 ∈ N such that for n ≥ n0 ,

|`n (H) − `(H)| ≤ ε (1.6)

with probability at least 1 − δ.

In the language of the previous section, the empirical risk of a hypoth-


esis is a PAC estimate of its expected risk. This motivates empirical risk
minimization as a proxy for expected risk minimization (1.3). For n ∈ N
and given training data X1 , X2 , . . . , Xn from X , empiricial risk minimiza-
tion attempts to produce a hypothesis H̃n such that

`n (H̃n ) ≤ inf `n (H) + ε. (1.7)


H∈H

This may still be a hard problem, but at least, we have all the data that we
need in order to solve it. This suggests an algorithmic view of empirical
risk minimization as a mapping from input to output.

Input: training data X1 , X2 , . . . , Xn


Output: an almost optimal hypothesis H̃n as in (1.7)

The ouput H̃n is a (hypothesis-valued) random variable (possibly also


depending on random choices made by the algorithm that computes H̃n ,
but this does not matter for our discussion).
In the coin flip example where a hypothesis H ∈ [0, 1] is a candidate
for the bias p? , and the Xi ∈ {0, 1} are biased coin flips, we have used loss
function `(H, X) = (X − H)2 . Hence
n
1X
`n (H) = (Xi − H)2 .
n i=1

12
Highschool calculus shows that this is (exactly and unsurprisingly) mini-
mized by
n
1X
H̃n = Xi ,
n i=1
the relative frequency of heads in a sequence of n biased coin flips.
In our running example from Section 1.3.1, X1 , X2 , . . . , Xn are points
in R2 , labeled with either 1 or 0, signifying whether the point is in the
unknown halfplane or not. For n = 9, we may therefore see a picture as in
Figure 1.3 (left).

H̃9

H? H?

Figure 1.3: Left: training data (filled circles are the ones in the un-
known halfplane H ? ); Right: a halfplane H̃9 with minimum empirical risk
`9 (H̃9 ) = 0.

In this case, empiricial risk minimization is easy to do. We know that


the training samples with label 1 are separated from the ones with label 0
by the line bounding the unknown halfplane H ? . Hence, we can select any
separating line and achieve optimal empirical risk 0. Such a separating line
can efficiently be found through linear programming. With loss functions
other than the 0-1-loss, separating hyperplanes may differ in empirical risk
and it may matter which one we choose; here, they are all the same.

1.5 Empirical versus expected risk


Empirical risk minimization produces an (approximate) empirical risk min-
imizer H̃n according to (1.7). In an ideal world, H̃n is at the same time PAC
for the minimum expected risk (as n → ∞). Let us make it crystal-clear
again what this formally means: H̃n is a random variable (depending on
the training data X1 , X2 , . . . , Xn ), and we want that this random variable

13
satisfies (1.3):
`(H̃n ) ≤ inf `(H) + ε
H∈H

with probability at least 1 − δ.


In the coin flip example, we have seen that H̃n is the relative frequency
of heads in a sequence X1 , X2 , . . . , Xn of n biased coin flips, and using the
formula for `(H) in Exercise 1 (i), we compute
n
!2
1 X
`(H̃n ) = `(p? ) + p? − Xi .
n i=1

Now the Chernoff bound ensures that for n sufficiently large, the expected
risk of H̃n is a good approximation of the minimum expected risk, with
high probability.
In general, the law of large numbers (1.6) seems to ensure that we are
living in an ideal world, just with adapted tolerances. Indeed, let H̃ be
some approximately optimal hypothesis w.r.t. expected risk, meaning that
`(H̃) ≤ inf H∈H `(H) + ε. Then we get
(1.6)
`(H̃n ) ≤ `n (H̃n ) + ε
(1.7)
≤ inf `n (H) + 2ε ≤ `n (H̃) + 2ε
H∈H
(1.6)
≤ `(H̃) + 3ε ≤ inf `(H) + 4ε. (1.8)
H∈H

Here, the inequalites using (1.6) hold with probability at least 1 − δ each,
while the other ones are certain, so the whole chain of inequalities holds
with probability at least 1 − 2δ.
If you think that this derivation was complete nonsense, you are
right.
But where exactly is the problem? It’s in the first inequality, all the
other ones are correct.
The problem is that we cannot apply the law of large numbers to a data-
dependent hypothesis H̃n . In (1.6), we need to fix a hypothesis and can then
argue that its empirical risk converges in probability to its expected risk.
This is what we do with H̃ in the second inequality using (1.6).
But in the first inequality, H̃n depends on the data and was deliberately
chosen such that the empirical risk is mimimized for the given data. As

14
we show in Section 1.5.1 below, this may lead to the empiricial risk being
much smaller than the expected risk, so that the crucial first inequality
does not hold.
Before doing so, let us establish a sufficient condition (a uniform version
of the law of large numbers) under which the above derivation is sound.
It does not hold in general, but if it holds, we are indeed in an ideal world:
minimizing the empirical risk also minimizes the expected risk.
Theorem 1.2. Suppose that for any δ,  > 0, there exists n0 ∈ N such that for
n ≥ n0 ,
sup |`n (H) − `(H)| ≤ ε (1.9)
H∈H

with probability at least 1 − δ. Then, for n ≥ n0 , an approximate empirical risk


minimizer H̃n as in (1.7) is PAC for expected risk minimization, meaning that it
satisfies
`(H̃n ) ≤ inf `(H) + 3ε
H∈H

with probability at least 1 − δ.


Proof. By (1.9),

|`n (H̃n ) − `(H̃n )| ≤ sup |`n (H) − `(H)| ≤ ε, (1.10)


H∈H

with probability at least 1 − δ. Then we get


(1.10)
`(H̃n ) ≤ `n (H̃n ) + ε
(1.7)
≤ inf `n (H) + 2ε
H∈H
(1.10)
≤ inf `(H) + 3ε (1.11)
H∈H

with probability at least 1 − δ.

1.5.1 A counterexample
We provide a simple (artificial) example to show that the inequality `(H̃n ) ≤
`n (H̃n ) + ε in our false derivation (1.8) can in general not be achieved. As
a consequence, the uniform law of large numbers (1.9) also fails in this
example.

15
Let X = [0, 1], equipped with the uniform distribution. The set of hy-
potheses H consists of certain subsets of X , and for H ∈ H, X ∈ X , we let
`(H, X) = I({X ∈/ H}).
Concretely, we choose H such that it satisfies two properties:

(i) every H ∈ H has length (Lebesgue measure) 1/2, so that `(H) = 1/2
for all H ∈ H;

(ii) for all training data X1 , X2 , . . . , Xn ∈ X , there exists H̃n ∈ H with


H̃n ⊇ {X1 , X2 , . . . , Xn }, so that `n (H̃n ) = 0.

Hence, `(H̃n ) ≤ `n (H̃n ) + ε fails for small ε.


We could choose H as the collection of all subsets of [0, 1] of length
1/2, but there is also a way of making H countable; see Exercise 2. H be-
ing infinite is necessary; for every finite set of hypotheses, a simple union
bound deduces the uniform law of large numbers (1.9) from the version
(1.6) with probability at least 1 − δ|H|, and then Theorem 1.2 implies that
the minimum expected risk converges to the minimum empirical risk.
In this example, empirical risk minimization still manages to compute
an optimal hypothesis H ? , one that minimizes the expected risk as in (1.2):
Since `(H) = 1/2 for all H ∈ H, every hypothesis is optimal, so there is
simply no chance to make a mistake.
But we can easily tweak the example such that an empirical risk mini-
mizer may not be PAC for minimum expected risk. For this, we add to H
sets of lengths 1/4, say, such that property (ii) above also holds for them.
Then there is always a hypothesis H̃n such that `n (H̃n ) = 0 and `(H̃n ) = 3/4
which is not PAC, since inf H∈H `(H) = 1/2 still.

1.6 The map of learning


Whenever we select a data-dependent hypothesis Hn (by an algorithm that
maps the training data X1 , X2 , . . . , Xn to a hypothesis), we can visualize it
as a point in the (`n , `)-plane:

16
`

Hn

`n

This is to some extent a cartoon picture. Hn is a random variable, so it


does not have fixed empirical and expected risk and therefore not a fixed
location in the (`n , `)-plane. The way we should view this picture is that
it tells us where Hn ends up for n → ∞, under the reasonable assumption
that the algorithm with high probability “homes in” on some hypothesis
as it sees more and more training data. If we adopt this view, the map that
emerges is detailed in Figure 1.4. We discuss its different “countries” and
“roads” in turn.

Algorithm. This is the computational procedure according to which a


hypothesis Hn is obtained from training data X1 , X2 , . . . , Xn . One possible
algorithm is (approximate) empirical risk minimization, but there may be
other algorithms. The algorithm controls the empirical risk and therefore
locates Hn in the `n -dimension.

Validation. Once a hypothesis Hn has been obtained through training,


one also needs to assess its expected risk, i.e. locate Hn in the `-dimension.
This is important in order to understand whether the hypothesis actually
solves the learning task at hand. Using the weak law of large numbers,
`(Hn ) can be estimated via its empirical risk on test data—fresh samples
from X that the algorithm has not seen. This process is called validation.
Using fresh samples ensures that Hn is a fixed hypothesis w.r.t. the test
data, so the weak law of large numbers actually applies.

Empirical risk minimization. The gray area above the main diagonal
(and slightly extending below) in Figure 1.4 is where the empirical risk
minimizer H̃n ends up. In words, the empirical risk `n (H̃n ) always under-
estimates the expected risk `(H̃n ), up to an error that we can make arbi-
trarily small. We have implicitly shown this before, through the correct

17
`

tio k
iza ris
it on

n
im al
a
in ic
z
high m pir
r ali
em
e
en
g
validation

overfitting underfitting
ing
reg stopp
ula early
riz
atio
n
io n
at
low
a liz
ner
ge
learning bad model
low high
`n
algorithm

Figure 1.4: The map of learning

part of the chain of inequalities (1.8): we simply need to observe that the
last term in this chain can further be bounded by `(H̃n ) + 3ε.

Overfitting. If our learning algorithm returns a hypothesis with low em-


pirical risk but high expected risk, we have a case of overfitting. The ex-
planation quality on the data source is much worse than on the training
data. The main cause of overfitting is that our theory (hypothesis class
H and loss function `) is so complex that it allows us to almost perfectly
explain any training data. But such a tailored explanation is unlikely to
explain unseen data from X . This is like trying to explain when your fa-
vorite sports team wins. From observing a few games (training data), one
can always find “perfect” explanations of the form “player X wore black
socks in exactly the won games.” Unless you are superstitious, you may

18
find this entertaining, but you don’t seriously expect such explanations to
last for more games.

Underfitting. If the learning algorithm returns a hypothesis with high


empirical risk, we cannot even explain the training data. In this case, there
is no justified hope to be able to explain unseen data, and we have a case
of underfitting. The main cause of underfittting is that our theory is too
simple to capture the nature of the data. In the sports team analogy, if
you only have the two hypotheses “the team always wins”, or “the team
always loses”, you will not be able to explain the results of a typical team.
It can in principle happen that a hypothesis with high empirical risk
has low expected risk, but then we are in the bad model region where we
have used a questionable theory.

Learning. If both empirical and expected risk are low, we can make a
case that we have learned something. We have invested some effort into
explaining the training data, and this explanation is also good for unseen
data (maybe not spectacularly so, but sufficient to gain some insights). If
you are able to predict the future outcomes of your favorite team’s matches
more reliably than a biased coin based on the current ranking, say, this is
already an achievement.

Generalization. Ideally, the expected risk is close to the empirical risk,


and if this happens, we have generalization. This means that the hypoth-
esis explains unseen data equally well as the training data, so there are no
surprises in using the explanation for the whole data source. But it does
not mean that the explanation is good. For example, an easy way to get
hold of a generalizing hypothesis is to simply ignore the training data and
always return a fixed hypothesis. By the law of large numbers, this gener-
alizes very well, but is not informative at all. In this situation, we are in the
upper right generalization region. On the other hand, if a hypothesis has
low empirical risk and generalizes, learning is implied, and this scenario is
the one that learning algorithms are shooting for.

Regularization. In the case that overfitting is observed, a possible rem-


edy is to add a regularization term r to the loss function ` with the goal

19
of “punishing” complex hypotheses. Typically, r has a unique minimizer
and nothing to do with the learning problem at hand.
Empirically minimizing `0 = ` + λr for a real number λ > 0 therefore
has the effect that we introdue a bias, meaning that we deviate more and
more from our theory, with the effect that the empirical risk increases. But
as the intended consequence, the variance (sensitivity to the training data)
decreases, and this may reduce the expected risk. For large λ, the new loss
function `0 is dominated by r, so the empirical risk minimizer eventually
becomes uninformative and ends up in the underfitting region. The art
is to optimize this bias-variance tradeoff : find the sweet spot λ? that mini-
mizes the expected risk. The curve in Figure 1.4 symbolically depicts the
development of the expected risk as we increase λ to move away from the
original overfitting hypothesis. This also visualizes the bias-variance trade-
off saying that in order to get smaller variance (less dependence on the
training data), we need to introduce a higher bias (a worse loss function).

Early stopping. Another way to deal with overfitting is early stopping.


Typically we are searching for the empirical risk minimizer using an itera-
tive method such as gradient descent (Chapter 3). In each step, we decrease
the empirical risk of our candidate hypothesis until we (hopefully) con-
verge to an approximately optimal hypothesis H̃n . We have seen that if
H̃n overfits, moving away from it through regularization may be benefi-
cial. This can also be achieved by not even getting to H̃n in the first place,
via early stopping of the stepwise optimization algorithm. How early we
stop is a choice we have to make, similar to choosing the parameter λ in
regularization. Symbolically, you can think of regularization and early
stopping as exploring the same bias-variance tradeoff curve in Figure 1.4,
but in opposite directions.

1.6.1 What are the guarantees?


The important takeaway at this point is the following: Although optimiza-
tion (in the form of empirical risk minimization) is widely and successfully
used in practice, it should never be used blindly.
In almost every concrete application, there are potentially many sensi-
ble theories (hypothesis classes H and loss functions `). Which one to pick
is often more an art than a science. There are conflicting goals that we have

20
already touched upon in discussing regularization.
On the one hand, the theory should be informative, meaning that the
way in which hypotheses are chosen and evaluated as good or bad is
meaningful in the application at hand. For example, the constant loss
function ` ≡ 0 ensures perfect generalization and learning according to
the map in Figure 1.4 but is obviously useless. On the other hand, the the-
ory should be robust, meaning that empirical risk minimization is a good
way of learning about the expected risk.
Finally, we should also be able to perform empirical risk minimization
efficiently. This computational aspect favors convex loss functions (see
Chapter 2), although these might not be the most informative ones in the
given application.
In this course, we do not cover the art of choosing the “right” theory. In
subsequent chapters, we focus on how to efficiently solve the optimization
problems that arise after this artwork has been done.
That said, there is a large body of results supporting this artwork. Over
the last fifty years, and starting long before machine learning has become
mainstream, the field of statistical learning theory has paved the way to-
wards understanding why machine learning works.
In the next section, we sketch a historical cornerstone that started what
is now known as the VC (Vapnik-Chervonenkis) theory. Originally devel-
oped in the Soviet Union in the late 1960s, it became known to the rest
of the world through English translations of the original Russian articles
from 1968 and the later one from 1971, which can be considered as the
“full paper” containing all proofs [VC71].
The paper establishes a notion of complexity of the hypothesis class H
that is directly and provably related to the success of empirical risk mini-
mization under 0-1-loss. The beauty of this result is that it does not depend
on the probability distribution over the data source X . The restriction is
that it only works for the 0-1-loss (in supervised learning) which—being
nonconvex—is computationally intractable. But the main insight is con-
ceptual: there is a well-defined mathematical way of quantifying our pre-
viously informal understanding of a (too) complex theory.

21
1.7 Vapnik-Chervonenkis theory
Here we are in the (abstract) setting of supervised learning with two classes,
generalizing our running example from Section 1.3.1. In this setting, the
data source X is of the form X = D × {0, 1}, with an unknown probability
distribution over data D, and with samples of the form X = (x, y) where
x ∈ D and y = I({x ∈ H ? }) for an unknown subset H ? ⊆ D. The goal is to
learn H ? .
The hypothesis class H consists of candidate subsets H ⊆ D and in
particular contains H ? . For H ∈ H and X = (x, y) ∈ X , the 0-1-loss is

`(H, X) = I({I({x ∈ H}) 6= y}) = I({x ∈ H ⊕ H ? }), (1.12)

telling us whether H misclassifies x. We have `(H ? , X) = 0 for all X, and


hence the minimum expected risk is `(H ? ) = 0.
In our running example, H is the set of all halfplanes in R2 , but in the
general theory, H may consist of arbitrary subsets of D.

1.7.1 The growth function


Now we introduce the crucial complexity measure for H, its growth func-
tion. Let us fix n training samples x1 , x2 , . . . , xn from D (we ignore their
labels yi as these come from H ? only and have nothing to do with H). For
hypothesis H ∈ H, we now look at the cut

H ∩ {x1 , x2 , . . . , xn }

induced by H. In our running example, the cut induced by halfplane H is


the set of training samples in the halfplane.
How many different cuts H ∩ {x1 , x2 , . . . , xn } can we get as we let H
run through H? For sure at most 2n . We define

H ∩ {x1 , x2 , . . . , xn } = {H ∩ {x1 , x2 , . . . , xn } : H ∈ H}

to be the set of cuts that H induces on x1 , x2 , . . . , xn , and we let

H(n) := max{|H ∩ {x1 , x2 , . . . , xn }| : x1 , x2 , . . . , xn ∈ D} (1.13)

be the largest number of cuts that H induces on any sequence of n training


samples. The function H : N → N (please note and excuse the slight abuse

22
of notation) is the growth function of H. We have already observed that
H(n) ≤ 2n . But this may be a (gross) overestimate.
In order to understand the function H(n) for halfplanes, let us look at
small cases first. We first note that H(3) = 8. Indeed, taking as training
samples the three corners of a triangle, we can cut out all subsets by half-
planes; see Figure 1.5.

Figure 1.5: Halfplanes can cut a 3-point set in all possible ways.

Next, we claim that H(4) = 14, so it is not possible for halfplanes to cut
out all 16 subsets of a 4-point set. A simple argument (omitted) shows that
|H ∩ {x1 , x2 , x3 , x4 }| is maximized when the 4 points are in general position
(no three on a line). Then there are only two types of configurations, and
in each of them, exactly two subsets cannot be cut out by a halfplane; see
Figure 1.6.

Figure 1.6: Left: Four points in convex position. Right: One point in the
convex hull of the others. In both cases, the set of black points (and its
complement) cannot be cut out by a halfplane.

23
Here is the bound for general n showing that the truth is very far away
from the worst-case bound of 2n .

Example 1.3. Let H be the set of all halfplanes in R2 . Then

H(n) = O n2 .


Proof. Let x1 , x2 , . . . , xn ∈ R2 . For every halfplane H inducing a nonempty


cut, we construct a canonical halfplane H 0 ⊆ H that induces the same cut.
To obtain H 0 , we translate H until there is a sample point xi on the bound-
ary (we may have H 0 = H). For each i, the halfplanes having xi on the
boundary induce only O(n) cuts: while rotating a line around xi , the cut
induced by any of its two halfplanes can change only when the line passes
another sample point. The bound of O(n2 ) follows. We remark that the
precise bound is H(n) = 2 n2 + 2, see Exercise 3.
Let us next look at the growth function of the counterexample in Sec-
tion 1.5.1.

Example 1.4. Let H be the collection of all subsets of [0, 1] of length 1/2, n ∈ N.
Then
H(n) = 2n .

Proof. Let S be a set of distinct samples from [0, 1]. For every T ⊆ S, we
construct H ∈ H such that H ∩ S = T . For this, we let U be a set of length
1/2 disjoint from S and set H = T ∪ U .
Exercise 4 asks you to prove that there is also a countable collection H
of subsets of [0, 1] with H(n) = 2n for all n. No finite collection can have
this property, since H(n) ≤ |H|. Indeed, to induce H(n) different cuts, we
need at least H(n) different hypotheses.

1.7.2 The result


Recall from Theorem 1.2 that in order for an (approximate) empirical risk
minimizer to also (approximately) minimize the expected risk, condition
(1.9) is sufficient: supH∈H |`n (H) − `(H)| ≤ ε with probability at least 1 − δ.
The result is that this can indeed be achieved for any hypothesis class with
polynomially bounded growth function, such as the class of halfplanes in
R2 (Example 1.3).

24
This follows from Theorem 2 of Vapnik and Chervonenkis [VC71] that
we state below. This theorem handles the case H ? = ∅. Exercise 5 asks you
to derive the general case.
Assuming that H ? = ∅, the 0-1 loss (1.12) simplifies to

`(H, X) = I({x ∈ H}).

This means that we can think of H as an event whose expected loss is


simply its probability:
`(H) = prob(H).
The empirical risk
n n
1X 1X
`n (H) = `(H, Xi ) = I({xi ∈ H}) =: probn (H)
n i=1 n i=1

is the relative frequency of H, the empirical approximation of prob(H). The


result is therefore already contained in the title of the Vapnik-Chervonenkis
paper [VC71]: On the uniform convergence of relative frequencies of events to
their probabilities.
Theorem 1.5 (Theorem 2 [VC71]). Let D be a probability space, H a set of
events, n ∈ N, ε > 0. Then

sup |probn (H) − prob(H)| > ε


H∈H

with probability at most

4H(2n) · exp(−ε2 n/8).

If the growth function is polynomially bounded (H(n) = O(nk ) for


some constant k), then the error probability in the theorem tends to 0 as
n → ∞.
Here is the general version that follows from it (Exercise 5). To be self-
contained at this point, we repeat some terminology from before.
Theorem 1.6. Let X = D × {0, 1} be a data source, H ⊆ 2D a hypothesis class,
H ? ∈ H an unknown ground truth classifier.
For H ∈ H and X = (x, y) ∈ X with y = I({x ∈ H ? }), let

`(H, X) = I({I({x ∈ H}) 6= y}) = I({x ∈ H ⊕ H ? })

25
be the 0-1-loss, telling us whether H misclassifies X. Let

`(H) = EX [`(H, X)] and


n
1X
`n (H) = `(H, Xi ), n ∈ N,
n i=1

be the expected risk and the empirical risk of H, respectively, where X1 , X2 , . . . , Xn


are training data, independently chosen from X (`n (H) is a random variable). Let
ε > 0. Then
sup |`n (H) − `(H)| > ε
H∈H

with probability at most

4H(2n) · exp(−ε2 n/8).

If the growth function is polynomially bounded (as in the case of half-


planes) Theorem 1.2 follows: as n → ∞, the minimum empirical risk ap-
proximates the minimum expected risk up to any desired accuracy.
What if the growth function is not polynomially bounded? It seems
that as long as it grows less than exponentially, we are still fine. But here
is a surprising fact: either H(n) = O(nk ) for some constant k, or H(n) = 2n
for all n. So there is nothing between polynomial growth and worst-case
growth [VC71, Theorem 1]. In other words, empiricial risk minimiza-
tion either works perfectly as n → ∞ (in case of a polynomially bounded
growth function), or not all (in case of H(n) = 2n for all n). By “not at all”,
we mean that in this case (and under suitable asumptions), distribution-
independent PAC learning cannot be achieved [BEHW89]. The literature
also presents these results in terms of the VC-dimension of H that is either
some finite number k (in which case H(n) = O(nk )), or it is infinite (and
then H(n) = 2n ).

1.8 Distribution-dependent guarantees


Often, we know (or assume) something about the probability distribution
underlying our data source, and this additional information can lead to
provable guarantees for empirical risk minimization that we otherwise
wouldn’t get. Let us consider linear regression with fixed design and centered
noise as an example for this scenario.

26
We have fixed ground truth vectors x1 , x2 , . . . , xn ∈ Rd , n ≥ d (this is
the design). We assume that the xi span Rd . The data source is Y = Rn
and comes with an unknown vector β ? ∈ Rd . A sample y ∈ Y has entries
y i = x> ?
i β + wi , where the wi are independent noise terms with expectation
0 and variance σ 2 each.
This means, there is a ground truth linear function xi 7→ x> ?
i β that we
would like to learn, but the function values yi that we get are corrupted
with centered noise.
The goal is to learn β ? , using just one sample y ∈ Y (which already
provides n values). As hypothesis class, we therefore use H = Rd .
If σ 2 = 0 meaning that there is no noise, we can simply compute β ?
from the design by solving the following system of equations in d un-
knonws β1 , . . . , βd .
    
y1 x>1 β1
 y 2   x>   β2 
  2 
 ..  =  ..   ..  .
 
 .   .  . 
yn x>n βd

In the presence of noise, we will not be able to nail down β ? exactly,


but we can try to get close to it. A natural loss function ` : H × Y → R is
n
1X
`(β, y) = (yi − x> 2
i β) ,
n i=1

telling us by how much (in squared Euclidean norm) β fails to explain the
observed values y. The expected risk is
n
1X > ?
`(β) = (x (β − β))2 + σ 2 . (1.14)
n i=1 i

Indeed, for each i, we have

EY [(yi − x> 2
 > ? > 2

i β) ] = E Y (x i β + w i − x i β)
= EY (x> ? 2
2wi x> ? 2
 
i (β − β)) + i (β − β) + wi
= (x> ? 2 2
i (β − β)) + σ ,

using that E[wi ] = 0 and E[wi2 ] = Var[wi ] = σ 2 .

27
This means that an expected risk of σ 2 is unavoidable but can also be at-
tained by choosing β = β ? . The following result quantifies how good em-
pirical risk minimization performs in this scenario. A proof can be found
in the lecture notes of Rigollet and Hütter on High Dimensional Statistics.1

Theorem 1.7. Given a sample y ∈ Rn , let β̃ = β̃1 minimize the empirical risk
n
1X
`1 (β) = (yi − x> 2
i β) .
n i=1

Then for any δ > 0,


n  
1X > ? 2 2 d + log(1/δ)
(x (β − β̃)) = O σ · ,
n i=1 i n

with probability at least 1 − δ. This further implies


 
(1.14) ? 2 d + log(1/δ)
`(β̃n ) = `(β ) + O σ · ,
n

with probability at least 1 − δ.

In this case, empirical risk minimization is easy to do analytically by


solving a least squares problem. We will get back to this in Section 2.4.2.

1.9 Worst-case versus average-case complexity


We have seen that empirical risk minimization can be successful in settings
where we have no information about the probability distribution over our
data source X (Section 1.7). But we may also have distribution-dependent
guarantees as in the previous Section 1.8. Even in settings of the first kind,
the training data are independent samples from the distribution in ques-
tion, and not just any data. This has implications when we use an opti-
mization algorithm to perform empirical risk minimization. In fact, what
we care about is not the worst-case performance of the algorithm but the
average case performance.
1
https://klein.mit.edu/˜rigollet/PDFs/RigNotes17.pdf, Theorem 2.2

28
The classical measure of algorithm performance is its worst-case com-
plexity, the function that maps n to the maximum runtime of the algorithm
over all possible inputs of size n. For example, the worst case complex-
ity of (deterministic) Quicksort for sorting n numbers is Θ(n2 ). But if the
input numbers are independent samples from the same distribution, they
come in random order; therefore, the average-case complexity of Quicksort is
much better, namely O (n log n). The average case complexity is the func-
tion that maps n to the expected runtime of the algorithm, taken over its
input distribution.
Still, many sources that describe and analyze (optimization) algorithms
in Data Science do this in terms of worst-case complexity. These lecture
notes are not an exception. The main reason is that it’s in most cases very
difficult to understand the average-case complexity. There are two prob-
lems.

Lack of Knowledge. The first problem occurs if we have no informa-


tion about the probability distribution over our data source X . In case of
Quicksort, we can in this case still argue that the average-case complex-
ity is better than the worst-case complexity, but this is not a “Data Sci-
ence phenomenon”. If the only thing that we need towards an improved
average-case complexity is that the input comes in random order, then we
can artificially enforce this for any input, by simply permuting the input
randomly before feeding it to the algorithm. If we do this in case of Quick-
sort, we obtain, randomized Quicksort whose maximum expected runtime
on n numbers is O (n log n), leading to an optimal worst case bound for
the expected performance.
True “Data Science” analyses of the average-case complexity typically
require some understanding of the data source X that we do not have.

Too specific knowledge. If we know (or assume) the distribution over


the data source X , we may be able to actually compute the average-case
complexity of the algorithm over that specific distribution. But this typically
doesn’t give us results for other distributions that may occur in other ap-
plications. While there is only one worst-case complexity of an algorithm,
there are infinitely many average-case complexities, and it’s typically al-
ready an arduous task to analyze one of them. Unless the result covers an
important distribution (or even an important family of distributions), we

29
have hardly more than an ad-hoc result for one particular application.
Worst-case complexity may be pessimistic in any given concrete ap-
plication, but it does provide a valid upper bound for the runtime in all
applications, via one proof. This makes worst-case complexity an attrative
complexity measure from a theoretical point of view. Having said this,
one should always be aware that the resulting bounds may be pessimistic
(sometimes to the point of being useless), and that even as a theoretician,
one should strive for better results that ideally cover a number of relevant
applications.

1.10 The story of the simplex method


The simplex method for linear programming is probably the optimiza-
tion algorithm whose complexity has puzzled (and is still puzzling) re-
searchers the most since its invention in the 1940s. As the term simplex
method suggests, it is actually a family of algorithms, with members dif-
fering in their pivot rules according to which they decide how to make
progress on the way when there are several choices. In this section, we pro-
vide a historical overview with a focus on the computational complexity
of the simplex method. The tension between worst-case and average-case
complexity is present from the very beginning and has ultimately led to
the development of smoothed complexity which in some sense combines
the best of both worlds.

1.10.1 Initial wandering


In his seminal textbook from 1963, George Dantzig, the inventor of the
simplex method, is setting the stage [Dan16, p. 160]:

While the simplex method appears a natural one to try in


the n-dimensional space of the variables, it might be expected,
a priori, to be inefficient, as there could be considerable wan-
dering . . . before an optimal extreme point is reached.

Here, Dantzig is alluding to the fact that in 1963, no theoretical results


are known to rule out the pessimistic scenario of “considerable wander-
ing” and bad runtime. He then continues:

30
However, empirical evidence with thousands of practical
problems indicates that the number of iterations is usually close
to the number of basic variables in the final set which were not
present in the initial set.

As there are n (nonnegative) variables, this is stating that the runtime


in practice is O(n). With m denoting the number of constraints, Dantzig is
hinting at the possibility of a theoretical average-case analysis:
Some believe that for a randomly chosen problem with fixed
m, the number of iterations grows in proportion to n.
Dantzig does not further elaborate on what he means by a “randomly
chosen” linear program.
In 1969, David Gale publishes an article in the American Mathematical
Monthly where he presents a variant of the simplex method for solving a
system of a linear inequalities that he considers to be a more fundamental
problem than linear programming itself (which asks for an optimal so-
lution to a system of linear inequalities) [Gal69]. After summarizing the
(un)known complexity results, he concludes:

Thus there is a large and embarassing gap between what


has been observed and what has been proved. This gap stood
as a challenge to workers in the field for twenty years now and
remains, in my opinion, the principal open question in the the-
ory of linear computation.

1.10.2 Worst-case complexity


Three years later, Victor Klee and George J. Minty make progress on this
question [KM72], but not in the desired direction of proving a theoretical
bound that matches the observations. Instead, they prove that the worst-
case complexity of the simplex method (with Dantzig’s original pivot rule)
is exponential: there are linear programs (the Klee-Minty cubes) in n non-
negative variables that require the method to perform 2n iterations. But
Klee and Minty also clearly point out the limitations of their research:
On the other hand, our results may not be especially sig-
nificant for the practical aspect of linear programming (see the
final section for comments on this point).

31
Despite these limitations, their results creates quite some excitement in
the community; a sequence of papers ensues in which many other pivot
rules are shown to require an exponential number of iterations as well.
While each of these papers exhibits a different construction, they all seem
to follow a similar approach; Manfred Padberg calls this the period of
worstcasitis [Pad95]. Only later, Nina Amenta and Günter Ziegler will
show that the “similar approach” intuition can be formalized, and that
all known constructions are in fact special cases of a general deformed prod-
uct construction [AZ99]. One particular pivot rule, developed by Norman
Zadeh in 1980 with the goal of defeating all “similar approaches”, will
eventually be shown to also require exponentially many iterations; but this
will happen only in 2011, via an indeed quite different approach. While
this involves money and nudity, it is for the purpose of our discussion a
side story, so we refer the interested reader to Günter Ziegler’s blog entry.2

1.10.3 Average-case complexity


The first substantial average-case analysis of the simplex method is per-
formed by Karl Heinz Borgwardt in a sequence of results that are summa-
rized in his book from 1987 [Bor87]. By “substantial” we mean an anal-
ysis that goes beyond easy and artificial cases. In the preface, Borgwardt
writes:

The subject and purpose of this book is to explain the great


efficiency in practice by assuming certain distributions on the
”real-world”-problems. Other stochastic models are realistic as
well and so this analysis should be considered as one of many
possibilities.

In Borgwardt’s distribution, the origin is assumed to satisfy all the


constraints; the (normal vectors of the) constraints as well as the (vector
defining the) objective function are rotationally symmetric random vec-
tors. Special cases of this setting occur if the constraints and objective
function are chosen from a multivariate normal distribution, or from a ball
(or sphere) centered at the origin. Borgwardt proves that the number of
2
https://gilkalai.wordpress.com/2011/01/20/
gunter-ziegler-1000-from-beverly-hills-for-a-math-problem-ipam-remote-blogging/

32
iterations of the simplex method with the shadow-vertex pivot rule is poly-
nomial in expectation over the distribution. The analysis is very technical
and involved; on a high level, it boils down to proving that the shadow
(projection onto a two-dimensional plane) of a random linear program
has in expectation only a small number of vertices. As these are exactly
the ones that the shadow-vertex pivot rule visits, polynomial runtime on
average follows.
Borgwardt’s distribution may be what Dantzig had in mind when he
was talking about “randomly chosen” linear programs. On the other hand,
linear programs observed in practice are typically not random but highly
structured. If they follow any distribution at all, it is certainly not the
one assumed by Borgwardt. Therefore, Borgwardt’s average-case analysis
does not offer a full explanation for the efficiency of the simplex method
in practice. But it is still an important step forward as it shows that there
is a natural (although practically not very relevant) distribution over lin-
ear programs on which the simplex method is fast on average. In Section
0.9, Borgwardt also speculates what the right “real-world”-model is and
writes the following:

This is a philosophical question and nobody can answer it


satisfactorily. But one should discuss the ideas, conjectures and
experiences of practical and theoretical experts of linear pro-
gramming.

1.10.4 Smoothed complexity


In 2004, and bluntly ignoring Borgwardt’s “nobody can”, Daniel Spielman
and Shang-Hua Teng finally provide a “real-world”-model that is versa-
tile enough to encompass many applications and still allows to prove that
the simplex method is fast on average over a random linear program cho-
sen according to the model [ST04]. The smoothed analysis they suggest is a
hybrid between worst-case and average-case analyis. The high-level de-
scription of Spielman and Teng is crisp:

Worst-case analysis can improperly suggest that an algo-


rithm will perform poorly by examining its performance under
the most contrived circumstances. . . .

33
. . . However, average-case analysis may be unconvincing as
the inputs encountered in many application domains may bear
little resemblance to the random inputs that dominate the anal-
ysis. . . .
. . . In smoothed analysis, we measure the performance of
an algorithm under slight random perturbations of arbitrary
inputs.
Following Spielman and Teng [ST04], we define the three complexity
measures formally. Let CA (X) be the runtime of algorithm A on input
X ∼ X . Here X is again some data source. If A is randomized, we consider
the expected runtime. We let Xn denote the set of all inputs of encoding
size n. A has worst-case complexity f (n) if
max CA (X) = f (n).
X∼Xn

The maximum may be achieved in “the most contrived circumstances”


and not in typical circumstances. On the plus side, the worst-case com-
plexity is independent from the distribution over X that we may not know.
So the worst-case complexity can in principle be determined.
Algorithm A has average-case complexity f (n) if
EX∼Xn [CA (X)] = f (n).
This seems to be exactly what we want: the expected runtime over ran-
dom data sampled from X . The catch is that—not knowing the distri-
bution over X —we can also not determine the average-case complexity.
Therefore, similar to expected loss, this is an idealized measure that we
may only hope to approximate. Unlike in empirical risk minimization, the
approximation here does not consist of a computation, but a proof that is
supposed to work for all values of n. Towards this, one makes some (sim-
plifying) assumptions on the distribution that allow to prove a bound the
average-case complexity under these assumptions. The consequence of
this approach is that X “may bear little resemblance to the random inputs
that dominate the analysis.”
Smoothed complexity looks at the worst-case expected complexity af-
ter injecting random noise into the data. For a real number σ ≥ 0, the
algorithm has smoothed complexity f (n, σ) if
max Ewi ∼N (0,σkXk) CA (X + w) = f (n, σ).
X∼Xn

34
Some explanations are in order here. We assume that the data are such
that we can inject arbitrarily small noise. In discrete settings (for example
X = {0, 1}n to model n coin flips), this is not the case. Concretely, we as-
sume that the input X can be written as a vector of real numbers, and for
all indices i, we add independent centered Gaussian noise wi to the i-th
entry of this vector. The standard deviation of each noise term is propor-
tional to the size kXk of the input which may for example be measured by
Euclidean norm. The factor of proportionality is σ. It is important to un-
derstand that the smoothed complexity is independent of the probability
distribution over X .
If σ = 0, we simply have the worst-case complexity. If σ is large, the
noise dominates, and the smoothed complexity becomes meaningless for
the application. So the scenario that we are interested in is that of small
nonzero σ.
The smoothed complexity is called polynomial if f (n, σ) is polynomial
in n and 1/σ. This allows the runtime to tend to infinity as σ → 0, but at
a rate that is polynomial in 1/σ. Smoothed complexity is a very natural
complexity measure in Data Science where data come from measurements
or experiments and are therefore per se noisy. In this case, the measure-
ment or experiment can itself be considered as injecting the random noise
into (unknown) ground truth data. Smoothed complexity then covers the
worst possible ground truth data.
The main technical achievement of Spielman and Teng (earning them
a number of prestigious prizes) is to show that the smoothed complexity
of the shadow vertex simplex algorithm is polynomial, while its worst-
case complexity is known to be exponential. Intuitively, this means that
the worst-case linear programs are rare and isolated points in the input
space: by slightly perturbing them, we arrive at linear programs that the
algorithm can solve in polynomial time. Indeed, the deformed products
that serve as worst-case inputs for the shadow vertex and other pivot rules
are very sensitive to noise; their geometric features are highly structured in
tiny regions of space, with the consequence that this structure completely
falls apart under small perturbations.
Smoothed analysis does not help for problems where the worst-case in-
puts have some volume. Let’s say that somewhere in input space, there is
a ball of fixed radius only containing (near) worst-case inputs. In this case,
injecting small random noise does not speed up the algorithm. Also, the
smoothed complexity of an algorithm is usually hard to determine, and

35
even for discrete algorithms such as the simplex method, smoothed anal-
ysis is integral-heavy due to the Gaussian noise terms. When applicable
and technically feasible, smoothed analysis is an excellent tool to analyze
Data Science algorithms, but this is by far not always the case. The 2009
survey of Spielman and Teng contains a few examples where smoothed
analysis works [ST09].

1.10.5 An open end


Coming back to the simplex method for one last time: it is still unknown
whether there is a pivot rule under which the simplex method has a poly-
nomial number of iterations in the worst case. While one candidate after
the other has been ruled out over the last 50 years, the theoretical possi-
bility of such a “wonder rule” remains. But it will be very hard and prob-
ably require new methodology to discover it. As a corollary, such a rule
would prove the polynomial Hirsch conjecture in the affirmative, something
that researchers have tried in vain for decades, with considerable effort.
For details, we refer the interested reader to the paper by Santos whose
disproof of the original (much stronger) Hirsch conjecture from 1957 is a
significant breakthrough [San12].

1.11 The estimation-optimization tradeoff


We recall from Section 1.4 and equation (1.7) that the goal in empirical risk
minimization is not to find the absolutely best explanation of the training
data but an almost best one H̃n . This is motivated by the fact that the em-
pirical risk of H̃n is anyway only an approximation of its expected risk, the
measure we actually care about. As we inevitably lose precision in going
from empirical to expected risk, it doesn’t help to optimize the empirical
risk to a significantly higher precision. Let us call the precision that we
lose in going from empirical to expected risk the estimation error; the pre-
cision we lose in finding only an almost best explanation of the training
data is the optimization error. The literature discusses a third kind of error,
the appoximation error that arises when the expected risk minimizer H ? is
not in our hypothesis class H, so that even the best explanation from H
loses some precision as compared to H ? . As we have not discussed this
scenario, we will ignore the approximation error here.

36
In a given application, we may have constraints on how many train-
ing data we can afford to sample, and on how much time we can afford
to spend on optimization. Constraints on the number of training samples
typically come from the fact that samples are expensive to obtain. Indeed,
a training sample may come from an actual physical measurement or an
experiment that has a significant cost; or it requires human intervention
to label a training sample with its correct class. Reducing the cost of hu-
man intervention is all that services such as Amazon Mechanical Turk are
about. We call this scenario small-scale learning.
Constraints on optimization time typically come from large training
data. The “good old days” where every algorithm of polynomial runtime
was considered efficient are long gone in Data Science. With large data,
we typically need linear-time or even sublinear-time algorithms in order to
cope with the data. This is the scenario of large-scale learning.
In small-scale learning, it doesn’t hurt to go for as small an optimiza-
tion error as we can. But in large-scale learning, we may need to give up
on some optimization precision in order to be able to stay within the opti-
mization time budget.
The estimation-optimization tradeoff consists in finding the most efficient
way of spending the resources under the given constraints. The optimiza-
tion algorithms that we will analyze in this course support this tradeoff.
They are usually stepwise methods, gradually improving a candidate so-
lution. The runtime guarantees that we provide are of the form “on data of
this and that type, this and that algorithm is guaranteed to have optimiza-
tion error at most ε after at most c(ε) many steps.” Here, c is a function that
grows as ε → 0, and c is typically not even defined at ε = 0. Hence, most
of our algorithms cannot even be used to “optimize to the end.” Bus as
this is not needed, our main concern is to bound the growth of c as ε → 0,
and algorithms can signficantly differ in this growth. For example, an al-
gorithm with c(ε) = O(log(1/ε)) is preferable over one with c(ε) = O(1/ε),
and the latter is better than an algorithm with c(ε) = O(1/ε2 ).
Following up on Section 1.9, we point out that our bounds on c(ε) will
usually be worst-case bounds, and as such, they may be overly pessimistic
in a concrete application even if they are tight on contrived input. But
given the difficulty of obtaining average-case bounds in more then very
specific applications, the worst-case bounds still provide some (and some-
times the only) useful guidance concerning optimization time.
The material of this section is based on (and discussed in much more

37
detail by) Bottou and Bousquet[BB07]; we refer the interested reader to
this paper.

1.12 Further listening


Theoretical results in machine learning, in particular the VC theory (Sec-
tion 1.7) as well as distribution-dependent guarantees (Section 1.8) are cov-
ered in the ETH lectures 263-5300-00L Guarantees for Machine Learning,
401-2684-00L Mathematics of Machine Learning, and 263-4508-00L Algo-
rithmic Foundations of Data Science.
The lecture 252-0526-00L Statistical Learning Theory advocates and
explains an approach to learning that is quite different from empirical risk
minimization and yields better results in a number of applications. The
method is called maximum entropy and posterior agreement.
The lectures 227-0690-11L Large-Scale Convex Optimization and 263-
4400-00L Advanced Graph Algorithms and Optimization have a signifi-
cant overlap with Chapters 2 (Theory of Convex Functions) and 3 (Gradi-
ent Descent) of this lecture.
The course 401-3901-00L Linear & Combinatorial Optimization is con-
cerned with discrete optimization and in particular discusses polyedral
approaches (linear programming and the simplex method are key ingre-
dients of the polyhedral approach). An introductory course teaching in
particular the simplex method (Section 1.10) is 401-0647-00L Introduction
to Mathematical Optimization.

1.13 Exercises
Exercise 1. Let X = {0, 1} be such that the event X = 1 has probability p? for
X ∼ X . We want to model the task of finding p? as an expected risk minimization
problem.
For X ∈ X and H ∈ H = [0, 1], we define `(H, X) = (X − H)2 . The
expected risk of H is `(H) = EX [`(H, X)].

(i) Compute `(H) for given H ∈ H.

(ii) Prove that p? is the unique minimizer of the expected risk `, and that the
minimum expexted risk is p? (1 − p? ), the variance of the biased coin.

38
Exercise 2. Prove that there exists a countable collection H of subsets H ⊆ [0, 1]
such that (i) every H ∈ H has length 1/2; (ii) for every finite set S ⊆ [0, 1], there
is H ∈ H with H ∩ S = ∅.

Exercise 3. Let H be the set of all halfplanes in R2 , and let S be a set of n ≥ 1
points in R2 . Prove that H cuts S in at most 2 n2 + 2 many different ways, and
that there are sets S for which this bound is attained. More  precisely, prove that
n
the set H ∩ S = {H ∩ S : H ∈ H} has size at most 2 2 + 2, with equality if
and only if S is in general position, meaning that no three points of S are on a
common line.

Exercise 4. Prove that there exists a countable collection H of subsets H ⊆ [0, 1]


with the following property: for every finite set S ⊆ [0, 1] and every T ⊆ S, there
is H ∈ H with H ∩ S = T .

Exercise 5. Derive Theorem 1.6 from Theorem 1.5!

39
Chapter 2

Theory of Convex Functions

Contents
2.1 Mathematical Background . . . . . . . . . . . . . . . . . . . . 42
2.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.1.2 The Cauchy-Schwarz inequality . . . . . . . . . . . . 42
2.1.3 The spectral norm . . . . . . . . . . . . . . . . . . . . 44
2.1.4 The mean value theorem . . . . . . . . . . . . . . . . . 45
2.1.5 The fundamental theorem of calculus . . . . . . . . . 45
2.1.6 Differentiability . . . . . . . . . . . . . . . . . . . . . . 46
2.2 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.1 The mean value inequality . . . . . . . . . . . . . . . 48
2.3 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.1 First-order characterization of convexity . . . . . . . 54
2.3.2 Second-order characterization of convexity . . . . . . 57
2.3.3 Operations that preserve convexity . . . . . . . . . . 59
2.4 Minimizing convex functions . . . . . . . . . . . . . . . . . . 59
2.4.1 Strictly convex functions . . . . . . . . . . . . . . . . . 61
2.4.2 Example: Least squares . . . . . . . . . . . . . . . . . 62
2.4.3 Constrained Minimization . . . . . . . . . . . . . . . . 63
2.5 Existence of a minimizer . . . . . . . . . . . . . . . . . . . . . 64
2.5.1 Sublevel sets and the Weierstrass Theorem . . . . . . 65
2.5.2 Recession cone and lineality space . . . . . . . . . . . 66
2.5.3 Coercive convex functions . . . . . . . . . . . . . . . . 70
2.5.4 Weakly coercive convex functions . . . . . . . . . . . 71

40
2.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.6.1 Handwritten digit recognition . . . . . . . . . . . . . 72
2.6.2 Master’s Admission . . . . . . . . . . . . . . . . . . . 74
2.7 Convex programming . . . . . . . . . . . . . . . . . . . . . . 80
2.7.1 Lagrange duality . . . . . . . . . . . . . . . . . . . . . 80
2.7.2 Karush-Kuhn-Tucker conditions . . . . . . . . . . . . 84
2.7.3 Computational complexity . . . . . . . . . . . . . . . 86
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

41
This chapter develops the basic theory of convex functions that we will
need later. Much of the material is also covered in other courses, so we will
refer to the literature for standard material and focus more on material that
we feel is less standard (but important in our context).

2.1 Mathematical Background


2.1.1 Notation
For vectors in Rd , we use bold font, and for their coordinates normal font,
e.g. x = (x1 , . . . , xd ) ∈ Rd . x1 , x2 , . . . denotes a sequence of vectors. Vectors
are considered as column vectors, unless they are explicitly transposed.
So x is a column vector, and x> , its transpose, is a row vector. x> y is the
scalar product di=1 xi yi of two vectors x, y ∈ Rd .
P
kxk denotes the Euclidean norm (`2 -norm or 2-norm) of vector x,
d
X
2 >
kxk = x x = x2i .
i=1

We also use
N = {1, 2, . . .} and R+ := {x ∈ R : x ≥ 0}
to denote the natural and non-negative real numbers, respectively. We are
freely using basic notions and material from linear algebra and analysis,
such as open and closed sets, vector spaces, matrices, continuity, conver-
gence, limits, triangle inequality, among others.

2.1.2 The Cauchy-Schwarz inequality


Lemma 2.1 (Cauchy-Schwarz inequality). Let u, v ∈ Rd . Then
|u> v| ≤ kuk kvk .
The inequality holds beyond the Euclidean norm; all we need is an
inner product, and a norm induced by it. But here, we only discuss the
Euclidean case.
For nonzero vectors, the Cauchy-Schwarz inequality is equivalent to
u> v
−1 ≤ ≤ 1,
kuk kvk

42
and this fraction can be used to define the angle α between u and v:
u> v
cos(α) = ,
kuk kvk
where α ∈ [0, π]. The following shows the situation for two unit vectors
(kuk = kvk = 1): The scalar product u> v is the length of the projection of
v onto u (which is considered to be negative when α > π/2). This is just
the highschool definition of the cosine.
v
1 v 1

α α
u u
u> v > 0 u> v < 0

Hence, equality in Cauchy-Schwarz is obtained if α = 0 (u and v point


into the same direction), or if α = π (u and v point into opposite direc-
tions):

v=u v = −u α=π
u> v = 1 u>v = −1

Fix u 6= 0. We see that the vector v maximizing the scalar product u> v
among all vectors v of some fixed length is a positive multiple of u, while
the scalar product is minimized by a negative multiple of u.

Proof of the Cauchy-Schwarz inequality. There are many proof, but the
authors particularly like this one: define the quadratic function
d d
! d
! d
!
X X X X
f (x) = (ui x+vi )2 = u2i x2 + 2 ui vi x+ vi2 =: ax2 +bx+c.
i=1 i=1 i=1 i=1

43
We know that f (x) = ax2 + bx + c = 0 has the two solutions

−b ± b2 − 4ac
x1,2 = .
2a
This is known as the Mitternachtsformel in German-speaking countries, as
you are supposed to know it even when you are asleep at midnight.
As by definition, f (x) ≥ 0 for all x, f (x) = 0 has at most one real solu-
tion, and this is equivalent to having discriminant b2 − 4ac ≤ 0. Plugging
in the definitions of a, b, c, we get
d
!2 d
! d !
X X X
b2 −4ac = 2 ui vi −4 u2i vi2 = 4(u> v)2 −4 kuk2 kvk2 ≤ 0.
i=1 i=1 i=1

Dividing by 4 and taking square roots yields the Cauchy-Schwarz inequal-


ity.

2.1.3 The spectral norm


Definition 2.2 (Spectral norm). Let A be an (m × d)-matrix. Then
kAvk
kAk := max = max kAvk
v∈Rd ,v6=0 kvk kvk=1

is the 2-norm (or spectral norm) of A.


In words, the spectral norm is the largest factor by which a vector can
be stretched in length under the mapping v → Av. Note that as a simple
consequence,
kAvk ≤ kAkkvk
for all v.
It is good to remind ourselves what a norm is, and why the spectral
norm is actually a norm. We need that it is absolutely homegeneous:
kλAk = |λ|kAk which follows from the fact that the Euclidean norm is ab-
solutely homegeneous. Then we need the triangle inequality: kA + Bk ≤
kAk + kBk for two matrices of the same dimensions. Again, this follows
from the triangle inequality for the Euclidean norm. Finally, we need that
kAk = 0 implies A = 0. Which is true, since for any nonzero matrix A,
there is a vector v such that Av and hence the Euclidean norm of Av is
nonzero.

44
2.1.4 The mean value theorem
We also recall the mean value theorem that we will frequently need:
Theorem 2.3 (Mean value theorem). Let a < b be real numbers, and let h :
[a, b] → R be a continuous function that is differentiable on (a, b); we denote the
derivative by h0 . Then there exists c ∈ (a, b) such that
h(b) − h(a)
h0 (c) = .
b−a
Geometrically, this means the following: We can interpret the value
(h(b) − h(a))/(b − a) as the slope of the line through the two points (a, h(a))
and (b, h(b)). Then the mean value theorem says that between a and b, we
find a tangent to the graph of h that has the same slope:

h(a)

h(b)

a c b

2.1.5 The fundamental theorem of calculus


If a function h is continuously differentiable in an interval [a, b], we have
another way of expressing h(b) − h(a) in terms of the derivative.
Theorem 2.4 (Fundamental theorem of calculus). Let a < b be real num-
bers, and let h : dom(h) → R be a differentiable function on an open domain
dom(h) ⊃ [a, b], and such that h0 is continuous on [a, b]. Then
Z b
h(b) − h(a) = h0 (t)dt.
a

This theorem is the theoretical underpinning of typical definite inte-


R4
gral computations in high school. For example, to evaluate 2 x2 dx, we
integrate x2 (giving us x3 /3), and then compute
Z 4
43 23 56
x2 dx = − = .
2 3 3 3

45
2.1.6 Differentiability
For univariate functions f : dom(f ) → R with dom(f ) ⊆ R, differentia-
bility is covered in high school. We will need the concept for multivari-
ate and vector-valued functions f : dom(f ) → Rm with dom(f ) ⊆ Rd .
Mostly, we deal with the case m = 1: real-valued functions in d variables.
As we frequently need this material, we include a refresher here.

Definition 2.5. Let f : dom(f ) → Rm where dom(f ) ⊆ Rd . the function f


is called differentiable at x in the interior of dom(f ) if there exists an (m × d)-
matrix A and an error function r : Rd → Rm defined in some neighborhood of
0 ∈ Rd such that for all y in some neighborhood of x,

f (y) = f (x) + A(y − x) + r(y − x),

where
kr(v)k
lim = 0.
v→0 kvk

It then also follows that the matrix A is unique, and it is called the differential
or Jacobian of f at x. We will denote it by Df (x). More precisely, Df (x) is the
matrix of partial derivatives at the point x,

∂fi
Df (x)ij = (x).
∂xj

f is called differentiable if f is differentiable at all x ∈ dom(f ) (which implies


that dom(f ) is open).

Differentiability at x means that in some neighborhood of x, f is ap-


proximated by a (unique) affine function f (x) + Df (x)(y − x), up to a
sublinear error term. If m = 1, Df (x) is a row vector typically denoted
by ∇f (x)> , where the (column) vector ∇f (x) is called the gradient of f at
x. Geometrically, this means that the graph of the affine function f (x) +
∇f (x)> (y − x) is a tangent hyperplane to the graph of f at (x, f (x)); see
Figure 2.1.
It also follows easily that a differentiable function is continuous, see
Exercise 6.
Let us do a simple example to illustrate the concept of differentiability.

46
f (x) + ∇f (x)> (y − x)

f (y)

x y

Figure 2.1: If f is differentiable at x, the graph of f is locally (around x)


approximated by a tangent hyperplane

Example 2.6. Consider the function f (x) = x2 . We know that its derivative is
f 0 (x) = 2x. But why? For fixed x and y = x + v, we compute

f (y) = (x + v)2 = x2 + 2vx + v 2


= f (x) + 2x · v + v 2
= f (x) + A(y − x) + r(y − x),

where A := 2x, r(y − x) = r(v) := v 2 . We have limv→0 |r(v)||v|


= limv→0 |v| = 0.
Hence, A = 2x is indeed the differential (a.k.a. derivative) of f at x.

In computing differentials, the chain rule is particularly useful.

Lemma 2.7 (Chain rule). Let f : dom(f ) → Rm , dom(f ) ⊆ Rd and g :


dom(g) → Rd . Suppose that g is differentiable at x ∈ dom(g) and that f is
differentiable at g(x) ∈ dom(f ). Then f ◦ g (the composition of f and g) is
differentiable at x, with the differential given by the matrix equation

D(f ◦ g)(x) = Df (g(x))Dg(x).

Here is an application of the chain rule that we will use frequently. Let
f : dom(f ) → Rm be a differentiable function with (open) convex domain,
and fix x, y ∈ dom(f ). There is an open interval I containing [0, 1] such

47
that x + t(y − x) ∈ dom(f ) for all t ∈ I. Define g : I → Rd by g(t) =
x + t(y − x) and set h = f ◦ g. Thus, h : I → Rm with h(t) = f (x + t(y − x)),
and for all t ∈ I, we have

h0 (t) = Dh(t) = Df (g(t))Dg(t) = Df (x + t(y − x))(y − x). (2.1)

Since we mostly consider real-valued functions, we will encounter dif-


ferentials in the form of gradients. For example, if f (x) = c> x = dj=1 cj xj ,
P

then ∇f (x) = c; and if f (x) = kxk2 = dj=1 x2j , then ∇f (x) = 2x.
P

2.2 Convex sets


Definition 2.8. A set C ⊆ Rd is convex if for any two points x, y ∈ C, the
connecting line segment is contained in C. In formulas, if for all λ ∈ [0, 1],
λx + (1 − λ)y ∈ C; see Figure 2.2.

y
y x
x

Figure 2.2: A convex set (left) and a non-convex set (right)

T Ci , i ∈ I be convex sets, where I is a (possibly infinite)


Observation 2.9. Let
index set. Then C = i∈I Ci is a convex set.

For d = 1, convex sets are intervals.

2.2.1 The mean value inequality


The mean value inequality can be considered as as generalization of the
mean value theorem to multivariate and vector-valued functions over con-
vex sets (a “mean value equality” does not exist in this full generality).

48
To motivate it, let us consider the univariate and real-valued case first.
Let f : dom(f ) → R be differentiable and suppose that f has bounded
derivatives over an interval X ⊆ dom(f ), meaning that for some real
number B, we have |f 0 (x)| ≤ B for all x ∈ X. The mean value theorem
then gives the mean value inequality

|f (y) − f (x)| = |f 0 (c)(y − x)| ≤ B|y − x|

for all x, y ∈ X and some in-between c. In other words, f is not only


continuous but actually B-Lipschitz over X.
Vice versa, suppose that f is B-Lipschitz over a nonempty open interval
X, then for all c ∈ X,
f (c + δ) − f (c)
|f 0 (c)| = | lim | ≤ B,
δ→0 δ
so f has bounded derivatives over X. Hence, over an open interval, Lip-
schitz functions are exactly the ones with bounded derivative. Even if the
interval is not open, bounded derivatives still yield the Lipschitz property,
but the other direction may fail. As a trivial example, the Lipschitz con-
dition is always satisfied over a singleton interval X = {x}, but that does
not say anything about the derivative at x. In any case, we need X to be
an interval; if X has “holes”, the previous arguments break down.
These considerations extend to multivariate and vector-valued func-
tions over convex subsets of the domain.

Theorem 2.10. Let f : dom(f ) → Rm be differentiable, X ⊆ dom(f ) a con-


vex set, B ∈ R+ . If X ⊆ dom(f ) is nonemepty and open, the following two
statements are equivalent.

(i) f is B-Lipschitz, meaning that

kf (x) − f (y)k ≤ B kx − yk , ∀x, y ∈ X

(ii) f has differentials bounded by B (in spectral norm), meaning that

kDf (x)k ≤ B, ∀x ∈ X.

Moreover, for every (not necessarily open) convex X ⊆ dom(f ), (ii) implies (i),
and this is the mean value inequality.

49
Proof. Suppose that f is B-Lipschitz over an open set X. For v ∈ Rd ,
v → 0, differentiability at x ∈ X yields for small v ∈ Rd that x + v ∈ X
and therefore

B kvk ≥ kf (x + v) − f (x)k = kDf (x)v + r(v)k ≥ kDf (x)vk − kr(v)k ,

where kr(v)k / kvk → 0, the first inequality uses (i), and the last is the
reverse triangle inequality. Rearranging and dividing by kvk, we get

kDf (x)vk kr(v)k


≤B+ .
kvk kvk

Let v? be a unit vector such that kDf (x)k = kDf (x)v? k / kv? k and let v =
tv? for t → 0. Then we further get

kr(v)k
kDf (x)k ≤ B + → B,
kvk

and kDf (x)k ≤ B follows, so differentials are bounded by B.


For the other direction, suppose that differentials are bounded by B
over X (not necessarily open); we proceed as in [FM91].
For fixed x, y ∈ X ⊆ dom(f ), x 6= y, and z ∈ Rm (to be determined
later), we define
h(t) = z> f (x + t(y − x))
over dom(h) = [0, 1], in which case the chain rule yields

h0 (t) = z> Df (x + t(y − x))(y − x), t ∈ (0, 1),

see also (2.1). Note that x + t(y − x) ∈ X for t ∈ [0, 1] by convexity of X.


The mean value theorem guarantees c ∈ (0, 1) such that h0 (c) = h(1)−h(0).
Now we compute

z> (f (y) − f (x)) = |h(1) − h(0)| = |h0 (c)|


= z> Df (x + c(y − x))(y − x)
≤ kzkkDf (x + c(y − x))(y − x)k (Cauchy-Schwarz)
≤ kzkkDf (x + c(y − x))kk(y − x)k (spectral norm)
≤ Bkzkk(y − x)k (bounded differentials).

50
6 f (y), as otherwise, (i) trivially holds;
We assume w.l.o.g. that f (x) =
now we set
f (y) − f (x)
z=
kf (y) − f (x)k.
With this, the previous inequality reduces to (i), so f is indeed B-Lipschitz
over X.

2.3 Convex functions


We are considering real-valued functions f : dom(f ) → R, dom(f ) ⊆ Rd .
Definition 2.11 ([BV04, 3.1.1]). A function f : dom(f ) → R is convex if (i)
dom(f ) is convex and (ii) for all x, y ∈ dom(f ) and all λ ∈ [0, 1], we have

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y). (2.2)

Geometrically, the condition means that the line segment connecting


the two points (x, f (x)), (y, f (y)) ∈ Rd+1 lies pointwise above the graph
of f ; see Figure 2.3. (Whenever we say “above”, we mean “above or on”.)
An important special case arises when f : Rd → R is an affine function,
i.e. f (x) = c> x + c0 for some vector c ∈ Rd and scalar c0 ∈ R. In this case,
(2.2) is always satisfied with equality, and line segments connecting points
on the graph lie pointwise on the graph.
While the graph of f is the set {(x, f (x)) ∈ Rd+1 : x ∈ dom(f )}, the
epigraph (Figure 2.4) is the set of points above the graph,

epi(f ) := {(x, α) ∈ Rd+1 : x ∈ dom(f ), α ≥ f (x)}.

Observation 2.12. f is a convex function if and only if epi(f ) is a convex set.


Proof. This is easy but let us still do it to illustrate the concepts. Let f be a
convex function and consider two points (x, α), (y, β) ∈ epi(f ), λ ∈ [0, 1].
This means, f (x) ≤ α, f (y) ≤ β, hence by convexity of f ,

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) ≤ λα + (1 − λ)β.

Therefore, by definition of the epigraph,

λ(x, α) + (1 − λ)(y, β) = (λx + (1 − λ)y, λα + (1 − λ)β) ∈ epi(f ),

51
λf (x) + (1 − λ)f (y) f (y)

f (x) f (λx + (1 − λ)y)

x λx + (1 − λ)y y

Figure 2.3: A convex function

so epi(f ) is a convex set. In the other direction, let epi(f ) be a convex set
and consider two points x, y ∈ dom(f ), λ ∈ [0, 1]. By convexity of epi(f ),
we have

epi(f ) 3 λ(x, f (x)) + (1 − λ)(y, f (y)) = (λx + (1 − λ)y, λf (x) + (1 − λ)f (y)),

and this is just a different way of writing (2.2).


Lemma 2.13 (Jensen’s inequality). Let f : Rd → R Pbe a convex function,
x1 , . . . , xm ∈ dom(f ), and λ1 , . . . , λm ∈ R+ such that m i=1 λi = 1. Then

m
! m
X X
f λi x i ≤ λi f (xi ).
i=1 i=1

For m = 2, this is (2.2). The proof of the general case is Exercise 7.


Lemma 2.14. Let f be convex and suppose that dom(f ) is open. Then f is
continuous.
This is not entirely obvious (see Exercise 8) and really needs dom(f ) ⊆
Rd . It becomes false if we consider convex functions over vector spaces of
infinite dimension. In fact, in this case, even linear functions (which are in
particular convex) may fail to be continuous.

52
epi(f ) epi(f )
graph of f

f (x)
f (x)

x x

Figure 2.4: Graph and epigraph of a non-convex function (left) and a con-
vex function (right)

Lemma 2.15. There exists an (infinite dimensional) vector space V and a linear
function f : V → R such that f is discontinuous at all v ∈ V .

Proof. This is a classical example. Let us consider the vector space V of all
univariate polynomials; the vector space operations are addition of two
polynomials, and multiplication of a polynomial with a scalar. We con-
sider a polynomial such as 3x5 + 2x2 + 1 as a function x 7→ 3x5 + 2x2 + 1
over the domain [−1, 1].
The standard norm in a function space such as V is the supremum norm
k · k∞ , defined for any bounded function h : [−1, 1] → R via khk∞ :=
supx∈[−1,1] |h(x)|. Polynomials are continuous and as such bounded over
[−1, 1].
We now consider the linear function f : V → R defined by f (p) = p0 (0),
the derivative of p at 0. The function f is linear, simply because the deriva-
tive is a linear operator. As dom(f ) is the whole space V , dom(f ) is open.
We claim that f is discontinuous at 0 (the zero polynomial). Since f is
linear, this implies discontinuity at every polynomial p ∈ V . To prove dis-
continuity at 0, we first observe that f (0) = 0 and then show that there are
polynomials p of arbitrarily small supremum norm with f (p) = 1. Indeed,

53
for n, k ∈ N, n > 0, consider the polynomial
k
(nx)2i+1 (nx)3 (nx)5 (nx)2k+1
 
1X 1
pn,k (x) = (−1)i = nx − + − ··· ±
n i=0 (2i + 1)! n 3! 5! (2k + 1)!

which—for any fixed n and sufficiently large k—approximates the func-


tion ∞
1 1X (nx)2i+1
sn (x) = sin(nx) = (−1)i
n n i=0 (2i + 1)!
up to any desired precision over the whole interval [−1, 1] (Taylor’s theo-
rem with remainder). In formulas, kpn,k − sn k∞ → 0 as k → ∞. Moreover,
ksn k∞ → 0 as n → ∞. Using the triangle inequality, this implies that
kpn,k k → 0 as n, k → ∞. On the other hand, f (pn,k ) = p0n,k (0) = 1 for all
n, k.

2.3.1 First-order characterization of convexity


As an example of a convex function, let us consider f (x1 , x2 ) = x21 + x22 .
The graph of f is the unit paraboloid in R3 which looks convex. However,
to verify (2.2) directly is somewhat cumbersome. Next, we develop better
ways to do this if the function under consideration is differentiable.
Lemma 2.16 ([BV04, 3.1.3]). Suppose that dom(f ) is open and that f is differ-
entiable; in particular, the gradient (vector of partial derivatives)
 
∂f ∂f
∇f (x) := (x), . . . , (x)
∂x1 ∂xd
exists at every point x ∈ dom(f ). Then f is convex if and only if dom(f ) is
convex and
f (y) ≥ f (x) + ∇f (x)> (y − x) (2.3)
holds for all x, y ∈ dom(f ).
Geometrically, this means that for all x ∈ dom(f ), the graph of f lies
above its tangent hyperplane at the point (x, f (x)); see Figure 2.5.
Proof. Suppose that f is convex, meaning that for t ∈ (0, 1),

f (x+t(y−x)) = f ((1−t)x+ty) ≤ (1−t)f (x)+tf (y) = f (x)+t(f (y)−f (x)).

54
f (y)

f (x) + ∇f (x)> (y − x)

x y

Figure 2.5: First-order characterization of convexity

Dividing by t and using differentiability at x, we get


f (x + t(y − x)) − f (x)
f (y) ≥ f (x) +
t
>
∇f (x) t(y − x) + r(t(y − x))
= f (x) +
t
r(t(y − x))
= f (x) + ∇f (x)> (y − x) + ,
t
where the error term r(t(y − x))/t goes to 0 as t → 0. The inequality
f (y) ≥ f (x) + ∇f (x)> (y − x) follows.
Now suppose this inequality holds for all x, y ∈ dom(f ), let λ ∈ [0, 1],
and define z := λx + (1 − λ)y ∈ dom(f ) (by convexity of dom(f )). Then
we have
f (x) ≥ f (z) + ∇f (z)> (x − z),
f (y) ≥ f (z) + ∇f (z)> (y − z).
After multiplying the first inequality by λ and the second one by (1 − λ),
the gradient terms cancel in the sum of the two inequalities, and we get
λf (x) + (1 − λ)f (y) ≥ f (z) = f (λx + (1 − λ)y).
This is convexity.

55
For f (x1 , x2 ) = x21 + x22 , we have ∇f (x) = (2x1 , 2x2 ), hence (2.3) boils
down to
y12 + y22 ≥ x21 + x22 + 2x1 (y1 − x1 ) + 2x2 (y2 − x2 ),
which after some rearranging of terms is equivalent to
(y1 − x1 )2 + (y2 − x2 )2 ≥ 0,
hence true. There are relevant convex functions that are not differentiable,
see Figure 2.6 for an example. More generally, Exercise 14 asks you to
prove that the `1 -norm (or 1-norm) f (x) = kxk1 is convex.

f (x) = |x|

x 0

Figure 2.6: A non-differentiable convex function

There is another useful and less standard first-order characterization of


convexity that we can easily derive from the standard one above.
Lemma 2.17. Suppose that dom(f ) is open and that f is differentiable. Then f
is convex if and only if dom(f ) is convex and
(∇f (y) − ∇f (x))> (y − x) ≥ 0 (2.4)
holds for all x, y ∈ dom(f ).
The inequality (2.4) is known as monotonicity of the gradient.
Proof. If f is convex, the first-order characterization in Lemma 2.16 yields
f (y) ≥ f (x) + ∇f (x)> (y − x),
f (x) ≥ f (y) + ∇f (y)> (x − y),
for all x, y ∈ dom(f ). After adding up these two inequalities, f (x) + f (y)
appears on both sides and hence cancels, so that we get
0 ≥ ∇f (x)> (y − x) + ∇f (y)> (x − y) = (∇f (y) − ∇f (x))> (x − y).

56
Multiplying this by −1 yields (2.4).
For the other direction, suppose that monotonicty of the gradient (2.4)
holds. Then we in particular have
(∇f (x + t(y − x)) − ∇f (x))> (t(y − x)) ≥ 0
for all x, y ∈ dom(f ) and t ∈ (0, 1). Dividing by t, this yields
(∇f (x + t(y − x)) − ∇f (x))> (y − x)) ≥ 0. (2.5)
Fix x, y ∈ dom(f ). For t ∈ [0, 1], let h(t) := f (x + t(y − x)). In our case
where f is real-valued, (2.1) yields h0 (t) = ∇f (x + t(y − x))> (y − x), t ∈
(0, 1). Hence, (2.5) can be rewritten as
h0 (t) ≥ ∇f (x)> (y − x), t ∈ (0, 1).
By the mean value theorem, there is c ∈ (0, 1) such that h0 (c) = h(1) − h(0).
Then
f (y) = h(1) = h(0) + h0 (c) = f (x) + h0 (c)
≥ f (x) + ∇f (x)> (y − x).
This is the first-order characterization of convexity (Lemma 2.16).

2.3.2 Second-order characterization of convexity


If f : dom(f ) → R is twice differentiable (meaning that f is differentiable
and the gradient function ∇f is also differentiable), convexity can be char-
acterized as follows.
Lemma 2.18. Suppose that dom(f ) is open and that f is twice differentiable; in
particular, the Hessian (matrix of second partial derivatives)
∂2f ∂2f ∂2f
 
(x) (x) · · · (x)
 ∂x∂12∂x
f
1 ∂x1 ∂x2
∂2f
∂x1 ∂xd
∂2f
(x) (x) · · · (x) 

∂x ∂x ∂x ∂x ∂x ∂x
2

∇ f (x) =  2
..
1 2
..
2 2
..
d 
. . ··· .
 
 
2
∂ f 2
∂ f 2
∂ f
∂xd ∂x1
(x) ∂xd ∂x2 (x) · · · ∂xd ∂xd (x)
exists at every point x ∈ dom(f ) and is symmetric. Then f is convex if and only
if dom(f ) is convex, and for all x ∈ dom(f ), we have
∇2 f (x)  0 (i.e. ∇2 f (x) is positive semidefinite). (2.6)

57
(A symmetric matrix M is positive semidefinite, denoted by M  0, if x> M x ≥
0 for all x, and positive definite, denoted by M  0, if x> M x > 0 for all x 6= 0.)

The fact that the Hessians of a twice continuously differentiable function


are symmetric is a classical result known as the Schwarz theorem [AE08,
Corollary 5.5]. But symmetry in fact already holds if f is twice differen-
tiable [Die69, (8.12.3)]. However, if f is only twice partially differentiable,
we may get non-symmetric Hessians [AE08, Remark 5.6].
Proof. Once again, we employ our favorite univariate function h(t) :=
f (x + t(y − x)), for fixed x, y ∈ dom(f ) and t ∈ I where I ⊃ [0, 1] is a
suitable open interval. But this time, we also need h’s second derivative.
For t ∈ I, v := y − x, we have

h0 (t) = ∇f (x + tv)> v,
h00 (t) = v> ∇2 f (x + tv)v.

The formula for h0 (t) has already been derived in the proof of Lemma 2.17,
and the formula for h00 (t) is Exercise 15.
If f is convex, we always have h00 (0) ≥ 0, as we will show next. Given
this, ∇2 f (x)  0 follows for every x ∈ dom(f ): by openness of dom(f ),
for every v ∈ Rd of sufficiently small norm, there is y ∈ dom(f ) such that
v = y − x, and then v> ∇2 f (x)v = h00 (0) ≥ 0. By scaling, this inequality
extends to all v ∈ Rd .
To show h00 (0) ≥ 0, we observe that for all sufficiently small δ, x + δv ∈
dom(f ) and hence

h0 (δ) − h0 (0) (∇f (x + δv) − ∇f (x))> v (∇f (x + δv) − ∇f (x))> δv


= = ≥ 0,
δ δ δ2
by monotonicity of the gradient for convex f (Lemma 2.17). It follows that
h00 (0) = limδ→0 (h0 (δ) − h0 (0))/δ ≥ 0.
For the other direction, the mean value theorem applied to h0 yields
c ∈ (0, 1) such that h0 (1) − h0 (0) = h00 (c), and spelled out, this is

∇f (y)> v − ∇f (x)> v = v> ∇2 f (x + cv)v ≥ 0, (2.7)

since ∇2 f (z)  0 for all z ∈ dom(f ). Hence, we have proved monotonicity


of the gradient which by Lemma 2.17 implies convexity of f .

58
Geometrically, Lemma 2.18 means that the graph of f has non-negative
curvature everywhere and hence “looks like a bowl”. For f (x1 , x2 ) = x21 +
x22 , we have  
2 2 0
∇ f (x) = ,
0 2
which is a positive definite matrix. In higher dimensions, the same ar-
gument can be used to show that the squared distance dy (x) = kx −
yk2 to a fixed point y is a convex function; see Exercise 9. The non-
squared Euclidean distance kx − yk is also convex in x, as a consequence
of Lemma 2.19(ii) below and the fact that every seminorm (in particular
the Euclidean norm kxk) is convex (Exercise 16). The squared Euclidean
distance has the advantage that it is differentiable, while the Euclidean
distance itself (whose graph is an “ice cream cone” for d = 2) is not.

2.3.3 Operations that preserve convexity


There are three important operations that preserve convexity.
Lemma 2.19 (Exercise 10).

m
Pmfunctions, λ1 , λ2 , . . . , λm ∈ R+ .TThen
(i) Let f1 , f2 , . . . , fm be convex
m
f :=
maxi=1 fi as well as f := i=1 λi fi are convex on dom(f ) := i=1 dom(fi ).
(ii) Let f be a convex function with dom(f ) ⊆ Rd , g : Rm → Rd an affine
function, meaning that g(x) = Ax + b, for some matrix A ∈ Rd×m and
some vector b ∈ Rd . Then the function f ◦ g (that maps x to f (Ax + b))
is convex on dom(f ◦ g) := {x ∈ Rm : g(x) ∈ dom(f )}.

2.4 Minimizing convex functions


The main feature that makes convex functions attractive in optimization
is that every local minimum is a global one, so we cannot “get stuck” in
local optima. This is quite intuitive if we think of the graph of a convex
function as being bowl-shaped.
Definition 2.20. A local minimum of f : dom(f ) → R is a point x such that
there exists ε > 0 with
f (x) ≤ f (y) ∀y ∈ dom(f ) satisfying ky − xk < ε.

59
Lemma 2.21. Let x? be a local minimum of a convex function f : dom(f ) → R.
Then x? is a global minimum, meaning that

f (x? ) ≤ f (y) ∀y ∈ dom(f ).

Proof. Suppose there exists y ∈ dom(f ) such that f (y) < f (x? ) and define
y0 := λx? + (1 − λ)y for λ ∈ (0, 1). From convexity (2.2), we get that
that f (y0 ) < f (x? ). Choosing λ so close to 1 that ky0 − x? k < ε yields a
contradiction to x? being a local minimum.
This does not mean that a convex function always has a global mini-
mum. Think of f (x) = x as a trivial example. But also if f is bounded from
below over dom(f ), it may fail to have a global minimum (f (x) = ex ).
To ensure the existence of a global minimum, we need additional condi-
tions. For example, it suffices if outside some ball B, all function values
are larger than some value f (x), x ∈ B. In this case, we can restrict f
to B, without changing the smallest attainable value. And on B (which is
compact), f attains a minimum by continuity (Lemma 2.14). An easy ex-
ample: for f (x1 , x2 ) = x21 + x22 , we know that outside any ball containing 0,
f (x) > f (0) = 0.
Another easy condition in the differentiable case is given by the follow-
ing result.

Lemma 2.22. Suppose that f : dom(f ) → R is convex and differentiable over


an open domain dom(f ) ⊆ Rd . Let x ∈ dom(f ). If ∇f (x) = 0, then x is a
global minimum.

Proof. Suppose that ∇f (x) = 0. According to Lemma 2.16, we have

f (y) ≥ f (x) + ∇f (x)> (y − x) = f (x)

for all y ∈ dom(f ), so x is a global minimum.


The converse is also true and does not even require convexity.

Lemma 2.23. Suppose that f : dom(f ) → R is differentiable over an open


domain dom(f ) ⊆ Rd . Let x ∈ dom(f ). If x is a global minimum then
∇f (x) = 0.

60
Proof. Suppose that ∇f (x)i 6= 0 for some i. For t ∈ R, we define x(t) =
x + tei , where ei is the i-th unit vector. For |t| sufficiently small, we have
x(t) ∈ dom(f ) since dom(f ) is open. Let z(t) = f (x(t)). By the chain rule,
z 0 (0) = ∇f (x)> ei = ∇f (x)i 6= 0. Hence, z decreases in one direction as we
move away from 0, and this yields f (x(t)) < f (x) for some t, so x is not a
global minimum.

2.4.1 Strictly convex functions


In general, a global minimum of a convex function is not unique (think of
f (x) = 0 as a trivial example). However, if we forbid “flat” parts of the
graph of f , a global minimum becomes unique (if it exists at all).

Definition 2.24 ([BV04, 3.1.1]). A function f : dom(f ) → R is strictly con-


vex if (i) dom(f ) is convex and (ii) for all x 6= y ∈ dom(f ) and all λ ∈ (0, 1),
we have
f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y). (2.8)

This means that the open line segment connecting (x, f (x)) and (y, f (y))
is pointwise strictly above the graph of f . For example, f (x) = x2 is strictly
convex.

Lemma 2.25 ([BV04, 3.1.4]). Suppose that dom(f ) is open and that f is twice
continuously differentiable. If the Hessian ∇2 f (x)  0 for every x ∈ dom(f )
(i.e., z> ∇2 f (x)z > 0 for any z 6= 0), then f is strictly convex.

The converse is false, though: f (x) = x4 is strictly convex but has van-
ishing second derivative at x = 0.

Lemma 2.26. Let f : dom(f ) → R be strictly convex. Then f has at most one
global minimum.

Proof. Suppose x? 6= y? are two global minima with fmin = f (x? ) = f (y? ),
and let z = 12 x? + 12 y? . By (2.8),

1 1
f (z) < fmin + fmin = fmin ,
2 2
a contradiction to x? and y? being global minima.

61
2.4.2 Example: Least squares
Suppose we want to fit a hyperplane to a set of data points x1 , . . . , xm in
Rd , based on the hypothesis that the points actually come (approximately)
from a hyperplane. A classical method for this is least squares. For con-
creteness, let us do this in R2 . Suppose that the data points are

(1, 10), (2, 11), (3, 11), (4, 10), (5, 9), (6, 10), (7, 9), (8, 10),

Figure 2.7 (left).


y y

x x

Figure 2.7: Data points in R2 (left) and least-squares fit (right)

Also, for simplicity (and quite appropriately in this case), let us restrict
to fitting a linear model, or more formally to fit non-vertical lines of the
form y = w0 + w1 x. If (xi , yi ) is the i-th data point, the least squares fit
chooses w0 , w1 such that the least squares objective
8
X
f (w0 , w1 ) = (w1 xi + w0 − yi )2
i=1

is minimized. It easily follows from Lemma 2.19 that f is convex. In fact,

f (w0 , w1 ) = 204w12 + 72w1 w0 − 706w1 + 8w02 − 160w0 + 804, (2.9)

62
so we can check convexity directly using the second order condition. We
have gradient

∇f (w0 , w1 ) = (72w1 + 16w0 − 160, 408w1 + 72w0 − 706)

and Hessian  
2 16 72
∇ (w0 , w1 ) = .
72 408
A 2 × 2 matrix is positive semidefinite if the diagonal elements and the
determinant are positive, which is the case here, so f is actually strictly
convex and has a unique global minimum. To find it, we solve the linear
system ∇f (w0 , w1 ) = (0, 0) of two equations in two unknowns and obtain
the global minimum
 43 1 
? ?
(w0 , w1 ) = ,− .
4 6
Hence, the “optimal” line is
1 43
y =− x+ ,
6 4
see Figure 2.7 (right).

2.4.3 Constrained Minimization


Frequently, we are interested in minimizing a convex function only over a
subset X of its domain.

Definition 2.27. Let f : dom(f ) → R be convex and let X ⊆ dom(f ) be a


convex set. A point x ∈ X is a minimizer of f over X if

f (x) ≤ f (y) ∀y ∈ X.

If f is differentiable, minimizers of f over X have a very useful charac-


terization.

Lemma 2.28 ([BV04, 4.2.3]). Suppose that f : dom(f ) → R is convex and


differentiable over an open domain dom(f ) ⊆ Rd , and let X ⊆ dom(f ) be a
convex set. Point x? ∈ X is a minimizer of f over X if and only if

∇f (x? )> (x − x? ) ≥ 0 ∀x ∈ X.

63
If X does not contain the global minimum, then Lemma 2.28 has a
nice geometric interpretation. Namely, it means that X is contained in the
halfspace {x ∈ Rd : ∇f (x? )> (x − x? ) ≥ 0} (normal vector ∇f (x? ) at x?
pointing into the halfspace); see Figure 2.8. In still other words, x − x?
forms a non-obtuse angle with ∇f (x? ) for all x ∈ X.

∇f (x? )> (x − x? ) ≥ 0
X

∇f (x? )

x
x?

Figure 2.8: Optimality condition for constrained optimization

We typically write constrained minimization problems in the form

argmin{f (x) : x ∈ X} (2.10)

or
minimize f (x)
(2.11)
subject to x ∈ X.

2.5 Existence of a minimizer


The existence of a minimizer (or a global minimum if X = dom(f )) will
be an assumption made by most minimization algorithms that we discuss
later. In practice, such algorithms are being used (and often also work)
if there is no minimizer. By “work”, we mean in this case that they com-
pute a point x such that f (x) is close to inf y∈X f (y), assuming that the
infimum is finite (as in f (x) = ex ). But a sound theoretical analysis usu-
ally requires the existence of a minimizer. Therefore, this section develops
tools that may helps us in analyzing whether this is the case for a given

64
convex function. To avoid technicalities, we restrict ourselves to the case
dom(f ) = Rd .

2.5.1 Sublevel sets and the Weierstrass Theorem


Definition 2.29. Let f : Rd → R, α ∈ R. The set
f ≤α := {x ∈ Rd : f (x) ≤ α}
is the α-sublevel set of f ; see Figure 2.9

f ≤α f ≤α f ≤α

Figure 2.9: Sublevel set of a non-convex function (left) and a convex func-
tion (right)

It is easy to see from the definition that every sublevel set of a convex
function is convex. Moreover, as a consequence of continuity of f , sublevel
sets are closed. The following (known as the Weierstrass Theorem) just
formalizes an argument that we have made earlier.
Theorem 2.30. Let f : Rd → R be a continuous function, and suppose there is
a nonempty and bounded sublevel set f ≤α . Then f has a global minimum.
Proof. As the set (−∞, α] is closed, its pre-image f ≤α by the continuous
function f is closed. We know that f —as a continuous function—attains a
minimum over the (non-empty) closed and bounded (= compact) set f ≤α
at some x? . This x? is also a global minimum as it has value f (x? ) ≤ α,
while any x ∈/ f ≤α has value f (x) > α ≥ f (x? ).

65
Note that Theorem 2.30 holds for convex functions as convexity on Rd
implies continuity (Exercise 8).

2.5.2 Recession cone and lineality space


What happens if there is no bounded sublevel set? Then f may (f (x) = 0)
or may not (f (x) = x, f (x) = ex ) have a global minimum. But what can-
not happen is that some nonempty sublevel sets are bounded, and others
are not. Also, in the unbounded case, all nonempty sublevel sets have
the same “reason” for being unbounded, namely the same directions of re-
cession. Again, we assume for simplicity that dom(f ) = Rd , and we are
only considering unconstrained minimization. But most of the material
can be adapted to general domains and to constrained optimization over
a convex set X ⊆ dom(f ). We closely follow Bertsekas [Ber05].

Recession cone and lineality space of a convex set.

Definition 2.31. Let C ⊆ Rd be a convex set. Then y ∈ Rd is a direction of


recession of C if for some x ∈ C and all λ ≥ 0, it holds that x + λy ∈ C.

This means that C is unbounded in direction y. Whether y is a direc-


tion of recession only depends on C and not on the particular x, assuming
that C is closed (otherwise, it may be false).

Lemma 2.32. Let C ⊆ Rd be a nonempty closed convex set, and let y ∈ Rd . The
following statements are equivalent.

(i) ∃x ∈ C : x + λy ∈ C for all λ ≥ 0.

(ii) ∀x ∈ C : x + λy ∈ C for all λ ≥ 0.

Proof. We need to show that (i) implies (ii), so choose x ∈ C, y ∈ Rd such


that x + λy ∈ C for all λ ≥ 0. Fix λ > 0, let z = λy and x0 ∈ C. To get (ii),
we prove that x0 +z ∈ C. To this end, we define sequences (wk ), (zk ), k ∈ N
via

wk := x + kz ∈ C (by (i))
1 1
zk := (wk − x0 ) = z + (x − x0 ),
k k

66
see Figure
 0 2.10. By definition of a convex set, we have x0 + zk = k1 wk +
1 − k x ∈ C. Moreover, zk converges to z, so x + zk converges to x0 + z ∈
1 0

C, and this is an element of C, since C is closed.

w1 w2 w3 wk
x z z1

z2
z3
x0 zk
z
C

Figure 2.10: The proof of Lemma 2.32

The directions of recession of C actually form a convex cone, a set that is


closed under taking non-negative linear combinations. This is known as
the recession cone R(C) of C; see Figure 2.11 (left).

L(C)
R(C)

C C

Figure 2.11: The recession cone and lineality space of a convex set

Lemma 2.33. Let C ⊆ Rd be a closed convex set, and let y1 , y2 be directions of


recession of C; λ1 , λ2 ∈ R+ . Then y = λ1 y1 + λ2 y2 is also a direction of recession
of C.

Proof. The statement is trivial if y = 0. Otherwise, after scaling y by


1/(λ1 + λ2 ) > 0 (which does not affect the property of being a direction

67
of recession), we may assume that λ1 + λ2 = 1. Now, for all x ∈ C and all
λ ∈ R, we get

x + λy = x + λ1 λy1 + λ2 λy2 = λ1 (x + λy1 ) + λ2 (x + λy2 ) ∈ C.


| {z } | {z }
∈C ∈C

Definition 2.34. Let C ⊆ Rd be a convex set. y ∈ Rd is a direction of con-


stancy of C if both y and −y are directions of recession of C.

This means that C is unbounded along the whole line spanned by y.


The directions of constancy form a linear subspace, the lineality space L(C)
of C; see Figure 2.11 (right).

Lemma 2.35. Let C ⊆ Rd be a closed convex set, and let y1 , y2 be directions of


constancy of C; λ1 , λ2 ∈ R. Then y = λ1 y1 +λ2 y2 is also a direction of constancy
of C.

Proof. After replacing yi with the direction of recession −yi if necessary


(i = 1, 2), we may assume that λ1 , λ2 ≥ 0, so y is a direction of recession
by Lemma 2.33. The same argument works for −y, so y is a direction of
constancy.

Recession cone and lineality space of a convex function. For this, we


look at directions of recession of sublevel sets (which are closed and con-
vex in our case).

Lemma 2.36. Let f : Rd → R be a convex function. Any two nonempty sublevel


0 0
sets f ≤α , f ≤α have the same recession cones, i.e. R(f ≤α ) = R(f ≤α ).

Proof. Let y be a direction of recession for f ≤α , i.e. for all x ∈ f ≤α and all
λ ≥ 0, we have
f (x + λy) ≤ α.
We claim that this implies the stronger bound

f (x + λy) ≤ f (x). (2.12)

In words, f is non-increasing along any direction of recession. Using this,


0
the statement follows: Because f ≤α and f ≤α are nonempty, there exists

68
0
x0 ∈ f ≤α ∩ f ≤α , and then we have f (x0 + λy) ≤ f (x0 ) ≤ α0 , so y is a
0
direction of recession for f ≤α .
To prove (2.12), we fix λ and let z = λy. With wk := x + kz ∈ f ≤α , we
have  
1 1
x+z= 1− x + wk ,
k k
so convexity of f and the fact that wk ∈ f ≤α yields
   
1 1 1 1
f (x + z) ≤ 1 − f (x) + f (wk ) ≤ 1 − f (x) + α. (2.13)
k k k k

Thus, as k → ∞, the right side of (2.13) tends to f (x), and therefore


f (x + z) ≤ f (x); see Figure 2.12.

 
1
f (x + z) ≤ 1 − k
f (x) + k1 α
γ
f (x)

z
x w1 w2 w3 wk

Figure 2.12: The proof of (2.12)

Definition 2.37. Let f : Rd → R be a convex function. Then y ∈ Rd is a


direction of recession (of constancy, respectively) of f if y is a direction of recession
(of constancy, respectively) for some (equivalently, for every) nonempty sublevel
set. The set of directions of recession of f is called the recession cone R(f ) of f .
The set of directions of constancy of f is called the lineality space L(f ) of f .

We can characterize recession cone and lineality space of f directly,


without looking at sublevel sets (the proof is Exercise 11). The conditions
of Lemma 2.38(ii) and Lemma 2.39(ii) finally explain the terms “direction
of recession” and “direction of constancy”.

Lemma 2.38. Let f : Rd → R be a convex function. The following statements


are equivalent.

69
(i) y ∈ Rd is a direction of recession of f .
(ii) f (x + λy) ≤ f (x) for all x ∈ Rd and all λ ∈ R+ .
(iii) (y, 0) is a (“horizontal”) direction of recession of (the closed convex set)
epi(f ).
Lemma 2.39. Let f : Rd → R be convex. The following statements are equiva-
lent.
(i) y ∈ Rd is a direction of constancy of f .
(ii) f (x + λy) = f (x) for all x ∈ Rd and all λ ∈ R.
(iii) (y, 0) is a (“horizontal”) direction of constancy of (the closed convex set)
epi(f ).

2.5.3 Coercive convex functions


Definition 2.40. A convex function f is coercive if its recession cone is trivial,
meaning that 0 is its only direction of recession.1
Coercivity means that along any direction, f (x) goes to infinity. An
example of a coercive convex function is f (x1 , x2 ) = x21 + x22 . Non-coercive
functions are f (x) = x and f (x) = ex (any y ≤ 0 is a direction of recession).
For a constant function f : Rd → R, every direction y is a direction of
recession. In general, affine functions are never coercive.
Lemma 2.41. Let f : Rd → R be a coercive convex function. Then every
nonempty sublevel set f ≤α is bounded.
This may seem obvious, as f ≤α is bounded in every direction by coer-
civity. But we still need an argument that there is a global bound.
Proof. Let f ≤α be a nonempty sublevel, and assume without loss of gener-
ality that 0 ∈ f ≤α , i.e., α ≥ f (0).
Let S d−1 = {y ∈ Rd : kyk = 1} be the unit sphere. We define a function
g : S d−1 → R via

g(y) = max{λ ≥ 0 : f (λy) ≤ α}, y ∈ S d−1 .


1
The usual definition of a coercive function is that f (x) → ∞ whenever kxk → ∞. In
the convex case, both definitions agree.

70
Since f is continuous and has no nonzero direction of recession, we know
that for each y ∈ S d−1 the set {λ ≥ 0 : f (λy) ≤ α} is closed and bounded
(it is actually an interval, by convexity of f ), so the maximum exists and
g(y) is well-defined. We claim that g is continuous.
Let (yk0 )k∈N be a sequence of unit vectors such that limk→∞ yk = y ∈
S d−1 . We need to show that limk→∞ g(y0 ) = g(y). Let us fix ε > 0 arbitrarily
small. For λ := g(y) − ε ≥ 0, we have f (λy) ≤ α (an easy consequence of
convexity of f and α ≥ f (0)). And for λ := g(y) + ε, we get f (λy) > α by
definition of g(y). Continuity of f then yields limk→∞ f (λyk ) = f (λy) ≤ α
and limk→∞ f (λyk ) = f (λy) > α. Hence, for sufficiently large k, g(yk ) ∈
[λ, λ] = [g(y) − ε, g(y) + ε], and limk→∞ g(y0 ) = g(y) follows.
As a continuous function, g attains a maximum λ? over the compact set
S , and this means that f ≤α is contained in the closed λ? -ball around the
d−1

origin. Hence, f ≤α is bounded.


Together with Theorem 2.30, we obtain

Theorem 2.42. Let f : Rd → R be a coercive convex function. Then f has a


global minimum.

2.5.4 Weakly coercive convex functions


It turns out that we can allow nontrivial directions of recession and still
guarantee a global minimum.

Definition 2.43. Let f : Rd → R be a convex function. Function f is called


weakly coercive if its recession cone equals its lineality space, i.e. every direction
of recession is a direction of constancy.

Function f (x) = 0 is a trivial example of a non-coercive but weakly


coercive function. A more interesting example is f (x1 , x2 ) = x21 . Here, the
directions of recession are all vectors of the form y = (0, x2 ), and these are
at the same time directions of constancy.

Theorem 2.44. Let f : Rd → R be a weakly coercive convex function. Then f


has a global minimum.

Proof. We know that the lineality space L of f is a linear subspace of Rd : by


Definition 2.37, L is the lineality space of every nonempty (closed and con-
vex) sublevel set, and as such it is closed under taking linear combinations

71
(Lemma 2.35). Let L⊥ be the orthogonal complement of L. Restricted to
L⊥ , f is coercive, as L⊥ is orthogonal to any direction of constancy, equiva-
lently to every direction of recession, since f is weakly coercive. Therefore,
L⊥ can contain only the trivial direction of recession. It follows that f|L⊥
has a global minimum x? ∈ L⊥ by Theorem 2.42 (which we can apply af-
ter identifying L⊥ w.l.o.g. with Rm for some m ≤ n). This is also a global
minimum of f . To see this, let z ∈ Rd and write it (uniquely) in the form
z = x + y with x ∈ L⊥ and y ∈ L. Then we get

f (z) = f (x + y) = f (x) ≥ f (x? ),

where the second equality follows from y being a direction of constancy;


see Lemma 2.39(ii).

2.6 Examples
In the following two sections, we give two examples of convex function
minimization tasks that arise from machine learning applications.

2.6.1 Handwritten digit recognition


Suppose you want to write a program that recognizes handwritten deci-
mal digits 0, 1, . . . , 9. You have a set P of grayscale images (28 × 28 pixels,
say) that represent handwritten decimal digits, and for each image x ∈ P ,
you know the digit d(x) ∈ {0, . . . , 9} that it represents, see Figure 2.13.
You want to train your program with the set P , and after that, use it to
recognize handwritten digits in arbitrary 28 × 28 images.
The classical approach is the following. We represent an image as a
feature vector x ∈ R784 , where xi is the gray value of the i-th pixel (in some
order). During the training phase, we compute a matrix W ∈ R10×784 and
then use the vector y = W x ∈ R10 to predict the digit seen in an arbitrary
image x. The idea is that yj , j = 0, . . . , 9 corresponds to the probability
of the digit being j. This does not work directly, since the entries of y
may be negative and generally do not sum up to 1. But we can convert y
to a vector z of actual probabilities, such that a small yj leads to a small
probability zj and a large yj to a large probability zj . How to do this is not
canonical, but here is a well-known formula that works:

72
Figure 2.13: Some training images from the MNIST data set (picture from
http://corochann.com/mnist-dataset-introduction-1138.
html

eyj
zj = zj (y) = P9 . (2.14)
k=0 eyk
The classification then simply outputs digit j with probability zj . The
matrix W is chosen such that it (approximately) minimizes the classifica-
tion error on the training set P . Again, it is not canonical how we measure
classification error; here we use the following loss function to evaluate the
error induced by a given matrix W .
9
!
X  X X 
`(W ) = − ln zd(x) (W x) = ln e(W x)k − (W x)d(x) . (2.15)
x∈P x∈P k=0

This function “punishes” images for which the correct digit j has low
probability zj (corresponding to a significantly negative value of log zj ).
In an ideal world, the correct digit would always have probability 1, re-
sulting in `(W ) = 0. But under (2.14), probabilities are always strictly
between 0 and 1, so we have `(W ) > 0 for all W .

73
Exercise 12 asks you to prove that ` is convex. In Exercise 13, you will
characterize the situations in which ` has a global minimum.

2.6.2 Master’s Admission


The computer science department of a well known Swiss university is ad-
mitting top international students to its MSc program, in a competitive
application process. Applicants are submitting various documents (GPA,
TOEFL test score, GRE test scores, reference letters,. . . ). During the evalu-
ation of an application, the admission committee would like to compute a
(rough) forecast of the applicant’s performance in the MSc program, based
on the submitted documents.2
Data on the actual performance of students admitted in the past is
available. To keep things simple in the following example, Let us base
the forecast on GPA (grade point average) and TOEFL (Test of English as
a Foreign Language) only. GPA scores are normalized to a scale with a
minimum of 0.0 and a maximum of 4.0, where admission starts from 3.5.
TOEFL scores are on an integer scale between 0 and 120, where admission
starts from 100.
Table 2.1 contains the known data. GGPA (graduation grade point av-
erage on a Swiss grading scale) is the average grade obtained by an ad-
mitted student over all courses in the MSc program. The Swiss scale goes
from 1 to 6 where 1 is the lowest grade, 6 is the highest, and 4 is the lowest
passing grade.
As in Section 2.4.2, we are attempting a linear regression with least
squares fit, i.e. we are making the hypothesis that

GGPA ≈ w0 + w1 · GPA + w2 · TOEFL. (2.16)

However, in our scenario, the relevant GPA scores span a range of only
0.5 while the relevant TOEFL scores span a range of 20. The resulting least
squares objective would be somewhat ugly; we already saw this in our
previous example (2.9), where the data points had large second coordinate,
resulting in the w1 -scale being very different from the w2 -scale. This time,
we normalize first, so that w1 und w2 become comparable and allow us to
understand the relative influences of GPA and TOEFL.
2
Any resemblance to real departments is purely coincidental. Also, no serious depart-
ment will base performance forecasts on data from 10 students, as we will do it here.

74
GPA TOEFL GGPA
3.52 100 3.92
3.66 109 4.34
3.76 113 4.80
3.74 100 4.67
3.93 100 5.52
3.88 115 5.44
3.77 115 5.04
3.66 107 4.73
3.87 106 5.03
3.84 107 5.06

Table 2.1: Data for 10 admitted students: GPA and TOEFL scores (at time
of application), GGPA (at time of graduation)

The general setting is this: we have n inputs x1 , . . . , xn , where each vec-


tor xi ∈ Rd consists of d input variables; then we have n outputs y1 , . . . , yn ∈
R. Each pair (xi , yi ) is an observation. In our case, d = 2, n = 10, and for
example, ((3.93, 100), 5.52) is an observation (of a student doing very well).
With variable weights w0 , w = (w1 , . . . , wd ) ∈ Rd , we plan to minimize
the least squares objective
n
X
f (w0 , w) = (w0 + w> xi − yi )2 .
i=1

We first want to assume that the inputs and outputs are centered, mean-
ing that
n n
1X 1X
xi = 0, yi = 0.
n i=1 n i=1
1
Pn
This can be achieved by simply subtracting the mean x̄ = n i=1 xi from
every input and the mean ȳ = n1 ni=1 yi from every output. In our exam-
P
ple, this yields the numbers in Table 2.2 (left).
After centering, the global minimum (w0? , w? ) of the least squares ob-
jective satisfies w0? = 0 while w? is unaffected by centering (Exercise 17),
so that we can simply omit the variable w0 in the sequel.

75
GPA TOEFL GGPA GPA TOEFL GGPA
-0.24 -7.2 -0.94 -2.04 -1.28 -0.94
-0.10 1.8 -0.52 -0.88 0.32 -0.52
-0.01 5.8 -0.05 -0.05 1.03 -0.05
-0.02 -7.2 -0.18 -0.16 -1.28 -0.18
0.17 -7.2 0.67 1.42 -1.28 0.67
0.12 7.8 0.59 1.02 1.39 0.59
0.01 7.8 0.19 0.06 1.39 0.19
-0.10 -0.2 -0.12 -0.88 -0.04 -0.12
0.11 -1.2 0.17 0.89 -0.21 0.17
0.07 -0.2 0.21 0.62 -0.04 0.21

Table 2.2: Centered observations (left); normalized inputs (right)

Finally, we assume that all d input variables are on the same scale,
meaning that
n
1X 2
x = 1, j = 1, . . . , d.
n i=1 ij
To achieve this for fixed j (assuming
q P that no variable is 0 in all inputs),
we multiply all xij by s(j) = n/ ni=1 x2ij (which, in the optimal solution
w? , just multiplies wj? by 1/s(j), an argument very similar to the one in
Exercise 17). For our data set, the resulting normalized data are shown in
Table 2.2 (right). Now the least squares objective (after omitting w0 ) is
10
X
f (w1 , w2 ) = (w1 xi1 + w2 xi2 − yi )2
i=1
≈ 10w12 + 10w22 + 1.99w1 w2 − 8.7w1 − 2.79w2 + 2.09.
This is minimized at
w? = (w1? , w2? ) ≈ (0.43, 0.097),
so if our initial hypothesis (2.16) is true, we should have
yi ≈ yi? = 0.43xi1 + 0.097xi2 (2.17)
in the normalized data. This can quickly be checked, and the results are
not perfect, but not too bad, either; see Table 2.3 (ignore the last column
for now).

76
xi1 xi2 yi yi? zi?
-2.04 -1.28 -0.94 -1.00 -0.87
-0.88 0.32 -0.52 -0.35 -0.37
-0.05 1.03 -0.05 0.08 -0.02
-0.16 -1.28 -0.18 -0.19 -0.07
1.42 -1.28 0.67 0.49 0.61
1.02 1.39 0.59 0.57 0.44
0.06 1.39 0.19 0.16 0.03
-0.88 -0.04 -0.12 -0.38 -0.37
0.89 -0.21 0.17 0.36 0.38
0.62 -0.04 0.21 0.26 0.27

Table 2.3: Outputs yi? predicted by the linear model (2.17) and by the model
zi? = 0.43xi1 that simply ignores the second input variable

What we also see from (2.17) is that the first input variable (GPA) has a
much higher influence on the output (GGPA) than the second one (TOEFL).
In fact, if we drop the second one altogether, we obtain outputs zi? (last col-
umn in Table 2.3) that seem equivalent to the predicted outputs yi? within
the level of noise that we have anyway.
We conclude that TOEFL scores are probably not indicative for the per-
formance of admitted students, so the admission committee should not
care too much about them. Requiring a minimum score of 100 might make
sense, but whenever an applicant reaches at least this score, the actual
value does not matter.

The LASSO. So far, we have computed linear functions y = 0.43x1 +


0.097x2 and z = 0.43x1 that “explain” the historical data from Table 2.1.
However, they are optimized to fit the historical data, not the future. We
may have overfitting. This typyically leads to unrealiable predictions of
high variance in the future. Also, ideally, we would like non-indicative
variables (such as the TOEFL in our example) to actually have weight 0,
so that the model “knows” the important variables and is therefore better
to interpret.
The question is: how can we in general improve the quality of our
forecast? There are various heuristics to identify the “important” variables

77
(subset selection). A very simple one is just to forget about weights close to
0 in the least squares solution. However, for this, we need to define what
it means to be close to 0; and it may happen that small changes in the data
lead to different variables being dropped if their weights are around the
threshold. On the other end of the spectrum, there is best subset selection
where we compute the least squares solution subject to the constraint that
there are at most k nonzero weights, for some k that we believe is the right
number of important variables. This is NP-hard, though.
A popular approach that in many cases improves forecasts and at the
same time identifies important variables has been suggested by Tibshirani
in 1996 [Tib96]. Instead of minimizing the least squares objective glob-
ally, it is minimized over a suitable `1 -ball (ball in the 1-norm kwk1 =
Pd
j=1 |wj |): Pn > 2
minimize i=1 kw xi − yi k (2.18)
subject to kwk1 ≤ R,
where R ∈ R+ is some parameter. In our case, if we for example

minimize f (w1 , w2 ) = 10w12 + 10w22 + 1.99w1 w2 − 8.7w1 − 2.79w2 + 2.09


subject to |w1 | + |w2 | ≤ 0.2,
(2.19)
? ? ?
we obtain weights w = (w1 , w2 ) = (0.2, 0): the non-indicative TOEFL
score has disappeared automatically! For R = 0.3, the same happens (with
w1? = 0.3, respectively). For R = 0.4, the TOEFL score starts creeping
back in: we get (w1? , w2? ) ≈ (0.36, 0.036). For R = 0.5, we have (w1? , w2? ) ≈
(0.41, 0.086), while for R = 0.6 (and all larger values of R), we recover the
original solution (w1? , w2? ) = (0.43, 0.097).
It is important to understand that using the “fixed” weights (which
may be significantly shrunken), we make predictions worse on the histori-
cal data (this must be so, since least squares was optimal for the historical
data). But future predictions may benefit (a lot). To quantify this benefit,
we need to make statistical assumptions about future observations; this is
beyond the scope of our treatment here.
The phenomenon that adding a constraint on kwk1 tends to set weights
to 0 is not restricted to d = 2. The constrained minimization problem (2.18)
is called the LASSO (least absolute shrinkage and selection operator) and
has the tendency to assign weights of 0 and thus to select a subset of input
variables, where R controls how aggressive the selection is.

78
In our example, it is easy to get an intuition why this works. Let us look
at the case R = 0.2. The smallest value attainable in (2.19) is the smallest α
such that that the (elliptical) sublevel set f ≤α of the least squares objective
f still intersects the `1 -ball {(w1 , w2 ) : |w1 |+|w2 | ≤ 0.2}. This smallest value
turns out to be α = 0.75, see Figure 2.14. For this value of α, the sublevel
set intersects the `1 -ball exactly in one point, namely (0.2, 0).

b
10w12 + 10w22 + 1.99w1 w2 − 8.7w1 − 2.79w2 + 2.09 = 0.75

(0.43, 0.097)

|w1 | + |w2 | ≤ 0.2

Figure 2.14: Lasso

At (0.2, 0), the ellipse {(w1 , w2 ) : f (w1 , w2 ) = α} is “vertical enough”


to just intersect the corner of the `1 -ball. The reason is that the center of
the ellipse is relatively close to the w1 -axis, when compared to its size. As
R increases, the relevant value of α decreases, the ellipse gets smaller and
less vertical around the w1 -axis; until it eventually stops intersecting the `1 -
ball {(w1 , w2 ) : |w1 |+|w2 | ≤ R} in a corner (dashed situation in Figure 2.14,
for R = 0.4).
Even though we have presented a toy example in this section, the back-
ground is real. The theory of admission and in particular performance
forecasts has been developed in a recent PhD thesis by Zimmermann [Zim16].

79
2.7 Convex programming
Convex programs are specific convex constrained minimization problems.
They arise when we minimize a convex function f over a convex set X
defined by finitely many convex inequality and affine equality constraints.
This turns out to be an important class of problems with a rich theory. For
a large part of this section, we do not need to assume convexity.
According to Boyd and Vandenberge [BV04, 4.1.1], an optimization
problem in standard form is given by

minimize f0 (x)
subject to fi (x) ≤ 0, i = 1, . . . , m (2.20)
hi (x) = 0, i = 1, . . . , p
p
The problem has domain D = (∩m i=0 dom(fi )) ∩ (∩i=1 dom(hi )). We assume
that D is open. You may think of D as equal to Rd in many cases.
A convex program arises when the fi are convex functions, and the hi
are affine functions with domain Rd . In this case, Observation 2.9 has the
following consequences: the problem domain D is convex, and so are all
sets of the form {x ∈ D : fi (x) ≤ 0} and {x ∈ D : hi (x) = 0} (here we use
that the hi are affine). Also
X = {x ∈ Rd : fi (x) ≤ 0, i = 1, . . . , m; hi (x) = 0, i = 1, . . . , p},
the feasible region of (2.20) is then a convex set. So we are in constrained
minimization as discussed in Section 2.4.3, with a feasible region X in-
duced by finitely many (in)equality constraints.

2.7.1 Lagrange duality


Lagrange duality is a powerful tool that (under suitable conditions) allows
us to express the optimization problem (2.20) differently, and this dual
view often provides new insights and may help us in solving the original
problem. For example, we will see below that linear programming duality
with its many applications is a special case of Lagrange duality.
Definition 2.45. Given optimization problem (2.20), its Lagrangian is the func-
tion L : D × Rm × Rp → R given by
m p
X X
L(x, λ, ν) = f0 (x) + λi fi (x) + νi hi (x). (2.21)
i=1 i=1

80
The λi , νi are called Lagrange multipliers.
The Lagrange dual function is the function g : Rm × Rp → R ∪ {−∞}
defined by
g(λ, ν) = inf L(x, λ, ν). (2.22)
x∈D

The fact that g can assume value −∞ is not a pathology but typical. We
will see that the “interesting” (λ, ν) are the ones for which g(λ, ν) > −∞.
Let’s discuss linear programming, the case where all involved func-
tions are affine. Concretely, we consider a linear program of the form
minimize c> x
subject to Ax = b (2.23)
x ≥ 0.
We assume there are d variables, so x ∈ Rd . This is of the form (2.20): the
vector equation Ax = b summarizes the equality constraints induced by
hi (x) := a>
i x − bi , i = 1, . . . , p, while the nonnegativity constraints come
from fi (x) := −xi , i = 1, . . . , m. We finally have f0 (x) := c> x. As all
functions are defined everywhere, the domain is D = Rd .
Then the Lagrangian is
L(x, λ, ν) = c> x − λ> x + ν > (Ax − b) = −b> ν + (c> − λ> + ν > A)x.
It follows that g(λ, ν) > −∞ if and only if c> − λ> + ν > A = 0. And in this
case, we have g(λ, ν) = −b> ν.
The significance of the Lagrangian dual function is that it provides a
lower bound on the infimum value of (2.20), provided that λ ≥ 0. But a
nontrivial lower bound is obtained only if g(λ, ν) > −∞, this is why we
are interested in this case.
Lemma 2.46 (Weak Lagrange duality). Let x be a feasible solution of the
optimization problem (2.20), meaning that fi (x) ≤ 0 for i = 1, . . . , m and
hi (x) = 0 for i = 1, . . . , p. Let g be the Lagrange dual function of (2.20) and
λ ∈ Rm , ν ∈ Rp such that λ ≥ 0. Then
g(λ, ν) ≤ f0 (x).
Proof.
m p
X X
g(λ, ν) ≤ L(x, λ, ν) = f0 (x) + λi fi (x) + νi hi (x) ≤ f0 (x).
i=1 i=1
| {z } | {z }
≤0 =0

81
It is natural to ask how λ and ν must be chosen such that we get the
best lower bound. The answer is provided by the Lagrange dual.
Definition 2.47. Let g be the Lagrange dual function of the optimization problem
(2.20). Then the Lagrange dual of (2.20) is the optimization problem

maximize g(λ, ν)
(2.24)
subject to λ ≥ 0.

By Lemma 2.46, the supremum value of the Lagrange dual (2.24) is a


lower bound for the infimum value of the (primal) problem (2.20).
What is even nicer is that the Lagrange dual is a convex program, even
if (2.20) is not! By this, we mean that the equivalent minimization problem

minimize −g(λ, ν)
subject to λ ≥ 0.

is a convex program in standard form (2.20). As there are no equality con-


straints, and the inequality constraints are obviously convex, we only need
to show that the function −g is a convex function. Here we have a slight
problem, since −g may also assume value ∞, and our Definition 2.11 of
convexity does not cover this case. But this is easy to fix, and we defer the
details (and the proof of convexity in this extended setting) to Exercise 18.
As an example, let’s look at our linear program (2.23) again. We have
already seen that its Lagrangian dual function satisfies

−b> ν if c> − λ> + ν > A = 0,



g(λ, ν) =
−∞ otherwise.

So the Lagrange dual (2.24) becomes

maximize −b> ν
subject to c> + ν > A ≥ 0

Renaming −ν to y and transposing the constraints, we arrive at the “stan-


dard” dual linear program

maximize b> y
(2.25)
subject to A> y ≤ c

82
In the case of linear programming, the primal (2.23) and the dual (2.25)
have the same optimal value: inf c> x = sup b> y. It may happen that this
value is −∞ (if the primal is unbounded and the dual is infeasible), or
∞ (if the primal is infeasible and the dual is unbounded). If the value is
finite, it is attained in both the primal and the dual, so we actually have
min c> x = max b> y.
This is the strong duality theorem of linear programming [MG07, Sec-
tion 6.1]. It strenghtens weak duality which says that inf c> x ≥ sup b> y.
For general convex programs (2.20), strong duality still holds (here, we
do need convexity!), but some extra conditions are needed. There are a
number of known sufficient conditions; these are usually named constraint
qualifications. Here is a concrete result.
Theorem 2.48 ([BV04, 5.3.2]). Suppose that (2.20) is a convex program with a
feasible solution x̃ that in additions satisfies fi (x̃) < 0, i = 1, . . . , m (a Slater
point). Then the infimum value of the primal (2.20) equals the supremum value
of its Lagrange dual (2.24). Moreover, if this value is finite, it is attained by a
feasible solution of the dual. (2.24).
Unlike in linear programming, a finite value is not necessarily attained
by a feasible solution of the primal. So in the case of finite value, the the-
orem can be summarized as inf f0 (x) = max g(λ, ν); see Exercise 19 for an
illustration of the theorem.
A common application of Lagrange duality is to turn the “hard” con-
straints of the optimization problem (2.20) into “soft” ones by moving
them to the objective function. Instead of the constrained minimization
problem (2.20), we consider (for some fixed λ ≥ 0 and fixed ν) the uncon-
strained minimization problem
minimize f0 (x) + m
P Pp
i=1 λi fi (x) + i=1 νi hi (x). (2.26)
As the objective function is the Lagrangian L(x, λ, ν), the infimum value
of this unconstrained problem is by definition the value g(λ, ν) of the
Lagrange dual function. If we have strong Lagrange duality, there exist
λ = λ? ≥ 0 and ν = ν ? such that the unconstrained problem (2.26) has
the same infimum as the constrained optimization problem (2.20). For all
λ ≥ 0 and ν, we know that the infimum of (2.26) provides a lower bound
on the infimum of (2.20).
In practice, one could repeatedly solve (2.26), with a number of sen-
sible candidates for λ ≥ 0 and ν, and use the largest resulting value as

83
an approximation of the infimum value of (2.20). Naturally, this approach
comes without any theoretical guarantees, unless we know that strong du-
ality actually holds, and that “sensible” is quantifiable in some way.
Strong duality (inf f0 (x) = sup g(λ, ν)) may also hold when there is no
Slater point, or even when (2.20) is not a convex program. Theorem 2.48
simply provides one particular and very useful sufficient condition.

2.7.2 Karush-Kuhn-Tucker conditions


A case of particular interest is that strong duality holds and the joint value
is attained in both the primal and the dual, meaning that min f0 (x) =
max g(λ, ν). If the defining functions of the optimization problem (2.20)
are differentiable, the Karush-Kuhn-Tucker conditions provide necessary
and—under convexity—also sufficient conditions for this case to occur.
Definition 2.49 (Zero duality gap). Let x̃ be feasible for the primal (2.20) and
(λ̃, ν̃) feasible for the Lagrange dual (2.24). The primal and dual solutions x̃ and
(λ̃, ν̃) are said to have zero duality gap if f0 (x̃) = g(λ̃, ν̃).

If x̃ and (λ̃, ν̃) have zero duality gap, we have the following crucial
chain of (in)equalities:

f0 (x̃) = g(λ̃, ν̃)


p
m
!
X X
= inf f0 (x) + λ̃i fi (x) + ν̃i hi (x)
x∈D
i=1 i=1
m p
(2.27)
X X
≤ f0 (x̃) + λ̃i fi (x̃) + ν̃i hi (x̃)
i=1
| {z } i=1
| {z }
≤0 0
≤ f0 (x̃).

As a consequence, all inequalities in the box are actually equalities.


From this we can draw two interesting consequences.

Lemma 2.50 (Complementary slackness). If x̃ and (λ̃, ν̃) have zero duality
gap, then
λ̃i fi (x̃) = 0, i = 1, . . . , m.

84
This is called complementary slackness, since if there is slack in the i-
th inequality of the primal (fi (x̃) < 0), then there is no slack in the i-th
inequality of the dual (λ̃i = 0); and vice versa.

Lemma 2.51 (Vanishing Lagrangian gradient). If x̃ and (λ̃, ν̃) have zero du-
ality gap, and if all fi and hi are differentiable, then
m p
X X
∇f0 (x̃) + λ̃i ∇fi (x̃) + ν̃i ∇hi (x̃) = 0.
i=1 i=1

Proof. According to the (in)equality in the third line of (2.27), x̃ minimizes


the differentiable function
m p
X X
f0 (x) + λ̃i fi (x) + ν̃i hi (x),
i=1 i=1

and hence its gradient vanishes by Lemma 2.23.


In summary, we get the following result.

Theorem 2.52 (Karush-Kuhn-Tucker necessary conditions). Let x̃ and (λ̃, ν̃)


be feasible solutions of the primal optimization problem (2.20) and its Lagrange
dual (2.24), respectively, with zero duality gap. If all fi and hi in (2.20) are dif-
ferentiable, then

λ̃i fi (x̃) = 0, i = 1, . . . , m, (2.28)


m p
X X
∇f0 (x̃) + λ̃i ∇fi (x̃) + ν̃i ∇hi (x̃) = 0. (2.29)
i=1 i=1

Theorem 2.53 (Karush-Kuhn-Tucker sufficient conditions). Let x̃ and (λ̃, ν̃)


be feasible solutions of the primal optimization problem (2.20) and its Lagrange
dual (2.24).
Further suppose that all fi and hi in (2.20) are differentiable, all fi are convex,
all hi are affine, and (2.28) as well as (2.29) hold. Then x̃ and (λ̃, ν̃) have zero
duality gap.

Proof. If we can establish the chain of equalities in (2.27), the statement


follows. The equality in the second line is the definition of the Lagrange
dual function; in the third line, we obtain equality through the vanishing

85
Lagrangian gradient (2.29) along with Lemma 2.22, showing that x̃ min-
imizes the function f (x) := L(x, λ̃, ν̃). This is convex by Lemma 2.19 (i).
The inequality under the brace in the third line is complementary slack-
ness (2.28), and from this, equality in the fourth line follows.
The Karush-Kuhn-Tucker conditions (primal and dual feasibilty, com-
plementary slackness, vanishing Lagrangian gradient) can be of signifi-
cant help in solving the primal problem (2.20). If we can find x̃ and (λ̃, ν̃)
satisfying them, we know that x̃ is a minimizer of (2.20). This may be
easier to “solve” than (2.20) itself.
However, we cannot always count on the Karush-Kuhn-Tucker condi-
tions being solvable. Theorem 2.52 guarantees them only if there are pri-
mal and dual solutions of zero duality gap. But if the primal has a Slater
point, then inf f0 (x) = max g(λ, ν) by Theorem 2.48, and in this case, the
Karush-Kuhn-Tucker conditions are indeed equivalent to the existence of
a minimizer of (2.20).

2.7.3 Computational complexity


By a celebrated result of Khachyian from 1979, linear programs can be
solved in polynomial time, using the (impractical) ellipsoid method [Kha80].
In 1984, Karmarkar showed that interior point methods provide a new and
practical approach to solve linear programs in polynomial time [Kar84].
The scope of both methods is not restricted to linear programming; in
fact, they had been developed before in order to solve convex programs.
Still, the fact that they result in polynomial-time methods when applied to
linear programming was a significant insight.
By design, both methods are iterative and can achieve any desired op-
timization error ε > 0 as introduced in Section 1.11. But they will usually
never find the true minimizer, even if it exists. But in the case of linear
programming (and some other “benign” convex optimization problems),
there is a finite set of candidate solutions guaranteed to contain a mini-
mizer. In linear programming, these are the basic feasible solutions. More-
over, from any feasible solution one can easily find a candidate solution
that is not worse. Finally, one argues that if ε is small enough (depending
on the description size of the problem), the only candidate solutions one
can still find in this way are minimizers.

86
In general convex programs, this approach does not work. In partic-
ular, there may not be any minimizers, even if the program’s infimum
value is finite; see Exercise 19. Still, one can try to understand how long
the methods take in order to reach optimization error at most ε. We argued
in Section 1.11 that this is enough for most Data Science purposes.
For convex programs and interior point methods, this has been pio-
neered by Nesterov and Nemirovskii in their 1994 book [NN94]. Boyd and
Vandenberghe present the full theory in Section 11 of their book [BV04].
Here we only give (part of) the high-level summary of the state of af-
fairs [BV04, 11.5.5].
It is important to say upfront that the analysis does not work for arbi-
trary convex programs of the form (2.20); some assumptions are needed.
Mainly, the function f0 to be minimized has to be self-concordant which is a
condition involving its second- and third-order derivatives. Also, we need
some bound M on the maximum value of an optimal solution.
In a first phase, one needs to find a feasible solution, and the runtime
of this phase inversely depends on how close the problem is to being in-
feasible. In the second phase, the number of iterations is of the order

√ M − p?
  
O m log ,
ε

where p? is the infimum value of (2.20). The bound dos not depend on
p, the number of equality constraints, and also not on d, the number of
variables. These two problem dimensions appear in the complexity of the
individual iterations, but we will not go into this.
What we can say, though, is that the individual iterations are com-
putationally very heavy, making them unsuitable for large-scale learning
where optimization time is a bottleneck.
Hence, it makes sense to consider simple algorithms with low computa-
tional cost per step, even if they need more iterations than the best possible
algorithms. This is the approach that we will take in the subsequent chap-
ters.

2.8 Exercises
Exercise 6. Prove that a differentiable function is continuous!

87
Exercise 7. Prove Jensen’s inequality (Lemma 2.13)!
Exercise 8. Prove that a convex function (with dom(f ) open) is continuous
(Lemma 2.14)!
Hint: First prove that a convex function f is bounded on any cube C =
[l1 , u1 ] × [l2 , u2 ] × · · · × [ld , ud ] ⊆ dom(f ), with the maximum value occurring
on some corner of the cube (a point z such that zi ∈ {li , ui } for all i). Then use
this fact to show that—given x ∈ dom(f ) and ε > 0—all y in a sufficiently
small ball around x satisfy |f (y) − f (x)| < ε.
Exercise 9. Prove that the function dy : Rd → R, x 7→ kx − yk2 is strictly
convex for any y ∈ Rd . (Use Lemma 2.25.)
Exercise 10. Prove Lemma 2.19! Can (ii) be generalized to show that for two
convex functions f, g, the function f ◦ g is convex as well?
Exercise 11. Prove Lemmata 2.38 and 2.39!
Exercise 12. Consider the function ` defined in (2.15). Prove that ` is convex!
Exercise 13. Consider the function ` defined in (2.15). Let us call an argument
matrix W a separator for P if for all x ∈ P ,
9
(W x)d(x) = max(W x)j ,
j=0

i.e. under (2.14), the correct digit has highest probability (possibly along with
other digits). A separator is trivial if for all x ∈ P and all i, j ∈ {0, . . . , 9},

(W x)i = (W x)j .

For example, whenever the rows of W are pairwise identical, we obtain a trivial
separator. But depending on the data, there may be other trivial separators. For
example, if some pixel is black (gray value 0) in all images, arbitrarily changing
the entries in the corresponding column of a trivial separator gives us another
trivial separator. For a trivial separator W , (2.15) yields `(W ) = |P | ln 10.
Prove the following statement: ` has a global minimum if and only if all sepa-
rators are trivial.
As a special case, consider the situation in which there exists a strong (and
in particular nontrivial) separator: a matrix W ? such that for all x ∈ P and all
j 6= d(x),
(W ? x)d(x) > (W ? x)j ,

88
i.e. the correct digit has unique highest probability. In this case, it is easy to see
that `(λW ? ) →λ→∞ 0, so we cannot have a global minimum, as inf W (`(W )) = 0
is not attainable.
Pd
Exercise 14. Prove that the function f (x) = kxk1 = i=1 |xi | (`1 -norm) is
convex!

Exercise 15. Let f : dom(f ) → R be twice differentiable. For fixed x, y ∈


dom(f ), consider the univariate function h(t) = f (x + t(y − x)) over a suitable
open interval dom(h) ⊇ [0, 1] such that x + t(y − x) ∈ dom(f ) for all t ∈
dom(h). Let us abbreviate v = y − x. We already know that h0 (t) = ∇f (x +
tv)> v for t ∈ dom(h). Prove that

h00 (t) = v> ∇2 f (x + tv)v, t ∈ dom(h).

Exercise 16. A seminorm is a function f : Rd → R satisfying the following two


properties for all x, y ∈ Rd and all λ ∈ R.

(i) f (λx) = |λ|f (x),

(ii) f (x + y) ≤ f (x) + f (y) (triangle inequality).

Prove that every seminorm is convex!


Pn
Exercise 17. Suppose that we have centered observations (x i , yi ) such that i=1 xi =
0, ni=1 yi = 0. Let w0? , w? be the global minimum of the least squares objective
P

n
X
f (w0 , w) = (w0 + w> xi − yi )2 .
i=1

Prove that w0? = 0. Also, suppose x0i and yi0 are such that for all i, x0i = xi + q,
yi0 = yi + r. Show that (w0 , w) minimizes f if and only if (w0 − w> q + r, w)
minimizes n
X
0
f (wo , w) = (w0 + w> x0i − yi0 )2 .
i=1

Exercise 18. A function f : dom(f ) → R ∪ {∞} is called convex if dom(f ) is


convex and the inequality (2.2) defining convexity holds when f (x), f (y) < ∞.
This in particular implies that the “finite domain” {x ∈ dom(f ) : f (x) < ∞}
is convex as well (if it wasn’t, we could construct x, y in the finite domain and

89
some convex combination λx + (1 − λ)y ∈ dom(f ) with f (λx + (1 − λ)y) = ∞,
violating convexity).
Prove that the negative of the Lagrangian dual function g : Rm × Rp →
R ∪ {−∞} as in Definition 2.45 is convex in the above sense.

Exercise 19. Consider the optimization problem

minimize p x2 − x1
subject to x21 + 1 − x2 ≤ 0.

(i) Prove that this is a convex program with a Slater point so that the conditions
of Theorem 2.48 are satisfied!

(ii) Compute the Lagrange dual function g, the maximum value γ of the La-
grange dual problem and a feasible solution of value γ.

(iii) Show that the primal problem does not attain its infimum value (which is
also γ by Theorem 2.48)!

90
Chapter 3

Gradient Descent

Contents
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.1.1 Convergence rates . . . . . . . . . . . . . . . . . . . . 93
3.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3 Vanilla analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.4 Lipschitz convex functions: O(1/ε2 ) steps . . . . . . . . . . . 97
3.5 Smooth convex functions: O(1/ε) steps . . . . . . . . . . . . 99
3.6 Acceleration
√ for smooth convex functions:
O(1/ ε) steps . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.7 Interlude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.8 Smooth and strongly convex functions:
O(log(1/ε)) steps . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

91
3.1 Overview
The gradient descent algorithm (including variants such as projected or
stochastic gradient descent) is the most useful workhorse for minimizing
loss functions in practice. The algorithm is extremely simple and surpris-
ingly robust in the sense that it also works well for many loss functions
that are not convex. While it is easy to construct (artificial) non-convex
functions on which gradient descent goes completely astray, such func-
tions do not seem to be typical in practice; however, understanding this
on a theoretical level is an open problem, and only few results exist in this
direction.
The vast majority of theoretical results concerning the performance of
gradient descent hold for convex functions only. In this and the following
chapters, we will present some of these results, but maybe more impor-
tantly, the main ideas behind them. As it turns out, the number of ideas
that we need is rather small, and typically, they are shared between dif-
ferent results. Our approach is therefore to fully develop each idea once,
in the context of a concrete result. If the idea reappears, we will typically
only discuss the changes that are necessary in order to establish a new re-
sult from this idea. In order to avoid boredom from ideas that reappear
too often, we omit other results and variants that one could also get along
the lines of what we discuss.
Let f : Rd → R be a convex and differentiable function. We also assume
that f has a global minimum x? , and the goal is to find (an approximation
of) x? . This usually means that for a given ε > 0, we want to find x ∈ Rd
such that
f (x) − f (x? ) < ε.
Notice that we are not making an attempt to get near to x? itself — there
can be several minima y? 6= x? with f (x? ) = f (y? ).
Gradient descent is an iterative method, meaning that it generates a se-
quence x0 , x2 , . . . of solutions such that in some iteration T , we eventually
have f (xT ) − f (x? ) < ε.
Table 3.1 gives an overview of the results that we will prove. They con-
cern several variants of gradient descent as well as several classes of func-
tions. The significance of each algorithm and function class will briefly be
discussed when it first appears.
In Chapter 6, we will also look at gradient descent on functions that

92
smooth &
Lipschitz smooth strongly
strongly
convex convex convex
convex
functions functions functions
functions
gradient Thm. 3.1 Thm. 3.8 Thm. 3.14
descent O(1/ε2 ) O(1/ε) O(log(1/ε))
accelerated
Thm. 3.9
gradient √
O(1/ ε)
descent
projected
Thm. 4.2 Thm. 4.4 Thm. 4.5
gradient
O(1/ε2 ) O(1/ε) O(log(1/ε))
descent
subgradient Thm. 10.20 Thm. 10.22
descent O(1/ε2 ) O(1/ε)
stochastic
Thm. 12.4 Thm. 12.4
gradient
O(1/ε2 ) O(1/ε)
descent

Table 3.1: Results on gradient descent. Below each theorem, the number
of steps is given which the respective variant needs on the respective func-
tion class to achieve additive approximation error at most ε.

are not convex. In this case, provably small approximation error can still
be obtained for some particularly well-behaved functions (we will give an
example). For smooth (but not necessarily convex) functions, we gener-
ally cannot show convergence in error, but a (much) weaker convergence
property still holds.

3.1.1 Convergence rates


You sometimes hear terms such as linear convergence, quadratic conver-
gence, or sublinear convergence. They refer to iterative optimization meth-
ods and describe how quickly the error “goes down” from one iteration
to the next. Let εt denote the error in iteration t; in the context of mini-
mization , we often consider εt = f (xt ) − f (x? ), but there could be other
error measures. An algorithm is said to exhibit (at least) linear convergence

93
whenever there is a real number 0 < c < 1 such that

εt+1 ≤ cεt for all sufficiently large t.

The word linear comes from the fact that the error in step t + 1 is bounded
by a linear function of the error in step t.
This means that for t large enough, the error goes down by at least a
constant factor in each step. Linear convergence implies that an error of at
most ε is achieved within O(log(1/ε)) iterations. For example, this is the
bound provided by Theorem 3.14 (last entry in the first row of Table 3.1),
and it is proved by showing linear convergence of the algorithm.
The term superlinear convergence refers to an algorithm for which there
are constants r > 1 and c > 0 such that

εt+1 ≤ c(εt )r for all sufficiently large t.

The case r = 2 is known as quadratic convergence. Under quadratic con-


vergence, an error of at most ε is achieved within O(log log(1/ε)) iterations.
We will see an algorithm with quadratic convergence in Chapter 8.
If a (converging) algorithm does not exhibit at least linear convergence,
we say that it has sublinear convergence. One can also quantify sublinear
convergence more precisely if needed.

3.2 The algorithm


Gradient descent is a very simple iterative algorithm for finding the de-
sired approximation x, under suitable conditions that we will get to. It
computes a sequence x0 , x1 , . . . of vectors such that x0 is arbitrary, and for
each t ≥ 0, xt+1 is obtained from xt by making a step of vt ∈ Rd :

xt+1 = xt + vt .

How do we choose vt in order to get closer to optimality, meaning that


f (xt+1 ) < f (xt )?
From differentiability of f at xt (Definition 2.5), we know that for kvt k
tending to 0,

f (xt + vt ) = f (xt ) + ∇f (xt )> vt + r(vt ) ≈ f (xt ) + ∇f (xt )> vt .


| {z }
o(kvt k)

94
To get any decrease in function value at all, we have to choose vt such that
∇f (xt )> vt < 0. But among all steps vt of the same length, we should in
fact choose the one with the most negative value of ∇f (xt )> vt , so that we
maximize our decrease in function value. This is achieved when vt points
into the direction of the negative gradient −∇f (xt ). But as differentiability
guarantees decrease only for small steps, we also want to control how far
we go along the direction of the negative gradient.
Therefore, the step of gradient descent is defined by

xt+1 := xt − γ∇f (xt ). (3.1)

Here, γ > 0 is a fixed stepsize, but it may also make sense to have γ depend
on t. For now, γ is fixed. We hope that for some reasonably small integer
t, in the t-th iteration we get that f (xt ) − f (x? ) < ε; see Figure 3.1 for an
example.
Now it becomes clear why we are assuming that dom(f ) = Rd : The
update step (3.1) may in principle take us “anywhere”, so in order to get
a well-defined algorithm, we want to make sure that f is defined and dif-
ferentiable everywhere.
The choice of γ is critical for the performance. If γ is too small, the
process might take too long, and if γ is too large, we are in danger of
overshooting. It is not clear at this point whether there is a “right” stepsize.

3.3 Vanilla analysis


The first-order characterization of convexity provides us with a way to
bound terms of the form f (xt ) − f (x? ): With x = xt , y = x? , (2.3) gives us

f (xt ) − f (x? ) ≤ ∇f (xt )> (xt − x? ). (3.2)

So we have reduced the problem to the one of bounding f (xt )> (xt − x? ),
and this is what we do next.
Let xt be some iterate in the sequence (3.1). We abbreviate gt := ∇f (xt ).
By definition of gradient descent (3.1), gt = (xt − xt+1 )/γ, hence

1
gt> (xt − x? ) = (xt − xt+1 )> (xt − x? ). (3.3)
γ

95
x2

x5
3 x3 x4
x2

x1

x0
x1
4

Figure 3.1: Example run of gradient descent on the quadratic function


f (x1 , x2 ) = 2(x1 − 4)2 + 3(x2 − 3)2 with global minimum (4, 3); we have
chosen x0 = (0, 0), γ = 0.1; dashed lines represent level sets of f (points of
constant f -value)

Now we apply (somewhat out of the blue, but this will clear up in the next
step) the basic vector equation 2v> w = kvk2 + kwk2 − kv − wk2 (a.k.a. the
cosine theorem) to rewrite the same expression as

1
gt> (xt − x? ) = kxt − xt+1 k2 + kxt − x? k2 − kxt+1 − x? k2


1
γ 2 kgt k2 + kxt − x? k2 − kxt+1 − x? k2

=

γ 1
kgt k2 + kxt − x? k2 − kxt+1 − x? k2

= (3.4)
2 2γ

Next we sum this up over the iterations t, so that the latter two terms in

96
the bracket cancel in a telescoping sum.
T −1 T −1
X γX 1
gt> (xt ?
kgt k2 + kx0 − x? k2 − kxT − x? k2

−x ) =
t=0
2 t=0 2γ
T −1
γX 1
≤ kgt k2 + kx0 − x? k2 (3.5)
2 t=0 2γ

Now we recall from (3.2) that

f (xt ) − f (x? ) ≤ gt> (xt − x? ).

Hence we further obtain


T −1 T −1
X
? γX 1
(f (xt ) − f (x )) ≤ kgt k2 + kx0 − x? k2 . (3.6)
t=0
2 t=0 2γ

This gives us an upper bound for the average error f (xt ) − f (x? ), t =
0, . . . , T − 1, hence in particular for the error incurred by the iterate with
the smallest function value. The last iterate is not necessarily the best one:
gradient descent with fixed stepsize γ will in general also make steps that
overshoot and actually increase the function value; see Exercise 23(i).
The question is of course: is this result any good? In general, the an-
swer is no. A dependence on kx0 − x? k is to be expected (the further we
start from x? , the longer we will take); the dependence on the squared gra-
dients kgt k2 is more of an issue, and if we cannot control them, we cannot
say much.

3.4 Lipschitz convex functions: O(1/ε2) steps


Here is the cheapest “solution” to squeeze something out of the vanilla
analysis (3.5): let us simply assume that all gradients of f are bounded
in norm. Equivalently, such functions are Lipschitz continuous over Rd
by Theorem 2.10. (A small subtetly here is that in the situation of real-
valued functions, Theorem 2.10 is talking about the spectral norm of the
(1 × d)-matrix (or row vector) ∇f (x)> , while below, we are talking about
the Euclidean norm of the (column) vector ∇f (x); but these two norms are
the same; see Exercise 20.)

97
Assuming bounded gradients rules out many interesting functions,
though. For example, f (x) = x2 (a supermodel in the world of convex
functions) already doesn’t qualify, as ∇f (x) = 2x—and this is unbounded
as x tends to infinity. But let’s care about supermodels later.
Theorem 3.1. Let f : Rd → R be convex and differentiable with a global mini-
mum x? ; furthermore, suppose that kx0 − x? k ≤ R and k∇f (x)k ≤ B for all x.
Choosing the stepsize
R
γ := √ ,
B T
gradient descent (3.1) yields
T −1
1X RB
(f (xt ) − f (x? )) ≤ √ .
T t=0 T

Proof. This is a simple calculation on top of (3.6): after plugging in the


bounds kx0 − x? k ≤ R and kgt k ≤ B, we get
T −1
X γ 2 1
(f (xt ) − f (x? )) ≤ B T + R2 ,
t=0
2 2γ

so want to choose γ such that


γ 2 R2
q(γ) = B T+
2 2γ
is minimized. √ derivative to zero yields the above value of γ,
√ Setting the
and q(R/(B T )) = RB T . Dividing by T , the result follows.
−1
This means that in order to achieve minTt=0 (f (xt ) − f (x? )) ≤ ε, we need

R2 B 2
T ≥
ε2
many iterations. This is not particularly good when it comes to concrete
numbers (think of desired error ε = 10−6 when R, B are somewhat larger).
On the other hand, the number of steps does not depend on d, the di-
mension of the space. This is very important since we often optimize in
high-dimensional spaces. Of course, R and B may depend on d, but in
many relevant cases, this dependence is mild.

98
What happens if we don’t know R and/or B? An idea is to “guess”
R and B, run gradient descent with T and γ resulting from the guess,
check whether the result has absolute error at most ε, and repeat with a
different guess otherwise. This fails, however, since in order to compute
the absolute error, we need to know f (x? ) which we typically don’t. But
Exercise 24 asks you to show that knowing R is sufficient.

3.5 Smooth convex functions: O(1/ε) steps


Our workhorse in the vanilla analysis was the first-order characterization
of convexity: for all x, y ∈ dom(f ), we have

f (y) ≥ f (x) + ∇f (x)> (y − x). (3.7)

Next we want to look at functions for which f (y) can be bounded from
above by f (x)+∇f (x)> (y−x), up to at most quadratic error. The following
definition applies to all differentiable functions, convexity is not required.

Definition 3.2. Let f : dom(f ) → R be a differentiable function, X ⊆ dom(f )


convex and L ∈ R+ . Function f is called smooth (with parameter L) over X if

L
f (y) ≤ f (x) + ∇f (x)> (y − x) + kx − yk2 , ∀x, y ∈ X. (3.8)
2
If X = dom(f ), f is simply called smooth.

Recall that (3.7) says that for any x, the graph of f is above its tangential
hyperplane at (x, f (x)). In contrast, (3.8) says that for any x ∈ X, the
graph of f is below a not-too-steep tangential paraboloid at (x, f (x)); see
Figure 3.2.
This notion of smoothness has become standard in convex optimiza-
tion, but the naming is somewhat unfortunate, since there is an (older)
definition of a smooth function in mathematical analysis where it means a
function that is infinitely often differentiable.
We have the following simple characterization of smoothness.

Lemma 3.3 (Exercise 21). Suppose that dom(f ) is open and convex, and that
f : dom(f ) → R is differentiable. Let L ∈ R+ . Then the following two state-
ments are equivalent.

99
f (x) + ∇f (x)> (y − x) + L2 kx − yk2
f (y)
f (x) + ∇f (x)> (y − x)

x y

Figure 3.2: A smooth convex function

(i) f is smooth with parameter L.

(ii) g defined by g(x) = L2 x> x − f (x) is convex over dom(g) := dom(f ).

Let us discuss some cases. If L = 0, (3.7) and (3.8) together require that

f (y) = f (x) + ∇f (x)> (y − x), ∀x, y ∈ dom(f ),

meaning that f is an affine function. A simple calculation shows that our


supermodel function f (x) = x2 is smooth with parameter L = 2:

f (y) = y 2 = x2 + 2x(y − x) + (x − y)2


L
= f (x) + f 0 (x)(y − x) + (x − y)2 .
2
More generally, we also claim that all quadratic functions of the form
f (x) = x> Qx + b> x + c are smooth, where Q is a (d × d) matrix, b ∈ Rd
and c ∈ R. Because x> Qx = x> Q> x, we get that f (x) = x> Qx = 21 x> (Q +

100
Q> )x, where 12 (Q + Q> ) is symmetric. Therefore, we can assume without
loss of generality that Q is symmetric, i.e., it suffices to show that quadratic
functions defined by symmetric functions are smooth.

Lemma 3.4 (Exercise 22). Let f (x) = x> Qx+b> x+c, where Q is a symmetric
(d × d) matrix, b ∈ Rd , c ∈ R. Then f is smooth with parameter 2 kQk, where
kQk is the spectral norm of Q (Definition 2.2).

The (univariate) convex function f (x) = x4 is not smooth (over R): at


x = 0, condition (3.8) reads as
L 2
y4 ≤ y ,
2
and there is obviously no L that works for all y. The function is smooth,
however, over any bounded set X (Exercise 27).
In general—and this is the important message here—only functions of
asymptotically at most quadratic growth can be smooth. It is tempting to
believe that any such “subquadratic” function is actually smooth, but this
is not true. Exercise 23(iii) provides a counterexample.
While bounded gradients are equivalent to Lipschitz continuity of f
(Theorem 2.10), smoothness turns out to be equivalent to Lipschitz con-
tinuity of ∇f —if f is convex over the whole space. In general, Lipschitz
continuity of ∇f implies smoothness, but not the other way around.

Lemma 3.5. Let f : Rd → R be convex and differentiable. The following two


statements are equivalent.

(i) f is smooth with parameter L.

(ii) k∇f (x) − ∇f (y)k ≤ Lkx − yk for all x, y ∈ Rd .

We will derive the direction (ii)⇒(i) as Lemma 6.1 in Chapter 6 (which


neither requires convexity nor domain Rd ). The other direction is a bit
more involved. A proof of the equivalence can be found in the lecture
slides of L. Vandenberghe, http://www.seas.ucla.edu/˜vandenbe/
236C/lectures/gradient.pdf.
The operations that we have shown to preserve convexity (Lemma 2.19)
also preserve smoothness. This immediately gives us a rich collection of
smooth functions.

101
Lemma 3.6 (Exercise 25).
(i) Let f1 , f2 , . . . , fm be smooth with parameters P
L1 , L2 , . . . , Lm , and let
λm ∈ R+ . Then the function
λ1 , λ2 , . . . ,P f := mi=1 λi fi is smooth with
parameter m m
T
λ L
i=1 i i over dom(f ) := i=1 dom(f i ).

(ii) Let f : dom(f ) → R with dom(f ) ⊆ Rd be smooth with parameter L,


and let g : Rm → Rd be an affine function, meaning that g(x) = Ax + b,
for some matrix A ∈ Rd×m and some vector b ∈ Rd . Then the function
f ◦ g (that maps x to f (Ax + b)) is smooth with parameter LkAk2 on
dom(f ◦ g) := {x ∈ Rm : g(x) ∈ dom(f )}, where kAk is the spectral
norm of A (Definition 2.2).
We next show that for smooth convex functions, the vanilla analysis
provides a better bound than it does under bounded gradients. In partic-
ular, we are now able to serve the supermodel f (x) = x2 .
We start with a preparatory lemma showing that gradient descent (with
suitable stepsize γ) makes progress in function value on smooth functions
in every step. We call this sufficient decrease, and maybe surprisingly, it
does not require convexity.
Lemma 3.7 (Sufficient decrease). Let f : Rd → R be differentiable and smooth
with parameter L according to (3.8). With
1
γ := ,
L
gradient descent (3.1) satisfies
1
f (xt+1 ) ≤ f (xt ) − k∇f (xt )k2 , t ≥ 0.
2L
More specifically, this already holds if f is smooth with parameter L over the line
segment connecting xt and xt+1 .
Proof. We apply the smoothness condition (3.8) and the definition of gra-
dient descent that yields xt+1 − xt = −∇f (xt )/L. We compute
L
f (xt+1 ) ≤ f (xt ) + ∇f (xt )> (xt+1 − xt ) + kxt − xt+1 k2
2
1 1
= f (xt ) − k∇f (xt )k2 + k∇f (xt )k2
L 2L
1
= f (xt ) − k∇f (xt )k2 .
2L
102
Theorem 3.8. Let f : Rd → R be convex and differentiable with a global min-
imum x? ; furthermore, suppose that f is smooth with parameter L according
to (3.8). Choosing stepsize
1
γ := ,
L
gradient descent (3.1) yields
L
f (xT ) − f (x? ) ≤ kx0 − x? k2 , T > 0.
2T
Proof. We apply sufficient decrease (Lemma 3.7) to bound the sum of the
kgt k2 = k∇f (xt )k2 after step (3.6) of the vanilla analysis as follows:
T −1 T −1
1 X X
k∇f (xt )k2 ≤ (f (xt ) − f (xt+1 )) = f (x0 ) − f (xT ). (3.9)
2L t=0 t=0

With γ = 1/L, (3.6) then yields


T −1 T −1
X
? 1 X L
(f (xt ) − f (x )) ≤ k∇f (xt )k2 + kx0 − x? k2
t=0
2L t=0 2
L
≤ f (x0 ) − f (xT ) + kx0 − x? k2 ,
2
equivalently
T
X L
(f (xt ) − f (x? )) ≤ kx0 − x? k2 . (3.10)
t=1
2
Because f (xt+1 ) ≤ f (xt ) for each 0 ≤ t ≤ T by Lemma 3.7, by taking the
average we get that
T
? 1X L
f (xT ) − f (x ) ≤ (f (xt ) − f (x? )) ≤ kx0 − x? k2 .
T t=1 2T

This improves over the bounds of Theorem 3.1. With R2 := kx0 − x? k2 ,


we now only need
R2 L
T ≥

103
iterations instead of R2 B 2 /ε2 to achieve absolute error at most ε.
Exercise 26 shows that we do not need to know L to obtain the same
asymptotic runtime.
Interestingly, the bound in Theorem 3.8 can be improved—but not by
much. Fixing L and R = kx0 − x? k, the bound is of the form O(1/T ). Lee
and Wright have shown that a better upper bound of o(1/T ) holds, but
that for any fixed δ > 0, a lower bound of Ω(1/T 1+δ ) also holds [LW19].

3.6 Acceleration for smooth convex functions:



O(1/ ε) steps
Let’s take a step back, forget about gradient descent for a moment, and just
think about what we actually use the algorithm for: we are minimizing a
differentiable convex function f : Rd → R, where we are assuming that
we have acccess to the gradient vector ∇f (x) at any given point x.
But is it clear that gradient descent is the best algorithm for this task?
After all, it is just some algorithm that is using gradients to make progress
locally, but there might be other (and better) such algorithms. Let us define
a first-order method as an algorithm that only uses gradient information to
minimize f . More precisely, we allow a first-order method to access f only
via an oracle that is able to return values of f and ∇f at arbitrary points.
Gradient descent is then just a specific first-order method.
For any class of convex functions, one can then ask a natural ques-
tion: What is the best first-order method for the function class, the one that
needs the smallest number of oracle calls in the worst case, as a function
of the desired error ε? In particular, is there a method that asymptotically
beats gradient descent?
There is an interesting history here: in 1979, Nemirovski and Yudin √
have shown that every first-order method needs in the worst case Ω(1/ ε)
steps (gradient evaluations) in order to achieve an additive error of ε on
smooth functions [NY83]. Recall that we have seen an upper bound of
O(1/ε) for gradient descent in the previous section; in fact, this upper
bound was known to Nemirovsky and Yudin already. Reformulated in the
language of the previous section, there is a first-order method (gradient
descent) that attains additive error O(1/T ) after T steps, and all first-order
methods have additive error Ω(1/T 2 ) in the worst case.

104
The obvious question resulting from this was whether there actually
exists a first-order method that has additive error O(1/T 2 ) after T steps, on
every smooth function. This was answered in the affirmative by Nesterov
in 1983 when he proposed an algorithm that is now known as (Nesterov’s)
accelerated gradient descent [Nes83]. Nesterov’s book (Sections 2.1 and 2.2)
is a comprehensive source for both lower and upper bound [Nes18].
It is not easy to understand why the accelerated gradient descent algo-
rithm is an optimal first-order method, and how Nesterov even arrived at
it. A number of alternative derivations of optimal algorithms have been
given by other authors, usually claiming that they provide a more natural
or easier-to-grasp approach. However, each alternative approach requires
some understanding of other things, and there is no well-established “sim-
plest approach”. Here, we simply throw the algorithm at the reader, with-
out any attempt to motivate it beyond some obvious words. Then we
present a short proof that the algorithm is indeed optimal.
Let f : Rd → R be convex, differentiable, and smooth with parame-
ter L. Accelerated gradient descent is the following algorithm: choose z0 =
y0 = x0 arbitrary. For t ≥ 0, set
1
yt+1 := xt − ∇f (xt ), (3.11)
L
t+1
zt+1 := zt − ∇f (xt ), (3.12)
2L
t+1 2
xt+1 := yt+1 + zt+1 . (3.13)
t+3 t+3
This means, we are performing a normal “smooth step” from xt to obtain
yt+1 and a more aggressive step from zt to get zt+1 . The next iterate xt+1
is a weighted average of yt+1 and zt+1 , where we compensate for the more
aggressive step by giving zt+1 a relatively low weight.
Theorem 3.9. Let f : Rd → R be convex and differentiable with a global min-
imum x? ; furthermore, suppose that f is smooth with parameter L according
to (3.8). Accelerated gradient descent (3.11), (3.12), and (3.13), yields

2L kz0 − x? k2
f (yT ) − f (x? ) ≤ , T > 0.
T (T + 1)
Comparing this bound with the one from Theorem 3.8, we see that the
error is now indeed O(1/T 2 ) instead of O(1/T ); to reach error at most ε,

105

accelerated gradient descent therefore only needs O(1/ ε) steps instead
of O(1/ε).
Proof. The analysis uses a potential function argument [BG17]. We assign a
potential Φ(t) to each time t and show that Φ(t + 1) ≤ Φ(t). The potential
is
Φ(t) := t(t + 1) (f (yt ) − f (x? )) + 2L kzt − x? k2 .
If we can show that the potential always decreases, we get

T (T + 1) (f (yT ) − f (x? )) + 2L kzT − x? k2 ≤ 2L kz0 − x? k2 ,


| {z } | {z }
Φ(T ) Φ(0)

from which the statement immediately follows. For the argument, we


need three well-known ingredients: (i) sufficient decrease (Lemma 3.7) for
step (3.11) with γ = 1/L:

1
f (yt+1 ) ≤ f (xt ) − k∇f (xt )k2 ; (3.14)
2L
t+1
(ii) the vanilla analysis (Section 3.3) for step (3.12) with γ = 2L
, gt =
∇f (xt ):

t+1 L
gt> (zt − x? ) = kgt k2 + kzt − x? k2 − kzt+1 − x? k2 ;

(3.15)
4L t+1
(iii) convexity:

f (xt ) − f (w) ≤ gt> (xt − w), w ∈ Rd . (3.16)

On top of this, we perform some simple calculations next. By defini-


tion, the potentials are

Φ(t + 1) = t(t + 1) (f (yt+1 ) − f (x? )) + 2(t + 1) (f (yt+1 ) − f (x? )) + 2L kzt+1 − x? k2


Φ(t) = t(t + 1) (f (yt ) − f (x? )) + 2L kzt − x? k2

Now,
Φ(t + 1) − Φ(t)
∆ :=
t+1

106
can be bounded as follows.
2L
t (f (yt+1 ) − f (yt )) + 2 (f (yt+1 ) − f (x? )) + kzt+1 − x? k2 − kzt − x? k2

∆ =
t+1
(3.15) t+1
= t (f (yt+1 ) − f (yt )) + 2 (f (yt+1 ) − f (x? )) + kgt k2 − 2gt> (zt − x? )
2L
(3.14) 1
≤ t (f (xt ) − f (yt )) + 2 (f (xt ) − f (x? )) − kgt k2 − 2gt> (zt − x? )
2L
≤ t (f (xt ) − f (yt )) + 2 (f (xt ) − f (x? )) − 2gt> (zt − x? )
(3.16)
≤ tgt> (xt − yt ) + 2gt> (xt − x? ) − 2gt> (zt − x? )
= gt> ((t + 2)xt − tyt − 2zt )
(3.13)
= gt> 0 = 0.
Hence, we indeed have Φ(t + 1) ≤ Φ(t).

3.7 Interlude
Let us get back to the supermodel f (x) = x2 (that is smooth with param-
eter L = 2, as we observed before). According to Theorem 3.8, gradient
descent (3.1) with stepsize γ = 1/2 satisfies
1 2
f (xT ) ≤ x. (3.17)
T 0
Here we used that the minimizer is x? = 0. Let us check how good this
bound really is. For our concrete function and concrete stepsize, (3.1) reads
as
1
xt+1 = xt − ∇f (xt ) = xt − xt = 0,
2
so we are always done after one step! But we will see in the next section
that this is only because the function is particularly beautiful, and on top of
that, we have picked the best possible smoothness parameter. To simulate
a more realistic situation here, let us assume that we have not looked at the
supermodel too closely and found it to be smooth with parameter L = 4
only (which is a suboptimal but still valid parameter). In this case, γ = 1/4
and (3.1) becomes
1 xt xt
xt+1 = xt − ∇f (xt ) = xt − = .
4 2 2
107
So, we in fact have x 
0 1 2
f (xT ) = f = x. (3.18)
2T 22T 0
This is still vastly better than the bound of (3.17)! While (3.17) requires
T ≈ x20 /ε to achieve f (xT ) ≤ ε, (3.18) requires only
 2
1 x0
T ≈ log ,
2 ε
which is an exponential improvement in the number of steps.

3.8 Smooth and strongly convex functions:


O(log(1/ε)) steps
The supermodel function f (x) = x2 is not only smooth (“not too curved”)
but also strongly convex (“not too flat”). It will turn out that this is the
crucial ingredient that makes gradient descent fast.
Definition 3.10. Let f : dom(f ) → R be a convex and differentiable function,
X ⊆ dom(f ) convex and µ ∈ R+ , µ > 0. Function f is called strongly convex
(with parameter µ) over X if
µ
f (y) ≥ f (x) + ∇f (x)> (y − x) + kx − yk2 , ∀x, y ∈ X. (3.19)
2
If X = dom(f ), f is simply called strongly convex.
While smoothness according to (3.8) says that for any x ∈ X, the graph
of f is below a not-too-steep tangential paraboloid at (x, f (x)), strong con-
vexity means that the graph of f is above a not-too-flat tangential paraboloid
at (x, f (x)). The graph of a smooth and strongly convex function is there-
fore at every point wedged between two paraboloids; see Figure 3.3.
We can also interpret (3.19) as a strengthening of convexity. In the form
of (3.7), convexity reads as

f (y) ≥ f (x) + ∇f (x)> (y − x), ∀x, y ∈ dom(f ),

and therefore says that every convex function satisfies (3.19) with µ = 0.
In the spirit of Lemma 3.3 for smooth functions, we can characterize
strong convexity via convexity of another function.

108
f (x) + ∇f (x)> (y − x) + L2 kx − yk2
f (y)
f (x) + ∇f (x)> (y − x) + µ2 kx − yk2

x y

Figure 3.3: A smooth and strongly convex function

Lemma 3.11 (Exercise 28). Suppose that dom(f ) is open and convex, and that
f : dom(f ) → R is differentiable. Let µ ∈ R+ . Then the following two state-
ments are equivalent.

(i) f is strongly convex with parameter µ.

(ii) g defined by g(x) = f (x) − µ2 x> x is convex over dom(g) := dom(f ).

Lemma 3.12 (Exercise 29). If f : Rd → R is strongly convex with parameter


µ > 0, then f is strictly convex and has a unique global minimum.

The supermodel f (x) = x2 is particularly beautiful since it is both


smooth and strongly convex with the same parameter L = µ = 2 (go-
ing through the calculations in Exercise 22 will reveal this). We can easily
characterize the class of particularly beautiful functions. These are exactly
the ones whose sublevel sets are `2 -balls.

109
Lemma 3.13 (Exercise 30). Let f : Rd → R be strongly convex with parameter
µ > 0 and smooth with parameter µ. Prove that f is of the form
µ
f (x) = kx − bk2 + c,
2
where b ∈ Rd , c ∈ R.

Once we have a unique global minimum x? , we can attempt to prove


that limt→∞ xt = x? in gradient descent. We start from the vanilla analysis
(3.4) and plug in the lower bound gt> (xt −x? ) = ∇f (xt )> (xt −x? ) ≥ f (xt )−
f (x? ) + µ2 kxt − x? k2 resulting from strong convexity. We get

1  µ
f (xt )−f (x? ) ≤ γ 2 k∇f (xt )k2 + kxt − x? k2 − kxt+1 − x? k2 − kxt −x? k2 .
2γ 2
(3.20)
Rewriting this yields a bound on kxt+1 − x? k2 in terms of kxt − x? k2 , along
with some “noise” that we still need to take care of:

kxt+1 −x? k2 ≤ 2γ(f (x? )−f (xt ))+γ 2 k∇f (xt )k2 +(1−µγ)kxt −x? k2 . (3.21)

Theorem 3.14. Let f : Rd → R be convex and differentiable. Suppose that f is


smooth with parameter L according to (4.5) and strongly convex with parameter
µ > 0 according to (4.9). Exercise 33 asks you to prove that there is a unique
global minimum x? of f . Choosing

1
γ := ,
L
gradient descent (3.1) with arbitrary x0 satisfies the following two properties.

(i) Squared distances to x? are geometrically decreasing:


 µ
kxt+1 − x? k2 ≤ 1 − kxt − x? k2 , t ≥ 0.
L

(ii) The absolute error after T iterations is exponentially small in T :

L µ T
f (xT ) − f (x? ) ≤ 1− kx0 − x? k2 , T > 0.
2 L

110
Proof. For (i), we show that the noise in (3.21) disappears. By sufficient
decrease (Lemma 3.7), we know that
1
f (x? ) − f (xt ) ≤ f (xt+1 ) − f (xt ) ≤ − k∇f (xt )k2 ,
2L
and hence the noise can be bounded as follows, using γ = 1/L, multiply-
ing by 2γ and rearranging the terms, we get:

2γ (f (x? ) − f (xt )) + γ 2 k∇f (xt )k2 ≤ 0,

Hence, (3.21) actually yields


 µ
kxt+1 − x? k2 ≤ (1 − µγ)kxt − x? k2 = 1 − kxt − x? k2
L
and 
? 2 µ T
kxT − x k ≤ 1 − kx0 − x? k2 .
L
The bound in (ii) follows from smoothness (3.8), using ∇f (x? ) = 0
(Lemma 2.23):
L L
f (xT ) − f (x? ) ≤ ∇f (x? )> (xT − x? ) + kxT − x? k2 = kxT − x? k2 .
2 2

From this, we can derivate a rate in terms of the number of steps re-
quired (T ). Using the inequality ln(1 + x) ≤ x, it follows that after
 2 
L R L
T ≥ ln ,
µ 2ε

iterations, we reach absolute error at most ε.

3.9 Exercises
Exercise 20. Let c ∈ Rd . Prove that the spectral norm of c> equals the Euclidean
norm of c, meaning that
|c> x|
max = kck .
x6=0 kxk

111
Exercise 21. Prove Lemma 3.3! (Alternative characterization of smoothness)

Exercise 22. Prove Lemma 3.4: The quadratic function f (x) = x> Qx+b> x+c,
Q symmetric, is smooth with parameter 2 kQk.

Exercise 23. Consider the function f (x) = |x|3/2 for x ∈ R.

(i) Prove that f is strictly convex and differentiable, with a unique global min-
imum x? = 0.

(ii) Prove that for every fixed stepsize γ in gradient descent (3.1) applied to f ,
there exists x0 for which f (x1 ) > f (x0 ).

(iii) Prove that f is not smooth.

(iv) Let X ⊆ R be a closed convex set such that 0 ∈ X and X 6= {0}. Prove
that f is not smooth over X.

Exercise 24. In order to obtain average error at most ε in Theorem 3.1, we need
to choose iteration number and stepsize as
 2
RB R
T ≥ , γ := √ .
ε B T
If R or B are unknown, we cannot do this.
Suppose now that we know R but not B. This means, we know a concrete
number R such that kx0 − x? k ≤ R; we also know that there exists a number B
such that k∇f (x)k ≤ B for all x, but we don’t know a concrete such number.
Develop an algorithm that—not knowing B—finds a vector x such that f (x)−
f (x? ) < ε, using at most
 2 !
RB
O
ε
many gradient descent steps!

Exercise 25. Prove Lemma 3.6! (Operations which preserve smoothness)

Exercise 26. In order to obtain average error at most ε in Theorem 3.8, we need
to choose
1 R2 L
γ := , T ≥ ,
L 2ε

112
if kx0 − x? k ≤ R. If L is unknown, we cannot do this.
Now suppose that we know R but not L. This means, we know a concrete
number R such that kx0 − x? k ≤ R; we also know that there exists a number
L such that f is smooth with parameter L, but we don’t know a concrete such
number.
Develop an algorithm that—not knowing L—finds a vector x such that f (x)−
f (x? ) < ε, using at most  2 
R L
O

many gradient descent steps!

Exercise 27. Let a ∈ R. Prove that f (x) = x4 is smooth over X = (−a, a) and
determine a concrete smoothness parameter L.

Exercise 28. Prove Lemma 3.11! (Alternative characterization of strong convex-


ity)

Exercise 29. Prove Lemma 3.12! (Strongly convex functions have unique global
minimum)

Exercise 30. Prove Lemma 3.13! (Strongly convex and smooth functions)

113
Chapter 4

Projected Gradient Descent

Contents
4.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.2 Bounded gradients: O(1/ε2 ) steps . . . . . . . . . . . . . . . . 116
4.3 Smooth convex functions: O(1/ε) steps . . . . . . . . . . . . 117
4.4 Smooth and strongly convex functions: O(log(1/ε)) steps . . 120
4.5 Projecting onto `1 -balls . . . . . . . . . . . . . . . . . . . . . . 122
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

114
4.1 The Algorithm
Another way to control gradients in (3.5) is to minimize f over a closed
convex subset X ⊆ Rd . For example, we may have a constrained opti-
mization problem to begin with (for example the LASSO in Section 2.6.2),
or we happen to know some region X containing a global minimum x? , so
that we can restrict our search to that region. In this case, gradient descent
also works, but we need an additional projection step. After all, it can hap-
pen that some iteration of (3.1) takes us “into the wild” (out of X) where
we have no business to do. Projected gradient descent is the following
modification. We choose x0 ∈ X arbitrary and for t ≥ 0 define

yt+1 := xt − γ∇f (xt ), (4.1)


xt+1 := ΠX (yt+1 ) := argmin kx − yt+1 k2 . (4.2)
x∈X

This means, after each iteration, we project the obtained iterate yt+1 back
to X. This may be very easy (think of X as the unit ball in which case
we just have to scale yt+1 down to length 1 if it is longer). But it may
also be very difficult. In general, computing ΠX (yt+1 ) means to solve an
auxiliary convex constrained minimization problem in each step! Here,
we are just assuming that we can do this. The projection is well-defined:
the squared distance function dy (x) := kx − yk2 is strongly convex, and
hence, a unique minimum over the nonempty closed and convex set X
exists by Exercise 33.
We note that finding an initial x0 ∈ X also reduces to projection (of 0,
for example) onto X.
We will frequently need the following

Fact 4.1. Let X ⊆ Rd be closed and convex, x ∈ X, y ∈ Rd . Then

(i) (x − ΠX (y))> (y − ΠX (y)) ≤ 0.

(ii) kx − ΠX (y)k2 + ky − ΠX (y)k2 ≤ kx − yk2 .

Part (i) says that the vectors x − ΠX (y) and y − ΠX (y) form an obtuse
angle, and (ii) equivalently says that the square of the long side x − y in
the triangle formed by the three points is at least the sum of squares of the
two short sides; see Figure 4.1.

115
y
α ≥ 90o

α ΠX (y)
X
x

Figure 4.1: Illustration of Fact 4.1

Proof. ΠX (y) is by definition a minimizer of the (differentiable) convex


function dy (x) = kx − yk2 over X, and (i) is just the equivalent optimal-
ity condition of Lemma 2.28. We need X to be closed in the first place in
order to ensure that we can project onto X (see Exercise 33 applied with
dy (x)). Indeed, for example, the number 1 has no closest point in the set
[−∞, 0) ∈ R. Part (ii) follows from (i) via the (by now well-known) equa-
tion 2v> w = kvk2 + kwk2 − kv − wk2 .
Exercise 31 asks you to prove that if xt+1 = xt in projected gradient
descent (i.e. we project back to the previous iterate), then xt is a minimizer
of f over X.

4.2 Bounded gradients: O(1/ε2) steps


As in the unconstrained case, let us first assume that gradients are bounded
by a constant B—this time over X. This implies that f is B-Lipschitz over
X (see Theorem 2.10), but the converse may not hold.
If we minimize f over a closed and bounded (= compact) convex set X,
we get the existence of a minimizer and a bound R for the initial distance
to it for free; assuming that f is continuously differentiable, we also have a
bound B for the gradient norms over X. This is because then x 7→ k∇f (x)k
is a continuous function that attains a maximum over X. In this case, our
vanilla analysis yields a much more useful result than the one in Theo-
rem 3.1, with the same stepsize and the same number of steps.

116
Theorem 4.2. Let f : dom(f ) → R be convex and differentiable, X ⊆ dom(f )
closed and convex, x? a minimizer of f over X; furthermore, suppose that kx0 −
x? k ≤ R, and that k∇f (x)k ≤ B for all x ∈ X. Choosing the constant stepsize

R
γ := √ ,
B T
projected gradient descent (4.1) with x0 ∈ X yields
T −1
1X RB
(f (xt ) − f (x? )) ≤ √ .
T t=0 T

Proof. The only required changes to the vanilla analysis are that in steps
(3.3) and (3.4), xt+1 needs to be replaced by yt+1 as this is the real next
(non-projected) gradient descent iterate after these steps; we therefore get

1
gt> (xt − x? ) = γ 2 kgt k2 + kxt − x? k2 − kyt+1 − x? k2 .

(4.3)

From Fact 4.1 (ii) (with x = x? , y = yt+1 ), we obtain kxt+1 − x? k2 ≤ kyt+1 −


x? k2 , hence we get

1
gt> (xt − x? ) ≤ γ 2 kgt k2 + kxt − x? k2 − kxt+1 − x? k2

(4.4)

and return to the previous vanilla analysis for the remainder of the proof.

4.3 Smooth convex functions: O(1/ε) steps


We recall from Definition 3.2 that f that is smooth over X if
L
f (y) ≤ f (x) + ∇f (x)> (y − x) + kx − yk2 , ∀x, y ∈ X. (4.5)
2
To minimize f over X, we use projected gradient descent again. The
runtime turns out to be the same as in the unconstrained case. Again, we
have sufficient decrease. This is not obvious from the following lemma,
but you are asked to prove it in Exercise 32.

117
Lemma 4.3. Let f : dom(f ) → R be differentiable and smooth with parameter L
over a closed and convex set X ⊆ dom(f ), according to (4.5). Choosing stepsize
1
γ :=,
L
projected gradient descent (4.1) with arbitrary x0 ∈ X satisfies
1 L
f (xt+1 ) ≤ f (xt ) − k∇f (xt )k2 + kyt+1 − xt+1 k2 , t ≥ 0.
2L 2
More specifically, this already holds if f is smooth with parameter L over the line
segment connecting xt and xt+1 .
Proof. We proceed similar to the proof of the “unconstrained” sufficient
decrease Lemma 3.7, except that we now need to deal with projected gra-
dient descent. We again start from smoothness but then use yt+1 = xt −
∇f (xt )/L, followed by the usual equation 2v> w = kvk2 +kwk2 −kv −wk2 :
L
f (xt+1 ) ≤ f (xt ) + ∇f (xt )> (xt+1 − xt ) + kxt − xt+1 k2
2
L
= f (xt ) − L(yt+1 − xt )> (xt+1 − xt ) + kxt − xt+1 k2
2
L
kyt+1 − xt k2 + kxt+1 − xt k2 − kyt+1 − xt+1 k2

= f (xt ) −
2
L
+ kxt − xt+1 k2
2
L L
= f (xt ) − kyt+1 − xt k2 + kyt+1 − xt+1 k2
2 2
1 L
= f (xt ) − k∇f (xt )k2 + kyt+1 − xt+1 k2 .
2L 2

Theorem 4.4. Let f : dom(f ) → R be convex and differentiable. Let X ⊆


dom(f ) be a closed convex set, and assume that there is a minimizer x? of f over
X; furthermore, suppose that f is smooth over X with parameter L according
to (4.5). Choosing stepsize
1
γ := ,
L
projected gradient descent (4.1) with x0 ∈ X satisfies
L
f (xT ) − f (x? ) ≤ kx0 − x? k2 , T > 0.
2T
118
Proof. The plan is as in the proof of Theorem 3.8 to use the inequality

1 L
k∇f (xt )k2 ≤ f (xt ) − f (xt+1 ) + kyt+1 − xt+1 k2 (4.6)
2L 2
resulting from sufficient decrease (Lemma 4.3) to bound the squared gra-
dient kgt k2 = k∇f (xt )k2 in the vanilla analysis. Unfortunately, (4.6) has
an extra term compared to what we got in the unconstrained case. But we
can compensate for this in the vanilla analysis itself. Let us go back to its
“constrained” version (4.3), featuring yt+1 instead of xt+1 :

1
gt> (xt − x? ) = γ 2 kgt k2 + kxt − x? k2 − kyt+1 − x? k2 .


Previously, we applied kxt+1 − x? k2 ≤ kyt+1 − x? k2 (Fact 4.1(ii)) to get back


on the unconstrained vanilla track. But in doing so, we dropped a term
that now becomes useful. Indeed, Fact 4.1(ii) actually yields kxt+1 − x? k2 +
kyt+1 − xt+1 k2 ≤ kyt+1 − x? k2 , so that we get the following upper bound
for gt> (xt − x? ):
1
γ 2 kgt k2 + kxt − x? k2 − kxt+1 − x? k2 − kyt+1 − xt+1 k2 .

(4.7)

Using f (xt ) − f (x? ) ≤ gt> (xt − x? ) from convexity, we have (with γ = 1/L)
that
T −1
X T −1
X
(f (xt ) − f (x? )) ≤ gt> (xt − x? ) (4.8)
t=0 t=0
T −1 T −1
1 X L LX
≤ kgt k2 + kx0 − x? k2 − kyt+1 − xt+1 k2 .
2L t=0 2 2 t=0

To bound the sum of the squared gradients, we use (4.6):


T −1 T −1  
1 X 2
X L 2
kgt k ≤ f (xt ) − f (xt+1 ) + kyt+1 − xt+1 k
2L t=0 t=0
2
T −1
LX
= f (x0 ) − f (xT ) + kyt+1 − xt+1 k2 .
2 t=0

119
Plugging this into (4.8), the extra terms cancel, and we arrive—as in the
unconstrained case—at
T
X L
(f (xt ) − f (x? )) ≤ kx0 − x? k2 .
t=1
2
The statement follows as in the proof of Theorem 3.8 from the fact that due
to sufficient decrease (Exercise 32), the last iterate is the best one.

4.4 Smooth and strongly convex functions: O(log(1/ε))


steps
Assuming that f is smooth and strongly convex over a set X, we can also
prove fast convergence of projected gradient descent. This does not re-
quire any new ideas, we have seen all the ingredients before.
We recall from Definition 3.10 that f is strongly convex with parameter
µ > 0 over X if
µ
f (y) ≥ f (x) + ∇f (x)> (y − x) + kx − yk2 , ∀x, y ∈ X. (4.9)
2
Theorem 4.5. Let f : dom(f ) → R be convex and differentiable. Let X ⊆
dom(f ) be a nonempty closed and convex set and suppose that f is smooth over
X with parameter L according to (4.5) and strongly convex over X with param-
eter µ > 0 according to (4.9). Exercise 33 asks you to prove that there is a unique
minimizer x? of f over X. Choosing
1
γ := ,
L
projected gradient descent (4.1) with arbitrary x0 satisfies the following two prop-
erties.
(i) Squared distances to x? are geometrically decreasing:
 µ
kxt+1 − x? k2 ≤ 1 − kxt − x? k2 , t ≥ 0.
L
(ii) The absolute error after T iterations is exponentially small in T :
 µ T /2
f (xT ) − f (x? ) ≤ k∇f (x? )k 1 − kx0 − x? k
L
L µ T
+ 1− kx0 − x? k2 , T > 0.
2 L
120
We note that this is almost the same result as in Theorem 3.14 for the
unconstrained case; in fact, the result in part (i) is identical, but in part (ii),
we get an additional term. This is due to the fact that in the constrained
case, we cannot argue that ∇f (x? ) = 0. In fact, this additional term is the
dominating one, once the error becomes small. It has the effect that the
required number of steps to reach error at most ε will roughly double, in
comparison to the bound of Theorem 3.14.
Proof. In the strongly convex case, the “constrained” vanilla bound (4.7)
1
γ 2 k∇f (xt )k2 + kxt − x? k2 − kxt+1 − x? k2 − kyt+1 − xt+1 k2


on f (xt ) − f (x? ) can be strengthened to
1  µ
γ 2 k∇f (xt )k2 + kxt − x? k2 − kxt+1 − x? k2 − kyt+1 − xt+1 k2 − kxt −x? k2
2γ 2
(4.10)
Now we proceed as in the proof of Theorem 3.14 and rewrite the latter
bound into a bound on kxt+1 − x? k2 that is
2γ(f (x? ) − f (xt )) + γ 2 k∇f (xt )k2 − kyt+1 − xt+1 k2 + (1 − µγ)kxt − x? k2 ,
so we have geometric decrease in squared distance to x? , up to some noise.
Again, we show that by sufficient decrease, the noise in this bound disap-
pears. From Lemma 4.3, we know that
1 L
f (x? ) − f (xt ) ≤ f (xt+1 ) − f (xt ) ≤ − k∇f (xt )k2 + kyt+1 − xt+1 k2 ,
2L 2
and using this, the noise can be bounded. Multiplying the previous in-
equality by 2/L, and rearranging the terms we get:
2 1
(f (x? ) − f (xt )) + 2 k∇f (xt )k2 − kyt+1 − xt+1 k2 ≤ 0.
L L
With γ = 1/L, this exactly shows that the noise is nonpositive. This yields
(i). The bound in (ii) follows from smoothness (3.8):
L
f (xT ) − f (x? ) ≤ ∇f (x? )> (xT − x? ) + kx? − xT k2
2
L
≤ k∇f (x )k kxT − x k + kx? − xT k2 (Cauchy-Schwarz)
? ?
2
 µ T /2 L µ T
≤ k∇f (x? )k 1 − kx0 − x? k + 1− kx0 − x? k2 .
L 2 L

121
4.5 Projecting onto `1-balls
Problems that are `1 -regularized appear among the most commonly used
models in machine learning and signal processing, and we have already
discussed the Lasso as an important example of that class. We will now
address how to perform projected gradient as an efficient optimization for
`1 -constrained problems. Let
n d
X o
d
X = B1 (R) := x ∈ R : kxk1 = |xi | ≤ R
i=1

be the `1 -ball of radius R > 0 around 0, i.e., the set of all points with 1-
norm at most R. Our goal is to compute ΠX (v) for a given vector v, i.e. the
projection of v onto X; see Figure 4.2.

X = B1 (R)
v

ΠX (v)

0 R

Figure 4.2: Projecting onto an `1 -ball

At first sight, this may look like a rather complicated task. Geometri-
cally, X is a cross polytope (square for d = 2, octahedron for d = 3), and as
such it has 2d many facets. But we can start with some basic simplifying
observations.
Fact 4.6. We may assume without loss of generality that (i) R = 1, (ii) vi ≥ 0 for
all i, and (iii) di=1 vi > 1.
P

Proof. If we project v/R onto B1 (1), we obtain ΠX (v)/R (just scale Fig-
ure 4.2), so we can restrict to the case R = 1. For (ii), we observe that

122
simultaneously flipping the signs of a fixed subset of coordinates in both
v and x ∈ X yields vectors v0 and x0 ∈ X such that kx − vk = kx0 − v0 k;
thus, x minimizes the distance to v if and only if x0 minimizes the distance
to v0 . Hence, it suffices to compute ΠX (v) for vectors with nonnegative
entries. If di=1 vi ≤ 1, we have ΠX (v) = v and are done, so the interesting
P
case is (iii).
Fact 4.7. Under the assumptions of Fact 4.6, x = ΠX (v) satisfies xi ≥ 0 for all i
and di=1 xi = 1.
P

Proof. If xi < 0 for some i, then (−xi − vi )2 ≤ (xi − vi )2 (since vi ≥ 0),


so flipping the i-th sign in x would yield another vector in X at least as
close to v as x, but such a vector cannot exist by strict convexity of the
squared distance. And if di=1 xi < 1, then x0 = x + λ(v − x) ∈ X for some
P
small positive λ, with kx0 − vk = (1 − λ)kx − vk, again contradicting the
optimality of x.
Corollary 4.8. Under the assumptions of Fact 4.6,

ΠX (v) = argmin kx − vk2 ,


x∈∆d

where
n d
X o
d
∆d := x ∈ R : xi = 1, xi ≥ 0 ∀i
i=1

is the standard simplex.


This means, we have reduced the projection onto an `1 -ball to the pro-
jection onto the standard simplex; see Figure 4.3.
To address the latter task, we make another assumption that can be
established by suitably permuting the entries of v (which just permutes
the entries of its projection onto ∆d in the same way).
Fact 4.9. We may assume without loss of generality that v1 ≥ v2 ≥ · · · ≥ vd .
Lemma 4.10. Let x? := argminx∈∆d kx−vk2 . Under the assumption of Fact 4.9,
there exists (a unique) p ∈ {1, . . . , d} such that

x?i > 0, i ≤ p,
x?i = 0, i > p.

123
∆d
v

ΠX (v)

0 1

Figure 4.3: Projecting onto the standard simplex

Proof. We are using the optimality criterion of Lemma 2.28:

∇dv (x? )> (x − x? ) = 2(x? − v)> (x − x? ) ≥ 0, x ∈ ∆d , (4.11)

where dv (z) := kz − vk2 is the squared distance to v.


Because di=1 x?i = 1, there is at least one positive entry in x? . It remains
P
to show that we cannot have x?i = 0 and x?i+1 > 0. Indeed, in this situa-
tion, we could decrease x?i+1 by some small positive ε and simultaneously
increase x?i to ε to obtain a vector x ∈ ∆d such that

(x? − v)> (x − x? ) = (0 − vi )ε − (x?i+1 − vi+1 )ε = ε(vi+1 − vi − x?i+1 ) < 0,


| {z } |{z}
≤0 >0

contradicting the optimality (4.11).


But we can say even more about x? .

Lemma 4.11. Under the assumption of Fact 4.9, and with p as in Lemma 4.10,

x?i = vi − Θp , i ≤ p,

where p
1 X 
Θp = vi − 1 .
p i=1

124
Proof. Again, we argue by contradiction. If not all x?i − vi , i ≤ p have the
same value −Θp , then we have x?i −vi < x?j −vj for some i, j ≤ p. As before,
we can then decrease x?j > 0 by some small positive ε and simultaneously
increase x?i by ε to obtain x ∈ ∆d such that

(x? − v)> (x − x? ) = (x?i − vi )ε − (x?j − vj )ε = ε((x?i − vi ) − (x?j − vj )) < 0,


| {z }
<0

again contradicting (4.11). The expression for Θp is then obtained from


p p p
X X X
1= x?i = (vi − Θp ) = vi − pΘp .
i=1 i=1 i=1

Let us summarize the situation: we now have d candidates for x? ,


namely the vectors

x? (p) := (v1 − Θp , . . . , vp − Θp , 0, . . . , 0), p ∈ {1, . . . , d}, (4.12)

and we just need to find the right one. In order for candidate x? (p) to
comply with Lemma 4.10, we must have

vp − Θp > 0, (4.13)

and this actually ensures x? (p)i > 0 for all i ≤ p by the assumption of
Fact 4.9 and therefore x? (p) ∈ ∆d . But there could still be several values of
p satisfying (4.13). Among them, we simply pick the one for which x? (p)
minimizes the distance to v. It is not hard to see that this can be done in
time O(d log d), by first sorting v and then carefully updating the values
Θp and kx? (p) − vk2 as we vary p to check all candidates.
But actually, there is an even simpler criterion that saves us from com-
paring distances.
Lemma 4.12. Under the assumption of Fact 4.9, with x? (p) as in (4.12), and
with p
?
 1 X 
p := max p ∈ {1, . . . , d} : vp − vi − 1 > 0 ,
p i=1
it holds that
argmin kx − vk2 = x? (p? ).
x∈∆d

125
The proof is Exercise 34. Together with our previous reductions, we
obtain the following result.

Theorem 4.13. Let v ∈ Rd , R ∈ R+ , X = B1 (R) the `1 -ball around 0 of


radius R. The projection

ΠX (v) = argmin kx − vk2


x∈X

of v onto B1 (R) can be computed in time O(d log d).

This can be improved to time O(d), based on the observation that a


given p can be compared to the value p? in Lemma 4.12 in linear time,
without the need to presort v [DSSSC08].

4.6 Exercises
Exercise 31. Consider the projected gradient descent algorithm as in (4.1) and
(4.2), with a convex differentiable function f . Suppose that for some iteration t,
xt+1 = xt . Prove that in this case, xt is a minimizer of f over the closed and
convex set X!

Exercise 32. Prove that in Theorem 4.4 (i),

f (xt+1 ) ≤ f (xt ).

Exercise 33. Let X ⊆ Rd be a nonempty closed and convex set, and let f be
strongly convex over X. Prove that f has a unique minimizer x? over X! In
particular, for X = Rd , we obtain the existence of a unique global minimum.

Exercise 34. Prove Lemma 4.12!


Hint: It is useful to prove that with x? (p) as in (4.12) and satisfying (4.13),
d
X
x? (p) = argmin{kx − vk : xi = 1, xp+1 = · · · = xd = 0}.
i=1

126
Chapter 5

Coordinate Descent

Contents
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.2 Alternative analysis of gradient descent . . . . . . . . . . . . 128
5.2.1 The Polyak-Łojasiewicz inequality . . . . . . . . . . . 128
5.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3 Coordinate-wise smoothness . . . . . . . . . . . . . . . . . . 130
5.4 Coordinate descent algorithms . . . . . . . . . . . . . . . . . 131
5.4.1 Randomized coordinate descent . . . . . . . . . . . . 132
5.4.2 Importance Sampling . . . . . . . . . . . . . . . . . . 134
5.4.3 Steepest coordinate descent . . . . . . . . . . . . . . . 135
5.4.4 Greedy coordinate descent . . . . . . . . . . . . . . . 138
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

127
5.1 Overview
In large-scale learning, an issue with the gradient descent algorithms dis-
cussed in Chapter 3 is that in every iteration, we need to compute the full
gradient ∇f (xt ) in order to obtain the next iterate xt+1 . If the number of
variables d is large, this can be very costly. The idea of coordinate descent
is to update only one coordinate of xt at a time, and to do this, we only
need to compute one coordinate of ∇f (xt ) (one partial derivative). We ex-
pect this to be by a factor of d faster than computation of the full gradient
and update of the full iterate.
But we also expect to pay a price for this in terms of a higher number of
iterations. In this chapter, we will analyze a number of coordinate descent
variants on smooth and strongly convex functions. It turns out that in
the worst case, the number of iterations will increase by a factor of d, so
nothing is gained (but also nothing is lost).
But under suitable additional assumptions about the function f , coor-
dinate descent variants can actually lead to provable speedups. In prac-
tice, coordinate descent algorithms are popular due to their simplicity and
often good performance.
Much of this chapter’s material is from Karimi at al. [KNS16] and Nu-
tini et al. [NSL+ 15]. As a warm-up, we return to gradient descent.

5.2 Alternative analysis of gradient descent


We have analyzed gradient descent on smooth and strongly convex func-
tions before (Section 3.8) and in particular proved that the sequence of it-
erates converges to the unique global minimum x? . Here we go a different
route. We will only prove that the sequence of function values converges
to the optimal function value. To do so, we do not need strong convexity
but only the Polyak-Łojasiewicz inequality, a consequence of strong convex-
ity that we derive next. This alternative simple anlysis of gradient descent
will also pave the way for our later analysis of coordinate descent.

5.2.1 The Polyak-Łojasiewicz inequality


Definition 5.1. Let f : Rd → R be a differentiable function with a global min-
imum x? . We say that f satisfies the Polyak-Łojasiewicz inequality (PL in-

128
equality) if the following holds for some µ > 0:
1
k∇f (x)k2 ≥ µ(f (x) − f (x? )), ∀ x ∈ Rd . (5.1)
2
The inequality was proposed by Polyak in 1963, and also by Łojasiewicz
in the same year; see Karimi et al. and the references therein [KNS16]. It
says that the squared gradient norm at every point x is at least propor-
tional to the error in objective function value at x. It also directly implies
that every critical point (a point where ∇f (x) = 0) is a minimizer of f .
The interesting result for us is that strong convexity over Rd implies
the PL inequality.
Lemma 5.2 (Strong Convexity ⇒ PL inequality). Let f : Rd → R be dif-
ferentiable and strongly convex with parameter µ > 0 (in particular, a global
minimum x? exists by Lemma 3.12). Then f satisfies the PL inequality for the
same µ.
Proof. Using strong convexity, we get
µ
f (x? ) ≥ f (x) + ∇f (x)> (x? − x) + kx? − xk2
2
 µ 
≥ f (x) + min ∇f (x) (y − x) + ky − xk2
>
y 2
1
= f (x) − k∇f (x)k2 .

The latter equation results from solving a convex minimization problem
in y by finding a critical point (Lemma 2.22). The PL inequality follows.

The PL inequality is a strictly weaker condition than strong convexity.


For example, consider f (x1 , x2 ) = x21 which is not strongly convex: every
point (0, x2 ) is a global minimum. But f still satisfies the PL inequality,
since it behaves like the strongly convex function x → x2 in (5.1).
There are even nonconvex functions satisfying the PL inequality (Exer-
cise 35).

5.2.2 Analysis
We can now easily analyze gradient descent on smooth functions that in
addition satisfy the PL inequality. By Exercise 35, this result also covers
some nonconvex optimization problems.

129
Theorem 5.3. Let f : Rd → R be differentiable with a global minimum x? .
Suppose that f is smooth with parameter L according to (4.5) and satisfies the PL
inequality (5.1) with parameter µ > 0. Choosing stepsize
1
γ= ,
L
gradient descent (3.1) with arbitrary x0 satisfies
 µ T
f (xT ) − f (x? ) ≤ 1 − (f (x0 ) − f (x? )), T > 0.
L
Proof. For all t, we have
1
f (xt+1 ) ≤ f (xt ) − k∇f (xt )k2 (sufficient decrease, Lemma 3.7)
2L
µ
≤ f (xt ) − (f (xt ) − f (x? )) (PL inequality (5.1)).
L
If we subtract f (x? ) on both sides, we get
 µ
f (xt+1 ) − f (x? ) ≤ 1 − (f (xt ) − f (x? )),
L
and the statement follows.

5.3 Coordinate-wise smoothness


To analyze coordinate descent, we work with coordinate-wise smoothness.

Definition 5.4. Let f : Rd → R be differentiable, and L = (L1 , L2 , . . . , Ld ) ∈


Rd+ . Function f is called coordinate-wise smooth (with parameter L) if for
every coordinate i = 1, 2, . . . , d,

Li 2
f (x + λei ) ≤ f (x) + λ∇i f (x) + λ ∀x ∈ Rd , λ ∈ R, . (5.2)
2
If Li = L for all i, f is said to be coordinate-wise smooth with parameter L.

Let’s compare this to our standard definition of smoothness in Defi-


nition 3.2. It is easy to see that if f is smooth with parameter L, then f
is coordinate-wise smooth with parameter L. Indeed, (5.2) then coincides

130
with the regular smoothness inequality (3.8), when applied to vectors y of
the form y = x + λei .
But we may be able to say more. For example, f (x1 , x2 ) = x21 + 10x22
is smooth with parameter L = 20 (due to the 10x22 term, no smaller value
will do), but f is coordinate-wise smooth with parameter L = (2, 20). So
coordinate-wise smoothness allows us to obtain a more fine-grained pic-
ture of f than smoothness.
There are even cases where the best possible smoothness parameter
is L, but we can choose coordinate-wise smoothness parameters Li (sig-
nificantly) smaller than L for all i. Consider f (x1 , x2 ) = x21 + x22 + M x1 x2
for a constant M > 0. For y = (y, y) and x = 0, smoothness requires
that (M + 2)y 2 = f (y) ≤ L2 kyk2 = Ly 2 , so we need smoothness parameter
L ≥ (M + 2).
On the other hand, f is coordinate-wise smooth with L = (2, 2): fixing
one cordinate, we obtain a univariate function of the form x2 + ax + b. This
is smooth with parameter 2 (use Lemma 3.6 (i) along with the fact that
affine functions are smooth with parameter 0).

5.4 Coordinate descent algorithms


Coordinate descent methods generate a sequence {xt }t≥0 of iterates. In
iteration t, they do the following:

choose an active coordinate i ∈ [d]


xt+1 := xt − λi ei . (5.3)

Here, ei denotes the i-th unit basis vector in Rd , and λi is a suitable stepsize
for the selected coordinate i. We will focus on the gradient-based choice
of the stepsize as

xt+1 := xt − γi ∇i f (xt ) ei , (5.4)

Here, ∇i f (x) denotes the i-th entry of the gradient ∇f (x), and in this
regime, we refer to γi > 0 as the stepsize.
In the coordinate-wise smooth case, we obtain a variant of sufficient
decrease for coordinate descent.

131
Lemma 5.5. Let f : Rd → R be differentiable and coordinate-wise smooth with
parameter L = (L1 , L2 , . . . , Ld ) according to (5.2). With active coordinate i in
iteration t and stepsize
1
γi = ,
Li
coordinate descent (5.4) satisfies
1
f (xt+1 ) ≤ f (xt ) − |∇i f (xt )|2 .
2Li
Proof. We apply the coordinate-wise smoothness condition (5.2) with λ =
−∇i f (xt )/Li , for which we have xt+1 = xt + λei . Hence
Li 2
f (xt+1 ) ≤ f (xt ) + λ∇i f (xt ) + λ
2
1 1
= f (xt ) − |∇i f (xt )|2 + |∇i f (xt )|2
Li 2Li
1
= f (xt ) − |∇i f (xt )|2 .
2Li

In the next two sections, we consider randomized variants of coordi-


nate descent that pick the coordinate to consider in a given step at ran-
dom (from some distribution). Using elementary techniques, we will be
able to bound the expected number of iterations. It requires more elaborate
techniques to prove tail estimates of the form that with high probability, a
certain number of steps will not be exceeded [Nes12].

5.4.1 Randomized coordinate descent


In randomized coordinate descent, the active coordinate in step t is chosen
uniformly at random from the set [d]:

sample i ∈ [d] uniformly at random


xt+1 := xt − γi ∇i f (xt )ei . (5.5)
Nesterov shows that randomized coordinate descent is at least as fast
as gradient descent on smooth functions, if we assume that it is d times
cheaper to update one coordinate than the full iterate [Nes12].

132
If we additionally assume the PL inequality, we can obtain fast conver-
gence as follows.
Theorem 5.6. Let f : Rd → R be differentiable with a global minimum x? .
Suppose that f is coordinate-wise smooth with parameter L according to Defini-
tion 5.4 and satisfies the PL inequality (5.1) with parameter µ > 0. Choosing
stepsize
1
γi = ,
L
randomized coordinate descent (5.5) with arbitrary x0 satisfies

?
 µ T
E[f (xT ) − f (x )] ≤ 1 − (f (x0 ) − f (x? )), T > 0.
dL
Comparing this to the result for gradient descent in Theorem 5.3, the
number of iterations to reach optimization error at most ε is by a factor
of d higher. To see this, note that (for µ/L small)
 µ  µ d
1− ≈ 1− .
L dL
This means, while each iteration of coordinate descent is by a factor of d
cheaper, the number of iterations is by a factor of d higher, so we have a
zero-sum game here. But in the next section, we will refine the analysis
and show that there are cases where coordinate descent will actually be
faster. But first, let’s prove Theorem 5.6.
Proof. By definition, f is coordinate-wise smooth with (L, L, . . . , L), so suf-
ficient decrease according to Lemma 5.5 yields
1
f (xt+1 ) ≤ f (xt ) − |∇i f (xt )|2 .
2L
By taking the expectation of both sides with respect to the choice of i, we
have
d
1 X1
E [f (xt+1 )|xt ] ≤ f (xt ) − |∇i f (xt )|2
2L i=1 d
1
= f (xt ) − k∇f (xt )k2
2dL
µ
≤ f (xt ) − (f (xt ) − f (x? )) (PL inequality (5.1)).
dL
133
In the second line, we conveniently used the fact that the squared Eu-
clidean norm is additive. Subtracting f (x? ) from both sides, we therefore
obtain
?
 µ
E[f (xt+1 ) − f (x )|xt ] ≤ 1 − (f (xt ) − f (x? )).
dL
Taking expectations (over xt ), we obtain
?
 µ
E[f (xt+1 ) − f (x )] ≤ 1 − E[f (xt ) − f (x? )].
dL
The statement follows.
In the proof, we have used conditional expectations: E [f (xt+1 )|xt ] is a
random variable whose expectation is E [f (xt+1 )].

5.4.2 Importance Sampling


Uniformly random selection of the active coordinate is not the best choice
when the coordinate-wise smoothness parameters Li differ. In this case,
it makes sense to sample proportional to the Li ’s as suggested by Nes-
terov [Nes12]. This is coordinate descent with importance sampling:

Li
sample i ∈ [d] with probability Pd
j=1 Lj
1
xt+1 := xt − ∇i f (xt )ei . (5.6)
Li
Here is the result.
Theorem 5.7. Let f : Rd → R be differentiable with a global minimum x? .
Suppose that f is coordinate-wise smooth with parameter L = (L1 , L2 , . . . , Ld )
according to (5.2) and satisfies the PL inequality (5.1) with parameter µ > 0. Let
d
1X
L̄ = Li
d i=1

be the average of all coordinate-wise smoothness constants. Then coordinate de-


scent with importance sampling (5.6) and arbitrary x0 satisfies
 µ T
E[f (xT ) − f (x? )] ≤ 1 − (f (x0 ) − f (x? )), T > 0.
dL̄
134
The proof of the theorem is Exercise 36. We note that L̄ can be much
smaller than the value L = maxdi=1 Li that appears in Theorem 5.6, so coor-
dinate descent with importance sampling is potentially much faster than
randomized coordinate descent. In the worst-case (all Li are equal), both
algorithms are the same.

5.4.3 Steepest coordinate descent


In contrast to random coordinate descent, steepest coordinate descent (or
greedy coordinate descent) chooses the active coordinate according to

choose i = argmax |∇i f (xt )|


i∈[d]

xt+1 := xt − γi ∇i f (xt )ei . (5.7)

This is a deterministic algorithm and also called the Gauss-Southwell


rule.
It is easy to show that the same convergence rate that we have obtained
for random coordinate descent in Theorem 5.6 also holds for steepest co-
ordinate descent. To see this, the only ingredient we need is the fact that
d
2 1X
max |∇i f (x)| ≥ |∇i f (x)|2 ,
i d i=1

and since we now have a deterministic algorithm, there is no need to take


expectations in the proof.

Corollary 5.8. Let f : Rd → R be differentiable with a global minimum x? .


Suppose that f is coordinate-wise smooth with parameter L according to Defini-
tion 5.4 and satisfies the PL inequality (5.1) with parameter µ > 0. Choosing
stepsize
1
γi = ,
L
steepest coordinate descent (5.7) with arbitrary x0 satisfies

?
 µ T
f (xT ) − f (x ) ≤ 1 − (f (x0 ) − f (x? )), T > 0.
dL

135
This result is a bit disappointing: individual iterations seem to be as
costly as in gradient descent, but the number of iterations is by factor of d
larger. This comparison with Theorem 5.3 is not fully fair, though, since
in contrast to gradient descent, steepest coordinate descent requires only
coordinate-wise smoothness, and as we have seen in Section 5.3, this can
be better than global smoothness. But steepest coordinate descent also
cannot compete with randomized gradient descent (same number of it-
erations, but higher cost per iteration). However, we show next that the
algorithm allows for a speedup in certain cases; also, it may be possible to
efficiently maintain the maximum absolute gradient value throughout the
iterations, so that evaluation of the full gradient can be avoided.

Strong convexity with respect to `1 -norm. It was shown by Nutini et


al. [NSL+ 15] that a better convergence result can be obtained for strongly
convex functions, when strong convexity is measured with respect to `1 -
norm instead of the standard Euclidean norm, i.e.
µ1
f (y) ≥ f (x) + ∇f (x)> (y − x) + ky − xk21 , x, y ∈ Rd . (5.8)
2
Due to ky − xk1 ≥ ky − xk, f is then also strongly convex with µ = µ1 in
the usual sense. On the other hand, if f is µ-strongly convex in the usual √
sense, then f satisfies (5.8) with µ1 = µ/d, due to ky − xk ≥ ky − xk1 / d.
Hence, µ1 may be up to factor of d smaller than µ, and if this happens,
(5.8) will not lead to a speedup of the algorithm. But isn’t µ1 necessarily
smaller than µ by a factor
√ of d ? After all, there are always x, y such that
ky − xk = ky − xk1 / d. But if for those worst-case x, y, the inequality of
strong convexity holds with µ0 > µ, we can achieve µ1 > µ/d. As an exam-
ple for this scenario, Nutini et al. [NSL+ 15, Appendix C of arXiv version]
compute the best parameters µ, µ1 of strong convexity for a convex func-
tion of the form f (x) = 12 di=1 Li xi and show that µ1 can be significiantly
P
larger than µ/d.
The proof of convergence under (5.8) is similar to the one of Theo-
rem 5.6, after proving the following lemma: if f is strongly convex with re-
spect to `1 -norm, it satisfies the PL inequality with respect to `∞ -norm. The
proof is Exercise 38 and follows the same strategy as the earlier Lemma 5.2
for the Euclidean norm. While this requires only elementary calculations,
it does not reveal the deeper reason why `1 -norm in strong convexity leads

136
to `∞ -norm in the PL inequality. This has to do with convex conjugates,
but we will not go into it here.

Lemma 5.9 (Exercise 38). Let f : Rd → R be differentiable and strongly convex


with parameter µ1 > 0 w.r.t. `1 -norm as in (5.8). (In particular, f is µ1 -strongly
convex w.r.t. Euclidean norm, so a global minimum x? exists by Lemma 3.12.)
Then f satisfies the PL inequality w.r.t. `∞ -norm with the same µ1 :
1
2
k∇f (x)k2∞ ≥ µ1 (f (x) − f (x? )), ∀x ∈ Rd . (5.9)

Theorem 5.10. Let f : Rd → R be differentiable with a global minimum x? .


Suppose that f is coordinate-wise smooth with parameter L according to Defini-
tion 5.4 and satisfies the PL inequality (5.9) with parameter µ1 > 0. Choosing
stepsize
1
γi = ,
L
steepest coordinate descent (5.7) with arbitrary x0 satisfies

?
 µ1  T
f (xT ) − f (x ) ≤ 1 − (f (x0 ) − f (x? )), T > 0.
L
Proof. By definition, f is coordinate-wise smooth with (L, L, . . . , L), so suf-
ficient decrease according to Lemma 5.5 yields

1 1
f (xt+1 ) ≤ f (xt ) − |∇i f (xt )|2 = f (xt ) − k∇f (xt )k2∞ ,
2L 2L
by definition of steepest gradient descent. Using the PL inequality (5.9),
we further get
µ1
f (xt+1 ) ≤ f (xt ) − (f (xt ) − f (x? ).
L
Now we proceed as in the alternative analysis of gradient descent: Sub-
tracting f (x? ) from both sides, we obtain

?
 µ1 
f (xt+1 ) − f (x ) ≤ 1 − (f (xt ) − f (x? )),
L
and the statement follows.

137
5.4.4 Greedy coordinate descent
This is a variant that does not even require f to be differentiable. In each
iteration, we make the step that maximizes the progress in the chosen co-
ordinate. This requires to perform a line search by solving a 1-dimensional
optimization problem:

choose i ∈ [d]
xt+1 := argmin f (xt + λei ) (5.10)
λ∈R

There are cases where the line search can exactly be done analytically,
or approximately by some other means. In the differentiable case, we can
take any of the previously studied coordinate descent variants and replace
some of its steps by greedy steps if it turns out that we can perform line
search along the selected coodinate. This will not compromise the conver-
gence analysis, as stepwise progress can only be better.

Figure 5.1: The non-differentiable function f (x) := kxk2 + |x1 − x2 |. The


global minimum is (0, 0), but greedy coordinate descent cannot escape any
point (x, x), |x| ≤ 1/2. Figure by Alp Yurtsever & Volkan Cevher, EPFL

Some care is in order when applying the greedy variant in the nondif-
ferentiable case for which the previous variants don’t work. The algorithm
can get stuck in non-optimal points, as for example in the objective func-
tion of Figure 5.1. But not all hope is lost. There are relevant cases where
this scenario does not happen, as we show next.

138
Theorem 5.11. Let f : Rd → R be of the form
X
f (x) := g(x) + h(x) with h(x) = hi (xi ), x ∈ Rd , (5.11)
i

with g convex and differentiable, and the hi convex.


Let x ∈ Rd be a point such that greedy coordinate descent cannot make
progress in any coordinate. Then x is a global minimum of f .
A function h as in the theorem is called separable. Figure 5.2 illustrates
the theorem.

Figure 5.2: The function f (x) := kxk2 + kxk1 . Greedy coordinate descent
cannot get stuck. Figure by Alp Yurtsever & Volkan Cevher, EPFL

Proof. We follow Ryan Tibshirani’s lecture.1 . Let y ∈ Rd . Using the first-


order characaterization of convexity for g and the definition of h, we obtain

f (y) = g(y) + h(y)


d
X
≥ g(x) + ∇g(x)> (y − x) + h(x) + (hi (yi ) − hi (xi ))
i=1
d
X
= f (x) + (∇i g(x)(yi − xi ) + hi (yi ) − hi (xi )) ≥ f (x),
i=1

using that ∇i g(x)(yi − xi ) + hi (yi ) − hi (xi ) ≥ 0 for all i (Exercise 39).


1
https://www.stat.cmu.edu/˜ryantibs/convexopt-S15/lectures/
22-coord-desc.pdf

139
One very important class of applications here are objective functions of
the form
f (x) + λkxk1 ,
where f is convex and smooth, and h(x) = λkxk1 is a (separable) `1 -
regularization term. The LASSO ( Section 2.6) in its regularized form gives
rise to a concrete such case:

min kAx − bk2 + λkxk1 . (5.12)


x∈Rn

Whether greedy coordinate descent actually converges on functions as


in Theorem 5.11 is a different question; this was answered in the affirma-
tive by Tseng under mild regularity conditions on g, and under using the
cyclic order of coordinates throughout the iterations [Tse01].

5.5 Summary
Coordinate descent methods are used widely in machine learning appli-
cations. Variants of coordinate methods form the state of the art for the
class of generalized linear models, including linear classifiers and regression
models, as long as separable convex regularizers are used (e.g. `1 -norm or
squared `2 -norm).
The following table summarizes the converegence bounds of coordi-
nate descent algorithms on coordinate-wise smooth and strongly convex
functions (we only use the PL inequality, a consequence of strong convex-
ity). The Bound column contains the factor by which the error is guaran-
teed to decrease in every step.

Algorithm PL norm Smoothness Bound Result


µ
Randomized `2 L 1 − dL Thm. 5.6
Importance sampling `2 (L1 , L2 , . . . , Ld ) 1 − dµL̄ Thm. 5.7
µ
Steepest `2 L 1 − dL Cor. 5.8
µ1
Steeper (than Steepest) `1 L 1− L Thm. 5.10
µ
In the worst case, all algorithms have a Bound of 1 − dL and there-
fore need d times more iterations than gradient descent. This can fully be
compensated if iterations are d times cheaper.

140
In the best case, Steeper (than Steepest) matches the performance of
gradient descent in terms of iteration count. The algorithm is therefore an
attractive choice for problems where we can obtain (or maintain) the steep-
est coordinate of the gradient efficiently. This includes several practical
case, for example when the gradients are sparse, e.g. because the original
data is sparse.
Importance sampling is attractive when most coordinate-wise smooth-
ness parameters Li are much smaller than the maximum. In the best case,
it can be d times faster than gradient descent. On the downside, applying
the method requires to know all the Li . In the other methods, an upper
bound on all Li is sufficient in order to run the algorithm.

5.6 Exercises
Exercise 35. Provide an example of a nonconvex function that satisfies the PL
inequality 5.1!

Exercise 36 (Importance Sampling). Prove Theorem 5.7! Can you come up


with an example from machine learning where L̄  L = maxdi=1 Li ?

Exercise 37. Derive the solution to exact coordinate minimization for the Lasso
problem (5.12), for the i-th coordinate. Write A−i for the n × (d − 1) matrix
obtained by removing the i-th column from A, and same for the vector x−i with
one entry removed accordingly.

Exercise 38. Prove Lemma 5.9, proceeding as in the proof of Lemma 5.2!

Exercise 39. Let f be as in Theorem 5.11 and x ∈ Rd such that f (x + λei ) ≥


f (x) for all λ and all i. Prove that ∇i g(x)(yi − xi ) + hi (yi ) − hi (xi ) ≥ 0 for all
y ∈ Rd and all i ∈ [d].

141
Chapter 6

Nonconvex functions

Contents
6.1 Smooth functions . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.2 Trajectory analysis . . . . . . . . . . . . . . . . . . . . . . . . 149
6.2.1 Deep linear neural networks . . . . . . . . . . . . . . 150
6.2.2 A simple nonconvex function . . . . . . . . . . . . . . 152
6.2.3 Smoothness along the trajectory . . . . . . . . . . . . 155
6.2.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . 158
6.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

142
So far, all convergence results that we have given for variants of gra-
dient descent have been for convex functions. And there is a good reason
for this: on nonconvex functions, gradient descent can in general not be
expected to come close (in distance or function value) to the global mini-
mum x? , even if there is one.
As an example, consider the nonconvex function from Figure 2.4 (left).
Figure 6.1 shows what happens if we start gradient descent somewhere “to
the right”, with a not too large stepsize so that we do not overshoot. For
any sufficiently large T , the iterate xT will be close to the local minimum
y? , but not to the global minimum x? .

x∗ y∗ x0

Figure 6.1: Gradient descent may get stuck in a local minimum y? 6= x?

Even if the global minimum is the unique local minimum, gradient


descent is not guaranteed to get there, as it may also get stuck in a saddle
point, or even fail to reach anything at all; see Figure 6.2.
In practice, variants of gradient descent are often observed to perform
well even on nonconvex functions, but theoretical explanations for this are
mostly missing.
In this chapter, we show that under favorable conditions, we can still
say something useful about the behavior of gradient descent, even on non-
convex functions.

143
x0 y∗ x∗ x∗ x0

Figure 6.2: Gradient descent may get stuck in a flat region (saddle point)
y? (left), or reach neither a local minimum nor a saddle point (right).

6.1 Smooth functions


A particularly low hanging fruit is the analysis of gradient descent on
smooth (but not necessarily convex) functions. We recall from Defini-
tion 3.2 that a differentiable function f : dom(f ) → R is smooth with
parameter L ∈ R+ over a convex set X ⊆ dom(f ) if
L
f (y) ≤ f (x) + ∇f (x)> (y − x) + kx − yk2 , ∀x, y ∈ X.
2
This means that at every point x ∈ X, the graph of f is below a not-too-
steep tangential paraboloid, and this may happen even if the function is
not convex; see Figure 6.3.
There is a class of arbitrarily smooth nonconvex functions, namely the
differentiable concave functions. A function f is called concave if −f is
convex. Hence, for all x, the graph of a differentiable concave function
is below the tangent hyperplane at x, hence f is smooth with parameter
L = 0; see Figure 6.4.
However, from our optimization point of view, concave functions are
boring, since they have no global minimum (at least in the unconstrained
setting that we are treating here). Gradient descent will then simply “run
off to infinity”.
We will therefore consider smooth functions that have a global min-
imum x? . Are there even such functions that are not convex? Actually,

144
f (x) + ∇f (x)> (y − x) + L2 kx − yk2
f (y)

x y

Figure 6.3: A smooth and nonconvex function

many. As we show next, any twice differentiable function with bounded


Hessians over some convex set X is smooth over X. A concrete example of
a smooth function that is not convex but has a global minimum (actually,
many), is f (x) = sin(x).
Lemma 6.1. Let f : dom(f ) → R be twice differentiable, with X ⊆ dom(f ) a
convex set, and k∇2 f (x)k ≤ L for all x ∈ X, where k·k is again spectral norm.
Then f is smooth with parameter L over X.
Proof. By Theorem 2.10 (applied to the gradient function ∇f ), bounded
Hessians imply Lipschitz continuity of the gradient,
k∇f (x) − ∇f (y)k ≤ L kx − yk , x, y ∈ X. (6.1)
We show that this in turn implies smoothness. This is in fact the easy
direction of Lemma 3.5 (in the twice differentiable case).
For any fixed x, y ∈ X, we use the (by now) very familar function
h : dom(h) → Rd over a suitable open domain I ⊃ [0, 1], given by

h(t) = f x + t(y − x) , t ∈ I,

145
x y

f (x) + ∇f (x)> (y − x)

f (y)

Figure 6.4: A concave function and the first-order characterization of con-


cavity: f (y) ≤ f (x) + ∇f (x)> (y − x), ∀x, y ∈ Rd

for which we have shown in (2.1) that


>
h0 (t) = ∇f x + t(y − x) (y − x), t ∈ I.

As f is twice differentiable, ∇f and hence also h0 are actually continuous,


so we can apply the fundamental theorem of calculus (in the second line
of the lengthy but easy derivation below). We compute

f (y) − f (x) − ∇f (x)> (y − x)


= h(1) − h(0) − ∇f (x)> (y − x)
Z 1
= h0 (t)dt − ∇f (x)> (y − x)
Z0 1
= ∇f (x + t(y − x))> (y − x)dt − ∇f (x)> (y − x)
Z0 1
∇f (x + t(y − x))> (y − x) − ∇f (x)> (y − x) dt

=
Z0 1
>
= ∇f (x + t(y − x)) − ∇f (x) (y − x)dt.
0

146
So far, we had only equalities, now we start estimating:

f (y) − f (x) − ∇f (x)> (y − x)


Z 1
>
= ∇f (x + t(y − x)) − ∇f (x) (y − x)dt
Z0 1
>
≤ ∇f (x + t(y − x)) − ∇f (x) (y − x) dt
Z0 1

≤ ∇f (x + t(y − x)) − ∇f (x) k(y − x)k dt (Cauchy-Schwarz)
Z0 1
≤ L kt(y − x)k k(y − x)k dt (Lipschitz continuous gradients)
0
Z 1
= Lt kx − yk2 dt
0
L
= kx − yk2 .
2
This is smoothness over X according to Definition 3.2.
For twice differentiable convex functions, the converse is also (almost)
true. If f is smooth over an open convex subset X ⊆ dom(f ), it has
bounded Hessians over X (Exercise 40 ). Convexity is needed here since
e.g. concave functions are smooth with parameter L = 0 but generally
have unbounded Hessians. It is also not hard to understand why open-
ness is necessary in general. Indeed, for a point x on the boundary of X,
the smoothness condition does not give us any information about nearby
points not in X. As a consequence, even at points with large Hessians, f
might look smooth inside X. As a simple example, consider f (x1 , x2 ) =
x21 + M x22 with M ∈ R+ large. The function f is smooth with L = 2 over
X = {(x1 , x2 ) : x2 = 0}: indeed, over this set, f looks just like the super-
model. But for all x, we have k∇2 f (x)k = 2M .
Now we get back to gradient descent on smooth functions with a global
minimum. The punchline is so unspectacular that there is no harm in
spoiling it already now: What we can prove is that k∇f (xt )k2 converges to
0 at the same rate as f (xt ) − f (x? ) converges to 0 in the convex case. Nat-
urally, f (xt ) − f (x? ) itself is not guaranteed to converge in the nonconvex
case, for example if xt converges to a local minimum that is not global, as
in Figure 6.1.

147
It is tempting to interpret convergence of k∇f (xt )k2 to 0 as convergence
to a critical point of f (a point where the gradient vanishes). But this inter-
pretation is not fully accurate in general, as Figure 6.2 (right) shows: The
algorithm may enter a region where f asymptotically approaches some
value, without reaching it (think of the rightmost piece of the function in
the figure as f (x) = e−x ). In this case, the gradient converges to 0, but the
iterates are nowhere near a critical point.

Theorem 6.2. Let f : Rd → R be differentiable with a global minimum x? ; fur-


thermore, suppose that f is smooth with parameter L according to Definition 3.2.
Choosing stepsize
1
γ := ,
L
gradient descent (3.1) yields
T −1
1X 2L
k∇f (xt )k2 ≤ f (x0 ) − f (x? ) ,

T > 0.
T t=0 T

In particular, k∇f (xt )k2 ≤ 2L



T
f (x0 ) − f (x? ) for some t ∈ {0, . . . , T − 1}. And
also, limt→∞ k∇f (xt )k2 = 0 (Exercise 41).

Proof. We recall that sufficient decrease (Lemma 3.7) does not require con-
vexity, and this gives

1
f (xt+1 ) ≤ f (xt ) − k∇f (xt )k2 , t ≥ 0.
2L
Rewriting this into a bound on the gradient yields

k∇f (xt )k2 ≤ 2L f (xt ) − f (xt+1 ) .




Hence, we get a telescoping sum


T −1
X
k∇f (xt )k2 ≤ 2L f (x0 ) − f (xT ) ≤ 2L f (x0 ) − f (x? ) .
 
t=0

The statement follows.

148
In the smooth setting, gradient descent has another interesting prop-
erty: with stepsize 1/L, it cannot overshoot. By this, we mean that it
cannot pass a critical point (in particular, not the global minimum) when
moving from xt to xt+1 . Equivalently, with a smaller stepsize, no critical
point can be reached. With stepsize 1/L, it is possible to reach a critical
point, as we have demonstrated for the supermodel function f (x) = x2 in
Section 3.7.

Lemma 6.3 (Exercise 42). Let f : Rd → R be differentiable; let x ∈ Rd such


that ∇f (x) 6= 0, i.e. x is not a critical point. Suppose that f is smooth with
parameter L over the line segment connecting x and x0 = x − γ∇f (x), where
γ = 1/L0 < 1/L. Then x0 is also not a critical point.

Figure 6.5 illustrates the situation.

x x0 y ? x y ? x0 x x0 = y ?

Figure 6.5: Gradient descent on smooth functions: When moving from x


to x0 = x − γ∇f (x) with γ < 1/L, x0 will not be a critical point (left);
equivalently, with γ = 1/L, we cannot overshoot, i.e. pass a critical point
(middle); with γ = 1/L, we may exactly reach a critical point (right).

6.2 Trajectory analysis


Even if the “landscape” (graph) of a nonconvex function has local minima,
saddle points, and flat parts, it is sometimes possible to prove that gradient
descent avoids these bad spots and still converges to a global minimum.
For this, one needs a good starting point and some theoretical understand-
ing of what happens when we start there—this is trajectory analysis.
In 2018, results along these lines have appeared that prove convergence
of gradient descent to a global minimum in training deep linear linear net-
works, under suitable conditions. In this section, we will study a vastly

149
simplified setting that allows us to show the main ideas (and limitations)
behind one particular trajectory analysis [ACGH18].
In our simplified setting, we will look at the task of minimizing a con-
crete and very simple nonconvex function. This function turns out be
smooth along the trajectories that we analyze, and this is one important
ingredient. However, smoothness alone does not suffice to prove con-
vergence to the global minimum, let alone fast convergence: As we have
seen in the last section, we can in general only guarantee that the gradient
norms converge to 0, and at a rather slow rate. To get beyond this, we will
need to exploit additional properties of the function under consideration.

6.2.1 Deep linear neural networks


Let us go back to the problem of learning linear models as discussed in
Section 2.6.2, using the example of Master’s admission. We had n inputs
x1 , . . . , xn , where each input xi ∈ Rd consisted of d input variables; and
we had n outputs y1 , . . . , yn ∈ R. Then we made the hypothesis that (after
centering), output values depend (approximately) linearly on the input,

y i ≈ w > xi ,

for a weight vector w = (w1 , . . . , wd ) ∈ Rd to be learned.


Now we consider the more general case where there is not just one
output yi ∈ R as response to the i-th input, but m outputs yi ∈ Rm . In this
case, the linear hypothesis becomes

yi ≈ W xi ,

for a weight matrix W ∈ Rm×d to be learned. The matrix that best fits this
hypothesis on the given observations is the least-squares matrix
n
X
W ? = argmin kW xi − yi k2 .
W ∈Rm×d i=1

If we let X ∈ Rd×n be the matrix whose columns are the xi and Y ∈ Rm×n
the matrix whose columns are the yi , we can equivalently write this as

W ? = argmin kW X − Y k2F , (6.2)


W ∈Rm×d

150
qP
2
where kAkF = i,j aij is the Frobenius norm of a matrix A.
Finding W ∗ (the global minimum of a convex quadratic function) is a
simple task that boils down to solving a system of linear equations; see
also Section 2.4.2. A fancy way of saying this is that we are training a
linear neural network with one layer, see Figure 6.6 (left).
h21
x1
x1 h11 h22
x2
x2 h12 h23 y1
y1 x3
x3 h13 h24 y2
y2 x4
x4 h14 h25
x5
x5 h26

W W1 W2 W3

Figure 6.6: Left: A linear neural network over d input variables x =


(x1 , . . . , xd ) and m output variables y = (y1 , . . . , ym ). The edge connecting
input variable xj with output variable yi has a weight wij (to be learned),
and all weights together form a weight matrix W ∈ Rm×d . Given the
weights, the network computes the linear transformation y = W x be-
tween inputs and outputs. Right: a deep linear neural network of depth
3 with weight matrices W1 , W2 , W3 . Given the weights, the network com-
putes the linear transformation y = W3 W2 W1 x.

But what if we have ` layers (Figure 6.6 (right)? Training such a net-
work corresponds to minimizing
kW` W`−1 · · · W1 X − Y k2F ,
over ` weight matrices W1 , . . . , W` to be learned. In case of linear neural
networks, there is no benefit in adding layers, as any linear transforma-
tion x 7→ W` W`−1 · · · W1 X can of course be represented as x 7→ W X with
W := W`−1 · · · W1 . But from a theoretical point of view, a deep linear neu-
ral network gives us a simple playground in which we can try to under-
stand why training deep neural networks with gradient descent works,

151
despite the fact that the objective function is no longer convex. The hope
is that such an understanding can ultimately lead to an analyis of gradient
descent (or other suitable methods) for “real” (meaning non-linear) deep
neural networks.
In the next section, we will discuss the case where all matrices are 1 × 1,
so they are just numbers. This is arguably a toy example in our already
simple playground. Still, it gives rise to a nontrivial nonconvex function,
and the analysis of gradient descent on it will require similar ingredients
as the one on general deep linear neural networks [ACGH18].

6.2.2 A simple nonconvex function


The function (that we consider fixed throughout the section) is f : Rd → R
defined by
d
!2
1 Y
f (x) := xk − 1 , (6.3)
2 k=1

As d is fixed, we will abbreviate dk=1 xk by k xk throughout. Minimizing


Q Q
this function corresponds to training a deep linear neural network with d
layers, one neuron per layer, with just one training input x = 1 and a
corresponding output y = 1. Figure 6.7 visualizes the function f for d = 2.
First of all, the function f does have global minima, as it is nonnegative,
and value 0 can be achieved (in many ways). Hence, we immediately
know how to minimize this (for example, set xk = 1 for all k). The question
is whether gradient descent also knows, and if so, how we prove this.
Let us start by computing the gradient. We have
! !>
Y Y Y
∇f (x) = xk − 1 xk , . . . , xk . (6.4)
k k6=1 k6=d

What areQthe critical points, the ones where ∇f (x) vanishes? This hap-
pens when k xk = 1 in which case we have a global minimum (level 0
in Figure 6.7). But there are other critical points. Whenever at least two
of the xk are zero, the gradient also vanishes, and the value of f is 1/2 at
such a point (point 0 in Figure 6.7). This already shows that the function
cannot be convex, as for convex functions, every critical point is a global
minimum (Lemma 2.22). It is easy to see that every non-optimal critical
point must have two or more zeros.

152
Figure 6.7: Levels sets of f (x1 , x2 ) = 21 (x1 x2 − 1)2

In fact, all critical points except the global minima are saddle points.
This is because at any such point x, we can slightly perturb the (two or
more) zero entries in such a way that the product of all entries becomes
either positive or negative, so that the function value either decreases or
increases.
Figure 6.8 visualizes (scaled) negative gradients of f for d = 2; these are
the directions in which gradient descent would move from the tails of the
respective arrows. The figure already indicates that it is difficult to avoid
convergence to a global minimum, but it is possible (see Exercise 44).
We now want to Q show that for any dimension d, and from anywhere in
X = {x : x > 0, k xk ≤ 1}, gradient descent will converge to a global
minimum. Unfortunately, our function f is not smooth over X. For the
analysis, we will therefore show that f is smooth along the trajectory of

153
Figure 6.8: Scaled negative gradients of f (x1 , x2 ) = 21 (x1 x2 − 1)2

gradient descent for suitable L, so that we get sufficient decrease

1
f (xt+1 ) ≤ f (xt ) − k∇f (xt )k2 , t≥0
2L
by Lemma 3.7.
This already shows that gradient descent cannot converge to a saddle
point: all these have (at least two) zero entries and therefore function value
1/2. But for starting point x0 ∈ X, we have f (x0 ) < 1/2, so we can never
reach a saddle while decreasing f .
But doesn’t this mean that we necessarily have to converge to a global
minimum? No, because the sublevel sets of f are unbounded, so it could in
principle happen that gradient descent runs off to infinity while constantly
improving f (xt ) (an example is gradient descent on f (x) = e−x ). Or some

154
other bad behavior occurs (we haven’t characterized what can go wrong).
So there is still something to prove. Q
How about convergence from other starting points? For x > 0, k xk ≥
1, we also get convergence (Exercise 43). But there are also starting points
from which gradient descent will not converge to a global minimum (Ex-
ercise 44).
The following simple lemma is the key to showing that gradient de-
scent behaves nicely in our case.
Definition 6.4. Let x > 0 (componentwise) , and let c ≥ 1 be a real number. x
is called c-balanced if xi ≤ cxj for all 1 ≤ i, j ≤ d.
In fact, any initial iterate x0 > 0 is c-balanced for some (possibly large) c.
Q
Lemma 6.5. Let x > 0 be c-balanced with k xk ≤ 1. Then for any stepsize
γ > 0, x0 := x−γ∇f (x) satisfies x0 ≥ x (componentwise) and is also c-balanced.
If c = 1 (all entries of x are equal), this is easy to see since then also
all entries of ∇f (x) in (6.4) are equal.
Q Later we will show that for suitable
step size, we also maintain that k x0k ≤ 1, so that gradient descent only
goes through balanced iterates.
Q Q
Proof. Set ∆ := −γ( k xk − 1)( k xk ) ≥ 0. Then the gradient descent
update assumes the form

x0k = xk + ≥ xk , k = 1, . . . , d.
xk
For i, j, we have xi ≤ cxj and xj ≤ cxi (⇔ 1/xi ≤ c/xj ). We therefore get
∆ ∆c
x0i = xi + ≤ cxj + = cx0j .
xi xj

6.2.3 Smoothness along the trajectory


It will turn out that our function f —despite not being globally smooth—
is smooth over Qthe trajectory of gradient descent, assuming that we start
with x0 > 0, k (x0 )k < 1. We will derive this from bounded Hessians.
Let us therefore start by computing the Hessian matrix ∇2 f (x), where by

155
definition, ∇2 f (x)ij is the j-th partial derivative of the i-th entry of ∇f (x).
This i-th entry is !
Y Y
(∇f )i = xk − 1 xk
k k6=i

and its j-th partial derivative is therefore


 !2
 Y
xi , j=i



2
∇ f (x)ij = k6=i
Y Y Y
2 x x − 6 i
xk , j =



 k k
k6=i k6=j k6=i,j
Q
This looks
Q promising: Q if k xk ≤ 1, then we would also expect that the
products k6=i xk and k6=i,j xk are small, in which case all entries of the
Hessian are small, giving us a bound on k∇f 2 xk that we need to establish
smoothness of f . However, for general x, If x contains entries
Q this fails. Q
close to 0, it may happen that some terms k6=i xk and k6=i,j xk are actually
very large.
What comes to our rescue is again c-balancedness.
Lemma 6.6. Suppose that x > 0 is c-balanced (Definition 6.4). Then for any
I ⊆ {1, . . . , d}, we have
 |I| Y !1−|I|/d !1−|I|/d
1 Y Y
xk ≤ xk ≤ c|I| xk .
c k k∈I
/ k

For any i, we have xi ≥ (1/c)d k xk by balancedness, hence xi ≥


d
Q
Proof. Q
(1/c)( k xk )1/d . It follows that
Q Q !1−|I|/d
Y x k x
k k
Y
xk = Q k ≤ |I| (
Q |I|/d
= c|I| xk .
i∈I x i (1/c) k x k )
k∈I
/ k

The lower bound follows in the same way from xdi ≤ cd k xk .


Q

This lets us bound the Hessians of c-balanced points.


Q
Lemma 6.7. Let x > 0 be c-balanced with k xk ≤ 1. Then
∇2 f (x) ≤ ∇2 f (x) F
≤ 3dc2 .
where k·kF is the Frobenius norm and k·k the spectral norm.

156
Proof. The fact that kAk ≤ kAkF is Exercise 45. To bound the Frobenius
norm, we use the previous lemma to compute
!2
Y
∇2 f (x)ii = xi ≤ c2
k6=i

and for i 6= j,

Y Y Y
∇2 f (x)ij ≤ 2 xk xk + xk ≤ 3c2 .
k6=i k6=j k6=i,j

2
Hence, k∇2 f (x)kF ≤ 9d2 c4 . Taking square roots, the statement follows.
This now implies smoothness of f along the whole trajectory of gradi-
ent descent, under the usual “smooth stepsize” γ = 1/L = 1/3dc2 .

Lemma 6.8. Let x > 0 be c-balanced with k xk < 1, L = 3dc2 . Let γ := 1/L.
Q
We already know from Lemma 6.5 that

x0 := x − γ∇f (x) ≥ x

is c-balanced. Furthermore, (and this is the statement of the lemma), f is smooth


with parameter L over theQ line segment connecting x and x0 . Lemma 6.3 (no
overshooting) also yields k x0k ≤ 1.

Proof. Image traveling from x to x0 along the line segment. As long as the
product of all variables remains bounded by 1, Hessians remain bounded
by Lemma 6.7, and f is smooth over the part of the segment traveled so
far, by Lemma 6.1. So f can only fail to be smoothQover the whole segment
when there is y 6= x0 on the segment such that k yk = 1. Consider the
first such y. Note that f is still smooth with parameter LQover the segment
connecting x and y. Also, ∇f (x) 6= 0 (due to x > 0, k xk < 1), so x is
not a critical point, and y results from x by a gradient descent step with
stepsize < 1/L (stepsize 1/L takes us toQ x0 ). Hence, y is also not a critical
point by Lemma 6.3, and we can’t have k yk = 1.
Consequently, f is smooth over the whole line segment connecting x
and x0 .

157
6.2.4 Convergence
Theorem
Q 6.9. Let c ≥ 1 and δ > 0 such that x0 > 0 is c-balanced with δ ≤
k (x )
0 k < 1. Choosing stepsize
1
γ= ,
3dc2
gradient descent satisfies
T
δ2

f (xT ) ≤ 1 − 4 f (x0 ), T ≥ 0.
3c
This means that the loss indeed converges to its optimal value 0, and
does so with a fast exponential error decrease. Exercise 46 asks you to
prove that also the iterates themselves converge (to an optimal solution),
so gradient descent will not run off to infinity.
Proof. For each t ≥ 0, f is smooth over conv({xt , xt+1 }) with parameter
L = 3dc2 , hence Lemma 3.7 yields sufficient decrease:
1
f (xt+1 ) ≤ f (xt ) − 2
k∇f (xt )k2 . (6.5)
6dc
Q
For every c-balanced x with δ ≤ k xk ≤ 1, we have
d
!2
X Y
k∇f (x)k2 = 2f (x) xk
i=1 k6=i
!2−2/d
d Y
≥ 2f (x) xk (Lemma 6.6)
c2 k
!2
d Y
≥ 2f (x) 2 xk
c k
d
≥ 2f (x) 2 δ 2 .
c
Then, (6.5) further yields
δ2
 
1 d 2
f (xt+1 ) ≤ f (xt ) − 2f (xt ) 2 δ = f (xt ) 1 − 4 ,
6dc2 c 3c
proving the theorem.

158
This looks great: just as for strongly convex functions, we seem to have
fast convergence since the function value goes down by a constant factor
in each step. There is a catch, though. To see this, consider the starting
solution x0 = (1/2, . . . , 1/2). This is c-balanced with c = 1, but the δ that
we get is 1/2d . Hence, the “constant factor” is
 
1
1− ,
3 · 4d

and we need T ≈ 4d to reduce the initial error by a constant factor not


depending on d.
Indeed, for this starting value x0 , the gradient is exponentially small,
so we are crawling towards the optimum at exponentially small speed. In
order to get polynomial-time convergence, we need to start with a δ that
decays at most polynomially with d. For large d, this requires us to start
very close to optimality. As a concrete example, let us try to achieve a
constant δ (not depending on d) with a 1-balanced solution of the form
xi = (1 − b/d) for all i. For this, we need that
 d
b
1− ≈ e−b = Ω(1),
d

and this requires b = O(1). Hence, we need to start at distance O(1/ d)
from the optimal solution (1, . . . , 1).
The problem is due to constant stepsize. Indeed, f is locally much
smoother at small x0 than Lemma 6.8 predicts, so we could afford much
larger steps in the beginning. The lemma covers the “worst case” when
we are close to optimality already.
So could we improve using a time-varying stepsize? The question is
moot: if we know the function f under consideration, we do not need
to run any optimization in the first place. The question we were trying
to address is whether and how a standard gradient descent algorithm is
able to optimize nonconvex functions as well. Above, we have given a
(partially satisfactory) answer for a concrete function: yes, it can, but at a
very slow rate, if d is large and the starting point not close to optimality
yet.

159
6.3 Exercises
Exercise 40. Let f : dom(f ) → R be convex and twice differentiable, with
X ⊆ dom(f ) an open convex set, and suppose that f is smooth with parameter
L over X. Prove that under these conditions, k∇2 f (x)k ≤ L for all x ∈ X, where
k·k is the spectral norm.

Exercise 41. Prove that the statement of Theorem 6.2 implies that

lim k∇f (xt )k2 = 0.


t→∞

Exercise 42. Prove Lemma 6.3 (gradient descent does not overshoot on smooth
functions).
Q 2
d
Exercise 43. Consider the function f (x) = 12 k=1 x k − 1 . Prove that for
any starting point x0 ∈ X = {x ∈ Rd : x > 0, k xk ≥ 1} and any ε > 0,
Q
gradient descent attains f (xT ) ≤ ε for some iteration T .
Q 2
1 d
Exercise 44. Consider the function f (x) = 2 k=1 xk − 1 . Prove that for
even dimension d ≥ 2, there is a point x0 (not a critical point) such that gradient
descent does not converge to a global minimum when started at x0 , regardless of
step size(s).

Exercise 45. Prove that for any matrix A, kAk ≤ kAkF , where k·k is the spectral
norm and k·kF the Frobenius norm.

Exercise 46. Prove that the sequence (xT )T ≥0 of iterates in Theorem 6.9 con-
verges to a an optimal solution x? .

160
Chapter 7

The Frank-Wolfe Algorithm

Contents
7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.3 On linear minimization oracles . . . . . . . . . . . . . . . . . 165
7.3.1 LASSO and the `1 -ball . . . . . . . . . . . . . . . . . . 165
7.3.2 Semidefinite Programming and the Spectahedron . . 166
7.4 Duality gap — A certificate for optimization quality . . . . . 167
7.5 Convergence in O(1/ε) steps . . . . . . . . . . . . . . . . . . . 168
7.5.1 Convergence analysis for γt = 2/(t + 2) . . . . . . . . 169
7.5.2 Stepsize variants . . . . . . . . . . . . . . . . . . . . . 170
7.5.3 Affine invariance . . . . . . . . . . . . . . . . . . . . . 171
7.5.4 The curvature constant . . . . . . . . . . . . . . . . . . 172
7.5.5 Convergence in duality gap . . . . . . . . . . . . . . . 174
7.6 Sparsity, extensions and use cases . . . . . . . . . . . . . . . . 175
7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

161
7.1 Overview
As constrained optimization problems do appear often in practice, we will
give them a second look here. We again consider problems of the form

minimize f (x)
(7.1)
subject to x ∈ X,

which we have introduced already in Section 2.4.3.

f (x)

x
X ✓ Rd
<latexit sha1_base64="lA5r741TZdGyssowL328g73wxGg=">AAAFUnicddRBb9MwFABgd2thlME6OHKxqJA4oMqp1lFuE2MSF8SoSFepKZXtOF20JA62066L8ku4wk/iwG/hgtMWsbwyS4ms9z4/O45llkahNoT8qu3s1hv37u89aD7cf/T4oHX4ZKhlprhwuYykGjGqRRQmwjWhicQoVYLGLBIX7Oq0zF/MhdKhTD6bZSomMZ0lYRByamxo2joYeTpjWhjxFXuDL/601SYdctw7cvqYdHrE6Ts92+n2HEK62OmQVWujTTufHtZfeb7kWSwSwyOq9dghqZnkVJmQR6JoepkWKeVXdCbGtpvQWOhJvlp5gV/YiI8DqeyTGLyK3h6R01jrZcysjKm51DBXBv+XG2cm6E/yMEkzIxK+nijIImwkLrcB+6ES3ERL26FchXatmF9SRbmxm9Vseu+E/RglPtjCH1OhqJEq97hM5kW+et8h7G8o8vJlSyRiwWUc08TPvUGRe+UiGcsHRYFxNXv2L3tWwKHUTukxGfnlPsgop1ZUAAOAQcAB4BD4APgQCAAEBAEAAQQzAGYQxADEEKQApBBoADQEGQAZBHMA5hAsAFhAcA3ANQRLAJYQ3ABws3Ug3LgqXFhiCMBwqwSpAgIrhH4VeKE0dKvKXHLKNkeX0/U81YMnVXorf2rz9n75e4nguztut/Om43w6ap+83Vw0e+gZeo5eIge9RifoPTpHLuIoQ9/Qd/Sj/rP+u1Fr7K7pTm0z5imqtMb+H4RY72I=</latexit>

Figure 7.1: A constrained optimization problem in dimension d = 2.

The only algorithm we have discussed for this case was projected gra-
dient descent in Chapter 4. This comes with a clear downside that pro-
jections onto a set X can sometimes be very complex to compute, even in
cases when the set is convex. Would it still be possible to solve constrained
optimization problems using a gradient-based algorithm, but without any
projection steps?
From a different perspective, coordinate descent, as we have discussed
in Chapter 5, had the attractive advantage that it only modified one coor-
dinate in every step, keeping all others unchanged. Yet, it is not applicable
in the general constrained case, as we can not easily know when a coordi-
nate step would exit the constraint set X (except in easy cases when X is
defined as a product of intervals). Is there a coordinate-like algorithm also
for general constraint sets X?
It turns out the answer to both previous questions is yes. The algorithm
was discovered by Marguerite Frank and Philip Wolfe in 1956 [FW56],

162
giving rise to the name of the method. Historically, the motivation for
the method was different from the two aspects mentioned above. After
the second world war, linear programming (that is to minimize a linear
function over set of linear constraints) had significant impact for many in-
dustrial applications (e.g. in logistics). Given these successes with linear
objectives, Marguerite Frank and Philip Wolfe studied if similar methods
could be generalized to non-linear objectives (including quadratic as well
as general objectives), that is problems of the form (7.1).

7.2 The Algorithm


Similar to projected gradient descent, the Frank-Wolfe algorithm uses a
nontrivial primitive. Here, it is the linear minimization oracle (LMO). For
the feasible region X ⊆ Rd and an arbitrary vector g ∈ Rd (which we can
think of as an optimization direction),

LMOX (g) := argmin g> z (7.2)


z∈X

is any minimizer of the linear function g> z over X. We will assume that
a minimizer exists whenever we apply the oracle. If X is closed and
bounded, this is guaranteed.
The Frank-Wolfe algorithm proceeds iteratively, starting from an initial
feasible point x0 ∈ X, using a (time-dependent) stepsize γt ∈ [0, 1].

s := LMOX (∇f (xt )), (7.3)


xt+1 := (1 − γt )xt + γt s, (7.4)

We immediately see that the algorithm reduces non-linear constrained


optimization to linear optimization over the same set X: It is able to solve
general non-linear constrained optimization problems (7.1), by only solv-
ing a simpler linear constrained optimization over the same set X in each
iteration — that is the call to the linear minimization oracle LMOX (7.2).
But which linear problem is actually helpful to solve in each step —
that is which direction should we give to the linear oracle LMOX ? The
Frank-Wolfe algorithm uses the gradient g = ∇f (xt ). The rationale is
that the gradient defines the best linear approximation of f at xt . In each
step, the algorithm minimizes this linear approximation over the set X

163
f

f (x)

s x
X ✓ Rd
<latexit sha1_base64="lA5r741TZdGyssowL328g73wxGg=">AAAFUnicddRBb9MwFABgd2thlME6OHKxqJA4oMqp1lFuE2MSF8SoSFepKZXtOF20JA62066L8ku4wk/iwG/hgtMWsbwyS4ms9z4/O45llkahNoT8qu3s1hv37u89aD7cf/T4oHX4ZKhlprhwuYykGjGqRRQmwjWhicQoVYLGLBIX7Oq0zF/MhdKhTD6bZSomMZ0lYRByamxo2joYeTpjWhjxFXuDL/601SYdctw7cvqYdHrE6Ts92+n2HEK62OmQVWujTTufHtZfeb7kWSwSwyOq9dghqZnkVJmQR6JoepkWKeVXdCbGtpvQWOhJvlp5gV/YiI8DqeyTGLyK3h6R01jrZcysjKm51DBXBv+XG2cm6E/yMEkzIxK+nijIImwkLrcB+6ES3ERL26FchXatmF9SRbmxm9Vseu+E/RglPtjCH1OhqJEq97hM5kW+et8h7G8o8vJlSyRiwWUc08TPvUGRe+UiGcsHRYFxNXv2L3tWwKHUTukxGfnlPsgop1ZUAAOAQcAB4BD4APgQCAAEBAEAAQQzAGYQxADEEKQApBBoADQEGQAZBHMA5hAsAFhAcA3ANQRLAJYQ3ABws3Ug3LgqXFhiCMBwqwSpAgIrhH4VeKE0dKvKXHLKNkeX0/U81YMnVXorf2rz9n75e4nguztut/Om43w6ap+83Vw0e+gZeo5eIge9RifoPTpHLuIoQ9/Qd/Sj/rP+u1Fr7K7pTm0z5imqtMb+H4RY72I=</latexit>

Figure 7.2: Illustration of a Frank-Wolfe step from an iterate x.

and makes a step into the direction of the minimizer; see Figure 7.2.

We identify several attractive properties of this algorithm:

• Iterates are aways feasible, if the constraint set X is convex. In other


words, x0 , x1 , . . . , xt ∈ X. This follows thanks to the definition of the
linear minimization oracle returning a point s within X, and the fact
that the next iterate xt+1 is on the line segment [s, xt ], for γt ∈ [0, 1].
This requires that the stepsize in each iteration is chosen in 0 ≤ γt ≤
1. We postpone the further discussion of the stepsizes to later when
we give the convergence analysis.

• The algorithm is projection-free. As we are going to see later, depend-


ing on the geometry of the constraint set X, the subproblem LMOX
is often easier to solve than a projection onto the same set X. Intu-
itively, this the case because LMOX is only a linear problem, while a
projection operation is a quadratic optimization problem.

• The iterates always have a simple sparse representation: xt is always a


convex combination of the initial iterate and the minimizers s used
so far. We will come back to this point in Section 7.6 below.

164
7.3 On linear minimization oracles
The algorithm is particularly useful for cases when the constraint set X can
be described as a convex hull of a finite or otherwise “nice” set of points
A, formally conv(A) = X. We call A the atoms describing the constraint
set.
In this case, a solution to the linear subproblem LMOX defined in (7.2)
is always attained by anPatom a ∈ A. Indeed, every s ∈ Pnconv(X) is a
n
convex combination s = i=1 λi ai of finitely many atoms ( i=1 λi = 1, all
λi nonnegative). It follows that for every g, there is always an atom such
that g> s ≥ a> >
i g. Hence, if s minimizes g z, then there is also an atomic
minimizer.
This allows us to significantly reduce the candidate solutions for the
step directions used by the Frank-Wolfe algorithm. (Note that subprob-
lem (7.2) might still have optimal solutions which are not atoms, but there
is always at least one atomic solution LMOX (g) ∈ A).
The set A = X is a valid (but not too useful) set of atoms. The “opti-
mal” set of atoms is the set of extreme points. A point x ∈ X is extreme if
x 6∈ conv(X \ {x}). Such an extreme point must be in every set of atoms,
but not every atom must be extreme. All that we require for A to be a set
of atoms is that conv(A) = X.
We give two interesting examples next.

7.3.1 LASSO and the `1 -ball


The LASSO problem in its standard (primal) form is given as

min kAx − bk2 subject to kxk1 ≤ 1 (7.5)


x∈Rd

Here we observe that the constraint set X = {x ∈ Rd : kxk1 ≤ 1} is the unit


`1 -ball, the convex hull of the unit basis vectors: X = conv({±e1 , . . . , ±ed }).
Linear problems over the unit `1 -ball are easy to solve: For any direc-
tion g, the minimizer can be chosen as one of the atoms (the unit basis

165
vectors and their negatives):

LMOX (g) = argmin z> g


z∈X

= argmin z> g (7.6)


z∈{±e1 ,...,±en }

= −sgn(gi )ei with i := argmax |gi | (7.7)


i∈[d]

So we only have to look at the vector g and identify its largest coordinate
(in absolute value). This operation is of course significantly more efficient
than projection onto an `1 -ball. The latter we have analyzed in Section 4.5
and have shown a more sophisticated algorithm that still did not have
runtime linear in the dimension.

7.3.2 Semidefinite Programming and the Spectahedron


Hazan’s algorithm [Haz08] is an application of the Frank-Wolfe algorithm
to semidefinite programming. We use the notation of Gärtner and Ma-
toušek [GM12, Chapter 5]. In Hazan’s algorithm, each LMO assumes the
form
argmin G•Z
subject to Tr(Z) = 1 (7.8)
Z  0.
Here, the feasible region X is the spectahedron, the set of all (symmetric)
positive semidefinite matrices Z ∈ Rd×d of trace 1, and G is a symmetric
matrix. For two squarePmatrices A and B, the notation A • B stands for
their “scalar product” i,j aij bij , so G • Z is the matrix analog of g> z. In
fact, (7.8) is a semidefinite program itself, but of a simple form that allows
an explicit solution, as we show next.
The atoms of the spectahedron turn out to be the matrices of the form
>
zz with z ∈ Rd , kzk = 1 (these are positive semidefinite of trace 1). It re-
mains to show that every positive semidefinite matrix of trace 1 is a convex
combination of suitable atoms. To see this, we diagonalize such a matrix
Z as Z = T DT > where T is orthogonal and D is diagonal, again of trace
1. The diagonal elements λ1 , . . . , λd are the (nonnegative) eigenvalues of
Z, summing up to the trace 1. Let ai be the i-th column of T . As T is
orthogonal, we have kai k = 1. It follows that Z = di=1 λi ai a>
P
i is the de-

166
sired convex combination of atoms. We remark that ai is a (unit length)
eigenvector of Z w.r.t. eigenvalue λi .
Lemma 7.1. Let λ1 be the smallest eigenvalue of G, and let s1 be a corresponding
eigenvector of unit length. Then we can choose LMOX (G) = s1 s> 1.

Proof. Since it is sufficient to minimize over atoms, we have


min G • Z = min G • zz> = min z> Gz = λ1 .
Tr(Z)=1,Z0 kzk=1 kzk=1

The second equality follows from G • zz> = z> Gz for all z (simple rewrit-
ing), and the last equality is a standard result from linear algebra that can
be proved via elementary calculations, involving diagonalization of G.
Now, s1 is easily seen to attain the last minimum, hence s1 s>
1 attains the
>
first minimum, and LMOX (G) = s1 s1 follows.

7.4 Duality gap — A certificate for optimization


quality
A rather unexpected side benefit of the linear minimization oracle is that
it can be used as a certificate of the optimization quality at our current iter-
ate. Even if the true optimum value f (x? ) of the problem is unknown, the
point s returned by LMOX (∇f (xt )) lets us compute an upper bound on
the optimality gap f (xt ) − f (x? ).
Given x ∈ X, we define the duality gap (also known as the Hearn gap)
at x as
g(x) := ∇f (x)> (x − s) for s := LMOX (∇f (x)). (7.9)
Note that g(x) is well-defined since it only depends on the minimum value
∇f (x)> s of LMOX (∇f (x)), but not on the concrete minimizer s of which
there may be many. The duality gap g(x) can be interpreted as the opti-
mality gap ∇f (x)> x − ∇f (x)> s of the linear subproblem. In particular,
g(x) ≥ 0; see Figure 7.3.
Lemma 7.2. Suppose that the constrained minimization problem (7.1) has a min-
imizer x? . Let x ∈ X. Then
g(x) ≥ f (x) − f (x? ),
meaning that the duality gap is an upper bound for the optimality gap.

167
f

f (x)
g(x)

s x
X ✓ Rd
<latexit sha1_base64="lA5r741TZdGyssowL328g73wxGg=">AAAFUnicddRBb9MwFABgd2thlME6OHKxqJA4oMqp1lFuE2MSF8SoSFepKZXtOF20JA62066L8ku4wk/iwG/hgtMWsbwyS4ms9z4/O45llkahNoT8qu3s1hv37u89aD7cf/T4oHX4ZKhlprhwuYykGjGqRRQmwjWhicQoVYLGLBIX7Oq0zF/MhdKhTD6bZSomMZ0lYRByamxo2joYeTpjWhjxFXuDL/601SYdctw7cvqYdHrE6Ts92+n2HEK62OmQVWujTTufHtZfeb7kWSwSwyOq9dghqZnkVJmQR6JoepkWKeVXdCbGtpvQWOhJvlp5gV/YiI8DqeyTGLyK3h6R01jrZcysjKm51DBXBv+XG2cm6E/yMEkzIxK+nijIImwkLrcB+6ES3ERL26FchXatmF9SRbmxm9Vseu+E/RglPtjCH1OhqJEq97hM5kW+et8h7G8o8vJlSyRiwWUc08TPvUGRe+UiGcsHRYFxNXv2L3tWwKHUTukxGfnlPsgop1ZUAAOAQcAB4BD4APgQCAAEBAEAAQQzAGYQxADEEKQApBBoADQEGQAZBHMA5hAsAFhAcA3ANQRLAJYQ3ABws3Ug3LgqXFhiCMBwqwSpAgIrhH4VeKE0dKvKXHLKNkeX0/U81YMnVXorf2rz9n75e4nguztut/Om43w6ap+83Vw0e+gZeo5eIge9RifoPTpHLuIoQ9/Qd/Sj/rP+u1Fr7K7pTm0z5imqtMb+H4RY72I=</latexit>

Figure 7.3: Illustration of the duality gap at iterate x.

Proof. Using that s minimizes ∇f (x)> z over X, we argue that


g(x) = ∇f (x)> (x − s)
≥ ∇f (x)> (x − x? )
≥ f (x) − f (x? ) (7.10)
where in the last inequality we have used the first-order characterization
of convexity of f (Lemma 2.16).
So the duality gap g(xt )—a value which is available for every itera-
tion of the Frank-Wolfe algorithm—always gives us a guaranteed upper
bound on the unknown error f (xt ) − f (x? ). This contrasts unconstrained
optimization, where we don’t have any such certificate in general.
We argue that it is also a useful upper bound. At any optimal point x?
of the constrained optimization problem, the gap vanishes, i.e. g(x? ) = 0.
This follows from the optimality conditions for constrained convex opti-
mization, given in Lemma 2.28, stating that ∇f (x? )> (x−x? ) ≥ 0 ∀x ∈ X.

7.5 Convergence in O(1/ε) steps


We first address the standard way of choosing the stepzise in the Frank-
Wolfe algorithm. We need to assume that the function f is smooth, but
unlike for gradient descent, the stepsize can be chosen independently from
the smoothness parameter.

168
7.5.1 Convergence analysis for γt = 2/(t + 2)
Theorem 7.3. Consider the constrained minimization problem (7.1) where f :
Rd → R is convex and smooth with parameter L, and X is convex, closed
and bounded (in particular, a minimizer x? of f over X exists, and all linear
minimization oracles have minimizers). With any x0 ∈ X, and with stepsizes
γt = 2/(t + 2), the Frank-Wolfe algorithm yields
2L diam(X)2
f (xT ) − f (x? ) ≤ , T ≥ 1,
T +1
where diam(X) := maxx,y∈X kx − yk is the diameter of X (which exists since X
is closed and bounded).
The following descent lemma forms the core of the convergence proof:
Lemma 7.4. For a step xt+1 := xt + γt (s − xt ) with stepsize γt ∈ [0, 1], it holds
that
L
f (xt+1 ) ≤ f (xt ) − γt g(xt ) + γt2 ks − xt k2 ,
2
where s = LMOX (∇f (xt )).
Proof. From the definition of smoothness of f , we have
f (xt+1 ) = f (xt + γt (s − xt ))
L
≤ f (xt ) + ∇f (xt )> γt (s − xt ) + γt2 ks − xt k2 (7.11)
2
L
= f (xt ) − γt g(xt ) + γt2 ks − xt k2 ,
2
using the definition (7.9) of the duality gap.
Proof of Theorem 7.3. Writing h(x) := f (x) − f (x? ) for the (unknown) op-
timization gap at point x, und using the certificate property (7.10) of the
duality gap, that is h(x) ≤ g(x), Lemma 7.4 implies that
L
h(xt+1 ) ≤ h(xt ) − γt g(xt ) + γt2 ks − xt k2
2
L
≤ h(xt ) − γt h(xt ) + γt2 ks − xt k2
2
L
= (1 − γt )h(xt ) + γt2 ks − xt k2
2
2
≤ (1 − γt )h(xt ) + γt C, (7.12)

169
where C := L2 diam(X)2 .
The convergence proof finishes by induction. Exercise 47 asks you to
2
prove that for γt = t+2 , we obtain
4C
h(xt ) ≤ , t ≥ 1.
t+1

7.5.2 Stepsize variants


The previous runtime analysis also holds for two alternative stepsizes. In
practice, convergence might even be faster with these alternatives, since
they are trying to optimize progress, in two different ways. For both al-
ternative stepsizes, we will establish inequality (7.12) with the standard
stepsize γt = 2/(t + 2) =: µt from which h(xt ) ≤ 4C/(t + 1) follows.

Line search stepsize. Here, γt ∈ [0, 1] is chosen such that the progress in
f -value (and hence also in h-value) is maximized,

γt := argmin f (1 − γ)xt + γs .
γ∈[0,1]

Let yt+1 be the iterate obtained from xt with the standard stepsize µt .
From (7.12) and the definition of γt , we obtain the desired inequality
h(xt+1 ) ≤ h(yt+1 ) ≤ (1 − µt )h(xt ) + µ2t C. (7.13)

Gap-based stepsize. This chooses γt such that the right-hand side in the
first line of (7.12) is minimized. A simple calculation shows that this results
in  
g(xt )
γt := min ,1 .
L ks − xt k2
Now we establish (7.13) as follows:
L
h(xt+1 ) ≤ h(xt ) − γt g(xt ) + γt2 ks − xt k2
2
L
≤ h(xt ) − µt g(xt ) + µ2t ks − xt k2
2
L
≤ h(xt ) − µt h(xt ) + µ2t ks − xt k2
2
2
≤ (1 − µt )h(xt ) + µt C.

170
Directly plugging in the definition of γt yields

h(xt ) 1 − γ2t , γt < 1,


 
h(xt+1 ) ≤
h(xt ), γt = 1,

So we make progress in every iteration under the gap-based stepsize (this


is not guaranteed under the standard stepsize), but faster convergence is
not implied.

7.5.3 Affine invariance


The convergence bound on the Frank-Wolfe method that we have devel-
oped in Theorem 7.3,

2L diam(X)2
f (xT ) − f (x? ) ≤ ,
T +1
is in some sense bad. Consider the problem of minimizing f (x1 , x2 ) =
x21 + x22 over the unit square X = {(x1 , x2 ) : 0 ≤ x1 ≤ 1, 0 ≤ x2 ≤ 1}. The
function f (the two-dimensional supermodel) is smooth with L = 2, and
diam(X)2 = 2. Next consider f 0 (x1 , x2 ) = x21 + (10x2 )2 over the rectangle
X 0 = {(x1 , x2 ) : 0 ≤ x1 ≤ 1, 0 ≤ x2 ≤ 1/10}. The function f 0 is smooth with
L0 = 200, and diam(X 0 )2 = 1 + 1/100. Hence, our convergence analysis
seems to suggest that the error after T steps of the Frank-Wolfe algorithm
on f 0 over X 0 is roughly 100 times larger than on f over X.
In reality, however, there is no such difference. The reason is that the
two problems (f, X) and (f 0 , X 0 ) are equivalent under rescaling of vari-
able x2 , and the Frank-Wolfe algorithm is invariant under this and more
generally all affine transformations of space. Figure 7.4 depicts the two
problems (f, X) and (f 0 , X 0 ) from our example above.
To argue about the affine invariance formally, we call two problems
(f, X) and (f 0 , X 0 ) affinely equivalent if f 0 (x) = f (Ax+b) for some invertible
matrix A and some vector b, and X 0 = {A−1 (x − b) : x ∈ X}. The equiva-
lence is that x ∈ X with function value f (x) if and only if x0 = A−1 (x−b) ∈
X 0 with the same function value f 0 (x0 ) = f (AA−1 (x − b) + b) = f (x). We
call x0 the vector corresponding to x. In Figure 7.4, we have
 
1 0
A= , b = 0.
0 10

171
Figure 7.4: Two optimization problems (f, X) and (f 0 , X 0 ) that are equiva-
lent under an affine transformation.

By the chain rule, we get

∇f 0 (x0 ) = A> ∇f (Ax0 + b) = A> ∇f (x). (7.14)

Now consider performing an iteration of the Frank-Wolfe algorithm

(a) on (f, X), starting from some iterate x, and

(b) on (f 0 , X 0 ), starting from the corresponding iterate x0 ,

in both cases with the same stepsize. Because of


(7.14)
∇f 0 (x0 )> z0 = ∇f (x)> AA−1 (z − b) = ∇f (x)> z − c,

where c is some constant, the linear minimization oracle in (b) returns the
step direction s0 = A−1 (s − b) ∈ X 0 corresponding to the step direction s ∈
X in (a). It follows that also the next iterates in (a) and (b) will correspond
to each other and have the same function values. In particular, after any
number of steps, both (a) and (b) will incur the same optimization error.

7.5.4 The curvature constant


It follows from the above discussion that a good analysis of the Frank-
Wolfe algorithm should provide a bound that is invariant under affine

172
transformations, unlike the bound of Theorem 7.3. For this, we define
a curvature constant of the constrained optimization problem (7.1). The
quantity serves as a combined notion of complexity of both the objective
function f and the constraint set X:
1 >

C(f,X) := sup f (y) − f (x) − ∇f (x) (y − x) . (7.15)
x,s∈X,γ∈(0,1] γ2
y=(1−γ)x+γs

To gain an understanding of this quantity, note that d(y) := f (y) −


f (x) − ∇f (x)> (y − x) is the pointwise vertical distance between the graph
of f and its linear approximation at x. By convexity, d(y) ≥ 0 for all
y ∈ X. For y resulting from x by a Frank-Wolfe step with stepsize γ, we
normalize the vertical distance with γ 2 (a natural choice if we think of f as
being smooth), and take the supremum over all possible such normalized
vertical distances.
We will see that the convergence rate of the Frank-Wolfe algorithm can
be described purely in terms of this quantity, without resorting to any
smoothness constants L or diameters diam(X). As we have already seen,
the latter two quantities are not affine invariant.
In a similar way as we have done it for the algorithm itself, we can
prove that the curvature constant C(f,X) is affine invariant. Hence, here is
the envisioned good analysis of the Frank-Wolfe algorithm.

Theorem 7.5. Consider the constrained minimization problem (7.1) where f :


Rd → R is convex, and X is convex, closed and bounded. Let C(f,X) be the
curvature constant (7.15) of f over X. With any x0 ∈ X, and with stepsizes
γt = 2/(t + 2), the Frank-Wolfe algorithm yields

4C(f,X)
f (xT ) − f (x? ) ≤ , T ≥ 1.
T +1
Proof. The crucial step is to prove the following version of (7.11):

f (xt+1 ) ≤ f (xt ) − ∇f (xt )> γt (xt − s) + γt2 C(f,X) . (7.16)

After this, we can follow the remainder of the proof of Theorem 7.3, with
C(f,X) instead of C = L2 diam(X)2 . To show (7.16), we use

x := xt , y := xt+1 = (1 − γt )xt + γt s, y − x = −γt (x − s),

173
and rewrite the definition of the curvature constant (7.15) to get

f (y) ≤ f (x) + ∇f (x)> (y − x) + γt2 C(f,X) .

Plugging in the previous definitions of x and y, (7.16) follows.


One might suspect this affine independent bound to be worse than the
best bound obtainable from Theorem 7.3 after an affine transformation.
As we show next, this not the case: when f is twice differentiable, C(f,X)
is always bounded by the constant C = L2 diam(X)2 that determines the
convergence rate in Theorem 7.3.

Lemma 7.6 (Exercise 48). Let f be a convex function which is smooth with
parameter L over X. Then

L
C(f,X) ≤ diam(X)2 .
2

7.5.5 Convergence in duality gap


The following result shows that the duality gap converges as well, essen-
tially at the same rate as the primal error. .

Theorem 7.7. Let f : Rd → R be convex and smooth with parameter L, and


x0 ∈ X, T ≥ 2. Then choosing any of the stepsizes in Section 7.5.2, the Frank-
Wolfe algorithm yields at t, 1 ≤ t ≤ T such that

27/2 · C(f,X)
g(xt ) ≤
T +1
Still, compared to our previous theorem, the convergence of the gap
here is a stronger and more useful result, because g(xt ) is easy to compute
in any iteration of the Frank-Wolfe algorithm, and as we have seen in (7.10)
serves as an upper bound (certificate) to the unknown primal error, that is
f (xt ) − f (x? ) ≤ g(xt ).
The proof of the theorem is left as Exercise 49, and is difficult. The
argument leverages that not all gaps can be small, and will again crucially
rely on the descent Lemma 7.4.

174
7.6 Sparsity, extensions and use cases
A very important feature of the Frank-Wolfe algorithm has been pointed
out before, but we would like to make it explicit here. Consider the con-
vergence bound of Theorem 7.5,
4C(f,X)
f (xT ) − f (x? ) ≤ , T ≥ 1.
T +1
This means that O(1/ε) many iterations are sufficent to obtain optimality
gap at most ε. At this time, the current solution is a convex combination of
x0 and O(1/ε) many atoms of the constraint set X. Thinking of ε as a con-
stant (such as 0.01), this means that constantly many atoms are sufficient in
order to get an almost optimal solution. This is quite remarkable, and it
connects to the notion of coresets in computational geometry. A coreset is a
small subsets of a given set of objects that is representative (with respect to
some measure) for the set of all objects. Some algorithms for finding small
coresets are inspired by the Frank-Wolfe algorithm [Cla10].
The algorithm and analysis above can be extended to several settings,
including
• Approximate LMO, that is we can allow a linear minimization oracle
which is not exact but is of a certain additive or multiplicative ap-
proximation quality for the subproblem (7.2). Convergence bounds
are essentially as in the exact case [Jag13].
• Randomized LMO, that is that the LMOX solves the linear minimiza-
tion oracle only over a random subset of X. Convergence in O(1/ε)
steps still holds [KPd18].
• Stochastic LMO, that is LMOX is fed with a stochastic gradient instead
of the true gradient [HL20].
• Unconstrained problems. This is achieved by considering growing ver-
sions of a constraint set X. For instance when X is an `1 -norm ball,
the algorithm will become similar to popular steepest coordinate
methods as we have discussed in Section 5.4.3. In this case, the
resulting algorithms are also known as matching-pursuit, and are
widely used in the literature on sparse recovery of a signal, also
known as compressed sensing. For more details, we refer the reader
to [LKTJ17].

175
The Frank-Wolfe algorithm and its variants have many popular use-
cases. The most attractive uses are for constraint sets X where a projection
step bears significantly more computational cost compared to solving a
linear problem over X. Some examples of such sets include:
• Lasso and other L1-constrained problems, as discussed in Section 7.3.1.
• Matrix Completion. For several low-rank approximation problems,
including matrix completion as in recommender systems, the Frank-
Wolfe algorithm is a very scalable algorithm, and has much lower
iteration cost compared to projected gradient descent. For a more
formal treatment, see Exercise 50.
• Relaxation of combinatorial problems, where we would like to opti-
mize over a discrete set A (e.g. matchings, network flows etc). In
this case, the Frank-Wolfe algorithm is often used together with early
stopping, in order to achieve a good iterate xt being a combination
of at most t of the original points A.
Many of these applications can also be written as constraint sets of the
form X := conv(A) for some set of atoms A, as illustrated in the following
table:
Examples A |A| dim. LMOX (g)
L1-ball {±ei } 2d d ±ei with argmaxi |gi |
Simplex {ei } d d ei with argmini gi
Spectahedron {xx> , kxk = 1} ∞ d2 argminkxk=1 x> Gx
Norms {x, kxk ≤ 1} ∞ d argmin hs, gi
s,ksk≤1
Nuclear norm {Y, kY k∗ ≤ 1} ∞ d2 ..
Wavelets .. ∞ ∞ ..

7.7 Exercises
Exercise 47 (Induction for the Frank-Wolfe convergence analysis). Given
some constant C > 0 and a sequence of real values h0 , h1 , . . . satisfying (7.12),
i.e.
ht+1 ≤ (1 − γt )ht + γt2 C for t = 0, 1, . . .

176
2
for γ = t+2
, prove that
4C
ht ≤ for t ≥ 1.
t+1
Exercise 48 (Relating Curvature and Smoothness). Prove Lemma 7.6:

Exercise 49 (Duality gap convergence for the Frank-Wolfe algorithm). Prove


Theorem 7.7 on the convergence of the duality gap (which is an upper bound to
the primal error f (xt ) − f (x? ). The proof will again crucially rely on the descent
Lemma 7.4.

Exercise 50 (Frank-Wolfe for Matrix completion). Consider the matrix com-


pletion problem, that is to find a matrix Y solving
X
minn×m (Zij − Yij )2
Y ∈X⊆R
(i,j)∈Ω

where the optimization domain X is the set of matrices in the unit ball of the trace
norm (or nuclear norm), which is defined the convex hull of the rank-1 matrices
n o
> u∈Rn , kuk2 =1
X := conv(A) with A := uv v∈Rm , kvk =1 .
2

Here Ω ⊆ [n] × [m] is the set of observed entries from a given data matrix Z
(collecting the ratings given by users to items for example).

1. Derive the LMOX for this set X for a gradient at iterate Y ∈ Rn×m .

2. Derive the projection step onto X. How do the LMOX and the projection
step compare, in terms of computational cost?

177
Chapter 8

Newton’s Method

Contents
8.1 1-dimensional case . . . . . . . . . . . . . . . . . . . . . . . . 179
8.2 Newton’s method for optimization . . . . . . . . . . . . . . . 181
8.3 Once you’re close, you’re there. . . . . . . . . . . . . . . . . . . 183
8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

178
8.1 1-dimensional case
The Newton method (or Newton-Raphson method, invented by Sir Isaac
Newton and formalized by Joseph Raphson) is an iterative method for
finding a zero of a differentiable univariate function f : R → R. Starting
from some number x0 , it computes

f (xt )
xt+1 := xt − , t ≥ 0. (8.1)
f 0 (xt )

Figure 8.1 shows what happens. xt+1 is the point where the tangent line
to the graph of f at (xt , f (xt )) intersects the x-axis. In formulas, xt+1 is the
solution of the linear equation

f (xt ) + f 0 (xt )(x − xt ) = 0,

and this yields the update formula (8.1).

f (x)

xt xt+1

f (xt ) + f 0 (xt )(x − xt )

Figure 8.1: One step of Newton’s method

The Newton step (8.1) obviously fails if f 0 (xt ) = 0 and may get out of
control if |f 0 (xt )| is very small. Any theoretical analysis will have to make
suitable assumptions to avoid this. But before going into this, we look at
Newton’s method in a benign case.

179
√ √
Let f (x) = x2 − R, where R ∈ R+ . f has two zeros, √ R and − R.
Starting for example at x0 = R, we hope to converge to R quickly. In this
case, (8.1) becomes
x2t − R
 
1 R
xt+1 = xt − = xt + . (8.2)
2xt 2 xt
This is in fact the Babylonian method to compute square roots, and here we
see that it is just a special case of Newton’s method. √
Can we prove that we indeed quickly converge to R? What we im-
mediately see from (8.2) is that all iterates will be positive and hence
 
1 R xt
xt+1 = xt + ≥ .
2 xt 2

So we cannot be too fast. Suppose R ≥ 1. In order to even get xt < 2 R,
we need at least T ≥ log(R)/2 steps. It√turns out that the Babylonian
method starts taking off only when xt − R < 1/2, say (Exercise 51 asks
you to prove that it takes O(log R) steps to get there).√
√ let us now suppose that x0 − R < 1/2, so we are
To watch takeoff,
starting close to R already. We rewrite (8.2) as
√ xt R √ 1  √ 2
xt+1 − R = + − R= xt − R . (8.3)
2 2xt 2xt

Assuming for now that R ≥ 1/4, all iterates have value at least R ≥
1/2, hence we get
√  √ 2
xt+1 − R ≤ xt − R .
This means that the error goes to 0 quadratically, and
 2T
√  √ 2T 1
xT − R ≤ x0 − R < , T ≥ 0. (8.4)
2

What does this tell us? In order to get xT − R < ε, we only √ need
1
T = log log( ε ) steps! Hence, it takes a while to get to roughly R, but
from there, we achieve high accuracy very fast.
Let us do a concrete example of the practical behavior (on a computer
with IEEE√ 754 double arithmetic). If R = 1000, the method takes
√ 7 steps to
get x7 − 1000 < 1/2, and then 3 more steps to get x13 equal to 1000 up to
the machine precision (53 binary digits). In this last phase, we essentially
double the number of correct digits in each iteration!

180
8.2 Newton’s method for optimization
Suppose we want to find a global minimum x? of a differentiable con-
vex function f : R → R (assuming that a global minimum exists). Lem-
mata 2.22 and 2.23 guarantee that we can equivalently search for a zero of
the derivative f 0 . To do this, we can apply Newton’s method if f is twice
differentiable; the update step then becomes
f 0 (xt )
xt+1 := xt − = xt − f 00 (xt )−1 f 0 (xt ), t ≥ 0. (8.5)
f 00 (xt )
There is no reason to restrict to d = 1. Here is Newton’s method for min-
imizing a convex function f : Rd → R. We choose x0 arbitrarily and then
iterate:
xt+1 := xt − ∇2 f (xt )−1 ∇f (xt ), t ≥ 0. (8.6)
The update vector ∇2 f (xt )−1 ∇f (xt ) is the result of a matrix-vector mul-
tiplication: we invert the Hessian at xt and multiply the result with the
gradient at xt . As before, this fails if the Hessian is not invertible, and may
get out of control if the Hessian has small norm.
We have introduced iteration (8.6) simply as a (more or less natural)
generalization of (8.5), but there’s more to it. If we consider (8.6) as a
special case of a general update scheme
xt+1 = xt − H(xt )∇f (xt ),
where H(xt ) ∈ Rd×d is some matrix, then we see that also gradient de-
scent (3.1) is of this form, with H(xt ) = γI. Hence, Newton’s method can
also be thought of as “adaptive gradient descent” where the adaptation is
w.r.t. the local geometry of the function at xt . Indeed, as we show next,
this allows Newton’s method to converge on all nondegenerate quadratic
functions in one step, while gradient descent only does so with the right
stepsize on “beautiful” quadratic functions whose sublevel sets are Eu-
clidean balls (Exercise 30).
Lemma 8.1. A nondegenerate quadratic function is a function of the form
1
f (x) = x> M x − q> x + c,
2
d×d
where M ∈ R is an invertible symmetric matrix, q ∈ Rd , c ∈ R. Let x? =
M −1 q be the unique solution of ∇f (x) = 0 (the unique global minimum if f is
convex). With any starting point x0 ∈ Rd , Newton’s method (8.6) yields x1 = x? .

181
Proof. We have ∇f (x) = M x − q (this implies x? = M −1 q) and ∇2 f (x) =
M . Hence,

x0 − ∇2 f (x0 )−1 ∇f (x0 ) = x0 − M −1 (M x0 − q) = M −1 q = x? .

In particular, Newton’s method can solve an invertible system M x = q


of linear equations in one step. But no miracle is happening here, as this
step involves the inversion of the matrix ∇2 f (x0 ) = M .
More generally, the behavior of Newton’s method is affine invariant.
By this, we mean that it is invariant under any invertible affine transfor-
mation, as follows:

Lemma 8.2 (Exercise 52). Let f : Rd → R be twice differentiable, A ∈ Rd×d


an invertible matrix, b ∈ Rd . Let g : Rd → Rd be the (bijective) affine function
g(y) = Ay + b, y ∈ Rd . Finally, for a twice differentiable function h : Rd → R,
let Nh : Rd → Rd denote the Newton step for h, i.e.

Nh (x) := x − ∇2 h(x)−1 ∇h(x),

whenever this is defined. Then we have Nf ◦g = g −1 ◦ Nf ◦ g.

This says that in order to perform a Newton step for f ◦ g on yt , we


can transform yt to xt = g(yt ), perform the Newton step for f on x and
transform the result xt+1 back to yt+1 = g −1 (xt+1 ). Another way of saying
this is that the following diagram commutes:
xt xt+1
Nf

g g −1

Nf ◦g
yt yt+1

182
Hence, while gradient descent suffers if the coordinates are at very dif-
ferent scales, Newton’s method doesn’t.
We conclude the general exposition with another interpretation of New-
ton’s method: each step minimizes the local second-order Taylor approxi-
mation.

Lemma 8.3 (Exercise 55). Let f be convex and twice differentiable at xt ∈


dom(f ), with ∇2 f (xt )  0 being invertible. The vector xt+1 resulting from
the Netwon step (8.6) satisfies

1
xt+1 = argmin f (xt ) + ∇f (xt )> (x − xt ) + (x − xt )> ∇2 f (xt )(x − xt ).
x∈Rd 2

8.3 Once you’re close, you’re there. . .


We will prove a result about Newton’s method that may seem rather weak:
under suitable conditions, and starting close to the global minimum, we
will reach distance at most ε to the minimum within log log(1/ε) steps.
The weak part here is of course not the number of steps log log(1/ε)—this
is much faster than anything we have seen so far—but the assumption that
we are starting close to the minimum already. Under such an assumption,
we say that we have a local convergence result.
To compensate for the above weakness to some extent, we will be able
to handle non-convex functions as well. More precisely, we show that un-
der the aforementioned suitable conditions, and starting close to a crit-
ical point, we will reach distance at most ε to the critical point within
log log(1/ε) steps. This can of course only work if the conditions ensure
that we are close to only one critical point; so we have a unique critical
point nearby, and Newton’s method will have no choice other than to con-
verge to it.
For convex functions, we can ask about global convergence results that
hold for every starting point. In general, such results were unknown for
Newton’s method as in (8.6) until recently. Under a stability assump-
tion on the Hessian, global convergence was shown to hold by [KSJ18].
There are some variants of Newton’s method for which such results can
be proved, most notably the cubic regularization variant of Nesterov and
Polyak [NP06]. Weak global convergence results can be obtained by adding

183
a step size to (8.6) and always making only steps that decrease the function
value (which may not happen under the full Newton step).
An alternative is to use gradient descent to get us sufficiently close to
the global minimum, and then switch to Newton’s method for the rest. In
Chapter 3, we have seen that under favorable conditions, we may know
when gradient descent has taken us close enough.
In practice, Newton’s method is often (but not always) much faster
than gradient descent in terms of the number of iterations. The price to pay
is a higher iteration cost, since we need to compute (and invert) Hessians.
After this disclaimer, let us state the main result right away. We follow
Vishnoi [Vis15], except that we do not require convexity.
Theorem 8.4. Let f : dom(f ) → R be twice differentiable with a critical
point x? . Suppose that there is a ball X ⊆ dom(f ) with center x? such that
the following two properties hold.
(i) Bounded inverse Hessians: There exists a real number µ > 0 such that
1
k∇2 f (x)−1 k ≤ , ∀x ∈ X.
µ

(ii) Lipschitz continuous Hessians: There exists a real number B ≥ 0 such


that
k∇2 f (x) − ∇2 f (y)k ≤ Bkx − yk ∀x, y ∈ X.

In both cases, the matrix norm is the spectral norm defined in Lemma 3.6. Prop-
erty (i) in particular stipulates that Hessians are invertible at all points in X.
Then, for xt ∈ X and xt+1 resulting from the Newton step (8.6), we have
B
kxt+1 − x? k ≤ kxt − x? k2 .

As an example, let us consider a nondegenerate quadratic function f
(constant Hessian M = ∇2 f (x) for all x; see Lemma 8.1). Then f has ex-
actly one critical point x? . Property (i) is satisfied with µ = 1/kM −1 k over
X = Rd ; property (ii) is satisfied for B = 0. According to the statement of
the theorem, Newton’s method will thus reach x? after one step—which
we already know from Lemma 8.1.
In general, there could be several critical points for which properties
(i) and (ii) hold, and it may seem surprising that the theorem makes a

184
statement about all of them. But in fact, if xt is far away from such a
critical point, the statement allows xt+1 to be even further away from it; we
cannot expect to make progress towards all critical points simultaneously.
The theorem becomes interesting only if we are very close to some critical
point. In this case, we will actually converge to it. In particular, this critical
point is then isolated and the only one nearby, so that Newton’s method
cannot avoid getting there.

Corollary 8.5 (Exercise 53). With the assumptions and terminology of Theo-
rem 8.4, and if x0 ∈ X satisfies
µ
kx0 − x? k ≤ ,
B
then Newton’s method (8.6) yields
 2T −1
? µ 1
kxT − x k ≤ , T ≥ 0.
B 2

Hence, we have a bound as (8.4) for the last phase of the Babylonian
method: in order to get kxT − x? k < ε, we only need T = log log( 1ε ) steps.
But before this fast behavior kicks in, we need to be µ/B-close to x? al-
ready. The fact that x0 is this close to only one critical point necessarily
follows.
An intuitive reason for a unique critical point near x0 (and for fast con-
vergence to it) is that under our assumptions, the Hessians we encounter
are almost constant when we are close to x? . This means that locally, our
function behaves almost like a nondegenerate quadratic function which
has truly constant Hessians and allows Newton’s method to convergence
to its unique critical point in one step (Lemma 8.1).

Lemma 8.6 (Exercise 54). With the assumptions and terminology of Theorem 8.4,
and if x0 ∈ X satisfies
µ
kx0 − x? k ≤ ,
B
then the Hessians in Newton’s method satisfy the relative error bound
 2t −1
k∇2 f (xt ) − ∇2 f (x? )k 1
≤ , t ≥ 0.
k∇2 f (x? )k 2

185
We now still owe the reader the proof of the main convergence result,
Theorem 8.4:
Proof of Theorem 8.4. To simplify notation, let us abbreviate H := ∇2 f , x =
xt , x0 = xt+1 . Subtracting x? from both sides of (8.6), we get

x0 − x? = x − x? − H(x)−1 ∇f (x)
= x − x? + H(x)−1 (∇f (x? ) − ∇f (x))
Z 1
? −1
= x − x + H(x) H(x + t(x? − x))(x? − x)dt.
0

The last step, which applies the fundamental theorem of calculus, needs
some explanations. In fact, we have applied it to each component hi (t) of
the vector-valued function h(t) = ∇f (x + t(x? − x)):
Z 1
hi (1) − hi (0) = h0i (t), i = 1, . . . , d.
0

These d equations can be summarized as


Z 1
?
∇f (x ) − ∇f (x) = h(1) − h(0) = h0 (t),
0

where h0 (t) has components h01 (t), . . . , h0d (t), and the integral is also under-
∂f
stood componentwise. Furthermore, as hi (t) = ∂x i
(x + t(x? − x)), the chain
rule yields h0i (t) = dj=1 ∂x∂f
(x + t(x? − x))(x?j − xj ). This summarizes to
P
ij
h0 (t) = H(x + t(x? − x))(x? − x).
Also note that we are allowed to apply the fundamental theorem of
calculus in the first place, since f is twice continuously differentiable over
X (as a consequence of assuming Lipschitz continuous Hessians), so also
h0 (t) is continuous.
After this justifying intermezzo, we further massage the expression we
have obtained last. Using
Z 1
? −1 ? −1
x − x = H(x) H(x)(x − x ) = H(x) −H(x)(x? − x)dt,
0

we can now write


Z 1
0 ? −1
x − x = H(x) (H(x + t(x? − x)) − H(x)) (x? − x)dt.
0

186
Taking norms, we have
Z 1
0 ? −1
kx − x k ≤ kH(x) k · (H(x + t(x? − x)) − H(x)) (x? − x)dt ,
0

by the properties of the spectral norm. As we also have


Z 1 Z 1
g(t)dt ≤ kg(t)kdt
0 0

for any vector-valued function g (Exercise 57), we can further bound


Z 1
0 ? −1
H(x + t(x? − x)) − H(x) (x? − x) dt

kx − x k ≤ kH(x) k
Z0 1
≤ kH(x)−1 k H(x + t(x? − x)) − H(x) · kx? − xkdt
0
Z 1
−1 ?
= kH(x) k · kx − xk H(x + t(x? − x)) − H(x) dt.
0

We can now use the properties (i) and (ii) (bounded inverse Hessians, Lip-
schitz continuous Hessians) to conclude that
Z 1 Z 1
0 ? 1 ? ? B ? 2
kx − x k ≤ kx − xk Bkt(x − x)kdt = kx − xk tdt .
µ 0 µ 0
| {z }
1/2

How realistic are properties (i) and (ii)? If f is twice continuously dif-
ferentiable (meaning that the second derivative ∇2 f is continuous), then
we will always find suitable values of µ and B over a ball X with center
x? —provided that ∇2 f (x? ) 6= 0.
Indeed, already in the one-dimensional case, we see that under f 00 (x? ) =
0 (vanishing second derivative at the global minimum), Newton’s method
will in the worst reduce the distance to x? at most by a constant factor in
each step, no matter how close to x? we start. Exercise 56 asks you to find
such an example. In such a case, we have linear convergence, but the fast
quadratic convergence (O(log log(1/ε)) steps cannot be proven.
One way to ensure bounded inverse Hessians is to require strong con-
vexity over X.

187
Lemma 8.7 (Exercise 58). Let f : dom(f ) → R be twice differentiable and
strongly convex with parameter µ over an open convex subset X ⊆ dom(f )
according to Definition 3.10, meaning that
µ
f (y) ≥ f (x) + ∇f (x)> (y − x) + kx − yk2 , ∀x, y ∈ X.
2
Then ∇2 f (x) is invertible and k∇2 f (x)−1 k ≤ 1/µ for all x ∈ X, where k · k is
the spectral norm defined in Lemma 3.6.

8.4 Exercises
Exercise
√ 51. Consider the Babylonian method (8.2). Prove that we get xT −
R < 1/2 for T = O(log R).
Exercise 52. Prove Lemma 8.2!
Exercise 53. Prove Corollary 8.5!
Exercise 54. Prove Lemma 8.6!
Exercise 55. Prove Lemma 8.3!
Exercise 56. Let δ > 0 be any real number. Find an example of a convex function
f : R → R such that (i) the unique global minimum x? has a vanishing second
derivative f 00 (x? ) = 0, and (ii) Newton’s method satisfies
|xt+1 − x? | ≥ (1 − δ)|xt − x? |,
for all xt 6= x? .
Exercise 57. This exercise is just meant to recall some basics around integrals.
Show that for a vector-valued function g : R → Rd , the inequality
Z 1 Z 1
g(t)dt ≤ kg(t)kdt
0 0

holds, where k · k is the 2-norm (always assuming that the funtions under consid-
eration are integrable)! You may assume (i) that integrals are linear:
Z 1 Z 1 Z 1
(λ1 g1 (t) + λ2 g2 (t))dt = λ1 g1 (t)dt + λ2 g2 (t)dt,
0 0 0
R1
And (ii), if g(t) ≥ 0 for all t ∈ [0, 1], then 0
g(t)dt ≥ 0.

188
Exercise 58. Prove Lemma 8.7! You may want to proceed in the following steps.

(i) Prove that the function g(x) = f (x) − µ2 kxk2 is convex over X (see also
Exercise 28).

(ii) Prove that ∇2 f (x) is invertible for all x ∈ X.

(iii) Prove that all eigenvalues of ∇2 f (x)−1 are positive and at most 1/µ.

(iv) Prove that for a symmetric matrix M , the spectral norm kM k is the largest
absolute eigenvalue.

189
Chapter 9

Quasi-Newton Methods

Contents
9.1 The secant method . . . . . . . . . . . . . . . . . . . . . . . . 191
9.2 The secant condition . . . . . . . . . . . . . . . . . . . . . . . 193
9.3 Quasi-Newton methods . . . . . . . . . . . . . . . . . . . . . 193
9.4 Greenstadt’s approach . . . . . . . . . . . . . . . . . . . . . . 194
9.4.1 The method of Lagrange multipliers . . . . . . . . . . 196
9.4.2 Application to Greenstadt’s Update . . . . . . . . . . 197
9.4.3 The Greenstadt family . . . . . . . . . . . . . . . . . . 198
9.4.4 The BFGS method . . . . . . . . . . . . . . . . . . . . 201
9.4.5 The L-BFGS method . . . . . . . . . . . . . . . . . . . 202
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

190
The main computational bottleneck in Newton’s method (8.6) is the
computation and inversion of the Hessian matrix in each step. This matrix
has size d × d, so it will take up to O(d3 ) time to invert it (or to solve the
system ∇2 f (xt )∆x = −∇f (xt ) that gives us the next Newton step ∆x).
Already in the 1950s, attempts were made to circumvent this costly step,
the first one going back to Davidon [Dav59].
In this chapter, we will (for a change) not prove convergence results;
rather, we focus on the development of Quasi-Newton methods, and how
state-of-the-art methods arise from first principles. To motivate the ap-
proach, let us go back to the 1-dimensional case.

9.1 The secant method


Like Newton’s method (8.1), the secant method is an iterative method for
finding a zero of a univariate function. Unlike Newton’s method, it does
not use derivatives and hence does not require the function under con-
sideration to be differentiable. In fact, it is (therefore) much older than
Newton’s method. Reversing history and starting from the Newton step
f (xt )
xt+1 := xt − , t ≥ 0,
f 0 (xt )
we can derive the secant method by replacing the derivative f 0 (xt ) with its
finite difference approximation
f (xt ) − f (xt−1 )
.
xt − xt−1
As we (in the differentiable case) have
f (xt ) − f (x)
f 0 (xt ) = lim ,
x→xt xt − x
we get
f (xt ) − f (xt−1 )
≈ f 0 (xt )
xt − xt−1
for |xt − xt−1 | small. As the method proceeds, we expect consecutive iter-
ates xt−1 , xt to become closer and closer, so that the secant step
xt − xt−1
xt+1 := xt − f (xt ) , t≥1 (9.1)
f (xt ) − f (xt−1 )

191
approximates the Newton step (two starting values x0 , x1 need to be cho-
sen here). Figure 9.1 shows what the method does: it constructs the line
through the two points (xt−1 , f (xt−1 )) and (xt , f (xt )) on the graph of f ; the
next iterate xt+1 is where this line intersects the x-axis. Exercise 59 asks
you to formally prove this.

f (x)

xt−1 xt xt+1

Figure 9.1: One step of the secant method

Convergence of the secant method can be analyzed, but we don’t do


this here. The main point for us is that we now have a derivative-free ver-
sion of Newton’s method.
When the task is to optimize a differentiable univariate function, we
can apply the secant method to its derivative to obtain the secant method
for optimization:
xt − xt−1
xt+1 := xt − f 0 (xt ) , t ≥ 1. (9.2)
f 0 (x 0
t ) − f (xt−1 )

This is a second-derivative-free version of Newton’s method (8.5) for opti-


mization. The plan is now to generalize this to higher dimensions to obtain
a Hessian-free version of Newton’s method (8.6) for optimization over Rd .

192
9.2 The secant condition
Applying finite difference approximation to the second derivative of f
(we’re still in the 1-dimensional case), we get
f 0 (xt ) − f 0 (xt−1 )
Ht := ≈ f 00 (xt ),
xt − xt−1
which we can write as
f 0 (xt ) − f 0 (xt−1 ) = Ht (xt − xt−1 ) ≈ f 00 (xt )(xt − xt−1 ). (9.3)
Now, while Newton’s method for optimization uses the update step
xt+1 = xt − f 00 (xt )−1 f 0 (xt ), t ≥ 0,
the secant method works with the approximation Ht ≈ f 00 (xt ):
xt+1 = xt − Ht−1 f 0 (xt ), t ≥ 1. (9.4)
The fact that Ht approximates f 00 (xt ) in the twice differentiable case
was our motivation for the secant method, but in the method itself, there
is no reference to f 00 (which is exactly the point). All that is needed is the
secant condition from (9.3) that defines Ht :
f 0 (xt ) − f 0 (xt−1 ) = Ht (xt − xt−1 ). (9.5)
This view can be generalized to higher dimensions. If f : Rd → R is
differentiable, (9.4) becomes
xt+1 = xt − Ht−1 ∇f (xt ), t ≥ 1, (9.6)
where Ht ∈ Rd×d is now supposed to be a symmetric matrix satisfying the
d-dimensional secant condition
∇f (xt ) − ∇f (xt−1 ) = Ht (xt − xt−1 ). (9.7)

9.3 Quasi-Newton methods


If f is twice differentiable, the secant condition (9.7) along with the first-
order Taylor approximation of ∇f (x) yields the d-dimensional analog of
(9.3):
∇f (xt ) − ∇f (xt−1 ) = Ht (xt − xt−1 ) ≈ ∇2 f (xt )(xt − xt−1 ),

193
We might therefore hope that Ht ≈ ∇2 f (xt ), and this would mean that
(9.6) approximates Newton’s method. Therefore, whenever we use (9.6)
with a symmetric matrix satisfying the secant condition (9.7), we say that
we have a Quasi-Newton method.
In the 1-dimensional case, there is only one Quasi-Newton method—
the secant method (9.1). Indeed, equation (9.5) uniquely defines the num-
ber Ht in each step.
But in the d-dimensional case, the matrix Ht in the secant condition is
underdetermined, starting from d = 2: Taking the symmetry requirement
into account, (9.7) is a system of d equations in d(d + 1)/2 unknowns, so if
it is satisfiable at all, there are many solutions Ht . This raises the question
of which one to choose, and how to do so efficiently; after all, we want to
get some savings over Newton’s method.
Newton’s method is a Quasi-Newton method if and only if f is a non-
degenerate quadratic function (Exercise 60). Hence, Quasi-Newton meth-
ods do not generalize Newton’s method but form a family of related algo-
rithms.
The first Quasi-Newton method was developed by William C. Davi-
don in 1956; he desperately needed iterations that were faster than those
of Newton’s method in order obtain results in the short time spans be-
tween expected failures of the room-sized computer that he used to run
his computations on.
But the paper he wrote about his new method got rejected for lacking
a convergence analysis, and for allegedly dubious notation. It became a
very influential Technical Report in 1959 [Dav59] and was finally officially
published in 1991, with a foreword giving the historical context [Dav91].
Ironically, Quasi-Newton methods are today the methods of choice in a
number of relevant machine learning applications.

9.4 Greenstadt’s approach


For efficieny reasons (we want to avoid matrix inversions), Quasi-Newton
methods typically directly deal with the inverse matrices Ht−1 . Suppose
−1
that we have the iterates xt−1 , xt as well as the matrix Ht−1 ; now we want
−1
to compute a matrix Ht to perform the next Quasi-Newton step (9.6).
How should we choose Ht−1 ?

194
We draw some intuition from (the analysis of) Newton’s method. Re-
call that we have shown ∇2 f (xt ) to fluctuate only very little in the region
of extremely fast convergence (Lemma 8.6); in fact, Newton’s method is
optimal (one step!) when ∇2 f (xt ) is actually constant— this is the case of
a quadratic function (Lemma 8.1). Hence, in a Quasi-Newton method, it
also makes sense to have that Ht ≈ Ht−1 , or Ht−1 ≈ Ht−1
−1
.
−1
Greenstadt’s approach from 1970 [Gre70] is to update Ht−1 by an “error
matrix” Et to obtain
Ht−1 = Ht−1 −1
+ Et .
Moreover, the errors should be as small as possible, subject to the con-
straints that Ht−1 is symmetric and satisfies the secant condition (9.7). A
simple measure of error introduced by an update matrix E is its squared
Frobenius norm
Xd X d
2
kEkF := e2ij .
i=1 j=1

Since Greenstadt considered the resulting Quasi-Newton method as “too


specialized”, he searched for a compromise between variability in the method
and simplicity of the resulting formulas. This led him to minimize the er-
ror term
kAEA> k2F ,
where A ∈ Rd×d is some fixed invertible transformation matrix. If A = I,
we recover the squared Frobenius norm of E.
Let us now fix t and simplify notation by setting
−1
H := Ht−1 ,
H0 := −1
Ht ,
E := Et ,
σ := xt − xt−1 ,
y = ∇f (xt ) − ∇f (xt−1 ),
r = σ − Hy.
The update formula then is
H 0 = H + E, (9.8)
and the secant condition ∇f (xt ) − ∇f (xt−1 ) = Ht (xt − xt−1 ) becomes
H 0y = σ (⇔ Ey = r). (9.9)

195
Greenstadt’s approach can now be distilled into the following convex
constrained minimization problem in the d2 variables Eij :

minimize 21 kAEA> k2F


subject to Ey = r (9.10)
E> − E = 0

9.4.1 The method of Lagrange multipliers


Minimization subject to equality constraints can be done via the method
of Lagrange multipliers. Here we need it only for the case of linear equality
constraints in which case the method assumes a very simple form.
Theorem 9.1. Let f : Rd → R be convex and differentiable, C ∈ Rm×d for some
m ∈ N, e ∈ Rm , x? ∈ Rd such that Cx? = e. Then the following two statements
are equivalent.
(i) x? = argmin{f (x) : x ∈ Rd , Cx = e}
(ii) There exists a vector λ ∈ Rm such that
∇f (x? )> = λ> C.
The entries of λ are known as the Lagrange multipliers.
This is a consequence of earlier material. By Theorem 2.48, a Slater
point implies strong Lagrange duality. If, as in (i), there are only affine
equality constraints, the Slater point condition is void, and we obtain strong
Lagrange duality “for free”. In this case, the equivalence of (i) and (ii)
follows from the Karush-Kuhn-Tucker necessary and sufficient conditions
(Theorems 2.52 and 2.53).
For completeness we reprove Theorem 9.1 here, via elementary argu-
ments.
Proof. The easy direction is (ii)⇒(i): if λ as specified exists and x ∈ Rd
satisfies Cx = e, we get
∇f (x? )> (x − x? ) = λ> C(x − x? ) = λ> (e − e) = 0.
Hence, x? is a minimizer of f over {x ∈ Rd : Cx = e} by the optimality
condition of Lemma 2.28.
The other direction is Exercise 61.

196
9.4.2 Application to Greenstadt’s Update
In order to apply this method to (9.10), we need to compute the gradient
of f (E) = 12 kAEA> k2F . Formally, this is a d2 -dimensional vector, but it is
customary and more practical to write it as a matrix again,
 
∂f (E)
∇f (E) = .
∂Eij 1≤i,j≤d

Fact 9.2 (Exercise 62). Let A, B ∈ Rd×d two matrices. With f : Rd×d → R,
f (E) := 12 kAEBk2F , we have

∇f (E) = A> AEBB > .


The second step is to write the system of equations Ey = r, E > − E = 0
in Greenstadt’s convex program (9.10) in matrix form Cx = e so that we
can apply the method of Lagrange multipliers according to Theorem 9.1.
As there are d + d2 equations in d2 variables, it is best to think of the
rows of C as being indexed with elements i ∈ [d] := {1, . . . , d} for the first
d equations Ey = r, and pairs (i, j) ∈ [d] × [d] for the last d2 symmetry
constraints (more than half of which are redundant but we don’t care).
Columns of C are indexed with pairs (i, j) as well.
Let us denote by λ ∈ Rd the Lagrange multipliers for the first d equa-
tions and Γ ∈ Rd×d the ones for the last d2 ones.
In column (i, j) of C corresponding to variable Eij , we have entry yj in
row i as well as entries 1 (row (j, i)) and −1 (row (i, j)). Taking the inner
product with the Lagrange multipliers, this column therefore yields
λi yj + Γji − Γij .
After aggregating these entries into a d × d matrix, Theorem 9.1 tells us
that we should aim for equality with ∇f (E) as derived in Fact 9.2. We
have proved the following intermediate result.
Lemma 9.3. An update matrix E ? satisfying the constraints Ey = r (secant
condition in the next step) and E > − E = 0 (symmetry) is a minimizer of the
error function f (E) := 21 kAEA> k2F subject to the aforementioned constraints if
and only if there exists a vector λ ∈ Rd and a matrix Γ ∈ Rd×d such that
W E ? W = λy> + Γ> − Γ, (9.11)
where W := A> A (a symmetric and positive definite matrix).

197
Note that λy> is the outer product of a column and a row vector and
hence a matrix. As we assume A to be invertible, the quadratic func-
tion f (E) is easily seen to be strongly convex and as a consequence has
a unique minimizer E ? subject to the set of linear equations in (9.10) (see
Lemma 3.12 which also applies if we minimize over a closed set). Hence,
we know that the minimizer E ? and corresponding Lagrange multipiers
λ, Γ exist.

9.4.3 The Greenstadt family


We need to solve the system of equations
Ey = r, (9.12)
>
E − E = 0, (9.13)
W EW = λy> + Γ> − Γ. (9.14)
This system is linear in E, λ, Γ, hence easy to solve computationally. How-
ever, we want a formula for the unique solution E ? in terms of the pa-
rameters W, y, σ = r + Hy. In the following derivation, we closely follow
Greenstadt [Gre70, pages 4–5].
With M := W −1 (which exists since W = A> A is positive definite),
(9.14) can be rewritten as
E = M λy> + Γ> − Γ M.

(9.15)
Transposing this system (using that M is symmetric) yields
E > = M yλ> + Γ − Γ> M.


By symmetry (9.13), we can subtract the latter two equations to obtain


M λy> − yλ> + 2Γ> − 2Γ M = 0.


As M is invertible, this is equivalent to


1
Γ> − Γ = yλ> − λy> ,

2
so we can eliminate Γ by substituting back into (9.15):
 
1 > 1
> >
M = M λy> + yλ> M.
 
E = M λy + yλ − λy (9.16)
2 2

198
To also eliminate λ, we now use (9.12)—the secant condition in the next
step—to get
1
Ey = M λy> + yλ> M y = r.

2
−1
Premultiplying with 2M gives
2M −1 r = λy> + yλ> M y = λy> M y + yλ> M y.


Hence,
1 −1 >

λ= 2M r − yλ M y . (9.17)
y> M y
To get rid of λ on the right hand side, we premultiply this with y> M to
obtain
 
1  > 2y> r
y> M λ = > 2y r − (y> M y)(λ> M y) = > − λ> M y
| {z } y M y | {z } y M y | {z }
z z z

It follows that
y> r
z = λ> M y = .
y> M y
This in turn can be substituted into the right-hand side of (9.17) to remove
λ there, and we get
(y> r)
 
1 −1
λ= > 2M r − > y .
y My y My
Consequently,
(y> r)
 
> 1
λy = >
2M −1 ry> − yy >
,
y My y> M y
(y> r)
 
1
yλ> = 2yr> M −1 − yy >
.
y> M y y> M y
This gives us an explicit formula for E, by substituting the previous ex-
pressions back into (9.16). For this, we compute
(y> r)
 
> 1 > >
M λy M = 2ry M − > M yy M ,
y> M y y My
(y> r)
 
> 1 > >
M yλ M = 2M yr − > M yy M ,
y> M y y My

199
and consequently,

(y> r)
 
1 1
E = M λy> + yλ> M = > > > >

ry M + M yr − > M yy M .
2 y My y My
(9.18)
Finally, we use r = σ − Hy to obtain the update matrix E ? in terms
−1
of the original parameters H = Ht−1 (previous approximation of the in-
verse Hessian that we now want to update to Ht−1 = H 0 = H + E ? ),
σ = xt − xt−1 (previous Quasi-Newton step) and y = ∇f (xt ) − ∇f (xt−1 )
(previous change in gradients). This gives us the Greenstadt family of
Quasi-Newton methods.

Definition 9.4. Let M ∈ Rd×d be a symmetric and invertible matrix. Consider


the Quasi-Newton method

xt+1 = xt − Ht−1 ∇f (xt ), t ≥ 1,

where H0 = I (or some other positive definite matrix), and Ht−1 = Ht−1
−1
+ Et is
−1
chosen for all t ≥ 1 in such a way that Ht is symmetric and satisfies the secant
condition
∇f (xt ) − ∇f (xt−1 ) = Ht (xt − xt−1 ).
For any fixed t, set
−1
H := Ht−1 ,
H0 := −1
Ht ,
σ := xt − xt−1 ,
y := ∇f (xt ) − ∇f (xt−1 ),

and define

1  >
?
E = > σy M + M yσ > − Hyy> M − M yy> H
y My
1 > > >

− (y σ − y Hy)M yy M . (9.19)
y> M y

If the update matrix Et = E ? is used, the method is called the Greenstadt


method with parameter M .

200
9.4.4 The BFGS method
In his paper, Greenstadt suggested two obvious choices for the matrix M
In Definition 9.4, namely M = H (the previous approximation of the in-
verse Hessian) and M = I. In the next paper of the same issue of the same
journal, Goldfarb suggested to use the matrix M = H 0 , the next approxi-
mation of the inverse Hessian. Even though we don’t yet have it, we can
use it in the formula (9.19) since we know that H 0 will by design satisfy the
secant condition H 0 y = σ. And as M always appears next to y in (9.19),
M y = H 0 y = σ, so H 0 disappears from the formula!
Definition 9.5. The BFGS method is the Greenstadt method with parameter
M = H 0 = Ht−1 in step t, in which case the update matrix E ? assumes the form
1  > > > 1 > > >

E? = 2σσ − Hyσ − σy H − (y σ − y Hy)σσ
y> σ σ>y
1  > >
 y> Hy  > 
= − Hyσ − σy H + 1 + σσ ., (9.20)
y> σ y> σ
−1
where H = Ht−1 , σ = xt − xt−1 , y = ∇f (xt ) − ∇f (xt−1 ).
We leave it as Exercise 63 (i) to prove that the denominator y> σ appear-
ing twice in the formula is positive, unless the function f is flat between
the iterates xt−1 and xt . And under y> σ > 0, the BFGS method has an-
other nice property: if the previous matrix H is positive definite, then also
the next matrix H 0 is positive definite; see Exercise 63 (ii). In this sense, the
matrices Ht−1 behave like proper inverse Hessians.
The method is named after Broyden, Fletcher, Goldfarb and Shanno
who all came up with it independently around 1970. Greenstadt’s name is
mostly forgotten.
Let’s take a step back and see what we have achieved. Recall that our
starting point was that Newton’s method needs to compute and invert
Hessian matrices in each iteration and therefore has in practice a cost of
O(d3 ) per iteration. Did we improve over this?
First of all, any method in Greenstadt’s family avoids the computation
of Hessian matrices altogether. Only gradients are needed. In the BFGS
method in particular, the cost per iteration drops to O(d2 ). Indeed, the
computation of the update matrix E ? in Definition 9.5 reduces to matrix-
vector multiplications and outer-product computations, all of which can
be done in O(d2 ) time.

201
Newton and Quasi-Newton methods are often performed with scaled
steps. This means that the iteration becomes

xt+1 = xt − αt Ht−1 ∇f (xt ), t ≥ 1, (9.21)

for some αt ∈ R+ . This parameter can for example be chosen such that
f (xt+1 ) is minimized (line search). Another approach is backtracking line
search where we start with αt = 1, and as long as this does not lead to
sufficient progress, we halve αt . Line search ensures that the matrices Ht−1
in the BFGS method remain positive definite [Gol70].
As the Greenstadt update method just depends on the step σ = xt −
xt−1 but not on how it was obtained, the update works in exactly the same
way as before even if scaled steps are being used.

9.4.5 The L-BFGS method


In high dimensions d, even an iteration cost of O(d2 ) as in the BFGS method
may be prohibitive. In fact, already at the end of the 1970s, the first limited
memory (and limited time) variants of the method have been proposed.
Here we essentially follow Nocedal [Noc80]. The idea is to use only in-
formation from the previous m iterations, for some small value of m, and
“forget” anything older. In order to describe the resulting L-BFGS method,
we first rewrite the BFGS update formula in product form.

Observation 9.6. With E ? as in Definition 9.5 and H 0 = H + E ? , we have

σy> yσ > σσ >


   
0
H = I− > H I− > + > . (9.22)
y σ y σ y σ

To verify this, simply expand the product in the right-hand side and
compare with (9.20).
We further observe that we do not need the actual matrix H 0 = Ht−1 to
perform the next Quasi-Newton step (9.6), but only the vector H 0 ∇f (xt ).
Here is the crucial insight.

Lemma 9.7. Let H, H 0 as in Observation 9.6, i.e.

σy> yσ > σσ >


   
0
H = I− > H I− > + > .
y σ y σ y σ

202
Let g0 ∈ Rd . Suppose that we have an oracle to compute s = Hg for any vector
g. Then s0 = H 0 g0 can be computed with one oracle call and O(d) additional
arithmetic operations, assuming that σ and y are known.

Proof. From (9.22), we conclude that

σy> yσ > σσ >


   
0 0
Hg = I− > H I− > g0 + > g0 .
y σ y σ y σ
| {z } | {z }
g h
| {z }
s
| {z }
w
| {z }
z

We compute the vectors h, g, s, w, z in turn. We have

σσ > 0 σ > g0
h= g = σ ,
y> σ y> σ

so h can be computed with two inner products, a real division, and a mul-
tiplication of σ with a scalar. For g, we obtain

yσ > σ > g0
 
g= I− > g0 = g0 − y > .
y σ y σ

which is a multiplication of y with a scalar that we already know, followed


by a vector addition. To get s = Hg, we call the oracle. For w, we similarly
have
σy> y> s
 
w= I− > s=s−σ > ,
y σ y σ
which is one inner product (the other one we already know), a real divison,
a multiplication of σ with a scalar, and a vector addition. Finally,

H 0 g0 = z = w + h

is a vector addition. In total, we needed three inner product computations,


three scalar multiplications, three vector additions, two real divisions, and
one oracle call.

203
How do we implement the oracle? We simply apply the previous
Lemma recursively. Let

σ k = xk − xk−1 ,
yk = ∇f (xk ) − ∇f (xk−1 )

be the values of σ and y in iteration k ≤ t. When we perform the Quasi-


Newton step xt+1 = xt − Ht−1 ∇f (xt ) in iteration t ≥ 1, we have already
computed these vectors for k = 1, . . . , t. Using Lemma 9.7, we could there-
fore call the recursive procedure in Algorithm 1 with k = t, g0 = ∇f (xt )
to compute the required vector Ht−1 ∇f (xt ) in iteration t. To maintain the
immediate connection to Lemma 9.7, we refrain from introducing extra
variables for values that occur several times; but in an actual implementa-
tion, this would be done, of course.

Algorithm 1 Recursive view of the BFGS method. To compute Ht−1 ∇f (xt ),


call the function with arguments (t, ∇f (xt )); values σ k , yk from iterations
1, . . . , t are assumed to be available.
function BFGS- STEP(k, g0 ) . returns Hk−1 g0
if k = 0 then
return H0−1 g0
else . apply Lemma 9.7
> 0
σ g
h = σ >k
yk σ k
σ > g0
g = g0 − y >k
yk σ k
s = BFGS- STEP(k − 1, g)
y> s
w = s − σ k >k
yk σ k
z=w+h
return z
end if
end function

By Lemma 9.7, the runtime of BFGS- STEP(t, ∇f (xt )) is O(td). For t >
d, this is slower (and needs more memory) than the standard BFGS step
according to Definition 9.5 which always takes O(d2 ) time.

204
The benefit of the recursive variant is that it can easily be adapted to
a step that is faster (and needs less memory) than the standard BFGS step.
The idea is to let the recursion bottom out after a fixed number m of recur-
sive calls (in practice, values of m ≤ 10 are not uncommon). The step then
has runtime O(md) which is a substantial saving over the standard step if
m is much smaller than d.
The only remaining question is what we return when the recursion
now bottoms out prematurely at k = t − m. As we don’t know the matrix
−1 −1
Ht−m , we cannot return Ht−m g0 (which would be the correct output in this
case). Instead, we pretend that we have started the whole method just now
and use our initial matrix H0 instead of Ht−m .1 The resulting algorithm is
depicted in Algorithm 2.

Algorithm 2 The L-BFGS method. To compute Ht−1 ∇f (xt ) based on the


previous m iterations, call the function with arguments (t, m, ∇f (xt )); val-
ues σ k , yk from iterations t − m + 1, . . . , t are assumed to be available.
function L-BFGS- STEP(k, `, g0 ) . ` ≤ k; returns s0 ≈ Hk−1 g0
if ` = 0 then
return H0−1 g0
else . apply Lemma 9.7
σ> g 0
h = σ >k
yk σ k
σ > g0
g = g0 − y >k
yk σ k
s = L-BFGS- STEP(k − 1, ` − 1, g)
y> s
w = s − σ k >k
yk σ k
z=w+h
return z
end if
end function

Note that the L-BFGS method is still a Quasi-Newton method as long


as m ≥ 1: if we go through at least one update step of the form H 0 = H +E,
1
In practice, we can do better: as we already have some information from previous
steps, we can use this information to construct a more tuned H0 . We don’t go into this
here.

205
the matrix H 0 will satisfy the secant condition by design, irrespective of H.

9.5 Exercises
Exercise 59. Consider a step of the secant method:
xt − xt−1
xt+1 = xt − f (xt ) , t ≥ 1.
f (xt ) − f (xt−1 )

Assuming that xt 6= xt−1 and f (xt ) 6= f (xt−1 ), prove that the line through
the two points (xt−1 , f (xt−1 )) and (xt , f (xt )) intersects the x-axis at the point
x = xt+1 .

Exercise 60. Let f : Rd → R be a twice differentiable function with nonzero


Hessians everywhere. Prove that the following two statements are equivalent.

(i) f is a nondegenerate quadratic function, meaning that


1
f (x) = x> M x − q> x + c,
2
where M ∈ Rd×d is an invertible symmetric matrix, q ∈ Rd , c ∈ R (see
also Lemma 8.1).

(ii) Applied to f , Newton’s update step

xt+1 := xt − ∇2 f (xt )−1 ∇f (xt ), t≥1

defines a Quasi-Newton method for all x0 , x1 ∈ Rd .

Exercise 61. Prove the direction (i)⇒(ii) of Theorem 9.1! You may want to do
proceed in the following steps.

1. Prove the Poor Man’s Farkas Lemma: a system of n linear equations


Ax = b in d variables has a solution if and only if for all λ ∈ Rn , λ> A =
0> implies λ> b = 0. (You may use the fact that the row rank of a matrix
equals its column rank.)

2. Argue that x? = argmin{∇f (x? )> x : x ∈ Rd , Cx = e}.

3. Apply the Poor Man’s Farkas Lemma.

206
Exercise 62. Prove Fact 9.2!

Exercise 63. Consider the BFGS method (Definition 9.5).

(i) Prove that y> σ > 0, unless xt = xt−1 , or f (λxt +(1−λ)xt−1 ) = λf (xt )+
(1 − λ)f (xt−1 ) for all λ ∈ (0, 1).

(ii) Prove that if H is positive definite and y> σ > 0, then also H 0 is positive
definite. You may want to use the product form of the BFGS update as
developed in Observation 9.6.

207
Chapter 10

Subgradient Methods

Contents
10.1 Subgradient and Subdifferential . . . . . . . . . . . . . . . . . 209
10.1.1 Definition and examples . . . . . . . . . . . . . . . . . 209
10.1.2 Topological properties . . . . . . . . . . . . . . . . . . 211
10.1.3 Subdifferential and directional derivative . . . . . . . 213
10.1.4 Calculus of Subgradient . . . . . . . . . . . . . . . . . 214
10.2 Subgradient Method . . . . . . . . . . . . . . . . . . . . . . . 215
10.2.1 Subgradient
√ Descent . . . . . . . . . . . . . . . . . . . 216
10.2.2 O(1/ t) convergence for convex functions . . . . . . 217
10.2.3 O(1/t) convergence for strongly convex functions . . 221
10.3 Lower Bound Complexity . . . . . . . . . . . . . . . . . . . . 223
10.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

208
Figure 10.1: Subgradients

10.1 Subgradient and Subdifferential


10.1.1 Definition and examples
Definition 10.1. Let f : dom(f ) → R ∪ {+∞} be a convex function. A vector
g ∈ Rd is a subgradient of f at a point x ∈ dom(f ) if
f (y) ≥ f (x) + g> (y − x), ∀y ∈ dom(f ).
The set of all subgradient at x is called the subdifferential of f at x denoted as
∂f .
The notion of subgradient can be viewed as a generalization of gradi-
ent, for functions which are not necessarily differentiable. The subgradient
is not always unique. As shown in the figure below, there exists multiple
subgradients at x = x2 while only one subgradient at x = x1 .
As can be easily seen from the above figure, any subgradient at x forms
a supporting hyperplane for the epigraph of a function. Note that for any
fixed x ∈ dom(f ), and any subgradient g ∈ ∂f (x), we have by definition
that
f (y) − g> y ≥ f (x) − g> x, ∀y ∈ dom(f ),
which is equivalent to say t − g> y ≥ f (x) − g> x, ∀(y, t) ∈ epi(f ). This
further implies that
 >    >  
−g y −g x
≥ , ∀(y, t) ∈ epi(f ).
1 t 1 f (x)
Hence, the hyperplane H := (y, t) : (−g, 1)> (y, t) = (−g, 1)> (x, f (x)) is


a supporting plane of the convex set epi(f ).

209
Lemma 10.2. If f is convex and differentiable at x ∈ dom(f ), then ∂f (x) =
{∇f (x)}.
Proof. By definition, let y = x + εd, g ∈ ∂f (x), f (x + εd) ≥ f (x) + εg> d.
f (x + εd) − f (x)
≥ g> d, ∀d, ∀ε.
ε
Letting ε → 0, we have ∇f (x)> d ≥ g> d, ∀d. This only holds when g =
∇f (x).
Lemma 10.3 (Exercise 64). Prove that if f is differentiable at x ∈ dom(f ), then
∂f (x) ⊆ {∇f (x)}.
Example 10.4. Below, we provide some specific examples of subgradients.
(a) f (x) = 12 x2 , ∂f (x) = {x};
(
{sgn(x)}, x 6= 0
(b) f (x) = |x|, ∂f (x) = ;
[−1, 1], x=0
( √
− x, x≥0
(c) f (x) = , ∂f (0) = ∅;
+∞, otherwise

1,
 x=0
(d) f (x) = 0, x>0 , ∂f (0) = ∅.

+∞, otherwise

Note in examples (b)-(d), the functions are non-differentiable at x = 0.

Lemma 10.5 (Exercise 65). Let f : dom(f ) → R be convex, dom(f ) open,


B ∈ R+ . Then the following two statements are equivalent.
(i) kgk ≤ B for all x ∈ dom(f ) and all g ∈ ∂f (x).
(ii) |f (x) − f (y)| ≤ Bkx − yk for all x, y ∈ dom(f ).
Lemma 10.6. Suppose that f : dom(f ) → R and x ∈ dom(f ). If 0 ∈ ∂f (x),
then x is a global minimum.
Proof. By definition of subgradients, g = 0 ∈ ∂f (x) gives
f (y) ≥ f (x) + g> (y − x) = f (x), ∀y ∈ dom(f ),
so x is a global minimum.

210
10.1.2 Topological properties
In the following theorem, we discuss some topological properties of the
subdifferential set.
First of all, it can be easily seen that the subdifferential set is always
closed and convex.
Lemma 10.7. Let f be a convex function and x ∈ dom(f ). Then ∂f (x) is
convex and closed.
Proof. Convexity and closedness are evident due to

∂f (x) = g ∈ Rn : f (y) ≥ f (x) + g> (y − x), ∀y




= ∩y g ∈ Rn : f (y) ≥ f (x) + g> (y − x)




is the solution to an infinite system of linear inequalities. The intersection


of arbitrary number of closed and convex sets is still closed and convex.

Secondly, for a convex function, the subgradient always exists at any


point in the relative interior of the domain. This result is tied to the so-
called hyperplane separation theorem for convex sets, which is one of the
most important properties that distinguish convex sets from nonconvex
sets.
Definition 10.8. The relative interior of set X is defined as

relint(X) = {x : ∃r > 0, such that B(x, r) ∩ Aff(X) ⊆ X} ,

which is the set of interior points relative to the affine subspace that contains X.
Definition 10.9. Let S and T be two nonempty convex sets in Rn . A hyperplane
H = x ∈ Rn : a> x = b with a 6= 0 is said to separate S and T if S ∪ T 6⊂ H


and

S ⊂ H − = x ∈ Rn : a> x ≤ b ,


T ⊂ H + = x ∈ Rn : a> x ≥ b .


The hyperplane H is said to strictly separate S and T if

S ⊂ H −− = x ∈ Rn : a> x < b ,


T ⊂ H ++ = x ∈ Rn : a> x > b .


211
Figure 10.2: (a) Separation of two sets, (b) Strict separation of two sets

Theorem 10.10 (Hyperplane separation theorem, [Roc97]). Let S and T be


two nonempty convex sets. Then S and T can be separated if and only if

relint(S) ∩ relint(T ) = ∅.

As an immediate corollary, we have

 set and x0 ∈ ∂S (boundary of S).


Corollary 10.11. Let S be a nonempty convex
There exists a supporting hyperplane H = x : a> x = a> x0 with a 6= 0 such
that
S ⊂ x : a> x ≤ a> x0 , and x0 ∈ H.


We are now ready to show the existence of subgradient in the relative


interior of the domain of a convex function.
Theorem 10.12. Let f be a convex function. Then ∂f (x) is nonempty and
bounded if x ∈ relint(dom(f )).
Proof. (Non-emptiness) W.l.o.g., let’s assume dom(f ) is full-dimensional
and x ∈ int(dom(f )). Since epi(f ) is convex and (x, f (x)) belongs to its
boundary, by the hyperplane separation theorem, ∃α = (s, β) 6= 0, s.t.

s> y + βt ≥ s> x + βf (x), ∀(y, t) ∈ epi(f ).

Clearly, we must have β > 0. Since x ∈ int(dom(f )), we cannot have β =


0. otherwise, to ensure s> y ≥ s> x, ∀y ∈ B(x, δ) for some small enough
δ > 0, we need s = 0, which is impossible. Hence, β > 0, setting g = −β −1 s

f (y) ≥ f (x) + g> (y − x), ∀y.

212
(Boundedness) Suppose ∂f (x) is unbounded, i.e. ∃gk ∈ ∂f (x), s.t.
kgk k2 → ∞, as k → ∞. Since x ∈ int(dom(f )), ∃δ > 0, s.t. B(x, δ) ⊆
dom(f ). Hence, yk = x + δ kggkkk ∈ dom(f ). By convexity,
2

f (yk ) ≥ f (x) + gk> (yk − x) = f (x) + δ kgk k2 → ∞.

However, this contradicts the continuity of f over int(dom(f )).


We also have the converse of the theorem. Note that the condition (of
existence of subgradient) should hold for the whole domain but not just
for the relative interior of the domain.
Lemma 10.13. Let f : dom(f ) → R be a function such that dom(f ) is convex
and ∂f (x) 6= ∅ for all x ∈ dom(f ). Then f is convex.
Proof. For any x, y ∈ dom(f ) and λ ∈ [0, 1], denote z = λx + (1 − λ)y ∈
dom(f ). Since ∂f (z) 6= ∅, let g ∈ ∂f (z) and we have
f (x) ≥ f (z) + g> (x − z)

=⇒ λf (x) + (1 − λ)f (y) ≥ f (λx + (1 − λ)y).
f (y) ≥ f (z) + g> (y − z)

Remark 10.14 (Exercise 66). The subdifferential of a convex function f (x) at


x ∈ dom(f ) is a monotone operator, i.e.,

(u − v)> (x − y) ≥ 0, ∀x, y ∈ dom(f ), u ∈ ∂f (x), v ∈ ∂f (y).

10.1.3 Subdifferential and directional derivative


Recall that the directional derivative of a function f at x along direction d
is
f (x + δd) − f (x)
f 0 (x; d) = lim+ .
δ→0 δ
If f is differentiable, then f 0 (x; d) = ∇f (x)> d.
f (x+δd)−f (x)
Lemma 10.15 (Exercise 67). When f is convex, the ratio φ(δ) := δ
is non-decreasing in δ > 0.
Theorem 10.16. Let f be convex and x ∈ int(dom(f )), then

f 0 (x; d) = max g> d.


g∈∂f (x)

213
Proof. By definition, We have f (x + δd) − f (x) ≥ δg> d for all δ and g ∈
∂f (x). Hence, f 0 (x; d) ≥ g> d, ∀g ∈ ∂f (x). Moreover, this implies that

f 0 (x; d) ≥ max g> d.


g∈∂f (x)

It suffices to show that ∃g̃ ∈ ∂f (x), s.t. f 0 (x; d) ≤ g̃> d. Consider the two
sets

C1 = {(y, t) : f (y) < t} ,


C2 = {(y, t) : y = x + αd, t = f (x) + αf 0 (x; d), α ≥ 0} .

Claim: C1 ∩ C2 = ∅ and C1 , C2 are convex and nonempty. This is because


f (x + αd) ≥ f (x) + αf 0 (x; d), ∀α ≥ 0. (Due to Lemma 10.15).
By the hyperplane separation theorem, ∃(g0 , β) 6= 0, s.t.

g0> (x + αd) + β(f (x) + αf 0 (x; d)) ≤ g0> y + βt, ∀α ≥ 0, ∀t > f (y)

One can easily show that β > 0. Let g̃ = β −1 g0 ,

g̃T (x + αd) + f (x) + αf 0 (x; d) ≤ g̃> y + f (y), ∀α ≥ 0.

Setting α = 0, we have

g̃> x + f (x) ≤ g̃> y + f (y) ⇔ −g̃ ∈ ∂f (x).

Setting α = 1 and y = x, we have

g̃> d + f 0 (x; d) ≤ 0 ⇔ f 0 (x; d) ≤ −g̃> d.

Therefore, we have shown that f 0 (x; d) = maxg∈∂f (x) g> d.

10.1.4 Calculus of Subgradient


Determining the subdifferentiable set of a convex function at a given point
is in general very difficult. The following calculus of subdifferentiable sets
provides a constructive way to compute the subgradient of convex func-
tions arising from convexity-preserving operators.
1. Taking conic combination: If h(x) = λf (x) + µg(x), where λ, µ ≥ 0
and f, g are both convex, then

∂h(x) = λ∂f (x) + µ∂g(x), ∀x ∈ int(dom(h)).

214
2. Taking affine composition: If h(x) = f (Ax + b), where f is convex,
then
∂h(x) = A> ∂f (Ax + b).
3. Taking supremum: If h(x) = supα∈A fα (x) and each fα (x) is convex,
then
∂h(x) ⊇ conv{∂fα (x)|α ∈ α(x)}
where α(x) := {α : h(x) = fα (x)}.
4. Taking superposition: If h(x) = F (f1 (x), . . . , fm (x)), where F (y1 , . . . , ym )
is non-decreasing and convex, then
( m )
X
∂h(x) ⊇ di ∂fi (x) : (d1 , . . . , dm ) ∈ ∂F (y1 , . . . , ym ) .
i=1

Example 10.17. Let h(x) = maxy∈C f (x, y) where f (x, y) is convex in x for any
y and C is closed, then ∂f (x, y∗ (x)) ⊂ ∂h(x), where y∗ (x) = argmaxy∈C f (x, y).
This is because if g ∈ ∂f (x, y∗ (x)), we have
h(z) ≥ f (z, y∗ (x)) ≥ f (x, y∗ (x)) + g> (z − x) = h(z) + g> (z − x).

Lemma 10.18 (Exercise 68). Consider the function f (x) = kxk, here k·k is a
general norm. Then
∂f (x) = {g : g> x = kxk and kgk∗ ≤ 1}.
where k·k∗ is the dual norm: kyk∗ = maxx:kxk≤1 y> x. In particular, ∂f (0) :=
{g : kgk∗ ≤ 1}.

10.2 Subgradient Method


Consider the generic convex minimization
min f (x)
s.t. x ∈ X
where f is convex, possibly non-differentiable, and X ⊆ dom(f ) is closed
and convex. Assume the problem is solvable with the optimal solution
and value denoted as x? , f ? . Accordingly, we can define two important
quantities of X and f as

215
• R2 := max kx − yk22 is the squared diameter of X.
x,y∈X

|f (x)−f (y)|
• B := sup kx−yk2
< +∞ is the constant that characterizes the Lip-
x,y∈X
schitz continuity of f under k·k2 norm.

Note that the convexity of the function f can actually lead to local Lips-
chitz continuity. We will simply make this an assumption in the subse-
quent text. In addition, as we showed earlier, subgradient of f always
exists at interior of X, which motivates the subgradient method.

10.2.1 Subgradient Descent


The subgradient method, also called Subgradient Descent, was first proposed
by Naum Zuselevich Shor in 1967.

Algorithm 3 Subgradient Method


1: Initialize x1 ∈ X
2: for t = 1, . . . , T do
3: xt+1 = ΠX (xt − γt gt ), where gt ∈ ∂f (xt )
4: end for

In the above algorithm,

• gt ∈ ∂f (xt ) is a subgradient of f at xt ;

• γt > 0 is the stepsize;

• ΠX (x) := argminy∈X kx − yk22 is the projection operation.

When f is differentiable, this reduces to Projected Gradient Descent.

Remark 10.19. Note that unlike Gradient Descent, Subgradient Descent is not
a descent method, i.e., moving along the negative direction of subgradient is not
necessarily decreasing the objective function.

216
Choices of Stepsizes Stepsize γt is an important parameter that needs to
be selected during the iterations, which will affect the convergence analy-
sis as we will show later. Four most commonly used stepsizes include:

1. Constant stepsize: γt ≡ γ > 0.


γ
2. Scaled stepsize: γt = kgt k2
.

3. Non-summable but diminishing stepsize satisfying:


X∞
γt = ∞, lim γt = 0.
t=1 t→∞

4. Non-summable but square-summable stepsize satisfying:


X∞ X∞
γt = ∞, γt2 < ∞.
t=1 t=1

This is also called Robbins-Monro stepsize. For example, γt = 1t .

5. Polyak stepsize: Assuming f ? = f (x? ) is known, choose

f (xt ) − f ?
γt = .
kgt k22

To be discussed later, Subgradient Descent behaves substantially differ-


ent from Gradient Descent. The choices of stepsize, rates of convergence,
and criterion used to measure the convergence are different. As mentioned
earlier, subgradient descent is not a descent method. Hence, we will need
to introduce other quantities to measure the convergence, instead of the
quantity f (xt ) − f ? used earlier for Gradient Descent.


10.2.2 O(1/ t) convergence for convex functions
Theorem 10.20. Assume f is convex, then Subgradient Descent satisfies

T
!−1 T
!
X 1 2 1 X 2
min f (xt ) − f ? ≤ γt kx1 − x? k2 + γ 2 kgt )k2 . (10.1)
1≤t≤T
t=1
2 2 t=1 t

217
and
T
!−1 T
!
X 1 1 X
f (x̂T ) − f ? ≤ γt kx1 − x? k22 + γ 2 kgt k22 , (10.2)
t=1
2 2 t=1 t
P −1 P 
T T
where x̂T = t=1 γt t=1 γt xt ∈ X.

Proof. The proof uses the similar technique as in the convergence for Gra-
dient Descent. First, by definition, we have

kxt+1 − x? k22 = kΠX (xt − γt gt ) − ΠX (x? )k22


≤ kxt − γt gt − x? k22
= kxt − x? k22 − 2γt gt> (xt − x? ) + γt2 kgt k22 ,

where the inequality comes from the non-expansiveness of the projection


operation. Therefore, it follows that
1
γt gt> (xt − x? ) ≤ kxt − x? k22 − kxt+1 − x? k22 + γt2 kgt k22 .

(10.3)
2
Due to the convexity of f , we have

γt gt> (xt − x? ) ≥ γt (f (xt ) − f ? ) . (10.4)

Combining (10.3) and (10.4) and adding both sides of the inequality from
t = 1 to t = T , we obtain
T T
!
X 1 X
γt (f (xt ) − f ? ) ≤ kx1 − x? k22 − kxT +1 − x? k22 + γt2 kgt k22
t=1
2 t=1
T
!
1 X
≤ kx1 − x? k22 + γt2 kgt k22 . (10.5)
2 t=1

For the proof of (10.1), by definition, the left hand side of (10.5) can be
lower bounded by
T T
!  
X X
? ?
γt (f (xt ) − f ) ≥ γt · min f (xt ) − f
1≤t≤T
t=1 t=1

218
For the proof of (10.2), by convexity, the left hand side of (10.5) is lower
bounded by
T T
!
X X
γt (f (xt ) − f ? ) ≥ γt · (f (x̂T ) − f ? ) .
t=1 t=1

Bounds (10.1) and (10.2) are hence proved.

Remark. Invoking the definition of B and R, we have 12 kx1 − x? k22 ≤ 21 R2


and kgt k2 ≤ B. As a corollary,

1 2
+ 21 Tt=T0 γt2 B 2
P
2
R
min f (xt ) − f ? ≤ PT , ∀1 ≤ T0 ≤ T.
t=T0 γt
T0 ≤t≤T

Note that the above general result is obtained by slightly modifying the
summation or averaging from T0 to T instead of from 1 to T .

Convergence with various stepsizes. Below we see how the bounds in


(10.1) and (10.1) would imply the convergence and even the convergence
rate with different choices of stepsizes. By abuse of notation, we denote
both min f (xt ) − f ? and f (x̂T ) − f ? as εT .
1≤t≤T

1. Constant stepsize: with γt ≡ γ,

(1/2)R2 + (T /2)γ 2 B 2 R2 1 B 2 T →∞ B 2
εT ≤ = · + γ −−−→ γ.
Tγ 2T γ 2 2

Note that the error does not diminish to zero as T grows to infinity,
which shows one of the drawbacks of using arbitrary constant step-
sizes. By minimizing the upper bound, we can select the optimal
stepsize γ∗ to obtain:

R RB
γ∗ = √ ⇒ εT ≤ √ .
B T T
Similar analysis applies to the scaled stepsize.

219
2. Non-summable but diminishing stepsize:
T
!, T !
1 2 1 X X
εT ≤ R + γt2 B 2 γt
2 2 t=1 t=1
T1
!, T ! T
!, T
!
2 X
1 2 1X X B X
≤ R + γ 2B2 γt + γt2 γt
2 2 t=1 t t=1
2 t=T +1 t=T1 +1
1

where 1 ≤ T1 ≤ T . When T → ∞, select large T1 and the first term on


the right hand side → 0 since γt is non-summable. The second term
also → 0 because γt2 always approaches zero faster than γt . Conse-
quently, we know that
T →∞
εT −−−→ 0.

1

An example choice of the stepsize is γt = O tq
with q ∈ (0, 1]. If we
choose γt = BR√t , then

BR(1 + Tt=1 1t )
P  
R BR ln(T )
γt = √ ⇒ εT ≤ =O √ .
B t 2 Tt=1 √1t
P
T

In fact, if we choose the averaging from T2 instead of 1, we have


 
BR 1 + Tt=b T c 1t
P  
? 2 BR
min f (xt ) − f ≤ ≤O √ .
2 Tt=b T c √1t
P
b T2 c≤t≤T T
2
 
This further implies that εT ≤ O BR√
T
.

3. Non-summable but square-summable stepsize: It is obvious that


T
!, T !
1 2 B2 X 2 X T →∞
εT ≤ R + γt γt −−−→ 0.
2 2 t=1 t=1

4. Polyak stepsize: The motivation of choosing this stepsize comes from


the fact that
kxt+1 − x? k22 ≤ kxt − x? k22 − 2γt gt> (xt − x? ) + γt2 kgt k22
≤ kxt − x? k22 − 2γt (f (xt ) − f ? ) + γt2 kgt k22 .

220
?
The Polyak step γt = f (xkgt )−f2 is exactly the minimizer of the right
t k2
hand side. In fact, the stepsize yields

(f (xt ) − f ? )2
kxt+1 − x? k22 ≤ kxt − x? k22 − , (10.6)
kgt k22

which guarantees the decrease of kxt − x? k22 at each step. Applying


(10.6) recursively, we obtain
XT
(f (xt ) − f ? )2 ≤ R2 · B < ∞.
t=1

Therefore, we have εT → 0 as T → ∞.

10.2.3 O(1/t) convergence for strongly convex functions


Note that a non-differentiable convex function can also be strongly convex.

Definition 10.21. Let f : dom(f ) → R be convex, X ⊆ dom(f ) convex and


µ ∈ R+ . Function f is called strongly convex (with parameter µ) over X if
µ
f (y) ≥ f (x) + g> (y − x) + kx − yk2 , ∀x, y ∈ X, ∀g ∈ ∂f (x). (10.7)
2
If X = dom(f ), then f is simply called strongly convex.

For strongly convex function f , we obtain the following theorem.

Theorem 10.22. Assume f is µ-strongly convex, then Subgradient Descent with


1
stepsize γt = µt satisfies

B 2 (ln(T ) + 1)
min f (xt ) − f ? ≤ (10.8)
1≤t≤T 2µT

and

? B 2 (ln(T ) + 1)
f (x̂T ) − f ≤ , (10.9)
2µT
1
PT
where x̂T := T t=1 xt .

221
Proof. First recall that µ-strongly convex implies that
µ
f (y) ≥ f (x) + g> (y − x) + kx − yk22 , ∀x, y ∈ X, ∀g ∈ ∂f (x).
2
Similarly as the proof for the convex case, the left hand side of (10.3) can
be lower bounded by
 µ 
γt gt> (xt − x? ) ≥ γt f (xt ) − f ? + kxt − x? k22 .
2
1
Combining (10.3) and plug in γt = µt
we have
 
? µ ? 2 µ ? 2 1 2
f (xt ) − f ≤ (t − 1) kxt − x k2 − t kxt+1 − x k2 + kgt k2 .
2 2 2µt

By recursively adding both sides from t = 1 to t = T , we obtain


T T T
X X 1 B2 X 1 B2
[f (xt ) − f ? ] ≤ kgt k22 ≤ ≤ (ln(T ) + 1).
t=1 t=1
2µt 2µ t=1
t 2µ

In addition, we have Tt=1 f (xt ) − f ? ≥ T · εT for either min1≤t≤T f (xt ) − f ?


P

or f (x̂T ) − f ? with x̂T = T1 Tt=1 xt , which leads to the desired results.


P

With another choice of stepsize and averaging strategy, we can get rid
of the log factor in the bound. The following theorem can be obtained
[Bubeck (2014)].

Theorem 10.23. Assume f is µ-strongly convex, then subgradient method with


2
stepsize γt = µ(t+1) satisfies

2B 2 2B 2
min f (xt ) − f ? ≤ and f (x̂T ) − f ? ≤ (10.10)
1≤t≤T µ(T + 1) µ(T + 1)
PT 2t
where x̂T = t=1 T (T +1) xt .

Proof. Omitted and left as an exercise.

222
Table 10.1: Comparison of nonsmooth and smooth convex optimization.

Convex Strongly Convex


   2
Subgradient method O BR

t
O Bµt
 2  √ 2t 
1− κ
Accelerated gradient descent O LR
t2
O √
1+ κ

Summary Table 10.1 compares the convergence rate of subgradient method


for nonsmooth convex optimization with that of the accelerated gradient
descent for smooth optimization. For both convex and strongly convex
cases, subgradient method achieves slower convergence than accelerated
gradient descent. Particularly, subgradient method can only achieve sub-
linear convergence even under the strongly convex case, instead of linear
rate in the smooth case.

10.3 Lower Bound Complexity


While the convergence rates achieved by subgradient descent seems much
worse than those achieved by gradient descent for smooth problems,
√  we
show below that in the worst case, one cannot improve the O 1/ t and
O (1/t) rates for the convex and strongly convex situations, respectively,
when using block-box oriented methods that only have access to the sub-
gradient of the objective function.
Theorem 10.24 (Nemirovski & Yudin 1979). For any 1 ≤ t ≤ d and x1 ∈ Rd ,
(1) there exists a B-Lipschitz continuous function f : Rd → R and a convex
set X ⊆ dom(f ) of diameter R, such that for any first-order method that
generates: xt ∈ x1 +span(g1 , . . . , gt−1 )), where gi ∈ ∂f (xi ), i = 1, . . . , t−
1, we always have
B·R
min f (xs ) − f ? ≥ √ .
1≤s≤t 4(1 + t)
(2) there exists a µ-strongly convex, B-Lipschitz continuous function f and a
convex set X, for any first-order method as described above, we always have
B2
min f (xs ) − f ? ≥ .
1≤s≤t 8µt

223
Proof. From the following construction, one can see that we can assume
x1 = 0 without loss of generality.
Let X = {x ∈ Rd , kxk2 ≤ R2 }. Then by triangle inequality we know that
the diameter of X is R. Let f : Rd → R be a function such that
µ
f (x) = C · max xj + kxk22 ,
1≤j≤t 2
where C is some constant to be determined. Note that it is never optimal
to have (x? )i 6= 0 for t < i ≤ d and by symmetry √
we know that (x? )1 =
(x? )2 = . . . = (x? )t . Thus, as long as C ≤ Rµ2 t , the optimal solution and
optimal value of the problem minx∈X f (x) is given by
 C
? − µt 1 ≤ i ≤ t C2
(x )i = and f ? = − .
0 t<i≤d 2µt
On the other hand, the subdifferential of function f is
∂f (x) = µx + C · conv{ei : i such that xi = max xj }.
1≤j≤t

Consider the worst subgradient oracle, that given an input x, it returns


g(x) = C · ei + µx, with i being the first coordinate that xi = max1≤j≤t xj .
By induction, we can show that xt ∈ span(e1 , . . . , et−1 ). This implies that
f (xs ) ≥ 0 for 1 ≤ s ≤ t. Therefore
C2
min f (xs ) − f ? ≥ .
1≤s≤t 2µt
√ √
B √t 2B√ Rµ t
(1) Let C = 1+ t
, µ = R(1+ t)
. (Note that C = 2
.) By triangle
inequality, for any g ∈ ∂f (x), kgk2 ≤ C + µ kxk2 ≤ C + µ · R2 = B. This
implies that f is B-Lipschitz continuous. Moreover, we have
C2 B·R
min f (xs ) − f ? ≥ = √ .
1≤s≤t 2µt 4(1 + t)

(2) Let C = B2 , R = Bµ . (Note that C ≤ Rµ2 t holds.) By triangle in-
equality, for any g ∈ ∂f (x), kgk2 ≤ C + µ · R2 = B. This implies that f is
B-Lipschitz continuous. Since f (x) − µ2 kxk22 is convex in x, we know that
f is µ-strongly convex. Moreover,
C2 B2
min f (xs ) − f ? ≥ = .
1≤s≤t 2µt 8µt

224
Remark 10.25. Recall that to obtain an ε-solution, the number of subgradient
call for convex function required by Subgradient Descent is at most
 2 
|f (x)−f (y)|
 max · maxx,y∈X kx − yk2 
 x∈X kx−yk2
O

ε 2

 

where the term max |f (x)−f


kx−yk2
(y)|
· maxx,y∈X kx − yk2 can be considered as the k·k2 -
x∈X  2
variation of f on x ∈ X, and O Bµε for strongly convex function. The above
theorem indicates that these complexity bounds are indeed optimal among first-
order methods based on subgradient oracles.

10.4 Exercises
Exercise 64. Prove that if f is differentiable at x ∈ dom(f ), then ∂f (x) ⊆
{∇f (x)}.
Exercise 65 (Lipschitz continuity and bounded gradient). Let f : dom(f ) →
R be convex, dom(f ) open, B ∈ R+ . Then the following two statements are
equivalent.
(i) kgk ≤ B for all x ∈ dom(f ) and all g ∈ ∂f (x).
(ii) |f (x) − f (y)| ≤ B kx − yk for all x, y ∈ dom(f ).
Exercise 66 (Monotonicity). Prove that the subdifferential of a convex function
f (x) at x ∈ dom(f ) is a monotone operator, i.e.,
(u − v)T (x − y) ≥ 0, ∀x, y ∈ dom(f ), u ∈ ∂f (x), v ∈ ∂f (y).
Exercise 67 (Directional derivative). Let f be a convex function and x ∈
dom(f ) and let d be such that x + αd ∈ dom(f ) for α ∈ (0, δ) for some
δ > 0. Show that the scalar function
f (x + αd) − f (x)
φ(α) =
α
is non-decreasing function of α on (0, δ).

225
Exercise 68. Consider the function f (x) = kxk, here k·k is a general norm.
Show that
∂f (x) = {g : g> x = kxk and kgk∗ ≤ 1}.
where k·k∗ is the dual norm: kyk∗ = maxx:kxk≤1 y> x. In particular, ∂f (0) :=
{g : kgk∗ ≤ 1}.

Exercise 69 (Subgradient Descent, Polyak stepsize). Analyze convergence


rate of subgradient descent under Polyak’s stepsize for both convex and strongly
convex objectives.

226
Chapter 11

Mirror Descent, Smoothing,


Proximal Algorithms

Contents
11.1 Mirror Decent . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
11.1.1 Bregman divergence . . . . . . . . . . . . . . . . . . . 228
11.1.2 Mirror
√ Descent . . . . . . . . . . . . . . . . . . . . . . 228
11.1.3 O(1/ t) convergence for convex functions . . . . . . 230
11.2 A Quick Tour of Convex Conjugate Theory . . . . . . . . . . 232
11.3 Smoothing Techniques . . . . . . . . . . . . . . . . . . . . . . 233
11.3.1 Common smoothing techniques . . . . . . . . . . . . 235
11.4 Nesterov’s smoothing . . . . . . . . . . . . . . . . . . . . . . 237
11.4.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 238
11.4.2 Theoretical Guarantees . . . . . . . . . . . . . . . . . . 240
11.5 Moreau-Yosida Regularization . . . . . . . . . . . . . . . . . 241
11.5.1 Proximal Operators . . . . . . . . . . . . . . . . . . . . 242
11.5.2 Proximal Point Algorithm . . . . . . . . . . . . . . . 244
11.6 Proximal Gradient Methods . . . . . . . . . . . . . . . . . . . 246
11.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

227
11.1 Mirror Decent
Recall the subgradient decent updating rule can be equivalently written
as
 
1 2
xt+1 = argmin kx − xt k2 + hγt gt , xi
x∈X 2
Why should we restrict to the Euclidean k.k2 distance? Next we introduce
another algorithm, Mirror Descent, that generalizes subgradient descent
with non-Euclidean distances.

11.1.1 Bregman divergence


Definition 11.1. Let ω(x) : X → R be a function that is strictly convex, contin-
uously differentiable on a closed convex set X. The Bregman divergence is defined
as
Vω (x, y) = ω(x) − ω(y) − ∇ω(y)> (x − y), ∀x, y ∈ X.
Bregman divergence is not a valid distance: it is asymmetric, namely,
Vω (x, y) 6= Vω (y, x) and triangle inequality may not hold. We call ω(·) the
distance-generating function. If ω(x) is σ-strongly convex with respect to
some norm with modulus σ, namely, ω(x) ≥ ω(y) + ∇ω(y)> (x − y) +
σ
2
kx − yk2 , then we always have Vω (x, y) ≥ σ2 kx − yk2 .
Lemma 11.2 (Generalized Pythagorean Theorem, Exercise 70). If x? is the
Bregman projection of x0 onto a convex set C ⊂ X: x? = argminx∈C Vω (x, x0 ).
Then for all y ∈ C, it holds that

Vω (y, x0 ) ≥ Vω (y, x? ) + Vω (x? , x0 ).

11.1.2 Mirror Descent


Given an input x and vector ξ, we will define the prox-mapping:

Proxx (ξ) = argmin{Vω (u, x) + hξ, ui}. (11.1)


u∈X

where the distance-generating function ω(·) is 1-strongly convex with re-


spect to the norm k · k on X.

228
The Mirror Descent algorithm, originally introduced by Nemirovski &
Yudin in 1983, adopts the update

xt+1 = argmin{Vω (x, xt ) + hγt gt , xi} = argmin{ω(x) + hγt gt − ∇ω(xt ), xi}


x∈X x∈X

which can be simplified as xt+1 = Proxxt (γt gt ).

Algorithm 4 Mirror Descent


1: Initialize x1 ∈ X
2: for t = 1, . . . , T do
3: xt+1 = Proxxt (γt gt ), where gt ∈ ∂f (xt )
4: end for

Example 11.3 (`2 -setup). X ⊆ Rn , ω(x) = 21 kxk22 , k·k = k·k2 , then

(a) Bregman distance: Vω (x, y) = 12 kx − yk22 ;

(b) Prox-mapping: Proxx (ξ) = ΠX (x − ξ);

(c) Mirror decent reduces to Subgradient Decent.

Example 11.4 (`1 -setup). X = {x ∈ Rn+ , ni=1 xi = 1}, ω(x) = ni=1 xi ln(xi ),
P P
k·k = k·k1 . One can verify that ω(x) is 1-strongly convex with respect to the
k·k1 -norm on X. In this case, we have

(a) Bregman distance: Vω (x, y) = ni=1 xi ln(xi /yi ), known as the Kullback-
P
Leibler divergence;

(b) Prox-mapping becomes to

n
!−1  x e−ξ1 
X 1
Proxx (ξ) = xi e−ξi  ... ;
i=1 xn e−ξn

(c) Mirror decent gives rise to multiplicative updates with normalization.

229

11.1.3 O(1/ t) convergence for convex functions
We first present the useful three point identity lemma:
Lemma 11.5. (Three point identity) For any x, y, z ∈ dom(ω):

Vω (x, z) = Vω (x, y) + Vω (y, z) − h∇ω(z) − ∇ω(y), x − yi.

Proof. This can be easily derived from definiton. We have

Vω (x, y) + Vω (y, z) = ω(x) − ω(y) + ω(y) − ω(z) − h∇ω(y), x − yi − h∇ω(z), y − zi


= Vω (x, z) + h∇ω(z), x − zi − h∇ω(y), x − yi − h∇ω(z), y − zi
= Vω (x, z) + h∇ω(z) − ∇ω(y), x − yi.

Remark 11.6. When ω(x) = 12 kxk22 , this is the same as law of cosines, i.e.,

kz − xk22 = kz − yk22 + ky − xk22 + 2hz − y, y − xi.

Theorem 11.7. For Mirror Decent, let f be convex and ω(·) be 1-strongly con-
vex on X with regard to norm k·k, then we have:
? 1
PT 2 2
? V ω (x , x 1 ) + 2 t=1 γt kgt k∗
min f (xt ) − f ≤ PT , (11.2)
t=1 γt
1≤t≤T

and ! PT
γt2 kgt k2∗
PT
γt xt ? Vω (x? , x1 ) + 21 t=1
f Pt=1
T
−f ≤ PT , (11.3)
t=1 γt t=1 γt
where k.k∗ denotes dual norm.
Proof. Since xt+1 = argminx∈X {ω(x) + hγt gt − ∇ω(xt ), xi}, by optimality
condition, we have

h∇ω(xt+1 ) + γt gt − ∇ω(xt ), x − xt+1 i ≥ 0.

From the three point identity, we have for ∀x ∈ X:

hγt gt , xt+1 −xi ≤ h∇ω(xt+1 )−∇ω(xt ), x−xt+1 i = Vω (x, xt )−Vω (x, xt+1 )−Vω (xt+1 , xt )

hγt gt , xt − xi ≤ Vω (x, xt ) − Vω (x, xt+1 ) − Vω (xt+1 , xt ) + hγt gt , xt − xt+1 i.

230
By Young’s inequality,
γt2 1
hγt gt , xt − xt+1 i ≤ kgt k2∗ + kxt − xt+1 k2 .
2 2
From the strongly convexity of ω(x), Vω (xt+1 , xt ) ≥ 12 kxt − xt+1 k2 . Adding
these two inequalities, we get the key inequality:
γt2
kgt k2∗ .
hγt gt , xt − x? i ≤ Vω (x? , xt ) − Vω (x? , xt+1 ) + (11.4)
2
Following similar line of proof for Subgradient Descent, we obtain the de-
sired results.

Remark 11.8 (Mirror Decent vs. Subgradient Decent). Let f be convex, with
proper choice of stepsize as before, we can see that the convergence of these two
methods looks very similar.
 
• For Subgradient Decent, εt ∼ O RB √
t
,
where R2 := kx − x1 k22 and B := maxx∈X maxg∈∂f (x) kgk2 .
√ 
• For Mirror Decent, εt ∼ O √ΩB t
,
where Ω := maxx∈X Vω (x, x1 ) and B := maxx∈X maxg∈∂f (x) kgk∗ .
The rate remains the same, but the constants differ. In some cases, one can
show that using Mirror Descent significantly improves upon the constant. Con-
sider X = {x ∈ Rd : xi ≥ 0, di=1 xi = 1}.
P

• For Subgradient Decent, under `2 -setup, ω(x) = 21 kxk22 , k·k = k·k∗ =


k·k2 , we know R2 ≤ 2.
Pd
• For Mirror Decent, under `1 -setup, ω(x) = i=1 xi ln xi , k·k = k·k1 ,
k·k∗ = k·k∞ , if we choose x1 = argminx∈X ω(x), then Ω ≤ maxx∈X ω(x)−
minx∈X ω(x) = 0 − (− ln d) = ln d.
Therefore, the ratio between the efficiency estimates of GD and MD is
maxx∈X maxg∈∂f (x) kgk2
 
1
O √ · .
ln d maxx∈X maxg∈∂f (x) kgk∞

Note that kgk∞ ≤ kgk 2 ≤ d kgk
 ∞ . Hence, in the worst case, we can see
p
Mirror Decent would be O d/ lg d faster than Subgradient Descent.

231
11.2 A Quick Tour of Convex Conjugate Theory
Definition 11.9 (convex conjugate). For a function f : dom(f ) → R, its
convex conjugate is given as:

f ∗ (y) = sup {x> y − f (x)} (11.5)


x∈dom(f )

The convex conjugate is also known as Legendre-Fenchel transforma-


tion. Note that f need not necessarily be convex for the above definition.
Also, f ∗ will always be convex (regardless of f ) since it is the supremum
over linear functions of y. By definition, we have:

f ∗ (y) ≥ x> y − f (x), ∀x, y =⇒ x> y ≤ f (x) + f ∗ (y), ∀x, y

The last inequality above is known as the Fenchel inequality, which is a


generalization of the Young’s inequality:

kxk2 kyk2∗
x> y ≤ + , ∀x, y.
2 2
Lemma 11.10 (Sec.12, [Roc97]). If function f is convex, lower semi-continuous
and proper, then (f ∗ )∗ = f .

Here lower semi-continuity means that lim inf x→x0 f (x) ≥ f (x0 ). In
other words, the level set ({x : f (x) ≤ α}) of f is a closed set. A proper con-
vex function means that f (x) > −∞. Hence, for f satisfying the lemma,
f (x) admits the Fenchel representation

f (x) = max {y> x − f ∗ (y)}.


y∈dom(f ∗ )

Proposition 11.11. If function f is µ-strongly convex then f ∗ is continuously


differentiable and µ1 -Lipschitz smooth.

Proof. By definition, we have f ∗ (y) = supx∈dom(f ) {y> x − f (x)}. This gives


us the subdifferential set

∂f ∗ (y) = argmax {y> x − f (x)}


x∈dom(f )

232
Note that for all y, the optimal solution of the above problem is unique due
to strong convexity. Hence, ∂f ∗ (y) is a singleton, i.e. ∂f ∗ (y) = {∇f ∗ (x)}.
Hence, f ∗ is differentiable. Now, we need to show the following:
1
k∇f ∗ (y1 ) − ∇f ∗ (y2 )k2 ≤ ky1 − y2 k2 , ∀y1 , y2 (11.6)
µ
Let x1 = argmaxx∈dom(f ) {y1> x−f (x)}. Similarly, let x2 = argmaxx∈dom(f ) {y2> x−
f (x)}. From the optimality condition, we get:
hy1 , x2 − x1 i ≤ h∂f (x1 ), x2 − x1 i (11.7)
hy2 , x1 − x2 i ≤ h∂f (x2 ), x1 − x2 i (11.8)
From the µ-strong convexity of f , we have:
µ
kx2 − x1 k22
f (x2 ) ≥ f (x1 ) + ∂f (x1 )> (x2 − x1 ) + (11.9)
2
µ
f (x1 ) ≥ f (x2 ) + ∂f (x2 )> (x1 − x2 ) + kx1 − x2 k22 (11.10)
2
Combining equations 11.7, 11.8 with 11.9, 11.10, we get:
hy1 − y2 , x1 − x2 i ≥ µ kx1 − x2 k22 .
From the Cauchy-Schwarz inequality, this further implies that
1
=⇒ kx1 − x2 k2 ≤ ky1 − y2 k2
µ
Hence, (11.6) follows from the definitions of x1 , x2 .
Lemma 11.12 (Exercise 73). Let f and g be two proper, convex and semi-continuous
functions, then
(a) (f + g)∗ (x) = inf y {f ∗ (y) + g ∗ (x − y)},
(b) (αf )∗ (x) = αf ∗ αx for α > 0.


11.3 Smoothing Techniques


Previously, we discussed Subgradient Descent and Mirror Descent for non-
smooth convex optimization: These algorithms are designed for general-
purposed nonsmooth optimization problems and don’t exploit the struc-
ture of the problem at hand. In practice, we always know some thing

233
about the structure of the optimization problem we intend to solve. One
can then utilize this structure to come up with more efficient algorithms as
compared to Subgradient Descent and Mirror Descent algorithms.

min f (x) (11.11)


x∈X

where f is a convex but possibly non-differentiable function, and X is a


convex compact set. Another intuitive way to approach the above problem
is to approximate the non-smooth function f (x) by a smooth and convex
function fµ (x), so that we can leverage gradient descent and acceleration
techniques to solve the smoothed problem:

min fµ (x) (11.12)


x∈X

where fµ (x) is convex and Lµ -Lipschitz differentiable.

fµ (x) µ=1
4 µ=2
µ=3
µ=4

x
−4 −2 2 4 6

Figure 11.1: Huber function with varying thresholds µ.

Example 11.13. Consider the simplest non-smooth and convex function, f (x) =
|x|. The following function, known as Huber function (Figure 11.1)
( 2
x
, if |x| ≤ µ
fµ (x) = 2µ µ
(11.13)
|x| − 2 , if |x| > µ

234
is a smooth approximation of the absolute value function. fµ (x) is clearly contin-
uous and differentiable everywhere. It can also been easily seen that
µ
f (x) − ≤ fµ (x) ≤ f (x).
2
Hence, if µ → 0, then fµ (x) → f (x). Moreover, we also have fµ00 (x) ≤ µ1 .
This implies that fµ (x) is µ1 -Lipschitz continuous. Therefore, µ determines the
approximation accuracy and the smoothness level.
The Huber function approximation has been widely used in machine learn-
ing to approximate non-smooth loss functions, e.g. absolute loss (robust regres-
sion), hinge loss (SVM), etc. For example, suppose we have m data samples
(a1 , b1 ), ..., (am , bm ), and we intend to solve the regression problem with absolute
loss: m
X
min a>i x − bi
x∈Rd
i=1

We can approximate the absolute loss by Huber loss and solve instead the follow-
ing smooth convex optimization problem:
m
X
min fµ (ai > x − bi ).
x∈Rd
i=1

11.3.1 Common smoothing techniques


In this section, we will briefly introduce several major smoothing tech-
niques used for non-smooth convex optimization.

(1) Nesterov’s smoothing based on conjugate function.


This technique uses the following function to approximate f (x):

fµ (x) = max ∗ {x> y − f ∗ (y) − µ · d(y)} (11.14)


y∈dom(f )

where f ∗ is the convex conjugate of f and d(y) is some proximity


function that is strongly convex and nonnegative everywhere. Note
that by definition, we have

fµ (x) = max {x> y − f ∗ (y) − µ · d(y)} = (f ∗ + µd)∗ (x).


y∈dom(f ∗ )

235
By adding the strongly convex term µd(y) term, the function (f ∗ +µd)
is strongly convex. Therefore, based on Proposition 11.11, function
fµ (x) is continuously differentiable and Lipschitz-smooth.

(2) Moreau-Yosida smoothing/regularization.


This technique uses the following function to approximate f (x):

1
fµ (x) = min {f (y) + kx − yk22 } (11.15)
y∈dom(f ) 2µ

where µ > 0 is the approximation parameter.

Remark 11.14. Under the simple proximity function d(y) = 21 kyk22 , Nes-
terov’s smoothing is indeed equivalent to Moreau-Yosida regularization.
Let f ∗ denote the conjugate of the function f . Suppose f is proper, con-
vex and lower-semicontinuous, then

f (x) = max y> x − f ∗ (y)



y

Then we can show that

fµ (x) = maxy y> x − f ∗ (y) − µ


kyk22

2
(Nesterov’s smoothing)
µ 2 ∗
= f∗ + n2 k·k2 (x) o
= inf y f (y) + 1

kx − yk22 ( Moreau-Yousida regularization)

where the last equation follows from Lemma 11.12(a).

(3) Lasry-Lions regularization.


This smoothing technique considers double application of the Moreau-
Yosida smoothing with function flipping:

1 1
fµ,δ (x) = max min{f (z) + kz − yk22 − ky − xk22 } (11.16)
y z 2µ 2δ

where δ, µ > 0. Similarly, based on Proposition 11.11, function fµ,δ (x)


is continuously differentiable and Lipschitz-smooth.

(4) Ben-Tal-Teboulle smoothing based on recession function.

236
This smoothing technique is only applicable to a particular class of
function which can be represented as:
f (x) = F (f1 (x), f2 (x), . . . , fm (x)), (11.17)
where F (y) = maxx∈dom(g) {g(x + y) − g(x)} is the recession func-
tion of some function g : Rm → R here. For a function f satisfying
the above condition, the Ben-Tal and Teboulle’s smoothing technique
uses the following function to approximate f (x):
 
f1 (x) fm (x)
fµ (x) = µg ,..., . (11.18)
µ µ

(5) Randomized smoothing.


The randomized smoothing paradigm uses the following function to
approximate f (x):
fµ (x) = EZ f (x + µZ) (11.19)
where Z is an isotopic Gaussian or uniform random variable.

11.4 Nesterov’s smoothing


In this section, we mainly focus on Nesterov’s smoothing and discuss its
properties to gain insight into this smoothing technique. We consider a
general problem setting: minx∈X f (x), where function f can be represented
by
f (x) = max{hAx + b, yi − φ(y)},
y∈Y

with φ(y) being a convex and continuous function and Y a convex and
compact set. Note that the aforementioned representation generalizes the
Fenchel representation using conjugate function and needs not be unique.
Indeed, for many cases, we are able to construct such representation easily
as compared to using the convex conjugate.
Example 11.15. Let f (x) = max1≤i≤m a> i x − bi . Computing the convex con-

jugate for f is a cumbersome task and f is quite complex. Instead, we can easily
represent f as follows:
Xm Xm
f (x) = max (a>
i x − bi )yi , where Y := {y ∈ Rm : |yi | ≤ 1}.
y∈Y i=1 i=1

237
Proximity function. We now proceed to discuss some properties of the
proximity function d(y). The function d(y) should satisfy the following
properties:

(i) d(y) is continuous and 1-strongly convex on Y ;

(ii) d(y0 ) = 0, for y0 ∈ argminy∈Y d(y);

(iii) d(y) ≥ 0, ∀y ∈ Y .

Example 11.16. Let y0 ∈ Y , here are some examples of valid proximity functions:
• d(y) = 12 ky − y0 k22

• d(y) = 21 wi (yi − (y0 )i )2 with wi ≥ 1


P

• d(y) = ω(y) − ω(y0 ) − ∇ω(y0 )> (y − y0 ) with ω(x) being 1-strongly


convex on Y
We can check that these proximity functions satisfy all the properties mentioned
above.

Nesterov’s smoothing considers the following smooth approximation:

fµ (x) = max{hAx + b, yi − φ(y) − µ · d(y)}.


y∈Y

11.4.1 Example
Below, we provide a simple example to illustrate the smoothed function
under different choices of proximity function d(·) and Fenchel representa-
tion. Consider the absolute value function f (x) = |x|. Note that f admits
the following two different representation:

f (x) = sup yx
|y|≤1

OR f (x) = sup (y1 − y2 )x


y1 ,y2 ≥0
y1 +y2 =1

Here, Y = {y : |y| ≤ 1} or Y = {y = (y1 , y2 ) : y1 , y2 ≥ 0, y1 + y2 = 1}, and


φ(y) := 0.
Now we consider different choices for the distance function d(y).

238
1. d(y) = 12 y 2 . Clearly, d(·) is 1-strongly convex on Y = {y : |y| ≤ 1},
and d(y) ≥ 0.
Nesterov’s smoothing gives rise to
(
x2
n µ 2o 2µ
, |x| ≤ µ
fµ (x) = sup yx − y = (11.20)
|y|≤1 2 |x| − µ2 , |x| > µ
which is the well-known Huber function.

Remark. The same approximation can be obtained from the Moreau-


Yosida smoothing technique as follows:
 
1 2
fµ (x) = inf |y| + ky − xk
y∈Y 2µ
p
2. d(y) = 1− 1 − y 2 . Clearly, d(·) is 1-strongly convex on Y = {y : |y| ≤ 1}
and d(y) ≥ 0.
Nesterov’s smoothing gives rise to
n  p o p
fµ (x) = sup yx − µ 1 − 1 − y 2 = x2 + µ 2 − µ (11.21)
|y|≤1

Remark. The same approximation can be obtained from Ben-Tal &


Teboulle’s smoothing based on recession function:
p
|x| = sup {g(x + µ) − g(y)} , g(y) = 1 + y 2
y
  p
x
fµ (x) = µg = x2 + µ 2
µ

3. d(y) = y1 log y1 + y2 log y2 + log 2. Clearly, d(·) is 1-strongly convex on


Y = {(y1 , y2 ) : y1 , y2 ≥ 0, y1 + y2 = 1} and d(y) ≥ 0.
Nesterov’s smoothing gives rise to
fµ (x) = sup {(y1 − y2 )x − µ (y1 log y1 + y2 log y2 + log 2)} (11.22)
y1 ,y2 ≥0
y1 +y2 =1
x x
!
e− µ + e µ
= µ log (11.23)
2

239
Remark. The same approximation can be obtained from Ben-Tal &
Teboulle smoothing based on recession function.

|x| = max{x, −x} = sup {g(x + µ) − g(y)} , g(y) = log (ey1 + ey2 )
y
 
x  x
−µ x

fµ (x) = µg = µ log e + e µ
µ

11.4.2 Theoretical Guarantees


We describe below the Lipschitz smoothness of the function [Nesterov,
2005].

Proposition 11.17. For fµ (x), we have

• fµ (x) is continuously differentiable.

• ∇fµ (x) = A> y(x), where y(x) = argmax{hAx + b, yi − φ(y) − µ · d(y)}.


y∈Y

kAk22
• fµ (x) is µ
-Lipschitz smooth, where kAk2 := maxx:kxk2 ≤1 kAxk2 .

This can be derived similarly as Proposition 11.11, we omit the proofs


here. Now let us look at the approximation accuracy.

Proposition 11.18. For any µ > 0, let DY2 = maxy∈Y d(y), we have

f (x) − µDY2 ≤ fµ (x) ≤ f (x).

Remark. Let f∗ = minx∈X f (x) and fµ,∗ = minx∈X fµ (x), we have fµ,∗ ≤
f∗ . Moreover, for any xt generated by an algorithm

f (xt ) − f∗ ≤ f (xt ) − fµ (xt ) + fµ (xt ) − fµ,∗


| {z } | {z }
approximation error optimization error

Suppose we have access to the gradient of fµ (x) when solving the re-
sulting smooth convex optimization problem

min fµ (x)
x∈X

240
(i) If we apply projected gradient descent to solve the smooth problem,
!
kAk22 DX
2
f (xt ) − f∗ ≤ O + µDY2
µt

Therefore, if we  the error to be less than a threshold , we need


 want

to set µ = O D2 s and the total number of iterations is at most
 2 2 Y  2 2 2
kAk2 DX kAk2 DX DY
T = O µ
= O 2
.

(ii) If we apply accelerated gradient descent to solve the smooth prob-


lem,
2
||A||22 DX
 
2
f (xt ) − f∗ ≤ O + µDY
µt2
Therefore, if 
we want
 the error to be less than a threshold , we need

to set µ = O D2 and the total number of iterations is at most T =
  Y  
kAk2 DX kAk2 DX DY
O √

=O 
, which is substantially better than the
O (1/2 ) complexity if we were to directly apply subgradient descent.

11.5 Moreau-Yosida Regularization


Recall that the Moreau-Yosida smoothing technique considers the smooth
approximation function:
 
1 2
fµ (x) = min f (y) + kx − yk2 (11.24)
y∈dom(f ) 2µ
where µ > 0 is the smoothness parameter. Function fµ (x) is also called the
Moreau envelope of function f (x).
Note that when computing the gradient of the smoothed function ∇fµ (x),
for any smoothing form used in (11.14), we will need to solve subproblems
in the form  
1 2
min f (y) + kx − yk .
y 2
The optimal solution to this subproblem is often referred to as the proxi-
mal operator, which shares many similarity as the projection operator we
discussed earlier. We provide some basic results below.

241
11.5.1 Proximal Operators
Definition 11.19. Given a convex function f , the proximal operator of f at a
given point x is defined as
 
1 2
proxf (x) = argmin f (y) + kx − yk
y 2

As an immediate observation, for any µ > 0, we have


 
1 2
proxµf (x) = argmin f (y) + kx − yk
y 2µ

Note that for continuous convex function f , the proximal operator al-
ways exists and is unique. In fact, for simple functions, proximal operators
can sometimes be computed efficiently with closed-form solutions at low
computation cost.

Example 11.20. Let f be the indicator function of a convex set X, namely,


(
0, x∈X
f (x) = δX (x) =
+∞, x 6∈ X

Then the proximal operator reduces to the projection operator onto X, i.e.,
 
1 2
proxf (x) = argmin kx − yk = ΠX (x).
y∈X 2

In general, the proximal operator possesses many similar properties as


the projection operator as discussed earlier, e.g. treating optimal solution
as fixed point, non-expansiveness, and decomposition.

Proposition 11.21. Let f be a convex function, then we have

(a) (Fixed Point) A point x? minimizes f (x) iff x? = proxf (x? ).

(b) (Non-expansive) proxf (x) − proxf (y) ≤ kx − yk.

(c) (Moreau Decomposition) For any x, x = proxf (x) + proxf ∗ (x).

242
Proof. (a) First, if x? minimizes f (x), we have f (x) ≥ f (x) , ∀x ∈ X.
Hence,
1 1
f (x) + kx − x? k2 ≥ f (x? ) + kx? − x? k2
2 2
This implies that
 
1
x = argmin f (x) + kx − x? k2
?
= proxf (x? )
x 2
To prove the converse, consider if
 
? ? 1 ? 2
x = proxf (x ) = argmin f (x) + kx − x k
x 2

By the optimality condition, this implies that

0 ∈ ∂f (x? ) + (x? − x? ) =⇒ 0 ∈ ∂f (x? )

Therefore, x? minimizes f .
(b) Let us denote ux = proxf (x) and uy = proxf (y). Equivalently,

x − ux ∈ ∂f (ux ) and y − uy ∈ ∂f (uy ).

Now we use the fact that ∂f is a monotone mapping, which tells us


that

hx − ux − (y − uy ), ux − uy i ≥ 0

Hence, we have

hx − y, ux − uy i ≥ kux − uy k2 .

By Cauchy Schwartz inequality, this leads to kux − uy k ≤ kx − yk as


desired.
(c) Let u = proxf (x), or equivalently, x − u ∈ ∂f (u). Note that we also
have u ∈ ∂f ∗ (x − u), this is equivalent to x − u = proxf ∗ (x). Hence,
x = u + (x − u) = proxf (x) + proxf ∗ (x).

243
Recall the definition of fµ (x), by Danskin’s theorem, we can prove that
the gradient of fµ (x) is given by

1
∇fµ (x) = (x − proxµf (x)) (11.25)
µ

Since fµ is µ1 -smooth, gradient descent for the smoothed function works as


follows
xt+1 = xt − µ∇fµ (xt )
From equation (11.25), this is equivalent to

xt+1 = proxµf (xt )

which is known as proximal point algorithm, initially proposed by Rock-


afellar in 1976. In the next subsection, we discuss in more details the gen-
eral algorithm and its convergence results.

11.5.2 Proximal Point Algorithm


The goal is to minimize a non-smooth convex function f (x), i.e. minx f (x).
The proximal point algorithm works as follows:

xt+1 = proxγt f (xt ) t = 0, 1, 2, . . .

where γt > 0 are the stepsizes.

Theorem 11.22. Let f be a convex function, the proximal point algorithm satis-
fies
? kx0 − x? k22
f (xt ) − f ≤ Pt−1
2 τ =0 γτ
Proof. First, by optimality of xt+1 :

1
f (xt+1 ) + kxt+1 − xt k22 ≤ f (xt )
2γt
i.e.,
1
f (xt ) − f (xt+1 ) ≥ kxt+1 − xt k22 .
2γt

244
This further implies that f (xt ) is non-increasing at each iteration. Let g ∈
∂f (xt+1 ), by convexity of f , we have f (xt+1 ) − f ? ≤ g> (xt+1 − x? ). From
the optimality condition of xt+1 , we have

1 xt − xt+1
0 ∈ ∂f (xt+1 ) + (xt+1 − xt ) =⇒ ∈ ∂f (xt+1 )
γt γt
Hence,
1
f (xτ +1 ) − f ? ≤ (xτ − xτ +1 )> (xτ +1 − x? )
γτ
1
≤ (xτ − x? + x? − xτ +1 )> (xτ +1 − x? )
γt
1 
(xτ − x? )> (xτ +1 − x? ) − kxτ +1 − x? k2


γt

Since (xτ − x? )> (xτ +1 − x? ) ≤ 21 [kxτ − x? k2 + kxτ +1 − x? k2 ], this implies


that
1
γτ (f (xτ +1 ) − f ? ) ≤ kxτ − x? k2 − kxτ +1 − x? k2

2
Summing this inequality for τ = 0, 1, 2, . . . , t − 1,
t−1
X
? kx0 − x? k2 kxt − x? k2 kx0 − x? k2
γτ (f (xτ +1 ) − f ) ≤ − ≤
τ =0
2 2 2

Since f (xτ ) is non-increasing, we have


t−1
! t−1
X X
?
γτ (f (xt ) − f ) ≤ γτ (f (xτ +1 ) − f ? )
τ =0 τ =0

Therefore,
? kx0 − x? k2
f (xt ) − f ≤ Pt−1
2 τ =0 γτ

245
Remark. Note that
1. Unlike most algorithms we discussed so far in this course, the algo-
rithm is not a gradient-based algorithm.
P
2. γt can be arbitrarily, the algorithm converges as long as t γt → ∞,
however, the cost of the proximal operator will depend on γt . For
larger γt , the algorithm converges faster, but the proximal operator
proxγt f (xt ) might be harder to solve.
 
1
3. If γt = µ (const.), then f (xt ) − f ? = O µt . This matches with the
O (1/t) rate we obtain from the gradient descent perspective.

11.6 Proximal Gradient Methods

11.7 Exercises
Exercise 70 (Generalized Pythagorean Theorem). Let ω(·) : Ω → R be
strictly convex and continuously differentiable, X ⊆ Ω be closed and convex.
Define the Bregman projection of a point y onto X as:

ΠωX (y) := argmin Vω (x, y).


x∈X

Then for any x ∈ X, y ∈ Ω it holds that

Vω (x, y) ≥ Vω (x, ΠωX (y)) + Vω (ΠωX (y), y).

Exercise 71 (Mirror Descent, smooth setting). Let f be convex and if the gradi-
ent of f is Lipschitz continuous such that k∇f (x) − ∇f (y)k∗ ≤ L kx − yk , ∀x, y.
Show that by setting γt = 1/L, the sequence of iterates {xt } generated by Mirror
Descent satisfies that

? L · Vω (x? , x1 )
min f (xt ) − f ≤ .
1≤t≤T T
Exercise 72 (Compute Conjugate). Calculate the conjugate of the following
convex functions:
(a) f (x) = ex on R

246
(b) f (x) = kxk on Rd

(c) f (x) = 12 kxk2 on Rd

(d) f (x) = log( ni=1 exp{xi }) on Rd


P

Exercise 73 (Calculus of conjugate). Prove the following

(a) (Scalar Multiplication) Let f (x) be convex and α > 0, then

(αf )∗ (y) = αf ∗ (y/α)

(b) (Direct Summation) Let f (x1 ) and g(x2 ) be convex and h(x1 , x2 ) = f (x1 )+
g(x2 ), then
h∗ (y1 , y2 ) = f ∗ (y1 ) + g ∗ (y2 )

(c) (Weighted Summation) Let f (x) and g(x) be closed convex functions, and
h(x) = f (x) + g(x), then

h∗ (y) = inf {f ∗ (z) + g ∗ (y − z)}


z

where the latter is the convolution of f ∗ and g ∗ .


[Hint: First show that (inf z {F (z) + G(y − z)})∗ = F ∗ (y) + G∗ (y), and
then apply with F = f ∗ , and G = g ∗ .]

Exercise 74 (Fenchel inequality). From the definition of conjugate function, we


have for any x and y,
x> y ≤ f (x) + f ∗ (y).
Show that x> y = f (x) + f ∗ (y) if and only if y ∈ ∂f (x).

247
Chapter 12

Stochastic Optimization

Contents
12.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . 250
12.1.1 Convergence for strongly convex functions . . . . . . 252
12.1.2 Convergence for convex functions . . . . . . . . . . . 253
12.1.3 Convergence of SGD under constant stepsize . . . . . 253
12.1.4 Convergence for nonconvex functions . . . . . . . . . 255
12.2 Adaptive Stochastic Gradient Methods . . . . . . . . . . . . . 257
12.2.1 Popular variants: AdaGrad, RMSProp and Adam . . 257
12.2.2 Theory and Practice . . . . . . . . . . . . . . . . . . . 259
12.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

248
Stochastic optimization involves decision-making in the presence of
randomness and lies at the heart of Data Science. The stochastic optimiza-
tion problem is often formulated as
n
1X
min F (x) = fi (x), (12.1)
x∈X n i=1

or in a more general form

min F (x) = Eξ [f (x, ξ)] , (12.2)


x∈X

where f (x, ξ) is a function involving the decision variable x and a ran-


dom variable (vector) ξ. The random ξ is some well-defined random vari-
able with support Ξ ⊂ Rm and follows the distribution P (ξ). The former
problem (12.1), also known as the finite-sum problem, or big-n problem, can
be viewed a special case of the latter problem (12.2). This can be seen
by settingPξ as the uniform distribution over index set {1, 2, . . . , n}, then
F (x) = n ni=1 fi (x) = Eξ [fξ (x)].
1

Example: supervised learning. In a supervised learning task, we are


given a set of training data points (a1 , b1 ), . . . , (an , bn ), where ai is the fea-
ture vector, bi is the label, and n is the size of training data. Often times,
we assume the data are generated i.i.d. from some unknown data dis-
tribution. The goal is to find a predictor (either a regressor or classifier)
h(·) ∈ H from some hypothesis class H, that fits the data by minimizing
the empirical risk (training error):
n
1X
min Ln (h) := `(h(ai ), bi ), (12.3)
h(·)∈H n i=1

where `(·, ·) is some loss function such as least square loss, hinge loss, lo-
gistic loss, etc., and the predictor h(·) can come from a parametric fam-
ily such as linear models or neural networks or a non-parametric family
such as reproducing kernel Hilbert space or random forests. In statistical
learning, instead, we often seek to minimize the expected risk over the
population rather than the empirical distribution (testing error):

min L(h) := E(a,b)∼P [`(h(a), b)], (12.4)


h(·)∈H

249
where P is the unknown data distribution, and (a, b) is a random testing
data point sampled from the distribution. The expected risk measures the
generalization performance and can be well approximated by the empiri-
cal risk if n is sufficiently large.

Computational challenge. In general, for finite-sum problems of the form


(12.1), computing the full gradient can be expensive when n is large. In
big data applications, we cannot afford going through the data multiple
for runs. For the purely stochastic objective of form (12.2), computing the
gradient ∇f (x) can be intractable as it involves the integration over a dis-
tribution and the distribution of P (ξ) is often unknown in practice. Hence,
gradient descent methods and their variants as discussed before are not
immediately applicable to such stochastic optimization problems.

12.1 Stochastic Gradient Descent


Stochastic Gradient Descent (SGD), also known as Stochastic Approxima-
tion (SA), is a popular method to solve the stochastic optimization prob-
lem and it dates back to 1951 by Robbins and Monro. Assume F (x, ξ) is
continuously differentiable for any realization ξ ∈ Ξ. Given a point x1 , the
idea of classic Stochastic Approximation is that we update xt+1 by

xt+1 := ΠX (xt − γt ∇f (xt , ξ t )) , t ≥ 1,


iid
where ξ t ∼ P (ξ) . Here the gradient ∇ is taken over the argument x.
In the special case of finite-sum problem (12.1), the SGD update is
given by 
xt+1 := ΠX xt − γt ∇fit (xt ) ,
where it is sampled uniformly at random from {1, . . . , n}. It follows im-
mediately that the stochastic gradient ∇fit (xt ) is unbiased, i.e.,
n n
X 1X
Eit [∇fit (xt )|xt ] = ∇fi (xt ) · P (it = i) = ∇fi (xt ) = ∇F (xt ).
i=1
n i=1

For the general setting, without loss of generality, we always assume


that ∇F (x, ξ) is well-defined for any x ∈ X almost everywhere on ξ ∈ Ξ
such that gradients and integrals can be exchanged (this holds under mild

250
regularity conditions, say e.g., following dominated convergence theo-
rem). In other words, the stochastic gradient is unbiased, i.e.,

Z Z
E [∇f (x, ξ)] = ∇f (x, ξ)P (ξ)dξ = ∇ f (x, ξ)P (ξ)dξ = ∇F (x).
ξ ξ

Example 12.1. Consider the ordinary least square regression problem:


n
1X1 >
min F (x) = (a x − bi )2
x∈Rd n i=1 2 i

where {(ai , bi )}ni=1 are the training data. Full GD works as follows:
n
γX >
xt+1 = xt − (a xt − bi )ai ;
n i=1 i

while SGD works as follows:

xt+1 = xt − γt (a>
it xt − bit )ait , where it ∼ Unif{1, . . . , n}.

As a comparison, the per-iteration cost of GD is of order O (n) more expensive


than SGD.

Remark 12.2. We make a few observations here.

1. We need a decreasing sequence {γt }t≥1 and γt → 0 as t goes to infinity to


ensure convergence. Suppose xt converges to a limit point y, we have y =
y − γ∇f (y, ξ). However, since ∇f (y, ξ) is random, we cannot guarantee
that ∇f (y, ξ) = 0, ∀ξ ∈ Ξ. Hence, we need γt → 0 as t → ∞.

2. For the SA algorithm, the iterate xt = xt (ξ [t−1] ) is a function of the i.i.d.


historic sample ξ [t−1] = (ξ 1 , . . . , ξ t−1 ) of the generated random process,
so xt and F (xt ) are random variables. We cannot use the previous error
functions to measure the optimality, e.g. [F (xt ) − F ? ] and kxt − x? k22 .
Instead, a more appropriate criterion would be consider the expectation or
high probability results.

251
12.1.1 Convergence for strongly convex functions
Theorem 12.3 ([NJLS09]). Assume F (x) is µ-strongly convex, and ∃M > 0,
s.t. E k∇f (x, ξ)k22 ≤ M 2 , ∀x ∈ X, then SA method with γt = γt at iteration t
 

where γ > 1/2µ satisfies


 2 2 
 ? 2
 C(γ) γ M ? 2
E kxt − x k2 ≤ , where C(γ) = max , kx1 − x k2 .
t 2µγ − 1
Proof. For any given xt , we calculate xt+1 by a sample ξ t generated in this
iteration, and the distance of xt+1 to the optimal x? is

kxt+1 − x? k22 = kΠX (xt − γt ∇f (xt , ξ t )) − ΠX (x? )k22


≤ kxt − γt ∇f (xt , ξ t ) − x? k22
= kxt − x? k22 − 2γt h∇f (xt , ξ t ), xt − x? i + γt2 k∇f (xt , ξ t )k22 .

Taking expectation on both sides of the above inequality, we have

E kxt+1 − x? k22 ≤ E kxt − x? k22 − 2γt E [h∇f (xt , ξ t ), xt − x? i] + γt2 M 2 .


   

(21.1)

By law of total expectation and the linearity of expectation, we have

E [h∇f (xt , ξ t ), xt − x? i] = E [E [h∇f (xt , ξ t ), xt − x? i|xt ]]


= E [hE [∇f (xt , ξ t )|xt ] , xt − x? i]
= E [h∇F (xt ), xt − x? i] ,

where the last equality holds because ξ t is independent from xt = xt (ξ [t−1] ).


By µ-strongly convexity of F (x), it follows that

h∇F (xt ), xt −x? i ≥ µ kxt − x? k22 +h∇F (x? ), xt −x? i ≥ µ kxt − x? k22 , ∀xt ∈ X.

Combing with equation (21.1), we get the following result

E kxt+1 − x? k22 ≤ (1 − 2µγt )E kxt − x? k22 + γt2 M 2 .


   

Since γt = γ/t and γ ≥ 1/(2µ), the inequality above is equivalent with


2µγ   γ 2M 2
E kxt+1 − x? k22 ≤ (1 − )E kxt − x? k22 +
 
.
t t2
By induction, we conclude that E kxt − x? k22 ≤ C(γ)
 
t
.

252
12.1.2 Convergence for convex functions
Analogous to the deterministic setting, we can also generalize SGD to non-
Euclidean setting by leveraging Bregman divergence. This will immedi-
ately lead to the Stochastic Mirror Descent, where we simply replace the
gradient by the stochastic gradient estimator.
Stochastic Mirror Descent, works as follows:

xt+1 = argmin {Vω (x, xt ) + hγt G(xt , ξ t ), xi} ,


x∈X

where for given input x, ξ, the estimator G(x, ξ) satisfies that E[G(x, ξ)] ∈
∂F (x) and E[kG(x, ξ)k2∗ ] ≤ M 2 . Note that we don’t necessarily require
F (x) or f (x, ξ) to be differentiable. Here ω(x) is the distance-generating
function that is continuously differentiable and 1-strongly convex function
w.r.t. some norm k·k on X. And k·k∗ is the dual norm of k·k. The Bregman
divergence is defined as Vω (x, y) = ω(x)−ω(y)−∇ω(y)> (x−y), ∀x, y ∈ X.

Theorem 12.4 ([NJLS09]). Let F be convex, then Stochastic Mirror Descent


satisfies that
2 PT
? R2 + M2 2
t=1 γt
E[F (x̂T ) − F (x )] ≤ PT
t=1 γt
PT
t=1 γt xt
where R2 = max Vω (x, x1 ) and x̂T = P T .
x∈X t=1 γt

The proof follows similarly as the proof for Mirror Descent, which we
simply
 omit
 here. As an immediate result,  if we set the stepsize γt =
R M
√R
O √
M T
, we have E[F (x̂T ) − F (x? )] ≤ O T
, which implies an overall
2
sample complexity of O (1/ε ) in order to achieve an ε-optimal solution in
expectation.
It is worth pointing out that these complexity bounds, namely, O (1/ε)
and O (1/ε2 ) for strongly convex and convex cases, respectively, match
with the information-theoretic lower bound established in [AWBR09], thus
they are unimprovable without further structural assumptions.

12.1.3 Convergence of SGD under constant stepsize


So far, we have seen the necessity of using diminishing stepsize for SGD to
converge to an optimal solution for convex problems, albeit at a sublinear

253
convergence rate. On the other hand, SGD with constant stepsize may
converge faster, but is only guaranteed to converge to a neighborhood of
the optimal solution, up to constant error.
Below we present one of such results established in the literature. We
omit the proof here and leave it as a self-exercise.
Theorem 12.5 ([BCN18]). Assume that F (x) is both µ-strongly convex and
L-smooth. Moreover, assume that stochastic gradient satisfies that
E[k∇f (x, ξ)k22 ] ≤ σ 2 + c k∇F (x)k22 . (12.5)
1
Then, SGD with constant stepsize γt ≡ γ ≤ Lc
achieves:
γLσ 2
E[F (xt ) − F (x? )] ≤ + (1 − γµ)t−1 [F (x1 ) − F (x? )],

where x? is the optimal solution.
The assumption on the stochastic gradient in (12.5) is much weaker
than bounded stochastic gradients used in previous results. The assump-
tion can be viewed as a generalization of bounded variance assumption:
if the stochastic gradient has bounded variance, namely,
E[k∇f (x, ξ) − ∇F (x)k22 ] ≤ σ 2 ,
then the assumption in (12.5) holds with c = 1. When σ 2 = 0, the condition
reduces to E[k∇f (x, ξ)k22 ] ≤ c k∇F (x)k22 , which is also called the strong
growth condition (SGC) with constant c. This implies that the stochastic
gradients shrink relative to the full gradient. Note that in the purely de-
terministic setting, we have σ 2 = 0 and c = 1.
The above theorem implies that SGD with constant stepsize converges
linearly to some neighborhood of the optimal solution. Moreover, there
is a tradeoff between the convergence speed and the accuracy of the solu-
tion: as γ decreases, the convergence rate deteriorates, but the last iterate
is closer to the optimal solution.
Remark 12.6 (SGC). Under the strong growth condition, namely, when
E[k∇f (x, ξ)k22 ] ≤ c k∇F (x)k22 ,
SGD with constant stepsize converges to the global optimum at a linear rate.
Recently, it has been shown that modern overparameterized machine learning
models falling into the interpolation regime commonly satisfy such strong growth
condition and enjoys fast linear convergence [VBS19].

254
Remark 12.7 (Interpolation). Consider the finite-sum objective
n
1X
F (x) = fi (x),
n i=1

where each component function fi (·) corresponds to the loss under one data point.
In the interpolation regime, each individual loss attains its minimum at x? , thus
∇fi (x? ) = 0, ∀i = 1, . . . , n. In particular, for many regression or classification
problems in the realizable setting, using over-parameterized model often leads
to interpolation and zero training loss. Observe that the SGC in the finite-sum
setting implies interpolation.

12.1.4 Convergence for nonconvex functions


So far, we discussed the convergence of SGD for convex and strongly con-
vex functions. What about nonconvex functions? In general, SGD will
only converge to a stationary point and is not guaranteed to converge to
a local or global minima. For simplicity of presentation, here we consider
the unconstrained setting with X = Rd .

Theorem 12.8. Consider the stochastic optimization problem minx∈X F (x) :=


E[f (x, ξ)], where F is L-smooth and the stochastic gradient has bounded vari-
ance, namely, E[k∇f (x, ξ) − ∇F (x)k22 ] ≤ σ 2 . Then SGD with stepsize γt ≡
γ := min{ L1 , σγ√0T } achieves:

2L(F (x1 ) − F (x∗ )) 2(F (x1 ) − F (x∗ ))


 
2 σ
E[k∇F (x̂T )k ] ≤ +√ + Lγ0
T T γ0

where x̂T is selected uniformly at random from {x1 , . . . , xT }.

Proof. Based on L-smoothness of the objective, we have


 
> L 2
E[F (xt+1 ) − F (xt )] ≤ E ∇F (xt ) (xt+1 − xt ) + kxt+1 − xt k .
2

Plugging in the definition of xt+1 = xt − γt ∇f (xt , ξ t ), we have

Lγt2
 
> 2
E[F (xt+1 ) − F (xt )] ≤ E −γt ∇F (xt ) ∇f (xt , ξ t ) + k∇f (xt , ξ t )k .
2

255
Invoking the unbiasedness of the stochastic gradient E[∇f (xt , ξ t )|xt ] =
∇F (xt ) and the fact that E[k∇f (xt , ξ t )k2 |xt ] ≤ σ 2 + k∇F (xt )k22 , this further
implies that
Lγt2 Lσ 2 γt2
 
E[F (xt+1 ) − F (xt )] ≤ − γt − Ek∇F (xt )k2 + .
2 2
Lγt2
Since γt ≤ L1 , we have γt − 2
≥ γt /2. Therefore,
γt Lσ 2 γt2
E[F (xt+1 ) − F (xt )] ≤ − Ek∇F (xt )k2 +
2 2
The rest follows by telescoping sum and definition of x̂T and γt . Recall
γt = γ = min{ L1 , σγ√0T }.
T
1X
E[k∇F (x̂T )k2 ] = Ek∇F (xt )k2
T t=1
2(F (x1 ) − F (x∗ ))
≤ + γσ 2 L
γT

2(F (x1 ) − F (x∗ )) σ T γ0
≤ max{L, } + √ σ2T
T γ σ T

0
2L(F (x1 ) − F (x )) 2(F (x1 ) − F (x∗ ))

σ
≤ +√ + Lγ0
T T γ0

Remark 12.9. The above theorem implies that to find an ε-stationary point x̄
such that E[k∇F (x̄)k] ≤ ε, SGD requires at most O(1/ε4 ) stochastic gradient
evaluations. In other words, the sample complexity for finding an approximate
stationary point is at most O(1/ε4 ). Recently, [ACD+ 22] proved that this sample
complexity is unimprovable for general stochastic optimization given unbiased
stochastic oracles, without further assumptions.
Before we proceed, below we summarize the computation complexity
between SGD and GD for solving finite-sum problems of form (12.1) for a
comparison.
As we can see that GD converges faster but with expensive iteration
cost, whereas SGD converges slowly but with cheaper iteration cost. For
problems with large n and moderate accuracy ε, SGD is more appealing
than GD as the total complexity is independent of n.

256
Table 12.1: Complexity for smooth and strongly convex problems: κ =
L/µ

iteration complexity iteration cost total


GD O (κ log(1/ε)) O (n) O (nκ log(1/ε))
SGD O (1/ε) O (1) O (1/ε)

Table 12.2: Complexity for nonconvex problems

iteration complexity iteration cost total


2
GD O (1/ε ) O (n) O (n/ε2 )
SGD O (1/ε4 ) O (1) O (1/ε4 )

12.2 Adaptive Stochastic Gradient Methods


As we discussed earlier, in theory, it is impossible to improve the con-
vergence rate of SGD for convex and strongly convex functions without
imposing additional structure. In practice, the performance of SGD can be
very sensitive to the choice of stepsize and hyperparamter tuning.
Adaptive gradient methods, whose stepsizes and search directions are
adjusted based on past gradients, have received phenomenal popular-
ity and are proven successful in a variety of large-scale machine learn-
ing applications. Prominent examples include AdaGrad [DHS11a], RM-
SProp [HSS12], AdaDelta [Zei12], Adam [KB15], and AMSGrad [RKK18],
just to name a few. Their empirical success is especially pronounced for
nonconvex optimization such as training deep neural networks. Besides
improved performance, being parameter-agnostic is another important trait
of adaptive methods. Unlike (stochastic) gradient descent, adaptive meth-
ods often do not require a priori knowledge about problem-specific pa-
rameters (such as Lipschitz constants, smoothness, etc.)

12.2.1 Popular variants: AdaGrad, RMSProp and Adam


Below, we introduce some popular variants and their key steps:

257
1. AdaGrad: AdaGrad rescales the learning rate component-wise by
the square root of the cumulative sum of the previous gradients:
(
vt = vt−1 + ∇f (xt , ξ t ) 2
xt+1 = xt − ε+γ√0 vt ∇f (xt , ξ t )
Here the operator stands for component-wide product and ε > 0
ensures the denominator is positive. The key idea is to adapt the
learning rate to the parameters non-uniformly – use smaller step-
sizes for parameters with frequent features. The downside of Ada-
Grad is that the learning rate decreases too aggressively and the
learning rate becomes too small at a later stage.
AdaGrad can be also viewed as adaptively setting the Bregman di-
vergence of Mirror Descent as follows:
1 1 Xt 1
ω(x) = xt x =⇒ ωt (x) = xt Ht x, where Ht = εI + [ gt gt> ] 2
2 2 t=1

where gt = ∇f (xt , ξ t ).
2. RMSProp: RMSProp uses a moving average of the squared gradi-
ents with a discount factor to slow down the decay of the learning
rates. (
vt = βvt−1 + (1 − β)∇f (xt , ξ t ) 2
xt+1 = xt − ε+γ√0 vt ∇f (xt , ξ t )
Here β ∈ (0, 1) and is chosen to be close to 1.
3. Adam: Adam combines RMSProp with Momentum estimation. Sim-
ilar to RMSProp, Adam also keeps an exponentially decaying aver-
age of past gradients, similar to the momentum estimation. Because
of the factor β1 , β2 , the estimates mt and vt of the first and second
moments of the gradient become biased, Adam also counteract these
biases by normalizing these terms.

v t
 = β2 vt−1 + (1 − β2 ∇f (xt , ξ t ) 2
mt = β1 mt−1 + (1 − β1 )∇f (xt , ξ t )
xt+1 = xt − ε+γ√0 ṽt m̃t

vt mt
Here ṽt = 1−β t and m̃t = 1−αt are bias-corrected estimates. Both β1

and β2 are chosen to be close to 1.

258
Below we introduce a generic framework proposed in [RKK19]. This
framework recovers several popular adaptive methods. The generic setup
is as follow:

Algorithm 5 Generic Adaptive Scheme


1: for t = 1, 2, . . . , T do
2: gt = ∇f (xt , ξ t )
3: mt = φt (g1 , . . . , gt )
4: Vt = ψt (g1 , . . . , gt )
−1/2
5: x̂t = xt − αt Vt mt
1/2
6: xt+1 = argminx∈X {(x − x̂t )> Vt (x − x̂t )}
7: end for

Here φt and ψt are functions to be specified. Particularly, when φt (g1 , . . . , gt ) =


gt and ψ(g1 , . . . , gt ) = I (the identity matrix), this simply reduces to the
standard stochastic gradient descent. When
Pt
g2
φt (g1 , . . . , gt ) = gt , ψ(g1 , . . . , gt ) = τ =1 τ ,
t
this recovers the AdaGrad algorithm originally introduced in [DHS11b].
Here gτ2 refers to the element-wise square.
When we use the exponential moving average of gradients, namely,
t
X t
X
φt (g1 , . . . , gt ) = (1−β1 ) β1t−τ gτ , ψt (g1 , . . . , gt ) = (1−β2 )diag( β2t−τ gτ2 ),
τ =1 τ =1

for some β1 , β2 ∈ (0, 1), this recovers the well-known Adam algorithm [KB14].
In practice, a common choice is to set β1 = 0.9 and β2 = 0.99.

12.2.2 Theory and Practice


In practice, adaptive methods are less sensitive to parameter tuning and
adaptive to sparse features. Numerical experiments show that adaptive
methods significantly outperform SGD on natural language models, and
training GANs, but they are less effective in computer vision tasks. [WRS+ 17]
showed that for some tasks, adaptive methods tend to overfit and gener-
alize worse than their non-adaptive counterparts. They often display fast

259
initial progress on the training set, but their performance quickly plateaus
on the testing set.
On the theoretical front, some adaptive methods can achieve nearly the
same convergence guarantees as (stochastic) gradient descent [DHS11a,
WWB19, RKK18]. For instance, [DHS11a] √  introduce AdaGrad for con-
vex online learning and achieve O T regrets. [LO19] and [WWB19]
show an O(εe −4 ) complexity for AdaGrad in the nonconvex stochastic op-
timization. On the other hand, [RKK18] point out √ the non-convergence of
Adam for simple convex problems when β1 < β2 and provide a remedy
with non-increasing stepsizes. There is a surge in the study of Adam-
type algorithms due to their popularity in the deep neural network train-
ing [ZRS+ 18, CLSH19, LJH+ 20]. Some work provides the convergence re-
sults for adaptive methods in the strongly-convex optimization [WLC+ 20,
Lev17, MH17].

12.3 Exercises
Exercise 75 (Weak Growth Condition). Suppose F (·) is L-smooth and has a
minima at x? . We say the stochastic gradient satisfies the weak growth condi-
tion with constant c if

E[k∇f (x, ξ)k22 ] ≤ 2cL[F (x) − F (x? )].

Prove that

1. For convex function F , strong growth condition implies weak growth con-
dition.

2. For µ-strongly convex function F , weak growth condition implies strong


growth condition.

Exercise 76 (Interpolation). If F (x) = n1 ni=1 fi (x) is convex and each com-


P
ponent function fi (x) is L-smooth, then interpolation implies weak growth con-
dition.

Exercise 77 (Individual Smoothness). Let F (x) = E[f (x, ξ)], where f (x, ξ)
is convex and L-smooth for any realization of ξ. Define x? = argminx F (x).
Show that

260
(a) E[k∇f (x, ξ) − ∇f (x? , ξ)k22 ] ≤ 2L[F (x) − F (x? )].

(b) E[k∇f (x, ξ)k22 ] ≤ 4L[F (x) − F (x? )] + 2E[k∇f (x? , ξ)k22 ].

Exercise 78 (SGD inP the Interpolation Regime). Consider the finite sum prob-
lem minx F (x) := n ni=1 fi (x). Suppose each fi (x) is non-negative, L-smooth
1

and convex. The function F (x) is µ-strongly convex. Define x? = argminx F (x).
Assume that for all 1 ≤ i ≤ n, fi (x? ) = 0. In the context of least square regres-
sion, this means the model interpolates the data with zero loss.
Under the above assumption, prove that SGD with stepsize γt = L1 achieves
linear convergence:
µ
E[kxt+1 − x? k22 ] ≤ (1 − )E[kxt − x? k22 ].
L
Furthermore, we have
L µ
E[F (xt+1 )] ≤ (1 − )t kx1 − x? k22 .
2 L

261
Chapter 13

Finite Sum Optimization

Contents
13.1 Variance Reduction Technique . . . . . . . . . . . . . . . . . . 263
13.1.1 Mini-batch and importance sampling . . . . . . . . . 263
13.1.2 Using control variate . . . . . . . . . . . . . . . . . . . 264
13.2 Stochastic Variance-Reduced Algorithms . . . . . . . . . . . 265
13.2.1 SAG/SAGA . . . . . . . . . . . . . . . . . . . . . . . . 266
13.2.2 SVRG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
13.2.3 SPIDER/SARAH/STORM . . . . . . . . . . . . . . . 272
13.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 274

262
13.1 Variance Reduction Technique
In the previous theoretical analysis, we have seen the important role of
variance in the convergence. For instance, when using constant stepsize,
for strongly convex and smooth objectives, SGD achieves

? γLσ 2
E[F (xt ) − F (x )] ≤ + (1 − γµ)t−1 [F (x1 ) − F (x? )],

where σ 2 is an upper bound of the variance, which controls the accuracy


of the solution.
Although we cannot improve the convergence rate of SGD, there are
many ways that we can still improve the performance by reducing the
variance. We discuss several common variance reduction strategies below.

13.1.1 Mini-batch and importance sampling


• Mini-batch sampling: use a small batch of samples instead of one to
estimate the gradient at every iteration
b
1X
∇f (xt , ξ t ) =⇒ ∇f (xt , ξ t,i )
b i=1

Consequently, the variance of the new stochastic gradient will be


O(b) times smaller, i.e. the constant term σ 2 in the convergence now
reduces to σ 2 /b. However, the overall computation cost or sample
complexity remains the same.

• Importance sampling: Instead of sampling from ξ ∼ P , we can


obtain samples from another well defined random variable η with
nominal distribution Q, and use a different stochastic gradient,

P (η t )
G(xt , ξ t ) =⇒ G(xt , η t )
Q(η t )

The variance of the new stochastic gradient under properly chosen


distribution Q could be smaller.

263
• Momentum: add momentum to the gradient step
Xt
xt+1 = xt − γt m̂t , where m̂t = c · αt−τ ∇fiτ (xτ )
τ =1

Here mt is the weighted average of the past stochastic gradients,


with increasing weights to the recent sampled gradients.

13.1.2 Using control variate


In recent years, another common variance reduction technique based on
control variate has been exploited and led to significant theoretical and
practical improvements. Below we first discuss the main idea.
Suppose we want to estimate Θ = E[X], the expected value of a ran-
dom variable X. Suppose we also have access to a random variable Y
which is highly correlated with X, and we can compute E[Y ] easily. Let’s
consider the following point estimator Θ̂α with α ∈ [0, 1]:

Θ̂α = α(X − Y ) + E[Y ] (13.1)


The expectation and variance are given by,

E[Θ̂α ] = αE[X] + (1 − α)E[Y ] (13.2)


Var[Θ̂α ] = α2 (Var[X] + Var[Y ] − 2Cov[X, Y ]) (13.3)
Note that
• α = 1, this estimator becomes (X − Y ) + E[Y ], which is an unbiased
estimator.
• α = 0, this estimator reduces to a constant E[Y ], which has zero
variance but could be heavily biased.

• If Cov[X, Y ] is sufficiently large, then Var[Θ̂α ] < Var[X]. The new


estimator Θ̂α has smaller variance than the direct estimator X.
• As α increases from 0 to 1, the bias decreases and the variance in-
creases.
In the next section, we show how these variance reduction techniques
can be integrated with SGD to achieve faster convergence for solving finite-
sum problems.

264
13.2 Stochastic Variance-Reduced Algorithms
In this section, we focus on finite-sum optimization problem
n
1X
min F (x) = fi (x).
x n i=1

Recall that gradient descent update is given as follows,


n
1X
xt+1 = xt − γ · ∇F (xt ) = xt − η · ∇fi (xt ),
n i=1

which requires expensive per-iteration cost, namely, O (n) gradient com-


putation at every iteration, yet with a fast convergence rate (linear rate for
strongly convex and smooth functions with large constant stepsize). On
the other hand, the stochastic gradient descent update,
xt+1 = xt − γt ∇fit (xt ),
only requires O (1) gradient computation at every iteration, yet suffers
from a slow convergence rate and small stepsize.
A natural question is: can we achieve best of both worlds, namely, can
we design algorithms with fast convergence rate like GD but with cheap
iteration cost like SGD?
Inspired by the seminal work on stochastic average gradient (SAG)
[SLRB17], a family of stochastic variance-reduced methods have been in-
troduced in the last decade, both for convex stochastic optimization and
nonconvex stochastic optimization. For convex problems, representative
variance-reduced algorithms include SVRG (stochastic variance-reduced
gradient) [JZ13], SAGA (stochastic average gradient amélioré) [DBLJ14].
There also exist many other variants, including but not limited to SDCA,
MISO, Finito, Catalyst-SVRG, S2GD, etc. See [GSBR20] for a comprehen-
sive survey. These methods are as cheap to update as SGD, but have as
fast convergence as full gradient descent. For nonconvex optimization,
there are several recently developed variance-reduced variants: SPIDER
[FLLZ18, WJZ+ 19], SARAH [NLST17], STORM [CO19], etc.
All of these algorithms can be viewed as applying the variance reduc-
tion technique described above, in one way or the other. Below we discuss
some representative algorithms designed for the convex and nonconvex
settings, respectively.

265
13.2.1 SAG/SAGA
SAG: The key idea of SAG is to keep track of the average of the past
stored gradient of each component (denoted as vi ) as an estimate of the
full gradient
n n
1X 1X t
∇fi (xt ) = ∇F (xt ) ≈ gt = v
n i=1 n i=1 i
where the past gradient {vi }ni=1 for each component function are updated
as: (
∇fit (xt ), if i = it ,
vit =
vit−1 , if i 6= it .
Note that at every iteration, only the vector vit that corresponds to the
random index it is updated.
Note that equivalently, we can compute the moving average gt as a
recursive update
1 1
gt = gt−1 − vit−1 + ∇fit (xt ).
n t
n
This implies that computing the gradient estimator gt only requires O (d)
computation cost, as cheap as SGD. To see the connection to variance re-
duction technique, let us further rewrite the gradient estimator as follows
n
1 1X
gt = (∇fit (xt ) − vit ) + vi .
n n i=1
From the discussion in previous section, we see that this is a biased esti-
mator and likely has smaller variance than the direct stochastic gradient
estimator ∇fit (xt ).
To summarize, in a compact form, SAG works as follows:
n
(
γX t t ∇fit (xt ), if i = it
xt+1 = xt − vi , where vi = .
n i=1 vit−1 , otherwise
We further describe the detailed implementation in Algorithm 6 below.
Compared to SGD, the per-iteration cost is almost the same, but there
is an additional O (nd) memory cost to store the past gradients of each
components. On the theory side, this is the first stochastic method to enjoy
linear rate using a constant stepsize for finite-sum problems with strongly-
convex and smooth objectives.

266
Algorithm 6 SAG
1: Initialize vi = 0, i = 1, . . . , n
2: for t = 1, 2, . . . , T do
3: Randomly pick it ∈ {1, 2, . . . , n}
4: gt = gt−1 − n1 vit
5: vit = ∇fit (xt )
6: gt = gt + n1 vit
7: xt+1 = xt − γgt
8: end for

Theorem 13.1 ([SLRB17]). If F is µ-strongly convex and each fi is Li -smooth


and convex, for i = 1, . . . , n. Setting γ = 1/(16Lmax ), SAG satisfies
  t
? 1 µ
E[F (xt ) − F (x )] ≤ C · 1 − min , .
8n 16Lmax
Here Lmax := max{L1 , . . . , Ln }.
As a result, the total complexity for SAG to achieve an ε-optimal so-
1

lution is O (n + κmax ) log( ε ) . In contrast, full gradient descent would re-
quire the total complexity of O (n + κ) log( 1ε ) , where κ = LµF and LF ≤

1
Pn
n i=1 Li . In the case when Li = L, ∀i, the difference is between O (nκ)
and O (n + κ) which can be huge especially when both n and κ are large.

SAGA: The idea of SAGA is similar to SAG except that SAGA uses a
different coefficient to keep the gradient estimator unbiased. SAGA works
as follows:
" n
#
1 X
xt+1 = xt − γ (∇fit (xt ) − vit−1 )+ vt−1
t
n i=1 i

Unlike SAG, the gradient estimator used in SAGA is unbiased. Similar to


SAG, it requires the memory cost O (nd). On the theory side, it has similar
linear convergence guarantee as SAG but has a much simpler proof.

13.2.2 SVRG
We now introduce another algorithm called SVRG, which is one of the
most popular variance-reduced algorithms. In particular, SVRG algorithm

267
does not require storage of gradients as seen in SAG or SAGA. Moreover,
as we shall see, the convergence rate for SVRG can be proved easily and a
very intuitive explanation can be provided by linking increased speed to
reduced variance.
The idea of the algorithm is to use fixed reference point to build the
variance-reduced gradient:
gt = ∇fit (xt ) − ∇fit (x̃) + ∇F (x̃)
where the reference point x̃ is only updated once a while.
The intuition is that the closer x̃ is to xt , the smaller the variance of the
gradient estimator:
E[kgt − ∇F (xt )k2 ] ≤ E[k∇fit (xt ) − ∇fit (x̃)k2 ] ≤ L2max kxt − x̃k2 .
The full algorithm of SVRG is described in Algorithm 7 below.

Algorithm 7 Stochastic Variance Reduced Gradient


1: Parameters update frequency m and learning rate η
2: Initialize x̃0
3: for s = 1, 2, . . . do
4: x̃ = x̃s−1
θ̃ = n1 ni=1 ∇fi (x̃)
P
5:
6: x0 = x̃
7: for t = 1, 2, . . . , m do
8: Randomly pick it ∈ {1, 2, . . . , n} and update
 weight,
9: xt = xt−1 − η ∇fit (xt−1 ) − ∇fit (x̃) + θ̃
10: end for
11: Update x̃s
12: Option I x̃s = xm P
13: Option II x̃s = m1 m−1
t=0 xt
14: Option III x̃s = xt for randomly chosen t ∈ {0, 1, . . . , m − 1}
15: end for

Theorem 13.2. Assume fi (x) is convex and L-smooth and F (x) := n1 ni=1 fi (x)
P
is µ-strongly convex. Let x? = argminx F (x). Assume m is sufficiently large
1
(and η < 2L ), so that,
1 2Lη
ρ= + < 1,
µη(1 − 2Lη)m 1 − 2Lη

268
then we have geometric convergence in expectation for SVRG (under Option II
and Option III), i.e.,

E[F (x̃s ) − F (x? )] ≤ ρs [F (x̃0 ) − F (x? )].

To establish the main theorem, we first provide the following lemma.

Lemma 13.3. For any x, we have


n
1X
k∇fi (x) − ∇fi (x? )k22 ≤ 2L(F (x) − F (x? )). (13.4)
n i=1

[We omit the proof of this lemma and leave it as an exercise.]

We now proceed to prove the theorem.


Proof. Consider a fixed stage s. Denote x̃ = x̃s−1 and x̃0 = x̃s where x̃s
is selected after all the updates have completed. Let gt = ∇fit (xt−1 ) −
∇fit (x̃) + ∇F (x̃). Taking expectation with respect to it conditional on xt−1
and x̃, we have

E[kgt k2 |xt−1 , x̃]


= E k[∇fit (xt−1 ) − ∇fit (x? )] + [∇fit (x? ) − ∇fit (x̃) + ∇F (x̃)]k22 |xt−1 , x̃
 

≤ 2E k∇fit (xt−1 ) − ∇fit (x? )k22 |xt−1 , x̃ + 2E k∇fit (x̃) − ∇fit (x? ) − ∇F (x̃)k22 |xt−1 , x̃
   

≤ 2E k∇fit (xt−1 ) − ∇fit (x? )k22 |xt−1 , x̃ + 2E k∇fit (x̃) − ∇fit (x? )k22 |xt−1 , x̃
   

≤ 4L [F (xt−1 ) − F (x? ) + F (x̃) − F (x? )]

From the definition of gt , E [gt |xt−1 , x̃] = ∇F (xt−1 ). This leads to

E kxt − x? k22 |xt−1 , x̃ = E kxt−1 − ηgt − x? k22 |xt−1 , x̃


   

= kxt−1 − x? k22 − 2η(xt−1 − x? )> E [gt |xt−1 , x̃] + η 2 E kgt k22 |xt−1 , x̃
 

≤ kxt−1 − x? k22 − 2η(xt−1 − x? )> ∇F (xt−1 ) + 4Lη 2 [F (xt−1 ) − F (x? ) + F (x̃) − F (x? )]
≤ kxt−1 − x? k22 − 2η(F (xt−1 ) − F (x? )) + 4Lη 2 [F (xt−1 ) − F (x? ) + F (x̃) − F (x? )]
≤ kxt−1 − x? k22 − 2η(1 − 2Lη)(F (xt−1 ) − F (x? )) + 4Lη 2 (F (x̃) − F (x? ))

269
By taking the telescoping sum over t = 1, 2, . . . , m and applying law of
total expectation, we have
m
X
? 2 ? 2
E [F (xt−1 ) − F (x? )]
   
E kxm − x k2 ≤E kx0 − x k2 − 2η(1 − 2Lη)
t=1
m
X
+ 4Lη 2 E [F (x̃) − F (x? )] .
t=1
Pm
If Option II is used, then by the convexity of f we have F (x̃0 ) ≤ 1
m t=1 F (xt−1 ),
from which, by taking expectation, we obtain
m
X
− E [F (xt−1 ) − F (x? )] ≤ −mE [F (x̃) − F (x? )] (13.5)
t=1

If Option III is used, namely, x̃0 is P chosen randomly from {x0 , . . . , xm−1 },
we have E [F (x̃)|x0 , . . . , xm−1 ] = m1 mt=1 F (xt−1 ), which also leads to (13.5)
by law of total expectation. Therefore, we have

E kxm − x? k2 + 2η(1 − 2Lη)mE [F (x̃0 ) − F (x? )]


 

≤ E kx0 − x? k2 + 4Lmη 2 E [F (x̃) − F (x? )]


 

= E kx̃ − x? k2 + 4Lmη 2 E [F (x̃) − F (x? )]


 

2
≤ E [F (x̃) − F (x? )] + 4Lmη 2 E [F (x̃) − F (x? )]
µ
The last inequality holds because F (x) is µ-strongly convex. Clearly, from
the above inequality we get,
 
0 ? 1 2Lη
E [F (x̃ ) − F (x )] ≤ + E [F (x̃) − F (x? )] .
µη(1 − 2Lη)m 1 − 2Lη
This gives us the desired linear convergence rate.
Remark 13.4. Setting η = θ/L with some constant θ > 0, this gives
 
L 2θ L
ρ= + =O + const .
µθ(1 − 2θ)m 1 − 2θ µm
Hence, if we set m = O(L/µ), then this will result in a constant rate ρ. The
number of epochs needed to achieve an ε optimal solution is O(log( 1ε )). Therefore,

270
the overall complexity for SVRG is
      
1 L 1
O (2m + n) log =O n+ log
ε µ ε
Note that the complexity is significantly better that of Gradient Descent, i.e.,
  
L 1
O n · log
µ ε
when the condition number L/µ is large.

Extensions.
1. Non-uniform sampling: SVRG algorithm assumes uniform sam-
pling, however, one may choose an adaptive sampling rate,
Li
P(it = i) = P
i Li

where Li is the smoothness parameter for fi . This


 sampling strategy

immediately improves the complexity from O n + Lmax µ
log 1
ε
  
Lavg 1
to O n + µ log ε . Intuitively, the function fi (x) that has a
higher Lipschitz constant (which is prone to change relatively rapidly)
gets higher probability of getting selected.
2. Composite convex minimization: These are problems of the form
1X
min fi (x) + g(x)
x n i

where fi (x) are smooth and convex, but g(x) is convex but possibly
nonsmooth. Such problems can be handle by prox-SVRG [XZ14] by
imposing an additional proximal operator of g at iteration.
3. Acceleration: SVRG can be further accelerated to achieve an optimal
complexity of s ! !
nL 1
O n+ log( ) .
µ ε
L
This improvement is significant in problems where µ
is large.

271
Table 13.1: Comparisons between SVRG and SAG/SAGA

SVRG SAG/SAGA
memory cost O (d) O (nd)
epoch-based yes no
# gradients per step at least 2 1
parameters stepsize & epoch length stepsize
unbiasedness yes yes/no
1
O (n + κmax ) log 1ε
 
total complexity O (n + κmax ) log ε

Before we conclude this section, we summarize the comparisons be-


tween SVRG and SAG/SAGA in the following table.
Compared to SAG/SAGA, SVRG admits cheap memory cost with no
need to store past gradients or past iterates. On the other hand, it has a
nested two-loop structure, which requires more parameter tuning, whereas
SAG/SAGA are much easier to implement.

13.2.3 SPIDER/SARAH/STORM
The previous variance-reduced methods are designed primarily for solv-
ing smooth and strongly-convex finite-sum problems. For other settings,
sometimes variance-reduction technique may not be theoretically benefi-
cial; in other times, modifications need to be made. Below we introduce
some recent variance-reduced methods designed specially for nonconvex
optimization.
For smooth functions with finite-sum structure F (x) := n1 ni=1 fi (x)
P
As we have shown in previous sections, GD finds a point with k∇F (x̄)k ≤
ε in O (n/ε2 ) gradient evaluations, whereas SGD finds a point with E[k∇F (x̄)k] ≤
ε in O (1/ε4 ) gradient evaluations.
Recently, several algorithms have been designed to achieve better per-
formance than both GD and SGD, by exploiting variance reduction tech-
niques, for example SPIDER [FLLZ18, WJZ+ 19], SARAH [NLST17], STORM
[CO19], PAGE [LBZR21]. Specifically, when the objective satisfies the so-
2 2 2
√Ei k∇f
called average-smoothness condition: i (x) − ∇fi (y)k ≤ L kx − yk , an
improved complexity of O (min{ n/ε2 , ε−3 }) can be obtained. Note that
when each individual function fi (x) is smooth, the average-smoothness

property automatically holds. The complexity of O (min{ n/ε2 , ε−3 }) im-

272

proves the O (n/ε2 ) complexity of GD with an O ( n) factor and improves
the O (1/ε4 ) complexity of SGD with an O (1/ε) factor.
Consider the general problem setting:

min F (x) := Eξ [f (x, ξ)]


x∈X

where F (x) is smooth but may be non-convex.


Below we present a general framework of variance-reduced algorithms
in Algorithm 8 based on recursive gradient estimators, which encapsulates
most of the existing variants.

Algorithm 8 General VR Framework [Zha21]


Input: T, Q, D, S, x0 , α, η, ω(x).
for t = 0, 1, · · · , T − 1 do
if t ≡ 0 (mod Q) then
compute gt = D1 D i
P
i=1 ∇f (xt ; ξ t ).
else
compute  
gt = (1 − η) gt−1 − S1 Si=1 ∇f (xt−1 ; ξ it ) + S1 Si=1 ∇f (xt ; ξ it ).
P P

end if
xt+1 = argminx∈X {gt> x + α1 Vω (x, xt )}.
end for
Output: xτ with τ chosen uniformly at random from {0, 1, · · · , T − 1}.

In the algorithm, T is the total iteration number, Q denotes the epoch


length, D and S are batch sizes, α is the fixed stepsize and η is the momen-
tum parameter used to compute the recursive gradient. To be specific, let
every Q iterations denote an epoch. At the beginning of each epoch, we
use a mini-batch of size D to compute gt = D1 D i
P
i=1 ∇F (xt ; ξ t ) as the check-
point gradient. In other iterations, we use a mini-batch of size S evaluated
at both points xt−1 and xt and compute the recursive gradient
S S
1X i 1X
gt = (1 − η)(gt−1 − ∇F (xt−1 ; ξ t )) + ∇F (xt ; ξ it ). (13.6)
S i=1 S i=1

Here S is typically much smaller than D. The gradient complexity of Al-


gorithm 8 is O(T (2S + D/Q)).

273
When the momentum parameter η = 1, gt reduces to a mini-batch esti-
mator of ∇f (xt ) and loses the variance recursion property. The framework
thus reduces to mini-batch SGD. When η < 1, gt has a similar form as the
classical variance reduction techniques SPIDER and SARAH (when η = 0),
and STORM (when the time-varying ηt ∈ (0, 1) is used).
The following table summarizes different choices of parameters and
their relations to existing VR methods and two alternative variants [Zha21].

Table 13.2: Summary of parameter selections for Algorithm 8 for finding


an ε-stationary point. T stands for iteration complexity, T /Q for number of
epoches, D for batch size at checkpoints, S for batch size at other iterations,
η for the momentum parameter and α for the stepsize. [Zha21]

Parameters SPIDER SARAH STORM New 1 New 2


T O(ε−2 ) O(ε−3 ) O(ε−3 ) O(ε−5/2 ) O(ε−3 )
T /Q O(ε−1 ) O(ε−1 ) 1 O(ε−1 ) 1
D O(ε−2 ) O(ε−2 ) O(1) O(ε−2 ) O(ε−1 )
S O(ε−1 ) O(1) O(1) O(ε−1/2 ) O(1)
η (or ηt ) 0 0 O(t−2/3 ) 0 O(ε2 )
α (or αt ) O(1) O(ε) O(t−1/3 ) O(ε1/2 ) O(ε)
Complexity O(ε−3 ) O(ε−3 ) Õ(ε−3 ) O(ε ) O(ε−3 )
−3

Lastly, we illustrate the performance of these VR methods for the multi-


class classification task on the commonly-used MNIST and CIFAR-10 datasets.
Detailed numerical results can be found in [Zha21]. These algorithms are
tested on a three-layer fully-connected neural network with ELU activa-
tion function to guarantee smoothness. Figure 13.1 indicates that STORM
still performs the best, suggesting the merits of using single-loop algo-
rithms without using large batch sizes.

13.2.4 Summary
So far, we have learned several different methods for solving stochas-
tic optimization with finite-sum structure, such as GD, AGD, SGD, and
variance-reduced methods. The following two tables give a glimpse of the
summary of the complexity comparisons for smooth strongly-convex or
nonconvex objectives.

274
(a) MNIST (b) CIFAR-10

(c) MNIST (d) CIFAR-10

Figure 13.1: The performances of different algorithms for the multi-class


classification task using a three-layer neural network. Both metrics are
plotted against number of samples used during training with ntrain =
50000 as the training size.

275
Table 13.3: Complexity of finding an ε-optimal solution for strongly-
convex finite-sum optimization, κ = LµF , κmax = Lmax
µ

Algorithm # of Iterations Per-iteration Cost


O κ log 1ε

GD O (n)

O n κ log 1ε

AGD O (n)
O κmax

SGD ε
O (1)
1

SAG/SAGA O (n + κmax ) log ε O (1)

Table 13.4: Complexity of find an ε-stationary point of nonconvex finite-


sum optimization

Algorithm # of Iterations Per-iteration Cost


O ε12

GD O (n)
O ε14

SGD O (1)

STORM O (min{ n/ε2 , ε−3 }) O (1)

276
Chapter 14

Min-Max Optimization

Contents
14.1 Motivational Applications . . . . . . . . . . . . . . . . . . . . 278
14.1.1 Zero-sum Matrix Games . . . . . . . . . . . . . . . . . 278
14.1.2 Quadratic Minimax Problems . . . . . . . . . . . . . . 279
14.1.3 Nonsmooth Optimization . . . . . . . . . . . . . . . . 279
14.2 Saddle Points and Global Minimax Points . . . . . . . . . . . 280
14.3 Minimax Theorem and Existence of Saddle Point . . . . . . . 284
14.4 Frist-order Methods for Minimax Problems . . . . . . . . . . 288
14.4.1 Strongly-Convex-Strongly-Concave Setting . . . . . . 288
14.4.2 Convex-Concave Setting . . . . . . . . . . . . . . . . . 291
14.5 Extension to Variational Inequalities . . . . . . . . . . . . . . 296
14.5.1 Stampacchia VI and Minty VI . . . . . . . . . . . . . . 296
14.5.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 298
14.5.3 Algorithms for Solving VIs . . . . . . . . . . . . . . . 300
14.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

277
The past decade has witnessed a paradigm shift from risk minimiza-
tion to min-max optimization for a myriad of emerging machine learning
applications from training generative adversarial networks (GANs) to re-
inforcement learning (RL), and from adversarial training to algorithmic
fairness. The problem of interest in such applications is often a smooth
minimax optimization problem (also referred to as saddle point problems):

min max φ(x, y), (14.1)


x∈X y∈Y

where the function φ : Rm × Rn → R is smooth (i.e., gradient Lipschitz),


X is a convex set in Rm , and Y is a convex set in Rn .
In this chapter, we will introduce the relevant solution concepts for
solving such problems and the first-order algorithms that have been de-
veloped in the classical and recent literature. We will also extend the min-
max optimization paradigm to even more general problem settings such
as N -player games and variational inequalities. This chapter will focus on
the golden convex regime.

14.1 Motivational Applications


14.1.1 Zero-sum Matrix Games
Consider the 2-player game where players have opposite evaluations of
outcomes. Let I denote the non-empty finite set of strategies of player 1,
and J denote the strategy set of player 2. If player 1 chooses action i ∈ I
and player 2 chooses action j ∈ J, player 1 will suffer from the cost aij and
R|I|×|J| . De-
player 2 receives the payoff aij . Let matrix A = [aij ]i∈I,j∈J ∈P
note the mixed strategy set ∆(I) = {x ∈ R|I| : xi ≥ 0 , i ∈ I , i∈I xi = 1}
of player 1, and respectively ∆(J) for player 2. For given mixed strategy
x ∈ ∆(I) and y ∈ ∆(J), the cost of player 1 can be represented as x> Ay. In
this matrix game, player 1 aims to minimize its cost while player 2 aims to
maximize its payoff. This can be reformulated as the min-max optimiza-
tion problem:
min max x> A y. (14.2)
x∈∆(I) y∈∆(J)

This is a special case of the smooth minimax optimization problem


with bilinear objective and simplex constraints.

278
14.1.2 Quadratic Minimax Problems
Quadratic minimax problems are fundamental problems which arise in
numerical analyses, optimal control, and many other areas. The minimax
objective is quadratic in x and y:
min max φ(x, y) = x> Bx + y> Ax − yT Cy. (14.3)
x∈Rm y∈Rn

Here we assume positive semi-definite (p.s.d.) matrices B  0 and C  0.


Example 14.1. Consider the robust least squares problem with a coefficient ma-
trix A and noisy vector y = y0 + δ, where y is corrupted by a deterministic
perturbation δ with a bounded norm ρ:
min max kAx − yk2 .
x y:ky−y0 k≤ρ

This problem can be viewed a special case of quadratic minimax problem.

14.1.3 Nonsmooth Optimization


Many nonsmooth optimization problems can be naturally reformulated
into smooth minimax problems. Let f be a smooth convex function and g
be a convex possibly nonsmooth function, A ∈ Rm×n a matrix. Consider
the composite problem:
min f (x) + g(Ax) .
x∈Rn

Recall that g(Ax) = maxy∈Rm hAx, yi − g ∗ (y) where g ∗ is the Fenchel con-
jugate. We can rewrite the original problem with the min-max reformula-
tion:
minn maxm f (x) + hAx, yi − g ∗ (y). (14.4)
x∈R y∈R

Example 14.2. Suppose g(x) = max gi (x) where each gi (x) is smooth and con-
1≤i≤m
vex for all 1 ≤ i ≤ m. Note that g(x) is the maximum of convex
P functions and is
typically non-smooth. This can be written as g(x) = max m i=1 yi gi (x), where
P y∈∆m
the simplex ∆m = {y : y ≥ 0, yi = 1} is a compact convex set. Hence, we
have m
X
min g(x) ⇒ min max φ(x, y) = yi gi (x).
x x y∈∆m
i=1

279
Note that φ(x, y) is a smooth function since each gi is smooth and it is concave
(linear) in y for any x and convex in x for any fixed y ∈ ∆m .

Example 14.3. Consider g(x) = ||Ax − b||p where ||.||p denotes the p-norm
1
given by ||x||p = ( ni=1 xpi ) p . The function g(x) is convex but non-smooth be-
P
cause it is not differentiable at zero. This can be rewritten as g(x) = max hAx −
||y||q ≤1
b, yi. Denote Y = {y : ||y||q ≤ 1} as the unit q-norm ball, where q is such that
1
p
+ 1q = 1. We have

min g(x) ⇒ min max φ(x, y) = hAx − b, yi.


x x y∈Y

If p = 1, we have the case of robust regression and q = ∞ in this case. If p = 2,


we have least squares regression and in this case q = 2.

Example 14.4. Now consider g(x) = m >


P
i=1 max(1−(ai x)bi ), 0), which is a con-
vex piecewise linear function. This is the hinge loss function used widely in
Psup-
m
port vector machines. We can rewrite the function as g(x) = max i=1 yi (1−
0≤yi ≤1,i=1,...,m
(a>
i x)bi ). Denote the set Y = {y = (y1 , . . . , ym ) : 0 ≤ yi ≤ 1, 1 ≤ i ≤ m}. We
have m
X
min g(x) ⇒ min max φ(x, y) = yi (1 − (a>
i x)bi ).
x x y∈Y
i=1

Note that φ(x, y) is a smooth function that is concave (linear) in y for any x and
convex (linear) in x for any fixed y.

14.2 Saddle Points and Global Minimax Points


We now definition the notions of optimality or solution concepts for the
minimax optimization problems:

min max φ(x, y). (14.5)


x∈X y∈Y

A well-known notion of optimality in this setting is Nash equilibrium


(also known as saddle point for two-player games), namely, no player can
benefit by changing strategy unilaterally while the other player keeps hers
unchanged. For a saddle point (x∗ , y∗ ) ∈ (X × Y), x∗ is an optimal solu-
tion to minx∈X φ(x, y∗ ) and y∗ is an optimal solution to maxy∈Y φ(x∗ , y).

280
Figure 14.1: Saddle point

Such notion of optimality is commonly used in the context of simulta-


neous games, and it does not reflect the order between the min-player
and the max-player. On the other hand, for sequential games, where one
player acts first and the other acts second, it is natural to consider optimal-
ity notion such as Stackelberg equilibrium (also known as global minimax
point for two-player games). For a global minimax point (x∗ , y∗ ) ∈ X × Y,
x∗ is a global minimum of the function φ̄(x) := maxy∈Y φ(x, y), and y∗ is a
global maximum of the function φ(x∗ , y).
Definition 14.5 (Saddle point). (x∗ , y∗ ) ∈ X × Y is a saddle point of φ(x, y)
if for any (x, y) ∈ X × Y , we have

φ(x∗ , y) ≤ φ(x∗ , y∗ ) ≤ φ(x, y∗ ). (14.6)

Definition 14.6 (Global minimax point). (x∗ , y∗ ) ∈ X ×Y is a global minimax


point of φ(x, y) if for any (x, y) ∈ X × Y , we have

φ(x∗ , y) ≤ φ(x∗ , y∗ ) ≤ max


0
φ(x, y0 ). (14.7)
y ∈Y

We now define the primal and dual problems induced by the minimax
optimization problem (14.5).

Opt(P) = min φ̄(x), φ̄(x) = max φ(x, y) (P )


x∈X y∈Y

Opt(D) = max φ(y), φ(y) = min φ(x, y) (D)


y∈Y ¯ ¯ x∈X

281
Immediately, based on the definition, we can see that weak duality
holds, i.e., Opt(D) ≤ Opt(P), namely,

max min φ(x, y) ≤ min max φ(x, y). (14.8)


y∈Y x∈X x∈X y∈Y

Note that for any x ∈ X , y ∈ Y, we have φ(x, y) ≤ maxy∈Y φ(x, y). Tak-
ing minimum on both side implies that minx∈X φ(x, y) ≤ minx∈X maxy∈Y φ(x, y)
for any y ∈ Y, namely φ(y) ≤ Opt(P) for any y ∈ Y. Therefore, Opt(D) =
maxy∈Y φ(y) ≤ Opt(P). ¯
¯ following, we show that if a saddle point exists, then strong du-
In the
ality holds, namely, Opt(D) = Opt(P).

Lemma 14.7. The point (x∗ , y∗ ) is a saddle point of φ(x, y) if and only if

max min φ(x, y) = min max φ(x, y),


y∈Y x∈X x∈X y∈Y

and x∗ ∈ argminx∈X φ̄(x), y∗ ∈ argmaxy∈Y φ(y).

In other words, (x∗ , y∗ ) is a saddle point if and only if strong duality


holds and x∗ , y∗ are respsectively the optimal solutions to the induced pri-
mal problem (P ) and the dual problem (D).
Proof. (⇒) We first check the “only if” part. If (x∗ , y∗ ) is a saddle point,
then ∀x ∈ X , y ∈ Y : φ(x∗ , y) ≤ φ(x∗ , y∗ ) ≤ φ(x, y∗ ). Hence,

φ(y∗ ) = min φ(x, y∗ ) = φ(x∗ , y∗ ) = max φ(x∗ , y) = φ̄(x∗ ).


¯ x∈X y∈Y

Therefore,
Opt(D) = max φ(y) ≥ φ(y∗ ) = φ(x∗ , y∗ ).
y∈Y ¯ ¯
Similarly, we have

Opt(P) = min φ̄(x) ≤ φ̄(x∗ ) = φ(x∗ , y∗ ).


x∈X

This implies Opt(P) ≤ Opt(D). Together with the weak duality, we further
have Opt(P) = Opt(D). In fact, all the inequality above should hold for
eqauality, namely,

φ̄(x∗ ) = Opt(P) = φ(x∗ , y∗ ) = Opt(D) = φ(y∗ ).


¯
282
This means x∗ is optimal to (P ) and y∗ is optimal to (D).
(⇐) We now check “if” part. If x∗ ∈ argminx∈X φ̄(x) and y∗ ∈ argmaxy∈Y φ(y),
we have
Opt(D) = φ(y∗ ) = min φ(x, y∗ ) ≤ φ(x∗ , y∗ ),
¯ x∈X

and
Opt(P) = φ̄(x∗ ) = max φ(x∗ , y) ≥ φ(x∗ , y∗ ).
y∈Y

Now if Opt(D) = Opt(P), then we must have minx∈X φ(x, y∗ ) = φ(x∗ , y∗ ) =


maxy∈Y φ(x∗ , y), which means (x∗ , y∗ ) is a saddle point.

Remark 14.8. Based on the above analysis, it can be seen that a saddle point, if
exists, is also a global minimax point.And there is no advantage to the players of
knowing the opponent’s choice or to play second. The minimax, maximin, and the
equilibrium all give the same payoff.
Howeer, saddle point may not always exist. In contrast, a global minimax
point always existis under mild regularity condition. For eample, if function
φ(x, y) is continuous on X × Y and X , Y are compact sets, then the global mini-
max point exists due to the extreme-value theorem.

Example 14.9. Consider the Rock-Paper-Scissor game, which can be viewed as


the following bilinear minimax problem: minx∈∆(I) maxy∈∆(J) xT A y where I =
J = {rock, paper, scissor}, and the payoff matrix A is given below:

rock paper scissor


rock 0, 0 −1, 1 1, −1
paper 1, −1 0, 0 −1, 1
scissor −1, 1 1, −1 0, 0

In this case, it can be easily verfied that the mixed strategy x∗ = ( 13 , 13 , 31 ), y∗ =


( 13 , 31 , 13 ) is a saddle point. In fact, it is also the unique saddle point here.

Example 14.10. Consider the function φ(x, y) = (x − y)2 , X = [0, 1], Y =


[0, 1]. It is easy to show that the primal funcition φ̄(x) = max{x2 , (x − 1)2 } and
minx∈[0,1] φ̄(x) = 41 , namely, Opt(P) = 14 . On the other hand, the dual function
φ(y) = 0, thus maxy∈[0,1] φ(y) = 0, namely, Opt(D) = 0. Hence, a saddle point
¯does not exist in this case. ¯

283
14.3 Minimax Theorem and Existence of Saddle
Point
In the last section, we have seen that saddle point exists if and only if the
induced primal and dual problems are solvable and there is no duality
gap. But in general, what structural conditions guarantee the existence of
saddle point. This has been extensively studied since the seminal work by
John von Neumann in 1929. Below we present a few classical results.
Theorem 14.11. (von Neumann’s Minimax theorem) For any payoff matrix
A ∈ Rm×n ,
min max x> Ay = max min x> Ay,
x∈∆m y∈∆n y∈∆n x∈∆m

where ∆m = {x ∈ R+ : i=1 xi = 1}, ∆n = {y ∈ Rn+ : nj=1 yj = 1}.


m
Pm P

There are different proof techniques for the von Neumann’s Minimax
Theorem. The classical proof is based on the separating hyperplane the-
orem from convex analysis. A modern proof is based on regret analysis
from online learning. Below we briefly discuss the high-level proof from
the low regret analysis.
Consider the repeated two-player games for T rounds. At round t =
1, . . . , T :
• The row player chooses a strategy xt ∈ ∆m ;
• The column player chooses a strategy yt ∈ ∆n ;
• The row player receives a penalty/cost of x>
t Ayt ;

The regret of the row player after T rounds is defined as the difference in
total costs when compared to the best fixed strategy at hindsight :
T
X T
X
RT (y1 , . . . , yT ) = x>
t Ayt − min x> Ayt .
x∈∆m
t=1 t=1

We say the row player has low regret if for any y1 , . . . , yT :


RT (y1 , . . . , yT ) ≤ RT and RT /T → 0.
There exist many low-regret learning algorithms for the row player. On-
line gradient descent is one √ of the simplest algorithms, which guarantees
that RT (y1 , . . . , yT ) ≤ O( T ).

284
Lemma 14.12 (Exercise 79). Consider the repeated zero-sum matrix game. Sup-
pose row player chooses xt+ according to the gradient descent update rule at each
round:
xt+1 = Π∆m (xt − ηAyt )
where Π∆m is the projection operator
p on ∆m and η > 0 is the stepsize. Let G =
maxy∈∆n kAyk. Then with η = 2/(G2 T ), the row player’s regret satisfies:
T T
X X √
RT (y1 , . . . , yT ) := x>
t Ayt − min x> Ayt ≤ 2G2 T .
x∈∆m
t=1 t=1

In the following, we show that the existence of low-regret algorithm for


the row player implies the minimax theorem. Suppose the column player
chooses the best response against the row player’s choice:
x> >
t Ayt = max xt Ay.
y∈∆n

1
PT 1
PT
Define x̄ = T t=1 xt ∈ ∆m and ȳ = T t=1 yt ∈ ∆n . We have
min max x> Ay ≤ max x̄> Ay
x∈∆m y∈∆n y∈∆n
T
1X >
= max xt Ay
y∈∆n T
t=1
T
1X
≤ max x> Ay
T t=1 y∈∆n t
T
1X >
= x Ayt
T t=1 t
T
1 X
≤ min x> Ayt + RT (y1 , . . . , yT )/T
T x∈∆m t=1
T
X
= min x> Aȳ + RT (y1 , . . . , yT )/T
x∈∆m
t=1
≤ max min x> Ay + RT /T
y∈∆n x∈∆m

Let T → ∞, this implies that


min max x> Ay ≤ max min x> Ay.
x∈∆m y∈∆n y∈∆n x∈∆m

285
Combined with the weak duality, this leads to the minimax theorem.
The von Neumann Minimax Theorem is specifically for bilinear matrix
games. In fact, for general continuous games with convex-concave cost
function, the minimax theorem also holds and a saddle point exists.

Theorem 14.13. (Sion-Kakutani Minimax theorem) Let sets X ⊂ Rm and


Y ⊂ Rn be two convex compact sets. Let function φ(x, y) : X × Y → R be a
contunious function such that for any fixed y ∈ Y, it is convex in the variable
x and for any fixed x ∈ X , it is concave in the variable y. Then φ(X , Y) has a
saddle point on X × Y and

max min φ(x, y) = min max φ(x, y).


y∈Y x∈X x∈X y∈Y

Remark 14.14. Note that the above assumptions are only sufficient conditions
for the existence of saddle point. There are many extensions of minimax theo-
rem with weaker assumptions. For example, continuity of φ(x, y) can be relaxed
to semi-continuity, convexity of of φ(·, y) can be realxed to quasi-convexity. It
is sufficient to have only one of the two sets to be compact. When φ(x, y) is
both strongly convex in x and strongly concave in y, we can further remove the
compactness assumption. Results can be generalized to Hilbert spaces and even
function spaces, see e.g., [Roc97, ET99] for detailed results.

Below we provide a self-contained proof for this general result based


on two key results.

Lemma 14.15 (Minimax Lemma, Exercise 80). Let fi (x), i = 1, ...,Pm be con-
vex and continuous on a convex compact set X . Let ∆n = {λ ∈ R+ : ni=1 λi =
n

1}. We have
Xn Xn
min max λi fi (x) = max min λi fi (x)
x∈X λ∈∆n λ∈∆n x∈X
i=1 i=1

Equivalently, this means there exists some λ ∈ ∆n such that
n
X
min max fi (x) = min λ∗i fi (x).
x∈X 1≤i≤n x∈X
i=1

The above lemma can be proven based on Lagrangian duality or simi-


larly with the regret analysis, which we leave as an exercise.

286
Lemma 14.16. (Helley’ Theorem) Let F be any collection of compact convex sets
in Rm . If every (m + 1) sets have common point, then all sets have a point in
common.
Now we are ready to prove the Sion’s minimax theorem based on these
two results. Define the induced primal and dual problems:

(P ) : min φ̄(x) := max φ(x, y)


x∈X y∈Y

(D) : max φ(y) := min φ(x, y).


y∈Y ¯ x∈X

First, since X and Y are compact, φ(x, y) is continuous, both φ̄(x) and φ(y)
are continuous and attain their optimum on compact set. It is sufficient¯ to
show Opt(D) ≥ Opt(P), i.e.,

max min φ(x, y) ≥ min max φ(x, y).


y∈Y x∈X x∈X y∈Y

Consider the sets X(y) = {x ∈ X : φ(x, y) ≤ Opt(D)}. Note that X(y) is


nonempty, compact and convex for any y ∈ Y. We first show that every
collection of these sets has a point in common.

Suppose ∃y1 , ..., yn . s.t. X(y1 ) ∩ ... ∩ X(yn ) = ∅. This implies:

min max φ(x, yi ) > Opt(D)


x∈X i=1,...,n

By Minimax Lemma, ∃λ∗ ∈ ∆n such that


n
X
min max φ(x, yi ) = min λ∗i φ(x, yi )
x∈X i=1,...,n x∈X
i=1
m
X
≤ min φ(x, λ∗i yi )
x∈X
i=1
= φ(ȳ) ≤ Opt(D)
¯
Pm ∗
where ȳ = i=1 λi yi and the first inequality is by concavity of φ(x, y). The
result here leads to a contradiction! Hence, every finite collection of X(y)
has a common point. By Helley’s theorem, all of these sets {X(y) : y ∈ Y}
has a common point. Therefore, ∃x∗ ∈ X : x∗ ∈ X(y), ∀y ∈ Y, which
means ∃x∗ ∈ X , φ(x∗ , y) ≤ Opt(D), ∀y ∈ Y. Hence Opt(P) ≤ Opt(D).

287
14.4 Frist-order Methods for Minimax Problems
In this section, we introduce first-order methods for finding an approxi-
mate saddle point for convex-concave minimax problem:

min max φ(x, y).


x∈X y∈Y

Throughout, we assume saddle point exists. First of all, we need to


defined what’s a good performance criterion to characterize the accuracy
of the solution.
Given a candidate solution z = (x, y), we quantify the inaccuracy or
error by Esad (z) defined as

Esad (ẑ) := max φ(x̂, y) − min φ(x, ŷ) = φ̄(x̂) − φ(ŷ).


y∈Y x∈X ¯
Note that for all z ∈ X × Y, Esad (z) ≥ 0, and Esad (z) = 0 if and only if z is a
saddle point.
Since Opt(P) = Opt(D), Esad (ẑ) can be written as

Esad (ẑ) = φ̄(x̂) − Opt(P) + Opt(D) − φ(ŷ),


¯
and hence we have,
φ̄(x̂) − Opt(P) ≤ Esad (ẑ),
Opt(D) − φ(ŷ) ≤ Esad (ẑ).
¯
In other words, the gap function Esad (ẑ) upper bounds both the optimality
error for both the primal and dual problem. Next, we examine the iteration
complexity of first-order methods for finding an -saddle point, namely,
Esad (ẑ) ≤ .

14.4.1 Strongly-Convex-Strongly-Concave Setting


We first consider the setting when the cost function φ(x, y) is strongly con-
vex in x and strongly concave in y. More specifically, we make the follow-
ing assumption:

288
Assumption 14.17 (SC-SC setting). We assume φ(x, y) is µ-strongly convex
in x for any fixed y ∈ Y and is µ-strongly concave in y for any fixed x ∈ X ,
namely, for any x, x1 , x2 ∈ X and y, y1 , y2 ∈ Y,
µ
φ(x1 , y) ≥ φ(x2 , y) + ∇x φ(x2 , y)> (x1 − x2 )) +kx1 − x2 k2 ,
2
> µ
−φ(x, y1 ) ≥ −φ(x, y2 ) − ∇y φ(x, y2 ) (y1 − y2 ) + ky1 − y2 k2 .
2
In the SC-SC setting, it can be easily shown that the saddle point is also
unique. We also assume the function is smooth.

Assumption 14.18 (Smoothness). We assume function φ(x, y) is L-Lipschitz


smooth jointly in x and y: for any x1 , x2 ∈ X and y1 , y2 ∈ Y,

k∇x φ(x1 , y1 ) − ∇x φ(x2 , y2 )k ≤ L kx1 − x2 k + ky1 − y2 k ,

k∇y φ(x1 , y1 ) − ∇y φ(x2 , y2 )k ≤ L kx1 − x2 k + ky1 − y2 k .

Gradient Descent Ascent. The simplest gradient-based algorithm for solv-


ing the above minimax problem is Gradient Descent Ascent (GDA): for
t = 1, . . . , T ,

xt+1 = ΠX (xt − η∇x φ(xt , yt ))


yt+1 = ΠY (yt + η∇y φ(xt , yt ))

The algorithm updates x and y simultaneuous at each iteration using


only the gradient information. In the following theorem, we show that the
algorithm converges linearly using sufficiently small constant stepsize.

Theorem 14.19 (Convergence of GDA). Under the Assumptions 14.17 and 14.18,
GDA with stepsize η < 2Lµ 2 converges linearly:

kxt+1 − x∗ k2 + kyt+1 − y∗ k2 ≤ (1 + 4η 2 L2 − 2ηµ) kxt − x∗ k2 + kyt − y∗ k2 .




µ
When η = 4L2
,
T
kxT − x∗ k2 + kyT − y∗ k2 ≤ 1 − µ2 /(4L2 ) kx0 − x∗ k2 + ky0 − y∗ k2 .


289
Proof. First of all, by the definition of strong convexity, we have
µ
φ(x2 , y1 ) ≥ φ(x1 , y1 ) + ∇x φ(x1 , y1 )> (x2 − x1 ) + kx2 − x1 k2 ,
2
µ
φ(x1 , y2 ) ≥ φ(x2 , y2 ) + ∇x φ(x2 , y2 )> (x1 − x2 ) + kx1 − x2 k2 ,
2
µ
−φ(x1 , y2 ) ≥ −φ(x1 , y1 ) − ∇y φ(x1 , y1 ) (y2 − y1 ) + ky2 − y1 k2 ,
>
2
µ
−φ(x2 , y1 ) ≥ −φ(x2 , y2 ) − ∇y φ(x2 , y2 ) (y1 − y2 ) + ky1 − y2 k2 .
>
2
Summing four equations together, we get:

(∇x φ(x1 , y1 ) − ∇x φ(x2 , y2 ))> (x1 − x2 ) + (∇y φ(x2 , y2 ) − ∇y φ(x1 , y1 ))> (y1 − y2 )
≥ µkx1 − x2 k2 + µky1 − y2 k2 .

By the L-smoothness, we have

k∇x φ(x, y) − ∇x φ(x∗ , y∗ )k2 ≤ 2L2 kx − x∗ k2 + 2L2 ky − y∗ k2 ,


k∇y φ(x, y) − ∇y φ(x∗ , y∗ )k2 ≤ 2L2 kx − x∗ k2 + 2L2 ky − y∗ k2 .

Leveraging the non-expansiveness of the projection, we then have

kxt+1 − x∗ k2 + kyt+1 − y∗ k2
= kΠX (xt − η∇x φ(xt , yt )) − ΠX (x∗ − η∇x φ(x∗ , y∗ ))k2 +
kΠY (yt + η∇y φ(xt , yt )) − ΠY (y∗ + η∇y φ(x∗ , y∗ ))k2
≤ kxt − η∇x φ(xt , yt ) − x∗ + η∇x φ(x∗ , y∗ )k2 +
kyt + η∇y φ(xt , yt ) − y∗ − η∇y φ(x∗ , y∗ )
≤ kxt − x∗ k2 + η 2 k∇x φ(xt , yt ) − ∇y φ(x∗ , y∗ )k2 −
2η(∇x φ(xt , yt ) − ∇y φ(x∗ , y∗ ))> (xt − x∗ ) +
kyt − y∗ k2 + η 2 k∇y φ(xt , yt ) − ∇y φ(x∗ , y∗ )k2 −
2η(∇y φ(x∗ , y∗ ) − ∇y φ(xt , yt ))> (yt − y∗ ).

Plugging in the previous two inequalities leads to the desired result:

kxt+1 − x∗ k2 + kyt+1 − y∗ k2 ≤ (1 + 4η 2 L2 − 2ηµ) kxt − x∗ k2 + kyt − y∗ k2 .




290
Remark 14.20 (Upper bound complexity). Define the condition number κ =
L
µ
. The above theorem implies that the iteration complexity of GDA to output a
solution -close to the saddle point is at most O(κ2 log(1/)). The best-known
complexity for solving SC-SC minimax problems under the exact same setting is
O(κ log(1/)) [Tse95, MOP20], which can be achieved by extragradient method
(EG) and optimistic GDA, to be introduced in the next subsection.
Remark 14.21 (Lower bound complexity). The above results can be extended
to more general class of SC-SC minimax problems, F(Lx , Ly , Lxy , µx , µy ), where
Lx , Ly , Lxy correspond to the gradient Lipshitz constants with respect to different
blocks of variables, and µx , µy are the strong convexity or concavity constants
with respect to variables x, y. To find an -approximate saddle point, [ZHZ19]
recently showed that any first-order algorithm with the linear span assumption
requires at least s !
 L L 2
x xy L y
 1
Ω + + log
µx µx µy µy 
calls to a gradient oracle for φ(x, y). Directly
 applying the above GDA would
L2 1
yield the upper bound complexity O µ2 log  , where µ = min{µx , µy }, L =
max{Lx , Ly , Lxy }.

14.4.2 Convex-Concave Setting


In this section, we focus on the general convex-concave setting, where the
cost function φ(x, y) is convex in x for any fixed y and concave in y for any
fixed x. One representative example is the bilinear matrix game, where
φ(x, y) = x> Ay.
Assumption 14.22 (C-C setting). We assume φ(x, y) is convex in x for any
fixed y ∈ Y and is concave in y for any fixed x ∈ X .

GDA with constant stepsize may not converge. Consider the function
φ(x, y) = xy, which has the saddle point (0, 0). GDA with constant stepsize
gives the update: xt+1 = xt − ηyt ; yt+1 = yt + ηxt . This implies that
x2t+1 + yt+1
2
= (xt − ηyt )2 + (yt + ηxt )2 = (1 + η 2 )(x2t + yt2 )
It does not converge to the saddle point (0, 0). For any η > 0, the iterate
(xt , yt ) will diverge.

291
Extragradient Method (EG). Extragradient method [Kor76] is a classical
method introduced by Korpelevich in 1976. The main idea of EG is to
use the gradient at the current point to find a mid-point, and then use the
gradient at that mid-point to find the next iterate. The algorithm behaves
as follows:

xt+ 1 = ΠX (xt − η∇x φ(xt , yt )) , yt+ 1 = ΠY (yt + η∇y φ(xt , yt ))


2 2
     
xt+1 = ΠX xt − η∇x φ xt+ 1 , yt+ 1 , yt+1 = ΠY yt + η∇y φ xt+ 1 , yt+ 1
2 2 2 2

As suggested by the name, EG requires evaluating extra gradients at


the midpoint in each iteration, thus it doubles the computation cost com-
pared to GDA. With appropriate choice of constant stepsize, EG is proven
to converge to a saddle point in the C-C setting.

Theorem 14.23 (Convergence of EG). Assume DX := maxx,x0 kx − x0 k < ∞


and DY := maxy,y0 ky − y0 k < ∞, Under Assumptions 14.22 and 14.18, EG
1
with stepsize η ≤ 2L satisfies
T
! T
!
1X 1X DX2 + DY2
max φ x 1,y − min φ x, y 1 ≤ .
y∈Y T t=1 t+ 2 x∈X T t=1 t+ 2 2ηT

1
PT 1
PT 1
Denote x̂T = T t=1 xt+ 1 , ŷT = T t=1 yt+ 1 . Setting η = 2L
, this implies that
2 2

L(DX2 + DY2 )
Esad (x̂T , ŷT ) ≤ .
T
Proof. For convenience, we denote

et+ 1 = xt − η∇x φ(xt , yt ),


x y
et+ 1 = yt + η∇y φ(xt , yt )
2
  2  
et+1 = xt − η∇x φ xt+ 1 , yt+ 1 ,
x y
et+1 = yt + η∇y φ xt+ 1 , yt+ 1 .
2 2 2 2

Thus,
   
xt+ 1 = ΠX x
et+ 1 , yt+ 1 = ΠY yet+ 1
2 2 2 2

xt+1 = ΠX (e
xt+1 ), yt+1 = ΠY (e
yt+1 ).

292
First, we note that
 >  >
∇x φ xt+ 1 , yt+ 1 (xt+ 1 − x) = ∇x φ xt+ 1 , yt+ 1 (xt+1 − x)+
2 2 2 2 2
>
∇x φ (xt , yt ) (xt+ 1 − xt+1 )+
2
   >
∇x φ xt+ 1 , yt+ 1 − ∇x φ (xt , yt ) (xt+ 1 − xt+1 ).
2 2 2

We will bound each term of the right hand side. For the first term, we have
 > 1
∇x φ xt+ 1 , yt+ 1 (xt+1 − x) = (xt − xet+1 )> (xt+1 − x)
2 2 η
1
≤ (xt − xt+1 )> (xt+1 − x)
η
1 
kx − xt k2 − kx − xt+1 k2 − kxt − xt+1 k2 ,

=

where the second inequality uses the property of projection. For the sec-
ond term, we have
1
∇x φ (xt , yt )> (xt+ 1 − xt+1 ) = (xt − xet+ 1 )> (xt+ 1 − xt+1 )
2 η 2 2

1
≤ (xt − xt+ 1 )> (xt+ 1 − xt+1 )
η 2 2

1 h i
= kxt+1 − xt k2 − kxt+ 1 − xt+1 k2 − kxt − xt+ 1 k2 .
2η 2 2

For the third term, we have


   >
∇x φ xt+ 1 , yt+ 1 − ∇x φ (xt , yt ) (xt+ 1 − xt+1 )
2 2 2
 
≤ ∇x φ xt+ 1 , yt+ 1 − ∇x φ (xt , yt ) xt+ 1 − xt+1
2 2 2
h i
≤ L xt+ 1 − xt + yt+ 1 − yt xt+ 1 − xt+1
2 2 2

L 2 L 2 2
≤ xt+ 1 − xt + yt+ 1 − yt + L xt+ 1 − xt+1 ,
2 2 2 2 2

where the last inequality uses the fact that ab ≤ 12 a2 + 12 b2 .

293
Combine three inequalities above gives rise to
 >
∇x φ xt+ 1 , yt+ 1 (xt+ 1 − x)
2 2 2
 
1  2 2
 1 2
≤ kxt − xk − kx − xt+1 k + L − xt+ 1 − xt+1 +
2η 2η 2
 
L 1 2 L 2
− xt+ 1 − xt + yt+ 1 − yt .
2 2η 2 2 2

Similarly, we can show


 >
− ∇y φ xt+ 1 , yt+ 1 (yt+ 1 − y)
2 2 2
 
1  2 2
 1 2
≤ kyt − yk − ky − yt+1 k + L − yt+ 1 − yt+1 +
2η 2η 2
 
L 1 2 L 2
− yt+ 1 − yt + xt+ 1 − xt .
2 2η 2 2 2

1 1
Since η ≤ 2L , we have L − 2η ≤ 0. Taking the telescope sum of the above
two equations from t = 1, . . . , T leads to
 >  >
∇x φ xt+ 1 , yt+ 1 (xt+ 1 − x) − ∇y φ xt+ 1 , yt+ 1 (yt+ 1 − y)
2 2 2 2 2 2

1 
kx1 − xk2 + ky1 − yk2 .



Lastly, invoking the convexity of φ(·, y) and concavity of φ(x, ·), for any
x, y, it holds that
T
! T
!
1X 1X
φ xt , y − φ x, yt
T t=1 T t=1
T
1 Xh    i
≤ φ xt+ 1 , y − φ x, yt+ 1
T t=1 2 2

T  
1X  >  >
≤ −∇y φ xt+ 1 , yt+ 1 (yt+ 1 − y) + ∇x φ xt+ 1 , yt+ 1 (xt+ 1 − x)
T t=1 2 2 2 2 2 2

DX2 + DY2
≤ .
2ηT

294
Remark 14.24. The above theorem shows that the average iterate of the mid-
points from EG method achieves the convergence rate of O(1/T ): Esad (x̂T , ŷT ) ≤
O TL . The rate is known to be unimprovable among first-order methods for
solving the general convex-concave minimax problems, without further assump- √
tions [OX21]. It is shown that the last-iterate of EG has a slower O(1/ T )
convergence rate that the average-iterate [GPDO20, COZ22].

Optimistic Gradient Descent Ascent (OGDA). Similar to EG, OGDA


is another popular algorithm for solving convex-concave minimax prob-
lems, and was originally introduced by Popov in 1980. There are several
slightly different versions of the algorithm in the literature coined with
different names. In the constrained setting, OGDA performs the update:
   
xt+ 1 = ΠX xt − η∇x φ(xt− 1 , yt− 1 ) , yt+ 1 = ΠY yt + η∇y φ(xt− 1 , yt− 1 )
2 2 2 2 2 2
     
xt+1 = ΠX xt − η∇x φ xt+ , yt+
1 1 , yt+1 = ΠY yt + η∇y φ xt+ , yt+
1 1
2 2 2 2

Note that in sharp contrast to EG, OGDA only requires evaluating one
gradient at each iteration and reuses the previous gradient. In the uncon-
strained setting, the algorithm can be simplified as
(
xt+1 = xt − 2η∇x φ(xt , yt )+η∇x φ(xt−1 , yt−1 )
yt+1 = yt + 2η∇y φ(xt , yt )−η∇y φ(xt−1 , yt−1 )

or equivalently,
(
xt+1 = xt − η∇x φ(xt , yt ) − η(η∇x φ(xt , yt ) − ∇x φ(xt−1 , yt−1 ))
yt+1 = yt + η∇y φ(xt , yt ) + η(∇y φ(xt , yt ) − ∇y φ(xt−1 , yt−1 ))

The last term in each of the updates can be regarded a “negative-momentum”,


which makes it converge faster than GDA. The convergence guarantees of
GDA are similar to EG, which we omit here.

Connections to Proximal Point Algorithm (PPA). For convex minimiza-


tion, we have introduced the proximal point method, which iteratively
computes the proximal operator of the objective. Similarly, we can extend

295
the proximal point algorithm for solving convex-concave minimax prob-
lems. This was initially studied by Rockafellar in 1976 [?] At each iteration,
PPA performs the update:
 
1 2 1 2
(xt+1 , yt+1 ) = argmin argmax φ(x, y) + kx − xt k − ky − yt k .
x∈X y∈Y 2η 2η
Note that for any η > 0, the above subproblem is strongly convex in x
and strongly concave in y, thus admitting a unique solution. In the un-
constrained case when X = Rm , Y = Rn , the PPA update can be rewritten
as the implicit update:
xt+1 = xt − η∇x φ(xt+1 , yt+1 ), yt+1 = yt + η∇y φ(xt+1 , yt+1 ).
This is a conceptual algorithm as implementing the algorithm requires
solving the SC-SC auxiliary problems at each iteration, which may not ad-
mit closed form update. EG and OGDA can be viewed as approximately
computing the implicit update:
• EG: ∇x φ(xt+1 , yt+1 ) ≈ ∇x φ(xt+ 1 , yt+ 1 )
2 2

• OGDA: ∇x φ(xt+1 , yt+1 ) ≈ ∇x φ(xt , yt )+(∇x φ(xt , yt )−∇x φ(xt−1 , yt−1 )).
Similar as in convex minimization, this approximate proximal point per-
spective can be used to design and analyze accelerated algorithms for solv-
ing minimax problems.

14.5 Extension to Variational Inequalities


In this section, we introduce a unified and convenient framework – vari-
ational inequalities with monotone operator – which encapsulates convex
minimization, convex-concave minimax problems, and general concave
games as special cases.

14.5.1 Stampacchia VI and Minty VI


Variational Inequality (VI) Problem. Let Z ⊂ Rd be a nonempty subset
and consider a mapping F : Z → Rd . The goal of the VI problem is to find
a (strong) solution z∗ ∈ Z such that
hF (z∗ ), z − z∗ i ≥ 0, for all z ∈ Z . (SVI)

296
Figure 14.2: Variational inequality problem: in geometric terms, the vari-
ational inequality states that F (x∗ ) is orthogonal to the feasible set at the
point x∗ .

This is also known as the Stampacchia Variational Inequality (SVI), which


was introduced in [HS66] by Hartman and Stampacchia in 1966. We call
the solution to the Stampacchia Variational Inequality as strong solution.
Following the Brouwer fixed point theorem, if Z is a nonempty convex
compact subset of Rd and F : Z → Rd is continuous, then there exists
a solution z ∗ to (VI). Throughout, we assume the solution set is always
nonempty.
This formulation, as we see later, offers a unified treatment of systems
of equations, optimization problems, equilibrium problems, complemen-
tarity problems, fixed point problems, etc.
A closely relevant problem is the Minty Variational Inequality (MVI),
which aims to find a (weak) solution z∗ ∈ Z such that

hF (z), z − z∗ i ≥ 0, for all z ∈ Z . (MVI)

This is also known as the Minty Variational Inequality, which was intro-
duced in [Min62] by Minty in 1962. We call the solution to the Minty Vari-
ational Inequality as weak solution. Below we discuss the relationships
between these two solution concepts.

Definition 14.25 (Monotonicity). The operator F : Z → Rd is said to be


monotone if
hF (u) − F (v), u − vi ≥ 0 ∀ u, v ∈ Z .
It is said to be µ-strongly-monotone with modulus µ > 0 if

hF (u) − F (v), u − vi ≥ µ ku − vk2 ∀ u, v ∈ Z .

297
Lemma 14.26 (Exercise 81). The following statements hold:

(i) If F is monotone, then a solution to SVI is also a solution to MVI.

(ii) If F is continuous and Z is convex, then a solution to MVI is also a solution


to SVI.

Notably, the montonicity of operator F is sufficient but not necessary


to show strong solution implies weak solution. It can be relaxed by much
weaker assumption, such as pseudo-monotonicity:

hF (v), u − vi ≥ 0 ⇒ hF (u), u − vi ≥ 0, ∀ u, v ∈ Z .

Accuracy Measure. A natural inaccuracy measure of a candidate solu-


tion ẑ ∈ Z to (Minty) VI(Z, F ) is the dual gap function:

EVI (ẑ) := maxhF (z), ẑ − zi. (14.9)


z∈Z

Note that EVI (ẑ) ≥ 0 for any ẑ ∈ Z, and the inaccuracy or error vanishes
exactly at the set of weak solutions. If ẑ is a weak solution, then hF (z), ẑ −
zi ≤ 0, so EVI (ẑ) ≤ 0, which implies that EVI (ẑ) = 0.

14.5.2 Examples
Many problems of interest that we discussed in this course can be framed
as solving VIs.

Convex Minimization. Consider the convex minimization problem

min f (x)
x∈X

where f is a convex and continuously differentiable function and X is con-


vex set. Set F (x) = ∇f (x). Since f is convex and continuously differen-
tiable, the gradient field F is monotone and continuous. Finding a global
minimum of f over X is equivalent to finding a weak solution to Minty
VI(X , F ) with F (x) = ∇f (x).

298
Convex-concave Saddle Point Problems. Consider the convex-concave
saddle point problem:
min max φ(x, y)
x∈X y∈Y

where φ(x, y) is convex in x for any fixed y ∈ Y and concave in y for any
fixed x ∈ X , X , Y are convex sets. Assume φ(·, ·) are also continuously
differential. Set

z = [x; y], Z = X × Y, F (z) = [∇x φ(x, y); −∇y φ(x, y)].

The operator F is monotone due to the convexity of φ(·, y) and −φ(x, ·).
Finding a saddle point of φ(x, y) is equivalent to finding a weak solution
to the corresponding Minty VI(Z, F ) with Z and F as defined above.

Concave Nash Equilibrium Problems. Consider a game with finite num-


ber of players i ∈ N = {1, · · · , N }. Each player chooses an action xi ∈ Xi ,
di
where Xi is a compact convex subset Q R . We denote the action profile x =
(xi , x−i ) = (x1 , · · · , xN ) ∈ X = i Xi . For each player, its payoff (or util-
ity) function is given by ui (xi , x−i ) : X → R, which depends on the whole
action profile of all players.
Definition 14.27. An action profile x∗ ∈ X is said to be a Nash equilibrium if it
is resilient to unilateral deviations, i.e.,

ui (x∗i , x∗−i ) ≥ ui (xi , x∗−i ) ∀ xi ∈ Xi , i ∈ N .

We assume that the payoff functions are continuously differentiable


Q
and moreover each ui (xi , x−i ) is concave in xi for all x−i ∈ X−i = j6=i Xj , , i ∈
N . Under the concavity assumption, Nash equilibria can be characterized
via first-order optimality:

h∇i ui (x∗i , x∗−i ), xi − x∗i i ≤ 0 ∀xi ∈ Xi ,

where ∇i refers to differentiation with respect to xi . Set

F (x) = (−∇i ui (xi , x−i ))i∈N .

Note that F is monotone for concave games. Finding a Nash equilibrium


of the concave game is equivalent to finding a weak solution to the Minty
VI(X , F ) with the above defined operator F .

299
14.5.3 Algorithms for Solving VIs
In this section, we introduce algorithms for solving VIs. Let Z ⊂ Rd be a
nonempty subset and consider a mapping F : Z → Rd . The goal is to find
a solution z∗ Z that satisfies

hF (z), z − z∗ i ≥ 0, for all z ∈ Z . (MVI)

Throughout, We make the following blanket assumptions:


• The set Z is a closed convex subset of Rd .

• The solution set of (VI) is nonempty.

• The mapping F is monotone.

• The mapping F is Lipschitz continuous with constant L > 0, i.e.,

kF (u) − F (v)k ≤ L ku − vk , ∀u, v ∈ Z .

As a result, finding the weak solution to MVI is equivalent to finding the


strong solution to SVI. We are interested in the iteration complexity for
finding an approximate solution ẑ ∈ Z such that EVI (ẑ) ≤ , where

EVI (ẑ) := maxhF (z), ẑ − zi.


z∈Z

EG and OGDA. Extragradient Method (EG) and Optimistic Gradient


Descent Ascent (OGDA) can be directly extended to solving VIs. We present
them in Algorithm 9 and Algorithm 10. A key difference is that OGDA
only requires one evaluation of the operator F at iterate zt , while EG re-
quires two evaluations of the operator F at iterate zt and middle point
ẑt .

Algorithm 9 EG
1: Initialize z1 ∈ Z
2: for t = 1, 2, . . . , T do
3: z̃t = ΠZ (zt − ηt F (zt ))
4: zt+1 = ΠZ (zt − ηt F (z̃t ))
5: end for

300
Algorithm 10 OGDA
1: Initialize z1 = z1/2 ∈ Z
2: for t = 1, 2, . . . , T do
3: zt+ 1 = ΠZ (zt− 1 − ηt F (zt ))
2 2
4: zt+1 = ΠZ (zt+ 1 − ηt F (zt ))
2
5: end for

Mirror Prox algorithm. We can further extend EG to non-Euclidean ge-


ometry by leveraging Bregman divergence. This is known as the Mirror
Prox algorithm [Nem04]. Let ω(z) : Z → R be a distance generating func-
tion where ω is 1-strongly convex function w.r.t some norm ||.|| on the un-
derlying space and is continuously differentiable. The Bregman distance
induced by ω(·) is given as

1
V (z, z0 ) = ω(z) − ω(z0 ) − ∇ω(z0 )> (z − z0 ) ≥ ||z − z0 ||2 .
2
The Mirror Prox algorithm work as follows: initialize z1 ∈ Z and update
at each iteration t,

ẑt = argmin{V (z, zt ) + hγt F (zt ), zi}, (14.10)


z∈Z

zt+1 = argmin{V (z, zt ) + hγt F (ẑt ), zi}. (14.11)


z∈Z

Figure 14.3: EG or Mirror Prox algorithm

Note that this is not the same as two consecutive steps of the mirror
descent algorithm since the first term in the minimization, V (z, zt ) is same
in both the steps of the update. This is illustrated in Figure 14.3.

301
Theorem 14.28 ([Nem04]). Denote the diameter of the Bregman distance Ω =
maxz∈Z V (z, z1 ). The Mirror Prox algorithm with step-size γt ≤ L1 satisfies
PT
Ω γt ẑt
EVI (z̄T ) := max hF (z), z̄T − zi ≤ PT , where z̄T = Pt=1 T
.
t=1 γt t=1 γt
z∈Z

Proof. From the Bregman three-point identity and the optimality condition
for ẑt to be the solution of (14.10), we have
hγt F (zt ), ẑt − zi ≤ V (z, zt ) − V (z, ẑt ) − V (ẑt , zt ), ∀z ∈ Z. (14.12)
Similarly optimality at zt+1 for (14.11) gives
hγt F (ẑt ), zt+1 − zi ≤ V (z, zt ) − V (z, zt+1 ) − V (zt+1 , zt ), ∀z ∈ Z. (14.13)
Set z = zt+1 in (14.12) to obtain
hγt F (zt ), ẑt − zt+1 i ≤ V (zt+1 , zt ) − V (zt+1 , ẑt ) − V (ẑt , zt ). (14.14)
Combing (14.13) and (14.14), we have
hγt F (ẑt ), ẑt − zi
= hγt F (ẑt ), ẑt − zt+1 i} + hγt F (ẑt ), zt+1 − zi
= γt hF (ẑt ) − F (zt ), ẑt − zt+1 i + hγt F (zt ), ẑt − zt+1 i + hγt F (ẑt ), zt+1 − zi
≤ γt hF (ẑt ) − F (zt ), ẑt − zt+1 i − V (zt+1 , ẑt ) − V (ẑt , zt ) + V (z, zt ) − V (z, zt+1 )
Let σt = γt hF (ẑt ) − F (zt ), ẑt − zt+1 i − V (zt+1 , ẑt ) − V (ẑt , zt ). By as-
sumption of smoothness, we have kF (ẑt ) − F (zt )k∗ ≤ Lkẑt − zt k. In-
voking Cauchy-Schwatz inequality and the property of Bregman distance,
V (z, z0 ) ≥ 12 ||z − z0 ||2 , to obtain
1 1
σt ≤ γt L||zt+1 − ẑt || · ||ẑt − zt || − ||zt+1 − ẑt ||2 − ||ẑt − zt ||2 .
2 2
Since γt ≤ 1/L, we have σt ≤ 0.
Thus we have
hγt F (ẑt ), ẑt − zi ≤ V (z, zt ) − V (z, zt+1 ).
By monotonicity of F , we have hF (ẑt ), ẑt − zi ≥ hF (z), ẑt − zi. Hence,
T
X
γt hF (z), ẑt − zi ≤ V (z, z1 ), ∀z ∈ Z.
t=1

Taking maximum on both sizes leads to the desired result, EVI (z̄T ) ≤ PT
γt
.
t=1

302
Remark. If the step-size is assumed to be constant, γt = L1 , then we have

ΩL
EVI (z̄T ) ≤ .
T
Mirror Prox algorithm achieves a O(1/T ) rate of convergence for solving
VIs. EG can be viewed as a special case if setting Euclidean distance as the
Bregman divergence and achieves similar guarantees. If additional struc-
ture is available, e.g., strong monotonicity of the operator F , then linear
convergence can be obtained for EG, OGDA, and Mirror Prox algrithms.

Remark. While the VI perspective provides a unified framework to an-


alyze a broad class of optimization problems, it might not fully exploit
the underlying fine-grained structure of the problem of interest. For spe-
cific applications, the Lipschitz constants may vary for different blocks,
or the strong monotonicity parameters may vary for different blocks, etc.
Efficiently exploiting such structure would require more sophisticated al-
gorithm design and analysis.

14.6 Exercises
Exercise 79. Consider the repeated zero-sum matrix game. Suppose row player
chooses xt+ according to the gradient descent update rule at each round:

xt+1 = Π∆m (xt − ηAyt )

where Π∆m is the projection operator


p on ∆m and η > 0 is the stepsize. Let G =
maxy∈∆n kAyk. Then with η = 2/(G2 T ), the row player’s regret satisfies:
T T
X X √
RT (y1 , . . . , yT ) := x>
t Ayt − min x> Ayt ≤ 2G2 T .
x∈∆m
t=1 t=1

Exercise 80. Let fi (x), i = 1, ...,P


m be convex and continuous on a convex com-
pact set X . Let ∆n = {y ∈ R+ : ni=1 yi = 1}. Then there exists some y∗ ∈ ∆n
n

such that m
X
min max fi (x) = min yi∗ fi (x).
x∈X 1≤i≤m x∈X
i=1

303
Exercise 81. Show that the following statements hold:

(i) If F is monotone, then a solution to SVI is also a solution to MVI.

(ii) If F is continuous and Z is convex, then a solution to MVI is also a solution


to SVI.

304
Bibliography

[ACD+ 22] Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster,
Nathan Srebro, and Blake Woodworth. Lower bounds for non-
convex stochastic optimization. Mathematical Programming,
pages 1–50, 2022.

[ACGH18] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A
convergence analysis of gradient descent for deep linear neu-
ral networks. CoRR, abs/1810.02281, 2018.

[AE08] Herbert Amann and Joachim Escher. Analysis II. Birkhäuser,


2008.

[AWBR09] Alekh Agarwal, Martin J Wainwright, Peter Bartlett, and


Pradeep Ravikumar. Information-theoretic lower bounds on
the oracle complexity of convex optimization. Advances in Neu-
ral Information Processing Systems, 22, 2009.

[AZ99] Nina Amenta and Günter M. Ziegler. Deformed products and


maximal shadows of polytopes. In B. Chazelle, J.E. Goodman,
and R. Pollack, editors, Advances in Discrete and Computational
Geometry, volume 223 of Contemporary Mathematics, pages 57–
90. American Mathematical Society, 1999.

[BB07] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale
learning. Advances in neural information processing systems, 20,
2007.

[BCN18] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimiza-


tion methods for large-scale machine learning. Siam Review,
60(2):223–311, 2018.

305
[BEHW89] Anselm Blumer, A. Ehrenfeucht, David Haussler, and Man-
fred K. Warmuth. Learnability and the vapnik-chervonenkis
dimension. J. ACM, 36(4):929–965, 1989.

[Ber05] Dimitri P. Bertsekas. Lecture slides on convex analysis


and optimization, 2005. http://athenasc.com/Convex_
Slides.pdf.

[BG17] Nikhil Bansal and Anupam Gupta. Potential-function proofs


for first-order methods. CoRR, abs/1712.04581, 2017.

[Bor87] Karl Heinz Borgwardt. The Simplex Method. Algorithms and


Combinatorics. Springer, 1987.

[BV04] Stephen Boyd and Lieven Vandenberghe. Convex Optimiza-


tion. Cambridge University Press, New York, NY, USA, 2004.
https://web.stanford.edu/˜boyd/cvxbook/.

[Cla10] Kenneth L. Clarkson. Coresets, sparse greedy approximation,


and the Frank-Wolfe algorithm. ACM Trans. Algorithms, 6(4),
sep 2010.

[CLRS09] Thomas H. Cormen, Charles E. Leiserson, Ron L. Rivest, and


Clifford Stein. Introduction to algorithms. MIT Press, Cam-
bridge, Mass, 3rd ed. edition, 2009.

[CLSH19] Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On
the convergence of a class of adam-type algorithms for non-
convex optimization. In ICLR, 2019.

[CO19] Ashok Cutkosky and Francesco Orabona. Momentum-based


variance reduction in non-convex sgd. Advances in Neural In-
formation Processing Systems, 32:15236–15245, 2019.

[COZ22] Yang Cai, Argyris Oikonomou, and Weiqiang Zheng. Tight


last-iterate convergence of the extragradient method for con-
strained monotone variational inequalities. arXiv preprint
arXiv:2204.09228, 2022.

[Dan16] George Dantzig. Linear Programming and Extensions. Princeton


University Press, 2016.

306
[Dav59] William C. Davidon. Variable metric method for minimization.
Technical Report ANL-5990, AEC Research and Development,
1959.
[Dav91] William C. Davidon. Variable metric method for minimization.
SIAM J. Optimization, 1(1):1–17, 1991.
[DBLJ14] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga:
A fast incremental gradient method with support for non-
strongly convex composite objectives. In Advances in neural
information processing systems, pages 1646–1654, 2014.
[DHS11a] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgra-
dient methods for online learning and stochastic optimization.
Journal of machine learning research, 12(7), 2011.
[DHS11b] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgra-
dient methods for online learning and stochastic optimization.
Journal of machine learning research, 12(7), 2011.
[Die69] J. Dieudonneé. Foundations of Modern Analysis. Academic
Press, 1969.
[Don17] David Donoho. Fifty years of data science. Journal of Computa-
tional and Graphical Statistics, 24(6):745–766, 2017.
[DSSSC08] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar
Chandra. Efficient projections onto the `1 -ball for learning in
high dimensions. In Proceedings of the 25th International Confer-
ence on Machine Learning, pages 272–279, 2008.
[ET99] Ivar Ekeland and Roger Temam. Convex analysis and variational
problems. SIAM, 1999.
[FLLZ18] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang.
Spider: Near-optimal non-convex optimization via stochastic
path-integrated differential estimator. In Advances in Neural
Information Processing Systems, pages 689–699, 2018.
[FM91] M. Furi and M. Martelli. On the mean value theorem, in-
equality, and inclusion. The American Mathematical Monthly,
98(9):840–846, 1991.

307
[FW56] Marguerite Frank and Philip Wolfe. An algorithm for
quadratic programming. Naval Research Logistics Quarterly,
3(1-2):95–110, 1956.

[Gal69] David Gale. How to solve linear inequalities. The American


Mathematical Monthly, 76(6):589–599, 1969.

[GM12] Bernd Gärtner and Jiřı́ Matoušek. Approximation Algorithms


and Semidefinite Programming. Springer, 2012.

[Gol70] D. Goldfarb. A family of variable-metric methods derived by


variational means. Mathematics of Computation, 24(109):23–26,
1970.

[GPDO20] Noah Golowich, Sarath Pattathil, Constantinos Daskalakis,


and Asuman Ozdaglar. Last iterate is slower than averaged
iterate in smooth convex-concave saddle point problems. In
Conference on Learning Theory, pages 1758–1784. PMLR, 2020.

[Gre70] J. Greenstadt. Variations on variable-metric methods. Mathe-


matics of Computation, 24(109):1–22, 1970.

[Gro18] Alexey Gronskiy. Statistical Mechanics and Information Theory in


Approximate Robust Inference. PhD thesis, ETH Zurich, Zurich,
2018.

[GSBR20] Robert M Gower, Mark Schmidt, Francis Bach, and Peter


Richtárik. Variance-reduced methods for machine learning.
Proceedings of the IEEE, 108(11):1968–1983, 2020.

[Haz08] Elad Hazan. Sparse approximate solutions to semidefinite


programs. In Eduardo Sany Laber, Claudson Bornstein,
Loana Tito Nogueira, and Luerbio Faria, editors, LATIN 2008:
Theoretical Informatics, pages 306–316, Berlin, Heidelberg, 2008.
Springer Berlin Heidelberg.

[HL20] Robert M. Freund Haihao Lu. Generalized stochastic Frank-


Wolfe algorithm with stochastic substitute gradient for struc-
tured convex optimization. Mathematical Programming, pages
317–349, 2020.

308
[HS66] Philip Hartman and Guido Stampacchia. On some non-linear
elliptic differential-functional equations. 1966.

[HSS12] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neu-


ral networks for machine learning lecture 6a overview of mini-
batch gradient descent. Cited on, 14(8):2, 2012.

[Jag13] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse


convex optimization. In ICML - International Conference on Ma-
chine Learning, pages 427–435, 2013.

[JZ13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient


descent using predictive variance reduction. Advances in neural
information processing systems, 26:315–323, 2013.

[Kar84] Narendra Karmarkar. A new polynomial-time algorithm for


linear programming. Combinatorica, 4(4):373–395, 1984.

[KB14] Diederik P Kingma and Jimmy Ba. Adam: A method for


stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[KB15] Diederik P. Kingma and Jimmy Ba. Adam: A method for


stochastic optimization. In ICLR, 2015.

[Kha80] Leonid G. Khachiyan. Polynomial algorithms in linear pro-


gramming. U.S.S.R. Comput. Math. and Math. Phys., 20:53–72,
1980.

[KM72] Victor Klee and George J. Minty. How good is the simplex
algorithm? In Oliver Shisha, editor, Inequalities, III, pages 159–
175, New York, 1972. Academic Press.

[KNS16] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear Con-
vergence of Gradient and Proximal-Gradient Methods Under
the Polyak-Łojasiewicz Condition. In ECML PKDD 2016: Ma-
chine Learning and Knowledge Discovery in Databases, pages 795–
811. Springer, 2016.

[Kor76] Galina M Korpelevich. The extragradient method for finding


saddle points and other problems. Matecon, 12:747–756, 1976.

309
[KPd18] Thomas Kerdreux, Fabian Pedregosa, and Alexandre
d’Aspremont. Frank-wolfe with subsampling oracle, 2018.

[KSJ18] Sai Praneeth Karimireddy, Sebastian U Stich, and Martin


Jaggi. Global linear convergence of Newton’s method with-
out strong-convexity or Lipschitz gradients. arXiv, 2018.

[LBZR21] Zhize Li, Hongyan Bao, Xiangliang Zhang, and Peter


Richtárik. Page: A simple and optimal probabilistic gradient
estimator for nonconvex optimization. In International Confer-
ence on Machine Learning, pages 6286–6295. PMLR, 2021.

[Lev17] Kfir Levy. Online to offline conversions, universality and


adaptive minibatch sizes. NeurIPS, 30, 2017.

[LJH+ 20] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xi-
aodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of
the adaptive learning rate and beyond. In ICLR, 2020.

[LKTJ17] Francesco Locatello, Rajiv Khanna, Michael Tschannen, and


Martin Jaggi. A Unified Optimization View on Generalized
Matching Pursuit and Frank-Wolfe. In AISTATS - Proceedings
of the 20th International Conference on Artificial Intelligence and
Statistics, volume 54 of PMLR, pages 860–868, 2017.

[LO19] Xiaoyu Li and Francesco Orabona. On the convergence of


stochastic gradient descent with adaptive stepsizes. In AIS-
TATS, pages 983–992. PMLR, 2019.

[LW19] Ching-Pei Lee and Stephen Wright. First-order algorithms


converge faster than o(1/k) on convex problems. In ICML - Pro-
ceedings of the 36th International Conference on Machine Learning,
volume 97 of PMLR, pages 3754–3762, Long Beach, California,
USA, 2019.

[MG07] Jiřı́ Matoušek and Bernd Gärtner. Understanding and Using Lin-
ear Programming. Universitext. Springer-Verlag, 2007.

[MH17] Mahesh Chandra Mukkamala and Matthias Hein. Variants


of rmsprop and adagrad with logarithmic regret bounds. In
ICML, pages 2545–2553. PMLR, 2017.

310
[Min62] George J Minty. Monotone (nonlinear) operators in hilbert
space. 1962.

[MOP20] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil.


A unified analysis of extra-gradient and optimistic gradient
methods for saddle point problems: Proximal point approach.
In International Conference on Artificial Intelligence and Statistics,
pages 1497–1507. PMLR, 2020.

[Nem04] Arkadi Nemirovski. Prox-method with rate of convergence


o (1/t) for variational inequalities with lipschitz continuous
monotone operators and smooth convex-concave saddle point
problems. SIAM Journal on Optimization, 15(1):229–251, 2004.

[Nes83] Yurii Nesterov. A method of solving a convex programming


problem with convergence rate o(1/k 2 ). Soviet Math. Dokl.,
27(2), 1983.

[Nes12] Yurii Nesterov. Efficiency of coordinate descent methods on


huge-scale optimization problems. SIAM Journal on Optimiza-
tion, 22(2):341–362, 2012.

[Nes18] Yurii Nesterov. Lectures on Convex Optimization, volume 137


of Springer Optimization and Its Applications. Springer, second
edition, 2018.

[NJLS09] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and


Alexander Shapiro. Robust stochastic approximation ap-
proach to stochastic programming. SIAM Journal on optimiza-
tion, 19(4):1574–1609, 2009.

[NLST17] Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč.
Sarah: A novel method for machine learning problems using
stochastic recursive gradient. In Proceedings of the 34th Inter-
national Conference on Machine Learning-Volume 70, pages 2613–
2621. JMLR. org, 2017.

[NN94] Y. Nesterov and A. Nemirovskii. Interior-Point Polynomial


Methods in Con- vex Programming. Society for Industrial and
Applied Mathematics, 1994.

311
[Noc80] J. Nocedal. Updating quasi-newton matrices with limited stor-
age. Mathematics of Computation, 35(151):773–782, 1980.

[NP06] Yurii Nesterov and B.T. Polyak. Cubic regularization of new-


ton method and its global performance. Mathematical Program-
ming, 108(1):177–205, 2006.

[NSL+ 15] Julie Nutini, Mark W Schmidt, Issam H Laradji, Michael P


Friedlander, and Hoyt A Koepke. Coordinate Descent Con-
verges Faster with the Gauss-Southwell Rule Than Random
Selection. In ICML - Proceedings of the 32nd International Confer-
ence on Machine Learning, pages 1632–1641, 2015.

[NY83] Arkady. S. Nemirovsky and D. B. Yudin. Problem complexity


and method efficiency in optimization. Wiley, 1983.

[OX21] Yuyuan Ouyang and Yangyang Xu. Lower complexity bounds


of first-order methods for convex-concave bilinear saddle-
point problems. Mathematical Programming, 185(1):1–35, 2021.

[Pad95] Manfred Padberg. Linear Optimization and Extensions, vol-


ume 12 of Algorithms and Combinatorics. Springer-Verlag, Berlin
Heidelberg, 1995.

[RKK18] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the con-
vergence of adam and beyond. In ICLR, 2018.

[RKK19] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the con-
vergence of adam and beyond. arXiv preprint arXiv:1904.09237,
2019.

[Roc97] R. Tyrrell Rockafellar. Convex Analysis. Princeton Landmarks


in Mathematics. Princeton University Press, 1997.

[San12] Francisco Santos. A counterexample to the hirsch conjecture.


Annals of Mathematics, 176:383–412, 2012.

[SLRB17] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimiz-


ing finite sums with the stochastic average gradient. Mathe-
matical Programming, 162(1):83–112, 2017.

312
[ST04] Daniel A. Spielman and Shang-Hua Teng. Smoothed analysis
of algorithms: Why the simplex algorithm usually takes poly-
nomial time. J. ACM, 51(3):385–463, 2004.

[ST09] Daniel A. Spielman and Shang-Hua Teng. Smoothed analysis:


An attempt to explain the behavior of algorithms in practice.
Commun. ACM, 52(10):76–84, 2009.

[Tib96] Robert Tibshirani. Regression shrinkage and selection via the


LASSO. J. R. Statist. Soc. B, 58(1):267–288, 1996.

[Tse95] Paul Tseng. On linear convergence of iterative methods for the


variational inequality problem. Journal of Computational and
Applied Mathematics, 60(1-2):237–252, 1995.

[Tse01] P. Tseng. Convergence of a block coordinate descent method


for nondifferentiable minimization. Journal of Optimization The-
ory and Applications, 109(3):475–494, 2001.

[VBS19] Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and
faster convergence of sgd for over-parameterized models and
an accelerated perceptron. In The 22nd international conference
on artificial intelligence and statistics, pages 1195–1204. PMLR,
2019.

[VC71] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform con-


vergence of relative frequencies of events to their probabilities.
Theory of Probability & Its Applications, 16(2):264–280, 1971.

[Vis15] Nisheeth Vishnoi. A mini-course on convex opti-


mization (with a view toward designing fast algo-
rithms), 2015. https://theory.epfl.ch/vishnoi/
Nisheeth-VishnoiFall2014-ConvexOptimization.
pdf.

[WJZ+ 19] Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh.
Spiderboost and momentum: Faster variance reduction algo-
rithms. In Advances in Neural Information Processing Systems,
pages 2406–2416, 2019.

313
[WLC+ 20] Guanghui Wang, Shiyin Lu, Quan Cheng, Wei-wei Tu, and Li-
jun Zhang. Sadam: A variant of adam for strongly convex
functions. In ICLR, 2020.

[WRS+ 17] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro,
and Benjamin Recht. The marginal value of adaptive gradient
methods in machine learning. Advances in neural information
processing systems, 30, 2017.

[WWB19] Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad step-
sizes: Sharp convergence over nonconvex landscapes. In
ICML, pages 6677–6686. PMLR, 2019.

[XZ14] Lin Xiao and Tong Zhang. A proximal stochastic gradient


method with progressive variance reduction. SIAM Journal on
Optimization, 24(4):2057–2075, 2014.

[Zei12] Matthew D Zeiler. Adadelta: an adaptive learning rate


method. arXiv preprint arXiv:1212.5701, 2012.

[Zha21] Liang Zhang. Variance reduction for non-convex stochastic


optimization: General analysis and new applications. Mas-
ter’s thesis, ETH Zurich, 2021.

[ZHZ19] Junyu Zhang, Mingyi Hong, and Shuzhong Zhang. On lower


iteration complexity bounds for the saddle point problems.
arXiv preprint arXiv:1912.07481, 2019.

[Zim16] Judith Zimmermann. Information Processing for Effective and


Stable Admission. PhD thesis, ETH Zurich, 2016.

[ZRS+ 18] Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale,
and Sanjiv Kumar. Adaptive methods for nonconvex opti-
mization. NeurIPS, 31, 2018.

314

You might also like