0% found this document useful (0 votes)
9 views

2022Lectures1-8_Optimization_for_DataScience

The document outlines the course 'Optimisation for Data Science' taught by Prof. Raphael Hauser, covering various optimisation models, methods, and data analysis problems. It includes topics such as first-order methods, the method of steepest descent, and applications in regression, matrix completion, and support vector machines. The course aims to equip students with the necessary theoretical and practical skills in optimisation relevant to data science.

Uploaded by

unkown21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

2022Lectures1-8_Optimization_for_DataScience

The document outlines the course 'Optimisation for Data Science' taught by Prof. Raphael Hauser, covering various optimisation models, methods, and data analysis problems. It includes topics such as first-order methods, the method of steepest descent, and applications in regression, matrix completion, and support vector machines. The course aims to equip students with the necessary theoretical and practical skills in optimisation relevant to data science.

Uploaded by

unkown21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

B6.

2 Optimisation for Data Science


Lecture Notes, Lectures 1–8
Oxford Mathematical Institute, HT 2022

Prof. Raphael Hauser

March 10, 2022

1
TABLE OF CONTENTS B6.2 Opt Data Sci

Table of Contents
1 Scope of this Course 2
1.1 Optimisation Models . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Sparse Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 First Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Data Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . 3

2 Motivating Examples 3

3 Optimisation Terminology and Prerequisites 10


3.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Characterisation of Convergence Speed . . . . . . . . . . . . . . 13

4 Method of Steepest Descent 14


4.1 Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Convergence Theory . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Long-Step Descent Methods . . . . . . . . . . . . . . . . . . . . . 17

5 The Proximal Method 19


5.1 Proximal Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 The Prox-Gradient Method . . . . . . . . . . . . . . . . . . . . . 21
5.3 Convergence Theory . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Acceleration of Gradient Methods 25


6.1 Summary of Complexity Results Seen so Far . . . . . . . . . . . 25
6.2 The Heavy Ball Method . . . . . . . . . . . . . . . . . . . . . . . 25
6.3 Nesterov Acceleration . . . . . . . . . . . . . . . . . . . . . . . . 27
6.4 Nesterov Acceleration of L-Smooth Convex Functions . . . . . . 28

1
B6.2 Opt Data Sci

1 Scope of this Course

1.1 Optimisation Models


Models considered in this course can take any of the following forms.
Unconstrained model
min f (x)
x∈Rn

where f is smooth, which in the context of this course means C 1 with Lipschitz
continuous gradient.
Regularised model
min f (x) + λΨ(x)
x∈Rn

Ψ : Rn → R convex, possibly nonsmooth, controls the complexity and struc-


ture of the optimal solution x∗ (a point where the objective function f (x)+λΨ(x)
achieves a minimum). λ is called the regularisation parameter.
Constrained model
min f (x) subject to x ∈ F
x∈Rn

e.g., F = {x ∈ Rn : g(x) ≤ 0, x ≥ 0} for some g : Rn → Rq (all vector


inequalities to be interpreted componentwise, i.e. g(x) ≤ 0 means gs (x) ≤ 0
(s = 1, . . . q)).

1.2 Sparse Objective


In data analysis the objective function f (x) often takes the form of a sum
m
X
f (x) = fj (x),
j=1

where each fi is sparse (depends only on a few of the coordinates of x), and
since n, m  1, often only gradients ∇f (x) or ∇fi (x) are available, but no
higher derivatives.

1.3 First Order Methods


First order methods are iterative algorithms designed to produce a sequence of
solutions (points x ∈ Rn that satisfy the constraints x ∈ F) (xk )k∈N → x∗ that
converges to an optimal solution based on updates

xk+1 = xk + αk dk

with search direction dk computed from ∇f (x) or ∇fi (x), and a step length αk >
0 that is either fixed or computed iteratively.

2
1.4 Data Analysis Problems B6.2 Opt Data Sci

1.4 Data Analysis Problems

Data analysis problems typically combine the following elements:


Data set D := {(aj , yj ) : j = 1, . . . , m}, where aj ∈ V lies in a vector space
of features, and yj ∈ W in a space of observations.
Parametric model Φ(·, x) : V → W of a feature-observation relation, param-
eterised by a vector x ∈ Rn .
Data fitting problem: find x ∈ Rn such that Φ(aj ; x) ≈ yj for all j by solving
minx∈Rn f (x), with objective
m
X
f (x) = LD (x) := `(aj , yj ; x).
j=1

LD (x) is the loss on the data set D associated with parameter values x. Typical
example `(aj , yj ; x) = kΦ(aj ; x) − yj k2 .

2 Motivating Examples
Example 1. Examples of observation sets:

i) W = R regression problems.

ii) W = {1, . . . , M } classification problems.


iii) W = ∅ data clustering, structured dimensionality reduction.

Example 2 (Regression).

i) Regression without intercept: Choose V = Rn , W = R, Φ(a; x) = aT x. Data


fitting model

1X 1
minn m(aT 2
j x − yj ) = kAx − yk2 ,
x∈R 2 j=1 2m

T
where A = a1 ... am .
ii) Regression with intercept: Same formulation, but replace x and aj by
   
x aj
x̃ = , ãj = ,
β 1

so that Φ(ã; x̃) = ãT x + β.

3
B6.2 Opt Data Sci

iii) Tychonov-regularised regression: this method uses the data fitting model
1
min kAx − yk22 + λkxk22 ,
x∈Rn 2m
which leads to robust regression, less sensitive to noise in the data (aj , yj ).
iv) Lasso-regression: the data fitting model
1
minn kAx − yk2 + λkxk1
x∈R 2m
tends to yield sparse optimal solution x∗ (feature selection) if one exists.
v) Dictionary learning: Extending linear regression to learning the regres-
sors. In dictionary learning we seek a dictionary A ∈ Rm×n and sparse
regressors X ∈ Rn×p so that the data Y ∈ Rm×p can be approximately
expressed as AX ≈ Y . We use V = (Rm×n , Rn×p ) as data space and
W = R1×p as observation space, with data D = {y` }m`=1 , where y` ∈ R
1×p

and p  n. The data fitting model is designed as


1
min kAX − Y k2F ,
A∈Rm×n ,X∈F k (m,p) 2mp

where
( n
)
X
n×p
Fk (m, p) = X∈R : |xi,j 6= 0| ≤ k ∀ j = 1, . . . , p ,
i=1

indicating only k of the the learned regressors from A are used to approx-
imate each column of data in Y .

Example 3. Matrix completion: Choose V = Rn×p , W = R, D = {(Aj , yj ) : j =


1, . . . , m}. Define the trace inner product on V ,
h·, ·i : V × V → R
(X, Y ) 7→ tr(X T Y ).
The data fitting problem is given by
m
1 X 2
min (hAj , Xi − yj ) + λkXk∗ ,
X∈V 2m j=1

where hAj , Xi P
models the sensing of X, yj the target value, λ > 0, and the nuclear
p
norm kXk∗ = i=1 σi is the sum of the singular values of X. Regularisation
by the nuclear norm tends to yield a low-rank optimal solution if one exists.
Typical choice Aj = [as,t ] with
(
1 if s = sj , t = tj ,
as,t =
0 otherwise

4
B6.2 Opt Data Sci

for fixed sj , tj , so that hAj , Xi = xsj ,tj and yj correspond to a few known
entries of X. This problem may be seen as a sophisticated form of regression.
Example 4. Sparse inverse covariance estimation: V = Rn , W = ∅, aj i.i.d. samples
of a random vector Ω → Rn with E[A] = 0 and Cov(A) = E[AA T
Pm] = Σ T< ∞
1
(all covariances are finite). The sample estimate of Σ is S = m j=1 aj aj . In
many applications, the components of A are conditionally independent,
D(Aj |Ai , {Ak : k 6= i, j}) = D(Aj |{Ak : k 6= i, j}).
For example, when
causality causality
Ai → Ak → Aj ,
then Cov(Ai , Aj ) 6= 0 but D(Aj |Ai , Ak ) = D(Aj |Ak ). Fact: If Ai , Aj are con-
ditionally independent, then (Σ−1 )i,j = 0 (component (i, j) of Σ−1 is zero).
Therefore, we correct the sample estimate S to that its inverse becomes sparse
by solving the data fitting problem
min hS, Xi − log det X + λkXk1 ,
X∈SRn×n ,X0

where SRn×n is the set of n × n real symmetric matrices, and X  0 means


X is positive semidefinite. In this example we extract structure from the data
without any measurement, that is, W = ∅ and
LD (X) = hS, Xi − log det X + λkXk1
m
1 X
= haj aT
j , Xi − log det X + λkXk1
m j=1
m
1 X T T
= a Xaj − log det X + λkXk1
m j=1 j
 
m  
2 Y 1
= − log  (2π)−n/2 det(X −1 )−1/2 exp − aT Xaj  − n log(2π)
m j=1
2 j

+ λkXk1 ,
so that the first two terms of the objective represent a shift plus a negative mul-
tiple of the log-likelihood function for X = Cov(A−1 ) under the parametric
model that the data aj are i.i.d. samples of mean centred multivariate Gaus-
sian vectors. Note also that the barrier term − log det X is used to keep X
away from boundary and hence invertible. The Lasso term λkXk1 incentivises
a sparse solution that allows one to identify the conditional dependencies be-
tween the variables if the reconstruction is successful.

Example 5 (Principal Component Analysis).

i) Robust PCA: We want to render classical PCA (see prelims course on


data analysis and statistics) robust to data outliers. Data space V =

5
B6.2 Opt Data Sci

(Rm×n , Rm×n ), observation space W = R, data D = Y ∈ Rm×n , the


entire data matrix. The data fitting model is chosen as

min kY − (L + S)k2F + λkLk∗ + γkSk1 ,


L,S∈Rm×n
P
where the nuclear norm
P kLk∗ = k σk encourages a low rank compo-
nent L and kSk1 = i,j |Si,j | encourages a sparse component S.
The weights λ and γ trade off the rank of L, sparsity of S, and how well
their sum matches the data Y . The sparse component S can be viewed as
outlier errors in Y .
ii) Sparse principal compnent analysis: The first component of the PCA of a
matrix S  0 is computed as follows,

max v T Sv, subject to kvk2 = 1.


v∈Rn

A sparse 1st principal component is defined in principle by

max v T Sv, subject to kvk2 = 1, kvk0 ≤ k,


v

where kvk0 := |{i : vi =


6 0}| is the number of nonzero compoents of v,
with k < n. Since the latter problem ins NP hard, we replace it by its
convex relaxation,

max hS, M i, subject to M  0, hI, M i = 1, kM k1 ≤ ρ,


M ∈SRn

where hS, M i replaces v T Sv = tr(S, vv T ) and hI, M i = 1 replaces 1 =


v T v = hI, vv T i.

Example 6 (Data Separation).

i) Support vector machines: V = Rn , W = {+1, −1}, problem of classifying


data into two classes. The ansatz is to seek a separating hyperplane {a ∈
Rn : aT x − β = 0} such that

yj × (aT
j x − β) ≥ 1. (1)

We would like to use Φ(a; x, β) := sign(aj , x − β) and recall that we want


Φ(a; x, β) = yj for as many data points j as possible. To avoid having to
solve an NP-hard problem and to find a robust solution, we maximise the
separation (margin) between the two sets, which is obtained by solving

min kxk22 , subject to (1). (2)


(x,β)

6
B6.2 Opt Data Sci

But (2) may not have any solution (x, β) at all that satisfies (1). Therefore,
solve the following data fitting model instead,
m
1 X  λ
min max 1 − yj × (aT 2
j x − β), 0 + kxk2
(x,β) m 2
j=1
m
1 X λ
⇔ min sj + kxk22
x∈R ,β∈R,s∈Rm
n m j=1 2

subject to sj ≥ 1 − yj (aT
j x − β), sj ≥ 0, (j = 1, . . . , m).

ii) Nonlinear separation: the idea is to use a nonlinear transformation ζ :


Rn → Ṽ , where (Ṽ , h·, ·i) is a Hilbert space. We replace (1) by

yj × (hζ(aj ), xi − β) ≥ 1 (3)

and solve the data fitting model


m
1 X λ
min max (1 − yj (hζ(aj ), xi − β), 0) + hx, xi
x∈Ṽ ,β∈R m 2
j=1
m
1 X λ
⇔ min sj + kxk2Ṽ
x∈Ṽ ,β∈R,s∈Rm m j=1 2
subject to sj ≥ 1 − yj (hζ(aj ), xi − β), sj ≥ 0, (j = 1, . . . , m)
m
(convex duality) 1 X
⇐⇒ minm αT Qα − αj
α∈R 2
j=1
1
subject to 0 ≤ αj ≤ , (j = 1, . . . , m)
λ
y T α = 0,

where Q = (qk,` )  0 in SRm×m is defined by qk,` = yk y` hζ(ak ), ζ(a` )i.


Note: to solve this convex quadratic programming problem (QP), there
is no need to know ζ explicitly, only the ability to compute K(ak , a` ) :=
hζ(ak ), ζ(a` )i is required, where K : V × V → R is a kernel function, i.e.,
such that (K(aj , a` ))j=1:m,`=1:m  0 for all choices of vectors

{a1 , . . . , am } ⊂ V = Rn .

For example, use a Gaussian kernel K(aj , a` ) = exp −kaj − a` k22 /2σ 2 .

Example 7 (Multiclass Classification).

i) Logistic regression: V = Rn , W = {e1 , . . . , eM } ⊂ RM , where yj = ei


(the i-th canonical basis vector in RM ) indicates that aj is in class Ci for

7
B6.2 Opt Data Sci

Figure 1: The support vector machine model maximises the margin between
two point clouds.

classes C1 , . . . , CM . The multiclass classification problem is to output a


probability map
 
pi (a; x)
..
Φ(a; X) =   , pi (a; X) = P(a ∈ Ci ).
 
.
pM (a; X)

We wish to calibrate the parametric models



exp aT x[i]
pi (a; X) = PM , (i = 1, . . . , M )
T
k=1 exp a x[i]

such that (
1 if yj = ei ,
pi (aj ; X) ≈
0 if yj 6= ei ,
where x[i] ∈ Rn (i = 1, . . . , M ) and X = {x[i] : i = 1, . . . , M }. Idea: max-
imise the log-likelihood function, which leads to the data fitting model
  T 
x[1]

m M
!
1 X  T  ..  X
max y  .  aj − log exp(xT
[k] aj )  .

X m j=1 j
xT
[M ]
k=1

The obective is convex and in summation form. Note that if yj = ei then

xT
 
[1]
yjT  ...  = xT
i .
 

xT[M ]

8
B6.2 Opt Data Sci

Figure 2: Nonlinear separation of point clouds.

ii) Nonlinear multiclass logistic regression via deep learning: Combines logistic
regression with a nonlinear transformation of the inputs in D  1 hidden
layers 1, . . . , D of a neural network,

a0 = a,
a` = σ(W ` a`−1 + g ` ), (` = 1, . . . , D − 1),
D D D−1 D
a =W a +g ,
`
|×|a`−1 | `
where W ∈ R|a , g ` ∈ R|a | , and σ is a nonlinear activation function,
such as
1
• σ(t) = 1+exp(−t) (logistic function),
• σ(t) = max(t, 0) (hinge loss, also called ReLU),
• σ(t) = tanh(t) (smooth version of ReLU),
• σ(t) = 1 with probability 1/(1 + exp(−t)) and 0 otherwise.

Let w = (W 1 , g 1 ), . . . , (W D , g D ) be the parameters of the NN, and set
aD = aD (w) as a function of the network parameters. The data fitting
model becomes
  T 
x[1]

m M
!
1 X T .  D X
max yj  ..  aj (w) − log exp(xT D
[k] aj (w))  .

X,w m
j=1 xT
[M ]
k=1

The objective is still in summation form but now non-convex.

9
B6.2 Opt Data Sci

Figure 3: In this region we see one strict global minimiser, and


many local minimisers. Source: http://papers.nips.cc/paper/
7875-visualizing-the-loss-landscape-of-neural-nets.pdf.

3 Optimisation Terminology and Prerequisites

3.1 Terminology
Consider the optimisation model
min f (x) subject to x ∈ F (4)
x∈Rn

Definition 1. A solution x∗ ∈ F is

i) a local minimiser if f (x∗ ) ≤ f (x) ∀x ∈ F ∩ Bε (x∗ ) for some ε > 0,


ii) a strict local minimiser if f (x∗ ) < f (x) ∀x ∈ F ∩ Bε (x∗ ),
iii) a global mimimiser if f (x∗ ) ≤ f (x) ∀x ∈ F.

Definition 2.

i) f : Rn → R is convex if
f ((1 − α)x + αy) ≤ (1 − α)f (x) + αf (y), ∀α ∈ [0, 1], x, y ∈ Rn . (5)

ii) f : Rn → R ∪ {+∞} is proper convex if (5) holds and f (x) < +∞ for at
least one point x.
iii) For Ω ⊂ Rn , define the indicator function as follows,
(
0 if x ∈ Ω,
IΩ (x) =
+∞ otherwise.

Then Ω 6= ∅ is a convex set if and only if IΩ (x) is a proper convex function.

10
3.1 Terminology B6.2 Opt Data Sci

Figure 4: A strongly convex function.

iv) We say that (4) is a convex optimisation problem if f and F =


6 ∅ are both
convex. Then (4) is equivalent to the unconstrained problem

min f (x) + IF (x).


x∈Rn

with proper convex objective.


v) f : Rn → R is strongly convex with modulus of convexity γ > 0 if for all
α ∈ [0, 1] and x, y ∈ Rn ,
1
f ((1 − α)x + αy) ≤ (1 − α)f (x) + αf (y) − γα(1 − α)kx − yk2 .
2

vi) f : Rn → R is L-smooth if differentiable everywhere with L-Lipschitz conti-


nous gradient, L > 0, that is,

k∇f (x) − ∇f (y)k ≤ Lkx − yk ∀x, y ∈ Rn .

vii) v ∈ Rn is a subgradient of f at x if

f (y) ≥ f (x) + v T (y − x), ∀x, y ∈ Rn .

viii) The subdifferential ∂f (x) of f at x is the set of subgradients of f at x.

11
3.2 Prerequisites B6.2 Opt Data Sci

Figure 5: At points where f is not differentiable the subgradient is non-unique.

3.2 Prerequisites

Lemma 1.

i) If a ∈ ∂f (x) and b ∈ ∂f (y), then (a − b)T (x − y) ≥ 0.


ii) For any λ ≥ 0, we have ∂λf (x) = λ∂f (x).
iii) ∂(f + g)(x) = ∂f (x) + ∂g(x), that is, for every z ∈ ∂(f + g)(x) ∃u ∈ ∂f (x)
and v ∈ ∂g(x) such that z = u + v.

Proposition 1. Let f : Rn → R ∪ {+∞} be proper convex.

i) Then x∗ ∈ Rn is a local minimiser if and only if it is a global minimiser.

ii) The set of miminisers

arg minn f (x) := {x∗ : f (x∗ ) = minn f (x)}


x∈R x∈R

is convex.

iii) x∗ ∈ arg minx∈Rn f (x) if and only if1 0 ∈ ∂f (x∗ ) .


Proposition 2. Let f : Rn → R ∪ {+∞} be proper convex.

i) If f is differentiable at x, then ∂f (x) = {∇f (x)}.


ii) If |∂f (x)| = 1, then f is differentiable at x.
1 To simplify notation, we write 0 instead of ~0 for vectors of zeroes of size n or any other size

12
3.3 Characterisation of Convergence Speed B6.2 Opt Data Sci

iii) If f is γ-strongly convex and differentiable at x ∈ Rn , then


γ
ky − xk2 ≤ f (y) − f (x) − ∇f (x)T (y − x) ∀y ∈ Rn . (6)
2
iv) If f is L-smooth, then
L
f (y) − f (x) − ∇f (x)T (y − x) ≤
ky − xk2 ∀y ∈ Rn . (7)
2
Proposition 3. If f : Rn → R is strongly convex, then there exists a unique global
minimiser.

For proofs of slightly weakened versions of Lemma 1 and Propositions 1 –


3, see problem sheets.

3.3 Characterisation of Convergence Speed


To obtain a notion of convergence for iterative algorithms, we could show that
k∇f (xk )k → 0 as k → ∞, or dist(0, ∂f (xk )) → 0 (where dist(·) denotes the Eu-
clidean distance), or kxk − x∗ k → 0, or f (xk ) → f (x∗ ). The rate of convergence
can be characterised in several different ways:
Definition 3. Let (φk )k∈N ⊂ R+ be a sequence such that limk→∞ φk = 0. We say
that (φk )k∈N converges
φk+1
i) Q-linearly if ∃σ ∈ (0, 1) such that φk ≤ 1 − σ, ∀k ∈ N,

ii) R-linearly if ∃ σ ∈ (0, 1) such that φk ≤ C(1 − σ)k , ∀k ∈ N,


iii) Q-superlinearly if limk→∞ φφk+1
k
= 0,
φk+1
iv) Q-quadratically if ∃ C > 0 such that φ2k
≤ C, ∀k,

v) R-superlinearly if ∃ (νk )k∈N ⊂ R+ such that (νk )k∈N → 0 Q-superlinearly


and φk ≤ νk for all k ∈ N.
Lemma 2.
Q-linear convergence ⇒ R-linear convergence ; Q-linear convergence,
Q-quadratic convergence ⇒ Q-superlinear convergence ⇒ R-superlinear convergence,
R-superlinear convergence ; Q-superlinear convergence ; Q-quadratic convergence.

See problem sheets for a proof and counterexamples.


Definition 4. If φk → 0 slower than R-linearly, we say that the convergence is
sublinear.
Example 8. The following sequences all converge to zero sublinearly,
   
c c c
√ , , .
k k∈N k k∈N ln(k + 1) k∈N

13
B6.2 Opt Data Sci

4 Method of Steepest Descent

4.1 Specifications

Gradient methods solve unconstrained optimisation models of the form

min f (x), (8)


x∈Rn

where f : Rn → R is L-smooth via iterative updates of the form

xk+1 = xk + αk dk . (9)

The scalar αk > 0 is called the step length, and dk ∈ Rn is the search direction.
The Method of Steepest Descent is a gradient method based on making the
following choices: dk = −∇f (xk ) for all k (the direction of steepest descent),
and αk ≡ L−1 for all k (constant step length). This leads to the steepest descent
updates
1
xk+1 = xk − ∇f (xk ). (10)
L
Our main concern is whether the choice of αk is neither too short (leading to
excessively slow convergence) nor too long (leading to zig-zagging behaviour).
The updates (10) are motivated by the minimisation of the first order Taylor
approximation

L
f (x + αd) ≤ f (x) + α∇f (x)T d + α2 kdk2 (11)
2
at x = xk over α for fixed d = −∇f (x). (11) is an immediate consequence of
(7). Writing ϕ(α) = f (x + αd) and setting d = −∇f (x), the r.h.s. of (11) is a
convex quadratic

α2
ϕ(α) = f (x) − αk∇f (x)k2 + Lk∇f (xk )k2 .
2
Minimising ϕ(α) be solving for ϕ0 (α) = 0 yields α∗ = 1/L.
The following will be one of our main tools of convergence analysis for the
method of steepest descent:

Lemma 3 (Foundational Inequality of Steepest Descent). Starting steepest de-


scent iterations at some arbitrary point x0 , we have
 
k+1 1 1
= f x − ∇f (x ) ≤ f (xk ) −
k k
k∇f (xk )k2 ∀k.

f x
L 2L

Proof. Substitute α = 1/L and d = −∇f (xk ) into (11).

14
4.2 Convergence Theory B6.2 Opt Data Sci

4.2 Convergence Theory

Overview of convergence results below:



• (∇f (xk ))k∈N → 0 sublinearly with rate O(1/ k) for general f .

• (f (xk ) − f (x∗ ))k∈N → 0 sublinearly with rate O(1/k) for convex f .


• (f (xk ) − f (x∗ ))k∈N → 0 Q-linearly with rate (1 − γ/L) when f is γ-
strongly convex.
Theorem 1. For f (x) L-smooth and bounded below by f , the method of steepest
descent with constant step size αk = 1/L satisfies
s
2L × (f (x0 ) − f )
min k∇f (xk )k ≤ , ∀ T ≥ 1.
0≤k≤T −1 T

Proof. Lemma 3 implies


T
X −1 T
X −1
k∇f (xk )k ≤ 2L f (xk ) − f (xk+1 ) = 2L f (x0 ) − f (xT ) .
   

k=0 k=0

Therefore,
v
u T −1 r
k
u1 X 2L [f (x0 ) − f (xT )]
min k∇f (x )k ≤ t k∇f (xk )k2 ≤
0≤k≤T −1 T T
k=0

Theorem 2. If f is L-smooth and convex with minimiser2 x∗ , then the method of


steepest descent with constant step size αk = 1/L satisfies

L 0
f (xT ) − f (x∗ ) ≤ kx − x∗ k2 , ∀ T ≥ 1.
2T

Proof. By convexity of f , f (x∗ ) ≥ f (xk ) + ∇f (xk )(x∗ − xk ). Substitution into


Lemma 3 yields

1
f (xk+1 ) ≤ f (x∗ ) + ∇f (xk )T (xk − x∗ ) − k∇f (xk )k2 (use upper bound on f (xk ))
2L !
2
L 1
= f (x∗ ) + k ∗ 2 k ∗
kx − x k − x − x − ∇f (x ) k
(expand to check)
2 L
L
= f (x∗ ) + kxk − x∗ k2 − kxk+1 − x∗ k2 (use def of xk+1 ).

2
2 Note that the minimiser x∗ is not necessarily unique.

15
4.2 Convergence Theory B6.2 Opt Data Sci

Repeated use of this bound in a telescoping sum yields


T −1 −1
X  L TX  L
f (xk+1 ) − f (x∗ ) ≤ kxk − x∗ k2 − kxk+1 − x∗ k2 ≤ kx0 − x∗ k2 ,

2 2
k=0 k=0

and since f (xk+1 ) ≥ f (xT ) by virtue of Lemma 3, this implies


T −1
1 X L 0
f (xT ) − f (x∗ ) ≤ f (xk+1 ) − f (x∗ ) ≤ kx − x∗ k2 .

T 2T
k=0

Theorem 3. If f is L-smooth and γ-strongly convex then there exists a unique min-
imiser x∗ , and
 γ
f (xk+1 ) − f (x∗ ) ≤ 1 − f (xk ) − f (x∗ )

∀ k.
L

Proof. For the uniqueness of x∗ , see Proposition 3. Proposition 2 implies that


for all y ∈ Rn ,
γ
f (y) ≥ f (xk ) + ∇f (xk )T (y − xk ) + ky − xk k2 .
2
Therefore,
γ
min f (y) ≥ min f (xk ) + ∇f (xk )T (y − x) + ky − xk2 . (12)
y y 2

The arg min of the right hand side of (12) equals y ∗ = xk − (1/γ)∇f (xk ), and
substitution back into (12) gives
  2
∗ k k T 1 k γ 1
f (x ) ≥ f (x ) − ∇f (x ) ∇f (x ) + ∇f (xk ) ,
γ 2 γ
1 2
= f (xk ) − ∇f (xk ) ,

which in turn yields


2
∇f (xk ) ≥ 2γ f (xk ) − f (x∗ ) .

(13)

Substituting into the bound of Lemma 3, we find


γ
f (xk+1 ) − f (x∗ ) ≤ f (xk ) − f (x∗ ) − f (xk ) − f (x∗ ) .

L

16
4.3 Long-Step Descent Methods B6.2 Opt Data Sci

Figure 6: A step size αk that satisfies Conditions ii) and iii) can be found by
bisection.

4.3 Long-Step Descent Methods

We continue solving the unconstrained model (8) via iterative updates (9), but
now with more general (and possibly much larger) step sizes αk and search
diretions dk than before.
Throughout this section we assume f (x) to be L-smooth, possibly non-
convex, and bounded below f (x) ≥ f for all x ∈ Rn . The search directions
dk are assumed to lie within a bounded angle of the steepest descent direction.
Specifically, our assumptions on dk and αk are as follows:
Assumption 1. There exist constants 0 < η ≤ 1 and 0 < c1 < c2 < 1 such that
for all k ∈ N,

i) ∇f (xk )T dk ≤ −ηk∇f (xk )k · kdk k (sufficient local descent),


ii) f (xk + αk dk ) ≤ f (xk ) + c1 αk ∇f (xk )T dk (step not too long),
iii) ∇f (xk + αk dk ))T dk ≥ c2 ∇f (xk )T dk (step not too short).

Note that Assumption i) is trivially met if we choose the steepest descent di-
rection dk = −∇f (xk ), in which case η = 1. The condition allows us more gen-
erality in computing dk , for example via an inexact computation of −∇f (xk )
that is numerically cheaper to compute. Assumptions ii) and iii) look very
technical at first sight, but all we need to know for the purposes of this course
is that there exist very simple and numerically cheap approximate 1-D opti-
misation schemes called inexact line searches that are able to produce values αk
that satisfy both conditions. The additional generality is important in practi-
cal computations, because the Lipschitz constant L is generally not known, so

17
4.3 Long-Step Descent Methods B6.2 Opt Data Sci

that taking constant step sizes αk ≡ 1/L is largely a theoretical choice. Line
searches also allow αk to take much larger values than 1/L when warranted,
so that the algorithm may make more rapid progress.
Theorem 4. Under Assumption 1, we have
s s
L f (x0 ) − f
min k∇f (xk )k ≤ × .
0≤k≤T −1 η 2 c1 (1 − c2 ) T

Proof. Assumption iii) implies


Assn iii) T
−(1 − c2 )∇f (xk )T dk ∇f (xk + αk dk ) − ∇f (xk ) dk


L-smoothness & C.S.
≤ L · αk kdk k2 .
Solving for αk yields

1 − c2 Assn i) η(1 − c2 )k∇f (xk )k


αk ≥ − · ∇f (xk )T dk ≤ = .
Lkdk k2 Lkdk k
Substitution into Assumption ii) gives
Assn ii) & i) c1 (1 − c2 ) 2
f (xk+1 ) = f (xk + αk dk ) ≤ f (xk ) − η k∇f (xk )k2 , (14)
L
which generalises Lemma 3 that was used in Theorem 1 for the analysis of
the short-step case. The rest of the proof follows the blueprint of the proof of
Theorem 1: Solving (14) for k∇f (xk )k2 yields
L
k∇f (xk )k2 ≤ f (xk ) − f (xk+1 ) .

c1 (1 − c2 )η 2
Nota bene this implies f (xk ) − f (xk+1 ) ≥ 0, so that we have descent in each
iteration. Summing over k, we find
T −1 T −1
X L X
k∇f (xk )k2 ≤ f (xk ) − f (xk+1 )

c1 (1 − c2 )η 2
k=0 k=0
L
f (x0 ) − f (xT )

= 2
c1 (1 − c2 )η
L
f (x0 ) − f .

≤ 2
c1 (1 − c2 )η
The result now follows from
v
u T −1
u1 X
k
min k∇f (x )k ≤ t k∇f (xk )k2 .
0≤k≤T T
k=0

18
B6.2 Opt Data Sci

Figure 7: Construction of the Moreau envelope.

5 The Proximal Method

5.1 Proximal Operators

The following notion is useful in designing algorithms for minimising regu-


larised problems:
Definition 5. Let h : Rn → R ∪ {+∞} be proper3 convex and closed4 .

i) The Moreau envelope of h associated with λ > 0 is the function

Mλ,h : Rn → R,
   
1 1 2 1 2
x 7→ inf λh(u) + ku − xk = inf h(u) + ku − xk .
λ u 2 u 2λ

The Moreau envelope is a smoothed version of h.

ii) The proximal operator is given by

proxλh : Rn → R,
 
1 2
x 7→ arg min λh(u) + ku − xk .
u 2

The function is well defined by virtue of Proposition 3.

3 By definition, h(u) is finite for at least one point u ∈ Rn .


4A function is closed if all sublevel sets {x : f (x) ≤ α} are closed

19
5.1 Proximal Operators B6.2 Opt Data Sci

Figure 8: Soft and hard thresholding functions.

Example 9.

• h(x) = 0 has prox operator proxλh (x) = arg minu 12 ku − xk2 = x, that is,
the identity.

• h(x) = IΩ (x), with Ω ⊆ Rn closed convex, has prox operator

1
proxλIΩ (x) = arg min λIΩ (u) + ku − xk2
u 2
= arg min ku − xk2 ,
u∈Ω

that is, the projection of x onto Ω.

• h(x) = kxk1 has prox operator with i-th component



xi − λ
 if xi > λ,
(proxλk·k1 (x))i = 0 if xi ∈ [−λ, λ],

xi + λ if xi < −λ.

This is called the soft thresholding operator.


• h(x) = kxk0 := |{i : xi 6= 0}| has prox operator with i-th component
( √
xi if |xi | > 2λ,
(proxλk·k0 (x))i = √
0 if |xi | < 2λ.

This is called hard thresholding.

Proposition 4. Let h : Rn → R ∪ {+∞} be proper convex and closed. For fixed


x ∈ Rn , let Φ(u) := λh(u) + 12 ku − xk2 . Then the following properties hold.

i) 0 ∈ λ∂h (proxλh (x)) + proxλh (x) − x.

20
5.2 The Prox-Gradient Method B6.2 Opt Data Sci

ii) Mλ,h (x) < +∞ for all x ∈ Rn , even if h(x) = +∞.


iii) Mλ,h (x) is convex in x.
iv) Mλ,h (x) is differentiable in x, with gradient

1
∇Mλ,h (x) = (x − proxλh (x)) .
λ

v) x∗ ∈ arg minx h(x) if and only if x∗ ∈ arg minx Mλ,h (x).


vi) k proxλh (x) − proxλh (y)k ≤ kx − yk for all x, y ∈ Rn .

See problem sheets for a proof.

5.2 The Prox-Gradient Method

We now extend the scope of problems to solve and consider the regularised
optimisation model
minn φ(x) := f (x) + λψ(x), (15)
x∈R

where f is L-smooth and convex, and ψ(x) is convex and closed. This includes
the case of constrained optimisation, as

min f (x) ⇔ minn f (x) + λIΩ (x),


x∈Ω x∈R

where the equivalence is understood in the sense that the argmin are the same
in both cases.
To construct an iterative algorithm for Model (15), we use the following
template:

1. y k+1 = xk − αk ∇f (xk )
2. xk+1 = arg minz 21 kz − y k+1 k2 + αk λψ(z),

that is, we minimise ψ(z) while staying close to the steepest descent update
y k+1 . Note the factor αk in front of λψ(z) to avoid that xk+1 → y k+1 if αk → ∞,
that is, both terms on the r.h.s. of the second step should be of the same order.
We may also express the update of the template algorithm as

xk+1 = proxαk λψ xk − αk ∇f (xk ) .



(16)

An alternative interpretation and motivation of the update (16) is as fol-


lows:
1
xk+1 = arg min f (xk ) + ∇f (xk )T (z − xk ) + kz − xk k2 + λψ(z),
z 2αk

21
5.2 The Prox-Gradient Method B6.2 Opt Data Sci

Now note that f (xk ) + ∇f (xk )T (z − xk ) is the 1st order Taylor approximation
of f (z) around xk , so that (xk ) + ∇f (xk )T (z − xk ) + 2α1 k kz − xk k2 is a simplified
2nd order model of f (z) that does not depend on 2nd order derivatives of f .
The updates are thus obtained by iteratively building very simple 2nd order
models by which f is replaced in (15) to obtain subproblems that are easier to
solve.
Example 10. Unregularised unconstrained minimisation minx f (x). This case
reduces to ψ(x) ≡ 0 and the steepest descent update
1 2
xk+1 = prox0 xk − αk ∇f (xk ) = arg min z − xk − αk ∇f (xk )
 
z 2
= xk − αk ∇f (xk ).
Example 11. Constrained convex minimisation minx f (x) subject to x ∈ Ω with
Ω ⊂ Rn convex. This case reduces to ψ(x) = IΩ (x) and prox updates
1 2
xk+1 = proxαk λIΩ xk − αk ∇f (xk ) = arg min z − xk − αk ∇f (xk )
 
,
z∈Ω 2

which is the orthogonal projection of the steepest descent update xk −αk ∇f (xk )
onto the feasible set Ω.
Definition 6. The gradient map of model (15) is the map
Gα : Rn → Rn
1 
x 7→ x − proxαλψ (x − α∇f (x)) .
α
We note that the notion of gradient map allows us to write the prox update
(16) as
xk+1 = proxαk λψ xk − αk ∇f (xk ) = xk − αk Gαk (xk ),

(17)
so that in the proximal method Gαk (xk ) takes the role of ∇f (xk ) in comparison
with the method of steepest descent.
Lemma 4 (Foundational Inequality of Prox-Method). The gradient map has the
following properties:

i) Gα (x) ∈ ∇f (x) + λ∂ψ (x − αGα (x)).


ii) For all α ∈ 0, L1 and z ∈ Rn ,


α
φ (x − αGα (x)) ≤ φ(z) + Gα (x)T (x − z) − kGα (x)k2 . (18)
2
Notes:

• Property i) shows that (17) has an interpretation of a generalised subgra-


dient step, with the gradient map replacing the gradient with the follow-
ing subtlety: −αk Gαk (xk ) consists of an explicit forward descent step for
f and an implicit backward ascent step for λψ.

22
5.2 The Prox-Gradient Method B6.2 Opt Data Sci

• Property ii) holds true for all z ∈ Rn . The term φ(z) + Gα (x)T (x − z) can
be seen as a backward first order model of φ(x). Using (17) and setting
x = xk and α = αk , (18) reads
αk
φ xk+1 ≤ φ(z) + Gαk (xk )T (xk − z) − kGαk (xk )k2 .

2
In particular, setting z = xk we obtain
αk
φ xk+1 ≤ φ(xk ) − kGαk (xk )k2 .

(19)
2
Compare this result to the foundational inequality of the short-step steep-
est descent, i.e., Lemma 3, which reads
1
f xk+1 ≤ f (xk ) − k∇f (xk )k2 .

(20)
2L
Note that when αk = 1/L, (19) becomes (20) with f (x) replaced by the
merit function φ(x) and ∇f (x) by the gradient map. Inequality (18) thus
generalises the foundational inequality and can be used the same way to
analyse the convergence of the prox-gradient method as Lemma 3 was
used in the analysis of the Method of Steepest Descent.

Proof. Using x − αGα (x) = proxαλψ (x − α∇f (x)) = arg minu Ξ(u), where
Ξ(u) := αλψ(u) + 21 ku − (x − α∇f (x))k2 is convex, it must be true that

0 ∈ ∂ Ξ proxαλψ (x − α∇f (x)) = ∂ Ξ (x − αGα (x))
= αλ∂ψ (x − αGα (x)) + (x − αGα (x)) − (x − α∇f (x))

Therefore, αGα (x) ∈ αλ∂ψ (x − αGα (x)) + α∇f (x), as claimed in i).
Further, since f is L-smooth, (7) implies

L
f (y) ≤ f (x) + ∇f (x)T (y − x) + ky − xk2 , ∀ y ∈ Rn .
2
Substituting y = x − αGα (x) yields

Lα2
f (x − αGα (x)) ≤ f (x) − αGα (x)T ∇f (x) + kGα (x)k2
2
α
≤ f (x) − αGα (x)T ∇f (x) + kGα (x)k2 , (21)
2
where the last inequality follows from the assumption that α ∈ (0, 1/L]. On
the other hand, the convexity of f (x) implies (see Definition 2 and properties
of the subdifferential)

f (z) ≥ f (x) + ∇f (x)T (z − x) (22)

23
5.3 Convergence Theory B6.2 Opt Data Sci

By virtue part i) we also have Gα (x) − ∇f (x) ∈ ∂λψ (x − αGα (x)), so that by
the definition of subgradients we have
T
λψ(z) ≥ λψ (x − αGα (x)) + [Gα (x) − ∇f (x)] (z − (x − αGα (x))) . (23)

It follows that
def of φ
φ (x − αGα (x)) = f (x − αGα (x)) + λψ (x − αGα (x))
(21) α
≤ f (x) − αGα (x)T ∇f (x) + kGα (x)k2 + λψ (x − αGα (x))
2
(22) α
≤ f (z) − ∇f (x)T (z − x + αGα (x)) + kGα (x)k2
2
+ λψ (x − αGα (x))
(23) α
≤ f (z) − ∇f (x)T (z − x + αGα (x)) + kGα (x)k2
2
T
+ λψ (z) − [Gα (x) − ∇f (x)] (z − (x − αGα (x)))
α
= f (z) + λψ(z) + Gα (x)T (x − z) − kGα (x)k2
2
α
= φ(z) + Gα (x) (x − z) − kGα (x)k2 ,
T
2
as claimed in part ii).

5.3 Convergence Theory

Theorem 5. If Model (15) has a minimiser x∗ and the prox-gradient algorithm is run
with constant step length αk ≡ 1/L starting from an initial point x0 , then

L 0
φ(xk ) − φ(x∗ ) ≤ kx − x∗ k2 , ∀ k ≥ 1.
2k

Proof. Equation (19) establishes that φ(xk ) k∈N is monotone decreasing,

φ(xk+1 ) ≤ φ(xk ), ∀ k. (24)

Further, using z = x∗ in Lemma 4 ii), we have


αk
0 ≤ φ(xk+1 ) − φ(x∗ ) ≤ Gαk (xk )T (xk − x∗ ) − kGαk (xk )k2
2
1  k 2 2

= x − x∗ − xk − x∗ − αk Gαk (xk )
2αk
1  k 2 2

= x − x∗ − xk+1 − x∗ . (25)
2αk

Therefore, kxk − x∗ k k∈N is also montone decreasing.




24
B6.2 Opt Data Sci

Combining both insights, we find

 (24) K−1
X
K × φ(xK ) − φ(x∗ ) ≤ φ(xk+1 ) − φ(x∗ )


k=0
K−1
(25),αk =L−1 L X k 2 2

≤ x − x∗ − xk+1 − x∗
2
k=0
telescoping sum L 2
≤ × x0 − x∗ .
2

6 Acceleration of Gradient Methods

6.1 Summary of Complexity Results Seen so Far


• The Method of Steepest Descent solves
min f (x)
x
 
– in sublinear O √1k time when f is smooth (but nonconvex) and
bounded below, both in short and long step variants,
– in sublinear O k1 time when f is smooth, convex and bounded be-


low,
γ

– in linear time with rate 1 − L when f is L-smooth and γ-strongly
convex.
• The Prox-Gradient Method solves
min f (x) + λψ(x)
x

1

in sublinear O k time when f is smooth and convex and ψ is convex
and closed.

Accelerated gradient methods are designed to obtain faster, in some cases


provably optimal, convergence rates.

6.2 The Heavy Ball Method


This algorithm is the easiest to understand among accelerated gradient meth-
ods, but it is designed for the somewhat restricted scope of minimising convex
quadratic functions,
1
minn f (x) = xT Ax − bT x, (26)
x∈R 2

25
6.2 The Heavy Ball Method B6.2 Opt Data Sci

Figure 9: Heavy ball updates with momentum term.

where A is symmetric with eigenvalues in [γ, L] for some 0 < γ < L, so that
f (x) is L-smooth and γ-strongly convex.
The iterative updates are defined as
xk+1 = xk − αk ∇f (xk ) + βk (xk − xk−1 ),
consisting of a steepest descent step xk − αk ∇f (xk ) and an additional momen-
tum term βk (xk − xk−1 ) for some βk > 0.
To avoid the requirement of two starting points x−1 , x0 , the first iteration
takes a steepest descent update. The following theorem is given without proof:
Theorem 6. There exists a constant C > 0 such that when the Heavy Ball Method is
applied to Model (26) with constant step lengths
√ √
4 L− γ
αk ≡ α = √ √ , βk ≡ β = √ √
( L + γ)2 L+ γ
satisifies kxk − x∗ k ≤ Cβ k for all k.

As an immediate consequence of Theorem 6 we find that


L k
f (xk ) − f (x∗ ) ≤ ∇f (x∗ )(xk − x∗ ) + kx − x∗ k2 (by (6))
2
LC 2 2k
≤ β (since ∇f (x∗ ) = 0, and by Theorem 6)
2
r 2k
LC 2

γ
≈ 1−2 (approximation for L  γ)
2 L
The case where L  γ is especially interesting, as this corresponds to ill-
conditioned A for which Model (26) is numerically difficult to solve. In this
case the Heavy Ball Method has a much faster convergence rate than the Method
of Steepest Descent, which as we saw in Theorem 3 satisfies
  γ k
f (xk ) − f (x∗ ) ≤ f (x0 ) − x(x∗ ) × 1 −

.
L

26
6.3 Nesterov Acceleration B6.2 Opt Data Sci

Figure 10: Heavy ball updates converge faster due to reduced zig zagging.

We conclude that the Heavy Ball Method also converges linearly, but at a faster
rate than the Method of Steepest Descent.
Example 12. If γ = 0.01L, which corresponds top a matrix A with moderately
sized condition number of 100, we have (1 − 2 γ/L)2 = 0.64. Hence, the
Heavy Ball Method shrinks f (xk ) − f (x∗ ) by at least a third in each iteration,
while the steepest descent method shrinks f (xk ) − f (x∗ ) only by 1%, since
(1 − γ/L) = 0.99.

6.3 Nesterov Acceleration


Nesterov’s Accelerated Gradient Method is a modification of the Heavy Ball Method
that works for general strongly convex objectives to solve
min f (x) (27)
x∈Rn

via iterative updates


xk+1 = xk − αk ∇f xk + βk (xk − xk−1 ) + βk (xk − xk−1 ),

(28)
which, compared to the heavy ball updates
xk+1 = xk − αk ∇f xk + βk (xk − xk−1 ),

(29)
only differ in the point where the gradient ∇f is evaluated: in the case of Nes-
terov’s method this point contains the momentum term also. To avoid the re-
quirement of two starting points, the first update is again taken as the steepest
descent step. The next result is given without proof:
Theorem 7. For f L-smooth γ-strongly convex, Nesterov’s Accelerated Gradient
Method applied to Model (27) from a given starting point x0 ∈ Rn and with con-
stant step lengths √ √
1 L− γ
αk ≡ α = , βk ≡ β = √ √
L L+ γ

27
6.4 Nesterov Acceleration of L-Smooth Convex Functions B6.2 Opt Data Sci

satisifies
 r k
k ∗ L+γ 0 ∗ 2 γ
f (x ) − f (x ) ≤ kx − x k × 1 − ∀ k ≥ 1.
2 L

Theorem 7 shows that the Accelerated Gradient Method with constant step
size again converges linearly with a faster rate than the Method of Steepest
Descent, as r
γ γ
1− <1− .
L L
p
Example 13. For γ = 0.01L we have (1− γ/L) = 0.9 which compares favourably
with the decrease rate of (1 − γ/L) = 0.99 for steepest descent. It takes more
than 10 steepest descent iterations to guarantee the same relative decrease in
f (xk ) − f (x∗ ) as for Nesterov’s Method.

6.4 Nesterov Acceleration of L-Smooth Convex Functions

We will now discuss a variant of Nesterov’s Method for the unconstrained min-
imisation of an L-smooth convex function f (x). Therefore, f (x) is bounded
from above by simple quadratic models:

L
f (y) ≤ f (x) + h∇f (x), y − xi + ky − xk2 (30)
2
The Lipschitz constant is only assumed to exist but is not necessarily known.
In contrast to the case characterised by Theorem 7, the step size varies in each
iteration.
This method has an iteration complexity of O(ε−1/2 ), that is, there exists a
constant C > 0 (which depends of course on the √ starting point) such that for
any given error tolerance ε > 0 it takes only C/ ε iterations to guarantee that
f (xk )−f (x∗ ) < ε. It is known theoretically that this iteration complexity cannot
be improved, thus Nesterov’s Method is optimal from an iteration complexity
perspective.
The algorithm proceeds via two sequences of iterates, xk and y k , whose
computation is interlaced. The points xk are obtained as the minimisers of a
quadratic upper bound function, as discussed earlier in the course, while the
iterates y k are obtained as approximate second order corrections to the points
xk .

28
6.4 Nesterov Acceleration of L-Smooth Convex Functions B6.2 Opt Data Sci

Algorithm 1 (Nesterov Acceleration for L-Smooth Convex Minimisation).


// initialisation
ky 0 −zk
find z 6= y 0 ∈ Rn and set α−1 := k∇f (y 0 )−∇f (z)k ;
k := 0, λ0 := 1, x−1 := y 0 ;
// main body
while kxk − xk−1 k > tol do
find minimal i ∈ N such that

f y k − 2−i αk−1 ∇f (y k ) ≤ f (y k ) − 2−(i+1) αk−1 k∇f (y k )k2



(31)

αk := 2−i αk−1 ;
xk := y k − αk ∇f (y k );
 p 
λk+1 := 21 1 + 4λ2k + 1 ;
λk −1

y k+1 := xk + λk+1 xk − xk−1 ;
k ← k + 1;
end

A few remarks are in order:

• Since y 0 6= z, we have k∇f (y 0 ) − ∇f (z)k ≤ Lky 0 − zk, and hence α−1 ≥


L−1 is an overestimate of the optimal step length. This guarantees that
the steps are not unnecessarily short initially.
• Subsequently, αk+1 ≤ αk for all k. The step length is approaching the un-
kown optimal step length L−1 from above until some iteration k0 when
αk0 ≤ L−1 occurs for the first time. At that point in all subsequent itera-
tions k > k0 Equation (30) implies that
L
f y k − αk−1 ∇f (y k ) ≤ f (y k ) + h∇f (y k ), −αk−1 ∇f (y k )i + kαk−1 ∇f (y k )k2

2
k αk−1 k 2
≤ f (y ) − k∇f (y )k .
2
The sufficient decrease condition (31) is therefore satisfied with i = 0 so
that αk = αk−1 = · · · = αk0 ceases to decrease any further, and we have

L−1
≤ αk ≤ L−1 , ∀ k ≥ k0 , (32)
2
and where the lower bound inequality holds for all k ∈ N. Note that
α ≤ L−1 is exactly the condition we need to ensure that

α−1
muk,α (y) = f (y k ) + h∇f (y k ), y − y k i + ky − y k k2
2
is an upper bound function on f (y). If we had a priori knowledge of L,
we could simply set α = L−1 at the outset, but Algorithm 1 is designed

29
6.4 Nesterov Acceleration of L-Smooth Convex Functions B6.2 Opt Data Sci

for the case where L is not known explicitly and learns an appropriate
value of α from recycled information, that is from quantities that have
to be computed anyway as the iterations proceed. The back tracking de-
creases the value of α at most

blog2 (2Lα−1 )c

times before the final value is attained.


• The point xk is simply the global minimiser of

α−1
muk,α (x) = f (y k ) + h∇f (y k ), x − y k i + kx − y k k2 .
2

• The update λk+1 is obtained by solving the quadratic equation

λ2k+1 − λk+1 = λ2k . (33)

It follows by induction that

1 k+1
λk+1 ≥ λk + ≥1+ , (34)
2 2
and furthermore,
λ0 − 1
= 0,
λk+1
λk − 1
< 1, ∀k ∈ N,
λk+1
λk − 1
lim =1
k→∞ λk+1

• The step from xk to y k+1 is designed to go along the “ravine” in ill-


conditioned problems that have a long, narrow valley of near-optimal
points. Pure gradient descent methods tend to stall in such situations
and jump back an forth across the ravine. To speed up the convergence,
a step along the ravine is required. The vector xk − xk−1 asymptotically
−1
points in the right direction, and the step size λλkk+1 is designed to start
out timidly and to become asymptotically bolder.

Let us now analyse the convergence of Algorithm 1. Note that, asymptot-


ically we have xk − y k → 0, so that it doesn’t matter whether we analyse the
algorithm in terms of the points y k or xk . It turns out that it is easiest to carry
out the analysis in terms of the vectors

pk = (λk − 1)(xk−1 − xk ). (35)

30
6.4 Nesterov Acceleration of L-Smooth Convex Functions B6.2 Opt Data Sci

Theorem 8. Let f : Rn → Rn be a convex L-smooth function that has at least one


finite minimiser x∗ ∈ arg min f (x) and a finite minimum f ∗ = min f (x). Then the
sequence (xk )k∈N of iterates produced by Algorithm 1 satisfies
4Lkx0 − x∗ k2
f (xk ) ≤ f ∗ +
(k + 2)2
for all k ∈ N. Hence, xk is ε-optimal for all
$ √ %
2 L × kx0 − x∗ k
k ≥ N (ε) = √ − 1.
ε

Proof. Recall that we have


λk − 1 k 1
y k+1 = xk + (x − xk−1 ) = xk − (λk − 1) xk−1 − xk

λk+1 λk+1
1
= xk − pk , (36)
λk+1
and
xk+1 = y k+1 − αk+1 ∇f (y k+1 )
λk − 1 k
= xk + (x − xk−1 ) − αk+1 ∇f (y k+1 ),
λk+1
which implies
0 = λk+1 xk − xk+1 + (λk − 1) xk − xk−1 − λk+1 αk+1 ∇f (y k+1 ).
 
(37)
k
Using the auxiliary vectors p defined in (35), we have
pk+1 − xk+1 = (λk+1 − 1)(xk − xk+1 ) − xk+1
= λk+1 (xk − xk+1 ) − xk
(37)
= (λk − 1) (xk−1 − xk ) − xk + λk+1 αk+1 ∇f (y k+1 )
= pk − xk + λk+1 αk+1 ∇f (y k+1 ).
It follows that
kpk+1 −xk+1 + x∗ k2
= kpk − xk + x∗ k2 + λ2k+1 αk+1
2
k∇f (y k+1 )k2
+ 2λk+1 αk+1 h∇f (y k+1 ), pk − xk + x∗ i
= kpk − xk + x∗ k2 + λ2k+1 αk+1
2
k∇f (y k+1 )k2
+ 2(λk+1 − 1)αk+1 h∇f (y k+1 ), pk i
+ 2λk+1 αk+1 h∇f (y k+1 ), x∗ − xk + λ−1 k
k+1 p i
(36)
= kpk − xk + x∗ k2 + λ2k+1 αk+1
2
k∇f (y k+1 )k2
+ 2(λk+1 − 1)αk+1 h∇f (y k+1 ), pk i
+ 2λk+1 αk+1 h∇f (y k+1 ), x∗ − y k+1 i, (38)

31
6.4 Nesterov Acceleration of L-Smooth Convex Functions B6.2 Opt Data Sci

Next, we will derive bounds on the last two terms in the right hand side of
(38). To bound the second term, we use convexity of f , which implies f (x∗ ) ≥
f (y k+1 ) + h∇f (y k+1 ), x∗ − y k+1 i, and hence
f (y k+1 ) − f ∗ ≤ −h∇f (y k+1 ), x∗ − y k+1 i. (39)
On the other hand, the sufficient decrease condition (31) yields
αk
f xk = f y k − αk ∇f (y k ) ≤ f (y k ) − k∇f (y k )k2
 
2
and at iteration k + 1 this implies
αk+1
f (y k+1 ) ≥ f (xk+1 ) + k∇f (y k+1 )k2 . (40)
2
Substitution into (39) yields
 αk+1
h∇f (y k+1 ), x∗ − y k+1 i ≤ − f (xk+1 ) − f ∗ − k∇f (y k+1 )k2 (41)
2
To bound the first term, we use (36), xk = y k+1 + λ−1 k
k+1 p . Convexity now
implies f (y k+1 ) + λ−1
k+1 h∇f (y
k+1
), pk i ≤ f (xk ), so (40) yields
αk+1
k∇f (y k+1 )k2 ≤ f (xk ) − f (xk+1 ) − λ−1 k+1 h∇f (y
k+1
), pk i,
2
and hence,

(λ2k+1 − λk+1 )αk+1


2
k∇f (y k+1 )k2 ≤ 2(λ2k+1 − λk+1 )αk+1 f (xk ) − f (xk+1 )


− 2(λk+1 − 1)αk+1 h∇f (y k+1 ), pk i, (42)

Let us combine what we derived so far:


kpk+1 −xk+1 + x∗ k2 − kpk − xk + x∗ k2
(38),(41)
≤ 2(λk+1 − 1)αk+1 h∇f (y k+1 ), pk i
− 2λk+1 αk+1 (f (xk+1 ) − f ∗ ) + (λ2k+1 − λk+1 )αk+1
2
k∇f (y k+1 )k2
(42)
≤ −2λk+1 αk+1 (f (xk+1 ) − f ∗ ) + 2(λ2k+1 − λk+1 )αk+1 (f (xk ) − f (xk+1 ))
= 2αk+1 λ2k (f (xk ) − f ∗ ) − 2αk+1 λ2k+1 (f (xk+1 ) − f ∗ )
≤ 2αk λ2k (f (xk ) − f ∗ ) − 2αk+1 λ2k+1 (f (xk+1 ) − f ∗ ),
where the last inequality follows from the fact that the αk are montonically
decreasing, and the penultimate equality is a consequence of λ2k+1 −λk+1 = λ2k .
Applying this inequality iteratively, we find
2αk+1 λ2k+1 (f (xk+1 ) − f ∗ ) ≤ 2αk+1 λ2k+1 (f (xk+1 ) − f ∗ ) + kpk+1 − xk+1 + x∗ k2
≤ 2αk λ2k (f (xk ) − f ∗ ) + kpk − xk + x∗ k2
≤ ...
≤ 2α0 λ20 (f (x0 ) − f ∗ ) + kp0 − x0 + x∗ k2
≤ ky 0 − x∗ k2 , (43)

32
REFERENCES B6.2 Opt Data Sci

where the last inequality follows from λ0 = 1, p0 = (λ0 − 1)(x−1 − x0 ) = 0, and

kp0 − x0 + x∗ k2 = kx0 − x∗ k2
= ky 0 − α0 ∇f (y 0 ) − x∗ k2
= ky 0 − x∗ k2 + α02 k∇f (y 0 )k2 − 2α0 h∇f (y 0 ), y 0 − x∗ i
(41)  α0 
≤ ky 0 − x∗ k2 + α02 k∇f (y 0 )k2 − 2α0 f (x0 ) − f ∗ + k∇f (y 0 )k2
2
= ky 0 − x∗ k2 − 2α0 λ20 f (x0 ) − f ∗


To recap, (43), (34) and (32) revealed that

2αk+1 λ2k+1 (f (xk+1 ) − f ∗ ) ≤ ky 0 − x∗ k2 ,


k+1
λk+1 ≥ 1 + ,
2
1
≤ αk+1 .
2L
In combination this yields
 2  2
−1 k+3 ∗ k+1
k+1
f (xk+1 ) − f ∗ ≤ ky 0 −x∗ k2 ,
 
L f (x )−f ≤ 2αk+1 1 +
2 2

so that
4Lky 0 − x∗ k2
f (xk+1 ) − f ∗ ) ≤ ,
(k + 3)2
as claimed by the theorem.

References
[1] A. Beck. First Order Methods in Optimization. MOS-SIAM Series on Opti-
mization, 2017.
[2] S.J. Wright. Optimization Algorithms for Data Analysis.
http://www.optimization-online.org/DB FILE/2016/12/5748.pdf
[3] S.J. Wright. Coordinate descent algorithms. Mathematical Programming,
151:334, 2015. https://arxiv.org/abs/1502.04759

[4] D.P. Woodruff. Sketching as a Tool for Numerical Linear Algebra


https://arxiv.org/abs/1411.4357
[5] L. Bottou, F.E. Curtis, and J. Nocedal. Optimization methods for large-scale
machine learning. SIAM Review, 59(1): 65-98, 2017.

33
REFERENCES B6.2 Opt Data Sci

[6] Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradi-


ent methods. The Journal of Machine Learning Research, 18(1): 8194-8244,
2017.
[7] J. Wright and Y. Ma. High-dimensional data analysis with low-dimensional
models. CUP, 2021.

[8] A.S. Nemirovskii and D.B. Yudin. Complexity of problems and efficiency
of optimization methods. “Nauka” Moskow, 1979, (Russian).
[9] N. Shor. Minimization methods for Non-Differentiable Functions.
(Springer-Verlag, Berlin, 1985)

[10] B. Polyak. Introduction to Optimization. (Optimization Software Inc.,


Publications Division, New York, 1987)
[11] Y. Nesterov. A Method of Solving a Convex Programming Problem with
Convergence Rate O(1/k 2 ). Soviet Math. Dokl., Vol 27 (1983), No 2.

[12] Y. Nesterov. Smooth Minimization of Non-Smooth Functions. Math. Pro-


gram., Ser. A 103, 127-152 (2005).

34

You might also like