0% found this document useful (0 votes)

13 views35 pages

2022lectures1-8 Optimization For DataScience

The document outlines the course 'Optimisation for Data Science' taught by Prof. Raphael Hauser, covering various optimisation models, methods, and data analysis problems. It includes topics such as first-order methods, the method of steepest descent, and applications in regression, matrix completion, and support vector machines. The course aims to equip students with the necessary theoretical and practical skills in optimisation relevant to data science.

Uploaded by

unkown21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views35 pages

2022lectures1-8 Optimization For DataScience

Uploaded by

unkown21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

B6.

2 Optimisation for Data Science

Lecture Notes, Lectures 1–8
Oxford Mathematical Institute, HT 2022

Prof. Raphael Hauser

March 10, 2022

1
TABLE OF CONTENTS B6.2 Opt Data Sci

Table of Contents
1 Scope of this Course 2
1.1 Optimisation Models . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Sparse Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 First Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Data Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . 3

2 Motivating Examples 3

3 Optimisation Terminology and Prerequisites 10

3.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Characterisation of Convergence Speed . . . . . . . . . . . . . . 13

4 Method of Steepest Descent 14

4.1 Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Convergence Theory . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Long-Step Descent Methods . . . . . . . . . . . . . . . . . . . . . 17

5 The Proximal Method 19

5.1 Proximal Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 The Prox-Gradient Method . . . . . . . . . . . . . . . . . . . . . 21
5.3 Convergence Theory . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Acceleration of Gradient Methods 25

6.1 Summary of Complexity Results Seen so Far . . . . . . . . . . . 25
6.2 The Heavy Ball Method . . . . . . . . . . . . . . . . . . . . . . . 25
6.3 Nesterov Acceleration . . . . . . . . . . . . . . . . . . . . . . . . 27
6.4 Nesterov Acceleration of L-Smooth Convex Functions . . . . . . 28

1
B6.2 Opt Data Sci

1 Scope of this Course

1.1 Optimisation Models

Models considered in this course can take any of the following forms.
Unconstrained model
min f (x)
x∈Rn

where f is smooth, which in the context of this course means C 1 with Lipschitz
continuous gradient.
Regularised model
min f (x) + λΨ(x)
x∈Rn

Ψ : Rn → R convex, possibly nonsmooth, controls the complexity and struc-

ture of the optimal solution x∗ (a point where the objective function f (x)+λΨ(x)
achieves a minimum). λ is called the regularisation parameter.
Constrained model
min f (x) subject to x ∈ F
x∈Rn

e.g., F = {x ∈ Rn : g(x) ≤ 0, x ≥ 0} for some g : Rn → Rq (all vector

inequalities to be interpreted componentwise, i.e. g(x) ≤ 0 means gs (x) ≤ 0
(s = 1, . . . q)).

1.2 Sparse Objective

In data analysis the objective function f (x) often takes the form of a sum
m
X
f (x) = fj (x),
j=1

where each fi is sparse (depends only on a few of the coordinates of x), and
since n, m 1, often only gradients ∇f (x) or ∇fi (x) are available, but no
higher derivatives.

1.3 First Order Methods

First order methods are iterative algorithms designed to produce a sequence of
solutions (points x ∈ Rn that satisfy the constraints x ∈ F) (xk )k∈N → x∗ that
converges to an optimal solution based on updates

xk+1 = xk + αk dk

with search direction dk computed from ∇f (x) or ∇fi (x), and a step length αk >
0 that is either fixed or computed iteratively.

2
1.4 Data Analysis Problems B6.2 Opt Data Sci

1.4 Data Analysis Problems

Data analysis problems typically combine the following elements:

Data set D := {(aj , yj ) : j = 1, . . . , m}, where aj ∈ V lies in a vector space
of features, and yj ∈ W in a space of observations.
Parametric model Φ(·, x) : V → W of a feature-observation relation, param-
eterised by a vector x ∈ Rn .
Data fitting problem: find x ∈ Rn such that Φ(aj ; x) ≈ yj for all j by solving
minx∈Rn f (x), with objective
m
X
f (x) = LD (x) := `(aj , yj ; x).
j=1

LD (x) is the loss on the data set D associated with parameter values x. Typical
example `(aj , yj ; x) = kΦ(aj ; x) − yj k2 .

2 Motivating Examples
Example 1. Examples of observation sets:

i) W = R regression problems.

ii) W = {1, . . . , M } classification problems.

iii) W = ∅ data clustering, structured dimensionality reduction.

Example 2 (Regression).

i) Regression without intercept: Choose V = Rn , W = R, Φ(a; x) = aT x. Data

fitting model

1X 1
minn m(aT 2
j x − yj ) = kAx − yk2 ,
x∈R 2 j=1 2m

T
where A = a1 ... am .
ii) Regression with intercept: Same formulation, but replace x and aj by

x aj
x̃ = , ãj = ,
β 1

so that Φ(ã; x̃) = ãT x + β.

3
B6.2 Opt Data Sci

iii) Tychonov-regularised regression: this method uses the data fitting model
1
min kAx − yk22 + λkxk22 ,
x∈Rn 2m
which leads to robust regression, less sensitive to noise in the data (aj , yj ).
iv) Lasso-regression: the data fitting model
1
minn kAx − yk2 + λkxk1
x∈R 2m
tends to yield sparse optimal solution x∗ (feature selection) if one exists.
v) Dictionary learning: Extending linear regression to learning the regres-
sors. In dictionary learning we seek a dictionary A ∈ Rm×n and sparse
regressors X ∈ Rn×p so that the data Y ∈ Rm×p can be approximately
expressed as AX ≈ Y . We use V = (Rm×n , Rn×p ) as data space and
W = R1×p as observation space, with data D = {y` }m`=1 , where y` ∈ R
1×p

and p n. The data fitting model is designed as

1
min kAX − Y k2F ,
A∈Rm×n ,X∈F k (m,p) 2mp

where
( n
)
X
n×p
Fk (m, p) = X∈R : |xi,j 6= 0| ≤ k ∀ j = 1, . . . , p ,
i=1

indicating only k of the the learned regressors from A are used to approx-
imate each column of data in Y .

Example 3. Matrix completion: Choose V = Rn×p , W = R, D = {(Aj , yj ) : j =

1, . . . , m}. Define the trace inner product on V ,
h·, ·i : V × V → R
(X, Y ) 7→ tr(X T Y ).
The data fitting problem is given by
m
1 X 2
min (hAj , Xi − yj ) + λkXk∗ ,
X∈V 2m j=1

where hAj , Xi P
models the sensing of X, yj the target value, λ > 0, and the nuclear
p
norm kXk∗ = i=1 σi is the sum of the singular values of X. Regularisation
by the nuclear norm tends to yield a low-rank optimal solution if one exists.
Typical choice Aj = [as,t ] with
(
1 if s = sj , t = tj ,
as,t =
0 otherwise

4
B6.2 Opt Data Sci

for fixed sj , tj , so that hAj , Xi = xsj ,tj and yj correspond to a few known
entries of X. This problem may be seen as a sophisticated form of regression.
Example 4. Sparse inverse covariance estimation: V = Rn , W = ∅, aj i.i.d. samples
of a random vector Ω → Rn with E[A] = 0 and Cov(A) = E[AA T
Pm] = Σ T< ∞
1
(all covariances are finite). The sample estimate of Σ is S = m j=1 aj aj . In
many applications, the components of A are conditionally independent,
D(Aj |Ai , {Ak : k 6= i, j}) = D(Aj |{Ak : k 6= i, j}).
For example, when
causality causality
Ai → Ak → Aj ,
then Cov(Ai , Aj ) 6= 0 but D(Aj |Ai , Ak ) = D(Aj |Ak ). Fact: If Ai , Aj are con-
ditionally independent, then (Σ−1 )i,j = 0 (component (i, j) of Σ−1 is zero).
Therefore, we correct the sample estimate S to that its inverse becomes sparse
by solving the data fitting problem
min hS, Xi − log det X + λkXk1 ,
X∈SRn×n ,X0

where SRn×n is the set of n × n real symmetric matrices, and X 0 means

X is positive semidefinite. In this example we extract structure from the data
without any measurement, that is, W = ∅ and
LD (X) = hS, Xi − log det X + λkXk1
m
1 X
= haj aT
j , Xi − log det X + λkXk1
m j=1
m
1 X T T
= a Xaj − log det X + λkXk1
m j=1 j
 
m
2 Y 1
= − log  (2π)−n/2 det(X −1 )−1/2 exp − aT Xaj  − n log(2π)
m j=1
2 j

+ λkXk1 ,
so that the first two terms of the objective represent a shift plus a negative mul-
tiple of the log-likelihood function for X = Cov(A−1 ) under the parametric
model that the data aj are i.i.d. samples of mean centred multivariate Gaus-
sian vectors. Note also that the barrier term − log det X is used to keep X
away from boundary and hence invertible. The Lasso term λkXk1 incentivises
a sparse solution that allows one to identify the conditional dependencies be-
tween the variables if the reconstruction is successful.

Example 5 (Principal Component Analysis).

i) Robust PCA: We want to render classical PCA (see prelims course on

data analysis and statistics) robust to data outliers. Data space V =

5
B6.2 Opt Data Sci

(Rm×n , Rm×n ), observation space W = R, data D = Y ∈ Rm×n , the

entire data matrix. The data fitting model is chosen as

min kY − (L + S)k2F + λkLk∗ + γkSk1 ,

L,S∈Rm×n
P
where the nuclear norm
P kLk∗ = k σk encourages a low rank compo-
nent L and kSk1 = i,j |Si,j | encourages a sparse component S.
The weights λ and γ trade off the rank of L, sparsity of S, and how well
their sum matches the data Y . The sparse component S can be viewed as
outlier errors in Y .
ii) Sparse principal compnent analysis: The first component of the PCA of a
matrix S 0 is computed as follows,

max v T Sv, subject to kvk2 = 1.

v∈Rn

A sparse 1st principal component is defined in principle by

max v T Sv, subject to kvk2 = 1, kvk0 ≤ k,

where kvk0 := |{i : vi =

6 0}| is the number of nonzero compoents of v,
with k < n. Since the latter problem ins NP hard, we replace it by its
convex relaxation,

max hS, M i, subject to M 0, hI, M i = 1, kM k1 ≤ ρ,

M ∈SRn

where hS, M i replaces v T Sv = tr(S, vv T ) and hI, M i = 1 replaces 1 =

v T v = hI, vv T i.

Example 6 (Data Separation).

i) Support vector machines: V = Rn , W = {+1, −1}, problem of classifying

data into two classes. The ansatz is to seek a separating hyperplane {a ∈
Rn : aT x − β = 0} such that

yj × (aT
j x − β) ≥ 1. (1)

We would like to use Φ(a; x, β) := sign(aj , x − β) and recall that we want

Φ(a; x, β) = yj for as many data points j as possible. To avoid having to
solve an NP-hard problem and to find a robust solution, we maximise the
separation (margin) between the two sets, which is obtained by solving

min kxk22 , subject to (1). (2)

(x,β)

6
B6.2 Opt Data Sci

But (2) may not have any solution (x, β) at all that satisfies (1). Therefore,
solve the following data fitting model instead,
m
1 X λ
min max 1 − yj × (aT 2
j x − β), 0 + kxk2
(x,β) m 2
j=1
m
1 X λ
⇔ min sj + kxk22
x∈R ,β∈R,s∈Rm
n m j=1 2

subject to sj ≥ 1 − yj (aT
j x − β), sj ≥ 0, (j = 1, . . . , m).

ii) Nonlinear separation: the idea is to use a nonlinear transformation ζ :

Rn → Ṽ , where (Ṽ , h·, ·i) is a Hilbert space. We replace (1) by

yj × (hζ(aj ), xi − β) ≥ 1 (3)

and solve the data fitting model

m
1 X λ
min max (1 − yj (hζ(aj ), xi − β), 0) + hx, xi
x∈Ṽ ,β∈R m 2
j=1
m
1 X λ
⇔ min sj + kxk2Ṽ
x∈Ṽ ,β∈R,s∈Rm m j=1 2
subject to sj ≥ 1 − yj (hζ(aj ), xi − β), sj ≥ 0, (j = 1, . . . , m)
m
(convex duality) 1 X
⇐⇒ minm αT Qα − αj
α∈R 2
j=1
1
subject to 0 ≤ αj ≤ , (j = 1, . . . , m)
λ
y T α = 0,

where Q = (qk,` ) 0 in SRm×m is defined by qk,` = yk y` hζ(ak ), ζ(a` )i.

Note: to solve this convex quadratic programming problem (QP), there
is no need to know ζ explicitly, only the ability to compute K(ak , a` ) :=
hζ(ak ), ζ(a` )i is required, where K : V × V → R is a kernel function, i.e.,
such that (K(aj , a` ))j=1:m,`=1:m 0 for all choices of vectors

{a1 , . . . , am } ⊂ V = Rn .

For example, use a Gaussian kernel K(aj , a` ) = exp −kaj − a` k22 /2σ 2 .

Example 7 (Multiclass Classification).

i) Logistic regression: V = Rn , W = {e1 , . . . , eM } ⊂ RM , where yj = ei

(the i-th canonical basis vector in RM ) indicates that aj is in class Ci for

7
B6.2 Opt Data Sci

Figure 1: The support vector machine model maximises the margin between
two point clouds.

classes C1 , . . . , CM . The multiclass classification problem is to output a

probability map
 
pi (a; x)
..
Φ(a; X) =   , pi (a; X) = P(a ∈ Ci ).
 
.
pM (a; X)

We wish to calibrate the parametric models

exp aT x[i]
pi (a; X) = PM , (i = 1, . . . , M )
T
k=1 exp a x[i]

such that (
1 if yj = ei ,
pi (aj ; X) ≈
0 if yj 6= ei ,
where x[i] ∈ Rn (i = 1, . . . , M ) and X = {x[i] : i = 1, . . . , M }. Idea: max-
imise the log-likelihood function, which leads to the data fitting model
  T 
x[1]

m M
!
1 X  T  ..  X
max y  .  aj − log exp(xT
[k] aj )  .

X m j=1 j
xT
[M ]
k=1

The obective is convex and in summation form. Note that if yj = ei then

xT
 
[1]
yjT  ...  = xT
i .
 

xT[M ]

8
B6.2 Opt Data Sci

Figure 2: Nonlinear separation of point clouds.

ii) Nonlinear multiclass logistic regression via deep learning: Combines logistic
regression with a nonlinear transformation of the inputs in D 1 hidden
layers 1, . . . , D of a neural network,

a0 = a,
a` = σ(W ` a`−1 + g ` ), (` = 1, . . . , D − 1),
D D D−1 D
a =W a +g ,
`
|×|a`−1 | `
where W ∈ R|a , g ` ∈ R|a | , and σ is a nonlinear activation function,
such as
1
• σ(t) = 1+exp(−t) (logistic function),
• σ(t) = max(t, 0) (hinge loss, also called ReLU),
• σ(t) = tanh(t) (smooth version of ReLU),
• σ(t) = 1 with probability 1/(1 + exp(−t)) and 0 otherwise.

Let w = (W 1 , g 1 ), . . . , (W D , g D ) be the parameters of the NN, and set
aD = aD (w) as a function of the network parameters. The data fitting
model becomes
  T 
x[1]

m M
!
1 X T .  D X
max yj  ..  aj (w) − log exp(xT D
[k] aj (w))  .

X,w m
j=1 xT
[M ]
k=1

The objective is still in summation form but now non-convex.

9
B6.2 Opt Data Sci

Figure 3: In this region we see one strict global minimiser, and

many local minimisers. Source: http://papers.nips.cc/paper/
7875-visualizing-the-loss-landscape-of-neural-nets.pdf.

3 Optimisation Terminology and Prerequisites

3.1 Terminology
Consider the optimisation model
min f (x) subject to x ∈ F (4)
x∈Rn

Definition 1. A solution x∗ ∈ F is

i) a local minimiser if f (x∗ ) ≤ f (x) ∀x ∈ F ∩ Bε (x∗ ) for some ε > 0,

ii) a strict local minimiser if f (x∗ ) < f (x) ∀x ∈ F ∩ Bε (x∗ ),
iii) a global mimimiser if f (x∗ ) ≤ f (x) ∀x ∈ F.

Definition 2.

i) f : Rn → R is convex if
f ((1 − α)x + αy) ≤ (1 − α)f (x) + αf (y), ∀α ∈ [0, 1], x, y ∈ Rn . (5)

ii) f : Rn → R ∪ {+∞} is proper convex if (5) holds and f (x) < +∞ for at
least one point x.
iii) For Ω ⊂ Rn , define the indicator function as follows,
(
0 if x ∈ Ω,
IΩ (x) =
+∞ otherwise.

Then Ω 6= ∅ is a convex set if and only if IΩ (x) is a proper convex function.

10
3.1 Terminology B6.2 Opt Data Sci

Figure 4: A strongly convex function.

iv) We say that (4) is a convex optimisation problem if f and F =

6 ∅ are both
convex. Then (4) is equivalent to the unconstrained problem

min f (x) + IF (x).

x∈Rn

with proper convex objective.

v) f : Rn → R is strongly convex with modulus of convexity γ > 0 if for all
α ∈ [0, 1] and x, y ∈ Rn ,
1
f ((1 − α)x + αy) ≤ (1 − α)f (x) + αf (y) − γα(1 − α)kx − yk2 .
2

vi) f : Rn → R is L-smooth if differentiable everywhere with L-Lipschitz conti-

nous gradient, L > 0, that is,

k∇f (x) − ∇f (y)k ≤ Lkx − yk ∀x, y ∈ Rn .

vii) v ∈ Rn is a subgradient of f at x if

f (y) ≥ f (x) + v T (y − x), ∀x, y ∈ Rn .

viii) The subdifferential ∂f (x) of f at x is the set of subgradients of f at x.

11
3.2 Prerequisites B6.2 Opt Data Sci

Figure 5: At points where f is not differentiable the subgradient is non-unique.

3.2 Prerequisites

Lemma 1.

i) If a ∈ ∂f (x) and b ∈ ∂f (y), then (a − b)T (x − y) ≥ 0.

ii) For any λ ≥ 0, we have ∂λf (x) = λ∂f (x).
iii) ∂(f + g)(x) = ∂f (x) + ∂g(x), that is, for every z ∈ ∂(f + g)(x) ∃u ∈ ∂f (x)
and v ∈ ∂g(x) such that z = u + v.

Proposition 1. Let f : Rn → R ∪ {+∞} be proper convex.

i) Then x∗ ∈ Rn is a local minimiser if and only if it is a global minimiser.

ii) The set of miminisers

arg minn f (x) := {x∗ : f (x∗ ) = minn f (x)}

x∈R x∈R

is convex.

iii) x∗ ∈ arg minx∈Rn f (x) if and only if1 0 ∈ ∂f (x∗ ) .

Proposition 2. Let f : Rn → R ∪ {+∞} be proper convex.

i) If f is differentiable at x, then ∂f (x) = {∇f (x)}.

ii) If |∂f (x)| = 1, then f is differentiable at x.
1 To simplify notation, we write 0 instead of ~0 for vectors of zeroes of size n or any other size

12
3.3 Characterisation of Convergence Speed B6.2 Opt Data Sci

iii) If f is γ-strongly convex and differentiable at x ∈ Rn , then

γ
ky − xk2 ≤ f (y) − f (x) − ∇f (x)T (y − x) ∀y ∈ Rn . (6)
2
iv) If f is L-smooth, then
L
f (y) − f (x) − ∇f (x)T (y − x) ≤
ky − xk2 ∀y ∈ Rn . (7)
2
Proposition 3. If f : Rn → R is strongly convex, then there exists a unique global
minimiser.

For proofs of slightly weakened versions of Lemma 1 and Propositions 1 –

3, see problem sheets.

3.3 Characterisation of Convergence Speed

To obtain a notion of convergence for iterative algorithms, we could show that
k∇f (xk )k → 0 as k → ∞, or dist(0, ∂f (xk )) → 0 (where dist(·) denotes the Eu-
clidean distance), or kxk − x∗ k → 0, or f (xk ) → f (x∗ ). The rate of convergence
can be characterised in several different ways:
Definition 3. Let (φk )k∈N ⊂ R+ be a sequence such that limk→∞ φk = 0. We say
that (φk )k∈N converges
φk+1
i) Q-linearly if ∃σ ∈ (0, 1) such that φk ≤ 1 − σ, ∀k ∈ N,

ii) R-linearly if ∃ σ ∈ (0, 1) such that φk ≤ C(1 − σ)k , ∀k ∈ N,

iii) Q-superlinearly if limk→∞ φφk+1
k
= 0,
φk+1
iv) Q-quadratically if ∃ C > 0 such that φ2k
≤ C, ∀k,

v) R-superlinearly if ∃ (νk )k∈N ⊂ R+ such that (νk )k∈N → 0 Q-superlinearly

and φk ≤ νk for all k ∈ N.
Lemma 2.
Q-linear convergence ⇒ R-linear convergence ; Q-linear convergence,
Q-quadratic convergence ⇒ Q-superlinear convergence ⇒ R-superlinear convergence,
R-superlinear convergence ; Q-superlinear convergence ; Q-quadratic convergence.

See problem sheets for a proof and counterexamples.

Definition 4. If φk → 0 slower than R-linearly, we say that the convergence is
sublinear.
Example 8. The following sequences all converge to zero sublinearly,

c c c
√ , , .
k k∈N k k∈N ln(k + 1) k∈N

13
B6.2 Opt Data Sci

4 Method of Steepest Descent

4.1 Specifications

Gradient methods solve unconstrained optimisation models of the form

min f (x), (8)

x∈Rn

where f : Rn → R is L-smooth via iterative updates of the form

xk+1 = xk + αk dk . (9)

The scalar αk > 0 is called the step length, and dk ∈ Rn is the search direction.
The Method of Steepest Descent is a gradient method based on making the
following choices: dk = −∇f (xk ) for all k (the direction of steepest descent),
and αk ≡ L−1 for all k (constant step length). This leads to the steepest descent
updates
1
xk+1 = xk − ∇f (xk ). (10)
L
Our main concern is whether the choice of αk is neither too short (leading to
excessively slow convergence) nor too long (leading to zig-zagging behaviour).
The updates (10) are motivated by the minimisation of the first order Taylor
approximation

L
f (x + αd) ≤ f (x) + α∇f (x)T d + α2 kdk2 (11)
2
at x = xk over α for fixed d = −∇f (x). (11) is an immediate consequence of
(7). Writing ϕ(α) = f (x + αd) and setting d = −∇f (x), the r.h.s. of (11) is a
convex quadratic

α2
ϕ(α) = f (x) − αk∇f (x)k2 + Lk∇f (xk )k2 .
2
Minimising ϕ(α) be solving for ϕ0 (α) = 0 yields α∗ = 1/L.
The following will be one of our main tools of convergence analysis for the
method of steepest descent:

Lemma 3 (Foundational Inequality of Steepest Descent). Starting steepest de-

scent iterations at some arbitrary point x0 , we have

k+1 1 1
= f x − ∇f (x ) ≤ f (xk ) −
k k
k∇f (xk )k2 ∀k.

f x
L 2L

Proof. Substitute α = 1/L and d = −∇f (xk ) into (11).

14
4.2 Convergence Theory B6.2 Opt Data Sci

4.2 Convergence Theory

Overview of convergence results below:

√
• (∇f (xk ))k∈N → 0 sublinearly with rate O(1/ k) for general f .

• (f (xk ) − f (x∗ ))k∈N → 0 sublinearly with rate O(1/k) for convex f .

• (f (xk ) − f (x∗ ))k∈N → 0 Q-linearly with rate (1 − γ/L) when f is γ-
strongly convex.
Theorem 1. For f (x) L-smooth and bounded below by f , the method of steepest
descent with constant step size αk = 1/L satisfies
s
2L × (f (x0 ) − f )
min k∇f (xk )k ≤ , ∀ T ≥ 1.
0≤k≤T −1 T

Proof. Lemma 3 implies

T
X −1 T
X −1
k∇f (xk )k ≤ 2L f (xk ) − f (xk+1 ) = 2L f (x0 ) − f (xT ) .

k=0 k=0

Therefore,
v
u T −1 r
k
u1 X 2L [f (x0 ) − f (xT )]
min k∇f (x )k ≤ t k∇f (xk )k2 ≤
0≤k≤T −1 T T
k=0

Theorem 2. If f is L-smooth and convex with minimiser2 x∗ , then the method of

steepest descent with constant step size αk = 1/L satisfies

L 0
f (xT ) − f (x∗ ) ≤ kx − x∗ k2 , ∀ T ≥ 1.
2T

Proof. By convexity of f , f (x∗ ) ≥ f (xk ) + ∇f (xk )(x∗ − xk ). Substitution into

Lemma 3 yields

1
f (xk+1 ) ≤ f (x∗ ) + ∇f (xk )T (xk − x∗ ) − k∇f (xk )k2 (use upper bound on f (xk ))
2L !
2
L 1
= f (x∗ ) + k ∗ 2 k ∗
kx − x k − x − x − ∇f (x ) k
(expand to check)
2 L
L
= f (x∗ ) + kxk − x∗ k2 − kxk+1 − x∗ k2 (use def of xk+1 ).

2
2 Note that the minimiser x∗ is not necessarily unique.

15
4.2 Convergence Theory B6.2 Opt Data Sci

Repeated use of this bound in a telescoping sum yields

T −1 −1
X L TX L
f (xk+1 ) − f (x∗ ) ≤ kxk − x∗ k2 − kxk+1 − x∗ k2 ≤ kx0 − x∗ k2 ,

2 2
k=0 k=0

and since f (xk+1 ) ≥ f (xT ) by virtue of Lemma 3, this implies

T −1
1 X L 0
f (xT ) − f (x∗ ) ≤ f (xk+1 ) − f (x∗ ) ≤ kx − x∗ k2 .

T 2T
k=0

Theorem 3. If f is L-smooth and γ-strongly convex then there exists a unique min-
imiser x∗ , and
γ
f (xk+1 ) − f (x∗ ) ≤ 1 − f (xk ) − f (x∗ )

∀ k.
L

Proof. For the uniqueness of x∗ , see Proposition 3. Proposition 2 implies that

for all y ∈ Rn ,
γ
f (y) ≥ f (xk ) + ∇f (xk )T (y − xk ) + ky − xk k2 .
2
Therefore,
γ
min f (y) ≥ min f (xk ) + ∇f (xk )T (y − x) + ky − xk2 . (12)
y y 2

The arg min of the right hand side of (12) equals y ∗ = xk − (1/γ)∇f (xk ), and
substitution back into (12) gives
2
∗ k k T 1 k γ 1
f (x ) ≥ f (x ) − ∇f (x ) ∇f (x ) + ∇f (xk ) ,
γ 2 γ
1 2
= f (xk ) − ∇f (xk ) ,
2γ

which in turn yields

2
∇f (xk ) ≥ 2γ f (xk ) − f (x∗ ) .

(13)

Substituting into the bound of Lemma 3, we find

γ
f (xk+1 ) − f (x∗ ) ≤ f (xk ) − f (x∗ ) − f (xk ) − f (x∗ ) .

L

16
4.3 Long-Step Descent Methods B6.2 Opt Data Sci

Figure 6: A step size αk that satisfies Conditions ii) and iii) can be found by
bisection.

4.3 Long-Step Descent Methods

We continue solving the unconstrained model (8) via iterative updates (9), but
now with more general (and possibly much larger) step sizes αk and search
diretions dk than before.
Throughout this section we assume f (x) to be L-smooth, possibly non-
convex, and bounded below f (x) ≥ f for all x ∈ Rn . The search directions
dk are assumed to lie within a bounded angle of the steepest descent direction.
Specifically, our assumptions on dk and αk are as follows:
Assumption 1. There exist constants 0 < η ≤ 1 and 0 < c1 < c2 < 1 such that
for all k ∈ N,

i) ∇f (xk )T dk ≤ −ηk∇f (xk )k · kdk k (sufficient local descent),

ii) f (xk + αk dk ) ≤ f (xk ) + c1 αk ∇f (xk )T dk (step not too long),
iii) ∇f (xk + αk dk ))T dk ≥ c2 ∇f (xk )T dk (step not too short).

Note that Assumption i) is trivially met if we choose the steepest descent di-
rection dk = −∇f (xk ), in which case η = 1. The condition allows us more gen-
erality in computing dk , for example via an inexact computation of −∇f (xk )
that is numerically cheaper to compute. Assumptions ii) and iii) look very
technical at first sight, but all we need to know for the purposes of this course
is that there exist very simple and numerically cheap approximate 1-D opti-
misation schemes called inexact line searches that are able to produce values αk
that satisfy both conditions. The additional generality is important in practi-
cal computations, because the Lipschitz constant L is generally not known, so

17
4.3 Long-Step Descent Methods B6.2 Opt Data Sci

that taking constant step sizes αk ≡ 1/L is largely a theoretical choice. Line
searches also allow αk to take much larger values than 1/L when warranted,
so that the algorithm may make more rapid progress.
Theorem 4. Under Assumption 1, we have
s s
L f (x0 ) − f
min k∇f (xk )k ≤ × .
0≤k≤T −1 η 2 c1 (1 − c2 ) T

Proof. Assumption iii) implies

Assn iii) T
−(1 − c2 )∇f (xk )T dk ∇f (xk + αk dk ) − ∇f (xk ) dk

≤
L-smoothness & C.S.
≤ L · αk kdk k2 .
Solving for αk yields

1 − c2 Assn i) η(1 − c2 )k∇f (xk )k

αk ≥ − · ∇f (xk )T dk ≤ = .
Lkdk k2 Lkdk k
Substitution into Assumption ii) gives
Assn ii) & i) c1 (1 − c2 ) 2
f (xk+1 ) = f (xk + αk dk ) ≤ f (xk ) − η k∇f (xk )k2 , (14)
L
which generalises Lemma 3 that was used in Theorem 1 for the analysis of
the short-step case. The rest of the proof follows the blueprint of the proof of
Theorem 1: Solving (14) for k∇f (xk )k2 yields
L
k∇f (xk )k2 ≤ f (xk ) − f (xk+1 ) .

c1 (1 − c2 )η 2
Nota bene this implies f (xk ) − f (xk+1 ) ≥ 0, so that we have descent in each
iteration. Summing over k, we find
T −1 T −1
X L X
k∇f (xk )k2 ≤ f (xk ) − f (xk+1 )

c1 (1 − c2 )η 2
k=0 k=0
L
f (x0 ) − f (xT )

= 2
c1 (1 − c2 )η
L
f (x0 ) − f .

≤ 2
c1 (1 − c2 )η
The result now follows from
v
u T −1
u1 X
k
min k∇f (x )k ≤ t k∇f (xk )k2 .
0≤k≤T T
k=0

18
B6.2 Opt Data Sci

Figure 7: Construction of the Moreau envelope.

5 The Proximal Method

5.1 Proximal Operators

The following notion is useful in designing algorithms for minimising regu-

larised problems:
Definition 5. Let h : Rn → R ∪ {+∞} be proper3 convex and closed4 .

i) The Moreau envelope of h associated with λ > 0 is the function

Mλ,h : Rn → R,

1 1 2 1 2
x 7→ inf λh(u) + ku − xk = inf h(u) + ku − xk .
λ u 2 u 2λ

The Moreau envelope is a smoothed version of h.

ii) The proximal operator is given by

proxλh : Rn → R,

1 2
x 7→ arg min λh(u) + ku − xk .
u 2

The function is well defined by virtue of Proposition 3.

3 By definition, h(u) is finite for at least one point u ∈ Rn .

4A function is closed if all sublevel sets {x : f (x) ≤ α} are closed

19
5.1 Proximal Operators B6.2 Opt Data Sci

Figure 8: Soft and hard thresholding functions.

Example 9.

• h(x) = 0 has prox operator proxλh (x) = arg minu 12 ku − xk2 = x, that is,
the identity.

• h(x) = IΩ (x), with Ω ⊆ Rn closed convex, has prox operator

1
proxλIΩ (x) = arg min λIΩ (u) + ku − xk2
u 2
= arg min ku − xk2 ,
u∈Ω

that is, the projection of x onto Ω.

• h(x) = kxk1 has prox operator with i-th component


xi − λ
 if xi > λ,
(proxλk·k1 (x))i = 0 if xi ∈ [−λ, λ],

xi + λ if xi < −λ.


This is called the soft thresholding operator.

• h(x) = kxk0 := |{i : xi 6= 0}| has prox operator with i-th component
( √
xi if |xi | > 2λ,
(proxλk·k0 (x))i = √
0 if |xi | < 2λ.

This is called hard thresholding.

Proposition 4. Let h : Rn → R ∪ {+∞} be proper convex and closed. For fixed

x ∈ Rn , let Φ(u) := λh(u) + 12 ku − xk2 . Then the following properties hold.

i) 0 ∈ λ∂h (proxλh (x)) + proxλh (x) − x.

20
5.2 The Prox-Gradient Method B6.2 Opt Data Sci

ii) Mλ,h (x) < +∞ for all x ∈ Rn , even if h(x) = +∞.

iii) Mλ,h (x) is convex in x.
iv) Mλ,h (x) is differentiable in x, with gradient

1
∇Mλ,h (x) = (x − proxλh (x)) .
λ

v) x∗ ∈ arg minx h(x) if and only if x∗ ∈ arg minx Mλ,h (x).

vi) k proxλh (x) − proxλh (y)k ≤ kx − yk for all x, y ∈ Rn .

See problem sheets for a proof.

5.2 The Prox-Gradient Method

We now extend the scope of problems to solve and consider the regularised
optimisation model
minn φ(x) := f (x) + λψ(x), (15)
x∈R

where f is L-smooth and convex, and ψ(x) is convex and closed. This includes
the case of constrained optimisation, as

min f (x) ⇔ minn f (x) + λIΩ (x),

x∈Ω x∈R

where the equivalence is understood in the sense that the argmin are the same
in both cases.
To construct an iterative algorithm for Model (15), we use the following
template:

1. y k+1 = xk − αk ∇f (xk )
2. xk+1 = arg minz 21 kz − y k+1 k2 + αk λψ(z),

that is, we minimise ψ(z) while staying close to the steepest descent update
y k+1 . Note the factor αk in front of λψ(z) to avoid that xk+1 → y k+1 if αk → ∞,
that is, both terms on the r.h.s. of the second step should be of the same order.
We may also express the update of the template algorithm as

xk+1 = proxαk λψ xk − αk ∇f (xk ) .

(16)

An alternative interpretation and motivation of the update (16) is as fol-

lows:
1
xk+1 = arg min f (xk ) + ∇f (xk )T (z − xk ) + kz − xk k2 + λψ(z),
z 2αk

21
5.2 The Prox-Gradient Method B6.2 Opt Data Sci

Now note that f (xk ) + ∇f (xk )T (z − xk ) is the 1st order Taylor approximation
of f (z) around xk , so that (xk ) + ∇f (xk )T (z − xk ) + 2α1 k kz − xk k2 is a simplified
2nd order model of f (z) that does not depend on 2nd order derivatives of f .
The updates are thus obtained by iteratively building very simple 2nd order
models by which f is replaced in (15) to obtain subproblems that are easier to
solve.
Example 10. Unregularised unconstrained minimisation minx f (x). This case
reduces to ψ(x) ≡ 0 and the steepest descent update
1 2
xk+1 = prox0 xk − αk ∇f (xk ) = arg min z − xk − αk ∇f (xk )

z 2
= xk − αk ∇f (xk ).
Example 11. Constrained convex minimisation minx f (x) subject to x ∈ Ω with
Ω ⊂ Rn convex. This case reduces to ψ(x) = IΩ (x) and prox updates
1 2
xk+1 = proxαk λIΩ xk − αk ∇f (xk ) = arg min z − xk − αk ∇f (xk )

,
z∈Ω 2

which is the orthogonal projection of the steepest descent update xk −αk ∇f (xk )
onto the feasible set Ω.
Definition 6. The gradient map of model (15) is the map
Gα : Rn → Rn
1
x 7→ x − proxαλψ (x − α∇f (x)) .
α
We note that the notion of gradient map allows us to write the prox update
(16) as
xk+1 = proxαk λψ xk − αk ∇f (xk ) = xk − αk Gαk (xk ),

(17)
so that in the proximal method Gαk (xk ) takes the role of ∇f (xk ) in comparison
with the method of steepest descent.
Lemma 4 (Foundational Inequality of Prox-Method). The gradient map has the
following properties:

i) Gα (x) ∈ ∇f (x) + λ∂ψ (x − αGα (x)).

ii) For all α ∈ 0, L1 and z ∈ Rn ,

α
φ (x − αGα (x)) ≤ φ(z) + Gα (x)T (x − z) − kGα (x)k2 . (18)
2
Notes:

• Property i) shows that (17) has an interpretation of a generalised subgra-

dient step, with the gradient map replacing the gradient with the follow-
ing subtlety: −αk Gαk (xk ) consists of an explicit forward descent step for
f and an implicit backward ascent step for λψ.

22
5.2 The Prox-Gradient Method B6.2 Opt Data Sci

• Property ii) holds true for all z ∈ Rn . The term φ(z) + Gα (x)T (x − z) can
be seen as a backward first order model of φ(x). Using (17) and setting
x = xk and α = αk , (18) reads
αk
φ xk+1 ≤ φ(z) + Gαk (xk )T (xk − z) − kGαk (xk )k2 .

2
In particular, setting z = xk we obtain
αk
φ xk+1 ≤ φ(xk ) − kGαk (xk )k2 .

(19)
2
Compare this result to the foundational inequality of the short-step steep-
est descent, i.e., Lemma 3, which reads
1
f xk+1 ≤ f (xk ) − k∇f (xk )k2 .

(20)
2L
Note that when αk = 1/L, (19) becomes (20) with f (x) replaced by the
merit function φ(x) and ∇f (x) by the gradient map. Inequality (18) thus
generalises the foundational inequality and can be used the same way to
analyse the convergence of the prox-gradient method as Lemma 3 was
used in the analysis of the Method of Steepest Descent.

Proof. Using x − αGα (x) = proxαλψ (x − α∇f (x)) = arg minu Ξ(u), where
Ξ(u) := αλψ(u) + 21 ku − (x − α∇f (x))k2 is convex, it must be true that

0 ∈ ∂ Ξ proxαλψ (x − α∇f (x)) = ∂ Ξ (x − αGα (x))
= αλ∂ψ (x − αGα (x)) + (x − αGα (x)) − (x − α∇f (x))

Therefore, αGα (x) ∈ αλ∂ψ (x − αGα (x)) + α∇f (x), as claimed in i).
Further, since f is L-smooth, (7) implies

L
f (y) ≤ f (x) + ∇f (x)T (y − x) + ky − xk2 , ∀ y ∈ Rn .
2
Substituting y = x − αGα (x) yields

Lα2
f (x − αGα (x)) ≤ f (x) − αGα (x)T ∇f (x) + kGα (x)k2
2
α
≤ f (x) − αGα (x)T ∇f (x) + kGα (x)k2 , (21)
2
where the last inequality follows from the assumption that α ∈ (0, 1/L]. On
the other hand, the convexity of f (x) implies (see Definition 2 and properties
of the subdifferential)

f (z) ≥ f (x) + ∇f (x)T (z − x) (22)

23
5.3 Convergence Theory B6.2 Opt Data Sci

By virtue part i) we also have Gα (x) − ∇f (x) ∈ ∂λψ (x − αGα (x)), so that by
the definition of subgradients we have
T
λψ(z) ≥ λψ (x − αGα (x)) + [Gα (x) − ∇f (x)] (z − (x − αGα (x))) . (23)

It follows that
def of φ
φ (x − αGα (x)) = f (x − αGα (x)) + λψ (x − αGα (x))
(21) α
≤ f (x) − αGα (x)T ∇f (x) + kGα (x)k2 + λψ (x − αGα (x))
2
(22) α
≤ f (z) − ∇f (x)T (z − x + αGα (x)) + kGα (x)k2
2
+ λψ (x − αGα (x))
(23) α
≤ f (z) − ∇f (x)T (z − x + αGα (x)) + kGα (x)k2
2
T
+ λψ (z) − [Gα (x) − ∇f (x)] (z − (x − αGα (x)))
α
= f (z) + λψ(z) + Gα (x)T (x − z) − kGα (x)k2
2
α
= φ(z) + Gα (x) (x − z) − kGα (x)k2 ,
T
2
as claimed in part ii).

5.3 Convergence Theory

Theorem 5. If Model (15) has a minimiser x∗ and the prox-gradient algorithm is run
with constant step length αk ≡ 1/L starting from an initial point x0 , then

L 0
φ(xk ) − φ(x∗ ) ≤ kx − x∗ k2 , ∀ k ≥ 1.
2k

Proof. Equation (19) establishes that φ(xk ) k∈N is monotone decreasing,

φ(xk+1 ) ≤ φ(xk ), ∀ k. (24)

Further, using z = x∗ in Lemma 4 ii), we have

αk
0 ≤ φ(xk+1 ) − φ(x∗ ) ≤ Gαk (xk )T (xk − x∗ ) − kGαk (xk )k2
2
1 k 2 2

= x − x∗ − xk − x∗ − αk Gαk (xk )
2αk
1 k 2 2

= x − x∗ − xk+1 − x∗ . (25)
2αk

Therefore, kxk − x∗ k k∈N is also montone decreasing.

24
B6.2 Opt Data Sci

Combining both insights, we find

(24) K−1
X
K × φ(xK ) − φ(x∗ ) ≤ φ(xk+1 ) − φ(x∗ )

k=0
K−1
(25),αk =L−1 L X k 2 2

≤ x − x∗ − xk+1 − x∗
2
k=0
telescoping sum L 2
≤ × x0 − x∗ .
2

6 Acceleration of Gradient Methods

6.1 Summary of Complexity Results Seen so Far

• The Method of Steepest Descent solves
min f (x)
x

– in sublinear O √1k time when f is smooth (but nonconvex) and
bounded below, both in short and long step variants,
– in sublinear O k1 time when f is smooth, convex and bounded be-

low,
γ

– in linear time with rate 1 − L when f is L-smooth and γ-strongly
convex.
• The Prox-Gradient Method solves
min f (x) + λψ(x)
x

1

in sublinear O k time when f is smooth and convex and ψ is convex
and closed.

Accelerated gradient methods are designed to obtain faster, in some cases

provably optimal, convergence rates.

6.2 The Heavy Ball Method

This algorithm is the easiest to understand among accelerated gradient meth-
ods, but it is designed for the somewhat restricted scope of minimising convex
quadratic functions,
1
minn f (x) = xT Ax − bT x, (26)
x∈R 2

25
6.2 The Heavy Ball Method B6.2 Opt Data Sci

Figure 9: Heavy ball updates with momentum term.

where A is symmetric with eigenvalues in [γ, L] for some 0 < γ < L, so that
f (x) is L-smooth and γ-strongly convex.
The iterative updates are defined as
xk+1 = xk − αk ∇f (xk ) + βk (xk − xk−1 ),
consisting of a steepest descent step xk − αk ∇f (xk ) and an additional momen-
tum term βk (xk − xk−1 ) for some βk > 0.
To avoid the requirement of two starting points x−1 , x0 , the first iteration
takes a steepest descent update. The following theorem is given without proof:
Theorem 6. There exists a constant C > 0 such that when the Heavy Ball Method is
applied to Model (26) with constant step lengths
√ √
4 L− γ
αk ≡ α = √ √ , βk ≡ β = √ √
( L + γ)2 L+ γ
satisifies kxk − x∗ k ≤ Cβ k for all k.

As an immediate consequence of Theorem 6 we find that

L k
f (xk ) − f (x∗ ) ≤ ∇f (x∗ )(xk − x∗ ) + kx − x∗ k2 (by (6))
2
LC 2 2k
≤ β (since ∇f (x∗ ) = 0, and by Theorem 6)
2
r 2k
LC 2

γ
≈ 1−2 (approximation for L γ)
2 L
The case where L γ is especially interesting, as this corresponds to ill-
conditioned A for which Model (26) is numerically difficult to solve. In this
case the Heavy Ball Method has a much faster convergence rate than the Method
of Steepest Descent, which as we saw in Theorem 3 satisfies
γ k
f (xk ) − f (x∗ ) ≤ f (x0 ) − x(x∗ ) × 1 −

.
L

26
6.3 Nesterov Acceleration B6.2 Opt Data Sci

Figure 10: Heavy ball updates converge faster due to reduced zig zagging.

We conclude that the Heavy Ball Method also converges linearly, but at a faster
rate than the Method of Steepest Descent.
Example 12. If γ = 0.01L, which corresponds top a matrix A with moderately
sized condition number of 100, we have (1 − 2 γ/L)2 = 0.64. Hence, the
Heavy Ball Method shrinks f (xk ) − f (x∗ ) by at least a third in each iteration,
while the steepest descent method shrinks f (xk ) − f (x∗ ) only by 1%, since
(1 − γ/L) = 0.99.

6.3 Nesterov Acceleration

Nesterov’s Accelerated Gradient Method is a modification of the Heavy Ball Method
that works for general strongly convex objectives to solve
min f (x) (27)
x∈Rn

via iterative updates

xk+1 = xk − αk ∇f xk + βk (xk − xk−1 ) + βk (xk − xk−1 ),

(28)
which, compared to the heavy ball updates
xk+1 = xk − αk ∇f xk + βk (xk − xk−1 ),

(29)
only differ in the point where the gradient ∇f is evaluated: in the case of Nes-
terov’s method this point contains the momentum term also. To avoid the re-
quirement of two starting points, the first update is again taken as the steepest
descent step. The next result is given without proof:
Theorem 7. For f L-smooth γ-strongly convex, Nesterov’s Accelerated Gradient
Method applied to Model (27) from a given starting point x0 ∈ Rn and with con-
stant step lengths √ √
1 L− γ
αk ≡ α = , βk ≡ β = √ √
L L+ γ

27
6.4 Nesterov Acceleration of L-Smooth Convex Functions B6.2 Opt Data Sci

satisifies
r k
k ∗ L+γ 0 ∗ 2 γ
f (x ) − f (x ) ≤ kx − x k × 1 − ∀ k ≥ 1.
2 L

Theorem 7 shows that the Accelerated Gradient Method with constant step
size again converges linearly with a faster rate than the Method of Steepest
Descent, as r
γ γ
1− <1− .
L L
p
Example 13. For γ = 0.01L we have (1− γ/L) = 0.9 which compares favourably
with the decrease rate of (1 − γ/L) = 0.99 for steepest descent. It takes more
than 10 steepest descent iterations to guarantee the same relative decrease in
f (xk ) − f (x∗ ) as for Nesterov’s Method.

6.4 Nesterov Acceleration of L-Smooth Convex Functions

We will now discuss a variant of Nesterov’s Method for the unconstrained min-
imisation of an L-smooth convex function f (x). Therefore, f (x) is bounded
from above by simple quadratic models:

L
f (y) ≤ f (x) + h∇f (x), y − xi + ky − xk2 (30)
2
The Lipschitz constant is only assumed to exist but is not necessarily known.
In contrast to the case characterised by Theorem 7, the step size varies in each
iteration.
This method has an iteration complexity of O(ε−1/2 ), that is, there exists a
constant C > 0 (which depends of course on the √ starting point) such that for
any given error tolerance ε > 0 it takes only C/ ε iterations to guarantee that
f (xk )−f (x∗ ) < ε. It is known theoretically that this iteration complexity cannot
be improved, thus Nesterov’s Method is optimal from an iteration complexity
perspective.
The algorithm proceeds via two sequences of iterates, xk and y k , whose
computation is interlaced. The points xk are obtained as the minimisers of a
quadratic upper bound function, as discussed earlier in the course, while the
iterates y k are obtained as approximate second order corrections to the points
xk .

28
6.4 Nesterov Acceleration of L-Smooth Convex Functions B6.2 Opt Data Sci

Algorithm 1 (Nesterov Acceleration for L-Smooth Convex Minimisation).

// initialisation
ky 0 −zk
find z 6= y 0 ∈ Rn and set α−1 := k∇f (y 0 )−∇f (z)k ;
k := 0, λ0 := 1, x−1 := y 0 ;
// main body
while kxk − xk−1 k > tol do
find minimal i ∈ N such that

f y k − 2−i αk−1 ∇f (y k ) ≤ f (y k ) − 2−(i+1) αk−1 k∇f (y k )k2

(31)

αk := 2−i αk−1 ;
xk := y k − αk ∇f (y k );
p
λk+1 := 21 1 + 4λ2k + 1 ;
λk −1

y k+1 := xk + λk+1 xk − xk−1 ;
k ← k + 1;
end

A few remarks are in order:

• Since y 0 6= z, we have k∇f (y 0 ) − ∇f (z)k ≤ Lky 0 − zk, and hence α−1 ≥

L−1 is an overestimate of the optimal step length. This guarantees that
the steps are not unnecessarily short initially.
• Subsequently, αk+1 ≤ αk for all k. The step length is approaching the un-
kown optimal step length L−1 from above until some iteration k0 when
αk0 ≤ L−1 occurs for the first time. At that point in all subsequent itera-
tions k > k0 Equation (30) implies that
L
f y k − αk−1 ∇f (y k ) ≤ f (y k ) + h∇f (y k ), −αk−1 ∇f (y k )i + kαk−1 ∇f (y k )k2

2
k αk−1 k 2
≤ f (y ) − k∇f (y )k .
2
The sufficient decrease condition (31) is therefore satisfied with i = 0 so
that αk = αk−1 = · · · = αk0 ceases to decrease any further, and we have

L−1
≤ αk ≤ L−1 , ∀ k ≥ k0 , (32)
2
and where the lower bound inequality holds for all k ∈ N. Note that
α ≤ L−1 is exactly the condition we need to ensure that

α−1
muk,α (y) = f (y k ) + h∇f (y k ), y − y k i + ky − y k k2
2
is an upper bound function on f (y). If we had a priori knowledge of L,
we could simply set α = L−1 at the outset, but Algorithm 1 is designed

29
6.4 Nesterov Acceleration of L-Smooth Convex Functions B6.2 Opt Data Sci

for the case where L is not known explicitly and learns an appropriate
value of α from recycled information, that is from quantities that have
to be computed anyway as the iterations proceed. The back tracking de-
creases the value of α at most

blog2 (2Lα−1 )c

times before the final value is attained.

• The point xk is simply the global minimiser of

α−1
muk,α (x) = f (y k ) + h∇f (y k ), x − y k i + kx − y k k2 .
2

• The update λk+1 is obtained by solving the quadratic equation

λ2k+1 − λk+1 = λ2k . (33)

It follows by induction that

1 k+1
λk+1 ≥ λk + ≥1+ , (34)
2 2
and furthermore,
λ0 − 1
= 0,
λk+1
λk − 1
< 1, ∀k ∈ N,
λk+1
λk − 1
lim =1
k→∞ λk+1

• The step from xk to y k+1 is designed to go along the “ravine” in ill-

conditioned problems that have a long, narrow valley of near-optimal
points. Pure gradient descent methods tend to stall in such situations
and jump back an forth across the ravine. To speed up the convergence,
a step along the ravine is required. The vector xk − xk−1 asymptotically
−1
points in the right direction, and the step size λλkk+1 is designed to start
out timidly and to become asymptotically bolder.

Let us now analyse the convergence of Algorithm 1. Note that, asymptot-

ically we have xk − y k → 0, so that it doesn’t matter whether we analyse the
algorithm in terms of the points y k or xk . It turns out that it is easiest to carry
out the analysis in terms of the vectors

pk = (λk − 1)(xk−1 − xk ). (35)

30
6.4 Nesterov Acceleration of L-Smooth Convex Functions B6.2 Opt Data Sci

Theorem 8. Let f : Rn → Rn be a convex L-smooth function that has at least one

finite minimiser x∗ ∈ arg min f (x) and a finite minimum f ∗ = min f (x). Then the
sequence (xk )k∈N of iterates produced by Algorithm 1 satisfies
4Lkx0 − x∗ k2
f (xk ) ≤ f ∗ +
(k + 2)2
for all k ∈ N. Hence, xk is ε-optimal for all
$ √ %
2 L × kx0 − x∗ k
k ≥ N (ε) = √ − 1.
ε

Proof. Recall that we have

λk − 1 k 1
y k+1 = xk + (x − xk−1 ) = xk − (λk − 1) xk−1 − xk

λk+1 λk+1
1
= xk − pk , (36)
λk+1
and
xk+1 = y k+1 − αk+1 ∇f (y k+1 )
λk − 1 k
= xk + (x − xk−1 ) − αk+1 ∇f (y k+1 ),
λk+1
which implies
0 = λk+1 xk − xk+1 + (λk − 1) xk − xk−1 − λk+1 αk+1 ∇f (y k+1 ).

(37)
k
Using the auxiliary vectors p defined in (35), we have
pk+1 − xk+1 = (λk+1 − 1)(xk − xk+1 ) − xk+1
= λk+1 (xk − xk+1 ) − xk
(37)
= (λk − 1) (xk−1 − xk ) − xk + λk+1 αk+1 ∇f (y k+1 )
= pk − xk + λk+1 αk+1 ∇f (y k+1 ).
It follows that
kpk+1 −xk+1 + x∗ k2
= kpk − xk + x∗ k2 + λ2k+1 αk+1
2
k∇f (y k+1 )k2
+ 2λk+1 αk+1 h∇f (y k+1 ), pk − xk + x∗ i
= kpk − xk + x∗ k2 + λ2k+1 αk+1
2
k∇f (y k+1 )k2
+ 2(λk+1 − 1)αk+1 h∇f (y k+1 ), pk i
+ 2λk+1 αk+1 h∇f (y k+1 ), x∗ − xk + λ−1 k
k+1 p i
(36)
= kpk − xk + x∗ k2 + λ2k+1 αk+1
2
k∇f (y k+1 )k2
+ 2(λk+1 − 1)αk+1 h∇f (y k+1 ), pk i
+ 2λk+1 αk+1 h∇f (y k+1 ), x∗ − y k+1 i, (38)

31
6.4 Nesterov Acceleration of L-Smooth Convex Functions B6.2 Opt Data Sci

Next, we will derive bounds on the last two terms in the right hand side of
(38). To bound the second term, we use convexity of f , which implies f (x∗ ) ≥
f (y k+1 ) + h∇f (y k+1 ), x∗ − y k+1 i, and hence
f (y k+1 ) − f ∗ ≤ −h∇f (y k+1 ), x∗ − y k+1 i. (39)
On the other hand, the sufficient decrease condition (31) yields
αk
f xk = f y k − αk ∇f (y k ) ≤ f (y k ) − k∇f (y k )k2

2
and at iteration k + 1 this implies
αk+1
f (y k+1 ) ≥ f (xk+1 ) + k∇f (y k+1 )k2 . (40)
2
Substitution into (39) yields
αk+1
h∇f (y k+1 ), x∗ − y k+1 i ≤ − f (xk+1 ) − f ∗ − k∇f (y k+1 )k2 (41)
2
To bound the first term, we use (36), xk = y k+1 + λ−1 k
k+1 p . Convexity now
implies f (y k+1 ) + λ−1
k+1 h∇f (y
k+1
), pk i ≤ f (xk ), so (40) yields
αk+1
k∇f (y k+1 )k2 ≤ f (xk ) − f (xk+1 ) − λ−1 k+1 h∇f (y
k+1
), pk i,
2
and hence,

(λ2k+1 − λk+1 )αk+1

2
k∇f (y k+1 )k2 ≤ 2(λ2k+1 − λk+1 )αk+1 f (xk ) − f (xk+1 )

− 2(λk+1 − 1)αk+1 h∇f (y k+1 ), pk i, (42)

Let us combine what we derived so far:

kpk+1 −xk+1 + x∗ k2 − kpk − xk + x∗ k2
(38),(41)
≤ 2(λk+1 − 1)αk+1 h∇f (y k+1 ), pk i
− 2λk+1 αk+1 (f (xk+1 ) − f ∗ ) + (λ2k+1 − λk+1 )αk+1
2
k∇f (y k+1 )k2
(42)
≤ −2λk+1 αk+1 (f (xk+1 ) − f ∗ ) + 2(λ2k+1 − λk+1 )αk+1 (f (xk ) − f (xk+1 ))
= 2αk+1 λ2k (f (xk ) − f ∗ ) − 2αk+1 λ2k+1 (f (xk+1 ) − f ∗ )
≤ 2αk λ2k (f (xk ) − f ∗ ) − 2αk+1 λ2k+1 (f (xk+1 ) − f ∗ ),
where the last inequality follows from the fact that the αk are montonically
decreasing, and the penultimate equality is a consequence of λ2k+1 −λk+1 = λ2k .
Applying this inequality iteratively, we find
2αk+1 λ2k+1 (f (xk+1 ) − f ∗ ) ≤ 2αk+1 λ2k+1 (f (xk+1 ) − f ∗ ) + kpk+1 − xk+1 + x∗ k2
≤ 2αk λ2k (f (xk ) − f ∗ ) + kpk − xk + x∗ k2
≤ ...
≤ 2α0 λ20 (f (x0 ) − f ∗ ) + kp0 − x0 + x∗ k2
≤ ky 0 − x∗ k2 , (43)

32
REFERENCES B6.2 Opt Data Sci

where the last inequality follows from λ0 = 1, p0 = (λ0 − 1)(x−1 − x0 ) = 0, and

kp0 − x0 + x∗ k2 = kx0 − x∗ k2
= ky 0 − α0 ∇f (y 0 ) − x∗ k2
= ky 0 − x∗ k2 + α02 k∇f (y 0 )k2 − 2α0 h∇f (y 0 ), y 0 − x∗ i
(41) α0
≤ ky 0 − x∗ k2 + α02 k∇f (y 0 )k2 − 2α0 f (x0 ) − f ∗ + k∇f (y 0 )k2
2
= ky 0 − x∗ k2 − 2α0 λ20 f (x0 ) − f ∗

To recap, (43), (34) and (32) revealed that

2αk+1 λ2k+1 (f (xk+1 ) − f ∗ ) ≤ ky 0 − x∗ k2 ,

k+1
λk+1 ≥ 1 + ,
2
1
≤ αk+1 .
2L
In combination this yields
2 2
−1 k+3 ∗ k+1
k+1
f (xk+1 ) − f ∗ ≤ ky 0 −x∗ k2 ,

L f (x )−f ≤ 2αk+1 1 +
2 2

so that
4Lky 0 − x∗ k2
f (xk+1 ) − f ∗ ) ≤ ,
(k + 3)2
as claimed by the theorem.

References
[1] A. Beck. First Order Methods in Optimization. MOS-SIAM Series on Opti-
mization, 2017.
[2] S.J. Wright. Optimization Algorithms for Data Analysis.
http://www.optimization-online.org/DB FILE/2016/12/5748.pdf
[3] S.J. Wright. Coordinate descent algorithms. Mathematical Programming,
151:334, 2015. https://arxiv.org/abs/1502.04759

[4] D.P. Woodruff. Sketching as a Tool for Numerical Linear Algebra

https://arxiv.org/abs/1411.4357
[5] L. Bottou, F.E. Curtis, and J. Nocedal. Optimization methods for large-scale
machine learning. SIAM Review, 59(1): 65-98, 2017.

33
REFERENCES B6.2 Opt Data Sci

[6] Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradi-

ent methods. The Journal of Machine Learning Research, 18(1): 8194-8244,
2017.
[7] J. Wright and Y. Ma. High-dimensional data analysis with low-dimensional
models. CUP, 2021.

[8] A.S. Nemirovskii and D.B. Yudin. Complexity of problems and efficiency
of optimization methods. “Nauka” Moskow, 1979, (Russian).
[9] N. Shor. Minimization methods for Non-Differentiable Functions.
(Springer-Verlag, Berlin, 1985)

[10] B. Polyak. Introduction to Optimization. (Optimization Software Inc.,

Publications Division, New York, 1987)
[11] Y. Nesterov. A Method of Solving a Convex Programming Problem with
Convergence Rate O(1/k 2 ). Soviet Math. Dokl., Vol 27 (1983), No 2.

[12] Y. Nesterov. Smooth Minimization of Non-Smooth Functions. Math. Pro-

gram., Ser. A 103, 127-152 (2005).

Advanced Machine Learning: (COMP 5328)
No ratings yet
Advanced Machine Learning: (COMP 5328)
46 pages
G8 Term+3 Angles
No ratings yet
G8 Term+3 Angles
20 pages
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
No ratings yet
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
6 pages
Mathematics: Quarter 4 - Module 1: Measures of Position
100% (1)
Mathematics: Quarter 4 - Module 1: Measures of Position
10 pages
Tối Ưu Hóa Cho Khoa Học Dữ Liệu
No ratings yet
Tối Ưu Hóa Cho Khoa Học Dữ Liệu
64 pages
Sparse Regression and Dictionary Learning
No ratings yet
Sparse Regression and Dictionary Learning
14 pages
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
No ratings yet
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
258 pages
Optimization Algorithms For Data Analysis Wright
No ratings yet
Optimization Algorithms For Data Analysis Wright
49 pages
Convex Cardinality Optimization
No ratings yet
Convex Cardinality Optimization
26 pages
Sketching As A Tool For Numerical Linear Algebra
No ratings yet
Sketching As A Tool For Numerical Linear Algebra
139 pages
Optimization Methods For Machine Learning: Stephen Wright
No ratings yet
Optimization Methods For Machine Learning: Stephen Wright
78 pages
Sparse Optimization Lecture: Basic Sparse Optimization Models
No ratings yet
Sparse Optimization Lecture: Basic Sparse Optimization Models
33 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
O4MD 01 Introduction
No ratings yet
O4MD 01 Introduction
10 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Lasso Linear Regression
No ratings yet
Lasso Linear Regression
8 pages
03 Sparse Approx Algs PDF
No ratings yet
03 Sparse Approx Algs PDF
12 pages
Wainwrightslides 2
No ratings yet
Wainwrightslides 2
77 pages
Lecture 12
No ratings yet
Lecture 12
31 pages
Numerical Methods For Least Squares Problems, Second Edition
No ratings yet
Numerical Methods For Least Squares Problems, Second Edition
510 pages
Gradient Descent With Sparsification: An Iterative Algorithm For Sparse Recovery With Restricted Isometry Property
No ratings yet
Gradient Descent With Sparsification: An Iterative Algorithm For Sparse Recovery With Restricted Isometry Property
8 pages
ISYE 8803 - Kamran - M7 - Regularization Applications
No ratings yet
ISYE 8803 - Kamran - M7 - Regularization Applications
42 pages
Eecs127 Reader
No ratings yet
Eecs127 Reader
199 pages
Computational OPT Book 2023 Chapter 01
No ratings yet
Computational OPT Book 2023 Chapter 01
18 pages
Practice 1130
No ratings yet
Practice 1130
20 pages
Adaptive Signal Recovery in Sparse Nonparametric Models: Natalia Stepanova and Marie Turcicova
No ratings yet
Adaptive Signal Recovery in Sparse Nonparametric Models: Natalia Stepanova and Marie Turcicova
15 pages
Least-Squares Data Fitting: EE263 Autumn 2015 S. Boyd and S. Lall
No ratings yet
Least-Squares Data Fitting: EE263 Autumn 2015 S. Boyd and S. Lall
17 pages
Fast Inference in Sparse Coding Algorithms With Applications To Object Recognition
No ratings yet
Fast Inference in Sparse Coding Algorithms With Applications To Object Recognition
9 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
18 pages
Convex Optimization Prerequisite - Topics
No ratings yet
Convex Optimization Prerequisite - Topics
6 pages
Penalty Decomposition Methods For L - Norm Minimization
No ratings yet
Penalty Decomposition Methods For L - Norm Minimization
26 pages
MIT18 S096F15 TenLec
No ratings yet
MIT18 S096F15 TenLec
165 pages
Non Convex Optimization
No ratings yet
Non Convex Optimization
139 pages
Lecture: Dimensionality Reduction With Principal Component Analysis
No ratings yet
Lecture: Dimensionality Reduction With Principal Component Analysis
42 pages
Materials 12 01227 PDF
No ratings yet
Materials 12 01227 PDF
10 pages
Handout 1 Introduction
No ratings yet
Handout 1 Introduction
7 pages
Nonlinear Programming Concepts PDF
No ratings yet
Nonlinear Programming Concepts PDF
224 pages
Nonlinear Programming PDF
No ratings yet
Nonlinear Programming PDF
224 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
Mathematics of Signals, Networks, and Learning
No ratings yet
Mathematics of Signals, Networks, and Learning
68 pages
Algorithms and Complexity
No ratings yet
Algorithms and Complexity
130 pages
2IIG0 Cheat Sheet 1
No ratings yet
2IIG0 Cheat Sheet 1
2 pages
MachineLearningPatternRecognition 18 Finalversion
No ratings yet
MachineLearningPatternRecognition 18 Finalversion
265 pages
Sparse Coding and Dictionary Learning
No ratings yet
Sparse Coding and Dictionary Learning
40 pages
Wainwrightslides 1
No ratings yet
Wainwrightslides 1
67 pages
MLF Notes - Rishab Dec 24
No ratings yet
MLF Notes - Rishab Dec 24
6 pages
Face Recognition With Compressive Sensing: Mathias Lohne Spring, 2017
No ratings yet
Face Recognition With Compressive Sensing: Mathias Lohne Spring, 2017
14 pages
Thesis GPR
No ratings yet
Thesis GPR
116 pages
Modeling Basics: Compartment Models Dimensional Analysis Stochastic Modeling
No ratings yet
Modeling Basics: Compartment Models Dimensional Analysis Stochastic Modeling
58 pages
Weekly Homework X
No ratings yet
Weekly Homework X
15 pages
Cs421 Cheat Sheet
No ratings yet
Cs421 Cheat Sheet
2 pages
Optimization For Data Science
No ratings yet
Optimization For Data Science
18 pages
Math6015-Lecture-02 - Gif
No ratings yet
Math6015-Lecture-02 - Gif
115 pages
515 HW 2
No ratings yet
515 HW 2
2 pages
Book PCH
No ratings yet
Book PCH
321 pages
A Generic Proximal Algorithm For Convex Optimization - Application To Total Variation Minimization
No ratings yet
A Generic Proximal Algorithm For Convex Optimization - Application To Total Variation Minimization
5 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Introduction to Minimax
From Everand
Introduction to Minimax
V. F. Dem’yanov
No ratings yet
Se
0% (1)
Se
3 pages
BSM 008
No ratings yet
BSM 008
5 pages
06 1MA1 3H - Aiming For 9 Spring 2020 Mark Scheme (Set 1)
0% (1)
06 1MA1 3H - Aiming For 9 Spring 2020 Mark Scheme (Set 1)
14 pages
Sub Math Guide 2022 Today3
No ratings yet
Sub Math Guide 2022 Today3
15 pages
Rectangular Problems PDF
No ratings yet
Rectangular Problems PDF
10 pages
Detailed Lesson Plan
No ratings yet
Detailed Lesson Plan
6 pages
Grade 10 Quadratic Equations: Answer The Questions
No ratings yet
Grade 10 Quadratic Equations: Answer The Questions
11 pages
The Hardest SAT Math Questions Ever PrepScholar
No ratings yet
The Hardest SAT Math Questions Ever PrepScholar
1 page
3 Further Graph Transformations X1
No ratings yet
3 Further Graph Transformations X1
21 pages
Paper B - Matrices - 47
No ratings yet
Paper B - Matrices - 47
6 pages
Chapter 5. Probability and Probability Distributions Part 2
No ratings yet
Chapter 5. Probability and Probability Distributions Part 2
82 pages
Euclid Geometry MCQ1
No ratings yet
Euclid Geometry MCQ1
2 pages
PPT
No ratings yet
PPT
11 pages
Projects (Class 11)
No ratings yet
Projects (Class 11)
3 pages
Chapter 6
No ratings yet
Chapter 6
24 pages
Syllabus in CE 421 THEORY 2 JMC
No ratings yet
Syllabus in CE 421 THEORY 2 JMC
18 pages
Python Notes
No ratings yet
Python Notes
4 pages
Mathlinks 9 Review Bundles CH 6
No ratings yet
Mathlinks 9 Review Bundles CH 6
4 pages
Fine-Structure Constant From Golden Ratio Geometry
No ratings yet
Fine-Structure Constant From Golden Ratio Geometry
17 pages
SAT Suite Question Bank - Results
No ratings yet
SAT Suite Question Bank - Results
170 pages
STAT201 Probability Theory and Applications (1410) - Kwong Koon Shing
No ratings yet
STAT201 Probability Theory and Applications (1410) - Kwong Koon Shing
2 pages
D. H. Fremlin - Measure Theory, Volume 5, Part 2 (2000, Torres Fremlin)
No ratings yet
D. H. Fremlin - Measure Theory, Volume 5, Part 2 (2000, Torres Fremlin)
375 pages
Riemann Sums Essay
No ratings yet
Riemann Sums Essay
18 pages
Edia - Graphing and Solving Inequalities (Full)
No ratings yet
Edia - Graphing and Solving Inequalities (Full)
4 pages
Selfstudys Com File
No ratings yet
Selfstudys Com File
23 pages
John Corcoran in Spanish 1985 To 2003
No ratings yet
John Corcoran in Spanish 1985 To 2003
18 pages
C++ Exp 1
No ratings yet
C++ Exp 1
7 pages
Futher Maths Lesson Plan WK 5
No ratings yet
Futher Maths Lesson Plan WK 5
3 pages

2022lectures1-8 Optimization For DataScience

Uploaded by

2022lectures1-8 Optimization For DataScience

Uploaded by

B6.

2 Optimisation for Data Science

Prof. Raphael Hauser

March 10, 2022

3 Optimisation Terminology and Prerequisites 10

4 Method of Steepest Descent 14

5 The Proximal Method 19

6 Acceleration of Gradient Methods 25

1 Scope of this Course

1.1 Optimisation Models

Ψ : Rn → R convex, possibly nonsmooth, controls the complexity and struc-

e.g., F = {x ∈ Rn : g(x) ≤ 0, x ≥ 0} for some g : Rn → Rq (all vector

1.2 Sparse Objective

1.3 First Order Methods

1.4 Data Analysis Problems

Data analysis problems typically combine the following elements:

ii) W = {1, . . . , M } classification problems.

i) Regression without intercept: Choose V = Rn , W = R, Φ(a; x) = aT x. Data

so that Φ(ã; x̃) = ãT x + β.

and p  n. The data fitting model is designed as

Example 3. Matrix completion: Choose V = Rn×p , W = R, D = {(Aj , yj ) : j =

where SRn×n is the set of n × n real symmetric matrices, and X  0 means

Example 5 (Principal Component Analysis).

i) Robust PCA: We want to render classical PCA (see prelims course on

(Rm×n , Rm×n ), observation space W = R, data D = Y ∈ Rm×n , the

min kY − (L + S)k2F + λkLk∗ + γkSk1 ,

max v T Sv, subject to kvk2 = 1.

A sparse 1st principal component is defined in principle by

max v T Sv, subject to kvk2 = 1, kvk0 ≤ k,

where kvk0 := |{i : vi =

max hS, M i, subject to M  0, hI, M i = 1, kM k1 ≤ ρ,

where hS, M i replaces v T Sv = tr(S, vv T ) and hI, M i = 1 replaces 1 =

Example 6 (Data Separation).

i) Support vector machines: V = Rn , W = {+1, −1}, problem of classifying

We would like to use Φ(a; x, β) := sign(aj , x − β) and recall that we want

min kxk22 , subject to (1). (2)

ii) Nonlinear separation: the idea is to use a nonlinear transformation ζ :

and solve the data fitting model

where Q = (qk,` )  0 in SRm×m is defined by qk,` = yk y` hζ(ak ), ζ(a` )i.

Example 7 (Multiclass Classification).

i) Logistic regression: V = Rn , W = {e1 , . . . , eM } ⊂ RM , where yj = ei

classes C1 , . . . , CM . The multiclass classification problem is to output a

We wish to calibrate the parametric models

The obective is convex and in summation form. Note that if yj = ei then

Figure 2: Nonlinear separation of point clouds.

The objective is still in summation form but now non-convex.

Figure 3: In this region we see one strict global minimiser, and

3 Optimisation Terminology and Prerequisites

i) a local minimiser if f (x∗ ) ≤ f (x) ∀x ∈ F ∩ Bε (x∗ ) for some ε > 0,

Then Ω 6= ∅ is a convex set if and only if IΩ (x) is a proper convex function.

Figure 4: A strongly convex function.

iv) We say that (4) is a convex optimisation problem if f and F =

min f (x) + IF (x).

with proper convex objective.

vi) f : Rn → R is L-smooth if differentiable everywhere with L-Lipschitz conti-

k∇f (x) − ∇f (y)k ≤ Lkx − yk ∀x, y ∈ Rn .

f (y) ≥ f (x) + v T (y − x), ∀x, y ∈ Rn .

viii) The subdifferential ∂f (x) of f at x is the set of subgradients of f at x.

Figure 5: At points where f is not differentiable the subgradient is non-unique.

i) If a ∈ ∂f (x) and b ∈ ∂f (y), then (a − b)T (x − y) ≥ 0.

Proposition 1. Let f : Rn → R ∪ {+∞} be proper convex.

i) Then x∗ ∈ Rn is a local minimiser if and only if it is a global minimiser.

ii) The set of miminisers

arg minn f (x) := {x∗ : f (x∗ ) = minn f (x)}

iii) x∗ ∈ arg minx∈Rn f (x) if and only if1 0 ∈ ∂f (x∗ ) .

i) If f is differentiable at x, then ∂f (x) = {∇f (x)}.

iii) If f is γ-strongly convex and differentiable at x ∈ Rn , then

For proofs of slightly weakened versions of Lemma 1 and Propositions 1 –

3.3 Characterisation of Convergence Speed

ii) R-linearly if ∃ σ ∈ (0, 1) such that φk ≤ C(1 − σ)k , ∀k ∈ N,

v) R-superlinearly if ∃ (νk )k∈N ⊂ R+ such that (νk )k∈N → 0 Q-superlinearly

See problem sheets for a proof and counterexamples.

4 Method of Steepest Descent

Gradient methods solve unconstrained optimisation models of the form

min f (x), (8)

where f : Rn → R is L-smooth via iterative updates of the form

and p n. The data fitting model is designed as

where SRn×n is the set of n × n real symmetric matrices, and X 0 means

max hS, M i, subject to M 0, hI, M i = 1, kM k1 ≤ ρ,

where Q = (qk,` ) 0 in SRm×m is defined by qk,` = yk y` hζ(ak ), ζ(a` )i.