2022Lectures1-8_Optimization_for_DataScience
2022Lectures1-8_Optimization_for_DataScience
1
TABLE OF CONTENTS B6.2 Opt Data Sci
Table of Contents
1 Scope of this Course 2
1.1 Optimisation Models . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Sparse Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 First Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Data Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . 3
2 Motivating Examples 3
1
B6.2 Opt Data Sci
where f is smooth, which in the context of this course means C 1 with Lipschitz
continuous gradient.
Regularised model
min f (x) + λΨ(x)
x∈Rn
where each fi is sparse (depends only on a few of the coordinates of x), and
since n, m 1, often only gradients ∇f (x) or ∇fi (x) are available, but no
higher derivatives.
xk+1 = xk + αk dk
with search direction dk computed from ∇f (x) or ∇fi (x), and a step length αk >
0 that is either fixed or computed iteratively.
2
1.4 Data Analysis Problems B6.2 Opt Data Sci
LD (x) is the loss on the data set D associated with parameter values x. Typical
example `(aj , yj ; x) = kΦ(aj ; x) − yj k2 .
2 Motivating Examples
Example 1. Examples of observation sets:
i) W = R regression problems.
Example 2 (Regression).
1X 1
minn m(aT 2
j x − yj ) = kAx − yk2 ,
x∈R 2 j=1 2m
T
where A = a1 ... am .
ii) Regression with intercept: Same formulation, but replace x and aj by
x aj
x̃ = , ãj = ,
β 1
3
B6.2 Opt Data Sci
iii) Tychonov-regularised regression: this method uses the data fitting model
1
min kAx − yk22 + λkxk22 ,
x∈Rn 2m
which leads to robust regression, less sensitive to noise in the data (aj , yj ).
iv) Lasso-regression: the data fitting model
1
minn kAx − yk2 + λkxk1
x∈R 2m
tends to yield sparse optimal solution x∗ (feature selection) if one exists.
v) Dictionary learning: Extending linear regression to learning the regres-
sors. In dictionary learning we seek a dictionary A ∈ Rm×n and sparse
regressors X ∈ Rn×p so that the data Y ∈ Rm×p can be approximately
expressed as AX ≈ Y . We use V = (Rm×n , Rn×p ) as data space and
W = R1×p as observation space, with data D = {y` }m`=1 , where y` ∈ R
1×p
where
( n
)
X
n×p
Fk (m, p) = X∈R : |xi,j 6= 0| ≤ k ∀ j = 1, . . . , p ,
i=1
indicating only k of the the learned regressors from A are used to approx-
imate each column of data in Y .
where hAj , Xi P
models the sensing of X, yj the target value, λ > 0, and the nuclear
p
norm kXk∗ = i=1 σi is the sum of the singular values of X. Regularisation
by the nuclear norm tends to yield a low-rank optimal solution if one exists.
Typical choice Aj = [as,t ] with
(
1 if s = sj , t = tj ,
as,t =
0 otherwise
4
B6.2 Opt Data Sci
for fixed sj , tj , so that hAj , Xi = xsj ,tj and yj correspond to a few known
entries of X. This problem may be seen as a sophisticated form of regression.
Example 4. Sparse inverse covariance estimation: V = Rn , W = ∅, aj i.i.d. samples
of a random vector Ω → Rn with E[A] = 0 and Cov(A) = E[AA T
Pm] = Σ T< ∞
1
(all covariances are finite). The sample estimate of Σ is S = m j=1 aj aj . In
many applications, the components of A are conditionally independent,
D(Aj |Ai , {Ak : k 6= i, j}) = D(Aj |{Ak : k 6= i, j}).
For example, when
causality causality
Ai → Ak → Aj ,
then Cov(Ai , Aj ) 6= 0 but D(Aj |Ai , Ak ) = D(Aj |Ak ). Fact: If Ai , Aj are con-
ditionally independent, then (Σ−1 )i,j = 0 (component (i, j) of Σ−1 is zero).
Therefore, we correct the sample estimate S to that its inverse becomes sparse
by solving the data fitting problem
min hS, Xi − log det X + λkXk1 ,
X∈SRn×n ,X0
+ λkXk1 ,
so that the first two terms of the objective represent a shift plus a negative mul-
tiple of the log-likelihood function for X = Cov(A−1 ) under the parametric
model that the data aj are i.i.d. samples of mean centred multivariate Gaus-
sian vectors. Note also that the barrier term − log det X is used to keep X
away from boundary and hence invertible. The Lasso term λkXk1 incentivises
a sparse solution that allows one to identify the conditional dependencies be-
tween the variables if the reconstruction is successful.
5
B6.2 Opt Data Sci
yj × (aT
j x − β) ≥ 1. (1)
6
B6.2 Opt Data Sci
But (2) may not have any solution (x, β) at all that satisfies (1). Therefore,
solve the following data fitting model instead,
m
1 X λ
min max 1 − yj × (aT 2
j x − β), 0 + kxk2
(x,β) m 2
j=1
m
1 X λ
⇔ min sj + kxk22
x∈R ,β∈R,s∈Rm
n m j=1 2
subject to sj ≥ 1 − yj (aT
j x − β), sj ≥ 0, (j = 1, . . . , m).
yj × (hζ(aj ), xi − β) ≥ 1 (3)
{a1 , . . . , am } ⊂ V = Rn .
For example, use a Gaussian kernel K(aj , a` ) = exp −kaj − a` k22 /2σ 2 .
7
B6.2 Opt Data Sci
Figure 1: The support vector machine model maximises the margin between
two point clouds.
such that (
1 if yj = ei ,
pi (aj ; X) ≈
0 if yj 6= ei ,
where x[i] ∈ Rn (i = 1, . . . , M ) and X = {x[i] : i = 1, . . . , M }. Idea: max-
imise the log-likelihood function, which leads to the data fitting model
T
x[1]
m M
!
1 X T .. X
max y . aj − log exp(xT
[k] aj ) .
X m j=1 j
xT
[M ]
k=1
xT
[1]
yjT ... = xT
i .
xT[M ]
8
B6.2 Opt Data Sci
ii) Nonlinear multiclass logistic regression via deep learning: Combines logistic
regression with a nonlinear transformation of the inputs in D 1 hidden
layers 1, . . . , D of a neural network,
a0 = a,
a` = σ(W ` a`−1 + g ` ), (` = 1, . . . , D − 1),
D D D−1 D
a =W a +g ,
`
|×|a`−1 | `
where W ∈ R|a , g ` ∈ R|a | , and σ is a nonlinear activation function,
such as
1
• σ(t) = 1+exp(−t) (logistic function),
• σ(t) = max(t, 0) (hinge loss, also called ReLU),
• σ(t) = tanh(t) (smooth version of ReLU),
• σ(t) = 1 with probability 1/(1 + exp(−t)) and 0 otherwise.
Let w = (W 1 , g 1 ), . . . , (W D , g D ) be the parameters of the NN, and set
aD = aD (w) as a function of the network parameters. The data fitting
model becomes
T
x[1]
m M
!
1 X T . D X
max yj .. aj (w) − log exp(xT D
[k] aj (w)) .
X,w m
j=1 xT
[M ]
k=1
9
B6.2 Opt Data Sci
3.1 Terminology
Consider the optimisation model
min f (x) subject to x ∈ F (4)
x∈Rn
Definition 1. A solution x∗ ∈ F is
Definition 2.
i) f : Rn → R is convex if
f ((1 − α)x + αy) ≤ (1 − α)f (x) + αf (y), ∀α ∈ [0, 1], x, y ∈ Rn . (5)
ii) f : Rn → R ∪ {+∞} is proper convex if (5) holds and f (x) < +∞ for at
least one point x.
iii) For Ω ⊂ Rn , define the indicator function as follows,
(
0 if x ∈ Ω,
IΩ (x) =
+∞ otherwise.
10
3.1 Terminology B6.2 Opt Data Sci
vii) v ∈ Rn is a subgradient of f at x if
11
3.2 Prerequisites B6.2 Opt Data Sci
3.2 Prerequisites
Lemma 1.
is convex.
12
3.3 Characterisation of Convergence Speed B6.2 Opt Data Sci
13
B6.2 Opt Data Sci
4.1 Specifications
xk+1 = xk + αk dk . (9)
The scalar αk > 0 is called the step length, and dk ∈ Rn is the search direction.
The Method of Steepest Descent is a gradient method based on making the
following choices: dk = −∇f (xk ) for all k (the direction of steepest descent),
and αk ≡ L−1 for all k (constant step length). This leads to the steepest descent
updates
1
xk+1 = xk − ∇f (xk ). (10)
L
Our main concern is whether the choice of αk is neither too short (leading to
excessively slow convergence) nor too long (leading to zig-zagging behaviour).
The updates (10) are motivated by the minimisation of the first order Taylor
approximation
L
f (x + αd) ≤ f (x) + α∇f (x)T d + α2 kdk2 (11)
2
at x = xk over α for fixed d = −∇f (x). (11) is an immediate consequence of
(7). Writing ϕ(α) = f (x + αd) and setting d = −∇f (x), the r.h.s. of (11) is a
convex quadratic
α2
ϕ(α) = f (x) − αk∇f (x)k2 + Lk∇f (xk )k2 .
2
Minimising ϕ(α) be solving for ϕ0 (α) = 0 yields α∗ = 1/L.
The following will be one of our main tools of convergence analysis for the
method of steepest descent:
14
4.2 Convergence Theory B6.2 Opt Data Sci
k=0 k=0
Therefore,
v
u T −1 r
k
u1 X 2L [f (x0 ) − f (xT )]
min k∇f (x )k ≤ t k∇f (xk )k2 ≤
0≤k≤T −1 T T
k=0
L 0
f (xT ) − f (x∗ ) ≤ kx − x∗ k2 , ∀ T ≥ 1.
2T
1
f (xk+1 ) ≤ f (x∗ ) + ∇f (xk )T (xk − x∗ ) − k∇f (xk )k2 (use upper bound on f (xk ))
2L !
2
L 1
= f (x∗ ) + k ∗ 2 k ∗
kx − x k − x − x − ∇f (x ) k
(expand to check)
2 L
L
= f (x∗ ) + kxk − x∗ k2 − kxk+1 − x∗ k2 (use def of xk+1 ).
2
2 Note that the minimiser x∗ is not necessarily unique.
15
4.2 Convergence Theory B6.2 Opt Data Sci
Theorem 3. If f is L-smooth and γ-strongly convex then there exists a unique min-
imiser x∗ , and
γ
f (xk+1 ) − f (x∗ ) ≤ 1 − f (xk ) − f (x∗ )
∀ k.
L
The arg min of the right hand side of (12) equals y ∗ = xk − (1/γ)∇f (xk ), and
substitution back into (12) gives
2
∗ k k T 1 k γ 1
f (x ) ≥ f (x ) − ∇f (x ) ∇f (x ) + ∇f (xk ) ,
γ 2 γ
1 2
= f (xk ) − ∇f (xk ) ,
2γ
16
4.3 Long-Step Descent Methods B6.2 Opt Data Sci
Figure 6: A step size αk that satisfies Conditions ii) and iii) can be found by
bisection.
We continue solving the unconstrained model (8) via iterative updates (9), but
now with more general (and possibly much larger) step sizes αk and search
diretions dk than before.
Throughout this section we assume f (x) to be L-smooth, possibly non-
convex, and bounded below f (x) ≥ f for all x ∈ Rn . The search directions
dk are assumed to lie within a bounded angle of the steepest descent direction.
Specifically, our assumptions on dk and αk are as follows:
Assumption 1. There exist constants 0 < η ≤ 1 and 0 < c1 < c2 < 1 such that
for all k ∈ N,
Note that Assumption i) is trivially met if we choose the steepest descent di-
rection dk = −∇f (xk ), in which case η = 1. The condition allows us more gen-
erality in computing dk , for example via an inexact computation of −∇f (xk )
that is numerically cheaper to compute. Assumptions ii) and iii) look very
technical at first sight, but all we need to know for the purposes of this course
is that there exist very simple and numerically cheap approximate 1-D opti-
misation schemes called inexact line searches that are able to produce values αk
that satisfy both conditions. The additional generality is important in practi-
cal computations, because the Lipschitz constant L is generally not known, so
17
4.3 Long-Step Descent Methods B6.2 Opt Data Sci
that taking constant step sizes αk ≡ 1/L is largely a theoretical choice. Line
searches also allow αk to take much larger values than 1/L when warranted,
so that the algorithm may make more rapid progress.
Theorem 4. Under Assumption 1, we have
s s
L f (x0 ) − f
min k∇f (xk )k ≤ × .
0≤k≤T −1 η 2 c1 (1 − c2 ) T
18
B6.2 Opt Data Sci
Mλ,h : Rn → R,
1 1 2 1 2
x 7→ inf λh(u) + ku − xk = inf h(u) + ku − xk .
λ u 2 u 2λ
proxλh : Rn → R,
1 2
x 7→ arg min λh(u) + ku − xk .
u 2
19
5.1 Proximal Operators B6.2 Opt Data Sci
Example 9.
• h(x) = 0 has prox operator proxλh (x) = arg minu 12 ku − xk2 = x, that is,
the identity.
1
proxλIΩ (x) = arg min λIΩ (u) + ku − xk2
u 2
= arg min ku − xk2 ,
u∈Ω
20
5.2 The Prox-Gradient Method B6.2 Opt Data Sci
1
∇Mλ,h (x) = (x − proxλh (x)) .
λ
We now extend the scope of problems to solve and consider the regularised
optimisation model
minn φ(x) := f (x) + λψ(x), (15)
x∈R
where f is L-smooth and convex, and ψ(x) is convex and closed. This includes
the case of constrained optimisation, as
where the equivalence is understood in the sense that the argmin are the same
in both cases.
To construct an iterative algorithm for Model (15), we use the following
template:
1. y k+1 = xk − αk ∇f (xk )
2. xk+1 = arg minz 21 kz − y k+1 k2 + αk λψ(z),
that is, we minimise ψ(z) while staying close to the steepest descent update
y k+1 . Note the factor αk in front of λψ(z) to avoid that xk+1 → y k+1 if αk → ∞,
that is, both terms on the r.h.s. of the second step should be of the same order.
We may also express the update of the template algorithm as
21
5.2 The Prox-Gradient Method B6.2 Opt Data Sci
Now note that f (xk ) + ∇f (xk )T (z − xk ) is the 1st order Taylor approximation
of f (z) around xk , so that (xk ) + ∇f (xk )T (z − xk ) + 2α1 k kz − xk k2 is a simplified
2nd order model of f (z) that does not depend on 2nd order derivatives of f .
The updates are thus obtained by iteratively building very simple 2nd order
models by which f is replaced in (15) to obtain subproblems that are easier to
solve.
Example 10. Unregularised unconstrained minimisation minx f (x). This case
reduces to ψ(x) ≡ 0 and the steepest descent update
1 2
xk+1 = prox0 xk − αk ∇f (xk ) = arg min z − xk − αk ∇f (xk )
z 2
= xk − αk ∇f (xk ).
Example 11. Constrained convex minimisation minx f (x) subject to x ∈ Ω with
Ω ⊂ Rn convex. This case reduces to ψ(x) = IΩ (x) and prox updates
1 2
xk+1 = proxαk λIΩ xk − αk ∇f (xk ) = arg min z − xk − αk ∇f (xk )
,
z∈Ω 2
which is the orthogonal projection of the steepest descent update xk −αk ∇f (xk )
onto the feasible set Ω.
Definition 6. The gradient map of model (15) is the map
Gα : Rn → Rn
1
x 7→ x − proxαλψ (x − α∇f (x)) .
α
We note that the notion of gradient map allows us to write the prox update
(16) as
xk+1 = proxαk λψ xk − αk ∇f (xk ) = xk − αk Gαk (xk ),
(17)
so that in the proximal method Gαk (xk ) takes the role of ∇f (xk ) in comparison
with the method of steepest descent.
Lemma 4 (Foundational Inequality of Prox-Method). The gradient map has the
following properties:
α
φ (x − αGα (x)) ≤ φ(z) + Gα (x)T (x − z) − kGα (x)k2 . (18)
2
Notes:
22
5.2 The Prox-Gradient Method B6.2 Opt Data Sci
• Property ii) holds true for all z ∈ Rn . The term φ(z) + Gα (x)T (x − z) can
be seen as a backward first order model of φ(x). Using (17) and setting
x = xk and α = αk , (18) reads
αk
φ xk+1 ≤ φ(z) + Gαk (xk )T (xk − z) − kGαk (xk )k2 .
2
In particular, setting z = xk we obtain
αk
φ xk+1 ≤ φ(xk ) − kGαk (xk )k2 .
(19)
2
Compare this result to the foundational inequality of the short-step steep-
est descent, i.e., Lemma 3, which reads
1
f xk+1 ≤ f (xk ) − k∇f (xk )k2 .
(20)
2L
Note that when αk = 1/L, (19) becomes (20) with f (x) replaced by the
merit function φ(x) and ∇f (x) by the gradient map. Inequality (18) thus
generalises the foundational inequality and can be used the same way to
analyse the convergence of the prox-gradient method as Lemma 3 was
used in the analysis of the Method of Steepest Descent.
Proof. Using x − αGα (x) = proxαλψ (x − α∇f (x)) = arg minu Ξ(u), where
Ξ(u) := αλψ(u) + 21 ku − (x − α∇f (x))k2 is convex, it must be true that
0 ∈ ∂ Ξ proxαλψ (x − α∇f (x)) = ∂ Ξ (x − αGα (x))
= αλ∂ψ (x − αGα (x)) + (x − αGα (x)) − (x − α∇f (x))
Therefore, αGα (x) ∈ αλ∂ψ (x − αGα (x)) + α∇f (x), as claimed in i).
Further, since f is L-smooth, (7) implies
L
f (y) ≤ f (x) + ∇f (x)T (y − x) + ky − xk2 , ∀ y ∈ Rn .
2
Substituting y = x − αGα (x) yields
Lα2
f (x − αGα (x)) ≤ f (x) − αGα (x)T ∇f (x) + kGα (x)k2
2
α
≤ f (x) − αGα (x)T ∇f (x) + kGα (x)k2 , (21)
2
where the last inequality follows from the assumption that α ∈ (0, 1/L]. On
the other hand, the convexity of f (x) implies (see Definition 2 and properties
of the subdifferential)
23
5.3 Convergence Theory B6.2 Opt Data Sci
By virtue part i) we also have Gα (x) − ∇f (x) ∈ ∂λψ (x − αGα (x)), so that by
the definition of subgradients we have
T
λψ(z) ≥ λψ (x − αGα (x)) + [Gα (x) − ∇f (x)] (z − (x − αGα (x))) . (23)
It follows that
def of φ
φ (x − αGα (x)) = f (x − αGα (x)) + λψ (x − αGα (x))
(21) α
≤ f (x) − αGα (x)T ∇f (x) + kGα (x)k2 + λψ (x − αGα (x))
2
(22) α
≤ f (z) − ∇f (x)T (z − x + αGα (x)) + kGα (x)k2
2
+ λψ (x − αGα (x))
(23) α
≤ f (z) − ∇f (x)T (z − x + αGα (x)) + kGα (x)k2
2
T
+ λψ (z) − [Gα (x) − ∇f (x)] (z − (x − αGα (x)))
α
= f (z) + λψ(z) + Gα (x)T (x − z) − kGα (x)k2
2
α
= φ(z) + Gα (x) (x − z) − kGα (x)k2 ,
T
2
as claimed in part ii).
Theorem 5. If Model (15) has a minimiser x∗ and the prox-gradient algorithm is run
with constant step length αk ≡ 1/L starting from an initial point x0 , then
L 0
φ(xk ) − φ(x∗ ) ≤ kx − x∗ k2 , ∀ k ≥ 1.
2k
Proof. Equation (19) establishes that φ(xk ) k∈N is monotone decreasing,
24
B6.2 Opt Data Sci
(24) K−1
X
K × φ(xK ) − φ(x∗ ) ≤ φ(xk+1 ) − φ(x∗ )
k=0
K−1
(25),αk =L−1 L X k 2 2
≤ x − x∗ − xk+1 − x∗
2
k=0
telescoping sum L 2
≤ × x0 − x∗ .
2
low,
γ
– in linear time with rate 1 − L when f is L-smooth and γ-strongly
convex.
• The Prox-Gradient Method solves
min f (x) + λψ(x)
x
1
in sublinear O k time when f is smooth and convex and ψ is convex
and closed.
25
6.2 The Heavy Ball Method B6.2 Opt Data Sci
where A is symmetric with eigenvalues in [γ, L] for some 0 < γ < L, so that
f (x) is L-smooth and γ-strongly convex.
The iterative updates are defined as
xk+1 = xk − αk ∇f (xk ) + βk (xk − xk−1 ),
consisting of a steepest descent step xk − αk ∇f (xk ) and an additional momen-
tum term βk (xk − xk−1 ) for some βk > 0.
To avoid the requirement of two starting points x−1 , x0 , the first iteration
takes a steepest descent update. The following theorem is given without proof:
Theorem 6. There exists a constant C > 0 such that when the Heavy Ball Method is
applied to Model (26) with constant step lengths
√ √
4 L− γ
αk ≡ α = √ √ , βk ≡ β = √ √
( L + γ)2 L+ γ
satisifies kxk − x∗ k ≤ Cβ k for all k.
26
6.3 Nesterov Acceleration B6.2 Opt Data Sci
Figure 10: Heavy ball updates converge faster due to reduced zig zagging.
We conclude that the Heavy Ball Method also converges linearly, but at a faster
rate than the Method of Steepest Descent.
Example 12. If γ = 0.01L, which corresponds top a matrix A with moderately
sized condition number of 100, we have (1 − 2 γ/L)2 = 0.64. Hence, the
Heavy Ball Method shrinks f (xk ) − f (x∗ ) by at least a third in each iteration,
while the steepest descent method shrinks f (xk ) − f (x∗ ) only by 1%, since
(1 − γ/L) = 0.99.
27
6.4 Nesterov Acceleration of L-Smooth Convex Functions B6.2 Opt Data Sci
satisifies
r k
k ∗ L+γ 0 ∗ 2 γ
f (x ) − f (x ) ≤ kx − x k × 1 − ∀ k ≥ 1.
2 L
Theorem 7 shows that the Accelerated Gradient Method with constant step
size again converges linearly with a faster rate than the Method of Steepest
Descent, as r
γ γ
1− <1− .
L L
p
Example 13. For γ = 0.01L we have (1− γ/L) = 0.9 which compares favourably
with the decrease rate of (1 − γ/L) = 0.99 for steepest descent. It takes more
than 10 steepest descent iterations to guarantee the same relative decrease in
f (xk ) − f (x∗ ) as for Nesterov’s Method.
We will now discuss a variant of Nesterov’s Method for the unconstrained min-
imisation of an L-smooth convex function f (x). Therefore, f (x) is bounded
from above by simple quadratic models:
L
f (y) ≤ f (x) + h∇f (x), y − xi + ky − xk2 (30)
2
The Lipschitz constant is only assumed to exist but is not necessarily known.
In contrast to the case characterised by Theorem 7, the step size varies in each
iteration.
This method has an iteration complexity of O(ε−1/2 ), that is, there exists a
constant C > 0 (which depends of course on the √ starting point) such that for
any given error tolerance ε > 0 it takes only C/ ε iterations to guarantee that
f (xk )−f (x∗ ) < ε. It is known theoretically that this iteration complexity cannot
be improved, thus Nesterov’s Method is optimal from an iteration complexity
perspective.
The algorithm proceeds via two sequences of iterates, xk and y k , whose
computation is interlaced. The points xk are obtained as the minimisers of a
quadratic upper bound function, as discussed earlier in the course, while the
iterates y k are obtained as approximate second order corrections to the points
xk .
28
6.4 Nesterov Acceleration of L-Smooth Convex Functions B6.2 Opt Data Sci
αk := 2−i αk−1 ;
xk := y k − αk ∇f (y k );
p
λk+1 := 21 1 + 4λ2k + 1 ;
λk −1
y k+1 := xk + λk+1 xk − xk−1 ;
k ← k + 1;
end
L−1
≤ αk ≤ L−1 , ∀ k ≥ k0 , (32)
2
and where the lower bound inequality holds for all k ∈ N. Note that
α ≤ L−1 is exactly the condition we need to ensure that
α−1
muk,α (y) = f (y k ) + h∇f (y k ), y − y k i + ky − y k k2
2
is an upper bound function on f (y). If we had a priori knowledge of L,
we could simply set α = L−1 at the outset, but Algorithm 1 is designed
29
6.4 Nesterov Acceleration of L-Smooth Convex Functions B6.2 Opt Data Sci
for the case where L is not known explicitly and learns an appropriate
value of α from recycled information, that is from quantities that have
to be computed anyway as the iterations proceed. The back tracking de-
creases the value of α at most
blog2 (2Lα−1 )c
α−1
muk,α (x) = f (y k ) + h∇f (y k ), x − y k i + kx − y k k2 .
2
1 k+1
λk+1 ≥ λk + ≥1+ , (34)
2 2
and furthermore,
λ0 − 1
= 0,
λk+1
λk − 1
< 1, ∀k ∈ N,
λk+1
λk − 1
lim =1
k→∞ λk+1
30
6.4 Nesterov Acceleration of L-Smooth Convex Functions B6.2 Opt Data Sci
31
6.4 Nesterov Acceleration of L-Smooth Convex Functions B6.2 Opt Data Sci
Next, we will derive bounds on the last two terms in the right hand side of
(38). To bound the second term, we use convexity of f , which implies f (x∗ ) ≥
f (y k+1 ) + h∇f (y k+1 ), x∗ − y k+1 i, and hence
f (y k+1 ) − f ∗ ≤ −h∇f (y k+1 ), x∗ − y k+1 i. (39)
On the other hand, the sufficient decrease condition (31) yields
αk
f xk = f y k − αk ∇f (y k ) ≤ f (y k ) − k∇f (y k )k2
2
and at iteration k + 1 this implies
αk+1
f (y k+1 ) ≥ f (xk+1 ) + k∇f (y k+1 )k2 . (40)
2
Substitution into (39) yields
αk+1
h∇f (y k+1 ), x∗ − y k+1 i ≤ − f (xk+1 ) − f ∗ − k∇f (y k+1 )k2 (41)
2
To bound the first term, we use (36), xk = y k+1 + λ−1 k
k+1 p . Convexity now
implies f (y k+1 ) + λ−1
k+1 h∇f (y
k+1
), pk i ≤ f (xk ), so (40) yields
αk+1
k∇f (y k+1 )k2 ≤ f (xk ) − f (xk+1 ) − λ−1 k+1 h∇f (y
k+1
), pk i,
2
and hence,
32
REFERENCES B6.2 Opt Data Sci
kp0 − x0 + x∗ k2 = kx0 − x∗ k2
= ky 0 − α0 ∇f (y 0 ) − x∗ k2
= ky 0 − x∗ k2 + α02 k∇f (y 0 )k2 − 2α0 h∇f (y 0 ), y 0 − x∗ i
(41) α0
≤ ky 0 − x∗ k2 + α02 k∇f (y 0 )k2 − 2α0 f (x0 ) − f ∗ + k∇f (y 0 )k2
2
= ky 0 − x∗ k2 − 2α0 λ20 f (x0 ) − f ∗
so that
4Lky 0 − x∗ k2
f (xk+1 ) − f ∗ ) ≤ ,
(k + 3)2
as claimed by the theorem.
References
[1] A. Beck. First Order Methods in Optimization. MOS-SIAM Series on Opti-
mization, 2017.
[2] S.J. Wright. Optimization Algorithms for Data Analysis.
http://www.optimization-online.org/DB FILE/2016/12/5748.pdf
[3] S.J. Wright. Coordinate descent algorithms. Mathematical Programming,
151:334, 2015. https://arxiv.org/abs/1502.04759
33
REFERENCES B6.2 Opt Data Sci
[8] A.S. Nemirovskii and D.B. Yudin. Complexity of problems and efficiency
of optimization methods. “Nauka” Moskow, 1979, (Russian).
[9] N. Shor. Minimization methods for Non-Differentiable Functions.
(Springer-Verlag, Berlin, 1985)
34