0% found this document useful (0 votes)
24 views12 pages

Intro 1

Uploaded by

miaomiao24122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views12 pages

Intro 1

Uploaded by

miaomiao24122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

WEEK 1: OVERVIEW

Nothing in the World takes place without Optimization, and there is no


doubt that all aspects of the World, that have a rational basis, can be
explained by Optimization Methods.
 Leonard Euler

1. A Brief Histroy of Continuous Optimization

1.1. The Classical Age, Caculus of Variation.


• Fermat 1636 Methods of Minima and Maxima, motivated by the Ancient Greek's
study of geometry, e.g., what's the largest rectangle with a given perimeter, the
snell's law. The principle of adequality [ignoring the higher order terms]. The
intuition: a solution is not optimal if some slight local perturbation leads to a
better point.

f (x) ∼ f (x + e)
e.g. largest rectangle with a xed circumference x(1 − x) ∼ (x + e)(1 − [x + e]) =
x + e − x2 − 2xe − e2 ⇒ 2xe = e
The fact that this principle stays relevant even today shows Fermat is a true prophet.

• Newton (Principia Mathematica), solid of revolution with the least resistance, phyis-
cal model wrong, but the mathematics is solid), Lebiniz studied smooth problems
and formalized second order suciency and necessasity condtions for local optimal-
ity.

∇f (x) = 0, ∇2 f (x) > 0 and ∇f (x) = 0, ∇2 f (x) = 0


• Euler, Lagrange, isoperimetric problem, Queen Dido's problem, the Lagrange mul-
tiplier for constrained optimization

Z 1
max yd(x)
y 0
y(0) = 0y(1) = 1
Z 1 p
ẏ 2 + 1dx = l
0

Solve the Euler-Lagrange equation , which would lead to a dierential equation. Weier-
strass (1879), the rst complete proof of the optimality of circle.
Newton Ralphson Method for solving nonlinear systems, Cauchy proposed the gradient
descent method.

1.2. The Age of Computerized Planning, ISMP conference 1949.


• Kantorovich (Soviet Union), Mathematical Methods in the Organization and Plan-
ning of Production, 1939
1
WEEK 1: OVERVIEW 2

Figure 2.1. Gene Mircoarray

 questions of the optimum distribution of the work of machines and mecha-


nisms, the minimization of scrap, the best utilization of raw materials and
local materials, fuel, transportation, and so on. These problems are not di-
rectly comparable to problems considered in mathematical analysis. It is more
correct to say that they are formally similar, and even turn out to be for-
mally very simple, but the process of solving them with which one is faced [i.e.,
by mathematical analysis] is practically completely unusable, since it requires
the solution of tens of thousands or even millions of systems of equations for
completion.
 In the sense that we get more productivity from the current technology with
just better organization & planning, the name imaginary engineering is not
unwarantted.
 Introduce LP, and shadow price, Nobel price 1975 (optimal allocation of re-
sources).
 Hand calculation 6 hrs to nd the solution and 15 minutes to check the validity
of the solution
• Danzig the simplex method, Linear Programming and Extensions, simplex method
in 1947 (kept secret until after the war)
 manufacturing, revenue management, telecommunications, advertising, archi-
tecture, circuit design and countless other areas
 Planning for airforce to reduce expenses and to increase loss to the enemy
• Von Neumann 1947, the LP duality theory
• KKT condition, Karush's paper in 1939 was not published because it was only for
nite dimensional problem.
• Bellman optimality principle, 1957
• 80s 90s: complexity result the ellipsoid method from Arkady Nemirovky & Nesterov,
the race to generate high accuracy solutions for a wide range of problems.
• Interior point method for SDP, LP, QP, SQP for non-convex problem

1.3. The Age of Big Data (Stochastic and Uncertainty). Enormous data, gigantic
model, a lot of noise, a lot of uncertainty, learn from interaction,

• the reign of rst order and zeroth order methods.

2. Overview of The Computation Challenges

2.1. Stochastic & Statistics Setting. Consider the following large scale regression prob-
lem
WEEK 1: OVERVIEW 3

Figure 2.2. Neural Network

X
min ∥Xβ − Y ∥2 + r(β) = ∥x⊤
i β − yi ∥ + r(β),
β
i
n×d
where X∈R is the design (data) matrix. If there is no regularizer term, this is the
classical least square problem and the optimal solution can be computed by

β ∗ = (X ⊤ X)−1 X ⊤ Y.
Small scale problem could be solved eciently by doing matrix inversion. In practice, both
m and n could be prohibitively large, leading to computation challenge. For example,

• Gene feature detection, obesity in mouse study, n=311, genes d= 23,388. Outline
method
• Image classication, CIFAR small, n=50k, d=1000.

2.2. NonConvex Optimization. Burer-Monteiro Factorization for solving large SDP sys-
tem (Approximation to combinatorial problems, control, circuit design, etc)
minX f (X) → minU f (U U ⊤ )
where U is a skinny matrix.

• Main challenge: no convexity, can only resort to stationarity as the termination


condition/

2.3. Non-smooth non-convex optimization. The output of (feedfoward) neural network


is represented as

ψ(W, x) = f1 (W1 ; ·) ◦ f2 (W2 ; ·) ◦ · · · ◦ fL (WL ; x) where fi (Wi ; y) = relu(Wi⊤ y).


Illustrate the non-smoothness by highlighting the kink.
The challenge from non-smoothness, the gradient provides less meaningful convergence
guidance. Example

f (x) = ∥x∥2 + |x1 |

∇f (x) = [x1 + sign(x1 ), x2 ]


Illustrate from (0.0001, 0.1).
WEEK 1: OVERVIEW 4

Figure 2.3. The Zig-zag from non-smoothness

2.4. Variational Inequality. The goal of reinforcement learning is to optimize the policy
to maximize the expected accumulated reward vπ (s0 ), that is,

max vπ (s0 )
π
where

vπ (s) = Eπ [Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |St = s]


= Eπ [Rt+1 + γvπ (St+1 )|St = s]
X X
= π(a|s) p(s′ , r|s, a)[r(s, a) + γvπ (s′ )].
a s′ ,r

Notice we even loss the access to the (even stochastic) rst order oracle. And the goal
is learn the optimal policy by exploring the environment. The Bellman operator could be
reformulated as some xed point equation, or some VI problem to be more precise.
WEEK 1: OVERVIEW 5

Figure 3.1. The Oracle Computation

3. Introduction to Complexity Analysis

3.1. The Set Up (Information Theoretical). We want to solve the following problem

min f (x).
x∈X
Now let's think about what are the requirements for a reasonable algorithm (solver). If I
buy a solver, say, for solving regression problems, then I would want to use it to handle all
regression problems I encounter, right? So for the algorithm designer, the key requirement
is for one algorithm to solve a whole class problems, rather than just one problem. To
achieve this kind of generality, we would need some interface between the algorithm and the
problem instances. Specically, the designer need to specify what the algorithm needs from
the user, say the design matrix and the label values in the regression case, and the user only
need to supply these information to utilize the algorithm to solve his problem.
For continuous optimization problems, the interface corresponds to the black-box oracle
model shown in Figure 3.1. It is called black-box oracle because the solver only sees the
gradients and function values provided by the oracle, and has no idea about the objective
function f beyond these information. The optimization algorithm is often iterative. In each
iteration, the algorithm proposes a new point xt+1 and asks the oracle hey, what's your
opinion on this point? As the algorithm gathers more information about the unknown
function f, hopefully it could return a good solution.
Now let us make this more precise.

• The task of solving a class of problems is specied Σ = (F, O, Tϵ ) where


 F denotes the class of all objective functions, O denotes the oracle access, and
Tϵ denotes the termination criterion.
 For example, F could be the class of all Lipschitz smooth functions, which
includes all the polynomials, all the analytical functions and many more. Here
for O, we often utilize rst order access. For Tϵ , there can be many choices,
function value gap, or stationarity condition, distance to optimal solution, etc.
The choice of Tϵ is an active area of research.
• In each step, the optimization algorithm then computes the next solution in some
way (a mapping from all collected information to a new iterate xt+1 )
 x t+1 0 0 t
← m(x , O(x ), ...., x , O(x )) t

• How do we characterize the performance of such an algorithm A for task F ? Average


is problematic because we would require underlying distribution over all problem
instances. We would focus on the worst case complexity for our setting.
 PΣ (m; f ) := #iterations for m to nd xt satisfying Tϵ
WEEK 1: OVERVIEW 6

 The mini-max oracle complexity: PΣ∗ (M; F) = minA∈M maxf ∈F P (m; f )


 Clearly, we have

PΣ (M; f ) ≤ max min P (m; f ) ≤ PΣ∗ (M; F) ≤ PΣ (m; F)


f ∈F m∈M

where PΣ (m; F) = maxf ∈F PΣ (m; f ), and PΣ (M; f ) = minm∈M p(m, f ). Any



upper bound on PΣ (M; F) is called an upper complexity bound, and any lower
bound on it is called a lower complexity bound. The relation above illustrates
ways to establishing such upper and lower bounds (think about what needs to
be done).
 In many cases, we are going to see PΣ (M; f ) = O(1)PΣ (m; F) for some f and
m, then we call m an optimal method.

3.2. The Lower Oracle Complexity for Global Non-smooth Non-convex Opti-
mization.
Denition 1. A function f : Rd → R called M -Lispchitz if |f (x) − f (y)| ≤ M ∥x − y∥.
Theorem 2. Consider the task Σ of nding an ϵ-optimal solution, i.e., x ∈ B2d (0, 1) such
that f (x) − f∗ ≤ ϵ, on M -Lipschitz continuous functions F , with the rst-order oracle
oracle. The oracle complexity for deterministic iterative rst order methods PΣ∗ (M; F) is
lower bounded by Ω((L/ϵ)d ).
Proof. For any given method m, let {xt } := [(x1 , x2 , ...., xN ) denotes the iterates generated
with the input O(xi ) = [f (xt ), f ′ (xt )] = [0, 0] for N = ( M d
4ϵ ) .
t n
Since Balls centered at {x } with radius 2ϵ/M does not form covering of B2 (0, 1) (Vol-
d
umetric Argument). There must exists some B(x̄; 2ϵ/M ) with x̄ ∈ B2 (0, 1) that does not
t
intersect any {x }. (Why) We can construct a worst case function f with

(
0 if x ∈
/ B(x̄; 2ϵ/M )
f (x) = .
M ∥x − x̄∥ − 2ϵ o.w
Clearly f is M -Lipschitz and the iterates generated running Algorithm m would only
i ∗ M d
lead to f (x ) − f∗ ≥ 2ϵ∀i ∈ [N ]. Thus we have shown PΣ (M; F) ≥ (
4ϵ ) . □
Remark: this result shows that nding global optimal solution is simply impossible for
a high dimension setting. Instead we need to dene narrower classes of F and think about
the corresponding Tϵ when dening the optimization task.
So long, the dream of an universal algorithm that solves everything. But there is also a
sliver lining: for us who do algorithm design, we would be able to keep our job for quite a
while.

4. Convexity and three point inequality

The central diculty associated with the preceding example is non-convexity. The gra-
dient, f ′ (x) := limt→0 f (x + t) − f (x)/t, only contains local information in general. So we
have to sample the whole landscape to obtain good enough solution in terms of function
value gap.
To resolve this issue, one global condition we could impose is convexity.

Theorem 3. Given f : Rd → R, the following denitions of convex functions are equivalent


• Classical: ((1 − α)x +
fP Pαy) ≤ (1 − α)f (x) + αf P(y) ∀α ∈ [0, 1] ∀x, y .
• Jensen: f ( i pi xi ) ≤ i pi f (xi ) ∀pi ≥ 0 s.t. i pi = 1.
WEEK 1: OVERVIEW 7

• If f is dierentiable: f (y) ≥ f (x) + ⟨∇f (x), y − x⟩ ∀x, y


• If ∇2 f exists everywhere: ∇2 f (x) ⪰ 0 ∀x
Corollary 4. Given a C 1 convex function with domain Rd , x∗ is optimal i ∥∇f (x)∥ = 0.
Remark: notice all the conditions are required to hold globally. In particular, the lower
support stu allows us to construct separation oracle and also allows us to certify optimality,
and the Jensen's inequality is incredibly useful in convergence analysis.

Denition 5. A set X is said to be convex if it contains every line segment, (1−α)x+αy ∈ X


∀α ∈ [0, 1] if x and y ∈ X.
Remark: a function is convex if its super-graph is a convex set.

4.1. The sub-dierential calculus. Now let us derive some upper oracle complexity
bound for the task of minimizing convex Lipschitz continuous function, f (x) = |x| =
max{x, −x} for instance. The kink at 0 can be a little troublesome for us because the
gradient is not dened there. To resolve this issue requires some to generalize the concept
of gradient a little bit. (Draw the gure to illustrate sub-gradient, and illustrate it keeps
the lower support feature)

Theorem 6. Given a convex function f : Rd → R, the following denitions of sub-gradient


are equivalent [Rockefeller convex analysis]
(1) ∂f (x) = ∩y∈X {s : f (y) ≥ f (x) + ⟨s, y − x⟩}.
f (x+td)−f (x)
(2) ∂f (x) = ∩d∈S(0,1) {s : ⟨s, d⟩ ≤ f ′ (x; d) := limt→0 t }.
(3) ∂C f (x) = CON V {s : s = limxi →x ∇f (xi )}.
Proof. This left as an exercise. Showing the equivalence between 1 and 2, and that ∂C f (x) ⊂
∂f (x) should be easy. □
Proposition 7. Given a closed convex function f :Rd − R̄, it satises the following properties
[Rockefeller variational analysis Chap 10]
(1) ∂f (x) ̸= ∅ for every x ∈ dom(f ), that is, {x ∈ Rd : f (x) < ∞}.
(2) f (x) ≥ f (x̄) + ⟨f ′ (x̄), x − x̄⟩ ∀x, f ′ (x̄) ∈ ∂f (x̄).
∗ ∗
(3) (FO condition) x is optimal if and only if 0 ∈ ∂f (x ).
P
(4) If f is separable, i.e., f (x) = i fi (xi ), then ∂f (x) = ∂f1 (x1 ) × ∂f2 (x2 ) × ... ×
∂fm (xm ).
(5) Chain rule, if f (x) = g(F (x)) with F being smooth and g being convex, then
∂f (x) = ∇J F (x)⊤ ∂g(F (x)) under a certain regularity condition on x.
(6) Summation rule, if f (x) = f1 (x) + f2 (x), then ∂f (x) = ∂f1 (x) + ∂f2 (x) under a
certain regularity condition on x.

Some examples:
d
• f (x) = ∥x∥1 , we know ∂f (x) = B∞ (0, 1). [Use Rule 4]
n
• f (x) = maxi xi , then ∂f (x̄) = ∆+ (the probability simplex) if x̄1 = x̄2 = ... = x̄n .
• f (x) = max{f1 (x), f2 (x)}, we know both ∂f (x̄) = Conv(∇f1 (x̄), ∇f2 (x̄)) when
f1 (x̄) = f2 (x̄). [Use Chain rule 5]
• For piecewise C 1 function, we could nd a close enough approximation by taking
derivative a close enough point.
• Another example, given a closed convex function, the indicator function δX (x) is
given by
WEEK 1: OVERVIEW 8

Algorithm 1 The Subgradient Descent Method


ηt
• xt ← arg minx∈X ⟨x, f ′ (xt−1 )⟩ + 2 ∥x − xt−1 ∥2

(
0 x∈X
δX (x) :=
+∞ o.w.
Use Theorem 4.a) can calculate its sub-dierential to be ∂δX (x) = 0, and if x̄ is on the
boundary of X, then it matches the normal cone associated with a convex set.

∂δX (x̄) = {s : ⟨s, x − x̄⟩ ≤ 0} = NX (x̄).

5. The oracle complexity for the Sub-gradient Descent Method

We establish the upper oracle complexity bound associated with the sub-gradient descent
(sub-GD) method in this section. As listed in Algorithm 1, we make a observations

(1) The update could be interpreted as minimizing the linear support functional evalu-
ated at xt−1 (Why?) subject to a proximal term ∥x − xt−1 ∥2 to ensure its stability.
(Illustrate)
(2) Another perspective: we know xt ∈ arg minx∈X ⟨x, f ′ (xt )⟩ ensures the optimality
t t t t
of x because f (x) ≥ lf (x; x ) ≥ lf (x ; x )∀x. In this case, we do not need to add
t−1 t
the proximity term. However, when at x , we don't know x , that is why we are
adding in the proximity term to account for the unknown, i.e., in the hope that
xt does not dier too much xt−1 such that f ′ (xt ) does not dier too much from
f ′ (xt−1 ). A word of caution: this is only intuition, and it is important to look at
the convergence proof to really understand what's going on.
(3) The magnitude of the proximity term is controlled by the step-size parameter ηt .
Notice it is a function of t. Choosing the right step-size sequence is often crucial for
convergence
(4) If X = Rd , the update is the same as the sub-gradient descent step (i.e. descent
along the gradient direction).

The following lemma illustrates the desired property of the proximity term. Let's parse the
resulting inequality before going to the proof

• The inequality is comparing the generated iterate x+ with some arbitrary point

x ∈ X (we really care about x ). It says that under the linear support function
& proximity model, the minimizer is always better than any point x ∈ X . More
η + 2 η t t−1 2
∥x̄−x
specically, it is better by a magnitude of
2 ∥ , or ∥x −x
2 ∥ in Algorithm
′ t ′ t−1
1. This term would helps us a lot in keeping the unknown term f (x ) − f (x )
under control.

Lemma 8. Given convex, closed and non-empty set X , the iterates x+ generated from the
next proximal update
η
x+ ← arg min⟨x, g⟩ + ∥x − x̄∥2
x∈X 2
satises
η η η
⟨g, x+ − x⟩ + ∥x+ − x∥2 + ∥x̄ − x+ ∥2 ≤ ∥x̄ − x∥2 ∀x ∈ X.
2 2 2
WEEK 1: OVERVIEW 9

Proof. By the FO condition for convex optimization, we get

0 ∈ g + η(x+ − x̄) + NX (x+ ) ⇒ −[g + η(x+ − x̄)] ∈ NX (x+ )


that is

⟨−g − η(x+ − x̄), x − x+ ⟩ ≤ 0 ∀x ∈ X


Since ⟨(x+ − x̄), x+ −x⟩ = ∥x−x+ ∥2 +∥x̄−x+ ∥2 −∥x̄−x∥2 , the result follows immediately.

Now we turn to analyze Algorithm 1.

• An essential feature of the Algorithm 1, actually all algorithms we are going to see,
is its recursive nature (illustrate: the relation between xt and xt+1 , so on and so for).
This recursive nature gives the simplicity to both the algorithm and the analysis.
But it also makes the analysis kind of intricate.
• The next proposition develops a recursive inequality useful for our analysis.

Proposition 9. For an M -Lipschitz continuous function, the iterates generated by Algo-


rithm 1 satises
1 1 1 M2
⟨f ′ (xt ), xt − x⟩ + ∥x − xt+1 ∥2 ≤ ∥x − xt ∥2 + 2 ∀x ∈ X.
ηt+1 2 2 2ηt+1

Proof.

(xt )∥2
Use Young's inequality ⟨f ′ (xt ), xt − xt−1 ⟩ + ηt
2 ∥x
t
− xt−1 ∥2 ≥ − ∥f 2ηt . □

Let me say something more about the result.


1
• When we want to chain them up, the terms
2 ∥x − x ∥ − 21 ∥x − xt ∥2 can telescope
t+1 2
2
M
(that is the nice part), and then the last term accumulates. This is what happens
2ηt2
for this sort of recursive algorithm. Any error term encountered in one iteration will
propagate, so we will try quite hard to design it to be just right. Sometimes it just
becomes a game of algebra and some genius turns out to be real good at the game.
• In particular, a telescope sum of the preceding relation leads us to (Why? Illustrate
how convexity allows us to provide function value bound)

(5.1)
N N
X 1 1 X 1 ′ t−1 t−1 1
[f (xt−1 ) − f (x∗ )] + ∥xN − x∗ ∥2 ≤ ⟨f (x ), x − x∗ ⟩ + ∥xN − x∗ ∥2
η
t=1 t
2 η
t=1 t
2
N
1 0 ∗ 2
X M2
≤ ∥x − x ∥ + .
2 t=1
2ηt2
• Let's do a rough calculation with stepsize
√ ηt = tx for some x ∈ [0, 1] (do it, you
would see t has the optimal order).

Theorem 10. With ηt = γ t − 1, the ergodic average solution x̄⌊k/2⌋ =
Pk 1 t
Pk 1
t=⌊k/2⌋ ηt+1 x / t=⌊k/2⌋ ηt+1
for k ≥ 4 satises (see Chapter 3 of [2] for more information)
k k
X 1 X 1 1 M2
f (x̄⌊k/2⌋ ) − f∗ ≤ ⟨f ′ (xt ), xt − x∗ ⟩/ ≤ O(1) √ ( + γD2 )
ηt+1 ηt+1 k γ
t=⌊k/2⌋ t=⌊k/2⌋
WEEK 1: OVERVIEW 10

Algorithm 2 The Dual Averaging Method


Let x0 = 0, set s0 = 0.
t−1 ′ t−1
(1) g = f (x ).
(2) st = st−1 + g t−1 .
t t 0 βt
(3) Compute x = arg minx∈X ⟨s , x − x ⟩ +
2 ∥x − x0 ∥ 2 .

Proof. Important inequalities for the analysis:


Rk √
k 1 √1 |t=⌊k/2⌋ ≥ 12 t + 1
1/2 t=k
P
t=⌊k/2⌋ t ≥ t=⌊k/2⌋ t+1 = 2(t + 1)

Pk 1
Rk 1 t=k
t=⌊k/2⌋ t ≤ t=⌊k/2⌋ t−1 = ln(t − 1)|t=⌊k/2⌋ ≤ ln(3) ∀k ≥ 4.
So the exact bound is

1 ln(3)M 2
f (x̄⌊k/2⌋ ) − f∗ ≤ √ ( + γD2 )
k+1 γ

Remark 11.
Pk 1 ′ t t
The complexity bound is dimension independent. The quantity t=⌊k/2⌋ ηt ⟨f (x ), x −

x ⟩ contains veriable primal and dual information.

6. The Dual Averaging Method [3]

A closer look at the (5.1) reveals some apparent contradiction.

• More than function value gap, we have a primal & dual gap. The dierence between
the primal solution f (x̄N ) and some lower bound model lk (x)

N −1 N N
X 1 ′ t−1 t−1 X 1 X 1
⟨f (x ), x − x⟩ = f (xt−1 ) − [f (xt−1 ) − ⟨f ′ (xt−1 ), x − xt−1 ⟩]
t=0
ηt t=1
ηt t=1
ηt

N N
X 1 X 1
= f (xt−1 ) − ( )lN (x)
η
t=1 t
η
t=1 t
N N
X 1 X 1
lN (x) = [f (xt−1 ) − ⟨f ′ (xt−1 ), x − x0 ⟩ + ⟨f ′ (xt−1 ), x0 − xt−1 ⟩]/( ).
η
t=1 t
η
t=1 t

• Specically, even though we do not know f ∗, if we know an upper bound for D0 ≥


∗ 0
∥x − x ∥, we can compute a lower bound for it by

f∗ ≥ min lN (x).
x∈B(x0 ,D0 )

• To ensure the convergence of the SubGD sequence, we would need 1/ηt → 0. This
is necessary from the primal perspective of searching for optimal solution because
the sub-gradient does not vanish near x∗ .
• However, from the dual perspective of developing a lower approximation, this makes
little sense because we are paying less attention to support functions near x∗ .

6.1. The Method. The dual averaging method attempts to reconcile this apparent con-
tradiction.
We make a few observations regarding the algorithm.
WEEK 1: OVERVIEW 11

Algorithm 3 The Double Stabilization Method


ηt−1 ηt −ηt−1
(1) xt ← arg minx∈X ⟨g t−1 , x⟩ + 2 ∥x − xt−1 ∥2 + 2 ∥x − x0 ∥2 .

(1) The method is designed entirely from the dual perspective. It computes the dual
solution st by averaging the sub-gradient informations with uniform weights, which
resolves the observed contradiction. And the sub-gradient is evaluated at the prox-
imal minimizer associated with the lower approximation model.

(2) The progress is controlled by the stepsize parameter βt = O( t). However, the
convergence proof in the original paper is quite involved. [3]
(3) This dual perspective of nding a good lower bound model is very useful in opti-
mization. This insight of we only need to obtain a good lower approximation model
has been especially useful in distributed optimization [1].

6.2. Another Interpretation. Recall that we have the following three point inequality

Lemma 12. For x ← arg minx∈X ⟨g, x⟩ + αDΦ (x; x̄) + βDΦ (x; x̂), we have
+

(6.1) ⟨g, x+ − x⟩ + αDΦ (x+ ; x̄) + βDΦ (x+ ; x̂) + (α + β)DΦ (x+ ; x) ≤ αDΦ (x; x̄) + βDΦ (x; x̂).
Proof. We have −[g + α(∇Φ(x+ ) − ∇Φ(x̄)) + β(∇Φ(x+ ) − ∇Φ(x̂))] ∈ NX (x+ ). Then the
result follows from the fact

⟨∇Φ(x+ ) − ∇Φ(x̄), x − x+ ⟩ = DΦ (x; x+ ) + DΦ (x+ ; x̄) − DΦ (x; x̄).



This leads us to the following one-step recursion relation.

Theorem 13. If ηt is monotonically non-decreasing, then we have


N N
X X ∥g t ∥2
⟨g t , xt − x⟩ + ηN +1 DΦ (x; xN ) ≤ ηN DΦ (x; x0 ) + .
t=0 t=0
ηt
√ √
In particular, if ηt = γ t + 1, then we get the following O(1/ t) convergence rate
N
1 X t t 1 M2
⟨g , x − x⟩ ≤ O(1) √ (γDΦ (x; x0 ) + ).
N t=0 N αγ
Proof. The following relation is a immediate consequence to 6.1

∥g t−1 ∥2
⟨g t−1 , xt−1 −x⟩+ηt DΦ (x; xt )+(ηt −ηt−1 )DΦ (xt ; x0 ) ≤ ηt−1 DΦ (x; xt−1 )+(ηt −ηt−1 )DΦ (x; x0 )+ .
ηt−1

Remark:

• This alternative method still satises the motivation, that is, the dual model con-
structed utilizes the average of all the dual information, and that the primal iterate's
remain stabilized in some ball around x∗ . But the analysis is more transparent.
• The key is how do we stabilize the sequence of iterates generated. This question of
stabilization and how to do it eciently is still being actively researched. Still many
opportunities in that direction!
WEEK 1: OVERVIEW 12

References

[1] John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization:
Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592606,
2011.
[2] Guanghui Lan. First-order and stochastic Optimization Methods for Machine Learning. Springer-Nature,
2020.
[3] Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming,
120(1):221259, 2009.

You might also like