Intro 1
Intro 1
f (x) ∼ f (x + e)
e.g. largest rectangle with a xed circumference x(1 − x) ∼ (x + e)(1 − [x + e]) =
x + e − x2 − 2xe − e2 ⇒ 2xe = e
The fact that this principle stays relevant even today shows Fermat is a true prophet.
• Newton (Principia Mathematica), solid of revolution with the least resistance, phyis-
cal model wrong, but the mathematics is solid), Lebiniz studied smooth problems
and formalized second order suciency and necessasity condtions for local optimal-
ity.
Z 1
max yd(x)
y 0
y(0) = 0y(1) = 1
Z 1 p
ẏ 2 + 1dx = l
0
Solve the Euler-Lagrange equation , which would lead to a dierential equation. Weier-
strass (1879), the rst complete proof of the optimality of circle.
Newton Ralphson Method for solving nonlinear systems, Cauchy proposed the gradient
descent method.
1.3. The Age of Big Data (Stochastic and Uncertainty). Enormous data, gigantic
model, a lot of noise, a lot of uncertainty, learn from interaction,
2.1. Stochastic & Statistics Setting. Consider the following large scale regression prob-
lem
WEEK 1: OVERVIEW 3
X
min ∥Xβ − Y ∥2 + r(β) = ∥x⊤
i β − yi ∥ + r(β),
β
i
n×d
where X∈R is the design (data) matrix. If there is no regularizer term, this is the
classical least square problem and the optimal solution can be computed by
β ∗ = (X ⊤ X)−1 X ⊤ Y.
Small scale problem could be solved eciently by doing matrix inversion. In practice, both
m and n could be prohibitively large, leading to computation challenge. For example,
• Gene feature detection, obesity in mouse study, n=311, genes d= 23,388. Outline
method
• Image classication, CIFAR small, n=50k, d=1000.
2.2. NonConvex Optimization. Burer-Monteiro Factorization for solving large SDP sys-
tem (Approximation to combinatorial problems, control, circuit design, etc)
minX f (X) → minU f (U U ⊤ )
where U is a skinny matrix.
2.4. Variational Inequality. The goal of reinforcement learning is to optimize the policy
to maximize the expected accumulated reward vπ (s0 ), that is,
max vπ (s0 )
π
where
Notice we even loss the access to the (even stochastic) rst order oracle. And the goal
is learn the optimal policy by exploring the environment. The Bellman operator could be
reformulated as some xed point equation, or some VI problem to be more precise.
WEEK 1: OVERVIEW 5
3.1. The Set Up (Information Theoretical). We want to solve the following problem
min f (x).
x∈X
Now let's think about what are the requirements for a reasonable algorithm (solver). If I
buy a solver, say, for solving regression problems, then I would want to use it to handle all
regression problems I encounter, right? So for the algorithm designer, the key requirement
is for one algorithm to solve a whole class problems, rather than just one problem. To
achieve this kind of generality, we would need some interface between the algorithm and the
problem instances. Specically, the designer need to specify what the algorithm needs from
the user, say the design matrix and the label values in the regression case, and the user only
need to supply these information to utilize the algorithm to solve his problem.
For continuous optimization problems, the interface corresponds to the black-box oracle
model shown in Figure 3.1. It is called black-box oracle because the solver only sees the
gradients and function values provided by the oracle, and has no idea about the objective
function f beyond these information. The optimization algorithm is often iterative. In each
iteration, the algorithm proposes a new point xt+1 and asks the oracle hey, what's your
opinion on this point? As the algorithm gathers more information about the unknown
function f, hopefully it could return a good solution.
Now let us make this more precise.
3.2. The Lower Oracle Complexity for Global Non-smooth Non-convex Opti-
mization.
Denition 1. A function f : Rd → R called M -Lispchitz if |f (x) − f (y)| ≤ M ∥x − y∥.
Theorem 2. Consider the task Σ of nding an ϵ-optimal solution, i.e., x ∈ B2d (0, 1) such
that f (x) − f∗ ≤ ϵ, on M -Lipschitz continuous functions F , with the rst-order oracle
oracle. The oracle complexity for deterministic iterative rst order methods PΣ∗ (M; F) is
lower bounded by Ω((L/ϵ)d ).
Proof. For any given method m, let {xt } := [(x1 , x2 , ...., xN ) denotes the iterates generated
with the input O(xi ) = [f (xt ), f ′ (xt )] = [0, 0] for N = ( M d
4ϵ ) .
t n
Since Balls centered at {x } with radius 2ϵ/M does not form covering of B2 (0, 1) (Vol-
d
umetric Argument). There must exists some B(x̄; 2ϵ/M ) with x̄ ∈ B2 (0, 1) that does not
t
intersect any {x }. (Why) We can construct a worst case function f with
(
0 if x ∈
/ B(x̄; 2ϵ/M )
f (x) = .
M ∥x − x̄∥ − 2ϵ o.w
Clearly f is M -Lipschitz and the iterates generated running Algorithm m would only
i ∗ M d
lead to f (x ) − f∗ ≥ 2ϵ∀i ∈ [N ]. Thus we have shown PΣ (M; F) ≥ (
4ϵ ) . □
Remark: this result shows that nding global optimal solution is simply impossible for
a high dimension setting. Instead we need to dene narrower classes of F and think about
the corresponding Tϵ when dening the optimization task.
So long, the dream of an universal algorithm that solves everything. But there is also a
sliver lining: for us who do algorithm design, we would be able to keep our job for quite a
while.
The central diculty associated with the preceding example is non-convexity. The gra-
dient, f ′ (x) := limt→0 f (x + t) − f (x)/t, only contains local information in general. So we
have to sample the whole landscape to obtain good enough solution in terms of function
value gap.
To resolve this issue, one global condition we could impose is convexity.
4.1. The sub-dierential calculus. Now let us derive some upper oracle complexity
bound for the task of minimizing convex Lipschitz continuous function, f (x) = |x| =
max{x, −x} for instance. The kink at 0 can be a little troublesome for us because the
gradient is not dened there. To resolve this issue requires some to generalize the concept
of gradient a little bit. (Draw the gure to illustrate sub-gradient, and illustrate it keeps
the lower support feature)
Some examples:
d
• f (x) = ∥x∥1 , we know ∂f (x) = B∞ (0, 1). [Use Rule 4]
n
• f (x) = maxi xi , then ∂f (x̄) = ∆+ (the probability simplex) if x̄1 = x̄2 = ... = x̄n .
• f (x) = max{f1 (x), f2 (x)}, we know both ∂f (x̄) = Conv(∇f1 (x̄), ∇f2 (x̄)) when
f1 (x̄) = f2 (x̄). [Use Chain rule 5]
• For piecewise C 1 function, we could nd a close enough approximation by taking
derivative a close enough point.
• Another example, given a closed convex function, the indicator function δX (x) is
given by
WEEK 1: OVERVIEW 8
(
0 x∈X
δX (x) :=
+∞ o.w.
Use Theorem 4.a) can calculate its sub-dierential to be ∂δX (x) = 0, and if x̄ is on the
boundary of X, then it matches the normal cone associated with a convex set.
We establish the upper oracle complexity bound associated with the sub-gradient descent
(sub-GD) method in this section. As listed in Algorithm 1, we make a observations
(1) The update could be interpreted as minimizing the linear support functional evalu-
ated at xt−1 (Why?) subject to a proximal term ∥x − xt−1 ∥2 to ensure its stability.
(Illustrate)
(2) Another perspective: we know xt ∈ arg minx∈X ⟨x, f ′ (xt )⟩ ensures the optimality
t t t t
of x because f (x) ≥ lf (x; x ) ≥ lf (x ; x )∀x. In this case, we do not need to add
t−1 t
the proximity term. However, when at x , we don't know x , that is why we are
adding in the proximity term to account for the unknown, i.e., in the hope that
xt does not dier too much xt−1 such that f ′ (xt ) does not dier too much from
f ′ (xt−1 ). A word of caution: this is only intuition, and it is important to look at
the convergence proof to really understand what's going on.
(3) The magnitude of the proximity term is controlled by the step-size parameter ηt .
Notice it is a function of t. Choosing the right step-size sequence is often crucial for
convergence
(4) If X = Rd , the update is the same as the sub-gradient descent step (i.e. descent
along the gradient direction).
The following lemma illustrates the desired property of the proximity term. Let's parse the
resulting inequality before going to the proof
• The inequality is comparing the generated iterate x+ with some arbitrary point
∗
x ∈ X (we really care about x ). It says that under the linear support function
& proximity model, the minimizer is always better than any point x ∈ X . More
η + 2 η t t−1 2
∥x̄−x
specically, it is better by a magnitude of
2 ∥ , or ∥x −x
2 ∥ in Algorithm
′ t ′ t−1
1. This term would helps us a lot in keeping the unknown term f (x ) − f (x )
under control.
Lemma 8. Given convex, closed and non-empty set X , the iterates x+ generated from the
next proximal update
η
x+ ← arg min⟨x, g⟩ + ∥x − x̄∥2
x∈X 2
satises
η η η
⟨g, x+ − x⟩ + ∥x+ − x∥2 + ∥x̄ − x+ ∥2 ≤ ∥x̄ − x∥2 ∀x ∈ X.
2 2 2
WEEK 1: OVERVIEW 9
• An essential feature of the Algorithm 1, actually all algorithms we are going to see,
is its recursive nature (illustrate: the relation between xt and xt+1 , so on and so for).
This recursive nature gives the simplicity to both the algorithm and the analysis.
But it also makes the analysis kind of intricate.
• The next proposition develops a recursive inequality useful for our analysis.
Proof.
′
(xt )∥2
Use Young's inequality ⟨f ′ (xt ), xt − xt−1 ⟩ + ηt
2 ∥x
t
− xt−1 ∥2 ≥ − ∥f 2ηt . □
(5.1)
N N
X 1 1 X 1 ′ t−1 t−1 1
[f (xt−1 ) − f (x∗ )] + ∥xN − x∗ ∥2 ≤ ⟨f (x ), x − x∗ ⟩ + ∥xN − x∗ ∥2
η
t=1 t
2 η
t=1 t
2
N
1 0 ∗ 2
X M2
≤ ∥x − x ∥ + .
2 t=1
2ηt2
• Let's do a rough calculation with stepsize
√ ηt = tx for some x ∈ [0, 1] (do it, you
would see t has the optimal order).
√
Theorem 10. With ηt = γ t − 1, the ergodic average solution x̄⌊k/2⌋ =
Pk 1 t
Pk 1
t=⌊k/2⌋ ηt+1 x / t=⌊k/2⌋ ηt+1
for k ≥ 4 satises (see Chapter 3 of [2] for more information)
k k
X 1 X 1 1 M2
f (x̄⌊k/2⌋ ) − f∗ ≤ ⟨f ′ (xt ), xt − x∗ ⟩/ ≤ O(1) √ ( + γD2 )
ηt+1 ηt+1 k γ
t=⌊k/2⌋ t=⌊k/2⌋
WEEK 1: OVERVIEW 10
1 ln(3)M 2
f (x̄⌊k/2⌋ ) − f∗ ≤ √ ( + γD2 )
k+1 γ
□
Remark 11.
Pk 1 ′ t t
The complexity bound is dimension independent. The quantity t=⌊k/2⌋ ηt ⟨f (x ), x −
∗
x ⟩ contains veriable primal and dual information.
• More than function value gap, we have a primal & dual gap. The dierence between
the primal solution f (x̄N ) and some lower bound model lk (x)
N −1 N N
X 1 ′ t−1 t−1 X 1 X 1
⟨f (x ), x − x⟩ = f (xt−1 ) − [f (xt−1 ) − ⟨f ′ (xt−1 ), x − xt−1 ⟩]
t=0
ηt t=1
ηt t=1
ηt
N N
X 1 X 1
= f (xt−1 ) − ( )lN (x)
η
t=1 t
η
t=1 t
N N
X 1 X 1
lN (x) = [f (xt−1 ) − ⟨f ′ (xt−1 ), x − x0 ⟩ + ⟨f ′ (xt−1 ), x0 − xt−1 ⟩]/( ).
η
t=1 t
η
t=1 t
• To ensure the convergence of the SubGD sequence, we would need 1/ηt → 0. This
is necessary from the primal perspective of searching for optimal solution because
the sub-gradient does not vanish near x∗ .
• However, from the dual perspective of developing a lower approximation, this makes
little sense because we are paying less attention to support functions near x∗ .
6.1. The Method. The dual averaging method attempts to reconcile this apparent con-
tradiction.
We make a few observations regarding the algorithm.
WEEK 1: OVERVIEW 11
(1) The method is designed entirely from the dual perspective. It computes the dual
solution st by averaging the sub-gradient informations with uniform weights, which
resolves the observed contradiction. And the sub-gradient is evaluated at the prox-
imal minimizer associated with the lower approximation model.
√
(2) The progress is controlled by the stepsize parameter βt = O( t). However, the
convergence proof in the original paper is quite involved. [3]
(3) This dual perspective of nding a good lower bound model is very useful in opti-
mization. This insight of we only need to obtain a good lower approximation model
has been especially useful in distributed optimization [1].
6.2. Another Interpretation. Recall that we have the following three point inequality
Lemma 12. For x ← arg minx∈X ⟨g, x⟩ + αDΦ (x; x̄) + βDΦ (x; x̂), we have
+
(6.1) ⟨g, x+ − x⟩ + αDΦ (x+ ; x̄) + βDΦ (x+ ; x̂) + (α + β)DΦ (x+ ; x) ≤ αDΦ (x; x̄) + βDΦ (x; x̂).
Proof. We have −[g + α(∇Φ(x+ ) − ∇Φ(x̄)) + β(∇Φ(x+ ) − ∇Φ(x̂))] ∈ NX (x+ ). Then the
result follows from the fact
∥g t−1 ∥2
⟨g t−1 , xt−1 −x⟩+ηt DΦ (x; xt )+(ηt −ηt−1 )DΦ (xt ; x0 ) ≤ ηt−1 DΦ (x; xt−1 )+(ηt −ηt−1 )DΦ (x; x0 )+ .
ηt−1
□
Remark:
• This alternative method still satises the motivation, that is, the dual model con-
structed utilizes the average of all the dual information, and that the primal iterate's
remain stabilized in some ball around x∗ . But the analysis is more transparent.
• The key is how do we stabilize the sequence of iterates generated. This question of
stabilization and how to do it eciently is still being actively researched. Still many
opportunities in that direction!
WEEK 1: OVERVIEW 12
References
[1] John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization:
Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592606,
2011.
[2] Guanghui Lan. First-order and stochastic Optimization Methods for Machine Learning. Springer-Nature,
2020.
[3] Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming,
120(1):221259, 2009.