3 Agd
3 Agd
Some interpretations:
(1) A closed convex set admits both inner and outer description. Inner: it contains every line segment.
Outer: it is the intersection of halve spaces. If we consider the graph associated with G(f ) := {(x, t) :
t ≥ f (x)}, f ∗ specifies the half spaces for G(f ). (Draw some picture )
(2) Specifically, it captures the intercept of the linear support functional because
min f (x) − ⟨x, π⟩ = − max⟨π, x⟩ − f (x) = −f ∗ (π).
x x
∗
Thus ⟨π, x⟩ − f (x) is a linear support function for f associated with the dual variable π, and vice
versa. (This is made precise in Lemma 3).
(3) It is regarded as the Laplace transform for convex function analysis, but the integration is replaced
by maximization. (See Chapter 4 of [1] for Calculations for commonly used functions).
(4) Example, for an indicator function δC (x), its conjugate function is the support funciton σC (π) =
arg maxx∈C ⟨π, x⟩(why?). For instance, σ∆+ (y) =max yi = (δ∆+ )∗ .
(5) And the exponent conjugate (hence the name conjugate, proved using the Holder’s inequality)
1 1
∥y∥pp = max⟨x, y⟩ − ∥x∥qq .
p x q
(6) In micro-economics, given the cost function c(Q), c∗ (P ) = maxp P Q − c(Q), the conjugate function
expresses maximal profit as a function of price. Also used as the part of the moment generation
function in probability, as a part of the Chernoff bound.
Focused on CCP functions, the fenchel conjugate admits the following desirable properties.
Lemma 3. Given a convex, closed and proper function f : Rd → R, the Fenchel transfomation satisfies the
following properties.
a) Involution, f ∗∗ ≡ f. (Theorem 11.1 of [6]).
b) f ∗ is a convex, closed and proper function.
c) linear support functional, f (x) + f ∗ (π) ≥ ⟨π, x⟩ ∀x, π ∈ Rd .
Since the Fenchel conjugate characterizes the linear support functionals, we could also use it provides
alternative characterization of the sub-differential.
Theorem 4. Given a CCP function f and its Fenchel conjugate f ∗ , the following are equivalent (See Section
4.6 of [1]).
a) ⟨π, x⟩ = f (x) + f ∗ (π).
b.i) π ∈ ∂f (x).
b.ii) x ∈ ∂f ∗ (π).
c.i) π ∈ arg maxπ̄ ⟨x, π̄⟩ − f ∗ (π̄).
c.ii) x ∈ arg maxx̄ ⟨π, x̄⟩ − f (x̄).
Proof. The equivalence between a) and c) follows from the definition of Fenchel conjugate and Lemma 3.
c.ii) ⇒ a) def of Fenchel conjugate a) ⇒ c.ii) Lemma 3.b). Moreover, the equivalence for c.i) follows from
involution.
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 3
lf (x̄; π̄) = ⟨x̄, π̄⟩ − f ∗ (π̄) = ⟨x̄, π̄⟩ + f (x̄) − ⟨x̄, π̄⟩ = f (x̄)
lf (x; π̄) = ⟨x, π̄⟩ − f ∗ (π̄) ≤ ⟨x, π̄⟩ + f (x) − ⟨x, π̄⟩ ≤ f (x).
Next, fixing x̄, the maximization in Lemma 3 could be interpreted as going upward; finding the lower
linear functional lf (·; π) that “touches” x̄
⟨g, x+ − x⟩ + ηDΦ (x; x+ ) + Dr (x; x+ ) + r(x+ ) + ηDΦ (x+ ; x̄) ≤ ηDΦ (x; x̄) + r(x) ∀x ∈ X.
Proof. The result follows from the first-order optimality condition that
Similarly, we get
µ
(5.1) ⟨−π̄, x̄ − x̂⟩ + ∥x̂ − x̄∥2 + f (x̄) ≤ f (x̂).
2
Adding up the preceding formulae then leads to
1
(5.2) f (y) − f (x̄) − ⟨∇f (x̄), y − x̄⟩ ≥ ∥∇f (y) − ∇f (x̄)∥2∗ ∀y, x̄ ∈ dom(∂f ).
2L
We will prove the claim later. Let us assume it being true for now. Again, let π̂, π̄ ∈ dom(∂f ∗ ) be given.
Consider any x̄ ∈ ∂f ∗ (π̄) and x̂ ∈ ∂f ∗ (π̂), i.e., π̄ = ∇f (x̄) and π̂ = ∇f (x̂) (why?). We have from (5.2) that
1
f (x̂) − f (x̄) − ⟨∇f (x̄), x̂ − x̄⟩ ≥ ∥π̂ − π̄∥2∗
2L
1
f (x̄) − f (x̂) − ⟨∇f (x̂), x̄ − x̂⟩ ≥ ∥π̂ − π̄∥2∗ .
2L
Adding them up leads to
1
∥π̂ − π̄∥2∗ ≤ ⟨π̂ − π̄, x̂ − x̄⟩ ≤ ∥x̂ − x̄∥∥π̂ − π̄∥∗ .
L
Thus, we get the desired strong convexity relation ∥(f ∗ )′ (π̄) − (f ∗ )′ (π̄)∥ ≥ L1 ∥π̂ − π̄∥.
Now we show the claim in (5.2). Consider ϕ(x) = f (x) − ⟨∇f (x̄), x⟩.Clearly, x̄ ∈ arg minx∈Rd ϕ(x) and we
have
L
∥y − x∥2 ∀x ∈ Rd .
ϕ(x̄) ≤ ϕ(x) ≤ ϕ(y) + ⟨∇ϕ(y), x − y⟩ +
2
∥∇ϕ(y)∥∗
Let v = arg maxx∈B(0,1) ⟨∇ϕ(y), x⟩ such that ⟨v, ∇ϕ(y)⟩ = ∥∇ϕ(y)∥∗ . Taking x = y − L v leads to
1 1
ϕ(x̄) ≤ ϕ(y) − ∥∇ϕ(y)∥2∗ + ∥∇ϕ(y)∥2∗
L 2L
1
⇒ ∥∇f (y) − ∇f (x̄)∥2∗ ≤ f (y) − f (x̄) − ⟨∇f (x̄), y − x̄⟩.
2L
□
In this zero-sum game, the max-player π seeks to maximize his pay-off L(π; x) which is dependent on the
action of the primal player x, while the primal player seeks to minimize his lose. In the tth iteration of the
AGD Algorithm 2, since his reward depends on the yet-known action of x, xt , the dual player constructs
an estimate x̃t of it by extrapolating from the history, and then maximizes the his reward L(x̃t ; π) subject
to some prox-penalty Line 2. After that, seeing the action π t , the x player responds by minimizing L(x; π t )
subject to some prox-penalty term.
The surprising gradient evaluation associated with Algorithm 1 can be interpreted as the a dual proximal
step. Before going into the details, let us see that the π t evaluation could indeed be reinterpreted as some
gradient evaluation step.
ωt−1 = ωt θt
ωt−1 (1 + τt−1 ) ≥ ωt τt
ωt−1 ηt−1 ≥ ωt ηt
τt ηt−1 ≥ Lθt , (τN + 1)ηN ≥ L.
Then the following relation is valid for any z := (x, π) ∈ X × Rd .
N
X ωN η N N ω1 η 1 0
ωt Q(z t ; z) + ∥x − x∥2 ≤ ∥x − x∥2 + ω1 τ1 Df ∗ (π; π 0 ).
t=1
2 2
Proof. The following algebraic identity decomposition is crucial for our analysis. (Extrapolation prediction
is crucial for the next decomposition to work. )
Using the three point inequality associated with the primal update leads us to
ηt t ηt ηt
L(xt ; π t ) − L(x; π t ) + ∥x − x∥2 + ∥xt−1 − xt ∥2 ≤ ∥xt−1 − x∥2 .
2 2 2
So the ωt weighted sum satisfies
N N
X ωN ηN N X ωt ηt t−1 η1 ω1 t−1
ωt [L(xt ; π t ) − L(x; π t )] + ∥x − x∥2 + ∥x − xt ∥2 ≤ ∥x − x∥2 .
t=1
2 t=1
2 2
Similarly, the three point inequality for the dual update implies that (why?)
N
X N
X
ωt [L(xt ; π) − L(x; π t )] + ĀN + ωN (τN + 1)Df ∗ (π; π N ) + ωt τt Df ∗ (π t ; π t−1 ) ≤ τ1 ω1 Df ∗ (π; π 0 ).
t=1 t=1
N
X
ĀN := ωN ⟨π N − π, (xN − xN −1 )⟩ − θt ωt ⟨π t − π t−1 , (xt−1 − xt−2 )⟩.
t=1
In particular, we have from the 1/L strong convexity of f ∗ (see Theorem 6) that
ηt−1 t−1
θt ωt ⟨π t − π t−1 , (xt−1 − xt−2 )⟩ ≤ ωt τt Df ∗ (π t ; π t−1 ) + ωt−1∥x − xt−2 ∥2
2
ηN N
−ωN ⟨π N − π, (xN − xN −1 )⟩ ≤ ωN (τN + 1)Df ∗ (π t ; π t−1 ) + ωN ∥x − xN −1 ∥2 .
2
Adding up all the relations leads to the desired convergence guarantee. □
Now we specify some concrete stepsize choice to obtain the desired function value convergence bound.
Remarks:
• The above convergence bound could also be obtained from a purely primal analysis, see [2]. We
√ 0 ∗ 2
could also obtain the desired linear convergence rate O( κ log( L∥x −x
ϵ
∥
) if f is a strongly convex
function. Because of the heavy use of three point inequality, the extension to the non-Euclidean
geometry is really straightforward.
• Nesterov retired this year. However, the exact reason why this scheme work is still not well-
understood.
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 9
9.1. The Linear Span Assumption. Without the loss of generality, we can assume the initial point x0
to be 0 (why? hint: apply some shifting to the worst case function).
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 10
Definition 11. A first order method is said to satisfy the linear span assumption if the iterates generated
satisfy the following requirement
9.2. The Lower Bound Gadget. We present a parameterized class of hard functions which could to
tuned to provide the desired lower complexity bound. Particularly, the hard instance, parameterized by
l > 0, r > 0, n ∈ N+ , is
n−1
l 1X 1
(9.2) h(x; l, r, n) := [ (xi − xi+1 )2 + (xn )2 − rx1 ].
4 2 i=1 2
1
−1
r
−1 0
2 −1
l
0
−1 2 −1
(9.3) ∇h(x̄) = { x̄ −
.. }.
4 ¨
.
−1 2 −1 0
−1 2 0
Clearly, h is an l-Lipschitz smooth function (why?). Now we can solve for the first order optimality condition
∇h(x∗ ) = 0 to obtain x∗ = r[n, n − 1, n − 2, ..., 1] to get ∥x∗ − 0∥2 ≤ r2 n3 (it is a rough upper bound for
n ≥ 2). Moreover, constrained to having non-zeros only in the first n/2 coordinates, we can calculate the
optimal solution to be x̄∗ = r[n/2, n/2 − 1, ..., 1, 0, ..., 0]. Thus the function value gap is lower bounded by
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 11
r2 l n n r2 ln
h(x̄∗ ) − h(x∗ ) = [ − ]= .
4 2 4 16
Property c) follows from the structure of the hard instance. □
9.3. The Lower Bound Result. Now we ready to utilize the hard-instance gadget to provide the lower
complexity bound.
Theorem 13. Consider the task of finding ϵ-optimal solution (in function value gap) for the class of L-
smooth convex functions with ∥x0 − x∗ ∥2 ≤ D02 and that ϵ ≤ LD02 /256. The number of iterations required by
qassumption to find an ϵ-optimal q
any first order method satisfying the linear span solution for any problem
LD02 LD02
inside the problem class is lower bounded by ⌊ 18 ϵ ⌋, that is, PΣ∗ (M; F) ≥ Ω(1) ϵ .
• This is the state of art for optimal methods for smooth optimization. However, this is not the last
story on the search for optimal methods.
• The eigenvalues associated with the matrix in (9.3) is plotted in Figure 9.1. The hard instance is
particularly bad because the eigenvalues spread almost uniformly over the range [0, 4].
• If these eigenvalues are clustered in some way, say, most of them are small except for a few, then the
conjugate gradient (CG) method could lead to a much faster convergence rate.
• Moreover, if the dimension is small, then the ellipsoid method could also significantly improve the
complexity result.
• To go into future requires us to define more refined problem classes and design more adaptive
algorithms.
References
[1] Amir Beck. First-order methods in optimization, volume 25. SIAM, 2017.
[2] Guanghui Lan. First-order and stochastic Optimization Methods for Machine Learning. Springer-Nature, 2020.
[3] Arkadi Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization.
John Wiley UK/USA, 1983.
[4] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2). In
Doklady AN USSR, volume 269, pages 543–547, 1983.
[5] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media,
2003.
[6] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.