Mirror Descent and Nonlinear Projected Subgradient Methods For Convex Optimization
Mirror Descent and Nonlinear Projected Subgradient Methods For Convex Optimization
com
Operations
Research
Letters
www.elsevier.com/locate/dsw
Abstract
The mirror descent algorithm (MDA) was introduced by Nemirovsky and Yudin for solving convex optimization problems.
This method exhibits an e3ciency estimate that is mildly dependent in the decision variables dimension, and thus suitable
for solving very large scale optimization problems. We present a new derivation and analysis of this algorithm. We show
that the MDA can be viewed as a nonlinear projected-subgradient type method, derived from using a general distance-like
function instead of the usual Euclidean squared distance. Within this interpretation, we derive in a simple way convergence
and e3ciency estimates. We then propose an Entropic mirror descent algorithm for convex minimization over the unit
simplex, with a global e3ciency estimate proven to be mildly dependent in the dimension of the problem.
c 2003 Elsevier Science B.V. All rights reserved.
Keywords: Nonsmooth convex minimization; Projected subgradient methods; Nonlinear projections; Mirror descent algorithms; Relative
entropy; Complexity analysis; Global rate of convergence
1. Introduction
Consider the following nonsmooth convex minimization problem,
(P)
Corresponding author.
E-mail addresses: [email protected] (A. Beck),
[email protected] (M. Teboulle).
A standard method to solve (P) is the subgradient projection algorithm, (see e.g. [2] and references
168
(1.1)
16s6k
xX
(1.2)
169
(yk );
(yk+1 )
(2.3)
(2.4)
(2.5)
z 9 (x) x = (I + NX )1 (z)
=
X (z) =
(z):
xk = X (yk );
(2.6)
yk+1 = xk tk f (xk );
(2.7)
(2.8)
170
is nothing else, but the nonlinear subgradient projection method (3.9), with a particular choice of D based
on a Bregman-like distance generated by a function
. Note, that the hypothesis on D will be somewhat
di@erent from the usual Bregman based distances assumed in the literature (see e.g. [8,14], and references
therein).
Let : X R be strongly convex and continuously
di@erentiable on int X . The distance-like function is
de?ned by B : X int(X ) R given by
B (x; y) = (x) (y)
x y; (y):
(3.10)
(3.11)
(3.12)
(3.13)
(z) =
(z) = (9 )
Using these relations, SANP can be written as follows. Let yk+1 := (xk ) tk f (xk ) and set xk =
(yk ). Then, SANP given by (3.13) reduces to
xk+1 = (yk+1 ), which are exactly the iterations
generated by MDA.
Note that when satis?es (3.12), then SANP reduces
to: xk+1 = ( )1 ( (xk ) tk f (xk )).
16s6k
xX
B (x ; x1 ) + 21
6
k
k
2
s 2
s=1 ts f (x )
s=1 ts
(4.15)
4. Convergence analysis
With this interpretation of the MDA, viewed as
SANP, its convergence analysis can be derived in a
simple way. The key of the analysis, relies essentially
on the following simple identity which appears to be
a natural generalization of the quadratic identity valid
for the Euclidean norm.
Lemma 4.1 (Chen and Teboulle [5]). Let S Rn
be an open set with closure SM and let : SM R
be continuously di<erentiable on S. Then for any
three points a; b S and c SM the following identity
holds true
B (c; a) + B (a; b) B (c; b)
=
(b) (a); c a:
171
(4.14)
(4.16)
Using the subgradient inequality for the convex function f one obtains
0 6 tk (f(xk ) f(x ))
6 tk
xk x ; f (xk )
=
x xk+1 ; (xk )
(xk+1 ) tk f (xk )
(4.17)
(4.18)
+ xk xk+1 ; tk f (xk )
(4.19)
:= s1 + s2 + s3 ;
(4.20)
172
s2 = B (x ; xk ) B (x ; xk+1 ) B (xk+1 ; xk )
(by Lemma 4:1);
s3 6 (2)1 tk2 f (xk )2 + 21 xk xk+1 2 ;
the later inequality following from
a; b 6 (2)1 a2
+ 21 b2 , a; b Rn . Therefore, recalling that
B (; ) is -strongly convex, i.e., B (xk+1 ; xk ) +
21 xk xk+1 2 6 0, it follows that
tk (f(xk ) f(x )) = s1 + s2 + s3
6 B (x ; xk ) B (x ; xk+1 )
+ (2)1 tk2 f (xk )2 :
(4.21)
+ (2)
s
16s6k
B (x ; x1 ) + (2)1
6
k
k=1
16s6k
k
2
s 2
s=1 ts f (x )
s=1 ts
;
(4.22)
6 Lf
xX
2B (x ; x1 ) 1
:
k
(4.24)
Proposition 5.1. Let e : & R be the entropy function de;ned in (5.27). Then,
(a) e is 1-strongly convex over int & with respect
to the 1 norm, i.e.,
e (x) e (y); x y
=
16s6k
6 O(1)
(5.26)
n
xj ln xj if x &;
n
(xj yj ) ln
j=1
x y21 ;
xj
yj
x; y int &:
+ otherwise;(5.27)
j=1
173
(t 1)2
;
t+1
t 0:
(xj yj ) ln
(xj yj )2
xj
2
yj
xj + y j
j=1
n
xj + yj (xj yj )2
x +y
2
( j 2 j )2
j=1
2
n
x
+
y
y
|
|x
j
j
j
j
xj +yj
2
2
()
j=1
= x y21 ;
where the inequality (*) follows from the convexity of
the quadratic function and the fact that (x + y)=2 &.
(b) Using the de?nition of the conjugate and simple
calculus gives the desired results, see also [11].
174
n
(c) Substituting = e (x) = j=1 xj ln xj in the
de?nition of B we obtain with xj1 = n1 , j,
n
xj
xj ln
B (x ; x1 ) =
xj1
j=1
=
n
xj ln xj + ln n 6 ln n;
x &;
j=1
k
xjk etk fj (x )
2 ln n 1
k+1
;
xj = n
; tk =
k tk fj (xk )
Lf
k
j=1 xj e
where f (x) = (f1 (x) ; : : : ; fn (x))T 9f(x).
Applying Theorem 4.2 and Proposition 5.1 we immediately obtain the following e3ciency estimate for
the EMDA.
Theorem 5.1. Let {xk } be the sequence generated by
EMDA with starting point x1 = n1 e. Then, for all
k 1 one has
tr(Z) = 1; Z 0
and which often arise in relaxations of combinatorial optimization problems. This can be analyzed
within the use of a corresponding entropic function
de?ned over the space of positive semide?nite symmetric matrices (see for example [6] and references
therein).
16s6k
2ln n
f (xs )
:
k
(5.28)
Thus, the EMDA appears as another useful candidate algorithm for solving large scale convex minimization problems over the unit simplex. Indeed,
EMDA shares the same e3ciency estimate than the
(MDA1 ) obtained with 1 , but has the advantage of
being completely explicit, as opposed to the (MDA1 )
References
[1] A. Ben-Tal, T. Margalit, A. Nemirovski, The ordered subsets
mirror descent optimization method with applications to
tomography, SIAM J. Optim. 12 (2001) 79108.
[2] D. Bertsekas, Nonlinear Programming, 2nd Edition, Athena
Scienti?c, Belmont, MA, 1999.
[3] L.M. Bregman, A relaxation method of ?nding a common
point of convex sets and its application to the solution
of problems in convex programming, USSR Computational
Mathematics and Mathematical Physics 7 (1967) 200217.
175