0% found this document useful (0 votes)
36 views35 pages

Mirror Descent Slides

1. Mirror descent is a generalization of projected subgradient descent that replaces the Euclidean norm with a Bregman divergence. This allows the use of different distance-like functions. 2. The convergence rate of projected subgradient descent depends on the boundedness of subgradients and initialization radius. Mirror descent enjoys similar convergence guarantees by exploiting properties of Bregman divergences like strong convexity. 3. Standard setups for mirror descent include using the squared Euclidean norm, resulting in projected subgradient descent, and negative entropy, corresponding to the generalized Kullback-Leibler divergence over the simplex.

Uploaded by

myturtle game01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views35 pages

Mirror Descent Slides

1. Mirror descent is a generalization of projected subgradient descent that replaces the Euclidean norm with a Bregman divergence. This allows the use of different distance-like functions. 2. The convergence rate of projected subgradient descent depends on the boundedness of subgradients and initialization radius. Mirror descent enjoys similar convergence guarantees by exploiting properties of Bregman divergences like strong convexity. 3. Standard setups for mirror descent include using the squared Euclidean norm, resulting in projected subgradient descent, and negative entropy, corresponding to the generalized Kullback-Leibler divergence over the simplex.

Uploaded by

myturtle game01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Mirror Descent and Variable Metric

Methods

Stephen Boyd & John Duchi & Mert Pilanci

EE364b, Stanford University

April 24, 2019

1
Mirror descent

• due to Nemirovski and Yudin (1983)

• recall the projected subgradient method:

(1) get subgradient g (k) ∈ ∂f (x(k) )


(2) update
 
1 2
x(k+1) = argmin g (k)T x + x − x(k)
x∈C 2αk 2

2
replace k·k2 with an alternate distance2 -like function

2
Convergence rate of projected subgradient method

Consider minx∈C f (x)


• Bounded subgradients: ||g||2 ≤ G for all g ∈ ∂f
• Initialization radius: ||x(1) − x∗ ||2 ≤ R

Projected sub-gradient method iterates will satisfy


Pk
(k) ∗ R2 + G2 i=1 αi2
fbest −f ≤ Pk
2 i=1 αi

setting αi = (R/G)/ k gives

(k) RG
fbest − f ∗ ≤ √
k
G = maxx∈C ||∂f (x)||2 and R = maxx,y∈C ||x − y||2 The analysis
and the convergence results depend on Euclidean (`2 ) norm
3
Bregman Divergence

h convex differentiable over an open convex set C.


• The Bregman divergence associated to h is defined by

 T 
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)

can be interpreted as
distance between x and y as measured by the function h
Example: h(x) = ||x||22

4
Strong convexity

h(x) is λ-strongly convex with respect to the norm || · || if

λ
h(x) ≥ h(y) + ∇(y)T (x − y) + ||x − y||2
2

5
Properties of Bregman divergence

For a λ-strongly convex function h, Bregman divergence


 T 
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)

satisfies
λ
Dh (x, y) ≥ ||x − y||2 ≥ 0
2

6
Pythagorean theorem
Bregman projection
PCh (y) = arg min Dh (x, y)
x∈C

Dh (x, y) ≥ Dh (x, PCh (y)) + Dh (PCh (y), y)


7
Projected Gradient Descent

 
1
x(k+1) = PC arg min f (xk ) + g (k)T (x − x(k) ) + ||x − x(k) ||22 )
x 2αk

where PC is the Euclidean projection onto C.

8
Mirror Descent

 
1
x(k+1) = PCh argmin f (xk ) + g (k)T (x − x(k) ) + Dh (x, x(k) )
x αk

where Dh (x, y) is the Bregman divergence


 T 
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)

and h(x) is strongly convex with respect to || · ||


PCh is the Bregman projection:

PCh (y) = arg min Dh (x, y)


x∈C

9
Mirror Descent update rule

 
1
x (k+1)
= PCh k
argmin f (x ) + g (k)T
(x − x ) +(k)
Dh (x, x(k) )
x α k
 
h (k)T 1 (k)
= PC argmin g x+ Dh (x, x )
x αk
 
h (k)T 1 1 (k) T
= PC argmin g x+ h(x) − ∇h(x ) x
x αk αk
1 1
optimality condition for y = argmin:g k + αk ∇h(y) − αk ∇h(x
(k)
) =0
Dh (x, y) is the Bregman divergence
 T 
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)
PCh is the Bregman projection:
T
PCh (y) = arg min Dh (x, y) = arg min h(x) − ∇h(y) x
x∈C x∈C
T
= arg min h(x) − (∇h(x(k) ) − αk g (k) ) x
x∈C
T 10
= arg min Dh (x, x(k) ) + αk g (k) x
x∈C
Mirror Descent update rule (simplified)

 
1
x (k+1)
= PCh k
argmin f (x ) + g (k)T
(x − x (k)
)+ Dh (x, x(k) )
x αk
T
= arg min Dh (x, x(k) ) + αk g (k) x
x∈C

where Dh (x, y) is the Bregman divergence


 T 
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)

11
Convergence guarantees

Let ||g||∗ ≤ G||·|| for all g ∈ ∂f , or equivalently


f (x) − f (y) ≤ G||·|| ||x − y||
1/2
Let xh = arg minx∈C h(x) and R||·||
h
= 2 maxy Dh (xh , y)/λ , then

||x − xh || ≤ R||·||
h

General guarantee:
k k
X 1 X 2 (i) 2
αi [f (x(i) ) − f (x? )] ≤ D(x? , x(1) ) + α kg k∗
i=1
2 i=1 i
h
λR||·||
Choose step size αk = √ .
G||·|| k
Mirror Descent iterates satisfy

h
(k)
R||·|| G||·||
fbest − f ∗ ≤ √
k
12
Standard setups for Mirror Descent

 
1
x (k+1)
= PCh k
argmin f (x ) + g (k)T
(x − x (k)
)+ Dh (x, x(k) )
x αk
T
= arg min Dh (x, x(k) ) + αk g (k) x
x∈C

where Dh (x, y) is the Bregman divergence


 T 
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)

• simplest version is when h(x) = 12 ||x||22 , which is strongly convex


w.r.t. || · ||2 . Mirror Descent= Projected Subgradient Descent
Dh (x, y) = ||x − y||22
• negative entropy h(x) = ni=1 xi log xi , which is 1-strongly
P

Pn ||x||1 (Pinsker’s inequality).


convex wrt w.r.t.
Dh (x, y) = i=1 xi log xyii − (xi − yi ) is the generalized
Kullback-Leibler divergence
13
Negative Entropy
• negative entropy h(x) = ni=1 xi log xi
P
Pn
Dh (x, y) = i=1 xi log xyii − (xi − yi ) is the generalized
Kullback-Leibler divergence
• unit simplex C = ∆n = {x ∈ Rn+ :
P
i xi = 1}.
• Bregman projection onto the simplex is a simple renormalization
y
P h (Y ) =
||y||1
• Mirror Descent:
 
(k+1) h k (k)T (k) 1 (k)
x = PC argmin f (x ) + g (x − x ) + Dh (x, x )
x αk
• y ∈ arg min =⇒ ∇h(y) = log(y) + 1 = ∇h(x(k) ) − αk g k
(k)
=⇒ yi = xi exp (−αk gik )
• Mirror Descent update:
(k) (k)
(k+1) x exp(−αgi )
xi = Pn i (k) (k)
j=1 xj exp(−αgj )
14
Mirror descent examples

• Usual (projected) subgradient descent: h(x) = 12 kxk22


• With constraints of simplex, C = {x ∈ Rn+ | 1T x = 1}, use
negative entropy
Xn
h(x) = xi log xi
i=1

(1) Strongly convex with respect to `1 -norm


(2) With x(1) = 1/n, have Dh (x? , x(1) ) ≤ log n for x? ∈ C
(3) If kgk∞ ≤ G∞ for g ∈ ∂f (x) for x ∈ C,

(k) log n α 2
fbest − f ? ≤ + G∞
αk 2
(4) Can be much better than regular subgradient decent...

15
Example

Robust regression problem (an LP):


m
X
minimize f (x) = kAx − bk1 = |aTi x − bi |
i=1
subject to x ∈ C = {x ∈ Rn+ | 1T x = 1}
Pm
subgradient of objective is g = i=1 sign(aTi x − bi )ai

• Projected subgradient update (h(x) = (1/2) kxk22 ): homework


• Mirror descent update (h(x) = ni=1 xi log xi ):
P

(k) (k)
(k+1) x exp(−αgi )
xi = Pn i (k) (k)
j=1 xj exp(−αgj )

16
Example

Robust regression problem with ai ∼ N (0, In×n ) and


bi = (ai,1 + ai,2 )/2 + εi where εi ∼ N (0, 10−2 ), m = 20, n = 3000

101
Entropy
Gradient
100

10-1

10-2
fbest−f ⋆
( k)

10-3

10-4

10-5

10-6 0 10 20 30 40 50 60
k

stepsizes chosen according to best bounds (but still sensitive to


stepsize choice)
17
Example: Spectrahedron

minimizing a function on the spectrahedron Sn defined as

Sn = {X ∈ Sn+ : tr(X) = 1}

18
Example: Spectrahedron

Sn = {X ∈ Sn+ : tr(X) = 1}

• von Neumann entropy:


n
X
h(X) = λi (X) log λi (X)
i=1

where λ1 (X), ..., λn (X) are the eigenvalues of X.


• 1
2 strongly convex with respect to the norm
n
X
||X||tr = λi (X)
i=1
• Mirror Descent update:
Yt+1 = exp(log Xt − αt ∇f (Xt ))
Xt+1 = PCh (Yt + 1) = Yt+1 /||Yt+1 ||tr
19
Mirror Descent Analysis

distance generating function h, 1-strongly-convex w.r.t. k·k:


1 2
h(y) ≥ h(x) + ∇h(x)T (y − x) + kx − yk
2
Fenchel conjugate
h∗ (θ) = sup θT x − h(x) , ∇h∗ (θ) = argmax θT x − h(x)
 
x∈C x∈C

∇h, ∇h∗ take us “through the mirror” and back


∇h
x−
←−
−−−−
−−


−−θ
∇h

miror descent iterations for C = Rn


n o  
x(k+1) = argmin αk g (k)T x + Dh (x, x(k) ) = ∇h∗ ∇h(x(k) ) − αk g (k)
x∈C

1 2
h(x) = 2 kxk2 recovers standard case
20
Convergence analysis

g (k) ∈ ∂f (x(k) ) , θ(k+1) = θ(k) − αk g (k) , x(k+1) = ∇h∗ (θ(k+1) )

Bregman divergence

Dh∗ (θ0 , θ) = h∗ (θ0 ) − h∗ (θ) − ∇h∗ (θ)T (θ0 − θ)

Let θ? = ∇h(x? ),

Dh∗ (θ(k+1) , θ? ) = Dh∗ (θ(k) , θ? )


+ (θ(k+1) − θ(k) )T (∇h∗ (θ(k) ) − ∇h∗ (θ? ))
+ Dh∗ (θ(k+1) , θ(k) )
and
T
(θ(k+1) − θ(k) )T (∇h∗ (θ(k) ) − ∇h∗ (θ? )) = −αk g (k) (x(k) − x? )

21
Convergence analysis continued

From convexity and g (k) ∈ ∂f (x(k) ),


T
f (x(k) ) − f (x? ) ≤ g (k) (x(k) − x? )

Therefore

αk [f (x(k) ) − f (x? )] ≤ Dh∗ (θ(k) , θ? ) − Dh∗ (θ(k+1) , θ? )


+ Dh∗ (θ(k+1) , θ(k) )
2
Fact: h is 1-strongly-convex w.r.t. k·k ⇔ Dh (x0 , x) ≥ 12 kx0 − xk ⇔
2
h∗ is 1-smooth w.r.t. k·k∗ ⇔ Dh∗ (θ0 , θ) ≤ 21 kθ0 − θk∗

Bounding the Dh∗ (θ(k+1) , θ(k) ) terms and telescoping gives


k k
X 1 X 2 (i) 2
αi [f (x(i) ) − f (x? )] ≤ Dh∗ (θ(1) , θ? ) + α kg k∗
i=1
2 i=1 i

22
Convergence guarantees

Note: Dh∗ (θ(1) , θ? ) = Dh (x? , x(1) )


Most general guarantee,
k k
X
(i) ? ? (1) 1 X 2 (i) 2
αi [f (x ) − f (x )] ≤ Dh (x , x )+ α kg k∗
i=1
2 i=1 i

Fixed step size αk = α


k
1X 1 α
f (x(i) ) − f (x? ) ≤ Dh (x? , x(1) ) + max kg (i) k2∗
k i=1 αk 2 i

in general, converges if
• Dh (x? , x(1) ) < ∞

P
k αk = ∞ and αk → 0
• for all g ∈ ∂f (x) and x ∈ C, kgk∗ ≤ G for some G < ∞
Stochastic gradients are fine!
23
Variable metric subgradient methods

subgradient method with variable metric Hk  0:

(1) get subgradient g (k) ∈ ∂f (x(k) )


(2) update (diagonal) metric Hk
(3) update x(k+1) = x(k) − Hk−1 g (k)

• matrix Hk generalizes step-length αk

there are many such methods (Ellipsoid method, AdaGrad, . . . )

24
Variable metric projected subgradient method

same, with projection carried out in the Hk metric:

(1) get subgradient g (k) ∈ ∂f (x(k) )


(2) update (diagonal) metric Hk
(3) update x(k+1) = PXHk x(k) − Hk−1 g (k)


where
ΠH 2
X (y) = argmin kx − ykH
x∈X

and kxkH = xT Hx.

25
Convergence analysis

since ΠH
X is non-expansive in the k · kHk norm, we get
k

  2
kx(k+1) − x? k2Hk = PXHk x(k) − Hk−1 g (k) − PXHk (x? )
Hk

≤ kx (k)
− Hk−1 g (k) − x? k2Hk
= kx (k)
− x? k2Hk − 2(g (k) )T (x(k) − x? ) + kg (k) k2H −1
k

≤ kx(k) − x? k2Hk − 2(f (x(k) ) − f ? ) + kg (k) k2H −1 .


k

using f ? = f (x? ) ≥ f (x(k) ) + g (k)T (x? − x(k) )

26
apply recursively, use
k    
(k)
X
f (x(i) ) − f ? ≥ k fbest − f ?
i=1

and rearrange to get


Pk
kx(1) − x? k2H1 + i=1 kg (i) k2H −1
(k) ?
fbest −f ≤ i

2k
Pk  (i) ? 2 (i) ? 2

i=2 kx − x kHi − kx − x kHi−1
+
2k

27
numerator of additional term can be bounded to get estimates
• for general Hk = diag(hk )

k
X k
X
2
R∞ kH1 k1 + kg (i) k2H −1 2
R∞ kHi − Hi−1 k1
i
k ? i=1 i=2
fbest −f ≤ +
2k 2k
• for Hk = diag(hk ) with hi ≥ hi−1 for all i

k
X
kg (i) k2H −1
2
k i=1
i
R∞ khk k1
fbest − f? ≤ +
2k 2k

where max kx(i) − x? k∞ ≤ R∞


1≤i≤k

28
converges if
• R∞ < ∞ (e.g. if X is compact)
Pk
• (i) 2
i=1 kg kH −1 grows slower than k
i
Pk
• i=2 kHi − Hi−1 k1 grows slower than k or
hi ≥ hi−1 for all i and khk k1 grows slower than k

29
AdaGrad

AdaGrad — adaptive subgradient method

(1) get subgradient g (k) ∈ ∂f (x(k) )


(2) choose metric Hk :
Pk
• set Sk = i=1 diag(g (i) )2
1
• set Hk = 1
α Sk
2

(3) update x(k+1) = PXHk x(k) − Hk−1 g (k)




where α > 0 is step-size

30
AdaGrad – motivation

• for fixed Hk = H we have estimate:


k
(k) ?1 (1) ? T (1) ? 1 X (i) 2
fbest −f ≤ (x − x ) H(x − x ) + kg kH −1
2k 2k i=1

• idea: Choose diagonal Hk  0 that minimizes this estimate in


hindsight:
k
X
Hk = argmin max (x − y)T diag(h)(x − y) + kg (i) k2diag(h)−1
h x,y∈C
i=1
q q 
Pk (i) 2 Pk (i) 2
• optimal Hk = 1
R∞ diag i=1 (g1 ) , . . . , i=1 (gn )

• intuition: adapt step-length based on historical step lengths

31
AdaGrad – convergence

1
by construction, Hi = α diag(hi ) and hi ≥ hi−1 , so

k
(k) 1 X (i) 2 1 2
fbest − f ? ≤ kg kH −1 + R khk k1
2k i=1 i 2kα ∞
α 1 2
≤ khk k1 + R khk k1
k 2kα ∞
(second line is a theorem)
2
also have (with α = R∞ ) and for compact sets C
( k
)
(k) ? 2 T
X
(i) 2
fbest −f ≤ inf sup (x − y) diag(h)(x − y) + kg kdiag(h)−1
k h≥0 x,y∈C i=1

32
Example

Classification problem:
• Data: {ai , bi }, i = 1, . . . , 50000
• ai ∈ R1000
• b ∈ {−1, 1}
• Data created with 5% mis-classifications w.r.t. w = 1, v = 0
• Objective: find classifiers w ∈ R1000 and v ∈ R such that
• aTi w + v > 1 if b = 1
• aTi w + v < 1 if b = −1
• Optimization method:
• Minimize hinge-loss: max(0, 1 − bi (aTi w + v))
P
i
• Choose example uniformly at random, take sub-gradient step
w.r.t. that example

33
Best subgradient method vs best AdaGrad

1
10
subgradientmethodmethod
adagrad

0
10
relativeconvergence

−1
10

−2
10
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
k

Often best AdaGrad performs better than best subgradient method


34
AdaGrad with different step-sizes α:

alphaalphaalpha1
alphaalphaalpha2
alphaalphaalpha3
alphaalphaalpha4
relativeconvergence

0
10

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
k

Sensitive to step-size selection (like standard subgradient method)


35

You might also like