Mirror Descent Slides
Mirror Descent Slides
Methods
1
Mirror descent
2
replace k·k2 with an alternate distance2 -like function
2
Convergence rate of projected subgradient method
(k) RG
fbest − f ∗ ≤ √
k
G = maxx∈C ||∂f (x)||2 and R = maxx,y∈C ||x − y||2 The analysis
and the convergence results depend on Euclidean (`2 ) norm
3
Bregman Divergence
T
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)
can be interpreted as
distance between x and y as measured by the function h
Example: h(x) = ||x||22
4
Strong convexity
λ
h(x) ≥ h(y) + ∇(y)T (x − y) + ||x − y||2
2
5
Properties of Bregman divergence
satisfies
λ
Dh (x, y) ≥ ||x − y||2 ≥ 0
2
6
Pythagorean theorem
Bregman projection
PCh (y) = arg min Dh (x, y)
x∈C
1
x(k+1) = PC arg min f (xk ) + g (k)T (x − x(k) ) + ||x − x(k) ||22 )
x 2αk
8
Mirror Descent
1
x(k+1) = PCh argmin f (xk ) + g (k)T (x − x(k) ) + Dh (x, x(k) )
x αk
9
Mirror Descent update rule
1
x (k+1)
= PCh k
argmin f (x ) + g (k)T
(x − x ) +(k)
Dh (x, x(k) )
x α k
h (k)T 1 (k)
= PC argmin g x+ Dh (x, x )
x αk
h (k)T 1 1 (k) T
= PC argmin g x+ h(x) − ∇h(x ) x
x αk αk
1 1
optimality condition for y = argmin:g k + αk ∇h(y) − αk ∇h(x
(k)
) =0
Dh (x, y) is the Bregman divergence
T
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)
PCh is the Bregman projection:
T
PCh (y) = arg min Dh (x, y) = arg min h(x) − ∇h(y) x
x∈C x∈C
T
= arg min h(x) − (∇h(x(k) ) − αk g (k) ) x
x∈C
T 10
= arg min Dh (x, x(k) ) + αk g (k) x
x∈C
Mirror Descent update rule (simplified)
1
x (k+1)
= PCh k
argmin f (x ) + g (k)T
(x − x (k)
)+ Dh (x, x(k) )
x αk
T
= arg min Dh (x, x(k) ) + αk g (k) x
x∈C
11
Convergence guarantees
||x − xh || ≤ R||·||
h
General guarantee:
k k
X 1 X 2 (i) 2
αi [f (x(i) ) − f (x? )] ≤ D(x? , x(1) ) + α kg k∗
i=1
2 i=1 i
h
λR||·||
Choose step size αk = √ .
G||·|| k
Mirror Descent iterates satisfy
h
(k)
R||·|| G||·||
fbest − f ∗ ≤ √
k
12
Standard setups for Mirror Descent
1
x (k+1)
= PCh k
argmin f (x ) + g (k)T
(x − x (k)
)+ Dh (x, x(k) )
x αk
T
= arg min Dh (x, x(k) ) + αk g (k) x
x∈C
(k) log n α 2
fbest − f ? ≤ + G∞
αk 2
(4) Can be much better than regular subgradient decent...
15
Example
(k) (k)
(k+1) x exp(−αgi )
xi = Pn i (k) (k)
j=1 xj exp(−αgj )
16
Example
101
Entropy
Gradient
100
10-1
10-2
fbest−f ⋆
( k)
10-3
10-4
10-5
10-6 0 10 20 30 40 50 60
k
Sn = {X ∈ Sn+ : tr(X) = 1}
18
Example: Spectrahedron
Sn = {X ∈ Sn+ : tr(X) = 1}
1 2
h(x) = 2 kxk2 recovers standard case
20
Convergence analysis
Bregman divergence
Let θ? = ∇h(x? ),
21
Convergence analysis continued
Therefore
22
Convergence guarantees
in general, converges if
• Dh (x? , x(1) ) < ∞
•
P
k αk = ∞ and αk → 0
• for all g ∈ ∂f (x) and x ∈ C, kgk∗ ≤ G for some G < ∞
Stochastic gradients are fine!
23
Variable metric subgradient methods
24
Variable metric projected subgradient method
where
ΠH 2
X (y) = argmin kx − ykH
x∈X
√
and kxkH = xT Hx.
25
Convergence analysis
since ΠH
X is non-expansive in the k · kHk norm, we get
k
2
kx(k+1) − x? k2Hk = PXHk x(k) − Hk−1 g (k) − PXHk (x? )
Hk
≤ kx (k)
− Hk−1 g (k) − x? k2Hk
= kx (k)
− x? k2Hk − 2(g (k) )T (x(k) − x? ) + kg (k) k2H −1
k
26
apply recursively, use
k
(k)
X
f (x(i) ) − f ? ≥ k fbest − f ?
i=1
2k
Pk (i) ? 2 (i) ? 2
i=2 kx − x kHi − kx − x kHi−1
+
2k
27
numerator of additional term can be bounded to get estimates
• for general Hk = diag(hk )
k
X k
X
2
R∞ kH1 k1 + kg (i) k2H −1 2
R∞ kHi − Hi−1 k1
i
k ? i=1 i=2
fbest −f ≤ +
2k 2k
• for Hk = diag(hk ) with hi ≥ hi−1 for all i
k
X
kg (i) k2H −1
2
k i=1
i
R∞ khk k1
fbest − f? ≤ +
2k 2k
28
converges if
• R∞ < ∞ (e.g. if X is compact)
Pk
• (i) 2
i=1 kg kH −1 grows slower than k
i
Pk
• i=2 kHi − Hi−1 k1 grows slower than k or
hi ≥ hi−1 for all i and khk k1 grows slower than k
29
AdaGrad
30
AdaGrad – motivation
31
AdaGrad – convergence
1
by construction, Hi = α diag(hi ) and hi ≥ hi−1 , so
k
(k) 1 X (i) 2 1 2
fbest − f ? ≤ kg kH −1 + R khk k1
2k i=1 i 2kα ∞
α 1 2
≤ khk k1 + R khk k1
k 2kα ∞
(second line is a theorem)
2
also have (with α = R∞ ) and for compact sets C
( k
)
(k) ? 2 T
X
(i) 2
fbest −f ≤ inf sup (x − y) diag(h)(x − y) + kg kdiag(h)−1
k h≥0 x,y∈C i=1
32
Example
Classification problem:
• Data: {ai , bi }, i = 1, . . . , 50000
• ai ∈ R1000
• b ∈ {−1, 1}
• Data created with 5% mis-classifications w.r.t. w = 1, v = 0
• Objective: find classifiers w ∈ R1000 and v ∈ R such that
• aTi w + v > 1 if b = 1
• aTi w + v < 1 if b = −1
• Optimization method:
• Minimize hinge-loss: max(0, 1 − bi (aTi w + v))
P
i
• Choose example uniformly at random, take sub-gradient step
w.r.t. that example
33
Best subgradient method vs best AdaGrad
1
10
subgradientmethodmethod
adagrad
0
10
relativeconvergence
−1
10
−2
10
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
k
alphaalphaalpha1
alphaalphaalpha2
alphaalphaalpha3
alphaalphaalpha4
relativeconvergence
0
10
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
k