0% found this document useful (0 votes)

36 views35 pages

Mirror Descent Slides

1. Mirror descent is a generalization of projected subgradient descent that replaces the Euclidean norm with a Bregman divergence. This allows the use of different distance-like functions. 2. The convergence rate of projected subgradient descent depends on the boundedness of subgradients and initialization radius. Mirror descent enjoys similar convergence guarantees by exploiting properties of Bregman divergences like strong convexity. 3. Standard setups for mirror descent include using the squared Euclidean norm, resulting in projected subgradient descent, and negative entropy, corresponding to the generalized Kullback-Leibler divergence over the simplex.

Uploaded by

myturtle game01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views35 pages

Mirror Descent Slides

Uploaded by

myturtle game01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Mirror Descent and Variable Metric

Methods

Stephen Boyd & John Duchi & Mert Pilanci

EE364b, Stanford University

April 24, 2019

1
Mirror descent

• due to Nemirovski and Yudin (1983)

• recall the projected subgradient method:

(1) get subgradient g (k) ∈ ∂f (x(k) )

(2) update

1 2
x(k+1) = argmin g (k)T x + x − x(k)
x∈C 2αk 2

2
replace k·k2 with an alternate distance2 -like function

2
Convergence rate of projected subgradient method

Consider minx∈C f (x)

• Bounded subgradients: ||g||2 ≤ G for all g ∈ ∂f
• Initialization radius: ||x(1) − x∗ ||2 ≤ R

Projected sub-gradient method iterates will satisfy

Pk
(k) ∗ R2 + G2 i=1 αi2
fbest −f ≤ Pk
2 i=1 αi
√
setting αi = (R/G)/ k gives

(k) RG
fbest − f ∗ ≤ √
k
G = maxx∈C ||∂f (x)||2 and R = maxx,y∈C ||x − y||2 The analysis
and the convergence results depend on Euclidean (`2 ) norm
3
Bregman Divergence

h convex differentiable over an open convex set C.

• The Bregman divergence associated to h is defined by

T
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)

can be interpreted as
distance between x and y as measured by the function h
Example: h(x) = ||x||22

4
Strong convexity

h(x) is λ-strongly convex with respect to the norm || · || if

λ
h(x) ≥ h(y) + ∇(y)T (x − y) + ||x − y||2
2

5
Properties of Bregman divergence

For a λ-strongly convex function h, Bregman divergence

T
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)

satisfies
λ
Dh (x, y) ≥ ||x − y||2 ≥ 0
2

6
Pythagorean theorem
Bregman projection
PCh (y) = arg min Dh (x, y)
x∈C

Dh (x, y) ≥ Dh (x, PCh (y)) + Dh (PCh (y), y)

7
Projected Gradient Descent

1
x(k+1) = PC arg min f (xk ) + g (k)T (x − x(k) ) + ||x − x(k) ||22 )
x 2αk

where PC is the Euclidean projection onto C.

8
Mirror Descent

1
x(k+1) = PCh argmin f (xk ) + g (k)T (x − x(k) ) + Dh (x, x(k) )
x αk

where Dh (x, y) is the Bregman divergence

T
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)

and h(x) is strongly convex with respect to || · ||

PCh is the Bregman projection:

PCh (y) = arg min Dh (x, y)

x∈C

9
Mirror Descent update rule

1
x (k+1)
= PCh k
argmin f (x ) + g (k)T
(x − x ) +(k)
Dh (x, x(k) )
x α k

h (k)T 1 (k)
= PC argmin g x+ Dh (x, x )
x αk

h (k)T 1 1 (k) T
= PC argmin g x+ h(x) − ∇h(x ) x
x αk αk
1 1
optimality condition for y = argmin:g k + αk ∇h(y) − αk ∇h(x
(k)
) =0
Dh (x, y) is the Bregman divergence
T
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)
PCh is the Bregman projection:
T
PCh (y) = arg min Dh (x, y) = arg min h(x) − ∇h(y) x
x∈C x∈C
T
= arg min h(x) − (∇h(x(k) ) − αk g (k) ) x
x∈C
T 10
= arg min Dh (x, x(k) ) + αk g (k) x
x∈C
Mirror Descent update rule (simplified)

1
x (k+1)
= PCh k
argmin f (x ) + g (k)T
(x − x (k)
)+ Dh (x, x(k) )
x αk
T
= arg min Dh (x, x(k) ) + αk g (k) x
x∈C

where Dh (x, y) is the Bregman divergence

T
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)

11
Convergence guarantees

Let ||g||∗ ≤ G||·|| for all g ∈ ∂f , or equivalently

f (x) − f (y) ≤ G||·|| ||x − y||
1/2
Let xh = arg minx∈C h(x) and R||·||
h
= 2 maxy Dh (xh , y)/λ , then

||x − xh || ≤ R||·||
h

General guarantee:
k k
X 1 X 2 (i) 2
αi [f (x(i) ) − f (x? )] ≤ D(x? , x(1) ) + α kg k∗
i=1
2 i=1 i
h
λR||·||
Choose step size αk = √ .
G||·|| k
Mirror Descent iterates satisfy

h
(k)
R||·|| G||·||
fbest − f ∗ ≤ √
k
12
Standard setups for Mirror Descent

1
x (k+1)
= PCh k
argmin f (x ) + g (k)T
(x − x (k)
)+ Dh (x, x(k) )
x αk
T
= arg min Dh (x, x(k) ) + αk g (k) x
x∈C

where Dh (x, y) is the Bregman divergence

T
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)

• simplest version is when h(x) = 12 ||x||22 , which is strongly convex

w.r.t. || · ||2 . Mirror Descent= Projected Subgradient Descent
Dh (x, y) = ||x − y||22
• negative entropy h(x) = ni=1 xi log xi , which is 1-strongly
P

Pn ||x||1 (Pinsker’s inequality).

convex wrt w.r.t.
Dh (x, y) = i=1 xi log xyii − (xi − yi ) is the generalized
Kullback-Leibler divergence
13
Negative Entropy
• negative entropy h(x) = ni=1 xi log xi
P
Pn
Dh (x, y) = i=1 xi log xyii − (xi − yi ) is the generalized
Kullback-Leibler divergence
• unit simplex C = ∆n = {x ∈ Rn+ :
P
i xi = 1}.
• Bregman projection onto the simplex is a simple renormalization
y
P h (Y ) =
||y||1
• Mirror Descent:

(k+1) h k (k)T (k) 1 (k)
x = PC argmin f (x ) + g (x − x ) + Dh (x, x )
x αk
• y ∈ arg min =⇒ ∇h(y) = log(y) + 1 = ∇h(x(k) ) − αk g k
(k)
=⇒ yi = xi exp (−αk gik )
• Mirror Descent update:
(k) (k)
(k+1) x exp(−αgi )
xi = Pn i (k) (k)
j=1 xj exp(−αgj )
14
Mirror descent examples

• Usual (projected) subgradient descent: h(x) = 12 kxk22

• With constraints of simplex, C = {x ∈ Rn+ | 1T x = 1}, use
negative entropy
Xn
h(x) = xi log xi
i=1

(1) Strongly convex with respect to `1 -norm

(2) With x(1) = 1/n, have Dh (x? , x(1) ) ≤ log n for x? ∈ C
(3) If kgk∞ ≤ G∞ for g ∈ ∂f (x) for x ∈ C,

(k) log n α 2
fbest − f ? ≤ + G∞
αk 2
(4) Can be much better than regular subgradient decent...

15
Example

Robust regression problem (an LP):

m
X
minimize f (x) = kAx − bk1 = |aTi x − bi |
i=1
subject to x ∈ C = {x ∈ Rn+ | 1T x = 1}
Pm
subgradient of objective is g = i=1 sign(aTi x − bi )ai

• Projected subgradient update (h(x) = (1/2) kxk22 ): homework

• Mirror descent update (h(x) = ni=1 xi log xi ):
P

(k) (k)
(k+1) x exp(−αgi )
xi = Pn i (k) (k)
j=1 xj exp(−αgj )

16
Example

Robust regression problem with ai ∼ N (0, In×n ) and

bi = (ai,1 + ai,2 )/2 + εi where εi ∼ N (0, 10−2 ), m = 20, n = 3000

101
Entropy
Gradient
100

10-1

10-2
fbest−f ⋆
( k)

10-3

10-4

10-5

10-6 0 10 20 30 40 50 60
k

stepsizes chosen according to best bounds (but still sensitive to

stepsize choice)
17
Example: Spectrahedron

minimizing a function on the spectrahedron Sn defined as

Sn = {X ∈ Sn+ : tr(X) = 1}

18
Example: Spectrahedron

Sn = {X ∈ Sn+ : tr(X) = 1}

• von Neumann entropy:

n
X
h(X) = λi (X) log λi (X)
i=1

where λ1 (X), ..., λn (X) are the eigenvalues of X.

• 1
2 strongly convex with respect to the norm
n
X
||X||tr = λi (X)
i=1
• Mirror Descent update:
Yt+1 = exp(log Xt − αt ∇f (Xt ))
Xt+1 = PCh (Yt + 1) = Yt+1 /||Yt+1 ||tr
19
Mirror Descent Analysis

distance generating function h, 1-strongly-convex w.r.t. k·k:

1 2
h(y) ≥ h(x) + ∇h(x)T (y − x) + kx − yk
2
Fenchel conjugate
h∗ (θ) = sup θT x − h(x) , ∇h∗ (θ) = argmax θT x − h(x)

x∈C x∈C

∇h, ∇h∗ take us “through the mirror” and back

∇h
x−
←−
−−−−
−−
∗
→
−−θ
∇h

miror descent iterations for C = Rn

n o
x(k+1) = argmin αk g (k)T x + Dh (x, x(k) ) = ∇h∗ ∇h(x(k) ) − αk g (k)
x∈C

1 2
h(x) = 2 kxk2 recovers standard case
20
Convergence analysis

g (k) ∈ ∂f (x(k) ) , θ(k+1) = θ(k) − αk g (k) , x(k+1) = ∇h∗ (θ(k+1) )

Bregman divergence

Dh∗ (θ0 , θ) = h∗ (θ0 ) − h∗ (θ) − ∇h∗ (θ)T (θ0 − θ)

Let θ? = ∇h(x? ),

Dh∗ (θ(k+1) , θ? ) = Dh∗ (θ(k) , θ? )

+ (θ(k+1) − θ(k) )T (∇h∗ (θ(k) ) − ∇h∗ (θ? ))
+ Dh∗ (θ(k+1) , θ(k) )
and
T
(θ(k+1) − θ(k) )T (∇h∗ (θ(k) ) − ∇h∗ (θ? )) = −αk g (k) (x(k) − x? )

21
Convergence analysis continued

From convexity and g (k) ∈ ∂f (x(k) ),

T
f (x(k) ) − f (x? ) ≤ g (k) (x(k) − x? )

Therefore

αk [f (x(k) ) − f (x? )] ≤ Dh∗ (θ(k) , θ? ) − Dh∗ (θ(k+1) , θ? )

+ Dh∗ (θ(k+1) , θ(k) )
2
Fact: h is 1-strongly-convex w.r.t. k·k ⇔ Dh (x0 , x) ≥ 12 kx0 − xk ⇔
2
h∗ is 1-smooth w.r.t. k·k∗ ⇔ Dh∗ (θ0 , θ) ≤ 21 kθ0 − θk∗

Bounding the Dh∗ (θ(k+1) , θ(k) ) terms and telescoping gives

k k
X 1 X 2 (i) 2
αi [f (x(i) ) − f (x? )] ≤ Dh∗ (θ(1) , θ? ) + α kg k∗
i=1
2 i=1 i

22
Convergence guarantees

Note: Dh∗ (θ(1) , θ? ) = Dh (x? , x(1) )

Most general guarantee,
k k
X
(i) ? ? (1) 1 X 2 (i) 2
αi [f (x ) − f (x )] ≤ Dh (x , x )+ α kg k∗
i=1
2 i=1 i

Fixed step size αk = α

k
1X 1 α
f (x(i) ) − f (x? ) ≤ Dh (x? , x(1) ) + max kg (i) k2∗
k i=1 αk 2 i

in general, converges if
• Dh (x? , x(1) ) < ∞
•
P
k αk = ∞ and αk → 0
• for all g ∈ ∂f (x) and x ∈ C, kgk∗ ≤ G for some G < ∞
Stochastic gradients are fine!
23
Variable metric subgradient methods

subgradient method with variable metric Hk 0:

(1) get subgradient g (k) ∈ ∂f (x(k) )

(2) update (diagonal) metric Hk
(3) update x(k+1) = x(k) − Hk−1 g (k)

• matrix Hk generalizes step-length αk

there are many such methods (Ellipsoid method, AdaGrad, . . . )

24
Variable metric projected subgradient method

same, with projection carried out in the Hk metric:

(1) get subgradient g (k) ∈ ∂f (x(k) )

(2) update (diagonal) metric Hk
(3) update x(k+1) = PXHk x(k) − Hk−1 g (k)

where
ΠH 2
X (y) = argmin kx − ykH
x∈X
√
and kxkH = xT Hx.

25
Convergence analysis

since ΠH
X is non-expansive in the k · kHk norm, we get
k

2
kx(k+1) − x? k2Hk = PXHk x(k) − Hk−1 g (k) − PXHk (x? )
Hk

≤ kx (k)
− Hk−1 g (k) − x? k2Hk
= kx (k)
− x? k2Hk − 2(g (k) )T (x(k) − x? ) + kg (k) k2H −1
k

≤ kx(k) − x? k2Hk − 2(f (x(k) ) − f ? ) + kg (k) k2H −1 .

using f ? = f (x? ) ≥ f (x(k) ) + g (k)T (x? − x(k) )

26
apply recursively, use
k
(k)
X
f (x(i) ) − f ? ≥ k fbest − f ?
i=1

and rearrange to get

Pk
kx(1) − x? k2H1 + i=1 kg (i) k2H −1
(k) ?
fbest −f ≤ i

2k
Pk (i) ? 2 (i) ? 2

i=2 kx − x kHi − kx − x kHi−1
+
2k

27
numerator of additional term can be bounded to get estimates
• for general Hk = diag(hk )

k
X k
X
2
R∞ kH1 k1 + kg (i) k2H −1 2
R∞ kHi − Hi−1 k1
i
k ? i=1 i=2
fbest −f ≤ +
2k 2k
• for Hk = diag(hk ) with hi ≥ hi−1 for all i

k
X
kg (i) k2H −1
2
k i=1
i
R∞ khk k1
fbest − f? ≤ +
2k 2k

where max kx(i) − x? k∞ ≤ R∞

1≤i≤k

28
converges if
• R∞ < ∞ (e.g. if X is compact)
Pk
• (i) 2
i=1 kg kH −1 grows slower than k
i
Pk
• i=2 kHi − Hi−1 k1 grows slower than k or
hi ≥ hi−1 for all i and khk k1 grows slower than k

29
AdaGrad

AdaGrad — adaptive subgradient method

(1) get subgradient g (k) ∈ ∂f (x(k) )

(2) choose metric Hk :
Pk
• set Sk = i=1 diag(g (i) )2
1
• set Hk = 1
α Sk
2

(3) update x(k+1) = PXHk x(k) − Hk−1 g (k)

where α > 0 is step-size

30
AdaGrad – motivation

• for fixed Hk = H we have estimate:

k
(k) ?1 (1) ? T (1) ? 1 X (i) 2
fbest −f ≤ (x − x ) H(x − x ) + kg kH −1
2k 2k i=1

• idea: Choose diagonal Hk 0 that minimizes this estimate in

hindsight:
k
X
Hk = argmin max (x − y)T diag(h)(x − y) + kg (i) k2diag(h)−1
h x,y∈C
i=1
q q
Pk (i) 2 Pk (i) 2
• optimal Hk = 1
R∞ diag i=1 (g1 ) , . . . , i=1 (gn )

• intuition: adapt step-length based on historical step lengths

31
AdaGrad – convergence

1
by construction, Hi = α diag(hi ) and hi ≥ hi−1 , so

k
(k) 1 X (i) 2 1 2
fbest − f ? ≤ kg kH −1 + R khk k1
2k i=1 i 2kα ∞
α 1 2
≤ khk k1 + R khk k1
k 2kα ∞
(second line is a theorem)
2
also have (with α = R∞ ) and for compact sets C
( k
)
(k) ? 2 T
X
(i) 2
fbest −f ≤ inf sup (x − y) diag(h)(x − y) + kg kdiag(h)−1
k h≥0 x,y∈C i=1

32
Example

Classification problem:
• Data: {ai , bi }, i = 1, . . . , 50000
• ai ∈ R1000
• b ∈ {−1, 1}
• Data created with 5% mis-classifications w.r.t. w = 1, v = 0
• Objective: find classifiers w ∈ R1000 and v ∈ R such that
• aTi w + v > 1 if b = 1
• aTi w + v < 1 if b = −1
• Optimization method:
• Minimize hinge-loss: max(0, 1 − bi (aTi w + v))
P
i
• Choose example uniformly at random, take sub-gradient step
w.r.t. that example

33
Best subgradient method vs best AdaGrad

1
10
subgradientmethodmethod
adagrad

0
10
relativeconvergence

−1
10

−2
10
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
k

Often best AdaGrad performs better than best subgradient method

34
AdaGrad with different step-sizes α:

alphaalphaalpha1
alphaalphaalpha2
alphaalphaalpha3
alphaalphaalpha4
relativeconvergence

0
10

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
k

Sensitive to step-size selection (like standard subgradient method)

Ee227c Notes 2 PDF
No ratings yet
Ee227c Notes 2 PDF
122 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Dimitri Bertsekas - Nonlinear Programming (Google Books Preview) (2016, Athena Scientific) - Libgen - Li
No ratings yet
Dimitri Bertsekas - Nonlinear Programming (Google Books Preview) (2016, Athena Scientific) - Libgen - Li
64 pages
Convergence of Descent Methods For Semi-Algebraic and
No ratings yet
Convergence of Descent Methods For Semi-Algebraic and
37 pages
A Double Projection Algorithm For Quasimonotone Variational Inequalities in Banach Spaces
No ratings yet
A Double Projection Algorithm For Quasimonotone Variational Inequalities in Banach Spaces
14 pages
Projected Gradient
No ratings yet
Projected Gradient
21 pages
Notes ch4 1
No ratings yet
Notes ch4 1
7 pages
Raghu Meka Notes
No ratings yet
Raghu Meka Notes
7 pages
A Note On The Optimal Convergence Rate of Descent
No ratings yet
A Note On The Optimal Convergence Rate of Descent
11 pages
Convex - Optimization - Homework 3
No ratings yet
Convex - Optimization - Homework 3
6 pages
Bregman
No ratings yet
Bregman
9 pages
1190 543 PB
No ratings yet
1190 543 PB
17 pages
DSA3102 Midterm Chestsheet
No ratings yet
DSA3102 Midterm Chestsheet
2 pages
Mirror 2
No ratings yet
Mirror 2
8 pages
Grundlehren Der Mathematischen Wissenschaften 306: A Series of Comprehensive Studies in Mathematics
No ratings yet
Grundlehren Der Mathematischen Wissenschaften 306: A Series of Comprehensive Studies in Mathematics
361 pages
Main
No ratings yet
Main
166 pages
Controle 16
No ratings yet
Controle 16
4 pages
Proximal Gradient Methods For Machine
No ratings yet
Proximal Gradient Methods For Machine
96 pages
04 Nonlinear Systems and Optimization
No ratings yet
04 Nonlinear Systems and Optimization
74 pages
ConvexSpring25 Week9
No ratings yet
ConvexSpring25 Week9
26 pages
L10 - Subgrad - PGD (Partially Annotated)
No ratings yet
L10 - Subgrad - PGD (Partially Annotated)
39 pages
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
No ratings yet
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
8 pages
Handbook of Convergence Theorems
No ratings yet
Handbook of Convergence Theorems
70 pages
Introduction To Optimization - Jean-François Aujol
No ratings yet
Introduction To Optimization - Jean-François Aujol
51 pages
C2 M2 Exam Withsol
No ratings yet
C2 M2 Exam Withsol
12 pages
Steepest Descent Algorithm
No ratings yet
Steepest Descent Algorithm
28 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
Subgrad Method Slides
No ratings yet
Subgrad Method Slides
33 pages
Institute of Computer Science: Academy of Sciences of The Czech Republic
No ratings yet
Institute of Computer Science: Academy of Sciences of The Czech Republic
49 pages
18.657: Mathematics of Machine Learning: S R LR LK K
No ratings yet
18.657: Mathematics of Machine Learning: S R LR LK K
9 pages
Chapter 1
No ratings yet
Chapter 1
147 pages
Chương 9
No ratings yet
Chương 9
12 pages
Lectures 2023
No ratings yet
Lectures 2023
115 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Convex Optimization Algorithms 1st Edition Dimitri P. Bertsekas Download PDF
No ratings yet
Convex Optimization Algorithms 1st Edition Dimitri P. Bertsekas Download PDF
49 pages
Chapter 8 Lecture Notes
No ratings yet
Chapter 8 Lecture Notes
4 pages
Co 463
No ratings yet
Co 463
116 pages
Gradient
No ratings yet
Gradient
31 pages
Subgradients: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradients: Ryan Tibshirani Convex Optimization 10-725
25 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
21 pages
Nisheeth VishnoiFall2014 ConvexOptimization PDF
No ratings yet
Nisheeth VishnoiFall2014 ConvexOptimization PDF
114 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
Hw3sol PDF
No ratings yet
Hw3sol PDF
8 pages
Multi-Variable Optimization Methods
No ratings yet
Multi-Variable Optimization Methods
21 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
CS-6777 Liu Abs
No ratings yet
CS-6777 Liu Abs
103 pages
Gradient Methods For Minimizing Composite Objective Function
No ratings yet
Gradient Methods For Minimizing Composite Objective Function
31 pages
FRM Study Guide 2015 PDF
No ratings yet
FRM Study Guide 2015 PDF
548 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
Conjugate Gradient Method: Com S 477/577 Nov 6, 2007
No ratings yet
Conjugate Gradient Method: Com S 477/577 Nov 6, 2007
8 pages
02-Subgrad Method Notes
No ratings yet
02-Subgrad Method Notes
27 pages
DC Programming: The Optimization Method You Never Knew You Had To Know
No ratings yet
DC Programming: The Optimization Method You Never Knew You Had To Know
13 pages
Bregman Divergence and Mirror Descent
No ratings yet
Bregman Divergence and Mirror Descent
8 pages
Logarithms (New) 4037
No ratings yet
Logarithms (New) 4037
10 pages
Coordinate Descent Algorithms: Stephen J. Wright
No ratings yet
Coordinate Descent Algorithms: Stephen J. Wright
32 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
Geometry: Grades
100% (1)
Geometry: Grades
108 pages
Introduction To Mathematical Control Theory
No ratings yet
Introduction To Mathematical Control Theory
276 pages
CGQP Nov 2023 Exam Paper - RM
No ratings yet
CGQP Nov 2023 Exam Paper - RM
9 pages
Basic Concepts: 1.1 Continuity
No ratings yet
Basic Concepts: 1.1 Continuity
7 pages
FR Study Guide 4th Edition-2018
No ratings yet
FR Study Guide 4th Edition-2018
716 pages
Sparsity and Its Mathematics
No ratings yet
Sparsity and Its Mathematics
44 pages
Chapter 02 Multiple Choice Questions With Answers (3 Files Merged)
67% (3)
Chapter 02 Multiple Choice Questions With Answers (3 Files Merged)
6 pages
Arithmetic Progressions
No ratings yet
Arithmetic Progressions
13 pages
Investigation: Mathematics Grade 12 Sequences and Series Term 1 Marks:50 Time:1 Hour
No ratings yet
Investigation: Mathematics Grade 12 Sequences and Series Term 1 Marks:50 Time:1 Hour
8 pages
Domain Testing
No ratings yet
Domain Testing
12 pages
Math Question Paper2 Sa1
100% (1)
Math Question Paper2 Sa1
7 pages
CGQP Jun 2023 Exam Paper - RM
No ratings yet
CGQP Jun 2023 Exam Paper - RM
9 pages
Number - Quantitative Aptitude For CAT EBOOK
No ratings yet
Number - Quantitative Aptitude For CAT EBOOK
6 pages
Differential Form of Gauss' Law (Calc 3 Connection)
No ratings yet
Differential Form of Gauss' Law (Calc 3 Connection)
4 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
Detailed Lesson Plan in General Mathematics
No ratings yet
Detailed Lesson Plan in General Mathematics
7 pages
CGQP Jun 2023 Examiner Report - RM (Website)
No ratings yet
CGQP Jun 2023 Examiner Report - RM (Website)
3 pages
Gauteng Prelims Paper 1 2017 MEMO
100% (6)
Gauteng Prelims Paper 1 2017 MEMO
14 pages
Fourier Representation of Signals and LTI Systems
No ratings yet
Fourier Representation of Signals and LTI Systems
45 pages
Ist Q Math 9
100% (1)
Ist Q Math 9
2 pages
University of Strathclyde Department of Mathematics Vectors
No ratings yet
University of Strathclyde Department of Mathematics Vectors
17 pages
p3 Complex Numbers Notes
No ratings yet
p3 Complex Numbers Notes
12 pages
Solving Systems of Linear Equations Worksheet
No ratings yet
Solving Systems of Linear Equations Worksheet
5 pages
Bipolar Neutrosophic Soft Expert Set Theory
No ratings yet
Bipolar Neutrosophic Soft Expert Set Theory
14 pages
cs224n 2023 Lecture03 Neuralnets
No ratings yet
cs224n 2023 Lecture03 Neuralnets
83 pages
155 Main
No ratings yet
155 Main
57 pages
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
Arm 54
No ratings yet
Arm 54
77 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
0985 Module 7 Management Liability Human Resource PDF
No ratings yet
0985 Module 7 Management Liability Human Resource PDF
80 pages
cs224n 2023 Lecture04 Dep Parsing
No ratings yet
cs224n 2023 Lecture04 Dep Parsing
45 pages
215 Amain
No ratings yet
215 Amain
73 pages
Seq Slides
No ratings yet
Seq Slides
43 pages
0985 Module 5 Intellectual Property and Reputation Risk PDF
No ratings yet
0985 Module 5 Intellectual Property and Reputation Risk PDF
54 pages
0985 Module 10 Fleet Risk PDF
No ratings yet
0985 Module 10 Fleet Risk PDF
58 pages
Robust Slides
No ratings yet
Robust Slides
32 pages
0985 Module 8 Environmental Risk PDF
No ratings yet
0985 Module 8 Environmental Risk PDF
54 pages
Lecture 4
No ratings yet
Lecture 4
67 pages
CGQP Jun 2023 Mark Scheme - RM
No ratings yet
CGQP Jun 2023 Mark Scheme - RM
43 pages
Lecture 6
No ratings yet
Lecture 6
28 pages
BB Notes
No ratings yet
BB Notes
18 pages
LPP Real Life Problems
No ratings yet
LPP Real Life Problems
8 pages
Lecture 1
No ratings yet
Lecture 1
13 pages
NMTC 2023 Stage II Finals - Junior BHASKARA (Grade 9 & 10) - Problems and Solutions - Cheenta Academy
No ratings yet
NMTC 2023 Stage II Finals - Junior BHASKARA (Grade 9 & 10) - Problems and Solutions - Cheenta Academy
1 page
1.1 Metric Space
No ratings yet
1.1 Metric Space
2 pages
Society of Actuaries/Casualty Actuarial Society: Exam C Construction and Evaluation of Actuarial Models
No ratings yet
Society of Actuaries/Casualty Actuarial Society: Exam C Construction and Evaluation of Actuarial Models
83 pages
SSM2030 Winter2021 Syllabus
No ratings yet
SSM2030 Winter2021 Syllabus
8 pages
SSM2040 Course Syllabus 2022-23 Final
No ratings yet
SSM2040 Course Syllabus 2022-23 Final
10 pages
Definitions: X X Xy X X Xy
No ratings yet
Definitions: X X Xy X X Xy
59 pages
1-What Material I Need To Teach Mathematics On Line
No ratings yet
1-What Material I Need To Teach Mathematics On Line
7 pages
Differential Equations: 9.1 Overview
No ratings yet
Differential Equations: 9.1 Overview
25 pages
CGQP Nov 2023 Examiner Report - RM (Website)
No ratings yet
CGQP Nov 2023 Examiner Report - RM (Website)
3 pages
ACC324 Lab2
No ratings yet
ACC324 Lab2
4 pages
3 19ElliottEffectAlgebra
No ratings yet
3 19ElliottEffectAlgebra
7 pages
MST PDF
No ratings yet
MST PDF
3 pages
SSM2010H Marketing in Sustainability Management
No ratings yet
SSM2010H Marketing in Sustainability Management
1 page
WS 1.1.6 - Sets & Venn Diagram
No ratings yet
WS 1.1.6 - Sets & Venn Diagram
2 pages
ABM Specialized Subjects
No ratings yet
ABM Specialized Subjects
2 pages

Mirror Descent Slides

Uploaded by

Mirror Descent Slides

Uploaded by

Mirror Descent and Variable Metric

Stephen Boyd & John Duchi & Mert Pilanci

EE364b, Stanford University

April 24, 2019

• due to Nemirovski and Yudin (1983)

• recall the projected subgradient method:

(1) get subgradient g (k) ∈ ∂f (x(k) )

Consider minx∈C f (x)

Projected sub-gradient method iterates will satisfy

h convex differentiable over an open convex set C.

h(x) is λ-strongly convex with respect to the norm || · || if

For a λ-strongly convex function h, Bregman divergence

Dh (x, y) ≥ Dh (x, PCh (y)) + Dh (PCh (y), y)

where PC is the Euclidean projection onto C.

where Dh (x, y) is the Bregman divergence

and h(x) is strongly convex with respect to || · ||

PCh (y) = arg min Dh (x, y)

where Dh (x, y) is the Bregman divergence

Let ||g||∗ ≤ G||·|| for all g ∈ ∂f , or equivalently

where Dh (x, y) is the Bregman divergence

• simplest version is when h(x) = 12 ||x||22 , which is strongly convex

Pn ||x||1 (Pinsker’s inequality).

• Usual (projected) subgradient descent: h(x) = 12 kxk22

(1) Strongly convex with respect to `1 -norm

Robust regression problem (an LP):

• Projected subgradient update (h(x) = (1/2) kxk22 ): homework

Robust regression problem with ai ∼ N (0, In×n ) and

stepsizes chosen according to best bounds (but still sensitive to

minimizing a function on the spectrahedron Sn defined as

• von Neumann entropy:

where λ1 (X), ..., λn (X) are the eigenvalues of X.

distance generating function h, 1-strongly-convex w.r.t. k·k:

∇h, ∇h∗ take us “through the mirror” and back

miror descent iterations for C = Rn

g (k) ∈ ∂f (x(k) ) , θ(k+1) = θ(k) − αk g (k) , x(k+1) = ∇h∗ (θ(k+1) )

Dh∗ (θ0 , θ) = h∗ (θ0 ) − h∗ (θ) − ∇h∗ (θ)T (θ0 − θ)

Dh∗ (θ(k+1) , θ? ) = Dh∗ (θ(k) , θ? )

From convexity and g (k) ∈ ∂f (x(k) ),

αk [f (x(k) ) − f (x? )] ≤ Dh∗ (θ(k) , θ? ) − Dh∗ (θ(k+1) , θ? )

Bounding the Dh∗ (θ(k+1) , θ(k) ) terms and telescoping gives

Note: Dh∗ (θ(1) , θ? ) = Dh (x? , x(1) )

Fixed step size αk = α

subgradient method with variable metric Hk  0:

(1) get subgradient g (k) ∈ ∂f (x(k) )

• matrix Hk generalizes step-length αk

there are many such methods (Ellipsoid method, AdaGrad, . . . )

same, with projection carried out in the Hk metric:

(1) get subgradient g (k) ∈ ∂f (x(k) )

≤ kx(k) − x? k2Hk − 2(f (x(k) ) − f ? ) + kg (k) k2H −1 .

using f ? = f (x? ) ≥ f (x(k) ) + g (k)T (x? − x(k) )

and rearrange to get

where max kx(i) − x? k∞ ≤ R∞

AdaGrad — adaptive subgradient method

(1) get subgradient g (k) ∈ ∂f (x(k) )

(3) update x(k+1) = PXHk x(k) − Hk−1 g (k)

where α > 0 is step-size

• for fixed Hk = H we have estimate:

• idea: Choose diagonal Hk  0 that minimizes this estimate in

• intuition: adapt step-length based on historical step lengths

Often best AdaGrad performs better than best subgradient method

Sensitive to step-size selection (like standard subgradient method)

You might also like

subgradient method with variable metric Hk 0:

• idea: Choose diagonal Hk 0 that minimizes this estimate in