Bregman Divergence and Mirror Descent
Bregman Divergence and Mirror Descent
1 Bregman Divergence
Motivation
• Generalize squared Euclidean distance to a class of distances that all share similar properties
Definition 1 (Bregman divergence) Let ψ : Ω → R be a function that is: a) strictly convex, b) continuously
differentiable, c) defined on a closed convex set Ω. Then the Bregman divergence is defined as
∆ψ (x, y) = ψ(x) − ψ(y) − h∇ψ(y), x − yi , ∀ x, y ∈ Ω. (1)
That is, the difference between the value of ψ at x and the first order Taylor expansion of ψ around y evaluated
at point x.
Examples
1 2 1 2
• Euclidean distance. Let ψ(x) = 2 kxk . Then ∆ψ (x, y) = 2 kx − yk .
This is called relative entropy, or Kullback–Leibler divergence between probability distributions x and
y.
2 2 2
• `p norm. Let p ≥ 1 and p1 + 1q = 1. ψ(x) = 12 kxkq . Then ∆ψ (x, y) = 21 kxkq + 12 kykq −
D E
2 2
x, ∇ 21 kykq . Note 12 kykq is not necessarily continuously differentiable, which makes this case not
precisely consistent with our definition.
• Nonnegativity: ∆ψ (x, y) ≥ 0 for all x, y. ∆ψ (x, y) = 0 if and only if x = y. Trivial by strict convexity.
• Asymmetry: in general, ∆ψ (x, y) 6= ∆ψ (y, x). Eg, KL-divergence. Symmetrization not always useful.
• Non-convexity in the second argument. Let Ω = [1, ∞), ψ(x) = − log x. Then ∆ψ (x, y) = − log x +
log y + x−y 1 2x
y . One can check its second order derivative in y is y 2 ( y − 1), which is negative when
2x < y.
• Linearity in ψ. For any a > 0, ∆ψ+aϕ (x, y) = ∆ψ (x, y) + a∆ϕ (x, y).
∂
• Gradient in x: ∂x ∆ψ (x, y) = ∇ψ(x) − ∇ψ(y). Gradient in y is trickier, and not commonly used.
sup must be attainable because ψ is strongly convex and Ω is closed. x is a maximizer if and only if
y = ∇ψ(x). So
ψ ∗ (y) + ψ(x) = hx, yi ⇔ y = ∇ψ(x). (8)
∗∗ ∗ ∗∗
Since ψ = ψ , so ψ (y) + ψ (x) = hx, yi, which means y is the maximizer in
ψ ∗∗ (x) = sup {hx, zi − ψ ∗ (z)} . (9)
z
• Mean of distribution. Suppose U is a random variable over an open set S with distribution µ. Then
min EU ∼µ [∆ψ (U, x)] (10)
x∈S
R
is optimized at ū := Eµ [U ] = u∈S uµ(u).
Proof: For any x ∈ S, we have
EU ∼µ [∆ψ (U, x)] − EU ∼µ [∆ψ (U, ū)] (11)
0 0
= Eµ [ψ(U ) − ψ(x) − (U − x) ∇ψ(x) − ψ(U ) + ψ(ū) + (U − ū) ∇ψ(ū)] (12)
0 0 0 0
= ψ(ū) − ψ(x) + x ∇ψ(x) − ū ∇ψ(ū) + Eµ [−U ∇ψ(x) + U ∇ψ(ū)] (13)
= ψ(ū) − ψ(x) − (ū − x)0 ∇ψ(x) (14)
= ∆ψ (ū, x). (15)
This must be nonnegative, and is 0 if and only if x = ū.
Then
∆ψ (y, x0 ) ≥ ∆ψ (y, x∗ ) + ∆ψ (x∗ , x0 ). (17)
In Euclidean case, it means the angle ∠yx∗ x0 is obtuse. More generally
Lemma 2 Suppose L is a proper convex function whose domain is an open set containing C. L is not
necessarily differentiable. Let x∗ be
x∗ = argmin {L(x) + ∆ψ (x∗ , x0 )} . (18)
x∈C
2
The projection in (16) is just a special case of L = 0. This property is the key to the analysis of many
optimization algorithms using Bregman divergence.
Proof: Denote J(x) = L(x) + ∆ψ (x, x0 ). Since x∗ minimizes J over C, there must exist a subgradient
d ∈ ∂J(x∗ ) such that
hd, x − x∗ i ≥ 0, ∀ x ∈ C. (20)
Since ∂J(x∗ ) = {g + ∇x=x∗ ∆ψ (x, x0 ) : g ∈ ∂L(x∗ )} = {g + ∇ψ(x∗ ) − ∇ψ(x0 ) : g ∈ ∂L(x∗ )}.
So there must be a subgradient g ∈ L(x∗ ) such that
hg + ∇ψ(x∗ ) − ∇ψ(x0 ), x − x∗ i ≥ 0, ∀ x ∈ C. (21)
Therefore using the property of subgradient, we have for all y ∈ C that
L(y) ≥ L(x∗ ) + hg, y − x∗ i (22)
≥ L(x∗ ) + h∇ψ(x0 ) − ∇ψ(x∗ ), y − x∗ i (23)
= L(x∗ ) − h∇ψ(x0 ), x∗ − x0 i + ψ(x∗ ) − ψ(x0 ) (24)
+ h∇ψ(x0 ), y − x0 i − ψ(y) + ψ(x0 ) (25)
− h∇ψ(x∗ ), y − x∗ i + ψ(y) − ψ(x∗ ) (26)
= L(x∗ ) + ∆ψ (x∗ , x0 ) − ∆ψ (y, x0 ) + ∆ψ (y, x∗ ). (27)
Rearranging completes the proof.
2 Mirror Descent
Why bother? Because the rate of convergence of subgradient descent often depends on the dimension of the
problem.
Suppose we want to minimize a function f over a set C. Recall the subgradient descent rule
xk+ 12 = xk − αk gk , where gk ∈ ∂f (xk ) (28)
1
2 1 2
xk+1 = argmin
x − xk+ 21
= argmin kx − (xk − αk gk )k . (29)
x∈C 2 x∈C 2
This can be interpreted as follows. First approximate f around xk by a first-order Taylor expansion
f (x) ≈ f (xk ) + hgk , x − xk i . (30)
1 2
Then penalize the displacement by 2αk kx − xk k . So the update rule is to find a regularized minimizer of
the model
1 2
xk+1 = argmin f (xk ) + hgk , x − xk i + kx − xk k . (31)
x∈C 2αk
It is trivial to see this is exactly equivalent to (29). To generalize the method beyond Euclidean distance, it is
straightforward to use the Bregman divergence as a measure of displacement:
1
xk+1 = argmin f (xk ) + hgk , x − xk i + ∆ψ (x, xk ) (32)
x∈C αk
= argmin {αk f (xk ) + αk hgk , x − xk i + ∆ψ (x, xk )} . (33)
x∈C
Mirror descent interpretation. Suppose the constraint set C is the whole space (i.e. no constraint). Then
we can take gradient with respect to x and find the optimality condition
1
gk + (∇ψ(xk+1 ) − ∇ψ(xk )) = 0 (34)
αk
⇔ ∇ψ(xk+1 ) = ∇ψ(xk ) − αk gk (35)
−1 ∗
⇔ xk+1 = (∇ψ) (∇ψ(xk ) − αk gk ) = (∇ψ )(∇ψ(xk ) − αk gk ). (36)
For example, in KL-divergence over simplex, the update rule becomes
xk+1 (i) = xk (i) exp(−αk gk (i)). (37)
Rate of convergence. Recall in unconstrained subgradient descent we followed 4 steps.
3
1. Bound on single update
2 2
kxk+1 − x∗ k2 = kxk − αk gk − x∗ k2 (38)
2 2
= kxk − x∗ k2 ∗
− 2αk hgk , xk − x i + αk2kgk k2 (39)
2 2
≤ kxk − x∗ k 2 − 2αk (f (xk ) − f (x )) ∗
+ αk2 kgk k2 . (40)
2. Telescope
T
X T
X
2 2 2
kxT +1 − x∗ k2 ≤ kx1 − x∗ k2 − 2 αk (f (xk ) − f (x∗ )) + αk2 kgk k2 . (41)
k=1 k=1
2 2
3. Bounding by kx1 − x∗ k2 ≤ R2 and kgk k2 ≤ G2 :
T
X T
X
∗ 2 2
2 αk (f (xk ) − f (x )) ≤ R + G αk2 . (42)
k=1 k=1
Acceleration 1: f is strongly convex. We say f is strongly convex with respect to another convex function
ψ with modulus λ if
f (x) ≥ f (y) + hg, x − yi + λ∆ψ (x, y) ∀ g ∈ ∂f (y). (51)
4
Note we do not assume f is differentiable. Now in the step from (47) to (48), we can plug in the definition of
strong convexity:
∆ψ (x∗ , xk+1 ) = . . . + αk hgk , x∗ − xk i + . . . (copy of (47)) (52)
∗ ∗
≤ . . . − αk (f (xk ) − f (x ) + λ∆ψ (x , xk )) + . . . (53)
≤ ... (54)
αk2 2
≤ (1 − λαk )∆ψ (x∗ , xk ) − αk (f (xk ) − f (x∗ )) + kgk k∗ (55)
2σ
Denote δk = ∆ψ (x∗ , xk ). Set αk = 1
λk . Then
k−1 1 G2 1 G2
δk+1 ≤ δk − k + ⇒ kδk+1 ≤ (k − 1)δk − k + (56)
k λk 2σλ2 k 2 λ 2σλ2 k
Now telescope (sum up both sides from k = 1 to T )
T T T
1X G2 X 1 G2 1 X 1 G2 O(log T )
T δT +1 ≤ − k + ⇒ min k ≤ ≤ . (57)
λ 2σλ2 k i∈{1,...,T } 2σλ T k 2σλ T
k=1 k=1 k=1
Acceleration 2: f has Lipschitz continuous gradient. If the gradient of f is Lipschitz continuous, there
exists L > 0 such that
k∇f (x) − ∇f (y)k∗ ≤ L kx − yk , ∀ x, y. (58)
Sometimes we just directly say f is smooth. It is also known that this is equivalent to
L 2
f (x) ≤ f (y) + h∇f (y), x − yi + kx − yk . (59)
2
We bound the hgk , x∗ − xk+1 i term in (46) as follows
hgk , x∗ − xk+1 i = hgk , x∗ − xk i + hgk , xk − xk+1 i (60)
L 2
≤ f (x∗ ) − f (xk ) + f (xk ) − f (xk+1 ) + kxk − xk+1 k (61)
2
L 2
= f (x∗ ) − f (xk+1 ) + kxk − xk+1 k . (62)
2
Plug into (46), we get
∗ ∗ ∗ L 2 σ 2
∆ψ (x , xk+1 ) ≤ ∆ψ (x , xk ) + αk f (x ) − f (xk+1 ) + kxk − xk+1 k − kxk − xk+1 k . (63)
2 2
σ
Set αk = L, we get
σ
∆ψ (x∗ , xk+1 ) ≤ ∆ψ (x∗ , xk ) − (f (xk+1 ) − f (x∗ )) . (64)
L
Telescope we get
L∆(x∗ , x1 ) LR2
min f (xk ) − f (x∗ ) ≤ ≤ . (65)
k∈{2,...,T +1} σT σT
This gives O( T1 ) convergence rate. But if we are smarter, like Nesterov, the rate can be improved to O( T12 ).
We will not go into the details but the algorithm and proof are again based on Lemma 2. This is often called
accelerated proximal gradient method.
2.1 Composite Objective
Suppose the objective function is h(x) = f (x)+r(x), where f is smooth and r(x) is simple, like kxk1 . If we
directly apply the above rates for optimizing h, we get O( √1T ) rate of convergence because h is not smooth.
It will be nice if we can enjoy the O( T1 ) rate as in smooth optimization. Fortunately this is possible thanks to
the simplicity of r(x), and we only need to extend the proximal operator (33) as follows:
1
xk+1 = argmin f (xk ) + hgk , x − xk i + r(x) + ∆ψ (x, xk ) (66)
x∈C αk
= argmin {αk f (xk ) + αk hgk , x − xk i + αk r(x) + ∆ψ (x, xk )} . (67)
x∈C
5
Algorithm 1: Protocol of online learning
1 The player initializes a model x1 .
2 for k = 1, 2, . . . do
3 The player proposes a model xk .
4 The rival picks a function fk .
5 The player suffers a loss fk (xk ).
6 The player gets access to fk and use it to update its model to xk+1 .
Here we use a first-order Taylor approximation of f around xk , but keep r(x) exact. Assuming this proximal
operator can be computed efficiently, then we can show all the above rates carry over. We here only show the
case of general f (not necessarily smooth or strongly convex), and leave the rest two cases as an exercise. In
fact we can again achieve O( T12 ) rate when f has Lipschitz continuous gradient.
Consider αk (f (xk ) + hgk , x − xk i + r(x)) as the L in Lemma 2. Then
αk (f (xk ) + hgk , x∗ − xk i + r(x∗ )) + ∆ψ (x∗ , xk ) (68)
≥ αk (f (xk ) + hgk , xk+1 − xk i + r(xk+1 )) + ∆ψ (xk+1 , xk ) + ∆ψ (x∗ , xk+1 ). (69)
Following exactly the derivations from (46) to (50), we obtain
∆ψ (x∗ , xk+1 ) ≤ ∆ψ (x∗ , xk ) + αk hgk , x∗ − xk+1 i + αk (r(x∗ ) − r(xk+1 )) − ∆ψ (xk+1 , xk ) (70)
≤ ... (71)
αk2 2
≤ ∆ψ (x∗ , xk ) − αk (f (xk ) + r(xk+1 ) − f (x∗ ) − r(x∗ )) + kgk k∗ . (72)
2σ
This is almost the same as (50), except that we want to have r(xk ) here, not r(xk+1 ). Fortunately this is not
a problem as long as we use a slightly different way of telescoping. Denote δk = ∆ψ (x∗ , xk ) and then
1 αk 2
f (xk ) + r(xk+1 ) − f (x∗ ) − r(x∗ ) ≤ (δk − δk+1 ) + kgk k∗ . (73)
αk 2σ
Summing up from k = 1 to T we obtain
T T T
G2 X
X δ1 X 1 1 δT +1
r(xT +1 ) − r(x1 ) + (h(xk ) − h(x∗ )) ≤ + δk − − + αk (74)
α1 αk αk−1 αT 2σ
k=1 k=2 k=1
T ! T
2 1 X 1 1 G2 X
≤R + − + αk (75)
α1 αk αk−1 2σ
k=2 k=1
2 T 2
R G X
= + αk . (76)
αT 2σ
k=1
R
pσ
Suppose we choose x1 = argminx r(x), which ensures r(xT +1 ) − r(x1 ) ≥ 0. Setting αk = G k, we get
T T
!
X
∗ RG √ 1X 1 RG √
(h(xk ) − h(x )) ≤ √ T+ √ = √ O( T ). (77)
σ 2 k σ
k=1 k=1
6
Then it is easy to derive the regret bound. Using fk in step (50), we have
1 αk 2
fk (xk ) − fk (x∗ ) ≤ (∆ψ (x∗ , xk ) − ∆ψ (x∗ , xk+1 )) + kgk k∗ . (80)
αk 2σ
Summing up from k = 1 to n and using the same process as in (74) to (77), we get
T
X RG √
(fk (xk ) − fk (x∗ )) ≤ √ O( T ). (81)
σ
k=1
√
So the regret grows in the order of O( T ).
f is strongly convex. Exactly use (55) with fk in place of f , and we can derive the O(log T ) regret bound
immediately.
f has Lipschitz continuous gradient. The result in (64) can NOT be extended to the online setting because
if we replace f by fk we will get fk (xk+1 ) − fk (x∗ ) on the right-hand side. Telescoping will not give a regret
bound. In fact, it is known that √in the online setting, having a Lipschitz continuous gradient itself cannot
reduce the regret bound from O( T ) (as in nonsmooth objective) to O(log T ).
Composite objective. In the online setting, both the player and the rival know r(x), and the rival changes
fk (x) at each iteration. The loss incurred at each iteration is hk (xk ) = fk (xk ) + r(xk ). The update rule is
1
xk+1 = argmin fk (xk ) + hgk , x − xk i + r(x) + ∆ψ (x, xk ) , where gk ∈ ∂fk (xk ). (82)
x∈C αk
Note in this setting, (73) becomes
1 αk 2
fk (xk ) + r(xk+1 ) − fk (x∗ ) − r(x∗ ) ≤ (δk − δk+1 ) + kgk k∗ . (83)
αk 2σ
Although we have r(xk+1 ) here rather than r(xk ), it is fine because r does not change through iterations.
Choosing x1 = argminx r(x) and telescoping in the same way as from (74) to (77), we immediately obtain
T
X G √
(hk (xk ) − hk (x∗ )) ≤ √ O( T ). (84)
σ
k=1
√
So the regret grows at O( T ).
When fk are strongly convex, we can get O(log T ) regret for the√
composite case. But as expected, having
Lipschitz continuity of ∇fk alone cannot reduce the regret from O( T ) to O(log T ).
2.3 Stochastic optimization
Let us consider optimizing a function which takes a form of expectation
min F (x) := E [f (x; ω)], (85)
x ω∼p
where p is a distribution of ω. This subsumes a lot of machine learning models. For example, the SVM
objective is
m
1 X λ 2
F (x) = max{0, 1 − ci hai , xi} + kxk . (86)
m i=1 2
1
It can be interpreted as (85) where ω is uniformly distributed in {1, 2, . . . , m} (i.e. p(ω = i) = m ), and
λ 2
f (x; i) = max{0, 1 − ci hai , xi} + kxk . (87)
2
When m is large, it can be costly to calculate F and its subgradient. So a simple idea is to base the
updates on a single randomly chosen data point. It can be considered as a special case of online learning in
Algorithm 1, where the rival in step 4 now randomly picks fk as f (x; ωk ) with ωk being drawn independently
from p. Ideally we hope that by using the mirror descent updates, xk will gradually approach the minimizer
7
Algorithm 2: Protocol of online learning
1 The player initializes a model x1 .
2 for k = 1, 2, . . . do
3 The player proposes a model xk .
4 The rival randomly draws a ωk from p, which defines a function fk (x) := f (x; ωk ).
5 The player suffers a loss fk (xk ).
6 The player gets access to fk and use it to update its model to xk+1 by, e.g., mirror descent (79).
of F (x). Intuitively this is quite reasonable, and by using fk we can compute an unbiased estimate of F (xk )
and a subgradient of F (xk ) (because ωk are sampled iid from p). This is a particular case of stochastic
optimization, and we recap it in Algorithm 2.
In fact, the method is valid in a more general setting. For simplicity, let us just say the rival plays ωk
at iteration k. Then an online learning algorithm A is simply a deterministic mapping from an ordered set
{ω1 , . . . , ωk } to xk+1 . Denote as A(ω0 ) the initial model x1 . Then the following theorem is the key for
online to batch conversion.
Theorem 3 Suppose an online learning algorithm A has regret bound Rk after running Algorithm 1 for k
iterations. Suppose ω1 , . . . , ωT +1 are drawn iid from p. Define x̂ = A(ωj+1 , . . . , ωT ) where j is drawn
uniformly random from {0, . . . , T }. Then
RT +1
E[F (x̂)] − min F (x) ≤ , (88)
x T +1
where the expectation is with respect to the randomness of ω1 , . . . , ωT , and j.
Similarly we can have high probability bounds, which can be stated in the form like (not exactly true)
RT +1 1
F (x̂) − min F (x) ≤ log (89)
x T +1 δ
with probability 1 − δ, where the probability is with respect to the randomness of ω1 , . . . , ωT , and j.
Proof of Theorem 3.
E[F (x̂)] = E [f (x̂; ωT +1 )] = E [f (A(ωj+1 , . . . , ωT ); ωT +1 )] (90)
j,ω1 ,...,ωT +1 j,ω1 ,...,ωT +1
T
1 X
= E f (A(ωj+1 , . . . , ωT ); ωT +1 ) (as j is drawn uniformly random) (91)
ω1 ,...,ωT +1 T + 1
j=0
T
1 X
= E f (A(ω1 , . . . , ωT −j ); ωT +1−j ) (shift iteration index by iid of wi )
T + 1 ω1 ,...,ωT +1 j=0
(92)
"T +1 #
1 X
= E f (A(ω1 , . . . , ωs−1 ); ωs ) (change of variable s = T − j + 1) (93)
T + 1 ω1 ,...,ωT +1 s=1
" T +1
#
1 X
≤ E min f (x; ωs ) + RT +1 (apply regret bound) (94)
T + 1 ω1 ,...,ωT +1 x s=1
RT +1
≤ min E[f (x; ω] + (expectation of min is smaller than min of expectation) (95)
x ω T +1
RT +1
= min F (x) + . (96)
x T +1