0% found this document useful (0 votes)

71 views8 pages

Bregman Divergence and Mirror Descent

This document discusses Bregman divergence and its properties. It then introduces mirror descent, an optimization algorithm that uses Bregman divergence. Mirror descent aims to improve the convergence rate of subgradient descent by regularizing the updates using Bregman divergence instead of the Euclidean distance. The key properties of Bregman divergence, such as convexity and the generalized Pythagorean theorem, enable the analysis of convergence for mirror descent.

Uploaded by

Champumaharaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views8 pages

Bregman Divergence and Mirror Descent

Uploaded by

Champumaharaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Bregman Divergence and Mirror Descent

1 Bregman Divergence

Motivation

• Generalize squared Euclidean distance to a class of distances that all share similar properties

• Lots of applications in machine learning, clustering, exponential family

Definition 1 (Bregman divergence) Let ψ : Ω → R be a function that is: a) strictly convex, b) continuously
differentiable, c) defined on a closed convex set Ω. Then the Bregman divergence is defined as
∆ψ (x, y) = ψ(x) − ψ(y) − h∇ψ(y), x − yi , ∀ x, y ∈ Ω. (1)
That is, the difference between the value of ψ at x and the first order Taylor expansion of ψ around y evaluated
at point x.

Examples
1 2 1 2
• Euclidean distance. Let ψ(x) = 2 kxk . Then ∆ψ (x, y) = 2 kx − yk .

• Ω = x ∈ Rn+ : i xi = 1 , and ψ(x) = i xi log xi . Then ∆ψ (x, y) = i xi log xyii for x, y ∈ Ω.

P P P

This is called relative entropy, or Kullback–Leibler divergence between probability distributions x and
y.
2 2 2
• `p norm. Let p ≥ 1 and p1 + 1q = 1. ψ(x) = 12 kxkq . Then ∆ψ (x, y) = 21 kxkq + 12 kykq −
D E
2 2
x, ∇ 21 kykq . Note 12 kykq is not necessarily continuously differentiable, which makes this case not
precisely consistent with our definition.

Properties of Bregman divergence

• Strict convexity in the first argument x. Trivial by the strict convexity of ψ.

• Nonnegativity: ∆ψ (x, y) ≥ 0 for all x, y. ∆ψ (x, y) = 0 if and only if x = y. Trivial by strict convexity.

• Asymmetry: in general, ∆ψ (x, y) 6= ∆ψ (y, x). Eg, KL-divergence. Symmetrization not always useful.

• Non-convexity in the second argument. Let Ω = [1, ∞), ψ(x) = − log x. Then ∆ψ (x, y) = − log x +
log y + x−y 1 2x
y . One can check its second order derivative in y is y 2 ( y − 1), which is negative when
2x < y.

• Linearity in ψ. For any a > 0, ∆ψ+aϕ (x, y) = ∆ψ (x, y) + a∆ϕ (x, y).
∂
• Gradient in x: ∂x ∆ψ (x, y) = ∇ψ(x) − ∇ψ(y). Gradient in y is trickier, and not commonly used.

• Generalized triangle inequality:

∆ψ (x, y) + ∆ψ (y, z) = ψ(x) − ψ(y) − h∇ψ(y), x − yi + ψ(y) − ψ(z) − h∇ψ(z), y − zi (2)
= ∆ψ (x, z) + hx − y, ∇ψ(z) − ∇ψ(y)i . (3)
• Special case: ψ is called strongly convex with respect to some norm with modulus σ if
σ 2
ψ(x) ≥ ψ(y) + h∇ψ(y), x − yi + kx − yk . (4)
2
Note the norm here is not necessarily Euclidean norm. When the norm is Euclidean, this condition is
2
equivalent to ψ(x)− σ2 kxk being convex. For
P
example, the ψ(x) =
i xi log xi used in KL-divergence
n
P
is 1-strongly convex over the simplex Ω = x ∈ R+ : i xi = 1 , with respect to the `1 norm. When
ψ is σ strong convex, we have
σ 2
∆ψ (x, y) ≥ kx − yk . (5)
2
σ 2
Proof: By definition ∆ψ (x, y) = ψ(x) − ψ(y) − h∇ψ(y), x − yi ≥ 2 kx − yk .

• Duality. Suppose ψ is strongly convex. Then

(∇ψ ∗ )(∇ψ(x)) = x, ∆ψ (x, y) = ∆ψ∗ (∇ψ(y), ∇ψ(x)) . (6)
Proof: (for the first equality only) Recall
ψ ∗ (y) = sup {hz, yi − ψ(z)} . (7)
z∈Ω

sup must be attainable because ψ is strongly convex and Ω is closed. x is a maximizer if and only if
y = ∇ψ(x). So
ψ ∗ (y) + ψ(x) = hx, yi ⇔ y = ∇ψ(x). (8)
∗∗ ∗ ∗∗
Since ψ = ψ , so ψ (y) + ψ (x) = hx, yi, which means y is the maximizer in
ψ ∗∗ (x) = sup {hx, zi − ψ ∗ (z)} . (9)
z

This means x = ∇ψ ∗ (y). To summarize, (∇ψ ∗ )(∇ψ(x)) = x.

• Mean of distribution. Suppose U is a random variable over an open set S with distribution µ. Then
min EU ∼µ [∆ψ (U, x)] (10)
x∈S
R
is optimized at ū := Eµ [U ] = u∈S uµ(u).
Proof: For any x ∈ S, we have
EU ∼µ [∆ψ (U, x)] − EU ∼µ [∆ψ (U, ū)] (11)
0 0
= Eµ [ψ(U ) − ψ(x) − (U − x) ∇ψ(x) − ψ(U ) + ψ(ū) + (U − ū) ∇ψ(ū)] (12)
0 0 0 0
= ψ(ū) − ψ(x) + x ∇ψ(x) − ū ∇ψ(ū) + Eµ [−U ∇ψ(x) + U ∇ψ(ū)] (13)
= ψ(ū) − ψ(x) − (ū − x)0 ∇ψ(x) (14)
= ∆ψ (ū, x). (15)
This must be nonnegative, and is 0 if and only if x = ū.

• Pythagorean Theorem. If x∗ is the projection of x0 onto a convex set C ∈ Ω:

x∗ = argmin ∆ψ (x, x0 ). (16)
x∈C

Then
∆ψ (y, x0 ) ≥ ∆ψ (y, x∗ ) + ∆ψ (x∗ , x0 ). (17)
In Euclidean case, it means the angle ∠yx∗ x0 is obtuse. More generally

Lemma 2 Suppose L is a proper convex function whose domain is an open set containing C. L is not
necessarily differentiable. Let x∗ be
x∗ = argmin {L(x) + ∆ψ (x∗ , x0 )} . (18)
x∈C

Then for any y ∈ C we have

L(y) + ∆ψ (y, x0 ) ≥ L(x∗ ) + ∆ψ (x∗ , x0 ) + ∆ψ (y, x∗ ). (19)

2
The projection in (16) is just a special case of L = 0. This property is the key to the analysis of many
optimization algorithms using Bregman divergence.
Proof: Denote J(x) = L(x) + ∆ψ (x, x0 ). Since x∗ minimizes J over C, there must exist a subgradient
d ∈ ∂J(x∗ ) such that
hd, x − x∗ i ≥ 0, ∀ x ∈ C. (20)
Since ∂J(x∗ ) = {g + ∇x=x∗ ∆ψ (x, x0 ) : g ∈ ∂L(x∗ )} = {g + ∇ψ(x∗ ) − ∇ψ(x0 ) : g ∈ ∂L(x∗ )}.
So there must be a subgradient g ∈ L(x∗ ) such that
hg + ∇ψ(x∗ ) − ∇ψ(x0 ), x − x∗ i ≥ 0, ∀ x ∈ C. (21)
Therefore using the property of subgradient, we have for all y ∈ C that
L(y) ≥ L(x∗ ) + hg, y − x∗ i (22)
≥ L(x∗ ) + h∇ψ(x0 ) − ∇ψ(x∗ ), y − x∗ i (23)
= L(x∗ ) − h∇ψ(x0 ), x∗ − x0 i + ψ(x∗ ) − ψ(x0 ) (24)
+ h∇ψ(x0 ), y − x0 i − ψ(y) + ψ(x0 ) (25)
− h∇ψ(x∗ ), y − x∗ i + ψ(y) − ψ(x∗ ) (26)
= L(x∗ ) + ∆ψ (x∗ , x0 ) − ∆ψ (y, x0 ) + ∆ψ (y, x∗ ). (27)
Rearranging completes the proof.

2 Mirror Descent
Why bother? Because the rate of convergence of subgradient descent often depends on the dimension of the
problem.
Suppose we want to minimize a function f over a set C. Recall the subgradient descent rule
xk+ 12 = xk − αk gk , where gk ∈ ∂f (xk ) (28)
1 2 1 2
xk+1 = argmin x − xk+ 21 = argmin kx − (xk − αk gk )k . (29)

x∈C 2 x∈C 2
This can be interpreted as follows. First approximate f around xk by a first-order Taylor expansion
f (x) ≈ f (xk ) + hgk , x − xk i . (30)
1 2
Then penalize the displacement by 2αk kx − xk k . So the update rule is to find a regularized minimizer of
the model

1 2
xk+1 = argmin f (xk ) + hgk , x − xk i + kx − xk k . (31)
x∈C 2αk
It is trivial to see this is exactly equivalent to (29). To generalize the method beyond Euclidean distance, it is
straightforward to use the Bregman divergence as a measure of displacement:

1
xk+1 = argmin f (xk ) + hgk , x − xk i + ∆ψ (x, xk ) (32)
x∈C αk
= argmin {αk f (xk ) + αk hgk , x − xk i + ∆ψ (x, xk )} . (33)
x∈C

Mirror descent interpretation. Suppose the constraint set C is the whole space (i.e. no constraint). Then
we can take gradient with respect to x and find the optimality condition
1
gk + (∇ψ(xk+1 ) − ∇ψ(xk )) = 0 (34)
αk
⇔ ∇ψ(xk+1 ) = ∇ψ(xk ) − αk gk (35)
−1 ∗
⇔ xk+1 = (∇ψ) (∇ψ(xk ) − αk gk ) = (∇ψ )(∇ψ(xk ) − αk gk ). (36)
For example, in KL-divergence over simplex, the update rule becomes
xk+1 (i) = xk (i) exp(−αk gk (i)). (37)
Rate of convergence. Recall in unconstrained subgradient descent we followed 4 steps.
3
1. Bound on single update
2 2
kxk+1 − x∗ k2 = kxk − αk gk − x∗ k2 (38)
2 2
= kxk − x∗ k2 ∗
− 2αk hgk , xk − x i + αk2kgk k2 (39)
2 2
≤ kxk − x∗ k 2 − 2αk (f (xk ) − f (x )) ∗
+ αk2 kgk k2 . (40)
2. Telescope
T
X T
X
2 2 2
kxT +1 − x∗ k2 ≤ kx1 − x∗ k2 − 2 αk (f (xk ) − f (x∗ )) + αk2 kgk k2 . (41)
k=1 k=1

2 2
3. Bounding by kx1 − x∗ k2 ≤ R2 and kgk k2 ≤ G2 :
T
X T
X
∗ 2 2
2 αk (f (xk ) − f (x )) ≤ R + G αk2 . (42)
k=1 k=1

4. Denote k = f (xk ) − f (x∗ ) and rearrange

PT
R2 + G2 k=1 αk2
min k ≤ PT . (43)
k∈{1,...,T } 2 k=1 αk
By setting the step size αk judiciously, we can achieve
RG
k ≤ √ .
min (44)
k∈{1,...,T } T
√
√ Then R ≤ 2. If each coordinate of each gradient gi is upper bounded by M ,
Suppose C is the simplex.
then G can be at most M n, i.e. depends on the dimension.
2
Clearly the step 2 to 4 can be easily extended by replacing kxk+1 − x∗ k2 with ∆ψ (x∗ , xk+1 ). So the
only challenge left is to extend step 1. This is actually possible via Lemma 2.
We further assume ψ is σ strongly convex. In (33), consider αk (f (xk ) + hgk , x − xk i) as the L in
Lemma 2. Then
αk (f (xk ) + hgk , x∗ − xk i) + ∆ψ (x∗ , xk ) ≥ αk (f (xk ) + hgk , xk+1 − xk i) + ∆ψ (xk+1 , xk ) + ∆ψ (x∗ , xk+1 )
(45)
Canceling some terms can rearranging, we obtain
∆ψ (x∗ , xk+1 ) ≤ ∆ψ (x∗ , xk ) + αk hgk , x∗ − xk+1 i − ∆ψ (xk+1 , xk ) (46)
= ∆ψ (x∗ , xk ) + αk hgk , x∗ − xk i + αk hgk , xk − xk+1 i − ∆ψ (xk+1 , xk ) (47)
σ 2
≤ ∆ψ (x∗ , xk ) − αk (f (xk ) − f (x∗ )) + αk hgk , xk − xk+1 i − kxk − xk+1 k (48)
2
σ 2
≤ ∆ψ (x∗ , xk ) − αk (f (xk ) − f (x∗ )) + αk kgk k∗ kxk − xk+1 k − kxk − xk+1 k (49)
2
α2 2
≤ ∆ψ (x∗ , xk ) − αk (f (xk ) − f (x∗ )) + k kgk k∗ (50)
2σ
2
Now compare with (40), we have successfully replaced kxk+1 − x∗ k2 with ∆ψ (x∗ , xi ). Again upper
bound ∆ψ (x∗ , x1 ) by R2 and kgk k∗ by G. Note the norm on gk is the dual norm. To see the advantage of
mirror descent, suppose C is the n dimensional simplex, and we use KL-divergence for which ψ is 1 strongly
convex with respect to the `1 norm. The dual norm of the `1 norm is the `∞ norm. Then we can bound
∆ψ (x∗ , x1 ) by using KL-divergence, and it is at most log n. G can be upper bounded
q by M . So as for the
value of RG, mirror descent is smaller than subgradient descent by an order of O( logn n ).

Acceleration 1: f is strongly convex. We say f is strongly convex with respect to another convex function
ψ with modulus λ if
f (x) ≥ f (y) + hg, x − yi + λ∆ψ (x, y) ∀ g ∈ ∂f (y). (51)

4
Note we do not assume f is differentiable. Now in the step from (47) to (48), we can plug in the definition of
strong convexity:
∆ψ (x∗ , xk+1 ) = . . . + αk hgk , x∗ − xk i + . . . (copy of (47)) (52)
∗ ∗
≤ . . . − αk (f (xk ) − f (x ) + λ∆ψ (x , xk )) + . . . (53)
≤ ... (54)
αk2 2
≤ (1 − λαk )∆ψ (x∗ , xk ) − αk (f (xk ) − f (x∗ )) + kgk k∗ (55)
2σ
Denote δk = ∆ψ (x∗ , xk ). Set αk = 1
λk . Then
k−1 1 G2 1 G2
δk+1 ≤ δk − k + ⇒ kδk+1 ≤ (k − 1)δk − k + (56)
k λk 2σλ2 k 2 λ 2σλ2 k
Now telescope (sum up both sides from k = 1 to T )
T T T
1X G2 X 1 G2 1 X 1 G2 O(log T )
T δT +1 ≤ − k + ⇒ min k ≤ ≤ . (57)
λ 2σλ2 k i∈{1,...,T } 2σλ T k 2σλ T
k=1 k=1 k=1

Acceleration 2: f has Lipschitz continuous gradient. If the gradient of f is Lipschitz continuous, there
exists L > 0 such that
k∇f (x) − ∇f (y)k∗ ≤ L kx − yk , ∀ x, y. (58)
Sometimes we just directly say f is smooth. It is also known that this is equivalent to
L 2
f (x) ≤ f (y) + h∇f (y), x − yi + kx − yk . (59)
2
We bound the hgk , x∗ − xk+1 i term in (46) as follows
hgk , x∗ − xk+1 i = hgk , x∗ − xk i + hgk , xk − xk+1 i (60)
L 2
≤ f (x∗ ) − f (xk ) + f (xk ) − f (xk+1 ) + kxk − xk+1 k (61)
2
L 2
= f (x∗ ) − f (xk+1 ) + kxk − xk+1 k . (62)
2
Plug into (46), we get

∗ ∗ ∗ L 2 σ 2
∆ψ (x , xk+1 ) ≤ ∆ψ (x , xk ) + αk f (x ) − f (xk+1 ) + kxk − xk+1 k − kxk − xk+1 k . (63)
2 2
σ
Set αk = L, we get
σ
∆ψ (x∗ , xk+1 ) ≤ ∆ψ (x∗ , xk ) − (f (xk+1 ) − f (x∗ )) . (64)
L
Telescope we get
L∆(x∗ , x1 ) LR2
min f (xk ) − f (x∗ ) ≤ ≤ . (65)
k∈{2,...,T +1} σT σT
This gives O( T1 ) convergence rate. But if we are smarter, like Nesterov, the rate can be improved to O( T12 ).
We will not go into the details but the algorithm and proof are again based on Lemma 2. This is often called
accelerated proximal gradient method.
2.1 Composite Objective
Suppose the objective function is h(x) = f (x)+r(x), where f is smooth and r(x) is simple, like kxk1 . If we
directly apply the above rates for optimizing h, we get O( √1T ) rate of convergence because h is not smooth.
It will be nice if we can enjoy the O( T1 ) rate as in smooth optimization. Fortunately this is possible thanks to
the simplicity of r(x), and we only need to extend the proximal operator (33) as follows:

1
xk+1 = argmin f (xk ) + hgk , x − xk i + r(x) + ∆ψ (x, xk ) (66)
x∈C αk
= argmin {αk f (xk ) + αk hgk , x − xk i + αk r(x) + ∆ψ (x, xk )} . (67)
x∈C

5
Algorithm 1: Protocol of online learning
1 The player initializes a model x1 .
2 for k = 1, 2, . . . do
3 The player proposes a model xk .
4 The rival picks a function fk .
5 The player suffers a loss fk (xk ).
6 The player gets access to fk and use it to update its model to xk+1 .

Here we use a first-order Taylor approximation of f around xk , but keep r(x) exact. Assuming this proximal
operator can be computed efficiently, then we can show all the above rates carry over. We here only show the
case of general f (not necessarily smooth or strongly convex), and leave the rest two cases as an exercise. In
fact we can again achieve O( T12 ) rate when f has Lipschitz continuous gradient.
Consider αk (f (xk ) + hgk , x − xk i + r(x)) as the L in Lemma 2. Then
αk (f (xk ) + hgk , x∗ − xk i + r(x∗ )) + ∆ψ (x∗ , xk ) (68)
≥ αk (f (xk ) + hgk , xk+1 − xk i + r(xk+1 )) + ∆ψ (xk+1 , xk ) + ∆ψ (x∗ , xk+1 ). (69)
Following exactly the derivations from (46) to (50), we obtain
∆ψ (x∗ , xk+1 ) ≤ ∆ψ (x∗ , xk ) + αk hgk , x∗ − xk+1 i + αk (r(x∗ ) − r(xk+1 )) − ∆ψ (xk+1 , xk ) (70)
≤ ... (71)
αk2 2
≤ ∆ψ (x∗ , xk ) − αk (f (xk ) + r(xk+1 ) − f (x∗ ) − r(x∗ )) + kgk k∗ . (72)
2σ
This is almost the same as (50), except that we want to have r(xk ) here, not r(xk+1 ). Fortunately this is not
a problem as long as we use a slightly different way of telescoping. Denote δk = ∆ψ (x∗ , xk ) and then
1 αk 2
f (xk ) + r(xk+1 ) − f (x∗ ) − r(x∗ ) ≤ (δk − δk+1 ) + kgk k∗ . (73)
αk 2σ
Summing up from k = 1 to T we obtain
T T T
G2 X

X δ1 X 1 1 δT +1
r(xT +1 ) − r(x1 ) + (h(xk ) − h(x∗ )) ≤ + δk − − + αk (74)
α1 αk αk−1 αT 2σ
k=1 k=2 k=1
T ! T
2 1 X 1 1 G2 X
≤R + − + αk (75)
α1 αk αk−1 2σ
k=2 k=1
2 T 2
R G X
= + αk . (76)
αT 2σ
k=1
R
pσ
Suppose we choose x1 = argminx r(x), which ensures r(xT +1 ) − r(x1 ) ≥ 0. Setting αk = G k, we get
T T
!
X
∗ RG √ 1X 1 RG √
(h(xk ) − h(x )) ≤ √ T+ √ = √ O( T ). (77)
σ 2 k σ
k=1 k=1

Therefore mink=1,...,T {h(xk ) − h(x∗ )} decays at the rate of O( √RG

σT
).

2.2 Online learning

The protocol of online learning is shown
P in Algorithm 1. The player’s goal of online learning is to minimize
the regret, the minimal possible loss k fk (x) over all possible x:
T
X T
X
Regret = fk (xk ) − min fk (x). (78)
x
k=1 k=1
Note there is no assumption made on how the rival picks fk , and it can adversarial. After obtaining fk at
iteration k, let us update the model into xk+1 by using the mirror descent rule on function fk only:

1
xk+1 = argmin fk (xk ) + hgk , x − xk i + ∆ψ (x, xk ) , where gk ∈ ∂fk (xk ). (79)
x∈C αk

6
Then it is easy to derive the regret bound. Using fk in step (50), we have
1 αk 2
fk (xk ) − fk (x∗ ) ≤ (∆ψ (x∗ , xk ) − ∆ψ (x∗ , xk+1 )) + kgk k∗ . (80)
αk 2σ
Summing up from k = 1 to n and using the same process as in (74) to (77), we get
T
X RG √
(fk (xk ) − fk (x∗ )) ≤ √ O( T ). (81)
σ
k=1
√
So the regret grows in the order of O( T ).

f is strongly convex. Exactly use (55) with fk in place of f , and we can derive the O(log T ) regret bound
immediately.

f has Lipschitz continuous gradient. The result in (64) can NOT be extended to the online setting because
if we replace f by fk we will get fk (xk+1 ) − fk (x∗ ) on the right-hand side. Telescoping will not give a regret
bound. In fact, it is known that √in the online setting, having a Lipschitz continuous gradient itself cannot
reduce the regret bound from O( T ) (as in nonsmooth objective) to O(log T ).

Composite objective. In the online setting, both the player and the rival know r(x), and the rival changes
fk (x) at each iteration. The loss incurred at each iteration is hk (xk ) = fk (xk ) + r(xk ). The update rule is

1
xk+1 = argmin fk (xk ) + hgk , x − xk i + r(x) + ∆ψ (x, xk ) , where gk ∈ ∂fk (xk ). (82)
x∈C αk
Note in this setting, (73) becomes
1 αk 2
fk (xk ) + r(xk+1 ) − fk (x∗ ) − r(x∗ ) ≤ (δk − δk+1 ) + kgk k∗ . (83)
αk 2σ
Although we have r(xk+1 ) here rather than r(xk ), it is fine because r does not change through iterations.
Choosing x1 = argminx r(x) and telescoping in the same way as from (74) to (77), we immediately obtain
T
X G √
(hk (xk ) − hk (x∗ )) ≤ √ O( T ). (84)
σ
k=1
√
So the regret grows at O( T ).
When fk are strongly convex, we can get O(log T ) regret for the√
composite case. But as expected, having
Lipschitz continuity of ∇fk alone cannot reduce the regret from O( T ) to O(log T ).
2.3 Stochastic optimization
Let us consider optimizing a function which takes a form of expectation
min F (x) := E [f (x; ω)], (85)
x ω∼p

where p is a distribution of ω. This subsumes a lot of machine learning models. For example, the SVM
objective is
m
1 X λ 2
F (x) = max{0, 1 − ci hai , xi} + kxk . (86)
m i=1 2
1
It can be interpreted as (85) where ω is uniformly distributed in {1, 2, . . . , m} (i.e. p(ω = i) = m ), and
λ 2
f (x; i) = max{0, 1 − ci hai , xi} + kxk . (87)
2
When m is large, it can be costly to calculate F and its subgradient. So a simple idea is to base the
updates on a single randomly chosen data point. It can be considered as a special case of online learning in
Algorithm 1, where the rival in step 4 now randomly picks fk as f (x; ωk ) with ωk being drawn independently
from p. Ideally we hope that by using the mirror descent updates, xk will gradually approach the minimizer
7
Algorithm 2: Protocol of online learning
1 The player initializes a model x1 .
2 for k = 1, 2, . . . do
3 The player proposes a model xk .
4 The rival randomly draws a ωk from p, which defines a function fk (x) := f (x; ωk ).
5 The player suffers a loss fk (xk ).
6 The player gets access to fk and use it to update its model to xk+1 by, e.g., mirror descent (79).

of F (x). Intuitively this is quite reasonable, and by using fk we can compute an unbiased estimate of F (xk )
and a subgradient of F (xk ) (because ωk are sampled iid from p). This is a particular case of stochastic
optimization, and we recap it in Algorithm 2.
In fact, the method is valid in a more general setting. For simplicity, let us just say the rival plays ωk
at iteration k. Then an online learning algorithm A is simply a deterministic mapping from an ordered set
{ω1 , . . . , ωk } to xk+1 . Denote as A(ω0 ) the initial model x1 . Then the following theorem is the key for
online to batch conversion.
Theorem 3 Suppose an online learning algorithm A has regret bound Rk after running Algorithm 1 for k
iterations. Suppose ω1 , . . . , ωT +1 are drawn iid from p. Define x̂ = A(ωj+1 , . . . , ωT ) where j is drawn
uniformly random from {0, . . . , T }. Then
RT +1
E[F (x̂)] − min F (x) ≤ , (88)
x T +1
where the expectation is with respect to the randomness of ω1 , . . . , ωT , and j.
Similarly we can have high probability bounds, which can be stated in the form like (not exactly true)
RT +1 1
F (x̂) − min F (x) ≤ log (89)
x T +1 δ
with probability 1 − δ, where the probability is with respect to the randomness of ω1 , . . . , ωT , and j.

Proof of Theorem 3.
E[F (x̂)] = E [f (x̂; ωT +1 )] = E [f (A(ωj+1 , . . . , ωT ); ωT +1 )] (90)
j,ω1 ,...,ωT +1 j,ω1 ,...,ωT +1
 
T
1 X
= E  f (A(ωj+1 , . . . , ωT ); ωT +1 ) (as j is drawn uniformly random) (91)
ω1 ,...,ωT +1 T + 1
j=0
 
T
1 X
= E  f (A(ω1 , . . . , ωT −j ); ωT +1−j ) (shift iteration index by iid of wi )
T + 1 ω1 ,...,ωT +1 j=0
(92)
"T +1 #
1 X
= E f (A(ω1 , . . . , ωs−1 ); ωs ) (change of variable s = T − j + 1) (93)
T + 1 ω1 ,...,ωT +1 s=1
" T +1
#
1 X
≤ E min f (x; ωs ) + RT +1 (apply regret bound) (94)
T + 1 ω1 ,...,ωT +1 x s=1
RT +1
≤ min E[f (x; ω] + (expectation of min is smaller than min of expectation) (95)
x ω T +1
RT +1
= min F (x) + . (96)
x T +1

Jan Van Tiel - Convex Analysis - An Introductory Text-Wiley (1984) PDF
No ratings yet
Jan Van Tiel - Convex Analysis - An Introductory Text-Wiley (1984) PDF
135 pages
Convex Optimization 1 - Charalampos Salis
No ratings yet
Convex Optimization 1 - Charalampos Salis
12 pages
Convex Analysis and Optimization Solution Manual
100% (2)
Convex Analysis and Optimization Solution Manual
193 pages
Maths Form 2 Notes
100% (2)
Maths Form 2 Notes
144 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
181 pages
Grundlehren Der Mathematischen Wissenschaften 305: A Series of Comprehensive Studies in Mathematics
No ratings yet
Grundlehren Der Mathematischen Wissenschaften 305: A Series of Comprehensive Studies in Mathematics
431 pages
Convexsol 1
No ratings yet
Convexsol 1
40 pages
Mirror Descent Slides
No ratings yet
Mirror Descent Slides
35 pages
Optimization Best
No ratings yet
Optimization Best
71 pages
Convex Analysis (2024)
No ratings yet
Convex Analysis (2024)
32 pages
Lecture Notes PDF
No ratings yet
Lecture Notes PDF
143 pages
Generalized Distances and Generalized Projections
No ratings yet
Generalized Distances and Generalized Projections
20 pages
Mikhail Moklyachuk - Convex Optimization - Introductory Course-Wiley-ISTE (2021)
No ratings yet
Mikhail Moklyachuk - Convex Optimization - Introductory Course-Wiley-ISTE (2021)
254 pages
CntrlEngg (Optimization) ConvexAnalysisAndOptimization Solutions DimitriBertsekas
No ratings yet
CntrlEngg (Optimization) ConvexAnalysisAndOptimization Solutions DimitriBertsekas
191 pages
5 Optimization: F Emp
No ratings yet
5 Optimization: F Emp
52 pages
Set 20160906
No ratings yet
Set 20160906
27 pages
Inequalities of Fejer Type Related To Generalized Convex Functions With Applications
No ratings yet
Inequalities of Fejer Type Related To Generalized Convex Functions With Applications
13 pages
1190 543 PB
No ratings yet
1190 543 PB
17 pages
Bregman
No ratings yet
Bregman
9 pages
A Double Projection Algorithm For Quasimonotone Variational Inequalities in Banach Spaces
No ratings yet
A Double Projection Algorithm For Quasimonotone Variational Inequalities in Banach Spaces
14 pages
Co 463
No ratings yet
Co 463
116 pages
Mirror 2
No ratings yet
Mirror 2
8 pages
VIP and CP S Nanda (Me)
No ratings yet
VIP and CP S Nanda (Me)
26 pages
Notes ch4 1
No ratings yet
Notes ch4 1
7 pages
Epigrafo PDF
No ratings yet
Epigrafo PDF
12 pages
Multivariable Calculus - Revision Notes Eco Hons Sem 2 DU
No ratings yet
Multivariable Calculus - Revision Notes Eco Hons Sem 2 DU
21 pages
18.657: Mathematics of Machine Learning: S R LR LK K
No ratings yet
18.657: Mathematics of Machine Learning: S R LR LK K
9 pages
A Detailed Analysis of The Brachistochrone Problem
No ratings yet
A Detailed Analysis of The Brachistochrone Problem
15 pages
Introduction To Optimization - Jean-François Aujol
No ratings yet
Introduction To Optimization - Jean-François Aujol
51 pages
Lec 11
No ratings yet
Lec 11
13 pages
Func 20160919
No ratings yet
Func 20160919
35 pages
Dynamic Programming Value Iteration
100% (1)
Dynamic Programming Value Iteration
36 pages
Lecture3 ConvexSetsFuns PDF
No ratings yet
Lecture3 ConvexSetsFuns PDF
43 pages
01 Convex and Concave Functions
No ratings yet
01 Convex and Concave Functions
5 pages
Xu2001 Minimax
No ratings yet
Xu2001 Minimax
13 pages
A Smoothing Technique For Nondifferentiable Optimization Problem
No ratings yet
A Smoothing Technique For Nondifferentiable Optimization Problem
11 pages
Convexity and Optimization: 1 An Entirely Too Brief Motivation
No ratings yet
Convexity and Optimization: 1 An Entirely Too Brief Motivation
22 pages
Weekly Learning Activity Sheet: Math 10, Quarter 2, Week 4
No ratings yet
Weekly Learning Activity Sheet: Math 10, Quarter 2, Week 4
7 pages
Nisheeth VishnoiFall2014 ConvexOptimization PDF
No ratings yet
Nisheeth VishnoiFall2014 ConvexOptimization PDF
114 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
Convex Optimization L2 18
No ratings yet
Convex Optimization L2 18
11 pages
Controle 16
No ratings yet
Controle 16
4 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Lecture 12
No ratings yet
Lecture 12
4 pages
DSA3102 Midterm Chestsheet
No ratings yet
DSA3102 Midterm Chestsheet
2 pages
Lecture 02 - Convexity
No ratings yet
Lecture 02 - Convexity
42 pages
Attractor: 1 Motivation
100% (1)
Attractor: 1 Motivation
8 pages
03 Convex Functions
No ratings yet
03 Convex Functions
31 pages
Lectures On Convex Sets: Niels Lauritzen
No ratings yet
Lectures On Convex Sets: Niels Lauritzen
93 pages
Analiza Convexa
No ratings yet
Analiza Convexa
4 pages
Lectures On Convex Sets
No ratings yet
Lectures On Convex Sets
93 pages
Optimality Conditions: Unconstrained Optimization: 1.1 Differentiable Problems
No ratings yet
Optimality Conditions: Unconstrained Optimization: 1.1 Differentiable Problems
10 pages
Convex Optimization Overview: Zico Kolter October 19, 2007
No ratings yet
Convex Optimization Overview: Zico Kolter October 19, 2007
12 pages
Coordinate System
No ratings yet
Coordinate System
4 pages
Lec3 Convex Function Exercise
No ratings yet
Lec3 Convex Function Exercise
4 pages
Convex Optimization For Machine Learning
No ratings yet
Convex Optimization For Machine Learning
110 pages
1 Theory of Convex Functions
No ratings yet
1 Theory of Convex Functions
14 pages
Some Inequalities Involving Ratios and Products of The Gamma Function Mridula Garg - Ajay Sharma - Pratibha Manohar
No ratings yet
Some Inequalities Involving Ratios and Products of The Gamma Function Mridula Garg - Ajay Sharma - Pratibha Manohar
7 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Lecture 10
No ratings yet
Lecture 10
4 pages
Basic Concepts: 1.1 Continuity
No ratings yet
Basic Concepts: 1.1 Continuity
7 pages
Geometry Formulas 2D 3D Perimeter Area Volume
100% (1)
Geometry Formulas 2D 3D Perimeter Area Volume
2 pages
4th Periodical Test in Science 6
No ratings yet
4th Periodical Test in Science 6
4 pages
3d Max Design 2012 Tutorial
No ratings yet
3d Max Design 2012 Tutorial
56 pages
Naplan 2012 Final Test Numeracy Year 9 (No Calculator)
No ratings yet
Naplan 2012 Final Test Numeracy Year 9 (No Calculator)
11 pages
4MB0 01 Que 20160526
No ratings yet
4MB0 01 Que 20160526
20 pages
Mathematics
No ratings yet
Mathematics
17 pages
Cad Cam Manual
No ratings yet
Cad Cam Manual
97 pages
TS - JR - Maths Ib - Imp Questions-2024-25
No ratings yet
TS - JR - Maths Ib - Imp Questions-2024-25
7 pages
Angles-Parallel-Lines Questions
No ratings yet
Angles-Parallel-Lines Questions
16 pages
CL 9 UIMO 2024 Paper 9274 Key Updated
No ratings yet
CL 9 UIMO 2024 Paper 9274 Key Updated
7 pages
TRIGONOMETRY
No ratings yet
TRIGONOMETRY
5 pages
Yearly Lesson Plan Mathematics Form3 2015
No ratings yet
Yearly Lesson Plan Mathematics Form3 2015
3 pages
B-Splines Primer
No ratings yet
B-Splines Primer
52 pages
Mathematics p2-2024
No ratings yet
Mathematics p2-2024
7 pages
Microstation 3D New
No ratings yet
Microstation 3D New
6 pages
Live 301 1562 Jair PDF
No ratings yet
Live 301 1562 Jair PDF
49 pages
Properties of Triangles and Quadrilaterals: Let's Check Your Understanding! Grade 5 Mathematics
No ratings yet
Properties of Triangles and Quadrilaterals: Let's Check Your Understanding! Grade 5 Mathematics
28 pages
4.1 Classifying Triangles
No ratings yet
4.1 Classifying Triangles
13 pages
Advanced Dynamics
No ratings yet
Advanced Dynamics
8 pages
Cambridge IGCSE: MATHEMATICS 0580/33
No ratings yet
Cambridge IGCSE: MATHEMATICS 0580/33
20 pages
Oxford Insight Mathematics 10 5 25 3 AC For NSW Student Book Obook John Ley Michael Fuller Z Lib Org - 114
No ratings yet
Oxford Insight Mathematics 10 5 25 3 AC For NSW Student Book Obook John Ley Michael Fuller Z Lib Org - 114
1 page
13 Scale Space Combined
No ratings yet
13 Scale Space Combined
31 pages
Mathgen 270315827
No ratings yet
Mathgen 270315827
10 pages
AS - A Level Further Mathematics Baseline Test
No ratings yet
AS - A Level Further Mathematics Baseline Test
3 pages
Lesson 5 - Continuity of A Function
No ratings yet
Lesson 5 - Continuity of A Function
4 pages
Table of Specifications 3 Quarter Examination in Grade 10 Mathematics
No ratings yet
Table of Specifications 3 Quarter Examination in Grade 10 Mathematics
3 pages
ELM Focus On Mathematics
No ratings yet
ELM Focus On Mathematics
27 pages
3 D Transformation
No ratings yet
3 D Transformation
10 pages
Make A Boomerang
No ratings yet
Make A Boomerang
2 pages
Grothendieck Groups: Ariyan Javan Peykar
No ratings yet
Grothendieck Groups: Ariyan Javan Peykar
5 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet