Appendix PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Appendix

Theoretical Analysis
We first list the requirement for AGLD and the assumptions on f .
(k) (k)
Requirement 1. For the gradient snapshot A(k) , we have αi ∈ {∇fi (x(j) )}j=k−D+1 , where D is a fixed constant.
Assumption 1 (Smoothness). Each individual fi is M̃ smooth. That is, fi is twice differentiable and there exists a constant
M̃ > 0 such that for all x, y ∈ Rd

fi (y) ≤ fi (x) + h∇fi (x), y − xi + kx − yk22 . (1)
2
Accordingly, we can verify that the summation f of fi0 s is M = M̃ N smooth.
Assumption 2 (Strongly Convexity). The sum f is µ strongly convex. That is, there exists a constant µ > 0 such that for all
x, y ∈ Rd ,
µ
f (y) ≥ f (x) + h∇f (x), y − xi + kx − yk22 . (2)
2

Lemma 1. Suppose one uniformly samples n data points from dataset of size N at one trial. Let T denotes the time to collect all
the data points, then we have
N ln N
P (T > β ) < N 1−β . (3)
n
(k)
Proof. Let Yi denote the event that the i-th sample was not selected in the first k trials. Then we have
(k)  n
P Yi = (1 − )(k) ≤ e−(nr)/N .
N
(k)
Thus we have P [Yi ] ≤ N −β for k = βN log N/n. Hence,
N ln N  [ (βN log N/n) 
P T >β = P Yi
n
(βN log N/n)
≤ N P [Yi ] ≤ N 1−β

From euqation (14) from the main paper, we can derive


∆(k+1) = y (k+1) − x(k+1)
Z (k+1)η
(k)
=∆ −( ∇f (y(s)) − ∇(y (k) )ds)−

(k)
η(∇f (y − ∇f (x(k) )) + η(g (k) − ∇f (x(k) ))
= ∆(k) − V (k) − ηU (k) + ηΨ(k) + ηE (k) ,
where
Z (k+1)η
(k)
V = ∇f (y(s)) − ∇(y (k) )ds,

U (k) = ∇f (y (k) − ∇f (x(k) ),


XN N
(k) (k)
X
Ψ(k) = (∇fi (x(k) ) − αi ) + αi − ∇f (x(k) ),
n i=1
i∈Ik
N X (k)
X (k)
E (k) = (∇fi (x(k) ) − αi ) − (∇fi (x(k) ) − αi ).
n
i∈Sk i∈Ik
2
Lemma 2. Assuming the M -smoothness and µ-strongly convexity of f and that η ≤ M +µ , we have
1 4 3
EkV (k) k2 ≤ η M d + h3 M 2 d, (4)
3
Ek∆(k) − ηU (k) k2 ≤ (1 − ηµ)2 Ek∆(k) k2 . (5)
Lemma 3. Assuming the M -smoothness and µ-strongly convexity of f and Requirement 1, we have the following upper bound
on EkΨ(k) k2 and EkE (k) k2
N
NX (k)
EkΨ(k) k2 ≤ Ek∇fi (x(k) ) − αi k2 ,
n i=1
N
2N (N + n) X (k)
EkE (k) k2 ≤ Ek∇fi (x(k) ) − αi k2 ,
n i=1
and
N
(k)
X
Ek∇fi (x(k) ) − αi k2 ≤ 32η 2 D2 M 2 Ek∆(k:k−2D) k2,∞ + 4ηDd + 48η 3 M 2 D3 d(ηDM + 1) + 8η 2 M N D2 d,
i=1
where EkE (k) k2 = 0 if the data access strategy is RA and ∆(k:k−2D) := [∆(k) , ∆(k−1) , · · · ∆([k−2D]+ ) ].
Proof.
XN N
(k) (k)
X
EkΨ(k) k2 = Ek (∇fi (x(k) ) − αi ) + αi − ∇f (x(k) k2
n i=1
i∈Ik
N
1 X (k)
X (k) 
= Ek N (∇fi (x(k) ) − αi ) − (∇f (x(k) ) − αi ) k2
n i=1
i∈Ik
N
1 (k)
X (k)
= EkN (∇fi (x(k) ) − αi ) − (∇f (x(k) ) − αi k2
n i=1
N2 (k)
≤ Ek∇fi (x(k) ) − αi k2
n
N
NX (k)
= Ek∇fi (x(k) ) − αi k2 .
n i=1
The third equality follows from the fact that Ik are chosen uniformly and independently. The first inequality is due to the fact
that EkX − EXk2 ≤ EkXk2 for any random variable X. Here in the last equality,we use that i is chosen uniformly from
{1, · · · , N } and i here is no longer a random variable.

N X (k)
X (k)
EkE (k) k2 = Ek (∇fi (x(k) ) − αi ) − (∇fi (x(k) ) − αi )k2
n
i∈Sk i∈Ik

2N 2 X (k)
X (k)
≤ 2 E(k (∇fi (x(k) ) − αi )k2 + k (∇fi (x(k) ) − αi )k2 )
n
i∈Sk i∈Ik
2
2N X (k)
X (k)
≤ E(n k∇fi (x(k) ) − αi k2 + n k∇fi (x(k) ) − αi k2 )
n2
i∈Sk i∈Ik
N N
2N 2 X
(k) (k) 2 n2 X (k)
≤ E(n k∇fi (x ) − αi k + k∇fi (x(k) ) − αi k2 )
n2 i=1
N i=1
N
2N (N + n) X (k)
= Ek∇fi (x(k) ) − αi k2 .
n i=1
Pn Pn
In the first two inequality, we use that k i=1 ai k2 ≤ n i=1 kai k2 . The third inequality follows from the fact that {Sk } are
subset of {1, · · · , N } in CA and RS and Ik are chosen uniformly and independently from {1, · · · , N }. When we use RA, Sk
just equals to Ik and EkE (k) k0 = 0.
(k)
Suppose that in the k-th iteration, snapshot αi are taken at x(ki ) , where ki ∈ {(k − 1) ∨ 0, (k − 2) ∨ 0, · · · , (k − D) ∨ 0}.
By the M̃ smoothness of fi , we have
N N N
X (k)
X X M2
Ek∇fi (x(k) ) − αi k2 ≤ M̃ 2 Ekx(k) − x(ki ) k2 = 2
Ekx(k) − x(ki ) k2 .
i=1 i=1 i=1
N
According to the update rule of x(k) , we have
k−1
X √ k−1
X k−1
X
Ekx(k) − x(ki ) k2 = Ekη g (j) + 2 ξ (j) k2 ≤ 2Dη 2 Ekg (j) k2 + 4Ddη,
j=ki j=ki j=k−D

where the in equality follows from ka + bk2 ≤ 2(kak2 + kbk2 ), ξ (j) are independent Gaussian variables and ki ≥ k − D.
By expanding g (j) , we have
X N XN
Ekg (j) k2 = Ek (∇fp (x(j) ) − αp(j) ) + αp(j) k2
n p=1
p∈Sj

X N XN
≤ 2Ek (∇fp (x(j) ) − αp(j) )k2 + 2Ek αp(j) k2
n p=1
p∈Sj
| {z } | {z }
A B

For A, we have
X N2
A ≤ 2n Ek(∇fp (x(j) ) − ∇fp (y (j) )) + (∇fp (y (j) ) − ∇fp (y (jp ) )) + (∇fp (y (jp ) ) − αp(j) )k2
n2
p∈Sj

6N 2 X
≤ (Ek∇fp (x(j) ) − ∇fp (y (j) )k2 + Ek∇fp (y (j) ) − ∇fp (y (jp ) )k2 + Ek∇fp (y (jp ) ) − αp(j) k2 )
n
p∈Sj
2
6M X
≤ (Ekx(j) ) − y (j) k2 + Eky (j) ) − y (jp ) k2 + Eky (jp ) ) − x(jp ) k2 )
n
p∈Sj

where the last inequality follows from the smoothness of fp .


Bu further expanding y (j) and y (jp ) , we have
j
Z jη √ X
Eky (j) − y (jp ) k2 = Ek ∇f (y(s))ds − 2 ξ (q) k2
jp η q=jp
Z jη
≤ 2(j − jp )η Ek∇f (y(s))k2 ds + 4ηDd
jp η
≤ 2Dη · DηM d + 4ηDd
≤ 2D2 η 2 M d + 4ηDd
Here, the first inequality is due to the Jensen’s inequality and the second inequality follows by Lemma 3 in (Dalalyan and
Karagulyan 2017) to bound Ek∇f (y(s))k2 ds ≤ M d.
Then we can bound A above by
6M 2 X
A≤ (Ek∆j k2 + 2D2 η 2 M d + 4ηDd + Ek∆jp k2 )
n
p∈Sj

≤ 6M (Ek∆j k2 + 2D2 η 2 M d + 4ηDd + Ek∆kD


2
j )
Now we can bound B with similar technique
N
X N
X
B = 2Ek (αp(j) − ∇fp (y (jp ) )) + ∇fp (y (jp ) )k2
p=1 p=1
N
X N
X
≤ 4N Ek∇fp (x(jp ) ) − ∇fp (y (jp ) )k2 + 4N Ek∇fp (y (jp ) )k2
p=1 p=1
N
2 X
4M
≤ Ek∆jp k2 + 4N M D
N p=1

≤ 4M Ek∆kD
2
j + 4N DM.
By substituting all these back, then we have
N N
X (k)
X M2
Ek∇fi (x(k) ) − αi k2 ≤ Ekx(k) − x(ki ) k2
i=1 i=1
N2
N k−1
M2 X 2
X
≤ (2Dη Ekg (j) k2 + 4Ddη)
N 2 i=1
j=k−D
N 2 k−1
M X X
≤ 2 2Dη 2 6M 2 (Ek∆j k2 + 2D2 η 2 M d + 4ηDd + Ek∆(j:j−D) k2,∞ )+
N i=1
j=k−D
2
Ek∆kD
 
4M j + 4N DM + 4Ddη
M2
≤ (4Ddη + 2D2 η 2 (16Ek∆(k:k−2D) k2,∞ + 24M 2 Ddη(ηDM + 1) + 4N M d)).
N
Then we can conclude this lemma.
C
Lemma 4. Given a positive sequence {ai }N i=0 and ρ ∈ (0, 1), if we have ρ < ai for all i ∈ {1, 2, · · · , N } and ak ≤
(1 − ρ) max(a[k−1]+ , a[k−2]+ , · · · , a[k−D]+ ) + C, then we can conclude
dk/De
X C
ak ≤ (1 − ρ)dk/De a0 + (1 − ρ)i−1 C ≤ exp(−ρdk/De)a0 + .
i=1
ρ
Proof. For all i ∈ {1, 2, · · · , D}, we have ai ≤ (1 − ρ)a0 + C < a0 .
P2
Then aD+1 ≤ (1 − ρ) max(aD , aD−1 , · · · , a1 ) + C ≤ (1 − ρ)2 a0 + i=1 (1 − ρ)i−1 C < (1 − ρ)a0 + C.
2
And aD+2 ≤ (1 − ρ) max(aD+1 , aD , · · · , a2 ) + C ≤ (1 − ρ)2 a0 + i=1 (1 − ρ)i−1 C < (1 − ρ)a0 + C.
P
dk/De
By repeating this argument, we can conclude ak ≤ (1 − ρ)dk/De a0 + i=1 (1 − ρ)i−1 C by induction.
P
N
Since 1 − x ≤ exp−x and i=1 (1 − ρ)i−1 C ≤ Cρ , we conclude this lemma.
P


µ n
Proposition 1. Assuming the M -smoothness and µ-strongly convexity of f and Requirement 1, if η ≤ min{ 8√10DM , 2 },
N m+M
we have for all k ≥ 0
ηµ
Ek∆(k+1) k2 ≤ (1 − )Ek∆(k:k−2D) k2,∞ + C1 η 3 + C2 η 2 ,
2
where both C1 and C2 are constants that only depend on M, N, D, µ.
Proof. We give the proof sketch here.For full proof, please refer to the Supplementary. Since E[Ψ(k) |x(k) ] = 0, we have
Ek∆(k+1) k2 = Ek∆(k) − ηU (k) − V (k) + ηE (k) k2 + η 2 EkΨ(k) k2
1
≤ (1 + α)Ek∆(k) − ηU (k) k2 + (1 + )EkV (k) + ηE (k) k2 + η 2 EkΨ(k) k2
α
1
≤ (1 + α)Ek∆ − ηU k + 2(1 + )(EkV (k) k2 + η 2 EkE (k) k2 ) + η 2 EkΨ(k) k2 ,
(k) (k) 2
α
where the first and the second inequalities are due to the Young’s inequality.
By substituting the bound in Lemma 2 and Lemma 3, we can get a one step result for Ek∆(k+1) k2 .
1 η4 M 3 d
Ek∆(k+1) k2 ≤ (1 + α)(1 − ηµ)2 Ek∆(k) k2 + 2(1 + )( + h3 M 2 d)+
α 3
N
N η2 1 X (k)
(4(1 + )(N + n) + 1) Ek∇fi (x(k) ) − αi k2
n α i=1
1 η4 M 3 d
≤ (1 + α)(1 − ηµ)2 Ek∆(k) k2 + 2(1 + )( + h3 M 2 d)
α 3
5N (N + n)η 2 1 M2
4ηDd + 32η 2 D2 M 2 Ek∆k2D 3 2 3 2 2

+ (1 + ) k + 48η M D d(ηDM + 1) + 8η M N D d
n α N
2 160D2 M 4 (N + n)η 4 1 
≤ (1 + α)(1 − ηµ) + (1 + ) Ek∆(k:k−2D) k2,∞ + C.
n α
2
4
M 3d (N +n)η 2
where C = 2(1 + α1 )( η + η 3 M 2 d) + + 5M + α1 ) 4ηDd + 48η 3 M 2 D3 d(ηDM + 1) + 8η 2 M N D2 d

3 n (1

160D 2 M 4 (N +n)η 4
By choosing α = ηµ < 1 and η ≤ √ µ n , we have (1 + α)(1 − ηµ)2 + n (1 + α1 ) ≤ 1 − ηµ
2 and
8 10(N +n)DM 2

1 η4 M 3 d 5M 2 (N + n)η 2 1
+ η 3 M 2 d) + + (1 + ) 4ηDd + 48η 3 M 2 D3 d(ηDM + 1) + 8η 2 M N D2 d

C ≤ 2(1 + )(
α 3 n √α
3 2 2 2 2
4M d 10m (N + n) 3µ D nd 6D dµ n 4M 2 d 40M 2 (N + n)Dd
≤ η3 + 8M N D2 d) +η 2 (

+ ( +p + )
µ nµ 40(N + n)M 10(M + n) µ nµ
| {z } | {z }
C1 C2

Then we can simplify the one iteration relation into


ηµ
Ek∆(k+1) k2 ≤ (1 − )Ek∆(k:k−2D) k2,∞ + C1 η 3 + C2 η 2 .
2

Theorem 1. Under the same assumptions as in Proposition 1, AGLD can guarantee to get -accuracy 2-Wasserstein distance in
k = O( 12 log 1 ) iterations by setting η = O(2 ).
Proof. Now we try to get a -accuracy 2-Wasserstein distance approximation . In order to use Lemma 4, we can assume that
3
2
2 C2 η 2 2
Ek∆(k) k2 > 4 (for otherwise, we already have get /2-accuracy ) and C 1η
ηµ/2 ≤ 16 and ηµ/2 ≤ 16 . Then by using lemma 4 and
the fact that |a|2 + |b|2 | + |c|2 ≤ (|a| + |b| + |c|)2 , the Wasserstein distance between p(k) and p∗ is bounded by

µηdk/(2D)e C1 η C2 η
W2 (p(k) , p∗ ) ≤ exp(− )W0 + √ + √ ,
4 µ µ

C2 η
Then by requiring that exp(− µηdk/(2D)e
4 )W0 ≤ 
2 , C√1µη ≤ 4 , √
µ ≤ 4 , we have W2 (p(k) , p∗ ) ≤ . That is η = O(2 ) and
k = O( 12 log 1 )

Improved results under additional smoothness assumptions


Under the Hessian Lipschitz continuous condition, we can improve the convergence rate of IAGLD with random access.
Hessian Lipschitz: There exists a constant L > 0 such that for all x, y ∈ Rd
k∇2 f (x) − ∇2 f (y)k ≤ Lkx − yk22 . (6)
We first give a technical lemma
Lemma 5. [(Dalalyan and Karagulyan 2017)] Assuming the M -smoothness µ-strongly convexity and L Hessian Lipschitz
smoothness of f , we have
η3 M 2 d
EkS (k) k2 ≤ (7)
3
4 2 2 3
η (L d + M d)
EkV (k) − S (k) k2 ≤ (8)
2
√ (k+1)η s
where S (k) = 2 kη ∇2 f (y(r))dW (r)ds.
R R

Theorem 2. Under the same assumption with Proposition 1 and Hessian Lipschitz, AGLD with RA procedure can achieve
-accuracy after k = O( 1 log 1 ) iterations by setting η = O().
Proof. The proof is similar to the proof in Theorem 1, but there are some key differences. First, we also give the one-iteration
result here. Since E[Ψ(k) |x(k) ] = 0, we have
Ek∆(k+1) k2 = Ek∆(k) − ηU (k) − (V (k) − S (k) ) − Sk + ηE (k) k2 + η 2 EkΨ(k) k2
1
≤ (1 + α)Ek∆(k) − ηU (k) − S (k) k2 + (1 + )EkV (k) − S (k) + ηE (k) k2 + η 2 EkΨ(k) k2
α
1
≤ (1 + α)(Ek∆ − ηU k + EkS k ) + η 2 EkΨ(k) k2 + 2(1 + )(EkV (k) − S (k) k2 + η 2 EkE (k) k2 ),
(k) (k) 2 (k) 2
α
where in the second inequality, we use the fact that E(S (k) |∆(k) , u(k) ) = 0. By substituting the bound in Lemma 2 , Lemma 3
and Lemma 5, we can get a one step result for Ek∆(k+1) k2 in the same way as in Proposition 1.
ηµ
Ek∆(k+1) k2 ≤ (1 − )Ek∆k2D 3 2
k + C1 η + C2 η (1 − I{RA} ).
2
Here, we can see that for RA, the η 2 term has now disappeared and that is the reason why we can get a better result. Then following
similar argument as the proof of Theorem 1, it can be verified that IAGLD with RA procedure can achieve -accuracy after
k = O( 1 log 1 ) iterations by setting η = O(). However, for CA and RS, we still need η = O(2 ) and k = O( 12 log 1 ).
References
Dalalyan, A. S., and Karagulyan, A. G. 2017. User-friendly guarantees for the langevin monte carlo with inaccurate gradient.
arXiv preprint arXiv:1710.00095.

You might also like