Appendix PDF
Appendix PDF
Appendix PDF
Theoretical Analysis
We first list the requirement for AGLD and the assumptions on f .
(k) (k)
Requirement 1. For the gradient snapshot A(k) , we have αi ∈ {∇fi (x(j) )}j=k−D+1 , where D is a fixed constant.
Assumption 1 (Smoothness). Each individual fi is M̃ smooth. That is, fi is twice differentiable and there exists a constant
M̃ > 0 such that for all x, y ∈ Rd
M̃
fi (y) ≤ fi (x) + h∇fi (x), y − xi + kx − yk22 . (1)
2
Accordingly, we can verify that the summation f of fi0 s is M = M̃ N smooth.
Assumption 2 (Strongly Convexity). The sum f is µ strongly convex. That is, there exists a constant µ > 0 such that for all
x, y ∈ Rd ,
µ
f (y) ≥ f (x) + h∇f (x), y − xi + kx − yk22 . (2)
2
Lemma 1. Suppose one uniformly samples n data points from dataset of size N at one trial. Let T denotes the time to collect all
the data points, then we have
N ln N
P (T > β ) < N 1−β . (3)
n
(k)
Proof. Let Yi denote the event that the i-th sample was not selected in the first k trials. Then we have
(k) n
P Yi = (1 − )(k) ≤ e−(nr)/N .
N
(k)
Thus we have P [Yi ] ≤ N −β for k = βN log N/n. Hence,
N ln N [ (βN log N/n)
P T >β = P Yi
n
(βN log N/n)
≤ N P [Yi ] ≤ N 1−β
N X (k)
X (k)
EkE (k) k2 = Ek (∇fi (x(k) ) − αi ) − (∇fi (x(k) ) − αi )k2
n
i∈Sk i∈Ik
2N 2 X (k)
X (k)
≤ 2 E(k (∇fi (x(k) ) − αi )k2 + k (∇fi (x(k) ) − αi )k2 )
n
i∈Sk i∈Ik
2
2N X (k)
X (k)
≤ E(n k∇fi (x(k) ) − αi k2 + n k∇fi (x(k) ) − αi k2 )
n2
i∈Sk i∈Ik
N N
2N 2 X
(k) (k) 2 n2 X (k)
≤ E(n k∇fi (x ) − αi k + k∇fi (x(k) ) − αi k2 )
n2 i=1
N i=1
N
2N (N + n) X (k)
= Ek∇fi (x(k) ) − αi k2 .
n i=1
Pn Pn
In the first two inequality, we use that k i=1 ai k2 ≤ n i=1 kai k2 . The third inequality follows from the fact that {Sk } are
subset of {1, · · · , N } in CA and RS and Ik are chosen uniformly and independently from {1, · · · , N }. When we use RA, Sk
just equals to Ik and EkE (k) k0 = 0.
(k)
Suppose that in the k-th iteration, snapshot αi are taken at x(ki ) , where ki ∈ {(k − 1) ∨ 0, (k − 2) ∨ 0, · · · , (k − D) ∨ 0}.
By the M̃ smoothness of fi , we have
N N N
X (k)
X X M2
Ek∇fi (x(k) ) − αi k2 ≤ M̃ 2 Ekx(k) − x(ki ) k2 = 2
Ekx(k) − x(ki ) k2 .
i=1 i=1 i=1
N
According to the update rule of x(k) , we have
k−1
X √ k−1
X k−1
X
Ekx(k) − x(ki ) k2 = Ekη g (j) + 2 ξ (j) k2 ≤ 2Dη 2 Ekg (j) k2 + 4Ddη,
j=ki j=ki j=k−D
where the in equality follows from ka + bk2 ≤ 2(kak2 + kbk2 ), ξ (j) are independent Gaussian variables and ki ≥ k − D.
By expanding g (j) , we have
X N XN
Ekg (j) k2 = Ek (∇fp (x(j) ) − αp(j) ) + αp(j) k2
n p=1
p∈Sj
X N XN
≤ 2Ek (∇fp (x(j) ) − αp(j) )k2 + 2Ek αp(j) k2
n p=1
p∈Sj
| {z } | {z }
A B
For A, we have
X N2
A ≤ 2n Ek(∇fp (x(j) ) − ∇fp (y (j) )) + (∇fp (y (j) ) − ∇fp (y (jp ) )) + (∇fp (y (jp ) ) − αp(j) )k2
n2
p∈Sj
6N 2 X
≤ (Ek∇fp (x(j) ) − ∇fp (y (j) )k2 + Ek∇fp (y (j) ) − ∇fp (y (jp ) )k2 + Ek∇fp (y (jp ) ) − αp(j) k2 )
n
p∈Sj
2
6M X
≤ (Ekx(j) ) − y (j) k2 + Eky (j) ) − y (jp ) k2 + Eky (jp ) ) − x(jp ) k2 )
n
p∈Sj
≤ 4M Ek∆kD
2
j + 4N DM.
By substituting all these back, then we have
N N
X (k)
X M2
Ek∇fi (x(k) ) − αi k2 ≤ Ekx(k) − x(ki ) k2
i=1 i=1
N2
N k−1
M2 X 2
X
≤ (2Dη Ekg (j) k2 + 4Ddη)
N 2 i=1
j=k−D
N 2 k−1
M X X
≤ 2 2Dη 2 6M 2 (Ek∆j k2 + 2D2 η 2 M d + 4ηDd + Ek∆(j:j−D) k2,∞ )+
N i=1
j=k−D
2
Ek∆kD
4M j + 4N DM + 4Ddη
M2
≤ (4Ddη + 2D2 η 2 (16Ek∆(k:k−2D) k2,∞ + 24M 2 Ddη(ηDM + 1) + 4N M d)).
N
Then we can conclude this lemma.
C
Lemma 4. Given a positive sequence {ai }N i=0 and ρ ∈ (0, 1), if we have ρ < ai for all i ∈ {1, 2, · · · , N } and ak ≤
(1 − ρ) max(a[k−1]+ , a[k−2]+ , · · · , a[k−D]+ ) + C, then we can conclude
dk/De
X C
ak ≤ (1 − ρ)dk/De a0 + (1 − ρ)i−1 C ≤ exp(−ρdk/De)a0 + .
i=1
ρ
Proof. For all i ∈ {1, 2, · · · , D}, we have ai ≤ (1 − ρ)a0 + C < a0 .
P2
Then aD+1 ≤ (1 − ρ) max(aD , aD−1 , · · · , a1 ) + C ≤ (1 − ρ)2 a0 + i=1 (1 − ρ)i−1 C < (1 − ρ)a0 + C.
2
And aD+2 ≤ (1 − ρ) max(aD+1 , aD , · · · , a2 ) + C ≤ (1 − ρ)2 a0 + i=1 (1 − ρ)i−1 C < (1 − ρ)a0 + C.
P
dk/De
By repeating this argument, we can conclude ak ≤ (1 − ρ)dk/De a0 + i=1 (1 − ρ)i−1 C by induction.
P
N
Since 1 − x ≤ exp−x and i=1 (1 − ρ)i−1 C ≤ Cρ , we conclude this lemma.
P
√
µ n
Proposition 1. Assuming the M -smoothness and µ-strongly convexity of f and Requirement 1, if η ≤ min{ 8√10DM , 2 },
N m+M
we have for all k ≥ 0
ηµ
Ek∆(k+1) k2 ≤ (1 − )Ek∆(k:k−2D) k2,∞ + C1 η 3 + C2 η 2 ,
2
where both C1 and C2 are constants that only depend on M, N, D, µ.
Proof. We give the proof sketch here.For full proof, please refer to the Supplementary. Since E[Ψ(k) |x(k) ] = 0, we have
Ek∆(k+1) k2 = Ek∆(k) − ηU (k) − V (k) + ηE (k) k2 + η 2 EkΨ(k) k2
1
≤ (1 + α)Ek∆(k) − ηU (k) k2 + (1 + )EkV (k) + ηE (k) k2 + η 2 EkΨ(k) k2
α
1
≤ (1 + α)Ek∆ − ηU k + 2(1 + )(EkV (k) k2 + η 2 EkE (k) k2 ) + η 2 EkΨ(k) k2 ,
(k) (k) 2
α
where the first and the second inequalities are due to the Young’s inequality.
By substituting the bound in Lemma 2 and Lemma 3, we can get a one step result for Ek∆(k+1) k2 .
1 η4 M 3 d
Ek∆(k+1) k2 ≤ (1 + α)(1 − ηµ)2 Ek∆(k) k2 + 2(1 + )( + h3 M 2 d)+
α 3
N
N η2 1 X (k)
(4(1 + )(N + n) + 1) Ek∇fi (x(k) ) − αi k2
n α i=1
1 η4 M 3 d
≤ (1 + α)(1 − ηµ)2 Ek∆(k) k2 + 2(1 + )( + h3 M 2 d)
α 3
5N (N + n)η 2 1 M2
4ηDd + 32η 2 D2 M 2 Ek∆k2D 3 2 3 2 2
+ (1 + ) k + 48η M D d(ηDM + 1) + 8η M N D d
n α N
2 160D2 M 4 (N + n)η 4 1
≤ (1 + α)(1 − ηµ) + (1 + ) Ek∆(k:k−2D) k2,∞ + C.
n α
2
4
M 3d (N +n)η 2
where C = 2(1 + α1 )( η + η 3 M 2 d) + + 5M + α1 ) 4ηDd + 48η 3 M 2 D3 d(ηDM + 1) + 8η 2 M N D2 d
3 n (1
√
160D 2 M 4 (N +n)η 4
By choosing α = ηµ < 1 and η ≤ √ µ n , we have (1 + α)(1 − ηµ)2 + n (1 + α1 ) ≤ 1 − ηµ
2 and
8 10(N +n)DM 2
1 η4 M 3 d 5M 2 (N + n)η 2 1
+ η 3 M 2 d) + + (1 + ) 4ηDd + 48η 3 M 2 D3 d(ηDM + 1) + 8η 2 M N D2 d
C ≤ 2(1 + )(
α 3 n √α
3 2 2 2 2
4M d 10m (N + n) 3µ D nd 6D dµ n 4M 2 d 40M 2 (N + n)Dd
≤ η3 + 8M N D2 d) +η 2 (
+ ( +p + )
µ nµ 40(N + n)M 10(M + n) µ nµ
| {z } | {z }
C1 C2
Theorem 1. Under the same assumptions as in Proposition 1, AGLD can guarantee to get -accuracy 2-Wasserstein distance in
k = O( 12 log 1 ) iterations by setting η = O(2 ).
Proof. Now we try to get a -accuracy 2-Wasserstein distance approximation . In order to use Lemma 4, we can assume that
3
2
2 C2 η 2 2
Ek∆(k) k2 > 4 (for otherwise, we already have get /2-accuracy ) and C 1η
ηµ/2 ≤ 16 and ηµ/2 ≤ 16 . Then by using lemma 4 and
the fact that |a|2 + |b|2 | + |c|2 ≤ (|a| + |b| + |c|)2 , the Wasserstein distance between p(k) and p∗ is bounded by
√
µηdk/(2D)e C1 η C2 η
W2 (p(k) , p∗ ) ≤ exp(− )W0 + √ + √ ,
4 µ µ
√
C2 η
Then by requiring that exp(− µηdk/(2D)e
4 )W0 ≤
2 , C√1µη ≤ 4 , √
µ ≤ 4 , we have W2 (p(k) , p∗ ) ≤ . That is η = O(2 ) and
k = O( 12 log 1 )