0% found this document useful (0 votes)

7 views11 pages

NeurIPS 2018 Lag Lazily Aggregated Gradient For Communication Efficient Distributed Learning Paper

This paper introduces Lazily Aggregated Gradient (LAG), a novel gradient method for distributed machine learning that reduces communication and computation by adaptively skipping gradient calculations. LAG achieves the same convergence rates as traditional batch gradient descent while significantly decreasing the number of communication rounds needed, especially in heterogeneous datasets. Numerical experiments demonstrate that LAG can reduce communication requirements by an order of magnitude compared to existing methods.

Uploaded by

hazelchangoeng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views11 pages

NeurIPS 2018 Lag Lazily Aggregated Gradient For Communication Efficient Distributed Learning Paper

Uploaded by

hazelchangoeng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

LAG: Lazily Aggregated Gradient for

Communication-Efficient Distributed Learning

Tianyi Chen⋆ Georgios B. Giannakis⋆ Tao Sun†,∗ Wotao Yin∗

⋆
University of Minnesota - Twin Cities, Minneapolis, MN 55455, USA
†
National University of Defense Technology, Changsha, Hunan 410073, China
∗
University of California - Los Angeles, Los Angeles, CA 90095, USA
{chen3827,[email protected]} [email protected] [email protected]

Abstract
This paper presents a new class of gradient methods for distributed machine learn-
ing that adaptively skip the gradient calculations to learn with reduced commu-
nication and computation. Simple rules are designed to detect slowly-varying
gradients and, therefore, trigger the reuse of outdated gradients. The resultant
gradient-based algorithms are termed Lazily Aggregated Gradient — justifying
our acronym LAG used henceforth. Theoretically, the merits of this contribution
are: i) the convergence rate is the same as batch gradient descent in strongly-
convex, convex, and nonconvex cases; and, ii) if the distributed datasets are hetero-
geneous (quantified by certain measurable constants), the communication rounds
needed to achieve a targeted accuracy are reduced thanks to the adaptive reuse of
lagged gradients. Numerical experiments on both synthetic and real data corrobo-
rate a significant communication reduction compared to alternatives.

1 Introduction
In this paper, we develop communication-efficient algorithms to solve the following problem
∑
min L(θ) with L(θ) := Lm (θ) (1)
θ ∈Rd m∈M

where θ ∈ Rd is the unknown vector, L and {Lm , m ∈ M} are smooth (but not necessarily convex)
functions with M := {1, . . . , M }. Problem (1) naturally arises in a number of areas, such as
multi-agent optimization [1], distributed signal processing [2], and distributed machine learning [3].
Considering∑ the distributed machine learning paradigm, each Lm is also a sum of functions, e.g.,
Lm (θ) := n∈Nmℓn (θ), where ℓn is the loss function (e.g., square or the logistic loss) with respect
to the vector θ (describing the model) evaluated at the training sample xn ; that is, ℓn (θ) := ℓ(θ; xn ).
While machine learning tasks are traditionally carried out at a single server, for datasets with massive
samples {xn }, running gradient-based iterative algorithms at a single server can be prohibitively
slow; e.g., the server needs to sequentially compute gradient components given limited processors.
A simple yet popular solution in recent years is to parallelize the training across multiple computing
units (a.k.a. workers) [3]. Specifically, assuming batch samples distributedly stored in a total of
M workers with the worker m ∈ M associated with samples {xn , n ∈ Nm }, a globally shared
model θ will be updated at the central server by aggregating gradients computed by workers. Due
to bandwidth and privacy concerns, each worker m will not upload its data {xn , n ∈ Nm } to the
server, thus the learning task needs to be performed by iteratively communicating with the server.
We are particularly interested in the scenarios where communication between the central server and
the local workers is costly, as is the case with the Federated Learning setting [4, 5], the cloud-edge

32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.
AI systems [6], and more in the emerging Internet-of-Things paradigm [7]. In those cases, communi-
cation latency is the bottleneck of overall performance. More precisely, the communication latency
is a result of initiating communication links, queueing and propagating the message. For sending
small messages, e.g., the d-dimensional model θ or aggregated gradient, this latency dominates the
message size-dependent transmission latency. Therefore, it is important to reduce the number of
communication rounds, even more so than the bits per round. In short, our goal is to find the model
parameter θ that minimizes (1) using as low communication overhead as possible.

1.1 Prior art

To put our work in context, we review prior contributions that we group in two categories.
Large-scale machine learning. Solving (1) at a single server has been extensively studied for large-
scale learning tasks, where the “workhorse approach” is the simple yet efficient stochastic gradient
descent (SGD) [8, 9]. Albeit its low per-iteration complexity, the inherited variance prevents SGD
to achieve fast convergence. Recent advances include leveraging the so-termed variance reduction
techniques to achieve both low complexity and fast convergence [10–12]. For learning beyond
a single server, distributed parallel machine learning is an attractive solution to tackle large-scale
learning tasks, where the parameter server architecture is the most commonly used one [3, 13]. Dif-
ferent from the single server case, parallel implementation of the batch gradient descent (GD) is a
popular choice, since SGD that has low complexity per iteration requires a large number of iterations
thus communication rounds [14]. For traditional parallel learning algorithms however, latency, band-
width limits, and unexpected drain on resources, that delay the update of even a single worker will
slow down the entire system operation. Recent research efforts in this line have been centered on
understanding asynchronous-parallel algorithms to speed up machine learning by eliminating costly
synchronization; e.g., [15–20]. All these approaches either reduce the computational complexity, or,
reduce the run time, but they do not save communication.
Communication-efficient learning. Going beyond single-server learning, the high communication
overhead becomes the bottleneck of the overall system performance [14]. Communication-efficient
learning algorithms have gained popularity [21, 22]. Distributed learning approaches have been de-
veloped based on quantized (gradient) information, e.g., [23–26], but they only reduce the required
bandwidth per communication, not the rounds. For machine learning tasks where the loss function
is convex and its conjugate dual is expressible, the dual coordinate ascent-based approaches have
been demonstrated to yield impressive empirical performance [5, 27, 28]. But these algorithms run
in a double-loop manner, and the communication reduction has not been formally quantified. To
reduce communication by accelerating convergence, approaches leveraging (inexact) second-order
information have been studied in [29, 30]. Roughly speaking, algorithms in [5, 27–30] reduce com-
munication by increasing local computation (relative to GD), while our method does not increase lo-
cal computation. In settings different from the one considered in this paper, communication-efficient
approaches have been recently studied with triggered communication protocols [31, 32]. Except for
convergence guarantees however, no theoretical justification for communication reduction has been
established in [31]. While a sublinear convergence rate can be achieved by algorithms in [32], the
proposed gradient selection rule is nonadaptive and requires double-loop iterations.

1.2 Our contributions

Before introducing our approach, we revisit the popular GD method for (1) in the setting of one
k
parameter server and M workers: At iteration k, the server
( k ) broadcasts the current model θ to all
the workers; every worker m ∈ M computes ∇Lm θ and uploads it to the server; and once
receiving gradients from all workers, the server updates the model parameters via
∑ ( )
GD iteration θ k+1 = θ k − α∇kGD with ∇kGD := ∇Lm θ k (2)
m∈M

where α is a stepsize, and ∇kGD is an aggregated gradient that summarizes the model
( change.
) To
implement (2), the server has to communicate with all workers to obtain fresh {∇Lm θ }.
k

In this context, the present paper puts forward a new batch gradient method (as simple as GD)
that can skip communication at certain rounds, which justifies the term Lazily Aggregated Gradient

2
Metric Communication Computation Memory
Algorithm PS→WK m WK m →PS PS WK m PS WK m
GD θk ∇Lm (2) ∇Lm θk /
k k
LAG-PS θ k, if m ∈ Mk δ∇km , if m ∈ Mk (4), (12b) ∇Lm , if m ∈ Mk θ k, ∇k, {θ̂ m } ∇Lm (θ̂ m )
k
LAG-WK θk δ∇km , if m ∈ Mk (4) ∇Lm , (12a) θ k , ∇k ∇Lm (θ̂ m )

Table 1: A comparison of communication, computation and memory requirements. PS denotes the

parameter server, WK denotes the worker, PS→WK m is the communication link from the server
to the worker m, and WK m → PS is the communication link from the worker m to the server.
(LAG). With its derivations deferred to Section 2, LAG resembles (2), given by
∑ ( k)
LAG iteration θ k+1 = θ k − α∇k with ∇k := ∇Lm θ̂ m (3)
m∈M
k k
where each ∇Lm (θ̂ m ) is either ∇Lm (θ k ), when θ̂ m = θ k , or an outdated gradient that has been
k
computed using an old copy θ̂ m ̸= θ k . Instead of requesting fresh gradient from every worker in (2),
the twist is to obtain ∇k by refining the previous aggregated gradient ∇k−1 ; that is, using only the
new gradients from the selected workers in Mk , while reusing the outdated gradients from the rest
k k k−1
of workers. Therefore, with θ̂ m := θ k, ∀m ∈ Mk, θ̂ m := θ̂ m , ∀m ∈
/ Mk, LAG in (3) is equivalent to
∑
LAG iteration θ k+1 = θ k − α∇k with ∇k = ∇k−1 + δ∇km (4)
m∈Mk
k−1
where δ∇km := ∇Lm (θ k )−∇Lm (θ̂ m ) is the difference between two evaluations of ∇Lm at the
k−1
If ∇k−1 is stored in the server, this simple modification
current iterate θ k and the old copy θ̂ m .
scales down the per-iteration communication rounds from GD’s M to LAG’s |Mk |.
We develop two different rules to select Mk . The first rule is adopted by the parameter server (PS),
and the second one by every worker (WK). At iteration k,
LAG-PS: the server determines Mk and sends θ k to the workers in Mk ; each worker m ∈ Mk
computes ∇Lm (θ k) and uploads δ∇km ; each workerm/ ∈Mk does nothing; the server updates via (4);
LAG-WK: the server broadcasts θ to all workers; every worker computes ∇Lm (θ k ), and checks
k

if it belongs to Mk ; only the workers in Mk upload δ∇km ; the server updates via (4).
See a comparison of two LAG variants with GD in Table 1.
Naively reusing outdated gradients, while saving
communication per iteration, can increase the to- Parameter
Server (PS)
tal number of iterations. To keep this number in
control, we judiciously design our simple trigger
rules so that LAG can: i) achieve the same order
of convergence rates (thus iteration complexities)
as batch GD under strongly-convex, convex, and Workers
nonconvex smooth cases; and, ii) require reduced
communication to achieve a targeted learning ac-
curacy, when the distributed datasets are heteroge- Figure 1: LAG in a parameter server setup.
neous (measured by certain quantity specified later). In certain learning settings, LAG requires only
O(1/M ) communication of GD. Empirically, we found that LAG can reduce the communication
required by GD and other distributed learning methods by an order of magnitude.
Notation. Bold lowercase letters denote column vectors, which are transposed by (·)⊤ . And ∥x∥
denotes the ℓ2 -norm of x. Inequalities for vectors x > 0 is defined entrywise.

2 LAG: Lazily Aggregated Gradient Approach

In this section, we formally develop our LAG method, and present the intuition and basic principles
behind its design. The original idea of LAG comes from a simple rewriting of the GD iteration (2)
as ∑ ∑ ( ( ) ( ))
θ k+1 = θ k − α ∇Lm (θ k−1 ) − α ∇Lm θ k − ∇Lm θ k−1 . (5)
m∈M m∈M

3
Let us view ∇Lm (θ k )−∇Lm (θ k−1 ) as a refinement to ∇Lm (θ k−1 ), and recall that obtaining this
refinement requires a round of communication between the server and the worker m. Therefore, to
save communication, we can skip the server’s communication with the worker∑ m if this refinement is
small compared to the old gradient; that is, ∥∇Lm (θ )−∇Lm (θ
k k−1
)∥ ≪ ∥ m∈M ∇Lm (θ k−1 )∥.
k−1
Generalizing on this intuition, given the generic outdated gradient components {∇Lm (θ̂ m )} with
k−1 k−1
θ̂ m = θ k−1−τ m for a certain τ k−1 ≥ 0, if communicating with some workers will bring only small
m m
gradient refinements, we skip those communications (contained in set Mkc ) and end up with
∑ ( k−1 ) ∑ ( ( ) ( k−1 ))
θ k+1 = θ k − α ∇Lm θ̂ m −α ∇Lm θ k − ∇Lm θ̂ m (6a)
m∈M m∈Mk
∑ ( ( k−1 ) ( ))
= θ k − α∇L(θ k ) − α ∇Lm θ̂ m − ∇Lm θ k (6b)
m∈Mk
c

where Mk and Mkc are the sets of workers that do and do not communicate with the server, respec-
tively. It is easy to verify that (6) is identical to (3) and (4). Comparing (2) with (6b), when Mkc
includes more workers, more communication is saved, but θ k is updated by a coarser gradient.
Key to addressing this communication versus accuracy tradeoff is a principled criterion to select
a subset of workers Mkc that do not communicate with the server at each round. To achieve this
“sweet spot,” we will rely on the fundamental descent lemma. For GD, it is given as follows [33].
k+1
Lemma 1 (GD descent in objective) Suppose L(θ) is L-smooth, and θ̄ is generated by run-
ning one-step GD iteration (2) given θ k and stepsize α. Then the objective values satisfy
( )
k+1 α2 L
L(θ̄ ) − L(θ ) ≤ − α −
k
∥∇L(θ k )∥2 := ∆kGD (θ k ). (7)
2
Likewise, for our wanted iteration (6), the following holds; its proof is given in the Supplement.
Lemma 2 (LAG descent in objective) Suppose L(θ) is L-smooth, and θ k+1 is generated by run-
ning one-step LAG iteration (4) given θ k . Then the objective values satisfy (cf. δ∇km in (4))
∑ ( )
α 2 α 2 L 1 2
L(θ k+1 )−L(θ k ) ≤ − ∇L(θ k ) + δ∇km + − θ k+1−θ k := ∆kLAG (θ k ). (8)
2 2 2 2α
m∈Mk
c

Lemmas 1 and 2 estimate the objective value descent by performing one-iteration of the GD and
LAG methods, respectively, conditioned on a common iterate θ k . GD finds ∆kGD (θk ) by performing
M rounds of communication with all the workers, while LAG yields ∆kLAG (θk ) by performing only
|Mk | rounds of communication with a selected subset of workers. Our pursuit is to select Mk to
ensure that LAG enjoys larger per-communication descent than GD; that is
∆kLAG (θ k )/|Mk | ≤ ∆kGD (θ k )/M. (9)

Choosing the standard α = 1/L, we can show that in order to guarantee (9), it is sufficient to have
(see the supplemental material for the deduction)
( k−1 ) ( k) 2 2
∇Lm θ̂ m −∇Lm θ k
≤ ∇L(θ )
2 k
/M , ∀m ∈ Mc . (10)

However, directly checking (10) at each worker is expensive since obtaining ∥∇L(θ k )∥2 requires
information from all the workers. Instead, we approximate ∥∇L(θ k )∥2 in (10) by
1 ∑
D
2 2
∇L(θ k ) ≈ ξd θ k+1−d − θ k−d (11)
α2
d=1

where {ξd }D
d=1 are constant weights, and the constant D determines the number of recent iterate
changes that LAG incorporates to approximate the current gradient. The rationale here is that, as L
is smooth, ∇L(θ k ) cannot be very different from the recent gradients or the recent iterate lags.
Building upon (10) and (11), we will include worker m in Mkc of (6) if it satisfies
k−1 2 1 ∑
D
2
LAG-WK condition ∇Lm (θ̂ m )−∇Lm (θ k ) ≤ ξd θ k+1−d −θ k−d . (12a)
α2 M 2
d=1

4
Algorithm 1 LAG-WK Algorithm 2 LAG-PS
1: Input: Stepsize α > 0, and threshold {ξd }. 1: Input: Stepsize α > 0, {ξd }, and Lm , ∀m.
0 0 0
2: Initialize: θ 1 , {∇Lm (θ̂ m ), ∀m}. 2: Initialize: θ 1,{θ̂ m ,∇Lm(θ̂ m ), ∀m}.
3: for k = 1, 2, . . . , K do 3: for k = 1, 2, . . . , K do
4: Server broadcasts θ k to all workers. 4: for worker m = 1, . . . , M do
5: for worker m = 1, . . . , M do 5: Server checks condition (12b).
6: Worker m computes ∇Lm (θ k ). 6: if worker m violates (12b) then
7: Worker m checks condition (12a). 7: Server sends θ k to worker m.
k
8: if worker m violates (12a) then 8: ▷ Save θ̂ m = θ k at server
9: Worker m uploads δ∇km . 9: Worker m computes ∇Lm (θ k ).
k
10: ▷ Save ∇Lm (θ̂ m ) = ∇Lm (θ k ) 10: Worker m uploads δ∇km .
11: else 11: else
12: Worker m uploads nothing. 12: No actions at server and worker m.
13: end if 13: end if
14: end for 14: end for
15: Server updates via (4). 15: Server updates via (4).
16: end for 16: end for

Table 2: A comparison of LAG-WK and LAG-PS.

Condition (12a) is checked at the worker side after each worker receives θ k from the server and
computes its ∇Lm (θk ). If broadcasting is also costly, we can resort to the following server side rule:

1 ∑
D
k−1 2 2
LAG-PS condition L2m θ̂ m − θk ≤ ξd θ k+1−d − θ k−d . (12b)
α2 M 2
d=1

The values of {ξd } and D admit simple choices, e.g., ξd = 1/D, ∀d with D = 10 used in simula-
tions.
LAG-WK vs LAG-PS. To perform (12a), the server needs to broadcast the current model θ k , and
all the workers need to compute the gradient; while performing (12b), the server needs the estimated
smoothness constant Lm for all the local functions. On the other hand, as it will be shown in Section
3, (12a) and (12b) lead to the same worst-case convergence guarantees. In practice, however, the
server-side condition is more conservative than the worker-side one at communication reduction,
because the smoothness of Lm readily implies that satisfying (12b) will necessarily satisfy (12a),
but not vice versa. Empirically, (12a) will lead to a larger Mkc than that of (12b), and thus extra
communication overhead will be saved. Hence, (12a) and (12b) can be chosen according to users’
preferences. LAG-WK and LAG-PS are summarized as Algorithms 1 and 2.
Regarding our proposed LAG method, three remarks are in order.
R1) With recursive update of the lagged gradients in (4) and the lagged iterates in (12), implementing
LAG is as simple as GD; see Table 1. Both empirically and theoretically, we will further demonstrate
that using lagged gradients even reduces the overall delay by cutting down costly communication.
R2) Although both LAG and asynchronous-parallel algorithms in [15–20] leverage stale gradients,
they are very different. LAG actively creates staleness, and by design, it reduces total communication
despite the staleness. Asynchronous algorithms passively receives staleness, and increases total
communication due to the staleness, but it saves run time.
R3) Compared with existing efforts for communication-efficient learning such as quantized gradient,
Nesterov’s acceleration, dual coordinate ascent and second-order methods, LAG is not orthogonal
to all of them. Instead, LAG can be combined with these methods to develop even more powerful
learning schemes. Extension to the proximal LAG is also possible to cover nonsmooth regularizers.

3 Iteration and communication complexity

In this section, we establish the convergence of LAG, under the following standard conditions.
Assumption 1: Loss function Lm (θ) is Lm -smooth, and L(θ) is L-smooth.
Assumption 2: L(θ) is convex and coercive. Assumption 3: L(θ) is µ-strongly convex.

5
The subsequent convergence analysis critically builds on the following Lyapunov function:
∑
D
2
Vk := L(θ k ) − L(θ ∗ ) + βd θ k+1−d − θ k−d (13)
d=1

where θ ∗ is the minimizer of (1), and {βd } is a sequence of constants that will be determined later.
We will start with the sufficient descent of our Vk in (13).
Lemma 3 (descent lemma) Under Assumption 1, if α and {ξd } are chosen properly, there exist
constants c0 , · · · , cD ≥ 0 such that the Lyapunov function in (13) satisfies
2 ∑
D
2
Vk+1 − Vk ≤ −c0 ∇L(θ k ) − cd θ k+1−d −θ k−d (14)
d=1

which implies the descent in our Lyapunov function, that is, Vk+1 ≤ Vk .
Lemma 3 is a generalization of GD’s descent lemma. As specified in the supplementary material,
under properly chosen {ξd }, the stepsize α ∈ (0, 2/L) including α = 1/L guarantees (14), matching
the stepsize region of GD. With Mk = M and βd = 0, ∀d in (13), Lemma 3 reduces to Lemma 1.

3.1 Convergence in strongly convex case

We first present the convergence under the smooth and strongly convex condition.
Theorem 1 (strongly convex case) Under Assumptions 1-3, the iterates {θ k } of LAG satisfy
( K) ( ∗) ( )K 0
L θ −L θ ≤ 1 − c(α; {ξd }) V (15)
∗
where θ is the minimizer of L(θ) in (1), and c(α; {ξd }) ∈ (0, 1) is a constant depending on α, {ξd }
and {βd } and the condition number κ := L/µ, which are specified in the supplementary material.
Iteration complexity. The iteration complexity in its generic form is complicated since c(α; {ξd })
depends on the choice of several parameters. Specifically, if we choose the parameters as follows
√
1 1− Dξ D−d+1
ξ1 = · · · = ξD := ξ < and α := and β1 = · · · = βD := √ (16)
D L 2α D/ξ
then, following Theorem 1, the iteration complexity of LAG in this case is
κ ( )
ILAG (ϵ) = √ log ϵ−1 . (17)
1 − Dξ

The iteration complexity in (17) is on the same order of GD’s iteration complexity κ log(ϵ−1 ), but
has a worse constant. This is the consequence of using a smaller stepsize in (16) (relative to α = 1/L
in GD) to simplify the choice of other parameters. Empirically, LAG with α = 1/L can achieve
almost the same empirical iteration complexity as GD; see Section 4. Building on the iteration
complexity, we study next the communication complexity of LAG. In the setting of our interest, we
define the communication complexity as the total number of uploads over all the workers needed to
achieve accuracy ϵ. While the accuracy refers to the objective optimality error in the strongly convex
case, it is considered as the gradient norm in general (non)convex cases.
The power of LAG is best illustrated by numerical examples; see an example of LAG-WK in Figure
2. Clearly, workers with a small smoothness constant communicate with the server less frequently.
This intuition will be formally treated in the next lemma.
Lemma 4 (lazy communication) Define the importance factor of every worker m as H(m) :=
Lm /L. If the stepsize α and the constants {ξd } in the conditions (12) satisfy ξD ≤ · · · ≤ ξd ≤
· · · ≤ ξ1 and worker m satisfies
/ 2 2 2
2
H (m) ≤ ξd (dα L M ) := γd (18)
then, until the k-th iteration, worker m communicates with the server at most k/(d + 1) rounds.
Lemma 4 asserts that if the worker m has a small Lm (a close-to-linear loss function) such that
H2 (m) ≤ γd , then under LAG, it only communicates with the server at most k/(d + 1) rounds.
This is in contrast to the total of k communication rounds involved per worker under GD. Ideally,
we want as many workers satisfying (18) as possible, especially when d is large.

6
To quantify the overall communication reduction, 1

WK 1
we define the heterogeneity score function as 0
1

1 ∑

WK 3
h(γ) := 1(H2 (m) ≤ γ) (19) 0

M m∈M 1

WK 5
0

where the indicator 1 equals 1 when H2 (m) ≤ γ 1

WK 7
holds, and 0 otherwise. Clearly, h(γ) is a nonde- 01

WK 9
creasing function of γ, that depends on the distribu- 0
tion of smoothness constants L1 , L2 , . . . , LM . It is 0 100 200 300 400 500
Iteration index k
600 700 800 900 1000

also instructive to view it as the cumulative distribu- Figure 2: Communication events of workers
tion function of the deterministic quantity H2 (m), 1, 3, 5, 7, 9 over 1, 000 iterations. Each stick
implying h(γ) ∈ [0, 1]. Putting it in our context, the is an upload. A setup with L1 < . . . < L9 .
critical quantity h(γd ) lower bounds the fraction of
workers that communicate with the server at most k/(d + 1) rounds until the k-th iteration. We are
now ready to present the communication complexity.
Proposition 5 (communication complexity) With γd defined in (18) and the function h(γ) in (19),
the communication complexity of LAG denoted as CLAG (ϵ) is bounded by
( D ( ) )
∑ 1 1 ( )
CLAG (ϵ) ≤ 1 − − h (γd ) M ILAG (ϵ) := 1 − ∆C̄(h; {γd }) M ILAG (ϵ) (20)
d d+1
d=1
∑D ( 1 )
where the constant is defined as ∆C̄(h; {γd }) := d=1 d − 1
d+1 h (γd ).
The communication complexity in (20) crucially depends on the iteration complexity ILAG (ϵ) as
well as what we call the fraction of reduced communication per iteration ∆C̄(h; √ {γd }). Simply
choosing the parameters as (16), it follows from (17) and (20) that (cf. γd = ξ(1 − Dξ)−2 M −2 d−1 )
( ) /( √ )
CLAG (ϵ) ≤ 1 − ∆C̄(h; ξ) CGD (ϵ) 1− Dξ . (21)
−1
where the GD’s complexity is CGD (ϵ) = M κ log(ϵ ). In (21), due to the nondecreasing prop-
erty of h(γ), increasing the constant ξ yields a smaller fraction of workers 1 − ∆C̄(h; ξ) that are
communicating per iteration, yet with a larger number of iterations (cf. (17)). The key enabler of
LAG’s communication reduction is a heterogeneous environment associated with a favorable h(γ)
ensuring that the benefit of increasing ξ is more significant than its effect on√increasing iteration
complexity. More precisely, for a given ξ, if h(γ) guarantees ∆C̄(h; ξ) > Dξ, then we have
CLAG (ϵ) < CGD (ϵ). Intuitively speaking, if there is a large fraction of workers with small Lm , LAG
has lower communication complexity than GD. An example follows to illustrate this reduction.
Example. Consider Lm = 1, m ̸= M , and LM = L ≥ M 2 ≫ 1, where we have H(m) =
1/L, m ̸= M, H(M ) = 1, implying that h(γ) ≥ 1 − M1
, if γ ≥ 1/L2 . Choosing D ≥ M and
ξ = M D/L < 1/D in (16) such that γD ≥ 1/L in (18), we have (cf. (21))
2 2 2
[ ( )( )] /( )
/ 1 1 M +D 2
CLAG (ϵ) CGD (ϵ) ≤ 1 − 1 − 1− 1 − M D/L ≈ ≈ . (22)
D+1 M M (D + 1) M
Due to technical issues in the convergence analysis, the current condition on h(γ) to ensure LAG’s
communication reduction is relatively restrictive. Establishing communication reduction on a
broader learning setting that matches the LAG’s intriguing empirical performance is in our agenda.

3.2 Convergence in (non)convex case

LAG’s convergence and communication reduction guarantees go beyond the strongly-convex case.
We next establish the convergence of LAG for general convex functions.
Theorem 2 (convex case) Under Assumptions 1 and 2, if α and {ξd } are chosen properly, then
L(θ K ) − L(θ ∗ ) = O (1/K) . (23)

For nonconvex objective functions, LAG can guarantee the following convergence result.
Theorem 3 (nonconvex case) Under Assumption 1, if α and {ξd } are chosen properly, then
2 2
min θ k+1 − θ k = o (1/K) and min ∇L(θ k ) = o (1/K) . (24)
1≤k≤K 1≤k≤K

7
Cyc-IAG Cyc-IAG
Cyc-IAG
Num-IAG Num-IAG
Num-IAG
LAG-PS LAG-PS
LAG-PS
0 LAG-WK LAG-WK
LAG-WK 10 0
10 Batch-GD Batch-GD

Objective error

Objective error
Objective error
0

Objective error
100 Batch-GD 10

Cyc-IAG
Num-IAG
-5 10-5 LAG-PS 10-5 10
-5
10
LAG-WK
Batch-GD

200 400 600 800 1000 101 102 103 0 0.5 1 1.5 2 2.5 101 102 103 104
Number of iteration Number of communications (uploads) Number of iteration ×104 Number of communications (uploads)

Increasing Lm Increasing Lm Uniform Lm Uniform Lm

Figure 3: Iteration and communication complexity in synthetic datasets.

102
Cyc-IAG Cyc-IAG
105 105 Cyc-IAG
Num-IAG Num-IAG Num-IAG
LAG-PS LAG-PS
100 LAG-PS
LAG-WK LAG-WK
Batch-GD Batch-GD
LAG-WK 100
Objective error

Objective error
Objective error
Objective error

Batch-GD
0 100 10-2
10

10-4
Cyc-IAG
-5
10-5 Num-IAG
10-5 10 10-6 LAG-PS
LAG-WK
Batch-GD
10-8
0 1000 2000 3000 4000 5000 101 102 103 104 0 0.5 1 1.5 2 2.5 3 3.5 101 102 103 104
Number of iteration Number of communications (uploads) Number of iteration ×104 Number of communications (uploads)

Linear regression Linear regression Logistic regression Logistic regression

Figure 4: Iteration and communication complexity in real datasets.

Theorems 2 and 3 assert that with the judiciously designed lazy gradient aggregation rules, LAG can
achieve order of convergence rate identical to GD for general (non)convex objective functions.
Similar to Proposition 5, in the supplementary material, we have also shown that in the (non)convex
case, LAG still requires less communication than GD, under certain conditions on the function h(γ).

4 Numerical tests and conclusions

To validate the theoretical results, this section evaluates the empirical performance of LAG in linear
and logistic regression tasks. All experiments were performed using MATLAB on an Intel CPU @
3.4 GHz (32 GB RAM) desktop. By default, we consider one server, and nine workers. Throughout
the test, we use L(θk ) − L(θ∗ ) as figure of merit of our solution. For logistic regression, the regular-
ization parameter is set to λ = 10−3 . To benchmark LAG, we consider the following approaches.
▷ Cyc-IAG is the cyclic version of the incremental aggregated gradient (IAG) method [9, 10] that
resembles the recursion (4), but communicates with one worker per iteration in a cyclic fashion.
▷ Num-IAG also resembles the recursion (4), and is the non-uniform-sampling enhancement of SAG
[12], but it randomly selects one worker∑ to obtain a fresh gradient per-iteration with the probability
of choosing worker m equal to Lm / m∈M Lm .
▷ Batch-GD is the GD iteration (2) that communicates with all the workers per iteration.
For LAG-WK, we choose ξd = ξ = 1/D with D = 10, and for LAG-PS, we choose more aggressive
ξd = ξ = 10/D with D = 10. Stepsizes for LAG-WK, LAG-PS, and GD are chosen as α = 1/L;
to optimize performance and guarantee stability, α = 1/(M L) is used in Cyc-IAG and Num-IAG.
We consider two synthetic data tests: a) linear regression with increasing smoothness constants,
e.g., Lm = (1.3m−1 + 1)2 , ∀m; and, b) logistic regression with uniform smoothness constants, e.g.,
L1 = . . . = L9 = 4; see Figure 3. For the case of increasing Lm , it is not surprising that both LAG
variants need fewer communication rounds. Interesting enough, for uniform Lm , LAG-WK still has
marked improvements on communication, thanks to its ability of exploiting the hidden smoothness
of the loss functions; that is, the local curvature of Lm may not be as steep as Lm .
Performance is also tested on the real datasets [2]: a) linear regression using Housing, Body fat,
Abalone datasets; and, b) logistic regression using Ionosphere, Adult, Derm datasets; see Figure 4.
Each dataset is evenly split into three workers with the number of features used in the test equal to the
minimal number of features among all datasets; see the details of parameters and data allocation in
the supplement material. In all tests, LAG-WK outperforms the alternatives in terms of both metrics,
especially reducing the needed communication rounds by several orders of magnitude. Its needed
communication rounds can be even smaller than the number of iterations, if none of workers violate

8
Linear regression Logistic regression
Algorithm M =9 M = 18 M = 27 M =9 M = 18 M = 27
Cyclic-IAG 5271 10522 15773 33300 65287 97773
Num-IAG 3466 5283 5815 22113 30540 37262
LAG-PS 1756 3610 5944 14423 29968 44598
LAG-WK 412 657 1058 584 1098 1723
Batch GD 5283 10548 15822 33309 65322 97821

Table 3: Communication complexity (ϵ = 10−8 ) in real dataset under different number of workers.

103 103
Cyc-IAG Cyc-IAG
Num-IAG Num-IAG
2 LAG-PS 2 LAG-PS
10 LAG-WK
10 LAG-WK

Objective error
Objective error

Batch-GD Batch-GD

101 101

0 0
10 10

10-1 10-1

0 1 2 3 4 5 1 2 3 4 5
10 10 10 10 10
Number of iteration ×105 Number of communications (uploads)

Figure 5: Iteration and communication complexity in Gisette dataset.

the trigger condition (12) at certain iterations. Additional tests under different number of workers
are listed in Table 3, which corroborate the effectiveness of LAG when it comes to communication
reduction. Similar performance gain has also been observed in the additional logistic regression test
on a larger dataset Gisette. The dataset was taken from [7] which was constructed from the MNIST
data [8]. After random selecting subset of samples and eliminating all-zero features, it contains 2000
samples xn ∈ R4837 . We randomly split this dataset into nine workers. The performance of all the
algorithms is reported in Figure 5 in terms of the iteration and communication complexity. Clearly,
LAG-WK and LAG-PS achieve the same iteration complexity as GD, and outperform Cyc- and Num-
IAG. Regarding communication complexity, two LAG variants reduce the needed communication
rounds by several orders of magnitude compared with the alternatives.
Confirmed by the impressive empirical performance on both synthetic and real datasets, this paper
developed a promising communication-cognizant method for distributed machine learning that we
term Lazily Aggregated gradient (LAG) approach. LAG can achieve the same convergence rates as
batch gradient descent (GD) in smooth strongly-convex, convex, and nonconvex cases, and requires
fewer communication rounds than GD given that the datasets at different workers are heterogeneous.
To overcome the limitations of LAG, future work consists of incorporating smoothing techniques to
handle nonsmooth loss functions, and robustifying our aggregation rules to deal with cyber attacks.

Acknowledgments
The work by T. Chen and G. Giannakis is supported in part by NSF 1500713 and 1711471, and NIH
1R01GM104975-01. The work by T. Chen is also supported by the Doctoral Dissertation Fellowship
from the University of Minnesota. The work by T. Sun is supported in part by China Scholarship
Council. The work by W. Yin is supported in part by NSF DMS-1720237 and ONR N0001417121.

9
References
[1] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans.
Automat. Control, vol. 54, no. 1, pp. 48–61, Jan. 2009.

[2] G. B. Giannakis, Q. Ling, G. Mateos, I. D. Schizas, and H. Zhu, “Decentralized Learning for Wireless
Communications and Networking,” in Splitting Methods in Communication and Imaging, Science and
Engineering. New York: Springer, 2016.

[3] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le
et al., “Large scale distributed deep networks,” in Proc. Advances in Neural Info. Process. Syst., Lake
Tahoe, NV, 2012, pp. 1223–1231.

[4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning

of deep networks from decentralized data,” in Proc. Intl. Conf. Artificial Intell. and Stat., Fort Lauderdale,
FL, Apr. 2017, pp. 1273–1282.

[5] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” in Proc. Ad-
vances in Neural Info. Process. Syst., Long Beach, CA, Dec. 2017, pp. 4427–4437.

[6] I. Stoica, D. Song, R. A. Popa, D. Patterson, M. W. Mahoney, R. Katz, A. D. Joseph, M. Jor-

dan, J. M. Hellerstein, J. E. Gonzalez et al., “A Berkeley view of systems challenges for AI,” arXiv
preprint:1712.05855, Dec. 2017.

[7] T. Chen, S. Barbarossa, X. Wang, G. B. Giannakis, and Z.-L. Zhang, “Learning and management for
Internet-of-Things: Accounting for adaptivity and scalability,” Proc. of the IEEE, Nov. 2018.

[8] L. Bottou, “Large-Scale Machine Learning with Stochastic Gradient Descent,” in Proc. of COMP-
STAT’2010, Y. Lechevallier and G. Saporta, Eds. Heidelberg: Physica-Verlag HD, 2010, pp. 177–186.

[9] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” arXiv
preprint:1606.04838, Jun. 2016.

[10] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,”
in Proc. Advances in Neural Info. Process. Syst., Lake Tahoe, NV, Dec. 2013, pp. 315–323.

[11] A. Defazio, F. Bach, and S. Lacoste-Julien, “Saga: A fast incremental gradient method with support for
non-strongly convex composite objectives,” in Proc. Advances in Neural Info. Process. Syst., Montreal,
Canada, Dec. 2014, pp. 1646–1654.

[12] M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums with the stochastic average gradient,”
Mathematical Programming, vol. 162, no. 1-2, pp. 83–112, Mar. 2017.

[13] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning
with the parameter server,” in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec. 2014,
pp. 19–27.

[14] B. McMahan and D. Ramage, “Federated learning: Collaborative machine learning without centralized
training data,” Google Research Blog, Apr. 2017. [Online]. Available: https://research.googleblog.com/
2017/04/federated-learning-collaborative.html

[15] L. Cannelli, F. Facchinei, V. Kungurtsev, and G. Scutari, “Asynchronous parallel algorithms for nonconvex
big-data optimization: Model and convergence,” arXiv preprint:1607.04818, Jul. 2016.

[16] T. Sun, R. Hannah, and W. Yin, “Asynchronous coordinate descent under more realistic assumptions,” in
Proc. Advances in Neural Info. Process. Syst., Long Beach, CA, Dec. 2017, pp. 6183–6191.

[17] Z. Peng, Y. Xu, M. Yan, and W. Yin, “Arock: an algorithmic framework for asynchronous parallel coordi-
nate updates,” SIAM J. Sci. Comp., vol. 38, no. 5, pp. 2851–2879, Sep. 2016.

[18] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradi-
ent descent,” in Proc. Advances in Neural Info. Process. Syst., Granada, Spain, Dec. 2011, pp. 693–701.

[19] J. Liu, S. Wright, C. Ré, V. Bittorf, and S. Sridhar, “An asynchronous parallel stochastic coordinate
descent algorithm,” J. Machine Learning Res., vol. 16, no. 1, pp. 285–322, 2015.

[20] X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochastic gradient for nonconvex optimiza-
tion,” in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec. 2015, pp. 2737–2745.

10
[21] M. I. Jordan, J. D. Lee, and Y. Yang, “Communication-efficient distributed statistical inference,” J. Amer-
ican Statistical Association, vol. to appear, 2018.

[22] Y. Zhang, J. C. Duchi, and M. J. Wainwright, “Communication-efficient algorithms for statistical opti-
mization.” J. Machine Learning Res., vol. 14, no. 11, 2013.
[23] A. T. Suresh, X. Y. Felix, S. Kumar, and H. B. McMahan, “Distributed mean estimation with limited
communication,” in Proc. Intl. Conf. Machine Learn., Sydney, Australia, Aug. 2017, pp. 3329–3337.

[24] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via
gradient quantization and encoding,” In Proc. Advances in Neural Info. Process. Syst., pages 1709–1720,
Long Beach, CA, Dec. 2017.

[25] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “TernGrad: Ternary gradients to reduce
communication in distributed deep learning,” In Proc. Advances in Neural Info. Process. Syst., pages
1509–1519, Long Beach, CA, Dec. 2017.

[26] A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” In Proc. of Empirical
Methods in Natural Language Process., pages 440–445, Copenhagen, Denmark, Sep. 2017.
[27] M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan, “Communication-
efficient distributed dual coordinate ascent,” in Proc. Advances in Neural Info. Process. Syst., Montreal,
Canada, Dec. 2014, pp. 3068–3076.
[28] C. Ma, J. Konečnỳ, M. Jaggi, V. Smith, M. I. Jordan, P. Richtárik, and M. Takáč, “Distributed optimization
with arbitrary local solvers,” Optimization Methods and Software, vol. 32, no. 4, pp. 813–848, Jul. 2017.

[29] O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient distributed optimization using an approx-
imate newton-type method,” in Proc. Intl. Conf. Machine Learn., Beijing, China, Jun. 2014, pp. 1000–
1008.

[30] Y. Zhang and X. Lin, “DiSCO: Distributed optimization for self-concordant empirical loss,” in Proc. Intl.
Conf. Machine Learn., Lille, France, Jun. 2015, pp. 362–370.

[31] Y. Liu, C. Nowzari, Z. Tian, and Q. Ling, “Asynchronous periodic event-triggered coordination of multi-
agent systems,” in Proc. IEEE Conf. Decision Control, Melbourne, Australia, Dec. 2017, pp. 6696–6701.

[32] G. Lan, S. Lee, and Y. Zhou, “Communication-efficient algorithms for decentralized and stochastic opti-
mization,” arXiv preprint:1701.03961, Jan. 2017.

[33] Y. Nesterov, Introductory Lectures on Convex Optimization: A basic course. Berlin, Germany: Springer,
2013, vol. 87.
[34] D. Blatt, A. O. Hero, and H. Gauchman, “A convergent incremental gradient method with a constant step
size,” SIAM J. Optimization, vol. 18, no. 1, pp. 29–51, Feb. 2007.

[35] M. Gurbuzbalaban, A. Ozdaglar, and P. A. Parrilo, “On the convergence rate of incremental aggregated
gradient algorithms,” SIAM J. Optimization, vol. 27, no. 2, pp. 1035–1048, Jun. 2017.

[36] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml

[37] L. Song, A. Smola, A. Gretton, K. M. Borgwardt, and J. Bedo, “Supervised feature selection via depen-
dence estimation,” in Proc. Intl. Conf. Machine Learn., Corvallis, OR, Jun. 2007, pp. 823–830.

[38] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recogni-
tion,” Proc. of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.

Introduction To Digital Signal Processing
90% (10)
Introduction To Digital Signal Processing
487 pages
Federated Learning
No ratings yet
Federated Learning
50 pages
Final - DNN - Hands - On - Jupyter Notebook
25% (8)
Final - DNN - Hands - On - Jupyter Notebook
8 pages
Goal Stack Planning
100% (8)
Goal Stack Planning
10 pages
GossipFL A Decentralized Federated Learning Framework With Sparsified and Adaptive Communication
100% (1)
GossipFL A Decentralized Federated Learning Framework With Sparsified and Adaptive Communication
14 pages
Introduction To Data Analytics and Visualization Question Paper
100% (1)
Introduction To Data Analytics and Visualization Question Paper
2 pages
Can Decentralized Algorithms Outperform Centralized Algorithms A Case Study For Decentralized Parallel Stochastic Gradient Descent
No ratings yet
Can Decentralized Algorithms Outperform Centralized Algorithms A Case Study For Decentralized Parallel Stochastic Gradient Descent
27 pages
Training Neural Networks Without Gradients
No ratings yet
Training Neural Networks Without Gradients
10 pages
Deep Gradient Compression
No ratings yet
Deep Gradient Compression
14 pages
Consensus Learning: A Novel Decentralised Ensemble Learning Paradigm
No ratings yet
Consensus Learning: A Novel Decentralised Ensemble Learning Paradigm
41 pages
SSGD Slide
No ratings yet
SSGD Slide
24 pages
Scaling Distributed Machine Learning With The Parameter Server
No ratings yet
Scaling Distributed Machine Learning With The Parameter Server
16 pages
NIPS 2015 Communication Complexity of Distributed Convex Learning and Optimization Paper
No ratings yet
NIPS 2015 Communication Complexity of Distributed Convex Learning and Optimization Paper
9 pages
ELD Using Reduced Gradient
No ratings yet
ELD Using Reduced Gradient
17 pages
Slide 14 - Distributed Deep Learning
No ratings yet
Slide 14 - Distributed Deep Learning
30 pages
Advanced Systemdesign 2023
No ratings yet
Advanced Systemdesign 2023
65 pages
GIANT: Globally Improved Approximate Newton Method For Distributed Optimization
No ratings yet
GIANT: Globally Improved Approximate Newton Method For Distributed Optimization
21 pages
Collaborative Learning Via Prediction Consensus
No ratings yet
Collaborative Learning Via Prediction Consensus
21 pages
2tpds 2020 3046440
No ratings yet
2tpds 2020 3046440
12 pages
Distributed Subgradient Methods For Multi-Agent Optimization
No ratings yet
Distributed Subgradient Methods For Multi-Agent Optimization
28 pages
CPSGD: Communication-Efficient and Differentially-Private Distributed SGD
No ratings yet
CPSGD: Communication-Efficient and Differentially-Private Distributed SGD
28 pages
Paper Report
No ratings yet
Paper Report
2 pages
SIGNSGD - Compressed Optimisation For Non-Convex Problems
No ratings yet
SIGNSGD - Compressed Optimisation For Non-Convex Problems
25 pages
GSASG Global Sparsification With Adaptive Aggregated Stochastic Gradients For Communication-Efficient Federated Learning
No ratings yet
GSASG Global Sparsification With Adaptive Aggregated Stochastic Gradients For Communication-Efficient Federated Learning
14 pages
A Communication-Efficient Collaborative Learning21
No ratings yet
A Communication-Efficient Collaborative Learning21
19 pages
Amazon Interview Question Bank - by Harine
No ratings yet
Amazon Interview Question Bank - by Harine
4 pages
Papr 11
No ratings yet
Papr 11
18 pages
NeurIPS 2021 A Faster Decentralized Algorithm For Nonconvex Minimax Problems Paper
No ratings yet
NeurIPS 2021 A Faster Decentralized Algorithm For Nonconvex Minimax Problems Paper
13 pages
Technical Writing
No ratings yet
Technical Writing
8 pages
Overlap of Computation and Communication Within Seqenence For LLM Inference
No ratings yet
Overlap of Computation and Communication Within Seqenence For LLM Inference
8 pages
Accelerated Distributed Nesterov Gradient Descent For Convex and Smooth Functions
No ratings yet
Accelerated Distributed Nesterov Gradient Descent For Convex and Smooth Functions
8 pages
Robust and Communication-Efficient Federated Learning From Non-IID Data
No ratings yet
Robust and Communication-Efficient Federated Learning From Non-IID Data
17 pages
Demo: Decoupled Momentum Optimization: Bowen Peng Jeffrey Quesnelle Diederik P. Kingma
No ratings yet
Demo: Decoupled Momentum Optimization: Bowen Peng Jeffrey Quesnelle Diederik P. Kingma
8 pages
Corrigendum To "Balance of Communication and Convergence: Predefined-Time Distributed Optimization Based On Zero-Gradient-Sum"
No ratings yet
Corrigendum To "Balance of Communication and Convergence: Predefined-Time Distributed Optimization Based On Zero-Gradient-Sum"
11 pages
Arxiv DCatalyst
No ratings yet
Arxiv DCatalyst
43 pages
An Efficient Distributed Stochastic Gradient Descent Algorithm For Deep-Learning Applications
No ratings yet
An Efficient Distributed Stochastic Gradient Descent Algorithm For Deep-Learning Applications
10 pages
Paper 14
No ratings yet
Paper 14
25 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Federated Learning: Strategies For Improving Communication Efficiency
No ratings yet
Federated Learning: Strategies For Improving Communication Efficiency
5 pages
Achieving Linear Converg
No ratings yet
Achieving Linear Converg
34 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Federated Dropout
No ratings yet
Federated Dropout
12 pages
GSASG Paper
No ratings yet
GSASG Paper
14 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Time-Sensitive Federated Learning With Heterogeneous Training Intensity A Deep Reinforcement Learning Approach
No ratings yet
Time-Sensitive Federated Learning With Heterogeneous Training Intensity A Deep Reinforcement Learning Approach
14 pages
Accelerating DNN Training in Wireless Federated Edge Learning
No ratings yet
Accelerating DNN Training in Wireless Federated Edge Learning
30 pages
Extra: An Exact First-Order Algorithm For Decentralized Consensus Optimization
No ratings yet
Extra: An Exact First-Order Algorithm For Decentralized Consensus Optimization
23 pages
Communication-Efficient Distributed Learning An Overview 0
No ratings yet
Communication-Efficient Distributed Learning An Overview 0
22 pages
Adaptive Consensus Gradients Aggregation For Scaled Distributed Training
No ratings yet
Adaptive Consensus Gradients Aggregation For Scaled Distributed Training
13 pages
2024 MTH058 Lecture07 FederatedLearning
No ratings yet
2024 MTH058 Lecture07 FederatedLearning
25 pages
Speeding Up Distributed Machine Learning Using Codes
No ratings yet
Speeding Up Distributed Machine Learning Using Codes
16 pages
2017 Konecny Et Al Federated Learning Google Paper
No ratings yet
2017 Konecny Et Al Federated Learning Google Paper
10 pages
Clustered Federated Learning - Model-Agnostic Distributed Multitask Optimization Under Privacy Constraints
No ratings yet
Clustered Federated Learning - Model-Agnostic Distributed Multitask Optimization Under Privacy Constraints
13 pages
SGD
No ratings yet
SGD
3 pages
Accelerating Federated Learning Via Momentum Gradient Descent
No ratings yet
Accelerating Federated Learning Via Momentum Gradient Descent
13 pages
COMPDLA08
No ratings yet
COMPDLA08
3 pages
Time Minimization in Hierarchical Federated Learning
No ratings yet
Time Minimization in Hierarchical Federated Learning
11 pages
Introduction To Extendible: Hashing
No ratings yet
Introduction To Extendible: Hashing
7 pages
Gradient-Congruity Guided Federated Sparse
No ratings yet
Gradient-Congruity Guided Federated Sparse
12 pages
A Communication-Efficient Hierarchical Federated Learning Framework Via Shaping Data Distribution at Edge
No ratings yet
A Communication-Efficient Hierarchical Federated Learning Framework Via Shaping Data Distribution at Edge
16 pages
L09 Using Matlab Neural Networks Toolbox
100% (1)
L09 Using Matlab Neural Networks Toolbox
34 pages
Stability Analysis For Control Systems
No ratings yet
Stability Analysis For Control Systems
36 pages
HCEC An Efficient Geo Distributed Deep Learning Train 2024 Journal of Syste
No ratings yet
HCEC An Efficient Geo Distributed Deep Learning Train 2024 Journal of Syste
9 pages
Nsdi21 SwitchML
No ratings yet
Nsdi21 SwitchML
25 pages
Applsci 13 05877
No ratings yet
Applsci 13 05877
17 pages
Capstone Project Report (Digit-Recognition Using CNN)
No ratings yet
Capstone Project Report (Digit-Recognition Using CNN)
11 pages
Efficient and Secure Federated Learning For Financial Applications
No ratings yet
Efficient and Secure Federated Learning For Financial Applications
15 pages
Lab Report 4 PDF
No ratings yet
Lab Report 4 PDF
15 pages
Artificial Assignment
No ratings yet
Artificial Assignment
7 pages
Page Replacement Algorithm
No ratings yet
Page Replacement Algorithm
20 pages
Parta Roth Herwitz Stability Criterion
No ratings yet
Parta Roth Herwitz Stability Criterion
28 pages
DAA Labmanual-Updated
No ratings yet
DAA Labmanual-Updated
38 pages
Mach Learning Qs
No ratings yet
Mach Learning Qs
7 pages
Pente: David Kron, Matt Renzelmann, Eric Richmond, and Todd Ritland
No ratings yet
Pente: David Kron, Matt Renzelmann, Eric Richmond, and Todd Ritland
16 pages
Fractals 2
No ratings yet
Fractals 2
22 pages
Yann LeCun - What's So Great About - Extreme Learning Machines - MachineLearning
No ratings yet
Yann LeCun - What's So Great About - Extreme Learning Machines - MachineLearning
11 pages
A Dataset On Body Composition, Strength and Performance in Older Adults
No ratings yet
A Dataset On Body Composition, Strength and Performance in Older Adults
5 pages
15 Dijkstra
No ratings yet
15 Dijkstra
48 pages
Pretraining and The Lasso
No ratings yet
Pretraining and The Lasso
48 pages
ISE 633 Large Scale Optimization For Machine Learning: Number of Units: 03
No ratings yet
ISE 633 Large Scale Optimization For Machine Learning: Number of Units: 03
4 pages
9 - Linear Discriminant Analysis
No ratings yet
9 - Linear Discriminant Analysis
19 pages
Structured Pruning of Deep Convolutional Neural Netw Orks: Sajid Anwar, Kyuyeon Hwang and Wonyong Sung
No ratings yet
Structured Pruning of Deep Convolutional Neural Netw Orks: Sajid Anwar, Kyuyeon Hwang and Wonyong Sung
11 pages
Expert Systems With Applications: Dana Bani-Hani, Mohammad Khasawneh
No ratings yet
Expert Systems With Applications: Dana Bani-Hani, Mohammad Khasawneh
14 pages
Medical Insurance Cost Prediction: Using Machine Learning
No ratings yet
Medical Insurance Cost Prediction: Using Machine Learning
14 pages
Vickers Elkin 2006 Decision Curve Analysis A Novel Method For Evaluating Prediction Models
No ratings yet
Vickers Elkin 2006 Decision Curve Analysis A Novel Method For Evaluating Prediction Models
10 pages
Elementary Alg
No ratings yet
Elementary Alg
3 pages
Group-Based Criminal Trajectory Analysis Using Cross-Validation Criteria
No ratings yet
Group-Based Criminal Trajectory Analysis Using Cross-Validation Criteria
21 pages
Cryptography, Winter Term 16/17: Sample Solution To Assignment 3
No ratings yet
Cryptography, Winter Term 16/17: Sample Solution To Assignment 3
3 pages
Journal Pone 0250441
No ratings yet
Journal Pone 0250441
10 pages
Abudayyeh Et Al 2003 Analysis of Edge Detection Techniques For Crack Identification in Bridges
No ratings yet
Abudayyeh Et Al 2003 Analysis of Edge Detection Techniques For Crack Identification in Bridges
9 pages
HW1 Solutions
No ratings yet
HW1 Solutions
3 pages
FCVM 11 1283132
No ratings yet
FCVM 11 1283132
11 pages
1 s2.0 S0899900724002818 Main
No ratings yet
1 s2.0 S0899900724002818 Main
8 pages
DAA1 Rev Test
No ratings yet
DAA1 Rev Test
3 pages
Cvip Notes
No ratings yet
Cvip Notes
4 pages
JGM 150001
No ratings yet
JGM 150001
2 pages

NeurIPS 2018 Lag Lazily Aggregated Gradient For Communication Efficient Distributed Learning Paper

Uploaded by

NeurIPS 2018 Lag Lazily Aggregated Gradient For Communication Efficient Distributed Learning Paper

Uploaded by

LAG: Lazily Aggregated Gradient for

Communication-Efficient Distributed Learning

Tianyi Chen⋆ Georgios B. Giannakis⋆ Tao Sun†,∗ Wotao Yin∗

1.1 Prior art

1.2 Our contributions

Table 1: A comparison of communication, computation and memory requirements. PS denotes the

2 LAG: Lazily Aggregated Gradient Approach

Table 2: A comparison of LAG-WK and LAG-PS.

3 Iteration and communication complexity

3.1 Convergence in strongly convex case

where the indicator 1 equals 1 when H2 (m) ≤ γ 1

3.2 Convergence in (non)convex case

Increasing Lm Increasing Lm Uniform Lm Uniform Lm

Figure 3: Iteration and communication complexity in synthetic datasets.

Linear regression Linear regression Logistic regression Logistic regression

Figure 4: Iteration and communication complexity in real datasets.

4 Numerical tests and conclusions

Figure 5: Iteration and communication complexity in Gisette dataset.

[4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning

[6] I. Stoica, D. Song, R. A. Popa, D. Patterson, M. W. Mahoney, R. Katz, A. D. Joseph, M. Jor-

You might also like