0% found this document useful (0 votes)
7 views11 pages

NeurIPS 2018 Lag Lazily Aggregated Gradient For Communication Efficient Distributed Learning Paper

This paper introduces Lazily Aggregated Gradient (LAG), a novel gradient method for distributed machine learning that reduces communication and computation by adaptively skipping gradient calculations. LAG achieves the same convergence rates as traditional batch gradient descent while significantly decreasing the number of communication rounds needed, especially in heterogeneous datasets. Numerical experiments demonstrate that LAG can reduce communication requirements by an order of magnitude compared to existing methods.

Uploaded by

hazelchangoeng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views11 pages

NeurIPS 2018 Lag Lazily Aggregated Gradient For Communication Efficient Distributed Learning Paper

This paper introduces Lazily Aggregated Gradient (LAG), a novel gradient method for distributed machine learning that reduces communication and computation by adaptively skipping gradient calculations. LAG achieves the same convergence rates as traditional batch gradient descent while significantly decreasing the number of communication rounds needed, especially in heterogeneous datasets. Numerical experiments demonstrate that LAG can reduce communication requirements by an order of magnitude compared to existing methods.

Uploaded by

hazelchangoeng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

LAG: Lazily Aggregated Gradient for

Communication-Efficient Distributed Learning

Tianyi Chen⋆ Georgios B. Giannakis⋆ Tao Sun†,∗ Wotao Yin∗



University of Minnesota - Twin Cities, Minneapolis, MN 55455, USA

National University of Defense Technology, Changsha, Hunan 410073, China

University of California - Los Angeles, Los Angeles, CA 90095, USA
{chen3827,[email protected]} [email protected] [email protected]

Abstract
This paper presents a new class of gradient methods for distributed machine learn-
ing that adaptively skip the gradient calculations to learn with reduced commu-
nication and computation. Simple rules are designed to detect slowly-varying
gradients and, therefore, trigger the reuse of outdated gradients. The resultant
gradient-based algorithms are termed Lazily Aggregated Gradient — justifying
our acronym LAG used henceforth. Theoretically, the merits of this contribution
are: i) the convergence rate is the same as batch gradient descent in strongly-
convex, convex, and nonconvex cases; and, ii) if the distributed datasets are hetero-
geneous (quantified by certain measurable constants), the communication rounds
needed to achieve a targeted accuracy are reduced thanks to the adaptive reuse of
lagged gradients. Numerical experiments on both synthetic and real data corrobo-
rate a significant communication reduction compared to alternatives.

1 Introduction
In this paper, we develop communication-efficient algorithms to solve the following problem

min L(θ) with L(θ) := Lm (θ) (1)
θ ∈Rd m∈M

where θ ∈ Rd is the unknown vector, L and {Lm , m ∈ M} are smooth (but not necessarily convex)
functions with M := {1, . . . , M }. Problem (1) naturally arises in a number of areas, such as
multi-agent optimization [1], distributed signal processing [2], and distributed machine learning [3].
Considering∑ the distributed machine learning paradigm, each Lm is also a sum of functions, e.g.,
Lm (θ) := n∈Nmℓn (θ), where ℓn is the loss function (e.g., square or the logistic loss) with respect
to the vector θ (describing the model) evaluated at the training sample xn ; that is, ℓn (θ) := ℓ(θ; xn ).
While machine learning tasks are traditionally carried out at a single server, for datasets with massive
samples {xn }, running gradient-based iterative algorithms at a single server can be prohibitively
slow; e.g., the server needs to sequentially compute gradient components given limited processors.
A simple yet popular solution in recent years is to parallelize the training across multiple computing
units (a.k.a. workers) [3]. Specifically, assuming batch samples distributedly stored in a total of
M workers with the worker m ∈ M associated with samples {xn , n ∈ Nm }, a globally shared
model θ will be updated at the central server by aggregating gradients computed by workers. Due
to bandwidth and privacy concerns, each worker m will not upload its data {xn , n ∈ Nm } to the
server, thus the learning task needs to be performed by iteratively communicating with the server.
We are particularly interested in the scenarios where communication between the central server and
the local workers is costly, as is the case with the Federated Learning setting [4, 5], the cloud-edge

32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.
AI systems [6], and more in the emerging Internet-of-Things paradigm [7]. In those cases, communi-
cation latency is the bottleneck of overall performance. More precisely, the communication latency
is a result of initiating communication links, queueing and propagating the message. For sending
small messages, e.g., the d-dimensional model θ or aggregated gradient, this latency dominates the
message size-dependent transmission latency. Therefore, it is important to reduce the number of
communication rounds, even more so than the bits per round. In short, our goal is to find the model
parameter θ that minimizes (1) using as low communication overhead as possible.

1.1 Prior art

To put our work in context, we review prior contributions that we group in two categories.
Large-scale machine learning. Solving (1) at a single server has been extensively studied for large-
scale learning tasks, where the “workhorse approach” is the simple yet efficient stochastic gradient
descent (SGD) [8, 9]. Albeit its low per-iteration complexity, the inherited variance prevents SGD
to achieve fast convergence. Recent advances include leveraging the so-termed variance reduction
techniques to achieve both low complexity and fast convergence [10–12]. For learning beyond
a single server, distributed parallel machine learning is an attractive solution to tackle large-scale
learning tasks, where the parameter server architecture is the most commonly used one [3, 13]. Dif-
ferent from the single server case, parallel implementation of the batch gradient descent (GD) is a
popular choice, since SGD that has low complexity per iteration requires a large number of iterations
thus communication rounds [14]. For traditional parallel learning algorithms however, latency, band-
width limits, and unexpected drain on resources, that delay the update of even a single worker will
slow down the entire system operation. Recent research efforts in this line have been centered on
understanding asynchronous-parallel algorithms to speed up machine learning by eliminating costly
synchronization; e.g., [15–20]. All these approaches either reduce the computational complexity, or,
reduce the run time, but they do not save communication.
Communication-efficient learning. Going beyond single-server learning, the high communication
overhead becomes the bottleneck of the overall system performance [14]. Communication-efficient
learning algorithms have gained popularity [21, 22]. Distributed learning approaches have been de-
veloped based on quantized (gradient) information, e.g., [23–26], but they only reduce the required
bandwidth per communication, not the rounds. For machine learning tasks where the loss function
is convex and its conjugate dual is expressible, the dual coordinate ascent-based approaches have
been demonstrated to yield impressive empirical performance [5, 27, 28]. But these algorithms run
in a double-loop manner, and the communication reduction has not been formally quantified. To
reduce communication by accelerating convergence, approaches leveraging (inexact) second-order
information have been studied in [29, 30]. Roughly speaking, algorithms in [5, 27–30] reduce com-
munication by increasing local computation (relative to GD), while our method does not increase lo-
cal computation. In settings different from the one considered in this paper, communication-efficient
approaches have been recently studied with triggered communication protocols [31, 32]. Except for
convergence guarantees however, no theoretical justification for communication reduction has been
established in [31]. While a sublinear convergence rate can be achieved by algorithms in [32], the
proposed gradient selection rule is nonadaptive and requires double-loop iterations.

1.2 Our contributions

Before introducing our approach, we revisit the popular GD method for (1) in the setting of one
k
parameter server and M workers: At iteration k, the server
( k ) broadcasts the current model θ to all
the workers; every worker m ∈ M computes ∇Lm θ and uploads it to the server; and once
receiving gradients from all workers, the server updates the model parameters via
∑ ( )
GD iteration θ k+1 = θ k − α∇kGD with ∇kGD := ∇Lm θ k (2)
m∈M

where α is a stepsize, and ∇kGD is an aggregated gradient that summarizes the model
( change.
) To
implement (2), the server has to communicate with all workers to obtain fresh {∇Lm θ }.
k

In this context, the present paper puts forward a new batch gradient method (as simple as GD)
that can skip communication at certain rounds, which justifies the term Lazily Aggregated Gradient

2
Metric Communication Computation Memory
Algorithm PS→WK m WK m →PS PS WK m PS WK m
GD θk ∇Lm (2) ∇Lm θk /
k k
LAG-PS θ k, if m ∈ Mk δ∇km , if m ∈ Mk (4), (12b) ∇Lm , if m ∈ Mk θ k, ∇k, {θ̂ m } ∇Lm (θ̂ m )
k
LAG-WK θk δ∇km , if m ∈ Mk (4) ∇Lm , (12a) θ k , ∇k ∇Lm (θ̂ m )

Table 1: A comparison of communication, computation and memory requirements. PS denotes the


parameter server, WK denotes the worker, PS→WK m is the communication link from the server
to the worker m, and WK m → PS is the communication link from the worker m to the server.
(LAG). With its derivations deferred to Section 2, LAG resembles (2), given by
∑ ( k)
LAG iteration θ k+1 = θ k − α∇k with ∇k := ∇Lm θ̂ m (3)
m∈M
k k
where each ∇Lm (θ̂ m ) is either ∇Lm (θ k ), when θ̂ m = θ k , or an outdated gradient that has been
k
computed using an old copy θ̂ m ̸= θ k . Instead of requesting fresh gradient from every worker in (2),
the twist is to obtain ∇k by refining the previous aggregated gradient ∇k−1 ; that is, using only the
new gradients from the selected workers in Mk , while reusing the outdated gradients from the rest
k k k−1
of workers. Therefore, with θ̂ m := θ k, ∀m ∈ Mk, θ̂ m := θ̂ m , ∀m ∈
/ Mk, LAG in (3) is equivalent to

LAG iteration θ k+1 = θ k − α∇k with ∇k = ∇k−1 + δ∇km (4)
m∈Mk
k−1
where δ∇km := ∇Lm (θ k )−∇Lm (θ̂ m ) is the difference between two evaluations of ∇Lm at the
k−1
If ∇k−1 is stored in the server, this simple modification
current iterate θ k and the old copy θ̂ m .
scales down the per-iteration communication rounds from GD’s M to LAG’s |Mk |.
We develop two different rules to select Mk . The first rule is adopted by the parameter server (PS),
and the second one by every worker (WK). At iteration k,
LAG-PS: the server determines Mk and sends θ k to the workers in Mk ; each worker m ∈ Mk
computes ∇Lm (θ k) and uploads δ∇km ; each workerm/ ∈Mk does nothing; the server updates via (4);
LAG-WK: the server broadcasts θ to all workers; every worker computes ∇Lm (θ k ), and checks
k

if it belongs to Mk ; only the workers in Mk upload δ∇km ; the server updates via (4).
See a comparison of two LAG variants with GD in Table 1.
Naively reusing outdated gradients, while saving
communication per iteration, can increase the to- Parameter
Server (PS)
tal number of iterations. To keep this number in
control, we judiciously design our simple trigger
rules so that LAG can: i) achieve the same order
of convergence rates (thus iteration complexities)
as batch GD under strongly-convex, convex, and Workers
nonconvex smooth cases; and, ii) require reduced
communication to achieve a targeted learning ac-
curacy, when the distributed datasets are heteroge- Figure 1: LAG in a parameter server setup.
neous (measured by certain quantity specified later). In certain learning settings, LAG requires only
O(1/M ) communication of GD. Empirically, we found that LAG can reduce the communication
required by GD and other distributed learning methods by an order of magnitude.
Notation. Bold lowercase letters denote column vectors, which are transposed by (·)⊤ . And ∥x∥
denotes the ℓ2 -norm of x. Inequalities for vectors x > 0 is defined entrywise.

2 LAG: Lazily Aggregated Gradient Approach


In this section, we formally develop our LAG method, and present the intuition and basic principles
behind its design. The original idea of LAG comes from a simple rewriting of the GD iteration (2)
as ∑ ∑ ( ( ) ( ))
θ k+1 = θ k − α ∇Lm (θ k−1 ) − α ∇Lm θ k − ∇Lm θ k−1 . (5)
m∈M m∈M

3
Let us view ∇Lm (θ k )−∇Lm (θ k−1 ) as a refinement to ∇Lm (θ k−1 ), and recall that obtaining this
refinement requires a round of communication between the server and the worker m. Therefore, to
save communication, we can skip the server’s communication with the worker∑ m if this refinement is
small compared to the old gradient; that is, ∥∇Lm (θ )−∇Lm (θ
k k−1
)∥ ≪ ∥ m∈M ∇Lm (θ k−1 )∥.
k−1
Generalizing on this intuition, given the generic outdated gradient components {∇Lm (θ̂ m )} with
k−1 k−1
θ̂ m = θ k−1−τ m for a certain τ k−1 ≥ 0, if communicating with some workers will bring only small
m m
gradient refinements, we skip those communications (contained in set Mkc ) and end up with
∑ ( k−1 ) ∑ ( ( ) ( k−1 ))
θ k+1 = θ k − α ∇Lm θ̂ m −α ∇Lm θ k − ∇Lm θ̂ m (6a)
m∈M m∈Mk
∑ ( ( k−1 ) ( ))
= θ k − α∇L(θ k ) − α ∇Lm θ̂ m − ∇Lm θ k (6b)
m∈Mk
c

where Mk and Mkc are the sets of workers that do and do not communicate with the server, respec-
tively. It is easy to verify that (6) is identical to (3) and (4). Comparing (2) with (6b), when Mkc
includes more workers, more communication is saved, but θ k is updated by a coarser gradient.
Key to addressing this communication versus accuracy tradeoff is a principled criterion to select
a subset of workers Mkc that do not communicate with the server at each round. To achieve this
“sweet spot,” we will rely on the fundamental descent lemma. For GD, it is given as follows [33].
k+1
Lemma 1 (GD descent in objective) Suppose L(θ) is L-smooth, and θ̄ is generated by run-
ning one-step GD iteration (2) given θ k and stepsize α. Then the objective values satisfy
( )
k+1 α2 L
L(θ̄ ) − L(θ ) ≤ − α −
k
∥∇L(θ k )∥2 := ∆kGD (θ k ). (7)
2
Likewise, for our wanted iteration (6), the following holds; its proof is given in the Supplement.
Lemma 2 (LAG descent in objective) Suppose L(θ) is L-smooth, and θ k+1 is generated by run-
ning one-step LAG iteration (4) given θ k . Then the objective values satisfy (cf. δ∇km in (4))
∑ ( )
α 2 α 2 L 1 2
L(θ k+1 )−L(θ k ) ≤ − ∇L(θ k ) + δ∇km + − θ k+1−θ k := ∆kLAG (θ k ). (8)
2 2 2 2α
m∈Mk
c

Lemmas 1 and 2 estimate the objective value descent by performing one-iteration of the GD and
LAG methods, respectively, conditioned on a common iterate θ k . GD finds ∆kGD (θk ) by performing
M rounds of communication with all the workers, while LAG yields ∆kLAG (θk ) by performing only
|Mk | rounds of communication with a selected subset of workers. Our pursuit is to select Mk to
ensure that LAG enjoys larger per-communication descent than GD; that is
∆kLAG (θ k )/|Mk | ≤ ∆kGD (θ k )/M. (9)

Choosing the standard α = 1/L, we can show that in order to guarantee (9), it is sufficient to have
(see the supplemental material for the deduction)
( k−1 ) ( k) 2 2
∇Lm θ̂ m −∇Lm θ k
≤ ∇L(θ )
2 k
/M , ∀m ∈ Mc . (10)

However, directly checking (10) at each worker is expensive since obtaining ∥∇L(θ k )∥2 requires
information from all the workers. Instead, we approximate ∥∇L(θ k )∥2 in (10) by
1 ∑
D
2 2
∇L(θ k ) ≈ ξd θ k+1−d − θ k−d (11)
α2
d=1

where {ξd }D
d=1 are constant weights, and the constant D determines the number of recent iterate
changes that LAG incorporates to approximate the current gradient. The rationale here is that, as L
is smooth, ∇L(θ k ) cannot be very different from the recent gradients or the recent iterate lags.
Building upon (10) and (11), we will include worker m in Mkc of (6) if it satisfies
k−1 2 1 ∑
D
2
LAG-WK condition ∇Lm (θ̂ m )−∇Lm (θ k ) ≤ ξd θ k+1−d −θ k−d . (12a)
α2 M 2
d=1

4
Algorithm 1 LAG-WK Algorithm 2 LAG-PS
1: Input: Stepsize α > 0, and threshold {ξd }. 1: Input: Stepsize α > 0, {ξd }, and Lm , ∀m.
0 0 0
2: Initialize: θ 1 , {∇Lm (θ̂ m ), ∀m}. 2: Initialize: θ 1,{θ̂ m ,∇Lm(θ̂ m ), ∀m}.
3: for k = 1, 2, . . . , K do 3: for k = 1, 2, . . . , K do
4: Server broadcasts θ k to all workers. 4: for worker m = 1, . . . , M do
5: for worker m = 1, . . . , M do 5: Server checks condition (12b).
6: Worker m computes ∇Lm (θ k ). 6: if worker m violates (12b) then
7: Worker m checks condition (12a). 7: Server sends θ k to worker m.
k
8: if worker m violates (12a) then 8: ▷ Save θ̂ m = θ k at server
9: Worker m uploads δ∇km . 9: Worker m computes ∇Lm (θ k ).
k
10: ▷ Save ∇Lm (θ̂ m ) = ∇Lm (θ k ) 10: Worker m uploads δ∇km .
11: else 11: else
12: Worker m uploads nothing. 12: No actions at server and worker m.
13: end if 13: end if
14: end for 14: end for
15: Server updates via (4). 15: Server updates via (4).
16: end for 16: end for

Table 2: A comparison of LAG-WK and LAG-PS.

Condition (12a) is checked at the worker side after each worker receives θ k from the server and
computes its ∇Lm (θk ). If broadcasting is also costly, we can resort to the following server side rule:

1 ∑
D
k−1 2 2
LAG-PS condition L2m θ̂ m − θk ≤ ξd θ k+1−d − θ k−d . (12b)
α2 M 2
d=1

The values of {ξd } and D admit simple choices, e.g., ξd = 1/D, ∀d with D = 10 used in simula-
tions.
LAG-WK vs LAG-PS. To perform (12a), the server needs to broadcast the current model θ k , and
all the workers need to compute the gradient; while performing (12b), the server needs the estimated
smoothness constant Lm for all the local functions. On the other hand, as it will be shown in Section
3, (12a) and (12b) lead to the same worst-case convergence guarantees. In practice, however, the
server-side condition is more conservative than the worker-side one at communication reduction,
because the smoothness of Lm readily implies that satisfying (12b) will necessarily satisfy (12a),
but not vice versa. Empirically, (12a) will lead to a larger Mkc than that of (12b), and thus extra
communication overhead will be saved. Hence, (12a) and (12b) can be chosen according to users’
preferences. LAG-WK and LAG-PS are summarized as Algorithms 1 and 2.
Regarding our proposed LAG method, three remarks are in order.
R1) With recursive update of the lagged gradients in (4) and the lagged iterates in (12), implementing
LAG is as simple as GD; see Table 1. Both empirically and theoretically, we will further demonstrate
that using lagged gradients even reduces the overall delay by cutting down costly communication.
R2) Although both LAG and asynchronous-parallel algorithms in [15–20] leverage stale gradients,
they are very different. LAG actively creates staleness, and by design, it reduces total communication
despite the staleness. Asynchronous algorithms passively receives staleness, and increases total
communication due to the staleness, but it saves run time.
R3) Compared with existing efforts for communication-efficient learning such as quantized gradient,
Nesterov’s acceleration, dual coordinate ascent and second-order methods, LAG is not orthogonal
to all of them. Instead, LAG can be combined with these methods to develop even more powerful
learning schemes. Extension to the proximal LAG is also possible to cover nonsmooth regularizers.

3 Iteration and communication complexity

In this section, we establish the convergence of LAG, under the following standard conditions.
Assumption 1: Loss function Lm (θ) is Lm -smooth, and L(θ) is L-smooth.
Assumption 2: L(θ) is convex and coercive. Assumption 3: L(θ) is µ-strongly convex.

5
The subsequent convergence analysis critically builds on the following Lyapunov function:

D
2
Vk := L(θ k ) − L(θ ∗ ) + βd θ k+1−d − θ k−d (13)
d=1

where θ ∗ is the minimizer of (1), and {βd } is a sequence of constants that will be determined later.
We will start with the sufficient descent of our Vk in (13).
Lemma 3 (descent lemma) Under Assumption 1, if α and {ξd } are chosen properly, there exist
constants c0 , · · · , cD ≥ 0 such that the Lyapunov function in (13) satisfies
2 ∑
D
2
Vk+1 − Vk ≤ −c0 ∇L(θ k ) − cd θ k+1−d −θ k−d (14)
d=1

which implies the descent in our Lyapunov function, that is, Vk+1 ≤ Vk .
Lemma 3 is a generalization of GD’s descent lemma. As specified in the supplementary material,
under properly chosen {ξd }, the stepsize α ∈ (0, 2/L) including α = 1/L guarantees (14), matching
the stepsize region of GD. With Mk = M and βd = 0, ∀d in (13), Lemma 3 reduces to Lemma 1.

3.1 Convergence in strongly convex case

We first present the convergence under the smooth and strongly convex condition.
Theorem 1 (strongly convex case) Under Assumptions 1-3, the iterates {θ k } of LAG satisfy
( K) ( ∗) ( )K 0
L θ −L θ ≤ 1 − c(α; {ξd }) V (15)

where θ is the minimizer of L(θ) in (1), and c(α; {ξd }) ∈ (0, 1) is a constant depending on α, {ξd }
and {βd } and the condition number κ := L/µ, which are specified in the supplementary material.
Iteration complexity. The iteration complexity in its generic form is complicated since c(α; {ξd })
depends on the choice of several parameters. Specifically, if we choose the parameters as follows

1 1− Dξ D−d+1
ξ1 = · · · = ξD := ξ < and α := and β1 = · · · = βD := √ (16)
D L 2α D/ξ
then, following Theorem 1, the iteration complexity of LAG in this case is
κ ( )
ILAG (ϵ) = √ log ϵ−1 . (17)
1 − Dξ

The iteration complexity in (17) is on the same order of GD’s iteration complexity κ log(ϵ−1 ), but
has a worse constant. This is the consequence of using a smaller stepsize in (16) (relative to α = 1/L
in GD) to simplify the choice of other parameters. Empirically, LAG with α = 1/L can achieve
almost the same empirical iteration complexity as GD; see Section 4. Building on the iteration
complexity, we study next the communication complexity of LAG. In the setting of our interest, we
define the communication complexity as the total number of uploads over all the workers needed to
achieve accuracy ϵ. While the accuracy refers to the objective optimality error in the strongly convex
case, it is considered as the gradient norm in general (non)convex cases.
The power of LAG is best illustrated by numerical examples; see an example of LAG-WK in Figure
2. Clearly, workers with a small smoothness constant communicate with the server less frequently.
This intuition will be formally treated in the next lemma.
Lemma 4 (lazy communication) Define the importance factor of every worker m as H(m) :=
Lm /L. If the stepsize α and the constants {ξd } in the conditions (12) satisfy ξD ≤ · · · ≤ ξd ≤
· · · ≤ ξ1 and worker m satisfies
/ 2 2 2
2
H (m) ≤ ξd (dα L M ) := γd (18)
then, until the k-th iteration, worker m communicates with the server at most k/(d + 1) rounds.
Lemma 4 asserts that if the worker m has a small Lm (a close-to-linear loss function) such that
H2 (m) ≤ γd , then under LAG, it only communicates with the server at most k/(d + 1) rounds.
This is in contrast to the total of k communication rounds involved per worker under GD. Ideally,
we want as many workers satisfying (18) as possible, especially when d is large.

6
To quantify the overall communication reduction, 1

WK 1
we define the heterogeneity score function as 0
1

1 ∑

WK 3
h(γ) := 1(H2 (m) ≤ γ) (19) 0

M m∈M 1

WK 5
0

where the indicator 1 equals 1 when H2 (m) ≤ γ 1

WK 7
holds, and 0 otherwise. Clearly, h(γ) is a nonde- 01

WK 9
creasing function of γ, that depends on the distribu- 0
tion of smoothness constants L1 , L2 , . . . , LM . It is 0 100 200 300 400 500
Iteration index k
600 700 800 900 1000

also instructive to view it as the cumulative distribu- Figure 2: Communication events of workers
tion function of the deterministic quantity H2 (m), 1, 3, 5, 7, 9 over 1, 000 iterations. Each stick
implying h(γ) ∈ [0, 1]. Putting it in our context, the is an upload. A setup with L1 < . . . < L9 .
critical quantity h(γd ) lower bounds the fraction of
workers that communicate with the server at most k/(d + 1) rounds until the k-th iteration. We are
now ready to present the communication complexity.
Proposition 5 (communication complexity) With γd defined in (18) and the function h(γ) in (19),
the communication complexity of LAG denoted as CLAG (ϵ) is bounded by
( D ( ) )
∑ 1 1 ( )
CLAG (ϵ) ≤ 1 − − h (γd ) M ILAG (ϵ) := 1 − ∆C̄(h; {γd }) M ILAG (ϵ) (20)
d d+1
d=1
∑D ( 1 )
where the constant is defined as ∆C̄(h; {γd }) := d=1 d − 1
d+1 h (γd ).
The communication complexity in (20) crucially depends on the iteration complexity ILAG (ϵ) as
well as what we call the fraction of reduced communication per iteration ∆C̄(h; √ {γd }). Simply
choosing the parameters as (16), it follows from (17) and (20) that (cf. γd = ξ(1 − Dξ)−2 M −2 d−1 )
( ) /( √ )
CLAG (ϵ) ≤ 1 − ∆C̄(h; ξ) CGD (ϵ) 1− Dξ . (21)
−1
where the GD’s complexity is CGD (ϵ) = M κ log(ϵ ). In (21), due to the nondecreasing prop-
erty of h(γ), increasing the constant ξ yields a smaller fraction of workers 1 − ∆C̄(h; ξ) that are
communicating per iteration, yet with a larger number of iterations (cf. (17)). The key enabler of
LAG’s communication reduction is a heterogeneous environment associated with a favorable h(γ)
ensuring that the benefit of increasing ξ is more significant than its effect on√increasing iteration
complexity. More precisely, for a given ξ, if h(γ) guarantees ∆C̄(h; ξ) > Dξ, then we have
CLAG (ϵ) < CGD (ϵ). Intuitively speaking, if there is a large fraction of workers with small Lm , LAG
has lower communication complexity than GD. An example follows to illustrate this reduction.
Example. Consider Lm = 1, m ̸= M , and LM = L ≥ M 2 ≫ 1, where we have H(m) =
1/L, m ̸= M, H(M ) = 1, implying that h(γ) ≥ 1 − M1
, if γ ≥ 1/L2 . Choosing D ≥ M and
ξ = M D/L < 1/D in (16) such that γD ≥ 1/L in (18), we have (cf. (21))
2 2 2
[ ( )( )] /( )
/ 1 1 M +D 2
CLAG (ϵ) CGD (ϵ) ≤ 1 − 1 − 1− 1 − M D/L ≈ ≈ . (22)
D+1 M M (D + 1) M
Due to technical issues in the convergence analysis, the current condition on h(γ) to ensure LAG’s
communication reduction is relatively restrictive. Establishing communication reduction on a
broader learning setting that matches the LAG’s intriguing empirical performance is in our agenda.

3.2 Convergence in (non)convex case

LAG’s convergence and communication reduction guarantees go beyond the strongly-convex case.
We next establish the convergence of LAG for general convex functions.
Theorem 2 (convex case) Under Assumptions 1 and 2, if α and {ξd } are chosen properly, then
L(θ K ) − L(θ ∗ ) = O (1/K) . (23)

For nonconvex objective functions, LAG can guarantee the following convergence result.
Theorem 3 (nonconvex case) Under Assumption 1, if α and {ξd } are chosen properly, then
2 2
min θ k+1 − θ k = o (1/K) and min ∇L(θ k ) = o (1/K) . (24)
1≤k≤K 1≤k≤K

7
Cyc-IAG Cyc-IAG
Cyc-IAG
Num-IAG Num-IAG
Num-IAG
LAG-PS LAG-PS
LAG-PS
0 LAG-WK LAG-WK
LAG-WK 10 0
10 Batch-GD Batch-GD

Objective error

Objective error
Objective error
0

Objective error
100 Batch-GD 10

Cyc-IAG
Num-IAG
-5 10-5 LAG-PS 10-5 10
-5
10
LAG-WK
Batch-GD

200 400 600 800 1000 101 102 103 0 0.5 1 1.5 2 2.5 101 102 103 104
Number of iteration Number of communications (uploads) Number of iteration ×104 Number of communications (uploads)

Increasing Lm Increasing Lm Uniform Lm Uniform Lm

Figure 3: Iteration and communication complexity in synthetic datasets.

102
Cyc-IAG Cyc-IAG
105 105 Cyc-IAG
Num-IAG Num-IAG Num-IAG
LAG-PS LAG-PS
100 LAG-PS
LAG-WK LAG-WK
Batch-GD Batch-GD
LAG-WK 100
Objective error

Objective error
Objective error
Objective error

Batch-GD
0 100 10-2
10

10-4
Cyc-IAG
-5
10-5 Num-IAG
10-5 10 10-6 LAG-PS
LAG-WK
Batch-GD
10-8
0 1000 2000 3000 4000 5000 101 102 103 104 0 0.5 1 1.5 2 2.5 3 3.5 101 102 103 104
Number of iteration Number of communications (uploads) Number of iteration ×104 Number of communications (uploads)

Linear regression Linear regression Logistic regression Logistic regression

Figure 4: Iteration and communication complexity in real datasets.

Theorems 2 and 3 assert that with the judiciously designed lazy gradient aggregation rules, LAG can
achieve order of convergence rate identical to GD for general (non)convex objective functions.
Similar to Proposition 5, in the supplementary material, we have also shown that in the (non)convex
case, LAG still requires less communication than GD, under certain conditions on the function h(γ).

4 Numerical tests and conclusions

To validate the theoretical results, this section evaluates the empirical performance of LAG in linear
and logistic regression tasks. All experiments were performed using MATLAB on an Intel CPU @
3.4 GHz (32 GB RAM) desktop. By default, we consider one server, and nine workers. Throughout
the test, we use L(θk ) − L(θ∗ ) as figure of merit of our solution. For logistic regression, the regular-
ization parameter is set to λ = 10−3 . To benchmark LAG, we consider the following approaches.
▷ Cyc-IAG is the cyclic version of the incremental aggregated gradient (IAG) method [9, 10] that
resembles the recursion (4), but communicates with one worker per iteration in a cyclic fashion.
▷ Num-IAG also resembles the recursion (4), and is the non-uniform-sampling enhancement of SAG
[12], but it randomly selects one worker∑ to obtain a fresh gradient per-iteration with the probability
of choosing worker m equal to Lm / m∈M Lm .
▷ Batch-GD is the GD iteration (2) that communicates with all the workers per iteration.
For LAG-WK, we choose ξd = ξ = 1/D with D = 10, and for LAG-PS, we choose more aggressive
ξd = ξ = 10/D with D = 10. Stepsizes for LAG-WK, LAG-PS, and GD are chosen as α = 1/L;
to optimize performance and guarantee stability, α = 1/(M L) is used in Cyc-IAG and Num-IAG.
We consider two synthetic data tests: a) linear regression with increasing smoothness constants,
e.g., Lm = (1.3m−1 + 1)2 , ∀m; and, b) logistic regression with uniform smoothness constants, e.g.,
L1 = . . . = L9 = 4; see Figure 3. For the case of increasing Lm , it is not surprising that both LAG
variants need fewer communication rounds. Interesting enough, for uniform Lm , LAG-WK still has
marked improvements on communication, thanks to its ability of exploiting the hidden smoothness
of the loss functions; that is, the local curvature of Lm may not be as steep as Lm .
Performance is also tested on the real datasets [2]: a) linear regression using Housing, Body fat,
Abalone datasets; and, b) logistic regression using Ionosphere, Adult, Derm datasets; see Figure 4.
Each dataset is evenly split into three workers with the number of features used in the test equal to the
minimal number of features among all datasets; see the details of parameters and data allocation in
the supplement material. In all tests, LAG-WK outperforms the alternatives in terms of both metrics,
especially reducing the needed communication rounds by several orders of magnitude. Its needed
communication rounds can be even smaller than the number of iterations, if none of workers violate

8
Linear regression Logistic regression
Algorithm M =9 M = 18 M = 27 M =9 M = 18 M = 27
Cyclic-IAG 5271 10522 15773 33300 65287 97773
Num-IAG 3466 5283 5815 22113 30540 37262
LAG-PS 1756 3610 5944 14423 29968 44598
LAG-WK 412 657 1058 584 1098 1723
Batch GD 5283 10548 15822 33309 65322 97821

Table 3: Communication complexity (ϵ = 10−8 ) in real dataset under different number of workers.

103 103
Cyc-IAG Cyc-IAG
Num-IAG Num-IAG
2 LAG-PS 2 LAG-PS
10 LAG-WK
10 LAG-WK

Objective error
Objective error

Batch-GD Batch-GD

101 101

0 0
10 10

10-1 10-1

0 1 2 3 4 5 1 2 3 4 5
10 10 10 10 10
Number of iteration ×105 Number of communications (uploads)

Figure 5: Iteration and communication complexity in Gisette dataset.

the trigger condition (12) at certain iterations. Additional tests under different number of workers
are listed in Table 3, which corroborate the effectiveness of LAG when it comes to communication
reduction. Similar performance gain has also been observed in the additional logistic regression test
on a larger dataset Gisette. The dataset was taken from [7] which was constructed from the MNIST
data [8]. After random selecting subset of samples and eliminating all-zero features, it contains 2000
samples xn ∈ R4837 . We randomly split this dataset into nine workers. The performance of all the
algorithms is reported in Figure 5 in terms of the iteration and communication complexity. Clearly,
LAG-WK and LAG-PS achieve the same iteration complexity as GD, and outperform Cyc- and Num-
IAG. Regarding communication complexity, two LAG variants reduce the needed communication
rounds by several orders of magnitude compared with the alternatives.
Confirmed by the impressive empirical performance on both synthetic and real datasets, this paper
developed a promising communication-cognizant method for distributed machine learning that we
term Lazily Aggregated gradient (LAG) approach. LAG can achieve the same convergence rates as
batch gradient descent (GD) in smooth strongly-convex, convex, and nonconvex cases, and requires
fewer communication rounds than GD given that the datasets at different workers are heterogeneous.
To overcome the limitations of LAG, future work consists of incorporating smoothing techniques to
handle nonsmooth loss functions, and robustifying our aggregation rules to deal with cyber attacks.

Acknowledgments
The work by T. Chen and G. Giannakis is supported in part by NSF 1500713 and 1711471, and NIH
1R01GM104975-01. The work by T. Chen is also supported by the Doctoral Dissertation Fellowship
from the University of Minnesota. The work by T. Sun is supported in part by China Scholarship
Council. The work by W. Yin is supported in part by NSF DMS-1720237 and ONR N0001417121.

9
References
[1] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans.
Automat. Control, vol. 54, no. 1, pp. 48–61, Jan. 2009.

[2] G. B. Giannakis, Q. Ling, G. Mateos, I. D. Schizas, and H. Zhu, “Decentralized Learning for Wireless
Communications and Networking,” in Splitting Methods in Communication and Imaging, Science and
Engineering. New York: Springer, 2016.

[3] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le
et al., “Large scale distributed deep networks,” in Proc. Advances in Neural Info. Process. Syst., Lake
Tahoe, NV, 2012, pp. 1223–1231.

[4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning


of deep networks from decentralized data,” in Proc. Intl. Conf. Artificial Intell. and Stat., Fort Lauderdale,
FL, Apr. 2017, pp. 1273–1282.

[5] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” in Proc. Ad-
vances in Neural Info. Process. Syst., Long Beach, CA, Dec. 2017, pp. 4427–4437.

[6] I. Stoica, D. Song, R. A. Popa, D. Patterson, M. W. Mahoney, R. Katz, A. D. Joseph, M. Jor-


dan, J. M. Hellerstein, J. E. Gonzalez et al., “A Berkeley view of systems challenges for AI,” arXiv
preprint:1712.05855, Dec. 2017.

[7] T. Chen, S. Barbarossa, X. Wang, G. B. Giannakis, and Z.-L. Zhang, “Learning and management for
Internet-of-Things: Accounting for adaptivity and scalability,” Proc. of the IEEE, Nov. 2018.

[8] L. Bottou, “Large-Scale Machine Learning with Stochastic Gradient Descent,” in Proc. of COMP-
STAT’2010, Y. Lechevallier and G. Saporta, Eds. Heidelberg: Physica-Verlag HD, 2010, pp. 177–186.

[9] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” arXiv
preprint:1606.04838, Jun. 2016.

[10] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,”
in Proc. Advances in Neural Info. Process. Syst., Lake Tahoe, NV, Dec. 2013, pp. 315–323.

[11] A. Defazio, F. Bach, and S. Lacoste-Julien, “Saga: A fast incremental gradient method with support for
non-strongly convex composite objectives,” in Proc. Advances in Neural Info. Process. Syst., Montreal,
Canada, Dec. 2014, pp. 1646–1654.

[12] M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums with the stochastic average gradient,”
Mathematical Programming, vol. 162, no. 1-2, pp. 83–112, Mar. 2017.

[13] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning
with the parameter server,” in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec. 2014,
pp. 19–27.

[14] B. McMahan and D. Ramage, “Federated learning: Collaborative machine learning without centralized
training data,” Google Research Blog, Apr. 2017. [Online]. Available: https://research.googleblog.com/
2017/04/federated-learning-collaborative.html

[15] L. Cannelli, F. Facchinei, V. Kungurtsev, and G. Scutari, “Asynchronous parallel algorithms for nonconvex
big-data optimization: Model and convergence,” arXiv preprint:1607.04818, Jul. 2016.

[16] T. Sun, R. Hannah, and W. Yin, “Asynchronous coordinate descent under more realistic assumptions,” in
Proc. Advances in Neural Info. Process. Syst., Long Beach, CA, Dec. 2017, pp. 6183–6191.

[17] Z. Peng, Y. Xu, M. Yan, and W. Yin, “Arock: an algorithmic framework for asynchronous parallel coordi-
nate updates,” SIAM J. Sci. Comp., vol. 38, no. 5, pp. 2851–2879, Sep. 2016.

[18] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradi-
ent descent,” in Proc. Advances in Neural Info. Process. Syst., Granada, Spain, Dec. 2011, pp. 693–701.

[19] J. Liu, S. Wright, C. Ré, V. Bittorf, and S. Sridhar, “An asynchronous parallel stochastic coordinate
descent algorithm,” J. Machine Learning Res., vol. 16, no. 1, pp. 285–322, 2015.

[20] X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochastic gradient for nonconvex optimiza-
tion,” in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec. 2015, pp. 2737–2745.

10
[21] M. I. Jordan, J. D. Lee, and Y. Yang, “Communication-efficient distributed statistical inference,” J. Amer-
ican Statistical Association, vol. to appear, 2018.

[22] Y. Zhang, J. C. Duchi, and M. J. Wainwright, “Communication-efficient algorithms for statistical opti-
mization.” J. Machine Learning Res., vol. 14, no. 11, 2013.
[23] A. T. Suresh, X. Y. Felix, S. Kumar, and H. B. McMahan, “Distributed mean estimation with limited
communication,” in Proc. Intl. Conf. Machine Learn., Sydney, Australia, Aug. 2017, pp. 3329–3337.

[24] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via
gradient quantization and encoding,” In Proc. Advances in Neural Info. Process. Syst., pages 1709–1720,
Long Beach, CA, Dec. 2017.

[25] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “TernGrad: Ternary gradients to reduce
communication in distributed deep learning,” In Proc. Advances in Neural Info. Process. Syst., pages
1509–1519, Long Beach, CA, Dec. 2017.

[26] A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” In Proc. of Empirical
Methods in Natural Language Process., pages 440–445, Copenhagen, Denmark, Sep. 2017.
[27] M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan, “Communication-
efficient distributed dual coordinate ascent,” in Proc. Advances in Neural Info. Process. Syst., Montreal,
Canada, Dec. 2014, pp. 3068–3076.
[28] C. Ma, J. Konečnỳ, M. Jaggi, V. Smith, M. I. Jordan, P. Richtárik, and M. Takáč, “Distributed optimization
with arbitrary local solvers,” Optimization Methods and Software, vol. 32, no. 4, pp. 813–848, Jul. 2017.

[29] O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient distributed optimization using an approx-
imate newton-type method,” in Proc. Intl. Conf. Machine Learn., Beijing, China, Jun. 2014, pp. 1000–
1008.

[30] Y. Zhang and X. Lin, “DiSCO: Distributed optimization for self-concordant empirical loss,” in Proc. Intl.
Conf. Machine Learn., Lille, France, Jun. 2015, pp. 362–370.

[31] Y. Liu, C. Nowzari, Z. Tian, and Q. Ling, “Asynchronous periodic event-triggered coordination of multi-
agent systems,” in Proc. IEEE Conf. Decision Control, Melbourne, Australia, Dec. 2017, pp. 6696–6701.

[32] G. Lan, S. Lee, and Y. Zhou, “Communication-efficient algorithms for decentralized and stochastic opti-
mization,” arXiv preprint:1701.03961, Jan. 2017.

[33] Y. Nesterov, Introductory Lectures on Convex Optimization: A basic course. Berlin, Germany: Springer,
2013, vol. 87.
[34] D. Blatt, A. O. Hero, and H. Gauchman, “A convergent incremental gradient method with a constant step
size,” SIAM J. Optimization, vol. 18, no. 1, pp. 29–51, Feb. 2007.

[35] M. Gurbuzbalaban, A. Ozdaglar, and P. A. Parrilo, “On the convergence rate of incremental aggregated
gradient algorithms,” SIAM J. Optimization, vol. 27, no. 2, pp. 1035–1048, Jun. 2017.

[36] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml

[37] L. Song, A. Smola, A. Gretton, K. M. Borgwardt, and J. Bedo, “Supervised feature selection via depen-
dence estimation,” in Proc. Intl. Conf. Machine Learn., Corvallis, OR, Jun. 2007, pp. 823–830.

[38] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recogni-
tion,” Proc. of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.

11

You might also like