NIPS 2015 Communication Complexity of Distributed Convex Learning and Optimization Paper
NIPS 2015 Communication Complexity of Distributed Convex Learning and Optimization Paper
Abstract
We study the fundamental limits to communication-efficient distributed methods
for convex learning and optimization, under different assumptions on the informa-
tion available to individual machines, and the types of functions considered. We
identify cases where existing algorithms are already worst-case optimal, as well as
cases where room for further improvement is still possible. Among other things,
our results indicate that without similarity between the local objective functions
(due to statistical data similarity or otherwise) many communication rounds may
be required, even if the machines have unbounded computational power.
1 Introduction
We consider the problem of distributed convex learning and optimization, where a set of m ma-
chines, each with access to a different local convex function Fi : Rd 7→ R and a convex domain
W ⊆ Rd , attempt to solve the optimization problem
m
1 X
min F (w) where F (w) = Fi (w). (1)
w∈W m i=1
A prominent application is empirical risk minimization, where the goal is to minimize the average
loss over some dataset, where each machine has access to a different subset of the data. Letting
{z1 , . . . , zN } be the dataset composed of N examples, and assuming the loss function `(w, z)
PN
is convex in w, then the empirical risk minimization problem minw∈W N1 i=1 `(w, zi ) can be
written as in Eq. (1), where Fi (w) is the average loss over machine i’s examples.
The main challenge in solving such problems is that communication between the different machines
is usually slow and constrained, at least compared to the speed of local processing. On the other
hand, the datasets involved in distributed learning are usually large and high-dimensional. Therefore,
machines cannot simply communicate their entire data to each other, and the question is how well
can we solve problems such as Eq. (1) using as little communication as possible.
As datasets continue to increase in size, and parallel computing platforms becoming more and more
common (from multiple cores on a single CPU to large-scale and geographically distributed com-
puting grids), distributed learning and optimization methods have been the focus of much research
in recent years, with just a few examples including [25, 4, 2, 27, 1, 5, 13, 23, 16, 17, 8, 7, 9, 11, 20,
19, 3, 26]. Most of this work studied algorithms for this problem, which provide upper bounds on
the required time and communication complexity.
In this paper, we take the opposite direction, and study what are the fundamental performance lim-
itations in solving Eq. (1), under several different sets of assumptions. We identify cases where
existing algorithms are already optimal (at least in the worst-case), as well as cases where room for
further improvement is still possible.
1
Since a major constraint in distributed learning is communication, we focus on studying the amount
of communication required to optimize Eq. (1) up to some desired accuracy . More precisely,
we consider the number of communication rounds that are required, where in each communication
round the machines can generally broadcast to each other information linear in the problem’s di-
mension d (e.g. a point in W or a gradient). This applies to virtually all algorithms for large-scale
learning we are aware of, where sending vectors and gradients is feasible, but computing and sending
larger objects, such as Hessians (d × d matrices) is not.
Our results pertain to several possible settings (see Sec. 2 for precise definitions). First, we distin-
guish between the local functions being merely convex or strongly-convex, and whether they are
smooth or not. These distinctions are standard in studying optimization algorithms for learning, and
capture important properties such as the regularization and the type of loss function used. Second,
we distinguish between a setting where the local functions are related – e.g., because they reflect
statistical similarities in the data residing at different machines – and a setting where no relationship
is assumed. For example, in the extreme case where data was split uniformly at random between
machines, one can show that quantities
√ such as the values, gradients and Hessians of the local func-
tions differ only by δ = O(1/ n), where n is the sample size per machine, due to concentration
of measure effects. Such similarities can be used to speed up the optimization/learning process, as
was done in e.g. [20, 26]. Both the δ-related and the unrelated setting can be considered in a unified
way, by letting δ be a parameter and studying the attainable lower bounds as a function of δ. Our
results can be summarized as follows:
• First, we define a mild structural assumption on the algorithm (which is satisfied by reasonable
approaches we are aware of), which allows us to provide the lower bounds described below on
the number of communication rounds required to reach a given suboptimality .
p
– When the local functions can be unrelated, we prove p a lower bound of Ω( 1/λ log(1/)) for
smooth and λ-strongly convex functions, and Ω( 1/) for smooth convex functions. These
lower bounds are matched by a straightforward distributed implementation of accelerated gra-
dient descent. In particular, the results imply that many communication rounds may be required
to get a high-accuracy solution, and moreover, that no algorithm satisfying our structural as-
sumption would be better, even if we endow the local machines with p unbounded computational
power. For non-smooth functions, we show a lower bound of Ω( 1/λ) for λ-strongly convex
functions, and Ω(1/) for general convex functions. Although we leave a full derivation to
future work, it seems these lower bounds can be matched in our framework by an algorithm
combining acceleration and Moreau proximal smoothing of the local functions.
– When the local functions are related
p (as quantified by the parameter δ), we prove a communica-
tion round lower bound of Ω( δ/λ log(1/)) for smooth and λ-strongly convex functions. For
quadratics, this bound is matched by (up to constants and logarithmic factors) by the recently-
proposed DISCO algorithm [26]. However, getting an optimal algorithm for general strongly
convex and smooth functions in the δ-related setting, let alone for non-smooth or non-strongly
convex functions, remains open.
• We also study the attainable performance without posing any structural assumptions on the algo-
rithm, but in the more restricted case where only a single round of communication is allowed. We
prove that in a broad regime, the performance of any distributed algorithm may be no better than a
‘trivial’ algorithm which returns the minimizer of one of the local functions, as long as the number
of bits communicated is less than Ω(d2 ). Therefore, in our setting, no communication-efficient
1-round distributed algorithm can provide non-trivial performance in the worst case.
Related Work
There have been several previous works which considered lower bounds in the context of distributed
learning and optimization, but to the best of our knowledge, none of them provide a similar type of
results. Perhaps the most closely-related paper is [22], which studied the communication complexity
of distributed optimization, and showed that Ω(d log(1/)) bits of communication are necessary
between the machines, for d-dimensional convex problems. However, in our setting this does not
lead to any non-trivial lower bound on the number of communication rounds (indeed, just specifying
a d-dimensional vector up to accuracy required O(d log(1/)) bits). More recently, [2] considered
lower bounds for certain types of distributed learning problems, but not convex ones in an agnostic
2
distribution-free framework. In the context of lower bounds for one-round algorithms, the results of
[6] imply that Ω(d2 ) bits of communication are required to solve linear regression in one round of
communication. However, that paper assumes a different model than ours, where the function to be
optimized is not split among the machines as in Eq. (1), where each Fi is convex. Moreover, issues
such as strong convexity and smoothness are not considered. [20] proves an impossibility result for
a one-round distributed learning scheme, even when the local functions are not merely related, but
actually result from splitting data uniformly at random between machines. On the flip side, that
result is for a particular algorithm, and doesn’t apply to any possible method.
Finally, we emphasize that distributed learning and optimization can be studied under many settings,
including ones different than those studied here. For example, one can consider distributed learning
on a stream of i.i.d. data [19, 7, 10, 8], or settings where the computing architecture is different, e.g.
where the machines have a shared memory, or the function to be optimized is not split as in Eq. (1).
Studying lower bounds in such settings is an interesting topic for future work.
3
are δ-related, if for any i, j ∈ {1 . . . k}, it holds that
kAi − Aj k ≤ δ, kbi − bj k ≤ δ, |ci − cj | ≤ δ
For example, in the context of linear regression with the squared loss over a bounded subset of
Rd , and assuming mn data points with bounded norm are randomly and equally √ split among m
machines, it can be shown that the conditions above hold with δ = O(1/ n) [20]. The choice
of δ provides us with a spectrum of learning problems ranked by difficulty: When√δ = Ω(1), this
generally corresponds to the unrelated setting discussed earlier. When δ = O(1/ n), we get the
situation typical of randomly partitioned data. When δ = 0, then all the local functions have essen-
tially the same minimizers, in which case Eq. (1) can be trivially solved with zero communication,
just by letting one machine optimize its own local function. We note that although Definition 1 can
be generalized to non-quadratic functions, we do not need it for the results presented here.
We end this section with an important remark. In this paper, we prove lower bounds for the δ-related
setting, which includes as √
a special case the commonly-studied setting of randomly partitioned data
(in which case δ = O(1/ n)). However, our bounds do not apply for random partitioning, since
they use δ-related constructions which do not correspond to randomly partitioned data. In fact, very
recent work [12] has cleverly shown that for randomly partitioned data, and for certain reasonable
regimes of strong convexity and smoothness, it is actually possible to get better performance than
what is indicated by our lower bounds. However, this encouraging result crucially relies on the
random partition property, and in parameter regimes which limit how much each data point needs to
be “touched”, hence preserving key statistical independence properties. We suspect that it may be
difficult to improve on our lower bounds under substantially weaker assumptions.
for some γ, ν ≥ 0 such that γ + ν > 0. After every communication round, let Wj := ∪m i=1 Wi for all
j. The algorithm’s final output (provided by the designated machine j) is a point in the span of Wj .
• Note that Wj is not an explicit part of the algorithm: It simply includes all points computed by
machine j so far, or communicated to it by other machines, and is used to define the set of new
points which the machine is allowed to compute.
• The assumption bears some resemblance – but is far weaker – than standard assumptions used to
provide lower bounds for iterative optimization algorithms. For example, a common assumption
(see [14]) is that each computed point w must lie in the span of the previous gradients. This corre-
sponds to a special case of Assumption 1, where γ = 1, ν = 0, and the span is only over gradients
of previously computed points. Moreover, it also allows (for instance) exact optimization of each
local function, which is a subroutine in some distributed algorithms (e.g. [27, 25]), by setting
γ = 0, ν = 1 and computing a point w satisfying γw + ν∇Fj (w) = 0. By allowing the span
to include previous gradients, we also incorporate algorithms which perform optimization of the
4
local function plus terms involving previous gradients and points, such as [20], as well as algo-
rithms which rely on local Hessian information and preconditioning, such as [26]. In summary,
the assumption is satisfied by most techniques for black-box convex optimization that we are
aware of. Finally, we emphasize that we do not restrict the number or computational complexity
of the operations performed between communication rounds.
• The requirement that γ, ν ≥ 0 is to exclude algorithms which solve non-convex local optimization
2
problems of the form minw Fj (w) + γ kwk with γ < 0, which are unreasonable in practice and
can sometimes break our lower bounds.
• The assumption that Wj is initially {0} (namely, that the algorithm starts from the origin) is purely
for convenience, and our results can be easily adapted to any other starting point by shifting all
functions accordingly.
The techniques we employ in this section are inspired by lower bounds on the iteration complex-
ity of first-order methods for standard (non-distributed) optimization (see for example [14]). These
are based on the construction of ‘hard’ functions, where each gradient (or subgradient) computa-
tion can only provide a small improvement in the objective value. In our setting, the dynamics are
roughly similar, but the necessity of many gradient computations is replaced by many communica-
tion rounds. This is achieved by constructing suitable local functions, where at any time point no
individual machine can ‘progress’ on its own, without information from other machines.
We begin by presenting a lower bound when the local functions Fi are strongly-convex and smooth:
Theorem 1. For any even number m of machines, any distributed algorithm which satisfies As-
sumption 1, and for any λ ∈ [0, 1), δ ∈ (0, 1), there exist m local quadratic functions over Rd
(where d is sufficiently large) which are 1-smooth, λ-strongly convex, and δ-related, such that if
w∗ = arg minw∈Rd F (w), then the number of communication rounds required to obtain ŵ satisfy-
ing F (ŵ) − F (w∗ ) ≤ (for any > 0) is at least
s ! ! !!
2 2
r
λ kw∗ k λ kw∗ k
1 1 1 δ
1+δ − 1 − 1 log − = Ω log
4 λ 4 2 λ
q
if λ > 0, and at least 323δ
kw∗ k − 2 if λ = 0.
The assumption of m being even is purely for technical convenience, and can be discarded at the
cost of making the proof slightly more complex. Also, note that m does not appear explicitly in
the bound, but may appear implicitly, via δ (for example, in a statistical setting δ may depend on
the number of data points per machine, and may be larger if the same dataset is divided to more
machines).
Let us contrast our lower bound with some existing algorithms and guarantees in the literature. First,
regardless of whether the local functions are similar or not, we can always simulate any gradient-
based method designed for a single machine, by iteratively computing gradients of the local func-
tions, and performing a communication round Pmto compute their average. Clearly, this will be a
1
gradient of the objective function F (·) = m i=1 Fi (·), which can be fed into any gradient-based
method such as gradient descent or accelerated gradient descent [14]. The resulting number of
required communication rounds is then equal to the number of iterations. In particular, using ac-
celerated
p gradient descent λ-strongly convex functions yields a round complexity
for smooth and p
2
of O( 1/λ log(kw∗ k /)), and O(kw∗ k 1/) for smooth convex functions. This matches our
lower bound (up to constants and log factors) when the local functions are unrelated (δ = Ω(1)).
When the functions are related, however, the upper bounds above are highly sub-optimal: Even if
the local functions are completely identical, and δ = 0, the number of communication rounds will
remain the same as when δ = Ω(1). To utilize function similarity while guaranteeing arbitrary small
, the two most relevant algorithms are DANE [20], and the more recent DISCO [26]. For smooth
and λ-strongly convex functions, p which are either quadratic or satisfy a certain self-concordance
condition, DISCO achieves Õ(1+ δ/λ) round complexity ([26, Thm.2]), which matches our lower
bound in terms of dependence on δ, λ. However, for non-quadratic losses, the round complexity
5
bounds are somewhat worse, and there are no guarantees for strongly convex and smooth functions
which are not self-concordant. Thus, the question of the optimal round complexity for such functions
remains open.
The full proof of Thm. 1 appears in the supplementary material, and is based on the following idea:
For simplicity, suppose we have two machines, with local functions F1 , F2 defined as follows,
δ(1 − λ) > δ(1 − λ) > λ 2
F1 (w) = w A1 w − e1 w + kwk (3)
4 2 2
δ(1 − λ) > λ 2
F2 (w) = w A2 w + kwk , where
4 2
1 −1 0 0 0 0 ...
1 0 0 0 0 0 ... −1 1 0 0 0 0 ...
0 1 −1 0 0 0 ...
0 0 1 −1 0 0 ...
0 −1 1 0 0 0 ...
A1 =
0 0 −1 1 0 0 ...
0 0 0 1 −1 0 . . . , A2 = 0
0 0 0 1 −1 ...
0 0 0 −1 1 0 . . .
.. .. .. .. .. .. ..
0
0 0 0 −1 1 ...
. . . . . . . .
.. .
.. .
.. .. .. .. ..
. . . .
It is easy to verify that for δ, λ ≤ 1, both F1 (w) and F2 (w) are 1-smooth and λ-strongly convex,
as well as δ-related. Moreover, the optimum of their average is a point w∗ with non-zero entries
at all coordinates. However, since each local functions has a block-diagonal quadratic term, it can
be shown that for any algorithm satisfying Assumption 1, after T communication rounds, the points
computed by the two machines can only have the first T + 1 coordinates non-zero. No machine
will be able to further ‘progress’ on its own, and cause additional coordinates to become non-zero,
without another communication round. This leads to a lower bound on the optimization error which
depends on T , resulting in the theorem statement after a few computations.
Remaining in the framework of algorithms satisfying Assumption 1, we now turn to discuss the
situation where the local functions are not necessarily smooth or differentiable. For simplicity, our
formal results here will be in the unrelated setting, and we only informally discuss their extension
to a δ-related setting (in a sense relevant to non-smooth functions). Formally defining δ-related
non-smooth functions is possible but not altogether trivial, and is therefore left to future work.
We adapt Assumption 1 to the non-smooth case, by allowing gradients to be replaced by arbitrary
subgradients at the same points. Namely, we replace Eq. (2) by the requirement that for some
g ∈ ∂Fj (w), and γ, ν ≥ 0, γ + ν > 0,
n
γw + νg ∈ span w0 , g0 , (∇2 Fj (w0 ) + D)w00 , (∇2 Fj (w0 ) + D)−1 w00
o
w0 , w00 ∈ Wj , g0 ∈ ∂Fj (w0 ) , D diagonal , ∇2 Fj (w0 ) exists , (∇2 Fj (w0 ) + D)−1 exists .
The lower bound for this setting is stated in the following theorem.
Theorem 2. For any even number m of machines, any distributed optimization algorithm which
satisfies Assumption 1, and for any λ ≥ 0, there exist λ-strongly convex (1+λ)-Lipschitz continuous
convex local functions F1 (w) and F2 (w) over the unit Euclidean ball in Rd (where d is sufficiently
large), such that if w∗ = arg minw:kwk≤1 F (w), the number of communication rounds required to
obtain ŵ satisfying F (ŵ) − F (w∗ ) ≤ (for any sufficiently small > 0) is 8
1
− 2 for λ = 0, and
q
1
16λ − 2 for λ > 0.
6
indicates that performing accelerated gradient descent on smoothed versions of the local functions
(using Moreau proximal smoothing, e.g. [15, 24]), can match these lower bounds up to log factors2 .
We leave a full formal derivation (which has some subtleties) to future work.
The full proof of Thm. 2 appears in the supplementary material. The proof idea relies on the fol-
lowing construction: Assume that we fix the number of communication rounds to be T , and (for
simplicity) that T is even and the number of machines is 2. Then we use local functions of the form
1 1 λ 2
F1 (w) = √ |b − w1 | + p (|w2 − w3 | + |w4 − w5 | + · · · + |wT − wT +1 |) + kwk
2 2(T + 2) 2
1 λ 2
F2 (w) = p (|w1 − w2 | + |w3 − w4 | + · · · + |wT +1 − wT +2 |) + kwk ,
2(T + 2) 2
where b is a suitably chosen parameter. It is easy to verify that both local functions are λ-strongly
convex and (1 + λ)-Lipschitz continuous over the unit Euclidean ball. Similar to the smooth case,
we argue that after T communication rounds, the resulting points w computed by machine 1 will
be non-zero only on the first T + 1 coordinates, and the points w computed by machine 2 will be
non-zero only on the first T coordinates. As in the smooth case, these functions allow us to ’control’
the progress of any algorithm which satisfies Assumption 1.
Finally, although the result is in the unrelated setting, it is straightforward to have a similar construc-
tion in a ‘δ-related’ setting, by multiplying F1 and F2 by δ. The resulting two functions have their
gradients and subgradients at most δ-different from each other, and p the construction above leads to
a lower bound of Ω(δ/) for convex Lipschitz functions, and Ω(δ 1/λ) for λ-strongly convex
Lipschitz functions. In terms of upper bounds, we are actually unaware of any relevant algorithm
in the literature adapted to such a setting, and the question of attainable performance here remains
wide open.
7
• For any machine j, if ŵj = arg minw∈Rd Fj (w), then F (ŵj ) − minw∈Rd F (w) ≤ c0 δ 2 /λ.
The theorem shows that unless the communication budget is extremely large (quadratic in the di-
mension), there are functions which cannot be optimized to non-trivial accuracy in one round of
communication, in the sense that the same accuracy (up to a universal constant) can be obtained
with a ‘trivial’ solution where we just return the optimum of a single local function. This comple-
ments an earlier result in [20], which showed that a particular one-round algorithm is no better than
returning the optimum of a local function, under the stronger assumption that the local functions are
not merely δ-related, but are actually the average loss over some randomly partitioned data.
The full proof appears in the supplementary material, but we sketch the main ideas below. As
before, focusing on the case of two machines, and assuming machine 2 is responsible for providing
the output, we use !
−1
> 1 1
F1 (w) = 3λw I+ √ M − I w
2c d 2
3λ 2
F2 (w) = kwk − δej ,
2
where M is essentially
√ a randomly chosen {−1, +1}-valued d × d symmetric matrix with spectral
norm at most c d, and c is a suitable constant. These functions can be shown to be δ-related as well
as λ-strongly convex. Moreover, the optimum of F (w) = 21 (F1 (w) + F2 (w)) equals
δ 1
w∗ = I + √ M ej .
6λ 2c d
Thus, we see that the optimal point w∗ depends on the j-th column of M . Intuitively, the machines
need to approximate this column, and this is the source of hardness in this setting: Machine 1
knows M but not j, yet needs to communicate to machine 2 enough information to construct its j-th
column. However, given a communication budget much smaller than the size of M (which is d2 ), it
is difficult to convey enough information on the j-th column without knowing what j is. Carefully
formalizing this intuition, and using some information-theoretic tools, allows us to prove the first
part of Thm. 3. Proving the second part of Thm. 3 is straightforward, using a few computations.
Acknowledgments: This research is supported in part by an FP7 Marie Curie CIG grant, the Intel
ICRI-CI Institute, and Israel Science Foundation grant 425/13. We thank Nati Srebro for several
helpful discussions and insights.
8
References
[1] A. Agarwal, O. Chapelle, M. Dudı́k, and J. Langford. A reliable effective terascale linear
learning system. CoRR, abs/1110.4198, 2011.
[2] M.-F. Balcan, A. Blum, S. Fine, and Y. Mansour. Distributed learning, communication com-
plexity and privacy. In COLT, 2012.
[3] M.-F. Balcan, V. Kanchanapally, Y. Liang, and D. Woodruff. Improved distributed principal
component analysis. In NIPS, 2014.
[4] R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and
distributed approaches. Cambridge University Press, 2011.
[5] S.P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statis-
tical learning via ADMM. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
[6] K. Clarkson and D. Woodruff. Numerical linear algebra in the streaming model. In STOC,
2009.
[7] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated
gradient methods. In NIPS, 2011.
[8] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction
using mini-batches. Journal of Machine Learning Research, 13:165–202, 2012.
[9] J. Duchi, A. Agarwal, and M. Wainwright. Dual averaging for distributed optimization: Con-
vergence analysis and network scaling. IEEE Trans. Automat. Contr., 57(3):592–606, 2012.
[10] R. Frostig, R. Ge, S. Kakade, and A. Sidford. Competing with the empirical risk minimizer in
a single pass. arXiv preprint arXiv:1412.6606, 2014.
[11] M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann, and M. Jordan.
Communication-efficient distributed dual coordinate ascent. In NIPS, 2014.
[12] J. Lee, T. Ma, and Q. Lin. Distributed stochastic variance reduced gradient methods. CoRR,
1507.07595, 2015.
[13] D. Mahajan, S. Keerthy, S. Sundararajan, and L. Bottou. A parallel SGD method with strong
convergence. CoRR, abs/1311.0636, 2013.
[14] Y. Nesterov. Introductory lectures on convex optimization: A basic course. Springer, 2004.
[15] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical programming,
103(1):127–152, 2005.
[16] B. Recht, C. Ré, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochas-
tic gradient descent. In NIPS, 2011.
[17] P. Richtárik and M. Takác. Distributed coordinate descent method for learning with big data.
CoRR, abs/1310.2059, 2013.
[18] O. Shamir. Fundamental limits of online and distributed algorithms for statistical learning and
estimation. In NIPS, 2014.
[19] O. Shamir and N. Srebro. On distributed stochastic optimization and learning. In Allerton
Conference on Communication, Control, and Computing, 2014.
[20] O. Shamir, N. Srebro, and T. Zhang. Communication-efficient distributed optimization using
an approximate newton-type method. In ICML, 2014.
[21] T. Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.
[22] J. Tsitsiklis and Z.-Q. Luo. Communication complexity of convex optimization. J. Complexity,
3(3):231–243, 1987.
[23] T. Yang. Trading computation for communication: Distributed SDCA. In NIPS, 2013.
[24] Y.-L. Yu. Better approximation and faster algorithm using proximal average. In NIPS, 2013.
[25] Y. Zhang, J. Duchi, and M. Wainwright. Communication-efficient algorithms for statistical
optimization. Journal of Machine Learning Research, 14:3321–3363, 2013.
[26] Y. Zhang and L. Xiao. Communication-efficient distributed optimization of self-concordant
empirical loss. In ICML, 2015.
[27] M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent. In
NIPS, 2010.