0% found this document useful (0 votes)
7 views

Communication-Efficient_Distributed_Learning_An_Overview_0

The article presents an overview of communication-efficient distributed learning, highlighting its importance in next-generation intelligent networks. It discusses the challenges of centralized learning, such as data privacy concerns and high communication overhead, and proposes various methodologies to enhance communication efficiency in distributed learning. The paper categorizes these methodologies into four types: reducing communication rounds, compressing exchanged information, managing radio resources, and utilizing game-theoretic mechanisms to incentivize participation.

Uploaded by

zulfa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Communication-Efficient_Distributed_Learning_An_Overview_0

The article presents an overview of communication-efficient distributed learning, highlighting its importance in next-generation intelligent networks. It discusses the challenges of centralized learning, such as data privacy concerns and high communication overhead, and proposes various methodologies to enhance communication efficiency in distributed learning. The paper categorizes these methodologies into four types: reducing communication rounds, compressing exchanged information, managing radio resources, and utilizing game-theoretic mechanisms to incentivize participation.

Uploaded by

zulfa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

This article has been accepted for publication in IEEE Journal on Selected Areas in Communications.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

Communication-Efficient Distributed Learning: An Overview


Xuanyu Cao (IEEE Senior Member), Tamer Başar (IEEE Life Fellow), Suhas Diggavi (IEEE Fellow),
Yonina C. Eldar (IEEE Fellow), Khaled B. Letaief (IEEE Fellow), H. Vincent Poor (IEEE Life Fellow),
and Junshan Zhang (IEEE Fellow)

Abstract—Distributed learning is envisioned as the bedrock data owners to a central entity, which conducts centralized
of next-generation intelligent networks, where intelligent agents, training and then sends the trained model to users of AI
such as mobile devices, robots, and sensors, exchange informa- services. Such a centralized learning paradigm has several
tion with each other or a parameter server to train machine
learning models collaboratively without uploading raw data disadvantages. First, transmitting huge amount of raw data
to a central entity for centralized processing. By utilizing the to a central processor can lead to significant traffic conges-
computation/communication capability of individual agents, the tion and large communication delay. This renders centralized
distributed learning paradigm can mitigate the burden at central learning inappropriate for time-sensitive applications such as
processors and help preserve data privacy of users. Despite its autonomous driving. Second, conducting the entire training
promising applications, a downside of distributed learning is its
need for iterative information exchange over wireless channels, procedure in a centralized manner causes substantial, if not
which may lead to high communication overhead unaffordable prohibitive, computation burden for the central processor, and
in many practical systems with limited radio resources such may lead to large computation latency. Third, the training data
as energy and bandwidth. To overcome this communication of individual users may contain private sensitive information
bottleneck, there is an urgent need for the development of (e.g., health data and financial data) and users with privacy
communication-efficient distributed learning algorithms capable
of reducing the communication cost and achieving satisfactory concerns may not be willing to share their raw data with
learning/optimization performance simultaneously. In this paper, others.
we present a comprehensive survey of prevailing methodologies To resolve the aforementioned issues, distributed learning
for communication-efficient distributed learning, including reduc-
tion of the number of communications, compression and quanti-
has emerged as an alternative paradigm, where data owners
zation of the exchanged information, radio resource management train machine learning models collaboratively and distribu-
for efficient learning, and game-theoretic mechanisms incentiviz- tively without uploading raw data to a central entity for cen-
ing user participation. We also point out potential directions for tralized processing. In distributed learning, by utilizing their
future research to further enhance the communication efficiency communication and computation resources, intelligent devices
of distributed learning in various scenarios.
(e.g., smartphones) conduct local training steps by using local
Index Terms—Distributed learning, communication efficiency, datasets and exchange information with other devices or a
event-triggering, quantization, compression, sparsification, re- parameter server. Such a framework alleviates the computation
source allocation, incentive mechanisms, single-task learning,
multitask learning, meta-learning, online learning
and communication burden of centralized learning, and helps
preserve data privacy of users. Due to its great potential,
distributed learning has been extensively studied in the past
decades and many distributed learning/optimization algorithms
I. I NTRODUCTION have been proposed for a variety of distributed learning
Machine learning is one of the most important technologies settings (e.g., single-task learning, personalized learning, on-
to enable ubiquitous artificial intelligence (AI). In conven- line learning, fully decentralized learning over networks, and
tional centralized machine learning, all data is delivered from more). Examples include distributed (sub)gradient descent
[1], distributed primal-dual method [2], alternating direction
This work was supported by the U.S. National Science Foundation Grant method of multipliers [3], distributed Newton’s method [4],
CNS-2128448, ARO MURI Grant AG285, US Army Research Laboratory Co-
operative Agreement W911NF-17-2-0196, and the National Natural Science etc. The convergence performance of these distributed learning
Foundation of China Grant 62203373. algorithms has been comprehensively analyzed for learning
Xuanyu Cao and Khaled B. Letaief are with the Department of Elec- problems under various conditions (convexity, nonconvexity,
tronic and Computer Engineering, The Hong Kong University of Science strong convexity, smoothness, etc.).
and Technology, Clear Water Bay, Kowloon, Hong Kong (email: {eexcao,
eekhaled}@ust.hk). Distributed learning algorithms require agents to exchange
Tamer Başar is with the Department of Electrical and Computer En- information with each other or a parameter server. The infor-
gineering, University of Illinois at Urbana-Champaign, IL, USA (email: mation often needs to be transmitted over wireless channels
[email protected]).
Suhas Diggavi is with the Department of Electrical and Computer and may consume substantial amount of radio resources (e.g.,
Engineering, University of California, Los Angeles, CA, USA (email: energy and bandwidth), which are scarce in practice. For
[email protected]). instance, mobile devices may have scarce energy due to their
Yonina C. Eldar is with the Department of Mathematics and Com- limited battery capacity, and can use very narrow bandwidth
puter Science, Weizmann Institute of Science, Rehovot, Israel (email: yon-
[email protected]). in communication systems located in densely populated urban
H. Vincent Poor is with the Department of Electrical and Com- regions. Sensors deployed in the wild may have little energy
puter Engineering, Princeton University, Princeton, NJ, USA (email: supply and are difficult to recharge when they run out of
[email protected]).
energy.
Junshan Zhang is with the Department of Electrical and Computer Engi-
neering, University of California, Davis, CA, USA (email: [email protected]). In conventional distributed learning algorithms, agents have

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

Section II: Preliminaries of Distributed Learning


A. Distributed Learning with Parameter Server
B. Fully Decentralized Learning without Parameter Server

Section III: Reducing the Number of Section IV: Compressing the


Communications in Distributed Learning Communications in Distributed Learning
A. Multiple Local Update Steps Between A. Quantization
Communications B. Sparsification
B. Event-Triggering C. Error-Compensated Compression
C. Performance Limits D. Other Compression Methods
D. Future Directions E. Future Directions

Communication-Efficient Distributed Learning

Section V: Resource Management for Section VI: Game Theory for


Communication-Efficient Distributed Learning Communication-Efficient
A. Power Allocation Distributed Learning
B. Bandwidth Allocation A. Existing Works
C. Future Directions B. Future Directions

Fig. 1: Organization of the paper.

to send high-dimensional dense real-valued vectors to other combining the aforementioned four types of techniques to
agents or the parameter server in every time slot, leading to further mitigate the communication overhead.
high radio resource consumption. To cope with communication In this paper, we present a holistic overview of existing
factors such as channel fading and noise, agents need to make works on communication-efficient distributed learning. The
the most of their limited radio resources to align well with organization of the paper is depicted in Fig. 1 and elucidated
the nature of distributed learning algorithms. Moreover, due as follows.
to the substantial consumption of radio resources, agents are • In Section II, we provide a brief overview of the basic
not well motivated to participate in distributed learning algo- problem formulations, algorithms, and convergence re-
rithms when they are deficient in resources. If not addressed sults of distributed learning, which is categorized into two
adequately, the scarcity of wireless resources may greatly scenarios, namely, distributed learning in the presence
restrict the application of distributed learning in many practical of a central parameter server and fully decentralized
scenarios. learning over networks without parameter servers. For
To reduce the communication overhead of distributed learn- both scenarios, we first consider single-task learning,
ing algorithms, a variety of methods have been proposed where all agents seek to learn a common model. Then,
in the literature. Methodologies for communication-efficient we consider personalized learning (including multitask
distributed learning can be divided into four categories. The learning and meta-learning), where different agents aim
first type of methods aim to reduce the number of commu- to learn different (but related) models.
nication rounds of distributed learning algorithms and require • In Section III, we survey communication-efficient dis-
information exchange only when necessary [5]. The conditions tributed learning algorithms that reduce the number of
for communications to occur are devised to balance the communication rounds. Such algorithms may conduct
tradeoff between learning performance and communication multiple local update steps between consecutive com-
overhead. Alternatively, the second type of methods seek to munication rounds according to some pre-defined rules,
compress the information to be sent into finite number of bits or trigger communications only when certain conditions
or sparse vectors through data compression techniques such as are met as the algorithms progress. We also provide an
quantization [6] and sparsification [7], [8]. The compression overview of results on characterizing the fundamental
methods and learning algorithms are designed jointly to miti- lower bounds for the number of communications needed
gate the negative impact of compressed communications on the to achieve certain learning performance guarantees. We
learning performance. The third type of methods take practical then introduce several possible future research directions
wireless communication factors (noise, fading, interference, for reducing the number of communications in various
etc.) into consideration and aim to manage radio resources distributed learning settings.
optimally for learning purposes (e.g., [9]). The goal is to • In Section IV, we consider communication-efficient dis-
achieve the best learning performance under radio resource tributed learning algorithms using compressed commu-
budget constraints. The fourth type of works investigate the nications to reduce redundant information transmission.
strategic behavior of agents in distributed learning [10]. Game- These compression techniques include quantization, spar-
theoretic mechanisms are designed to incentivize agent par- sification, error-compensated compression, as well as
ticipation in distributed learning algorithms, which consume other methods exploiting special structures (e.g., low
agents’ precious communication resources. There are works rank) of the exchanged information. Potential directions

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

and L(xx; u , d) = log(1+exp(−d·u uTx )) for logistic regression


Server (d = ±1). The most standard and commonly used distributed
learning setting is the single-task learning problem below
n
X
x) :=
min f (x x),
fi (x (1)
x
i=1
P
where fi (xx) = k∈Si L(x x; u ik , dik ) is the local loss function
of agent i, and {u uik , dik }k∈Si is the training set of agent
i. Problem (1) is referred to as empirical risk minimization
Agent 1 Agent 2 Agent n or consensus optimization in the literature of distributed
(a) Distributed learning in a server- (b) Fully decentralized learning over optimization. In such a single-task learning problem, agents
agent system a network aim to learn a common model x collaboratively based on all
agents’ training data. For instance, in sensor networks, sensors
Fig. 2: Two multi-agent systems for distributed learning. may seek to estimate the location of an object jointly by
using every sensor’s local measurements. In deep learning,
to alleviate the computational burden of training, data may be
for future works on distributed learning with compressed
distributed among multiple computers, which collaborate to
communications are also mentioned.
train a common neural network in parallel.
• In Section V, we survey resource management tech-
niques for distributed learning, which seek to achieve the Problem (1) has been studied for decades [14], and a variety
best learning performance under radio resource budget of algorithms have been proposed. One of the most standard
constraints. We review results on both power allocation algorithms is gradient descent (GD). At each time t, the server
and bandwidth allocation, including their integration with broadcasts the current model x (t) to all agents. Each agent i
other communication-efficient techniques such as user computes the local gradient ∇fi (x x(t)) by using local training
selection. We further point out some future research data, and sends it to the server. The server then aggregates all
directions on this topic. the local gradients, and updates the model according to
• In Section VI, we review several recent works on n
X
game-theoretic incentive mechanism design for encourag- x(t + 1) = x(t) − ηt ∇fi (x
x(t)),
ing user participation in distributed learning algorithms, i=1
which consume substantial amount of radio resources of where ηt > 0 is the stepsize. If each fi is convex and
users. Some potential future directions are also discussed. has Lipschitz continuous gradient with constant Li , then a
• In Section VII, we conclude the paper. fixed stepsize ηt = η ≤ Pn1 Li will guarantee that the GD
i=1
algorithm converges at rate O(1/t).
II. P RELIMINARIES OF D ISTRIBUTED L EARNING
Another popular algorithm for solving problem (1) is
In this section, we provide a brief overview of distributed the distributed alternating direction method of multipliers
learning, a research topic extensively studied over multiple (ADMM). At each time t, each agent i sends its current local
decades. We categorize distributed learning settings based on model x i (t) and local multiplier λ i (t) to the server. The server
the presence or absence of a central entity coordinating the broadcasts z (t + 1) = x̄ x(t) + ρ1 λ̄λ(t) to all agents, where
learning processes. For both scenarios, we present the basic 1
P n
x(t) = n i=1 x i (t), λ̄
x̄ 1
Pn
λ(t) = n i=1 λ i (t), and ρ > 0 is
problem formulations, prevailing algorithms, and convergence an algorithm parameter. Then, each agent i updates its local
results. model and multiplier in parallel as follows:
( )
2
A. Distributed Learning with Parameter Server ρ 1
x i (t + 1) = arg min fi (xxi ) + x i + λ i (t) − z (t + 1) ,
We first consider distributed learning over a system con- xi 2 ρ
sisting of multiple agents and a central parameter server xi (t + 1) − z (t + 1)).
λ i (t + 1) = λ i (t) + ρ(x
(abbreviated as server henceforth), as illustrated in Fig. 2-
(a), where the server is able to exchange information with all When the loss functions fi ’s are strongly convex and have
agents. Such multi-agent systems are ubiquitous. For instance, Lipschitz continuous gradients, distributed ADMM converges
in federated learning (FL) over cellular networks, the base to the global optimal solution at a linear rate [15]. Many other
station (server) can communicate with the mobile devices optimization algorithms can also be used to solve single-task
(agents) [11]–[13]. In sensor networks, the fusion center distributed learning problem (1), such as momentum accelera-
(server) can exchange information with the sensors (agents). tion methods (e.g., heavy ball and Nesterov’s algorithms), and
In the following, we categorize distributed learning problems (quasi-)Newton’s methods.
into two classes depending on whether the model parameters In some applications, the training data changes with time.
of the agents are the same or not. Agents may collect new data in real time and discard outdated
1) Single-Task Learning: Let L(x x; u , d) be the loss function data. Correspondingly, the loss functions also vary across
of the learning problem, where x , u , d are the model parameter, time. Such a scenario is referred to as online learning and
input feature, and output value or label, respectively. For ex- has been investigated extensively [16], [17]. Let us denote
ample, we have L(x uTx − d)2 for linear regression,
x; u , d) = (u the local loss function of agent i at time t by fi,t and let

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

Pn
ft (xx) = x) be the global loss function at time
i=1 fi,t (x steps. The most standard form of meta-learning can be cast as
t. Let {x x∗ (t)} be some performance benchmark, e.g., the the following optimization problem:
dynamic optimal model x ∗ (t) = arg minx ∈X ft (x x) or the n
∗ ∗
PT X
best fixed model x (t) = x = arg minx ∈X t=1 ft (x x), min x − α∇fi (x
fi (x x)), (4)
x
where X is the set of admissible model parameters and T i=1
is the time horizon. Our goal is to determine a series of where α > 0 is the stepsize for local adaptation, i.e., one-step
model
PT parameters PxT(t) sequentially

such that the regret, i.e., gradient descent. Let Fi (x x − α∇fi (x
x) := fi (x x)) be the meta-
f
t=1 t x
(x (t))− t=1 tf x
(x (t)), is minimized. In particular, if function of agent i. To solve problem P (4), we can still use GD
the regret is sublinear with respect to T , then the time-average n
algorithm, i.e., x (t + 1) = x (t) − ηt i=1 ∇Fi (x x(t)), where
loss incurred by the selected models x (t) is no greater than that each agent i sends ∇Fi (x x(t)) = (II −α∇2 fi (x
x(t)))∇fi (xx(t)−
of the benchmark x ∗ (t) asymptotically, as T goes to infinity. α∇fi (xx(t))) to the server at each time t. Various distributed
One of the most standard online optimization algorithms is personalized learning algorithms have been studied in [23]–
online gradient descent (OGD) [18], i.e., [26] from the viewpoint of distributed optimization.
n
!
X
x (t + 1) = PX x (t) − ηt ∇fi,t (x
x(t)) , (2)
i=1 B. Fully Decentralized Learning without Parameter Server
where PX stands for projection onto X . In the algorithm, the Many multi-agent systems do not have any central entity
server broadcasts the current model x (t) to the agents and capable of communicating with all agents. Instead, the agents
each agent i sends the local gradient ∇fi,t (x
x(t)) to the server. form a network, where two agents linked by an edge are able
With the stepsize being ηt = √1t , under certain technical to exchange information with each other, as illustrated in Fig.
assumptions, it has been 2-(b). For instance, large-scale sensor networks may not have
√ shown that the regret of OGD is
upper bounded by O( T ) and is thus sublinear [18]. fusion centers, and sensors can only communicate with other
2) Personalized Learning: In practice, different agents may nearby sensors. In ad hoc networks without base stations (e.g.,
have different model parameters to learn, in which case battlefield networks without communication infrastructure),
problem (1) is not a suitable formulation. Such a scenario mobile devices can only communicate with other nearby
is referred to as personalized learning, where each agent has devices. In the absence of central servers, the learning algo-
its own personal model to infer. One viable formulation for rithms have to be fully decentralized and only communications
personalized learning is multitask learning, where each agent between one-hop neighbors are allowed. In the following, we
i seeks to learn its own model x i . Even though the models discuss fully decentralized single-task learning and multitask
of different agents are distinct, they are still related and we learning over multi-agent networks without central entities.
should take their relationship into account when formulating 1) Single-Task Learning: One of the most prevailing fully
the learning problem. This leads to the following standard decentralized optimization algorithms for solving the single-
formulation for multitask learning [19]: task learning problem (1) is the decentralized gradient descent
n
(DGD) method proposed by [1]. Let Ni be the set of neighbors
X
xi ) + γ · tr X Ω −1X T ,
 of agent i. In DGD, each agent i updates its local model x i (t)
min fi (x (3a)

X ,Ω by using a convex combination of its neighbors’ local models,
i=1
followed by a local gradient descent step, i.e.,
s.t. Ω  0 , tr(Ω
Ω) = 1, (3b) X
where X = [x x1 , ..., x n ], γ > 0 is a regularization parameter, xi (t + 1) = aij xj (t) − ηt ∇fi (x
xi (t)),
j∈Ni ∪{i}
and tr(·) stands for the trace of a matrix. The matrix Ω
characterizes the relationship between the models of different where aij is the (i, j)-th entry of a doubly stochastic weight
agents, and the regularization term tr X Ω −1X T is used matrix A . We have aij = 0 for j ∈ / Ni ∪{i}, so that each agent
to promote such relationship in the learning outcome. In only communicates with its neighbors. It has been shown in
problem (3), we aim to learn both the models of all agents [1] that, if a constant stepsize ηt is used, all local models
and the relationship between these models jointly. To this converge to a neighborhood of the optimal solution to (1)
end, we can use alternating optimization methods [19], [20]. with rate O(1/t). If diminishing stepsizes are used, the DGD
In other words, the agents first optimize over X with fixed algorithm
√ can converge to the exact optimal solution with rate
Ω in a parallel manner and send their local models to the O(1/ t). Since the seminal work [1], a variety of first-order
server. Then, the server optimizes over Ω with fixed X , decentralized optimization algorithms have been developed to
and broadcasts the new relationship Ω to all agents. Under solve the consensus optimization problem (1) in various set-
certain technical conditions, convergence of such alternating tings, including constrained decentralized optimization in [27],
optimization methods to the globally optimal solution can be decentralized optimization over time-varying networks in [28],
guaranteed [19]. decentralized optimization over directed networks (e.g., the
In addition to multitask learning, another recently popular push-pull algorithm, in [29]), and decentralized optimization
framework for personalized learning is meta-learning initiated over time-varying directed networks (e.g, the push-subgradient
in [21], [22]. In meta-learning, agents collaborate to learn a algorithm, in [30]). Additionally, by using gradient information
common meta-model. Starting from the meta-model, an agent of the last two steps, the EXTRA algorithm proposed in [31]
can adapt to new tasks readily by using very limited local can converge to the exact optimal solution with a constant step-
data and simple training iterations, e.g., a few gradient descent size. To accelerate the convergence rate, distributed Nesterov

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

gradient descent algorithm was developed in [32]. Distributed xi , x j ) for each pair of neighboring agents i and j. For
gij (x
zero-order algorithms with gradient tracking were studied in instance, gij (x xi − xj k2 − b2ij can be used to make
xi , xj ) = kx
[33], where one could only evaluate the objective functions neighbors’ models close to each other, where bij is some
at finitely many points. Further, second-order decentralized constant. The link costs are either added to the objective
optimization algorithms have also been studied, such as de- function of the learning problem, i.e.,
centralized Newton’s method with truncated approximation n
X n X
X
of inverse Hessian matrices in [4], and decentralized BFGS min xi ) +
fi (x xi , x j ),
gij (x (6)
xn
x 1 ,...,x
algorithm (a quasi-Newton method using gradient information i=1 i=1 j∈Ni
to approximate Newton steps) [34].
or used as constraints of the learning problem, i.e.,
In addition to the aforementioned primal-domain meth-
n
ods, primal-dual algorithms have also been developed for X
min xi )
fi (x (7a)
decentralized optimization problems. One of the most widely xn
x 1 ,...,x
i=1
used primal-dual methods for solving problem (1) is the
s.t. xi , x j ) ≤ 0, ∀i, j ∈ Ni .
gij (x (7b)
decentralized ADMM, in which each agent i updates its local
model x i (t) (i.e., primal variable) and multiplier φ i (t) (i.e., For problem (6), decentralized linearized ADMM and decen-
dual variable) as follows: tralized Newton’s method were proposed in [41] and [42],
( respectively, both of which could achieve linear convergence
xi ) + φ i (t)Tx i + ρ|Ni |kx
x i (t + 1) = arg min fi (x xi k2 rate. For problem (7), a primal-dual optimization method was
xi developed in [43] to handle the constraints, and convergence
!T ) rate for the objective and constraint functions were shown to
be O(t−1/2 ) and O(t−1/4 ), respectively.
X
− ρ |Ni |x
xi (t) + x j (t) xi ,
j∈Ni Furthermore, when the training data is collected sequentially
(5a) and the loss functions vary across time, decentralized multitask
adaptive learning algorithms have been proposed in [44],
 
X
φ i (t + 1) = φ i (t) + ρ |Ni |x
xi (t + 1) − x j (t + 1) , [45], where agents are clustered and neighboring clusters
j∈Ni have similar models. When the model parameters are sparse,
(5b) ADMM-based and subgradient-based decentralized multitask
adaptive learning algorithms have been developed in [46].
where | · | stands for the cardinality of a set. At each time t,
each agent i needs to broadcast its current local model x i (t) to
all the neighbors in Ni . When the loss functions are strongly
convex and have Lipschitz continuous gradients, it has been III. R EDUCING THE N UMBER OF C OMMUNICATIONS IN
shown in [3] that decentralized ADMM has linear convergence D ISTRIBUTED L EARNING
rate. Following [3], a series of variants of decentralized
ADMM have been developed. To reduce the computational Conventional distributed learning algorithms require agents
burden and avoid solving optimization subproblems in each to exchange information with the server or neighboring agents
iteration, linearized ADMM and quadratically approximated in every time instant, which can lead to a high communi-
ADMM have been proposed in [35], [36], which use linear cation overhead. In this section, we provide an overview of
and quadratic approximations for fi (x xi ) in step (5a) to obtain communication-efficient distributed learning algorithms that
closed-form update equations. reduce the number of communications, and point out several
When the training data is collected in real-time and the loss potential directions for future work.
functions are time-varying, decentralized online optimization
problems have been studied. A decentralized online gradient
descent algorithm was developed in [37], √ where the regret of
every agent was upper bounded by O( T ). A decentralized A. Multiple Local Update Steps Between Communications
online saddle-point algorithm was proposed in [38], and a
One of the most commonly used approaches for improving
decentralized online push-sum algorithm was developed for
the communication efficiency of distributed learning is to
directed graphs in [39]. Moreover, dynamic decentralized
exchange information periodically instead of at every time
ADMM was studied in [40] and was shown to converge to
instant. Between consecutive communications, agents conduct
a neighborhood of the dynamic optimal solution, where the
multiple steps of local model updates based on local data. In
size of the neighborhood depended on the variation speed of
[5], such a method for distributed learning in a server-agent
the loss functions.
system has been investigated. Let τ ∈ {1, 2, ...} be the number
2) Multitask Learning: In addition to the single-task prob- of local model update steps between two consecutive global
lem (1), decentralized multitask learning problems over multi- aggregations, i.e., communications between the server and the
agent networks without any central entity have also been agents. When the time index t is not an integer multiple of τ ,
studied in the literature. In such a case, each agent i has each agent i conducts a local gradient descent step to update
an individual model x i to learn, and the local models of the local model x i (t), i.e.,
different agents are related. One of the most common methods
of characterizing this relationship is to introduce a link cost x i (t) = x i (t − 1) − η∇fi (x
xi (t − 1)).

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

Otherwise, when t is an integer multiple of τ , each agent i problems, it was shown that local SGD converged at the same
sends xi (t − 1) − η∇fi (x
xi (t − 1)) to the server. The server rate as standard mini-batch SGD did. By using local SGD,
aggregates the information from all agents to obtain the number √ of communication rounds could be reduced by a
n factor of O( T ), where T is the total number of update steps.
1X Additionally, post-local SGD, i.e., a mixture of mini-batch
x
e(t) = xi (t − 1) − η∇fi (x
(x xi (t − 1))).
n i=1 SGD and local SGD, was proposed in [50], and was shown to
achieve better tradeoff between communication efficiency and
Then, the server broadcasts x e(t) to all agents and each agent
generalized performance for deep learning. The convergence
i updates its new local model to be x i (t) = x e(t). In such
rate of local SGD with periodic averaging was further analyzed
an algorithm, global aggregations occur once every τ time
in [51] for nonconvex loss functions satisfying the Polyak-
instants. 1
Łojasiewicz condition. It was shown that O((nT ) 3 ) rounds
Suppose we are concerned with M types of radio resources,
of communications suffice to achieve a convergence rate of
e.g., energy and bandwidth, and the budget for type-m re-
O(1/nT ), which maintained linear speedup with respect to
source is Rm , m = 1, ..., M . When models are updated locally
the number of agents. Further, the number of local model
without information exchange (i.e., t is not an integer multiple
updates per round of global aggregation was adjusted in an
of τ ), the multi-agent system consumes cm amount of type-m
adaptive manner in [52], so that the runtime of the distributed
resource. If, in addition to local model update steps, global
learning algorithm was minimized when communication de-
aggregation happens and information exchange between the
lay existed. FL with heterogeneous number of local updates
server and the agents is needed, the system consumes bm
among agents was studied in [53]. The authors developed a
amount of type-m resource. We usually have bm > cm since
novel FL algorithm to compensate for the heterogeneity caused
global aggregation consumes additional resources. Let T be
by agents’ different computation speeds and dataset sizes.
the number of time instants of the algorithm and K = T /τ be
Additionally, to improve the convergence rate of local SGD, a
the number of global aggregations. When global aggregation
slow momentum algorithm was proposed in [54], where agents
occurs, the server sets fe ← min{f (e x (t)), fe} so that fe records
performed local momentum model update and synchronized
the best loss function values at time t = 0, τ, 2τ, .... Our goal is
periodically through global aggregations. A comprehensive
to achieve the best loss function values subject to the resource
comparison between local SGD and mini-batch SGD was
constraints, i.e.,
presented in [55], and it was shown that the two algorithms
min min f (e
x (kτ )) (8a) could outperform each other in certain regimes.
τ,K∈{1,2,...} k=0,...,K
s.t. (T + 1)cm + (K + 1)bm ≤ Rm , ∀m = 1, ..., M, In addition to local SGD, SGD with elastic averaging was
(8b) proposed in [56], where proximal terms were included in the
T = τ K, (8c) loss functions to allow some slacks between the local models
at the agents and the global model at the server. The approach
where the additional “+1” in (8b) accounts for the last global was shown to have better learning performance in the deep
aggregation. In [5], an algorithm for solving problem (8) learning setting where many local minima existed. Momentum
approximately has been proposed when the resource con- versions of elastic averaging SGD were also developed in [56].
sumptions {cm , bm } are known in advance. When {cm , bm } Further, cooperative SGD, a unified framework for a variety of
are unknown and can vary with time, a control algorithm local SGD algorithms (e.g., local SGD with averaging, elastic
estimating the parameters and adjusting the values of τ on- averaging SGD, and decentralized local SGD over networks
the-fly has been developed. without central server), was proposed and analyzed in [57],
Similarly, the federated averaging (FedAvg) algorithm in which improved upon prior results on local SGD in terms
[47] lets each agent conduct multiple steps of local model of convergence bounds and applicability. A new decentralized
updates between two global aggregations in distributed learn- primal-dual algorithm named decentralized communication
ing of deep networks. In addition, FedAvg selects a dynamic sliding method was developed in [58] for networked multi-
subset of agents, instead of all agents, to participate in model agent networks without central entities, where inter-agent
updating and global aggregation, which further improves the communications were skipped while individual agents solved
communication efficiency. It was shown in [47] through ex- local optimization subproblems iteratively. In [59], the authors
tensive numerical experiments that FedAvg could reduce the investigated semi-decentralized FL over a clustered network,
communication overhead by one to two orders of magnitudes. which consisted of a server and multiple clusters of agents.
Further, in [48], the authors studied rigorously the reason why Each cluster was comprised of a cluster head and multiple
periodic model averaging (i.e., global aggregation) could work normal agents. Within each cluster, agents performed multiple
as well as parallel mini-batch SGD (with global aggregation SGD iterations based on local datasets and aperiodically
in every time instant) and achieve linear speedup with respect engaged in consensus procedures within the cluster by using
to the number of agents. In particular, it was shown that fully decentralized device-to-device (D2D) communications.
the dominant term in the convergence bound for√ distributed Meanwhile, the cluster heads conducted inter-cluster model
learning with periodic model averaging was O(1/ nt), which aggregation through the help of the central server. Within such
was not affected by the model-averaging period. Further, in a framework, an adaptive control algorithm was developed in
[49], the convergence rate of local stochastic gradient descent [59] to tune the stepsize, D2D communication rounds, and
(SGD) was analyzed, where global aggregation occurred only global aggregation periods, with the goal of minimizing the
at certain time instants. For smooth strongly convex learning overall system loss due to energy consumption, delay, and FL

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

performance. Moreover, in [60], a hierarchical FL framework between the new local model x i (t+1) and the latest sent local
was presented, where clients, edge servers and cloud server model x ei (t). Let Ci (t) be the triggering threshold of agent i at
exchanged information with each other to learn collaboratively. time t. If kxxi (t+1)−e x i (t)k ≥ Ci (t), agent i sends x i (t+1) to
Different from GD, a communication-efficient dual coor- all neighbors and sets x ei (t + 1) = x i (t + 1). Otherwise, agent
dinate ascent algorithm was put forth in [61], where local i does not send anything and sets x ei (t + 1) = x
ei (t). In other
computation was used in a primal-dual method to reduce words, agents communicate with neighbors only when the
the communication overhead dramatically. A communication- differences between the latest sent models and the current true
efficient federated deep learning method was proposed in models are large enough. The impact of the event-triggering
[62], where parameters of the deep layers were updated less thresholds {Ci (t)} on the performance of the decentralized
frequently than those of the shallow layers to reduce the learning algorithm was analyzed in [63]. Convergence could
communication overhead. A temporally weighted aggregation be guaranteed as long as the event-triggering thresholds are
strategy was introduced at the server to make use of the square-summable. If the loss functions are strongly convex and
previously trained local models of the agents. Besides single- the event-triggering thresholds are geometrically decaying, the
task learning problem (1), meta-learning (c.f. problem (4)) local models converge to some neighborhood of the optimal
algorithms with reduced number of communications have also solution with linear convergence rate, where the size of the
been studied to facilitate communication-efficient personalized neighborhood is proportional to the constant stepsize η.
learning. In [23], a personalized FedAvg algorithm was pro- In [64], the authors proposed an event-triggered multi-agent
posed for distributed meta-learning problems, where a subset optimization algorithm over a complete network, where each
of agents conducted multiple local gradient descent steps with agent was able to communicate with all other agents. Each
respect to their local meta-functions and global aggregation agent sent its current local model to others when it detected
was performed periodically. For nonconvex loss functions, that other agents’ estimates of its local model were sufficiently
the convergence rate (to a first-order stationary point) of the different from the true local model. Later, an edge-based event-
algorithm was analyzed, and the impact of the closeness of the triggered projected DGD algorithm over fully decentralized
underlying distributions of agents’ data (measured in terms networks was developed in [65], where an agent sent its
of total variation and Wasserstein distance) on the learning current local model to one of its neighbors only when the
performance was characterized. difference between the current model and the latest sent one
was larger than an edge-specific threshold. With diminishing
B. Event-Triggering stepsizes and event-triggering thresholds, the convergence of
The communication patterns of distributed learning algo- the algorithm was analyzed for convex loss functions and
rithms in the aforementioned works in the previous subsection the impact of the triggering thresholds on the convergence
follow some predefined rules independent from the algorithm rate was investigated. The convergence rate of event-triggered
iterates, e.g., periodic global aggregation with a predefined decentralized SGD was further analyzed in [66] for nonconvex
period. Another generic approach to reducing the number loss functions, in the presence of diminishing stepsizes and
of communications is to exchange information only when triggering thresholds. Moreover, a continuous-time decentral-
certain conditions related to the algorithm iterates are met ized event-triggered DGD algorithm was proposed in [67],
during algorithm execution. Such an approach is named event- which was independent of the parameters of the loss functions
triggering, where communications occur only when a certain and free of Zeno behavior (i.e., not requiring infinite number
event is triggered. The triggering event can be devised so that of communications within a finite period of time).
the information is exchanged only when necessary. This can A decentralized event-triggered continuous-time zero-
potentially reduce the communication cost without degrading gradient-sum algorithm was proposed in [68], where the
the learning performance much. triggering condition depended on the distance between the
We use the event-triggered projected DGD algorithm for latest sent local model and the current local model as well as
problem (1) in [63] as a concrete example to illustrate the the consensus gap between the neighboring agents’ models.
event-triggering approach. Consider a fully decentralized net- The algorithm was shown to be free of Zeno behavior. In
work without central entities. Each agent i sends its local particular, the inter-communication time was lower bounded
model x i (t) to its neighbors only when certain conditions by some positive constant. For strongly convex loss func-
are met. In addition to x i (t), each agent i maintains an- tions, exponential convergence rate of the algorithm to the
other variable x ei (t), which stands for the latest sent local optimal solution was established. Moreover, event-triggered
model up to time t. Thus, at time t, agent i has access to decentralized zero-gradient-sum algorithms over directed net-
ei (t), {e
x i (t), x x j (t)}j∈Ni . Then, agent i updates its local model works were proposed in [69], where both continuous-time
as follows: and discrete-time algorithms were considered. Further, in [70],
x i (t + 1) the event-triggering approach was applied to a more general
  distributed optimization problem with affine constraints, which
X encompassed the distributed learning problem (1) and the
= PX x i (t) + x j (t) − x
aij (e ei (t)) − η∇fi (x
xi (t)) ,
network utility maximization problem as special cases. An
j∈Ni
event-triggered augmented Lagrangian method was put forth,
where X is a common constraint set for the local models, and where the triggering condition was related to the primal
A = [aij ] is a symmetric doubly stochastic weight matrix. gradient of the augmented Lagrangian. In addition, a de-
The triggering event for communications depends on the gap centralized event-triggered gradient tracking algorithm was

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

proposed in [71], where linear convergence to the optimal for reinforcement learning (RL) in [80]. It was shown that the
solution was established by using sporadic communications. LAG approach could achieve the same convergence rate as
A decentralized event-triggered gradient-push algorithm over vanilla PG method did, and the number of communications
directed networks was developed in [72], and the convergence could be significantly reduced, especially when the reward
of the algorithm was established under summable stepsizes functions of the agents were sufficiently heterogeneous. Other
and triggering thresholds. Moreover, a decentralized event- approaches to reducing the number of communications include
triggered coordinate descent algorithm was studied in [73]. An dynamically increasing batch sizes in parallel SGD to achieve
event-triggered (a.k.a. communication-censored) decentralized the best tradeoff between communication and computation
ADMM algorithm was developed in [74]. It was shown that (measured by the number of stochastic gradients called) [81],
the censored ADMM converged to the optimal solution if the and properly infusing redundancy to the training data for
loss functions were convex and the event triggering thresholds distributed SGD [82].
were summable. If the loss functions were strongly convex and
the triggering thresholds were decaying geometrically, then the C. Performance Limits
censored ADMM exhibited linear convergence rate. Further, Several papers have investigated the fundamental lower
when the loss functions were time-varying, an event-triggered bounds on the number of communications to achieve certain
decentralized online subgradient method was developed in learning performance guarantees.
[75], where the impact of the triggering thresholds on the The communication complexity of distributed convex op-
regret of each agent was characterized explicitly. timization was investigated in [83]. The paper considered a
The integration of event-triggering and quantization was simple setting where each of two processors had access to
considered in [76], which proposed a continuous-time event- a different convex function fi , i = 1, 2. The two processors
triggered DGD algorithm with dynamic quantization. The dy- exchanged binary information with each other until they found
namic quantization scheme consisted of a dynamic encoder for a point minimizing f1 (x x) + f2 (x
x) (corresponding to single-
the transmitting agent and a dynamic decoder for the receiving task learning problem (1) with two agents) within some
agent. The scheme quantized the difference of the latest sent error . It was shown in [83] that the minimal number of
local model and current local model with increasing accuracy, communication rounds to achieve this goal was Ω(d log(1/)),
which made use of the convergence effect of the algorithm. where d was the dimension of the decision variable (i.e.,
It was shown in [76] that the algorithm could converge to the model parameter in the context of learning). In [84], the
the optimal solution without encountering Zeno behavior. authors studied lower bounds for the number of communi-
Analogously, a discrete-time event-triggered quantized DGD cation rounds needed to solve distributed learning problems
algorithm was developed in [77] for time-varying directed over complete networks, where each agent was capable of
graphs, where the dynamic quantization scheme still included broadcasting to everyone. They identified cases where existing
dynamic encoding and decoding methods with finite number distributed learning algorithms were worst-case optimal, as
of quantization levels. It was shown that the algorithm could well as scenarios where improvements were possible. They
converge to the optimal solution even with one-bit information showed that, if the loss functions of different agents were not
exchange in each time instant √with triggered event, and the similar, a large number of communications was necessary even
convergence rate was O(log t/ t) for convex loss functions. when agents had infinite computation power. Lower bounds
An event-triggered distributed learning algorithm termed for the communication complexity of solving distributed linear
lazily aggregated gradient (LAG) for server-agent systems was systems and linear programming were studied in [85]. Further,
developed in [78]. In LAG, each agent sent the difference of the minimax communication complexity of distributed con-
the current local gradient and the last sent local gradient to vex stochastic optimization problems was examined in [86],
the server when this difference was larger than some threshold where every agent had access to the stochastic gradients of
related to the weighted temporal variation of the global model. a common objective function. Lower bounds on the number
Meanwhile, the server sent the current model to an agent only of communications and the corresponding optimal algorithm
when the difference between the local model of the agent and with matching upper bounds (up to logarithmic factors) were
the global model of the server was larger than some threshold presented. In addition, information-theoretic lower bounds on
pertaining to the temporal variation of the global model. In the query complexity of stochastic convex optimization were
other words, LAG conducted event-triggering for both the investigated in [87], [88].
downlink and uplink communications between the server and
the agents. It was shown in [78] that LAG exhibited linear con- D. Future Directions
vergence rate and O(1/t) convergence rate for the scenarios Several potential directions for future work in this domain
of strongly convex loss functions and convex loss functions, are listed below.
respectively. When the loss functions were nonconvex, LAG √ 1) Reducing the Number of Communications for Distributed
converged to a first-order stationary point with rate O(1/ t). Online Learning: There are relatively few works on reducing
In addition, LAG with quantized gradients was put forth in the number of communications for distributed online learn-
[79], which saved both the number of communication rounds ing. In [75], an event-triggered distributed online subgradient
and the number of bits per communication round. For strongly method was developed to reduce the number of communi-
convex loss functions, such an algorithm was shown to have cations for distributed online learning. Nevertheless, reference
the same linear convergence rate as standard GD did. The LAG [75] did not quantify the communication overhead of the algo-
algorithm was further extended to policy gradient (PG) method rithm explicitly and did not study the optimal tradeoff between

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

learning performance and communication cost. Moreover, [75] between communication cost and learning performance.
was focused on online single-task learning and did not take 3) Performance Limits for Generic Distributed Learning
other forms of learning problems into consideration. Many Problems: Even though some prior works have studied the
aspects of distributed online learning with reduced number of fundamental performance tradeoff between the number of
communications are yet to be explored. communications and the learning performance, they are only
For distributed online learning in server-agent systems, the concerned with specific scenarios of distributed learning, e.g.,
conventional approach is to use the OGD algorithm (2), where linear programming, and learning over complete graphs. We
global aggregation occurs at every time instant. Alternatively, still lack a clear understanding of the fundamental tradeoff
we can let each agent conduct local GD update steps for τ time between learning performance and number of communications
instants based on the local training data collected during this in the general distributed learning setting. One can start from
time period (i.e., the local loss functions during this period). the most basic setting, namely the static single-task distributed
Every τ time instants, each agent sends the difference of the learning problem (1), and determine lower bounds on the
local model, i.e., x i (t + τ ) − x i (t), to the server. The server number of communications needed to arrive at an -suboptimal
aggregates all the local models and computes the new global model. It would be interesting to see if standard algorithms
model, which is broadcast to all agents. Suppose K rounds of (e.g., GD and ADMM) or their variants can achieve the best
global aggregation occur during the execution of the algorithm, communication complexity. If not, one can seek to design
i.e., T = Kτ time instants in total. We can then characterize such optimal (in order sense) algorithms achieving the best
the relation between the parameters τ, K and the regret of learning performance with limited communication budget.
the online learning algorithm through regret analysis. Under Afterwards, it would be possible to extend the framework
given communication resource budgets, one can seek to obtain to more complicated scenarios, such as distributed online
the optimal τ, K yielding the minimal time-average regret. It learning and distributed personalized learning.
is also possible to extend this framework to other distributed
online learning problems, such as those with constraints not IV. C OMPRESSING THE C OMMUNICATIONS IN
amenable to computationally efficient projection operators. D ISTRIBUTED L EARNING
These problems can be handled through primal-dual methods
using Lagrangian (c.f. [2]), and we can study the optimal In addition to reducing the number of communication
tradeoff between the communication overhead and learning rounds, another general approach to improving the commu-
performance as measured by regret and constraint violations. nication efficiency of distributed learning algorithms is to
compress the information exchanged in each communication
In addition, it is possible to revisit the event-triggered round. In this section, an overview of distributed learning
decentralized online optimization problem over fully decen- algorithms using compressed communications is presented.
tralized networks without any central server. In [75], the
relation between event-triggering thresholds and the regret
of each agent has been characterized. One can further study A. Quantization
the relation between event-triggering thresholds and the com- One of the most widely used compression methods for
munication overhead, based on which the optimal triggering distributed learning is quantization, where the information to
thresholds can be designed to achieve the best regret under be exchanged is transformed into discrete values that can be
given communication budget. encoded into a finite number of bits. Quantization techniques
2) Reducing the Number of Communications for Distributed can reduce the number of communicated bits, and thus enable
Personalized Learning: Most of prior works on distributed distributed learning in systems with scarce communication
learning algorithms with reduced number of communications bandwidth, such as crowded Metropolitan areas with scarce
are focused on single-task learning problems. Two exceptions spectrum resources. The study of quantized incremental dis-
are [23] and [25], where infrequent communications are con- tributed learning algorithms was pioneered in [6], which aimed
ducted to alleviate the communication overhead for solving at solving the single-task learning problem (1) by using a
personalized learning problems. One can further develop an finite number of bits per communication. In the incremental
event-triggered approach to distributed meta-learning so that algorithm, all agents were numbered (labeled) in advance and
the number of local updates per communication round is not took turns to update the model parameters according to the
a fixed number and can vary according to the needs of the order prescribed by the labeling. The model updates were
algorithm based on triggering rules. It would also be possible cycled through the network. Let Λ ⊂ Rd be a d-dimensional
to consider distributed online meta-learning algorithms with lattice, where each entry of x ∈ Λ is an integer multiple of
reduced number of communications, when the training data is some given δ > 0 (the width of the lattice). Denote the set of
collected in real time. possible model parameters by X . At each cycle k, an agent i
Additionally, one can study distributed multitask learning receives the model x i−1,k from its predecessor, agent i − 1,
algorithms for solving problems (3) (for server-agent systems), and computes the new model by a quantized gradient descent
and (6), (7) (for fully decentralized networks without central step as follows:
servers) by using reduced number of communications. It would xi−1,k − η∇fi (x
xi−1,k )),
x i,k = Q(x
be possible to examine the relations between the communica-
tion patterns (e.g., number of local updates per communication where the quantizer Q is the projection operator associated
round or event-triggering rules) and the learning performance, with the set X ∩ Λ. Then, agent i sends the new model x i,k to
and design the best algorithms to achieve the best tradeoff its successor, agent i + 1. After one completed cycle of model

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

10

updates, we set x 0,k+1 = x k = x n,k and begin the next cycle. of time slots. The algorithm achieved exact convergence to
It has been shown in [6] that, as k goes to infinity, the gap the optimal solution x∗ by allocating diminishing weights
between f (x xk ) and the optimal value is upper bounded by to the quantized information received from neighbors. A
some number pertaining to the quantization resolution δ. The similar approach was adopted in [96], where the weights for
smaller δ is, the better the learning performance becomes (i.e., neighboring agents’ quantized models converged to zero as
the smaller f (xxk ) becomes). the quantized DGD algorithm progressed. It was shown that,
The approach in [6] required the network to maintain with random quantization schemes, the convergence rates for
an ordering of the agents and was not fully decentralized. convex
 2loss functions
 and
 strongly convex
 loss functions were
δ log t δ2 log t
Alternatively, a fully decentralized quantized DGD algorithm O (1−σ) 2 1 and O (1−σ)3 t 13 , respectively, where δ is
t4
was proposed in [89], where each agent i updated its local the length of the quantization interval and 1 − σ is the spectral
model in each time slot t as follows: gap of the underlying communication graph.
 
X A decentralized lazy mirror descent method with differential
x i (t + 1) = Q  aij x j (t) − ηt ∇fi (x
xi (t)) , exchanges was developed in [97] for fully decentralized learn-
j∈Ni ∪{i} ing problems over rate-constrained noisy wireless channels. To
combat the channel noise and rate constraints, the algorithm
so that each agent only needed to transmit a quantized local
used quantization and power control techniques jointly. Be-
model to its neighbors. The impact of the number of quan-
sides local models, agents also maintained the disagreements
tization levels on the convergence rate was investigated. To
in their estimates of neighbors’ local models due to noise
mitigate the negative effect of quantization on the learning
and rate constraints, and exchanged the quantized differences
performance, a universal vector quantization scheme was put
with neighbors. To guarantee convergence to the optimal
forth in [90] for FL over rate-constrained wireless channels
solution, the algorithm designed two sequences. One sequence
in a server-agent system. It was shown that the distortion
controlled the consensus rate (i.e., the weights of neighbors’
due to quantization vanishes as the number of agents in-
noisy quantized information), and the other one controlled
creases. Moreover, a distributed dual-averaging method us-
the transmission power when sending the differential signals.
ing quantized communications was developed in [91]. When
The impact of transmission power and quantization resolution
deterministic quantizers were used, the algorithm converged
on the convergence rate was characterized. A quantized FL
to a suboptimal point, where the suboptimality depended
algorithm was devised in [98], where transmission power and
on the quantization resolution. When probabilistic quantizers
quantization bits were jointly allocated across the agents to
were used, the algorithm converged to the optimal solution
minimize the communication errors.
in expectation, and the impact of quantization resolution on
the convergence rate was investigated. Analogously, quantized An iteratively refined quantization scheme was proposed
ADMM algorithm was studied in [92] by using deterministic for inexact (accelerated) proximal gradient methods in [99].
and probabilistic quantizers, and the effect of quantization During the progression of the algorithm, the center of the
accuracy on the learning performance was characterized. In quantization range changed as the estimates of the optimal
addition, a variant of DGD using multiple quantized consensus point varied, and the quantization range shrank as the estimates
communication steps per local gradient descent was proposed became more and more accurate. If the loss functions were
in [93] to allow more flexible tradeoff between communication strongly convex, with appropriately designed dynamic quanti-
and computational costs. zation scheme (appropriate shrinkage rate of the quantization
range), the algorithm converged to the optimal solution at
A compression scheme named quantized SGD (QSGD)
linear rate. A similar approach based on DGD was adopted
was proposed in [94] to allow for a smooth tradeoff be-
in [100], where an adaptive quantization scheme was used.
tween communication bandwidth and convergence time of the
As the algorithm progresses, one becomes more and more
learning algorithms. QSGD enjoyed guaranteed convergence
confident on the location of the optimal solution and adjusts
for both convex and nonconvex loss functions, and could
the quantization codebook accordingly to make the quantized
be equipped with stochastic variance-reduction techniques to
values more accurate. For convex or strongly convex loss func-
further accelerate convergence. Another method of achieving
tions, it was shown in [100] that such an adaptive quantization
convergence to the exact optimal solution by exchanging only
approach would not degrade the convergence rate compared to
quantized values was proposed in [95], which studied fully
vanilla DGD with perfect communications, except for constant
decentralized quantized optimization problems over networks.
factors depending on the quantization resolution. Following
At each time t, each agent i updated its local model as follows:
this line of research, in [101], the authors designed dynamic
x i (t + 1) quantization methods compressing the exchanged information
X into a few bits while still maintaining the linear convergence
= (1 −  + aii )x
xi (t) +  xj (t)) − η∇fi (x
aij Q(x xi (t)),
rate of the distributed learning algorithms. The convergence
j∈Ni
time of the algorithm was characterized as a function of the in-
where  is some positive parameter to be chosen and η is formation transmission rate. Similar dynamic quantizers were
the stepsize. The stochastic quantizer Q was assumed to applied to distributed gradient tracking algorithms to achieve
be unbiased and have bounded variance. By setting  = linear convergence rates by using finite-bit communications
3γ γ
O(1/T 2 ) and η = O(1/T 2 ), it was shown for strongly in [102], [103]. By using an analogous dynamic quantization
convex loss functions that E[kx xi (T ) − x ∗ k2 ] ≤ O(1/T γ ), scheme, [104] sought to minimize the number of quantization
where γ is an arbitrary number in (0, 1/2) and T is the number levels for achieving exact convergence. Exploiting dynamic

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

11

quantizers, [105] explored the minimal number of quantiza- where γ > 0 is some algorithm parameter to be chosen. It was
tion levels to ensure convergence of DGD over time-varying shown in [114] that the convergence of the algorithm could
directed graphs. It was shown that one-bit communications be guaranteed if γ is sufficiently large, and the convergence
sufficed when certain system parameters were chosen properly. rate was the same as that of the vanilla DGD using the
A hierarchical gradient quantization method for distributed exact models of neighbors. The DGD algorithm based on
learning was proposed in [106]. The stochastic gradient was signs of relative models was extended to the online scenario
decomposed into its norm and normalized gradient blocks, in [115], where the training data was collected sequentially
which were quantized using a uniform quantizer and a low- and the loss functions varied across time. It was proved that
dimensional Grassmannian codebook, respectively. A bit- the method could achieve the same regret (in order sense)
allocation scheme was used to determine the resolution of as standard OGD did. Additionally, an FL framework of
the low-dimensional quantizers for the gradient blocks. The training binary neural networks (BNNs) with binary model
convergence rate of this algorithm was analyzed in terms parameters was proposed in [116], where agents only needed
of the quantization bits. A double quantization method for to upload binary parameters to the server. Conditions ensuring
distributed learning was developed in [107], where both the the convergence of the proposed BNN training algorithm were
gradients (uplink transmission) and the models (downlink derived theoretically.
transmission) were quantized. The method was amenable to To further reduce the communication overhead, quantization
asynchronous implementation, and could be combined with techniques can be used in conjunction with other methods. An
gradient sparsification and momentum techniques to further FL algorithm using quantization, probabilistic device selec-
improve the communication efficiency and convergence rate. tion, and resource allocation jointly was proposed in [117].
Moreover, a quantized Frank-Wolfe algorithm was put forth The method could improve the learning performance and
in [108] to obtain a communication-efficient projection-free reduce the training time significantly. Quantization techniques
(thus alleviating the computational burden) approach. The con- were integrated with variance reduction to further accelerate
vergence of the algorithm was analyzed for both convex and the convergence in [118]. Moreover, in [119], an FL algo-
nonconvex problems. Quantized saddle-point algorithms were rithm combining periodic averaging, partial agent participa-
developed in [109] for decentralized stochastic optimization tion, and quantization was developed. The impact of these
with pairwise constraints between neighbors, which could be communication-efficient techniques on the convergence rate
used for multitask learning. The impact of quantization reso- was investigated for strongly convex as well as nonconvex
lution on the convergence rate of the algorithms was examined problems. Convergence analysis of the FedAvg algorithm with
for both the sample feedback and the bandit feedback (where non-i.i.d. dataset distributions, partial agent participation, and
only the values of the loss functions at two random points were finite-precision quantization was presented in [120]. It was
revealed at each time) settings. Quantization of data instead of shown that, to achieve O(1/t) convergence rate, transmitting
gradients was proposed in [110], which outperforms gradient the models required a logarithmic number of quantization
compression significantly when model dimension is large. levels, while transmitting the model differentials required only
A more aggressive quantization approach is to compress a constant number of quantization levels. A joint quantization
the exchanged information to two possible values, i.e., one and noise insertion approach for distributed learning was put
bit, or three possible values. A ternary gradient approach forth in [121], which was able to achieve differential privacy
was proposed for distributed learning in [111], where only and communication efficiency simultaneously.
three possible values were transmitted. The convergence of
the algorithm was established theoretically. It was shown B. Sparsification
via numerical experiments that the algorithm could reduce
the bandwidth requirement significantly without affecting the In addition to quantization, another popular approach to
learning performance much. In addition, a signSGD algorithm compressing the communications in distributed learning algo-
was studied in [112], where each agent sent only the signs rithms is sparsification, where only a small subset of entries
of the local gradients and the server used a majority vote to of the raw information vectors are transmitted.
aggregate the signs. An FL algorithm using one-bit gradient It was observed in [7], [122] that most entries of the
quantization and over-the-air majority rule aggregation was gradients used in DGD algorithm are very close to zero.
proposed in [113] for distributed learning over noisy fading Motivated by this observation, in [122], the authors proposed
wireless channels. The effects of wireless communication to map 99% of the gradient entries to zero and only transmit
factors, e.g., channel fading, noise, channel estimation errors, the rest. Empirical experiments indicated that this could reduce
were investigated comprehensively. It was shown that the neg- the communication cost significantly without degrading the
ative effects of these factors vanished as the number of agents learning performance much. The author of [7] also reduced the
grew. Another one-bit quantization approach proposed in [114] amount of communications by three orders of magnitude for
used only the signs of the relative models of neighbors, i.e., the training deep neural networks. The authors in [123] proposed
signs of the differences between agents’ models and neighbors’ to sparsify gradients used in SGD based on their magnitudes.
models. In the model adopted, at each time t, each agent i Combining sparsified gradients and local error correction,
updates its local model according to the algorithm could provide convergence guarantees for both
convex and nonconvex loss functions. A variant of parallel
x i (t + 1) block coordinate descent algorithm based on independent spar-
sification of local gradients was proposed in [124]. Moreover,
X
= x i (t) + γηt xj (t) − x i (t)) − ηt ∇fi (x
aij sgn(x xi (t)),
j∈Ni
[125] proposed a sparsification scheme that minimized the

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

12

total error incurred by sparsification throughout the learning significantly faster distributed training. A convex optimization
processes under total communication budget constraint. It was formulation for minimizing the coding length of the stochastic
found that the hard-threshold sparsifier, a variant of the Top- gradients in distributed learning was proposed in [133], where
k sparsifier (sending the k entries with largest magnitudes entries of the gradients were randomly dropped out and
and discarding the rest) with k determined by a constant the remaining entries were amplified to keep the sparsified
threshold, was the optimal sparsifier under such a criterion. gradients unbiased. A simple and fast algorithm for solving
For convex as well as nonconvex loss functions, the conver- this optimization problem was developed with guaranteed
gence of distributed learning algorithms using such a hard- sparsity. The convergence rates of distributed learning algo-
threshold sparsifier in conjunction with error feedback was rithms with sparse model averaging and gradient quantization
analyzed. It was proved in [125] that the algorithm had the were investigated for both convex and nonconvex problems in
same asymptotic convergence and linear speedup properties [134]. Besides first-order algorithms, second-order distributed
as SGD, and unlike conventional Top-k sparsifier, had no learning algorithms with sparsification were also studied. In
performance loss due to data heterogeneity. To further reduce [135], a distributed approximated Newton’s method was pro-
the communication overhead of distributed learning, a global posed based on δ-approximate compressors, which included
Top-k sparsifier was proposed in [126], where the k gradient Top-k sparsifier as a special case. It was shown that the
entries with globally largest absolute values from all agents algorithm was able to achieve the same rate of convergence as
were transmitted. It was shown that such a sparsifier incurred state-of-the-art second-order distributed learning algorithms by
much less communication cost compared to conventional incurring much less communication overhead. Sparsification
local Top-k sparsifier. Additionally, a modified sparsified SGD was also applied to deep learning in [136], where only the
algorithm, namely the global renovating SGD, was proposed important entries of the gradients were sent. Momentum resid-
in [127], where previous-round global gradients were utilized ual accumulation was designed for tracking outdated residual
to estimate the current global gradient and renovate the current gradient coordinates to avoid low convergence rate caused
zero-sparsified gradients. While mitigating the communica- by sparse updates. Sparsified gradient descent algorithm was
tion overhead, the algorithm made the convergence direction implemented as a library in [137].
closer to the centralized optimization, thus accelerating √the
distributed learning. Convergence guarantees of rate O(1/ t)
were provided for nonconvex learning problems.
C. Error-Compensated Compression
The impact of wireless communication factors (e.g., channel
fading, noise, power control) on sparsified distributed learn- Compressing the exchanged information usually leads to
ing algorithms has also been investigated in the literature. errors in distributed learning. As the learning algorithms
FL over bandwidth-limited fading multiple access channels progress, the errors caused by compression in each time
was studied in [128]. The authors proposed a compressed slot accumulate and may degrade the learning performance
analog distributed SGD algorithm, where agents first spar- severely. A remedy to this issue is to provide error feedback
sified their local gradients and then projected the resultant to the agents, who compensate for the errors dynamically
sparse vector into a low-dimensional vector for bandwidth to avoid error accumulation. Recently, following this general
reduction. Through bandwidth-limited wireless channels, these approach, a series of distributed learning algorithms with
low-dimensional vectors from the agents were sent to the error-compensated compression have been developed, which
server, where the aggregation was conducted by over-the-air can reduce the communication overhead significantly without
computations. A power allocation scheme was devised to align compromising the learning performance much.
the received gradients at the server. A convergence analysis for We use here the communication-compressed decentralized
this approach was presented in [129]. It was shown that the SGD algorithm proposed in [138] as an illustrative exam-
probability of reaching a small neighborhood of the optimal ple for the error-compensated compression approach. The
solution converged to one as time went to infinity. In [130], problem considered in [138] is the single-task decentralized
an online learning approach was developed to minimize the learning problem (1) over a connected undirected network.
overall training time of FL algorithms and achieve the near- The expected local loss function of each agent i is given by
optimal communication-computation tradeoff by controlling x, ξi )], where ξi is the local data, Di is the
x) = Eξi ∼Di [Fi (x
fi (x
the sparsity of the gradients. A compressive sensing (CS) data distribution, and Fi is the loss function. Let Q : Rd 7→ Rd
approach was proposed in [131] for FL over massive MIMO be a (possibly probabilistic) compression operator satisfying
systems, where sparse signals constructed from local gradients the following property:
were transmitted by devices and a CS algorithm was developed
to reconstruct local gradients at the central server. x) − x k2 ] ≤ (1 − δ)kx
EQ [kQ(x xk2 , ∀x
x ∈ Rd , (9)
In [132], the authors integrated sparsification with atomic where the expectation is taken with respect to the internal
decomposition (e.g., singular value decomposition, Fourier randomness of the compressor Q, and δ ∈ (0, 1) is a constant.
transform), where the atoms of the atomic decomposition Many popular compressors satisfy property (9), including
of the gradients were sparsified. Notable methods such as sparsifiers (e.g., Top-k and Rand-k (randomly picking k out
QSGD in [94] and TernGrad in [111] could be regarded as of d entries to transmit)), random gossiping (transmitting with
special cases of sparsified atomic decomposition algorithm. certain probability), and other random quantizers. Each agent i
It was shown in [132] that sparsifiying the singular values maintains (|Ni |+2) variables, namely x i (t), {b
x j (t)j∈Ni ∪{i} },
of neural network gradients, rather than their entries, led to where xbj (t) is an approximate local model of agent j. In each

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

13

time t, agent i first samples ξi (t) ∼ Di . Then it updates its Distributed learning algorithms with error-compensated
local model according to communication compression have also been studied for server-
X agent systems. A sparsified SGD algorithm with error-
x i (t + 1) = x i (t) + γ x j (t) − x
aij (b bi (t)) compensation was developed in [143], and was shown to
j∈Ni
converge at the same rate as vanilla SGD. Distributed SGD
− ηt ∇Fi (x
xi (t), ξi (t)), (10) with error-compensated stochastic quantization was proposed
where γ > 0 is an algorithm parameter to be selected. Note in [144], and its convergence was analyzed for the case of
that in (10), agent i uses the approximate model x bj (t) instead quadratic optimization, though its convergence rate was not
of the exact model x j (t) (j ∈ Ni ), which is not accessible to shown to be the same as vanilla SGD. Error-compensated
agent i. Afterwards, agent i computes signSGD was developed in [145], and the algorithm was
shown to achieve the same convergence rate as vanilla SGD.
xi (t + 1) − x
q i (t) = Q(x bi (t)), (11) An asynchronous error-compensated distributed SGD algo-
rithm composing quantization and sparsification was proposed
and sends q i (t) to all neighbors in Ni . Symmetrically, it
in [146], where each agent communicated with the server
receives q j (t) from all neighbors j ∈ Ni , and updates the
infrequently at different time instants. It was shown in [146]
approximate local model by
that despite this aggressive compression, the algorithm could
x bj (t), ∀j ∈ Ni ∪ {i}.
bj (t + 1) = q j (t) + x (12) achieve the same convergence rate as vanilla SGD for both
convex and nonconvex problems. A general framework for de-
This algorithm conducts error-compensation in steps (11) and vising and analyzing error-compensated quantized distributed
(12). Specifically, step (11) compresses the difference between learning algorithms was presented in [147], where linear
the new local model x i (t + 1) and the previous approximate convergence rates could be guaranteed. Linearly converging
local model xbi (t), which contains errors caused by compressed error-compensated distributed SGD with improved conver-
communications so far. Thus, in (12), q j (t) is able to partially gence rate was developed in [148] based on loopless Katyusha
offset the compression errors in the previous approximate method. Error-compensated communication compression was
model x bj (t). In particular, if Q is replaced by an identity further extended to distributed learning algorithms with vari-
mapping at time t, then combining (11) and (12) yields ance reduction techniques in [149], where the variance of
x
bi (t + 1) = x i (t + 1) readily (i.e., zero error), no matter stochastic gradient was reduced by taking a moving average
how large the gap x bi (t) − x i (t) was previously. It has been over all historical gradients. In such a case, only using the
shown in [138] that, if the loss functions {fi } are µ-strongly compression error in the previous time instant was not enough
σ2
convex, then the algorithm converges at rate O( µnt ), where for fully compensating for the compression errors. An error-
σ 2 is the variance of the stochastic gradients ∇Fi (x x, ξi ). compensation algorithm using the compression errors from
This recovers the convergence rate of mini-batch SGD with the previous two time instants was proposed and was shown
perfect communications. In the convergence bound, commu- to achieve the same convergence rate as the case with-
nication compression (e.g., the compression accuracy factor out compression. A distributed SGD with double-pass error-
δ) only affects higher order terms that are negligible as time compensated compression was proposed in [150], where the
t goes to infinity. This suggests that the error-compensated compression was conducted at both the server and the agents.
decentralized learning algorithm in [138] is able to reduce Hessian-based error-compensated compression was developed
communication overhead significantly (by sending information in [151], which was especially suitable for ill-conditioned
compressed by Q) without degrading the learning performance problems. A saddle-point algorithm with error-compensated
much. Numerical experiments show that, to achieve the same compression was studied in [152] to solve decentralized mul-
learning performance, the number of bits communicated by titask learning problems.
the error-compensated algorithm is smaller than that of vanilla
SGD by orders of magnitude. Decentralized learning algo-
rithms with error-compensated compressed communications D. Other Compression Methods
were also studied in [139], where two different compression In addition to quantization, sparsification and error-
strategies, namely extrapolation compression and difference compensated compression techniques, researchers have de-
compression, were used. When the compressors were unbiased vised other communication compression methods for dis-
and had bounded variances, it was shown for nonconvex √ learn- tributed learning algorithms. In [153], the models sent by
ing problems that the algorithm converged at rate O(1/ nt), the agents to the server were restricted to have certain struc-
matching the convergence rate of centralized learning with tures such as low-rank in order to reduce the communication
perfect communications. Error-compensated compression and overhead. In [8], a variety of techniques were employed to
event-triggered communications were combined to further reduce the communication bandwidth of distributed learn-
improve the communication efficiency of decentralized opti- ing algorithms comprehensively, including momentum correc-
mization algorithms in [140]. Further, momentum SGD with tion, local gradient clipping, momentum factor masking, and
error-compensated compressed communications was studied warm-up training. In [154], a low-rank gradient compressor
in [141], which imposed weaker assumptions on the variance based on power iterations was proposed for distributed learn-
and dissimilarity of the gradients. Decentralized optimization ing that could achieve test performance on par with SGD.
with sparsification and error-compensated compression was Communication-efficient FL algorithms based on sketching
investigated in [142]. were devised in [155]. Additionally, communication-efficient
multi-agent actor-critic algorithm for multi-agent RL over

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

14

directed graphs was examined in [156], where each agent only it would be interesting to see if the compression errors can
sent two scalars at each time. still be compensated for dynamically so that the regret is not
FL based on over-the-air computation was proposed in affected by communication compression in order sense.
[157] to reduce the bandwidth requirement by exploiting the 2) Performance Limits under Communication Rate Con-
superposition property of wireless multiple-access channels. straints: Most existing works on distributed learning with
The algorithm used joint device selection and beamforming compressed communications are focused on algorithm design.
design, which were modeled as a sparse low-rank optimization Yet little is known about the fundamental performance limits
problem. To solve this nonconvex problem, a difference-of- of distributed learning when communications are compressed.
convex (DC) algorithm with global convergence guarantee was With limited communication bandwidth, the data rate of
developed. The effects of over-the-air analog aggregation (e.g., information exchange in distributed learning algorithms is
waveform superposition and communication latency reduction) constrained. Under such communication rate constraints, one
on the performance of FL algorithms were further investigated would seek to establish lower bounds for the training loss (e.g.,
in [158], [159]. Moreover, [160] developed a band-limited the gap between the loss functions of the trained model and
coordinate descent approach by k-sparsifying the gradients and the optimal model) or testing performance (e.g., generalization
transmitting the gradient entries over k subcarriers through error), and ascertain the impact of communication rate on these
wireless channels. Learning-driven communication error min- lower bounds. It would be interesting to see if existing learning
imization was studied by jointly optimizing the power allo- algorithms with compressed communications can achieve such
cation and learning rates. In [161], the learning rate of the lower bounds in order sense. If not, one could look into
FL algorithm was optimized dynamically and beamforming designing novel compression methods to match the derived
subject to power constraints was also designed. performance lower bounds.

E. Future Directions V. R ESOURCE M ANAGEMENT FOR


C OMMUNICATION -E FFICIENT D ISTRIBUTED L EARNING
We provide two potential directions for future work on
The information exchange required by distributed learning
distributed learning with compressed communications.
algorithms consumes substantial amount of radio resources,
1) Communication Compression for Distributed Online
such as energy and bandwidth, which are scarce in many prac-
Learning: Prior work on distributed online learning with
tical circumstances. In this section, we provide an overview of
compressed communications is rather limited. A decentralized
resource management techniques for communication-efficient
online learning algorithm using the signs of the relative local
distributed learning algorithms, which seek to achieve the best
models of neighboring agents has been proposed in [115]. The
learning performance under resource budget constraints.
approach required each agent to be able to observe the signs of
the models of neighbors relative to its own model, which might
not be the case in practice. One future direction of research A. Power Allocation
would be to devise distributed online learning algorithms that A variety of power allocation schemes have been proposed
quantize/compress the local models directly (instead of the to obtain satisfactory performance for FL under energy con-
relative local models, i.e., the difference between neighbors’ straints. For FL over wireless networks, in [162], the authors
models). The quantization/compression schemes will have to took transmission energy (arising from sending local models
be designed such the degradation of online learning perfor- to the server) and computation energy (arising from the local
mance (e.g., regret and constraint violations) is minimal. training steps) into consideration, and minimized the total
One possible approach is to design a dynamic quantizer, energy consumption subject to constraints on computation and
which adjusts the length and the center of the quantization communication latencies. An iterative algorithm was devel-
interval on-the-fly. Specifically, as the algorithm progresses oped to solve this optimization problem, where closed-form
and becomes more confident about the location of the dynamic solutions for time/power/bandwidth allocation were derived.
optimal solution, the length of the quantization interval could In [9], a joint transmit power allocation and device selection
be shrunk, leading to higher quantization resolution. This can problem was studied to achieve the best FL performance over
potentially reduce the communication overhead of distributed wireless networks. A closed-form expression for the conver-
online algorithms without hurting the regret and constraint gence rate of the FL algorithm was first derived to quantify the
violations in order sense. A challenge to this approach is that, impact of wireless factors on the training loss. Then, based on
unlike static learning problems, the optimal solutions of online this convergence rate, the optimal scheme for transmit power
learning problems change with time and it is more difficult allocation, user selection, and uplink resource block allocation
to locate them with high confidence. Technical assumptions was developed. Additionally, a resource allocation problem
limiting the temporal variation speed of the online problems was formulated in [163], [164] to achieve the optimal tradeoff
may be needed to resolve this issue. Another approach would between FL convergence and energy consumption. Such a
be to adjust the algorithm parameters, e.g., the combination resource allocation problem was nonconvex, and was decom-
weights of decentralized OGD algorithm, so that the impact posed into three convex subproblems. The globally optimal
of the quantization errors vanishes gradually as the algorithm resource allocation scheme was obtained by characterizing the
progresses. solution structures of the subproblems.
It is also possible to devise error-compensated compression Convergence analysis of FL over noisy fading wireless
schemes for distributed online learning. When the training data channels was studied in [165] recently. Power allocation
is collected sequentially and loss functions vary with time, problems were formulated to minimize the convergence bound

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

15

subject to a set of average and maximum power constraints scheduling problem could be solved through a Bayesian learn-
at individual edge devices. The problems were transformed ing approach. Hierarchical FL was introduced in [176], where
into convex forms, and their structured optimal solutions, small-cell base stations coordinated the mobile users within
appearing in the form of regularized channel inversion, were their cells and periodically exchanged model updates with the
obtained by using the Lagrangian duality method. Moreover, main base station, i.e., the server. A method was proposed
FL system with over-the-air analog gradient aggregation was to optimize the allocation of subcarriers so as to reduce the
examined in [166]. Dynamic agent participation scheduling communication latency of the FL algorithm. In addition, a
and power allocation schemes were proposed to optimize the collaborative FL architecture supporting deep neural network
training performance under energy constraints of the agents, (DNN) training was considered in [177], which sought to
where both the communication and computation energy con- optimally select participating devices and allocate computing
sumptions were taken into account. The energy consumption and spectrum resources. A stochastic optimization problem
of FL algorithms has been studied by other approaches as with the objective of minimizing learning loss while satisfying
well, beyond power control. In [167], the total cost of FL, delay and long-term energy consumption requirements was
arising from the training time and energy consumption, was formulated. A deep multi-agent reinforcement learning ap-
minimized by choosing agent participation and the number of proach was developed to solve the problem. Moreover, FL with
local iterations. Solution properties of the formulated problem the assistance of intelligent reflection surface was proposed in
were derived to identify the design principles of FL algorithms. [178], and the delay of FL was analyzed in [179].
Further, a semi-asynchronous federated learning algorithm was
developed in [168], where the server aggregated a certain
C. Future Directions
number of local models based on their arrival orders in each
time. A convergence bound for the algorithm was established, Several promising directions for future research on resource
and the training time was minimized under communication management in distributed learning are discussed below.
cost constraints and FL accuracy constraints by choosing an 1) Improving Communication Efficiency and Data Privacy
appropriate number of participating agents. Simultaneously: In distributed learning, the loss functions of
the agents depend on their local private data, which often
contain sensitive information, e.g., health information and
B. Bandwidth Allocation financial information. Even though the agents do not need
Bandwidth allocation has also been investigated extensively to share their raw data with others in a distributed learning
and is often utilized in conjunction with other techniques setting, the exchanged information between agents and the
(such as agent selection and power control) to improve the server may still be overheard and utilized by malicious ad-
communication efficiency of FL. In [169], for FL over wireless versaries to (partially) infer the private data of the agents.
networks, a stochastic optimization problem minimizing the The noise in wireless channels, a nuisance from the perspec-
long-term learning loss under long-term energy constraints tive of communication, can help preserve the data privacy
was studied by selecting agent participation and allocating by preventing adversaries from inferring the private data of
bandwidth. An algorithm utilizing only the currently available agents accurately based on the overheard noisy information.
wireless channel information was devised to solve this stochas- Transmission power of agents also influence the data privacy
tic optimization problem. A joint probabilistic user selection of agents. Large transmission power enhances the signal-to-
and resource block (spectrum bands) allocation scheme was noise ratio at adversaries and makes it easier to infer the private
developed in [170] to minimize the training loss and con- data of agents. On the other hand, low transmission power
vergence time of the FL algorithm, where only those users hinders accurate information exchange in distributed learning
with significant impact on the global model were selected to algorithms, and thus degrades the learning performance. It
upload their local models. A joint bandwidth allocation and is therefore imperative to devise power allocation schemes
device selection scheme was proposed in [171] to maximize (across time and agents) to balance the learning performance
the training accuracy subject to total training time constraints and data privacy under energy budget constraints of agents.
for latency-constrained FL. Moreover, joint power and band- The goal is to achieve an optimal tradeoff between learning
width allocation was investigated in [172] to minimize the performance, data privacy, and communication efficiency for
energy consumption, computation cost and time cost of the distributed learning over wireless networks.
FL algorithm. In [173], a channel allocation problem was 2) Impact of Wireless Interference in Decentralized Net-
investigated to minimize the training delays subject to dif- works: Most of the existing works on resource management
ferential privacy and training performance constraints. A joint for communication-efficient distributed learning are focused
bandwidth allocation and user selection problem was examined on the server-agent setting, where all agents communicate
in [174] in the scenario of visible light communication. with a server through a multiple-access channel. When the
Asynchronous FL with limited wireless resources was stud- multi-agent system is a fully decentralized network without
ied in [175]. A metric named effectivity score was proposed to a central server, each agent exchanges information with its
represent the amount of learning. An asynchronous learning- neighbors and the concurrent information transmissions of
aware transmission scheduling (ALS) problem to maximize the different agents can cause mutual interference, which affects
effectivity score subject to resource constraints (e.g., spectrum the communication accuracy and the learning performance. It
constraints) was formulated. When the statistical information is therefore important to design novel transmission scheduling
of the system uncertainties (e.g., channel conditions, data and power allocation mechanisms to mitigate the negative
arrivals, and radio resource availability) was unknown, the effects of wireless interference on decentralized learning. For

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

16

instance, the information transmission should be scheduled strategy for the server and the optimal training strategies
so that agents located close to each other do not transmit for the users, where the utility functions of users took their
simultaneously to avoid strong interference. In addition, agents communication and computation costs into account. Auction
should avoid using high transmission power to compensate mechanisms were proposed in [182] to incentivize users to
for poor channel conditions as this would lead to strong contribute communication/computation resources and private
interference to nearby agents. data to FL algorithms. An approximate strategy-proof mech-
3) Resource Allocation for Communication-Efficient Dis- anism with guaranteed truthfulness, individual rationality and
tributed Online Learning: Most existing resource allocation computational efficiency was designed. To further improve the
schemes are designed for communication-efficient distributed social welfare, an automated strategy-proof mechanism based
learning in static settings, where the loss functions are fixed. on deep reinforcement learning was also devised. Additionally,
When the training data is collected sequentially and learning a hierarchical FL framework was studied in [183], where users
is conducted in real time, it is still unclear how to allocate first transmitted local models to edge servers for intermediate
radio resources (e.g., power and bandwidth) to achieve the best aggregation, and then edge servers communicated with the
online learning performance under resource budget constraints. model owner for global aggregation. Such an approach could
One can first study the impact of limited radio resources on the reduce the number of global communications and mitigate the
regret and constraint violations of various distributed online straggler effect of users. A hierarchical game was proposed for
learning algorithms (e.g., distributed OGD, online saddle-point the edge association and resource allocation problem, where
algorithm). Then, one can minimize the performance bounds users’ strategies were their edge associations and the edge
on the regret and constraint violations by allotting the radio servers’ strategies were their bandwidth allocation schemes.
resources in an optimal manner. The lower-level interaction between the users were modeled
as an evolutionary game. The upper-level interaction between
VI. G AME T HEORY FOR C OMMUNICATION -E FFICIENT the edge servers and the model owner was modeled as a
D ISTRIBUTED L EARNING Stackelberg differential game, where the model owner decided
The information exchange required by distributed learning an optimal reward scheme given the expected bandwidth
algorithms consumes substantial amount of communication re- allocation strategies of the edge servers.
sources, which are often scarce for users. For instance, mobile
devices may have limited amount of energy and communica- B. Future Directions
tion bandwidth. Therefore, users may not be willing to partic-
Two possible future directions on the use of game theory
ipate in the distributed learning algorithms or may not devote
for communication-efficient distributed learning are presented
sufficient radio resources to the learning algorithms, which
below.
would then lead to deterioration of the learning performance.
1) Game Theory for Communication-Efficient Fully Decen-
To cope with this challenge, several recent works have devised
tralized Learning over Networks: Most existing works on the
game-theoretic mechanisms to compensate for the resource
use of game theory for distributed learning are focused on
consumption of users and incentivize their participation in
server-agent systems, in which a central server interacts with
distributed learning. In this section, we present an overview of
a set of strategic agents. Yet little is known about the strategic
this line of research and point out several potential directions
behavior of agents in a fully decentralized learning setting
for future research.
over a network without central entity. Decentralized learning
algorithms require agents to exchange information with neigh-
A. Existing Works bors to facilitate collaborative learning. When determining
In [10], the authors studied FL in a server-agent system, the amount of radio resources (e.g., energy and bandwidth)
and sought to design an optimal incentive mechanism from devoted to information transmission, agents need to take
the server’s perspective in the presence of users’ multi- into consideration both their local resource budget constraints
dimensional private information, including training cost (such and the interference to other nearby agents sending/receiving
as communication and computation energy cost) and com- information concurrently. For example, if an agent sends
munication delays. The authors proposed a multi-dimensional information with very large transmission power, other nearby
contract-theoretic mechanism, which summarized users’ multi- concurrent transmissions cannot be received accurately due
dimensional private information into a one-dimensional cri- to the strong interference, which may degrade the collabo-
terion that entails a complete ordering of users. Analysis in rative learning performance. Using a non-cooperative game
various information scenarios was conducted to reveal the im- framework, one can study the strategic behavior (e.g., power
pact of information asymmetry levels on the server’s optimal control and spectrum usage) of agents in such a decentralized
strategy. Reputation was introduced as a metric to measure the learning setting. It would be interesting to examine the price-
reliability and trustworthiness of the mobile users in [180]. of-anarchy by comparing the learning performance of a non-
A reputation-based user selection scheme was developed for cooperative game and that of a fully cooperative scenario
reliable FL by using a multiweight subjective logic model. with globally optimal resource allocation scheme. Further, one
An incentive mechanism combining reputation and contract may devise game-theoretic incentive mechanisms (e.g., auction
theory was devised to motivate high-reputation users with and bargaining) to guide agents’ behavior and ameliorate the
high-quality data to participate in the FL algorithm. Moreover, performance of decentralized learning.
in [181], an incentive mechanism based on deep reinforcement 2) Incentive Mechanism Design for Communication-
learning was devised for FL to determine the optimal pricing Efficient Personalized Learning: Existing incentive mecha-

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

17

nisms are mostly designed for single-task distributed learning [9] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint
(problem (1)), where all agents collaborate to learn a common learning and communications framework for federated learning over
wireless networks,” IEEE Transactions on Wireless Communications,
model. It would be interesting to study the strategic behavior of vol. 20, no. 1, pp. 269–283, 2020.
agents in distributed personalized learning, where each agent [10] N. Ding, Z. Fang, and J. Huang, “Optimal contract design for efficient
has its own model to train and different agents’ models are federated learning with multi-dimensional private information,” IEEE
Journal on Selected Areas in Communications, vol. 39, no. 1, pp. 186–
distinct (but related). In personalized learning, e.g., multitask 200, 2020.
learning (problems (3), (6), (7)) and meta-learning (problem [11] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N.
(4)), each agent aims to obtain the best personal model by Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al.,
“Advances and open problems in federated learning,” Foundations and
using its scarce radio resources, and is indifferent about the Trends in Machine Learning, 2021.
learning accuracy of other agents’ models. The decision mak- [12] K. B. Letaief, Y. Shi, J. Lu, and J. Lu, “Edge artificial intelligence for
ing processes are coupled across agents since the local models 6G: Vision, enabling technologies, and applications,” IEEE Journal on
Selected Areas in Communications, vol. 40, no. 1, pp. 5–36, 2022.
of individual agents are related. Through a non-cooperative [13] T. Gafni, N. Shlezinger, K. Cohen, Y. C. Eldar, and H. V. Poor,
game framework, one can investigate the equilibrium resource “Federated learning: A signal processing perspective,” IEEE Signal
allocation strategies of agents and the performance of person- Processing Magazine, vol. 39, no. 3, pp. 14–41, 2022.
[14] D. Bertsekas and J. Tsitsiklis, Parallel and Distributed Computation:
alized learning algorithms at the equilibrium. To improve the Numerical Methods. Prentice Hall, 1989.
learning performance, one may further devise game-theoretic [15] S. Boyd, N. Parikh, and E. Chu, Distributed Optimization and Statisti-
mechanisms incentivizing agents to contribute sufficient radio cal Learning via the Alternating Direction Method of Multipliers. Now
Publishers Inc, 2011.
resources to personalized learning algorithms. [16] S. Shalev-Shwartz et al., “Online learning and online convex optimiza-
tion,” Foundations and Trends in Machine Learning, vol. 4, no. 2,
VII. C ONCLUSION pp. 107–194, 2011.
[17] E. Hazan, “Introduction to online convex optimization,” Foundations
In this paper, we have presented a holistic overview of and Trends in Optimization, vol. 2, no. 3-4, pp. 157–325, 2016.
communication-efficient distributed learning. First, we have [18] M. Zinkevich, “Online convex programming and generalized infinites-
surveyed methods reducing the number of communication imal gradient ascent,” in Proceedings of the 20th International Con-
ference on Machine Learning, pp. 928–936, 2003.
rounds for distributed learning, including multiple local train- [19] S. Liu, S. J. Pan, and Q. Ho, “Distributed multi-task relationship
ing steps between consecutive communications and event- learning,” in Proceedings of the 23rd ACM SIGKDD International
triggered communications. Second, we have reviewed various Conference on Knowledge Discovery and Data Mining, pp. 937–946,
2017.
communication compression schemes for distributed learning, [20] Y. Zhang and D.-Y. Yeung, “A regularization approach to learning task
such as quantization, sparsification, and error-compensated relationships in multitask learning,” ACM Transactions on Knowledge
compression. Third, resource management techniques, e.g., Discovery from Data, vol. 8, no. 3, pp. 1–31, 2014.
[21] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for
power control and bandwidth allocation, have been presented fast adaptation of deep networks,” in Proceedings of the International
to make the most of the limited radio resources to achieve Conference on Machine Learning, pp. 1126–1135, 2017.
the best learning performance. Finally, several recent studies [22] C. Finn, K. Xu, and S. Levine, “Probabilistic model-agnostic meta-
learning,” in Proceedings of the Conference on Neural Information
on the game-theoretic mechanism design for incentivizing Processing Systems, 2018.
user participation in distributed learning have been discussed. [23] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated
In addition to reviewing existing works, for each of these learning with theoretical guarantees: A model-agnostic meta-learning
approach,” in Proceedings of the Advances in Neural Information
communication-efficient distributed learning methods, we have Processing Systems, vol. 33, pp. 3557–3568, 2020.
also pointed out potential directions for future research. [24] C. T. Dinh, N. Tran, and J. Nguyen, “Personalized federated learning
with Moreau envelopes,” Advances in Neural Information Processing
Systems, vol. 33, pp. 21394–21405, 2020.
R EFERENCES [25] K. Ozkara, N. Singh, D. Data, and S. Diggavi, “QuPeD: Quantized
[1] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi- personalization via distillation with applications to federated learning,”
agent optimization,” IEEE Transactions on Automatic Control, vol. 54, Advances in Neural Information Processing Systems, vol. 34, 2021.
no. 1, pp. 48–61, 2009. [26] F. Hanzely, S. Hanzely, S. Horváth, and P. Richtárik, “Lower bounds
[2] M. Mahdavi, R. Jin, and T. Yang, “Trading regret for efficiency: and optimal algorithms for personalized federated learning,” in Pro-
online convex optimization with long term constraints,” The Journal ceedings of the Advances in Neural Information Processing Systems,
of Machine Learning Research, vol. 13, no. 1, pp. 2503–2528, 2012. vol. 33, pp. 2304–2315, 2020.
[3] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear [27] A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus
convergence of the ADMM in decentralized consensus optimization,” and optimization in multi-agent networks,” IEEE Transactions on
IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, Automatic Control, vol. 55, no. 4, pp. 922–938, 2010.
2014. [28] A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric convergence
[4] A. Mokhtari, Q. Ling, and A. Ribeiro, “Network Newton distributed op- for distributed optimization over time-varying graphs,” SIAM Journal
timization methods,” IEEE Transactions on Signal Processing, vol. 65, on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
no. 1, pp. 146–161, 2016. [29] S. Pu, W. Shi, J. Xu, and A. Nedić, “Push–pull gradient methods for
[5] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and distributed optimization in networks,” IEEE Transactions on Automatic
K. Chan, “Adaptive federated learning in resource constrained edge Control, vol. 66, no. 1, pp. 1–16, 2020.
computing systems,” IEEE Journal on Selected Areas in Communica- [30] A. Nedić and A. Olshevsky, “Distributed optimization over time-
tions, vol. 37, no. 6, pp. 1205–1221, 2019. varying directed graphs,” IEEE Transactions on Automatic Control,
[6] M. G. Rabbat and R. D. Nowak, “Quantized incremental algorithms vol. 60, no. 3, pp. 601–615, 2015.
for distributed optimization,” IEEE Journal on Selected Areas in [31] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order
Communications, vol. 23, no. 4, pp. 798–808, 2005. algorithm for decentralized consensus optimization,” SIAM Journal on
[7] N. Strom, “Scalable distributed DNN training using commodity GPU Optimization, vol. 25, no. 2, pp. 944–966, 2015.
cloud computing,” in Proceedings of the Sixteenth Annual Conference [32] G. Qu and N. Li, “Accelerated distributed Nesterov gradient descent,”
of the International Speech Communication Association, 2015. IEEE Transactions on Automatic Control, vol. 65, no. 6, pp. 2566–
[8] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient 2581, 2019.
compression: Reducing the communication bandwidth for distributed [33] Y. Tang, J. Zhang, and N. Li, “Distributed zero-order algorithms for
training,” in Proceedings of the International Conference on Learning nonconvex multiagent optimization,” IEEE Transactions on Control of
Representations, 2018. Network Systems, vol. 8, no. 1, pp. 269–281, 2021.

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

18

[34] M. Eisen, A. Mokhtari, and A. Ribeiro, “Decentralized quasi-Newton [57] J. Wang and G. Joshi, “Cooperative SGD: A unified framework for
methods,” IEEE Transactions on Signal Processing, vol. 65, no. 10, the design and analysis of local-update SGD algorithms,” Journal of
pp. 2613–2628, 2017. Machine Learning Research, vol. 22, no. 213, pp. 1–50, 2021.
[35] Q. Ling, W. Shi, G. Wu, and A. Ribeiro, “DLM: Decentralized lin- [58] G. Lan, S. Lee, and Y. Zhou, “Communication-efficient algorithms
earized alternating direction method of multipliers,” IEEE Transactions for decentralized and stochastic optimization,” Mathematical Program-
on Signal Processing, vol. 63, no. 15, pp. 4051–4064, 2015. ming, vol. 180, no. 1, pp. 237–284, 2020.
[36] A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “DQM: Decen- [59] F. P.-C. Lin, S. Hosseinalipour, S. S. Azam, C. G. Brinton, and
tralized quadratically approximated alternating direction method of N. Michelusi, “Semi-decentralized federated learning with cooperative
multipliers,” IEEE Transactions on Signal Processing, vol. 64, no. 19, D2D local model aggregations,” IEEE Journal on Selected Areas in
pp. 5158–5173, 2016. Communications, vol. 39, no. 12, pp. 3851–3869, 2021.
[37] F. Yan, S. Sundaram, S. Vishwanathan, and Y. Qi, “Distributed au- [60] L. Liu, J. Zhang, S. Song, and K. B. Letaief, “Client-edge-cloud hier-
tonomous online learning: Regrets and intrinsic privacy-preserving archical federated learning,” in Proceedings of the IEEE International
properties,” IEEE Transactions on Knowledge and Data Engineering, Conference on Communications, pp. 1–6, IEEE, 2020.
vol. 25, no. 11, pp. 2483–2493, 2013. [61] M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann,
[38] A. Koppel, F. Y. Jakubiec, and A. Ribeiro, “A saddle point algorithm for and M. I. Jordan, “Communication-efficient distributed dual coordi-
networked online convex optimization,” IEEE Transactions on Signal nate ascent,” in Proceedings of the Advances in Neural Information
Processing, vol. 63, no. 19, pp. 5149–5164, 2015. Processing Systems, pp. 3068–3076, 2014.
[39] M. Akbari, B. Gharesifard, and T. Linder, “Distributed online convex [62] Y. Chen, X. Sun, and Y. Jin, “Communication-efficient federated deep
optimization on time-varying directed graphs,” IEEE Transactions on learning with layerwise asynchronous model update and temporally
Control of Network Systems, vol. 4, no. 3, pp. 417–428, 2015. weighted aggregation,” IEEE Transactions on Neural Networks and
Learning Systems, vol. 31, no. 10, pp. 4229–4238, 2019.
[40] Q. Ling and A. Ribeiro, “Decentralized dynamic optimization through [63] C. Liu, H. Li, Y. Shi, and D. Xu, “Distributed event-triggered gradient
the alternating direction method of multipliers,” IEEE Transactions on method for constrained convex minimization,” IEEE Transactions on
Signal Processing, vol. 62, no. 5, pp. 1185–1197, 2013. Automatic Control, vol. 65, no. 2, pp. 778–785, 2019.
[41] X. Cao and K. J. R. Liu, “Distributed linearized ADMM for network [64] M. Zhong and C. G. Cassandras, “Asynchronous distributed opti-
cost minimization,” IEEE Transactions on Signal and Information mization with event-driven communication,” IEEE Transactions on
Processing over Networks, vol. 4, no. 3, pp. 626–638, 2018. Automatic Control, vol. 55, no. 12, pp. 2735–2750, 2010.
[42] X. Cao and K. J. R. Liu, “Distributed Newton’s method for network [65] Y. Kajiyama, N. Hayashi, and S. Takai, “Distributed subgradient
cost minimization,” IEEE Transactions on Automatic Control, vol. 66, method with edge-based event-triggered communication,” IEEE Trans-
no. 3, pp. 1278–1285, 2021. actions on Automatic Control, vol. 63, no. 7, pp. 2248–2255, 2018.
[43] A. Koppel, B. M. Sadler, and A. Ribeiro, “Proximity without consen- [66] J. George and P. Gurram, “Distributed stochastic gradient descent
sus in online multiagent optimization,” IEEE Transactions on Signal with event-triggered communication,” in Proceedings of the AAAI
Processing, vol. 65, no. 12, pp. 3062–3077, 2017. Conference on Artificial Intelligence, vol. 34, pp. 7169–7178, 2020.
[44] J. Chen, C. Richard, and A. H. Sayed, “Multitask diffusion adaptation [67] Z. Wu, Z. Li, Z. Ding, and Z. Li, “Distributed continuous-time
over networks,” IEEE Transactions on Signal Processing, vol. 62, optimization with scalable adaptive event-based mechanisms,” IEEE
no. 16, pp. 4129–4144, 2014. Transactions on Systems, Man, and Cybernetics: Systems, vol. 50,
[45] J. Chen, C. Richard, and A. H. Sayed, “Diffusion LMS over multitask no. 9, pp. 3252–3257, 2018.
networks,” IEEE Transactions on Signal Processing, vol. 63, no. 11, [68] W. Du, X. Yi, J. George, K. H. Johansson, and T. Yang, “Distributed
pp. 2733–2748, 2015. optimization with dynamic event-triggered mechanisms,” in Proceed-
[46] X. Cao and K. J. R. Liu, “Decentralized sparse multitask RLS over ings of the IEEE Conference on Decision and Control, pp. 969–974,
networks,” IEEE Transactions on Signal Processing, vol. 65, no. 23, IEEE, 2018.
pp. 6217–6232, 2017. [69] W. Chen and W. Ren, “Event-triggered zero-gradient-sum distributed
[47] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, consensus optimization over directed networks,” Automatica, vol. 65,
“Communication-efficient learning of deep networks from decentral- pp. 90–97, 2016.
ized data,” in Proceedings of the International Conference on Artificial [70] P. Wan and M. D. Lemmon, “Event-triggered distributed optimization
Intelligence and Statistics, pp. 1273–1282, 2017. in sensor networks,” in Proceedings of the International Conference on
[48] H. Yu, S. Yang, and S. Zhu, “Parallel restarted SGD with faster con- Information Processing in Sensor Networks, pp. 49–60, 2009.
vergence and less communication: Demystifying why model averaging [71] L. Gao, S. Deng, H. Li, and C. Li, “An event-triggered approach for
works for deep learning,” in Proceedings of the AAAI Conference on gradient tracking in consensus-based distributed optimization,” IEEE
Artificial Intelligence, vol. 33, pp. 5693–5700, 2019. Transactions on Network Science and Engineering, 2022.
[49] S. U. Stich, “Local SGD converges fast and communicates little,” in [72] J. Kim and W. Choi, “Gradient-push algorithm for distributed
Proceedings of the International Conference on Learning Representa- optimization with event-triggered communications,” arXiv preprint
tions, 2019. arXiv:2111.06315, 2021.
[50] T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi, “Don’t use large [73] B. Hu, Z.-H. Guan, G. Chen, and X. Shen, “A distributed hybrid
mini-batches, use local SGD,” in Proceedings of the International event-time-driven scheme for optimization over sensor networks,” IEEE
Conference on Learning Representations, 2020. Transactions on Industrial Electronics, vol. 66, no. 9, pp. 7199–7208,
2018.
[51] F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. R. Cadambe, [74] Y. Liu, W. Xu, G. Wu, Z. Tian, and Q. Ling, “Communication-censored
“Local SGD with periodic averaging: Tighter analysis and adaptive ADMM for decentralized consensus optimization,” IEEE Transactions
synchronization,” in Proceedings of the Conference on Neural Infor- on Signal Processing, vol. 67, no. 10, pp. 2565–2579, 2019.
mation Processing Systems, 2019.
[75] X. Cao and T. Başar, “Decentralized online convex optimization
[52] J. Wang and G. Joshi, “Adaptive communication strategies to achieve with event-triggered communications,” IEEE Transactions on Signal
the best error-runtime trade-off in local-update SGD,” Proceedings of Processing, vol. 69, pp. 284–299, 2021.
the Conference on Machine Learning and Systems, 2019. [76] S. Liu, L. Xie, and D. E. Quevedo, “Event-triggered quantized
[53] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the communication-based distributed convex optimization,” IEEE Trans-
objective inconsistency problem in heterogeneous federated optimiza- actions on Control of Network Systems, vol. 5, no. 1, pp. 167–178,
tion,” in Proceedings of the Advances in Neural Information Processing 2016.
Systems, vol. 33, pp. 7611–7623, 2020. [77] H. Li, S. Liu, Y. C. Soh, and L. Xie, “Event-triggered communication
[54] J. Wang, V. Tantia, N. Ballas, and M. Rabbat, “SlowMo: Improving and data rate constraint for distributed optimization of multiagent sys-
communication-efficient distributed SGD with slow momentum,” in tems,” IEEE Transactions on Systems, Man, and Cybernetics: Systems,
Proceedings of the International Conference on Learning Represen- vol. 48, no. 11, pp. 1908–1919, 2017.
tations, 2020. [78] T. Chen, G. B. Giannakis, T. Sun, and W. Yin, “LAG: Lazily aggregated
[55] B. Woodworth, K. K. Patel, S. Stich, Z. Dai, B. Bullins, B. Mcmahan, gradient for communication-efficient distributed learning,” in Proceed-
O. Shamir, and N. Srebro, “Is local SGD better than minibatch SGD?,” ings of the Conference on Neural Information Processing Systems,
in Proceedings of the International Conference on Machine Learning, 2018.
pp. 10334–10343, 2020. [79] J. Sun, T. Chen, G. B. Giannakis, and Z. Yang, “Communication-
[56] S. Zhang, A. Choromanska, and Y. LeCun, “Deep learning with efficient distributed learning via lazily aggregated quantized gradients,”
elastic averaging SGD,” in Proceedings of the Conference on Neural in Proceedings of the Conference on Neural Information Processing
Information Processing Systems, 2015. Systems, 2019.

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

19

[80] T. Chen, K. Zhang, G. B. Giannakis, and T. Başar, “Communication- [103] C.-S. Lee, N. Michelusi, and G. Scutari, “Finite rate quantized dis-
efficient policy gradient methods for distributed reinforcement learn- tributed optimization with geometric convergence,” in Proceedings of
ing,” IEEE Transactions on Control of Network Systems, 2022. the 52nd Asilomar Conference on Signals, Systems, and Computers,
[81] H. Yu and R. Jin, “On the computation and communication complexity pp. 1876–1880, 2018.
of parallel SGD with dynamic batch sizes for stochastic non-convex [104] P. Yi and Y. Hong, “Quantized subgradient algorithm and data-rate
optimization,” in Proceedings of the International Conference on analysis for distributed optimization,” IEEE Transactions on Control
Machine Learning, pp. 7174–7183, 2019. of Network Systems, vol. 1, no. 4, pp. 380–392, 2014.
[82] F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Trading [105] H. Li, C. Huang, G. Chen, X. Liao, and T. Huang, “Distributed
redundancy for communication: Speeding up distributed SGD for non- consensus optimization in multiagent networks with time-varying di-
convex optimization,” in Proceedings of the International Conference rected topologies and quantized communication,” IEEE Transactions
on Machine Learning, pp. 2545–2554, 2019. on Cybernetics, vol. 47, no. 8, pp. 2044–2057, 2017.
[83] J. N. Tsitsiklis and Z.-Q. Luo, “Communication complexity of convex [106] Y. Du, S. Yang, and K. Huang, “High-dimensional stochastic gradient
optimization,” Journal of Complexity, vol. 3, no. 3, pp. 231–243, 1987. quantization for communication-efficient edge learning,” IEEE Trans-
[84] Y. Arjevani and O. Shamir, “Communication complexity of distributed actions on Signal Processing, vol. 68, pp. 2128–2142, 2020.
convex learning and optimization,” in Proceedings of the Advances in [107] Y. Yu, J. Wu, and L. Huang, “Double quantization for communication-
Neural Information Processing Systems, 2015. efficient distributed optimization,” in Proceedings of the Advances in
[85] S. S. Vempala, R. Wang, and D. P. Woodruff, “The communication Neural Information Processing Systems, vol. 32, pp. 4438–4449, 2019.
complexity of optimization,” in Proceedings of the Fourteenth Annual [108] M. Zhang, L. Chen, A. Mokhtari, H. Hassani, and A. Karbasi, “Quan-
ACM-SIAM Symposium on Discrete Algorithms, pp. 1733–1752, 2020. tized Frank-Wolfe: Communication-efficient distributed optimization,”
[86] B. Woodworth, B. Bullins, O. Shamir, and N. Srebro, “The min-max in Proceedings of the International Conference on Artificial Intelligence
complexity of distributed stochastic convex optimization with intermit- and Statistics, 2020.
tent communication,” in Proceedings of the 34th Annual Conference [109] X. Cao and T. Başar, “Decentralized multi-agent stochastic optimiza-
on Learning Theory, 2021. tion with pairwise constraints and quantized communications,” IEEE
[87] A. Agarwal, M. J. Wainwright, P. Bartlett, and P. Ravikumar, Transactions on Signal Processing, vol. 68, pp. 3296–3311, 2020.
“Information-theoretic lower bounds on the oracle complexity of [110] O. A. Hanna, Y. H. Ezzeldin, C. Fragouli, and S. Diggavi, “Quantiza-
convex optimization,” in Proceedings of the Advances in Neural tion of distributed data for learning,” IEEE Journal on Selected Areas
Information Processing Systems, vol. 22, 2009. in Information Theory, vol. 2, no. 3, pp. 987–1001, 2021.
[88] P. Mayekar and H. Tyagi, “RATQ: A universal fixed-length quantizer [111] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li,
for stochastic optimization,” in Proceedings of the International Con- “Terngrad: Ternary gradients to reduce communication in distributed
ference on Artificial Intelligence and Statistics, pp. 1399–1409, 2020. deep learning,” in Proceedings of the Advances in Neural Information
[89] A. Nedic, A. Olshevsky, A. Ozdaglar, and J. N. Tsitsiklis, “Distributed Processing Systems, 2017.
subgradient methods and quantization effects,” in Proceedings of the [112] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar,
47th IEEE Conference on Decision and Control, pp. 4177–4184, 2008. “signSGD: Compressed optimisation for non-convex problems,” in
Proceedings of the International Conference on Machine Learning,
[90] N. Shlezinger, M. Chen, Y. C. Eldar, H. V. Poor, and S. Cui, “UVeQFed:
pp. 560–569, 2018.
Universal vector quantization for federated learning,” IEEE Transac-
tions on Signal Processing, vol. 69, pp. 500–514, 2020. [113] G. Zhu, Y. Du, D. Gündüz, and K. Huang, “One-bit over-the-
air aggregation for communication-efficient federated edge learning:
[91] D. Yuan, S. Xu, H. Zhao, and L. Rong, “Distributed dual averaging
Design and convergence analysis,” IEEE Transactions on Wireless
method for multi-agent optimization with quantized communication,”
Communications, vol. 20, no. 3, pp. 2120–2135, 2021.
Systems & Control Letters, vol. 61, no. 11, pp. 1053–1061, 2012.
[114] J. Zhang, K. You, and T. Başar, “Distributed discrete-time optimization
[92] S. Zhu and B. Chen, “Quantized consensus by the ADMM: Proba- in multiagent networks using only sign of relative state,” IEEE Trans-
bilistic versus deterministic quantizers,” IEEE Transactions on Signal actions on Automatic Control, vol. 64, no. 6, pp. 2352–2367, 2019.
Processing, vol. 64, no. 7, pp. 1700–1713, 2015. [115] X. Cao and T. Başar, “Decentralized online convex optimization based
[93] A. S. Berahas, C. Iakovidou, and E. Wei, “Nested distributed gradient on signs of relative states,” Automatica, vol. 129, article 109676, 2021.
methods with adaptive quantized communication,” in Proceedings of [116] Y. Yang, Z. Zhang, and Q. Yang, “Communication-efficient federated
the IEEE Conference on Decision and Control, pp. 1519–1525, 2019. learning with binary neural networks,” IEEE Journal on Selected Areas
[94] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: in Communications, vol. 39, no. 12, pp. 3836–3850, 2021.
Communication-efficient SGD via gradient quantization and encoding,” [117] M. Chen, N. Shlezinger, H. V. Poor, Y. C. Eldar, and S. Cui,
in Proceedings of the Advances in Neural Information Processing “Communication-efficient federated learning,” Proceedings of the Na-
Systems, vol. 30, pp. 1709–1720, 2017. tional Academy of Sciences of the U.S.A., vol. 118, no. 17, 2021.
[95] A. Reisizadeh, A. Mokhtari, H. Hassani, and R. Pedarsani, “An exact [118] S. Horváth, D. Kovalev, K. Mishchenko, S. Stich, and P. Richtárik,
quantized decentralized gradient descent algorithm,” IEEE Transactions “Stochastic distributed learning with gradient quantization and variance
on Signal Processing, vol. 67, no. 19, pp. 4934–4947, 2019. reduction,” arXiv preprint arXiv:1904.05115, 2019.
[96] T. T. Doan, S. T. Maguluri, and J. Romberg, “Convergence rates of [119] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and
distributed gradient methods under random quantization: A stochastic R. Pedarsani, “Fedpaq: A communication-efficient federated learning
approximation approach,” IEEE Transactions on Automatic Control, method with periodic averaging and quantization,” in Proceedings of
vol. 66, no. 10, pp. 4469–4484, 2021. the International Conference on Artificial Intelligence and Statistics,
[97] R. Saha, S. Rini, M. Rao, and A. Goldsmith, “Decentralized opti- pp. 2021–2031, 2020.
mization over noisy, rate-constrained networks: Achieving consensus [120] S. Zheng, C. Shen, and X. Chen, “Design and analysis of uplink and
by communicating differences,” IEEE Journal on Selected Areas in downlink communications for federated learning,” IEEE Journal on
Communications, vol. 40, no. 2, pp. 449–467, 2022. Selected Areas in Communications, vol. 39, no. 7, pp. 2150–2167,
[98] Y. Wang, Y. Xu, Q. Shi, and T.-H. Chang, “Quantized federated learn- 2021.
ing under transmission delay and outage constraints,” IEEE Journal on [121] N. Agarwal, A. T. Suresh, F. Yu, S. Kumar, and H. B. Mcmahan,
Selected Areas in Communications, vol. 40, no. 1, pp. 323–341, 2022. “cpSGD: Communication-efficient and differentially-private distributed
[99] Y. Pu, M. N. Zeilinger, and C. N. Jones, “Quantization design for SGD,” in Proceedings of the Conference on Neural Information Pro-
distributed optimization,” IEEE Transactions on Automatic Control, cessing Systems, 2018.
vol. 62, no. 5, pp. 2107–2120, 2017. [122] A. F. Aji and K. Heafield, “Sparse communication for distributed
[100] T. T. Doan, S. T. Maguluri, and J. Romberg, “Fast convergence rates gradient descent,” in Proceedings of the Conference on Empirical
of distributed subgradient methods with adaptive quantization,” IEEE Methods in Natural Language Processing, 2017.
Transactions on Automatic Control, vol. 66, no. 5, pp. 2191–2205, [123] D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat,
2021. and C. Renggli, “The convergence of sparsified gradient methods,”
[101] S. Magnússon, H. Shokri-Ghadikolaei, and N. Li, “On maintaining in Proceedings of the Advances in Neural Information Processing
linear convergence of distributed learning and optimization under Systems, pp. 5973–5983, 2018.
limited communication,” IEEE Transactions on Signal Processing, [124] K. Mishchenko, F. Hanzely, and P. Richtárik, “99% of worker-master
vol. 68, pp. 6101–6116, 2020. communication in distributed optimization is not needed,” in Pro-
[102] Y. Kajiyama, N. Hayashi, and S. Takai, “Linear convergence of ceedings of the Conference on Uncertainty in Artificial Intelligence,
consensus-based quantized optimization for smooth and strongly con- pp. 979–988, 2020.
vex cost functions,” IEEE Transactions on Automatic Control, vol. 66, [125] A. Sahu, A. Dutta, A. M Abdelmoniem, T. Banerjee, M. Canini, and
no. 3, pp. 1254–1261, 2021. P. Kalnis, “Rethinking gradient sparsification as total error minimiza-

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

20

tion,” in Proceedings of the Advances in Neural Information Processing tations,” in Proceedings of the Conference on Neural Information
Systems, vol. 34, 2021. Processing Systems, 2019.
[126] S. Shi, Q. Wang, K. Zhao, Z. Tang, Y. Wang, X. Huang, and X. Chu, “A [147] E. Gorbunov, D. Kovalev, D. Makarenko, and P. Richtárik, “Linearly
distributed synchronous SGD algorithm with global top-k sparsification converging error compensated SGD,” in Proceedings of the 34th
for low bandwidth networks,” in Proceedings of the IEEE 39th Interna- Conference on Neural Information Processing Systems, 2020.
tional Conference on Distributed Computing Systems, pp. 2238–2247, [148] X. Qian, P. Richtárik, and T. Zhang, “Error compensated distributed
2019. SGD can be accelerated,” in Proceedings of the 35th Conference on
[127] W. Ning, H. Sun, X. Fu, X. Yang, Q. Qi, J. Wang, J. Liao, and Z. Han, Neural Information Processing Systems, 2021.
“Following the correct direction: Renovating sparsified SGD towards [149] H. Tang, Y. Li, J. Liu, and M. Yan, “ErrorCompensatedX: error
global optimization in distributed edge learning,” IEEE Journal on compensation for variance reduced algorithms,” in Proceedings of the
Selected Areas in Communications, vol. 40, no. 2, pp. 499–514, 2022. Thirty-Fifth Conference on Neural Information Processing Systems,
[128] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading 2021.
channels,” IEEE Transactions on Wireless Communications, vol. 19, [150] H. Tang, C. Yu, X. Lian, T. Zhang, and J. Liu, “DoubleSqueeze:
no. 5, pp. 3546–3557, 2020. Parallel stochastic gradient descent with double-pass error-compensated
[129] M. M. Amiri and D. Gündüz, “Machine learning at the wireless edge: compression,” in Proceedings of the International Conference on
Distributed stochastic gradient descent over-the-air,” IEEE Transactions Machine Learning, pp. 6155–6165, 2019.
on Signal Processing, vol. 68, pp. 2155–2169, 2020. [151] S. Khirirat, S. Magnússon, and M. Johansson, “Compressed gradient
[130] P. Han, S. Wang, and K. K. Leung, “Adaptive gradient sparsification methods with Hessian-aided error compensation,” IEEE Transactions
for efficient federated learning: An online learning approach,” in on Signal Processing, vol. 69, pp. 998–1011, 2021.
Proceedings of the IEEE 40th International Conference on Distributed [152] N. Singh, X. Cao, S. Diggavi, and T. Basar, “Decentralized multi-
Computing Systems, pp. 300–310, 2020. task stochastic optimization with compressed communications,” arXiv
[131] Y.-S. Jeon, M. M. Amiri, J. Li, and H. V. Poor, “A compressive sensing preprint arXiv:2112.12373, 2021.
approach for federated learning over massive MIMO communication [153] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh,
systems,” IEEE Transactions on Wireless Communications, vol. 20, and D. Bacon, “Federated learning: Strategies for improving commu-
no. 3, pp. 1990–2004, 2021. nication efficiency,” in NIPS Workshop Private Multi-Party Machine
[132] H. Wang, S. Sievert, Z. Charles, S. Liu, S. Wright, and D. Pa- Learning, 2016.
pailiopoulos, “ATOMO: Communication-efficient learning via atomic [154] T. Vogels, S. P. Karinireddy, and M. Jaggi, “PowerSGD: Practical low-
sparsification,” in Proceedings of the Conference on Neural Information rank gradient compression for distributed optimization,” in Proceedings
Processing Systems, 2018. of the Conference on Neural Information Processing Systems, vol. 32,
[133] J. Wangni, J. Wang, J. Liu, and T. Zhang, “Gradient sparsification for 2019.
communication-efficient distributed optimization,” in Proceedings of [155] D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman,
the Conference on Neural Information Processing Systems, 2018. J. Gonzalez, and R. Arora, “FetchSGD: Communication-efficient fed-
[134] P. Jiang and G. Agrawal, “A linear speedup analysis of distributed deep erated learning with sketching,” in Proceedings of the International
learning with sparse and quantized communication,” in Proceedings of Conference on Machine Learning, pp. 8253–8265, 2020.
the 32nd International Conference on Neural Information Processing [156] Y. Lin, K. Zhang, Z. Yang, Z. Wang, T. Başar, R. Sandhu, and
Systems, pp. 2530–2541, 2018. J. Liu, “A communication-efficient multi-agent actor-critic algorithm
[135] A. Ghosh, R. K. Maity, A. Mazumdar, and K. Ramchandran, “Com- for distributed reinforcement learning,” in Proceedings of the IEEE
munication efficient distributed approximate Newton method,” in Pro- 58th Conference on Decision and Control, pp. 5562–5567, 2019.
ceedings of the IEEE International Symposium on Information Theory, [157] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-
pp. 2539–2544, 2020. the-air computation,” IEEE Transactions on Wireless Communications,
[136] Z. Tao and Q. Li, “eSGD: Communication efficient distributed deep vol. 19, no. 3, pp. 2022–2035, 2020.
learning on the edge,” in Proceedings of the USENIX Workshop on Hot [158] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for
Topics in Edge Computing, 2018. low-latency federated edge learning,” IEEE Transactions on Wireless
[137] C. Renggli, S. Ashkboos, M. Aghagolzadeh, D. Alistarh, and T. Hoe- Communications, vol. 19, no. 1, pp. 491–506, 2019.
fler, “SparCML: High-performance sparse communication for machine [159] T. Sery, N. Shlezinger, K. Cohen, and Y. C. Eldar, “Over-the-air
learning,” in Proceedings of the International Conference for High federated learning from heterogeneous data,” IEEE Transactions on
Performance Computing, Networking, Storage and Analysis, pp. 1–15, Signal Processing, vol. 69, pp. 3796–3811, 2021.
2019. [160] J. Zhang, N. Li, and M. Dedeoglu, “Federated learning over wireless
[138] A. Koloskova, S. Stich, and M. Jaggi, “Decentralized stochastic op- networks: A band-limited coordinated descent approach,” in Proceed-
timization and gossip algorithms with compressed communication,” ings of IEEE International Conference on Computer Communications,
in Proceedings of the International Conference on Machine Learning, 2021.
pp. 3478–3487, 2019. [161] C. Xu, S. Liu, Z. Yang, Y. Huang, and K.-K. Wong, “Learning rate op-
[139] H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu, “Communication timization for federated learning exploiting over-the-air computation,”
compression for decentralized training,” in Proceedings of the Confer- IEEE Journal on Selected Areas in Communications, vol. 39, no. 12,
ence on Neural Information Processing Systems, 2018. pp. 3742–3756, 2021.
[140] N. Singh, D. Data, J. George, and S. Diggavi, “SPARQ-SGD: Event- [162] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei,
triggered and compressed communication in decentralized optimiza- “Energy efficient federated learning over wireless communication net-
tion,” IEEE Transactions on Automatic Control, 2022. works,” IEEE Transactions on Wireless Communications, vol. 20, no. 3,
[141] N. Singh, D. Data, J. George, and S. Diggavi, “SQuARM-SGD: pp. 1935–1949, 2020.
Communication-efficient momentum SGD for decentralized optimiza- [163] N. H. Tran, W. Bao, A. Zomaya, M. N. Nguyen, and C. S. Hong,
tion,” IEEE Journal on Selected Areas in Information Theory, vol. 2, “Federated learning over wireless networks: Optimization model design
no. 3, pp. 954–969, 2021. and analysis,” in Proceedings of the IEEE International Conference on
[142] A. Koloskova, T. Lin, S. U. Stich, and M. Jaggi, “Decentralized deep Computer Communications, pp. 1387–1395, 2019.
learning with arbitrary communication compression,” in Proceedings [164] C. T. Dinh, N. H. Tran, M. N. Nguyen, C. S. Hong, W. Bao,
of the International Conference on Learning Representations, 2020. A. Y. Zomaya, and V. Gramoli, “Federated learning over wireless
[143] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified SGD with networks: Convergence analysis and resource allocation,” IEEE/ACM
memory,” in Proceedings of the Conference on Neural Information Transactions on Networking, vol. 29, no. 1, pp. 398–409, 2021.
Processing Systems, 2018. [165] X. Cao, G. Zhu, J. Xu, Z. Wang, and S. Cui, “Optimized power
[144] J. Wu, W. Huang, J. Huang, and T. Zhang, “Error compensated quan- control design for over-the-air federated edge learning,” IEEE Journal
tized SGD and its applications to large-scale distributed optimization,” on Selected Areas in Communications, vol. 40, no. 1, pp. 342–358,
in Proceedings of the International Conference on Machine Learning, 2022.
pp. 5325–5333, 2018. [166] Y. Sun, S. Zhou, Z. Niu, and D. Gündüz, “Dynamic scheduling for over-
[145] S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi, “Error feedback the-air federated edge learning with energy constraints,” IEEE Journal
fixes signSGD and other gradient compression schemes,” in Proceed- on Selected Areas in Communications, vol. 40, no. 1, pp. 227–242,
ings of the International Conference on Machine Learning, pp. 3252– 2022.
3261, 2019. [167] B. Luo, X. Li, S. Wang, J. Huang, and L. Tassiulas, “Cost-effective
[146] D. Basu, D. Data, C. Karakus, and S. Diggavi, “Qsparse-local-SGD: federated learning in mobile edge networks,” IEEE Journal on Selected
Distributed SGD with quantization, sparsification, and local compu- Areas in Communications, vol. 39, no. 12, pp. 3606–3621, 2021.

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

21

[168] Q. Ma, Y. Xu, H. Xu, Z. Jiang, L. Huang, and H. Huang, “FedSA: MobiHoc 2022 and 2023, and the Lead Guest Editor for the special issue on
A semi-asynchronous federated learning mechanism in heterogeneous “Communication-Efficient Distributed Learning over Networks” in IEEE Jour-
edge computing,” IEEE Journal on Selected Areas in Communications, nal on Selected Areas in Communications. His research interests encompass
vol. 39, no. 12, pp. 3654–3672, 2021. distributed/online optimization, communication-efficient distributed learning,
[169] J. Xu and H. Wang, “Client selection and bandwidth allocation in federated learning over wireless networks, and game theory and network
wireless federated learning networks: A long-term perspective,” IEEE economics.
Transactions on Wireless Communications, vol. 20, no. 2, pp. 1188–
1200, 2020.
[170] M. Chen, H. V. Poor, W. Saad, and S. Cui, “Convergence time
optimization for federated learning over wireless networks,” IEEE
Transactions on Wireless Communications, vol. 20, no. 4, pp. 2457–
2471, 2020.
[171] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint device
scheduling and resource allocation for latency constrained wireless
federated learning,” IEEE Transactions on Wireless Communications,
vol. 20, no. 1, pp. 453–467, 2020. Tamer Başar (S’71-M’73-SM’79-F’83-LF’13) has been with the University
[172] S. Wan, J. Lu, P. Fan, Y. Shao, C. Peng, and K. B. Letaief, “Con- of Illinois Urbana- Champaign since 1981, where he is currently Swanlund
vergence analysis and system design for federated learning over wire- Endowed Chair Emeritus and Center for Advanced Study (CAS) Professor
less networks,” IEEE Journal on Selected Areas in Communications, Emeritus of Electrical and Computer Engineering, with also affiliations with
vol. 39, no. 12, pp. 3622–3639, 2021. the Coordinated Science Laboratory, Information Trust Institute, and Mechani-
[173] K. Wei, J. Li, C. Ma, M. Ding, C. Chen, S. Jin, Z. Han, and H. V. cal Science and Engineering. At Illinois, he has also served as Director of CAS
Poor, “Low-latency federated learning over wireless channels with dif- (2014-2020), Interim Dean of Engineering (2018), and Interim Director of the
ferential privacy,” IEEE Journal on Selected Areas in Communications, Beckman Institute (2008-2010). He received B.S.E.E. from Robert College,
vol. 40, no. 1, pp. 290–307, 2022. Istanbul, and M.S., M.Phil., and Ph.D. from Yale University, from which
[174] W. Huang, Y. Yang, M. Chen, C. Liu, C. Feng, and H. V. Poor, he received in 2021 the Wilbur Cross Medal. He is a member of the US
“Wireless network optimization for federated learning with model National Academy of Engineering, and Fellow of IEEE, IFAC, and SIAM.
compression in hybrid VLC/RF systems,” Entropy, vol. 23, no. 11, He has served as presidents of IEEE CSS (Control Systems Society), ISDG
p. 1413, 2021. (International Society of Dynamic Games), and AACC (American Automatic
[175] H.-S. Lee and J.-W. Lee, “Adaptive transmission scheduling in wire- Control Council). He has received several awards and recognitions over the
less networks for asynchronous federated learning,” IEEE Journal on years, including the highest awards of IEEE CSS, IFAC, AACC, and ISDG,
Selected Areas in Communications, vol. 39, no. 12, pp. 3673–3687, the IEEE Control Systems Award, and a number of international honorary
2021. doctorates and professorships. He has over 1000 publications in systems,
[176] M. S. H. Abad, E. Ozfatura, D. Gunduz, and O. Ercetin, “Hierarchical control, communications, optimization, networks, and dynamic games, includ-
federated learning across heterogeneous cellular networks,” in Proceed- ing books on non-cooperative dynamic game theory, robust control, network
ings of the IEEE International Conference on Acoustics, Speech and security, wireless and communication networks, and stochastic networked
Signal Processing, pp. 8866–8870, 2020. control. He was the Editor-in-Chief of Automatica between 2004 and 2014,
[177] W. Zhang, D. Yang, W. Wu, H. Peng, N. Zhang, H. Zhang, and X. Shen, and is currently editor of several book series. His current research interests
“Optimizing federated learning in distributed industrial IoT: A multi- include stochastic teams, games, and networks; risk-sensitive estimation
agent approach,” IEEE Journal on Selected Areas in Communications, and control; mean-field game theory; multi-agent systems and learning;
vol. 39, no. 12, pp. 3688–3703, 2021. data-driven distributed optimization; epidemics modeling and control over
[178] Z. Wang, J. Qiu, Y. Zhou, Y. Shi, L. Fu, W. Chen, and K. B. networks; strategic information transmission, spread of disinformation, and
Letaief, “Federated learning via intelligent reflecting surface,” IEEE deception; security and trust; energy systems; and cyber-physical systems.
Transactions on Wireless Communications, vol. 21, no. 2, pp. 808–
822, 2022.
[179] L. Li, L. Yang, X. Guo, Y. Shi, H. Wang, W. Chen, and K. B. Letaief,
“Delay analysis of wireless federated learning based on saddle point
approximation and large deviation theory,” IEEE Journal on Selected
Areas in Communications, vol. 39, no. 12, pp. 3772–3789, 2021.
[180] J. Kang, Z. Xiong, D. Niyato, S. Xie, and J. Zhang, “Incentive mech-
anism for reliable federated learning: A joint optimization approach
to combining reputation and contract theory,” IEEE Internet of Things Suhas Diggavi is currently a Professor of Electrical and Computer Engineer-
Journal, vol. 6, no. 6, pp. 10700–10714, 2019. ing at UCLA. His undergraduate education is from IIT, Delhi and his PhD is
[181] Y. Zhan, P. Li, Z. Qu, D. Zeng, and S. Guo, “A learning-based incentive from Stanford University. He has worked as a principal member research staff
mechanism for federated learning,” IEEE Internet of Things Journal, at AT&T Shannon Laboratories and directed the Laboratory for Information
vol. 7, no. 7, pp. 6360–6368, 2020. and Communication Systems (LICOS) at EPFL. At UCLA, he directs the
[182] Y. Jiao, P. Wang, D. Niyato, B. Lin, and D. I. Kim, “Toward an Information Theory and Systems Laboratory.
automated auction framework for wireless federated learning services His research interests include information theory and its applications to sev-
market,” IEEE Transactions on Mobile Computing, vol. 20, no. 10, eral areas including machine learning, security & privacy, wireless networks,
pp. 3034–3048, 2021. data compression, cyber-physical systems, bio-informatics and neuroscience;
[183] W. Y. B. Lim, J. S. Ng, Z. Xiong, D. Niyato, C. Miao, and D. I. Kim, more information can be found at http://licos.ee.ucla.edu.
“Dynamic edge association and resource allocation in self-organizing He has received several recognitions for his research from IEEE and ACM,
hierarchical federated learning networks,” IEEE Journal on Selected including the 2013 IEEE Information Theory Society & Communications
Areas in Communications, vol. 39, no. 12, pp. 3640–3653, 2021. Society Joint Paper Award, the 2021 ACM Conference on Computer and
Communications Security (CCS) best paper award, the 2013 ACM Interna-
tional Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc)
best paper award, the 2006 IEEE Donald Fink prize paper award among
others. He was selected as a Guggenheim fellow in 2021. He also received
Xuanyu Cao (SM’20) received the B.S. degree in electrical engineering the 2019 Google Faculty Research Award, 2020 Amazon faculty research
from Shanghai Jiao Tong University, in 2013, and the M.S. and the Ph.D. award and 2021 Facebook/Meta faculty research award. He served as a IEEE
degrees in electrical engineering from the University of Maryland, College Distinguished Lecturer and also served on board of governors for the IEEE
Park, in 2016 and 2017, respectively. From August 2017 to October 2021, he Information theory society (2016-2021). He is a Fellow of the IEEE.
was successively a Postdoctoral Research Associate with the Department of He has been an associate editor for IEEE Transactions on Information
Electrical Engineering at Princeton University and the Coordinated Science Theory, ACM/IEEE Transactions on Networking and other journals and
Lab at the University of Illinois, Urbana-Champaign. Since October 2021, special issues, as well as in the program committees of several conferences.
he has been an Assistant Professor with the Department of Electronic He has also helped organize IEEE and ACM conferences including serving as
and Computer Engineering at The Hong Kong University of Science and the Technical Program Co-Chair for 2012 IEEE Information Theory Workshop
Technology, Clear Water Bay, Kowloon, Hong Kong. He is a senior member (ITW), the Technical Program Co-Chair for the 2015 IEEE International
of IEEE, an Editor for the IEEE Transactions on Wireless Communications Symposium on Information Theory (ISIT) and General co-chair for ACM
and IEEE Transactions on Vehicular Technology, a TPC member for ACM Mobihoc 2018. He has 8 issued patents.

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Journal on Selected Areas in Communications. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JSAC.2023.3242710

22

Yonina C. Eldar (Fellow, IEEE) received the B.Sc. degree in physics and H. Vincent Poor (S’72, M’77, SM’82, F’87) received the Ph.D. degree in
the B.Sc. degree in electrical engineering from Tel-Aviv University (TAU), EECS from Princeton University in 1977. From 1977 until 1990, he was
Tel-Aviv, Israel, in 1995 and 1996, respectively, and the Ph.D. degree in on the faculty of the University of Illinois at Urbana-Champaign. Since
electrical engineering and computer science from the Massachusetts Institute 1990 he has been on the faculty at Princeton, where he is currently the
of Technology (MIT), Cambridge, in 2002. She is currently a Professor with Michael Henry Strater University Professor. During 2006 to 2016, he served
the Department of Mathematics and Computer Science, Weizmann Institute as the Dean of Princeton’s School of Engineering and Applied Science. He
of Science, Rehovot, Israel, where she heads the center for Biomedical has also held visiting appointments at several other universities, including
Engineering. She is also a Visiting Professor at MIT, a Visiting Scientist most recently at Berkeley and Cambridge. His research interests are in the
at the Broad Institute, and an Adjunct Professor at Duke University, and was areas of information theory, machine learning and network science, and their
a Visiting Professor at Stanford. She is a member of the Israel Academy of applications in wireless networks, energy systems and related fields. Among
Sciences and Humanities, and a EURASIP Fellow. She has received many his publications in these areas is the forthcoming book Machine Learning
awards for excellence in research and teaching, including the IEEE Signal and Wireless Communications (Cambridge University Press). Dr. Poor is a
Processing Society Technical Achievement Award, in 2013, the IEEE/AESS member of the National Academy of Engineering and the National Academy
Fred Nathanson Memorial Radar Award, in 2014, and the IEEE Kiyo of Sciences and is a foreign member of the Chinese Academy of Sciences, the
Tomiyasu Award, in 2016. She was a Horev Fellow of the Leaders in Science Royal Society, and other national and international academies. He received
and Technology Program at the Technion and was selected as one of the the IEEE Alexander Graham Bell Medal in 2017.
50 most influential women in Israel and Asia. She was a member of the
Young Israel Academy of Science and Humanities and the Israel Committee
for Higher Education. She is the Editor-in-Chief of Foundations and Trends
in Signal Processing, and serves on many IEEE committees.

Junshan Zhang is a professor in the ECE Department at University of


Khaled B. Letaief is an internationally recognized leader in wireless com- California Davis. He received his Ph.D. degree from the School of ECE at
munications and networks. He is a Member of the United States National Purdue University in Aug. 2000, and was on the faculty of the School of ECEE
Academy of Engineering, Fellow of IEEE, Fellow of the Hong Kong Insti- at Arizona State University from 2000 to 2021. His research interests fall in
tution of Engineers, Member of India National Academy of Sciences, and the general field of information networks and data science, including edge
Member of the Hong Kong Academy of Engineering Sciences. He is also intelligence, reinforcement learning, continual learning, network optimization
recognized by Thomson Reuters as an ISI Highly Cited Researcher and was and control, game theory, with applications in connected and automated
listed among the 2020 top 30 of AI 2000 Internet of Things Most Influential vehicles, 5G and beyond, wireless networks, IoT data privacy/security, and
Scholars. smart grid.
Dr. Letaief is the recipient of many distinguished awards and honors Prof. Zhang is a Fellow of the IEEE, and a recipient of the ONR Young
including the 2022 IEEE Communications Society Edwin Howard Armstrong Investigator Award in 2005 and the NSF CAREER award in 2003. He received
Achievement Award; 2021 IEEE Communications Society Best Survey Paper the IEEE Wireless Communication Technical Committee Recognition Award
Award; 2019 IEEE Communications Society and Information Theory Society in 2016. His papers have won a few awards, including the Best Student
Joint Paper Award; 2016 IEEE Marconi Prize Paper Award in Wireless Com- Paper award at WiOPT 2018, the Kenneth C. Sevcik Outstanding Student
munications; 2011 IEEE Communications Society Harold Sobol Award; 2010 Paper Award of ACM SIGMETRICS/IFIP Performance 2016, the Best Paper
Purdue University Outstanding Electrical and Computer Engineer Award; Runner-up Award of IEEE INFOCOM 2009 and IEEE INFOCOM 2014, and
2007 IEEE Communications Society Joseph LoCicero Publications Exemplary the Best Paper Award at IEEE ICC 2008 and ICC 2017. Building on his
Award; and over 19 IEEE Best Paper Awards. research findings, he co-founded Smartiply Inc, a Fog Computing startup
From 1990 to 1993, he was a faculty member at the University of company delivering boosted network connectivity and embedded artificial
Melbourne, Australia. Since 1993, he has been with the Hong Kong University intelligence. Prof. Zhang is currently serving as Editor-in-chief for IEEE
of Science & Technology (HKUST) where he has held many administrative Transactions on Wireless Communications and a senior editor for IEEE/ACM
positions, including Acting Provost, Dean of Engineering, Head of the Transactions on Networking. He was TPC co-chair for a number of major
Electronic and Computer Engineering department, Director of the Wireless conferences in communication networks, including IEEE INFOCOM 2012
IC Design Center, founding Director of Huawei Innovation Laboratory, and and ACM MOBIHOC 2015. He was the general chair for ACM/IEEE SEC
Director of the Hong Kong Telecom Institute of Information Technology. From 2017 and WiOPT 2016. He was a Distinguished Lecturer of the IEEE
September 2015 to March 2018, he joined HBKU as Provost to help establish Communications Society.
a research-intensive university in Qatar in partnership with strategic partners
that include Northwestern University, Carnegie Mellon University, Cornell,
and Texas A&M.
Dr. Letaief is well recognized for his dedicated service to professional
societies and IEEE where he has served in many leadership positions.
These include founding Editor-in-Chief of the prestigious IEEE Transactions
on Wireless Communications. He also served as President of the IEEE
Communications Society (2018-19), the world’s leading organization for com-
munications professionals with headquarter in New York City and members
in 162 countries. He is currently serving as member of the IEEE Board of
Directors.
Dr. Letaief received the BS degree with distinction in Electrical Engineer-
ing from Purdue University at West Lafayette, Indiana, USA, in December
1984. He received the MS and Ph.D. Degrees in Electrical Engineering from
Purdue University, in Aug. 1986, and May 1990, respectively. He has also
received a Ph.D. Honoris Causa from the University of Johannesburg, South
Africa, in 2022.

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Weizmann Institute of Science. Downloaded on February 08,2023 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.

You might also like