Distributed Optimization in Networked Systems - Algorithms and Applications
Distributed Optimization in Networked Systems - Algorithms and Applications
Distributed
Optimization
in Networked
Systems
Algorithms and Applications
Wireless Networks
Series Editor
Xuemin Sherman Shen, University of Waterloo, Waterloo, ON, Canada
The purpose of Springer’s Wireless Networks book series is to establish the state
of the art and set the course for future research and development in wireless
communication networks. The scope of this series includes not only all aspects
of wireless networks (including cellular networks, WiFi, sensor networks, and
vehicular networks), but related areas such as cloud computing and big data.
The series serves as a central source of references for wireless networks research
and development. It aims to publish thorough and cohesive overviews on specific
topics in wireless networks, as well as works that are larger in scope than survey
articles and that contain more detailed background information. The series also
provides coverage of advanced and timely topics worthy of monographs, contributed
volumes, textbooks and handbooks.
** Indexing: Wireless Networks is indexed in EBSCO databases and DPLB **
Qingguo Lü • Xiaofeng Liao • Huaqing Li •
Shaojiang Deng • Shanfu Gao
Distributed Optimization
in Networked Systems
Algorithms and Applications
Qingguo Lü Xiaofeng Liao
College of Computer Science College of Computer Science
Chongqing University Chongqing University
Chongqing, China Chongqing, China
Shanfu Gao
College of Computer Science
Chongqing University
Chongqing, China
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore
Pte Ltd. 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
To My Family
Q. Lü
To my family
X. Liao
To my family
H. Li
To my family
S. Deng
To my family
S. Gao
Preface
In recent years, the Internet of Things (IoT) and big data have been interconnected
to a wide and deep extent through the sensing, computing, communication, and
control of intelligent information. Networked systems are playing an increasingly
important role in the interconnected information environment, profoundly affecting
computer science, artificial intelligence, and other related fields. The core of such
systems composed of many nodes is to efficiently accomplish certain global goals by
collaborating with each other, while making separate decisions based on different
preferences, thus solving large-scale complex problems that are difficult for indi-
vidual nodes to perform, with strong resistance to interference and environmental
adaptability. In addition, such systems require participating nodes to access only
their own local information. This may be due to the consideration of security and
privacy issues in the network, or simply because the network is too large, making
the aggregation of global information to a central node practically impossible or
very inefficient. Currently, as a hot research topic with wide applicability and great
application value across multiple disciplines, distributed optimization of networked
systems has laid an important foundation for promoting and leading the frontier
development in computer science and artificial intelligence. However, networked
systems cover a large number of intelligent devices (nodes), and the network
environment is often dynamic and changing, making it extremely hard to optimize
and analyze them. It is problematic for existing theories and methods to effectively
address the new needs and challenges of optimization brought about by the rapid
development of technologies related to networked systems. Hence, it is urgent to
develop new theories and methods of distributed optimization over networks.
Analysis and synthesis including distributed unconstrained optimization, dis-
tributed constrained optimization, distributed nonsmooth optimization, distributed
online optimization, distributed economic dispatch in smart grids, undirected
networks, directed networks, time-varying networks, consensus control protocol,
gradient tracking technique, event-triggered communication strategy, Nesterov
and heavy-ball accelerated mechanisms, variance-reduction technique, differential
privacy strategy, gradient descent algorithm, accelerated algorithm, stochastic gra-
dient algorithm, and online algorithm are all thoroughly studied. This monograph
vii
viii Preface
This book was supported in part by the Natural Science Foundation of Chongqing
under Grant CSTB2022NSCQ-MSX1627, in part by the Chongqing Postdoctoral
Science Foundation under Grant 2021XM1006, in part by the China Postdoctoral
Science Foundation under Grant 2021M700588, in part by the National Natural
Science Foundation of China under Grant 62173278, in part by the Science and
Technology Research Program of Chongqing Municipal Education Commission
under Grant KJQN202100228, in part by the project of Key Laboratory of Industrial
Internet of Things & Networked Control, Ministry of Education under Grant
2021FF09, in part by the project funded by Hubei Province Key Laboratory
of Intelligent Information Processing and Real-time Industrial System (Wuhan
University of Science and Technology) under Grant ZNXX2022004, in part by
the project funded by Hubei Key Laboratory of Intelligent Robot (Wuhan Institute
of Technology) under Grant HBIR202205, in part by the Science and Technology
Research Program of Chongqing Municipal Education Commission under Grant
KJQN202100228, and in part by National Key R&D Program of China under
Grant 2018AAA0100101. We would like to begin by acknowledging Yingjue Chen
and Keke Zhang who have unselfishly given their valuable time in arranging raw
materials. Their assistance has been invaluable to the completion of this book. The
authors are especially grateful to their families for their encouragement and never
ending support when it was most required. Finally, we would like to thank the editors
at Springer for their professional and efficient handling of this book.
ix
Contents
xi
xii Contents
xvii
xviii List of Figures
DT S
Fig. 9.4 (a) The pseudo-individual average subregrets (Rj,av (k))
between k and k + 1 for DTS with communication delays.
(b) The pseudo-individual average regrets (Rj (T )/T ) for
DTS with communication delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Fig. 9.5 The outputs xi (t) and xi (t) related to the adjacent relations
fit and fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Chapter 1
Accelerated Algorithms for Distributed
Convex Optimization
1.1 Introduction
In the past decades, with the development of artificial intelligence and the emer-
gence of 5G, a number of researchers are already interested in distributed optimiza-
tion. This chapter considers a class of widely concerned distributed optimization
problems with each node cooperatively attempting to optimize a global cost function
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 1
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_1
2 1 Accelerated Algorithms for Distributed Convex Optimization
in the context of local interactions and local computations [1]. Instances of such
formulation characterized by distributed computing have several important and
widespread applications in various fields, including wireless sensor networks for
decision-making and information-processing [2], distributed resource allocation
in smart grids [3], distributed learning in robust control [4], and time-varying
formation control [5, 6], among many others [7–13]. Unlike traditional centralized
optimization, distributed optimization involves multiple nodes that gain access to
their private local information over networks, and typically no central coordinator
(node) can acquire the entire information over the networks.
Recently, an increasing number of distributed algorithms have been emerged
according to various locally computational schemes for individual nodes. Some
known approaches for different networks are usually dependent on the distributed
(sub)gradient descent with extensions to figure out interaction delays, asynchronous
updates and stochastic (sub)gradient scenarios, etc. [14–22]. It is noteworthy in this
aspect that these algorithms are intuitive and flexible for the cost functions and
networks, and however the convergence rates are considerably slow owing to the
utilization of diminishing step-size, which is required to guarantee convergence to
an exact optimal solution [14]. The convergence rate of the known algorithms, even
for strong convex functions, is only sublinear [15]. The convergence rate reaches
to be linear of an algorithm with a constant step-size at the cost of a sub-optimal
solution [20]. Methods that make up this exactness-speed dilemma, such as the
distributed alternating direction method of multipliers (ADMMs) [23, 24] and the
distributed dual decomposition [25], are based on Lagrangian dual, which have nice
provable convergence rates (linear convergence rate for strong convex functions)
[26]. In addition, extensions of various real-world factors including stochastic errors
[27], privacy preserving [28] and techniques including proximal (sub)gradient [29],
and formation-containment control [30] have been extensively studied. However,
due to the need to deal with sub-problems in each iteration, the computational
complexity is considerably high. To overcome these difficulties effectively, quite a
few approaches have been proposed, which achieve linear convergence for smooth
and strongly convex cost functions [31–38]. Nonetheless, these approaches [31–38]
are just suitable for undirected networks.
Distributed optimization over directed networks was firstly studied in [39], where
(sub)gradient-push (SP) method was employed to eliminate the requirement of
network balancing, i.e., with column-stochastic weights. Since SP is established on
(sub)gradient descent with diminishing step-size, it also encounters a slow sublinear
convergence rate. To accelerate convergence, Xi and Khan [40] proposed a linearly
convergent distributed method (DEXTRA) with constant step-size by combining
push-sum strategy with the protocol (EXTRA) in [31]. Further, Xi et al. [41] (fixed
directed network) and Nedic et al. [42] (time-varying directed networks) combined
the push-sum strategy with distributed inexact gradient tracking with constant step-
size (ADD-OPT [41] and Push-DIGing [42]) to acquire linear convergence to the
exact optimal solution. Then, Lü et al. [43, 44] extended the work of [42] to non-
uniform step-sizes and showed linear convergence. A different class of approaches
which do not utilize push-sum mechanism have been recently proposed in [45–
1.1 Introduction 3
50], where both row- and column-stochastic weights are adopted simultaneously to
acquire linear convergence over directed networks. It is noteworthy that although
these approaches [39–50] avoid the construction of doubly-stochastic weights, they
just require nodes to possess (at least) its own out-degree information exactly.
Therefore, all the nodes in the networks [39–50] can adjust their outgoing weights
to ensure that the sum of each column of weight matrix is one. This requirement,
however, is likely to be unrealistic in broadcast-based interaction schemes (i.e., the
node neither accesses its out-neighbors nor regulates its outgoing weights).
In this chapter, the algorithm that we will construct depends crucially on the
gradient tracking and is a variation of methods appeared in [47–55]. To be specific,
Qu and Li [54] combined the gradient tracking with distributed Nesterov gradient
descent (DNGD) method [55] and thereby investigated two accelerated distributed
Nesterov methods, i.e., Acc-DNGD-SC and Acc-DNGD-NSC, which exhibited fast
convergence rate compared with the centralized gradient descent (CGD) method for
different cost functions. Note that although the convergence rates are improved, the
two approaches in [54] just assume that the interaction networks are undirected,
which also involve the applicability of the methods in many fields, such as
wireless sensor networks. To remove this deficiency, Xin et al. [48] established an
acceleration and generalization of first-order methods with the gradient tracking
and the momentum term, i.e., ABm, which overcame the conservatism (eigenvector
estimation or doubly-stochastic weights) in the related work by implementing both
row- and column-stochastic weights. In this setting, some interesting generalized
methods [46] (random link failures) and [49] (interaction delays) were proposed.
Regrettably, the construction of column-stochastic weights demands each node to
possess at least its out-degree information, which is arduous to be implemented, for
example, in broadcast-based interaction scenarios. In light of this challenge, Xin et
al. [52] investigated the case of row-stochastic weight matrix which was required to
restrict global information on the network and proposed a fast distributed method
(FROST) under non-uniform step-sizes motivated by the idea of [51]. Related works
also involve the issues of demand response and economic scheduling in power
systems [53, 56]. However, these methods [51–53, 56] do not adopt momentum
terms [54, 55, 57], where nodes acquire more information from in-neighbors in
the network for fast convergence. Moreover, two accelerated methods based on
Nesterov’s momentum for the distributed optimization over arbitrary networks were
presented in [50]. Unfortunately, the related work [50] does not consider the non-
uniform step-sizes and lack of a rigorous theoretical analysis of the methods. Hence,
it is of great significance to discuss such a challenging issue due to its practicality.
The main interest of this chapter is to study the distributed convex optimiza-
tion problem over a directed network. To solve this issue, a linearly convergent
algorithm is designed, for which the non-uniform step-sizes, momentum terms,
and row-stochastic matrix are utilized. We hope to develop a broad theory of the
distributed convex optimization, and the potential purpose of designing a distributed
optimization algorithm is to adapt and promote real scenarios. To conclude, this
4 1 Accelerated Algorithms for Distributed Convex Optimization
1.2 Preliminaries
1.2.1 Notation
If not particularly stated, the vectors mentioned in this chapter are column vectors.
Let R and Rp denote the set of real numbers and p-dimensional real column vectors,
respectively. The subscripts i and j are utilized to denote the indices of the node
1.2 Preliminaries 5
and the superscript t denotes the iteration index of an algorithm; e.g., xit denotes
the variable of node i at time t. We let notations 1n and 0n denote two column
vectors with all entries equaling to one and zero, respectively. Let In and zij denote
the identity matrix of size n and the entry of matrix Z in its i-th row and j -th
column, respectively. The Euclidean norm for vectors and the induced 2-norm for
matrices are represented by the symbol || · ||2 . Let notation Z = diag{y} represent
the diagonal matrix of the vector y = [y1 , y2 , . . . , yn ]T , which follows that zii =
yi , ∀i = 1, . . . , n, and zij = 0, ∀i = j . We define the symbol diag{Z} as a diagonal
matrix whose diagonal elements correspond (same) to the matrix Z. The transposes
of a vector z and a matrix W are indicative of zT and W T , respectively. Let ei =
[0, . . . , 1i , . . . , 0]T . The gradient of f (z) (differentiable) at z is denoted as f (z) :
Rp → Rp . A non-negative square matrix Z ∈ Rn×n is row-stochastic if Z1n = 1n ,
column-stochastic if Z T 1n = 1n , and doubly stochastic if Z1n = 1n and Z T 1n =
1n .
Consider a set of n nodes connected over a directed network. The global objective
is to find x ∈ Rp that minimizes the average of all local cost functions, i.e.,
1
n
minp f (x) = fi (x), (1.1)
x∈R n
i=1
a path between any two nodes, then G is said to be strongly connected. In addition,
the following assumptions are adopted.
Assumption 1.1 ([51]) The network G corresponding to the set of nodes is directed
and strongly connected.
Remark 1.1 Assumption 1.1 is fundamental to assure that nodes in the network can
always affect others directly or indirectly when studying distributed optimization
problems [39–53].
On the basis of the above section, we first review the centralized Nesterov gradient
descent method (CNGD) and then propose the directed distributed Nesterov-like
gradient tracking algorithm, named as D-DNGT, to solve problem (1.1).
Here, CNGD derived from [57] is briefly introduced for L̄-smooth and μ̄-strongly
convex cost function. At each time t ≥ 0, CNGD kept three vectors y t , x t , v t ∈ Rp
1.3 Algorithm Development 7
x t +1 = y t − L̄1 ∇f (y t ) (1.4)
⎪
⎩ t +1
v = (1 − α)v t + αγμ̄ y t − γα ∇f (y t ),
x t +1 = y t − L̄1 ∇f (y t )
(1.5)
y t +1 = x t +1 + β(x t +1 − x t ),
√ √ √ √
where β = ( L̄ − μ̄)/( L̄ + μ̄). It is well known that among all centralized
gradient approaches, CNGD [57] achieved the optimal convergence rate in terms of
the first-order oracle complexity. Under Assumptions 1.2 and 1.3, it is deduced that
the convergence rate of CNGD (1.5) was O((1 − μ̄/L̄)t ) whose dependence on
condition number L̄/μ̄ improved over CGD’s rate O((1 − μ̄/L̄)t ) in the large L̄/μ̄
regime. In this chapter, we devote ourselves to the study of a directed distributed
Nesterov-like gradient tracking (D-DNGT) algorithm, which is not only suitable for
a directed network but also converges linearly and accurately to the optimal solution
to (1.1). To the best of our knowledge, this work has not yet been involved and is
worthwhile to study.
We now describe D-DNGT to distributedly deal with problem (1.1). Each node i ∈
V at time t ≥ 0 stores four variables: xit ∈ Rp , yit ∈ Rp , sti ∈ Rn , and zit ∈ Rp . For
t > 0, node i ∈ V updates its variables as follows:
⎧
⎪ n
⎪
⎪
⎪ xit +1 = rij yjt + βi (xit − xit −1) − αi zit
⎪
⎪ j =1
⎪
⎪ t +1 t +1 t +1
⎪
⎨ yi = xi + βi (xi − xi )
t
n
(1.6)
⎪
⎪ sti +1 = rij stj
⎪
⎪ j =1
⎪
⎪
⎪ t +1
⎪ n ∇f (y t+1 ) ∇f (y t )
⎪
⎩ zi = rij zjt + it+1i − [sit ] i ,
[s ]
j =1 i i i i
8 1 Accelerated Algorithms for Distributed Convex Optimization
and rii = 1 − j ∈Niin rij > , ∀i, where 0 < < 1. Each node i ∈ V starts with
1
initial states xi0 = yi0 ∈ Rp , s0i = ei , and zi0 = ∇fi (yi0 ).2
Denote R = [rij ] ∈ Rn×n as the collection of weights rij , i, j ∈ V in (1.7), which
is obviously row-stochastic. In essence, the update of zit in (1.6) is a distributed
inexact gradient tracking step, where each local cost function’s gradient is scaled
by [sti ]i , which is generated by the third update in (1.6). Actually, the update
of sti in (1.6) is a consensus iteration aiming to estimate the Perron eigenvector
w = [w1 , . . . , wn ]T (related to the eigenvalue 1) of the weight matrix R satisfying
1T w = 1. This iteration is similar to that employed in [51–53]. To sum up, D-DNGT
(1.6) transforms CNGD (1.5) into distributed ones via gradient tracking and can be
applied to a directed network.
Remark 1.3 For the sake of brevity, we mainly concentrate on the one dimensional
case, i.e., p = 1, and the multiple dimensional case is similarly proven.
Define x t = [x1t , . . . , xnt ]T ∈ Rn , y t = [y1t , . . . , ynt ]T ∈ Rn , zt = [z1t , . . . , znt ]T ∈
Rn , S t = [st1 , . . . , stn ]T ∈ Rn×n , ∇F (y t ) = [∇f 1 (y1t ), . . . , ∇f n (ynt )]T ∈ Rn and
S̃ t = diag{S t }. Therefore, the aggregated form of D-DNGT (1.6) can be written as
follows:
⎧ t +1
⎪
⎪ x = Ry t + Dβ (x t − x t −1 ) − Dα zt
⎨ t +1
y = x t +1 + Dβ (x t +1 − x t )
+1 , (1.8)
⎪S
⎪
t = RS t
⎩ t +1
z = Rzt + [S̃ t +1 ]−1 ∇F (y t +1) − [S̃ t ]−1 ∇F (y t )
1 It is worth noticing that the weights, rij , i, j ∈ V , associated with the network G given in (1.7) is
valid. For all i ∈ V , the conditions of the weights, rij , i, j ∈ V , in (1.7) can be satisfied when we
set rij = 1/|Niin |, ∀j ∈ Niin , and rij = 0, otherwise.
2 Suppose that each node possesses and achieves its unique identifier in the network, e.g., 1, . . . , n,
[45–50].
1.3 Algorithm Development 9
In this subsection, some distributed optimization methods which are not only
suitable for directed networks but also related to D-DNGT (1.6) are discussed,
based on an instinct explanation. In particular, we consider ADD-OPT/Push-DIGing
[41, 42], FROST [52] and ABm [48].3
(a) Relation to ADD-OPT/Push-DIGing ADD-OPT [41] (Push-DIGing [42] is
suitable for time-varying networks in comparison with ADD-OPT) kept updating
four variables xit , sit , yit and zit ∈ R for each node i. Starting from the initial states
si0 = 1, zi0 = ∇fi (yi0 ) and an arbitrary xi0 , the updating rule of ADD-OPT is given
by
⎧
⎪ t +1 n
⎪
⎪ xi = cij xjt − αzit
⎪
⎪
⎪
⎪ j =1
⎨ n
t +1
si = cij sjt , yit +1 = xit +1 /sit +1 , (1.9)
⎪
⎪ j =1
⎪
⎪
⎪
⎪ t +1
n
⎪
⎩ zi = cij zjt + ∇fi (yit +1 ) − ∇fi (yit )
j =1
3 Notice that some notations involved in the relevant method may contradict the notations described
in distributed optimization problem/algorithm/analysis throughout the chapter. Therefore, we
declare here that the symbols in this section should not be applied to other parts.
10 1 Accelerated Algorithms for Distributed Convex Optimization
where αi > 0 is a step-size locally chosen at each node i and the row-stochastic
weights R = [rij ] ∈ Rn×n comply with (1.7); the initialization xi0 ∈ R, s0i = ei , and
zi0 = ∇fi (xi0 ). FROST utilized row-stochastic weights with non-uniform step-sizes
among the nodes, and exhibited fast convergence over a directed network, which
converged at a linear rate to the optimal solution under Assumptions 1.1–1.3.
(c) Relation to ABm The ABm, investigated in [48], combined the gradient
tracking with a momentum term and utilized non-uniform step-sizes, which is
described as follows:
⎧
n
⎪
⎪ t +1
rij xjt − αi zit + βi (xit − xit −1 )
⎨ xi =
j =1
n , (1.11)
⎪
⎪ t +1
⎩ zi = cij zjt + ∇fi (xit +1 ) − ∇fi (xit )
j =1
initialized with zi0 = ∇fi (xi0 ) and an arbitrary xi0 at each node i, where as before
αi > 0 and βi ≥ 0 represent the local step-size and the momentum coefficient of
node i. By simultaneously implementing both row-stochastic (R = [rij ] ∈ Rn×n )
and column-stochastic (C = [cij ] ∈ Rn×n ) weights, it is deduced from [48] that
ABm reduces to AB [45] when βi = 0, ∀i, and AB lies at the heart of existing
methods that employ the gradient tracking [42, 43, 48].
Notice that, ADD-OPT/Push-DIGing, FROST and D-DNGT, described above,
have a non-linear term which is derived from the division by the eigenvector learning
term ((1.6), (1.9) and (1.10)). ABm eliminates this non-linear calculation and is
still suitable for the directed networks. However, ABm requires each node to gain
access to its out-degree information to build column-stochastic weights. It is a
challenge to establish directly in a distributed manner, which has been interpreted
earlier. It is worth highlighting that our algorithm, D-DNGT, extends CNGD to a
distributed form and is suitable for the directed networks in comparison with CNGD
[57] and Acc-DNGD-SC/Acc-DNGD-NSC [54]. In addition, D-DNGT combines
FROST with two kinds of momentum terms (heavy-ball momentum and Nesterov
momentum), which ensures that nodes acquire more information from in-neighbors
in the network than FROST to achieve much faster convergence.
1.4 Convergence Analysis 11
In this section, we will prove that D-DNGT (1.6) converges at a linear rate to optimal
solution x ∗ provided that the coefficients (non-uniform step-sizes and momentum
coefficients) are bounded with properly chosen constants. The following notations
and relations are employed.
Recalling that R is irreducible and row-stochastic with positive diagonals,
under Assumption 1.1, there exists a normalized left Perron eigenvector w =
[w1 , . . . , wn ]T ∈ Rn (wi > 0, ∀i) of R such that
Before showing the main results, we introduce some auxiliary results. First, the
following crucial lemma is given, which is a direct implication of Assumption 1.1
and (1.7) (see Section II-B in [32]).
Lemma 1.4 ([32]) Suppose that Assumption 1.1 holds. Considering the weight
matrix R = [rij ] ∈ Rn×n follows (1.7). Then, there are a norm || · || and a constant
0 < ρ < 1 such that
for all x ∈ Rn .
According to the result established in Lemma 1.4, in the following, we present
an additional lemma from the Markov chain and consensus theory [60].
4 Throughout the chapter, for any arbitrary matrix/vector/scalar Z, we utilize the symbol (Z)t to
represent the t-th power of Z to distinguish the iteration of variables.
12 1 Accelerated Algorithms for Distributed Convex Optimization
Lemma 1.5 ([60]) Let S t be generated by (1.8). Then, there exist 0 < θ < ∞ and
0 < λ < 1 such that
For convenience of the convergence analysis, we will make frequently use of the
following well-known lemma (see example [32] for a proof).
Lemma 1.8 ([32]) Suppose that Assumptions 1.2–1.3 hold. Since the global cost
function f is μ̄-strongly convex and L̄-smooth, then for all x ∈ R and 0 < ε < 2/L̄,
we get
where l = max{|1 − L̄ε|, |1 − μ̄ε|}, x ∗ is the optimal solution to (1.1) and ∇f (x)
is the gradient of f (x) at x.
Lemma 1.9 Suppose that Assumption 1.1 holds. Then, for all t > 0, we have the
following inequality:
||x t +1 − (R)∞ x t +1 ||
where the inequality in (1.13) is obtained from Lemma 1.4 and the fact that
(R)∞ R = (R)∞ . The desired result of Lemma 1.9 is then acquired.
The next lemma presents the bound of the optimality residual associated with the
weight average ||(R)∞ x t +1 − 1n x ∗ ||2 (Notice that (R)∞ x t +1 = 1n x̄ t +1 ).
Lemma 1.10 Suppose that Assumptions 1.2 and 1.3 hold. If 0 < n(wT α) < 2/L̄,
then, the following inequality holds for all t > 0:
||(R)∞ x t +1 − 1n x ∗ ||2
where κ3 = d1 ||(R)∞ ||2 and l1 = max{|1 − L̄n(wT α), |1 − μ̄n(wT α)}; θ and λ are
introduced in Lemma 1.5.
Proof Notice that (R)∞ R = (R)∞ . Recalling the updates of x t and y t in D-DNGT
(1.8), we get Lemma from 1.7 that
||(R)∞ x t +1 − 1n x ∗ ||2
= ||(R)∞ (x t + 2Dβ (x t − x t −1 ) − Dα zt + (Dα − Dα )(R)∞ zt ) − 1n x ∗ ||2
≤ ||(R)∞ x t − (R)∞ Dα (R)∞ [S̃ t ]−1 ∇F (y t ) − 1n x ∗ ||2
We now discuss the first term in the inequality of (1.15). Note that (R)∞ =
1n wT and ∇F (1n x̄ t ) = [∇f1 (x̄ t ), . . . , ∇fn (x̄ t )]T . By utilizing 1n wT Dα 1n wT =
(wT α)1n wT , one obtains
where ∇f (x̄ t ) = (1/n)1Tn ∇F (1n x̄ t ). By Lemma 1.8, when 0 < n(wT α) < 2/L̄,
Λ1 is bounded by
√
Λ1 ≤ l1 n||wT x t − x ∗ ||2 = l1 ||(R)∞ x t − 1n x ∗ ||2 , (1.17)
where l1 = max{|1 − L̄n(wT α)|, |1 − μ̄n(wT α)|}. Then, Λ2 can be bounded in the
following way:
where ∇F (x t ) = [∇f1 (x1t ), . . . , ∇fn (xnt )]T . Since ∇f (x̄ t ) = (1/n)1Tn ∇F (1n x̄ t ),
it yields from Assumption 1.2 that
Next, by employing Lemma 1.6 and the relation S ∞ [S̃ ∞ ]−1 = 1n 1Tn , we have
where ŝ = supt ≥0 ||S t ||2 and s̃ = supt ≥0||[S̃ t ]−1 ||2 . The lemma follows by plugging
(1.16)–(1.20) into (1.15).
For the bound of the estimate difference ||x t +1 − x t ||, the following lemma is
shown.
Lemma 1.11 Suppose that Assumption 1.2 holds. For all t > 0, it holds that
Proof Recalling that (R)∞ R = (R)∞ , we obtain from the updates of x t and y t in
D-DNGT (1.8) that
||zt +1 − (R)∞ zt +1 ||
≤ κ4 κ6 (1 + β̂)||x t − (R)∞ x t || + κ6 d2 (1 + β̂)α̂||zt ||2
||zt +1 − (R)∞ zt +1 ||
≤ ||In − (R)∞ ||||[S̃ t +1]−1 ∇F (y t +1) − [S̃ t ]−1 ∇F (y t )||
+ ρ||zt − (R)∞ zt ||, (1.24)
where we employ the triangle inequality and Lemma 1.4 to deduce the inequality. As
for the first term of the inequality in (1.24), we apply the update of y t in D-DNGT
(1.8) and the result in Lemma 1.6 to obtain
Combining Lemma 1.11 with (1.25), the result in Lemma 1.12 is obtained.
16 1 Accelerated Algorithms for Distributed Convex Optimization
The final lemma constitutes an inevitable bound on the estimate, ||zt ||2 , for
deriving the aforementioned linear system.
Lemma 1.13 Assume that Assumption 1.2 holds. Then, the following inequality can
be established for all t > 0,
In view of Lemma 1.7, using S ∞ [S̃ ∞ ]−1 = 1n 1Tn and (R)∞ = S ∞ , it suffices that
Substituting (1.28) and (1.29) into (1.27) yields the desired result in Lemma 1.13.
The proof is completed.
With the supporting relationships, i.e., the above Lemmas 1.9–1.13, in hands, the
main convergence results of D-DNGT are now established as follows.
For the sake of convenience, we define wmin = mini∈V {wi }, ν1 = κ2 d1 nL̂,
ν2 = κ2 nL̂, ν3 = κ2 d1 , ν4 = d1 nL̂, ν5 = d1 ŝ s̃ L̂, ν6 = d2 d1 nL̂, ν7 = d2 nL̂,
ν8 = d2 d1 , ν9 = κ4 κ6 , ν10 = κ6 d2 d1 nL̂, ν11 = κ6 d2 nL̂, ν12 = κ6 + κ5 κ6 , ν13 =
κ5 κ6 , ν14 = κ6 d2 d1 , ν15 = κ2 α̂ ŝ(s̃)2 θ , ν16 = ŝ(s̃)2 θ α̂, ν17 = d2 α̂ ŝ(s̃)2 θ , ν18 =
(2||In − (R)∞ || + κ6 (1 + β̂)α̂ ŝ)(s̃)2 θ d2 , ν19 = ν13 η3 + ν10 η3 α̂, ν20 = ν9 η1 +
ν10 η1 α̂ + ν11 η2 α̂ + ν12 η3 + ν10 η3 α̂ + ν14 η4 α̂ and ν21 = η4 (1 − ρ) − ν9 η1 − (ν10 η1 +
ν11 η2 + ν14 η4 )α̂. Then, the first result, i.e., Theorem 1.14, is introduced below.
1.4 Convergence Analysis 17
and γ41 = ν9 + ν10 α̂ + ν9 β̂ + ν10 α̂ β̂, γ42 = ν11 α̂ + ν11 α̂ β̂, γ43 = ν12 β̂ +
ν13 β̂ 2 + ν10 α̂ β̂ + ν10 α̂ β̂ 2 and γ44 = ρ + ν14 α̂ + ν14 α̂ β̂; φ1t = ν15 (λ)t ||∇F (y t )||2 ,
φ2t = ν16 (λ)t ||∇F (y t )||2 , φ3t = ν17 (λ)t ||∇F (y t )||2 and φ4t = ν18 (λ)t ||∇F (y t )||2 .
Assuming in addition that the largest step-size satisfies
1 η1 (1 − ρ)
0 < α̂ < min , ,
nL̄ ν1 η1 + ν2 η2 + ν3 η4
η3 − κ4 η1 η4 (1 − ρ) − ν9 η1
, , (1.31)
ν6 η1 + ν7 η2 + ν8 η4 ν10 η1 + ν11 η2 + ν14 η4
Then, the spectral radius of Γ , defined as ρ(Γ ), is strictly less than 1, where η1 , η2 ,
η3 , and η4 are arbitrary constants such that
ν4 η1 + κ3 η4 ν9 η1
η1 > 0, η2 > , η3 > κ4 η1 , η4 > . (1.33)
μ̄nwmin 1−ρ
18 1 Accelerated Algorithms for Distributed Convex Optimization
Proof First, plugging Lemma 1.13 into Lemmas 1.9–1.12 and rearranging the
acquired inequalities, it is immediately to verify (1.30). Next, we provide quite a
few conditions for the relation ρ(Γ ) < 1 to establish. According to Theorem 8.1.29
in [60], we know that, for a positive vector η = [η1 , . . . , η4 ]T ∈ R4 , if Γ η < η,
then ρ(Γ ) < 1 holds. By the definition of Γ , it is deduced that inequality Γ η < η
is equivalent to
⎧
⎪
⎪ (κ1 η3 + ν1 η3 α̂)β̂ < η1 (1 − ρ) − (ν1 η1 + ν2 η2 + ν3 η4 )α̂
⎪
⎨ (2κ η + ν η α̂)β̂ < η (1 − l ) − (ν η + κ η )α̂
2 3 5 3 2 1 4 1 3 4
(1.34)
⎪ (κ5 η3 + ν6 η3 α̂)β̂
⎪ < η3 − κ4 η1 − (ν6 η1 + ν7 η2 + ν8 η4 )α̂
⎪
⎩
2ν19 β̂ < −ν20 + (ν20 )2 + 4ν19 ν21 .
When 0 < α̂ < 1/nL̄, from Lemma 1.10, it yields that l1 = 1 − μ̄n(wT α) ≤
1 − μ̄nwmin α̂. To ensure the positivity of β̂ (the right hand sides of (1.34) are always
positive), (1.34) further implies that
⎧
⎪
⎪ α̂ < ν1 η1η+ν
1 (1−ρ)
2 η2 +ν3 η4
⎪
⎨ ν4 η1 +κ3 η4
η2 > μ̄nwmin
η3 −κ4 η1 (1.35)
⎪
⎪ α̂ < ν6 η1 +ν7 η2 +ν8 η4 , η3 > κ4 η1
⎪
⎩
α̂ < ν10 η1 +ν11 η2 +ν14 η4 , η4 > ν1−ρ
η4 (1−ρ)−ν9 η1 9 η1
.
Lemma 1.16 ([39]) Let {v t }, {ut }, {a t }, and {bt } be non-negative sequences such
that for all t ≥ 0,
v t +1 ≤ (1 + a t )v t − ut + b t .
∞ t
Also, let ∞ t =0 a < ∞ and
t
∞ t =0tb < ∞. Then, we get limt →∞ v = v for a
t
ϕ t +1 ≤ Γ ϕ t + P t Qt . (1.36)
t −1
ϕ ≤ (Γ ) ϕ +
t t 0
(Γ )t −k−1 P k Qk . (1.37)
k=0
Since the spectral radius of Γ is strictly less than 1, it can be concluded from
Lemma 1.16 in [52] that ||(Γ )t ||2 ≤ ϑ(δ0 )t and ||(Γ )t −k−1 P k ||2 ≤ ϑ(δ0 )t for
some ϑ > 0 and λ < δ0 < 1. Taking 2-norm on both sides of (1.37) yields that
t −1
||ϕ t ||2 ≤ ||(Γ )t ||2 ||ϕ 0 ||2 + ||(Γ )t −k−1 P k ||2 ||Qk ||2
k=0
t −1
≤ ϑ||ϕ 0 ||2 (δ0 )t + ϑ(δ0 )t ||Qk ||2 . (1.38)
k=0
t −1
||ϕ t ||2 ≤ (ϑ||ϕ 0 ||2 + (1 + d1 )(1 + β̂)L̂ϑ ||ϕ k ||2
k=0
∗
+ ϑt||∇F (1n x )||2 )(δ0 ) . t
(1.40)
t −1
Define v t = k=0 ||ϕ ||2 , ν22 = (1 + d1 )(1 + β̂)L̂ϑ and p = ϑ||ϕ ||2 +
k t 0
∗
ϑt||∇F (1n x )||2 , and then (1.40) implies that
which is equivalent to
achieve that v t converges and thus is bounded. Following from (1.41), we obtain that
limt →∞ ||ϕ t ||2 /(δ1 )t ≤ limt →∞ (ν22 v t + pt )(δ0 )t /(δ1 )t = 0 for all δ0 < δ1 < 1,
and thus there is a positive constant m and an arbitrarily small constant τ such that
for all t ≥ 0,
each node can choose a relatively wider step-size. This is in contrast to the earlier
work on non-uniform step-sizes within the framework of the gradient tracking
[33, 35, 43, 44], which is dependent on the heterogeneity (||(In − W )α||2 /||W α||2 ,
W is the weight matrix, in [35], and α̂/α̃, α̃ = mini∈V {αi }, in [33], [43, 44]) of
the step-sizes. Besides, the analysis showed that the algorithms in [33, 35, 43, 44]
could linearly converge to the optimal solution if and only if the heterogeneity and
the largest step-size are small. However, the largest step-size follows a bound which
is a function of the heterogeneity, and there is a trade-off between the tolerance of
heterogeneity and the largest step-size which can be achieved. Finally, the bounds of
non-uniform step-sizes in this chapter allow the existence (not all) of zero step-sizes
among the nodes if the largest step-size is positive and sufficiently small.
1.4.4 Discussion
The idea of D-DNGT can be applied to other directed distributed gradient tracking
methods to relax the condition of the weight matrices being only column-stochastic
[41, 42] or both row- and column-stochastic [45, 46]. Next, three possible Nesterov-
like optimization algorithms are presented. In this chapter, we only highlight and
verify their feasibilities by means of simulations. A rigorous theoretical analysis of
the three possible algorithms is left for the future work.
(a) D-DNGT with Only Column-Stochastic Weights [41, 42] here, we present
an extended algorithm, named as D-DNGT-C, by applying the momentum terms
into ADD-OPT [41]/Push-DIGing [42] (the weight matrices are only column-
stochastic). Specifically, the updates of D-DNGT-C are stated as follows:
⎧
⎪ t +1 n
⎪
⎪
⎪ x i = cij htj + βi (xit − xit −1) − αi zit
⎪
⎪ j =1
⎪
⎪ ht +1 = x t +1 + β (x t +1 − x t )
⎪
⎨ i i i i i
t +1 n
t +1 t +1 t +1 (1.44)
⎪
⎪ s = c s
ij j
t , y = h i /si
⎪
⎪
i
j =1
i
⎪
⎪
⎪
⎪ n
⎪ zit +1 =
⎩ cij zjt + ∇fi (yit +1 ) − ∇fi (yit ),
j =1
initialized with xi0 = h0i = yi0 ∈ R, si0 = 1, and zi0 = ∇fi (yi0 ), where as before
C = [cij ] ∈ Rn×n is column-stochastic, and αi > 0 and βi ≥ 0 represent the local
step-size and the momentum coefficient of node i. Unlike ADD-OPT [41]/Push-
DIGing [42], D-DNGT-C, by means of column-stochastic weights, adds two types
of momentum terms (heavy-ball momentum and Nesterov momentum) to ensure
that nodes acquire more information from in-neighbors in the network to achieve
fast convergence.
22 1 Accelerated Algorithms for Distributed Convex Optimization
(b) D-DNGT with Both Row- and Column-Stochastic Weights [45, 46] consider
that D-DNGT with both row- and column-stochastic weights does not need the
eigenvector estimation in D-DNGT (1.6) or D-DNGT-C (1.44). Hence, an extended
algorithm (named as D-DNGT-RC), which utilizes both row-stochastic (R = [rij ] ∈
Rn×n ) and column-stochastic (C = [cij ] ∈ Rn×n ) weights, is presented as follows:
⎧
⎪ n
⎪
⎪ xit +1 = rij yjt + βi (xit − xit −1 ) − αi zit
⎪
⎪
⎨ j =1
yit +1 = xit +1 + βi (xit +1 − xit ) (1.45)
⎪
⎪
⎪
⎪ t +1
n
⎩ zi =
⎪ cij zjt + ∇fi (yit +1 ) − ∇fi (yit ),
j =1
where xi0 = yi0 ∈ R and zi0 = ∇fi (yi0 ), αi > 0 and βi ≥ 0 represent the
local step-size and the momentum coefficient of node i. D-DNGT-RC not only
reduces additional iterations of eigenvector learning but also guarantees that more
information nodes can be obtained from in-neighbors, which may exhibit fast
convergence than [45] and [46].
(c) D-DNGT-RC with Interaction Delays [49] note that nodes will confront arbi-
trary but uniformly bounded interaction delays in the process of gaining information
from in-neighbors [49]. Specifically, to solve problem (1.1), we denote ςijt 5 as an
arbitrary priori unknown delay induced by the interaction link (j, i) at time t ≥ 0.
Then, the updates of D-DNGT-RC with delay (D-DNGT-RC-D) become
⎧
⎪ n t −ς t
⎪
⎪ xit +1 = rij yj ij + βi (xit − xit −1 ) − αi zit
⎪
⎪
⎨ j =1
yit +1 = xit +1 + βi (xit +1 − xit ) (1.46)
⎪
⎪
⎪
⎪ t +1
n t −ς t
⎩ zi =
⎪ cij zj ij + ∇fi (yit +1) − ∇fi (yit ).
j =1
5For all t > 0, the interaction delays ςijt are assumed to be uniformly bounded. That is, there exists
some finite ς̂ > 0 such that 0 ≤ ςijt ≤ ς̂. In addition, each node is accessible to its own estimate
without delays, i.e., ςiit = 0, ∀i ∈ V and t > 0.
1.5 Numerical Examples 23
n
min f (x, v) = fi (x, v),
i=1
where x ∈ Rp and v ∈ R are the optimization variables for learning the separable
hyperplane. Here, the local cost function fi is given by
ω
mi
fi (x, v) = (||x||22 + v 2 ) + ln 1 + exp − cTij x + v bij ,
2
j =1
where each node i ∈ {1, . . . , n} privately knows mi training examples; cij , bij ∈
Rp × {−1, +1}, where cij is the p-dimensional feature vector of the j -th training
sample at the i-th node following from a Gaussian distribution with zero mean, and
bij is the label according to a Bernoulli distribution. In terms of parameter design,
we choose n = 10 and mi = 10 for all i and p = 2. The network topology as
the directed and strongly connected network is depicted in Fig. 1.1. In addition, we
utilize a simple uniform weighting strategy, rij = 1/|Niin |, ∀i, to regulate the row-
stochastic weights.
The simulation results are plotted in Figs. 1.2, 1.3 and 1.4. Figure 1.2 indicates
that D-DNGT with momentum terms promotes the convergence in comparison
with the applicable algorithms without momentum terms. Figure 1.3 means that
24 1 Accelerated Algorithms for Distributed Convex Optimization
Comparison (i)
0
-2
-4
-6
Residual
-8
-10
-12
-14
0 200 400 600 800 1000 1200 1400
Time[step]
Fig. 1.2 Performance comparisons between D-DNGT and the methods without momentum terms
D-DNGT with two momentum terms (heavy-ball momentum [48] and Nesterov
momentum [50, 54, 55]) improves the convergence when compared with the
applicable algorithms with single momentum term. We note that although the
eigenvector learning existed in D-DNGT may slow down convergence, D-DNGT
is more suitable for broadcast-based protocols than other optimization methods
(AB, ADD-OPT/Push-DIGing, ABm, and ABN ) because it only requires row-
stochastic weights. Finally, it is concluded from Fig. 1.4 that the algorithms with
momentum terms can successfully promote the convergence regardless of whether
the interaction links undergo interaction delays or the weight matrices are only
column-stochastic or both row- and column-stochastic.
1.5 Numerical Examples 25
Comparison (ii)
0
-2
-4
-6
Residual
-8
-10
-12
-14
0 200 400 600 800 1000 1200 1400
Time[step]
Fig. 1.3 Performance comparisons between D-DNGT and the methods with momentum terms
Comparison (iii)
0
-2
-4
-6
Residual
-8
-10
-12
-14
0 200 400 600 800 1000 1200 1400
Time[step]
Fig. 1.4 Performance comparisons between the extensions of D-DNGT and their closely related
methods
26 1 Accelerated Algorithms for Distributed Convex Optimization
1.6 Conclusion
References
1. S. Yang, Q. Liu, J. Wang, Distributed optimization based on a multiagent system in the presence
of communication delays. IEEE Trans. Syst., Man, Cybern., Syst. 47(5), 717–728 (2017)
2. J. Chen, A. Sayed, Diffusion adaptation strategies for distributed optimization and learning
over networks. IEEE Trans. Signal Process. 60(8), 4289–4305 (2012)
3. K. Li, Q. Liu, S. Yang, J. Cao, G. Lu, Cooperative optimization of dual multiagent system for
optimal resource allocation. IEEE Trans. Syst., Man, Cybern., Syst. 50(11), 4676–4687 (2020)
4. S. Wang, C. Li, Distributed robust optimization in networked system. IEEE Trans. Cybern.
47(8), 2321–2333 (2017)
5. X. Dong, G. Hu, Time-varying formation tracking for linear multi-agent systems with multiple
leaders. IEEE Trans. Autom. Control 62(7), 3658–3664 (2017)
6. X. Dong, G. Hu, Time-varying formation control for general linear multi-agent systems with
switching directed topologies. Automatica 73, 47–55 (2016)
7. C. Shi, G. Yang, Augmented Lagrange algorithms for distributed optimization over multi-agent
networks via edge-based method. Automatica 94, 55–62 (2018)
8. S. Zhu, C. Chen, W. Li, B. Yang, X. Guan, Distributed state estimation of sensor-network
systems subject to Markovian channel switching with application to a chemical process. IEEE
Trans. Syst. Man Cybern. Syst. 48(6), 864–874 (2018)
References 27
9. D. Jakovetic, A unification and generalization of exact distributed first order methods. IEEE
Trans. Signal Inform. Process. Over Netw. 5(1), 31–46 (2019)
10. Z. Wu, Z. Li, Z. Ding, Z. Li, Distributed continuous-time optimization with scalable adaptive
event-based mechanisms. IEEE Trans. Syst. Man Cybern. Syst. 50(9), 3252–3257 (2020)
11. K. Scaman, F. Back, S. Bubeck, Y. Lee, L. Massoulie, Optimal algorithms for smooth and
strongly convex distributed optimization in networks, in Proceedings of the 34th International
Conference on Machine Learning (PMLR), vol. 70 (2017), pp. 3027–3036
12. X. He, T. Huang, J. Yu, C. Li, Y. Zhang, A continuous-time algorithm for distributed
optimization based on multiagent networks. IEEE Trans. Syst. Man Cybern. Syst. 49(12),
2700–2709 (2019)
13. Y. Zhu, W. Ren, W. Yu, G. Wen, Distributed resource allocation over directed graphs via
continuous-time algorithms. IEEE Trans. Syst. Man Cybern. Syst. 51(2), 1097–1106 (2021)
14. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE
Trans. Autom. Control 54(1), 48–61 (2009)
15. A. Nedic, A. Ozdaglar, P. Parrilo, Constrained consensus and optimization in multi-agent
networks. IEEE Trans. Autom. Control 55(4), 922–938 (2010)
16. H. Li, S. Liu, Y. Soh, L. Xie, Event-triggered communication and data rate constraint for
distributed optimization of multiagent systems. IEEE Trans. Syst. Man Cybern. Syst. 48(11),
1908–1919 (2018)
17. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly
convex optimization. Automatica 90, 196–203 (2018)
18. I. Matei, J. Baras, Performance evaluation of the consensus-based distributed subgradient
method under random communication topologies. IEEE J. Sel. Topics Signal Process. 5(4),
754–771 (2011)
19. C. Xi, U. Khan, Distributed subgradient projection algorithm over directed graphs. IEEE Trans.
Autom. Control 62(8), 3986–3992 (2017)
20. D. Yuan, D. Ho, G. Jiang, An adaptive primal-dual subgradient algorithm for online distributed
constrained optimization. IEEE Trans. Cybern. 48(11), 3045–3055 (2018)
21. C. Li, P. Zhou, L. Xiong, Q. Wang, T. Wang, Differentially private distributed online learning,
IEEE Trans. Knowl. Data Eng. 30(8), 1440–1453 (2018)
22. J. Zhu, C. Xu, J. Guan, D. Wu, Differentially private distributed online algorithms over time-
varying directed networks. IEEE Trans. Signal Inform. Process. Over Netw. 4(1), 4–17 (2018)
23. W. Shi, Q. Ling, K. Yuan, G. Wu, W. Yin, On the linear convergence of the ADMM in
decentralized consensus optimization. IEEE Trans. Signal Process. 62(7), 1750–1761 (2014)
24. J. Mota, J. Xavier, P. Aguiar, M. Puschel, D-ADMM: a communication-efficient distributed
algorithm for separable optimization. IEEE Trans. Signal Process. 61(10), 2718–2723 (2013)
25. H. Terelius, U. Topcu, R. Murray, Decentralized multi-agent optimization via dual decomposi-
tion. IFAC Proc. Volumes 44(1), 11245–11251 (2011)
26. E. Wei, A. Ozdaglar, On the O(1/k) convergence of asynchronous distributed alternating
direction method of multipliers, in 2013 IEEE Global Conference on Signal and Information
Processing (2013). https://doi.org/10.1109/GlobalSIP.2013.6736937
27. M. Hong, T. Chang, Stochastic proximal gradient consensus over random networks. IEEE
Trans. Signal Process. 65(11), 2933–2948 (2017)
28. H. Xiao, Y. Yu, S. Devadas, On privacy-preserving decentralized optimization through
alternating direction method of multipliers (2019). Preprint arXiv:1902.06101
29. A. Chen, A. Ozdaglar, A fast distributed proximal-gradient method, in 2012 50th Annual
Allerton Conference on Communication, Control, and Computing (Allerton) (2012). https://
doi.org/10.1109/Allerton.2012.6483273
30. X. Dong, Y. Hua, Y. Zhou, Z. Ren, Y. Zhong, Theory and experiment on formation-containment
control of multiple multirotor unmanned aerial vehicle systems. IEEE Trans. Autom. Sci. Eng.
16(1), 229–240 (2019)
31. W. Shi, Q. Ling, G. Wu, W Yin, EXTRA: an exact first-order algorithm for decentralized
consensus optimization. SIAM J. Optimi. 25(2), 944–966 (2015)
28 1 Accelerated Algorithms for Distributed Convex Optimization
32. G. Qu, N. Li, Harnessing smoothness to accelerate distributed optimization. IEEE Trans.
Control Netw. Syst. 5(3), 1245–1260 (2018)
33. A. Nedic, A. Olshevsky, W. Shi, C. Uribe, Geometrically convergent distributed optimization
with uncoordinated step-sizes, in 2017 American Control Conference (ACC) (2017). https://
doi.org/10.23919/ACC.2017.7963560
34. M. Maros, J. Jalden, A geometrically converging dual method for distributed optimization over
time-varying graphs. IEEE Trans. Autom. Control 66(6), 2465–2479 (2021)
35. J. Xu, S. Zhu, Y. Soh, L. Xie, Convergence of asynchronous distributed gradient methods over
stochastic networks. IEEE Trans. Autom. Control 63(2), 434–448 (2018)
36. S. Pu, A. Nedic, Distributed stochastic gradient tracking methods. Math. Program. 187(1),
409–457 (2021)
37. Y. Tian, Y. Sun, B. Du, G. Scutari, ASY-SONATA: Achieving geometric convergence for
distributed asynchronous optimization, in 2018 56th Annual Allerton Conference on Communi-
cation, Control, and Computing (Allerton) (2018). https://doi.org/10.1109/ALLERTON.2018.
8636055
38. M. Maros, J. Jalden, Panda: A dual linearly converging method for distributed optimization
over time-varying undirected graphs, in 2018 IEEE Conference on Decision and Control
(CDC) (2018). https://doi.org/10.1109/CDC.2018.8619626
39. A. Nedic, A. Olshevsky, Distributed optimization over time-varying directed graphs. IEEE
Trans. Autom. Control 60(3), 601–615 (2015)
40. C. Xi, U. Khan, DEXTRA: a fast algorithm for optimization over directed graphs. IEEE Trans.
Autom. Control 62(10), 4980–4993 (2017)
41. C. Xi, R. Xin, U. Khan, ADD-OPT: accelerated distributed directed optimization. IEEE Trans.
Autom. Control 63(5), 1329–1339 (2018)
42. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization
over time-varying graphs. SIAM J. Optimi. 27(4), 2597–2633 (2017)
43. Q. Lü, H. Li, D. Xia, Geometrical convergence rate for distributed optimization with time-
varying directed graphs and uncoordinated step-sizes. Inf. Sci. 422, 516–530 (2018)
44. Q. Lü, H. Li, Z. Wang, Q. Han, W. Ge, Performing linear convergence for distributed
constrained optimisation over time-varying directed unbalanced networks. IET Control Theory
Appl. 13(17), 2800–2810 (2019)
45. R. Xin, U. Khan, A linear algorithm for optimization over directed graphs with geometric
convergence. IEEE Control Syst. Lett. 2(3), 315–320 (2018)
46. S. Pu, W. Shi, J. Xu, A. Nedic, Push-pull gradient methods for distributed optimization in
networks. IEEE Trans. Autom. Control 66(1), 1–16 (2021)
47. F. Saadatniaki, R. Xin, U. Khan, Decentralized optimization over time-varying directed graphs
with row and column-stochastic matrices. IEEE Trans. Autom. Control 65(11), 4769–4780
(2020)
48. R. Xin, U. Khan, Distributed heavy-ball: a generalization and acceleration of first-order
methods with gradient tracking. IEEE Trans. Autom. Control 65(6), 2627–2633 (2020)
49. C. Zhao, X. Duan, Y. Shi, Analysis of consensus-based economic dispatch algorithm under
time delays. IEEE Trans. Syst. Man Cybern. Syst. 50(8), 2978–2988 (2020)
50. R. Xin, D. Jakovetic, U. Khan, Distributed Nesterov gradient methods over arbitrary graphs.
IEEE Signal Process. Lett. 26(8), 1247–1251 (2019)
51. C. Xi, V. Mai, E. Abed, U. Khan, Linear convergence in optimization over directed graphs with
row-stochastic matrices. IEEE Trans. Autom. Control 63(10), 3558–3565 (2018)
52. R. Xin, C. Xi, U. Khan, FROST-Fast row-stochastic optimization with uncoordinated step-
sizes. EURASIP J. Advanc. Signal Process. 2019(1), 1–14 (2019)
53. H. Li, Q. Lü, T. Huang, Convergence analysis of a distributed optimization algorithm with a
general unbalanced directed communication network. IEEE Trans. Netw. Sci. Eng. 6(3), 237–
248 (2019)
54. G. Qu, N. Li, Accelerated distributed Nesterov gradient descent. IEEE Trans. Autom. Control
65(6), 2566–2581 (2020)
References 29
55. D. Jakovetic, J. Xavier, J. Moura, Fast distributed gradient methods. IEEE Trans. Autom.
Control 59(5), 1131–1146 (2014)
56. H. Li, Q. Lü, X. Liao, T. Huang, Accelerated convergence algorithm for distributed constrained
optimization under time-varying general directed graphs. IEEE Trans. Syst. Man Cybern. Syst.
50(7), 2612–2622 (2020)
57. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Springer Science
& Business Media, Berlin, 2013)
58. H. Wang, X. Liao, T. Huang, C. Li, Cooperative distributed optimization in multiagent
networks with delays. IEEE Trans. Syst. Man Cybern. Syst. 45(2), 363–369 (2015)
59. A. Defazio, On the curved geometry of accelerated optimization (2018). Preprint
arXiv:1812.04634
60. R. Horn, C. Johnson, Matrix Analysis (Cambridge University Press, New York, 2013)
61. T. Yang, Q. Lin, Z. Li, Unified convergence analysis of stochastic momentum methods for
convex and non-convex optimization (2016). Preprint arXiv:1604.03257
Chapter 2
Projection Algorithms for Distributed
Stochastic Optimization
Abstract This chapter focuses on introducing and solving the problem of compos-
ite constrained convex optimization with a sum of smooth convex functions and
non-smooth regularization terms (1 norm) subject to locally general constraints.
Each of the smooth objective functions is further thought of as the average
of several constituent functions, which is motivated by the modern large-scale
information processing problems in machine learning (the samples of a training
dataset are randomly distributed across multiple computing nodes). We present a
novel computation-efficient distributed stochastic gradient algorithm that makes use
of both the variance-reduction methodology and the distributed stochastic gradient
projection method with constant step-size to solve the problem in a distributed
manner. Theoretical study shows that the suggested algorithm can discover the
precise optimal solution in expectation when each constituent function (smooth)
is strongly convex if the constant step-size is less than an explicitly calculated upper
constraint. Regarding the current distributed methods, the suggested technique
not only has a low computation cost in terms of the overall number of local
gradient evaluations but is also suited for addressing general restricted optimization
problems. Finally, the numerical proof is offered to show the suggested algorithm’s
attractive performance.
2.1 Introduction
Given the limited computational and storage capacity of nodes, it has become
unrealistic to deal with large-scale tasks centrally on a single compute node
[1]. Distributed optimization is a classic topic [2–9] yet has recently aroused
considerable interest in many emerging applications (large-scale tasks), such as
parameter estimation [3, 4], network attacks [5], machine learning [6], IoT networks
[7], and some others. At least two facts [8] have contributed to this resurgence
of interest: (a) recent developments in high-performance computing platforms
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 31
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_2
32 2 Projection Algorithms for Distributed Stochastic Optimization
gradient tracking [38]. However, in practice, these methods converge slowly due
to the large variance coming from the stochastic gradient and the adoption of a
carefully tuned sequence of decaying step-sizes. To address this deficiency, various
variance-reduction techniques have been leveraged in developing the stochastic
gradient descent methods, which appear some representative centralized methods
such as S2GD [39], SAG [40], SAGA [41], SVRG [42, 43], and SARAH [44]. The
idea of the variance-reduction technique is to reduce the variance of the stochastic
gradient and substantially improve the convergence.
Motivated by the centralized variance reduced methods, the distributed variance
reduced methods have been extensively studied, which outperform their centralized
counterparts in handling large-scale tasks. Of relevance to our work are the recent
developments in [45] and [46]. The distributed stochastic averaging gradient method
(DSA) proposed in [45] incorporates the variance-reduction technique in SAGA
[41] to the algorithm design ideas of EXTRA [14], which not only obtains the
expected linear convergence of distributed stochastic optimization for the first
time but also performs better than the previous works [14, 35] in dealing with
machine learning problems. Similar works also involve the DSBA [47], diffusion-
AVRG [48], ADFS [49], SAL-Edge [50], GT-SAGA/GT-SVRG [2, 51, 52], and
Network-DANE [8] utilizing various strategies. However, to the best knowledge
of the authors, there are no methods to focus on solving general composite
constrained convex optimization problems. Recently, the distributed neurodynamic-
based consensus algorithm proposed in [46] is developed to solve the problem of
a sum of smooth convex functions and 1 norms subjected to the locally general
constraints (linear equality, convex inequality, and bounded constraints), which
generalizes the work in [53] to the case where the objective function and the
constraint conditions are wider. In particular, based on the Lyapunov stability theory,
the method in [46] can achieve consensus at the global optimal solution with
constant step-size. The work in [46] is insightful, but unfortunately, the algorithm
does not take into account the high computational cost of evaluating the full gradient
of the local objective function at each iteration.
In this chapter, we are concerned with solving the composite constrained convex
optimization problem with a sum of smooth convex functions and non-smooth
regularization terms (1 norm), where the smooth objective functions are further
composed of the average of several constituent functions and the locally general
constraints are constituted by linear equality, convex inequality, and bounded
constraints. To aim at this, a computation-efficient distributed stochastic gradient
algorithm is proposed, which is capable of adaptability and facilitating the real-
world applications. In general, the novelties of the present work are summarized as
follows:
(i) We propose and analyze a novel computation-efficient distributed stochastic
gradient algorithm by leveraging the variance-reduction technique and the
distributed stochastic gradient projection method with constant step-size. In
contrast with most existing distributed methods [29–33, 45, 47–51, 53], the
34 2 Projection Algorithms for Distributed Stochastic Optimization
2.2 Preliminaries
2.2.1 Notation
If not particularly stated, the vectors mentioned in this chapter are column vectors.
Let R, Rn , and Rm×n denote the set of real numbers, n-dimensional real column
vectors, and m × n real matrices, respectively. The n × n identity matrix is denoted
as In , and two column vectors of all ones and all zeros are denoted as 1 and 0
(appropriate dimensional), respectively. A quantity (probably a vector) of node i
is indexed by a subscript i; e.g., let xik be the estimate of node i at time k. We
use χmax (A) and χmin (A) to represent the largest and the smallest eigenvalues of
a real symmetric matrix A, respectively. We let the symbols x T and AT denote
the transposes of a vector x and a matrix A. The Euclidean norm (vectors) √ and
1 norm are denoted as || · || and || · ||1 , respectively. We let ||x||A = x T Ax,
where matrix A ∈ Rn×n is a positive semi-definite matrix. The Kronecker product
and the Cartesian product are represented by the symbols ⊗ and , respectively.
Given a random estimator x, the probability and expectation are represented by P[x]
and E[x], respectively. We utilize Z = diag{x} to represent the diagonal matrix of
vector x = [x1 , x2 , . . . , xn ]T , which satisfies that zii = xi , ∀i = 1, . . . , n, and
zij = 0, ∀i = j . Denote (·)+ = max{0, ·}.
For a set Ω ⊆ Rd , the projection of a vector x ∈ Rd onto Ω is denoted by
PΩ (x), i.e., PΩ (x) = arg miny∈Ω ||y −x||2. Notice that this projection always exists
and is unique if Ω is nonempty, closed, and convex [53]. Moreover, let Ω be a
nonempty closed convex set, then the projection operator PΩ (·) has the following
properties: (a) (y − PΩ (y))T (PΩ (y) − x) ≥ 0, for any x ∈ Ω and y ∈ Rd and (b)
||PΩ (y) − PΩ (x)|| ≤ ||y − x||, for any x, y ∈ Rd .
2.2 Preliminaries 35
For any given vector z ∈ Rd , denote φ(z) as another form of projection of the
vector z, which satisfies that φ(z) = [ψ(z1 ), . . . , ψ(zd )] ∈ Rd with the elements
such that ∀i = 1, . . . , d,
⎧
⎨ 1, if zi > 0,
ψ(zi ) = [−1, 1] , if zi = 0,
⎩
−1, if zi < 0.
n
min
n
J (x̂) = (fi (x̂) + ||Pi x̂ − qi ||1 ),
x̂∈∩i=1 Ωi
i=1
s.t. Bi x̂ = ci , Di x̂ ≤ si , i = 1, . . . , n, (2.1)
1
ei
fi (x̂) = fi,j (x̂), i = 1, . . . , n. (2.2)
ei
j =1
In addition, the main results of this chapter are based on the following assump-
tions.
Assumption 2.1 ([46]) The network G corresponding to the set of nodes is undi-
rected and connected.
Assumption 2.2 ([45]) Each local constituent function fi,j , i ∈ V, j ∈ {1, . . . , ei },
is ν-smooth and μ-strongly convex, where ν > μ > 0.
Remark 2.1 The formulated problem (2.1) with (2.2) can be frequently found in
machine learning (such as modern large-scale information processing problems,
reinforcement learning problems, etc.) with large-scale training samples randomly
distributed across the multiple computing nodes which focus on collectively training
a model x̂ ∈ Rd utilizing the neighboring nodes’ data. However, performing full
gradient evaluation becomes prohibitively expensive when the local data batch at a
36 2 Projection Algorithms for Distributed Stochastic Optimization
n
min J (x) = fi (xi ) + ||P x − q||1 ,
x∈Ω
i=1
s.t. Bx = c, Dx ≤ s, Lx = 0, (2.3)
2.3 Algorithm Development 37
ei
where fi (xi ) = (1/ei ) j =1 fi,j (xi ).
Remark 2.2 Note that the equality constraint Lx = 0 in (2.3) is equivalent to the
condition x1 = x2 = . . . = xn if the undirected network is connected. It is worth
highlighting that if x̂ ∗ is the optimal solution to the problem (2.1), then x ∗ = 1n ⊗
x̂ ∗ is the optimal solution to the problem (2.3). According to this observation, the
main motivation of this chapter is constructing a computation-efficient algorithm to
search for the optimal solution to the problem (2.3) over undirected and connected
networks.
We notice that under Assumption 2.2, problem (2.1) has a unique optimal
solution, denoted as x̂ ∗ . Therefore, problem (2.3) also has a unique optimal solution
x ∗ = 1n ⊗ x̂ ∗ under Assumption 2.1. By utilizing the Lagrangian function, the
necessary and sufficient conditions for the optimality of problem (2.3) are given in
the following lemma, whose proof is directly concluded from [46, 53].
Lemma 2.3 ([46, 53]) Let Assumptions 2.1 and 2.2 hold, and let η = 0 be a given
scalar. Then, x ∗ is an optimal solution to (2.3) if and only if there exist α ∗ ∈ Rm ,
β ∗ ∈ Rw , λ∗ ∈ Ra , and γ ∗ ∈ Rnd such that (x ∗ , α ∗ , β ∗ , λ∗ , γ ∗ ) satisfies the
following relations:
⎧ ∗
⎪
⎪ x = PΩ [x ∗ − η(∇f (x ∗ ) + P T α ∗ + B T β ∗ + D T λ∗ + Lγ ∗ )]
⎨ ∗
α = φ(α ∗ + P x ∗ − q)
(2.4)
⎪
⎪ Bx ∗ = c, Lx ∗ = 0
⎩ ∗
λ = (λ∗ + Dx ∗ − s)+ ,
In this subsection, inspired by the algorithm design ideas of methods in [41, 46,
53], we introduce the proposed algorithm named computation-efficient distributed
stochastic gradient algorithm for solving the problem (2.3). To motivate our
algorithm design, we can find that the existing distributed algorithms have suffered
38 2 Projection Algorithms for Distributed Stochastic Optimization
from the high computation cost when evaluating the local full gradients in machine
learning applications. Such a phenomenon inspires us to explore a useful method
which could significantly promote the computation efficiency. Therefore, the pro-
posed algorithm leverages the variance-reduction technique of SAGA [41] and the
distributed stochastic gradient projection method with constant step-size, which
effectively alleviates the computational burden in locally full gradient evaluation.
The computation-efficient distributed stochastic gradient algorithm at each node
i is formally described in Algorithm 1. To locally implement estimators of Algo-
rithm 1, each node i must own a gradient table that possesses all local constituent
gradients ∇fi,j (ti,j ), where ti,j is the most recent estimator at which the constituent
gradient ∇fi,j was evaluated. At each iteration k ≥ 0, each node i uniformly at
random selects one constituent function that indexed by χik ∈ {1, . . . , ei } from its
own local data batch and then generates the local stochastic gradient gik as step 4 in
Algorithm 1. After generating gik , the entry ∇fi,χ k (t k k ) is replaced by the newly
i i,χi
constituent gradient ∇fi,χ k (xik ), while the other entries remain the same. Then, the
i
projection step of estimator xik is implemented on the local stochastic gradient gik ,
and other steps of estimators, αik , βik , λki , γ̃ik , are implemented subsequently.
Define x k = vec[x1k , . . . , xnk ], α k = vec[α1k , . . . , αnk ], β k = vec[β1k , . . . , βnk ],
λk = vec[λk1 , . . . , λkn ], γ̃ k = vec[γ̃1k , . . . , γ̃nk ], and g k = vec[g1k , . . . , gnk ]. Let γ̃ k =
Lγ k , where γ k = vec[γ1k , . . . , γnk ] and the initial condition satisfies γ̃ 0 = Lγ 0
[46, 53]. Then, we write Algorithm 1 in the following compact matrix form for the
convenient of analysis:
⎧
⎪
⎪ x k − η(g k + P T φ(α k + P x k − q) + L(γ k + x k )
⎪
⎪ x k+1 = PΩ
⎪
⎪ +B T (β k + Bx k − c) + D T (λk + Dx k − s)+ )
⎪
⎨ k+1
α = φ(α k + P x k+1 − q)
(2.5)
⎪β
⎪
k+1 = β k + Bx k+1 − c
⎪
⎪
⎪
⎪ λ k+1 = (λk + Dx k+1 − s)+
⎪
⎩ k+1
γ = γ k + x k+1 .
We notice here that the randomness of Algorithm 1 rests with the set of random
k≥0
independent variables {χik }i∈{1,...,n} for calculating the local stochastic gradient gik .
Based on this, we utilize F k to indicate the entire history of the dynamic system
constructed by {χik̃ }k̃≤k−1
i∈{1,...,n} . Therefore, from some prior results in [41, 45, 51], we
know that the local stochastic gradient gik calculated in step 4 of Algorithm 1 is an
unbiased estimator of the local batch gradient ∇fi (xik ). Specifically, when F k is
given, we have
E gik |F k = ∇fi xik . (2.6)
1 ei
gik = ∇fi,χ k xik − ∇fi,χ k ti,χ
k
k + ∇fi,j ti,j
k
.
i i i ei
j =1
8: end for
In this section, we first introduce quite a few auxiliary results related to the
stochastic gradient g k , k ≥ 0. Then, we design a Lyapunov function and derive
the upper bounds of two parts of the Lyapunov function to support the main results.
Subsequently, we provide the theoretical guarantees for the convergence behavior
of the computation-efficient distributed stochastic gradient algorithm described in
Algorithm 1 by using the Lyapunov method. Finally, under some special cases, we
propose a distributed stochastic proximal gradient algorithm by using the variance-
reduction technique and study its convergence rate.
40 2 Projection Algorithms for Distributed Stochastic Optimization
1
ei
rik = k
fi,j ti,j − fi,j (x̂ ∗ ) − ∇fi,j (x̂ ∗ )T ti,j
k
− x̂ ∗ . (2.7)
ei
j =1
E ||g k − ∇f (x ∗ )||2
Notice that if the sequence of iterates x k tends to the optimal solution x ∗ , then the
k approach to x̂ ∗ , which yields that r k converges to zero. This
auxiliary variables ti,j
fact combined with the result in Lemma 2.6 indicates that the expected value for
the distance between the stochastic gradient g k and the gradient ∇f (x ∗ ) diminishes
when x k tends to x ∗ .
V (x k , α k , β k , λk , γ k , r k )
= V1 (x k ) + η[V2 (α k ) + V3 (β k ) + V4 (λk ) + V5 (γ k )] + br k , (2.9)
and b is a positive constant which will be specified in the subsequent analysis. For
simplifying notation, we denote V k = V (x k , α k , β k , λk , γ k , r k ) for all k ≥ 0.
Then, we give two crucial lemmas that involve the upper bounds of two parts of
Lyapunov function to support the main results.
The following lemma that involves an expected upper bound for r k is inevitable
to the subsequent convergence. The concrete proof can be found in [45, 47–51].
Lemma 2.7 ([45]) Consider the sequence {r k } generated by Algorithm 1. Under
Assumptions 2.1 and 2.2, we have the following inequality: ∀k ≥ 0,
1 k 1
E[r k+1 ] ≤ 1 − r + f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ ) , (2.10)
ê ě
E V1 (x k+1 ) − V1 (x k )
4ην k 4η(ν − μ)
≤ r + f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ )
a a
+2ηE (x k+1 − x k )T (∇f (x k+1 ) − ∇f (x k )) −(1 − aη)E ||x k+1 − x k ||2
V1 (x k+1 ) − V1 (x k )
= ||ψ1k − x ∗ ||2 − ||x k − x ∗ ||2
42 2 Projection Algorithms for Distributed Stochastic Optimization
V1 (x k+1 ) − V1 (x k )
≤ −||x k+1 − x k ||2 − 2η(x k+1 − x ∗ )T P T φ(α k + P x k − q)
− 2η(x k+1 − x ∗ )T L(γ k + x k ) − 2η(x k+1 − x ∗ )T D T (λk + Dx k − s)+
− 2η(x k+1 − x ∗ )T B T (β k + Bx k − c) − 2η(x k+1 − x ∗ )T g k . (2.13)
(x k+1 − x ∗ )T ∇f (x k )
= (x k − x ∗ )T ∇f (x k ) + (x k+1 − x k )T ∇f (x k )
≥ f (x k ) − f (x ∗ ) + (x k+1 − x k )T (∇f (x k ) − ∇f (x k+1 ))
+ (x k+1 − x k )T ∇f (x k+1 )
≥ f (x k+1 ) − f (x ∗ ) + (x k+1 − x k )T (∇f (x k ) − ∇f (x k+1 )). (2.15)
V1 (x k+1 ) − V1 (x k )
≤ −(1 − aη)||x k+1 − x k ||2 − 2η(x k − x ∗ )T (g k − ∇f (x k ))
η
+ ||g k − ∇f (x k )||2 − 2η(f (x k+1 ) − f (x ∗ ))
a
2.4 Convergence Analysis 43
f (x k+1 ) − f (x ∗ ) + (x k+1 − x ∗ )T (P T α ∗ + B T β ∗ + D T λ∗ + Lγ ∗ ) ≥ 0.
(2.18)
(x k+1 − x ∗ )T P T (φ(α k + P x k − q) − α ∗ )
E ||g k − ∇f (x k )||2
E V2 (α k+1 ) − V2 (α k )
Proof Denote ψ2k = α k+1 = φ(υ2k ), where υ2k = α k + P x k+1 − q. From the
definition of V2 (α k ), one has
From the projection property, we obtain that (ψ2k − α k − P x k+1 + q)T (ψ2k − α ∗ ) =
(φ(υ2k )−υ2k )T (φ(υ2k )−α ∗ ) ≤ 0. Combining with (2.24) and then taking conditional
expectation on F k , we complete the proof of Lemma 2.9.
2.4 Convergence Analysis 45
V3 (β k+1 ) − V3 (β k )
= ||β k+1 − β ∗ ||2 − ||β k − β ∗ ||2
= ||β k − β ∗ + Bx k+1 − c||2 − ||β k − β ∗ ||2
= (Bx k+1 − c)T (2(β k − β ∗ ) + Bx k+1 − c)
= 2(Bx k+1 − c)T (β k − β ∗ ) + ||Bx k+1 − Bx k ||2
+ 2(Bx k+1 − c + c − Bx k )T (Bx k − c) + ||Bx k − c||2 (2.26)
= 2(Bx k+1 − c)T (β k − β ∗ + Bx k − c) + ||Bx k+1 − Bx k ||2 − ||Bx k − c||2 .
Proof Denote ψ3k = λk+1 = (υ3k )+ , where υ3k = λk + Dx k+1 − s. From the
definition of V4 (λk ), one has
From the projection property, we obtain that (ψ3k − λk − Dx k+1 + s)T (ψ3k − λ∗ ) =
((υ3k )+ − υ3k )T ((υ3k )+ −λ∗ ) ≤ 0. Combining with (2.28) and then taking conditional
expectation on F k , we complete the proof of Lemma 2.11.
Lemma 2.12 Consider the sequence of iterations x k , α k , β k , λk , and γ k generated
by Algorithm 1. Under Assumptions 2.1 and 2.2, ∀k ≥ 0, the following inequality
holds for V5 (γ k ):
V5 (γ k+1 ) − V5 (γ k )
= (γ k+1 − γ ∗ )T L(γ k+1 − γ ∗ ) − (γ k − γ ∗ )T L(γ k − γ ∗ )
= (γ k − γ ∗ + x k+1 )T L(γ k − γ ∗ + x k+1 ) − (γ k − γ ∗ )T L(γ k − γ ∗ )
= 2(x k+1 )T L(γ k − γ ∗ ) + (x k+1 )T Lx k+1
= 2(x k+1 )T L(γ k − γ ∗ ) + (x k+1 − x k )T L(x k+1 − x k )
+ 2(x k+1 )T Lx k − (x k )T Lx k
= 2(x k+1 )T L(γ k − γ ∗ + x k ) + (x k+1 − x k )T L(x k+1 − x k ) − (x k )T Lx k .
(2.30)
Then, taking conditional expectation on F k , we get the result of Lemma 2.12. This
completes the proof.
In Theorem 2.13, we will show that Algorithm 1 converges under appropriate step-
size η and constant b by combining the results in Lemmas 2.7–2.12. Before present-
T
ing Theorem 2.13, we first set the constant b ∈ 4ηaêν , 2êηχminν (B B) − 4êνη(ν−μ)
aν
and the tunable parameter a ∈ 4η2êνêηχ+4êνη(ν−μ)
2
(B T B)
, +∞ .
min
Theorem 2.13 Suppose that Assumptions 2.1 and 2.2 hold. Considering the
computation-efficient distributed stochastic gradient algorithm described in
2.4 Convergence Analysis 47
where
4ην k 4η(ν − μ)
Ξ5 = r + (f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ ))
a a
+ ηE ||x k+1 − x k ||2L − η||x k ||2L .
Next, we derive the upper bound for each term in the right side of inequality (2.31).
From the smoothness of the global objective function f (x), one obtains the fact that
(x k+1 − x k )T (∇f (x k+1 ) − ∇f (x k )) ≤ ν||x k+1 − x k ||2 . Thus, the first term Ξ1 is
bounded by
where inequality (2.32) is deduced by the equality ||Bx k − c||2 = ||Bx k − Bx ∗ ||2 =
(x k − x ∗ )T B T B(x k − x ∗ ) = ||x k − x ∗ ||2B T B . From the projection property, we
have that E[(P x k+1 − q)T (α k+1 − φ(α k + P x k − q))] ≤ E[||x k+1 − x k ||2P T P ] +
48 2 Projection Algorithms for Distributed Stochastic Optimization
E[(P x k − q)T (α k+1 −φ(α k +P x k −q))]. Therefore, the second term Ξ2 is bounded
by
Similar to the procedure for deriving the bound of Ξ2 , we achieve the bound of Ξ3
as
b b
Ξ4 ≤ − r k + f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ )
ê ě
+ ηE ||x k+1 − x k ||2B T B . (2.36)
Substituting (2.32)–(2.16) and (2.37) into the right side of (2.31) and rearranging
the obtained terms, one gets
b 4ην
− ||x k − x ∗ ||2 − − rk
ηB T B− νb
2ě
+ 2νη(ν−μ)
a Ind ê a
To ensure that the Lyapunov function V k decreases monotonously over the entire
period of iteration k, it is equivalent to prove that
If each term in the right side of (2.38) is non-negative, the inequality (2.39)
holds. Therefore, the following conditions should be satisfied: (a) the matrices
Ind −(aInd +2νInd +B T B +L+2D T D +2P T P ) and ηB T B − ( νb 2ê
+ 2νη(ν−μ)
a )Ind
should be positive definite and (b) (b/ê)−(4ην/a) > 0. The results of Theorem 2.13
are then achieved.
Remark 2.14 Theorem 2.13 indicates that even utilizing the stochastic gradient g k ,
Algorithm 1 is guaranteed to resolve the composite constrained convex optimization
problem (2.1) if some conditions (such as a, b, and η) are satisfied and the
assumptions on the objective functions and the communication network hold.
However, an explicit convergence rate is not established in Theorem 2.13 due to the
existence of the locally general constraints that are constituted by linear equality,
convex inequality, and bounded constraints. When dealing with a composite non-
smooth problem, although global linear convergence of the distributed proximal
gradient methods has been well proved in the recent work [33], there is still no work
to analyze the global linear convergence of the primal–dual method (similar to the
proposed algorithm). In terms of this issue, related work needs to be further studied.
We, therefore, draw support from simulations to explore possible results.
2.4.4 Discussion
reformulated as follows:
n
1
min J (x) = fi (xi ) + ||P̃ xi − q̃||1 , s.t. W 2 x = 0, (2.40)
x
i=1
Note from the definition of matrix W that W is symmetric and its singular values
are in [0, 1) (i.e., ρmin = 0 < ρ ≤ ρmax < 1, where ρ is the minimum nonzero
2.4 Convergence Analysis 51
1 1
−2(γ̂ k )T W 2 (x̂ k − η(g k − ∇f (x ∗ )) − W 2 x̂ k ). (2.44)
1
Combining (2.45) with (2.44) and noting that ||W 2 γ̂ k ||2 ≥ ρ||γ̂ k ||2 , we have
νμ η2
≤ 1 − 2η ||x̂ k ||2 + ||g k − ∇f (x ∗ )||2
ν+μ 1 − ρmax
4ημ
− (f (x k ) − f (x ∗ ) − ∇f T (x ∗ )x̂ k )−2η(x̂ k )T (g k − ∇f (x k )), (2.47)
ν+μ
and therefore
where 0 < θ < 1. From here, the proof is similar to that in the proof of Theorem 1
in [33]. Then, it is suffice to obtain the desired results.
2.5 Numerical Examples 53
In this section, two numerical examples are provided to examine the convergence
and the practical behavior of the proposed algorithms. Notice that all the simulations
are carried out in MATLAB on an HP 288 Pro G4 MT Business PC with
Intel(R) Core(TM) i7-8700 processors, 8 GB memory, and 3.2 GHz. For the sake
of comparison, the optimal solutions to the following examples are obtained by the
centralized method with proper step-sizes for a long enough time.
First, the proposed algorithms are applied to solve a general distributed minimiza-
tion problem which is described as follows:
⎛ ⎞
n
1
ei
min ⎝ ||Ci,j x̂ − bi,j ||2 + ||Pi x̂ − qi ||1 ⎠,
ei
i=1 j =1
1.5
0.5
−0.5
0 200 400 600 800 1000
Iteration
Residual
0
10 The proposed algorithm (5)
The method in [46]
−5
10
−10
10
0 200 400 600 800 1000
Iteration
indicates that compared with the method in [46], the proposed algorithm (2.4)
demands a small number of gradient evaluations, which largely reduce the
computational cost.
Second, we further verify the application behavior of the proposed algorithm with
numerical simulations for real datasets. We consider the distributed sparse logistic
regression problem using the breast cancer Wisconsin (diagnostic) dataset provided
2.5 Numerical Examples 55
Residual
0
10 The proposed algorithm (5)
The method in [46]
−5
10
−10
10
0 1 2 3 4 5 6 7 8
Number of gradient evaluations 4
x 10
in the UCI Machine Learning Repository [56]. In the breast cancer Wisconsin
(diagnostic) dataset, we adopt N = 200 samples as training data, where each
training data has dimension d = 9. All the characters have been preprocessed
and normalized to the unit vector for each dataset. For the network, we generate a
randomly connected network with n = 20 nodes utilizing an Erdos–Renyi network
with probability p = 0.4. The distributed sparse logistic regression problem can be
formally described as
n
min fi (x̂) + κ1 ||x̂||1 , (2.55)
x̂∈Rd
i=1
1
ei κ
2
fi (x̂) = ln 1 + exp(−bi,j ci,j
T
x̂) + ||x̂||22 ,
ei 2
i=1
where bi,j ∈ {−1, 1} and ci,j ∈ Rd are local data kept by node i for j ∈
{1, . . . , ei }; the regularization term κ1 ||x̂||1 is applied to impose sparsity of the
optimal solution and (κ2 /2)||x̂||22 is added to avoid overfitting, respectively. In the
simulation, we assign data randomly to each local node, i.e., ni=1 ei = N. We set
the regularization parameters κ1 = 0.05 and κ2 = 10, respectively.
Then, we compare the proposed algorithms with the existing distributed methods,
including DL-ADMM [30], PG-EXTRA [31], NIDS [32], P2D2 [33], that can deal
with the composite non-smooth optimization problem. When κ1 = 0, we also
compare the proposed algorithms with the existing distributed methods, including
56 2 Projection Algorithms for Distributed Stochastic Optimization
DSA [31] and GT-SAGA [51] that use the variance-reduction technique. The
simulation results are described as follows:
(1) Figure 2.4 means that the proposed algorithms can achieve the linear con-
vergence rate as the existing distributed methods [30–33] that can deal with
composite non-smooth optimization problem under the real training set. Fig-
ure 2.5 indicates that compared with the existing distributed methods [30–33]
that do not adopt the variance-reduction technique, the proposed algorithms
demand a small number of gradient evaluations, which is cheaper in terms of
the computational cost.
−4
10
−6
10
0 50 100 150 200 250 300 350 400
Iteration
−4
10
−6
10
0 0.5 1 1.5 2 2.5 3 3.5 4
Number of gradient evaluations 4
x 10
−5
10
−10
10
−5
10
−10
10
(2) When κ1 = 0, Figs. 2.6 and 2.7 tell us that the proposed algorithms show similar
performance with the existing distributed variance reduced methods [46, 51].
2.6 Conclusion
the variance-reduction technique which highly reduces the expense of full gradient
evaluation. Through constructing an appropriate Lyapunov function, we proved that
the proposed algorithm converges in expectation to the optimal solution with a
suitably selected constant step-size. Furthermore, the privacy properties of the pro-
posed algorithm have also been explored via differential privacy strategy. Extensive
numerical experiments have been conducted to verify the superior performance of
the proposed algorithm. However, some nontrivial issues still deserve further study.
For example, the convergence rate of the proposed algorithm for the composite
constrained optimization problem needs to be studied in-depth, and general non-
smooth terms as well as more complex networks still demand further consideration.
In the future, we will further investigate the convergence rate of the proposed
algorithm and extend the algorithm to be applicable to more complex directed
networks. The extensions of the current algorithm to general non-smooth terms and
the distributed non-convex stochastic optimization are also two promising research
directions.
References
1. X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, J. Liu, Can decentralized algorithms
outperform centralized algorithms? A case study for decentralized parallel stochastic gradient
descent, in Advances in Neural Information Processing Systems (NIPS), vol. 30 (2017), pp. 1–
11
2. R. Xin, S. Kar, U.A. Khan, Decentralized stochastic optimization and machine learning: a
unified variance-reduction framework for robust performance and fast convergence. IEEE
Signal Process. Mag. 37(3), 102–113 (2020)
3. S. Khobahi, M. Soltanalian, F. Jiang, A.L. Swindlehurst, Optimized transmission for parameter
estimation in wireless sensor networks. IEEE Trans. Signal Inf. Proc. Netw. 6, 35–47 (2019)
4. A. Nedic, J. Liu, Distributed optimization for control. Ann. Rev. Control Robot. Auton. Syst.
1, 77–103 (2018)
5. J. Li, W. Abbas, X. Koutsoukos, Resilient distributed diffusion in networks with adversaries.
IEEE Trans. Signal Inf. Proc. Netw. 6, 1–17 (2019)
6. A. Nedic, Distributed gradient methods for convex machine learning problems in networks:
distributed optimization. IEEE Signal Process. Mag. 37(3), 92–101 (2020)
7. M. Rossi, M. Centenaro, A. Ba, S. Eleuch, T. Erseghe, M. Zorzi, Distributed learning
algorithms for optimal data routing in IoT networks. IEEE Trans. Signal Inf. Proc. Netw. 6,
175–195 (2020)
8. B. Li, S. Cen, Y. Chen, Y. Chi, Communication-efficient distributed optimization in networks
with gradient tracking and variance reduction, in Proceedings of the Twenty Third International
Conference on Artificial Intelligence and Statistics (PMLR), vol. 108 (2020), pp. 1662–1672
9. H. Li, C. Huang, Z. Wang, G. Chen, H. Umar, Computation-efficient distributed algorithm
for convex optimization over time-varying networks with limited bandwidth communication.
IEEE Trans. Signal Inf. Proc. Netw. 6, 140–151 (2020)
10. T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, K. Johansson, A
survey of distributed optimization. Annu. Rev. Control 47, 278–305 (2019)
11. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE
Trans. Autom. Control 54(1), 48–61 (2009)
12. S. Ram, A. Nedic, V. Veeravalli, Distributed stochastic subgradient projection algorithms for
convex optimization. J. Optim. Theory Appl. 147, 516–545 (2010)
References 59
13. J. Duchi, A. Agarwal, M. Wainwright, Dual averaging for distributed optimization: conver-
gence analysis and network scaling. IEEE Trans. Autom. Control 57(1), 151–164 (2012)
14. W. Shi, Q. Ling, G. Wu, W. Yin, EXTRA: an exact first-order algorithm for decentralized
consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015)
15. C. Xi, U. Khan, DEXTRA: a fast algorithm for optimization over directed graphs. IEEE Trans.
Autom. Control 62(10), 4980–4993 (2017)
16. M. Maros, J. Jalden, On the Q-linear convergence of distributed generalized ADMM under
non-strongly convex function components. IEEE Trans. Signal Inf. Proc. Netw. 5(3), 442–453
(2019)
17. C. Zhang, H. Gao, Y. Wang, Privacy-preserving decentralized optimization via decomposition
(2018). Preprint. arXiv:1808.09566
18. J. Chen, S. Liu, P. Chen, Zeroth-order diffusion adaptation over networks, in Proceedings of
the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(2018). https://doi.org/10.1109/ICASSP.2018.8461448
19. J. Xu, S. Zhu, Y.C. Soh, L. Xie, Augmented distributed gradient methods for multi-agent
optimization under uncoordinated constant stepsizes, in Proceedings of the IEEE 54th Annual
Conference on Decision and Control (2015). https://doi.org/10.1109/CDC.2015.7402509
20. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization
over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
21. R. Xin, U. Khan, Distributed heavy-ball: a generalization and acceleration of first-order
methods with gradient tracking. IEEE Trans. Autom. Control 65(6), 2627–2633 (2020)
22. V.S. Mai, E.H. Abed, Distributed optimization over directed graphs with row stochasticity and
constraint regularity. Automatica 102(102), 94–104 (2019)
23. B. Huang, Y. Zou, Z. Meng, W. Ren, Distributed time-varying convex optimization for a class
of nonlinear multiagent systems. IEEE Trans. Autom. Control 65(2), 801–808 (2020)
24. J. Zhu, C. Xu, J. Guan, D. Wu, Differentially private distributed online algorithms over time-
varying directed networks. IEEE Trans. Signal Inf. Proc. Netw. 4(1), 4–17 (2018)
25. M. Hong, D. Hajinezhad, M. Zhao, Prox-PDA: the proximal primal-dual algorithm for fast
distributed nonconvex optimization and learning over networks, in Proceedings of the 34th
International Conference on Machine Learning (ICML), vol. 70 (2017), pp. 1529–1538
26. F. Hua, R. Nassif, C. Richard, H. Wang, A.H. Sayed, Online distributed learning over graphs
with multitask graph-filter models. IEEE Trans. Signal Inf. Proc. Netw. 6, 63–77 (2020)
27. T. Yang, J. Lu, D. Wu, J. Wu, G. Shi, Z. Meng, K. Johansson, A distributed algorithm for
economic dispatch over time-varying directed networks with delays. IEEE Trans. Ind. Electron.
64(6), 5095–5106 (2017)
28. T. Yang, D. Wu, H. Fang, W. Ren, H. Wang, Y. Hong, K. Johansson, Distributed energy
resource coordination over time-varying directed communication networks. IEEE Trans.
Control Netw. Syst. 6(3), 1124–1134 (2019)
29. A.I. Chen, A. Ozdaglar, A fast distributed proximal-gradient method, in Proceedings of the
50th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
(2012), https://doi.org/10.1109/Allerton.2012.6483273
30. T.-H. Chang, M. Hong, X. Wang, Multi-agent distributed optimization via inexact consensus
ADMM. IEEE Trans. Signal Process. 63(2), 482–497 (2015)
31. W. Shi, Q. Ling, G. Wu, W. Yin, A proximal gradient algorithm for decentralized composite
optimization. IEEE Trans. Signal Process. 63(22), 6013–6023 (2015)
32. Z. Li, W. Shi, M. Yan, A decentralized proximal-gradient method with network independent
step-sizes and separated convergence rates. IEEE Trans. Signal Process. 67(17), 4494–4506
(2019)
33. S. Alghunaim, K. Yuan, A.H. Sayed, A linearly convergent proximal gradient algorithm for
decentralized optimization, in Advances in Neural Information Processing Systems (NIPS),
vol. 32 (2019), pp. 1–11
34. K. Zheng, Z. Yang, K. Zhang, P. Chatzimisios, K. Yang, W. Xiang, Big data-driven optimiza-
tion for mobile networks toward 5G. IEEE Netw. 30(1), 44–51 (2016)
60 2 Projection Algorithms for Distributed Stochastic Optimization
35. B. Swenson, R. Murray, S. Kar, H. Poor, Distributed stochastic gradient descent and conver-
gence to local minima (2020). Preprint. arXiv:2003.02818v1
36. M. Assran, N. Loizou, N. Ballas, M. Rabbat, Stochastic gradient push for distributed deep
learning, in Proceedings of the 36th International Conference on Machine Learning (ICML)
(2019). https://doi.org/10.48550/arxiv.1811.10792
37. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly
convex optimization. Automatica 90, 196–203 (2018)
38. R. Xin, A. Sahu, U. Khan, S. Kar, Distributed stochastic optimization with gradient tracking
over strongly-connected networks, in Proceedings of the 2019 IEEE 58th Conference on
Decision and Control (CDC) (2019). https://doi.org/10.1109/CDC40024.2019.9029217
39. J. Konecny, J. Liu, P. Richtarik, M. Takac, Mini-batch semi-stochastic gradient descent in the
proximal setting. IEEE J. Sel. Top. Signal Process. 10(2), 242–255 (2016)
40. M. Schmidt, N. Roux, F. Bach, Minimizing finite sums with the stochastic average gradient.
Math. Program. 162(1), 83–112 (2017)
41. A. Defazio, F. Bach, S. Lacoste-Julien, Saga: a fast incremental gradient method with support
for non-strongly convex composite objectives, in Advances in Neural Information Processing
Systems (NIPS), vol. 27 (2014), pp. 1–9
42. R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance
reduction, in Advances in Neural Information Processing Systems (NIPS), vol. 26 (2013),
pp. 1–9
43. C. Tan, S. Ma, Y. Dai, Y. Qian, Barzilai-borwein step size for stochastic average gradient, in
Advances in Neural Information Processing Systems, vol. 29 (2016), pp. 1–9
44. L. Nguyen, J. Liu, K. Scheinberg, M. Takac, SARAH: a novel method for machine learning
problems using stochastic recursive gradient, in Proceedings of the 34th International Confer-
ence on Machine Learning (ICML) (2017), pp. 2613–2621
45. A. Mokhtari, A. Ribeiro, DSA: decentralized double stochastic averaging gradient algorithm.
J. Mach. Learn. Res. 17(1), 2165–2199 (2016)
46. Y. Zhao, Q. Liu, A consensus algorithm based on collective neurodynamic system for
distributed optimization with linear and bound constraints. Math. Program. 122, 144–151
(2020)
47. Z. Shen, A. Mokhtari, T. Zhou, P. Zhao, H. Qian, Towards more efficient stochastic decen-
tralized learning: faster convergence and sparse communication, in Proceedings of the 35th
International Conference on Machine Learning (PMLR), vol. 80 (2018), pp. 4624–4633
48. K. Yuan, B. Ying, J. Liu, A. Sayed, Variance-reduced stochastic learning by networked agents
under random reshuffling. IEEE Trans. Signal Process. 67(2), 1–11 (2019)
49. H. Hendrikx, F. Bach, L. Massoulie, An accelerated decentralized stochastic proximal algo-
rithm for finite sums, in Advances in Neural Information Processing Systems, vol. 32 (2019),
pp. 4624–4633
50. Z. Wang, H. Li, Edge-based stochastic gradient algorithm for distributed optimization. IEEE
Trans. Netw. Sci. Eng. 7(3), 1421–1430 (2020)
51. R. Xin, U. Khan, S. Kar, Variance-reduced decentralized stochastic optimization with acceler-
ated convergence. IEEE Trans. Signal Process. 68, 6255–6271 (2020)
52. R. Xin, A. Sahu, S. Kar, U.A. Khan, Distributed empirical risk minimization over directed
graphs, in Proceedings of the 53rd Asilomar Conference on Signals, Systems, and Computers
(2019). https://doi.org/10.1109/IEEECONF44664.2019.9049065
53. Q. Liu, S. Yang, Y. Hong, Constrained consensus algorithms with fixed step size for distributed
convex optimization over multi-agent networks, IEEE Trans. Autom. Control 62(8), 4259–
4265 (2017)
54. M. Bazaraa, H. Sherali, C. Shetty, Nonlinear Programming: Theory and Algorithms, 3rd edn.
(John Wiley & Sons, Hoboken, 2006)
55. S. Alghunaim, E.K. Ryu, K. Yuan, A.H. Sayed, Decentralized proximal gradient algorithms
with linear convergence rates. IEEE Trans. Autom. Control 66(6), 2787–2794 (2021)
56. D. Dua, C. Graff, UCI machine learning repository, Dept. School Inf. Comput. Sci., Univ.
California, Irvine, CA, USA (2019)
Chapter 3
Proximal Algorithms for Distributed
Coupled Optimization
3.1 Introduction
Over the last decade, distributed optimization over networks has become a hotspot
of research along with the rapid development of network technologies, where nodes
focus on minimizing the sum of local functions (owned by each node) through local
communication [1, 2]. Traditional centralized approaches to solving optimization
problems usually require an entity to obtain essential information from all nodes,
which are costly, prone to the single point of failure, and lack robustness to new
environments. Compared to centralized algorithms, distributed algorithms avoid
long-distance communication and have greater flexibility and scalability because
they have the ability to decompose large-scale problems into a series of smaller
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 61
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_3
62 3 Proximal Algorithms for Distributed Coupled Optimization
problems [3, 4]. Considering this, distributed algorithms possess better robustness,
less communication, and good privacy protection in many applications [5–14],
including but not limited to machine learning [5, 6], online optimization [7, 8],
privacy masking [9, 10], resource allocation [11, 12], and data processing [13, 14].
Recently, researchers have made significant effort to study distributed approaches
for successfully solving optimization problems [15–24]. Distributed approaches
only depended on gradient information have been the majority reported in the
literature because of their good performance and excellent scalability [1, 2]. There
are quite a few known approaches such as distributed gradient descent (DGD)
[15, 16], distributed dual averaging (DDA) [17], EXTRA [18, 19], distributed
ADMM [20], distributed adaptive diffusion [21], and distributed gradient tracking
[22–24]. Based on the known approaches [15–24], many efficiently distributed
approaches have been proposed for handling the possible factors that exist in
the problem or achieving the desired targets, such as transmission delays [25],
complex constraints [26], computation efficiency [27], communication efficiency
[28], privacy security [29], etc. Besides the aforementioned works on discrete-
time iteration, distributed continuous-time approaches have been well-investigated
[30] simultaneously, which exhibit flexible applications in continuous-time physical
systems and hardware implementations [31].
In addition to the aforementioned works for handling problems with a single
objective, composite optimization problems with smooth+non-smooth objectives
have been sparked considerable interest in the community of distributed optimiza-
tion due to its broad applications. Usually, the approaches that solve composite
optimization problem contain the (fast) distributed proximal gradient [32], the
distributed linearized ADMM (DL-ADMM) [33], PG-ADMM [34], PG-EXTRA
[35], and NIDS [36]. From the perspective of convergence rate, the aforementioned
approaches only achieve sublinear convergence rate, and there is still a clear gap
compared with their centralized counterparts. Until recently, distributed linearly
convergent approaches have been investigated to successfully fill such a gap [37].
In particular, the authors in [37] proposed a distributed proximal gradient algorithm
based on a general primal–dual algorithmic framework, which not only attained a
linear convergence rate but also unified many existing related approaches. Then,
based on the gradient tracking mechanism, the authors in [38, 39] introduced the
NEXT/SONATA algorithm with linear convergence. Subsequently, the authors in
[40] gave a unified distributed algorithmic framework to obtain similar results on the
basis of the operator splitting theory. For the case where the non-smooth function
couples all nodes, the authors in [41, 42] firstly designed a distributed proximal
primal–dual algorithm from the novel perspective of transforming the saddle-
point problem and theoretically established its linear convergence. Under a much
weaker condition than the strong convexity assumed in [37–41], the authors in [43]
developed a distributed randomized block-coordinate proximal algorithm, which
achieved asymptotic linear convergence. However, the aforementioned approaches
are largely affected by the pressure of calculation.
3.1 Introduction 63
3.2 Preliminaries
3.2.1 Notation
If not particularly stated, the vectors mentioned in this chapter are column vectors.
Let the symbol · denote the 2-norm. Given a vector x and a positive semi-definite
matrix W , we denote ||x||2W = x T W x. The Kronecker products are denoted as ⊗.
The vector that stacks x1 , · · · , xn on top of each other is indicated as col{xi }ni=1 . The
symbol blkdiag{Xi }ni=1 denotes the block diagonal matrix that consists of diagonal
blocks {Xi }. The proximal operator of a function f (x) : Rn → R at x is defined as
proxηf (x) = arg miny∈Rn {f (y) + 1/(2η)||x − y||2}, where η > 0 is a parameter
(step-size). We denote the conjugate function of a function f at x ∈ Rn as f † (x) =
supy∈Rn {x T y − f (y)}. The gradient of a function f at x ∈ Rn is denoted as ∇f (x).
The subdifferential ∂f (x) of a function f at x ∈ Rn is the set of all subgradients.
where fi : Rqi → R is a convex function that we view as the privately cost of node
i, g : Rp → R∪{+∞} is a convex possibly non-smooth global cost function known
by all nodes,1 and the matrix Ai ∈ Rp×qi (full row rank) is a linear transform only
known by node i. Furthermore, each fi is described by
1
mi
fi (xi ) = fi,h (xi ), i = 1, · · · , n.
mi
h=1
1 Here, it is also worth noticing that the non-smooth function g may be expressed as an indicator
function for inequality constraints or equality constraints. For example, in a distributed resource
allocation problem [60], this non-smooth term may be an indicator function of the equality
constraints. In a distributed ridge regression problem [42], this non-smooth term may be an
indicator function of the inequality constraints. In addition, the non-smooth function g may
represent the regularization term, see, e.g., [3].
3.2 Preliminaries 65
Here, several quantities that will support the problem reformulation are defined
below:
n
n
x = col{xi }ni=1 ∈ Rq , q = qi , f (x) = fi (xi ), A = [A1 , · · · , An ] ∈ Rp×q .
i=1 i=1
Moreover, we make the assumptions on the constituent functions fi,h and the global
cost function g below.
Assumption 3.1 (i) Each local constituent function fi,h , i ∈ {1, . . . , n}, h ∈
{1, . . . , mi }, is β-smooth and α-strongly convex. (ii) The function g : Rp → R ∪ {+∞}
is proper lower semi-continuous and convex. (iii) There is x ∈ Rq satisfying that Ax
belongs to the relative interior domain of g.
Notice from Assumption 3.1 that the global cost function f : Rq → R is also
α -strongly convex and β -smooth, where 0 < α ≤ β , and problem (3.2) possesses a
unique optimal solution x ∗ = col{x1∗ , · · · , xn∗ } ∈ Rq , which achieves the minimum
of this problem.
Problem (3.1) is the sharing problem, where different individual variables possessed
by nodes are coupled through a function g . Notice that problems of form (3.1)
appear in many engineering applications [1], including smart grids, basis pursuit,
and resource allocation in wireless networks. They also appear in machine learning
applications [1], such as regression over distributed features. Here, we provide two
motivational physical applications that fit (3.1).
Example 1 A well-known example of problem (3.1) is the distributed resource
allocation problem. Inspired by Xu et al. [60], Scaman et al. [62], we set the
distributed resource allocation problem as follows:
n
n
min C(x) = Ci (xi ), s.t. (xi − ri ) = 0, (3.3)
x1 ,··· ,xn
i=1 i=1
2 If g = 0 and A is the Laplacian matrix of the graph for instance (or a square root of the Laplacian
matrix), then problem (3.2) will become a general consensus optimization problem that can be
well solved by distributed primal–dual algorithms, such as [31, 40, 59, 61]. From this perspective,
problem (3.2) is more general.
66 3 Proximal Algorithms for Distributed Coupled Optimization
i
which is the similar form as problem (3.1) with Ci (xi ) = (1/mi ) m
h=1 Ci,h (xi ) and
Ai = 1 for all i . Here, we note that the above transformed problem is reasonable
because in the actual resource allocation problem we want to minimize the cost of
the entire network under the premise of coordinating the optimization variables of
each node, and the cost encountered by each node is given by many different costs.
Example 2 Another notable example of problem (3.1) is the distributed logistic
regression problem, which possesses important applications in machine learning
[42, 53, 56–58] and can be described as follows:
n
1
mi
min ln 1 + exp −bi,h ci,h
T
xi , s.t. xi = xj ,
x1 ,··· ,xn mi
i=1 h=1
∀j ∈ Ni , where Ni is the neighbors set of i . Here, bi,h ∈ {−1, 1} and ci,h ∈ Rqi , h ∈
{1, . . . , mi }, is the local data kept by node i . Similar to Example 1, the above problem
can be transformed into problem (3.1) if we define the indicator function g(·) such
that
0, if x1 = · · · = xn ,
g(·) =
+∞, otherwise.
In this case, the matrix A = [A1 , · · · , An ] is sparse and encodes the communication
network between nodes. It is worth noticing that the matrix A in the problem (3.1) is
not necessarily sparse, and Ai is a private matrix only known by node i . Therefore,
the general distributed logistic regression problem can be transformed into a special
case of problem (3.1) if we utilize the above indicator function g .
3.2 Preliminaries 67
Notice from Assumption 3.1 that the strong duality holds Corollary 31.2.1 in [63].
Similar to the existing works [41], the saddle-point reformulation of problem (3.2)
is given by Proposition 19.18 in [64]:
−AT y ∗ = ∇f (x ∗ ), Ax ∗ ∈ ∂g † (y ∗ ), (3.5)
where ∇f (x ∗ ) = [∇f1 (x1∗ )T , · · · , ∇fn (xn∗ )T ]T . Notice that the dual variable y in
(3.4) is multiplied by A, which couples all nodes. Thus, algorithms solving problem
(3.4) directly in a distributed fashion cannot exist because the dual update needs to
be calculated by a central coordinator. Further reformulation is required to arrive at
a distributed solution. To aim at this, we in the following reformulate problem (3.4)
into another equivalent saddle-point problem that avoids the above dilemma.
First, let yi be a local copy of y at node i , and the following quantities are needed:
1 †
n
ŷ = col{yi }ni=1 ∈ Rpn , G† = gi (yi ), Ac = blkdiag{Ai }ni=1 ∈ Rpn×q ,
n
i=1
Lŷ = 0 ⇐⇒ y1 = y2 = · · · = yn . (3.6)
where z = col{zi }ni=1 ∈ Rpn . Since the matrix Ac is block diagonal and the matrix
L encodes the network sparsity structure, it suffices to conclude that problem (3.7)
can be resolved in a distributed manner. Then, the optimality conditions of problem
(3.7) are given as follows [41]:
⎧
⎪ T ∗ ∗
⎨ −Ac ŷ = ∇f (x )
∗
Lŷ = 0, (3.8)
⎪
⎩ A x ∗ + Lz∗ ∈ ∂G† (ŷ ∗ ),
c
According to (3.8), the following lemma will show that problems (3.4) and (3.7)
share the same optimal solution in terms of x . Since the saddle-point reformulation
from (3.4) to (3.7) is the same as [41], this result can be directly followed from [41]
and we just show it here for completeness.
Lemma 3.1 (Adapted from Lemma 1 in [41]) If (x ∗ , z∗ , ŷ ∗ ) fulfills the optimality
condition (3.8), then ŷ ∗ = 1n ⊗ y ∗ holds with (x ∗ , y ∗ ) such that (3.5).
Lemma 3.1 indicates that if a designed algorithm for solving problem (3.7)
can achieve the optimal solution (x ∗ , z∗ , ŷ ∗ ), then the corresponding primal-dual
pair (x ∗ , y ∗ ) is the optimal solution to problem (3.4). That is to say, the designed
algorithm can solve problem (3.4) indirectly and problems (3.1), (3.4), and (3.7)
share the same optimal solution in terms of x . However, unlike problem (3.4),
problem (3.7) can be resolved in a distributed manner because the matrices Ac and
L encode the network sparsity structure. Therefore, based on Lemma 3.1, we have
the opportunity to design distributed algorithms to finally solve the problem (3.1).
Since the local function fi (xi ) at each node i is the average of mi local constituent
functions fi,h (xi ), the implementations of most existing distributed primal–dual
algorithms [41, 43] require that each node i at time t ≥ 0 calculates the local full
gradient of fi at xit as
1
mi
∇fi (xit ) = ∇fi,h (xit ), i = 1, · · · , n, (3.9)
mi
h=1
which may result in high computation cost when the number of constituent functions
mi is large. This issue motivates us to investigate an effective technique that
can improve computation efficiency significantly. Fortunately, unbiased stochastic
average gradient (SAGA) can be substituted for the local full gradients to resolve
this issue. The idea is to keep a gradient list of all constituent functions, where
a randomly selected element is replaced each time, and the average value of the
elements in this list is applied for gradient approximation. In specific, denote
χit ∈ {1, · · · , mi } as the function index of node i , which is uniformly and randomly
3.3 Algorithm Development 69
selected at time t . Then, let ei,h be the auxiliary variable, which was selected to
evaluate the constituent gradient of the function fi,h at the last time. Therefore, the
recursive updates of variables ei,h are
t+1
ei,h = xit , if h = χit ; ei,h
t+1
= ei,h
t
, if h = χit .
1
mi
sit = ∇fi,χ t (xit ) − ∇fi,χ t ei,χ
t
t + ∇fi,h ei,h
t
, (3.10)
i i i mi
h=1
mi
mi
∇fi,h ei,h
t
= ∇fi,h ei,h
t−1
+ ∇fi,χ t−1 xit−1 − ∇fi,χ t−1 et−1t−1 ,
i i i,χi
h=1 h=1
(3.12)
i
the above cost can be avoided and we can calculate m h=1 ∇fi,h (ei,h ) in a computa-
t
tionally efficient way. In addition, we also point that the O(mi )-order computational
cost cannot be overcome in the existing methods [35–37, 40, 41, 43] using
deterministic gradient information.
Inspired by the variance-reduction technique of SAGA [55, 56, 58] and the dis-
tributed proximal primal–dual approach [41], we now propose a novel distributed
stochastic algorithm (VR-DPPD) to resolve problem (3.7), followed by its dis-
tributed implementation.
Define the auxiliary variable w = col{wi }ni=1 ∈ Rpn and the stochastic gradient
s = col{si }ni=1 ∈ Rq . Let x 0 and ŷ 0 be any values, z0 = 0pn , and s 0 = ∇f (x 0 ). Then,
70 3 Proximal Algorithms for Distributed Coupled Optimization
Remark 3.2 Requiring the matrix B to satisfy the condition in Assumption 3.2 is
necessary to ensure that all nodes converge to the same optimal variable, and this is not
difficult to construct over an undirected connected network, see, e.g., the lazy Metropolis
matrix designed in [23]. Moreover, the eigenvalues of Bc belong to (−1, 1]. Then, given
Bc , there exist plentiful choices for matrix L. For example, we can denote L2 = Ipn −Bc2
and verify the correctness of the condition 0 < Ipn −L2 . If it is true, then Assumption 3.2
can be satisfied. If not, we can let L2 = d(Ipn − Bc2 ) for any d ∈ (0, 1). Although there
are many choices for the matrices Bc and L, we only keep one choice for simplicity of
the following analysis and presentation.
By utilizing the design idea of the lazy Metropolis matrix (for more details,
please refer to [23]), a primitive symmetric doubly-stochastic matrix B̃ = [b̃ij ] ∈
Rn×n is first constructed, which satisfies that b̃ij > 0 if two nodes i and j are
connected through an edge over undirected connected networks, and b̃ij = 0
otherwise. Then, let B = (In + B̃)/2 = [bij ] ∈ Rn×n , which satisfies Assumption 3.2.
Based on this, we can set L2 = Ipn − Bc2 to satisfy the related conditions in
Assumption 3.2.
With the above choices of matrices in hand, we here show the distributed
implementation of (3.13). Specifically, it follows from the updates of wt and zt in
(3.13) that, for all t ≥ 1,
which eliminates the auxiliary variable zt , and algorithm (3.13) can be rewritten as
follows:
⎧
⎪
⎪ x t+1 = x t − ηx s t − ηx ATc ŷ t
⎪
⎨ wt+1 = (I − L2 )wt + ŷ t − ŷ t−1 + η A (x t+1 − x t )
pn y c
⎪
⎪ ϕ t+1 = B w t+1
⎪
⎩ t+1
c
ŷ = proxηy G† (ϕ t+1 ),
3 By using Moreau’s decomposition [64], the update of yit+1 can be executed in a convenient
fashion that avoids computing the proximal mapping of the conjugate function g † at each time.
72 3 Proximal Algorithms for Distributed Coupled Optimization
2: for t = 0, 1, 2, . . . do
3: if t = 0 then i
4: Compute and store the sum of the local gradients m h=1 ∇fi,h (ei,h ), which is ∇fi (xi ) actually,
0 0
5: else
6: Choose χit uniformly and randomly from {1, . . . , mi }.
i
7: Compute and store the summation term m h=1 ∇fi,h (ei,h ) according to (3.12).
t
Remark 3.3 At present, the classical proximal primal–dual fixed-point method (PDFP)
[65] and the later known distributed methods (including the proximal exact dual
diffusion method (PED2 ) [41], the dual consensus proximal algorithm (DCPA) [42],
and the primal–dual hybrid gradient method (PDHG) [66]) are proposed to solve the
minimization problem from the novel perspective of transforming (3.1) into a saddle-
point formulation. When dealing with large-scale tasks, the above methods [41, 66, 67]
may suffer from high computation costs. Compared with [41, 42, 66, 67], Algorithm 2
leverages the variance-reduction technique of SAGA [55, 58] for the purpose to
evaluate the locally full gradient in a more cost-efficient way. In addition, via Moreau’s
decomposition [64], the update of yit+1 in Algorithm 2 does not need to calculate the
conjugate function g † at each time, which can be implemented in a convenient manner.
In this section, we show the convergence behavior of VR-DPPD (3.13). First, some
auxiliary results related to the stochastic gradient and the fixed-point of (3.13) are
provided for further convergence analysis.
3.4 Convergence Analysis 73
1
mi
vit = t
(fi,h (ei,h ) − fi,h (xi∗ ) − ∇fi,h (xi∗ )T (ei,h
t
− xi∗ )).
mi
h=1
t
Here fi,h (ei,h ) − fi,h (xi∗ ) − ∇fi,h (xi∗ )T (ei,h
t
− xi∗ ), ∀t ≥ 0, is non-negative by the
strong convexity of local constituent function fi,h , and thus v t , ∀t ≥ 0, is also non-
negative. To simplify notation, we denote E ·|F t = Et , ∀t ≥ 1, in the following
analysis. Moreover, we define ∇f (x t ) = [∇f1 (x1t )T , · · · , ∇fn (xnt )T ]T .
Lemma 3.4 (Adapted from Lemma 6 in [53]) Consider the definition of v t . Under
Assumption 3.1, the following recursive relation holds: ∀t ≥ 0,
1 ∗ ∗ T t ∗ 1
E [v
t t+1
] ≤ (f (x ) − f (x ) − ∇f (x ) (x − x )) + 1 −
t
vt , (3.17)
m̌ m̂
where m̌ and m̂ are the smallest and largest amounts of the local constituent func-
tions over the whole network, respectively, i.e., m̌ = mini∈{1,··· ,n} {mi } and m̂ =
maxi∈{1,··· ,n} {mi }.
In addition, an upper bound for the mean-squared stochastic gradient variance
between the stochastic gradient s t and the gradient ∇f (x ∗ ), i.e., Et [s t − ∇f (x ∗ )],
is given, whose proof can be referred to [53].
Lemma 3.5 (Adapted from Lemma 4 in [53]) Under Assumption 3.1, the following
recursive relation holds: ∀t ≥ 0,
From Lemma 3.5, we can deduce that for each node i = 1, · · · , n, when xit
approaches to xi∗ , then ei,h
t
, h ∈ {1, · · · , mi }, tend to xi∗ , which indicates that the
mean-squared stochastic gradient variance between the stochastic gradient s t and
the gradient ∇f (x ∗ ) vanishes. Subsequently, we continue to show the existence and
optimality of the fixed-points of (3.13), which is taken from [41].
74 3 Proximal Algorithms for Distributed Coupled Optimization
First, we give a crucial lemma that plays an important role in supporting the
convergence results. Define the following error terms:
x̃ t = x t − x ∗ , ỹ t = ŷ t − ŷ ∗ , w̃t = wt − w∗ , z̃t = zt − z∗ .
Based on (3.20), we next establish a critical equality to support the main results.
3.4 Convergence Analysis 75
Lemma 3.7 Suppose that ηx and ηy are strictly positive. Under Assumption 3.1, the
following recursive relation holds: ∀t ≥ 0,
where κ = ηx /ηy .
Proof First, it follows from the update of the error term x̃ t in (3.20) that
Then, it derives from the update of the error term w̃t in (3.20) that
= κ||ỹ t ||2 + κ||ηy Ac x̃ t+1 + Lz̃t ||2 + 2κηy (ỹ t )T Ac x̃ t+1 + 2κ(ỹ t )T Lz̃t
= κ||ỹ t ||2 + κ||ηy Ac x̃ t+1 + Lz̃t ||2 + 2ηx (ỹ t )T Ac x̃ t+1 + 2κ(ỹ t )T Lz̃t
= κ||ỹ t ||2 + ηx ηy ||x̃ t+1 ||2AT A + κ||z̃t ||2L2 + 2ηx z̃t LT Ac x̃ t+1
c c
+ 2ηx (ỹ ) Ac x̃
t T t+1
+ 2κ(ỹ ) Lz̃t .
t T
(3.23)
Similarly, it obtains from the update of the error term z̃t in (3.20) that
Substituting the update of the error term x̃ t in (3.20) into (3.26) yields the result of
Lemma 3.7. The proof is completed.
Let ρmin (·), ρmax (·), and λmin (·) be the smallest non-zero singular value, the
largest singular value, and the smallest eigenvalue of its argument, respectively.
Notice from the condition (3.14) that 0 ≤ L2 < Ipn , which further implies that
0 < ρmin (L) < 1. Denote a tunable parameter τ = ηx τ1 , where τ1 < 4α m̌/(α + β)
is a constant. Then, we will deduce the convergence results of VR-DPPD (3.13) by
using the result in Lemma 3.7.
Theorem 3.8 Consider VR-DPPD (3.13) and let Assumption 3.1 hold. If the step-sizes
ηx and ηy satisfy
2 τ1 1 4α τ1
0 < ηx < min , , − ,
α + β 4β m̂ 2(2β − α) α+β m̌
2αβ
0 < ηy < 2 (A )(α + β)
,
ρmax c
and
||x̃ t − ηx (s t − ∇f (x ∗ ))||2
||x̃ t − ηx (s t − ∇f (x ∗ ))||2
αβ 2ηx
≤ 1 − 2ηx ||x̃ t ||2 − ||∇f (x t ) − ∇f (x ∗ )||2 + ηx2 ||s t − ∇f (x ∗ )||2
α+β α+β
− 2ηx (x̃ t )T (s t − ∇f (x t )), (3.28)
||x̃ t − ηx (s t − ∇f (x ∗ ))||2
αβ
≤ 1 − 2ηx ||x̃ t ||2 + ηx2 ||s t − ∇f (x ∗ )||2 − 2ηx (x̃ t )T (s t − ∇f (x t ))
α+β
4αηx
− (f (x t ) − f (x ∗ ) − ∇f (x ∗ )T (x t − x ∗ )). (3.29)
α+β
αβ
αβ 1 − 2ηx α+β
1 − 2ηx ||x̃ t ||2 ≤ ||x̃ t ||2I −η η AT A . (3.31)
α+β 1 − ηx ηy ρmax
2 (A )
c q x y c c
In addition, since the proximal mapping is non-expansive, we have from the update of
the error term ỹ t in (3.20) that
which followed from Assumption 3.2. Since each Ai has full row rank, we have 0 <
λmin (Ac ATc )Ipn ≤ Ac ATc . Thus, it holds that
Moreover, since z0 = 0 and z∗ are in the range space of L, the error quantity ẑt always
belongs to the range space of L, which indicates that ||z̃t ||2L2 ≥ ρmin
2 (L)||z̃t ||2 —see
αβ
1 − 2ηx α+β 4αηx t
≤ ||x̃ t ||2I −
q −ηx ηy Ac Ac
T
1 − ηx ηy ρmax
2 (A )
c α+β
+ (1 − ηx ηy λmin (Ac ATc ))κ||ỹ t ||2 + (1 − ρmin2 (L))κ||z̃t ||2
αβ
1 − 2ηx α+β 4αηx t
≤ ||x̃ t ||2I − + ηx2 Et ||s t − ∇f (x ∗ )||2
q −ηx ηy Ac Ac
T
1 − ηx ηy ρmax
2 (A )
c α+β
+ (1 − ηx ηy λmin (Ac ATc ))κ||ỹ t ||2 + (1 − ρmin
2
(L))κ||z̃t ||2 . (3.36)
where ϑ t+1 = ||x̃ t+1 ||2I −η η AT A + κ||ỹ t+1 ||2 + κ||z̃t+1 ||2 + τ ||v t+1 ||2 is defined to
q x y c c
simplify the symbolic expression, τ is a tunable parameter that is specified below, and
αβ
1 − 2ηx α+β
1 = , 2 = 1 − ηx ηy λmin (Ac ATc ), 3 = 1 − ρmin
2
(L),
1 − ηx ηy ρmax
2 (A )
c
1 4βηx2 4αηx τ
4 =1− + , 5 = − 2(2β − α)ηx2 − .
m̂ τ α+β m̌
3.4 Convergence Analysis 79
Moreover, if
2αβ τ1
0 < ηy < 2 (A )(α + β)
, 0< ηx < , (3.40)
ρmax c 4β m̂
we obtain that 1 < 1 and 4 < 1. Under (3.28) and (3.40), it can be verified that
Iq − ηx ηy ATc Ac > 0 and = max{ 1 , 2 , 3 , 4 } < 1. In addition, it can also be verified
that
(1 − ηx ηy ρmax
2
(Ac ))||x̃ t+1 ||2 ≤ ||x̃ t+1 ||2I . (3.41)
q −ηx ηy Ac Ac
T
Iterating the inequality Et [ϑ t+1 ] ≤ ˆϑ t that is deduced from (3.39), we get that
Et [ϑ t+1 ] ≤ ϑ 0 t+1
, (3.42)
which by means of (3.41) achieves the expected results. The proof is completed.
Remark 3.9 Theorem 3.8 is proved according to the convergence analysis method of
[41], which implies that VR-DPPD can ensure a linear convergence rate in solving
the problem with the coupled non-smooth function g under some conditions (such as
ηx , ηy , and τ1 ) and the assumptions on the objective functions. The main difference
in the convergence analysis compared to [41] is that we need to leverage the results
(Lemma 3.4 and Lemma 3.5) associated with the stochastic gradient to establish the main
recurrence relation (3.37). Furthermore, compared with [41], VR-DPPD further enjoys
the appealing feature of computation efficiency with variance-reduction technique.
From Theorem 3.8, it is known that the constant that controls the convergence
rate can be simplified by selecting specific values for τ1 , ηx , and ηy . This uncovers
connections to the properties of the local objective functions and the network
topology. To make this clearer, we define the condition number of the local
constituent function as θf = β/α . Then, the following corollary illustrates that
the parameters related to the local objective functions and the network topology
determine the convergence rate of VR-DPPD.
Corollary 3.10 Consider the VR-DPPD as given in Algorithm 2 and suppose the
conditions of Theorem 3.8 hold. Assume that the number of the local constituent function
80 3 Proximal Algorithms for Distributed Coupled Optimization
fi,h to each node is the same, i.e., m̂ = m̌ = m. Setting the constants τ1 , ηx , and ηy as
2m̌ 1 β
τ1 = , ηx = , ηy = 2 ,
1 + θf 4β(1 + θf ) ρmax (Ac )(1 + θf )
then, the linear convergence constant 0 < < 1 in Theorem 3.8 reduces to
⎧ ⎫
⎨1 − 1
, 1 − ρmin2 (L), ⎬
4(1+θf )2 −1
= max T .
⎩ λ (Ac Ac )
1 − 4ρ 2 min 1 ⎭
2 , 1 − 2m
max (A c )(1+θ f )
Remark 3.13 The theoretical analysis in this chapter may not be applicable to the case
where each local constituent function fi,h is generally convex. The main reasons for this
include two aspects. On the one hand, when the cost function
is generally
convex, it is
difficult for us to obtain a desired upper bound for Et [s t − ∇f (x ∗ )] in Lemma 3.5,
which is the core results in supporting the convergence analysis. On the other hand, we
cannot obtain the asymptotic convergence of Et−1 [||x̃ t ||] but may achieve the asymptotic
convergence of Et−1 [||f (x t ) − f (x ∗ )||], which may make our convergence analysis
3.5 Numerical Examples 81
n
1
mi
min ln 1 + exp −bi,h ci,h
T
xi , (3.43)
x1 ,··· ,xn mi
i=1 h=1
subject to the global constraint ni=1 xi ≤ a (here, we assume that all variables
xi have the same dimension, i.e., qi is a constant for all i ), where the local
objective function fi (x) is the average of mi constituent functions fi,h , i.e., fi (xi ) =
mi
(1/mi ) h=1 fi,h (xi ) for all i , where
fi,h (xi ) = ln 1 + exp −bi,h ci,h
T
xi ,
with bi,h ∈ {−1, 1} and ci,h ∈ Rqi , ∀h ∈ {1, . . . , mi }; a ∈ Rqi is a vector with constant
entries. Define ni=1 mi = m.
Problem (3.43) is a distributed optimization problem with coupled inequality
constraints, which we cannot solve intuitively with VR-DPPD proposed in this
chapter. To successfully solve this problem, an effective method is to equivalently
transform the problem (3.43) into the problem (3.1). To this aim, we resort to a
transformation with the help of an indicator function, which is well applied in many
works, such as [30, 42, 43]. In particular, to apply VR-DPPD to solve (3.43), we
first define an indicator function g(·) : Rqi → R such that
0, if x̌ ≤ a,
g(x̌) =
+∞, otherwise,
which
n
is non-smooth in terms of x̌ . Notice that x̌ represents the coupled term
i=1 xi here. Based on this, the above problem (constrained) can be further
82 3 Proximal Algorithms for Distributed Coupled Optimization
which possesses the similar form as problem (3.1) with Ai = Iqi . Then, we can
utilize VR-DPPD to solve problem (3.44). Here, we assume that g is a convex
possibly non-smooth global cost function known by all nodes. Similar to [41], the
following centralized linearized prox-ascent algorithm can be employed to deduce
the optimal solution, i.e.,
x k+1 = x k − ηx ∇f (x k ) − ηx AT y k
(3.45)
y k+1 = proxηy g (y k + ηy Ax k+1 ).
In the following two examples, we leverage two real datasets, i.e., breast cancer
dataset (dataset 1) and mushroom dataset (dataset 2), from the UCI Machine
Learning Repository [68] to support the simulation experiments. Notice that the
optimization problem (3.43) is not the same as the traditional logistic regression
binary classification problem. Similar to [3, 41, 42], we only apply these two real
data to verify the effectiveness of VR-DPPD without performing related model
training and testing. In addition, since the scales of the two real datasets are different,
the motivation for adopting them is very clear. Dataset 1 can be leveraged to test the
performance of VR-DPPD, while dataset 2 can be utilized to validate the advantages
of VR-DPPD over other related methods in handling large-scale data.
Fig. 3.1 (a) Random network with a connection probability p = 0.8. (b) Complete network. (c)
Cycle network. (d) Star network
0.5
Convergence of one dimension of xi
0.4
0.3
0.2
0.1
-0.1
-0.2
-0.3
0 200 400 600 800 1000
Iterations
Fig. 3.2 The transient behaviors of the second dimension of each primal variable xi
84 3 Proximal Algorithms for Distributed Coupled Optimization
primal variable xi can achieve in the mean at the optimal solution, which
together satisfies the global inequality constraint in (3.43) by computation.
(ii) Comparison: in this simulation (ηx = 0.05, ηy = 2), we compare VR-
DPPD with algorithm (3.45) and the related algorithm, PED2 , proposed in
[41] to show the appealing features of VR-DPPD. The simulation results
are shown in Fig. 3.3, where the x -axis is the iterations and the number of
gradient evaluations, respectively. Figure 3.3a clearly shows that VR-DPPD
converges linearly in this setup. In addition, from Fig. 3.3a, we can also find
0
VR-DPPD
-2 PED2 [41]
Centralized algorithm (40)
-4
Residual
-6
-8
-10
0 500 1000 1500 2000
Iterations
0
VR-DPPD
-2 PED2 [41]
Centralized algorithm (40)
-4
Residual
-6
-8
-10
0 0.5 1 1.5 2 2.5 3 3.5 4
Number of gradient evaluations 105
Fig. 3.3 Comparisons between VR-DPPD and other algorithms. (a) The x-axis is the iterations.
(b) The x-axis is the number of gradient evaluations
3.5 Numerical Examples 85
0
Cycle network
Random network: p=0.7
-2 Random network: p=0.5
Complete network
Star network
-4
Residual
-6
-8
-10
0 500 1000 1500 2000 2500 3000 3500 4000
Iterations
that the performance of VR-DPPD does not decrease largely even utilizing
the stochastic gradients with variance-reduction technique, that is, it is slightly
slower than PED2 [41]. Figure 3.3b tells us that compared with the centralized
algorithm (3.40) and PED2 [41], VR-DPPD demands a smaller number of local
gradient evaluations, which can reduce the computation cost to a certain extent.
(iii) Impacts of network sparsity: in this simulation (ηx = 0.05, ηy = 2), we discuss
the impacts of the network sparsity on the convergence results of VR-DPPD.
The simulation results are depicted in Fig. 3.4, which shows that the sparsity of
the network has a certain degree of influence on the convergence rate of VR-
DPPD (other parameters are fixed), that is, as the network becomes dense, the
convergence rate of VR-DPPD is faster.
0
VR-DPPD
PED2 [41]
-2 Centralized algorithm (40)
-4
Residual
-6
-8
-10
0 0.5 1 1.5 2
Iterations 4
10
0
VR-DPPD
PED2 [41]
-2 Centralized algorithm (40)
Residual
-4
-6 0
-5
-8
-10
0 1 2 3 4
105
-10
0 1 2 3 4 5 6 7 8
Iterations 7
10
Fig. 3.5 Comparisons between VR-DPPD and other algorithms. (a) The x-axis is the iterations.
(b) The x-axis is the number of gradient evaluations
Remark 3.14 It is worth emphasizing that we can further verify the performance of
VR-DPPD when the non-smooth coupling function g is not an indicator function, e.g.,
g = || ni=1 Ai xi ||1 in the problem (3.44). Here, we notice that the general form of the
non-smooth function g is mostly the indicator function in practical applications [3, 42,
60]. As mentioned before, the studied problem (3.1) can recover the well investigated
consensus problem if we choose g as the indicator function of the consensus constraint.
Considering the practical needs (the indicator function modeling of g is more common
References 87
and has wider applicability) and the limited length of the chapter (mainly showing the
superiority of VR-DPPD in terms of computational efficiency), we will not conduct
similar simulation experiments in this section when g is not an indicator function.
3.6 Conclusion
References
1. T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, K. H. Johansson,
A survey of distributed optimization. Annu. Rev. Control 47, 278–305 (2019)
2. H. Li, C. Huang, Z. Wang, G. Chen, H. Umar, Computation-efficient distributed algorithm
for convex optimization over time-varying networks with limited bandwidth communication.
IEEE Trans. Signal Inf. Proc. Netw. 6, 140–151 (2020)
3. S. Alghunaim, K. Yuan, A.H. Sayed, A proximal diffusion strategy for multiagent optimization
with sparse affine constraints. IEEE Trans. Autom. Control 65(11), 4554–4567 (2020)
4. J. Li, W. Abbas, X. Koutsoukos, Resilient distributed diffusion in networks with adversaries.
IEEE Trans. Signal Inf. Proc. Netw. 6, 1–17 (2019)
5. A. Nedic, Distributed gradient methods for convex machine learning problems in networks:
distributed optimization. IEEE Signal Process. Mag. 37(3), 92–101 (2020)
6. Z. Yang, W.U. Bajwa, ByRDiE: byzantine-resilient distributed coordinate descent for decen-
tralized learning. IEEE Trans. Signal Inf. Proc. Netw. 5(4), 611–627 (2019)
7. F. Hua, R. Nassif, C. Richard, H. Wang, A.H. Sayed, Online distributed learning over graphs
with multitask graph-filter models. IEEE Trans. Signal Inf. Proc. Netw. 6, 63–77 (2020)
8. D. Yuan, A. Proutiere, G. Shi, Distributed online linear regressions. IEEE Trans. Inf. Theory
67(1), 616–639 (2021)
9. Q. Lü, X. Liao, T. Xiang, H. Li, T. Huang, Privacy masking stochastic subgradient-push
algorithm for distributed online optimization. IEEE Tran. Cybern. 51(6), 3224–3237 (2021)
10. J. Zhu, C. Xu, J. Guan, D. Wu, Differentially private distributed online algorithms over time-
varying directed networks. IEEE Trans. Signal Inf. Proc. Netw. 4, 4–17 (2018)
88 3 Proximal Algorithms for Distributed Coupled Optimization
11. B. Huang, L. Liu, H. Zhang, Y. Li, Q. Sun, Distributed optimal economic dispatch for
microgrids considering communication delays. IEEE Trans. Syst. Man Cybern. Syst. Hum.
49(8), 1634–1642 (2019)
12. L. Liu, G. Yang, Distributed optimal economic environmental dispatch for microgrids over
time-varying directed communication graph. IEEE Trans. Netw. Sci. Eng. 8(2), 1913–1924
(2021)
13. M. Nokleby, H. Raja, W.U. Bajwa, Scaling-up distributed processing of data streams for
machine learning. Proc. IEEE 108(11), 1984–2012 (2020)
14. A. Gang, B. Xiang, W.U. Bajwa, Distributed principal subspace analysis for partitioned big
data: algorithms, analysis, and implementation. IEEE Trans. Signal Inf. Proc. Netw. 7, 699–
715 (2021)
15. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE
Trans. Autom. Control 54(1), 48–61 (2009)
16. A. Nedic, A. Olshevsky, Distributed optimization over time-varying directed graphs. IEEE
Trans. Autom. Control 60(3), 601–615 (2015)
17. J. Duchi, A. Agarwal, M. Wainwright, Dual averaging for distributed optimization: conver-
gence analysis and network scaling. IEEE Trans. Autom. Control 57(1), 151–164 (2012)
18. W. Shi, Q. Ling, G. Wu, W. Yin, EXTRA: an exact first-order algorithm for decentralized
consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015)
19. C. Xi, U.A. Khan, DEXTRA: a fast algorithm for optimization over directed graphs. IEEE
Trans. Autom. Control 62(10), 4980–4993 (2017)
20. M. Maros, J. Jalden, On the Q-linear convergence of distributed generalized ADMM under
non-strongly convex function components. IEEE Trans. Signal Inf. Proc. Netw. 5(3), 442–453
(2019)
21. J. Chen, S. Liu, P. Chen, Zeroth-order diffusion adaptation over networks, in Proceedings of
the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(2018). https://doi.org/10.1109/ICASSP.2018.8461448
22. J. Xu, S. Zhu, Y.C. Soh, L. Xie, Augmented distributed gradient methods for multi-agent
optimization under uncoordinated constant stepsizes, in Proceedings of the IEEE 54th Annual
Conference on Decision and Control (2015). https://doi.org/10.1109/CDC.2015.7402509
23. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization
over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
24. S. Pu, W. Shi, J. Xu, A. Nedic, Push-pull gradient methods for distributed optimization in
networks. IEEE Trans. Autom. Control 66(1), 1–16 (2021)
25. T. Yang, J. Lu, D. Wu, J. Wu, G. Shi, Z. Meng, K. Johansson, A distributed algorithm for
economic dispatch over time-varying directed networks with delays. IEEE Trans. Ind. Electron.
64(6), 5095–5106 (2017)
26. H. Li, Q. Lü, G. Chen, T. Huang, Z. Dong, Distributed constrained optimization over
unbalanced directed networks using asynchronous broadcast-based algorithm. IEEE Trans.
Autom. Control 66(3), 1102–1115 (2021)
27. B. Li, S. Cen, Y. Chen, Y. Chi, Communication-efficient distributed optimization in networks
with gradient tracking and variance reduction, in Proceedings of the 23rd International
Conference on Artificial Intelligence and Statistics (AISTATS) (2020), pp. 1662–1672
28. K. Yuan, B. Ying, J. Liu, A.H. Sayed, Variance-reduced stochastic learning by networked
agents under random reshuffling. IEEE Trans. Signal Process. 67(2), 351–366 (2019)
29. T. Ding, S. Zhu, J. He, C. Chen, X. Guan, Differentially private distributed optimization via
state and direction perturbation in multi-agent systems. IEEE Trans. Autom. Control 67(2),
722–737 (2022)
30. Y. Zhu, G. Wen, W. Yu, X. Yu, Continuous-time distributed proximal gradient algorithms for
nonsmooth resource allocation over general digraphs. IEEE Trans. Netw. Sci. Eng. 8(2), 1733–
1744 (2021)
31. Y. Zhu, W. Yu, G. Wen, W. Ren, Continuous-time coordination algorithm for distributed convex
optimization over weight-unbalanced directed networks. IEEE Trans. Circuits Syst. Express
Briefs 66(7), 1202–1206 (2019)
References 89
32. A.I. Chen, A. Ozdaglar, A fast distributed proximal-gradient method, in Proceedings of the
50th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
(2012). https://doi.org/10.1109/Allerton.2012.6483273
33. T.-H. Chang, M. Hong, X. Wang, Multi-agent distributed optimization via inexact consensus
ADMM. IEEE Trans. Signal Process. 63(2), 482–497 (2015)
34. N.S. Aybat, Z. Wang, T. Lin, S. Ma, Distributed linearized alternating direction method of
multipliers for composite convex consensus optimization. IEEE Trans. Autom. Control 63(1),
5–20 (2018)
35. W. Shi, Q. Ling, G. Wu, W. Yin, A proximal gradient algorithm for decentralized composite
optimization. IEEE Trans. Signal Process. 63(22), 6013–6023 (2015)
36. Z. Li, W. Shi, M. Yan, A decentralized proximal-gradient method with network independent
step-sizes and separated convergence rates. IEEE Trans. Signal Process. 67(17), 4494–4506
(2019)
37. S. Alghunaim, K. Yuan, A.H. Sayed, A linearly convergent proximal gradient algorithm for
decentralized optimization, in Advances in Neural Information Processing Systems (NIPS),
vol. 32 (2019), pp. 1–11
38. P. Di Lorenzo, G. Scutari, NEXT: in-network nonconvex optimization. IEEE Trans. Signal Inf.
Proc. Netw. 2(2), 120–136 (2016)
39. G. Scutari, Y. Sun, Distributed nonconvex constrained optimization over time-varying
digraphs. Math. Program. 176(1), 497–544 (2019)
40. J. Xu, Y. Tian, Y. Sun, G. Scutari, Distributed algorithms for composite optimization: Unified
framework and convergence analysis. IEEE Trans. Signal Process. 69, 3555–3570 (2021)
41. S. Alghunaim, K. Yuan, A.H. Sayed, A multi-agent primal-dual strategy for composite opti-
mization over distributed features, in Proceedings of the 2020 28th European Signal Processing
Conference (EUSIPCO) (2020). https://doi.org/10.23919/Eusipco47968.2020.9287370
42. S. Alghunaim, Q. Lyu, K. Yuan, A.H. Sayed, Dual consensus proximal algorithm for multi-
agent sharing problems. IEEE Trans. Signal Process. 69, 5568–5579 (2021)
43. P. Latafat, N.M. Freris, P. Patrinos, A new randomized block-coordinate primal-dual proximal
algorithm for distributed optimization. IEEE Trans. Autom. Control 64(10), 4050–4065 (2019)
44. B. Swenson, R. Murray, S. Kar, H. Poor, Distributed stochastic gradient descent and conver-
gence to local minima (2020). Preprint. arXiv:2003.02818v1
45. M. Assran, N. Loizou, N. Ballas, M. Rabbat, Stochastic gradient push for distributed deep
learning, in Proceedings of the 36th International Conference on Machine Learning (ICML)
(2019), pp. 344–353
46. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly
convex optimization. Automatica 90, 196–203 (2018)
47. R. Xin, A. Sahu, U. Khan, S. Kar, Distributed stochastic optimization with gradient tracking
over strongly-connected networks, in Proceedings of the 2019 IEEE 58th Conference on
Decision and Control (CDC) (2019). https://doi.org/10.1109/CDC40024.2019.9029217
48. J. Konecny, J. Liu, P. Richtarik, M. Takac, Mini-batch semi-stochastic gradient descent in the
proximal setting. IEEE J. Sel. Top. Signal Process. 10(2), 242–255 (2016)
49. M. Schmidt, N. Roux, F. Bach, Minimizing finite sums with the stochastic average gradient.
Math. Program. 162(1), 83–112 (2017)
50. A. Defazio, F. Bach, S. Lacoste-Julien, Saga: a fast incremental gradient method with support
for non-strongly convex composite objectives, in Advances in Neural Information Processing
Systems (NIPS), vol. 27 (2014), pp. 1–9
51. R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance
reduction, in Advances in Neural Information Processing Systems (NIPS) (2013), pp. 315–323
52. L. Nguyen, J. Liu, K. Scheinberg, M. Takac, SARAH: a novel method for machine learning
problems using stochastic recursive gradient, in Proceedings of the 34th International Confer-
ence on Machine Learning (ICML) (2017), pp. 2613–2621
53. A. Mokhtari, A. Ribeiro, DSA: decentralized double stochastic averaging gradient algorithm.
J. Mach. Learn. Res. 17(1), 2165–2199 (2016)
90 3 Proximal Algorithms for Distributed Coupled Optimization
4.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 91
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_4
92 4 Event-Triggered Algorithms for Distributed Convex Optimization
fields of intelligent robots [13], satellite formation flying [14], sensor networks [15–
20], cloud computing [21–24], distributed measurement and monitoring [25–27],
congestion control in communication network [28], and so on [29–33]. Unlike the
traditional control problems, a distinct feature of the multi-node system coordination
control is that the control strategy is only based on the distributed control of local
information among neighbors, and the desired overall goal is achieved through
information exchange. Multi-node coordinated control systems are made up of
problems such as consensus [3], formation control [1] and distributed filtering [34].
Among these problems, the consensus problem of multi-node systems is the core
research branch because it is the basis of multi-node collaboration and provides
theoretical support for other problems.
Event-triggered control strategies can effectively overcome the drawback of
the traditional processing methods. This interesting area aroused considerable
attention, see [35–46]. Tabuada [35] applied the event-triggered control method
in control systems and provided the event-triggered control strategies, which not
only guaranteed the asymptotic stability of the closed-loop system but also further
demonstrated that the event-triggered time sequence exclude the Zeno-like behavior.
Building on the work of Tabuada [35], Dimarogonas et al. [37] primarily studied
the consensus of first-order multi-node systems in which the update of individual
controller was based on the proportion of a certain measurement error with
reference to the function with the Standard of the latest local states. Meanwhile,
event-triggered control approach had a rapid development in consensus analysis
for continuous-time multi-node systems. Seyboth et al. [38] studied a variant of
the event-triggered average consensus problem for single-integrator and double-
integrators, in which a novel control strategy for multi-node coordination was
employed to simplify the performance and convergence analysis of the method. For
multi-node systems, the distributed rendezvous problem for single-integrator with
combinational measurements was studied in Fan et al. [26]. However, it required
that the Laplacian matrix of the associated communication topology ought to be
symmetric. Furthermore, Li et al. in [39] considered the event-triggered distributed
average consensus of discrete-time first-order multi-node systems with limited
communication data rate. In the framework of network communication, although
each node has a real-valued state, they can only communicate finite-bit binary
symbolic data sequence with its neighborhood nodes at each time step on account
of the digital communication channels with energy constraints. On the basis of
the designed novel event-triggered dynamic encoder and decoder for each node,
a distributed control algorithm is proposed.
Recent work of Zhu et al. [40] had coordinately put their sights on the analysis
of the event-triggered consensus problem which usually appeared in general linear
time-invariant systems with fixed topology via applying an integral inequality
technique. And a novel improved algorithm was introduced to determine the event
time sequences, which can not only reduce communication between neighboring
nodes but also control updates. From an implementation point of view, it is
still a very important issue for the event-triggered control strategies of multi-
node systems with discrete-time dynamics. The event-triggered and self-triggered
4.1 Introduction 93
(ii) A distributed event-triggered control scheme for each node is designed, which
analytically decides the next sampling time instant utilizing the exchange local
information. Based on the event-triggered control scheme, a distributed control
algorithm is presented, which only employs the nodes local information at their
latest sampling time instant.
(iii) We also show the convergence of the algorithm and prove that the event-
triggered distributed subgradient algorithm is able to make the whole nodes
asymptotically converge to an optimal solution.
4.2 Preliminaries
4.2.1 Notation
If not particularly stated, the vectors mentioned in this chapter are column vectors.
Denote R N×N and R N , R as the set of N ×N real matrices, the set of N-dimensional
real column vectors and the set of real numbers, respectively. N-dimensional vector
and N-dimensional identity matrix are denoted as 1N and IN , respectively. Given
a vector or a matrix W , we denote W T and W −1 as the transpose and inverse,
respectively. We denote λ(A) as the set of all eigenvalues of matrix A. Given a
vector x ∈ R n , the standard Euclidean norm in the Euclidean space is represented
as ||x||. The subgradient of f (x) is represented as ∇f (x) : R n → R n . Given a
matrix W , we write Wij or [W ]ij to define its i, j th entry. The symbol ⊗ denotes
the Kronecker product.
N
min f (x), f (x) = fi (x) over x ∈ R n , (4.1)
i=1
In this section, inspired by Lou et al. [49], Zhu and Martínez [50], we provide a
novel distributed subgradient algorithm to handle the optimization problem (4.1).
Consider that node i can get access to the subgradient of local objective function
fi (x). The distributed subgradient algorithm is a discrete-time dynamical system,
which will be depicted next.
96 4 Event-Triggered Algorithms for Distributed Convex Optimization
where l > 0 is the constant control parameter, wij are non-negative weights,
j
tki denotes the instant when the kth event happens for the node i, tk , k =
j
arg min {t − tm } is the latest event-triggered instant of node j . And tk+1
i
denotes
j
m∈k̄,t ≥tm
node i’s next event-triggered time instant after tki , which is analytically decided by
i
tk+1 = inf t : t > tki and yi (t) > 0 , (4.4)
where β1 > 0, β2 > 0, μ > 0 and the measurement error is designed as ei (t) =
xi (tki ) − xi (t).
Let ∇F (X(t)) = [∇f1 (x1 (t)) , ∇f2 (x2 (t)) , · · · , ∇fN (xN (t)) ] ∈ R ,
T T T T Nn
x̄(t) = (1/N) N x
i=1 i (t), δ i (t) = x i (t) − x̄(t), i = 1, 2, . . . , N, and X(t) =
[x1T (t), x2T (t), · · · , xN
T (t)]T ∈ R Nn . Then the distributed subgradient algorithm
4.3 Algorithm Development 97
(4.2) with distributed control input can be rewritten in a compact matrix-vector form
as follows:
By the definition of δi (t), it is easy to know that the consensus error δ(t) = [(IN −
JN ) ⊗ In ]X(t) hold, where JN = N1 1N 1TN . It follows that
Since the undirected graph network is connected, we can conclude that the
eigenvalues of the matrix L are 0, λ2 , . . . , λN . Then we can take an orthogonal
matrix T = [ζ, φ2 , . . . , φN ] ∈ R N×N by using the Schmidt’s Orthogonalization
Method, where ζ = N1 1N is an eigenvector of matrix IN − hlL with respect to
eigenvalue 0 and φi is the eigenvector of matrix L with respect to eigenvalues
λ2 , . . . , λN (φiT L = λi φiT , i = 2, 3 . . . , N). Letting δ̃(t) = (T −1 ⊗ In )δ(t),
ẽ(t) = (T −1 ⊗ In )e(t), ∇ F̃ (X(t)) = (T −1 ⊗ In )∇F (X(t)), we have that
δ̃1 (t) ẽ1 (t) ∇ F̃1 (X(t))
Decomposing δ̃(t) = , ẽ(t) = , ∇ F̃ (X(t)) = ,
δ̃2 (t) ẽ2 (t) ∇ F̃2 (X(t))
and in view of 4.9 and 4.10, it follows that
δ̃1 (t + 1) In − hlIn 0 δ̃1 (t)
=
δ̃2 (t + 1) 0 (IN−1 − hl L̃) ⊗ In − hl L̃ ⊗ In δ̃2 (t)
,
In 0 ∇ F̃1 (X(t))
− hg(t)
0 L̂ ⊗ In ∇ F̃2 (X(t))
(4.11)
⎛ ⎞ ⎡ ⎤
λ2 0 φ2
⎜ .. ⎟ ⎢. ⎥
where L̃ = ⎝ . ⎠ , L̂ = ⎣ .. ⎦ JN [φ2 · · · φN ]. On the other hand,
0 λN φN
from (T −1 ⊗ In )δ(t) = (T T ⊗ In )δ(t) = δ̃(t), one obtains δ̃1 (t) = (ζ T ⊗ In )δ(t).
Note that ζ T 1N = 1, it is easy to get
1
δ̃1 (t) = (ζ T ⊗ In )[X(t) − (JN ⊗ In )X(t)]
||ζ ||
1 1
= (ζ T ⊗ In )X(t) − (ζ T ⊗ In )X(t) = 0. (4.12)
||ζ || ||ζ ||
Before giving a key definition, we require to show the following assumption with
regard to the step-sizes.
Assumption 4.3 The step-sizes {g(t)} are positive sequence, which satisfies the
following:
∞
∞
sup g(t) ≤ 1, g(t) = +∞ and (g(t))2 < +∞.
t ≥0 t =0 t =0
Remark 4.2 On the one hand, the step-size function g(t) increases the accuracy
of node’s state estimation. On the other hand, to achieve the optimization, the
step-size function g(t) ought to decrease to zero as t increases to infinite, i.e.,
limt →∞ g(t) = 0. According to the nature of the series summation, we can
∞ ∞ k 2
t =0 (1/(t + 1) ) = +∞, t =0 (1/(t + 1) ) < +∞ for all
k
conclude that
0.5 < k ≤ 1. Since 0 < supt ≥0(1/(t + 1)k ) < 1, 0.5 < k ≤ 1, we can apply
4.4 Convergence Analysis 99
g(t) = 1/(t + 1)k , 0.5 < k ≤ 1, to satisfy the Assumption 4.3 in the forthcoming
analysis.
Definition 4.3 Under some distributed control input ui (t), the distributed subgra-
dient algorithm 4.2 is said to achieve consensus if lim ||xi (t)−xj (t)|| = 0, ∀i, j =
t →∞
1, 2, . . . , N, hold for any initial values.
In this section, we first provide some supporting lemmas. Then we will prove
convergence property of the optimization algorithm (4.2).
of the eigenvalue of L̃. Then one has ρ(IN−1 − hl L̃) < 1, where ρ(IN−1 − hl L̃)
stands for the spectral radius of matrix IN−1 − hl L̃.
Proof Letting λ be any eigenvalue of IN−1 − hl L̃, then,
(λ − 1)
0 = |(λ − 1)IN−1 + hl L̃| = | IN−1 + L̃|. (4.14)
lh
√
Let σ = α + β −1 be the eigenvalue of the matrix L̃. By (4.14), we have
λ−1
+ σ = 0.
lh
Denote
√
d(λ) = λ + lhα + lhβ −1 − 1 = a1 λ + a0 . (4.15)
2α
By Lemma 4.4, we have α > 0. Noticing that 0 < h < 1 and 0 < l < (α 2 +β 2 )h
, one
can therefore compute that
a a1
Δ1 = 0 = lh(lhα 2 − 2α + lhβ 2 ) < 0, (4.16)
ā1 ā0
100 4 Event-Triggered Algorithms for Distributed Convex Optimization
where ā0 and ā1 are the conjugate complex of a0 and a1 , respectively. Therefore, we
have |λ| < 1 by applying Schur-Cohn Stability Test [51]. Thus, ρ(IN−1 −hl L̃) < 1.
The proof is therefore completed.
Lemma 4.6 If ρ(IN−1 − hl L̃) < 1, there exist positive constants M ≥ 1 and
0 < γ < 1 such that
||IN−1 − hl L̃||t ≤ Mγ t , t ≥ 0.
We will introduce the characteristic that the nodes reach a consensus asymptotically,
which means the node estimate xi (t) converges to the same point when t goes to
infinity.
Theorem 4.7 (Consensus) Let the connected Assumption 4.1, the subgradient
boundedness Assumption 4.2, and the step-size Assumption 4.3 hold. Consider
the distributed subgradient algorithm (4.2) with control input (4.3), where the
triggering time sequence is determined by (4.4) with β1 ∈ (0, N||L||
1
), β2 ∈ (0, ∞).
Then, the consensus for (4.2) can be achieved for l ∈ (0, (α 2 +β
2α
2 )h ) and μ ∈ (γ , 1).
t−1
t−s−1
δ̃2 (t) = [(IN−1 − hl L̃) ⊗ In ]t δ̃2 (0) − hl [(IN−1 − hl L̃) ⊗ In ] (L̃ ⊗ In )ẽ2 (s)
s=0
t−1
t−s−1
−h g(s)[(IN−1 − hl L̃) ⊗ In ] (L̂ ⊗ In )∇ F̃2 (X(s)). (4.17)
s=0
Since the undirected graph network of these nodes is connected, by Lemma 4.5, we
have (ρIN−1 − hl L̃) < 1. Then it follows from Lemma 4.6 that there exist positive
constants M ≥ 1 and 0 < γ < 1 such that ||IN−1 − hl L̃||t ≤ Mγ t , t ≥ 0. Thus, from
(4.17), we can get
t−1
||δ̃2 (t)|| ≤ Mγ t ||δ̃2 (0)|| + hl||L̃|| Mγ t−s−1 ||ẽ2 (s)||
s=0
t−1
+ h||L̂|| ||g(s)||Kγ t−s−1||∇ F̃2 (X(s))||. (4.18)
s=0
4.4 Convergence Analysis 101
It is obvious that
and
t−1
||δ(t)|| ≤ Mγ t ||δ(0)|| + hl||L̃|| Mγ t−s−1 ||e(s)||
s=0
t−1
+ h||L̂|| ||g(s)||Mγ t−s−1||∇F (X(s))||. (4.22)
s=0
Then, we have
1
Finally, for all 0 < β1 < N||L|| , we therefore obtain
√
where Z = max{M[δ(0) − h||L̂|| NH
], Mhl||L̃||Nβ2
}. In order
1−γ (μ−γ )(1−Nβ1 ||L||)−Mhl||L̃||Nβ1 ||L||
to prove (4.26), we first state that the following inequality holds for any η > 1,
In the process of the above derivation, we have used the subgradient boundedness
Assumption 4.2, H = maxi {Hi }, the step-size Assumption 4.3, supt≥0 g(t) ≤ 1, and
the sum formula of geometric sequence. Then, we present the following two cases:
√ √
h||L̂|| NH h||L̂|| NH
Case 1: Z = M[δ(0) − 1−γ ], which implies that δ(0) − 1−γ ≥
hl||L̃||N (β1 ||L||Z+β2 )
(μ−γ )(1−N β1 ||L||) . Then by (4.28), we can achieve that
∗
ηZμt ≤ ||δ(t ∗ )||
√ !
hl||L̃||N(β1 ||L||Z + β2 ) h||L̂|| NH ∗
< ηM δ(0) − − μt
(μ − γ )(1 − Nβ1 ||L||) 1−γ
"
hl||L̃||N(β1 ||L||Z + β2 ) t ∗
+ μ
(μ − γ )(1 − Nβ1 ||L||)
√ !
h||L̂|| NH ∗ ∗
= ηM δ(0) − μt = ηZμt . (4.29)
1−γ
4.4 Convergence Analysis 103
√
Mhl||L̃||N β2
Case 2: Z=
(μ−γ )(1−N β1 ||L||)−Mhl||L̃||N β1 ||L||
, which implies that δ(0) − h||L̂||
1−γ
NH
<
hl||L̃||N (β1 ||L||Z+β2 )
(μ−γ )(1−N β1 ||L||) . Then, we have
∗
ηZμt ≤ ||δ(t ∗ )||
hl||L̃||N(β1 ||L||Z + β2 ) t ∗
< ηM μ
(μ − γ )(1 − Nβ1 ||L||)
∗
= ηZμt . (4.30)
The contradiction of (4.29) and (4.30) demonstrates that (4.27) is valid for any
η > 1. Then, let η → 1, we can obtain the results that inequality (4.26) holds,
which further implies the consensus of (4.2) can be achieved asymptotically. The
proof is thus completed.
Note that all the conditions of Lemma 4.8 hold, and then with the help of
Lemma 4.8, we obtain the following statements:
and
∞
α(t)(h(p(t)) − h∗ ) < ∞. (4.32)
t =0
∞
Since t =0 α(t) = ∞, it is obtained from 4.32 that
lim h(p(t)) = h∗ .
t →∞
Recalling (4.31), it is clear that the sequence {p(t)} is bounded. In general, we can
assume that {p (t)} converge to some p̃. By continuity of h, we therefore obtain
Thus, (4.31) and (4.34) jointly imply p̃ ∈ P ∗ . By substituting p∗ in (4.31) with p̃,
we achieve that {p(t)} converges to p̃. The proof is thus completed.
Theorem 4.10 (Convergence Properties) Let the connected Assumption 4.1, the
subgradient boundedness Assumption 4.2, and the step-size Assumption 4.3 hold.
As for the problem (4.1), consider the distributed subgradient algorithm (4.2) with
distributed control input (4.3), where the triggering time sequence is decided by
(4.4). Then, there exists an optimal solution x ∗ ∈ X∗ such that
1 1 1
N N N N
xi (t + 1) = wij xj (t) − hl aij (ei (t) − ej (t))
N N N
i=1 i=1 j =1 i=1 j ∈Ni
1
N
−h g(t)∇fi (xi (t)). (4.35)
N
i=1
4.4 Convergence Analysis 105
N
Since x̄ = i=1 xi (t), the control law (4.35) can be rewritten as
1 N
x̄(t + 1) = x̄(t) − hg(t) ∇fi (xi (t)). (4.36)
N
i=1
Consider the sequence (4.36). Letting x ∈ R n be an arbitrary vector, we have for all
t ≥ 0,
1 N
||x̄(t + 1) − x||2 = ||x̄(t) − hg(t) ∇fi (xi (t)) − x||2
N i=1
2hg(t)
N
= ||x̄(t) − x||2 − ∇fi (xi (t))T (x̄(t) − x)
N
i=1
h2 g 2 (t)
N
+ || ∇fi (xi (t))||2 . (4.37)
N2
i=1
2hg(t)
N
− ∇fi (xi (t))T (x̄(t) − x). (4.38)
N
i=1
Next, we analyze the cross-term ∇fi (xi (t))T (x̄(t) − x) in (4.38). Firstly, we write
We can take a lower bound on the first item [∇fi (xi (t))]T (x̄(t) − xi (t)) as follows
by using the subgradient boundedness:
[∇fi (xi (t))]T (x̄(t) − xi (t)) ≥ −||∇fi (xi (t))||||x̄(t) − xi (t)||. (4.40)
As for the second term [∇fi (xi (t))]T (xi (t) − x), we apply the convexity of fi to
obtain
||x̄(t + 1) − x||2
2hg(t)
N
≤ [(||∇fi (xi (t))|| + ||∇fi (x̄(t))||)||x̄(t) − xi (t)|| + fi (x) − fi (x̄(t))]
N
i=1
4hHg(t) 2hg(t)
N N
≤ ||x̄(t) − x||2 + (||x̄(t) − xi (t)||) − (fi (x̄(t)) − fi (x))
N N
i=1 i=1
+ h H g (t),
2 2 2
(4.43)
4hHg(t)
N
||x̄(t + 1) − x ∗ ||2 ≤ ||x̄(t) − x ∗ ||2 + (||x̄(t) − xi (t)||)
N
i=1
2hg(t)
N
− (fi (x̄(t)) − fi (x ∗ )) + h2 H 2 g 2 (t). (4.44)
N
i=1
N
Rearranging the above formula and applying f (x) = i=1 fi (x), it follows that
2h
g(t)(f (x̄(t)) − f ∗ ) ≤ ||x̄(t) − x ∗ ||2 − ||x̄(t + 1) − x ∗ ||2
N
4hHg(t)
N
+ (||x̄(t) − xi (t)||) + h2 H 2 g 2 (t). (4.45)
N
i=1
4.4 Convergence Analysis 107
where f ∗ is the optimal value. Summing (4.45) over [0, ∞), dropping the negative
term on the right hand side and multiplying by N on both sides, we obtain
∞ ∞
2h g(t)(f (x̄(t)) − f ∗ ) ≤ N||x̄(0) − x ∗ ||2 + Nh2 H 2 g 2 (t)
t=0 t=0
∞
N
+ 4hH g(t) (||x̄(t) − xi (t)||). (4.46)
t=0
i=1
Now, we are in the position to study inequality (4.46). The right side of (4.46) can
be partitioned as three items. For the first item, it is clear to get
For the second item of (4.46), it is obtained from the consistency Theorem 4.7 that
Multiplying the above relation with g(t) and sum up from 0 to l , it yields that
l
N
l
g(t) (||x̄(t) − xi (t)||) ≤ NZ g(t)μt . (4.50)
t=0 i=1 t=0
l
N
l
g(t) (||x̄(t) − xi (t)||) ≤ NZ (g 2 (t) + μ2t ). (4.51)
t=0 i=1 t=0
∞
Since t=0 g(t) = ∞ and f (x̄(t)) − f ∗ ≥ 0, it yields that
Thus, from (4.4) and ∞ t=0 g (t) < ∞, we can derive that all the conditions of
2
Lemma 4.9 are established. With this lemma in hand, we can deduce that the average
sequence {x̄(t)} asymptotic converges to an optimal solution x ∗ ∈ X∗ . Recalling
Theorem 4.7, it yields that each sequence {xi (t)}, i = 1, . . . , N , converges to the
same optimal solution x ∗ . The proof is thus achieved.
Remark 4.11 In this chapter, we only consider the diminishing step-size rule
(Assumption 4.3) to make the algorithm (4.2) converge to a consistent optimal
solution. It is worth noting that the algorithm (4.2) and other similar algorithms
can be fast by using a fixed or constant step-size, but they only converge to a
neighborhood of the optimal solution set.
rate of control inputs is 35/3000 = 1.17%. Figure 4.4, for node 3, it is explicit that
the norm of measurement error ||e3 (t)|| is asymptotically reduces to zero.
4.6 Conclusion
References
12. Y. Kang, D.-H. Zhai, G.-P. Liu, Y.-B. Zhao, P. Zhao, Stability analysis of a class of hybrid
stochastic retarded systems under asynchronous switching. IEEE Trans. Autom. Control 59(6),
1511–1523 (2014)
13. A. Jadbabaie, J. Lin, A. Morse, Coordination of groups of mobile autonomous agents using
nearest neighbor rules. IEEE Trans. Autom. Control 48(6), 988–1001 (2003)
14. T. Schetter, M. Campbell, D. Surka, Multiple agent-based autonomy for satellite constellations.
Artif. Intell. 145(1), 147–180 (2003)
15. S. Xie, Y. Wang, Construction of tree network with limited delivery latency in homogeneous
wireless sensor networks. Wirel. Pers. Commun. 78(1), 231–246 (2014)
16. S. Kar, J. Moura, Distributed average consensus in sensor networks with random link failures,
in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing – ICASSP
’07 (2007). https://doi.org/10.1109/ICASSP.2007.366410
17. S. Pereira, Z. Pages, Consensus in correlated random wireless sensor networks. IEEE Trans.
Signal Process. 59(12), 6279–6284 (2011)
18. S. Jian, H.W. Tan, J. Wang, J.W. Wang, S.Y. Lee, A novel routing protocol providing good
transmission reliability in underwater sensor networks. J. Int. Technol. 16(1), 171–178 (2015)
19. P. Guo, J. Wang, X.H. Geng, C.S. Kim, J.U. Kim, A variable threshold-value authentication
architecture for wireless mesh networks. J. Int. Technol. 15(6), 929–936 (2014)
20. S.R. Olfati, J.S. Shamma, Consensus filters for sensor networks and distributed sensor fusion,
in Proceedings of 44th IEEE Conference on Decision and Control (2005). https://doi.org/10.
1109/CDC.2005.1583238
21. Z. Fu, K. Ren, J. Shu, X. Sun, F. Huang, Enabling personalized search over encrypted
outsourced data with efficiency improvement. IEEE Trans. Parallel Distrib. Syst. 27(9), 2546–
2559 (2016)
22. Y.J. Ren, J. Shen, J. Wang, J. Han, S.Y. Lee, Mutual verifiable provable data auditing in public
cloud storage. J. Int. Technol. 16(2), 317–323 (2015)
23. Z. Xia, X. Wang, X. Sun, Q. Wang, A secure and dynamic multi-keyword ranked search scheme
over encrypted cloud data. IEEE Trans. Parallel Distrib. Syst. 27(2), 340–352 (2016)
24. Z. Fu, X. Sun, Q. Liu, L. Zhou, J. Shu, Achieving efficient cloud search services: multi-
keyword ranked search over encrypted cloud data supporting parallel computing. IEICE Trans.
Commun. 98(1), 190–200 (2015)
25. M. Cao, A. Morse, B. Anderson, Reaching a consensus in a dynamically changing environ-
ment: convergence rates, measurement delays, and asynchronous events. SIAM J. Control
Optim. 47(2), 575–600 (2008)
26. Y. Fan, G. Feng, Y. Wang, C. Song, Distributed event-triggered control of multi-agent systems
with combinational measurements. Automatica 49(2), 671–675 (2013)
27. K. Hamada, N. Hayashi, S. Takai, Event-triggered and self-triggered control for discrete-time
average consensus problems. SICE J. Control Meas. Syst. Integr. 7(5), 297–303 (2014)
28. Y.P. Tian, Stability analysis and design of the second order congestion control for networks
with heterogeneous delays. IEEE/ACM Trans. Netw. 13(5), 1082–1093 (2005)
29. B. Gu, V. Sheng, A robust regularization path algorithm for v-support vector classification.
IEEE Trans. Neural Netw. Learn. Syst. 28(5), 1241–1248 (2017)
30. Z. Xia, X. Wang, X. Sun, Q. Liu, N. Xiong, Steganalysis of LSB matching using differences
between nonadjacent pixels. Multimed. Tools Appl. 75(5), 1947–1962 (2016)
31. X. Wen, S. Ling, X. Yu, F. Wei, A rapid learning algorithm for vehicle classification. Inf. Sci.
295(1), 395–406 (2015)
32. T. Ma, J. Zhou, M. Tang, Y. Tian, A. Al-Dhelaan, M. Al-Rodhaan, S. Lee, Social network
and tag sources based augmenting collaborative recommender system. IEICE Trans. Inf. Syst.
98(4), 902–910 (2015)
33. G. Chen, L. Luo, H. Shu, B. Chen, H. Zhang, Color image analysis by quaternion-type
moments. J. Math. Imaging Vis. 51(1), 124–144 (2015)
34. J. Hu, X. Hu, Nonlinear filtering in target tracking using cooperative mobile sensors.
Automatica 46(12), 2041–2046 (2010)
References 113
35. P. Tabuada, Event-triggered real-time scheduling of stabilizing control tasks. IEEE Trans.
Autom. Control 52(9), 1680–1685 (2007)
36. H. Li, X. Liao, G. Chen, D.J. Hill, Z. Dong, T. Huang, Event-triggered asynchronous
intermittent communication strategy for synchronization in complex dynamical networks.
Neural Netw. 66, 1–10 (2015)
37. D. Dimarogonas, E. Frazzoli, K. Johansson, Distributed event-triggered control for multi-agent
systems. IEEE Trans. Autom. Control 57(5), 1291–1297 (2012)
38. G. Seyboth, D. Dimarogonas, K. Johansson, Event-based broadcasting for multi-agent average
consensus. Automatica 49(1), 245–252 (2013)
39. H. Li, G. Chen, T. Huang, Z. Dong, W. Zhu, L. Gao, Event-triggered distributed consensus
over directed digital networks with limited bandwidth. IEEE Tran. Cybern. 46(12), 3098–3110
(2016)
40. W. Zhu, Z. Jiang, G. Feng, Event-based consensus of multi-agent systems with general linear
models. Automatica 50(2), 552–558 (2014)
41. X. Chen, F. Hao, Event-triggered average consensus control for discrete-time multi-agent
systems. IET Contr. Theory Appl. 16(6), 2493–2498 (2012)
42. D. Ding, Z. Wang, B. Shen, Event-triggered consensus control for a class of discrete-time
stochastic multi-agent systems, in Proceeding of the 11th World Congress on Intelligent
Control and Automation (2014). https://doi.org/10.1109/WCICA.2014.7052731
43. H. Pu, W. Zhu, D. Wang, Consensus analysis of first-order discrete-time multi-agent systems
with time delay: an event-based approach, in 2016 35th Chinese Control Conference (CCC)
(2016). https://doi.org/10.1109/ChiCC.2016.7554623
44. X. Yin, D. Yue, S. Hu, Distributed event-triggered control of discrete-time heterogeneous
multi-agent systems. J. Frankl. Inst. 350(3), 651–669 (2013)
45. W. Zhu, Z. Tian, Event-based consensus of first-order discrete time multi-agent systems, in
2016 12th World Congress on Intelligent Control and Automation (WCICA) (2016). https://
doi.org/10.1109/WCICA.2016.7578796
46. W. Du, Sunney Y. Leung, Y. Tang, A. Vasilakos, Differential evolution with event-triggered
impulsive control. IEEE Trans. Cybern. 47(1), 244–257 (2016)
47. D. Yang, X. Liu, W. Chen, Periodic event/self-triggered consensus for general continuous-time
linear multi-agent systems under general directed graphs. IET Control Theory Appl. 9(3), 428–
440 (2015)
48. A. Nedic, A. Ozdaglar, Distributed optimization over time-varying directed graphs. IEEE
Trans. Autom. Control 60(3), 601–615 (2015)
49. Y. Lou, G. Shi, K. H. Johansson, Y. Hong, Approximate projected consensus for convex
intersection computation: convergence analysis and critical error angle. IEEE Trans. Autom.
Control 59(7), 1722–1736 (2014)
50. M. Zhu, S. Martínez, On distributed convex optimization under inequality and equality
constraints via primal-dual subgradient methods (2010). Preprint. arXiv: 1001.2612
51. B.C. Kuo, Discrete-Data Control System (Prentice Hall, Hoboken, 1970)
52. I. Lobel, A. Ozdaglar, D. Feijer, Distributed multi-agent optimization with state-dependent
communication. Math. Program. 129(2), 255–284 (2011)
53. H. Li, C. Guo, T. Huang, Z. Wei, X. Li, Event-triggered consensus in nonlinear multi-agent
systems with nonlinear dynamics and directed network topology. Neurocomputing 185, 105–
112 (2016)
Chapter 5
Event-Triggered Acceleration Algorithms
for Distributed Stochastic Optimization
Abstract In this chapter, we focus on introducing and exploring how to improve the
computational and computational efficiency in distributed optimization problems,
and the problem under study remains the problem of distributed optimization
to minimize a finite sum of convex cost functions over the nodes of a network
where each cost function is further considered as the average of several constituent
functions. Reviewing the existing work, no method can improve communication
efficiency and computational efficiency simultaneously. To achieve the above goal,
we will introduce an effective event-triggered distributed accelerated stochastic
gradient algorithm, namely ET-DASG. ET-DASG can improve communication
efficiency through an event-triggered strategy, improve computational efficiency by
using SAGA’s variance-reduction technique, and accelerate convergence by using
Nesterov’s acceleration mechanism, thus achieving the target of improving com-
munication efficiency and computational efficiency simultaneously. Furthermore,
we will provide in this chapter a convergence analysis that demonstrates that ET-
DASG can converge to the exact optimal solution within the average value with a
well-selected constant step-size. Also, thanks to the gradient tracking scheme, the
algorithm can achieve linear convergence rates when each constituent function is
strongly convex and smooth. Moreover, under certain conditions, we prove that the
time interval between two successive trigger moments is larger than the iteration
interval for each node. Finally, we also confirm the attractive performance of ET-
DASG through simulation results.
5.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 115
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_5
116 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
[5], privacy masking [6], and signal processing [7] due to its ability to parallelize
computation and prevent nodes from sharing privacy. Distributed algorithms usually
follow an iterative process in which nodes in the network store certain estimates of
the decision vector in the context of optimization, exchange this information with
neighboring nodes, and update their estimates based on the information received.
Some of the literature for such distributed schemes include the early work on
distributed gradient descent (DGD) [8] and its various extensions in achieving
efficiency [9, 10], solving constraints [11, 12], applying to complex networks [13,
14], or performing acceleration [15, 16]. These optimization methods successfully
showed the effectiveness for dealing with problems in a distributed manner [8–16].
Nonetheless, although these methods were intuitive and flexible for cost functions
and networks, their convergence rates were particularly slow in comparison with
that of centralized counterparts. Besides, linear convergence rates in sub-optimality
could be derived for DGD-based methods with constant step-sizes [17]. Therefore,
from an optimization point of view, it is always a priority to propose and analyze
methods that are comparable in performance to centralized counterparts in terms of
convergence rate. In a recent stream of literature, distributed gradient methods that
overcome this exactness-rate dilemma have been proposed, which achieve exactly
linear convergence rate for smooth and strongly convex cost functions. Instances
of such methods, including methods based on gradient tracking [18–28], methods
based on Lagrangian multiplier [29–32], and methods based on dual decomposition
[33–36], are characterized by various mechanisms.
Toward practical optimization models, approaches of momentum acceleration
have been successfully and widely used in optimization techniques, which is
conducive to the convergence of the DGD-based methods [16, 37–42]. First-order
optimization methods based on momentum acceleration have been of significance
in the machine learning community because of their better scalability for large-
scale tasks (including deep learning, federal learning, etc.) and good performance
in practice. When solving convex or strongly convex optimization problems, many
momentum approaches have emerged, e.g., the Nesterov’s acceleration mechanism
in [16, 37–39] and the heavy-ball mechanism in [40–42], which both ensure that
nodes obtain more information from neighbors in the network than ones with
no momentum, and have been proven to largely improve the convergence rate
of gradient-based methods. Despite momentum acceleration mechanisms having
superior theoretical advantages, they do not fully exploit the performance of related
methods in terms of efficiency, e.g., communication and computation. For example,
in machine learning, the accuracy of the machine learning model can be improved
by increasing the parameter scale and training dataset. However, this operation will
lead to a substantial increase in training time, which results in low communication
and computation efficiency. Therefore, exploiting some valid techniques to achieve
provable efficiency becomes a new challenge for researchers.
To improve communication efficiency and meanwhile maintain the desired
performance of the network, various types of strategies have recently been proposed
and gained popularity in the existing works, e.g., [43–46]. The emergence of the
event-triggered strategy provides a new perspective for collecting and transmitting
5.1 Introduction 117
information. The main idea behind the event-triggered strategy is that nodes only
take actions when necessary, that is, only when a measurement of the local
node’s state error reaches a specified threshold. Its superiority is that some desired
properties of the network can still be maintained efficiently. There are many works
on distributed event-triggered methods over networks, which can successfully solve
various practical problems and achieve expected results [43–46]. For example,
distributed event-triggered algorithms proposed in [43] have been utilized to
resolve constrained optimization problems, and event-triggered distributed gradient
tracking algorithms in [44] have been proven to linearly converge to the optimal
solution, which have further been extended to the distributed energy management
problem of smart grids in [46]. In the era of big-data, nodes in the network may
process large and complex data in the process of information sharing and calculation
[5]. Thus, the above method will bear the pressure of a lot of calculations.
To reduce the computational pressure and maintain the simplicity of the cal-
culation, approximating the true gradient with a stochastic gradient is a relatively
common solution at present. Based on this scheme, related stochastic gradient
methods have emerged [47–49]. However, because of the large variance existing in
the stochastic gradient, these approaches have weaker convergence. By decreasing
the variance of the stochastic gradient, many centralized stochastic methods [50–
54] adopt various variance-reduction techniques to surmount this shortcoming and
improve convergence. Inspired by Konecny et al. [50], Schmidt et al. [51], Defazio
et al. [52], Tan et al. [53], Nguyen et al. [54], many distributed variance-reduced
approaches [2, 55–60] have been extensively investigated, and their performance in
processing machine learning tasks is better than their centralized counterparts.
In this chapter, we focus on promoting the execution (i.e., communication
and computation) efficiency and accelerating the convergence of the distributed
optimization in dealing with the machine learning tasks. As far as the authors know,
there is no work involving methods to implement the included target. In specific, we
highlight the main contributions of this work as follows:
(i) A novel event-triggered distributed accelerated stochastic gradient algorithm,
namely ET-DASG, is proposed to solve the machine learning tasks. ET-DASG
with well-selected constant step-size can linearly converge in the mean to
the exact optimal solution if each constituent function is strongly convex and
smooth.
(ii) Unlike the time-triggered methods [38–42], ET-DASG utilizes the event-
triggered strategy which effectively avoids frequent real-time communication,
reduces the communication load, and thus improves communication efficiency.
Furthermore, for each node, we prove that the time interval between two
successive triggering instants is larger than the iteration interval.
(iii) Compared with the existing methods [38–46], ET-DASG achieves higher
computation efficiency by means of the variance-reduction technique. In
particular, at each iteration, ET-DASG only employs the gradient of one
randomly selected constituent function and adopts the unbiased stochastic
118 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
average gradient (SAGA) to estimate the local gradients, which greatly reduces
the expense of full gradient evaluations.
(iv) In comparison with the existing methods without momentum acceleration
mechanism [19–22, 44, 45], ET-DASG performs accelerated convergence with
the help of the Nesterov’s acceleration mechanism. Moreover, simulation
results verify that the convergence rate of ET-DASG improves with the increase
of the momentum coefficient.
5.2 Preliminaries
5.2.1 Notation
If not particularly stated, the vectors mentioned in this chapter are column vectors.
Let Rp and Rp×q denote the real Euclidean spaces with dimensions p and p × q,
respectively. The identity matrix and the spectral radius of a matrix are represented
as Ip ∈ Rp×p and ρ(·), respectively. The Euclidean norm and Kronecker product is
represented as || · || and ⊗, respectively. Let symbol E[·] denote the expectation of
a random variable. Let notations ∇f (·) and (·)T denote the gradient of a function f
and the transpose of vectors (matrices), respectively. The p-dimensional vector with
all ones and all zeros are represented as 1p and 0p , respectively.
This chapter focuses on optimizing a finite sum cost functions in machine learning,
which can be described as
i
1 i 1 i,h
m n
min f (x) = f (x), f (x) = i
i
f (x), (5.1)
x∈Rp m n
i=1 h=1
Notice from Assumption 5.1 that problem (5.1) possesses a unique optimal
solution x ∗ and the global cost function f is also κ1 -strongly convex and κ2 -smooth,
where 0 < κ1 ≤ κ2 . In addition, the condition number of the global cost function f
is defined as γ = κ2 /κ1 .
We aim to solve (5.1) over an undirected network (graph) G = {V, E, Â}, where
V = {1, 2, . . . , m} is the set of nodes, E ⊆ V × V is the set of edges containing
the interactions among all nodes and  = [a ij ] ∈ Rm×m is the weight matrix
(symmetric). The edge (i, j ) ∈ E if node j can directly exchange data with node i.
If (i, j ) ∈ E, then a ij > 0; otherwise, a ij = 0. Let N i = {j |a ij > 0} denote the set
of neighbors of node i. Furthermore, we make the following assumption regarding
the network.
Assumption 5.2 (i) G is undirected and connected; (ii) Â = [a ij ] ∈ Rm×m is
primitive and doubly stochastic.
Assumption 5.2 indicates that the second largest singular value κ3 of  is less
than 1, i.e., κ3 = ||Â − (1/m)1m 1Tm || < 1, [19, 20, 22, 25, 44].
Remark 5.1 Assumptions 5.1 and 5.2 are very general and easy to be satisfied in
many machine learning tasks that can usually be expressed as problem (5.1). These
two assumptions allow us to conditionally design a distributed linearly convergent
algorithm to accurately solve the practical applications. In addition, when training
a machine learning model, it may be necessary for a single computing node to have
a large amount of local data (ni 1), but the limited memory of the computing
node causes a significant increase in training time as well as the amount of
communication and calculations. However, it is expensive to improve the computing
and communication capabilities of a single piece of hardware. Hence, designing a
novel event-triggered distributed accelerated stochastic gradient algorithm will be
of great significance.
where xti and yti are two estimators of node i. Moreover, we suppose that all the
nodes broadcast its estimators xti and yti at initial time, i.e., x̂0i = x0i and ŷ0i = y0i for
all i ∈ V. In addition, the next triggering time tk(i,t
i
)+1 after tk(i,t ) for node i ∈ V is
i
determined by
i,x 2 i,y 2
)+1 = inf t|t > tk(i,t ) , ||εt || + ||εt || > Cκ4 ,
i i t
tk(i,t (5.2)
where Cκ4t is the event-triggered threshold with parameters C > 0 and 0 < κ4 < 1,
i,y
and εti,x , εt are the measurement errors which are defined by
i,y
εti,x = x̂ti − xti , εt = ŷti − yti . (5.3)
Remark 5.2 The emergence of the event-triggered strategy provides a new perspec-
tive for information sampling and transmission. In particular, once node i receives
j j
the transmitted estimators (x̂t , ŷt ) from its neighbors j ∈ N i , node i first replaces
the neighbor’s estimators stored in the local memory, and then updates its next time
estimators (xti+1 , yti+1) based on its current information (xti , yti ). Note that only when
an event occurs at node i, i.e., the triggering condition in (5.2), is met, the estimators
xti and yti can be broadcasted to its neighbors. From the communication perspective,
the event-triggered strategy avoids real-time communication with neighbors, which
plays an essential role in reducing the communication load and realizing better
communication efficiency compared with other time-triggered methods [38–42, 55–
60].
1 In the methods based on event-triggered strategy, the local estimators x̂ti and ŷti are determined by
its own estimators and the latest information sent from its neighbors j ∈ N i (at the latest triggering
time of node j before t).
5.3 Algorithm Development 121
2 Assume that there is no information transmission and calculation at time t = 0, which further
means that the event-triggered and stochastic gradient process are not executed.
122 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
m
j
i
xt+1 = xti + aij x̂t − x̂ti − ηyti , (5.4a)
j =1
i
st+1 = xt+1
i
+ α xt+1
i
− xti , (5.4b)
where aij is the weight between i and j , the step-size η > 0, and the momentum coefficient
0 < α < 1.
i
4: Choose χt+1 uniformly and randomly from {1, . . . , ni }.
i
5: Update gt+1 according to:
i
i i,χ i 1
n i
i,h
i
gt+1 =∇f i,χt+1 st+1
i
− ∇f i,χt+1 et+1t+1 + ∇f i,h et+1 .
ni
h=1
i,χ i i i,χ i i
6: Take et+2t+1 = st+1
i
and replace ∇f i,χt+1 (et+2t+1 ) by ∇f i,χt+1 (st+1
i i
) in the χt+1 gradient
i,h
table position. All other estimators et+2 , ∀h = χt+1 , and gradient entries in the table keep
i
i,h i,h i,h i,h
unchanged, i.e., et+2 = et+1 and ∇f i,h (et+2 ) = ∇f i,h (et+1 ) for all h = χt+1
i
.
i
7: Update estimator yt+1 according to:
m
j
i
yt+1 = yti + aij ŷt − ŷti + gt+1
i
− gti . (5.4c)
j =1
i,y
8: Calculate the measurement errors εti,x , εt in (5.3), and then test the triggering condition in
(5.2).
9: if the triggering condition in (5.2) is satisfied then
10: Broadcast xt+1i i
and yt+1 to its neighbors j ∈ N i , and update the latest triggering time.
11: end if
12: end for
According to the measurement errors (5.3) and Assumption 5.2(ii), the updates
in (5.4)–(5.4c) of Algorithm 3 can be represented as follows:
⎧ m
⎪
⎪ j
⎪
⎪ xti+1 = aij xt + ε̂ti,x − ηyti
⎪
⎨ j =1
sti+1 = xti+1 + α(xti+1 − xti ) , (5.5)
⎪
⎪
⎪ i
⎪
m
j i,y
⎩ yt +1 =
⎪ aij yt + ε̂t + gti+1 − gti
j =1
5.3 Algorithm Development 123
n i
ni
i,h i i i,χ i
∇f i,h
et +1 = ∇f i,h eti,h + ∇f i,χt sti − ∇f i,χt et t ,
h=1 h=1
the i above cost can be avoided and we can calculate the summation term
n i,h i,h
h=1 ∇f (et +1 ) in a computationally efficient way. However, we also point that,
the O(ni )-order computational cost cannot be overcome in the existing methods
[19, 20, 22, 23, 25] using deterministic gradient tracking technique. From this point
of view, the computation cost of ET-DASG is related to ni , and ET-DASG can
improve computation efficiency.
y 1,y m,y
Define ε̂tx = [(ε̂t1,x )T , . . . , (ε̂tm,x )T ]T , ε̂t = [(ε̂t )T , . . . , (ε̂t )T ]T , xt =
[(xt1 )T , . . . , (xtm )T ]T , yt = [(yt1)T , . . . , (ytm )T ]T , st = [(st1 )T , . . . , (stm )T ]T , gt =
[(gt1 )T , . . . , (gtm )T ]T , and A = Â ⊗ Ip . Then, Algorithm 3 can be rewritten in a
compact form,
⎧
⎨ xt +1 = Axt + ε̂tx − ηyt
s = xt +1 + α(xt +1 − xt ) . (5.6)
⎩ t +1 y
yt +1 = Ayt + ε̂t + gt +1 − gt
In what follows, we use Ft to denote the history of the system up until time t.
Then, from the prior results of [2, 55, 57], it is clear that the stochastic averaging
gradient is unbiased. In particular, given Ft , one gets
E gti |Ft = ∇f i (sti ). (5.7)
In this section, we show the theoretical guarantees for the convergence of ET-DASG.
To proceed, several auxiliary variables that will support the subsequent analysis are
defined below:
1 T 1 T
x̄t = 1m ⊗ Ip xt , ȳt = 1 ⊗ Ip yt ,
m m m
T 1m 1Tm
∇F (st ) = ∇f 1 (st1 )T , . . . , ∇f m (stm )T , A∞ = ⊗ Ip ,
m
1 T 1 T
∇ F̄ (st ) = 1m ⊗ Ip ∇F (st ), ḡt = 1 ⊗ Ip gt .
m m m
Then, note that (5.6) is a stochastic gradient tracking method [55, 59, 60]. Under
the given initial conditions, the measurement error (5.3) and Assumption 5.2(ii),
through induction, the following conclusions can be clearly drawn:
1 T 1 T y
ȳt = ḡt , (1m ⊗ Ip )ε̂tx = 0, (1 ⊗ Ip )ε̂t = 0, ∀t ≥ 0.
m m m
Recalling from (5.7), it is clear to verify that
Moreover, for each node i ∈ V, the average optimality gap between the auxiliary
variables eti,h , h ∈ {1, . . . , ni }, and the optimal solution x ∗ is defined as follows:
1 2
ni
m
i,h
vti = i et − x ∗ , vt = vti , ∀t ≥ 0. (5.8)
n
h=1 i=1
Before establishing the auxiliary results, we state a few useful results in the
following lemma, of which the proofs can be found in [40, 41, 46, 59, 61].
Lemma 5.5 Under Assumptions 5.1–5.2, we possess that
(i) For all x ∈ Rmp , one has ||Ax − A∞ x|| ≤ κ3 ||x − A∞ x||, where 0 < κ3 < 1.
(ii) For all x ∈ Rp and if 0 < η ≤ (1/κ2 ), one gets ||x − η∇f (x) − x ∗ || ≤
(1 − κ1 η)||x − x ∗ ||.
5.4 Convergence Analysis 125
From Lemma 5.6, we can find that when xti , i ∈ V, approach to x ∗ , then eti,h , i ∈
V, h ∈ {1, . . . , ni }, tend to x ∗ , which indicates that E[gt − ∇F (st )2 ] diminishes
to zero.
1 2 2 2 2
E[vt +1 ] ≤ 1 − E[vt ] + E[xt − A∞ xt ] + E[mx̄t − x ∗ ],
n̂ ň ň
(5.10)
Proof Following from (5.6) and the fact that ||A − Imp ||2 ≤ 4, one gets
where the first inequality has exploited the facts that ȳt = ḡt , ∀t ≥ 0, and (1Tm ⊗
Ip )∇F (x ∗ ) = 0. In view of (5.12) and (5.13), one has
Following from (5.14) and with reference to (5.6), it can be obtained that
Lemma 5.9 If Assumptions 5.1 and 5.2 hold, we possess the following recursive
relation: ∀t ≥ 1,
E[||xt +1 − A∞ xt +1||]2
1 + κ32 4η2
≤ E[||xt − A∞ xt ||]2 + E[||yt − A∞ yt ||]2
2 1 − κ32
4
+ E[||ε̂tx ||2 ]. (5.16)
1 − κ32
||xt +1 − A∞ xt +1 ||2
= ||Axt − ε̂tx − ηyt − A∞ (Axt − ε̂tx − ηyt )||2
2
= Axt − A∞ xt − ε̂tx − ηyt + ηA∞ yt , (5.17)
where the second equality has employed the facts that A∞ A = A∞ and A∞ ε̂tx =
0, ∀t ≥ 0. Recalling from the well-known inequality that ||c +d||2 ≤ (1 +a)||c||2 +
(1 + 1/a)||d||2, ∀c, d ∈ Rmp , for any a > 0, it deduces from Lemma 5.5(i) that
||xt +1 − A∞ xt +1 ||2
≤ (1 + a)κ32 ||xt − A∞ xt ||2 + 2(1 + a −1 )η2 ||yt − A∞ yt ||2
+ 2(1 + a −1 )||ε̂tx ||2 . (5.18)
E[m||x̄t +1 − x ∗ ||2 ]
ηκ1 4κ 2 η
≤ 1− E[m||x̄t − x ∗ ||2 ] + 2 E[||xt − A∞ xt ||2 ]
2 κ1
4α 2 κ22 η 2η2 κ22
+ E[||xt − xt −1 ||2 ] + E[vt ]. (5.19)
κ1 m
128 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
Proof Multiplying (1Tm ⊗ Ip )/m to the update of xt in (5.6) and in light of the fact
((1Tm ⊗ Ip )/m)ε̂tx = 0, one has
where the last equality has leveraged the facts that ȳt = ḡt and E[ḡt |Ft ] = ∇ F̄ (st ).
Considering that ∇f (x̄t ) − ∇ F̄ (st ), E[∇ F̄ (st ) − ḡt |Ft ] = 0, we have
E[||∇f (x̄t ) − ḡt ||2 |Ft ] = E[||∇f (x̄t ) − ∇ F̄ (st )||2 |Ft ]
+ E[||∇ F̄ (st ) − ḡt ||2 |Ft ]. (5.21)
Notice that {gti } is independent from each other for given Ft . Then, E[ i =j gt
i −
j j
∇f i (sti ), gt − ∇f j (st )|Ft ] = 0 holds. Thus, the last term in (5.21) is equal to
m
1
E[||∇ F̄ (st ) − ḡt || |Ft ] = 2 E ||
2
gt − ∇f (st )|| |Ft
i i i 2
m
i=1
where b > 0 is an arbitrary positive constant. Inserting b = κ1 into (5.24) gives rise
to
E[m||x̄t +1 − x ∗ ||2 ]
2κ22 η
= (1 − ηκ1 )E[m||x̄t − x ∗ ||2 ] + E[||xt − A∞ xt ||2 ]
κ1
2α 2 κ22 η η2
+ E[||xt − xt −1 ||2 ] + E[||gt − ∇F (st )||2 ]. (5.25)
κ1 m
We note here that 1 − ηκ1 + (8η2 κ22 )/m ≤ 1 − (ηκ1 )/2 if 0 < η ≤ (κ1 m)/(16κ22),
4η/m + 2κ1−1 ≤ 4κ1−1 if 0 < η ≤ 1/(2κ1), and 8η/m + 2κ1−1 ≤ 4κ1−1 if 0 < η ≤
1/(4κ1). Therefore, the results of (5.19) in Lemma 5.10 can be derived if 0 < η ≤
κ1 /(16κ2 ).
Finally, we establish a bound for the mean-squared stochastic gradient tracking
error E[||yt − A∞ yt ||2 ].
Lemma 5.11 If 0 < η ≤ (1 − κ32 )/(99κ2) and supposing that Assumptions 5.1 and
5.2 hold, we possess the following recursive relation: ∀t ≥ 1,
E[||yt +1 − A∞ yt +1 ||2 ]
1240κ22 266κ22
≤ E[||xt − A∞ xt ||2 ] + E[m||x̄t − x ∗ ||2 ]
1 − κ3 2 1 − κ32
3 + κ32 188κ22
+ E[||yt − A∞ yt ||]2 + E[||xt − xt −1 ||2 ]
4 1 − κ32
42κ22 640κ22 4 y
+ E[vt ] + E[||ε̂tx ||2 ] + E[||ε̂t ||2 ]. (5.27)
1 − κ32 1 − κ32 1 − κ32
130 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
Proof Utilizing the update of yt in (5.6) and the fact that A∞ A = A∞ , one acquires
||yt +1 − A∞ yt +1 ||2
y 2
= Ayt − A∞ yt + (Imp − A∞ )(gt +1 − gt ) − ε̂t
≤ (1 + c)||Ayt − A∞ yt ||2 + 2(1 + c−1 )||gt +1 − gt ||2 + 2(1 + c−1 )||ε̂t ||2 ,
y
(5.28)
1/c)||b||2, ∀a, b ∈ Rmp , for any c > 0, have been employed to derive (5.28).
Selecting c = (1 − κ32 )/(2κ32 ) in (5.28) and then taking the total expectation, we
have
E[||yt +1 − A∞ yt +1 ||2 ]
1 + κ32 4 4
E[||yt − A∞ yt ||2 ] +
y
≤ E[||gt +1 − gt ||2 ] + E[||ε̂t ||2 ],
2 1 − κ3
2 1 − κ32
(5.29)
where ||Imp − A∞ || = 1 and Lemma 5.5(i) have been applied to acquire (5.29).
Next, we proceed to analyze E[||gt +1 − gt ||2 ]. First, we have
E[||gt +1 − gt ||2 ]
≤ 2E[||gt +1 − gt − (∇F (st +1 ) − ∇F (st ))||2 ]
+ 2E[||∇F (st +1) − ∇F (st )||2 ]
≤ 2E[||gt +1 − gt ||2 ] + 2E[||gt − ∇F (st )||2 ]
+ 2κ22 E[||xt +1 − xt + α(xt +1 − xt ) − α(xt − xt −1)||2 ]
≤ 4κ22 α 2 E[||xt − xt −1 ||2 ] + 2E[||gt − ∇F (st )||2 ]
+ 16κ22E[||xt +1 − xt ||2 ] + 2E[||gt +1 − gt ||2 ], (5.30)
where the third and second inequalities have exploited 0 < α < 1 and E[gt +1 −
∇F (st +1 ), gt − ∇F (st )] = E[E[gt +1 − ∇F (st +1 ), gt − ∇F (st )|Ft +1 ]] = 0,
respectively. Using (5.15) with the requirement that 0 < η ≤ 1/(32κ2), it can be
deduced that
We next bound E[||gt +1 − ∇F (st +1 )||2 ]. Following directly from Lemma 5.6, one
has
Note that E[||xt +1 −A∞ xt +1 ||2 ] ≤ 4η2 E[||yt −A∞ yt ||2 ] +2κ32 E[||xt −A∞ xt ||2 ] +
4E[||ε̂tx ||2 ] if we select a = 1 in (5.18), and E[m||x̄t +1 − x ∗ ||2 ] ≤ α 2 E[||xt −
xt −1||2 ] + (η2 /m)E[||gt − ∇F (st )||2 ] + E[||xt − A∞ xt ||2 ] + 2E[m||x̄t − x ∗ ||2 ] if
we select b = 1/η in (5.24) for 0 < η < 2/κ2 . Substituting the above results and
(5.10) into (5.33) leads to the following: if 0 < η ≤ 1/(32κ2),
E[||yt +1 − A∞ yt +1||2 ]
266κ22 188κ22
≤ E[m||x̄t − x ∗ ||2 ] + E[||xt − xt −1 ||2 ]
1 − κ3 2 1 − κ32
640κ22 4 y 2 42κ22
+ E[||ε̂tx ||2 ] + E[||ε̂ t || ] + E[vt ]
1 − κ3 2 1 − κ32 1 − κ32
1 + κ32 2176κ22η2
+ + E[||yt − A∞ yt ||2 ]
2 1 − κ32
1240κ22
+ E[||xt − A∞ xt ||2 ]. (5.36)
1 − κ32
Hence, choosing 0 < η ≤ (1 − κ32 )/(99κ2 ) can achieve the results of (5.27) in
Lemma 5.11.
Next, we will show that ET-DASG linearly converges in the mean. In the theoretical
procedure, we first construct a linear matrix inequality based on the results in
Lemmas 5.7–5.11. Then, similar to other works [20, 22, 24, 25], we further prove
that the spectral radius of the coefficient matrix is less than 1. After iterating the
linear matrix inequality, we can finally find that ET-DASG linearly converges in the
mean without other complex operations.
Theorem 5.12 Consider ET-DASG in Algorithm 3 under Assumptions 5.1–5.2. If
the step-size η is chosen from the interval
1 − κ32 ň mň(1 − 9α 2 ) 1 − κ32 1
0 < η ≤ min , , , ,
12γ κ2 n̂ 34n̂κ1 γ 2 99κ2 κ2 Λ(γ , κ3 , n̂, ň)
the estimator xti , i ∈ V, converges in the mean to the exact optimal solution to
problem (5.1) with a linear convergence rate O(λt ), where 0 < λ < 1, 0 < α <
(1/3), and Λ(γ , κ3 , n̂, n) is a constant that are specified in the proof.
Proof To begin with, we jointly write (5.10), (5.11), (5.16), (5.19), and (5.27) as a
linear matrix inequality, i.e.,
θt +1 ≤ Γ θt + νt , (5.37)
5.4 Convergence Analysis 133
640κ22
with Δ1 = 4
E[||ε̂tx ||2 ], Δ2 = 4E[||ε̂tx ||2 ] as well as Δ3 = E[||ε̂tx ||2 ] +
1−κ32 1−κ32
y
4
E[||ε̂t ||2 ], and
1−κ 2
3
⎡ ⎤
1+κ32 4κ22 η2
0 0 0
⎢ 2 1−κ32 ⎥
⎢ 4κ22 η 4κ22 α 2 η 2η2 κ22 ⎥
⎢ 1 − ηκ21 0 ⎥
⎢ κ κ1 m ⎥
Γ = ⎢ 96η2 κ12 + 16κ22η2 ⎥
⎢ 8 144η2κ22 160η2κ22 32η2 κ22 ⎥.
⎢ 2 ⎥
⎢ 2 2
0 1 − n̂1 0 ⎥
⎣ ň ň
3+κ 2
⎦
1240 266 188 42 3
1−κ32 1−κ32 1−κ32 1−κ32 4
Then, we first infer the range of η like the existing works [20, 22, 25, 40] satisfying
ρ(Γ ) < 1 to establish the linear convergence of ET-DASG. Recalling from
Lemma 5.5(iii), to ensure ρ(Γ ) < 1, we can derive the range of η and a positive
vector w = [w1 , w2 , w3 , w4 , w5 ]T such that Γ w < w holds, which equivalently
indicates that
⎧
⎪
2
⎪ (1−κ32 ) w1
⎪
⎪ η2 κ22 <
⎪
⎪
8 w4
⎪ ηκ 2 < mκ1 w2 − mκ22 w1 − mκ22 α 2 w3
⎪
⎨ 2 8 w4 κ1 w4 κ1 w4
2κ 2 < w3 −8w1 . (5.38)
⎪ η 2 160w +96w +32w +16w
⎪
⎪
3 1 4 5
⎪
⎪
2 n̂w 1 2 n̂w
+ ň < w4
2
⎪
⎪ ň
⎪
⎩ 4960w1
+ 1064w 2
+ 752w232 + 168w242 < w5
2 2
(1−κ3 ) 2 2
(1−κ3 ) (1−κ3 ) (1−κ3 )
Obviously, to achieve a certain feasible range of η, we must ensure that the right
hands of the first three conditions related to the step-size η in (5.38) are positive.
According to this observation, it suffices to derive another two relations among the
elements in w, i.e.,
⎧
⎨ w3 > 8w1
8κ 2 w 8κ 2 α 2 w . (5.39)
⎩ w2 > 22 1 + 2 2 3
κ1 κ1
134 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
8κ22
w2 > (1 + 9α 2 ).
κ12
Here, we assume that 0 < α < (1/3), and we thus set w2 = 16γ 2 , where γ =
κ2 /κ1 . Third, we note that the fourth condition in (5.38) can be written as
Since γ > 1, we therefore set w4 = (34n̂γ 2 )/ň. Finally, in order to ensure that the
fifth condition in (5.38) is true, w5 should satisfy
8
w5 > 2
(620w1 + 133w2 + 94w3 + 21w4 )
(1 − κ32 )
16 357n̂γ 2
= 2
733 + 1064γ 2 + .
(1 − κ32 ) ň
mκ1 w2 m w1 mα 2 w3 mň
η< 2
− − = (1 − 9α 2 ),
8κ2 w4 κ1 w4 κ1 w4 34n̂κ1 γ 2
then the second condition in (5.38) holds. Next, to make the third condition in (5.38)
hold, the step-size η must satisfy the following:
1
η< √
κ2 160w3 + 96w1 + 32w4 + 16w5
1
= ,
κ2 Λ(γ , κ3 , n̂, ň)
5.4 Convergence Analysis 135
2
551424n(1−κ 2)
where Λ(γ , κ3 , n̂, n) = 1536 + 1088n n̂γ 2
+ n̂γ 2
3
. Therefore, it is clear to
verify that if η satisfies the conditions in Theorem 5.12, then, Γ w < w holds. Thus,
ρ(Γ ) < 1 is satisfied.
According to the above results, i.e., ρ(Γ ) < 1, we continue to establish the
remaining proof of the linear convergence rate of ET-DASG. To this aim, we
recursively iterate (5.37) as follows: ∀t ≥ 1,
t −1
θt ≤ Γ θ0 +
t
Γ t −l−1 νl . (5.40)
l=0
From the triggering condition (5.2) and the measurement errors (5.3), for all i ∈ V
i,y
and t ≥ 0, we can infer that ||εti,x ||2 = 0, ||εt ||2 = 0 if t = tk(i,t
i i,x 2
) and ||εt || ≤
i,y
Cκ4t , ||εt ||2 ≤ Cκ4t otherwise. Moreover, if the momentum coefficient α and the
constant step-size η satisfy the above conditions (ρ(Γ ) < 1), it suffices to verify
that ||Γ t || ≤ Qκ5t and ||Γ t −l−1 νl || ≤ Qκ5t for some constant Q > 0 and 0 <
max{κ4 , ρ(Γ )} ≤ κ5 < 1. Therefore, it can be deduced from (5.40) that
t −1
||θt ||≤ ||Γ t ||||θ0|| + ||Γ t −l−1 νl ||
l=0
t −1
≤ Q||θ0 ||κ5t + Qκ5t = ωt κ5t , (5.41)
l=0
where ωt = Q||θ0 || + Qt. Following from (5.41), it holds that limt →∞ ||θt ||/κ6t =
ωt (κ5 /κ6 )t = 0 for all κ5 < κ6 < 1. Hence, there exists a positive constant Q1 and
an arbitrary small constant ν such that κ6 = κ5 + ν and ||θt || ≤ Q1 κ6t for all t ≥ 0,
then,
existing work [16]. In terms of this issue, we will conduct a detailed study in future
work.
To verify the effectiveness of the designed event-triggered communication
strategy, in the following, we will prove that for each node the time interval
between two successive triggering instants is larger than the iteration interval, i.e.,
)+1 − tk(i,t ) ≥ 2, ∀i ∈ V, t ≥ 1.
i i
tk(i,t
Theorem 5.14 Suppose that Assumptions 5.1 and 5.2 hold. Considering Algo-
rithm 3, if the momentum coefficient α and the constant step-size η are selected
)+1 − tk(i,t ) ≥ 2, ∀i ∈ V, t ≥ 1, if the event-
i i
to satisfy Theorem 5.12, then, tk(i,t
triggered parameters C and κ4 satisfy that C > ((12Q1 + 128Q1 κ22 )(1 + κ42 ))/κ42 .
Proof It is deduced from (5.3) that
i,y
||εti,x ||2 + ||εt ||2
= ||xtii − xti ||2 + ||ytii − yti ||2
k(i,t) k(i,t)
where in the first inequality we have used that E[ḡt |Ft ] = ∇ F̄ (st ), and in the second
inequality we have applied E[||∇ F̄ (st ) − ḡt ||2 |Ft ] = (1/m2 )E[||∇F (st ) − gt ||2 |Ft ]
and Assumption 5.1(ii). Similarly, one can get
4κ22 2κ 2
≤ E[||xt i − A∞ xt i ||2 |Ft ] + 2 E[||xt i − xt i −1 ||2 |Ft ]
m k(i,t) k(i,t) m k(i,t) k(i,t)
1
+ 4κ22 E[||x̄t i − x ∗ ||2 |Ft ] + 2 E[||∇F (st i ) − gt i ||2 |Ft ]. (5.45)
k(i,t) m k(i,t) k(i,t)
5.4 Convergence Analysis 137
Recalling ||θt || ≤ Q1 κ6t in Theorem 5.12, we infer from the definition of ||θt || that
E[||xt − A∞ xt ||2 ] ≤ Q1 κ6t , E[m||x̄t − x ∗ ||2 ] ≤ Q1 κ6t , E[||xt − xt −1||2 ] ≤ Q1 κ6t ,
E[vt ] ≤ Q1 κ6t , and E[||yt − A∞ yt ||2 ] ≤ Q1 κ6t . Next, we apply (5.9), (5.43), (5.44),
and (5.45) to proceed. If 0 < α < 1, we possess that
i,y ti
||εti,x ||2 + ||εt ||2 ≤(12Q1 + 128Q1 κ22 )κ6t + (12Q1 + 128Q1κ22 )κ6k(i,t) .
(5.46)
It is also worth noting that when the triggering condition is not satisfied, the next
i,y
event will not happen, that is, ||εti,x ||2 + ||εt ||2 ≤ Cκ4t , ∀t ≥ 0, i ∈ V. Thus, when
t = tk(i,t )+1 the following inequality must be satisfied, i.e.,
i
ti i,y
Cκ4k(i,t)+1 ≤ ||εi,x
i ||2 + ||ε i ||2 , (5.47)
tk(i,t)+1 tk(i,t)+1
Since κ4 < κ6 , it is clear to deduce that (5.48) must hold if there is a constant C
satisfying
12Q1 + 128Q1κ22
)+1 − tk(i,t ) ≥ ln
i i
tk(i,t / ln κ4 . (5.49)
C − (12Q1 + 128Q1κ22 )
Fig. 5.1 Four undirected and connected network topologies composed of 10 nodes. (a) Random
network with a connection probability pc = 0.4. (b) Complete network. (c) Cycle network. (d)
Star network
data has dimension p = 112. For each dataset, all features have been preprocessed
and normalized to the unit vector. The distributed logistic regression problem takes
the form
1 1
m n i
π
minp ln 1 + exp −b i,h (ci,h )T x + ||x||22,
x∈R m ni 2
i=1 h=1
1
n i
π
f (x) = i
i
ln 1 + exp −b i,h (ci,h )T x + ||x||22,
n 2
h=1
140 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
with bi,h ∈ {−1, 1} and ci,h ∈ Rp being local data kept by node i for h ∈
{1, . . . , ni }; (π/2)||x||22 is the regularization term for avoiding overfitting. In the
simulation, we randomly divide data to each local node, i.e., 10 i=1 n = n. The
i
simulation results are described in the following five parts, i.e., (i)–(v). Here, we
also point out that the dataset 1 is adopted in (i)–(iv), and the datasets 1 and 2
are adopted in (v). Moreover, Fig. 5.1a–d are applied to part (iv), while Fig. 5.1a
is applied to all other parts. For the comparison in parts (iii)-(v), the residual
(1/m)log10 ( m ∗
i=1 ||xt − x ||) is treated as the comparison metric.
i
0.08
0.04
0.02
xi,1
t
xi,2
t
xi,3
t
0
0 200 400 600 800 1000
Iteration
1
X: 19
Y: 0.971
0.8
Accuracy
0.6
0.4 ET−DASG
0 5 10 15 20
Iteration
Fig. 5.2 Convergence: (a) The transient behaviors of three dimensions (randomly selected) of
state estimator x. (b) The testing accuracy of ET-DASG
convergence as other existing methods [20, 38, 59] even if it uses both the
event-triggered strategy and the variance-reduction technique at the same time.
Then, Table 5.1 summarizes the convergence time in seconds and the number
of local gradient evaluations of ET-DASG and the method in [38] for a specific
residual 10−7 under two real training datasets. Table 5.1 tells us that ET-DASG
demands less calculation time and less number of local gradient evaluations,
which can quickly achieve the target and greatly reduce the computation cost in
machine learning tasks. Moreover, Table 5.1 also shows that when the number
n and the dimension p of datasets are large, the calculation time and the
number of local gradient evaluations of ET-DASG are far less than that of the
142 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
5
Triggering times of 5 nodes
Fig. 5.3 The triggering times for the neighbors when 5 nodes run ET-DASG under different event-
triggered parameters
0
η=0.002
η=0.004
η=0.006
−5 η=0.010
Residual
η=0.012
η=0.016
−10
−15
0 500 1000 1500 2000
Iteration
0
α=0.1
α=0.2
α=0.3
−5
Residual
−10
−15
0 500 1000 1500 2000
Iteration
Fig. 5.4 Evolution of residuals under different constant step-sizes or momentum coefficients
0
Random network: pc=0.2
Random network: pc=0.4
Random network: pc=0.6
−5 Complete network
Residual
Cycle network
Star network
−10
−15
0 500 1000 1500 2000
Iteration
0
ET −DASG
The method in [59]
The method in [38]
−5 The method in [20]
Residual
−10
−15
0 200 400 600 800 1000
Iteration
The maximum-likelihood estimator for the source’s location is found by solving the
following problem:
⎛ ⎞
ni
1 ⎝ 1 i,h
m 2
c ⎠.
min s −
x∈R 2 m ni ||x − a i ||d
i=1 h=1
Table 5.1 The convergence time in seconds and the number of local gradient evaluations of ET-
DASG and the method in [38] for a specific residual 10−7 under two real datasets
ET-DASG The method in [38]
Datasets Time(s) Number Time(s) Number
Dataset 1 4.9287 1665 5.8607 29360
Dataset 2 22.8123 3552 132.9638 7.956 ×105
10 20 30 40 50 60 70 80 90 100
Fig. 5.7 The randomly selected 7 paths displayed on top of contours of log-likelihood function
that there is a stationary acoustic source x ∗ ∈ R2 locating in (55, 55) (an location
unknown to any sensors) that we aim at locating in the sensor networks.
Based on the above, the randomly selected 7 paths taken by ET-DASG are shown
in Fig. 5.7 which is plotted on top of contours of the log-likelihood. Figure 5.7
illustrates that ET-DASG can successfully find the exact source location like other
verified effective algorithm [62], which is suitable for the practical energy-based
source localization problem.
5.6 Conclusion
is not suitable for the network with random link failures or the problems with
constraints. In addition, the privacy issues are also not considered in ET-DASG.
Future work will further focus on investigating the privacy protection of ET-DASG
and extending the algorithm to be appropriate for the directed networks and the
distributed constrained optimization problems. The asynchronous implementation
of ET-DASG over broadcast-based mechanism or gossip-based mechanism with
random link failures is also a promising research direction.
References
17. A. Nedic, J. Liu, Distributed optimization for control. Ann. Rev. Control Robot. Auton. Syst.
1, 77–103 (2018)
18. H. Li, Q. Lü, T. Huang, Convergence analysis of a distributed optimization algorithm with a
general unbalanced directed communication network. IEEE Trans. Netw. Sci. Eng. 6(3), 237–
248 (2019)
19. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization
over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
20. G. Qu, N. Li, Harnessing smoothness to accelerate distributed optimization. IEEE Trans.
Control Netw. Syst. 5(3), 1245–1260 (2018)
21. Q. Lü, H. Li, D. Xia, Geometrical convergence rate for distributed optimization with time-
varying directed graphs and uncoordinated step-sizes. Inf. Sci. 422, 516–530 (2018)
22. R. Xin, U.A. Khan, A linear algorithm for optimization over directed graphs with geometric
convergence. IEEE Control Syst. Lett. 2(3), 315–320 (2018)
23. W. Shi, Q. Ling, G. Wu, W. Yin, EXTRA: an exact first-order algorithm for decentralized
consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015)
24. Y. Sun, A. Daneshmand, G. Scutari, Convergence rate of distributed optimization algorithms
based on gradient tracking (2019). Preprint. arXiv:1905.02637
25. S. Pu, W. Shi, J. Xu, A. Nedic, Push-pull gradient methods for distributed optimization in
networks. IEEE Trans. Autom. Control 66(1), 1–16 (2021)
26. M. Bin, I. Notarnicola, L. Marconi, G. Notarstefano, A system theoretical perspective to
gradient-tracking algorithms for distributed quadratic optimization, in Proceedings of the
2019 IEEE 58th Conference on Decision and Control (CDC) (2019). https://doi.org/10.1109/
CDC40024.2019.9029824
27. G. Scutari, Y. Sun, Distributed nonconvex constrained optimization over time-varying
digraphs. Math. Program. 176(1), 497–544 (2019)
28. M.I. Qureshi, R. Xin, S. Kar, U.A. Khan, S-ADDOPT: decentralized stochastic first-order
optimization over directed graphs. IEEE Control Syst. Lett. 5(3), 953–958 (2021)
29. Q. Lü, H. Li, Z. Wang, Q. Han, W. Ge, Performing linear convergence for distributed
constrained optimisation over time-varying directed unbalanced networks. IET Control Theory
Appl. 13(17), 2800–2810 (2019)
30. X. He, X. Fang, J. Yu, Distributed energy management strategy for reaching cost-driven
optimal operation integrated with wind forecasting in multimicrogrids system. IEEE Trans.
Syst. Man Cybern. Syst. 49(8), 1643–1651 (2019)
31. J. Zhang, K. You, K. Cai, Distributed dual gradient tracking for resource allocation in
unbalanced networks. IEEE Trans. Signal Process. 68, 2186–2198 (2020)
32. H. Xiao, Y. Yu, S. Devadas, On privacy-preserving decentralized optimization through
alternating direction method of multipliers (2019). Preprint. arXiv:1902.06101
33. M. Maros, J. Jalden, ECO-PANDA: a computationally economic, geometrically converging
dual optimization method on time-varying undirected graphs, in Proceedings of the 2019
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019).
https://doi.org/10.1109/ICASSP.2019.8683797
34. K. Scaman, F. Bach, S. Bubeck, Y. Lee, L. Massoulie, Optimal convergence rates for convex
distributed optimization in networks. J. Mach. Learn. Res. 20(159), 1–31 (2019)
35. C.A. Uribe, S. Lee, A. Gasnikov, A. Nedic, A dual approach for optimal algorithms in
distributed optimization over networks, in Proceedings of the 2020 Information Theory and
Applications Workshop (ITA) (2020). https://doi.org/10.1109/ITA50056.2020.9244951
36. S.A. Alghunaim, E. Ryu, K. Yuan, A.H. Sayed, Decentralized proximal gradient algorithms
with linear convergence rates. IEEE Trans. Autom. Control 66(6), 2787–2794 (2021)
37. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Springer Science
& Business Media, Berlin, 2013)
38. R. Xin, D. Jakovetic, U.A. Khan, Distributed nesterov gradient methods over arbitrary graphs.
IEEE Signal Process. Lett. 26(8), 1247–1251 (2019)
148 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
39. Q. Lü, X. Liao, H. Li, T. Huang, A nesterov-like gradient tracking algorithm for distributed
optimization over directed networks. IEEE Trans. Syst. Man Cybern. Syst. 51(10), 6258–6270
(2021)
40. R. Xin, U. A. Khan, Distributed heavy-ball: A generalization and acceleration of first-order
methods with gradient tracking, IEEE Trans. Autom. Control 65(6), 2627–2633 (2020)
41. Q. Lü, X. Liao, H. Li, T. Huang, Achieving acceleration for distributed economic dispatch in
smart grids over directed networks. IEEE Trans. Netw. Sci. Eng. 7(3), 1988–1999 (2020)
42. Y. Zhou, Z. Wang, K. Ji, Y. Liang, Proximal gradient algorithm with momentum and flexible
parameter restart for nonconvex optimization (2020). Preprint. arXiv:2002.11582v1
43. C. Liu, H. Li, Y. Shi, D. Xu, Distributed event-triggered gradient method for constrained convex
minimization. IEEE Trans. Autom. Control 65(2), 778–785 (2020)
44. N. Hayashi, T. Sugiura, Y. Kajiyama, S. Takai, Event-triggered consensus-based optimization
algorithm for smooth and strongly convex cost functions, in Proceedings of the 2018
IEEE Conference on Decision and Control (CDC) (2018). https://doi.org/10.1109/CDC.2018.
8618863
45. C. Li, X. Yu, W. Yu, T. Huang, Z-W. Liu, Distributed event-triggered scheme for economic
dispatch in smart grids. IEEE Trans. Ind. Inform. 12(5), 1775–1785 (2016)
46. K. Zhang, J. Xiong, X. Dai, Q. Lü, On the convergence of event-triggered distributed algorithm
for economic dispatch problem. Int. J. Electr. Power Energy Syst. 122, 1–10 (2020)
47. B. Swenson, R. Murray, S. Kar, H. Poor, Distributed stochastic gradient descent and conver-
gence to local minima (2020). Preprint. arXiv:2003.02818v1
48. M. Assran, N. Loizou, N. Ballas, M. Rabbat, Stochastic gradient push for distributed deep
learning, in Proceedings of the 36th International Conference on Machine Learning (ICML)
(2019), pp. 344–353
49. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly
convex optimization. Automatica 90, 196–203 (2018)
50. J. Konecny, J. Liu, P. Richtarik, M. Takac, Mini-batch semi-stochastic gradient descent in the
proximal setting. IEEE J. Sel. Top. Signal Process. 10(2), 242–255 (2016)
51. M. Schmidt, N. Roux, F. Bach, Minimizing finite sums with the stochastic average gradient.
Math. Program. 162(1), 83–112 (2017)
52. A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support
for non-strongly convex composite objectives, in Advances in Neural Information Processing
Systems (NIPS), vol. 27 (2014), pp. 1–9
53. C. Tan, S. Ma, Y. Dai, Y. Qian, Barzilai-borwein step size for stochastic average gradient, in
Advances in Neural Information Processing Systems, vol. 29 (2016), pp. 1–9
54. L. Nguyen, J. Liu, K. Scheinberg, M. Takac, SARAH: a novel method for machine learning
problems using stochastic recursive gradient, in Proceedings of the 34th International Confer-
ence on Machine Learning (ICML) (2017), pp. 2613–2621
55. A. Mokhtari, A. Ribeiro, DSA: decentralized double stochastic averaging gradient algorithm.
J. Mach. Learn. Res. 17(1), 2165–2199 (2016)
56. Z. Shen, A. Mokhtari, T. Zhou, P. Zhao, H. Qian, Towards more efficient stochastic decen-
tralized learning: faster convergence and sparse communication, in Proceedings of the 35th
International Conference on Machine Learning (PMLR), vol. 80 (2018), pp. 4624–4633
57. K. Yuan, B. Ying, J. Liu, A.H. Sayed, Variance-reduced stochastic learning by networked
agents under random reshuffling. IEEE Trans. Signal Process. 67(2), 351–366 (2019)
58. H. Hendrikx, F. Bach, L. Massoulie, An accelerated decentralized stochastic proximal algo-
rithm for finite sums, in Advances in Neural Information Processing Systems, vol. 32 (2019),
pp. 4624–4633
59. R. Xin, S. Kar, U.A. Khan, Decentralized stochastic optimization and machine learning: a
unified variance-reduction framework for robust performance and fast convergence. IEEE
Signal Process. Mag. 37(3), 102–113 (2020)
60. B. Li, S. Cen, Y. Chen, Y. Chi, Communication-efficient distributed optimization in networks
with gradient tracking and variance reduction, in Proceedings of the 23rd International
Conference on Artificial Intelligence and Statistics (AISTATS) (2020), pp. 1662–1672
References 149
61. R.A. Horn, C.R. Johnson, Matrix Analysis (Cambridge University Press, New York, 2013)
62. D. Blatt, A. Hero, Energy-based sensor network source localization via projection onto convex
sets. IEEE Trans. Signal Process. 54(9), 3614–3619 (2006)
63. D. Dua, C. Graff, UCI machine learning repository, Dept. School Inf. Comput. Sci., Univ.
California, Irvine, CA, USA (2019)
64. R.M. Gower, M. Schmidt, F. Bach, P. Richtarik, Variance-reduced methods for machine
learning. Proc. IEEE 108(11), 1968–1983 (2020)
65. S. Horvath, L. Lei, P. Richtarik, M.I. Jordan, Adaptivity of stochastic gradient methods for
nonconvex optimization (2020). Preprint. arXiv:2002.05359
Chapter 6
Accelerated Algorithms for Distributed
Economic Dispatch
6.1 Introduction
The economic dispatch problem (EDP) is one of the fundamental issues for energy
management during the practical operations of smart grids. The target of EDP is
allocating the generation power among the generators to meet the load demands
with minimal total operation cost (i.e., sum of the local generation costs) while
preserving all constraints of local generation capacity. On a certain sense, EDP can
be speciated as a constrained optimization problem, which has appealed to many
researchers in recent years [1–7]. To tackle EDP, many basic methods [8, 9] have
been implemented in a centralized manner. However, these centralized methods
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 151
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_6
152 6 Accelerated Algorithms for Distributed Economic Dispatch
6.2 Preliminaries
6.2.1 Notation
If not particularly stated, the vectors mentioned in this chapter are column vectors.
We let the subscript i denote the generator index and the superscript t denote the
time index; e.g., xit is generator i’s variable at time t. The sets of real numbers,
n-dimensional real column vectors, and n-dimensional real square matrices are
represented as R, Rn , and Rn×n , respectively. The symbol zij is denoted as the
entry of matrix Z in its i-th row and j -th column and In is denoted as the identity
matrix of size n. Given a vector y = [y1 , y2 , . . . , yn ]T , Z = diag{y} is utilized
to represent a diagonal matrix which satisfies that zii = yi , ∀i = 1, . . . , n, and
zij = 0, ∀i = j . The diagonal matrix consisting of the corresponding diagonal
elements of matrix Z is represented as diag{Z}. Two column vectors with all entries
equal to ones and zeros are denoted as 1n and 0n , respectively. The 2-norm of vectors
and matrices is denoted as || · ||2 . The symbols zT and W T are the transposes of a
vector z and a matrix W , respectively. Given two vectors v = [v1 , . . . , vn ]T and
u = [u1 , . . . , un ]T , the notation v u implies that vi ≤ ui for any i ∈ {1, . . . , n}.
The vector f (x) : Rp → Rp denotes the gradient of f (z) (differentiable) at z.
6.2 Preliminaries 155
Consider a set of n generators connected over a smart grid. The global objective
of EDP is to find the optimal allocation that meets the expected load demands
while preserving the limitations of generator capacity, therefore minimizing the total
generation cost, i.e.,
n
n
n
minn C(x) = Ci (xi ), s.t. xi = di , ximin ≤ xi ≤ ximax , (6.1)
x∈R
i=1 i=1 i=1
Due to Assumption 6.1, problem (6.1) has a unique optimal solution. It is worth
emphasizing that Assumptions 6.1 and 6.2 are two standard assumptions to achieve
linear convergence when employing first-order methods [27–35].
156 6 Accelerated Algorithms for Distributed Economic Dispatch
Assumption 6.3 forces each generator in the network to directly or indirectly have
an impact on the others. In comparison with the uniformly and strongly connected
assumption [27, 28, 39], Assumption 6.3, although more restrictive, is still relatively
common [30–35].
First, we construct the dual problem of (6.1). To this end, we introduce the
Lagrangian function L : X × R → R defined as
n
n
L(x, y) = Ci (xi ) − y (xi − di ), (6.2)
i=1 i=1
for all y ∈ R. With the above, the dual problem of (6.1) is described as
n
max Φ(y) = Φi (y), (6.4)
y∈R
i=1
where
By Assumption 6.1, it can be demonstrated that the strong duality between (6.1) and
(6.4) holds [40], i.e., there exists at least a dual optimal solution y ∗ to (6.4) such that
C(x ∗ ) = Φ(y ∗ ), where x ∗ is the primal optimal solution to (6.1), and that a set of
the dual optimal solution to (6.4) is nonempty [40].
Note from Assumption 6.1 that for each i ∈ V and any given y ∈ R, (6.6)
uniquely exists a minimizer as follows:
⎧ max
⎨ xi if ∇Ci−1 (y) ≥ ximax
x̃i (y) = ∇Ci−1 (y) if ximin < ∇Ci−1 (y) < ximax (6.7)
⎩ min
xi if ∇Ci−1 (y) ≤ ximin,
where ∇Ci−1 (y) is the inverse function of Ci , and according to Assumption 6.2,
∇Ci−1 (y) exists in interval [ximin, ximax ]. Since a set of the dual optimal solution
to (6.4) is nonempty, the primal (unique) optimal solution to (6.1) for each i ∈ V
becomes xi∗ = x̃i (y ∗ ), where y ∗ is any dual optimal solution.
Then, for any given y ∈ R, the dual function Φ(y) is differentiable [26] (because
of the uniqueness of x̃i (y)) at y and its gradient is
n
∇Φ(y) = − (x̃i (y) − di ), (6.8)
i=1
and further ∇Φi (y) = −(x̃i (y) − di ), ∀i ∈ V. Thus, the dual problem (6.4) can be
solved by utilizing the standard gradient ascent method as follows:
n
t +1
y = y − α̃
t
(x̃i (y t ) − di ), (6.9)
i=1
where α̃ > 0 is an appropriately selected step-size. It has been proven that method
(6.9) converges to the dual optimal solution, i.e., y t converges to y ∗ , under some
158 6 Accelerated Algorithms for Distributed Economic Dispatch
n
min q(y) = qi (y), (6.10)
y∈R
i=1
where qi (y) = −Φi (y) = Ci⊥ (y) − ydi . It is worth highlighting that problem (6.10)
shares the same set of dual optimal solution to problem (6.4), and it also has the
similar formulation with the distributed convex problems in [31, 32, 37]. According
to (6.10), we now describe D-DLM to distributedly deal with problem (6.1). In D-
DLM, each generator i ∈ V at time t ≥ 0 stores five variables: xit ∈ R, yit ∈ R,
hti ∈ R, sti ∈ Rn , and zit ∈ R. For t ≥ 0, generator i ∈ V updates its variables as
6.3 Algorithm Development 159
follows:
⎧
⎪
⎪
⎪ xit +1 = min{max{∇Ci−1 (yit ), ximin }, ximax }
⎪
⎪ n
⎪ t +1 t −1
⎪hi = j =1 rij yj + βi (hi − hi ) − αi zi
⎪ t t t t
⎨
yit +1 = hti +1 + βit +1 (hti +1 − hti ) (6.11)
⎪
⎪
⎪
⎪
⎪ sti +1 = nj=1 rij stj
⎪
⎪
⎪ t+1
⎩zit +1 = nj=1 rij zjt + xi t+1−di − xi −d
t
i
,
[si ]i [st ]
i i
and rii = 1 − j ∈Niin rij > , ∀i ∈ V, where 0 < < 1. Each generator i ∈ V
starts with many initial states xi0 ∈ Xi , yi0 = h0i ∈ R, s0i = ei , and zi0 = xi0 − di .1
Denote by R = [rij ] ∈ Rn×n the collection of weights rij , i, j ∈ V, in (6.11),
which is obviously row-stochastic. In essence, (xit − di ) in the update of zit in (6.11)
is the gradient of the function qi (y) at y = yit , i.e., ∇qi (yit ) = xit − di , i ∈ V.
Furthermore, the update of zit in (6.11) is a distributedly inexact gradient tracking
step, where each function’s gradient is scaled by [sti ]i , which is generated by the
update of sit in (6.11). Actually, the update of zit in (6.11) a consensus iteration step
aiming to overcome the unbalancedness brought about the left normalized Perron
eigenvector w = [w1 , . . . , wn ]T , corresponding to the eigenvalue 1, of the weight
matrix R, i.e., the left eigenvector w satisfying 1Tn w = 1. This iteration resembles
those employed in [30, 41] and [42]. To sum up, D-DLM (6.11) transforms the
centralized method (6.9) into the distributed ones via gradient tracking method and
can be applied to a directed network.
Define x t = [x1t , . . . , xnt ]T ∈ Rn , y t = [y1t , . . . , ynt ]T ∈ Rn , ht = [ht1 , . . . , htn ]T
∈ Rn , zt = [z1t , . . . , znt ]T ∈ Rn , S t = [st1 , . . . , stn ]T ∈ Rn×n , S̃ t = diag{S t },
and d = [d1, . . . , dn ]T ∈ Rn . Therefore, D-DLM (6.11) can be rewritten in the
following aggregated form:
1Suppose that each generator possesses and achieves its unique identifier in the network, e.g.,
1, . . . , n [27–35].
160 6 Accelerated Algorithms for Distributed Economic Dispatch
⎧
⎪
⎪
⎪ xit +1 = min{max{∇Ci−1 (yit ), ximin }, ximax }
⎪
⎪
⎪
⎪ t +1 = Ry t + D t (ht − ht −1 ) − D zt
⎨h β α
y t +1 = ht +1 + Dβt +1 (ht +1 − ht ) (6.13)
⎪
⎪
⎪
⎪ S t +1 = RS t
⎪
⎪
⎪
⎩ t +1
z = Rzt + [S̃ t +1 ]−1 (x t +1 − d) − [S̃ t ]−1 (x t − d),
In this section, the convergence properties of D-DLM (6.11) are rigorously ana-
lyzed. Before showing the main results, some auxiliary results (borrowed from the
literature) are introduced for completeness.
First, the following lemma shows that Ci⊥ , i ∈ V, is strongly convex and smooth
[22, 27, 28, 30].
Lemma 6.1 Suppose that Assumptions 6.1 and 6.2 hold. Then, for each i ∈ V, Ci⊥
is strongly convex with constant ϑi and Lipschitz differentiable with constant i ,
respectively, where ϑi = 1/li and i = 1/μi .
If Lemma 6.1 holds, it suffices that the global function (−Φ) is strongly convex
with parameter ϑ = ni=1 ϑi and has Lipschitz continuous gradient with parameter
= ni=1 i , respectively. In addition, we define ˆ = maxi∈V { i }.
Considering the sequence {ṽ t }∞t =0 and γ ∈ (0, 1), for any positive integer T > 0
and norm || · ||c (in this chapter, norm || · ||c may be a 2-norm or a particular norm),
let us further define
γ ,T γ
||ṽ||c = sup {||ṽ t ||c /γ t } and ||ṽ||c = sup{||ṽ t ||c /γ t }.
t =0,1,...,T t ≥0
Then, the following additional lemma from the generalized small gain theorem
[43] is presented.
Lemma 6.2 (Generalized Small Gain Theorem) Suppose that non-negative vec-
tor sequences {ṽit }∞ ˜
t =0 , i = 1, . . . , m, a non-negative matrix Γ ∈ R
m×m , ũ ∈ Rm
ṽ γ ,T Γ˜ ṽ γ ,T + ũ, (6.14)
where ṽ γ ,T = [||ṽ1 ||c , . . . , ||ṽm ||c ]T . If ρ(Γ˜ ) < 1, then ||ṽi ||c < B, where
γ ,T γ ,T γ
B < +∞ and ρ(Γ˜ ) is the spectral radius of matrix Γ˜ . Hence, each ||ṽi ||c , i ∈
{1, . . . , m}, linearly converges to zero at the linear rate of O((γ )t ).
Recall that R is irreducible and row-stochastic with positive diagonal entries.
Under Assumption 6.3, there exists a normalized left Perron eigenvector w =
[w1 , . . . , wn ]T ∈ Rn (wi > 0, ∀i) of R such that
In this subsection, the linear convergence rate of D-DLM is found via employing
Lemma 6.2. First, we cast four inequalities into a linear system as the form of (6.14)
and then investigate the spectral properties of the achieved coefficient matrix. To
this aim, some essential notations are introduced to simplify the main results. Denote
v1t = y t − (R)∞ y t , ∀t ≥ 0, v2t = (R)∞ y t − 1n y ∗ , ∀t ≥ 0, and v3t = ht − ht −1 , ∀t >
0, with the convention that v30 = 0n , and v4t = zt − (R)∞ zt , ∀t ≥ 0.
The first lemma constitutes an inevitable bound on the estimate ||zt ||2 for
deriving the aforementioned linear system.
Lemma 6.3 Under Assumptions 6.1 and 6.2, the following inequality holds for all
t ≥ 0:
||zt ||2 ≤ n ˆp1 ||v1t || + n ˆ||v2t ||2 + p1 ||v4t || + ŝ(s̃)2 θ (λ)t ||∇Q(y t )||2 , (6.15)
2 Throughout the chapter, for any arbitrary matrix (scalar or variable) Z, we utilize the symbol (Z)t
to represent the t-th power of Z to distinguish the iteration of variables.
6.4 Convergence Analysis 163
||(R)∞ zt ||2 ≤||S ∞ [S̃ t ]−1 ∇Q(y t ) − S ∞ [S̃ ∞ ]−1 ∇Q(y t )||2
+ ||S ∞ [S̃ ∞ ]−1 ∇Q(y t ) − S ∞ [S̃ ∞ ]−1 ∇Q(1n y ∗ )||2
where ŝ = supt ≥0||S t ||2 , s̃ = supt ≥0 ||[S̃ t ]−1 ||2 , and the last inequality follows from
the fact that ||[S̃ t ]−1 − [S̃ ∞ ]−1 ||2 ≤ θ (s̃)2 (λ)t , where 0 < θ < ∞ and 0 < λ < 1
are constants (see [41, 42] for more details). Then, one gets
Substituting (6.17) and (6.18) into (6.16) yields the desired result in Lemma 6.3.
In what follows, the bound of the consensus violation ||v1 ||γ ,T of the Lagrangian
multiplier is provided.
Lemma 6.4 Suppose that Assumptions 6.1 and 6.2 hold. Then, for all T > 0, we
have the following inequality:
α̂κ1 n ˆ γ ,T 2β̂κ1
||v1 ||γ ,T ≤ ||v2 ||2 + ||v3 ||γ ,T
γ − ρ − α̂κ1 n ˆp1 γ − ρ − α̂κ1 n ˆp1
α̂κ1 p1
+ ||v4 ||γ ,T + u1 , (6.19)
γ − ρ − α̂κ1 n ˆp1
for all max{λ, ρ + α̂κ1 n ˆp1 } < γ < 1, where 0 < ρ < 1 is a constant, κ1 =
p2 ||In − (R)∞ || and u1 = (||v10 || + α̂κ1 ŝ(s̃)2 θ supt =0,...,T ||∇Q(y t )||2 )/(γ − ρ −
α̂κ1 n ˆp1 ).
Proof According to the updates of ht and y t of D-DLM (6.13), it holds that
where the inequality in (6.20) is obtained from the facts that (R)∞ R = (R)∞ and
(R − (R)∞ )(In − (R)∞ ) = R − (R)∞ . Considering the weight matrix R = [rij ] ∈
Rn×n (6.12), then there are a norm || · || and a constant 0 < ρ < 1 such that
||Ry − (R)∞ y|| ≤ ρ||y − (R)∞ y|| for all y ∈ Rn (see [46] Lemma 5.3). Thus,
(6.20) further implies that
||v1t +1 || ≤ρ||v1t || + α̂κ1 ||zt ||2 + β̂κ1 ||v3t +1 || + β̂κ1 ||v3t ||, (6.21)
164 6 Accelerated Algorithms for Distributed Economic Dispatch
where α̂ = ||Dα ||2 and ||Dβt ||2 ≤ β̂. By Lemma 6.3, we have that
+ β̂κ1 ||v3t +1 || + α̂κ1 p1 ||v4t || + α̂κ1 ŝ(s̃)2 θ (λ)t ||∇Q(y t )||2 . (6.22)
From here, the procedure is similar to that in the proof of Lemma 8 in [30]. We
include the proof for completeness. By multiplying (γ )−(t +1) on both sides of (6.22)
and then taking the supremum for t = 0, . . . , T − 1, one has
Also, assuming that (γ )0 ||v10 || ≤ (γ )0 ||v10 || and λ < γ < 1, it follows that
where κ2 = p1 ||(R)∞ ||2 , l1 = max{|1 − wT α|, |1 − ϑwT α|} and u2 = (||v20 ||2 +
α̂ ŝ(s̃)2 θ supt =0,...,T ||∇Q(y t )||2 )/(γ − l1 ).
6.4 Convergence Analysis 165
||(R)∞ y t +1 − 1n y ∗ ||2
We now discuss the first term in the inequality of (6.26). Note that (R)∞ = 1n wT .
Through utilizing 1n wT Dα 1n wT = (wT α)1n wT , one obtains
where ∇q(ȳ t ) = 1Tn ∇Q(1n ȳ t ) and ∇Q(1n ȳ t ) = [∇q1(ȳ t ), . . . , ∇qn (ȳ t )]T . Since
the global function q is strongly convex and smooth (see Lemma 6.1), if 0 < wT α <
2/ , Λ1 is bounded by
√
Λ1 ≤ l1 n||wT y t − y ∗ ||2 = l1 ||(R)∞ y t − 1n y ∗ ||2 , (6.28)
where l1 = max{|1 − wT α|, |1 − ϑwT α|} (see reference [42] for a proof). Then,
Λ2 can be bounded in the following way:
Next, by employing the fact ||[S̃ t ]−1 − [S̃ ∞ ]−1 ||2 ≤ θ (s̃)2 (λ)t and the relationship
S ∞ [S̃ ∞ ]−1 = 1n 1Tn , we get
where ŝ = supt ≥0 ||S t ||2 and s̃ = supt ≥0||[S̃ t ]−1 ||2 . Plugging (6.27)–(6.31) into
(6.26) yields
||v2t +1 ||2 ≤ l1 ||v2t ||2 + α̂n ˆp1 ||v1t || + β̂κ2 [||v3t +1 || + ||v3t ||]
+ α̂κ2 ||v4t || + α̂ ŝ(s̃)2 θ (λ)t ||∇Q(y t )||2 . (6.32)
Here, we can identify the terms in (6.32) with the terms in (6.22). Hence, in order
to establish this lemma, we proceed as in the proof of Lemma 6.4 from (6.22).
For the bound of the estimate difference ||v3 ||γ ,T , the following lemma is shown.
Lemma 6.6 Suppose that Assumptions 6.1 and 6.2 hold. If max{λ, 2p2 β̂} < γ < 1,
it holds that for all T > 0,
where u3 = (α̂p2 ŝ(s̃)2 θ supt =0,...,T ||∇Q(y t )||2 )/(γ − 2p2 β̂) and κ3 = ||R − In ||.
Proof Recalling that (R)∞ R = (R)∞ , we obtain from (6.13) that
where the inequality in (6.34) is obtained from the fact (R − In )(In − (R)∞ ) =
R − In . Now, apply Lemma 6.3 to deduce that
||v3t +1 || ≤ 2p2 β̂||v3t || + (κ3 + α̂p2 n ˆp1 )||v1t || + α̂p2 n ˆ||v2t ||2
+ α̂p2 p1 ||v4t || + α̂p2 ŝ(s̃)2 θ (λ)t ||∇Q(y t )||2 . (6.35)
Similar to the procedure following (6.22), it is suffice to derive the desired result.
The next lemma establishes the inequality which bounds the error term ||v4 ||γ ,T
corresponding to gradient estimation.
Lemma 6.7 Suppose that Assumptions 6.1–6.3 hold. If max{λ, ρ} < γ < 1, for all
T > 0, we have the following estimate:
κ4 + 2β̂κ4
||v4 ||γ ,T ≤ ||v3 ||γ ,T + u4 , (6.36)
γ −ρ
6.4 Convergence Analysis 167
where u4 = 2||In − (R)∞ ||p2 (s̃)2 θ supt =0,...,T ||∇Q(y t )||2 /(γ − ρ) + ||v40 ||/(γ − ρ)
and κ4 = ||In − (R)∞ ||p1 p2 ˆs̃.
Proof It is immediately obtained from (6.13) that
||zt +1 − (R)∞ zt +1 ||
= ||(R)∞ (Rzt + [S̃ t +1 ]−1 ∇Q(y t +1 ) − [S̃ t ]−1 ∇Q(y t ))
− (Rzt + [S̃ t +1 ]−1 ∇Q(y t +1 ) − [S̃ t ]−1 ∇Q(y t ))||
≤ ||In − (R)∞ ||||[S̃ t +1]−1 ∇Q(y t +1 ) − [S̃ t ]−1 ∇Q(y t )||
+ ρ||zt − (R)∞ zt ||, (6.37)
where we employ the triangle inequality and the fact ||Ry − (R)∞ y|| ≤ ρ||y −
(R)∞ y|| to deduce the inequality. As for the first term of the inequality in (6.37),
we apply (6.13) to obtain
convenience, we define wmin = mini∈V {wi } and κ5 = κ1 n ˆp1 . Then, the first result,
i.e., Theorem 6.8, is introduced as follows.
Theorem 6.8 Suppose that Assumptions 6.1–6.3 hold. Considering D-DLM (6.13)
updates the sequences {x t }, {ht }, {y t }, {S t }, and {zt }. Then, if 0 < wT α < 2/ , we
obtain the following linear inequality for all T > 0:
v γ ,T Γ v γ ,T + u, (6.41)
γ ,T
where v γ ,T = [||v1 ||γ ,T , ||v2 ||2 , ||v3 ||γ ,T , ||v4 ||γ ,T ]T , u = [u1 , u2 , u3 , u4 ]T , and
the elements of matrix Γ = [γij ] ∈ R4×4 are given by
⎡ α̂κ1 n ˆ 2β̂κ1 α̂κ1 p1
⎤
0 γ −ρ−α̂κ5 γ −ρ−α̂κ5 γ −ρ−α̂κ5
⎢ α̂n ˆp ⎥
⎢ 1
0 2β̂κ2 α̂κ2 ⎥
⎢ ⎥
Γ = ⎢ κ +γα̂p−l1n ˆp α̂p2 n ˆ
γ −l1 γ −l1
⎥.
⎢ 3 2 1
0 α̂p2 p1 ⎥
⎣ γ −2p2 β̂ γ −2p2 β̂ γ −2p2 β̂ ⎦
κ4 +2β̂κ4
0 0 γ −ρ 0
n ˆp1 η1 + κ2 η4 κ4 η3
η1 > 0, η2 > , η3 > κ3 η1 , η4 > . (6.45)
ϑwmin 1−ρ
Proof First, summarizing the results of Lemmas 6.4–6.7, we can conclude (6.41)
immediately. Next, we provide some sufficient conditions to make the spectral
radius of Γ , defined as ρ(Γ ), be strictly less than 1, i.e., ρ(Γ ) < 1. According to
Theorem 8.1.29 in [45], we know that, for a positive vector η = [η1 , . . . , η4 ]T ∈ R4 ,
if Γ η < η, then ρ(Γ ) < 1 holds. By the definition of Γ , it is deduced that inequality
Γ η < η is equivalent to
⎧
⎪
⎪ 2β̂κ1 η3 < η1 γ − η1 ρ − (κ5 η1 + κ1 n ˆη2 + κ1 p1 η4 )α̂
⎪
⎪
⎨2β̂κ η < η γ − η l − (n ˆp η + κ η )α̂
2 3 2 2 1 1 1 2 4
(6.46)
⎪
⎪ 2 3
2p β̂η < η γ − κ η − (p n ˆp η + p2 n ˆη2 + p2 p1 η4 )α̂
⎪
⎪
3 3 1 2 1 1
⎩
2β̂κ4 η3 < η4 γ − η4 ρ − κ4 η3 ,
choose η3 and η4 in accordance with the third and the fourth conditions in (6.48),
respectively, and finally select η2 satisfying the second condition in (6.48). Hence,
following from (6.48), it yields the upper bounds on the largest step-size α̂ in (6.42)
considering the requirement that α̂ < 1/ . If 0 < γ < 1, and then we achieve the
upper bounds on the maximum momentum coefficient β̂ according to (6.46) and the
upper bounds α̂.
Recalling that ∇Q(y t ) = [∇q1 (y1t ), . . . , ∇qn (ynt )]T and ∇qi (yit ) = xit − di , i ∈
V, it derives that ||∇Q(y t )||2 ≤ for a positive constant > 0. And then all the
elements (u1 , u2 , u3 and u4 ) in vector u are uniformly bounded. Therefore, all the
conditions of Lemma 6.2 are completely satisfied. By Lemma 6.2, we can deduce
that the sequence {y t } converges to 1n y ∗ at a linear rate of O((γ )t ), where γ satisfies
(6.44). This finishes the proof.
Remark 6.9 It is worth emphasizing that η1 , η2 , η3 , and η4 in Theorem 6.8 are
tunable parameters, which only depend on the network topology and the cost
functions. Thus, the choices of the largest step-size α̂ and the maximum momentum
coefficient β̂ can be calculated without much effort as long as other parameters,
such as λ, ρ, etc., are properly selected. Furthermore, in order to design the step-
sizes and the momentum coefficients, some global parameters, such as ϑ, , ˆ, and
wmin , are needed. We note that the amount of preprocessing in calculating the global
parameters is substantially negligible compared with the worst-case running time of
D-DLM (see [46] for a specific analysis).
Based on Theorem 6.8, below we show that the sequence {x t } linearly converges
to the optimal solution to (6.1). Similar to many distributed Lagrangian methods
[22, 27, 28, 30], we accomplish this by relating the primal variables with Lagrangian
multipliers.
Theorem 6.10 Suppose that Assumptions 6.1–6.3 hold. Considering D-DLM
(6.13) updates the sequences {x t }, {ht }, {y t }, {S t }, and {zt }. If α̂ and β̂ satisfy
the conditions of Theorem 6.8, then the sequence {x t } converges to x ∗ at a linear
rate of O((γ /2)t ).
Proof Here, the approach is similar to that in the proof of Theorem 3 in [30].
We briefly give the proof here for completeness. To demonstrate Theorem 6.10,
we ought to show the following relation of the primal variables and Lagrangian
multipliers:
n
μi
(xit − xi∗ )
2
2
i=1
n
[∇qi (y ∗ )(yit − y ∗ ) + (yit − y ∗ ) + yit (xi∗ − di )].
i 2
≤ (6.49)
2
i=1
the gradient of dual function is usually calculated with different kinds of stochastic
noises, which yields the stochastic version of D-DLM.
In this section, a variety of studies on EDP in smart grids are provided to verify the
effectiveness of D-DLM and the correctness of the theoretical analysis. Here, all the
simulations are carried out in MATLAB on a HP Desktop with 3.20 GHz, 6 Cores,
12 Threads, Intel i7-8700 processors, and 8 GB memory.
First, we study the EDP on the IEEE 14-bus test system [22] as described in Fig. 6.1,
where {1, 2, 3, 6, 8} are generator buses and {2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14} are
load buses. Suppose that each generator i suffers a quadratic cost function, i.e.,
Ci (xi ) = ai (xi )2 + bi xi (known privately by generator i), where the generator
parameters are summarized in Table 6.1 [26]. Note that the power generation is zero
if a bus does not possess generators. The total demand is 14 i=1 di = 380 MW,
where the local demands on each load bus are d1 = 0 MW, d2 = 9 MW, d3 = 56
In this case study, EDP on the IEEE 118-bus test system [49] continues to be
considered to demonstrate the performance of D-DLM on a large-scale network.
The IEEE 118-bus test system, as shown in Fig. 6.3 [49], contains 54 generators
connected by bus lines, which is assumed to be directed and strongly connected.
Each generator i confronts a quadratic cost function (known privately by generator
i), i.e., Ci (xi ) = ai (xi )2 + bi xi + ci , where the coefficients ai ∈ [0.0024, 0.0697],
bi ∈ [8.3391, 37.6968], and ci ∈ [6.78, 74.33] with units $/MW2 , $/MW, and
$, respectively. Each xi is constrained by some interval in [ximin, ximax ] (MW),
where ximax ∈ [150, 400] and ximin ∈ [5, 150]. Suppose that the total load required
from the system is 4242 MW. In addition, the communication between generators
adopts the bus data given in [22]. For convenience, we employ the same uniformly
weighting strategy as explained in Case Study 1. Moreover, the non-uniform step-
sizes and the momentum coefficients are, respectively, selected by αi = 0.0002θi
and βit = 0.5θ̃it , t ≥ 0.
Then, numerical results are illustrated in Fig. 6.4, which demonstrates the
convergence of D-DLM for the IEEE 118-bus test system. It implies that D-DLM
174 6 Accelerated Algorithms for Distributed Economic Dispatch
80
60
40
0
0 50 100 150 200 250 300
Iteration
0
0 50 100 150 200 250 300
Iteration
400
300
200
Total Generation
100
Total Demand
0
0 50 100 150 200 250 300
Iteration
successfully drives the variables to the optimal solutions within few iterations even
for this large-scale network.
Considering that the demand is not always unchangeable in the practical operations
of smart grids, this case study involves the application of D-DLM in dynamic
economic dispatch problems (EDPs), i.e., time-varying demands. In this case study,
we will simulate the performance of the D-DLM utilizing the same IEEE 14-bus
test system, the cost functions at the generators, and other related parameters as
described in Case Study 1. In addition, we divide the total iteration time into 3
identical time intervals and assume that the total demand at each time interval is
different.
Then, numerical results are illustrated in Fig. 6.5. It shows that when the total
demand changes, the generator will alter the power generation accordingly to suffice
176 6 Accelerated Algorithms for Distributed Economic Dispatch
400
350
300
250
200
150
100
50
0
0 100 200 300 400 500 600 700 800 900 1000
Iteration
16
14
12
10
0
0 100 200 300 400 500 600 700 800 900 1000
Iteration
4500
4000
3500
3000
1500
0 100 200 300 400 500 600 700 800 900 1000
Iteration
the current total demand, and D-DLM successfully achieves the optimal power
generation after a short period of time.
Finally, D-DLM is compared with the existing centralized primal–dual method [40]
and distributed primal–dual method [30] in terms of the convergence performance,
convergence time, and computational complexity. To this aim, we show the follow-
ing two scenarios:
(1) Comparison for convergence performance: in the first scenario, the convergence
performance comparison is conducted on the IEEE 14-bus and 118-bus test
systems, respectively, where the residual E t = log10 ni=1 ||xit − xi∗ ||, t ≥ 0,
is applied as the comparison metric. The required parameters (row-stochastic
weights, non-uniform step-sizes, etc.) correspond to Case Studies 1 and 2.
Figure 6.6 implies that D-DLM performs linear convergence rate, and
it improves the convergence rate well with the increase of the momentum
coefficients (with an upper bound) in comparison with the applicable methods
[30, 40] without momentum terms. In addition, the fact that D-DLM promotes
the convergence rate is obvious even in large-scale directed networks.
(2) Comparisons for convergence time and computational complexity: in the second
scenario, the convergence time and computational complexity of D-DLM and
the applicable methods [30, 40] are discussed on the IEEE 14-bus and 118-
bus test systems. Here, we measure the convergence time by the time it takes
the algorithm and the computational complexity by the number of calculations
required by the algorithm to achieve the desired level of residual E t = −15,
respectively.
Table 6.2 indicates that, in terms of the convergence time, D-DLM promotes
the performance in comparison with the distributed method [30] since two
momentum terms are added in D-DLM for improving the convergence rate.
Besides, although the centralized method has less convergence time than the
distributed methods, it is worthy highlighting that in the distributed method
simulations, if they are run in parallel (in practice), the total time they need
should be far more less than the total local optimizations time running in
sequence. Table 6.2 also means that the computation complexity of D-DLM
and the distributed method [30] increase very slowly with the increase of the
number of buses, while the centralized method [40] rapidly increases.
178 6 Accelerated Algorithms for Distributed Economic Dispatch
80
60
40
0
0 100 200 300 400 500 600
Iteration
10
0
0 100 200 300 400 500 600
Iteration
400
350
300
250
200
150
100
Total Generation
Total Demand
50
0
0 100 200 300 400 500 600
Iteration
−5
−10
−15
0 50 100 150 200 250 300 350 400 450 500
Iteration
−5
−10
−15
0 100 200 300 400 500 600 700 800 900 1000
Iteration
6.6 Conclusion
In this chapter, we have considered the EDP in smart grids where generators
were designed to collectively minimize the total generation cost while satisfying
the expected load demands and preserving the limitations of generator capacity.
To solve EDP, a novel directed distributed Lagrangian momentum algorithm,
named as D-DLM, has been presented and analyzed at great length. D-DLM
extended the distributed gradient tracking method with heavy-ball momentum and
Nesterov momentum, guaranteed that generators selected non-uniform step-sizes
in a distributed manner and only required the weight matrix to be row-stochastic,
indicating that it was suitable for a directed network. In particular, the directed
network was assumed to be strongly connected. If the largest step-size and the
maximum momentum coefficient were subjected to some upper bounds (the bounds
relied only on the network topology and the cost functions), we have proven
that D-DLM linearly allocated the optimal dispatch at the expense of eigenvector
learning, supposing smooth and strongly convex cost functions. In addition, the
explicit estimation for the convergence rate of D-DLM has also been explored.
The theoretical analysis has been further verified by simulations. In the future
work, we will continue to consider a few of interesting problems (privacy masking,
utility maximization, etc.) in smart grids with the aid of D-DLM and study the
robustness of time-varying networks, packet dropout, latency, random link failures,
and transmission losses.
References
5. R. Wang, Q. Li, B. Zhang, L. Wang, Distributed consensus based algorithm for economic
dispatch in a microgrid, IEEE Trans. Smart Grid 10(4), 3630–3640 (2019)
6. T. Kim, S. Wright, D. Bienstock, S. Harnett, Analyzing vulnerability of power systems with
continuous optimization formulations. IEEE Trans. Netw. Sci. Eng. 3(3), 132–146 (2016)
7. S. D’Aronco, P. Frossard, Online resource inference in network utility maximization problems.
IEEE Trans. Netw. Sci. Eng. 6(3), 432–444 (2019)
8. N. Li, L. Chen, S. Low, Optimal demand response based on utility maximization in power
networks, in Proceedings of the 2011 IEEE Power and Energy Society General Meeting (PES).
https://doi.org/10.1109/PES.2011.6039082
9. N. Li, L. Chen, M. Dahleh, Demand response using linear supply function bidding. IEEE Trans.
Power Syst. 6(4), 1827–1838 (2015)
10. M. Ogura, V. Preciado, Stability of spreading processes over time-varying large-scale networks.
IEEE Trans. Netw. Sci. Eng. 3(1), 44–57 (2016)
11. Z. Zhang, M. Chow, Convergence analysis of the incremental cost consensus algorithm under
different communication network topologies in a smart grid. IEEE Trans. Power Syst. 27(4),
1761–1768 (2012)
12. S. Yang, S. Tan, J. Xu, Consensus based approach for economic dispatch problem in a smart
grid. IEEE Trans. Power Syst. 28(4), 4416–4426 (2013)
13. S. Kar, G. Hug, Distributed robust economic dispatch in power systems: a consensus +
innovations approach, in 2012 IEEE Power and Energy Society General Meeting. https://doi.
org/10.1109/PESGM.2012.6345156
14. S. Xie, W. Zhong, K. Xie, R. Yu, Y. Zhang, Fair energy scheduling for vehicle-to-grid networks
using adaptive dynamic programming. IEEE Trans. Neural Netw. Learn. Syst. 27(8), 1697–
1707 (2016)
15. Y. Li, H. Zhang, X. Liang, B. Huang, Event-triggered based distributed cooperative energy
management for multi-energy systems. IEEE Trans. Ind. Inform. 15(4), 2008–2022 (2019)
16. B. Huang, L. Liu, H. Zhang, Y. Li, Q. Sun, Distributed Optimal economic dispatch for
microgrids considering communication delays. IEEE Trans. Syst. Man, Cybern. Syst. 49(8),
1634–1642 (2019)
17. G. Binetti, A. Davoudi, F. Lewis, D. Naso, B. Turchiano, Distributed consensus-based
economic dispatch with transmission losses. IEEE Trans. Power Syst. 29(4), 1711–1720 (2014)
18. Z. Ni, S. Paul, A multistage game in smart grid security: a reinforcement learning solution.
IEEE Trans. Neural Netw. Learn. Syst. 30(9), 2684–2695 (2019)
19. A. Nedic, A. Olshevsky, W. Shi, Improved convergence rates for distributed resource allocation
(2017). arXiv preprint arXiv:1706.05441
20. T. Doan, C. Beck, Distributed Lagrangian methods for network resource allocation, in 2017
IEEE Conference on Control Technology and Applications (CCTA). https://doi.org/10.1109/
CCTA.2017.8062536
21. Z. Tang, D. Hill, T. Liu, A novel consensus-based economic dispatch for microgrids. IEEE
Trans. Smart Grid 9(4), 3920–3922 (2018)
22. T. Doan, A. Olshevsky, On the geometric convergence rate of distributed economic dis-
patch/demand response in power systems (2016). arXiv preprint arXiv:1609.06660
23. H. Pourbabak, J. Luo, T. Chen, W. Su, A novel consensus-based distributed algorithm for
economic dispatch based on local estimation of power mismatch. IEEE Trans. Smart Grid
9(6), 5930–5942 (2018)
24. Q. Li, D. Gao, L. Cheng, F. Zhang, W. Yan, Fully distributed DC optimal power flow
based on distributed economic dispatch and distributed state estimation (2019). arXiv preprint
arXiv:1903.01128
25. Z. Deng, X. Nian, Distributed generalized Nash equilibrium seeking algorithm design for
aggregative games over weight-balanced digraphs. IEEE Trans. Neural Netw. Learn. Syst.
30(3), 695–706 (2019)
26. T. Yang, J. Lu, D. Wu, J. Wu, G. Shi, Z. Meng, K. Johansson, A distributed algorithm for
economic dispatch over time-varying directed networks with delays. IEEE Trans. Ind. Electron.
64(6), 5095–5106 (2017)
182 6 Accelerated Algorithms for Distributed Economic Dispatch
27. H. Li, Q. Lü, X. Liao, T. Huang, Accelerated convergence algorithm for distributed constrained
optimization under time-varying general directed graphs. IEEE Trans. Syst. Man Cybern. Syst.
50(7), 2612–2622 (2020)
28. Q. Lü, H. Li, Z. Wang, Q. Han, W. Ge, Performing linear convergence for distributed
constrained optimisation over time-varying directed unbalanced networks. IET Control Theory
Appl. 13(17), 2800–2810 (2019)
29. C. Zhao, X. Duan, Y. Shi, Analysis of consensus-based economic dispatch algorithm under
time delays. IEEE Trans. Syst. Man Cybern. Syst. 50(8), 2978–2988 (2020)
30. H. Li, Q. Lü, T. Huang, Convergence analysis of a distributed optimization algorithm with a
general unbalanced directed communication network. IEEE Trans. Netw. Sci. Eng. 6(3), 237–
248 (2019)
31. G. Qu, N. Li, Accelerated distributed Nesterov gradient descent. IEEE Trans. Autom. Control
65(6), 2566–2581 (2020)
32. R. Xin, U. Khan, Distributed heavy-ball: a generalization and acceleration of first-order
methods with gradient tracking. IEEE Trans. Autom. Control 65(6), 2627–2633 (2020)
33. D. Jakovetic, J. Xavier, J. Moura, Fast distributed gradient methods. IEEE Trans. Autom.
Control 59(5), 1131–1146 (2014)
34. S. Pu, W. Shi, J. Xu, A. Nedic, Push-pull gradient methods for distributed optimization in
networks. IEEE Trans. Autom. Control 66(1), 1–16 (2021)
35. R. Xin, D. Jakovetic, U. Khan, Distributed Nesterov gradient methods over arbitrary graphs.
IEEE Signal Process. Lett. 26(8), 1247–1251 (2018)
36. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Springer, Berlin,
2013)
37. A. Nedic, A. Olshevsky, W. Shi, C. Uribe, Geometrically convergent distributed optimization
with uncoordinated step-sizes, in 2017 American Control Conference (ACC). https://doi.org/
10.23919/ACC.2017.7963560
38. Q. Lü, H. Li, D. Xia, Geometrical convergence rate for distributed optimization with time-
varying directed graphs and uncoordinated step-sizes. Inf. Sci. 422, 516–530 (2018)
39. D. Nunez, J. Cortes, Distributed online convex optimization over jointly connected digraphs.
IEEE Trans. Netw. Sci. Eng. 1(1), 23–37 (2014)
40. D. Bertsekas, Nonlinear Programming, 2nd edn. (Athena Scientific, Cambridge, 1999)
41. R. Xin, C. Xi, U. Khan, FROST-Fast row-stochastic optimization with uncoordinated step-
sizes. EURASIP J. Advanc. Signal Process. 2019(1), 1–14 (2019)
42. C. Xi, V. Mai, E. Abed, U. Khan, Linear convergence in optimization over directed graphs with
row-stochastic matrices. IEEE Trans. Autom. Control 63(10), 3558–3565 (2018)
43. Y. Tian, Y. Sun, G. Scutari, Achieving linear convergence in distributed asynchronous multi-
agent optimization. IEEE Trans. Autom. Control 65(12), 5264–5279 (2020)
44. Z. Wang, H. Li, Edge-based stochastic gradient algorithm for distributed optimization. IEEE
Trans. Netw. Sci. Eng. 7(3), 1421–1430 (2020)
45. R. Horn, C. Johnson, Matrix Analysis (Cambridge University Press, New York, 2013)
46. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization
over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
47. J. Xu, S. Zhu, Y. Soh, L. Xie, Convergence of asynchronous distributed gradient methods over
stochastic networks. IEEE Trans. Autom. Control 63(2), 434–448 (2018)
48. T. Yang, Q. Lin, Z. Li, Unified convergence analysis of stochastic momentum methods for
convex and non-convex optimization (2016). arXiv preprint arXiv:1604.03257
49. Y. Fu, M. Shahidehpour, Z. Li, AC contingency dispatch based on security-constrained unit
commitment. IEEE Trans. Power Syst. 21(2), 897–908 (2006)
Chapter 7
Primal–Dual Algorithms for Distributed
Economic Dispatch
7.1 Introduction
During recent years, multi-node systems have attracted increasing attention in the
fields of distributed sensor networks, multi-robot cooperation, UAVs formation
flight, and missile joint attack operations, etc. Many difficulties still exist in the
control and optimization of multi-node systems due to the complexity of node
dynamics, network structure, and actual target tasks. As one of the most important
research topics in the field of multi-node systems, the distributed optimization prob-
lem has attracted strong research interest from various scientific disciplines. This
is mainly due to its broad application in engineering fields, including distributed
formation control for resource allocation in peer-to-peer communication networks
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 183
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_7
184 7 Primal–Dual Algorithms for Distributed Economic Dispatch
[1], multiple autonomous vehicles [2], and distributed data fusion, information
processing, and decision-making in wireless sensor networks [3–5], etc. The
distributed optimization framework not only avoids the need to build long-distance
communication systems or data fusion centers but also provides a better load
balance for the network.
Among the existing literature, the (sub)gradient descent algorithm [6, 7], the
primal–dual (sub)gradient algorithm [8], the fast (sub)gradient descent algorithm
[9], and the (sub)gradient-push descent algorithm [10] were extensively proposed
to resolve distributed optimization problems. Nedic et al. [6] exactly incorporated
the average consensus approaches into the (sub)gradient methods to handle a
multi-node-based unconstrained convex optimization problem. Theoretical anal-
ysis showed
√ that the (sub)gradient descent algorithm in [6] converges at a rate
of O(1/ t) for convex Lipschitz and possibly non-smooth objective functions.
This coincides with the convergence rate of the centralized (sub)gradient descent
algorithm. Then, Zhu et al. [8] devised two distributed primal–dual (sub)gradient
algorithms to tackle a convex optimization problem where the nodes are restricted
to a global inequality constraint, a global equality constraint, and a global constraint
set. Nedic et al. [10] proposed a (sub)gradient-push descent algorithm that could
find the exact optimal solution even without the knowledge of the number of nodes
or the network sequence to implement the assumption that the objective function
is (sub)gradient boundedness. The√(sub)gradient-push algorithm was proved to be
convergent with a rate of O(ln t/ t) by employing a diminishing step-size. This
series of work has been extended to distributed optimization under various realistic
conditions, such as stochastic (sub)gradient errors [11], directed [12] or random
communication network [13], linear scaling in network size [14], heterogeneous
local constraints [15], asynchrony and delays [16], to just name a few. Although
these algorithms can solve different kinds of optimization problems, it is undeniable
that these algorithms are usually slow and still need to use a diminishing step-
size to achieve optimal solution even if the objective functions are differentiable
and strongly convex [8–25]. Besides, the abovementioned algorithms all require
the assumption of bounded (sub)gradient to accomplish the exact optimal solution,
which is another drawback. Nonetheless, at the expense of inexact convergence
(only converge to the vicinity of the optimal solution), the methods described above
can accelerate to O(1/t) by utilizing a constant step-size. However, this is not
the ultimate goal of solving the optimization problem. To address such issues,
Xu et al. [24] studied a new augmented distributed gradient method (Aug-DGM)
with uncoordinated constant step-sizes over general linear time-invariant system by
employing the so-called adapt-then-combine (ATC) scheme. Shi et al. [25] removed
the steady-state error by putting a difference structure into distributed gradient
descent algorithm, therefore extending a novel distributed first-order algorithm
(EXTRA). The algorithm achieved a convergence rate O(1/t) for convex objective
functions and a linear rate O(C −t ) (C > 1 is some constant) for strongly convex
objective functions.
Considering the large application domains of future smart grids, distributed
algorithms [26, 27] are broadly employed to promote the energy optimization
7.1 Introduction 185
efficiency in smart grids in recent years. He et al. in [28] proposed two second-
order continuous algorithms to exactly solve the economic power dispatch problem
and showed that the convergence rate of the algorithm is faster than the first-
order continuous time algorithm. In addition, a push-sum consensus protocol was
introduced in [29] to solve economic dispatch problems (EDPs) on directed fixed
networks in smart grids. This line of work has been extended to a variety of realistic
scenarios for distributed optimization. Li et al. [30] investigated a novel distributed
event-triggered optimization algorithm to address the economic dispatch problems
in smart grids. By making two consensus protocols running in parallel, Binetti et
al. [31] established a distributed consensus-based protocol for the optimization with
transmission losses.
The recent literature [35, 36] are the most relevant to our work. Nedic et al.
[35] was concentrated on the analysis of the distributed optimization problems
with coordinated step-size over time-varying undirected/directed networks. The
algorithm in [35] was capable of driving the whole network to converge to a
global and consensual minimizer at a geometric rate under the strong convexity and
smoothness assumptions. Doan et al. [36] further took coupling linear constraint
and individual box constraints into consideration. Under the strong convexity and
smoothness assumption on objective functions, Doan et al. conducted an explicit
and detailed analysis for the convergence rate on an undirected network. It is
noteworthy that [35] did not study the constrained optimization problem, while [36]
could not provide a detailed analysis for the constrained optimization problem over
general and time-varying directed network topologies. Our work is also linked to
a distributed scheme-based distributed optimization algorithm for network resource
allocation [32]. However, in [32], the network was required to be undirected, and
the weighted matrix was double-stochastic, which is quite strict in real applications.
Other related work are [33, 34], where the demand response problems (DRPs) in
power networks have been considered. Regrettably, the idea fails to update the
Lagrangian multiplier in a distributed way.
To sum up, when each node in the network is subjected to certain constraints,
the problem is largely open for the distributed constrained optimization over time-
varying general directed networks that we studied in this work. Therefore, we
develop and analyze a fully distributed primal–dual optimization algorithm for
the problem with both coupling linear constraint and individual box constraints.
Specially, compared to the centralized approach, the proposed algorithm is more
adaptable and practical due to its robustness to variability of renewable resources
and flexibility to the dynamic topology of networks. In general, the main contribu-
tions of this chapter can be structured as the following four aspects:
(i) The push-sum protocol and a gradient tracking technique are incorporated
into distributed primal–dual optimization algorithm. It generalizes the work
of [35], which neglected the constraints of each node in practical scenarios,
and moreover, it gives a wide selection of the step-sizes compared with most
existing distributed optimization algorithms, such as [8–23].
186 7 Primal–Dual Algorithms for Distributed Economic Dispatch
7.2 Preliminaries
7.2.1 Notation
If not particularly stated, the vectors mentioned in this chapter are column vectors.
Let R, RN , and RN×N denote the set of real numbers, the set of N-dimensional
real column vectors, and the set of N × N real matrices, respectively. Let IN denote
the N-dimensional identity matrix. The symbol 1 denotes an all-ones column vector
with appropriate dimensions. Given a matrix, W, W T , and W −1 (W is reversible)
are represented as the transpose and the inverse, respectively. The symbol · is
denoted as the inner product of two vectors. For a vector x ∈ RN , x̄ = (1/N)1T x
denotes the vector whose elements are its average vector and its consensus violation
is written as x̆ = x − (1/N)11T x = (I − J )x = J˘x, where J = (1/N)11T and
J˘ = I − J are two symmetric √ matrices. Given a vector x ∈ RN , the standard
Euclidean norm, namely ||x|| = x T x, is defined as ||x||, and the infinite norm is
defined as ||x|| ˘
∞ . Given a vector x ∈ R , the J weighted (semi)-norm is denoted
N
as ||x||J˘ = x, J˘x. Since J˘ = J˘T J˘, we have ||x||J˘ = ||J˘x||. Let ||W|| denote
the spectral norm of matrix ||W|| and ∇f (x) : R → R represent the (sub)gradient
of f (x). For a matrix W, we write Wij or [W]ij to denote its i, j ’th entry.
1 1
N N
min f (x) = fi (xi ), s.t. x i = P , i ≤ x i ≤ ui , (7.1)
x∈RN N N
i=1 i=1
denoted by Niin (k) = {j ∈ V|(j, i) ∈ E(k)} and Niout (k) = {j ∈ V|(i, j ) ∈ E(k)},
respectively. A directed path in the directed network G from node j to node i can
be depicted by a group of connected edges (j, i1 ), (i1 , i2 ), . . . , (im , i) with different
nodes ik , k = 1, 2, . . . , m. A directed network is strongly connected if and only
if for any two distinct nodes j and i in the set V, there exists a directed path
from node j to node i. The following two standard assumptions about the network
communication network are adopted.
Assumption 7.1 ([6]) The time-varying general directed network sequences G(k)
are B0 -strongly connected. Namely, there exists a positive integer B0 such that
the general directed network G(k) with vertex set V and edge set EB0 (k) =
(k+1)B −1
∪s=kB0 0 E(s) is strongly connected for any k ≥ 0.
Remark 7.2 With Assumption 7.1, through repeated communications with neigh-
bors, all nodes can repeatedly interact with each other in the whole network
sequence G(k). Particularly, this assumption is considerably weaker than that
requires each G(k) be strongly connected for all k ≥ 0.
Assumption 7.2 ([6]) For any k = 0, 1, . . . , the mixing matrix C(k) = [Cij (k)] ∈
RN×N is defined as
1
djout (k)+1
, j ∈ Niin (k)
Cij (k) = ,
0, j∈
/ Niin (k) or j = i
where djout(k) = |Njout (k)| is the out-degree of node j at time k ≥ 0 (Njout (k) =
{i ∈ V|(j, i)∈ E(k)}). Also, it is clear to know that C(k) is a column-stochastic
matrix, i.e., Ni=1 Cij (k) = 1 for any j ∈ V and k ≥ 0.
where μi ∈ (0, +∞), and we employ μ̂ = mini {μi } in the subsequent analysis.
7.3 Algorithm Development 189
On the basis of the above section, we first analyze the dual problem of problem (7.1),
and then we design our main algorithm, namely, distributed primal–dual gradient
algorithm, to handle the optimization of the dual problem. At the end of this section,
we will give some lemmas to support the convergence analysis of the algorithm.
1 N
= −wP + −supxi ∈Xi {−wxi − fi (xi )}
N i=1
1 N
= (−fi∗ (−w) − wP )
N i=1
1 N
= di (w), (7.3)
N i=1
where di (w) = −fi∗ (−w) − wP . The dual problem of (7.1), therefore, is given as
maxd(w), (7.4)
w∈R
1
N
∇d(w) = xi − P . (7.5)
N
i=1
190 7 Primal–Dual Algorithms for Distributed Economic Dispatch
Since w is the solution of the dual problem (7.4), we can design a distributed
algorithm to tackle the dual problem (7.4) that is equivalent to the problem (7.1).
Moreover, the dual problem (7.4) can be redefined as the following minimization
problem:
1
N
min q(w) = qi (w), (7.6)
w∈R N
i=1
7.3 Algorithm Development 191
where α > 0 is some step-size, and the vector ∇qi (wi (k)) is the gradient of the node
i’s objective function qi (w) at w = wi (k). According to the definition of qi (w), we
can achieve that
where R(k) = (V (k + 1))−1 C(k)V (k), h(k) = (V (k))−1 β(k). Note that,
under Assumptions 7.1 and 7.2, each matrix V (k) is invertible, and we denote
192 7 Primal–Dual Algorithms for Distributed Economic Dispatch
||V −1 ||max = supk≥0 ||V −1 (k)||, which is bounded. Also, we can prove that R(k) is
actually a row-stochastic matrix (see Lemma 4 of [10]).
Next, we will use the following symbols CB (k) = C(k)C(k −1) . . . C(k +1−B)
for any k = 0, 1, . . . and B = 0, 1, . . . with B ≤ k + 1 and the exceptional case
that C0 (k) = I for any k and CB (k) = I for any needed k < 0. An important
character of the norm of (I − (1/N)11T )RB (k) is shown in the following lemma,
which comes from the properties of distributed primal–dual gradient algorithm and
can be achieved from [33, 34].
Lemma 7.7 ([35]) Let Assumptions 7.1 and 7.2 hold, and let B be an integer
satisfying B ≥ B0 . Then, for any k = B − 1, B, . . . and any vectors φ and
ϕ with appropriate dimensions, if φ = RB (k)ϕ, we have ||φ̆|| ≤ δ||ϕ̆||, where
B−1
RB (k) = (V (k + 1))−1 CB (k)(V (k + 1 − B)), δ = Q1 (1 − τ NB0 ) NB0 < 1,
−NB
Q1 = 2N 1+τ NB00 , τ = 2+NB
1
.
1−τ N0
In this section, we first introduce the small gain theorem [35], followed by some
supporting lemmas. Then, we present the main results of this chapter.
1
||si ||γ ,K = max ||si (k)||, (7.10)
k=0,...,K γk
1
||si ||γ = sup ||si (k)||, (7.11)
k≥0 γk
Lemma 7.8 (The Small Gain Theorem [35]) Suppose that s1 , . . . , sm are
sequences such that for all K > 0 and each i = 1, . . . , m,
1
||si ||γ = sup ||si (k)||, (7.12)
k≥0 γk
η1 η2 . . . ηm < 1; (7.13)
then, we obtain
1
||s1 ||γ ≤ (ηm ηm−1 . . . η2 ω1 + . . . + ηm ωm−1 + ωm ). (7.14)
1 − η1 η2 . . . ηm
Remark 7.9 The original version of the small gain theorem has been extensively
studied and has been widely used in control theory [40]. Besides, since the small
gain theorem involves a cyclic structure, s1 → s2 → · · · → sm → s1 , one can get
similar bounds for ||si ||γ , ∀i.
Lemma 7.10 ([35]) For any matrix sequence si and a positive constant γ ∈ (0, 1),
if ||si ||γ is bounded, then ||si (k)|| converges to 0 with a geometric rate O(γ k ).
We need to define the following additional symbols that are often used in the
latter analysis before the main proof of the idea is carried out:
where w∗ ∈ R is the optimal solution of problem (7.6), and the initialization z(0) =
0. Considering the small gain theorem, the geometric convergence of ||w(k)|| will
be achieved by employing Lemma 7.8 to the following circle of arrows:
4 3 2 1
y → z → h̆ → z̆ → y. (7.17)
Remark 7.11 Recall that y is the difference between the estimate and the global
optimizer of the Lagrangian multiplier w, z is the successive difference of gradients,
h̆ is the consensus violation of the estimation of gradient average across nodes, and
w̆ is the consensus violation of the Lagrangian multiplier. In a sense, as long as y
is small, the error z is small since the gradients are close to zero in the vicinity of
the optimal Lagrangian multiplier. Then, as long as z is small, h̆ is small by the
framework of the algorithm (7.9). Furthermore, as long as h̆ is small, the framework
of algorithm (7.9) means that w̆ is close to zero. Finally, as long as w̆ is close to
zero, the algorithm will drive y to zero and thus achieve the whole cycle.
194 7 Primal–Dual Algorithms for Distributed Economic Dispatch
Remark 7.12 After the establishment of each arrow, we will apply Lemma 7.8 to
conclude our main results. Specifically, we need to be aware of the prerequisite
that the sequences {||y||γ ,K , ||z||γ ,K , ||h̆||γ ,K , ||w̆||γ ,K , ||y||γ ,K } are proven to be
bounded. Therefore, we can draw a conclusion that all quantities in the above circle
of arrows converge at a geometric rate O(γ k ). In addition, in order to apply the
small gain theorem in the following analysis, we need to require that the product of
gains γi is less than one, which is achieved by finding an appropriate step-size α.
Now, we are ready to present the establishment of each arrow in the above circle
(7.17). The following series of lemmas are based mainly on the views of [35].
Before introducing the Lemma 7.13, we make some definitions only for this lemma,
which distinguishes the notation used in our distributed optimization problem,
algorithm, and analysis. We next redefine problem (7.6) with different notations
as
1
N
min g(p) = gi (p), (7.18)
p∈RN N
i=1
1
N
pk+1 = pk − θ ∇gi (si (k)), (7.19)
N
i=1
where θ is the step-size. Let p∗ be the global optimal solution of g, and define
√
√ N 1+υ ϑ √
||y||γ ,K ≤ (1 + N ) 1 + + ||w̆||γ ,K + 2 N ||w̄(0) − w∗ ||.
γ υ μ̂σ̃ σ̂ σ̃
(7.22)
Next, to demonstrate the second arrows in the circle (7.17), we give the following
Lemma 7.15.
Lemma 7.15 (||h̆||γ ,K → ||w̆||
√
γ ,K [35]) . Let Assumptions 7.1–7.2 hold, and let
B
γ be a positive constant in ( δ, 1), where δ and B are the constants given in
Lemma 7.7. Then, we get
γ B −(t −1)
B
α γ − γB
||w̆||γ ,K ≤ δ + Q1 ||h̆||γ ,K + γ ||w̆(t − 1)||,
γB − δ 1−γ γB − δ
t =1
(7.23)
γ (1 − γ B )
||h̆||γ ,K ≤ Q1 ||V −1 ||max ||z||γ ,K
(γ B − δ)(1 − γ )
γ B −(t −1)
B
+ B γ ||h̆(t − 1)||. (7.24)
γ −δ
t =1
196 7 Primal–Dual Algorithms for Distributed Economic Dispatch
The last arrow in the circle (7.17) demonstrated in the following lemma is a
simple conclusion of the truth that the gradient of q is Lipschitz continuous with
parameter 1/μ̂.
Lemma 7.17 (||y||γ ,K → ||z||γ ,K [35]) . Under Assumption 7.3, we obtain that
for all K = 0, 1, . . . , and any 0 < γ < 1,
γ +1
||z||γ ,K ≤ ||y||γ ,K .
γ μ̂
Based on the circle (7.17) established in the previous section, we will demonstrate
a major result about the geometrical convergence of (x(k), w(k)) to a saddle point
(x ∗ , 1w∗ ) of the Lagrangian function L for the distributed primal–dual gradient
algorithm over a time-varying general directed networks. In what follows, we first
prove that the sequence {w(k)} updated by the distributed primal–dual gradient
algorithm (7.9) converges to 1w∗ at a global geometric rate O(γ k ) with the help
of the small gain theorem. Then, on the basis of the geometric convergence of
the sequence {w(k)}, we will prove that the sequence {x(k)} goes to x ∗ at a
global geometric rate O((γ /2)k ). Moreover, an explicit convergence rate γ for the
distributed primal–dual gradient algorithm will be given along the way.
Theorem 7.18 Let Assumptions 7.1–7.4 and Lemmas 7.3–7.17 hold. Let B be a
B−1
large enough integer constant such that δ = Q1 (1 − τ NB0 ) NB0 < 1. Then, for any
2
step-size α ∈ (0, 1.5(1−δ)
σ̃ J ], the sequence {w(k)} be generated by the distributed
primal–dual gradient algorithm converges to 1w∗ at a global geometric rate O(γ k ),
√ 2
1.5( J 2 +(1−δ 2 )J −δ J )
where γ ∈ (0, 1) is given by γ = 2B 1 − α1.5 σ̃
if α ∈ (0, ],
σ̃ J (1+J ) 2
√ 2
σ̃ J 1.5( J 2 +(1−δ 2 )J −δ J ) 1.5(1−δ)2
and γ = B δ + α1.5 if α ∈ ( , σ̃ J ], where J = 3Q1 ×
J (1+J )2
σ̃√ √ √
||V −1 ||max κB(δ + Q1 (B − 1))(1 + N ) × (1 + 4 N κ) and κ = 1/σ̃ μ̂.
Proof It is immediately obtained from Lemmas 7.13–7.17 that:
√ √
(i) ||y||γ ,K ≤ η1 ||w̆||γ ,K + ω1 , where η1 = (1 + N )(1 + γN υ1+υ μ̂σ̃
+ σ̂ϑσ̃ ) and
√
ω1 = 2 N||w̄(0) − w∗ ||.
−γ B
(ii) ||w̆||γ ,K ≤ η2 ||h̆||γ ,K + ω2 , where η2 = γ Bα−δ (δ + Q1 γ1−γ ) and ω2 =
γ B B −(t −1) ||w̆(t − 1)||.
γ B −δ t =1 γ
7.4 Convergence Analysis 197
B
(iii) ||h̆||γ ,K ≤ η3 ||z||γ ,K + ω3 , where η3 = Q1 ||V −1 ||max (γ γB (1−γ )
−δ)(1−γ )
and ω3 =
γ B B −(t −1) ||h̆(t − 1)||.
γ B −δ t =1 γ
γ +1
(iv) ||z||γ ,K ≤ η4 ||y||γ ,K +ω4 , where η4 = γ μ̂
and ω4 = 0.
Moreover, to use the small gain theorem, we need to choose an appropriate step-size
α such that
η1 η2 η3 η4 < 1, (7.25)
where ϑ > 0, υ > 0, and other constraint conditions on parameters that occur in
Lemmas 7.7, 7.14, and 7.15 are stated as follows:
ϑ +1 1
0 < α ≤ min , , (7.27)
σ̃ ϑ μ̃(1 + υ)
α σ̃ ϑ
1− ≤ γ < 1, (7.28)
ϑ +1
√
B
δ < γ < 1. (7.29)
Define two specific values for the parameters ϑ = 2σ̂ /μ̂ and υ = 1 in
Lemma 7.14 to obtain some concrete (probably loose) bound on the convergence
rate. Furthermore, by using 0.5 ≤ γ < 1 and (1 − γ B )/(1 − γ ) ≤ B, from relation
(7.26), we obtain
2
μ̂(γ B − δ)
α≤ √ √ √ , (7.31)
2Q1 ||V −1 ||max B(δ + Q1 (B − 1))(1 + N)(1 + 4 N κ)
198 7 Primal–Dual Algorithms for Distributed Economic Dispatch
where κ = 1/σ̃ μ̂ is the condition number. Noting that (ϑ + 1)/ϑ ≥ 1.5, it follows
from (7.28) that
1.5(1 − γ 2 )
≤ α. (7.32)
σ̃
√ using relations (7.31) and (7.32), we can achieve that there exists a λ ∈
Then,
( B δ, 1) such that
2
1.5(1 − γ 2 ) 1.5(γ B − δ)
, = ∅, (7.33)
σ̃ σ̃ J
√ √ √
where J = 3Q1 ||V −1 ||max κB(δ + Q1 (B − 1))(1 + 4 N κ)(1 + N ). Here, we
study a smaller interval by enlarging the left side in (7.33). Since B ≥ 1, we will
prove that
2
1.5(1 − γ 2B ) 1.5(γ B − δ)
, = ∅. (7.34)
σ̃ σ̃ J
√
Noting that when γ increases from B δ to 1, the left side of (7.34) is decreasing from
1.5(1−δ 2 ) 2
to 0 monotonically, while the right side is increasing from 0 to 1.5(1−δ)
σ̃ J
σ̃ √
monotonically. Thus, when γ varies from B δ to 1, the critical value of the interval
in (7.34) is valid when γ is given by
B δ+ J 2 + (1 − δ 2 )J
γ = . (7.35)
(1 + J )
Here, we assume that the value obtained in (7.35) is γmid . Thus, if we choose
√ 2
1.5( J 2 +(1−δ 2 )J −δ J )
α ∈ (0, ], we can set γ = 2B
1 − α1.5
σ̃
, while for α ∈
σ̃ J (1+J )2
√ 2
1.5( J 2 +(1−δ 2 )J −δ J ) 1.5(1−δ)2 σ̃ J
( , σ̃ J ], we can set γ = B δ + α1.5 . The proof is thus
σ̃ J (1+J )2
completed.
On the basis of the geometric convergence of the sequence {w(k)} in Theo-
rem 7.18, we next demonstrate that the sequence {x(k)} goes to x ∗ at a global
geometric rate O((γ /2)k ).
Theorem 7.19 Suppose Assumptions 7.1–7.4 hold. Let the sequence {x(k)}, {λ(k)},
{v(k)}, {w(k)}, and {β(k)} be updated by the distributed primal–dual gradient
algorithm (7.9). Choose α in Theorem 7.18 such that the sequence {w(k)} converges
to 1w∗ at a geometric rate O(γ k ). Then we have the sequence {x(k)} converges to
x ∗ at a geometric rate O((γ /2)k ).
7.4 Convergence Analysis 199
Proof To demonstrate the result shown in Theorem 7.19, we first show that for any
k ≥ 0, the following inequality holds:
N
μi
(xi (k + 1) − xi∗ )2
2
i=1
N
1
≤ (∇qi (w∗ )(wi (k) − w∗ ) + (wi (k) − w∗ )2 + wi (k)(xi∗ − P )).
2μi
i=1
(7.36)
Noticing that x ∗ ∈ S, we have N ∗
i=1 (xi − P ) = 0. Moreover, since wi (k) →
∗
w as k → ∞, it suffices to prove that the last term in (7.36) goes to zero as
k → ∞. Then we obtain the right hand side of (7.36) tends to zero as k → ∞.
Therefore, we immediately conclude that as the sequence {w(k)} converges to 1w∗
at a geometric rate of O(γ k ), the sequence {x(k)} converges to x ∗ at a geometric
rate of O((γ /2)k ). Based on the above analysis, the next major task is to derive
inequality (7.36).
To show inequality (7.36), we first define the local Lagrangian function Li :
Xi × R → R as
1
N
L(x, w)= Li (xi , wi ). (7.38)
N
i=1
Using xi (k + 1) ∈ arg min fi (xi (k)) + wi (k)xi (k), we also obtain for all xi ∈ Xi
xi ∈χi
that
μi
(xi (k + 1) − xi )2 ≤ Li (xi , wi (k)) − Li (xi (k + 1), wi (k)). (7.40)
2
200 7 Primal–Dual Algorithms for Distributed Economic Dispatch
Since (7.40) holds for all xi ∈ Xi , replacing xi by xi∗ and taking the average process
of (7.40), we immediately have
1 μi
N
(xi (k + 1) − xi∗ )2
N 2
i=1
1
N
≤ (Li (xi∗ , wi (k)) − Li (xi (k + 1), wi (k)))
N
i=1
1
N
= f (x ∗ ) + q(w(k)) + (wi (k)(xi∗ − P ))
N
i=1
1
N
= (qi (wi (k)) − qi (wi∗ ) + wi (k)(xi∗ − P )). (7.44)
N
i=1
Since qi has Lipschitz continuous derivative with Lipschitz parameter 1/μi , it yields
that
1
N
1
≤ (∇qi (wi∗ )(wi (k) − w∗ ) + (wi (k) − w∗ )2 + wi (k)(xi∗ − P )).
N 2μi
i=1
(7.45)
7.5 Numerical Examples 201
N
μi
(xi (k + 1) − xi∗ )2
2
i=1
N
1
≤ (∇qi (w∗ )(wi (k) − w∗ ) + (wi (k) − w∗ )2 + wi (k)(xi∗ − P )),
2μi
i=1
In this section, two numerical examples about economic dispatch problems and
demand response problems in power systems are presented to validate the practica-
bility of the proposed algorithm and feasibility of the theoretical analysis throughout
this chapter.
In the first example, we consider the economic dispatch on the IEEE 14-bus systems
[43] as interpreted in Fig. 7.1. Specially, we study a class of problems with which
some generators may not be connected to the grid or cease to exchange their power
during their operation. This may occur due to the fault of the generators or the
variability of renewable energy, which limits the generator generating capacity
within a specific time. To study this problem, we will use a class of uniformly
strongly connected time-varying general directed networks to model for the variable
connection between generators. Specifically, in this chapter, we consider a system
in which each generator i suffers a quadratic cost as a function of the amount of its
generated power Pi , i.e., Yi (Pi ) = ci Pi2 + di Pi , where ci and di are the adjustable
cost coefficients of generator i. Assume that each generator i only generates a
limited amount of power denoted by [0, Pimax ]. In the simulation, we choose the
average demand (required from the system) P = 60, and the coefficients of each
generator are shown in Table 7.1 that is also applied in [42]. Then, the simulation
results of the algorithm (7.7) are described in Figs. 7.1 and 7.2. The power allocation
at each generator is shown in Fig. 7.2, from which one can see that the distributed
primal–dual gradient algorithm (7.7) successfully allocates optimal powers to each
generator at time step k = 200. The allocated optimal power at each generator is as
follows, P1∗ = 66.25, P2∗ = 71.61, P3∗ = 47.15, P4∗ = 54.98, and P5∗ = 60.01.
From Fig. 7.3, for each generator, it is explicit that each Lagrangian multiplier
successfully converges to the dual optimal solution w∗ .
202 7 Primal–Dual Algorithms for Distributed Economic Dispatch
Table 7.1 Generator Gen Bus ci ($/MW2 ) di ($/MW) [0, Pimax ] (MW)
parameters (MU = Monetary
units) 1 1 0.04 2.0 [0, 80]
2 2 0.03 3.0 [0, 90]
3 3 0.035 4.0 [0, 70]
4 6 0.03 4.0 [0, 70]
5 8 0.04 2.5 [0, 80]
Our second application is about the problem of the demand response of 5 households
in the summer served by a single generator. In particular, we will consider the issue
of time-varying supplies during a day. We assume that the generator can predict the
average power to supply per hour of the day based on the information collected over
the past few days, and each household will incur costs in the use of power. Suppose
that all households are interested in cooperating to arrange their loads to meet the
supply while minimizing their total costs. According to [34], we know that the
demand response problems can be considered as the optimization problem studied
in this chapter. Specifically, in this chapter, we consider the energy consumption
of air conditioning, and no other tunable device is required during the process
of demand response. This is mainly because it may consume most energy during
summer. When the air conditioning uses an amount xi of power, we suppose that
each household i suffers a quadratic cost as a function, i.e., Qi (xi ) = ai (xi − bi )2 ,
where ai is the cost coefficient and bi is the initial energy of each household i. Let
7.5 Numerical Examples 203
80
P
1
P
2
70 P
3
P4
P
60 5
Power allocation xi(k)
50
40
30
20
10
0
0 20 40 60 80 100 120 140 160 180 200
time
0
w1
w
2
−1 w3
w
4
−2 w5
Lagrange multipliers wi(k)
−3
−4
−5
−6
−7
−8
0 20 40 60 80 100 120 140 160 180 200
time
S(t), t ∈ [7, 19], be the average vector of power supplied by the generator from
7 : 00 to 19 : 00. Let S(t), ai , and bi be chosen randomly from [0, 2000], (0, 1),
and [0, 1000], respectively. Then, the simulation results of the algorithm (7.7) are
described in Figs. 7.4 and 7.5. The optimal schedule of power set by each household
is shown in Fig. 7.4, while the predicted price on the time-varying demand is shown
in Fig. 7.5.
204 7 Primal–Dual Algorithms for Distributed Economic Dispatch
2000
1500
1000
500
7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00
900
800
700
600
price(w)
500
400
300
200
100
0
7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00
7.6 Conclusion
In this chapter, a fully distributed primal–dual gradient algorithm for tackling the
convex optimization problem with both coupling linear constraint and individual
box constraints has been studied in detail. It has been proven that under some
fairly standard assumptions on network connectivity and the objective function,
the algorithm is able to achieve a geometrical convergence rate over time-varying
general directed networks. Based on the adjustment of some parameters (ϑ, υ) and
the small gain theorem, we derived an explicit convergence rate γ for different
ranges of α. Furthermore, the correctness and effectiveness of the theoretical results
have been demonstrated by applying the proposed algorithm to investigate few
interesting problems in power systems. In addition, there are several meaningful
questions to be studied in future work. For example, it would be interesting to study
more general case in which the step-size we adopted is uncoordinated. It would also
be meaningful to extend our work to the cases of event-triggered, asynchronous, and
quantized communication among nodes over time-varying networks.
References
13. I. Matei, J.S. Baras, Performance evaluation of the consensus-based distributed subgradient
method under random communication topologies. IEEE J. Sel. Topics. Signal Process. 5(4),
754–771 (2011)
14. A. Olshevsky, Linear time average consensus on fixed graphs and implications for decentral-
ized optimization and multi-agent control (2014). arXiv preprint arXiv:1411.4186
15. A. Nedic, A. Ozdaglar, P. A. Parrilo, Constrained consensus and optimization in multi-agent
networks. IEEE Trans. Autom. Control 55(4), 922–938 (2010)
16. T. Wu, K. Yuan, Q. Ling, W. Yin, A. H. Sayed, Decentralized consensus optimization with
asynchrony and delays. IEEE Trans. Signal Inform. Process. Netw. 4(2), 293–307 (2018)
17. A. Nedic, Asynchronous broadcast-based convex optimization over a network. IEEE Trans.
Autom. Control 56(6), 1337–1351 (2011)
18. I. Lobel, A. Ozdaglar, Convergence analysis of distributed subgradient methods over random
networks, in 2008 46th Annual Allerton Conference on Communication, Control, and Comput-
ing. https://doi.org/10.1109/ALLERTON.2008.4797579
19. P. Yi, Y. Hong, Quantized subgradient algorithm and data-rate analysis for distributed
optimization. IEEE Trans. Control Netw. Syst. 1(4), 380–392 (2014)
20. B. Johansson, T. Keviczky, M. Johansson, K. H. Johansson, Subgradient methods and consen-
sus algorithms for solving convex optimization problems, in 2008 47th IEEE Conference on
Decision and Control. https://doi.org/10.1109/CDC.2008.4739339
21. D. Yuan, S. Xu, H. Zhao, Distributed primal-dual subgradient method for multiagent optimiza-
tion via consensus algorithms. IEEE Trans. Syst. Man Cybern. B: Cybern. 41(6), 1715–1724
(2011)
22. Z.J. Towfic, A.H. Sayed, Adaptive penalty-based distributed stochastic convex optimization.
IEEE Trans. Signal. Process. 62(15), 3924–3938 (2014)
23. S.S. Ram, A. Nedic, V.V. Veeravalli, Incremental stochastic subgradient algorithms for convex
optimization. SIAM J. Control Optim. 20(2), 691–717 (2009)
24. J. Xu, S. Zhu, Y. C. Soh, L. Xie, Augmented distributed gradient methods for multi-Agent
optimization under uncoordinated constant stepsizes, in Proceedings of the IEEE 54th Annual
Conference on Decision and Control. https://doi.org/10.1109/CDC.2015.7402509
25. W. Shi, Q. Ling, G. Wu, W. Yin, EXTRA: an exact first-order algorithm for decentralized
consensus optimization. SIAM J. Optim. 25, 944–966 (2015)
26. C. Li, X. Yu, T. Huang, X. He, Distributed optimal consensus over resource allocation network
and its application to dynamical economic dispatch. IEEE Trans. Neural Netw. Learn. Syst.
29(6), 2407–2418 (2018)
27. C. Li, X. Yu, W. Yu, G. Chen, J. Wang, Efficient computation for sparse load shifting in demand
side management. IEEE Trans. Smart Grid 8(1), 250–261 (2017)
28. X. He, D.W.C. Ho, T. Huang, J. Yu, H. Abu-Rub, C. Li, Second-order continuous-time
algorithms for economic power dispatch in smart grids. IEEE Trans. Syst. Man Cybern. Syst.
48(9), 1482–1492 (2018)
29. H. Xing, Y. Mou, M. Fu, Z. Lin, Distributed bisection method for economic power dispatch in
smart grid. IEEE Trans. Power Syst. 30(6), 3024–3035 (2015)
30. C. Li, X. Yu, W. Yu, T. Huang, Z.-W. Liu, Distributed event-triggered scheme for economic
dispatch in smart grids. IEEE Trans. Ind. Informat. 12(5), 1775–1785 (2016)
31. G. Binetti, A. Davoudi, F.L. Lewis, D. Naso, B. Turchiano, Distributed consensus-based
economic dispatch with transmission losses. IEEE Trans. Power Syst. 29(4), 1711–1720 (2014)
32. A. Nedic, A. Olshevsky, W. Shi, Improved convergence rates for distributed resource allocation
(2017). arXiv preprint arXiv:1706.05441
33. N. Li, L. Chen, M.A. Dahleh, Demand response using linear supply function bidding. IEEE
Trans. Smart Grid 6(4), 1827–1838 (2015)
34. N. Li, L. Chen, S.H. Low, Optimal demand response based on utility maximization in power
networks, in Proceedings of the 2011 IEEE Power and Energy Society General Meeting (PES).
https://doi.org/10.1109/PES.2011.6039082
35. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization
over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
References 207
Abstract In this chapter, we still study the problem of energy management in smart
grid operations, the problem of economic dispatch, i.e., the problem of minimizing
a sum of local convex cost functions subjected to both local interval constraints
and coupling linear constraint over an undirected network. We propose a new event-
triggered distributed accelerated primal–dual algorithm, ET-DAPDA, that achieves a
reduction in computation and interaction to solve the EDP with uncoordinated step-
sizes. ET-DAPDA (with respect to the dual updates) adds two momentum terms to
the gradient tracking scheme and assumes that each node interacts independently
with its neighbors only at the event-triggered sampling time instants. Assuming
smoothness and strong convexity of the cost function, the linear convergence of
ET-DAPDA is analyzed using the generalized small gain theorem. In addition, ET-
DAPDA strictly excludes Zeno-like behavior, which greatly reduces the interaction
cost. ET-DAPDA is investigated on 14-bus and 118-bus systems to evaluate its
applicability. Simulation results of convergence rates are further compared with
existing techniques to demonstrate the superiority of ET-DAPDA.
8.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 209
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_8
210 8 Event-Triggered Algorithms for Distributed Economic Dispatch
(local interval constraint and coupling linear constraint) convex optimization prob-
lem over an undirected network. ET-DAPDA guarantees that the interaction between
two nodes in the network is in an event-triggered way. To be specific, the principal
contributions of this work are shown in the following aspects:
(i) Distributed event-triggered interaction scheme is considered in the primal–
dual method. Compared with the recent work [20–23], ET-DAPDA considers
reducing the energy consumption and intensive calculations of interactions
between nodes, which may extend the useful life of a particular network such
as power systems or smart grids.
(ii) In comparison with [34, 35], ET-DAPDA does not demand the centralized
control to generate the Lagrangian multiplier. Specifically, ET-DAPDA incor-
porates the gradient tracking into the distributed primal–dual method to come
true linear convergence and adds two types of momentum terms to enable
nodes to obtain more information than the existing methods [26, 27] from their
neighbors in the network to accelerate convergence.
(iii) The convergence of ET-DAPDA is analyzed by using the generalized small
gain theorem, a standard tool in control theory for analyzing stability of
interconnected dynamical systems, which is expected to be broadly applicable
to other accelerated algorithms. In addition, ET-DAPDA provides a more
relaxed step-size choice than most existing distributed methods presented in
[20, 26, 33], etc.
(iv) Presuming that the largest step-size and the maximum momentum coefficient
are subjected to certain upper bounds, ET-DAPDA linearly converges to the
optimal solution under smoothness and strong convexity cost functions. ET-
DAPDA also rigorously excludes the Zeno-like behavior [25, 27]. In addition,
in comparison with [23, 33], explicit estimates for the convergence rate of ET-
DAPDA are built.
8.2 Preliminaries
8.2.1 Notation
If not particularly stated, the vectors mentioned in this chapter are column vectors.
We let · denote the inner product of two vectors, || · || denote the 2-norm for both
vectors and matrices, ρ(B) represent the spectral radius of a matrix B, and diag{y}
be a diagonal matrix of the vector y, where the diagonal element is yi and other
elements are zeros. The symbols 1m , Im and B T denote the one-column vectors of
m-dimensional, the m×m identity matrix, and the transpose of a matrix B (probably
a vector), respectively. Given an infinite sequence si = (si (0), si (1), si (2), · · ·),
where si (t) ∈ R, ∀t, we define si γ ,T = maxt =0,1,...,T γ1t si (t) and
212 8 Event-Triggered Algorithms for Distributed Economic Dispatch
si γ =supt ≥0 γ1t si (t) with γ ∈ (0, 1). Given two vectors v = [v1 , . . . , vm ]T and
u = [u1 , . . . , um ]T , the notation v u implies that vi ≤ ui for any i.
8.2.2 Nomenclature
This chapter studies a general EDP of finding the optimal solution of the sum
of the m local convex cost functions subjected to both local interval constraints
and coupling linear constraint over an undirected network of m nodes, which is
described as follows:
m
m
m
minf (x) = fi (xi ), s.t. xi = di , min
i ≤ xi ≤ max
i , (8.1)
x
i=1 i=1 i=1
∀i = 1, . . . , m, where x = [x1
, . . . , x m ]T
∈ Rm , the coupling linear constraint set is
expressed by M = {x ∈ R | i=1 xi = m
m m
i=1 di }, and the local interval constraint
set is denoted by Xi = {x̃ ∈ R|min i ≤ x̃ ≤ min
i } for all i = 1, . . . , m. We write
X = X1 ×X2 ×· · ·×Xm for the Cartesian products of each local interval constraint.
In addition, we indicate by P = X ∩ M the feasible set, x ∗ = [x1∗ , . . . ., xm ∗ ]T
8.3 Algorithm Development 213
m
max C(λ) = Ci (λ), (8.2)
λ∈R
i=1
214 8 Event-Triggered Algorithms for Distributed Economic Dispatch
where Ci (λ) = −fi (−λ)−λdi and f (λ) = supx∈X λx −f (x). Then, we transform
the dual problem (8.2) into the following minimization form:
m
min q(λ) = qi (λ), (8.3)
λ∈R
i=1
y
)+1 = inf{t > tk(t ) | ||ei (t)|| + ||ei (t)|| > Qω },
i i λ t
tk(t (8.5)
where Q > 0 and 0 < ω < 1 are the control parameters; the measurement errors
y
eiλ (t) and ei (t) are given as
eiλ (t) = λi (tk(t
i ) − λ (t)
) i
y . (8.6)
ei (t) = yi (tk(t
i
) ) − y i (t)
where aij ≥ 0 is the weight between node j and node i, and h > 0 is the tunable parameter.
5: Local variables updates: Each node i updates the following variables according to (8.7):
⎧
⎪
⎪ xi (t + 1) = arg min fi (xi ) + λi (t)(xi − di )
⎪
⎨ xi ∈Xi
zi (t + 1) = viλ (t) + ηi (t)(zi (t) − zi (t − 1)) − αi yi (t) , (8.8)
⎪
⎪
⎪ λi (t + 1) = ziy(t + 1) − ηi (t)(zi (t + 1) − zi (t))
⎩
yi (t + 1) = vi (t) − xi (t + 1) + xi (t)
Assumption 8.1 ([25]) The undirected network G is connected, and the matrix W
meets 1Tm W = 1Tm , W 1m = 1m , and δ = ρ(W − 1m 1Tm /m) < 1.
Assumption 8.2 ([26]) Each cost function fi , i ∈ V, is μi -strongly convex and has
Lipschitz continuous gradient with parameter σi , where σi , μi ∈ (0, +∞).
Remark 8.2 Assumption 8.1 often describes the information exchange rules in the
network. Assumption 8.1 assures the nonemptiness of the optimal solution set.
First, the convergence analysis of the Lagrangian multiplier primarily relies on the
generalized small gain theorem [36] described below.
Lemma 8.3 ([36]) Assuming non-negative vector sequences {ṽi (t)}∞ t =0 , i =
˜
1, . . . , m, a non-negative matrix Γ ∈ R m×m , ũ ∈ R , and γ ∈ (0, 1) satisfy
m
ṽ γ ,T Γ˜ ṽ γ ,T + ũ,
for all positive integer T , where ṽ γ ,T = [||ṽ1 ||γ ,T , . . . , ||ṽm ||γ ,T ]T . Then, ||ṽi ||γ <
B0 if ρ(Γ˜ ) < 1, where B0 < ∞. Hence, each ||ṽi ||, i ∈ {1, . . . , m}, converges
linearly to zero at a rate of O(γ t ).
In the following, the bound of ||v1 ||γ ,T is showed.
8.4 Convergence Analysis 217
Lemma 8.4 For all T ≥ 0 and under Assumptions 8.1 and 8.2, we have the
following inequality:
α̂ ˆ 2η̂ α̂
||v1 ||γ ,T ≤ ||v2 ||γ ,T + ||v3 ||γ ,T + ||v4 ||γ ,T
γ − δ − α̂ ˆ γ − δ − α̂ ˆ γ − δ − α̂ ˆ
h||L|| ||v1 (0)||
+ ||eλ||γ ,T + ,
γ − δ − α̂ ˆ γ − δ − α̂ ˆ
1 1
||v1 (t + 1)|| ≤|| W − 1m 1Tm v1 (t)|| + || Im − 1m 1Tm Dη (t)v3 (t)||
m m
1
+ || Im − 1m 1Tm Dα y(t)|| + h||L||||eλ(t)||
m
1
+ || Im − 1m 1Tm Dη (t + 1)v3 (t + 1)||, (8.11)
m
where the inequality in (8.11) is acquired from the fact that (W − (1/m)1m 1Tm )
(Im − (1/m)1m 1Tm ) = W − (1/m)1m 1Tm . Notice that α̂ = ||Dα || and ||Dη (t)|| ≤ η̂.
Built on Assumption 8.1, (8.11) further implies that
ˆ 1 (t)|| + ||v
Since ||y(t)|| ≤ ||v ˆ 2 (t)|| + ||v4 (t)|| holds [12], we have that
ˆ 1 (t)|| + α̂ ||v
||v1 (t + 1)|| ≤(δ + α̂ )||v ˆ 2 (t)|| + η̂||v3 (t + 1)||
From here, the process resembles that in the proof of Lemma 8 in [20]. We
contain the proof for integrity. By taking the supremum on both sides of (8.13)
218 8 Event-Triggered Algorithms for Distributed Economic Dispatch
δ + α̂ ˆ α̂ ˆ 2η̂
||v1 ||γ ,T ≤ ||v1 ||γ ,T + ||v2 ||γ ,T + ||v3 ||γ ,T
γ γ γ
α̂ h||L|| λ γ ,T
+ ||v4 ||γ ,T + ||e || + ||v1 (0)||, (8.15)
γ γ
which after some algebraic manipulations yields the desired result. This completes
the proof.
Notice that (1/m)1m 1Tm λ(t) = 1m λ̄(t). Then, the bound of ||v2 ||γ ,T is given in
the next lemma.
Lemma 8.5 Under Assumptions 8.1–8.2 and when c1 < γ < 1 as well as 0 <
(1/m2 )1Tm α < 2/, one gets ∀T ≥ 0,
α̂ ˆ 2η̂ α̂ γ
||v2 ||γ,T ≤ ||v1 ||γ,T + ||v3 ||γ,T + ||v4 ||γ,T + ||v2 (0)||,
γ − c1 γ − c1 γ − c1 γ − c1
1
|| 1m 1Tm λ(t + 1) − 1m λ∗ ||
m
1 1 1
≤ || 1m 1Tm λ(t) − 1m 1Tm Dα 1m 1Tm y(t) − 1m λ∗ ||
m m m
1
+ η̂[||z(t) − z(t − 1)|| + ||z(t + 1) − z(t)||] + α̂||y(t) − 1m 1Tm y(t)||.
m
(8.16)
8.4 Convergence Analysis 219
We now talk over the first term in the inequality of (8.16). Note that (1/m)1m 1Tm
y(t) = (1/m)1m1Tm ∇Q(λ(t)). Utilizing 1m 1Tm Dα 1m 1Tm = 1Tm α1m 1Tm , one obtains
1 1 1
|| 1m 1Tm λ(t) − 1m 1Tm Dα 1m 1Tm y(t) − 1m λ∗ ||
m m m
1 1
≤ ||1m (λ̄(t) − 1Tm α ∇q(λ̄(t)) − λ∗ )||
m m
1 1 1
+ 1Tm α|| 1m 1Tm ∇Q(1m λ̄(t)) − 1m 1Tm ∇Q(λ(t))||
m m m
= Λ1 + Λ2 , (8.17)
where ∇q(λ̄(t)) = 1Tm ∇Q(1m λ̄(t)). By Lemma 3 in [33], if 0 < (1/m2 )1Tm α < 2/,
Λ1 is bounded by
√ 1
Λ1 ≤ c1 m||λ̄(t) − λ∗ || = c1 || 1m 1Tm λ(t) − 1m λ∗ ||, (8.18)
m
where c1 = max{|1 − (1/m2)1Tm α|, |1 − ϑ(1/m2 )1Tm α|}. Then, Λ2 can be bounded
in the following way:
1 T ˆ 1
Λ2 ≤ 1 α || 1m 1Tm λ(t) − λ(t)||. (8.19)
m m m
Plugging (8.17)–(8.19) into (8.16) yields
Here, we can identify the terms in (8.20) with the terms in (8.13). Therefore, for the
sake of setting up this lemma, we proceed as in the proof of Lemma 8.4 from (8.13).
This completes the proof.
The following lemma displays the bound of ||v3 ||γ ,T .
Lemma 8.6 Under Assumptions 8.1 and 8.2, if 2η̂ < γ < 1, one gets that
κ1 + α̂ ˆ α̂ ˆ γ ||v3 (0)||
||v3 ||γ ,T ≤ ||v1 ||γ ,T + ||v2 ||γ ,T +
γ − 2η̂ γ − 2η̂ γ − 2η̂
α̂ h||L|| λ γ ,T
+ ||v4 ||γ ,T + ||e || ,
γ − 2η̂ γ − 2η̂
Proof It is obtained from the updates of z(t) and λ(t) of ET-DAPDA in (8.10) that
1
||z(t + 1) − z(t)|| ≤κ1 || 1m 1Tm λ(t) − λ(t)|| + 2η̂||z(t) − z(t − 1)||
m
+ h||L||||eλ (t)|| + α̂||y(t)||, (8.21)
where the inequality in (8.21) is obtained from the fact that (W − Im )(Im −
(1/m)1m 1Tm ) = W − Im . Then, one has
ˆ 1 (t)|| + α̂ ||v
||v3 (t + 1)|| ≤2η̂||v3 (t)|| + (κ1 + α̂ )||v ˆ 2 (t)||
Similar to the procedure following (8.13), it is ample to infer the desired result. This
completes the proof.
The next lemma sets up the inequality that bounds the error term ||v4 ||γ ,T to
gradient estimation.
Lemma 8.7 Under Assumptions 8.1 and 8.2, if ρ < γ < 1, one obtains that
ˆ + 2η̂ˆ h||L|| y γ ,T γ
||v4 ||γ ,T ≤ ||v3 ||γ ,T + ||e || + ||v4 (0)||,
γ −δ γ −δ γ −δ
for all T ≥ 0.
Proof By the update of y(t) of ET-DAPDA in (8.10), we have
1
||y(t + 1) − 1m 1Tm y(t + 1)||
m
1
≤ δ||y(t) − 1m 1Tm y(t)|| + h||L||||ey (t)|| + ||x(t + 1) − x(t)||, (8.23)
m
where we utilize the triangle inequality and Assumption 8.1 to derive the inequality.
With regard to the last term of the inequality in (8.23), we apply the update of λ(t)
of ET-DAPDA in (8.10) and the gradient ∇qi (λi (t)) = −xi (t) + di to obtain
Similar to the procedure following (8.13), it is ample to infer the desired result if
δ < γ < 1. This completes the proof.
8.4 Convergence Analysis 221
v γ ,T Γ v γ ,T + u, (8.25)
where u = [u1 , u2 , u3 , u4 ]T , v γ ,T = [||v1 ||γ ,T , ||v2 ||γ ,T , ||v3 ||γ ,T , ||v4 ||γ ,T ]T , and
the elements of matrix Γ = [γij ] ∈ R4×4 are given by
⎡ ⎤
α̂ ˆ 2η̂ α̂
0
⎢ γ −δ−α̂ ˆ γ −δ−α̂ ˆ γ −δ−α̂ ˆ ⎥
⎢ α̂ ˆ 2η̂ α̂ ⎥
⎢ γ −c1 0 γ −c1 γ −c1 ⎥
Γ =⎢ ⎥.
⎢ κ1 +α̂ ˆ α̂ ˆ α̂ ⎥
⎣ γ −2η̂ γ −2η̂
0 γ −2η̂ ⎦
ˆ
+2η̂ ˆ
0 0 γ −δ 0
β2 (γ − c1 ) − α̂(β1 ˆ + β4 ) β4 (γ − δ) − β3 ˆ
, ,
2β3 ˆ 3
2β
β3 γ − β1 κ1 − α̂(β1 ˆ + β2 ˆ + β4 )
. (8.27)
2β3
222 8 Event-Triggered Algorithms for Distributed Economic Dispatch
m(β1 ˆ + β4 ) β3 ˆ
β1 > 0, β2 > , β3 > β1 κ1 , β4 > .
ϑ 1−δ
Proof First, generalizing the results of Lemmas 8.4–8.7, we can infer inequal-
ity (8.25) instantly. Next, we give some abundant conditions to make the spectral
radius of Γ , defined as ρ(Γ ), strictly less than 1, i.e., ρ(Γ ) < 1. In accordance with
Theorem 8.1.29 in [37], we know that, for a positive vector β = [β1 , . . . , β4 ]T ∈ R4 ,
if Γ β < β, then ρ(Γ ) < 1 holds. It is inferred that inequality Γ β < β is equivalent
to
⎧
ˆ ˆ
⎪ 2η̂β3 < β1 (γ − δ) − α̂(β1 + β2 + β4 )
⎪
⎨ ˆ
2η̂β3 < β2 (γ − c1 ) − α̂(β1 + β4 )
, (8.28)
⎪
⎪ 2 η̂β3 < β3 γ − β1 κ1 − α̂(β1 ˆ + β2 ˆ + β4 )
⎩
ˆ 3 < β4 (γ − δ) − β3 ˆ
2η̂β
m
μi
(xi (t + 1) − x ∗ )2
2
i=1
m
1
≤ (∇qi (λ∗ )(λi (t) − λ∗ ) + (λi (t) − λ∗ )2 + λi (t)(xi∗ − di )). (8.31)
2μi
i=1
224 8 Event-Triggered Algorithms for Distributed Economic Dispatch
The fact x ∗ ∈ P yields that m ∗
i=1 (xi − di ) = 0. Additionally, if t → ∞, λi (t) →
∗
λ , ∀i ∈ V. Thence, if t → ∞, the right and left sides of (8.31) incline to zero.
This shows ||x(t) − x ∗ || linearly converges, i.e., ||x(t) − x ∗ || ≤ O((γ /2)t ), if
||λ(t) − 1m λ∗ || ≤ O(γ t ) achieved in Theorem 8.8. The proof of Theorem 8.10 is
accomplished.
Remark 8.11 Recall the conditions of the largest step-size α̂ and the maximum
momentum coefficient η̂ in (8.26) and (8.27) of Theorem 8.8 that the conditions
α̂ and η̂ imposed by ET-DAPDA depend on the parameters (, , ˆ m) that involve the
cost function, (δ, κ1 ) that involve the network topologies, and (γ , β1 , β2 , β3 , β4 ).
Moreover, the tunable parameters β1 , β2 , β3 , and β4 in Theorem 8.8 only rely on
the parameters of the network and the cost functions. Hence, we can find that the
designation of α̂ and η̂ devolves on the complexity of the EDP. Here, it is worth
noting that when the EDP takes into consideration the important issues such as non-
convex and/or discontinuous cost function, transmission line losses, the valve-point
effects etc., ET-DAPDA may not be very suitable or need to be modified, so the
conditions α̂ and η̂ may be another form. This is an open problem and remains to be
studied in the future.
1 + γ2
Q> (B1 + B2 + B4 + B1 + B2 ), (8.33)
γ2
y
the event-triggered condition eiλ (t) + ei (t) − Qωt > 0 is invalid; thus, we
possess
y
||eiλ(t)|| + ||ei (t)||
≤ ||λi (tk(t
i
) ) − λ̄(tk(t ) )|| + ||λ̄(tk(t ) ) − λ̄(t)|| + ||λ̄(t) − λi (t)||
i i
+ ||yi (tk(t
i
) ) − ȳ(tk(t ) )|| + ||ȳ(tk(t ) ) − ȳ(t)|| + ||ȳ(t) − yi (t)||
i i
i
≤ (B1 + B2 + B4 + B1 + B2 )γ tk(t) + (B2 γ t + B1 + B4 + B2 + B1 )γ t .
(8.34)
)+1 meets the condition (8.5), one has (recall that ω = γ is selected to satisfy
i
If tk(t
Theorem 8.8)
i y
Qγ tk(t)+1 ≤ ||eiλ (tk(t
i
)+1 )|| + ||ei (tk(t )+1 )||.
i
(8.35)
B1 + B2 + B4 + B1 + B2
)+1 − tk(t ) ≥ ln
i i
tk(t / ln γ . (8.36)
B1 + B2 + B4 + B1 + B2
)+1 −
i
Selecting h such that (8.32), then Q in (8.33) is strictly nonempty, and thus tk(t
tk(t ) ≥ 2 in (8.36) is guaranteed. The proof of Theorem 8.12 is accomplished.
i
Remark 8.13 Consider that most metaheuristic algorithms (including genetic algo-
rithm, ant colony optimization algorithm, PSO optimization algorithm, artificial
neural network algorithm, etc.) successfully solve the EDP (8.1), in terms of
identifying both the best solution and the computation time. In the following, we
conclude three advantages of applying ET-DAPDA compared to those of applying
the metaheuristic algorithms for the EDP: (1) ET-DAPDA can guarantee the global
optimal solution of the EDP, and the metaheuristic algorithm exists the randomness;
(2) ET-DAPDA can get a fixed output under a fixed input, while the metaheuristic
algorithm does not get a fixed output under a fixed input; (3) ET-DAPDA can
guarantee a fixed efficiency, but the metaheuristic algorithm cannot guarantee the
efficiency because of the randomness.
This section validates the theoretical results and demonstrates better performance
of ET-DAPDA. Notice that all the simulations are carried out in MATLAB on a
MacBook Pro 2017 with 8 GB memory, Intel Core i5 processors, 2 Cores, and
2.3 GHz.
226 8 Event-Triggered Algorithms for Distributed Economic Dispatch
Consider the EDP on the IEEE 14-bus system as described in [7], where
{1, 2, 3, 6, 8} are generator buses. Each generator’s cost function is represented by
fi (xi ) = ci xi + bi xi2 , where the cost coefficients (ci ,bi ) of each generator i and the
generators’ generation capacities (xi ∈ [ximin , ximax ]) are summarized
in Table 8.1
[7]. The total demand needed in this system is assumed as m d
i=1 i = 380 MW.
By running ET-DAPDA (8.10), it deduces from Fig. 8.1a and b that the whole
system successfully implements the economic dispatch, and the optimal power
generations are x1∗ = 80 MW, x2∗ = 90 MW, x3∗ = 64.67 MW, x6∗ = 70 MW,
and x8∗ = 75.33 MW. By computation, the network is suffered a total cost 2176$.
In addition, each generator i’s event-triggered sampling time instants and one
randomly selected measurement error are depicted in Fig. 8.1c and d, respectively.
It can be derived from the statistics of Fig. 8.1c that the number and the average of
sampling times for the 5 generators are [98 84 93 96 113] and 97, respectively. Thus,
the average sampling rate is 97/600 = 16.17%, which indicates that the Zeno-like
behavior does not happen. Figure 8.1d tells that the measurement error decreases
to zero asymptotically. Finally, the calculation time obtained by ET-DAPDA for
solving EDP on the IEEE 14-bus system is 0.2828s.
0.6
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600
Iteration
228 8 Event-Triggered Algorithms for Distributed Economic Dispatch
6000
5000
4000
3000
2000
Total Generation Total Demand
1000
0
0 100 200 300 400 500 600
Iteration
(c) Event sampling time instants of five (randomly) generators
Q t ||e (t)||
1
0.8
0.6
0.4
0.2
0
0 100 200 300 400 500 600
Iteration
8.6 Conclusion 229
8.6 Conclusion
This chapter develops and analyzes ET-DAPDA for handling EDP over a connected
undirected network. ET-DAPDA not only allows uncoordinated constant step-sizes,
230 8 Event-Triggered Algorithms for Distributed Economic Dispatch
-5
-10
-15
-20
-25
-30
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Time[step]
-5
-10
-15
-20
-25
-30
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Iteration
Fig. 8.3 Comparison with related methods in which the residual E(t) as the comparison metric
but also most importantly integrates gradient tracking strategy with two types of
momentum terms for faster convergence. If the largest step-size and the maximum
momentum coefficient are positive and sufficiently smaller than some certain
constants, it is proved that ET-DAPDA can linearly seek the exact optimal solution
with an explicit linear convergence rate under the assumptions on the networks
and the cost functions. Numerical experiments validate the theoretical results.
Nevertheless, ET-DAPDA is not perfect. For instance, it cannot be appropriate
for complex networks and other actual scenarios in EDP. Thus, future works can
8.6 Conclusion 231
2000
1500
1000
500
0
0 100 200 300 400 500 600
Iteration
4
11 × 10
(b) Cost: IEEE 118 bus system [12]
10
X: 1877
Y: 9.262e+04
3
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Iteration
Fig. 8.4 Comparison with related methods in which the obtained cost of EDP as the comparison
metric
References
9.1 Introduction
The distributed convex optimization problem has attracted the interest of many
researchers in the past few decades with the tremendous advances in advanced
technology and low-cost devices. Numerous engineering applications can be viewed
as distributed convex optimization problems, such as robust control [1], smart grids
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 235
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_9
236 9 Privacy Preserving Algorithms for Distributed Online Learning
[2], model prediction [3], and smart metering [4], among others [5–11]. This type
of problem requires a fully distributed optimization algorithm. Unlike traditional
centralized approaches, distributed algorithms involve multiple nodes that have
access to their own local information without the existence of a central coordinator
(node) to obtain all the information on the network [12].
A typical feature in many practical scenarios of the distributed optimization
is that they may just need to be adapted for dynamic changes and uncertain
environment. These problems with uncertainties can be regarded as the distributed
online optimization where the cost function changes over time with the adaptive
decisions are only relevant to the previous information and must be made at
each time. In view of the distributed online optimization problems, the target
of researchers is to develop the distributed online algorithms in a coordinated
manner. To estimate the performance of the distributed online algorithms, it is
conventional to capture the difference between the cost incurred by the algorithm
through the sequential cost functions and the cost incurred by the best fixed decision
in hindsight. The metric, as the difference between two costs, is called regret. In
addition, it is declared “good” for the distributed online algorithm when the achieved
regret is sublinear [13]. In terms of the distributed online optimization problems
over networks, many significant results have recently emerged [14–25]. Nunez and
Cortes [14] designed a distributed online subgradient descent algorithm on time-
varying balanced directed networks, which allowed proportional–integral feedback
on the divergence among nodes (neighboring). Akbari et al. [15] investigated a
distributed online subgradient-push algorithm on time-varying unbalanced directed
networks. Then, Bedi et al. [16] proposed an asynchronous stochastic saddle-
point method to solve online expected risk minimization problem, while Pradhan
et al. [17] considered distributed online non-parametric optimization problem.
Subsequently, with the development of the distributed optimization, the work of
the distributed online optimization was further explored. Some notable approaches
that addressed the distributed online optimization problems over networks mainly
included distributed proximal gradient algorithm [18], saddle-point algorithm [19],
distributed mirror descent algorithm [20], distributed dual averaging algorithm [21],
distributed regression algorithm [22], etc.
However, in the above algorithms, nodes have to accept the fact that their
privacy, at least some of it, will be inevitably disclosed during information sharing.
From the aspect of privacy preserving of nodes, differential privacy is one of
the insightful privacy strategies. It was initially proposed by Dwork [26] and
gained much attention because of its rigorous formulation and the proof of security
properties. In general, differential privacy guarantees that the malicious node
finds little sensitive information of any other nodes. Recently, several privacy
preserving optimization algorithms have been presented in the literature [27–31],
where the controlling idea is to inject stochastic noises or offsets in the node
communications and updates. Although the algorithms in [27–31] could figure
out distributed optimization problem with privacy considerations, they were only
suitable for problems with time-invariant cost functions. On the other hand, the
study of stochastic optimization problems is also profound. The addition of noise
9.1 Introduction 237
9.2 Preliminaries
9.2.1 Notation
If not particularly stated, the vectors mentioned in this chapter are column vectors.
Let R≥0 , Z≥0 , and Z>0 be the set of non-negative real numbers, the set of non-
negative integers, and the set of positive integers, respectively. A quantity (probably
a vector) allocated to node i is indexed by a subscript i; e.g., zi (t) is the estimate
of node i at time t. The n × n identity matrix is denoted as In , and the column
vectors of all ones and all zeros are denoted as 1 and 0 (appropriate dimensional),
respectively. The transposes of a vector z and a matrix W are represented as zT and
W T , respectively. Let the symbol ·, · denote the inner product of two vectors. The
Euclidean norm (vectors) and 1-norm are written as || · || and || · ||1 , respectively.
Given a random/stochastic variable Z, P[Z] and E[Z] denote its probability and
9.2 Preliminaries 239
n
f t (z) = fit (z),
i=1
where z ∈ Rd is the decision vector. Then, nodes, at each time t, intend to minimize
the following optimization problem:
n
min f t (z) = fit (z), (9.1)
z∈Rd
i=1
where node i ∈ V is only aware of its individual convex cost function fit . Suppose
that f t is not known to any node collectively, nor is it available to any single
location.
In this chapter, we wish to design a class of the distributed stochastic subgradient
algorithm to reduce the total cost incurred by nodes over a finite time horizon T ∈
Z>0 . Due to the distributed online optimization possessing time-varying nature, the
regret is an indispensable metric when evaluating the convergence of the designed
algorithm. Hence, the following classical network regret and individual regret of
node j ∈ V are introduced [14]:
T
n
T
n
R(T ) = fit (zi (t)) − fit (z∗ ),
t =1 i=1 t =1 i=1
240 9 Privacy Preserving Algorithms for Distributed Online Learning
and
T
n
T
n
Rj (T ) = fit (zj (t)) − fit (z∗ ),
t =1 i=1 t =1 i=1
T
n
where z∗ = arg min f (z) = fit (z) is the best decision.
z∈Rd t =1 i=1
Notice that the subgradients and other quantities such as {zi (t)}Tt=1 and
{fi (zi (t))}Tt=1 , i ∈ V, will become stochastic variables due to the presence of
t
and
T n
T
n
Rj (T ) = E fit (zj (t)) − E fit (z∗ ) . (9.3)
t =1 i=1 t =1 i=1
forward to measure the difference between the expectation of the total cost incurred by the node’s
state estimation and the optimal expected total cost that could have been achieved by the best fixed
decision in hindsight.
9.3 Algorithm Development 241
V = {1, 2, . . . , n} is the nodes set and E(t) ⊆ V × V is the edges set. An edge
(i, j ) ∈ E(t) indicates that node i can directly route information to node j at
time t, where i is regarded as an in-neighbor of j and in contrary j is viewed as
an out-neighbor of i. At time t, we define Niin (t) = {j ∈ V|(j, i) ∈ E(t)} and
Niout (t) = {j ∈ V| (i, j ) ∈ E(t)} as the in-neighbors set and out-neighbors set of
node i, respectively. The network is unbalanced at time t if |Niin (t)| = |Niout (t)|,
where | · | is the cardinality of a set. The time-varying network is said to be
unbalanced if it is unbalanced at all times. The out-degree of node i at time t is
denoted by diout (t) = |Niout (t)|. A directed path is a series of directed consecutive
edges. If there is at least one directed path between any pair of distinct nodes in
the network, then the network is called strongly connected. The weighted adjacency
matrix at time t corresponding to network G(t) is indicative of W (t) = [wij (t)] ∈
Rn×n , which follows that wij (t) > 0 if (j, i) ∈ E(t), and wij (t) = 0 otherwise. In
addition, at time t, W (t) is column-stochastic if ni=1 wij (t) = 1, ∀j ∈ V.
In this section, we first introduce the differential privacy strategy. Then, a distributed
online algorithm with differential privacy strategy is developed to figure out the
formulated problem.
The definition of differential privacy was first presented by Dwork in [26] and sub-
sequently analyzed in detail in [27–31]. Differential privacy enables participating
nodes to share their individual information without revealing sensitive information
about their privacy. In this chapter, the differential privacy is employed to preserve
the local cost function of individual node. Next, we first give the definition of
adjacent relation.
Definition 9.1 (Adjacent Relation) Taking into account two datasets D = {fi }ni=1
and D = {fi }ni=1 , if there is i ∈ V such that fi = fi and fj = fj , ∀j = i, then D
and D are called as adjacent.
In other words, if the data of a single participant in two datasets D and D
is different and the data of other participants is the same, then two datasets are
adjacent. Informally, in differential privacy, preserving privacy enables that two
adjacent datasets are nearly indistinguishable from the output of the randomized
method which acts on the datasets. In view of the differential privacy, we introduce
the following definition.
242 9 Privacy Preserving Algorithms for Distributed Online Learning
where ε > 0 and Range(M) represents the output range of the randomized method
M.
Inequality indicates that no matter whether an individual node participates in the
dataset, it does not incur any remarkable difference to the output of the randomized
method M. Hence, a malicious node gains little information of other nodes. The
constant ε is a measurement of the privacy level of M. That is to say, a smaller ε
means a higher level of privacy preserving. Thus, the constant ε is a compromise
between the desired level of privacy preserving and the accuracy of M.
For the sake of ensuring different privacy, we wish to know the “sensitivity” of
M. It is deduced from [26] that the magnitude of the stochastic noise (perturbation)
is dependent on the maximum change on the output of M from an individual entry
in the data source. This amount is known as the sensitivity of the method, which is
defined as follows.
Definition 9.3 (Sensitivity) At time t, the sensitivity of a method M is denoted as
Recall that our goal is to distributedly solve problem (9.1) with the privacy
preserving of nodes. In the case of distributed manner, we assume that each node,
at each time t, only acquires information transmitted by in-neighbors and sends
its own estimates to its out-neighbors through a time-varying unbalanced directed
network G(t). The privacy preserving indicates that the sensitive information of
each node i ∈ V is well preserved. Suppose that a malicious node can eavesdrop
on information by monitoring the entire communication channels and intercepting
messages interacted among nodes. Thus, traditional online approaches may result
in leakage of sensitive information of nodes. To address this issue, DP-DSSP is
employed to ensure that the privacy of the participating nodes is not leaked. It is
worth mentioning that the performance of DP-DSSP inevitably suffers from varying
degrees of degradation due to the existence of stochastic noise. That is to say, the
privacy level and the performance are inversely related.
We are now in the position of presenting DP-DSSP. In DP-DSSP, each node,
i ∈ V, owns five estimates: xi (t) ∈ Rd , hi (t) ∈ Rd , yi (t) ∈ Rd , si (t) ∈ R, and
zi (t) ∈ Rd . Then, each node i ∈ V implements the following updates at each time
t ∈ {1, . . . , T }:
⎧
⎪
⎪hi (t) = xi (t) + ηi (t),
⎪
⎪
⎪
⎪yi (t + 1) = d outhi(t(t)+1
) h (t )
+ j ∈N in (t ) d outj(t )+1
⎪
⎪
⎨ i
i j
s (t )
si (t + 1) = d outsi(t(t)+1
)
+ j ∈N in (t ) d outj(t )+1 (9.6)
⎪
⎪ i
⎪
⎪
i j
⎪
⎪zi (t + 1) = yi (t + 1)/si (t + 1)
⎪
⎪
⎩
xi (t + 1) = yi (t + 1) − α(t + 1)gi (t + 1),
where ∇fit +1 (zi (t + 1)) represents the subgradient of fit +1 (z) at z = zi (t + 1), and
τi (t + 1) ∈ Rd is an independent zero-mean stochastic noise term. The estimates of
DP-DSSP (9.6) are initialized as xi (0) ∈ Rd and si (0) = 1 for all i. In addition, for
any t ∈ Z≥0 and i, j ∈ V, non-negative matrix W (t) = [wij (t)] ∈ Rn×n follows
1
djout (t )+1
, j ∈ Niin (t) ∪ {i},
wij (t) = (9.7)
0, j ∈ / Niin (t).
244 9 Privacy Preserving Algorithms for Distributed Online Learning
||∇fit (z)|| ≤ Li , ∀x ∈ Rd .
Assumption 9.3 ([42]) For the sequence {G(t) = {V, E(t)}} of time-varying
unbalanced directed networks, there is an integer B ∈ Z>0 such that the aggregate
directed network GB (t) = {V, ∪t +B−1
=t E( )} is strongly connected ∀t ∈ Z≥0 .
The strong-connectivity bound B introduced in Assumption 9.3 is not required
to be known at any of the nodes and is only employed in the convergence analysis
[44]. Assumption 9.3 is more general than that demanding each G(t) is strongly
connected [37, 45].
This section presents the main results (ε-differential privacy and expected regrets)
in this chapter.
Theorem 9.5 (ε-Differential Privacy) Suppose that Assumptions 9.1–9.3 hold.
Let ηi (t) ∈ Rd , i ∈ V, t ∈ {1, . . . , T }, be introduced in DP-DSSP (9.6) with the
corresponding parameter σ (t) = Δ(t)/ε, where ε > 0. Then, DP-DSSP (9.6) can
guarantee ε-differential privacy.
Before bounding the expected regrets of individual node, many crucial notations
are first introduced. Let z∗ and Z ∗ be, respectively, defined as the optimal solution
n
and the nonempty optimal solutionn set to (9.1). Moreover, let μ̄ = (1/2n) i=1 μi ,
Lmax = maxi∈V Li , and L = i=1 L i , where μ i > 0, ∀i ∈ V. With the above
preparations, it is now ready to present the expected regrets of this chapter.
9.4 Main Results 245
n
R̄j (T ) + μj E[||ẑj (t) − z∗ ||2 ] ≤ Ξ1 + Ξ2 (1 + ln T ),
j =1
where
t
2 k=1 kzi (k)
ẑi (t) = ,
t (t + 1)
14CL n
Ξ1 =nμ̂E[||x̄(0) − z∗ ||2 ] + ||xj (0)||1 ,
δ(1 − λ) j =1
√
14Cn3/2p̂L 42 2Cn3/2 d p̂L 16nd p̂2
Ξ2 = + + + 2 2 ,
δλμ̂(1 − λ) δεμ̂(1 − λ) ε2
Based on the above definition, the following theorem, i.e., square-root regret, is
immediately established for generally convex cost functions.
Theorem 9.7 (Square-Root Regret) Suppose that Assumptions 9.1–9.3 hold.
Assume that each local cost function fit , i ∈ V, is convex for all t ∈ {1, . . . , T }.
Consider that DP-DSSP (9.6) generates the sequence {z1 (t), . . . , zn (t)}Tt=1 with
α(t) selected by DTS. Then, the pseudo-individual-regret of node j ∈ V can be
DT S
2 Here, we do not give a specific definition of pseudo-network-subregret R (k), and the
interested readers can similarly define it with reference to (9.2) and (9.8).
246 9 Privacy Preserving Algorithms for Distributed Online Learning
upper bounded as
log2T
√
DT S 2 √
R̄j (T ) ≤ Rj (k) ≤ √ Γ1 T ,
k=0
2−1
where
14CL
n
Γ1 =nE[||x̄(0) − z∗ ||2 ] + ||xj (0)||1 + 2 2
δ(1 − λ)
j =1
√
14Cn3/2 p̂L 42 2Cn3/2 d p̂L 16nd p̂2
+ + + .
δλ(1 − λ) δε(1 − λ) ε2
As pointed out in Theorems 9.6 and 9.7, the desired regrets of DP-DSSP can be
derived for strongly convex (logarithmic regret) and generally convex cost functions
(square-root regret). In addition, the obtained regrets are dependent on the network
size n and the vector dimension
√ d. Notice that the achieved regrets possess the same
order of O(ln T ) and O( T ) as [14, 15] (without differential privacy strategy) for a
fixed ε even the Laplace random noise is injected. Besides, for a fixed T , the regrets
in Theorems 9.6 and 9.7 will increase (arbitrarily large) as ε → 0. Namely, the
regrets possess the order of O(1/ε2 ).
Remark 9.8 DTS in Theorem 9.7 is to divide the original time series into series
of increasing size (except the last subseries) and run DP-DSSP on each subseries.
Specifically, DTS actually divides the original time series t ∈ {1, . . . , T } into
log2 T +1 subseries. In each subseries (except the last subseries), DP-DSSP needs
to update 2k rounds. Based on this, we can calculate the corresponding subregret in
each subseries k = 0, 1, 2, . . . , log2 T , and the total regret is bounded by the
sum of log2 T + 1 subregrets. Therefore, DP-DSSP can completely implement
t = 1, . . . , T iterations even in the case of DTS. More importantly, DTS does not
require to√ gain access to the global information T but still guarantees that DP-DSSP
achieves T regret. Hence, DTS is meaningful √ when compared with the existing
works that do not utilize DTS and still attain T regret growth [19, 20].
Proof of Theorem 9.5 Define x(t) = [x1 (t), . . . , xn (t)]T ∈ Rnd and x (t) =
[x1 (t), . . . , xn (t)]T ∈ Rnd . Recalling Definition 9.3, one has
Noticing that x (t) and x(t) include in Rnd , from the property of 1-norm, one gets
n
d
|xi,k (t) − xi,k (t)| = ||x(t) − x (t)||1 ≤ Δ(t), (9.11)
i=1 k=1
where xi,k (t) and xi,k (t) are, respectively, defined as the k-th elements of x (t)
and x(t). Then, according to the property of Laplace distribution [46] and (9.11),
it suffices that
n d
P[zi,k (t) − xi,k (t)]
P[zi,k (t) − xi,k (t)]
i=1 k=1
n d exp − |zi,k (t )−xi,k (t )|
σ (t )
=
|zi,k (t )−xi,k (t )|
i=1 k=1 exp
σ (t )
n
d
|zi,k (t) − xi,k (t) − zi,k (t) + xi,k (t)|
≤ exp
σ (t)
i=1 k=1
||xi,k (t) − xi,k (t)|| Δ(t)
≤ exp ≤ exp . (9.12)
σ (t) σ (t)
T
P[M(D) ∈ Ψ ] = P[M(Dt ) ∈ Ψ ]. (9.13)
t =1
T
T
Δ(t)
P[M(Dt ) ∈ Ψ ] ≤ exp · P[M(D t ) ∈ Ψ ],
σ (t)
t =1 t =1
This subsection concerns with establishing the logarithmic regret of DP-DSSP (9.6)
presented in Theorem 9.6. For this purpose, some supporting lemmas that are
indispensable in the proof of the main results are first presented.
In light of the definition of matrices W (t) and W (t : k), t ≥ k ∈ Z≥0 , the
following lemma is stated, which directly follows from Corollary 2 in [38].
Lemma 9.11 ([38]) Suppose that Assumption 9.3 holds. If matrix W (t) =
[wij (t)] ∈ Rn×n satisfies (9.7), there is a series of stochastic vectors φ(t) ∈ Rd
such that the matrix W (t : k) satisfies ∀i, j ∈ V and t ≥ k ∈ Z≥0 ,
help of Lemma 9.11, we provide the upper bound of E[||zi (t +1)−x̄(t)||], ∀t ∈ Z≥0 ,
below, which is obtained directly from Lemma 1(a) in [38].
Lemma 9.12 ([38]) Suppose that Assumptions 9.2 and 9.3 hold. Then, DP-
DSSP (9.6) generates a sequence {z1 (t), . . . , zn (t)}Tt=1 such that ∀i ∈ V, t ∈ Z≥0 ,
E[||zi (t + 1) − x̄(t)||]
√ t
2C t 3C n t −k
n n
≤ λ ||xj (0)||1 + λ E[||ηj (k)||]
δ δ
j =1 k=0 j =1
2Cn3/2 p̂ t −k
t
+ λ α(k),
δ
k=1
The next lemma is dedicated to establishing a necessary relation for deriving the
main results.
Lemma 9.13 Suppose that Assumptions 9.1–9.3 hold. Consider that DP-
DSSP (9.6) generates a sequence {x1 (t), . . . , xn (t)}Tt=1 . Then, we have for any
v ∈ Rd that ∀t ∈ Z≥0 ,
2
n
2α(t + 1)
≤ E[||ηi (t)||2 ] − E[f t +1 (x̄(t)) − f t +1 (v)]
n n
i=1
n
4α(t + 1) 2α 2 (t + 1) 2
+ E Li ||x̄(t) − zi (t + 1)|| +
n n
i=1
n
α(t + 1)
− E κit +1||zi (t + 1) − v||2 ,
n
i=1
1 1
n n
x̄(t + 1) = x̄(t) + ηi (t) + χi (t + 1), (9.14)
n n
i=1 i=1
2 2
n n
= χi (t + 1), x̄(t) − v + ηi (t), x̄(t) − v
n n
i=1 i=1
1
n
+ || (ηi (t) + χi (t + 1))||2 . (9.15)
n
i=1
Now, the cross-term (2/n) ni=1 χi (t + 1), x̄(t) − v in (9.15) is first considered.
For this purpose, we have that
For the cross-term ∇fit +1 (zi (t + 1)), zi (t + 1) − v, since each local cost function
fit +1 is strongly convex with parameter κi (v, zi (t + 1)), it holds that
Since fit +1 (zi (t + 1)) − fit +1 (v) = fit +1 (zi (t + 1)) − fit +1 (x̄(t)) + fit +1 (x̄(t)) −
fit +1 (v) and each local cost function fit +1 is convex, it is obtained that
κit +1
+ ||zi (t + 1) − v||2 . (9.22)
2
252 9 Privacy Preserving Algorithms for Distributed Online Learning
n t +1
Using f t +1 (z) = i=1 fi (z) and substituting (9.18) and (9.22) back into (9.17),
one acquires
n
∇fit +1 (zi (t + 1)), x̄(t) − v
i=1
1 t +1
n
≥ f t +1 (x̄(t)) − f t +1 (v) + κi ||zi (t + 1) − v||2
2
i=1
n
−2 Li ||zi (t + 1) − x̄(t)||. (9.23)
i=1
Recall that ηi (t) ∼ Lap(σ (t)) with E[ηi (t)] = 0. In light of (9.16) and (9.23), thus
it implies that
2
n
E χi (t + 1) + ηi (t), x̄(t) − v
n
i=1
2α(t + 1)
≤− E[f t +1 (x̄(t)) − f t +1 (v)]
n
n
α(t + 1)
t +1
− E κi ||zi (t + 1) − v|| 2
n
i=1
n
4α(t + 1)
+ E Li ||zi (t + 1) − x̄(t)|| . (9.24)
n
i=1
∀T ∈ Z>0 ,
T n
∗ 2
R(t) + E κit ||zi (t) − z ||
t =1 i=1
t −1
10Cn3/2 p̂L t −k−1
T
n
≤ E[||x̄(0) − z∗ ||2 ] + λ α(k)
α(1) δ
t =1 k=1
10CL
n T
16nd p̂2 2
+ ||xj (0)||1 + + 2 α(t)
δ(1 − λ) ε2
j =1 t =1
√ T t −1
30 2Cn3/2 d p̂L t −k−1
+ λ α(t + 1)
δε
t =1 k=0
T
∗ 2
+ nE ||x̄(t − 1) − z || Δ1 (t) ,
t =2
n
where Δ1 (t) = 1
α(t ) − 1
α(t −1) − 1
2n κit .
i=1
2
n
2α(t + 1)
≤ E[||ηi (t)||2 ] − E[f t +1 (x̄(t)) − f t +1 (z∗ )]
n n
i=1
n
4α(t + 1) 2α 2 (t + 1) 2
+ E Li ||zi (t + 1) − x̄(t)|| +
n n
i=1
n
α(t + 1)
t +1 ∗ 2
− E κi ||zi (t + 1) − z || . (9.26)
n
i=1
We then analyze the term f t +1 (x̄(t)) − f t +1 (z∗ ) in (9.26). First, since fit +1 is
strongly convex with parameter κit +1 , it yields that
n
t +1 t +1 ∗ 1 t +1
f (x̄(t)) − f (z ) ≥ κi ||x̄(t) − z∗ ||2 . (9.27)
2
i=1
254 9 Privacy Preserving Algorithms for Distributed Online Learning
1 t +1
n
≥ f t +1 (zi (t + 1)) − f t +1 (z∗ ) + κi ||x̄(t) − z∗ ||2
2
i=1
Plugging (9.29) back into (9.26) and then summing the obtained inequality over t
from 1 to T , it holds (using quite a few algebraic operations) that
T
1
n
R(t) − 2 E[||ηi (t − 1)||2 ]
α(t)
t =1 i=1
T n
T
∗ 2
≤ LE ||zi (t) − x̄(t − 1)|| − E κit ||zi (t) − z ||
t =1 t =1 i=1
T
n
∗ 2
+ nE ||x̄(t − 1) − z || Δ1 (t) + E[||x̄(0) − z∗ ||2 ]
α(1)
t =2
T
n
T
+ 4E Li ||zi (t) − x̄(t − 1)|| + 2 2 α(t). (9.30)
t =1 i=1 t =1
T
n
E[ Li ||zi (t) − x̄(t − 1)||]
t =1 i=1
2C
T n n
≤ Li λt −1 ||xj (0)||1
δ
t =1 i=1 j =1
√ T n t −1
3C n t −k−1
n
+ Li λ E[||ηj (k)||]
δ
t =1 i=1 k=0 j =1
t −1
2Cn3/2p̂ t −k−1
T n
+ Li λ α(k). (9.31)
δ
t =1 i=1 k=1
9.4 Main Results 255
T
n
n
L
n
Li λt −1 ||xj (0)||1 ≤ ||xj (0)||1 . (9.32)
1−λ
t =1 i=1 j =1 j =1
√
Furthermore, since ni=1 ||ηi (t)|| = n d |ηi,k (t)|2 (ηi,k (t), k = 1, . . . , d, is
the k-th element of ηi (t) ∈ Rd ) and each ηi,k (t), k = 1, . . . , d, is drawn from
Lap(σ (t)), we deduce that E[|ηi,k (t)|2 ] = 2σ 2 (t). In light of Δ(t)/σ (t) = ε, it
yields
√ √
n
√ n 2dΔ(t) 2 2nd p̂α(t + 1)
E[ ||ηi (t)||] = E[n d |ηi,k (t)| ] =
2 ≤ .
ε ε
i=1
(9.33)
n
8nd p̂2 α 2 (t + 1)
E[||ηi (t)||2 ] ≤ . (9.35)
ε2
i=1
Combining (9.30)–(9.35) and Lemma 9.12 and arranging the terms, the result in
Lemma 9.14 follows immediately.
Finally, the next lemma presents the connection between the pseudo-individual-
regret of node j ∈ V and the pseudo-network-regret.
Lemma 9.15 Suppose that Assumptions 9.1–9.3 hold. Consider that DP-
DSSP (9.6) generates the sequence {z1 (t), . . . , zn (t)}Tt=1 . Then, it holds that for
256 9 Privacy Preserving Algorithms for Distributed Online Learning
4CL
n
R̄j (T ) − ||xj (0)||1
δ(1 − λ)
j =1
√ T t −1
12 2Cn3/2 d p̂L t −k−1
≤ R̄(T ) + λ α(k + 1)
δε
t =1 k=0
t −1
4Cn3/2 p̂L t −k−1
T
+ λ α(k).
δ
t =1 k=1
T
n
T
n
T
n
fit (zj (t)) − fit (zi (t)) ≤ Lj ||zj (t) − zi (t)||, (9.36)
t =1 i=1 t =1 i=1 t =1 i=1
which directly derived from the convexity of fit . Moreover, it also concludes that
||zj (t + 1) − zi (t + 1)||2
≤ ||zj (t + 1) − x̄(t)||2 + ||zi (t + 1) − x̄(t)||2
+ 2||zj (t + 1) − x̄(t)||||zi (t + 1) − x̄(t)||. (9.37)
E[||zj (t + 1) − zi (t + 1)||]
√ t
4C t 6C n t −k
n n
≤ λ ||xj (0)||1 + λ E[||ηj (k)||]
δ δ
j =1 k=0 j =1
4Cn3/2 p̂ t −k
t
+ λ α(k). (9.38)
δ
k=1
9.4 Main Results 257
Since ηi (t) ∼ Lap(σ (t)) with E[||ηi (t)||2 ] = 2σ 2 (t), following from (9.33)
and (9.38), it yields that
T
n
E Lj ||zj (t) − zi (t)||
t =1 i=1
t −1
4CL 4Cn3/2 p̂L t −k−1
n T
≤ ||xj (0)||1 + λ α(k)
δ(1 − λ) δ
j =1 t =1 k=1
√ T t −1
12 2Cn3/2 d p̂L t −k−1
+ λ α(k + 1).
δε
t =1 k=0
t −1
T
1
λt −k−1 α(k + 1) ≤ (1 + ln T ), (9.40)
μ̂(1 − λ)
t =1 k=0
and further
t −1
T
1
λt −k−1 α(k) ≤ (1 + ln T ). (9.41)
λμ̂(1 − λ)
t =1 k=1
258 9 Privacy Preserving Algorithms for Distributed Online Learning
t
Let θ (t) = t (t + 1)/2 and ẑi (t) = k=1 kzi (k)/θ (t) for all t ∈ {1, . . . , T }. It then
yields that
T
n T
t
n
E μi ||zi (t) − z∗ ||2 ≥E μi ||zi (t) − z∗ ||2
θ (t)
t =1 i=1 t =1 i=1
n
∗ 2
≥E μi ||ẑi (t) − z || . (9.42)
i=1
where
10CL
n
Ξ3 = nE[μ̂||x̄(0) − z∗ ||2 ] + ||xj (0)||1 ,
δ(1 − λ)
j =1
√
10Cn3/2p̂L 16nd p̂2 30 2Cn3/2 d p̂L
Ξ4 = + + 2 +
2
.
δλμ̂(1 − λ) ε2 δεμ̂(1 − λ)
Thus, the result in Theorem 9.6 can be concluded by putting (9.43) into
Lemma 9.15. This fulfills the Proof of Theorem 9.6.
In the previous subsection, the logarithmic regret of DP-DSSP (9.6) is proved under
the strongly convexity of cost functions. Then, if the cost functions are generally
convex, it can be established that DP-DSSP (9.6) achieves a square-root regret.
Proof of Theorem 9.7 Noting that α(t) in Theorem 9.6 shows the reliance on T ,
thus DTS that does not depend on T is employed to generate an execution process.
9.4 Main Results 259
n
where we have utilized the fact that (1/ζ ) − (1/ζ ) − (1/2n) t
i=1 κi ≤ 0. Then,
√ √
letting ζ = 1/ T in (9.44) and using T ≥ 1, one achieves
⎡ ⎤
T
n √
E⎣ (fit (zi (t)) − fit (z∗ ))⎦ ≤ Γ2 T , (9.45)
t =1 i=1
where
10CL
n
Γ2 =nE[||x̄(0) − z∗ ||2 ] + ||xj (0)||1 + 2 2
δ(1 − λ)
j =1
√
10Cn3/2 p̂L 30 2Cn3/2 d p̂L 16nd p̂2
+ + + .
δλ(1 − λ) δε(1 − λ) ε2
Hence, combining (9.45) with Lemma 9.15, the result in Theorem 9.7 is obtained.
The proof is complete.
260 9 Privacy Preserving Algorithms for Distributed Online Learning
Furthermore, each node possesses its individual estimate without delays, i.e.,
ςii (t) = 0, ∀i ∈ V, t ∈ Z>0 .
Notice that for each node i ∈ V, both the updates of yi (t) and si (t) in
DP-DSSP (9.6) are dependent on the information acquired from in-neighbors,
while the estimates of hi (t), zi (t), and xi (t) are implemented locally without
further interactions. Therefore, at each time t ∈ {1, . . . , T }, when undergoing
communication delays, DP-DSSP (9.6) is executed as follows:
⎧
⎪
⎪hi (t) = xi (t) + ηi (t),
⎪
⎪
⎪
⎪yi (t + 1) = outhi (t ) + hj (t −ςij (t ))
⎪
⎪ j ∈Niin (t ) d out (t )+1
⎨ d (t )+1
i j
s (t −ς (t ))
si (t + 1) = d outsi(t(t)+1
)
+ j ∈N in (t ) jd out (t ij)+1 (9.46)
⎪
⎪ i
⎪
⎪
i j
⎪
⎪ z (t + 1) = y (t + 1)/s (t + 1)
⎪
⎪
i i i
⎩
xi (t + 1) = yi (t + 1) − α(t + 1)gi (t + 1).
The following theorem manifests that algorithm (9.46) not only preserves
differential privacy but also achieves expected regrets when the interactions among
nodes on time-varying unbalanced directed networks are subjected to arbitrary but
uniformly bounded communication delays.
Theorem 9.16 Suppose that Assumptions 9.1–9.4 hold, and Z ∗ is non-empty.
Consider that algorithm (9.46) updates the sequence {z1 (t), . . . , zn (t)}Tt=1 . Then,
the following results are derived:
(a) Letting ηi (t), i ∈ V, satisfy Theorem 9.5, then algorithm (9.46) preserves
differential privacy.
(b) Letting α(t) satisfy Theorem 9.6, then algorithm (9.46) achieves a logarithmic
regret of order O(ln T ) for strongly convex cost functions.
(c) Letting α(t) satisfy Theorem
√ 9.7, then algorithm (9.46) accomplishes a square-
root regret of order O( T ) for generally convex cost functions.
9.4 Main Results 261
Proof Note that algorithm (9.46) is transformed from DP-DSSP (9.6) due to
the modeling of the communication delays we have established. The Proof of
Theorem 9.16 can be divided into two steps [42]. We first convert algorithm (9.46)
(with communication delays) to a delay-free algorithm by employing an augmented
unbalanced directed network representation. Then, the performance of the obtained
delay-free algorithm can be analyzed similarly according to the above subsection.
For each node i in the network, we present ς̂ imaginary nodes i(1), i(2) , . . . , i(ς̂) .
At each time t, imaginary node i(k) preserves message which is eventually transmit-
ted to node i in the k-th time. According to Assumption 9.4, it is deduced that the
total number of nodes in the augmented unbalanced directed network is n(ς̂ + 1).
In addition, the n imaginary nodes are defined when the n nodes on the unbalanced
directed network (original) model the 1st time delay, the next n imaginary nodes
model the 2nd time delay, etc. For the interactions among these nodes on the
augmented unbalanced directed network at each time t, we suppose that for each
edge (j, i) on the unbalanced directed network (original), there always exist edges
(j, i(1)), (j, i(2)), . . . , (j, i(ς̂) ) and edges (i(1), i), (i(2), i(1) ), . . . , (i(ς̂−1) , i(ς̂) ) on
the augmented unbalanced directed network.
Let xi,(r) be the estimate of imaginary node i(r), where r ∈ {1, . . . , ς̂ }. For
convenience of analysis, we suppose d = 1 in this subsection. Define x̃(t) =
[x(t)T , x(1) (t)T , . . . , x(ς̂) (t)T ]T ∈ Rn(ς̂+1) , where x(r)(t) = [x1,(r), . . . , xn,(r) ]T ∈
Rn , and r ∈ {1, . . . , ς̂ }, is the column stack vector of xi,(r) for i ∈ V. Then,
notations h̃(t), η̃(t), ỹ(t), and s̃(t) are defined similarly. Hence, at each time t ∈
{1, . . . , T }, algorithm (9.46) can be written in the delay-free compact format as
follows:
⎧
⎪
⎪ h̃(t) = x̃(t) + η̃(t)
⎪
⎪
⎪
⎪
⎨ỹ(t + 1) = W̃ (t)h̃(t)
⎪
s̃(t + 1) = W̃ (t)s̃(t) (9.47)
⎪
⎪
⎪
⎪zi (t + 1) = yi (t + 1)/si (t + 1)
⎪
⎪
⎪
⎩x̃(t + 1) = ỹ(t + 1) − α(t + 1)[g(t + 1)T , 0T ]T ,
where g(t) = [g1 (t), . . . , gn (t)]T , η̃(t) = [η(t)T , 0T ]T ∈ Rn(ς̂+1) , and η(t) =
[η1 (t), . . . , ηn (t)]T . For the imaginary nodes, we pick x(r)(0) = 0, η(r) (0) = 0
and s(r) (0) = 0 for all r ∈ {1, . . . , ς̂ }. Notice that the update of zi (t) in (9.47) is
identical to the update of zi (t) in (9.6), which guarantees zi (t) is a finite quality even
s(r) (0) = 0. In addition, the weighted matrix W̃ (t) associated with the augmented
unbalanced directed network is indicated as
⎡ ⎤
W̃(0) (t) In×n 0 ... 0
⎢ W̃(1) (t) 0 In×n ... 0 ⎥
⎢ ⎥
⎢ . .. .. .. .. ⎥ ,
W̃ (t) = ⎢ .. . . . . ⎥
⎢ ⎥
⎣ W̃ (t) 0 0 . . . In×n ⎦
(ς̂−1)
W̃(ς̂ ) (t) 0 0 ... 0
262 9 Privacy Preserving Algorithms for Distributed Online Learning
where non-negative matrices W̃(0) (t), W̃(1) (t), . . . , W̃(ς̂ ) (t) (suitably defined) are
dependent on the communication delays encountered by the information interaction
at time t. Specifically, the weighted matrix W̃(r) (t) = [w̃ij,(r) (t)] ∈ Rn×n , r ∈
{1, . . . , ς̂ }, follows the following rules:
wij (t), if ςij (t) = r, (j, i) ∈ E(t),
w̃ij,(r) (t) =
0, otherwise,
where wij (t) is introduced in (9.7). According to the definition of w̃ij,(r) (t), it
can be concluded that for each edge (j, i) ∈ E(t) at each time t, only one of
w̃ij,(1) (t), . . . , w̃ij,(ς̂ ) (t) is positive and is equal to wij (t) (others are equal to zero).
Therefore, the transformation from (9.46) (with communication delays) to (9.47)
(without communication delays) is achieved.
From the definition of W̃ (t) and W (t), we can deduce that under Assumptions 9.3
and 9.4, the matrices W̃ (t) and W (t) exhibit the same properties at each time t. Thus,
the reminder Proof of Theorem 9.16 can be easily obtained by following the similar
techniques in the above subsection.
In this section, some practical simulation experiments are presented to assess effec-
tiveness and universality of DP-DSSP. Inspired by Akbari et al. [15] and Hosseini
et al. [21], we investigate the distributed collaborative localization problem. Each
node in the network is employed to detect a vector z ∈ Rd . At time t ∈ {1, . . . , T },
each node i ∈ V acquires an uncertain and time-varying (the noise, such as jamming,
may be exist) detection vector pi (t) ∈ Rdi and is indicated to possess a linear model
pi,z = Pi z, where Pi ∈ Rdi ×d and Pi z = 0 if and only if z = 0. The main focus is
to estimate the vector ẑ ∈ Rd which minimizes the global cost function:
T
n
1
f (ẑ) = ||qi (t) − Pi ẑ||2 ,
2
t =1 i=1
where the detection vector is represented as pi (t) = Pi z + βi (t) and βi (t) is the
white noise. Notice that the characteristics of noise are not acquired in advance and
quite a few nodes may not work well in some situations. Therefore, it is necessary
to utilize the distributed online algorithm to estimate the best selection for z.
In the simulation, we discuss a large-scale network with n = 100 nodes and
the dimension d = 1. At time t ∈ Z≥0 , a lot of n nodes and (n − 1)2 /4 directed
edges are randomly assigned in the unbalanced directed network G = {V, E}.
Suppose that the directed unbalanced network G = {V, E} (randomly selected) is
strongly connected. Then, through uniformly and randomly sampling E(t) from E
of G with 80%, we generate the time-varying unbalanced directed networks G(t) =
9.5 Numerical Examples 263
{V, E(t)}, t ∈ Z≥0 . Additionally, at time t, the local cost function fit : R → R
associated with node i ∈ V is given by fit (ẑ) = (1/2)(qi (t) − Pi ẑ)2 , where
qi (t) = ai (t)z+bi (t) and Pi ∈ R. The cost coefficients ai (t) and bi (t) for each node
i are, respectively, randomly selected from a uniform distribution on [amin, amax ]
and [bmin , bmax ] at time t. In this simulation, we employ DP-DSSP to estimate z
(for clearer expression, we randomly select few nodes to display in the following
scenarios) and study different scenarios to validate the performance of DP-DSSP.
(1) Without Communication Delays in this scenario, suppose that there are no
communication delays in the transmission of information on the network by nodes.
At time t ∈ {1, . . . , T }, let the coefficients ai (t) ∈ [0, 2] and bi (t) ∈ [−0.5 + ((i −
50)/100), 0.5 + ((i − 50)/100)] be picked at random by a uniform distribution for
each node i and the learning rate α(t) = 5/t. In addition, assume that Pi ∈ (0, 1]
for each node i. Based on the communication network designed above, preliminary
results are displayed in Fig. 9.1. The estimations of five nodes (randomly displayed)
are manifested in Fig. 9.1a over 100 time iterations. By running DP-DSSP, it is
demonstrated that nodes’ estimates (randomly displayed) have less fluctuations (the
fluctuation is larger than general online algorithms) around the optimal solution.
Figure 9.1b depicts the maximum and minimum pseudo-individual average regrets
over 100 time iterations, which exhibit the sublinear properties.
(2) With Communication Delays in this scenario, we verify the robust performance
of DP-DSSP with communication delays. Specifically, let ς̂ = 4 be the upper
bound of the time-varying delays. At each time t, communication delays imposed
on each communication link are randomly and uniformly selected in {0, 1, . . . , 4}.
Other required parameters correspond to the first scenario. The simulation results are
displayed in Fig. 9.2. It is clearly observed that when communication links undergo
communication delays, DP-DSSP is able to achieve the desired results.
(3) DTS and Communication Delays in this scenario (T = 100), we suppose that
the learning rates are selected by DTS and there exist communication delays (the
upper bound of the time-varying delays is ς̂ = 4) in the transmission of information
on the network by nodes. Comparisons of DP-DSSP with the method in [37] are
shown in Figs. 9.3 and 9.4. On the one hand, the x-axes in Figs. 9.3 and 9.4a are
simulated in terms of k because of DTS employed in DP-DSSP.3 On the other
hand, the x-axis in Fig. 9.4b is simulated in terms of t to better reflect the main
contributions of this chapter.
We can clearly obtain from Fig. 9.3 that the nodes’ estimates (randomly dis-
played) calculated by DP-DSSP undergo less fluctuations around the optimal
solution, while the method in [37] cannot perform well over time-varying unbal-
3 Here, it is worth highlighting that DTS actually divides the original time series t ∈ {1, . . . , T }
into log2 T + 1 subseries and DP-DSSP needs to update 2k rounds in each subseries (except
the last subseries). Since DP-DSSP cannot update 2k rounds in the last subseries, we only plot
k = 0, 1, . . . , log2 T in Figs. 9.3 and 9.4a.
264 9 Privacy Preserving Algorithms for Distributed Online Learning
0.5
0.4
0.3
0.2
0.1
-0.1
-0.2
0 20 40 60 80 100
Time (T)
(b) Pseudo-individual average regrets
9
max pseudo-individual average regret
8
min pseudo-individual average regret
7
6
5
4
3
2
1
0
0 20 40 60 80 100
Time (T)
Fig. 9.1 (a) Estimations of five nodes without communication delay. (b) The maximum and
minimum pseudo-individual average regrets (Rj (T )/T ) without communication delays
0.6
0.4
0.2
-0.2
-0.4
0 20 40 60 80 100
Time (T)
(b) Pseudo-individual average regrets
12
min pseudo-individual average regret
10 max pseudo-individual average regret
0
0 20 40 60 80 100
Time (T)
Fig. 9.2 (a) Estimations of five nodes with communication delays. (b) The maximum and
minimum pseudo-individual average regrets (Rj (T )/T ) with communication delays
(4) Differential Privacy Properties [47] in this scenario, we verify the differential
privacy properties of DP-DSSP with communication delays. Assume that the cost
functions fit = f ti for all i ∈ {2, . . . , 100} are identical except node 1’s cost
function, i.e., f1t = f t1 . Figure 9.5 (corresponding to the second scenario) plots
three randomly displayed nodes’ (always include node 1) outputs xi (t) and xi (t)
for DP-DSSP related to the adjacent relations fit and f ti , respectively. Figure 9.5
describes that the two outputs are fully fitted, which is almost indistinguishable from
a malicious node.
Remark 9.17 The fourth scenario can well validate the differential privacy proper-
ties of DP-DSSP, which is done in the same way as many of the existing works,
266 9 Privacy Preserving Algorithms for Distributed Online Learning
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
012 3 4 5 6 ...
k
(b) Node's estimate (z) (The method in [37])
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
012 3 4 5 6 ...
k
Fig. 9.3 Estimations of five nodes for DTS with communication delays. (a) Node’s estimate (z)
(DP-DSSP). (b) Node’s estimate (z) (the method in [37])
such as [4, 47]. DP-DSSP may be further applied to problems related to the spy
attacks on the network and dataset recovery, which are the key issues in the modern
era of big data. However, we note that the focus of this chapter is not about the
spy attacks on the network and dataset recovery but to design DP-DSSP to solve
the differentially private distributed online optimization problem on time-varying
unbalanced directed networks. In the future, further research on this kind of problem
will be conducted.
9.6 Conclusion 267
40
30
20
10
0
0 20 40 60 80 100
Time (T)
DT S
Fig. 9.4 (a) The pseudo-individual average subregrets (Rj,av (k)) between k and k + 1 for DTS
with communication delays. (b) The pseudo-individual average regrets (Rj (T )/T ) for DTS with
communication delays
9.6 Conclusion
This chapter has investigated the differentially private distributed online convex
optimization problem on time-varying unbalanced directed networks which were
supposed to be uniformly strongly connected. To figure out such optimization
problem distributedly, we have developed a new differentially private distributed
stochastic subgradient-push algorithm, named as DP-DSSP. Theoretical analysis
has shown that DP-DSSP (appropriate learning rate) preserved differential privacy
and achieved expected sublinear regrets for diverse cost functions. The compromise
268 9 Privacy Preserving Algorithms for Distributed Online Learning
-2
-4 x1 (t) x'1 (t) x41 (t) x'41 (t) x81 (t) x'81 (t)
-6
0 20 40 60 80 100
Time (T)
Fig. 9.5 The outputs xi (t) and xi (t) related to the adjacent relations fit and fit
between the desired level of privacy preserving and the accuracy of DP-DSSP
has been revealed. Furthermore, the robustness of DP-DSSP to communication
delays has also been explored. Finally, the performances of DP-DSSP have been
demonstrated via simulation experiments. Our work is still open, and more in-depth
research is demanded to resolve the distributed online constrained problem. As a
future work, we will consider to extend this work for a number of directions, i.e., spy
attacks on the network, dataset recovery, complex constraints, non-convex and/or
non-smooth functions, and networks with random link failures.
References
1. S. Wang, C. Li, Distributed robust optimization in networked system. IEEE Trans. Cybern.
47(8), 2321–2333 (2017)
2. H. Li, Q. Lü, X. Liao, T. Huang, Accelerated convergence algorithm for distributed constrained
optimization under time-varying general directed graphs. IEEE Trans. Syst. Man Cybern. Syst.
50(7), 2612–2622 (2020)
3. X. Shi, J. Cao, W. Huang, Distributed parametric consensus optimization with an application
to model predictive consensus problem. IEEE Trans. Cybern. 48(7), 2024–2035 (2018)
4. M. Hale, P. Barooah, K. Parker, K. Yazdani, Differentially private smart metering: Implementa-
tion, analytics, and billing, in UrbSys’19: Proceedings of the 1st ACM International Workshop
on Urban Building Energy Sensing, Controls, Big Data Analysis, and Visualization (2019), pp.
33–42
5. Z. Deng, X. Nian, C. Hu, Distributed algorithm design for nonsmooth resource allocation
problems. IEEE Trans. Cybern. 50(7), 3208–3217 (2020)
6. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly
convex optimization. Automatica 90, 196–203 (2018)
References 269