0% found this document useful (0 votes)
2 views282 pages

Distributed Optimization in Networked Systems - Algorithms and Applications

The document outlines the purpose and scope of the Springer Wireless Networks book series, which aims to advance research and development in wireless communication networks, including various types such as cellular and sensor networks. It emphasizes the importance of distributed optimization in networked systems, addressing the challenges posed by dynamic environments and the need for new optimization theories and methods. The book serves as a comprehensive resource for students and researchers in computer science and related fields, covering advanced topics and algorithms in distributed optimization.

Uploaded by

itlog89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views282 pages

Distributed Optimization in Networked Systems - Algorithms and Applications

The document outlines the purpose and scope of the Springer Wireless Networks book series, which aims to advance research and development in wireless communication networks, including various types such as cellular and sensor networks. It emphasizes the importance of distributed optimization in networked systems, addressing the challenges posed by dynamic environments and the need for new optimization theories and methods. The book serves as a comprehensive resource for students and researchers in computer science and related fields, covering advanced topics and algorithms in distributed optimization.

Uploaded by

itlog89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 282

Wireless Networks

Qingguo Lü · Xiaofeng Liao · Huaqing Li ·


Shaojiang Deng · Shanfu Gao

Distributed
Optimization
in Networked
Systems
Algorithms and Applications
Wireless Networks

Series Editor
Xuemin Sherman Shen, University of Waterloo, Waterloo, ON, Canada
The purpose of Springer’s Wireless Networks book series is to establish the state
of the art and set the course for future research and development in wireless
communication networks. The scope of this series includes not only all aspects
of wireless networks (including cellular networks, WiFi, sensor networks, and
vehicular networks), but related areas such as cloud computing and big data.
The series serves as a central source of references for wireless networks research
and development. It aims to publish thorough and cohesive overviews on specific
topics in wireless networks, as well as works that are larger in scope than survey
articles and that contain more detailed background information. The series also
provides coverage of advanced and timely topics worthy of monographs, contributed
volumes, textbooks and handbooks.
** Indexing: Wireless Networks is indexed in EBSCO databases and DPLB **
Qingguo Lü • Xiaofeng Liao • Huaqing Li •
Shaojiang Deng • Shanfu Gao

Distributed Optimization
in Networked Systems
Algorithms and Applications
Qingguo Lü Xiaofeng Liao
College of Computer Science College of Computer Science
Chongqing University Chongqing University
Chongqing, China Chongqing, China

Huaqing Li Shaojiang Deng


College of Electronic and Information College of Computer Science
Engineering Chongqing University
Southwest University Chongqing, China
Chongqing, China

Shanfu Gao
College of Computer Science
Chongqing University
Chongqing, China

ISSN 2366-1186 ISSN 2366-1445 (electronic)


Wireless Networks
ISBN 978-981-19-8558-4 ISBN 978-981-19-8559-1 (eBook)
https://doi.org/10.1007/978-981-19-8559-1

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore
Pte Ltd. 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
To My Family
Q. Lü
To my family
X. Liao
To my family
H. Li
To my family
S. Deng
To my family
S. Gao
Preface

In recent years, the Internet of Things (IoT) and big data have been interconnected
to a wide and deep extent through the sensing, computing, communication, and
control of intelligent information. Networked systems are playing an increasingly
important role in the interconnected information environment, profoundly affecting
computer science, artificial intelligence, and other related fields. The core of such
systems composed of many nodes is to efficiently accomplish certain global goals by
collaborating with each other, while making separate decisions based on different
preferences, thus solving large-scale complex problems that are difficult for indi-
vidual nodes to perform, with strong resistance to interference and environmental
adaptability. In addition, such systems require participating nodes to access only
their own local information. This may be due to the consideration of security and
privacy issues in the network, or simply because the network is too large, making
the aggregation of global information to a central node practically impossible or
very inefficient. Currently, as a hot research topic with wide applicability and great
application value across multiple disciplines, distributed optimization of networked
systems has laid an important foundation for promoting and leading the frontier
development in computer science and artificial intelligence. However, networked
systems cover a large number of intelligent devices (nodes), and the network
environment is often dynamic and changing, making it extremely hard to optimize
and analyze them. It is problematic for existing theories and methods to effectively
address the new needs and challenges of optimization brought about by the rapid
development of technologies related to networked systems. Hence, it is urgent to
develop new theories and methods of distributed optimization over networks.
Analysis and synthesis including distributed unconstrained optimization, dis-
tributed constrained optimization, distributed nonsmooth optimization, distributed
online optimization, distributed economic dispatch in smart grids, undirected
networks, directed networks, time-varying networks, consensus control protocol,
gradient tracking technique, event-triggered communication strategy, Nesterov
and heavy-ball accelerated mechanisms, variance-reduction technique, differential
privacy strategy, gradient descent algorithm, accelerated algorithm, stochastic gra-
dient algorithm, and online algorithm are all thoroughly studied. This monograph

vii
viii Preface

mainly investigates distributed optimization algorithms and applications in net-


worked control systems. In general, the following problems are investigated in
this monograph: (1) accelerated algorithms for distributed convex optimization;
(2) projection algorithms for distributed stochastic optimization; (3) proximal
algorithms for distributed coupled optimization; (4) event-triggered algorithms for
distributed convex optimization; (5) event-triggered acceleration algorithms for dis-
tributed stochastic optimization; (6) accelerated algorithms for distributed economic
dispatch; (7) primal-dual algorithms for distributed economic dispatch; (8) event-
triggered algorithms for distributed economic dispatch; and (9) privacy preserving
algorithms for distributed online learning. Among the topics, simulation results
including some typical real applications are presented to illustrate the effectiveness
and the practicability of the distributed optimization algorithms proposed in the
previous parts.
This book is appropriate as a college course textbook for undergraduate and grad-
uate students majoring in computer science, automation, artificial intelligence, and
electric engineering, and as a reference material for researchers and technologists in
related fields.

Chongqing, China Qingguo Lü


Xiaofeng Liao
Huaqing Li
Shaojiang Deng
Shanfu Gao
Acknowledgments

This book was supported in part by the Natural Science Foundation of Chongqing
under Grant CSTB2022NSCQ-MSX1627, in part by the Chongqing Postdoctoral
Science Foundation under Grant 2021XM1006, in part by the China Postdoctoral
Science Foundation under Grant 2021M700588, in part by the National Natural
Science Foundation of China under Grant 62173278, in part by the Science and
Technology Research Program of Chongqing Municipal Education Commission
under Grant KJQN202100228, in part by the project of Key Laboratory of Industrial
Internet of Things & Networked Control, Ministry of Education under Grant
2021FF09, in part by the project funded by Hubei Province Key Laboratory
of Intelligent Information Processing and Real-time Industrial System (Wuhan
University of Science and Technology) under Grant ZNXX2022004, in part by
the project funded by Hubei Key Laboratory of Intelligent Robot (Wuhan Institute
of Technology) under Grant HBIR202205, in part by the Science and Technology
Research Program of Chongqing Municipal Education Commission under Grant
KJQN202100228, and in part by National Key R&D Program of China under
Grant 2018AAA0100101. We would like to begin by acknowledging Yingjue Chen
and Keke Zhang who have unselfishly given their valuable time in arranging raw
materials. Their assistance has been invaluable to the completion of this book. The
authors are especially grateful to their families for their encouragement and never
ending support when it was most required. Finally, we would like to thank the editors
at Springer for their professional and efficient handling of this book.

ix
Contents

1 Accelerated Algorithms for Distributed Convex Optimization . . . . . . . . . 1


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Communication Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Centralized Nesterov Gradient Descent Method (CNGD) . . . . 6
1.3.2 Directed Distributed Nesterov-Like Gradient
Tracking (D-DNGT) Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.2 Supporting Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Projection Algorithms for Distributed Stochastic
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.3 Communication Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.1 Problem Reformulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.2 Computation-Efficient Distributed Stochastic
Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

xi
xii Contents

2.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39


2.4.1 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.2 Supporting Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.5.1 Example 1: Performance Examination . . . . . . . . . . . . . . . . . . . . . . . . 53
2.5.2 Example 2: Application Behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3 Proximal Algorithms for Distributed Coupled Optimization . . . . . . . . . . 61
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.3 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2.4 The Saddle-Point Reformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.1 Unbiased Stochastic Average Gradient (SAGA) . . . . . . . . . . . . . . 68
3.3.2 Distributed Stochastic Algorithm (VR-DPPD) . . . . . . . . . . . . . . . . 69
3.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4.1 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.5.1 Example 1: Simulation on General Real Data . . . . . . . . . . . . . . . . . 82
3.5.2 Example 2: Simulation on Large-Scale Real Data . . . . . . . . . . . . 85
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4 Event-Triggered Algorithms for Distributed Convex
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2.3 Communication Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3.1 Distributed Subgradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4.1 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Contents xiii

5 Event-Triggered Acceleration Algorithms for Distributed


Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2.3 Communication Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3.1 Event-Triggered Communication Strategy . . . . . . . . . . . . . . . . . . . . 120
5.3.2 Event-Triggered Distributed Accelerated Stochastic
Gradient Algorithm (ET-DASG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.4.1 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.4.2 Supporting Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.5.1 Example 1: Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.5.2 Example 2: Energy-based Source Localization . . . . . . . . . . . . . . . 143
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6 Accelerated Algorithms for Distributed Economic Dispatch . . . . . . . . . . . 151
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.2.3 Communication Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.2.4 Centralized Lagrangian Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.1 Directed Distributed Lagrangian Momentum Algorithm . . . . . 158
6.3.2 Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.4.1 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.5.1 Case Study 1: EDP on IEEE 14-Bus Test Systems . . . . . . . . . . . 172
6.5.2 Case Study 2: EDP on IEEE 118-Bus Test Systems . . . . . . . . . . 173
6.5.3 Case Study 3: The Application to Dynamical EDPs . . . . . . . . . . 175
6.5.4 Case Study 4: Comparison with Related Methods . . . . . . . . . . . . 177
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7 Primal–Dual Algorithms for Distributed Economic Dispatch . . . . . . . . . . 183
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
xiv Contents

7.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186


7.2.3 Communication Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.3.1 Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.3.2 Distributed Primal–Dual Gradient Algorithm . . . . . . . . . . . . . . . . . 190
7.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
7.4.1 Small Gain Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
7.4.2 Supporting Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.5.1 Example 1: EDP on the IEEE 14-Bus Test Systems . . . . . . . . . . 201
7.5.2 Example 2: Demand Response for Time-Varying
Supplies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8 Event-Triggered Algorithms for Distributed Economic Dispatch . . . . . . 209
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
8.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
8.2.2 Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
8.2.3 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
8.2.4 Communication Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
8.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
8.3.1 Problem Reformulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
8.3.2 Event-Triggered Control Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
8.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
8.4.1 Supporting Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
8.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
8.4.3 The Exclusion of Zeno-Like Behavior. . . . . . . . . . . . . . . . . . . . . . . . . 224
8.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
8.5.1 Example 1: EDP on the IEEE 14-Bus System . . . . . . . . . . . . . . . . 226
8.5.2 Example 2: EDP on Large-Scale Networks . . . . . . . . . . . . . . . . . . . 226
8.5.3 Example 3: Comparison with Related Methods . . . . . . . . . . . . . . . 229
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
9 Privacy Preserving Algorithms for Distributed Online Learning . . . . . . 235
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
9.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
9.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
9.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
9.2.3 Communication Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
9.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
9.3.1 Differential Privacy Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
9.3.2 Differentially Private Distributed Online Algorithm . . . . . . . . . . 243
Contents xv

9.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244


9.4.1 Differential Privacy Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
9.4.2 Logarithmic Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
9.4.3 Square-Root Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
9.4.4 Robustness to Communication Delays . . . . . . . . . . . . . . . . . . . . . . . . 260
9.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
List of Figures

Fig. 1.1 A directed and strongly connected network . . . . . . . . . . . . . . . . . . . . . . . . . . 24


Fig. 1.2 Performance comparisons between D-DNGT and the
methods without momentum terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Fig. 1.3 Performance comparisons between D-DNGT and the
methods with momentum terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Fig. 1.4 Performance comparisons between the extensions of
D-DNGT and their closely related methods . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Fig. 2.1 Convergence of x̂ for solving the optimization problem in
Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Fig. 2.2 Comparison (a): x-axis is the iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Fig. 2.3 Comparison (b): x-axis is the number of gradient
evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Fig. 2.4 Comparison (a): x-axis is the iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Fig. 2.5 Comparison (b): x-axis is the number of gradient
evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Fig. 2.6 Comparison (a): x-axis is the iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Fig. 2.7 Comparison (b): x-axis is the number of gradient
evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Fig. 3.1 (a) Random network with a connection probability
p = 0.8. (b) Complete network. (c) Cycle network. (d)
Star network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Fig. 3.2 The transient behaviors of the second dimension of each
primal variable xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Fig. 3.3 Comparisons between VR-DPPD and other algorithms. (a)
The x-axis is the iterations. (b) The x-axis is the number of
gradient evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Fig. 3.4 Evolution of residuals under different networks . . . . . . . . . . . . . . . . . . . . . 85
Fig. 3.5 Comparisons between VR-DPPD and other algorithms. (a)
The x-axis is the iterations. (b) The x-axis is the number of
gradient evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Fig. 4.1 All nodes’ states xi (t) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

xvii
xviii List of Figures

Fig. 4.2 Evolutions of all nodes’ control inputs ui (t) . . . . . . . . . . . . . . . . . . . . . . . . . 109


Fig. 4.3 All nodes’ sampling time instant sequences {tki } . . . . . . . . . . . . . . . . . . . . . 110
Fig. 4.4 Evolutions of measurement error and threshold for node 3 . . . . . . . . . 110
Fig. 5.1 Four undirected and connected network topologies
composed of 10 nodes. (a) Random network with a
connection probability pc = 0.4. (b) Complete network.
(c) Cycle network. (d) Star network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Fig. 5.2 Convergence: (a) The transient behaviors of three
dimensions (randomly selected) of state estimator x. (b)
The testing accuracy of ET-DASG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Fig. 5.3 The triggering times for the neighbors when 5 nodes run
ET-DASG under different event-triggered parameters . . . . . . . . . . . . . . . 142
Fig. 5.4 Evolution of residuals under different constant step-sizes
or momentum coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Fig. 5.5 Evolution of residuals under different networks . . . . . . . . . . . . . . . . . . . . . 144
Fig. 5.6 Comparisons between ET-DASG and other methods . . . . . . . . . . . . . . . . 144
Fig. 5.7 The randomly selected 7 paths displayed on top of contours
of log-likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Fig. 6.1 The IEEE 14-bus test system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Fig. 6.2 EDP on IEEE 14-bus test system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Fig. 6.3 The IEEE 118-bus test system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Fig. 6.4 EDP on IEEE 118-bus test system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Fig. 6.5 Dynamical EDPs on IEEE 14-bus test system . . . . . . . . . . . . . . . . . . . . . . . . 178
Fig. 6.6 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Fig. 7.1 The IEEE 14-bus test system [43] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Fig. 7.2 Power allocation at generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Fig. 7.3 Consensus of Lagrange multipliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Fig. 7.4 Optimal energy schedule of each household . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Fig. 7.5 Predicted price on the time-varying demands . . . . . . . . . . . . . . . . . . . . . . . . 204
Fig. 8.1 EDP on the IEEE 14-bus system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Fig. 8.2 EDP on the IEEE 118-bus test system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Fig. 8.3 Comparison with related methods in which the residual
E(t) as the comparison metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Fig. 8.4 Comparison with related methods in which the obtained
cost of EDP as the comparison metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Fig. 9.1 (a) Estimations of five nodes without communication
delay. (b) The maximum and minimum pseudo-individual
average regrets (Rj (T )/T ) without communication delays . . . . . . . . . 264
Fig. 9.2 (a) Estimations of five nodes with communication delays.
(b) The maximum and minimum pseudo-individual
average regrets (Rj (T )/T ) with communication delays . . . . . . . . . . . . 265
Fig. 9.3 Estimations of five nodes for DTS with communication
delays. (a) Node’s estimate (z) (DP-DSSP). (b) Node’s
estimate (z) (the method in [37]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
List of Figures xix

DT S
Fig. 9.4 (a) The pseudo-individual average subregrets (Rj,av (k))
between k and k + 1 for DTS with communication delays.
(b) The pseudo-individual average regrets (Rj (T )/T ) for
DTS with communication delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Fig. 9.5 The outputs xi (t) and xi (t) related to the adjacent relations

fit and fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Chapter 1
Accelerated Algorithms for Distributed
Convex Optimization

Abstract In this chapter, we introduce and solve distributed optimization problems


on directed networks, where each node has its own convex cost function while
obeying the network connectivity structure, and the principal target of these
problems is to minimize the global cost function (formulated by the average of
all local cost functions). Most of the existing methods, such as the push-sum
strategy, have eliminated the unbalancedness caused by directed networks with
the help of column-stochastic weights, but those methods may be infeasible in
case the distributed implementation requires each node to obtain (at least) its
out-degree information. In contrast, with the help of a directed network of row-
stochastic weights, we propose a new directed distributed Nesterov-like gradient
tracking algorithm, named D-DNGT, that incorporates the gradient tracking into the
distributed Nesterov method with momentum terms and employs non-uniform step-
sizes. This approach can effectively overcome the abovementioned limitations of
column-stochastic directed networks in the implementation. The implementation
of D-DNGT is straightforward if each node locally chooses a suitable step-size
and privately regulates the weights of information that acquires from in-neighbors.
If the largest step-size and the maximum momentum coefficient are positive and
small sufficiently, we can prove that D-DNGT converges linearly to the optimal
solution provided that the cost functions are smooth and strongly convex. We
provide numerical experiments to confirm the findings in this chapter and contrast
D-DNGT with recently proposed distributed optimization approaches.

Keywords Distributed convex optimization · Nesterov-like algorithm · Gradient


tracking · Directed network · Linear convergence

1.1 Introduction

In the past decades, with the development of artificial intelligence and the emer-
gence of 5G, a number of researchers are already interested in distributed optimiza-
tion. This chapter considers a class of widely concerned distributed optimization
problems with each node cooperatively attempting to optimize a global cost function

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 1
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_1
2 1 Accelerated Algorithms for Distributed Convex Optimization

in the context of local interactions and local computations [1]. Instances of such
formulation characterized by distributed computing have several important and
widespread applications in various fields, including wireless sensor networks for
decision-making and information-processing [2], distributed resource allocation
in smart grids [3], distributed learning in robust control [4], and time-varying
formation control [5, 6], among many others [7–13]. Unlike traditional centralized
optimization, distributed optimization involves multiple nodes that gain access to
their private local information over networks, and typically no central coordinator
(node) can acquire the entire information over the networks.
Recently, an increasing number of distributed algorithms have been emerged
according to various locally computational schemes for individual nodes. Some
known approaches for different networks are usually dependent on the distributed
(sub)gradient descent with extensions to figure out interaction delays, asynchronous
updates and stochastic (sub)gradient scenarios, etc. [14–22]. It is noteworthy in this
aspect that these algorithms are intuitive and flexible for the cost functions and
networks, and however the convergence rates are considerably slow owing to the
utilization of diminishing step-size, which is required to guarantee convergence to
an exact optimal solution [14]. The convergence rate of the known algorithms, even
for strong convex functions, is only sublinear [15]. The convergence rate reaches
to be linear of an algorithm with a constant step-size at the cost of a sub-optimal
solution [20]. Methods that make up this exactness-speed dilemma, such as the
distributed alternating direction method of multipliers (ADMMs) [23, 24] and the
distributed dual decomposition [25], are based on Lagrangian dual, which have nice
provable convergence rates (linear convergence rate for strong convex functions)
[26]. In addition, extensions of various real-world factors including stochastic errors
[27], privacy preserving [28] and techniques including proximal (sub)gradient [29],
and formation-containment control [30] have been extensively studied. However,
due to the need to deal with sub-problems in each iteration, the computational
complexity is considerably high. To overcome these difficulties effectively, quite a
few approaches have been proposed, which achieve linear convergence for smooth
and strongly convex cost functions [31–38]. Nonetheless, these approaches [31–38]
are just suitable for undirected networks.
Distributed optimization over directed networks was firstly studied in [39], where
(sub)gradient-push (SP) method was employed to eliminate the requirement of
network balancing, i.e., with column-stochastic weights. Since SP is established on
(sub)gradient descent with diminishing step-size, it also encounters a slow sublinear
convergence rate. To accelerate convergence, Xi and Khan [40] proposed a linearly
convergent distributed method (DEXTRA) with constant step-size by combining
push-sum strategy with the protocol (EXTRA) in [31]. Further, Xi et al. [41] (fixed
directed network) and Nedic et al. [42] (time-varying directed networks) combined
the push-sum strategy with distributed inexact gradient tracking with constant step-
size (ADD-OPT [41] and Push-DIGing [42]) to acquire linear convergence to the
exact optimal solution. Then, Lü et al. [43, 44] extended the work of [42] to non-
uniform step-sizes and showed linear convergence. A different class of approaches
which do not utilize push-sum mechanism have been recently proposed in [45–
1.1 Introduction 3

50], where both row- and column-stochastic weights are adopted simultaneously to
acquire linear convergence over directed networks. It is noteworthy that although
these approaches [39–50] avoid the construction of doubly-stochastic weights, they
just require nodes to possess (at least) its own out-degree information exactly.
Therefore, all the nodes in the networks [39–50] can adjust their outgoing weights
to ensure that the sum of each column of weight matrix is one. This requirement,
however, is likely to be unrealistic in broadcast-based interaction schemes (i.e., the
node neither accesses its out-neighbors nor regulates its outgoing weights).
In this chapter, the algorithm that we will construct depends crucially on the
gradient tracking and is a variation of methods appeared in [47–55]. To be specific,
Qu and Li [54] combined the gradient tracking with distributed Nesterov gradient
descent (DNGD) method [55] and thereby investigated two accelerated distributed
Nesterov methods, i.e., Acc-DNGD-SC and Acc-DNGD-NSC, which exhibited fast
convergence rate compared with the centralized gradient descent (CGD) method for
different cost functions. Note that although the convergence rates are improved, the
two approaches in [54] just assume that the interaction networks are undirected,
which also involve the applicability of the methods in many fields, such as
wireless sensor networks. To remove this deficiency, Xin et al. [48] established an
acceleration and generalization of first-order methods with the gradient tracking
and the momentum term, i.e., ABm, which overcame the conservatism (eigenvector
estimation or doubly-stochastic weights) in the related work by implementing both
row- and column-stochastic weights. In this setting, some interesting generalized
methods [46] (random link failures) and [49] (interaction delays) were proposed.
Regrettably, the construction of column-stochastic weights demands each node to
possess at least its out-degree information, which is arduous to be implemented, for
example, in broadcast-based interaction scenarios. In light of this challenge, Xin et
al. [52] investigated the case of row-stochastic weight matrix which was required to
restrict global information on the network and proposed a fast distributed method
(FROST) under non-uniform step-sizes motivated by the idea of [51]. Related works
also involve the issues of demand response and economic scheduling in power
systems [53, 56]. However, these methods [51–53, 56] do not adopt momentum
terms [54, 55, 57], where nodes acquire more information from in-neighbors in
the network for fast convergence. Moreover, two accelerated methods based on
Nesterov’s momentum for the distributed optimization over arbitrary networks were
presented in [50]. Unfortunately, the related work [50] does not consider the non-
uniform step-sizes and lack of a rigorous theoretical analysis of the methods. Hence,
it is of great significance to discuss such a challenging issue due to its practicality.
The main interest of this chapter is to study the distributed convex optimiza-
tion problem over a directed network. To solve this issue, a linearly convergent
algorithm is designed, for which the non-uniform step-sizes, momentum terms,
and row-stochastic matrix are utilized. We hope to develop a broad theory of the
distributed convex optimization, and the potential purpose of designing a distributed
optimization algorithm is to adapt and promote real scenarios. To conclude, this
4 1 Accelerated Algorithms for Distributed Convex Optimization

chapter possesses the following four contributions:


(i) We design and discuss a novel directed distributed Nesterov-like gradient
tracking algorithm, named as D-DNGT, with row-stochastic matrix to solve the
distributed convex optimization problems over a directed network. Specifically,
D-DNGT incorporates the gradient tracking into the distributed Nesterov
method, which adds two types of momentum terms to ensure that nodes acquire
more information from in-neighbors in the network than the existing methods
[51–53] to achieve fast convergence. More importantly, a consensus iteration
step [50–52] is exploited for designing D-DNGT to counteract the effect of the
unbalancedness induced by the directed network.
(ii) In comparison with Acc-DNGD-SC and Acc-DNGD-NSC proposed in [54],
D-DNGT extends the centralized Nesterov gradient descent method (CNGD)
[57] to a distributed form and is suitable for directed networks. In contrast
to [54] (doubly-stochastic matrix) and [39–50] (column-stochastic matrix),
D-DNGT with row-stochastic matrix is relatively easy to be achieved in a
distributed way if each node can privately regulate the weights on information
which acquires from in-neighbors. This is inevitable in some real applications
such as ad hoc networks, peer to peer, etc.
(iii) D-DNGT adopts non-uniform step-sizes, which presents a selection of more
relaxed step-sizes than most existing methods proposed in [41, 42, 49], etc. If
the cost functions are smooth and strongly convex, D-DNGT attains a linear
convergence to the exact optimal solution in the case where the non-uniform
step-sizes and the momentum coefficients are constrained by the specific upper
bounds. In addition, some extensions of D-DNGT are discussed in the presence
of two types of weight matrices (only column-stochastic [41, 42] or both row-
and column-stochastic [45, 46]) or network interaction delays (arbitrary but
uniformly bounded) [49, 58].
(iv) The provided bounds on the largest step-size depend only on the network
topology and the cost functions, and each node can choose a relatively wider
step-size. This is in contrast to the earlier work on non-uniform step-sizes
within the framework of the gradient tracking [33, 43, 44, 53], which is relied
on the heterogeneity of the step-sizes. Moreover, the bounds of non-uniform
step-sizes in this chapter allow the existence (not all) of zero step-sizes among
the nodes.

1.2 Preliminaries

1.2.1 Notation

If not particularly stated, the vectors mentioned in this chapter are column vectors.
Let R and Rp denote the set of real numbers and p-dimensional real column vectors,
respectively. The subscripts i and j are utilized to denote the indices of the node
1.2 Preliminaries 5

and the superscript t denotes the iteration index of an algorithm; e.g., xit denotes
the variable of node i at time t. We let notations 1n and 0n denote two column
vectors with all entries equaling to one and zero, respectively. Let In and zij denote
the identity matrix of size n and the entry of matrix Z in its i-th row and j -th
column, respectively. The Euclidean norm for vectors and the induced 2-norm for
matrices are represented by the symbol || · ||2 . Let notation Z = diag{y} represent
the diagonal matrix of the vector y = [y1 , y2 , . . . , yn ]T , which follows that zii =
yi , ∀i = 1, . . . , n, and zij = 0, ∀i = j . We define the symbol diag{Z} as a diagonal
matrix whose diagonal elements correspond (same) to the matrix Z. The transposes
of a vector z and a matrix W are indicative of zT and W T , respectively. Let ei =
[0, . . . , 1i , . . . , 0]T . The gradient of f (z) (differentiable) at z is denoted as f (z) :
Rp → Rp . A non-negative square matrix Z ∈ Rn×n is row-stochastic if Z1n = 1n ,
column-stochastic if Z T 1n = 1n , and doubly stochastic if Z1n = 1n and Z T 1n =
1n .

1.2.2 Model of Optimization Problem

Consider a set of n nodes connected over a directed network. The global objective
is to find x ∈ Rp that minimizes the average of all local cost functions, i.e.,

1
n
minp f (x) = fi (x), (1.1)
x∈R n
i=1

where each fi : Rp → R is a convex function that we view as the local cost of


node i. Let f (x ∗ ) = f ∗ and x ∗ be represented by the optimal value and an optimal
solution to (1.1), respectively. ∗
The optimal solution’s set to (1.1) is denoted by X ,
where X∗ = {x ∈ Rp |(1/n) ni=1 fi (x) = f ∗ }.

1.2.3 Communication Network

In this chapter, we consider a group of n nodes communicating over a directed


network G = {V, E} involving the nodes set V = {1, . . . , n} and the edges set
E ⊆ V × V. If (i, j ) ∈ E, it indicates that node i can directly transmit data to
node j , where i is viewed as an in-neighbor of j and contrarily j is regarded as an
out-neighbor of i. Let Niin = {j ∈ V|(j, i) ∈ E} and Niout = {j ∈ V|(i, j ) ∈ E}
be the in-neighbor and out-neighbor sets of i, respectively. If |Niin | = |Niout |, the
network is said to be unbalanced, where | · | is called as the cardinality of a set. For
the directed network G, a path of length b from node i1 to node ib+1 is a sequence of
b + 1 distinct nodes i1 , . . . , ib+1 such that (ik , ik+1 ) ∈ E for k = 1, . . . , b. If there is
6 1 Accelerated Algorithms for Distributed Convex Optimization

a path between any two nodes, then G is said to be strongly connected. In addition,
the following assumptions are adopted.
Assumption 1.1 ([51]) The network G corresponding to the set of nodes is directed
and strongly connected.
Remark 1.1 Assumption 1.1 is fundamental to assure that nodes in the network can
always affect others directly or indirectly when studying distributed optimization
problems [39–53].

Assumption 1.2 ([42]) Each local cost function fi , i ∈ V, is Li -smooth. Mathe-


matically, there exists Li > 0 such that for any x, y ∈ Rp , one has

||∇fi (x) − ∇fi (y)||2 ≤ Li ||x − y||2 . (1.2)

Assumption 1.3 ([42]) Each local cost function fi , i ∈ V, is μi -strongly con-


nected. Mathematically, there exists μi ≥ 0 such that for any x, y ∈ Rp , one has
μi
fi (x) ≥ fi (y) + ∇fi (y)T (x − y) + ||x − y||22 , (1.3)
2
n
where μi ∈ [0, +∞) and i=1 μi > 0.
Remark 1.2 It is worth emphasizing that Assumptions 1.2 and 1.3 are two standard
assumptions to achieve linear convergence when the first-order methods [41–53]
are employed. If Assumption 1.3 holds, it suffices that each fi is convex and at
least one of them is strongly convex. Moreover, under Assumption 1.3, problem
(1.1) possesses
 a unique optimal solution.
 In the following analysis, we denote
L̄ = (1/n) ni=1 Li and μ̄ = (1/n) ni=1 μi as the Lipschitz continuity and
strong convexity constants, respectively, for the global cost function f . Denote
L̂ = maxi {Li }.

1.3 Algorithm Development

On the basis of the above section, we first review the centralized Nesterov gradient
descent method (CNGD) and then propose the directed distributed Nesterov-like
gradient tracking algorithm, named as D-DNGT, to solve problem (1.1).

1.3.1 Centralized Nesterov Gradient Descent Method (CNGD)

Here, CNGD derived from [57] is briefly introduced for L̄-smooth and μ̄-strongly
convex cost function. At each time t ≥ 0, CNGD kept three vectors y t , x t , v t ∈ Rp
1.3 Algorithm Development 7

and implemented the following three steps:


⎧ αγ v t +γ x t

⎨ y = α μ̄+γ
t

x t +1 = y t − L̄1 ∇f (y t ) (1.4)

⎩ t +1
v = (1 − α)v t + αγμ̄ y t − γα ∇f (y t ),

with the initial states y 0 = x 0 = v 0 ∈ Rp , where α and γ are constants related


to the parameters (L̄ and μ̄) of the cost function f . Nesterov [57] specified the
requirement
 that γ = (1 − α)γ + α μ̄. If α = μ̄/L̄, then γ must satisfy γ =
(1 − μ̄/L̄)γ + μ̄ μ̄/L̄ = μ̄. After a series of transformations (see [59] for a
specific transformation), the equivalent form of CNGD (1.4) is given by

x t +1 = y t − L̄1 ∇f (y t )
(1.5)
y t +1 = x t +1 + β(x t +1 − x t ),
√ √ √ √
where β = ( L̄ − μ̄)/( L̄ + μ̄). It is well known that among all centralized
gradient approaches, CNGD [57] achieved the optimal convergence rate in terms of
the first-order oracle complexity. Under Assumptions 1.2 and 1.3, it is deduced that
the convergence rate of CNGD (1.5) was O((1 − μ̄/L̄)t ) whose dependence on
condition number L̄/μ̄ improved over CGD’s rate O((1 − μ̄/L̄)t ) in the large L̄/μ̄
regime. In this chapter, we devote ourselves to the study of a directed distributed
Nesterov-like gradient tracking (D-DNGT) algorithm, which is not only suitable for
a directed network but also converges linearly and accurately to the optimal solution
to (1.1). To the best of our knowledge, this work has not yet been involved and is
worthwhile to study.

1.3.2 Directed Distributed Nesterov-Like Gradient Tracking


(D-DNGT) Algorithm

We now describe D-DNGT to distributedly deal with problem (1.1). Each node i ∈
V at time t ≥ 0 stores four variables: xit ∈ Rp , yit ∈ Rp , sti ∈ Rn , and zit ∈ Rp . For
t > 0, node i ∈ V updates its variables as follows:

⎪ n


⎪ xit +1 = rij yjt + βi (xit − xit −1) − αi zit

⎪ j =1

⎪ t +1 t +1 t +1

⎨ yi = xi + βi (xi − xi )
t
n
(1.6)

⎪ sti +1 = rij stj

⎪ j =1


⎪ t +1
⎪  n ∇f (y t+1 ) ∇f (y t )

⎩ zi = rij zjt + it+1i − [sit ] i ,
[s ]
j =1 i i i i
8 1 Accelerated Algorithms for Distributed Convex Optimization

where αi > 0 and βi ≥ 0 are referred to the constant step-size (non-uniform)


and the momentum (heavy-ball momentum and Nesterov momentum) coefficient
(non-uniform) locally chosen at each node i, respectively. The notations [sti ]i and
∇fi (yit ) (vector), respectively, denote the i-th entry of sti and the gradient of fi (y)
at y = yit . The weights, rij , i, j ∈ V, associated with the network G obey the
following conditions:

⎨ > , j ∈ Niin , 
n
rij = rij = 1, ∀i, (1.7)

0, otherwise, j =1


and rii = 1 − j ∈Niin rij > , ∀i, where 0 <  < 1. Each node i ∈ V starts with
1

initial states xi0 = yi0 ∈ Rp , s0i = ei , and zi0 = ∇fi (yi0 ).2
Denote R = [rij ] ∈ Rn×n as the collection of weights rij , i, j ∈ V in (1.7), which
is obviously row-stochastic. In essence, the update of zit in (1.6) is a distributed
inexact gradient tracking step, where each local cost function’s gradient is scaled
by [sti ]i , which is generated by the third update in (1.6). Actually, the update
of sti in (1.6) is a consensus iteration aiming to estimate the Perron eigenvector
w = [w1 , . . . , wn ]T (related to the eigenvalue 1) of the weight matrix R satisfying
1T w = 1. This iteration is similar to that employed in [51–53]. To sum up, D-DNGT
(1.6) transforms CNGD (1.5) into distributed ones via gradient tracking and can be
applied to a directed network.
Remark 1.3 For the sake of brevity, we mainly concentrate on the one dimensional
case, i.e., p = 1, and the multiple dimensional case is similarly proven.
Define x t = [x1t , . . . , xnt ]T ∈ Rn , y t = [y1t , . . . , ynt ]T ∈ Rn , zt = [z1t , . . . , znt ]T ∈
Rn , S t = [st1 , . . . , stn ]T ∈ Rn×n , ∇F (y t ) = [∇f 1 (y1t ), . . . , ∇f n (ynt )]T ∈ Rn and
S̃ t = diag{S t }. Therefore, the aggregated form of D-DNGT (1.6) can be written as
follows:
⎧ t +1

⎪ x = Ry t + Dβ (x t − x t −1 ) − Dα zt
⎨ t +1
y = x t +1 + Dβ (x t +1 − x t )
+1 , (1.8)
⎪S

t = RS t
⎩ t +1
z = Rzt + [S̃ t +1 ]−1 ∇F (y t +1) − [S̃ t ]−1 ∇F (y t )

where Dα = diag{α} ∈ Rn×n and Dβ = diag{β} ∈ Rn×n , where α = [α1 , . . . , αn ]T


and β = [β1 , . . . , βn ]T . The initial states are x 0 = y 0 ∈ Rn , S 0 = In and

1 It is worth noticing that the weights, rij , i, j ∈ V , associated with the network G given in (1.7) is
valid. For all i ∈ V , the conditions of the weights, rij , i, j ∈ V , in (1.7) can be satisfied when we
set rij = 1/|Niin |, ∀j ∈ Niin , and rij = 0, otherwise.
2 Suppose that each node possesses and achieves its unique identifier in the network, e.g., 1, . . . , n,

[45–50].
1.3 Algorithm Development 9

z0 = ∇F (y 0 ) ∈ Rn . It is worth emphasizing that D-DNGT (1.6) do not need the


out-degree information of nodes (only row-stochastic matrix adopted in D-DNGT),
which is more likely to be implemented.

1.3.3 Related Methods

In this subsection, some distributed optimization methods which are not only
suitable for directed networks but also related to D-DNGT (1.6) are discussed,
based on an instinct explanation. In particular, we consider ADD-OPT/Push-DIGing
[41, 42], FROST [52] and ABm [48].3
(a) Relation to ADD-OPT/Push-DIGing ADD-OPT [41] (Push-DIGing [42] is
suitable for time-varying networks in comparison with ADD-OPT) kept updating
four variables xit , sit , yit and zit ∈ R for each node i. Starting from the initial states
si0 = 1, zi0 = ∇fi (yi0 ) and an arbitrary xi0 , the updating rule of ADD-OPT is given
by

⎪ t +1 n

⎪ xi = cij xjt − αzit



⎪ j =1
⎨ n
t +1
si = cij sjt , yit +1 = xit +1 /sit +1 , (1.9)

⎪ j =1

⎪ 

⎪ t +1
n

⎩ zi = cij zjt + ∇fi (yit +1 ) − ∇fi (yit )
j =1

where C = [cij ] ∈ Rn×n is column-stochastic and α > 0 is the step-size. Under


Assumptions 1.1–1.3, ADD-OPT converged linearly to the optimal solution over
a directed network using uniform constant step-size. Besides, ADD-OPT/Push-
DIGing applied push-sum strategy (column-stochastic weights) to overcome the
unbalancedness induced by directed networks, which may be infeasible in dis-
tributed implementation because it required each node to possess (at least) its out-
degree information. We emphasize here that row-stochastic weights are relatively
easy to achieve in a distributed setting and the implementation is straightforward if
each node can privately regulate the weights on information which acquires from
in-neighbors.
(b) Relation to FROST The method, FROST, proposed in [52], served as a basis
for the development of D-DNGT (1.6). FROST maintained over time t ≥ 0 at each
node i, the solution estimate xit ∈ R and two auxiliary variables sti ∈ Rn and zit ∈ R.

3 Notice that some notations involved in the relevant method may contradict the notations described
in distributed optimization problem/algorithm/analysis throughout the chapter. Therefore, we
declare here that the symbols in this section should not be applied to other parts.
10 1 Accelerated Algorithms for Distributed Convex Optimization

Mathematically, the updating rule is as follows:



⎪ n


⎪ xit +1 = rij xjt − αi zit

⎪ =1


j
n
sti +1 = rij stj , (1.10)

⎪ j =1

⎪ 

⎪ t +1
n ∇f (x t+1 ) ∇fi (xit )

⎩ zi = rij zjt + it+1i − [sti ]i
[si ]i
j =1

where αi > 0 is a step-size locally chosen at each node i and the row-stochastic
weights R = [rij ] ∈ Rn×n comply with (1.7); the initialization xi0 ∈ R, s0i = ei , and
zi0 = ∇fi (xi0 ). FROST utilized row-stochastic weights with non-uniform step-sizes
among the nodes, and exhibited fast convergence over a directed network, which
converged at a linear rate to the optimal solution under Assumptions 1.1–1.3.
(c) Relation to ABm The ABm, investigated in [48], combined the gradient
tracking with a momentum term and utilized non-uniform step-sizes, which is
described as follows:
⎧ 
n

⎪ t +1
rij xjt − αi zit + βi (xit − xit −1 )
⎨ xi =
j =1

n , (1.11)

⎪ t +1
⎩ zi = cij zjt + ∇fi (xit +1 ) − ∇fi (xit )
j =1

initialized with zi0 = ∇fi (xi0 ) and an arbitrary xi0 at each node i, where as before
αi > 0 and βi ≥ 0 represent the local step-size and the momentum coefficient of
node i. By simultaneously implementing both row-stochastic (R = [rij ] ∈ Rn×n )
and column-stochastic (C = [cij ] ∈ Rn×n ) weights, it is deduced from [48] that
ABm reduces to AB [45] when βi = 0, ∀i, and AB lies at the heart of existing
methods that employ the gradient tracking [42, 43, 48].
Notice that, ADD-OPT/Push-DIGing, FROST and D-DNGT, described above,
have a non-linear term which is derived from the division by the eigenvector learning
term ((1.6), (1.9) and (1.10)). ABm eliminates this non-linear calculation and is
still suitable for the directed networks. However, ABm requires each node to gain
access to its out-degree information to build column-stochastic weights. It is a
challenge to establish directly in a distributed manner, which has been interpreted
earlier. It is worth highlighting that our algorithm, D-DNGT, extends CNGD to a
distributed form and is suitable for the directed networks in comparison with CNGD
[57] and Acc-DNGD-SC/Acc-DNGD-NSC [54]. In addition, D-DNGT combines
FROST with two kinds of momentum terms (heavy-ball momentum and Nesterov
momentum), which ensures that nodes acquire more information from in-neighbors
in the network than FROST to achieve much faster convergence.
1.4 Convergence Analysis 11

1.4 Convergence Analysis

In this section, we will prove that D-DNGT (1.6) converges at a linear rate to optimal
solution x ∗ provided that the coefficients (non-uniform step-sizes and momentum
coefficients) are bounded with properly chosen constants. The following notations
and relations are employed.
Recalling that R is irreducible and row-stochastic with positive diagonals,
under Assumption 1.1, there exists a normalized left Perron eigenvector w =
[w1 , . . . , wn ]T ∈ Rn (wi > 0, ∀i) of R such that

lim (R)t = (R)∞ = 1n wT , wT R = wT and wT 1n = 1.


t →∞

Also, define S ∞ = limt →∞ S t (we obtain S ∞ = (R)∞ because of S 0 =


In ), S̃ ∞ = diag{S ∞ }, ŝ = supt ≥0||S t ||2 , s̃ = supt ≥0 ||[S̃ t ]−1 ||2 ,4 x̄ t = wT x t ,
∇F (1n x̄ t ) = [∇f1 (x̄ t ), . . . , ∇fn (x̄ t )]T , α̂ = maxi∈V {αi } and β̂ = maxi∈V {βi }.
Since R is primitive and S 0 = In , it yields that {S t } is convergent [39, 51], and
therefore, the diagonal elements of S t are positive and bounded for all t ≥ 0. Thus,
ŝ and s̃ are two finite constants. In addition, we employ || · || to indicate either a
particular matrix norm or a vector norm such that ||Rz|| ≤ ||R||||z|| for all matrices
R and vectors z. Since all vector norms are equivalent in finite dimensional vector
space, we find the following results: || · ||2 ≤ d1 || · || and || · || ≤ d2 || · ||2 , where d1
and d2 are some positive constants.

1.4.1 Auxiliary Results

Before showing the main results, we introduce some auxiliary results. First, the
following crucial lemma is given, which is a direct implication of Assumption 1.1
and (1.7) (see Section II-B in [32]).
Lemma 1.4 ([32]) Suppose that Assumption 1.1 holds. Considering the weight
matrix R = [rij ] ∈ Rn×n follows (1.7). Then, there are a norm || · || and a constant
0 < ρ < 1 such that

||Rx − (R)∞ x|| ≤ ρ||x − (R)∞ x||,

for all x ∈ Rn .
According to the result established in Lemma 1.4, in the following, we present
an additional lemma from the Markov chain and consensus theory [60].

4 Throughout the chapter, for any arbitrary matrix/vector/scalar Z, we utilize the symbol (Z)t to
represent the t-th power of Z to distinguish the iteration of variables.
12 1 Accelerated Algorithms for Distributed Convex Optimization

Lemma 1.5 ([60]) Let S t be generated by (1.8). Then, there exist 0 < θ < ∞ and
0 < λ < 1 such that

||S t − S ∞ ||2 ≤ θ (λ)t , ∀t ≥ 0.

The next lemma, as a direct consequence of Lemma 1.5, will be employed to


deduce that the linear convergence of the sequences {||[S̃ t ]−1 − [S̃ ∞ ]−1 ||2 } and
{||[S̃ t +1]−1 − [S̃ t ]−1 ||2 } can be acquired (Detailed proof may see [52] or [51]).
Lemma 1.6 ([51]) Let S t be generated by (1.8). For all t ≥ 0, it holds that
(a) ||[S̃ t ]−1 − [S̃ ∞ ]−1 ||2 ≤ θ (s̃)2 (λ)t ,
(b) ||[S̃ t +1 ]−1 − [S̃ t ]−1 ||2 ≤ 2θ (s̃)2 (λ)t .
The next lemma derives the dynamics that govern the evolution of the weight
sum of zt .
Lemma 1.7 ([51]) Let zt be generated by (1.8). Recall that z0 = ∇F (y 0 ). Then,
for all t ≥ 0, it yields that

(R)∞ zt = (R)∞ [S̃ t ]−1 ∇F (y t ).

For convenience of the convergence analysis, we will make frequently use of the
following well-known lemma (see example [32] for a proof).
Lemma 1.8 ([32]) Suppose that Assumptions 1.2–1.3 hold. Since the global cost
function f is μ̄-strongly convex and L̄-smooth, then for all x ∈ R and 0 < ε < 2/L̄,
we get

||x − ε∇f (x) − x ∗ ||2 ≤ l||x − x ∗ ||2 ,

where l = max{|1 − L̄ε|, |1 − μ̄ε|}, x ∗ is the optimal solution to (1.1) and ∇f (x)
is the gradient of f (x) at x.

1.4.2 Supporting Lemmas

In this subsection, we begin to constitute the convergence analysis of D-DNGT


by investigating the evolutions of ||x t +1 − (R)∞ x t +1||, ||(R)∞ x t +1 − 1n x ∗ ||2 ,
||x t +1 − x t || and ||zt +1 − (R)∞ zt +1 ||. Our goal is to bound the above four
expressions according to the linear combinations of their past estimates and ∇F (y t ),
in which way we construct a linear system of inequalities. In what follows, the bound
of the consensus violation, ||x t +1 − (R)∞ x t +1 ||, is first provided.
1.4 Convergence Analysis 13

Lemma 1.9 Suppose that Assumption 1.1 holds. Then, for all t > 0, we have the
following inequality:

||x t +1 − (R)∞ x t +1 ||

≤ ρ||x t − (R)∞ x t || + κ1 β̂||x t − x t −1|| + κ2 α̂||zt ||2 , (1.12)

where ρ is given in Lemma 1.4, κ1 = d2 (ρ + 1)||In − (R)∞ || and κ2 = (d2 )2 ||In −


(R)∞ ||; α̂ and β̂ are the largest step-size and the maximum momentum coefficient
among the nodes, respectively.
Proof According to the updates of x t , y t in D-DNGT (1.8), it holds that

||x t +1 − (R)∞ x t +1||


≤ ρ||(In − (R)∞ )(x t + Dβ (x t − x t −1 ))||
+ ||(In − (R)∞ )Dβ (x t − x t −1)|| + κ2 α̂||zt ||2 , (1.13)

where the inequality in (1.13) is obtained from Lemma 1.4 and the fact that
(R)∞ R = (R)∞ . The desired result of Lemma 1.9 is then acquired. 
The next lemma presents the bound of the optimality residual associated with the
weight average ||(R)∞ x t +1 − 1n x ∗ ||2 (Notice that (R)∞ x t +1 = 1n x̄ t +1 ).
Lemma 1.10 Suppose that Assumptions 1.2 and 1.3 hold. If 0 < n(wT α) < 2/L̄,
then, the following inequality holds for all t > 0:

||(R)∞ x t +1 − 1n x ∗ ||2

≤ d1 nL̂α̂||(R)∞ x t − x t || + l1 ||(R)∞ x t − 1n x ∗ ||2 + ŝ(s̃ 2 )θ α̂||∇F (y t )||2 (λ)t

+ (2κ3 +d1 ŝ s̃ L̂α̂)β̂||x t − x t −1|| + κ3 α̂||zt − (R)∞ zt ||, (1.14)

where κ3 = d1 ||(R)∞ ||2 and l1 = max{|1 − L̄n(wT α), |1 − μ̄n(wT α)}; θ and λ are
introduced in Lemma 1.5.
Proof Notice that (R)∞ R = (R)∞ . Recalling the updates of x t and y t in D-DNGT
(1.8), we get Lemma from 1.7 that

||(R)∞ x t +1 − 1n x ∗ ||2
= ||(R)∞ (x t + 2Dβ (x t − x t −1 ) − Dα zt + (Dα − Dα )(R)∞ zt ) − 1n x ∗ ||2
≤ ||(R)∞ x t − (R)∞ Dα (R)∞ [S̃ t ]−1 ∇F (y t ) − 1n x ∗ ||2

+ 2κ3 β̂||x t − x t −1 || + κ3 α̂||zt − (R)∞ zt ||. (1.15)


14 1 Accelerated Algorithms for Distributed Convex Optimization

We now discuss the first term in the inequality of (1.15). Note that (R)∞ =
1n wT and ∇F (1n x̄ t ) = [∇f1 (x̄ t ), . . . , ∇fn (x̄ t )]T . By utilizing 1n wT Dα 1n wT =
(wT α)1n wT , one obtains

||(R)∞ x t − (R)∞ Dα (R)∞ [S̃ t ]−1 ∇F (y t ) − 1n x ∗ ||2


≤ ||1n (wT x t − x ∗ − n(wT α)∇f (x̄ t ))||2
+ (wT α)||n1n ∇f (x̄ t ) − 1n wT [S̃ t ]−1 ∇F (y t )||2
= Λ1 + Λ2 , (1.16)

where ∇f (x̄ t ) = (1/n)1Tn ∇F (1n x̄ t ). By Lemma 1.8, when 0 < n(wT α) < 2/L̄,
Λ1 is bounded by

Λ1 ≤ l1 n||wT x t − x ∗ ||2 = l1 ||(R)∞ x t − 1n x ∗ ||2 , (1.17)

where l1 = max{|1 − L̄n(wT α)|, |1 − μ̄n(wT α)|}. Then, Λ2 can be bounded in the
following way:

Λ2 ≤(wT α)||n1n ∇f (x̄ t ) − 1n 1Tn ∇F (x t )||2


+ (wT α)||1n 1Tn ∇F (x t ) − 1n wT [S̃ t ]−1 ∇F (y t )||2
=Λ3 + Λ4 , (1.18)

where ∇F (x t ) = [∇f1 (x1t ), . . . , ∇fn (xnt )]T . Since ∇f (x̄ t ) = (1/n)1Tn ∇F (1n x̄ t ),
it yields from Assumption 1.2 that

Λ3 ≤ nL̂α̂||(R)∞ x t − x t ||2 . (1.19)

Next, by employing Lemma 1.6 and the relation S ∞ [S̃ ∞ ]−1 = 1n 1Tn , we have

Λ4 = (wT α)||S ∞ [S̃ ∞ ]−1 ∇F (x t ) − S ∞ [S̃ t ]−1 ∇F (y t )||2

≤ ŝ s̃ L̂α̂ β̂||x t − x t −1 ||2 + ŝ(s̃)2 θ α̂||∇F (y t )||2 (λ)t , (1.20)

where ŝ = supt ≥0 ||S t ||2 and s̃ = supt ≥0||[S̃ t ]−1 ||2 . The lemma follows by plugging
(1.16)–(1.20) into (1.15). 
For the bound of the estimate difference ||x t +1 − x t ||, the following lemma is
shown.
Lemma 1.11 Suppose that Assumption 1.2 holds. For all t > 0, it holds that

||x t +1 − x t || ≤ κ4 ||x t − (R)∞ x t || + κ5 β̂||x t − x t −1|| + d2 α̂||zt ||2 , (1.21)

where κ4 = ||R − In || and κ5 = d2 + d2 ||R||.


1.4 Convergence Analysis 15

Proof Recalling that (R)∞ R = (R)∞ , we obtain from the updates of x t and y t in
D-DNGT (1.8) that

||x t +1 − x t || ≤||R − In ||||x t − (R)∞ x t || + d2 α̂||zt ||2

+ (d2 + d2 ||R||)β̂||x t − x t −1 ||, (1.22)

and the lemma follows. 


The next lemma establishes the inequality which bounds the error term corre-
sponding to gradient estimation ||zt +1 − (R)∞ zt +1 ||.
Lemma 1.12 Suppose that Assumptions 1.1–1.3 hold. For all t > 0, we get the
following estimate:

||zt +1 − (R)∞ zt +1 ||
≤ κ4 κ6 (1 + β̂)||x t − (R)∞ x t || + κ6 d2 (1 + β̂)α̂||zt ||2

+ κ6 (1 + κ5 + κ5 β̂)β̂||x t − x t −1 || + ρ||zt − (R)∞ zt ||


+ 2||In − (R)∞ ||d2 (s̃)2 θ ||∇F (y t )||2 (λ)t , (1.23)

where κ6 = ||In − (R)∞ ||d1 d2 L̂s̃.


Proof It is immediately obtained from the update of zt in D-DNGT (1.8) that

||zt +1 − (R)∞ zt +1 ||
≤ ||In − (R)∞ ||||[S̃ t +1]−1 ∇F (y t +1) − [S̃ t ]−1 ∇F (y t )||
+ ρ||zt − (R)∞ zt ||, (1.24)

where we employ the triangle inequality and Lemma 1.4 to deduce the inequality. As
for the first term of the inequality in (1.24), we apply the update of y t in D-DNGT
(1.8) and the result in Lemma 1.6 to obtain

||[S̃ t +1]−1 ∇F (y t +1 ) − [S̃ t ]−1 ∇F (y t )||


≤ ||[S̃ t +1 ]−1 ∇F (y t +1) − [S̃ t +1 ]−1 ∇F (y t )||
+ ||[S̃ t +1 ]−1 ∇F (y t ) − [S̃ t ]−1 ∇F (y t )||

≤ d2 L̂s̃||y t +1 − y t ||2 + 2d2(s̃)2 θ ||∇F (y t )||2 (λ)t

≤ d1 d2 L̂s̃(1 + β̂)||x t +1 − x t || + d1 d2 L̂s̃ β̂||x t − x t −1||


+ 2d2 (s̃)2 θ ||∇F (y t )||2 (λ)t . (1.25)

Combining Lemma 1.11 with (1.25), the result in Lemma 1.12 is obtained. 
16 1 Accelerated Algorithms for Distributed Convex Optimization

The final lemma constitutes an inevitable bound on the estimate, ||zt ||2 , for
deriving the aforementioned linear system.
Lemma 1.13 Assume that Assumption 1.2 holds. Then, the following inequality can
be established for all t > 0,

||zt ||2 ≤d1 ||zt − (R)∞ zt || + d1 nL̂||x t − (R)∞ x t ||

+ nL̂||(R)∞ x t − 1n x ∗ ||2 + d1 nL̂β̂||x t − x t −1 ||


+ ŝ(s̃)2 θ (λ)t ||∇F (y t )||2 . (1.26)

Proof Note that

||zt ||2 ≤ ||zt − (R)∞ zt ||2 + ||(R)∞ zt ||2 . (1.27)

In view of Lemma 1.7, using S ∞ [S̃ ∞ ]−1 = 1n 1Tn and (R)∞ = S ∞ , it suffices that

||(R)∞ zt ||2 ≤||S ∞ [S̃ t ]−1 ∇F (y t ) − S ∞ [S̃ ∞ ]−1 ∇F (y t )||2


+ ||S ∞ [S̃ ∞ ]−1 ∇F (y t ) − S ∞ [S̃ ∞ ]−1 ∇F (1n x ∗ )||2

≤ŝ(s̃)2 θ (λ)t ||∇F (y t )||2 + nL̂||y t − 1n x ∗ ||2 . (1.28)

By the update of y t in D-DNGT (1.8), one gets

||y t − 1n x ∗ ||2 ≤||x t − (R)∞ x t ||2 + ||(R)∞ x t − 1n x ∗ ||2

+ β̂||x t − x t −1 ||2 . (1.29)

Substituting (1.28) and (1.29) into (1.27) yields the desired result in Lemma 1.13.
The proof is completed. 

1.4.3 Main Results

With the supporting relationships, i.e., the above Lemmas 1.9–1.13, in hands, the
main convergence results of D-DNGT are now established as follows.
For the sake of convenience, we define wmin = mini∈V {wi }, ν1 = κ2 d1 nL̂,
ν2 = κ2 nL̂, ν3 = κ2 d1 , ν4 = d1 nL̂, ν5 = d1 ŝ s̃ L̂, ν6 = d2 d1 nL̂, ν7 = d2 nL̂,
ν8 = d2 d1 , ν9 = κ4 κ6 , ν10 = κ6 d2 d1 nL̂, ν11 = κ6 d2 nL̂, ν12 = κ6 + κ5 κ6 , ν13 =
κ5 κ6 , ν14 = κ6 d2 d1 , ν15 = κ2 α̂ ŝ(s̃)2 θ , ν16 = ŝ(s̃)2 θ α̂, ν17 = d2 α̂ ŝ(s̃)2 θ , ν18 =
(2||In − (R)∞ || + κ6 (1 + β̂)α̂ ŝ)(s̃)2 θ d2 , ν19 = ν13 η3 + ν10 η3 α̂, ν20 = ν9 η1 +
ν10 η1 α̂ + ν11 η2 α̂ + ν12 η3 + ν10 η3 α̂ + ν14 η4 α̂ and ν21 = η4 (1 − ρ) − ν9 η1 − (ν10 η1 +
ν11 η2 + ν14 η4 )α̂. Then, the first result, i.e., Theorem 1.14, is introduced below.
1.4 Convergence Analysis 17

Theorem 1.14 Suppose that Assumptions 1.1–1.3 hold. Considering D-DNGT


(1.8) updates the sequences {x t }, {y t }, {S t } and {zt }. Then, if 0 < n(wT α) < 2/L̄,
one gets the following linear system of inequalities:
⎡ ⎤ ⎡ ⎤
||x t +1 − (R)∞ x t +1 || ||x t − (R)∞ x t ||
⎢ ||(R)∞ x t +1 − 1n x ∗ ||2 ⎥ ⎢ ||(R)∞ x t − 1n x ∗ ||2 ⎥
⎢ ⎥ ≤Γ ⎢ ⎥ + φt , (1.30)
⎣ ||x t +1 − x t || ⎦ ⎣ ||x t − x t −1|| ⎦
||z t +1 ∞
− (R) z ||t +1 ∞
||z − (R) z ||
t t

where the inequality is seen as component-wise. The elements of matrix Γ = [γij ] ∈


R4×4 and the vector φ t = [φ1t , φ2t , φ3t , φ4t ]T ∈ R4 are respectively given by
⎡ ⎤
ρ + ν1 α̂ ν2 α̂ κ1 β̂ + ν1 α̂ β̂ ν3 α̂
⎢ ν4 α̂ l1 2κ2 β̂ + ν5 α̂ β̂ κ3 α̂ ⎥
Γ =⎢
⎣ κ4 + ν6 α̂
⎥,
ν7 α̂ κ5 β̂ + ν6 α̂ β̂ ν8 α̂ ⎦
γ41 γ42 γ43 γ44

and γ41 = ν9 + ν10 α̂ + ν9 β̂ + ν10 α̂ β̂, γ42 = ν11 α̂ + ν11 α̂ β̂, γ43 = ν12 β̂ +
ν13 β̂ 2 + ν10 α̂ β̂ + ν10 α̂ β̂ 2 and γ44 = ρ + ν14 α̂ + ν14 α̂ β̂; φ1t = ν15 (λ)t ||∇F (y t )||2 ,
φ2t = ν16 (λ)t ||∇F (y t )||2 , φ3t = ν17 (λ)t ||∇F (y t )||2 and φ4t = ν18 (λ)t ||∇F (y t )||2 .
Assuming in addition that the largest step-size satisfies

1 η1 (1 − ρ)
0 < α̂ < min , ,
nL̄ ν1 η1 + ν2 η2 + ν3 η4

η3 − κ4 η1 η4 (1 − ρ) − ν9 η1
, , (1.31)
ν6 η1 + ν7 η2 + ν8 η4 ν10 η1 + ν11 η2 + ν14 η4

and the maximum momentum coefficient satisfies



η1 (1 − ρ) − (ν1 η1 + ν2 η2 + ν3 η4 )α̂
0 ≤ β̂ < min ,
κ1 η3 + ν1 η3 α̂

η2 (1 − l1 ) − (ν4 η1 + κ3 η4 )α̂ −ν20 + (ν20 )2 + 4ν19 ν21
, ,
2κ2 η3 + ν5 η3 α̂ 2ν19

η3 − κ4 η1 − (ν6 η1 + ν7 η2 + ν8 η4 )α̂
. (1.32)
κ5 η3 + ν6 η3 α̂

Then, the spectral radius of Γ , defined as ρ(Γ ), is strictly less than 1, where η1 , η2 ,
η3 , and η4 are arbitrary constants such that

ν4 η1 + κ3 η4 ν9 η1
η1 > 0, η2 > , η3 > κ4 η1 , η4 > . (1.33)
μ̄nwmin 1−ρ
18 1 Accelerated Algorithms for Distributed Convex Optimization

Proof First, plugging Lemma 1.13 into Lemmas 1.9–1.12 and rearranging the
acquired inequalities, it is immediately to verify (1.30). Next, we provide quite a
few conditions for the relation ρ(Γ ) < 1 to establish. According to Theorem 8.1.29
in [60], we know that, for a positive vector η = [η1 , . . . , η4 ]T ∈ R4 , if Γ η < η,
then ρ(Γ ) < 1 holds. By the definition of Γ , it is deduced that inequality Γ η < η
is equivalent to


⎪ (κ1 η3 + ν1 η3 α̂)β̂ < η1 (1 − ρ) − (ν1 η1 + ν2 η2 + ν3 η4 )α̂

⎨ (2κ η + ν η α̂)β̂ < η (1 − l ) − (ν η + κ η )α̂
2 3 5 3 2 1 4 1 3 4
(1.34)
⎪ (κ5 η3 + ν6 η3 α̂)β̂ 
⎪ < η3 − κ4 η1 − (ν6 η1 + ν7 η2 + ν8 η4 )α̂


2ν19 β̂ < −ν20 + (ν20 )2 + 4ν19 ν21 .

When 0 < α̂ < 1/nL̄, from Lemma 1.10, it yields that l1 = 1 − μ̄n(wT α) ≤
1 − μ̄nwmin α̂. To ensure the positivity of β̂ (the right hand sides of (1.34) are always
positive), (1.34) further implies that


⎪ α̂ < ν1 η1η+ν
1 (1−ρ)
2 η2 +ν3 η4

⎨ ν4 η1 +κ3 η4
η2 > μ̄nwmin
η3 −κ4 η1 (1.35)

⎪ α̂ < ν6 η1 +ν7 η2 +ν8 η4 , η3 > κ4 η1


α̂ < ν10 η1 +ν11 η2 +ν14 η4 , η4 > ν1−ρ
η4 (1−ρ)−ν9 η1 9 η1
.

Now, we are in the position of selecting vector η = [η1 , . . . , η4 ]T to ensure the


solvability of α̂. Since ρ < 1, we first pick an arbitrary positive constant η1 , then,
respectively, choose η3 and η4 in accordance with the third and fourth conditions
in (1.35), and finally select η2 satisfying the second condition in (1.35). Hence,
following from (1.35), it yields the upper bounds on the largest step-size α̂ in (1.31)
considering the requirement that 0 < n(wT α) < 2/L̄. In addition, we achieve the
upper bounds on the maximum momentum coefficient β̂ according to (1.34) and the
largest step-size α̂. This finishes the proof. 
Remark 1.15 It is worth emphasizing that η1 , η2 , η3 , and η4 in Theorem 1.14
are adjustable parameters, which rely only on the network topology and the
cost functions. Thus, the choices of the largest step-size, α̂, and the maximum
momentum coefficient, β̂, can be calculated without much effort as long as other
parameters, such as λ, η, etc., are properly selected. Furthermore, to design the step-
sizes and the momentum coefficients, some global parameters, such as L̂, L̄, μ̄, and
wmin , are needed. We noticed that the preprocessing amount for calculating global
parameters is almost negligible compared to the worst-case runtime of D-DNGT
(see [42] for a specific analysis).
Before presenting the linear convergence of D-DNGT to the global optimal
solution, the following supermartingale convergence result is first introduced, which
will be crucial for the convergence analysis.
1.4 Convergence Analysis 19

Lemma 1.16 ([39]) Let {v t }, {ut }, {a t }, and {bt } be non-negative sequences such
that for all t ≥ 0,

v t +1 ≤ (1 + a t )v t − ut + b t .
 ∞ t
Also, let ∞ t =0 a < ∞ and
t
∞ t =0tb < ∞. Then, we get limt →∞ v = v for a
t

certain variable v ≥ 0, and t =0 u < ∞.


Now, we are ready to state the main convergence result.
Theorem 1.17 Suppose that Assumptions 1.1–1.3 hold. Consider that the
sequences {x t }, {y t }, {S t }, and {zt } are updated in D-DNGT (1.8). If α̂ and β̂
satisfy the conditions in Theorem 1.14, then the sequence {x t } converges to 1n x ∗ at
a linear rate of O((δ)t ), where λ < δ < 1 is a constant.
Proof Define
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
||x t − (R)∞ x t || ν15 (λ)t 000 ||∇F (y t )||2
⎢ ||(R) x − 1n x ||2 ⎥ t ⎢ ν16 (λ)t
∞ t ∗ 0 0 0⎥ ⎢ 0 ⎥
ϕt = ⎢
⎣ ||x t − x t −1||
⎥,P = ⎢
⎦ ⎣ ν17 (λ)t
⎥ , Qt = ⎢
⎦ ⎣
⎥.

000 0
∞ t
||z − (R) z ||
t ν18 (λ)t 000 0

Then, inequality (1.30) is equivalent to

ϕ t +1 ≤ Γ ϕ t + P t Qt . (1.36)

By iterating (1.36) recursively, for all t > 0, we can see that

t −1

ϕ ≤ (Γ ) ϕ +
t t 0
(Γ )t −k−1 P k Qk . (1.37)
k=0

Since the spectral radius of Γ is strictly less than 1, it can be concluded from
Lemma 1.16 in [52] that ||(Γ )t ||2 ≤ ϑ(δ0 )t and ||(Γ )t −k−1 P k ||2 ≤ ϑ(δ0 )t for
some ϑ > 0 and λ < δ0 < 1. Taking 2-norm on both sides of (1.37) yields that

t −1

||ϕ t ||2 ≤ ||(Γ )t ||2 ||ϕ 0 ||2 + ||(Γ )t −k−1 P k ||2 ||Qk ||2
k=0
t −1

≤ ϑ||ϕ 0 ||2 (δ0 )t + ϑ(δ0 )t ||Qk ||2 . (1.38)
k=0

And further, for all k = 0, . . . , t − 1,

||Qk ||2 ≤||∇F (y k ) − ∇F (1n x ∗ )||2 + ||∇F (1n x ∗ )||2

≤L̂||x k − (R)∞ x k ||2 + L̂||(R)∞ x k − 1n x ∗ ||2


20 1 Accelerated Algorithms for Distributed Convex Optimization

+ L̂β̂||x k − x k−1 ||2 + ||∇F (1n x ∗ )||2

≤(1 + d1 )(1 + β̂)L̂||ϕ k ||2 + ||∇F (1n x ∗ )||2 , (1.39)

where ∇F (1n x ∗ ) = [∇f1 (x ∗ ), . . . , ∇fn (x ∗ )]T . Thus, by combining (1.38) and


(1.39), we deduce that for all t > 0,

t −1

||ϕ t ||2 ≤ (ϑ||ϕ 0 ||2 + (1 + d1 )(1 + β̂)L̂ϑ ||ϕ k ||2
k=0

+ ϑt||∇F (1n x )||2 )(δ0 ) . t
(1.40)
t −1
Define v t = k=0 ||ϕ ||2 , ν22 = (1 + d1 )(1 + β̂)L̂ϑ and p = ϑ||ϕ ||2 +
k t 0

ϑt||∇F (1n x )||2 , and then (1.40) implies that

||ϕ t ||2 = v t +1 − v t ≤ (ν22 v t + pt )(δ0 )t , (1.41)

which is equivalent to

v t +1 ≤ (1 + ν22 (δ0 )t )v t +1 + pt (δ0 )t . (1.42)


 ∞ t
Since λ < δ0 < 1, it holds that ∞ t =0 ν22 (δ0 ) < ∞ and
t
t =0 p (δ0 ) < ∞.
t

Thus, all the conditions in Lemma 1.16 are satisfied (u = 0, t ≥ 0) and we


t

achieve that v t converges and thus is bounded. Following from (1.41), we obtain that
limt →∞ ||ϕ t ||2 /(δ1 )t ≤ limt →∞ (ν22 v t + pt )(δ0 )t /(δ1 )t = 0 for all δ0 < δ1 < 1,
and thus there is a positive constant m and an arbitrarily small constant τ such that
for all t ≥ 0,

||x t − 1n x ∗ ||2 ≤ ||x t − (R)∞ x t ||2 + ||(R)∞ x t − 1n x ∗ ||2


≤ (1 + d1 )||ϕ t ||2
≤ m(δ0 + τ )t , (1.43)

where we define δ = δ0 + τ . This fulfills the proof. 


Remark 1.18 Theorem 1.17 establishes that D-DNGT linearly converges to the
global optimal solution provided that the largest step-size, α̂, and the maximum
momentum coefficient, β̂, respectively, obey the upper bounds given in Theo-
rem 1.14. Many existing works (the gradient tracking methods) [33, 35] and our
previous works [43, 44] adopted non-uniform step-sizes and converged at a linear
rate. Compared with [33, 35, 43, 44], this chapter still has three advantages. First,
D-DNGT incorporates the gradient tracking into the distributed Nesterov method,
which adds two types of momentum terms to improve information exchange to
ensure fast convergence. Second, since the provided bounds on the largest step-size,
α̂, in Theorem 1.14, depend only on the network topology and the cost functions,
1.4 Convergence Analysis 21

each node can choose a relatively wider step-size. This is in contrast to the earlier
work on non-uniform step-sizes within the framework of the gradient tracking
[33, 35, 43, 44], which is dependent on the heterogeneity (||(In − W )α||2 /||W α||2 ,
W is the weight matrix, in [35], and α̂/α̃, α̃ = mini∈V {αi }, in [33], [43, 44]) of
the step-sizes. Besides, the analysis showed that the algorithms in [33, 35, 43, 44]
could linearly converge to the optimal solution if and only if the heterogeneity and
the largest step-size are small. However, the largest step-size follows a bound which
is a function of the heterogeneity, and there is a trade-off between the tolerance of
heterogeneity and the largest step-size which can be achieved. Finally, the bounds of
non-uniform step-sizes in this chapter allow the existence (not all) of zero step-sizes
among the nodes if the largest step-size is positive and sufficiently small.

1.4.4 Discussion

The idea of D-DNGT can be applied to other directed distributed gradient tracking
methods to relax the condition of the weight matrices being only column-stochastic
[41, 42] or both row- and column-stochastic [45, 46]. Next, three possible Nesterov-
like optimization algorithms are presented. In this chapter, we only highlight and
verify their feasibilities by means of simulations. A rigorous theoretical analysis of
the three possible algorithms is left for the future work.
(a) D-DNGT with Only Column-Stochastic Weights [41, 42] here, we present
an extended algorithm, named as D-DNGT-C, by applying the momentum terms
into ADD-OPT [41]/Push-DIGing [42] (the weight matrices are only column-
stochastic). Specifically, the updates of D-DNGT-C are stated as follows:

⎪ t +1 n


⎪ x i = cij htj + βi (xit − xit −1) − αi zit

⎪ j =1

⎪ ht +1 = x t +1 + β (x t +1 − x t )

⎨ i i i i i
t +1 n
t +1 t +1 t +1 (1.44)

⎪ s = c s
ij j
t , y = h i /si


i
j =1
i



⎪ n
⎪ zit +1 =
⎩ cij zjt + ∇fi (yit +1 ) − ∇fi (yit ),
j =1

initialized with xi0 = h0i = yi0 ∈ R, si0 = 1, and zi0 = ∇fi (yi0 ), where as before
C = [cij ] ∈ Rn×n is column-stochastic, and αi > 0 and βi ≥ 0 represent the local
step-size and the momentum coefficient of node i. Unlike ADD-OPT [41]/Push-
DIGing [42], D-DNGT-C, by means of column-stochastic weights, adds two types
of momentum terms (heavy-ball momentum and Nesterov momentum) to ensure
that nodes acquire more information from in-neighbors in the network to achieve
fast convergence.
22 1 Accelerated Algorithms for Distributed Convex Optimization

(b) D-DNGT with Both Row- and Column-Stochastic Weights [45, 46] consider
that D-DNGT with both row- and column-stochastic weights does not need the
eigenvector estimation in D-DNGT (1.6) or D-DNGT-C (1.44). Hence, an extended
algorithm (named as D-DNGT-RC), which utilizes both row-stochastic (R = [rij ] ∈
Rn×n ) and column-stochastic (C = [cij ] ∈ Rn×n ) weights, is presented as follows:

⎪ n

⎪ xit +1 = rij yjt + βi (xit − xit −1 ) − αi zit


⎨ j =1
yit +1 = xit +1 + βi (xit +1 − xit ) (1.45)

⎪ 

⎪ t +1
n
⎩ zi =
⎪ cij zjt + ∇fi (yit +1 ) − ∇fi (yit ),
j =1

where xi0 = yi0 ∈ R and zi0 = ∇fi (yi0 ), αi > 0 and βi ≥ 0 represent the
local step-size and the momentum coefficient of node i. D-DNGT-RC not only
reduces additional iterations of eigenvector learning but also guarantees that more
information nodes can be obtained from in-neighbors, which may exhibit fast
convergence than [45] and [46].
(c) D-DNGT-RC with Interaction Delays [49] note that nodes will confront arbi-
trary but uniformly bounded interaction delays in the process of gaining information
from in-neighbors [49]. Specifically, to solve problem (1.1), we denote ςijt 5 as an
arbitrary priori unknown delay induced by the interaction link (j, i) at time t ≥ 0.
Then, the updates of D-DNGT-RC with delay (D-DNGT-RC-D) become

⎪ n t −ς t

⎪ xit +1 = rij yj ij + βi (xit − xit −1 ) − αi zit


⎨ j =1
yit +1 = xit +1 + βi (xit +1 − xit ) (1.46)

⎪ 

⎪ t +1
n t −ς t
⎩ zi =
⎪ cij zj ij + ∇fi (yit +1) − ∇fi (yit ).
j =1

Remark 1.19 The time-varying implementation of D-DNGT is more straightfor-


ward on a broadcast-based mechanism or a random network, such as the related
work in [46]. The asynchronous scheme can also follow the method in [19, 26, 35,
37]. In addition, it is also concluded from [61] that when D-DNGT is employed for
optimizing more complexity problem, such as deep neural networks, the gradient is
usually replaced with the stochastic gradient, which yields the stochastic version of
D-DNGT.

5For all t > 0, the interaction delays ςijt are assumed to be uniformly bounded. That is, there exists
some finite ς̂ > 0 such that 0 ≤ ςijt ≤ ς̂. In addition, each node is accessible to its own estimate
without delays, i.e., ςiit = 0, ∀i ∈ V and t > 0.
1.5 Numerical Examples 23

1.5 Numerical Examples

This section provides a variety of numerical experiments to illustrate the application


and performance of D-DNGT. The numerical experiments are divided into three
parts. (i) Without momentum terms: the convergence between D-DNGT and the
methods without momentum terms, including FROST [52], AB [45], and ADD-
OPT/Push-DIGing [41, 42], is first compared. (ii) With momentum terms: the
convergence between D-DNGT and the methods with momentum terms, including
ABm [48], ABN [50], and FROZEN [50], is also compared. (iii) Extensions of
D-DNGT: in the final scenario, we verify the convergence between the exten-
sions (including D-DNGT-C, D-DNGT-RC, and D-DNGT-RC-R) of D-DNGT
and their closely related methods (including ADD-OPT/Push-DIGing [41, 42],
AB [45], and AB with delays (AB-D) [49]). For the comparison of delays, let
ς̂ = 6 be the upper bound of the time-varying delays. At each time t > 0,
interaction delays, which are imposed on each interaction link, are randomly
and uniformlyselected in {0, 1, . . . , 6}. In light of parts (i)–(iii), we plot the
residual log10 ( ni=1 (||xit − x ∗ ||2 /||xi0 − x ∗ ||2 )) (t is the discrete-time iteration) for
comparison.
In the experiment, we concern a distributed binary classification problem uti-
lizing regularized logistic regression [48]. Specifically, the application of D-DNGT
for handling the distributed logistic regression problem is considered over a directed
network:


n
min f (x, v) = fi (x, v),
i=1

where x ∈ Rp and v ∈ R are the optimization variables for learning the separable
hyperplane. Here, the local cost function fi is given by

ω 
mi
    
fi (x, v) = (||x||22 + v 2 ) + ln 1 + exp − cTij x + v bij ,
2
j =1

 
where each node i ∈ {1, . . . , n} privately knows mi training examples; cij , bij ∈
Rp × {−1, +1}, where cij is the p-dimensional feature vector of the j -th training
sample at the i-th node following from a Gaussian distribution with zero mean, and
bij is the label according to a Bernoulli distribution. In terms of parameter design,
we choose n = 10 and mi = 10 for all i and p = 2. The network topology as
the directed and strongly connected network is depicted in Fig. 1.1. In addition, we
utilize a simple uniform weighting strategy, rij = 1/|Niin |, ∀i, to regulate the row-
stochastic weights.
The simulation results are plotted in Figs. 1.2, 1.3 and 1.4. Figure 1.2 indicates
that D-DNGT with momentum terms promotes the convergence in comparison
with the applicable algorithms without momentum terms. Figure 1.3 means that
24 1 Accelerated Algorithms for Distributed Convex Optimization

Fig. 1.1 A directed and strongly connected network

Comparison (i)
0

-2

-4

-6
Residual

-8

-10

-12

-14
0 200 400 600 800 1000 1200 1400
Time[step]

Fig. 1.2 Performance comparisons between D-DNGT and the methods without momentum terms

D-DNGT with two momentum terms (heavy-ball momentum [48] and Nesterov
momentum [50, 54, 55]) improves the convergence when compared with the
applicable algorithms with single momentum term. We note that although the
eigenvector learning existed in D-DNGT may slow down convergence, D-DNGT
is more suitable for broadcast-based protocols than other optimization methods
(AB, ADD-OPT/Push-DIGing, ABm, and ABN ) because it only requires row-
stochastic weights. Finally, it is concluded from Fig. 1.4 that the algorithms with
momentum terms can successfully promote the convergence regardless of whether
the interaction links undergo interaction delays or the weight matrices are only
column-stochastic or both row- and column-stochastic.
1.5 Numerical Examples 25

Comparison (ii)
0

-2

-4

-6
Residual

-8

-10

-12

-14
0 200 400 600 800 1000 1200 1400
Time[step]

Fig. 1.3 Performance comparisons between D-DNGT and the methods with momentum terms

Comparison (iii)
0

-2

-4

-6
Residual

-8

-10

-12

-14
0 200 400 600 800 1000 1200 1400
Time[step]

Fig. 1.4 Performance comparisons between the extensions of D-DNGT and their closely related
methods
26 1 Accelerated Algorithms for Distributed Convex Optimization

1.6 Conclusion

In this chapter, we have considered a general distributed optimization problem in


which nodes aimed to collectively optimize the average of all local cost functions. To
figure out the optimization problem, a generalized directed distributed Nesterov-like
gradient tracking algorithm, named as D-DNGT, has been proposed and analyzed
in detail. D-DNGT extended distributed gradient tracking method with heavy-ball
momentum and Nesterov momentum, guaranteed that nodes selected non-uniform
step-sizes in a distributed manner and only required the weight matrix to be row-
stochastic, which has indicated that it was suitable for a directed network. In
particular, the directed network was assumed to be strongly connected. When
the largest step-size and the maximum momentum coefficient were subjected to
some upper bounds (the bounds relied only on the network topology and the cost
functions), we have established the globally linear convergence rate for D-DNGT
at the expense of eigenvector learning, supposing strongly convex and smooth
cost functions. In addition, some extensions of D-DNGT have been also explored.
Simulation results further verified our theoretical analysis. However, D-DNGT is not
flawless, and more in-depth researches are demanded to perfect it. For example, D-
DNGT cannot be suitable for the dynamical networks, stochastic noises, as well as
the networks with random link failures and quantization effects. As the future work,
it would be valuable to extend D-DNGT to deal with a number of problems, i.e.,
time-varying directed networks, stochastic noises, as well as networks with random
link failures and quantization effects. Moreover, more complex (asynchronous
interaction, inequality constraints, etc.) optimization problem is also worthy of
study.

References

1. S. Yang, Q. Liu, J. Wang, Distributed optimization based on a multiagent system in the presence
of communication delays. IEEE Trans. Syst., Man, Cybern., Syst. 47(5), 717–728 (2017)
2. J. Chen, A. Sayed, Diffusion adaptation strategies for distributed optimization and learning
over networks. IEEE Trans. Signal Process. 60(8), 4289–4305 (2012)
3. K. Li, Q. Liu, S. Yang, J. Cao, G. Lu, Cooperative optimization of dual multiagent system for
optimal resource allocation. IEEE Trans. Syst., Man, Cybern., Syst. 50(11), 4676–4687 (2020)
4. S. Wang, C. Li, Distributed robust optimization in networked system. IEEE Trans. Cybern.
47(8), 2321–2333 (2017)
5. X. Dong, G. Hu, Time-varying formation tracking for linear multi-agent systems with multiple
leaders. IEEE Trans. Autom. Control 62(7), 3658–3664 (2017)
6. X. Dong, G. Hu, Time-varying formation control for general linear multi-agent systems with
switching directed topologies. Automatica 73, 47–55 (2016)
7. C. Shi, G. Yang, Augmented Lagrange algorithms for distributed optimization over multi-agent
networks via edge-based method. Automatica 94, 55–62 (2018)
8. S. Zhu, C. Chen, W. Li, B. Yang, X. Guan, Distributed state estimation of sensor-network
systems subject to Markovian channel switching with application to a chemical process. IEEE
Trans. Syst. Man Cybern. Syst. 48(6), 864–874 (2018)
References 27

9. D. Jakovetic, A unification and generalization of exact distributed first order methods. IEEE
Trans. Signal Inform. Process. Over Netw. 5(1), 31–46 (2019)
10. Z. Wu, Z. Li, Z. Ding, Z. Li, Distributed continuous-time optimization with scalable adaptive
event-based mechanisms. IEEE Trans. Syst. Man Cybern. Syst. 50(9), 3252–3257 (2020)
11. K. Scaman, F. Back, S. Bubeck, Y. Lee, L. Massoulie, Optimal algorithms for smooth and
strongly convex distributed optimization in networks, in Proceedings of the 34th International
Conference on Machine Learning (PMLR), vol. 70 (2017), pp. 3027–3036
12. X. He, T. Huang, J. Yu, C. Li, Y. Zhang, A continuous-time algorithm for distributed
optimization based on multiagent networks. IEEE Trans. Syst. Man Cybern. Syst. 49(12),
2700–2709 (2019)
13. Y. Zhu, W. Ren, W. Yu, G. Wen, Distributed resource allocation over directed graphs via
continuous-time algorithms. IEEE Trans. Syst. Man Cybern. Syst. 51(2), 1097–1106 (2021)
14. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE
Trans. Autom. Control 54(1), 48–61 (2009)
15. A. Nedic, A. Ozdaglar, P. Parrilo, Constrained consensus and optimization in multi-agent
networks. IEEE Trans. Autom. Control 55(4), 922–938 (2010)
16. H. Li, S. Liu, Y. Soh, L. Xie, Event-triggered communication and data rate constraint for
distributed optimization of multiagent systems. IEEE Trans. Syst. Man Cybern. Syst. 48(11),
1908–1919 (2018)
17. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly
convex optimization. Automatica 90, 196–203 (2018)
18. I. Matei, J. Baras, Performance evaluation of the consensus-based distributed subgradient
method under random communication topologies. IEEE J. Sel. Topics Signal Process. 5(4),
754–771 (2011)
19. C. Xi, U. Khan, Distributed subgradient projection algorithm over directed graphs. IEEE Trans.
Autom. Control 62(8), 3986–3992 (2017)
20. D. Yuan, D. Ho, G. Jiang, An adaptive primal-dual subgradient algorithm for online distributed
constrained optimization. IEEE Trans. Cybern. 48(11), 3045–3055 (2018)
21. C. Li, P. Zhou, L. Xiong, Q. Wang, T. Wang, Differentially private distributed online learning,
IEEE Trans. Knowl. Data Eng. 30(8), 1440–1453 (2018)
22. J. Zhu, C. Xu, J. Guan, D. Wu, Differentially private distributed online algorithms over time-
varying directed networks. IEEE Trans. Signal Inform. Process. Over Netw. 4(1), 4–17 (2018)
23. W. Shi, Q. Ling, K. Yuan, G. Wu, W. Yin, On the linear convergence of the ADMM in
decentralized consensus optimization. IEEE Trans. Signal Process. 62(7), 1750–1761 (2014)
24. J. Mota, J. Xavier, P. Aguiar, M. Puschel, D-ADMM: a communication-efficient distributed
algorithm for separable optimization. IEEE Trans. Signal Process. 61(10), 2718–2723 (2013)
25. H. Terelius, U. Topcu, R. Murray, Decentralized multi-agent optimization via dual decomposi-
tion. IFAC Proc. Volumes 44(1), 11245–11251 (2011)
26. E. Wei, A. Ozdaglar, On the O(1/k) convergence of asynchronous distributed alternating
direction method of multipliers, in 2013 IEEE Global Conference on Signal and Information
Processing (2013). https://doi.org/10.1109/GlobalSIP.2013.6736937
27. M. Hong, T. Chang, Stochastic proximal gradient consensus over random networks. IEEE
Trans. Signal Process. 65(11), 2933–2948 (2017)
28. H. Xiao, Y. Yu, S. Devadas, On privacy-preserving decentralized optimization through
alternating direction method of multipliers (2019). Preprint arXiv:1902.06101
29. A. Chen, A. Ozdaglar, A fast distributed proximal-gradient method, in 2012 50th Annual
Allerton Conference on Communication, Control, and Computing (Allerton) (2012). https://
doi.org/10.1109/Allerton.2012.6483273
30. X. Dong, Y. Hua, Y. Zhou, Z. Ren, Y. Zhong, Theory and experiment on formation-containment
control of multiple multirotor unmanned aerial vehicle systems. IEEE Trans. Autom. Sci. Eng.
16(1), 229–240 (2019)
31. W. Shi, Q. Ling, G. Wu, W Yin, EXTRA: an exact first-order algorithm for decentralized
consensus optimization. SIAM J. Optimi. 25(2), 944–966 (2015)
28 1 Accelerated Algorithms for Distributed Convex Optimization

32. G. Qu, N. Li, Harnessing smoothness to accelerate distributed optimization. IEEE Trans.
Control Netw. Syst. 5(3), 1245–1260 (2018)
33. A. Nedic, A. Olshevsky, W. Shi, C. Uribe, Geometrically convergent distributed optimization
with uncoordinated step-sizes, in 2017 American Control Conference (ACC) (2017). https://
doi.org/10.23919/ACC.2017.7963560
34. M. Maros, J. Jalden, A geometrically converging dual method for distributed optimization over
time-varying graphs. IEEE Trans. Autom. Control 66(6), 2465–2479 (2021)
35. J. Xu, S. Zhu, Y. Soh, L. Xie, Convergence of asynchronous distributed gradient methods over
stochastic networks. IEEE Trans. Autom. Control 63(2), 434–448 (2018)
36. S. Pu, A. Nedic, Distributed stochastic gradient tracking methods. Math. Program. 187(1),
409–457 (2021)
37. Y. Tian, Y. Sun, B. Du, G. Scutari, ASY-SONATA: Achieving geometric convergence for
distributed asynchronous optimization, in 2018 56th Annual Allerton Conference on Communi-
cation, Control, and Computing (Allerton) (2018). https://doi.org/10.1109/ALLERTON.2018.
8636055
38. M. Maros, J. Jalden, Panda: A dual linearly converging method for distributed optimization
over time-varying undirected graphs, in 2018 IEEE Conference on Decision and Control
(CDC) (2018). https://doi.org/10.1109/CDC.2018.8619626
39. A. Nedic, A. Olshevsky, Distributed optimization over time-varying directed graphs. IEEE
Trans. Autom. Control 60(3), 601–615 (2015)
40. C. Xi, U. Khan, DEXTRA: a fast algorithm for optimization over directed graphs. IEEE Trans.
Autom. Control 62(10), 4980–4993 (2017)
41. C. Xi, R. Xin, U. Khan, ADD-OPT: accelerated distributed directed optimization. IEEE Trans.
Autom. Control 63(5), 1329–1339 (2018)
42. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization
over time-varying graphs. SIAM J. Optimi. 27(4), 2597–2633 (2017)
43. Q. Lü, H. Li, D. Xia, Geometrical convergence rate for distributed optimization with time-
varying directed graphs and uncoordinated step-sizes. Inf. Sci. 422, 516–530 (2018)
44. Q. Lü, H. Li, Z. Wang, Q. Han, W. Ge, Performing linear convergence for distributed
constrained optimisation over time-varying directed unbalanced networks. IET Control Theory
Appl. 13(17), 2800–2810 (2019)
45. R. Xin, U. Khan, A linear algorithm for optimization over directed graphs with geometric
convergence. IEEE Control Syst. Lett. 2(3), 315–320 (2018)
46. S. Pu, W. Shi, J. Xu, A. Nedic, Push-pull gradient methods for distributed optimization in
networks. IEEE Trans. Autom. Control 66(1), 1–16 (2021)
47. F. Saadatniaki, R. Xin, U. Khan, Decentralized optimization over time-varying directed graphs
with row and column-stochastic matrices. IEEE Trans. Autom. Control 65(11), 4769–4780
(2020)
48. R. Xin, U. Khan, Distributed heavy-ball: a generalization and acceleration of first-order
methods with gradient tracking. IEEE Trans. Autom. Control 65(6), 2627–2633 (2020)
49. C. Zhao, X. Duan, Y. Shi, Analysis of consensus-based economic dispatch algorithm under
time delays. IEEE Trans. Syst. Man Cybern. Syst. 50(8), 2978–2988 (2020)
50. R. Xin, D. Jakovetic, U. Khan, Distributed Nesterov gradient methods over arbitrary graphs.
IEEE Signal Process. Lett. 26(8), 1247–1251 (2019)
51. C. Xi, V. Mai, E. Abed, U. Khan, Linear convergence in optimization over directed graphs with
row-stochastic matrices. IEEE Trans. Autom. Control 63(10), 3558–3565 (2018)
52. R. Xin, C. Xi, U. Khan, FROST-Fast row-stochastic optimization with uncoordinated step-
sizes. EURASIP J. Advanc. Signal Process. 2019(1), 1–14 (2019)
53. H. Li, Q. Lü, T. Huang, Convergence analysis of a distributed optimization algorithm with a
general unbalanced directed communication network. IEEE Trans. Netw. Sci. Eng. 6(3), 237–
248 (2019)
54. G. Qu, N. Li, Accelerated distributed Nesterov gradient descent. IEEE Trans. Autom. Control
65(6), 2566–2581 (2020)
References 29

55. D. Jakovetic, J. Xavier, J. Moura, Fast distributed gradient methods. IEEE Trans. Autom.
Control 59(5), 1131–1146 (2014)
56. H. Li, Q. Lü, X. Liao, T. Huang, Accelerated convergence algorithm for distributed constrained
optimization under time-varying general directed graphs. IEEE Trans. Syst. Man Cybern. Syst.
50(7), 2612–2622 (2020)
57. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Springer Science
& Business Media, Berlin, 2013)
58. H. Wang, X. Liao, T. Huang, C. Li, Cooperative distributed optimization in multiagent
networks with delays. IEEE Trans. Syst. Man Cybern. Syst. 45(2), 363–369 (2015)
59. A. Defazio, On the curved geometry of accelerated optimization (2018). Preprint
arXiv:1812.04634
60. R. Horn, C. Johnson, Matrix Analysis (Cambridge University Press, New York, 2013)
61. T. Yang, Q. Lin, Z. Li, Unified convergence analysis of stochastic momentum methods for
convex and non-convex optimization (2016). Preprint arXiv:1604.03257
Chapter 2
Projection Algorithms for Distributed
Stochastic Optimization

Abstract This chapter focuses on introducing and solving the problem of compos-
ite constrained convex optimization with a sum of smooth convex functions and
non-smooth regularization terms (1 norm) subject to locally general constraints.
Each of the smooth objective functions is further thought of as the average
of several constituent functions, which is motivated by the modern large-scale
information processing problems in machine learning (the samples of a training
dataset are randomly distributed across multiple computing nodes). We present a
novel computation-efficient distributed stochastic gradient algorithm that makes use
of both the variance-reduction methodology and the distributed stochastic gradient
projection method with constant step-size to solve the problem in a distributed
manner. Theoretical study shows that the suggested algorithm can discover the
precise optimal solution in expectation when each constituent function (smooth)
is strongly convex if the constant step-size is less than an explicitly calculated upper
constraint. Regarding the current distributed methods, the suggested technique
not only has a low computation cost in terms of the overall number of local
gradient evaluations but is also suited for addressing general restricted optimization
problems. Finally, the numerical proof is offered to show the suggested algorithm’s
attractive performance.

Keywords Composite constrained optimization · Distributed stochastic


algorithm · Computation-efficient · Variance reduction · Non-smooth term

2.1 Introduction

Given the limited computational and storage capacity of nodes, it has become
unrealistic to deal with large-scale tasks centrally on a single compute node
[1]. Distributed optimization is a classic topic [2–9] yet has recently aroused
considerable interest in many emerging applications (large-scale tasks), such as
parameter estimation [3, 4], network attacks [5], machine learning [6], IoT networks
[7], and some others. At least two facts [8] have contributed to this resurgence
of interest: (a) recent developments in high-performance computing platforms

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 31
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_2
32 2 Projection Algorithms for Distributed Stochastic Optimization

have enabled us to employ distributed resources to contribute significantly to


computational efficiency and (b) the size of datasets often far exceeds the storage
capacity of a single machine, requiring coordination across multiple machines.
In distributed optimization (without centralized coordination), each node is only
allowed to interact with its neighbors through a locally connected network. In
general, designing effective distributed algorithms for a wide range of optimization
problems is more challenging [8–10].
Distributed optimization methods that are only dependent on gradient informa-
tion have become a core interest in processing large-scale tasks due to their excellent
scalability. Many known methods, including distributed gradient descent (DGD)
[11, 12], dual averaging [13], EXTRA [14, 15], ADMM [16, 17], adaptive diffusion
[18], gradient tracking [19–21], and methods for constrained optimization problems
[22], have been studied in the literature. Moreover, quite a few efficient methods
for dealing with various practical problems such as complex networks [23], privacy
security [24], machine learning [25], online optimization [26], and power system
operation [27, 28] have been emerged.
More recently, significant effort has been made to design distributed methods
to solve the problem of composite non-smooth optimization. For the composite
optimization problem with global non-smooth terms, a fast distributed proximal
gradient method that adopts Nesterov’s acceleration is proposed in [29], which
achieves accelerated convergence. Other work focuses on the situation where
each node has a local non-smooth term that may be different from other nodes.
For example, a proximal distributed linearized ADMM (DL-ADMM) method is
provided in [30] to resolve such composite problems, and the convergence is
guaranteed. By extending EXTRA [14] to deal with local non-smooth terms, PG-
EXTRA is proposed in [31] and an improved convergence is established. In addition,
in comparison with PG-EXTRA [31], the NIDS method proposed in [32] is able
to employ larger step-sizes and also possesses the same convergence. From the
overview of the convergence rate, the above distributed proximal gradient methods
still exist a clear gap compared with centralized methods. Based on this, a linearly
convergent proximal gradient method is proposed in [33], which can successfully
break such a gap.
With the advent of the big data era, the amount of data that nodes in the network
need to process is getting larger and more complicated [34]. Therefore, the above
methods can be computationally very demanding due to the requirement that each
iteration of the algorithm needs a full gradient evaluation of local objective functions
[10–19, 21–24, 26, 29–33]. This may make these methods to be practically infeasible
when dealing with large-scale tasks, mainly because the nodes in the network need
to cope with large amounts of various data. In order to avoid extensive computation
and keep computational simplicity, one natural solution is to use the stochastic
gradient (a random subset of the local data for gradient evaluation) to approximate
the true gradient, and thus the distributed stochastic optimization methods have
emerged [35]. Considerable works have been done in investigating the distributed
stochastic optimization methods including distributed stochastic gradient descent
[35], stochastic gradient push [36], stochastic mirror descent [37], and stochastic
2.1 Introduction 33

gradient tracking [38]. However, in practice, these methods converge slowly due
to the large variance coming from the stochastic gradient and the adoption of a
carefully tuned sequence of decaying step-sizes. To address this deficiency, various
variance-reduction techniques have been leveraged in developing the stochastic
gradient descent methods, which appear some representative centralized methods
such as S2GD [39], SAG [40], SAGA [41], SVRG [42, 43], and SARAH [44]. The
idea of the variance-reduction technique is to reduce the variance of the stochastic
gradient and substantially improve the convergence.
Motivated by the centralized variance reduced methods, the distributed variance
reduced methods have been extensively studied, which outperform their centralized
counterparts in handling large-scale tasks. Of relevance to our work are the recent
developments in [45] and [46]. The distributed stochastic averaging gradient method
(DSA) proposed in [45] incorporates the variance-reduction technique in SAGA
[41] to the algorithm design ideas of EXTRA [14], which not only obtains the
expected linear convergence of distributed stochastic optimization for the first
time but also performs better than the previous works [14, 35] in dealing with
machine learning problems. Similar works also involve the DSBA [47], diffusion-
AVRG [48], ADFS [49], SAL-Edge [50], GT-SAGA/GT-SVRG [2, 51, 52], and
Network-DANE [8] utilizing various strategies. However, to the best knowledge
of the authors, there are no methods to focus on solving general composite
constrained convex optimization problems. Recently, the distributed neurodynamic-
based consensus algorithm proposed in [46] is developed to solve the problem of
a sum of smooth convex functions and 1 norms subjected to the locally general
constraints (linear equality, convex inequality, and bounded constraints), which
generalizes the work in [53] to the case where the objective function and the
constraint conditions are wider. In particular, based on the Lyapunov stability theory,
the method in [46] can achieve consensus at the global optimal solution with
constant step-size. The work in [46] is insightful, but unfortunately, the algorithm
does not take into account the high computational cost of evaluating the full gradient
of the local objective function at each iteration.
In this chapter, we are concerned with solving the composite constrained convex
optimization problem with a sum of smooth convex functions and non-smooth
regularization terms (1 norm), where the smooth objective functions are further
composed of the average of several constituent functions and the locally general
constraints are constituted by linear equality, convex inequality, and bounded
constraints. To aim at this, a computation-efficient distributed stochastic gradient
algorithm is proposed, which is capable of adaptability and facilitating the real-
world applications. In general, the novelties of the present work are summarized as
follows:
(i) We propose and analyze a novel computation-efficient distributed stochastic
gradient algorithm by leveraging the variance-reduction technique and the
distributed stochastic gradient projection method with constant step-size. In
contrast with most existing distributed methods [29–33, 45, 47–51, 53], the
34 2 Projection Algorithms for Distributed Stochastic Optimization

proposed algorithm is capable of solving a class of composite non-smooth


optimization problems subject to the locally general constraints.
(ii) The proposed algorithm outperforms the existing distributed methods [29–
33, 46, 53] in light of the total number of local gradient evaluations. In
particular, at each iteration, the proposed algorithm only evaluates the gradient
of one randomly selected constituent function and employs the unbiased
stochastic average gradient (obtained by the average of all most recent stochas-
tic gradients) to estimate the local gradients. Thus, the proposed algorithm
highly reduces the expense of full gradient evaluations.
(iii) If the constant step-size is less than an explicitly estimated upper bound, the
proposed algorithm is proven to converge to the exact optimal solution in
expectation when each constituent function is smooth and strongly convex.
In the unconstrained case, we also propose a distributed stochastic proximal
gradient algorithm by using the variance-reduction technique and study its
convergence rate.

2.2 Preliminaries

2.2.1 Notation

If not particularly stated, the vectors mentioned in this chapter are column vectors.
Let R, Rn , and Rm×n denote the set of real numbers, n-dimensional real column
vectors, and m × n real matrices, respectively. The n × n identity matrix is denoted
as In , and two column vectors of all ones and all zeros are denoted as 1 and 0
(appropriate dimensional), respectively. A quantity (probably a vector) of node i
is indexed by a subscript i; e.g., let xik be the estimate of node i at time k. We
use χmax (A) and χmin (A) to represent the largest and the smallest eigenvalues of
a real symmetric matrix A, respectively. We let the symbols x T and AT denote
the transposes of a vector x and a matrix A. The Euclidean norm (vectors) √ and
1 norm are denoted as || · || and || · ||1 , respectively. We let ||x||A = x T Ax,
where matrix A ∈ Rn×n is a positive semi-definite matrix. The Kronecker  product
and the Cartesian product are represented by the symbols ⊗ and , respectively.
Given a random estimator x, the probability and expectation are represented by P[x]
and E[x], respectively. We utilize Z = diag{x} to represent the diagonal matrix of
vector x = [x1 , x2 , . . . , xn ]T , which satisfies that zii = xi , ∀i = 1, . . . , n, and
zij = 0, ∀i = j . Denote (·)+ = max{0, ·}.
For a set Ω ⊆ Rd , the projection of a vector x ∈ Rd onto Ω is denoted by
PΩ (x), i.e., PΩ (x) = arg miny∈Ω ||y −x||2. Notice that this projection always exists
and is unique if Ω is nonempty, closed, and convex [53]. Moreover, let Ω be a
nonempty closed convex set, then the projection operator PΩ (·) has the following
properties: (a) (y − PΩ (y))T (PΩ (y) − x) ≥ 0, for any x ∈ Ω and y ∈ Rd and (b)
||PΩ (y) − PΩ (x)|| ≤ ||y − x||, for any x, y ∈ Rd .
2.2 Preliminaries 35

For any given vector z ∈ Rd , denote φ(z) as another form of projection of the
vector z, which satisfies that φ(z) = [ψ(z1 ), . . . , ψ(zd )] ∈ Rd with the elements
such that ∀i = 1, . . . , d,

⎨ 1, if zi > 0,
ψ(zi ) = [−1, 1] , if zi = 0,

−1, if zi < 0.

2.2.2 Model of Optimization Problem

Consider the composite constrained convex optimization problem in the following


form:


n
min
n
J (x̂) = (fi (x̂) + ||Pi x̂ − qi ||1 ),
x̂∈∩i=1 Ωi
i=1

s.t. Bi x̂ = ci , Di x̂ ≤ si , i = 1, . . . , n, (2.1)

where x̂ ∈ Rd is the optimization estimator, fi (x̂) is the local objective function of


node i, and ||Pi x̂ − qi ||1 is a non-smooth 1 -regularization term of node i; Pi ∈
Rmi ×d (mi ≥ 0), qi ∈ Rmi , Bi ∈ Rwi ×d (0 ≤ wi < d) is full row rank, ci ∈ Rwi ,
Di ∈ Rai ×d (ai ≥ 0), si ∈ Rai , and Ωi ⊆ Rn is a nonempty and closed convex set.
Moreover, each fi (x̂) is represented as the average of ei constituent functions, that
is,

1 
ei
fi (x̂) = fi,j (x̂), i = 1, . . . , n. (2.2)
ei
j =1

In addition, the main results of this chapter are based on the following assump-
tions.

Assumption 2.1 ([46]) The network G corresponding to the set of nodes is undi-
rected and connected.
Assumption 2.2 ([45]) Each local constituent function fi,j , i ∈ V, j ∈ {1, . . . , ei },
is ν-smooth and μ-strongly convex, where ν > μ > 0.
Remark 2.1 The formulated problem (2.1) with (2.2) can be frequently found in
machine learning (such as modern large-scale information processing problems,
reinforcement learning problems, etc.) with large-scale training samples randomly
distributed across the multiple computing nodes which focus on collectively training
a model x̂ ∈ Rd utilizing the neighboring nodes’ data. However, performing full
gradient evaluation becomes prohibitively expensive when the local data batch at a
36 2 Projection Algorithms for Distributed Stochastic Optimization

single computing node is very large, i.e., ei 1. Thus, designing a computation-


efficient algorithm will have far-reaching implications.

2.2.3 Communication Network

In this chapter, we consider a group of n nodes communicating over an undirected


graph G = {V, E, A} involving the nodes set V = {1, 2, . . . , n}, the edges set
E ⊆ V × V, and the adjacency matrix A = [aij ] ∈ Rn×n . If (i, j ) ∈ E, it
indicates that node i can directly exchange data with node j , where i or j is viewed
as a neighbor of j or i. The connection weight between nodes i and j in graph
G satisfies aij = aj i > 0 if (i, j ) ∈ E and otherwise aij = aj i = 0. Without
loss ofgenerality, no self-connection in the graph indicating that aii = 0 for all i.
di = nj=1,j =i aij represents the degree of node i ∈ V, whereas the degree matrix
DG = diag{d1 , d2 , . . . , dn } is a diagonal matrix. The Laplacian matrix of graph G
is represented by LG = DG − A satisfying the symmetric and positive semi-definite
properties if the graph G is undirected. A path is a series of consecutive edges. If
there is a path between any two nodes, then G is said to be connected.

2.3 Algorithm Development

In this section, a reformulation of problem (2.1) is initially presented. Then,


a computation-efficient distributed stochastic gradient algorithm is developed to
figure out the reformulated problem.

2.3.1 Problem Reformulation

Define x as a vector that stacks all the local estimators xi , i ∈ V (i.e., x =


vec[x1 , . . . , xn ] ∈ Rnd ). Let P , B, and D be the block diagonal matrices of
P1 to Pn (i.e., P = blkdiag{P1 , . . . , Pn } ∈ Rm×nd ), B1 to Bn (i.e., B =
blkdiag{B1 , . . . , Bn } ∈ Rw×nd ), andD1 to Dn (i.e.,  D = blkdiag{D1 , . . 
. , Dn } ∈
n n n
Ra×nd ), respectively, where m = i=1 m i , w = i=1 w i , and a = i=1 ai .
Denote q  = [q1 , . . . , qn ] ∈ R , c = [c1 , . . . , cn ] ∈ R , s = [s1 , . . . , snT ]T ∈
T T T m T T T w T

Ra , Ω = ni=1 Ωi , and L = LG ⊗ Id . Under Assumption 2.1, problem (2.1) can be


equivalently reformulated as follows:


n
min J (x) = fi (xi ) + ||P x − q||1 ,
x∈Ω
i=1

s.t. Bx = c, Dx ≤ s, Lx = 0, (2.3)
2.3 Algorithm Development 37

ei
where fi (xi ) = (1/ei ) j =1 fi,j (xi ).
Remark 2.2 Note that the equality constraint Lx = 0 in (2.3) is equivalent to the
condition x1 = x2 = . . . = xn if the undirected network is connected. It is worth
highlighting that if x̂ ∗ is the optimal solution to the problem (2.1), then x ∗ = 1n ⊗
x̂ ∗ is the optimal solution to the problem (2.3). According to this observation, the
main motivation of this chapter is constructing a computation-efficient algorithm to
search for the optimal solution to the problem (2.3) over undirected and connected
networks.
We notice that under Assumption 2.2, problem (2.1) has a unique optimal
solution, denoted as x̂ ∗ . Therefore, problem (2.3) also has a unique optimal solution
x ∗ = 1n ⊗ x̂ ∗ under Assumption 2.1. By utilizing the Lagrangian function, the
necessary and sufficient conditions for the optimality of problem (2.3) are given in
the following lemma, whose proof is directly concluded from [46, 53].
Lemma 2.3 ([46, 53]) Let Assumptions 2.1 and 2.2 hold, and let η = 0 be a given
scalar. Then, x ∗ is an optimal solution to (2.3) if and only if there exist α ∗ ∈ Rm ,
β ∗ ∈ Rw , λ∗ ∈ Ra , and γ ∗ ∈ Rnd such that (x ∗ , α ∗ , β ∗ , λ∗ , γ ∗ ) satisfies the
following relations:
⎧ ∗

⎪ x = PΩ [x ∗ − η(∇f (x ∗ ) + P T α ∗ + B T β ∗ + D T λ∗ + Lγ ∗ )]
⎨ ∗
α = φ(α ∗ + P x ∗ − q)
(2.4)

⎪ Bx ∗ = c, Lx ∗ = 0
⎩ ∗
λ = (λ∗ + Dx ∗ − s)+ ,

where PΩ : Rnd → Ω and φ : Rm → [−1, 1]m are two projection operators.


Remark 2.4 In view of Lemma 2.3, the first condition means that the fixed-point x ∗
of a mapping induced by a projected gradient descent step is guaranteed to exist in
the bounded constraint. The second condition is a projection step for the non-smooth
regularization term. The third and fourth conditions indicate primal feasibility, that
is, satisfying the two equalities and one inequality in (2.3). Therefore, if the designed
algorithm can achieve the solution to (2.4), the optimal solution to (2.3) can be
obtained accordingly.

2.3.2 Computation-Efficient Distributed Stochastic Gradient


Algorithm

In this subsection, inspired by the algorithm design ideas of methods in [41, 46,
53], we introduce the proposed algorithm named computation-efficient distributed
stochastic gradient algorithm for solving the problem (2.3). To motivate our
algorithm design, we can find that the existing distributed algorithms have suffered
38 2 Projection Algorithms for Distributed Stochastic Optimization

from the high computation cost when evaluating the local full gradients in machine
learning applications. Such a phenomenon inspires us to explore a useful method
which could significantly promote the computation efficiency. Therefore, the pro-
posed algorithm leverages the variance-reduction technique of SAGA [41] and the
distributed stochastic gradient projection method with constant step-size, which
effectively alleviates the computational burden in locally full gradient evaluation.
The computation-efficient distributed stochastic gradient algorithm at each node
i is formally described in Algorithm 1. To locally implement estimators of Algo-
rithm 1, each node i must own a gradient table that possesses all local constituent
gradients ∇fi,j (ti,j ), where ti,j is the most recent estimator at which the constituent
gradient ∇fi,j was evaluated. At each iteration k ≥ 0, each node i uniformly at
random selects one constituent function that indexed by χik ∈ {1, . . . , ei } from its
own local data batch and then generates the local stochastic gradient gik as step 4 in
Algorithm 1. After generating gik , the entry ∇fi,χ k (t k k ) is replaced by the newly
i i,χi
constituent gradient ∇fi,χ k (xik ), while the other entries remain the same. Then, the
i
projection step of estimator xik is implemented on the local stochastic gradient gik ,
and other steps of estimators, αik , βik , λki , γ̃ik , are implemented subsequently.
Define x k = vec[x1k , . . . , xnk ], α k = vec[α1k , . . . , αnk ], β k = vec[β1k , . . . , βnk ],
λk = vec[λk1 , . . . , λkn ], γ̃ k = vec[γ̃1k , . . . , γ̃nk ], and g k = vec[g1k , . . . , gnk ]. Let γ̃ k =
Lγ k , where γ k = vec[γ1k , . . . , γnk ] and the initial condition satisfies γ̃ 0 = Lγ 0
[46, 53]. Then, we write Algorithm 1 in the following compact matrix form for the
convenient of analysis:


⎪ x k − η(g k + P T φ(α k + P x k − q) + L(γ k + x k )

⎪ x k+1 = PΩ

⎪ +B T (β k + Bx k − c) + D T (λk + Dx k − s)+ )

⎨ k+1
α = φ(α k + P x k+1 − q)
(2.5)
⎪β

k+1 = β k + Bx k+1 − c



⎪ λ k+1 = (λk + Dx k+1 − s)+

⎩ k+1
γ = γ k + x k+1 .

We notice here that the randomness of Algorithm 1 rests with the set of random
k≥0
independent variables {χik }i∈{1,...,n} for calculating the local stochastic gradient gik .
Based on this, we utilize F k to indicate the entire history of the dynamic system
constructed by {χik̃ }k̃≤k−1
i∈{1,...,n} . Therefore, from some prior results in [41, 45, 51], we
know that the local stochastic gradient gik calculated in step 4 of Algorithm 1 is an
unbiased estimator of the local batch gradient ∇fi (xik ). Specifically, when F k is
given, we have

E gik |F k = ∇fi xik . (2.6)

Remark 2.5 Algorithm 1 adopts a novel variance-reduction technique of SAGA


[41] for gradient evaluation to avoid extensive computation. Although several
2.4 Convergence Analysis 39

Algorithm 1 Computation-efficient distributed stochastic gradient algorithm at each


node i ∈ V
1: Initialization: Each node i starts with xi0 ∈ Rd , αi0 ∈ Rmi , βi0 ∈ Rwi , λ0i ∈ Rai , γ̃i0 ∈ Rd ,
0 ∈ Rd for all j ∈ V .
and ti,j
2: for k = 0, 1, 2, . . . do
3: Choose χik uniformly at random from {1, . . . , ei };
4: Calculate the local stochastic gradient as:

  1 ei 
gik = ∇fi,χ k xik − ∇fi,χ k ti,χ
k
k + ∇fi,j ti,j
k
.
i i i ei
j =1

5: If j = χik , then store ∇fi,j (ti,j


k+1
) = ∇fi,j (xik ) in χik gradient table position; else
∇fi,j (ti,j
k+1
) = ∇fi,j (ti,j
k
).
6: Update estimator xik+1 according to:
 
xik − η(gik + PiT φi (αik + Pi xik − qi ) + BiT (βik + Bi xik − ci )
xik+1 = PΩi  ,
+DiT (λki + Di xik − si )+ + γ̃ik + nj=1,j =i aij (xik − xjk ))

where the step-size η > 0.


7: Update estimators αik+1 , βik+1 , λk+1
i , and γ̃ik+1 as follows:
⎧ k+1
⎪ α = φi (α k + Pi x k+1 − qi )

⎨ βik+1 = β k +i B x k+1i − c
i i i i i
.
⎪ λi = (λi + 
⎪ k+1 k
Di xik+1 − si )+
⎩ k+1 n k+1 k+1
γ̃i = γ̃i + j =1,j =i aij (xi − xj )
k

8: end for

existing distributed variance reduced methods such as DSA [45], GT-SAGA/GT-


SVRG [51], etc. have been investigated for various kinds of problems, it is worth
highlighting that via employing the variance-reduction technique, Algorithm 1 is
well applied to a class of composite optimization problems that include the 1 -norm
subject to locally linear equality, convex inequality, and bounded constraints.

2.4 Convergence Analysis

In this section, we first introduce quite a few auxiliary results related to the
stochastic gradient g k , k ≥ 0. Then, we design a Lyapunov function and derive
the upper bounds of two parts of the Lyapunov function to support the main results.
Subsequently, we provide the theoretical guarantees for the convergence behavior
of the computation-efficient distributed stochastic gradient algorithm described in
Algorithm 1 by using the Lyapunov method. Finally, under some special cases, we
propose a distributed stochastic proximal gradient algorithm by using the variance-
reduction technique and study its convergence rate.
40 2 Projection Algorithms for Distributed Stochastic Optimization

2.4.1 Auxiliary Results

To establish the auxiliary results, we first denote the auxiliary sequence r k =


 n
i=1 ri ∈ R, where
k

1 
ei  
rik = k
fi,j ti,j − fi,j (x̂ ∗ ) − ∇fi,j (x̂ ∗ )T ti,j
k
− x̂ ∗ . (2.7)
ei
j =1

k ) − f (x̂ ∗ ) − ∇f (x̂ ∗ )T (t k − x̂ ∗ ), ∀k ≥ 0, is non-


Here, note that fi,j (ti,j i,j i,j i,j
negative under Assumption 2.2, and consequently r k , ∀k ≥ 0, is non-negative as
well.
Based on the above, an expected upper bound for the distance between the
stochastic gradient g k and the gradient ∇f (x ∗ ) is introduced in the following part,
which is deduced in many existing works [45, 47–51]. For simplifying the writing,
we denote E[·] = E[·|F k ] as the condition expectation on F k in the subsequent
analysis.
Lemma 2.6 ([45]) Under Assumption 2.2 and the definition of r k in (2.7), the
sequence generated by Algorithm 1 satisfies the following: ∀k ≥ 0,

E ||g k − ∇f (x ∗ )||2

≤ 4νr k + 2(2ν − μ)(f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ )). (2.8)

Notice that if the sequence of iterates x k tends to the optimal solution x ∗ , then the
k approach to x̂ ∗ , which yields that r k converges to zero. This
auxiliary variables ti,j
fact combined with the result in Lemma 2.6 indicates that the expected value for
the distance between the stochastic gradient g k and the gradient ∇f (x ∗ ) diminishes
when x k tends to x ∗ .

2.4.2 Supporting Lemmas

In this subsection, we proceed to analyze the performance of Algorithm 1. To aim


at this, we first define a Lyapunov function as

V (x k , α k , β k , λk , γ k , r k )
= V1 (x k ) + η[V2 (α k ) + V3 (β k ) + V4 (λk ) + V5 (γ k )] + br k , (2.9)

where the function V1 (x k ) = ||x k − x ∗ ||2 , V2 (α k ) = ||α k − α ∗ ||2 , V3 (β k ) = ||β k −


β ∗ ||2 , V4 (λk ) = ||λk − λ∗ ||2 , V5 (γ k ) = ||γ k − γ ∗ ||2L for all k ≥ 0; η is the step-size
2.4 Convergence Analysis 41

and b is a positive constant which will be specified in the subsequent analysis. For
simplifying notation, we denote V k = V (x k , α k , β k , λk , γ k , r k ) for all k ≥ 0.
Then, we give two crucial lemmas that involve the upper bounds of two parts of
Lyapunov function to support the main results.
The following lemma that involves an expected upper bound for r k is inevitable
to the subsequent convergence. The concrete proof can be found in [45, 47–51].
Lemma 2.7 ([45]) Consider the sequence {r k } generated by Algorithm 1. Under
Assumptions 2.1 and 2.2, we have the following inequality: ∀k ≥ 0,
 
1 k 1
E[r k+1 ] ≤ 1 − r + f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ ) , (2.10)
ê ě

where ê = maxi∈V {ei } and ě = mini∈V {ei }.


Recalling the results in Lemma 2.3, we can find that there exist α ∗ ∈ Rm , β ∗ ∈
Rw , λ∗ ∈ Ra , and γ ∗ ∈ Rnd such that (x ∗ , α ∗ , β ∗ , λ∗ , γ ∗ ) satisfies the relations
in (2.4). Based on this, the following key lemmas characterize the dynamics of the
above five functions.
Lemma 2.8 Consider the sequence of iterations x k , α k , β k , λk , and γ k generated
by Algorithm 1. Under Assumptions 2.1 and 2.2, ∀k ≥ 0, the following inequality
holds for V1 (x k ):

E V1 (x k+1 ) − V1 (x k )

4ην k 4η(ν − μ) 
≤ r + f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ )
a a
+2ηE (x k+1 − x k )T (∇f (x k+1 ) − ∇f (x k )) −(1 − aη)E ||x k+1 − x k ||2

−2ηE (P x k+1 − q)T (φ(α k + P x k − q) − α ∗ )

−2ηE (Bx k+1 − c)T (β k − β ∗ + Bx k − c) − 2ηE (x k+1 )T L(γ k − γ ∗ + x k )

−2ηE (Dx k+1 − s)T ((λk + Dx k − s)+ − λ∗ ) , (2.11)

where a > 0 is a tunable parameter.


Proof Denote the equation ψ1k = x k+1 = PΩ [x k − υ1k ], where υ1k = η(g k +
P T φ(α k + P x k − q) + B T (β k + Bx k − c) + D T (λk + Dx k − s)+ + L(γ k + x k )).
Recalling the definition of V1 (x k ), one has

V1 (x k+1 ) − V1 (x k )
= ||ψ1k − x ∗ ||2 − ||x k − x ∗ ||2
42 2 Projection Algorithms for Distributed Stochastic Optimization

= −||ψ1k − x k ||2 + 2(ψ1k − x k )T (ψ1k − x ∗ )


= −||ψ1k − x k ||2 + 2(ψ1k − x k + υ1k )T (ψ1k − x ∗ ) − 2(υ1k )T (ψ1k − x ∗ ).
(2.12)

From the projection property, we obtain that (ψ1k − x k + υ1k )T (ψ1k − x ∗ ) =


(PΩ [x k − υ1k ] − (x k − υ1k ))T (PΩ [x k −υ1k ]−x ∗ ) ≤ 0. Combining with (2.12) yields
that

V1 (x k+1 ) − V1 (x k )
≤ −||x k+1 − x k ||2 − 2η(x k+1 − x ∗ )T P T φ(α k + P x k − q)
− 2η(x k+1 − x ∗ )T L(γ k + x k ) − 2η(x k+1 − x ∗ )T D T (λk + Dx k − s)+
− 2η(x k+1 − x ∗ )T B T (β k + Bx k − c) − 2η(x k+1 − x ∗ )T g k . (2.13)

Next, we will analyze each cross-term in (2.13). Notice that

(x k+1 − x ∗ )T g k = (x k+1 − x k )T (g k − ∇f (x k )) + (x k+1 − x ∗ )T ∇f (x k )


+ (x k − x ∗ )T (g k − ∇f (x k ))
a 1
≥ − ||x k+1 − x k ||2 − ||g k − ∇f (x k )||2
2 2a
+ (x k − x ∗ )T (g k − ∇f (x k )) + (x k+1 − x ∗ )T ∇f (x k ),
(2.14)

where a > 0 is an adjustable parameter. Recall the convexity of f (x) that f (x k ) −


f (x ∗ ) ≤ (x k − x ∗ )T ∇f (x k ) and f (x k+1 ) − f (x k ) ≤ (x k+1 − x k )T ∇f (x k+1 ).
Then, we get

(x k+1 − x ∗ )T ∇f (x k )
= (x k − x ∗ )T ∇f (x k ) + (x k+1 − x k )T ∇f (x k )
≥ f (x k ) − f (x ∗ ) + (x k+1 − x k )T (∇f (x k ) − ∇f (x k+1 ))
+ (x k+1 − x k )T ∇f (x k+1 )
≥ f (x k+1 ) − f (x ∗ ) + (x k+1 − x k )T (∇f (x k ) − ∇f (x k+1 )). (2.15)

Then, substituting (2.14) and (2.15) into (2.13) deduces that

V1 (x k+1 ) − V1 (x k )
≤ −(1 − aη)||x k+1 − x k ||2 − 2η(x k − x ∗ )T (g k − ∇f (x k ))
η
+ ||g k − ∇f (x k )||2 − 2η(f (x k+1 ) − f (x ∗ ))
a
2.4 Convergence Analysis 43

− 2η(x k+1 − x ∗ )T P T φ(α k + P x k − q) − 2η(x k+1 − x ∗ )T L(γ k + x k )


− 2η(x k+1 − x ∗ )T B T (β k + Bx k − c)
− 2η(x k+1 − x ∗ )T D T (λk + Dx k − s)+
+ 2η(x k+1 − x k )T (∇f (x k+1 ) − ∇f (x k )). (2.16)

Construct a Lagrangian function as follows:

Ψ (x, β, λ, γ ) =f (x) + ||P x − q||1 + β T (Bx − c)


+ λT (Dx − s) + γ T Lx. (2.17)

According to the Saddle-Point Theorem [54], we find that (x ∗ , β ∗ , λ∗ , γ ∗ ) is the


saddle point of Ψ (x, β, λ, γ ) in (2.17). For the saddle point, (x ∗ , β ∗ , λ∗ , γ ∗ ), it
satisfies that Ψ (x ∗ , β, λ, γ ) ≤ Ψ (x ∗ , β ∗ , λ∗ , γ ∗ ) ≤ Ψ (x, β ∗ , λ∗ , γ ∗ ). Thus, with
regard to x, (x ∗ , β ∗ , λ∗ , γ ∗ ) is a minimum point of Ψ (x, β ∗ , λ∗ , γ ∗ ), which further
means that the variational inequality (x − x ∗ )T ς ∗ ≥ 0 holds for all x ∈ Ω, where
ς ∗ ∈ ∂x Ψ (x ∗ , β ∗ , λ∗ , γ ∗ ) is a subgradient of the Lagrangian function Ψ . Since
x k+1 ∈ Ω, one obtains that (x k+1 − x ∗ )T (∇f (x ∗ ) + P T α ∗ + B T β ∗ + D T λ∗ +
Lγ ∗ ) ≥ 0. Then, by the convexity of f (x), we further acquire

f (x k+1 ) − f (x ∗ ) + (x k+1 − x ∗ )T (P T α ∗ + B T β ∗ + D T λ∗ + Lγ ∗ ) ≥ 0.
(2.18)

Notice that Bx ∗ = c and Lx ∗ = 0. Then, combining (2.16) and (2.18) yields

V1 (x k+1 ) − V1 (x k ) ≤ − (1 − aη)||x k+1 − x k ||2 − 2η(x k − x ∗ )T (g k − ∇f (x k ))


η k
+ ||g − ∇f (x k )||2 − 2η(x k+1)T L(γ k − γ ∗ + x k )
a
− 2η(x k+1 − x ∗ )T P T (φ(α k + P x k − q) − α ∗ )
− 2η(Bx k+1 − c)T (β k − β ∗ + Bx k − c)
− 2η(x k+1 − x ∗ )T D T ((λk + Dx k − s)+ − λ∗ )
+ 2η(x k+1 − x k )T (∇f (x k+1 ) − ∇f (x k )). (2.19)

Recall that α ∗ = φ(α ∗ +P x ∗ −q), where α ∗ is a solution to the variational inequality


(α − α ∗ )T (−P x ∗ +q) ≥ 0, ∀α ∈ [−1, 1]m . Thus, (φ(α k +P x k −q)−α ∗ )T (P x ∗ −
q) ≤ 0 holds due to φ(α k + P x k − q) ∈ [−1, 1]m . Then, we have

(x k+1 − x ∗ )T P T (φ(α k + P x k − q) − α ∗ )

= (P x k+1 − q)T (φ(α k + P x k − q) − α ∗ ) − (P x ∗ − q)T (φ(α k + P x k − q) − α ∗ )

≥ (P x k+1 − q)T (φ(α k + P x k − q) − α ∗ ). (2.20)


44 2 Projection Algorithms for Distributed Stochastic Optimization

Similarly, one has ((λk + Dx k − s)+ − λ∗ )T (Dx ∗ − s) ≤ 0 owing to (λk + Dx k −


s)+ ≥ 0. Thus, we get

(x k+1 − x ∗ )T D T ((λk + Dx k − s)+ − λ∗ )

= (Dx k+1 − s)T ((λk + Dx k − s)+ − λ∗ ) − (Dx ∗ − s)T ((λk + Dx k − s)+ − λ∗ )

≥ (Dx k+1 − s)T ((λk + Dx k − s)+ − λ∗ ). (2.21)

Moreover, by utilizing the standard variance decomposition E[||a − E[a]||2 ] =


E[||a||2] − ||E[a]||2 and (2.6), the expectation E[||g k − ∇f (x k )||2 ] = E||g k −
∇f (x k )||2 −||∇f (x k )−∇f (x ∗ )||2 . According to the strongly convexity of the global
function f (x), it has that ||∇f (x k )−∇f (x ∗ )||2 ≥ 2μ(f (x k )−f (x ∗ )−∇f (x ∗ )T (x k −
x ∗ )). Thus, it follows from (2.8) that

E ||g k − ∇f (x k )||2

≤ 4νr k + 4(ν − μ)(f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ ). (2.22)

Substituting (2.20)–(2.21) into (2.19), taking conditional expectation on F k , and


then combining with (2.22), we complete the proof of Lemma 2.8. 
Lemma 2.9 Consider the sequence of iterations x k , α k , β k , λk , and γ k generated
by Algorithm 1. Under Assumptions 2.1 and 2.2, ∀k ≥ 0, the following inequality
holds for V2 (α k ):

E V2 (α k+1 ) − V2 (α k )

≤2E (P x k+1 − q)T (α k+1 − α ∗ ) − E ||α k+1 − α k ||2 . (2.23)

Proof Denote ψ2k = α k+1 = φ(υ2k ), where υ2k = α k + P x k+1 − q. From the
definition of V2 (α k ), one has

V2 (α k+1 ) − V2 (α k ) = ||α k+1 − α ∗ ||2 − ||α k − α ∗ ||2


= −||ψ2k − α k ||2 + 2(ψ2k − α k )T (ψ2k − α ∗ )
= −||ψ2k − α k ||2 + 2(ψ2k − α k − P x k+1 + q)T (ψ2k − α ∗ )
+ 2(P x k+1 − q)T (ψ2k − α ∗ ). (2.24)

From the projection property, we obtain that (ψ2k − α k − P x k+1 + q)T (ψ2k − α ∗ ) =
(φ(υ2k )−υ2k )T (φ(υ2k )−α ∗ ) ≤ 0. Combining with (2.24) and then taking conditional
expectation on F k , we complete the proof of Lemma 2.9. 
2.4 Convergence Analysis 45

Lemma 2.10 Consider the sequence of iterations x k , α k , β k , λk , and γ k generated


by Algorithm 1. Under Assumptions 2.1 and 2.2, ∀k ≥ 0, the following inequality
holds for V3 (β k ):

E[V3 (β k+1 )] − V3 (β k ) − E[||Bx k+1 − Bx k ||2 ]


≤2E[(Bx k+1 − c)T (β k − β ∗ + Bx k − c)] − ||Bx k − c||2. (2.25)

Proof Since β k+1 = β k + Bx k+1 − c, from the definition of V3 (β k ), it follows that

V3 (β k+1 ) − V3 (β k )
= ||β k+1 − β ∗ ||2 − ||β k − β ∗ ||2
= ||β k − β ∗ + Bx k+1 − c||2 − ||β k − β ∗ ||2
= (Bx k+1 − c)T (2(β k − β ∗ ) + Bx k+1 − c)
= 2(Bx k+1 − c)T (β k − β ∗ ) + ||Bx k+1 − Bx k ||2
+ 2(Bx k+1 − c + c − Bx k )T (Bx k − c) + ||Bx k − c||2 (2.26)
= 2(Bx k+1 − c)T (β k − β ∗ + Bx k − c) + ||Bx k+1 − Bx k ||2 − ||Bx k − c||2 .

Furthermore, taking conditional expectation on F k yields the result of Lemma 2.10.


This completes the proof. 
Lemma 2.11 Consider the sequence of iterations x k , α k , β k , λk , and γ k generated
by Algorithm 1. Under Assumptions 2.1 and 2.2, ∀k ≥ 0, the following inequality
holds for V4 (λk ):

E[V3 (β k+1 )] − V3 (β k ) − E[||Bx k+1 − Bx k ||2 ]


≤2E[(Bx k+1 − c)T (β k − β ∗ + Bx k − c)] − ||Bx k − c||2. (2.27)

Proof Denote ψ3k = λk+1 = (υ3k )+ , where υ3k = λk + Dx k+1 − s. From the
definition of V4 (λk ), one has

V4 (λk+1 ) − V4 (λk ) = ||λk+1 − λ∗ ||2 − ||λk − λ∗ ||2


= ||ψ3k − λ∗ ||2 − ||λk − λ∗ ||2
= −||ψ3k − λk ||2 + 2(ψ3k − λk )T (ψ3k − λ∗ )
= −||ψ3k − λk ||2 + 2(ψ3k − λk − Dx k+1 + s)T (ψ3k − λ∗ )
+ 2(Dx k+1 − s)T (ψ3k − λ∗ ). (2.28)
46 2 Projection Algorithms for Distributed Stochastic Optimization

From the projection property, we obtain that (ψ3k − λk − Dx k+1 + s)T (ψ3k − λ∗ ) =
((υ3k )+ − υ3k )T ((υ3k )+ −λ∗ ) ≤ 0. Combining with (2.28) and then taking conditional
expectation on F k , we complete the proof of Lemma 2.11. 
Lemma 2.12 Consider the sequence of iterations x k , α k , β k , λk , and γ k generated
by Algorithm 1. Under Assumptions 2.1 and 2.2, ∀k ≥ 0, the following inequality
holds for V5 (γ k ):

E V5 (γ k+1 ) − V5 (γ k ) − E (x k+1 − x k )T L(x k+1 − x k )

≤2E (x k+1 )T L(γ k − γ ∗ + x k ) − (x k )T Lx k . (2.29)

Proof Since γ k+1 = γ k + x k+1 , from the definition of V5 (γ k ), it results in

V5 (γ k+1 ) − V5 (γ k )
= (γ k+1 − γ ∗ )T L(γ k+1 − γ ∗ ) − (γ k − γ ∗ )T L(γ k − γ ∗ )
= (γ k − γ ∗ + x k+1 )T L(γ k − γ ∗ + x k+1 ) − (γ k − γ ∗ )T L(γ k − γ ∗ )
= 2(x k+1 )T L(γ k − γ ∗ ) + (x k+1 )T Lx k+1
= 2(x k+1 )T L(γ k − γ ∗ ) + (x k+1 − x k )T L(x k+1 − x k )
+ 2(x k+1 )T Lx k − (x k )T Lx k
= 2(x k+1 )T L(γ k − γ ∗ + x k ) + (x k+1 − x k )T L(x k+1 − x k ) − (x k )T Lx k .
(2.30)

Then, taking conditional expectation on F k , we get the result of Lemma 2.12. This
completes the proof. 

2.4.3 Main Results

In Theorem 2.13, we will show that Algorithm 1 converges under appropriate step-
size η and constant b by combining the results in Lemmas 2.7–2.12. Before present- 
T
ing Theorem 2.13, we first set the constant b ∈ 4ηaêν , 2êηχminν (B B) − 4êνη(ν−μ)


and the tunable parameter a ∈ 4η2êνêηχ+4êνη(ν−μ)
2
(B T B)
, +∞ .
min

Theorem 2.13 Suppose that Assumptions 2.1 and 2.2 hold. Considering the
computation-efficient distributed stochastic gradient algorithm described in
2.4 Convergence Analysis 47

Algorithm 1, if the constant step-size η is selected from the interval


 
1
0, ,
χmax (aInd + 2νInd + B T B + L + 2D T D + 2P T P )

then the estimator xik , i ∈ V, converges to the global optimal solution x̂ ∗ in


expectation.
Proof By the Lyapunov function (2.9) and the results in Lemmas 2.8–2.12, we
obtain that

E[V k+1 ] − V k ≤ Ξ1 + Ξ2 + Ξ3 + Ξ4 + Ξ5 , (2.31)

where

Ξ1 = 2ηE (x k+1 − x k )T (∇f (x k+1 ) − ∇f (x k ))

− η||Bx k − c||2 − (1 − aη)E ||x k+1 − x k ||2 ,

Ξ2 = 2ηE (P x k+1 − q)T (α k+1 − φ(α k + P x k − q)) − ηE ||α k+1 − α k ||2 ,

Ξ3 = 2ηE (Dx k+1 − s)T (λk+1 − (λk + Dx k − s)+ ) − ηE ||λk+1 − λk ||2 ,



Ξ4 = b E r k+1 − r k + ηE ||x k+1 − x k ||2B T B ,

4ην k 4η(ν − μ)
Ξ5 = r + (f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ ))
a a
+ ηE ||x k+1 − x k ||2L − η||x k ||2L .

Next, we derive the upper bound for each term in the right side of inequality (2.31).
From the smoothness of the global objective function f (x), one obtains the fact that
(x k+1 − x k )T (∇f (x k+1 ) − ∇f (x k )) ≤ ν||x k+1 − x k ||2 . Thus, the first term Ξ1 is
bounded by

Ξ1 ≤ −(1 − aη − 2ην)E[||x k+1 − x k ||2 ] − η||x k − x ∗ ||2B T B , (2.32)

where inequality (2.32) is deduced by the equality ||Bx k − c||2 = ||Bx k − Bx ∗ ||2 =
(x k − x ∗ )T B T B(x k − x ∗ ) = ||x k − x ∗ ||2B T B . From the projection property, we
have that E[(P x k+1 − q)T (α k+1 − φ(α k + P x k − q))] ≤ E[||x k+1 − x k ||2P T P ] +
48 2 Projection Algorithms for Distributed Stochastic Optimization

E[(P x k − q)T (α k+1 −φ(α k +P x k −q))]. Therefore, the second term Ξ2 is bounded
by

Ξ2 ≤ 2ηE (α k+1 − α k )T (α k+1 − φ(α k + P x k − q))

− 2ηE (α k+1 − (α k + P x k − q))T (α k+1 − φ(α k + P x k − q))

− ηE ||α k+1 − α k ||2 + 2ηE ||x k+1 − x k ||2P T P . (2.33)

Consider that E[(α k+1 − (α k + P x k − q))T (α k+1 − φ(α k + P x k − q))] ≥


E[||α k+1 − φ(α k + P x k − q)||2]. In conjunction with (2.33), it follows that

Ξ2 ≤ 2ηE ||x k+1 − x k ||2P T P − η||α k − φ(α k + P x k − q)||2

− ηE ||α k+1 − φ(α k + P x k − q)||2

≤ 2ηE ||x k+1 − x k ||2P T P − η||α k − φ(α k + P x k − q)||2 . (2.34)

Similar to the procedure for deriving the bound of Ξ2 , we achieve the bound of Ξ3
as

Ξ3 ≤ 2ηE ||x k+1 − x k ||2D T D − η||λk − (λk + Dx k − s)+ ||2

− ηE ||λk+1 − (λk + Dx k − s)+ ||2

≤ 2ηE ||x k+1 − x k ||2D T D − η||λk − (λk + Dx k − s)+ ||2 . (2.35)

Then, with the result presented in Lemma 2.32, we further get

b b
Ξ4 ≤ − r k + f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ )
ê ě
+ ηE ||x k+1 − x k ||2B T B . (2.36)

Moreover, according to the smoothness of the global objective function f (x), it


follows that f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ ) ≤ ν2 ||x k − x ∗ ||2 . Hence, we have
   
b 4ην νb 2νη(ν − μ)
Ξ4 + Ξ5 ≤ − − r +
k
+ ||x k − x ∗ ||2
ê a 2ě a

+ ηE ||x k+1 − x k ||2B T B+L − η||x k ||2L . (2.37)


2.4 Convergence Analysis 49

Substituting (2.32)–(2.16) and (2.37) into the right side of (2.31) and rearranging
the obtained terms, one gets

E[V k+1 ] − V k ≤ − E ||x k+1 − x k ||2I


nd −η(aInd +2νInd +B
T B+L+2D T D+2P T P )

 
b 4ην
− ||x k − x ∗ ||2  − − rk
ηB T B− νb
2ě
+ 2νη(ν−μ)
a Ind ê a

− η||λk − (λk + Dx k − s)+ ||2 − η||x k ||2L


− η||α k − φ(α k + P x k − q)||2. (2.38)

To ensure that the Lyapunov function V k decreases monotonously over the entire
period of iteration k, it is equivalent to prove that

E[V k+1 ] − V k < 0. (2.39)

If each term in the right side of (2.38) is non-negative, the inequality (2.39)
holds. Therefore, the following conditions should be satisfied: (a) the matrices
Ind −(aInd +2νInd +B T B +L+2D T D +2P T P ) and ηB T B − ( νb 2ê
+ 2νη(ν−μ)
a )Ind
should be positive definite and (b) (b/ê)−(4ην/a) > 0. The results of Theorem 2.13
are then achieved. 
Remark 2.14 Theorem 2.13 indicates that even utilizing the stochastic gradient g k ,
Algorithm 1 is guaranteed to resolve the composite constrained convex optimization
problem (2.1) if some conditions (such as a, b, and η) are satisfied and the
assumptions on the objective functions and the communication network hold.
However, an explicit convergence rate is not established in Theorem 2.13 due to the
existence of the locally general constraints that are constituted by linear equality,
convex inequality, and bounded constraints. When dealing with a composite non-
smooth problem, although global linear convergence of the distributed proximal
gradient methods has been well proved in the recent work [33], there is still no work
to analyze the global linear convergence of the primal–dual method (similar to the
proposed algorithm). In terms of this issue, related work needs to be further studied.
We, therefore, draw support from simulations to explore possible results.

2.4.4 Discussion

In this subsection, by adopting the proximal operator, we will explore that a


distributed variance-reduction algorithm has similar performance to the centralized
method in dealing with the special case (the constraints are not involved and the non-
smooth terms are identical) of problem (2.1) from the perspective of the convergence
rate [33, 55]. In particular, the special case of problem (2.1) can be equivalently
50 2 Projection Algorithms for Distributed Stochastic Optimization

reformulated as follows:


n
1
min J (x) = fi (xi ) + ||P̃ xi − q̃||1 , s.t. W 2 x = 0, (2.40)
x
i=1

where x = vec[x1, . . . , xn ] ∈ Rnd , P̃ ∈ Rm̃×d (m̃ > 0), q̃ ∈ Rm̃ , fi (xi ) =


(1/ei ) eji=1 fi,j (xi ), W = (1/2)(Ind −A⊗Id ), and A is a primitive, symmetric, and

doubly-stochastic matrix. For convenience, we define h(x) = ni=1 ||P̃ xi − q̃||1 .
To solve problem (2.40), we propose the following recursion inspired by the design
ideals of the methods in [33, 55]:


1 1
⎨ zk+1 = x k − ηg k − W 2 x k − W 2 γ k
1
⎪ γ k+1 = γ k + W 2 zk+1 (2.41)
⎩ k+1
x = proxηh (zk+1 ),

where proxηh (zk+1 ) = arg miny {h(y)+(1/2η)||y−zk+1||2 } is the proximal operator


at auxiliary variable zk+1 ∈ Rnd . Then, the next proposition explores the linear
convergence of the algorithm (2.41).
Proposition 2.15 Suppose that Assumptions 2.1 and 2.2 hold. If γ 0 = 0nd and
the constant step-size η is small enough, algorithm (2.41) can guarantee that the
estimator x k linearly converges to the global optimal solution x ∗ in expectation.
Proof It is concluded from [33] that the fixed-point optimality to algorithm (2.41)
satisfies that


1 1
⎨ z∗ = x ∗ − η∇f (x ∗ ) − W 2 x ∗ − W 2 γ ∗
1
⎪ 0 = W 2 z∗ (2.42)
⎩ ∗ ∗
x = proxηh (z ),

where (x ∗ , γ ∗ , z∗ ) is a fixed-point of (2.41). Define x̂ k = x k − x ∗ , γ̂ k = γ k − γ ∗ ,


and ẑk = zk − z∗ . Then, subtracting (2.42) from (2.41), we obtain the following
error recursions:


1 1
⎨ ẑk+1 = x̂ k − η(g k − ∇f (x ∗ )) − W 2 x̂ k − W 2 γ̂ k
1
⎪ γ̂ k+1 = γ̂ k + W 2 ẑk+1 (2.43)
⎩ k+1 ∗
x̂ = proxηh (z ) − proxηh (z ).
k+1

Note from the definition of matrix W that W is symmetric and its singular values
are in [0, 1) (i.e., ρmin = 0 < ρ ≤ ρmax < 1, where ρ is the minimum nonzero
2.4 Convergence Analysis 51

singular value of W ). From the first equality in (2.43), one gets

||ẑk+1 ||2 = ||x̂ k − η(g k − ∇f (x ∗ )) − W 2 x̂ k ||2 + ||W 2 γ̂ k ||2


1 1

1 1
−2(γ̂ k )T W 2 (x̂ k − η(g k − ∇f (x ∗ )) − W 2 x̂ k ). (2.44)

From the second equality in (2.43), one gets


1 1
||γ̂ k+1 ||2 = ||γ̂ k ||2 + ||W 2 ẑk+1 ||2 − 2||W 2 γ̂ k ||2
1 1
+2(γ̂ k )T W 2 (x̂ k − η(g k − ∇f (x ∗ )) − W 2 x̂ k ). (2.45)

1
Combining (2.45) with (2.44) and noting that ||W 2 γ̂ k ||2 ≥ ρ||γ̂ k ||2 , we have

||ẑk+1 ||2Ind −W + ||γ̂ k+1 ||2

≤ (1 − ρ)||γ̂ k ||2 + ||x̂ k − η(g k − ∇f (x ∗ )) − W 2 x̂ k ||2 .


1
(2.46)

Then, we further obtain that

||x̂ k − η(g k − ∇f (x ∗ )) − W 2 x̂ k ||2


1

 
νμ η2
≤ 1 − 2η ||x̂ k ||2 + ||g k − ∇f (x ∗ )||2
ν+μ 1 − ρmax
4ημ
− (f (x k ) − f (x ∗ ) − ∇f T (x ∗ )x̂ k )−2η(x̂ k )T (g k − ∇f (x k )), (2.47)
ν+μ

where we have used the inequalities (x̂ k )T (∇f (x k ) − ∇f (x ∗ )) ≥ νμ


ν+μ ||x̂ ||
k 2 +
ν+μ ||∇f (x )
1 k − ∇f (x ∗ )||2 , ||∇f (x k ) − ∇f (x ∗ )||2 ≥ 2μ(f (x ) − f (x ∗ )
k −
1 η2
∇f T (x ∗ )x̂ k ), and ||η(g k − ∇f (x ∗ )) − W x̂ k ||2 ≤ 2
1−ρmax ||g
k − ∇f (x ∗ )||2 +
1
(x̂ k )T W x̂ k to derive (2.47). Substituting (2.47) into (2.46) and taking the condition
2

expectation, we obtain that

E ||ẑk+1 ||2Ind −W + E ||γ̂ k+1 ||2


 
νμ η2
≤ 1 − 2η ||x̂ k ||2 + E ||g k − ∇f (x ∗ )||2
ν+μ 1 − ρmax
4ημ
− (f (x k ) − f (x k ) − ∇f T (x ∗ )x̂ k ) + (1 − ρ)||γ̂ k ||2
ν+μ
4ημ
− (f (x k ) − f (x k ) − ∇f T (x ∗ )x̂ k ). (2.48)
ν+μ
52 2 Projection Algorithms for Distributed Stochastic Optimization

According to Lemmas 2.6 and 2.7, we can further get

E ||ẑk+1 ||2Ind −W + E ||γ̂ k+1 ||2 + b1 ηE r k+1


   
νμ b1 4νη
≤ 1 − 2η ||x̂ k ||2 + 1 − + ηr k
ν+μ ê 1 − ρmax
 
4μ 2(2ν − μ)η b1
− − − ηs k + (1 − ρ)||γ̂ k ||2 , (2.49)
ν +μ 1 − ρmax ě

where b1 is a positive constant which will be specified next and we set s k = f (x k )−


f (x ∗ )−∇f (x ∗ )T (x k −x ∗ ). If 0 < η ≤ (4μě−b 1 (ν+μ))(1−ρmax )
2ě(ν+μ)(2ν−μ)
4μě
and b1 < (ν+μ) , (2.49)
further implies that

E ||ẑk+1 ||2Ind −W + E ||γ̂ k+1 ||2 + b1 ηE r k+1


   
νμ b1 4νη
≤ 1 − 2η ||x̂ k ||2 + 1 − + ηr k +(1 − ρ)||γ̂ k ||2 .
ν+μ ê 1 − ρmax
(2.50)

Furthermore, it follows from the third equality in (2.43) that

x̂ k+1 = proxηh (zk+1 ) − proxηh (z∗ ) ≤ ẑk+1 , (2.51)

and therefore

(1 − ρmax )E ||x̂ k+1||2 + E ||γ̂ k+1 ||2 + b1 ηE r k+1


 
1 2ηνμ
≤ − (1 − ρmax )||x̂ k ||2
1 − ρmax (ν + μ)(1 − ρmax )
 
1 1 4νη
+ − + b1 ηr k + (1 − ρ)||γ̂ k ||2 . (2.52)
b1 ê (1 − ρmax )b1
 
(b1 +b1 ê−ê)(1−ρmax ) ν+μ
Then, if 0 < η ≤ min 4ν ê
, 2νμ and b1 > 1+ê ê , there must exist a
2ηνμ
positive constant θ = max{ 1−ρ1max − (ν+μ)(1−ρ , 1 − 1ê + (1−ρ4νη
max ) b1 max )b1
, 1 − ρ} such
that

(1 − ρmax )E ||x̂ k+1 ||2 + E ||γ̂ k+1 ||2 + b1 ηE r k+1

≤ θ ((1 − ρmax )||x̂ k ||2 + b1 ηr k + ||γ̂ k ||2 ), (2.53)

where 0 < θ < 1. From here, the proof is similar to that in the proof of Theorem 1
in [33]. Then, it is suffice to obtain the desired results. 
2.5 Numerical Examples 53

2.5 Numerical Examples

In this section, two numerical examples are provided to examine the convergence
and the practical behavior of the proposed algorithms. Notice that all the simulations
are carried out in MATLAB on an HP 288 Pro G4 MT Business PC with
Intel(R) Core(TM) i7-8700 processors, 8 GB memory, and 3.2 GHz. For the sake
of comparison, the optimal solutions to the following examples are obtained by the
centralized method with proper step-sizes for a long enough time.

2.5.1 Example 1: Performance Examination

First, the proposed algorithms are applied to solve a general distributed minimiza-
tion problem which is described as follows:
⎛ ⎞

n
1 
ei
min ⎝ ||Ci,j x̂ − bi,j ||2 + ||Pi x̂ − qi ||1 ⎠,
ei
i=1 j =1

s.t. x̂1 + x̂2 + x̂3 + x̂4 = 3,


x̂1 − x̂2 + x̂3 − x̂4 ≤ 2,
− 2 ≤ x̂i ≤ 2, i = 1, . . . , 4, (2.54)

where x̂ ∈ R4 , Ci,j ∈ R1×4 , Pi ∈ R1×4 , bi,j ∈ R, and qi ∈ R for all i, j . Let n = 10


and ei = 10 for all i. The components of Ci,j , bi,j , Pi , and qi are randomly selected
in [0, 2], [−4, 4], [0, 2], and [−4, 4], respectively. The communication among 10
nodes is modeled as a ring network. The node i is assigned the ith objective function

ei
fi (x̂)+||Pi x̂ −qi ||1 , where fi (x̂) = (1/ei ) ||Ci,j x̂ − bi,j ||2 with the constituent
j =1
function fi,j (x̂) = ||Ci,j (x̂) − bi,j ||2 , j = 1, . . . , ei . In the simulation, the constant
step-size η is set as 0.04 and the initial conditions (xi0 , αi0 , βi0 , λ0i , and γi0 ) are
randomly generated in Algorithm 1.
Then, the simulation results are shown as follows:
(1) Figure 2.1 depicts the transient behaviors of all dimensions of state estimator x̂.
Figure 2.1 indicates that the state estimator x̂ in Algorithm 1 can successfully
achieve the consensus at the globally optimal solution
 in expectation.
(2) Figures 2.2 and 2.3 adopt the residual (1/n) ni=1 ||xik − x̂ ∗ || to show a
numerical comparison between the proposed algorithm (2.4) and the distributed
method in [46] with the x-axis being the iteration and the number of gra-
dient evaluations, respectively. Figure 2.2 means that the proposed algorithm
(2.4) can perform a linear convergence rate with constant step-size and its
performance does not decrease even using the stochastic gradients. Figure 2.3
54 2 Projection Algorithms for Distributed Stochastic Optimization

The transient behaviors of all dimensions of state vector


2

1.5

0.5

−0.5
0 200 400 600 800 1000
Iteration

Fig. 2.1 Convergence of x̂ for solving the optimization problem in Example 1

Residual

0
10 The proposed algorithm (5)
The method in [46]

−5
10

−10
10
0 200 400 600 800 1000
Iteration

Fig. 2.2 Comparison (a): x-axis is the iteration

indicates that compared with the method in [46], the proposed algorithm (2.4)
demands a small number of gradient evaluations, which largely reduce the
computational cost.

2.5.2 Example 2: Application Behavior

Second, we further verify the application behavior of the proposed algorithm with
numerical simulations for real datasets. We consider the distributed sparse logistic
regression problem using the breast cancer Wisconsin (diagnostic) dataset provided
2.5 Numerical Examples 55

Residual

0
10 The proposed algorithm (5)
The method in [46]

−5
10

−10
10
0 1 2 3 4 5 6 7 8
Number of gradient evaluations 4
x 10

Fig. 2.3 Comparison (b): x-axis is the number of gradient evaluations

in the UCI Machine Learning Repository [56]. In the breast cancer Wisconsin
(diagnostic) dataset, we adopt N = 200 samples as training data, where each
training data has dimension d = 9. All the characters have been preprocessed
and normalized to the unit vector for each dataset. For the network, we generate a
randomly connected network with n = 20 nodes utilizing an Erdos–Renyi network
with probability p = 0.4. The distributed sparse logistic regression problem can be
formally described as


n
min fi (x̂) + κ1 ||x̂||1 , (2.55)
x̂∈Rd
i=1

with the local objective function fi (x̂) being

1 
ei  κ
2
fi (x̂) = ln 1 + exp(−bi,j ci,j
T
x̂) + ||x̂||22 ,
ei 2
i=1

where bi,j ∈ {−1, 1} and ci,j ∈ Rd are local data kept by node i for j ∈
{1, . . . , ei }; the regularization term κ1 ||x̂||1 is applied to impose sparsity of the
optimal solution and (κ2 /2)||x̂||22 is added to avoid overfitting,  respectively. In the
simulation, we assign data randomly to each local node, i.e., ni=1 ei = N. We set
the regularization parameters κ1 = 0.05 and κ2 = 10, respectively.
Then, we compare the proposed algorithms with the existing distributed methods,
including DL-ADMM [30], PG-EXTRA [31], NIDS [32], P2D2 [33], that can deal
with the composite non-smooth optimization problem. When κ1 = 0, we also
compare the proposed algorithms with the existing distributed methods, including
56 2 Projection Algorithms for Distributed Stochastic Optimization

DSA [31] and GT-SAGA [51] that use the variance-reduction technique. The
simulation results are described as follows:
(1) Figure 2.4 means that the proposed algorithms can achieve the linear con-
vergence rate as the existing distributed methods [30–33] that can deal with
composite non-smooth optimization problem under the real training set. Fig-
ure 2.5 indicates that compared with the existing distributed methods [30–33]
that do not adopt the variance-reduction technique, the proposed algorithms
demand a small number of gradient evaluations, which is cheaper in terms of
the computational cost.

Residual: Real dataset


0
10
The proposed algorithm (5)
PG−EXTRA
NIDS
P2D2
−2
10 DL−ADMM
The proposed algorithm (22)

−4
10

−6
10
0 50 100 150 200 250 300 350 400
Iteration

Fig. 2.4 Comparison (a): x-axis is the iteration

Residual: Real dataset


0
10
The proposed alogrithm (5)
PG−EXTRA
NIDS
P2D2
−2 DL−ADMM
10 The proposed algorithm (22)

−4
10

−6
10
0 0.5 1 1.5 2 2.5 3 3.5 4
Number of gradient evaluations 4
x 10

Fig. 2.5 Comparison (b): x-axis is the number of gradient evaluations


2.6 Conclusion 57

Residual: Real dataset


0
10
The proposed algorithm (5)
DSA
GT−SAGA

−5
10

−10
10

0 100 200 300 400 500 600 700


Iteration

Fig. 2.6 Comparison (a): x-axis is the iteration

Residual: Real dataset


0
10
The proposed algorithm (5)
DSA
GT−SAGA

−5
10

−10
10

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000


Number of gradient evaluations

Fig. 2.7 Comparison (b): x-axis is the number of gradient evaluations

(2) When κ1 = 0, Figs. 2.6 and 2.7 tell us that the proposed algorithms show similar
performance with the existing distributed variance reduced methods [46, 51].

2.6 Conclusion

In this chapter, we have designed a novel computation-efficient distributed stochas-


tic gradient algorithm for solving a class of strongly convex composite con-
strained optimization problems over networks. The proposed algorithm leveraged
58 2 Projection Algorithms for Distributed Stochastic Optimization

the variance-reduction technique which highly reduces the expense of full gradient
evaluation. Through constructing an appropriate Lyapunov function, we proved that
the proposed algorithm converges in expectation to the optimal solution with a
suitably selected constant step-size. Furthermore, the privacy properties of the pro-
posed algorithm have also been explored via differential privacy strategy. Extensive
numerical experiments have been conducted to verify the superior performance of
the proposed algorithm. However, some nontrivial issues still deserve further study.
For example, the convergence rate of the proposed algorithm for the composite
constrained optimization problem needs to be studied in-depth, and general non-
smooth terms as well as more complex networks still demand further consideration.
In the future, we will further investigate the convergence rate of the proposed
algorithm and extend the algorithm to be applicable to more complex directed
networks. The extensions of the current algorithm to general non-smooth terms and
the distributed non-convex stochastic optimization are also two promising research
directions.

References

1. X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, J. Liu, Can decentralized algorithms
outperform centralized algorithms? A case study for decentralized parallel stochastic gradient
descent, in Advances in Neural Information Processing Systems (NIPS), vol. 30 (2017), pp. 1–
11
2. R. Xin, S. Kar, U.A. Khan, Decentralized stochastic optimization and machine learning: a
unified variance-reduction framework for robust performance and fast convergence. IEEE
Signal Process. Mag. 37(3), 102–113 (2020)
3. S. Khobahi, M. Soltanalian, F. Jiang, A.L. Swindlehurst, Optimized transmission for parameter
estimation in wireless sensor networks. IEEE Trans. Signal Inf. Proc. Netw. 6, 35–47 (2019)
4. A. Nedic, J. Liu, Distributed optimization for control. Ann. Rev. Control Robot. Auton. Syst.
1, 77–103 (2018)
5. J. Li, W. Abbas, X. Koutsoukos, Resilient distributed diffusion in networks with adversaries.
IEEE Trans. Signal Inf. Proc. Netw. 6, 1–17 (2019)
6. A. Nedic, Distributed gradient methods for convex machine learning problems in networks:
distributed optimization. IEEE Signal Process. Mag. 37(3), 92–101 (2020)
7. M. Rossi, M. Centenaro, A. Ba, S. Eleuch, T. Erseghe, M. Zorzi, Distributed learning
algorithms for optimal data routing in IoT networks. IEEE Trans. Signal Inf. Proc. Netw. 6,
175–195 (2020)
8. B. Li, S. Cen, Y. Chen, Y. Chi, Communication-efficient distributed optimization in networks
with gradient tracking and variance reduction, in Proceedings of the Twenty Third International
Conference on Artificial Intelligence and Statistics (PMLR), vol. 108 (2020), pp. 1662–1672
9. H. Li, C. Huang, Z. Wang, G. Chen, H. Umar, Computation-efficient distributed algorithm
for convex optimization over time-varying networks with limited bandwidth communication.
IEEE Trans. Signal Inf. Proc. Netw. 6, 140–151 (2020)
10. T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, K. Johansson, A
survey of distributed optimization. Annu. Rev. Control 47, 278–305 (2019)
11. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE
Trans. Autom. Control 54(1), 48–61 (2009)
12. S. Ram, A. Nedic, V. Veeravalli, Distributed stochastic subgradient projection algorithms for
convex optimization. J. Optim. Theory Appl. 147, 516–545 (2010)
References 59

13. J. Duchi, A. Agarwal, M. Wainwright, Dual averaging for distributed optimization: conver-
gence analysis and network scaling. IEEE Trans. Autom. Control 57(1), 151–164 (2012)
14. W. Shi, Q. Ling, G. Wu, W. Yin, EXTRA: an exact first-order algorithm for decentralized
consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015)
15. C. Xi, U. Khan, DEXTRA: a fast algorithm for optimization over directed graphs. IEEE Trans.
Autom. Control 62(10), 4980–4993 (2017)
16. M. Maros, J. Jalden, On the Q-linear convergence of distributed generalized ADMM under
non-strongly convex function components. IEEE Trans. Signal Inf. Proc. Netw. 5(3), 442–453
(2019)
17. C. Zhang, H. Gao, Y. Wang, Privacy-preserving decentralized optimization via decomposition
(2018). Preprint. arXiv:1808.09566
18. J. Chen, S. Liu, P. Chen, Zeroth-order diffusion adaptation over networks, in Proceedings of
the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(2018). https://doi.org/10.1109/ICASSP.2018.8461448
19. J. Xu, S. Zhu, Y.C. Soh, L. Xie, Augmented distributed gradient methods for multi-agent
optimization under uncoordinated constant stepsizes, in Proceedings of the IEEE 54th Annual
Conference on Decision and Control (2015). https://doi.org/10.1109/CDC.2015.7402509
20. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization
over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
21. R. Xin, U. Khan, Distributed heavy-ball: a generalization and acceleration of first-order
methods with gradient tracking. IEEE Trans. Autom. Control 65(6), 2627–2633 (2020)
22. V.S. Mai, E.H. Abed, Distributed optimization over directed graphs with row stochasticity and
constraint regularity. Automatica 102(102), 94–104 (2019)
23. B. Huang, Y. Zou, Z. Meng, W. Ren, Distributed time-varying convex optimization for a class
of nonlinear multiagent systems. IEEE Trans. Autom. Control 65(2), 801–808 (2020)
24. J. Zhu, C. Xu, J. Guan, D. Wu, Differentially private distributed online algorithms over time-
varying directed networks. IEEE Trans. Signal Inf. Proc. Netw. 4(1), 4–17 (2018)
25. M. Hong, D. Hajinezhad, M. Zhao, Prox-PDA: the proximal primal-dual algorithm for fast
distributed nonconvex optimization and learning over networks, in Proceedings of the 34th
International Conference on Machine Learning (ICML), vol. 70 (2017), pp. 1529–1538
26. F. Hua, R. Nassif, C. Richard, H. Wang, A.H. Sayed, Online distributed learning over graphs
with multitask graph-filter models. IEEE Trans. Signal Inf. Proc. Netw. 6, 63–77 (2020)
27. T. Yang, J. Lu, D. Wu, J. Wu, G. Shi, Z. Meng, K. Johansson, A distributed algorithm for
economic dispatch over time-varying directed networks with delays. IEEE Trans. Ind. Electron.
64(6), 5095–5106 (2017)
28. T. Yang, D. Wu, H. Fang, W. Ren, H. Wang, Y. Hong, K. Johansson, Distributed energy
resource coordination over time-varying directed communication networks. IEEE Trans.
Control Netw. Syst. 6(3), 1124–1134 (2019)
29. A.I. Chen, A. Ozdaglar, A fast distributed proximal-gradient method, in Proceedings of the
50th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
(2012), https://doi.org/10.1109/Allerton.2012.6483273
30. T.-H. Chang, M. Hong, X. Wang, Multi-agent distributed optimization via inexact consensus
ADMM. IEEE Trans. Signal Process. 63(2), 482–497 (2015)
31. W. Shi, Q. Ling, G. Wu, W. Yin, A proximal gradient algorithm for decentralized composite
optimization. IEEE Trans. Signal Process. 63(22), 6013–6023 (2015)
32. Z. Li, W. Shi, M. Yan, A decentralized proximal-gradient method with network independent
step-sizes and separated convergence rates. IEEE Trans. Signal Process. 67(17), 4494–4506
(2019)
33. S. Alghunaim, K. Yuan, A.H. Sayed, A linearly convergent proximal gradient algorithm for
decentralized optimization, in Advances in Neural Information Processing Systems (NIPS),
vol. 32 (2019), pp. 1–11
34. K. Zheng, Z. Yang, K. Zhang, P. Chatzimisios, K. Yang, W. Xiang, Big data-driven optimiza-
tion for mobile networks toward 5G. IEEE Netw. 30(1), 44–51 (2016)
60 2 Projection Algorithms for Distributed Stochastic Optimization

35. B. Swenson, R. Murray, S. Kar, H. Poor, Distributed stochastic gradient descent and conver-
gence to local minima (2020). Preprint. arXiv:2003.02818v1
36. M. Assran, N. Loizou, N. Ballas, M. Rabbat, Stochastic gradient push for distributed deep
learning, in Proceedings of the 36th International Conference on Machine Learning (ICML)
(2019). https://doi.org/10.48550/arxiv.1811.10792
37. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly
convex optimization. Automatica 90, 196–203 (2018)
38. R. Xin, A. Sahu, U. Khan, S. Kar, Distributed stochastic optimization with gradient tracking
over strongly-connected networks, in Proceedings of the 2019 IEEE 58th Conference on
Decision and Control (CDC) (2019). https://doi.org/10.1109/CDC40024.2019.9029217
39. J. Konecny, J. Liu, P. Richtarik, M. Takac, Mini-batch semi-stochastic gradient descent in the
proximal setting. IEEE J. Sel. Top. Signal Process. 10(2), 242–255 (2016)
40. M. Schmidt, N. Roux, F. Bach, Minimizing finite sums with the stochastic average gradient.
Math. Program. 162(1), 83–112 (2017)
41. A. Defazio, F. Bach, S. Lacoste-Julien, Saga: a fast incremental gradient method with support
for non-strongly convex composite objectives, in Advances in Neural Information Processing
Systems (NIPS), vol. 27 (2014), pp. 1–9
42. R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance
reduction, in Advances in Neural Information Processing Systems (NIPS), vol. 26 (2013),
pp. 1–9
43. C. Tan, S. Ma, Y. Dai, Y. Qian, Barzilai-borwein step size for stochastic average gradient, in
Advances in Neural Information Processing Systems, vol. 29 (2016), pp. 1–9
44. L. Nguyen, J. Liu, K. Scheinberg, M. Takac, SARAH: a novel method for machine learning
problems using stochastic recursive gradient, in Proceedings of the 34th International Confer-
ence on Machine Learning (ICML) (2017), pp. 2613–2621
45. A. Mokhtari, A. Ribeiro, DSA: decentralized double stochastic averaging gradient algorithm.
J. Mach. Learn. Res. 17(1), 2165–2199 (2016)
46. Y. Zhao, Q. Liu, A consensus algorithm based on collective neurodynamic system for
distributed optimization with linear and bound constraints. Math. Program. 122, 144–151
(2020)
47. Z. Shen, A. Mokhtari, T. Zhou, P. Zhao, H. Qian, Towards more efficient stochastic decen-
tralized learning: faster convergence and sparse communication, in Proceedings of the 35th
International Conference on Machine Learning (PMLR), vol. 80 (2018), pp. 4624–4633
48. K. Yuan, B. Ying, J. Liu, A. Sayed, Variance-reduced stochastic learning by networked agents
under random reshuffling. IEEE Trans. Signal Process. 67(2), 1–11 (2019)
49. H. Hendrikx, F. Bach, L. Massoulie, An accelerated decentralized stochastic proximal algo-
rithm for finite sums, in Advances in Neural Information Processing Systems, vol. 32 (2019),
pp. 4624–4633
50. Z. Wang, H. Li, Edge-based stochastic gradient algorithm for distributed optimization. IEEE
Trans. Netw. Sci. Eng. 7(3), 1421–1430 (2020)
51. R. Xin, U. Khan, S. Kar, Variance-reduced decentralized stochastic optimization with acceler-
ated convergence. IEEE Trans. Signal Process. 68, 6255–6271 (2020)
52. R. Xin, A. Sahu, S. Kar, U.A. Khan, Distributed empirical risk minimization over directed
graphs, in Proceedings of the 53rd Asilomar Conference on Signals, Systems, and Computers
(2019). https://doi.org/10.1109/IEEECONF44664.2019.9049065
53. Q. Liu, S. Yang, Y. Hong, Constrained consensus algorithms with fixed step size for distributed
convex optimization over multi-agent networks, IEEE Trans. Autom. Control 62(8), 4259–
4265 (2017)
54. M. Bazaraa, H. Sherali, C. Shetty, Nonlinear Programming: Theory and Algorithms, 3rd edn.
(John Wiley & Sons, Hoboken, 2006)
55. S. Alghunaim, E.K. Ryu, K. Yuan, A.H. Sayed, Decentralized proximal gradient algorithms
with linear convergence rates. IEEE Trans. Autom. Control 66(6), 2787–2794 (2021)
56. D. Dua, C. Graff, UCI machine learning repository, Dept. School Inf. Comput. Sci., Univ.
California, Irvine, CA, USA (2019)
Chapter 3
Proximal Algorithms for Distributed
Coupled Optimization

Abstract In this chapter, we consider a multi-node sharing problem, where each


node possesses a local smooth function that is further considered as the average
of several constituent functions, and the network aims to minimize a finite sum
of all local functions plus a coupling function (possibly non-smooth). Due to its
benefits in scalability, robustness, and flexibility, distributed optimization has been a
significant focus in engineering research to tackle this problem. To accomplish this,
an equivalent saddle-point problem of this problem that is amenable to distributed
solutions is first formulated. Then, a novel distributed stochastic algorithm called
VR-DPPD is proposed, which combines the variance-reduction technique of SAGA
with the distributed proximal primal–dual method. We present a convergence
analysis and demonstrate that if smooth local functions are strongly convex, VR-
DPPD converges linearly to the exact optimal solution in expectation. With a novel
linear convergent algorithm that achieves low computation costs, our work advances
efforts to solve a general composite optimization problem with a convex (possibly
non-smooth) coupling function. The viability and performance of VR-DPPD are
demonstrated numerically.

Keywords Distributed optimization · Non-smooth coupling function · Proximal


primal–dual method · Variance reduction · Linear convergence

3.1 Introduction

Over the last decade, distributed optimization over networks has become a hotspot
of research along with the rapid development of network technologies, where nodes
focus on minimizing the sum of local functions (owned by each node) through local
communication [1, 2]. Traditional centralized approaches to solving optimization
problems usually require an entity to obtain essential information from all nodes,
which are costly, prone to the single point of failure, and lack robustness to new
environments. Compared to centralized algorithms, distributed algorithms avoid
long-distance communication and have greater flexibility and scalability because
they have the ability to decompose large-scale problems into a series of smaller

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 61
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_3
62 3 Proximal Algorithms for Distributed Coupled Optimization

problems [3, 4]. Considering this, distributed algorithms possess better robustness,
less communication, and good privacy protection in many applications [5–14],
including but not limited to machine learning [5, 6], online optimization [7, 8],
privacy masking [9, 10], resource allocation [11, 12], and data processing [13, 14].
Recently, researchers have made significant effort to study distributed approaches
for successfully solving optimization problems [15–24]. Distributed approaches
only depended on gradient information have been the majority reported in the
literature because of their good performance and excellent scalability [1, 2]. There
are quite a few known approaches such as distributed gradient descent (DGD)
[15, 16], distributed dual averaging (DDA) [17], EXTRA [18, 19], distributed
ADMM [20], distributed adaptive diffusion [21], and distributed gradient tracking
[22–24]. Based on the known approaches [15–24], many efficiently distributed
approaches have been proposed for handling the possible factors that exist in
the problem or achieving the desired targets, such as transmission delays [25],
complex constraints [26], computation efficiency [27], communication efficiency
[28], privacy security [29], etc. Besides the aforementioned works on discrete-
time iteration, distributed continuous-time approaches have been well-investigated
[30] simultaneously, which exhibit flexible applications in continuous-time physical
systems and hardware implementations [31].
In addition to the aforementioned works for handling problems with a single
objective, composite optimization problems with smooth+non-smooth objectives
have been sparked considerable interest in the community of distributed optimiza-
tion due to its broad applications. Usually, the approaches that solve composite
optimization problem contain the (fast) distributed proximal gradient [32], the
distributed linearized ADMM (DL-ADMM) [33], PG-ADMM [34], PG-EXTRA
[35], and NIDS [36]. From the perspective of convergence rate, the aforementioned
approaches only achieve sublinear convergence rate, and there is still a clear gap
compared with their centralized counterparts. Until recently, distributed linearly
convergent approaches have been investigated to successfully fill such a gap [37].
In particular, the authors in [37] proposed a distributed proximal gradient algorithm
based on a general primal–dual algorithmic framework, which not only attained a
linear convergence rate but also unified many existing related approaches. Then,
based on the gradient tracking mechanism, the authors in [38, 39] introduced the
NEXT/SONATA algorithm with linear convergence. Subsequently, the authors in
[40] gave a unified distributed algorithmic framework to obtain similar results on the
basis of the operator splitting theory. For the case where the non-smooth function
couples all nodes, the authors in [41, 42] firstly designed a distributed proximal
primal–dual algorithm from the novel perspective of transforming the saddle-
point problem and theoretically established its linear convergence. Under a much
weaker condition than the strong convexity assumed in [37–41], the authors in [43]
developed a distributed randomized block-coordinate proximal algorithm, which
achieved asymptotic linear convergence. However, the aforementioned approaches
are largely affected by the pressure of calculation.
3.1 Introduction 63

Recently, some distributed stochastic gradient methods have emerged [44–47].


Then, many representative centralized stochastic gradient approaches, including
S2GD [48], SAG [49], SAGA [50], SVRG [51], and SARAH [52], have adopted
various variance-reduction techniques to decrease the variance of the stochastic
gradient and improve the convergence. Inspired by Konecny et al. [48], Schmidt
et al. [49], Defazio et al. [50], Johnson and Zhang [51], Nguyen et al. [52], many
distributed variance-reduced approaches [28, 53–58] have been extensively investi-
gated. In particular, the authors in [56] proposed two novel distributed algorithms,
namely GT-SAGA and GT-SVRG, to achieve optimal rates for distributed stochastic
convex problems. Then, the authors in [57, 58] proposed a combination of gradient
tracking and variance reduction (SAGA-Type and SARAH-type) to address the
stochastic and distributed nature of the non-convex problem very well. In addition,
the authors in [59] investigated a novel stochastic and distributed algorithm to solve
the nonconvex and non-smooth constrained optimization problem.
However, we notice that there is currently little linearly convergent distributed
stochastic algorithm that is dedicated to deal with the multi-node sharing problems
(composite optimization problems) where the non-smooth function couples all
nodes. Thus, the main motivation of this chapter is investigating a distributed
stochastic algorithm, which not only linearly converges to the exact optimal solution
but also reduces the computational burden.
In this chapter, we consider a general multi-node sharing problem that subsumes
several real applications [41, 42]. We do not directly solve the primal problem but
attempt to transform it into an equivalent saddle-point problem that is amenable
to distributed solutions. Based on this, we propose a novel distributed stochastic
algorithm which possesses adaptability in real-world applications. Relative to the
existing works, the novelties of the present work are concluded as follows:
(i) A novel distributed stochastic algorithm (named VR-DPPD), which combines
the variance-reduction technique of SAGA with the distributed proximal
primal–dual method, is proposed. Different from [15–24], VR-DPPD can suc-
cessfully solve a general composite optimization problem [41, 42] concerning
a coupling function (possibly non-smooth) in a distributed fashion.
(ii) Unlike the existing stochastic gradient methods [44–47] that just possess vari-
ances, VR-DPPD leverages the unbiased stochastic average gradient (SAGA)
to estimate the local gradients, which reduces the variance of the stochastic
gradient and promotes the convergence substantially. Using SAGA, VR-
DPPD outperforms some distributed approaches [32–36, 38–42] involving the
computation cost of local gradient.
(iii) The convergence and convergence rate of VR-DPPD are rigorously analyzed.
In particular, we demonstrate that VR-DPPD converges linearly to the exact
optimal solution in expectation if smooth local functions are strongly convex.
As far as we know, this is the novel (not yet been investigated in other works)
linearly convergent distributed stochastic algorithm which focuses on solving
the general multi-node sharing problems accurately.
64 3 Proximal Algorithms for Distributed Coupled Optimization

3.2 Preliminaries

3.2.1 Notation

If not particularly stated, the vectors mentioned in this chapter are column vectors.
Let the symbol  ·  denote the 2-norm. Given a vector x and a positive semi-definite
matrix W , we denote ||x||2W = x T W x. The Kronecker products are denoted as ⊗.
The vector that stacks x1 , · · · , xn on top of each other is indicated as col{xi }ni=1 . The
symbol blkdiag{Xi }ni=1 denotes the block diagonal matrix that consists of diagonal
blocks {Xi }. The proximal operator of a function f (x) : Rn → R at x is defined as
proxηf (x) = arg miny∈Rn {f (y) + 1/(2η)||x − y||2}, where η > 0 is a parameter
(step-size). We denote the conjugate function of a function f at x ∈ Rn as f † (x) =
supy∈Rn {x T y − f (y)}. The gradient of a function f at x ∈ Rn is denoted as ∇f (x).
The subdifferential ∂f (x) of a function f at x ∈ Rn is the set of all subgradients.

3.2.2 Model of Optimization Problem

This chapter is focused on cooperatively solving a general composite optimization


problem, which is defined over an undirected and connected network of n nodes, as
follows:
 n 
 n 
min fi (xi ) + g Ai xi , (3.1)
x1 ,··· ,xn
i=1 i=1

where fi : Rqi → R is a convex function that we view as the privately cost of node
i, g : Rp → R∪{+∞} is a convex possibly non-smooth global cost function known
by all nodes,1 and the matrix Ai ∈ Rp×qi (full row rank) is a linear transform only
known by node i. Furthermore, each fi is described by

1 
mi
fi (xi ) = fi,h (xi ), i = 1, · · · , n.
mi
h=1

1 Here, it is also worth noticing that the non-smooth function g may be expressed as an indicator
function for inequality constraints or equality constraints. For example, in a distributed resource
allocation problem [60], this non-smooth term may be an indicator function of the equality
constraints. In a distributed ridge regression problem [42], this non-smooth term may be an
indicator function of the inequality constraints. In addition, the non-smooth function g may
represent the regularization term, see, e.g., [3].
3.2 Preliminaries 65

Here, several quantities that will support the problem reformulation are defined
below:


n 
n
x = col{xi }ni=1 ∈ Rq , q = qi , f (x) = fi (xi ), A = [A1 , · · · , An ] ∈ Rp×q .
i=1 i=1

Then, problem (3.1) can be rewritten as2

min f (x) + g(Ax). (3.2)


x

Moreover, we make the assumptions on the constituent functions fi,h and the global
cost function g below.
Assumption 3.1 (i) Each local constituent function fi,h , i ∈ {1, . . . , n}, h ∈
{1, . . . , mi }, is β-smooth and α-strongly convex. (ii) The function g : Rp → R ∪ {+∞}
is proper lower semi-continuous and convex. (iii) There is x ∈ Rq satisfying that Ax
belongs to the relative interior domain of g.
Notice from Assumption 3.1 that the global cost function f : Rq → R is also
α -strongly convex and β -smooth, where 0 < α ≤ β , and problem (3.2) possesses a
unique optimal solution x ∗ = col{x1∗ , · · · , xn∗ } ∈ Rq , which achieves the minimum
of this problem.

3.2.3 Motivating Examples

Problem (3.1) is the sharing problem, where different individual variables possessed
by nodes are coupled through a function g . Notice that problems of form (3.1)
appear in many engineering applications [1], including smart grids, basis pursuit,
and resource allocation in wireless networks. They also appear in machine learning
applications [1], such as regression over distributed features. Here, we provide two
motivational physical applications that fit (3.1).
Example 1 A well-known example of problem (3.1) is the distributed resource
allocation problem. Inspired by Xu et al. [60], Scaman et al. [62], we set the
distributed resource allocation problem as follows:


n 
n
min C(x) = Ci (xi ), s.t. (xi − ri ) = 0, (3.3)
x1 ,··· ,xn
i=1 i=1

2 If g = 0 and A is the Laplacian matrix of the graph for instance (or a square root of the Laplacian

matrix), then problem (3.2) will become a general consensus optimization problem that can be
well solved by distributed primal–dual algorithms, such as [31, 40, 59, 61]. From this perspective,
problem (3.2) is more general.
66 3 Proximal Algorithms for Distributed Coupled Optimization

where x = [x1 , x2 , . . . , xn ]T ∈ Rn is the optimal estimator and xi ∈ R is the number


of resources allocated to node i . The function Ci : R → R is convex 
representing
the cost incurred by the resource xi . The equality constraints, ni=1 (xi − ri ) = 0,
couple the nodes’ decisions, where ri ∈ R is a local virtual demand of resource for
node i . Problem (3.3) can be transformed into problem (3.1) through defining an
indicator function g(·) : R → R such that

0, if x̌ = a,
g(x̌) =
+∞, otherwise,

which
n
is non-smooth in terms of x̌ . Here, x̌ and a represent the coupled term ni=1 xi
and i=1 ri , respectively. Based on this, problem (3.3) can be further equivalently
transformed into the following composite non-smooth problem:
 n 

n
1 
mi 
min Ci,h (xi ) + g xi ,
x1 ,··· ,xn mi
i=1 h=1 i=1

 i
which is the similar form as problem (3.1) with Ci (xi ) = (1/mi ) m
h=1 Ci,h (xi ) and
Ai = 1 for all i . Here, we note that the above transformed problem is reasonable
because in the actual resource allocation problem we want to minimize the cost of
the entire network under the premise of coordinating the optimization variables of
each node, and the cost encountered by each node is given by many different costs.
Example 2 Another notable example of problem (3.1) is the distributed logistic
regression problem, which possesses important applications in machine learning
[42, 53, 56–58] and can be described as follows:


n
1 
mi   
min ln 1 + exp −bi,h ci,h
T
xi , s.t. xi = xj ,
x1 ,··· ,xn mi
i=1 h=1

∀j ∈ Ni , where Ni is the neighbors set of i . Here, bi,h ∈ {−1, 1} and ci,h ∈ Rqi , h ∈
{1, . . . , mi }, is the local data kept by node i . Similar to Example 1, the above problem
can be transformed into problem (3.1) if we define the indicator function g(·) such
that

0, if x1 = · · · = xn ,
g(·) =
+∞, otherwise.

In this case, the matrix A = [A1 , · · · , An ] is sparse and encodes the communication
network between nodes. It is worth noticing that the matrix A in the problem (3.1) is
not necessarily sparse, and Ai is a private matrix only known by node i . Therefore,
the general distributed logistic regression problem can be transformed into a special
case of problem (3.1) if we utilize the above indicator function g .
3.2 Preliminaries 67

3.2.4 The Saddle-Point Reformulation

Notice from Assumption 3.1 that the strong duality holds Corollary 31.2.1 in [63].
Similar to the existing works [41], the saddle-point reformulation of problem (3.2)
is given by Proposition 19.18 in [64]:

min max f (x) + y T Ax − g † (y), (3.4)


x y

where y ∈ Rp is the dual variable related to the coupled term Ax . Furthermore,


(x ∗ , y ∗ ) is an optimal solution of (3.4) if, and only if, it meets the following sufficient
and necessary conditions for the optimality of problem (3.4) [62, Proposition 19.18]:

−AT y ∗ = ∇f (x ∗ ), Ax ∗ ∈ ∂g † (y ∗ ), (3.5)

where ∇f (x ∗ ) = [∇f1 (x1∗ )T , · · · , ∇fn (xn∗ )T ]T . Notice that the dual variable y in
(3.4) is multiplied by A, which couples all nodes. Thus, algorithms solving problem
(3.4) directly in a distributed fashion cannot exist because the dual update needs to
be calculated by a central coordinator. Further reformulation is required to arrive at
a distributed solution. To aim at this, we in the following reformulate problem (3.4)
into another equivalent saddle-point problem that avoids the above dilemma.
First, let yi be a local copy of y at node i , and the following quantities are needed:

1 †
n
ŷ = col{yi }ni=1 ∈ Rpn , G† = gi (yi ), Ac = blkdiag{Ai }ni=1 ∈ Rpn×q ,
n
i=1

and we further denote the symmetric matrix L ∈ Rpn×pn to satisfy that

Lŷ = 0 ⇐⇒ y1 = y2 = · · · = yn . (3.6)

Then, another saddle-point reformulation is

min max f (x) + ŷ T Ac x + ŷ T Lz − G† ŷ , (3.7)


x,z ŷ

where z = col{zi }ni=1 ∈ Rpn . Since the matrix Ac is block diagonal and the matrix
L encodes the network sparsity structure, it suffices to conclude that problem (3.7)
can be resolved in a distributed manner. Then, the optimality conditions of problem
(3.7) are given as follows [41]:

⎪ T ∗ ∗
⎨ −Ac ŷ = ∇f (x )

Lŷ = 0, (3.8)

⎩ A x ∗ + Lz∗ ∈ ∂G† (ŷ ∗ ),
c

where (x ∗ , z∗ , ŷ ∗ ) is an optimal solution of (3.7).


68 3 Proximal Algorithms for Distributed Coupled Optimization

According to (3.8), the following lemma will show that problems (3.4) and (3.7)
share the same optimal solution in terms of x . Since the saddle-point reformulation
from (3.4) to (3.7) is the same as [41], this result can be directly followed from [41]
and we just show it here for completeness.
Lemma 3.1 (Adapted from Lemma 1 in [41]) If (x ∗ , z∗ , ŷ ∗ ) fulfills the optimality
condition (3.8), then ŷ ∗ = 1n ⊗ y ∗ holds with (x ∗ , y ∗ ) such that (3.5).
Lemma 3.1 indicates that if a designed algorithm for solving problem (3.7)
can achieve the optimal solution (x ∗ , z∗ , ŷ ∗ ), then the corresponding primal-dual
pair (x ∗ , y ∗ ) is the optimal solution to problem (3.4). That is to say, the designed
algorithm can solve problem (3.4) indirectly and problems (3.1), (3.4), and (3.7)
share the same optimal solution in terms of x . However, unlike problem (3.4),
problem (3.7) can be resolved in a distributed manner because the matrices Ac and
L encode the network sparsity structure. Therefore, based on Lemma 3.1, we have
the opportunity to design distributed algorithms to finally solve the problem (3.1).

3.3 Algorithm Development

Based on the reformulation of the saddle-point problem in the previous section, we


give the construction of the algorithm in this section. First, the unbiased stochastic
average gradient (SAGA) is introduced.

3.3.1 Unbiased Stochastic Average Gradient (SAGA)

Since the local function fi (xi ) at each node i is the average of mi local constituent
functions fi,h (xi ), the implementations of most existing distributed primal–dual
algorithms [41, 43] require that each node i at time t ≥ 0 calculates the local full
gradient of fi at xit as

1 
mi
∇fi (xit ) = ∇fi,h (xit ), i = 1, · · · , n, (3.9)
mi
h=1

which may result in high computation cost when the number of constituent functions
mi is large. This issue motivates us to investigate an effective technique that
can improve computation efficiency significantly. Fortunately, unbiased stochastic
average gradient (SAGA) can be substituted for the local full gradients to resolve
this issue. The idea is to keep a gradient list of all constituent functions, where
a randomly selected element is replaced each time, and the average value of the
elements in this list is applied for gradient approximation. In specific, denote
χit ∈ {1, · · · , mi } as the function index of node i , which is uniformly and randomly
3.3 Algorithm Development 69

selected at time t . Then, let ei,h be the auxiliary variable, which was selected to
evaluate the constituent gradient of the function fi,h at the last time. Therefore, the
recursive updates of variables ei,h are
t+1
ei,h = xit , if h = χit ; ei,h
t+1
= ei,h
t
, if h = χit .

At time t , the stochastic averaging gradient at node i is represented by

  1 
mi
sit = ∇fi,χ t (xit ) − ∇fi,χ t ei,χ
t
t + ∇fi,h ei,h
t
, (3.10)
i i i mi
h=1

where the gradients ∇fi,h (ei,h


t
) need to be stored in a table data structure ensuring
the implementation of (3.10). Define F t as the history of the system up until time t .
Then, from the prior results of [28, 53], F t , one obtains that

E[sit |F t ] = ∇fi (xit ). (3.11)

In (3.10), we note that, at each time,


mi
the computation of sit , i ∈ {1, · · · , n}, is
costly because of the calculation of h=1 ∇fi,h (ei,h
t
). That is, we must face the
O(mi )-order computational cost if we naively implement the update in (3.10). In
contrast, when we implement the following recursive formulation:


mi 
mi      
∇fi,h ei,h
t
= ∇fi,h ei,h
t−1
+ ∇fi,χ t−1 xit−1 − ∇fi,χ t−1 et−1t−1 ,
i i i,χi
h=1 h=1
(3.12)
 i
the above cost can be avoided and we can calculate m h=1 ∇fi,h (ei,h ) in a computa-
t

tionally efficient way. In addition, we also point that the O(mi )-order computational
cost cannot be overcome in the existing methods [35–37, 40, 41, 43] using
deterministic gradient information.

3.3.2 Distributed Stochastic Algorithm (VR-DPPD)

Inspired by the variance-reduction technique of SAGA [55, 56, 58] and the dis-
tributed proximal primal–dual approach [41], we now propose a novel distributed
stochastic algorithm (VR-DPPD) to resolve problem (3.7), followed by its dis-
tributed implementation.
Define the auxiliary variable w = col{wi }ni=1 ∈ Rpn and the stochastic gradient
s = col{si }ni=1 ∈ Rq . Let x 0 and ŷ 0 be any values, z0 = 0pn , and s 0 = ∇f (x 0 ). Then,
70 3 Proximal Algorithms for Distributed Coupled Optimization

the general matrix form of VR-DPPD at t ≥ 0 is




⎪ x t+1 = x t − ηx s t − ηx ATc ŷ t

⎨ wt+1 = ŷ t + η A x t+1 + Lzt
y c
(3.13)

⎪ zt+1 = zt − Lwt+1

⎩ t+1
ŷ = proxηy G† (Bc wt+1 ),

where ηx and ηy are two constant step-sizes (tunable) and Bc = B ⊗ Ip with B ∈


Rn×n satisfying Assumption 3.2 described in the following. Here, we can conclude
that VR-DPPD (3.13) is a stochastic version of the algorithm (3.12) in [41] by using
the variance-reduction technique of SAGA [55, 56, 58].
Assumption 3.2 Suppose that the network is undirected and connected. Moreover, the
matrix B is symmetric, doubly stochastic, and primitive. In addition, suppose that the
matrix L satisfies condition (3.6) and

0 < Ipn − L2 , Bc2 ≤ Ipn − L2 . (3.14)

Remark 3.2 Requiring the matrix B to satisfy the condition in Assumption 3.2 is
necessary to ensure that all nodes converge to the same optimal variable, and this is not
difficult to construct over an undirected connected network, see, e.g., the lazy Metropolis
matrix designed in [23]. Moreover, the eigenvalues of Bc belong to (−1, 1]. Then, given
Bc , there exist plentiful choices for matrix L. For example, we can denote L2 = Ipn −Bc2
and verify the correctness of the condition 0 < Ipn −L2 . If it is true, then Assumption 3.2
can be satisfied. If not, we can let L2 = d(Ipn − Bc2 ) for any d ∈ (0, 1). Although there
are many choices for the matrices Bc and L, we only keep one choice for simplicity of
the following analysis and presentation.
By utilizing the design idea of the lazy Metropolis matrix (for more details,
please refer to [23]), a primitive symmetric doubly-stochastic matrix B̃ = [b̃ij ] ∈
Rn×n is first constructed, which satisfies that b̃ij > 0 if two nodes i and j are
connected through an edge over undirected connected networks, and b̃ij = 0
otherwise. Then, let B = (In + B̃)/2 = [bij ] ∈ Rn×n , which satisfies Assumption 3.2.
Based on this, we can set L2 = Ipn − Bc2 to satisfy the related conditions in
Assumption 3.2.
With the above choices of matrices in hand, we here show the distributed
implementation of (3.13). Specifically, it follows from the updates of wt and zt in
(3.13) that, for all t ≥ 1,

wt+1 =(Ipn − L2 )wt + ŷ t − ŷ t−1 + ηy Ac (x t+1 − x t ), (3.15)


3.3 Algorithm Development 71

which eliminates the auxiliary variable zt , and algorithm (3.13) can be rewritten as
follows:


⎪ x t+1 = x t − ηx s t − ηx ATc ŷ t

⎨ wt+1 = (I − L2 )wt + ŷ t − ŷ t−1 + η A (x t+1 − x t )
pn y c

⎪ ϕ t+1 = B w t+1

⎩ t+1
c
ŷ = proxηy G† (ϕ t+1 ),

where ϕ = col{ϕi }ni=1 ∈ Rpn is an auxiliary variable. Utilizing the previous


choices of matrices Bc and L, it suffices to find that the i -th block in the vectors
{x, w, ϕ, ŷ} can be updated by each node i for i = 1, · · · , n. For any i , we let
xi0 , yi0 be any values, ϕi0 = φi0 (φ = col{φi }ni=1 ∈ Rpn is an auxiliary variable),
0
ei,h = xi0 , h = 1, · · · , mi , and si0 = ∇fi (xi0 ). From Moreau’s decomposition [64]
(i.e., a = proxηg (a) + ηproxη−1 g† (η−1 a), for all a ∈ Rp and η > 0) and letting
(ηy /n) = η, the updates of VR-DPPD at each node i are given as follows:3 for all
t ≥ 1,
⎧ t+1

⎪ xi = xit − ηx sit − ηx ATi yit



⎪ φ t+1 = yit + ηy Ai xit+1

⎨ it+1
wi = ϕit + φit+1 − φit
 (3.16)

⎪ t+1
n

⎪ ϕ = bij wjt+1


i
=1

⎩ t+1
j
yi = ϕit+1 − ηproxη−1 g (η−1 ϕit+1 ).

Here, the pseudo-code of the VR-DPPD algorithm is outlined in Algorithm 2.


To locally implement Algorithm 2, suppose that each node i ∈ V has a gradient
table containing all gradients ∇fi,h , ∀h ∈ {1, . . . , mi }, in relation to the local primal
variable xi or the local auxiliary variables ei,h . At each iteration t + 1, first each
node i chooses one label which indexed by χit ∈ {1, . . . , mi } from its own data batch
uniformly and randomly and then updates the local stochastic gradient sit via (3.10).
t+1
After updating sit , the local auxiliary variable ei,χ t will be assigned by the local
i
variable xit at the label χit , and the entry ∇fi,χit (ei,χ
t+1
t ) is substituted for the newly
i
constituent gradient ∇fi,χit (xit ) in the χit gradient table position, while the other
entries keep unchanged. Then, each node i sequentially updates the local variables
xit+1 , φit+1 , and ϕit+1 . Subsequently, each node i transmits bj i wit+1 to its neighbors
j ∈ Ni , where Ni is the neighbors set of i , receives bij wjt+1 from its neighbors
j , and updates ϕit+1 . Finally, each node i updates yit+1 . Since g is a global cost
function known by all nodes, we can deduce that Algorithm 1 can be implemented
in a distributed manner based on the above implementation process.

3 By using Moreau’s decomposition [64], the update of yit+1 can be executed in a convenient
fashion that avoids computing the proximal mapping of the conjugate function g † at each time.
72 3 Proximal Algorithms for Distributed Coupled Optimization

Algorithm 2 Distributed stochastic algorithm (VR-DPPD)-from the view of node i


1: Initialization: Each node i initializes xi0 ∈ Rqi , yi0 ∈ Rp , ϕi0 = φi0 ∈ Rp , ei,h 0
= xi0 ,
h = 1, · · · , mi , and si = ∇fi (xi ). Each node i knows the step-sizes ηx and ηy .
0 0

2: for t = 0, 1, 2, . . . do
3: if t = 0 then  i
4: Compute and store the sum of the local gradients m h=1 ∇fi,h (ei,h ), which is ∇fi (xi ) actually,
0 0

and let si = ∇fi (xi ).


0 0

5: else
6: Choose χit uniformly and randomly from {1, . . . , mi }.
 i
7: Compute and store the summation term m h=1 ∇fi,h (ei,h ) according to (3.12).
t

8: Compute stochastic averaging gradient sit via (3.10).


t+1 t+1
t = xi and replace ∇fi,χ t (ei,χ t ) by ∇fi,χ t (xi ) in the corresponding χi gradient table
t t t
9: Take ei,χ
i i i i
t+1
position. All other ei,h , ∀h = χit , and gradient elements in the table keep unchanged, i.e.,
ei,h = ei,h and ∇fi,h (ei,h ) = ∇fi,h (ei,h
t+1 t t+1 t
) for all h = χit .
10: end if
11: Update xit+1 , φit+1 and wit+1 sequentially via (3.16).
12: Transmit bj i wit+1 to its neighbors j ∈ Ni .
13: Receive bij wjt+1 from its neighbors j ∈ Ni , and update ϕit+1 via (3.16).
14: Update yit+1 via (3.16).
15: end for

Remark 3.3 At present, the classical proximal primal–dual fixed-point method (PDFP)
[65] and the later known distributed methods (including the proximal exact dual
diffusion method (PED2 ) [41], the dual consensus proximal algorithm (DCPA) [42],
and the primal–dual hybrid gradient method (PDHG) [66]) are proposed to solve the
minimization problem from the novel perspective of transforming (3.1) into a saddle-
point formulation. When dealing with large-scale tasks, the above methods [41, 66, 67]
may suffer from high computation costs. Compared with [41, 42, 66, 67], Algorithm 2
leverages the variance-reduction technique of SAGA [55, 58] for the purpose to
evaluate the locally full gradient in a more cost-efficient way. In addition, via Moreau’s
decomposition [64], the update of yit+1 in Algorithm 2 does not need to calculate the
conjugate function g † at each time, which can be implemented in a convenient manner.

3.4 Convergence Analysis

In this section, we show the convergence behavior of VR-DPPD (3.13). First, some
auxiliary results related to the stochastic gradient and the fixed-point of (3.13) are
provided for further convergence analysis.
3.4 Convergence Analysis 73

3.4.1 Auxiliary Results


n
Before introducing the results, we define an auxiliary sequence v t = t
i=1 vi ∈
R, ∀t ≥ 0, where

1 
mi
vit = t
(fi,h (ei,h ) − fi,h (xi∗ ) − ∇fi,h (xi∗ )T (ei,h
t
− xi∗ )).
mi
h=1

t
Here fi,h (ei,h ) − fi,h (xi∗ ) − ∇fi,h (xi∗ )T (ei,h
t
− xi∗ ), ∀t ≥ 0, is non-negative by the
strong convexity of local constituent function fi,h , and thus v t , ∀t ≥ 0, is also non-
negative. To simplify notation, we denote E ·|F t = Et , ∀t ≥ 1, in the following
analysis. Moreover, we define ∇f (x t ) = [∇f1 (x1t )T , · · · , ∇fn (xnt )T ]T .
Lemma 3.4 (Adapted from Lemma 6 in [53]) Consider the definition of v t . Under
Assumption 3.1, the following recursive relation holds: ∀t ≥ 0,
 
1 ∗ ∗ T t ∗ 1
E [v
t t+1
] ≤ (f (x ) − f (x ) − ∇f (x ) (x − x )) + 1 −
t
vt , (3.17)
m̌ m̂

where m̌ and m̂ are the smallest and largest amounts of the local constituent func-
tions over the whole network, respectively, i.e., m̌ = mini∈{1,··· ,n} {mi } and m̂ =
maxi∈{1,··· ,n} {mi }.
In addition, an upper bound for the mean-squared stochastic gradient  variance
between the stochastic gradient s t and the gradient ∇f (x ∗ ), i.e., Et [s t − ∇f (x ∗ )],
is given, whose proof can be referred to [53].
Lemma 3.5 (Adapted from Lemma 4 in [53]) Under Assumption 3.1, the following
recursive relation holds: ∀t ≥ 0,

Et [||s t − ∇f (x ∗ )||2 ] (3.18)


∗ ∗ T ∗
≤ 4βv + 2(2β − α)(f (x ) − f (x ) − ∇f (x ) (x − x )).
t t t

From Lemma 3.5, we can deduce that for each node i = 1, · · · , n, when xit
approaches to xi∗ , then ei,h
t
, h ∈ {1, · · · , mi }, tend to xi∗ , which indicates that the
mean-squared stochastic gradient variance between the stochastic gradient s t and
the gradient ∇f (x ∗ ) vanishes. Subsequently, we continue to show the existence and
optimality of the fixed-points of (3.13), which is taken from [41].
74 3 Proximal Algorithms for Distributed Coupled Optimization

Lemma 3.6 (Adapted from Lemma 2 in [41]) A point (x ∗ , z∗ , ŷ ∗ , w∗ ) is concerned


as the fixed-point of (3.13) if and only if the following conditions hold:


⎪ 0 = ∇f (x ∗ ) + ATc ŷ ∗

⎨ w∗ = ŷ ∗ + η A x ∗ + Lz∗
y c
(3.19)

⎪ 0 = Lw∗

⎩ ∗
ŷ = proxηy G† (Bc w∗ ).

In addition, if a fixed-point (x ∗ , z∗ , ŷ ∗ , w∗ ) satisfies (3.19), then ŷ ∗ = 1n ⊗ y ∗ holds


with (x ∗ , y ∗ ) such that (3.5).
Lemma 3.6 shows the result that the fixed-point of (3.13) is optimal. According
to Lemma 3.6, we know that the stochastic gradient s t will be fixed to f (x ∗ ) if
(x ∗ , z∗ , ŷ ∗ , w∗ ) is the fixed-point of (3.13). This fact makes us easy to prove the
sufficient and necessary conditions for the fixed-point optimality by considering that
VR-DPPD (3.13) is a stochastic version of the algorithm (3.12) in [41]. In addition,
it concludes from the update of zt in (3.13) that if z0 = 0pn , then z1 = Lw1 , which
remains with the range space of L. Accordingly, {zt }t≥0 will always belong to the
range space of L. Utilizing similar arguments in [41], one can constantly suppose
that z∗ remains in the range space of L and (x ∗ , z∗ , ŷ ∗ , w∗ ) is a fixed-point since
putting a vector in the null space of L to z∗ does not influence on the optimality
condition.

3.4.2 Main Results

First, we give a crucial lemma that plays an important role in supporting the
convergence results. Define the following error terms:

x̃ t = x t − x ∗ , ỹ t = ŷ t − ŷ ∗ , w̃t = wt − w∗ , z̃t = zt − z∗ .

Then, it obtains from (3.13) and (3.19) that




⎪ x̃ t+1 = x̃ t − ηx (s t − ∇f (x ∗ )) − ηx ATc ỹ t

⎨ w̃t+1 = ỹ t + η A x̃ t+1 + Lz̃t
y c
(3.20)
⎪ z̃t+1 = z̃t − Lw̃t+1


⎩ t+1
ỹ = proxηy G† (Bc wt+1 ) − proxηy G† (Bc w∗ ).

Based on (3.20), we next establish a critical equality to support the main results.
3.4 Convergence Analysis 75

Lemma 3.7 Suppose that ηx and ηy are strictly positive. Under Assumption 3.1, the
following recursive relation holds: ∀t ≥ 0,

||x̃ t+1 ||2I + κ||w̃t+1 ||2I + κ||z̃t+1 ||2


q −ηx ηy Ac Ac pn −L
T 2

= ||x̃ t − ηx (s t − ∇f (x ∗ ))||2 + κ||ỹ t ||2I + κ||z̃t ||2I , (3.21)


pn −ηx ηy Ac Ac pn −L
T 2

where κ = ηx /ηy .
Proof First, it follows from the update of the error term x̃ t in (3.20) that

||x̃ t+1 ||2 = ||x̃ t − ηx (s t − ∇f (x ∗ )) − ηx ATc ỹ t ||2

= ||x̃ t − ηx (s t − ∇f (x ∗ ))||2 + ηx2 ||ỹ t ||2A T


c Ac

− 2ηx (ỹ t )T (x̃ t − ηx (s t − ∇f (x ∗ ))). (3.22)

Then, it derives from the update of the error term w̃t in (3.20) that

κ||w̃t+1 ||2 = κ||ỹ t + ηy Ac x̃ t+1 + Lz̃t ||2

= κ||ỹ t ||2 + κ||ηy Ac x̃ t+1 + Lz̃t ||2 + 2κηy (ỹ t )T Ac x̃ t+1 + 2κ(ỹ t )T Lz̃t

= κ||ỹ t ||2 + κ||ηy Ac x̃ t+1 + Lz̃t ||2 + 2ηx (ỹ t )T Ac x̃ t+1 + 2κ(ỹ t )T Lz̃t

= κ||ỹ t ||2 + ηx ηy ||x̃ t+1 ||2AT A + κ||z̃t ||2L2 + 2ηx z̃t LT Ac x̃ t+1
c c

+ 2ηx (ỹ ) Ac x̃
t T t+1
+ 2κ(ỹ ) Lz̃t .
t T
(3.23)

Similarly, it obtains from the update of the error term z̃t in (3.20) that

κ||z̃t+1 ||2 = ||z̃t − Lw̃t+1 ||2

= κ||z̃t ||2 + κ||w̃t+1 ||2L2 − 2κ(z̃t )T L(ỹ t + ηy Ac x̃ t+1 + Lz̃t )

= κ||z̃t ||2 + κ||w̃t+1 ||2L2 − 2κ(z̃t )T Lỹ t

− 2ηx (z̃t )T LAc x̃ t+1 − 2κ||z̃t ||2L2 . (3.24)

Combining the three results (3.22)–(3.24), one can deduce that

||x̃ t+1 ||2 + κ||w̃t+1 ||2 + κ||z̃t+1 ||2

= 2ηx (ỹ t )T Ac x̃ t+1 − 2ηx (ỹ t )T Ac (x̃ t − ηx (s t − ∇f (x ∗ )))

+ ||x̃ t − ηx (s t − ∇f (x ∗ ))||2 + κ||z̃t ||2I + κ||w̃t+1 ||2L2


pn −L
2

+ κ||ỹ t ||2I + ηx ηy ||x̃ t+1 ||2AT A . (3.25)


pn +ηx ηy Ac Ac
T
c c
76 3 Proximal Algorithms for Distributed Coupled Optimization

Rearranging equation (3.25), we have that

||x̃ t+1 ||2I + κ||w̃t+1 ||2I + κ||z̃t+1 ||2


q −ηx ηy Ac Ac pn −L
T 2

= ||x̃ t − ηx (s t − ∇f (x ∗ ))||2 + κ||ỹ t ||2I


pn +ηx ηy Ac Ac
T

− 2ηx (ỹ t )T Ac (x̃ t − ηx (s t − ∇f (x ∗ )))

+ 2ηx (ỹ t )T Ac x̃ t+1 + κ||z̃t ||2I . (3.26)


pn −L
2

Substituting the update of the error term x̃ t in (3.20) into (3.26) yields the result of
Lemma 3.7. The proof is completed. 
Let ρmin (·), ρmax (·), and λmin (·) be the smallest non-zero singular value, the
largest singular value, and the smallest eigenvalue of its argument, respectively.
Notice from the condition (3.14) that 0 ≤ L2 < Ipn , which further implies that
0 < ρmin (L) < 1. Denote a tunable parameter τ = ηx τ1 , where τ1 < 4α m̌/(α + β)
is a constant. Then, we will deduce the convergence results of VR-DPPD (3.13) by
using the result in Lemma 3.7.
Theorem 3.8 Consider VR-DPPD (3.13) and let Assumption 3.1 hold. If the step-sizes
ηx and ηy satisfy
  
2 τ1 1 4α τ1
0 < ηx < min , , − ,
α + β 4β m̂ 2(2β − α) α+β m̌

2αβ
0 < ηy < 2 (A )(α + β)
,
ρmax c

then the sequence {x t } updated by VR-DPPD (3.13) linearly converges in expectation


to the optimal solution x ∗ to problem (3.1), i.e., Et−1 [||x̃ t ||] ≤ C t for all t ≥ 1 and
certain C ≥ 0, where
⎧ αβ

⎨ 1−2ηx α+β
, 1 − η η λ (A A T ), ⎬
x y min c c
= max 1−ηx ηy ρmax
2 (A )
c ,
⎩ 4βηx2 ⎭
1 − ρmin (L), 1 − m̂ + τ ,
2 1

and

||x̃ 0 ||2I + κ||ỹ 0 ||2 + κ||z̃0 ||2 + τ v 0


q −ηx ηy Ac Ac
T
C= .
1 − ηx ηy ρmax
2 (A )
c

Proof It holds that

||x̃ t − ηx (s t − ∇f (x ∗ ))||2

= ||x̃ t ||2 + ηx2 ||s t − ∇f (x ∗ )||2 − 2ηx (x̃ t )T (s t − ∇f (x ∗ ))


3.4 Convergence Analysis 77

= ||x̃ t ||2 + ηx2 ||s t − ∇f (x ∗ )||2 − 2ηx (x̃ t )T (s t − ∇f (x t ))

− 2ηx (x̃ t )T (∇f (x t ) − ∇f (x ∗ )). (3.27)

By Assumption 3.1, we have that (x̃ t )T (∇f (x t ) − ∇f (x ∗ )) ≥ αβ


α+β ||x̃ ||
t 2 +
α+β ||∇f (x ) −
1 t ∇f (x ∗ )||2 . Then, for ηx ≤ 2/(α + β), we have

||x̃ t − ηx (s t − ∇f (x ∗ ))||2
 
αβ 2ηx
≤ 1 − 2ηx ||x̃ t ||2 − ||∇f (x t ) − ∇f (x ∗ )||2 + ηx2 ||s t − ∇f (x ∗ )||2
α+β α+β
− 2ηx (x̃ t )T (s t − ∇f (x t )), (3.28)

which combined with the result ||∇f (x t ) − ∇f (x ∗ )||2 ≥ 2α(f (x t ) − f (x ∗ ) −


∇f (x ∗ )T (x t − x ∗ )) (the strong convexity in Assumption 3.1) yields that

||x̃ t − ηx (s t − ∇f (x ∗ ))||2
 
αβ
≤ 1 − 2ηx ||x̃ t ||2 + ηx2 ||s t − ∇f (x ∗ )||2 − 2ηx (x̃ t )T (s t − ∇f (x t ))
α+β
4αηx
− (f (x t ) − f (x ∗ ) − ∇f (x ∗ )T (x t − x ∗ )). (3.29)
α+β

Substituting the bound (3.29) into (3.21) gives

||x̃ t+1 ||2I −η η AT A + κ||w̃t+1 ||2I −L2 + κ||z̃t+1 ||2


q x y c c pn
 
αβ 4αηx t
≤ 1 − 2ηx ||x̃ t ||2 −  + ηx2 ||s t − ∇f (x ∗ )||2
α+β α+β
− 2ηx (x̃ t )T (s t − ∇f (x t )) + κ||z̃t ||2I + κ||ỹ t ||2I , (3.30)
pn −L pn −ηx ηy Ac Ac
2 T

where  t = f (x t ) − f (x ∗ ) − ∇f (x ∗ )T (x t − x ∗ ) ≥ 0, ∀t ≥ 0. Notice that if the matrix


Iq − ηx ηy ATc Ac is positive definite, one obtains that

   αβ 
αβ 1 − 2ηx α+β
1 − 2ηx ||x̃ t ||2 ≤ ||x̃ t ||2I −η η AT A . (3.31)
α+β 1 − ηx ηy ρmax
2 (A )
c q x y c c

In addition, since the proximal mapping is non-expansive, we have from the update of
the error term ỹ t in (3.20) that

||ỹ t+1 ||2 = ||proxηy G† (Bc wt+1 ) − proxηy G† (Bc w∗ )||2

≤ ||Bc w̃t+1 ||2 = ||w̃t+1 ||2B 2 ≤ ||w̃t+1 ||2I , (3.32)


pn −L
2
c
78 3 Proximal Algorithms for Distributed Coupled Optimization

which followed from Assumption 3.2. Since each Ai has full row rank, we have 0 <
λmin (Ac ATc )Ipn ≤ Ac ATc . Thus, it holds that

||ỹ t ||2I ≤ (1 − ηx ηy λmin (Ac ATc ))||ỹ t ||2 . (3.33)


pn −ηx ηy Ac Ac
T

Moreover, since z0 = 0 and z∗ are in the range space of L, the error quantity ẑt always
belongs to the range space of L, which indicates that ||z̃t ||2L2 ≥ ρmin
2 (L)||z̃t ||2 —see

Lemma 1 in [41]. Therefore,

||z̃t ||2I ≤ (1 − ρmin


2
(L))||z̃t ||2 . (3.34)
pn −L
2

Substituting the bounds (3.31)–(3.34) into (3.30) derives

||x̃ t+1 ||2I + κ||ỹ t+1 ||2 + κ||z̃t+1 ||2


q −ηx ηy Ac Ac
T

αβ
1 − 2ηx α+β 4αηx t
≤ ||x̃ t ||2I − 
q −ηx ηy Ac Ac
T
1 − ηx ηy ρmax
2 (A )
c α+β
+ (1 − ηx ηy λmin (Ac ATc ))κ||ỹ t ||2 + (1 − ρmin2 (L))κ||z̃t ||2

− 2ηx (x̃ t )T (s t − ∇f (x t )) + ηx2 ||s t − ∇f (x ∗ )||2 . (3.35)

From (3.11), it holds that Et [s t − ∇f (x t )] = 0. By taking conditional expectation on


F t , we deduce from (3.35) that
 
Et ||x̃ t+1 ||2I + κ||ỹ t+1 ||2 + κ||z̃t+1 ||2
q −ηx ηy Ac Ac
T

αβ
1 − 2ηx α+β 4αηx t  
≤ ||x̃ t ||2I −  + ηx2 Et ||s t − ∇f (x ∗ )||2
q −ηx ηy Ac Ac
T
1 − ηx ηy ρmax
2 (A )
c α+β
+ (1 − ηx ηy λmin (Ac ATc ))κ||ỹ t ||2 + (1 − ρmin
2
(L))κ||z̃t ||2 . (3.36)

Combining (3.17) and (3.18) with (3.36), we further have

Et [ϑ t+1 ] ≤ 1 ||x̃ ||I −η η AT A


t 2
+ 2 κ||ỹ || +
t 2
3 κ||z̃ || +
t 2
4τ v
t
− 5
t
, (3.37)
q x y c c

where ϑ t+1 = ||x̃ t+1 ||2I −η η AT A + κ||ỹ t+1 ||2 + κ||z̃t+1 ||2 + τ ||v t+1 ||2 is defined to
q x y c c
simplify the symbolic expression, τ is a tunable parameter that is specified below, and

αβ
1 − 2ηx α+β
1 = , 2 = 1 − ηx ηy λmin (Ac ATc ), 3 = 1 − ρmin
2
(L),
1 − ηx ηy ρmax
2 (A )
c

1 4βηx2 4αηx τ
4 =1− + , 5 = − 2(2β − α)ηx2 − .
m̂ τ α+β m̌
3.4 Convergence Analysis 79

Denote τ = ηx τ1 , where τ1 is a tunable parameter. If


 
1 4α τ1 4α m̌
0 < ηx ≤ − , τ1 < , (3.38)
2(2β − α) α+β m̌ α+β

then we have 5 ≥ 0, which combined with (3.37) implies that

Et [ϑ t+1 ] ≤ 1 ||x̃ ||I −η η AT A


t 2
+ 2 κ||ỹ || +
t 2
3 κ||z̃ || +
t 2
4τ v
t
. (3.39)
q x y c c

Moreover, if

2αβ τ1
0 < ηy < 2 (A )(α + β)
, 0< ηx < , (3.40)
ρmax c 4β m̂

we obtain that 1 < 1 and 4 < 1. Under (3.28) and (3.40), it can be verified that
Iq − ηx ηy ATc Ac > 0 and = max{ 1 , 2 , 3 , 4 } < 1. In addition, it can also be verified
that

(1 − ηx ηy ρmax
2
(Ac ))||x̃ t+1 ||2 ≤ ||x̃ t+1 ||2I . (3.41)
q −ηx ηy Ac Ac
T

Iterating the inequality Et [ϑ t+1 ] ≤ ˆϑ t that is deduced from (3.39), we get that

Et [ϑ t+1 ] ≤ ϑ 0 t+1
, (3.42)

which by means of (3.41) achieves the expected results. The proof is completed. 
Remark 3.9 Theorem 3.8 is proved according to the convergence analysis method of
[41], which implies that VR-DPPD can ensure a linear convergence rate in solving
the problem with the coupled non-smooth function g under some conditions (such as
ηx , ηy , and τ1 ) and the assumptions on the objective functions. The main difference
in the convergence analysis compared to [41] is that we need to leverage the results
(Lemma 3.4 and Lemma 3.5) associated with the stochastic gradient to establish the main
recurrence relation (3.37). Furthermore, compared with [41], VR-DPPD further enjoys
the appealing feature of computation efficiency with variance-reduction technique.
From Theorem 3.8, it is known that the constant that controls the convergence
rate can be simplified by selecting specific values for τ1 , ηx , and ηy . This uncovers
connections to the properties of the local objective functions and the network
topology. To make this clearer, we define the condition number of the local
constituent function as θf = β/α . Then, the following corollary illustrates that
the parameters related to the local objective functions and the network topology
determine the convergence rate of VR-DPPD.
Corollary 3.10 Consider the VR-DPPD as given in Algorithm 2 and suppose the
conditions of Theorem 3.8 hold. Assume that the number of the local constituent function
80 3 Proximal Algorithms for Distributed Coupled Optimization

fi,h to each node is the same, i.e., m̂ = m̌ = m. Setting the constants τ1 , ηx , and ηy as

2m̌ 1 β
τ1 = , ηx = , ηy = 2 ,
1 + θf 4β(1 + θf ) ρmax (Ac )(1 + θf )

then, the linear convergence constant 0 < < 1 in Theorem 3.8 reduces to
⎧ ⎫
⎨1 − 1
, 1 − ρmin2 (L), ⎬
4(1+θf )2 −1
= max T .
⎩ λ (Ac Ac )
1 − 4ρ 2 min 1 ⎭
2 , 1 − 2m
max (A c )(1+θ f )

VR-DPPD thus achieves an -optimal solution of x ∗ in


   
2 2 1
O max θf , ρmin (L), m log ,


constituent gradient computations at each node.


Proof The given values for τ1 , ηx , and ηy satisfy the conditions in Theorem 3.8.
Substituting then these values into the expression for in Theorem 3.8, we can easily
get the simplified form of in Corollary 3.10. 
Remark 3.11 It can be concluded from Corollary 3.10 that the convergence constant
depends on three elements including the condition number of the local constituent
functions, the structure of the network and the number of the local constituent functions.
For instance, the linear convergence constant may be influenced by the sparsity of the
network (see Remark 3.2). In addition, when the condition number θf increases, the
linear convergence constant may increase as well.
Remark 3.12 If each node is assigned a large dataset, i.e., m  1, then VR-
DPPD achieves an -optimal solution with a network-independent constituent gradi-
ent computation complexity of O(m log(1/)) at each node, which performs better
than the centralized SAGA [50] that processing all data on a single node requires
O((nm + θf ) log(1/)) ≈ O(nm log(1/)) constituent gradient computations[55, 56].
That is, the number of constituent gradient calculations of VR-DPPD required per
node is reduced by 1/n in comparison with its centralized counterpart. In addition,
since VR-DPPD incurs O(3.1) communication round per node at each iteration,
its total communication complexity is the same as its computation complexity, i.e.,
2 (L), m} log 1 ).
O(max{θf2 , ρmin 

Remark 3.13 The theoretical analysis in this chapter may not be applicable to the case
where each local constituent function fi,h is generally convex. The main reasons for this
include two aspects. On the one hand, when the cost function
 is generally
 convex, it is
difficult for us to obtain a desired upper bound for Et [s t − ∇f (x ∗ )] in Lemma 3.5,
which is the core results in supporting the convergence analysis. On the other hand, we
cannot obtain the asymptotic convergence of Et−1 [||x̃ t ||] but may achieve the asymptotic
convergence of Et−1 [||f (x t ) − f (x ∗ )||], which may make our convergence analysis
3.5 Numerical Examples 81

method inapplicable. Finally, we note that a rigorous theoretical analysis of VR-DPPD


for the general convex cost function is beyond the scope of this chapter and we leave it
to the future work.

3.5 Numerical Examples

In this section, numerical examples are provided to validate VR-DPPD. Notice


that all the simulations are carried out in Matlab R2016b running on a laptop
with Intel(R) Core(TM) i5-9500 CPU, 3.0 GHz, 8 GB of RAM, and Windows 10
operating system. To facilitate verification and comparison, we adopt the centralized
approach to obtain the optimal solutions of the following examples in a long enough
time and with appropriate step-sizes. Motivated by Alghunaim et al. [3, 41, 42], we
consider the problem as follows:


n
1 
mi   
min ln 1 + exp −bi,h ci,h
T
xi , (3.43)
x1 ,··· ,xn mi
i=1 h=1


subject to the global constraint ni=1 xi ≤ a (here, we assume that all variables
xi have the same dimension, i.e., qi is a constant for all i ), where the local
objective function fi (x) is the average of mi constituent functions fi,h , i.e., fi (xi ) =
 mi
(1/mi ) h=1 fi,h (xi ) for all i , where
  
fi,h (xi ) = ln 1 + exp −bi,h ci,h
T
xi ,

with bi,h ∈ {−1, 1} and ci,h ∈ Rqi , ∀h ∈ {1, . . . , mi }; a ∈ Rqi is a vector with constant
entries. Define ni=1 mi = m.
Problem (3.43) is a distributed optimization problem with coupled inequality
constraints, which we cannot solve intuitively with VR-DPPD proposed in this
chapter. To successfully solve this problem, an effective method is to equivalently
transform the problem (3.43) into the problem (3.1). To this aim, we resort to a
transformation with the help of an indicator function, which is well applied in many
works, such as [30, 42, 43]. In particular, to apply VR-DPPD to solve (3.43), we
first define an indicator function g(·) : Rqi → R such that

0, if x̌ ≤ a,
g(x̌) =
+∞, otherwise,

which
n
is non-smooth in terms of x̌ . Notice that x̌ represents the coupled term
i=1 xi here. Based on this, the above problem (constrained) can be further
82 3 Proximal Algorithms for Distributed Coupled Optimization

equivalently transformed into the following composite non-smooth problem:


 n 

n
1 
mi    
min ln 1 + exp −bi,h ci,h xi + g
T
xi , (3.44)
x1 ,··· ,xn mi
i=1 h=1 i=1

which possesses the similar form as problem (3.1) with Ai = Iqi . Then, we can
utilize VR-DPPD to solve problem (3.44). Here, we assume that g is a convex
possibly non-smooth global cost function known by all nodes. Similar to [41], the
following centralized linearized prox-ascent algorithm can be employed to deduce
the optimal solution, i.e.,

x k+1 = x k − ηx ∇f (x k ) − ηx AT y k
(3.45)
y k+1 = proxηy g (y k + ηy Ax k+1 ).

In the following two examples, we leverage two real datasets, i.e., breast cancer
dataset (dataset 1) and mushroom dataset (dataset 2), from the UCI Machine
Learning Repository [68] to support the simulation experiments. Notice that the
optimization problem (3.43) is not the same as the traditional logistic regression
binary classification problem. Similar to [3, 41, 42], we only apply these two real
data to verify the effectiveness of VR-DPPD without performing related model
training and testing. In addition, since the scales of the two real datasets are different,
the motivation for adopting them is very clear. Dataset 1 can be leveraged to test the
performance of VR-DPPD, while dataset 2 can be utilized to validate the advantages
of VR-DPPD over other related methods in handling large-scale data.

3.5.1 Example 1: Simulation on General Real Data

In this example, we use dataset 1 to examine the performance of VR-DPPD, where


m = 200 and qi = 9, ∀i . The entries of vector a are uniformly selected from (0, 1).
For the networks, a randomly connected network with n = 10 nodes employing
the Erdos–Renyi model [27] with a connection probability p = 0.8 is plotted
in Fig. 3.1a, and three categories of the network (for comparison) are plotted in
Fig. 3.1b, d, respectively. The combination matrix B is generated using the lazy
Metropolis rule [23] under the networks in Fig. 3.1. The target now is utilizing VR-
DPPD to seek the optimal solution to problem (3.43), and the related simulation
results are described in the following three

parts, i.e., (i)–(iii), based on the general
real data. Define the residual (1/n)log10 ( ni=1 ||xit − xi∗ ||/||xi0 − xi∗ ||).
(i) Convergence: in this simulation, we test the convergence of VR-DPPD with
ηx = 0.05 and ηy = 2. Then, the transient behaviors of the second dimension
of each primal variable xi are shown in Fig. 3.2. Figure 3.2 indicates that each
3.5 Numerical Examples 83

Fig. 3.1 (a) Random network with a connection probability p = 0.8. (b) Complete network. (c)
Cycle network. (d) Star network

0.5
Convergence of one dimension of xi

0.4

0.3

0.2

0.1

-0.1

-0.2

-0.3
0 200 400 600 800 1000
Iterations

Fig. 3.2 The transient behaviors of the second dimension of each primal variable xi
84 3 Proximal Algorithms for Distributed Coupled Optimization

primal variable xi can achieve in the mean at the optimal solution, which
together satisfies the global inequality constraint in (3.43) by computation.
(ii) Comparison: in this simulation (ηx = 0.05, ηy = 2), we compare VR-
DPPD with algorithm (3.45) and the related algorithm, PED2 , proposed in
[41] to show the appealing features of VR-DPPD. The simulation results
are shown in Fig. 3.3, where the x -axis is the iterations and the number of
gradient evaluations, respectively. Figure 3.3a clearly shows that VR-DPPD
converges linearly in this setup. In addition, from Fig. 3.3a, we can also find

0
VR-DPPD

-2 PED2 [41]
Centralized algorithm (40)

-4
Residual

-6

-8

-10
0 500 1000 1500 2000
Iterations

0
VR-DPPD

-2 PED2 [41]
Centralized algorithm (40)

-4
Residual

-6

-8

-10
0 0.5 1 1.5 2 2.5 3 3.5 4
Number of gradient evaluations 105

Fig. 3.3 Comparisons between VR-DPPD and other algorithms. (a) The x-axis is the iterations.
(b) The x-axis is the number of gradient evaluations
3.5 Numerical Examples 85

0
Cycle network
Random network: p=0.7
-2 Random network: p=0.5
Complete network
Star network
-4
Residual

-6

-8

-10
0 500 1000 1500 2000 2500 3000 3500 4000
Iterations

Fig. 3.4 Evolution of residuals under different networks

that the performance of VR-DPPD does not decrease largely even utilizing
the stochastic gradients with variance-reduction technique, that is, it is slightly
slower than PED2 [41]. Figure 3.3b tells us that compared with the centralized
algorithm (3.40) and PED2 [41], VR-DPPD demands a smaller number of local
gradient evaluations, which can reduce the computation cost to a certain extent.
(iii) Impacts of network sparsity: in this simulation (ηx = 0.05, ηy = 2), we discuss
the impacts of the network sparsity on the convergence results of VR-DPPD.
The simulation results are depicted in Fig. 3.4, which shows that the sparsity of
the network has a certain degree of influence on the convergence rate of VR-
DPPD (other parameters are fixed), that is, as the network becomes dense, the
convergence rate of VR-DPPD is faster.

3.5.2 Example 2: Simulation on Large-Scale Real Data

In this example, we employ dataset 2 to verify the advantages of VR-DPPD in


processing large-scale real data, where m = 6000 and qi = 112, ∀i . The entries
of vector a are uniformly selected from (0, 2). Let ηx = 0.015 and ηy = 2.
The simulation results are then shown in Fig. 3.5, which implies that VR-DPPD
requires fewer local gradient evaluations to quickly achieve the goal large-scale real
data. Moreover, Figs. 3.3b and 3.5b together indicate that when the number m of
datasets is large, the number of local gradient evaluations of VR-DPPD is far less
than that of the centralized algorithm (3.45) and PED2 [41]. Hence, by employing
the unbiased stochastic average gradients (SAGA), VR-DPPD can be well adapted
to large-scale (data) tasks, which may be applicable to solve the coupling problem.
86 3 Proximal Algorithms for Distributed Coupled Optimization

0
VR-DPPD
PED2 [41]
-2 Centralized algorithm (40)

-4
Residual

-6

-8

-10
0 0.5 1 1.5 2
Iterations 4
10

0
VR-DPPD
PED2 [41]
-2 Centralized algorithm (40)
Residual

-4

-6 0

-5

-8
-10
0 1 2 3 4
105
-10
0 1 2 3 4 5 6 7 8
Iterations 7
10

Fig. 3.5 Comparisons between VR-DPPD and other algorithms. (a) The x-axis is the iterations.
(b) The x-axis is the number of gradient evaluations

Remark 3.14 It is worth emphasizing that we can further verify the performance of
VR-DPPD when the non-smooth coupling function g is not an indicator function, e.g.,

g = || ni=1 Ai xi ||1 in the problem (3.44). Here, we notice that the general form of the
non-smooth function g is mostly the indicator function in practical applications [3, 42,
60]. As mentioned before, the studied problem (3.1) can recover the well investigated
consensus problem if we choose g as the indicator function of the consensus constraint.
Considering the practical needs (the indicator function modeling of g is more common
References 87

and has wider applicability) and the limited length of the chapter (mainly showing the
superiority of VR-DPPD in terms of computational efficiency), we will not conduct
similar simulation experiments in this section when g is not an indicator function.

3.6 Conclusion

In this chapter, we have designed a novel distributed stochastic algorithm, named


VR-DPPD, for minimizing a finite sum of smooth local functions plus a convex
(possibly non-smooth) function coupling all nodes. VR-DPPD incorporated the
variance-reduction technique of SAGA into the distributed proximal primal–dual
method, which achieved better computation efficiency and could handle actual
multi-node sharing problems. Theoretical analysis has demonstrated that VR-DPPD
converged linearly in expectation to the exact optimal solution when smooth local
functions are strongly convex. Extensive simulation examples have been conducted
to verify the performance of VR-DPPD. In the future, we will focus on accelerating
VR-DPPD and consider designing distributed asynchronous stochastic algorithms
to solve the non-smooth problem. Furthermore, establishing a rigorous theoretical
analysis of VR-DPPD for the general convex cost function is also a promising
research direction.

References

1. T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, K. H. Johansson,
A survey of distributed optimization. Annu. Rev. Control 47, 278–305 (2019)
2. H. Li, C. Huang, Z. Wang, G. Chen, H. Umar, Computation-efficient distributed algorithm
for convex optimization over time-varying networks with limited bandwidth communication.
IEEE Trans. Signal Inf. Proc. Netw. 6, 140–151 (2020)
3. S. Alghunaim, K. Yuan, A.H. Sayed, A proximal diffusion strategy for multiagent optimization
with sparse affine constraints. IEEE Trans. Autom. Control 65(11), 4554–4567 (2020)
4. J. Li, W. Abbas, X. Koutsoukos, Resilient distributed diffusion in networks with adversaries.
IEEE Trans. Signal Inf. Proc. Netw. 6, 1–17 (2019)
5. A. Nedic, Distributed gradient methods for convex machine learning problems in networks:
distributed optimization. IEEE Signal Process. Mag. 37(3), 92–101 (2020)
6. Z. Yang, W.U. Bajwa, ByRDiE: byzantine-resilient distributed coordinate descent for decen-
tralized learning. IEEE Trans. Signal Inf. Proc. Netw. 5(4), 611–627 (2019)
7. F. Hua, R. Nassif, C. Richard, H. Wang, A.H. Sayed, Online distributed learning over graphs
with multitask graph-filter models. IEEE Trans. Signal Inf. Proc. Netw. 6, 63–77 (2020)
8. D. Yuan, A. Proutiere, G. Shi, Distributed online linear regressions. IEEE Trans. Inf. Theory
67(1), 616–639 (2021)
9. Q. Lü, X. Liao, T. Xiang, H. Li, T. Huang, Privacy masking stochastic subgradient-push
algorithm for distributed online optimization. IEEE Tran. Cybern. 51(6), 3224–3237 (2021)
10. J. Zhu, C. Xu, J. Guan, D. Wu, Differentially private distributed online algorithms over time-
varying directed networks. IEEE Trans. Signal Inf. Proc. Netw. 4, 4–17 (2018)
88 3 Proximal Algorithms for Distributed Coupled Optimization

11. B. Huang, L. Liu, H. Zhang, Y. Li, Q. Sun, Distributed optimal economic dispatch for
microgrids considering communication delays. IEEE Trans. Syst. Man Cybern. Syst. Hum.
49(8), 1634–1642 (2019)
12. L. Liu, G. Yang, Distributed optimal economic environmental dispatch for microgrids over
time-varying directed communication graph. IEEE Trans. Netw. Sci. Eng. 8(2), 1913–1924
(2021)
13. M. Nokleby, H. Raja, W.U. Bajwa, Scaling-up distributed processing of data streams for
machine learning. Proc. IEEE 108(11), 1984–2012 (2020)
14. A. Gang, B. Xiang, W.U. Bajwa, Distributed principal subspace analysis for partitioned big
data: algorithms, analysis, and implementation. IEEE Trans. Signal Inf. Proc. Netw. 7, 699–
715 (2021)
15. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE
Trans. Autom. Control 54(1), 48–61 (2009)
16. A. Nedic, A. Olshevsky, Distributed optimization over time-varying directed graphs. IEEE
Trans. Autom. Control 60(3), 601–615 (2015)
17. J. Duchi, A. Agarwal, M. Wainwright, Dual averaging for distributed optimization: conver-
gence analysis and network scaling. IEEE Trans. Autom. Control 57(1), 151–164 (2012)
18. W. Shi, Q. Ling, G. Wu, W. Yin, EXTRA: an exact first-order algorithm for decentralized
consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015)
19. C. Xi, U.A. Khan, DEXTRA: a fast algorithm for optimization over directed graphs. IEEE
Trans. Autom. Control 62(10), 4980–4993 (2017)
20. M. Maros, J. Jalden, On the Q-linear convergence of distributed generalized ADMM under
non-strongly convex function components. IEEE Trans. Signal Inf. Proc. Netw. 5(3), 442–453
(2019)
21. J. Chen, S. Liu, P. Chen, Zeroth-order diffusion adaptation over networks, in Proceedings of
the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(2018). https://doi.org/10.1109/ICASSP.2018.8461448
22. J. Xu, S. Zhu, Y.C. Soh, L. Xie, Augmented distributed gradient methods for multi-agent
optimization under uncoordinated constant stepsizes, in Proceedings of the IEEE 54th Annual
Conference on Decision and Control (2015). https://doi.org/10.1109/CDC.2015.7402509
23. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization
over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
24. S. Pu, W. Shi, J. Xu, A. Nedic, Push-pull gradient methods for distributed optimization in
networks. IEEE Trans. Autom. Control 66(1), 1–16 (2021)
25. T. Yang, J. Lu, D. Wu, J. Wu, G. Shi, Z. Meng, K. Johansson, A distributed algorithm for
economic dispatch over time-varying directed networks with delays. IEEE Trans. Ind. Electron.
64(6), 5095–5106 (2017)
26. H. Li, Q. Lü, G. Chen, T. Huang, Z. Dong, Distributed constrained optimization over
unbalanced directed networks using asynchronous broadcast-based algorithm. IEEE Trans.
Autom. Control 66(3), 1102–1115 (2021)
27. B. Li, S. Cen, Y. Chen, Y. Chi, Communication-efficient distributed optimization in networks
with gradient tracking and variance reduction, in Proceedings of the 23rd International
Conference on Artificial Intelligence and Statistics (AISTATS) (2020), pp. 1662–1672
28. K. Yuan, B. Ying, J. Liu, A.H. Sayed, Variance-reduced stochastic learning by networked
agents under random reshuffling. IEEE Trans. Signal Process. 67(2), 351–366 (2019)
29. T. Ding, S. Zhu, J. He, C. Chen, X. Guan, Differentially private distributed optimization via
state and direction perturbation in multi-agent systems. IEEE Trans. Autom. Control 67(2),
722–737 (2022)
30. Y. Zhu, G. Wen, W. Yu, X. Yu, Continuous-time distributed proximal gradient algorithms for
nonsmooth resource allocation over general digraphs. IEEE Trans. Netw. Sci. Eng. 8(2), 1733–
1744 (2021)
31. Y. Zhu, W. Yu, G. Wen, W. Ren, Continuous-time coordination algorithm for distributed convex
optimization over weight-unbalanced directed networks. IEEE Trans. Circuits Syst. Express
Briefs 66(7), 1202–1206 (2019)
References 89

32. A.I. Chen, A. Ozdaglar, A fast distributed proximal-gradient method, in Proceedings of the
50th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
(2012). https://doi.org/10.1109/Allerton.2012.6483273
33. T.-H. Chang, M. Hong, X. Wang, Multi-agent distributed optimization via inexact consensus
ADMM. IEEE Trans. Signal Process. 63(2), 482–497 (2015)
34. N.S. Aybat, Z. Wang, T. Lin, S. Ma, Distributed linearized alternating direction method of
multipliers for composite convex consensus optimization. IEEE Trans. Autom. Control 63(1),
5–20 (2018)
35. W. Shi, Q. Ling, G. Wu, W. Yin, A proximal gradient algorithm for decentralized composite
optimization. IEEE Trans. Signal Process. 63(22), 6013–6023 (2015)
36. Z. Li, W. Shi, M. Yan, A decentralized proximal-gradient method with network independent
step-sizes and separated convergence rates. IEEE Trans. Signal Process. 67(17), 4494–4506
(2019)
37. S. Alghunaim, K. Yuan, A.H. Sayed, A linearly convergent proximal gradient algorithm for
decentralized optimization, in Advances in Neural Information Processing Systems (NIPS),
vol. 32 (2019), pp. 1–11
38. P. Di Lorenzo, G. Scutari, NEXT: in-network nonconvex optimization. IEEE Trans. Signal Inf.
Proc. Netw. 2(2), 120–136 (2016)
39. G. Scutari, Y. Sun, Distributed nonconvex constrained optimization over time-varying
digraphs. Math. Program. 176(1), 497–544 (2019)
40. J. Xu, Y. Tian, Y. Sun, G. Scutari, Distributed algorithms for composite optimization: Unified
framework and convergence analysis. IEEE Trans. Signal Process. 69, 3555–3570 (2021)
41. S. Alghunaim, K. Yuan, A.H. Sayed, A multi-agent primal-dual strategy for composite opti-
mization over distributed features, in Proceedings of the 2020 28th European Signal Processing
Conference (EUSIPCO) (2020). https://doi.org/10.23919/Eusipco47968.2020.9287370
42. S. Alghunaim, Q. Lyu, K. Yuan, A.H. Sayed, Dual consensus proximal algorithm for multi-
agent sharing problems. IEEE Trans. Signal Process. 69, 5568–5579 (2021)
43. P. Latafat, N.M. Freris, P. Patrinos, A new randomized block-coordinate primal-dual proximal
algorithm for distributed optimization. IEEE Trans. Autom. Control 64(10), 4050–4065 (2019)
44. B. Swenson, R. Murray, S. Kar, H. Poor, Distributed stochastic gradient descent and conver-
gence to local minima (2020). Preprint. arXiv:2003.02818v1
45. M. Assran, N. Loizou, N. Ballas, M. Rabbat, Stochastic gradient push for distributed deep
learning, in Proceedings of the 36th International Conference on Machine Learning (ICML)
(2019), pp. 344–353
46. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly
convex optimization. Automatica 90, 196–203 (2018)
47. R. Xin, A. Sahu, U. Khan, S. Kar, Distributed stochastic optimization with gradient tracking
over strongly-connected networks, in Proceedings of the 2019 IEEE 58th Conference on
Decision and Control (CDC) (2019). https://doi.org/10.1109/CDC40024.2019.9029217
48. J. Konecny, J. Liu, P. Richtarik, M. Takac, Mini-batch semi-stochastic gradient descent in the
proximal setting. IEEE J. Sel. Top. Signal Process. 10(2), 242–255 (2016)
49. M. Schmidt, N. Roux, F. Bach, Minimizing finite sums with the stochastic average gradient.
Math. Program. 162(1), 83–112 (2017)
50. A. Defazio, F. Bach, S. Lacoste-Julien, Saga: a fast incremental gradient method with support
for non-strongly convex composite objectives, in Advances in Neural Information Processing
Systems (NIPS), vol. 27 (2014), pp. 1–9
51. R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance
reduction, in Advances in Neural Information Processing Systems (NIPS) (2013), pp. 315–323
52. L. Nguyen, J. Liu, K. Scheinberg, M. Takac, SARAH: a novel method for machine learning
problems using stochastic recursive gradient, in Proceedings of the 34th International Confer-
ence on Machine Learning (ICML) (2017), pp. 2613–2621
53. A. Mokhtari, A. Ribeiro, DSA: decentralized double stochastic averaging gradient algorithm.
J. Mach. Learn. Res. 17(1), 2165–2199 (2016)
90 3 Proximal Algorithms for Distributed Coupled Optimization

54. H. Hendrikx, F. Bach, L. Massoulie, An accelerated decentralized stochastic proximal algo-


rithm for finite sums, in Advances in Neural Information Processing Systems, vol. 32 (2019),
pp. 1–11
55. R. Xin, S. Kar, U.A. Khan, Decentralized stochastic optimization and machine learning: a
unified variance-reduction framework for robust performance and fast convergence. IEEE
Signal Process. Mag. 37(3), 102–113 (2020)
56. R. Xin, U.A. Khan, S. Kar, Variance-reduced decentralized stochastic optimization with
accelerated convergence. IEEE Trans. Signal Process. 68, 6255–6271 (2020)
57. R. Xin, U.A. Khan, S. Kar, Fast decentralized non-convex finite-sum optimization with
recursive variance reduction. SIAM J. Optim. 32(1), 1–28 (2022)
58. R. Xin, U.A. Khan, S. Kar, A fast randomized incremental gradient method for decentralized
non-convex optimization. IEEE Trans. Autom. Control (2021). https://doi.org/10.1109/TAC.
2021.3122586
59. D. Hajinezhad, M. Hong, T. Zhao, Z. Wang, NESTT: a nonconvex primal-dual splitting method
for distributed and stochastic optimization, in Advances in Neural Information Processing
Systems (NIPS), vol. 29 (2016), pp. 1–9
60. J. Xu, S. Zhu, Y.C. Soh, L. Xie, A dual splitting approach for distributed resource allocation
with regularization. IEEE Trans. Control Netw. Syst. 6(1), 403–414 (2019)
61. K. Scaman, F. Bach, S. Bubeck, Y.T. Lee, L. Massoulie, Optimal algorithms for smooth and
strongly convex distributed optimization in networks, in Proceedings of the 34th International
Conference on Machine Learning (ICML) (2017), pp. 3027–3036
62. A. Nedic, A. Olshevsky, W. Shi, Improved convergence rates for distributed resource allocation
(2017). Preprint. arXiv:1706.05441
63. R.T. Rockafellar, Convex Analysis (Princeton University Press, Citeseer, Princeton, 1970)
64. H.H. Bauschke, P.L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert
Spaces, vol. 408 (Springer, Berlin, 2011)
65. P.L. Combettes, V.R. Wajs, Signal recovery by proximal forward-backward splitting. Multi-
scale Model. Simul. 4(4), 1168–1200 (2005)
66. A. Chambolle, T. Pock, A first-order primal-dual algorithm for convex problems with
applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
67. P. Chen, J. Huang, X. Zhang, A primal-dual fixed point algorithm for convex separable
minimization with applications to image restoration. Inverse Probl. 29(2), 02501-1 (2013)
68. D. Dua, C. Graff, UCI machine learning repository, Dept. School Inf. Comput. Sci., Univ.
California, Irvine, CA, USA (2019)
Chapter 4
Event-Triggered Algorithms for
Distributed Convex Optimization

Abstract In this chapter, we focus on introducing and discussing event-triggered


distributed subgradient algorithms for solving a class of convex optimization
problems for first-order discrete-time multi-node systems based on undirected
networks. In such algorithms, the communication process of the entire network
is controlled by a set of trigger conditions monitored by each node. Each node’s
trigger condition and event-triggered distributed subgradient optimization algorithm
are entirely decentralized and simply rest on the local objective functions and
individual states of each node and its neighbors at the event-triggered sequence of
themselves. At each time instant, each node updates its state at the time instant of its
respective event trigger by using its own objective function and the state collected
from itself and its neighboring nodes. Under the condition that the topology of the
undirected network is connected and the design parameters are appropriate, we will
establish a sufficient condition to ensure consensus and reach an optimal solution.
We will finally give a theoretical analysis to show that the event-triggered distributed
subgradient algorithm is able to steer the nodes of the entire network to converge
asymptotically to the optimal solution of the convex optimization problem. Finally,
simulation results are used to validate the effectiveness of the algorithm and to
demonstrate the feasibility of the theoretical analysis.

Keywords Consensus · Distributed optimization · Event-triggered control ·


Multi-node systems · Discrete-time

4.1 Introduction

In recent years, owing to the great value in applications of multi-node systems,


there are more and more researchers in [1–12] having started their research in this
field and having achieved many remarkable achievements. The multi-node system
is not only an important class of systems in the research of complex systems but
also an important branch in the research of distributed artificial intelligence. The
coordinated control of multi-node systems is the current frontier research direction
in the field of control systems and has considerable application prospects in the

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 91
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_4
92 4 Event-Triggered Algorithms for Distributed Convex Optimization

fields of intelligent robots [13], satellite formation flying [14], sensor networks [15–
20], cloud computing [21–24], distributed measurement and monitoring [25–27],
congestion control in communication network [28], and so on [29–33]. Unlike the
traditional control problems, a distinct feature of the multi-node system coordination
control is that the control strategy is only based on the distributed control of local
information among neighbors, and the desired overall goal is achieved through
information exchange. Multi-node coordinated control systems are made up of
problems such as consensus [3], formation control [1] and distributed filtering [34].
Among these problems, the consensus problem of multi-node systems is the core
research branch because it is the basis of multi-node collaboration and provides
theoretical support for other problems.
Event-triggered control strategies can effectively overcome the drawback of
the traditional processing methods. This interesting area aroused considerable
attention, see [35–46]. Tabuada [35] applied the event-triggered control method
in control systems and provided the event-triggered control strategies, which not
only guaranteed the asymptotic stability of the closed-loop system but also further
demonstrated that the event-triggered time sequence exclude the Zeno-like behavior.
Building on the work of Tabuada [35], Dimarogonas et al. [37] primarily studied
the consensus of first-order multi-node systems in which the update of individual
controller was based on the proportion of a certain measurement error with
reference to the function with the Standard of the latest local states. Meanwhile,
event-triggered control approach had a rapid development in consensus analysis
for continuous-time multi-node systems. Seyboth et al. [38] studied a variant of
the event-triggered average consensus problem for single-integrator and double-
integrators, in which a novel control strategy for multi-node coordination was
employed to simplify the performance and convergence analysis of the method. For
multi-node systems, the distributed rendezvous problem for single-integrator with
combinational measurements was studied in Fan et al. [26]. However, it required
that the Laplacian matrix of the associated communication topology ought to be
symmetric. Furthermore, Li et al. in [39] considered the event-triggered distributed
average consensus of discrete-time first-order multi-node systems with limited
communication data rate. In the framework of network communication, although
each node has a real-valued state, they can only communicate finite-bit binary
symbolic data sequence with its neighborhood nodes at each time step on account
of the digital communication channels with energy constraints. On the basis of
the designed novel event-triggered dynamic encoder and decoder for each node,
a distributed control algorithm is proposed.
Recent work of Zhu et al. [40] had coordinately put their sights on the analysis
of the event-triggered consensus problem which usually appeared in general linear
time-invariant systems with fixed topology via applying an integral inequality
technique. And a novel improved algorithm was introduced to determine the event
time sequences, which can not only reduce communication between neighboring
nodes but also control updates. From an implementation point of view, it is
still a very important issue for the event-triggered control strategies of multi-
node systems with discrete-time dynamics. The event-triggered and self-triggered
4.1 Introduction 93

average consensus for discrete-time multi-node systems were, respectively, reported


in Chen [41] and Hamada [27]. The event-triggered consensus analysis, which was
related to a class of discrete-time stochastic multi-node systems, was concerned by
Ding et al. [42] by using a discrete-time version of the input-to-state stability in
probability. The decentralized event-triggered consensus problem was discussed in
[47] for discrete-time multi-node systems linear dynamics. Using the Kronecker
product technique and the Lyapunov functional method, Yin et al. [44] considered
the event-triggered consensus problem which consist of two kinds of nodes differed
by their dynamics for a set of discrete-time heterogeneous multi-node systems.
In order to ensure the consensus of heterogeneous multi-node systems in terms
of linear matrix inequality, a sufficient condition was established. In some recent
chapters, Nedic et al. [48] developed a broadcast-based algorithm, called the
subgradient-push, which guided every node to an optimal value under a standard
assumption of subgradient boundedness. The subgradient-push requires neither the
knowledge of the number of nodes nor the graph sequence to implement. The
authors in [13] worked out an event-triggered consensus problem of first-order
discrete-time multi-node systems and firstly proven that the Zeno-like behavior for
triggering time sequence is excluded in terms of the discrete-time models. Then, the
consensus problem of first-order discrete-time multi-node systems with time delay
via an event-triggered method was considered in Pu et al. [43]. In [43], the authors
designed the event-triggered controller with input time delay and the triggering time
sequence was determined by some given triggering function. In addition, as far as
the theoretical analysis is concerned, the Zeno-like behavior of the trigger function
with time delay is excluded for the closed-loop system.
In [48], the authors studied the distributed convex optimization problem of
discrete-time multi-node systems over time-varying directed graphs without event-
trigged control. Moreover, the authors in [43, 45] considered the consensus problem
of first-order discrete-time multi-node systems, but they did not study the distributed
convex optimization problem. Based on the works [43, 45, 48], we further consider
the convex optimization problem of multi-node systems via an event-triggered
distributed sampling control scheme, where the controllers exchange information
through a shared limited communication way over an undirected network topology.
On the one hand, there is little existing work concerning with the distributed convex
optimization problem of first-order discrete-time multi-node systems by fully taking
advantage of the event-triggered broadcasting communication technique such as to
further save the communication cost. On the other hand, our analysis method is
novel, which exactly incorporates the distributed optimization technique into the
event-triggered control. More precisely, the main contributions of this chapter can
be summarized as follows:
(i) We study the convex optimization problem of first-order discrete-time multi-
node systems by a distributed event-triggered sampling control scheme, where
the event-triggered control strategy in this chapter can remove unnecessary
communications among neighboring nodes, resulting in the reduction of
computation costs and energy consumptions in practice.
94 4 Event-Triggered Algorithms for Distributed Convex Optimization

(ii) A distributed event-triggered control scheme for each node is designed, which
analytically decides the next sampling time instant utilizing the exchange local
information. Based on the event-triggered control scheme, a distributed control
algorithm is presented, which only employs the nodes local information at their
latest sampling time instant.
(iii) We also show the convergence of the algorithm and prove that the event-
triggered distributed subgradient algorithm is able to make the whole nodes
asymptotically converge to an optimal solution.

4.2 Preliminaries

4.2.1 Notation

If not particularly stated, the vectors mentioned in this chapter are column vectors.
Denote R N×N and R N , R as the set of N ×N real matrices, the set of N-dimensional
real column vectors and the set of real numbers, respectively. N-dimensional vector
and N-dimensional identity matrix are denoted as 1N and IN , respectively. Given
a vector or a matrix W , we denote W T and W −1 as the transpose and inverse,
respectively. We denote λ(A) as the set of all eigenvalues of matrix A. Given a
vector x ∈ R n , the standard Euclidean norm in the Euclidean space is represented
as ||x||. The subgradient of f (x) is represented as ∇f (x) : R n → R n . Given a
matrix W , we write Wij or [W ]ij to define its i, j th entry. The symbol ⊗ denotes
the Kronecker product.

4.2.2 Model of Optimization Problem

In this subsection, we study a network of nodes labeled by V = {1, 2, . . . , N} whose


target is to resolve the following distributed convex optimization problem:


N
min f (x), f (x) = fi (x) over x ∈ R n , (4.1)
i=1

where fi : R n → R is the local objective function of node i, and x is a global


decision vector. Assume that fi is only available to node i, and probably different.
Let X∗ = arg minx∈R n f (x) denote the set of optimal solutions to (4.1), which is
assumed to be nonempty. The optimal value of (4.1) is defined by f ∗ and x ∗ an
optimizer of problem (4.1). In this chapter, we do not assume that the local objective
function fi is differentiable. The subgradient plays the role of gradient when the
function is not differentiable at a points. Consider a convex function f : R n → R
and a point x̃ ∈ R n , then the subgradient of the function f at x̃ is a vector ∇f (x̃) ∈
4.3 Algorithm Development 95

R n such that the following subgradient inequality holds for any x ∈ R n :

[∇f (x̃)]T (x − x̃) ≤ f (x) − f (x̃).

The following assumptions are necessary in the analysis of distributed optimization


algorithm throughout this chapter.
Assumption 4.1 (Connected) G is undirected and connected. That is, L has exact
one zero eigenvalue and all the other eigenvalues are positive.

Assumption 4.2 (Subgradient Boundedness) The subgradients ∇fi (x) of each


fi (x) are uniformly bounded, i.e., there exists 0 < Hi < ∞ for all i = 1, . . . , N
such that ||∇fi (x)|| ≤ Hi for all subgradients ∇fi (x) of fi (x). Moreover, we will
use the notation H = maxi Hi .
Remark 4.1 It is not hard to find a simple function f (x) = ln(1 + sin x)
which can meet the requirements of the convexity and Subgradient Boundedness
Assumption 4.2.

4.2.3 Communication Networks

In this chapter, we consider a group of N nodes communicating over an undirected


graph G = {V , E, W } involving the nodes set V = {1, 2, . . . , N}, the edges set
E ⊆ V × V and the adjacency matrix W = [wij ] ∈ R N×N . If (j, i) ∈ E, it
indicates that node i can directly exchange data with node j , where i or j is viewed
as a neighbor of j or i. The connection weight between nodes i and j in graph G
satisfies wij = wj i > 0 if (i, j ) ∈ E and otherwise wij = wj i = 0. Here, no
self-connection in the graph indicating that wii = 0 for all i. We denote |Ni | by the
neighbor nodes set of i and we formulate |Ni | as the number of neighbors of node
i. The Laplacian matrix L = (lij )N×N of graph G associated with the adjacent

matrix W is denoted by lij = −wij , i = j ; lii = N j =1,j =i wij , which assures that
N
l
j =1 ij = 0. A path is a series of consecutive edges. If there is a path between any
two nodes, then G is said to be connected.

4.3 Algorithm Development

In this section, inspired by Lou et al. [49], Zhu and Martínez [50], we provide a
novel distributed subgradient algorithm to handle the optimization problem (4.1).
Consider that node i can get access to the subgradient of local objective function
fi (x). The distributed subgradient algorithm is a discrete-time dynamical system,
which will be depicted next.
96 4 Event-Triggered Algorithms for Distributed Convex Optimization

4.3.1 Distributed Subgradient Algorithm

Consider a set V = {1, . . . , N} of nodes. Formally, at each iteration t = 0, 1, . . . the


node i updates its state at the next time instant according to the following first-order
difference equation

xi (t + 1) = xi (t) + h[ui (t) − g(t)∇fi (xi (t))], (4.2)

where ui (t) ∈ R n , i = 1, 2, . . . , N, is the distributed control input of node i to


be designed; h > 0 is the network control gain; the positive scalars g(t) > 0
are step-sizes, and the vector ∇fi (xi (t)) is a subgradient of the node i objective
function fi (x) at x = xi (t). In the following, we will give an event-triggered control
scheme to reduce not only communication between neighboring nodes but also the
energy consumption of incident detection for each node, while retaining asymptotic
property of consensus. Suppose that the event-triggered instant sequence of node
i is tki , k = 0, 1, . . . , when node i collects its state xi (tki ) and broadcasts it to its
j
neighboring nodes. Furthermore, node j can sent out its latest sampling state xj (tk )
to node i if (i, j ) ∈ E. Therefore, for t ∈ [tki , tk+1
i ), the following distributed control

input can be designed as


   
j
ui (t) = −l wij xi (tki ) − xj tk , (4.3)
j ∈Ni

where l > 0 is the constant control parameter, wij are non-negative weights,
j
tki denotes the instant when the kth event happens for the node i, tk , k =
j
arg min {t − tm } is the latest event-triggered instant of node j . And tk+1
i
denotes
j
m∈k̄,t ≥tm
node i’s next event-triggered time instant after tki , which is analytically decided by
 
i
tk+1 = inf t : t > tki and yi (t) > 0 , (4.4)

in which yi (t) is the event-triggered function and denoted by


     
j
yi (t) = ||ei (t)|| − β1 || aij xi tki − xj tk || − β2 μt , (4.5)
j ∈Ni

where β1 > 0, β2 > 0, μ > 0 and the measurement error is designed as ei (t) =
xi (tki ) − xi (t).
Let ∇F (X(t))  = [∇f1 (x1 (t)) , ∇f2 (x2 (t)) , · · · , ∇fN (xN (t)) ] ∈ R ,
T T T T Nn

x̄(t) = (1/N) N x
i=1 i (t), δ i (t) = x i (t) − x̄(t), i = 1, 2, . . . , N, and X(t) =
[x1T (t), x2T (t), · · · , xN
T (t)]T ∈ R Nn . Then the distributed subgradient algorithm
4.3 Algorithm Development 97

(4.2) with distributed control input can be rewritten in a compact matrix-vector form
as follows:

X(t + 1) = [(IN − hlL) ⊗ In ]X(t) − hl(L ⊗ In )e(t) − hg(t)∇F (X(t)).


(4.6)

By the definition of δi (t), it is easy to know that the consensus error δ(t) = [(IN −
JN ) ⊗ In ]X(t) hold, where JN = N1 1N 1TN . It follows that

X(t) = δ(t) + (JN ⊗ In )X(t). (4.7)

It is obtained from (4.7) that

[(IN − JN ) ⊗ In ][(IN − hlL) ⊗ In ]X(t)


= [(IN − JN ) ⊗ In ][(IN − hlL) ⊗ In ][δ(t) + (JN ⊗ In )X(t)]
= [(IN − hlL − JN ) ⊗ In ]δ(t) + [(JN − JN ) ⊗ In ]X(t)
= [(IN − hlL − JN ) ⊗ In ]δ(t)
= [(IN − hlL) ⊗ In ]δ(t). (4.8)

In the above derivation, we have used the equalities JN JN = JN , [(IN − JN ) ⊗


In ](L ⊗ In ) = L ⊗ In , and [(IN − JN ) ⊗ In ](L ⊗ In ) = L ⊗ In . With (4.6) and
(4.8), we have

δ(t + 1) = [(IN − hlL) ⊗ In ]δ(t) − hl(L ⊗ In )e(t)


− hg(t)[(IN − JN ) ⊗ In ]∇F (X(t)). (4.9)

Since the undirected graph network is connected, we can conclude that the
eigenvalues of the matrix L are 0, λ2 , . . . , λN . Then we can take an orthogonal
matrix T = [ζ, φ2 , . . . , φN ] ∈ R N×N by using the Schmidt’s Orthogonalization
Method, where ζ = N1 1N is an eigenvector of matrix IN − hlL with respect to
eigenvalue 0 and φi is the eigenvector of matrix L with respect to eigenvalues
λ2 , . . . , λN (φiT L = λi φiT , i = 2, 3 . . . , N). Letting δ̃(t) = (T −1 ⊗ In )δ(t),
ẽ(t) = (T −1 ⊗ In )e(t), ∇ F̃ (X(t)) = (T −1 ⊗ In )∇F (X(t)), we have that

δ̃(t + 1) = [T −1 (IN − hlL)T ⊗ In ]δ̃(t) − hl(T −1 LT ⊗ In )ẽ(t)


− hg(t)[T −1 (IN − JN )T ⊗ In ]∇ F̃ (X(t)). (4.10)
98 4 Event-Triggered Algorithms for Distributed Convex Optimization

  
δ̃1 (t) ẽ1 (t) ∇ F̃1 (X(t))
Decomposing δ̃(t) = , ẽ(t) = , ∇ F̃ (X(t)) = ,
δ̃2 (t) ẽ2 (t) ∇ F̃2 (X(t))
and in view of 4.9 and 4.10, it follows that
  
δ̃1 (t + 1) In − hlIn 0 δ̃1 (t)
=
δ̃2 (t + 1) 0 (IN−1 − hl L̃) ⊗ In − hl L̃ ⊗ In δ̃2 (t)
  ,
In 0 ∇ F̃1 (X(t))
− hg(t)
0 L̂ ⊗ In ∇ F̃2 (X(t))
(4.11)
⎛ ⎞ ⎡ ⎤
λ2 0 φ2
⎜ .. ⎟ ⎢. ⎥
where L̃ = ⎝ . ⎠ , L̂ = ⎣ .. ⎦ JN [φ2 · · · φN ]. On the other hand,
0 λN φN
from (T −1 ⊗ In )δ(t) = (T T ⊗ In )δ(t) = δ̃(t), one obtains δ̃1 (t) = (ζ T ⊗ In )δ(t).
Note that ζ T 1N = 1, it is easy to get

1
δ̃1 (t) = (ζ T ⊗ In )[X(t) − (JN ⊗ In )X(t)]
||ζ ||
1 1
= (ζ T ⊗ In )X(t) − (ζ T ⊗ In )X(t) = 0. (4.12)
||ζ || ||ζ ||

Then one obtains

δ̃2 (t + 1) = [(IN−1 − hl L̃) ⊗ In ]δ̃2 (t) − hl(L̃ ⊗ In )ẽ2 (t)

− hg(t)(L̂ ⊗ In )∇ F̃2 (X(t)). (4.13)

Before giving a key definition, we require to show the following assumption with
regard to the step-sizes.

Assumption 4.3 The step-sizes {g(t)} are positive sequence, which satisfies the
following:

 ∞

sup g(t) ≤ 1, g(t) = +∞ and (g(t))2 < +∞.
t ≥0 t =0 t =0

Remark 4.2 On the one hand, the step-size function g(t) increases the accuracy
of node’s state estimation. On the other hand, to achieve the optimization, the
step-size function g(t) ought to decrease to zero as t increases to infinite, i.e.,
limt →∞ g(t) = 0. According to the nature of the series summation, we can
∞ ∞ k 2
t =0 (1/(t + 1) ) = +∞, t =0 (1/(t + 1) ) < +∞ for all
k
conclude that
0.5 < k ≤ 1. Since 0 < supt ≥0(1/(t + 1)k ) < 1, 0.5 < k ≤ 1, we can apply
4.4 Convergence Analysis 99

g(t) = 1/(t + 1)k , 0.5 < k ≤ 1, to satisfy the Assumption 4.3 in the forthcoming
analysis.
Definition 4.3 Under some distributed control input ui (t), the distributed subgra-
dient algorithm 4.2 is said to achieve consensus if lim ||xi (t)−xj (t)|| = 0, ∀i, j =
t →∞
1, 2, . . . , N, hold for any initial values.

4.4 Convergence Analysis

In this section, we first provide some supporting lemmas. Then we will prove
convergence property of the optimization algorithm (4.2).

4.4.1 Auxiliary Results


Lemma 4.4 ([45]) All the eigenvalues of matrix L̃ are positive if and only if the
undirected graph network of all nodes is connected.
In the following, we will first assume that all the eigenvalues of matrix have
arbitrary imaginary parts and positive real parts, which will lead Lemma 4.4 to a
more general situation.
Lemma 4.5 ([43]) Suppose that the undirected graph network is connected and

0 < h < 1, 0 < l < (α 2 +β 2 )h , where α and β represent the real and imaginary part

of the eigenvalue of L̃. Then one has ρ(IN−1 − hl L̃) < 1, where ρ(IN−1 − hl L̃)
stands for the spectral radius of matrix IN−1 − hl L̃.
Proof Letting λ be any eigenvalue of IN−1 − hl L̃, then,

(λ − 1)
0 = |(λ − 1)IN−1 + hl L̃| = | IN−1 + L̃|. (4.14)
lh

Let σ = α + β −1 be the eigenvalue of the matrix L̃. By (4.14), we have

λ−1
+ σ = 0.
lh
Denote

d(λ) = λ + lhα + lhβ −1 − 1 = a1 λ + a0 . (4.15)


By Lemma 4.4, we have α > 0. Noticing that 0 < h < 1 and 0 < l < (α 2 +β 2 )h
, one
can therefore compute that
 
a a1 
Δ1 =  0 = lh(lhα 2 − 2α + lhβ 2 ) < 0, (4.16)
ā1 ā0 
100 4 Event-Triggered Algorithms for Distributed Convex Optimization

where ā0 and ā1 are the conjugate complex of a0 and a1 , respectively. Therefore, we
have |λ| < 1 by applying Schur-Cohn Stability Test [51]. Thus, ρ(IN−1 −hl L̃) < 1.
The proof is therefore completed. 
Lemma 4.6 If ρ(IN−1 − hl L̃) < 1, there exist positive constants M ≥ 1 and
0 < γ < 1 such that

||IN−1 − hl L̃||t ≤ Mγ t , t ≥ 0.

Proof The proof process can imitate that of [47]. 

4.4.2 Main Results

We will introduce the characteristic that the nodes reach a consensus asymptotically,
which means the node estimate xi (t) converges to the same point when t goes to
infinity.
Theorem 4.7 (Consensus) Let the connected Assumption 4.1, the subgradient
boundedness Assumption 4.2, and the step-size Assumption 4.3 hold. Consider
the distributed subgradient algorithm (4.2) with control input (4.3), where the
triggering time sequence is determined by (4.4) with β1 ∈ (0, N||L||
1
), β2 ∈ (0, ∞).
Then, the consensus for (4.2) can be achieved for l ∈ (0, (α 2 +β

2 )h ) and μ ∈ (γ , 1).

Proof It is immediately obtained from (4.13) that


t−1
t−s−1
δ̃2 (t) = [(IN−1 − hl L̃) ⊗ In ]t δ̃2 (0) − hl [(IN−1 − hl L̃) ⊗ In ] (L̃ ⊗ In )ẽ2 (s)
s=0


t−1
t−s−1
−h g(s)[(IN−1 − hl L̃) ⊗ In ] (L̂ ⊗ In )∇ F̃2 (X(s)). (4.17)
s=0

Since the undirected graph network of these nodes is connected, by Lemma 4.5, we
have (ρIN−1 − hl L̃) < 1. Then it follows from Lemma 4.6 that there exist positive
constants M ≥ 1 and 0 < γ < 1 such that ||IN−1 − hl L̃||t ≤ Mγ t , t ≥ 0. Thus, from
(4.17), we can get


t−1
||δ̃2 (t)|| ≤ Mγ t ||δ̃2 (0)|| + hl||L̃|| Mγ t−s−1 ||ẽ2 (s)||
s=0


t−1
+ h||L̂|| ||g(s)||Kγ t−s−1||∇ F̃2 (X(s))||. (4.18)
s=0
4.4 Convergence Analysis 101

It is obvious that

||δ(t + 1)|| = ||δ̃(t + 1)|| = ||δ̃2 (t + 1)||, (4.19)

||ẽ2 (t)|| ≤ ||ẽ(t)|| = ||e(t)||, (4.20)

and

||∇ F̃2 (X(t))|| ≤ ||∇ F̃ (X(t))|| = ||∇F (X(t))||. (4.21)

Together with (4.19), (4.20), and (4.21), (4.18) further implies


t−1
||δ(t)|| ≤ Mγ t ||δ(0)|| + hl||L̃|| Mγ t−s−1 ||e(s)||
s=0


t−1
+ h||L̂|| ||g(s)||Mγ t−s−1||∇F (X(s))||. (4.22)
s=0

In addition, by the definition of triggering time sequence, we deduce that yi (t) ≤ 0,


i.e.,
 j
||ei (t)|| ≤ β1 || aij (xi (tki ) − xj (tk ))|| + β2 μt
j ∈Ni
 
≤ β1 || aij (xi (t) − xj (t))|| + || aij (ei (t) − ej (t))|| + β2 μt
j ∈Ni j ∈Ni

≤ β1 ||L ⊗ In ||||δ(t)|| + β1 ||L ⊗ In ||||e(t)|| + β2 μt


≤ β1 ||L||||δ(t)|| + β1 ||L||||e(t)|| + β2 μt . (4.23)

Then, we have

||e(t)|| ≤ Nβ1 ||L||||δ(t)|| + Nβ1 ||L||||e(t)|| + Nβ2 μt . (4.24)

1
Finally, for all 0 < β1 < N||L|| , we therefore obtain

Nβ1 ||L|| Nβ2


||e(t)|| ≤ ||δ(t)|| + μt . (4.25)
1 − Nβ1 ||L|| 1 − Nβ1 ||L||

In the following, we claim that

||δ(t)|| ≤ Zμt , t ≥ 0, (4.26)


102 4 Event-Triggered Algorithms for Distributed Convex Optimization


where Z = max{M[δ(0) − h||L̂|| NH
], Mhl||L̃||Nβ2
}. In order
1−γ (μ−γ )(1−Nβ1 ||L||)−Mhl||L̃||Nβ1 ||L||
to prove (4.26), we first state that the following inequality holds for any η > 1,

||δ(t)|| < ηZμt , t ≥ 0. (4.27)



Assuming (4.27) is not true, there must exist a t ∗ > 0 such that ||δ(t ∗ )|| ≥ ηZμt
and ||δ(t)|| < ηZμt for t ∈ (0, t ∗ ). Then, by (4.22), it is obtained that

ηZμt ≤ ||δ(t ∗ )||
∗ −1
t  
t∗ ∗ −s−1 Nβ1 ||L|| Nβ2
≤ Mγ ||δ(0)|| + hlM||L̃|| γt ||δ(s)|| + μs
1 − Nβ1 ||L|| 1 − Nβ1 ||L||
s=0
t ∗ −1 ∗
+ hM||L̂|| ||g(s)||γ t −s−1 ||∇F (X(s))||
s=0
⎧ ⎫
⎨ ∗ −1
t √ ∗ −1
t ⎬
∗ β 1 ||L||Z + β2 1 μs h|| L̂|| N H 1
≤ ηMγ t ||δ(0)|| + hl||L̃||N +
⎩ 1 − Nβ1 ||L|| γ γs γ γs ⎭
s=0 s=0
 √ !
hl||L̃||N(β1 ||L||Z + β2 ) h||L̂|| NH ∗
≤ ηM ||δ(0)|| − − γt
(μ − γ )(1 − Nβ1 ||L||) 1−γ
"
hl||L̃||N(β1 ||L||Z + β2 ) t ∗
+ μ . (4.28)
(μ − γ )(1 − Nβ1 ||L||)

In the process of the above derivation, we have used the subgradient boundedness
Assumption 4.2, H = maxi {Hi }, the step-size Assumption 4.3, supt≥0 g(t) ≤ 1, and
the sum formula of geometric sequence. Then, we present the following two cases:
√ √
h||L̂|| NH h||L̂|| NH
Case 1: Z = M[δ(0) − 1−γ ], which implies that δ(0) − 1−γ ≥
hl||L̃||N (β1 ||L||Z+β2 )
(μ−γ )(1−N β1 ||L||) . Then by (4.28), we can achieve that

ηZμt ≤ ||δ(t ∗ )||
 √ !
hl||L̃||N(β1 ||L||Z + β2 ) h||L̂|| NH ∗
< ηM δ(0) − − μt
(μ − γ )(1 − Nβ1 ||L||) 1−γ
"
hl||L̃||N(β1 ||L||Z + β2 ) t ∗
+ μ
(μ − γ )(1 − Nβ1 ||L||)
√ !
h||L̂|| NH ∗ ∗
= ηM δ(0) − μt = ηZμt . (4.29)
1−γ
4.4 Convergence Analysis 103


Mhl||L̃||N β2
Case 2: Z=
(μ−γ )(1−N β1 ||L||)−Mhl||L̃||N β1 ||L||
, which implies that δ(0) − h||L̂||
1−γ
NH
<
hl||L̃||N (β1 ||L||Z+β2 )
(μ−γ )(1−N β1 ||L||) . Then, we have


ηZμt ≤ ||δ(t ∗ )||

hl||L̃||N(β1 ||L||Z + β2 ) t ∗
< ηM μ
(μ − γ )(1 − Nβ1 ||L||)

= ηZμt . (4.30)

The contradiction of (4.29) and (4.30) demonstrates that (4.27) is valid for any
η > 1. Then, let η → 1, we can obtain the results that inequality (4.26) holds,
which further implies the consensus of (4.2) can be achieved asymptotically. The
proof is thus completed.


We now begin to introduce an indispensable convergence result which is shown


in Lemma 4.8.
Lemma 4.8 ([52]) Let {φ(t)} be a non-negative scalar sequence such that

φ(t + 1) ≤ (1 + v(t))φ(t) − ϕ(t) + u(t),



for all t ≥
0, where v(t) ≥ 0, ϕ(t) ≥ 0 and u(t) ≥ 0 for all t ≥ 0 with ∞t =0 v(t) <

∞,
∞ and t =0 u(t) < ∞. Then, the sequence {φ(t)} converges to some φ ≥ 0 and
t =0 ϕ(t) < ∞.
In what follows, we will present a vital lemma, which is crucial in the analysis
of the distributed optimization algorithm (4.2). Thereafter, we will investigate the
convergence behavior of the algorithm, where the optimal solution can be achieved
asymptotically.
Lemma 4.9 ([48]) Consider an optimization problem minp∈R n h(p), where h :
R n → R is a continuous objective function. Suppose that the optimal solution set
P ∗ of the above optimization problem is nonempty. Let {p(t)} be a sequence with
all p ∈ P ∗ and all t ≥ 0 such that

||p(t + 1) − p||2 ≤ (1 + v(t))||p(t) − p||2 − α(t)(h(p(t)) − h(p)) + u(t),



where v(t) ≥ 0, α(t) 
 ≥ 0 and u(t) ≥ 0 for all t ≥ 0 with ∞ t =0 v(t) < ∞,
∞ ∞
t =0 α(t) = ∞ and t =0 u(t) < ∞. Then the sequence {p(t)} converges to a
certain optimal solution p∗ ∈ P ∗ .
Proof Letting p = p∗ for any p∗ ∈ P ∗ and denoting h∗ = minp∈R n h(p), it follows
that for all t ≥ 0,

||p(t + 1) − p∗ ||2 ≤ (1 + v(t))||p(t) − p∗ ||2 − α(t)(h(p(t)) − h∗ ) + u(t).


104 4 Event-Triggered Algorithms for Distributed Convex Optimization

Note that all the conditions of Lemma 4.8 hold, and then with the help of
Lemma 4.8, we obtain the following statements:

{||p(t) − p∗ ||2 } converges for each p∗ ∈ P ∗ , (4.31)

and


α(t)(h(p(t)) − h∗ ) < ∞. (4.32)
t =0
∞
Since t =0 α(t) = ∞, it is obtained from 4.32 that

lim h(p(t)) = h∗ .
t →∞

Denoting a subsequence of {p(t)} by {p (t)}, then

lim h(p (t)) = lim inf h(p(t)) = h∗ . (4.33)


→∞ t →∞

Recalling (4.31), it is clear that the sequence {p(t)} is bounded. In general, we can
assume that {p (t)} converge to some p̃. By continuity of h, we therefore obtain

lim h(p (t)) = h(p̃). (4.34)


→∞

Thus, (4.31) and (4.34) jointly imply p̃ ∈ P ∗ . By substituting p∗ in (4.31) with p̃,
we achieve that {p(t)} converges to p̃. The proof is thus completed. 
Theorem 4.10 (Convergence Properties) Let the connected Assumption 4.1, the
subgradient boundedness Assumption 4.2, and the step-size Assumption 4.3 hold.
As for the problem (4.1), consider the distributed subgradient algorithm (4.2) with
distributed control input (4.3), where the triggering time sequence is decided by
(4.4). Then, there exists an optimal solution x ∗ ∈ X∗ such that

lim ||xi (t) − x ∗ || = 0, ∀i ∈ {1, . . . , N}.


t →∞

Proof Taking the average process of (4.2), we can conclude that

1  1  1 
N N N N
xi (t + 1) = wij xj (t) − hl aij (ei (t) − ej (t))
N N N
i=1 i=1 j =1 i=1 j ∈Ni

1 
N
−h g(t)∇fi (xi (t)). (4.35)
N
i=1
4.4 Convergence Analysis 105

N
Since x̄ = i=1 xi (t), the control law (4.35) can be rewritten as

1  N
x̄(t + 1) = x̄(t) − hg(t) ∇fi (xi (t)). (4.36)
N
i=1

Consider the sequence (4.36). Letting x ∈ R n be an arbitrary vector, we have for all
t ≥ 0,

1 N
||x̄(t + 1) − x||2 = ||x̄(t) − hg(t) ∇fi (xi (t)) − x||2
N i=1

2hg(t) 
N
= ||x̄(t) − x||2 − ∇fi (xi (t))T (x̄(t) − x)
N
i=1

h2 g 2 (t) 
N
+ || ∇fi (xi (t))||2 . (4.37)
N2
i=1

Recalling that the subgradient boundedness Assumption 4.2 is hold by Hi and H =


maxi {Hi }, then we can derive that

||x̄(t + 1) − x||2 ≤ ||x̄(t) − x||2 + h2 H 2 g 2 (t)

2hg(t) 
N
− ∇fi (xi (t))T (x̄(t) − x). (4.38)
N
i=1

Next, we analyze the cross-term ∇fi (xi (t))T (x̄(t) − x) in (4.38). Firstly, we write

[∇fi (xi (t))]T (x̄(t) − x)


= [∇fi (xi (t))]T (x̄(t) − xi (t)) + [∇fi (xi (t))]T (xi (t) − x). (4.39)

We can take a lower bound on the first item [∇fi (xi (t))]T (x̄(t) − xi (t)) as follows
by using the subgradient boundedness:

[∇fi (xi (t))]T (x̄(t) − xi (t)) ≥ −||∇fi (xi (t))||||x̄(t) − xi (t)||. (4.40)

As for the second term [∇fi (xi (t))]T (xi (t) − x), we apply the convexity of fi to
obtain

[∇fi (xi (t))]T (xi (t) − x) ≥ fi (xi (t)) − fi (x), (4.41)


106 4 Event-Triggered Algorithms for Distributed Convex Optimization

from which, by applying the Lipschitz continuity of fi (deduced by the Assump-


tion 4.2) and adding and subtracting fi (x̄(t)), we can further achieve

[∇fi (xi (t))]T (x̄(t) − x)


≥ −||∇fi (xi (t))||||x̄(t) − xi (t)|| + [fi (xi (t)) − fi (x̄(t))] + [fi (x̄(t)) − fi (x)]
≥ −||∇fi (xi (t))||||x̄(t) − xi (t)|| + [∇fi (x̄(t))]T (xi (t) − x̄(t)) + fi (x̄(t)) − fi (x)
≥ −||∇fi (xi (t))||||x̄(t) − xi (t)|| − ||∇fi (x̄(t))||||xi (t) − x̄(t)|| + fi (x̄(t)) − fi (x)
= −(||∇fi (xi (t))|| + ||∇fi (x̄(t)))||)||x̄(t) − xi (t)|| + fi (x̄(t)) − fi (x). (4.42)

Substituting (4.42) into (4.38) yields

||x̄(t + 1) − x||2

2hg(t) 
N
≤ [(||∇fi (xi (t))|| + ||∇fi (x̄(t))||)||x̄(t) − xi (t)|| + fi (x) − fi (x̄(t))]
N
i=1

+ ||x̄(t) − x||2 + h2 H 2 g 2 (t)

4hHg(t)  2hg(t) 
N N
≤ ||x̄(t) − x||2 + (||x̄(t) − xi (t)||) − (fi (x̄(t)) − fi (x))
N N
i=1 i=1

+ h H g (t),
2 2 2
(4.43)

where in the second inequality we use the subgradient boundedness of fi . Now, we


can apply (4.34) with x = x ∗ for any x ∗ ∈ X∗ to acquire

4hHg(t) 
N
||x̄(t + 1) − x ∗ ||2 ≤ ||x̄(t) − x ∗ ||2 + (||x̄(t) − xi (t)||)
N
i=1

2hg(t) 
N
− (fi (x̄(t)) − fi (x ∗ )) + h2 H 2 g 2 (t). (4.44)
N
i=1

N
Rearranging the above formula and applying f (x) = i=1 fi (x), it follows that

2h
g(t)(f (x̄(t)) − f ∗ ) ≤ ||x̄(t) − x ∗ ||2 − ||x̄(t + 1) − x ∗ ||2
N
4hHg(t) 
N
+ (||x̄(t) − xi (t)||) + h2 H 2 g 2 (t). (4.45)
N
i=1
4.4 Convergence Analysis 107

where f ∗ is the optimal value. Summing (4.45) over [0, ∞), dropping the negative
term on the right hand side and multiplying by N on both sides, we obtain
∞ ∞
2h g(t)(f (x̄(t)) − f ∗ ) ≤ N||x̄(0) − x ∗ ||2 + Nh2 H 2 g 2 (t)
t=0 t=0

∞ 
N
+ 4hH g(t) (||x̄(t) − xi (t)||). (4.46)
t=0
i=1

Now, we are in the position to study inequality (4.46). The right side of (4.46) can
be partitioned as three items. For the first item, it is clear to get

N||x̄(0) − x ∗ ||2 < ∞. (4.47)

Similarly, in view of step-size Assumption 4.3, it follows that




Nh2 H 2 g 2 (t) < ∞, for all t ≥ 0. (4.48)
t=1

For the second item of (4.46), it is obtained from the consistency Theorem 4.7 that

||xi (t) − x̄(t)|| ≤ Zμt . (4.49)

Multiplying the above relation with g(t) and sum up from 0 to l , it yields that


l 
N 
l
g(t) (||x̄(t) − xi (t)||) ≤ NZ g(t)μt . (4.50)
t=0 i=1 t=0

By using g(t)μt ≤ g 2 (t) + μ2t , we immediately have


l 
N 
l
g(t) (||x̄(t) − xi (t)||) ≤ NZ (g 2 (t) + μ2t ). (4.51)
t=0 i=1 t=0

By geometric series summation


∞ ∞ 2
method and the step-size Assumption 4.3, we obtain
t=0 μ2t < 1
1−μ2
and t=0 g (t) < ∞. Substituting (4.47), (4.48), and (4.51)back
into (4.46) yields


g(t)(f (x̄(t)) − f ∗ ) < ∞. (4.52)
t=1

∞
Since t=0 g(t) = ∞ and f (x̄(t)) − f ∗ ≥ 0, it yields that

lim inf(f (x̄(t)) − f ∗ ) = 0. (4.53)


t→∞
108 4 Event-Triggered Algorithms for Distributed Convex Optimization


Thus, from (4.4) and ∞ t=0 g (t) < ∞, we can derive that all the conditions of
2

Lemma 4.9 are established. With this lemma in hand, we can deduce that the average
sequence {x̄(t)} asymptotic converges to an optimal solution x ∗ ∈ X∗ . Recalling
Theorem 4.7, it yields that each sequence {xi (t)}, i = 1, . . . , N , converges to the
same optimal solution x ∗ . The proof is thus achieved. 
Remark 4.11 In this chapter, we only consider the diminishing step-size rule
(Assumption 4.3) to make the algorithm (4.2) converge to a consistent optimal
solution. It is worth noting that the algorithm (4.2) and other similar algorithms
can be fast by using a fixed or constant step-size, but they only converge to a
neighborhood of the optimal solution set.

4.5 Numerical Examples

In this section, a numerical example is given to validate the practicability of the


proposed algorithm and correctness of the theoretical analysis throughout this
chapter.
Consider the general undirected graph G = {V , E, W = [wij ]5×5 }, where E =
{(1, 2), (2, 1), (1, 3), (3, 1), (2, 3), (3, 2), (2, 4), (4, 2), (2, 5), (5, 2), (3, 4), (4, 3),
(3, 5), (5, 3), (4, 5), (5, 4)}, w12 = 0.54, w21 = 0.54, w13 = 0.81, w31 = 0.81,
w23 = 0.72, w32 = 0.72, w24 = 0.36, w42 = 0.36, w25 = 0.72, w52 = 0.72,
w34 = 0.9, w43 = 0.9, w35 = 0.36, w53 = 0.36, w45 = 0.36, w54 = 0.36 and
wij = 0 if (i, j ) ∈/ E. In the following example, the undirected graph sequence G
describedabove will be employed. Consider the optimization problem (4.1) with
fi (x) = 3j =1 Hi (xij , uij ) (i = 1, 2, 3, 4, 5), where Hi (xij , uij ) = (xij − uij )2 /2
if |xij −uij | ≤  and Hi (xij , uij ) = (|xij −uij |−(1/2)) otherwise. Here, xij is the
j -th row of vector xi and uij is the corresponding element of matrix U = [uij ]5×3 .
Note that the local objective function fi (x) is not differentiable, and x ∈ R 3 is a
global decision vector. Moreover, we discuss the distributed subgradient algorithm
(4.2) with control input (4.3), where the triggering time sequence is determined by
(4.4).
In the simulation, we choose the design parameters  = 2, β1 = e−2.3 ,
β2 = 1/(1 − e−0.223 ), h = 0.01, l = 1, the step-sizes g(t) = 0.005/(t + 1)
and design one such random matrix U = [uij ]5×3 that the mean of each column
is −2, 0, 2, respectively, and the variance of each column is 1. The simulation
results of the algorithm (4.2) are described in Figs. 4.1, 4.2, 4.3, and 4.4. The state
evolutions of all nodes are shown in Fig. 4.1, from which one can see that all the
nodes asymptotically achieve the optimal solution by taking 3000 iteration. When
the nodes achieve consensus, the distributed control input ui (t) tends to 0, which
can be seen from Fig. 4.2. Each node’s event-triggered sampling time instants are
shown in Fig. 4.3, from which we can observe that the updates of the control inputs
are asynchronous. According to the statistics, the sampling times for the 5 nodes are
[22, 28, 31, 41, 53], and the average sampling times are 35. Thus, the average update
Fig. 4.1 All nodes’ states xi (t)

Fig. 4.2 Evolutions of all nodes’ control inputs ui (t)


Fig. 4.3 All nodes’ sampling time instant sequences {tki }

Fig. 4.4 Evolutions of measurement error and threshold for node 3


References 111

rate of control inputs is 35/3000 = 1.17%. Figure 4.4, for node 3, it is explicit that
the norm of measurement error ||e3 (t)|| is asymptotically reduces to zero.

4.6 Conclusion

In this chapter, the consensus-based first-order discrete-time multi-node system for


solving the distributed convex optimization problem with event-triggered commu-
nication has been studied in detail. Under the designed distributed event-triggered
function and the triggering condition, the distributed control input is designed. It
has been proven that the algorithm is able to make the whole nodes asymptotically
converge to an optimal point. Moreover, the theoretical results are demonstrated
through a numerical example. Although, we do not prove that the Zeno-like behavior
for triggering time sequence is excluded throughout this chapter, the proof of this
crucial problem will capture our attention in future study. Future work also should
include the case of more complex constrained convex optimization problem, and
event-triggered communication among nodes in the dynamic networks.

References

1. M. Porfiri, D. Roberson, D. Stilwell, Tracking and formation control of multiple autonomous


agents: a two-level consensus approach. Automatica 43(8), 1318–1328 (2007)
2. L. Cheng, Y. Wang, W. Ren, Z.-G. Hou, M. Tan, Containment control of multi-agent systems
with dynamic leaders based on a PIn-type approach. IEEE Tran. Cybern. 46(12), 3004–3017
(2016)
3. S. Olfati, R. Murray, Consensus problems in networks of agents with switching topology and
time delays. IEEE Trans. Autom. Control 49(9), 1520–1533 (2004)
4. H. Li, G. Chen, T. Huang, Z. Dong, High performance consensus control in networked systems
with limited bandwidth communication and time-varying directed topologies. IEEE Trans.
Neural Netw. Learn. Syst. 28(5), 1043–1054 (2017)
5. H. Geng, Z. Chen, Z. Liu, Q. Zhang, Consensus of a heterogeneous multi-agent system with
input saturation. Neurocomputing 166, 382–388 (2015)
6. X. Wu, Y. Tang, W. Zhang, Input-to-state stability of impulsive stochastic delayed systems
under linear assumptions. Automatica 66, 195–204 (2016)
7. H. Chu, Y. Cai, W. Zhang, Consensus tracking for multi-agent Systems with directed graph via
distributed adaptive protocol. Neurocomputing 166, 8–13 (2015)
8. H. Li, G. Chen, X. Liao, T. Huang, Quantized data-based leader-following consensus of general
discrete-time multi-agent systems. IEEE Trans. Circuits Syst. Express Briefs 63(4), 401–405
(2016)
9. G. Miao, Q. Ma, Group consensus of the first-order multi-agent systems with nonlinear input
constraints. Neurocomputing 161, 113–119 (2015)
10. C. Huang, H. Li, D. Xia, L. Xiao, Distributed consensus of multi-agent systems over general
directed networks with limited bandwidth communication. Neurocomputing 174, 681–688
(2016)
11. H. Li, X. Liao, T. Huang, W. Zhu, Y. Liu, Second-order global consensus in multiagent
networks with random directional link failure. IEEE Trans. Neural Netw. Learn. Syst. 26(3),
565–575 (2015)
112 4 Event-Triggered Algorithms for Distributed Convex Optimization

12. Y. Kang, D.-H. Zhai, G.-P. Liu, Y.-B. Zhao, P. Zhao, Stability analysis of a class of hybrid
stochastic retarded systems under asynchronous switching. IEEE Trans. Autom. Control 59(6),
1511–1523 (2014)
13. A. Jadbabaie, J. Lin, A. Morse, Coordination of groups of mobile autonomous agents using
nearest neighbor rules. IEEE Trans. Autom. Control 48(6), 988–1001 (2003)
14. T. Schetter, M. Campbell, D. Surka, Multiple agent-based autonomy for satellite constellations.
Artif. Intell. 145(1), 147–180 (2003)
15. S. Xie, Y. Wang, Construction of tree network with limited delivery latency in homogeneous
wireless sensor networks. Wirel. Pers. Commun. 78(1), 231–246 (2014)
16. S. Kar, J. Moura, Distributed average consensus in sensor networks with random link failures,
in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing – ICASSP
’07 (2007). https://doi.org/10.1109/ICASSP.2007.366410
17. S. Pereira, Z. Pages, Consensus in correlated random wireless sensor networks. IEEE Trans.
Signal Process. 59(12), 6279–6284 (2011)
18. S. Jian, H.W. Tan, J. Wang, J.W. Wang, S.Y. Lee, A novel routing protocol providing good
transmission reliability in underwater sensor networks. J. Int. Technol. 16(1), 171–178 (2015)
19. P. Guo, J. Wang, X.H. Geng, C.S. Kim, J.U. Kim, A variable threshold-value authentication
architecture for wireless mesh networks. J. Int. Technol. 15(6), 929–936 (2014)
20. S.R. Olfati, J.S. Shamma, Consensus filters for sensor networks and distributed sensor fusion,
in Proceedings of 44th IEEE Conference on Decision and Control (2005). https://doi.org/10.
1109/CDC.2005.1583238
21. Z. Fu, K. Ren, J. Shu, X. Sun, F. Huang, Enabling personalized search over encrypted
outsourced data with efficiency improvement. IEEE Trans. Parallel Distrib. Syst. 27(9), 2546–
2559 (2016)
22. Y.J. Ren, J. Shen, J. Wang, J. Han, S.Y. Lee, Mutual verifiable provable data auditing in public
cloud storage. J. Int. Technol. 16(2), 317–323 (2015)
23. Z. Xia, X. Wang, X. Sun, Q. Wang, A secure and dynamic multi-keyword ranked search scheme
over encrypted cloud data. IEEE Trans. Parallel Distrib. Syst. 27(2), 340–352 (2016)
24. Z. Fu, X. Sun, Q. Liu, L. Zhou, J. Shu, Achieving efficient cloud search services: multi-
keyword ranked search over encrypted cloud data supporting parallel computing. IEICE Trans.
Commun. 98(1), 190–200 (2015)
25. M. Cao, A. Morse, B. Anderson, Reaching a consensus in a dynamically changing environ-
ment: convergence rates, measurement delays, and asynchronous events. SIAM J. Control
Optim. 47(2), 575–600 (2008)
26. Y. Fan, G. Feng, Y. Wang, C. Song, Distributed event-triggered control of multi-agent systems
with combinational measurements. Automatica 49(2), 671–675 (2013)
27. K. Hamada, N. Hayashi, S. Takai, Event-triggered and self-triggered control for discrete-time
average consensus problems. SICE J. Control Meas. Syst. Integr. 7(5), 297–303 (2014)
28. Y.P. Tian, Stability analysis and design of the second order congestion control for networks
with heterogeneous delays. IEEE/ACM Trans. Netw. 13(5), 1082–1093 (2005)
29. B. Gu, V. Sheng, A robust regularization path algorithm for v-support vector classification.
IEEE Trans. Neural Netw. Learn. Syst. 28(5), 1241–1248 (2017)
30. Z. Xia, X. Wang, X. Sun, Q. Liu, N. Xiong, Steganalysis of LSB matching using differences
between nonadjacent pixels. Multimed. Tools Appl. 75(5), 1947–1962 (2016)
31. X. Wen, S. Ling, X. Yu, F. Wei, A rapid learning algorithm for vehicle classification. Inf. Sci.
295(1), 395–406 (2015)
32. T. Ma, J. Zhou, M. Tang, Y. Tian, A. Al-Dhelaan, M. Al-Rodhaan, S. Lee, Social network
and tag sources based augmenting collaborative recommender system. IEICE Trans. Inf. Syst.
98(4), 902–910 (2015)
33. G. Chen, L. Luo, H. Shu, B. Chen, H. Zhang, Color image analysis by quaternion-type
moments. J. Math. Imaging Vis. 51(1), 124–144 (2015)
34. J. Hu, X. Hu, Nonlinear filtering in target tracking using cooperative mobile sensors.
Automatica 46(12), 2041–2046 (2010)
References 113

35. P. Tabuada, Event-triggered real-time scheduling of stabilizing control tasks. IEEE Trans.
Autom. Control 52(9), 1680–1685 (2007)
36. H. Li, X. Liao, G. Chen, D.J. Hill, Z. Dong, T. Huang, Event-triggered asynchronous
intermittent communication strategy for synchronization in complex dynamical networks.
Neural Netw. 66, 1–10 (2015)
37. D. Dimarogonas, E. Frazzoli, K. Johansson, Distributed event-triggered control for multi-agent
systems. IEEE Trans. Autom. Control 57(5), 1291–1297 (2012)
38. G. Seyboth, D. Dimarogonas, K. Johansson, Event-based broadcasting for multi-agent average
consensus. Automatica 49(1), 245–252 (2013)
39. H. Li, G. Chen, T. Huang, Z. Dong, W. Zhu, L. Gao, Event-triggered distributed consensus
over directed digital networks with limited bandwidth. IEEE Tran. Cybern. 46(12), 3098–3110
(2016)
40. W. Zhu, Z. Jiang, G. Feng, Event-based consensus of multi-agent systems with general linear
models. Automatica 50(2), 552–558 (2014)
41. X. Chen, F. Hao, Event-triggered average consensus control for discrete-time multi-agent
systems. IET Contr. Theory Appl. 16(6), 2493–2498 (2012)
42. D. Ding, Z. Wang, B. Shen, Event-triggered consensus control for a class of discrete-time
stochastic multi-agent systems, in Proceeding of the 11th World Congress on Intelligent
Control and Automation (2014). https://doi.org/10.1109/WCICA.2014.7052731
43. H. Pu, W. Zhu, D. Wang, Consensus analysis of first-order discrete-time multi-agent systems
with time delay: an event-based approach, in 2016 35th Chinese Control Conference (CCC)
(2016). https://doi.org/10.1109/ChiCC.2016.7554623
44. X. Yin, D. Yue, S. Hu, Distributed event-triggered control of discrete-time heterogeneous
multi-agent systems. J. Frankl. Inst. 350(3), 651–669 (2013)
45. W. Zhu, Z. Tian, Event-based consensus of first-order discrete time multi-agent systems, in
2016 12th World Congress on Intelligent Control and Automation (WCICA) (2016). https://
doi.org/10.1109/WCICA.2016.7578796
46. W. Du, Sunney Y. Leung, Y. Tang, A. Vasilakos, Differential evolution with event-triggered
impulsive control. IEEE Trans. Cybern. 47(1), 244–257 (2016)
47. D. Yang, X. Liu, W. Chen, Periodic event/self-triggered consensus for general continuous-time
linear multi-agent systems under general directed graphs. IET Control Theory Appl. 9(3), 428–
440 (2015)
48. A. Nedic, A. Ozdaglar, Distributed optimization over time-varying directed graphs. IEEE
Trans. Autom. Control 60(3), 601–615 (2015)
49. Y. Lou, G. Shi, K. H. Johansson, Y. Hong, Approximate projected consensus for convex
intersection computation: convergence analysis and critical error angle. IEEE Trans. Autom.
Control 59(7), 1722–1736 (2014)
50. M. Zhu, S. Martínez, On distributed convex optimization under inequality and equality
constraints via primal-dual subgradient methods (2010). Preprint. arXiv: 1001.2612
51. B.C. Kuo, Discrete-Data Control System (Prentice Hall, Hoboken, 1970)
52. I. Lobel, A. Ozdaglar, D. Feijer, Distributed multi-agent optimization with state-dependent
communication. Math. Program. 129(2), 255–284 (2011)
53. H. Li, C. Guo, T. Huang, Z. Wei, X. Li, Event-triggered consensus in nonlinear multi-agent
systems with nonlinear dynamics and directed network topology. Neurocomputing 185, 105–
112 (2016)
Chapter 5
Event-Triggered Acceleration Algorithms
for Distributed Stochastic Optimization

Abstract In this chapter, we focus on introducing and exploring how to improve the
computational and computational efficiency in distributed optimization problems,
and the problem under study remains the problem of distributed optimization
to minimize a finite sum of convex cost functions over the nodes of a network
where each cost function is further considered as the average of several constituent
functions. Reviewing the existing work, no method can improve communication
efficiency and computational efficiency simultaneously. To achieve the above goal,
we will introduce an effective event-triggered distributed accelerated stochastic
gradient algorithm, namely ET-DASG. ET-DASG can improve communication
efficiency through an event-triggered strategy, improve computational efficiency by
using SAGA’s variance-reduction technique, and accelerate convergence by using
Nesterov’s acceleration mechanism, thus achieving the target of improving com-
munication efficiency and computational efficiency simultaneously. Furthermore,
we will provide in this chapter a convergence analysis that demonstrates that ET-
DASG can converge to the exact optimal solution within the average value with a
well-selected constant step-size. Also, thanks to the gradient tracking scheme, the
algorithm can achieve linear convergence rates when each constituent function is
strongly convex and smooth. Moreover, under certain conditions, we prove that the
time interval between two successive trigger moments is larger than the iteration
interval for each node. Finally, we also confirm the attractive performance of ET-
DASG through simulation results.

Keywords Distributed optimization · Stochastic algorithm · Event-triggered ·


Variance reduction · Nesterov’s acceleration

5.1 Introduction

The emergence of networked control systems has brought with it an urgent


requirement for efficient communication and computing technologies. Distributed
optimization can tackle the interaction of multiple nodes on a network and has a
broad application in machine learning [1, 2], resource allocation [3, 4], data analysis

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 115
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_5
116 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

[5], privacy masking [6], and signal processing [7] due to its ability to parallelize
computation and prevent nodes from sharing privacy. Distributed algorithms usually
follow an iterative process in which nodes in the network store certain estimates of
the decision vector in the context of optimization, exchange this information with
neighboring nodes, and update their estimates based on the information received.
Some of the literature for such distributed schemes include the early work on
distributed gradient descent (DGD) [8] and its various extensions in achieving
efficiency [9, 10], solving constraints [11, 12], applying to complex networks [13,
14], or performing acceleration [15, 16]. These optimization methods successfully
showed the effectiveness for dealing with problems in a distributed manner [8–16].
Nonetheless, although these methods were intuitive and flexible for cost functions
and networks, their convergence rates were particularly slow in comparison with
that of centralized counterparts. Besides, linear convergence rates in sub-optimality
could be derived for DGD-based methods with constant step-sizes [17]. Therefore,
from an optimization point of view, it is always a priority to propose and analyze
methods that are comparable in performance to centralized counterparts in terms of
convergence rate. In a recent stream of literature, distributed gradient methods that
overcome this exactness-rate dilemma have been proposed, which achieve exactly
linear convergence rate for smooth and strongly convex cost functions. Instances
of such methods, including methods based on gradient tracking [18–28], methods
based on Lagrangian multiplier [29–32], and methods based on dual decomposition
[33–36], are characterized by various mechanisms.
Toward practical optimization models, approaches of momentum acceleration
have been successfully and widely used in optimization techniques, which is
conducive to the convergence of the DGD-based methods [16, 37–42]. First-order
optimization methods based on momentum acceleration have been of significance
in the machine learning community because of their better scalability for large-
scale tasks (including deep learning, federal learning, etc.) and good performance
in practice. When solving convex or strongly convex optimization problems, many
momentum approaches have emerged, e.g., the Nesterov’s acceleration mechanism
in [16, 37–39] and the heavy-ball mechanism in [40–42], which both ensure that
nodes obtain more information from neighbors in the network than ones with
no momentum, and have been proven to largely improve the convergence rate
of gradient-based methods. Despite momentum acceleration mechanisms having
superior theoretical advantages, they do not fully exploit the performance of related
methods in terms of efficiency, e.g., communication and computation. For example,
in machine learning, the accuracy of the machine learning model can be improved
by increasing the parameter scale and training dataset. However, this operation will
lead to a substantial increase in training time, which results in low communication
and computation efficiency. Therefore, exploiting some valid techniques to achieve
provable efficiency becomes a new challenge for researchers.
To improve communication efficiency and meanwhile maintain the desired
performance of the network, various types of strategies have recently been proposed
and gained popularity in the existing works, e.g., [43–46]. The emergence of the
event-triggered strategy provides a new perspective for collecting and transmitting
5.1 Introduction 117

information. The main idea behind the event-triggered strategy is that nodes only
take actions when necessary, that is, only when a measurement of the local
node’s state error reaches a specified threshold. Its superiority is that some desired
properties of the network can still be maintained efficiently. There are many works
on distributed event-triggered methods over networks, which can successfully solve
various practical problems and achieve expected results [43–46]. For example,
distributed event-triggered algorithms proposed in [43] have been utilized to
resolve constrained optimization problems, and event-triggered distributed gradient
tracking algorithms in [44] have been proven to linearly converge to the optimal
solution, which have further been extended to the distributed energy management
problem of smart grids in [46]. In the era of big-data, nodes in the network may
process large and complex data in the process of information sharing and calculation
[5]. Thus, the above method will bear the pressure of a lot of calculations.
To reduce the computational pressure and maintain the simplicity of the cal-
culation, approximating the true gradient with a stochastic gradient is a relatively
common solution at present. Based on this scheme, related stochastic gradient
methods have emerged [47–49]. However, because of the large variance existing in
the stochastic gradient, these approaches have weaker convergence. By decreasing
the variance of the stochastic gradient, many centralized stochastic methods [50–
54] adopt various variance-reduction techniques to surmount this shortcoming and
improve convergence. Inspired by Konecny et al. [50], Schmidt et al. [51], Defazio
et al. [52], Tan et al. [53], Nguyen et al. [54], many distributed variance-reduced
approaches [2, 55–60] have been extensively investigated, and their performance in
processing machine learning tasks is better than their centralized counterparts.
In this chapter, we focus on promoting the execution (i.e., communication
and computation) efficiency and accelerating the convergence of the distributed
optimization in dealing with the machine learning tasks. As far as the authors know,
there is no work involving methods to implement the included target. In specific, we
highlight the main contributions of this work as follows:
(i) A novel event-triggered distributed accelerated stochastic gradient algorithm,
namely ET-DASG, is proposed to solve the machine learning tasks. ET-DASG
with well-selected constant step-size can linearly converge in the mean to
the exact optimal solution if each constituent function is strongly convex and
smooth.
(ii) Unlike the time-triggered methods [38–42], ET-DASG utilizes the event-
triggered strategy which effectively avoids frequent real-time communication,
reduces the communication load, and thus improves communication efficiency.
Furthermore, for each node, we prove that the time interval between two
successive triggering instants is larger than the iteration interval.
(iii) Compared with the existing methods [38–46], ET-DASG achieves higher
computation efficiency by means of the variance-reduction technique. In
particular, at each iteration, ET-DASG only employs the gradient of one
randomly selected constituent function and adopts the unbiased stochastic
118 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

average gradient (SAGA) to estimate the local gradients, which greatly reduces
the expense of full gradient evaluations.
(iv) In comparison with the existing methods without momentum acceleration
mechanism [19–22, 44, 45], ET-DASG performs accelerated convergence with
the help of the Nesterov’s acceleration mechanism. Moreover, simulation
results verify that the convergence rate of ET-DASG improves with the increase
of the momentum coefficient.

5.2 Preliminaries

5.2.1 Notation

If not particularly stated, the vectors mentioned in this chapter are column vectors.
Let Rp and Rp×q denote the real Euclidean spaces with dimensions p and p × q,
respectively. The identity matrix and the spectral radius of a matrix are represented
as Ip ∈ Rp×p and ρ(·), respectively. The Euclidean norm and Kronecker product is
represented as || · || and ⊗, respectively. Let symbol E[·] denote the expectation of
a random variable. Let notations ∇f (·) and (·)T denote the gradient of a function f
and the transpose of vectors (matrices), respectively. The p-dimensional vector with
all ones and all zeros are represented as 1p and 0p , respectively.

5.2.2 Model of Optimization Problem

This chapter focuses on optimizing a finite sum cost functions in machine learning,
which can be described as
i
1  i 1  i,h
m n
min f (x) = f (x), f (x) = i
i
f (x), (5.1)
x∈Rp m n
i=1 h=1

where x ∈ Rp is the optimization estimator (decision vector) and f i : Rp → R is


a convex function that we view as the privately cost of node i, which is represented
as the average of ni constituent functions f i,h . In addition, we make the following
assumption regarding the constituent functions.
Assumption 5.1 Each local constituent function f i,h , i ∈ {1, . . . , m}, h ∈
{1, . . . , ni }, is κ1 -strongly convex and κ2 -smooth, i.e., for any a, b ∈ Rp : (i)
f i,h (a + b) ≥ f i,h (a) + ∇f i,h (a)Tb + κ21 ||b||2; (ii) ||∇f i,h (a + b) − ∇f i,h (a)|| ≤
κ2 ||b||.
5.3 Algorithm Development 119

Notice from Assumption 5.1 that problem (5.1) possesses a unique optimal
solution x ∗ and the global cost function f is also κ1 -strongly convex and κ2 -smooth,
where 0 < κ1 ≤ κ2 . In addition, the condition number of the global cost function f
is defined as γ = κ2 /κ1 .

5.2.3 Communication Network

We aim to solve (5.1) over an undirected network (graph) G = {V, E, Â}, where
V = {1, 2, . . . , m} is the set of nodes, E ⊆ V × V is the set of edges containing
the interactions among all nodes and  = [a ij ] ∈ Rm×m is the weight matrix
(symmetric). The edge (i, j ) ∈ E if node j can directly exchange data with node i.
If (i, j ) ∈ E, then a ij > 0; otherwise, a ij = 0. Let N i = {j |a ij > 0} denote the set
of neighbors of node i. Furthermore, we make the following assumption regarding
the network.
Assumption 5.2 (i) G is undirected and connected; (ii) Â = [a ij ] ∈ Rm×m is
primitive and doubly stochastic.
Assumption 5.2 indicates that the second largest singular value κ3 of  is less
than 1, i.e., κ3 = ||Â − (1/m)1m 1Tm || < 1, [19, 20, 22, 25, 44].
Remark 5.1 Assumptions 5.1 and 5.2 are very general and easy to be satisfied in
many machine learning tasks that can usually be expressed as problem (5.1). These
two assumptions allow us to conditionally design a distributed linearly convergent
algorithm to accurately solve the practical applications. In addition, when training
a machine learning model, it may be necessary for a single computing node to have
a large amount of local data (ni 1), but the limited memory of the computing
node causes a significant increase in training time as well as the amount of
communication and calculations. However, it is expensive to improve the computing
and communication capabilities of a single piece of hardware. Hence, designing a
novel event-triggered distributed accelerated stochastic gradient algorithm will be
of great significance.

5.3 Algorithm Development

In this section, the event-triggered communication strategy is introduced. Then,


a novel event-triggered distributed accelerated stochastic gradient algorithm (ET-
DASG) is developed.
120 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

5.3.1 Event-Triggered Communication Strategy

In this subsection, we focus on designing an event-triggered strategy, where each


node can determine online when to broadcast its current estimators to its neighbors
by testing a certain triggering condition.
Before introducing the event-triggered strategy, we first define tki by the k-th
triggering time of the i-th node, where i ∈ V. Assume that x̂ti and ŷti are the
estimators1 that node i transmits to its neighbors at the latest triggering time before
time t, i.e.,

⎨ x̂ti = x ii for tk(i,t
i
) ≤ t < tk(i,t )+1 ,
i
tk(i,t)
⎩ ŷti = y ii i
for tk(i,t ) ≤ t < tk(i,t )+1 ,
i
tk(i,t)

where xti and yti are two estimators of node i. Moreover, we suppose that all the
nodes broadcast its estimators xti and yti at initial time, i.e., x̂0i = x0i and ŷ0i = y0i for
all i ∈ V. In addition, the next triggering time tk(i,t
i
)+1 after tk(i,t ) for node i ∈ V is
i

determined by
 
i,x 2 i,y 2
)+1 = inf t|t > tk(i,t ) , ||εt || + ||εt || > Cκ4 ,
i i t
tk(i,t (5.2)

where Cκ4t is the event-triggered threshold with parameters C > 0 and 0 < κ4 < 1,
i,y
and εti,x , εt are the measurement errors which are defined by

i,y
εti,x = x̂ti − xti , εt = ŷti − yti . (5.3)
Remark 5.2 The emergence of the event-triggered strategy provides a new perspec-
tive for information sampling and transmission. In particular, once node i receives
j j
the transmitted estimators (x̂t , ŷt ) from its neighbors j ∈ N i , node i first replaces
the neighbor’s estimators stored in the local memory, and then updates its next time
estimators (xti+1 , yti+1) based on its current information (xti , yti ). Note that only when
an event occurs at node i, i.e., the triggering condition in (5.2), is met, the estimators
xti and yti can be broadcasted to its neighbors. From the communication perspective,
the event-triggered strategy avoids real-time communication with neighbors, which
plays an essential role in reducing the communication load and realizing better
communication efficiency compared with other time-triggered methods [38–42, 55–
60].

1 In the methods based on event-triggered strategy, the local estimators x̂ti and ŷti are determined by
its own estimators and the latest information sent from its neighbors j ∈ N i (at the latest triggering
time of node j before t).
5.3 Algorithm Development 121

5.3.2 Event-Triggered Distributed Accelerated Stochastic


Gradient Algorithm (ET-DASG)

In this subsection, we introduce the event-triggered distributed accelerated stochas-


tic gradient algorithm, named as ET-DASG, to solve problem (5.1).
To motivate the design of ET-DASG, we know that methods based on the event-
triggered communication strategy can achieve better communication efficiency. On
the other hand, the variance-reduction technique adopted in the existing approaches
can effectively alleviate the computation burden in locally full gradient evaluations
and promote the computation efficiency. In addition, the accelerated linear conver-
gence is also an important requirement for machine learning methods. Therefore,
we propose ET-DASG which applies the event-triggered strategy to realize better
communication efficiency, applies the variance-reduction technique of SAGA to
achieve higher computation efficiency, and meanwhile leverages the Nesterov’s
acceleration mechanism and gradient tracking technique to implement accelerated
linear convergence.
The details of ET-DASG are given in Algorithm 3.2 To implement Algorithm 3
locally, we first assume that each node i ∈ V has a gradient table containing all
gradients ∇f i,h , ∀h ∈ {1, . . . , ni }, with respect to some estimators (such as the
local accelerated estimator s i or the local auxiliary estimator ei,h ). At each iteration
t + 1, each node i first updates the step of the local decision estimator xti+1 and
the local accelerated estimator sti+1 as step 3. Then, each node i uniformly and
randomly chooses one label which indexed by χti+1 ∈ {1, . . . , ni } from its own data
batch and then updates the local stochastic gradient gti+1 as step 5. After updating
i,χ i
gti+1 , the local auxiliary estimator et +2t+1 will be assigned by the local accelerated
i i,χ i
estimator sti+1 at the label χti+1 , and the entry ∇f i,χt+1 (et +2t+1 ) is substituted for the
i
newly constituent gradient ∇f i,χt+1 (sti+1 ) in the χti+1 gradient table position, while
the other entries keep unchanged. Then, the step of local auxiliary estimator yti+1
is implemented to track the variance-reduced gradient. Subsequently, each node i
i,y
calculates the measurement errors εti,x , εt in (5.3) and then tests the triggering
condition in (5.2). Finally, node i broadcasts xti+1 and yti+1 to its neighbors j ∈ N i
and updates the latest triggering time if the condition is satisfied, i.e., the event is
triggered, and otherwise keeps local estimators unchanged.

2 Assume that there is no information transmission and calculation at time t = 0, which further

means that the event-triggered and stochastic gradient process are not executed.
122 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

Algorithm 3 Event-triggered distributed accelerated stochastic gradient algorithm


(ET-DASG) for node i ∈ V
1: Initialization: Each node i initializes x̂0i = x0i ∈ Rp , e1i,h = s0i ∈ Rp , ∀h ∈ {1, . . . , ni }, and
ŷ0i = y0i = g0i = ∇f i (s0i ) ∈ Rp .
2: for t = 0, 1, 2, . . . do
i i
3: Update estimators xt+1 and st+1 according to:


m 
j
i
xt+1 = xti + aij x̂t − x̂ti − ηyti , (5.4a)
j =1

i
st+1 = xt+1
i
+ α xt+1
i
− xti , (5.4b)

where aij is the weight between i and j , the step-size η > 0, and the momentum coefficient
0 < α < 1.
i
4: Choose χt+1 uniformly and randomly from {1, . . . , ni }.
i
5: Update gt+1 according to:

i
 i i,χ i 1 
n i

i,h
i
gt+1 =∇f i,χt+1 st+1
i
− ∇f i,χt+1 et+1t+1 + ∇f i,h et+1 .
ni
h=1

i,χ i i i,χ i i
6: Take et+2t+1 = st+1
i
and replace ∇f i,χt+1 (et+2t+1 ) by ∇f i,χt+1 (st+1
i i
) in the χt+1 gradient
i,h
table position. All other estimators et+2 , ∀h = χt+1 , and gradient entries in the table keep
i
i,h i,h i,h i,h
unchanged, i.e., et+2 = et+1 and ∇f i,h (et+2 ) = ∇f i,h (et+1 ) for all h = χt+1
i
.
i
7: Update estimator yt+1 according to:


m 
j
i
yt+1 = yti + aij ŷt − ŷti + gt+1
i
− gti . (5.4c)
j =1

i,y
8: Calculate the measurement errors εti,x , εt in (5.3), and then test the triggering condition in
(5.2).
9: if the triggering condition in (5.2) is satisfied then
10: Broadcast xt+1i i
and yt+1 to its neighbors j ∈ N i , and update the latest triggering time.
11: end if
12: end for

According to the measurement errors (5.3) and Assumption 5.2(ii), the updates
in (5.4)–(5.4c) of Algorithm 3 can be represented as follows:
⎧ m

⎪ j

⎪ xti+1 = aij xt + ε̂ti,x − ηyti

⎨ j =1
sti+1 = xti+1 + α(xti+1 − xti ) , (5.5)


⎪ i

m
j i,y
⎩ yt +1 =
⎪ aij yt + ε̂t + gti+1 − gti
j =1
5.3 Algorithm Development 123

j,x i,y j,y i,y


where ε̂ti,x = m
j =1 aij (εt − εti,x ) and ε̂t = m
j =1 aij (εt − εt ).
Remark 5.3 It is worth noticing that, as calculated in step 5 of Algorithm 3,
computation of local stochastic gradient gti , i ∈ V, is costly in respect that this
i
step needs to calculate the summation term nh=1 ∇f i,h (eti,h +1 ) at each iteration. In
specific, if we naively implement the update in step 5, then at each iteration, we must
i
face the O(ni )-order computational cost when calculating nh=1 ∇f i,h (eti,h +1 ). At
each iteration, when using the following recursive formula to update the summation
term:


n i
 ni
  
i,h i i i,χ i
∇f i,h
et +1 = ∇f i,h eti,h + ∇f i,χt sti − ∇f i,χt et t ,
h=1 h=1

the i above cost can be avoided and we can calculate the summation term
n i,h i,h
h=1 ∇f (et +1 ) in a computationally efficient way. However, we also point that,
the O(ni )-order computational cost cannot be overcome in the existing methods
[19, 20, 22, 23, 25] using deterministic gradient tracking technique. From this point
of view, the computation cost of ET-DASG is related to ni , and ET-DASG can
improve computation efficiency.
y 1,y m,y
Define ε̂tx = [(ε̂t1,x )T , . . . , (ε̂tm,x )T ]T , ε̂t = [(ε̂t )T , . . . , (ε̂t )T ]T , xt =
[(xt1 )T , . . . , (xtm )T ]T , yt = [(yt1)T , . . . , (ytm )T ]T , st = [(st1 )T , . . . , (stm )T ]T , gt =
[(gt1 )T , . . . , (gtm )T ]T , and A = Â ⊗ Ip . Then, Algorithm 3 can be rewritten in a
compact form,

⎨ xt +1 = Axt + ε̂tx − ηyt
s = xt +1 + α(xt +1 − xt ) . (5.6)
⎩ t +1 y
yt +1 = Ayt + ε̂t + gt +1 − gt

In what follows, we use Ft to denote the history of the system up until time t.
Then, from the prior results of [2, 55, 57], it is clear that the stochastic averaging
gradient is unbiased. In particular, given Ft , one gets
 
E gti |Ft = ∇f i (sti ). (5.7)

Remark 5.4 In comparison with the existing distributed accelerated approaches


[38–42], distributed event-triggered methods [43–46], and variance-reduced dis-
tributed stochastic approaches (including GT-SAGA/GT-SVAG [59], DSA [55],
etc.), it is worth highlighting that Algorithm 3 achieves the above targets well
when processing machine learning tasks. That is to say, Algorithm 3 not only
accelerates the convergence but also promotes the execution (communication and
computation) efficiency. Based on this, we can consider that this chapter develops a
novel distributed optimization algorithm in adapting to real scenarios.
124 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

5.4 Convergence Analysis

In this section, we show the theoretical guarantees for the convergence of ET-DASG.

5.4.1 Auxiliary Results

To proceed, several auxiliary variables that will support the subsequent analysis are
defined below:
1  T 1  T
x̄t = 1m ⊗ Ip xt , ȳt = 1 ⊗ Ip yt ,
m m m
 T 1m 1Tm
∇F (st ) = ∇f 1 (st1 )T , . . . , ∇f m (stm )T , A∞ = ⊗ Ip ,
m
1  T 1  T
∇ F̄ (st ) = 1m ⊗ Ip ∇F (st ), ḡt = 1 ⊗ Ip gt .
m m m
Then, note that (5.6) is a stochastic gradient tracking method [55, 59, 60]. Under
the given initial conditions, the measurement error (5.3) and Assumption 5.2(ii),
through induction, the following conclusions can be clearly drawn:

1 T 1 T y
ȳt = ḡt , (1m ⊗ Ip )ε̂tx = 0, (1 ⊗ Ip )ε̂t = 0, ∀t ≥ 0.
m m m
Recalling from (5.7), it is clear to verify that

E[gt |Ft ] = ∇F (st ), E[ȳt |Ft ] = E[ḡt |Ft ] = ∇ F̄ (st ), ∀t ≥ 0.

Moreover, for each node i ∈ V, the average optimality gap between the auxiliary
variables eti,h , h ∈ {1, . . . , ni }, and the optimal solution x ∗ is defined as follows:

1  2
ni 
m
 i,h 
vti = i et − x ∗  , vt = vti , ∀t ≥ 0. (5.8)
n
h=1 i=1

Before establishing the auxiliary results, we state a few useful results in the
following lemma, of which the proofs can be found in [40, 41, 46, 59, 61].
Lemma 5.5 Under Assumptions 5.1–5.2, we possess that
(i) For all x ∈ Rmp , one has ||Ax − A∞ x|| ≤ κ3 ||x − A∞ x||, where 0 < κ3 < 1.
(ii) For all x ∈ Rp and if 0 < η ≤ (1/κ2 ), one gets ||x − η∇f (x) − x ∗ || ≤
(1 − κ1 η)||x − x ∗ ||.
5.4 Convergence Analysis 125

(iii) Assume that Γ ∈ Rp×p is a non-negative matrix and w ∈ Rp is a positive


vector. If Γ w < w, then ρ(Γ ) < 1.
(iv) Consider the sequences {xt } and {st } generated√ by the algorithm (5.6), one
obtains that ||∇ F̄ (st ) − ∇f (x̄t )|| ≤ (κ2 / m)||st − A∞ xt || for all t ≥ 0.
In addition, an upper bound of E[gt − ∇F (st )2 ] is introduced in the following
lemma, whose proof can be referred to the literatures [55, 59].
Lemma 5.6 If Assumptions 5.1–5.2 hold, we possess the following recursive
relation: ∀t ≥ 1,
 2  2
E[gt − ∇F (st )2 ] ≤ 4κ22 E[xt − A∞ xt  ] + 8κ22 E[mx̄t − x ∗  ]
+ 8α 2 κ22 E[xt − xt −1 2 ] + 2κ22 E[vt ]. (5.9)

From Lemma 5.6, we can find that when xti , i ∈ V, approach to x ∗ , then eti,h , i ∈
V, h ∈ {1, . . . , ni }, tend to x ∗ , which indicates that E[gt − ∇F (st )2 ] diminishes
to zero.

5.4.2 Supporting Lemmas

In this subsection, we start to analyze the performance of Algorithm 3 by establish-


ing the interactions among the following sequences for all t ≥ 1: (i) Mean-squared
consensus error E[||xt − A∞ xt ||2 ]; (ii) Mean-squared network optimality gap
E[||x̄t − x ∗ ||2 ]; (iii) Mean state optimality gap E[vt ]; (iv) Mean-squared state
difference E[||xt −xt −1 ||2 ]; and (v) Mean-squared stochastic gradient tracking error
E[||yt − A∞ yt ||2 ].
The first result concerning the upper bound of E[vt ] is given in the following
lemma. The proof of this result can refer to [59].
Lemma 5.7 The sequence {vt }, ∀t ≥ 1, is upper bounded by

1 2  2 2  2
E[vt +1 ] ≤ 1 − E[vt ] + E[xt − A∞ xt  ] + E[mx̄t − x ∗  ],
n̂ ň ň
(5.10)

where n̂ = maxi∈V {ni } and ň = mini∈V {ni }.


In the next lemma, we give the bound of the mean-squared state difference
E[||xt − xt −1||2 ].
126 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

Lemma 5.8 If Assumptions 5.1–5.2 hold, we possess the following recursive


relation: ∀t ≥ 1,

E[||xt +1 − xt ||2 ] ≤ (96η2 κ22 + 8)E[||xt − A∞ xt ||2 ] + 32η2 κ22 E[vt ]


+ 144η2κ22 E[m||x̄t − x ∗ ||2 ] + 16η2 E[||yt − A∞ yt ||2 ]
+ 160η2κ22 α 2 E[||xt − xt −1||2 ] + 4E[||ε̂tx ||2 ]. (5.11)

Proof Following from (5.6) and the fact that ||A − Imp ||2 ≤ 4, one gets

||xt +1 − xt ||2 = ||(A − Imp )xt − ε̂tx − ηyt ||2


= ||(A − Imp )(xt − A∞ xt ) − ε̂tx − ηyt ||2
≤ 8||xt − A∞ xt ||2 + 4||ε̂tx ||2 + 4η2 ||yt ||2 . (5.12)

Define ∇F (x ∗ ) = [∇f 1 (x ∗ )T , . . . , ∇f m (x ∗ )T ]T . Then, we further get

||yt || = ||yt − A∞ yt + A∞ yt − A∞ ∇F (st ) + A∞ ∇F (st )||


≤ ||yt − A∞ yt || + ||gt − ∇F (st )|| + κ2 ||st − (1m ⊗ Ip )x ∗ ||
≤ ||yt − A∞ yt || + ||gt − ∇F (st )|| + κ2 ||st − A∞ xt ||

+ mκ2 ||x̄t − x ∗ ||, (5.13)

where the first inequality has exploited the facts that ȳt = ḡt , ∀t ≥ 0, and (1Tm ⊗
Ip )∇F (x ∗ ) = 0. In view of (5.12) and (5.13), one has

E[||xt +1 − xt ||2 ] ≤ 8E[||xt − A∞ xt ||2 ] + 16η2 κ22 E[||st − A∞ xt ||2 ]


+ 16η2E[||gt − ∇F (st )||2 ] + 16η2 E[||yt − A∞ yt ||2 ]
+ 16η2κ22 E[m||x̄t − x ∗ ||2 ] + 4E[||ε̂tx ||2 ]. (5.14)

Following from (5.14) and with reference to (5.6), it can be obtained that

E[||xt +1 − xt ||2 ] ≤ 16η2 E[||gt − ∇F (st )||2 ] + 16η2κ22 E[m||x̄t − x ∗ ||2 ]


+ 16η2 E[||yt − A∞ yt ||2 ] + 32α 2 η2 κ22 E[||xt − xt −1||2 ]
+ (32η2 κ22 + 8)E[||xt − A∞ xt ||2 ] + 4E[||ε̂tx ||2 ], (5.15)

which together with Lemma 5.6 finishes the proof. 


Then, we derive a bound for the mean-squared consensus error E[||xt −A∞ xt ||2 ].
5.4 Convergence Analysis 127

Lemma 5.9 If Assumptions 5.1 and 5.2 hold, we possess the following recursive
relation: ∀t ≥ 1,

E[||xt +1 − A∞ xt +1||]2
1 + κ32 4η2
≤ E[||xt − A∞ xt ||]2 + E[||yt − A∞ yt ||]2
2 1 − κ32
4
+ E[||ε̂tx ||2 ]. (5.16)
1 − κ32

Proof From the update rule (5.6), it follows

||xt +1 − A∞ xt +1 ||2
= ||Axt − ε̂tx − ηyt − A∞ (Axt − ε̂tx − ηyt )||2
 2
= Axt − A∞ xt − ε̂tx − ηyt + ηA∞ yt  , (5.17)

where the second equality has employed the facts that A∞ A = A∞ and A∞ ε̂tx =
0, ∀t ≥ 0. Recalling from the well-known inequality that ||c +d||2 ≤ (1 +a)||c||2 +
(1 + 1/a)||d||2, ∀c, d ∈ Rmp , for any a > 0, it deduces from Lemma 5.5(i) that

||xt +1 − A∞ xt +1 ||2
≤ (1 + a)κ32 ||xt − A∞ xt ||2 + 2(1 + a −1 )η2 ||yt − A∞ yt ||2
+ 2(1 + a −1 )||ε̂tx ||2 . (5.18)

Substituting a = (1 − κ32 )/(2κ32 ) (due to the fact that a is an arbitrary positive


constant) to (5.18) leads to the results of (5.16) in Lemma 5.9. 
Next, we establish a bound for the mean-squared network optimality gap E[||x̄t −
x ∗ ||2 ].
Lemma 5.10 If 0 < η ≤ κ1 /(16κ2 ) and supposing that Assumptions 5.1 and 5.2
hold, we possess the following recursive relation: ∀t ≥ 1,

E[m||x̄t +1 − x ∗ ||2 ]
 ηκ1 4κ 2 η
≤ 1− E[m||x̄t − x ∗ ||2 ] + 2 E[||xt − A∞ xt ||2 ]
2 κ1
4α 2 κ22 η 2η2 κ22
+ E[||xt − xt −1 ||2 ] + E[vt ]. (5.19)
κ1 m
128 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

Proof Multiplying (1Tm ⊗ Ip )/m to the update of xt in (5.6) and in light of the fact
((1Tm ⊗ Ip )/m)ε̂tx = 0, one has

E[||x̄t +1 − x ∗ ||2 |Ft ]


= E[||x̄t − ηȳt − x ∗ ||2 |Ft ]
= E[||x̄t − η∇f (x̄t ) − x ∗ + η∇f (x̄t ) − ηȳt ||2 |Ft ]
= ||x̄t − η∇f (x̄t ) − x ∗ ||2 + η2 E[||∇f (x̄t ) − ḡt ||2 ]
+ 2ηx̄t − η∇f (x̄t ) − x ∗ , ∇f (x̄t ) − ∇ F̄ (st ), (5.20)

where the last equality has leveraged the facts that ȳt = ḡt and E[ḡt |Ft ] = ∇ F̄ (st ).
Considering that ∇f (x̄t ) − ∇ F̄ (st ), E[∇ F̄ (st ) − ḡt |Ft ] = 0, we have

E[||∇f (x̄t ) − ḡt ||2 |Ft ] = E[||∇f (x̄t ) − ∇ F̄ (st )||2 |Ft ]
+ E[||∇ F̄ (st ) − ḡt ||2 |Ft ]. (5.21)

Notice that {gti } is independent from each other for given Ft . Then, E[ i =j gt
i −
j j
∇f i (sti ), gt − ∇f j (st )|Ft ] = 0 holds. Thus, the last term in (5.21) is equal to
 m 
1 
E[||∇ F̄ (st ) − ḡt || |Ft ] = 2 E ||
2
gt − ∇f (st )|| |Ft
i i i 2
m
i=1

= E[||gt − ∇F (st )||2 |Ft ], (5.22)

which together with (5.21) and Lemma 5.5(ii) in (5.20) leads to


 2
E[||x̄t +1 − x ∗ ||2 |Ft ] ≤ (1 − ηκ1 )2 ||x̄t − x ∗ ||2 + η2 ∇f (x̄t ) − ∇ F̄ (st )
+ 2η(1 − ηκ1 )||x̄t − x ∗ ||||∇f (x̄t ) − ∇ F̄ (st )||
η2
+ E[||gt − ∇F (st )||2 |Ft ]. (5.23)
m2
By Lemma 5.5(iv) and the update of st in (5.6), we further have

E[||x̄t +1 − x ∗ ||2 |Ft ]


2η2 κ22 η(1 − ηκ1 )
≤ ||xt − A∞ xt ||2 + ||∇f (x̄t ) − ∇ F̄ (st )||2
m b
2α 2 η2 κ22 η2
+ ||xt − xt −1 ||2 + 2 E[||gt − ∇F (st )||2 |Ft ]
m m
+ η(1 − ηκ1 )b||x̄t − x ∗ ||2 + (1 − ηκ1 )2 ||x̄t − x ∗ ||2 , (5.24)
5.4 Convergence Analysis 129

where b > 0 is an arbitrary positive constant. Inserting b = κ1 into (5.24) gives rise
to

E[m||x̄t +1 − x ∗ ||2 ]
2κ22 η
= (1 − ηκ1 )E[m||x̄t − x ∗ ||2 ] + E[||xt − A∞ xt ||2 ]
κ1
2α 2 κ22 η η2
+ E[||xt − xt −1 ||2 ] + E[||gt − ∇F (st )||2 ]. (5.25)
κ1 m

Substituting Lemma 5.6 into (35.25) yields that


 
8η2 κ22 2η2 κ22
E[m||x̄t +1 − x ∗ ||2 ] ≤ 1 − ηκ1 + E[m||x̄t − x ∗ ||2 ] + E[vt ]
m m
2 8η
+ α 2 κ22 η + E[||xt − xt −1 ||2 ]
κ1 m
4η 2
+ κ22 η + E[||xt − A∞ xt ||2 ]. (5.26)
m κ1

We note here that 1 − ηκ1 + (8η2 κ22 )/m ≤ 1 − (ηκ1 )/2 if 0 < η ≤ (κ1 m)/(16κ22),
4η/m + 2κ1−1 ≤ 4κ1−1 if 0 < η ≤ 1/(2κ1), and 8η/m + 2κ1−1 ≤ 4κ1−1 if 0 < η ≤
1/(4κ1). Therefore, the results of (5.19) in Lemma 5.10 can be derived if 0 < η ≤
κ1 /(16κ2 ). 
Finally, we establish a bound for the mean-squared stochastic gradient tracking
error E[||yt − A∞ yt ||2 ].
Lemma 5.11 If 0 < η ≤ (1 − κ32 )/(99κ2) and supposing that Assumptions 5.1 and
5.2 hold, we possess the following recursive relation: ∀t ≥ 1,

E[||yt +1 − A∞ yt +1 ||2 ]
1240κ22 266κ22
≤ E[||xt − A∞ xt ||2 ] + E[m||x̄t − x ∗ ||2 ]
1 − κ3 2 1 − κ32
3 + κ32 188κ22
+ E[||yt − A∞ yt ||]2 + E[||xt − xt −1 ||2 ]
4 1 − κ32
42κ22 640κ22 4 y
+ E[vt ] + E[||ε̂tx ||2 ] + E[||ε̂t ||2 ]. (5.27)
1 − κ32 1 − κ32 1 − κ32
130 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

Proof Utilizing the update of yt in (5.6) and the fact that A∞ A = A∞ , one acquires

||yt +1 − A∞ yt +1 ||2
 y 2
= Ayt − A∞ yt + (Imp − A∞ )(gt +1 − gt ) − ε̂t 
≤ (1 + c)||Ayt − A∞ yt ||2 + 2(1 + c−1 )||gt +1 − gt ||2 + 2(1 + c−1 )||ε̂t ||2 ,
y

(5.28)

where A∞ ε̂t = 0, ∀t ≥ 0, and the inequality ||a + b||2 ≤ (1 + c)||a||2 + (1 +


y

1/c)||b||2, ∀a, b ∈ Rmp , for any c > 0, have been employed to derive (5.28).
Selecting c = (1 − κ32 )/(2κ32 ) in (5.28) and then taking the total expectation, we
have

E[||yt +1 − A∞ yt +1 ||2 ]
1 + κ32 4 4
E[||yt − A∞ yt ||2 ] +
y
≤ E[||gt +1 − gt ||2 ] + E[||ε̂t ||2 ],
2 1 − κ3
2 1 − κ32
(5.29)

where ||Imp − A∞ || = 1 and Lemma 5.5(i) have been applied to acquire (5.29).
Next, we proceed to analyze E[||gt +1 − gt ||2 ]. First, we have

E[||gt +1 − gt ||2 ]
≤ 2E[||gt +1 − gt − (∇F (st +1 ) − ∇F (st ))||2 ]
+ 2E[||∇F (st +1) − ∇F (st )||2 ]
≤ 2E[||gt +1 − gt ||2 ] + 2E[||gt − ∇F (st )||2 ]
+ 2κ22 E[||xt +1 − xt + α(xt +1 − xt ) − α(xt − xt −1)||2 ]
≤ 4κ22 α 2 E[||xt − xt −1 ||2 ] + 2E[||gt − ∇F (st )||2 ]
+ 16κ22E[||xt +1 − xt ||2 ] + 2E[||gt +1 − gt ||2 ], (5.30)

where the third and second inequalities have exploited 0 < α < 1 and E[gt +1 −
∇F (st +1 ), gt − ∇F (st )] = E[E[gt +1 − ∇F (st +1 ), gt − ∇F (st )|Ft +1 ]] = 0,
respectively. Using (5.15) with the requirement that 0 < η ≤ 1/(32κ2), it can be
deduced that

E[||xt +1 − xt ||2 ] ≤ 16η2E[||yt − A∞ yt ||2 ] + 0.03125α 2E[||xt − xt −1 ||2 ]


+ 8.03125E[||xt − A∞ xt ||2 ] + 16η2 E[||gt − ∇F (st )||2 ]
+ 0.015625E[m||x̄t − x ∗ ||2 ] + 4E[||ε̂tx ||2 ], (5.31)
5.4 Convergence Analysis 131

which together with (5.30) yields that

E[||gt +1 − gt ||2 ] ≤ 256κ22η2 E[||yt − A∞ yt ||2 ] + 2E[||gt +1 − ∇F (st +1 )||2 ]


+ 4.5κ22α 2 E[||xt − xt −1 ||2 ] + 0.25κ22E[m||x̄t − x ∗ ||2 ]
+ 128.5κ22E[||xt − A∞ xt ||2 ] + 64κ22E[||ε̂tx ||2 ]
+ 2.25E[||gt − ∇F (st )||2 ]. (5.32)

We next bound E[||gt +1 − ∇F (st +1 )||2 ]. Following directly from Lemma 5.6, one
has

E[gt +1 − ∇F (st +1 )2 ]

≤ 4κ22 E[xt +1 − A∞ xt +12 ] + 8κ22E[mx̄t +1 − x ∗ 2 ]


+ 8α 2 κ22 E[xt +1 − xt 2 ] + 2κ22 E[vt +1 ]. (5.33)

Note that E[||xt +1 −A∞ xt +1 ||2 ] ≤ 4η2 E[||yt −A∞ yt ||2 ] +2κ32 E[||xt −A∞ xt ||2 ] +
4E[||ε̂tx ||2 ] if we select a = 1 in (5.18), and E[m||x̄t +1 − x ∗ ||2 ] ≤ α 2 E[||xt −
xt −1||2 ] + (η2 /m)E[||gt − ∇F (st )||2 ] + E[||xt − A∞ xt ||2 ] + 2E[m||x̄t − x ∗ ||2 ] if
we select b = 1/η in (5.24) for 0 < η < 2/κ2 . Substituting the above results and
(5.10) into (5.33) leads to the following: if 0 < η ≤ 1/(32κ2),

E[||gt +1 − ∇F (st +1 )||2 ]


≤ 84.25κ22E[||xt − A∞ xt ||2 ] + 144η2κ22 E[||yt − A∞ yt ||2 ]
 
8η2 κ22
+ + 128η2κ22 E[||gt − ∇F (st )||2 ] + 2κ22 E[vt ]
m

+ 20.125κ22E[m||x̄t − x ∗ ||2 ] + 48κ22E[||ε̂tx ||2 ]


+ 8.25α 2κ22 E[||xt − xt −1||2 ]. (5.34)

If 0 < η ≤ 1/(16κ2), we further get from (5.9) that

E[||gt +1 − ∇F (st +1 )||2 ]


≤ 86.25κ22E[||xt − A∞ xt ||2 ] + 144η2κ22 E[||yt − A∞ yt ||2 ]
+ 24.125κ22E[m||x̄t − x ∗ ||2 ] + 3κ22 E[vt ]
+ 12.25α 2κ22 E[||xt − xt −1 ||2 ] + 48κ22E[||ε̂tx ||2 ]. (5.35)
132 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

As for (5.35), keeping (5.32) in mind, it deduces from (5.29) that

E[||yt +1 − A∞ yt +1||2 ]
266κ22 188κ22
≤ E[m||x̄t − x ∗ ||2 ] + E[||xt − xt −1 ||2 ]
1 − κ3 2 1 − κ32
640κ22 4 y 2 42κ22
+ E[||ε̂tx ||2 ] + E[||ε̂ t || ] + E[vt ]
1 − κ3 2 1 − κ32 1 − κ32
 
1 + κ32 2176κ22η2
+ + E[||yt − A∞ yt ||2 ]
2 1 − κ32

1240κ22
+ E[||xt − A∞ xt ||2 ]. (5.36)
1 − κ32

Hence, choosing 0 < η ≤ (1 − κ32 )/(99κ2 ) can achieve the results of (5.27) in
Lemma 5.11. 

5.4.3 Main Results

Next, we will show that ET-DASG linearly converges in the mean. In the theoretical
procedure, we first construct a linear matrix inequality based on the results in
Lemmas 5.7–5.11. Then, similar to other works [20, 22, 24, 25], we further prove
that the spectral radius of the coefficient matrix is less than 1. After iterating the
linear matrix inequality, we can finally find that ET-DASG linearly converges in the
mean without other complex operations.
Theorem 5.12 Consider ET-DASG in Algorithm 3 under Assumptions 5.1–5.2. If
the step-size η is chosen from the interval
  
1 − κ32 ň mň(1 − 9α 2 ) 1 − κ32 1
0 < η ≤ min , , ,  ,
12γ κ2 n̂ 34n̂κ1 γ 2 99κ2 κ2 Λ(γ , κ3 , n̂, ň)

the estimator xti , i ∈ V, converges in the mean to the exact optimal solution to
problem (5.1) with a linear convergence rate O(λt ), where 0 < λ < 1, 0 < α <
(1/3), and Λ(γ , κ3 , n̂, n) is a constant that are specified in the proof.
Proof To begin with, we jointly write (5.10), (5.11), (5.16), (5.19), and (5.27) as a
linear matrix inequality, i.e.,

θt +1 ≤ Γ θt + νt , (5.37)
5.4 Convergence Analysis 133

where θt ∈ R5 , νt ∈ R5 , and Γ ∈ R5×5 are represented by


⎡   ⎤ ⎡ ⎤
E ||xt − A∞ xt ||2 Δ1
⎢ E m||x̄ − x ∗ ||2  ⎥ ⎢0 ⎥
⎢  t  ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
θt = ⎢ E ||xt − xt −1 || 2 ⎥ , νt = ⎢ Δ2 ⎥ ,
⎢ ⎥ ⎢ ⎥
⎣ E [vt ] ⎦ ⎣0 ⎦
 
E ||yt − A∞ yt ||2 /κ22 Δ3

640κ22
with Δ1 = 4
E[||ε̂tx ||2 ], Δ2 = 4E[||ε̂tx ||2 ] as well as Δ3 = E[||ε̂tx ||2 ] +
1−κ32 1−κ32
y
4
E[||ε̂t ||2 ], and
1−κ 2
3

⎡ ⎤
1+κ32 4κ22 η2
0 0 0
⎢ 2 1−κ32 ⎥
⎢ 4κ22 η 4κ22 α 2 η 2η2 κ22 ⎥
⎢ 1 − ηκ21 0 ⎥
⎢ κ κ1 m ⎥
Γ = ⎢ 96η2 κ12 + 16κ22η2 ⎥
⎢ 8 144η2κ22 160η2κ22 32η2 κ22 ⎥.
⎢ 2 ⎥
⎢ 2 2
0 1 − n̂1 0 ⎥
⎣ ň ň
3+κ 2

1240 266 188 42 3
1−κ32 1−κ32 1−κ32 1−κ32 4

Then, we first infer the range of η like the existing works [20, 22, 25, 40] satisfying
ρ(Γ ) < 1 to establish the linear convergence of ET-DASG. Recalling from
Lemma 5.5(iii), to ensure ρ(Γ ) < 1, we can derive the range of η and a positive
vector w = [w1 , w2 , w3 , w4 , w5 ]T such that Γ w < w holds, which equivalently
indicates that


2
⎪ (1−κ32 ) w1

⎪ η2 κ22 <


8 w4
⎪ ηκ 2 < mκ1 w2 − mκ22 w1 − mκ22 α 2 w3

⎨ 2 8 w4 κ1 w4 κ1 w4
2κ 2 < w3 −8w1 . (5.38)
⎪ η 2 160w +96w +32w +16w


3 1 4 5


2 n̂w 1 2 n̂w
+ ň < w4
2

⎪ ň

⎩ 4960w1
+ 1064w 2
+ 752w232 + 168w242 < w5
2 2
(1−κ3 ) 2 2
(1−κ3 ) (1−κ3 ) (1−κ3 )

Obviously, to achieve a certain feasible range of η, we must ensure that the right
hands of the first three conditions related to the step-size η in (5.38) are positive.
According to this observation, it suffices to derive another two relations among the
elements in w, i.e.,

⎨ w3 > 8w1
8κ 2 w 8κ 2 α 2 w . (5.39)
⎩ w2 > 22 1 + 2 2 3
κ1 κ1
134 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

Based on this, it is now ready to choose the values of w1 , w2 , w3 , w4 , w5 that


are independent of η. First, according to the first condition in (5.39), we can set
w1 = 1 and w3 = 9. Second, by using w1 and w3 , the second condition in (5.39) is
equivalent to the following:

8κ22
w2 > (1 + 9α 2 ).
κ12

Here, we assume that 0 < α < (1/3), and we thus set w2 = 16γ 2 , where γ =
κ2 /κ1 . Third, we note that the fourth condition in (5.38) can be written as

2n̂ 32n̂γ 2 2n̂


w4 > + = (1 + 16γ 2 ).
ň ň ň

Since γ > 1, we therefore set w4 = (34n̂γ 2 )/ň. Finally, in order to ensure that the
fifth condition in (5.38) is true, w5 should satisfy

8
w5 > 2
(620w1 + 133w2 + 94w3 + 21w4 )
(1 − κ32 )
16 357n̂γ 2
= 2
733 + 1064γ 2 + .
(1 − κ32 ) ň

357n̂γ 2 2154n̂γ 2 34464 n̂γ 2


Noticing that 733 + 1064γ 2 + ň
≤ ň
, we thus set w5 = 2 . We
(1−κ32 ) ň
now solve for the range of η from the first three conditions in (5.38) given in the
previously fixed w1 , w2 , w3 , w4 , w5 . According to the first condition in (5.38), we
obtain
 
1 − κ32 w1 1 − κ32 ň
η< = .
2κ2 2w4 12γ κ2 n̂

Moreover, we can conclude that if η meets

mκ1 w2 m w1 mα 2 w3 mň
η< 2
− − = (1 − 9α 2 ),
8κ2 w4 κ1 w4 κ1 w4 34n̂κ1 γ 2

then the second condition in (5.38) holds. Next, to make the third condition in (5.38)
hold, the step-size η must satisfy the following:

1
η< √
κ2 160w3 + 96w1 + 32w4 + 16w5
1
=  ,
κ2 Λ(γ , κ3 , n̂, ň)
5.4 Convergence Analysis 135

2
551424n(1−κ 2)
where Λ(γ , κ3 , n̂, n) = 1536 + 1088n n̂γ 2
+ n̂γ 2
3
. Therefore, it is clear to
verify that if η satisfies the conditions in Theorem 5.12, then, Γ w < w holds. Thus,
ρ(Γ ) < 1 is satisfied.
According to the above results, i.e., ρ(Γ ) < 1, we continue to establish the
remaining proof of the linear convergence rate of ET-DASG. To this aim, we
recursively iterate (5.37) as follows: ∀t ≥ 1,

t −1

θt ≤ Γ θ0 +
t
Γ t −l−1 νl . (5.40)
l=0

From the triggering condition (5.2) and the measurement errors (5.3), for all i ∈ V
i,y
and t ≥ 0, we can infer that ||εti,x ||2 = 0, ||εt ||2 = 0 if t = tk(i,t
i i,x 2
) and ||εt || ≤
i,y
Cκ4t , ||εt ||2 ≤ Cκ4t otherwise. Moreover, if the momentum coefficient α and the
constant step-size η satisfy the above conditions (ρ(Γ ) < 1), it suffices to verify
that ||Γ t || ≤ Qκ5t and ||Γ t −l−1 νl || ≤ Qκ5t for some constant Q > 0 and 0 <
max{κ4 , ρ(Γ )} ≤ κ5 < 1. Therefore, it can be deduced from (5.40) that

t −1

||θt ||≤ ||Γ t ||||θ0|| + ||Γ t −l−1 νl ||
l=0
t −1

≤ Q||θ0 ||κ5t + Qκ5t = ωt κ5t , (5.41)
l=0

where ωt = Q||θ0 || + Qt. Following from (5.41), it holds that limt →∞ ||θt ||/κ6t =
ωt (κ5 /κ6 )t = 0 for all κ5 < κ6 < 1. Hence, there exists a positive constant Q1 and
an arbitrary small constant ν such that κ6 = κ5 + ν and ||θt || ≤ Q1 κ6t for all t ≥ 0,
then,

E[||xt − (1m ⊗ Ip )x ∗ ||2 ] ≤ 2E[||xt − A∞ xt ||2 ] + 2E[||A∞ xt − (1m ⊗ Ip )x ∗ ||2 ]


≤ 2Q1 (κ5 + ν)t , (5.42)
√ √
where we define λ= κ5 + ν= κ6 . This completes the proof of Theorem 5.12. 
Remark 5.13 Theorem 5.12 means that even adopting the event-triggered com-
munication strategy and the stochastic gradient gt , ET-DASG is ensured to figure
out the machine learning task (5.1) and achieves linear convergence rate if some
conditions (such as η and α) and the assumptions on the communication network
as well as the cost functions hold. In addition, it is worth emphasizing that due
to the influence of related factors (such as the convergence analysis method, the
event-triggered strategy, the variance-reduction technique, etc.) in this chapter, it
is difficult for us to intuitively obtain the optimal momentum coefficient like the
136 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

existing work [16]. In terms of this issue, we will conduct a detailed study in future
work.
To verify the effectiveness of the designed event-triggered communication
strategy, in the following, we will prove that for each node the time interval
between two successive triggering instants is larger than the iteration interval, i.e.,
)+1 − tk(i,t ) ≥ 2, ∀i ∈ V, t ≥ 1.
i i
tk(i,t
Theorem 5.14 Suppose that Assumptions 5.1 and 5.2 hold. Considering Algo-
rithm 3, if the momentum coefficient α and the constant step-size η are selected
)+1 − tk(i,t ) ≥ 2, ∀i ∈ V, t ≥ 1, if the event-
i i
to satisfy Theorem 5.12, then, tk(i,t
triggered parameters C and κ4 satisfy that C > ((12Q1 + 128Q1 κ22 )(1 + κ42 ))/κ42 .
Proof It is deduced from (5.3) that

i,y
||εti,x ||2 + ||εt ||2
= ||xtii − xti ||2 + ||ytii − yti ||2
k(i,t) k(i,t)

≤ 4||xtii − x̄t i ||2 + 4||x̄t i − x ∗ ||2 + 4||x ∗ − x̄t ||2


k(i,t) k(i,t) k(i,t)

+ 4||x̄t − xti ||2 + 4||ytii − ȳt i ||2 + 4||ȳt − yti ||2


k(i,t) k(i,t)

+ 4||ḡt i − ∇f (x ∗ )||2 + 4||∇f (x ∗ ) − ḡt ||2 , (5.43)


k(i,t)

where we have employed ȳt = ḡt . Then, we further have that

E[||∇f (x ∗ ) − ḡt ||2 |Ft ]


=E[||∇f (x ∗ ) − ∇ F̄ (st )||2 |Ft ] + E[||∇ F̄ (st ) − ḡt ||2 |Ft ]
1 2κ22
≤ E[||∇F (st ) − gt || 2
|F t ] + E[||xt − xt −1 ||2 |Ft ]
m2 m
4κ22
+ 4κ22 E[||x̄t − x ∗ ||2 |Ft ] + E[||xt − A∞ xt ||2 |Ft ], (5.44)
m

where in the first inequality we have used that E[ḡt |Ft ] = ∇ F̄ (st ), and in the second
inequality we have applied E[||∇ F̄ (st ) − ḡt ||2 |Ft ] = (1/m2 )E[||∇F (st ) − gt ||2 |Ft ]
and Assumption 5.1(ii). Similarly, one can get

E[||ḡt i − ∇f (x ∗ )||2 |Ft ]


k(i,t)

4κ22 2κ 2
≤ E[||xt i − A∞ xt i ||2 |Ft ] + 2 E[||xt i − xt i −1 ||2 |Ft ]
m k(i,t) k(i,t) m k(i,t) k(i,t)

1
+ 4κ22 E[||x̄t i − x ∗ ||2 |Ft ] + 2 E[||∇F (st i ) − gt i ||2 |Ft ]. (5.45)
k(i,t) m k(i,t) k(i,t)
5.4 Convergence Analysis 137

Recalling ||θt || ≤ Q1 κ6t in Theorem 5.12, we infer from the definition of ||θt || that
E[||xt − A∞ xt ||2 ] ≤ Q1 κ6t , E[m||x̄t − x ∗ ||2 ] ≤ Q1 κ6t , E[||xt − xt −1||2 ] ≤ Q1 κ6t ,
E[vt ] ≤ Q1 κ6t , and E[||yt − A∞ yt ||2 ] ≤ Q1 κ6t . Next, we apply (5.9), (5.43), (5.44),
and (5.45) to proceed. If 0 < α < 1, we possess that

i,y ti
||εti,x ||2 + ||εt ||2 ≤(12Q1 + 128Q1 κ22 )κ6t + (12Q1 + 128Q1κ22 )κ6k(i,t) .
(5.46)

It is also worth noting that when the triggering condition is not satisfied, the next
i,y
event will not happen, that is, ||εti,x ||2 + ||εt ||2 ≤ Cκ4t , ∀t ≥ 0, i ∈ V. Thus, when
t = tk(i,t )+1 the following inequality must be satisfied, i.e.,
i

ti i,y
Cκ4k(i,t)+1 ≤ ||εi,x
i ||2 + ||ε i ||2 , (5.47)
tk(i,t)+1 tk(i,t)+1

which indicates that


ti ti ti
Cκ4k(i,t)+1 ≤(12Q1 + 128Q1κ22 )κ6k(i,t)+1 + (12Q1 + 128Q1κ22 )κ6k(i,t) . (5.48)

Since κ4 < κ6 , it is clear to deduce that (5.48) must hold if there is a constant C
satisfying

12Q1 + 128Q1κ22
)+1 − tk(i,t ) ≥ ln
i i
tk(i,t / ln κ4 . (5.49)
C − (12Q1 + 128Q1κ22 )

)+1 − tk(i,t ) ≥ 2, ∀i ∈ V, t ≥ 1, when C and κ4


i i
Here, it suffices to verify that tk(i,t
are chosen from Theorem 5.14. 
Remark 5.15 The communication cost of ET-DASG has a great relationship with
the event-triggered parameters C and κ4 . When the event-triggered parameters
satisfy the condition C > ((12Q1 + 128Q1 κ22 )(1 + κ42))/κ42 in Theorem 5.14 (other
parameters, such as α and η, satisfy Theorem 5.12 to achieve the convergence), the
time interval between two successive triggering instants for each node is greater
)+1 − tk(i,t ) ≥ 2, ∀i ∈ V, t ≥ 1, which means that by using
i i
than 2, i.e., tk(i,t
the event-triggered communication strategy, the communication cost of ET-DASG
at each iteration is at least half of the related time-triggered distributed methods
[18–22, 40, 41]. From this point of view, the communication cost of ET-DASG
is related to the event-triggered parameters C and κ4 , and ET-DASG can promote
communication efficiency.
Remark 5.16 Although the existing gradient tracking methods [19–29] including
our previous methods [21, 29] just linearly converge to the optimal solution, it is
worth noting that ET-DASG enjoys three appealing features: (i) Communication
efficiency with event-triggered communication strategy; (ii) Computation efficiency
138 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

with variance-reduction technique; (iii) Accelerated convergence with Nesterov’s


acceleration mechanism.

5.5 Numerical Examples

In this section, two numerical examples about logistic regression in machine


learning [2, 59] and energy-based source localization in sensor network [62] are
provided to examine the effectiveness of ET-DASG. Notice that all the examples are
implemented using MATLAB R2014a on a computer running MacBook(Pro) 2017
with Intel(R) Core(TM) i5 CPU (2.3 GHz) and 8GB RAM. To facilitate verification
and comparison, we adopt the yalmip toolbox in Matlab to get the optimal solutions
of the following examples.

5.5.1 Example 1: Logistic Regression

In this subsection, ET-DASG is leveraged to deal with a binary classification


problem by logistic regression using two real datasets, i.e., breast cancer Wis-
consin (diagnostic) dataset (dataset 1) and mushroom dataset (dataset 2), from
UCI Machine Learning Repository [63]. For the networks, a randomly connected
network with m = 10 nodes employing the Erdos–Renyi model [10, 60] with a
connection probability pc = 0.4 is plotted in Fig. 5.1a, and three categories of
network (i.e., complete network, cycle network, and star network) are plotted in
Fig. 5.1b–d.
The motivation or discussion to adopt the two real datasets and the Erdos–Renyi
network model is given as follows: (a) This subsection mainly applies ET-DASG to
solve the logical regression problem in machine learning. This problem is actually
a binary classification problem. On the one hand, both the datasets 1 and 2 can be
used for training of binary classification problems (i.e., diseased vs. non-disease and
toxic vs. non-toxic). On the other hand, the scales of the two datasets are different.
Dataset 1 can be used to examine the performance of ET-DASG, while dataset 2
can be employed for comparison with other related distributed methods to verify
the advantages of ET-DASG in processing large-scale problems. In addition, many
recent works [64, 65] have conducted algorithms on these two datasets for machine
learning. (b) The Erdos–Renyi network model employed in this subsection is an
undirected and connected graph which can satisfy the conditions in Assumption 5.2
well. In addition, there are also a lot of works [10, 60] to simulate the algorithm
based on the Erdos–Renyi network model. Because of the randomness of the
network, the simulation results are more reliable and can verify the performance
of ET-DASG well.
In dataset 1, we use n = 200 samples as training data and each data has
dimension p = 9. In dataset 2, we use n = 6000 samples as training data and each
5.5 Numerical Examples 139

Fig. 5.1 Four undirected and connected network topologies composed of 10 nodes. (a) Random
network with a connection probability pc = 0.4. (b) Complete network. (c) Cycle network. (d)
Star network

data has dimension p = 112. For each dataset, all features have been preprocessed
and normalized to the unit vector. The distributed logistic regression problem takes
the form

1  1   
m n i
π
minp ln 1 + exp −b i,h (ci,h )T x + ||x||22,
x∈R m ni 2
i=1 h=1

where the local objective function f i (x) is

1   
n i
π
f (x) = i
i
ln 1 + exp −b i,h (ci,h )T x + ||x||22,
n 2
h=1
140 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

with bi,h ∈ {−1, 1} and ci,h ∈ Rp being local data kept by node i for h ∈
{1, . . . , ni }; (π/2)||x||22 is the regularization term for avoiding overfitting. In the
simulation, we randomly divide data to each local node, i.e., 10 i=1 n = n. The
i

simulation results are described in the following five parts, i.e., (i)–(v). Here, we
also point out that the dataset 1 is adopted in (i)–(iv), and the datasets 1 and 2
are adopted in (v). Moreover, Fig. 5.1a–d are applied to part (iv), while Fig. 5.1a
is applied to all other parts. For the comparison in parts (iii)-(v), the residual
(1/m)log10 ( m ∗
i=1 ||xt − x ||) is treated as the comparison metric.
i

(i) Convergence: In this simulation, we set η = 0.0035, α = 0.2, C = 5, κ4 =


0.985, π = 4, and 10 i=1 n = 200. Then, the transient behaviors of three
i

dimensions (randomly selected) of state estimator x and the testing accuracy


are shown in Fig. 5.2, which illustrate that the testing accuracy is 97.1% and
the state estimator x in ET-DASG can achieve the consensus in the mean at the
global optimal solution.
(ii) Triggering times: In this simulation (other parameters are the same as part
(i)), we discuss the triggering times for the neighbors when 5 nodes run ET-
DASG under different event-triggered parameters. The simulation results are
shown in Fig. 5.3, which imply by combining with Fig. 5.2 that ET-DASG with
event-triggered communication strategy can achieve the expected results with
fewer communications. In addition, it can be verified from Fig. 5.3 that the
triggering times will be decreased (that is, the communication cost is small)
if the parameter κ4 increases. Here, since C is related to κ4 (Theorem 5.14)
and the range of C will be wider when κ4 becomes larger, it is reasonable to
discuss the influence of different κ4 on the triggering times of the nodes, that
is, C remains unchanged.
(iii) Impacts of the constant step-size and the momentum coefficient: In this
simulation (other parameters are the same as part (i)), we discuss the impacts
of the constant step-size η and the momentum coefficient α in ET-DASG on the
convergence results. The simulation results are shown in Fig. 5.4. Figure 5.4a
indicates that the increase of the step-size η can promote the convergence of
ET-DASG. However, when the step-size η exceeds an upper bound (around
0.012), the performance of ET-DASG will be reduced. Figure 5.4b shows
that the convergence rate of ET-DASG improves with the increase of the
momentum coefficient α (η = 0.012). However, this situation is subject to
the upper bound of α (Theorem 5.12).
(iv) Impacts of network sparsity: In this simulation (η = 0.012 and other
parameters are the same as part (i)), we discuss the impacts of the networks
on the convergence results of ET-DASG. The simulation results are shown in
Fig. 5.5. It can be verified from Fig. 5.5 that ET-DASG converges faster as the
network becomes dense.
(v) Comparison: In this simulation (the related parameters are the same as part
(i)), we compare ET-DASG with other existing methods [20, 38, 59] to show
the appealing features of ET-DASG. The simulation results are shown in
Fig. 5.6 and Table 5.1. Figure 5.6 means that ET-DASG can achieve the same
5.5 Numerical Examples 141

0.08

Convergence of three dimensions of x 0.06

0.04

0.02
xi,1
t
xi,2
t
xi,3
t

0
0 200 400 600 800 1000
Iteration

1
X: 19
Y: 0.971

0.8
Accuracy

0.6

0.4 ET−DASG

0 5 10 15 20
Iteration

Fig. 5.2 Convergence: (a) The transient behaviors of three dimensions (randomly selected) of
state estimator x. (b) The testing accuracy of ET-DASG

convergence as other existing methods [20, 38, 59] even if it uses both the
event-triggered strategy and the variance-reduction technique at the same time.
Then, Table 5.1 summarizes the convergence time in seconds and the number
of local gradient evaluations of ET-DASG and the method in [38] for a specific
residual 10−7 under two real training datasets. Table 5.1 tells us that ET-DASG
demands less calculation time and less number of local gradient evaluations,
which can quickly achieve the target and greatly reduce the computation cost in
machine learning tasks. Moreover, Table 5.1 also shows that when the number
n and the dimension p of datasets are large, the calculation time and the
number of local gradient evaluations of ET-DASG are far less than that of the
142 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

Triggering times of 5 nodes


4

0 50 100 150 200


Iteration

5
Triggering times of 5 nodes

0 50 100 150 200


Iteration

Fig. 5.3 The triggering times for the neighbors when 5 nodes run ET-DASG under different event-
triggered parameters

method in [38]. Hence, ET-DASG can be well adapted to large-scale machine


learning tasks by adopting the stochastic gradient.
Remark 5.17 Due to the simultaneous use of the event-triggered communication
and stochastic gradient, ET-DASG requires more iterations to achieve a specific
residual, i.e., the convergence rate of ET-DASG is slightly slower than other
related methods [20, 38, 58]. It is worth highlighting that ET-DASG, for a specific
termination condition, has the advantages in communication times, the number of
local gradient evaluations and the running time, which exhibits well efficiency in
communication and computation.
5.5 Numerical Examples 143

0
η=0.002
η=0.004
η=0.006
−5 η=0.010
Residual

η=0.012
η=0.016

−10

−15
0 500 1000 1500 2000
Iteration

0
α=0.1
α=0.2
α=0.3
−5
Residual

−10

−15
0 500 1000 1500 2000
Iteration

Fig. 5.4 Evolution of residuals under different constant step-sizes or momentum coefficients

5.5.2 Example 2: Energy-based Source Localization

In this example, ET-DASG is applied to handle the energy-based source localization


problem [62] over a network of m sensors (a randomly connected network with
pc = 0.8). Each sensor is randomly distributed in spatial locations denoted as
a i ∈ R2 , i = 1, . . . , m, which is known privately by itself, and each sensor
collects ni measurements. Then, an isotropic energy propagation model is applied
to measure the h-th received signal strength at sensor i, which is represented by
s i,h = c/(||x̂ − a i ||d ) + bi,h , where c > 0 is a constant and d ≥ 1 is an attenuation
characteristic; ||x̂ − a i || > 1, and bi,h is an independent and identically distributed
sample noise following from the zero-mean Gaussian distribution with variance σ 2 .
144 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

0
Random network: pc=0.2
Random network: pc=0.4
Random network: pc=0.6

−5 Complete network
Residual

Cycle network
Star network

−10

−15
0 500 1000 1500 2000
Iteration

Fig. 5.5 Evolution of residuals under different networks

0
ET −DASG
The method in [59]
The method in [38]
−5 The method in [20]
Residual

−10

−15
0 200 400 600 800 1000
Iteration

Fig. 5.6 Comparisons between ET-DASG and other methods

The maximum-likelihood estimator for the source’s location is found by solving the
following problem:
⎛ ⎞
ni
1  ⎝ 1  i,h
m 2
c ⎠.
min s −
x∈R 2 m ni ||x − a i ||d
i=1 h=1

Here, we consider that m = 50 sensors are uniformly distributed in a 100 × 100


square and the source location is randomly chosen from the square. The source
emits a signal with strength c = 100 and each sensor has ni = 100 measurements.
In this simulation, we set η = 0.008, α = 0.1, C = 5, and κ4 = 0.92. Assume
5.6 Conclusion 145

Table 5.1 The convergence time in seconds and the number of local gradient evaluations of ET-
DASG and the method in [38] for a specific residual 10−7 under two real datasets
ET-DASG The method in [38]
Datasets Time(s) Number Time(s) Number
Dataset 1 4.9287 1665 5.8607 29360
Dataset 2 22.8123 3552 132.9638 7.956 ×105

Decentralized energy-based source localization


100
90 Sensor location
Source location
80
70
60
50
40
30
20
10

10 20 30 40 50 60 70 80 90 100

Fig. 5.7 The randomly selected 7 paths displayed on top of contours of log-likelihood function

that there is a stationary acoustic source x ∗ ∈ R2 locating in (55, 55) (an location
unknown to any sensors) that we aim at locating in the sensor networks.
Based on the above, the randomly selected 7 paths taken by ET-DASG are shown
in Fig. 5.7 which is plotted on top of contours of the log-likelihood. Figure 5.7
illustrates that ET-DASG can successfully find the exact source location like other
verified effective algorithm [62], which is suitable for the practical energy-based
source localization problem.

5.6 Conclusion

In this chapter, we have proposed a novel event-triggered distributed accelerated


stochastic gradient algorithm, namely ET-DASG, for resolving the machine learning
tasks over networks. ET-DASG could realize better communication efficiency,
achieve higher computation efficiency, and implement accelerated convergence.
With the help of the linear matrix inequality theory, we proved that ET-DASG
with suitably selected constant step-size converged in the mean linearly to the
optimal solution if each constituent function is strongly convex and smooth.
Furthermore, the time interval for each node between two successive triggering
instants has been proven to be larger than the iteration interval. Simulation results
have confirmed the appealing performance of ET-DASG. However, ET-DASG
146 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

is not suitable for the network with random link failures or the problems with
constraints. In addition, the privacy issues are also not considered in ET-DASG.
Future work will further focus on investigating the privacy protection of ET-DASG
and extending the algorithm to be appropriate for the directed networks and the
distributed constrained optimization problems. The asynchronous implementation
of ET-DASG over broadcast-based mechanism or gossip-based mechanism with
random link failures is also a promising research direction.

References

1. S. Pu, A. Olshevsky, I.C. Paschalidis, Asymptotic network independence in distributed


stochastic optimization for machine learning: examining distributed and centralized stochastic
gradient descent. IEEE Signal Process. Mag. 37(3), 114–122 (2020)
2. Z. Wang, H. Li, Edge-based stochastic gradient algorithm for distributed optimization. IEEE
Trans. Netw. Sci. Eng. 7(3), 1421–1430 (2020)
3. C. Li, X. Yu, T. Huang, X. He, Distributed optimal consensus over resource allocation network
and its application to dynamical economic dispatch. IEEE Trans. Neural Netw. Learn. Syst.
29(6), 2407–2418 (2018)
4. N. Heydaribeni, A. Anastasopoulos, Distributed mechanism design for network resource
allocation problems. IEEE Trans. Netw. Sci. Eng. 7(2), 621–636 (2020)
5. M. Rossi, M. Centenaro, A. Ba, S. Eleuch, T. Erseghe, M. Zorzi, Distributed learning
algorithms for optimal data routing in IoT networks. IEEE Trans. Signal Inf. Proc. Netw. 6,
175–195 (2020)
6. D. Nunez, J. Cortes, Efficient privacy-preserving machine learning in hierarchical distributed
system. IEEE Trans. Netw. Sci. Eng. 6(4), 599–612 (2019)
7. A. Nedic, Distributed gradient methods for convex machine learning problems in networks:
distributed optimization. IEEE Signal Process. Mag. 37(3), 92–101 (2020)
8. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE
Trans. Autom. Control 54(1), 48–61 (2009)
9. D. Nunez, J. Cortes, Distributed online convex optimization over jointly connected digraphs.
IEEE Trans. Netw. Sci. Eng. 1(1), 23–37 (2014)
10. C. Li, X. Yu, W. Yu, G. Chen, J. Wang, Efficient computation for sparse load shifting in demand
side management. IEEE Trans. Smart Grid 8(1), 250–261 (2017)
11. M.O. Sayin, N.D. Vanli, S.S. Kozat, T. Basar, Stochastic subgradient algorithms for strongly
convex optimization over distributed networks. IEEE Trans. Netw. Sci. Eng. 4(4), 248–260
(2017)
12. H. Li, Q. Lü, G. Chen, T. Huang, Z. Dong, Distributed constrained optimization over
unbalanced directed networks using asynchronous broadcast-based algorithm. IEEE Trans.
Autom. Control 66(3), 1102–1115 (2021)
13. Z. Wang, D. Wang, D. Gu, Distributed optimal state consensus for multiple circuit systems
with disturbance rejection. IEEE Trans. Netw. Sci. Eng. 7(4), 2926–2939 (2020)
14. X. He, J. Yu, T. Huang, C. Li, Distributed power management for dynamic economic dispatch
in the multimicrogrids environment. IEEE Trans. Control Syst. Technol. 27(4), 1651–1658
(2019)
15. N. Loizou, M. Rabbat, P. Richtárik, Provably accelerated randomized gossip algorithms, in
Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP) (2019). https://doi.org/10.1109/ICASSP.2019.8683847
16. D. Jakovetic, J. Xavier, J. Moura, Fast distributed gradient methods. IEEE Trans. Autom.
Control 59(5), 1131–1146 (2014)
References 147

17. A. Nedic, J. Liu, Distributed optimization for control. Ann. Rev. Control Robot. Auton. Syst.
1, 77–103 (2018)
18. H. Li, Q. Lü, T. Huang, Convergence analysis of a distributed optimization algorithm with a
general unbalanced directed communication network. IEEE Trans. Netw. Sci. Eng. 6(3), 237–
248 (2019)
19. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization
over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
20. G. Qu, N. Li, Harnessing smoothness to accelerate distributed optimization. IEEE Trans.
Control Netw. Syst. 5(3), 1245–1260 (2018)
21. Q. Lü, H. Li, D. Xia, Geometrical convergence rate for distributed optimization with time-
varying directed graphs and uncoordinated step-sizes. Inf. Sci. 422, 516–530 (2018)
22. R. Xin, U.A. Khan, A linear algorithm for optimization over directed graphs with geometric
convergence. IEEE Control Syst. Lett. 2(3), 315–320 (2018)
23. W. Shi, Q. Ling, G. Wu, W. Yin, EXTRA: an exact first-order algorithm for decentralized
consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015)
24. Y. Sun, A. Daneshmand, G. Scutari, Convergence rate of distributed optimization algorithms
based on gradient tracking (2019). Preprint. arXiv:1905.02637
25. S. Pu, W. Shi, J. Xu, A. Nedic, Push-pull gradient methods for distributed optimization in
networks. IEEE Trans. Autom. Control 66(1), 1–16 (2021)
26. M. Bin, I. Notarnicola, L. Marconi, G. Notarstefano, A system theoretical perspective to
gradient-tracking algorithms for distributed quadratic optimization, in Proceedings of the
2019 IEEE 58th Conference on Decision and Control (CDC) (2019). https://doi.org/10.1109/
CDC40024.2019.9029824
27. G. Scutari, Y. Sun, Distributed nonconvex constrained optimization over time-varying
digraphs. Math. Program. 176(1), 497–544 (2019)
28. M.I. Qureshi, R. Xin, S. Kar, U.A. Khan, S-ADDOPT: decentralized stochastic first-order
optimization over directed graphs. IEEE Control Syst. Lett. 5(3), 953–958 (2021)
29. Q. Lü, H. Li, Z. Wang, Q. Han, W. Ge, Performing linear convergence for distributed
constrained optimisation over time-varying directed unbalanced networks. IET Control Theory
Appl. 13(17), 2800–2810 (2019)
30. X. He, X. Fang, J. Yu, Distributed energy management strategy for reaching cost-driven
optimal operation integrated with wind forecasting in multimicrogrids system. IEEE Trans.
Syst. Man Cybern. Syst. 49(8), 1643–1651 (2019)
31. J. Zhang, K. You, K. Cai, Distributed dual gradient tracking for resource allocation in
unbalanced networks. IEEE Trans. Signal Process. 68, 2186–2198 (2020)
32. H. Xiao, Y. Yu, S. Devadas, On privacy-preserving decentralized optimization through
alternating direction method of multipliers (2019). Preprint. arXiv:1902.06101
33. M. Maros, J. Jalden, ECO-PANDA: a computationally economic, geometrically converging
dual optimization method on time-varying undirected graphs, in Proceedings of the 2019
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019).
https://doi.org/10.1109/ICASSP.2019.8683797
34. K. Scaman, F. Bach, S. Bubeck, Y. Lee, L. Massoulie, Optimal convergence rates for convex
distributed optimization in networks. J. Mach. Learn. Res. 20(159), 1–31 (2019)
35. C.A. Uribe, S. Lee, A. Gasnikov, A. Nedic, A dual approach for optimal algorithms in
distributed optimization over networks, in Proceedings of the 2020 Information Theory and
Applications Workshop (ITA) (2020). https://doi.org/10.1109/ITA50056.2020.9244951
36. S.A. Alghunaim, E. Ryu, K. Yuan, A.H. Sayed, Decentralized proximal gradient algorithms
with linear convergence rates. IEEE Trans. Autom. Control 66(6), 2787–2794 (2021)
37. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Springer Science
& Business Media, Berlin, 2013)
38. R. Xin, D. Jakovetic, U.A. Khan, Distributed nesterov gradient methods over arbitrary graphs.
IEEE Signal Process. Lett. 26(8), 1247–1251 (2019)
148 5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization

39. Q. Lü, X. Liao, H. Li, T. Huang, A nesterov-like gradient tracking algorithm for distributed
optimization over directed networks. IEEE Trans. Syst. Man Cybern. Syst. 51(10), 6258–6270
(2021)
40. R. Xin, U. A. Khan, Distributed heavy-ball: A generalization and acceleration of first-order
methods with gradient tracking, IEEE Trans. Autom. Control 65(6), 2627–2633 (2020)
41. Q. Lü, X. Liao, H. Li, T. Huang, Achieving acceleration for distributed economic dispatch in
smart grids over directed networks. IEEE Trans. Netw. Sci. Eng. 7(3), 1988–1999 (2020)
42. Y. Zhou, Z. Wang, K. Ji, Y. Liang, Proximal gradient algorithm with momentum and flexible
parameter restart for nonconvex optimization (2020). Preprint. arXiv:2002.11582v1
43. C. Liu, H. Li, Y. Shi, D. Xu, Distributed event-triggered gradient method for constrained convex
minimization. IEEE Trans. Autom. Control 65(2), 778–785 (2020)
44. N. Hayashi, T. Sugiura, Y. Kajiyama, S. Takai, Event-triggered consensus-based optimization
algorithm for smooth and strongly convex cost functions, in Proceedings of the 2018
IEEE Conference on Decision and Control (CDC) (2018). https://doi.org/10.1109/CDC.2018.
8618863
45. C. Li, X. Yu, W. Yu, T. Huang, Z-W. Liu, Distributed event-triggered scheme for economic
dispatch in smart grids. IEEE Trans. Ind. Inform. 12(5), 1775–1785 (2016)
46. K. Zhang, J. Xiong, X. Dai, Q. Lü, On the convergence of event-triggered distributed algorithm
for economic dispatch problem. Int. J. Electr. Power Energy Syst. 122, 1–10 (2020)
47. B. Swenson, R. Murray, S. Kar, H. Poor, Distributed stochastic gradient descent and conver-
gence to local minima (2020). Preprint. arXiv:2003.02818v1
48. M. Assran, N. Loizou, N. Ballas, M. Rabbat, Stochastic gradient push for distributed deep
learning, in Proceedings of the 36th International Conference on Machine Learning (ICML)
(2019), pp. 344–353
49. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly
convex optimization. Automatica 90, 196–203 (2018)
50. J. Konecny, J. Liu, P. Richtarik, M. Takac, Mini-batch semi-stochastic gradient descent in the
proximal setting. IEEE J. Sel. Top. Signal Process. 10(2), 242–255 (2016)
51. M. Schmidt, N. Roux, F. Bach, Minimizing finite sums with the stochastic average gradient.
Math. Program. 162(1), 83–112 (2017)
52. A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support
for non-strongly convex composite objectives, in Advances in Neural Information Processing
Systems (NIPS), vol. 27 (2014), pp. 1–9
53. C. Tan, S. Ma, Y. Dai, Y. Qian, Barzilai-borwein step size for stochastic average gradient, in
Advances in Neural Information Processing Systems, vol. 29 (2016), pp. 1–9
54. L. Nguyen, J. Liu, K. Scheinberg, M. Takac, SARAH: a novel method for machine learning
problems using stochastic recursive gradient, in Proceedings of the 34th International Confer-
ence on Machine Learning (ICML) (2017), pp. 2613–2621
55. A. Mokhtari, A. Ribeiro, DSA: decentralized double stochastic averaging gradient algorithm.
J. Mach. Learn. Res. 17(1), 2165–2199 (2016)
56. Z. Shen, A. Mokhtari, T. Zhou, P. Zhao, H. Qian, Towards more efficient stochastic decen-
tralized learning: faster convergence and sparse communication, in Proceedings of the 35th
International Conference on Machine Learning (PMLR), vol. 80 (2018), pp. 4624–4633
57. K. Yuan, B. Ying, J. Liu, A.H. Sayed, Variance-reduced stochastic learning by networked
agents under random reshuffling. IEEE Trans. Signal Process. 67(2), 351–366 (2019)
58. H. Hendrikx, F. Bach, L. Massoulie, An accelerated decentralized stochastic proximal algo-
rithm for finite sums, in Advances in Neural Information Processing Systems, vol. 32 (2019),
pp. 4624–4633
59. R. Xin, S. Kar, U.A. Khan, Decentralized stochastic optimization and machine learning: a
unified variance-reduction framework for robust performance and fast convergence. IEEE
Signal Process. Mag. 37(3), 102–113 (2020)
60. B. Li, S. Cen, Y. Chen, Y. Chi, Communication-efficient distributed optimization in networks
with gradient tracking and variance reduction, in Proceedings of the 23rd International
Conference on Artificial Intelligence and Statistics (AISTATS) (2020), pp. 1662–1672
References 149

61. R.A. Horn, C.R. Johnson, Matrix Analysis (Cambridge University Press, New York, 2013)
62. D. Blatt, A. Hero, Energy-based sensor network source localization via projection onto convex
sets. IEEE Trans. Signal Process. 54(9), 3614–3619 (2006)
63. D. Dua, C. Graff, UCI machine learning repository, Dept. School Inf. Comput. Sci., Univ.
California, Irvine, CA, USA (2019)
64. R.M. Gower, M. Schmidt, F. Bach, P. Richtarik, Variance-reduced methods for machine
learning. Proc. IEEE 108(11), 1968–1983 (2020)
65. S. Horvath, L. Lei, P. Richtarik, M.I. Jordan, Adaptivity of stochastic gradient methods for
nonconvex optimization (2020). Preprint. arXiv:2002.05359
Chapter 6
Accelerated Algorithms for Distributed
Economic Dispatch

Abstract In this chapter, we focus on introducing accelerated distributed optimiza-


tion algorithms in the application scenario of the economic dispatch problem (EDP)
for smart grids. This application scenario focuses on researching how to allocate the
generation power among generators to match the load demand with the minimum
total generation cost while observing all constraints on the local generation capacity.
Each generator possesses its own local generation cost, and the total generation
cost is the sum of all local generation costs. For the EDP question, most existing
methods, such as push-sum-based strategies, overcome the unbalancedness induced
by directed networks by employing column-stochastic weights, which may be
infeasible in distributed implementations. In contrast, to apply to directed networks
with row-stochastic weights, we develop a new directed distributed Lagrangian
momentum algorithm, D-DLM, which integrates a distributed gradient tracking
method with two momentum terms and non-uniform step-sizes in the update of the
Lagrangian multipliers. Next, we give proof that if the maximum step-size and the
maximum momentum coefficient are positive and sufficiently small, the D-DLM can
be optimally dispatched with a linear assignment under smooth and strongly convex
generation costs. Finally, various studies of EDP in smart grids are simulated.

Keywords Distributed economic dispatch · Smart grids · Directed network ·


Distributed Lagrangian momentum algorithm · Linear convergence

6.1 Introduction

The economic dispatch problem (EDP) is one of the fundamental issues for energy
management during the practical operations of smart grids. The target of EDP is
allocating the generation power among the generators to meet the load demands
with minimal total operation cost (i.e., sum of the local generation costs) while
preserving all constraints of local generation capacity. On a certain sense, EDP can
be speciated as a constrained optimization problem, which has appealed to many
researchers in recent years [1–7]. To tackle EDP, many basic methods [8, 9] have
been implemented in a centralized manner. However, these centralized methods

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 151
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_6
152 6 Accelerated Algorithms for Distributed Economic Dispatch

require a powerful centralized controller to collect global information about the


entire grid and process large amounts of data, which are often computationally and
communicationally expensive and prone to single-point failure. As the integration of
renewable resources, energy storage devices, plug-in hybrid vehicles, and potential
consumers occurs, the smart grid of the future will be highly distributed and further
lead to the impracticality of traditional centralized methods [10].
Recently, many distributed algorithms have been proposed for the optimization
problem described above with considerable applications on EDP. Some known
approaches for different networks are usually dependent on the distributed con-
sensus protocol (local calculation and local information exchange) with extensions
to figure out delays and asynchronous scenarios [11–25]. For the smart grid with
generator constraints, Zhang and Chow [11] first proposed an incremental cost
consensus algorithm, which adopted the collective mismatch between demand and
supply to send back to the algorithm, allowing incremental cost to converge to
the optimal value. However, to ensure energy balance, a leader was defined in
[11], which prevented the algorithm from operating in a fully distributed manner.
In order to remove the dependency on the leader, a two-level incremental cost
consensus algorithm was proposed in [12]. And then, Kar and Hug [13] proposed
a distributed algorithm based on the consensus+innovations framework, which
avoided the requirement of the leader. Meanwhile, extensions of various real-world
factors and techniques including vehicle-to-grid [14], event-triggered [15], delays
[16], transmission losses [17], and security [18] have been extensively considered
in the study of EDP. It is noteworthy in this aspect that all of these works only
involved the case of undirected networks [11–25].
Distributed optimization for solving EDP over directed networks was recently
studied in [26], where (sub)gradient-push (SP) method was employed to eliminate
the requirement of network balancing, i.e., with column-stochastic matrices. Since
SP was based on (sub)gradient descent with diminishing step-size, it also exhibited a
slow sublinear convergence rate. To accelerate convergence, Li et al. [27] proposed a
linearly convergent distributed algorithm with constant step-size to solve the EDP by
incorporated the push-sum strategy into the distributedly inexact gradient tracking
method. And then, Lü et al. [27] extended the work of [27] to non-uniform step-
sizes and showed linear convergence. In addition, to deal with EDP, a different class
of approaches which did not utilize push-sum strategy was recently proposed in
[29], where a row- and a column-stochastic matrix were adopted simultaneously to
acquire linear convergence over directed networks. It is noteworthy that although
these approaches [26–29] were appropriate for the directed networks, they all
required generators to possess (at least) its own out-degree information exactly.
Therefore, all the generators in the networks [26–29] could adjust their outgoing
weights and ensure that the sum of each column of weight matrix is one. This
requirement, however, is likely to be unrealistic in the practical operations of smart
grids.
In this chapter, the algorithm that we will present mainly depends on the
distributed gradient tracking method and is a variation of methods appeared in
[30] for EDP and [31, 32, 35] for classic convex optimization problems. To be
6.1 Introduction 153

specific, Li et al. [30] developed a distributed primal–dual augmented (sub)gradient


algorithm to solve the EDP over a directed network. The algorithm in [30]
employed a row-stochastic matrix and non-uniform step-sizes and yet linearly
converged to the optimal solution for smooth and strongly convex cost functions.
However, the algorithm in [30] did not adopt momentum terms [31–33], where
generators acquired more information from in-neighbors in the network for faster
convergence. In light of methods with momentum terms, Qu and Li [31] adopted
Nesterov momentum and thereby investigated two accelerated distributed Nesterov
algorithms, which exhibited faster convergence rate compared with the centralized
gradient descent (CGD) method for different cost functions. Note that, although the
convergence rate was improved, the two algorithms in [31] were only suitable for
undirected networks, which also involved the applicability of the methods in smart
grids. To overcome this deficiency, Xin et al. [32] established a generalization and
acceleration of first-order methods with heavy-ball momentum, i.e., ABm, which
removed the conservatism (doubly-stochastic weights or eigenvector estimation) in
the related works by implementing both row- and column-stochastic weights. In
this setting, some interesting generalized methods [34] (random link failures) and
[29] (delays) were proposed. Unfortunately, the construction of column-stochastic
weights required the out-degree information, which was arduous to be implemented,
for example, in broadcast-based or gossip scenarios. On the other hand, the works
of [31–34] did not consider the distributed constrained optimization problem. This
happens to be an indispensable issue which we must face when studying EDP in
smart grids. The related work [35] did not consider the distributed constrained
optimization problem and the non-uniform step-sizes, and a rigorous theoretical
analysis of the algorithm was also lacking. Hence, it is of great significance to
develop effective distributed algorithms to deal with the more practical EDP in smart
grids.
The main interest of this chapter is to study the EDP over a directed network. To
solve this issue, a linearly convergent algorithm is constructed, for which the non-
uniform step-sizes, two types of momentum terms, and the row-stochastic matrix are
utilized. We hope to develop a broad theory of the distributed convex optimization,
and the potential purpose of designing a distributed algorithm is to adapt and
promote the practical operations of smart grids. To conclude, the key contributions
of the presented work can be summarized in the following four aspects:
(i) We propose a novel directed distributed Lagrangian momentum algorithm,
named as D-DLM, with row-stochastic matrix to solve the EDP over a
directed network. In contrast to [31] (doubly-stochastic matrix) and [26–29]
(column-stochastic matrix) for the EDP, D-DLM with row-stochastic matrix
is relatively easy to be implemented in a distributed manner. Specifically, the
implementation of D-DLM is straightforward if each generator can privately
regulate the weights on information acquired from in-neighbors. This is more
appropriate for the practical operations of smart grids.
(ii) For the updates of Lagrangian multipliers, D-DLM extends the centralized
Nesterov gradient descent method (CNGD) [36] to a distributed form and is
154 6 Accelerated Algorithms for Distributed Economic Dispatch

appropriate for the distributed constrained optimization problem over directed


networks in comparison with the work in [31]. In particular, D-DLM extends
the distributed gradient tracking method with two types of momentum terms
to ensure that generators acquire more information from in-neighbors in the
network than the existing methods [29, 30] for faster convergence. More
importantly, a consensus iteration step is exploited for designing D-DLM to
counteract the effect of the unbalancedness induced by the directed networks.
(iii) D-DLM utilizes non-uniform step-sizes, which shows a wider range of step-
size selection than that of most of the existing methods investigated in [27, 29],
etc. More importantly, assuming that the non-uniform step-sizes and the
momentum coefficients are constrained by some specific upper bounds, D-
DLM linearly allocates the optimal dispatch if the cost functions are smooth
and strongly convex. In addition, in comparison with [32] and [35], we also
establish explicit estimates for the convergence rate of D-DLM.
(iv) The provided bounds on the largest step-size only depend on the cost functions
and the network topology. This is superior to the earlier works on non-
uniform step-sizes within the framework of gradient tracking [27, 28, 37, 38],
which relies not only on the cost functions and the network topology but
also on the heterogeneity of the step-sizes (there is a compromise between
the heterogeneity and the largest achievable step-size). More importantly, the
bound of non-uniform step-sizes in this chapter allows the existence (not all)
of zero step-sizes among the generators.

6.2 Preliminaries

6.2.1 Notation

If not particularly stated, the vectors mentioned in this chapter are column vectors.
We let the subscript i denote the generator index and the superscript t denote the
time index; e.g., xit is generator i’s variable at time t. The sets of real numbers,
n-dimensional real column vectors, and n-dimensional real square matrices are
represented as R, Rn , and Rn×n , respectively. The symbol zij is denoted as the
entry of matrix Z in its i-th row and j -th column and In is denoted as the identity
matrix of size n. Given a vector y = [y1 , y2 , . . . , yn ]T , Z = diag{y} is utilized
to represent a diagonal matrix which satisfies that zii = yi , ∀i = 1, . . . , n, and
zij = 0, ∀i = j . The diagonal matrix consisting of the corresponding diagonal
elements of matrix Z is represented as diag{Z}. Two column vectors with all entries
equal to ones and zeros are denoted as 1n and 0n , respectively. The 2-norm of vectors
and matrices is denoted as || · ||2 . The symbols zT and W T are the transposes of a
vector z and a matrix W , respectively. Given two vectors v = [v1 , . . . , vn ]T and
u = [u1 , . . . , un ]T , the notation v  u implies that vi ≤ ui for any i ∈ {1, . . . , n}.
The vector f (x) : Rp → Rp denotes the gradient of f (z) (differentiable) at z.
6.2 Preliminaries 155

Define ei = [0, . . . , 1i , . . . , 0]T . A non-negative square matrix Z ∈ Rn×n is row-


stochastic if Z1n = 1n .

6.2.2 Model of Optimization Problem

Consider a set of n generators connected over a smart grid. The global objective
of EDP is to find the optimal allocation that meets the expected load demands
while preserving the limitations of generator capacity, therefore minimizing the total
generation cost, i.e.,


n 
n 
n
minn C(x) = Ci (xi ), s.t. xi = di , ximin ≤ xi ≤ ximax , (6.1)
x∈R
i=1 i=1 i=1

∀i = 1, . . . , n, where Ci : R → R is the local cost function known privately


by generator i. Denote xi ∈ R as the power generation of generator i and
x = [x1 , x2 , . . . , xn ]T ∈ Rn . For each generator i, the maximum and minimum
capacities of power generation are indicated by ximin and ximax , respectively. The set
of limitation of generator capacity is defined by Xi = {x† ∈ R|ximin ≤ x† ≤ ximax }
for i = 1, . . . , n, and denote their Cartesian products as X = X1 × · · · × Xn . To
guarantee the feasibility
 n of
problem (6.1), the total expected power demand satisfies
n
x min ≤ d ≤ n max . We denote by x ∗ = [x ∗ , x ∗ , . . . , x ∗ ]T ∈ Rn
i=1 i i=1 i i=1 xi 1 2 n
the optimal solution to problem (6.1). Then, the following assumptions are made in
the sequel.

Assumption 6.1 (Strong Convexity [30]) Each local cost function Ci , i =


1, . . . , n, is μi -strongly connected. Mathematically, there exists μi > 0 such
that for any x, y ∈ R,
μi
θ Ci (x) + (1 − θ )Ci (y) ≥ Ci (θ x + (1 − θ )y) + (x − y)2 ,
2
where θ ∈ [0, 1] is an arbitrary constant.

Assumption 6.2 (Smoothness [30]) Each local cost function Ci , i = 1, . . . , n, is


continuously differentiable and has a Lipschitz-continuous gradient. Mathemati-
cally, there exists li > 0 such that for any x, y ∈ R,

|∇Ci (x) − ∇Ci (y)| ≤ li |x − y|.

Due to Assumption 6.1, problem (6.1) has a unique optimal solution. It is worth
emphasizing that Assumptions 6.1 and 6.2 are two standard assumptions to achieve
linear convergence when employing first-order methods [27–35].
156 6 Accelerated Algorithms for Distributed Economic Dispatch

6.2.3 Communication Network

The network is unbalanced if |Niin | = |Niout |, where | · | is called as the cardinality


of a set.
In this chapter, we consider a group of n generators communicating over a
directed network G = {V, E} involving the vertices (generators) set V = {1, . . . , n}
and the edges set E ⊆ V × V, which consists of ordered pairs of vertices. If
(i, j ) ∈ E, it indicates that generator i can directly transmit data to generator j ,
where i is viewed as an in-neighbor of j and contrarily j is regarded as an out-
neighbor of i. Let Niin = {j ∈ V|(j, i) ∈ E} and Niout = {j ∈ V|(i, j ) ∈ E} be the
in-neighbor and out-neighbor sets of i, respectively. If |Niin | = |Niout |, the network
is said to be unbalanced, where |·| is called as the cardinality of a set. For the directed
network G, a path of length b from generator i1 to generator ib+1 is a sequence of
b + 1 distinct generators i1 , . . . , ib+1 such that (ik , ik+1 ) ∈ E for k = 1, . . . , b. If
there exists a path between any two generators, G is strongly connected. Then, we
consider the following assumption.
Assumption 6.3 ([32]) The network G corresponding to the set of generators is
directed and strongly connected.

Assumption 6.3 forces each generator in the network to directly or indirectly have
an impact on the others. In comparison with the uniformly and strongly connected
assumption [27, 28, 39], Assumption 6.3, although more restrictive, is still relatively
common [30–35].

6.2.4 Centralized Lagrangian Method

First, we construct the dual problem of (6.1). To this end, we introduce the
Lagrangian function L : X × R → R defined as


n 
n
L(x, y) = Ci (xi ) − y (xi − di ), (6.2)
i=1 i=1

where y is the Lagrangian multiplier associated with equality constraint in (6.1).


Consider a convex function Ci , i ∈ V, and the conjugate function Ci⊥ of Ci is given
by

Ci⊥ (y) = sup yxi − Ci (xi ), (6.3)


xi ∈Xi
6.2 Preliminaries 157

for all y ∈ R. With the above, the dual problem of (6.1) is described as


n
max Φ(y) = Φi (y), (6.4)
y∈R
i=1

where the dual function Φi (y) for each i ∈ V is given by

Φi (y) = ψi (y) + ydi , (6.5)

where

ψi (y) = −Ci⊥ (y) = min (Ci (xi ) − yxi ). (6.6)


xi ∈Xi

By Assumption 6.1, it can be demonstrated that the strong duality between (6.1) and
(6.4) holds [40], i.e., there exists at least a dual optimal solution y ∗ to (6.4) such that
C(x ∗ ) = Φ(y ∗ ), where x ∗ is the primal optimal solution to (6.1), and that a set of
the dual optimal solution to (6.4) is nonempty [40].
Note from Assumption 6.1 that for each i ∈ V and any given y ∈ R, (6.6)
uniquely exists a minimizer as follows:
⎧ max
⎨ xi if ∇Ci−1 (y) ≥ ximax
x̃i (y) = ∇Ci−1 (y) if ximin < ∇Ci−1 (y) < ximax (6.7)
⎩ min
xi if ∇Ci−1 (y) ≤ ximin,

where ∇Ci−1 (y) is the inverse function of Ci , and according to Assumption 6.2,
∇Ci−1 (y) exists in interval [ximin, ximax ]. Since a set of the dual optimal solution
to (6.4) is nonempty, the primal (unique) optimal solution to (6.1) for each i ∈ V
becomes xi∗ = x̃i (y ∗ ), where y ∗ is any dual optimal solution.
Then, for any given y ∈ R, the dual function Φ(y) is differentiable [26] (because
of the uniqueness of x̃i (y)) at y and its gradient is


n
∇Φ(y) = − (x̃i (y) − di ), (6.8)
i=1

and further ∇Φi (y) = −(x̃i (y) − di ), ∀i ∈ V. Thus, the dual problem (6.4) can be
solved by utilizing the standard gradient ascent method as follows:


n
t +1
y = y − α̃
t
(x̃i (y t ) − di ), (6.9)
i=1

where α̃ > 0 is an appropriately selected step-size. It has been proven that method
(6.9) converges to the dual optimal solution, i.e., y t converges to y ∗ , under some
158 6 Accelerated Algorithms for Distributed Economic Dispatch

minor assumptions on cost functions and an arbitrary y 0 ∈ R, and then x t converges


to x ∗ [40]. However, it is worth noting that the update of y in (6.9) requires the
knowledge of each generator, thus curbing the distributed implementation of the
method. To eliminate this deficiency, each generator must update y in some way via
interacting only with its in-neighbors. Hence, we
require an approach not only to, in
a distributed manner, calculate or approximate ni=1 (x̃i (y t ) − di ) in (6.9), but also
to be suitable for the directed networks considered in this chapter. It is basically the
main motivation of D-DLM presented in the next section.

6.3 Algorithm Development

In this section, D-DLM that is capable of solving EDP in a distributed manner is


presented over a directed network. First, D-DLM is constructed for the EDP via
integrating the distributed gradient tracking method with two types of momentum
terms (for the Lagrangian multiplier). Then, the two distributed optimization
methods [28, 30] that are not only suitable for directed networks but also related
to D-DLM are discussed with an intuitive explanation.

6.3.1 Directed Distributed Lagrangian Momentum Algorithm

As the previous section is concerned, the centralized Lagrangian method requires


global information from generators to generate the Lagrangian multiplier. To
overcome this deficiency, a distributed algorithm is needed for each generator to
estimate this quantity. Specifically, we devote ourselves to the study of a novel
directed distributed Lagrangian momentum algorithm (D-DLM), which is not only
suitable for a directed network but also converges linearly and accurately to the
optimal solution to (6.1). To the best of our knowledge, this work has not yet been
involved and is worthwhile to study.
First, the dual problem (6.4) can be converted to the following minimization
form:


n
min q(y) = qi (y), (6.10)
y∈R
i=1

where qi (y) = −Φi (y) = Ci⊥ (y) − ydi . It is worth highlighting that problem (6.10)
shares the same set of dual optimal solution to problem (6.4), and it also has the
similar formulation with the distributed convex problems in [31, 32, 37]. According
to (6.10), we now describe D-DLM to distributedly deal with problem (6.1). In D-
DLM, each generator i ∈ V at time t ≥ 0 stores five variables: xit ∈ R, yit ∈ R,
hti ∈ R, sti ∈ Rn , and zit ∈ R. For t ≥ 0, generator i ∈ V updates its variables as
6.3 Algorithm Development 159

follows:



⎪ xit +1 = min{max{∇Ci−1 (yit ), ximin }, ximax }

⎪ n
⎪ t +1 t −1
⎪hi = j =1 rij yj + βi (hi − hi ) − αi zi
⎪ t t t t

yit +1 = hti +1 + βit +1 (hti +1 − hti ) (6.11)

⎪ 


⎪ sti +1 = nj=1 rij stj

⎪ 
⎪ t+1
⎩zit +1 = nj=1 rij zjt + xi t+1−di − xi −d
t
i
,
[si ]i [st ]
i i

where αi > 0 refers to the constant step-size (non-uniform) locally chosen at


each generator i, and βit is the momentum (heavy-ball momentum and Nesterov
momentum) coefficient (non-uniform) at time t ≥ 0. The symbol [sti ]i denotes the
i-th entry of vector sti . The weights, rij , i, j ∈ V, associated with the network G
obey the following conditions:

⎨ > , j ∈ Niin , 
n
rij = rij = 1, ∀i ∈ V, (6.12)

0, otherwise, j =1


and rii = 1 − j ∈Niin rij > , ∀i ∈ V, where 0 <  < 1. Each generator i ∈ V
starts with many initial states xi0 ∈ Xi , yi0 = h0i ∈ R, s0i = ei , and zi0 = xi0 − di .1
Denote by R = [rij ] ∈ Rn×n the collection of weights rij , i, j ∈ V, in (6.11),
which is obviously row-stochastic. In essence, (xit − di ) in the update of zit in (6.11)
is the gradient of the function qi (y) at y = yit , i.e., ∇qi (yit ) = xit − di , i ∈ V.
Furthermore, the update of zit in (6.11) is a distributedly inexact gradient tracking
step, where each function’s gradient is scaled by [sti ]i , which is generated by the
update of sit in (6.11). Actually, the update of zit in (6.11) a consensus iteration step
aiming to overcome the unbalancedness brought about the left normalized Perron
eigenvector w = [w1 , . . . , wn ]T , corresponding to the eigenvalue 1, of the weight
matrix R, i.e., the left eigenvector w satisfying 1Tn w = 1. This iteration resembles
those employed in [30, 41] and [42]. To sum up, D-DLM (6.11) transforms the
centralized method (6.9) into the distributed ones via gradient tracking method and
can be applied to a directed network.
Define x t = [x1t , . . . , xnt ]T ∈ Rn , y t = [y1t , . . . , ynt ]T ∈ Rn , ht = [ht1 , . . . , htn ]T
∈ Rn , zt = [z1t , . . . , znt ]T ∈ Rn , S t = [st1 , . . . , stn ]T ∈ Rn×n , S̃ t = diag{S t },
and d = [d1, . . . , dn ]T ∈ Rn . Therefore, D-DLM (6.11) can be rewritten in the
following aggregated form:

1Suppose that each generator possesses and achieves its unique identifier in the network, e.g.,
1, . . . , n [27–35].
160 6 Accelerated Algorithms for Distributed Economic Dispatch




⎪ xit +1 = min{max{∇Ci−1 (yit ), ximin }, ximax }



⎪ t +1 = Ry t + D t (ht − ht −1 ) − D zt
⎨h β α
y t +1 = ht +1 + Dβt +1 (ht +1 − ht ) (6.13)



⎪ S t +1 = RS t



⎩ t +1
z = Rzt + [S̃ t +1 ]−1 (x t +1 − d) − [S̃ t ]−1 (x t − d),

where Dα = diag{α} ∈ Rn×n and Dβt = diag{β t } ∈ Rn×n , t ≥ 0, where α =


[α1 , . . . , αn ]T and β t = [β1t , . . . , βnt ]T . The initial states are x 0 ∈ X, y 0 = h0 ∈ Rn ,
S 0 = In , and z0 = (x 0 − d) ∈ Rn . It is worth emphasizing that D-DLM (6.13)
does not need the out-degree information of generators (only row-stochastic matrix
is adopted in D-DLM), which is more likely to be implemented in a distributed
manner.

6.3.2 Related Methods

Relation to Method with Column-Stochastic Weights [28] Under Assump-


tions 6.1–6.3, DPD-PD proposed in [28] converged linearly to the optimal solution
over a directed network using non-uniform step-sizes. Besides, DPD-PD applied
push-sum strategy (column-stochastic weights) to overcome the unbalancedness
induced by the directed networks, which may be infeasible in the distributed
implementation because it required each generator to possess (at least) its out-
degree information. We emphasize here that row-stochastic weights are relatively
easier to achieve in a distributed setting and the implementation is straightforward
if each generator can privately regulate the weights on information acquired from
in-neighbors.
Relation to Method with Row-Stochastic Weights [30] The distributed primal–
dual method proposed in [30] served as a basis for the development of D-DLM
(6.11). The method [30] utilized row-stochastic weights with non-uniform step-sizes
among the generators and converged at a linear rate to the optimal solution over a
directed network under Assumptions 6.1–6.3. Notice that D-DLM (6.11) combines
the method [30] with two kinds of momentum terms (heavy-ball momentum and
Nesterov momentum), which ensures that generators acquire more information from
in-neighbors in the network than the method proposed in [30] for faster convergence.
6.4 Convergence Analysis 161

6.4 Convergence Analysis

In this section, the convergence properties of D-DLM (6.11) are rigorously ana-
lyzed. Before showing the main results, some auxiliary results (borrowed from the
literature) are introduced for completeness.

6.4.1 Auxiliary Results

First, the following lemma shows that Ci⊥ , i ∈ V, is strongly convex and smooth
[22, 27, 28, 30].
Lemma 6.1 Suppose that Assumptions 6.1 and 6.2 hold. Then, for each i ∈ V, Ci⊥
is strongly convex with constant ϑi and Lipschitz differentiable with constant i ,
respectively, where ϑi = 1/li and i = 1/μi .
If Lemma 6.1 holds, it suffices that the global function (−Φ) is strongly convex
with parameter ϑ = ni=1 ϑi and has Lipschitz continuous gradient with parameter

= ni=1 i , respectively. In addition, we define ˆ = maxi∈V { i }.
Considering the sequence {ṽ t }∞t =0 and γ ∈ (0, 1), for any positive integer T > 0
and norm || · ||c (in this chapter, norm || · ||c may be a 2-norm or a particular norm),
let us further define
γ ,T γ
||ṽ||c = sup {||ṽ t ||c /γ t } and ||ṽ||c = sup{||ṽ t ||c /γ t }.
t =0,1,...,T t ≥0

Then, the following additional lemma from the generalized small gain theorem
[43] is presented.
Lemma 6.2 (Generalized Small Gain Theorem) Suppose that non-negative vec-
tor sequences {ṽit }∞ ˜
t =0 , i = 1, . . . , m, a non-negative matrix Γ ∈ R
m×m , ũ ∈ Rm

and γ ∈ (0, 1) such that for all T > 0,

ṽ γ ,T  Γ˜ ṽ γ ,T + ũ, (6.14)

where ṽ γ ,T = [||ṽ1 ||c , . . . , ||ṽm ||c ]T . If ρ(Γ˜ ) < 1, then ||ṽi ||c < B, where
γ ,T γ ,T γ

B < +∞ and ρ(Γ˜ ) is the spectral radius of matrix Γ˜ . Hence, each ||ṽi ||c , i ∈
{1, . . . , m}, linearly converges to zero at the linear rate of O((γ )t ).
Recall that R is irreducible and row-stochastic with positive diagonal entries.
Under Assumption 6.3, there exists a normalized left Perron eigenvector w =
[w1 , . . . , wn ]T ∈ Rn (wi > 0, ∀i) of R such that

lim (R)t = (R)∞ = 1n wT , wT R = wT and wT 1n = 1.


t →∞
162 6 Accelerated Algorithms for Distributed Economic Dispatch

Also, we define S ∞ = limt →∞ S t (we have S ∞ = (R)∞ due to S 0 =


In ), S̃ ∞ = diag{S ∞ }, ŝ = supt ≥0||S t ||2 , s̃ = supt ≥0 ||[S̃ t ]−1 ||2 ,2 ȳ t = wT y t ,
∇Q(1n ȳ t ) = [∇q1 (ȳ t ), . . . , ∇qn (ȳ t )]T , ∇Q(y t ) = [∇q1 (y1t ), . . . , ∇qn (ynt )]T ,
α̂ = maxi∈V {αi }, and β̂ = supt ≥0maxi∈V {βit }. Since R is primitive and S 0 = In ,
it yields that {S t } is convergent [30, 41, 42], and therefore the diagonal elements of
S t are positive and bounded, for all t ≥ 0. Thus, ŝ and s̃ are two finite constants. In
addition, we employ || · || to indicate either a particular matrix norm or a particular
vector norm such that ||Ry|| ≤ ||R||||y|| for all matrices R and vectors y. Since
all vector norms on finite-dimensional vector space are equivalent, we have the
following conclusions: || · ||2 ≤ p1 || · || and || · || ≤ p2 || · ||2 , where p1 and p2
are some positive constants [45].

6.4.2 Main Results

In this subsection, the linear convergence rate of D-DLM is found via employing
Lemma 6.2. First, we cast four inequalities into a linear system as the form of (6.14)
and then investigate the spectral properties of the achieved coefficient matrix. To
this aim, some essential notations are introduced to simplify the main results. Denote
v1t = y t − (R)∞ y t , ∀t ≥ 0, v2t = (R)∞ y t − 1n y ∗ , ∀t ≥ 0, and v3t = ht − ht −1 , ∀t >
0, with the convention that v30 = 0n , and v4t = zt − (R)∞ zt , ∀t ≥ 0.
The first lemma constitutes an inevitable bound on the estimate ||zt ||2 for
deriving the aforementioned linear system.
Lemma 6.3 Under Assumptions 6.1 and 6.2, the following inequality holds for all
t ≥ 0:

||zt ||2 ≤ n ˆp1 ||v1t || + n ˆ||v2t ||2 + p1 ||v4t || + ŝ(s̃)2 θ (λ)t ||∇Q(y t )||2 , (6.15)

where 0 < θ < ∞ and 0 < λ < 1 are constants.


Proof Note that

||zt ||2 ≤ ||zt − (R)∞ zt ||2 + ||(R)∞ zt ||2 . (6.16)

Recalling that z0 = x 0 − d and ∇qi (yit ) = xit − di , i ∈ V, then, for all t ≥ 0, we


have (R)∞ zt = (R)∞ [S̃ t ]−1 ∇Q(y t ) (see reference [30] for a proof). Thus, using

2 Throughout the chapter, for any arbitrary matrix (scalar or variable) Z, we utilize the symbol (Z)t
to represent the t-th power of Z to distinguish the iteration of variables.
6.4 Convergence Analysis 163

S ∞ [S̃ ∞ ]−1 = 1n 1Tn and (R)∞ = S ∞ , it suffices that

||(R)∞ zt ||2 ≤||S ∞ [S̃ t ]−1 ∇Q(y t ) − S ∞ [S̃ ∞ ]−1 ∇Q(y t )||2
+ ||S ∞ [S̃ ∞ ]−1 ∇Q(y t ) − S ∞ [S̃ ∞ ]−1 ∇Q(1n y ∗ )||2

≤ŝ(s̃)2 θ (λ)t ||∇Q(y t )||2 + n ˆ||y t − 1n y ∗ ||2 , (6.17)

where ŝ = supt ≥0||S t ||2 , s̃ = supt ≥0 ||[S̃ t ]−1 ||2 , and the last inequality follows from
the fact that ||[S̃ t ]−1 − [S̃ ∞ ]−1 ||2 ≤ θ (s̃)2 (λ)t , where 0 < θ < ∞ and 0 < λ < 1
are constants (see [41, 42] for more details). Then, one gets

||y t − 1n y ∗ ||2 ≤ ||y t − (R)∞ y t ||2 + ||(R)∞ y t − 1n y ∗ ||2 . (6.18)

Substituting (6.17) and (6.18) into (6.16) yields the desired result in Lemma 6.3. 
In what follows, the bound of the consensus violation ||v1 ||γ ,T of the Lagrangian
multiplier is provided.
Lemma 6.4 Suppose that Assumptions 6.1 and 6.2 hold. Then, for all T > 0, we
have the following inequality:

α̂κ1 n ˆ γ ,T 2β̂κ1
||v1 ||γ ,T ≤ ||v2 ||2 + ||v3 ||γ ,T
γ − ρ − α̂κ1 n ˆp1 γ − ρ − α̂κ1 n ˆp1
α̂κ1 p1
+ ||v4 ||γ ,T + u1 , (6.19)
γ − ρ − α̂κ1 n ˆp1

for all max{λ, ρ + α̂κ1 n ˆp1 } < γ < 1, where 0 < ρ < 1 is a constant, κ1 =
p2 ||In − (R)∞ || and u1 = (||v10 || + α̂κ1 ŝ(s̃)2 θ supt =0,...,T ||∇Q(y t )||2 )/(γ − ρ −
α̂κ1 n ˆp1 ).
Proof According to the updates of ht and y t of D-DLM (6.13), it holds that

||v1t +1 || ≤ ||(R − (R)∞ )v1t || + ||(In − (R)∞ )Dβt +1 v3t +1 ||

+ ||(In − (R)∞ )Dβt v3t || + ||(In − (R)∞ )Dα zt ||, (6.20)

where the inequality in (6.20) is obtained from the facts that (R)∞ R = (R)∞ and
(R − (R)∞ )(In − (R)∞ ) = R − (R)∞ . Considering the weight matrix R = [rij ] ∈
Rn×n (6.12), then there are a norm || · || and a constant 0 < ρ < 1 such that
||Ry − (R)∞ y|| ≤ ρ||y − (R)∞ y|| for all y ∈ Rn (see [46] Lemma 5.3). Thus,
(6.20) further implies that

||v1t +1 || ≤ρ||v1t || + α̂κ1 ||zt ||2 + β̂κ1 ||v3t +1 || + β̂κ1 ||v3t ||, (6.21)
164 6 Accelerated Algorithms for Distributed Economic Dispatch

where α̂ = ||Dα ||2 and ||Dβt ||2 ≤ β̂. By Lemma 6.3, we have that

||v1t +1 || ≤ (ρ + α̂κ1 n ˆp1 )||v1t || + α̂κ1 n ˆ||v2t ||2 + β̂κ1 ||v3t ||

+ β̂κ1 ||v3t +1 || + α̂κ1 p1 ||v4t || + α̂κ1 ŝ(s̃)2 θ (λ)t ||∇Q(y t )||2 . (6.22)

From here, the procedure is similar to that in the proof of Lemma 8 in [30]. We
include the proof for completeness. By multiplying (γ )−(t +1) on both sides of (6.22)
and then taking the supremum for t = 0, . . . , T − 1, one has

||v1t +1 || ρ + α̂κ1 n ˆp1 ||v1t || α̂κ1 n ˆ ||v2t ||2


sup ≤ sup + sup
t =0,...,T −1 (γ )t +1 γ t =0,...,T −1 (γ )
t
γ t =0,...,T −1 (γ )t

||v3t || + ||v3t +1 || α̂κ1 p1 ||v4t ||


+ β̂κ1 sup + sup
t =0,...,T −1 (γ )t +1 γ t =0,...,T −1 (γ )
t

α̂κ1 ŝ(s̃)2 θ (λ)t ||∇Q(y t )||2


+ sup . (6.23)
t =0,...,T −1 (γ )t +1

Also, assuming that (γ )0 ||v10 || ≤ (γ )0 ||v10 || and λ < γ < 1, it follows that

ρ + α̂κ1 n ˆp1 α̂κ1 n ˆ γ ,T


||v1 ||γ ,T ≤ ||v1 ||γ ,T + ||v2 ||2
γ γ
2β̂κ1 α̂κ1 p1
+ ||v3 ||γ ,T + ||v4 ||γ ,T + ||v10 ||
γ γ
α̂κ1 ŝ(s̃)2 θ
+ sup ||∇Q(y t )||2 , (6.24)
γ t =0,...,T −1

which after some algebraic manipulations yields the desired result. 


γ ,T
The next lemma presents the bound of the dual optimality residual ||v2 ||2
associated with the weight average (notice that (R)∞ y t = 1n ȳ t ).
Lemma 6.5 Suppose that Assumptions 6.1 and 6.2 hold. If max{λ, l1 } < γ < 1
and 0 < wT α < 2/ , then the following inequality holds for all T > 0:

γ ,T α̂n ˆp1 2β̂κ2 α̂κ2


||v2 ||2 ≤ ||v1 ||γ ,T + ||v3 ||γ ,T + ||v4 ||γ ,T + u2 , (6.25)
γ − l1 γ − l1 γ − l1

where κ2 = p1 ||(R)∞ ||2 , l1 = max{|1 − wT α|, |1 − ϑwT α|} and u2 = (||v20 ||2 +
α̂ ŝ(s̃)2 θ supt =0,...,T ||∇Q(y t )||2 )/(γ − l1 ).
6.4 Convergence Analysis 165

Proof Notice that (R)∞ R = (R)∞ . Following from (6.13), we have

||(R)∞ y t +1 − 1n y ∗ ||2

= ||(R)∞ (y t + Dβt (ht − ht −1 ) + Dβt +1 (ht +1 − ht )

− Dα zt + (Dα − Dα )(R)∞ zt ) − 1n y ∗ ||2


≤ ||(R)∞ y t − (R)∞ Dα (R)∞ [S̃ t ]−1 ∇Q(y t ) − 1n y ∗ ||2

+ β̂κ2 [||ht − ht −1 || + ||ht +1 − ht ||]


+ α̂κ2 ||zt − (R)∞ zt ||. (6.26)

We now discuss the first term in the inequality of (6.26). Note that (R)∞ = 1n wT .
Through utilizing 1n wT Dα 1n wT = (wT α)1n wT , one obtains

||(R)∞ y t − (R)∞ Dα (R)∞ [S̃ t ]−1 ∇Q(y t ) − 1n y ∗ ||2


≤ ||1n (wT y t − x ∗ − (wT α)∇q(ȳ t ))||2
+ (wT α)||1n ∇q(ȳ t ) − 1n wT [S̃ t ]−1 ∇Q(y t )||2
= Λ1 + Λ2 , (6.27)

where ∇q(ȳ t ) = 1Tn ∇Q(1n ȳ t ) and ∇Q(1n ȳ t ) = [∇q1(ȳ t ), . . . , ∇qn (ȳ t )]T . Since
the global function q is strongly convex and smooth (see Lemma 6.1), if 0 < wT α <
2/ , Λ1 is bounded by

Λ1 ≤ l1 n||wT y t − y ∗ ||2 = l1 ||(R)∞ y t − 1n y ∗ ||2 , (6.28)

where l1 = max{|1 − wT α|, |1 − ϑwT α|} (see reference [42] for a proof). Then,
Λ2 can be bounded in the following way:

Λ2 ≤ (wT α)||1n ∇q(ȳ t ) − 1n 1Tn ∇Q(y t )||2


+ (wT α)||1n 1Tn ∇Q(y t ) − 1n wT [S̃ t ]−1 ∇Q(y t )||2
=Λ3 + Λ4 . (6.29)

Since ∇q(ȳ t ) = 1Tn ∇Q(1n ȳ t ), it holds from Lemma 6.1 that

Λ3 ≤ α̂n ˆp1 ||(R)∞ y t − y t ||. (6.30)

Next, by employing the fact ||[S̃ t ]−1 − [S̃ ∞ ]−1 ||2 ≤ θ (s̃)2 (λ)t and the relationship
S ∞ [S̃ ∞ ]−1 = 1n 1Tn , we get

Λ4 = (wT α)||S ∞ [S̃ ∞ ]−1 ∇Q(y t ) − S ∞ [S̃ t ]−1 ∇Q(y t )||2


≤ α̂ ŝ(s̃)2 θ (λ)t ||∇Q(y t )||2 , (6.31)
166 6 Accelerated Algorithms for Distributed Economic Dispatch

where ŝ = supt ≥0 ||S t ||2 and s̃ = supt ≥0||[S̃ t ]−1 ||2 . Plugging (6.27)–(6.31) into
(6.26) yields

||v2t +1 ||2 ≤ l1 ||v2t ||2 + α̂n ˆp1 ||v1t || + β̂κ2 [||v3t +1 || + ||v3t ||]
+ α̂κ2 ||v4t || + α̂ ŝ(s̃)2 θ (λ)t ||∇Q(y t )||2 . (6.32)

Here, we can identify the terms in (6.32) with the terms in (6.22). Hence, in order
to establish this lemma, we proceed as in the proof of Lemma 6.4 from (6.22). 
For the bound of the estimate difference ||v3 ||γ ,T , the following lemma is shown.
Lemma 6.6 Suppose that Assumptions 6.1 and 6.2 hold. If max{λ, 2p2 β̂} < γ < 1,
it holds that for all T > 0,

κ3 + α̂p2 n ˆp1 α̂p2 n ˆ γ ,T


||v3 ||γ ,T ≤ ||v1 ||γ ,T + ||v2 ||2
γ − 2p2 β̂ γ − 2p2 β̂
α̂p2 p1
+ ||v4 ||γ ,T + u3 , (6.33)
γ − 2p2 β̂

where u3 = (α̂p2 ŝ(s̃)2 θ supt =0,...,T ||∇Q(y t )||2 )/(γ − 2p2 β̂) and κ3 = ||R − In ||.
Proof Recalling that (R)∞ R = (R)∞ , we obtain from (6.13) that

||ht +1 − ht || = ||(R − In )y t + 2Dβt (ht − ht −1 ) − Dα zt ||


≤ ||(R − In )(y t − (R)∞ y t )||

+ 2p2 β̂||ht − ht −1 || + p2 α̂||zt ||2 , (6.34)

where the inequality in (6.34) is obtained from the fact (R − In )(In − (R)∞ ) =
R − In . Now, apply Lemma 6.3 to deduce that

||v3t +1 || ≤ 2p2 β̂||v3t || + (κ3 + α̂p2 n ˆp1 )||v1t || + α̂p2 n ˆ||v2t ||2
+ α̂p2 p1 ||v4t || + α̂p2 ŝ(s̃)2 θ (λ)t ||∇Q(y t )||2 . (6.35)

Similar to the procedure following (6.22), it is suffice to derive the desired result. 
The next lemma establishes the inequality which bounds the error term ||v4 ||γ ,T
corresponding to gradient estimation.
Lemma 6.7 Suppose that Assumptions 6.1–6.3 hold. If max{λ, ρ} < γ < 1, for all
T > 0, we have the following estimate:

κ4 + 2β̂κ4
||v4 ||γ ,T ≤ ||v3 ||γ ,T + u4 , (6.36)
γ −ρ
6.4 Convergence Analysis 167

where u4 = 2||In − (R)∞ ||p2 (s̃)2 θ supt =0,...,T ||∇Q(y t )||2 /(γ − ρ) + ||v40 ||/(γ − ρ)
and κ4 = ||In − (R)∞ ||p1 p2 ˆs̃.
Proof It is immediately obtained from (6.13) that

||zt +1 − (R)∞ zt +1 ||
= ||(R)∞ (Rzt + [S̃ t +1 ]−1 ∇Q(y t +1 ) − [S̃ t ]−1 ∇Q(y t ))
− (Rzt + [S̃ t +1 ]−1 ∇Q(y t +1 ) − [S̃ t ]−1 ∇Q(y t ))||
≤ ||In − (R)∞ ||||[S̃ t +1]−1 ∇Q(y t +1 ) − [S̃ t ]−1 ∇Q(y t )||
+ ρ||zt − (R)∞ zt ||, (6.37)

where we employ the triangle inequality and the fact ||Ry − (R)∞ y|| ≤ ρ||y −
(R)∞ y|| to deduce the inequality. As for the first term of the inequality in (6.37),
we apply (6.13) to obtain

||[S̃ t +1 ]−1 ∇Q(y t +1 ) − [S̃ t ]−1 ∇Q(y t )||


≤ ||[S̃ t +1 ]−1 ∇Q(y t +1 ) − [S̃ t +1 ]−1 ∇Q(y t )||
+ ||[S̃ t +1 ]−1 ∇Q(y t ) − [S̃ t ]−1 ∇Q(y t )||

≤ p2 ˆs̃||y t +1 − y t ||2 + 2p2 (s̃)2 θ (λ)t ||∇Q(y t )||2

≤ p1 p2 ˆs̃(1 + β̂)||ht +1 − ht || + p1 p2 ˆs̃ β̂||ht − ht −1 ||


+ 2p2 (s̃)2 θ (λ)t ||∇Q(y t )||2 . (6.38)

Combining (6.37) and (6.38) leads to

||v4t +1 || ≤ ρ||v4t || + κ4 (1 + β̂)||v3t +1 || + κ4 β̂||v3t ||


+ 2||In − (R)∞ ||p2 (s̃)2 θ (λ)t ||∇Q(y t )||2 , (6.39)

which further derives


||v4t +1 || ρ ||v4t || ||v3t +1 || κ4 β̂ ||v3t ||
≤ + κ 4 (1 + β̂) +
(γ )t +1 γ (γ )t (γ )t γ (γ )t
2||In − (R)∞ ||p2 (s̃)2 θ ||∇Q(y t )||2 λ t
+ ( ). (6.40)
γ γ
If max{λ, ρ} < γ < 1, we can conclude the desired result by taking supt =0,...,T on
both sides of (6.40) and rearranging the acquired expressions. 
With the aid of the auxiliary relationships, i.e., the above Lemmas 6.4–6.7,
the main convergence results of D-DLM are now established. For the sake of
168 6 Accelerated Algorithms for Distributed Economic Dispatch

convenience, we define wmin = mini∈V {wi } and κ5 = κ1 n ˆp1 . Then, the first result,
i.e., Theorem 6.8, is introduced as follows.
Theorem 6.8 Suppose that Assumptions 6.1–6.3 hold. Considering D-DLM (6.13)
updates the sequences {x t }, {ht }, {y t }, {S t }, and {zt }. Then, if 0 < wT α < 2/ , we
obtain the following linear inequality for all T > 0:

v γ ,T  Γ v γ ,T + u, (6.41)
γ ,T
where v γ ,T = [||v1 ||γ ,T , ||v2 ||2 , ||v3 ||γ ,T , ||v4 ||γ ,T ]T , u = [u1 , u2 , u3 , u4 ]T , and
the elements of matrix Γ = [γij ] ∈ R4×4 are given by
⎡ α̂κ1 n ˆ 2β̂κ1 α̂κ1 p1

0 γ −ρ−α̂κ5 γ −ρ−α̂κ5 γ −ρ−α̂κ5
⎢ α̂n ˆp ⎥
⎢ 1
0 2β̂κ2 α̂κ2 ⎥
⎢ ⎥
Γ = ⎢ κ +γα̂p−l1n ˆp α̂p2 n ˆ
γ −l1 γ −l1
⎥.
⎢ 3 2 1
0 α̂p2 p1 ⎥
⎣ γ −2p2 β̂ γ −2p2 β̂ γ −2p2 β̂ ⎦
κ4 +2β̂κ4
0 0 γ −ρ 0

Assuming in addition that the largest step-size satisfies


 
1 η1 (1 − ρ) η3 − κ3 η1
0 < α̂ < min , , ,
κ5 η1 + κ1 n ˆη2 + κ1 p1 η4 p2 n ˆp1 η1 + p2 n ˆη2 + p2 p1 η4
(6.42)

and the maximum momentum coefficient satisfies



η1 (1 − ρ) − (κ5 η1 + κ1 n ˆη2 + κ1 p1 η4 )α̂ η4 − η4 ρ − κ4 η3
0 ≤ β̂ < min , ,
2κ1 η3 2κ4 η3

(ϑwmin η2 − n ˆp1 η1 − κ2 η4 )α̂ η3 − κ3 η1 − (p2 n ˆp1 η1 + p2 n ˆη2 + p2 p1 η4 )α̂
, .
2κ2 η3 2p2 η3
(6.43)

Then the sequence {y t } converges to 1n y ∗ at a linear rate of O((γ )t ), where 0 <


γ < 1 is a constant such that

2β̂κ1 η3 + (κ5 η1 + κ1 n ˆη2 + κ1 p1 η4 )α̂
γ = max λ, ρ + ,
η1

2β̂κ2 η3 + (n ˆp1 η1 + κ2 η4 )α̂ 2β̂κ4 η3 + κ4 η3


l1 + ,ρ + ,
η2 η4

κ3 η1 + (p2 n ˆp1 η1 + p2 n ˆη2 + p2 p1 η4 )α̂
2p2 β̂ + , (6.44)
η3
6.4 Convergence Analysis 169

where η1 , η2 , η3 , and η4 are positive constants such that

n ˆp1 η1 + κ2 η4 κ4 η3
η1 > 0, η2 > , η3 > κ3 η1 , η4 > . (6.45)
ϑwmin 1−ρ

Proof First, summarizing the results of Lemmas 6.4–6.7, we can conclude (6.41)
immediately. Next, we provide some sufficient conditions to make the spectral
radius of Γ , defined as ρ(Γ ), be strictly less than 1, i.e., ρ(Γ ) < 1. According to
Theorem 8.1.29 in [45], we know that, for a positive vector η = [η1 , . . . , η4 ]T ∈ R4 ,
if Γ η < η, then ρ(Γ ) < 1 holds. By the definition of Γ , it is deduced that inequality
Γ η < η is equivalent to


⎪ 2β̂κ1 η3 < η1 γ − η1 ρ − (κ5 η1 + κ1 n ˆη2 + κ1 p1 η4 )α̂


⎨2β̂κ η < η γ − η l − (n ˆp η + κ η )α̂
2 3 2 2 1 1 1 2 4
(6.46)

⎪ 2 3
2p β̂η < η γ − κ η − (p n ˆp η + p2 n ˆη2 + p2 p1 η4 )α̂


3 3 1 2 1 1

2β̂κ4 η3 < η4 γ − η4 ρ − κ4 η3 ,

which further implies that




⎪ 2β̂κ1 η3 + ρη1 + (κ5 η1 + κ1 n ˆη2 + κ1 p1 η4 )α̂

⎪ γ >

⎪ η1



⎪ 2 β̂κ η + η l + (n ˆp1 η1 + κ2 η4 )α̂

⎨γ >
2 3 2 1
η2 (6.47)

⎪ 2p β̂η + κ η + (p2 n ˆp1 η1 + p2 n ˆη2 + p2 p1 η4 )α̂
⎪γ >

2 3 3 1

⎪ η3



⎪ 2 β̂κ η + ρη + κ η
⎪γ >

4 3 4 4 3
.
η4

Recalling l1 in Lemma 6.5, if α̂ < 1/ , it yields that l1 = 1 − ϑwT α ≤ 1 − ϑwmin α̂.


To ensure the positivity of β̂ (the right hand sides of (6.46) are always positive), if
0 < γ < 1, (6.46) further gives
⎧ η1 (1 − ρ)

⎪ α̂ <



⎪ κ η
5 1 + κ1 n ˆη2 + κ1 p1 η4

⎪ ˆp1 η1 + κ2 η4

⎨η2 > n
ϑwmin (6.48)

⎪ η3 − κ3 η1

⎪ α̂ < ; η3 > κ3 η1

⎪ p2 n p1 η1 + p2 n ˆη2 + p2 p1 η4
ˆ


⎪ κ η
⎩η4 > 4 3 .
1−ρ

Now, we are in the position of selecting vector η = [η1 , . . . , η4 ]T to ensure the


solvability of α̂. Since ρ < 1, we first pick an arbitrary positive constant η1 , then
170 6 Accelerated Algorithms for Distributed Economic Dispatch

choose η3 and η4 in accordance with the third and the fourth conditions in (6.48),
respectively, and finally select η2 satisfying the second condition in (6.48). Hence,
following from (6.48), it yields the upper bounds on the largest step-size α̂ in (6.42)
considering the requirement that α̂ < 1/ . If 0 < γ < 1, and then we achieve the
upper bounds on the maximum momentum coefficient β̂ according to (6.46) and the
upper bounds α̂.
Recalling that ∇Q(y t ) = [∇q1 (y1t ), . . . , ∇qn (ynt )]T and ∇qi (yit ) = xit − di , i ∈
V, it derives that ||∇Q(y t )||2 ≤  for a positive constant  > 0. And then all the
elements (u1 , u2 , u3 and u4 ) in vector u are uniformly bounded. Therefore, all the
conditions of Lemma 6.2 are completely satisfied. By Lemma 6.2, we can deduce
that the sequence {y t } converges to 1n y ∗ at a linear rate of O((γ )t ), where γ satisfies
(6.44). This finishes the proof. 
Remark 6.9 It is worth emphasizing that η1 , η2 , η3 , and η4 in Theorem 6.8 are
tunable parameters, which only depend on the network topology and the cost
functions. Thus, the choices of the largest step-size α̂ and the maximum momentum
coefficient β̂ can be calculated without much effort as long as other parameters,
such as λ, ρ, etc., are properly selected. Furthermore, in order to design the step-
sizes and the momentum coefficients, some global parameters, such as ϑ, , ˆ, and
wmin , are needed. We note that the amount of preprocessing in calculating the global
parameters is substantially negligible compared with the worst-case running time of
D-DLM (see [46] for a specific analysis).
Based on Theorem 6.8, below we show that the sequence {x t } linearly converges
to the optimal solution to (6.1). Similar to many distributed Lagrangian methods
[22, 27, 28, 30], we accomplish this by relating the primal variables with Lagrangian
multipliers.
Theorem 6.10 Suppose that Assumptions 6.1–6.3 hold. Considering D-DLM
(6.13) updates the sequences {x t }, {ht }, {y t }, {S t }, and {zt }. If α̂ and β̂ satisfy
the conditions of Theorem 6.8, then the sequence {x t } converges to x ∗ at a linear
rate of O((γ /2)t ).
Proof Here, the approach is similar to that in the proof of Theorem 3 in [30].
We briefly give the proof here for completeness. To demonstrate Theorem 6.10,
we ought to show the following relation of the primal variables and Lagrangian
multipliers:


n
μi
(xit − xi∗ )
2
2
i=1


n
[∇qi (y ∗ )(yit − y ∗ ) + (yit − y ∗ ) + yit (xi∗ − di )].
i 2
≤ (6.49)
2
i=1

Recall that x ∗ is the primal


n optimal solution to (6.1) and x ∗ = [x1∗ , x2∗ , . . . , xn∗ ]T .

Then, we have that i=1 (xi − di )=0. It also derives that each Lagrangian
6.4 Convergence Analysis 171

multiplier yit , i ∈ V, converges to y ∗ (dual optimal solution to (6.10)) if the largest


step-size α̂ and the maximum momentum coefficient β̂ satisfy the corresponding
upper bounds in Theorem 6.8. Thus, the right hand side of (6.49) goes to zero as
t → ∞. This immediately indicates that the sequence {x t } converges to x ∗ at a
linear rate of O((γ /2)t ) if the sequence {y t } converges to 1n y ∗ at a linear rate of
O((γ )t ) given in Theorem 6.8. From here, the current focus is to display inequality
(6.49). We note that various variants of (6.49) are shown in detail in most works
[22, 27, 28, 30]. Hence, in order to establish inequality (6.49), we perform the
remaining analysis according to the proof given in [22, 27, 28, 30]. This fulfills
the proof. 
Remark 6.11 Theorem 6.10 establishes that D-DLM linearly converges to the
global optimal solution provided that the largest step-size α̂ and the maximum
momentum coefficient β̂, respectively, obey the upper bounds given in Theorem 6.8.
Although most existing works (the distributed gradient tracking methods) [37, 47]
and our previous works [28, 38] adopted non-uniform step-sizes and converged at
a linear rate, compared with [28, 37, 38, 47], this chapter still has three advantages.
First, D-DLM extends the distributed gradient tracking method with heavy-ball
momentum and Nesterov momentum, which improves information exchange to
ensure faster convergence. Second, since the provided bounds on the largest step-
size α̂ in Theorem 6.8 only depend on the network topology and the cost functions,
each generator can choose a relatively wider step-size. This is in contrast to the
earlier works on non-uniform step-sizes within the framework of gradient tracking
[28, 37, 38, 47], which relies not only on the cost functions and the network topology
but also on the heterogeneity (||(In − W )α||2 /||W α||2 , W is the weight matrix in
[47] and α̂/α̃, α̃ = mini∈V {αi } in [28, 37, 38]) of the step-sizes. Besides, the analysis
has also showed that the algorithms in [28, 37, 38, 47] could linearly converge
to the optimal solution if and only if the heterogeneity and the largest step-size
are small. However, the largest step-size follows a bound which is a function of
the heterogeneity, and there is a trade-off between the tolerance of heterogeneity
and the largest step-size which can be achieved. Finally, the bound of non-uniform
step-sizes in this chapter allows the existence (not all) of zero step-sizes among the
generators given that the largest step-size α̂ is positive and sufficiently small.
Remark 6.12 Recently, there has been a lot of works devoted to the study of
the distributed methods with momentum terms, such as method with heavy-ball
momentum [32] and method with Nesterov momentum [31, 35]. It should be noted
that the analysis method, i.e., the generalized small gain theorem [43] employed
in this chapter is significantly different from the method used in [31, 32, 35]. In
comparison with [32, 35], this chapter also establishes explicit estimates for the
convergence rate of D-DLM. It is further straightforward to conceive a time-varying
implementation of D-DLM over broadcast-based mechanism or random networks,
e.g., the related work in [34]. Asynchronous schemes may also be derived from the
methodologies in [43, 47]. In addition, it is also concluded from [48] that if D-DLM
employed for optimizing more complex problems, such as deep neural networks,
172 6 Accelerated Algorithms for Distributed Economic Dispatch

the gradient of dual function is usually calculated with different kinds of stochastic
noises, which yields the stochastic version of D-DLM.

6.5 Numerical Examples

In this section, a variety of studies on EDP in smart grids are provided to verify the
effectiveness of D-DLM and the correctness of the theoretical analysis. Here, all the
simulations are carried out in MATLAB on a HP Desktop with 3.20 GHz, 6 Cores,
12 Threads, Intel i7-8700 processors, and 8 GB memory.

6.5.1 Case Study 1: EDP on IEEE 14-Bus Test Systems

First, we study the EDP on the IEEE 14-bus test system [22] as described in Fig. 6.1,
where {1, 2, 3, 6, 8} are generator buses and {2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14} are
load buses. Suppose that each generator i suffers a quadratic cost function, i.e.,
Ci (xi ) = ai (xi )2 + bi xi (known privately by generator i), where the generator
parameters are summarized in Table 6.1 [26]. Note that the power generation is zero
if a bus does not possess generators. The total demand is 14 i=1 di = 380 MW,
where the local demands on each load bus are d1 = 0 MW, d2 = 9 MW, d3 = 56

Fig. 6.1 The IEEE 14-bus test system


6.5 Numerical Examples 173

Table 6.1 Generator Bus ai ($/MW2 ) bi ($/MW) [ximin , ximax ] (MW)


parameters in IEEE 14-bus
test system 1 0.04 2.0 [0, 80]
2 0.03 3.0 [0, 90]
3 0.035 4.0 [0, 70]
6 0.03 4.0 [0, 70]
8 0.04 2.5 [0, 80]

MW, d4 = 55 MW, d5 = 27 MW, d6 = 27 MW, d7 = 0 MW, d8 = 0 MW, d9 = 8


MW, d10 = 24 MW, d11 = 53 MW, d12 = 46 MW, d13 = 16 MW, and d14 = 40
MW. The goal now is to minimize the total power generation cost of the generator
while meeting the load demands.
Suppose that the communication between generators is represented by a directed
and strongly connected circle network [44], and the weighting strategy rij =
1/|Niin |, ∀i, is utilized to construct the row-stochastic weights. The non-uniform
step-sizes and the momentum coefficients are, respectively, selected by αi =
0.004θi and βit = 0.3θ̃it , t ≥ 0, where θi and θ̃it are uniformly distributed over
the interval (0, 1). The simulation results are shown in Fig. 6.2. It deduces from
Fig. 6.2a and b that the optimal incremental cost y ∗ = 8.527 $/MW and the optimal
power generation are x1∗ = 80 MW, x2∗ = 90 MW, x3∗ = 64.67 MW, x6∗ = 70 MW,
and x8∗ = 75.33 MW. When yik and xik converge to the optimal solutions, the total
generation satisfies the total demand, as depicted in Fig. 6.2c.

6.5.2 Case Study 2: EDP on IEEE 118-Bus Test Systems

In this case study, EDP on the IEEE 118-bus test system [49] continues to be
considered to demonstrate the performance of D-DLM on a large-scale network.
The IEEE 118-bus test system, as shown in Fig. 6.3 [49], contains 54 generators
connected by bus lines, which is assumed to be directed and strongly connected.
Each generator i confronts a quadratic cost function (known privately by generator
i), i.e., Ci (xi ) = ai (xi )2 + bi xi + ci , where the coefficients ai ∈ [0.0024, 0.0697],
bi ∈ [8.3391, 37.6968], and ci ∈ [6.78, 74.33] with units $/MW2 , $/MW, and
$, respectively. Each xi is constrained by some interval in [ximin, ximax ] (MW),
where ximax ∈ [150, 400] and ximin ∈ [5, 150]. Suppose that the total load required
from the system is 4242 MW. In addition, the communication between generators
adopts the bus data given in [22]. For convenience, we employ the same uniformly
weighting strategy as explained in Case Study 1. Moreover, the non-uniform step-
sizes and the momentum coefficients are, respectively, selected by αi = 0.0002θi
and βit = 0.5θ̃it , t ≥ 0.
Then, numerical results are illustrated in Fig. 6.4, which demonstrates the
convergence of D-DLM for the IEEE 118-bus test system. It implies that D-DLM
174 6 Accelerated Algorithms for Distributed Economic Dispatch

(a) Power generation (MW)


100

80

60

40

20 Gen. 1 Gen. 2 Gen. 3 Gen. 6 Gen. 8

0
0 50 100 150 200 250 300
Iteration

(b) Incremental cost ($/MW)


10

0
0 50 100 150 200 250 300
Iteration

(c) Power balance (MW)


500

400

300

200

Total Generation
100
Total Demand

0
0 50 100 150 200 250 300
Iteration

Fig. 6.2 EDP on IEEE 14-bus test system


6.5 Numerical Examples 175

Fig. 6.3 The IEEE 118-bus test system

successfully drives the variables to the optimal solutions within few iterations even
for this large-scale network.

6.5.3 Case Study 3: The Application to Dynamical EDPs

Considering that the demand is not always unchangeable in the practical operations
of smart grids, this case study involves the application of D-DLM in dynamic
economic dispatch problems (EDPs), i.e., time-varying demands. In this case study,
we will simulate the performance of the D-DLM utilizing the same IEEE 14-bus
test system, the cost functions at the generators, and other related parameters as
described in Case Study 1. In addition, we divide the total iteration time into 3
identical time intervals and assume that the total demand at each time interval is
different.
Then, numerical results are illustrated in Fig. 6.5. It shows that when the total
demand changes, the generator will alter the power generation accordingly to suffice
176 6 Accelerated Algorithms for Distributed Economic Dispatch

400

350

300

250

200

150

100

50

0
0 100 200 300 400 500 600 700 800 900 1000
Iteration

(b) Incremental cost ($/MW)


18

16

14

12

10

0
0 100 200 300 400 500 600 700 800 900 1000
Iteration

(c) Power balance (MW)


5000

4500

4000

3500

3000

2500 Total Generation


Total Demand
2000

1500
0 100 200 300 400 500 600 700 800 900 1000
Iteration

Fig. 6.4 EDP on IEEE 118-bus test system


6.5 Numerical Examples 177

the current total demand, and D-DLM successfully achieves the optimal power
generation after a short period of time.

6.5.4 Case Study 4: Comparison with Related Methods

Finally, D-DLM is compared with the existing centralized primal–dual method [40]
and distributed primal–dual method [30] in terms of the convergence performance,
convergence time, and computational complexity. To this aim, we show the follow-
ing two scenarios:
(1) Comparison for convergence performance: in the first scenario, the convergence
performance comparison is conducted on the IEEE  14-bus and 118-bus test
systems, respectively, where the residual E t = log10 ni=1 ||xit − xi∗ ||, t ≥ 0,
is applied as the comparison metric. The required parameters (row-stochastic
weights, non-uniform step-sizes, etc.) correspond to Case Studies 1 and 2.
Figure 6.6 implies that D-DLM performs linear convergence rate, and
it improves the convergence rate well with the increase of the momentum
coefficients (with an upper bound) in comparison with the applicable methods
[30, 40] without momentum terms. In addition, the fact that D-DLM promotes
the convergence rate is obvious even in large-scale directed networks.
(2) Comparisons for convergence time and computational complexity: in the second
scenario, the convergence time and computational complexity of D-DLM and
the applicable methods [30, 40] are discussed on the IEEE 14-bus and 118-
bus test systems. Here, we measure the convergence time by the time it takes
the algorithm and the computational complexity by the number of calculations
required by the algorithm to achieve the desired level of residual E t = −15,
respectively.
Table 6.2 indicates that, in terms of the convergence time, D-DLM promotes
the performance in comparison with the distributed method [30] since two
momentum terms are added in D-DLM for improving the convergence rate.
Besides, although the centralized method has less convergence time than the
distributed methods, it is worthy highlighting that in the distributed method
simulations, if they are run in parallel (in practice), the total time they need
should be far more less than the total local optimizations time running in
sequence. Table 6.2 also means that the computation complexity of D-DLM
and the distributed method [30] increase very slowly with the increase of the
number of buses, while the centralized method [40] rapidly increases.
178 6 Accelerated Algorithms for Distributed Economic Dispatch

(a) Power generation (MW)


100

80

60

40

20 Gen. 1 Gen. 2 Gen. 3 Gen. 4 Gen. 5

0
0 100 200 300 400 500 600
Iteration

(b) Incremental cost ($/MW)

10

0
0 100 200 300 400 500 600
Iteration

(c) Power balance (MW)


450

400

350

300

250

200

150

100
Total Generation
Total Demand
50

0
0 100 200 300 400 500 600
Iteration

Fig. 6.5 Dynamical EDPs on IEEE 14-bus test system


6.5 Numerical Examples 179

(a) Residual: IEEE 14−bus test system


5

−5

−10

−15
0 50 100 150 200 250 300 350 400 450 500
Iteration

(b) Residual: IEEE 118−bus test system


10

−5

−10

−15
0 100 200 300 400 500 600 700 800 900 1000
Iteration

Fig. 6.6 Performance comparison


180 6 Accelerated Algorithms for Distributed Economic Dispatch

Table 6.2 Convergence time and computational complexity


Algorithm types Bus systems Convergence time (s) Computational complexity
D-DLM 14 0.0684 1.704 × 104
118 3.1956 2.106 × 105
Centralized 14 0.0156 1.337 × 104
118 0.0693 2.229 × 106
Distributed 14 0.0726 1.568 × 104
118 4.8311 2.619 × 105

6.6 Conclusion

In this chapter, we have considered the EDP in smart grids where generators
were designed to collectively minimize the total generation cost while satisfying
the expected load demands and preserving the limitations of generator capacity.
To solve EDP, a novel directed distributed Lagrangian momentum algorithm,
named as D-DLM, has been presented and analyzed at great length. D-DLM
extended the distributed gradient tracking method with heavy-ball momentum and
Nesterov momentum, guaranteed that generators selected non-uniform step-sizes
in a distributed manner and only required the weight matrix to be row-stochastic,
indicating that it was suitable for a directed network. In particular, the directed
network was assumed to be strongly connected. If the largest step-size and the
maximum momentum coefficient were subjected to some upper bounds (the bounds
relied only on the network topology and the cost functions), we have proven
that D-DLM linearly allocated the optimal dispatch at the expense of eigenvector
learning, supposing smooth and strongly convex cost functions. In addition, the
explicit estimation for the convergence rate of D-DLM has also been explored.
The theoretical analysis has been further verified by simulations. In the future
work, we will continue to consider a few of interesting problems (privacy masking,
utility maximization, etc.) in smart grids with the aid of D-DLM and study the
robustness of time-varying networks, packet dropout, latency, random link failures,
and transmission losses.

References

1. N. Heydaribeni, A. Anastasopoulos, Distributed mechanism design for network resource


allocation problems. IEEE Trans. Netw. Sci. Eng. 7(2), 621–636 (2020)
2. C. Li, X. Yu, T. Huang, X. He, Distributed optimal consensus over resource allocation network
and its application to dynamical economic dispatch. IEEE Trans. Neural Netw. Learn. Syst.
29(6), 2407–2418 (2018)
3. L. Wang, S. Cheng, Data-driven resource management for ultra-dense small cells: an affinity
propagation clustering approach. IEEE Trans. Netw. Sci. Eng. 6(3), 267–279 (2019)
4. Y. Kawamoto, H. Takagi, H. Nishiyama, N. Kato, Efficient resource allocation utilizing Q-
learning in multiple UA communications. IEEE Trans. Netw. Sci. Eng. 6(3), 293–302 (2019)
References 181

5. R. Wang, Q. Li, B. Zhang, L. Wang, Distributed consensus based algorithm for economic
dispatch in a microgrid, IEEE Trans. Smart Grid 10(4), 3630–3640 (2019)
6. T. Kim, S. Wright, D. Bienstock, S. Harnett, Analyzing vulnerability of power systems with
continuous optimization formulations. IEEE Trans. Netw. Sci. Eng. 3(3), 132–146 (2016)
7. S. D’Aronco, P. Frossard, Online resource inference in network utility maximization problems.
IEEE Trans. Netw. Sci. Eng. 6(3), 432–444 (2019)
8. N. Li, L. Chen, S. Low, Optimal demand response based on utility maximization in power
networks, in Proceedings of the 2011 IEEE Power and Energy Society General Meeting (PES).
https://doi.org/10.1109/PES.2011.6039082
9. N. Li, L. Chen, M. Dahleh, Demand response using linear supply function bidding. IEEE Trans.
Power Syst. 6(4), 1827–1838 (2015)
10. M. Ogura, V. Preciado, Stability of spreading processes over time-varying large-scale networks.
IEEE Trans. Netw. Sci. Eng. 3(1), 44–57 (2016)
11. Z. Zhang, M. Chow, Convergence analysis of the incremental cost consensus algorithm under
different communication network topologies in a smart grid. IEEE Trans. Power Syst. 27(4),
1761–1768 (2012)
12. S. Yang, S. Tan, J. Xu, Consensus based approach for economic dispatch problem in a smart
grid. IEEE Trans. Power Syst. 28(4), 4416–4426 (2013)
13. S. Kar, G. Hug, Distributed robust economic dispatch in power systems: a consensus +
innovations approach, in 2012 IEEE Power and Energy Society General Meeting. https://doi.
org/10.1109/PESGM.2012.6345156
14. S. Xie, W. Zhong, K. Xie, R. Yu, Y. Zhang, Fair energy scheduling for vehicle-to-grid networks
using adaptive dynamic programming. IEEE Trans. Neural Netw. Learn. Syst. 27(8), 1697–
1707 (2016)
15. Y. Li, H. Zhang, X. Liang, B. Huang, Event-triggered based distributed cooperative energy
management for multi-energy systems. IEEE Trans. Ind. Inform. 15(4), 2008–2022 (2019)
16. B. Huang, L. Liu, H. Zhang, Y. Li, Q. Sun, Distributed Optimal economic dispatch for
microgrids considering communication delays. IEEE Trans. Syst. Man, Cybern. Syst. 49(8),
1634–1642 (2019)
17. G. Binetti, A. Davoudi, F. Lewis, D. Naso, B. Turchiano, Distributed consensus-based
economic dispatch with transmission losses. IEEE Trans. Power Syst. 29(4), 1711–1720 (2014)
18. Z. Ni, S. Paul, A multistage game in smart grid security: a reinforcement learning solution.
IEEE Trans. Neural Netw. Learn. Syst. 30(9), 2684–2695 (2019)
19. A. Nedic, A. Olshevsky, W. Shi, Improved convergence rates for distributed resource allocation
(2017). arXiv preprint arXiv:1706.05441
20. T. Doan, C. Beck, Distributed Lagrangian methods for network resource allocation, in 2017
IEEE Conference on Control Technology and Applications (CCTA). https://doi.org/10.1109/
CCTA.2017.8062536
21. Z. Tang, D. Hill, T. Liu, A novel consensus-based economic dispatch for microgrids. IEEE
Trans. Smart Grid 9(4), 3920–3922 (2018)
22. T. Doan, A. Olshevsky, On the geometric convergence rate of distributed economic dis-
patch/demand response in power systems (2016). arXiv preprint arXiv:1609.06660
23. H. Pourbabak, J. Luo, T. Chen, W. Su, A novel consensus-based distributed algorithm for
economic dispatch based on local estimation of power mismatch. IEEE Trans. Smart Grid
9(6), 5930–5942 (2018)
24. Q. Li, D. Gao, L. Cheng, F. Zhang, W. Yan, Fully distributed DC optimal power flow
based on distributed economic dispatch and distributed state estimation (2019). arXiv preprint
arXiv:1903.01128
25. Z. Deng, X. Nian, Distributed generalized Nash equilibrium seeking algorithm design for
aggregative games over weight-balanced digraphs. IEEE Trans. Neural Netw. Learn. Syst.
30(3), 695–706 (2019)
26. T. Yang, J. Lu, D. Wu, J. Wu, G. Shi, Z. Meng, K. Johansson, A distributed algorithm for
economic dispatch over time-varying directed networks with delays. IEEE Trans. Ind. Electron.
64(6), 5095–5106 (2017)
182 6 Accelerated Algorithms for Distributed Economic Dispatch

27. H. Li, Q. Lü, X. Liao, T. Huang, Accelerated convergence algorithm for distributed constrained
optimization under time-varying general directed graphs. IEEE Trans. Syst. Man Cybern. Syst.
50(7), 2612–2622 (2020)
28. Q. Lü, H. Li, Z. Wang, Q. Han, W. Ge, Performing linear convergence for distributed
constrained optimisation over time-varying directed unbalanced networks. IET Control Theory
Appl. 13(17), 2800–2810 (2019)
29. C. Zhao, X. Duan, Y. Shi, Analysis of consensus-based economic dispatch algorithm under
time delays. IEEE Trans. Syst. Man Cybern. Syst. 50(8), 2978–2988 (2020)
30. H. Li, Q. Lü, T. Huang, Convergence analysis of a distributed optimization algorithm with a
general unbalanced directed communication network. IEEE Trans. Netw. Sci. Eng. 6(3), 237–
248 (2019)
31. G. Qu, N. Li, Accelerated distributed Nesterov gradient descent. IEEE Trans. Autom. Control
65(6), 2566–2581 (2020)
32. R. Xin, U. Khan, Distributed heavy-ball: a generalization and acceleration of first-order
methods with gradient tracking. IEEE Trans. Autom. Control 65(6), 2627–2633 (2020)
33. D. Jakovetic, J. Xavier, J. Moura, Fast distributed gradient methods. IEEE Trans. Autom.
Control 59(5), 1131–1146 (2014)
34. S. Pu, W. Shi, J. Xu, A. Nedic, Push-pull gradient methods for distributed optimization in
networks. IEEE Trans. Autom. Control 66(1), 1–16 (2021)
35. R. Xin, D. Jakovetic, U. Khan, Distributed Nesterov gradient methods over arbitrary graphs.
IEEE Signal Process. Lett. 26(8), 1247–1251 (2018)
36. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Springer, Berlin,
2013)
37. A. Nedic, A. Olshevsky, W. Shi, C. Uribe, Geometrically convergent distributed optimization
with uncoordinated step-sizes, in 2017 American Control Conference (ACC). https://doi.org/
10.23919/ACC.2017.7963560
38. Q. Lü, H. Li, D. Xia, Geometrical convergence rate for distributed optimization with time-
varying directed graphs and uncoordinated step-sizes. Inf. Sci. 422, 516–530 (2018)
39. D. Nunez, J. Cortes, Distributed online convex optimization over jointly connected digraphs.
IEEE Trans. Netw. Sci. Eng. 1(1), 23–37 (2014)
40. D. Bertsekas, Nonlinear Programming, 2nd edn. (Athena Scientific, Cambridge, 1999)
41. R. Xin, C. Xi, U. Khan, FROST-Fast row-stochastic optimization with uncoordinated step-
sizes. EURASIP J. Advanc. Signal Process. 2019(1), 1–14 (2019)
42. C. Xi, V. Mai, E. Abed, U. Khan, Linear convergence in optimization over directed graphs with
row-stochastic matrices. IEEE Trans. Autom. Control 63(10), 3558–3565 (2018)
43. Y. Tian, Y. Sun, G. Scutari, Achieving linear convergence in distributed asynchronous multi-
agent optimization. IEEE Trans. Autom. Control 65(12), 5264–5279 (2020)
44. Z. Wang, H. Li, Edge-based stochastic gradient algorithm for distributed optimization. IEEE
Trans. Netw. Sci. Eng. 7(3), 1421–1430 (2020)
45. R. Horn, C. Johnson, Matrix Analysis (Cambridge University Press, New York, 2013)
46. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization
over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
47. J. Xu, S. Zhu, Y. Soh, L. Xie, Convergence of asynchronous distributed gradient methods over
stochastic networks. IEEE Trans. Autom. Control 63(2), 434–448 (2018)
48. T. Yang, Q. Lin, Z. Li, Unified convergence analysis of stochastic momentum methods for
convex and non-convex optimization (2016). arXiv preprint arXiv:1604.03257
49. Y. Fu, M. Shahidehpour, Z. Li, AC contingency dispatch based on security-constrained unit
commitment. IEEE Trans. Power Syst. 21(2), 897–908 (2006)
Chapter 7
Primal–Dual Algorithms for Distributed
Economic Dispatch

Abstract In this chapter, we study a distributed primal–dual gradient algorithm


applicable in a sequence of time-varying general directed networks based on a
distributed economic dispatch problem for smart grids where each node can only
obtain its own locally convex objective function and the estimation of each node is
restricted to coupled linear constraints and single-box constraints. In this algorithm,
we assume that the communication network between the nodes is a uniform strongly
connected network, and apply a column-stochastic mixing matrix and a fixed step-
size in the algorithm to accurately guide all nodes to converge asymptotically to a
global, consistent optimal solution. Under standard assumptions of strong convexity
and smoothness of the objective function, we show that the distributed algorithm is
able to drive the entire network to converge geometrically to an optimal solution of
the convex optimization problem only if the step-size does not exceed some upper
bound. We also give an explicit analysis of the convergence rate of the proposed
optimization algorithm. We perform simulations of economic dispatch problems
and demand response problems in power systems to illustrate the effectiveness of
the proposed optimization algorithm.

Keywords Distributed convex optimization · Multi-node systems · Time-varying


directed networks · Primal–dual algorithm · Geometric convergence

7.1 Introduction

During recent years, multi-node systems have attracted increasing attention in the
fields of distributed sensor networks, multi-robot cooperation, UAVs formation
flight, and missile joint attack operations, etc. Many difficulties still exist in the
control and optimization of multi-node systems due to the complexity of node
dynamics, network structure, and actual target tasks. As one of the most important
research topics in the field of multi-node systems, the distributed optimization prob-
lem has attracted strong research interest from various scientific disciplines. This
is mainly due to its broad application in engineering fields, including distributed
formation control for resource allocation in peer-to-peer communication networks

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 183
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_7
184 7 Primal–Dual Algorithms for Distributed Economic Dispatch

[1], multiple autonomous vehicles [2], and distributed data fusion, information
processing, and decision-making in wireless sensor networks [3–5], etc. The
distributed optimization framework not only avoids the need to build long-distance
communication systems or data fusion centers but also provides a better load
balance for the network.
Among the existing literature, the (sub)gradient descent algorithm [6, 7], the
primal–dual (sub)gradient algorithm [8], the fast (sub)gradient descent algorithm
[9], and the (sub)gradient-push descent algorithm [10] were extensively proposed
to resolve distributed optimization problems. Nedic et al. [6] exactly incorporated
the average consensus approaches into the (sub)gradient methods to handle a
multi-node-based unconstrained convex optimization problem. Theoretical anal-
ysis showed
√ that the (sub)gradient descent algorithm in [6] converges at a rate
of O(1/ t) for convex Lipschitz and possibly non-smooth objective functions.
This coincides with the convergence rate of the centralized (sub)gradient descent
algorithm. Then, Zhu et al. [8] devised two distributed primal–dual (sub)gradient
algorithms to tackle a convex optimization problem where the nodes are restricted
to a global inequality constraint, a global equality constraint, and a global constraint
set. Nedic et al. [10] proposed a (sub)gradient-push descent algorithm that could
find the exact optimal solution even without the knowledge of the number of nodes
or the network sequence to implement the assumption that the objective function
is (sub)gradient boundedness. The√(sub)gradient-push algorithm was proved to be
convergent with a rate of O(ln t/ t) by employing a diminishing step-size. This
series of work has been extended to distributed optimization under various realistic
conditions, such as stochastic (sub)gradient errors [11], directed [12] or random
communication network [13], linear scaling in network size [14], heterogeneous
local constraints [15], asynchrony and delays [16], to just name a few. Although
these algorithms can solve different kinds of optimization problems, it is undeniable
that these algorithms are usually slow and still need to use a diminishing step-
size to achieve optimal solution even if the objective functions are differentiable
and strongly convex [8–25]. Besides, the abovementioned algorithms all require
the assumption of bounded (sub)gradient to accomplish the exact optimal solution,
which is another drawback. Nonetheless, at the expense of inexact convergence
(only converge to the vicinity of the optimal solution), the methods described above
can accelerate to O(1/t) by utilizing a constant step-size. However, this is not
the ultimate goal of solving the optimization problem. To address such issues,
Xu et al. [24] studied a new augmented distributed gradient method (Aug-DGM)
with uncoordinated constant step-sizes over general linear time-invariant system by
employing the so-called adapt-then-combine (ATC) scheme. Shi et al. [25] removed
the steady-state error by putting a difference structure into distributed gradient
descent algorithm, therefore extending a novel distributed first-order algorithm
(EXTRA). The algorithm achieved a convergence rate O(1/t) for convex objective
functions and a linear rate O(C −t ) (C > 1 is some constant) for strongly convex
objective functions.
Considering the large application domains of future smart grids, distributed
algorithms [26, 27] are broadly employed to promote the energy optimization
7.1 Introduction 185

efficiency in smart grids in recent years. He et al. in [28] proposed two second-
order continuous algorithms to exactly solve the economic power dispatch problem
and showed that the convergence rate of the algorithm is faster than the first-
order continuous time algorithm. In addition, a push-sum consensus protocol was
introduced in [29] to solve economic dispatch problems (EDPs) on directed fixed
networks in smart grids. This line of work has been extended to a variety of realistic
scenarios for distributed optimization. Li et al. [30] investigated a novel distributed
event-triggered optimization algorithm to address the economic dispatch problems
in smart grids. By making two consensus protocols running in parallel, Binetti et
al. [31] established a distributed consensus-based protocol for the optimization with
transmission losses.
The recent literature [35, 36] are the most relevant to our work. Nedic et al.
[35] was concentrated on the analysis of the distributed optimization problems
with coordinated step-size over time-varying undirected/directed networks. The
algorithm in [35] was capable of driving the whole network to converge to a
global and consensual minimizer at a geometric rate under the strong convexity and
smoothness assumptions. Doan et al. [36] further took coupling linear constraint
and individual box constraints into consideration. Under the strong convexity and
smoothness assumption on objective functions, Doan et al. conducted an explicit
and detailed analysis for the convergence rate on an undirected network. It is
noteworthy that [35] did not study the constrained optimization problem, while [36]
could not provide a detailed analysis for the constrained optimization problem over
general and time-varying directed network topologies. Our work is also linked to
a distributed scheme-based distributed optimization algorithm for network resource
allocation [32]. However, in [32], the network was required to be undirected, and
the weighted matrix was double-stochastic, which is quite strict in real applications.
Other related work are [33, 34], where the demand response problems (DRPs) in
power networks have been considered. Regrettably, the idea fails to update the
Lagrangian multiplier in a distributed way.
To sum up, when each node in the network is subjected to certain constraints,
the problem is largely open for the distributed constrained optimization over time-
varying general directed networks that we studied in this work. Therefore, we
develop and analyze a fully distributed primal–dual optimization algorithm for
the problem with both coupling linear constraint and individual box constraints.
Specially, compared to the centralized approach, the proposed algorithm is more
adaptable and practical due to its robustness to variability of renewable resources
and flexibility to the dynamic topology of networks. In general, the main contribu-
tions of this chapter can be structured as the following four aspects:
(i) The push-sum protocol and a gradient tracking technique are incorporated
into distributed primal–dual optimization algorithm. It generalizes the work
of [35], which neglected the constraints of each node in practical scenarios,
and moreover, it gives a wide selection of the step-sizes compared with most
existing distributed optimization algorithms, such as [8–23].
186 7 Primal–Dual Algorithms for Distributed Economic Dispatch

(ii) We study a distributed constrained optimization problem over time-varying


general directed networks that is a main advantage compared with the work
in [36]. Moreover, the underlying networks are assumed to be uniformly and
jointly strongly connected, which is considerably weaker than requiring each
network to be strongly connected.
(iii) Unlike the centralized methods proposed in [33, 34], the proposed algorithm
is fully distributed, which can successfully estimate the Lagrangian multiplier
based only on local interaction (in a distributed manner) and update its primal
variable by applying this estimate.
(iv) The proposed algorithm achieves a geometrical convergence rate as long as
the step-size is smaller than an explicit upper bound and no positive lower
bound is required when the objective functions satisfy the strong convexity
and smoothness assumptions. We also provide an explicit analysis for the
convergence rate of the proposed algorithm.

7.2 Preliminaries

7.2.1 Notation

If not particularly stated, the vectors mentioned in this chapter are column vectors.
Let R, RN , and RN×N denote the set of real numbers, the set of N-dimensional
real column vectors, and the set of N × N real matrices, respectively. Let IN denote
the N-dimensional identity matrix. The symbol 1 denotes an all-ones column vector
with appropriate dimensions. Given a matrix, W, W T , and W −1 (W is reversible)
are represented as the transpose and the inverse, respectively. The symbol · is
denoted as the inner product of two vectors. For a vector x ∈ RN , x̄ = (1/N)1T x
denotes the vector whose elements are its average vector and its consensus violation
is written as x̆ = x − (1/N)11T x = (I − J )x = J˘x, where J = (1/N)11T and
J˘ = I − J are two symmetric √ matrices. Given a vector x ∈ RN , the standard
Euclidean norm, namely ||x|| = x T x, is defined as ||x||, and the infinite norm is
defined as ||x|| ˘
 ∞ . Given a vector x ∈ R , the J weighted (semi)-norm is denoted
N

as ||x||J˘ = x, J˘x. Since J˘ = J˘T J˘, we have ||x||J˘ = ||J˘x||. Let ||W|| denote
the spectral norm of matrix ||W|| and ∇f (x) : R → R represent the (sub)gradient
of f (x). For a matrix W, we write Wij or [W]ij to denote its i, j ’th entry.

7.2.2 Model of Optimization Problem

Consider a network of N nodes labeled by V = {1, 2, . . . , N}, in which each


node can only share information to their own neighbors via local communication.
The goal of the multi-node group is to collectively solve the following distributed
7.2 Preliminaries 187

economic dispatch problems in smart grids:

1  1 
N N
min f (x) = fi (xi ), s.t. x i = P ,  i ≤ x i ≤ ui , (7.1)
x∈RN N N
i=1 i=1

∀i = 1, . . . , N, where x = [x1, x2 , . . . , xN ]T . For each node i, fi : R → R is


the local convex objective function. Assume that each node i preserves a variable
xi ∈ R and knows a function fi privately. Throughout the chapter, we care about the
condition that the average
 of all nodes’ variables is equal to an invariable positive
constant, i.e., (1/N) N i=1 xi = P , while each node variable is subjected to some
interval, i.e., i ≤ xi ≤ ui , i = 1, . . . , N. We will denote the equality
 constraint in
(7.1) as the coupling constraint P, i.e., P = {x ∈ RN |(1/N) N i=1 x i = P }, and
denote the individual constraints in (7.1) as the box constraints Xi , i.e., Xi = {x̃ ∈
R|i ≤ x̃ ≤ ui }, ∀i = 1, . . . , N. For all nodes, we use X = X1 × X2 × . . . × XN to
denote their Cartesian products. Moreover, we denote S = X ∩ P the feasible set
of the optimization problem (7.1). In the chapter, we assume x ∗ = [x1∗ , . . . , xN
∗ ]T as

the optimal solution of problem (7.1).


Remark 7.1 (Economic Dispatch Problems) EDPs are studied by a network of N
generators. The goal of generators is to minimize the total cost incurred in the case
of the generators subject to individual capacity constraints while in the meantime
to seek optimal power generation to satisfy the average load of the network. Here
we index the generators as 1, . . . , N. Based on the optimization models (7.1), we
denote by xi the power generated at generator i. Upon generating power of xi units,
generator i is exposed to a convex cost denoted by fi (xi ). Thus, the average cost
f of the network is equivalent to the average of the cost at each generator, i.e.,
f (x) = (1/N) N i=1 fi (xi ). It is indispensable that the average power generated at
the nodes meets the average load of the network, represented by a positive constant
P , i.e., P = (1/N) N i=1 xi . In the EDPs, we focus on not only the power balance at
individual generator but also the balance of the entire network. Finally, we assume
that each generator i can only generate a limited power that can be represented by
the box constraints in (7.1), i.e., i ≤ xi ≤ ui , ∀i = 1, . . . , N.

7.2.3 Communication Network

In this chapter, we consider a group of n nodes communicating over a general


weighted and directed network G(k) = {V, E(k), A(k)} at time k ≥ 0, where
V = {1, 2, . . . , N} and E(k) ⊆ V×V represents the set of nodes with i indicating the
ith vertex and the set of directed edges, respectively. The weighted adjacency matrix
of G(k) is represented by A(k) = [Aij (k)] ∈ RN×N with non-negative elements
Aij (k). A directed edge (j, i) ∈ E(k) indicates that node j can arrive at node i
at time k ≥ 0. The in-neighbors set and out-neighbors set of node i at time k are
188 7 Primal–Dual Algorithms for Distributed Economic Dispatch

denoted by Niin (k) = {j ∈ V|(j, i) ∈ E(k)} and Niout (k) = {j ∈ V|(i, j ) ∈ E(k)},
respectively. A directed path in the directed network G from node j to node i can
be depicted by a group of connected edges (j, i1 ), (i1 , i2 ), . . . , (im , i) with different
nodes ik , k = 1, 2, . . . , m. A directed network is strongly connected if and only
if for any two distinct nodes j and i in the set V, there exists a directed path
from node j to node i. The following two standard assumptions about the network
communication network are adopted.
Assumption 7.1 ([6]) The time-varying general directed network sequences G(k)
are B0 -strongly connected. Namely, there exists a positive integer B0 such that
the general directed network G(k) with vertex set V and edge set EB0 (k) =
(k+1)B −1
∪s=kB0 0 E(s) is strongly connected for any k ≥ 0.
Remark 7.2 With Assumption 7.1, through repeated communications with neigh-
bors, all nodes can repeatedly interact with each other in the whole network
sequence G(k). Particularly, this assumption is considerably weaker than that
requires each G(k) be strongly connected for all k ≥ 0.
Assumption 7.2 ([6]) For any k = 0, 1, . . . , the mixing matrix C(k) = [Cij (k)] ∈
RN×N is defined as
 1
djout (k)+1
, j ∈ Niin (k)
Cij (k) = ,
0, j∈
/ Niin (k) or j = i

where djout(k) = |Njout (k)| is the out-degree of node j at time k ≥ 0 (Njout (k) =
{i ∈ V|(j, i)∈ E(k)}). Also, it is clear to know that C(k) is a column-stochastic
matrix, i.e., Ni=1 Cij (k) = 1 for any j ∈ V and k ≥ 0.

Moreover, to facilitate the analysis of the optimization algorithm, we will make


the following two assumptions.
Assumption 7.3 ([36]) For each node i = 1, . . . , N, its local objective function fi
is differentiable everywhere and has Lipschitz continuous gradients. Namely, there
exists a constant σi ∈ (0, +∞) such that

||∇fi (x) − ∇fi (y)|| ≤ σi ||x − y|| for any x, y ∈ R

where we use σ̂ = mini {σi } in the subsequent analysis.


Assumption 7.4 ([36]) For each node i = 1, . . . , N, its objective function fi is
proper, closed, and strongly convex with strongly convex parameter μi . Namely, for
any x, y ∈ R, we have
μi
fi (x) ≥ fi (y) + ∇fi (y), x − y + ||x − y||2 ,
2

where μi ∈ (0, +∞), and we employ μ̂ = mini {μi } in the subsequent analysis.
7.3 Algorithm Development 189

7.3 Algorithm Development

On the basis of the above section, we first analyze the dual problem of problem (7.1),
and then we design our main algorithm, namely, distributed primal–dual gradient
algorithm, to handle the optimization of the dual problem. At the end of this section,
we will give some lemmas to support the convergence analysis of the algorithm.

7.3.1 Dual Problem

Consider the Lagrangian function L : X × R → R of problem (7.1) defined as


 
1  1 
N N
L(x, w) = fi (xi ) + w xi − P , (7.2)
N N
i=1 i=1

where w denotes the Lagrangian multiplier in accordance with the coupling


constraint in (7.1). Given a proper, closed, and convex function f : X → R, we let
f ∗ be its conjugate, i.e., f ∗ (y) = supx∈X y T x − f (x), y ∈ RN . Assuming w ∈ R,
the dual function d : R → R of problem (7.1) can be described as

d(w) = min L(x, w)


x∈X

1 N 1 N
= min fi (xi ) + w xi − P
x∈X N i=1 N i=1

1 N
= −wP + −supxi ∈Xi {−wxi − fi (xi )}
N i=1
1 N
= (−fi∗ (−w) − wP )
N i=1
1 N
= di (w), (7.3)
N i=1

where di (w) = −fi∗ (−w) − wP . The dual problem of (7.1), therefore, is given as

maxd(w), (7.4)
w∈R

where its gradient is obtained as

1 
N
∇d(w) = xi − P . (7.5)
N
i=1
190 7 Primal–Dual Algorithms for Distributed Economic Dispatch

According to the analysis of the primal–dual gradient method in [36], we


know that the update of the Lagrangian multiplier depends on the information of
each node, which prevents us from implementing the distributed algorithm. This
shortcoming has been reflected in [33, 34], where a centralize step is required to
update w and send it back to the nodes. Specifically, the objective of this chapter
is to solve the dual problem in a distributed manner, which requires that the nodes
do not need to know the information of each node, but only share their own states
with their neighbors in each step. Before giving the distributed primal–dual gradient
algorithm, we conclude this part by considering some important properties of the
dual function (7.3). The following three lemmas show some basic properties of the
conjugate of the convex function f .
Lemma 7.3 ([37]) Let f be a proper, closed, and convex function and denote f ∗ as
its conjugate. Then, we have that f ∗ is also proper, closed, and convex. Moreover,
we also have f ∗∗ = f .
Lemma 7.4 ([38, 39]) For a proper, lower semi-continuous, convex function f :
dom(f ) → R and a positive value χ, the following two properties are equivalent:
(a) f ∗ is strongly convex with constant χ; (b) f is differentiable, and the gradient
∇f is Lipschitz continuous with constant 1/χ.
Built on the above two lemmas, the following lemma, which is necessary in the
analysis of distributed primal–dual gradient algorithm, is shown.
Lemma 7.5 ([36]) Let Assumptions 7.3 and 7.4 hold. Then the conjugate fi∗ of fi
satisfies: (i) fi∗ is strongly convex with constant 1/σi ; (ii) fi∗ is differentiable, and
the gradient ∇fi∗ is Lipschitz continuous with constant 1/μi .
Define the Lagrangian dual optimal solution of the Lagrangian dual problem
as w∗ . Then the following definition of saddle point plays a critical role in the
following analysis.
Definition 7.6 (Saddle Point [8]) Consider the Lagrangian function L : X × R →
R. A pair of vectors (x ∗ , 1w∗ ) is called a saddle point of L(x, w) over the set X ×R
if L(x ∗ , w) ≤ L(x ∗ , 1w∗ ) ≤ L(x, 1w∗ ) holds for all x ∈ X and w ∈ R.

7.3.2 Distributed Primal–Dual Gradient Algorithm

Since w is the solution of the dual problem (7.4), we can design a distributed
algorithm to tackle the dual problem (7.4) that is equivalent to the problem (7.1).
Moreover, the dual problem (7.4) can be redefined as the following minimization
problem:

1 
N
min q(w) = qi (w), (7.6)
w∈R N
i=1
7.3 Algorithm Development 191

where qi is defined as qi (w) = fi∗ (−w) + wP . Therefore, we are now concerned


with the minimization of problem (7.6) since it has the same formula as [35]. It is
important to note that the problems (7.6) and (7.4) share the same set of solutions.
In our algorithm, each node i maintains five variables xi (k), λi (k), vi (k), wi (k) and
βi (k) at each time instant k = 0, 1, . . ., where λi (k), vi (k), βi (k) are three auxiliary
variables and xi (k), wi (k) are two key variables used to estimate the optimal
solution, denoted by xi∗ and w∗ , respectively. Here, recall that x ∗ = [x1∗ , . . . , xN∗ ]T

is the primal optimal solution of (7.1) and w is the dual optimal solution of (7.6).
The initialization of our distributed primal–dual gradient algorithm is xi (0) ∈ Xi ,
wi (0) = λi (0) ∈ R, βi (0) = ∇qi (wi (0)) ∈ R, and vi (0) = 1 for all i = 1, . . . , N.
Then, each node i at each time instant k ≥ 0 updates based on the following rules:


⎪ xi (k + 1) ∈ arg min fi (xi (k)) + wi (k)xi (k)

⎪ xi ∈Xi



⎪ λi (k + 1) = Cij (k)(λj (k) − αβj (k))



⎨ j ∈Niin (k)

vi (k + 1) = Cij (k)vj (k) , (7.7)

⎪ j ∈Niin (k)



⎪ wi (k + 1) = λi (k + 1)/vi (k + 1)

⎪ 


⎩ βi (k + 1) =
⎪ Cij (k)βi (k)+(∇qi (wi (k + 1)) − ∇qi (wi (k)))
j ∈Niin (k)

where α > 0 is some step-size, and the vector ∇qi (wi (k)) is the gradient of the node
i’s objective function qi (w) at w = wi (k). According to the definition of qi (w), we
can achieve that

∇qi (wi (k)) = −xi (k) + P . (7.8)

Let x(k) = [x1 (k), . . . , xN (k)]T ∈ RN , w(k) = [w1 (k), . . . , wN (k)]T ∈ RN ,


λ(k) = [λ1 (k), . . . , λN (k)]T ∈ RN , β(k) = [β1 (k), · · · , βN (k)]T ∈ RN ,
v(k) = [v1 (k), . . . , vN (k)]T ∈ RN , h(k) = [h1 (k), . . . , hN (k)]T ∈ RN , and
∇q(w(k)) = [∇q1 (w1 (k)), · · · , ∇qN (wN (k))]T ∈ RN . With formula (7.8), the
algorithm (7.7) can be rewritten in the compact matrix form (by manipulating a
simple algebraic transformation) as follows:


⎪ N
⎪ x(k + 1) ∈ arg min
⎪ fi (xi ) + wi (k)xi

⎪ x∈X i=1


v(k + 1) = C(k)v(k)
, (7.9)

⎪ V (k + 1) = diag{v(k + 1)}



⎪ w(k + 1) = R(k)(w(k) − αh(k))


h(k + 1) = R(k)h(k) − (V (k + 1))−1 (∇q(w(k + 1)) − ∇q(w(k)))

where R(k) = (V (k + 1))−1 C(k)V (k), h(k) = (V (k))−1 β(k). Note that,
under Assumptions 7.1 and 7.2, each matrix V (k) is invertible, and we denote
192 7 Primal–Dual Algorithms for Distributed Economic Dispatch

||V −1 ||max = supk≥0 ||V −1 (k)||, which is bounded. Also, we can prove that R(k) is
actually a row-stochastic matrix (see Lemma 4 of [10]).
Next, we will use the following symbols CB (k) = C(k)C(k −1) . . . C(k +1−B)
for any k = 0, 1, . . . and B = 0, 1, . . . with B ≤ k + 1 and the exceptional case
that C0 (k) = I for any k and CB (k) = I for any needed k < 0. An important
character of the norm of (I − (1/N)11T )RB (k) is shown in the following lemma,
which comes from the properties of distributed primal–dual gradient algorithm and
can be achieved from [33, 34].
Lemma 7.7 ([35]) Let Assumptions 7.1 and 7.2 hold, and let B be an integer
satisfying B ≥ B0 . Then, for any k = B − 1, B, . . . and any vectors φ and
ϕ with appropriate dimensions, if φ = RB (k)ϕ, we have ||φ̆|| ≤ δ||ϕ̆||, where
B−1
RB (k) = (V (k + 1))−1 CB (k)(V (k + 1 − B)), δ = Q1 (1 − τ NB0 ) NB0 < 1,
−NB
Q1 = 2N 1+τ NB00 , τ = 2+NB
1
.
1−τ N0

7.4 Convergence Analysis

In this section, we first introduce the small gain theorem [35], followed by some
supporting lemmas. Then, we present the main results of this chapter.

7.4.1 Small Gain Theorem

To realize the geometric convergence of the distributed primal–dual gradient


algorithm, we first give a preliminary knowledge, namely, the small gain theorem.
Before describing the small gain theorem, we need to give some notations. Denote
the notation si for the infinite sequence si = (si (0), si (1), si (2), . . .), where si (k) ∈
RN , ∀i. Furthermore, for any positive integer K, denote

1
||si ||γ ,K = max ||si (k)||, (7.10)
k=0,...,K γk

1
||si ||γ = sup ||si (k)||, (7.11)
k≥0 γk

where the parameter γ ∈ (0, 1) will be conducted as the geometric convergence


parameter in our later analysis.
Based on the above definition, the geometric convergence analysis of the
sequence {||w(k) − 1w∗ ||} is mainly built on the small gain theorem that we will
state in the next.
7.4 Convergence Analysis 193

Lemma 7.8 (The Small Gain Theorem [35]) Suppose that s1 , . . . , sm are
sequences such that for all K > 0 and each i = 1, . . . , m,

1
||si ||γ = sup ||si (k)||, (7.12)
k≥0 γk

where ωi is some constant, and the gains (non-negative constants) η1 , . . . , ηm


satisfy

η1 η2 . . . ηm < 1; (7.13)

then, we obtain

1
||s1 ||γ ≤ (ηm ηm−1 . . . η2 ω1 + . . . + ηm ωm−1 + ωm ). (7.14)
1 − η1 η2 . . . ηm

Remark 7.9 The original version of the small gain theorem has been extensively
studied and has been widely used in control theory [40]. Besides, since the small
gain theorem involves a cyclic structure, s1 → s2 → · · · → sm → s1 , one can get
similar bounds for ||si ||γ , ∀i.
Lemma 7.10 ([35]) For any matrix sequence si and a positive constant γ ∈ (0, 1),
if ||si ||γ is bounded, then ||si (k)|| converges to 0 with a geometric rate O(γ k ).
We need to define the following additional symbols that are often used in the
latter analysis before the main proof of the idea is carried out:

y(k) = w(k) − 1w∗ for any k = 0, 1, . . . , (7.15)

z(k) = ∇q(w(k)) − ∇q(w(k − 1)) for any k = 1, 2, . . . , (7.16)

where w∗ ∈ R is the optimal solution of problem (7.6), and the initialization z(0) =
0. Considering the small gain theorem, the geometric convergence of ||w(k)|| will
be achieved by employing Lemma 7.8 to the following circle of arrows:

4 3 2 1
y → z → h̆ → z̆ → y. (7.17)
Remark 7.11 Recall that y is the difference between the estimate and the global
optimizer of the Lagrangian multiplier w, z is the successive difference of gradients,
h̆ is the consensus violation of the estimation of gradient average across nodes, and
w̆ is the consensus violation of the Lagrangian multiplier. In a sense, as long as y
is small, the error z is small since the gradients are close to zero in the vicinity of
the optimal Lagrangian multiplier. Then, as long as z is small, h̆ is small by the
framework of the algorithm (7.9). Furthermore, as long as h̆ is small, the framework
of algorithm (7.9) means that w̆ is close to zero. Finally, as long as w̆ is close to
zero, the algorithm will drive y to zero and thus achieve the whole cycle.
194 7 Primal–Dual Algorithms for Distributed Economic Dispatch

Remark 7.12 After the establishment of each arrow, we will apply Lemma 7.8 to
conclude our main results. Specifically, we need to be aware of the prerequisite
that the sequences {||y||γ ,K , ||z||γ ,K , ||h̆||γ ,K , ||w̆||γ ,K , ||y||γ ,K } are proven to be
bounded. Therefore, we can draw a conclusion that all quantities in the above circle
of arrows converge at a geometric rate O(γ k ). In addition, in order to apply the
small gain theorem in the following analysis, we need to require that the product of
gains γi is less than one, which is achieved by finding an appropriate step-size α.
Now, we are ready to present the establishment of each arrow in the above circle
(7.17). The following series of lemmas are based mainly on the views of [35].

7.4.2 Supporting Lemmas

Before introducing the Lemma 7.13, we make some definitions only for this lemma,
which distinguishes the notation used in our distributed optimization problem,
algorithm, and analysis. We next redefine problem (7.6) with different notations
as

1 
N
min g(p) = gi (p), (7.18)
p∈RN N
i=1

where each function gi satisfies Assumptions 7.3–7.4. Consider the following


inexact gradient descent on the function g:

1 
N
pk+1 = pk − θ ∇gi (si (k)), (7.19)
N
i=1

where θ is the step-size. Let p∗ be the global optimal solution of g, and define

rk = ||pk − p∗ || for any k = 0, 1, . . . (7.20)

On the basis of the above definitions, we introduce Lemma 7.13.



Lemma 7.13 ([35]) Suppose that 1 − ϑ+1 θ σ̃ ϑ
≤ γ < 1 and 0 < θ ≤
1 N 
min{ σ̃ ϑ , μ̃(1+υ) }, where ϑ > 0, υ > 0, σ̃ = N i=1 σ1i , and μ̃ = N1 N
ϑ+1 1 1
i=1 μi . Let
Assumptions 7.3–7.4 hold for every function gi . For the problem (7.18), consider
the sequences {rk } and {pk } be updated by the inexact gradient descent algorithm
(7.19). Then, for K = 0, 1, . . ., we have
 
√ 1+υ ϑ 
N
|r| γ ,K
≤ 2r0 + (γ N )−1 + ||p − si ||γ ,K , (7.21)
υ μ̂σ̃ σ̂ σ̃
i=1
7.4 Convergence Analysis 195

where we recall that σ̂ = mini {σi } and μ̂ = mini {μi }.


In the following, we start with the first demonstration of the circle (7.17) that
is grounded on the error bound of the inexact gradient descent (IGD) algorithm in
Lemma 7.13.
Lemma 7.14 (||w̆||γ ,K → ||y||γ ,K [35]) . Let Assumptions 7.1, 7.3–7.4 hold. In
addition, we suppose that the step-size α and the parameter γ satisfy
  
α σ̃ ϑ ϑ +1 1
1− ≤γ <1 and 0 < α ≤ min , ,
ϑ +1 σ̃ ϑ μ̃(1 + υ)
N
where ϑ > 0, υ > 0 are some adjustable parameters and σ̃ = 1
N
1
i=1 σi and

μ̃ = N1 Ni=1 μi . Then, for all K = 0, 1, . . ., we have that
1

 √  
√ N 1+υ ϑ √
||y||γ ,K ≤ (1 + N ) 1 + + ||w̆||γ ,K + 2 N ||w̄(0) − w∗ ||.
γ υ μ̂σ̃ σ̂ σ̃
(7.22)

Next, to demonstrate the second arrows in the circle (7.17), we give the following
Lemma 7.15.
Lemma 7.15 (||h̆||γ ,K → ||w̆||

γ ,K [35]) . Let Assumptions 7.1–7.2 hold, and let
B
γ be a positive constant in ( δ, 1), where δ and B are the constants given in
Lemma 7.7. Then, we get

γ B  −(t −1)
B
α γ − γB
||w̆||γ ,K ≤ δ + Q1 ||h̆||γ ,K + γ ||w̆(t − 1)||,
γB − δ 1−γ γB − δ
t =1
(7.23)

for all K = 0, 1, . . ., where Q1 is the constant as defined in Lemma 7.7.


Now we present the third arrow of circle (7.17) in the following lemma.
Lemma 7.16 (||z||γ ,K → ||w̆||γ ,K [35]) . Let Assumptions 7.1–7.4 hold,√let the
parameter δ be given in Lemma 7.7, and let γ be a positive constant in ( B δ, 1).
Then, we obtain for all K = 0, 1, . . . that

γ (1 − γ B )
||h̆||γ ,K ≤ Q1 ||V −1 ||max ||z||γ ,K
(γ B − δ)(1 − γ )

γ B  −(t −1)
B
+ B γ ||h̆(t − 1)||. (7.24)
γ −δ
t =1
196 7 Primal–Dual Algorithms for Distributed Economic Dispatch

The last arrow in the circle (7.17) demonstrated in the following lemma is a
simple conclusion of the truth that the gradient of q is Lipschitz continuous with
parameter 1/μ̂.
Lemma 7.17 (||y||γ ,K → ||z||γ ,K [35]) . Under Assumption 7.3, we obtain that
for all K = 0, 1, . . . , and any 0 < γ < 1,

γ +1
||z||γ ,K ≤ ||y||γ ,K .
γ μ̂

7.4.3 Main Results

Based on the circle (7.17) established in the previous section, we will demonstrate
a major result about the geometrical convergence of (x(k), w(k)) to a saddle point
(x ∗ , 1w∗ ) of the Lagrangian function L for the distributed primal–dual gradient
algorithm over a time-varying general directed networks. In what follows, we first
prove that the sequence {w(k)} updated by the distributed primal–dual gradient
algorithm (7.9) converges to 1w∗ at a global geometric rate O(γ k ) with the help
of the small gain theorem. Then, on the basis of the geometric convergence of
the sequence {w(k)}, we will prove that the sequence {x(k)} goes to x ∗ at a
global geometric rate O((γ /2)k ). Moreover, an explicit convergence rate γ for the
distributed primal–dual gradient algorithm will be given along the way.
Theorem 7.18 Let Assumptions 7.1–7.4 and Lemmas 7.3–7.17 hold. Let B be a
B−1
large enough integer constant such that δ = Q1 (1 − τ NB0 ) NB0 < 1. Then, for any
2
step-size α ∈ (0, 1.5(1−δ)
σ̃ J ], the sequence {w(k)} be generated by the distributed
primal–dual gradient algorithm converges to 1w∗ at a global geometric rate O(γ k ),
 √ 2
1.5( J 2 +(1−δ 2 )J −δ J )
where γ ∈ (0, 1) is given by γ = 2B 1 − α1.5 σ̃
if α ∈ (0, ],
σ̃ J (1+J ) 2
 √ 2
σ̃ J 1.5( J 2 +(1−δ 2 )J −δ J ) 1.5(1−δ)2
and γ = B δ + α1.5 if α ∈ ( , σ̃ J ], where J = 3Q1 ×
J (1+J )2
σ̃√ √ √
||V −1 ||max κB(δ + Q1 (B − 1))(1 + N ) × (1 + 4 N κ) and κ = 1/σ̃ μ̂.
Proof It is immediately obtained from Lemmas 7.13–7.17 that:
√ √ 
(i) ||y||γ ,K ≤ η1 ||w̆||γ ,K + ω1 , where η1 = (1 + N )(1 + γN υ1+υ μ̂σ̃
+ σ̂ϑσ̃ ) and

ω1 = 2 N||w̄(0) − w∗ ||.
−γ B
(ii) ||w̆||γ ,K ≤ η2 ||h̆||γ ,K + ω2 , where η2 = γ Bα−δ (δ + Q1 γ1−γ ) and ω2 =
γ B  B −(t −1) ||w̆(t − 1)||.
γ B −δ t =1 γ
7.4 Convergence Analysis 197

B
(iii) ||h̆||γ ,K ≤ η3 ||z||γ ,K + ω3 , where η3 = Q1 ||V −1 ||max (γ γB (1−γ )
−δ)(1−γ )
and ω3 =
γ B B −(t −1) ||h̆(t − 1)||.
γ B −δ t =1 γ
γ +1
(iv) ||z||γ ,K ≤ η4 ||y||γ ,K +ω4 , where η4 = γ μ̂
and ω4 = 0.
Moreover, to use the small gain theorem, we need to choose an appropriate step-size
α such that

η1 η2 η3 η4 < 1, (7.25)

which means that


⎛ √  √    ⎞
(1 + N ) 1 + γN υ1+υ + ϑ α
γ −δ
×
⎝  μ̂σ̃ σ̂ σ̃ B
⎠ < 1, (7.26)
−γ B B) γ +1
δ + Q1 γ1−γ Q1 ||V −1 ||max (γ γB (1−γ
−δ)(1−γ ) γ μ̂

where ϑ > 0, υ > 0, and other constraint conditions on parameters that occur in
Lemmas 7.7, 7.14, and 7.15 are stated as follows:
 
ϑ +1 1
0 < α ≤ min , , (7.27)
σ̃ ϑ μ̃(1 + υ)


α σ̃ ϑ
1− ≤ γ < 1, (7.28)
ϑ +1


B
δ < γ < 1. (7.29)

Noting that ϑ > 0 and υ > 0, it follows from (7.27) that



B
δ < γ < 1. (7.30)

Define two specific values for the parameters ϑ = 2σ̂ /μ̂ and υ = 1 in
Lemma 7.14 to obtain some concrete (probably loose) bound on the convergence
rate. Furthermore, by using 0.5 ≤ γ < 1 and (1 − γ B )/(1 − γ ) ≤ B, from relation
(7.26), we obtain
2
μ̂(γ B − δ)
α≤ √ √ √ , (7.31)
2Q1 ||V −1 ||max B(δ + Q1 (B − 1))(1 + N)(1 + 4 N κ)
198 7 Primal–Dual Algorithms for Distributed Economic Dispatch

where κ = 1/σ̃ μ̂ is the condition number. Noting that (ϑ + 1)/ϑ ≥ 1.5, it follows
from (7.28) that

1.5(1 − γ 2 )
≤ α. (7.32)
σ̃

√ using relations (7.31) and (7.32), we can achieve that there exists a λ ∈
Then,
( B δ, 1) such that
 2

1.5(1 − γ 2 ) 1.5(γ B − δ)
, = ∅, (7.33)
σ̃ σ̃ J
√ √ √
where J = 3Q1 ||V −1 ||max κB(δ + Q1 (B − 1))(1 + 4 N κ)(1 + N ). Here, we
study a smaller interval by enlarging the left side in (7.33). Since B ≥ 1, we will
prove that
 2

1.5(1 − γ 2B ) 1.5(γ B − δ)
, = ∅. (7.34)
σ̃ σ̃ J

Noting that when γ increases from B δ to 1, the left side of (7.34) is decreasing from
1.5(1−δ 2 ) 2
to 0 monotonically, while the right side is increasing from 0 to 1.5(1−δ)
σ̃ J
σ̃ √
monotonically. Thus, when γ varies from B δ to 1, the critical value of the interval
in (7.34) is valid when γ is given by
 
B δ+ J 2 + (1 − δ 2 )J
γ = . (7.35)
(1 + J )

Here, we assume that the value obtained in (7.35) is γmid . Thus, if we choose
√ 2 
1.5( J 2 +(1−δ 2 )J −δ J )
α ∈ (0, ], we can set γ = 2B
1 − α1.5
σ̃
, while for α ∈
σ̃ J (1+J )2
√ 2 
1.5( J 2 +(1−δ 2 )J −δ J ) 1.5(1−δ)2 σ̃ J
( , σ̃ J ], we can set γ = B δ + α1.5 . The proof is thus
σ̃ J (1+J )2
completed. 
On the basis of the geometric convergence of the sequence {w(k)} in Theo-
rem 7.18, we next demonstrate that the sequence {x(k)} goes to x ∗ at a global
geometric rate O((γ /2)k ).
Theorem 7.19 Suppose Assumptions 7.1–7.4 hold. Let the sequence {x(k)}, {λ(k)},
{v(k)}, {w(k)}, and {β(k)} be updated by the distributed primal–dual gradient
algorithm (7.9). Choose α in Theorem 7.18 such that the sequence {w(k)} converges
to 1w∗ at a geometric rate O(γ k ). Then we have the sequence {x(k)} converges to
x ∗ at a geometric rate O((γ /2)k ).
7.4 Convergence Analysis 199

Proof To demonstrate the result shown in Theorem 7.19, we first show that for any
k ≥ 0, the following inequality holds:


N
μi
(xi (k + 1) − xi∗ )2
2
i=1


N
1
≤ (∇qi (w∗ )(wi (k) − w∗ ) + (wi (k) − w∗ )2 + wi (k)(xi∗ − P )).
2μi
i=1
(7.36)

Noticing that x ∗ ∈ S, we have N ∗
i=1 (xi − P ) = 0. Moreover, since wi (k) →

w as k → ∞, it suffices to prove that the last term in (7.36) goes to zero as
k → ∞. Then we obtain the right hand side of (7.36) tends to zero as k → ∞.
Therefore, we immediately conclude that as the sequence {w(k)} converges to 1w∗
at a geometric rate of O(γ k ), the sequence {x(k)} converges to x ∗ at a geometric
rate of O((γ /2)k ). Based on the above analysis, the next major task is to derive
inequality (7.36).
To show inequality (7.36), we first define the local Lagrangian function Li :
Xi × R → R as

Li (xi , wi ) = fi (xi ) + wi (xi − P ). (7.37)

Then, the global Lagrangian function L : X × RN can be defined as

1 
N
L(x, w)= Li (xi , wi ). (7.38)
N
i=1

Noticing that since fi is strongly convex with parameter μi , Li is strongly convex


with parameter μi for a fixed wi . Therefore, we get for all xi ∈ Xi that
μi
(xi (k + 1) − xi )2 ≤ Li (xi , wi (k)) − Li (xi (k + 1), wi (k))
2
− ∇Li (xi (k + 1), wi (k))(xi − xi (k + 1)). (7.39)

Using xi (k + 1) ∈ arg min fi (xi (k)) + wi (k)xi (k), we also obtain for all xi ∈ Xi
xi ∈χi
that
μi
(xi (k + 1) − xi )2 ≤ Li (xi , wi (k)) − Li (xi (k + 1), wi (k)). (7.40)
2
200 7 Primal–Dual Algorithms for Distributed Economic Dispatch

Since (7.40) holds for all xi ∈ Xi , replacing xi by xi∗ and taking the average process
of (7.40), we immediately have

1  μi
N
(xi (k + 1) − xi∗ )2
N 2
i=1

1 
N
≤ (Li (xi∗ , wi (k)) − Li (xi (k + 1), wi (k)))
N
i=1

= L(x ∗ , w(k)) − L(x(k + 1), w(k)). (7.41)

Recalling (7.3), (7.6), and (7.8), we therefore have

qi (w(k)) = −fi (xi (k + 1)) − wi (k)(xi (k + 1) − P ). (7.42)

Moreover, the strong duality holds according to Assumption 3 in [41], i.e., f ∗ =


d ∗ = −q ∗ . We therefore get

Li (xi∗ , wi (k)) − Li (xi (k + 1), wi (k))


= fi (xi∗ ) + wi (k)(xi∗ − P ) − fi (xi (k + 1)) − wi (k)(xi (k + 1) − P )
= fi (xi∗ ) + qi (wi (k)) + wi (k)(xi∗ − P ). (7.43)

Averaging (7.43) over i from 1 to N gives

L(x ∗ , w(k)) − L(x(k + 1), w(k))

1 
N
= f (x ∗ ) + q(w(k)) + (wi (k)(xi∗ − P ))
N
i=1

1 
N
= (qi (wi (k)) − qi (wi∗ ) + wi (k)(xi∗ − P )). (7.44)
N
i=1

Since qi has Lipschitz continuous derivative with Lipschitz parameter 1/μi , it yields
that

L(x ∗ , w(k)) − L(x(k + 1), w(k))

1 
N
1
≤ (∇qi (wi∗ )(wi (k) − w∗ ) + (wi (k) − w∗ )2 + wi (k)(xi∗ − P )).
N 2μi
i=1
(7.45)
7.5 Numerical Examples 201

Substituting (7.45) into (7.41), we obtain


N
μi
(xi (k + 1) − xi∗ )2
2
i=1


N
1
≤ (∇qi (w∗ )(wi (k) − w∗ ) + (wi (k) − w∗ )2 + wi (k)(xi∗ − P )),
2μi
i=1

which concludes our proof. 

7.5 Numerical Examples

In this section, two numerical examples about economic dispatch problems and
demand response problems in power systems are presented to validate the practica-
bility of the proposed algorithm and feasibility of the theoretical analysis throughout
this chapter.

7.5.1 Example 1: EDP on the IEEE 14-Bus Test Systems

In the first example, we consider the economic dispatch on the IEEE 14-bus systems
[43] as interpreted in Fig. 7.1. Specially, we study a class of problems with which
some generators may not be connected to the grid or cease to exchange their power
during their operation. This may occur due to the fault of the generators or the
variability of renewable energy, which limits the generator generating capacity
within a specific time. To study this problem, we will use a class of uniformly
strongly connected time-varying general directed networks to model for the variable
connection between generators. Specifically, in this chapter, we consider a system
in which each generator i suffers a quadratic cost as a function of the amount of its
generated power Pi , i.e., Yi (Pi ) = ci Pi2 + di Pi , where ci and di are the adjustable
cost coefficients of generator i. Assume that each generator i only generates a
limited amount of power denoted by [0, Pimax ]. In the simulation, we choose the
average demand (required from the system) P = 60, and the coefficients of each
generator are shown in Table 7.1 that is also applied in [42]. Then, the simulation
results of the algorithm (7.7) are described in Figs. 7.1 and 7.2. The power allocation
at each generator is shown in Fig. 7.2, from which one can see that the distributed
primal–dual gradient algorithm (7.7) successfully allocates optimal powers to each
generator at time step k = 200. The allocated optimal power at each generator is as
follows, P1∗ = 66.25, P2∗ = 71.61, P3∗ = 47.15, P4∗ = 54.98, and P5∗ = 60.01.
From Fig. 7.3, for each generator, it is explicit that each Lagrangian multiplier
successfully converges to the dual optimal solution w∗ .
202 7 Primal–Dual Algorithms for Distributed Economic Dispatch

Fig. 7.1 The IEEE 14-bus test system [43]

Table 7.1 Generator Gen Bus ci ($/MW2 ) di ($/MW) [0, Pimax ] (MW)
parameters (MU = Monetary
units) 1 1 0.04 2.0 [0, 80]
2 2 0.03 3.0 [0, 90]
3 3 0.035 4.0 [0, 70]
4 6 0.03 4.0 [0, 70]
5 8 0.04 2.5 [0, 80]

7.5.2 Example 2: Demand Response for Time-Varying Supplies

Our second application is about the problem of the demand response of 5 households
in the summer served by a single generator. In particular, we will consider the issue
of time-varying supplies during a day. We assume that the generator can predict the
average power to supply per hour of the day based on the information collected over
the past few days, and each household will incur costs in the use of power. Suppose
that all households are interested in cooperating to arrange their loads to meet the
supply while minimizing their total costs. According to [34], we know that the
demand response problems can be considered as the optimization problem studied
in this chapter. Specifically, in this chapter, we consider the energy consumption
of air conditioning, and no other tunable device is required during the process
of demand response. This is mainly because it may consume most energy during
summer. When the air conditioning uses an amount xi of power, we suppose that
each household i suffers a quadratic cost as a function, i.e., Qi (xi ) = ai (xi − bi )2 ,
where ai is the cost coefficient and bi is the initial energy of each household i. Let
7.5 Numerical Examples 203

80
P
1
P
2
70 P
3
P4
P
60 5
Power allocation xi(k)

50

40

30

20

10

0
0 20 40 60 80 100 120 140 160 180 200
time

Fig. 7.2 Power allocation at generators

0
w1
w
2
−1 w3
w
4

−2 w5
Lagrange multipliers wi(k)

−3

−4

−5

−6

−7

−8
0 20 40 60 80 100 120 140 160 180 200
time

Fig. 7.3 Consensus of Lagrange multipliers

S(t), t ∈ [7, 19], be the average vector of power supplied by the generator from
7 : 00 to 19 : 00. Let S(t), ai , and bi be chosen randomly from [0, 2000], (0, 1),
and [0, 1000], respectively. Then, the simulation results of the algorithm (7.7) are
described in Figs. 7.4 and 7.5. The optimal schedule of power set by each household
is shown in Fig. 7.4, while the predicted price on the time-varying demand is shown
in Fig. 7.5.
204 7 Primal–Dual Algorithms for Distributed Economic Dispatch

Optimal energy schdule of households


3000
supply 1
supply 2
supply 3
supply 4
supply 6
2500 average supply
Energy schedule

2000

1500

1000

500
7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00

Fig. 7.4 Optimal energy schedule of each household

Dynamic price on the time-varying demand


1000

900

800

700

600
price(w)

500

400

300

200

100

0
7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00

Fig. 7.5 Predicted price on the time-varying demands


References 205

7.6 Conclusion

In this chapter, a fully distributed primal–dual gradient algorithm for tackling the
convex optimization problem with both coupling linear constraint and individual
box constraints has been studied in detail. It has been proven that under some
fairly standard assumptions on network connectivity and the objective function,
the algorithm is able to achieve a geometrical convergence rate over time-varying
general directed networks. Based on the adjustment of some parameters (ϑ, υ) and
the small gain theorem, we derived an explicit convergence rate γ for different
ranges of α. Furthermore, the correctness and effectiveness of the theoretical results
have been demonstrated by applying the proposed algorithm to investigate few
interesting problems in power systems. In addition, there are several meaningful
questions to be studied in future work. For example, it would be interesting to study
more general case in which the step-size we adopted is uncoordinated. It would also
be meaningful to extend our work to the cases of event-triggered, asynchronous, and
quantized communication among nodes over time-varying networks.

References

1. B. Johansson, M. Rabi, M. Johansson, A simple peer-to-peer algorithm for distributed


optimization in sensor networks, in 2007 46th IEEE Conference on Decision and Control.
https://doi.org/10.1109/CDC.2007.4434888
2. Z.-W. Liu, X. Yu, Z.-H. Guan, B. Hu, C. Li, Pulse-modulated intermittent control in consensus
of multiagent systems. IEEE Trans. Syst. Man Cybern. Syst. 47(5), 783–793 (2017)
3. G. Wen, W. Yu, Y. Xia, X. Yu, J. Hu, Distributed tracking of nonlinear multiagent systems under
directed switching topology: an observer-based protocol, IEEE Trans. Syst. Man Cybern. Syst.
47(5), 869–881 (2017)
4. S. Yang, Q. Liu, J. Wang, Distributed optimization based on a multiagent system in the presence
of communication delays. IEEE Trans. Syst. Man Cybern. Syst. 47(5), 717–728 (2017)
5. D. Wang, N. Zhang, J. Wang, W. Wang, Cooperative containment control of multiagent systems
based on follower observers with time delay. IEEE Trans. Syst. Man Cybern. Syst. 47(1), 13–
23 (2017)
6. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE
Trans. Autom. Control 55(1), 48–61 (2011)
7. D. Yuan, D.W.C. Ho, S. Xu, Regularized primal-dual subgradient method for distributed
constrained optimization. IEEE Trans. Cybern. 46(9), 2109–2118 (2016)
8. M. Zhu, S. Martinez, On distributed convex optimization under inequality and equality
constraints. IEEE Trans. Autom. Control 57(1), 151–164 (2012)
9. D. Jakovetic, J. Xavier, J.M.F. Moura, Fast distributed gradient methods. IEEE Trans. Autom.
Control 60(3), 601–615 (2015)
10. A. Nedic, A. Olshevsky, Distributed optimization over time-varying directed graphs. IEEE
Trans. Autom. Control 60(3), 601–615 (2015)
11. S.S. Ram, A. Nedic, V.V. Veeravalli, Distributed stochastic subgradient projection algorithms
for convex optimization. J. Optim. Theory Appl. 147(3), 516–545 (2010)
12. Z. Feng, G. Hu, G. Wen, Distributed consensus tracking for multi-agent systems under two
types of attacks. Int. J. Robust. Nonlinear Control 26(5), 896–918 (2016)
206 7 Primal–Dual Algorithms for Distributed Economic Dispatch

13. I. Matei, J.S. Baras, Performance evaluation of the consensus-based distributed subgradient
method under random communication topologies. IEEE J. Sel. Topics. Signal Process. 5(4),
754–771 (2011)
14. A. Olshevsky, Linear time average consensus on fixed graphs and implications for decentral-
ized optimization and multi-agent control (2014). arXiv preprint arXiv:1411.4186
15. A. Nedic, A. Ozdaglar, P. A. Parrilo, Constrained consensus and optimization in multi-agent
networks. IEEE Trans. Autom. Control 55(4), 922–938 (2010)
16. T. Wu, K. Yuan, Q. Ling, W. Yin, A. H. Sayed, Decentralized consensus optimization with
asynchrony and delays. IEEE Trans. Signal Inform. Process. Netw. 4(2), 293–307 (2018)
17. A. Nedic, Asynchronous broadcast-based convex optimization over a network. IEEE Trans.
Autom. Control 56(6), 1337–1351 (2011)
18. I. Lobel, A. Ozdaglar, Convergence analysis of distributed subgradient methods over random
networks, in 2008 46th Annual Allerton Conference on Communication, Control, and Comput-
ing. https://doi.org/10.1109/ALLERTON.2008.4797579
19. P. Yi, Y. Hong, Quantized subgradient algorithm and data-rate analysis for distributed
optimization. IEEE Trans. Control Netw. Syst. 1(4), 380–392 (2014)
20. B. Johansson, T. Keviczky, M. Johansson, K. H. Johansson, Subgradient methods and consen-
sus algorithms for solving convex optimization problems, in 2008 47th IEEE Conference on
Decision and Control. https://doi.org/10.1109/CDC.2008.4739339
21. D. Yuan, S. Xu, H. Zhao, Distributed primal-dual subgradient method for multiagent optimiza-
tion via consensus algorithms. IEEE Trans. Syst. Man Cybern. B: Cybern. 41(6), 1715–1724
(2011)
22. Z.J. Towfic, A.H. Sayed, Adaptive penalty-based distributed stochastic convex optimization.
IEEE Trans. Signal. Process. 62(15), 3924–3938 (2014)
23. S.S. Ram, A. Nedic, V.V. Veeravalli, Incremental stochastic subgradient algorithms for convex
optimization. SIAM J. Control Optim. 20(2), 691–717 (2009)
24. J. Xu, S. Zhu, Y. C. Soh, L. Xie, Augmented distributed gradient methods for multi-Agent
optimization under uncoordinated constant stepsizes, in Proceedings of the IEEE 54th Annual
Conference on Decision and Control. https://doi.org/10.1109/CDC.2015.7402509
25. W. Shi, Q. Ling, G. Wu, W. Yin, EXTRA: an exact first-order algorithm for decentralized
consensus optimization. SIAM J. Optim. 25, 944–966 (2015)
26. C. Li, X. Yu, T. Huang, X. He, Distributed optimal consensus over resource allocation network
and its application to dynamical economic dispatch. IEEE Trans. Neural Netw. Learn. Syst.
29(6), 2407–2418 (2018)
27. C. Li, X. Yu, W. Yu, G. Chen, J. Wang, Efficient computation for sparse load shifting in demand
side management. IEEE Trans. Smart Grid 8(1), 250–261 (2017)
28. X. He, D.W.C. Ho, T. Huang, J. Yu, H. Abu-Rub, C. Li, Second-order continuous-time
algorithms for economic power dispatch in smart grids. IEEE Trans. Syst. Man Cybern. Syst.
48(9), 1482–1492 (2018)
29. H. Xing, Y. Mou, M. Fu, Z. Lin, Distributed bisection method for economic power dispatch in
smart grid. IEEE Trans. Power Syst. 30(6), 3024–3035 (2015)
30. C. Li, X. Yu, W. Yu, T. Huang, Z.-W. Liu, Distributed event-triggered scheme for economic
dispatch in smart grids. IEEE Trans. Ind. Informat. 12(5), 1775–1785 (2016)
31. G. Binetti, A. Davoudi, F.L. Lewis, D. Naso, B. Turchiano, Distributed consensus-based
economic dispatch with transmission losses. IEEE Trans. Power Syst. 29(4), 1711–1720 (2014)
32. A. Nedic, A. Olshevsky, W. Shi, Improved convergence rates for distributed resource allocation
(2017). arXiv preprint arXiv:1706.05441
33. N. Li, L. Chen, M.A. Dahleh, Demand response using linear supply function bidding. IEEE
Trans. Smart Grid 6(4), 1827–1838 (2015)
34. N. Li, L. Chen, S.H. Low, Optimal demand response based on utility maximization in power
networks, in Proceedings of the 2011 IEEE Power and Energy Society General Meeting (PES).
https://doi.org/10.1109/PES.2011.6039082
35. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization
over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
References 207

36. T. T. Doan, A. Olshevsky, On the geometric convergence rate of distributed economic


dispatch/demand response in power networks (2016). arXiv preprint arXiv:1609.06660
37. D.P. Bertsekas, Convex Optimization Theory (Athena Scientific, Cambridge, MA, 2009)
38. H.H. Bauschke, P.L. Combettes, Convex Analysis and Monotone Operator Theory (Springer,
Berlin, 2011)
39. R.T. Rockafellar, R.J.B. Wets, Variational Analysis (Springer, Berlin, 2009)
40. C.A. Desoer, M. Vidyasagar, Feedback Systems: Input-Output Properties (SIAM, 2009)
41. D.P. Bertsekas, A. Nedic, A.E. Ozdaglar, Convex Analysis and Optimization (Athena Scientific,
Cambridge, MA, 2004)
42. S. Kar, G. Hug, Distributed robust economic dispatch in power systems: a consensus +
innovations approach, in 2012 IEEE Power and Energy Society General Meeting. https://doi.
org/10.1109/PESGM.2012.6345156
43. IEEE 14 bus system. http://icseg.iti.illinois.edu/ieee-14-bus-system/
Chapter 8
Event-Triggered Algorithms for
Distributed Economic Dispatch

Abstract In this chapter, we still study the problem of energy management in smart
grid operations, the problem of economic dispatch, i.e., the problem of minimizing
a sum of local convex cost functions subjected to both local interval constraints
and coupling linear constraint over an undirected network. We propose a new event-
triggered distributed accelerated primal–dual algorithm, ET-DAPDA, that achieves a
reduction in computation and interaction to solve the EDP with uncoordinated step-
sizes. ET-DAPDA (with respect to the dual updates) adds two momentum terms to
the gradient tracking scheme and assumes that each node interacts independently
with its neighbors only at the event-triggered sampling time instants. Assuming
smoothness and strong convexity of the cost function, the linear convergence of
ET-DAPDA is analyzed using the generalized small gain theorem. In addition, ET-
DAPDA strictly excludes Zeno-like behavior, which greatly reduces the interaction
cost. ET-DAPDA is investigated on 14-bus and 118-bus systems to evaluate its
applicability. Simulation results of convergence rates are further compared with
existing techniques to demonstrate the superiority of ET-DAPDA.

Keywords Economic dispatch · Event-triggered · Momentum terms ·


Accelerated algorithm · Linear convergence

8.1 Introduction

With the development of related industries in the engineering field, distributed


optimization has become a broad concern in smart grids [1], sensor networks
[2], etc. Practical optimization problems are evolving to become increasingly
complex owing to energy constraints [3], privacy concerns [4], dynamic resource
requirements [5], and the wide size of networks [6]. Achieving optimal solutions
may not be easy when using traditional techniques. Therefore, developing more
efficient distributed optimization algorithms to solve general convex optimization
problems is the focus of current research [7–12].
In the last few years, strategies based on distribution consensus [13–19] have
become mainstream in the interpretation of the optimization approaches. More

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 209
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_8
210 8 Event-Triggered Algorithms for Distributed Economic Dispatch

commonly used in these approaches include distributed (sub)gradient descent


approach [13], distributed primal–dual approach [14], and so on. Similar work also
has been generalized to the application of distributed optimization under all kinds
of practical conditions, such as complex networks [16], communication constraints
[17, 18], online optimization [19], and so on. Because the above approaches [13–
19] all ask for using the decreasing step-size to guide the state of all nodes to reach
the optimal solution, they have been exposed to a potential disadvantage of slow
convergence rate.
In order to overcome this type of difficulty, a lot of neoteric distributed
optimization methods have been generated [20–23]. A distributed inexact gradient
method with constant step-size for time-varying undirected/directed networks was
developed in [20], which adopted the gradient tracking technique to handle opti-
mization problem. Furthermore, in terms of the uncoordinated step-sizes, Lü et al.
in [21] and Nedic et al. in [22] expanded the work [20] achieving linear convergence
results, respectively. Despite the better performance of the above methods, there are
still many disadvantages, for instance, the computational and interaction capabilities
of the nodes are not considered, and thus it is tough to apply in particular the
methods to practical applications. In terms of this issues, some event-triggered
distributed optimization algorithms have been emerged, such as [24, 25]. Although
the convergence rates accelerated to linear and the capabilities of the nodes were
considered, these methods [24, 25] could not be applied to the EDP.
Motivated by the EDP in smart grids, Doan and Beck [26] developed a distributed
primal–dual method to settle the EDP over undirected networks. The analysis of
[26] revealed that the method can find the exact optimal solution with a linear
convergence rate. Unfortunately, the interactions among nodes in the network
undoubtedly squander plentiful resources. Thus, Wang et al. in [27] presented an
event-triggered distributed primal–dual method for solving EDP over time-varying
networks and obtained desired linear convergence results. Moreover, quite a number
of effective methods have received increasing attention in recent years [28, 29].
For comprehensively figuring out the practical EDP, a new hybrid evolutionary
algorithm was discussed [28]. Then, a reliable hybrid evolutionary algorithm was
proposed to solve the multi-objective multi-area EDP [29]. Besides, quite a few
more practical energy management issues were considered recently, such as [30] for
complex optimal power flow problem and [31] for transmission expansion planning
with wind farms.
It should be noted that the above methods [24–27] did not rely on the momentum
terms [32, 33] to ensure the algorithm achieving faster convergence rate. Existing
work has proven that both the algorithms with the Nesterov momentum [32]
and the algorithms with heavy-ball momentum [33] can improve the convergence
rate to some extent. This motivates us to develop an event-triggered distributed
optimization algorithm that uses Nesterov momentum and heavy-ball momentum
(not only guarantee faster convergence but also save computation and interaction
resources) to solve EDP.
In this chapter, we bring out a fresh event-triggered distributed accelerated
primal–dual algorithm, named as ET-DAPDA, to solve the distributed constrained
8.2 Preliminaries 211

(local interval constraint and coupling linear constraint) convex optimization prob-
lem over an undirected network. ET-DAPDA guarantees that the interaction between
two nodes in the network is in an event-triggered way. To be specific, the principal
contributions of this work are shown in the following aspects:
(i) Distributed event-triggered interaction scheme is considered in the primal–
dual method. Compared with the recent work [20–23], ET-DAPDA considers
reducing the energy consumption and intensive calculations of interactions
between nodes, which may extend the useful life of a particular network such
as power systems or smart grids.
(ii) In comparison with [34, 35], ET-DAPDA does not demand the centralized
control to generate the Lagrangian multiplier. Specifically, ET-DAPDA incor-
porates the gradient tracking into the distributed primal–dual method to come
true linear convergence and adds two types of momentum terms to enable
nodes to obtain more information than the existing methods [26, 27] from their
neighbors in the network to accelerate convergence.
(iii) The convergence of ET-DAPDA is analyzed by using the generalized small
gain theorem, a standard tool in control theory for analyzing stability of
interconnected dynamical systems, which is expected to be broadly applicable
to other accelerated algorithms. In addition, ET-DAPDA provides a more
relaxed step-size choice than most existing distributed methods presented in
[20, 26, 33], etc.
(iv) Presuming that the largest step-size and the maximum momentum coefficient
are subjected to certain upper bounds, ET-DAPDA linearly converges to the
optimal solution under smoothness and strong convexity cost functions. ET-
DAPDA also rigorously excludes the Zeno-like behavior [25, 27]. In addition,
in comparison with [23, 33], explicit estimates for the convergence rate of ET-
DAPDA are built.

8.2 Preliminaries

8.2.1 Notation

If not particularly stated, the vectors mentioned in this chapter are column vectors.
We let · denote the inner product of two vectors, || · || denote the 2-norm for both
vectors and matrices, ρ(B) represent the spectral radius of a matrix B, and diag{y}
be a diagonal matrix of the vector y, where the diagonal element is yi and other
elements are zeros. The symbols 1m , Im and B T denote the one-column vectors of
m-dimensional, the m×m identity matrix, and the transpose of a matrix B (probably
a vector), respectively. Given an infinite sequence si = (si (0), si (1), si (2), · · ·),
where si (t) ∈ R, ∀t, we define si γ ,T = maxt =0,1,...,T γ1t si (t) and
212 8 Event-Triggered Algorithms for Distributed Economic Dispatch

si γ =supt ≥0 γ1t si (t) with γ ∈ (0, 1). Given two vectors v = [v1 , . . . , vm ]T and
u = [u1 , . . . , um ]T , the notation v  u implies that vi ≤ ui for any i.

8.2.2 Nomenclature

f The global cost function of EDP


fi The local cost function of node i
C The dual function of EDP
Ci The local dual function of node i
f The conjugate function of function f
xi Node i’s estimate of EDP
di The element of coupling linear constraint
min
i The upper bound of node i’s estimate xi
max
i The lower bound of node i’s estimate xi
λi The Lagrangian multiplier of node i
αi The constant step-size of node i
ηi (t) The momentum coefficient of node i at time t
zi , yi The auxiliary variables of node i
y
viλ , vi Node i’s control inputs of the algorithm
i
tk(t ) The event-triggered time instant sequence of node
i at time t
y
eiλ , ei The measurement errors of node i
σi The Lipschitz smoothness parameter of fi
μi The strong convexity parameter of fi

8.2.3 Model of Optimization Problem

This chapter studies a general EDP of finding the optimal solution of the sum
of the m local convex cost functions subjected to both local interval constraints
and coupling linear constraint over an undirected network of m nodes, which is
described as follows:


m 
m 
m
minf (x) = fi (xi ), s.t. xi = di , min
i ≤ xi ≤ max
i , (8.1)
x
i=1 i=1 i=1

∀i = 1, . . . , m, where x = [x1
, . . . , x m ]T 
∈ Rm , the coupling linear constraint set is
expressed by M = {x ∈ R | i=1 xi = m
m m
i=1 di }, and the local interval constraint
set is denoted by Xi = {x̃ ∈ R|min i ≤ x̃ ≤ min
i } for all i = 1, . . . , m. We write
X = X1 ×X2 ×· · ·×Xm for the Cartesian products of each local interval constraint.
In addition, we indicate by P = X ∩ M the feasible set, x ∗ = [x1∗ , . . . ., xm ∗ ]T
8.3 Algorithm Development 213

(x ∗ ) = f ∗ the optimal value. Denote optimal


the optimal solution to (8.1), and f

solution (nonempty) X = {x ∈ P| m ∗
i=1 fi (xi ) = f }.
Remark 8.1 The optimization problem (8.1) is a general mathematical form of
the EDP that is a fundamental and important problem that arises in a variety of
application domains within engineering, such as smart grids [7, 12, 26, 27]. There is
a possibility of incorporating the important issues such as: transmission line losses,
the valve-point effects, the prohibited zones of generator, the ramp-rate limits of
generator, etc., in the considered problem, but this is not the focus of our work.

8.2.4 Communication Network

In this chapter, we concern a weight undirected network G making up of m nodes,


where each node communicates only locally, that is, with a subset of other nodes.
The network G is represented by G = {V, E, A}, where V = {1, . . . , m} is the
nodes set, E ⊆ V × V is the edges set, and A = [aij ] ∈ Rm×m is the weight
adjacency matrix where the weights aij satisfy that aij = aj i > 0 if (i, j ) ∈ E,
and aij = 0 otherwise. The diagonal elements aij = 0 if only G is simple, and
A is symmetric. Nodes i and j can only communicate directly with each other if
(i, j ) ∈ E. The neighboring set of node i is denoted by Ni = {j |j ∈ E, aij ≥ 0}.
The degree matrix D = diag{d̃1 , . . . , d̃m } is diagonal, whose diagonal elements are
outlined by d̃i = m j =1 aij , ∀i. Define d̂ = maxi∈V {d̃i }. The Laplacian matrix is
L = [lij ] ∈ Rm×m , which satisfies L = D − A.

8.3 Algorithm Development

This section presents the algorithm development followed by some assumptions.


First, the problem reformulation is shown.

8.3.1 Problem Reformulation

The dual problem of (8.1) can be displayed as


m
max C(λ) = Ci (λ), (8.2)
λ∈R
i=1
214 8 Event-Triggered Algorithms for Distributed Economic Dispatch

where Ci (λ) = −fi (−λ)−λdi and f  (λ) = supx∈X λx −f (x). Then, we transform
the dual problem (8.2) into the following minimization form:


m
min q(λ) = qi (λ), (8.3)
λ∈R
i=1

where qi (convex) is given by

qi (λ) = −Ci (λ) = fi (−λ) + λdi . (8.4)

8.3.2 Event-Triggered Control Scheme

The event-triggered control scheme can effectively decrease network burden in


comparison with the traditional time-triggered control scheme [7, 26]. In event-
triggered control scheme, tk(t i , t = 0, 1, . . . , i ∈ V, is employed to stand for
)
the event-triggered time instant sequence of node i. In addition, each node i at
each event-triggered time instant tk(t i , t = 0, 1, . . ., holds its estimates λ (t i ),
) i k(t )
i ), and then transmits to neighbors. Also, node j can disseminate its latest
yi (tk(t )
sampling states λj (tki (t ) ), yj (tki (t ) ), to node i at tki (t ) when (j, i) ∈ E, where
j
k (t) = arg minm∈k̄,t ≥t j {t − tm }.
m

) , the next event-triggered time instant for node i ∈ V is represented by


i
After tk(t
i
tk(t )+1, which follows that

y
)+1 = inf{t > tk(t ) | ||ei (t)|| + ||ei (t)|| > Qω },
i i λ t
tk(t (8.5)

where Q > 0 and 0 < ω < 1 are the control parameters; the measurement errors
y
eiλ (t) and ei (t) are given as

eiλ (t) = λi (tk(t
i ) − λ (t)
) i
y . (8.6)
ei (t) = yi (tk(t
i
) ) − y i (t)

8.3.2.1 Event-Triggered Distributed Accelerated Primal–Dual Algorithm


(ET-DAPDA)

Here, ET-DAPDA for solving (8.3) is formally described in Algorithm 4. It obtains


from (8.4) that

∇qi (λi (t)) = −xi (t) + di , (8.9)


8.3 Algorithm Development 215

Algorithm 4 Event-triggered distributed accelerated primal–dual algorithm (ET-


DAPDA) at each node i
1: Initialization: Choose the initial points xi (0) ∈ R, zi (0) ∈ R, λi (0) ∈ R, yi (0) = di −
y
xi (0) ∈ R, viλ (0) = λi (0) ∈ R, and vi (0) = yi (0) ∈ R.
2: Set t = 0.
3: Local information exchange: Each node i independently interacts with neighbors only at the
event-triggered sampling time instants.
4: Control inputs updates: Based on the event-triggered control scheme, each node i updates
the auxiliary variables:
⎧ 
m
⎪ j
⎨ vi (t) = λi (t) + h
⎪ aij (λj (tk (t) ) − λi (tk(t)
λ i
))
j =1

m , (8.7)

⎪ y j
⎩ vi (t) = yi (t) + h aij (yj (tk (t) ) − yi (tk(t)
i ))
j =1

where aij ≥ 0 is the weight between node j and node i, and h > 0 is the tunable parameter.
5: Local variables updates: Each node i updates the following variables according to (8.7):


⎪ xi (t + 1) = arg min fi (xi ) + λi (t)(xi − di )

⎨ xi ∈Xi
zi (t + 1) = viλ (t) + ηi (t)(zi (t) − zi (t − 1)) − αi yi (t) , (8.8)


⎪ λi (t + 1) = ziy(t + 1) − ηi (t)(zi (t + 1) − zi (t))

yi (t + 1) = vi (t) − xi (t + 1) + xi (t)

where αi > 0 and ηi (t) ≥ 0.


6: Set t = t + 1 and repeat.
7: Until t > tmax (tmax is a predefined maximum iteration number).

where ∇qi (λi (t)) is the gradient of qi (λ) at λ = λi (t), t ≥ 0. It deserves to


be mentioned that ET-DAPDA neither requires repeated exchange of information
between nodes nor demands a conspicuous computational cost. Hence, ET-DAPDA
with event-triggered control scheme may maximize the lifespan of the system.
Define ∇Q(λ(t)) = [∇q1 (λ1 (t)), . . . , ∇qm (λm (t))]T ∈ Rm , x(t) = [x1 (t),
x2 (t), . . . , xm (t)]T ∈ Rm , z(t) = [z1 (t), z2 (t), . . . , zm (t)]T ∈ Rm , λ(t) =
[λ1 (t), λ2 (t), . . . , λm (t)]T ∈ Rm , y(t) = [y1 (t), . . . , ym (t)]T ∈ Rm , v λ (t) =
λ (t)]T ∈ Rm , v y (t) = [v y (t), . . . , v y (t)]T ∈ Rm , η(t) = [η (t), . . .,
[v1λ (t), . . . , vm 1 m 1
ηm (t)]T ∈ Rm , and α = [α1 , . . . , αm ]T ∈ Rm . Then, we write the vector form of
Algorithm 4 below:
⎧ 
m

⎪ x(t + 1) = arg min (fi (xi ) + λi (t)(xi − di ))


⎨ x∈X i=1
z(t + 1) = v λ (t) + Dη (t)(z(t) − z(t − 1)) − Dα y(t) , (8.10)



⎪ λ(t + 1) = z(t + 1) + Dη (t)(z(t + 1) − z(t))

y(t + 1) = v y (t) − x(t + 1) + x(t),
216 8 Event-Triggered Algorithms for Distributed Economic Dispatch

where Dα = diag{α}, Dη (t) = diag{η(t)}, v λ (t) = W λ(t) − hLeλ (t), v y (t) =


Wy(t) − hLey (t). Here, W = Im − hL and W = [wij ] ∈ Rm×m , where 0 < h < d̂.
The initialization y(0) = d − x(0), where d = [d1 , . . . , dm ]T ∈ Rm .

Assumption 8.1 ([25]) The undirected network G is connected, and the matrix W
meets 1Tm W = 1Tm , W 1m = 1m , and δ = ρ(W − 1m 1Tm /m) < 1.
Assumption 8.2 ([26]) Each cost function fi , i ∈ V, is μi -strongly convex and has
Lipschitz continuous gradient with parameter σi , where σi , μi ∈ (0, +∞).
Remark 8.2 Assumption 8.1 often describes the information exchange rules in the
network. Assumption 8.1 assures the nonemptiness of the optimal solution set.

8.4 Convergence Analysis

In this section, the linear convergence rate of ET-DAPDA is constructed by applying


the generalized small gain theorem [36]. First, we convert four inequalities into
a linear matrix inequality and immediately discover the spectral nature of its
coefficient matrix. For this purpose, some elementary notations are introduced to
simple the convergence results. Denote v1 (t) = λ(t) − (1/m)1m1Tm λ(t), ∀t ≥ 0,
v2 (t) = (1/m)1m 1Tm λ(t)−1m λ∗ , ∀t ≥ 0, v3 (t) = z(t)−z(t − 1), ∀t > 0, as well as
the practice that v3 (0) = 0m , and v4 (t) = y(t) − (1/m)1m 1Tm y(t), ∀t ≥ 0. Besides,
define ∇Q(1m λ̄(t)) = [∇q 1 (λ̄(t)), . . . , ∇qm (
λ̄(t))]T , α̂ = maxi∈V {αi }, β̂ =
m m
supt ≥0 maxi∈V {βi (t)}, ϑ = i=1 (1/σi ),  = i=1 (1/μi ), ϑ̂ = maxi∈V {1/σi },
ˆ
and  = maxi∈V {1/μi }.

8.4.1 Supporting Lemmas

First, the convergence analysis of the Lagrangian multiplier primarily relies on the
generalized small gain theorem [36] described below.
Lemma 8.3 ([36]) Assuming non-negative vector sequences {ṽi (t)}∞ t =0 , i =
˜
1, . . . , m, a non-negative matrix Γ ∈ R m×m , ũ ∈ R , and γ ∈ (0, 1) satisfy
m

ṽ γ ,T  Γ˜ ṽ γ ,T + ũ,

for all positive integer T , where ṽ γ ,T = [||ṽ1 ||γ ,T , . . . , ||ṽm ||γ ,T ]T . Then, ||ṽi ||γ <
B0 if ρ(Γ˜ ) < 1, where B0 < ∞. Hence, each ||ṽi ||, i ∈ {1, . . . , m}, converges
linearly to zero at a rate of O(γ t ).
In the following, the bound of ||v1 ||γ ,T is showed.
8.4 Convergence Analysis 217

Lemma 8.4 For all T ≥ 0 and under Assumptions 8.1 and 8.2, we have the
following inequality:

α̂ ˆ 2η̂ α̂
||v1 ||γ ,T ≤ ||v2 ||γ ,T + ||v3 ||γ ,T + ||v4 ||γ ,T
γ − δ − α̂ ˆ γ − δ − α̂ ˆ γ − δ − α̂ ˆ
h||L|| ||v1 (0)||
+ ||eλ||γ ,T + ,
γ − δ − α̂ ˆ γ − δ − α̂ ˆ

ˆ 1 ) < γ < 1, where δ is provided in Assumption 8.1.


for all (δ + α̂ p
Proof On the basis of the updates of z(t) and λ(t) of ET-DAPDA in (8.10), it holds
that

1 1
||v1 (t + 1)|| ≤|| W − 1m 1Tm v1 (t)|| + || Im − 1m 1Tm Dη (t)v3 (t)||
m m
1
+ || Im − 1m 1Tm Dα y(t)|| + h||L||||eλ(t)||
m
1
+ || Im − 1m 1Tm Dη (t + 1)v3 (t + 1)||, (8.11)
m

where the inequality in (8.11) is acquired from the fact that (W − (1/m)1m 1Tm )
(Im − (1/m)1m 1Tm ) = W − (1/m)1m 1Tm . Notice that α̂ = ||Dα || and ||Dη (t)|| ≤ η̂.
Built on Assumption 8.1, (8.11) further implies that

||v1 (t + 1)|| ≤δ||v1 (t)|| + α̂||y(t)|| + η̂||v3 (t + 1)||


+ η̂||v3 (t)|| + h||L||||eλ(t)||. (8.12)

ˆ 1 (t)|| + ||v
Since ||y(t)|| ≤ ||v ˆ 2 (t)|| + ||v4 (t)|| holds [12], we have that

ˆ 1 (t)|| + α̂ ||v
||v1 (t + 1)|| ≤(δ + α̂ )||v ˆ 2 (t)|| + η̂||v3 (t + 1)||

+ η̂||v3 (t)|| + α̂||v4 (t)|| + h||L||||eλ(t)||. (8.13)

From here, the process resembles that in the proof of Lemma 8 in [20]. We
contain the proof for integrity. By taking the supremum on both sides of (8.13)
218 8 Event-Triggered Algorithms for Distributed Economic Dispatch

for t = 0, .., T − 1, one has

||v1 (t + 1)|| δ + α̂ ˆ ||v1 (t)|| α̂ ˆ ||v2 (t)||


sup t +1
≤ sup t
+ sup
t =0,..,T −1 γ γ t =0,..,T −1 γ γ t =0,..,T −1 γt
α̂ ||v4 (t)|| h||L|| ||eλ (t)||
+ sup + sup
γ t =0,..,T −1 γt γ t =0,..,T −1 γ t
||v3 (t)|| + ||v3 (t + 1)||
+ η̂ sup . (8.14)
t =0,..,T −1 γ t +1

Also, assuming that γ 0 ||v1 (0)|| ≤ γ 0 ||v1 (0)||, one has

δ + α̂ ˆ α̂ ˆ 2η̂
||v1 ||γ ,T ≤ ||v1 ||γ ,T + ||v2 ||γ ,T + ||v3 ||γ ,T
γ γ γ
α̂ h||L|| λ γ ,T
+ ||v4 ||γ ,T + ||e || + ||v1 (0)||, (8.15)
γ γ

which after some algebraic manipulations yields the desired result. This completes
the proof. 
Notice that (1/m)1m 1Tm λ(t) = 1m λ̄(t). Then, the bound of ||v2 ||γ ,T is given in
the next lemma.
Lemma 8.5 Under Assumptions 8.1–8.2 and when c1 < γ < 1 as well as 0 <
(1/m2 )1Tm α < 2/, one gets ∀T ≥ 0,

α̂ ˆ 2η̂ α̂ γ
||v2 ||γ,T ≤ ||v1 ||γ,T + ||v3 ||γ,T + ||v4 ||γ,T + ||v2 (0)||,
γ − c1 γ − c1 γ − c1 γ − c1

where c1 = max{|1 − (1/m2 )1Tm α|, |1 − ϑ(1/m2 )1Tm α|}.


Proof Notice that 1Tm W = 1Tm and 1Tm L = 0. Following from the updates of
z(t) and λ(t) of ET-DAPDA in (8.10), we can utilize the fact (1/m)1m 1Tm y(t) =
(1/m)1m 1Tm ∇Q(λ(t)) to establish

1
|| 1m 1Tm λ(t + 1) − 1m λ∗ ||
m
1 1 1
≤ || 1m 1Tm λ(t) − 1m 1Tm Dα 1m 1Tm y(t) − 1m λ∗ ||
m m m
1
+ η̂[||z(t) − z(t − 1)|| + ||z(t + 1) − z(t)||] + α̂||y(t) − 1m 1Tm y(t)||.
m
(8.16)
8.4 Convergence Analysis 219

We now talk over the first term in the inequality of (8.16). Note that (1/m)1m 1Tm
y(t) = (1/m)1m1Tm ∇Q(λ(t)). Utilizing 1m 1Tm Dα 1m 1Tm = 1Tm α1m 1Tm , one obtains

1 1 1
|| 1m 1Tm λ(t) − 1m 1Tm Dα 1m 1Tm y(t) − 1m λ∗ ||
m m m
1 1
≤ ||1m (λ̄(t) − 1Tm α ∇q(λ̄(t)) − λ∗ )||
m m
1 1 1
+ 1Tm α|| 1m 1Tm ∇Q(1m λ̄(t)) − 1m 1Tm ∇Q(λ(t))||
m m m
= Λ1 + Λ2 , (8.17)

where ∇q(λ̄(t)) = 1Tm ∇Q(1m λ̄(t)). By Lemma 3 in [33], if 0 < (1/m2 )1Tm α < 2/,
Λ1 is bounded by

√ 1
Λ1 ≤ c1 m||λ̄(t) − λ∗ || = c1 || 1m 1Tm λ(t) − 1m λ∗ ||, (8.18)
m

where c1 = max{|1 − (1/m2)1Tm α|, |1 − ϑ(1/m2 )1Tm α|}. Then, Λ2 can be bounded
in the following way:

1 T ˆ 1
Λ2 ≤ 1 α || 1m 1Tm λ(t) − λ(t)||. (8.19)
m m m
Plugging (8.17)–(8.19) into (8.16) yields

ˆ 1 (t)|| + η̂||v3 (t + 1)||


||v2 (t + 1)|| ≤c1 ||v2 (t)|| + α̂ ||v
+ η̂||v3 (t)|| + α̂||v4 (t)||. (8.20)

Here, we can identify the terms in (8.20) with the terms in (8.13). Therefore, for the
sake of setting up this lemma, we proceed as in the proof of Lemma 8.4 from (8.13).
This completes the proof. 
The following lemma displays the bound of ||v3 ||γ ,T .
Lemma 8.6 Under Assumptions 8.1 and 8.2, if 2η̂ < γ < 1, one gets that

κ1 + α̂ ˆ α̂ ˆ γ ||v3 (0)||
||v3 ||γ ,T ≤ ||v1 ||γ ,T + ||v2 ||γ ,T +
γ − 2η̂ γ − 2η̂ γ − 2η̂
α̂ h||L|| λ γ ,T
+ ||v4 ||γ ,T + ||e || ,
γ − 2η̂ γ − 2η̂

∀T ≥ 0, where κ1 = ||W − Im ||.


220 8 Event-Triggered Algorithms for Distributed Economic Dispatch

Proof It is obtained from the updates of z(t) and λ(t) of ET-DAPDA in (8.10) that

1
||z(t + 1) − z(t)|| ≤κ1 || 1m 1Tm λ(t) − λ(t)|| + 2η̂||z(t) − z(t − 1)||
m
+ h||L||||eλ (t)|| + α̂||y(t)||, (8.21)

where the inequality in (8.21) is obtained from the fact that (W − Im )(Im −
(1/m)1m 1Tm ) = W − Im . Then, one has

ˆ 1 (t)|| + α̂ ||v
||v3 (t + 1)|| ≤2η̂||v3 (t)|| + (κ1 + α̂ )||v ˆ 2 (t)||

+ α̂||v4 (t)|| + h||L||||eλ(t)||. (8.22)

Similar to the procedure following (8.13), it is ample to infer the desired result. This
completes the proof. 
The next lemma sets up the inequality that bounds the error term ||v4 ||γ ,T to
gradient estimation.
Lemma 8.7 Under Assumptions 8.1 and 8.2, if ρ < γ < 1, one obtains that

ˆ + 2η̂ˆ h||L|| y γ ,T γ
||v4 ||γ ,T ≤ ||v3 ||γ ,T + ||e || + ||v4 (0)||,
γ −δ γ −δ γ −δ

for all T ≥ 0.
Proof By the update of y(t) of ET-DAPDA in (8.10), we have

1
||y(t + 1) − 1m 1Tm y(t + 1)||
m
1
≤ δ||y(t) − 1m 1Tm y(t)|| + h||L||||ey (t)|| + ||x(t + 1) − x(t)||, (8.23)
m
where we utilize the triangle inequality and Assumption 8.1 to derive the inequality.
With regard to the last term of the inequality in (8.23), we apply the update of λ(t)
of ET-DAPDA in (8.10) and the gradient ∇qi (λi (t)) = −xi (t) + di to obtain

ˆ + η̂)||z(t + 1) − z(t)|| + ˆη̂||z(t) − z(t − 1)||.


||x(t + 1) − x(t)|| ≤ (1
(8.24)

Combining (8.23) and (8.24) gives

ˆ + η̂)||v3 (t + 1)|| + ˆη̂||v3 (t)|| + h||L||||ey (t)||.


||v4 (t + 1)|| ≤δ||v4 (t)|| + (1

Similar to the procedure following (8.13), it is ample to infer the desired result if
δ < γ < 1. This completes the proof. 
8.4 Convergence Analysis 221

8.4.2 Main Results

Drawing support from Lemmas 8.4–8.7, the convergence results of ET-


DAPDA (8.10) are now built as follows. As a matter of convenience, we define
ˆ u2 = γ ||v2 (0)||/(γ − c1 ), u3 =
u1 = (h||L|| + ||eλ ||γ ,T + ||v1 (0)||)/(γ − δ − α̂ ),
(h||L||||eλ||γ ,T + γ )/(γ − 2η̂), and u4 = (h||L||||ey ||γ ,T + γ ||v4 (0)||)/(γ − δ).
Theorem 8.8 is introduced below.
Theorem 8.8 In consideration of ET-DAPDA, (8.10) updates the sequences {x(t)},
{z(t)}, {λ(t)}, and {y(t)}. Then, under Assumptions 8.1 and 8.2 and if 0 <
(1/m2 )1Tm α < 2/, we acquire the following linear system of inequality:

v γ ,T  Γ v γ ,T + u, (8.25)

where u = [u1 , u2 , u3 , u4 ]T , v γ ,T = [||v1 ||γ ,T , ||v2 ||γ ,T , ||v3 ||γ ,T , ||v4 ||γ ,T ]T , and
the elements of matrix Γ = [γij ] ∈ R4×4 are given by
⎡ ⎤
α̂ ˆ 2η̂ α̂
0
⎢ γ −δ−α̂ ˆ γ −δ−α̂ ˆ γ −δ−α̂ ˆ ⎥
⎢ α̂ ˆ 2η̂ α̂ ⎥
⎢ γ −c1 0 γ −c1 γ −c1 ⎥
Γ =⎢ ⎥.
⎢ κ1 +α̂ ˆ α̂ ˆ α̂ ⎥
⎣ γ −2η̂ γ −2η̂
0 γ −2η̂ ⎦
ˆ
+2η̂ ˆ
0 0 γ −δ 0

Assume moreover that the largest step-size satisfies


 
m β1 (γ − δ) β3 − β1 κ1
0 < α̂ < min , , , (8.26)
 β1 ˆ + β2 ˆ + β4 β1 ˆ + β2 ˆ + β4

and the maximum momentum coefficient satisfies



β1 (γ − δ) − α̂(β1 ˆ + β2 ˆ + β4 )
0 ≤ η̂ < min ,
2η3

β2 (γ − c1 ) − α̂(β1 ˆ + β4 ) β4 (γ − δ) − β3 ˆ
, ,
2β3 ˆ 3
2β

β3 γ − β1 κ1 − α̂(β1 ˆ + β2 ˆ + β4 )
. (8.27)
2β3
222 8 Event-Triggered Algorithms for Distributed Economic Dispatch

Then, if 0 < γ < 1 is a constant such that



ˆ 3 + β3 ˆ + β4 δ
2η̂β3 + β1 δ + α̂(β1 ˆ + β2 ˆ + β4 ) 2η̂β
γ = max , ,
β1 β4

2η̂β3 + β2 c1 + α̂(β1 ˆ + β4 ) 2η̂β3 + β1 κ1 + α̂(β1 ˆ + β2 ˆ + β4 )
, ,
β2 β3

the sequence {λ(t)} converges linearly to 1n λ∗ at a rate of O(γ t ), where β1 , β2 , β3 ,


and β4 are arbitrary constants such that

m(β1 ˆ + β4 ) β3 ˆ
β1 > 0, β2 > , β3 > β1 κ1 , β4 > .
ϑ 1−δ

Proof First, generalizing the results of Lemmas 8.4–8.7, we can infer inequal-
ity (8.25) instantly. Next, we give some abundant conditions to make the spectral
radius of Γ , defined as ρ(Γ ), strictly less than 1, i.e., ρ(Γ ) < 1. In accordance with
Theorem 8.1.29 in [37], we know that, for a positive vector β = [β1 , . . . , β4 ]T ∈ R4 ,
if Γ β < β, then ρ(Γ ) < 1 holds. It is inferred that inequality Γ β < β is equivalent
to

ˆ ˆ
⎪ 2η̂β3 < β1 (γ − δ) − α̂(β1  + β2  + β4 )

⎨ ˆ
2η̂β3 < β2 (γ − c1 ) − α̂(β1  + β4 )
, (8.28)

⎪ 2 η̂β3 < β3 γ − β1 κ1 − α̂(β1 ˆ + β2 ˆ + β4 )

ˆ 3 < β4 (γ − δ) − β3 ˆ
2η̂β

which further implies that


⎧ ˆ 2 +β
ˆ 4)
⎪ 2η̂β3 +β1 δ+α̂(β1 +β

⎪ γ >


β1
⎨γ > ˆ 4)
2η̂β3 +β2 c1 +α̂(β1 +β
β2
ˆ 2 +β
ˆ 4) . (8.29)

⎪ γ > 2η̂β3 +β1 κ1 +α̂(β1 +β

⎪ β3

⎩ ˆ 3 +β3 +β
2η̂β ˆ 4δ
γ > β4

Reviewing c1 in Lemma 8.5, if α̂ < m/, it produces that c1 = 1 − ϑ(1/m2 )1Tm α ≤


1 − ϑ α̂/m. To assure the positivity of β̂ (the right hand sides of (8.28) should be
8.4 Convergence Analysis 223

positive), if 0 < γ < 1, (8.28) further gives





⎪ α̂ < βˆ1 (γ −δ)
ˆ 4


β1 +β2 +β
⎨ ˆ 4)
m(β1 +β
β2 > ϑ . (8.30)

⎪ α̂ < βˆ3 −β1ˆκ1 , β3 > β1 κ1

⎪ β1 +β2 +β4

⎩ β > β3 ˆ
4 1−δ

Now, we are in the state of picking vector β = [β1 , . . . , β4 ]T to assure the


solvability of α̂. Since δ < 1, first, let β1 be an arbitrary positive constant. Then,
choose β3 and β4 satisfy the third and fourth conditions in (8.30), respectively.
Finally, selecting β2 follows the second condition in (8.30). Therefore, based
on (8.30), it generates the upper bounds of the largest step-size α̂ in (8.26)
concerning α̂ < m/. If 0 < γ < 1, then we reach the upper bounds of the
maximum momentum coefficient η̂ following from (8.28) and the largest step-
size α̂. √
Since ||eλ (t)|| ≤ √Qωt and ||ey (t)|| ≤ Qωt , one obtains γ −t ||eλ (t)|| ≤ mQ
and γ −t ||ey (t)|| ≤ mQ when we choose ω = γ . Then, we can infer that all
the elements (u1 , u2 , u3 , and u4 ) in vector u are uniformly bounded. Hence, by
Lemma 8.3 (the conditions are all satisfied), the results of Theorem 8.8 can be
acquired. This completes the proof. 
Remark 8.9 Theorem 8.8 indicates that, when certain EDP is determined, α̂ and
η̂ can be calculated without much effort by correctly selecting other parameters
such as ω, δ, μ, etc. Notice that the selection of α̂ and η̂ may require some
ˆ It is derived from [20] that the amount of
global parameters, such as ϑ, , and .
preprocessing required to calculate the global values is heavily smaller than the
running time (worst case) of ET-DAPDA (see [20] for a specific analysis).
On the basis of Theorem 8.8, the linear convergence of the sequence {x(t)} will
be shown in the following.
Theorem 8.10 Given ET-DAPDA, (8.10) updates the sequences {x(t)}, {z(t)},
{λ(t)}, and {y(t)}. Under Assumptions 8.1–8.2 and let α̂ and η̂ satisfy Theorem 8.8,
then the sequence {x(t)} linearly converges to x ∗ at a rate of O((γ /2)t ).
Proof As many distributed primal–dual methods [7, 12, 26, 27], we achieve this
theorem by associating the primal variables with Lagrangian multipliers. Similar to
the proof of Theorem 2 in [12, 26, 27], we also obtain the following inequality:


m
μi
(xi (t + 1) − x ∗ )2
2
i=1


m
1
≤ (∇qi (λ∗ )(λi (t) − λ∗ ) + (λi (t) − λ∗ )2 + λi (t)(xi∗ − di )). (8.31)
2μi
i=1
224 8 Event-Triggered Algorithms for Distributed Economic Dispatch


The fact x ∗ ∈ P yields that m ∗
i=1 (xi − di ) = 0. Additionally, if t → ∞, λi (t) →

λ , ∀i ∈ V. Thence, if t → ∞, the right and left sides of (8.31) incline to zero.
This shows ||x(t) − x ∗ || linearly converges, i.e., ||x(t) − x ∗ || ≤ O((γ /2)t ), if
||λ(t) − 1m λ∗ || ≤ O(γ t ) achieved in Theorem 8.8. The proof of Theorem 8.10 is
accomplished. 
Remark 8.11 Recall the conditions of the largest step-size α̂ and the maximum
momentum coefficient η̂ in (8.26) and (8.27) of Theorem 8.8 that the conditions
α̂ and η̂ imposed by ET-DAPDA depend on the parameters (, , ˆ m) that involve the
cost function, (δ, κ1 ) that involve the network topologies, and (γ , β1 , β2 , β3 , β4 ).
Moreover, the tunable parameters β1 , β2 , β3 , and β4 in Theorem 8.8 only rely on
the parameters of the network and the cost functions. Hence, we can find that the
designation of α̂ and η̂ devolves on the complexity of the EDP. Here, it is worth
noting that when the EDP takes into consideration the important issues such as non-
convex and/or discontinuous cost function, transmission line losses, the valve-point
effects etc., ET-DAPDA may not be very suitable or need to be modified, so the
conditions α̂ and η̂ may be another form. This is an open problem and remains to be
studied in the future.

8.4.3 The Exclusion of Zeno-Like Behavior

In what follows, the Zeno-like behavior of the discrete-time systems will be


excluded, i.e., infk {tk(t
i
)+1 − tk(t ) } ≥ 2, ∀i ∈ V, and t ≥ 0.
i

Theorem 8.12 Considering ET-DAPDA, (8.10) updates the sequences {x(t)},


{z(t)}, {λ(t)}, and {y(t)}. Select α̂, η̂, ω, and γ to satisfy Theorem 8.8. Assume that
Assumptions 8.1–8.2 hold. Let the control gain h obey
 
1 1
0 < h < min , , (8.32)
d̂ w1 w2

where w1 = (1 + γ 2 )||L|| m/γ 2 and w2 = (1 + )/(γ − δ − α̂ ) ˆ + 1/(γ − δ).
Then, the event-triggered time sequence {tk(t
i } does not exist Zeno-like behavior if
)
Q satisfies

1 + γ2
Q> (B1 + B2 + B4 + B1 + B2 ), (8.33)
γ2

where B1 , B2 , B3 , and B4 are positive constants.


Proof Recalling that ρ(Γ ) < 1, we obtain from Lemma 8.3 that ||v1 ||γ ≤
B1 , ||v2 ||γ ≤ B2 , ||v3 ||γ ≤ B3 , and ||v4 ||γ ≤ B4 , where B1 , B2 , B3 , and B4 are
positive constants. Utilizing Lemma 8.3, one has ||v1 || ≤ B1 γ t , ||v2 || ≤ B2 γ t ,
||v3 || ≤ B3 γ t , and ||v4 || ≤ B4 γ t . Notice that the next event will not happen when
8.5 Numerical Examples 225

y
the event-triggered condition eiλ (t) + ei (t) − Qωt > 0 is invalid; thus, we
possess
y
||eiλ(t)|| + ||ei (t)||
≤ ||λi (tk(t
i
) ) − λ̄(tk(t ) )|| + ||λ̄(tk(t ) ) − λ̄(t)|| + ||λ̄(t) − λi (t)||
i i

+ ||yi (tk(t
i
) ) − ȳ(tk(t ) )|| + ||ȳ(tk(t ) ) − ȳ(t)|| + ||ȳ(t) − yi (t)||
i i

i
≤ (B1 + B2 + B4 + B1 + B2 )γ tk(t) + (B2 γ t + B1 + B4 + B2 + B1 )γ t .
(8.34)

)+1 meets the condition (8.5), one has (recall that ω = γ is selected to satisfy
i
If tk(t
Theorem 8.8)
i y
Qγ tk(t)+1 ≤ ||eiλ (tk(t
i
)+1 )|| + ||ei (tk(t )+1 )||.
i
(8.35)

It is deduced from (8.34) and (8.35) that

B1 + B2 + B4 + B1 + B2
)+1 − tk(t ) ≥ ln
i i
tk(t / ln γ . (8.36)
B1 + B2 + B4 + B1 + B2

)+1 −
i
Selecting h such that (8.32), then Q in (8.33) is strictly nonempty, and thus tk(t
tk(t ) ≥ 2 in (8.36) is guaranteed. The proof of Theorem 8.12 is accomplished.
i 
Remark 8.13 Consider that most metaheuristic algorithms (including genetic algo-
rithm, ant colony optimization algorithm, PSO optimization algorithm, artificial
neural network algorithm, etc.) successfully solve the EDP (8.1), in terms of
identifying both the best solution and the computation time. In the following, we
conclude three advantages of applying ET-DAPDA compared to those of applying
the metaheuristic algorithms for the EDP: (1) ET-DAPDA can guarantee the global
optimal solution of the EDP, and the metaheuristic algorithm exists the randomness;
(2) ET-DAPDA can get a fixed output under a fixed input, while the metaheuristic
algorithm does not get a fixed output under a fixed input; (3) ET-DAPDA can
guarantee a fixed efficiency, but the metaheuristic algorithm cannot guarantee the
efficiency because of the randomness.

8.5 Numerical Examples

This section validates the theoretical results and demonstrates better performance
of ET-DAPDA. Notice that all the simulations are carried out in MATLAB on a
MacBook Pro 2017 with 8 GB memory, Intel Core i5 processors, 2 Cores, and
2.3 GHz.
226 8 Event-Triggered Algorithms for Distributed Economic Dispatch

8.5.1 Example 1: EDP on the IEEE 14-Bus System

Consider the EDP on the IEEE 14-bus system as described in [7], where
{1, 2, 3, 6, 8} are generator buses. Each generator’s cost function is represented by
fi (xi ) = ci xi + bi xi2 , where the cost coefficients (ci ,bi ) of each generator i and the
generators’ generation capacities (xi ∈ [ximin , ximax ]) are summarized
 in Table 8.1
[7]. The total demand needed in this system is assumed as m d
i=1 i = 380 MW.
By running ET-DAPDA (8.10), it deduces from Fig. 8.1a and b that the whole
system successfully implements the economic dispatch, and the optimal power
generations are x1∗ = 80 MW, x2∗ = 90 MW, x3∗ = 64.67 MW, x6∗ = 70 MW,
and x8∗ = 75.33 MW. By computation, the network is suffered a total cost 2176$.
In addition, each generator i’s event-triggered sampling time instants and one
randomly selected measurement error are depicted in Fig. 8.1c and d, respectively.
It can be derived from the statistics of Fig. 8.1c that the number and the average of
sampling times for the 5 generators are [98 84 93 96 113] and 97, respectively. Thus,
the average sampling rate is 97/600 = 16.17%, which indicates that the Zeno-like
behavior does not happen. Figure 8.1d tells that the measurement error decreases
to zero asymptotically. Finally, the calculation time obtained by ET-DAPDA for
solving EDP on the IEEE 14-bus system is 0.2828s.

8.5.2 Example 2: EDP on Large-Scale Networks

To demonstrate the applications of ET-DAPDA for large-scale networks, the EDP


on the IEEE 118-bus system [12] is considered in this example. The IEEE 118-
bus system holds 54 generators, which are connected by quite a few lines [12].
Each cost function of the generator i is held by fi (xi ) = ci xi2 + bi xi + ai ,
where ci ∈ [0.002, 0.071], bi ∈ [8.335, 37.697], and ai ∈ [6.78, 74.33] are
adjustable coefficients with suitable units. Generating capacities of each generator
i are different from each other, which means that xi ∈ [ximin , ximax ], where
ximin = [0,
150] MW and ximax = [20, 450] MW. Notice that the overall system
m
demands i=1 di = 6000 MW power generation. Figure 8.2 depicts that ET-
DAPDA successfully achieves the desired results even in this kind of large-scale
networks. With this optimal schedule of power at each generator, the optimal power
generations of 54 generators are [5, 5, 5, 292.1342, 292.1348, 10, 55.5241, 5, 5,

Table 8.1 Generator Bus bi ($/MW2 ) ci ($/MW) [ximin , ximax ] (MW)


parameters in the IEEE
14-bus system 1 0.04 2.0 [0, 80]
2 0.03 3.0 [0, 90]
3 0.035 4.0 [0, 70]
6 0.03 4.0 [0, 70]
8 0.04 2.5 [0, 80]
8.5 Numerical Examples 227

Fig. 8.1 EDP on the IEEE (a) Power generation (MW)


100
14-bus system
90
80
70
60
50
40
30 Gen. 1 Gen. 2 Gen. 3 Gen. 4 Gen. 5
20
10
0
0 100 200 300 400 500 600
Iteration
(b) Total generation (MW)
500
450
400
350
300
250
200
150
Total Generation Total Demand
100
50
0
0 100 200 300 400 500 600
Iteration
(c) Event sampling time instants

0 100 200 300 400 500 600


Iteration
(d) Measurement error and threshold
1
0.9
0.8
Q t ||e (t)||
0.7 1

0.6
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600
Iteration
228 8 Event-Triggered Algorithms for Distributed Economic Dispatch

Fig. 8.2 EDP on the IEEE 450


(a) Power generation (MW)
118-bus test system
400
350
300
250
200
150
100
50
0
0 100 200 300 400 500 600
Iteration
(b) Total generation (MW)
7000

6000

5000

4000

3000

2000
Total Generation Total Demand
1000

0
0 100 200 300 400 500 600
Iteration
(c) Event sampling time instants of five (randomly) generators

0 100 200 300 400 500 600


Iteration
(d) Measurement error and threshold
1.2

Q t ||e (t)||
1
0.8

0.6

0.4

0.2

0
0 100 200 300 400 500 600
Iteration
8.6 Conclusion 229

292.1353, 350, 8, 8, 55.5245, 8, 55.5244, 8, 8, 55.5257, 250, 250, 55.5234, 55.5238,


200, 200, 55.5246, 420, 420, 292.1332, 41.0539, 10, 5, 5, 55.5241, 55.5252,
292.1356, 55.5244, 10, 300, 200, 8, 20, 292.1361, 292.1348, 292.1350, 8, 55.5244,
55.5251, 8, 25, 55.5249, 55.5259, 55.5245, 25] MW, and the system is suffered a
total cost 9.262 × 104$. Finally, the calculation time obtained by ET-DAPDA for
solving EDP on the IEEE 118-bus system is 6.3838s.

8.5.3 Example 3: Comparison with Related Methods

To verify the convergence performance of ET-DAPDA for the analyzed systems


(IEEE 14-bus and 118-bus systems), the results obtained in this chapter are
compared to those obtained by the existing related methods [23, 26, 27, 34]. Here,
the required parameters are the same as Examples 1 and 2.
(a) First, the convergence performance comparison is respectively
 conducted on the
above two systems, where the residual E(t) = log10 m i=1 ||x i (t) − xi∗ ||, t ≥ 0,
is the comparison metric. Figure 8.3 shows us the following four facts: (1) ET-
DAPDA can still achieve the same linear convergence rate as in [23, 26] even
under event-triggered control; (2) ET-DAPDA promotes the convergence rate
compared with the applicable event-triggered methods [27] without momentum
terms; (3) When the largest step-size is around in α̂ = 0.0042, the conver-
gence performs well; (4) ET-DAPDA performs better in comparison with the
centralized method [34] even in large-scale networks. Finally, the accuracy of
the computation conducted for the optimal power generation is included in
[10−30, 10−25 ] on the IEEE 14-bus system and is included in [10−25, 10−20 ]
on the IEEE 118-bus system, respectively, which means that the accuracy is not
seriously affected even in large-scale networks.
(b) Second, the cost (corresponding to each analyzed system) obtained by ET-
DAPDA and other related methods [23, 26, 27, 34] are depicted in Fig. 8.4.
Figure 8.4 shows us the following two facts: (1) ET-DAPDA not only adopts
two effective strategies (event-triggered control strategy and momentum accel-
eration strategy) to ensure that ET-DAPDA is superior to other methods
[23, 26, 27, 34] in terms of the calculation and the convergence rate, but also
succeeds in getting the same optimal cost of the EDP as other proven methods
[23, 26, 27, 34] for each analyzed system; (2) ET-DAPDA can quickly obtain
the optimal cost in large-scale networks.

8.6 Conclusion

This chapter develops and analyzes ET-DAPDA for handling EDP over a connected
undirected network. ET-DAPDA not only allows uncoordinated constant step-sizes,
230 8 Event-Triggered Algorithms for Distributed Economic Dispatch

(a) Residual: IEEE 14 bus system [17]


5

-5

-10

-15

-20

-25

-30
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Time[step]

(b) Residual: IEEE 118 bus system [12]


10

-5

-10

-15

-20

-25

-30
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Iteration

Fig. 8.3 Comparison with related methods in which the residual E(t) as the comparison metric

but also most importantly integrates gradient tracking strategy with two types of
momentum terms for faster convergence. If the largest step-size and the maximum
momentum coefficient are positive and sufficiently smaller than some certain
constants, it is proved that ET-DAPDA can linearly seek the exact optimal solution
with an explicit linear convergence rate under the assumptions on the networks
and the cost functions. Numerical experiments validate the theoretical results.
Nevertheless, ET-DAPDA is not perfect. For instance, it cannot be appropriate
for complex networks and other actual scenarios in EDP. Thus, future works can
8.6 Conclusion 231

(a) Cost: IEEE 14 bus system [17]


2500
X: 503
Y: 2176

2000

1500

1000

500

0
0 100 200 300 400 500 600
Iteration

4
11 × 10
(b) Cost: IEEE 118 bus system [12]

10
X: 1877
Y: 9.262e+04

3
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Iteration

Fig. 8.4 Comparison with related methods in which the obtained cost of EDP as the comparison
metric

focus on designing similar algorithms applied to directed or time-varying directed


networks, considering the EDP with power loss, transmission line losses, the valve-
point effects, the prohibited zones of generator, the ramp-rate limits of generator,
and investigating other methods (hybrid and PSO-oriented methods) to the EDP.
232 8 Event-Triggered Algorithms for Distributed Economic Dispatch

References

1. B. Jeddi, V. Vahidinasab, P. Ramezanpour, J. Aghaei, M. Shafie-khah, J. Catalao, Robust


optimization framework for dynamic distributed energy resources planning in distribution
networks. Int. J. Electr. Power Energy Syst. 110, 419–433 (2019)
2. X. He, W. Xue, H. Fang, Consistent distributed state estimation with global observability over
sensor network. Automatica 92, 162–172 (2018)
3. X. Xing, L. Xie, H. Meng, Cooperative energy management optimization based on distributed
MPC in grid-connected microgrids community. Int. J. Electr. Power Energy Syst. 107, 186–199
(2019)
4. Q. Lü, X. Liao, T. Xiang, H. Li, T. Huang, Privacy masking stochastic subgradient-push
algorithm for distributed online optimization. IEEE Trans. Cybern. 51(6), 3224–3237 (2021)
5. X. Mao, W. Zhu, L. Wu, B. Zhou, Optimal allocation of dynamic VAR sources using zoning-
based distributed optimization algorithm. Int. J. Electr. Power Energy Syst. 113, 952–962
(2019)
6. M. Ogura, V. Preciado, Stability of spreading processes over time-varying large-scale networks.
IEEE Trans. Netw. Sci. Eng. 3(1), 44–57 (2016)
7. Y. Yuan, H. Li, J. Hu, Z. Wang, Stochastic gradient-push for economic dispatch on time-varying
directed networks with delays. Int. J. Electr. Power Energy Syst. 113, 564–572 (2019)
8. Q. Lü, H. Li, Z. Wang, Q. Han, W. Ge, Performing linear convergence for distributed
constrained optimisation over time-varying directed unbalanced networks. IET Contr. Theory
Appl. 13(7), 2800–2810 (2019)
9. T. Liu, X. Tan, B. Sun, Y. Wu, D. Tsang, Energy management of cooperative microgrids: a
distributed optimization approach. Int. J. Electr. Power Energy Syst. 96, 335–346 (2018)
10. H. Li, Q. Lü, X. Liao, T. Huang, Accelerated convergence algorithm for distributed constrained
optimization under time-varying general directed graphs. IEEE Trans. Syst. Man Cybern. Syst.
50(7), 2612–2622 (2020)
11. W. Liu, M. Chi, Z. Liu, Z. Guan, J. Chen, J. Xiao, Distributed optimal active power dispatch
with energy storage units and power flow limits in smart grids. Int. J. Electr. Power Energy
Syst. 105, 420–428 (2019)
12. Q. Lü, X. Liao, H. Li, T. Huang, Achieving acceleration for distributed economic dispatch in
smart grids over directed networks. IEEE Trans. Netw. Sci. Eng. 7(3), 1988–1999 (2020)
13. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE
Trans. Autom. Control 54(1), 48–61 (2009)
14. J. Duchi, A. Agarwal, M. Wainwright, Dual averaging for distributed optimization: conver-
gence analysis and network scaling. IEEE Trans. Autom. Control 57(1), 151–164 (2012)
15. Q. Lü, X. Liao, H. Li, T. Huang, A Nesterov-like gradient tracking algorithm for distributed
optimization over directed networks. IEEE Trans. Syst. Man Cybern. Syst. 51(10), 6258–6270
(2021)
16. A. Bedi, A. Koppel, K. Rajawat, Beyond consensus and synchrony in online network
optimization via saddle point method (2017). Preprint arXiv:1707.05816
17. Q. Lü, H. Li, Event-triggered discrete-time distributed consensus optimization over time-
varying graphs. Complexity 2017, 1–13 (2017)
18. Q. Lü, H. Li, D. Xia, Distributed optimization of first-order discrete-time multi-agent systems
with event-triggered communication. Neurocomputing 235, 255–263 (2017)
19. S. Shahrampour, A. Jadbabaie, Distributed online optimization in dynamic environments using
mirror descent. IEEE Trans. Autom. Control 63(3), 714–725 (2018)
20. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization
over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
21. Q. Lü, H. Li, D. Xia, Geometrical convergence rate for distributed optimization with time-
varying directed graphs and uncoordinated step-sizes. Inf. Sci. 422, 516–530 (2018)
References 233

22. A. Nedic, A. Olshevsky, W. Shi, C. Uribe, Geometrically convergent distributed optimization


with uncoordinated step-sizes, in 2017 American Control Conference (ACC) (2017). https://
doi.org/10.23919/ACC.2017.7963560
23. C. Zhao, X. Duan, Y. Shi, Analysis of consensus-based economic dispatch algorithm under
time delays. IEEE Trans. Syst. Man Cybern. Syst. 50(8), 2978–2988 (2020)
24. Y. Kajiyama, N. Hayashi, S. Takai, Distributed subgradient method with edge-based event-
triggered communication. IEEE Trans. Autom. Control 63(7), 2248–2255 (2018)
25. Q. Lü, H. Li, X. Liao, H. Li, Geometrical convergence rate for distributed optimization
with zero-like-free event-triggered communication scheme and uncoordinated step-sizes, in
Proceedings of the 7th International Conference on Information Science and Technology
(ICIST) (2017). https://doi.org/10.1109/ICIST.2017.7926783
26. T. Doan, A. Olshevsky, On the geometric convergence rate of distributed economic dis-
patch/demand response in power systems (2016). Preprint arXiv:1609.06660
27. J. Wang, H. Li, Z. Wang, Distributed event-triggered scheme for economic dispatch in power
systems with uncoordinated step-sizes. IET Gener. Transm. Distrib. 13(16), 3612–3622 (2019)
28. E. Naderi, A. Azizivahed, H. Narimani, M. Fathi, M. Narimani, A comprehensive study of
practical economic dispatch problems by a new hybrid evolutionary algorithm. Appl. Soft.
Comput. 67, 1186–1206 (2017)
29. H. Narimani, S. Razavi, A. Azizivahed, E. Naderi, M. Fathi, M. Ataei, M. Narimani, A multi-
objective framework for multi-area economic emission dispatch. Energy 154, 126–142 (2018)
30. E. Naderi, M. Pourakbari-Kasmaei, H. Abdi, An efficient particle swarm optimization
algorithm to solve optimal power flow problem integrated with FACTS devices. Appl. Soft.
Comput. 80, 243–262 (2019)
31. E. Naderi, M. Pourakbari-Kasmaei, M. Lehtonen, Transmission expansion planning integrated
with wind farms: a review, comparative study, and a novel profound search approach. Int. J.
Electr. Power Energy Syst. 115, 10546–0 (2020)
32. G. Qu, N. Li, Accelerated distributed Nesterov gradient descent. IEEE Trans. Autom. Control
65(6), 2566–2581 (2020)
33. R. Xin, U. Khan, Distributed heavy-ball: a generalization and acceleration of first-order
methods with gradient tracking. IEEE Trans. Autom. Control 65(6), 2627–2633 (2020)
34. N. Li, L. Chen, S. Low, Optimal demand response based on utility maximization in power
networks, in Proceedings of the 2011 IEEE Power and Energy Society General Meeting (PES)
(2011). https://doi.org/10.1109/PES.2011.6039082
35. N. Li, L. Chen, M. Dahleh, Demand response using linear supply function bidding. IEEE Trans.
Smart Grid 6(4), 1827–1838 (2015)
36. Y. Tian, Y. Sun, G. Scutari, Achieving linear convergence in distributed asynchronous multi-
agent optimization. IEEE Trans. Autom. Control 65(12), 5264–5279 (2020)
37. T. Yang, Q. Lin, Z. Li, Unified convergence analysis of stochastic momentum methods for
convex and non-convex optimization (2016). Preprint arXiv:1604.03257
Chapter 9
Privacy Preserving Algorithms
for Distributed Online Learning

Abstract In this chapter, we focus on introducing a distributed online optimization


problem for a set of nodes communicating on a time-varying unbalanced directed
network, while considering the problem of how to preserve the privacy of their local
cost functions. The main goal of this set of nodes is to cooperatively minimize the
sum of all locally known convex cost functions (global cost function). We propose a
differentially private distributed stochastic subgradient-push algorithm, named DP-
DSSP, to solve such an optimization problem in a collaborative and distributed
manner. The algorithm ensures that nodes interact with their in-neighbors and
collectively optimize the global cost function. Unlike most existing distributed
algorithms that do not consider privacy issues, DP-DSSP successfully protects
the privacy of participating nodes through a differential privacy strategy, which
is more practical in applications involving sensitive information, such as military
affairs or medical treatment. An important feature of DP-DSSP is that it handles
distributed online optimization problems in the context of time-varying unbalanced
directed networks. Theoretical analysis shows that DP-DSSP can effectively protect
differential privacy and can achieve sublinear regret. The tradeoff between the level
of privacy and accuracy of DP-DSSP is also revealed. Moreover, DP-DSSP is able
to handle arbitrarily large but uniformly bounded delays in the communication link.
Finally, simulation experiments confirm the usefulness of DP-DSSP and the findings
of this chapter.

Keywords Distributed online optimization · Differential privacy · Stochastic


subgradient-push algorithm · Time-varying directed networks · Communication
delays

9.1 Introduction

The distributed convex optimization problem has attracted the interest of many
researchers in the past few decades with the tremendous advances in advanced
technology and low-cost devices. Numerous engineering applications can be viewed
as distributed convex optimization problems, such as robust control [1], smart grids

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 235
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_9
236 9 Privacy Preserving Algorithms for Distributed Online Learning

[2], model prediction [3], and smart metering [4], among others [5–11]. This type
of problem requires a fully distributed optimization algorithm. Unlike traditional
centralized approaches, distributed algorithms involve multiple nodes that have
access to their own local information without the existence of a central coordinator
(node) to obtain all the information on the network [12].
A typical feature in many practical scenarios of the distributed optimization
is that they may just need to be adapted for dynamic changes and uncertain
environment. These problems with uncertainties can be regarded as the distributed
online optimization where the cost function changes over time with the adaptive
decisions are only relevant to the previous information and must be made at
each time. In view of the distributed online optimization problems, the target
of researchers is to develop the distributed online algorithms in a coordinated
manner. To estimate the performance of the distributed online algorithms, it is
conventional to capture the difference between the cost incurred by the algorithm
through the sequential cost functions and the cost incurred by the best fixed decision
in hindsight. The metric, as the difference between two costs, is called regret. In
addition, it is declared “good” for the distributed online algorithm when the achieved
regret is sublinear [13]. In terms of the distributed online optimization problems
over networks, many significant results have recently emerged [14–25]. Nunez and
Cortes [14] designed a distributed online subgradient descent algorithm on time-
varying balanced directed networks, which allowed proportional–integral feedback
on the divergence among nodes (neighboring). Akbari et al. [15] investigated a
distributed online subgradient-push algorithm on time-varying unbalanced directed
networks. Then, Bedi et al. [16] proposed an asynchronous stochastic saddle-
point method to solve online expected risk minimization problem, while Pradhan
et al. [17] considered distributed online non-parametric optimization problem.
Subsequently, with the development of the distributed optimization, the work of
the distributed online optimization was further explored. Some notable approaches
that addressed the distributed online optimization problems over networks mainly
included distributed proximal gradient algorithm [18], saddle-point algorithm [19],
distributed mirror descent algorithm [20], distributed dual averaging algorithm [21],
distributed regression algorithm [22], etc.
However, in the above algorithms, nodes have to accept the fact that their
privacy, at least some of it, will be inevitably disclosed during information sharing.
From the aspect of privacy preserving of nodes, differential privacy is one of
the insightful privacy strategies. It was initially proposed by Dwork [26] and
gained much attention because of its rigorous formulation and the proof of security
properties. In general, differential privacy guarantees that the malicious node
finds little sensitive information of any other nodes. Recently, several privacy
preserving optimization algorithms have been presented in the literature [27–31],
where the controlling idea is to inject stochastic noises or offsets in the node
communications and updates. Although the algorithms in [27–31] could figure
out distributed optimization problem with privacy considerations, they were only
suitable for problems with time-invariant cost functions. On the other hand, the
study of stochastic optimization problems is also profound. The addition of noise
9.1 Introduction 237

to optimization algorithms has a long history originally arising from statistical


mechanics (such as simulated annealing) [32] and given significant treatment in [33]
that presented a stochastic approximation method and included the detailed analysis
of convergence in the context of diverse noise models. Distributed implementation
of stochastic gradient methods has received increasing attention in recent years
[34–36]. Ram et al. [34] proposed a stochastic subgradient projected algorithm for
solving convex optimization problem subject to a common convex constraint set.
Then, Srivastava and Nedic [35] utilized two diminishing step-sizes to deal with
communication noises and subgradient errors, respectively. The recent application
of stochastic gradient methods to non-convex optimization in machine learning
started at [36], which initially ensured global convergence of the methods on
non-convex functions with exponentially multi-local minima and saddle points.
However, they all neglected the preserving of nodes’ privacy [32–36].
Of significant relevance to our work are the recent developments in [37] and
[38]. Under time-varying balanced directed networks, Zhu et al. [37] established
an intuitive method with differential privacy strategy to solve the distributed
online optimization problem. Then, Li et al. [39] studied the privacy preserving
of distributed online learning and detailedly discussed the privacy levels of the
proposed differentially private online algorithm. Note that, although the privacy of
nodes was preserved, the two approaches in [37] and [39] just assumed that the
interaction networks were balanced, which was considered to be the main difficulty
and challenge of implementing distributed algorithms. This therefore limits the
applicability of the algorithm in a number of real-world fields, such as peer to
peer, ad hoc, wireless sensor networks, etc. Concerning this limitation, Nedic and
Olshevsky [38] designed a subgradient-push algorithm by incorporating distributed
subgradient descent into push-sum mechanism on time-varying unbalanced directed
networks. In this setting, some interesting generalized algorithms [40] (the dis-
tributed stochastic subgradient push) and [15] (the distributed online subgradient
push) were developed. Regrettably, the privacy issues of nodes were ignored in
the above algorithms [15, 38, 40]. To sum up, there has not yet been any prior
work devoted to designing algorithms, which not only preserve the privacy of
participating nodes but also figure out distributed online optimization problems on
time-varying unbalanced directed networks. It is of great significance, therefore, to
discuss such challenging issue owning to its practicality.
In this chapter, our focus is to initiate the study of differentially private distributed
online convex optimization problems in dynamic environment. To this aim, a fully
distributed online algorithm with differential privacy strategy is proposed, for which
a desired privacy level and the expected regrets can be obtained even on time-
varying unbalanced directed networks. We hope to make contributions for a broad
theory of the distributed online convex optimization, and our motivation is designing
a completely distributed algorithm being capable of adaptability and facilitating the
real-world applications. In general, the key contributions of the present work are as
follows:
238 9 Privacy Preserving Algorithms for Distributed Online Learning

(i) We design and discuss a differentially private distributed stochastic


subgradient-push algorithm, named as DP-DSSP, by incorporating randomized
perturbation technique into stochastic subgradient-push mechanism. More
importantly, DP-DSSP is completely distributed, and it just needs each node
to possess its own out-degree without any further knowledge of the number of
nodes or network topologies. DP-DSSP is therefore easier to be implemented
than the existing methods [14, 30, 41].
(ii) In comparison with the algorithms in [14] for online optimization problems
and [37] for privacy issues, DP-DSSP not only solves differentially private
distributed online convex optimization problems but also applies to more
general interaction networks, i.e., time-varying unbalanced directed networks.
To overcome the imbalance induced by time-varying directed networks, a
push-sum mechanism [38, 40] is exploited for designing DP-DSSP in order
to counteract the effect of network’s imbalance.
(iii) DP-DSSP can be considered as an extension of the algorithms in [37] and
[40] for both preserving the privacy of participating nodes and solving the
distributed online convex optimization problems. Specifically, DP-DSSP not
only guarantees the privacy of nodes’ cost functions but also achieves sublinear
(logarithmic/square-root) regrets for diverse (strongly/generally convex) cost
functions for a fixed privacy level. Therefore, in DP-DSSP, the performance is
not seriously affected while maintaining differential privacy.
(iv) A compromise between the desired level of privacy preserving and the
accuracy of DP-DSSP is revealed. Namely, by fixing other parameters, the
upper bounds of expected regrets of DP-DSSP possess the order of magnitude
O(1/ε2 ). Furthermore, the robustness of DP-DSSP is investigated in the
presence of arbitrary but uniformly bounded network communication delays
[42]. This thus makes it more universal and practical in real applications.

9.2 Preliminaries

9.2.1 Notation

If not particularly stated, the vectors mentioned in this chapter are column vectors.
Let R≥0 , Z≥0 , and Z>0 be the set of non-negative real numbers, the set of non-
negative integers, and the set of positive integers, respectively. A quantity (probably
a vector) allocated to node i is indexed by a subscript i; e.g., zi (t) is the estimate
of node i at time t. The n × n identity matrix is denoted as In , and the column
vectors of all ones and all zeros are denoted as 1 and 0 (appropriate dimensional),
respectively. The transposes of a vector z and a matrix W are represented as zT and
W T , respectively. Let the symbol ·, · denote the inner product of two vectors. The
Euclidean norm (vectors) and 1-norm are written as || · || and || · ||1 , respectively.
Given a random/stochastic variable Z, P[Z] and E[Z] denote its probability and
9.2 Preliminaries 239

expectation, respectively. We let W (t : k) = W (t) · · · W (k), ∀t ≥ k ∈ Z≥0 ,


represent the products of the time-varying matrices W (t), . . . , W (k). Also, denote
W (t − 1 : t) = I, ∀t ∈ Z>0 . Assume that C (convex) is a subset of Rd (d is the
dimension). The function f : C → R is convex if f (y) ≥ f (x) + ∇f (x), y −
x + κ(x,y)
2 ||y − x|| , ∀x, y ∈ C, where ∇f (x) is a subgradient of f at x, and
2

κ : C × C → R≥0 . If κ(x, y) = κ > 0, ∀x, y ∈ C, in this relation, the


function f is κ-strongly convex on C. Concerning a convex function f and a vector
x ∈ Rd , ∇f (x) ∈ Rd is represented by a subgradient of f at x, if for all y ∈ Rd ,
f (y) − f (x) ≥ ∇f (x), y − x.

9.2.2 Model of Optimization Problem

In distributed online optimization problems, suppose that there is a series of time-


varying global convex cost functions which are not previously fetched and gradually
revealed over time. More specifically, at each time t ∈ {1, . . . , T } (T ∈ Z>0 is the
time horizon), each node i ∈ V selects an estimate zi (t) ∈ Rd . After this, node i
observes a local convex cost function fit : Rd → R and incurs the cost of fit (zi (t)).
Therefore, the network cost, at each time t, is represented by


n
f t (z) = fit (z),
i=1

where z ∈ Rd is the decision vector. Then, nodes, at each time t, intend to minimize
the following optimization problem:


n
min f t (z) = fit (z), (9.1)
z∈Rd
i=1

where node i ∈ V is only aware of its individual convex cost function fit . Suppose
that f t is not known to any node collectively, nor is it available to any single
location.
In this chapter, we wish to design a class of the distributed stochastic subgradient
algorithm to reduce the total cost incurred by nodes over a finite time horizon T ∈
Z>0 . Due to the distributed online optimization possessing time-varying nature, the
regret is an indispensable metric when evaluating the convergence of the designed
algorithm. Hence, the following classical network regret and individual regret of
node j ∈ V are introduced [14]:


T 
n 
T 
n
R(T ) = fit (zi (t)) − fit (z∗ ),
t =1 i=1 t =1 i=1
240 9 Privacy Preserving Algorithms for Distributed Online Learning

and


T 
n 
T 
n
Rj (T ) = fit (zj (t)) − fit (z∗ ),
t =1 i=1 t =1 i=1


T 
n
where z∗ = arg min f (z) = fit (z) is the best decision.
z∈Rd t =1 i=1
Notice that the subgradients and other quantities such as {zi (t)}Tt=1 and
{fi (zi (t))}Tt=1 , i ∈ V, will become stochastic variables due to the presence of
t

the stochastic noises in calculating the subgradients. Hence, it may be impossible


to obtain deterministic regrets like the classical regrets described above while
adopting stochastic subgradients. Moreover, E(·) is with respect to the expectation
of a stochastic variable. Based on this, the concept of pseudo-network-regret and
pseudo-individual-regret associated with node j ∈ V is utilized in this chapter1
[43]:
   

T 
n 
T 
n
R(T ) = E fit (zi (t)) −E fit (z∗ ) , (9.2)
t =1 i=1 t =1 i=1

and
   T n 

T 
n 
Rj (T ) = E fit (zj (t)) − E fit (z∗ ) . (9.3)
t =1 i=1 t =1 i=1

As a result, the main focus in this chapter is to investigate a distributed stochastic


subgradient algorithm for each node i ∈ V on time-varying unbalanced directed
networks such that the regrets in (9.2) and (9.3) are sublinear with respect to T ,
i.e., limT →∞ R(t)/T = 0 and limT →∞ Rj (t)/T = 0, j ∈ V. Intuitively, the
sublinearity implies that the average value of the global cost function over time
horizon T achieves the optimal value as T → ∞.

9.2.3 Communication Network

In this chapter, the intrinsic interconnections among nodes are represented by a


time-varying unbalanced directed network G(t) = {V, E(t)} at time t, where

1 The pseudo-network-regret and pseudo-individual-regret associated with node j ∈ V are brought

forward to measure the difference between the expectation of the total cost incurred by the node’s
state estimation and the optimal expected total cost that could have been achieved by the best fixed
decision in hindsight.
9.3 Algorithm Development 241

V = {1, 2, . . . , n} is the nodes set and E(t) ⊆ V × V is the edges set. An edge
(i, j ) ∈ E(t) indicates that node i can directly route information to node j at
time t, where i is regarded as an in-neighbor of j and in contrary j is viewed as
an out-neighbor of i. At time t, we define Niin (t) = {j ∈ V|(j, i) ∈ E(t)} and
Niout (t) = {j ∈ V| (i, j ) ∈ E(t)} as the in-neighbors set and out-neighbors set of
node i, respectively. The network is unbalanced at time t if |Niin (t)| = |Niout (t)|,
where | · | is the cardinality of a set. The time-varying network is said to be
unbalanced if it is unbalanced at all times. The out-degree of node i at time t is
denoted by diout (t) = |Niout (t)|. A directed path is a series of directed consecutive
edges. If there is at least one directed path between any pair of distinct nodes in
the network, then the network is called strongly connected. The weighted adjacency
matrix at time t corresponding to network G(t) is indicative of W (t) = [wij (t)] ∈
Rn×n , which follows that wij (t) > 0 if (j, i) ∈ E(t), and wij (t) = 0 otherwise. In
addition, at time t, W (t) is column-stochastic if ni=1 wij (t) = 1, ∀j ∈ V.

9.3 Algorithm Development

In this section, we first introduce the differential privacy strategy. Then, a distributed
online algorithm with differential privacy strategy is developed to figure out the
formulated problem.

9.3.1 Differential Privacy Strategy

The definition of differential privacy was first presented by Dwork in [26] and sub-
sequently analyzed in detail in [27–31]. Differential privacy enables participating
nodes to share their individual information without revealing sensitive information
about their privacy. In this chapter, the differential privacy is employed to preserve
the local cost function of individual node. Next, we first give the definition of
adjacent relation.
Definition 9.1 (Adjacent Relation) Taking into account two datasets D = {fi }ni=1
and D = {fi }ni=1 , if there is i ∈ V such that fi = fi and fj = fj , ∀j = i, then D
and D are called as adjacent.
In other words, if the data of a single participant in two datasets D and D
is different and the data of other participants is the same, then two datasets are
adjacent. Informally, in differential privacy, preserving privacy enables that two
adjacent datasets are nearly indistinguishable from the output of the randomized
method which acts on the datasets. In view of the differential privacy, we introduce
the following definition.
242 9 Privacy Preserving Algorithms for Distributed Online Learning

Definition 9.2 (Differential Privacy) A randomized method M is ε- differential


privacy if given any set of outputs Ψ ⊆ Range(M) with any adjacent datasets D
and D , it holds

P[M(D) ∈ Ψ ] ≤ exp(ε) · P[M(D ) ∈ Ψ ], (9.4)

where ε > 0 and Range(M) represents the output range of the randomized method
M.
Inequality indicates that no matter whether an individual node participates in the
dataset, it does not incur any remarkable difference to the output of the randomized
method M. Hence, a malicious node gains little information of other nodes. The
constant ε is a measurement of the privacy level of M. That is to say, a smaller ε
means a higher level of privacy preserving. Thus, the constant ε is a compromise
between the desired level of privacy preserving and the accuracy of M.
For the sake of ensuring different privacy, we wish to know the “sensitivity” of
M. It is deduced from [26] that the magnitude of the stochastic noise (perturbation)
is dependent on the maximum change on the output of M from an individual entry
in the data source. This amount is known as the sensitivity of the method, which is
defined as follows.
Definition 9.3 (Sensitivity) At time t, the sensitivity of a method M is denoted as

Δ(t) = sup ||M(Dt ) − M(Dt )||1 , (9.5)


Dt ,Dt :Adj(Dt ,Dt )

where Adj(Dt , Dt ) is the adjacent relation of adjacent datasets Dt and Dt at time t.


Built on the concept of sensitivity, we observe that if the method maintains
the same level of privacy preserving, a higher sensitivity requires more stochastic
noise. Thus, we can control the magnitude of the stochastic noise via bounding the
sensitivity to ensure differential privacy.
Remark 9.4 With the development of artificial intelligence and the emergence
of 5G, data from multiple real-world application areas, such as disease preven-
tion/treatment and online ads recommendation, has become large, widely dis-
tributed, and rapidly changing. This requires data to be processed in real time
(online) to meet the need for rapid response to users, and a large amount of data
related to personal information needs to be preserved. Thus, designing an algorithm
which not only preserves the privacy of participating nodes but also solves the
problem of the distributed online optimization will have far-reaching implications.
9.3 Algorithm Development 243

9.3.2 Differentially Private Distributed Online Algorithm

Recall that our goal is to distributedly solve problem (9.1) with the privacy
preserving of nodes. In the case of distributed manner, we assume that each node,
at each time t, only acquires information transmitted by in-neighbors and sends
its own estimates to its out-neighbors through a time-varying unbalanced directed
network G(t). The privacy preserving indicates that the sensitive information of
each node i ∈ V is well preserved. Suppose that a malicious node can eavesdrop
on information by monitoring the entire communication channels and intercepting
messages interacted among nodes. Thus, traditional online approaches may result
in leakage of sensitive information of nodes. To address this issue, DP-DSSP is
employed to ensure that the privacy of the participating nodes is not leaked. It is
worth mentioning that the performance of DP-DSSP inevitably suffers from varying
degrees of degradation due to the existence of stochastic noise. That is to say, the
privacy level and the performance are inversely related.
We are now in the position of presenting DP-DSSP. In DP-DSSP, each node,
i ∈ V, owns five estimates: xi (t) ∈ Rd , hi (t) ∈ Rd , yi (t) ∈ Rd , si (t) ∈ R, and
zi (t) ∈ Rd . Then, each node i ∈ V implements the following updates at each time
t ∈ {1, . . . , T }:


⎪hi (t) = xi (t) + ηi (t),

⎪ 

⎪yi (t + 1) = d outhi(t(t)+1
) h (t )
+ j ∈N in (t ) d outj(t )+1


⎨ i

i j
s (t )
si (t + 1) = d outsi(t(t)+1
)
+ j ∈N in (t ) d outj(t )+1 (9.6)

⎪ i


i j

⎪zi (t + 1) = yi (t + 1)/si (t + 1)



xi (t + 1) = yi (t + 1) − α(t + 1)gi (t + 1),

where ηi (t) ∈ Rd , generated by node i ∈ V, is a stochastic variable (independent


identically distributed) from a Laplace distribution Lap(σ (t)), and α(t + 1) > 0
is the diminishing learning rate. The noisy subgradient of fit (z) at z = zi (t + 1) is
indicative of git +1 (zi (t + 1)) abbreviated as gi (t + 1) and is assumed to be satisfied:

git +1 (zi (t + 1)) = ∇fit +1 (zi (t + 1)) + τi (t + 1),

where ∇fit +1 (zi (t + 1)) represents the subgradient of fit +1 (z) at z = zi (t + 1), and
τi (t + 1) ∈ Rd is an independent zero-mean stochastic noise term. The estimates of
DP-DSSP (9.6) are initialized as xi (0) ∈ Rd and si (0) = 1 for all i. In addition, for
any t ∈ Z≥0 and i, j ∈ V, non-negative matrix W (t) = [wij (t)] ∈ Rn×n follows

1
djout (t )+1
, j ∈ Niin (t) ∪ {i},
wij (t) = (9.7)
0, j ∈ / Niin (t).
244 9 Privacy Preserving Algorithms for Distributed Online Learning

In addition, the following assumptions are needed.


Assumption 9.1 (Bounded Subgradient [38]) Concerning a sequence of local
convex cost functions {f1t , . . . , fnt }Tt=1, for all i ∈ V, the subgradient ∇fit (z) of
fit (z) at z is Li -bounded over Rd , i.e.,

||∇fit (z)|| ≤ Li , ∀x ∈ Rd .

Denote by Ft the σ -field produced by all the information of DP-DSSP up to time


t. Then, we give the conditions of the stochastic noise τi (t), i ∈ V, t ∈ {1, . . . , T },
as follows.

Assumption 9.2 (Stochastic Noise [40]) The stochastic noise τi (t), i ∈ V,


t ∈ {1, . . . , T }, is a stochastic variable, which is independent and satisfies
E[τi (t)|Ft −1 ] = 0. Also, suppose that E[||τi (t)||] ≤ πi for a constant πi > 0.
The stochastic noise τi (t), i ∈ V, t ∈ {1, . . . , T }, is also considered in [37]
and [40], which plays a significant role in real applications. In addition, the next
assumption is needed.

Assumption 9.3 ([42]) For the sequence {G(t) = {V, E(t)}} of time-varying
unbalanced directed networks, there is an integer B ∈ Z>0 such that the aggregate
directed network GB (t) = {V, ∪t +B−1
=t E( )} is strongly connected ∀t ∈ Z≥0 .
The strong-connectivity bound B introduced in Assumption 9.3 is not required
to be known at any of the nodes and is only employed in the convergence analysis
[44]. Assumption 9.3 is more general than that demanding each G(t) is strongly
connected [37, 45].

9.4 Main Results

This section presents the main results (ε-differential privacy and expected regrets)
in this chapter.
Theorem 9.5 (ε-Differential Privacy) Suppose that Assumptions 9.1–9.3 hold.
Let ηi (t) ∈ Rd , i ∈ V, t ∈ {1, . . . , T }, be introduced in DP-DSSP (9.6) with the
corresponding parameter σ (t) = Δ(t)/ε, where ε > 0. Then, DP-DSSP (9.6) can
guarantee ε-differential privacy.
Before bounding the expected regrets of individual node, many crucial notations
are first introduced. Let z∗ and Z ∗ be, respectively, defined as the optimal solution
n
and the nonempty optimal solutionn set to (9.1). Moreover, let μ̄ = (1/2n) i=1 μi ,
Lmax = maxi∈V Li , and L = i=1 L i , where μ i > 0, ∀i ∈ V. With the above
preparations, it is now ready to present the expected regrets of this chapter.
9.4 Main Results 245

Theorem 9.6 (Logarithmic Regret) Suppose that Assumptions 9.1–9.3 hold.


Assume that each local cost function fit , i ∈ V, is μi -strongly convex for all
t ∈ {1, . . . , T }. Consider that DP-DSSP updates the sequence {z1 (t), . . . , zn (t)}Tt=1
with α(t) = 1/(μ̂t), where μ̂ ∈ (0, μ̄]. Then, the pseudo-individual-regret of node
j ∈ V can be upper bounded as


n
R̄j (T ) + μj E[||ẑj (t) − z∗ ||2 ] ≤ Ξ1 + Ξ2 (1 + ln T ),
j =1

where
t
2 k=1 kzi (k)
ẑi (t) = ,
t (t + 1)
14CL n
Ξ1 =nμ̂E[||x̄(0) − z∗ ||2 ] + ||xj (0)||1 ,
δ(1 − λ) j =1

14Cn3/2p̂L 42 2Cn3/2 d p̂L 16nd p̂2
Ξ2 = + + + 2 2 ,
δλμ̂(1 − λ) δεμ̂(1 − λ) ε2

δ > 0 and 0 < λ < 1 connect to the 


where  network topology, x̄(0) =
(1/n) ni=1 xi (0), p̂ = maxi∈V {Li + πi }, and  2 = ni=1 (Li + νi )2 .
Noticing that the learning rate α(t) in Theorem 9.6 shows the reliance on the
time horizon T , thus the Doubling Trick scheme √ (DTS) [41] which does not rely
on T is applied. In DTS, we pick α(t) = 1/ 2k in each period of 2k rounds (t =
2k , . . . , 2k+1 − 1) for k = 0, 1, 2, . . . , log2 T . By means of DTS, the pseudo-
individual-subregret associated with node j ∈ V between k and k + 1 can be defined
as follows:2
⎡ k+1 ⎤ ⎡ k+1 ⎤
2 n 2 n
Rj (k) = E ⎣ fit (zj (t))⎦ − E ⎣ fit (z∗ )⎦ .
DT S
(9.8)
t =2k i=1 t =2k i=1

Based on the above definition, the following theorem, i.e., square-root regret, is
immediately established for generally convex cost functions.
Theorem 9.7 (Square-Root Regret) Suppose that Assumptions 9.1–9.3 hold.
Assume that each local cost function fit , i ∈ V, is convex for all t ∈ {1, . . . , T }.
Consider that DP-DSSP (9.6) generates the sequence {z1 (t), . . . , zn (t)}Tt=1 with
α(t) selected by DTS. Then, the pseudo-individual-regret of node j ∈ V can be

DT S
2 Here, we do not give a specific definition of pseudo-network-subregret R (k), and the
interested readers can similarly define it with reference to (9.2) and (9.8).
246 9 Privacy Preserving Algorithms for Distributed Online Learning

upper bounded as

log2T 

 DT S 2 √
R̄j (T ) ≤ Rj (k) ≤ √ Γ1 T ,
k=0
2−1

where

14CL 
n
Γ1 =nE[||x̄(0) − z∗ ||2 ] + ||xj (0)||1 + 2 2
δ(1 − λ)
j =1

14Cn3/2 p̂L 42 2Cn3/2 d p̂L 16nd p̂2
+ + + .
δλ(1 − λ) δε(1 − λ) ε2

As pointed out in Theorems 9.6 and 9.7, the desired regrets of DP-DSSP can be
derived for strongly convex (logarithmic regret) and generally convex cost functions
(square-root regret). In addition, the obtained regrets are dependent on the network
size n and the vector dimension
√ d. Notice that the achieved regrets possess the same
order of O(ln T ) and O( T ) as [14, 15] (without differential privacy strategy) for a
fixed ε even the Laplace random noise is injected. Besides, for a fixed T , the regrets
in Theorems 9.6 and 9.7 will increase (arbitrarily large) as ε → 0. Namely, the
regrets possess the order of O(1/ε2 ).
Remark 9.8 DTS in Theorem 9.7 is to divide the original time series into series
of increasing size (except the last subseries) and run DP-DSSP on each subseries.
Specifically, DTS actually divides the original time series t ∈ {1, . . . , T } into
log2 T +1 subseries. In each subseries (except the last subseries), DP-DSSP needs
to update 2k rounds. Based on this, we can calculate the corresponding subregret in
each subseries k = 0, 1, 2, . . . , log2 T , and the total regret is bounded by the
sum of log2 T + 1 subregrets. Therefore, DP-DSSP can completely implement
t = 1, . . . , T iterations even in the case of DTS. More importantly, DTS does not
require to√ gain access to the global information T but still guarantees that DP-DSSP
achieves T regret. Hence, DTS is meaningful √ when compared with the existing
works that do not utilize DTS and still attain T regret growth [19, 20].

9.4.1 Differential Privacy Analysis

This subsection demonstrates that DP-DSSP guarantees ε-differential privacy. In the


process of information sharing, since the sensitive information of individual node
may be invaded by a malicious node, differential privacy strategy is thus applied
9.4 Main Results 247

to preserve the sensitive information of nodes. In differential privacy, a stochastic


noise (stochastic variable) which follows the Laplace distribution is injected to the
estimate of node at each time t. To achieve differential privacy, the sensitivity (as
defined in Definition 9.3) of DP-DSSP is employed, which identifies the amount
of noises that need to be added. By bounding the sensitivity, a number of noises
(stochastic) to ensure ε-differential privacy are determined. Hence, we first bound
the sensitivity of DP-DSSP as follows.
Lemma 9.9 Suppose that Assumptions 9.1 and 9.2 hold. The sensitivity of DP-
DSSP (9.6) calculated as (9.5) is bounded by

Δ(t) ≤ 2p̂ dα(t + 1),

where p̂ = maxi∈V {Li + πi }.


Proof This chapter requires that each node pursues the privacy of their local cost
function being well preserved. Thus, adjacent relation in this chapter implies that
there is a node i ∈ V such that fit = f ti and fjt = f tj , ∀j = i. Let xi (t) and xi (t)
be, respectively, indicative of the implementations for M(Dt ) and M(Dt ). In light
of the update of xi (t) in (9.6) and Definition 9.3, it holds

||M(Dt ) − M(Dt )||1 =||xi (t + 1) − xi (t + 1)||1


≤α(t + 1)(||gi (t + 1)||1 + ||gi (t + 1)||1 )

≤ dα(t + 1)(||gi (t + 1)|| + ||gi (t + 1)||),

which follows from Assumptions 9.1 and 9.2 that

E[||gi (t + 1)|||Ft ] ≤ Li + πi ≤ p̂. (9.9)

Due to the arbitrary nature of the pair of adjacent datasets Dt and Dt , by


Definition 9.3, one yields

Δ(t) ≤ E[||xi (t + 1) − xi (t + 1)||1|Ft ] ≤ 2p̂ dα(t + 1),

which implies the desired result. 


Then, the Proof of Theorem 9.5 is provided as follows.
248 9 Privacy Preserving Algorithms for Distributed Online Learning

Proof of Theorem 9.5 Define x(t) = [x1 (t), . . . , xn (t)]T ∈ Rnd and x (t) =
[x1 (t), . . . , xn (t)]T ∈ Rnd . Recalling Definition 9.3, one has

||x(t) − x (t)||1 ≤ Δ(t). (9.10)

Noticing that x (t) and x(t) include in Rnd , from the property of 1-norm, one gets


n 
d
|xi,k (t) − xi,k (t)| = ||x(t) − x (t)||1 ≤ Δ(t), (9.11)
i=1 k=1

where xi,k (t) and xi,k (t) are, respectively, defined as the k-th elements of x (t)
and x(t). Then, according to the property of Laplace distribution [46] and (9.11),
it suffices that

n  d
P[zi,k (t) − xi,k (t)]
P[zi,k (t) − xi,k (t)]
i=1 k=1
 
n  d exp − |zi,k (t )−xi,k (t )|
σ (t )
=  
|zi,k (t )−xi,k (t )|
i=1 k=1 exp
σ (t )
 

n 
d
|zi,k (t) − xi,k (t) − zi,k (t) + xi,k (t)|
≤ exp
σ (t)
i=1 k=1
   
||xi,k (t) − xi,k (t)|| Δ(t)
≤ exp ≤ exp . (9.12)
σ (t) σ (t)

Furthermore, one has


T
P[M(D) ∈ Ψ ] = P[M(Dt ) ∈ Ψ ]. (9.13)
t =1

Integrating both sides of (9.12), by (9.13), then we acquire


T 
T  
Δ(t)
P[M(Dt ) ∈ Ψ ] ≤ exp · P[M(D t ) ∈ Ψ ],
σ (t)
t =1 t =1

which according to Δ(t)/σ (t) = ε yields that

P[M(D) ∈ Ψ ] ≤ exp(ε) · P[M(D t ) ∈ Ψ ].


9.4 Main Results 249

By Definition 9.2, the result of Theorem 9.5 is completed. 


Remark 9.10 Since the sensitivity of DP-DSSP (9.6) depends on α(t), then in light
of Theorem 9.5, the sensitivity, for a fixed ε, will decrease with the execution of
DP-DSSP.

9.4.2 Logarithmic Regret

This subsection concerns with establishing the logarithmic regret of DP-DSSP (9.6)
presented in Theorem 9.6. For this purpose, some supporting lemmas that are
indispensable in the proof of the main results are first presented.
In light of the definition of matrices W (t) and W (t : k), t ≥ k ∈ Z≥0 , the
following lemma is stated, which directly follows from Corollary 2 in [38].
Lemma 9.11 ([38]) Suppose that Assumption 9.3 holds. If matrix W (t) =
[wij (t)] ∈ Rn×n satisfies (9.7), there is a series of stochastic vectors φ(t) ∈ Rd
such that the matrix W (t : k) satisfies ∀i, j ∈ V and t ≥ k ∈ Z≥0 ,

|[W (t : k)]ij − φi (t)| ≤ Cλt −k ,

where C ≥ 1 and 0 < λ < 1.


Denote η(t) = [η1 (t), . . . , ηn (t)]T , y(t) = [y1 (t), .  . . , yn (t)]T , s(t) =
[s1 (t), . . ., sn (t)] , g(t) = [g1 (t), . . . , gn (t)] , and x̄(t) = n ni=1 xi (t). With the
T T 1

help of Lemma 9.11, we provide the upper bound of E[||zi (t +1)−x̄(t)||], ∀t ∈ Z≥0 ,
below, which is obtained directly from Lemma 1(a) in [38].
Lemma 9.12 ([38]) Suppose that Assumptions 9.2 and 9.3 hold. Then, DP-
DSSP (9.6) generates a sequence {z1 (t), . . . , zn (t)}Tt=1 such that ∀i ∈ V, t ∈ Z≥0 ,

E[||zi (t + 1) − x̄(t)||]
√ t
2C t  3C n  t −k 
n n
≤ λ ||xj (0)||1 + λ E[||ηj (k)||]
δ δ
j =1 k=0 j =1

2Cn3/2 p̂  t −k
t
+ λ α(k),
δ
k=1

where δ ≥ 1/(nnB ) weighs the imbalance of the network and 0 < λ ≤


(1 − 1/(nnB ))1/(B) is a measure of connectivity (if each of the networks G(t) is
regular, one may see [38] for more details on δ and λ); ηi (t) ∼ Lap(σ (t)) with
E[ηi (t)] = 0 and E[||ηi (t)||2 ] = 2σ 2 (t).
250 9 Privacy Preserving Algorithms for Distributed Online Learning

The next lemma is dedicated to establishing a necessary relation for deriving the
main results.
Lemma 9.13 Suppose that Assumptions 9.1–9.3 hold. Consider that DP-
DSSP (9.6) generates a sequence {x1 (t), . . . , xn (t)}Tt=1 . Then, we have for any
v ∈ Rd that ∀t ∈ Z≥0 ,

E[||x̄(t + 1) − v||2 ] − E[||x̄(t) − v||2 ]

2
n
2α(t + 1)
≤ E[||ηi (t)||2 ] − E[f t +1 (x̄(t)) − f t +1 (v)]
n n
i=1
 n 
4α(t + 1)  2α 2 (t + 1) 2
+ E Li ||x̄(t) − zi (t + 1)|| +
n n
i=1
 n 
α(t + 1) 
− E κit +1||zi (t + 1) − v||2 ,
n
i=1

where κit +1 is the strong convexity parameter of fit +1 .


Proof Owing to the column-stochasticity of each matrix W (t), it is concluded from
the update of hi (t) in (9.6) that

1 1
n n
x̄(t + 1) = x̄(t) + ηi (t) + χi (t + 1), (9.14)
n n
i=1 i=1

where χi (t + 1) = −α(t + 1)gi (t + 1), t ∈ Z≥0 . Let v ∈ Rd be an arbitrary vector.


It derives from 9.14 that

||x̄(t + 1) − v||2 − ||x̄(t) − v||2

2 2
n n
= χi (t + 1), x̄(t) − v + ηi (t), x̄(t) − v
n n
i=1 i=1

1 
n
+ || (ηi (t) + χi (t + 1))||2 . (9.15)
n
i=1

Now, the cross-term (2/n) ni=1 χi (t + 1), x̄(t) − v in (9.15) is first considered.
For this purpose, we have that

E[χi (t + 1), x̄(t) − v|Ft ]

= −α(t + 1)∇fit +1 (zi (t + 1)), x̄(t) − v, (9.16)


9.4 Main Results 251

where the cross-term ∇fit +1 (zi (t + 1)), x̄(t) − v is

∇fit +1 (zi (t + 1)), x̄(t) − v


= ∇fit +1 (zi (t + 1)), x̄(t) − zi (t + 1)

+ ∇fit +1 (zi (t + 1)), zi (t + 1) − v. (9.17)

Following from the Cauchy–Schwarz inequality, we further get

∇fit +1 (zi (t + 1)), x̄(t) − zi (t + 1) ≥ −Li ||zi (t + 1) − x̄(t)||. (9.18)

For the cross-term ∇fit +1 (zi (t + 1)), zi (t + 1) − v, since each local cost function
fit +1 is strongly convex with parameter κi (v, zi (t + 1)), it holds that

∇fit +1 (zi (t + 1)), zi (t + 1) − v


κit +1
≥ fit +1 (zi (t + 1)) − fit +1 (v) + ||zi (t + 1) − v||2 . (9.19)
2

Since fit +1 (zi (t + 1)) − fit +1 (v) = fit +1 (zi (t + 1)) − fit +1 (x̄(t)) + fit +1 (x̄(t)) −
fit +1 (v) and each local cost function fit +1 is convex, it is obtained that

fit +1 (zi (t + 1)) − fit +1 (v) ≥∇fit +1 (x̄(t)), zi (t + 1) − x̄(t)


+ fit +1 (x̄(t)) − fit +1 (v). (9.20)

Then, we also acquire that

∇fit +1 (x̄(t)), zi (t + 1) − x̄(t) ≥ −Li ||zi (t + 1) − x̄(t)||. (9.21)

Therefore, it follows from (9.19)–(9.21) that

∇fit +1 (zi (t + 1)), zi (t + 1) − v


≥ −Li ||zi (t + 1) − x̄(t)|| + fit +1 (x̄(t)) − fit +1 (v)

κit +1
+ ||zi (t + 1) − v||2 . (9.22)
2
252 9 Privacy Preserving Algorithms for Distributed Online Learning

n t +1
Using f t +1 (z) = i=1 fi (z) and substituting (9.18) and (9.22) back into (9.17),
one acquires


n
∇fit +1 (zi (t + 1)), x̄(t) − v
i=1

1  t +1
n
≥ f t +1 (x̄(t)) − f t +1 (v) + κi ||zi (t + 1) − v||2
2
i=1


n
−2 Li ||zi (t + 1) − x̄(t)||. (9.23)
i=1

Recall that ηi (t) ∼ Lap(σ (t)) with E[ηi (t)] = 0. In light of (9.16) and (9.23), thus
it implies that
 
2
n
E χi (t + 1) + ηi (t), x̄(t) − v
n
i=1
2α(t + 1)
≤− E[f t +1 (x̄(t)) − f t +1 (v)]
n
 n 
α(t + 1) 
t +1
− E κi ||zi (t + 1) − v|| 2
n
i=1
 n 
4α(t + 1) 
+ E Li ||zi (t + 1) − x̄(t)|| . (9.24)
n
i=1

Furthermore, according to Assumptions 9.1 and 9.2, we also have that


 
1 2
n n
2α 2 (t + 1) 2
E || (ηi (t) + χi (t + 1))|| ≤
2
E[||ηi (t)||2 ] + ,
n n n
i=1 i=1
(9.25)
 
where (9.25) is built upon the condition that ( ni=1 ai )2 ≤ n ni=1 ai2 . Substitut-
ing (9.24) and (9.25) into the estimate where the total expectation has been taken
for (9.15) deduces the result in Lemma 9.13. 
For the relation of the pseudo-network-regret, the following lemma is shown.
Lemma 9.14 Suppose that Assumptions 9.1–9.3 hold. Consider that DP-
DSSP (9.6) generates the sequence {z1 (t), . . . , zn (t)}Tt=1 . Then, it holds that
9.4 Main Results 253

∀T ∈ Z>0 ,
 T n 

∗ 2
R(t) + E κit ||zi (t) − z ||
t =1 i=1
t −1
10Cn3/2 p̂L   t −k−1
T
n
≤ E[||x̄(0) − z∗ ||2 ] + λ α(k)
α(1) δ
t =1 k=1
 
10CL 
n T
16nd p̂2 2
+ ||xj (0)||1 + + 2 α(t)
δ(1 − λ) ε2
j =1 t =1
√ T t −1
30 2Cn3/2 d p̂L   t −k−1
+ λ α(t + 1)
δε
t =1 k=0
 

T
∗ 2
+ nE ||x̄(t − 1) − z || Δ1 (t) ,
t =2


n
where Δ1 (t) = 1
α(t ) − 1
α(t −1) − 1
2n κit .
i=1

Proof Letting v = z∗ in Lemma 9.13, we achieve that

E[||x̄(t + 1) − z∗ ||2 ] − E[||x̄(t) − z∗ ||2 ]

2
n
2α(t + 1)
≤ E[||ηi (t)||2 ] − E[f t +1 (x̄(t)) − f t +1 (z∗ )]
n n
i=1
 n 
4α(t + 1)  2α 2 (t + 1) 2
+ E Li ||zi (t + 1) − x̄(t)|| +
n n
i=1
 n 
α(t + 1) 
t +1 ∗ 2
− E κi ||zi (t + 1) − z || . (9.26)
n
i=1

We then analyze the term f t +1 (x̄(t)) − f t +1 (z∗ ) in (9.26). First, since fit +1 is
strongly convex with parameter κit +1 , it yields that
 n 
t +1 t +1 ∗ 1  t +1
f (x̄(t)) − f (z ) ≥ κi ||x̄(t) − z∗ ||2 . (9.27)
2
i=1
254 9 Privacy Preserving Algorithms for Distributed Online Learning

On the other hand, by Assumptions 9.1 and 9.2, we get that

f t +1 (x̄(t)) − f t +1 (z∗ ) ≥f t +1 (zi (t + 1)) − f t +1 (z∗ )


− L||zi (t + 1) − x̄(t)||, (9.28)
n
where L = i=1 Li . Thus, by combining (9.27) and (9.28), we conclude the
following estimate:

2(f t +1 (x̄(t)) − f t +1 (z∗ ))

1  t +1
n
≥ f t +1 (zi (t + 1)) − f t +1 (z∗ ) + κi ||x̄(t) − z∗ ||2
2
i=1

− L||zi (t + 1) − x̄(t)||. (9.29)

Plugging (9.29) back into (9.26) and then summing the obtained inequality over t
from 1 to T , it holds (using quite a few algebraic operations) that


T
1 
n
R(t) − 2 E[||ηi (t − 1)||2 ]
α(t)
t =1 i=1
   T n 

T 
∗ 2
≤ LE ||zi (t) − x̄(t − 1)|| − E κit ||zi (t) − z ||
t =1 t =1 i=1
 

T
n
∗ 2
+ nE ||x̄(t − 1) − z || Δ1 (t) + E[||x̄(0) − z∗ ||2 ]
α(1)
t =2
 

T 
n 
T
+ 4E Li ||zi (t) − x̄(t − 1)|| + 2 2 α(t). (9.30)
t =1 i=1 t =1

Furthermore, from Lemma 9.12, one achieves


T 
n
E[ Li ||zi (t) − x̄(t − 1)||]
t =1 i=1

2C   
T n n
≤ Li λt −1 ||xj (0)||1
δ
t =1 i=1 j =1
√ T n t −1
3C n    t −k−1 
n
+ Li λ E[||ηj (k)||]
δ
t =1 i=1 k=0 j =1

t −1
2Cn3/2p̂    t −k−1
T n
+ Li λ α(k). (9.31)
δ
t =1 i=1 k=1
9.4 Main Results 255

Since 0 < λ < 1, we get


T 
n 
n
L 
n
Li λt −1 ||xj (0)||1 ≤ ||xj (0)||1 . (9.32)
1−λ
t =1 i=1 j =1 j =1

 √ 
Furthermore, since ni=1 ||ηi (t)|| = n d |ηi,k (t)|2 (ηi,k (t), k = 1, . . . , d, is
the k-th element of ηi (t) ∈ Rd ) and each ηi,k (t), k = 1, . . . , d, is drawn from
Lap(σ (t)), we deduce that E[|ηi,k (t)|2 ] = 2σ 2 (t). In light of Δ(t)/σ (t) = ε, it
yields
√ √

n
√  n 2dΔ(t) 2 2nd p̂α(t + 1)
E[ ||ηi (t)||] = E[n d |ηi,k (t)| ] =
2 ≤ .
ε ε
i=1
(9.33)

Thus, by combining (9.31)–(9.33), we acquire that


 

T 
n
E Li ||zi (t) − x̄(t − 1)||
t =1 i=1
t −1
2CL  2Cn3/2 p̂L   t −k−1
n T
≤ ||xj (0)||1 + λ α(k)
δ(1 − λ) δ
j =1 t =1 k=1
√ T t −1
6 2Cn3/2 d p̂L   t −k−1
+ λ α(t + 1). (9.34)
δε
t =1 k=0

In addition, from (9.33), we also obtain


n
8nd p̂2 α 2 (t + 1)
E[||ηi (t)||2 ] ≤ . (9.35)
ε2
i=1

Combining (9.30)–(9.35) and Lemma 9.12 and arranging the terms, the result in
Lemma 9.14 follows immediately. 
Finally, the next lemma presents the connection between the pseudo-individual-
regret of node j ∈ V and the pseudo-network-regret.
Lemma 9.15 Suppose that Assumptions 9.1–9.3 hold. Consider that DP-
DSSP (9.6) generates the sequence {z1 (t), . . . , zn (t)}Tt=1 . Then, it holds that for
256 9 Privacy Preserving Algorithms for Distributed Online Learning

each j ∈ V and T ∈ Z>0 ,

4CL 
n
R̄j (T ) − ||xj (0)||1
δ(1 − λ)
j =1
√ T t −1
12 2Cn3/2 d p̂L   t −k−1
≤ R̄(T ) + λ α(k + 1)
δε
t =1 k=0
t −1
4Cn3/2 p̂L   t −k−1
T
+ λ α(k).
δ
t =1 k=1

Proof First, notice that


T 
n 
T 
n 
T 
n
fit (zj (t)) − fit (zi (t)) ≤ Lj ||zj (t) − zi (t)||, (9.36)
t =1 i=1 t =1 i=1 t =1 i=1

which directly derived from the convexity of fit . Moreover, it also concludes that

||zj (t + 1) − zi (t + 1)||2
≤ ||zj (t + 1) − x̄(t)||2 + ||zi (t + 1) − x̄(t)||2
+ 2||zj (t + 1) − x̄(t)||||zi (t + 1) − x̄(t)||. (9.37)

Thus, we further get from (9.37) that

E[||zj (t + 1) − zi (t + 1)||]
√ t
4C t  6C n  t −k 
n n
≤ λ ||xj (0)||1 + λ E[||ηj (k)||]
δ δ
j =1 k=0 j =1

4Cn3/2 p̂  t −k
t
+ λ α(k). (9.38)
δ
k=1
9.4 Main Results 257

Since ηi (t) ∼ Lap(σ (t)) with E[||ηi (t)||2 ] = 2σ 2 (t), following from (9.33)
and (9.38), it yields that
 

T 
n
E Lj ||zj (t) − zi (t)||
t =1 i=1
t −1
4CL  4Cn3/2 p̂L   t −k−1
n T
≤ ||xj (0)||1 + λ α(k)
δ(1 − λ) δ
j =1 t =1 k=1
√ T t −1
12 2Cn3/2 d p̂L   t −k−1
+ λ α(k + 1).
δε
t =1 k=0

Hence, by the definitions of the pseudo-individual-regret and the pseudo-network-


regret, the result in Lemma 9.15 is established. 
We now present the Proof of Theorem 9.6 based on the lemma established above.
Proof of Theorem 9.6 Let κit = μi , ∀i ∈ V, t ∈ {1, . . . , T }. Then, the local cost
function fit , i ∈ V, is strongly convex with parameter μi for all t ∈{1, . . . , T }.
Recalling that α(t) = 1/(μ̂t) for any μ̂ ∈ (0, μ̄] and μ̄ = (1/2n) ni=1 μi , we
obtain

Δ1 (t) = μ̂t − μ̂(t − 1) − μ̄ ≤ 0. (9.39)


  T
Since Tk=1 (1/k) = 1 + Tk=2 (1/k) ≤ 1 + 1 (1/k)dk = 1 + ln T , it can be
deduced that

 t −1
T 
1
λt −k−1 α(k + 1) ≤ (1 + ln T ), (9.40)
μ̂(1 − λ)
t =1 k=0

and further

 t −1
T 
1
λt −k−1 α(k) ≤ (1 + ln T ). (9.41)
λμ̂(1 − λ)
t =1 k=1
258 9 Privacy Preserving Algorithms for Distributed Online Learning

t
Let θ (t) = t (t + 1)/2 and ẑi (t) = k=1 kzi (k)/θ (t) for all t ∈ {1, . . . , T }. It then
yields that
   

T 
n  T
t 
n
E μi ||zi (t) − z∗ ||2 ≥E μi ||zi (t) − z∗ ||2
θ (t)
t =1 i=1 t =1 i=1
 n 

∗ 2
≥E μi ||ẑi (t) − z || . (9.42)
i=1

Substituting (9.39)–(9.42) into Lemma 9.14, by computation, one has


⎡ ⎤

n
R(t) + E ⎣ μj ||ẑj (t) − z∗ ||2 ⎦ ≤ Ξ3 + Ξ4 (1 + ln T ), (9.43)
j =1

where

10CL 
n
Ξ3 = nE[μ̂||x̄(0) − z∗ ||2 ] + ||xj (0)||1 ,
δ(1 − λ)
j =1

10Cn3/2p̂L 16nd p̂2 30 2Cn3/2 d p̂L
Ξ4 = + + 2 +
2
.
δλμ̂(1 − λ) ε2 δεμ̂(1 − λ)

Thus, the result in Theorem 9.6 can be concluded by putting (9.43) into
Lemma 9.15. This fulfills the Proof of Theorem 9.6. 

9.4.3 Square-Root Regret

In the previous subsection, the logarithmic regret of DP-DSSP (9.6) is proved under
the strongly convexity of cost functions. Then, if the cost functions are generally
convex, it can be established that DP-DSSP (9.6) achieves a square-root regret.
Proof of Theorem 9.7 Noting that α(t) in Theorem 9.6 shows the reliance on T ,
thus DTS that does not depend on T is employed to generate an execution process.
9.4 Main Results 259

By selecting α(t) = ζ, ∀t ∈ {1, . . . , T }, it holds from Lemma 9.14 that


⎡ ⎤

T 
n
n
E⎣ (fit (zi (t)) − fit (z∗ ))⎦ − E[||x̄(0) − z∗ ||2 ]
ζ
t =1 i=1

30 2Cn3/2 d p̂L 10Cn3/2p̂L
≤ T ζ+ T ζ
δε(1 − λ) δλ(1 − λ)
 
10CL 
n
16nd p̂2 2
+ 2 T ζ + ||xj (0)||1 , (9.44)
ε2 δ(1 − λ)
j =1

n
where we have utilized the fact that (1/ζ ) − (1/ζ ) − (1/2n) t
i=1 κi ≤ 0. Then,
√ √
letting ζ = 1/ T in (9.44) and using T ≥ 1, one achieves
⎡ ⎤

T 
n √
E⎣ (fit (zi (t)) − fit (z∗ ))⎦ ≤ Γ2 T , (9.45)
t =1 i=1

where

10CL 
n
Γ2 =nE[||x̄(0) − z∗ ||2 ] + ||xj (0)||1 + 2 2
δ(1 − λ)
j =1

10Cn3/2 p̂L 30 2Cn3/2 d p̂L 16nd p̂2
+ + + .
δλ(1 − λ) δε(1 − λ) ε2

By DTS, for k = 0, 1, 2, . . . , log2 T , DP-DSSP (9.6) is performed in period of


T = 2k (DP-DSSP (9.6) only updates the time period between t = 2log2 T  and
t = T when k = log2 T ) rounds t = 2k , . . . , 2k+1 − 1. In addition, the bound

of (9.45) on each period is at most Γ2 2k . Thus, the total bound is

log2T   √ log2 T +1 √


 1− 2 2 √
Γ2 2k = Γ2 √ ≤√ Γ2 T .
k=0
1− 2 2−1

Hence, combining (9.45) with Lemma 9.15, the result in Theorem 9.7 is obtained.
The proof is complete. 
260 9 Privacy Preserving Algorithms for Distributed Online Learning

9.4.4 Robustness to Communication Delays

In this section, the robust performance of DP-DSSP (9.6) in the presence of


communication delays is discussed. Assume that nodes will confront arbitrary but
uniformly bounded communication delays in the process of gaining information
from in-neighbors. Specifically, at time t, let ςij (t) be indicative of an arbitrary
priori unknown delay induced by the communication link (j, i). In the following,
we normally provide the bounded assumption on delay ςij (t).
Assumption 9.4 (Bounded Delays[42]) For all t ∈ Z>0 , the communication
delays ςij (t), ∀ i, j ∈ V, are uniformly bounded. Namely, there is a constant ς̂ ∈
Z>0 such that

0 ≤ ςij (t) ≤ ς̂ , ∀ i, j ∈ V, t ∈ Z>0 .

Furthermore, each node possesses its individual estimate without delays, i.e.,
ςii (t) = 0, ∀i ∈ V, t ∈ Z>0 .
Notice that for each node i ∈ V, both the updates of yi (t) and si (t) in
DP-DSSP (9.6) are dependent on the information acquired from in-neighbors,
while the estimates of hi (t), zi (t), and xi (t) are implemented locally without
further interactions. Therefore, at each time t ∈ {1, . . . , T }, when undergoing
communication delays, DP-DSSP (9.6) is executed as follows:


⎪hi (t) = xi (t) + ηi (t),

⎪ 

⎪yi (t + 1) = outhi (t ) + hj (t −ςij (t ))

⎪ j ∈Niin (t ) d out (t )+1
⎨ d (t )+1

i j
s (t −ς (t ))
si (t + 1) = d outsi(t(t)+1
)
+ j ∈N in (t ) jd out (t ij)+1 (9.46)

⎪ i


i j

⎪ z (t + 1) = y (t + 1)/s (t + 1)


i i i

xi (t + 1) = yi (t + 1) − α(t + 1)gi (t + 1).

The following theorem manifests that algorithm (9.46) not only preserves
differential privacy but also achieves expected regrets when the interactions among
nodes on time-varying unbalanced directed networks are subjected to arbitrary but
uniformly bounded communication delays.
Theorem 9.16 Suppose that Assumptions 9.1–9.4 hold, and Z ∗ is non-empty.
Consider that algorithm (9.46) updates the sequence {z1 (t), . . . , zn (t)}Tt=1 . Then,
the following results are derived:
(a) Letting ηi (t), i ∈ V, satisfy Theorem 9.5, then algorithm (9.46) preserves
differential privacy.
(b) Letting α(t) satisfy Theorem 9.6, then algorithm (9.46) achieves a logarithmic
regret of order O(ln T ) for strongly convex cost functions.
(c) Letting α(t) satisfy Theorem
√ 9.7, then algorithm (9.46) accomplishes a square-
root regret of order O( T ) for generally convex cost functions.
9.4 Main Results 261

Proof Note that algorithm (9.46) is transformed from DP-DSSP (9.6) due to
the modeling of the communication delays we have established. The Proof of
Theorem 9.16 can be divided into two steps [42]. We first convert algorithm (9.46)
(with communication delays) to a delay-free algorithm by employing an augmented
unbalanced directed network representation. Then, the performance of the obtained
delay-free algorithm can be analyzed similarly according to the above subsection.
For each node i in the network, we present ς̂ imaginary nodes i(1), i(2) , . . . , i(ς̂) .
At each time t, imaginary node i(k) preserves message which is eventually transmit-
ted to node i in the k-th time. According to Assumption 9.4, it is deduced that the
total number of nodes in the augmented unbalanced directed network is n(ς̂ + 1).
In addition, the n imaginary nodes are defined when the n nodes on the unbalanced
directed network (original) model the 1st time delay, the next n imaginary nodes
model the 2nd time delay, etc. For the interactions among these nodes on the
augmented unbalanced directed network at each time t, we suppose that for each
edge (j, i) on the unbalanced directed network (original), there always exist edges
(j, i(1)), (j, i(2)), . . . , (j, i(ς̂) ) and edges (i(1), i), (i(2), i(1) ), . . . , (i(ς̂−1) , i(ς̂) ) on
the augmented unbalanced directed network.
Let xi,(r) be the estimate of imaginary node i(r), where r ∈ {1, . . . , ς̂ }. For
convenience of analysis, we suppose d = 1 in this subsection. Define x̃(t) =
[x(t)T , x(1) (t)T , . . . , x(ς̂) (t)T ]T ∈ Rn(ς̂+1) , where x(r)(t) = [x1,(r), . . . , xn,(r) ]T ∈
Rn , and r ∈ {1, . . . , ς̂ }, is the column stack vector of xi,(r) for i ∈ V. Then,
notations h̃(t), η̃(t), ỹ(t), and s̃(t) are defined similarly. Hence, at each time t ∈
{1, . . . , T }, algorithm (9.46) can be written in the delay-free compact format as
follows:


⎪ h̃(t) = x̃(t) + η̃(t)




⎨ỹ(t + 1) = W̃ (t)h̃(t)

s̃(t + 1) = W̃ (t)s̃(t) (9.47)



⎪zi (t + 1) = yi (t + 1)/si (t + 1)



⎩x̃(t + 1) = ỹ(t + 1) − α(t + 1)[g(t + 1)T , 0T ]T ,

where g(t) = [g1 (t), . . . , gn (t)]T , η̃(t) = [η(t)T , 0T ]T ∈ Rn(ς̂+1) , and η(t) =
[η1 (t), . . . , ηn (t)]T . For the imaginary nodes, we pick x(r)(0) = 0, η(r) (0) = 0
and s(r) (0) = 0 for all r ∈ {1, . . . , ς̂ }. Notice that the update of zi (t) in (9.47) is
identical to the update of zi (t) in (9.6), which guarantees zi (t) is a finite quality even
s(r) (0) = 0. In addition, the weighted matrix W̃ (t) associated with the augmented
unbalanced directed network is indicated as
⎡ ⎤
W̃(0) (t) In×n 0 ... 0
⎢ W̃(1) (t) 0 In×n ... 0 ⎥
⎢ ⎥
⎢ . .. .. .. .. ⎥ ,
W̃ (t) = ⎢ .. . . . . ⎥
⎢ ⎥
⎣ W̃ (t) 0 0 . . . In×n ⎦
(ς̂−1)
W̃(ς̂ ) (t) 0 0 ... 0
262 9 Privacy Preserving Algorithms for Distributed Online Learning

where non-negative matrices W̃(0) (t), W̃(1) (t), . . . , W̃(ς̂ ) (t) (suitably defined) are
dependent on the communication delays encountered by the information interaction
at time t. Specifically, the weighted matrix W̃(r) (t) = [w̃ij,(r) (t)] ∈ Rn×n , r ∈
{1, . . . , ς̂ }, follows the following rules:

wij (t), if ςij (t) = r, (j, i) ∈ E(t),
w̃ij,(r) (t) =
0, otherwise,

where wij (t) is introduced in (9.7). According to the definition of w̃ij,(r) (t), it
can be concluded that for each edge (j, i) ∈ E(t) at each time t, only one of
w̃ij,(1) (t), . . . , w̃ij,(ς̂ ) (t) is positive and is equal to wij (t) (others are equal to zero).
Therefore, the transformation from (9.46) (with communication delays) to (9.47)
(without communication delays) is achieved.
From the definition of W̃ (t) and W (t), we can deduce that under Assumptions 9.3
and 9.4, the matrices W̃ (t) and W (t) exhibit the same properties at each time t. Thus,
the reminder Proof of Theorem 9.16 can be easily obtained by following the similar
techniques in the above subsection. 

9.5 Numerical Examples

In this section, some practical simulation experiments are presented to assess effec-
tiveness and universality of DP-DSSP. Inspired by Akbari et al. [15] and Hosseini
et al. [21], we investigate the distributed collaborative localization problem. Each
node in the network is employed to detect a vector z ∈ Rd . At time t ∈ {1, . . . , T },
each node i ∈ V acquires an uncertain and time-varying (the noise, such as jamming,
may be exist) detection vector pi (t) ∈ Rdi and is indicated to possess a linear model
pi,z = Pi z, where Pi ∈ Rdi ×d and Pi z = 0 if and only if z = 0. The main focus is
to estimate the vector ẑ ∈ Rd which minimizes the global cost function:


T 
n
1
f (ẑ) = ||qi (t) − Pi ẑ||2 ,
2
t =1 i=1

where the detection vector is represented as pi (t) = Pi z + βi (t) and βi (t) is the
white noise. Notice that the characteristics of noise are not acquired in advance and
quite a few nodes may not work well in some situations. Therefore, it is necessary
to utilize the distributed online algorithm to estimate the best selection for z.
In the simulation, we discuss a large-scale network with n = 100 nodes and
the dimension d = 1. At time t ∈ Z≥0 , a lot of n nodes and (n − 1)2 /4 directed
edges are randomly assigned in the unbalanced directed network G = {V, E}.
Suppose that the directed unbalanced network G = {V, E} (randomly selected) is
strongly connected. Then, through uniformly and randomly sampling E(t) from E
of G with 80%, we generate the time-varying unbalanced directed networks G(t) =
9.5 Numerical Examples 263

{V, E(t)}, t ∈ Z≥0 . Additionally, at time t, the local cost function fit : R → R
associated with node i ∈ V is given by fit (ẑ) = (1/2)(qi (t) − Pi ẑ)2 , where
qi (t) = ai (t)z+bi (t) and Pi ∈ R. The cost coefficients ai (t) and bi (t) for each node
i are, respectively, randomly selected from a uniform distribution on [amin, amax ]
and [bmin , bmax ] at time t. In this simulation, we employ DP-DSSP to estimate z
(for clearer expression, we randomly select few nodes to display in the following
scenarios) and study different scenarios to validate the performance of DP-DSSP.
(1) Without Communication Delays in this scenario, suppose that there are no
communication delays in the transmission of information on the network by nodes.
At time t ∈ {1, . . . , T }, let the coefficients ai (t) ∈ [0, 2] and bi (t) ∈ [−0.5 + ((i −
50)/100), 0.5 + ((i − 50)/100)] be picked at random by a uniform distribution for
each node i and the learning rate α(t) = 5/t. In addition, assume that Pi ∈ (0, 1]
for each node i. Based on the communication network designed above, preliminary
results are displayed in Fig. 9.1. The estimations of five nodes (randomly displayed)
are manifested in Fig. 9.1a over 100 time iterations. By running DP-DSSP, it is
demonstrated that nodes’ estimates (randomly displayed) have less fluctuations (the
fluctuation is larger than general online algorithms) around the optimal solution.
Figure 9.1b depicts the maximum and minimum pseudo-individual average regrets
over 100 time iterations, which exhibit the sublinear properties.

(2) With Communication Delays in this scenario, we verify the robust performance
of DP-DSSP with communication delays. Specifically, let ς̂ = 4 be the upper
bound of the time-varying delays. At each time t, communication delays imposed
on each communication link are randomly and uniformly selected in {0, 1, . . . , 4}.
Other required parameters correspond to the first scenario. The simulation results are
displayed in Fig. 9.2. It is clearly observed that when communication links undergo
communication delays, DP-DSSP is able to achieve the desired results.

(3) DTS and Communication Delays in this scenario (T = 100), we suppose that
the learning rates are selected by DTS and there exist communication delays (the
upper bound of the time-varying delays is ς̂ = 4) in the transmission of information
on the network by nodes. Comparisons of DP-DSSP with the method in [37] are
shown in Figs. 9.3 and 9.4. On the one hand, the x-axes in Figs. 9.3 and 9.4a are
simulated in terms of k because of DTS employed in DP-DSSP.3 On the other
hand, the x-axis in Fig. 9.4b is simulated in terms of t to better reflect the main
contributions of this chapter.
We can clearly obtain from Fig. 9.3 that the nodes’ estimates (randomly dis-
played) calculated by DP-DSSP undergo less fluctuations around the optimal
solution, while the method in [37] cannot perform well over time-varying unbal-

3 Here, it is worth highlighting that DTS actually divides the original time series t ∈ {1, . . . , T }
into log2 T + 1 subseries and DP-DSSP needs to update 2k rounds in each subseries (except
the last subseries). Since DP-DSSP cannot update 2k rounds in the last subseries, we only plot
k = 0, 1, . . . , log2 T in Figs. 9.3 and 9.4a.
264 9 Privacy Preserving Algorithms for Distributed Online Learning

(a) Node's estimate (z)


0.6

0.5

0.4

0.3

0.2

0.1

-0.1

-0.2
0 20 40 60 80 100
Time (T)
(b) Pseudo-individual average regrets
9
max pseudo-individual average regret
8
min pseudo-individual average regret
7
6
5
4
3
2
1
0
0 20 40 60 80 100
Time (T)

Fig. 9.1 (a) Estimations of five nodes without communication delay. (b) The maximum and
minimum pseudo-individual average regrets (Rj (T )/T ) without communication delays

anced directed networks. In Fig. 9.4a, we simulate each pseudo-individual average


DT S  k+1   k+1 
subregret Rj,av (k) = (E[ 2t =2k ni=1 fit (zj (t))] − E[ 2t =2k ni=1 fit (z∗ )])/2k
between k and k + 1 for k = 0, 1, 2, . . . , log2 T − 1. Under the same settings,
Fig. 9.4a indicates that the difference between the max and the min pseudo-
individual average subregrets of the method in [37] is larger than that of DP-DSSP.
This is because the method in [37] cannot be applicable for time-varying unbalanced
directed networks. Moreover, Fig. 9.4b implies that DP-DSSP achieves small
expected regrets and exhibits sublinear property, which is superior to the method
in [37].
9.5 Numerical Examples 265

(a) Node's estimate (z)


0.8

0.6

0.4

0.2

-0.2

-0.4
0 20 40 60 80 100
Time (T)
(b) Pseudo-individual average regrets
12
min pseudo-individual average regret
10 max pseudo-individual average regret

0
0 20 40 60 80 100
Time (T)

Fig. 9.2 (a) Estimations of five nodes with communication delays. (b) The maximum and
minimum pseudo-individual average regrets (Rj (T )/T ) with communication delays

(4) Differential Privacy Properties [47] in this scenario, we verify the differential
privacy properties of DP-DSSP with communication delays. Assume that the cost
functions fit = f ti for all i ∈ {2, . . . , 100} are identical except node 1’s cost
function, i.e., f1t = f t1 . Figure 9.5 (corresponding to the second scenario) plots
three randomly displayed nodes’ (always include node 1) outputs xi (t) and xi (t)
for DP-DSSP related to the adjacent relations fit and f ti , respectively. Figure 9.5
describes that the two outputs are fully fitted, which is almost indistinguishable from
a malicious node.

Remark 9.17 The fourth scenario can well validate the differential privacy proper-
ties of DP-DSSP, which is done in the same way as many of the existing works,
266 9 Privacy Preserving Algorithms for Distributed Online Learning

(a) Node's estimate (z) (DP-DSSP)


0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
012 3 4 5 6 ...
k
(b) Node's estimate (z) (The method in [37])
0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
012 3 4 5 6 ...
k

Fig. 9.3 Estimations of five nodes for DTS with communication delays. (a) Node’s estimate (z)
(DP-DSSP). (b) Node’s estimate (z) (the method in [37])

such as [4, 47]. DP-DSSP may be further applied to problems related to the spy
attacks on the network and dataset recovery, which are the key issues in the modern
era of big data. However, we note that the focus of this chapter is not about the
spy attacks on the network and dataset recovery but to design DP-DSSP to solve
the differentially private distributed online optimization problem on time-varying
unbalanced directed networks. In the future, further research on this kind of problem
will be conducted.
9.6 Conclusion 267

(a) Pseudo-individual average subregrets


20
min pseudo-individual average subregret (DP-DSSP)
18 min pseudo-individual average subregret [37]
max pseudo-individual average subregret [37]
16
max pseudo-individual average subregret (DP-DSSP)
14
12
10
8
6
4
2
0
0 1 2 3 4 5 6
k
(b) Pseudo-individual average regrets
60
min pseudo-individual average regret [37]
max pseudo-individual average regret (DP-DSSP)
50

40

30

20

10

0
0 20 40 60 80 100
Time (T)
DT S
Fig. 9.4 (a) The pseudo-individual average subregrets (Rj,av (k)) between k and k + 1 for DTS
with communication delays. (b) The pseudo-individual average regrets (Rj (T )/T ) for DTS with
communication delays

9.6 Conclusion

This chapter has investigated the differentially private distributed online convex
optimization problem on time-varying unbalanced directed networks which were
supposed to be uniformly strongly connected. To figure out such optimization
problem distributedly, we have developed a new differentially private distributed
stochastic subgradient-push algorithm, named as DP-DSSP. Theoretical analysis
has shown that DP-DSSP (appropriate learning rate) preserved differential privacy
and achieved expected sublinear regrets for diverse cost functions. The compromise
268 9 Privacy Preserving Algorithms for Distributed Online Learning

The outputs xi(t) and x'i(t)


6

-2

-4 x1 (t) x'1 (t) x41 (t) x'41 (t) x81 (t) x'81 (t)

-6
0 20 40 60 80 100
Time (T)

Fig. 9.5 The outputs xi (t) and xi (t) related to the adjacent relations fit and fit

between the desired level of privacy preserving and the accuracy of DP-DSSP
has been revealed. Furthermore, the robustness of DP-DSSP to communication
delays has also been explored. Finally, the performances of DP-DSSP have been
demonstrated via simulation experiments. Our work is still open, and more in-depth
research is demanded to resolve the distributed online constrained problem. As a
future work, we will consider to extend this work for a number of directions, i.e., spy
attacks on the network, dataset recovery, complex constraints, non-convex and/or
non-smooth functions, and networks with random link failures.

References

1. S. Wang, C. Li, Distributed robust optimization in networked system. IEEE Trans. Cybern.
47(8), 2321–2333 (2017)
2. H. Li, Q. Lü, X. Liao, T. Huang, Accelerated convergence algorithm for distributed constrained
optimization under time-varying general directed graphs. IEEE Trans. Syst. Man Cybern. Syst.
50(7), 2612–2622 (2020)
3. X. Shi, J. Cao, W. Huang, Distributed parametric consensus optimization with an application
to model predictive consensus problem. IEEE Trans. Cybern. 48(7), 2024–2035 (2018)
4. M. Hale, P. Barooah, K. Parker, K. Yazdani, Differentially private smart metering: Implementa-
tion, analytics, and billing, in UrbSys’19: Proceedings of the 1st ACM International Workshop
on Urban Building Energy Sensing, Controls, Big Data Analysis, and Visualization (2019), pp.
33–42
5. Z. Deng, X. Nian, C. Hu, Distributed algorithm design for nonsmooth resource allocation
problems. IEEE Trans. Cybern. 50(7), 3208–3217 (2020)
6. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly
convex optimization. Automatica 90, 196–203 (2018)
References 269

7. L. Ma, W. Bian, A novel multiagent neurodynamic approach to constrained distributed convex


optimization. IEEE Trans. Cybern. 51(3), 1322–1333 (2021)
8. C. Shi, G. Yang, Augmented Lagrange algorithms for distributed optimization over multi-agent
networks via edge-based method. Automatica 94, 55–62 (2018)
9. X. Wang, Y. Hong, P. Yi, H. Ji, Y. Kang, Distributed optimization design of continuous-time
multiagent systems with unknown-frequency disturbances. IEEE Trans. Cybern. 47(8), 2058–
2066 (2017)
10. B. Ning, L. Han, Z. Zuo, Distributed optimization of multiagent systems with preserved
network connectivity. IEEE Trans. Cybern. 49(11), 3980–3990 (2019)
11. S. Yang, Q. Liu, J. Wang, A multi-agent system with a proportionalintegral protocol for
distributed constrained optimization. IEEE Trans. Autom. Control 62(7), 3461–3467 (2017)
12. Q. Lü, X. Liao, H. Li, T. Huang, Achieving acceleration for distributed economic dispatch in
smart grids over directed networks. IEEE Trans. Netw. Sci. Eng. 7(3), 1988–1999 (2020)
13. D. Yuan, D. Ho, G. Jiang, An adaptive primal-dual subgradient algorithm for online distributed
constrained optimization. IEEE Trans. Cybern. 48(11), 3045–3055 (2018)
14. D. Nunez, J. Cortes, Distributed online convex optimization over jointly connected digraphs.
IEEE Trans. Netw. Sci. Eng. 1(1), 23–37 (2014)
15. M. Akbari, B. Gharesifard, T. Linder, Distributed online convex optimization on time-varying
directed graphs. IEEE Trans. Control Netw. Syst. 4(3), 417–428 (2017)
16. A. Bedi, A. Koppel, K. Rajawat, Beyond consensus and synchrony in online network
optimization via saddle point method (2017). Preprint arXiv:1707.05816
17. H. Pradhan, A. Bedi, A. Koppel, K. Rajawat, Exact decentralized online nonparametric opti-
mization, in 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP)
(2018). https://doi.org/10.1109/GlobalSIP.2018.8646689
18. R. Dixit, A. Bedi, K. Rajawat, Online learning over dynamic graphs via distributed proximal
gradient algorithm. IEEE Trans. Autom. Control 66(11), 5065–5079 (2021)
19. A. Koppel, F. Jakubiec, A. Ribeiro, A saddle point algorithm for networked online convex
optimization. IEEE Trans. Signal Process. 63(19), 5149–5164 (2015)
20. S. Shahrampour, A. Jadbabaie, Distributed online optimization in dynamic environments using
mirror descent. IEEE Trans. Autom. Control 63(3), 714–725 (2018)
21. S. Hosseini, A. Chapman, M. Mesbahi, Online distributed convex optimization on dynamic
networks. IEEE Trans. Autom. Control 61(11), 3545–3550 (2016)
22. D. Yuan, A. Proutiere, G. Shi, Distributed online linear regression. IEEE Trans. Inf. Theory
67(1), 616–639 (2021)
23. X. Zhou, E. Anese, L. Chen, A. Simonrtto, An incentive-based online optimization framework
for distribution grids. IEEE Trans. Autom. Control 63(7), 2019–2031 (2018)
24. A. Molina, M. Cervantes, E. Montes, M. Perez, Adaptive controller tuning method based
on online multiobjective optimization: a case study of the four-bar mechanism. IEEE Trans.
Cybern. 51(3), 1272–1285 (2021)
25. X. Li, X. Yi, L. Xie, Distributed online optimization for multi-agent networks with coupled
inequality constraints. IEEE Trans. Autom. Control 66(8), 3575–3591 (2021)
26. C. Dwork, Differential privacy: A survey of results, in International Conference on Theory and
Applications of Models of Computation (2008). https://doi.org/10.1007/978-3-540-79228-4_
27. Y. Wang, M. Hale, M. Egerstedt, G. Dullerud, Differentially private objective functions in
distributed cloud-based optimization, in 2016 IEEE 55th Conference on Decision and Control
(CDC) (2016). https://doi.org/10.1109/CDC.2016.7798824
28. E. Nozari, P. Tallapragada, J. Cortes, Differentially private distributed convex optimization via
objective perturbation, in 2016 American Control Conference (ACC) (2016). https://doi.org/10.
1109/ACC.2016.7525222
29. M. Hale, M. Egerstedt, Differentially private cloud-based multi-agent optimization with
constraints, in 2015 American Control Conference (ACC) (2015). https://doi.org/10.1109/ACC.
2015.7170902
30. Y. Lou, L. Yu, S. Wang, P. Yi, Privacy preservation in distributed subgradient optimization
algorithms. IEEE Trans. Cybern. 48(7), 2154–2165 (2018)
270 9 Privacy Preserving Algorithms for Distributed Online Learning

31. C. Zhang, H. Gao, Y. Wang, Privacy-preserving decentralized optimization via decomposition


((2018)). Preprint arXiv:1808.09566
32. S. Gelfand, S. Mitter, Recursive stochastic algorithms for global optimization in Rd . SIAM J.
Control Optim. 29(5), 999–1018 (1991)
33. K. Harold, G. Yin, Stochastic Approximation and Recursive Algorithms and Applications
(Springer Science & Business Media, Berlin, 2003)
34. S. Ram, A. Nedic, V. Veeravalli, Distributed stochastic subgradient projection algorithms for
convex optimization. J. Optim. Theory Appl. 147, 516–545 (2010)
35. K. Srivastava, A. Nedic, Distributed asynchronous constrained stochastic optimization. IEEE
J. Sel. Topics Signal Process. 5(4), 772–790 (2011)
36. R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points-online stochastic gradient for
tensor decomposition, in Proceedings of The 28th Conference on Learning Theory (PMLR),
vol. 40 (2015), pp. 797–842
37. J. Zhu, C. Xu, J. Guan, D. Wu, Differentially private distributed online algorithms over time-
varying directed networks. IEEE Trans. Signal Inform. Process. Over Netw. 4(1), 4–17 (2018)
38. A. Nedic, A. Olshevsky, Distributed optimization over time-varying directed graphs. IEEE
Trans. Autom. Control 60(3), 601–615 (2015)
39. C. Li, P. Zhou, L. Xiong, Q. Wang, T. Wang, Differentially private distributed online learning.
IEEE Trans. Knowl. Data Eng. 30(8), 1440–1453 (2018)
40. A. Nedic, A. Olsevsky, Stochastic gradient-push for strongly convex functions on time-varying
directed graphs. IEEE Trans. Autom. Control 61(12), 3936–3947 (2016)
41. S. Shwartz, Online learning and online convex optimization. Found. Trends Mach. Learn. 4(2),
107–194 (2012)
42. T. Yang, J. Lu, D. Wu, J. Wu, G. Shi, Z. Meng, K. Johansson, A distributed algorithm for
economic dispatch over time-varying directed networks with delays. IEEE Trans. Ind. Electron.
64(6), 5095–5106 (2017)
43. S. Lee, A. Nedic, M. Raginsky, Stochastic dual averaging for decentralized online optimization
on time-varying communication graphs. IEEE Trans. Autom. Control 62(12), 6407–6414
(2017)
44. H. Li, C. Huang, G. Chen, X. Liao, T. Huang, Distributed consensus optimization in multi-
agent networks with time-varying directed topologies and quantized communication. IEEE
Trans. Cybern. 47(8), 2044–2057 (2017)
45. Q. Lü, X. Liao, H. Li, T. Huang, A Nesterov-like gradient tracking algorithm for distributed
optimization over directed networks. IEEE Trans. Syst. Man Cybern. Syst. 51(10), 6258–6270
(2021)
46. R. Durrett, Probability: Theory and Examples (Cambridge University Press, Cambridge, 2019)
47. L. Gao, S. Deng, W. Ren, Differentially private consensus with event-triggered mechanism.
IEEE Trans. Control Netw. Syst. 6(1), 60–71 (2019)

You might also like