0% found this document useful (0 votes)
15 views6 pages

Accelerated

sd

Uploaded by

Endale Aragu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

Accelerated

sd

Uploaded by

Endale Aragu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Accelerating Hierarchical Federated Learning with Adaptive

Aggregation Frequency in Edge Computing


Suo Chen Zhenguo Ma Zhiyuan Wang
[email protected] [email protected] [email protected]
School of Computer Science and School of Computer Science and School of Computer Science and
Technology, University of Science and Technology, University of Science and Technology, University of Science and
Technology of China Technology of China Technology of China
Hefei, Anhui, China Hefei, Anhui, China Hefei, Anhui, China
ABSTRACT 1 INTRODUCTION
Federated Learning (FL) has gained significant popularity as a As mobile devices become increasingly prevalent in the Internet
means of handling large scale of data in Edge Computing (EC) of Things (IoT) [2], a greater amount of data is being collected
applications. Due to the frequent communication between edge de- at the network edge, such as gateways and switches. To address
vices and server, the parameter server based framework for FL may privacy concerns of exposing the raw data to remote servers, Fed-
suffer from the communication bottleneck and lead to a degraded erated Learning (FL) [12] has emerged as a potential solution for
training efficiency. As an alternative solution, Hierarchical Feder- application scenarios like Edge Computing (EC) [14]. FL follows
ated Learning (HFL), which leverages edge servers as intermediaries the popular Parameter Server (PS) framework [7], which typically
to perform model aggregation among devices in proximity, comes includes one or more servers and a group of edge devices. Each
into being. However, the existing HFL solutions fail to perform device maintains a device-specific local model based on its locally
effective training considering the constrained and heterogeneous stored dataset, while the global model is produced and updated on
communication resources on edge devices. In this paper, we de- the server by aggregating local models from edge devices [17].
sign a communication-efficient HFL framework, named CE-HFL, to However, due to the frequent communication between edge
accelerate the convergence of HFL. Concretely, we propose to ad- devices and the server, the communication overhead on the server
just the global and edge aggregation frequencies in HFL according may be multiplied with the increasing devices, and the server may
to heterogeneous communication resources among edge devices. become the system bottleneck [20]. To alleviate the communication
By performing multiple local updating before communication, the overhead of server, the recent works are focused on the Hierarchical
communication overhead on edge servers and the cloud server can Federated Learning (HFL) [8, 9]. In HFL, each device carries out
be significantly reduced. The experimental results on real-world local updates on its own dataset. These updated models are then
dataset demonstrate the effectiveness of the proposed method. periodically synchronized through a nearby edge server, resulting in
an edge-level model (known as edge aggregation [8]). Subsequently,
CCS CONCEPTS each edge server synchronizes its model with the cloud server to
obtain the global model (known as global aggregation). By utilizing
• Computing methodologies → Self-organization; • Networks
edge servers, HFL reduces the network bandwidth pressure on the
→ Layering; Network control algorithms.
cloud server, which is a significant improvement over traditional
FL [19].
KEYWORDS Nevertheless, two critical factors should be considered to achieve
Hirarchical Federated Learning, Edge Computing, Heterogeneous efficient HFL in EC. (i) Limited communication capacity. The ma-
Communication Capacity, Adaptive Aggregation Frequency jority of edge devices are connected through wireless links such
as 4G and Wi-Fi. However, these wireless links may become con-
ACM Reference Format: gested due to limited channel bandwidth, resulting in constrained
Suo Chen, Zhenguo Ma, and Zhiyuan Wang. 2023. Accelerating Hierar-
bandwidth for edge devices [18]. Furthermore, limited bandwidth
chical Federated Learning with Adaptive Aggregation Frequency in Edge
Computing. In 2023 4th International Conference on Computing, Networks
can cause a significant increase in communication delay between
and Internet of Things (CNIOT ’23), May 26–28, 2023, Xiamen, China. ACM, devices, ultimately leading to a decrease in training efficiency. (ii)
New York, NY, USA, 6 pages. https://doi.org/10.1145/3603781.3604232 Heterogeneous communication capacity. Generally, there exists a
wide range of edge devices such as smartphones, laptops, and Rasp-
berry Pi computers, each of which may possess distinct communi-
Permission to make digital or hard copies of all or part of this work for personal or cation bandwidths in their network interface cards. Additionally,
classroom use is granted without fee provided that copies are not made or distributed due to the fluctuating signal strength of wireless links at varying
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the communication distances, the communication capacity of these
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or edge devices is often dissimilar. For instance, Raspberry Pi 3 B and
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
3 B+ are outfitted with dissimilar communication hardware, while
CNIOT ’23, May 26–28, 2023, Xiamen, China the former has a 100 Mbps capacity and the latter owns a Gigabit
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. capacity [15]. As a consequence, communication-efficient HFL is
ACM ISBN 979-8-4007-0070-5/23/05. . . $15.00
https://doi.org/10.1145/3603781.3604232

990
CNIOT ’23, May 26–28, 2023, Xiamen, China Suo Chen et al.

𝐷𝑢
charming for edge devices due to their limited and heterogeneous D𝑢 = {(𝑥𝑖 , 𝑦𝑖 )}𝑖=1 as the local dataset of each device 𝑢 ∈ U, where
communication resource. 𝑥𝑖 is the 𝑖-th sample, 𝑦𝑖 is the corresponding label, and 𝐷𝑢 = |D𝑢 |
Several recent studies in the field of HFL have proposed different represents the number of samples in D𝑢 . (ii) Edge servers serve
solutions to tackle the challenges mentioned above. To efficiently as the intermediaries between the cloud server and edge devices,
handle bandwidth limitations in HFL, some approaches (such as which are always deployed at the place (e.g, network edge) and close
[18] and [19]) focus on allocating a group of devices to each edge to edge devices. Generally, edge servers are equipped with more
server for training acceleration. The authors of [1] use gradient computation and communication resources compared with edge
sparsification to improve communication efficiency, while Luo et devices, and are core components in HFL. Concretely, we introduce
al. [9] develop a joint resource allocation and device-edge assign- V = {𝑣 1, 𝑣 2, · · · , 𝑣𝑚 } to represent the set of edge servers, where
ment strategy to achieve global cost minimization. However, these 𝑚 = |V | denotes the number of edge servers in V. (iii) The cloud
methods aim to handle resource constraints in EC without consid- server is usually regraded as the coordinator of all the distributed
ering the heterogeneous communication resources among devices. devices and edge servers, which is responsible for aggregating the
Consequently, the above methods may lead to unnecessary waiting edge-level models in HFL and distributing the updated global model
time among edge devices, and the resource utilization is degraded. to all edge servers.
In this paper, we propose a communication-efficient HFL frame-
work, named CE-HFL, so as to tackle the constrained and hetero- 2.2 Traditional Hierarchical Federated Learning
geneous communication capacities on edge devices with adaptive During the training of a traditional HFL algorithm (e.g, HierFAVG),
aggregation frequencies. Concretely, CE-HFL jointly optimizes the each device 𝑢 ∈ U first performs the local updating for 𝜏𝑒 iterations
global and edge aggregation frequencies to reduce the waiting time and sends the updated local model to an assigned edge server 𝑣 ∈ V.
and communication overhead. However, the value of aggregation Subsequently, the edge server 𝑣 conducts the edge aggregation
frequency is an important factor that affects the performance of based on the received local models. After 𝜏𝑔 edge aggregations,
the proposed algorithm. Specifically, when large aggregation fre- edge server 𝑣 sends the aggregated model to the cloud server for
quencies are adopted, the communication overhead can be reduced, global aggregation. Concretely, we denote 𝐾 as the total number of
but the training performance of the global model may be degraded. communication rounds on each device, which is equal to the number
In contrast, the model performance can be enhanced when small of performed edge aggregations. In this way, the total number of
aggregation frequency is adopted, but the communication overhead training epochs (i.e., global aggregation) can be formulated as 𝜏𝐾𝑔 . To
is creased for edge servers and the cloud server. Besides, consider-
be concise, one training epochs is composed of 𝜏𝑔 communication
ing the heterogeneous communication capacity, the value of global
rounds on edge devices, and one round refers to 𝜏𝑒 times of local
aggregation frequency may be affected by the value of edge ag-
updates. By the hierarchical aggregation (i.e., edge aggregation
gregation frequency. As a consequence, it is challenging to jointly
and global aggregation) of the trained model, the performance of
adjust the global and edge aggregation frequencies. The main con-
the global model can be enhanced while significantly reducing the
tributions of this paper are summarized as follows:
communication cost of the cloud server.
• We propose a communication-efficient HFL framework,
named CE-HFL, which jointly adjusts the global and edge 2.3 The Training Procedure of CE-HFL
aggregation frequencies , so as to tackle the constrained and
However, considering the heterogeneous resources of edge devices
heterogeneous communication capacities on edge devices.
(edge servers), performing the same number of local updates (edge
• We conduct extensive experiments to evaluate the perfor-
aggregations) on different devices may incur unbearable waiting
mance of CE-HFL. The results demonstrate that CE-HFL can
time. To this end, we propose to adjust the number of local updates
significantly reduce the waiting time and improve the model
and edge aggregations according to their heterogeneous resources,
accuracy given the same training time.
so as to reduce the waiting time among edge devices and edge
server without incurring performance degradation. The framework
2 SYSTEM OVERVIEW of CE-HFL is illustrated in Fig. 1, which mainly consists of three
In this section, we first present the preliminaries for traditional steps, i.e., local updating, edge aggregation, and global aggregation.
hierarchical FL. Subsequently, we illustrate the training procedure Local Updating. On each device 𝑢 ∈ U, the local loss function
of the proposed CE-HFL and provide the convergence analysis. 𝑓𝑢 is defined on the local dataset D𝑢 as follows [16]
𝐷𝑢
2.1 System Model 𝑓𝑢 (𝑤𝑢𝑘 ) =
1 ∑︁
𝐹𝑢 (𝑥𝑖 ; 𝑦𝑖 , 𝑤𝑢𝑘 ), (1)
Under the traditional hierarchical federated learning (HFL) frame- 𝐷𝑢 𝑖=1
work, three types of entries participate in the model training: (i)
edge devices, (ii) edge servers, and (iii) the cloud server [8]. (i) Edge where 𝑤𝑢𝑘 represents the local model of device 𝑢 at round 𝑘 ∈
devices (e.g., smartphones, laptops) are generally equipped with {1, 2, · · · , 𝐾 }, and 𝐹𝑢 (𝑥𝑖 ; 𝑦𝑖 , 𝑤𝑢𝑘 ) denotes the loss function over the
constrained computation and communication capacities, so as to 𝑖-th data sample. Upon receiving the aggregated model 𝑤 𝑣𝑘 and the
determined edge aggregation frequency 𝜏𝑒,𝑢 𝑘 from edge server 𝑣,
perform the local model updating based on their locally stored
datasets, . For sake of illustration, let U = {𝑢 1, 𝑢 2, · · · , 𝑢𝑛 } denote each device 𝑢 iteratively updates the local model for 𝜏𝑒,𝑢 𝑘 iterations
the set of edge devices that participate in HPL, where 𝑛 = |U| at each round 𝑘, so as to minimize the local loss function 𝑓𝑢 and re-
represents the number of edge devices. Additionally, we denote duce the waiting time among edge devices. Based on the stochastic

991
Accelerating Hierarchical Federated Learning with Adaptive Aggregation Frequency in Edge Computing CNIOT ’23, May 26–28, 2023, Xiamen, China

Í
𝐷𝑢 is the accumulated number of samples on the edge devices
𝑢 ∈ S𝑣
Cloud Global Aggregation assigned to 𝑣. Then, edge server 𝑣 calculates the edge aggregation
Server Frequency
𝑘+1 for each device 𝑢 ∈ S according to the received
Global v1 2 frequency 𝜏𝑒,𝑢 𝑣
v2 4
Aggregation
Aggregated Global v3 3 completion time, so as to equalize the consumed time 𝑡𝑢𝑘+1 for the
Edge Model Model assigned devices at 𝑘 + 1. The aggregated model 𝑤 𝑣𝑘+1 and the
Aggregation v1 v2 v3 updated edge aggregation frequency 𝑡𝑢𝑘+1 are distributed to the
Edge Aggregation
Frequency
devices 𝑢 ∈ S𝑣 to start the next round of local updating. On the
Edge Aggregation
Frequency u7 3 other hand, after finishing 𝜏𝑔,𝑣 edge aggregations, edge server 𝑣
u1 7 u8 4
Local u2 3 u9 7 uploads the aggregated model and the consumed time to the cloud
Updating u3 10 server for global aggregation. The completion time of edge server 𝑣
for accomplishing 𝜏𝑔,𝑣 edge aggregations also consists of two parts:
computation time, and communication time. The computation time
𝑘
𝑡 𝑣,𝑐𝑝 on server 𝑣 at the 𝑘-th round can be formulated as
u1 u2 u3 u4 u5 u6 u7 u8 u9
𝑘
𝑡 𝑣,𝑐𝑝 = max 𝑡𝑢𝑘 = max {𝑡𝑢,𝑐𝑝
𝑘
+ 𝑡𝑢,𝑐𝑚 }. (6)
Figure 1: Illustration of training process of CE-HFL. 𝑢 ∈ S𝑣 𝑢 ∈ S𝑣
It is worth noting that we ignore the aggregation time of local
gradient descent algorithm [3], device 𝑢 updates the local model models, since the computation resource of edge server is adequate

𝑤𝑢𝑘,𝜏 as follows and the aggregation time is negligible. We also assume the commu-
′ ′ ′ nication time of edge server 𝑣 as a constant 𝑡 𝑣,𝑐𝑚 . In this way, the
𝑤𝑢𝑘,𝜏 +1 = 𝑤𝑢𝑘,𝜏 − 𝜂∇𝑓𝑢 (𝑤𝑢𝑘,𝜏 ), (2)
completion time 𝑡 𝑣𝑘 of edge server 𝑣 at round 𝑘 is defined as

where 𝑤𝑢𝑘,𝜏 represents the local model of device 𝑢 at iteration
𝜏 ′ ∈ {1, 2, · · · , 𝜏𝑒,𝑢
𝑘 } of the 𝑘-th round, 𝜂 denotes the learning rate of 
𝑘
𝑡 𝑣,𝑐𝑝 + 𝑡 𝑣,𝑐𝑚
𝑘 If 𝜏𝑔,𝑣 edge aggregations are finished,
′ 𝑡 𝑣𝑘 = 𝑘 Else.
(7a)
model training, and ∇𝑓𝑢 (𝑤𝑢𝑘,𝜏 ) is the gradient of local loss function 𝑡 𝑣,𝑐𝑝
′ 𝑘,𝜏 𝑘 (7b)
on local model 𝑤𝑢𝑘,𝜏 . Subsequently, the updated local model 𝑤𝑢 𝑒,𝑢
and the completion time (denoted as 𝑡𝑢𝑘 ) at round 𝑘 on device 𝑢 are As a consequence, the consumed time of edge server 𝑣 for accom-
transmitted to the assigned edge server 𝑣 ∈ V for edge aggregation. plishing 𝜏𝑔,𝑣 edge aggregations is obtained by accumulating the
The completion time 𝑡𝑢𝑘 of device 𝑢 at round 𝑘 consists of two parts: completion time of the inclusive round 𝑘.
computation time, and communication time [11]. The computation Global Aggregation. On the cloud server, the global aggre-
𝑘
time 𝑡𝑢,𝑐𝑝 of device 𝑢 at the 𝑘-th round can be formulated as gation is performed as soon as the aggregated models and the
completion time are retrieved from all the edge servers. Concretely,
𝑘 𝑘 𝑀
𝑡𝑢,𝑐𝑝 = 𝜏𝑒,𝑢 (3) the global aggregation at the 𝑘-th round is performed as follows
𝐶𝑢
𝑘 represents the number of local updates on device 𝑢 at
where 𝜏𝑒,𝑢
(
Í 𝐷 𝑣 𝑤𝑣𝑘+1
𝑘+1 If the required edge aggregations are finished,
round 𝑘, 𝑀 is the required computation capacity of one local update, 𝑤 = 𝑣 ∈V
𝐷 (8a)
𝑤𝑘 Else.
and 𝐶𝑢 is the computation resource of device 𝑢. In addition to the
computation time, 𝑡𝑢𝑘 also consists of the time for downloading the (8b)
aggregated model and the time for uploading the local model. As where 𝑤 𝑘 denotes the global model at round 𝑘, and 𝐷 =
Í
𝐷𝑣 =
a consequence, the communication time of device 𝑢 (i.e., the sum Í 𝑣∈V
of download time and upload time) can be regarded as a constant 𝐷𝑢 represents the total number of samples in all edge devices.
𝑢∈U
𝑡𝑢,𝑐𝑚 , which may be varied among different edge devices. In this Subsequently, the cloud server calculates the global aggregation
way, the completion time 𝑡𝑢𝑘 on device 𝑢 at round 𝑘 is defined as frequency 𝜏𝑔,𝑣 for each edge server 𝑣 ∈ V according to the re-
𝑀 ceived completion time, so as to equalize the consumed time of
𝑡𝑢𝑘 = 𝑡𝑢,𝑐𝑝
𝑘 𝑘
+ 𝑡𝑢,𝑐𝑚 = 𝜏𝑒,𝑢 + 𝑡𝑢,𝑐𝑚 . (4) edge servers at the next epoch. The updated global model and the
𝐶𝑢
global aggregation frequency are sent back to all edge servers to
Edge Aggregation. On edge server 𝑣, the edge aggregation is
replace the aggregate model. Regarding the abundant resources of
triggered after receiving the updated local models and the comple-
the cloud server, we ignore the consumed time for model averaging.
tion time from all the assigned edge devices. Following the typical
As a consequence, the corresponding completion time of round 𝑘
Federated Averaging algorithm [12], the edge aggregation on edge
can be expressed as
server 𝑣 can be updated as follows
∑︁ 𝐷𝑢 𝑤 𝑘,𝜏𝑒,𝑢 𝑘
max𝑣 ∈V { max {𝑡𝑢,𝑐𝑝 𝑘
+ 𝑡𝑢,𝑐𝑚 } + 𝑡 𝑣,𝑐𝑚 } If the required
𝑢
𝑤 𝑣𝑘+1 =

, (5) 


 𝑢 ∈S𝑣 edge
𝐷𝑣 𝑘
𝑡 = max 𝑡 𝑣𝑘 = aggregations
(9a)
𝑢 ∈ S𝑣 are finished,
𝑣∈V  𝑘
max𝑣 ∈V { max {𝑡𝑢,𝑐𝑝 + 𝑡𝑢,𝑐𝑚 } } Else.


where 𝑤 𝑣𝑘+1 represents the aggregated model on edge server at  𝑢 ∈S𝑣

round 𝑘 + 1, S𝑣 is the set of edge devices assigned to 𝑣, and 𝐷 𝑣 = (9b)

992
CNIOT ’23, May 26–28, 2023, Xiamen, China Suo Chen et al.

Algorithm 1 Training procedure of CE-HFL Algorithm 2 Determine the aggregation frequencies for CE-HFL
1: Initialize the model parameters 𝑤 0 and send to all edge devices; 1: Initialize 𝑡𝑢𝑘 and 𝑡𝑢,𝑐𝑚 for ∀𝑢 ∈ U, 𝑡 𝑣𝑘 and 𝑡 𝑣,𝑐𝑚 for ∀𝑣 ∈ V;
2: for round 𝑘 ∈ {1, 2, ..., 𝐾 } and 𝑇 𝑘 < 𝑇 𝐵 do 2: Initialize the computation capacity 𝐶𝑢 for ∀𝑢 ∈ U;
3: Procedure at the cloud server 3: Calculate the average completion time of all edge servers at
4: if Receive the aggregated model from all edge servers then the current epoch, and denote it as 𝑡;
5: Update the global model according to Eqs. (8a) and (8a); 4: for each 𝑣 ∈ V do
𝑡𝑘
Calculate the global aggregation frequency for each Let 𝑡ˆ𝑣𝑘+1 = | S1 | 𝑢 ∈ S𝑣 𝑚𝑢 ;
Í
6: 5:
𝑣
edge server using Alg. 2; 6: for each 𝑢 ∈ S𝑣 do
7: Distribute the global model and aggregation to edge 𝑘+1 = ⌊ 𝑣 (𝑡ˆ𝑘+1 −𝑡 )𝐶𝑢
7: Calculate 𝜏𝑒,𝑢 𝑢,𝑐𝑚
⌋;
servers; 𝑀
Í 8: end for
8: Record the completion time as 𝑇 𝑘 = 𝑘𝑗=1 𝑡 𝑗 ; (𝑡 −𝑡 𝑣,𝑐𝑚 )
9: Calculate 𝜏𝑔,𝑣 = ⌊ ˆ𝑘+1 ⌋;
9: end if 𝑡𝑣
10: Procedure at the edge server 𝑣 10: end for
𝑘+1 and 𝜏
Return the calculated 𝜏𝑒,𝑢
𝑘,𝜏 𝑘 11: 𝑔,𝑣 to the corresponding server
11: Receive local model 𝑤𝑢 𝑒,𝑢 and the completion time 𝑡𝑢𝑘 from and device;
all the devices 𝑢 ∈ S𝑣 ;
12: Update the aggregated global 𝑤 𝑣𝑘+1 according to Eq. (5);
13: if Finish the required 𝜏𝑔,𝑣 edge aggregations then
capacity 𝐶𝑢 can be replaced with a factor 𝑐𝑢 that is maintained
14: Send 𝑤 𝑣𝑘+1 and the completion time to the cloud server 𝑡𝑢𝑘 −𝑡𝑢,𝑐𝑚
𝑘
for global aggregation; on edge server, and is updated as 𝑐𝑢 = 𝛾 ∗ 𝑐𝑢 + (1 − 𝛾) ∗ 𝑘
𝜏𝑒,𝑢
15: Receive 𝑤 𝑘+1 and the updated 𝜏𝑔,𝑣 from cloud server when receiving latest completion time 𝑡𝑢𝑘 . As a consequence, the
and set 𝑤 𝑣𝑘+1 = 𝑤 𝑘+1 ; proposed algorithm does not require any prior knowledge about
16: end if the computation capacity of each edge device.
17: Calculate the edge aggregation frequency for each device After the initialization, we calculate the average completion time
𝑢 ∈ S𝑣 using Alg. 2; of all edge servers at the current epoch, and denote it as 𝑡 (Line 3 in
18: Broadcast the updated model 𝑤 𝑣𝑘+1 and edge aggregation Alg. 2). Subsequently, for each edge server 𝑣 ∈ V, we calculate the
frequency to all the assigned devices; edge aggregation frequency for the devices assigned to it. Specifi-
19: Procedure at the edge device 𝑢 cally, we estimate the completion time for device 𝑢 at round 𝑘 + 1 as
20: Receive the aggregated model and edge aggregation fre- the average value of the completion time at round 𝑘, as depicted in
quency from edge server 𝑣; Line 5 of Alg. 2. Based on the estimated time 𝑡ˆ𝑣𝑘+1 , we can calculate
21: for local updates 𝜏 ′ ∈ {1, 2, ..., 𝜏𝑒,𝑢
𝑘 } do
the aggregation frequency for device 𝑢 as 𝜏𝑒,𝑢 𝑘+1 = ⌊ 𝑣 (𝑡ˆ𝑘+1 −𝑡
𝑢,𝑐𝑚 𝑢

)𝐶
′ ′ ′ 𝑀
22: Update the local model 𝑤𝑢𝑘,𝜏 +1
as 𝑤𝑢𝑘,𝜏 +1
= −𝑤𝑢𝑘,𝜏 (Line 7 of Alg. 2). It is worth noting that the aggregation frequencies

𝜂∇𝑓𝑢 (𝑤𝑢𝑘,𝜏 ); of edge devices are adjusted to align the estimated 𝑡ˆ𝑣𝑘+1 , and the
23: end for waiting time among edge devices assigned to 𝑣 can be reduced.
𝑘,𝜏 𝑘 Similarly, the global aggregation frequency for server 𝑣 is adjusted
24: Send the updated model 𝑤𝑢 𝑒,𝑢 and the consumed time to
edge server 𝑣; based on the average time 𝑡 and 𝑡ˆ𝑣𝑘+1 , as shown in Line 9 of Alg. 2.
25: end for By aligning the completion time of each server at the next epoch
26: Return the trained global model 𝑤 𝐾 ; with the estimated 𝑡, the waiting time among edge servers can also
be decreased. In this way, by adaptively adjusting the global ag-
gregation frequency and the edge aggregation frequency, CE-HFL
can reduce the waiting time and communication cost during model
For sake of clear illustration, the above training process of CE- training, without incurring performance degradation.
HFL is concluded in Alg. 1. It is worth noting that the above We note that the edge devices participating in FL may be various
three steps are conducted iteratively until the global loss 𝑓 (𝑤 𝐾 ) = kinds of mobile devices. In addition, edge devices always connect
Í 𝐷𝑢 𝑓𝑢 (𝑤 𝐾 )
𝐷 is minimized within 𝐾 rounds, or the time budget to base station (BS) via unstable wireless links. Hence, considering
𝑢∈U the device mobility in dynamic EC environment, the communica-
𝑇 𝐵 is exceeded. tion timeout issue may occur due to the BS handover or unstable
network condition. The communication timeout may lead to ex-
3 ALGORITHM DESIGN tra waiting time among edge devices, since the fast devices are
In this section, we present the algorithm for the cloud server and compelled to wait for the devices suffering from communication
edge server to determine the aggregation frequencies. First of all, we timeout, and the resource utilization is further lowered. To this
initialize some critical parameters in Lines 1-3 of Alg. 2. Concretely, end, we propose to refer to the timeout retransmission mechanism
the completion time and the communication time for each edge in TCP protocol, and introduce a similar method that retransmits
device and server are initialized, which are used to calculate the the local models if no reply is received from the server within the
aggregation frequencies. Besides, the computation capacity for each retransmission timeout (RTO). Furthermore, the network condi-
device 𝑢 ∈ U is initialized as 𝑐𝑢 . We note that the computation tions always vary with time in EC. As as consequence, the value of

993
Accelerating Hierarchical Federated Learning with Adaptive Aggregation Frequency in Edge Computing CNIOT ’23, May 26–28, 2023, Xiamen, China

RTO (denoted as T𝑅𝑇 𝑂 ) should be adjusted based on the real-time 1 .0 0 .8

network condition. According to [13], T𝑅𝑇 𝑂 can be adjusted as


0 .8 0 .6
follows:

T e s t a c c u ra c y
T e s t a c c u ra c y
0 .6
T𝑅𝑇 𝑂 = T b𝑅𝑇𝑇 + 𝜑 ∗ 𝜎 (T𝑅𝑇𝑇 ), (9c) 0 .4
0 .4 C E -H F L
C E -H F L
where 𝜑 is a factor indicating the network condition and is set as F e d A v g 0 .2 F e d A v g
0 .2 F e d C H F e d C H
4 in general [13], T
b𝑅𝑇𝑇 denotes the smoothed round-trip time and H ie r F A V G H ie r F A V G
can be obtained by incorporating the history round-trip time T𝑅𝑇𝑇 , 0 .0
0 .0 0 .5 1 .0 1 .5 2 .0
0 .0
0 1 2 3 4 5
and 𝜎 (T𝑅𝑇𝑇 ) represents the round-trip time variation. T r a n in g tim e ( 1 0 3 s ) T r a n in g tim e ( 1 0 3 s )
(a) LeNet-5 over FMNIST (b) AlexNet over CIFAR-10
4 EXPERIMENTATION AND EVALUATION
4.1 Experimental Setup Figure 2: Test accuracy of models trained with different meth-
In this section, extensive experiments are conducted to evaluate the ods.
8
performance of our proposed framework. We conduct extensive H ie r F A V G F e d A v g
8

A v g . w a itin g tim e ( s )
experiments on two real-world datasets with two models: (i) LeNet-

A v g . w a itin g tim e ( s )
F e d C H C E -H F L
6 6
5 [6] on FMNIST [21], and (ii) AlexNet [5] on CIFAR-10. These two
datasets are randomly divided into 15 partitions to serve as the 4 4
local dataset of 15 edge devices. H ie r F A V G F e d A v g
2 2 F e d C H C E -H F L

4.1.1 Parameter Settings. In our experiments, the batch size and


learning rate are set as 256 and 3e −5 , as well as 64 and 0.003 for 0
0 2 0 4 0 6 0
0
0 2 0 4 0 6 0 8 0 1 0 0
LeNet-5 and AlexNet, respectively [10]. In addition, the training C o m m u n ic a tio n r o u n d s C o m m u n ic a tio n r o u n d s
time for FMNIST and CIFAR-10 is specified as 2,000s and 5,000s, (a) LeNet-5 over FMNIST (b) AlexNet over CIFAR-10
respectively. We note that the global and edge aggregation fre-
quencies are mixed with the history values using the exponential
moving average [4]. The weight of history value is set as 0.8 for Figure 3: Average waiting time of all devices.
guaranteeing the system stability [11]. CIFAR-10). In comparison, the best accuracy among all the baselines
is 0.95 and 0.72 on FMNIST and CIFAR-10, which demonstrates the
4.1.2 Baselines. In order to evaluate the effectiveness of the pro- effectiveness of CE-HFL.
posed CE-HFL framework, we introduce the other three methods
as baselines: (i) FedAvg [12], (ii) FedCH [19], and (iii) HierFAVG [8]. 4.2.2 Effect of Adaptive Aggregation Frequencies. The second set
Specifically, the first baseline FedAvg is the typical parameter server of experiments evaluates the effect of adaptive aggregation fre-
based FL mechanism, where the cloud server updates the global quencies on the average waiting time. The average waiting time of
model by averaging the local models from all edge devices, and different methods is shown in Fig. 3, where the horizontal axis is the
the aggregation frequency is set as 20. The second baseline FedCH number of communication rounds. In Figs. 3, given the same train-
is a hierarchical FL framework by constructing the logical cluster ing time, CE-HFL can train the models with more communication
among devices, and the cluster head is responsible for aggregating rounds. For example, by Fig. 3(a), the communication rounds for CE-
the local models from the assigned edge devices. For FedCH, the HFL, FedAvg, FedCH and HierFAVG are 59, 52, 45, 40, respectively.
global and edge aggregation frequencies are set as 20 and 1 for That is because CE-HFL assign different aggregation frequencies
fairness. The third baseline HierFAVG performs the hierarchical FL for various devices according to the heterogeneous resources, and
model training with fixed global and edge aggregation frequencies, resource utilization is improved. In addition, though producing
which are set as 4 and 15 in our experiments. higher waiting time at the start of training, the stable waiting time
of CE-HFL is the lowest among those of all the methods. Gener-
4.2 Experimental Results ally, CE-HFL reduces the average waiting time by approximately
71.1% (over FMNIST) and 78.6% (over CIFAR-10) compared with the
4.2.1 Overall Performance. In our first set of experiments, we eval-
baselines.
uate the overall performance of different methods given the same
training time budget. Concretely, with the time budget of 2,000s 4.2.3 Comparison of Completion Time. We conduct the third set
and 5,000s for FEMNIST and CIFAR-10, the test accuracy curves of of experiments to compare the performance of CE-HFL and the
CE-HFL and baselines are presents in Fig. 2, where the horizontal baseline methods when they achieve the target accuracy. The exper-
axis denotes the training time. As training progresses, the test ac- imental results are illustrated in Fig. 4, where the horizontal axis is
curacy for all methods increases. However, the models trained with the target accuracy. In Fig. 4, we can observe that more completion
CE-HFL have faster convergence rate that the models trained with time is always required to obtain higher test accuracy for each
other methods. For example, by Fig. 2(b), when the training time method. It is appealing that the completion time of CE-HFL is the
equals 1,917s, CE-HFL obtains the test accuracy of 0.72 on CIFAR-10. lowest among all strategies when achieving the same test accuracy.
In comparison, the test accuracy of other methods is lower than For instance, by Fig. 4(a), to obtain the test accuracy of 0.86 on
0.67. Moreover, given the same training time, CE-HFL achieves FMNIST, the completion time of CE-HFL, FedAvg, FedCH and Hier-
higher test accuracy than the baseline methods. Concretely, CE- FAVG is 202s, 255s, 575s, and 1,008s. In other words, CE-HFL saves
HFL obtains the test accuracy of 0.98 (over FMNIST) and 0.76 (over the completion time by 20.47%, 64.8$ and 79.9%, when achieving

994
CNIOT ’23, May 26–28, 2023, Xiamen, China Suo Chen et al.

1 .6
4 .0 5 CONCLUSION
C E -H F L C E -H F L
C o m p le tio n tim e ( 1 0 3 s )

C o m p le tio n tim e ( 1 0 3 s )
F e d A
F e d C
v g
H 3 .0
F e d A v g In this paper, we focused on accelerating the hierarchical federated
1 .2 F e d C H
H ie r F A V G H ie r F A V G learning by adaptively adjusting the aggregation frequencies. We
0 .8 2 .0 proposed an efficient algorithm to achieve a trade-off between com-
munication cost and model performance, and extended it to the real
0 .4 1 .0
edge computing scenarios. We conducted extensive experiments
0 .0 0 .0
on real-world datasets. The experimental results demonstrated the
0 .8 4 0 .8 6 0 .8 8 0 .6 1 0 .6 3 0 .6 5
effectiveness of our proposed framework.
T e s t a c c u ra c y T e s t a c c u ra c y
(a) LeNet-5 over FMNIST (b) AlexNet over CIFAR-10 REFERENCES
[1] Mehdi Salehi Heydar Abad, Emre Ozfatura, Deniz Gunduz, and Ozgur Ercetin.
2020. Hierarchical federated learning across heterogeneous cellular networks. In
Figure 4: Completion time of different methods when achiev- ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal
ing the target accuracy. Processing (ICASSP). IEEE, 8866–8870.
[2] Kevin Ashton et al. 2009. That ‘internet of things’ thing. RFID journal 22, 7 (2009),
97–114.
7 .0 8 0 [3] Léon Bottou. 2012. Stochastic gradient descent tricks. Neural Networks: Tricks of
T r a ffic c o n s u m p tio n ( G B )
T r a ffic c o n s u m p tio n ( G B )

C E -H F L C E -H F L
H ie r F A V G H ie r F A V G the Trade: Second Edition (2012), 421–436.
5 .6 6 4
F e d C H F e d C H [4] Frank Klinker. 2011. Exponential moving average versus moving exponential
F e d A v g F e d A v g average. Mathematische Semesterberichte 58 (2011), 97–107.
4 .2 4 8
[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifi-
2 .8 3 2 cation with deep convolutional neural networks. Advances in neural information
processing systems 25 (2012).
1 .4 1 6 [6] Yann LeCun et al. 2015. LeNet-5, convolutional neural networks. URL: http://yann.
lecun. com/exdb/lenet 20, 5 (2015), 14.
0 .0 0 [7] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed,
0 .8 4 0 .8 6 0 .8 8 0 .6 1 0 .6 3 0 .6 5
Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling
T e s t a c c u ra c y T e s t a c c u ra c y
distributed machine learning with the parameter server. In 11th { USENIX } Sym-
(a) LeNet-5 over FMNIST (b) AlexNet over CIFAR-10 posium on Operating Systems Design and Implementation ( { OSDI } 14). 583–598.
[8] Lumin Liu, Jun Zhang, SH Song, and Khaled B Letaief. 2020. Client-edge-cloud
hierarchical federated learning. In ICC 2020-2020 IEEE International Conference
Figure 5: Traffic consumption of different methods when on Communications (ICC). IEEE, 1–6.
[9] Siqi Luo, Xu Chen, Qiong Wu, Zhi Zhou, and Shuai Yu. 2020. HFEL: Joint edge
achieving the target accuracy. association and resource allocation for cost-efficient hierarchical federated edge
learning. IEEE Transactions on Wireless Communications 19, 10 (2020), 6535–6548.
[10] Zhenguo Ma, Yang Xu, Hongli Xu, Jianchun Liu, and Yinxing Xue. 2022. Like
the test accuracy of 0.86 on FMNIST. Generally, to achieve the test Attracts Like: Personalized Federated Learning in Decentralized Edge Computing.
IEEE Transactions on Mobile Computing (2022).
accuracy of 0.88 and 0.65 on FMNIST and CIFAR-10, the completion [11] Zhenguo Ma, Yang Xu, Hongli Xu, Zeyu Meng, Liusheng Huang, and Yinxing
time of CE-HFL is 260s and 821s, indicating that CE-HFL can obtain Xue. 2023. Adaptive Batch Size for Federated Learning in Resource-Constrained
the speedup by up to 6.3x and 4.8x on the above datasets. Edge Computing. IEEE Transactions on Mobile Computing 22, 1 (2023), 37–53.
https://doi.org/10.1109/TMC.2021.3075291
[12] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and
4.2.4 Comparison of Traffic Consumption. Moreover, we also Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep net-
record the traffic consumption of different methods when achieving works from decentralized data. In Artificial intelligence and statistics. PMLR,
1273–1282.
the target accuracy. The experimental results are presented in Fig. 5, [13] Vern Paxson, Mark Allman, Jerry Chu, and Matt Sargent. 2011. Computing TCP’s
where the horizontal axis represents the target accuracy. By Fig, 5, retransmission timer. Technical Report.
as the model complexity increases, the total traffic consumption is [14] Weisong Shi and Schahram Dustdar. 2016. The promise of edge computing.
Computer 49, 5 (2016), 78–81.
also increased. For example, the total traffic consumption of FedAvg [15] Oscar C Valderrama Riveros and Fernando G Tinetti. 2021. Mpi communication
is 6,678MB on LeNet-5, while consumes the traffic of 73,703MB on performance in a heterogeneous environment with raspberry pi. In Advances in
AlexNet. In addition, the proposed method always achieves the less Parallel & Distributed Processing, and Applications: Proceedings from PDPTA’20,
CSC’20, MSV’20, and GCC’20. Springer, 451–460.
network traffic when achieving the target accuracy. That is because [16] Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian
the global and edge aggregation frequencies are adjusted and the Makaya, Ting He, and Kevin Chan. 2018. When edge meets learning: Adaptive
control for resource-constrained distributed machine learning. In IEEE INFOCOM
server communicates less frequently with edge servers, and thus 2018-IEEE conference on computer communications. IEEE, 63–71.
the communication cost is reduced. Totally, CE-HFL consumes the [17] Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian
traffic of 1,445MB and 21,431MB on FMNIST and CIFAR-10 when Makaya, Ting He, and Kevin Chan. 2019. Adaptive federated learning in re-
source constrained edge computing systems. IEEE journal on selected areas in
obtaining the target accuracy of 0.88 and 0.65. In comparison, the communications 37, 6 (2019), 1205–1221.
traffic consumption of the baseline methods is higher than 2,079MB [18] Zhiyuan Wang, Hongli Xu, Jianchun Liu, He Huang, Chunming Qiao, and Yang-
(over FMNIST) and 35,022MB (over CIFAR-10). As a consequence, ming Zhao. 2021. Resource-efficient federated learning with hierarchical aggre-
gation in edge computing. In IEEE INFOCOM 2021-IEEE Conference on Computer
on FMNIST and CIFAR-10, CE-HFL can save the traffic consump- Communications. IEEE, 1–10.
tion by at least 30.5% and 38.8%, when compared with the existing [19] Zhiyuan Wang, Hongli Xu, Jianchun Liu, Yang Xu, He Huang, and Yangming Zhao.
2022. Accelerating federated learning with cluster construction and hierarchical
methods. aggregation. IEEE Transactions on Mobile Computing (2022).
Based on the above results, compared with the baselines, CE-HFL [20] Xin Wu, Zhi Wang, Jian Zhao, Yan Zhang, and Yu Wu. 2020. FedBC: blockchain-
can reduce the waiting time among edge devices and improve the based decentralized federated learning. In 2020 IEEE international conference on
artificial intelligence and computer applications (ICAICA). IEEE, 217–221.
test accuracy given the same training time, which demonstrates [21] Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-mnist: a novel
the effectiveness of the proposed framework. image dataset for benchmarking machine learning algorithms. arXiv preprint
arXiv:1708.07747 (2017).

995

You might also like