Accelerated
Accelerated
990
CNIOT ’23, May 26–28, 2023, Xiamen, China Suo Chen et al.
𝐷𝑢
charming for edge devices due to their limited and heterogeneous D𝑢 = {(𝑥𝑖 , 𝑦𝑖 )}𝑖=1 as the local dataset of each device 𝑢 ∈ U, where
communication resource. 𝑥𝑖 is the 𝑖-th sample, 𝑦𝑖 is the corresponding label, and 𝐷𝑢 = |D𝑢 |
Several recent studies in the field of HFL have proposed different represents the number of samples in D𝑢 . (ii) Edge servers serve
solutions to tackle the challenges mentioned above. To efficiently as the intermediaries between the cloud server and edge devices,
handle bandwidth limitations in HFL, some approaches (such as which are always deployed at the place (e.g, network edge) and close
[18] and [19]) focus on allocating a group of devices to each edge to edge devices. Generally, edge servers are equipped with more
server for training acceleration. The authors of [1] use gradient computation and communication resources compared with edge
sparsification to improve communication efficiency, while Luo et devices, and are core components in HFL. Concretely, we introduce
al. [9] develop a joint resource allocation and device-edge assign- V = {𝑣 1, 𝑣 2, · · · , 𝑣𝑚 } to represent the set of edge servers, where
ment strategy to achieve global cost minimization. However, these 𝑚 = |V | denotes the number of edge servers in V. (iii) The cloud
methods aim to handle resource constraints in EC without consid- server is usually regraded as the coordinator of all the distributed
ering the heterogeneous communication resources among devices. devices and edge servers, which is responsible for aggregating the
Consequently, the above methods may lead to unnecessary waiting edge-level models in HFL and distributing the updated global model
time among edge devices, and the resource utilization is degraded. to all edge servers.
In this paper, we propose a communication-efficient HFL frame-
work, named CE-HFL, so as to tackle the constrained and hetero- 2.2 Traditional Hierarchical Federated Learning
geneous communication capacities on edge devices with adaptive During the training of a traditional HFL algorithm (e.g, HierFAVG),
aggregation frequencies. Concretely, CE-HFL jointly optimizes the each device 𝑢 ∈ U first performs the local updating for 𝜏𝑒 iterations
global and edge aggregation frequencies to reduce the waiting time and sends the updated local model to an assigned edge server 𝑣 ∈ V.
and communication overhead. However, the value of aggregation Subsequently, the edge server 𝑣 conducts the edge aggregation
frequency is an important factor that affects the performance of based on the received local models. After 𝜏𝑔 edge aggregations,
the proposed algorithm. Specifically, when large aggregation fre- edge server 𝑣 sends the aggregated model to the cloud server for
quencies are adopted, the communication overhead can be reduced, global aggregation. Concretely, we denote 𝐾 as the total number of
but the training performance of the global model may be degraded. communication rounds on each device, which is equal to the number
In contrast, the model performance can be enhanced when small of performed edge aggregations. In this way, the total number of
aggregation frequency is adopted, but the communication overhead training epochs (i.e., global aggregation) can be formulated as 𝜏𝐾𝑔 . To
is creased for edge servers and the cloud server. Besides, consider-
be concise, one training epochs is composed of 𝜏𝑔 communication
ing the heterogeneous communication capacity, the value of global
rounds on edge devices, and one round refers to 𝜏𝑒 times of local
aggregation frequency may be affected by the value of edge ag-
updates. By the hierarchical aggregation (i.e., edge aggregation
gregation frequency. As a consequence, it is challenging to jointly
and global aggregation) of the trained model, the performance of
adjust the global and edge aggregation frequencies. The main con-
the global model can be enhanced while significantly reducing the
tributions of this paper are summarized as follows:
communication cost of the cloud server.
• We propose a communication-efficient HFL framework,
named CE-HFL, which jointly adjusts the global and edge 2.3 The Training Procedure of CE-HFL
aggregation frequencies , so as to tackle the constrained and
However, considering the heterogeneous resources of edge devices
heterogeneous communication capacities on edge devices.
(edge servers), performing the same number of local updates (edge
• We conduct extensive experiments to evaluate the perfor-
aggregations) on different devices may incur unbearable waiting
mance of CE-HFL. The results demonstrate that CE-HFL can
time. To this end, we propose to adjust the number of local updates
significantly reduce the waiting time and improve the model
and edge aggregations according to their heterogeneous resources,
accuracy given the same training time.
so as to reduce the waiting time among edge devices and edge
server without incurring performance degradation. The framework
2 SYSTEM OVERVIEW of CE-HFL is illustrated in Fig. 1, which mainly consists of three
In this section, we first present the preliminaries for traditional steps, i.e., local updating, edge aggregation, and global aggregation.
hierarchical FL. Subsequently, we illustrate the training procedure Local Updating. On each device 𝑢 ∈ U, the local loss function
of the proposed CE-HFL and provide the convergence analysis. 𝑓𝑢 is defined on the local dataset D𝑢 as follows [16]
𝐷𝑢
2.1 System Model 𝑓𝑢 (𝑤𝑢𝑘 ) =
1 ∑︁
𝐹𝑢 (𝑥𝑖 ; 𝑦𝑖 , 𝑤𝑢𝑘 ), (1)
Under the traditional hierarchical federated learning (HFL) frame- 𝐷𝑢 𝑖=1
work, three types of entries participate in the model training: (i)
edge devices, (ii) edge servers, and (iii) the cloud server [8]. (i) Edge where 𝑤𝑢𝑘 represents the local model of device 𝑢 at round 𝑘 ∈
devices (e.g., smartphones, laptops) are generally equipped with {1, 2, · · · , 𝐾 }, and 𝐹𝑢 (𝑥𝑖 ; 𝑦𝑖 , 𝑤𝑢𝑘 ) denotes the loss function over the
constrained computation and communication capacities, so as to 𝑖-th data sample. Upon receiving the aggregated model 𝑤 𝑣𝑘 and the
determined edge aggregation frequency 𝜏𝑒,𝑢 𝑘 from edge server 𝑣,
perform the local model updating based on their locally stored
datasets, . For sake of illustration, let U = {𝑢 1, 𝑢 2, · · · , 𝑢𝑛 } denote each device 𝑢 iteratively updates the local model for 𝜏𝑒,𝑢 𝑘 iterations
the set of edge devices that participate in HPL, where 𝑛 = |U| at each round 𝑘, so as to minimize the local loss function 𝑓𝑢 and re-
represents the number of edge devices. Additionally, we denote duce the waiting time among edge devices. Based on the stochastic
991
Accelerating Hierarchical Federated Learning with Adaptive Aggregation Frequency in Edge Computing CNIOT ’23, May 26–28, 2023, Xiamen, China
Í
𝐷𝑢 is the accumulated number of samples on the edge devices
𝑢 ∈ S𝑣
Cloud Global Aggregation assigned to 𝑣. Then, edge server 𝑣 calculates the edge aggregation
Server Frequency
𝑘+1 for each device 𝑢 ∈ S according to the received
Global v1 2 frequency 𝜏𝑒,𝑢 𝑣
v2 4
Aggregation
Aggregated Global v3 3 completion time, so as to equalize the consumed time 𝑡𝑢𝑘+1 for the
Edge Model Model assigned devices at 𝑘 + 1. The aggregated model 𝑤 𝑣𝑘+1 and the
Aggregation v1 v2 v3 updated edge aggregation frequency 𝑡𝑢𝑘+1 are distributed to the
Edge Aggregation
Frequency
devices 𝑢 ∈ S𝑣 to start the next round of local updating. On the
Edge Aggregation
Frequency u7 3 other hand, after finishing 𝜏𝑔,𝑣 edge aggregations, edge server 𝑣
u1 7 u8 4
Local u2 3 u9 7 uploads the aggregated model and the consumed time to the cloud
Updating u3 10 server for global aggregation. The completion time of edge server 𝑣
for accomplishing 𝜏𝑔,𝑣 edge aggregations also consists of two parts:
computation time, and communication time. The computation time
𝑘
𝑡 𝑣,𝑐𝑝 on server 𝑣 at the 𝑘-th round can be formulated as
u1 u2 u3 u4 u5 u6 u7 u8 u9
𝑘
𝑡 𝑣,𝑐𝑝 = max 𝑡𝑢𝑘 = max {𝑡𝑢,𝑐𝑝
𝑘
+ 𝑡𝑢,𝑐𝑚 }. (6)
Figure 1: Illustration of training process of CE-HFL. 𝑢 ∈ S𝑣 𝑢 ∈ S𝑣
It is worth noting that we ignore the aggregation time of local
gradient descent algorithm [3], device 𝑢 updates the local model models, since the computation resource of edge server is adequate
′
𝑤𝑢𝑘,𝜏 as follows and the aggregation time is negligible. We also assume the commu-
′ ′ ′ nication time of edge server 𝑣 as a constant 𝑡 𝑣,𝑐𝑚 . In this way, the
𝑤𝑢𝑘,𝜏 +1 = 𝑤𝑢𝑘,𝜏 − 𝜂∇𝑓𝑢 (𝑤𝑢𝑘,𝜏 ), (2)
completion time 𝑡 𝑣𝑘 of edge server 𝑣 at round 𝑘 is defined as
′
where 𝑤𝑢𝑘,𝜏 represents the local model of device 𝑢 at iteration
𝜏 ′ ∈ {1, 2, · · · , 𝜏𝑒,𝑢
𝑘 } of the 𝑘-th round, 𝜂 denotes the learning rate of
𝑘
𝑡 𝑣,𝑐𝑝 + 𝑡 𝑣,𝑐𝑚
𝑘 If 𝜏𝑔,𝑣 edge aggregations are finished,
′ 𝑡 𝑣𝑘 = 𝑘 Else.
(7a)
model training, and ∇𝑓𝑢 (𝑤𝑢𝑘,𝜏 ) is the gradient of local loss function 𝑡 𝑣,𝑐𝑝
′ 𝑘,𝜏 𝑘 (7b)
on local model 𝑤𝑢𝑘,𝜏 . Subsequently, the updated local model 𝑤𝑢 𝑒,𝑢
and the completion time (denoted as 𝑡𝑢𝑘 ) at round 𝑘 on device 𝑢 are As a consequence, the consumed time of edge server 𝑣 for accom-
transmitted to the assigned edge server 𝑣 ∈ V for edge aggregation. plishing 𝜏𝑔,𝑣 edge aggregations is obtained by accumulating the
The completion time 𝑡𝑢𝑘 of device 𝑢 at round 𝑘 consists of two parts: completion time of the inclusive round 𝑘.
computation time, and communication time [11]. The computation Global Aggregation. On the cloud server, the global aggre-
𝑘
time 𝑡𝑢,𝑐𝑝 of device 𝑢 at the 𝑘-th round can be formulated as gation is performed as soon as the aggregated models and the
completion time are retrieved from all the edge servers. Concretely,
𝑘 𝑘 𝑀
𝑡𝑢,𝑐𝑝 = 𝜏𝑒,𝑢 (3) the global aggregation at the 𝑘-th round is performed as follows
𝐶𝑢
𝑘 represents the number of local updates on device 𝑢 at
where 𝜏𝑒,𝑢
(
Í 𝐷 𝑣 𝑤𝑣𝑘+1
𝑘+1 If the required edge aggregations are finished,
round 𝑘, 𝑀 is the required computation capacity of one local update, 𝑤 = 𝑣 ∈V
𝐷 (8a)
𝑤𝑘 Else.
and 𝐶𝑢 is the computation resource of device 𝑢. In addition to the
computation time, 𝑡𝑢𝑘 also consists of the time for downloading the (8b)
aggregated model and the time for uploading the local model. As where 𝑤 𝑘 denotes the global model at round 𝑘, and 𝐷 =
Í
𝐷𝑣 =
a consequence, the communication time of device 𝑢 (i.e., the sum Í 𝑣∈V
of download time and upload time) can be regarded as a constant 𝐷𝑢 represents the total number of samples in all edge devices.
𝑢∈U
𝑡𝑢,𝑐𝑚 , which may be varied among different edge devices. In this Subsequently, the cloud server calculates the global aggregation
way, the completion time 𝑡𝑢𝑘 on device 𝑢 at round 𝑘 is defined as frequency 𝜏𝑔,𝑣 for each edge server 𝑣 ∈ V according to the re-
𝑀 ceived completion time, so as to equalize the consumed time of
𝑡𝑢𝑘 = 𝑡𝑢,𝑐𝑝
𝑘 𝑘
+ 𝑡𝑢,𝑐𝑚 = 𝜏𝑒,𝑢 + 𝑡𝑢,𝑐𝑚 . (4) edge servers at the next epoch. The updated global model and the
𝐶𝑢
global aggregation frequency are sent back to all edge servers to
Edge Aggregation. On edge server 𝑣, the edge aggregation is
replace the aggregate model. Regarding the abundant resources of
triggered after receiving the updated local models and the comple-
the cloud server, we ignore the consumed time for model averaging.
tion time from all the assigned edge devices. Following the typical
As a consequence, the corresponding completion time of round 𝑘
Federated Averaging algorithm [12], the edge aggregation on edge
can be expressed as
server 𝑣 can be updated as follows
∑︁ 𝐷𝑢 𝑤 𝑘,𝜏𝑒,𝑢 𝑘
max𝑣 ∈V { max {𝑡𝑢,𝑐𝑝 𝑘
+ 𝑡𝑢,𝑐𝑚 } + 𝑡 𝑣,𝑐𝑚 } If the required
𝑢
𝑤 𝑣𝑘+1 =
, (5)
𝑢 ∈S𝑣 edge
𝐷𝑣 𝑘
𝑡 = max 𝑡 𝑣𝑘 = aggregations
(9a)
𝑢 ∈ S𝑣 are finished,
𝑣∈V 𝑘
max𝑣 ∈V { max {𝑡𝑢,𝑐𝑝 + 𝑡𝑢,𝑐𝑚 } } Else.
where 𝑤 𝑣𝑘+1 represents the aggregated model on edge server at 𝑢 ∈S𝑣
992
CNIOT ’23, May 26–28, 2023, Xiamen, China Suo Chen et al.
Algorithm 1 Training procedure of CE-HFL Algorithm 2 Determine the aggregation frequencies for CE-HFL
1: Initialize the model parameters 𝑤 0 and send to all edge devices; 1: Initialize 𝑡𝑢𝑘 and 𝑡𝑢,𝑐𝑚 for ∀𝑢 ∈ U, 𝑡 𝑣𝑘 and 𝑡 𝑣,𝑐𝑚 for ∀𝑣 ∈ V;
2: for round 𝑘 ∈ {1, 2, ..., 𝐾 } and 𝑇 𝑘 < 𝑇 𝐵 do 2: Initialize the computation capacity 𝐶𝑢 for ∀𝑢 ∈ U;
3: Procedure at the cloud server 3: Calculate the average completion time of all edge servers at
4: if Receive the aggregated model from all edge servers then the current epoch, and denote it as 𝑡;
5: Update the global model according to Eqs. (8a) and (8a); 4: for each 𝑣 ∈ V do
𝑡𝑘
Calculate the global aggregation frequency for each Let 𝑡ˆ𝑣𝑘+1 = | S1 | 𝑢 ∈ S𝑣 𝑚𝑢 ;
Í
6: 5:
𝑣
edge server using Alg. 2; 6: for each 𝑢 ∈ S𝑣 do
7: Distribute the global model and aggregation to edge 𝑘+1 = ⌊ 𝑣 (𝑡ˆ𝑘+1 −𝑡 )𝐶𝑢
7: Calculate 𝜏𝑒,𝑢 𝑢,𝑐𝑚
⌋;
servers; 𝑀
Í 8: end for
8: Record the completion time as 𝑇 𝑘 = 𝑘𝑗=1 𝑡 𝑗 ; (𝑡 −𝑡 𝑣,𝑐𝑚 )
9: Calculate 𝜏𝑔,𝑣 = ⌊ ˆ𝑘+1 ⌋;
9: end if 𝑡𝑣
10: Procedure at the edge server 𝑣 10: end for
𝑘+1 and 𝜏
Return the calculated 𝜏𝑒,𝑢
𝑘,𝜏 𝑘 11: 𝑔,𝑣 to the corresponding server
11: Receive local model 𝑤𝑢 𝑒,𝑢 and the completion time 𝑡𝑢𝑘 from and device;
all the devices 𝑢 ∈ S𝑣 ;
12: Update the aggregated global 𝑤 𝑣𝑘+1 according to Eq. (5);
13: if Finish the required 𝜏𝑔,𝑣 edge aggregations then
capacity 𝐶𝑢 can be replaced with a factor 𝑐𝑢 that is maintained
14: Send 𝑤 𝑣𝑘+1 and the completion time to the cloud server 𝑡𝑢𝑘 −𝑡𝑢,𝑐𝑚
𝑘
for global aggregation; on edge server, and is updated as 𝑐𝑢 = 𝛾 ∗ 𝑐𝑢 + (1 − 𝛾) ∗ 𝑘
𝜏𝑒,𝑢
15: Receive 𝑤 𝑘+1 and the updated 𝜏𝑔,𝑣 from cloud server when receiving latest completion time 𝑡𝑢𝑘 . As a consequence, the
and set 𝑤 𝑣𝑘+1 = 𝑤 𝑘+1 ; proposed algorithm does not require any prior knowledge about
16: end if the computation capacity of each edge device.
17: Calculate the edge aggregation frequency for each device After the initialization, we calculate the average completion time
𝑢 ∈ S𝑣 using Alg. 2; of all edge servers at the current epoch, and denote it as 𝑡 (Line 3 in
18: Broadcast the updated model 𝑤 𝑣𝑘+1 and edge aggregation Alg. 2). Subsequently, for each edge server 𝑣 ∈ V, we calculate the
frequency to all the assigned devices; edge aggregation frequency for the devices assigned to it. Specifi-
19: Procedure at the edge device 𝑢 cally, we estimate the completion time for device 𝑢 at round 𝑘 + 1 as
20: Receive the aggregated model and edge aggregation fre- the average value of the completion time at round 𝑘, as depicted in
quency from edge server 𝑣; Line 5 of Alg. 2. Based on the estimated time 𝑡ˆ𝑣𝑘+1 , we can calculate
21: for local updates 𝜏 ′ ∈ {1, 2, ..., 𝜏𝑒,𝑢
𝑘 } do
the aggregation frequency for device 𝑢 as 𝜏𝑒,𝑢 𝑘+1 = ⌊ 𝑣 (𝑡ˆ𝑘+1 −𝑡
𝑢,𝑐𝑚 𝑢
⌋
)𝐶
′ ′ ′ 𝑀
22: Update the local model 𝑤𝑢𝑘,𝜏 +1
as 𝑤𝑢𝑘,𝜏 +1
= −𝑤𝑢𝑘,𝜏 (Line 7 of Alg. 2). It is worth noting that the aggregation frequencies
′
𝜂∇𝑓𝑢 (𝑤𝑢𝑘,𝜏 ); of edge devices are adjusted to align the estimated 𝑡ˆ𝑣𝑘+1 , and the
23: end for waiting time among edge devices assigned to 𝑣 can be reduced.
𝑘,𝜏 𝑘 Similarly, the global aggregation frequency for server 𝑣 is adjusted
24: Send the updated model 𝑤𝑢 𝑒,𝑢 and the consumed time to
edge server 𝑣; based on the average time 𝑡 and 𝑡ˆ𝑣𝑘+1 , as shown in Line 9 of Alg. 2.
25: end for By aligning the completion time of each server at the next epoch
26: Return the trained global model 𝑤 𝐾 ; with the estimated 𝑡, the waiting time among edge servers can also
be decreased. In this way, by adaptively adjusting the global ag-
gregation frequency and the edge aggregation frequency, CE-HFL
can reduce the waiting time and communication cost during model
For sake of clear illustration, the above training process of CE- training, without incurring performance degradation.
HFL is concluded in Alg. 1. It is worth noting that the above We note that the edge devices participating in FL may be various
three steps are conducted iteratively until the global loss 𝑓 (𝑤 𝐾 ) = kinds of mobile devices. In addition, edge devices always connect
Í 𝐷𝑢 𝑓𝑢 (𝑤 𝐾 )
𝐷 is minimized within 𝐾 rounds, or the time budget to base station (BS) via unstable wireless links. Hence, considering
𝑢∈U the device mobility in dynamic EC environment, the communica-
𝑇 𝐵 is exceeded. tion timeout issue may occur due to the BS handover or unstable
network condition. The communication timeout may lead to ex-
3 ALGORITHM DESIGN tra waiting time among edge devices, since the fast devices are
In this section, we present the algorithm for the cloud server and compelled to wait for the devices suffering from communication
edge server to determine the aggregation frequencies. First of all, we timeout, and the resource utilization is further lowered. To this
initialize some critical parameters in Lines 1-3 of Alg. 2. Concretely, end, we propose to refer to the timeout retransmission mechanism
the completion time and the communication time for each edge in TCP protocol, and introduce a similar method that retransmits
device and server are initialized, which are used to calculate the the local models if no reply is received from the server within the
aggregation frequencies. Besides, the computation capacity for each retransmission timeout (RTO). Furthermore, the network condi-
device 𝑢 ∈ U is initialized as 𝑐𝑢 . We note that the computation tions always vary with time in EC. As as consequence, the value of
993
Accelerating Hierarchical Federated Learning with Adaptive Aggregation Frequency in Edge Computing CNIOT ’23, May 26–28, 2023, Xiamen, China
T e s t a c c u ra c y
T e s t a c c u ra c y
0 .6
T𝑅𝑇 𝑂 = T b𝑅𝑇𝑇 + 𝜑 ∗ 𝜎 (T𝑅𝑇𝑇 ), (9c) 0 .4
0 .4 C E -H F L
C E -H F L
where 𝜑 is a factor indicating the network condition and is set as F e d A v g 0 .2 F e d A v g
0 .2 F e d C H F e d C H
4 in general [13], T
b𝑅𝑇𝑇 denotes the smoothed round-trip time and H ie r F A V G H ie r F A V G
can be obtained by incorporating the history round-trip time T𝑅𝑇𝑇 , 0 .0
0 .0 0 .5 1 .0 1 .5 2 .0
0 .0
0 1 2 3 4 5
and 𝜎 (T𝑅𝑇𝑇 ) represents the round-trip time variation. T r a n in g tim e ( 1 0 3 s ) T r a n in g tim e ( 1 0 3 s )
(a) LeNet-5 over FMNIST (b) AlexNet over CIFAR-10
4 EXPERIMENTATION AND EVALUATION
4.1 Experimental Setup Figure 2: Test accuracy of models trained with different meth-
In this section, extensive experiments are conducted to evaluate the ods.
8
performance of our proposed framework. We conduct extensive H ie r F A V G F e d A v g
8
A v g . w a itin g tim e ( s )
experiments on two real-world datasets with two models: (i) LeNet-
A v g . w a itin g tim e ( s )
F e d C H C E -H F L
6 6
5 [6] on FMNIST [21], and (ii) AlexNet [5] on CIFAR-10. These two
datasets are randomly divided into 15 partitions to serve as the 4 4
local dataset of 15 edge devices. H ie r F A V G F e d A v g
2 2 F e d C H C E -H F L
994
CNIOT ’23, May 26–28, 2023, Xiamen, China Suo Chen et al.
1 .6
4 .0 5 CONCLUSION
C E -H F L C E -H F L
C o m p le tio n tim e ( 1 0 3 s )
C o m p le tio n tim e ( 1 0 3 s )
F e d A
F e d C
v g
H 3 .0
F e d A v g In this paper, we focused on accelerating the hierarchical federated
1 .2 F e d C H
H ie r F A V G H ie r F A V G learning by adaptively adjusting the aggregation frequencies. We
0 .8 2 .0 proposed an efficient algorithm to achieve a trade-off between com-
munication cost and model performance, and extended it to the real
0 .4 1 .0
edge computing scenarios. We conducted extensive experiments
0 .0 0 .0
on real-world datasets. The experimental results demonstrated the
0 .8 4 0 .8 6 0 .8 8 0 .6 1 0 .6 3 0 .6 5
effectiveness of our proposed framework.
T e s t a c c u ra c y T e s t a c c u ra c y
(a) LeNet-5 over FMNIST (b) AlexNet over CIFAR-10 REFERENCES
[1] Mehdi Salehi Heydar Abad, Emre Ozfatura, Deniz Gunduz, and Ozgur Ercetin.
2020. Hierarchical federated learning across heterogeneous cellular networks. In
Figure 4: Completion time of different methods when achiev- ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal
ing the target accuracy. Processing (ICASSP). IEEE, 8866–8870.
[2] Kevin Ashton et al. 2009. That ‘internet of things’ thing. RFID journal 22, 7 (2009),
97–114.
7 .0 8 0 [3] Léon Bottou. 2012. Stochastic gradient descent tricks. Neural Networks: Tricks of
T r a ffic c o n s u m p tio n ( G B )
T r a ffic c o n s u m p tio n ( G B )
C E -H F L C E -H F L
H ie r F A V G H ie r F A V G the Trade: Second Edition (2012), 421–436.
5 .6 6 4
F e d C H F e d C H [4] Frank Klinker. 2011. Exponential moving average versus moving exponential
F e d A v g F e d A v g average. Mathematische Semesterberichte 58 (2011), 97–107.
4 .2 4 8
[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifi-
2 .8 3 2 cation with deep convolutional neural networks. Advances in neural information
processing systems 25 (2012).
1 .4 1 6 [6] Yann LeCun et al. 2015. LeNet-5, convolutional neural networks. URL: http://yann.
lecun. com/exdb/lenet 20, 5 (2015), 14.
0 .0 0 [7] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed,
0 .8 4 0 .8 6 0 .8 8 0 .6 1 0 .6 3 0 .6 5
Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling
T e s t a c c u ra c y T e s t a c c u ra c y
distributed machine learning with the parameter server. In 11th { USENIX } Sym-
(a) LeNet-5 over FMNIST (b) AlexNet over CIFAR-10 posium on Operating Systems Design and Implementation ( { OSDI } 14). 583–598.
[8] Lumin Liu, Jun Zhang, SH Song, and Khaled B Letaief. 2020. Client-edge-cloud
hierarchical federated learning. In ICC 2020-2020 IEEE International Conference
Figure 5: Traffic consumption of different methods when on Communications (ICC). IEEE, 1–6.
[9] Siqi Luo, Xu Chen, Qiong Wu, Zhi Zhou, and Shuai Yu. 2020. HFEL: Joint edge
achieving the target accuracy. association and resource allocation for cost-efficient hierarchical federated edge
learning. IEEE Transactions on Wireless Communications 19, 10 (2020), 6535–6548.
[10] Zhenguo Ma, Yang Xu, Hongli Xu, Jianchun Liu, and Yinxing Xue. 2022. Like
the test accuracy of 0.86 on FMNIST. Generally, to achieve the test Attracts Like: Personalized Federated Learning in Decentralized Edge Computing.
IEEE Transactions on Mobile Computing (2022).
accuracy of 0.88 and 0.65 on FMNIST and CIFAR-10, the completion [11] Zhenguo Ma, Yang Xu, Hongli Xu, Zeyu Meng, Liusheng Huang, and Yinxing
time of CE-HFL is 260s and 821s, indicating that CE-HFL can obtain Xue. 2023. Adaptive Batch Size for Federated Learning in Resource-Constrained
the speedup by up to 6.3x and 4.8x on the above datasets. Edge Computing. IEEE Transactions on Mobile Computing 22, 1 (2023), 37–53.
https://doi.org/10.1109/TMC.2021.3075291
[12] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and
4.2.4 Comparison of Traffic Consumption. Moreover, we also Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep net-
record the traffic consumption of different methods when achieving works from decentralized data. In Artificial intelligence and statistics. PMLR,
1273–1282.
the target accuracy. The experimental results are presented in Fig. 5, [13] Vern Paxson, Mark Allman, Jerry Chu, and Matt Sargent. 2011. Computing TCP’s
where the horizontal axis represents the target accuracy. By Fig, 5, retransmission timer. Technical Report.
as the model complexity increases, the total traffic consumption is [14] Weisong Shi and Schahram Dustdar. 2016. The promise of edge computing.
Computer 49, 5 (2016), 78–81.
also increased. For example, the total traffic consumption of FedAvg [15] Oscar C Valderrama Riveros and Fernando G Tinetti. 2021. Mpi communication
is 6,678MB on LeNet-5, while consumes the traffic of 73,703MB on performance in a heterogeneous environment with raspberry pi. In Advances in
AlexNet. In addition, the proposed method always achieves the less Parallel & Distributed Processing, and Applications: Proceedings from PDPTA’20,
CSC’20, MSV’20, and GCC’20. Springer, 451–460.
network traffic when achieving the target accuracy. That is because [16] Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian
the global and edge aggregation frequencies are adjusted and the Makaya, Ting He, and Kevin Chan. 2018. When edge meets learning: Adaptive
control for resource-constrained distributed machine learning. In IEEE INFOCOM
server communicates less frequently with edge servers, and thus 2018-IEEE conference on computer communications. IEEE, 63–71.
the communication cost is reduced. Totally, CE-HFL consumes the [17] Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian
traffic of 1,445MB and 21,431MB on FMNIST and CIFAR-10 when Makaya, Ting He, and Kevin Chan. 2019. Adaptive federated learning in re-
source constrained edge computing systems. IEEE journal on selected areas in
obtaining the target accuracy of 0.88 and 0.65. In comparison, the communications 37, 6 (2019), 1205–1221.
traffic consumption of the baseline methods is higher than 2,079MB [18] Zhiyuan Wang, Hongli Xu, Jianchun Liu, He Huang, Chunming Qiao, and Yang-
(over FMNIST) and 35,022MB (over CIFAR-10). As a consequence, ming Zhao. 2021. Resource-efficient federated learning with hierarchical aggre-
gation in edge computing. In IEEE INFOCOM 2021-IEEE Conference on Computer
on FMNIST and CIFAR-10, CE-HFL can save the traffic consump- Communications. IEEE, 1–10.
tion by at least 30.5% and 38.8%, when compared with the existing [19] Zhiyuan Wang, Hongli Xu, Jianchun Liu, Yang Xu, He Huang, and Yangming Zhao.
2022. Accelerating federated learning with cluster construction and hierarchical
methods. aggregation. IEEE Transactions on Mobile Computing (2022).
Based on the above results, compared with the baselines, CE-HFL [20] Xin Wu, Zhi Wang, Jian Zhao, Yan Zhang, and Yu Wu. 2020. FedBC: blockchain-
can reduce the waiting time among edge devices and improve the based decentralized federated learning. In 2020 IEEE international conference on
artificial intelligence and computer applications (ICAICA). IEEE, 217–221.
test accuracy given the same training time, which demonstrates [21] Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-mnist: a novel
the effectiveness of the proposed framework. image dataset for benchmarking machine learning algorithms. arXiv preprint
arXiv:1708.07747 (2017).
995