Speed Up Federated Learning in Heterogeneous Environments A Dynamic Tiering Approach
Speed Up Federated Learning in Heterogeneous Environments A Dynamic Tiering Approach
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473
Abstract—Federated learning enables collaborative training in resource-constrained B5G IoT environments (e.g., mobile
of a model while keeping the training data decentralized and devices, sensors, and edge servers) is challenging due to device
private. However, in IoT systems, inherent heterogeneity in heterogeneity and dynamic wireless channels. IoT devices
processing power, communication bandwidth, and task size can
significantly hinder the efficient training of large models. Such often have heterogeneous computational and communication
heterogeneity would render vast variations in the training time capabilities, along with varying dataset sizes. Greater hetero-
of clients, lengthening overall training and wasting resources geneity would incur a significant impact on training time,
of faster clients. To tackle these heterogeneity challenges, we increasing the time needed to achieve comparable accuracy
propose Dynamic Tiering-based Federated Learning (DTFL), a compared to non-heterogeneous settings [1], [2]. Motivated
novel system that leverages distributed optimization principles to
improve edge learning performance. Based on clients’ resources, by the increasing demand for intelligent B5G IoT applications,
DTFL dynamically offloads part of the global model to the server, we aim to address the challenges of training large models on
alleviating resource constraints on slower clients and speed- resource-constrained heterogeneous devices.
ing up training. By leveraging Split Learning, DTFL offloads To train large models with resource-constrained IoT devices,
different portions of the global model to clients in different various methods have been proposed in the literature. One
tiers and enables each client to update the models in parallel
via local-loss-based training. This helps reduce the computation solution is to split the global model into a client-side model
and communication demand on resource-constrained devices, (i.e., the first few layers of the global model) and a server-side
mitigating the straggler problem. DTFL introduces a dynamic model, where the clients only need to train the small client-
tier scheduler that uses tier profiling to estimate the expected side model via Split Learning (SL) [3], [4]. Additionally, Liao
training time of each client based on their historical training et al. [5] enhance the model training speed in Split Federated
time, communication speed, and dataset size. The dynamic
tier scheduler assigns clients to suitable tiers to minimize the Learning (SFL) by allowing local clients control over both
overall training time in each round. We theoretically prove the the local updating frequency and batch size. Yet in SFL, each
convergence properties of DTFL and validate its effectiveness by client needs to wait for the back-propagated gradients from the
training large models (ResNet-56 and ResNet-110) across varying server to update its model, and the communication overhead
numbers of clients (from 10 to 200) using popular image datasets for transmitting the forward/backward signals between the
(CIFAR-10, CIFAR-100, CINIC-10, and HAM10000) under both
IID and non-IID systems. DTFL seamlessly integrates various server and clients can be substantial at each training round (i.e.,
privacy measures without sacrificing performance. Extensive the time needed to complete a round of training). To address
experimental results show that compared with state-of-the-art these issues, FedGKT [6] and COMET [7] use a knowledge
FL methods, DTFL can significantly reduce the training time by transfer training algorithm to train small models on the client-
up to 80% while maintaining model accuracy. side and periodically transfer their knowledge via knowledge
Index Terms—Edge computing, federated learning, heteroge- distillation to a large server-side model. Furthermore, Han
neous devices, split learning, distributed optimization et al. [8] develop a federated SL algorithm that addresses
the latency and communication issues by integrating local-
I. I NTRODUCTION loss-based training into SL. However, the client-side models
Federated Learning (FL) has become a popular privacy- in earlier works [3]–[8] remain fixed throughout the training
preserving distributed learning paradigm, particularly in process, and choosing suitable client-side models in heteroge-
emerging edge computing scenarios like beyond 5G (B5G) neous environments is challenging as the resources of clients
Internet of Things (IoT) systems. FL enables collaborative may change over time. Another solution is to divide clients
training of a global model without requiring clients to share into tiers based on their training speed and select clients from
their sensitive data with others. In FL, clients update the global the same tier in each training round to mitigate the straggler
model using their locally trained weights to avoid sharing raw problem (i.e., some devices significantly lag in training) [9],
data with the server or other clients. Training large models [10]. Yet, existing tier-based works [9], [10] still require clients
to train the entire global model, which is not suitable for
Seyed Mahmoud Sajjadi Mohammadabadi and Lei Yang are with the training large models.
Department of Computer Science and Engineering, University of Nevada
Reno, Reno, NV, USA (e-mail: [email protected]; [email protected]). In this paper, we propose the Dynamic Tiering-Based
S. Zawad is with IBM Research - Almaden, San Jose, CA, USA (e-mail: Federated Learning (DTFL) system, a novel approach that
[email protected]). leverages distributed optimization principles to speed up train-
Feng Yan is with the Computer Science Department and Electrical and
Computer Engineering Department of University of Houston, TX, USA (e- ing of large models in heterogeneous IoT environments. DTFL
mail: [email protected]). builds upon the strengths of SFL [8] and tier-based FL [9]
Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473
while addressing their limitations, particularly latency issues II. BACKGROUND AND R ELATED W ORKS
and training time in heterogeneous edge computing scenarios. Federated Learning. Existing FL methods (see a compre-
In DTFL, clients are categorized into different tiers, with each hensive study of FL [20]) require clients to repeatedly down-
tier responsible for offloading specific portions of the global load and update the global model, which is not suitable for
model from clients to the server. Then, each client and the training large models with resource-constrained devices in
server update the models in parallel using local-loss-based heterogeneous environments and may suffer a severe straggler
training [11], [12]. In heterogeneous environments, client problem. To address the straggler problem, Li et al. [13] select
training times may vary across rounds. Static tier assignments a smaller set of clients for training in each global iteration but
can lead to severe straggler issues, especially when resource- require more training rounds. Bonawitz et al. [21] mitigate
constrained clients (e.g., due to other concurrently running stragglers by neglecting the slowest 30% of clients, while
applications on IoT devices) are assigned to demanding tiers. FedProx [22] uses distinct local epoch numbers for clients.
To address this challenge, we propose a dynamic tier scheduler Both Bonawitz et al. [21] and FedProx [22] face the challenge
that assigns clients to suitable tiers based on their capacities, of determining the perfect parameters (i.e., percentage of
task size, and current training speed. The tier scheduler slowest clients and the number of local epochs). Recently,
employs tier profiling to estimate client-side training time, tier-based FL methods [9], [10], [23] propose dividing clients
using only the measured training time, communicated network into tiers based on their training speed and selecting clients
speed, and observed dataset size. This low-overhead approach from the same tier in each training round to mitigate the
makes it suitable for real-world system implementation. The straggler problem. However, these methods require clients to
main contributions of this paper are summarized as follows: train the entire global model and do not consider the impact of
• We propose the DTFL to address the challenges of training communication overhead and dynamic environmental changes,
large models on heterogeneous devices in FL by offloading which renders significant hurdles in training large models on
different portions of the global model from clients to the resource-constrained devices in heterogeneous environments.
server based on clients’ resources and wireless communi- Split Learning. To tackle the computational limitations of
cation capacity, thereby accelerating training and mitigating resource-constrained devices, SL splits the global model into a
the straggler problem. client-side model and a server-side model, and clients need to
• We propose a dynamic tier scheduler that continuously only update the small client-side model, compared to FL [3],
assesses clients’ training performance and dynamically ad- [4]. To increase SL training speed, SplitFed [24] combined
justs clients’ tier assignments in each round of training it with FL, while CPSL [25] proposed a first-parallel-then-
to minimize the training time. This approach optimizes sequential approach that clusters clients and sequentially trains
resource allocation and reduces training time in dynamic a model in SL fashion within each cluster and transfers the
environments. updated cluster model to the next cluster. In SL, clients must
• We theoretically show the convergence of DTFL on convex wait for the server’s backpropagated gradients to update their
and non-convex loss functions under standard assumptions models, which can cause significant communication overhead.
in FL [13], [14] and local-loss-based training [8], [12], [15]. To address these issues, He et al. [6] propose FedGKT to
• Using DTFL, we train large models (ResNet-56 and ResNet- train small models at clients and periodically transfer their
110 [16]) on different numbers of clients (from 10 to knowledge by knowledge distillation to a large server-side
200) using popular datasets (CIFAR-10 [17], CIFAR-100 model. Han et al. [8] develop a federated SL algorithm that
[17], CINIC-10 [18], and HAM10000 [19]) and their non- addresses latency and communication issues by integrating
I.I.D. (non-identical and independent distribution) variants. local-loss-based training. Clients train a model using local
Our extensive experiments demonstrate that DTFL not only error signals, which eliminates the need to communicate with
maintains model accuracy comparable to state-of-the-art FL the server. However, the client-side models in current SL
methods but also achieves up to an 80% reduction in training approaches [6], [8], [26] are fixed throughout the training
time compared to baselines. process, and choosing suitable client-side models in heteroge-
• We evaluate DTFL’s performance when employing pri- neous environments is challenging as clients’ resources may
vacy measures, such as minimizing the distance correlation change over time. As opposed to these works, the proposed
between raw data and intermediate representations, and DTFL can dynamically adjust the size of the client-side model
shuffling data patches. In comparison with a baseline for each client over time, significantly reducing training time
accuracy of 87.1% without privacy measures, applying patch and mitigating the straggler problem.
shuffling leads to a minor 1.7% accuracy decrease, while
incorporating distance correlation with α = 0.5 results
III. DYNAMIC T IERING - BASED F EDERATED L EARNING
in a 3.6% accuracy reduction. This demonstrates DTFL’s
adaptability to privacy measures with minimal impact on A. Problem Statement
performance. We aim to collaboratively train a large model (e.g., ResNet,
The paper is organized as follows: Section II presents or AlexNet) by K clients on a range of heterogeneous
background and related work. Section III introduces the DTFL resource-constrained devices that lack powerful computation
framework and provides a theoretical convergence analysis of and communication resources without centralizing the dataset
Nk
DTFL. Section IV evaluates the performance of DTFL. Section on the server-side. Let {(xi , yi )}i=1 denote the dataset of
V concludes the paper. client k, where xi denotes the ith training sample, yi is the
Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473
Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473
Offloading the model to the server can effectively reduce the ship between the client’s dataset size and server-side training
total training time, as illustrated in Table I. As a client offloads times. Specifically, using a standard data batch, the server
more layers to the server (moving towards tier m = 1), the profiles the transferred data size (i.e., model parameter and
(r)
model size on the client’s side decreases, thereby reducing intermediate data size) for each tier m, as Dsize (mk ).
the computational workload. Meanwhile, this may increase the Then, for each client k in tier m, the communication
(r) (r) (r)
amount of data transmitted (i.e., the size of the intermediate time can be estimated as Dsize (mk )Ñk /νk , where νk
data and partial model). As indicated in Table I, there exists a represents the client’s communication speed and Ñk denotes
non-trivial tier assignment that minimizes the overall training the number of data batches. To track clients’ training time
time. To find the optimal tier assignment, DTFL needs to for their respective client-side models, the server maintains
consider multiple factors, including the communication link and updates the set of historical client-side training times
speed between the server and the clients, the computation for each client k in tier m, denoted as Tkcm . To mitigate
power of each client, and the local dataset size. measurement noise, the server uses Exponential Moving
Average (EMA) on historical client-side training time (i.e.,
(r) (r)
C. Dynamic Tier Scheduling T̄kcm (mk ) ← EMA(Tkcm (mk ))) as the current client’s
In a heterogeneous environment with multiple clients, the training time in tier m. One key challenge in tier profiling
proposed dynamic tier scheduling aims to minimize the overall is capturing the dynamics of the training time of each client
training time by determining the optimal tier assignments for in a heterogeneous environment. We need knowledge of
each client. the training time for every client in each tier, but only the
(r)
Specifically, let mk denote the tier of client k in the train- current tier’s training time for each client is available each
(r) (r) (r) round. To address this, we analyze the relationship between
ing round r. Tkc (mk ), Tkcom (mk ) and Tks (mk ) represent
normalized training times across tiers for each client, where
the training time of the client-side model, the communication
the normalized training time refers to the model training
time, and the training time of the server-side model of client k
time using a standard data batch. Table II shows normalized
at round r, respectively. Using the proposed local-loss-based
training times of different tiers relative to tier 1. Notably,
split training algorithm, each client and the server train the
the table indicates that for any client within a specific
model in parallel. The overall training time Tk for client k in
model, the normalized training times relative to tier 1 for
each round can be presented as:
both client-side and server-side models, denoted as T cp and
T sp respectively, across different tiers are the same. This
(r) (r) (r)
Tk (mk ) = max{Tkc (mk ) + Tkcom (mk ), is because the ratio between normalized training times in
(5)
(r)
Tks (mk ) + Tkcom (mk )}.
(r) any two tiers depends solely on their respective model size,
which does change for different clients. These model sizes
As clients train their models in parallel, the overall training remain constant within a specific model design, ensuring
time in each round r is determined by the slowest client (i.e., consistent normalized training times across tiers for clients.
straggler). To minimize the overall training time, we minimize Based on this tier profiling, we can estimate the training
the maximum training time of clients in each round: times in other tiers using the observed training time of each
(r)
min max Tk (mk ), subject to {mk } ∈ M ∀k,
(r) client in the assigned tier (see lines 24 to 29 in Algorithm
(r) k (6) 1).
{mk }
• Tier Scheduling. In each round, the tier scheduler mini-
where M denotes the set of tiers. Note that problem (6) is
mizes the maximum training time of clients. First, it iden-
an integer programming problem. Solving (6) requires the
(r) tifies the maximum time (i.e., the straggler training time),
knowledge of each client’s training time {Tk (mk )} under
denoted as Tmax , by estimating the maximum training time
each tier. As the capacities of each client in a heterogeneous
of all clients if they are assigned to a tier that minimizes
environment may change over time, a static tier assignment
their training time (see line 31 in Algorithm 1). Then, it
may still lead to a severe straggler problem. The key question
assigns other clients to a tier with an estimated training
is how to efficiently solve (6) in a heterogeneous environment.
time that is less than or equal to Tmax (see line 33
To optimize client assignments, we introduce a dynamic
in Algorithm 1). To better utilize the resources of each
tier scheduler. This scheduler leverages tier profiling to esti-
client, the tier scheduler selects tier m that minimizes
mate each client’s training time under different tiers, efficiently
the offloading to the server while still ensuring that its
placing them in the most suitable tier for each training round. (r+1)
estimatedtraining time remains below Tmax by mk ←
• Tier Profiling. Before the training starts, the server con-
(r+1)
(r) (r) arg max {T̂k (mk ) ≤ Tmax } .
ducts tier profiling to estimate Tkc (mk ), Tkcom (mk ) and m
(r)
Tks (mk ) for each client across different tiers. These esti-
mates are used for tier profiling, a crucial component of the The dynamic tier scheduler is illustrated in Figure 2 and
dynamic tier scheduler. Client-side training time is predicted is detailed in TierScheduler(·) function in Algorithm 1.
using historical data and a normalized training time profile. DTFL training process is described in Algorithm 1. This
Communication time is calculated based on transferred data scheduler has a computational complexity of O (KM ) and
size, client communication speed, and dataset size. Server- incurs negligible overhead compared to the computation and
side training time is estimated using the observed relation- communication time.
Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473
TABLE II: The normalized training times for both client-side and server-side models in different tiers for each client relative to Tier 1, using
ResNet-56 with 10 clients. In each experiment, all the clients are assigned to the same tier. We change the CPU capacities of clients in each
experiment to evaluate the impact of CPU capacities.
Tier 1 2 3 4 5 6
Client-side Training Time 1.00 ± 0.04 1.63 ± 0.10 2.16 ± 0.15 2.68 ± 0.22 3.30 ± 0.24 3.81 ± 0.28
Server-side Training Time 1.00 ± 0.07 0.82 ± 0.06 0.65 ± 0.06 0.51 ± 0.04 0.33 ± 0.03 0.20 ± 0.01
cation times using tier profiles, historical training time, and network 9: Receive updated wckm from client k
(r+1) (r+1)
(r+1) cm
speeds across different tiers. The tier scheduler assigns clients to tiers, 10: wk = {wk , wksm }
minimizing maximum training time while limiting server load. 11: end for
(r+1)
w(r+1) = K
1
P
12: k wk
(r+1) (r+1)
D. Convergence Analysis 13: Update all models (wcm and wsm
) in each tier using
(r+1)
We show the convergence of both client-side and server-side w
models in DTFL on convex and non-convex loss functions 14: end for (r)
ClientUpdate(wckm )
based on standard assumptions in FL and local-loss-based (r)
15: Forward propagate on local data to calculate z k
training. We assume that (A1) client-side fkcm and server- (r)
16: Send (z k , yk ) to the server
side fksm objective functions of each client in each tier 17: Forward propagation to the auxiliary layer
are differentiable and L-smooth; (A2) fkcm and fksm have 18: Calculate local loss, backpropagation
expected squared norm bounded by G21 ; (A3) the variance 19:
(r+1)
wckm
(r) (r)
← wckm − η∇fkcm (wcm , wam )
(r)
Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473
0
0 0
0
F cm f cm wcm , and F sm := f sm wsm .
:=
q
0 P (r)
C1 = G1 G22 + 2LB 2 F sm r dcm and C2 =
q
2 2 2
P c(r)
G1 G2 + B G1 r d m are convergent based on (A6).
(r) (r)
Am = minr {Acm > 0}, where Acm denotes the number
(r)
of clients in tier m at round r. dcm denotes the distance
between the density function of the output of the client-side
model and its converged state.
According to Theorem 1, both client-side and server-side
models converge as the number of rounds R increases, with
varying convergence rates across different tiers. Note that as Fig. 3: Comparing the training time curve (in seconds) of DTFL with
DTFL leverages the local-loss-based split training, the conver- baselines for various I.I.D. datasets. DTFL exhibits significantly faster
gence of the server-side model depends on the convergence of convergence across all datasets.
the client-side model, which is explicitly characterized by C1
and C2 in the analysis. The complete proof of the theorem is ResNet-110 [16] that work well on the selected datasets.
given in the supplementary material. Furthermore, DTFL can also be applied to large language
models (LLM) like BERT [38] by splitting techniques as
IV. E XPERIMENTAL E VALUATION proposed in FedBERT [39] and FedSplitBERT [40]. For each
A. Experimental Setup tier, we split a global model to create client and server-side
models. This split layer varies across tiers, and it progressively
Dataset. We consider image classification on four public
moves toward the last layer as the tier increases. For each
image datasets, including CIFAR-10 [17], CIFAR-100 [17],
client-side model, we add a fully connected (FC) and an
CINIC-10 [18], and HAM10000 [19]. We also study label dis-
average pooling (avgpool) layer as the auxiliary network.
tribution skew [34] (i.e., the distribution of labels varies across
More details and alternative auxiliary network architectures
clients) to generate their non-I.I.D. variants using FedML
can be found in the supplementary material. For FedGKT
[35]. The dataset distributions used in these experiments are
implementation, we follow the same settings as in FedGKT
provided in the supplementary materials.
[6]. We split the global model after module 2 (as defined in
Baselines. We compare DTFL with state-of-the-art FL/SL
the supplementary) for the SplitFed model.
methods, including FedAvg [27], SplitFed [24], FedYogi [29],
and FedGKT [6]. For the same reasons as in FedGKT [6], we
do not compare with FedProx [22] and FedMA [36]. For large B. Training Time Improvement of DTFL
CNNs like ResNets, FedProx tends to underperform FedAvg, Training time comparison of DTFL to baselines. In Table
while FedMA’s incompatibility with batch normalization lay- III, we summarize all experimental results of training a global
ers limits its use in modern DNNs. model (i.e., ResNet-56 or ResNet-110) with 7 tiers (i.e.,
Implementation. We conduct the experiments using Python M = 7) when using different FL methods. The experiments
3.11.3 and the PyTorch library version 1.13.1, which is avail- are conducted on a heterogeneous client population, with 20%
able online in the DTFL GitHub repository [37]. DTFL and having been assigned to each profile at the outset. Every 50
the baselines are deployed in a server, which is equipped with rounds, the client profiles (i.e., number of simulated CPUs
dual-sockets Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz and communication speed) of 30% of the clients are randomly
with hyper-threading disabled, and four NVIDIA GeForce changed to simulate a dynamic environment, while all clients
GTX 1080 Ti GPUs, 64 GB of memory. Each client is assigned participate in every training round. The training time of each
a different simulated CPU and communication resource to method to achieve target accuracy is provided in Table III.
simulate heterogeneous resources (i.e., simulate the training In all cases, for both I.I.D. and non-I.I.D. settings, DTFL
time of different CPU/network profiles). By using these re- reduces the training time at an exceptional rate than the
source profiles, we simulate a heterogeneous environment baselines (FedAvg, SplitFed, FedYogi, FedGKT). For example,
where clients’ capacity varies in both cross-silo and cross- DTFL reduces the training time of FedAvg by 80% to reach
device FL settings. We consider 5 resource profiles: 4 CPUs the target accuracy on I.I.D. CIFAR-10 with ResNet-110.
with 100 Mbps, 2 CPUs with 30 Mbps, 1 CPU with 30 This experiment illustrates the capabilities of DTFL that can
Mbps, 0.2 CPU with 30 Mbps, and 0.1 CPU with 10 Mbps significantly reduce training time when training on distributed
communication speed to the server. Each client is assigned heterogeneous clients. Figure 3 depicts the curve of the server
one resource profile at the beginning of the training, and the test accuracy during the training process of all the methods for
profile can be changed during the training process to simulate the I.I.D. CIFAR-10 case with ResNet-110, where we observe
the dynamic environment. a faster convergence using DTFL compared to baselines.
Model Architecture. DTFL is a versatile approach suitable for
training a wide range of neural network models (e.g., Mul-
tilayer Perceptron, MLP, Recurrent Neural Networks, RNN, C. Understanding DTFL under Different Settings
and CNN), particularly benefiting large-scale models. In the Performance of DTFL with different numbers of clients.
experiments, we evaluate large CNN models, ResNet-56 and We evaluate the performance of DTFL with different numbers
Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473
TABLE III: Comparison of training time (in seconds) to baseline approaches with 10 clients on different datasets. The numbers represent the
training time used to achieve the target accuracy (i.e., CIFAR-10 I.I.D. 80%, CIFAR-10 non-I.I.D. 70%, CIFAR-100 I.I.D. 55%, CIFAR-100
non-I.I.D. 50%, CINIC-10 I.I.D. 75%, CINIC-10 non-I.I.D. 65%, and HAM10000 75%).
Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473
Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473
[25] W. Wu, M. Li, K. Qu, C. Zhou, X. Shen, W. Zhuang, X. Li, and W. Shi, IEEE Symposium on Security and Privacy (SP). IEEE, 2019, pp. 656–
“Split learning over wireless networks: Parallel design and resource 672.
management,” IEEE Journal on Selected Areas in Communications, [48] E. Erdogan, A. Küpçü, and A. E. Cicek, “Splitguard: Detecting and
vol. 41, no. 4, pp. 1051–1066, 2023. mitigating training-hijacking attacks in split learning,” in Proceedings
[26] Z. Zhang, A. Pinto, V. Turina, F. Esposito, and I. Matta, “Privacy of the 21st Workshop on Privacy in the Electronic Society, 2022, pp.
and efficiency of communications in federated split learning,” IEEE 125–137.
Transactions on Big Data, 2023. [49] H. U. Sami and B. Güler, “Secure aggregation for clustered federated
[27] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, learning,” in 2023 IEEE International Symposium on Information Theory
“Communication-efficient learning of deep networks from decentralized (ISIT). IEEE, 2023, pp. 186–191.
data,” in Artificial intelligence and statistics. PMLR, 2017, pp. 1273– [50] X. Qiu, H. Pan, W. Zhao, C. Ma, P. P. Gusmao, and N. D. Lane, “vfedsec:
1282. Efficient secure aggregation for vertical federated learning via secure
[28] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the layer,” arXiv preprint arXiv:2305.16794, 2023.
objective inconsistency problem in heterogeneous federated optimiza- [51] T. Wang, Y. Zhang, and R. Jia, “Improving robustness to model inversion
tion,” Advances in neural information processing systems, vol. 33, pp. attacks via mutual information regularization,” in Proceedings of the
7611–7623, 2020. AAAI Conference on Artificial Intelligence, vol. 35, no. 13, 2021, pp.
[29] S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečnỳ, 11 666–11 673.
S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” arXiv [52] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan,
preprint arXiv:2003.00295, 2020. V. Smith, and A. Talwalkar, “Leaf: A benchmark for federated settings,”
[30] M. Laskin, L. Metz, S. Nabarro, M. Saroufim, B. Noune, C. Luschi, arXiv preprint arXiv:1812.01097, 2018.
J. Sohl-Dickstein, and P. Abbeel, “Parallel training of deep networks
with local updates,” arXiv preprint arXiv:2012.03837, 2020.
[31] S. U. Stich, “Local sgd converges fast and communicates little,” arXiv
preprint arXiv:1805.09767, 2018.
[32] H. Yu, S. Yang, and S. Zhu, “Parallel restarted sgd with faster con-
vergence and less communication: Demystifying why model averaging
works for deep learning,” in Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 33, no. 01, 2019, pp. 5693–5700.
[33] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T.
Suresh, “Scaffold: Stochastic controlled averaging for federated learn-
ing,” in International Conference on Machine Learning. PMLR, 2020,
pp. 5132–5143.
[34] Q. Li, Y. Diao, Q. Chen, and B. He, “Federated learning on non-iid
data silos: An experimental study,” in 2022 IEEE 38th International
Conference on Data Engineering (ICDE). IEEE, 2022, pp. 965–978.
[35] C. He, S. Li, J. So, X. Zeng, M. Zhang, H. Wang, X. Wang,
P. Vepakomma, A. Singh, H. Qiu et al., “Fedml: A research li-
brary and benchmark for federated machine learning,” arXiv preprint
arXiv:2007.13518, 2020.
[36] H. Wang, M. Yurochkin, Y. Sun, D. Papailiopoulos, and Y. Khazaeni,
“Federated learning with matched averaging,” in International Confer-
ence on Learning Representations, 2020.
[37] Sajjadi Mohammadabadi, Seyed Mahmoud, “Dynamic tiering-based
federated learning (dtfl),” https://github.com/mahmoudsajjadi/DTFL, ac-
cessed on 1 Sep 2024.
[38] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
[39] Y. Tian, Y. Wan, L. Lyu, D. Yao, H. Jin, and L. Sun, “Fedbert: When
federated learning meets pre-training,” ACM Transactions on Intelligent
Systems and Technology (TIST), vol. 13, no. 4, pp. 1–26, 2022.
[40] Z. Lit, S. Sit, J. Wang, and J. Xiao, “Federated split bert for hetero-
geneous text classification,” in 2022 International Joint Conference on
Neural Networks (IJCNN). IEEE, 2022, pp. 1–8.
[41] H. Yin, A. Mallya, A. Vahdat, J. M. Alvarez, J. Kautz, and P. Molchanov,
“See through gradients: Image batch recovery via gradinversion,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2021, pp. 16 337–16 346.
[42] L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” Advances
in neural information processing systems, vol. 32, 2019.
[43] J. Shen, N. Cheng, X. Wang, F. Lyu, W. Xu, Z. Liu, K. Aldubaikhy, and
X. Shen, “Ringsfl: An adaptive split federated learning towards taming
client heterogeneity,” IEEE Transactions on Mobile Computing, 2023.
[44] P. Vepakomma, A. Singh, O. Gupta, and R. Raskar, “Nopeek: Informa-
tion leakage reduction to share activations in distributed deep learning,”
in 2020 International Conference on Data Mining Workshops (ICDMW).
IEEE, 2020, pp. 933–942.
[45] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov,
K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in
Proceedings of the 2016 ACM SIGSAC conference on computer and
communications security, 2016, pp. 308–318.
[46] D. Yao, L. Xiang, H. Xu, H. Ye, and Y. Chen, “Privacy-preserving
split learning via patch shuffling over transformers,” in 2022 IEEE
International Conference on Data Mining (ICDM). IEEE, 2022, pp.
638–647.
[47] M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana, “Certified
robustness to adversarial examples with differential privacy,” in 2019
Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.