0% found this document useful (0 votes)
9 views9 pages

Speed Up Federated Learning in Heterogeneous Environments A Dynamic Tiering Approach

The article presents Dynamic Tiering-based Federated Learning (DTFL), a novel approach designed to enhance the training of large models in heterogeneous IoT environments by dynamically offloading portions of the global model to clients based on their resources. DTFL addresses challenges such as varying training times and resource constraints by implementing a dynamic tier scheduler that optimizes client assignments and reduces overall training time by up to 80% while maintaining model accuracy. The paper includes theoretical convergence analysis and experimental validation using popular datasets and models, demonstrating DTFL's effectiveness compared to existing federated learning methods.

Uploaded by

mohamedaldresy98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views9 pages

Speed Up Federated Learning in Heterogeneous Environments A Dynamic Tiering Approach

The article presents Dynamic Tiering-based Federated Learning (DTFL), a novel approach designed to enhance the training of large models in heterogeneous IoT environments by dynamically offloading portions of the global model to clients based on their resources. DTFL addresses challenges such as varying training times and resource constraints by implementing a dynamic tier scheduler that optimizes client assignments and reduces overall training time by up to 80% while maintaining model accuracy. The paper includes theoretical convergence analysis and experimental validation using popular datasets and models, demonstrating DTFL's effectiveness compared to existing federated learning methods.

Uploaded by

mohamedaldresy98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

This article has been accepted for publication in IEEE Internet of Things Journal.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473

IEEE INTERNET OF THINGS JOURNAL 1

Speed Up Federated Learning in Heterogeneous


Environments: A Dynamic Tiering Approach
Seyed Mahmoud Sajjadi Mohammadabadi, Graduate Student Member, IEEE, Syed Zawad, Member, IEEE, Feng
Yan, Member, IEEE, Lei Yang, Senior Member, IEEE

Abstract—Federated learning enables collaborative training in resource-constrained B5G IoT environments (e.g., mobile
of a model while keeping the training data decentralized and devices, sensors, and edge servers) is challenging due to device
private. However, in IoT systems, inherent heterogeneity in heterogeneity and dynamic wireless channels. IoT devices
processing power, communication bandwidth, and task size can
significantly hinder the efficient training of large models. Such often have heterogeneous computational and communication
heterogeneity would render vast variations in the training time capabilities, along with varying dataset sizes. Greater hetero-
of clients, lengthening overall training and wasting resources geneity would incur a significant impact on training time,
of faster clients. To tackle these heterogeneity challenges, we increasing the time needed to achieve comparable accuracy
propose Dynamic Tiering-based Federated Learning (DTFL), a compared to non-heterogeneous settings [1], [2]. Motivated
novel system that leverages distributed optimization principles to
improve edge learning performance. Based on clients’ resources, by the increasing demand for intelligent B5G IoT applications,
DTFL dynamically offloads part of the global model to the server, we aim to address the challenges of training large models on
alleviating resource constraints on slower clients and speed- resource-constrained heterogeneous devices.
ing up training. By leveraging Split Learning, DTFL offloads To train large models with resource-constrained IoT devices,
different portions of the global model to clients in different various methods have been proposed in the literature. One
tiers and enables each client to update the models in parallel
via local-loss-based training. This helps reduce the computation solution is to split the global model into a client-side model
and communication demand on resource-constrained devices, (i.e., the first few layers of the global model) and a server-side
mitigating the straggler problem. DTFL introduces a dynamic model, where the clients only need to train the small client-
tier scheduler that uses tier profiling to estimate the expected side model via Split Learning (SL) [3], [4]. Additionally, Liao
training time of each client based on their historical training et al. [5] enhance the model training speed in Split Federated
time, communication speed, and dataset size. The dynamic
tier scheduler assigns clients to suitable tiers to minimize the Learning (SFL) by allowing local clients control over both
overall training time in each round. We theoretically prove the the local updating frequency and batch size. Yet in SFL, each
convergence properties of DTFL and validate its effectiveness by client needs to wait for the back-propagated gradients from the
training large models (ResNet-56 and ResNet-110) across varying server to update its model, and the communication overhead
numbers of clients (from 10 to 200) using popular image datasets for transmitting the forward/backward signals between the
(CIFAR-10, CIFAR-100, CINIC-10, and HAM10000) under both
IID and non-IID systems. DTFL seamlessly integrates various server and clients can be substantial at each training round (i.e.,
privacy measures without sacrificing performance. Extensive the time needed to complete a round of training). To address
experimental results show that compared with state-of-the-art these issues, FedGKT [6] and COMET [7] use a knowledge
FL methods, DTFL can significantly reduce the training time by transfer training algorithm to train small models on the client-
up to 80% while maintaining model accuracy. side and periodically transfer their knowledge via knowledge
Index Terms—Edge computing, federated learning, heteroge- distillation to a large server-side model. Furthermore, Han
neous devices, split learning, distributed optimization et al. [8] develop a federated SL algorithm that addresses
the latency and communication issues by integrating local-
I. I NTRODUCTION loss-based training into SL. However, the client-side models
Federated Learning (FL) has become a popular privacy- in earlier works [3]–[8] remain fixed throughout the training
preserving distributed learning paradigm, particularly in process, and choosing suitable client-side models in heteroge-
emerging edge computing scenarios like beyond 5G (B5G) neous environments is challenging as the resources of clients
Internet of Things (IoT) systems. FL enables collaborative may change over time. Another solution is to divide clients
training of a global model without requiring clients to share into tiers based on their training speed and select clients from
their sensitive data with others. In FL, clients update the global the same tier in each training round to mitigate the straggler
model using their locally trained weights to avoid sharing raw problem (i.e., some devices significantly lag in training) [9],
data with the server or other clients. Training large models [10]. Yet, existing tier-based works [9], [10] still require clients
to train the entire global model, which is not suitable for
Seyed Mahmoud Sajjadi Mohammadabadi and Lei Yang are with the training large models.
Department of Computer Science and Engineering, University of Nevada
Reno, Reno, NV, USA (e-mail: [email protected]; [email protected]). In this paper, we propose the Dynamic Tiering-Based
S. Zawad is with IBM Research - Almaden, San Jose, CA, USA (e-mail: Federated Learning (DTFL) system, a novel approach that
[email protected]). leverages distributed optimization principles to speed up train-
Feng Yan is with the Computer Science Department and Electrical and
Computer Engineering Department of University of Houston, TX, USA (e- ing of large models in heterogeneous IoT environments. DTFL
mail: [email protected]). builds upon the strengths of SFL [8] and tier-based FL [9]

Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473

IEEE INTERNET OF THINGS JOURNAL 2

while addressing their limitations, particularly latency issues II. BACKGROUND AND R ELATED W ORKS
and training time in heterogeneous edge computing scenarios. Federated Learning. Existing FL methods (see a compre-
In DTFL, clients are categorized into different tiers, with each hensive study of FL [20]) require clients to repeatedly down-
tier responsible for offloading specific portions of the global load and update the global model, which is not suitable for
model from clients to the server. Then, each client and the training large models with resource-constrained devices in
server update the models in parallel using local-loss-based heterogeneous environments and may suffer a severe straggler
training [11], [12]. In heterogeneous environments, client problem. To address the straggler problem, Li et al. [13] select
training times may vary across rounds. Static tier assignments a smaller set of clients for training in each global iteration but
can lead to severe straggler issues, especially when resource- require more training rounds. Bonawitz et al. [21] mitigate
constrained clients (e.g., due to other concurrently running stragglers by neglecting the slowest 30% of clients, while
applications on IoT devices) are assigned to demanding tiers. FedProx [22] uses distinct local epoch numbers for clients.
To address this challenge, we propose a dynamic tier scheduler Both Bonawitz et al. [21] and FedProx [22] face the challenge
that assigns clients to suitable tiers based on their capacities, of determining the perfect parameters (i.e., percentage of
task size, and current training speed. The tier scheduler slowest clients and the number of local epochs). Recently,
employs tier profiling to estimate client-side training time, tier-based FL methods [9], [10], [23] propose dividing clients
using only the measured training time, communicated network into tiers based on their training speed and selecting clients
speed, and observed dataset size. This low-overhead approach from the same tier in each training round to mitigate the
makes it suitable for real-world system implementation. The straggler problem. However, these methods require clients to
main contributions of this paper are summarized as follows: train the entire global model and do not consider the impact of
• We propose the DTFL to address the challenges of training communication overhead and dynamic environmental changes,
large models on heterogeneous devices in FL by offloading which renders significant hurdles in training large models on
different portions of the global model from clients to the resource-constrained devices in heterogeneous environments.
server based on clients’ resources and wireless communi- Split Learning. To tackle the computational limitations of
cation capacity, thereby accelerating training and mitigating resource-constrained devices, SL splits the global model into a
the straggler problem. client-side model and a server-side model, and clients need to
• We propose a dynamic tier scheduler that continuously only update the small client-side model, compared to FL [3],
assesses clients’ training performance and dynamically ad- [4]. To increase SL training speed, SplitFed [24] combined
justs clients’ tier assignments in each round of training it with FL, while CPSL [25] proposed a first-parallel-then-
to minimize the training time. This approach optimizes sequential approach that clusters clients and sequentially trains
resource allocation and reduces training time in dynamic a model in SL fashion within each cluster and transfers the
environments. updated cluster model to the next cluster. In SL, clients must
• We theoretically show the convergence of DTFL on convex wait for the server’s backpropagated gradients to update their
and non-convex loss functions under standard assumptions models, which can cause significant communication overhead.
in FL [13], [14] and local-loss-based training [8], [12], [15]. To address these issues, He et al. [6] propose FedGKT to
• Using DTFL, we train large models (ResNet-56 and ResNet- train small models at clients and periodically transfer their
110 [16]) on different numbers of clients (from 10 to knowledge by knowledge distillation to a large server-side
200) using popular datasets (CIFAR-10 [17], CIFAR-100 model. Han et al. [8] develop a federated SL algorithm that
[17], CINIC-10 [18], and HAM10000 [19]) and their non- addresses latency and communication issues by integrating
I.I.D. (non-identical and independent distribution) variants. local-loss-based training. Clients train a model using local
Our extensive experiments demonstrate that DTFL not only error signals, which eliminates the need to communicate with
maintains model accuracy comparable to state-of-the-art FL the server. However, the client-side models in current SL
methods but also achieves up to an 80% reduction in training approaches [6], [8], [26] are fixed throughout the training
time compared to baselines. process, and choosing suitable client-side models in heteroge-
• We evaluate DTFL’s performance when employing pri- neous environments is challenging as clients’ resources may
vacy measures, such as minimizing the distance correlation change over time. As opposed to these works, the proposed
between raw data and intermediate representations, and DTFL can dynamically adjust the size of the client-side model
shuffling data patches. In comparison with a baseline for each client over time, significantly reducing training time
accuracy of 87.1% without privacy measures, applying patch and mitigating the straggler problem.
shuffling leads to a minor 1.7% accuracy decrease, while
incorporating distance correlation with α = 0.5 results
III. DYNAMIC T IERING - BASED F EDERATED L EARNING
in a 3.6% accuracy reduction. This demonstrates DTFL’s
adaptability to privacy measures with minimal impact on A. Problem Statement
performance. We aim to collaboratively train a large model (e.g., ResNet,
The paper is organized as follows: Section II presents or AlexNet) by K clients on a range of heterogeneous
background and related work. Section III introduces the DTFL resource-constrained devices that lack powerful computation
framework and provides a theoretical convergence analysis of and communication resources without centralizing the dataset
Nk
DTFL. Section IV evaluates the performance of DTFL. Section on the server-side. Let {(xi , yi )}i=1 denote the dataset of
V concludes the paper. client k, where xi denotes the ith training sample, yi is the

Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473

IEEE INTERNET OF THINGS JOURNAL 3

TABLE I: Comparison of training time (in seconds) for 10 clients


under different tiers when M = 6 to achieve 80% accuracy on
the I.I.D. CIFAR-10 dataset using ResNet-110. In each experiment,
all clients are assigned to the same tier. Randomly assign clients to
different CPU and network speed profiles. Profiles in Case 1: 2 CPUs
with 30 Mbps, 1 CPU with 30 Mbps, 0.2 CPU with 30 Mbps. Profiles
in Case 2: 4 CPUs with 100 Mbps, 1 CPU with 30 Mbps, 0.1 CPU
with 10 Mbps. The experimental setup can be found in Sec. IV.

Ex. Tier 1 2 3 4 5 6 FedAvg


Comp. 4622 8106 9982 10681 11722 12250 13396
1 Comm. 5911 5995 2187 2189 1018 908 16
Overall 10533 14101 12170 12871 12741 13158 13408
Comp. 8384 14634 17993 19027 21428 22344 24428
2 Comm. 17754 18090 6720 6762 2941 2653 43
Overall 26138 32724 24713 25989 24369 24997 24471

tiers to cater to a variety of heterogeneous resource-constrained


devices in heterogeneous environments. As shown in experi-
mental results in Sec. IV-B, DTFL can significantly reduce
Fig. 1: Overview of Dynamic Tiering-based Federated Learning. The the training time, maintaining model accuracy more efficiently
purple layer at the client-side represents the auxiliary network across than these methods.
tiers. In each round, clients download their assigned model ⃝, 1
perform forward propagation with their local dataset ⃝, 2 compute
local loss and backpropagate ⃝, 3 while the server continues training
with received data ⃝,4 and then aggregates the updated model ⃝. 5 B. Tiering Local-Loss-Based Training
To cater to heterogeneous resource-constrained devices,
associated label of xi , and Nk is the number of samples in DTFL divides the clients into M tiers based on their training
client k’s dataset. The FL problem can be formulated as a speed. In different tiers, DTFL offloads different portions of
distributed optimization problem: the global model w to the server and enables each client to
K update the models in parallel via local-loss-based training.
def PNk
min f (w) = min · fk (w) (1) Specifically, in tier m, the model w is split into a client-
w w k=1 N
side model wcm and a server-side model wsm . Clients in tier
N
where fk (w) = 1
Pk
ℓ ((xi , yi ) ; w) (2) m train the client-side model wcm and an auxiliary network
Nk
i=1 wam . The auxiliary network is the extra layers connected to
PK the client-side model; this network is used to compute the local
where w denotes the model parameters and N = k=1 Nk .
loss on the client-side. By introducing the auxiliary network,
f (w) denotes the global objective function, and fk (w) denotes
we enable each client to update the models in parallel with
the kth client’s local objective function, which evaluates the
the server [8], which avoids the severe synchronization and
local loss over its dataset using loss function ℓ.
substantial communication in SL that significantly slows down
One main drawback of existing federated optimization tech- the training process [3], [4]. In this paper, we use a few fully
niques (e.g., [22], [27]–[29]) for solving (1) is that they cannot connected layers for the auxiliary network as in previous works
efficiently train large models on a variety of heterogeneous [8], [12], [30].
resource-constrained devices. Such heterogeneity would lead Under this setting, we define fkc (wcm , wam ) as the client-
to the severe straggler problem that clients may have signif- side loss function and fks (wsm , wcm ) as the corresponding
icantly different response latencies (i.e., the time between a server-side loss function in tier m. Our goal is to find wcm ∗
client receiving the training task and returning the results) and wam ∗ that minimize the client-side loss function in each
in the FL process, which would considerably slow down the tier m:
training (see experimental results in Sec. IV-B). Nk c cm
, wam )
P
To address these issues, we propose a Dynamic Tiering-
min
wcm ,wam k∈Acm N m · fk (w (3)
based Federated Learning system (see Figure 1), in which PNk
we develop a dynamic tier scheduler that assigns clients to where fkc (wP cm
, wam ) = N1k i=1 ℓ ((xi , yi ) ; wcm , wam )
m cm
suitable tiers based on their training speed. In different tiers, and N = k∈Acm Nk . A denotes the set of clients in
DTFL offloads different portions of the global model to clients tier m. Given the optimal client-side model wcm ∗ , the server
and enables each client to update the models in parallel via finds wsm ∗ that minimizes the server-side loss function:
Nk s sm
, w cm ∗ )
P
local-loss-based training, which can reduce the computation min
w s m k∈Acm N m · fk (w (4)
and communication demand on resource-constrained devices,
Nk
where fks (wsm , wcm ∗ ) = N1k i=1 ℓ ((z i , yi ) ; wsm ) and
P
while mitigating the straggler problem. In contrast to existing
works (e.g., [6], [8], [9]), which can be treated as a single- z i = hwcm ∗ (xi ) is the intermediate output of the client-side
tier case in DTFL, DTFL provides more flexibility via multiple model wcm ∗ given the input xi .

Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473

IEEE INTERNET OF THINGS JOURNAL 4

Offloading the model to the server can effectively reduce the ship between the client’s dataset size and server-side training
total training time, as illustrated in Table I. As a client offloads times. Specifically, using a standard data batch, the server
more layers to the server (moving towards tier m = 1), the profiles the transferred data size (i.e., model parameter and
(r)
model size on the client’s side decreases, thereby reducing intermediate data size) for each tier m, as Dsize (mk ).
the computational workload. Meanwhile, this may increase the Then, for each client k in tier m, the communication
(r) (r) (r)
amount of data transmitted (i.e., the size of the intermediate time can be estimated as Dsize (mk )Ñk /νk , where νk
data and partial model). As indicated in Table I, there exists a represents the client’s communication speed and Ñk denotes
non-trivial tier assignment that minimizes the overall training the number of data batches. To track clients’ training time
time. To find the optimal tier assignment, DTFL needs to for their respective client-side models, the server maintains
consider multiple factors, including the communication link and updates the set of historical client-side training times
speed between the server and the clients, the computation for each client k in tier m, denoted as Tkcm . To mitigate
power of each client, and the local dataset size. measurement noise, the server uses Exponential Moving
Average (EMA) on historical client-side training time (i.e.,
(r) (r)
C. Dynamic Tier Scheduling T̄kcm (mk ) ← EMA(Tkcm (mk ))) as the current client’s
In a heterogeneous environment with multiple clients, the training time in tier m. One key challenge in tier profiling
proposed dynamic tier scheduling aims to minimize the overall is capturing the dynamics of the training time of each client
training time by determining the optimal tier assignments for in a heterogeneous environment. We need knowledge of
each client. the training time for every client in each tier, but only the
(r)
Specifically, let mk denote the tier of client k in the train- current tier’s training time for each client is available each
(r) (r) (r) round. To address this, we analyze the relationship between
ing round r. Tkc (mk ), Tkcom (mk ) and Tks (mk ) represent
normalized training times across tiers for each client, where
the training time of the client-side model, the communication
the normalized training time refers to the model training
time, and the training time of the server-side model of client k
time using a standard data batch. Table II shows normalized
at round r, respectively. Using the proposed local-loss-based
training times of different tiers relative to tier 1. Notably,
split training algorithm, each client and the server train the
the table indicates that for any client within a specific
model in parallel. The overall training time Tk for client k in
model, the normalized training times relative to tier 1 for
each round can be presented as:
both client-side and server-side models, denoted as T cp and
T sp respectively, across different tiers are the same. This
(r) (r) (r)
Tk (mk ) = max{Tkc (mk ) + Tkcom (mk ), is because the ratio between normalized training times in
(5)
(r)
Tks (mk ) + Tkcom (mk )}.
(r) any two tiers depends solely on their respective model size,
which does change for different clients. These model sizes
As clients train their models in parallel, the overall training remain constant within a specific model design, ensuring
time in each round r is determined by the slowest client (i.e., consistent normalized training times across tiers for clients.
straggler). To minimize the overall training time, we minimize Based on this tier profiling, we can estimate the training
the maximum training time of clients in each round: times in other tiers using the observed training time of each
(r)
min max Tk (mk ), subject to {mk } ∈ M ∀k,
(r) client in the assigned tier (see lines 24 to 29 in Algorithm
(r) k (6) 1).
{mk }
• Tier Scheduling. In each round, the tier scheduler mini-
where M denotes the set of tiers. Note that problem (6) is
mizes the maximum training time of clients. First, it iden-
an integer programming problem. Solving (6) requires the
(r) tifies the maximum time (i.e., the straggler training time),
knowledge of each client’s training time {Tk (mk )} under
denoted as Tmax , by estimating the maximum training time
each tier. As the capacities of each client in a heterogeneous
of all clients if they are assigned to a tier that minimizes
environment may change over time, a static tier assignment
their training time (see line 31 in Algorithm 1). Then, it
may still lead to a severe straggler problem. The key question
assigns other clients to a tier with an estimated training
is how to efficiently solve (6) in a heterogeneous environment.
time that is less than or equal to Tmax (see line 33
To optimize client assignments, we introduce a dynamic
in Algorithm 1). To better utilize the resources of each
tier scheduler. This scheduler leverages tier profiling to esti-
client, the tier scheduler selects tier m that minimizes
mate each client’s training time under different tiers, efficiently
the offloading to the server while still ensuring that its
placing them in the most suitable tier for each training round. (r+1)
estimatedtraining time remains below Tmax by mk ←
• Tier Profiling. Before the training starts, the server con-

(r+1)
(r) (r) arg max {T̂k (mk ) ≤ Tmax } .
ducts tier profiling to estimate Tkc (mk ), Tkcom (mk ) and m
(r)
Tks (mk ) for each client across different tiers. These esti-
mates are used for tier profiling, a crucial component of the The dynamic tier scheduler is illustrated in Figure 2 and
dynamic tier scheduler. Client-side training time is predicted is detailed in TierScheduler(·) function in Algorithm 1.
using historical data and a normalized training time profile. DTFL training process is described in Algorithm 1. This
Communication time is calculated based on transferred data scheduler has a computational complexity of O (KM ) and
size, client communication speed, and dataset size. Server- incurs negligible overhead compared to the computation and
side training time is estimated using the observed relation- communication time.

Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473

IEEE INTERNET OF THINGS JOURNAL 5

TABLE II: The normalized training times for both client-side and server-side models in different tiers for each client relative to Tier 1, using
ResNet-56 with 10 clients. In each experiment, all the clients are assigned to the same tier. We change the CPU capacities of clients in each
experiment to evaluate the impact of CPU capacities.

Tier 1 2 3 4 5 6
Client-side Training Time 1.00 ± 0.04 1.63 ± 0.10 2.16 ± 0.15 2.68 ± 0.22 3.30 ± 0.24 3.81 ± 0.28
Server-side Training Time 1.00 ± 0.07 0.82 ± 0.06 0.65 ± 0.06 0.51 ± 0.04 0.33 ± 0.03 0.20 ± 0.01

Algorithm 1 DTFL’s Training Process,


Initialization
MainServer()
1: for each round r = 0 to R − 1 do
(r)
2: m(r) ← TierScheduler(T cm (mk ), ν (r) , Ñ )
3: for each client k in parallel do
(r)
(r)
4: (z k , yk ) ← ClientUpdate(wckm )
cm (r) (r)
5: Measure Tk (mk ), νk and Ñk
//server updates the server-side model
(r)
(r)
6: Forward propagation of z k on wskm
Fig. 2: Dynamic Tier Scheduler Overview: In each round of training (r)
for each client, the server monitors network speeds, client training 7: Calculate loss, backpropagation on wksm
(r+1) (r)
time, and dataset sizes. Tier profiles show transferred data for each sm sm
8: wk ← wk - η∇fks (wsm , wcm ∗ )
tier. The tier profiler estimates client/server training and communi- (r+1)

cation times using tier profiles, historical training time, and network 9: Receive updated wckm from client k
(r+1) (r+1)
(r+1) cm
speeds across different tiers. The tier scheduler assigns clients to tiers, 10: wk = {wk , wksm }
minimizing maximum training time while limiting server load. 11: end for
(r+1)
w(r+1) = K
1
P
12: k wk
(r+1) (r+1)
D. Convergence Analysis 13: Update all models (wcm and wsm
) in each tier using
(r+1)
We show the convergence of both client-side and server-side w
models in DTFL on convex and non-convex loss functions 14: end for (r)
ClientUpdate(wckm )
based on standard assumptions in FL and local-loss-based (r)
15: Forward propagate on local data to calculate z k
training. We assume that (A1) client-side fkcm and server- (r)
16: Send (z k , yk ) to the server
side fksm objective functions of each client in each tier 17: Forward propagation to the auxiliary layer
are differentiable and L-smooth; (A2) fkcm and fksm have 18: Calculate local loss, backpropagation
expected squared norm bounded by G21 ; (A3) the variance 19:
(r+1)
wckm
(r) (r)
← wckm − η∇fkcm (wcm , wam )
(r)

of the gradients of fkcm and fksm is bounded by σ 2 ; (A4) c


(r+1)
20: Send w km to the server
fkcm and fksm are µ-convex for µ ≥ 0 for some results; (r)
TierScheduler(T cm (mk ), ν (r) , Ñ )
(A5) the client-side objective functions are (G2 , B)-BGD  k do
21: for all client 
(Bounded Gradient Dissimilarity); (A6) the time-varying pa- (r) m (r)
(r) (r) 22: Add Tkcm (mk ) − D (r) Ñk
into Tkcm (mk )
rameter satisfies dcm < ∞, where dcm represents the distance νk 
(r) (r)
between the density function of the slow agent-side model’s 23: T̄kcm (mk ) ← EMA Tkcm (mk )
output at round r and its converged state. These assumptions 24: for all tier mk
(r+1)
do
are well-established and frequently utilized in the machine (r+1)
//estimate T̂k (mk )
(r)
learning literature for convergence analyses, as in previous (r+1) Dsize (mk )Ñk
25: T̂kcom (mk )← (r)
works such as [12], [13], [31]–[33]. We adopt the approach of νk
(r+1)
(r+1) T cp (mk
) cm (r)
DGL [12] for local-loss-based training, where the server input 26: T̂kc (mk )← (r) T̄k (mk )
T cp (mk )
distribution varies over time and depends on client-side model (r+1) (r+1)
27: T̂ks (mk ) ← T sp (mk )Ñk
convergence. (r+1)
28: Compute T̂k (mk ) using Equation (5)
Theorem 1 (Convergence of DTFL): Under assumptions 29: end for
(A1), (A2), (A3), and (A5), the convergence properties 30: end for
(r+1)
of DTFL for both convex and non-convex functions 31: Tmax ← max min{T̂k (mk )}
k m
are summarized as follows: 2Convex: Under (A4), 32: for all clients k do  
1 4L(1+B ) (r+1) (r+1)
η ≤ 8L(1+B 2 ) and R ≥ , the client-side model 33: mk ← arg max {T̂k (mk ) ≤ Tmax }
µ m
ηH 2 34: end for

converges at the rate of O µD exp − η2 µR + µRA1m
2

35: Return m(r+1)
and the server-side
√ model converges at the rate of
0 0
C1 H√ F sm F sm
O R + 2
RAm
+ ηmax R . Non-convex: If both f cm and
 √ 0 0

1 F sm
f sm are non-convex with η ≤ ),
8L(1+B 2 then the client-side O C2
+ H√
2 F sm
+
, where ηmax is the maximum
√ 0 0
 R RAm ηmax R
H√ F cm F cm m
model converges at the rate of O 1
RAm
+ ηmax R of learning rate η, H12 := σ 2 + 1 − AK G22 , H22 :=
 0 m 0 ⋆
and the server-side model converges at the rate of L3 B 2 + 1 F sm + 1 − AK L2 G22 , D := wcm − wcm ,

Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473

IEEE INTERNET OF THINGS JOURNAL 6

0
 0 0
 0
F cm f cm wcm , and F sm := f sm wsm .
:=
q
0 P (r)
C1 = G1 G22 + 2LB 2 F sm r dcm and C2 =
q
2 2 2
P c(r)
G1 G2 + B G1 r d m are convergent based on (A6).
(r) (r)
Am = minr {Acm > 0}, where Acm denotes the number
(r)
of clients in tier m at round r. dcm denotes the distance
between the density function of the output of the client-side
model and its converged state.
According to Theorem 1, both client-side and server-side
models converge as the number of rounds R increases, with
varying convergence rates across different tiers. Note that as Fig. 3: Comparing the training time curve (in seconds) of DTFL with
DTFL leverages the local-loss-based split training, the conver- baselines for various I.I.D. datasets. DTFL exhibits significantly faster
gence of the server-side model depends on the convergence of convergence across all datasets.
the client-side model, which is explicitly characterized by C1
and C2 in the analysis. The complete proof of the theorem is ResNet-110 [16] that work well on the selected datasets.
given in the supplementary material. Furthermore, DTFL can also be applied to large language
models (LLM) like BERT [38] by splitting techniques as
IV. E XPERIMENTAL E VALUATION proposed in FedBERT [39] and FedSplitBERT [40]. For each
A. Experimental Setup tier, we split a global model to create client and server-side
models. This split layer varies across tiers, and it progressively
Dataset. We consider image classification on four public
moves toward the last layer as the tier increases. For each
image datasets, including CIFAR-10 [17], CIFAR-100 [17],
client-side model, we add a fully connected (FC) and an
CINIC-10 [18], and HAM10000 [19]. We also study label dis-
average pooling (avgpool) layer as the auxiliary network.
tribution skew [34] (i.e., the distribution of labels varies across
More details and alternative auxiliary network architectures
clients) to generate their non-I.I.D. variants using FedML
can be found in the supplementary material. For FedGKT
[35]. The dataset distributions used in these experiments are
implementation, we follow the same settings as in FedGKT
provided in the supplementary materials.
[6]. We split the global model after module 2 (as defined in
Baselines. We compare DTFL with state-of-the-art FL/SL
the supplementary) for the SplitFed model.
methods, including FedAvg [27], SplitFed [24], FedYogi [29],
and FedGKT [6]. For the same reasons as in FedGKT [6], we
do not compare with FedProx [22] and FedMA [36]. For large B. Training Time Improvement of DTFL
CNNs like ResNets, FedProx tends to underperform FedAvg, Training time comparison of DTFL to baselines. In Table
while FedMA’s incompatibility with batch normalization lay- III, we summarize all experimental results of training a global
ers limits its use in modern DNNs. model (i.e., ResNet-56 or ResNet-110) with 7 tiers (i.e.,
Implementation. We conduct the experiments using Python M = 7) when using different FL methods. The experiments
3.11.3 and the PyTorch library version 1.13.1, which is avail- are conducted on a heterogeneous client population, with 20%
able online in the DTFL GitHub repository [37]. DTFL and having been assigned to each profile at the outset. Every 50
the baselines are deployed in a server, which is equipped with rounds, the client profiles (i.e., number of simulated CPUs
dual-sockets Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz and communication speed) of 30% of the clients are randomly
with hyper-threading disabled, and four NVIDIA GeForce changed to simulate a dynamic environment, while all clients
GTX 1080 Ti GPUs, 64 GB of memory. Each client is assigned participate in every training round. The training time of each
a different simulated CPU and communication resource to method to achieve target accuracy is provided in Table III.
simulate heterogeneous resources (i.e., simulate the training In all cases, for both I.I.D. and non-I.I.D. settings, DTFL
time of different CPU/network profiles). By using these re- reduces the training time at an exceptional rate than the
source profiles, we simulate a heterogeneous environment baselines (FedAvg, SplitFed, FedYogi, FedGKT). For example,
where clients’ capacity varies in both cross-silo and cross- DTFL reduces the training time of FedAvg by 80% to reach
device FL settings. We consider 5 resource profiles: 4 CPUs the target accuracy on I.I.D. CIFAR-10 with ResNet-110.
with 100 Mbps, 2 CPUs with 30 Mbps, 1 CPU with 30 This experiment illustrates the capabilities of DTFL that can
Mbps, 0.2 CPU with 30 Mbps, and 0.1 CPU with 10 Mbps significantly reduce training time when training on distributed
communication speed to the server. Each client is assigned heterogeneous clients. Figure 3 depicts the curve of the server
one resource profile at the beginning of the training, and the test accuracy during the training process of all the methods for
profile can be changed during the training process to simulate the I.I.D. CIFAR-10 case with ResNet-110, where we observe
the dynamic environment. a faster convergence using DTFL compared to baselines.
Model Architecture. DTFL is a versatile approach suitable for
training a wide range of neural network models (e.g., Mul-
tilayer Perceptron, MLP, Recurrent Neural Networks, RNN, C. Understanding DTFL under Different Settings
and CNN), particularly benefiting large-scale models. In the Performance of DTFL with different numbers of clients.
experiments, we evaluate large CNN models, ResNet-56 and We evaluate the performance of DTFL with different numbers

Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473

IEEE INTERNET OF THINGS JOURNAL 7

TABLE III: Comparison of training time (in seconds) to baseline approaches with 10 clients on different datasets. The numbers represent the
training time used to achieve the target accuracy (i.e., CIFAR-10 I.I.D. 80%, CIFAR-10 non-I.I.D. 70%, CIFAR-100 I.I.D. 55%, CIFAR-100
non-I.I.D. 50%, CINIC-10 I.I.D. 75%, CINIC-10 non-I.I.D. 65%, and HAM10000 75%).

CIFAR-10 CIFAR-100 CINIC-10


Method Global Model HAM10000
I.I.D. non-I.I.D. I.I.D. non-I.I.D. I.I.D. non-I.I.D.
ResNet-56 2750 3986 3585 6093 23968 40138 2353
DTFL
ResNet-110 4816 7054 5678 9874 42099 70469 3615
ResNet-56 13157 20773 19170 35350 114509 197926 11566
FedAvg
ResNet-110 24471 39094 36360 66317 210468 395423 22328
ResNet-56 35877 46514 54174 97859 271873 510156 19549
SplitFed
ResNet-110 67265 84342 101783 183122 521334 896627 43581
ResNet-56 9122 13130 12727 19216 82083 113464 8071
FedYogi
ResNet-110 19299 25668 23978 35356 155212 219134 14932
ResNet-56 25458 30808 36838 59461 184589 218065 37181
FedGKT
ResNet-110 39676 47458 64457 98754 321534 411259 61755

TABLE IV: DTFL exhibits superior performance on CIFAR-10 I.I.D.,


20000
significantly reducing training time (in seconds) compared to base-

Training Time (s)


lines across all client sizes (20, 50, 100, 200). 15000
10000
# Clients DTFL FedAvg SplitFed FedYogi FedGKT
5000 Case 1
20 1877 7950 21350 6341 14595 Case 2
50 2547 10435 29026 8073 17872 0 1 2 3 4 5 6 7
100 3102 14032 36449 10760 24438 Number of Tiers (M)
200 3594 16060 43942 12786 27632
Fig. 4: Impact of the number of tiers on total training time (in seconds)
for two cases (similar to Table I). The figure shows a decreasing trend
in training time as the number of tiers increases.
of clients to better understand the scalability of DTFL. Table
IV shows the training time for various training methods using
different numbers of clients on the I.I.D. CIFAR-10 dataset, D. Privacy Discussion
to reach a target accuracy of 80% with the ResNet-110 Using DTFL, we can significantly reduce training time.
model. In these experiments, we randomly sample 10% of all However, exchanging hidden feature maps (i.e., the intermedi-
clients to be involved in each round of the training process. ate output z i ) may potentially leak privacy. A potential threat
Note that DTFL can also be employed in other FL client to DTFL is model inversion attacks, extracting client data by
selection methods (e.g., TiFL [9], FedAT [10]). In general, analyzing feature maps or model parameter transfers from
increasing the number of clients has no adverse effects on clients to servers. Prior studies [41], [42] have shown that
DTFL performance and significantly reduces training. attackers need access to all model parameters or gradients
Impact of the number of tiers on DTFL performance. to recover client data. This is not feasible with partial or
We evaluate DTFL performance under different numbers of fragmented models. Thus, similar to SplitFed [24], DTFL
tiers while employing the global ResNet-110 model (model can use separate servers for model aggregation and training
details under different tiers are provided in Table X in the to prevent a single server from having access to all model
supplementary material). In Figure 4, we present the total parameters and intermediate data. Another potential threat to
training time for the I.I.D. CIFAR-10 dataset and 10 clients DTFL is that an attacker can infer client model parameters by
with different numbers of tiers. We conducted experiments inputting dummy data into the client’s local model and training
with two different cases, similar to those in Table I, where a replicating model on the resulting feature maps [43]. DTFL
clients’ CPU profiles randomly switch to another profile every can prevent this attack by denying clients access to external
20 rounds of training within the profiles of the same case. datasets, query services, and dummy data, thereby preventing
Experiments show that to reach the target accuracy of 80%, the attacker from obtaining the necessary data.
the training time generally decreases with the number of tiers, However, for attackers with strong eavesdropping capabil-
as DTFL would have more flexibility to fine-tune the tier ities, there may be potential privacy leakage. As DTFL is
of each client based on the heterogeneous resources of each compatible with privacy-preserving FL approaches, existing
client. It should be noted that the model under each tier needs data privacy protection methods can be easily integrated into
to be carefully designed based on the structure of the global DTFL to mitigate potential privacy leakage, e.g., distance
model. Arbitrarily splitting the global model might harm the correlation [44], differential privacy [45], patch shuffling [46],
model’s accuracy. Thus, the optimal number of tiers is much PixelDP [47], SplitGuard [48], and cryptographic techniques
less than the number of layers of a global model. For ResNet- [49], [50]. For example, we can add a regularization term
110, we find 7 tiers provided in the supplementary material into the client’s local training objective to reduce the mutual
can significantly reduce the training time while maintaining information between hidden feature maps and raw data [51],
the model accuracy. making it more difficult for attackers to reconstruct raw data.

Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473

IEEE INTERNET OF THINGS JOURNAL 8

TABLE V: Integrating privacy protection into DTFL on the CIFAR-


[2] A. M. Abdelmoniem, C.-Y. Ho, P. Papageorgiou, and M. Canini, “A
10 dataset using ResNet-56 with 20 clients. Distance Correlation has comprehensive empirical study of heterogeneity in federated learning,”
minimal accuracy impact with smaller α; Patch Shuffling maintains IEEE Internet of Things Journal, 2023.
similar accuracy. [3] O. Gupta and R. Raskar, “Distributed learning of deep neural network
over multiple agents,” Journal of Network and Computer Applications,
Distance Correlation (α) Patch vol. 116, pp. 1–8, 2018.
Method
0.00 0.25 0.50 0.75 Shuffling [4] P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar, “Split learning
for health: Distributed deep learning without sharing raw patient data,”
Accuracy 87.1 86.8 83.5 75.6 85.4 arXiv preprint arXiv:1812.00564, 2018.
[5] Y. Liao, Y. Xu, H. Xu, Z. Yao, L. Wang, and C. Qiao, “Accelerating
federated learning with data and model parallelism in edge computing,”
Each client decorrelates its input xi and related feature map IEEE/ACM Transactions on Networking, 2023.
z i , i.e., fkc,private (wcm , wam ) = (1 − α)fkc (wcm , wam ) + [6] C. He, M. Annavaram, and S. Avestimehr, “Group knowledge transfer:
Federated learning of large cnns at the edge,” Advances in Neural
αDCor(xi , z i ), where α balances the model performance and Information Processing Systems, vol. 33, pp. 14 068–14 080, 2020.
the data privacy, and DCor denotes the distance correlation [7] Y. J. Cho, J. Wang, T. Chirvolu, and G. Joshi, “Communication-
defined in the NoPeek method [44]. Distance correlation efficient and model-heterogeneous personalized federated learning via
clustered knowledge transfer,” IEEE Journal of Selected Topics in Signal
enhances the privacy of DTFL against reconstruction attacks Processing, vol. 17, no. 1, pp. 234–247, 2023.
[44]. [8] D.-J. Han, H. I. Bhatti, J. Lee, and J. Moon, “Accelerating federated
Integration of privacy protection methods. We evaluate learning with split learning on locally generated losses,” in ICML
2021 Workshop on Federated Learning for User Privacy and Data
the model accuracy and privacy trade-offs of DTFL when Confidentiality. ICML Board, 2021.
integrating distance correlation and patch shuffling techniques. [9] Z. Chai, A. Ali, S. Zawad, S. Truex, A. Anwar, N. Baracaldo, Y. Zhou,
Table V illustrates the model accuracy of DTFL with distance H. Ludwig, F. Yan, and Y. Cheng, “Tifl: A tier-based federated learning
system,” in Proceedings of the 29th International Symposium on High-
correlation, showing a decreasing trend as α increases. This Performance Parallel and Distributed Computing, 2020, pp. 125–136.
suggests that integrating distance correlation can enhance data [10] Z. Chai, Y. Chen, A. Anwar, L. Zhao, Y. Cheng, and H. Rangwala, “Fe-
privacy without significant accuracy loss, especially for rela- dat: A high-performance and communication-efficient federated learning
system with asynchronous tiers,” in Proceedings of the International
tively smaller values of α. Notably, applying patch shuffling Conference for High Performance Computing, Networking, Storage and
with the same settings as those used by Yao et al. [46] to Analysis, 2021, pp. 1–16.
intermediate data has minimal impact on accuracy. The server [11] A. Nøkland and L. H. Eidnes, “Training neural networks with local
error signals,” in International conference on machine learning. PMLR,
lacks information about the clients’ α values, which can vary 2019, pp. 4839–4850.
between clients. This prevents the server from inferring the [12] E. Belilovsky, M. Eickenberg, and E. Oyallon, “Decoupled greedy
clients’ data. learning of cnns,” in International Conference on Machine Learning.
PMLR, 2020, pp. 736–745.
[13] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence
V. C ONCLUSION of fedavg on non-iid data,” in International Conference on Learning
Representations, 2019.
In this paper, we developed DTFL as an effective solution to [14] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani,
address the challenges of training large models collaboratively “Fedpaq: A communication-efficient federated learning method with
periodic averaging and quantization,” in International Conference on
in a heterogeneous environment. DTFL offloads different Artificial Intelligence and Statistics. PMLR, 2020, pp. 2021–2031.
portions of the global model to clients in different tiers and [15] Z. Huo, B. Gu, H. Huang et al., “Decoupled parallel backpropagation
allows each client to update the models in parallel using with convergence guarantee,” in International Conference on Machine
Learning. PMLR, 2018, pp. 2098–2106.
local-loss-based training, which can meet the computation and [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
communication requirements on resource-constrained devices recognition,” in Proceedings of the IEEE conference on computer vision
and mitigate the straggler problem. We crafted a dynamic and pattern recognition, 2016, pp. 770–778.
[17] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features
tier scheduling algorithm that dynamically assigns clients to from tiny images,” 2009.
optimal tiers based on their training time. The convergence [18] L. N. Darlow, E. J. Crowley, A. Antoniou, and A. J. Storkey, “Cinic-10
of DTFL is analyzed theoretically. Extensive experiments on is not imagenet or cifar-10,” arXiv preprint arXiv:1810.03505, 2018.
large datasets with different numbers of highly heterogeneous [19] P. Tschandl, C. Rosendahl, and H. Kittler, “The ham10000 dataset,
a large collection of multi-source dermatoscopic images of common
clients show that DTFL has effectively decreased training time pigmented skin lesions,” Scientific data, vol. 5, no. 1, pp. 1–9, 2018.
while maintaining model accuracy compared to state-of-the-art [20] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N.
FL methods. Furthermore, DTFL efficiently integrates privacy Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al.,
“Advances and open problems in federated learning,” Foundations and
protection measures without compromising its performance. Trends® in Machine Learning, vol. 14, no. 1–2, pp. 1–210, 2021.
Future research directions include refining the dynamic [21] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman,
tier scheduling algorithm by incorporating predictive models V. Ivanov, C. Kiddon, J. Konečnỳ, S. Mazzocchi, B. McMahan et al.,
“Towards federated learning at scale: System design,” Proceedings of
for enhanced resource allocation and developing advanced Machine Learning and Systems, vol. 1, pp. 374–388, 2019.
privacy-preserving techniques for feature maps exchanged [22] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith,
during split learning. “Federated optimization in heterogeneous networks,” Proceedings of
Machine Learning and Systems, vol. 2, pp. 429–450, 2020.
[23] J. Wu, F. Dong, H. Leung, Z. Zhu, J. Zhou, and S. Drew, “Topology-
R EFERENCES aware federated learning in edge computing: A comprehensive survey,”
ACM Computing Surveys, vol. 56, no. 10, pp. 1–41, 2024.
[1] C. Yang, Q. Wang, M. Xu, Z. Chen, K. Bian, Y. Liu, and X. Liu, [24] C. Thapa, P. C. M. Arachchige, S. Camtepe, and L. Sun, “Splitfed:
“Characterizing impacts of heterogeneity in federated learning upon When federated learning meets split learning,” in Proceedings of the
large-scale smartphone data,” in Proceedings of the Web Conference AAAI Conference on Artificial Intelligence, vol. 36, no. 8, 2022, pp.
2021, 2021, pp. 935–946. 8485–8493.

Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2024.3487473

IEEE INTERNET OF THINGS JOURNAL 9

[25] W. Wu, M. Li, K. Qu, C. Zhou, X. Shen, W. Zhuang, X. Li, and W. Shi, IEEE Symposium on Security and Privacy (SP). IEEE, 2019, pp. 656–
“Split learning over wireless networks: Parallel design and resource 672.
management,” IEEE Journal on Selected Areas in Communications, [48] E. Erdogan, A. Küpçü, and A. E. Cicek, “Splitguard: Detecting and
vol. 41, no. 4, pp. 1051–1066, 2023. mitigating training-hijacking attacks in split learning,” in Proceedings
[26] Z. Zhang, A. Pinto, V. Turina, F. Esposito, and I. Matta, “Privacy of the 21st Workshop on Privacy in the Electronic Society, 2022, pp.
and efficiency of communications in federated split learning,” IEEE 125–137.
Transactions on Big Data, 2023. [49] H. U. Sami and B. Güler, “Secure aggregation for clustered federated
[27] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, learning,” in 2023 IEEE International Symposium on Information Theory
“Communication-efficient learning of deep networks from decentralized (ISIT). IEEE, 2023, pp. 186–191.
data,” in Artificial intelligence and statistics. PMLR, 2017, pp. 1273– [50] X. Qiu, H. Pan, W. Zhao, C. Ma, P. P. Gusmao, and N. D. Lane, “vfedsec:
1282. Efficient secure aggregation for vertical federated learning via secure
[28] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the layer,” arXiv preprint arXiv:2305.16794, 2023.
objective inconsistency problem in heterogeneous federated optimiza- [51] T. Wang, Y. Zhang, and R. Jia, “Improving robustness to model inversion
tion,” Advances in neural information processing systems, vol. 33, pp. attacks via mutual information regularization,” in Proceedings of the
7611–7623, 2020. AAAI Conference on Artificial Intelligence, vol. 35, no. 13, 2021, pp.
[29] S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečnỳ, 11 666–11 673.
S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” arXiv [52] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan,
preprint arXiv:2003.00295, 2020. V. Smith, and A. Talwalkar, “Leaf: A benchmark for federated settings,”
[30] M. Laskin, L. Metz, S. Nabarro, M. Saroufim, B. Noune, C. Luschi, arXiv preprint arXiv:1812.01097, 2018.
J. Sohl-Dickstein, and P. Abbeel, “Parallel training of deep networks
with local updates,” arXiv preprint arXiv:2012.03837, 2020.
[31] S. U. Stich, “Local sgd converges fast and communicates little,” arXiv
preprint arXiv:1805.09767, 2018.
[32] H. Yu, S. Yang, and S. Zhu, “Parallel restarted sgd with faster con-
vergence and less communication: Demystifying why model averaging
works for deep learning,” in Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 33, no. 01, 2019, pp. 5693–5700.
[33] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T.
Suresh, “Scaffold: Stochastic controlled averaging for federated learn-
ing,” in International Conference on Machine Learning. PMLR, 2020,
pp. 5132–5143.
[34] Q. Li, Y. Diao, Q. Chen, and B. He, “Federated learning on non-iid
data silos: An experimental study,” in 2022 IEEE 38th International
Conference on Data Engineering (ICDE). IEEE, 2022, pp. 965–978.
[35] C. He, S. Li, J. So, X. Zeng, M. Zhang, H. Wang, X. Wang,
P. Vepakomma, A. Singh, H. Qiu et al., “Fedml: A research li-
brary and benchmark for federated machine learning,” arXiv preprint
arXiv:2007.13518, 2020.
[36] H. Wang, M. Yurochkin, Y. Sun, D. Papailiopoulos, and Y. Khazaeni,
“Federated learning with matched averaging,” in International Confer-
ence on Learning Representations, 2020.
[37] Sajjadi Mohammadabadi, Seyed Mahmoud, “Dynamic tiering-based
federated learning (dtfl),” https://github.com/mahmoudsajjadi/DTFL, ac-
cessed on 1 Sep 2024.
[38] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
[39] Y. Tian, Y. Wan, L. Lyu, D. Yao, H. Jin, and L. Sun, “Fedbert: When
federated learning meets pre-training,” ACM Transactions on Intelligent
Systems and Technology (TIST), vol. 13, no. 4, pp. 1–26, 2022.
[40] Z. Lit, S. Sit, J. Wang, and J. Xiao, “Federated split bert for hetero-
geneous text classification,” in 2022 International Joint Conference on
Neural Networks (IJCNN). IEEE, 2022, pp. 1–8.
[41] H. Yin, A. Mallya, A. Vahdat, J. M. Alvarez, J. Kautz, and P. Molchanov,
“See through gradients: Image batch recovery via gradinversion,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2021, pp. 16 337–16 346.
[42] L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” Advances
in neural information processing systems, vol. 32, 2019.
[43] J. Shen, N. Cheng, X. Wang, F. Lyu, W. Xu, Z. Liu, K. Aldubaikhy, and
X. Shen, “Ringsfl: An adaptive split federated learning towards taming
client heterogeneity,” IEEE Transactions on Mobile Computing, 2023.
[44] P. Vepakomma, A. Singh, O. Gupta, and R. Raskar, “Nopeek: Informa-
tion leakage reduction to share activations in distributed deep learning,”
in 2020 International Conference on Data Mining Workshops (ICDMW).
IEEE, 2020, pp. 933–942.
[45] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov,
K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in
Proceedings of the 2016 ACM SIGSAC conference on computer and
communications security, 2016, pp. 308–318.
[46] D. Yao, L. Xiang, H. Xu, H. Ye, and Y. Chen, “Privacy-preserving
split learning via patch shuffling over transformers,” in 2022 IEEE
International Conference on Data Mining (ICDM). IEEE, 2022, pp.
638–647.
[47] M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana, “Certified
robustness to adversarial examples with differential privacy,” in 2019

Authorized licensed use limited to: King Fahd University of Petroleum and Minerals. Downloaded on February 07,2025 at 12:57:15 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like