Clustering_Enhanced_Reinforcement_Learning_for_Adaptive_Offloading_in_Resource_Constrained_Devices (2)
Clustering_Enhanced_Reinforcement_Learning_for_Adaptive_Offloading_in_Resource_Constrained_Devices (2)
Abstract—The Federated Edge Artificial Intelligence (Edge occurs on the client side, and a separation layer breaks down
AI) deploys AI applications on Internet-of-Things (IoT) devices, the data for transmission to the server. The server then trains
addressing data privacy concerns in the real world. To achieve the remaining DNN layers using the decomposed data, sending
effective Federated Learning (FL), three challenges must be
addressed: i) limited computing power on devices, ii) non-uniform back the corrupted data gradient to devices through the sepa-
impacts on devices, and iii) adaptability to changing network ration layer. Devices use these gradients for backpropagation
conditions. This study introduces a new algorithm called Adaptive on the rest of the DNN. SL significantly reduces computation
Offloading Point (AOP), designed to accelerate local training on compared to FL by having fewer layers on the device side,
constrained devices. It decomposes deep neural network (DNN) where the entire DNN is trained.
layer blocks, enabling training on both client and server sides.
The novelty of the proposed method lies in using a reinforcement While SL benefits collaborators on specific devices, its
learning-based Gaussian mixture model (GMM) clustering to dy- practicality is limited due to time-intensive sequential training
namically determine DNN layer offloading, addressing nullability, on all devices. This limitation led to the development of
computation uniformity, training time, and network bandwidth split federated learning (SFL) [7], aiming for parallel FL and
variation issues. Experimental results on real devices, using vision accelerated device training within the SL algorithm. However,
transformer (ViT) models with an identification image dataset,
show that AOP’s training time is significantly faster than that of current SFL-based research needs more consideration for
previous baseline methods. optimal partitioning strategies and often requires hardware
Index Terms—Edge AI, Federated learning, Adaptive offload- configuration data for model partitioning before manual train-
ing point, Reinforcement learning, Vision transformer. ing. Furthermore, a static partitioning strategy may become
suboptimal as operating conditions fluctuate during training.
I. I NTRODUCTION Here, we list some of the challenges FL, SL are facing that
Federated learning (FL) has gained attention as a privacy- need to be addressed to be effective in edge AI applications:
preserving distributed learning technique and has recently 1) Data and Model Size Impact on Training Efficiency:
become popular [1]. In FL, machine learning (ML) models faces challenges with large DNN models on resource-
like deep neural networks (DNNs) run on IoT devices, known constrained IoT devices like Raspberry Pis or NVIDIA
for their compact and flexible nature (e.g., NVIDIA Jetson or Jetsons [8]. For instance, the lightweight MobileNetV1
Raspberry Pi) [2]. FL allows DNN model training on devices model (3.2 million parameters) takes over 8 hours for
without transmitting raw data to the server, instead sending in- one round and one epoch of training [8]. This study
termediate models. Despite its benefits, FL faces challenges in proposes SplitNN as an efficient alternative, demonstrat-
computational costs and on-device storage, hindering practical ing a significant reduction in training time to 2.5 hours
applications [3]. for one epoch on five Raspberry Pis. While opting for
Implementing FL in edge AI on resource-constrained de- lightweight models is a temporary solution, the study
vices is challenging due to DNN model complexity and highlights the importance of a sustainable, long-term
numerous parameters, often in the millions or hundreds of approach, advocating for model-sharing methods where
millions [4]. To address this, the concept of partitioning and training occurs partially on the client and partly on the
offloading the DNN model at the client edge has emerged. server.
The goal is to ease computational burdens and distribute tasks 2) Heterogeneous Devices Impacting System Operation: FL
among edge clients. For instance, DNN performance in this faces complications due to heterogeneous computing ca-
context has been explored [5], and split learning (SL) [6] is pabilities, architectures, sequential round-robin process-
an ML technique leveraging this concept. ing, and varying network conditions among IoT devices
In SL [6], the DNN model is split into head and tail [9]. The fusion server has to wait for all devices to
segments, located on the server and client sides. Training complete training, impacting model accuracy as each de-
vice contributes differently to the training process [10]. 1) Develop an adaptive offloading point (AOP) algorithm
Addressing this challenge requires finding a solution to generate optimal strategies for device offloading.
that ensures effective coordination and communication Minimize the impact of computational heterogeneity,
among diverse IoT devices during the FL process. ensuring efficiency for diverse devices.
3) Networking Infrastructure and Device Disparities: Dis- 2) Applying reinforcement learning (RL) [15] based on the
parities in IoT device configurations, processing speeds, GMM clustering algorithm offers outstanding efficiency
and network conditions pose challenges in FL system when compared to K-means in FedAdapt. This inno-
deployment for edge AI applications [11]. Mitigating vative approach facilitates faster identification of action
these challenges involves developing robust strategies points, mitigates excessive randomness during training,
to adapt to varying network speeds and geographical and accelerates the training process on devices.
considerations, ensuring stability and effectiveness in 3) Experimental results were conducted on five Raspberry
diverse edge AI applications. Pis 4 and one PC-desktop as an edge client, with
various parameter configurations applied to both VIT
SL integration with VIT models is gaining traction, pri-
and deepVIT transformer models [16], [17], using in-
marily explored on standard image classification datasets [12].
cident image datasets. The results demonstrate that the
However, practical applications and real-world testing on edge
AOP method is 7% faster in training time compared to
devices are limited. Despite SL’s potential, its application
FedAdapt and exhibits a 26% improvement compared
in practice is crucial, given the challenges posed by lower
to classical FL with the VIT model. Similarly, with
configurations on edge devices, emphasizing the ongoing
the deepVIT model, the AOP method achieves a 5.6%
evolution of SL applications in addressing constraints in edge
improvement compared to FedAdapt and a notable
environments.
25.6% improvement compared to classic FL.
The authors in [13] introduced the FedAdapt algorithm,
The rest of this paper is organized as follows: Section II
which combines reinforcement learning-based optimization
introduces the concepts of AOP proposed algorithm. Section III
and K-means clustering [14]. This algorithm dynamically allo-
presents the optimization training for AOP. Section IV offers
cates DNN classes for server offloading, effectively address-
the evaluation performance, and the last section, V, concludes
ing challenges of computational heterogeneity and network
the paper and highlights directions for future research.
bandwidth (BW) variability. Evaluation results indicate that
FedAdapt reduces training time by over 40% compared II. T HE AOP PROPOSED ALGORITHM
to classical FL. However, it uses K-means clustering with This section describes the design and implementation of our
limitations, identifying clusters based on training time per system optimized for real-world FL.
interaction (TTPi) values. The maximum value within each
cluster serves as the agent input (TTPi and FLOP). The output A. System overview
action is initially randomized using the multivariate typical The overarching context centers on employing GMM and
technique and iteratively refined through reward evaluation RL in adaptive settings, specifically for optimizing offloading
scoring, introducing inherent randomness that may slow con- strategies in FL scenarios. The procedure encompasses pre-
vergence and lead to sub-optimal action selection. processing, clustering, RL-based decision-making, and post-
In this study, we present the AOP algorithm, a novel processing to improve the efficiency of training models across
approach detailed in Section II. This algorithm not only tackles distributed devices, delineated in steps 1 to 4 and illustrated
the three challenges mentioned earlier but also expedites in Fig. 1.
FL, alleviates the impact of computational heterogeneity, and 1) FL Round and Pre-processing: Initialize a complete
adapts to varying network BW. Our proposed method employs FL process. After completing the FL round (round t
the GMM approach [14], seeking to identify clusters based on - 1), the pre-processing stage collects observations on
the combination of TTPi, BW, and FLOP values. The agent the state of devices, including computational capabili-
input (TTPi, BW, FLOP) represents the average values of the ties and network bandwidth between each device and
clusters. The output action is determined using a combination servers. Normalization of training time per iteration is
of multivariate normal and max-min scaling techniques. By performed during this stage.
comparing the means of central clusters, a scaling factor is 2) Clustering Module: Employ the clustering module to
obtained, adjusting the action. This process accelerates the group devices with similar training times based on com-
identification of suitable actions, ultimately enhancing the putational homogeneity and network bandwidth. This
reliability of reward scoring. In comparison to K-means, GMM clustering is crucial for RL to determine offloading
significantly reduces training time by constraining random strategies for each cluster.
amplitudes, preventing excessively large or time-consuming 3) Trained RL Agent and Offloading Decision: Use the
actions during training. This is particularly advantageous for trained RL Agent, equipped with group information and
weaker devices, which may otherwise struggle to handle large observations (referred to as state), to generate offloading
OP scores in a limited number of iterations. decisions (referred to as actions) for each device cluster.
This study makes the following contributions: This process involves a fully connected neural network,
for the device and the server is determined by the proportion
of the workload executed, {µct Wtc } and {1 − µct Wtc }, which
are offloaded to the client and server, respectively. Let S(µct )
denote the size of the feature maps transferred between the
device and server during the training of round t. S(µct )
depends on µct as the offloading strategy determines the size
of the transferred feature map. Finally, the training time for
device c in round t can be calculated as follows:
µct Wtc (1 − µct )Wtc S(µct )
Ttc = + + , (1)
Ctc Cts BWtc
where {µct Wtc /Ctc } and {1−µct Wtc /Cts } represent the train-
ing time on the device and server, respectively. {S(µct )/BWtc }
is the communication time during training. In round t, Wtc ,
Fig. 1: Automated splitting point determine algorithm for Cts , Ctc , and BWtc are either constants or variables that are
client-server model. not controlled by AOP. Since the offloading strategy for each
device µct is an offloading point and is controlled by AOP,
our optimization target is defined as minimizing the average
and the training details are covered in a dedicated training time for C devices in a round to reduce the overall
section. training time effectively.
4) Post-processing and Offloading Strategy: In the post- The optimization problem described in Eq. 2 is formulated
processing stage, apply the output of the trained RL as follows:
agent to implement offloading decisions for devices
within each cluster. All devices in a cluster execute the
C
same offloading strategy, determining the layers of the 1 X c
min T
DNN model residing on each device for the current FL c
µt C c=1 t
round (round t). (2)
µc W c (1 − µct )Wtc S(µct )
subject to Ttc = t ct + s + ,
B. Model architecture Ct Ct BWtc
To maximize the efficiency of each transformer model, we Here, the objective
Pis to minimize
the average training time
1 C c
employ a strategy of dividing them into smaller, evenly dis- for C devices C k=1 Tt in round t. The training time
tributed sub-models, layers, or blocks, referred to as offloading Ttc for each device is subject to constraints that consider
points (OPs). For example, we partition a VIT model into the offloading strategy, device capacities (Ctc and Cts ), and
9 OPs and a deepVIT model into 10 OPs, as illustrated in network bandwidth (BWtc ).
Table I. Importantly, this division is carried out meticulously While the optimization seeks to minimize the overall train-
to ensure a balanced selection of OPs without unnecessary re- ing time for all devices, AOP goes beyond optimizing the
dundancy. This precise approach aims to maintain equilibrium maximum training time, which may be constrained by strag-
in OP selection, thereby promoting optimal performance. gler devices. It also aims to reduce the training time for each
individual device. This approach ensures a more balanced
C. Problem formulation
distribution of the computation load across devices. Therefore,
In particular, the network BW between the device and the the defined objective in AOP is to minimize the average
server may vary across different FL rounds. To address this training time for C devices, contributing to the overarching
variability, we observe the bandwidth from the previous FL goal of reducing the total training time over all FL rounds.
round to formulate a load reduction strategy. Unlike a previous The optimization is carried out for µt ∈ [0, 1], representing
study mentioned in [11], where the goal was solely to reduce the offloading point in each round, accounting for variable
the load over the overall training period, our approach goes operational conditions such as Ctc , Cts , and BWtc .
beyond that. Our objective is to diminish the training time
per communication round by establishing consistent offload III. R EINFORCEMENT LEARNING AGENT FOR TRAINING
strategies for all devices and adapting to observed network METHOD IN AOP
changes. This section outlines the training process of the RL agent
In the context of the AOP algorithm, FL training is con- used in AOP to achieve the objectives of the GMM clustering
ducted on each of the C devices, each with a training workload method. RL, a powerful ML technique, is a sequential decision
Wtc per round t. In this FL task, server s has a training optimization method in various applications. Its primary func-
capacity Cts , the participating client c has a training capacity tion within AOP is to autonomously generate a well-considered
Ctc , and the network bandwidth between the device and the load reduction strategy for participating devices, ultimately
server is denoted as BWtc at round t. The offloading strategy maximizing rewards, i.e., the training time on AOP.
TABLE I: FLOPs, trainable parameters of offloading point (OP) for VIT/deepVIT transformer models.
Model VIT deepVIT
Trainable Parameters 7.750M/10.484M 3.878M/6.617M
FLOPs OP = [0.78M, e-6M, 1.1M, 1.1M, 1.1M, 1.1M, 1.1M, OP = [0.39M, e-6M, 0.55M, 0.55M, 0.55M, 0.55M,
1.1M, e-6M] 0.55M, 0..55M, e-6M, 0.128M]
FLOPs (%) OP = [10.5%, ∼ 0%, 14.9%, 14.9%, 14.9%, 14.9%, OP = [10.2%, ∼ 0% , 14.4%, 14.4%, 14.4%, 14.4%,
14.9%, 14.9%,∼ 0%] 14.4%, 14.4%,∼ 0%, 3.3%]
Offloading points 9 10
A. Optimizing for multiple tasks workload that the device performs. Once the f lopN n value
The offloading strategy within the AOP algorithm needs is obtained, the number of FLOPs is calculated and set as
to be adaptable to changing operating conditions. The the target workload on the device. The objective is to select
GMM approach aims to categorize devices and servers an operational workload closest to the target workload. Eq.
into buckets based on a combination of parameters X = 5 defines how the input state and the output action are
{T T P i, BW, F LOP }. This categorization is crucial for opti- represented at each time step T .
mizing the performance of AOP. While using the RL method,
it is essential to generate different output actions in response St = {T T P int , BWtn , F LOPt−1
n
}N
n=1 ,
to changes in multiple tasks. However, in practical scenar- At = {F LOPtn }N (5)
n=1 ,
ios, it has been observed that this approach may lead to
suboptimal offloading, especially when network bandwidth is subject to F LOPtn ∈ {0, 1}
not constrained. To address this challenge, the agent’s input C. Reward function
(comprising TTPi, BW, FLOP) represents the mean µK values
The reward function plays a crucial role in training the
of the N th groups. The output action is determined through
RL agent. The reward result after each training round in
a combination of the standard multivariate technique and rate
the FL system is denoted as Rt in Eq. 6. To ensure the
maximization. A scaling factor π is obtained by comparing
realization of the objective in Eq. 5, one alternative is to set
the parameters of the central groups, and this factor is used
the reward value as the average training time. Due to the
to adjust the action in the heterogeneous groups. This process
varying configurations of devices, the average training time
accelerates the identification of appropriate actions, ultimately
is referenced based on each device completing the base model
enhancing the reliability of the reward point. The RL agent’s
training entirely, denoted as B N or the whole backbone. The
training occurs in a controlled environment where the network
training time of each group generates an average reward score
bandwidth between the device and the pre-processing server
for the OFF sampling communication round.
automatically adjusts to represent the mean of the group
through the expectation-maximization (EM) algorithm. The N
1 X N
EM algorithm consists of two steps: the E-step and the M- Rt = B − T T P iN
t , (6)
step. N n=1
E-step: For each i, j set We chose Proximal Policy Optimization (PPO) for this
πk N (xn |µk , σk ) study due to its user-friendly nature and strong performance
γ(znk ) = K (3) in standard RL tasks. PPO belongs to the class of non-policy
Σj=1 πj N (xn |µj , σj
gradient RL algorithms, which efficiently leverage trajectory
M-step: Update the parameters data from past interactions. This makes PPO our primary focus
ΣN for training in this study.
n=1 γ(znk )xn
µk = ΣN γ(znk )
n=1
ΣN γ(z )(x −µ )(xn −µk )T (4) D. RL training methodology
Σk = n=1 nk ΣnN k γ(znk )
n=1
The RL task involves two networks, the actor and critic,
πk = N1 ΣN
n=1 γ(znk )
sharing a three-layer architecture. During RL training, the
B. State and action critic network assists the actor network. After training, only
The input states for the RL agent consist of the average the actor network develops the offload action. Ideally, the RL
training time and local bandwidth of devices within a group. model should be trained online during the FL task. However,
In each training round, the RL’s task is to generate each online training during FL requires waiting for each FL training
group’s corresponding load reduction action, denoted as yn . round to complete for reward calculation. To address this,
The action of each group is then processed to map to the we opt for offline RL model training before FL tasks. To
model of all devices in the group, resulting in the action expedite RL training, we reduce batches for each round,
denoted Ât . The output action for each group is designed as called ”truncated FL bullets.” Instead of using round time for
an absolute value (f lopN
n ) ranging from 0 to 1, allowing the input/output, we gather batch training times per device. The
RL agent to adapt to various backbone models. This value FL model is retrained with regular loops, excluding truncated
is mapped to a percentage of the total model computational ones if offload time surpasses the limit or training shows
Algorithm 1: Condition workload with EM algorithm Algorithm 2: AOP algorithm
for GMM and PPO 1 Requirement: |ω| = ωh,i + ωt ; ωh,i is a head model,
1 Input: a given data X = {x1 , x2 , . . . , xn }; and ωt is a tail model of OP model M = Mh,i + Mt,i
π = {π1 , π2 , . . . , πk } 2 η: learning rate
2 Output: µ = {µ1 , µ2 , . . . , µk }; Σ = {Σ1 , Σ2 , . . . , Σk } 3 unicast initialize - ωh,0 of basic training to full
3 for each i do offloading OPmax
4 x(ttpi, bw, flops) 4 for per i ∈ {R, . . . , R − 1} do
5 Randomly initial π, µ, Σ 5 to c-th client from Round Robin scheduling i for c
6 for t ∈ T do 6 /*run train on clients c ∈ K */
7 /*E-Step 7 generates smashed data sc,i by passing input data
8 for n ∈ N do xc,i through ωc,i
9 for k ∈ K do 8 produces sc,i , yc,i applying Mt,i model forward to
10 γ(znk ) from Eq. 3 the server ▷
11 end for 9 /*run train on server */
12 yn = argmax γ(znk ) 10 produces sc,i , yc,i via sh,i , yc,i continues training
k
13 end for for all c at i
14 /*M-Step 11 produces sc,i , yc,i applying Mh,i model from the
15 for k ∈ K do server ▷
16 for k ∈ K do 12 generates loss ΣR i Li by passing xc,i , yc,i through
17 µk , Σk , πk from Eq. 4 ωt in parallel
18 end for 13 updates |ω| via ωt ← η · ▽ωt (ΣR i Li ) ▷
19 µ mixture component means use to state of 14 backward c − th model gradh,i with cut-layer
environment for policy each i gradient for all c each at i ▷
20 µπ scale weight of mixture components 15 /*run update on server */
21 γ prediction is conducted by group label 16 send c − th model M with average weight
1 K
for client each i M Σc |ω| ▷
22 end for 17 if frequency f ∈ R then
23 end for 18 the actor and critic train of RL finds to OP
24 minibach size Mf for policy action for all c workload from Alg. 1
25 end for 19 end if
26 for k ∈ K do 20 /*run update on clients c ∈ K */
27 Input: initial policy parameters θ0 clipping 21 updates next round weight ωc,i+1 via Mc,h,i ← M
threshold ϵ = 0.2 with cut-layer gradient for all c ▷
28 for iteraction 1, 2, . . . , K do 22 end for
29 for actor 1, 2, . . . , N do
30 Run policy πold in environment for T
timesteps IV. P ERFORMANCE EVALUATION
31 Compute advantage estimates Â1 , . . . , ÂT
In this section, we assess the performance of the proposed
with condition GMM scale
32 end for method, AOP, using incident datasets. Initially, we elaborate on
33 Lt (θ) = the parameter settings for our experiments, encompassing the
min(rt (θ)Ât , clip(rt (θ)), 1 − ϵ, 1 + ϵ)Ât physical clients, dataset, and models. Subsequently, we eval-
34 Optimize surrogate L wrt. θ, with K epochs uate and discuss the performance of our proposed method in
and minibatch size Mf < N T comparison with classical FL and the FedAdapt algorithm,
35 θold ← θ focusing on aspects such as training time and offloading points
36 end for on each client.
37 OP action for all c
A. Simulation settings
38 end for
1) Heterogeneous edge clients: In this section, we examine
diverse heterogeneous devices with varying computing, mem-
ory, and communication capabilities. The ensemble comprises
low standard deviation in the offload action’s multivariable a PC desktop serving as the server (Intel i9-9900K CPU
Gaussian function. In training, parameters are set as follows: 3.60GHz, RAM 64GB) and six edge clients dispersed across
RL Agent discount factor is 0.9 for future state rewards; PPO different locations and distances, including five (client1 to
clip parameter is 0.2; actor and critic network learning rates client5) Raspberry Pis (ARM Cortex, RAM 8GB) and a
are 0.001; update policy for epochs is 50. PC laptop (client6) (Intel i7-CPU 1.10GHz, RAM 16GB).
(a) Animals (b) Collapse (c) Crash (d) Fire (e) Flooding (f) Landslide (g) Snow (h) Treefall
Fig. 2: Incident datasets with eight classes.
To ensure a comprehensive evaluation, we allocate different It is easy to see the following: For VIT, the training time is
bandwidths to devices with varying speeds. For example, in much faster than the two baseline methods on each commu-
client1, client2, we limit the client’s data transfer rate to a mere nication and all rounds. Applying AOP will train a part of the
10Mbps. In contrast, in other devices as client3 to client6, we VIT model at the client and the rest at the server faster than
denoted Inf , which means the speed is unspecified, indicat- classical FL. Furthermore, compared with FedAdapt, our
ing an arbitrary speed dependent on the network bandwidth results show that our AOP is much faster than an outstanding
environment. method using the offloading point technique used in Fig. 3a.
2) Dataset: This section have chosen the incident image Next, we analyze the accuracy. VIT’s accuracy is much higher
dataset for simulation to assess the efficacy of our novel than classical FL and FedAdapt due to faster convergence
offloading point concept. This dataset, previously employed time, see in Fig. 3b. Second, we compare total time training,
in numerous studies [18], comprises various street incident for deepVIT-AOP is still much faster than classical FL and
scenarios categorized into eight classes, as depicted in Fig. 2. slightly faster than FedAdapt. So far, we considered the
These images serve as a semantically rich resource for training testing average accuracy in our results, and it has higher
in the image dataset with image size is 128 × 128. Each class accuracy than classical FL and lower than FedAdapt. After
is anticipated to encompass hundreds of images, providing the training process is completed, the total training time of
a diverse spectrum and capturing commonplace objects and classical FL, VIT FedAdapt, and VIT AOP is 26h33m4s,
surroundings whenever feasible. 21h41m24s, and 19h59m13s, respectively. Thus, VIT AOP is
3) Models and baselines: We chose the transformer model, 7% faster than VIT FedAdapt and 26% faster than classical
specifically the VIT and deepVIT, for evaluating our incident FL.
image dataset due to their inherent parallel processing capa- Similarly, in Fig. 3c deepVIT AOP has a total training time
bilities in the encoder phase. This study focuses solely on of 12h35m12s, while deepVIT FedAdapt is 13h35m22s, and
image datasets, assessing the model’s reduction in training classical FL is 16h58m49s. So deepVIT AOP is 5.6% faster
time while maintaining accuracy. Detailed specifications for than deepVIT FedAdapt and 25.6% faster than classical FL,
the chosen models are provided in Section II and Tab. I. For a while Fig. 3d shows the deepVIT with the highest accuracy.
comprehensive comparison, classical FL and the FedAdapt Finally, in the AOP technique, training time is the most
baseline methods are included, considering aspects such as important when using offloading points to find bottlenecks
accuracy, total training time on devices and individual devices, based on bandwidth to split part of the deep learning model
and optimal offloading point selection. The dataset is randomly to client and server devices. Choosing the suitable model is
split into train/valid/test sets (7 : 2 : 1). All models and more critical because transformer models will be the deciding
baselines are implemented in Python 3.10.9 using the PyTorch factor in results such as training time, accuracy, etc.
library. Experiments are conducted on the server’s NVIDIA
GTX 1080Ti GPU and only CPU for clients. We use the C. Performance evaluation for time training on actual each
SGD optimizer with a learning rate of 0.1, a batch size of edge clients
32, and each client has a batch size of 5. Simulations involve
In this Section, we evaluate the training time on each device
500 communication rounds on each task, with each round
edge client (here, five Raspberry Pis 4 and one PC laptop)
comprising 1 epoch of local training. It’s crucial to note that,
installed in different locations. Consider Fig. 4a, in the case
for a fair comparison, all methods are trained in the same
of VIT-AOP, we see that the client6 is the PC laptop. With such
environment, ensuring identical simulation settings.
a powerful edge device, superior to Raspberry Pi 4, training is
faster, which is obvious. However, there are two Raspberry PI
B. Performance evaluation for total time training and average
as client1, client2. We must see if the retraining time exceeds
accuracy on actual edge clients
three Raspberry Pi edge devices. The reason is straightforward:
To assess the effectiveness of the AOP approach with the we assume/set the bandwidth limit to only be a maximum of
VIT and deepVIT transformer models, we compare it with two 10Mbps, so it is reasonable that the training time on these two
baseline methods: classical FL and FedAdapt. Analyzing the devices is longer than on other devices. That shows that the use
training time spent on each communication round using Algs. of AI applications on edge devices is the future trend, and more
1 and 2 (see Fig. 3), we observe the following: importantly, the devices need to have a robust configuration.
(a) (b)
(c) (d)
Fig. 3: The total training time (s) and average test accuracy (%) for each communication round are shown for classical FL,
FedAdapt, and AOP on two models: VIT (in figures (a) and (b)) and deepVIT (in figures (c) and (d)).
.
(a) (b)
Fig. 4: The training time with AOP algorithm on each client is depicted for two models: VIT (in figure (a)) and deepVIT (in
figure (b)) models.
(a) (b)
Fig. 5: The action mean with AOP algorithm on each client using two VIT (in figure (a)) and deepVIT (in figure (b)) models.
The bandwidth between the client and the server needs to be R EFERENCES
strong enough, and, most importantly, there must be an optimal [1] Q. Li et al., ”A Survey on Federated Learning Systems: Vision, Hype
solution to share tasks in edge computing, especially the AOP and Reality for Data Privacy and Protection,” in IEEE Transactions on
solution we introduce. Similarly, for the deepVIT case in Fig. Knowledge and Data Engineering, vol. 35, no. 4, pp. 3347-3366, 1 April
2023, doi: 10.1109/TKDE.2021.3124599.
4b, the results of training time on each edge device also re- [2] D. Wu, R. Ullah, P. Harvey, P. Kilpatrick, I. Spence and B. Varghese,
validated our focus, contribution and new ideas in this paper. ”FedAdapt: Adaptive Offloading for IoT Devices in Federated Learning,”
in IEEE Internet of Things Journal, vol. 9, no. 21, pp. 20889-20901, 1
D. Performance evaluation for action mean on actual each Nov.1, 2022, doi: 10.1109/JIOT.2022.3176469.
[3] Matsubara, Y., Levorato, M., & Restuccia, F. (2021). Split Computing
edge clients and Early Exiting for Deep Learning Applications: Survey and Research
As highlighted in Section II, determining the workload score Challenges. ACM Computing Surveys, 55, 1 - 30.
[4] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,
for each device in the group is based on a percentage derived Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S.,
from the action mean. Eq. 1 can be applied to calculate the Uszkoreit, J., & Houlsby, N. (2020). An Image is Worth 16x16 Words:
workload score. We define the action mean as the percentage Transformers for Image Recognition at Scale. ArXiv, abs/2010.11929.
[5] L. Lockhart, P. Harvey, P. Imai, P. Willis and B. Varghese, ”Scis-
of FLOPs at layer i out of the total number of FLOPs in the sion: Performance-driven and Context-aware Cloud-Edge Distribution
model. This enables us to assess the impact and importance of of Deep Neural Networks,” in 2020 IEEE/ACM 13th International
each workload. Refer to Figs. 5a and 5b for VIT and deepVIT Conference on Utility and Cloud Computing (UCC), Leicester, United
Kingdom, 2020 pp. 257-268.
models, respectively. Higher workload values indicate that the [6] P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar, “Split learning
device is adept at handling the offloading point selection and for health: Distributed deep learning without sharing raw patient data,”
related tasks. For instance, the workload attains its highest arXiv:1812.00564, 2018.
[7] C. Thapa, P. C. M. Arachchige, S. Camtepe, and L. Sun, “SplitFed:When
value for an edge device, a PC laptop is a client6. Additionally, federated learning meets split learning,” in Proc. AAAI Conf. Ar-
the workload values for each client reflect the environmental tif.Intell., Jun. 2022, pp. 8485–8493.
conditions influencing the cutting point on each device. This [8] Y. Gao et al., ”End-to-End Evaluation of Federated Learning and Split
Learning for Internet of Things,” 2020 International Symposium on
information provides insights into the device’s offloading point Reliable Distributed Systems (SRDS), Shanghai, China, 2020, pp. 91-
and the corresponding workload percentage (i.e., how much % 100, doi: 10.1109/SRDS51746.2020.00017.
of the maximum work). Similarly, simultaneous evaluation of [9] L. Li, H. Xiong, Z. Guo, J. Wang and C. -Z. Xu, ”SmartPC: Hierarchical
Pace Control in Real-Time Federated Learning System,” 2019 IEEE
the workload on different devices allows us to ascertain the Real-Time Systems Symposium (RTSS), Hong Kong, China, 2019, pp.
maximum value achievable by other edge devices. 406-418, doi: 10.1109/RTSS46320.2019.00043.
[10] Imteaj, A., Mamun Ahmed, K., Thakker, U., Wang, S., Li, J., Amini,
V. C ONCLUSIONS M.H. (2023). Federated Learning for Resource-Constrained IoT Devices:
Panoramas and State of the Art. In: Razavi-Far, R., Wang, B., Taylor,
By seamlessly integrating the transformer model from clas- M.E., Yang, Q. (eds) Federated and Transfer Learning. Adaptation,
sical FL with the offloading point technique in edge devices, Learning, and Optimization, vol 27. Springer, Cham.
[11] Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Ingerman,
we introduce a groundbreaking solution named adaptive of- A., Ivanov, V., Kiddon, C., Konecný, J., Mazzocchi, S., McMahan,
floading point (AOP). This novel approach is designed for H.B., Overveldt, T.V., Petrou, D., Ramage, D., & Roselander, J.
application in edge AI, offering enhanced model privacy (2019). Towards Federated Learning at Scale: System Design. ArXiv,
abs/1902.01046.
through network splitting and incorporating differential pri- [12] Almalik, Faris et al. “FeSViBS: Federated Split Learning of Vision
vate client-side model updates. AOP outperforms classical FL Transformer with Block Sampling.” International Conference on Medical
and FedAdapt by leveraging the actual network bandwidth, Image Computing and Computer-Assisted Intervention (2023).
[13] D. Wu, R. Ullah, P. Harvey, P. Kilpatrick, I. Spence and B. Varghese,
facilitating parallel processing across clients, and achieving ”FedAdapt: Adaptive Offloading for IoT Devices in Federated Learning,”
significantly faster training times on real edge devices. Our in IEEE Internet of Things Journal, vol. 9, no. 21, pp. 20889-20901, 1
experimental results, conducted on a incident image dataset, Nov.1, 2022, doi: 10.1109/JIOT.2022.3176469.
[14] Christopher M. Bishop. 2006. Pattern Recognition and Machine Learn-
showcase AOP’s superior performance in terms of training ing (Information Science and Statistics). Springer-Verlag, Berlin, Hei-
efficiency and accuracy. These findings hold promising im- delberg.
plications for practical applications, particularly in the realm [15] Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning:
An Introduction. A Bradford Book, Cambridge, MA, USA.
of FL for Edge AI. [16] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,
Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S.,
ACKNOWLEDGMENT Uszkoreit, J., & Houlsby, N. (2020). An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale. ArXiv, abs/2010.11929.
This R&D includes the results of ”Research and [17] Zhou, D., Kang, B., Jin, X., Yang, L., Lian, X., Hou, Q., & Feng,
development of optimized AI technology by secure data J. (2021). DeepViT: Towards Deeper Vision Transformer. ArXiv,
coordination (JPMI00316)” by the Ministry of Internal abs/2103.11886.
[18] Levering, A., Tomko, M., Tuia, D., Khoshelham, K.: Detecting unsigned
Affairs and Communications (MIC), Japan. physical road incidents from driver-view images. IEEE Trans. Intell.
Veh. 6(1), 24–33 (2021)
We would like to thank Takamasa Mizoi, Senior Research
Engineer, and Isao Kikuchi, System Engineer, at the Big Data
Integration Research Center, National Institute of Information
and Communications Technology (NICT), for their great co-
operation in the development of the experimental system.