Towards Energy-Aware Federated Learning Via MARL: A Dual-Selection Approach For Model and Client
Towards Energy-Aware Federated Learning Via MARL: A Dual-Selection Approach For Model and Client
ing for heterogeneous Artificial Intelligence of Thing (AIoT) de- weights of local device models to a cloud server for knowledge
vices, their training performance and energy efficacy are severely aggregation, thus enhancing both the training and inference ca-
restricted in practical battery-driven scenarios due to the “wooden pability of local models. Although FL is promising in knowledge
barrel effect” caused by the mismatch between homogeneous model sharing, it faces the problems of both large-scale deployment and
paradigms and heterogeneous device capability. As a result, due to quick adaption to dynamic environments, where local models are
various kinds of differences among devices, it is hard for existing required to be frequently trained to accommodate an ever-changing
FL methods to conduct training effectively in energy-constrained world. In practice, such problems are hard to be solved, since classic
scenarios, such as battery constraints of devices. To tackle the above Federated Averaging (e.g., FedAvg) methods require that all devices
issues, we propose an energy-aware FL framework named DR-FL, should have homogeneous local models with the same architecture.
which considers the energy constraints in both clients and het- According to the well-known “wooden barrel effect” caused by
erogeneous deep learning models to enable energy-efficient FL. homogeneous assumption as shown in Figure 1, the energy con-
Unlike Vanilla FL, DR-FL adopts our proposed Muti-Agents Rein- sumption waste in Vanilla FL is usually due to the following two
forcement Learning (MARL)-based dual-selection method, which reasons, i.e., the mismatch between computing power and homoge-
allows participated devices to make contributions to the global neous model, and the mismatch between power consumption and
model effectively and adaptively based on their computing capabil- homogeneous model. The former uses device energy for waiting
ities and energy capacities in a MARL-based manner. Experiments time, while the latter uses device energy for useless training time
on various well-known datasets show that DR-FL can not only (only enough power to support training but not support commu-
maximize knowledge sharing among heterogeneous models under nication). Thus, such a homogeneous model assumption strongly
the energy constraint of large-scale AIoT systems but also improve limits the overall energy efficiency of the entire FL system. This is
the model performance of each involved heterogeneous device. because energy usages in the entire system are mainly determined
by how much power is used in the effective model learning other
than waiting, which consumes energy to wait other than training
1 INTRODUCTION
or communication.
The increasing popularity of Artificial Intelligence (AI) techniques, Typically, an AIoT system involves various types of devices with
especially for Deep Learning (DL), accelerates the significant evo- different settings (i.e., computing power and remaining power). If
lution of Internet of Things (IoT) toward Artificial Intelligence of all devices have been equipped with homogeneous local models,
Things (AIoT), where various AIoT devices are equipped with DL the inference potential of devices with superior computing power
models to enable accurate perception and intelligent control [2]. will be eclipsed. Things become even worse when the devices of
Although AIoT systems (e.g., autonomous driving, intelligent con- AIoT applications are powered by batteries. In this case, the devices
trol [18], and healthcare systems [1, 23]) play an important role in with less battery energy will be reluctant to participate in frequent
various safety-critical domains, due to both the limited classifica- interactions with the cloud server. Otherwise, if one device runs
tion capabilities of local device models and the restricted access out of power at an early stage of the FL training, it is hard for
to private local data, it is hard to guarantee the training and in- the global model to achieve an expected inference performance.
ference performance of AIoT devices in Federated Learning (FL) Meanwhile, the overall inference performance of the global model
[13], especially when they are powered by batteries and deployed will be strongly deteriorated due to the absence of such an exhausted
within an uncertain dynamic environment [4]. To quickly figure devices in the following training process. Therefore, how to fully
out the training procedure inference perception of devices, more explore the potential of energy-constrained heterogeneous devices to
and more large-scale AIoT systems have the aid of cloud computing enable high-performance and energy-efficient FL is becoming a major
[30], which has tremendous computing power and flexible device bottleneck in the design of an AIoT system.
management schemes. However, such a cloud-based architecture Although various heterogeneous FL methods (e.g., HeteroFL
still cannot fundamentally improve the inference accuracy of AIoT [5], Scale-FL [8], PervasiveFL [24]) and energy-saving techniques
devices, since they are not allowed to transmit private local data to [11, 12] have been investigated to address the above issue, most
each other. Due to concerns about data privacy, both training and of them focus on either enabling effective knowledge sharing be-
inference performance of local models are greatly suppressed. tween heterogeneous models or reducing the energy consumption
As a promising collaborative machine learning paradigm, FL of devices. Based on the coarse-grained FedAvg operations, few
allows local DL model training among various devices without
Conference’17, July 2017, Washington, DC, USA Trovato et al.
Wasted Energy Caused by Capability and Dispatching Model Mistach Wasted Energy Caused by Battery and Dispatching Model Mistach
High Power Client Middle Power Client Low Power Client High Power Client Middle Power Client Low Power Client
Capability Capality Capability Capability Capality Capability
Battery Battery Battery Battery Battery Battery
� �ℎ
Round Valina FL � �ℎ
Round Valina FL
� �ℎ
Round Our method � �ℎ
Round Our method
Figure 1: The energy consumption waste of the “wooden barrel effect” in Vanilla FL is usually due to the following two reasons,
i.e., the mismatch between computing power and homogeneous model, and the mismatch between power consumption and
homogeneous model. The former uses device energy for waiting time, while the latter uses device energy for useless training
time (only enough power to support training but not support communication).
of the existing FL methods can substantially address the above inference performance within various non-IID scenarios, but
challenges to quickly adapt to new environments within an energy- also have superior scalability for large-scale AIoT systems.
constrained scenario. Inspired by the concepts of BranchyNet [21]
The rest of this paper is organized as follows. Section 2 discusses
and multi-agent reinforcement learning [29], in this paper, we pro-
related work on heterogeneous FL and energy-aware FL training.
pose a novel FL framework named DR-FL, which takes both the
After giving the preliminaries of FL and multi-agent reinforcement
layer-wise structure information of DL models and the remaining
learning in section 3, section 4 details our proposed DR-FL method.
energy of each client into account to enable energy-efficient feder-
Section 5 presents experimental results on well-known benchmarks.
ated training. Unlike traditional FedAvg-based FL method that re-
Finally, section 6 concludes the paper.
lies on homogeneous device models, DR-FL maintains a layer-wise
global model on the cloud server, while each device only installs a
subset layer-wise model according to its computing power and re- 2 RELATED WORK
maining battery. In this way, all the heterogeneous local models can Although FL is good at knowledge sharing without compromising
effectively make contributions to the global model based on their the data privacy of devices in AIoT system design, due to the ho-
computing capabilities and remaining energy in a MARL-based mogeneous assumption that all the involved devices should have
manner. Meanwhile, by adopting MARL, DR-FL can not only make local DL models with the same architecture, Vanilla FL methods
the trade-off between training performance and energy consump- inevitably suffer from the problems of low inference accuracy and
tion, thus ensuring energy-efficient FL training to accommodate invalid energy consumption, thus impeding the deployment of FL
various energy-constrained environments. This paper makes the methods in large-scale AIoT system designs [9, 16, 24, 31], especially
following three major contributions: for non-IID scenarios.
To enable collaborative learning among heterogeneous device
• We establish a novel, lightweight cloud-based FL framework models, various solutions have been extensively studied, which
named DR-FL, which can be easily implemented and enables can be primarily classified into two categories, i.e., subnetwork
various heterogeneous DNNs to share knowledge without aggregation-based methods and knowledge distillation-based methods.
compromising their data privacy in FL for heterogeneous The basic idea of subnetwork aggregation-based methods is to
devices by layer-wise model aggregation. allow knowledge aggregation on top of subnetworks of local device
• We propose a dual-selection approach based on MARL to models, which enables knowledge sharing among heterogeneous
control energy-efficient learning from the perspectives of device models. For instance, Diao et al. [5] presented an effective
both layer-wise models and participating clients, which can heterogeneous FL framework named HeteroFL, which can train
maximize the efficacy of the entire AIoT system. heterogeneous local models with varying computation complexities
• Experimental results obtained from both simulation and real but still produce a single global inference model, assuming that
test-bed platforms show that, compared with various state- device models are subnetworks of the global model. By integrating
of-the-art approaches, DR-FL can not only achieve better FL and width-adjustable slimmable neural networks, Yun et al. [27]
Towards Energy-Aware Federated Learning via MARL: A Dual-Selection Approach for Model and Client Conference’17, July 2017, Washington, DC, USA
proposed a novel learning framework named ScaleFL, which jointly round, the update process of each device model is defined as follows
utilizes superposition coding for global model aggregation and
superposition training for updating local models. In [24], Xia et al. W𝑛𝑡+1 ← W𝑛𝑡 − 𝜂∇W𝑛𝑡 , (1)
developed a novel framework named PervasiveFL, which utilizes a
small uniform model (i.e., “modellet”) to enable heterogeneous FL. where W𝑛𝑡 and W𝑛𝑡+1 represent the global models at round 𝑡 and
Although all the above heterogeneous FL methods are promising, round 𝑡 + 1 in the 𝑛𝑡ℎ device, respectively. 𝜂 indicates the learning
most focus on improving inference performance. Few of them take rate and ∇W𝑛𝑡 is the gradient obtained by the 𝑛𝑡ℎ device model
the issues of real-time training and energy efficiency into account. after the 𝑡 𝑡ℎ training round. To protect data privacy, at the end
Since a large-scale FL-based AIoT application typically involves of each communication round, FL uploads the weight differences
a variety of devices that are powered by batteries, how to conduct (i.e., model gradients) of each device instead of the newly updated
energy-efficient FL training is becoming an important issue [19, 28]. models to the cloud for aggregation. After gathering the gradients
To address this issue, various methods have been investigated to from all the participating devices, the cloud updates the parameters
reduce the energy consumed by FL training and device-server com- of the shared-global model based on the Fedavg [14] algorithm,
munication. For example, Hamdi et al. [6] studied the FL deployment which is defined as follows:
problem in an energy-harvesting wireless network, where a certain Í𝑁
number of users may be unable to participate in FL due to interfer- L𝑛 ∇W𝑛𝑡
W𝑡 +1 ← W𝑡 + 𝑛=1 , (2)
ence and energy constraints. They formalized such a deployment 𝑁
scenario as a joint energy management and user scheduling prob- Í𝑁
∇W𝑛
lems over wireless systems, and solved it efficiently. In [20], Sun et where 𝑛=1𝑁 𝑡 denotes the average gradient of 𝑁 participating de-
al. presented an online energy-aware dynamic worker scheduling vices in communication round 𝑡, W𝑡 and W𝑡 +1 represent the global
policy, which can maximize the average number of workers sched- models after 𝑡 𝑡ℎ and 𝑡 + 1𝑡ℎ communication round, respectively, and
uled for gradient update under a long-term energy constraint. In L𝑛 means the training data size of device 𝑛. Although Vanilla FL
[26], Yang et al. formulated the energy-efficient transmission and methods (e.g., FedAvg) perform remarkably in distributed machine
computation resource allocation for FL over wireless communica- learning, they cannot be directly applied to AIoT scenarios. This
tion networks as a joint learning and communication problem. To is because the heterogeneous AIoT devices will lead to different
minimize system energy consumption under a latency constraint, training speeds for Vanilla FL, resulting in additional energy waste,
they presented an iterative algorithm that can derive optimal solu- which is unacceptable for an energy-constrained system.
tions considering various factors (e.g., bandwidth allocation, power
control, computation frequency, and learning accuracy). Although 3.2 Multi-Agent Reinforcement Learning
all the above energy-saving methods can effectively reduce energy
In cooperative Multi-Agent Reinforcement Learning (MARL), a
consumption in both FL training and communication, few of them
set of 𝑁 agents is trained to produce optimal actions that lead to
can guarantee the training time requirement of FL training within
maximum team rewards. Specifically, at each timestamp 𝑡, each
a complex dynamic environment.
agent 𝑛 (where 1 ≤ 𝑛 ≤ 𝑁 ) observes its state 𝑠𝑡𝑛 and selects an
To the best of our knowledge, DR-FL is the first attempt to inves-
action 𝑎𝑛𝑡 based on 𝑠𝑡𝑛 . After all agents have completed their actions,
tigate the dual selection by both layer-wise models and the partici-
the team receives a joint reward 𝑟𝑡 and transitions to the next state
pated clients based on MARL to enable fine-grained heterogeneous
𝑠𝑡𝑛+1 . The goal is to maximize the total expected discounted reward
FL, where heterogeneous devices can adaptively and efficiently
𝑅 = 𝑇𝑡=1 𝛾𝑟𝑡 by selecting the optimal actions for each agent, where
Í
make contributions to the global model based on their computing
capabilities and remaining energy. Compared with state-of-the- 𝛾 ∈ [0, 1] is the discount factor.
art heterogeneous FL methods, DR-FL can not only maximize the Recently, QMIX [17] has emerged as a promising solution for
knowledge sharing among various heterogeneous models under jointly training agents in cooperative MARL. In QMIX, each agent
energy constraints but also significantly improve both the model 𝑛 employs a Deep Neural Network (DNN) to infer its actions. This
performance of each involved device and the energy efficacy of the DNN implements the 𝑄-function 𝑄 𝜃 (𝑠, 𝑎) = 𝐸 [𝑅𝑡 |𝑠𝑡𝑛 = 𝑠, 𝑎𝑛𝑡 = 𝑎],
where 𝜃 represents the parameters of the DNN, and 𝑅𝑡 = 𝑇𝑖=𝑡 𝛾𝑟𝑖
Í
entire FL system.
is the total discounted team reward received at 𝑡. During MARL
execution, each agent 𝑛 selects the action 𝑎 ∗ with the highest 𝑄-
3 PRELIMINARIES
value (i.e., 𝑎 ∗ = arg max𝑎 𝑄 𝜃 (𝑠𝑛 , 𝑎)).
3.1 Federated Learning To train the QMIX, a replay buffer is employed to store transi-
With the prosperity of distributed machine learning technologies tion tuples (𝑠𝑡𝑛 , 𝑎𝑛𝑡 , 𝑠𝑡𝑛+1, 𝑟𝑡 ) for each agent 𝑛. The joint 𝑄-function,
[22], privacy-aware FL is proposed to effectively solve the prob- 𝑄 tot (·), is represented as the element-wise summation of all individ-
lem of data silos, where multiple AIoT devices can achieve knowl- ual 𝑄-functions (i.e., 𝑄 tot (𝑠𝑡 , 𝑎𝑡 ) = 𝑛 𝑄𝑛𝜃 (𝑠𝑡𝑛 , 𝑎𝑛𝑡 )), where 𝑠𝑡 = {𝑠𝑡𝑛 }
Í
edge sharing without leaking their data privacy. Since the physical and 𝑎𝑡 = {𝑎𝑛𝑡 } are the states and actions collected from all agents
environment is volatile (i.e., high latency network and unstable 𝑛 ∈ 𝑁 at timestamp 𝑡. The agent DNNs can be recursively trained
connection) in real AIoT scenarios, Vanilla FL randomly selects a by minimizing the loss 𝐿 = 𝐸𝑠𝑡 ,𝑎𝑡 ,𝑟𝑡 ,𝑠𝑡 +1 [𝑦𝑡 − 𝑄 tot (𝑠𝑡 , 𝑎𝑡 )] 2 , where
′
number of AIoT devices for each communication round of training 𝑦𝑡 = 𝑟𝑡 + 𝛾 𝑛 max𝑎 𝑄𝑛𝜃 (𝑠𝑡𝑛+1, 𝑎) and 𝜃 ′ represents the parameters
Í
a homogeneous DNN model. Suppose there are 𝑁 devices selected of the target network, which are periodically copied from 𝜃 during
at the 𝑡 𝑡ℎ communication round in FL. After the 𝑡 𝑡ℎ communication the training phase.
Conference’17, July 2017, Washington, DC, USA Trovato et al.
[� ��, � �� ]
Maximum Q Value
…
Guided Dual-Selection ��
can choose an appropriate model for each AIoT device based on its 𝑄(𝝉, 𝒂)
Device Evaluation Network 𝑆!
𝑄& (𝜏 & , . )
remaining energy and computing capabilities, which can not only
𝑄# (𝜏 # *
, 𝑎!$ ) … 𝑄& (𝜏 & +
, 𝑎!$ )
improve the efficiency of the device resource usage but also ensure
.
𝑊' ℎ+ (𝑆* ) MLP
⊕
𝑆!
their active participation in FL (see more details in Section 4.3). .
Agent 1 Agent n (
ℎ(𝑎&' )
GRU
(
ℎ(𝑎&' )
𝑊# ℎ) (𝑆* )
Furthermore, apart from selecting a layer-wised model for each Model
… Model
Selection Selection
AIoT device, the selector can also adjust the computing capability 𝑄,(𝜏, , 𝑎-,) … 𝑄,(𝜏, , 𝑎-,) MLP
4.3.2 MARL Agent State Design: The state of each MARL agent Equation 6. The MARL agents are trained using QMIX as described
𝐷𝑛 is comprised of three components: the remaining energy 𝐸𝑎𝑙𝑙 𝐷𝑛
, in Figure 3.
the computation capability of each communication round 𝐶𝐷𝑛 , and
the size of the local training dataset 𝐿𝐷𝑛 . At each training round 𝑡, 5 EXPERIMENTAL RESULTS
each agent initially conducts the training procedure and transmits To evaluate the performance of our proposed method, we imple-
its gradients to the central server. Furthermore, to estimate the mented the DR-FL algorithm using PyTorch (version 1.4.0). Similar
current training and communication delays at client device 𝑛, each to FedAvg, we assume that only 10% of AIoT devices were involved
MARL agent is equipped with a record of training latency 𝑇𝑡𝑟𝑎 𝐷𝑛
in each round of FL communication during the training period.
and communication latency 𝑇𝑐𝑜𝑚 , where 𝑇𝑡𝑟𝑎 and 𝑇𝑐𝑜𝑚 denote the
𝐷𝑛 𝐷𝑛 𝐷𝑛 For DR-FL and other heterogeneous FL methods, we set the small
latency in local training and model uploading for agent 𝑛 during batch size to 32. The number of local training epochs and the initial
the communication round 𝑡. As shown in Figure 3, the parameter learning rate were 5 and 0.05, respectively. To simulate a variety
𝜏 represents the trajectory of historical data from training, and ℎ of energy-constrained scenarios, we assume that each device is
represents the MLP layer for knowledge extraction. Moreover, each powered by a battery with a maximum capacity of 7,560 joules. In
MARL agent 𝑛 also calculates the energy consumption of training other words, each battery capacity is 1500 mA at a rated voltage
and communication based on Equation 7. This inclusion is crucial of 5.04V. We conducted comprehensive experiments to answer the
as the energy costs contribute to the overall energy cost, while the following four Research Questions (RQs).
remaining energy of the agent influences both training latency and RQ1: (Superiority of DR-FL ): What advantages can DR-FL
model accuracy. The state vector 𝑠𝑡𝑛 of agent 𝑛 in communication achieve compared with state-of-the-art heterogeneous FL methods?
round 𝑡 is defined as: RQ2: (Benefits of MARL-based Dual-Selection): What bene-
fits does MARL-based Dual-Selection provide during DR-FL learn-
𝑠𝑡𝑛 = [𝐿𝑡𝑛 , 𝐶𝐷𝑛 , 𝐸𝐷𝑛 , 𝑡]. (9) ing, especially under constraints such as device energy and overall
Finally, to decrease storage overhead and accelerate the speed of training time, compared with other SOTA heterogenous FL meth-
agent convergence, all MLPs and GRUs within the MARL agents ods?
share their weights. RQ3: (Scalability of DR-FL): How does the number of AIoT
devices participating in knowledge sharing affect the performance
4.3.3 Agent Action Design: Given the input state shown in Equa- of DR-FL?
tion 9, each MARL agent 𝑛 determines which layers of the local RQ4: (Exploration of the Validation Data Ratio): How does
model should be used for the local training process on each device. the proportion of validation data in MARL affect the performance
Specifically, the MARL agent will generate 𝑄 values for the current of DR-FL?
action set [𝑎 0, . . . , 𝑎𝑀 ], where 𝑀 represents the number of model
selections available to the client. Note that when the selected action 5.1 Experimental Settings
is zero, the client device will run the first model, and when the
5.1.1 Model Settings. We compared our DR-FL method with two
selected action is 𝑀, the client will not participate in the FL. After
typical state-of-the-art heterogeneous FL methods, i.e., HeteroFL
selecting the layer-wise model for each heterogeneous device, all
[5] and ScaleFL [8], which belong to subnetwork aggregation-based
the Q values obtained by the agents will select the device with the
methods and knowledge distillation-based methods, respectively.
highest Q value through the Top-K algorithm to participate in the
We set the ResNet-18 model [7] as the backbone, where each block of
FL process.
the ResNet-18 model is followed by a new pair of the bottleneck and
4.3.4 Reward Function Design: To optimize the objective described classifier, thus forming four new heterogeneous layer-wise models
in Equation 8, the reward function should reflect the changes in to simulate four types of heterogeneous models (i.e., Models 1-4
the model accuracy, processing latency (training, communication shown in Table 1). Note that each layer-wise model can be reused
and waiting latency), and processing energy consumption after with the same backbone for the purpose of model inference.
executing the dual-selection strategy generated by MARL agents.
5.1.2 Dataset Settings. To evaluate the effectiveness of DR-FL, we
The reward 𝑟𝑡 at training round 𝑡 is defined as follows:
considered four training datasets: i.e., CIFAR10, CIFAR100 [10],
Street View House Numbers (SVHN) [15], Fashion-MNIST [25]. CI-
𝑡 −1 𝑡 −1
𝑡
𝑟𝑡 = 𝑤 1 · (𝑀𝐴𝑐𝑐 −𝑀𝐴𝑐𝑐 ) −𝑤 2 · (𝐸𝑎𝑙𝑙 𝑡
−𝐸𝑎𝑙𝑙 ) −𝑤 3 · max 𝑇𝑎𝑙𝑙
𝑡,𝑛
. (10) FAR10: The CIFAR10 dataset consists of 60,000 32×32 colour images
1≤𝑛≤𝑁
across ten classes, with 6,000 images per class. The dataset is split
Here, max1≤𝑛≤𝑁 𝑇𝑎𝑙𝑙
𝑡,𝑛
represents the total time needed for lo- into 50,000 training images and 10,000 testing images. CIFAR100:
cal training of all selected devices. The MARL agents utilize the The CIFAR100 dataset is similar to CIFAR10 but contains 100 classes
evaluation accuracy calculated by a small tiny dataset on the cloud instead of 10, with 600 images per class. The dataset also comprises
server to select the layer-wise model that will be dispatched to the 50,000 training images and 10,000 testing images. SVHN: The SVHN
local device and continue the local training and upload their model dataset is a real-world image dataset derived from house numbers
updates. Moreover, 𝑤 1 , 𝑤 2 , and 𝑤 3 1 are the norm ratios to control in Google Street View images. It contains over 600,000 labelled digit
all the reward plays the same role in the entire reward. 𝐸𝑎𝑙𝑙 𝑡 is the images, where each image is a 32×32 colour image representing a
total remaining energy of 𝑡 𝑡ℎ communication round as defined in single digit (0-9). Fashion-MNIST: The Fashion-MNIST dataset is
a dataset of Zalando’s article images, consisting of 70,000 28 × 28
1We used 𝑤1 = 1000, 𝑤2 = 0.01, 𝑤3 = 1 in our experiments. grayscale images of 10 different fashion categories. In subsequent
Towards Energy-Aware Federated Learning via MARL: A Dual-Selection Approach for Model and Client Conference’17, July 2017, Washington, DC, USA
Table 1: Test accuracy (%) comparison for different models and dataset settings under specific energy constraints with 40 clients.
Dataset CIFAR10
Methods HeteroFL [5] ScaleFL [8] DR-FL (Ours)
Distribution 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0
Model_1 30.46 ± 1.10 46.11 ± 3.32 65.23 ± 1.45 29.25 ± 1.17 54.44 ± 0.87 58.15 ± 4.32 58.69 ± 0.73 59.01 ± 0.85 76.46 ± 0.12
Model_2 48.41 ± 1.24 62.55 ± 3.45 62.10 ± 3.24 41.66 ± 5.43 55.46 ± 3.87 71.48 ± 1.23 65.31 ± 1.54 75.93 ± 0.62 77.43 ± 2.77
Model_3 34.85 ± 5.79 65.01 ± 1.79 74.78 ± 2.76 39.92 ± 2.75 60.07 ± 0.68 70.83 ± 1.43 72.71 ± 0.58 70.64 ± 1.40 71.54 ± 1.54
Model_4 45.26 ± 3.68 69.65 ± 2.99 75.14 ± 1.13 46.59 ± 3.43 70.60 ± 4.54 73.90 ± 1.17 70.76 ± 1.30 69.37 ± 0.45 72.27 ± 1.73
Dataset CIFAR100
Methods HeteroFL [5] ScaleFL [8] DR-FL (Ours)
Distribution 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0
Model_1 11.86 ± 0.78 22.56 ± 2.13 25.66 ± 1.13 13.14 ± 1.96 21.39 ± 1.59 17.58 ± 0.43 26.25 ± 0.23 33.59 ± 3.32 39.65 ± 1.35
Model_2 16.33 ± 3.34 25.98 ± 1.72 28.68 ± 0.57 12.67 ± 2.13 28.77 ± 4.33 29.84 ± 1.39 17.83 ±0.75 39.50 ± 1.08 33.55 ± 0.45
Model_3 14.18 ± 0.29 31.99 ± 0.53 31.31 ± 3.34 17.12 ± 2.88 30.04 ± 1.91 33.92 ± 2.34 26.46 ± 0.24 32.10 ± 1.12 33.40 ± 0.13
Model_4 15.66 ± 0.78 29.33 ± 0.85 35.44 ± 1.54 19.24 ± 1.22 30.29 ± 1.03 33.23 ± 1.32 22.55 ± 0.73 32.55 ± 1.45 33.80 ± 1.25
Dataset SVHN
Methods HeteroFL [5] ScaleFL [8] DR-FL (Ours)
Distribution 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0
Model_1 60.08 ± 3.23 46.02 ± 3.32 60.38 ± 1.39 47.90 ± 0.53 85.79 ± 2.22 88.91 ± 1.11 67.19 ± 0.32 91.58 ± 0.21 68.78 ± 1.33
Model_2 65.11 ± 4.32 54.83 ± 1.28 68.90 ± 2.87 50.26 ± 2.21 86.82 ± 2.51 85.16 ± 4.13 79.86 ± 0.87 85.30 ± 1.19 91.72 ± 0.94
Model_3 65.93 ± 4.56 69.20 ± 4.19 75.97 ± 1.84 76.73 ± 2.23 84.91 ± 0.68 88.70 ± 3.25 91.47 ± 0.17 88.61 ± 1.72 93.45 ± 0.37
Model_4 66.31 ± 3.09 71.34 ± 0.79 76.14 ± 1.90 55.27 ± 3.23 86.10 ± 3.56 92.47 ± 0.51 91.11 ± 1.32 89.26 ± 0.75 92.78 ± 0.54
Dataset Fashion-MNIST
Methods HeteroFL [5] ScaleFL [8] DR-FL (Ours)
Distribution 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0
Model_1 45.06 ± 2.01 85.58 ± 1.31 87.00 ± 1.93 53.78 ± 0.98 74.26 ± 2.34 87.29 ± 0.93 80.15 ± 0.23 82.25 ± 0.19 87.10 ± 0.37
Model_2 59.76 ± 0.46 85.75 ± 0.63 88.60 ± 0.34 57.19 ± 3.13 85.32 ± 2.51 87.44 ± 0.55 82.10 ± 0.39 88.76 ± 0.23 85.22 ± 0.34
Model_3 57.25 ± 0.98 83.26 ± 3.27 87.75 ± 1.25 62.26 ± 1.34 87.69 ± 1.07 88.47 ± 0.97 86.88 ± 0.23 89.34 ± 0.62 90.52 ± 0.13
Model_4 56.32 ± 4.07 87.82 ± 1.28 87.83 ± 0.56 55.85 ± 1.51 86.78 ± 3.27 88.40 ± 0.69 85.80 ± 0.17 89.36 ± 0.11 89.60 ± 0.29
experiments, we investigated three non-Independent and Identi- NVIDIA Maxwell GPU, and 4GB LPDDR4 RAM; iii) the Jetson AGX
cally Distributed (non-IID) distributions for each dataset. Similar to Xavier boards, where each of them is equipped with an 8-core
the work of HeteroFL in [5], we constructed non-IID local training CPU and a 512-core Volta GPU; and iv) an HP 9800 power meter
datasets using heterogeneous data splits following a Dirichlet dis- (see the top-left part in Figure 4(a)) produced by Shenzhen HOPI
tribution controlled by a variable 𝛼. Typically, a smaller value of 𝛼 Electronic Technology Ltd. Note that, along with the federated
represents a higher degree of a corresponding non-IID distribution. training process, we used the power meter to record the energy
Meanwhile, we used the same data augmentation technologies to consumption of all the AIoT devices every second for the MARL
fully utilize natural image datasets as the ones used in HeteroFL environment construction.
[5]. To enable MARL training on a cloud server in DR-FL, we used
4% of the overall training data as the validation set on the server. 5.2 Accuracy Comparison (RQ1)
Note that the validation set on the server does not overlap with To evaluate the effectiveness of our proposed DR-FL, Table 1 presents
local training datasets hosted by AIoT devices. the best test accuracy information for HeteroFL, ScaleFL and our
DR-FL under the specific energy constraints along the FL processes
based on the four datasets, assuming all the device batteries are
initialized to be full. For each dataset and FL method combination,
we considered three kinds of data distributions for all local AIoT
devices, where the non-IID settings follow the Dirichlet distribu-
tions controlled by 𝛼. Note that the baseline approaches (HeteroFL
and ScaleFL) do not consider the energetic constraints in their FL
procedure. To make a fair comparison, we added the greedy algo-
rithm for energy awareness in this experiment (model selection
will select the maximum model that can be trained for FL) into the
(a) AIoT devices (b) The server
two baseline algorithms for comparison. The experiments were
Figure 4: Real test-bed platform for our experiment. repeated five times to calculate the mean and variance.
From Table 1, it is evident that within the constraint of the
5.1.3 Test-bed Settings. Besides simulation-based evaluation, we restricted battery energy conditions set for each device, DR-FL
constructed a physical test-bed platform as shown in Figure 4 to exhibits superior inference performance, surpassing results in 29
check the performance of our DR-FL in a real-world environment. out of the 36 evaluated scenarios in comparison with other baseline
The test-bed consists of four parts: i) the cloud server that is built algorithms. Specifically, no matter which data set, in the scenario
on top of an Ubuntu workstation equipped with an Intel i9 CPU, of 𝛼 = 0.1, our method shows superior performance in comparison
32G memory, and a GTX3090 GPU; ii) the Jetson Nano boards, with other baseline algorithms. Moreover, the performance of some
where each of them has a quad-core ARM A57 CPU, a 128-core models at 𝛼 = 0.1 in DR-FL has exceeded the performance of two
Conference’17, July 2017, Washington, DC, USA Trovato et al.
60
70
baselines at 𝛼 = 0.5. As an example shown in the non-IID scenario
DR-FL - Model_1
DR-FL - Model_2
DR-FL - Model_3
60 50 DR-FL - Model_4
Accuracy (%)
Accuracy (%)
50 DR-FL - Model_1
DR-FL - Model_2 40 HeteroFL - Model_3
HeteroFL - Model_4
while HeteroFL only attains 66.31% and ScaleFL only gets 76.73% on
DR-FL - Model_3
40 DR-FL - Model_4
ScaleFL - Model_1
ScaleFL - Model_2
HeteroFL - Model_1 30 ScaleFL - Model_3
30 HeteroFL - Model_2 ScaleFL - Model_4
Model_3. This is because our MARL-based dual-selection method 20
HeteroFL - Model_3
HeteroFL - Model_4
ScaleFL - Model_1
ScaleFL - Model_2
20
can efficiently utilize the available energy of devices by assigning 10
ScaleFL - Model_3
ScaleFL - Model_4 10
0 5 10 15 20 25 30 35 0 5 10 15 20 25
specific layer-wise models to participating devices that are more # of Communication Round # of Communication Round
suitable for heterogeneous federated learning. (a) CIFAR10 (𝛼 = 0.1) w/ 40 devices (b) CIFAR10 (𝛼 = 0.1) w/ 60 devices
80 80
Accuracy (%)
Accuracy (%)
60 DR-FL - Model_1
DR-FL - Model_2
DR-FL - Model_1
DR-FL - Model_2
DR-FL - Model_3 DR-FL - Model_3
Accuracy (%)
Accuracy (%)
60 HeteroFL - Model_3 60 DR-FL - Model_1
DR-FL - Model_2
HeteroFL - Model_4
subfigure, we use the notion 𝑋 _𝑌 to represent the total result of all ScaleFL - Model_1 DR-FL - Model_3
ScaleFL - Model_2 DR-FL - Model_4
40 ScaleFL - Model_3 40 HeteroFL - Model_1
HeteroFL - Model_2
ScaleFL - Model_4
𝑋 denotes the total result involving all the devices. For example, ScaleFL - Model_3
ScaleFL - Model_4
0 5 10 15 20 25 0 5 10 15 20 25 30 35
in Figure 5(a), the legend DR-FL denotes the overall remaining en- # of Communication Round # of Communication Round
ergy of all 40 devices, while DR-FL_Nano represents the overall (e) SVHN (𝛼 = 0.1) w/ 40 devices (f) SVHN (𝛼 = 0.1) w/ 60 devices
remaining energy of all the 20 Jetson Nano boards. Figure 6: Learning curves of DR-FL and other baselines in
AIoT systems with different numbers of devices under lim-
HeteroFL HeteroFL ited energy constraints.
0 50 00 50 00 50 00 50
HeteroFL_Nano DR-FL
2 5 7 10 12 15 17
0 00 00 00 00 00 00
0 0 0 0 0 0
Remaining Energy (J)
HeteroFL_AGX
50 100 150 200 250 300
DR-FL
DR-FL_Nano
DR-FL_AGX To explore the role of the validation set proportion in our method,
the validation set with different proportions (1%-10%) is selected
for the experiment of this paper, and the non-independent data set
CIFAR10 (𝛼 = 0.1) is selected as the exploration scenario. From
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
participate in global model training, where devices can effectively [20] Yuxuan Sun, Sheng Zhou, and Deniz Gündüz. 2019. Energy-Aware Analog Aggre-
learn from each other through appropriate parts belonging to dif- gation for Federated Learning with Redundant Data. In ICC 2020 - 2020 IEEE Inter-
national Conference on Communications (ICC). 1–7. https://api.semanticscholar.
ferent layer-wise models. Comprehensive experiments performed org/CorpusID:207869996
on well-known datasets demonstrate the effectiveness of DR-FL for [21] Surat Teerapittayanon, Bradley McDanel, and H. T. Kung. 2016. BranchyNet:
Fast inference via early exiting from deep neural networks. In Proceedings of 23rd
inference performance, energy consumption, and scalability. International Conference on Pattern Recognition (ICPR). 2464–2469 pages.
[22] Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim
Verbelen, and Jan S. Rellermeyer. 2019. A Survey on Distributed Machine Learning.
REFERENCES ACM Computing Surveys (CSUR) 53 (2019), 1–33. https://api.semanticscholar.
[1] Saleh Baghersalimi, Tomás Teijeiro, David Atienza Alonso, and Amir Aminifar. org/CorpusID:209439571
2021. Personalized Real-Time Federated Learning for Epileptic Seizure Detection. [23] Yawen Wu, Dewen Zeng, Zhepeng Wang, Yi Sheng, Lei Yang, Alaina J. James, Yiyu
IEEE Journal of Biomedical and Health Informatics 26 (2021), 898–909. https: Shi, and Jingtong Hu. 2022. Federated Contrastive Learning for Dermatological
//api.semanticscholar.org/CorpusID:235786959 Disease Diagnosis via On-device Learning. ArXiv abs/2202.07470 (2022). https:
[2] Kartikeya Bhardwaj, Wei Chen, and Radu Marculescu. 2020. INVITED: New //api.semanticscholar.org/CorpusID:245446614
Directions in Distributed Deep Learning: Bringing the Network at Forefront of [24] Jun Xia, Tian Liu, Zhiwei Ling, Ting Wang, Xin Fu, and Mingsong Chen. 2022.
IoT Design. Proceedings of 57th ACM/IEEE Design Automation Conference (DAC) PervasiveFL: Pervasive Federated Learning forHeterogeneous IoT Systems. IEEE
(2020), 1–6. https://api.semanticscholar.org/CorpusID:221293302 Transactions on Computer Aided Design of Integrated Circuits Systems 41, 11 (2022),
[3] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, 4100–4111.
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase [25] Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: a Novel Image
Representations using RNN Encoder–Decoder for Statistical Machine Translation. Dataset for Benchmarking Machine Learning Algorithms. ArXiv:1708.07747
In Proceedings of Conference on Empirical Methods in Natural Language Processing. (2017).
[4] Yangguang Cui, Kun Cao, Junlong Zhou, and Tongquan Wei. 2022. HELCFL: [26] Zhaohui Yang, Mingzhe Chen, Walid Saad, Choong Seon Hong, and Moham-
High-Efficiency and Low-Cost Federated Learning in Heterogeneous Mobile-Edge mad R. Shikh-Bahaei. 2019. Energy Efficient Federated Learning Over Wireless
Computing. 2022 Design, Automation & Test in Europe Conference & Exhibition Communication Networks. IEEE Transactions on Wireless Communications 20
(DATE) (2022), 1227–1232. https://api.semanticscholar.org/CorpusID:248922002 (2019), 1935–1949. https://api.semanticscholar.org/CorpusID:207880723
[5] Enmao Diao, Jie Ding, and Vahid Tarokh. 2021. HeteroFL: Computation and com- [27] Won Joon Yun, Yunseok Kwak, Hankyul Baek, Soyi Jung, Mingyue Ji, Mehdi
munication efficient federated learning for heterogeneous clients. In Proceedings Bennis, Jihong Park, and Joongheon Kim. 2023. SlimFL: Federated Learning With
of International Conference on Learning Representations (ICLR). Superposition Coding Over Slimmable Neural Networks. IEEE/ACM Transactions
[6] Rami Hamdi, Mingzhe Chen, Ahmed Ben Said, Marwa Qaraqe, and H. Vincent on Networking (TON) 31, 6 (2023), 2499–2514.
Poor. 2022. Federated Learning Over Energy Harvesting Wireless Networks. [28] Jing Zhang and Dacheng Tao. 2020. Empowering Things With Intelligence: A
IEEE Internet of Things Journal 9, 1 (2022), 92–103. Survey of the Progress, Challenges, and Opportunities in Artificial Intelligence
[7] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning of Things. IEEE Internet of Things Journal 8 (2020), 7789–7817. https://api.
for Image Recognition. In Proceedings of IEEE Conference on Computer Vision and semanticscholar.org/CorpusID:226975900
Pattern Recognition (CVPR). 770–778 pages. [29] Linfeng Zhang, Chenglong Bao, and Kaisheng Ma. 2021. Self-Distillation: Towards
[8] Fatih Ilhan, Gong Su, and Ling Liu. 2023. ScaleFL: Resource-Adaptive Federated Efficient and Compact Neural Networks. IEEE Transactions on Pattern Analysis
Learning with Heterogeneous Clients. In Proceedings of 2023 IEEE/CVF Conference and Machine Intelligence 44, 8 (2021), 4388–4403. https://api.semanticscholar.
on Computer Vision and Pattern Recognition (CVPR). org/CorpusID:232302458
[9] Latif Ullah Khan, Walid Saad, Zhu Han, Ekram Hossain, and Choong Seon Hong. [30] Xinqian Zhang, Ming Hu, Jun Xia, Tongquan Wei, Mingsong Chen, and Shiyan
2020. Federated Learning for Internet of Things: Recent Advances, Taxonomy, and Hu. 2021. Efficient Federated Learning for Cloud-Based AIoT Applications. IEEE
Open Challenges. IEEE Communications Surveys & Tutorials 23 (2020), 1759–1799. Transactions on Computer-Aided Design of Integrated Circuits and Systems 40, 11
https://api.semanticscholar.org/CorpusID:221970627 (2021), 221–2223. https://doi.org/10.1109/TCAD.2020.3046665
[10] Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. [31] Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. 2021. Data-Free Knowledge
https://api.semanticscholar.org/CorpusID:18268744 Distillation for Heterogeneous Federated Learning. Proceedings of machine learn-
[11] Liang Li, Dian Shi, Ronghui Hou, Hui Li, Miao Pan, and Zhu Han. 2020. To Talk ing research 139 (2021), 12878–12889. https://api.semanticscholar.org/CorpusID:
or to Work: Flexible Communication Compression for Energy Efficient Federated 235125689
Learning over Heterogeneous Mobile Edge Devices. IEEE INFOCOM 2021 - IEEE
Conference on Computer Communications, 1–10. https://api.semanticscholar.org/
CorpusID:229349304
[12] Li Li, Haoyi Xiong, Zhishan Guo, Jun Wang, and Chengzhong Xu. 2019.
SmartPC: Hierarchical Pace Control in Real-Time Federated Learning Sys-
tem. 2019 IEEE Real-Time Systems Symposium (RTSS) (2019), 406–418. https:
//api.semanticscholar.org/CorpusID:203582658
[13] H. B. McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera
y Arcas. 2016. Communication-Efficient Learning of Deep Networks from Decen-
tralized Data. In Proceedings of International Conference on Artificial Intelligence
and Statistics. https://api.semanticscholar.org/CorpusID:14955348
[14] H. B. McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera
y Arcas. 2016. Communication-Efficient Learning of Deep Networks from Decen-
tralized Data. In Proceedings of International Conference on Artificial Intelligence
and Statistics.
[15] Yuval Netzer, Tao Wang, Adam Coates, A. Bissacco, Bo Wu, and A. Ng. 2011.
Reading Digits in Natural Images with Unsupervised Feature Learning. https:
//api.semanticscholar.org/CorpusID:16852518
[16] Dinh C. Nguyen, Ming Ding, Pubudu N. Pathirana, Aruna Prasad Seneviratne,
Jun Li, and Fellow Ieee H. Vincent Poor. 2021. Federated Learning for Internet of
Things: A Comprehensive Survey. IEEE Communications Surveys & Tutorials 23
(2021), 1622–1658. https://api.semanticscholar.org/CorpusID:233289549
[17] Tabish Rashid, Mikayel Samvelyan, C. S. D. Witt, Gregory Farquhar, Jakob N.
Foerster, and Shimon Whiteson. 2018. QMIX: Monotonic Value Function Fac-
torisation for Deep Multi-Agent Reinforcement Learning. ArXiv abs/1803.11485
(2018). https://api.semanticscholar.org/CorpusID:4533648
[18] Samarjit and Al Faruque. 2016. Automotive Cyber-Physical Systems: A Tutorial
Introduction. https://api.semanticscholar.org/CorpusID:247235211
[19] Dian Shi, Liang Li, Rui Chen, Pavana Prakash, Miao Pan, and Yuguang Fan.
2021. Toward Energy-Efficient Federated Learning Over 5G+ Mobile Devices.
IEEE Wireless Communications 29 (2021), 44–51. https://api.semanticscholar.org/
CorpusID:231592874