0% found this document useful (0 votes)
20 views9 pages

Towards Energy-Aware Federated Learning Via MARL: A Dual-Selection Approach For Model and Client

Uploaded by

Angel Paulina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views9 pages

Towards Energy-Aware Federated Learning Via MARL: A Dual-Selection Approach For Model and Client

Uploaded by

Angel Paulina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Towards Energy-Aware Federated Learning via MARL: A

Dual-Selection Approach for Model and Client


Jun Xia Yiyu Shi
University of Notre Dame University of Notre Dame
Notre Dame, IN, USA Notre Dame, IN, USA
[email protected] [email protected]

ABSTRACT compromising their local data privacy. Instead of sharing local


Although Federated Learning (FL) is promising in knowledge shar- sensitive data among devices, FL only needs to send gradients or
arXiv:2405.08183v1 [cs.LG] 13 May 2024

ing for heterogeneous Artificial Intelligence of Thing (AIoT) de- weights of local device models to a cloud server for knowledge
vices, their training performance and energy efficacy are severely aggregation, thus enhancing both the training and inference ca-
restricted in practical battery-driven scenarios due to the “wooden pability of local models. Although FL is promising in knowledge
barrel effect” caused by the mismatch between homogeneous model sharing, it faces the problems of both large-scale deployment and
paradigms and heterogeneous device capability. As a result, due to quick adaption to dynamic environments, where local models are
various kinds of differences among devices, it is hard for existing required to be frequently trained to accommodate an ever-changing
FL methods to conduct training effectively in energy-constrained world. In practice, such problems are hard to be solved, since classic
scenarios, such as battery constraints of devices. To tackle the above Federated Averaging (e.g., FedAvg) methods require that all devices
issues, we propose an energy-aware FL framework named DR-FL, should have homogeneous local models with the same architecture.
which considers the energy constraints in both clients and het- According to the well-known “wooden barrel effect” caused by
erogeneous deep learning models to enable energy-efficient FL. homogeneous assumption as shown in Figure 1, the energy con-
Unlike Vanilla FL, DR-FL adopts our proposed Muti-Agents Rein- sumption waste in Vanilla FL is usually due to the following two
forcement Learning (MARL)-based dual-selection method, which reasons, i.e., the mismatch between computing power and homoge-
allows participated devices to make contributions to the global neous model, and the mismatch between power consumption and
model effectively and adaptively based on their computing capabil- homogeneous model. The former uses device energy for waiting
ities and energy capacities in a MARL-based manner. Experiments time, while the latter uses device energy for useless training time
on various well-known datasets show that DR-FL can not only (only enough power to support training but not support commu-
maximize knowledge sharing among heterogeneous models under nication). Thus, such a homogeneous model assumption strongly
the energy constraint of large-scale AIoT systems but also improve limits the overall energy efficiency of the entire FL system. This is
the model performance of each involved heterogeneous device. because energy usages in the entire system are mainly determined
by how much power is used in the effective model learning other
than waiting, which consumes energy to wait other than training
1 INTRODUCTION
or communication.
The increasing popularity of Artificial Intelligence (AI) techniques, Typically, an AIoT system involves various types of devices with
especially for Deep Learning (DL), accelerates the significant evo- different settings (i.e., computing power and remaining power). If
lution of Internet of Things (IoT) toward Artificial Intelligence of all devices have been equipped with homogeneous local models,
Things (AIoT), where various AIoT devices are equipped with DL the inference potential of devices with superior computing power
models to enable accurate perception and intelligent control [2]. will be eclipsed. Things become even worse when the devices of
Although AIoT systems (e.g., autonomous driving, intelligent con- AIoT applications are powered by batteries. In this case, the devices
trol [18], and healthcare systems [1, 23]) play an important role in with less battery energy will be reluctant to participate in frequent
various safety-critical domains, due to both the limited classifica- interactions with the cloud server. Otherwise, if one device runs
tion capabilities of local device models and the restricted access out of power at an early stage of the FL training, it is hard for
to private local data, it is hard to guarantee the training and in- the global model to achieve an expected inference performance.
ference performance of AIoT devices in Federated Learning (FL) Meanwhile, the overall inference performance of the global model
[13], especially when they are powered by batteries and deployed will be strongly deteriorated due to the absence of such an exhausted
within an uncertain dynamic environment [4]. To quickly figure devices in the following training process. Therefore, how to fully
out the training procedure inference perception of devices, more explore the potential of energy-constrained heterogeneous devices to
and more large-scale AIoT systems have the aid of cloud computing enable high-performance and energy-efficient FL is becoming a major
[30], which has tremendous computing power and flexible device bottleneck in the design of an AIoT system.
management schemes. However, such a cloud-based architecture Although various heterogeneous FL methods (e.g., HeteroFL
still cannot fundamentally improve the inference accuracy of AIoT [5], Scale-FL [8], PervasiveFL [24]) and energy-saving techniques
devices, since they are not allowed to transmit private local data to [11, 12] have been investigated to address the above issue, most
each other. Due to concerns about data privacy, both training and of them focus on either enabling effective knowledge sharing be-
inference performance of local models are greatly suppressed. tween heterogeneous models or reducing the energy consumption
As a promising collaborative machine learning paradigm, FL of devices. Based on the coarse-grained FedAvg operations, few
allows local DL model training among various devices without
Conference’17, July 2017, Washington, DC, USA Trovato et al.

Wasted Energy Caused by Capability and Dispatching Model Mistach Wasted Energy Caused by Battery and Dispatching Model Mistach
High Power Client Middle Power Client Low Power Client High Power Client Middle Power Client Low Power Client
Capability Capality Capability Capability Capality Capability
Battery Battery Battery Battery Battery Battery

Deadline Battery Usage Deadline Battery Usage

Runtime Runtime ... Running Runtime Runtime ... Running

Runtime ... Running Runtime ... Running

Runtime ... Running Runtime ... Running

� �ℎ
Round Valina FL � �ℎ
Round Valina FL

Deadline Deadline Optimized Battery Usage


Optimized Battery Usage
Runtime ... Running Runtime ... Running

Runtime ... Running Runtime ... Running

Runtime ... Running Runtime ... Running

� �ℎ
Round Our method � �ℎ
Round Our method

Figure 1: The energy consumption waste of the “wooden barrel effect” in Vanilla FL is usually due to the following two reasons,
i.e., the mismatch between computing power and homogeneous model, and the mismatch between power consumption and
homogeneous model. The former uses device energy for waiting time, while the latter uses device energy for useless training
time (only enough power to support training but not support communication).

of the existing FL methods can substantially address the above inference performance within various non-IID scenarios, but
challenges to quickly adapt to new environments within an energy- also have superior scalability for large-scale AIoT systems.
constrained scenario. Inspired by the concepts of BranchyNet [21]
The rest of this paper is organized as follows. Section 2 discusses
and multi-agent reinforcement learning [29], in this paper, we pro-
related work on heterogeneous FL and energy-aware FL training.
pose a novel FL framework named DR-FL, which takes both the
After giving the preliminaries of FL and multi-agent reinforcement
layer-wise structure information of DL models and the remaining
learning in section 3, section 4 details our proposed DR-FL method.
energy of each client into account to enable energy-efficient feder-
Section 5 presents experimental results on well-known benchmarks.
ated training. Unlike traditional FedAvg-based FL method that re-
Finally, section 6 concludes the paper.
lies on homogeneous device models, DR-FL maintains a layer-wise
global model on the cloud server, while each device only installs a
subset layer-wise model according to its computing power and re- 2 RELATED WORK
maining battery. In this way, all the heterogeneous local models can Although FL is good at knowledge sharing without compromising
effectively make contributions to the global model based on their the data privacy of devices in AIoT system design, due to the ho-
computing capabilities and remaining energy in a MARL-based mogeneous assumption that all the involved devices should have
manner. Meanwhile, by adopting MARL, DR-FL can not only make local DL models with the same architecture, Vanilla FL methods
the trade-off between training performance and energy consump- inevitably suffer from the problems of low inference accuracy and
tion, thus ensuring energy-efficient FL training to accommodate invalid energy consumption, thus impeding the deployment of FL
various energy-constrained environments. This paper makes the methods in large-scale AIoT system designs [9, 16, 24, 31], especially
following three major contributions: for non-IID scenarios.
To enable collaborative learning among heterogeneous device
• We establish a novel, lightweight cloud-based FL framework models, various solutions have been extensively studied, which
named DR-FL, which can be easily implemented and enables can be primarily classified into two categories, i.e., subnetwork
various heterogeneous DNNs to share knowledge without aggregation-based methods and knowledge distillation-based methods.
compromising their data privacy in FL for heterogeneous The basic idea of subnetwork aggregation-based methods is to
devices by layer-wise model aggregation. allow knowledge aggregation on top of subnetworks of local device
• We propose a dual-selection approach based on MARL to models, which enables knowledge sharing among heterogeneous
control energy-efficient learning from the perspectives of device models. For instance, Diao et al. [5] presented an effective
both layer-wise models and participating clients, which can heterogeneous FL framework named HeteroFL, which can train
maximize the efficacy of the entire AIoT system. heterogeneous local models with varying computation complexities
• Experimental results obtained from both simulation and real but still produce a single global inference model, assuming that
test-bed platforms show that, compared with various state- device models are subnetworks of the global model. By integrating
of-the-art approaches, DR-FL can not only achieve better FL and width-adjustable slimmable neural networks, Yun et al. [27]
Towards Energy-Aware Federated Learning via MARL: A Dual-Selection Approach for Model and Client Conference’17, July 2017, Washington, DC, USA

proposed a novel learning framework named ScaleFL, which jointly round, the update process of each device model is defined as follows
utilizes superposition coding for global model aggregation and
superposition training for updating local models. In [24], Xia et al. W𝑛𝑡+1 ← W𝑛𝑡 − 𝜂∇W𝑛𝑡 , (1)
developed a novel framework named PervasiveFL, which utilizes a
small uniform model (i.e., “modellet”) to enable heterogeneous FL. where W𝑛𝑡 and W𝑛𝑡+1 represent the global models at round 𝑡 and
Although all the above heterogeneous FL methods are promising, round 𝑡 + 1 in the 𝑛𝑡ℎ device, respectively. 𝜂 indicates the learning
most focus on improving inference performance. Few of them take rate and ∇W𝑛𝑡 is the gradient obtained by the 𝑛𝑡ℎ device model
the issues of real-time training and energy efficiency into account. after the 𝑡 𝑡ℎ training round. To protect data privacy, at the end
Since a large-scale FL-based AIoT application typically involves of each communication round, FL uploads the weight differences
a variety of devices that are powered by batteries, how to conduct (i.e., model gradients) of each device instead of the newly updated
energy-efficient FL training is becoming an important issue [19, 28]. models to the cloud for aggregation. After gathering the gradients
To address this issue, various methods have been investigated to from all the participating devices, the cloud updates the parameters
reduce the energy consumed by FL training and device-server com- of the shared-global model based on the Fedavg [14] algorithm,
munication. For example, Hamdi et al. [6] studied the FL deployment which is defined as follows:
problem in an energy-harvesting wireless network, where a certain Í𝑁
number of users may be unable to participate in FL due to interfer- L𝑛 ∇W𝑛𝑡
W𝑡 +1 ← W𝑡 + 𝑛=1 , (2)
ence and energy constraints. They formalized such a deployment 𝑁
scenario as a joint energy management and user scheduling prob- Í𝑁
∇W𝑛
lems over wireless systems, and solved it efficiently. In [20], Sun et where 𝑛=1𝑁 𝑡 denotes the average gradient of 𝑁 participating de-
al. presented an online energy-aware dynamic worker scheduling vices in communication round 𝑡, W𝑡 and W𝑡 +1 represent the global
policy, which can maximize the average number of workers sched- models after 𝑡 𝑡ℎ and 𝑡 + 1𝑡ℎ communication round, respectively, and
uled for gradient update under a long-term energy constraint. In L𝑛 means the training data size of device 𝑛. Although Vanilla FL
[26], Yang et al. formulated the energy-efficient transmission and methods (e.g., FedAvg) perform remarkably in distributed machine
computation resource allocation for FL over wireless communica- learning, they cannot be directly applied to AIoT scenarios. This
tion networks as a joint learning and communication problem. To is because the heterogeneous AIoT devices will lead to different
minimize system energy consumption under a latency constraint, training speeds for Vanilla FL, resulting in additional energy waste,
they presented an iterative algorithm that can derive optimal solu- which is unacceptable for an energy-constrained system.
tions considering various factors (e.g., bandwidth allocation, power
control, computation frequency, and learning accuracy). Although 3.2 Multi-Agent Reinforcement Learning
all the above energy-saving methods can effectively reduce energy
In cooperative Multi-Agent Reinforcement Learning (MARL), a
consumption in both FL training and communication, few of them
set of 𝑁 agents is trained to produce optimal actions that lead to
can guarantee the training time requirement of FL training within
maximum team rewards. Specifically, at each timestamp 𝑡, each
a complex dynamic environment.
agent 𝑛 (where 1 ≤ 𝑛 ≤ 𝑁 ) observes its state 𝑠𝑡𝑛 and selects an
To the best of our knowledge, DR-FL is the first attempt to inves-
action 𝑎𝑛𝑡 based on 𝑠𝑡𝑛 . After all agents have completed their actions,
tigate the dual selection by both layer-wise models and the partici-
the team receives a joint reward 𝑟𝑡 and transitions to the next state
pated clients based on MARL to enable fine-grained heterogeneous
𝑠𝑡𝑛+1 . The goal is to maximize the total expected discounted reward
FL, where heterogeneous devices can adaptively and efficiently
𝑅 = 𝑇𝑡=1 𝛾𝑟𝑡 by selecting the optimal actions for each agent, where
Í
make contributions to the global model based on their computing
capabilities and remaining energy. Compared with state-of-the- 𝛾 ∈ [0, 1] is the discount factor.
art heterogeneous FL methods, DR-FL can not only maximize the Recently, QMIX [17] has emerged as a promising solution for
knowledge sharing among various heterogeneous models under jointly training agents in cooperative MARL. In QMIX, each agent
energy constraints but also significantly improve both the model 𝑛 employs a Deep Neural Network (DNN) to infer its actions. This
performance of each involved device and the energy efficacy of the DNN implements the 𝑄-function 𝑄 𝜃 (𝑠, 𝑎) = 𝐸 [𝑅𝑡 |𝑠𝑡𝑛 = 𝑠, 𝑎𝑛𝑡 = 𝑎],
where 𝜃 represents the parameters of the DNN, and 𝑅𝑡 = 𝑇𝑖=𝑡 𝛾𝑟𝑖
Í
entire FL system.
is the total discounted team reward received at 𝑡. During MARL
execution, each agent 𝑛 selects the action 𝑎 ∗ with the highest 𝑄-
3 PRELIMINARIES
value (i.e., 𝑎 ∗ = arg max𝑎 𝑄 𝜃 (𝑠𝑛 , 𝑎)).
3.1 Federated Learning To train the QMIX, a replay buffer is employed to store transi-
With the prosperity of distributed machine learning technologies tion tuples (𝑠𝑡𝑛 , 𝑎𝑛𝑡 , 𝑠𝑡𝑛+1, 𝑟𝑡 ) for each agent 𝑛. The joint 𝑄-function,
[22], privacy-aware FL is proposed to effectively solve the prob- 𝑄 tot (·), is represented as the element-wise summation of all individ-
lem of data silos, where multiple AIoT devices can achieve knowl- ual 𝑄-functions (i.e., 𝑄 tot (𝑠𝑡 , 𝑎𝑡 ) = 𝑛 𝑄𝑛𝜃 (𝑠𝑡𝑛 , 𝑎𝑛𝑡 )), where 𝑠𝑡 = {𝑠𝑡𝑛 }
Í
edge sharing without leaking their data privacy. Since the physical and 𝑎𝑡 = {𝑎𝑛𝑡 } are the states and actions collected from all agents
environment is volatile (i.e., high latency network and unstable 𝑛 ∈ 𝑁 at timestamp 𝑡. The agent DNNs can be recursively trained
connection) in real AIoT scenarios, Vanilla FL randomly selects a by minimizing the loss 𝐿 = 𝐸𝑠𝑡 ,𝑎𝑡 ,𝑟𝑡 ,𝑠𝑡 +1 [𝑦𝑡 − 𝑄 tot (𝑠𝑡 , 𝑎𝑡 )] 2 , where

number of AIoT devices for each communication round of training 𝑦𝑡 = 𝑟𝑡 + 𝛾 𝑛 max𝑎 𝑄𝑛𝜃 (𝑠𝑡𝑛+1, 𝑎) and 𝜃 ′ represents the parameters
Í
a homogeneous DNN model. Suppose there are 𝑁 devices selected of the target network, which are periodically copied from 𝜃 during
at the 𝑡 𝑡ℎ communication round in FL. After the 𝑡 𝑡ℎ communication the training phase.
Conference’17, July 2017, Washington, DC, USA Trovato et al.

4 METHOD time. Note that since actual energy consumption is intrinsically


related to the size of the trained model, variations in the size of the
4.1 Problem Formulation
model lead to fluctuations in the energy consumed during both the
Assuming that an energy-constrained FL system contains a cloud training and communication processes. Therefore, it is of utmost
server and 𝑁 heterogeneous AIoT devices, which can be represented importance to consider these energy dynamics when addressing
as 𝐷 = {𝐷 1, ..., 𝐷𝑛 , ..., 𝐷 𝑁 }. All these heterogeneous AIoT devices the optimization model.
can be classified into three categories according to their computing Model Accuracy: In a heterogeneous scenario, how to effec-
capability, i.e., small, middle and large, where the small, middle and tively leverage the heterogeneity in heterogeneous models and
large mean the level of device computing resources and storage devices to enhance the performance of aggregated models is an
resources. In this paper, the performance of the entire FL system is urgent issue that needs to be solved in FL. Furthermore, resource-
significantly influenced by three key factors: running time, energy constrained heterogeneous AIoT devices that participate in aggre-
consumption, and model accuracy. Running time determines the gation pose a considerable obstacle to the application of energy-
training efficiency of the FL system in a real scenario. Moreover, constrained FL. Inspired by the work [12] where the performance
energy consumption is also a significant factor, particularly for of model inference is affected by the number of successful aggre-
AIoT devices powered by limited energy resources. Lastly, model gations for its device, we can deduce that the accuracy of hetero-
accuracy ensures that the system produces reliable and valuable geneous models is proportional to the total number of aggregated
predictions. Therefore, to optimize the overall performance of the models participating in each round. However, since devices con-
FL system, it is crucial to make a balance between three factors. sume energy every time they participate in each round of aggre-
Running Time Model: Considering the differences in network gation, how to reasonably select aggregation devices in an energy-
delay and computing resources of heterogeneous AIoT devices, the constrained environment to improve model accuracy has become a
energy-constrained FL system aims to minimize the total running major challenge in the design of an FL framework.
time 𝑇𝑎𝑙𝑙 among all the devices, which is shown as Optimization Objective: Taking energy information into ac-
𝑇𝑎𝑙𝑙 = max 𝑇𝑎𝑙𝑙
𝐷𝑛
. (3) count, an optimization model is proposed for energy-constrained
∀𝑛
FL. This model aims to balance three objectives: minimizing total
Let 𝑇𝑐𝑜𝑚
𝐷𝑛
and 𝑇𝑡𝑟𝑎
𝐷𝑛
be the communication time of the device 𝐷𝑛 and
running time 𝑇𝑎𝑙𝑙 , and maximizing model accuracy 𝑀𝑎𝑐𝑐 under total
the training time of the layer-wise model on device 𝐷𝑛 , respectively.
energy consumption 𝐸𝑎𝑙𝑙 constraint, which is defined as follows
Note that due to the abundant computing resources in the cloud
server, its running time is negligible compared to that on devices. min𝑇𝑎𝑙𝑙 , max 𝑀𝑎𝑐𝑐 ,
The total running time for each device 𝑇𝑎𝑙𝑙
𝐷𝑛
is defined as (8)
s.t. 𝐸𝑎𝑙𝑙 ≤ 𝐸,
𝐷𝑛
𝑇𝑎𝑙𝑙 𝐷𝑛
= 𝑇𝑐𝑜𝑚 𝐷𝑛
+ 𝑇𝑡𝑟𝑎 . (4)
where 𝐸 is the energy budget of an FL system.
Here, the communication time for each device 𝑇𝑐𝑜𝑚 𝐷𝑛
can be regarded
as the ratio of the size of a model 𝑆 𝐷𝑛 with different layers and the 4.2 Workflow of DR-FL
speed of bandwidth 𝑉𝑛𝑒𝑡 . Since the training time of each device 𝑇𝑡𝑟𝑎
𝐷𝑛
In DR-FL, heterogeneous AIoT devices and a cloud server cooperate
is determined by the computation capability of local devices 𝐶𝐷𝑛 , to achieve high performance of various layer-wise models deployed
the training data size in a device 𝐿𝐷𝑛 , we formalize communication on edge devices. Before training, all devices participating in DR-FL
time 𝑇𝑐𝑜𝑚
𝐷𝑛
and training time 𝑇𝑡𝑟𝑎
𝐷𝑛
as will initialize and install a layer-wise model, which is a subset layer
𝐷𝑛
𝑇𝑐𝑜𝑚 𝐷𝑛
= 𝑉𝑛𝑒𝑡
𝑆
, 𝐷𝑛
𝑇𝑡𝑟𝑎
𝐿
= 𝐶𝐷𝐷𝑛 , (5) of the global model in the cloud server. Then, the cloud server sends
𝑛 a part of the global model to AIoT devices for local training. At the
where 𝑂 𝐷𝑛 is reflected by the computation capability of the device end of local training, DR-FL performs layer-wise model aggregation
𝐶𝐷𝑛 . Assuming that the network transmission speed can be kept on the cloud server. Note that hot-plug AIoT devices are permissible
relatively stable. in DR-FL, where newly involved devices only inherit the parameters
Energy Consumption Model: The energy consumed by the of the global model in the cloud server. Figure 2 shows the workflow
overall FL system plays an important role in ensuring the system of the DR-FL, which consists of five steps as follows.
operates smoothly. The calculation of the total remaining energy Step 1 (Battery or Model Information Upload): During the
can be expressed as initialization step of DR-FL, each device intending to participate in
𝑁 
∑︁  FL should upload its device information to the cloud, which includes
𝐸𝑎𝑙𝑙 = 𝐷𝑛
𝐸𝑟𝑒𝑚𝑎𝑖𝑛 𝐷𝑛
− 𝐸𝑡𝑟𝑎 𝐷𝑛
− 𝐸𝑐𝑜𝑚 . (6) the power, computing, and storage capabilities of devices and the
𝑛=1
overclocking potential of models. This collected information is
Note that both training and communication energy consumption
used for energy-aware dual-selection for the layer-wise model and
are all decided by two factors, i.e., the size of the training model and
client in subsequent steps to optimize the entire system’s energy
the power mode of AIoT devices. The training energy consumption
efficiency.
𝐷𝑛
and communication energy consumption 𝐸𝑐𝑜𝑚 𝐷𝑛
of device 𝐷𝑛
Step 2 (Layer-Wise Model Aggregation): After receiving the
𝐸𝑡𝑟𝑎
are calculated as
participating devices’ local model gradients, this step will layer-
𝐷𝑛
𝐸𝑡𝑟𝑎 𝐷𝑛
= 𝑃𝑡𝑟𝑎𝑖𝑛 × 𝑇𝑡𝑟𝑎 , 𝐷𝑛
𝐸𝑐𝑜𝑚 = 𝑃𝑐𝑜𝑚 × 𝑇𝑐𝑜𝑚𝐷𝑛
, (7) align averaging (The same parts of the network will be aggregated.)
where 𝑃𝑡𝑟𝑎𝑖𝑛 is the energy consumption per unit training time, and such gradients and use the previous round global model stored on
𝑃𝑐𝑜𝑚 is the energy consumption per unit network transmission the server to construct a new global model.
Towards Energy-Aware Federated Learning via MARL: A Dual-Selection Approach for Model and Client Conference’17, July 2017, Washington, DC, USA

Environment 3 Energy-Aware MARL-based Dual-Selection


Cloud Training
2 Layer-Wise Model Aggregation
Server
� �
< � , [� , � ], � , �
� � � � � +�
>
Client #1 Client #2 Client #n

[� ��, � �� ]
Maximum Q Value

Guided Dual-Selection ��

Agent #1 Agent #2 Agent #N


Predict Agents
Global Model Lib

4 Layer-Wise Model Dispatching 1 Battery or Model Information Upload

Local Local Updated Local Local Updated Local Local Updated


Dataset Model Model Dataset Model Model Dataset Model Model

Battery Energy Updated Battery Energy Updated Battery Energy Updated
Consumption Battery Consumption Battery Consumption Battery

5 Local 5 Local 5 Local


Client #1 Client #2 Client #N
Training Training Training

Figure 2: Framework and workflow of our method.


Step 3 (Energy-Aware MARL-based Dual-Selection): Then, 𝑅!

to prevent selected devices from dropping out of the FL process 𝑄(𝝉, 𝒂)


due to energy limitations, we design a MARL-based selector that 𝑄& (𝜏 & , 𝑎!$ )
+

can choose an appropriate model for each AIoT device based on its 𝑄(𝝉, 𝒂)
Device Evaluation Network 𝑆!
𝑄& (𝜏 & , . )
remaining energy and computing capabilities, which can not only
𝑄# (𝜏 # *
, 𝑎!$ ) … 𝑄& (𝜏 & +
, 𝑎!$ )
improve the efficiency of the device resource usage but also ensure
.
𝑊' ℎ+ (𝑆* ) MLP

𝑆!
their active participation in FL (see more details in Section 4.3). .
Agent 1 Agent n (
ℎ(𝑎&' )
GRU
(
ℎ(𝑎&' )
𝑊# ℎ) (𝑆* )
Furthermore, apart from selecting a layer-wised model for each Model
… Model
Selection Selection
AIoT device, the selector can also adjust the computing capability 𝑄,(𝜏, , 𝑎-,) … 𝑄,(𝜏, , 𝑎-,) MLP

of AIoT devices, aiming to achieve a trade-off between energy $*


(𝑜!# , 𝑎!%# ) $+
(𝑜!& , 𝑎!%# ) (𝑜!& , 𝑎!%#
&
)
consumption and computing efficiency.
Environment
Step 4 (Layer-Wise Model Dispatching): Based on an energy-
aware MARL-based dual-selection strategy, the cloud server dis-
patches part of the global model parameters to each heterogeneous Figure 3: Maximum Q Value Guided Dual-selection. There
AIoT device. are two networks here, i.e., the model selection network and
Step 5 (Local Training): Based on the received global model the device evaluation network. The model selection network
parameters, each heterogeneous AIoT device builds an initial local is calculated through the value 𝑂 observed by the agent from
model (i.e., layer-wise model), which is trained using cross-entropy the environment and the action set 𝐴𝑡 −1 of the previous
loss based on local training samples to obtain the gradients of the round, thereby obtaining the latest action and its correspond-
local model for gradient upload. ing Q value. The device evaluation network obtains the Q
DR-FL repeats all five steps above until the global model and all values of all devices and then uses the hybrid network to com-
its local models converge. bine all Q values and the current timestamp state 𝑆𝑡 through
a two-layer weight matrix into an overall Q value 𝑄𝑡𝑜𝑡 . Then,
the network uses the discounted rewards given by the envi-
4.3 Dual-Selection for Local Model and Client ronment for MARL, thereby multi-agents can obtain their
4.3.1 MARL Training Process: In our DR-FL, each device uses an own rewards from the environment. ℎ means the MLP for
energy-aware MARL-based dual-selection method to select the extracting deep representations of states or actions. | · | means
participated device and the layers of its corresponding local model the dot product.
running on devices. To better capture connections between long-
term/short-term rewards and strategies, each MARL is designed computes team rewards by considering the validation accuracy
with two Multi-Layer Perceptions (MLP) and a Gated Recurrent improvement of the global model 𝑀𝑎𝑐𝑐 , the total runtime 𝑇𝑎𝑙𝑙 , the
Unit (GRU) [3], respectively, as shown in Figure 3. During the computation capabilities 𝐶 and the remaining energy of each device
training procedure of MARL, each agent acquires its current state 𝐸𝑎𝑙𝑙 . The MARL agents are then trained with the QMIX algorithm
𝑆𝑡 and selects an action 𝑎𝑛𝑡 for each client. Based on both client [17] to maximize the system rewards (See the design details in
selection and layer-wise model considerations, the central server Section 4.3.4).
Conference’17, July 2017, Washington, DC, USA Trovato et al.

4.3.2 MARL Agent State Design: The state of each MARL agent Equation 6. The MARL agents are trained using QMIX as described
𝐷𝑛 is comprised of three components: the remaining energy 𝐸𝑎𝑙𝑙 𝐷𝑛
, in Figure 3.
the computation capability of each communication round 𝐶𝐷𝑛 , and
the size of the local training dataset 𝐿𝐷𝑛 . At each training round 𝑡, 5 EXPERIMENTAL RESULTS
each agent initially conducts the training procedure and transmits To evaluate the performance of our proposed method, we imple-
its gradients to the central server. Furthermore, to estimate the mented the DR-FL algorithm using PyTorch (version 1.4.0). Similar
current training and communication delays at client device 𝑛, each to FedAvg, we assume that only 10% of AIoT devices were involved
MARL agent is equipped with a record of training latency 𝑇𝑡𝑟𝑎 𝐷𝑛
in each round of FL communication during the training period.
and communication latency 𝑇𝑐𝑜𝑚 , where 𝑇𝑡𝑟𝑎 and 𝑇𝑐𝑜𝑚 denote the
𝐷𝑛 𝐷𝑛 𝐷𝑛 For DR-FL and other heterogeneous FL methods, we set the small
latency in local training and model uploading for agent 𝑛 during batch size to 32. The number of local training epochs and the initial
the communication round 𝑡. As shown in Figure 3, the parameter learning rate were 5 and 0.05, respectively. To simulate a variety
𝜏 represents the trajectory of historical data from training, and ℎ of energy-constrained scenarios, we assume that each device is
represents the MLP layer for knowledge extraction. Moreover, each powered by a battery with a maximum capacity of 7,560 joules. In
MARL agent 𝑛 also calculates the energy consumption of training other words, each battery capacity is 1500 mA at a rated voltage
and communication based on Equation 7. This inclusion is crucial of 5.04V. We conducted comprehensive experiments to answer the
as the energy costs contribute to the overall energy cost, while the following four Research Questions (RQs).
remaining energy of the agent influences both training latency and RQ1: (Superiority of DR-FL ): What advantages can DR-FL
model accuracy. The state vector 𝑠𝑡𝑛 of agent 𝑛 in communication achieve compared with state-of-the-art heterogeneous FL methods?
round 𝑡 is defined as: RQ2: (Benefits of MARL-based Dual-Selection): What bene-
fits does MARL-based Dual-Selection provide during DR-FL learn-
𝑠𝑡𝑛 = [𝐿𝑡𝑛 , 𝐶𝐷𝑛 , 𝐸𝐷𝑛 , 𝑡]. (9) ing, especially under constraints such as device energy and overall
Finally, to decrease storage overhead and accelerate the speed of training time, compared with other SOTA heterogenous FL meth-
agent convergence, all MLPs and GRUs within the MARL agents ods?
share their weights. RQ3: (Scalability of DR-FL): How does the number of AIoT
devices participating in knowledge sharing affect the performance
4.3.3 Agent Action Design: Given the input state shown in Equa- of DR-FL?
tion 9, each MARL agent 𝑛 determines which layers of the local RQ4: (Exploration of the Validation Data Ratio): How does
model should be used for the local training process on each device. the proportion of validation data in MARL affect the performance
Specifically, the MARL agent will generate 𝑄 values for the current of DR-FL?
action set [𝑎 0, . . . , 𝑎𝑀 ], where 𝑀 represents the number of model
selections available to the client. Note that when the selected action 5.1 Experimental Settings
is zero, the client device will run the first model, and when the
5.1.1 Model Settings. We compared our DR-FL method with two
selected action is 𝑀, the client will not participate in the FL. After
typical state-of-the-art heterogeneous FL methods, i.e., HeteroFL
selecting the layer-wise model for each heterogeneous device, all
[5] and ScaleFL [8], which belong to subnetwork aggregation-based
the Q values obtained by the agents will select the device with the
methods and knowledge distillation-based methods, respectively.
highest Q value through the Top-K algorithm to participate in the
We set the ResNet-18 model [7] as the backbone, where each block of
FL process.
the ResNet-18 model is followed by a new pair of the bottleneck and
4.3.4 Reward Function Design: To optimize the objective described classifier, thus forming four new heterogeneous layer-wise models
in Equation 8, the reward function should reflect the changes in to simulate four types of heterogeneous models (i.e., Models 1-4
the model accuracy, processing latency (training, communication shown in Table 1). Note that each layer-wise model can be reused
and waiting latency), and processing energy consumption after with the same backbone for the purpose of model inference.
executing the dual-selection strategy generated by MARL agents.
5.1.2 Dataset Settings. To evaluate the effectiveness of DR-FL, we
The reward 𝑟𝑡 at training round 𝑡 is defined as follows:
considered four training datasets: i.e., CIFAR10, CIFAR100 [10],
Street View House Numbers (SVHN) [15], Fashion-MNIST [25]. CI-
𝑡 −1 𝑡 −1
𝑡
𝑟𝑡 = 𝑤 1 · (𝑀𝐴𝑐𝑐 −𝑀𝐴𝑐𝑐 ) −𝑤 2 · (𝐸𝑎𝑙𝑙 𝑡
−𝐸𝑎𝑙𝑙 ) −𝑤 3 · max 𝑇𝑎𝑙𝑙
𝑡,𝑛
. (10) FAR10: The CIFAR10 dataset consists of 60,000 32×32 colour images
1≤𝑛≤𝑁
across ten classes, with 6,000 images per class. The dataset is split
Here, max1≤𝑛≤𝑁 𝑇𝑎𝑙𝑙
𝑡,𝑛
represents the total time needed for lo- into 50,000 training images and 10,000 testing images. CIFAR100:
cal training of all selected devices. The MARL agents utilize the The CIFAR100 dataset is similar to CIFAR10 but contains 100 classes
evaluation accuracy calculated by a small tiny dataset on the cloud instead of 10, with 600 images per class. The dataset also comprises
server to select the layer-wise model that will be dispatched to the 50,000 training images and 10,000 testing images. SVHN: The SVHN
local device and continue the local training and upload their model dataset is a real-world image dataset derived from house numbers
updates. Moreover, 𝑤 1 , 𝑤 2 , and 𝑤 3 1 are the norm ratios to control in Google Street View images. It contains over 600,000 labelled digit
all the reward plays the same role in the entire reward. 𝐸𝑎𝑙𝑙 𝑡 is the images, where each image is a 32×32 colour image representing a
total remaining energy of 𝑡 𝑡ℎ communication round as defined in single digit (0-9). Fashion-MNIST: The Fashion-MNIST dataset is
a dataset of Zalando’s article images, consisting of 70,000 28 × 28
1We used 𝑤1 = 1000, 𝑤2 = 0.01, 𝑤3 = 1 in our experiments. grayscale images of 10 different fashion categories. In subsequent
Towards Energy-Aware Federated Learning via MARL: A Dual-Selection Approach for Model and Client Conference’17, July 2017, Washington, DC, USA

Table 1: Test accuracy (%) comparison for different models and dataset settings under specific energy constraints with 40 clients.

Dataset CIFAR10
Methods HeteroFL [5] ScaleFL [8] DR-FL (Ours)
Distribution 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0
Model_1 30.46 ± 1.10 46.11 ± 3.32 65.23 ± 1.45 29.25 ± 1.17 54.44 ± 0.87 58.15 ± 4.32 58.69 ± 0.73 59.01 ± 0.85 76.46 ± 0.12
Model_2 48.41 ± 1.24 62.55 ± 3.45 62.10 ± 3.24 41.66 ± 5.43 55.46 ± 3.87 71.48 ± 1.23 65.31 ± 1.54 75.93 ± 0.62 77.43 ± 2.77
Model_3 34.85 ± 5.79 65.01 ± 1.79 74.78 ± 2.76 39.92 ± 2.75 60.07 ± 0.68 70.83 ± 1.43 72.71 ± 0.58 70.64 ± 1.40 71.54 ± 1.54
Model_4 45.26 ± 3.68 69.65 ± 2.99 75.14 ± 1.13 46.59 ± 3.43 70.60 ± 4.54 73.90 ± 1.17 70.76 ± 1.30 69.37 ± 0.45 72.27 ± 1.73
Dataset CIFAR100
Methods HeteroFL [5] ScaleFL [8] DR-FL (Ours)
Distribution 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0
Model_1 11.86 ± 0.78 22.56 ± 2.13 25.66 ± 1.13 13.14 ± 1.96 21.39 ± 1.59 17.58 ± 0.43 26.25 ± 0.23 33.59 ± 3.32 39.65 ± 1.35
Model_2 16.33 ± 3.34 25.98 ± 1.72 28.68 ± 0.57 12.67 ± 2.13 28.77 ± 4.33 29.84 ± 1.39 17.83 ±0.75 39.50 ± 1.08 33.55 ± 0.45
Model_3 14.18 ± 0.29 31.99 ± 0.53 31.31 ± 3.34 17.12 ± 2.88 30.04 ± 1.91 33.92 ± 2.34 26.46 ± 0.24 32.10 ± 1.12 33.40 ± 0.13
Model_4 15.66 ± 0.78 29.33 ± 0.85 35.44 ± 1.54 19.24 ± 1.22 30.29 ± 1.03 33.23 ± 1.32 22.55 ± 0.73 32.55 ± 1.45 33.80 ± 1.25
Dataset SVHN
Methods HeteroFL [5] ScaleFL [8] DR-FL (Ours)
Distribution 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0
Model_1 60.08 ± 3.23 46.02 ± 3.32 60.38 ± 1.39 47.90 ± 0.53 85.79 ± 2.22 88.91 ± 1.11 67.19 ± 0.32 91.58 ± 0.21 68.78 ± 1.33
Model_2 65.11 ± 4.32 54.83 ± 1.28 68.90 ± 2.87 50.26 ± 2.21 86.82 ± 2.51 85.16 ± 4.13 79.86 ± 0.87 85.30 ± 1.19 91.72 ± 0.94
Model_3 65.93 ± 4.56 69.20 ± 4.19 75.97 ± 1.84 76.73 ± 2.23 84.91 ± 0.68 88.70 ± 3.25 91.47 ± 0.17 88.61 ± 1.72 93.45 ± 0.37
Model_4 66.31 ± 3.09 71.34 ± 0.79 76.14 ± 1.90 55.27 ± 3.23 86.10 ± 3.56 92.47 ± 0.51 91.11 ± 1.32 89.26 ± 0.75 92.78 ± 0.54
Dataset Fashion-MNIST
Methods HeteroFL [5] ScaleFL [8] DR-FL (Ours)
Distribution 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0 𝛼 =0.1 𝛼 =0.5 𝛼 =1.0
Model_1 45.06 ± 2.01 85.58 ± 1.31 87.00 ± 1.93 53.78 ± 0.98 74.26 ± 2.34 87.29 ± 0.93 80.15 ± 0.23 82.25 ± 0.19 87.10 ± 0.37
Model_2 59.76 ± 0.46 85.75 ± 0.63 88.60 ± 0.34 57.19 ± 3.13 85.32 ± 2.51 87.44 ± 0.55 82.10 ± 0.39 88.76 ± 0.23 85.22 ± 0.34
Model_3 57.25 ± 0.98 83.26 ± 3.27 87.75 ± 1.25 62.26 ± 1.34 87.69 ± 1.07 88.47 ± 0.97 86.88 ± 0.23 89.34 ± 0.62 90.52 ± 0.13
Model_4 56.32 ± 4.07 87.82 ± 1.28 87.83 ± 0.56 55.85 ± 1.51 86.78 ± 3.27 88.40 ± 0.69 85.80 ± 0.17 89.36 ± 0.11 89.60 ± 0.29

experiments, we investigated three non-Independent and Identi- NVIDIA Maxwell GPU, and 4GB LPDDR4 RAM; iii) the Jetson AGX
cally Distributed (non-IID) distributions for each dataset. Similar to Xavier boards, where each of them is equipped with an 8-core
the work of HeteroFL in [5], we constructed non-IID local training CPU and a 512-core Volta GPU; and iv) an HP 9800 power meter
datasets using heterogeneous data splits following a Dirichlet dis- (see the top-left part in Figure 4(a)) produced by Shenzhen HOPI
tribution controlled by a variable 𝛼. Typically, a smaller value of 𝛼 Electronic Technology Ltd. Note that, along with the federated
represents a higher degree of a corresponding non-IID distribution. training process, we used the power meter to record the energy
Meanwhile, we used the same data augmentation technologies to consumption of all the AIoT devices every second for the MARL
fully utilize natural image datasets as the ones used in HeteroFL environment construction.
[5]. To enable MARL training on a cloud server in DR-FL, we used
4% of the overall training data as the validation set on the server. 5.2 Accuracy Comparison (RQ1)
Note that the validation set on the server does not overlap with To evaluate the effectiveness of our proposed DR-FL, Table 1 presents
local training datasets hosted by AIoT devices. the best test accuracy information for HeteroFL, ScaleFL and our
DR-FL under the specific energy constraints along the FL processes
based on the four datasets, assuming all the device batteries are
initialized to be full. For each dataset and FL method combination,
we considered three kinds of data distributions for all local AIoT
devices, where the non-IID settings follow the Dirichlet distribu-
tions controlled by 𝛼. Note that the baseline approaches (HeteroFL
and ScaleFL) do not consider the energetic constraints in their FL
procedure. To make a fair comparison, we added the greedy algo-
rithm for energy awareness in this experiment (model selection
will select the maximum model that can be trained for FL) into the
(a) AIoT devices (b) The server
two baseline algorithms for comparison. The experiments were
Figure 4: Real test-bed platform for our experiment. repeated five times to calculate the mean and variance.
From Table 1, it is evident that within the constraint of the
5.1.3 Test-bed Settings. Besides simulation-based evaluation, we restricted battery energy conditions set for each device, DR-FL
constructed a physical test-bed platform as shown in Figure 4 to exhibits superior inference performance, surpassing results in 29
check the performance of our DR-FL in a real-world environment. out of the 36 evaluated scenarios in comparison with other baseline
The test-bed consists of four parts: i) the cloud server that is built algorithms. Specifically, no matter which data set, in the scenario
on top of an Ubuntu workstation equipped with an Intel i9 CPU, of 𝛼 = 0.1, our method shows superior performance in comparison
32G memory, and a GTX3090 GPU; ii) the Jetson Nano boards, with other baseline algorithms. Moreover, the performance of some
where each of them has a quad-core ARM A57 CPU, a 128-core models at 𝛼 = 0.1 in DR-FL has exceeded the performance of two
Conference’17, July 2017, Washington, DC, USA Trovato et al.

60
70
baselines at 𝛼 = 0.5. As an example shown in the non-IID scenario
DR-FL - Model_1
DR-FL - Model_2
DR-FL - Model_3
60 50 DR-FL - Model_4

of SVHN with 𝛼 = 0.1, the test accuracy of DR-FL reaches 91.47%,


HeteroFL - Model_1
HeteroFL - Model_2

Accuracy (%)

Accuracy (%)
50 DR-FL - Model_1
DR-FL - Model_2 40 HeteroFL - Model_3
HeteroFL - Model_4

while HeteroFL only attains 66.31% and ScaleFL only gets 76.73% on
DR-FL - Model_3
40 DR-FL - Model_4
ScaleFL - Model_1
ScaleFL - Model_2
HeteroFL - Model_1 30 ScaleFL - Model_3
30 HeteroFL - Model_2 ScaleFL - Model_4
Model_3. This is because our MARL-based dual-selection method 20
HeteroFL - Model_3
HeteroFL - Model_4
ScaleFL - Model_1
ScaleFL - Model_2
20
can efficiently utilize the available energy of devices by assigning 10
ScaleFL - Model_3
ScaleFL - Model_4 10
0 5 10 15 20 25 30 35 0 5 10 15 20 25
specific layer-wise models to participating devices that are more # of Communication Round # of Communication Round

suitable for heterogeneous federated learning. (a) CIFAR10 (𝛼 = 0.1) w/ 40 devices (b) CIFAR10 (𝛼 = 0.1) w/ 60 devices
80 80

5.3 Comparison of Energy Consumption (RQ2) 60

Accuracy (%)

Accuracy (%)
60 DR-FL - Model_1
DR-FL - Model_2
DR-FL - Model_1
DR-FL - Model_2
DR-FL - Model_3 DR-FL - Model_3

To validate the performance of our DR-FL method in terms of en- 40


DR-FL - Model_4
HeteroFL - Model_1
HeteroFL - Model_2
40
DR-FL - Model_4
HeteroFL - Model_1
HeteroFL - Model_2
HeteroFL - Model_3 HeteroFL - Model_3
ergy consumption and running time, we conducted an experiment 20
HeteroFL - Model_4
ScaleFL - Model_1
ScaleFL - Model_2
20
HeteroFL - Model_4
ScaleFL - Model_1
ScaleFL - Model_2

involving a total of 40 devices (i.e., 20 Jetson Nano boards and 20


ScaleFL - Model_3 ScaleFL - Model_3
ScaleFL - Model_4 ScaleFL - Model_4
0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
# of Communication Round # of Communication Round
AGX Xavier boards). Figure 5 compares the total remaining energy
variation and running time in the federated training processes using (c) Fashion-MNIST (𝛼 = 0.1) w/ 40 devices (d) Fashion-MNIST (𝛼 = 0.1) w/ 60 devices
DR-FL - Model_1
HeteroFL (ScaleFL is the same energy consumption and running 80
DR-FL - Model_2
DR-FL - Model_3
DR-FL - Model_4
80

time in the greedy algorithm ) and DR-FL, respectively. For each


HeteroFL - Model_1
HeteroFL - Model_2

Accuracy (%)

Accuracy (%)
60 HeteroFL - Model_3 60 DR-FL - Model_1
DR-FL - Model_2
HeteroFL - Model_4

subfigure, we use the notion 𝑋 _𝑌 to represent the total result of all ScaleFL - Model_1 DR-FL - Model_3
ScaleFL - Model_2 DR-FL - Model_4
40 ScaleFL - Model_3 40 HeteroFL - Model_1
HeteroFL - Model_2
ScaleFL - Model_4

the devices of type 𝑌 using method 𝑋 . If 𝑌 is omitted, the notion


HeteroFL - Model_3
HeteroFL - Model_4
20 20 ScaleFL - Model_1
ScaleFL - Model_2

𝑋 denotes the total result involving all the devices. For example, ScaleFL - Model_3
ScaleFL - Model_4
0 5 10 15 20 25 0 5 10 15 20 25 30 35
in Figure 5(a), the legend DR-FL denotes the overall remaining en- # of Communication Round # of Communication Round

ergy of all 40 devices, while DR-FL_Nano represents the overall (e) SVHN (𝛼 = 0.1) w/ 40 devices (f) SVHN (𝛼 = 0.1) w/ 60 devices
remaining energy of all the 20 Jetson Nano boards. Figure 6: Learning curves of DR-FL and other baselines in
AIoT systems with different numbers of devices under lim-
HeteroFL HeteroFL ited energy constraints.
0 50 00 50 00 50 00 50

HeteroFL_Nano DR-FL
2 5 7 10 12 15 17
0 00 00 00 00 00 00
0 0 0 0 0 0
Remaining Energy (J)

HeteroFL_AGX
50 100 150 200 250 300

5.5 Ablation Study (RQ4)


Running Time (S)

DR-FL
DR-FL_Nano
DR-FL_AGX To explore the role of the validation set proportion in our method,
the validation set with different proportions (1%-10%) is selected
for the experiment of this paper, and the non-independent data set
CIFAR10 (𝛼 = 0.1) is selected as the exploration scenario. From
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28

0
2
4
6
8
10
12
14
16
18
20
22
24
26
28

# of Communication Round # of Communication Round


Table 2, we can see, with the number of validation set increases,
(a) Total energy variation (b) Total running time
in the initial overall test accuracy rise, and with the proportion
Figure 5: Comparison of total energy consumption and run- of validation sets more than 4%, the accuracy decreases. This phe-
ning time. nomenon shows that it can be used as an effective tuning knob to
explore the trade-off between the proportion of cloud validation
Figure 5(a) shows that our method can have more training rounds data and the entire DR-FL performance. We found that the setup
under the same energy constraints, thus leading to better overall test validation data ratio of 4% provided a reasonable balance. We picked
accuracy and energy efficacy. For example, for HeteroFL, the Jetson 4% and used it in all experiments.
AGX Xavier-based devices ran out of batteries in the 12𝑡ℎ round. Table 2: Average model accuracy with different percentages
However, for DR-FL, the Jetson AGX Xavier-based devices ran out of the validation dataset
of batteries in the 18𝑡ℎ round. Moreover, in Figure 5(b), we can Percentage 1% 2% 3% 4% 5% 6% 7% 8% 9% 10%
clearly find an inflexion point in the 12𝑡ℎ round for HeteroFL, after Accuracy (acc) 57.72 63.23 64.35 65.04 63.16 59.18 58.86 52.21 54.9975 55.69
which only Jetson Nano-based devices are involved in federated
training. However, for DR-FL, we can observe an inflexion point in 6 CONCLUSION
the 15𝑡ℎ round, indicating the effectiveness of the MARL algorithm Federated Learning (FL) is expected to enable privacy-preserving
in controlling the energy waste of the device with reduced useless collaborative learning among Artificial Intelligence of Things (AIoT)
wait and training time. devices. However, due to various heterogeneous settings (e.g., non-
IID data, device models with different architectures) and device
5.4 Scalability Analysis (RQ3) resource constraints (e.g., computing power and energy capacity),
Figure 6 compares the test accuracy of three methods (i.e., HeteroFL, existing FL-based AIoT design greatly suffers from the problems of
ScaleFL, and DR-FL) for various non-IID scenarios with different low inference accuracy, rapid battery consumption and long train-
numbers of devices under specific energy constraints. From this ing time. To address these issues, this paper introduces a novel FL
figure, we can observe that when more heterogeneous devices framework that enables efficient knowledge sharing between het-
participate in FL, the superiority of DR-FL becomes more significant erogeneous devices under specific energy constraints. Based on our
than that of the other two methods. For example, for the non-IID proposed layer-wise aggregation method and MARL-based dual se-
scenario of CIFAR10 (with 𝛼=0.1), DR-FL consistently achieves lection mechanism, AIoT devices with different computational and
higher test accuracy than ScaleFL and HeteroFL. energy capabilities can adaptively select appropriate local models to
Towards Energy-Aware Federated Learning via MARL: A Dual-Selection Approach for Model and Client Conference’17, July 2017, Washington, DC, USA

participate in global model training, where devices can effectively [20] Yuxuan Sun, Sheng Zhou, and Deniz Gündüz. 2019. Energy-Aware Analog Aggre-
learn from each other through appropriate parts belonging to dif- gation for Federated Learning with Redundant Data. In ICC 2020 - 2020 IEEE Inter-
national Conference on Communications (ICC). 1–7. https://api.semanticscholar.
ferent layer-wise models. Comprehensive experiments performed org/CorpusID:207869996
on well-known datasets demonstrate the effectiveness of DR-FL for [21] Surat Teerapittayanon, Bradley McDanel, and H. T. Kung. 2016. BranchyNet:
Fast inference via early exiting from deep neural networks. In Proceedings of 23rd
inference performance, energy consumption, and scalability. International Conference on Pattern Recognition (ICPR). 2464–2469 pages.
[22] Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim
Verbelen, and Jan S. Rellermeyer. 2019. A Survey on Distributed Machine Learning.
REFERENCES ACM Computing Surveys (CSUR) 53 (2019), 1–33. https://api.semanticscholar.
[1] Saleh Baghersalimi, Tomás Teijeiro, David Atienza Alonso, and Amir Aminifar. org/CorpusID:209439571
2021. Personalized Real-Time Federated Learning for Epileptic Seizure Detection. [23] Yawen Wu, Dewen Zeng, Zhepeng Wang, Yi Sheng, Lei Yang, Alaina J. James, Yiyu
IEEE Journal of Biomedical and Health Informatics 26 (2021), 898–909. https: Shi, and Jingtong Hu. 2022. Federated Contrastive Learning for Dermatological
//api.semanticscholar.org/CorpusID:235786959 Disease Diagnosis via On-device Learning. ArXiv abs/2202.07470 (2022). https:
[2] Kartikeya Bhardwaj, Wei Chen, and Radu Marculescu. 2020. INVITED: New //api.semanticscholar.org/CorpusID:245446614
Directions in Distributed Deep Learning: Bringing the Network at Forefront of [24] Jun Xia, Tian Liu, Zhiwei Ling, Ting Wang, Xin Fu, and Mingsong Chen. 2022.
IoT Design. Proceedings of 57th ACM/IEEE Design Automation Conference (DAC) PervasiveFL: Pervasive Federated Learning forHeterogeneous IoT Systems. IEEE
(2020), 1–6. https://api.semanticscholar.org/CorpusID:221293302 Transactions on Computer Aided Design of Integrated Circuits Systems 41, 11 (2022),
[3] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, 4100–4111.
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase [25] Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: a Novel Image
Representations using RNN Encoder–Decoder for Statistical Machine Translation. Dataset for Benchmarking Machine Learning Algorithms. ArXiv:1708.07747
In Proceedings of Conference on Empirical Methods in Natural Language Processing. (2017).
[4] Yangguang Cui, Kun Cao, Junlong Zhou, and Tongquan Wei. 2022. HELCFL: [26] Zhaohui Yang, Mingzhe Chen, Walid Saad, Choong Seon Hong, and Moham-
High-Efficiency and Low-Cost Federated Learning in Heterogeneous Mobile-Edge mad R. Shikh-Bahaei. 2019. Energy Efficient Federated Learning Over Wireless
Computing. 2022 Design, Automation & Test in Europe Conference & Exhibition Communication Networks. IEEE Transactions on Wireless Communications 20
(DATE) (2022), 1227–1232. https://api.semanticscholar.org/CorpusID:248922002 (2019), 1935–1949. https://api.semanticscholar.org/CorpusID:207880723
[5] Enmao Diao, Jie Ding, and Vahid Tarokh. 2021. HeteroFL: Computation and com- [27] Won Joon Yun, Yunseok Kwak, Hankyul Baek, Soyi Jung, Mingyue Ji, Mehdi
munication efficient federated learning for heterogeneous clients. In Proceedings Bennis, Jihong Park, and Joongheon Kim. 2023. SlimFL: Federated Learning With
of International Conference on Learning Representations (ICLR). Superposition Coding Over Slimmable Neural Networks. IEEE/ACM Transactions
[6] Rami Hamdi, Mingzhe Chen, Ahmed Ben Said, Marwa Qaraqe, and H. Vincent on Networking (TON) 31, 6 (2023), 2499–2514.
Poor. 2022. Federated Learning Over Energy Harvesting Wireless Networks. [28] Jing Zhang and Dacheng Tao. 2020. Empowering Things With Intelligence: A
IEEE Internet of Things Journal 9, 1 (2022), 92–103. Survey of the Progress, Challenges, and Opportunities in Artificial Intelligence
[7] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning of Things. IEEE Internet of Things Journal 8 (2020), 7789–7817. https://api.
for Image Recognition. In Proceedings of IEEE Conference on Computer Vision and semanticscholar.org/CorpusID:226975900
Pattern Recognition (CVPR). 770–778 pages. [29] Linfeng Zhang, Chenglong Bao, and Kaisheng Ma. 2021. Self-Distillation: Towards
[8] Fatih Ilhan, Gong Su, and Ling Liu. 2023. ScaleFL: Resource-Adaptive Federated Efficient and Compact Neural Networks. IEEE Transactions on Pattern Analysis
Learning with Heterogeneous Clients. In Proceedings of 2023 IEEE/CVF Conference and Machine Intelligence 44, 8 (2021), 4388–4403. https://api.semanticscholar.
on Computer Vision and Pattern Recognition (CVPR). org/CorpusID:232302458
[9] Latif Ullah Khan, Walid Saad, Zhu Han, Ekram Hossain, and Choong Seon Hong. [30] Xinqian Zhang, Ming Hu, Jun Xia, Tongquan Wei, Mingsong Chen, and Shiyan
2020. Federated Learning for Internet of Things: Recent Advances, Taxonomy, and Hu. 2021. Efficient Federated Learning for Cloud-Based AIoT Applications. IEEE
Open Challenges. IEEE Communications Surveys & Tutorials 23 (2020), 1759–1799. Transactions on Computer-Aided Design of Integrated Circuits and Systems 40, 11
https://api.semanticscholar.org/CorpusID:221970627 (2021), 221–2223. https://doi.org/10.1109/TCAD.2020.3046665
[10] Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. [31] Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. 2021. Data-Free Knowledge
https://api.semanticscholar.org/CorpusID:18268744 Distillation for Heterogeneous Federated Learning. Proceedings of machine learn-
[11] Liang Li, Dian Shi, Ronghui Hou, Hui Li, Miao Pan, and Zhu Han. 2020. To Talk ing research 139 (2021), 12878–12889. https://api.semanticscholar.org/CorpusID:
or to Work: Flexible Communication Compression for Energy Efficient Federated 235125689
Learning over Heterogeneous Mobile Edge Devices. IEEE INFOCOM 2021 - IEEE
Conference on Computer Communications, 1–10. https://api.semanticscholar.org/
CorpusID:229349304
[12] Li Li, Haoyi Xiong, Zhishan Guo, Jun Wang, and Chengzhong Xu. 2019.
SmartPC: Hierarchical Pace Control in Real-Time Federated Learning Sys-
tem. 2019 IEEE Real-Time Systems Symposium (RTSS) (2019), 406–418. https:
//api.semanticscholar.org/CorpusID:203582658
[13] H. B. McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera
y Arcas. 2016. Communication-Efficient Learning of Deep Networks from Decen-
tralized Data. In Proceedings of International Conference on Artificial Intelligence
and Statistics. https://api.semanticscholar.org/CorpusID:14955348
[14] H. B. McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera
y Arcas. 2016. Communication-Efficient Learning of Deep Networks from Decen-
tralized Data. In Proceedings of International Conference on Artificial Intelligence
and Statistics.
[15] Yuval Netzer, Tao Wang, Adam Coates, A. Bissacco, Bo Wu, and A. Ng. 2011.
Reading Digits in Natural Images with Unsupervised Feature Learning. https:
//api.semanticscholar.org/CorpusID:16852518
[16] Dinh C. Nguyen, Ming Ding, Pubudu N. Pathirana, Aruna Prasad Seneviratne,
Jun Li, and Fellow Ieee H. Vincent Poor. 2021. Federated Learning for Internet of
Things: A Comprehensive Survey. IEEE Communications Surveys & Tutorials 23
(2021), 1622–1658. https://api.semanticscholar.org/CorpusID:233289549
[17] Tabish Rashid, Mikayel Samvelyan, C. S. D. Witt, Gregory Farquhar, Jakob N.
Foerster, and Shimon Whiteson. 2018. QMIX: Monotonic Value Function Fac-
torisation for Deep Multi-Agent Reinforcement Learning. ArXiv abs/1803.11485
(2018). https://api.semanticscholar.org/CorpusID:4533648
[18] Samarjit and Al Faruque. 2016. Automotive Cyber-Physical Systems: A Tutorial
Introduction. https://api.semanticscholar.org/CorpusID:247235211
[19] Dian Shi, Liang Li, Rui Chen, Pavana Prakash, Miao Pan, and Yuguang Fan.
2021. Toward Energy-Efficient Federated Learning Over 5G+ Mobile Devices.
IEEE Wireless Communications 29 (2021), 44–51. https://api.semanticscholar.org/
CorpusID:231592874

You might also like