0% found this document useful (0 votes)

15 views13 pages

Auction Based Clustered Federated Learning in Mobile Edge Computing System

The document presents an auction-based clustered federated learning approach for mobile edge computing systems, addressing challenges related to data heterogeneity and energy consumption among clients. It proposes a client selection scheme that utilizes federated virtual datasets and an auction mechanism to improve model training efficiency while ensuring data privacy. Simulation results demonstrate the effectiveness of the proposed methods in enhancing convergence rates and balancing energy consumption in federated learning scenarios.

Uploaded by

Syfe Mahfuz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views13 pages

Auction Based Clustered Federated Learning in Mobile Edge Computing System

Uploaded by

Syfe Mahfuz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

1

Auction Based Clustered Federated Learning in

Mobile Edge Computing System
Renhao Lu, Weizhe Zhang, Senior Member, IEEE, Qiong Li,
Xiaoxiong Zhong, Member, IEEE, Athanasios V. Vasilakos, Senior Member, IEEE

Abstract—In recent years, mobile clients’ computing ability has become one of the main bottlenecks hindering artificial
and storage capacity have greatly improved, efficiently dealing intelligence development. Federated learning [1] is proposed
arXiv:2103.07150v1 [cs.LG] 12 Mar 2021

with some applications locally. Federated learning is a promising as a promising distributed learning to alleviate the privacy
distributed machine learning solution that uses local computing
and local data to train the Artificial Intelligence (AI) model. leakage problem of machine learning. Unlike traditional ma-
Combining local computing and federated learning can train a chine learning and distributed machine learning, there is no
powerful AI model under the premise of ensuring local data pri- need to centralize user data for AI model training. In the
vacy while making full use of mobile clients’ resources. However, federated learning system, clients only need to transmit the
the heterogeneity of local data, that is, Non-independent and model parameters or gradients trained on their local data to
identical distribution (Non-IID) and imbalance of local data size,
may bring a bottleneck hindering the application of federated the aggregation server, thereby protecting data privacy.
learning in mobile edge computing (MEC) system. Inspired by Moreover, data heterogeneity brings new challenges to
this, we propose a cluster-based clients selection method that federated learning development. Wang et al. [2] proposed a
can generate a federated virtual dataset that satisfies the global reinforcement learning solution to solve federated learning
distribution to offset the impact of data heterogeneity and proved in Non-IID scenarios. Since this scheme requires multiple
that the proposed scheme could converge to an approximate
optimal solution. Based on the clustering method, we propose an rounds of reinforcement learning model training in advance
auction-based clients selection scheme within each cluster that for different scenarios, its generalization ability is weak. There
fully considers the system’s energy heterogeneity and gives the are also some clustering solutions to solve this challenge [3],
Nash equilibrium solution of the proposed scheme for balance [4], [5], [6], [7]. For example, Sattler et al. [3] divided the
the energy consumption and improving the convergence rate. The clients into several groups based on the similarity of the local
simulation results show that our proposed selection methods and
auction-based federated learning can achieve better performance model and then train in each group to improve the average
with the Convolutional Neural Network model (CNN) under accuracy. The above solutions have shown specific effects in
different data distributions. dealing with Non-IID scenarios, but they ignore the imbalance
Index Terms—Federated learning, Auction mechanism, Clus- of local data. That is, the size of data generated by different
ter. clients is inconsistent. Cai et al.[8] proposes a scheme that
using dynamic samples to solve the problem of data imbalance
I. I NTRODUCTION without considering the impact of Non-IID.
Also, especially in wireless edge networks, the energy

M ACHINE learning is applied to various fields, including

medical care, autonomous driving, finance, etc. Mas-
sive data generated by mobile clients can effectively promote
consumption of mobile clients is usually limited. How to
balance the clients’ energy consumption in the system is
another core challenge for developing federated learning in the
the improvement of machine learning technology. However, mobile edge computing system. Clients are required to perform
these data directly or indirectly reveal the privacy of users. calculations, making mobile clients unwilling to participate
With the increase of people’s awareness of privacy and the in federated learning due to limited energy. [9], [10] design
introduction of privacy protection laws, privacy data leakage incentive mechanism to give part of the benefits to partici-
Manuscript received February 27, 2021. (Corresponding author: Weizhe pating clients to encourage them to participate in federated
Zhang.) learning. These schemes fully consider the resource status
Renhao Lu is with School of Computer Science and Technology, Harbin when selecting clients and are committed to minimizing the
Institute of Technology, Harbin, China.
Weizhe Zhang is with School of Computer Science and Technol- federated learning system’s overall resource consumption but
ogy, Harbin Institute of Technology, Harbin, China, and also with Cy- ignoring individual clients’ energy consumption.
berspace Security Research Center, Peng Cheng Laboratory, Shenzhen, China.
(Email:[email protected])
Qiong Li is with School of Computer Science and Technology, Harbin
Institute of Technology, Harbin, China.
A. Motivation
Xiaoxiong Zhong is with Cyberspace Security Research Center, Peng Cheng Federated learning can effectively solve the problem of
Laboratory, Shenzhen, China.
Athanasios V. Vasilakos is with the School of Electrical and Data En- user data privacy leakage in traditional machine learning.
gineering, University of Technology Sydney, Australia, with the Depart- Furthermore, it can make full use of the remaining comput-
ment of Computer Science and Technology, Fuzhou University, Fuzhou ing power of idle edge clients. On the one hand, federated
350116, China, and with the Department of Computer Science, Electrical and
Space Engineering, Lulea University of Technology, Lulea, 97187, Sweden learning systems’ data heterogeneity has become one of the
(Email:[email protected]) main bottlenecks of federated learning development. Wang et
2

Clients Selection Local Train

Clients
Group 1 Random Train data

Local data
Local data Local data Local data
Auction Test data Local update 1
...
Global
Local data
model
Group 2
Train
Cluster Random data
Local data Local data
Local data Auction Test data Local update 2
Local data
...
. .
.
. .
.
. .
.
Local data Group J
Train
Random data
Server
Local data Local data Auction Test data Local update J
Local data
Local data
...
Iterate

Fig. 1: An overview of auction based clustered federated learning.

al.[2] observed that the clustering scheme could speed up we give the optimal solution for clients’ bidding, which
the convergence speed of the global model compared with satisfies the Nash equilibrium.
randomly selecting user local models on Non-IID. They also • We evaluate the performance of our scheme through
verified the effectiveness of the clustering algorithm through simulation in a variety of different Non-IID scenarios.
experiments. However, in their experimental settings, each user Furthermore, we introduce the metric of energy consump-
has the same number of data samples, which is unrealistic in tion balance in the federated learning scenario for the
real system scenarios. The real scene is different users have first time. The simulation results show that our scheme
different data sizes, which means the scale of data owned by shows good performance in convergence rate and energy
edge clients is imbalanced. Therefore, in our research, we will consumption balance.
fully consider the two aspects of data heterogeneity: Non-IID
and imbalances of local data. Besides, they did not give a II. RELATED WORKS
theoretical analysis. On the other hand, partial edge clients are
selected for training in each iteration. Appropriate edge client In recent years, federated learning [1] as a special dis-
selection can effectively improve the convergence rate of the tributed machine learning approach has been widely studied
global model. However, the energy of mobile edge clients is by researchers. On the one hand, the original intention of
limited. Therefore, we proposed the energy balanced selection federated learning is to train the AI model to ensure data
mechanism in this paper. privacy. [11], [12], [13], [14], [15] study federated learning
from the perspective of protecting client’s data privacy and
AI model. On the other hand, different from the traditional
B. Contribution distributed machine learning system, the federated learning
Fig. 1 give an overview of auction based clustered federated system’s communication environment is more complex and
learning, and the main contributions of our research are as uncertain. Therefore, reducing communication overhead and
follows: improving communication efficiency is another core challenge
• We propose a client selection scheme based on initial gra- of federated learning. [16], [17], [18], [19] is devoted to
dient clustering, which mainly includes the following im- reducing the communication cost of federated learning or
provements: 1) We introduce the concept of federated vir- the communication rounds required for training. Recently,
tual datasets, and its goal is to transform the heterogeneity the heterogeneity of federated learning systems has become
of distributed local data into solving the heterogeneity the main bottleneck of its development. FL heterogeneity is
of virtual datasets. 2) To alleviate the impact of local divided into data heterogeneity and structural heterogeneity
data imbalance and ensure client clustering accuracy, we [20].
propose a sample window mechanism before clustering. Our research is mainly to solve the challenge of data
3) We give a theoretical analysis of the proposed scheme heterogeneity in the federated learning model’s training pro-
and prove that it can converge to an approximate optimal cess. In terms of training data samples, unlike conventional
solution under the stochastic gradient descent algorithm. distributed machine learning, the training data samples of
• Given the uneven resource consumption caused by ran- federated learning are generally Non-IID. McMahan et al.[1]
domly selecting clients in the cluster, we propose a cluster proposed the Federated Averaging (FedAvg) algorithm, which
internal client selection scheme based on the auction is a deep network federated learning method based on iterative
mechanism, which fully considers the data heterogeneity model averaging. They also pointed out that the FedAvg
and each client’s remaining energy. At the same time, algorithm is still applicable when the data of clients is Non-
3

IID. Li et al. [21] theoretically analyzed the effectiveness of Sever

the FedAvg algorithm and show that the convergence rate of
Model
FedAvg algorithm under Non-IID is significantly worse than
Client
under IID. Zhao et al. [22] also verified by experiment that
convolutional accuracy neural networks trained with FedAvg Local
algorithm decreases significantly under the Non-IID setting. Data

Besides, they proposed a data-sharing strategy, which is to

improve the accuracy of model training by sharing part of
the data that meets the global distribution. However, without ......
knowing the distribution of clients, it is not easy to make all
client data evenly distributed by sharing data samples. [23] an- Fig. 2: System model.
alyzed the convergence bounds of gradient descent algorithm
in federated learning system under Non-IID setting. Tian et al.
[24] proposed FedProx federated learning algorithm to tackle B. Federated virtual dataset
heterogeneity in federated system. As an improved version of The goal of federated learning is to jointly train a global ma-
FedAvg, FedProx introduced a proximal term to cope with the chine learning model with massive edge clients. The training
heterogeneity of local model updating [25]. Wang et al. [26] process is to gradually reduce the loss function f (∗), which
proposed federated matched averaging (FedMA), a layer-wise is a distributed optimization problem:
federated learning algorithm designed for CNN and LSTM
network architecture. w∗ = arg min(f (w))
w
Sattler et al. [6] proposed a federated multi-task learning (1)
= arg min (f (wk , ξk ))
framework using a clustering strategy. They used the cosine P
w= pk w k
similarity of the local model to cluster clients. Its purpose is
to enable clients of different clusters to learn more specialized where w and wk represent the global and local model param-
models. Ghosh et al. [7] demonstrated that each user has eters respectively, and w∗ is optimal global model parameters.
its learning task, and users with the same learning task can ξk and |ξk | respectively denote local data and data size of
perform more efficient federated learning. Briggs et al. [27] client k. so, pk = P|ξ|ξk |k | . In our research, we use stochastic
classifies clients based on local model updates, with the gradient descent algorithm. Thus, for any client k participating
goal of training specialized machine learning models over in training, the update process is as formula (2):
distributed datasets. However, those specialized models trained k
= wtk − ηt ∇f wt , xkt

wt+1 (2)
with partial local data can not make full use of the system’s
data. Wang et al.[2] observed the efficiency of training a global where xkt ∈ ξk , ηt is the learning rate. Besides, let g (wt )
model using clustering strategy under a Non-IID setting but denote the aggregated value of local update gradient in round
did not give a theoretical analysis. t, then,
XNL
pk ∇f wt , xkt

g (wt ) = (3)
III. CLUSTER BASED SELECTION METHOD k∈N L

The large number of clients and limited communication

A. System model links in the mobile edge computing network determines that
not all clients can participate in each training round. Therefore,
Consider a mobile edge computing scenario that consists of
in a simple federated learning system, partial clients are
a cloud server S and a set N L of N edge clients, which
selected to participate in training in each iteration, assuming
act as the service provider. With the help of edge clients’
that K clients
SKare selected in each round.
data services and computing services, the server S aggregates
Let ξt = k xkt , which is defined as a virtual datasets, then,
an AI model. Specifically, edge clients train the local data
for I rounds based on the model broadcast by the server S,
X
pk ∇f wt , xkt .

g (wt , ξt ) = (4)
and return the trained model parameters to the server. Server
xk
t ∈ξt
S performs aggregation operations on the collected model
parameters to obtain an updated model. The server and edge Therefore, the distributed stochastic gradient optimization
clients iterate the above operations until the updated model algorithm of federated learning can be regarded as a traditional
reaches the required accuracy. Besides, due to link bandwidth, centralized stochastic batch gradient descent algorithm on
timeliness of model parameters, and other reasons, the server virtual datasets ξt .
S only selects K clients from N L to participate in training in The above analysis is based on the situation that the selected
each iteration. Moreover, local data samples of edge clients clients only trains one local round. However, in order to
have heterogeneous properties, which contain two aspects: alleviate the communication pressure of the system, it is
local data size and data sample distribution. Therefore, each general that the selected client performs I(I ≥ 2) local
client is divided into data of different sizes in the system rounds training based on local datasets, that is, I local rounds
model, as shown in Fig. 2. stochastic gradient descent algorithm or stochastic mini-batch
4

based on the similarity of local models and selecting a client

from each group. They also verified the effectiveness of this
scheme through experiments. However, there are still two
problems with t heir strategy. Firstly, in their experimental
settings, each client has the same number of data samples.
In a simple federated learning system, different clients have a
different number of local data samples. Secondly, after only
one epoch of SGD update, the local model parameters may
not reflect the local distribution.
For solving the existing problems, we propose a client
Fig. 3: Federated virtual dataset. selection scheme based on initial gradient clustering. Our
proposed scheme mainly includes two stages: the clustering
stage, training stage. In the clustering stage, there are three
gradient descent algorithm. So, the model parameters update optimizations whose purpose is to reflect clients’ distribution
rule is as follow: with more data accurately.
K
X I
X 1) Our proposed scheme uses the local gradient to represent
k
wt+1 = wt − ηt pk ∇f (wt,i , xkt,i ) (5) the data distribution.
k=1 i=0 2) We set a local sample window to limit the number of
In our research, we only analyze the case that the learning samples that the client participates in training to offset
rate ηt is fixed. Furthermore, we assume: from the perspective the impact of data imbalance.
of expectation, multi-step learning with a small learning rate is 3) Before clustering, each client samples multiple times
equivalent to a few steps with a large learning rate. As shown from their local data to participate in training and
in formula (6): calculate its average gradient.
" K
X X I−1
# In the training phase, we set the sample threshold to make
k k the selected clients’ local data size at the same level, whose
E ηt pk ∇f (wt,i , xt,i ) ≈
k i=0 purpose is to lower the impact of local data imbalance. That
(6)
"
XK
# is, randomly select a client and use its local data size as the
E θIηt pk ∇f (wtk , ξk ) sample threshold. After that, all clients larger than this value
k=1 can participate in the selection to ensure that the virtual dataset
SK is closer to the global distribution.
where 0<θ ≤ 1. And, we let ξt = k ξk , then,
"K # Besides, we conducted a theoretical analysis of this scheme,
X
k
and we prove that the proposed scheme can converge to an
g (wt , ξt ) = E pk ∇f (wt , ξk ) (7) approximate optimal solution with local SGD in the β convex
k=1 setting, as shown in Theorem 1. Please refer to the appendix
wt+1 = wt − θIηt g (wt , ξt ) . (8) A for the details of the proof.
Theorem 1 Under the assumptions 1 to 4, which are defined
Based on this assumption, it returns to the situation of I = 1.
in Appendix A, adopting clustering clients sampling strategy
Thus, we give the following analyses:
with fixed step size ηt = η satisfies:
i) Because of local data heterogeneity, ξt also has hetero-
geneity. E [f (wt+1 )] − f (w∗ ) ≤
ii) Unlike the fixity of local data distribution, the distri-
bution of ξt varies with the combination of selected (1 − B1 )t−1 (f (w1 ) − f (w∗ ) − A1 ) + A1
clients and can be changed through the clients’ selection µE 2ηθIL2 M θIηµE β 2
scheme. where 0 < η ≤ LMG , A1 = µE β 2 , B1 = 4L .
iii) At this point, solving the heterogeneity problem of local
data can be transformed into the heterogeneity problem Proof : see Appendix A for the proof.
of ξt . We also verify the effectiveness of our proposed scheme
iv) As shown in Fig. 3, for any t, ξt is consistent with the (represented as Gradients Cluster Random) through experi-
global distribution. Then, the negative impact of data ments. And we use Weights Cluster Random to represent the
heterogeneity can be alleviated. scheme proposed by [1]. In order to ensure the fairness of
the evaluation, both schemes adopt the K-means clustering
scheme. We also compare with the FedAvg[1] scheme (repre-
C. Cluster based clients selection sented as FedAvg Random), which randomly selects k clients
In this subsection, we propose a client selection scheme for model training during each round. Furthermore, we train
based on a clustering strategy so that the virtual dataset CNN models with Pytorch on three different datasets, MNIST,
constructed in each round can meet the global distribution. Fashion MNIST, and CIFAR-10. For data distribution settings,
Wang et al.[2] gave a client selection strategy based on initial each client only has data for one label but random local data
model parameter clustering, precisely: global grouping clients size. Specifically, we randomly assign a different number of
5

samples to each client, and each client can have at least 100 and computing services from the bidders. After bidding, there
data samples and a maximum of 1200 data samples. will be Kj (Kj ≥ 1) winner in cluster j, whose bids are the
lowest within the cluster.
)01,67 )01,67 Our auction-based federated learning system mainly con-
0.35 *UDGLHQWVBFOXVWHUBUDQGRP
0.8 0.30 :HLJKWVBFOXVWHUBUDQGRP sists of four parts, energy consumption, cost function and
)HG$YJBUDQGRP reward model, and auction-based edge clients selection algo-
0.25
7HVWDFFXUDF\

0.6

7UDLQORVV
0.20 rithm. We will describe them in detail as follows.
0.4 0.15
*UDGLHQWVBFOXVWHUBUDQGRP 0.10
:HLJKWVBFOXVWHUBUDQGRP 0.05
0.2 )HG$YJBUDQGRP A. Energy consumption model
0.00
0 20 40 60 80 100 0 20 40 60 80 100 In the federated learning system, each selected client i trains
&RPPXQLFDWLRQURXQGV &RPPXQLFDWLRQURXQGV
01,67 01,67 the global model wt on local data to obtain the local model
0.8 0.7 *UDGLHQWVBFOXVWHUBUDQGRP wti , and then all selected clients transmit the local model to
0.7 0.6
:HLJKWVBFOXVWHUBUDQGRP
the aggregation server. Therefore, in each iteration, the clients’
)HG$YJBUDQGRP
0.6 0.5
7HVWDFFXUDF\

energy consumption consists of two parts, communication

7UDLQORVV

0.5 0.4
0.4 0.3 consumption, and computational consumption. The energy
0.3 *UDGLHQWVBFOXVWHUBUDQGRP
0.2 consumption of the edge clients can be expressed as follows:
0.2 :HLJKWVBFOXVWHUBUDQGRP
)HG$YJBUDQGRP 0.1
0.1 sum cp cm
0 25 50 75 100 125 150 175 200 0.0 0 25 50 75 100 125 150 175 200 Ei,t = Ei,t + Ei,t (9)
&RPPXQLFDWLRQURXQGV &RPPXQLFDWLRQURXQGV
&,)$5 sum
0.24
&,)$5 where Ei,t denotes the energy consumption of client i
*UDGLHQWVBFOXVWHUBUDQGRP 2.00 *UDGLHQWVBFOXVWHUBUDQGRP cp cm
0.22 :HLJKWVBFOXVWHUBUDQGRP 1.75 :HLJKWVBFOXVWHUBUDQGRP
as a training client in round t, Ei,t and Ei,t respectively
0.20 )HG$YJBUDQGRP
1.50 )HG$YJBUDQGRP represent the computation and communication consumption of
7HVWDFFXUDF\

0.18 1.25
7UDLQORVV

0.16 the client’s energy. Formula (10) represents the communication

1.00
0.14 0.75 energy consumption of the client i.
0.12 0.50
cm re se
0.10 0.25 Ei,t = Ei,t + Ei,t (10)
0 25 50 75 100 125 150 175 200 0.00
&RPPXQLFDWLRQURXQGV 0 25 50 75 100 125 150 175 200 re se
&RPPXQLFDWLRQURXQGV where Ei,t and Ei,t respectively represent the energy con-
sumption of the client i when receiving the global model and
Fig. 4: Test accuracy and training loss v.s. communication
sending the local model. The computational energy consump-
rounds under 100 clients.
tion of the client i is as follows:
As shown in Fig. 4, the client selection schemes with cp
Ei,t = (N si × %)/100. (11)
clustering strategy shows better performance than FedAvg
scheme. The reason is that virtual datasets constructed by the Where % represents the energy consumption of each client
first two scheme is closer to the global distribution, and the training 100 samples and N si represents the size of the local
distribution of the virtual dataset in each iteration is similar, data sample of client i.
thereby reducing the impact of data heterogeneity. Since our
proposed scheme fully considers the imbalance of the data, B. Cost function
the clustering is more accurate. Therefore, our scheme has a
In our scheme, we design a cost function to determine which
higher convergence rate than the other two schemes.
clients are selected for each training round. The cost function
mainly depends on the client’s residual energy, the number of
IV. AUCTION-BASED CLUSTERED FEDERATED local data samples, and the client’s historical training rounds.
LEARNING Therefore, our cost function includes two parts, resource cost
In section III, we experimentally verify that the clustering and service cost, which are represented by Cri,t and Csi,t ,
scheme can effectively improve global model training conver- respectively.
gence under the Non-IID and imbalanced setting. However, 1) The resource cost: The client’s resource cost is deter-
in wireless edge networks, most clients are mobile clients mined by the residual energy of the client i and the energy
whose energy is limited. Randomly selecting clients from each required by the current of training round t. Simultaneously,
cluster for the next round of training will cause some clients’ the resource cost is also dynamic, and the cost increases with
excessive consumption. Therefore, in this section, we propose the reduction of residual energy. So, the resource cost of client
an auction-based client selection scheme in each cluster. The i in round t is defined as follows:
auction mechanism is a game theory model which consists of E res −E cp res cp
φ i,t i,t if Ei,t − Ei,t >0
two roles, auctioneers and bidders. As shown in Fig. 1, each Cri,t = (12)
+∞ otherwise
group constitutes an auction system. Unlike traditional cloud
res
computing systems, edge clients act as a bidder in this auction where 0 < φ < 1, and Ei,t represent remaining energy
system, providing data services and computing services. Also, of the client i. The formula (12) also ensures that clients
the cloud aggregation server acts as an auctioneer, and the with sufficient energy resources have a greater probability of
server trains a global AI model by purchasing data services participating in this training round.
6

2) The service cost: The quality of service provided by So the total return of clients i is:
clients determines the service cost. Moreover, clients’ quality Nr
of service depends on the quality and quantity of their samples.
X
win
Rei = Ri,t . (17)
Clients with more data samples can accelerate the model’s t=0
convergence, which means that the clients have a better quality
of service. To ensure that clients with higher service quality
have a higher probability of participating in training, we D. Federated learning with auction-based selection
set the sample size to be inversely proportional to the cost. 1) Optimal bid: The aggregation server selects a certain
Besides, for the global model, the more data samples involved percentage of clients in each cluster to participate in the
in training can improve the model’s generalization ability. training, assuming that the number of clients in cluster j is Nj ,
Therefore, we record the historical participation rounds of and Kj clients participate in training in each round. Therefore,
clients. With the increase of clients’ participation rounds, our our scheme abstracts the clustering-based federated learning
model appropriately reduces service quality to ensure that the system into an auction scenario, where the aggregation server
clients with less sample number can participate in the training. is the auctioneer and the clients are the bidders. Furthermore,
So, our service cost is calculated as follows: the auction scenario of federated learning has the following
conditions:
Csi,t = χϑN si + ζ(1 − loga (coi,t + a)) (13)
i) Client bids are independent of each other, and the bids
where (χ + ζ) = 1, (0 ≤ χ, ζ ≥ 1, a > 1), coi,t represents meet [0,1] uniform distribution.
the historical training rounds of client i up to round t. So, our ii) The bid strategy is strictly monotonically increasing,
cost function is shown in formula (14). and the bid strategy between clients is an asymmetrical
bidding strategy.
ci,t = αCsi,t + γCri,t (14) iii) The Kj clients with the lowest bids win, DL represents
the set of winners.
where (α + γ) = 1, (0 < α, γ < 1). iv) When the bidder bids are the same, the winning client
is selected according to the service cost, followed by the
C. Reward model resource cost.
Therefore, the revenue function of the client Ui (bi,t , ci,t ) in
In this subsection, we mainly design two kinds of reward
the cluster can be expressed as follows:
models. One is that all the benefits only belong to the clients;
the other is that clients and servers share the AI model’s

b − ci,t i ∈ Nk
benefits. For any client i, its reward in each round is expressed Ui (bi,t , ci,t ) = i,t (18)
0 otherwise
win
as Ri,t . For the first one, in the federated learning system,
clients participating in the training share the global model’s where bi,t and ci,t respectively represent the price and cost of
benefits. In line with the idea of more work, more rewards, the client i.
we design a reward model suitable for federated learning, and Theorem 2 In round t, there is an optimal bid bi,t =
1 Nj −Kj
the goal is to motivate more clients to participate in training. Nj −Kj +1 + Nj −Kj +1 ci,t for client k in cluster j, which sat-
We set the total training round of the global training model to isfies the Nash equilibrium in the clustered federated learning
achieve the target accuracy rate as N r, and divide the profit system that uses the auction mechanism.
of the global model into each round averagely. Therefore, our Proof : The goal of using the game model is to achieve
reward function shows below: federated learning system equilibrium. For the auction mech-
(
N si Rg anism, reaching the system’s equilibrium state is to maximize
win
P ×N i ∈ W int the expected revenue of the clients participating in the auction.
Ri,t = j∈W in(t) N sj r
(15)
0 otherwise
max (E (Ui (bi,t − ci,t ))) =
Where W int represents the set of clients participating in the Y
(19)
(bi,t − ci,t ) P (bi,t < bi,t )
training in the round t, Rg is the total economic income of
j ∈N
/ k
the AI model. For the second, the server and clients share
the economic profits of the AI model according to a certain Since the client i bid satisfies a uniform distribution, formula
proportional relationship. In other words, we divide the profit (19) can be derived as:
of AI model into each round equally, and then divide each
profit equally according to the number of clients participating max (E (Ui (bi,t − ci,t ))) =
Y
−1 (20)

in the training in each round. After that, the server shares this (bi,t − ci,t ) 1 − fm (bi,t )
benefit with each client participating in the training in a certain m∈N
/ k
proportion. We take the bid of the client as the proportion of
each benefit, so the reward function of client at this time can where fm is the bid strategy of client m, due to the symmetry
be expressed as: of the bid strategy, so fm = fi = f , then:
max (E (Ui (bi,t − ci,t ))) =
b × Rg

i ∈ W int (21)
Ri,t = i,t N r
win
(16) −1
n−Kj
0 otherwise. (bi,t − ci,t ) 1 − fm (bi,t ) .
7

The first-order optimal auction condition of equation (20) can randomly selects a group js, and selects the kj clients with
be expressed as: the lowest bid as the winner in the group js. Then, let the
Nj −Kj −1 −1 0 smallest local data size among the winners as the threshold
(bi,t − ci,t ) (Nj − Kj ) 1 − f −1 (bi,t ) f (bi,t ) smin . In the second step, each group is an auction system,
n−k
− 1 − f −1 (bi,t )

= 0. clients with local data size greater than smin get the right to
participate in the auction. And then, the server determines the
Let bi,t be the optimal bid of client i, so, ci,t = f −1 (bi,t ). winners in each group according to the client’s bid. So far, the
n−Kj −1
We also introduce the differential factor (1 − ci,t ) . second stage is completed.
Nj −Kj −1 0 In the final stage, federated training is carried out. The
(bi,t − ci,t ) (Nj − Kj ) (1 − ci,t ) f −1 (bi,t )
Nj −Kj
(22) winning clients train on local data and send the local update
− (1 − ci,t ) = 0. model weights to the server. Finally, the server aggregates
Equation (22) is a full differential equation, and the solution the client model update and then repeats the second and
to the full differential equation is as follows: third stages until the model converges or reaches the specified
Nj −Kj −1 communication rounds.
(1 − ci,t ) dbi − (bi,t − ci,t ) (Nj − kj ) dci,t = 0

(1 − ci,t )
Nj −Kj
+ bi d (1 − ci,t ))
Nj −Kj
+ V. EVALUATION
Nj −Kj −1 A. Simulation set up
(1 − ci,t ) ci,t (Nj − Kj ) = 0
We have implemented our scheme in a federated learning
N −K
d (1 − ci,t ) j j × bi,t + simulator developed from scratch using PyTorch under a
device with a 3.0GHz CPU frequency. We choose three classic
Nj − K j picture datasets for dataset selection: MNIST, Fashion MNIST
d(1 − ci,t )Nj −Kj +1 − d(1 − ci,t )Nj −Kj = 0
Nj − K j + 1 (represented by FMNIST), and CIFAR-10. These datasets all
contain ten types of data samples, and the former two datasets

(1 − ci,t )
Nj −Kj
× bi,t + both have 70,000 images, 60,000 for training, and 10,000 for
testing. The latter has 60,000 images, 50,000 for training,
Nj − Kj and 10,000 for testing. In terms of the AI model, we trained
(1 − ci,t )Nj −Kj +1 − (1 − ci,t )Nj −Kj = 0
Nj − Kj + 1 three different models, including CNN for MNIST1 , CNN
1 Nj − Kj for Fashion MNIST2 and CNN for CIFAR-103 . Besides, our
bi,t = + × ci,t + C scheme mainly considers clients’ selection, using FedAvg and
Nj − Kj + 1 Nj − Kj + 1
FedProx respectively when modeling aggregation. Therefore,
where C is a constant, set C to 0 , so the optimal bid is: we mainly compare with the randomly selected FedAvg and
1 Nj − Kj FedProx, which are represented by Random FedAvg and
bi,t = + × ci,t . (23)
Nj − Kj + 1 Nj − Kj + 1 Random FedProx, respectively.
Data Distribution at Different clients: For the Non-IID
E. Auction-based clients selection algorithm setting, we use the [2] setting method, that is, the percent ν
In this subsection, we describe our proposed edge clients of the samples stored by each client is the same label, and
selection algorithm in detail. From the server’s perspective, the remaining data samples are randomly sampled. In our
the edge clients selection algorithm’s function is to select simulation, ν is set to: 1, 0.8, 0.5. Besides, for imbalance
appropriate edge clients to participate in federated learning setting, the number of local samples for each client is between
training, thereby accelerating the global model’s convergence $/6 and 2$, Where $ is the average value that samples of
rate. From the perspective of edge clients, the client selection each client can allocate. Taking the 100 clients scenario and
algorithm’s goal is to maximize the benefits for users and avoid MNIST dataset as an example, the value of $ is 600 at this
excessive energy consumption. So, we designed an auction- time, and each client has at least 100 data samples and at most
based edge clients selection algorithm, as shown in Algorithm 1200 data samples. Then, each client’s local data, of which
1. Algorithm 1 has three main stages, gradient-based cluster- 80% is used for training, 10% is used for verification, and the
ing, auction-based selection, and federated training. last 10% is used for testing.
The server initializes the model parameters and the sample 1 The CNN for MNIST has 10 layers with the following structure: 5×5×10
threshold smm . All clients randomly select smm local samples Convolutional → 2×2 MaxPool → 5×5×20 Convolutional → Dropout → 2×2
to calculate the gradient-based on w, repeat T0 times, and send MaxPool → Flatten → 320×5 Fully connected → dropout → 50×10 Fully
the gradient average to the server. The purpose of this is to connected → softmax.
2 The CNN for Fashion MNIST has 9 layers with the following structure:
obtain local data distribution. Based on local data distribution,
5×5×16 Convolutional → Batch Normalization → 2×2 MaxPool → 5×5×32
the clients are divided into J groups. So far, the first stage is Convolutional → Batch Normalization → 2×2 MaxPool → Flatten →
completed. 1568×10 Fully connected → softmax.
3 The CNN for CIFAR-10 has 8 layers with the following structure:
Our auction model consists of two steps: in the first step,
5×5×6 Convolutional → 2×2 MaxPool → 5×5×16 Convolutional → Flatten
each client calculates its own cost according to formula (6- → 400×120 Fully connected → 120×84 Fully connected → 84×10 Fully
11) and then bids according to formula (20). Then, the server connected → softmax.
8

Energy at different mobile clients: We assume that in

Algorithm 1: Auction Federated Learning Based on the system, each client’s battery capacity is the same so that
Clustering Strategy the percentage of energy represents the remaining energy of
Input: list of all clients N L, number of clients N , each client. For each client’s initial energy, we considered two
number of clusters L, and proportion of scenarios: case1: Each client has the same energy size; we
selected clients Ratio set it to 100%. Case2: we set the energy of all clients in the
Output: list of selected clients DL system to satisfy a normal distribution with an upper bound of
1 Server broadcasts w, smm to all clients ; 100%, a lower bound of 50%, a mean of 75%, and a standard
2 for each client k ∈ N L do deviation of 10. In other words, the energy of all clients in the
3 for t0 = 0 to T0 do system is between 50% and 100%.
4 Client k selects smm samples from local data;
5 Client k computes its gradient ∇f (w, ξtk0 ); B. Simulation results
6 Client k computes the mean value of ∇f (w, ξtk0 ); 1) Price and reward: We propose two reward models in
7 Client k sends its mean value of gradient to server; section VI. In our simulation, we choose the second reward
8 Server clusters clients into J groups according to model. Firstly, the AI model trained by the federated learning
client’s gradient; system has certain economic returns, which belong to the
9 Server computes the number of selected clients, K; whole federated learning system. In other words, the server
10 for t = 0 to T do and clients share the economic returns of the AI model. Sec-
11 Server broadcasts training request; ondly, we divide the economic returns equally according to the
12 for each client k ∈ N L do communication rounds required by the training model. Thirdly,
13 Client k computes its cost according to (9-14); we mainly measure all clients’ average bids for clients’ prices
14 Client k computes bid bk,t according to (23); in each round of the system. The reward mainly includes
15 Client k sends its bid bk,t back to server; two parts: server-side and clients side. For the server-side, we
16 Server computes the number of selected clients in record the server’s reward with the number of communication
each group j: Kj = K/J; rounds. For the client’s side, we mainly consider the sum of
17 Server randomly selects a group js; each round of training’s benefits.
18 Server selects Kj clients with lowest bid in group
0.70 0.350
js: KLj s;
0.325
Server computes minimum local data size of 0.65
19 0.300
$YHUDJH3ULFH

KLjs , smin ; 0.60 0.275

5HZDUG
VHUYHUUHZDUG
0.250
20 for j = 0 to J do 0.55 0.225
GHYLFHVUHZDUG

21 for each client k ∈ group j do 0.200

0.50 'HYLFHVZLWKQRUPDOGLVWULEXWLRQHQHUJ\
22 if ks ≥ smin then S 'HYLFHVZLWKIXOOHQHUJ\ 0.175
0.150
23 JLj,t = JLj,t k 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
&RPPXQLFDWLRQ5RXQGV &RPPXQLFDWLRQURXQGV
24 Server selects Kj clients with lowest bid in Fig. 5: Price and reward vs communication rounds on
JLj,t : KLj,t
S; MNIST data. (price(left), reward(right)).
25 DLt = DLt KLj ;
26 for each client k ∈ DL do Our simulation results show that with the increase of the
27 Client k trains on local data; number of communication rounds, clients’ bidding is also
k increasing shown in Fig. 8. As the number of training rounds
28 Client k updates model weights, wt+1 ;
k increases, the remaining energy of the client is decreasing,
29 Client k sends wt+1 back to server;
so the cost and bid will also increase. Besides, it also shows
30 Server aggregates model parameters wt+1 (5)
that the client’s bid will be lower when the battery is full,
which is also in line with economic theory. Fig. 8 also shows
that with the increase of communication rounds, the server’s
reward per round decreases. This is because the bidding of
TABLE I: Simulation parameters clients is increasing, and the proportion of clients in reward is
Parameter Value also increasing, so the server-side reward is decreasing.
Sample windows size, smm 50 2) Convergence rate: The convergence rate of the global
Energy consumption per 100 samples, % 0.2
Cost parameter related to the residual energy, φ 0.5
model is represented by the test accuracy rate vs. the number
Cost parameter related to local samples, ϑ 0.5 of communication rounds. This simulation scenario mainly
Weight parameter related to local samples, χ 0.7 contains 100 clients, and in each round of training, 10% of
Cost parameter related to communication rounds, a 2
Weight parameter related to communication rounds, ς 0.3
the clients are selected to participate in the training, and then
Weight parameter related to the resource cost, α 0.7 we mainly test the scheme’s accuracy with the change of com-
Weight parameter related to the service cost, λ 0.3 munication rounds on MNIST, fashion MNIST datasets, and
CIFAR-10 datasets. The first two schemes are our proposed
9

client selection schemes based on initial gradient clustering. 1RQ,,')01,673UR[ 1RQ,,')01,673UR[

0.9 0.9
The first scheme uses a random selection strategy (represented 0.8 0.8
as Gradient-Cluster-Random) in each cluster, and the second 0.7 0.7

7HVWDFFXUDF\

7HVWDFFXUDF\
0.6 0.6
scheme adopts an auction-based client selection scheme in 0.5 0.5
each cluster (represented as Gradient-Cluster-Auction). The 0.4 0.4
third scheme is the classic FedAvg [1], which randomly selects 0.3 *UDGLHQWB&OXVWHUB$XFWLRQB3UR[
0.3 *UDGLHQWB&OXVWHUB$XFWLRQB3UR[
*UDGLHQWB&OXVWHUB5DQGRPB3UR[ *UDGLHQWB&OXVWHUB5DQGRPB3UR[
0.2 5DQGRPB)HG3UR[ 0.2 5DQGRPB)HG3UR[
clients in the system to participate in training (represented 0.1 0.1
as Random-FedAvg). The last scheme is the classic FedProx 0 20 40 60 80 100 0 20 40 60 80 100
&RPPXQLFDWLRQURXQGV &RPPXQLFDWLRQURXQGV
[25], which also adopting random selection (represented as 1RQ,,'01,673UR[ 1RQ,,'01,673UR[
0.8 0.8
Random-FedProx). 0.7 0.7
0.6 0.6

7HVWDFFXUDF\

7HVWDFFXUDF\
1RQ,,')01,67$9* 1RQ,,')01,67$9* 0.5 0.5
0.4 0.4
0.8 0.8 0.3 *UDGLHQWB&OXVWHUB$XFWLRQB3UR[ 0.3 *UDGLHQWB&OXVWHUB$XFWLRQB3UR[
0.2 *UDGLHQWB&OXVWHUB5DQGRPB3UR[ 0.2 *UDGLHQWB&OXVWHUB5DQGRPB3UR[
7HVWDFFXUDF\

7HVWDFFXUDF\

0.6 0.6 5DQGRPB)HG3UR[ 5DQGRPB)HG3UR[

0.1 0.1
0.4 0.4 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
&RPPXQLFDWLRQURXQGV &RPPXQLFDWLRQURXQGV
*UDGLHQWB&OXVWHUB$XFWLRQB$YJ *UDGLHQWB&OXVWHUB$XFWLRQB$YJ
1RQ,,'&,)$53UR[ 1RQ,,'&,)$53UR[
0.2 *UDGLHQWB&OXVWHUB5DQGRPB$YJ 0.2 *UDGLHQWB&OXVWHUB5DQGRPB$YJ 0.22
5DQGRPB)HG$YJ 5DQGRPB)HG$YJ *UDGLHQWB&OXVWHUB$XFWLRQB3UR[ 0.22 *UDGLHQWB&OXVWHUB$XFWLRQB3UR[
0.20 *UDGLHQWB&OXVWHUB5DQGRPB3UR[ 0.20 *UDGLHQWB&OXVWHUB5DQGRPB3UR[
0 20 40 60 80 100 0 20 40 60 80 100 5DQGRPB)HG3UR[ 5DQGRPB)HG3UR[
&RPPXQLFDWLRQURXQGV &RPPXQLFDWLRQURXQGV 0.18 0.18

7HVWDFFXUDF\

7HVWDFFXUDF\
1RQ,,'01,67$9* 1RQ,,'01,67$9* 0.16 0.16
0.8 0.8 0.14 0.14
0.7 0.7
0.12 0.12
0.6 0.6
7HVWDFFXUDF\

7HVWDFFXUDF\

0.5 0.5 0.10 0.10

0.4 0.4 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
&RPPXQLFDWLRQURXQGV &RPPXQLFDWLRQURXQGV
0.3 *UDGLHQWB&OXVWHUB$XFWLRQB$YJ 0.3 *UDGLHQWB&OXVWHUB$XFWLRQB$YJ
0.2 *UDGLHQWB&OXVWHUB5DQGRPB$YJ 0.2 *UDGLHQWB&OXVWHUB5DQGRPB$YJ
Fig. 7: Test accuracy v.s. communication rounds on different
5DQGRPB)HG$YJ 0.1 5DQGRPB)HG$YJ
0.1
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 levels of Non-IID under 100 clients, when using Prox.
&RPPXQLFDWLRQURXQGV &RPPXQLFDWLRQURXQGV
1RQ,,'&,)$5$9* 1RQ,,'&,)$5$9*
0.24 *UDGLHQWB&OXVWHUB$XFWLRQB$YJ 0.24 *UDGLHQWB&OXVWHUB$XFWLRQB$YJ
0.22 *UDGLHQWB&OXVWHUB5DQGRPB$YJ 0.22 *UDGLHQWB&OXVWHUB5DQGRPB$YJ cluster scheme. On the other hand, as shown in Fig. 7, our
5DQGRPB)HG$YJ 5DQGRPB)HG$YJ
0.20 0.20 proposed clustering client selection scheme shows a better
7HVWDFFXUDF\

7HVWDFFXUDF\

0.18 0.18 convergence rate when IID dominates (ν = 0.5). As the impact
0.16 0.16
0.14 0.14 of data heterogeneity decreases, compared to the auction client
0.12 0.12 selection scheme, the cluster random selection scheme can
0.10 0.10
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
have more clients portfolio.
&RPPXQLFDWLRQURXQGV &RPPXQLFDWLRQURXQGV 3) Energy consumption: In terms of energy consumption,
Fig. 6: Test accuracy v.s. communication rounds on different for the balance of remaining energy, we use the standard
levels of Non-IID under 100 clients, when using AVG. deviation of all clients’ remaining energy in the system as the
metric to measure the balance of energy consumption. The
system energy consumption is balanced when the standard
On the one hand, we test the three schemes’ accuracy under
deviation is smaller. We set the system client to 100 and
two different Non-IID settings: ν = 1 (left) and ν = 0.8
then verify the scheme’s effectiveness in three different Non-
(right). As shown in Fig. 5 and Fig. 6, the proposed cluster
IID scenarios. As shown in Fig. 9 and Fig. 10, our proposed
selection client scheme is better than FedAvg and FedProx in
auction-based selection scheme shows a better balance than
convergence rate. The reason is that the first two schemes fully
the other schemes. The reason is that the auction-based client’s
consider the local data distribution of different clients when
selection scheme fully considers the clients’ remaining energy
sampling clients. Through the clustering sampling strategy, a
consumption in the cost function and then bids according to the
virtual data set is constructed in each round, which approxi-
optimal solution that satisfies the Nash equilibrium to achieve
mates the global distribution in distribution, thereby alleviating
the balance of energy consumption in the system. Since the
the influence of data heterogeneity on the global model’s
FedAvg and FedProx scheme uses a random sample selection
convergence. However, for FedAvg and Fedprox with random
strategy, some clients with many local samples may consume
sampling strategy, the distribution of the selected local data set
too much energy.
is different in different rounds, which has a negative impact
on the convergence of the global model. Besides, compared to
randomly selecting a client from each class after clustering, VI. C ONCLUSION
the auction-based client selection scheme can show a faster In our research, we are committed to solving the data
convergence rate. This is because our auction scheme fully heterogeneity problem faced by the federated learning system
considers the local data size when choosing clients, making the by designing a client selection scheme. Thus, we design a
average number of training samples larger than our proposed client selection method based on initial gradient clustering.
10

1RQ,,'01,67$9* 1RQ,,'01,673UR[ Assumption 1 The objective function f (∗) is a L-smooth

1.0 1.0
0.9 function, satisfying:
0.8 0.8
0.7 1
7HVWDFFXUDF\

7HVWDFFXUDF\
T 2
0.6 0.6 f (y) − f (x) ≤ ∇f (x) (y − x) + × L ky − xk2
0.5 2
0.4 *UDGLHQWB&OXVWHUB$XFWLRQB$YJ 0.4 *UDGLHQWB&OXVWHUB$XFWLRQB3UR[ Assumption 2 The objective function is a β strongly convex
*UDGLHQWB&OXVWHUB5DQGRPB$YJ 0.3 *UDGLHQWB&OXVWHUB5DQGRPB3UR[
0.2 5DQGRPB)HG$YJ 0.2 5DQGRPB)HG3UR[ function, satisfying:
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 1
&RPPXQLFDWLRQURXQGV &RPPXQLFDWLRQURXQGV T 2
f (y) − f (x) > ∇f (x) (y − x) + β ky − xk2 .
1RQ,,')01,67$9* 1RQ,,')01,673UR[ 2
0.95 0.95 Gaurush et al.[28] make the following assumptions which
0.90 0.90 give the expected bounds of the sample gradient, as show in
7HVWDFFXUDF\

7HVWDFFXUDF\

0.85 the follow assumptions.

0.85
*UDGLHQWB&OXVWHUB$XFWLRQB$YJ 0.80 *UDGLHQWB&OXVWHUB$XFWLRQB3UR[
Assumption3 Bounds of gradient expectation, ∃µ, µG (0 <
0.80 *UDGLHQWB&OXVWHUB5DQGRPB$YJ *UDGLHQWB&OXVWHUB5DQGRPB3UR[ µ ≤ µG ) the expectations for random sampling:
5DQGRPB)HG$YJ 0.75 5DQGRPB)HG3UR[
0.75
0 20 40 60 80 100 0 20 40 60 80 100 µ k∇f (wt , D)k < Eξt ∈D [g(wt , ξt )] < µG k∇f (wt , D)k
&RPPXQLFDWLRQURXQGV &RPPXQLFDWLRQURXQGV
1RQ,,'&,)$5$9* 1RQ,,'&,)$53UR[
0.50 0.45 where ∇f (wt , D) represents the gradient mean of all samples,
0.45 0.40
0.40 D is the set of all local samples.
0.35
0.35 Assumption4 Bounds of the second-order norm of the gradi-
7HVWDFFXUDF\

7HVWDFFXUDF\

0.30 0.30
0.25 ent, ∃M, MG (0 < M ≤ MG ), the expectations for random
0.25
0.20 *UDGLHQWB&OXVWHUB$XFWLRQB$YJ 0.20 *UDGLHQWB&OXVWHUB$XFWLRQB3UR[ sampling are met:
0.15 *UDGLHQWB&OXVWHUB5DQGRPB$YJ 0.15 *UDGLHQWB&OXVWHUB5DQGRPB3UR[ h i
5DQGRPB)HG$YJ 5DQGRPB)HG3UR[
0.10 0.10 2
Eξt kg(wt , ξt )k ≤ M + MG k∇f k
2
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
&RPPXQLFDWLRQURXQGV &RPPXQLFDWLRQURXQGV
where ∇f represents the gradient mean of all samples.
Fig. 8: Test accuracy v.s. communication rounds on In addition, we assume that the data distribution in different
Non-IID-0.5 under 100 clients. (Avg (left) and Prox (right)). clusters also meets assumption 3 and assumption 4. It is
obviously that the corresponding factor µ, µG , M , MG of
different clusters are different in the gradient-based clustering
This method mainly includes three innovations. Firstly, we scheme. And,we use µj , µGj , Mj , MGj to represent the
introduce the concept of a federated virtual data set for the corresponding parameters in cluster j.
first time to provide a solution to the data heterogeneity of the
federated learning system. Secondly, we proposed the sample
windows mechanism , whose goal is to alleviate the impact of B. Lemma
local data imbalance on clustering accuracy. Thirdly, we prove Lemma 1 The expected bounds of the gradient of the cluster
that our proposed client’s selection method can converge to sampling strategies are:
the approximate optimal solution under the stochastic gradient
µE k∇f (wt , D)k ≤ Eξt [g (wt , ξt )] ≤ µGE k∇f (wt , D)k
descent algorithm.
PJ
Furthermore, we designed an auction-based client selection where J is the total number of clusters, and µE = J1 j=1 µj ,
algorithm in each cluster, which aims to solve the imbalance PJ
µGE = J1 j=1 µGj .
of resource consumption caused by the random selection of
clients. At the same time, we introduced energy-balanced Proof:
metrics in the federal learning scenario for the first time. According to Assumption 3, each cluster satisfies:
Simulation results show that our proposed client selection
scheme can achieve better performance in the AI model’s µj k∇f (wt , D)k < Eξt ∈D [g(wt , ξt )] < µGj k∇f (wt , D)k .
convergence rate and the energy consumption balance of the
Summation on both sides of inequality:
system.
J J
1X 1X
µj k∇f (wt , D)k ≤ Eζ∈D [g(wt , ξt )]
A PPENDIX A J j=1 J j=1
PROOF OF THEOREM 1
J J
1X 1X
In this section, we give assumptions, lemmas, and proofs Eζ∈D [g(wt , ξt )] ≤ µGj k∇f (wt , D)k
J j=1 J j=1
related to Theorem 1.
then
J J
1X 1X
A. Assumption µE = µj , µGE = µGj
J j=1 J j=1
Firstly, we make some assumptions about the objective loss
function f (∗). µE k∇f (wt , D)k ≤ Eξt [g (wt , ξt )] ≤ µGE k∇f (wt , D)k
11

18.05
18.81
Gradient_Cluster_Auction_Avg 18.29 27.17
26.06
Gradient_Cluster_Auction_Avg 26.76 25 24.43 23.79
Gradient_Cluster_Auction_Avg 24.06
17.5
(QHUJ\FRQVXPSWLRQHTXLOLEULXP
25

(QHUJ\FRQVXPSWLRQHTXLOLEULXP

(QHUJ\FRQVXPSWLRQHTXLOLEULXP
Gradient_Cluster_Random_Avg Gradient_Cluster_Random_Avg Gradient_Cluster_Random_Avg
Random_FedAvg Random_FedAvg Random_FedAvg
15.0 20
20
12.5 10.91 11.01 11.58
15
10.0 8.55
9.25 15 13.71 14.28
13.43
13.34 13.52 12.87
8.08
7.5 10 8.73 8.61 8.87
10 8.03
7.02 7.61
5.0
5 5
2.5
0.0 noniid-1.0 noniid-0.8 noniid-0.5 0 noniid-1.0 noniid-0.8 noniid-0.5 0 noniid-1.0 noniid-0.8 noniid-0.5
'DWDGLVWULEXWLRQ 'DWDGLVWULEXWLRQ 'DWDGLVWULEXWLRQ
(a) Fashion MNIST (b) MNIST (c) CIFAR-10

Fig. 9: The Energy balance on different levels of Non-IID under 100 clients, when using Avg.

17.62 26.92 25.10

17.5 17.22 Gradient_Cluster_Auction_Prox 17.15 26.51 26.17
Gradient_Cluster_Auction_Prox 25 24.11 Gradient_Cluster_Auction_Prox 24.44
25
(QHUJ\FRQVXPSWLRQHTXLOLEULXP

(QHUJ\FRQVXPSWLRQHTXLOLEULXP

(QHUJ\FRQVXPSWLRQHTXLOLEULXP
Gradient_Cluster_Random_Prox Gradient_Cluster_Random_Prox Gradient_Cluster_Random_Prox
15.0 Random_FedProx Random_FedProx Random_FedProx

20 20
12.5 10.95 11.33 10.82
10.0 15 14.35
13.45 13.62
15 13.26 12.94
8.54 12.35
8.00 8.04
7.5 10 10 8.41
8.82 8.42 8.31 7.79
5.0 7.44

2.5 5 5
0.0 noniid-1.0 noniid-0.8 noniid-0.5 0 noniid-1.0 noniid-0.8 noniid-0.5 0 noniid-1.0 noniid-0.8 noniid-0.5
'DWDGLVWULEXWLRQ 'DWDGLVWULEXWLRQ 'DWDGLVWULEXWLRQ
(a) Fashion MNIST (b) MNIST (c) CIFAR-10

Fig. 10: The Energy balance on different levels of Non-IID under 100 clients, when using Prox.

Lemma 2 The expected bounds of the gradient of the cluster Lemma 3 Under the assumption of L smooth function,
sampling strategies are:
h i E [f (wt+1 )] − f (wt ) ≤ −θIηt ∇f (wt ) Eξt [g (wt , ξt )]
2 2
Eξt kg(wt , ξt )k ≤ M + MGE k∇f k 1 h
2
i
+ θ2 I 2 ηt2 LEξt kg (wt , ξt )k
2
1
PJ
where MGE = J j=1 MGj . Proof:
According to Assumption 1,
Proof: L 2 T
According to Assumption 4, each cluster satisfies: f (wt+1 ) − f (wt ) ≤ ∇f (wt ) (wt+1 − wt ) + kwt+1 − wt k
2
h i K I−1
2 2
Eξt kg(wt , ξt )k ≤ M + MGj k∇f k . T
X X
∇f wt,i , xkt,i

= ∇f (wt ) ηt pk
k=1 i=0
Summation on both sides of inequality: K I−1 2
1 X X
+ ηt2 L wt,i , xkt,i

J J pk ∇f
1X h
2
i 1 X 2
2
Eξt kg(wt , ξt )k ≤ M + MGj k∇f k k=1 i=0
J j=1 J j=1
Take expectations on both sides of the inequality, we can
1
J
X conclude:
2
≤M+ MGj k∇f k h i
J j=1 E [f (wt+1 )] − f (wt ) ≤ −θIηt ∇f (wt ) Eξt kg (wt , ξt )k
2

2
≤ M + MGE k∇f k 1
+ θ2 I 2 ηt2 LEξt [g (wt , ξt )]
2
then Lemma4 Under the assumption of 1, 3, 4,
J
1X
MGE = MGj ηt
2
J j=1 E [f (wt+1 )]−f (wt ) ≤ −θIηt µ − θILMG k∇f (wt )k
2
η2
h
2
i
2 + t θ2 I 2 LM
Eξt kg(wt , ξt )k ≤ M + MGE k∇f k . 2
12

Proof: C. Theorem
From lemma 3, it can be concluded that: Theorem 1 Under the assumption 1 to 4, adopting clustering
h
2
i clients sampling strategy while fixing the step size ηt = η
E [f (wt+1 )] − f (wt ) ≤ −θIηt ∇f (wt ) Eξt kg (wt , ξt )k satisfies:
1
+ θ2 I 2 ηt2 LEξt [g (wt , ξt )] . E [f (wt+1 )]−f (wt ) ≤ (1−B1 )t−1 (f (w1 )−f (w∗ )−A1 )+A1
2
µE 2ηIL2 M θIηµE β 2
According to Assumption 3: where 0 < η ≤ LMG , A1 = µE β 2 , B1 = 4L .
T 2
∇f (wt ) Eξt [g (wt , ξt )] ≤ µ k∇f (wt )k Proof:
then According to lemma 4:
2
η
2
E [f (wt+1 )] − f (wt ) ≤ −θIηt k∇f (wt )k E [f (wt+1 )] − f (wt ) ≤ −θIη µ − ILMG k∇f (wt )k
2
1 h
2
i
η2
+ θ2 I 2 ηt2 LEξt kg (wt , ξt )k . + θ2 I 2 LM
2 2
According to Assumption 4: µE
∵0<η≤
h
2
i
2 LMG
Eξt kg (wt , ξt )k ≤ µG k∇f (wt )k η µ
∴ −θIη µ − θILMG ≤ − θIη
then: 2 2
2 µ 2 η2
E [f (wt+1 )] − f (wt ) ≤ −θIηt µ k∇f (wt )k ∴ E [f (wt+1 )] − f (wt ) ≤ − θIη k∇f (wt )k + θ2 I 2 LM
2 2
L 2 2
+ θ2 I 2 ηt2 (M + MG ) k∇f (wt )k µ 2 η
2 E [f (wt+1 )] ≤ f (wt ) − θIη k∇f (wt )k + θI 2 LM.
2 2
ηt 2
≤ −θIηt (µ − ILMG ) k∇f (wt )k Subtract f (w∗ ) from both sides of the inequality:
2
θ2 I 2 ηt2 µ 2
+ LM E [f (wt+1 )]−f (w∗ ) ≤ f (wt )−f (w∗ )− θIη k∇f (wt )k
2 2
η2
Lemma5 Under the Assumption 1, and w∗ is the optimal + θ2 I 2 LM.
2
model parameter,
According to lemma 7:,
2 2
kwt − w∗ k ≥
(f (wt ) − f (w∗ )). 2 β2
L k∇f (wt )k ≥ (f (wt ) − f (w∗ )).
2L
Lemma6 Under the Assumption 2, and w∗ is the optimal
model parameter, Multiply both sides of the inequality by − µ2 θIη
2 µ 2 θIηµβ 2
kwt − w∗ k ≤ k∇f (wt )k . − θIη k∇f (wt )k ≤ − (f (wt ) − f (w∗ ))
β 2 4L
Gaurush et al.[28] gave proofs of Lemma 5 and Lemma 6.
θIηµβ 2
Lemma7 Under the Assumption 1, Assumption 2, and w∗ E [f (wt+1 )] − f (w∗ ) ≤ (1 − )(f (wt ) − f (w∗ ))
is the optimal model parameter, 4L
η2
2 β2 + θ2 I 2 LM.
k∇f (wt )k ≥ (f (wt ) − f (w∗ )) 2
2L Both sides of the inequality are subtracted by the introduced
Proof: item A, then
According to lemma 5:
θIηµβ 2
2 2 E [f (wt+1 )] − f (w∗ ) − A ≤ (1 − )(f (wt ) − f (w∗ ))
kwt − w∗ k ≥ (f (wt ) − f (w∗ )). 4L
L η2
According to lemma 6:: + θ2 I 2 LM − A.
2
2 Suppose the following equation holds
kwt − w∗ k ≤ k∇f (wt )k
β
θIηµβ 2 η2
2 4 2 (1 − )(f (wt ) − f (w∗ )) + θ2 I 2 LM − A
kw∗ − wt k ≤ 2 k∇f (wt )k 4L 2
β θIηµβ 2
= (1 − )(f (wt ) − f (w∗ ) − A).
4 2 2 2 4L
2
k∇f (wt )k ≥ kw∗ − wt k ≥ (f (wt ) − f (w∗ ))
β L It can be deduced that
2 β2 2ηθIL2 M
k∇f (wt )k ≥ (f (wt ) − f (w∗ )) A=
µβ 2
2L
13

then: [14] V. Tolpegin, S. Truex, M. E. Gursoy, and L. Liu, “Data poisoning

attacks against federated learning systems,” in European Symposium on
2
2ηθIL M Research in Computer Security. Springer, 2020, pp. 480–501.
E [f (wt+1 )] − f (w∗ ) − [15] J. Lin, M. Du, and J. Liu, “Free-riders in federated learning: Attacks
µβ 2
and defenses,” arXiv preprint arXiv:1911.12560, 2019.
Iηθµβ 2 2ηθIL2 M [16] X. Yao, C. Huang, and L. Sun, “Two-stream federated learning: Reduce
≤ (1 − )(f (wt ) − f (w∗ ) − ) the communication costs,” in 2018 IEEE Visual Communications and
4L µβ 2
Image Processing (VCIP). IEEE, 2018, pp. 1–4.
Iηθµβ 2 2 2ηθIL2 M [17] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and
≤ (1 − ) (f (wt−1 ) − f (w∗ ) − ) D. Bacon, “Federated learning: Strategies for improving communication
4L µβ 2
efficiency,” arXiv preprint arXiv:1610.05492, 2016.
θIηµβ 2 t−1 2ηθIL2 M [18] H. Li and T. Han, “An end-to-end encrypted neural network for
≤ (1 − ) (f (w1 ) − f (w∗ ) − ) gradient updates transmission in federated learning,” arXiv preprint
4L µβ 2
arXiv:1908.08340, 2019.
[19] H. Zhu and Y. Jin, “Multi-objective evolutionary federated learning,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 31,
E [f (wt+1 )] − f (w∗ ) ≤ (1 − B)t−1 (f (w1 ) − f (w∗ ) − A) + A no. 4, pp. 1310–1322, 2019.
2 [20] L. Li, Y. Fan, and K. Lin, “A survey on federated learning,” in 16th
where B = Iηθµβ4L , when using clustering clients sampling IEEE International Conference on Control & Automation, ICCA 2020,
strategy, µ = µE , then: Singapore, October 9-11, 2020. IEEE, 2020, pp. 791–796.
[21] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence
E [f (wt+1 )] − f (w∗ ) ≤ of fedavg on non-iid data,” arXiv preprint arXiv:1907.02189, 2019.
[22] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated
θIηµE β 2 t−1 2ηθIL2 M 2ηθIL2 M learning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018.
(1− ) (f (w1 )−f (w∗ )− 2
)+ . [23] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and
4L µE β µE β 2
K. Chan, “Adaptive federated learning in resource constrained edge com-
R EFERENCES puting systems,” IEEE Journal on Selected Areas in Communications,
vol. 37, no. 6, pp. 1205–1221, 2019.
[1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, [24] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith,
“Communication-efficient learning of deep networks from decentralized “Federated optimization in heterogeneous networks,” arXiv preprint
data,” in Artificial Intelligence and Statistics. PMLR, 2017, pp. 1273– arXiv:1812.06127, 2018.
1282. [25] V. Mothukuri, R. M. Parizi, S. Pouriyeh, Y. Huang, A. Dehghantanha,
[2] H. Wang, Z. Kaplan, D. Niu, and B. Li, “Optimizing federated learning and G. Srivastava, “A survey on security and privacy of federated
on non-iid data with reinforcement learning,” in IEEE INFOCOM 2020- learning,” Future Generation Computer Systems, vol. 115, pp. 619–640,
IEEE Conference on Computer Communications. IEEE, 2020, pp. 2020.
1698–1707. [26] H. Wang, M. Yurochkin, Y. Sun, D. Papailiopoulos, and Y. Khaz-
[3] F. Sattler, K.-R. Müller, T. Wiegand, and W. Samek, “On the byzantine aeni, “Federated learning with matched averaging,” arXiv preprint
robustness of clustered federated learning,” in ICASSP 2020-2020 IEEE arXiv:2002.06440, 2020.
International Conference on Acoustics, Speech and Signal Processing [27] C. Briggs, Z. Fan, and P. Andras, “Federated learning with hierarchical
(ICASSP). IEEE, 2020, pp. 8861–8865. clustering of local updates to improve training on non-iid data,” arXiv
[4] L. U. Khan, M. Alsenwi, Z. Han, and C. S. Hong, “Self organizing preprint arXiv:2004.11791, 2020.
federated learning over wireless networks: A socially aware clustering [28] G. Hiranandani and P. Chiu, “Variations of the stochastic gradient
approach,” in 2020 International Conference on Information Networking descent for multi-label classication loss functions,” Online, 2020.
(ICOIN). IEEE, 2020, pp. 453–458. [Online]. Available: https://gaurush.com/assets/docs/ece 566.pdf
[5] R. Jiang and S. Zhou, “Cluster-based cooperative digital over-the-air
aggregation for wireless federated edge learning,” in 2020 IEEE/CIC
International Conference on Communications in China (ICCC). IEEE,
2020, pp. 887–892.
[6] F. Sattler, K.-R. Müller, and W. Samek, “Clustered federated learning:
Model-agnostic distributed multitask optimization under privacy con-
straints,” IEEE Transactions on Neural Networks and Learning Systems,
2020.
[7] A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient frame-
work for clustered federated learning,” arXiv preprint arXiv:2006.04088,
2020.
[8] L. Cai, D. Lin, J. Zhang, and S. Yu, “Dynamic sample selection for
federated learning with heterogeneous data in fog computing,” in ICC
2020-2020 IEEE International Conference on Communications (ICC).
IEEE, 2020, pp. 1–6.
[9] L. U. Khan, S. R. Pandey, N. H. Tran, W. Saad, Z. Han, M. N. Nguyen,
and C. S. Hong, “Federated learning for edge networks: Resource opti-
mization and incentive mechanism,” IEEE Communications Magazine,
vol. 58, no. 10, pp. 88–93, 2020.
[10] T. H. T. Le, N. H. Tran, Y. K. Tun, Z. Han, and C. S. Hong, “Auction
based incentive design for efficient federated learning in cellular wireless
networks,” in 2020 IEEE Wireless Communications and Networking
Conference (WCNC). IEEE, 2020, pp. 1–6.
[11] A. N. Bhagoji, S. Chakraborty, P. Mittal, and S. Calo, “Analyzing feder-
ated learning through an adversarial lens,” in International Conference
on Machine Learning. PMLR, 2019, pp. 634–643.
[12] C. Xie, K. Huang, P.-Y. Chen, and B. Li, “Dba: Distributed backdoor
attacks against federated learning,” in International Conference on
Learning Representations, 2019.
[13] Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, and H. Qi, “Beyond
inferring class representatives: User-level privacy leakage from federated
learning,” in IEEE INFOCOM 2019-IEEE Conference on Computer
Communications. IEEE, 2019, pp. 2512–2520.