0% found this document useful (0 votes)
11 views7 pages

CSCE689 DRL Project Report

This project aimed to reproduce and evaluate the optimization of device selection in Federated Learning using Deep Q-learning, specifically focusing on the work by Wang et al. The authors implemented a Double DQN model and proposed a new reward function after finding discrepancies in the original paper's results. Their findings indicated that the DQN model did not effectively learn device selection, leading to doubts about the original research's conclusions.

Uploaded by

dannizhou2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

CSCE689 DRL Project Report

This project aimed to reproduce and evaluate the optimization of device selection in Federated Learning using Deep Q-learning, specifically focusing on the work by Wang et al. The authors implemented a Double DQN model and proposed a new reward function after finding discrepancies in the original paper's results. Their findings indicated that the DQN model did not effectively learn device selection, leading to doubts about the original research's conclusions.

Uploaded by

dannizhou2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Reproduction and Evaluation for Optimal Device

Selection in Federated Learning using Deep Q-learning


Tian Liu Cheng Niu Yuting Cai
UIN:525004380 UIN:532008693 UIN:232003637
Texas A&M University Texas A&M University Texas A&M University
TX, USA TX, USA TX, USA
Abstract
In this final course project for CSCE 689 Deep Reinforce-
ment Learning, we tried to reproduce the work done by
Wang et al [1], which applied deep reinforcement learning to
optimize the federated learning (FL) on Non-IID Data. Specif-
ically, Wang et al. formulated the device selection problem
during FL as a Markov Decision Problem and applied the
DQN model to select the participating devices in each com-
munication round to achieve faster convergence and less
communication cost. We used the given flsim git repository
as a starting framework for the Federated Learning system
and implemented the Double DQN model based on the code
given in the homework. We implemented the PCA for state
compression and tested the original reward function as pro- Figure 1. Illustration of Federated Learning System. (By
posed in the paper. Ouyang et al. [3])
However, the testing result in the MNIST dataset with
different device settings (100 clients and 20 clients) shows central server. However, sending local data to the cloud has
that our implemented DQN model cannot reproduce the raised significant privacy concern among users.
same training performance as shown in the paper. Based As a solution, Federated learning is a new machine learn-
on discussions with other researchers online, we found that ing paradigm which enables multiple devices collaboratively
the reward function as shown in the paper is not correct. learn from each other while preserving the privacy of user
We further proposed new reward function and analyzed the data at each device. As shown in Figure 1, each end devices
reasons why DQN is not appropriate for the device selection will conduct local training based on their local data and then
problem during FL which has strong dependency between send their local model update instead of local data to the
actions. Similar to many other researchers’ experiences, we central server. The server then aggregates the model updates
are highly doubtful of the results shown in the paper. And it received and send back the new updated model to end
the results of our experiments indicate that DQN did not devices. The end devices will conduct more local training us-
learn how to select the device. ing the new model received. Each communication between
Our GitHub repository is https://github.com/tian1327/ the end devices and server is referred to as a round. The
flsim_dqn. Our 5-minute Youtube video presentation is avail- process is repeated until the model converges, thus resulting
able here https://youtu.be/ZNkAfigHkN0. multiple rounds. In such a manner, the privacy of local data
on each end devices can be preserved.
Keywords: federated learning, reinforcement learning, deep
Q-learning
1.1 Device Selection in Federated Learning System
However, in practice when there are millions of end de-
vices which have different network conditions, having all de-
1 Introduction and Motivation vices to participate in the training will lead to long wait time
With recent fast development in machine learning, there and thus impractical. A common practice is to randomly sam-
have been many deep learning applications in people’s life, ple a certain number of devices in each communication round
such as human activity recognition through smart phone to participate the training. However, the data at each device
or smart watch, next word prediction in keyboard typing, is usually not independent and identically distributed (non-
voice recognition. The training of the deep learning models IID). Randomly selecting participating devices will introduce
is usually conducted in the cloud with powerful GPUs and unbalanced dataset leading to poor model performance and
requires each end devices to send their local data to the slow convergence i.e. more communication rounds [2].
1
CSCE 689, DRL, Fall 2022 Liu, Niu and Cai, et al.

1.2 Motivation of using DRL to optimize device 2 MDP Definition


selection Here we explain why the device selection in federated
To optimize the performance of federated learning on non- learning framework can be modeled as a Markov Decision
IID data, Wang et al. [1] proposed applying deep reinforce- Process following the work by Wang et al [1].
ment learning to learn to optimally select the participating
devices for each training rounds to speed up the convergence. 2.1 Why the device selection problem is sequential?
In particular, with an observation that there are an implicit The device selection problem can be defined as selecting 𝐾
connection between the distribution of training data on local devices from total 𝑁 available devices to participate in each
device and the model weights trained based the local data, round of training for the federated learning job. Figure 2
Wang et al. proposed to adopt reinforcement learning model shows the workflow of using Reinforcement Learning to
to learn to select a subset of devices in each communica- select 𝑘 devices for each round of the training.
tion round to maximize the reward which encourages the After all 𝑁 devices downloading the initial random model
increase of validation accuracy and penalizes the use of more weights 𝑊𝑖𝑛𝑖𝑡 and training to return the result model weights
communication rounds. 𝑤 1(𝑘 ) , 𝑘 ∈ [𝑁 ] to the FL server, the FL server generate a
𝑄 (𝑠𝑡 , 𝑎; 𝜃 ) for all devices.
Then the agent selects 𝐾 devices based on the top-𝐾 val-
ues from the generation. These devices download the latest
1.3 Our Contributions in this project global weights 𝑤𝑡 and train to obtain result model weights
1. We successfully implemented the DQN model for de- 𝑤 2(𝐾 ) , 𝑘 ∈ [𝐾], the FL server generate new 𝑄 (𝑠𝑡 +1, 𝑎; 𝜃 ) from
vice selection based on the paper. report of devices, and then repeat this process sequentially.
2. We implemented the principle component analysis
(PCA) for state space compression based on the local
weights of all clients and successfully reproduced the
clustered clients based on their local weight updates.
3. We tested our implemented DQN model for various
device settings (100 clients select 10 each round, 20
clients select 4 each round) on the MNIST dataset. We
analyzed its performance and why the DQN did not
work.
4. We proposed new reward function based on our anal-
ysis that the original reward function is incorrect. We
tested the performance of new reward function in dif-
ferent settings.
5. We successfully reproduced several figures in the pa-
per, such as Figure 3 and Figure 4.
6. We fixed various bugs in the original flsim system,
such as missing members in class, various typos, etc.
7. We shared our results and findings with other researchers
who are also struggled with implementing DQN in FL
(https://github.com/iQua/flsim/issues/7).

1.4 Our Reference Code Figure 2. Workflow of device selection problem. (By Wang
1. We implemented the DQN server based on the pro- et al. [1])
vided flsim git repository at https://github.com/iQua/
2.2 State space
flsim. Notice that this original git repository only pro-
vides the simulation framework for the federated learn- In the device selection problem, the state space 𝑆 is the
ing system. It does not contain the DQN server used global model weights and the model weights of each client de-
in the paper, for some unknown/suspicious reasons. vice in each round. The vector 𝑆𝑡 = (𝑤𝑡 , 𝑤𝑡(1) , . . . , 𝑤𝑡(𝑁 ) ),where
2. We implemented the DQN training framework based 𝑤𝑡 denotes the weights of the global model after round 𝑡,
on the DQN homework code given in CSCE689 Deep and 𝑤𝑡(1) , . . . , 𝑤𝑡(𝑁 ) denotes the model weights of the 𝑁 de-
Reinforcement Learning course. vices respectively. Because the weights have large dimension
2
Reproduction and Evaluation for Optimal Device Selection in Federated Learning using Deep Q-learning CSCE 689, DRL, Fall 2022

which makes the state space very large, in the paper, Wang et However, the original Q-learning algorithms can be unsta-
al. [1] applied principle component analysis (PCA) to reduce ble since they indirectly optimize the agent performance by
the dimensionality down to 100. Thus, the final state space learning an approximator 𝑄 (𝑠, 𝑎; 𝜃 𝑡 ) to the optimal action-
is a vector of 100 dimensions. value function 𝑄 ∗ (𝑠, 𝑎). DDQN adds another value function

𝑄 (𝑠, 𝑎; 𝜃 𝑡 ) to stabilize the action-value function estimation.
2.3 Action space If time permits, we will try to implement other RL algo-
In the device selection problem, the action is to select rithms such as policy gradient or actor-critic methods.
subset of 𝐾 devices from 𝑁 devices. However, this selection
would have resulted in a large action space of size 𝑁𝐾 , it 2.7 Stretch goal
will complicates the Reinforcement Learning training, so we One of our stretch goals is to try to understand how the
need to take a trick to reduce the action space size. given code implements the federated learning system since
We train the agent by select only one out of 𝑁 devices to none of us have experience with FL coding before. Another
participate in FL per round, while in testing and application, stretch goal is to implement DQN on top of the FL system
the agent will sample a batch of top-𝐾 clients to participate and try to reproduce the results reported in the paper.
in FL. Now the action space is thus reduced to 1, 2, . . . , 𝑁 ,
which means select device 𝑖 to participate in FL. The other
clients will be selected through the top-𝐾 values of 𝑄 ∗ (𝑠𝑡 , 𝑎) 3 Related Works
after our training completion. In this section, we introduced previous work on device
selection problems in Federated Learning framework and
2.4 Transition function different reinforcement learning models. And we discuss
In the device selection problem, the transition function how our work of using reinforcement learning differs from
is stochastic. For every training round, different 𝐾 devices other works.
are selected to learn the new model weights, and then they
conduct new local training with the new model and send 3.1 Device Selection in Federated Learning System
updates to the server. Thus the new state is updated again
Because of the large amounts of participation devices in
with new local model update received from 𝐾 devices.
Federated Learning system, the limited network connectivity
and bandwidth, as well as the long training time on the server,
2.5 Reward function
in practice it is infeasible to have all devices participate in
In the device selection problem, we define the reward every round of training is Federated Learning Systems. To
function at the end of each round 𝑡 is 𝑟𝑡 = Ξ (𝜔𝑡 −Ω) − 1, 𝑡 = address the device selection problem, McMahan et al.[4]first
1, . . . ,𝑇 The 𝜔𝑡 is the testing accuracy achieved by the global proposed FEDAVG algorithm which selects a random frac-
model on the held-out validation set after round t, Ω is the tion of total devices to participate in each round of training.
target accuracy. Specifically, at each round, server will randomly select 𝑘
As ML training proceeds, the model accuracy will increase devices and send the global model weights to each client to
at a slower pace, which means 𝜔𝑡 − 𝜔𝑡 −1 decreases as round conduct local training based on their local data. Then these 𝑘
𝑡 increases. Therefore, Ξ is a positive constant that ensures device will send their local weight updates back to the server.
that 𝑟𝑡 grows exponentially to amplify the marginal accuracy And the server will take the average of these weight updates.
increase as FL progresses into later stages. The process repeats until the model converges. However, the
The item −1 encourages the agent to complete training randomly selected local datasets may not have the same dis-
in fewer rounds, because the more rounds it takes, the less tribution as the true data distribution. And each local model
cumulative reward the agent will receive. could be significantly different from each other, leading to
Thus, the reward function aims to achieve target accuracy slow convergence and model accuracy reduction[2].
in the least communication rounds. To address the non-IID problem, Ouyang et al.[3] pro-
posed the ClusterFL, which groups the devices into clusters
2.6 RL algorithms to use and baselines and conducts the average within each cluster. The key ob-
In the device selection problem, we use the double Deep servation of the ClusterFL is that there are intrinsic similari-
Q-learning Network(DDQN) to learn the function 𝑄 ∗ (𝑠𝑡 , 𝑎). ties between different devices due to the subject’s biological
The Q-learning algorithm provides a value estimation for features, physical environments, and sensor biases. To be
each potential action 𝑎 at state 𝑠𝑡 , based on which devices are specific, the server will cluster the devices based on the KL
selected. Considering limited available traces from federated divergence of the local model updates. And only devices
learning jobs, the DQN can be more efficiently trained and within the same cluster are averaging their model weights.
can reuse data more effectively than policy gradient methods The paper also proposed to drop nodes that have less correla-
and actor-critic methods. tion to other nodes in the same cluster and to drop straggler
3
CSCE 689, DRL, Fall 2022 Liu, Niu and Cai, et al.

nodes of each cluster to reduce the communication cost. Re- Q-Network(DQN) to play the Atari games and replicated the
sults show that ClusterFL achieved much faster convergence results of Mnih et al. Their implementation is also designed
and reduced communication rounds significantly. to be flexible to different neural net network architectures
Nishio et al.[5] proposed FedCS to address the inefficient and other problem domains. They use the NVIDIA CUDA
training among devices with heterogeneous resources. Es- Deep Neural Network library (cuDNN) to run forward and
sentially, FedCS selects the devices based on their resource backward passes on common neural network layers opti-
conditions to maximize the number of participating devices mized specially for the NVIDIA GPUs.
in each round of training. The selection of devices is based
3.2.2 Model Based On Double Q-Network. Since DQN
on the estimated time for each device to finish its distribu-
Algorithm uses the same Q estimate value for selecting and
tion, schedule update, and upload process based on their
evaluation in action selection. This will lead to maximiza-
computation resources and network bandwidth.
tion bias.[12]. So Zeng et al.[13] proposed to use Double
Cho et al.[6] proposed FLAME, a user-centered FL train-
Q-Network to select devices from all devices to improve the
ing by leveraging the time alignment across the multiple
performance of Federated Learning by decoupling action se-
devices owned by the same user. FLAME also conducts accu-
lection and evaluation process. The double-DQN algorithm
racy and efficiency-aware device selection which achieves
in Federated Learning mentioned by Zhang et al.[14] is based
higher accuracy, greater energy efficiency, and faster con-
on choosing the action with the largest Q value in the next
vergence. Cho et al.[7] also proposed POWER-OF-CHOICE,
state and using the selected action to update target Q and
which adopts higher sampling probability for clients with
can be expressed as:
higher loss in each round to achieve faster convergence.
𝑄𝑡𝑎𝑟𝑔𝑒𝑡 = 𝑟 + 𝛾𝑄 (𝑠 ′, argmax 𝑄 (𝑠 ′, 𝑎; 𝜃 ); 𝜃 ¬) (1)
3.2 Reinforcement Learning Models 𝑎

3.2.1 Model Based On Deep Q-Network. Q-learning is a where 𝜃 is parameter in evaluated network and 𝜃 − rep-
kind of model-free reinforcement learning, which also be cast resents parameter in target network. 𝜃 − get updated with
as a kind of asynchronous dynamic programming. [8]With 𝜃 after a certain steps. After applying double-DQN. We ex-
Q-learning, agents have the capability of learning to act pected to see the model to converge and achieve accuracy
optimally in Markovian domains by experiencing the conse- target faster according to the simulation result get by Nguyen
quences of actions. et al.[15]
Deep Q-learning is an extension of the classical Q-learning
algorithm that uses deep neural network to approximate the 4 Implementation Details and Experiments
action-value function. Classical Q-learning agents have some Setup
limited applicability in some domains where useful features In this section, we discussed in details about our imple-
can be handcrafted. To use reinforcement learning success- mentation/changes to the code, and the experiments setup
fully in real-world situations, Mnih et al. [9] proposed this we used to evaluate our DQN implementation. We completed
single algorithm that would be able to develop a wide range the project in the following steps:
of competencies on a varied range of challenging tasks–a First, we ran the given flsim code to see if the federated
central goal of general artificial intelligence. They use neural learning system can run as expected. However, the given
networks with several layers of nodes in order to build up code has many bugs and missing parts. For example, when
progressively more abstract representations of the data and the clients are sending reports to the server, the local updated
learn concepts such as object categories directly from raw weights were not included in their report. Thus we have to
sensory data. As a consequence, the deep Q-network could fix the code first. Some other bugs include various typos in
learn successful policies directly from high-dimensional sen- code.
sory inputs. The deep learning networks have some stability Second, after the Federated Learning system is set up, we
problems to improve, and not using the same RMSProp defi- would like to reproduce the Figure 3 in the paper, which
nition that many deep learning libraries provide. compared the number of communication rounds required
The theoretical foundation of DQN is less well understood. to reach 99% testing accuracy for FedAvg on IID and non-
J.Fan et al. [10] provide the theoretical guarantees for Deep IID data settings. We also compared their performance with
Q-Network in both algorithmic and statistical perspectives. using K-means and K-clusters algorithms for device selection.
The statistical error characterizes the bias and variance that We have to overcome many difficulties such as we had to
arise from approximating the action-value function using understand the config file settings to set up correct config
deep neural network, while the algorithmic error converges for running IID and non-IID clients. The K-means server also
to zero at a geometric rate. has a bug in it that we had to fix it.
Mnih et al’s paper [9] not implement the DQN algorithm, Third, we implement the PCA model to compress the
and simply represent the results which often be used as a state space as stated in the paper. The original code does
benchmark. Melrose Roderick et al. [11] implemented a Deep not provide an implementation, so we had to implement it
4
Reproduction and Evaluation for Optimal Device Selection in Federated Learning using Deep Q-learning CSCE 689, DRL, Fall 2022

by ourselves. We tested our implementation by plotting the


Figure 4 in the paper which shows the clustered clients by
projecting their weights to 2-dimensional space. With the
implementation of PCA, the original approximately 410,000
parameters of all devices and global models were reduced to
only about 10,000 parameters for state representative.
Fourth, we then implement the DQN model for device se-
lection. Specifically, we implemented two servers, i.e. DQN-
Train server and DQN server. The DQNTrain server is used
to train the DQN model through multiple episodes of FL
processes. Then once the DQN model is trained, the DQN
server loaded the trained model to conduct testing in a new
round of Federated Learning. We recorded the training per-
formance of DQN using its total reward in each episode and
its testing performance using testing accuracy of each round. Figure 3. Training model on non-IID MNIST data.
Fifth, we evaluated the DQN on 2 settings, i.e. selecting 10
devices out of total 100 clients, and selecting 4 devices out
of total 20 clients. In the 100-client case, each client has 600
data while in the 20-client case, each client has 3000 data.
We compared the trained DQN performance with Fedavg,
K-Means, and k-clusters as baselines.
Sixth, after observing the bad performance of DQN which
is not as we expect, we analyzed why DQN did not work
well. Then we proposed a new reward function based on (a) PCA for 20 clients (b) PCA for 100 clients
the improvement on testing accuracy rather than on the
difference to the target accuracy. We compare the training Figure 4. PCA on clients weights
and testing performance of two reward functions in the 100-
client and 20-client settings as mentioned above.
The experiments of training and testing DQN model was dimension reduction, clients can still be classified by two
done on a Linux server (64 cores Intel Xeon Silver 4313 CPUs dimensions weights both on 20 clients pool and 100 clients
and 3 Nvidia A30 GPUs with 24 GB each). Each round of pool.
training episode has max 50 steps (max 50 FL communication
5.3 DQN Training Performance with 2 Reward
rounds), which takes about 11 mins per episode. We trained
Functions
the DQN for about 200 episodes for each settings.
The original reward function as discussed in above is
5 Evaluation Results and Discussions 𝑟𝑡 = Ξ (𝜔𝑡 −Ω) − 1, 𝑡 = 1, . . . ,𝑇 (2)
5.1 FedAvg in Non-IID Settings However, we discover the original reward function can not
FED-AVG is a client selection strategy which selects ten reflect the return from training comprehensively. Because
clients from clients pool randomly to participate the training our DQN model will select one client to participate training
each round. K-Center is strategy which do k cluster classifica- each round. If we suppose the DQN model select the same
tion on clients pool first and choose clients come from same clients every round, it will equivalent to use one client to
cluster to participate training. K-mean is strategy to choose train the global model. The global model accuracy will still
k clients with minimum mean, which indicate these clients increase during training. But the selection process doesn’t
are most similar to train. As shown in Figure 3, FED-AVG get optimized at all and it is unreasonable. So we propose a
only takes around 60 rounds to achieve the 99% accuracy.But new reward function:
Ξ (𝜔𝑡 −𝜔𝑡 −1 )

when comes to non-IID data, all three methods take almost 𝜔𝑡 > 𝜔𝑡 −1
𝑟𝑡 = 𝑡 = 1, . . . ,𝑇 (3)
triple communication rounds to achieve the target. −Ξ (𝜔𝑡 −1 −𝜔𝑡 ) 𝜔𝑡 < 𝜔𝑡 −1
As shown in equation (3), now reward function will pro-
5.2 Clustering Clients using PCA weights vide positive reward if accuracy has improvement meanwhile
As demonstrated in above sections. We treat weights from give penalty if accuracy drops.
clients as DQN training state spaces. Since it’s relatively too In the experiment, we perform DQN training on both re-
large to train. We implement PCA to reduce the weights ward function with two settings: picking 10 clients from
from clients to two dimension. As shown in Figure 4, after 100 total clients(shown in Figure 5.) and picking 4 clients
5
CSCE 689, DRL, Fall 2022 Liu, Niu and Cai, et al.

(a) 10-100 DQN training with new(b) 10-100 DQN training with given
reward function reward function (a) 10-100 DQN compare (b) 4-20 DQN compare

Figure 5. 10-100 DQN training reward Figure 7. DQN Accuracy v.s. communication rounds com-
pare with traditional methods

5.5 Why DQN did not work?


The major reason DQN doesn’t work might because we
only select a single client each round during training. So the
(a) 4-20 DQN training with new re-(b) 4-20 DQN training with given
return from each round has a huge bias. This also leads to
ward function reward function
problem that DQN agent doesn’t explore enough. It is highly
possible that there are many clients are never captured by
Figure 6. 4-20 DQN training reward
agent during training. But increasing number of clients in-
crease the cost of DQN exponentially. For example, choosing
from 20 total clients(shown in Figure 6.) However, even new 2 clients within 100 clients will increase action space from
reward function provides more clear change on reward dur- 100 to over 4950. And due to limitation of time and device,
ing training process, the result indicates DQN agent doesn’t we can not afford such huge cost.
get improved during training because the reward doesn’t Also the memory size to record trajectories maybe too
increase as training process running for both settings. small in our DQN model. Consider the 10,100 state space
and 100 action space in setting, only 200 episodes in each
5.4 DQN Inference Performance trajectory is relatively small. Which may cause the training
In this section we compared our DQN strategy with tradi- ended before agent explored the solution with best return.
tional strategies for two settings. First setting is 4-20 which However, training 200 episodes takes more than 30 hrs. In-
represents picking 4 clients with best performance within to- creasing the number of training episodes will increase the
tal 20 clients. 10-100 represents picking 10 top performance time cost significantly, which we can not achieve in this
clients out of 100 clients. The results shown in Figure 7. project.
In 4-20 setting, our DQN strategy reaches high accuracy Besides, even with the new reward function we proposed,
in much fewer communication rounds than K-mean and reward function cannot reflect actual return of DQN training
K-center strategy. But it’s not clearly prove that DQN is a accurately because test accuracy will still increase after each
better strategy because K-mean and K-center tend to under- training round regardless the selection of client. For example,
perform in small sample collections. The problem of DQN even the server select the same device for every communica-
can be observed more obviously in 10-100 setting. As num- tion round, the testing accuracy will still increase with more
ber of clients increase and state space increase in DQN, the rounds (in case is like multiple rounds of stochastic gradient
performance of DQN strategy perform worse than every descent with the same data used for every round).
other strategy on both converge speed and final accuracy Last but not least. DQN is not able to provide ideal perfor-
achieved. mance for problems which has strong dependency between
In conclusion, random selecting clients for federal pro- actions such as the some Atari games (Double Dunk and
vides better result then selecting clients by trained DQN Montezuma) mentioned in the class. And our client selecting
doesn’t improve accuracy or speed in our experiment. we problem is a typical problem of that type since the selection
will discuss the potential reason for failure of DQN in the before and then in FL are very strongly correlated in learning
next section. result.
6
Reproduction and Evaluation for Optimal Device Selection in Federated Learning using Deep Q-learning CSCE 689, DRL, Fall 2022

6 Future work [9] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu,
Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, An-
The DQN model in our current implementation did not dreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,
reproduce the converging rewards as shown in the paper. Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran,
Thus, in our future work, we plan to: Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control
through deep reinforcement learning. Nature, 518(7540):529–533, Feb
1. Try in a settings with small number of total clients, 2015.
such as selecting 2 devices each round out of total 10 [10] Zhuoran Yang, Yuchen Xie, and Zhaoran Wang. A theoretical analysis
of deep q-learning. CoRR, abs/1901.00137, 2019.
devices.
[11] Melrose Roderick, James MacGlashan, and Stefanie Tellex. Implement-
2. Try the policy gradient RL models like Actor-Critic ing the deep q-network. CoRR, abs/1711.07478, 2017.
models. [12] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement
3. Improve the DQN model such as increasing replay learning with double q-learning. In Proceedings of the AAAI conference
memory size and designing more precise reward func- on artificial intelligence, volume 30, 2016.
[13] Rongfei Zeng, Chao Zeng, Xingwei Wang, Bo Li, and Xiaowen Chu. A
tions.
comprehensive survey of incentive mechanism for federated learning.
arXiv preprint arXiv:2106.15406, 2021.
[14] Jiaxiang Zhang, Yiming Liu, Xiaoqi Qin, and Xiaodong Xu. Energy-
7 Conclusions efficient federated learning framework for digital twin-enabled indus-
In conclusion, we successfully build the federal learning trial internet of things. In 2021 IEEE 32nd Annual International Sympo-
model with client choosing mechanism based on Deep Q- sium on Personal, Indoor and Mobile Radio Communications (PIMRC),
pages 1160–1166. IEEE, 2021.
Learning agent. Unfortunately, our experiment on imple-
[15] Huy T Nguyen, Nguyen Cong Luong, Jun Zhao, Chau Yuen, and Dusit
menting DQN to optimize federal learning on non-iid data Niyato. Resource allocation in mobility-aware federated learning
did not reproduce the work shown in the paper. Based on networks: A deep reinforcement learning approach. In 2020 IEEE 6th
our analysis, we believe the reward function proposed by World Forum on Internet of Things (WF-IoT), pages 1–6. IEEE, 2020.
the original paper is problematic and DQN is not suitable for
device selection process during Federated learning which
has strong dependency between actions. But implementa-
tion not just DQN but all reinforcement learning method to
improve performance of federal learning is still a interesting
research field with a huge potential in the future.

References
[1] Hao Wang, Zakhary Kaplan, Di Niu, and Baochun Li. Optimizing
federated learning on non-iid data with reinforcement learning. In
IEEE INFOCOM 2020-IEEE Conference on Computer Communications,
pages 1698–1707. IEEE, 2020.
[2] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and
Vikas Chandra. Federated learning with non-iid data. arXiv preprint
arXiv:1806.00582, 2018.
[3] Xiaomin Ouyang, Zhiyuan Xie, Jiayu Zhou, Jianwei Huang, and Guo-
liang Xing. Clusterfl: a similarity-aware federated learning system
for human activity recognition. In Proceedings of the 19th Annual
International Conference on Mobile Systems, Applications, and Services,
pages 54–66, 2021.
[4] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and
Blaise Aguera y Arcas. Communication-efficient learning of deep
networks from decentralized data. In Artificial intelligence and statistics,
pages 1273–1282. PMLR, 2017.
[5] Takayuki Nishio and Ryo Yonetani. Client selection for federated
learning with heterogeneous resources in mobile edge. In ICC 2019-
2019 IEEE international conference on communications (ICC), pages 1–7.
IEEE, 2019.
[6] Hyunsung Cho, Akhil Mathur, and Fahim Kawsar. Flame: Fed-
erated learning across multi-device environments. arXiv preprint
arXiv:2202.08922, 2022.
[7] Yae Jee Cho, Jianyu Wang, and Gauri Joshi. Client selection in feder-
ated learning: Convergence analysis and power-of-choice selection
strategies. arXiv preprint arXiv:2010.01243, 2020.
[8] Peter Dayan Christopher J.C.H Watkins. Q-learning. pages 279–292,
1992.
7

You might also like