CSCE689 DRL Project Report
CSCE689 DRL Project Report
1.4 Our Reference Code Figure 2. Workflow of device selection problem. (By Wang
1. We implemented the DQN server based on the pro- et al. [1])
vided flsim git repository at https://github.com/iQua/
2.2 State space
flsim. Notice that this original git repository only pro-
vides the simulation framework for the federated learn- In the device selection problem, the state space 𝑆 is the
ing system. It does not contain the DQN server used global model weights and the model weights of each client de-
in the paper, for some unknown/suspicious reasons. vice in each round. The vector 𝑆𝑡 = (𝑤𝑡 , 𝑤𝑡(1) , . . . , 𝑤𝑡(𝑁 ) ),where
2. We implemented the DQN training framework based 𝑤𝑡 denotes the weights of the global model after round 𝑡,
on the DQN homework code given in CSCE689 Deep and 𝑤𝑡(1) , . . . , 𝑤𝑡(𝑁 ) denotes the model weights of the 𝑁 de-
Reinforcement Learning course. vices respectively. Because the weights have large dimension
2
Reproduction and Evaluation for Optimal Device Selection in Federated Learning using Deep Q-learning CSCE 689, DRL, Fall 2022
which makes the state space very large, in the paper, Wang et However, the original Q-learning algorithms can be unsta-
al. [1] applied principle component analysis (PCA) to reduce ble since they indirectly optimize the agent performance by
the dimensionality down to 100. Thus, the final state space learning an approximator 𝑄 (𝑠, 𝑎; 𝜃 𝑡 ) to the optimal action-
is a vector of 100 dimensions. value function 𝑄 ∗ (𝑠, 𝑎). DDQN adds another value function
′
𝑄 (𝑠, 𝑎; 𝜃 𝑡 ) to stabilize the action-value function estimation.
2.3 Action space If time permits, we will try to implement other RL algo-
In the device selection problem, the action is to select rithms such as policy gradient or actor-critic methods.
subset of 𝐾 devices from 𝑁 devices. However, this selection
would have resulted in a large action space of size 𝑁𝐾 , it 2.7 Stretch goal
will complicates the Reinforcement Learning training, so we One of our stretch goals is to try to understand how the
need to take a trick to reduce the action space size. given code implements the federated learning system since
We train the agent by select only one out of 𝑁 devices to none of us have experience with FL coding before. Another
participate in FL per round, while in testing and application, stretch goal is to implement DQN on top of the FL system
the agent will sample a batch of top-𝐾 clients to participate and try to reproduce the results reported in the paper.
in FL. Now the action space is thus reduced to 1, 2, . . . , 𝑁 ,
which means select device 𝑖 to participate in FL. The other
clients will be selected through the top-𝐾 values of 𝑄 ∗ (𝑠𝑡 , 𝑎) 3 Related Works
after our training completion. In this section, we introduced previous work on device
selection problems in Federated Learning framework and
2.4 Transition function different reinforcement learning models. And we discuss
In the device selection problem, the transition function how our work of using reinforcement learning differs from
is stochastic. For every training round, different 𝐾 devices other works.
are selected to learn the new model weights, and then they
conduct new local training with the new model and send 3.1 Device Selection in Federated Learning System
updates to the server. Thus the new state is updated again
Because of the large amounts of participation devices in
with new local model update received from 𝐾 devices.
Federated Learning system, the limited network connectivity
and bandwidth, as well as the long training time on the server,
2.5 Reward function
in practice it is infeasible to have all devices participate in
In the device selection problem, we define the reward every round of training is Federated Learning Systems. To
function at the end of each round 𝑡 is 𝑟𝑡 = Ξ (𝜔𝑡 −Ω) − 1, 𝑡 = address the device selection problem, McMahan et al.[4]first
1, . . . ,𝑇 The 𝜔𝑡 is the testing accuracy achieved by the global proposed FEDAVG algorithm which selects a random frac-
model on the held-out validation set after round t, Ω is the tion of total devices to participate in each round of training.
target accuracy. Specifically, at each round, server will randomly select 𝑘
As ML training proceeds, the model accuracy will increase devices and send the global model weights to each client to
at a slower pace, which means 𝜔𝑡 − 𝜔𝑡 −1 decreases as round conduct local training based on their local data. Then these 𝑘
𝑡 increases. Therefore, Ξ is a positive constant that ensures device will send their local weight updates back to the server.
that 𝑟𝑡 grows exponentially to amplify the marginal accuracy And the server will take the average of these weight updates.
increase as FL progresses into later stages. The process repeats until the model converges. However, the
The item −1 encourages the agent to complete training randomly selected local datasets may not have the same dis-
in fewer rounds, because the more rounds it takes, the less tribution as the true data distribution. And each local model
cumulative reward the agent will receive. could be significantly different from each other, leading to
Thus, the reward function aims to achieve target accuracy slow convergence and model accuracy reduction[2].
in the least communication rounds. To address the non-IID problem, Ouyang et al.[3] pro-
posed the ClusterFL, which groups the devices into clusters
2.6 RL algorithms to use and baselines and conducts the average within each cluster. The key ob-
In the device selection problem, we use the double Deep servation of the ClusterFL is that there are intrinsic similari-
Q-learning Network(DDQN) to learn the function 𝑄 ∗ (𝑠𝑡 , 𝑎). ties between different devices due to the subject’s biological
The Q-learning algorithm provides a value estimation for features, physical environments, and sensor biases. To be
each potential action 𝑎 at state 𝑠𝑡 , based on which devices are specific, the server will cluster the devices based on the KL
selected. Considering limited available traces from federated divergence of the local model updates. And only devices
learning jobs, the DQN can be more efficiently trained and within the same cluster are averaging their model weights.
can reuse data more effectively than policy gradient methods The paper also proposed to drop nodes that have less correla-
and actor-critic methods. tion to other nodes in the same cluster and to drop straggler
3
CSCE 689, DRL, Fall 2022 Liu, Niu and Cai, et al.
nodes of each cluster to reduce the communication cost. Re- Q-Network(DQN) to play the Atari games and replicated the
sults show that ClusterFL achieved much faster convergence results of Mnih et al. Their implementation is also designed
and reduced communication rounds significantly. to be flexible to different neural net network architectures
Nishio et al.[5] proposed FedCS to address the inefficient and other problem domains. They use the NVIDIA CUDA
training among devices with heterogeneous resources. Es- Deep Neural Network library (cuDNN) to run forward and
sentially, FedCS selects the devices based on their resource backward passes on common neural network layers opti-
conditions to maximize the number of participating devices mized specially for the NVIDIA GPUs.
in each round of training. The selection of devices is based
3.2.2 Model Based On Double Q-Network. Since DQN
on the estimated time for each device to finish its distribu-
Algorithm uses the same Q estimate value for selecting and
tion, schedule update, and upload process based on their
evaluation in action selection. This will lead to maximiza-
computation resources and network bandwidth.
tion bias.[12]. So Zeng et al.[13] proposed to use Double
Cho et al.[6] proposed FLAME, a user-centered FL train-
Q-Network to select devices from all devices to improve the
ing by leveraging the time alignment across the multiple
performance of Federated Learning by decoupling action se-
devices owned by the same user. FLAME also conducts accu-
lection and evaluation process. The double-DQN algorithm
racy and efficiency-aware device selection which achieves
in Federated Learning mentioned by Zhang et al.[14] is based
higher accuracy, greater energy efficiency, and faster con-
on choosing the action with the largest Q value in the next
vergence. Cho et al.[7] also proposed POWER-OF-CHOICE,
state and using the selected action to update target Q and
which adopts higher sampling probability for clients with
can be expressed as:
higher loss in each round to achieve faster convergence.
𝑄𝑡𝑎𝑟𝑔𝑒𝑡 = 𝑟 + 𝛾𝑄 (𝑠 ′, argmax 𝑄 (𝑠 ′, 𝑎; 𝜃 ); 𝜃 ¬) (1)
3.2 Reinforcement Learning Models 𝑎
3.2.1 Model Based On Deep Q-Network. Q-learning is a where 𝜃 is parameter in evaluated network and 𝜃 − rep-
kind of model-free reinforcement learning, which also be cast resents parameter in target network. 𝜃 − get updated with
as a kind of asynchronous dynamic programming. [8]With 𝜃 after a certain steps. After applying double-DQN. We ex-
Q-learning, agents have the capability of learning to act pected to see the model to converge and achieve accuracy
optimally in Markovian domains by experiencing the conse- target faster according to the simulation result get by Nguyen
quences of actions. et al.[15]
Deep Q-learning is an extension of the classical Q-learning
algorithm that uses deep neural network to approximate the 4 Implementation Details and Experiments
action-value function. Classical Q-learning agents have some Setup
limited applicability in some domains where useful features In this section, we discussed in details about our imple-
can be handcrafted. To use reinforcement learning success- mentation/changes to the code, and the experiments setup
fully in real-world situations, Mnih et al. [9] proposed this we used to evaluate our DQN implementation. We completed
single algorithm that would be able to develop a wide range the project in the following steps:
of competencies on a varied range of challenging tasks–a First, we ran the given flsim code to see if the federated
central goal of general artificial intelligence. They use neural learning system can run as expected. However, the given
networks with several layers of nodes in order to build up code has many bugs and missing parts. For example, when
progressively more abstract representations of the data and the clients are sending reports to the server, the local updated
learn concepts such as object categories directly from raw weights were not included in their report. Thus we have to
sensory data. As a consequence, the deep Q-network could fix the code first. Some other bugs include various typos in
learn successful policies directly from high-dimensional sen- code.
sory inputs. The deep learning networks have some stability Second, after the Federated Learning system is set up, we
problems to improve, and not using the same RMSProp defi- would like to reproduce the Figure 3 in the paper, which
nition that many deep learning libraries provide. compared the number of communication rounds required
The theoretical foundation of DQN is less well understood. to reach 99% testing accuracy for FedAvg on IID and non-
J.Fan et al. [10] provide the theoretical guarantees for Deep IID data settings. We also compared their performance with
Q-Network in both algorithmic and statistical perspectives. using K-means and K-clusters algorithms for device selection.
The statistical error characterizes the bias and variance that We have to overcome many difficulties such as we had to
arise from approximating the action-value function using understand the config file settings to set up correct config
deep neural network, while the algorithmic error converges for running IID and non-IID clients. The K-means server also
to zero at a geometric rate. has a bug in it that we had to fix it.
Mnih et al’s paper [9] not implement the DQN algorithm, Third, we implement the PCA model to compress the
and simply represent the results which often be used as a state space as stated in the paper. The original code does
benchmark. Melrose Roderick et al. [11] implemented a Deep not provide an implementation, so we had to implement it
4
Reproduction and Evaluation for Optimal Device Selection in Federated Learning using Deep Q-learning CSCE 689, DRL, Fall 2022
(a) 10-100 DQN training with new(b) 10-100 DQN training with given
reward function reward function (a) 10-100 DQN compare (b) 4-20 DQN compare
Figure 5. 10-100 DQN training reward Figure 7. DQN Accuracy v.s. communication rounds com-
pare with traditional methods
6 Future work [9] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu,
Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, An-
The DQN model in our current implementation did not dreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,
reproduce the converging rewards as shown in the paper. Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran,
Thus, in our future work, we plan to: Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control
through deep reinforcement learning. Nature, 518(7540):529–533, Feb
1. Try in a settings with small number of total clients, 2015.
such as selecting 2 devices each round out of total 10 [10] Zhuoran Yang, Yuchen Xie, and Zhaoran Wang. A theoretical analysis
of deep q-learning. CoRR, abs/1901.00137, 2019.
devices.
[11] Melrose Roderick, James MacGlashan, and Stefanie Tellex. Implement-
2. Try the policy gradient RL models like Actor-Critic ing the deep q-network. CoRR, abs/1711.07478, 2017.
models. [12] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement
3. Improve the DQN model such as increasing replay learning with double q-learning. In Proceedings of the AAAI conference
memory size and designing more precise reward func- on artificial intelligence, volume 30, 2016.
[13] Rongfei Zeng, Chao Zeng, Xingwei Wang, Bo Li, and Xiaowen Chu. A
tions.
comprehensive survey of incentive mechanism for federated learning.
arXiv preprint arXiv:2106.15406, 2021.
[14] Jiaxiang Zhang, Yiming Liu, Xiaoqi Qin, and Xiaodong Xu. Energy-
7 Conclusions efficient federated learning framework for digital twin-enabled indus-
In conclusion, we successfully build the federal learning trial internet of things. In 2021 IEEE 32nd Annual International Sympo-
model with client choosing mechanism based on Deep Q- sium on Personal, Indoor and Mobile Radio Communications (PIMRC),
pages 1160–1166. IEEE, 2021.
Learning agent. Unfortunately, our experiment on imple-
[15] Huy T Nguyen, Nguyen Cong Luong, Jun Zhao, Chau Yuen, and Dusit
menting DQN to optimize federal learning on non-iid data Niyato. Resource allocation in mobility-aware federated learning
did not reproduce the work shown in the paper. Based on networks: A deep reinforcement learning approach. In 2020 IEEE 6th
our analysis, we believe the reward function proposed by World Forum on Internet of Things (WF-IoT), pages 1–6. IEEE, 2020.
the original paper is problematic and DQN is not suitable for
device selection process during Federated learning which
has strong dependency between actions. But implementa-
tion not just DQN but all reinforcement learning method to
improve performance of federal learning is still a interesting
research field with a huge potential in the future.
References
[1] Hao Wang, Zakhary Kaplan, Di Niu, and Baochun Li. Optimizing
federated learning on non-iid data with reinforcement learning. In
IEEE INFOCOM 2020-IEEE Conference on Computer Communications,
pages 1698–1707. IEEE, 2020.
[2] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and
Vikas Chandra. Federated learning with non-iid data. arXiv preprint
arXiv:1806.00582, 2018.
[3] Xiaomin Ouyang, Zhiyuan Xie, Jiayu Zhou, Jianwei Huang, and Guo-
liang Xing. Clusterfl: a similarity-aware federated learning system
for human activity recognition. In Proceedings of the 19th Annual
International Conference on Mobile Systems, Applications, and Services,
pages 54–66, 2021.
[4] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and
Blaise Aguera y Arcas. Communication-efficient learning of deep
networks from decentralized data. In Artificial intelligence and statistics,
pages 1273–1282. PMLR, 2017.
[5] Takayuki Nishio and Ryo Yonetani. Client selection for federated
learning with heterogeneous resources in mobile edge. In ICC 2019-
2019 IEEE international conference on communications (ICC), pages 1–7.
IEEE, 2019.
[6] Hyunsung Cho, Akhil Mathur, and Fahim Kawsar. Flame: Fed-
erated learning across multi-device environments. arXiv preprint
arXiv:2202.08922, 2022.
[7] Yae Jee Cho, Jianyu Wang, and Gauri Joshi. Client selection in feder-
ated learning: Convergence analysis and power-of-choice selection
strategies. arXiv preprint arXiv:2010.01243, 2020.
[8] Peter Dayan Christopher J.C.H Watkins. Q-learning. pages 279–292,
1992.
7