Resource Management Based On Reinforcement Learning For D2D Communication in Cellular Networks
Resource Management Based On Reinforcement Learning For D2D Communication in Cellular Networks
Resource Management Based On Reinforcement Learning For D2D Communication in Cellular Networks
Abstract— Recently, the integration of Device-to-Device (D2D) three sub-problems such as mode selection, power control, and
communication to cellular networks became a vitality task with channel assignment problem [6]-[9]. In [6], it proposed a joint
the growth of mobile devices, as well as requirements of enhanced mode selection and resource allocation approach to maximize
network performance in terms of spectral efficiency, energy the sum rate of the cellular network system. In [7], a joint
efficiency, and latency. In this paper, we propose a spectrum admission control and resource allocation scheme was proposed
allocation framework based on Reinforcement Learning (RL) for to aim at providing a lasting Quality of service (QoS) aid to both
joint mode selection, channel assignment, and power control in CUEs and DUEs users within the network. In [8], it proposed a
D2D communication. The objective is to maximize the overall resource allocation scheme for DUEs to maximize the overall
throughput of the network while ensuring the quality of
throughput in a single cell. The algorithm scheme is based on
transmission and guaranteeing low latency requirements of D2D
communications. The proposed algorithm uses reinforcement
three stages: admission control for D2D pairs, optimal power
learning (RL) based on Markov Decision Process (MDP) with a control; and how to find the optimal reuse candidates using
proposed new reward function to learn the policy by interacting maximum weighted matching. A centralized joint mode
with the D2D environment. An Actor-Critic Reinforcement selection and power control scheme was proposed in [9] by
Learning (AC-RL) approach is then used to solve the resource using a heuristic algorithm for the light and medium load
management problem. The simulation results show that our scenario to maximize the overall system throughput while
learning method performs well, can greatly improve the sum rate ensuring SINR for both D2D and cellular users.
of D2D links, and converges quickly, compared with the Various resource management schemes for joint mode
algorithms in the literature. selection and power control were developed as traditional
Keywords— Device-to-device (D2D) communication, Resource optimization problems. The optimization complexity of these
allocation, Reinforcement learning (RL), Markov Decision Process. schemes is high and cannot be applied to complicated
communication scenarios. Even though the above work can
I. INTRODUCTION achieve significant efficiency, the solutions are not intelligent
Device-to-Device (D2D) communication is a promising enough. Aside from conventional optimization techniques, in
component in next-generation cellular technologies. D2D several recent works, Reinforcement Learning (RL) approaches
communication in cellular networks is defined as allowing two were developed to address the mode selection and resource
cellular users to establish direct communication without relying allocation problem. In [10], an RL framework was proposed to
on the Base Station (BS) or core network. D2D communication solve the joint mode selection and power adaptation problem in
is generally non-transparent to the cellular network and it can the V2V communication network in 5G. In [11]-[12], a novel
occur on the cellular spectrum (i.e., inband) or unlicensed dynamic neural Q-learning based scheduling algorithm was
spectrum (i.e., outband). With the increase of the cellular mobile proposed for downlink transmission in LTE-A cellular network,
applications and the corresponding high requirements in terms which aims to achieve a good trade-off between throughput and
of quality of service (QoS), connectivity, latency, energy, and fairness. The proposed algorithm is based on the Q-learning
spectral efficiency, D2D communication can be very helpful algorithm and adaptable to variations in channel conditions. For
with it’s advantage of proximity and reusing gain. Hence D2D D2D-based V2V communication in LTE-A cellular networks, a
communications is recognized as a promising candidate for dynamic neural Q-learning-based resource allocation and
improving the cellular architecture of cellular networks [1]-[5]. resource sharing algorithm were proposed in [13]. The proposed
Recently, many approaches of resource allocation have been algorithm aims at optimizing the sum rate of cellular and
proposed in D2D Communications. Most of the works focus on vehicular users and reducing the interference of V2V links to
the throughput maximization with QoS and power constraints. cellular links while ensuring the QoS requirements of safety
Since the problem formulation involves binary channel vehicular users. In [14], a reinforcement learning algorithm was
assignment parameters, it leads to a non-convex, mixed-integer- proposed for adaptive power control to enhance system
non-linear program (MINLP). Generally, resources throughput and minimizing the interference while satisfying the
management are non-deterministic polynomial-time hard (NP) communication quality of cellular and D2D users.
problems. Subsequently, a common approach to solve this type Recently, a lot of research has been conducted to adopt RL to
of problems is to decompose the original problem into two or address resource management in D2D communication [15]-
by = , ,…,
network: a set of K cellular user equipments (CUEs) is denoted
are located in the coverage area of an
, ,…,
other or the channel gain among them is poor, in this mode,
. Without loss of generality, each CUE occupies there cannot directly communicate with each other. Thus,
one RB which can be shared by multiple DUE pairs, and one in this case, they can communicate through the Base
DUE pair can only occupy one RB. We assume the peer device
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2023 at 12:45:27 UTC from IEEE Xplore. Restrictions apply.
Station BS (as a relay) as traditional cellular users. So, the that the resource of an existing CUE may be shared at most by
,&
= / 4!
SINR can be expressed as: one D2D pair. However, constraint (9c) is a guarantee that any
,&
,
DUE will select one of the three modes at most. Constraints
(9d) indicates that the RBs used by DUE in the cellular mode
In addition, the CUE’ RB will not suffer interference from and the dedicated mode should not exceed the total number of
DUEs, when is not currently reused by any DUEs. Then, the RBs. Constraints (9e) and (9f) guarantee that the transmit
&
= / 5!
SINR may be expressed as:
&
powers of CUEs and DUEs are within the maximum limit. All
, the notations used are given in Table I.
The minimum data rate requirement constraints of CUE i TABLE I. NOTATIONS AND THEIR DEFINITIONS
mon
* ≥* , ∀ 0 ∈ , 234 * ≥ * ,∀ 5 ∈ 6!
Notation Definitions
,,-. ,,-.
and DUE j pair may be expressed as:
mn
The data rate of CUE i.
o,pqr
us,s
The minimum data rate requirements of DUE j.
for DUE j pair is assured by controlling the probability of
un,v
Channel gain of D2D pair j.
{t|}~
The maximum transit power of CUE.
The outage probability is used to characterize the reliability
probability that the transmission data rate * is less than the {on
The maximum transit power of DUE.
requirement of DUE j pair, and it can be defined as the
{o,m
Transmit power of CUE i.
< = , D* ≤* E ≤ <89: 8!
may be expressed as:
AB;9C- ,,-. AB;9C- s
Transmit power of DUE j in the Cellular mode
RgI , h+I + I& ≤ 89: , ∀5 ∈ ; 9i! A, the transition probability P, the reward R, and l ∈ 0,1! is
, , ,&
∈&
I, ,
+ 1 − RI , ! ≤ 89: , ∀ 0∈ . 9k!
the discount factor.
∈
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2023 at 12:45:27 UTC from IEEE Xplore. Restrictions apply.
selection with the given state. Let ‘ •! denotes a
there are agent, environment, action, state, reward and other Policy: The policy is a function which decides the action
= –• —• •, 2; ! + l R • ‰ |•, 2; ! ” • • ‰ !› 12!
reward obtained after action a is executed.
9∈ž
œ • ∈„
Since the optimal policy maximizes the cumulative discounted
reward from the beginning, it contributes to design the resource
management scheme in D2D communication cellular networks.
B. Actor-Critic (AC) Learning for Resource Management.
In this subsection, a model-free RL is utilized to address the
Fig. 2. Framework of RL for the spectrum allocation in D2D links. resource management and to learn the optimal strategy for the
Agent: For each communication link (agent). The agent learns resource management of D2D communication with continuous
1, • ‰ = •Œ2Œi 2!
D2D links may be regarded as agents and the network
• ‰ |•, 2! = ‹
0, XŒℎi••0•i.
(10)
represents the environment. Each agent observes the current
network state and then decides which action may be decided
Reward: The main target of using RL is to learn the optimal based on its learned policy strategy by itself. Then, the D2D
strategy by increasing the reward. Thus, it is very important how environment provides a new network state and the immediate
to design an efficient reward function, which directly decides the reward r in (11) to the agents. According to the feedback, all
optimal strategy that the agent finds, and which actions it will take. agents learn a new policy in the next step and so on.
furthermore, we have built a new reward function for the resource
1) Action selection: In the D2D environment, the D2D
management issue, which can be expressed as:
transmitter is set as an agent. The agent interacts with the
•= G , I ∗ ! − Ž [Rg< +< h] − Ž URg* − * hV
∗ -=9> AB;9C- ,,-. environment and then takes the action. During the learning
∈ ∈& process, the agent continuously updates the policy until the
− Ž• [Rg* − * h] 11!
,,-.
optimal strategy is learned. Subsequently, the agent needs to
select an action according to a stochastic strategy, the purpose
∈
of which is to enhance performance while explicitly balancing
Where part 1 is the immediate utility (the throughput of the
two competing objectives: (a) chooses the communication
overall network), part 2 indicates the cost functions in terms of the
unsatisfied latency and unsatisfied reliability of D2D link, part3 mode and (b) then combines the channel assignment and power
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2023 at 12:45:27 UTC from IEEE Xplore. Restrictions apply.
i¡< •, 2!/¢!
‘ •, 2! = 14!
∑9‰∈ž i¡< •, 2′!/¢!
IV. PERFORMANCE EVALUATION
= • •, 2! + l. ” • ‰ ! − ” •! 15!
œ‰∈„ Parameter Value
System bandwidth 5 MHz
After that, the TD error would feed back to the actor. By the Channel bandwidth 180 kHz
” • ! in case of • ≠ • ‰ .
10−2
Maximum transmit power for CUE ({o|}~) & DUE ({t|}~ )
Pathloss constant (¨)
24 dBm
3) Policy Update: The critic would utilize the TD error to Snapshot for the distribution of CUEs and DUEs in a cell
network as illustrated in Figure (3). The eNB is located at the
evaluate the selected action by the actor, and the policy can be
origin of the cell while the locations of CUEs and DUEs are
< •, 2! = < •, 2! − ¦g¤ •, 2, Œ!h I •, 2! 17!
updated as [20],
randomly distributed within the serving cell coverage area.
Where ¤ •, 2, Œ! denotes the executed times of action a at
In Figure (4), the system throughput analysis under
state s in these t stages. ¦ . ! is a positive step-size parameter.
different numbers of D2D users is performed. The result
indicates an enhanced performance while using the proposed
Equations (14) and (17) ensure one action under a specific state algorithm over the existing algorithms. On the other hand, the
minimum reward, i.e., I •, 2! < 0.
can be selected with higher probability, if we reach the highest total system throughput as a function of the D2D number. AC-
RL approach with two different approaches is compared, it can
If every action is executed for infinite times in each state be observed obviously that the total system throughput grows
value function ” •! and the policy function ‘ •, 2! will
and the learning strategy is greedy with infinite exploration, the as D2D number increases, and the AC-RL approach is of higher
eventually converge to ” ∗ •! and ‘ ∗ , respectively, with a
performance than Q-learning approaches as well as the random
search approach.
probability of 1. The complete proposed AC-RL approach is
shown in Algorithm 1.
Algorithm 1: AC-RL Algorithm
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2023 at 12:45:27 UTC from IEEE Xplore. Restrictions apply.
search approach, especially, the proposed algorithm on the immediate observations in D2D environments. The
accomplishes the best performance in reward with the highest results show that the proposed solution can efficaciously
convergence rate. guarantee the transmission quality and enhance the sum rate of
the cellular and D2D users, and outperform other existing
algorithms by having better convergence and overall
throughput of the network. In future work, we will apply the RL
approach for the joint resource allocation issue in multi-cell
D2D communications underlaying cellular networks.
REFERENCES
[1] K. Doppler, M. Rinne, C. Wijting, C. Ribeiro, and K. Hugl, "Device-to-device
communication as an underlay to LTE-advanced networks," IEEE Commun.
Mag., vol. 47, no. 12, pp. 42–49, Dec. 2009.
[2] A. Asadi, Q. Wang, and V. Mancuso, "A survey on device-to-device
communication in cellular networks." IEEE Commun. Surv. Tutor. 16(4),
1801–1819 (2014).
[3] P. Phond, Ekram Hossain, and Dong In Kim. "Resource allocation for Device-
to-Device communications underlaying LTE-advanced networks." IEEE
Wireless Communications 20.4 (2013): 91-100.
[4] S. Marzieh, M. Mehrjoo, and M. Kazeminia. "Proximity Mode Selection
Fig. 4. System throughput gain for different D2D numbers. Method in Device to Device Communications." 2018 8th International
Conference on Computer and Knowledge Engineering (ICCKE). IEEE, 2018.
[5] L. Lei, Z. Zhong, C. Lin, and X. Shen, "Operator controlled device-to-device
communications in LTE-advanced networks," IEEE Wireless Commun., vol.
19, no. 3, pp. 96–104, Jun. 2012.
[6] Azam, Muhammad, et al. "Joint admission control, mode selection, and power
allocation in D2D communication systems." IEEE Transactions on Vehicular
Technology 65.9, 2015.
[7] Cicalo, Sergio, and Velio Tralli. "QoS-aware admission control and resource
allocation for D2D communications underlaying cellular networks." IEEE
Transactions on Wireless Communications 17.8, 2018.
[8] Feng, Daquan, Lu Lu, Yi Yuan-Wu, Geoffrey Ye Li, Gang Feng, and Shaoqian
Li. "Device-to-device communications underlaying cellular networks," IEEE
Transactions on Communications 61, no. 8, 2013.
[9] G. Yu, L. Xu, D. Feng, R. Yin, G. Y. Li, and Y. Jiang, "Joint mode selection
and resource allocation for device-to-device communications," IEEE Trans.
Commun., vol. 62, no. 11, pp. 3814–3824, Nov. 2014.
[10] Zhao, Di, et al. "A Reinforcement Learning Method for Joint Mode Selection
Fig. 5. Learning process comparisons of AC-RL algorithms. and Power Adaptation in the V2V Communication Network in 5G." IEEE
Thus, in general, D2D communications will typically Transactions on Cognitive Comm and Networking (2020).
coexist and share RBs with cellular users for their data [11] F. Souhir, F. Zarai, and A. Belghith. "A Q-learning-based Scheduler
Technique for LTE and LTE-Advanced Network." WINSYS. 2017.
transmission. The proposed joint resource management can [12] Feki, Souhir, and Faouzi Zarai. "Cell performance-optimization scheduling
maximize the throughput whilst avoiding interference caused algorithm using reinforcement learning for LTE-advanced network." 2017
due share the RBs of cellular networks. The agent continually IEEE/ACS 14th International Conference on Computer Systems and
upgrades the policy throughout the learning process to discover Applications (AICCSA). IEEE, 2017.
[13] Feki, Souhir, Aymen Belghith, and Faouzi ZARAI. "A Reinforcement
how to select power levels and allocate resources. based on the Learning-based Radio Resource Management Algorithm for D2D-based V2V
simulation outcomes, every agent discovers a way to meet the Communication." 201915th International Wireless Communications & Mobile
cellular communication constraints whilst avoiding Computing Conference (IWCMC). IEEE, 2019.
[14] Gengtian, Shi, et al. "Power Control Based on Multi-Agent Deep Q Network
interference with D2D communications and increasing the for D2D Communication." 2020 International Conference on Artificial
throughput of the overall network. Therefore, numerical results Intelligence in Information and Communication (ICAIIC). IEEE, 2020.
show that the approach has good convergence. [15] Chen, Wentai, and Jun Zheng. "A Reinforcement Learning Based Joint
Spectrum Allocation and Power Control Algorithm for D2D Communication
V. CONCLUSION Underlaying Cellular Networks." International Conference on Artificial
Intelligence for Communications and Networks. Springer, Cham, 2019.
The integration of D2D communication to cellular networks [16] Chen, Wentai, and Jun Zheng. "A Multi-agent Reinforcement Learning Based
became a vitality task with the growth of mobile devices, as Power Control Algorithm for D2D Communication Underlaying Cellular
well as requirements of enhanced network performance in Networks." International Conference on Artificial Intelligence for
terms of spectral efficiency, energy efficiency, and latency. In Communications and Networks. Springer, Cham, 2019.
[17] Y. Luo, Z. Shi, X. Zhou,Q. Liu, and Q.Yi, "Dynamic resource allocations
this paper, we formulated a joint resource management (mode based on Q-learning for D2D communication in cellular networks." 2014 11th
selection, resource block assignment and transmits power International Computer Conference (ICCWAMTIP). IEEE, 2014.
control) problem with the constraints of QoS requirements of [18] Zia, Kamran, et al. "A distributed multi-agent RL-based autonomous spectrum
D2D links, to maximize the throughput of the overall network allocation scheme in D2D enabled multi-tier HetNets." IEEE Access 7 (2019):
in D2D communications. The resource management problem is 6733-6745.
[19] Shah, Syed Waqas Haider, et al. "On the impact of mode selection on effective
solved with a RL framework based on MDP. With the RL capacity of device-to-device communication." IEEE Wireless
algorithm, D2D links are able to intelligently making their Communications Letters 8.3 (2019): 945-948.
adaptive selections to enhance their overall performance based [20] Sewak, Mohit. Deep Reinforcement Learning. Springer Singapore, 2019.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2023 at 12:45:27 UTC from IEEE Xplore. Restrictions apply.