Resource Management Based On Reinforcement Learning For D2D Communication in Cellular Networks

Resource Management Based on Reinforcement Learning
for D2D Communication in Cellular Networks

Amamer Saied Dongyu Qiu Mahmoud Swessi
Department of Electrical and Department of Electrical and Department of Electrical and
Computer Engineering Computer Engineering Computer Engineering
Concordia University Concordia University Concordia University
Montreal, QC, Canada Montreal, QC, Canada Montreal, QC, Canada
[email protected] [email protected] [email protected]
Abstract— Recently, the integration of Device-to-Device (D2D) three sub-problems such as mode selection, power control, and
communication to cellular networks became a vitality task with channel assignment problem [6]-[9]. In [6], it proposed a joint
the growth of mobile devices, as well as requirements of enhanced mode selection and resource allocation approach to maximize
network performance in terms of spectral efficiency, energy the sum rate of the cellular network system. In [7], a joint
efficiency, and latency. In this paper, we propose a spectrum admission control and resource allocation scheme was proposed
allocation framework based on Reinforcement Learning (RL) for to aim at providing a lasting Quality of service (QoS) aid to both
joint mode selection, channel assignment, and power control in CUEs and DUEs users within the network. In [8], it proposed a
D2D communication. The objective is to maximize the overall resource allocation scheme for DUEs to maximize the overall
throughput of the network while ensuring the quality of
throughput in a single cell. The algorithm scheme is based on
transmission and guaranteeing low latency requirements of D2D
communications. The proposed algorithm uses reinforcement
three stages: admission control for D2D pairs, optimal power
learning (RL) based on Markov Decision Process (MDP) with a control; and how to find the optimal reuse candidates using
proposed new reward function to learn the policy by interacting maximum weighted matching. A centralized joint mode
with the D2D environment. An Actor-Critic Reinforcement selection and power control scheme was proposed in [9] by
Learning (AC-RL) approach is then used to solve the resource using a heuristic algorithm for the light and medium load
management problem. The simulation results show that our scenario to maximize the overall system throughput while
learning method performs well, can greatly improve the sum rate ensuring SINR for both D2D and cellular users.
of D2D links, and converges quickly, compared with the Various resource management schemes for joint mode
algorithms in the literature. selection and power control were developed as traditional
Keywords— Device-to-device (D2D) communication, Resource optimization problems. The optimization complexity of these
allocation, Reinforcement learning (RL), Markov Decision Process. schemes is high and cannot be applied to complicated
communication scenarios. Even though the above work can
I. INTRODUCTION achieve significant efficiency, the solutions are not intelligent
Device-to-Device (D2D) communication is a promising enough. Aside from conventional optimization techniques, in
component in next-generation cellular technologies. D2D several recent works, Reinforcement Learning (RL) approaches
communication in cellular networks is defined as allowing two were developed to address the mode selection and resource
cellular users to establish direct communication without relying allocation problem. In [10], an RL framework was proposed to
on the Base Station (BS) or core network. D2D communication solve the joint mode selection and power adaptation problem in
is generally non-transparent to the cellular network and it can the V2V communication network in 5G. In [11]-[12], a novel
occur on the cellular spectrum (i.e., inband) or unlicensed dynamic neural Q-learning based scheduling algorithm was
spectrum (i.e., outband). With the increase of the cellular mobile proposed for downlink transmission in LTE-A cellular network,
applications and the corresponding high requirements in terms which aims to achieve a good trade-off between throughput and
of quality of service (QoS), connectivity, latency, energy, and fairness. The proposed algorithm is based on the Q-learning
spectral efficiency, D2D communication can be very helpful algorithm and adaptable to variations in channel conditions. For
with it’s advantage of proximity and reusing gain. Hence D2D D2D-based V2V communication in LTE-A cellular networks, a
communications is recognized as a promising candidate for dynamic neural Q-learning-based resource allocation and
improving the cellular architecture of cellular networks [1]-[5]. resource sharing algorithm were proposed in [13]. The proposed
Recently, many approaches of resource allocation have been algorithm aims at optimizing the sum rate of cellular and
proposed in D2D Communications. Most of the works focus on vehicular users and reducing the interference of V2V links to
the throughput maximization with QoS and power constraints. cellular links while ensuring the QoS requirements of safety
Since the problem formulation involves binary channel vehicular users. In [14], a reinforcement learning algorithm was
assignment parameters, it leads to a non-convex, mixed-integer- proposed for adaptive power control to enhance system
non-linear program (MINLP). Generally, resources throughput and minimizing the interference while satisfying the
management are non-deterministic polynomial-time hard (NP) communication quality of cellular and D2D users.
problems. Subsequently, a common approach to solve this type Recently, a lot of research has been conducted to adopt RL to
of problems is to decompose the original problem into two or address resource management in D2D communication [15]-
978-1-7281-5628-6/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2023 at 12:45:27 UTC from IEEE Xplore. Restrictions apply.
[19]. In [15], the algorithm was proposed uses a distributed Q- discovery and the session setup are completed before the
learning based joint SA-PC algorithm for performing resource resource allocation.
allocation and power control for each D2D pair. In [16], two
MARL-based algorithms have been proposed for performing
power control of D2D pair: centralized Q-learning algorithm
and distributed Q-learning algorithm. In [17], the Q-learning
based resource allocation was proposed when the CUEs and
DUEs users sharing the same resources by Q-learning based
policy to maximize the overall throughput of the network. A
distributed Q-learning based spectrum allocation scheme has
been proposed in [18], where D2D users are agents to learn the
environment and select resource blocks (RBs) autonomously
while mitigating the interference to the CUEs and maximizing
the network throughput. In [19], a mode selection scheme that
decide whether two users in proximity should communicate Fig. 1. System model for D2D communications network.
using D2D mode or a cellular mode has been modeling as a In this system model, we assume that the potential D2D
Markov process. It then investigated the impact of path-loss users share the uplink resources of cellular users. The key
measurement errors on the maximum effective capacity of a parameters analyzed are the signal-to-interference-plus-noise
D2D link for both overlay and underlay scenarios. ratio (SINR), the effect of variation in D2D transmitter power,
In this paper, we consider joint mode selection, resource sum rate, outage probability, mode selection of D2D
block (RB) assignment, and power control. Meanwhile, Q- communication, and low latency communications requirements
learning has low convergence speed and may not always suitable of D2D links.
to deal with continuous-valued state and action spaces. The
Actor-Critic-Reinforcement Learning (AC-RL) approach is B. D2D Communication Modes
adopted to solve the resource management problem of D2D One important issue on D2D communication is the mode
communication networks, to maximize the overall throughput that devices use to communicate between them since a suitable
while guaranteeing low latency communications requirements communication mode increases the network throughput. In [3]-
of D2D communications. The AC-RL approach can efficiently [4] modes of D2D pairs, which can choose one of three
deal with the continuous-valued state and action spaces (e.g., RB
communication modes from among the following:
occupy status, channel status information (CSI), etc.), where the
1) Reuse Mode: In this mode, when the two DUEs are close
actor is used to exploit the stochastic actions and the critic is
applied to estimate the state-action value function. The main together, two DUEs communicate directly by sharing
contributions of this paper can be summarized as follows: CUEs ' uplink RBs resource. In this case, even the
• We formulate a joint resource allocation (resource block efficiency of the spectrum can be improved, interference is
(RB) assignment, mode selection, and transmits power experienced between the DUEs and cellular users. In reuse
control) issue with considering the QoS requirements, to mode, if CUE it shares its RB resource with the D2D pair,
maximize the throughput of the overall network in D2D the CUE will suffer interference from the D2D pair. Then,
,
the SINR received at the eNB can be expressed as:
,
= 1!
communications.
,
∑ ,
ℎ, +
• Then, the resource management problem is modelled as the
RL framework, thus D2D links are able to make their
adaptive decisions to improve their performance based on Similarly, for a DUE pair j that shares the RB k, where the
instant observations in D2D environments. interference is caused on the reused RB from the co-channel
The rest of this paper can be organized as follows. Section II CUE i. Instead, when reusing the RB of CUE i, the SINR at the
,
shows the system model and formulated the optimization receiver of the DUE pair j is given by:
,
= 2!
problem. We model the resource management problem as MDP ,
and the AC learning methodology is embraced to solve it, in ∑" "
,
ℎ", + ,
ℎ, +
Section III. Section IV presents the simulation results. Finally, "#
section V concludes the paper. 2) Dedicated Mode: Two DUEs communicate directly using
II. SYSTEM MODEL AND PROBLEM FORMULATION a RBs resource that is not currently use. DUEs in this mode
A. System Model consume fewer channel resource compared to those in the
As illustrated in Figure (1), we consider a single-cell cellular mode and can increase the spectrum efficiency due
,
to proximity gain. Then, the SINR can be expressed as:
,
cellular system consisting of one eNB, with a two-tier cellular
= 3!
,
by = , ,…,
network: a set of K cellular user equipments (CUEs) is denoted
are located in the coverage area of an
set of M D2D users equipments (DUEs) is denoted by =

eNB, sharing the orthogonal N resource blocks (RBs), and the 3) Cellular Mode: When two DUEs are distant from each
, ,…,
other or the channel gain among them is poor, in this mode,
. Without loss of generality, each CUE occupies there cannot directly communicate with each other. Thus,
one RB which can be shared by multiple DUE pairs, and one in this case, they can communicate through the Base
DUE pair can only occupy one RB. We assume the peer device
Station BS (as a relay) as traditional cellular users. So, the that the resource of an existing CUE may be shared at most by
,&
= / 4!
SINR can be expressed as: one D2D pair. However, constraint (9c) is a guarantee that any
,&
,
DUE will select one of the three modes at most. Constraints
(9d) indicates that the RBs used by DUE in the cellular mode
In addition, the CUE’ RB will not suffer interference from and the dedicated mode should not exceed the total number of
DUEs, when is not currently reused by any DUEs. Then, the RBs. Constraints (9e) and (9f) guarantee that the transmit
&
= / 5!
SINR may be expressed as:
&
powers of CUEs and DUEs are within the maximum limit. All
, the notations used are given in Table I.
The minimum data rate requirement constraints of CUE i TABLE I. NOTATIONS AND THEIR DEFINITIONS
mon
* ≥* , ∀ 0 ∈ , 234 * ≥ * ,∀ 5 ∈ 6!
Notation Definitions
,,-. ,,-.
and DUE j pair may be expressed as:
mn
The data rate of CUE i.
o,pqr
Due to the latency requirement, let 789: denote the mts

The minimum data rate requirements of CUE i.
The data rate of DUE j.
ms
maximum tolerable latency threshold. The latency constraint t,pqr
us,s
The minimum data rate requirements of DUE j.
for DUE j pair is assured by controlling the probability of
un,v
Channel gain of D2D pair j.
7;: is beyond the threshold 789: . Then the probability must be

exceeding the threshold value, where the transmission delay
ws,v
Channel gain of the Links, from CUE i to the eNB b.
a smaller than the tolerable threshold <89:-=9>

wn,s
Interference gains of the links, from DUE-TX j to eNB b.
, which may be
< = 7;: ≥ 789: ≤ <89: 7! xzy

Interference gains of the links, from CUE i to DUE-RX j
-=9> -=9>
expressed as:
,
{o|}~
Additive white Gaussian noise (AWGN) on each channel.
{t|}~
The maximum transit power of CUE.
The outage probability is used to characterize the reliability
probability that the transmission data rate * is less than the {on
The maximum transit power of DUE.
requirement of DUE j pair, and it can be defined as the
{o,m
Transmit power of CUE i.
requirement threshold * ,,-. . Therefore, the outage probability n

{t,m
Transmit power of CUE i in the Reuse mode
must be below the tolerable outage probability <89:AB;9C- s

{t,
Transmit power of DUE j in the Reuse mode
, which
s
{t,
Transmit power of DUE j in the Dedicated mode
< = , D* ≤* E ≤ <89: 8!
may be expressed as:
AB;9C- ,,-. AB;9C- s
Transmit power of DUE j in the Cellular mode
III. REINFORCEMENT LEARNING (RL) FOR RESOURCE

C. Problem Formulation
MANAGEMENT
Our goal in this paper is to optimize the overall network We use Markov decision processes (MDP) to model the
throughput while guaranteeing the above-mentioned DUEs' optimization problem in (9), which is hard to solve as it is a non-
QoS criteria and satisfying the network's resource constraints. convex combination and NP-hard problem. Then, the solution to
Therefore, the joint mode selection, channel assignment, and the formulated MDP problem can be attained by making use of
power control of resource management issue can be formulated Actor-Critic RL (AC-RL) algorithm [20]. On the other hand, the
mathematically as: main parts of the reinforcement learning (RL) based on MDP are
,& ,
G , I ∗ ! = arg max QR I & log U1 + V + R I log U1 + V
∗ , , indicated with a new proposed reward function, and an AC-RL
O ,P
∈ ∈
framework is used to solve the resource management problem.
,
+ R R I , WX Y1 + Z
,
A. Markov Decision Process (MDP) for Resource
∑" "
,
ℎ", + ,
ℎ, + Management.
∈& ∈
"#
,
Markov Decision Processes (MDP) are widely used as
+ R R I , WX U1 + V + R [1 − R I , ] WX ^1 + _` 9!
, ,
∑ ,
ℎ, +
optimization tools for determining optimal strategies in
∈& ∈ ∈& ∈
(6), (7), (8); I & , I , I , ∈ 0, 1 , ∀5 ∈ 9a!

communication systems. We apply MDP to model the strategy
S.t.
R I , ≤ 1, ∀0 ∈ ; 9d!
searching process in the RL formwork. At each time step, the
process is in some state, and the decision-maker (agent) may
∈
R I , + I & + I ≤ 1, ∀0∈ ; 9e!

decide any action that is available in the current state. Then
selecting an action, the agent receives some reward associated
∈&
R I & + R I ≤ f; 94!
with the played action in that state, and the process randomly
MDP is a 5-tuple (S, A, P, R, l), in which the state S, the action

moves to a new state according to some transition matrix. Our
∈ ∈
RgI , h+I + I& ≤ 89: , ∀5 ∈ ; 9i! A, the transition probability P, the reward R, and l ∈ 0,1! is
, , ,&
∈&
I, ,
+ 1 − RI , ! ≤ 89: , ∀ 0∈ . 9k!
the discount factor.
∈
Where I & , I , and I , are the mode selection indicators,

In D2D environments, the probabilities of state transition
and expected rewards are generally unknown for all states.
Thus, we formulate that the problem of resource allocation in a
representing the cellular mode, the dedicated mode, and the
D2D communication is a model-free reinforcement framework
reuse mode, respectively. The constraint (9a) represents the
in which the MDP has a continuous action and state space. In a
channel reuse relationship between CUEs and DUEs combined
reinforcement learning framework, As shown in the Figure (2),
with the resource partition model. Constraint (9b) make sure
selection with the given state. Let ‘ •! denotes a
there are agent, environment, action, state, reward and other Policy: The policy is a function which decides the action
policy: ‘ •!: ‚ → €, which is a mapping from the state S to the

basic elements. The agent corresponding to a D2D pair,
changes the state • → •′ by executing action. And agent will

interacts with the environment and generates trajectory, which
action A. In the network, the objective of the agent aims to
” • •! denotes the state-value function, which called as a

receive a reward from environment. By continuing these choose a policy π(s) to maximize its expected reward. Let
interactions, agents accumulate more and more experiences,
executes an action 2 ∈ € and receives a reward • ∈ *, the

and then update the policy. To be more precise, when an agent
™
cumulative discounted reward, which is expressed as:
” • •! = –• —R l ; •; •; , 2; !|•˜ = •, ‘›
environment transitions from state • ∈ ‚ to •′ ∈ ‚. R is the
;š˜
= –• —• •, 2; ! + l R • ‰ |•, 2; ! ” • • ‰ !› 12!
reward obtained after action a is executed.
The optimal policy ‘ ∗ •! satisfies the Bellman equation [20],

œ • ∈„
is achieved to maximizing the cumulative discounted reward

starting from the state s.
” ∗ •! = ” • •! = max Ÿ–•∗ —• •, 2! + l R • ‰ |•, 2! ” • • ‰ !› 13!
∗ ∗
9∈ž
œ • ∈„
Since the optimal policy maximizes the cumulative discounted
reward from the beginning, it contributes to design the resource
management scheme in D2D communication cellular networks.
B. Actor-Critic (AC) Learning for Resource Management.
In this subsection, a model-free RL is utilized to address the
Fig. 2. Framework of RL for the spectrum allocation in D2D links. resource management and to learn the optimal strategy for the
Agent: For each communication link (agent). The agent learns resource management of D2D communication with continuous
State: The system state can be described as ‚ =

and makes decisions by interacting with the environment. action. The actor-critic reinforcement learning method is one of
{ƒ&„… , ƒ † , ƒ‡A„ }, where ƒ&„… shows the observed channel

the RL tools, which is a combination of the value-based RL
information, ƒ † denotes all RBs occupy status between users,

method and the strategy-based RL method. Moreover,
and ƒ‡A„ indicates the requirements of QoS ( e.g., the latency,

according to whether the environmental elements (i.e. reward
function and state transition probability) are known. The actor-
the minimum data rate, and the reliability requirements). critic is an architecture RL algorithm based on the policy
learning process, are defined as a = {ˆ „ , ˆO& , ˆ † }∈ A. The

Action: There are three actions considered in each for the gradient, where the actor is represented through adopting a
agent will take the action a ∈ A according to the current state •,

control policy with action selections based on the observed
network state, then the critic evaluates the input policy by a
after making a decision in terms of the mode selection (ˆ „ ), reward function from the environment feedback [20].
transmit power control (ˆO& ), and the RB assignment (ˆ † ).
Transition probability: The transition probability • ‰ |•, 2!
We adopted the Actor-Critic Reinforcement Learning (AC-
describes the probability when the agent takes the action a ∈ A

RL) to optimize the policy numerically to solve intelligent
from the state • ∈ ‚ to a new state • ‰ ∈ ‚.

resource management in D2D communications. In networks,
1, • ‰ = •Œ2Œi 2!
D2D links may be regarded as agents and the network
• ‰ |•, 2! = ‹
0, XŒℎi••0•i.
(10)
represents the environment. Each agent observes the current
network state and then decides which action may be decided
Reward: The main target of using RL is to learn the optimal based on its learned policy strategy by itself. Then, the D2D
strategy by increasing the reward. Thus, it is very important how environment provides a new network state and the immediate
to design an efficient reward function, which directly decides the reward r in (11) to the agents. According to the feedback, all
optimal strategy that the agent finds, and which actions it will take. agents learn a new policy in the next step and so on.
furthermore, we have built a new reward function for the resource
1) Action selection: In the D2D environment, the D2D
management issue, which can be expressed as:
transmitter is set as an agent. The agent interacts with the
•= G , I ∗ ! − Ž [Rg< +< h] − Ž URg* − * hV
∗ -=9> AB;9C- ,,-. environment and then takes the action. During the learning
∈ ∈& process, the agent continuously updates the policy until the
− Ž• [Rg* − * h] 11!
,,-.
optimal strategy is learned. Subsequently, the agent needs to
select an action according to a stochastic strategy, the purpose
∈
of which is to enhance performance while explicitly balancing
Where part 1 is the immediate utility (the throughput of the
two competing objectives: (a) chooses the communication
overall network), part 2 indicates the cost functions in terms of the
unsatisfied latency and unsatisfied reliability of D2D link, part3 mode and (b) then combines the channel assignment and power
We adopt softmax policy for long-term optimization. ‘ •, 2!,

and part 4 are the cost functions in terms of the unsatisfied level where the agent has two various actions to achieve a goal.
respectively. The coefficient Ž" , • ∈ 1,2,3 are the weights of the

minimum sum data rate requirements of cellular link and D2D link,
which determines the probability of taking action a, can be
last three parts, which are used to balance the utility and the cost. determined by utilizing Boltzmann distribution as [20];
i¡< •, 2!/¢!
‘ •, 2! = 14!
∑9‰∈ž i¡< •, 2′!/¢!
IV. PERFORMANCE EVALUATION
where ¢ is a positive parameter called temperature. In addition,

In this part, simulation results are being performed in
•, 2! defines the affinity to select action a at state s; it is

MATLAB 2018a to evaluate the overall performance of our
proposed resource management based on the AC-RL approach
updated after every iteration. The Boltzmann distribution is in the D2D environment. Then, we evaluate it with the
chosen to avoid jumping into exploitation phase before testing following tactics: Q-learning approach (referred to as Q-
each action in every state [20]. learning) that is utilized in [15]; and random search approach
an action, the system changes the state • ∈ ‚ to a new state • ‰ ∈

2) State-Value Function Update: Once the agent chooses (referred to as random search).
‚ with a transition probability in (10). Meanwhile, the total

In our simulation, we consider a single cell scenario with
reward for the taken action a would be • •, 2!. Consequently,

the radius of 500m. Where the CUEs are uniformly distributed
the Time Difference (TD) error I •, 2! would be computed by

in the cell. we adopt the clustered distribution model for D2D
the difference between the state-value function ” • •′!

pairs, in which the transmitter (DUE-Tx) and the receiver
estimated at the preceding state which in (12), and • •, 2!

(DUE-Rx) of each D2D pair are uniformly distributed in a
+ ” • •′! at the critic,

cluster with radius r; and clusters are uniformly distributed in
the cell. Our simulation parameters are shown in the Table II.
I •, 2! = • •, 2! + l R • ‰ |•, 2!” • ‰ ! − ” •!
TABLE II. SIMULATION PARAMETERS
= • •, 2! + l. ” • ‰ ! − ” •! 15!
œ‰∈„ Parameter Value
System bandwidth 5 MHz
After that, the TD error would feed back to the actor. By the Channel bandwidth 180 kHz
” • ‰ ! = ” •! + Žg¤ •, Œ!h I •, 2! 16!

Number of cells 1 cell
way, the state-value function would be updated as Cell radius 500 m
Here, ¤ •, Œ! indicates the occurrence time of state s in Noise power (xzy )

Maximum distance between DUE-TX and DUE-RX 70 m
these t stages. Ž . ! is a positive step-size parameter that affects

−114 dBm
the convergence rate. On the other hand, ” • ‰ ! remains as

Pathloss exponent (α) 4
” • ! in case of • ≠ • ‰ .
10−2
Maximum transmit power for CUE ({o|}~) & DUE ({t|}~ )
Pathloss constant (¨)
24 dBm
3) Policy Update: The critic would utilize the TD error to Snapshot for the distribution of CUEs and DUEs in a cell
network as illustrated in Figure (3). The eNB is located at the
evaluate the selected action by the actor, and the policy can be
origin of the cell while the locations of CUEs and DUEs are
< •, 2! = < •, 2! − ¦g¤ •, 2, Œ!h I •, 2! 17!
updated as [20],
randomly distributed within the serving cell coverage area.
Where ¤ •, 2, Œ! denotes the executed times of action a at
In Figure (4), the system throughput analysis under
state s in these t stages. ¦ . ! is a positive step-size parameter.
different numbers of D2D users is performed. The result
indicates an enhanced performance while using the proposed
Equations (14) and (17) ensure one action under a specific state algorithm over the existing algorithms. On the other hand, the
minimum reward, i.e., I •, 2! < 0.
can be selected with higher probability, if we reach the highest total system throughput as a function of the D2D number. AC-
RL approach with two different approaches is compared, it can
If every action is executed for infinite times in each state be observed obviously that the total system throughput grows
value function ” •! and the policy function ‘ •, 2! will
and the learning strategy is greedy with infinite exploration, the as D2D number increases, and the AC-RL approach is of higher
eventually converge to ” ∗ •! and ‘ ∗ , respectively, with a
performance than Q-learning approaches as well as the random
search approach.
probability of 1. The complete proposed AC-RL approach is
shown in Algorithm 1.
Algorithm 1: AC-RL Algorithm
1. for each • ∈ ‚, each 2 ∈ € do

{Initialization}
2. Initialize state-value function ” •!, policy function

< •, 2!, and strategy function ‘ •, 2!.
3. end for
5. Choose an action a in state s according to ‘ •, 2! in (14);

4. Repeat until convergent
6. Observe the rewards and receive the current reward using

(11);
• → •′ and compute the TD error by (15);

7. Identify the network state and accordingly update state
8. Update the state-value function (16) for • = •′;

Fig. 3. Snapshot for CUEs and DUEs distribution in the cell with radius
9. Update the policy function by (17) for • = •′, a = a′,

500m where K = 20 and M = 10.
As shown in Figure (5), the learning process of the three
approaches in terms of the reward performance when the
10. Update the strategy function ‘′ •, 2! using (14).
respectively;
number of DUEs is 10. We can see that the AC-RL approaches
greatly outperform the Q-learning approach and the random
search approach, especially, the proposed algorithm on the immediate observations in D2D environments. The
accomplishes the best performance in reward with the highest results show that the proposed solution can efficaciously
convergence rate. guarantee the transmission quality and enhance the sum rate of
the cellular and D2D users, and outperform other existing
algorithms by having better convergence and overall
throughput of the network. In future work, we will apply the RL
approach for the joint resource allocation issue in multi-cell
D2D communications underlaying cellular networks.
REFERENCES
[1] K. Doppler, M. Rinne, C. Wijting, C. Ribeiro, and K. Hugl, "Device-to-device
communication as an underlay to LTE-advanced networks," IEEE Commun.
Mag., vol. 47, no. 12, pp. 42–49, Dec. 2009.
[2] A. Asadi, Q. Wang, and V. Mancuso, "A survey on device-to-device
communication in cellular networks." IEEE Commun. Surv. Tutor. 16(4),
1801–1819 (2014).
[3] P. Phond, Ekram Hossain, and Dong In Kim. "Resource allocation for Device-
to-Device communications underlaying LTE-advanced networks." IEEE
Wireless Communications 20.4 (2013): 91-100.
[4] S. Marzieh, M. Mehrjoo, and M. Kazeminia. "Proximity Mode Selection
Fig. 4. System throughput gain for different D2D numbers. Method in Device to Device Communications." 2018 8th International
Conference on Computer and Knowledge Engineering (ICCKE). IEEE, 2018.
[5] L. Lei, Z. Zhong, C. Lin, and X. Shen, "Operator controlled device-to-device
communications in LTE-advanced networks," IEEE Wireless Commun., vol.
19, no. 3, pp. 96–104, Jun. 2012.
[6] Azam, Muhammad, et al. "Joint admission control, mode selection, and power
allocation in D2D communication systems." IEEE Transactions on Vehicular
Technology 65.9, 2015.
[7] Cicalo, Sergio, and Velio Tralli. "QoS-aware admission control and resource
allocation for D2D communications underlaying cellular networks." IEEE
Transactions on Wireless Communications 17.8, 2018.
[8] Feng, Daquan, Lu Lu, Yi Yuan-Wu, Geoffrey Ye Li, Gang Feng, and Shaoqian
Li. "Device-to-device communications underlaying cellular networks," IEEE
Transactions on Communications 61, no. 8, 2013.
[9] G. Yu, L. Xu, D. Feng, R. Yin, G. Y. Li, and Y. Jiang, "Joint mode selection
and resource allocation for device-to-device communications," IEEE Trans.
Commun., vol. 62, no. 11, pp. 3814–3824, Nov. 2014.
[10] Zhao, Di, et al. "A Reinforcement Learning Method for Joint Mode Selection
Fig. 5. Learning process comparisons of AC-RL algorithms. and Power Adaptation in the V2V Communication Network in 5G." IEEE
Thus, in general, D2D communications will typically Transactions on Cognitive Comm and Networking (2020).
coexist and share RBs with cellular users for their data [11] F. Souhir, F. Zarai, and A. Belghith. "A Q-learning-based Scheduler
Technique for LTE and LTE-Advanced Network." WINSYS. 2017.
transmission. The proposed joint resource management can [12] Feki, Souhir, and Faouzi Zarai. "Cell performance-optimization scheduling
maximize the throughput whilst avoiding interference caused algorithm using reinforcement learning for LTE-advanced network." 2017
due share the RBs of cellular networks. The agent continually IEEE/ACS 14th International Conference on Computer Systems and
upgrades the policy throughout the learning process to discover Applications (AICCSA). IEEE, 2017.
[13] Feki, Souhir, Aymen Belghith, and Faouzi ZARAI. "A Reinforcement
how to select power levels and allocate resources. based on the Learning-based Radio Resource Management Algorithm for D2D-based V2V
simulation outcomes, every agent discovers a way to meet the Communication." 201915th International Wireless Communications & Mobile
cellular communication constraints whilst avoiding Computing Conference (IWCMC). IEEE, 2019.
[14] Gengtian, Shi, et al. "Power Control Based on Multi-Agent Deep Q Network
interference with D2D communications and increasing the for D2D Communication." 2020 International Conference on Artificial
throughput of the overall network. Therefore, numerical results Intelligence in Information and Communication (ICAIIC). IEEE, 2020.
show that the approach has good convergence. [15] Chen, Wentai, and Jun Zheng. "A Reinforcement Learning Based Joint
Spectrum Allocation and Power Control Algorithm for D2D Communication
V. CONCLUSION Underlaying Cellular Networks." International Conference on Artificial
Intelligence for Communications and Networks. Springer, Cham, 2019.
The integration of D2D communication to cellular networks [16] Chen, Wentai, and Jun Zheng. "A Multi-agent Reinforcement Learning Based
became a vitality task with the growth of mobile devices, as Power Control Algorithm for D2D Communication Underlaying Cellular
well as requirements of enhanced network performance in Networks." International Conference on Artificial Intelligence for
terms of spectral efficiency, energy efficiency, and latency. In Communications and Networks. Springer, Cham, 2019.
[17] Y. Luo, Z. Shi, X. Zhou,Q. Liu, and Q.Yi, "Dynamic resource allocations
this paper, we formulated a joint resource management (mode based on Q-learning for D2D communication in cellular networks." 2014 11th
selection, resource block assignment and transmits power International Computer Conference (ICCWAMTIP). IEEE, 2014.
control) problem with the constraints of QoS requirements of [18] Zia, Kamran, et al. "A distributed multi-agent RL-based autonomous spectrum
D2D links, to maximize the throughput of the overall network allocation scheme in D2D enabled multi-tier HetNets." IEEE Access 7 (2019):
in D2D communications. The resource management problem is 6733-6745.
[19] Shah, Syed Waqas Haider, et al. "On the impact of mode selection on effective
solved with a RL framework based on MDP. With the RL capacity of device-to-device communication." IEEE Wireless
algorithm, D2D links are able to intelligently making their Communications Letters 8.3 (2019): 945-948.
adaptive selections to enhance their overall performance based [20] Sewak, Mohit. Deep Reinforcement Learning. Springer Singapore, 2019.

Resource Management Based On Reinforcement Learning For D2D Communication in Cellular Networks

Uploaded by

Copyright:

Available Formats

Resource Management Based On Reinforcement Learning For D2D Communication in Cellular Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Resource Management Based On Reinforcement Learning For D2D Communication in Cellular Networks

Uploaded by

Copyright:

Available Formats

Resource Management Based on Reinforcement Learning

for D2D Communication in Cellular Networks

978-1-7281-5628-6/20/$31.00 ©2020 IEEE

set of M D2D users equipments (DUEs) is denoted by =

Due to the latency requirement, let 789: denote the mts

7;: is beyond the threshold 789: . Then the probability must be

a smaller than the tolerable threshold <89:-=9>

< = 7;: ≥ 789: ≤ <89: 7! xzy

requirement threshold * ,,-. . Therefore, the outage probability n

must be below the tolerable outage probability <89:AB;9C- s

III. REINFORCEMENT LEARNING (RL) FOR RESOURCE

(6), (7), (8); I & , I , I , ∈ 0, 1 , ∀5 ∈ 9a!

R I , + I & + I ≤ 1, ∀0∈ ; 9e!

MDP is a 5-tuple (S, A, P, R, l), in which the state S, the action

Where I & , I , and I , are the mode selection indicators,

policy: ‘ •!: ‚ → €, which is a mapping from the state S to the

changes the state • → •′ by executing action. And agent will

” • •! denotes the state-value function, which called as a

executes an action 2 ∈ € and receives a reward • ∈ *, the

The optimal policy ‘ ∗ •! satisfies the Bellman equation [20],

is achieved to maximizing the cumulative discounted reward

State: The system state can be described as ‚ =

{ƒ&„… , ƒ † , ƒ‡A„ }, where ƒ&„… shows the observed channel

information, ƒ † denotes all RBs occupy status between users,

and ƒ‡A„ indicates the requirements of QoS ( e.g., the latency,

learning process, are defined as a = {ˆ „ , ˆO& , ˆ † }∈ A. The

agent will take the action a ∈ A according to the current state •,

describes the probability when the agent takes the action a ∈ A

from the state • ∈ ‚ to a new state • ‰ ∈ ‚.

We adopt softmax policy for long-term optimization. ‘ •, 2!,

respectively. The coefficient Ž" , • ∈ 1,2,3 are the weights of the

where ¢ is a positive parameter called temperature. In addition,

•, 2! defines the affinity to select action a at state s; it is

an action, the system changes the state • ∈ ‚ to a new state • ‰ ∈

‚ with a transition probability in (10). Meanwhile, the total

reward for the taken action a would be • •, 2!. Consequently,

the Time Difference (TD) error I •, 2! would be computed by

the difference between the state-value function ” • •′!

estimated at the preceding state which in (12), and • •, 2!

+ ” • •′! at the critic,

” • ‰ ! = ” •! + Žg¤ •, Œ!h I •, 2! 16!

Here, ¤ •, Œ! indicates the occurrence time of state s in Noise power (xzy )

these t stages. Ž . ! is a positive step-size parameter that affects

the convergence rate. On the other hand, ” • ‰ ! remains as

1. for each • ∈ ‚, each 2 ∈ € do

2. Initialize state-value function ” •!, policy function

5. Choose an action a in state s according to ‘ •, 2! in (14);

6. Observe the rewards and receive the current reward using

• → •′ and compute the TD error by (15);

8. Update the state-value function (16) for • = •′;

9. Update the policy function by (17) for • = •′, a = a′,

You might also like