0% found this document useful (0 votes)
124 views5 pages

Deep Reinforcement Learning For Intelligent Reflec

1) The document proposes a deep reinforcement learning approach to optimize the sum-rate of an intelligent reflecting surface-assisted device-to-device communication network. 2) It formulates a Markov decision process to jointly optimize the transmit power of device-to-device transmitters and the phase shift matrix of the intelligent reflecting surface. 3) A deep reinforcement learning algorithm called proximal policy optimization is used to find the optimal policy that maximizes the network sum-rate.

Uploaded by

Sreekrishna Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views5 pages

Deep Reinforcement Learning For Intelligent Reflec

1) The document proposes a deep reinforcement learning approach to optimize the sum-rate of an intelligent reflecting surface-assisted device-to-device communication network. 2) It formulates a Markov decision process to jointly optimize the transmit power of device-to-device transmitters and the phase shift matrix of the intelligent reflecting surface. 3) A deep reinforcement learning algorithm called proximal policy optimization is used to find the optimal policy that maximizes the network sum-rate.

Uploaded by

Sreekrishna Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1

Deep Reinforcement Learning for Intelligent


Reflecting Surface-assisted D2D Communications
Khoi Khac Nguyen, Antonino Masaracchia, Cheng Yin, Long D. Nguyen, Octavia A. Dobre, and Trung Q.
Duong

Abstract—In this paper, we propose a deep reinforcement Some research works have investigated the efficiency of
learning (DRL) approach for solving the optimisation problem the IRS in assisting the D2D communications [7], [8]. In [7]
of the network’s sum-rate in device-to-device (D2D) communica-
arXiv:2108.02892v1 [eess.SP] 6 Aug 2021

and [8], two sub-problems with fixed passive beamforming


tions supported by an intelligent reflecting surface (IRS). The IRS
is deployed to mitigate the interference and enhance the signal vector and fixed phase shift matrix were considered. To solve
between the D2D transmitter and the associated D2D receiver. the power allocation optimisation with the fixed phase shift
Our objective is to jointly optimise the transmit power at the D2D matrix, the authors in [7] used the gradient descent method
transmitter and the phase shift matrix at the IRS to maximise while the authors in [8] employed the Dinkelbach method.
the network sum-rate. We formulate a Markov decision process For the phase shift optimisation, a local search algorithm was
and then propose the proximal policy optimisation for solving
the maximisation game. Simulation results show impressive proposed in [7] while fractional programming was utilised
performance in terms of the achievable rate and processing time. in [8]. However, these approaches assume a discrete phase
shift and only reach a sub-optimal solution. Moreover, these
Index Terms—Intelligent reflecting surface (IRS), D2D com- works only consider perfect conditions, e.g., channel state
munications, deep reinforcement learning. information (CSI). In addition, these algorithms cause large
delays due to high computational complexity.
Very recently, deep reinforcement learning (DRL) has been
I. I NTRODUCTION applied as an effective solution for solving complicated prob-
Device-to-device (D2D) communications play a critical role lems in wireless networks [9]–[14]. In [9], we defined the
in 5G networks by allowing users to communicate directly discrete power level and used the DRL algorithm to choose
without the involvement of base stations. It helps reduce the the transmit power at the D2D transmitter for maximising
latency and improve the information transmission efficiency the EE. In [11], discrete and continuous action spaces were
[1], [2]. In [1], the optimised power allocation at the D2D considered for the beamforming vector and the IRS phase
transmitters was proposed to maximise the energy efficiency shift in multiple-input single-output (MISO) communications.
(EE) performance, by following a machine learning-based ap- Then, two DRL algorithms were used to maximise the total
proach. In [2], the D2D transmitters harvest energy through the throughput. In [12], a method based on the DRL was used
simultaneous wireless information and power transfer protocol for optimising the unmanned aerial vehicle (UAV)’s altitude
(SWIPT). Then, a game theory approach was proposed to and the IRS diagonal matrix to minimise the sum age-of-
solve the power allocation and power splitting at SWIPT with information. In [13], the authors used the DRL technique to
pricing strategies for maximising the network performance. maximise the signal-to-noise ratio.
Intelligent reflecting surface (IRS), referring to the technol- In this paper, we propose a DRL algorithm for solving
ogy of massive elements of flexible reflection capability that the joint power allocation and phase shift matrix optimisa-
are controlled by an intelligent unit, has recently attracted great tion in the IRS-assisted D2D communications. Firstly, we
attention from the research community as an efficient means to conceive a D2D communication system with the support of
expand wireless coverage. The IRS can manage the incoming the IRS. The D2D channel is a combination of the direct
signal by a controller, which allows to efficiently adapt the link and the reflective link. The IRS is used for mitigating
angle of passive reflection from the transmitters toward the the interference and enhancing the information transmission
receivers [3]–[6]. In [4], the IRS harvests energy from the channel. Secondly, we formulate a Markov decision process
access point (AP) and uses it for reflecting the signal in two (MDP) [15] for the network throughput maximisation in the
phases. The AP beamforming vector, the IRS’s phase schedul- IRS-assisted D2D communications, in which the optimisation
ing, and the passive beamforming were optimised to maximise variables are the power at the D2D users and the phase shifts
the information rate. In [5], a channel estimation scheme for a at the IRS. Then, a DRL algorithm is used to search for an
multi-user multiple-input multiple-output (MIMO) system has optimal policy for maximising the network sum-rate. Finally,
been designed with the support of double IRS panels. we compare the efficiency of our proposed methods with other
schemes in terms of the achievable network sum-rate.
K. K. Nguyen, A. Masaracchia, C. Yin, and T. Q.
Duong are Queen’s University Belfast, UK (e-mail: II. S YSTEM M ODEL AND P ROBLEM F ORMULATION
{knguyen02,a.masaracchia,cyin01,trung.q.duong}@qub.ac.uk). L. D. Nguyen
is with Duy Tan University, Vietnam (email: [email protected]). We consider an IRS-assisted wireless network with N pairs
O. A. Dobre is with Memorial University, Canada (e-mail: [email protected]) of D2D users distributed randomly and an IRS panel, as shown
2

in Fig. 1. Each pair of D2D users comprises of a single- channel at time step t described by
antenna D2D transmitter (D2D-Tx) and a single-antenna D2D s r
receiver (D2D-Rx). An IRS panel with K reflective elements t β1 LoS 1
Hnm = h̃nm + h̃N LoS , (3)
is deployed to enhance the signal from the D2D-Tx to the 1 + β1 β + 1 nm
associated D2D-Rx and mitigate the interference from other
D2D-Txs. The IRS with reflective elements maps the receiver’s where β1 is the Rician factor, and h̃LoS N LoS
nm , h̃nm are the line-
signal by the value of the phase shift matrix controlled by of-sight (LoS) and the non-line-of-sight (NLoS) components
an intelligent unit. The received signal at the D2D-Rx is for the reflected channel, respectively. Specifically, the LoS
composed of a direct signal and a reflective one. component is defined as [7]
We denote the position of the nth D2D-Tx at time step t q 0

as Xnt (Tx) = xtn (Tx), ynt (Tx) , n = 1, . . . , N and that of h̃LoS


nm = β0 (dtn,IRS dtIRS,m )−κ0 e−jθ , (4)
t
the mth D2D-Rx as Xm (Rx) = xtm (Rx)), ym t
(Rx) , m = where θ0 ∈ [0, 2π] is the random phase. The NLoS component
t t t
1, . . . , N . The IRS is fixed at the position (xIRS , yIRS , zIRS ). is defined as
The phase shift value of each element in the IRS belongs to q
[0, 2π]. h̃N
nm
LoS
= β0 (dtn,IRS dtIRS,m )−κ1 ĥN LoS
nm , (5)

where κ1 is the path loss exponent for the NLoS component


Information Transmission
and the small-scale fading ĥN
nm
LoS
∼ CN (0, 1) is i.i.d. complex
Interference Gaussian distribution with zero mean and unit variance.
The received signal at the nth D2D-Rx at time step t can
be written as
 K
X p
stn = htnn + t
Hnn Φt ptn utn
D2D-Tx D2D-Tx
D2D-Tx k=1
N K
(6)
X  X p
+ htmn + t
Hmn Φt ptm utm + $,
D2D-Rx D2D-Rx
D2D-Rx m6=n k=1

where ptn is the transmit power at the nth D2D-Tx at time


step t, utn is the transmitted symbol from the nth D2D-Tx,
Fig. 1. System model of the IRS-assisted D2D communications. and $ ∼ N (0, α2 ) is the complex additive white Gaussian
noise.
We denote the direct channel from the nth D2D-Tx to Accordingly, the received signal-to-interference-plus-noise
the mth D2D-Rx at time step t by htnm , and the reflective ratio (SINR) at the nth D2D-Rx can be represented as
t
channel by Hnm . The phase shift matrix at the IRS at time PK
|htnn + k=1 Hnnt
Φt |2 ptn
t
t t
step t is defined by Φt = diag(η2t θ1t , η2t θ2t , . . . , ηK θK ), where γn = P P K
. (7)
t t t 2 t 2
t t
ηk ∈ [0, 1] and θk ∈ [0, 2π] represent the amplitude and the m6=n,m∈N |hmn + k=1 Hmn Φ | pm + α
phase shift value, respectively. In this paper, we assume that The achievable sum-rate at the nth D2D pair during time
the amplitudes of all elements are set to ηkt = 1. step t is defined as
The distance between the nth D2D-Tx and the mth D2D-Rx
at time step t is defined as Rnt = B log2 (1 + γnt ), (8)
q 2 where B is the bandwidth.
dtnm = t (Rx) 2 .

xtn (Tx) − xtm (Rx) + ynt (Tx) − ym In this paper, we aim at optimising the power allocation of
(1) all N pairs of D2D users P = {p1 , p2 , . . . , pN } and the phase
Similarly, the distance between the nth D2D-Tx and the IRS shift matrix Φ of the IRS to maximise the network sum-rate
is dtn,IRS and the distance between the IRS and the mth D2D- while satisfying all the constraints. The considered network
Rx is dtIRS,m at time step t. The direct channel is formulated optimisation can be formulated as follows:
as
p N
X
htnm = ĥn β0 (dtnm )−κ0 , (2) max t
Rtotal = Rnt
P,Φ
n=1
where β0 and ĥn are the channel power gain at the reference s.t. 0 < pn < Pmax , ∀n ∈ N (9)
distance d0 = 1 m and κ0 is the path-loss exponent in the Rnt ≥ rmin , ∀n ∈ N
D2D link. Here, we assume that the small-scale fading follows
θk ∈ [0, 2π], ∀k ∈ K,
the Nakagami-m distribution with m as the fading severity
parameter. where Pmax is the maximum transmit power at the D2D-Tx
The reflective channel via the IRS from the nth D2D-Tx and the constraint Rnt ≥ rmin , ∀n ∈ N indicates the quality-
toward the mth D2D-Rx is considered as a Rician fading of-service (QoS) of the D2D communications.
3

III. J OINT O PTIMISATION OF P OWER A LLOCATION AND denotes the state-value function while Qπ (s, a) is the action-
P HASE S HIFT M ATRIX value function.
Given the optimisation problem (9), we formulate the MDP In the PPO method, we limit the current policy such that
with the agent, the state space S, the action space A, the it does not go far from the obtained policy by using different
transition probability P, the reward function R and the dis- techniques, e.g., the clipping technique and Kullback-Leiber
count factor ζ. Let us denote Pss0 (a) as the probability when [17]. In this work, we use the clipping surrogate method to
the agent takes action at ∈ A at the state s = st ∈ S and prevent the excessive modification of the objective value, as
transfers to the next state s0 = st+1 ∈ S. In particular, we follows:
"
formulate the MDP game as follows: 
Lclip (s, a; θ) = E min ptθ Aπ (s, a),
• State space: The channel gain of the D2D users forms
the state space as # (14)

n K
X K
X clip(ptθ , 1 π
− , 1 + )A (s, a) ,
S = h11 + H11 Φ, . . . , h1N + H1N Φ, . . . ,
k=1 k=1 where  is a hyperparameter.
K
X K
X When the advantage Aπ (s, a) is positive, the term (1 + )
hnm + Hnm Φ, . . . , hnN + HnN Φ, . . . , takes action. Meanwhile, for the negative case of the advantage
k=1 k=1 Aπ (s, a), the term (1 − ) sets a ceiling to limit the objective
K K
X X o value. Moreover, for the advantage function Aπ (s, a), we use
hN 1 + HN 1 Φ, . . . , hN N + HN N Φ . (10) [18]:
k=1 k=1
Aπ (s, a) = rt + ζV π (st+1 ) − V π (st ), (15)
• Action space: The D2D-Txs adjust the transmit power
and the IRS changes the phase shift for maximising the where the state-value function V π (s) is obtained at the state
expected reward. Thus, The action space for the D2D s under the policy π as follows:
n o
users and the IRS is considered as follows: V π = E R|s, π . (16)
A = {p1 , p2 , . . . , pN , θ1 , θ2 , . . . , θK }. (11)
To train the policy network, we store the transition into a
• Reward function: The agent needs to find an optimal
mini-batch memory D and then use the stochastic policy gra-
policy for maximising the reward. In our problem, our dient (SGD) method to maximise the objective. By denoting
objective is to maximise the network sum-rate; thus, the the policy parameter by θ, it is updated as
reward function is defined as R =
h i
! θd+1 = arg max E L(s, a; θd ) . (17)
N PK 2
X |hnn + k=1 Hnn Φ| pn
B log2 1+ PN PK . The PPO algorithm for joint optimisation of the transmit
|h + H Φ| 2 p + α2
n=1 m6=n mn k=1 mn m power and the phase shift matrix in the IRS-aided D2D com-
(12) munications is presented in Algorithm 1, where M denotes
By following the MDP, the agent interacts with the environ- the maximum number of episodes and T is the number of
ment and receives the response to achieve the best expected iterations during a period of time.
reward. Particularly, the state of the agent at time step t is st .
The agent chooses and executes the action at under the policy IV. S IMULATION R ESULTS
π. The environment responds with the reward rt . After taking
For numerical results, we use Tensorflow 1.13.1 [19]. The
the action at , the agent moves to the new state st+1 with
IRS is deployed at (0, 0, 0), while the D2D devices are ran-
probability Pss0 (a). The interactions are iteratively executed
domly distributed within a circle of 100 m from the center. The
and the policy is updated for the optimal reward.
maximum distance between the D2D-Tx and the associated
In this paper, we propose a DRL approach to search for an
D2D-Rx is set to 10 m. We assume d/λ = 1/2, and set
optimal policy for maximising the reward value in (12). The
the learning rate for the PPO algorithm to 0.0001. For the
optimal policy can be obtained by modifying the estimation
neural networks, we initialise two hidden layers with 128 and
of the value function or directly by the objective. We use an
64 units, respectively. All other parameters are provided in
on-policy algorithm for our work, namely proximal policy
Table I. We consider the following algorithms in the numerical
optimisation (PPO) with the clipping surrogate technique
results.
[16]. Consider the probability ratio of the current policy and
t π(s,a;θ) • The proposed algorithm: We use the PPO algorithm
obtained policy pθ = π(s,a;θold ) , we need to find the optimal
policy to maximise the total expected reward as follows: with the clipping surrogate technique to solve the joint
" # " # optimisation of the power allocation and the phase shift
π(s, a; θ) π matrix of the IRS.
L(s, a; θ) = E A (s, a) = E ptθ Aπ (s, a) , (13) • Maximum power transmission (MPT): The D2D-Tx
π(s, a; θold )
transmits information with maximum power, Pmax . We
where E[·] is the expectation operation and Aπ (s, a) = use the PPO algorithm to optimise the phase shift matrix
Qπ (s, a) − V π (s) denotes the advantage function [17]; V π (s) of the IRS panel.
4

Algorithm 1 Proposed approach based on the PPO algorithm


for the IRS-assisted D2D communications.
1: Initialise the policy π with the parameter θπ Proposed algorithm
45 MPT
2: Initialise other parameters RPS
44 WithoutIRS
3: for episode = 1, . . . , M do
Receive initial observation state s0 43

Sum-rate (bits/s/Hz)
4:
5: for iteration = 1, . . . , T do 42
6: Obtain the action at at state st by following the
41
current policy
7: Execute the action at 40
8: Receive the reward rt according to (12) 39
9: Observe the new state st+1
10: Update the state st = st+1 38

11: Collect set of partial trajectories with D transitions 10 20 30 40 50


K
12: Estimate the advantage function according to (15)
13: end for
Fig. 2. The network sum-rate versus the number of IRS elements, K.
14: Update policy parameters using SGD with mini-batch
D
D
1 X clip schemes is compared while varying the number of D2D pairs,
θt+1 = arg max L (s, a; θt ) (18)
D N , in Fig. 3. We set the number of IRS element to K = 20
15: end for and take the average over 500 episodes to obtain the results.
Our proposed algorithm shows better performance, followed
by MPT. With higher number of D2D users, N ≥ 6, the
• Random phase shift matrix selection (RPS): We op- performance attained by the proposed algorithm still increases
timise the power allocation at the D2D-Tx with random while it decreases for the other schemes. The RPS and
selection of the phase shift matrix Φ. WithoutIRS models show the worse performance.
• Without IRS: The D2D-Tx transmits information with-
out the support of the IRS. We optimise the power
allocation by using the PPO algorithm.
45

TABLE I 40
SIMULATION PARAMETERS.
Sum-rate (bits/s/Hz)

Parameters Value 35

Bandwidth (W ) 1 MHz
30
Path-loss parameter κ0 = 2.5, κ1 = 3.6
Channel power gain −30 dB
25
Rician factor β1 = 4
Noise power α2 = −80 dBm Proposed algorithm
20 MPT
Clipping parameter  = 0.2 RPS
WithoutIRS
Discounting factor ζ = 0.9
2 3 4 5 6 7 8 9 10
Max number of D2D pairs 10 N
Initial batch size K = 128

Fig. 3. The network sum-rate versus the number of D2D pairs, N .


Firstly, we compare the achievable network sum-rate pro-
vided by our proposed algorithm with that of other schemes. Further, we set N = 5, K = 20 and compare the
Fig. 2 plots the sum-rate versus different numbers of the IRS performance results of the four schemes while changing the
elements, K, where the number of D2D pairs is set to N = 5. value of the threshold, rmin , in Fig. 4. When the value of
As can be observed from this figure, the PPO algorithm-based rmin increases towards infinity, the number of D2D pairs that
technique outperforms other schemes and is followed by the satisfies the QoS constraints decreases and the sum-rate of
MPT technique. The RP and WithoutIRS schemes show poorer all schemes tends to 0. The proposed algorithm outperforms
performance in terms of the network sum-rate. The achievable the other schemes for all values of rmin . The gap between
network sum-rate using our proposed algorithm and MPT our algorithm and others increases following the increase
improves with increasing the number of IRS elements. The in rmin when rmin ≥ 5. The MPT algorithm exhibits the
results show that with the monotonic increase in the value worst performance when rmin ≥ 7. This suggests that the
of K, the communication quality between the D2D-Tx and optimisation of power allocation is important for efficient D2D
associated D2D-Rx is enhanced, while the interference from communications.
other D2D-Txs is suppressed. Next, we compare the total sum-rate of the four schemes by
Next, the performance of the previously mentioned four setting different maximum transmission powers at the D2D-
5

has been proposed for joint optimisation of the D2D-Tx power


and the IRS’s phase shift matrix. Numerical results have
40 showed a significant improvement in the achievable network
sum-rate performance compared with the benchmark schemes.
30
Our proposed scheme demonstrates the superiority of using
Sum-rate (bits/s/Hz)

IRS in mitigating the interference in the D2D communications


when compared with other existing schemes.
20
R EFERENCES
[1] K. K. Nguyen, T. Q. Duong, N. A. Vien, N.-A. Le-Khac, and N. M.
10
Nguyen, “Non-cooperative energy efficient power allocation game in
Proposed algorithm D2D communication: A multi-agent deep reinforcement learning ap-
MPT
RPS proach,” IEEE Access, vol. 7, pp. 100 480–100 490, Jul. 2019.
0 WithoutIRS [2] J. Huang, C.-C. Xing, and M. Guizani, “Power allocation for D2D
3 4 5 6 7 8 9 10 communications with SWIPT,” IEEE Trans. Wireless Commun., vol. 19,
rmin no. 4, pp. 2308–2320, Apr. 2020.
[3] H. Yu, H. D. Tuan, A. A. Nasir, T. Q. Duong, and H. V. Poor, “Joint
design of reconfigurable intelligent surfaces and transmit beamforming
Fig. 4. The network sum-rate versus the QoS threshold, rmin .
under proper and improper Gaussian signaling,” IEEE J. Select. Areas
Commun., vol. 38, no. 11, pp. 2589–2603, Nov. 2020.
[4] Y. Zou, S. Gong, J. Xu, W. Cheng, D. T. Hoang, and D. Niyato,
Tx, Pmax , in Fig. 5, with N = 5, K = 20. As Pmax “Wireless powered intelligent reflecting surfaces for enhancing wireless
varies from 100 mW to 400 mW, the performance of the four communications,” IEEE Trans. Veh. Technol., vol. 69, no. 10, pp. 12 369–
12 373, Oct. 2020.
schemes increases in the same upward trend. The gap between [5] B. Zheng, C. You, and R. Zhang, “Efficient channel estimation for
our proposed algorithm and the other schemes increases with double-IRS aided multi-user MIMO system,” IEEE Trans. Commun.,
the increase value of Pmax as we jointly optimise both power vol. 69, no. 6, pp. 3818–3832, Jun. 2021.
[6] K. K. Nguyen, S. Khosravirad, L. D. Nguyen, T. T. Nguyen,
allocation at the D2D-Tx and the IRS’s phase shift matrix. and T. Q. Duong, “Intelligent reconfigurable surface-assisted multi-
It is clear that the proposed algorithm is more effective for UAV networks: Efficient resource allocation with deep reinforcement
mitigating interference and providing a better communication learning,” 2021. [Online]. Available: https://arxiv.org/abs/2105.14142
[7] Y. Chen, B. Ai, H. Zhang, Y. Niu, L. Song, Z. Han, and H. V. Poor,
quality. “Reconfigurable intelligent surface assisted device-to-device communi-
Furthermore, we use neural networks for establishing the cations,” IEEE Trans. Wireless Commun., vol. 20, no. 5, pp. 2792–2804,
DRL algorithm. Thus, after iterative interactions with the May 2021.
[8] S. Jia, X. Yuan, and Y.-C. Liang, “Reconfigurable intelligent surfaces
environment, the neural networks are trained for achieving for energy efficiency in D2D communication network,” IEEE Wireless
an optimal solution. After training offline, the neural network Commun. Lett., vol. 10, no. 3, pp. 683–687, Mar. 2021.
can be deployed to the system for online execution. The [9] K. K. Nguyen, T. Q. Duong, N. A. Vien, N.-A. Le-Khac, and L. D.
Nguyen, “Distributed deep deterministic policy gradient for power
online neural networks can determine the proper action for allocation control in D2D-based V2V communications,” IEEE Access,
the IRS phase shift value and the D2D-Tx power allocation vol. 7, pp. 164 533–164 543, Nov. 2019.
for maximising the network sum-rate in real-time. [10] K. K. Nguyen, N. A. Vien, L. D. Nguyen, M.-T. Le, L. Hanzo, and T. Q.
Duong, “Real-time energy harvesting aided scheduling in UAV-assisted
D2D networks relying on deep reinforcement learning,” IEEE Access,
vol. 9, pp. 3638–3648, Dec. 2021.
[11] C. Huang, R. Mo, and C. Yuen, “Reconfigurable intelligent surface as-
sisted multiuser MISO systems exploiting deep reinforcement learning,”
41.0 IEEE J. Select. Areas Commun., vol. 38, no. 8, pp. 1839–1850, Aug.
2020.
40.5 [12] M. Shokry, M. Elhattab, C. Assi, S. Sharafeddine, and A. Ghrayeb,
“Optimizing age of information through aerial reconfigurable intelligent
Sum-rate (bits/s/Hz)

40.0 surfaces: A deep reinforcement learning approach,” IEEE Trans. Veh.


Proposed algorithm Technol., vol. 70, no. 4, pp. 3978–3983, Apr. 2021.
39.5 MPT
RPS [13] K. Feng, Q. Wang, X. Li, and C.-K. Wen, “Deep reinforcement learning
WithoutIRS based intelligent reflecting surface optimization for MISO communica-
39.0 tion systems,” IEEE Wireless Commun. Lett., vol. 9, no. 5, pp. 745–749,
May 2020.
38.5 [14] K. K. Nguyen, T. Q. Duong, T. Do-Duy, H. Claussen, and L. Hanzo, “3D
UAV trajectory and data collection optimisation via deep reinforcement
38.0 learning,” 2021. [Online]. Available: https://arxiv.org/abs/2106.03129
[15] D. P. Bertsekas, Dynamic Programming and Optimal Control. Athena
100 200 300 400 Scientific Belmont, MA, 1995, vol. 1, no. 2.
Pmax (mW) [16] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” 2017. [Online]. Available:
Fig. 5. The network sum-rate versus the maximum transmit power, Pmax . https://arxiv.org/abs/1707.06347
[17] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “High-
dimensional continuous control using generalized advantage estimation,”
in Proc. 4th International Conf. Learning Representations (ICLR), 2016.
V. C ONCLUSION [18] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,”
in Proc. Int. Conf. Mach. Learn. PMLR, 2016, pp. 1928–1937.
In this paper, we have presented a DRL-based optimal [19] M. Abadi et al., “Tensorflow: A system for large-scale machine learn-
resource allocation scheme for IRS-assisted D2D communica- ing,” in Proc. 12th USENIX Sym. Opr. Syst. Design and Imp. (OSDI 16),
tions. The PPO algorithm with the clipping surrogate technique Nov. 2016, pp. 265–283.

You might also like