Deep Reinforcement Learning For Intelligent Reflec
Deep Reinforcement Learning For Intelligent Reflec
Abstract—In this paper, we propose a deep reinforcement Some research works have investigated the efficiency of
learning (DRL) approach for solving the optimisation problem the IRS in assisting the D2D communications [7], [8]. In [7]
of the network’s sum-rate in device-to-device (D2D) communica-
arXiv:2108.02892v1 [eess.SP] 6 Aug 2021
in Fig. 1. Each pair of D2D users comprises of a single- channel at time step t described by
antenna D2D transmitter (D2D-Tx) and a single-antenna D2D s r
receiver (D2D-Rx). An IRS panel with K reflective elements t β1 LoS 1
Hnm = h̃nm + h̃N LoS , (3)
is deployed to enhance the signal from the D2D-Tx to the 1 + β1 β + 1 nm
associated D2D-Rx and mitigate the interference from other
D2D-Txs. The IRS with reflective elements maps the receiver’s where β1 is the Rician factor, and h̃LoS N LoS
nm , h̃nm are the line-
signal by the value of the phase shift matrix controlled by of-sight (LoS) and the non-line-of-sight (NLoS) components
an intelligent unit. The received signal at the D2D-Rx is for the reflected channel, respectively. Specifically, the LoS
composed of a direct signal and a reflective one. component is defined as [7]
We denote the position of the nth D2D-Tx at time step t q 0
III. J OINT O PTIMISATION OF P OWER A LLOCATION AND denotes the state-value function while Qπ (s, a) is the action-
P HASE S HIFT M ATRIX value function.
Given the optimisation problem (9), we formulate the MDP In the PPO method, we limit the current policy such that
with the agent, the state space S, the action space A, the it does not go far from the obtained policy by using different
transition probability P, the reward function R and the dis- techniques, e.g., the clipping technique and Kullback-Leiber
count factor ζ. Let us denote Pss0 (a) as the probability when [17]. In this work, we use the clipping surrogate method to
the agent takes action at ∈ A at the state s = st ∈ S and prevent the excessive modification of the objective value, as
transfers to the next state s0 = st+1 ∈ S. In particular, we follows:
"
formulate the MDP game as follows:
Lclip (s, a; θ) = E min ptθ Aπ (s, a),
• State space: The channel gain of the D2D users forms
the state space as # (14)
n K
X K
X clip(ptθ , 1 π
− , 1 + )A (s, a) ,
S = h11 + H11 Φ, . . . , h1N + H1N Φ, . . . ,
k=1 k=1 where is a hyperparameter.
K
X K
X When the advantage Aπ (s, a) is positive, the term (1 + )
hnm + Hnm Φ, . . . , hnN + HnN Φ, . . . , takes action. Meanwhile, for the negative case of the advantage
k=1 k=1 Aπ (s, a), the term (1 − ) sets a ceiling to limit the objective
K K
X X o value. Moreover, for the advantage function Aπ (s, a), we use
hN 1 + HN 1 Φ, . . . , hN N + HN N Φ . (10) [18]:
k=1 k=1
Aπ (s, a) = rt + ζV π (st+1 ) − V π (st ), (15)
• Action space: The D2D-Txs adjust the transmit power
and the IRS changes the phase shift for maximising the where the state-value function V π (s) is obtained at the state
expected reward. Thus, The action space for the D2D s under the policy π as follows:
n o
users and the IRS is considered as follows: V π = E R|s, π . (16)
A = {p1 , p2 , . . . , pN , θ1 , θ2 , . . . , θK }. (11)
To train the policy network, we store the transition into a
• Reward function: The agent needs to find an optimal
mini-batch memory D and then use the stochastic policy gra-
policy for maximising the reward. In our problem, our dient (SGD) method to maximise the objective. By denoting
objective is to maximise the network sum-rate; thus, the the policy parameter by θ, it is updated as
reward function is defined as R =
h i
! θd+1 = arg max E L(s, a; θd ) . (17)
N PK 2
X |hnn + k=1 Hnn Φ| pn
B log2 1+ PN PK . The PPO algorithm for joint optimisation of the transmit
|h + H Φ| 2 p + α2
n=1 m6=n mn k=1 mn m power and the phase shift matrix in the IRS-aided D2D com-
(12) munications is presented in Algorithm 1, where M denotes
By following the MDP, the agent interacts with the environ- the maximum number of episodes and T is the number of
ment and receives the response to achieve the best expected iterations during a period of time.
reward. Particularly, the state of the agent at time step t is st .
The agent chooses and executes the action at under the policy IV. S IMULATION R ESULTS
π. The environment responds with the reward rt . After taking
For numerical results, we use Tensorflow 1.13.1 [19]. The
the action at , the agent moves to the new state st+1 with
IRS is deployed at (0, 0, 0), while the D2D devices are ran-
probability Pss0 (a). The interactions are iteratively executed
domly distributed within a circle of 100 m from the center. The
and the policy is updated for the optimal reward.
maximum distance between the D2D-Tx and the associated
In this paper, we propose a DRL approach to search for an
D2D-Rx is set to 10 m. We assume d/λ = 1/2, and set
optimal policy for maximising the reward value in (12). The
the learning rate for the PPO algorithm to 0.0001. For the
optimal policy can be obtained by modifying the estimation
neural networks, we initialise two hidden layers with 128 and
of the value function or directly by the objective. We use an
64 units, respectively. All other parameters are provided in
on-policy algorithm for our work, namely proximal policy
Table I. We consider the following algorithms in the numerical
optimisation (PPO) with the clipping surrogate technique
results.
[16]. Consider the probability ratio of the current policy and
t π(s,a;θ) • The proposed algorithm: We use the PPO algorithm
obtained policy pθ = π(s,a;θold ) , we need to find the optimal
policy to maximise the total expected reward as follows: with the clipping surrogate technique to solve the joint
" # " # optimisation of the power allocation and the phase shift
π(s, a; θ) π matrix of the IRS.
L(s, a; θ) = E A (s, a) = E ptθ Aπ (s, a) , (13) • Maximum power transmission (MPT): The D2D-Tx
π(s, a; θold )
transmits information with maximum power, Pmax . We
where E[·] is the expectation operation and Aπ (s, a) = use the PPO algorithm to optimise the phase shift matrix
Qπ (s, a) − V π (s) denotes the advantage function [17]; V π (s) of the IRS panel.
4
Sum-rate (bits/s/Hz)
4:
5: for iteration = 1, . . . , T do 42
6: Obtain the action at at state st by following the
41
current policy
7: Execute the action at 40
8: Receive the reward rt according to (12) 39
9: Observe the new state st+1
10: Update the state st = st+1 38
TABLE I 40
SIMULATION PARAMETERS.
Sum-rate (bits/s/Hz)
Parameters Value 35
Bandwidth (W ) 1 MHz
30
Path-loss parameter κ0 = 2.5, κ1 = 3.6
Channel power gain −30 dB
25
Rician factor β1 = 4
Noise power α2 = −80 dBm Proposed algorithm
20 MPT
Clipping parameter = 0.2 RPS
WithoutIRS
Discounting factor ζ = 0.9
2 3 4 5 6 7 8 9 10
Max number of D2D pairs 10 N
Initial batch size K = 128