DQN 1
DQN 1
5, MAY 2021
Abstract— In this letter, a deep Q-learning network (DQN) decoder detects the signal according to the power orders of
based resource allocation (RA) scheme is proposed for the mas- users. Thus, the error accumulation is an inherent problem of
sive multiple-input multiple-output (MIMO)- nonorthogonal mul- the SIC decoder. The more users are superimposed, the error
tiple access (NOMA) systems. The reinforcement learning (RL)
frame is developed to build an iterative optimization structure for propagation is more serious. This will limit the transmission
user clustering, power allocation and beamforming. Specifically, efficiency greatly. In addition, each beam has to cover all the
a DQN is designed to group the users based on the reward item users in one cluster in a MIMO-NOMA system, rather than
calculated after power allocation and beamforming. The objective one user in a MIMO-OMA system. Then, the tradeoff between
is to maximize the reward item, i.e., the system throughput. Then, enhancing the intra-cluster coverage and eliminating the inter-
a back propagation neural network (BPNN) is used to realize the
power allocation. During the training of BPNN, the exhaustive cluster interference becomes more difficult for beamforming.
search results in the quantized power set are taken as the output Base on the above analysis, joint optimization of user cluster-
labels. Simulation experiments show that the proposed scheme ing, power allocation (PA) and beamforming becomes more
can achieve high system spectrum efficiency approximating to the urgent for massive MIMO-NOMA.
exhaustive search based on user clustering and power allocation. Unfortunately, such a joint RA problem with multiple
Index Terms— Non-orthogonal multiple access, massive variables has been proven to be a NP-hard problem [5]. The
multiple-input multiple-output, resource allocation, deep alternate optimization for three parts is usually used. For the
Q-learning network, back propagation neural network. user clustering sub-problem, the optimal solution is obtained
by searching all clustering combinations exhaustively. For the
I. I NTRODUCTION scenario with a large number of users, some heuristic cluster-
Authorized licensed use limited to: London School of Economics & Political Science. Downloaded on May 16,2021 at 23:25:39 UTC from IEEE Xplore. Restrictions apply.
CAO et al.: DEEP Q-NETWORK BASED-RESOURCE ALLOCATION SCHEME FOR MASSIVE MIMO-NOMA 1545
extremely low. Therefore, a deep Q-learning network (DQN) is σ 2 . The second and third terms after the second equal sign in
adopted to realize the joint RA scheme under the scene with a (2) represents the intra-cluster and inter-cluster interferences,
large number of users. Then, a deep coupled iterative structure respectively. The beamforming vector is designed to eliminate
involving three functional modules, namely user clustering, inter-cluster interference, which should satisfy hn w i = 0, i =
power allocation and beamforming, is established based on n. However, it is difficult to achieve such an ideal result in
the deep reinforcement learning (DRL) frame. In the user practice and the inter-cluster interference couldn’t be ignored.
clustering stage, DQN is used to gradually adjust the clustering Moreover, suppose that the SIC detector in the receiver could
results to maximize the system throughput, which is calculated cancel the interference from the previous users ideally, and
by the environment evaluator based on the previous clustering the user channel quality for the n-th user cluster ranks as
2 2 2
result, power factors and beamforming vectors. In the power hn,1 ≤ hn,2 ≤ · · · hn,K . Thus, the signal to inter-
assignment, a BPNN is designed to learn the relationship ference plus noise ratio of user Un,k is
2
between the power allocation factors and the users’ channel |hn,k wn | αn,k Pn
Φn,k = N
state information (CSI) for each user cluster. As for the 2 2
K
beamforming, some traditional methods, such as the zero |hn,k wi | Pi +|hn,k w n | αn,j Pn +σ 2
i=1,i=n j=k+1
forcing (ZF) algorithm and the other optimized beamforming (3)
schemes could be directly used among different user clusters.
Simulation experiments show that the iterative process can
Furthermore, with the objective of maximum sum rate, the
converge after dozens of iterations and the system performance
joint optimization problem could be written as follows:
could approximate the case of exhaustive search scheme. N K
max Rsum = Blog2 (1 + Φn,k )
{αn,k },{wn },{Un,k } n=1 k=1
II. S YSTEM M ODEL AND P ROBLEM F ORMULATION N
s.t. C1 : αn,k ≤ 1,αn,k ∈ [0, 1]
Consider a single-cell multi-users downlink system. The n=1
N
base station (BS) is equipped with Nt antennas and serves C2 : Pn ≤ P
n=1
L single-antenna users. All users in the cell are divided into
C3 : Rn,k ≥ Rmin
N clusters, each of which includes K users. The users’ 2
data in one cluster are transmitted in the form of power- C4 : W = 1 (4)
domain NOMA signal structure and preprocessed by the same where B is the bandwidth of one user channel. C1 is the
beamforming vector. Assume that the BS deploys antennas on power allocation factor constraint for each user cluster and
the Y -Z plane in terms of the uniform planar array (UPA). C2 denotes the total power constraint of the BS over one user
Then, the channel vector from the BS to user m can be channel. C3 can ensure the minimum data rate for each user
modeled similarly as in [14], which is given by and C4 is the norm constraint of the beamforming matrix.
Lu
1 Because the joint optimization problem (4) is non-convex
hm = ( dm/d0 )μ √ gm,l b(vm,l ) ⊗ a(um,l ) (1) [5], the traditional solutions are always the heuristic or the
Lu l=1
alternating iterative methods. These methods exist the draw-
Here d0 is the radius of cell and dm is the distance between the backs of high complexity or limited performance. While, DRL
user and the BS. μ is the large-scale fading factor. b(vm,l ) and can not only fully explore the hidden information of big data
a(um,l ) are the vertical and horizontal array response vectors to improve its own learning performance, but also realize
of the UPA antenna, respectively. The symbol “⊗” in formula the dynamic real-time interaction. This method has strong
(1) represents the kronecker product of two matrices. Lu is generalization ability and highlights its advantages in wireless
the number of scattering paths and g denotes the small-scale RA. Therefore, this letter proposes a joint optimization method
fading coefficient. to solve the problem (4) based on the DRL in Section III.
Assume the data sent by the BS being X = [x1 · · · xN ]T ∈
K
C N ×1 , where xn = αn,k Pn sn,k is the superposed III. DQN BASED -R ESOURCE A LLOCATION S CHEME
k=1
NOMA signal of K users in the n-th cluster. Here, Pn is the The proposed RA network based on DQN is shown in Fig.1
total power of the n-th cluster. αn,k and sn,k are the power and it includes three parts: user clustering, power allocation
allocation factor and the transmitted symbol of the k-th user in and beamforming. While, the first two parts are mainly con-
the n-th cluster, denoted by Un,k , respectively. It is assumed cerned in this section.
2
that E[|sn,k | ] = 1. The received signal of user Un,k is
yn,k A. User Clustering Based on DQN
= hn,k W X +zn,k = hn,k w n αn,k Pn sn,k The user clustering problem is modeled as a RL task, which
K N
consists of the agent and the environment. Specifically, the user
+hn,k wn αn,k Pn sn,k +hn,k wi xi +zn,k clustering module is taken as an agent and the performance of
j=1,j=k i=1,i=n the massive MIMO-NOMA system is the environment. The
(2) actions {at } taken by the agent are based on the expected
where W = [w1 · · · wN ] ∈ C Nt ×N is a beamforming matrix, rewards from the environment. According to the considered
and hn,k ∈ C 1×Nt is the channel vector for user Un,k . zn,k is system, each part of the RL framework is described as
a complex white Gaussian noise with zero mean and variance follows:
Authorized licensed use limited to: London School of Economics & Political Science. Downloaded on May 16,2021 at 23:25:39 UTC from IEEE Xplore. Restrictions apply.
1546 IEEE COMMUNICATIONS LETTERS, VOL. 25, NO. 5, MAY 2021
Authorized licensed use limited to: London School of Economics & Political Science. Downloaded on May 16,2021 at 23:25:39 UTC from IEEE Xplore. Restrictions apply.
CAO et al.: DEEP Q-NETWORK BASED-RESOURCE ALLOCATION SCHEME FOR MASSIVE MIMO-NOMA 1547
TABLE I
S IMULATION PARAMETERS
Authorized licensed use limited to: London School of Economics & Political Science. Downloaded on May 16,2021 at 23:25:39 UTC from IEEE Xplore. Restrictions apply.
1548 IEEE COMMUNICATIONS LETTERS, VOL. 25, NO. 5, MAY 2021
V. C ONCLUSION
This work mainly studies the downlink RA problem in
the massive MIMO-NOMA system. In order to maximize
the system spectrum efficiency under the premise of ensuring
the worst user performance constraint, a deep Q-learning
network and a BP neural network are designed to realize the
joint user clustering and the intra-cluster power allocation,
respectively. The simulation results demonstrate the advantage
of our scheme on improving system spectrum efficiency.
R EFERENCES
[1] F. Tang, Y. Kawamoto, N. Kato, and J. Liu, “Future intelligent and
secure vehicular network toward 6G: Machine-learning approaches,”
Proc. IEEE, vol. 108, no. 2, pp. 292–307, Feb. 2020.
[2] Z. Shi, W. Gao, S. Zhang, J. Liu, and N. Kato, “AI-enhanced cooperative
Fig. 5. CDF curves of the user’s spectrum efficiency. spectrum sensing for non-orthogonal multiple access,” IEEE Wireless
Commun., vol. 27, no. 2, pp. 173–179, Apr. 2020.
a performance comparable to the ES-ESPA case and has [3] G. Zhang et al., “Interference management by vertical beam control
a significant gain over the NLUPA-FTPA method for the combined with coordinated pilot assignment and power allocation in
scenarios with more users. Furthermore, we also find that the 3D massive MIMO systems,” KSII Trans. Internet Inf. Syst., vol. 9,
no. 8, pp. 2797–2820, Aug. 2015.
total system spectrum efficiency increases about 7 bit/s/Hz [4] Z. Ding, F. Adachi, and H. V. Poor, “The application of MIMO to non-
at the power of 4W, when the number of users improves orthogonal multiple access,” IEEE Trans. Wireless Commun., vol. 15,
from 8 to 12. no. 1, pp. 537–552, Jan. 2016.
[5] Y. Sun, D. W. K. Ng, Z. Ding, and R. Schober, “Optimal joint power and
What’s more, we change the number of intra-cluster users subcarrier allocation for full-duplex multicarrier non-orthogonal multiple
and observe its effect on the system performance, just shown in access systems,” IEEE Trans. Commun., vol. 65, no. 3, pp. 1077–1091,
Fig.4. Here, K increases from 2 to 4 while L keeps to 8. In this Mar. 2017.
[6] S. M. R. Islam, M. Zeng, O. A. Dobre, and K.-S. Kwak, “Resource
case, the BPNN needs to be retrained. The network is adjusted allocation for downlink NOMA systems: Key techniques and open
to having 5 hidden layers, where the numbers of nodes are issues,” IEEE Wireless Commun., vol. 25, no. 2, pp. 40–47, Apr. 2018.
32, 64, 128, 64 and 32, respectively. It can be observed that [7] A. Benjebbour, A. Li, Y. Saito, Y. Kishiyama, A. Harada, and
T. Nakamura, “System-level performance of downlink NOMA for future
our scheme has a larger performance gain over the NLUPA- LTE enhancements,” in Proc. IEEE Globecom Workshops (GC Wkshps),
FTPA method for this case. However, we find that the system Atlanta, GA, USA, Dec. 2013, pp. 66–70.
spectrum efficiency for all the schemes under K = 4 reduces [8] S. Chinnadurai, P. Selvaprabhu, and M. H. Lee, “A novel joint user
pairing and dynamic power allocation scheme in MIMO-NOMA sys-
significantly when compared to the ones in Fig.2 (K = 2). This tem,” in Proc. Int. Conf. Inf. Commun. Technol. Converg. (ICTC), Jeju,
is just caused by the serious intra-cluster error propagation in South Korea, Oct. 2017, pp. 951–953.
the SIC decoder. The simulation work in Fig.4 can show that [9] Y. Liu, M. Elkashlan, Z. Ding, and G. K. Karagiannidis, “Fairness of
user clustering in MIMO non-orthogonal multiple access systems,” IEEE
the given BPNN based PA is applicable for the other case of Commun. Lett., vol. 20, no. 7, pp. 1465–1468, Jul. 2016.
K and K is suitably set as lower than 4 due to the limitation [10] R. Zhu and G. Zhang, “A segment-average based channel estimation
of the intra-cluster error propagation. scheme for one-bit massive MIMO systems with deep neural network,”
in Proc. IEEE 19th Int. Conf. Commun. Technol. (ICCT), Xi’an, China,
Fig.5 shows the CDF curve of user’s spectrum efficiency Oct. 2019, pp. 81–86.
got by the proposed scheme and the scheme ES-FTPA under [11] C. He, Y. Hu, Y. Chen, and B. Zeng, “Joint power allocation and channel
the total transmission power of 4 W , L = 8 and K = 2. The assignment for NOMA with deep reinforcement learning,” IEEE J. Sel.
Areas Commun., vol. 37, no. 10, pp. 2200–2210, Oct. 2019.
dotted line is the performance of the case that the inter-cluster [12] M. Liu, T. Song, and G. Gui, “Deep cognitive perspective: Resource
interference is ignored. It shows that the scheme DQN-BPNN allocation for NOMA-based heterogeneous IoT with imperfect SIC,”
obtains an obvious gain of the system spectrum efficiency IEEE Internet Things J., vol. 6, no. 2, pp. 2885–2894, Apr. 2019.
[13] G. Gui, H. Huang, Y. Song, and H. Sari, “Deep learning for an effective
against the traditional method. However, the performance of nonorthogonal multiple access scheme,” IEEE Trans. Veh. Technol.,
the edge users improves just slightly and has a big gap from the vol. 67, no. 9, pp. 8440–8450, Sep. 2018.
ideal situation. Although the minimum rate constraint for users [14] D. Ying, F. W. Vook, T. A. Thomas, D. J. Love, and A. Ghosh,
“Kronecker product correlation model and limited feedback codebook
has been considered in BPNN, the inter-cluster interference, design in a 3D channel model,” in Proc. IEEE Int. Conf. Commun. (ICC),
which cannot be suppressed by ZF beamforming, still worsens Sydney, NSW, Australia, Jun. 2014, pp. 5865–5870.
the final performance. Moreover, in the ideal situation without [15] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep rein-
forcement learning for dynamic multichannel access in wireless net-
inter-cluster interference, the lowest user’s spectrum efficiency works,” IEEE Trans. Cogn. Commun. Netw., vol. 4, no. 2, pp. 257–265,
cannot be kept on Rmin . This is because FPTA rather than Jun. 2018.
ESPA is used to generate the training labels in some extreme [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
cases. In such cases, Rmin cannot be realized for few users Process. Syst., Stateline, NV, USA, Dec. 2012, pp. 1097–1105.
no matter how to adjust the users’ power within a cluster due [17] TP for Classification of MUST Schemes, document R1-154999, 3GPP,
to their very poor channel conditions. But nearly 90% users 2015.
[18] J.-M. Kang, I.-M. Kim, and C.-J. Chun, “Deep learning-based MIMO-
still achieve the better performance than Rmin by the proposed NOMA with imperfect SIC decoding,” IEEE Syst. J., vol. 14, no. 3,
scheme. pp. 3414–3417, Sep. 2020.
Authorized licensed use limited to: London School of Economics & Political Science. Downloaded on May 16,2021 at 23:25:39 UTC from IEEE Xplore. Restrictions apply.