0% found this document useful (0 votes)
39 views13 pages

Energy-Efficient Rate-Splitting Multiple Access A Deep Reinforcement Learning-Based Framework

This paper presents a deep reinforcement learning (DRL) framework for energy-efficient power and rate allocation in a Rate-Splitting Multiple Access (RSMA) network, addressing the challenges of multi-user interference in 6G networks. The proposed framework utilizes multi-agent DRL, with each private message represented as a separate agent, and employs two algorithms, Deep Q-Learning and REINFORCE, to optimize energy efficiency and sum-rate maximization. Numerical simulations demonstrate that this DRL approach outperforms existing heuristic methods, providing a robust solution to the joint optimization problem.

Uploaded by

prasannajit dash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views13 pages

Energy-Efficient Rate-Splitting Multiple Access A Deep Reinforcement Learning-Based Framework

This paper presents a deep reinforcement learning (DRL) framework for energy-efficient power and rate allocation in a Rate-Splitting Multiple Access (RSMA) network, addressing the challenges of multi-user interference in 6G networks. The proposed framework utilizes multi-agent DRL, with each private message represented as a separate agent, and employs two algorithms, Deep Q-Learning and REINFORCE, to optimize energy efficiency and sum-rate maximization. Numerical simulations demonstrate that this DRL approach outperforms existing heuristic methods, providing a robust solution to the joint optimization problem.

Uploaded by

prasannajit dash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Received 19 September 2023; accepted 27 September 2023. Date of publication 6 October 2023; date of current version 20 October 2023.

Digital Object Identifier 10.1109/OJCOMS.2023.3322047

Energy-Efficient Rate-Splitting Multiple Access:


A Deep Reinforcement Learning-Based Framework
MARIA DIAMANTI 1 (Member, IEEE), GEORGIOS KAPSALIS1 ,
EIRINI ELENI TSIROPOULOU 2 (Senior Member, IEEE),
AND SYMEON PAPAVASSILIOU 1 (Senior Member, IEEE)
1 Institute of Communication and Computer Systems, School of Electrical and Computer Engineering,
National Technical University of Athens, 15780 Zografou, Greece
2 Department of Electrical and Computer Engineering, University of New Mexico, Albuquerque, NM 87131, USA

CORRESPONDING AUTHOR: S. PAPAVASSILIOU (e-mail: [email protected])


This work was supported in part by the European Commission through the Horizon Europe/JU SNS Project Hexa-X-II under Grant 101095759.

ABSTRACT Rate-Splitting Multiple Access (RSMA) has been recognized as an effective technique to
reconcile the tradeoff between decoding interference and treating interference as noise in 6G and beyond
networks. In this paper, in line with the need for network sustainability, we study the energy-efficient
power and rate allocation of the common and private messages transmitted in the downlink of a single-cell
single-antenna RSMA network. Contrary to the literature that resorts to heuristic approaches to deal with
the joint problem, we transform the formulated energy efficiency maximization problem into a multi-
agent Deep Reinforcement Learning (DRL) problem, based on which each transmitted private message
represents a different DRL agent. Each agent explores its own state-action space, the size of which is
fixed and independent of the number of agents, and shares its gained experience by exploration with
a common neural network. Two DRL algorithms, namely the value-based Deep Q-Learning (DQL) and
the policy-based REINFORCE, are properly configured and utilized to solve it. The adaptation of the
proposed DRL framework is also demonstrated for the treatment of the considered network’s sum-rate
maximization objective. Numerical results obtained via modeling and simulation verify the effectiveness
of the proposed DRL framework to conclude a solution to the joint problem under both optimization
objectives, outperforming existing heuristic approaches and algorithms from the literature.

INDEX TERMS Energy efficiency maximization, rate-splitting multiple access (RSMA), deep reinforce-
ment learning (DRL).

I. INTRODUCTION the involved users in the transmission, contrary to the private


G AND beyond communication networks must deal
6 with the ever more challenging issue of multi-user
interference, given the requirements for massive connectiv-
message intended for each user separately. As a result, when
decoding the private message, the interference originating
from the other users’ private messages is treated as noise.
ity to be supported over the same physical resources. In By smartly controlling the split among the common and
this context, Rate-Splitting Multiple Access (RSMA) has private messages, an acceptable tradeoff between efficient
been recognized as a promising technique to transcend the spectrum usage, multi-user interference management, and
immense controversy between decoding interference and signal processing complexity at the receivers is achieved [2].
treating interference as noise in such multi-user commu- In light of elucidating the performance limits of the
nication systems [1]. The rate-splitting lies in splitting a RSMA technique, systematic attempts have focused on
message into two or more parts that can be flexibly decoded resource optimization in RSMA-based wireless networks.
at one or more receivers, respectively. The common mes- Accordingly, the power control, precoder design, and rate
sage – as it is called – is intended for and decoded by all allocation should be jointly studied in single or multi-antenna

c 2023 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

VOLUME 4, 2023 2397


DIAMANTI et al.: ENERGY-EFFICIENT RSMA: A DRL-BASED FRAMEWORK

systems, resulting in highly non-convex and combinatorial contradicting objectives [8] or decomposed into two sub-
optimization problems that are difficult to solve optimally problems solved iteratively [9]. Subsequently, the method
using conventional optimization techniques [3]. Moreover, of Successive Convex Approximation (SCA) is used to
the network complexity in the number of wireless con- convexify the resulting problems and obtain a solution.
nections, calls for robust optimization techniques that can Toward accounting for sustainability and not restricting
scale well and dynamically adapt to the environment. Deep the resource allocation procedure to achieving high data
Reinforcement Learning (DRL) has been broadly considered rates, a different line of research pursues the maximization
in communications and networking to handle the complexity, of the studied system’s energy efficiency while potentially
scalability, and autonomicity issues therein [4]. Leveraging ensuring some minimum rate requirements, e.g., [10], [11],
the power of deep neural networks, DRL algorithms explore [12], [13], [14]. Similar to the above, the joint power
a vast state-action space and conclude near-optimal solu- control/beamforming and common-rate allocation constitute
tions to non-convex problems while allowing the network’s fundamental problems studied in SISO [10] and MISO [11]
self-adaptation based on the trained model. broadcast channels under the energy efficiency optimization
In this article, we target energy efficiency maximization objective. Both [10] and [11] conclude with suboptimal solu-
in a single-antenna RSMA-based wireless network. To the tions contrariwise to [12] that under a similar MISO setting
best of the authors’ knowledge, this is the first time in the with [11] manages to obtain a globally optimal solution
literature to design and propose a DRL-based framework based on Successive Incumbent Transcending (SIT) Branch
for energy-efficient power and rate allocation of the com- and Bound (BB) algorithm. Continuing with more complex
mon and private messages transmitted in the downlink. The network settings, the authors in [13] and [14] investigate the
optimization problem is transformed into a multi-agent DRL application of the RSMA technique in a Cloud Radio Access
problem, such that each agent autonomously explores its Network (C-RAN) and a Reconfigurable Intelligent Surface
own state-action space and contributes its gained experience (RIS)-assisted network, accordingly. In the former, the typ-
to a commonly trained neural network. Two different DRL ical power control and rate allocation problem is addressed
algorithms are properly configured and utilized to solve it, toward energy efficiency maximization subject to the addi-
namely the value-based Deep Q-Learning (DQL) and the tional per-base station’s transmission power and common
policy-based REINFORCE algorithm. The algorithms are fronthaul links’ capacity constraints, whereas, in the latter,
evaluated in terms of effectiveness in determining a solution the RIS’s phase-shift optimization is considered along.
to the problem by comparison against other existing heuristic The overwhelming majority of research works in the field
approaches from the literature. Complementary to this and of RSMA network optimization has relied on model-oriented
for better revealing the benefits and tradeoffs of the obtained and heuristic algorithms that (i) conclude suboptimal solu-
solution when aiming at energy efficiency, we also analyze tions, (ii) are characterized by high computational complexity
and assess the proposed framework under the objective of as the network scales, and (iii) prohibit adaptability to the
sum-rate maximization considering the same network set- network’s unpredictable changes. To tackle these challenges,
ting, which is again a problem that has not been similarly the application of DRL algorithms is becoming increasingly
targeted in the literature so far. popular. The works in [15], [16], [17], [18] provide represen-
tative examples of DRL algorithms successfully implemented
to solve optimization problems in various communication
A. RELATED WORK environments. In [15] and [16], the power control toward
RSMA provides a generalization of several existing orthog- sum-rate maximization is modeled as a multi-agent DRL
onal and non-orthogonal multiple access techniques, leading problem, according to which the transmitter of each wireless
to superior performance in terms of achieved throughput link, i.e., agent, autonomously executes its action in selecting
and spectral efficiency as has been theoretically proved an appropriate transmission power level based on a com-
for two-user Single-Input Single-Output (SISO) [5] and monly trained neural network, which is a paradigm referred
Multiple-Input Single-Output (MISO) [2] broadcast chan- to as “centralized training and distributed execution” in
nels. The existence of such theoretical analyses provoked the literature. Value-based DQL, policy-based REINFORCE,
active research around RSMA lately, with an emphasis on and actor-critic Deep Deterministic Policy Gradient (DDPG)
resource allocation under various network settings. In [6] algorithms are then implemented and tested in this con-
and [7], the sum-rate and weighted sum-rate maximization text. In [17], the DQL algorithm is used to derive the
in the downlink of multi-user SISO and MISO systems user pairing in the downlink of a Non-Orthogonal Multiple
are targeted, respectively, by jointly performing power Access (NOMA) network, while the joint channel selection
control/precoder design and rate allocation. Other works, and power control problem is treated in [18] under both
e.g., [8], [9], are devoted to achieving a tradeoff between value-based and actor-critic-based DRL algorithm imple-
energy and spectral efficiency in downlink single-cell and mentations. Both works in [17], [18] consider the sum-rate
multi-cell MISO systems. The aforementioned tradeoff is maximization objective.
formulated as a multi-objective optimization problem that Regarding the application of DRL algorithms for resource
is either approximated by the weighted sum of the two optimization in RSMA networks, only a handful of research

2398 VOLUME 4, 2023


works can be found in the literature, i.e., [19], [20], [21], robust, DRL framework that is independent of the
[22], [23]. In [19] and [20], two similar policy-based DRL number of users in the network.
algorithms are proposed to determine the beamforming in the 3) The applicability and adaptation of the proposed DRL
downlink of an RSMA network, targeting the system’s sum- framework are also demonstrated for the treatment of
rate maximization. Under the same optimization objective, the system’s sum-rate maximization, which serves as a
the joint problem of uplink-downlink user association and basis for highlighting the benefits and tradeoffs of the
beamforming is tackled in [21] for a multiple Unmanned obtained solution when targeting energy efficiency.
Aerial Vehicle (UAV)-assisted RSMA network using an 4) The overall DRL framework’s performance is eval-
actor-critic DRL algorithm. In [22], an actor-critic DRL uated via modeling and simulation and numerical
algorithm is introduced to perform computation offload- results are presented that verify its superiority under
ing decision-making, power allocation, and decoding order both optimization objectives when compared against
optimization in the uplink of an RSMA-assisted Mobile existing heuristic approaches from the literature.
Edge Computing (MEC) network while aiming for the The remainder of this article is organized as follows.
minimization of the weighted sum of latency and con- Section II presents the system model and the energy effi-
sumed energy. Last, accounting for communications powered ciency maximization problem formulation. In Section III,
by energy harvesting, the authors in [23] design a DRL the multi-agent DRL modeling and distributed DRL archi-
framework to perform harvested power allocation from a tecture are discussed along with the description of the DQL
UAV to end-user devices, and then the beamforming in the and REINFORCE algorithms. In Section IV, the sum-rate
RSMA network is determined using the Minimum Mean maximization benchmark problem’s formulation and solution
Square Error (MMSE) technique. It should be noted that are analyzed. Section V presents the numerical evaluation,
none of the aforementioned works in [19], [20], [21], and Section VI concludes the paper.
[22], [23] has inherited the paradigm of centralized training
and distributed execution by following a multi-agent DRL II. PROBLEM STATEMENT
modeling, while both continuous [19], [21], [22], [23] and A. SYSTEM MODEL
discrete [20], [21] action spaces have been scrutinized. In We consider a single-cell single-antenna wireless network
the meantime, the energy efficiency maximization in RSMA- consisting of a set of users N = {1, . . . , N} served by a base
based networks via DRL algorithms has been significantly station positioned at the center of the cell. The multiplexing
overlooked, creating a research gap. of data transmissions for different users in the downlink
is performed over the same frequency band by employing
the RSMA technique. The message intended for user n is
B. CONTRIBUTIONS & OUTLINE denoted as Wn , which is further divided into two parts: a
p
In this article, a DRL framework for energy-efficient power common part Wnc and a private part Wn . The common parts
and rate allocation of the common and private messages intended for the different users, i.e., W1c , . . . , Wnc , . . . , WNc ,
transmitted in the downlink of a single-antenna RSMA- are combined and encoded into a single common stream
based network is proposed for the first time in the literature. v0 that is transmitted to all users with downlink transmis-
Different from the existing works in the intersection of sion power p0 [Watt]. On the other hand, the remaining
p
RSMA and DRL, multi-agent DRL modeling is adopted, private messages Wn , ∀n ∈ N are encoded into separate pri-
according to which each private stream plays the role of a vate streams vn and transmitted individually with power pn
different DRL agent that contributes its personal experience [Watt], ∀n ∈ N . Given that the system operates on a per-
from interacting with the environment toward training a com- time slot basis, the transmitted signal by the base station at
mon neural network. Two different DRL algorithms are then time slot t is:
utilized to solve the formulated DRL problem, namely the  N 

value-based DQL and the policy-based REINFORCE. The x(t) = p(t)
0 0v(t)
+ p(t) (t)
n vn . (1)
key contributions of this article are summarized as follows. n=1

1) The non-convex energy efficiency maximization The received signal by each user n is:
problem is converted into a multi-agent DRL problem  N 

(t) (t) (t)
by properly designing the states, actions, and rewards y(t)
n = G n p0 0v + G(t) (t) (t) (t)
n pj vj + zn , (2)
to capture the problem’s objective and constraints j=1
and ultimately obtain the joint power and rate solu-
where G(t)
n denotes the channel gain from the base station to
tion sought while modeling each private stream as a (t)
user n and zn ∼ CN (0, σ 2 ) is the corresponding Additive
different DRL agent.
White Gaussian Noise (AWGN). An overview of a simplified
2) The multi-agent DRL modeling, the adoption of
two-user RSMA-based network is presented in Fig. 1.
the centralized training and distributed execution
With reference to the channel gain modeling, in this article,
paradigm, and the appropriate discretization of the
block fading is adopted, such that:
action space – for DRL algorithms’ application pur-
poses – result in a computationally scalable, though G(t) (t) 2
n = |hn | βn , (3)

VOLUME 4, 2023 2399


DIAMANTI et al.: ENERGY-EFFICIENT RSMA: A DRL-BASED FRAMEWORK

(t) (t) (t) N (t)


Eq. (7) is rewritten as G1 p0 − G1 n=1 pn ≥ ptol , based
on the ordering of the channel gains.
After decoding the common stream, the decoding of the
(t)
corresponding private stream vn takes place at the receiver
of each user, the achievable rate of which is:
 
(t) (t)
p (t) Gn pn  
rn = log2 1 + (t) N (t)
bps/Hz . (8)
j=1,j=n pj + σ
Gn 2

As a result, the total achievable data rate of a user n in


the downlink of an RSMA-based network is:
R(t) (t) p (t)
n = cn + rn  
G(t)
n pn
(t)
= c(t)
n + log2 1 + (t) N (t)
. (9)
Gn j=1,j=n pj + σ2
FIGURE 1. Overview of simplified two-user RSMA-based network.

B. PROBLEM FORMULATION
where βn is the large-scale fading that can remain the same In this article, the energy efficiency maximization is tar-
(t) geted in the downlink of a single-antenna RSMA-based
over several time slots, whereas the term hn represents
the small-scale Rayleigh fading. To model the time-varying wireless network that is defined as the ratio between the
nature of the channel, Jake’s model [24] is used and the sum of the total N achievable data rates of all users in the
(t)
small-scale Rayleigh fading is expressed as a first-order system, i.e., R , and the total consumed power by
n=1 n
N
Gaussian-Markov process: the base station, i.e., p(t) 0 + (t)
n=1 n . Toward achieving this
p
 objective, the allocated by the base station common-stream
(t) (t) (t)
h(t)
n = ρh(t−1)
n + 1 − ρ 2 ζn(t) , (4) rates c(t) = [c1 , . . . , cn , . . . , cN ]T , private-stream powers
(t) (t) (t)
(t) p(t) = [p1 , . . . , pn , . . . , pN ]T , and common-stream power
where ζn ∼ CN (0, 1−ρ 2 ) is an independent and identically (t)
p0 to the users, are optimized. Specifically, the correspond-
distributed random variable. The correlation parameter ρ is
ing optimization problem to be solved by the base station is
ρ = J0 (2π fd T), where J0 is the zero-order Bessel function,
formally written as follows:
fd is the maximum Doppler frequency, and T is the time slot N
over which the correlated channel variation occurs. R(t)
n
max EE = (t) n=1 N (10a)
Following the above, the achievable rate for decoding the
(t)
(t) (t)
c ,p ,p0
(t)
p0 + n=1 p(t) n
common stream v0 transmitted by the base station to user
n is calculated as: 
N
c (t)
  s.t. c(t)
n ≤ r1 , (10b)
(t) (t)
c (t) Gn p0   n=1
rn = log2 1 + (t) N (t) bps/Hz . (5)
Gn j=1 pj + σ
2 
N
G(t) (t) (t)
1 p0 − G1 p(t)
n + σ ≥ ptol , (10c)
2
To guarantee the successful decoding of the common n=1
stream v(t)
0 by all users n ∈ N , the allocated decoding rates 
N
c(t)
n must adhere to the following condition: p(t)
0 + p(t)
n ≤ pmax , (10d)
n=1

N
(t) (t) (t)
c(t) c (t) cn , pn ≥ 0, ∀n and p0 ≥ 0. (10e)
n ≤ min rn , (6)
n∈N
n=1 Eq. (10b) and Eq. (10c) represent the required constraints
where minn∈N rnc (t) = r1c (t) ,
given the channel gains sorted over the allocated common-stream rates and powers, respec-
(t) (t) (t) tively, for the successful decoding and implementation of
as G1 ≤ · · · ≤ Gn ≤ · · · ≤ GN .
Furthermore, to ensure the successful implementation of the SIC technique at the receivers of the users, as described
the Successive Interference Cancellation (SIC) technique at earlier in Section II-A. Eq. (10d) indicates the base station’s
the receiver of each user n, the following condition must be maximum power budget pmax [Watt], while Eq. (10e) defines
met: the feasible range of values of the different optimization
variables.
(t)

N
(t)
G(t) (t)
n p0 − Gn pj ≥ ptol , (7) III. PROBLEM SOLUTION
j=1
In this section, the formulated energy efficiency
with ptol [Watt] indicating the receivers’ SIC decoding toler- maximization problem is equivalently transformed into a
ance/sensitivity that is assumed to be the same for all users. multi-agent DRL problem to capitalize on the architectural

2400 VOLUME 4, 2023


(t)
paradigm of centralized training and distributed execution. After determining the selected actions an ∈ An of all
(t)
Subsequently, the application of the value-based DQL and agents at time slot t, the optimal values of (c(t) , p0 ) that
policy-based REINFORCE algorithms is analyzed and maximize the system’s energy efficiency can be obtained
discussed to solve the multi-agent DRL problem. through analytically and exhaustively solving the following
optimization problem:
N (t)
A. MULTI-AGENT DRL MODEL & ARCHITECTURE n=1 cn
A typical multi-agent DRL problem is characterized by the max (t)
(12a)
(t)
c(t) ,p0 p0
set of agents, the environment’s state space, the agents’
action spaces, and the reward function. The definition of 
N
c (t)
s.t. c(t)
n ≤ r1 , (12b)
the aforementioned constituent elements in the context of
n=1
the studied optimization problem is as follows. (t)
Agents: Each private stream v(t) n from the downlink trans-
c(t)
n ≥ 0, ∀n and p0 ∈ P0 , (12c)
mitted signal by the base station to the users is regarded as (t)
where by P0 we denote the set of feasible values of p0 :
a distinct agent in the considered transformation. Given that  (t)  (t)  (t)
p − a p − n∈N an p − a
there exists a one-to-one correspondence between the users P0 = { max P0n∈N n , max P0 −1 , . . . , max 1n∈N n }
and the private streams, which are henceforth termed DRL and P0 represents its cardinality. The problem in (12) reduces
agents, we denote the set of agents as N = {1, . . . , N} and to a linear programming problem for the different values
(t)
use index n to refer to a particular agent. of p0 that can be, in turn, optimally solved in polynomial
State: At each time slot, the agents observe specific time. It is remarkable that the obtained solution for (c(t) , p(t)0 )
characteristics of the environment and create a correspond- satisfies constraints (10b) and (10d) owing to the proper def-
ing representation known as the state. In more detail, the inition of problem (12). The satisfaction of the remaining
(t)
local state sn observed by agent n encompasses relevant constraint in Eq. (10c) is guaranteed later by the definition
information to the transmission of its corresponding private of the DRL problem’s reward function.
(t) (t)
stream vn . Given that the power levels of the common and Reward: As a consequence of the chosen action an , each
(t+1)
private streams undergo changes at the end of each time agent n transitions to a new state sn and receives a
(t+1)
slot and remain constant during the subsequent slot [15], the scalar reward feedback signal fn . Aiming to maximize
(t)
agent’s n state sn at the beginning of time slot t is a tuple the energy efficiency of the system, the agent’s feedback
of the following eight components: signal increases with an increase in the normalized energy
(t)
1) the channel gain Gn at time slot t; efficiency EE N , while it decreases with the level of viola-
2) the channel gain G(t−1) n at time slot t − 1; tion of constraint (10c). Specifically, if constraint (10c) is
(t+1)
3) the interference sensed from the rest  private streams at satisfied, the reward fn is given by:
(t) (t−1)
the beginning of time slot t, i.e., Gn j∈N ,j=n pj + EE
fn(t+1) = , (13)
σ 2; N
4) the interference sensed from the rest of the private otherwise, it is calculated as follows:
streamsat the beginning of time slot t − 1, i.e., ⎛ ⎛ ⎞⎞
N
ptol + σ ⎠⎠
2
G(t−1) (t−2)
j∈N ,j=n pj + σ 2; fn(t+1) =
EE ⎝
· 1 + tanh⎝p(t) p(t)
0 − j − .
n
(t−1) N (t)
5) the power pn of the private stream; j=1 G 1
(t−1)
6) the power p0 of the common stream; (14)
p (t)
7) the data rate rn of the private stream at the beginning
The function tanh(x) approaches −1 as x tends to negative
of time slot t, calculated considering p(t−1)
n , ∀n ∈ N ; values. Hence, considering the definition of the reward in
(t)
8) the data rate cn of the common stream. Eq. (14), it follows that the latter tends to zero as the vio-
Action: Each agent chooses and performs an action a(t) n ∈ lation of constraint (10c) grows. This behavior allows the
An from its set of possible actions An following some policy agent to learn the negative impact of constraint violation.
(t) (t) (t)
π(an |sn ) conditioned on the current state sn . Specifically, Based on the proposed multi-agent DRL problem
the agent’s n action space is formally defined as: modeling described above, the centralized training and
1 distributed execution architectural paradigm can be
pn,max An −2
An = 0, pn,min , pn,min · , . . . , pn,max , (11) adopted [15], [25]. Following this paradigm, a single general-
pn,min purpose model is trained centrally and shared among the
pmax
distributed agents. The agents interact with their environ-
where pn,max = N+1 is the maximum allowable transmission ment and utilize the learned actions (or policies depending on
(t)
power of the private stream vn , with pmax denoting the base the employed DRL algorithm), generating experience sam-
station’s maximum power budget, and pn,min is a correspond- ples that are then provided as feedback to the centralized
ing minimum allowable power level. Also, An indicates the model trainer (see Fig. 2). This approach allows leverag-
cardinality of the set An . ing the advantages of multi-agent DRL modeling in terms

VOLUME 4, 2023 2401


DIAMANTI et al.: ENERGY-EFFICIENT RSMA: A DRL-BASED FRAMEWORK

definitions, the subscripts n referring to the different agents


have been dropped for notation convenience.
To approximate the optimal Q-function Q∗ (s, a), a neu-
ral network with parameter vector θ q is used, referred to
as Deep Q-Network (DQN). Consequently, solving the DRL
problem reduces to determining the optimal parameter vec-
tor θ q , regardless of the dimensions of the state-action
space. The DQN is trained from the experiences gained
by the distributed agents interacting with the environment.
Specifically, to combat potential instability issues of the
DQL algorithm due to the high correlation of the succes-
sive states observed by a particular agent, the experience
replay mechanism [26] is used. Based on this mechanism,
N different First In First Out (FIFO) queues of size M are
FIGURE 2. Overview of proposed multi-agent DRL architecture. used, in which each agent n separately stores the experience
acquired at time step t of training, represented by the tuple
e(t)
n = (sn
(t−1) (t−1) (t) (t)
, an , fn , sn ). A minibatch D(t) of size D
of reduced action and state spaces that require less memory,
of experiences is randomly created at time slot t by a com-
computational resources, and execution time while maintain-
mon randomizer, comprising an equal number of experiences
ing the stability and efficiency of a centralized solution. Each
from the different agents’ queues, to eliminate training the
agent explores its own state-action space, which in our case
DQN over correlated agent experiences.
consists of eight components that describe the state of the
Given a minibatch D(t) , the least-square error of the
agent and An power levels, i.e., actions, that are independent
trained DQN with parameters θ q is calculated as:
of the number of users existing in the network, combating the
     2
curse of dimensionality issue of discrete state-action space (t)
L θ (t)
q = yDQN − Q π
s, a; θ (t)
q . (17)
modeling in DRL frameworks. Undoubtedly, the design of
(s,a,f ,s )∈D (t)
the reward feedback signal is crucial to effectively optimize
the global objective by the agents’ distributed decisions and (t)
The target state-action value yDQN is given by:
actions. However, upon its successful definition, the agents  
(t)
can quickly learn a more general model, benefiting from yDQN = f + γ max Qπ s , a ; w(t) , (18)
a
one another. The centralized model training can also be
performed offline using data from a simulated wireless envi- where w(t) is the parameter vector of a second “target”
ronment and be further fine-tuned in real scenarios. In this DQN – as it is called – that is updated to be equal to
way, the burden of online training from the inherent large the trained DQN, i.e., w(t) = θ (t)
q , once every Tu time slots.
volumes of data is eliminated. The idea behind creating a second instance of the DQN that
is sporadically updated serves the purpose of eliminating the
B. DEEP Q-LEARNING: A VALUE-BASED ALGORITHM correlation between the trained and the targeted state-action
DQL is a value-based algorithm that approximates the value. In the special case that γ = 0, the target state-action
Q-function Qπ (s, a), i.e., the expected reward when choos- value coincides with the agent’s immediate reward f and,
ing an action a in state s according to some policy π . The thus, there is no need to keep a target DQN instance.
definition of the Q-function Qπ (s, a) is given as follows: To progressively derive a better approximation of the
∞   Q-function, the trained DQN’s parameters θ q are updated via
  the gradient descend method with learning rate ηq ∈ (0, 1]:
π τ (t+τ +1)  (t) (t)
Q (s, a) = E γ f s = s, a = a , (15)  
τ =0 θ (t+1)
q = θ (t) (t)
q − η q ∇θ q L θ q . (19)
where γ is the discounted rate that determines the importance
of future rewards, with γ ∈ [0, 1]. In the special case that Given the updated DQN’s parameters and the agent’s state,
γ = 0, only the instantaneous reward is considered. the optimal action that is selected at each time slot t of the
The Q-function satisfies the recursive Bellman equation: designed DQL algorithm follows a dynamic -greedy policy.
   Let Ne denote the number of episodes, each comprising
 
Qπ (s, a) = E f (t) + γ Qπ s , a s(t) = s, a(t) = a , (16) Nt time slots, then the exploration probability of randomly
selecting an action different from the optimal one a∗ =
describing the relationship of the value in state s with the arg maxa Qπ (s, a; θ (t)
q ), is given by:
values in all states s that are likely to follow in the next
k = e−λk , k = 1, 2, . . . , Ne , (20)
time slots. By solving Eq. (16), the optimal state-action value
Q∗ (s, a) = maxa Qπ (s, a) can be determined, implying the where λ ∈ [0, 1] is the exploration probability. The proposed
optimal policy π ∗ = arg maxa Q∗ (s, a). In the preceding DQL algorithm is summarized in Algorithm 1.

2402 VOLUME 4, 2023


Algorithm 1 Deep Q-Learning Algorithm Algorithm 2 REINFORCE Algorithm
1: Initialize Ne , Nt , ηq , λ, M, D. 1: Initialize Ne , Nt , ηπ .
2: Randomly initialize DQN’s parameters θ q . 2: Randomly initialize DPN’s parameters θ π .
3: for k = 1 to Ne do 3: for k = 1 to Ne do
(1)
4: Update k based on Eq. (18). 4: Derive initial agents’ states sn , ∀n.
5: Derive initial agents’ states s(1) n , ∀n. 5: for t = 1 to Nt do
(t)
6: for t = 1 to Nt do 6: Select action an ∈ An , ∀n based on π(an |sn ; θ π ).
(t) (t) (t) (t)
7: if rand() ≤ k then 7: Set p = [a1 , . . . , an , . . . , aN ] and calculate
8: Randomly select action a(t) n ∈ An , ∀n.
(t)
(c(t) , p0 ) by solving problem (12).
9: else
(t) (t)
8: Assign (p(t) , c(t) , p(t)0 ) solution to the base station
10: Select an = arg maxan Qπ (sn , an ; θ (t) q ), ∀n. and observe new states sn
(t+1) (t+1)
and rewards fn , ∀n.
11: end if (t) (t) (t)
(t) (t) (t) 9: Calculate μf , σf , and f̂n , ∀n based on Eq. (23).
12: Set p(t) = [a1 , . . . , an , . . . , aN ] and calculate (t)
(t) 10: Calculate ∇θ π J(θ (t) π ) using f̂n , ∀n.
(c(t) , p0 ) by solving problem (12). (t+1)
11: Update DPN’s parameters θ π based on Eq. (22).
13: Assign (p(t) , c(t) , p(t)
0 ) solution to the base station (t) (t+1)
(t+1) (t+1) 12: Set sn ← sn , ∀n.
and observe new states sn and rewards fn , ∀n. 13: end for
(t)
14: Obtain and store experience en , ∀n in the corre- 14: end for
sponding agent’s n queue.
15: Create a minibatch D(t) and calculate ∇θ q L(θ (t) q ).
(t+1) 
16: Update DQN’s parameters θ q based on Eq. (19). N (t)
N (t) (t)
(f −μ )2
17: Set s(t)
n ← s (t+1)
n , ∀n. where μ(t)
f = and σf(t) =
i=1 fi
N
i=1 i
N
f
represent
18: end for the mean value and the dispersion of the agents’ rewards at
19: end for time slot t. The proposed REINFORCE algorithm is outlined
in Algorithm 2.

C. REINFORCE: A POLICY-BASED ALGORITHM


REINFORCE is a policy-based algorithm that directly gen- IV. SUM-RATE MAXIMIZATION BENCHMARK
erates the stochastic policy π(a|s) using a Deep Policy In this section, we extend our proposed DRL framework
Network (DPN) with θ π being the corresponding param- analyzed in detail in Section III to account for an alter-
eter vector. Therefore, the goal at each time slot t is to native objective, namely the sum-rate maximization in the
derive the parameter vector θ (t) considered downlink RSMA-based communication network.
π that maximizes the agents’
expected mean immediate reward defined as: On the one hand, we aim to corroborate the applicability,
 (t)
 effectiveness, and efficiency of the devised DRL frame-
N
n=1 fn work under different optimization objectives, given that the
J(θ π ) = E . (21) problem of sum-rate maximization has not been treated sim-
N
ilarly by the literature so far. On the other hand, we seek
Then, the optimal policy π ∗ (s, a; θ π ) = arg maxπ J ∗ (θ π ) to macroscopically identify and promote the significance
is derived that is applied by each agent to determine its of targeting energy efficiency, resulting in a better trade-
action a(t+1) at the next time slot. off between resource utilization, system performance, and
To progressively conclude the parameters θ π that algorithmic complexity.
maximize J, the gradient ascend method is used, such that The formal representation of the corresponding sum-rate
  maximization problem toward optimizing the vectors of allo-
θ (t+1) = θ (t)
+ η ∇
π θπ J θ (t)
π , (22) (t) (t) (t)
π π cated common-stream rates c(t) = [c1 , . . . , cn , . . . , cn ]T ,
where ηπ ∈ (0, 1] is the corresponding learning rate. and the private and common-stream transmission powers
Due to the exploration of the algorithm in the state-action p(t) = [p(t) (t) (t) T (t)
1 , . . . , pn , . . . , pN ] and p0 is as follows:
space during the training phase, there is a high probabil-
ity that the values of the mean immediate rewards J(θ π ) 
N
max R(t)
n (24a)
obtained between sequential time slots diverge significantly (t)
c(t) ,p(t) ,p0 n=1
between each other. This behavior affects the algorithm’s
performance, resulting in its instability. To circumvent this N
c (t)
s.t. c(t)
n ≤ r1 , (24b)
issue, each agent’s reward is normalized:
n=1
(t) (t)
fn − μf 
N
f̂n(t) = , (23) G(t)
1 p0
(t)
− G(t) p(t)
n + σ ≥ ptol , (24c)
2
σf(t)
1
n=1

VOLUME 4, 2023 2403


DIAMANTI et al.: ENERGY-EFFICIENT RSMA: A DRL-BASED FRAMEWORK

(t)

N TABLE 1. Simulation parameters.
p0 + p(t)
n ≤ pmax , (24d)
n=1
(t) (t) (t)
cn , pn ≥ 0, ∀n and p0 ≥ 0. (24e)
The definition of problem (24) is in accordance with
its energy efficiency counterpart and a similar approach
with Section III-A can be followed for its transformation
(t)
into a multi-agent DRL scenario. Each private stream vn
of the downlink transmitted signal constitutes a different
(t)
agent whose description of the local state sn comprises
the eight components analyzed in Section III-A. Each agent
(t)
autonomously chooses an action an ∈ An from the set
of possible actions An in Eq. (11) after evaluating its state.
Based on the agents’ chosen actions, the values of (c(t) , p(t)
0 )
that maximize  the sum rate can be obtained by setting
(t) (t)
p0 = pmax − N n=1 pn and solving the following linear evaluated via modeling and simulation. Throughout our
programming problem:
experiments, we consider N = 4 users randomly spa-

N tially distributed with minimum and maximum distance
max c(t)
n (25a) from the base station set as 10 m and 500 m, respectively.
(t)
cn ≥0,∀n n=1
The channel gain between the users and the base sta-
N
c (t)
tion is calculated considering the log-distance path loss
s.t. c(t)
n ≤ r1 . (25b) model PL = 120.9 + 37.6 log(d) with d measured in
n=1 km and log-normal shadowing standard deviation equal to
It should be noted that the common stream does not 8 dB [6]. The maximum Doppler frequency is fd = 10 Hz
interfere with the private streams 
and, thus, the allocation of and the time slot duration is T = 20 ms [15]. The rest
(t)
all available power, i.e., pmax − N n=1 pn , to the common
of the communication-related parameters are summarized in
stream maximizes the sum rate [6]. This observation can be Table 1.
easily derived by closely examining Eq. (5) and (6). Considering the definition of the action space in the multi-
Last, to target the system’s sum-rate maximization, the agent DRL problem, a number of An = 10, ∀n and P0 = 100
reward feedback signals provided to the agents should be discrete power levels for the private and common streams
redefined accordingly. Following a similar rationale with the is considered unless otherwise explicitly stated. The struc-
one in Section III-A, if constraint (10c) is satisfied, the ture of the neural networks used as part of the DQL and
(t+1)
reward fn provided to agent n at time slot t + 1 about REINFORCE algorithms is similar and is as follows. A
(t) feedforward neural network with 3 hidden layers is chosen,
the action an chosen at the previous time slot t is captured
by its normalized achieved data rate, i.e., having 200, 100, and 40 neurons, respectively. The input
layer has 8 neurons, i.e., one neuron for each state feature,
Rtn
fn(t+1) = , (26) while the output layer has An neurons equal to the num-
N ber of power levels of the private streams. The Rectified
whereas, in case of the constraint violation, the reward is: Linear Unit (ReLU) is chosen as an activation function,
⎛ ⎛ ⎞⎞
while the specific values used for the DQN and REINFORCE
R t
(t)
 N
(t) p + σ 2
fn(t+1) = n · ⎝1 + tanh⎝p0 − ⎠⎠.
tol algorithms’ hyper-parameters are listed in Table 1. A com-
pj − (t)
N G prehensive numerical analysis is included in the following,
j=1 1
(27) justifying the selection of the latter values.
To characterize the effectiveness of the proposed DRL
The physical meaning and interpretation of the designed algorithms in concluding a solution under both optimization
reward are identical with Eq. (13) and (14) described earlier. objectives, two heuristic approaches from the literature are
Subsequently, the proposed DRL framework based on the also considered and simulated. First, a heuristic algorithm
value-based DQL algorithm or policy-based REINFORCE to solve the energy-efficient power and rate allocation is
alternative can be directly applied to render a solution to the used as a benchmark, where the decoupling of the joint
sum-rate maximization problem. problem into distinct subproblems is performed. The respec-
tive algorithm is presented in [10] and is referred to as
V. EVALUATION & RESULTS “Heuristic” henceforth. Furthermore, regarding the sum-rate
In this section, the performance of the proposed DRL frame- maximization objective, a modified version of the Weighted
work for energy-efficient power and rate allocation in the Minimum-Mean Square Error (WMMSE) [27] algorithm
downlink of single-cell single-antenna RSMA networks is is used to solve the power allocation problem and, then,

2404 VOLUME 4, 2023


FIGURE 4. Average energy efficiency per user under the DQL algorithm for different
values of the minibatch size when targeting energy efficiency maximization.

time slot. As a consequence, small values of the learning


rate, i.e., ηq = ηπ = 10−1 , result in suboptimal solutions,
whereas larger values, i.e., ηq = ηπ = 10−5 , 10−6 , may
prevent optimization and cause the algorithms’ training to get
stuck. There is a turning point where optimal performance in
the achieved energy efficiency can be achieved for both DQL
and REINFORCE algorithms. The DQL algorithm performs
best for ηq = 10−2 , 10−3 , 10−4 , whereas ηπ = 10−3 , 10−4
are the values of the learning rate parameter yielding best
performance for REINFORCE algorithm. Under these par-
FIGURE 3. Average energy efficiency per user under the (a) DQL and ticular values that training is performed successfully, both
(b) REINFORCE algorithms for different values of the learning rate when targeting
energy efficiency maximization. algorithms present stable performance and reach almost iden-
tical energy efficiency levels. However, the REINFORCE
algorithm requires fewer episodes to conclude, exhibiting
determine the rate splitting for the RSMA network. The stable performance from the very beginning. Concluding,
latter benchmarking heuristic is denoted as “WMMSE”. based on the results of Fig. 3(a) and 3(b), the learning rate
In the sequel, the plotted values of the energy efficiency parameters are set equal to ηq = 10−2 and ηπ = 10−3 for
and sum rate metrics have been normalized with the num- the rest of the simulation experiments.
ber of users in the system to capture the average achieved Especially with reference to the DQL algorithm, the hyper-
energy efficiency and rate per user. This representation serves parameter related to the size of the minibatch of experiences
the purpose of accurately reflecting the performance of used as input to the DQN should be additionally configured.
the system under the specific number of served users. To For this purpose, different values of the minibatch size D are
ensure reasonable system performance, we consider as suc- scrutinized, and the performance of the DQL algorithm in
cessful and valid those network and algorithm settings that the achieved energy efficiency is observed over the training
allow each user to achieve at least 1 Mbps/Hz downlink data episodes. The results are presented in Fig. 4, where a similar
rate [6]. tradeoff between small and large values for the minibatch
size hyper-parameter is depicted. A minibatch with inade-
A. DRL ALGORITHMS’ HYPER-PARAMETER ANALYSIS quate experience samples, i.e., D = 50, 100, may cause the
First, we perform a numerical analysis over different val- trained model to converge to a local maximum, whereas a
ues of the DRL algorithms’ hyper-parameters, reflecting large size of the minibatch, i.e., D = 700, 1000, may have the
their impact on the algorithms’ behavior over the train- opposite effect and result in the DQN’s overtraining during
ing episodes. The obtained results are indicatively presented the very first episodes. This prohibits the DQN from learn-
for the energy efficiency optimization objective, while simi- ing actions by experiences gained at later episodes, yielding
lar observations can be rendered considering the sum-rate solutions of lower achieved energy efficiency compared to
maximization of the system. In Fig. 3(a) and 3(b), the the optimal hyper-parameter setting. The latter optimal set-
achieved energy efficiency is illustrated as a function of ting is found for a minibatch size of D = 500 experience
the training episodes for different values of the DQL samples officially selected for our experiments.
and REINFORCE algorithms’ learning rates ηq and ηπ , Apart from properly configuring the DRL algorithms, the
respectively. In more detail, the learning rate controls the design of the state-action space is crucial for the solu-
adjustment level of the parameter vector, i.e., the neural tion outcome. In this context, controlling the size of the
network’s weights, in response to the estimated error at each agents’ action space is also performed numerically to strike a

VOLUME 4, 2023 2405


DIAMANTI et al.: ENERGY-EFFICIENT RSMA: A DRL-BASED FRAMEWORK

FIGURE 5. Average energy efficiency per user under the DQL and REINFORCE
algorithms for different numbers of power levels when targeting energy efficiency
maximization.

balance between achieved energy efficiency and algorithmic


complexity. Fig. 5 illustrates the achieved energy efficiency
under different numbers of power levels An for the private
streams for both DQL and REINFORCE algorithms. The
results reveal that there exists an “optimal” number of power
levels where the tradeoff between exploring different actions
and complexity in the exploration is optimal for both algo-
rithms, which in our case is An = 10, ∀n as used in the
experiments overall.
Concluding, the trained DRL models follow the config-
uration that resulted from the hyper-parameter analysis so FIGURE 6. Average (a) energy efficiency and (b) rate per user under the DQL,
REINFORCE, “Heuristic”, and “WMMSE” approaches for different numbers of users
far. The results presented for both DQL and REINFORCE when targeting (a) energy efficiency and (b) sum-rate maximization.
algorithms from this point and on correspond to the average
energy efficiency (and rate accordingly) given as output from
the trained deep model over Ne = 100 randomly simulated under all approaches and algorithms as the number of users
episodes, comprising Nt = 500 time slots each. gets higher due to the increased interference and the total
transmission power required in the downlink by the base
B. SCALABILITY ANALYSIS station. A significant gap is shown between the DRL-based
Subsequently, we conduct a scalability analysis consider- algorithms and the “Heuristic” approach for a small num-
ing an increasing number of users in the cell, aiming to ber of users transmitting over the same frequency band,
evaluate the performance of the proposed DRL framework i.e., N = 2, 3. For larger values of N, when the system
as the network size increases while comparing at the same is congested and constrained, the DRL algorithms and the
time against the “Heuristic” and “WMMSE” approaches. “Heuristic” perform closely. Especially for N = 6, 7, the
The range considered regarding the number of users is majority (if not all) of the comparative scenarios are unable
N = [2, 7] in alignment with good practices followed in to conclude a solution that provides at least 1 Mbps per
the existing literature of RSMA, e.g., [6], [8]. It should be user. For this reason, their achieved energy efficiency value
noted that the common and private streams transmitted by is set equal to 0. The latter justifies that the number of users
the base station to the users are multiplexed over the same sharing the same frequency resource cannot be arbitrarily
frequency resources, resulting in interference between them, increased. Fig. 6(b) depicts the achieved average rate per
as expressed in Eq. (5) and Eq. (8), respectively. Therefore, user for different numbers of users when seeking the sum-
for the interference not to become unbearable, an upper rate maximization. In this simulation case, it is remarkable
bound in the number of users sharing the same frequency that the “WMMSE” approach fails to conclude a solution
band is considered in the literature, equal to N = 7. In case that secures a data rate higher than 1 Mbps for each user for
more users should be considered in the simulation topology, N ≥ 4 contrariwise to the proposed DRL algorithms that can
then the same problem with the proposed one will be solved provide an effective resource allocation solution for at least
independently for different clusters of users that operate over five users under the same frequency band. In this way, the
a different frequency band. power of DRL to explore a vast state-action space is further
Fig. 6(a) demonstrates the achieved energy efficiency per demonstrated.
user for different numbers of users in the horizontal axis The outcome of the scalability analysis so far is that DRL
when targeting the energy efficiency maximization of the is more successful in deriving an energy-efficient power and
system. As expected, the results present a decaying trend rate allocation in RSMA networks than a heuristic approach

2406 VOLUME 4, 2023


TABLE 2. Resulting testing time under the DQL, REINFORCE, “Heuristic”, and TABLE 3. Average energy efficiency per user and resulting training time under the
“WMMSE” approaches for different numbers of users. DQL, REINFORCE, and Q-Learning approaches for different numbers of users.

i.e., s(t) (t)


n = Gn [4]. The agent’s state, i.e., channel gain, is
further quantized into 10 value ranges, each of which creates
a separate row in the Q-table while the discrete actions form
different columns. Table 3 includes the obtained numerical
results regarding the achieved energy efficiency and resulting
under both optimization objectives. In the following, we training time. The training time of the DQL, REINFORCE,
also measure the resulting testing time, i.e., the execution and Q-Learning algorithms has been measured considering
time of the resource allocation procedure based on the pre- 4000, 200, and 100 episodes, respectively, where conver-
trained deep neural network over the testing dataset that gence is reached. Also, a small number of users N has
includes simulated channel gain distributions of the users been considered owing to the inherent difficulty of con-
that are different from the ones used during pre-training. structing a Q-table of all combinations of state-action pairs
The obtained numerical results are listed in Table 2. The for all users in the system. Despite the small scale of the
results reveal that the two DRL algorithms behave similarly simulated system, the Q-Learning algorithm still concludes
in the resulting testing time. However, both of them outper- a resource allocation solution of notably low energy effi-
form the “Heuristic” approach, whose mean execution time ciency, i.e., approximately 49 times lower when N = 2
is 96.15 sec under the energy efficiency optimization objec- and 26 times lower when N = 3 compared to the DRL
tive. On the other hand, although the “WMMSE” approach algorithms.
proves to be significantly faster than the DRL algorithms
during their testing, its ability to conclude a solution is C. NETWORK OPTIMIZATION OBJECTIVES ANALYSIS
limited and restricted to a very small number of users. To gain more insight into the impact of energy efficiency
Note that the cells missing numerical values refer to the optimization on the overall network’s performance, we pro-
specific simulation cases with N = 6, 7, where a mini- ceed to a comparative examination between the performance
mum acceptable rate of 1 Mbps for each user cannot be of the proposed DRL framework under (a) energy efficiency
secured by some of the different comparative algorithms and and (b) sum-rate maximization objectives. In particular,
approaches. Fig. 7(a) and 7(b) demonstrate the achieved values under
Our scalability analysis is complemented by a compari- both metrics when (a) energy efficiency and (b) sum-rate
son against the well-known Q-Learning algorithm [4], which maximization is targeted, respectively. To render this com-
allows for further justifying the need for solutions based on parison even more plausible, we also account for different
deep neural networks to tackle optimization problems of values of the base station’s maximum power budget within
the scale and complexity of the examined one. Based on the range pmax = [20, 40] dBm, characterizing its total max-
the Q-Learning algorithm, the optimal Q-function is derived imum emitted transmission power in the downlink at each
after exhaustive exploration and calculation of its value for simulation scenario. Note that for each 5 dBm-increment of
the different state-action pairs, contrary to the proposed pmax , we increase the number of power levels An concluded
DQL algorithm that employs a deep neural network to per- from Fig. 5 by five to fairly maintain the sensitivity of explo-
form function approximation. The calculated value of the ration within the action space An , ∀n. The number of users
Q-function for each state-action pair is stored in a lookup considered in this simulation case is N = 4.
table, i.e., the Q-table. For the implementation of the Q- Under the energy efficiency objective, both DRL algo-
Learning algorithm, the modeling of the reward function rithms “stick” to the pursued minimum data rate requirement
and the discrete action space in Section III-A are kept for each user (see right part of Fig. 7(a)) and target to
unchanged, while the only differentiation lies in the design maximize the achieved energy efficiency without neces-
of the state space that is discretized to facilitate the con- sarily spending the total amount of power pmax available.
struction of the Q-table. Directly discretizing the state space Apparently, there exists a turning point regarding the avail-
of our proposed DRL framework that comprises eight dis- able maximum power budget and the resulting number of
tinct components (see Section III-A) leads to the creation power levels, where the DRL algorithms find the best solu-
of a huge Q-table. For this reason, inspired by the major- tion to the problem. Specifically, both algorithms manage
ity of Q-learning applications in wireless networks from the to achieve a maximum energy efficiency level approxi-
literature, we consider that an agent’s n state is completely mately equal to 46.5 bits/J/Hz, as shown in the left part
(t)
captured by its channel gain Gn at a particular time slot t, of Fig. 7(a) and coincide in that this is found for pmax =

VOLUME 4, 2023 2407


DIAMANTI et al.: ENERGY-EFFICIENT RSMA: A DRL-BASED FRAMEWORK

VI. CONCLUSION AND FUTURE WORK


In this paper, the problem of energy efficiency maximization
was investigated in a single-cell single-antenna RSMA
network. Specifically, the joint power and rate allocation of
the common and private messages transmitted in the down-
link of the RSMA network was designed to maximize the
system’s energy efficiency. To manage such a combinato-
rial problem, a multi-agent DRL modeling was proposed,
according to which the DRL agents were mapped to the
private streams that explore the wireless network via their
actions, i.e., private stream power allocations. The DRL
agents contribute their experiences gained to training a
common neural network, at which point, two different
DRL algorithms were properly configured and utilized. The
first DRL algorithm regarded the value-based DQL, while
the second corresponded to the policy-based REINFORCE.
The output of the respective DRL algorithm, which is the
optimal private-stream power allocations of the DRL agents,
was then used as input to a linear programming problem
that directly derived the common-stream power and rate
allocations for the considered network setting. The same
multi-agent DRL modeling, architecture, and algorithms
were also evaluated under a different network optimization
objective, namely the sum-rate maximization of the con-
sidered RSMA network. The proposed DRL framework
showed to perfectly adapt to both optimization settings and
conclude solutions that are closer to optimal when com-
pared against existing approaches and algorithms from the
literature.
FIGURE 7. Average energy efficiency and rate per user under the DQL and
REINFORCE algorithms for different values of the base station’s maximum power Our current and future work focuses on the design and test-
budget pmax when targeting (a) energy efficiency and (b) sum-rate maximization. ing of actor-critic-based algorithms over the same network
setup. Furthermore, the extension of the networking setting
to account for multiple antenna transmissions will be tar-
25 dBm. Regarding the sum-rate maximization, the two geted by adapting both the multi-agent DRL modeling and
designed DRL algorithms exhibit identical performance (see architecture, as well as the employed DRL algorithms.
Fig. 7(b)). Higher values of the parameter pmax allow for
achieving higher user data rates (right part of Fig. 7(b)) REFERENCES
while decreasing the corresponding energy efficiency of the [1] H. Joudeh and B. Clerckx, “Robust transmission in downlink
system (left part of Fig. 7(b)). To be more specific, when multiuser MISO systems: A rate-splitting approach,” IEEE Trans.
Signal Process., vol. 64, no. 23, pp. 6227–6242, Dec. 2016.
pmax gets doubled from 20 dBm to 40 dBm, a small incre- [2] B. Clerckx, Y. Mao, R. Schober, and H. V. Poor, “Rate-splitting uni-
ment of two times is observed in the user data rate due to fying SDMA, OMA, NOMA, and multicasting in MISO broadcast
higher interference sensed by the users, which in conjunction channel: A simple two-user rate analysis,” IEEE Wireless Commun.
Lett., vol. 9, no. 3, pp. 349–353, Mar. 2020.
with the higher sum of transmission powers in the denom- [3] Y. Mao, O. Dizdar, B. Clerckx, R. Schober, P. Popovski, and
inator of the energy efficiency function, rapidly decreases H. V. Poor, “Rate-splitting multiple access: Fundamentals, survey,
the energy efficiency by almost 15 times. Furthermore, and future research trends,” IEEE Commun. Surveys Tuts., vol. 24,
no. 4, pp. 2073–2126, 4th Quart., 2022.
closely inspecting the right parts of Fig. 7(a) and 7(b), [4] N. C. Luong et al., “Applications of deep reinforcement learning in
it can be easily seen that for an average rate equal to communications and networking: A survey,” IEEE Commun. Surveys
1.5 bps/Hz per user, the concluded energy efficiency under Tuts., vol. 21, no. 4, pp. 3133–3174, 4th Quart., 2019.
[5] T. Han and K. Kobayashi, “A new achievable rate region for
the sum-rate maximization objective is 15 bits/J/Hz, whereas the interference channel,” IEEE Trans. Inf. Theory, vol. 27, no. 1,
a value close to 46.5 bits/J/Hz could be achieved if pursuing pp. 49–60, Jan. 1981.
the energy efficiency maximization, following the results [6] Z. Yang, M. Chen, W. Saad, and M. Shikh-Bahaei, “Optimization of
rate allocation and power control for rate splitting multiple access
of Fig. 7(a). Interestingly, this comes with the cost of 31 (RSMA),” IEEE Trans. Commun., vol. 69, no. 9, pp. 5988–6002,
times lower achieved energy efficiency when myopically Sep. 2021.
targeting the system’s sum-rate maximization, highlighting [7] H. Xia, Y. Mao, B. Clerckx, X. Zhou, S. Han, and C. Li,
“Weighted sum-rate maximization for rate-splitting multiple access
the need to focus on energy-efficient resource allocation based secure communication,” in Proc. IEEE Wireless Commun. Netw.
approaches. Conf. (WCNC), 2022, pp. 19–24.

2408 VOLUME 4, 2023


[8] G. Zhou, Y. Mao, and B. Clerckx, “Rate-splitting multiple access for MARIA DIAMANTI (Member, IEEE) received the
multi-antenna downlink communication systems: Spectral and energy Diploma degree in electrical and computer
efficiency tradeoff,” IEEE Trans. Wireless Commun., vol. 21, no. 7, engineering from the Aristotle University of
pp. 4816–4828, Jul. 2022. Thessaloniki in 2018. She is currently pursuing
[9] J. Zhang, J. Zhang, Y. Zhou, H. Ji, J. Sun, and N. Al-Dhahir, “Energy the Ph.D. degree with the School of Electrical
and spectral efficiency tradeoff via rate splitting and common beam- and Computer Engineering, National Technical
forming coordination in multicell networks,” IEEE Trans. Commun., University of Athens, where she is also a Research
vol. 68, no. 12, pp. 7719–7731, Dec. 2020. Assistant. Her research interests lie in the areas of
[10] W. De Souza Junior, V. Croisfelt, and T. Abrão, “On the energy 5G/6G wireless networks, resource management
efficiency of one-layer SISO rate-splitting multiple access,” in Proc. and optimization, game theory, contract theory,
IEEE URUCON, 2021, pp. 42–46. and reinforcement learning.
[11] Y. Mao, B. Clerckx, and V. O. Li, “Energy efficiency of rate-splitting
multiple access, and performance benefits over SDMA and NOMA,” in
Proc. 15th Int. Symp. Wireless Commun. Syst. (ISWCS), 2018, pp. 1–5.
[12] B. Matthiesen, Y. Mao, A. Dekorsy, P. Popovski, and B. Clerckx, GEORGIOS KAPSALIS received the Diploma
“Globally optimal spectrum- and energy-efficient beamforming for degree in electrical and computer engineering from
rate splitting multiple access,” IEEE Trans. Signal Process., vol. 70, the National Technical University of Athens in
pp. 5025–5040, Oct. 2022, doi: 10.1109/TSP.2022.3214376. 2022. His Diploma thesis focused on the topic
[13] A. A. Ahmad, B. Matthiesen, A. Sezgin, and E. Jorswieck, “Energy of resource allocation in rate-splitting multiple
efficiency in C-RAN using rate splitting and common message decod- access networks with the use of optimization
ing,” in Proc. IEEE Int. Conf. Commun. Workshops (ICC Workshops), and reinforcement learning techniques. His over-
2020, pp. 1–6. all research interests lie in the broader area of
[14] Z. Yang, J. Shi, Z. Li, M. Chen, W. Xu, and M. Shikh-Bahaei, “Energy resource optimization in 5G/6G wireless commu-
efficient rate splitting multiple access (RSMA) with reconfigurable nications systems.
intelligent surface,” in Proc. IEEE Int. Conf. Commun. Workshops
(ICC Workshops), 2020, pp. 1–6.
[15] Y. S. Nasir and D. Guo, “Multi-agent deep reinforcement learning for EIRINI ELENI TSIROPOULOU (Senior Member,
dynamic power allocation in wireless networks,” IEEE J. Sel. Areas IEEE) is currently an Associate Professor with
Commun., vol. 37, no. 10, pp. 2239–2250, Oct. 2019. the Department of Electrical and Computer
[16] F. Meng, P. Chen, L. Wu, and J. Cheng, “Power allocation in multi-user Engineering, University of New Mexico. Her
cellular networks: Deep reinforcement learning approaches,” IEEE main research interests lie in the area of cyber–
Trans. Wireless Commun., vol. 19, no. 10, pp. 6255–6267, Oct. 2020. physical social systems and wireless hetero-
[17] F. Jiang, Z. Gu, C. Sun, and R. Ma, “Dynamic user pairing and geneous networks, with emphasis on network
power allocation for NOMA with deep reinforcement learning,” in modeling and optimization, resource orchestration
Proc. IEEE Wireless Commun. Netw. Conf. (WCNC), 2021, pp. 1–6. in interdependent systems, reinforcement learning,
[18] Z. Lu, C. Zhong, and M. C. Gursoy, “Dynamic channel access and game theory, network economics, and Internet of
power control in wireless interference networks via multi-agent deep Things. Four of her papers received the Best Paper
reinforcement learning,” IEEE Trans. Veh. Technol., vol. 71, no. 2, Award at IEEE WCNC in 2012, ADHOCNETS in 2015, IEEE/IFIP WMNC
pp. 1588–1601, Feb. 2022. 2019, and INFOCOM 2019 by the IEEE ComSoc Technical Committee on
[19] J. Huang, Y. Yang, L. Yin, D. He, and Q. Yan, “Deep reinforcement Communications Systems Integration and Modeling. She was selected by
learning-based power allocation for rate-splitting multiple access in the IEEE Communication Society—N2Women—as one of the top ten Rising
6G LEO satellite communication system,” IEEE Wireless Commun. Stars of 2017 in the communications and networking field. She received
Lett., vol. 11, no. 10, pp. 2185–2189, Oct. 2022. the NSF CRII Award in 2019 and the Early Career Award by the IEEE
[20] N. Q. Hieu, D. T. Hoang, D. Niyato, and D. I. Kim, “Optimal Communications Society Internet Technical Committee in 2019.
power allocation for rate splitting communications with deep rein-
forcement learning,” IEEE Wireless Commun. Lett., vol. 10, no. 12,
pp. 2820–2823, Dec. 2021.
[21] J. Ji, L. Cai, K. Zhu, and D. Niyato, “Decoupled association with rate SYMEON PAPAVASSILIOU (Senior Member,
splitting multiple access in UAV-assisted cellular networks using multi- IEEE) is currently a Professor with the School
agent deep reinforcement learning,” IEEE Trans. Mobile Comput., of ECE, National Technical University of
early access, Mar. 15, 2023, doi: 10.1109/TMC.2023.3256404. Athens. From 1995 to 1999, he was a Senior
[22] T. P. Truong, N.-N. Dao, and S. Cho, “HAMEC-RSMA: Enhanced Technical Staff Member with AT&T Laboratories,
aerial computing systems with rate splitting multiple access,” IEEE Middletown, NJ, USA. In August 1999, he joined
Access, vol. 10, pp. 52398–52409, 2022. the ECE Department, New Jersey Institute of
[23] J. Seong, M. Toka, and W. Shin, “Sum-rate Maximization of RSMA- Technology, USA, where he was an Associate
based aerial communications with energy harvesting: A reinforcement Professor until 2004. He has an established record
learning approach,” IEEE Wireless Commun. Lett., vol. 12, no. 10, of publications in his field of expertise, with
pp. 1741–1745, Oct. 2023. more than 400 technical journal and conference
[24] P. Dent, G. E. Bottomley, and T. Croft, “Jakes fading model revisited,” published papers. His main research interests lie in the area of computer
Electron. Lett., vol. 13, no. 29, pp. 1162–1163, 1993. communication networks, with emphasis on the analysis, optimization,
[25] F. D. Calabrese, L. Wang, E. Ghadimi, G. Peters, L. Hanzo, and performance evaluation of mobile and distributed systems, wireless
and P. Soldati, “Learning radio resource management in RANs: networks, and complex systems. He received the Best Paper Award in IEEE
Framework, opportunities, and challenges,” IEEE Commun. Mag., INFOCOM 94, the AT&T Division Recognition and Achievement Award
vol. 56, no. 9, pp. 138–145, Sep. 2018. in 1997, the U.S. National Science Foundation Career Award in 2003,
[26] V. Mnih et al., “Human-level control through deep reinforcement the Best Paper Award in IEEE WCNC 2012, the Excellence in Research
learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. Grant in Greece in 2012, the Best Paper Awards in ADHOCNETS 2015,
[27] S. S. Christensen, R. Agarwal, E. De Carvalho, and J. M. Cioffi, ICT 2016 and IEEE/IFIP WMNC 2019, IEEE Globecom 2022, as well
“Weighted sum-rate maximization using weighted MMSE for MIMO- as the 2019 IEEE ComSoc Technical Committee on Communications
BC beamforming design,” IEEE Trans. Wireless Commun., vol. 7, Systems Integration and Modeling Best Paper Award (for his INFOCOM
no. 12, pp. 4792–4799, Dec. 2008. 2019 paper).

VOLUME 4, 2023 2409

You might also like