Energy-Efficient Rate-Splitting Multiple Access A Deep Reinforcement Learning-Based Framework
Energy-Efficient Rate-Splitting Multiple Access A Deep Reinforcement Learning-Based Framework
ABSTRACT Rate-Splitting Multiple Access (RSMA) has been recognized as an effective technique to
reconcile the tradeoff between decoding interference and treating interference as noise in 6G and beyond
networks. In this paper, in line with the need for network sustainability, we study the energy-efficient
power and rate allocation of the common and private messages transmitted in the downlink of a single-cell
single-antenna RSMA network. Contrary to the literature that resorts to heuristic approaches to deal with
the joint problem, we transform the formulated energy efficiency maximization problem into a multi-
agent Deep Reinforcement Learning (DRL) problem, based on which each transmitted private message
represents a different DRL agent. Each agent explores its own state-action space, the size of which is
fixed and independent of the number of agents, and shares its gained experience by exploration with
a common neural network. Two DRL algorithms, namely the value-based Deep Q-Learning (DQL) and
the policy-based REINFORCE, are properly configured and utilized to solve it. The adaptation of the
proposed DRL framework is also demonstrated for the treatment of the considered network’s sum-rate
maximization objective. Numerical results obtained via modeling and simulation verify the effectiveness
of the proposed DRL framework to conclude a solution to the joint problem under both optimization
objectives, outperforming existing heuristic approaches and algorithms from the literature.
INDEX TERMS Energy efficiency maximization, rate-splitting multiple access (RSMA), deep reinforce-
ment learning (DRL).
systems, resulting in highly non-convex and combinatorial contradicting objectives [8] or decomposed into two sub-
optimization problems that are difficult to solve optimally problems solved iteratively [9]. Subsequently, the method
using conventional optimization techniques [3]. Moreover, of Successive Convex Approximation (SCA) is used to
the network complexity in the number of wireless con- convexify the resulting problems and obtain a solution.
nections, calls for robust optimization techniques that can Toward accounting for sustainability and not restricting
scale well and dynamically adapt to the environment. Deep the resource allocation procedure to achieving high data
Reinforcement Learning (DRL) has been broadly considered rates, a different line of research pursues the maximization
in communications and networking to handle the complexity, of the studied system’s energy efficiency while potentially
scalability, and autonomicity issues therein [4]. Leveraging ensuring some minimum rate requirements, e.g., [10], [11],
the power of deep neural networks, DRL algorithms explore [12], [13], [14]. Similar to the above, the joint power
a vast state-action space and conclude near-optimal solu- control/beamforming and common-rate allocation constitute
tions to non-convex problems while allowing the network’s fundamental problems studied in SISO [10] and MISO [11]
self-adaptation based on the trained model. broadcast channels under the energy efficiency optimization
In this article, we target energy efficiency maximization objective. Both [10] and [11] conclude with suboptimal solu-
in a single-antenna RSMA-based wireless network. To the tions contrariwise to [12] that under a similar MISO setting
best of the authors’ knowledge, this is the first time in the with [11] manages to obtain a globally optimal solution
literature to design and propose a DRL-based framework based on Successive Incumbent Transcending (SIT) Branch
for energy-efficient power and rate allocation of the com- and Bound (BB) algorithm. Continuing with more complex
mon and private messages transmitted in the downlink. The network settings, the authors in [13] and [14] investigate the
optimization problem is transformed into a multi-agent DRL application of the RSMA technique in a Cloud Radio Access
problem, such that each agent autonomously explores its Network (C-RAN) and a Reconfigurable Intelligent Surface
own state-action space and contributes its gained experience (RIS)-assisted network, accordingly. In the former, the typ-
to a commonly trained neural network. Two different DRL ical power control and rate allocation problem is addressed
algorithms are properly configured and utilized to solve it, toward energy efficiency maximization subject to the addi-
namely the value-based Deep Q-Learning (DQL) and the tional per-base station’s transmission power and common
policy-based REINFORCE algorithm. The algorithms are fronthaul links’ capacity constraints, whereas, in the latter,
evaluated in terms of effectiveness in determining a solution the RIS’s phase-shift optimization is considered along.
to the problem by comparison against other existing heuristic The overwhelming majority of research works in the field
approaches from the literature. Complementary to this and of RSMA network optimization has relied on model-oriented
for better revealing the benefits and tradeoffs of the obtained and heuristic algorithms that (i) conclude suboptimal solu-
solution when aiming at energy efficiency, we also analyze tions, (ii) are characterized by high computational complexity
and assess the proposed framework under the objective of as the network scales, and (iii) prohibit adaptability to the
sum-rate maximization considering the same network set- network’s unpredictable changes. To tackle these challenges,
ting, which is again a problem that has not been similarly the application of DRL algorithms is becoming increasingly
targeted in the literature so far. popular. The works in [15], [16], [17], [18] provide represen-
tative examples of DRL algorithms successfully implemented
to solve optimization problems in various communication
A. RELATED WORK environments. In [15] and [16], the power control toward
RSMA provides a generalization of several existing orthog- sum-rate maximization is modeled as a multi-agent DRL
onal and non-orthogonal multiple access techniques, leading problem, according to which the transmitter of each wireless
to superior performance in terms of achieved throughput link, i.e., agent, autonomously executes its action in selecting
and spectral efficiency as has been theoretically proved an appropriate transmission power level based on a com-
for two-user Single-Input Single-Output (SISO) [5] and monly trained neural network, which is a paradigm referred
Multiple-Input Single-Output (MISO) [2] broadcast chan- to as “centralized training and distributed execution” in
nels. The existence of such theoretical analyses provoked the literature. Value-based DQL, policy-based REINFORCE,
active research around RSMA lately, with an emphasis on and actor-critic Deep Deterministic Policy Gradient (DDPG)
resource allocation under various network settings. In [6] algorithms are then implemented and tested in this con-
and [7], the sum-rate and weighted sum-rate maximization text. In [17], the DQL algorithm is used to derive the
in the downlink of multi-user SISO and MISO systems user pairing in the downlink of a Non-Orthogonal Multiple
are targeted, respectively, by jointly performing power Access (NOMA) network, while the joint channel selection
control/precoder design and rate allocation. Other works, and power control problem is treated in [18] under both
e.g., [8], [9], are devoted to achieving a tradeoff between value-based and actor-critic-based DRL algorithm imple-
energy and spectral efficiency in downlink single-cell and mentations. Both works in [17], [18] consider the sum-rate
multi-cell MISO systems. The aforementioned tradeoff is maximization objective.
formulated as a multi-objective optimization problem that Regarding the application of DRL algorithms for resource
is either approximated by the weighted sum of the two optimization in RSMA networks, only a handful of research
1) The non-convex energy efficiency maximization The received signal by each user n is:
problem is converted into a multi-agent DRL problem N
(t) (t) (t)
by properly designing the states, actions, and rewards y(t)
n = G n p0 0v + G(t) (t) (t) (t)
n pj vj + zn , (2)
to capture the problem’s objective and constraints j=1
and ultimately obtain the joint power and rate solu-
where G(t)
n denotes the channel gain from the base station to
tion sought while modeling each private stream as a (t)
user n and zn ∼ CN (0, σ 2 ) is the corresponding Additive
different DRL agent.
White Gaussian Noise (AWGN). An overview of a simplified
2) The multi-agent DRL modeling, the adoption of
two-user RSMA-based network is presented in Fig. 1.
the centralized training and distributed execution
With reference to the channel gain modeling, in this article,
paradigm, and the appropriate discretization of the
block fading is adopted, such that:
action space – for DRL algorithms’ application pur-
poses – result in a computationally scalable, though G(t) (t) 2
n = |hn | βn , (3)
B. PROBLEM FORMULATION
where βn is the large-scale fading that can remain the same In this article, the energy efficiency maximization is tar-
(t) geted in the downlink of a single-antenna RSMA-based
over several time slots, whereas the term hn represents
the small-scale Rayleigh fading. To model the time-varying wireless network that is defined as the ratio between the
nature of the channel, Jake’s model [24] is used and the sum of the total N achievable data rates of all users in the
(t)
small-scale Rayleigh fading is expressed as a first-order system, i.e., R , and the total consumed power by
n=1 n
N
Gaussian-Markov process: the base station, i.e., p(t) 0 + (t)
n=1 n . Toward achieving this
p
objective, the allocated by the base station common-stream
(t) (t) (t)
h(t)
n = ρh(t−1)
n + 1 − ρ 2 ζn(t) , (4) rates c(t) = [c1 , . . . , cn , . . . , cN ]T , private-stream powers
(t) (t) (t)
(t) p(t) = [p1 , . . . , pn , . . . , pN ]T , and common-stream power
where ζn ∼ CN (0, 1−ρ 2 ) is an independent and identically (t)
p0 to the users, are optimized. Specifically, the correspond-
distributed random variable. The correlation parameter ρ is
ing optimization problem to be solved by the base station is
ρ = J0 (2π fd T), where J0 is the zero-order Bessel function,
formally written as follows:
fd is the maximum Doppler frequency, and T is the time slot N
over which the correlated channel variation occurs. R(t)
n
max EE = (t) n=1 N (10a)
Following the above, the achievable rate for decoding the
(t)
(t) (t)
c ,p ,p0
(t)
p0 + n=1 p(t) n
common stream v0 transmitted by the base station to user
n is calculated as:
N
c (t)
s.t. c(t)
n ≤ r1 , (10b)
(t) (t)
c (t) Gn p0 n=1
rn = log2 1 + (t) N (t) bps/Hz . (5)
Gn j=1 pj + σ
2
N
G(t) (t) (t)
1 p0 − G1 p(t)
n + σ ≥ ptol , (10c)
2
To guarantee the successful decoding of the common n=1
stream v(t)
0 by all users n ∈ N , the allocated decoding rates
N
c(t)
n must adhere to the following condition: p(t)
0 + p(t)
n ≤ pmax , (10d)
n=1
N
(t) (t) (t)
c(t) c (t) cn , pn ≥ 0, ∀n and p0 ≥ 0. (10e)
n ≤ min rn , (6)
n∈N
n=1 Eq. (10b) and Eq. (10c) represent the required constraints
where minn∈N rnc (t) = r1c (t) ,
given the channel gains sorted over the allocated common-stream rates and powers, respec-
(t) (t) (t) tively, for the successful decoding and implementation of
as G1 ≤ · · · ≤ Gn ≤ · · · ≤ GN .
Furthermore, to ensure the successful implementation of the SIC technique at the receivers of the users, as described
the Successive Interference Cancellation (SIC) technique at earlier in Section II-A. Eq. (10d) indicates the base station’s
the receiver of each user n, the following condition must be maximum power budget pmax [Watt], while Eq. (10e) defines
met: the feasible range of values of the different optimization
variables.
(t)
N
(t)
G(t) (t)
n p0 − Gn pj ≥ ptol , (7) III. PROBLEM SOLUTION
j=1
In this section, the formulated energy efficiency
with ptol [Watt] indicating the receivers’ SIC decoding toler- maximization problem is equivalently transformed into a
ance/sensitivity that is assumed to be the same for all users. multi-agent DRL problem to capitalize on the architectural
(t)
N TABLE 1. Simulation parameters.
p0 + p(t)
n ≤ pmax , (24d)
n=1
(t) (t) (t)
cn , pn ≥ 0, ∀n and p0 ≥ 0. (24e)
The definition of problem (24) is in accordance with
its energy efficiency counterpart and a similar approach
with Section III-A can be followed for its transformation
(t)
into a multi-agent DRL scenario. Each private stream vn
of the downlink transmitted signal constitutes a different
(t)
agent whose description of the local state sn comprises
the eight components analyzed in Section III-A. Each agent
(t)
autonomously chooses an action an ∈ An from the set
of possible actions An in Eq. (11) after evaluating its state.
Based on the agents’ chosen actions, the values of (c(t) , p(t)
0 )
that maximize the sum rate can be obtained by setting
(t) (t)
p0 = pmax − N n=1 pn and solving the following linear evaluated via modeling and simulation. Throughout our
programming problem:
experiments, we consider N = 4 users randomly spa-
N tially distributed with minimum and maximum distance
max c(t)
n (25a) from the base station set as 10 m and 500 m, respectively.
(t)
cn ≥0,∀n n=1
The channel gain between the users and the base sta-
N
c (t)
tion is calculated considering the log-distance path loss
s.t. c(t)
n ≤ r1 . (25b) model PL = 120.9 + 37.6 log(d) with d measured in
n=1 km and log-normal shadowing standard deviation equal to
It should be noted that the common stream does not 8 dB [6]. The maximum Doppler frequency is fd = 10 Hz
interfere with the private streams
and, thus, the allocation of and the time slot duration is T = 20 ms [15]. The rest
(t)
all available power, i.e., pmax − N n=1 pn , to the common
of the communication-related parameters are summarized in
stream maximizes the sum rate [6]. This observation can be Table 1.
easily derived by closely examining Eq. (5) and (6). Considering the definition of the action space in the multi-
Last, to target the system’s sum-rate maximization, the agent DRL problem, a number of An = 10, ∀n and P0 = 100
reward feedback signals provided to the agents should be discrete power levels for the private and common streams
redefined accordingly. Following a similar rationale with the is considered unless otherwise explicitly stated. The struc-
one in Section III-A, if constraint (10c) is satisfied, the ture of the neural networks used as part of the DQL and
(t+1)
reward fn provided to agent n at time slot t + 1 about REINFORCE algorithms is similar and is as follows. A
(t) feedforward neural network with 3 hidden layers is chosen,
the action an chosen at the previous time slot t is captured
by its normalized achieved data rate, i.e., having 200, 100, and 40 neurons, respectively. The input
layer has 8 neurons, i.e., one neuron for each state feature,
Rtn
fn(t+1) = , (26) while the output layer has An neurons equal to the num-
N ber of power levels of the private streams. The Rectified
whereas, in case of the constraint violation, the reward is: Linear Unit (ReLU) is chosen as an activation function,
⎛ ⎛ ⎞⎞
while the specific values used for the DQN and REINFORCE
R t
(t)
N
(t) p + σ 2
fn(t+1) = n · ⎝1 + tanh⎝p0 − ⎠⎠.
tol algorithms’ hyper-parameters are listed in Table 1. A com-
pj − (t)
N G prehensive numerical analysis is included in the following,
j=1 1
(27) justifying the selection of the latter values.
To characterize the effectiveness of the proposed DRL
The physical meaning and interpretation of the designed algorithms in concluding a solution under both optimization
reward are identical with Eq. (13) and (14) described earlier. objectives, two heuristic approaches from the literature are
Subsequently, the proposed DRL framework based on the also considered and simulated. First, a heuristic algorithm
value-based DQL algorithm or policy-based REINFORCE to solve the energy-efficient power and rate allocation is
alternative can be directly applied to render a solution to the used as a benchmark, where the decoupling of the joint
sum-rate maximization problem. problem into distinct subproblems is performed. The respec-
tive algorithm is presented in [10] and is referred to as
V. EVALUATION & RESULTS “Heuristic” henceforth. Furthermore, regarding the sum-rate
In this section, the performance of the proposed DRL frame- maximization objective, a modified version of the Weighted
work for energy-efficient power and rate allocation in the Minimum-Mean Square Error (WMMSE) [27] algorithm
downlink of single-cell single-antenna RSMA networks is is used to solve the power allocation problem and, then,
FIGURE 5. Average energy efficiency per user under the DQL and REINFORCE
algorithms for different numbers of power levels when targeting energy efficiency
maximization.