Federated Deep Reinforcement Learning For User Access Control in Open Radio Access Networks
Federated Deep Reinforcement Learning For User Access Control in Open Radio Access Networks
Abstract—The Open Radio Access Network (O-RAN) introduc- RANs towards intelligence [1]. In the O-RAN, functions of
ing a particular unit known as RAN Intelligent Controllers (RICs) a RAN are divided into four units [2], including the Radio
has been regarded as revolutionary paradigms to support multi- Unit (RU), Central Unit (CU), Distributed Unit (DU) and
class wireless services required in the fifth and sixth generation
(5G/6G) networks. Through unprecedentedly installing various RICs, connected by standard interfaces. With the facilitation
machine learning (ML) algorithms to RICs, a RAN is able to of RICs, ML algorithms can be unprecedentedly installed to
intelligently configure resources/communications to support any a RAN [3] to intelligently perform resource/communication
vertical applications over any operating scenarios. However, to configurations.
practically deploy this RAN paradigm, the O-RAN still suffers
two critical issues of load balance and handover control, and Although the O-RAN provides a revolutionary architecture
therefore the very first ML algorithm for the O-RAN should to implement intelligent RANs, the user access control (also
effectively address these issues. In this paper, inspired by the known as user association) assigning each UE to be served
superior performance of deep reinforcement learning (DRL) in by a proper BS so as to maximize the overall throughput
tackling sequential decision-making tasks, we therefore develop
an intelligent user access control scheme with the facilitation (i.e., the sum of individual throughputs of UEs) is still a
of deep Q-networks (DQNs). A federated DRL-based scheme is challenging issue when base stations (BSs) are massively and
further proposed to train the parameters of multiple DQNs in densely deployed. Traditionally, the user access control is
the O-RAN, so as to maximize the long-term throughput and performed based on some specific metrics such as received
meanwhile avoid frequent user handovers with a limited amount signal strength (RSS) and link capacity [4], [5]. However, due
of signaling overheads in the O-RAN. The simulation results have
fully demonstrated the outstanding performance over the state- to the mobility of UEs, such mechanism would incur two
of-the-arts, to service the urgent needs in the standardization of critical engineering concerns, i.e., frequent handovers and load
the O-RAN. balancing. To address these two issues, intensive research works
Index Terms—Open radio access networks (O-RANs), user have been advocated, which can be categorized as centralized
access control, deep reinforcement learning (DRL), federated and distributed schemes. For centralized schemes, there exists a
learning (FL)
centralized controller adopting traditional optimization methods
such as convex optimization to obtain the optimal user access
I. I NTRODUCTION policy if global network information is available [6]. While for
Since the 5G networks, supporting wireless services with distributed schemes, the access selection of each UE can be
different levels of data rates, latency, reliability, and mobility formulated as a Markov decision process (MDP), which can
has been the mandatory requirement for the future RANs. The be solved by dynamic programming (DP) algorithms if the
5G/6G RANs therefore should autonomously find the optimum prior knowledge of state transition probabilities is available
resource and communication configurations to optimize the [7]. Additionally, game-theoretic schemes such as matching
performance of any vertical applications (e.g., smart manufac- game are also introduced in constructing distributed decision-
turing, unmanned/remote driving, everything reality, etc.) over making mechanisms [8]. However, the mobility of UEs leads
any operating scenarios. For this purpose, the future RANs to a highly dynamic network environment, in which global
should cognize the service requirements of vertical applications, network information and thus the state transition probabilities
deployment conditions, as well as the characteristic of each user are unavailable in practice.
equipment (UE), which may rely on the recent development of With the superior performance in tackling sequential
ML algorithms. To enable this desired capability, the O-RAN decision-making tasks, deep reinforcement learning (DRL)
has been regarded as a novel paradigm to evolve the current has been shown as an effective technology in developing
intelligent algorithms for wireless communication systems [9].
This work is supported in part by the National Natural Science Foun-
dation of China under Grants 61631005 and U1801261, the National Key Through applying DRL in each UE, the optimum policy can
R&D Program of China under Grant 2018YFB180115, the Key Areas of be learned to maximize the long-term throughput instead of
Research and Development Program of Guangdong Province, China, under instantaneous throughput through continuous interactions with
Grant 2018B010114001, the Fundamental Research Funds for the Central
Universities under Grant ZYGX2019Z022, and the Programme of Introducing the environment. However, it may take a long time to achieve
Talents of Discipline to Universities under Grant B20064. the convergence while multiple UEs operate DRL distributively.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 24,2022 at 12:51:55 UTC from IEEE Xplore. Restrictions apply.
RAN intelligent controller (RIC) division multiple access (OFDMA), a spectrum resource pool
O1/E2
BS
is composed of K equal-length resource blocks (RBs), denoted
RRC PDCP CU as K = {0, · · · , K − 1}, and these K RBs are shared by all the
F1-C F1-U
BS RLC
MAC DU
BS BSs. At each time slot, a single UE is permitted to access only
Higher PHY
F1-C
RRC PDCP CU
F1-U Lower PHY
OFH F1-C
RRC PDCP CU
F1-U
one BS and is allocated with a fixed number of RBs (e.g., one
RLC RU RLC
MAC
Higher PHY
DU
RF
MAC
Higher PHY
DU as an example) for downlink transmissions. Specifically, ui,j (t)
OFH
Lower PHY
OFH
RU Downlink
Lower PHY
RF RU
indicates whether UE i chooses to access BS j at time slot t
RF Transmission
Moving Moving
or not, i.e., ui,j (t) = 1 if UE i chooses BS j at time slot t,
UE
UE
and ui,j (t) = 0, otherwise. The RB allocation indicator fi,k (t)
UE
UE is further introduced to indicate whether UE i is allocated with
RB k at time slot t or not. Particularly, if UE i is allocated with
Fig. 1: The general deployment of O-RANs considered in this RB k at time slot t, fi,k (t) = 1, and fi,k (t) = 0, otherwise.
k
paper, in which there are N BSs and M UEs. 1) Channel Model: The channel gain gi,j (t) between UE
i and BS j in RB k at time slot t is composed of the
large-scale fading component li,j (t) and the small-scale fading
To achieve convergence in a practical level of duration, a componentqhki,j (t). li,j (t) is mainly determined by the distance
limited amount of information exchanged among UEs may be 2 2
di,j (t) = (xi (t) − xj (t)) + (yi (t) − yj (t)) between UE i
inevitable. Instead of solely information exchanges, information located at (xi (t), yi (t)) and BS j located at (xj (t), yj (t)), while
can be further “processed” to enhance the performance, which hki,j (t) remains unchanged within one time slot. The Jake’s
inspires us to note the federated learning (FL) technique for model captures the variation of small-scale fading between two
training DQNs in a distributed manner [10]. In FL, multiple contiguous time slots, i.e.,
UEs train their local model parameters independently with
their own observations, and a global model server collects and hki,j (t) = ρhki,j (t) + δi,j
k
(t) , (1)
aggregates the local model parameters from UEs to generate
where ρ is the coherent coefficient between two contiguous
the global model parameters, which are sent back to UEs later.
time slots, hki,j (0) is a Gaussian random variable with zero
With the iteratively updated global model parameters, UEs can
derive the optimum policies. mean and unit variance, i.e., hki,j (0) ∼ CN (0, 1), and δi,j
k
(t)
To successfully deploy the O-RAN, the very first intelligent is a Gaussian random variable with zero mean and a variance
scheme for the O-RAN should effectively address the issue of of 1 − ρ2 , i.e., δi,j
k
(t) ∼ CN 0, 1 − ρ2 . Hence, the channel
k
user access control. In this paper, we therefore should develop gain gi,j (t) from BS j to UE i in RB k at time slot t is given
a federated DRL-based scheme, in which each UE acting as an by
k
2
(t) = li,j (t) hki,j (t) .
agent makes access decisions independently based on a DQN gi,j (2)
trained by the federated training algorithm performing on the 2) Transmission Rates: If multiple UEs accessing different
RIC as shown in Fig. 1. Additionally, although UEs with phys- BSs are allocated with the same RB, co-channel interference
ical proximity can be homogeneous in terms of the consensus should be considered. Denoting the transmit power of BS j as
of the environment, these UEs making decision independently Pj , the signal-to-noise-plus-interference-ratio (SINR) at UE i
may have biases in the training phase. To tackle this engineering in RB k from BS j at time slot t can be expressed by
concern, a dueling structure is introduced to construct the DQN
k
in the proposed scheme, and the parameters of the DQN are k
Pj fi,k (t)gi,j (t)
γi,j (t) = P k
, (3)
decomposed into three parts, i.e., common-part parameters, m∈B\j Pm fi,k (t)gi,m (t) + σ2
value-function parameters and advantage-function parameters,
and only the common-part and value-function parameters are where σ 2 is the noise power at UE i. Therefore, the achievable
used to perform model aggregation. With the proposed scheme, rate of UE i from BS j in RB k at time slot t can be written
each UE is able to maximize the overall throughput meanwhile by
avoiding frequent handovers. cki,j (t) = Blog2 (1 + γi,j
k
(t)), (4)
II. S YSTEM M ODEL AND P ROBLEM F ORMULATION where B is the bandwidth of each RB. Based on (4), the total
transmission rate of UE i at time slot t is the sum of the
A. System Model achievable rates in all the allocated RBs, i.e.,
In this paper, the general deployment of the O-RAN as shown Ri,j (t) =
X
fi,k (t)cki,j (t). (5)
in Fig. 1 is considered, in which N BSs are deployed to serve k∈K
M UEs in downlink transmissions. The BS set and UE set are
denoted as B = {0, 1, · · · , N − 1} and U = {0, · · · , M − 1}, B. Problem Formulation
respectively. The time-domain resources are divided into time Due to mobility of UEs, the throughput from a BS deter-
slots of an equal-length as per the frame structure of 3GPP mined by the dynamic channel gains would change drastically.
New Radio (NR) [11]. Adopting the orthogonal frequency Additionally, frequent handovers would also happen, which is
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 24,2022 at 12:51:55 UTC from IEEE Xplore. Restrictions apply.
harmful to practical RAN deployment. Hence, it is necessary next time slot. After interacting with the environment, the agent
for each UE to access a proper BS so as to optimize its aims to learn the optimal policy to maximize the cumulated
throughput performance. Jointly considering handover costs and long-term reward, which is also called Q-function, i.e.,
the achieved throughput, the utility function of each UE can be "∞
X
#
defined as Q(s, a) = E t
β r(s(t), a(t))|s(0) = s , (12)
Ri,j (t), ui,j (t) = ui,j (t − 1), t=0
Γi,j (t) = (6)
Ri,j (t) − C, ui,j (t) 6= ui,j (t − 1), where E[·] is the expectation operation, and β is the discount
where C is the equivalent cost of the handover in throughput, factor.
which is incurred by the signalling overheads and power A recent innovation to obtain the optimal Q-function is the
consumptions. deep Q-learning (DQL) algorithm [9], in which the deep neural
network (DNN) is adopted to approximate the Q-function. In
Optimization 1. In this paper, we aim to maximize the long- DQL, two neural networks are constructed, i.e., the trained
term utilities of all the UEs from t = 0 to t = T − 1. Such network Q(s, a; θ) with the parameters θ and the target net-
optimization is given by work Q(s, a; θ − ) with the parameters θ − . The agent obtains
T
X −1 X X estimated Q-values by inputting the current state s(t) into
max ui,j (t)Γi,j (t) (7) the trained network. Furthermore, a replay memory is adopted
U,F
t=0 i∈U j∈B to store interaction experiences {s(t), a(t), r(s(t), a(t)), s(t +
X 1)}, and a mini-batch size of experiences are sampled randomly
s.t. ui,j (t) = 1, ∀i ∈ U, (8)
from the replay memory as the training data in each training
j∈B
X round. The parameters θ can be updated by using the stochastic
fi,k (t) = 1, ∀i ∈ U, (9) gradient descent (SGD) method to minimize the loss function
k∈K over the mini-batch of random samples. To this end, the loss
ui,j (t) ∈ {0, 1} , ∀(i, j) ∈ U × B, (10) function can be defined as the mean square error between the
fi,k (t) ∈ {0, 1} , ∀(i, k) ∈ U × K, (11) target value and the estimated value, i.e.,
h 2
i
where U = [ui,j (t), ∀i, j, t] and F = [fi,k (t), ∀i, k, t]. L(θ) = E (y DQN − Q(s, a; θ)) , (13)
The constraints (8) and (9) limit each UE accessing only one where y DQN = r(s, a) + β max Q(s0 , a0 ; θ − ) is the target
BS and being allocated with one RB (as an example) at each a0 ∈A
value.
time slot. Since Optimization 1 is an integer programming
problem and the objective is to maximize the long-term perfor- B. Foundations of FL
mance, it is a hard problem by using conventional optimization FL is an efficient method to build up a distributed framework
methods such as convex optimization. These concerns thus for complex machine learning models such as DNNs. In FL,
drive us to focus on DRL to develop the access control multiple UEs develop a global learning model in a distributed
algorithms for UEs in the O-RAN. To this end, Optimization 1 manner, and this model is stored in the global model server.
is transferred to a MDP. However, as aforementioned, the global Generally, there are two phases in each training round of
network information, i.e. channel state information, may not FL, i.e., local model updating and global model aggregation.
be practically available due to mobility of UEs. Moreover, the Specifically, in the local model updating phase, some UEs
throughput performance of each UE is influenced by other UEs’ are selected to download global model parameters from the
access policies. Hence, in order to construct a global DQN so model server, and then each of the selected UEs needs to
as to maximize the long-term overall throughput with limited update local parameters by using its own training data. Next,
communication overheads, FL is introduced for training the the updated local parameters would be presented to the global
parameters of the DQN installed in each UE in a distributed model server, and it aggregates all the received parameters to
manner with the aid of the global model server installed in the compute global model parameters in the model aggregation
RIC. phase. The updated global model parameters wG t
would be sent
III. F EDERATED DRL- BASED S CHEME FOR U SER ACCESS back to all the UEs later, and the UEs adopt newly updated
C ONTROL global model parameters as their local model parameters. By
iteratively aggregating trained models form different UEs, the
A. Preliminary of DRL
final global model would be obtained until the global model
In the basic framework of DRL, the objective of the agent parameters wG t
achieves a desirable accuracy.
is to maximize the cumulated long-term reward via continuous
interactions with the environment. Specifically, the agent visits C. The Proposed Federated DRL-based Scheme
a state s(t) ∈ S by observing the environment at time slot In the proposed scheme, since the action space is discrete, the
t. According to the policy π(s(t)), the agent selects an action DQN can be adopted for the DRL model, and each UE acting
a(t) ∈ A for execution. The agent obtains an immediate reward as an independent agent makes access decisions based on a
r(s(t), a(t)) and a new state s(t + 1) at the beginning of the global DQN, which is trained by a federated training algorithm.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 24,2022 at 12:51:55 UTC from IEEE Xplore. Restrictions apply.
A global model server installed in the RIC is responsible for dueling structure is only related with the state of UE i, the
selecting UEs to perform local training, and aggregating the value-function parameters and the common-part parameters are
collected parameters from the selected UEs to update the global used to represent the consensus for the decision policies of the
parameters. In the following, the state space, action space and environment. The advantage-function parameters are used to
reward function of the designed DRL algorithm are developed. represent the bias χi (s, a; θi ), which corresponds to the actions
of each UE. To this end, only partial DQN parameters of UEs
Definition 1. (State space) The state of UE i at time slot t is
should be aggregated as the consensus of all the UEs, while
denoted as si (t), which is composed of five components: (i) the
the bias function is trained by each UE locally. Additionally, to
BS access indicators at time slot t − 1, (ii) the RB allocation
improve the discrimination of the value function and advantage
indicators at time slot t − 1, (iii) the channel gains in different
function, the Q-function can be written as
RBs from BS ĵ = arg max Ui (t − 1) at time slot t − 1, where
j∈B
Qi (s, a; θic , θiv , θia ) = V (s; θic , θiv ) + A (s, a; θic , θia )
Ui (t−1) = {ui,j (t−1), j ∈ B}, (iv) the group of channel gains
from different BS in RB k̂ = arg max Fi (t − 1) at time slot t−1, 1 X
− A (s, a; θic , θia ). (17)
k∈K |A|
where Fi (t−1) = {fi,k (t−1), k ∈ K}, and (v) the
P transmission
a∈A
rate achieved at time slot t − 1, ωi (t − 1) = Ri,j (t − 1). At the beginning of the initialization stage, each UE needs to
j∈B construct a DQN Qi (si , ai ; θi ) with the arbitrarily initialized
Hence, the state can be defined by
parameters θi and the replay memory Di to store its interaction
si (t) = {ui,0 (t − 1), · · · , ui,N −1 (t − 1), experience {si (t), ai (t), ri (si (t), ai (t)), si (t+1)} in each time
fi,0 (t − 1), . . . , fi,K−1 (t − 1), slot. In the first Nb slots, UE i makes access decisions using the
gi,0 ĵ (t − 1), . . . , gi,K−1
ĵ
(t − 1), (14) RSS-based method, and subsequently receives Nb correspond-
k̂
gi,0 (t− k̂
1), . . . , gi,N −1 (t − 1), ing rewards. Then, these Nb interaction experiences are stored
ωi (t − 1)}. into its replay memory as the initialization training data.
At each time slot t (t > Nb ) of the training stage, UE i visits
Definition 2. (Action space) In each time slot, UE i needs to
a state si (t), and then makes its access decision ai (t) by using a
choose a proper BS to access and to request a RB for downlink
policy such as -greedy for action selection based on its trained
transmissions, and therefore the action of UE i at time slot t
DQN Qi (si , ai ; θi ). If two UEs request the same RB from a
can be defined by,
specific BS, the BS would accept the request with a stronger
ai (t) = {ui,0 (t), · · · ui,N −1 (t), fi,0 (t), · · · , fi,K−1 (t)} . (15) signal strength, and allocate the RB to the corresponding UE.
At the next time slot, each UE receives an immediate reward
Definition 3. (Reward function) Let ri (si (t), ai (t)) denote the ri (si (t), ai (t)) and a new state si (t + 1), then such interaction
reward of UE i after taking action ai (t) at si (t). Since the experience is stored into its replay memory. Let tstart denote a
objective of each UE is to maximize its throughput meanwhile local record of time slot for the last training round. In every nf
to avoid frequent handovers, the reward of UE i at time slot t is time slots (i.e., t − tstart = nf ), the global model server would
a balance between the achievable throughput and the handover select UEs randomly to train the global DQN, and each of
cost, i.e., the selected UEs samples a mini-batch size Nb of experiences
randomly from the replay memory. The parameters θi (t) of
ri (si (t), ai (t)) = ωi (t), Ui (t) = Ui (t − 1),
ωi (t) − ηC, Ui (t) 6= Ui (t − 1), each selected UE would be updated by using the SGD algorithm
(16) based on the sampled data. After local training, the selected
where η is a punishment factor to balance the throughput and UEs only present its common-part parameters θci (t) and value-
the handover cost. function parameters θai (t) to the global model server through
the uplink reports.
For homogeneous UEs located in the same area, the global
DQN model can be regarded as consensus knowledge of the Proposition 1. In the global model aggregation stage, the
environment, i.e., Q(s, a; θi ) ≈ yc (s, a; θi ). Although the global common-part parameters and value-function parameters
aggregated DQN model parameters can present characteris- are obtained by averaging the corresponding received param-
tics of most UEs, the local agent may have its own bias eters, i.e.,
χi (s, a; θi ) based on its interaction statistics, which can not 1 X
θ cG (t) = θ ic (t), (18)
be estimated adequately by the aggregated model. In this case, |Ns (t)|
i∈Ns (t)
we introduce the dueling structure to develop the DQN of each
UE. Due to the existence of dueling structure [12], the DQN 1 X
θ vG (t) = θ iv (t), (19)
parameter are composed of the common-part parameters θci , |Ns (t)|
i∈Ns (t)
the value-function parameters θvi , and the advantage-function
where Ns (t) is the set of the selected UEs at time slot t.
parameters θai , and the Q function of the local UE can be
regarded as the sum of a value function V (si ; θci , θvi ) and a Next, the obtained parameters θ cG (t) and θ vG (t) are sent
advantage function A (s, a; θic , θia ), i.e., Qi (s, a; θic , θiv , θia ) = back to each UE. UE i combines the newly obtained pa-
V (s; θic , θiv ) + A (s, a; θic , θia ). Since the value function in the rameters and the locally trained advantage-function parameters
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 24,2022 at 12:51:55 UTC from IEEE Xplore. Restrictions apply.
θvi (t) as new parameters of its local DQN, i.e., θi (t + 1) = TABLE I: Parameters in the simulation.
{θ ia (t), θ cG (t), θ vG (t)}. In the contiguous time slots, UEs make Parameters Value
access decisions by using the -greedy method based on the BS transmit power 30 dBm / 40 dBm
Noise power density -174 dBm/Hz
new local DQN. If the global loss achieves a desirable value, Bandwidth of each RB 180 kHz
the training stage would be closed, and all the UEs make access BS large-scale fading [13] 34 + 40 log(d) / 37 + 30 log(d)
Common-network part 256 × 256
decisions with the final global DQN parameters in the testing Value-function part 128 × 64 × 1
stage. In Algorithm 1, above procedure of the proposed scheme Advantage-function part 256 × 128 × 60
Optimizer Adam
is summarized. Carrier frequency 2GHz
Learning rate 0.001
Replay memory size 2000
Algorithm 1 The Proposed Federated DRL-based Scheme Mini-batch size 32
1: Initialize Stage: 107
3.7
2: Each UE constructs a DQN Qi (si , ai ; θi ) with randomly ini-
5: tstart ← t 3.1
6: repeat 3 Partial parameter exchange
7: Each UE visits a state si (t) at each time slot, and makes access Complete parameter exchange
2.9
decision ai (t) using the -greedy method based on the DQN 0 0.5 1 1.5 2 2.5 3
trained previously. Slot 104
8: Each UE would obtain a reward ri (si (t), ai (t)) and a new Fig. 2: Moving average throughput with different DQN param-
state si (t + 1).
9: Stores the interaction experience into its replay memory Di . eter exchange principles.
10: t←t+1
11: if t − tstart == nf then
12: The global model server selects UEs from the UE set U for B. Results and Analysis
updating global DQN parameters randomly. Firstly, we investigate the throughput performance of the pro-
13: Each selected UE samples randomly a mini-batch size of
experiences from its replay memory Di . posed scheme with different DQN-parameter exchange methods
14: Each selected UE updates their local DQN parameters by in Fig. 2. In this simulation, we set nf = 10, and a moving
minimizing the loss function defined in (13) based on SGD average of the results in previous 500 slots is shown. From
algorithm. Fig. 2, we can observe that the overall throughput achieved
15: Each selected UE reports its trained common-part parame- by the proposed scheme grows significantly with time, which
ters θci (t) and value-function parameters θvi (t) to the global
model server. means that UEs with the proposed scheme are able to learn
16: The global server aggregates local parameters from the proper access policies from the interactions with the environ-
selected UEs based on (18) and (19). ment. Furthermore, the curve of the partial parameter exchange
17: The server sends the newly trained parameters back to all method is higher than that of the complete DQN parameter
the UEs. exchange method. It indicates that only exchanging partial
18: Each UE combines its local advantage-function parameters
θai (t) and newly obtained parameters as new local DQN parameters brings greater generalization capabilities for UEs
parameters for access decision-making. in the proposed scheme.
19: end if
20: until The global loss achieves a desirable value 3.7
107
3.4
107
Moving average overall throughput (bps)
3.6
3.2
3.5
3
3.4
3.2
2.6
A. Simulation Settings 3.1
The proposed scheme 2.4 The proposed scheme
3 Distributed training scheme Distributed training scheme
To evaluate the performance of the proposed scheme, we 2.9
Centralized training scheme
2.2
Centralized training scheme
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
consider the two-tier O-RAN, in which N = 6 BSs share Slot 104 Slot 104
K = 10 RBs to serve M = 20 UEs located randomly in a (a) ρ = 0.9 (b) ρ = 0.2
1000 × 1000 m2 square area. Each UE moves according to
the Gauss-Markov mobility model [14], and chooses another Fig. 3: Moving average throughput with different coherent
moving direction if the boundary is reached. All the hyperpa- coefficients.
rameters of the DRL algorithm are determined by the cross-
validation method as shown in Table I. Subsequently, we provide the throughput performances of the
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 24,2022 at 12:51:55 UTC from IEEE Xplore. Restrictions apply.
proposed scheme under different coherent coefficients in Fig. 3. scheme with a larger η can lead to less overall throughput
For performance comparison, we provide the performances of and less handovers, and different trade-offs between the overall
the distributed training scheme and centralized training scheme. throughput and the number of handovers can be achieved by
In the distributed training scheme, each UE trains its DQN adopting different η.
with its own local interaction experiences independently, and
V. C ONCLUSIONS
all the UEs make access decisions according to their own
trained DQN. In the centralized training scheme, there exists In this paper, we propose a federated DRL-based scheme for
a centralized trainer responsible for training a global DQN for user access control in the O-RAN. In the proposed scheme, each
UEs in the O-RAN, and the trainer updates the parameters of UE acts as an independent agent to makes access decisions with
the DQN based on the collected experiences and sends the the facilitation of a global model server, and the server installed
updated parameter to all the UEs later. When ρ = 0.9, the in the RIC is responsible for updating global DQN parameters
small-scale fading changes slowly, leading to a slowly varying by averaging DQN parameters obtained from selected UEs.
channel gain, and the channel gain varies fast while ρ = 0.2. Additionally, to achieve convergence for independent agents,
From Fig. 3, we can observe that the curves of all the algorithms the dueling structure is introduced to decompose the parameters
when ρ = 0.9 are higher than that when ρ = 0.2, since of the DQN, and only partial DQN parameters are exchanged in
the change pattern of the small-scale fading is easier to learn the proposed scheme to decrease the communication overheads.
in a slowly changing environment. Additionally, while under With the proposed scheme, each UE is able to access proper
different coefficients, the proposed scheme can always achieve BSs and RBs for downlink transmissions to maximize its long-
higher throughput than the other algorithms. This justifies the term throughput and avoid frequent handovers. The simulation
effectiveness and robustness of the proposed scheme in terms results have shown that a higher overall throughput and less
of throughput performance. number of handovers can be achieved by the proposed scheme
over the RSS-based method, distributed training scheme and
1010 centralized training scheme.
2.6 700
RSS-based method
The proposed scheme
Distributed training scheme 650 R EFERENCES
2.4 Centralized training scheme
Average overall throughput (bits)
600
Q-learning [1] O-RAN White Paper, “O-RAN: Towards an open and smart RAN,” Jan.
2018.
Number of handovers
2.2 550
[2] O-RAN-WG1-O-RAN Architecture Description-v01.00.00, “O-RAN ar-
500
2
chitecture description,” Oct. 2020.
450 [3] O-RAN-WG2-AIML-v01.01 , “AI/ML workflow description and require-
400
ments,” Mar. 2020.
1.8
[4] H. A. Mahmoud, I. Guvenc, and F. Watanabe, “Performance of open access
350 RSS-based method femtocell networks with different cell-selection methods,” in Proc. IEEE
The proposed scheme
1.6
300 Distributed training scheme VTC, May 2010, pp. 1–5.
Centralized training scheme
Q-learning
[5] G. Yang, C. K. Ho, and Y. L. Guan,“Multi-antenna Wireless Energy
250
1.4
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
Transfer for Backscatter Communication Systems,” in IEEE J. on Sel.
Areas in Commun., vol. 33, no. 12, pp. 2974–2987, Dec. 2015.
[6] Y. Lin, W. Bao, W. Yu, and B. Liang, “Optimizing user association and
spectrum allocation in hetnets: A utility perspective,” IEEE J. on Sel. Areas
in Commun., vol. 33, no. 6, pp. 1025–1039, June 2015.
Fig. 4: Overall throughput and number of handovers with [7] C. Shen and M. van der Schaar, “A learning approach to frequent handover
different punishment factors in the reward function. mitigations in 3gpp mobility protocols,” in 2017 IEEE WCNC, Mar. 2017,
pp. 1–6.
Finally, we evaluate the impact of the handover punishment [8] S. Bayat, R. H. Y. Louie, Z. Han, B. Vucetic, and Y. Li, “Distributed user
factor η in the reward function in terms of the overall through- association and femtocell allocation in heterogeneous wireless networks,”
put and the number of handovers in Fig. 4. For performance IEEE Trans. Commun., vol. 62, no. 8, pp. 3027—3043, Aug. 2014.
[9] Y.-C. Liang, Dynamic Spectrum Management: From Cognitive Radio
comparison, we also evaluate the performances of the RSS- to Blockchain and Artificial Intelligence. Springer Singapore, 2020.
based method and Q-learning algorithm, distributed training [Online]. Available: https://doi.org/10.1007/978-981-15-0776-2
method and centralized method. In the RSS-based method, each [10] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang,
D. Niyato, and C. Miao, “Federated learning in mobile edge networks: A
UE accesses the BS and RB subject to the strongest RSS in comprehensive survey.” IEEE Commun. Surveys Tutorials, in early access,
each time slot. While in the Q-learning algorithm, each UE 2020, doi: 10.1109/COMST.2020.2986024,2020.
makes access decisions according to its estimated Q-table. All [11] S. Lien, S. Shieh, Y. Huang, B. Su, Y. Hsu, and H. Wei, “5G new
radio: Waveform, frame structure, multiple access, and initial access,” IEEE
the results are calculated over 500 time slots in which the Commun. Magazine, vol. 55, no. 6, pp. 64–71, 2017.
performances of all the considered algorithms are converged. [12] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Fre-
From Fig. 4, we can observe that UEs with the proposed scheme itas, “Dueling network architectures for deep reinforcement learning.” in
International Conference on Machine Learning, Jun. 2016.
can achieve the highest overall throughput and the least number [13] N. Zhao, Y.-C. Liang, D. Niyato, Y. Pei, M. Wu, and Y. Jiang, “Deep
of handovers than that of other three algorithms, indicating reinforcement learning for user association and resource allocation in
that the proposed scheme outperforms over other algorithms heterogeneous cellular networks,” IEEE Trans. on Wireless Commun.,
vol. 18, no. 11, pp. 5141–5152, Nov. 2019.
in terms of overall throughput and the number of handovers. [14] T. Camp, J. Boleng and V. Davies, “A survey of mobility models for
Additionally, the overall throughput and the handover number ad hoc network research.” in Wireless Commun. Mobile Comput., vol. 2,
decrease as the value of η increases. Hence, the proposed no. 5, pp. 483–502, Sep. 2002.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 24,2022 at 12:51:55 UTC from IEEE Xplore. Restrictions apply.