0% found this document useful (0 votes)
14 views16 pages

Joint Delay-Energy Optimization For Multi-Priority Random Access in Machine-Type Communications

Cellular IoT networks, random access control, energy-delay tradeoff, priority, deep reinforcement learning.

Uploaded by

wbfan123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views16 pages

Joint Delay-Energy Optimization For Multi-Priority Random Access in Machine-Type Communications

Cellular IoT networks, random access control, energy-delay tradeoff, priority, deep reinforcement learning.

Uploaded by

wbfan123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1416 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 23, NO.

2, FEBRUARY 2024

Joint Delay-Energy Optimization for Multi-Priority


Random Access in Machine-Type Communications
Wenbo Fan , Pingzhi Fan , Fellow, IEEE, and Yan Long , Member, IEEE

Abstract— Cellular-based networks are deemed as one solution communication (MTC) has been identified as one of the three
to provide communication links for the internet of things (IoT) main application scenarios of the 5th Generation and beyond
due to its high reliability and wide coverage. However, due to mobile network services. Due to the wide coverage, mobil-
the overloaded machine-type devices in IoT, the existing random
access procedure in cellular networks suffers significant preamble ity support, and high reliability, cellular-based IoT systems
collision problem and hardly meets the requirement of large including NarrowBand-IoT and Long-Term Evolution(LTE)-
random access. Despite the effort to cope with the preamble Machine to Machine are considered as the key solutions for
collision problem in conventional random access control schemes, MTC. According to the third generation partnership project
other important performance requirements in random access are (3GPP) reports, the MTC should meet the connection density
not well addressed, including access delay, energy consumption,
and service priority. To improve the random access control of 1 million per square kilometer [1]. Thus, it is crucial for
scheme, we propose a novel hierarchical hybrid (HH) access class cellular-based IoT systems to satisfy the dense MTC scenarios
barring (ACB) and back-off (BO) scheme (HH ACB-BO scheme), involving a large number of MTDs.
where the hybrid ACB-BO is exploited to balance the delay- In conventional cellular networks, grant-based four-step
energy tradeoff, and the hierarchical structure is proposed to random access (RA) is normally used for establishing the
prioritize communication services. We mathematically formulate
this random access control scheme to optimize the delay and communication connection between the MTD and the eNodeB
energy performance jointly. With the fixed priority weights (eNB) in random access channel (RACH) [2]. However, due to
for each service priority, the closed-form of the optimal ACB the limited preamble resources in each random access oppor-
factors and BO indicators adjustment result is derived. Moreover, tunity (RAO), a collision occurs when two devices select the
in order to realize the adaptive prioritized random access control, same preamble, thus resulting in a serious preamble collision
we apply deep reinforcement learning (DRL) to the proposed
random access scheme to dynamically adjust the ACB factors and problem and large access delay in cellular IoT networks with
BO indicators in an online manner. Considering the hierarchical a large number of MTDs. Therefore, reducing access delay
structure and the action space complexity in DRL, a multi- is critical for RA under the overloaded case. In addition,
agent DRL algorithm is designed for the HH ACB-BO scheme since MTDs are usually portable and powered by batteries,
(multi-agent HH-DRL algorithm), where online policy transfer energy resource is valuable for MTDs. With the increasing
is applied to guarantee the policy effectiveness in the practical
networks. Finally, simulation results verify the effectiveness of energy-saving demands of IoT applications, it is essential
the proposed HH ACB-BO scheme and reveal that the multi- to reduce the energy consumption for MTDs in RACH.
agent HH-DRL algorithm outperforms other algorithms in terms Moreover, machine-type services are various depending on the
of average access success probability and energy consumption specific applications. The MTDs in IoT applications may have
performance. different delay requirements, such as delay-sensitive services
Index Terms— Cellular IoT networks, random access control, including eHealth, self-driven vehicles, and public security,
energy-delay tradeoff, priority, deep reinforcement learning. and delay-tolerant services including factory management and
city pollution detection [3]. For this reason, it is necessary
I. I NTRODUCTION to consider service priority in terms of delay requirement to

W ITH the explosive growth of machine-type devices


(MTDs) in the internet of things (IoT), machine-type
further improve the RA performance. Based on the analysis,
the RA control scheme should be investigated from the three
aspects: the access delay, energy consumption, and service pri-
Manuscript received 13 November 2022; revised 8 April 2023;
accepted 11 June 2023. Date of publication 30 June 2023; date of current ority, especially when a cellular IoT scenario with overloaded
version 13 February 2024. This work was supported by NSFC under Project MTDs is concerned.
62020106001. The work of Yan Long was supported in part by the Sichuan There are some existing works focusing on the RA control
Science and Technology Program under Grant 2022NSFSC0912 and Grant
2023NSFSC0459. The associate editor coordinating the review of this arti- scheme from the above three aspects. In order to reduce the
cle and approving it for publication was J. Yang. (Corresponding author: access delay, Access Class Barring (ACB) [4] was proposed
Pingzhi Fan.) to alleviate the RA pressure by restricting RA requests from
The authors are with the Information Coding and Transmission Key Labora-
tory of Sichuan Province, CSNMT Int. Coop. Res. Centre (MoST), Southwest MTDs. Duan et al. in [5] derived an optimal closed-form
Jiaotong University, Chengdu 611756, China (e-mail: [email protected]; solution of the ACB factor to minimize the access delay
[email protected]; [email protected]). under ideal conditions. This work is closely related to the
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TWC.2023.3289314. number of activated MTDs and preamble resources. How-
Digital Object Identifier 10.1109/TWC.2023.3289314 ever, the number of activated MTDs is usually unknown in
1536-1276 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1417

time-varying traffic networks. Thus, the ACB factor is dynam- the preamble resources are assigned to MTDs with different
ically adjusted according to the results of activated MTD priorities according to the soft priority weights. Although these
estimation methods like maximum-likelihood estimation [6], priority-based schemes could improve the RA performance
pseudo-Bayesian estimation [7], and learning automata-based by dynamically assigning RACH resources for MTDs with
estimation [8], respectively. In addition, deep reinforcement different delay requirement, it ignores the energy consumption
learning (DRL) as a promising method has been developed to in RA. Recalling that the hybrid ACB-BO scheme introduced
dynamically adjust the ACB factor by interacting with the RA in RA control scheme, the service priority should also be
process [9], [10]. integrated in cellular IoT to pursue the performance of access
However, these aforementioned schemes only focused on delay, energy consumption, and service priority jointly.
the access delay performance without involving the energy Machine Learning, emerges as a promising tool to deal
consumption in the RA process. Recently, energy efficiency with the complex practical networks recently. In [23], fed-
was considered in many works [11], [12], [13]. In [11], erated learning is applied in the wireless computing power
by performing multiple access computing offloads sequen- network, which minimizes the sum energy consumption of
tially, a swapping-heuristic based algorithm was proposed to all computing nodes by orchestrating the computing and
minimize energy consumption in multi-access mobile edge networking resources. Similarly, Wang et al. in [24] proposed
computing. Cao et al. in [12] investigated the energy efficiency a decentralized federated learning for the mobile and hetero-
of RA-based orthogonal and non-orthogonal multiple access geneous wireless computing power network, where nodes can
networks and discovered that lower data rates have beneficial freely participate or leave the federated training. Moreover,
effects on energy efficiency. Zhao et al. in [13] studied the using multi-agent DRL, Lee et al. in [25] proposed a novel
performance limit of energy efficiency that was evaluated contention-based RA solution for satellites networks, where
via the lifetime throughput under the age of information each satellite has a sole agent. Jadoon et al. in [26] applied
constraint, where the optimal channel access probability and DRL to slotted ALOHA RA, which balances the throughput
packet arrival rate are derived to achieve maximum lifetime and fairness performances. In [27], by online assigning pream-
throughput. In the dense MTC scenarios involving a large ble to each user, a DRL-enabled intelligent RA management
number of MTDs, due to the overloaded MTDs in RA, the was proposed to reduce access latency and access failures.
probability of passing ACB check is trivial, leading to huge Although there are existing works focusing on DRL-based
energy consumption in frequent ACB checks. Concerning the solutions to handle complex control problems in RA, single-
energy consumption problem, based on the 3GPP test cases, policy training scheme is studied. However, the single-policy
Gerasimenko et al. in [14] conducted a thorough analysis and training scheme can hardly adapt to practical networks, espe-
simulations of the RA performance containing energy con- cially in MTC scenarios with multiple traffic modes. To be
sumption under the overloaded case. It reveals the capability specific, when bursty traffic occurs, the above DRL algorithms
of the back-off (BO) scheme to reduce energy consumption. with single-policy training can hardly train the policy in a short
Specifically, according to the BO scheme, the failed MTDs time, and even worse, may not successfully converge at the end
in the ACB check are assigned to the future RAOs to retry of the bursty traffic. To this end, policy transfer (PT), as an
RA. Before re-access, these failed MTDs are inactive without important part of transfer reinforcement learning [28], can train
receiving system information. This greatly mitigates the energy multiple policies through policy reuse. In the face of MTC
waste on frequent ACB checks under the overloaded case. traffic bursty, the agent may use online PT and switch to the
Hence, on the one hand, the ACB scheme reduces the access corresponding policy timely to avoid re-adapting to the bursty
delay while introducing energy consumption in ACB check traffic. In this way, the policy effectiveness can be ensured even
process. On the other hand, the BO scheme reduces the energy under the practical networks with time-varying traffic modes.
consumption in receiving system information by inactivating To this end, in this paper, all the MTDs in the system are
these failed MTDs. As a result, it is natural to combine firstly classified into different priority groups based on the
the two aspects and a hybrid ACB-BO scheme is studied in different delay requirements in practical applications. Then,
[15], [16], and [17]. Furthermore, Jiang et al. in [18] developed a novel hierarchical hybrid (HH) ACB and BO random access
a DRL algorithm to dynamically adjust the ACB factor and control scheme (HH ACB-BO scheme) is proposed to provide
BO indicator by maximizing a long-term joint reward, which service priority and balance the delay-energy tradeoff. Under
is composed of both access delay and energy consumption. the fixed priority weights for each service priority, a closed-
Although the hybrid ACB-BO scheme can balance the form solution of the ACB factors and the BO indicators
access delay and energy consumption, it is hard to cater to the is obtained under the ideal situations (HH-ideal). In order
IoT networks directly without considering the service priority to realize the adaptive priority control, a multi-agent DRL
with various delay requirements. Therefore, the priority-based algorithm is applied to the HH ACB-BO scheme (multi-
RA has been studied in recent works [19], [20], [21], [22]. agent HH-DRL algorithm), where online PT is applied to
Zhang et al. in [19] proposed an analytical framework to opti- guarantee the policy effectiveness in the face of time-varying
mize the network throughput by dividing MTDs into multiple traffic modes in practical networks. Our contributions can be
groups according to their throughput requirements. Moreover, summarized as follows:
DRL has also been developed to provide different ACB factors 1) To overcome the serious preamble collision problem
for different types of MTDs [20], [21]. More recently, Liu et al. and enhance the RA performance, we propose a HH
in [22] proposed an online preamble control algorithm, where ACB-BO scheme by combining multi-priority with the

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
1418 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 23, NO. 2, FEBRUARY 2024

hybrid ACB-BO scheme, where the hybrid ACB-BO 3GPP standard [29], there are 64 preambles available at each
scheme is exploited to balance the delay-energy tradeoff, RACH in cellular networks, of which only 54 are used for
and the hierarchical structure is used to enable service contention-based four-step RA.
priority for different types of MTDs.
2) Under the fixed priority weights for each service priority, A. Activated MTD Traffics and Priorities
a joint delay-energy optimization problem is formu-
Since MTDs in different applications can be activated
lated for the proposed HH ACB-BO scheme. Firstly,
randomly, the activated MTD traffics can follow a variety of
a mathematical analysis of the access delay and energy
possible statistics. 3GPP defines two different types of acti-
consumption is carried out to develop an objective
vated MTD traffic models including Beta distribution traffic
function. Then, we approximate and scale the objec-
and uniform distribution traffic [29]. The Beta distribution
tive function to obtain a closed-form solution of the
traffic describes that a large number of MTDs access the
optimal ACB factors and BO indicators. Simulation
network in a concentrated short period of time. The uniform
results show that the RA performance of the proposed
distribution traffic represents that MTDs access the network
HH-ideal algorithm can be effectively improved. How-
uniformly over a period of time. This provides us with an
ever, in some extreme cases, the unfavorable fixed
example of how to determine the activated traffic model for
priority weights lead to the average access success
different types of MTDs.
probability shrinking.
In this study, without loss of generality, we classify MTDs
3) In order to realize adaptive priority control to improve
into two categories: high-priority MTDs and low-priority
the average access success probability, we apply the
MTDs. The high-priority MTDs mainly serve delay-sensitive
DRL algorithm to the proposed HH ACB-BO scheme to
services with bursty traffic such as eHealth, self-driven vehi-
dynamically adjust the ACB factors and BO indicators
cles, and public security applications. This type of activated
in an online manner. Taking advantage of the afore-
MTD traffic is best represented by the Beta distribution traffic.
mentioned mathematical analysis of the HH ACB-BO
Thus, we assume that each high-priority MTD is activated
scheme, we introduce a joint reward function to guar-
at time 0 < t < T with probability f (t). Following Beta
antee RA performance, where a punishment sub-reward
distribution with parameters α = 3 and β = 4, it can be
is designed for the adaptive priority control. In addition,
expressed as
we develop a multi-agent HH-DRL algorithm inspired
β−1
by the hierarchical structure of the HH ACB-BO scheme, tα−1 (T − t)
which facilitates multi-priority expansion and reduces f (t) = , (1)
T α+β−1 B (α, β)
the complexity of action space in the training process.
4) Since MTC usually has multiple traffic modes in the where B(·) represents the Beta function and T indicates the
practical networks, the single-policy scheme in DRL activation period. Assuming that the duration of each RAO is
can hardly train the policy in a short time when MTC τ , the number of newly activated high-priority MTDs at i-th
traffic happens to be bursty. In order to ensure the policy RAO is denoted as
Z iτ
effectiveness when the traffic mode changes, PT-based
νi = Nh f (t) dt, i = 1, 2 . . . T /τ, (2)
multi-policy online training scheme is applied in the (i−1)τ
proposed multi-agent HH-DRL algorithm, which can
where Nh is the total number of high-priority MTDs.
switch to the corresponding policy timely and avoid
Oppositely, the low-priority MTDs include consumer elec-
re-adapting to the changing traffic.
tronics, factory management sensors, delay-tolerant city pol-
The rest of the paper is organized as follows. In Section II, lution detection, etc., which feature looser delay constraints.
we present the multi-priority RA system model. The HH This type of activated MTD traffic can be represented by
ACB-BO scheme is proposed and mathematically analyzed uniform distribution traffic, in which the MTDs are uniformly
in Section III. Section IV applies the DRL algorithm to the activated during a time period. The number of newly activated
HH ACB-BO scheme and proposes a multi-agent HH-DRL low-priority MTDs at i-th RAO is defined as µi ∼ U (0, 2N T )
l

algorithm. In Section V, we provide extensive simulation over the activation period (0, T ), where Nl represents the
results to evaluate the performance of the proposed scheme, total number of low-priority MTDs. Overall, both activated
and the conclusion is finally given in Section VI. high-priority and low-priority MTDs coexist in the cellular
IoT network, and perform RA attempts to establish links with
II. S YSTEM M ODEL the eNB.

In this paper,we consider a cellular-based mIoT network,


B. RA Procedure
where each eNB serving a large number of MTDs, similar
to most existing works [18], [22]. When an MTD performs In the conventional LTE standards, the contention-based RA
access to the eNB, it is active; otherwise, it is inactive. The procedure is divided into four steps in each RAO, which is
activated MTD executes the RA control scheme and four-step described in the following:
RA procedure to establish connection with eNB in RACH. • Step 1: RA preamble transmission: By randomly selecting
In the RA process, we regard an RAO as an access slot, and an orthogonal preamble from the preamble resource pool,
all access slots have the same duration time. According to the the activated user equipment (UE) transmits the access

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1419

request to the eNB. This preamble resource pool is In the beginning, the activated MTD receives the system
generated by the Zadoff-Chu sequence cycle shift. information containing ACB factor PACB and BO indicator
• Step 2: Random access response (RAR): Once eNB BI. Then it executes the ACB check according to PACB ,
detects the preamble, the eNB broadcasts the RAR mes- and performs the four-step RA once ACB check is successful.
sage to all UEs which contains the assigned granted Finally, if MTDs fail in the ACB check or the four-step RA,
uplink resources, timing instructions, and temporary iden- it will generate a backoff time TBO to postpone re-access
tity for each detected preamble. according to the received BI.
• Step 3: Connection request: After receiving the RAR
message, the UE transmits its connection request mes- III. HH ACB-BO S CHEME AND P ROBLEM F ORMULATION
sage containing temporary identity to the eNB using the
Recently, access delay, energy consumption, and service
assigned granted uplink resource.
priority have been widely studied and considered as three
• Step 4: Contention resolution: If the eNB correctly
key performance indicators (KPIs) in RACH. Although the
demodulates the UE’s connection request message, the
hybrid ACB-BO scheme could balance the access delay and
contention resolution message contains its temporary
energy consumption, it ignores the priority aspect. While
identity and is broadcasted to UEs as a response. The
the priority-based RA control schemes rarely consider the
UE determines the RA failure if the contention resolution
delay-energy tradeoff problem. Therefore, in this paper,
message does not contain its temporary identity.
we propose a novel HH ACB-BO scheme to optimize these
Obviously, in this four-step RA process, as long as two or three KPIs in RACH, in which the hybrid ACB-BO scheme
more UEs select the same preamble, the preamble collision is exploited to balance the delay-energy tradeoff, and the
will occur, eventually leading to the RA failure. To simplify hierarchical structure is utilized to serve MTDs with different
the study of the RA procedure, we assume that RA success if priorities.
the preamble is selected by MTDs without collision.

A. HH ACB-BO Scheme
C. RA Control Schemes
The proposed HH ACB-BO scheme is shown in Fig. 1.
In order to improve the RA performance, the activated Generally, it contains ACB and BO schemes and prioritizes
MTDs need to execute the RA control scheme before per- services in RACH. For the ACB scheme, the MTD receives the
forming the RA procedure. The most common RA control system information containing the ACB factor and performs
schemes include ACB, BO, and hybrid scheme. The details ACB check. For the BO scheme, the MTDs that failed in
are as follows: RA are defined as backlogged MTDs, and the backlogged
1) ACB Scheme: The ACB scheme is a mechanism to MTDs are assigned to the future RAOs to retry RA attempts,
control RA congestion by restricting RA requests in each which mitigates the energy waste on receiving the system
RAO. Firstly, the eNB broadcasts the ACB factor PACB before information. Thus, the hybrid ACB-BO scheme is exploited to
each RAO. Then the activated MTD randomly generates a balance the delay-energy tradeoff. In addition, considering the
number q ∈ [0, 1], and compares it with the received ACB service priority, we use Ph and BIh to indicate the ACB factor
factor PACB . The activated MTD executes the four-step RA and BO indicator for high-priority MTDs, and Pl and BIl for
only when its q ≤ PACB . Otherwise, the MTD fails in the low-priority MTDs, where h and l represent the high-priority
ACB check, and repeats the ACB check in the next RAO. and low-priority, respectively. The corresponding explanations
In [5], an optimal ACB factor could reduce the access delay, of notations in the HH ACB-BO scheme are illustrated in
but receiving system information frequently will bring huge Table I.
energy consumption. In the i-th RAO, the number of newly activated MTDs could
2) BO Scheme: In the 3GPP standards, when the MTD RA be expressed as
fails, it starts the BO scheme. The MTD in the BO scheme
uniformly generates a BO time TBO ∼ U (0, BI), where BI Ai = νi + µi , (3)
is the BO indicator. The MTD must wait for the TBO time
where νi represents the number of newly activated
before re-trying the RA attempt. However, in most previous
high-priority MTDs and µi represents the number of newly
researches, the BO indicator BI was considered to be fixed.
activated low-priority MTDs. The number of backlogged
In fact, BI has a critical impact on the RA performance. For
MTDs could be expressed as
example, in the overloaded case, a small BI could cause the
i
MTDs to perform the ACB check frequently and thus waste X j j
the MTD’s energy. While a large BI increases the access Bi = P i,j (NAf + NRf ), (4)
delay and even wastes the preamble resources. Therefore, j=1

an appropriate BI value is crucial for the BO scheme and where P i,j represents the probability that the backlogged
should be designed with ACB scheme jointly to improve the MTDs in the j-th RAO backoff to the i-th RAO according
energy performance. to the BO scheme, which could be denoted as
3) Hybrid Scheme: A hybrid scheme is a mechanism that
0 i − j > BI j

combines two or more RA control schemes together. In this P i,j = 1 j (5)
paper, we focus on the most common hybrid ACB-BO scheme. BI j i − j ≤ BI .

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
1420 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 23, NO. 2, FEBRUARY 2024

Fig. 1. The proposed HH ACB-BO scheme.

TABLE I with the fixed priority weights for each service priority, a math-
S UMMARY OF M AIN N OTATIONS ematical analysis of the access delay and energy consumption
is carried out, and the joint delay-energy optimization objective
function is formulated. Then, we approximate and scale the
objective function to obtain a closed-form solution of the
optimal ACB factors and BO indicators in our HH ACB-BO
scheme. More details are as follows.
1) Access Delay: The access delay is defined as the number
of RAOs for an MTD from its newly activated state to the
connection state with eNB. Thus, minimizing the access delay
is equivalent to minimizing the number of RAOs required
for MTDs to establish wireless links with eNB. At the same
time, in order to minimize the number of RAOs consumed by
MTDs to establish wireless links, each RAO needs to reach
the maximum number of success accesses [30]. Therefore, the
access delay minimization problem could be converted into
the success access maximization in each RAO, which can be
formulated as
arg min τd = arg max E[NRs |N = n], (8)
To facilitate the mathematical analysis, we have the follow- where τd represents the access delay. We assume that N h
ing definition: and N l are independent. Considering the service priority, the
success access maximization problem can be further converted
N =A+B
into two sub-problems, i.e., the success access maximization
= N h + N l, (6) for high-priority and low-priority MTDs in each RAO, respec-
where N is the total number of activated MTDs. In detail, tively. Therefore, we have
it can be expressed as the summation of the newly activated h
E[NRs |N = n] = E[NRs |N h = nh ]
MTDs A and the backlogged MTDs B. Also, it is the sum- l
+ E[NRs |N l = nl ]. (9)
mation of high-priority activated MTDs N h and low-priority
activated MTDs N l . Similarly, the MTDs that pass the ACB For the high-priority MTDs, the success access result means
check NAs and pass the four-step RA NRs are also com- that the MTD could pass both ACB check and RA procedure.
posed of high-priority and low-priority MTDs, which can be Thus, in order to calculate the number of success access high-
expressed as follows: priority MTDs, we have
h l nh
NAs = NAs + NAs h
X
h l E[NRs |N h = nh ] = h
P (NAs = λh |N h = nh )
NRs = NRs + NRs . (7)
λh =1
We also assume that once the preamble collision occurs, h h
× E[NRs |NAs = λh ]. (10)
none of MTDs in this collision can complete the four-step
RA procedure in this RAO. In addition, the MTDs without The ACB check can be viewed as a Bernoulli experiment
preamble collision can always successfully pass the four-step with the probability of ACB check success Ph . Thus, the
h
RA procedure. probability mass function of NAs = λh could be expressed
as
 
nh
B. Problem Formulation h h
P (NAs = λh |N = nh ) = Phλh
λh
In this part, a joint delay-energy optimization problem is
formulated for the proposed HH ACB-BO scheme. Firstly, ×(1 − Ph ))nh −λh . (11)

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1421

After the ACB check, the number of MTDs that enter where T and P represent the consumed time and average
h
into the RA procedure is NAs = λh . In RA procedure, the power in Esi and Era , respectively. Therefore, the energy
success access of MTD means that the generated preamble consumption in each RAO is denoted as
does not conflict with other MTDs, and the probability of
success access in RA procedure, Pm , is, E = nEsi + λEra
   λ −1 = (nh + nl )Esi + (nh Ph + nl Pl )Era , (19)
λh 1 φh M − 1 h
Pm = , (12)
1 φh M φh M where n represents the number of activated MTDs, and λ
where M is the number of preamble resources, and φh rep- represents the number of MTDs passing the ACB check.
resents the high-priority weight [22]. In addition, the number 3) Joint Delay-Energy Optimization: As discussed above,
of success access MTDs is equal to the number of generated access delay and energy consumption are directly affected by
preambles without collision in each RAO, which could also be ACB factors and BO indicators. This part aims to obtain the
viewed as a Bernoulli experiment. Therefore, the number of optimal ACB factors and BO indicators of the HH ACB-BO
success access high-priority MTDs in each RAO is expressed scheme. According to (16) and (19), in each RAO, the joint
as delay-energy optimization problem can be formulated as
h h
|NAs = λ h ] = φh M P m Ph nh −1 Pl nl −1
E[NRs max nh Ph (1 − ) + nl Pl (1 − )
   λ −1 Ph ,Pl ,BIh ,BIl φh M φl M
λh 1 φh M − 1 h
= φh M − (nh + nl )Esi − (nh Ph + nl Pl )Era ,
1 φh M φh M
Nfh
φh M − 1 λh −1 s.t.(a) : nh = Ch + BIh ,
= λh ( ) . (13) Nl
φh M (b) : nl = Cl + BIfl , (20)
Therefore, according to (11) and (13), we have (c) : 0 ≤ Ph , Pl ≤ 1,
h
E[NRs |N h = nh ] (d) : 1 ≤ BIh , BIl ≤ 100,
nh  
X nh φh M − 1 λh −1 where nh and nl represent the number of high-priority
= Phλh (1 − Ph ))nh −λh λh ( ) and low-priority activated MTDs, respectively. Take the
λh φh M
λh =1
nh   high-priority as an example, in the i-th RAO, the activated
X nh − 1 ph j−1 MTDs consist of newly activated MTDs Ah and backlogged
= n h Ph (1 − Ph )nh −j (Ph − )
j−1 φh M MTDs Bh . We assume that the number of backlogged MTDs
j=1
in previous RAOs is specified to a constant Bi−1,h , and the
Ph nh −1 N
= nh Ph (1 − ) . (14) latest backlogged MTDs is denoted as BIfh , where Nf =
φh M
NAf + NRf represents the number of MTDs that fail in the
Similarly, in each RAO, the number of success accesses for ACB check and RA procedure in the last RAO. Therefore,
low-priority MTDs is expressed as we have Ch = Ah + Bi−1,h and Cl = Al + Bi−1,l . We also
l Pl nl −1 assume that 100 is the upper limit of the BO indicator.
E[NRs |N l = nl ] = nl Pl (1 − ) , (15)
φl M For simplicity, we omit the energy consumption in RA
where φl represents the low-priority weight, and φh + φl = procedure in the objective function. Therefore, the original
1. Accordingly, in each RAO, the total number of success problem in (20) can be approximately transformed into (21):
accesses could be described as Ph nh −1 Pl nl −1
max nh Ph (1 − ) + nl Pl (1 − )
E[NRs |N = n] Ph ,Pl ,BIh ,BIl φh M φl M
Ph nh −1 Pl nl −1 − (nh + nl )Esi ,
= nh Ph (1 − ) + nl Pl (1 − ) . (16)
M M s.t.(a), (b), (c), (d). (21)
2) Energy Consumption: In each RAO, the energy con-
sumption of MTD consists of two parts: the energy for We utilize the joint derivative to solve the optimization prob-
receiving system information in ACB scheme and the energy lem in (21) and obtain the optimal solution in Proposition 1.
consumed in RA procedure. As long as the MTD is in the Proposition 1: Note that Ai ≥ 0, Bi−1 ≥ 0, Nfh ≥ 0,
l
active state, it needs to receive the system information for the Nf ≥ 0, M ≥ 1, BIh ≥ 1, and BIl ≥ 1. These parameters
ACB check, and only the MTD that passes the ACB check are all integers in practice. Under the fixed priority weights
can execute the RA procedure. According to [18], the energy φh and φl , by analyzing the joint derivative of the objective
consumption of receiving system information Esi could be function, a closed-form solution of the optimal ACB factors
expressed as and BO indicators is obtained in (22), shown at the bottom of
the next page.
Esi = Tsi Psi . (17) Proof: See Appendix A.
The energy consumption in the RA procedure Era could be Although a closed-form solution of the optimal ACB factors
described as and BO indicators is obtained in (22), the exact prior infor-
mation (such as A and B) needs to be acquired in the RA
Era = Tmsg1 Pmsg1 + Tmsg2 Pmsg2
process, which is hard in practice. Moreover, the BO indicator
+ Tmsg3 Pmsg3 + Tmsg4 Pmsg4 , (18) has a long-term impact on the joint delay-energy optimization

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
1422 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 23, NO. 2, FEBRUARY 2024

according to (4) and (5). While for simple analysis, the number Agent: As the main body of the DRL algorithm, agent has
of backlogged MTDs in the previous RAOs affected by the the ability to collect the environment information and execute
previous BO indicators is specified to the constant Bi−1 . the decision-making. In the RA process, the agent is deployed
Furthermore, since the fixed high-priority weight is larger at eNB to collect the aforementioned important information,
than that of the low-priority weight, when the number of implement the DRL algorithm, and online adjust the ACB
low-priority MTDs are far more than that of high-priority factors and BO indicators.
MTDs, the RA performance shrinking will be led. For these State: The state space is defined as S = [ns , nc , ni ], and
reasons, we apply the DRL algorithm to the proposed HH these states are related to the number of activated MTDs in
ACB-BO scheme to adjust the ACB factors and BO indicators each RAO.
in an online manner as shown in Sec. IV. Action: The action space is defined as A =
[Ph , Pl , BIh , BIl ], which includes the ACB factors (Ph , Pl ) ∈
IV. D EEP R EINFORCEMENT L EARNING FOR HH ACB-BO {0, 1} and BO indicators (BIh , BIl ) ∈ {10, 20, . . . , 100}.
S CHEME By implementing the DRL algorithm, the ACB factors
and BO indicators are dynamically adjusted in an online
DRL is capable in handling a long-term decision opti-
manner to adapt to the time-varying traffic. Although
mization problems and has been widely applied in various
the ACB factor is continuous and the BO indicator is
fields of wireless communication networks, including channel
discrete, in order to integrate these two schemes into one
selection [31], power optimization [32], etc.. In this study,
agent, we also discretize the ACB factor with a dispersion
we also use DRL to dynamically adjust the ACB factors
of 0.1.
and BO indicators. Taking advantage of the aforementioned
Reward function: In order to optimize these three KPIs
mathematical analysis of the HH ACB-BO scheme, a joint
in RACH, a specific joint reward function is designed as
reward function is developed to optimize three KPIs by
the driver of the DRL algorithm to update the decision-
interacting with the RA process, and a punishment sub-reward
making policy. According to the joint delay-energy opti-
is designed for the adaptive priority control. Moreover, to facil-
mization as shown in (20) and considering the adap-
itate the multi-priority expansion and reduce the complexity
tive priority control, the joint reward function can be
of action space in the training process, we innovatively design
expressed as
a multi-agent DRL framework, where two DRL agents serve
for high-priority and low-priority MTDs, respectively. In the R = φd Rd + φe Re − φp Rp , (23)
following, we first introduce the preliminary DRL definitions.
Next, we introduce the DRL algorithm and propose a multi- where φd , φe , φp are the weights of the access delay sub-
agent HH-DRL algorithm for the HH ACB-BO scheme. reward Rd , the energy consumption sub-reward Re , and
the priority sub-reward Rp , respectively. According to (8),
minimizing the access delay is equivalent to maximizing
A. Definitions
the number of success access MTDs in each RAO, thus
In order to implement the DRL algorithm, the eNB needs to Rd can be denoted by the number of generated preambles
collect some important information in each RAO. For example, without collision, that is, Rd = ns . In addition, the energy
the number of generated preambles without collision ns , the consumption in each RAO can be denoted as E according to
number of the collided preambles nc , the number of the idle (19), thus Re = 1/E. For adaptive priority control, we set
preambles ni , the number of MTDs passing ACB check λ, a punishment mechanism to indicate the access failure. The
and the number of high-priority MTDs that access failure nf . priority sub-reward can be described by Rp = nf , where
Note that these information have been widely used in previous nf is the number of high-priority MTDs that access failure.
works [8], [20], [21], [33], and obtaining these information is Accordingly, the specific joint reward function is formulated
beyond the scope of this study. The details of the definitions as
of DRL applied in the HH ACB-BO scheme are presented as
follows. R = φd ns + φe /E − φp nf . (24)


 φh M φl M Nfh Nfl
 Ph =

N h , Pl = Nfl
, BIh = 100, BIl = 100, φh M ≤ Ch + , φl M ≤ Cl + ;
100 100

f



 Ch + 100 Cl + 100 & '
Nfh Nfh Nfl

φl M


, φl M ≤ Cl +

 Ph = 1, Pl =
 , BIh = , BIl = 100, φh M > Ch + ;
 Nfl φh M − Ch 100 100
Cl + 100

& ' (22)
 φh M Nfl Nfh Nfl
Ph = , Pl = 1, BIh = 100, BIl = , φh M ≤ Ch + , φl M > Cl + ;


Nfh φl M − Cl



C + 100 100
h


 100 & ' & '
Nfh Nfl Nfh Nfl



 Ph = 1, Pl = 1, BIh = φh M − Ch , BIl = φl M − Cl , φh M > Ch + , φl M > Cl + .


100 100

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1423

Fig. 2. The multi-agent DRL framework for HH ACB-BO scheme.

B. Deep Reinforcement Learning C. Proposed Multi-Agent HH-DRL Algorithm


Since traditional RL (e.g., Q-learning) cannot deal with the According to the definition of action, it is observed that,
large dimension state space and action space problem in Q- the complexity of action space can be expressed as O(100P ),
function, it has limitations with the increases of RL scale [34]. where 100 is the complexity of action space for a single pri-
DRL as a promising method, could approximate Q-function by ority level, and P is the number of priority levels. As a result,
the neural network to overcome the curse of dimensionality. with the increase of service priorities, a single DRL agent may
In the DRL algorithm, the DQN [35] is exploited to replace the be not suitable due to the huge action space. To solve this
Q-function. That is, Q(s, a, θ), where θ represents the neural problem, we propose a multi-agent HH-DRL algorithm which
network parameters. Similar to Q-learning, the agent collects provides a sole DRL agent for each priority. The multi-agent
the information of (s, a, r, s′ ) to construct an experience replay DRL framework designed for the HH ACB-BO scheme is
buffer to update the neural network parameters θ. Moreover, shown in Fig. 2. The action is divided into two sub-actions cor-
double DQN [36] was used to avoid the over-estimation responding to each priority, where ah represents the sub-action
problem caused by the inaccuracy of predicted Q-value in of the high-priority agent, and al represents the sub-action
DQN. In double DQN, the target network Q(s, a, θtarget ) is of the low-priority agent. Thus, the complexity of the action
used to calculate the real Q-value while the current network space can be expressed as O(100 × P ). Through introducing
Q(s, a, θcurrent ) is used to update DQN parameters. Based on this multi-agent structure, the complexity of action space is
the dataset sampled from the experience replay buffer, the loss greatly reduced for the agent, and it could more adapt to
function of the current network is defined by the future networks with boosting services. In the following,
we introduce the workflow of the proposed multi-agent HH-
K DRL algorithm as shown in Algorithm 1.
1 X
L(θcurrent ) = (Yj − Q(sj , aj ; θcurrent ))2 , (25) At the beginning of the multi-agent HH-DRL algorithm,
K j=1
the eNB initializes DRL agents for each priority, including
the DQN structure and hyperparameters in Table II. Generally
where K is the size of the dataset, and θcurrent represents
speaking, there are two processes, the interaction process and
the parameters of the current network. Yj is the label Q-value
the DRL training process. In each RAO, the interaction process
calculated by the target network, which is defined as
between the agents and the environment is shown in Steps 1-
3 of Fig. 2. Note that different DRL agents share a common
Yj = Rj + γQ(s′ , arg max Q(s′ , a′ , θcurrent ); θtarget ).
′ a reward and state when interacting with the RA process. Firstly,
(26) the DRL agents observe the state s, which are fed into the
current network of each agent to calculate the Q-value. Then,
Finally, the current network parameters θcurrent are updated the sub-actions are selected to adjust the ACB factors and
by using the gradient descent algorithm, which can be BO indicators based on the Q-value according to the ϵ-greedy
expressed as algorithm
t+1 t t (
θcurrent = θcurrent − η∇L(θcurrent ), (27) ϵ/m + 1 − ϵ if a = arg max Qπ (s, a)
π(s, a) = a∈A .
where η is the learning factor of deep learning, and ∇ ϵ/m else
represents the gradient of θtrain in the loss function. (28)

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
1424 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 23, NO. 2, FEBRUARY 2024

Algorithm 1 Multi-Agent HH-DRL Algorithm descent algorithm in (27). In addition, the target network
Input: DQN structure, iterations I, state dimension parameters are synchronized and updated with the current
and action dimension. network parameters every Z times. The agents obtain the
1 Algorithm hyperparameters: learning rate α, discount well-trained DQN network when the reward value is stable
factor γ, initial exploration rate ϵi , final exploration and the algorithm converges. Then the optimal policy π ∗ is
rate ϵf , deep learning factor η, dataset size K, obtained according to the greedy algorithm
synchronous frequency Z;
1 if a = arg max Q∗ (s, a)
(
2 Initialize each agent’s Q-network and clear experience ∗ a∈A
π (a|s) = . (29)
replay set D; 0 else
3 for Each iteration i = 1, 2, . . . , I do
Finally, the policy π ∗ is deployed at the eNB to dynamically
4 − − − − −Interaction with RA process −
adjust the ACB factors and BO indicators in an online manner.
− − −−
The complexity of Algorithm 1 is mainly determined by
5 Each agent observes the state s = [ns , nc , ni ];
the number of agents and two neural networks. The number of
6 Input state s into each agent’s current Q-network,
agents is equal to the number of priorities, i.e., P . Considering
and obtain the action Q-value;
that the current Q-network and target Q-network both contain
7 Each agent selects action (ah , al ) according to
J fully connected layers, for each agent, the time complexity
ϵ-greedy algorithm (28);
can be calculated as [37]
8 After executing action (ah , al ), the agents obtain
the immediate reward R based on (24) and J−1
X J−1
X
observe the new state s′ ; ncur,j ncur,j+1 + ntar,j ntar,j+1
9 Save (s, ah , R, s′ ) and (s, al , R, s′ ) into j=0 j=0

corresponding experience replay set (Dh , Dl ); J−1


X
10 −−−−−−−−DRL training−−−−−−−−− = O(2 ncur,j ncur,j+1 ). (30)
11 Randomly sample K data from Dh , Dl , and obtain j=0

the label Q-value by calculate the target Q-value Thus, Pthe time complexity related to all agents is
based on (26); J−1
O(2P j=0 ncur,j ncur,j+1 ), where ncur,j denotes the unit
12 Update the current network loss function by (25), number in the j-th DNN layer of current Q-network. Similarly,
and perform the gradient descent algorithm to ntar,j is the unit number in the j-th DNN layer of target Q-
update the current network parameters θtrain by network, and ncur,j = ntar,j .
(27);
13 if i is an integer multiple of Z then
14 The target network parameters θtarget are D. Online Policy Transfer
synchronized with the current network In the practical networks, the MTC has multiple traffic
parameters θtrain ; modes (e.g., uniform traffic mode and bursty traffic mode)
15 else and the traffic modes are time-varying. The single-policy-
16 Set s = s′ and go back to step 6 until the end based online training in DRL can hardly guarantee the policy
of activation period; effectiveness when the traffic mode changes. For example,
17 end when the MTC traffic happens to be bursty, if the agent
18 end continues online training based on the previous uniform traffic
19 Return well-trained network models and obtain the mode policy, the agent needs to re-adapt to the bursty traffic.
policy π ∗ with maximum Q-value by (29). It takes a long time to adjust the policy, and may even not
be able to converge at the end of the bursty traffic. Similarly,
the same problem is faced as well when the bursty traffic
changes to uniform traffic. In this section, the multi-policy
After executing the current action (ah , al ), agents obtain the online training with PT is applied to the proposed multi-agent
immediate reward R from the joint reward function in (24) HH-DRL algorithm, which is shown in Fig. 3. When the MTC
and observe the new state s′ . Finally, the experience data traffic mode changes, the agent switches the corresponding
(s, ah , R, s′ ) and (s, al , R, s′ ) are stored in the replay buffer policy timely through the online PT to avoid the policy
for DRL training. readjustment.
The training process of multi-agent DRL is shown in For simple networks, there are two service priorities (i.e.,
Steps 4-7 of Fig. 2. Firstly, mini-batch datasets are randomly high-priority and low-priority) and the scale of traffic does not
sampled from the replay buffer according to the experience change. The policy changing point is the moment when the
replay mechanism, which can not only reuse the existing agent performs PT. It occurs when the bursty traffic appears
empirical data but also scatter the original sequences and in the network (e.g., the high-priority MTDs are activated)
eliminate the correlation. Then the mean-variance loss function or disappears from the network (e.g., the high-priority MTDs
L(θcurrent ) is calculated by the current network Q-value are deactivated). For complex networks, there are multiple
and target network Q-value according to (25). After that, service priorities and the traffic scale changes over time. Then
the current network parameters are updated by the gradient the eNB needs to provide more policies for different service

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1425

Fig. 3. Online PT in the proposed algorithm.

priority MTDs respectively during the activation time. It shows


that the number of MTDs with different priorities will change
in real time. Moreover, we compare the proposed multi-agent
HH-DRL algorithm with the following existing algorithms:
1) ACB-ideal [5]: This algorithm adjusts the ACB factor
with the assumption that the exact number of activated
MTDs is pre-known in each RAO.
2) Soft priority-ACB [22]: In each RAO, this algorithm
adaptively assigns the preamble resources for each ser-
vice priority according to both the fixed priority weights
and the number of activated MTDs.
Fig. 4. Number of newly activated MTDs during the activation time, 3) Hybrid-DRL [18]: Under DRL method, this algorithm
Nh = 3000, Nl = 2000. dynamically configures the same ACB factor and BO
indicator for each service priority without considering
service variety.
priorities and different traffic scales. In addition, the eNB also 4) HH-ideal: This algorithm assigns the preamble resources
provides traffic mode recognition function to help the agent for each service priority according to the fixed priority
to determine the policy changing points. This traffic mode weights, and dynamically configures the ACB factors
recognition function, which is beyond the scope of this paper, and BO indicators for each service priority according
can be obtained actively by the eNB to identify the devices, to the pre-known number of activated MTDs based on
or passively by the agent to identify the reward changes. (22).
Without loss of generality, we analyze the performance of In summary, the hybrid-based schemes include the hybrid-
multi-policy with PT under the simple network in Section V. DRL, HH-ideal, and multi-agent HH-DRL algorithms. The
priority-based schemes include the soft priority-ACB, HH-
V. P ERFORMANCE E VALUATION ideal, and multi-agent HH-DRL algorithms. To make fair
In this section, we provide simulation results to compare comparisons, the fixed priority weights in the soft-priority
the performance of the proposed scheme with the existing ACB algorithm are the same as that in the HH-ideal algorithm,
schemes in terms of the success access, access delay, energy which is shown in (22) and φh = 4/5, φl = 1/5. In addition,
consumption, and service priority. These schemes are imple- both the hybrid-DRL algorithm and the proposed multi-agent
mented by using Python 3.5 in a 64-bit computer with a core HH-DRL algorithm have the same DRL hyperparameters,
i5 processor and 8G RAM. The cellular-based IoT system which is presented in Table II.
parameters provided by 3GPP are listed in Table II, where
the traffic model parameters refer to [29] and [38], and the A. Success Access Performance
energy consumption parameters refer to [18]. In addition, the The meaning of success access performance includes the
access failure conditions of MTDs with different priorities are average number of success access MTDs per RAO and the
defined according to the delay requirements in [21]. Under average access success probability, where the access suc-
this setting, the eNB could provide 200 RAOs per second cess probability is the probability of an MTD successfully
for MTDs. Following the Beta distribution with parameters establishing link with eNB within the delay requirement.
(α = 3, β = 4) and uniform distribution µi ∼ U (0, 20), Fig. 4 Fig. 5 compares the average number of success access MTDs
plots the number of newly activated high-priority and low- per RAO under different RA control schemes. Overall, the

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
1426 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 23, NO. 2, FEBRUARY 2024

TABLE II
S YSTEM PARAMETERS

Fig. 5. Average number of success accesses for different RA control schemes, M = 54.

Fig. 6. Average access success probability for different RA control schemes, M = 54.

ACB-ideal algorithm outperforms the other algorithms because RAO decreases gradually under the HH-ideal algorithm. This
it can obtain the optimal ACB factor with the pre-known num- seriously affects the utilization of preamble resources since
ber of activated MTDs. More specifically, with the increase of the unfavorable fixed priority weights provides too much
high-priority MTDs as shown in Fig. 5(a), the performance of preamble resources for high-priority MTDs, which is minority
the HH-ideal algorithm and multi-agent HH-DRL algorithm in networks, while oppositely, provides little for low-priority
are slightly lower than that of the ACB-ideal algorithm. MTDs, which is majority. Conversely, the proposed multi-
In addition, it can be seen that the number of success accesses agent HH-DRL algorithm could ensure the preamble resource
per RAO of the proposed multi-agent HH-DRL algorithm is utilization with the adaptive priority control.
stable at 18. Fig. 6 presents the performance of average access success
However in Fig. 5(b), with the increase of low-priority probability for different RA control schemes. With the increase
MTDs, the performance of the average success accesses per of high-priority MTDs as shown in Fig. 6(a), the performance

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1427

Fig. 7. Average access delay for different RA control schemes, M = 54.

Fig. 8. Average energy consumption for different RA control schemes, M = 54.

of the priority-based scheme is better than that of other seriously affects the access delay performance of low-priority
schemes. This is because the priority-based schemes could pro- MTDs. Accordingly, the average access delay of the HH-ideal
vide more preamble resources for high-priority MTDs without algorithm is also affected and deteriorated. However, the soft
affecting the performance of low-priority MTDs. However, priority-ACB and the proposed multi-agent HH-DRL algo-
in Fig. 6(b), with the increase of low-priority MTDs, the rithms could guarantee the access delay of both low-priority
unfavorable fixed priority weights under HH-ideal algorithm and high-priority MTDs through the online adjustment of Ph
can hardly satisfy the delay requirement of low-priority MTDs, and Pl with adaptive priority control.
leading to a sharp drop of performance once the number of Fig. 8 compares the performance of different RA control
low-priority MTDs are far more than that of high-priority schemes in terms of the average energy consumption. Note that
MTDs. the BO scheme could prevent a large number of backlogged
MTDs from frequently executing the ACB check in the
B. Access Delay and Energy Consumption overloaded case. It is clearly observed that the HH ACB-BO
algorithm achieves much lower energy consumption than the
Fig. 7 shows the average access delay performance under ACB-ideal and soft priority-ACB algorithms in both Fig. 8(a)
different RA control schemes. The average access delay is and Fig. 8(b).
defined as the average number of RAOs for an MTD from its
newly activated state to the access success (i.e., establish link
with eNB) or access failure. With the increase of high-priority C. Priority Performance Evaluation
MTDs as shown in Fig. 7(a), the access delay performance of Fig. 9 further depicts the average access delay performance.
the proposed HH-ideal and multi-agent HH-DRL algorithms Fig. 9(a) and Fig. 9(b) show that the priority-based schemes
are close to that of the ACB-ideal algorithm. However, with could reduce the access delay for high-priority MTDs since
the increase of low-priority MTDs as shown in Fig. 7(b), these schemes provide more preamble resources for high-
the performance of the HH-ideal algorithm deteriorates. From priority MTDs. However, the cost is the access delay of
(22), it can be seen that the low-priority ACB factor of low-priority MTDs is increased, which is shown in Fig. 9(c)
HH-ideal is Pl = φnl M l
when the low-priority MTDs are and Fig. 9(d).
overloaded. Due to the small low-priority weight φl and a large In addition, since the minority high-priority MTDs have a
number of low-priority activated MTDs nl , the probability of higher priority weight in the HH-ideal scheme, their access
the low-priority MTDs passing the ACB check is trivial. This delay performance outperforms other schemes as shown in

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
1428 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 23, NO. 2, FEBRUARY 2024

Fig. 9. Average access delay of prioritized MTDs for different RA control schemes, M = 54.

multi-agent HH-DRL algorithm is very close to that of the


HH-ideal algorithm, and even outperforms in some extreme
scenarios by providing the adaptive priority control.

D. Convergence Performance Evaluation


Fig. 10 plots the convergence behavior of the average reward
under the proposed algorithm (i.e., the multi-agent HH-DRL
algorithm with PT) and the multi-agent HH-DRL algorithm
without PT (i.e., the single-policy scheme), where bursty traffic
occurs after 10 iterations of uniform traffic period. Due to
PT, the proposed algorithm can switch to the corresponding
Fig. 10. Average reward of each iteration in the training process, M = 54, policy when encountering different traffic modes. As a result,
Nh = 4000, Nl = 2000.
it can be seen that after 50 iterations of bursty traffic period,
the proposed policy gradually converges. Besides, it can be
observed that there are slight fluctuations under the proposed
Fig. 9(b), and meanwhile the access delay of low-priority algorithm. The reason may be the following: In the proposed
MTDs is relatively poor in Fig. 9(d). Compared with the algorithm, multiple agents are utilized. Each agent explores
HH-ideal algorithm, the soft priority-ACB and the proposed only part of action spaces. Therefore, the training result under
multi-agent HH-DRL algorithms could improve the access each agent based on the partial action space may achieve local
delay performance of high-priority MTDs without causing optimum, resulting in the convergence fluctuations. We can
serious performance loss to the low-priority MTDs by adaptive also see that, when bursty traffic occurs, the single-policy
priority control even when the number of low-priority MTDs training scheme without PT cannot converge in a short time.
is large. But, it is worth noting that the soft priority-ACB Moreover, the single-policy needs to be retrained when the
algorithm hardly provides optimization of energy consump- bursty traffic is converted into uniform traffic.
tion.
Overall, these simulation results verify that the proposed VI. C ONCLUSION
multi-agent HH-DRL algorithm is able to optimize the access In this paper, we have studied the joint delay-energy
delay, energy consumption, and service priority in multi- optimization problem for multi-priority RA. Firstly, a novel
priority RA. Moreover, the performance of the proposed HH ACB-BO scheme is proposed to overcome the serious

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1429

preamble collision problem and enhance the RA performance. By setting ∂f∂P(X)


h
= 0 and ∂f∂P(X)
l
= 0, we have Ph = φnhhM
Under the fixed priority weights for each service priority, and Pl = φnl M
l
. Then, by substituting these results into ∂f (X)
∂BIh
a joint delay-energy optimization problem is formulated for
and ∂f (X) 1
∂BIl , and we have 1 + n log (1 − n ) < 0 when n > 1.
the proposed HH ACB-BO scheme, and the closed-form of
the optimal ACB factors and BO indicators adjustment result Thus, it is obvious that both ∂f (X) ∂f (X)
∂BIh > 0 and ∂BIl > 0 are
is derived as HH-ideal algorithm. Moreover, in order to realize positive, thus the function f (X) monotonically increases with
the adaptive priority control, the DRL algorithm is applied BIh and BIl . In the following, we mainly focus on the effect
to the proposed HH ACB-BO scheme to adjust the ACB of Ph and Pl on the extreme value of f (X). According to the
factors and BO indicators. Specifically, we develop a multi- constraint (c) in problem (21), we have
agent HH-DRL algorithm inspired by the hierarchical structure φl M φh M
of the HH ACB-BO scheme, where online PT is applied to 0≤ , ≤ 1. (A.4)
nh nl
guarantee the policy effectiveness in the practical networks.
Finally, we conduct extensive experiments based on the 3GPP Because M > 1 and n > 1, the maximum value of f (X) can
reports. The simulation results verify the effectiveness of the be obtained as long as φnl M
l
≤ 1 and φnhhM ≤ 1. By substituting
proposed multi-agent HH-DRL algorithm from the aspects of Nfh Nfl
nh = Ch + BIh and nl = Cl + BIl , we have
average access success probability, access delay, and energy
consumption. Besides, the performance of the proposed multi- Nfh Nfl
agent HH-DRL algorithm is very close to that of the HH-ideal φh M − Ch ≤ , φl M − Cl ≤ . (A.5)
BIh BIl
algorithm, and even outperforms in some extreme scenarios by
providing the adaptive priority control. Furthermore, since 1 ≤ BIh , BIl ≤ 100, we have the
following results.
Nfh Nfl
A PPENDIX A 1) When φh M − Ch ≤ 100 and φl M − Cl ≤ 100 , which
P ROOF OF P ROPOSITION 1 means that the activated MTDs are far more than the available
preamble resources. In order to satisfy the constraint (d) in
To simplify the analysis, we convert the original problem problem (21), and f (X) monotonically increase with BIh and
Nh Nl
by substituting nh = Ch + BIfh and nl = Cl + BIfl into the BIl , it is obvious that BIh = 100, BIl = 100, and we have
objective function in (21), yielding:
φh M φl M
Ph = Nfh
, Pl = Nfl
. (A.6)
Nfh h
Nf Ch + Cl +
Ph Ch + BI −1 100 100
f (X) = Ph (Ch + )(1 − ) h
BIh φh M f Nh f Nl
2) When φh M − Ch > 100 and φl M − Cl ≤ 100 , which
Nfl Pl Cl + BI
l
Nf
−1 means that activated MTDs of high-priority are less than the
+ Pl (Cl + )(1 − ) l
BIl φl M preamble resources, but the activated MTDs of low-priority are
Nfh Nfl far more than the available preamble resources. Thus, in order
− (Ch + Cl + + )Esi , (A.1) to satisfy (A.5), we have
BIh BIl
& '
where X denotes the set of ACB factors and BO indicators Nfh
{Ph , Pl , BIh , BIl }, and the optimal X could be defined as BIh = , BIl = 100, (A.7)
φh M − Ch
X ∗ = arg max f (X), (A.2)
X∈{(c), (d)} where ⌈·⌉ denotes to take an integer. Because the constraint
(c) is satisfied, we have
where {(c), (d)} are the constraints of the ACB factors and
BO indicators in problem (20). φl M
Ph = 1, Pl = Nfl
. (A.8)
Taking the first order derivative of f (X), we have Cl + 100
∂f (X) Ph n−2 Ph n h Nh Nl
= nh (1 − ) (1 − ), 3) When φh M − Ch ≤ 100 f
and φl M − Cl > 100 f
,
∂Ph φh M φh M
∂f (X) Pl n−2 Pl n l which means that the activated MTDs of high-priority are far
= nl (1 − ) (1 − ), more than the preamble resources, but the activated MTDs
∂Pl φl M φl M
of low-priority are less than the available preamble resources.
∂f (X) Ph Nfh Ph nh −1 Ph
=− (1 − ) (1 + nh log (1 − )) Thus, in order to satisfy (A.5), we have
∂BIh BIh2 φh M φh M & '
Nfh Nfl
∂f (X) BIh = 100, BIl = . (A.9)
+ Esr , = φl M − Cl
BIh2 ∂BIl
Pl Nfl Pl nl −1 Pl Because the constraint (c) is satisfied, we have
− 2 (1 − ) (1 + nl log (1 − ))
BIl φl M φl M
φh M
Nfl Ph = Nfh
, Pl = 1. (A.10)
+ Esr . (A.3) Ch +
BIl2 100

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
1430 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 23, NO. 2, FEBRUARY 2024

f Nh f Nl
4) When φh M − Ch > 100 and φl M − Cl > 100 , [16] Y. Liu, Y. Deng, N. Jiang, M. Elkashlan, and A. Nallanathan, “Analysis
which means that activated MTDs are less than the preamble of random access in NB-IoT networks with three coverage enhance-
ment groups: A stochastic geometry approach,” IEEE Trans. Wireless
resources. Thus, in order to satisfy (A.5), we have Commun., vol. 20, no. 1, pp. 549–564, Jan. 2021.
[17] W. Zhan and L. Dai, “Massive random access of machine-to-machine
& ' & '
Nfh Nfl communications in LTE networks: Modeling and throughput optimiza-
BIh = , BIl = . (A.11)
φh M − Ch φl M − Cl tion,” IEEE Trans. Wireless Commun., vol. 17, no. 4, pp. 2771–2785,
Apr. 2018.
Because the constraint (c) is satisfied, we have [18] N. Jiang, Y. Deng, A. Nallanathan, and J. Yuan, “A decoupled learning
strategy for massive access optimization in cellular IoT networks,” IEEE
Ph = 1, Pl = 1. J. Sel. Areas Commun., vol. 39, no. 3, pp. 668–685, Mar. 2021.
[19] C. Zhang, X. Sun, J. Zhang, and H. Zhu, “Priority-based massive
Combining these four cases, under the fixed priority random access of M2M communications in LTE networks: Throughput
weights, the closed-form of the optimal ACB factors and BO analysis and optimization,” in Proc. IEEE/CIC Int. Conf. Commun.
China (ICCC), Aug. 2019, pp. 472–477.
indicators is obtained for the proposed HH ACB-BO scheme [20] M. R. Chowdhury and S. De, “Delay-aware priority access classification
as shown in (22). for massive machine-type communication,” IEEE Trans. Veh. Technol.,
vol. 70, no. 12, pp. 13238–13254, Dec. 2021.
R EFERENCES [21] Z. Chen and D. B. Smith, “Heterogeneous machine-type communi-
cations in cellular networks: Random access optimization by deep
[1] Z. Zhang et al., “6G wireless networks: Vision, requirements, architec- reinforcement learning,” in Proc. IEEE Int. Conf. Commun. (ICC),
ture, and key technologies,” IEEE Veh. Technol. Mag., vol. 14, no. 3, May 2018, pp. 1–6.
pp. 28–41, Sep. 2019. [22] J. Liu, M. Agiwal, M. Qu, and H. Jin, “Online control of preamble
[2] I. Leyva-Mayorga, L. Tello-Oquendo, V. Pla, J. Martinez-Bauset, and groups with priority in massive IoT networks,” IEEE J. Sel. Areas
V. Casares-Giner, “On the accurate performance evaluation of the LTE- Commun., vol. 39, no. 3, pp. 700–713, Mar. 2021.
A random access procedure and the access class barring scheme,” [23] W. Sun, Z. Li, Q. Wang, and Y. Zhang, “FedTAR: Task and resource-
IEEE Trans. Wireless Commun., vol. 16, no. 12, pp. 7785–7799, aware federated learning for wireless computing power networks,” IEEE
Dec. 2017. Internet Things J., vol. 10, no. 5, pp. 4257–4270, Mar. 2023.
[3] Y. Sim and D. Cho, “Performance analysis of priority-based access class
[24] P. Wang, W. Sun, H. Zhang, W. Ma, and Y. Zhang, “Distributed
barring scheme for massive MTC random access,” IEEE Syst. J., vol. 14,
and secure federated learning for wireless computing power net-
no. 4, pp. 5245–5252, Dec. 2020.
works,” IEEE Trans. Veh. Technol., early access, Feb. 22, 2023, doi:
[4] H. He, Q. Du, H. Song, W. Li, Y. Wang, and P. Ren, “Traffic- 10.1109/TVT.2023.3247859.
aware ACB scheme for massive access in machine-to-machine [25] J. Lee, H. Seo, J. Park, M. Bennis, and Y. Ko, “Learning emergent
networks,” in Proc. IEEE Int. Conf. Commun. (ICC), Jun. 2015, random access protocol for LEO satellite networks,” IEEE Trans.
pp. 617–622. Wireless Commun., vol. 22, no. 1, pp. 257–269, Jan. 2023.
[5] S. Duan, V. Shah-Mansouri, Z. Wang, and V. W. S. Wong, “D-ACB: [26] M. A. Jadoon, A. Pastore, M. Navarro, and F. Perez-Cruz, “Deep rein-
Adaptive congestion control algorithm for bursty M2M traffic in LTE forcement learning for random access in machine-type communication,”
networks,” IEEE Trans. Veh. Technol., vol. 65, no. 12, pp. 9847–9861, in Proc. IEEE Wireless Commun. Netw. Conf. (WCNC), Apr. 2022,
Dec. 2016. pp. 2553–2558.
[6] M. Tavana, V. Shah-Mansouri, and V. W. S. Wong, “Congestion control
[27] J. Bai, H. Song, Y. Yi, and L. Liu, “Multiagent reinforcement learning
for bursty M2M traffic in LTE networks,” in Proc. IEEE Int. Conf.
meets random access in massive cellular Internet of Things,” IEEE
Commun. (ICC), Jun. 2015, pp. 5815–5820.
Internet Things J., vol. 8, no. 24, pp. 17417–17428, Dec. 2021.
[7] H. Jin, W. T. Toor, B. C. Jung, and J. Seo, “Recursive pseudo-
[28] W. Zhao, J. P. Queralta, and T. Westerlund, “Sim-to-real transfer in deep
Bayesian access class barring for M2M communications in LTE
reinforcement learning for robotics: A survey,” in Proc. IEEE Symp. Ser.
systems,” IEEE Trans. Veh. Technol., vol. 66, no. 9, pp. 8595–8599,
Comput. Intell. (SSCI), Dec. 2020, pp. 737–744.
Sep. 2017.
[8] C. Di, B. Zhang, Q. Liang, S. Li, and Y. Guo, “Learning automata-based [29] Study on RAN Improvements for Machine-Type Communications, docu-
access class barring scheme for massive random access in machine- ment TR 37.868, version 11.0.0, 3GPP, Sep. 2011.
to-machine communications,” IEEE Internet Things J., vol. 6, no. 4, [30] Z. Wang and V. W. S. Wong, “Optimal access class barring for stationary
pp. 6007–6017, Aug. 2019. machine type communication devices with timing advance informa-
[9] D. Zhang, J. Liu, and W. Zhou, “ACB scheme based on reinforcement tion,” IEEE Trans. Wireless Commun., vol. 14, no. 10, pp. 5374–5387,
learning in M2M communication,” in Proc. IEEE Global Commun. Conf. Oct. 2015.
(GLOBECOM), Dec. 2020, pp. 1–6. [31] O. Naparstek and K. Cohen, “Deep multi-user reinforcement learning for
distributed dynamic spectrum access,” IEEE Trans. Wireless Commun.,
[10] L. Tello-Oquendo, D. Pacheco-Paramo, V. Pla, and J. Martinez-Bauset,
vol. 18, no. 1, pp. 310–323, Jan. 2019.
“Reinforcement learning-based ACB in LTE–A networks for handling
massive M2M and H2H communications,” in Proc. IEEE Int. Conf. [32] Z. Ding, R. Schober, and H. V. Poor, “No-pain no-gain: DRL assisted
Commun. (ICC), May 2018, pp. 1–7. optimization in energy-constrained CR-NOMA networks,” IEEE Trans.
Commun., vol. 69, no. 9, pp. 5917–5932, Sep. 2021.
[11] L. P. Qian, Y. Wu, N. Yu, D. Wang, F. Jiang, and W. Jia,
“Energy-efficient multi-access mobile edge computing with secrecy [33] L. Tello-Oquendo, V. Pla, I. Leyva-Mayorga, J. Martinez-Bauset,
provisioning,” IEEE Trans. Mobile Comput., vol. 22, no. 1, pp. 237–252, V. Casares-Giner, and L. Guijarro, “Efficient random access channel
Jan. 2023. evaluation and load estimation in LTE-A with massive MTC,” IEEE
[12] S. Cao and F. Hou, “On the maximum energy efficiency of ran- Trans. Veh. Technol., vol. 68, no. 2, pp. 1998–2002, Feb. 2019.
dom access-based OMA and NOMA in multirate environment,” [34] R. Sutton and A. Barto. (2017). Reinforcement Learning: An
IEEE Trans. Wireless Commun., vol. 21, no. 12, pp. 10438–10454, Introduction (Draft). [Online]. Available: http://www.incompleteideas.
Dec. 2022. net/book/bookdraft2017nov5.pdf
[13] F. Zhao, X. Sun, W. Zhan, X. Wang, J. Gong, and X. Chen, “Age- [35] D. Silver. (2016). Tutorial: Deep Reinforcement Learning. [Online].
energy tradeoff in random-access Poisson networks,” IEEE Trans. Green Available: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf
Commun. Netw., vol. 6, no. 4, pp. 2055–2072, Dec. 2022. [36] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
[14] M. Gerasimenko, V. Petrov, O. Galinina, S. Andreev, and with double Q-learning,” 2015, arXiv:1509.06461.
Y. Koucheryavy, “Energy and delay analysis of LTE-advanced [37] Q. Luo, T. H. Luan, W. Shi, and P. Fan, “Deep reinforcement learning
RACH performance under MTC overload,” in Proc. IEEE Globecom based computation offloading and trajectory planning for multi-UAV
Workshops, Dec. 2012, pp. 1632–1637. cooperative target search,” IEEE J. Sel. Areas Commun., vol. 41, no. 2,
[15] N. Jiang, Y. Deng, A. Nallanathan, X. Kang, and T. Q. S. Quek, pp. 504–520, Feb. 2023.
“Analyzing random access collisions in massive IoT networks,” [38] J. Cheng, C. Lee, and T. Lin, “Prioritized random access with dynamic
IEEE Trans. Wireless Commun., vol. 17, no. 10, pp. 6853–6870, access barring for RAN overload in 3GPP LTE-A networks,” in Proc.
Oct. 2018. IEEE GLOBECOM Workshops (GC Wkshps), Dec. 2011, pp. 368–372.

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1431

Wenbo Fan received the B.S. degree in com- Yan Long (Member, IEEE) received the B.E.
munication engineering from Southwest Jiaotong degree in electrical and information engineering and
University, Chengdu, China, in 2018, where he is the Ph.D. degree in communication and informa-
currently pursuing the Ph.D. degree with the School tion systems from Xidian University, Xi’an, China,
of Information Science and Technology. in 2009 and 2015, respectively. From September
His research interests include massive random 2011 to March 2013, she was a Visiting Student
access, machine learning, and compressed sensing. with the Department of Electrical and Computer
Engineering, University of Florida, USA. She is
currently a Lecturer with the School of Informa-
tion Science and Technology, Southwest Jiaotong
University, Chengdu, China. Her research interests
include distributed machine learning in wireless networks, the next generation
of WLAN, 5G/6G cellular networks, and wireless resource optimization.

Pingzhi Fan (Fellow, IEEE) received the M.Sc.


degree in computer science from Southwest Jiao-
tong University (SWJTU), China, in 1987, and the
Ph.D. degree in electronic engineering from Hull
University, U.K., in 1994. He has been a Presidential
Professor with SWJTU, the Honorary Dean of the
SWJTU-Leeds Joint School, and a Visiting Professor
with Leeds University, U.K., since 1997. He served
as an EXCOM Member for the IEEE Region 10, the
IET (IEE) Council, and the IET Asia Pacific Region.
He was a recipient of the U.K. ORS Award in 1992,
the National Science Fund for Distinguished Young Scholars (NSFC) in 1998,
the IEEE VT Society Jack Neubauer Memorial Award in 2018, the IEEE SP
Society SPL Best Paper Award in 2018, the IEEE/CIC ICCC2020 Best Paper
Award, the IEEE WCSP2022 Best Paper Award, and the IEEE ICC2023 Best
Paper Award. He served as a Chief Scientist of the National 973 Plan Project
(National Basic Research Program of China) from January 2012 to December
2016. His research interests include high mobility wireless communications,
massive random-access techniques, and signal design and coding. He is an
IEEE VTS Distinguished Speaker (2019–2025) and a fellow of IET, CIE, and
CIC.

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.

You might also like