Joint Delay-Energy Optimization For Multi-Priority Random Access in Machine-Type Communications
Joint Delay-Energy Optimization For Multi-Priority Random Access in Machine-Type Communications
2, FEBRUARY 2024
Abstract— Cellular-based networks are deemed as one solution communication (MTC) has been identified as one of the three
to provide communication links for the internet of things (IoT) main application scenarios of the 5th Generation and beyond
due to its high reliability and wide coverage. However, due to mobile network services. Due to the wide coverage, mobil-
the overloaded machine-type devices in IoT, the existing random
access procedure in cellular networks suffers significant preamble ity support, and high reliability, cellular-based IoT systems
collision problem and hardly meets the requirement of large including NarrowBand-IoT and Long-Term Evolution(LTE)-
random access. Despite the effort to cope with the preamble Machine to Machine are considered as the key solutions for
collision problem in conventional random access control schemes, MTC. According to the third generation partnership project
other important performance requirements in random access are (3GPP) reports, the MTC should meet the connection density
not well addressed, including access delay, energy consumption,
and service priority. To improve the random access control of 1 million per square kilometer [1]. Thus, it is crucial for
scheme, we propose a novel hierarchical hybrid (HH) access class cellular-based IoT systems to satisfy the dense MTC scenarios
barring (ACB) and back-off (BO) scheme (HH ACB-BO scheme), involving a large number of MTDs.
where the hybrid ACB-BO is exploited to balance the delay- In conventional cellular networks, grant-based four-step
energy tradeoff, and the hierarchical structure is proposed to random access (RA) is normally used for establishing the
prioritize communication services. We mathematically formulate
this random access control scheme to optimize the delay and communication connection between the MTD and the eNodeB
energy performance jointly. With the fixed priority weights (eNB) in random access channel (RACH) [2]. However, due to
for each service priority, the closed-form of the optimal ACB the limited preamble resources in each random access oppor-
factors and BO indicators adjustment result is derived. Moreover, tunity (RAO), a collision occurs when two devices select the
in order to realize the adaptive prioritized random access control, same preamble, thus resulting in a serious preamble collision
we apply deep reinforcement learning (DRL) to the proposed
random access scheme to dynamically adjust the ACB factors and problem and large access delay in cellular IoT networks with
BO indicators in an online manner. Considering the hierarchical a large number of MTDs. Therefore, reducing access delay
structure and the action space complexity in DRL, a multi- is critical for RA under the overloaded case. In addition,
agent DRL algorithm is designed for the HH ACB-BO scheme since MTDs are usually portable and powered by batteries,
(multi-agent HH-DRL algorithm), where online policy transfer energy resource is valuable for MTDs. With the increasing
is applied to guarantee the policy effectiveness in the practical
networks. Finally, simulation results verify the effectiveness of energy-saving demands of IoT applications, it is essential
the proposed HH ACB-BO scheme and reveal that the multi- to reduce the energy consumption for MTDs in RACH.
agent HH-DRL algorithm outperforms other algorithms in terms Moreover, machine-type services are various depending on the
of average access success probability and energy consumption specific applications. The MTDs in IoT applications may have
performance. different delay requirements, such as delay-sensitive services
Index Terms— Cellular IoT networks, random access control, including eHealth, self-driven vehicles, and public security,
energy-delay tradeoff, priority, deep reinforcement learning. and delay-tolerant services including factory management and
city pollution detection [3]. For this reason, it is necessary
I. I NTRODUCTION to consider service priority in terms of delay requirement to
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1417
time-varying traffic networks. Thus, the ACB factor is dynam- the preamble resources are assigned to MTDs with different
ically adjusted according to the results of activated MTD priorities according to the soft priority weights. Although these
estimation methods like maximum-likelihood estimation [6], priority-based schemes could improve the RA performance
pseudo-Bayesian estimation [7], and learning automata-based by dynamically assigning RACH resources for MTDs with
estimation [8], respectively. In addition, deep reinforcement different delay requirement, it ignores the energy consumption
learning (DRL) as a promising method has been developed to in RA. Recalling that the hybrid ACB-BO scheme introduced
dynamically adjust the ACB factor by interacting with the RA in RA control scheme, the service priority should also be
process [9], [10]. integrated in cellular IoT to pursue the performance of access
However, these aforementioned schemes only focused on delay, energy consumption, and service priority jointly.
the access delay performance without involving the energy Machine Learning, emerges as a promising tool to deal
consumption in the RA process. Recently, energy efficiency with the complex practical networks recently. In [23], fed-
was considered in many works [11], [12], [13]. In [11], erated learning is applied in the wireless computing power
by performing multiple access computing offloads sequen- network, which minimizes the sum energy consumption of
tially, a swapping-heuristic based algorithm was proposed to all computing nodes by orchestrating the computing and
minimize energy consumption in multi-access mobile edge networking resources. Similarly, Wang et al. in [24] proposed
computing. Cao et al. in [12] investigated the energy efficiency a decentralized federated learning for the mobile and hetero-
of RA-based orthogonal and non-orthogonal multiple access geneous wireless computing power network, where nodes can
networks and discovered that lower data rates have beneficial freely participate or leave the federated training. Moreover,
effects on energy efficiency. Zhao et al. in [13] studied the using multi-agent DRL, Lee et al. in [25] proposed a novel
performance limit of energy efficiency that was evaluated contention-based RA solution for satellites networks, where
via the lifetime throughput under the age of information each satellite has a sole agent. Jadoon et al. in [26] applied
constraint, where the optimal channel access probability and DRL to slotted ALOHA RA, which balances the throughput
packet arrival rate are derived to achieve maximum lifetime and fairness performances. In [27], by online assigning pream-
throughput. In the dense MTC scenarios involving a large ble to each user, a DRL-enabled intelligent RA management
number of MTDs, due to the overloaded MTDs in RA, the was proposed to reduce access latency and access failures.
probability of passing ACB check is trivial, leading to huge Although there are existing works focusing on DRL-based
energy consumption in frequent ACB checks. Concerning the solutions to handle complex control problems in RA, single-
energy consumption problem, based on the 3GPP test cases, policy training scheme is studied. However, the single-policy
Gerasimenko et al. in [14] conducted a thorough analysis and training scheme can hardly adapt to practical networks, espe-
simulations of the RA performance containing energy con- cially in MTC scenarios with multiple traffic modes. To be
sumption under the overloaded case. It reveals the capability specific, when bursty traffic occurs, the above DRL algorithms
of the back-off (BO) scheme to reduce energy consumption. with single-policy training can hardly train the policy in a short
Specifically, according to the BO scheme, the failed MTDs time, and even worse, may not successfully converge at the end
in the ACB check are assigned to the future RAOs to retry of the bursty traffic. To this end, policy transfer (PT), as an
RA. Before re-access, these failed MTDs are inactive without important part of transfer reinforcement learning [28], can train
receiving system information. This greatly mitigates the energy multiple policies through policy reuse. In the face of MTC
waste on frequent ACB checks under the overloaded case. traffic bursty, the agent may use online PT and switch to the
Hence, on the one hand, the ACB scheme reduces the access corresponding policy timely to avoid re-adapting to the bursty
delay while introducing energy consumption in ACB check traffic. In this way, the policy effectiveness can be ensured even
process. On the other hand, the BO scheme reduces the energy under the practical networks with time-varying traffic modes.
consumption in receiving system information by inactivating To this end, in this paper, all the MTDs in the system are
these failed MTDs. As a result, it is natural to combine firstly classified into different priority groups based on the
the two aspects and a hybrid ACB-BO scheme is studied in different delay requirements in practical applications. Then,
[15], [16], and [17]. Furthermore, Jiang et al. in [18] developed a novel hierarchical hybrid (HH) ACB and BO random access
a DRL algorithm to dynamically adjust the ACB factor and control scheme (HH ACB-BO scheme) is proposed to provide
BO indicator by maximizing a long-term joint reward, which service priority and balance the delay-energy tradeoff. Under
is composed of both access delay and energy consumption. the fixed priority weights for each service priority, a closed-
Although the hybrid ACB-BO scheme can balance the form solution of the ACB factors and the BO indicators
access delay and energy consumption, it is hard to cater to the is obtained under the ideal situations (HH-ideal). In order
IoT networks directly without considering the service priority to realize the adaptive priority control, a multi-agent DRL
with various delay requirements. Therefore, the priority-based algorithm is applied to the HH ACB-BO scheme (multi-
RA has been studied in recent works [19], [20], [21], [22]. agent HH-DRL algorithm), where online PT is applied to
Zhang et al. in [19] proposed an analytical framework to opti- guarantee the policy effectiveness in the face of time-varying
mize the network throughput by dividing MTDs into multiple traffic modes in practical networks. Our contributions can be
groups according to their throughput requirements. Moreover, summarized as follows:
DRL has also been developed to provide different ACB factors 1) To overcome the serious preamble collision problem
for different types of MTDs [20], [21]. More recently, Liu et al. and enhance the RA performance, we propose a HH
in [22] proposed an online preamble control algorithm, where ACB-BO scheme by combining multi-priority with the
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
1418 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 23, NO. 2, FEBRUARY 2024
hybrid ACB-BO scheme, where the hybrid ACB-BO 3GPP standard [29], there are 64 preambles available at each
scheme is exploited to balance the delay-energy tradeoff, RACH in cellular networks, of which only 54 are used for
and the hierarchical structure is used to enable service contention-based four-step RA.
priority for different types of MTDs.
2) Under the fixed priority weights for each service priority, A. Activated MTD Traffics and Priorities
a joint delay-energy optimization problem is formu-
Since MTDs in different applications can be activated
lated for the proposed HH ACB-BO scheme. Firstly,
randomly, the activated MTD traffics can follow a variety of
a mathematical analysis of the access delay and energy
possible statistics. 3GPP defines two different types of acti-
consumption is carried out to develop an objective
vated MTD traffic models including Beta distribution traffic
function. Then, we approximate and scale the objec-
and uniform distribution traffic [29]. The Beta distribution
tive function to obtain a closed-form solution of the
traffic describes that a large number of MTDs access the
optimal ACB factors and BO indicators. Simulation
network in a concentrated short period of time. The uniform
results show that the RA performance of the proposed
distribution traffic represents that MTDs access the network
HH-ideal algorithm can be effectively improved. How-
uniformly over a period of time. This provides us with an
ever, in some extreme cases, the unfavorable fixed
example of how to determine the activated traffic model for
priority weights lead to the average access success
different types of MTDs.
probability shrinking.
In this study, without loss of generality, we classify MTDs
3) In order to realize adaptive priority control to improve
into two categories: high-priority MTDs and low-priority
the average access success probability, we apply the
MTDs. The high-priority MTDs mainly serve delay-sensitive
DRL algorithm to the proposed HH ACB-BO scheme to
services with bursty traffic such as eHealth, self-driven vehi-
dynamically adjust the ACB factors and BO indicators
cles, and public security applications. This type of activated
in an online manner. Taking advantage of the afore-
MTD traffic is best represented by the Beta distribution traffic.
mentioned mathematical analysis of the HH ACB-BO
Thus, we assume that each high-priority MTD is activated
scheme, we introduce a joint reward function to guar-
at time 0 < t < T with probability f (t). Following Beta
antee RA performance, where a punishment sub-reward
distribution with parameters α = 3 and β = 4, it can be
is designed for the adaptive priority control. In addition,
expressed as
we develop a multi-agent HH-DRL algorithm inspired
β−1
by the hierarchical structure of the HH ACB-BO scheme, tα−1 (T − t)
which facilitates multi-priority expansion and reduces f (t) = , (1)
T α+β−1 B (α, β)
the complexity of action space in the training process.
4) Since MTC usually has multiple traffic modes in the where B(·) represents the Beta function and T indicates the
practical networks, the single-policy scheme in DRL activation period. Assuming that the duration of each RAO is
can hardly train the policy in a short time when MTC τ , the number of newly activated high-priority MTDs at i-th
traffic happens to be bursty. In order to ensure the policy RAO is denoted as
Z iτ
effectiveness when the traffic mode changes, PT-based
νi = Nh f (t) dt, i = 1, 2 . . . T /τ, (2)
multi-policy online training scheme is applied in the (i−1)τ
proposed multi-agent HH-DRL algorithm, which can
where Nh is the total number of high-priority MTDs.
switch to the corresponding policy timely and avoid
Oppositely, the low-priority MTDs include consumer elec-
re-adapting to the changing traffic.
tronics, factory management sensors, delay-tolerant city pol-
The rest of the paper is organized as follows. In Section II, lution detection, etc., which feature looser delay constraints.
we present the multi-priority RA system model. The HH This type of activated MTD traffic can be represented by
ACB-BO scheme is proposed and mathematically analyzed uniform distribution traffic, in which the MTDs are uniformly
in Section III. Section IV applies the DRL algorithm to the activated during a time period. The number of newly activated
HH ACB-BO scheme and proposes a multi-agent HH-DRL low-priority MTDs at i-th RAO is defined as µi ∼ U (0, 2N T )
l
algorithm. In Section V, we provide extensive simulation over the activation period (0, T ), where Nl represents the
results to evaluate the performance of the proposed scheme, total number of low-priority MTDs. Overall, both activated
and the conclusion is finally given in Section VI. high-priority and low-priority MTDs coexist in the cellular
IoT network, and perform RA attempts to establish links with
II. S YSTEM M ODEL the eNB.
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1419
request to the eNB. This preamble resource pool is In the beginning, the activated MTD receives the system
generated by the Zadoff-Chu sequence cycle shift. information containing ACB factor PACB and BO indicator
• Step 2: Random access response (RAR): Once eNB BI. Then it executes the ACB check according to PACB ,
detects the preamble, the eNB broadcasts the RAR mes- and performs the four-step RA once ACB check is successful.
sage to all UEs which contains the assigned granted Finally, if MTDs fail in the ACB check or the four-step RA,
uplink resources, timing instructions, and temporary iden- it will generate a backoff time TBO to postpone re-access
tity for each detected preamble. according to the received BI.
• Step 3: Connection request: After receiving the RAR
message, the UE transmits its connection request mes- III. HH ACB-BO S CHEME AND P ROBLEM F ORMULATION
sage containing temporary identity to the eNB using the
Recently, access delay, energy consumption, and service
assigned granted uplink resource.
priority have been widely studied and considered as three
• Step 4: Contention resolution: If the eNB correctly
key performance indicators (KPIs) in RACH. Although the
demodulates the UE’s connection request message, the
hybrid ACB-BO scheme could balance the access delay and
contention resolution message contains its temporary
energy consumption, it ignores the priority aspect. While
identity and is broadcasted to UEs as a response. The
the priority-based RA control schemes rarely consider the
UE determines the RA failure if the contention resolution
delay-energy tradeoff problem. Therefore, in this paper,
message does not contain its temporary identity.
we propose a novel HH ACB-BO scheme to optimize these
Obviously, in this four-step RA process, as long as two or three KPIs in RACH, in which the hybrid ACB-BO scheme
more UEs select the same preamble, the preamble collision is exploited to balance the delay-energy tradeoff, and the
will occur, eventually leading to the RA failure. To simplify hierarchical structure is utilized to serve MTDs with different
the study of the RA procedure, we assume that RA success if priorities.
the preamble is selected by MTDs without collision.
A. HH ACB-BO Scheme
C. RA Control Schemes
The proposed HH ACB-BO scheme is shown in Fig. 1.
In order to improve the RA performance, the activated Generally, it contains ACB and BO schemes and prioritizes
MTDs need to execute the RA control scheme before per- services in RACH. For the ACB scheme, the MTD receives the
forming the RA procedure. The most common RA control system information containing the ACB factor and performs
schemes include ACB, BO, and hybrid scheme. The details ACB check. For the BO scheme, the MTDs that failed in
are as follows: RA are defined as backlogged MTDs, and the backlogged
1) ACB Scheme: The ACB scheme is a mechanism to MTDs are assigned to the future RAOs to retry RA attempts,
control RA congestion by restricting RA requests in each which mitigates the energy waste on receiving the system
RAO. Firstly, the eNB broadcasts the ACB factor PACB before information. Thus, the hybrid ACB-BO scheme is exploited to
each RAO. Then the activated MTD randomly generates a balance the delay-energy tradeoff. In addition, considering the
number q ∈ [0, 1], and compares it with the received ACB service priority, we use Ph and BIh to indicate the ACB factor
factor PACB . The activated MTD executes the four-step RA and BO indicator for high-priority MTDs, and Pl and BIl for
only when its q ≤ PACB . Otherwise, the MTD fails in the low-priority MTDs, where h and l represent the high-priority
ACB check, and repeats the ACB check in the next RAO. and low-priority, respectively. The corresponding explanations
In [5], an optimal ACB factor could reduce the access delay, of notations in the HH ACB-BO scheme are illustrated in
but receiving system information frequently will bring huge Table I.
energy consumption. In the i-th RAO, the number of newly activated MTDs could
2) BO Scheme: In the 3GPP standards, when the MTD RA be expressed as
fails, it starts the BO scheme. The MTD in the BO scheme
uniformly generates a BO time TBO ∼ U (0, BI), where BI Ai = νi + µi , (3)
is the BO indicator. The MTD must wait for the TBO time
where νi represents the number of newly activated
before re-trying the RA attempt. However, in most previous
high-priority MTDs and µi represents the number of newly
researches, the BO indicator BI was considered to be fixed.
activated low-priority MTDs. The number of backlogged
In fact, BI has a critical impact on the RA performance. For
MTDs could be expressed as
example, in the overloaded case, a small BI could cause the
i
MTDs to perform the ACB check frequently and thus waste X j j
the MTD’s energy. While a large BI increases the access Bi = P i,j (NAf + NRf ), (4)
delay and even wastes the preamble resources. Therefore, j=1
an appropriate BI value is crucial for the BO scheme and where P i,j represents the probability that the backlogged
should be designed with ACB scheme jointly to improve the MTDs in the j-th RAO backoff to the i-th RAO according
energy performance. to the BO scheme, which could be denoted as
3) Hybrid Scheme: A hybrid scheme is a mechanism that
0 i − j > BI j
combines two or more RA control schemes together. In this P i,j = 1 j (5)
paper, we focus on the most common hybrid ACB-BO scheme. BI j i − j ≤ BI .
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
1420 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 23, NO. 2, FEBRUARY 2024
TABLE I with the fixed priority weights for each service priority, a math-
S UMMARY OF M AIN N OTATIONS ematical analysis of the access delay and energy consumption
is carried out, and the joint delay-energy optimization objective
function is formulated. Then, we approximate and scale the
objective function to obtain a closed-form solution of the
optimal ACB factors and BO indicators in our HH ACB-BO
scheme. More details are as follows.
1) Access Delay: The access delay is defined as the number
of RAOs for an MTD from its newly activated state to the
connection state with eNB. Thus, minimizing the access delay
is equivalent to minimizing the number of RAOs required
for MTDs to establish wireless links with eNB. At the same
time, in order to minimize the number of RAOs consumed by
MTDs to establish wireless links, each RAO needs to reach
the maximum number of success accesses [30]. Therefore, the
access delay minimization problem could be converted into
the success access maximization in each RAO, which can be
formulated as
arg min τd = arg max E[NRs |N = n], (8)
To facilitate the mathematical analysis, we have the follow- where τd represents the access delay. We assume that N h
ing definition: and N l are independent. Considering the service priority, the
success access maximization problem can be further converted
N =A+B
into two sub-problems, i.e., the success access maximization
= N h + N l, (6) for high-priority and low-priority MTDs in each RAO, respec-
where N is the total number of activated MTDs. In detail, tively. Therefore, we have
it can be expressed as the summation of the newly activated h
E[NRs |N = n] = E[NRs |N h = nh ]
MTDs A and the backlogged MTDs B. Also, it is the sum- l
+ E[NRs |N l = nl ]. (9)
mation of high-priority activated MTDs N h and low-priority
activated MTDs N l . Similarly, the MTDs that pass the ACB For the high-priority MTDs, the success access result means
check NAs and pass the four-step RA NRs are also com- that the MTD could pass both ACB check and RA procedure.
posed of high-priority and low-priority MTDs, which can be Thus, in order to calculate the number of success access high-
expressed as follows: priority MTDs, we have
h l nh
NAs = NAs + NAs h
X
h l E[NRs |N h = nh ] = h
P (NAs = λh |N h = nh )
NRs = NRs + NRs . (7)
λh =1
We also assume that once the preamble collision occurs, h h
× E[NRs |NAs = λh ]. (10)
none of MTDs in this collision can complete the four-step
RA procedure in this RAO. In addition, the MTDs without The ACB check can be viewed as a Bernoulli experiment
preamble collision can always successfully pass the four-step with the probability of ACB check success Ph . Thus, the
h
RA procedure. probability mass function of NAs = λh could be expressed
as
nh
B. Problem Formulation h h
P (NAs = λh |N = nh ) = Phλh
λh
In this part, a joint delay-energy optimization problem is
formulated for the proposed HH ACB-BO scheme. Firstly, ×(1 − Ph ))nh −λh . (11)
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1421
After the ACB check, the number of MTDs that enter where T and P represent the consumed time and average
h
into the RA procedure is NAs = λh . In RA procedure, the power in Esi and Era , respectively. Therefore, the energy
success access of MTD means that the generated preamble consumption in each RAO is denoted as
does not conflict with other MTDs, and the probability of
success access in RA procedure, Pm , is, E = nEsi + λEra
λ −1 = (nh + nl )Esi + (nh Ph + nl Pl )Era , (19)
λh 1 φh M − 1 h
Pm = , (12)
1 φh M φh M where n represents the number of activated MTDs, and λ
where M is the number of preamble resources, and φh rep- represents the number of MTDs passing the ACB check.
resents the high-priority weight [22]. In addition, the number 3) Joint Delay-Energy Optimization: As discussed above,
of success access MTDs is equal to the number of generated access delay and energy consumption are directly affected by
preambles without collision in each RAO, which could also be ACB factors and BO indicators. This part aims to obtain the
viewed as a Bernoulli experiment. Therefore, the number of optimal ACB factors and BO indicators of the HH ACB-BO
success access high-priority MTDs in each RAO is expressed scheme. According to (16) and (19), in each RAO, the joint
as delay-energy optimization problem can be formulated as
h h
|NAs = λ h ] = φh M P m Ph nh −1 Pl nl −1
E[NRs max nh Ph (1 − ) + nl Pl (1 − )
λ −1 Ph ,Pl ,BIh ,BIl φh M φl M
λh 1 φh M − 1 h
= φh M − (nh + nl )Esi − (nh Ph + nl Pl )Era ,
1 φh M φh M
Nfh
φh M − 1 λh −1 s.t.(a) : nh = Ch + BIh ,
= λh ( ) . (13) Nl
φh M (b) : nl = Cl + BIfl , (20)
Therefore, according to (11) and (13), we have (c) : 0 ≤ Ph , Pl ≤ 1,
h
E[NRs |N h = nh ] (d) : 1 ≤ BIh , BIl ≤ 100,
nh
X nh φh M − 1 λh −1 where nh and nl represent the number of high-priority
= Phλh (1 − Ph ))nh −λh λh ( ) and low-priority activated MTDs, respectively. Take the
λh φh M
λh =1
nh high-priority as an example, in the i-th RAO, the activated
X nh − 1 ph j−1 MTDs consist of newly activated MTDs Ah and backlogged
= n h Ph (1 − Ph )nh −j (Ph − )
j−1 φh M MTDs Bh . We assume that the number of backlogged MTDs
j=1
in previous RAOs is specified to a constant Bi−1,h , and the
Ph nh −1 N
= nh Ph (1 − ) . (14) latest backlogged MTDs is denoted as BIfh , where Nf =
φh M
NAf + NRf represents the number of MTDs that fail in the
Similarly, in each RAO, the number of success accesses for ACB check and RA procedure in the last RAO. Therefore,
low-priority MTDs is expressed as we have Ch = Ah + Bi−1,h and Cl = Al + Bi−1,l . We also
l Pl nl −1 assume that 100 is the upper limit of the BO indicator.
E[NRs |N l = nl ] = nl Pl (1 − ) , (15)
φl M For simplicity, we omit the energy consumption in RA
where φl represents the low-priority weight, and φh + φl = procedure in the objective function. Therefore, the original
1. Accordingly, in each RAO, the total number of success problem in (20) can be approximately transformed into (21):
accesses could be described as Ph nh −1 Pl nl −1
max nh Ph (1 − ) + nl Pl (1 − )
E[NRs |N = n] Ph ,Pl ,BIh ,BIl φh M φl M
Ph nh −1 Pl nl −1 − (nh + nl )Esi ,
= nh Ph (1 − ) + nl Pl (1 − ) . (16)
M M s.t.(a), (b), (c), (d). (21)
2) Energy Consumption: In each RAO, the energy con-
sumption of MTD consists of two parts: the energy for We utilize the joint derivative to solve the optimization prob-
receiving system information in ACB scheme and the energy lem in (21) and obtain the optimal solution in Proposition 1.
consumed in RA procedure. As long as the MTD is in the Proposition 1: Note that Ai ≥ 0, Bi−1 ≥ 0, Nfh ≥ 0,
l
active state, it needs to receive the system information for the Nf ≥ 0, M ≥ 1, BIh ≥ 1, and BIl ≥ 1. These parameters
ACB check, and only the MTD that passes the ACB check are all integers in practice. Under the fixed priority weights
can execute the RA procedure. According to [18], the energy φh and φl , by analyzing the joint derivative of the objective
consumption of receiving system information Esi could be function, a closed-form solution of the optimal ACB factors
expressed as and BO indicators is obtained in (22), shown at the bottom of
the next page.
Esi = Tsi Psi . (17) Proof: See Appendix A.
The energy consumption in the RA procedure Era could be Although a closed-form solution of the optimal ACB factors
described as and BO indicators is obtained in (22), the exact prior infor-
mation (such as A and B) needs to be acquired in the RA
Era = Tmsg1 Pmsg1 + Tmsg2 Pmsg2
process, which is hard in practice. Moreover, the BO indicator
+ Tmsg3 Pmsg3 + Tmsg4 Pmsg4 , (18) has a long-term impact on the joint delay-energy optimization
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
1422 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 23, NO. 2, FEBRUARY 2024
according to (4) and (5). While for simple analysis, the number Agent: As the main body of the DRL algorithm, agent has
of backlogged MTDs in the previous RAOs affected by the the ability to collect the environment information and execute
previous BO indicators is specified to the constant Bi−1 . the decision-making. In the RA process, the agent is deployed
Furthermore, since the fixed high-priority weight is larger at eNB to collect the aforementioned important information,
than that of the low-priority weight, when the number of implement the DRL algorithm, and online adjust the ACB
low-priority MTDs are far more than that of high-priority factors and BO indicators.
MTDs, the RA performance shrinking will be led. For these State: The state space is defined as S = [ns , nc , ni ], and
reasons, we apply the DRL algorithm to the proposed HH these states are related to the number of activated MTDs in
ACB-BO scheme to adjust the ACB factors and BO indicators each RAO.
in an online manner as shown in Sec. IV. Action: The action space is defined as A =
[Ph , Pl , BIh , BIl ], which includes the ACB factors (Ph , Pl ) ∈
IV. D EEP R EINFORCEMENT L EARNING FOR HH ACB-BO {0, 1} and BO indicators (BIh , BIl ) ∈ {10, 20, . . . , 100}.
S CHEME By implementing the DRL algorithm, the ACB factors
and BO indicators are dynamically adjusted in an online
DRL is capable in handling a long-term decision opti-
manner to adapt to the time-varying traffic. Although
mization problems and has been widely applied in various
the ACB factor is continuous and the BO indicator is
fields of wireless communication networks, including channel
discrete, in order to integrate these two schemes into one
selection [31], power optimization [32], etc.. In this study,
agent, we also discretize the ACB factor with a dispersion
we also use DRL to dynamically adjust the ACB factors
of 0.1.
and BO indicators. Taking advantage of the aforementioned
Reward function: In order to optimize these three KPIs
mathematical analysis of the HH ACB-BO scheme, a joint
in RACH, a specific joint reward function is designed as
reward function is developed to optimize three KPIs by
the driver of the DRL algorithm to update the decision-
interacting with the RA process, and a punishment sub-reward
making policy. According to the joint delay-energy opti-
is designed for the adaptive priority control. Moreover, to facil-
mization as shown in (20) and considering the adap-
itate the multi-priority expansion and reduce the complexity
tive priority control, the joint reward function can be
of action space in the training process, we innovatively design
expressed as
a multi-agent DRL framework, where two DRL agents serve
for high-priority and low-priority MTDs, respectively. In the R = φd Rd + φe Re − φp Rp , (23)
following, we first introduce the preliminary DRL definitions.
Next, we introduce the DRL algorithm and propose a multi- where φd , φe , φp are the weights of the access delay sub-
agent HH-DRL algorithm for the HH ACB-BO scheme. reward Rd , the energy consumption sub-reward Re , and
the priority sub-reward Rp , respectively. According to (8),
minimizing the access delay is equivalent to maximizing
A. Definitions
the number of success access MTDs in each RAO, thus
In order to implement the DRL algorithm, the eNB needs to Rd can be denoted by the number of generated preambles
collect some important information in each RAO. For example, without collision, that is, Rd = ns . In addition, the energy
the number of generated preambles without collision ns , the consumption in each RAO can be denoted as E according to
number of the collided preambles nc , the number of the idle (19), thus Re = 1/E. For adaptive priority control, we set
preambles ni , the number of MTDs passing ACB check λ, a punishment mechanism to indicate the access failure. The
and the number of high-priority MTDs that access failure nf . priority sub-reward can be described by Rp = nf , where
Note that these information have been widely used in previous nf is the number of high-priority MTDs that access failure.
works [8], [20], [21], [33], and obtaining these information is Accordingly, the specific joint reward function is formulated
beyond the scope of this study. The details of the definitions as
of DRL applied in the HH ACB-BO scheme are presented as
follows. R = φd ns + φe /E − φp nf . (24)
φh M φl M Nfh Nfl
Ph =
N h , Pl = Nfl
, BIh = 100, BIl = 100, φh M ≤ Ch + , φl M ≤ Cl + ;
100 100
f
Ch + 100 Cl + 100 & '
Nfh Nfh Nfl
φl M
, φl M ≤ Cl +
Ph = 1, Pl =
, BIh = , BIl = 100, φh M > Ch + ;
Nfl φh M − Ch 100 100
Cl + 100
& ' (22)
φh M Nfl Nfh Nfl
Ph = , Pl = 1, BIh = 100, BIl = , φh M ≤ Ch + , φl M > Cl + ;
Nfh φl M − Cl
C + 100 100
h
100 & ' & '
Nfh Nfl Nfh Nfl
Ph = 1, Pl = 1, BIh = φh M − Ch , BIl = φl M − Cl , φh M > Ch + , φl M > Cl + .
100 100
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1423
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
1424 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 23, NO. 2, FEBRUARY 2024
Algorithm 1 Multi-Agent HH-DRL Algorithm descent algorithm in (27). In addition, the target network
Input: DQN structure, iterations I, state dimension parameters are synchronized and updated with the current
and action dimension. network parameters every Z times. The agents obtain the
1 Algorithm hyperparameters: learning rate α, discount well-trained DQN network when the reward value is stable
factor γ, initial exploration rate ϵi , final exploration and the algorithm converges. Then the optimal policy π ∗ is
rate ϵf , deep learning factor η, dataset size K, obtained according to the greedy algorithm
synchronous frequency Z;
1 if a = arg max Q∗ (s, a)
(
2 Initialize each agent’s Q-network and clear experience ∗ a∈A
π (a|s) = . (29)
replay set D; 0 else
3 for Each iteration i = 1, 2, . . . , I do
Finally, the policy π ∗ is deployed at the eNB to dynamically
4 − − − − −Interaction with RA process −
adjust the ACB factors and BO indicators in an online manner.
− − −−
The complexity of Algorithm 1 is mainly determined by
5 Each agent observes the state s = [ns , nc , ni ];
the number of agents and two neural networks. The number of
6 Input state s into each agent’s current Q-network,
agents is equal to the number of priorities, i.e., P . Considering
and obtain the action Q-value;
that the current Q-network and target Q-network both contain
7 Each agent selects action (ah , al ) according to
J fully connected layers, for each agent, the time complexity
ϵ-greedy algorithm (28);
can be calculated as [37]
8 After executing action (ah , al ), the agents obtain
the immediate reward R based on (24) and J−1
X J−1
X
observe the new state s′ ; ncur,j ncur,j+1 + ntar,j ntar,j+1
9 Save (s, ah , R, s′ ) and (s, al , R, s′ ) into j=0 j=0
the label Q-value by calculate the target Q-value Thus, Pthe time complexity related to all agents is
based on (26); J−1
O(2P j=0 ncur,j ncur,j+1 ), where ncur,j denotes the unit
12 Update the current network loss function by (25), number in the j-th DNN layer of current Q-network. Similarly,
and perform the gradient descent algorithm to ntar,j is the unit number in the j-th DNN layer of target Q-
update the current network parameters θtrain by network, and ncur,j = ntar,j .
(27);
13 if i is an integer multiple of Z then
14 The target network parameters θtarget are D. Online Policy Transfer
synchronized with the current network In the practical networks, the MTC has multiple traffic
parameters θtrain ; modes (e.g., uniform traffic mode and bursty traffic mode)
15 else and the traffic modes are time-varying. The single-policy-
16 Set s = s′ and go back to step 6 until the end based online training in DRL can hardly guarantee the policy
of activation period; effectiveness when the traffic mode changes. For example,
17 end when the MTC traffic happens to be bursty, if the agent
18 end continues online training based on the previous uniform traffic
19 Return well-trained network models and obtain the mode policy, the agent needs to re-adapt to the bursty traffic.
policy π ∗ with maximum Q-value by (29). It takes a long time to adjust the policy, and may even not
be able to converge at the end of the bursty traffic. Similarly,
the same problem is faced as well when the bursty traffic
changes to uniform traffic. In this section, the multi-policy
After executing the current action (ah , al ), agents obtain the online training with PT is applied to the proposed multi-agent
immediate reward R from the joint reward function in (24) HH-DRL algorithm, which is shown in Fig. 3. When the MTC
and observe the new state s′ . Finally, the experience data traffic mode changes, the agent switches the corresponding
(s, ah , R, s′ ) and (s, al , R, s′ ) are stored in the replay buffer policy timely through the online PT to avoid the policy
for DRL training. readjustment.
The training process of multi-agent DRL is shown in For simple networks, there are two service priorities (i.e.,
Steps 4-7 of Fig. 2. Firstly, mini-batch datasets are randomly high-priority and low-priority) and the scale of traffic does not
sampled from the replay buffer according to the experience change. The policy changing point is the moment when the
replay mechanism, which can not only reuse the existing agent performs PT. It occurs when the bursty traffic appears
empirical data but also scatter the original sequences and in the network (e.g., the high-priority MTDs are activated)
eliminate the correlation. Then the mean-variance loss function or disappears from the network (e.g., the high-priority MTDs
L(θcurrent ) is calculated by the current network Q-value are deactivated). For complex networks, there are multiple
and target network Q-value according to (25). After that, service priorities and the traffic scale changes over time. Then
the current network parameters are updated by the gradient the eNB needs to provide more policies for different service
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1425
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
1426 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 23, NO. 2, FEBRUARY 2024
TABLE II
S YSTEM PARAMETERS
Fig. 5. Average number of success accesses for different RA control schemes, M = 54.
Fig. 6. Average access success probability for different RA control schemes, M = 54.
ACB-ideal algorithm outperforms the other algorithms because RAO decreases gradually under the HH-ideal algorithm. This
it can obtain the optimal ACB factor with the pre-known num- seriously affects the utilization of preamble resources since
ber of activated MTDs. More specifically, with the increase of the unfavorable fixed priority weights provides too much
high-priority MTDs as shown in Fig. 5(a), the performance of preamble resources for high-priority MTDs, which is minority
the HH-ideal algorithm and multi-agent HH-DRL algorithm in networks, while oppositely, provides little for low-priority
are slightly lower than that of the ACB-ideal algorithm. MTDs, which is majority. Conversely, the proposed multi-
In addition, it can be seen that the number of success accesses agent HH-DRL algorithm could ensure the preamble resource
per RAO of the proposed multi-agent HH-DRL algorithm is utilization with the adaptive priority control.
stable at 18. Fig. 6 presents the performance of average access success
However in Fig. 5(b), with the increase of low-priority probability for different RA control schemes. With the increase
MTDs, the performance of the average success accesses per of high-priority MTDs as shown in Fig. 6(a), the performance
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1427
of the priority-based scheme is better than that of other seriously affects the access delay performance of low-priority
schemes. This is because the priority-based schemes could pro- MTDs. Accordingly, the average access delay of the HH-ideal
vide more preamble resources for high-priority MTDs without algorithm is also affected and deteriorated. However, the soft
affecting the performance of low-priority MTDs. However, priority-ACB and the proposed multi-agent HH-DRL algo-
in Fig. 6(b), with the increase of low-priority MTDs, the rithms could guarantee the access delay of both low-priority
unfavorable fixed priority weights under HH-ideal algorithm and high-priority MTDs through the online adjustment of Ph
can hardly satisfy the delay requirement of low-priority MTDs, and Pl with adaptive priority control.
leading to a sharp drop of performance once the number of Fig. 8 compares the performance of different RA control
low-priority MTDs are far more than that of high-priority schemes in terms of the average energy consumption. Note that
MTDs. the BO scheme could prevent a large number of backlogged
MTDs from frequently executing the ACB check in the
B. Access Delay and Energy Consumption overloaded case. It is clearly observed that the HH ACB-BO
algorithm achieves much lower energy consumption than the
Fig. 7 shows the average access delay performance under ACB-ideal and soft priority-ACB algorithms in both Fig. 8(a)
different RA control schemes. The average access delay is and Fig. 8(b).
defined as the average number of RAOs for an MTD from its
newly activated state to the access success (i.e., establish link
with eNB) or access failure. With the increase of high-priority C. Priority Performance Evaluation
MTDs as shown in Fig. 7(a), the access delay performance of Fig. 9 further depicts the average access delay performance.
the proposed HH-ideal and multi-agent HH-DRL algorithms Fig. 9(a) and Fig. 9(b) show that the priority-based schemes
are close to that of the ACB-ideal algorithm. However, with could reduce the access delay for high-priority MTDs since
the increase of low-priority MTDs as shown in Fig. 7(b), these schemes provide more preamble resources for high-
the performance of the HH-ideal algorithm deteriorates. From priority MTDs. However, the cost is the access delay of
(22), it can be seen that the low-priority ACB factor of low-priority MTDs is increased, which is shown in Fig. 9(c)
HH-ideal is Pl = φnl M l
when the low-priority MTDs are and Fig. 9(d).
overloaded. Due to the small low-priority weight φl and a large In addition, since the minority high-priority MTDs have a
number of low-priority activated MTDs nl , the probability of higher priority weight in the HH-ideal scheme, their access
the low-priority MTDs passing the ACB check is trivial. This delay performance outperforms other schemes as shown in
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
1428 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 23, NO. 2, FEBRUARY 2024
Fig. 9. Average access delay of prioritized MTDs for different RA control schemes, M = 54.
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1429
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
1430 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 23, NO. 2, FEBRUARY 2024
f Nh f Nl
4) When φh M − Ch > 100 and φl M − Cl > 100 , [16] Y. Liu, Y. Deng, N. Jiang, M. Elkashlan, and A. Nallanathan, “Analysis
which means that activated MTDs are less than the preamble of random access in NB-IoT networks with three coverage enhance-
ment groups: A stochastic geometry approach,” IEEE Trans. Wireless
resources. Thus, in order to satisfy (A.5), we have Commun., vol. 20, no. 1, pp. 549–564, Jan. 2021.
[17] W. Zhan and L. Dai, “Massive random access of machine-to-machine
& ' & '
Nfh Nfl communications in LTE networks: Modeling and throughput optimiza-
BIh = , BIl = . (A.11)
φh M − Ch φl M − Cl tion,” IEEE Trans. Wireless Commun., vol. 17, no. 4, pp. 2771–2785,
Apr. 2018.
Because the constraint (c) is satisfied, we have [18] N. Jiang, Y. Deng, A. Nallanathan, and J. Yuan, “A decoupled learning
strategy for massive access optimization in cellular IoT networks,” IEEE
Ph = 1, Pl = 1. J. Sel. Areas Commun., vol. 39, no. 3, pp. 668–685, Mar. 2021.
[19] C. Zhang, X. Sun, J. Zhang, and H. Zhu, “Priority-based massive
Combining these four cases, under the fixed priority random access of M2M communications in LTE networks: Throughput
weights, the closed-form of the optimal ACB factors and BO analysis and optimization,” in Proc. IEEE/CIC Int. Conf. Commun.
China (ICCC), Aug. 2019, pp. 472–477.
indicators is obtained for the proposed HH ACB-BO scheme [20] M. R. Chowdhury and S. De, “Delay-aware priority access classification
as shown in (22). for massive machine-type communication,” IEEE Trans. Veh. Technol.,
vol. 70, no. 12, pp. 13238–13254, Dec. 2021.
R EFERENCES [21] Z. Chen and D. B. Smith, “Heterogeneous machine-type communi-
cations in cellular networks: Random access optimization by deep
[1] Z. Zhang et al., “6G wireless networks: Vision, requirements, architec- reinforcement learning,” in Proc. IEEE Int. Conf. Commun. (ICC),
ture, and key technologies,” IEEE Veh. Technol. Mag., vol. 14, no. 3, May 2018, pp. 1–6.
pp. 28–41, Sep. 2019. [22] J. Liu, M. Agiwal, M. Qu, and H. Jin, “Online control of preamble
[2] I. Leyva-Mayorga, L. Tello-Oquendo, V. Pla, J. Martinez-Bauset, and groups with priority in massive IoT networks,” IEEE J. Sel. Areas
V. Casares-Giner, “On the accurate performance evaluation of the LTE- Commun., vol. 39, no. 3, pp. 700–713, Mar. 2021.
A random access procedure and the access class barring scheme,” [23] W. Sun, Z. Li, Q. Wang, and Y. Zhang, “FedTAR: Task and resource-
IEEE Trans. Wireless Commun., vol. 16, no. 12, pp. 7785–7799, aware federated learning for wireless computing power networks,” IEEE
Dec. 2017. Internet Things J., vol. 10, no. 5, pp. 4257–4270, Mar. 2023.
[3] Y. Sim and D. Cho, “Performance analysis of priority-based access class
[24] P. Wang, W. Sun, H. Zhang, W. Ma, and Y. Zhang, “Distributed
barring scheme for massive MTC random access,” IEEE Syst. J., vol. 14,
and secure federated learning for wireless computing power net-
no. 4, pp. 5245–5252, Dec. 2020.
works,” IEEE Trans. Veh. Technol., early access, Feb. 22, 2023, doi:
[4] H. He, Q. Du, H. Song, W. Li, Y. Wang, and P. Ren, “Traffic- 10.1109/TVT.2023.3247859.
aware ACB scheme for massive access in machine-to-machine [25] J. Lee, H. Seo, J. Park, M. Bennis, and Y. Ko, “Learning emergent
networks,” in Proc. IEEE Int. Conf. Commun. (ICC), Jun. 2015, random access protocol for LEO satellite networks,” IEEE Trans.
pp. 617–622. Wireless Commun., vol. 22, no. 1, pp. 257–269, Jan. 2023.
[5] S. Duan, V. Shah-Mansouri, Z. Wang, and V. W. S. Wong, “D-ACB: [26] M. A. Jadoon, A. Pastore, M. Navarro, and F. Perez-Cruz, “Deep rein-
Adaptive congestion control algorithm for bursty M2M traffic in LTE forcement learning for random access in machine-type communication,”
networks,” IEEE Trans. Veh. Technol., vol. 65, no. 12, pp. 9847–9861, in Proc. IEEE Wireless Commun. Netw. Conf. (WCNC), Apr. 2022,
Dec. 2016. pp. 2553–2558.
[6] M. Tavana, V. Shah-Mansouri, and V. W. S. Wong, “Congestion control
[27] J. Bai, H. Song, Y. Yi, and L. Liu, “Multiagent reinforcement learning
for bursty M2M traffic in LTE networks,” in Proc. IEEE Int. Conf.
meets random access in massive cellular Internet of Things,” IEEE
Commun. (ICC), Jun. 2015, pp. 5815–5820.
Internet Things J., vol. 8, no. 24, pp. 17417–17428, Dec. 2021.
[7] H. Jin, W. T. Toor, B. C. Jung, and J. Seo, “Recursive pseudo-
[28] W. Zhao, J. P. Queralta, and T. Westerlund, “Sim-to-real transfer in deep
Bayesian access class barring for M2M communications in LTE
reinforcement learning for robotics: A survey,” in Proc. IEEE Symp. Ser.
systems,” IEEE Trans. Veh. Technol., vol. 66, no. 9, pp. 8595–8599,
Comput. Intell. (SSCI), Dec. 2020, pp. 737–744.
Sep. 2017.
[8] C. Di, B. Zhang, Q. Liang, S. Li, and Y. Guo, “Learning automata-based [29] Study on RAN Improvements for Machine-Type Communications, docu-
access class barring scheme for massive random access in machine- ment TR 37.868, version 11.0.0, 3GPP, Sep. 2011.
to-machine communications,” IEEE Internet Things J., vol. 6, no. 4, [30] Z. Wang and V. W. S. Wong, “Optimal access class barring for stationary
pp. 6007–6017, Aug. 2019. machine type communication devices with timing advance informa-
[9] D. Zhang, J. Liu, and W. Zhou, “ACB scheme based on reinforcement tion,” IEEE Trans. Wireless Commun., vol. 14, no. 10, pp. 5374–5387,
learning in M2M communication,” in Proc. IEEE Global Commun. Conf. Oct. 2015.
(GLOBECOM), Dec. 2020, pp. 1–6. [31] O. Naparstek and K. Cohen, “Deep multi-user reinforcement learning for
distributed dynamic spectrum access,” IEEE Trans. Wireless Commun.,
[10] L. Tello-Oquendo, D. Pacheco-Paramo, V. Pla, and J. Martinez-Bauset,
vol. 18, no. 1, pp. 310–323, Jan. 2019.
“Reinforcement learning-based ACB in LTE–A networks for handling
massive M2M and H2H communications,” in Proc. IEEE Int. Conf. [32] Z. Ding, R. Schober, and H. V. Poor, “No-pain no-gain: DRL assisted
Commun. (ICC), May 2018, pp. 1–7. optimization in energy-constrained CR-NOMA networks,” IEEE Trans.
Commun., vol. 69, no. 9, pp. 5917–5932, Sep. 2021.
[11] L. P. Qian, Y. Wu, N. Yu, D. Wang, F. Jiang, and W. Jia,
“Energy-efficient multi-access mobile edge computing with secrecy [33] L. Tello-Oquendo, V. Pla, I. Leyva-Mayorga, J. Martinez-Bauset,
provisioning,” IEEE Trans. Mobile Comput., vol. 22, no. 1, pp. 237–252, V. Casares-Giner, and L. Guijarro, “Efficient random access channel
Jan. 2023. evaluation and load estimation in LTE-A with massive MTC,” IEEE
[12] S. Cao and F. Hou, “On the maximum energy efficiency of ran- Trans. Veh. Technol., vol. 68, no. 2, pp. 1998–2002, Feb. 2019.
dom access-based OMA and NOMA in multirate environment,” [34] R. Sutton and A. Barto. (2017). Reinforcement Learning: An
IEEE Trans. Wireless Commun., vol. 21, no. 12, pp. 10438–10454, Introduction (Draft). [Online]. Available: http://www.incompleteideas.
Dec. 2022. net/book/bookdraft2017nov5.pdf
[13] F. Zhao, X. Sun, W. Zhan, X. Wang, J. Gong, and X. Chen, “Age- [35] D. Silver. (2016). Tutorial: Deep Reinforcement Learning. [Online].
energy tradeoff in random-access Poisson networks,” IEEE Trans. Green Available: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf
Commun. Netw., vol. 6, no. 4, pp. 2055–2072, Dec. 2022. [36] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
[14] M. Gerasimenko, V. Petrov, O. Galinina, S. Andreev, and with double Q-learning,” 2015, arXiv:1509.06461.
Y. Koucheryavy, “Energy and delay analysis of LTE-advanced [37] Q. Luo, T. H. Luan, W. Shi, and P. Fan, “Deep reinforcement learning
RACH performance under MTC overload,” in Proc. IEEE Globecom based computation offloading and trajectory planning for multi-UAV
Workshops, Dec. 2012, pp. 1632–1637. cooperative target search,” IEEE J. Sel. Areas Commun., vol. 41, no. 2,
[15] N. Jiang, Y. Deng, A. Nallanathan, X. Kang, and T. Q. S. Quek, pp. 504–520, Feb. 2023.
“Analyzing random access collisions in massive IoT networks,” [38] J. Cheng, C. Lee, and T. Lin, “Prioritized random access with dynamic
IEEE Trans. Wireless Commun., vol. 17, no. 10, pp. 6853–6870, access barring for RAN overload in 3GPP LTE-A networks,” in Proc.
Oct. 2018. IEEE GLOBECOM Workshops (GC Wkshps), Dec. 2011, pp. 368–372.
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.
FAN et al.: JOINT DELAY-ENERGY OPTIMIZATION FOR MULTI-PRIORITY RANDOM ACCESS IN MTC 1431
Wenbo Fan received the B.S. degree in com- Yan Long (Member, IEEE) received the B.E.
munication engineering from Southwest Jiaotong degree in electrical and information engineering and
University, Chengdu, China, in 2018, where he is the Ph.D. degree in communication and informa-
currently pursuing the Ph.D. degree with the School tion systems from Xidian University, Xi’an, China,
of Information Science and Technology. in 2009 and 2015, respectively. From September
His research interests include massive random 2011 to March 2013, she was a Visiting Student
access, machine learning, and compressed sensing. with the Department of Electrical and Computer
Engineering, University of Florida, USA. She is
currently a Lecturer with the School of Informa-
tion Science and Technology, Southwest Jiaotong
University, Chengdu, China. Her research interests
include distributed machine learning in wireless networks, the next generation
of WLAN, 5G/6G cellular networks, and wireless resource optimization.
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on February 29,2024 at 09:24:19 UTC from IEEE Xplore. Restrictions apply.