Deep Reinforcement Learning Based Transmission Scheduling For Sensing Aware Control
Deep Reinforcement Learning Based Transmission Scheduling For Sensing Aware Control
Abstract— Massive field data is wirelessly transmitted to hot rolling process demonstrate the superiority of DTSM. Our
the edge side to facilitate sensing and control in the emerg- future work will consider the joint scheduling of uplink-downlink
ing Industrial Internet of Things (IIoT) systems. Under the transmissions and design the collaboration among multiple edge
expanding transmission scheduling space and dynamic network computing nodes (ECNs) to address the limitations of centralized
conditions, balancing control performance and limited transmis- learning methods. Besides, the proposed method can be extended
sion resources is a fundamental challenge. For this problem, to other industrial applications such as flight control system
we propose a novel deep reinforcement learning (DRL)-based testing.
transmission scheduling method (DTSM), where sensing per-
formance guarantee is introduced for its criticality in ensuring Index Terms— Sensing and control, deep reinforcement learn-
complete system observation and effective control. Specifically, ing (DRL), dynamic transmission scheduling, industrial Internet
taking system observability as the key metric, the time slots for of Things (IIoT).
multi-sensor data transmission under different control demands
are properly reserved with theoretically guaranteed performance.
Then, the primal-dual DRL framework is adopted to further N OMENCLATURE
improve the overall performance of system control and resource Symbols
utilization by dynamically scheduling the transmission number
of each sensor. The scheduling is based on the real-time states of (·)⊤ Transpose of a vector or a matrix.
sensing and wireless network, and the action space is determined E[·] Expectation of a random variable.
according to our reserved time slots. Besides, after primal- ⊗ Kronecker product.
dual updates, the scheduling results can satisfy the estimation tr(·), rank(·) Trace and rank of a matrix.
error-evaluated constraint imposed for the ultimate control effect.
Finally, the proposed method is applied to the industrial laminar
λ̄ A , λ A The largest and smallest eigen-
cooling process and its effectiveness is fully demonstrated. values of a matrix A.
Im , 0m m × m identity matrix and null
Note to Practitioners—This paper is motivated by the require- matrix.
ment of balancing control performance and scarce transmission 1m m × 1 vector containing all ones.
resources in industrial automation fields such as steel manufac-
turing, where massive sensor data is transmitted to the edge
emj m × 1 vector with 1 as its j-th
side through wireless networks. The expanding transmission component and 0 elsewhere.
scheduling space and dynamic network conditions have led A ⪰ 0m A ∈ Rm×m being positive
to increased interest in advanced deep reinforcement learning semidefinite.
(DRL) methods. However, few previous works have explored the [A]i, j , [A]i,: , [A]:, j , [a]i The (i, j)-th entry, i-th row, j-th
impact of control demands on intelligent transmission scheduling
design. For these issues, we propose a novel DRL-based trans-
column of a matrix A, and the i-
mission scheduling method (DTSM), where the time slots for th entry of a vector a.
multi-sensor data transmission are delicately reserved according Nn Set {1, 2, · · · , n}.
to different control demands and dynamic scheduling is realized
based on real-time states of sensing and wireless network. The
overall performance of system control and resource utilization
is improved, and practitioners can easily adjust method param- Variables
eters to achieve the desired balance between the two aspects xk , yk , u k System state, measurement value, and
according to practical demands. Case studies in the industrial
control action at time period k.
Received 6 November 2024; accepted 10 January 2025. Date of publication wk , vk Process noise and measurement noise
16 January 2025; date of current version 8 April 2025. This article was rec- at time period k.
ommended for publication by Associate Editor C. Zhang and Editor Q. Zhao
upon evaluation of the reviewers’ comments. This work was supported in part 0k , 4k Transmission reliability indicator and
by the National Natural Science Foundation of China under Grant 62025305, network state at time period k.
Grant 62432009, Grant 61933009, Grant 92167205, and Grant 62103268. [ϕk ] j Scheduled number of transmissions
(Corresponding author: Cailian Chen.)
The authors are with the Department of Automation, Shanghai Jiao for sensor j during time period k.
Tong University, Shanghai 200240, China, and also with the Key Labo- [ϕ̄] j Total number of slots reserved for sen-
ratory of System Control and Information Processing, Ministry of Edu- sor j.
cation of China, Shanghai 200240, China (e-mail: [email protected];
[email protected]; [email protected]; [email protected]). Jc,k , Jr,k Control and transmission costs at time
Digital Object Identifier 10.1109/TASE.2025.3530409 period k.
1558-3783 © 2025 IEEE. All rights reserved, including rights for text and data mining, and training of artificial intelligence
and similar technologies. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
10906 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 22, 2025
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DRL BASED TRANSMISSION SCHEDULING FOR SENSING AWARE CONTROL 10907
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
10908 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 22, 2025
xk+1 = Axk + Bu k + B ′ u ′k + wk , (1) slots is relaxed in [33], which allows the number of slots
assigned to the CFP to be adjustable. Each time slot supports
yk = 0k (C Gxk + vk ), (2)
one transmission of any sensor’s measurement. Then, for each
where k ∈ NT with T being the time horizon. xk ∈ Rm , yk ∈ sensor j, [ϕk ] j denotes the scheduled number of transmissions
R N , and u k ∈ Rq are the system state, measurement value, and during time period k, and [ϕ̄] j is the total number of slots
control action at time period k, respectively. A ∈ Rm×m , B ∈ reserved for it. Denoting the failure rate of per transmission
Rm×q , and G ∈ Rl×m are the system, control, and observation from sensor j to the ECN given 4k = ε as µ j,ε , the packet
matrices, respectively. C ∈ R N ×l is the binary sensing matrix reception ratio is as follows:
satisfying that C1l = 1 N and C ⊤ 1 N > 0l . The diagonal matrix p
P [0k ] j, j = 1 | 4k = ε, [ϕk ] j = p = 1 − µ j,ε .
(4)
0k ∈ R N ×N is the transmission reliability indicator, where
[0k ] j, j = 1 if the measurement [C] j,: Gxk + [vk ] j of sensor j In addition, when the beacon containing downlink scheduling
′
is received at the ECN and [0k ] j, j = 0 otherwise. u ′k ∈ Rq commands is lost in practice, we can, for example, set [ϕk ] j =
′
with input matrix B ′ ∈ Rm×q is the exogenous input that [ϕ̄] j for each sensor in a performance-first manner. This still
cannot be changed [31]. The process noise wk ∼ N (0, 6w ), avoids collisions among data transmissions due to our time slot
measurement noise vk ∼ N (0, 6v ), and initial state x1 ∼ division. In the case of unreliable control action transmissions,
N (x̄ 1 , 6x ), where 6w , 6v , and 6x are positive definite. wk , we can let the actuator record and adopt the last received
vk , and x1 are mutually uncorrelated. control action, or introduce a smart actuator for local control.
Here, within all observable system dimensions captured by Interested readers may refer to [14] and [15] for more details
G, each sensor j ∈ N N obtains its measurement according to about actuation packet scheduling. We leave these for future
the sensing vector [C] j,: , which depends on the sensor’s spatial work.
distribution. Notice that the potentially high-dimensional xk Referring to [34], the following assumption is adopted for
cannot be directly obtained, and it is necessary to fuse the the transmission process:
measurements from the sensor network for accurate state Assumption 2: The transmission indicators {[0k ] j, j } are
estimation. conditionally independent given the network states and the
Assumption 1 is provided to ensure the system control- transmission numbers.
lability and observability, which is the basis for subsequent
transmission policy optimization.
Assumption 1: The pair (A, B) is controllable, and (A, G) B. Problem Formulation
is observable. The considered problem of transmission scheduling for
2) Network Model: We consider the sensors transmitting sensing-aware control is as follows:
their measurements to the ECN over a shared block-fading " T #
wireless channel [32], where the beacon-enabled IEEE 1 X
κc Jc,k + κr Jr,k
802.15.4 protocol [14] is adopted. The channel gain is assumed Problem 1 : min lim E (5)
{ϕk ,u k } T →∞ T
k=1
constant within each time period but may vary period by
period [7]. The Markov network state process {4k }k=0 T
with s.t. {ϕk } ∈ Dv ∩ Ds . (6)
4k ∈ E ≜ {0, · · · , Mn − 1} is then defined to capture the gain
In this problem, the control goal is to drive the system states
variation, where 40 is known and the transition probabilities
to the desired {z k } while reducing the control usage, thus the
are
control cost Jc,k = ∥xk+1 −z k+1 ∥2Q +∥u k ∥2R , where Q and R are
P 4k = ε′ | 4k−1 = ε = [E]ε,ε′ , ∀ε, ε′ ∈ E.
(3) positive definite matrices. Besides, each transmission from the
sensor consumes its limited resources such as battery P energy,
As shown in Fig. 2, the beacon interval contains active then the total transmission cost is defined as Jr,k = Nj=1 [ϕk ] j .
and inactive periods. The active period further includes a κc and κr are positive weight factors reflecting the preferences
contention access period (CAP) and a collision free period in terms of control performance and transmission cost, respec-
(CFP). Since the real-time sensing performance guarantee is tively. Dv is the reservation constraint meaning that a suitable
required, we focus on the scheduling of CFP, which employs number of transmissions are further scheduled only within
the deterministic time-division multiple access (TDMA). The the reserved slots ϕ̄. Ds is the sensing performance-related
limitation that the CFP can include up to 7 guaranteed time constraint imposed for the ultimate control effect.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DRL BASED TRANSMISSION SCHEDULING FOR SENSING AWARE CONTROL 10909
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
10910 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 22, 2025
which necessarily exists since our assumptions satisfy the Theorem 1: The observability probability is lower and
conditions in [35]. Also, we have upper bounded as follows:
−1
2∞ = A⊤ U∞ B B ⊤ U∞ B + R B ⊤ U∞ A. (17) " l
Y
#d
min (1 − µi,t
′
)
Ignoring the fixed terms, the objective function of Problem t∈Nd
i=1
1 can be rewritten as
≤ φ k, {εi }i=1
d
, { pi }i=1
d
" T #
1 X
(ldo )
κc tr(2∞ Pk ) + κr Jr,k .
g̃ 1 = lim E (18) ld X
T →∞ T
X Y Y
k=1 ≤ (1 − µi,t
′
) µi,t
′
. (24)
o=m h=1 (i,t)∈Io,h (i,t)∈Ī o,h
Thus, using the optimal control law (12), the control per-
formance is expressed based on the sensing outcomes {Pk },
Here, S1 , · · · , Sl are the sensor sets obtained by merging
which are dependent on the designed {ϕk }. The bridge between [p ]
sensors with the same [C] j,: . µi,t
′
= j∈Si µ j,εt t j . Io,h is the
Q
transmission and control is constructed. For clarity, the order
h-th element in set b Io = {I | I ⊆ Nl × Nd , |I| = o}, and
of edge sensing, transmission, and control processes is shown
Ī o,h = (Nl × Nd ) \ Io,h .
as follows:
Proof: Define the following auxiliary transmission indicator:
(9),(10) (i) (4)
· · · → xk → x̂ k|k−1 , Pk|k−1 → ϕk → yk , 0k (
(7),(8) (12) (1) ′ 1, ∃ j ∈ Si , [0k ] j, j = 1,
→ x̂ k , Pk → u k → xk+1 → · · · , (19) [γk ]i = (25)
0, otherwise.
where (i) adopts the DRL-based scheduling designed subse-
quently. Let G1 = {[γ ′
t ]i = 1, P∀i ∈ Nl , ∀t ∈ {k, · · · , k + d − 1}},
and G2 = { k+d−1 l ′
P
Therefore, considering that Pk|k−1 is available at the begin- t=k i=1 t ]i ≥ m}. Since for each t ∈
[γ
ning of time period k and the network state also affects the {k, · · · , k + d − 1} and i ∈ Nl , [C] j,: G At−k remains the
packet reception ratio, the MDP state for DRL is defined as same for every j ∈ Si , we have G1 ensures that all distinct
row vectors in C G At−k are retained in 0t C G At−k . Also
Pk|k−1 4k−1
sk ≜ , , (20) it holds that rank([(C G)⊤ , · · · , (C G Ad−1 )⊤ ]⊤ ) = m since
λ̄6x Mn rank(C) = l. Thus, based on matrix row transformation,
where λ̄6x and Mn are the normalization terms. Based on (18), we have P[rank(O k ) = m | G1 ] =P1. Besides,
Pk+d−1 it holds
k+d−1 Pl
the reward at time period k is defined as that rank(O k ) ≤ t=k rank(0 t C) = t=k [γ ′
i=1 t ]i and
thus P rank(Ok ) = m | Ḡ 2 = 0.
rk ≜ −κc tr(2∞ Pk ) − κr Jr,k , (21) Similar to [36], letting Go = {{4k−1+i } = {εi }, {ϕk−1+i } =
and the stage cost is correspondingly denoted as ck = −rk . { pi }}, we have P[G1 | Go ] ≤ φ(k, {εi }, { pi }) ≤ P[G2 | Go ].
Based on Assumption 2, it is obtained that
B. DRL Action Space Design: Observability-Based Time Slot k+d−1
Y Y l
P γi,t′ = 1|4t = εt−k+1 , ϕt = pt−k+1
Reservation P[G1 |Go ] =
As an important concept in estimation theory, system t=k i=1
k+d−1 l
observability is one of the fundamental conditions to realize Y Y Y [p ]
complete sensing [25]. For system (1) and (2), the observabil- = (1 − µ j,εt−k+1
t−k+1
j
)
ity matrix within time interval {k, · · · , k + d − 1} is defined t=k i=1 j∈Si
d Y
l
as Y Yl
= (1 − µi,t
′
) ≥ [min (1 − µi,t
′
)]d .
0k C GIm
t∈Nd i=1
t=1 i=1
0k+1 C G A
Ok ≜ , (22)
··· Then, by traversing the feasible outcomes of {γi,t′ }, P[G2 | Go ]
0k+d−1 C G A d−1 can be calculated as the rightmost term in (24). The proof is
completed. ■
where d is the smallest positive integer such that matrix
As a comparison, [25] proposes deterministic system
[G ⊤ , (G A)⊤ , · · · , (G Ad−1 )⊤ ]⊤ is full rank. It follows from
observability conditions regarding whether each sensor trans-
Assumption 1 that d exists and d ≤ m. Considering that Ok
mits or not under the assumption of reliable transmission.
contains random variables, the observability probability [5] is
We instead further consider the relationship between the con-
defined as follows:
tinuous observability probability and each sensor’s adjustable
φ(k, {εi }, { pi }) ≜ P[rank(Ok ) = m | transmission number under imperfect transmission conditions.
{4k−1+i } = {εi }, {ϕk−1+i } = { pi } ,
(23) Compared with our preliminary work [36], the form of the
lower bound here is more concise, which facilitates subsequent
where the range i = 1 to d of {·} is omitted for notation calculations.
brevity. It is conditioned upon the network states and the To improve edge sensing for desired control performance,
scheduled numbers. Then, Theorem 1 is proposed to analyze a certain number of time slots need to be reserved for adequate
and bound this probability. measurement acquisition. With observability probability as the
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DRL BASED TRANSMISSION SCHEDULING FOR SENSING AWARE CONTROL 10911
N
key metric, the target is to design a ϕ̄ satisfying the following X
d + κr [ϕ̄] j . (31)
condition for each k ∈ NT and {εi }i=1 ∈ Ed:
j=1
ϱ
φ(k, {εi }, {ϕ̄}) ≥ 1 − d , (26) For the coefficients, their exist positive constants α1 and
λ̄ A⊤ A
α2 such that
where λ̄ A⊤ A is assumed to be positive. ϱ ∈ (0, 1) reflects d−2
max(λ̄d−1 , λ̄0A⊤ A )
the improvement of sensing performance, and ϱ < 1 is the
X
ζ1 = A⊤ A
ζ1,1 + m λ̄6w λ̄iA⊤ A ,
condition to ensure the boundedness of E[tr(Pk )], which will 1−ϱ i=0
be proved in Theorem 2. As ϱ further decreases toward 0, λ̄ B B ⊤ λ̄ A⊤ A λ λ̄ A⊤ A
it means that the expected observability probability gradually ζ2 = (λ̄ Q + R n−1 )2 ,
λQ λB⊤ B + λR α2 ζ2,1
approaches 1, and the sensing requirement is higher. To effi-
ciently solve for ϕ̄, we utilize the derived probability lower where ζ1,1 = (α 6ζvd−1 ) + λ̄dϱ m λ̄6w i=0
λ m d−1 i
P
λ̄ A⊤ A ,
bound and greedy-based ideas [27]. For every ε ∈ E, initialized 1 1,2 A⊤ A
with ϕ̄ ε = 0 N , the sensor with the best transmission condition λ̄6w λ6v + λ̄6w λ6w λ̄G ⊤ G maxi |Si | −1
is first selected in each Si , i ∈ Nl : ζ1,2 = (1 + ) ,
λ A⊤ A λ6w λ6v
ϕ̄ ε = ϕ̄ ε + e Nj0 , j0 ∈ argmin µ j,ε . (27) λ̄ Q λ R + λ̄ Q λ Q λ̄ B B ⊤
j∈Si ζ2,1 = (1 + )−1 .
λ A⊤ A λ Q λ R
[ϕ]
Then, define f g (ϕ, ε) = [ li=1 (1 − j∈Si µ j,ε j )]d as the
Q Q
Proof: We first discuss
Pthe sensing part. Referring to [37],
objective function. Until f g (ϕ̄ ε , ε) ≥ 1 − ϱ/λ̄dA⊤ A , each update
since C ⊤ 0k⊤ 0k C = N ⊤
j=1 k ] j, j C j,: C j,: ⪯ maxi |Si |Il ,
[0
is as follows:
we have Pk+1|k ⪰ ζ1,2 A−⊤ Pk A−1 . Then for k ≥ d, when
−1 −1
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
10912 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 22, 2025
For (ii), it is known that there exists a positive constant α2′ determined by (7), (9) as f s (·, ·), we have
such that CC ⊤ ⪰ α2′ Im , and α2 is calculated as
( P[Pk+1|k = P ′ Pk|k−1 = P, 4k = ε′ , ak = a]
α2′ /λ̄n−1 , n ≥ 2,
YN
P [0k ] j, j = [o] j | 4k = ε′ , [ϕk ] j = [a] j ,
α2 = A⊤ A (33)
α2 ,
′
n = 1.
j=1
= if P ′ = f s (P, o), [o] j ∈ {0, 1},
λ R λ̄ A⊤ A
Based on (13), it holds that λ Q Im ⪯ Uk ⪯ (λ̄ Q + n−1 )Im ,
0, otherwise,
α2 ζ2,1
then we have where probability P[[0k ] j, j = [o] j | 4k = ε′ , [ϕk ] j =
2k = A⊤ Uk B(B ⊤ Uk B + R)−1 B ⊤ Uk A [a] j ] is calculated as in (4).
• The reward is as defined in (21).
⪯ A⊤ Uk B(λ Q λ B ⊤ B + λ R )−1 B ⊤ Uk A ⪯ ζ2 Im .
• βd is the discount factor with βd ∈ (0, 1).
Since 2∞ is the steady-state value of 2k , 2∞ ⪯ ζ2 Im holds. In this MDP, the reward should be formally denoted as
Finally, as E(tr(2∞ Pk )) ≤ λ̄2∞ E[tr(Pk )], (31) is directly E[rk | sk = s, ak = a] which depends on sk and ak , and here
obtained. The proof is completed. ■ we use the real outcomes of Pk instead to avoid calculating
Since ϱ < 1, the bound on the right-hand side of (31) the expectation. Since the DRL framework is adopted, the
is finite as k → ∞. Also, the bound is nonincreasing with transition probability (matrix E involved) is allowed to be
the decrease of ϱ. This theoretically describes the effect of unknown, and the specific expression of P is given to clarify
observability probability guarantee on overall performance, the Markov properties of the considered problem. Moreover,
and justifies the use of (26) as a target. In essence, The- the curse of dimensionality is effectively overcome compared
orem 2 reflects the comprehensive effects of controllability, with traditional policy and value iteration methods.
observability, and transmission design on overall performance. For the constraints, it is direct to obtain that Dv =
Specifically, the control relevant 2k is analyzed to be bounded {{ϕk } | ϕk ∈ A, ∀k}, which is already addressed by the action
based on the assumptions of system controllability and reli- space construction. Then, to better regulate the control per-
able control action transmission. Due to imperfect sensing formance, we design the constraint Ds evaluated by tr(Pk ) as
data transmission, a sufficient ϕ̄ is needed to ensure the follows:
observability probability. Then, based on the sensing perfor- ( " T # )
X
mance analysis under the observable condition, it is ultimately βd 1 tr(Pk ) > b̄ ≤ Tc ,
k−1
Ds = {ϕk } | E (34)
guaranteed that the sensing relevant tr(Pk ) is bounded in a k=1
probabilistic sense. Compared with [5], [36], we introduce
where Tc is the maximum number of time periods allowed for
the control part and further consider the active transmission
tr(Pk ) to be greater than the threshold b̄. That is, we expect to
design in the sensing part. Besides, the rationality of the
limit the time that the covariance matrix is outside the desired
involved requirement for A to be invertible is well-discussed
region {Pk | tr(Pk ) ≤ b̄} [24], [39]. Here, we take into account
in [38].
that tr(Pk ) quantifies the mean square estimation error, i.e.,
tr(Pk ) = E[∥ek ∥2 ], and it directly affects the control cost (18).
C. DRL-Based Transmission Scheduling Method To solve the considered MDP with DRL methods, we intro-
Based on the state, reward, and action space designed duce the following discounted cost form [21], which is a
above, the MDP M = (S, A, P, r, βd ) is constructed as common setting in DRL and approximates the effect of using
follows: the original cost function (18):
• The state space S ≜ R
m×m
× {0, M1n , · · · , MMn −1
n
} consists " T
X
#
of the sensing outcome and the network state. The state Problem 2 : min lim E βd ck
k−1
(35)
π ∈5 RS T →∞
sk at time period k is as defined in (20). k=1
• The action space A as defined in (30) is the combination s.t. {ϕk } ∈ Ds . (36)
of all possible scheduled transmission numbers for each
sensor. The action at time period k is ak ≜ ϕk . Here, 5 R S is our concerned randomized stationary (Markov)
′
• The transition probability P s | s, a consists of two
policies set [30]. π(·|s; θa ) indicates the probability distri-
components of sensing outcome transition and network bution of the scheduled transmission numbers given current
state transition. Based on the independence of the two state s, which is parameterized with a vector θa . Considering
components and conditional probability properties, P is the imposed constraint (36), we introduce the Lagrangian for
expressed as Problem 2 with dual variable ν ∈ R:
" T #
P ′ ε′ ε
P X
P ≜ P sk+1 = ( , ) sk = ( , ), ak = a L(θa , ν) = E βd (ck + νωk ) ,
k−1
(37)
λ̄6x Mn λ̄6x Mn k=1
= P[Pk+1|k = P ′ Pk|k−1 = P, 4k = ε′ , ak = a]
where ωk = 1{tr(Pk ) > b̄} − (1 − βd )Tc , and denote ck + νωk
· P[4k = ε′ 4k−1 = ε] as c̃k . The corresponding dual function is as follows:
where the second term is directly calculated as E ε,ε′ H(ν) = min L(θa , ν). (38)
by (3). Denoting the mapping Pk|k−1 , 0k → Pk+1|k , ∀k θa
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DRL BASED TRANSMISSION SCHEDULING FOR SENSING AWARE CONTROL 10913
Algorithm 1 DTSM
Input: ϱ, b̄, Tc , involved system and learning process
parameters;
Output: {ϕk }, {u k }, and optimized policy π;
/* Time slot reservation */
1 for ε = 0, 1, · · · , Mn − 1 do
Fig. 4. Scheduling result generation flow. 2 Initialize ϕ̄ ε = 0 N ;
3 Set ϕ̄ ε according to (27) for i = 1, 2, · · · , l;
4 while f g (ϕ̄ ε , ε) < 1 − ϱ/λ̄dA⊤ A do
Then, the dual optimization problem is to maximize the dual 5 Set ϕ̄ ε according to (28);
function with respect to dual variable ν: 6 end
7 end
max H(ν) = max min L(θa , ν). (39) 8 Set ϕ̄ according to (29);
ν≥0 ν≥0 θa
/* Dynamic transmission scheduling */
It can be seen that for ν fixed, the inner optimization (38) can 9 Set Dv and Ds according to ϕ̄, b̄, and Tc ;
be solved with standard RL frameworks by setting the reward 10 Randomly initialize θa and θc , and initialize ν = 0;
to r̃ k = −c̃k . The outer optimization (39) is convex since the 11 for episode p = 1, 2, · · · , Ml do
maximization objective function is concave and the constraint 12 Initialize s1 = (6x /λ̄6x , 40 /Mn );
set is convex. 13 for k = 1, 2, · · · , T do
To avoid dealing with the potentially huge discrete action 14 Infer ϕk using current π(·|·; θa ) and deliver it;
space, we introduce a virtual continuous action ă k ∈ (0, 1) N 15 Collect yk from the field sensors and perform
and the transformation [ak ] j = ⌊[ă k ] j ([ϕ̄] j + 1)⌋, ∀ j ∈ N N to estimation (7), (8);
obtain the scheduling result. Then, we incorporate the state- 16 Calculate u k using (12) and deliver it;
of-the-art PPO method [40] into the primal-dual framework, 17 Perform estimation (9), (10), and collect sample
which effectively generates the virtual action ă k and has a {sk , ă k , r̃ k , sk+1 };
simplified structure. In PPO, the actor network outputs the 18 end
mean and the standard deviation of a Gaussian distribution 19 Compute the GAE sequences accroding to (40);
from which ă k is sampled. The critic network denoted as 20 for epoch i = 1, 2, · · · , M p do
Vc (s; θc ) estimates the state-value function. To ensure that 21 Extract a mini-batch of Mb transitions from the
[ă k ] j falls within (0, 1), the last layer of the mean network uses sample trajectory;
a sigmoid activation function, and the sampled ă k is truncated 22 Update θa and θc by minimizing (41) and (42);
between 0 and 1. The action transformation can essentially be 23 end
regarded as part of the complete policy and does not affect 24 Update dual variable accroding to (43);
the domains of rk and ωk . The flow from sk to the scheduling 25 end
result is illustrated in Fig. 4.
Based on the trajectory {s1 , ă 1 , r̃ 1 , · · · , sT , ă T , r̃ T , sT +1 }
collected at the ECN in each episode, the generalized advan-
tage estimation (GAE) [40] is as follows: are updated M p times in one episode by gradient descent. The
dual variable ν is updated as follows:
T
ν p+1 = [ν p + ην ∇
bν L(θa , ν p )]+ ,
X
9k = (βd βg )t−k δt , (40) (43)
t=k
where ην is learning rate, ν p+1Pis the dual variable after episode
where βg is the GAE parameter, δt = r̃ t + βd Vc (st+1 ; θc ) − p ∈ N Ml , and ∇bν L(θ, ν) = k=1 T
βdk−1 ωk is the approximate
Vc (st ; θc ) is the temporal difference error. By extracting a gradient using the sample trajectory. The whole process of
mini-batch of Mb transitions {skt , ă kt , r̃ kt , skt +1 }t=1
Mb
, the loss DTSM is provided in Algorithm 1.
functions for optimizing the actor and critic networks are In DTSM, the constraint (34) formed by the selected param-
Mb eters b̄ and Tc should at least be satisfied by the policy ak =
−1 X ϕ̄, ∀k. b̄ and Tc can further be jointly tuned to characterize
L a (θa ) = min(ιkt (θa )9kt , f p,ϵ ιkt (θa ) 9kt ),
(41)
Mb t=1 the required performance constraint. In this way, the policy
Mb ak = ϕ̄, ∀k, which is guaranteed overall performance via
1 X
L c (θc ) = δ2 , (42) Theorem 2, is a feasible solution in the DRL-based scheduling
Mb t=1 kt space. Thus, once the DTSM is well-trained, a policy with
desired overall performance that outperforms the benchmark
π̆ (ă k |sk ;θa )
where the probability ratio ιk (θa ) = π̆ (ă k |sk ;θaold )
. π̆ (ă k |sk ; ·) ak = ϕ̄, ∀k can be found. Moreover, for online deployment,
is the probability that ă k is sampled under sk , θaold is the only a learned actor network is required to support the
actor network parameter before the update. The clip function execution of steps 14-17. Combined with the actual sce-
f p,ϵ (x) = max(min(x, (1 + ϵ)x), (1 − ϵ)x). Then, with ηa and nario, these steps correspond to procedures ⃝- 1 ⃝4 in Fig. 5,
ηc as learning rates, the parameters of actor and critic networks respectively.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
10914 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 22, 2025
Remark 1: Based on the above constraint value selection, the field device layer aggregate measurements to the ECN and
we know that there exists a feasible solution π ∗ achieving receive scheduling commands from the ECN through wireless
the optimal value of Problem 2. One can reasonably restrict transmission. The control actions are sent by the ECN to
the stage cost as ck = min(ck , Mc ), where Mc is a large the actuators, i.e., the water valves, in a wired manner to
number. Combined with the boundedness of the constraint regulate their opening degrees. More adequate measurement
ωk , the conditions in [24, Theorem 2] are satisfied. The transmission improves temperature estimation accuracy and
primal-dual update thus converges to a neighborhood of π ∗ in thus water flow control performance, but this comes at the
case (38) is solved exactly, which also implies the constraint expense of network resource usage and sensor battery energy
satisfaction of the solutions. Although DRL methods are not consumption. Therefore, it is necessary to adopt our proposed
theoretically guaranteed to obtain the global optimal solution DTSM to improve the overall system performance.
to (38), extensive experience suggests that they can converge An open thermodynamic system with strip steel surfaces,
to solutions with little suboptimality, and the convergence is cooling zone inlet and outlet as boundaries is obtained as
preserved under mild assumptions [39]. shown in Fig. 5. The temperature variation model is then as
Remark 2: We discuss the scalability of DTSM from follows [3]:
the perspective of computational complexity. For the time ∂x λ ∂2x ∂x
slot reservation process, the computational complexity is = − vc , (44)
O(Mn Mϕ N 2 ), which scales quadratically with the number ∂t ρc ∂σz2 ∂σn
of sensors N . Since both the actor (the mean and standard where x denotes the temperature value. t, σz , and σn are the
deviation parts being consistent) and critic networks adopt coordinates along time, thickness, and length, respectively. vc
fully connected structures, the complexity for the input layer is the coiling speed. λ, ρ, and c are the thermal conductivity,
in each of them is O(Mb Mh,1 (m 2 +1)), where m 2 +1 and Mh,1 density, and specific heat capacity of the strip steel, respec-
are the numbers of the input layer’s nodes and the first hidden tively. The boundary conditions are
layer’s nodes, respectively [41]. This scales quadratically with ∂x hw
the system dimension m. Also, denoting the number of the last σz =0
= (x − xw ), x|σn =0 = x f ,
∂σz λ
hidden layer’s nodes as Mh,−1 , the complexity for the output
∂x hw
= − (x − xw ),
layer of actor network is O Mb Mh,−1 N , which scales linearly σz =σ̄ z
with the number of sensors N . ∂σz λ
where σ̄ z is the maximum coordinate value along the
IV. A PPLICATION AND E VALUATION thickness. h w is the water cooling heat transfer coefficient
In this section, the proposed DTSM is applied to the indus- depending on the opening degrees of water valves, and xw
trial hot rolling process through digital twin (DT) technology. is the cooling water temperature. x f is the inlet temperature.
The application scenario, physical model, and system imple- Then, the finite difference method is used to discretize (44).
mentation are introduced in detail. Moreover, the effectiveness See the system division in Fig. 5, where the entire thermody-
of the proposed method is fully discussed. namic system is divided into τz and τn volumes in thickness
and length, respectively. Here, N infrared thermometers are
uniformly deployed above the τn upper surface volumes. For
A. Laminar Cooling Scenario and Physical Model
each upper or lower surface volume, the nozzles above or
The application scenario is the laminar cooling process in below it are controlled by an independent regulating valve.
industrial hot rolling. As shown in Fig. 5, the sensing of Thus, the total number of valves is 2τn .
strip steel temperature and cooling water control are inte- Denoting x p, j,k as the temperature of ( p, j)-th volume (1 ≤
grated into this IIoT system for desired temperature regulation. p ≤ τz , 1 ≤ j ≤ τn ) at time period k, the discretized model is
This process is extremely important since the mechanical as follows:
properties of steel, e.g., yield strength, toughness, etc., are
directly determined by the cooling curve [42]. Here, multiple x p, j,k+1 = x p, j,k + ϑn (x p, j,k − x p, j−1,k )
+ ϑz x p+1, j,k − 2 x p, j,k + x p−1, j,k ,
temperature sensors such as infrared thermometers deployed in
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DRL BASED TRANSMISSION SCHEDULING FOR SENSING AWARE CONTROL 10915
Fig. 6. Digital twin system implementation. Fig. 7. Time slot reservation process. (a) Case of ε = 0. (b) Case of ε = 1.
where ϑz = 112tρc λ
, ϑn = 11t vn c . 1z , 1n are the discretization temperature and cooling zone parameters vc = 10 m / s,
z
steps and 1t is the sampling period. The discretized equations τz = 2, τn = 4, 1z = 5 mm, 1n = 5 m, x̄ 1 =
at the surface are obtained similarly based on the boundary [1123, 1123, 1113, 1113, 1103, 1103, 1093, 1093]⊤ K, xw =
conditions. Defining xk = [(xk1 )⊤ , · · · , (xk )⊤ , · · · , (xkτn )⊤ ]⊤
j
293 K, x f = [1123, 1123]⊤ K are determined with ref-
j
with xk = [x1, j,k , · · · , xτz , j,k ] , the stacked form of the model
⊤
erence to [42] and [44]. Based on system modeling and
is properties, N = 8, T = 100, m = 8, q = 8, l =
4, 6x = 20Im , 6w = Im , 6v = I N , Q = 10Im ,
xk+1 = Axk + Bu k + B ′ u ′k + wk , (45)
R = Iq are determined. The desired z k gradually decreases
where A = An + A z + Iτn τz , B = Iτn ⊗ Q B . Fur- from x̄ 1 to [1023, 1023, 973, 973, 923, 923, 873, 873]⊤ K over
ther, An = (ϑn Qn ) ⊗ Iτz , A z = Iτn ⊗ (ϑz Qz ), Q B = time. Based on the sensor deployment, the observation and
[[1, · · · , 0]⊤ , [0, · · · , 1]⊤ ], and sensing matrices are given as G = Iτn ⊗ [0, 1], C =
[el1 , el1 , · · · , ell , ell ]⊤ . Thus, we have S1 = {1, 2}, S2 = {3, 4},
−1 0 · · · · · · 0 −2 2 0 · · · 0
.. . S3 = {5, 6}, S4 = {7, 8}. It is verified that Assumption 1 holds
1 −1 . . . . 1 −2 1 . . . .. and d = 2, n = 1. κc , κr are subsequently specified according
Qn = 0 . . . . . . . . . ... , Qz = 0 . . . . . . . . . 0 .
to different performance demands.
. . . .
. .
For the wireless network between the ECN and the sensors,
.. . . . . . . 0 .. . . 1 −2 1
we consider that the CFP contains 16 slots and the beacon
0 · · · 0 1 −1 0 · · · 0 2 −2 slot is used to generate and broadcast the schedule ϕk . The
slot duration is selected as 3.84 ms [45]. The inactive period
Besides, u k is the control action regarding the opening degrees
lasts for 10 slot durations, during which the sensors may
of water valves. B ′ = Iτn τz , and u ′k = [ϑn x ⊤f , 0, · · · , 0]⊤ with
turn off their radios to save energy. This period is used in
x f ∈ Rτz being the available inlet temperature.
turn for the ECN to perform posterior estimation, control,
deliver control commands (through wired communication),
B. System Implementation and Parameter Settings and prior estimation, as well as the field sensors to measure
1) DT System Implementation: The DT system, as illus- the temperature. The sampling period 1t is consistent with
trated in Fig. 6, is constructed to realize the interaction the beacon interval, which is around 100 ms. Two network
between the real-world space of high-strength steel production states E = {0, 1} are considered, and the transition probability
and the twin space integrating DTSM. The DT system is built matrix is E = [[0.1, 0.9]⊤ , [0.9, 0.1]⊤ ]. Under ε = 0, the
based on Unity 2020 using the Intel Core Ultra 7 proces- values of µ2,0 , µ4,0 , µ6,0 , µ8,0 are set higher than µ1,0 , µ3,0 ,
sor, which includes several scenarios such as rough rolling, µ5,0 , µ7,0 . And under ε = 1, the values of µ1,0 , µ3,0 , µ6,0 ,
finishing rolling, and laminar cooling. Based on the actual µ8,0 are set higher than µ2,0 , µ4,0 , µ5,0 , µ7,0 . For the learning
deployed sensor information and physical mechanisms, the process, both the actor and critic networks use two hidden
DT system reproduces the running status of the strip. Then layers with 256 nodes, and the Adam optimizer is adopted.
in the twin space, by linking the proposed DTSM that is Hyper-parameters βd = 0.95, βg = 0.95, ϵ = 0.2, Ml = 4000,
transformed into Python code, the DRL training is realized M p = 10, Mb = 100, ηa = 1e−5 , ηc = 1e−4 , ην = 1e−5 , etc.
and the strip cooling effect brought by the designed scheduling are selected by referring to [22] and [24].
and control policies can be evaluated. The well-trained policies
are eventually deployed in the ECN to guide actual production.
C. Performance Evaluation
In general, for the self-contained and safety-critical steel man-
ufacturing where it is difficult to directly modify production 1) Time Slot Reservation: Fig. 7 shows the number of
instructions [29], such a virtual-real interaction framework reserved time slots [ϕ̄ ε ] j and the value of f g (ϕ̄ ε , ε) during the
effectively supports the application of the proposed DTSM reservation process of DTSM. For clarity, only sensors with
at the control level. non-zero time slots are shown. Under both cases of ε = 0 and
2) Parameter Settings: The values of the main parame- ε = 1, the number of time slots reserved for the sensors
ters are selected as follows. The thermodynamic parameters with better transmission conditions increases alternately. And
λ = 40 W /(m · K), ρ = 7.9 × 103 Kg / m3 , c = when the overall number is less than 4, f g (ϕ̄ ε , ε) are all
[ϕ̄ ]
0.46 × 103 J/(Kg ·K ) are determined according to [43], the 0 since ∃i, j∈Si µ j,εε j = 1. This practical system satisfies that
Q
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
10916 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 22, 2025
Fig. 8. Normalized average cost and constraint violation comparisons. (a) Case of κ ′ = 0.1 without constraint. (b) Case of κ ′ = 0.6 without constraint.
(c) and (d) Case of κ ′ = 0.1 with constraint.
Fig. 9. Tracking performance and transmission cost comparisons in the case of κ ′ = 0.1. (a) and (b) Without constraint. (c) and (d) With constraint.
λ̄ A⊤ A < 1, and when ϱ is set close to 1, (26) is arbitrarily satis- combinations by adjusting κ ′ . The training process is repeated
fied since 1−ϱ/λ̄dA⊤ A < 0. Here, for a better control effect, ϱ is several times. It can be seen that DTSM converges relatively
selected such that 1−ϱ/λ̄dA⊤ A = 0.3. Then, the reserved results stably under various settings. First, without considering con-
are ϕ̄ 0 = [1, 0, 1, 0, 2, 0, 2, 0]⊤ , ϕ̄ 1 = [0, 2, 0, 2, 2, 0, 2, 0]⊤ , straint (34), the normalized average cost comparisons under
and finally ϕ̄ = [1, 2, 1, 2, 2, 0, 2, 0]⊤ . It can be seen that κ ′ = 0.1 and κ ′ = 0.6 are shown in Fig. 8(a)-(b). The
( k=1
PT
sensors 6 and 8 are not assigned to time slots, this is because ck )/T −Jmin
normalized cost is calculated through Jmax
, where
under both ε = 0 and ε = 1 they have worse transmission Jmin and Jmax are the minimum and maximum average costs
conditions in S3 and S4 , respectively. This reflects that DTSM of the episodes in all repeated experiments, respectively. It can
effectively saves time slot resources. Besides, the simulation be seen that in both cases, the performance of DTSM is better
of φ(k, {ϕ̄ ε }i=1
2
, {ε}i=1
2
) by the Monte Carlo method is also than that of RND, CPF, and GCEC. GCEC designed based on
plotted, which is almost the same as f g (ϕ̄ ε , ε). This verifies empirical rules performs relatively well under κ ′ = 0.6, but
the lower bound proposed in Theorem 1, and infers that the it is finally outperformed by the continuously trained DTSM.
real value reaches the bound at this time. For the ablated methods, AS-2 performs worse than DTSM
2) Overall Transmission and Control Performance: We due to abandoning the search for a wider range of solutions.
compare the proposed DTSM with several other methods: 1) Although DTSM performs slightly worse than IFTS (AS-1)
the random policy (RND) which randomly selects an action under κ ′ = 0.1 (nearly consistent under κ ′ = 0.6), it requires
in A; 2) the control performance first policy (CPF) which 37.5% fewer reserved time slots than IFTS (AS-1). This means
adopts ϕk = ϕ̄, ∀k; 3) the greedy policy on control-aware the network can support more other applications, or extend the
error covariance (GCEC) [27] which schedules a trans- inactive period to save energy.
mission for the sensor with the minimal tr(2∞ (Pk|k−1 −1
+ Taking the case of κ ′ = 0.1 as an example to further
1−β 18
G [C] j,: [6v ] j, j [C] j,: G) ). Meanwhile, we conduct ablation
⊤ ⊤ −1 −1
consider the constraint (34) with b̄ = 13 and Tc = 1−βdd , the
studies to better validate the importance of each design con- normalized average cost and constraint violation comparisons
sideration: 1) remove the observability-based slot reservation are shown in Fig. 8(c)-(d). Due to the enhanced sensing
(marked as AS-1), i.e., adopt the intelligent policy with performance demands, neither GCEC nor RND can meet the
full time slots (IFTS) [14] which performs the scheduling constraint here. For DTSM, the learned policy converges to
described in Algorithm 1 with all available time slots (uni- a feasible solution after about 1000 episodes, and it can
formly) reserved for the sensors; 2) remove the fine-grained further optimize the overall performance while satisfying
action space partitioning (marked as AS-2), i.e. consider the the constraint. The goal of interest here is to ultimately
“On-Off” case where each sensor transmits 0 or [ϕ̄] j times; design a scheduling policy that satisfies the constraints without
3) remove the dual update process (marked as AS-3), where requiring constraint satisfaction during the training process,
a fixed ν̄ is used to indicate the cost of constraint violation. which matches our experimental results. Besides, although ν̄
These methods use the same transmission and control decou- is empirically selected in AS-3 to ensure the constraint, it is
pling and optimal control law as in DTSM. too conservative to achieve the optimal performance goal.
2 ) tr(6 2 )
Letting κκrc = κ ′ tr(6
P N x ∞ with P N x ∞ be the normalization In contrast, ν is systematically adjusted in DTSM for the
j=1 [ϕ̄] j j=1 [ϕ̄] j
item, then we consider different transmission-control weight optimization goal. In summary, each key design consideration
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DRL BASED TRANSMISSION SCHEDULING FOR SENSING AWARE CONTROL 10917
TABLE I TABLE II
I NFERENCE T IME OVERHEAD W ITHOUT C ONSTRAINT (C̄) AVERAGE C OST C OMPARISON U NDER D IFFERENT N UMERICAL
AND W ITH C ONSTRAINT (C) S ETTINGS (✓ AND × R EPRESENT C ONSTRAINT
S ATISFACTION AND V IOLATION )
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
10918 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 22, 2025
[5] D. E. Quevedo, A. Ahlen, and K. H. Johansson, “State estimation over [27] V. Tzoumas, L. Carlone, G. J. Pappas, and A. Jadbabaie, “LQG control
sensor networks with correlated wireless fading channels,” IEEE Trans. and sensing co-design,” IEEE Trans. Autom. Control, vol. 66, no. 4,
Autom. Control, vol. 58, no. 3, pp. 581–593, Mar. 2013. pp. 1468–1483, Apr. 2021.
[6] X. Guan, C. Chen, B. Yang, C. Hua, L. Lyu, and S. Zhu, “Towards the [28] L. Zheng, M. Liu, S. Zhang, Z. Liu, and S. Dong, “End-to-end multi-
integration of sensing, transmission and control for industrial network sensor fusion method based on deep reinforcement learning in UASNs,”
systems: Challenges and recent developments,” Acta Autom. Sin., vol. 45, Ocean Eng., vol. 305, Aug. 2024, Art. no. 117904.
no. 1, pp. 27–38, Jan. 2019. [29] Z. Ji, C. Chen, S. Zhu, Y. Ma, and X. Guan, “Intelligent edge sensing
[7] W. Liu, D. E. Quevedo, Y. Li, K. H. Johansson, and B. Vucetic, “Remote and control co-design for industrial cyber-physical system,” IEEE Trans.
state estimation with smart sensors over Markov fading channels,” IEEE Signal Inf. Process. Over Netw., vol. 9, pp. 175–189, 2023.
Trans. Autom. Control, vol. 67, no. 6, pp. 2743–2757, Jun. 2022. [30] O. Hernández-Lerma and J. B. Lasserre, Discrete-time Markov Con-
[8] C. Hu, X. Xie, S. Ding, and Y. Jing, “Distributed set-membership fusion trol Processes: Basic Optimality Criteria, vol. 30. Cham, Switzerland:
estimation for complex networks with communication constraints,” Springer, 2012.
IEEE Trans. Autom. Sci. Eng., early access, May 20, 2024, doi: [31] A. K. Singh and B. C. Pal, “An extended linear quadratic regulator for
10.1109/TASE.2024.3401740. LTI systems with exogenous inputs,” Automatica, vol. 76, pp. 10–16,
Feb. 2017.
[9] Y. Kan, H. Yang, F. Qu, and Y. Li, “Sensor power control for remote state
[32] R. A. Berry and R. G. Gallager, “Communication over fading chan-
estimation with historical data re-transmission,” IEEE Trans. Autom. Sci.
nels with delay constraints,” IEEE Trans. Inf. Theory, vol. 48, no. 5,
Eng., vol. 21, no. 3, pp. 4058–4069, Jul. 2024.
pp. 1135–1149, May 2002.
[10] L. Chen, B. Hu, Z.-H. Guan, L. Zhao, and D.-X. Zhang, “Control- [33] J. Araújo, M. Mazo, A. Anta, P. Tabuada, and K. H. Johansson, “System
aware transmission scheduling for industrial network systems over a architectures, protocols and algorithms for aperiodic wireless control
shared communication medium,” IEEE Internet Things J., vol. 9, no. 13, systems,” IEEE Trans. Ind. Informat., vol. 10, no. 1, pp. 175–184,
pp. 11299–11310, Jul. 2022. Feb. 2014.
[11] Y. Wu, Q. Yang, H. Li, K. S. Kwak, and V. C. M. Leung, “Control- [34] D. E. Quevedo, A. Ahlén, A. S. Leong, and S. Dey, “On Kalman filtering
aware energy-efficient transmissions for wireless control systems with over fading wireless channels with controlled transmission powers,”
short packets,” IEEE Internet Things J., vol. 8, no. 19, pp. 14920–14933, Automatica, vol. 48, no. 7, pp. 1306–1316, Jul. 2012.
Oct. 2021. [35] T. Farjam, H. Wymeersch, and T. Charalambous, “Distributed channel
[12] K. Gatsis, A. Ribeiro, and G. J. Pappas, “Random access design for access for control over unknown memoryless communication channels,”
wireless control systems,” Automatica, vol. 91, pp. 1–9, May 2018. IEEE Trans. Autom. Control, vol. 67, no. 12, pp. 6445–6459, Dec. 2022.
[13] T. Shi, P. Shi, and J. Chambers, “Dynamic event-triggered model [36] T. Jin, Y. Ma, Z. Ji, and C. Chen, “Intelligent transmission scheduling for
predictive control under channel fading and denial-of-service attacks,” edge sensing in industrial IoT systems,” in Proc. IEEE Global Commun.
IEEE Trans. Autom. Sci. Eng., vol. 21, no. 4, pp. 6448–6459, Oct. 2024. Conf., Dec. 2023, pp. 7037–7042.
[14] Y. Ma et al., “Optimal dynamic transmission scheduling for wireless [37] W. Li, G. Wei, D. Ding, Y. Liu, and F. E. Alsaadi, “A new look at
networked control systems,” IEEE Trans. Control Syst. Technol., vol. 30, boundedness of error covariance of Kalman filtering,” IEEE Trans. Syst.
no. 6, pp. 2360–2376, Nov. 2022. Man, Cybern. Syst., vol. 48, no. 2, pp. 309–314, Feb. 2018.
[15] Y. Ma et al., “Smart actuation for end-edge industrial control systems,” [38] G. Battistelli and L. Chisci, “Kullback–Leibler average, consensus on
IEEE Trans. Autom. Sci. Eng., vol. 21, no. 1, pp. 269–283, Jan. 2024. probability densities, and distributed state estimation with guaranteed
stability,” Automatica, vol. 50, no. 3, pp. 707–718, Mar. 2014.
[16] K. Huang, W. Liu, Y. Li, B. Vucetic, and A. Savkin, “Optimal Downlink–
Uplink scheduling of wireless networked control for industrial IoT,” [39] S. Paternain, M. Calvo-Fullana, L. F. O. Chamon, and A. Ribeiro, “Safe
IEEE Internet Things J., vol. 7, no. 3, pp. 1756–1772, Mar. 2020. policies for reinforcement learning via primal-dual methods,” IEEE
Trans. Autom. Control, vol. 68, no. 3, pp. 1321–1336, Mar. 2023.
[17] C. Li, X. Zhao, M. Chen, W. Xing, N. Zhao, and G. Zong, “Dynamic
[40] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
periodic event-triggered control for networked control systems under
“Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.
packet dropouts,” IEEE Trans. Autom. Sci. Eng., vol. 21, no. 1,
[41] B. Zhao and X. Zhao, “Deep reinforcement learning resource allocation
pp. 906–920, Jan. 2024.
in wireless sensor networks with energy harvesting and relay,” IEEE
[18] S. Luo, L. Zhang, and Y. Fan, “Real-time scheduling for dynamic partial- Internet Things J., vol. 9, no. 3, pp. 2330–2345, Feb. 2022.
no-wait multiobjective flexible job shop by deep reinforcement learning,” [42] Y. Zheng, N. Li, and S. Li, “Hot-rolled strip laminar cooling process
IEEE Trans. Autom. Sci. Eng., vol. 19, no. 4, pp. 3020–3038, Oct. 2022. plant-wide temperature monitoring and control,” Control Eng. Pract.,
[19] S. Roshanravan and S. Shamaghdari, “Adaptive fault-tolerant tracking vol. 21, no. 1, pp. 23–30, Jan. 2013.
control for affine nonlinear systems with unknown dynamics via rein- [43] L. Lyu, C. Chen, S. Zhu, and X. Guan, “5G enabled codesign of energy-
forcement learning,” IEEE Trans. Autom. Sci. Eng., vol. 21, no. 1, efficient transmission and estimation for industrial IoT systems,” IEEE
pp. 569–580, Jan. 2024. Trans. Ind. Informat., vol. 14, no. 6, pp. 2690–2704, Jun. 2018.
[20] L. Yang, Y. Xu, Z. Huang, H. Rao, and D. E. Quevedo, “Learning [44] Y. Zheng, S. Li, and X. Wang, “Distributed model predictive control for
optimal stochastic sensor scheduling for remote estimation with chan- plant-wide hot-rolled strip laminar cooling process,” J. Process Control,
nel capacity constraint,” IEEE Trans. Ind. Informat., vol. 19, no. 3, vol. 19, no. 9, pp. 1427–1437, Oct. 2009.
pp. 2565–2573, Mar. 2023. [45] F. Kauer, M. Köstler, T. Lübkert, and V. Turau, “Formal analysis and
[21] A. S. Leong, A. Ramaswamy, D. E. Quevedo, H. Karl, and L. Shi, “Deep verification of the IEEE 802.15.4 DSME slot allocation,” in Proc. 19th
reinforcement learning for wireless sensor scheduling in cyber–physical ACM Int. Conf. Model., Anal. Simul. Wireless Mobile Syst., New York,
systems,” Automatica, vol. 113, Mar. 2020, Art. no. 108759. NY, USA, Nov. 2016, pp. 140–147.
[22] G. Pang, W. Liu, Y. Li, and B. Vucetic, “DRL-based resource allocation
in remote state estimation,” IEEE Trans. Wireless Commun., vol. 22,
no. 7, pp. 4434–4448, Jul. 2023.
[23] Z. Zhao, W. Liu, D. E. Quevedo, Y. Li, and B. Vucetic, “Deep learning
for wireless-networked systems: A joint estimation-control-scheduling
approach,” IEEE Internet Things J., vol. 11, no. 3, pp. 4535–4550,
Feb. 2024. Tiankai Jin (Graduate Student Member, IEEE)
[24] V. Lima, M. Eisen, K. Gatsis, and A. Ribeiro, “Model-free design of received the B.Eng. degree from Southwest Jiaotong
control systems over wireless fading channels,” Signal Process., vol. 197, University, Chengdu, China, in 2020. He is currently
Aug. 2022, Art. no. 108540. pursuing the Ph.D. degree in control science and
[25] Z. Ji, C. Chen, J. He, S. Zhu, and X. Guan, “Edge sensing and control co- engineering with the School of Electronic Informa-
design for industrial cyber-physical systems: Observability guaranteed tion and Electrical Engineering, Shanghai Jiao Tong
method,” IEEE Trans. Cybern., vol. 52, no. 12, pp. 13350–13362, University, Shanghai, China.
Dec. 2022. His current research interests include the co-design
[26] X. Wen et al., “Age-of-task-aware co-design of sampling, scheduling, of sensing, transmission and control for industrial
and control for industrial IoT systems,” IEEE Internet Things J., vol. 11, cyber-physical systems, and the reinforcement learn-
no. 3, pp. 4227–4242, Feb. 2024. ing under network systems.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DRL BASED TRANSMISSION SCHEDULING FOR SENSING AWARE CONTROL 10919
Cailian Chen (Senior Member, IEEE) received the Xinping Guan (Fellow, IEEE) is currently a
B.Eng. and M.Eng. degrees in automatic control Chair Professor with Shanghai Jiao Tong Univer-
from Yanshan University, China, in 2000 and 2002, sity, Shanghai, China, where he is the Dean of
respectively, and the Ph.D. degree in control and the School of Electronic, Information and Electrical
systems from the City University of Hong Kong, Engineering, and the Director of the Key Laboratory
Hong Kong, SAR, in 2006. of Systems Control and Information Processing,
She has been with the Department of Automa- Ministry of Education of China. Before that, he was
tion, Shanghai Jiao Tong University, since 2008. the Executive Director of the Office of Research
She is currently a Distinguished Professor. She Management, Shanghai Jiao Tong University, and a
has authored three research monographs and over Full Professor and the Dean of the Electrical Engi-
100 referred international journal articles. She is the neering, Yanshan University, Qinhuangdao, China.
inventor of more than 30 patents. Her research interests include industrial He has authored and/or co-authored five research monographs, more than
wireless networks and computational intelligence and the Internet of Vehicles. 200 articles in peer-reviewed journals, and numerous conference papers. As a
Prof. Chen received the prestigious IEEE Transactions on Fuzzy Systems Principal Investigator, he has finished/been working on more than 20 national
Outstanding Paper Award in 2008, the IEEE TCCPS Industrial Technical key projects. He is the Leader of the prestigious Innovative Research Team
Excellence Award in 2022, and five conference best paper awards. She was of the National Natural Science Foundation of China (NSFC). His current
awarded the N2Women Star in Computer Networking and Communications research interests include industrial network systems, smart manufacturing,
in 2022. She won the Second Prize of National Natural Science Award from and underwater networks.
the State Council of China in 2018, the First Prize of Natural Science Award Dr. Guan is an Executive Committee Member of Chinese Automation
from The Ministry of Education of China in 2006 and 2016, respectively, Association Council and Chinese Artificial Intelligence Association Council.
and the First Prize of Technological Invention of Shanghai Municipal, China, He received the Second Prize of the National Natural Science Award of China
in 2017. She was honored “National Outstanding Young Researcher” by NSF in both 2008 and 2018 and the First Prize of Natural Science Award from
of China in 2020, “Changjiang Young Scholar” in 2015, and China Young the Ministry of Education of China and Municipal of Shanghai, China, for
Women Scientists Award in 2023. She has been actively involved in various four times. He was a recipient of the “IEEE Transactions on Fuzzy Systems
professional services. She is a Distinguished Lecturer of IEEE VTS. She Outstanding Paper Award” in 2008 and the IEEE TCCPS Industrial Technical
serves as the Deputy Editor for National Science Open and an Associate Excellence Award in 2022. He was honored “National Outstanding Youth” by
Editor for IEEE T RANSACTIONS ON V EHICULAR T ECHNOLOGY and IET NSF of China, “Changjiang Scholar” by the Ministry of Education of China,
Cyber-Physical Systems: Theory and Applications. and “State-Level Scholar” of “New Century Bai Qianwan Talent Program” of
China.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.