0% found this document useful (0 votes)
4 views15 pages

Deep Reinforcement Learning Based Transmission Scheduling For Sensing Aware Control

The document presents a novel deep reinforcement learning (DRL)-based transmission scheduling method (DTSM) aimed at optimizing data transmission in Industrial Internet of Things (IIoT) systems, balancing control performance and limited resources. It emphasizes the importance of ensuring system observability and dynamically scheduling sensor transmissions based on real-time conditions. The effectiveness of DTSM is demonstrated through applications in industrial processes, particularly in laminar cooling, while addressing the challenges of centralized learning methods and resource utilization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views15 pages

Deep Reinforcement Learning Based Transmission Scheduling For Sensing Aware Control

The document presents a novel deep reinforcement learning (DRL)-based transmission scheduling method (DTSM) aimed at optimizing data transmission in Industrial Internet of Things (IIoT) systems, balancing control performance and limited resources. It emphasizes the importance of ensuring system observability and dynamically scheduling sensor transmissions based on real-time conditions. The effectiveness of DTSM is demonstrated through applications in industrial processes, particularly in laminar cooling, while addressing the challenges of centralized learning methods and resource utilization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL.

22, 2025 10905

Deep Reinforcement Learning Based Transmission


Scheduling for Sensing Aware Control
Tiankai Jin , Graduate Student Member, IEEE, Cailian Chen , Senior Member, IEEE,
Yehan Ma , Senior Member, IEEE, and Xinping Guan , Fellow, IEEE

Abstract— Massive field data is wirelessly transmitted to hot rolling process demonstrate the superiority of DTSM. Our
the edge side to facilitate sensing and control in the emerg- future work will consider the joint scheduling of uplink-downlink
ing Industrial Internet of Things (IIoT) systems. Under the transmissions and design the collaboration among multiple edge
expanding transmission scheduling space and dynamic network computing nodes (ECNs) to address the limitations of centralized
conditions, balancing control performance and limited transmis- learning methods. Besides, the proposed method can be extended
sion resources is a fundamental challenge. For this problem, to other industrial applications such as flight control system
we propose a novel deep reinforcement learning (DRL)-based testing.
transmission scheduling method (DTSM), where sensing per-
formance guarantee is introduced for its criticality in ensuring Index Terms— Sensing and control, deep reinforcement learn-
complete system observation and effective control. Specifically, ing (DRL), dynamic transmission scheduling, industrial Internet
taking system observability as the key metric, the time slots for of Things (IIoT).
multi-sensor data transmission under different control demands
are properly reserved with theoretically guaranteed performance.
Then, the primal-dual DRL framework is adopted to further N OMENCLATURE
improve the overall performance of system control and resource Symbols
utilization by dynamically scheduling the transmission number
of each sensor. The scheduling is based on the real-time states of (·)⊤ Transpose of a vector or a matrix.
sensing and wireless network, and the action space is determined E[·] Expectation of a random variable.
according to our reserved time slots. Besides, after primal- ⊗ Kronecker product.
dual updates, the scheduling results can satisfy the estimation tr(·), rank(·) Trace and rank of a matrix.
error-evaluated constraint imposed for the ultimate control effect.
Finally, the proposed method is applied to the industrial laminar
λ̄ A , λ A The largest and smallest eigen-
cooling process and its effectiveness is fully demonstrated. values of a matrix A.
Im , 0m m × m identity matrix and null
Note to Practitioners—This paper is motivated by the require- matrix.
ment of balancing control performance and scarce transmission 1m m × 1 vector containing all ones.
resources in industrial automation fields such as steel manufac-
turing, where massive sensor data is transmitted to the edge
emj m × 1 vector with 1 as its j-th
side through wireless networks. The expanding transmission component and 0 elsewhere.
scheduling space and dynamic network conditions have led A ⪰ 0m A ∈ Rm×m being positive
to increased interest in advanced deep reinforcement learning semidefinite.
(DRL) methods. However, few previous works have explored the [A]i, j , [A]i,: , [A]:, j , [a]i The (i, j)-th entry, i-th row, j-th
impact of control demands on intelligent transmission scheduling
design. For these issues, we propose a novel DRL-based trans-
column of a matrix A, and the i-
mission scheduling method (DTSM), where the time slots for th entry of a vector a.
multi-sensor data transmission are delicately reserved according Nn Set {1, 2, · · · , n}.
to different control demands and dynamic scheduling is realized
based on real-time states of sensing and wireless network. The
overall performance of system control and resource utilization
is improved, and practitioners can easily adjust method param- Variables
eters to achieve the desired balance between the two aspects xk , yk , u k System state, measurement value, and
according to practical demands. Case studies in the industrial
control action at time period k.
Received 6 November 2024; accepted 10 January 2025. Date of publication wk , vk Process noise and measurement noise
16 January 2025; date of current version 8 April 2025. This article was rec- at time period k.
ommended for publication by Associate Editor C. Zhang and Editor Q. Zhao
upon evaluation of the reviewers’ comments. This work was supported in part 0k , 4k Transmission reliability indicator and
by the National Natural Science Foundation of China under Grant 62025305, network state at time period k.
Grant 62432009, Grant 61933009, Grant 92167205, and Grant 62103268. [ϕk ] j Scheduled number of transmissions
(Corresponding author: Cailian Chen.)
The authors are with the Department of Automation, Shanghai Jiao for sensor j during time period k.
Tong University, Shanghai 200240, China, and also with the Key Labo- [ϕ̄] j Total number of slots reserved for sen-
ratory of System Control and Information Processing, Ministry of Edu- sor j.
cation of China, Shanghai 200240, China (e-mail: [email protected];
[email protected]; [email protected]; [email protected]). Jc,k , Jr,k Control and transmission costs at time
Digital Object Identifier 10.1109/TASE.2025.3530409 period k.
1558-3783 © 2025 IEEE. All rights reserved, including rights for text and data mining, and training of artificial intelligence
and similar technologies. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
10906 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 22, 2025

Dv , Ds Reservation and sensing I. I NTRODUCTION


performance-related constraints.
x̂ k|k−1 , x̂ k , Pk|k−1 , Pk Prior and posterior estimates with
error covariance matrices at time
W ITH the evolution of Industry 4.0, industrial automa-
tion is empowered by the Internet of Things which
integrates ubiquitous sensing, transmission, control, and
period k. learning [1], [2]. Based on edge computing and wireless
d The smallest positive integer such transmission technologies [3], field sensor information can be
that [G ⊤ , (G A)⊤ , · · · , (G Ad−1 )⊤ ]⊤ more flexibly obtained at the edge side for subsequent sensing
is full rank. and production control. The more measurements the sensors
Ok Observability matrix within time transmit, the more accurate sensing becomes, and the better
interval {k, · · · , k + d − 1}. the final control performance will be. However, sufficient
φ, ϱ Observability probability and perfor- data transmission inevitably brings about the occupation of
mance improvement indicator. scarce network resources such as bandwidths and time slots,
C, n Controllability matrix and the small- as well as the energy consumption of field sensors. Therefore,
est positive integer such that it is full it is urgent to achieve the optimal balance between control
rank. performance and transmission cost.
S, A, P State space, action space, and transi- Wireless transmission is usually unreliable and the chan-
tion probability for DRL. nel status may vary with time due to fading and noises,
sk , ak , rk , ck State, action, reward and stage cost etc [4], [5]. Given these implications, exploring the interactions
for DRL at time period k. between transmission scheduling and system performance is
π, Vc Policy and estimated state-value
thus more challenging [6]. First, many works have been
function.
devoted to transmission scheduling for both sensing [7], [8],
θa , θc Trainable parameters of actor and
[9] and control [10], [11], [12] performances. Specifically,
critic networks.
the stability condition [7], event-based transmitted data cod-
L, H, ν Lagrangian, dual function, and dual
ing mechanism [8], and joint transmission power control
variable for Problem 2.
scheme of the latest and historical packets [9] for sensing
La, Lc Loss functions for actor and critic
performance are studied. The sensor access allocation [10],
networks.
power control [11], and the optimal access policy structure
are also analyzed [12] for system control. Further, to more
Parameters tightly couple the transmission and control processes for
N The number of sensors. overall performance improvement, transmission and control
T Time horizon. co-design methods have been proposed [13], [14], [15], [16],
A, B, G, C System, control, measurement, and sensing [17]. For example, both the dynamic event-triggered sensor
matrices. data scheduling and model predictive control mechanisms
u ′k , B ′ Exogenous input and input matrix. are proposed in [13] to improve the transmission efficiency
6w , 6v , 6x Covariance matrices. and system robustness. Turning to downlink transmissions,
m, q, l, q ′ System dimensions. control performance and network resources are balanced [14]
E, Mn , E Set, number, and transition probability by scheduling the transmission number for each control loop,
matrix of the network state. and a smart actuator is introduced in the industrial field to over-
µ j,ε Failure rate of per transmission from sensor come the potential packet drops from the edge controller [15].
j to the ECN given 4k = ε. Besides, considering the packet drops in both uplink and
zk Desired system state at time period k. downlink transmissions, the tradeoff between sensing data and
Q, R Weight matrices for control performance control command transmissions is analyzed [16], and the novel
and control usage. bilateral dynamic scheduling mechanism is proposed together
κc , κr Weight factors for control performance and with an output feedback controller [17].
transmission cost. Recently, due to the expanding scheduling space as well as
Tc , b̄ Constraint values in Ds . partially unknown model parameters in industrial automation,
λ, ρ, c, vc Thermal conductivity, density, specific heat deep reinforcement learning (DRL) methods have received
capacity, and coiling speed of the strip increasing attention [18], [19]. Under remote estimation
steel. scenarios of scheduling sensor transmissions for multiple
τz , τn , 1z , 1n Discretization sizes and steps. independent subsystems, deep deterministic policy gradient
1t Sampling period. (DDPG)-based stochastic sensor scheduling [20], deep Q-
xw , x f , x̄ 1 Cooling water, thermodynamic system inlet network (DQN)-based channel assignment [21], and proximal
and initial temperatures. policy optimization (PPO)-based joint design of channel
βd , βg , ϵ Discounted factor, GAE parameter, and clip assignment and power allocation [22] are widely studied.
factor. With unknown system models, a joint design framework of
ηa , ηc , ην Learning rates for θa , θc , and ν. estimation method, control policy, and sensor transmission
Ml , M p , Mb Numbers of overall episodes, epochs per scheduling is proposed in [23] to balance control performance
episode, and transitions in a mini-batch. and energy consumption. Additional long-term constraints on

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DRL BASED TRANSMISSION SCHEDULING FOR SENSING AWARE CONTROL 10907

resource or control performance are also addressed by the


primal-dual DRL method [24]. However, in these intelligent
scheduling frameworks, although the conditions for bounded
sensing cost [20], for control stabilization [24], and for the
existence of solutions ensuring system stability [21], [22] are
discussed, how to regulate the scheduling space according
to actual production control demands has not been explored.
This may result in poor resource utilization due to excessive
resource reservation or unsatisfactory control performance due
to insufficient action space.
Meanwhile, sensing completeness is not only the basic Fig. 1. Edge computing-supported IIoT system.
requirement to achieve accurate state estimation, but also the
guarantee of the ultimate control performance [25]. Most of
the aforementioned works assume that a single sensor has information for system control under dynamic network
sufficient ability to estimate the state and support subsequent conditions. Based on the optimal sensing-aware control
control decisions. For instance, despite the multi-sensor and design, the complete transmission scheduling problem
multi-system setting adopted in [20], [21], and [22], it is is formulated as an MDP and thus incorporated into the
still essentially considered that one sensor can fully sense DRL framework.
an independent system. Whereas this is overly stringent on 2) Previous setting of fixed scheduling space is no longer
the sensing abilities of individual sensors when applied to required. Through the bridge of the derived observability
real complex Industrial Internet of Things (IIoT) systems. probability bounds, the time slots for multi-sensor data
Certainly, there are some attempts to consider the scheduling transmission are delicately reserved to match different
and fusion of multi-sensor information [2], [3], [25], [26], [27], control demands, thereby achieving DRL action space
[28], [29]. With edge device-assisted sensing fusion, an on- regulation.
demand transmission scheme is designed in [3] to optimize 3) The performance using the reserved slots is theoretically
the overall control and transmission performance, and the characterized. With this as a benchmark, we further
concept of Age of Task (AoT) [26] is proposed to measure schedule the sensor transmissions in a spatial–temporal
the timeliness of multi-element computationally intensive tasks dynamic manner based on primal-dual DRL. The over-
and is associated with system control. Sensing and control all cost of transmission and control is optimized and
co-design mechanisms and greedy-based sensor scheduling the estimation error-related constraint is satisfied after
methods are proposed in [25] and [27]. In addition, based training.
on intelligent learning methods, the determination of optimal The rest of this paper is organized as follows. In Section II, the
fusion coefficients of multi-sensor information under unknown IIoT architecture, system models, and considered problem are
correlations [28], and the joint design of sensing and control introduced. In Section III, the novel DRL-based transmission
under unknown system models [2], [29] are studied. How- scheduling method (DTSM) is proposed with detailed MDP
ever, these works mainly have threefold limitations including construction. In Section IV, the application of DTSM in the
the simplified sensing process [3], [26], the perfect uplink laminar cooling process is discussed in detail. Finally, this
measurement transmission assumption [2], [25], [27], [29], work is concluded in Section V.
and the lack of system control analysis [28]. Thus, how to
comprehensively incorporate these factors into transmission II. P RELIMINARIES AND P ROBLEM F ORMULATION
scheduling remains unexplored. In this section, the IIoT architecture, system model, network
To overcome the above deficiencies, by introducing the model, and basic assumptions are first introduced. Then, the
edge computing node (ECN) with strong computing power, problem of transmission scheduling for sensing-aware control
we consider the timely sensing fusion of sensor networks and is formulated.
intelligent transmission scheduling on the edge side for final
industrial production control. Dynamic network conditions are
considered in the real-time sensing process, and the optimal A. Architecture and Models
control law is derived for tracking desired system states, based 1) IIoT Architecture and System Model: As illustrated in
on which the transmission scheduling problem is formulated as Fig. 1, the edge computing-supported IIoT system mainly
a Markov decision process (MDP) [30]. Especially, by intro- includes the field device layer (FDL) and the edge computing
ducing the key sensing metric of observability probability [5], layer (ECL). In the FDL, a large number of sensors (thermal
we delicately reserve the transmission resources under differ- imagers, infrared thermometers, environment sensors, etc.)
ent control performance demands by adjusting its guarantee and actuators (robotic arms, water supply valves, etc.) are
degree. This addresses the limitations of existing DRL-based deployed. They provide rich measurement data to the ECL and
scheduling methods [20], [21], [22], [23], [24]. In summary, execute the received control actions. In the ECL, the deployed
the major contributions of this paper are as follows: ECNs support the execution of sensing, control, and DRL
1) A novel DRL-based framework is proposed to realize methods. Optimized transmission policies and control actions
the transmission scheduling and fusion of multi-sensor are delivered to the field sensors and actuators, respectively.

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
10908 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 22, 2025

In this paper, we focus on the sensing and control under


one ECN. The total N sensors communicate with the ECN
via a shared wireless channel. The downlink transmissions of
scheduling commands are assumed to be perfect due to factors
such as the ECN having more sufficient energy. Besides,
to ensure timely and reliable control of industrial systems,
the control actions from the ECN to actuators are transmitted
via dedicated wired connections. Similar settings can be found
Fig. 2. IEEE 802.15.4 superframe structure.
in [11], [21], [23], etc.
The considered system dynamics are as follows:

xk+1 = Axk + Bu k + B ′ u ′k + wk , (1) slots is relaxed in [33], which allows the number of slots
assigned to the CFP to be adjustable. Each time slot supports
yk = 0k (C Gxk + vk ), (2)
one transmission of any sensor’s measurement. Then, for each
where k ∈ NT with T being the time horizon. xk ∈ Rm , yk ∈ sensor j, [ϕk ] j denotes the scheduled number of transmissions
R N , and u k ∈ Rq are the system state, measurement value, and during time period k, and [ϕ̄] j is the total number of slots
control action at time period k, respectively. A ∈ Rm×m , B ∈ reserved for it. Denoting the failure rate of per transmission
Rm×q , and G ∈ Rl×m are the system, control, and observation from sensor j to the ECN given 4k = ε as µ j,ε , the packet
matrices, respectively. C ∈ R N ×l is the binary sensing matrix reception ratio is as follows:
satisfying that C1l = 1 N and C ⊤ 1 N > 0l . The diagonal matrix p
P [0k ] j, j = 1 | 4k = ε, [ϕk ] j = p = 1 − µ j,ε .
 
(4)
0k ∈ R N ×N is the transmission reliability indicator, where
[0k ] j, j = 1 if the measurement [C] j,: Gxk + [vk ] j of sensor j In addition, when the beacon containing downlink scheduling

is received at the ECN and [0k ] j, j = 0 otherwise. u ′k ∈ Rq commands is lost in practice, we can, for example, set [ϕk ] j =

with input matrix B ′ ∈ Rm×q is the exogenous input that [ϕ̄] j for each sensor in a performance-first manner. This still
cannot be changed [31]. The process noise wk ∼ N (0, 6w ), avoids collisions among data transmissions due to our time slot
measurement noise vk ∼ N (0, 6v ), and initial state x1 ∼ division. In the case of unreliable control action transmissions,
N (x̄ 1 , 6x ), where 6w , 6v , and 6x are positive definite. wk , we can let the actuator record and adopt the last received
vk , and x1 are mutually uncorrelated. control action, or introduce a smart actuator for local control.
Here, within all observable system dimensions captured by Interested readers may refer to [14] and [15] for more details
G, each sensor j ∈ N N obtains its measurement according to about actuation packet scheduling. We leave these for future
the sensing vector [C] j,: , which depends on the sensor’s spatial work.
distribution. Notice that the potentially high-dimensional xk Referring to [34], the following assumption is adopted for
cannot be directly obtained, and it is necessary to fuse the the transmission process:
measurements from the sensor network for accurate state Assumption 2: The transmission indicators {[0k ] j, j } are
estimation. conditionally independent given the network states and the
Assumption 1 is provided to ensure the system control- transmission numbers.
lability and observability, which is the basis for subsequent
transmission policy optimization.
Assumption 1: The pair (A, B) is controllable, and (A, G) B. Problem Formulation
is observable. The considered problem of transmission scheduling for
2) Network Model: We consider the sensors transmitting sensing-aware control is as follows:
their measurements to the ECN over a shared block-fading " T #
wireless channel [32], where the beacon-enabled IEEE 1 X
κc Jc,k + κr Jr,k

802.15.4 protocol [14] is adopted. The channel gain is assumed Problem 1 : min lim E (5)
{ϕk ,u k } T →∞ T
k=1
constant within each time period but may vary period by
period [7]. The Markov network state process {4k }k=0 T
with s.t. {ϕk } ∈ Dv ∩ Ds . (6)
4k ∈ E ≜ {0, · · · , Mn − 1} is then defined to capture the gain
In this problem, the control goal is to drive the system states
variation, where 40 is known and the transition probabilities
to the desired {z k } while reducing the control usage, thus the
are
control cost Jc,k = ∥xk+1 −z k+1 ∥2Q +∥u k ∥2R , where Q and R are
P 4k = ε′ | 4k−1 = ε = [E]ε,ε′ , ∀ε, ε′ ∈ E.
 
(3) positive definite matrices. Besides, each transmission from the
sensor consumes its limited resources such as battery P energy,
As shown in Fig. 2, the beacon interval contains active then the total transmission cost is defined as Jr,k = Nj=1 [ϕk ] j .
and inactive periods. The active period further includes a κc and κr are positive weight factors reflecting the preferences
contention access period (CAP) and a collision free period in terms of control performance and transmission cost, respec-
(CFP). Since the real-time sensing performance guarantee is tively. Dv is the reservation constraint meaning that a suitable
required, we focus on the scheduling of CFP, which employs number of transmissions are further scheduled only within
the deterministic time-division multiple access (TDMA). The the reserved slots ϕ̄. Ds is the sensing performance-related
limitation that the CFP can include up to 7 guaranteed time constraint imposed for the ultimate control effect.

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DRL BASED TRANSMISSION SCHEDULING FOR SENSING AWARE CONTROL 10909

A. DRL State and Reward Design: Sensing-Aware Control


Based on the collected measurements, the Kalman filtering
is applied by the ECN to give the minimum mean square error
estimate of the system state:
 −1
−1
Pk = Pk|k−1 + G ⊤ C ⊤ 0k⊤ 6v−1 0k C G , (7)
 
−1
x̂ k = Pk Pk|k−1 x̂ k|k−1 + G ⊤ C ⊤ 0k⊤ 6v−1 yk , (8)
Fig. 3. Basic design idea of DTSM.
Pk+1|k = A Pk A⊤ + 6w , (9)
x̂ k+1|k = A x̂ k + Bu k + B ′ u ′k , (10)
There is a tradeoff between the two costs Jc,k and Jr,k ,
because reducing Jc,k requires more sufficient measurements where x̂ k|k−1 and x̂ k are the prior and posterior estimates
to improve the estimation accuracy of xk , which is at the at time period k, respectively. Pk|k−1 and Pk are the prior
expense of a larger Jr,k . If the IIoT system focuses more on and posterior error covariance matrices at time period k,
the control effect of the industrial process, the ratio of κc to respectively. The initialization is x̂ 1|0 = x̄ 1 and P1|0 = 6x .
κr is increased. Conversely, if resource conservation is more We then decouple the design of transmission scheduling and
of a concern, the ratio is decreased. Such a setup effectively control. Define gk as the optimal control cost function given
reflects the performance objectives under different demands, any transmission scheduling ϕk :
and is traceable to a wide range of industrial scenarios, such
as [3], [14], [23], and [26], etc. Dv and Ds need to be 1 
E kc Jc,k + κr Jr,k + gk+1 ,

gk = min (11)
delicately designed. If ϕ̄ is reserved too small, it is difficult uk T
to realize sufficient sensing for control demands, whereas if where gT +1 = 0. Letting ek = xk − x̂ k , gk can be calculated
it is too large, too many network resources are pre-occupied. recursively:
Ds is required to reasonably represent the system performance
demands. Besides, the optimal u k needs to be jointly designed 1 
gk = min E κc (∥u k − K k (Axk − Uk−1 L k + B ′ u ′k )∥2Mk
since it depends on the transmitted measurements. uk T
For these concerns, the basic design idea of our proposed +∥wk ∥U2 k + ∥Axk − Uk−1 L k + B ′ u ′k ∥2Hk − L ⊤ −1
k Uk L k +
DTSM is shown in Fig. 3. We will formulate the considered T
#
X
Problem 1 into an MDP and use the advanced DRL method to +Dk ) + κr Jr,k + κc ∥et ∥22t + κc ∥wt ∥U2 t + κr Jr,t

solve it. Thus, we can obtain an optimized policy to perform t=k+1
dynamic scheduling based on the real-time DRL state. Specif- 1 
(i)
ically, we first provide the edge sensing method and decouple = E κc (∥Axk − Uk−1 L k + B ′ u ′k ∥2Hk − L ⊤ −1
k Uk L k
T
the design of transmission and control while deriving the T
X
#
optimal control law (12). The sensing process is embedded into +Dk ) + κc ∥et ∥2t + κc ∥wt ∥Ut + κr Jr,t ,
2 2

the control cost expression through the error covariance matrix. t=k
The state and reward for DRL are thus determined. Then, with
observability probability and estimation error covariance as the where (i) holds by substituting the optimal u ∗k :
key metrics of sensing performance, we respectively design u ∗k = K k (A x̂ k − Uk−1 L k + B ′ u ′k ). (12)
the time slot reservation ϕ̄ and sensing constraint Ds , where ϕ̄
determines the action space. Innovatively, Theorem 1 provides The quantities involved are recursive as follows with the
the observability probability bounds (24), which facilitate the boundary conditions being UT = Q, L T = Qz T +1 , and
calculation during the reservation process. Theorem 2 reflects DT = z ⊤
T +1 Qz T +1 :
the upper bound (31) on overall control and transmission
cost when setting ϕk = ϕ̄, which quantitatively illustrates the
−1
Uk = Q + A⊤ Hk+1 A, Hk = Uk−1 + B R −1 B ⊤ , (13)
rationality of the designed reservation. Finally, the primal-dual −1
L k = Qz k+1 + A⊤ Hk+1 Uk+1 L k+1 − B ′ u ′k ,

(14)
DRL framework is adopted to dynamically schedule the trans- −1
mission numbers for overall performance improvement while Dk = Dk+1 − L ⊤ ⊤
k+1 Uk+1 L k+1 + z k+1 Qz k+1
satisfying the imposed constraints. −1
+ ∥Uk+1 L k+1 − B ′ u ′k ∥2Hk , (15)

III. T RANSMISSION S CHEDULING W ITH D EEP and Mk = B ⊤ Uk B + R, K k = −Mk−1 B ⊤ Uk , 2k =


R EINFORCEMENT L EARNING A⊤ Uk B Mk−1 B ⊤ Uk A. Since the nearly infinite horizon is con-
sidered, Uk can be replaced by the steady-state U∞ for
In this section, we first introduce the sensing and control
simplicity, which is the positive semi-definite solution of the
methods, based on which the state and reward for DRL
following equation:
are designed. Then, the system observability and time slot
reservation for different control demands are analyzed, which U∞ =A⊤ U∞ A + Q
determines the action space. Finally, the complete DTSM is −1
provided based on the results above. − A ⊤ U∞

B B ⊤ U∞ B + R B ⊤ U∞ A, (16)

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
10910 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 22, 2025

which necessarily exists since our assumptions satisfy the Theorem 1: The observability probability is lower and
conditions in [35]. Also, we have upper bounded as follows:
−1
2∞ = A⊤ U∞ B B ⊤ U∞ B + R B ⊤ U∞ A. (17) " l
Y
#d
min (1 − µi,t

)
Ignoring the fixed terms, the objective function of Problem t∈Nd
i=1
1 can be rewritten as
≤ φ k, {εi }i=1
d
, { pi }i=1
d

" T #
1 X
(ldo )
 
κc tr(2∞ Pk ) + κr Jr,k .

g̃ 1 = lim E (18) ld X
T →∞ T
X Y Y
k=1 ≤  (1 − µi,t

) µi,t
′ 
. (24)
o=m h=1 (i,t)∈Io,h (i,t)∈Ī o,h
Thus, using the optimal control law (12), the control per-
formance is expressed based on the sensing outcomes {Pk },
Here, S1 , · · · , Sl are the sensor sets obtained by merging
which are dependent on the designed {ϕk }. The bridge between [p ]
sensors with the same [C] j,: . µi,t

= j∈Si µ j,εt t j . Io,h is the
Q
transmission and control is constructed. For clarity, the order
h-th element in set b Io = {I | I ⊆ Nl × Nd , |I| = o}, and
of edge sensing, transmission, and control processes is shown
Ī o,h = (Nl × Nd ) \ Io,h .
as follows:
Proof: Define the following auxiliary transmission indicator:
(9),(10) (i) (4)
· · · → xk → x̂ k|k−1 , Pk|k−1 → ϕk → yk , 0k (
(7),(8) (12) (1) ′ 1, ∃ j ∈ Si , [0k ] j, j = 1,
→ x̂ k , Pk → u k → xk+1 → · · · , (19) [γk ]i = (25)
0, otherwise.
where (i) adopts the DRL-based scheduling designed subse-
quently. Let G1 = {[γ ′
t ]i = 1, P∀i ∈ Nl , ∀t ∈ {k, · · · , k + d − 1}},
and G2 = { k+d−1 l ′
P
Therefore, considering that Pk|k−1 is available at the begin- t=k i=1 t ]i ≥ m}. Since for each t ∈

ning of time period k and the network state also affects the {k, · · · , k + d − 1} and i ∈ Nl , [C] j,: G At−k remains the
packet reception ratio, the MDP state for DRL is defined as same for every j ∈ Si , we have G1 ensures that all distinct
row vectors in C G At−k are retained in 0t C G At−k . Also
Pk|k−1 4k−1
 
sk ≜ , , (20) it holds that rank([(C G)⊤ , · · · , (C G Ad−1 )⊤ ]⊤ ) = m since
λ̄6x Mn rank(C) = l. Thus, based on matrix row transformation,
where λ̄6x and Mn are the normalization terms. Based on (18), we have P[rank(O k ) = m | G1 ] =P1. Besides,
Pk+d−1 it holds
k+d−1 Pl
the reward at time period k is defined as that rank(O k ) ≤ t=k rank(0 t C) = t=k [γ ′
i=1 t ]i and
thus P rank(Ok ) = m | Ḡ 2 = 0.
 
rk ≜ −κc tr(2∞ Pk ) − κr Jr,k , (21) Similar to [36], letting Go = {{4k−1+i } = {εi }, {ϕk−1+i } =
and the stage cost is correspondingly denoted as ck = −rk . { pi }}, we have P[G1 | Go ] ≤ φ(k, {εi }, { pi }) ≤ P[G2 | Go ].
Based on Assumption 2, it is obtained that
B. DRL Action Space Design: Observability-Based Time Slot k+d−1
Y Y l
P γi,t′ = 1|4t = εt−k+1 , ϕt = pt−k+1
 
Reservation P[G1 |Go ] =
As an important concept in estimation theory, system t=k i=1
k+d−1 l
observability is one of the fundamental conditions to realize Y Y Y [p ]
complete sensing [25]. For system (1) and (2), the observabil- = (1 − µ j,εt−k+1
t−k+1
j
)
ity matrix within time interval {k, · · · , k + d − 1} is defined t=k i=1 j∈Si
d Y
l
as Y Yl
= (1 − µi,t

) ≥ [min (1 − µi,t

)]d .
0k C GIm
 
t∈Nd i=1
t=1 i=1
 0k+1 C G A 
Ok ≜  , (22)
 ···  Then, by traversing the feasible outcomes of {γi,t′ }, P[G2 | Go ]
0k+d−1 C G A d−1 can be calculated as the rightmost term in (24). The proof is
completed. ■
where d is the smallest positive integer such that matrix
As a comparison, [25] proposes deterministic system
[G ⊤ , (G A)⊤ , · · · , (G Ad−1 )⊤ ]⊤ is full rank. It follows from
observability conditions regarding whether each sensor trans-
Assumption 1 that d exists and d ≤ m. Considering that Ok
mits or not under the assumption of reliable transmission.
contains random variables, the observability probability [5] is
We instead further consider the relationship between the con-
defined as follows:
tinuous observability probability and each sensor’s adjustable
φ(k, {εi }, { pi }) ≜ P[rank(Ok ) = m | transmission number under imperfect transmission conditions.
{4k−1+i } = {εi }, {ϕk−1+i } = { pi } ,

(23) Compared with our preliminary work [36], the form of the
lower bound here is more concise, which facilitates subsequent
where the range i = 1 to d of {·} is omitted for notation calculations.
brevity. It is conditioned upon the network states and the To improve edge sensing for desired control performance,
scheduled numbers. Then, Theorem 1 is proposed to analyze a certain number of time slots need to be reserved for adequate
and bound this probability. measurement acquisition. With observability probability as the

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DRL BASED TRANSMISSION SCHEDULING FOR SENSING AWARE CONTROL 10911

N
key metric, the target is to design a ϕ̄ satisfying the following X
d + κr [ϕ̄] j . (31)
condition for each k ∈ NT and {εi }i=1 ∈ Ed:
j=1
ϱ
φ(k, {εi }, {ϕ̄}) ≥ 1 − d , (26) For the coefficients, their exist positive constants α1 and
λ̄ A⊤ A
α2 such that
where λ̄ A⊤ A is assumed to be positive. ϱ ∈ (0, 1) reflects d−2
max(λ̄d−1 , λ̄0A⊤ A )
the improvement of sensing performance, and ϱ < 1 is the
X
ζ1 = A⊤ A
ζ1,1 + m λ̄6w λ̄iA⊤ A ,
condition to ensure the boundedness of E[tr(Pk )], which will 1−ϱ i=0
be proved in Theorem 2. As ϱ further decreases toward 0, λ̄ B B ⊤ λ̄ A⊤ A λ λ̄ A⊤ A
it means that the expected observability probability gradually ζ2 = (λ̄ Q + R n−1 )2 ,
λQ λB⊤ B + λR α2 ζ2,1
approaches 1, and the sensing requirement is higher. To effi-
ciently solve for ϕ̄, we utilize the derived probability lower where ζ1,1 = (α 6ζvd−1 ) + λ̄dϱ m λ̄6w i=0
λ m d−1 i
P
λ̄ A⊤ A ,
bound and greedy-based ideas [27]. For every ε ∈ E, initialized 1 1,2 A⊤ A

with ϕ̄ ε = 0 N , the sensor with the best transmission condition λ̄6w λ6v + λ̄6w λ6w λ̄G ⊤ G maxi |Si | −1
is first selected in each Si , i ∈ Nl : ζ1,2 = (1 + ) ,
λ A⊤ A λ6w λ6v
ϕ̄ ε = ϕ̄ ε + e Nj0 , j0 ∈ argmin µ j,ε . (27) λ̄ Q λ R + λ̄ Q λ Q λ̄ B B ⊤
j∈Si ζ2,1 = (1 + )−1 .
λ A⊤ A λ Q λ R
[ϕ]
Then, define f g (ϕ, ε) = [ li=1 (1 − j∈Si µ j,ε j )]d as the
Q Q
Proof: We first discuss
Pthe sensing part. Referring to [37],
objective function. Until f g (ϕ̄ ε , ε) ≥ 1 − ϱ/λ̄dA⊤ A , each update
since C ⊤ 0k⊤ 0k C = N ⊤
j=1 k ] j, j C j,: C j,: ⪯ maxi |Si |Il ,
[0
is as follows:
we have Pk+1|k ⪰ ζ1,2 A−⊤ Pk A−1 . Then for k ≥ d, when
−1 −1

ϕ̄ ε = ϕ̄ ε + e Njg , jg ∈ argmax f g (ϕ̄ ε + e Nj , ε). (28) rank(Ok−d+1 ) = m, it holds that


j∈N N
Pk−1 ⪰ ζ1,2 A−⊤ Pk−1 A + G ⊤ C ⊤ 0k⊤ 6v−1 0k C G ⪰ · · ·
−1 −1
Combining the cases under various channel states, ϕ̄ is con-
structed as X ζ1,2
d−1 i
(i) α1 ζ1,2
d−1
⪰ (A⊤ )−i G ⊤ C ⊤ 0k−i

0k−i C G A−i ⪰ Im .
λ6v λ6v
[ϕ̄] j = max[ϕ̄ ε ] j , j ∈ N N . (29) i=0
ε∈E
For (i), it is easy to obtain that there exists a positive constant
Thus, the action space for transmission scheduling is finally α1′ such that Ok⊤ Ok ⪰ α1′ Im holds for any Ok of full column
defined as follows: rank. Thus, (A⊤ )−d+1 Ok⊤ Ok A−d+1 ⪰ α1 Im holds with
A ≜ {0, · · · , [ϕ̄]1 } × · · · × 0, · · · , [ϕ̄] N .

(30) (
α1′ /λ̄d−1 , d ≥ 2,
| {z } α1 = A⊤ A (32)
N terms α1′ , d = 1,
Since f g (ϕ̄ ε , ε) is nondecreasing with the increase of [ϕ̄ ε ] j , and (i) is obtained by taking in α1 . Let kt = (t − 1)d + 1, t =
(29) ensures that ∀ε, f g (ϕ̄, ε) ≥ 1 − ϱ/λ̄dA⊤ A . Thus, the lower 1, 2, · · · , Vt = tr(Pkt ), Rkt = rank(Okt +1 ). It holds that
bound proposed in Theorem 1 for φ(k, {εi }, {ϕ̄}) is greater  
than or equal to 1−ϱ/λ̄dA⊤ A , and finally (26) holds. We assume E Vt+1 | Pkt = P
that the total number of reserved time slots does not exceed the
   
= E Vt+1 | Pkt = P, Rkt = m P Rkt = m | Pkt = P
maximum value Mϕ allowed by the adopted communication
+ E Vt+1 | Pkt = P, Rkt < m P Rkt < m | Pkt = P
   
protocol. In case this is not satisfied then the required ϱ needs
to be increased appropriately. The complete process is listed ≤ ϱ tr(P) + ζ1,1 .
in the “Time slot reservation” part of Algorithm 1. Taking the expectation on both sides, we have E[Vt ] ≤
Instead of arbitrary or uniform reservation [36], we use ϱt−1 E[V1 ] + i=0 ϱ ζ1,1 . Then ∀ p ∈ Nd−1 , it holds that
Pt−2 i
observability probability as the bridge to more delicately
p−1
allocate time slots to suitable sensors. This avoids the need p
X
to directly analyze the intractable control cost E[tr(2∞ Pk )] in E[tr(Pkt + p )] ≤ λ̄ A⊤ A E[Vt ] + λ̄iA⊤ A tr(6w )
i=0
advance, and realizes the time slot reservation more concisely
and practically with our derived probability bounds. ≤ max(λ̄ A⊤ A , λ̄ A⊤ A )ϱt−1 E[V1 ]
d−1
+ ζ1 .
Further, based on Assumption 1, denote n ≤ m as the ∀k = (t − 1)d + p, p ∈ Nd , t is calculated as ⌈ dk ⌉. Also,
smallest positive integer such that the controllability matrix E[V1 ] ≤ E[tr(P1|0 )] ≤ m λ̄6x , thus we have E[tr(Pk )] ≤
C = [B, AB, · · · , An−1 B] is full rank. Theorem 2 is proposed max(λ̄d−1
k
, λ̄0A⊤ A )ϱ⌈ d ⌉−1 m λ̄6x + ζ1 .
A⊤ A
to quantify the overall performance of the designed time slot We then discuss the control part. Similarly, we have Uk−1 ≥
reservation. ζ2,1 A−1 Hk−1 A−⊤ . Also, for k ≤ T − n + 1, it holds that
Theorem 2: Assuming λ A⊤ A > 0 and setting ϕk = ϕ̄, ∀k ∈
NT , the expectation of stage cost ck is then upper bounded as Hk−1 ⪰ ζ2,1 A−1 Hk+1
−1 −⊤
A + B R −1 B ⊤ ⪰ · · ·
follows: n−1
X (ii) α2 ζ2,1
n−1

⌈ dk ⌉−1
 ⪰ ζ2,1
i
A−i B R −1 B ⊤ (A⊤ )−i ⪰ Im .
E[ck ] ≤κc max(λ̄d−1 A⊤ A
, λ̄ 0

A A )ϱ m λ̄6x + ζ1 ζ2 λR
i=0

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
10912 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 22, 2025

For (ii), it is known that there exists a positive constant α2′ determined by (7), (9) as f s (·, ·), we have
such that CC ⊤ ⪰ α2′ Im , and α2 is calculated as
( P[Pk+1|k = P ′ Pk|k−1 = P, 4k = ε′ , ak = a]
α2′ /λ̄n−1 , n ≥ 2,
 YN 
P [0k ] j, j = [o] j | 4k = ε′ , [ϕk ] j = [a] j ,

α2 = A⊤ A (33) 
α2 ,


n = 1.
 j=1
= if P ′ = f s (P, o), [o] j ∈ {0, 1},

λ R λ̄ A⊤ A
Based on (13), it holds that λ Q Im ⪯ Uk ⪯ (λ̄ Q + n−1 )Im ,

 0, otherwise,
α2 ζ2,1
then we have where probability P[[0k ] j, j = [o] j | 4k = ε′ , [ϕk ] j =
2k = A⊤ Uk B(B ⊤ Uk B + R)−1 B ⊤ Uk A [a] j ] is calculated as in (4).
• The reward is as defined in (21).
⪯ A⊤ Uk B(λ Q λ B ⊤ B + λ R )−1 B ⊤ Uk A ⪯ ζ2 Im .
• βd is the discount factor with βd ∈ (0, 1).
Since 2∞ is the steady-state value of 2k , 2∞ ⪯ ζ2 Im holds. In this MDP, the reward should be formally denoted as
Finally, as E(tr(2∞ Pk )) ≤ λ̄2∞ E[tr(Pk )], (31) is directly E[rk | sk = s, ak = a] which depends on sk and ak , and here
obtained. The proof is completed. ■ we use the real outcomes of Pk instead to avoid calculating
Since ϱ < 1, the bound on the right-hand side of (31) the expectation. Since the DRL framework is adopted, the
is finite as k → ∞. Also, the bound is nonincreasing with transition probability (matrix E involved) is allowed to be
the decrease of ϱ. This theoretically describes the effect of unknown, and the specific expression of P is given to clarify
observability probability guarantee on overall performance, the Markov properties of the considered problem. Moreover,
and justifies the use of (26) as a target. In essence, The- the curse of dimensionality is effectively overcome compared
orem 2 reflects the comprehensive effects of controllability, with traditional policy and value iteration methods.
observability, and transmission design on overall performance. For the constraints, it is direct to obtain that Dv =
Specifically, the control relevant 2k is analyzed to be bounded {{ϕk } | ϕk ∈ A, ∀k}, which is already addressed by the action
based on the assumptions of system controllability and reli- space construction. Then, to better regulate the control per-
able control action transmission. Due to imperfect sensing formance, we design the constraint Ds evaluated by tr(Pk ) as
data transmission, a sufficient ϕ̄ is needed to ensure the follows:
observability probability. Then, based on the sensing perfor- ( " T # )
X
mance analysis under the observable condition, it is ultimately βd 1 tr(Pk ) > b̄ ≤ Tc ,
k−1

Ds = {ϕk } | E (34)
guaranteed that the sensing relevant tr(Pk ) is bounded in a k=1
probabilistic sense. Compared with [5], [36], we introduce
where Tc is the maximum number of time periods allowed for
the control part and further consider the active transmission
tr(Pk ) to be greater than the threshold b̄. That is, we expect to
design in the sensing part. Besides, the rationality of the
limit the time that the covariance matrix is outside the desired
involved requirement for A to be invertible is well-discussed
region {Pk | tr(Pk ) ≤ b̄} [24], [39]. Here, we take into account
in [38].
that tr(Pk ) quantifies the mean square estimation error, i.e.,
tr(Pk ) = E[∥ek ∥2 ], and it directly affects the control cost (18).
C. DRL-Based Transmission Scheduling Method To solve the considered MDP with DRL methods, we intro-
Based on the state, reward, and action space designed duce the following discounted cost form [21], which is a
above, the MDP M = (S, A, P, r, βd ) is constructed as common setting in DRL and approximates the effect of using
follows: the original cost function (18):
• The state space S ≜ R
m×m
× {0, M1n , · · · , MMn −1
n
} consists " T
X
#
of the sensing outcome and the network state. The state Problem 2 : min lim E βd ck
k−1
(35)
π ∈5 RS T →∞
sk at time period k is as defined in (20). k=1
• The action space A as defined in (30) is the combination s.t. {ϕk } ∈ Ds . (36)
of all possible scheduled transmission numbers for each
sensor. The action at time period k is ak ≜ ϕk . Here, 5 R S is our concerned randomized stationary (Markov)

• The transition probability P s | s, a consists of two
policies set [30]. π(·|s; θa ) indicates the probability distri-
components of sensing outcome transition and network bution of the scheduled transmission numbers given current
state transition. Based on the independence of the two state s, which is parameterized with a vector θa . Considering
components and conditional probability properties, P is the imposed constraint (36), we introduce the Lagrangian for
expressed as Problem 2 with dual variable ν ∈ R:
" T #
P ′ ε′ ε
 
P X
P ≜ P sk+1 = ( , ) sk = ( , ), ak = a L(θa , ν) = E βd (ck + νωk ) ,
k−1
(37)
λ̄6x Mn λ̄6x Mn k=1
= P[Pk+1|k = P ′ Pk|k−1 = P, 4k = ε′ , ak = a]
where ωk = 1{tr(Pk ) > b̄} − (1 − βd )Tc , and denote ck + νωk
· P[4k = ε′ 4k−1 = ε] as c̃k . The corresponding dual function is as follows:
where the second term is directly calculated as E ε,ε′ H(ν) = min L(θa , ν). (38)
by (3). Denoting the mapping Pk|k−1 , 0k → Pk+1|k , ∀k θa

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DRL BASED TRANSMISSION SCHEDULING FOR SENSING AWARE CONTROL 10913

Algorithm 1 DTSM
Input: ϱ, b̄, Tc , involved system and learning process
parameters;
Output: {ϕk }, {u k }, and optimized policy π;
/* Time slot reservation */
1 for ε = 0, 1, · · · , Mn − 1 do
Fig. 4. Scheduling result generation flow. 2 Initialize ϕ̄ ε = 0 N ;
3 Set ϕ̄ ε according to (27) for i = 1, 2, · · · , l;
4 while f g (ϕ̄ ε , ε) < 1 − ϱ/λ̄dA⊤ A do
Then, the dual optimization problem is to maximize the dual 5 Set ϕ̄ ε according to (28);
function with respect to dual variable ν: 6 end
7 end
max H(ν) = max min L(θa , ν). (39) 8 Set ϕ̄ according to (29);
ν≥0 ν≥0 θa
/* Dynamic transmission scheduling */
It can be seen that for ν fixed, the inner optimization (38) can 9 Set Dv and Ds according to ϕ̄, b̄, and Tc ;
be solved with standard RL frameworks by setting the reward 10 Randomly initialize θa and θc , and initialize ν = 0;
to r̃ k = −c̃k . The outer optimization (39) is convex since the 11 for episode p = 1, 2, · · · , Ml do
maximization objective function is concave and the constraint 12 Initialize s1 = (6x /λ̄6x , 40 /Mn );
set is convex. 13 for k = 1, 2, · · · , T do
To avoid dealing with the potentially huge discrete action 14 Infer ϕk using current π(·|·; θa ) and deliver it;
space, we introduce a virtual continuous action ă k ∈ (0, 1) N 15 Collect yk from the field sensors and perform
and the transformation [ak ] j = ⌊[ă k ] j ([ϕ̄] j + 1)⌋, ∀ j ∈ N N to estimation (7), (8);
obtain the scheduling result. Then, we incorporate the state- 16 Calculate u k using (12) and deliver it;
of-the-art PPO method [40] into the primal-dual framework, 17 Perform estimation (9), (10), and collect sample
which effectively generates the virtual action ă k and has a {sk , ă k , r̃ k , sk+1 };
simplified structure. In PPO, the actor network outputs the 18 end
mean and the standard deviation of a Gaussian distribution 19 Compute the GAE sequences accroding to (40);
from which ă k is sampled. The critic network denoted as 20 for epoch i = 1, 2, · · · , M p do
Vc (s; θc ) estimates the state-value function. To ensure that 21 Extract a mini-batch of Mb transitions from the
[ă k ] j falls within (0, 1), the last layer of the mean network uses sample trajectory;
a sigmoid activation function, and the sampled ă k is truncated 22 Update θa and θc by minimizing (41) and (42);
between 0 and 1. The action transformation can essentially be 23 end
regarded as part of the complete policy and does not affect 24 Update dual variable accroding to (43);
the domains of rk and ωk . The flow from sk to the scheduling 25 end
result is illustrated in Fig. 4.
Based on the trajectory {s1 , ă 1 , r̃ 1 , · · · , sT , ă T , r̃ T , sT +1 }
collected at the ECN in each episode, the generalized advan-
tage estimation (GAE) [40] is as follows: are updated M p times in one episode by gradient descent. The
dual variable ν is updated as follows:
T
ν p+1 = [ν p + ην ∇
bν L(θa , ν p )]+ ,
X
9k = (βd βg )t−k δt , (40) (43)
t=k
where ην is learning rate, ν p+1Pis the dual variable after episode
where βg is the GAE parameter, δt = r̃ t + βd Vc (st+1 ; θc ) − p ∈ N Ml , and ∇bν L(θ, ν) = k=1 T
βdk−1 ωk is the approximate
Vc (st ; θc ) is the temporal difference error. By extracting a gradient using the sample trajectory. The whole process of
mini-batch of Mb transitions {skt , ă kt , r̃ kt , skt +1 }t=1
Mb
, the loss DTSM is provided in Algorithm 1.
functions for optimizing the actor and critic networks are In DTSM, the constraint (34) formed by the selected param-
Mb eters b̄ and Tc should at least be satisfied by the policy ak =
−1 X ϕ̄, ∀k. b̄ and Tc can further be jointly tuned to characterize
L a (θa ) = min(ιkt (θa )9kt , f p,ϵ ιkt (θa ) 9kt ),

(41)
Mb t=1 the required performance constraint. In this way, the policy
Mb ak = ϕ̄, ∀k, which is guaranteed overall performance via
1 X
L c (θc ) = δ2 , (42) Theorem 2, is a feasible solution in the DRL-based scheduling
Mb t=1 kt space. Thus, once the DTSM is well-trained, a policy with
desired overall performance that outperforms the benchmark
π̆ (ă k |sk ;θa )
where the probability ratio ιk (θa ) = π̆ (ă k |sk ;θaold )
. π̆ (ă k |sk ; ·) ak = ϕ̄, ∀k can be found. Moreover, for online deployment,
is the probability that ă k is sampled under sk , θaold is the only a learned actor network is required to support the
actor network parameter before the update. The clip function execution of steps 14-17. Combined with the actual sce-
f p,ϵ (x) = max(min(x, (1 + ϵ)x), (1 − ϵ)x). Then, with ηa and nario, these steps correspond to procedures ⃝- 1 ⃝4 in Fig. 5,
ηc as learning rates, the parameters of actor and critic networks respectively.

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
10914 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 22, 2025

Fig. 5. Application of DTSM in the laminar cooling process.

Remark 1: Based on the above constraint value selection, the field device layer aggregate measurements to the ECN and
we know that there exists a feasible solution π ∗ achieving receive scheduling commands from the ECN through wireless
the optimal value of Problem 2. One can reasonably restrict transmission. The control actions are sent by the ECN to
the stage cost as ck = min(ck , Mc ), where Mc is a large the actuators, i.e., the water valves, in a wired manner to
number. Combined with the boundedness of the constraint regulate their opening degrees. More adequate measurement
ωk , the conditions in [24, Theorem 2] are satisfied. The transmission improves temperature estimation accuracy and
primal-dual update thus converges to a neighborhood of π ∗ in thus water flow control performance, but this comes at the
case (38) is solved exactly, which also implies the constraint expense of network resource usage and sensor battery energy
satisfaction of the solutions. Although DRL methods are not consumption. Therefore, it is necessary to adopt our proposed
theoretically guaranteed to obtain the global optimal solution DTSM to improve the overall system performance.
to (38), extensive experience suggests that they can converge An open thermodynamic system with strip steel surfaces,
to solutions with little suboptimality, and the convergence is cooling zone inlet and outlet as boundaries is obtained as
preserved under mild assumptions [39]. shown in Fig. 5. The temperature variation model is then as
Remark 2: We discuss the scalability of DTSM from follows [3]:
the perspective of computational complexity. For the time ∂x λ ∂2x ∂x
slot reservation process, the computational complexity is = − vc , (44)
O(Mn Mϕ N 2 ), which scales quadratically with the number ∂t ρc ∂σz2 ∂σn
of sensors N . Since both the actor (the mean and standard where x denotes the temperature value. t, σz , and σn are the
deviation parts being consistent) and critic networks adopt coordinates along time, thickness, and length, respectively. vc
fully connected structures, the complexity for the input layer is the coiling speed. λ, ρ, and c are the thermal conductivity,
in each of them is O(Mb Mh,1 (m 2 +1)), where m 2 +1 and Mh,1 density, and specific heat capacity of the strip steel, respec-
are the numbers of the input layer’s nodes and the first hidden tively. The boundary conditions are
layer’s nodes, respectively [41]. This scales quadratically with ∂x hw
the system dimension m. Also, denoting the number of the last σz =0
= (x − xw ), x|σn =0 = x f ,
∂σz λ
hidden layer’s nodes as Mh,−1 , the complexity for the output
∂x hw
= − (x − xw ),

layer of actor network is O Mb Mh,−1 N , which scales linearly σz =σ̄ z
with the number of sensors N . ∂σz λ
where σ̄ z is the maximum coordinate value along the
IV. A PPLICATION AND E VALUATION thickness. h w is the water cooling heat transfer coefficient
In this section, the proposed DTSM is applied to the indus- depending on the opening degrees of water valves, and xw
trial hot rolling process through digital twin (DT) technology. is the cooling water temperature. x f is the inlet temperature.
The application scenario, physical model, and system imple- Then, the finite difference method is used to discretize (44).
mentation are introduced in detail. Moreover, the effectiveness See the system division in Fig. 5, where the entire thermody-
of the proposed method is fully discussed. namic system is divided into τz and τn volumes in thickness
and length, respectively. Here, N infrared thermometers are
uniformly deployed above the τn upper surface volumes. For
A. Laminar Cooling Scenario and Physical Model
each upper or lower surface volume, the nozzles above or
The application scenario is the laminar cooling process in below it are controlled by an independent regulating valve.
industrial hot rolling. As shown in Fig. 5, the sensing of Thus, the total number of valves is 2τn .
strip steel temperature and cooling water control are inte- Denoting x p, j,k as the temperature of ( p, j)-th volume (1 ≤
grated into this IIoT system for desired temperature regulation. p ≤ τz , 1 ≤ j ≤ τn ) at time period k, the discretized model is
This process is extremely important since the mechanical as follows:
properties of steel, e.g., yield strength, toughness, etc., are
directly determined by the cooling curve [42]. Here, multiple x p, j,k+1 = x p, j,k + ϑn (x p, j,k − x p, j−1,k )
+ ϑz x p+1, j,k − 2 x p, j,k + x p−1, j,k ,

temperature sensors such as infrared thermometers deployed in

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DRL BASED TRANSMISSION SCHEDULING FOR SENSING AWARE CONTROL 10915

Fig. 6. Digital twin system implementation. Fig. 7. Time slot reservation process. (a) Case of ε = 0. (b) Case of ε = 1.

where ϑz = 112tρc λ
, ϑn = 11t vn c . 1z , 1n are the discretization temperature and cooling zone parameters vc = 10 m / s,
z
steps and 1t is the sampling period. The discretized equations τz = 2, τn = 4, 1z = 5 mm, 1n = 5 m, x̄ 1 =
at the surface are obtained similarly based on the boundary [1123, 1123, 1113, 1113, 1103, 1103, 1093, 1093]⊤ K, xw =
conditions. Defining xk = [(xk1 )⊤ , · · · , (xk )⊤ , · · · , (xkτn )⊤ ]⊤
j
293 K, x f = [1123, 1123]⊤ K are determined with ref-
j
with xk = [x1, j,k , · · · , xτz , j,k ] , the stacked form of the model

erence to [42] and [44]. Based on system modeling and
is properties, N = 8, T = 100, m = 8, q = 8, l =
4, 6x = 20Im , 6w = Im , 6v = I N , Q = 10Im ,
xk+1 = Axk + Bu k + B ′ u ′k + wk , (45)
R = Iq are determined. The desired z k gradually decreases
where A = An + A z + Iτn τz , B = Iτn ⊗ Q B . Fur- from x̄ 1 to [1023, 1023, 973, 973, 923, 923, 873, 873]⊤ K over
ther, An = (ϑn Qn ) ⊗ Iτz , A z = Iτn ⊗ (ϑz Qz ), Q B = time. Based on the sensor deployment, the observation and
[[1, · · · , 0]⊤ , [0, · · · , 1]⊤ ], and sensing matrices are given as G = Iτn ⊗ [0, 1], C =
    [el1 , el1 , · · · , ell , ell ]⊤ . Thus, we have S1 = {1, 2}, S2 = {3, 4},
−1 0 · · · · · · 0 −2 2 0 · · · 0
..  .  S3 = {5, 6}, S4 = {7, 8}. It is verified that Assumption 1 holds
 1 −1 . . . .   1 −2 1 . . . ..  and d = 2, n = 1. κc , κr are subsequently specified according
 

Qn =  0 . . . . . . . . . ... , Qz =  0 . . . . . . . . . 0 .
   
    to different performance demands.
 . . . .
 
 . .
  For the wireless network between the ECN and the sensors,
 .. . . . . . . 0   .. . . 1 −2 1 
 
we consider that the CFP contains 16 slots and the beacon
0 · · · 0 1 −1 0 · · · 0 2 −2 slot is used to generate and broadcast the schedule ϕk . The
slot duration is selected as 3.84 ms [45]. The inactive period
Besides, u k is the control action regarding the opening degrees
lasts for 10 slot durations, during which the sensors may
of water valves. B ′ = Iτn τz , and u ′k = [ϑn x ⊤f , 0, · · · , 0]⊤ with
turn off their radios to save energy. This period is used in
x f ∈ Rτz being the available inlet temperature.
turn for the ECN to perform posterior estimation, control,
deliver control commands (through wired communication),
B. System Implementation and Parameter Settings and prior estimation, as well as the field sensors to measure
1) DT System Implementation: The DT system, as illus- the temperature. The sampling period 1t is consistent with
trated in Fig. 6, is constructed to realize the interaction the beacon interval, which is around 100 ms. Two network
between the real-world space of high-strength steel production states E = {0, 1} are considered, and the transition probability
and the twin space integrating DTSM. The DT system is built matrix is E = [[0.1, 0.9]⊤ , [0.9, 0.1]⊤ ]. Under ε = 0, the
based on Unity 2020 using the Intel Core Ultra 7 proces- values of µ2,0 , µ4,0 , µ6,0 , µ8,0 are set higher than µ1,0 , µ3,0 ,
sor, which includes several scenarios such as rough rolling, µ5,0 , µ7,0 . And under ε = 1, the values of µ1,0 , µ3,0 , µ6,0 ,
finishing rolling, and laminar cooling. Based on the actual µ8,0 are set higher than µ2,0 , µ4,0 , µ5,0 , µ7,0 . For the learning
deployed sensor information and physical mechanisms, the process, both the actor and critic networks use two hidden
DT system reproduces the running status of the strip. Then layers with 256 nodes, and the Adam optimizer is adopted.
in the twin space, by linking the proposed DTSM that is Hyper-parameters βd = 0.95, βg = 0.95, ϵ = 0.2, Ml = 4000,
transformed into Python code, the DRL training is realized M p = 10, Mb = 100, ηa = 1e−5 , ηc = 1e−4 , ην = 1e−5 , etc.
and the strip cooling effect brought by the designed scheduling are selected by referring to [22] and [24].
and control policies can be evaluated. The well-trained policies
are eventually deployed in the ECN to guide actual production.
C. Performance Evaluation
In general, for the self-contained and safety-critical steel man-
ufacturing where it is difficult to directly modify production 1) Time Slot Reservation: Fig. 7 shows the number of
instructions [29], such a virtual-real interaction framework reserved time slots [ϕ̄ ε ] j and the value of f g (ϕ̄ ε , ε) during the
effectively supports the application of the proposed DTSM reservation process of DTSM. For clarity, only sensors with
at the control level. non-zero time slots are shown. Under both cases of ε = 0 and
2) Parameter Settings: The values of the main parame- ε = 1, the number of time slots reserved for the sensors
ters are selected as follows. The thermodynamic parameters with better transmission conditions increases alternately. And
λ = 40 W /(m · K), ρ = 7.9 × 103 Kg / m3 , c = when the overall number is less than 4, f g (ϕ̄ ε , ε) are all
[ϕ̄ ]
0.46 × 103 J/(Kg ·K ) are determined according to [43], the 0 since ∃i, j∈Si µ j,εε j = 1. This practical system satisfies that
Q

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
10916 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 22, 2025

Fig. 8. Normalized average cost and constraint violation comparisons. (a) Case of κ ′ = 0.1 without constraint. (b) Case of κ ′ = 0.6 without constraint.
(c) and (d) Case of κ ′ = 0.1 with constraint.

Fig. 9. Tracking performance and transmission cost comparisons in the case of κ ′ = 0.1. (a) and (b) Without constraint. (c) and (d) With constraint.

λ̄ A⊤ A < 1, and when ϱ is set close to 1, (26) is arbitrarily satis- combinations by adjusting κ ′ . The training process is repeated
fied since 1−ϱ/λ̄dA⊤ A < 0. Here, for a better control effect, ϱ is several times. It can be seen that DTSM converges relatively
selected such that 1−ϱ/λ̄dA⊤ A = 0.3. Then, the reserved results stably under various settings. First, without considering con-
are ϕ̄ 0 = [1, 0, 1, 0, 2, 0, 2, 0]⊤ , ϕ̄ 1 = [0, 2, 0, 2, 2, 0, 2, 0]⊤ , straint (34), the normalized average cost comparisons under
and finally ϕ̄ = [1, 2, 1, 2, 2, 0, 2, 0]⊤ . It can be seen that κ ′ = 0.1 and κ ′ = 0.6 are shown in Fig. 8(a)-(b). The
( k=1
PT
sensors 6 and 8 are not assigned to time slots, this is because ck )/T −Jmin
normalized cost is calculated through Jmax
, where
under both ε = 0 and ε = 1 they have worse transmission Jmin and Jmax are the minimum and maximum average costs
conditions in S3 and S4 , respectively. This reflects that DTSM of the episodes in all repeated experiments, respectively. It can
effectively saves time slot resources. Besides, the simulation be seen that in both cases, the performance of DTSM is better
of φ(k, {ϕ̄ ε }i=1
2
, {ε}i=1
2
) by the Monte Carlo method is also than that of RND, CPF, and GCEC. GCEC designed based on
plotted, which is almost the same as f g (ϕ̄ ε , ε). This verifies empirical rules performs relatively well under κ ′ = 0.6, but
the lower bound proposed in Theorem 1, and infers that the it is finally outperformed by the continuously trained DTSM.
real value reaches the bound at this time. For the ablated methods, AS-2 performs worse than DTSM
2) Overall Transmission and Control Performance: We due to abandoning the search for a wider range of solutions.
compare the proposed DTSM with several other methods: 1) Although DTSM performs slightly worse than IFTS (AS-1)
the random policy (RND) which randomly selects an action under κ ′ = 0.1 (nearly consistent under κ ′ = 0.6), it requires
in A; 2) the control performance first policy (CPF) which 37.5% fewer reserved time slots than IFTS (AS-1). This means
adopts ϕk = ϕ̄, ∀k; 3) the greedy policy on control-aware the network can support more other applications, or extend the
error covariance (GCEC) [27] which schedules a trans- inactive period to save energy.
mission for the sensor with the minimal tr(2∞ (Pk|k−1 −1
+ Taking the case of κ ′ = 0.1 as an example to further
1−β 18
G [C] j,: [6v ] j, j [C] j,: G) ). Meanwhile, we conduct ablation
⊤ ⊤ −1 −1
consider the constraint (34) with b̄ = 13 and Tc = 1−βdd , the
studies to better validate the importance of each design con- normalized average cost and constraint violation comparisons
sideration: 1) remove the observability-based slot reservation are shown in Fig. 8(c)-(d). Due to the enhanced sensing
(marked as AS-1), i.e., adopt the intelligent policy with performance demands, neither GCEC nor RND can meet the
full time slots (IFTS) [14] which performs the scheduling constraint here. For DTSM, the learned policy converges to
described in Algorithm 1 with all available time slots (uni- a feasible solution after about 1000 episodes, and it can
formly) reserved for the sensors; 2) remove the fine-grained further optimize the overall performance while satisfying
action space partitioning (marked as AS-2), i.e. consider the the constraint. The goal of interest here is to ultimately
“On-Off” case where each sensor transmits 0 or [ϕ̄] j times; design a scheduling policy that satisfies the constraints without
3) remove the dual update process (marked as AS-3), where requiring constraint satisfaction during the training process,
a fixed ν̄ is used to indicate the cost of constraint violation. which matches our experimental results. Besides, although ν̄
These methods use the same transmission and control decou- is empirically selected in AS-3 to ensure the constraint, it is
pling and optimal control law as in DTSM. too conservative to achieve the optimal performance goal.
2 ) tr(6 2 )
Letting κκrc = κ ′ tr(6
P N x ∞ with P N x ∞ be the normalization In contrast, ν is systematically adjusted in DTSM for the
j=1 [ϕ̄] j j=1 [ϕ̄] j
item, then we consider different transmission-control weight optimization goal. In summary, each key design consideration

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DRL BASED TRANSMISSION SCHEDULING FOR SENSING AWARE CONTROL 10917

TABLE I TABLE II
I NFERENCE T IME OVERHEAD W ITHOUT C ONSTRAINT (C̄) AVERAGE C OST C OMPARISON U NDER D IFFERENT N UMERICAL
AND W ITH C ONSTRAINT (C) S ETTINGS (✓ AND × R EPRESENT C ONSTRAINT
S ATISFACTION AND V IOLATION )

plays a corresponding role, and our proposed DTSM achieves


a better balance between control performance and resource
utilization.
After the training phase, we adopt the policy learned by
DTSM that satisfies the constraint and has the lowest average
cost. The comparisons of strip steel temperature ((τz , 1)-th
volume),
PN tracking error ∥xk − z k ∥2 , and transmission cost
j=1 [ϕk ] j are illustrated in Fig. 9. The cases with and
without constraints under κ ′ = 0.1 are compared. Although settings of (4, 4, 0, C̄), (8, 8, 0, C̄), and (12, 12, 0, C̄) can be
Fig. 9(a)-(b) show that DTSM is not optimal in any single regarded as specific cases for verifying the optimality. Since
aspect of tracking error and transmission cost, combined here the optimization goal only contains the control cost,
with the curves in the figure and the overall performance CPF is the optimal solution under the slot reservation ϕ̄. The
comparison (Fig. 8(a)), it is known that DTSM achieves a optimality of DTSM is then verified because DTSM and CPF
desirable tradeoff between these two aspects. Besides, it can perform almost the same.
be seen from Fig. 9(b) that the transmission number scheduled
by DTSM is larger in earlier time periods (partly due to greater V. C ONCLUSION
estimation uncertainty), which reflects the dynamic property The transmission scheduling for sensing-aware control in
of DTSM. Comparing Fig. 9(c)-(d) with Fig. 9(a)-(b), the IIoT systems is investigated in this paper. Taking into account
control performance is improved after imposing the sensing the significance of edge sensing, the novel DTSM is proposed
performance-related constraint, and the consumed transmis- to balance the control performance and transmission cost
sion cost is also increased. This reflects the effectiveness of through network resource reservation and intelligent transmis-
constraint (34) in regulating the final control performance. sion scheduling. The time slots for measurement transmission
3) Running Time: For the training phase of DTSM, the time are reserved with observability probability as the key metric,
overhead of training steps 19-24 in each episode is around and the transmission number of each sensor is further dynam-
0.1 s. For the inference time overhead of ϕk in 100 repeated ically scheduled adopting the primal-dual DRL framework.
experiments with the learned policy, the median and worst The advantages of DTSM in terms of resource utilization and
values with and without constraint (34) under κ ′ = 0.1 are overall performance improvement are demonstrated through
recorded in Table I. The time overheads of DTSM and IFTS the application in the industrial laminar cooling process. The
are close, and they are greater than those of other comparison limitations of this work lie in that the overall intelligent
methods due to the forward propagation of neural networks. scheduling of uplink and downlink transmissions has not been
In general, DTSM satisfies the requirements of practical considered, nor has the collaboration among multiple ECNs in
deployment, since at this time the intelligent transmission the ECL been discussed. Therefore, in future works, we would
scheduling only relies on the well-trained π(·|·; θa ) to infer like to analyze the tradeoff between multi-sensor and multi-
ϕk , and the worst time overhead is less than the selected time actuator data transmissions, and explore dynamic scheduling
slot duration. in a larger scope including the entire ECL based on distributed
4) Robustness Verification: The average costs and con- approaches.
straint satisfactions under different numerical settings of
system scales (m, N ), weighting factors (κ ′ ), and whether or R EFERENCES
not constraints are set (C̄, C) are shown in Table II. The
[1] D. Kozma, P. Varga, and F. Larrinaga, “Dynamic multilevel workflow
system scale varies through different division granularities. management concept for industrial IoT systems,” IEEE Trans. Autom.
The reserved time slots for m = 4 and m = 12 are ϕ̄ = Sci. Eng., vol. 18, no. 3, pp. 1354–1366, Jul. 2021.
[2, 2, 2, 0] and ϕ̄ = [1, 2, 1, 2, 1, 1, 1, 0, 2, 0, 1, 0], respec- [2] Z. Ji, C. Chen, J. He, S. Zhu, and X. Guan, “Learning-based edge sensing
and control co-design for industrial cyber-physical system,” IEEE Trans.
tively. The other parameters are modified according to the Autom. Sci. Eng., vol. 20, no. 1, pp. 59–73, Jan. 2023.
case of m = 8. Here, DTSM and IFTS converge under various [3] C. Chen, L. Lyu, S. Zhu, and X. Guan, “On-demand transmission for
settings and the final converged values are recorded. It can be edge-assisted remote control in industrial network systems,” IEEE Trans.
seen that DTSM performs best among the comparison methods Ind. Informat., vol. 16, no. 7, pp. 4842–4854, Jul. 2020.
[4] L. Schenato, B. Sinopoli, M. Franceschetti, K. Poolla, and S. S. Sastry,
(except for IFTS with more reserved slots) in all cases, which “Foundations of control and estimation over lossy networks,” Proc.
effectively verifies the robustness of DTSM. Moreover, the IEEE, vol. 95, no. 1, pp. 163–187, Jan. 2007.

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
10918 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 22, 2025

[5] D. E. Quevedo, A. Ahlen, and K. H. Johansson, “State estimation over [27] V. Tzoumas, L. Carlone, G. J. Pappas, and A. Jadbabaie, “LQG control
sensor networks with correlated wireless fading channels,” IEEE Trans. and sensing co-design,” IEEE Trans. Autom. Control, vol. 66, no. 4,
Autom. Control, vol. 58, no. 3, pp. 581–593, Mar. 2013. pp. 1468–1483, Apr. 2021.
[6] X. Guan, C. Chen, B. Yang, C. Hua, L. Lyu, and S. Zhu, “Towards the [28] L. Zheng, M. Liu, S. Zhang, Z. Liu, and S. Dong, “End-to-end multi-
integration of sensing, transmission and control for industrial network sensor fusion method based on deep reinforcement learning in UASNs,”
systems: Challenges and recent developments,” Acta Autom. Sin., vol. 45, Ocean Eng., vol. 305, Aug. 2024, Art. no. 117904.
no. 1, pp. 27–38, Jan. 2019. [29] Z. Ji, C. Chen, S. Zhu, Y. Ma, and X. Guan, “Intelligent edge sensing
[7] W. Liu, D. E. Quevedo, Y. Li, K. H. Johansson, and B. Vucetic, “Remote and control co-design for industrial cyber-physical system,” IEEE Trans.
state estimation with smart sensors over Markov fading channels,” IEEE Signal Inf. Process. Over Netw., vol. 9, pp. 175–189, 2023.
Trans. Autom. Control, vol. 67, no. 6, pp. 2743–2757, Jun. 2022. [30] O. Hernández-Lerma and J. B. Lasserre, Discrete-time Markov Con-
[8] C. Hu, X. Xie, S. Ding, and Y. Jing, “Distributed set-membership fusion trol Processes: Basic Optimality Criteria, vol. 30. Cham, Switzerland:
estimation for complex networks with communication constraints,” Springer, 2012.
IEEE Trans. Autom. Sci. Eng., early access, May 20, 2024, doi: [31] A. K. Singh and B. C. Pal, “An extended linear quadratic regulator for
10.1109/TASE.2024.3401740. LTI systems with exogenous inputs,” Automatica, vol. 76, pp. 10–16,
Feb. 2017.
[9] Y. Kan, H. Yang, F. Qu, and Y. Li, “Sensor power control for remote state
[32] R. A. Berry and R. G. Gallager, “Communication over fading chan-
estimation with historical data re-transmission,” IEEE Trans. Autom. Sci.
nels with delay constraints,” IEEE Trans. Inf. Theory, vol. 48, no. 5,
Eng., vol. 21, no. 3, pp. 4058–4069, Jul. 2024.
pp. 1135–1149, May 2002.
[10] L. Chen, B. Hu, Z.-H. Guan, L. Zhao, and D.-X. Zhang, “Control- [33] J. Araújo, M. Mazo, A. Anta, P. Tabuada, and K. H. Johansson, “System
aware transmission scheduling for industrial network systems over a architectures, protocols and algorithms for aperiodic wireless control
shared communication medium,” IEEE Internet Things J., vol. 9, no. 13, systems,” IEEE Trans. Ind. Informat., vol. 10, no. 1, pp. 175–184,
pp. 11299–11310, Jul. 2022. Feb. 2014.
[11] Y. Wu, Q. Yang, H. Li, K. S. Kwak, and V. C. M. Leung, “Control- [34] D. E. Quevedo, A. Ahlén, A. S. Leong, and S. Dey, “On Kalman filtering
aware energy-efficient transmissions for wireless control systems with over fading wireless channels with controlled transmission powers,”
short packets,” IEEE Internet Things J., vol. 8, no. 19, pp. 14920–14933, Automatica, vol. 48, no. 7, pp. 1306–1316, Jul. 2012.
Oct. 2021. [35] T. Farjam, H. Wymeersch, and T. Charalambous, “Distributed channel
[12] K. Gatsis, A. Ribeiro, and G. J. Pappas, “Random access design for access for control over unknown memoryless communication channels,”
wireless control systems,” Automatica, vol. 91, pp. 1–9, May 2018. IEEE Trans. Autom. Control, vol. 67, no. 12, pp. 6445–6459, Dec. 2022.
[13] T. Shi, P. Shi, and J. Chambers, “Dynamic event-triggered model [36] T. Jin, Y. Ma, Z. Ji, and C. Chen, “Intelligent transmission scheduling for
predictive control under channel fading and denial-of-service attacks,” edge sensing in industrial IoT systems,” in Proc. IEEE Global Commun.
IEEE Trans. Autom. Sci. Eng., vol. 21, no. 4, pp. 6448–6459, Oct. 2024. Conf., Dec. 2023, pp. 7037–7042.
[14] Y. Ma et al., “Optimal dynamic transmission scheduling for wireless [37] W. Li, G. Wei, D. Ding, Y. Liu, and F. E. Alsaadi, “A new look at
networked control systems,” IEEE Trans. Control Syst. Technol., vol. 30, boundedness of error covariance of Kalman filtering,” IEEE Trans. Syst.
no. 6, pp. 2360–2376, Nov. 2022. Man, Cybern. Syst., vol. 48, no. 2, pp. 309–314, Feb. 2018.
[15] Y. Ma et al., “Smart actuation for end-edge industrial control systems,” [38] G. Battistelli and L. Chisci, “Kullback–Leibler average, consensus on
IEEE Trans. Autom. Sci. Eng., vol. 21, no. 1, pp. 269–283, Jan. 2024. probability densities, and distributed state estimation with guaranteed
stability,” Automatica, vol. 50, no. 3, pp. 707–718, Mar. 2014.
[16] K. Huang, W. Liu, Y. Li, B. Vucetic, and A. Savkin, “Optimal Downlink–
Uplink scheduling of wireless networked control for industrial IoT,” [39] S. Paternain, M. Calvo-Fullana, L. F. O. Chamon, and A. Ribeiro, “Safe
IEEE Internet Things J., vol. 7, no. 3, pp. 1756–1772, Mar. 2020. policies for reinforcement learning via primal-dual methods,” IEEE
Trans. Autom. Control, vol. 68, no. 3, pp. 1321–1336, Mar. 2023.
[17] C. Li, X. Zhao, M. Chen, W. Xing, N. Zhao, and G. Zong, “Dynamic
[40] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
periodic event-triggered control for networked control systems under
“Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.
packet dropouts,” IEEE Trans. Autom. Sci. Eng., vol. 21, no. 1,
[41] B. Zhao and X. Zhao, “Deep reinforcement learning resource allocation
pp. 906–920, Jan. 2024.
in wireless sensor networks with energy harvesting and relay,” IEEE
[18] S. Luo, L. Zhang, and Y. Fan, “Real-time scheduling for dynamic partial- Internet Things J., vol. 9, no. 3, pp. 2330–2345, Feb. 2022.
no-wait multiobjective flexible job shop by deep reinforcement learning,” [42] Y. Zheng, N. Li, and S. Li, “Hot-rolled strip laminar cooling process
IEEE Trans. Autom. Sci. Eng., vol. 19, no. 4, pp. 3020–3038, Oct. 2022. plant-wide temperature monitoring and control,” Control Eng. Pract.,
[19] S. Roshanravan and S. Shamaghdari, “Adaptive fault-tolerant tracking vol. 21, no. 1, pp. 23–30, Jan. 2013.
control for affine nonlinear systems with unknown dynamics via rein- [43] L. Lyu, C. Chen, S. Zhu, and X. Guan, “5G enabled codesign of energy-
forcement learning,” IEEE Trans. Autom. Sci. Eng., vol. 21, no. 1, efficient transmission and estimation for industrial IoT systems,” IEEE
pp. 569–580, Jan. 2024. Trans. Ind. Informat., vol. 14, no. 6, pp. 2690–2704, Jun. 2018.
[20] L. Yang, Y. Xu, Z. Huang, H. Rao, and D. E. Quevedo, “Learning [44] Y. Zheng, S. Li, and X. Wang, “Distributed model predictive control for
optimal stochastic sensor scheduling for remote estimation with chan- plant-wide hot-rolled strip laminar cooling process,” J. Process Control,
nel capacity constraint,” IEEE Trans. Ind. Informat., vol. 19, no. 3, vol. 19, no. 9, pp. 1427–1437, Oct. 2009.
pp. 2565–2573, Mar. 2023. [45] F. Kauer, M. Köstler, T. Lübkert, and V. Turau, “Formal analysis and
[21] A. S. Leong, A. Ramaswamy, D. E. Quevedo, H. Karl, and L. Shi, “Deep verification of the IEEE 802.15.4 DSME slot allocation,” in Proc. 19th
reinforcement learning for wireless sensor scheduling in cyber–physical ACM Int. Conf. Model., Anal. Simul. Wireless Mobile Syst., New York,
systems,” Automatica, vol. 113, Mar. 2020, Art. no. 108759. NY, USA, Nov. 2016, pp. 140–147.
[22] G. Pang, W. Liu, Y. Li, and B. Vucetic, “DRL-based resource allocation
in remote state estimation,” IEEE Trans. Wireless Commun., vol. 22,
no. 7, pp. 4434–4448, Jul. 2023.
[23] Z. Zhao, W. Liu, D. E. Quevedo, Y. Li, and B. Vucetic, “Deep learning
for wireless-networked systems: A joint estimation-control-scheduling
approach,” IEEE Internet Things J., vol. 11, no. 3, pp. 4535–4550,
Feb. 2024. Tiankai Jin (Graduate Student Member, IEEE)
[24] V. Lima, M. Eisen, K. Gatsis, and A. Ribeiro, “Model-free design of received the B.Eng. degree from Southwest Jiaotong
control systems over wireless fading channels,” Signal Process., vol. 197, University, Chengdu, China, in 2020. He is currently
Aug. 2022, Art. no. 108540. pursuing the Ph.D. degree in control science and
[25] Z. Ji, C. Chen, J. He, S. Zhu, and X. Guan, “Edge sensing and control co- engineering with the School of Electronic Informa-
design for industrial cyber-physical systems: Observability guaranteed tion and Electrical Engineering, Shanghai Jiao Tong
method,” IEEE Trans. Cybern., vol. 52, no. 12, pp. 13350–13362, University, Shanghai, China.
Dec. 2022. His current research interests include the co-design
[26] X. Wen et al., “Age-of-task-aware co-design of sampling, scheduling, of sensing, transmission and control for industrial
and control for industrial IoT systems,” IEEE Internet Things J., vol. 11, cyber-physical systems, and the reinforcement learn-
no. 3, pp. 4227–4242, Feb. 2024. ing under network systems.

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: DRL BASED TRANSMISSION SCHEDULING FOR SENSING AWARE CONTROL 10919

Cailian Chen (Senior Member, IEEE) received the Xinping Guan (Fellow, IEEE) is currently a
B.Eng. and M.Eng. degrees in automatic control Chair Professor with Shanghai Jiao Tong Univer-
from Yanshan University, China, in 2000 and 2002, sity, Shanghai, China, where he is the Dean of
respectively, and the Ph.D. degree in control and the School of Electronic, Information and Electrical
systems from the City University of Hong Kong, Engineering, and the Director of the Key Laboratory
Hong Kong, SAR, in 2006. of Systems Control and Information Processing,
She has been with the Department of Automa- Ministry of Education of China. Before that, he was
tion, Shanghai Jiao Tong University, since 2008. the Executive Director of the Office of Research
She is currently a Distinguished Professor. She Management, Shanghai Jiao Tong University, and a
has authored three research monographs and over Full Professor and the Dean of the Electrical Engi-
100 referred international journal articles. She is the neering, Yanshan University, Qinhuangdao, China.
inventor of more than 30 patents. Her research interests include industrial He has authored and/or co-authored five research monographs, more than
wireless networks and computational intelligence and the Internet of Vehicles. 200 articles in peer-reviewed journals, and numerous conference papers. As a
Prof. Chen received the prestigious IEEE Transactions on Fuzzy Systems Principal Investigator, he has finished/been working on more than 20 national
Outstanding Paper Award in 2008, the IEEE TCCPS Industrial Technical key projects. He is the Leader of the prestigious Innovative Research Team
Excellence Award in 2022, and five conference best paper awards. She was of the National Natural Science Foundation of China (NSFC). His current
awarded the N2Women Star in Computer Networking and Communications research interests include industrial network systems, smart manufacturing,
in 2022. She won the Second Prize of National Natural Science Award from and underwater networks.
the State Council of China in 2018, the First Prize of Natural Science Award Dr. Guan is an Executive Committee Member of Chinese Automation
from The Ministry of Education of China in 2006 and 2016, respectively, Association Council and Chinese Artificial Intelligence Association Council.
and the First Prize of Technological Invention of Shanghai Municipal, China, He received the Second Prize of the National Natural Science Award of China
in 2017. She was honored “National Outstanding Young Researcher” by NSF in both 2008 and 2018 and the First Prize of Natural Science Award from
of China in 2020, “Changjiang Young Scholar” in 2015, and China Young the Ministry of Education of China and Municipal of Shanghai, China, for
Women Scientists Award in 2023. She has been actively involved in various four times. He was a recipient of the “IEEE Transactions on Fuzzy Systems
professional services. She is a Distinguished Lecturer of IEEE VTS. She Outstanding Paper Award” in 2008 and the IEEE TCCPS Industrial Technical
serves as the Deputy Editor for National Science Open and an Associate Excellence Award in 2022. He was honored “National Outstanding Youth” by
Editor for IEEE T RANSACTIONS ON V EHICULAR T ECHNOLOGY and IET NSF of China, “Changjiang Scholar” by the Ministry of Education of China,
Cyber-Physical Systems: Theory and Applications. and “State-Level Scholar” of “New Century Bai Qianwan Talent Program” of
China.

Yehan Ma (Senior Member, IEEE) received the


B.S. and M.S. degrees from Harbin Institute of
Technology in 2013 and 2015, respectively, and the
Ph.D. degree in computer science from Washington
University in St. Louis, in 2020. She is currently an
Associate Professor with the School of Electronic
Information and Electrical Engineering, Shanghai
Jiao Tong University. Her research interests include
the industrial Internet of Things, cyber-physical sys-
tems, and edge computing. She broadly investigated
techniques and solutions for holistic managements
of computation, communication, and control in cyber-physical systems.

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on June 06,2025 at 08:40:48 UTC from IEEE Xplore. Restrictions apply.

You might also like