0% found this document useful (0 votes)

12 views13 pages

Optimal Transmission Strategy For Multiple Markovian Fading Channels Existence, Structure, and Approximation

Uploaded by

xhx123kv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views13 pages

Optimal Transmission Strategy For Multiple Markovian Fading Channels Existence, Structure, and Approximation

Uploaded by

xhx123kv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Automatica 158 (2023) 111312

Contents lists available at ScienceDirect

Automatica
journal homepage: www.elsevier.com/locate/automatica

Optimal transmission strategy for multiple Markovian fading

channels: Existence, structure, and approximation✩
∗
Yong Xu a , Haoxiang Xiang a , Lixin Yang a , Renquan Lu a , , Daniel E. Quevedo b
a
Guangdong Provincial Key Laboratory of Intelligent Decision and Cooperative Control, School of Automation, Guangdong University of
Technology, Guangzhou 510006, China
b
School of Electrical Engineering and Robotics, Queensland University of Technology, Brisbane 4001, Australia

article info a b s t r a c t

Article history: This paper investigates the optimal transmission strategy for remote state estimation over multiple
Received 7 December 2022 Markovian fading channels. A smart sensor is used to obtain a local state estimate of a system,
Received in revised form 25 June 2023 and transmits it to a remote estimator. A new transmission strategy is proposed by co-designing
Accepted 20 August 2023
the channel allocation and the transmission power control. The co-designing problem is modeled as
Available online xxxx
a constrained Markov decision process (CMDP) to minimize the expected average estimation error
Keywords: covariance subject to the energy constraint over an infinite horizon. The CMDP is then relaxed as
Networked control systems an unconstrained Markov decision process (UMDP) using the Lagrange multiplier method. Sufficient
Transmission strategy conditions for the existence of the optimal stationary policy for the UMDP are established to obtain
Remote state estimation the optimal transmission strategy. The structure of the optimal transmission power control policy
Markov decision process
for the UMDP with discounted cost is also elucidated. Taking account of the discrete-continuous
Reinforcement learning
hybrid action space, a parameterized deep Q-network (P-DQN) algorithm is employed to obtain an
approximate optimal policy for the UMDP. Finally, a moving vehicle example is introduced to illustrate
the effectiveness of the developed methods.
© 2023 Published by Elsevier Ltd.

1. Introduction interference and other factors, data packets may be delayed or

even lost during remote transmission. This raises the issue of
Networked control systems (NCSs), wherein information is how to improve remote estimation performance over unreliable
exchanged by a shared wireless communication network, have wireless channels.
received extensive attention in the past two decades (Zhang, Han, NCSs generally use a shared wireless network with limited
& Yu, 2015). NCSs enable many remote tasks, hence play an channel capacity. Although some transmission strategies have
important role in a number of fields, such as factory automation, been proposed to deal with the channel capacity constraint (Leong,
monitoring, space exploration, and so on. In the field of monitor- Ramaswamy, Quevedo, Karl, & Shi, 2020; Wu, Ding, Cheng, &
ing, remote state estimation is a basic supporting technology (Li Shi, 2020; Yang, Xu, Huang, Rao, & Quevedo, 2022), data trans-
et al., 2017), where a wireless sensor observers a physical process, mission with a single channel is often not robust. Therefore, a
and then transmits its measurement or a local state estimate to multi-channel technology is introduced, where the sensor can
a remote estimator. However, due to channel fading, external choose different communication channels at each time instant.
Multi-channel transmissions can maintain the stability of trans-
mission in emergencies such as extreme weather and natural
✩ This work was supported in part by the National Natural Science Foun- disasters (George et al., 2010; Ray & Turuk, 2017). There are
dation of China under Grants (62121004, U22A2044, 62206063), the Local two types of channel allocation protocols for multi-channel. One
Innovative and Research Teams Project of Guangdong Special Support Program
is called fixed channel allocation (FCA), that is, each sensor is
(2019BT02X353), Key Area Research and Development Program of Guangdong
Province (2021B0101410005), and the Natural Science Foundation of Guangdong configured with a fixed channel independently. FCA does not
Province (2021B1515420008). The material in this paper was not presented at require a complex algorithm since it is static (Leong, Dey, Nair,
any conference. This paper was recommended for publication in revised form & Sharma, 2011). However, the FCA does not make full use
by Associate Editor Luca Schenato under the direction of Editor Christos G. of available channels (Soua & Minet, 2015). A more advanced
Cassandras.
∗ Corresponding author. protocol is dynamic channel allocation (DCA), where each sensor
E-mail addresses: [email protected] (Y. Xu), [email protected]
can dynamically select a channel before transmitting data. For
(H. Xiang), [email protected] (L. Yang), [email protected] (R. Lu), instance, Ding, Li, Quevedo, Dey, and Shi (2017) addressed a
[email protected] (D.E. Quevedo). game problem with DCA under denial-of-service (DoS) attacks,

https://doi.org/10.1016/j.automatica.2023.111312
0005-1098/© 2023 Published by Elsevier Ltd.
Y. Xu, H. Xiang, L. Yang et al. Automatica 158 (2023) 111312

and a Nash Q-learning algorithm was proposed to obtain optimal action space, where the Deep Q-Network (DQN) is a ground
strategies for sensor and jammer. Ni, Leong, Quevedo, and Shi breaking work, and it has been demonstrated to have powerful
(2019) extended the preceding study to an infinite time horizon, decision-making ability, including conquering Atari games and
and a closed-form expression of the optimal pricing policy was Go (Mnih et al., 2013). Recently, DQN was used to solve the com-
obtained. Moreover, Liu (2019) developed the Stackelberg game plicated sensor scheduling problem for remote state estimation
under DoS attack with multi-channel, and specific steps of de- in Leong et al. (2020) and Yang, Rao, Lin, Xu, and Shi (2022). The
signing optimal strategies for both players were given. Similar other type is the continuous action space, where the deterministic
to Ding et al. (2017), Zhang, Gan, Shao, Zhang, and Cheng (2020) policy gradient (DPG) and its deep version DDPG (Lillicrap et al.,
studied multi-hop relay channels in a game-theoretic framework, 2015) are effective algorithms. DDPG is widely applied in the field
and the existence of the equilibrium was proved. Even though the of power system and mobile robot path planning (Khodayar, Liu,
above DCAs provide remarkable performance, most of them focus Wang, & Khodayar, 2020). However, in practical applications, the
on allocating the limited channel bandwidth to multiple sensors. action space may be hybrid with discrete and continuous actions.
However, in practice, a single smart sensor can utilize different For this issue, a novel parameterized deep Q-network (P-DQN)
frequency bands, or say channels, to transmit its data, where the algorithm is introduced to learn a near-optimal policy for the
channels have various Markovian fading characteristics (Adamu, transmission strategy (Xiong et al., 2018).
López-Benítez, & Zhang, 2023). There is a lack of understanding Motivated by the above discussions, this work investigates the
on how to choose the frequency band for the sensor. It is worth optimal transmission strategy for remote state estimation over
noting that stability issues of state estimation with Markovian multiple Markovian fading channels. The major challenge lies
channels are highly non-trivial (Liu et al., 2021), and require in co-designing the power control and channel allocation. This
further studies. corresponds to a hybrid action space, which is difficult to handle.
The sensor’s transmission power has a significant influence The main contributions are summarized as follows.
on the remote estimation performance. Thus, power control has
(1) Co-design of power control and channel allocation: The
become a hot topic in the field of remote state estimation (Pan-
transmission quality is closely related to the channel con-
tazis & Vergados, 2007; Quevedo, Ahlén, & Ostergaard, 2010).
dition and the transmission power. Compared to most ex-
Although the traditional constant power control method is easy
isting works that only considered channel allocation with
to implement, it cannot adapt to a complex environment (Ren,
fixed power or a limited number of power levels (Ding
Wu, Johansson, Shi, & Shi, 2017). By contrast, dynamic methods
et al., 2017; Ni et al., 2019; Zhang et al., 2020), the cur-
can effectively improve the performance and the efficiency of
rent work investigates the co-designing of dynamic chan-
power control (Lin et al., 2016). In recent years, Markov deci-
nel allocation and continuous power control over multiple
sion processes (MDP) have been used in designing a dynamic
Markov fading channels for remote state estimation. The
power controller to optimize the energy consumption. For exam-
goal is to minimize the average infinite-horizon estimation
ple, Nourian, Leong, and Dey (2014) studied optimal transmission
error with energy constraint. The problem is formulated as
energy allocation with imperfect feedback channel and energy
a constrained Markov decision process (CMDP).
harvesting constraints. The strategy was formulated as an MDP,
(2) Existence and structure of the optimal policy: To ad-
and the optimal energy allocation strategy was obtained by the
dress the co-design problem, the CMDP is relaxed as an
dynamic programming technique. Similarly, Li et al. (2016) ad-
unconstrained Markov decision process (UMDP). Then, the
dressed power control with energy harvesting constraints, and
original problem and the relaxed problem are both ana-
a novel approximate solution was developed. Furthermore, Li,
lyzed. The existence of an optimal scheduling policy for
Mehr, and Chen (2019) considered the collaborative power con-
UMDP is proved in Theorem 1, and a saddle point for CMDP
trol for multiple sensors, and a Markov game framework was em-
is verified in Theorem 2. The threshold structure of the
ployed to investigate the optimal transmission strategy for each
optimal scheduling policy for UMDP with discounted cost
sensor. Pezzutto, Schenato, and Dey (2022) explored power allo-
is then established in Theorem 3 using monotonicity and
cation with multi-packet reception capability, and the existence
submodularity concepts.
of a stationary optimal policy was proved based on MDP theory.
(3) Near-optimal policy calculation using P-DQN: If the state
However, the above mentioned works only implemented the
transition probability of CMDP is unknown, the relaxed
power control on a single channel, which motivates researchers
problem can be reformulated as an RL problem. Different
to consider the multi-channel case. Another limitation of most
from traditional RL problems (Leong et al., 2020), the action
works is that they consider the energy constraint or limited
space in this work is hybrid with discrete and continuous
channel capacity, separately. Thus, how to codesign transmission
actions. A novel Parameterized Deep Q-Network (P-DQN)
power control and channel allocation strategy remains largely
algorithm is adopted to tackle this issue, which combines
open.
the advantages of DQN and DDPG. Simulations show that
For an MDP, how to attain its optimal policy has been widely
the P-DQN algorithm outperforms alternative transmission
investigated. If the environmental information is known, the op-
strategies.
timal policy can be obtained by solving the Bellman optimality
equation, using dynamic programming (DP), value iteration (VI) The remainder of this paper is organized as follows. The sys-
and policy iteration (PI) (Hernández-Lerma & Lasserre, 2012). tem model and the problem of interest are given in Section 2.
However, complete knowledge of the environment is often un- Section 3 presents the CMDP formulation and the main results.
available. Reinforcement learning (RL) can cope with this issue. Q- Section 4 describes the P-DQN algorithm to solve the trans-
learning and Sarsa are two typical RL algorithms (Sutton & Barto, mission strategy when transition probabilities are unknown. A
2018), which have been widely applied to deal with transmission numerical example is included in Section 5. Section 6 concludes
scheduling problem (Wu, Ren, Jia, Johansson, & Shi, 2019). Due the paper. Proofs are given in the Appendix.
to the curse of dimensionality, these methods are impractical Notations: R, R+ , N and N+ stand for real numbers, non-
when the state or action space becomes very large. Hence, neural negative real numbers, integers and non-negative integers, re-
networks are introduced to RL, leading to deep reinforcement spectively. Rn×m denotes the set of n × m real matrices. For a
learning (DRL). According to the types of action spaces, DRL can matrix X , X T , ρ (X ) and tr(X ) represent its transpose, spectral
generally be divided into two types. The first type is the discrete radius and trace, respectively. For a positive definite (positive
2
Y. Xu, H. Xiang, L. Yang et al. Automatica 158 (2023) 111312

Fig. 1. The diagram of the transmission strategy for remote state estimation
over multiple Markovian fading channels.

semi-definite) matrix A, we use the notation A ≻ 0 (A ⪰ 0), and

Sn+ denotes the set of positive semidefinite matrices belonging to
Rn×n . diag {·} indicates a diagonal matrix. [Y ] stands for the cardi-
nality of a discrete set Y . P[·] and P[·|·] represent the probability Fig. 2. Markovian fading channels.
and the conditional probability of an event, respectively, and E[·]
is the expectation of a random variable. For any functions f1 , and
f2 , we define f1 ◦ f2 (X ) ≜ f1 (f2 (X )). For any function f , f t is defined s s s
x̂k+1|k+1 = x̂k+1|k + Kk+1 (y k+1 − C x̂k+1|k ),
as f t (X ) ≜ f ( · · · (f (X ))) and f 0 (X ) ≜ X .
   Pks+1|k+1 = (I − Kk+1 C )Pks+1|k .
t times
s s
For simplicity, we shall use x̂k and Pks to represent x̂k|k and Pks|k
2. Problem formulation in the following analysis. It is convenient to define a Lyapunov
operator f (·) : Sn+ → Sn+ and a Riccati operator g(·) : Sn+ → Sn+ as
2.1. System description
f (X ) ≜ AXAT + Q , (3)
As shown in Fig. 1, transmission design for remote state esti- T
g(X ) ≜ X − XC (CXC + R) T −1
CX . (4)
mation over multiple Markovian fading channels is considered in
this work. The system is described as As a consequence of Assumption 1, the local Kalman filter con-
verges to a steady state exponentially fast (Kailath et al., 2000).
xk+1 = Axk + wk , (1)
Without loss of generality, assume that the local Kalman filter has
where A ∈ Rn×n is a known system matrix, whose spectral radius been in its steady state P s at the initial time instant, that is,
is ρ (A) > 1, i.e., the system is unstable. xk ∈ Rn is the system
state at time k, wk ∈ Rn is the system noise, which is a zero- Pks = P s , ∀k ≥ 0. (5)
mean Gaussian vector with covariance matrix Q ⪰ 0. Moreover, s
In (5), P is the unique positive semi-definite solution of the
the initial state x0 is also Gaussian with zero mean and covariance equation g ◦ f (X ) = X , which can be computed offline.
matrix P0 ⪰ 0. The measurement y k ∈ Rp observed by a smart
sensor is given by
2.2. Multiple Markovian fading channels
y k = C xk + vk , (2)
As shown in Fig. 2, the sensor transmits the local state esti-
where the measurement noise vk ∈ Rp is a zero-mean Gaussian s
mate x̂k to a remote estimator over a wireless network, which
vector with covariance matrix R ≻ 0, and C ∈ Rp×n is a known
contains m additive white Gaussian noisy (AWGN) channels. The
observation matrix. The noises {wk }, {vk } and the initial state x0
data transmission is subject to channel fading. This can be ex-
are mutually independent. The following assumption is needed
pressed as follows for each of the ith channels with i ∈ M ≜
for the convergence of the Kalman filter located at the sensor.
{1, . . . , m}:
Assumption 1. For the system (1) and (2), the pair (A, Q 1/2 ) is ϕout ,i,k = gi,k ϕin,i,k + ni,k . (6)
controllable, and the pair (A, C ) is observable.
In (6), gi,k is a random complex number, and ni,k is an additive
The smart sensor possesses sufficient computational capacity white Gaussian noise (Goldsmith, 2005). The signals ϕin,i,k , and
to run a Kalman filter to generate a local state estimate. Define ϕout ,i,k are the channel input ⏐ and ⏐2 output, respectively. The channel
the a priori and posterior state estimates and the corresponding
gain is defined as hi,k ≜ ⏐gi,k ⏐ , which takes values in a finite set
estimation error covariances as
H ≜ {h(1) , . . . , h(l) }, that is, hi,k ∈ H, for all i ∈ M. The channel gain
s
x̂k|k−1 ≜ E[xk |y 0 , . . . , y k−1 ], set is assumed to be known. Without loss of generality, assume
s s that there exists a total order in the set H, that is, h(1) > h(2) >
Pks|k−1 ≜ E[(xk − x̂k|k−1 )(xk − x̂k|k−1 )T |y 0 , . . . , y k−1 ],
s
· · · > h(l) . Each channel gain i (i ∈ M) is governed by its own
x̂k|k ≜ E[xk |y 0 , . . . , y k ], m finite state time-homogenous Markov chain (FSMC) with the
s s
Pks|k ≜ E[(xk − x̂k|k )(xk − x̂k|k )T |y 0 , . . . , y k ]. following transition probability:

The associated Kalman filter recursions (Kailath, Sayed, & Hassibi, Pi (h(ȷ) |h(ı) ) ≜ P[hi,k+1 = h(ȷ) |hi,k = h(ı) ],
2000) are given by where the function Pi (·|·) is given a priori. Before proceeding, the
s s
x̂k+1|k = ,
Ax̂k|k following assumptions are presented.

Pks+1|k = APks|k AT Q + , Assumption 2. For multiple Markovian fading channels, the

Kk+1 = Pk+1|k C (CPks+1|k C T
s T
+ R)−1 , following conditions hold:
3
Y. Xu, H. Xiang, L. Yang et al. Automatica 158 (2023) 111312

(1) The statistical properties of each channel are mutually in-

dependent.
(2) By means of channel reciprocity techniques (Goldsmith,
2005), at any time k, the sensor is aware of all channel gains
hi,k , i = 1, . . . , m.
(3) The channels are block fading, that is, the channel gains are Fig. 3. Correlations.
invariant during a transmission period, and vary between
blocks (Quevedo, Ahlen, & Johansson, 2012).
(4) The channel gains are uncorrelated with the system param- concrete form of the function q(·, ·) depends on the modulation
eters (Sinopoli et al., 2004). scheme (Quevedo, Ahlén, Leong, & Dey, 2012). An example of
Due to the impact of channel fading, the local estimate may be the modulation scheme is the binary phase shift keying (BPSK)
lost during the transmission from the smart sensor to the remote transmission (Proakis & Salehi, 2001), where the packet arrival
estimator. A binary random process γk is introduced to indicate probability is computed by
whether the packet is received successfully or not, that is, (∫ √ )b
hu
1 −t 2 /2
{
1,
s
if x̂k is received error-free at time k, f (hu) = √ e dt ,
γk = (7) 0 2π
0, otherwise (regarded as dropout).
and each packet contains b bits. It is evident that this packet
2.3. Transmission strategy arrival probability model is monotone increasing in the channel
gain and the transmission power.
As shown in Fig. 1, the smart sensor chooses a channel to
s
transmit the local state estimate x̂k to the remote estimator with 2.4. Remote state estimator
certain transmission power. Denote uk ∈ M and pk ∈ P ≜
[0, pmax ] as the selected channel and the transmission power at We assume that the smart sensor is aware of the channel gains
time instant k, respectively. The smart sensor has limited energy hi,k for all i ∈ M before the sensor sends data at each time instant,
such that it cannot use more than pk = pmax to transmit data at which can be achieved by channel reciprocity techniques (Gold-
any time instant. In general, smaller channel gains require higher smith, 2005). Moreover, assume that the remote estimator knows
transmission power levels to achieve a certain packet arrival the system parameters and whether or not the data packet was
probability. Conversely, for very good channels (with large gains), successfully received at each time instant. Therefore, the state es-
small power levels are sufficient, see also Gatsis, Ribeiro, and timate x̂k and its corresponding error covariance Pk are updated as
Pappas (2014) and Quevedo, Østergaard, and Ahlen (2013). This
s
{
observation motivates us to restrict the power levels, depending x̂k , if γk = 1,
on the current channel gain as follows: x̂k = (10)
Ax̂k−1 , if γk = 0,
pk ∈ Pi ≜ {0} ∪ (p[i−1] , p[i] ], if huk ,k = h(i) , (8)
and
where i ∈ L ≜ {1, 2, . . . , l}, and 0 ≤ p[i−1] ≤ p[i] ≤ pmax .
{
P s, if γk = 1,
As illustrated in Section 5, this restriction reduces the compu- Pk = (11)
tational complexity whilst providing good performance. To this f (Pk−1 ), if γk = 0.
end, a strategy is embedded in the smart sensor to govern the Define τk as the holding time, which represents the elapsed time
channel allocation and the power control. The implementation since the last successful reception, i.e.,
of the power control is illustrated in Fig. 3 with p[0] = 0 and
p[l] = pmax . For convenience, write huk ,k as huk , which already τk ≜ min{0 ≤ t ≤ k : γk−t = 1}. (12)
t
contains information of the time index. Define the packet arrival
Hence, the estimation error covariance of the remote estimator
probability as
can be equivalently written as
q(pk , huk ) ≜ P[γk = 1|pk , huk ]. (9)
Pk = f τk (P s ). (13)
We shall adopt the following assumptions with respect to the
Assume that the smart sensor has enough resources to make
packet arrival probability q(·, ·) (Gatsis et al., 2014):
decisions, and the remote estimator sends an ACK of the packet
arrival to the smart sensor through a perfect channel at each time
Assumption 3.
instant. Thus, the smart sensor can compute Pk by (13).
(1) The function q(p̌, hu ) is continuous in p̌. Moreover, q(0, hu )
= 0, ∀u ∈ M, and q(p̌, hu ) > 0, ∀p̌ ∈ P and ∀u ∈ M. 2.5. Problem of interest
(2) The function q(p̌, hu ) is monotone non-decreasing in both
p̌ and hu . The objective of the transmission strategy is to utilize as little
(3) The packet arrival probabilities are conditionally indepen- energy as possible to improve the remote estimation perfor-
dent, that is, mance. According to Assumption 3, the larger pk and huk are,
the larger q(pk , huk ) is, and the estimation performance is im-
P[γk = lk , . . . , γ1 = l1 |p1 , . . . , pk , hu1 , . . . , huk ] proved (Sinopoli et al., 2004). However, due to the coupling
k relationship between the transmission power and the channel
∏
= P[γt = lt |pt , hut ]. gains in (8), the trade-off between them should be considered.
t =1 Thus the following problem is considered in this work.

Example 1. Based on (9), the packet arrival probability is deter- Problem 1. Find an optimal transmission strategy to optimize
mined by the channel gain huk and the transmission power pk . The estimation performance with energy constraint by co-designing
4
Y. Xu, H. Xiang, L. Yang et al. Automatica 158 (2023) 111312
∏m
channel allocation and power control, that is, where Ph (h′ |h) ≜ ′
i=1 Pi (hi |hi ). Based on the transition
T −1 probability (16), the next state is only dependent on the
1∑ current state and the current action, hence the state sk is
inf lim sup E [tr (Pk )] ,
{uk ,pk } T →∞ T Markovian.
k=0
T −1
• Cost function: Define the cost function as c(sk , ak ) ≜ ce
1∑ (sk , ak ) + cp (ak ), where ce (sk , ak ) is the estimation cost, which
s.t. lim sup pk ≤ p̄, (14)
T →∞ T is defined as
k=0

where p̄ ∈ [0, pmax ] is the average energy constraint. ce (sk , ak ) ≜ E[tr(Pk )], (17)

and the other term is the power cost, that is,

Remark 1. We consider both maximum energy constraints and
average energy constraints for the smart sensor. Specifically, the cp (ak ) ≜ pk . (18)
smart sensor’s transmission power at each time satisfies the max-
imum energy constraint, i.e., p̄ ∈ [0, pmax ], while the long-term Note that the estimation cost is the expectation of the estimation
transmission power error covariance Pk . The evolution of the CMDP is as follows. At
∑isT −restricted
1
by the average energy constraint,
i.e., lim supT →∞ T1 k=0 p k ≤ p̄. time instant k, the smart sensor observes the state sk , and takes
an action ak . Then, the system moves to a new state sk+1 at the
3. Optimal transmission strategy with known transition prob- next time instant, and the sensor compiles the costs ce (sk , ak ) and
ability cp (ak ).
Define a feasible randomized Markov policy as θ ≜ {θk }∞ k=1 ∈
This section models the optimal transmission problem as a Θ M , where θk is a map from the state sk to the probability
CMDP, which is then relaxed as a UMDP by a Lagrange multiplier distribution over the action space, and Θ M is the set of all such
method. The existence of an optimal stationary policy for the policies. For each policy θ ∈ Θ M , define an expected average trace
UMDP is studied. Finally, the structure of the optimal transmis- of the estimation error covariance and an expected average power
sion power control policy for the UMDP with discounted cost is consumption as
obtained.
T −1
1∑
3.1. Constrained markov decision process formulation Je (θ ) ≜ lim sup E[ce (sk , ak )], (19)
T →∞ T
k=0

This subsection focuses on formulating Problem 1 as a CMDP. T −1

1∑
Each element of the CMDP {S, A, {A(s)|s ∈ S}, P (·|·, ·), c(·, ·)} is Jp (θ ) ≜ lim sup E[cp (ak )]. (20)
T →∞ T
defined. k=0

• State space: Define the state of the CMDP as sk ≜ (τk−1 , hk ), Based on the above analysis, Problem 1 can be expressed as
where τk−1 ∈ N is the holding time, and hk ∈ HM ≜ the following constrained MDP problem, that is, find an optimal
[h1,k , . . . , hm,k ]T is a random vector representing the joint randomized Markov policy θ ∗ satisfying
fading channel gain at time k. Hence, the state space S ≜
CP : inf Je (θ )
N × HM is a Cartesian product of an infinitely countable set θ
and M finite sets. s.t. Jp (θ ) ≤ p̄, (21)
• Action space: The action space of the CMDP is defined as
A ≜ M × P, which is a discrete-continuous hybrid space. where p̄ ∈ [0, pmax ] is the average energy constraint.
Due to the coupling relationship between channel allocation
and power control defined in (8), the feasible action cannot 3.2. Unconstrained Markov decision process
be chosen arbitrarily by the smart sensor. Hence, define a
set A(s) ≜ {(u, phu )|phu ∈ Pi with hu = h(i) , ∀u ∈ M}, which
contains all feasible actions for each state s ∈ S. Clearly, A(s) This subsection utilizes the Lagrange multiplier method to
is a Borel subset of the entire action space A. It is convenient transform the CMDP (21) into an UMDP, i.e.,
to define UP : inf sup Je (θ ) + λ(Jp (θ ) − p̄), (22)
θ λ≥0
K ≜ {(s, a)|s ∈ S, a ∈ A(s)} (15)
where λ ≥ 0 is a Lagrange multiplier. Define a cost function
as the set of feasible state–action pairs, which is a Borel
cλ : K → R as
subset of the set S × A.
• Transition probability: After taking an action, the current cλ (sk , ak ) ≜ ce (sk , ak ) + λ(cp (ak ) − p̄). (23)
state transits to a new state at the next time instant. This
transition is stochastic, so we define P (sk+1 |sk , ak ) : S × Then, problem (22) can be expressed as a saddle point problem,
A × S → [0, 1] as the state transition probability, which T −1
1∑
is expressed as inf sup lim sup E[cλ (sk , ak )]. (24)
θ λ≥0 T →∞ T
P (sk+1 |sk , ak ) k=0

=P (τk , hk+1 |τk−1 , hk , ak ) The general approach to deal with this saddle point problem is
to relax it by fixing λ, and then solving an optimization problem
≜P[τk = τ ′ , hk+1 = h′ |τk−1 = τ , hk = h, ak = (u, p)]
with fixed weight (Iwaki, Wu, Wu, Sandberg, & Johansson, 2017).
⎨ Ph (h |h)q(p, hu ), if τ ′ = 0, ak ∈ A(sk ),
′
⎧
⎪ However, there commonly is a gap between this relaxed problem
= Ph (h′ |h)(1 − q(p, hu )), if τ ′ = τ + 1, ak ∈ A(sk ), and the original one. Thus, the relaxed problem with fixed λ and
the problem (24) are both analyzed in this work, and the proof
0, otherwise,
⎪
⎩
of the existence of the optimal policy for them will be given in
(16) Section 3.4.
5
Y. Xu, H. Xiang, L. Yang et al. Automatica 158 (2023) 111312

Remark 2. By introducing Lagrange multipliers, the CMDP is control, the stability of state estimation with Markovian channels
converted into an UMDP, where the original constraints are in- is highly non-trivial to elucidate (Liu et al., 2021). To avoid this
corporated into the objective function as penalty terms. Note that issues, Assumption 4 needs to be satisfied.
these two problems are not exactly equivalent. The UMDP does
not enforce the original constraints explicitly, but incorporates Lemma 2 (Shi, Johansson, & Qiu, 2011). The Lyapunov operator f (·)
them through the Lagrange multipliers as penalties. The solution in (3) has the following properties for 0 ≤ k1 ≤ k2 :
obtained by the UMDP might not satisfy the original constraints
(1) f k1 (P s ) ≤ f k2 (P s ),
exactly, but it provides an approximation that balances the op-
timization objective and the constraint violation (Altman, 1999; (2) tr(f k1 (P s )) ≤ tr(f k2 (P s )).
Bertsekas & Tsitsiklis, 1996).
We consider the relaxed problem (24) with a given λ ≥ 0, that
3.3. Existence of optimal stationary policy is, we seek to find an optimal policy satisfying
T −1
1∑
In the randomized Markov policy θ = {θk }∞ k=1 , the mapping inf lim sup E[cλ (sk , ak )]. (26)
θk to randomly select the action is different at each time instant, θ T →∞ T
k=0
limiting its practical use. By contrast, a stationary policy elimi-
For the optimization problem (26), a different λ corresponds to
nates the need for exploration or random decision-making, which
allows one to act quickly and efficiently. In addition, stationary a different optimization result with trade-off between estima-
policies with a unique mapping are often easier to implement tion performance and power consumption. The existence of the
in practical applications. Therefore, it is of practical significance optimal stationary policy for UP with fixed λ is presented next.
to establish an optimal stationary policy for the MDP, which
facilitates efficient decision-making. Denote Θ S as a set of all Theorem 1. Under Assumption 4, there exists a stationary policy
feasible stationary policies, that is, θ ∗ ∈ Θ S for the problem (26) to solve the following average cost
optimality equation (ACOE):
k=1 |θk = θ̃, ∀k},
Θ S ≜ {θ = {θk }∞ { }
J ⋆+V (s)= min cλ (s, a)+
∑
V (s′ )P (s′ |s, a) . (27)
with P[θ̃ (sk ) = ak ] = 1. In addition, Θ RS is a set of all ran- a∈A(s)
s′
domized stationary policies with a unique mapping from a state
to the probability distribution over the action space. Due to the In (27), J ⋆ is an optimal average cost per stage, and V (s) is a
complexity of the MDP, the optimal stationary policy does not relative value function, which quantifies the value of a particular
always exist. Thus, whether there exists an optimal stationary state relative to a reference state.
policy is a core issue for the MDP. In this subsection, we focus on
The proof is given in Appendix B.
the existence of an optimal stationary policy for the UMDP (24).
For the problem (24), the existence of an optimal saddle point
Some assumption and lemmas are needed.
is summarized as follows.
Assumption 4. The average energy constraint lies in the interval
Theorem 2. Under Assumption 4, there exists a saddle point
p[σ −1] ≤ p̄ ≤ p[σ ] , and there exists at least one channel i such
(θ ∗ , λ∗ ) for the problem (24), where θ ∗ ∈ Θ RS is the optimal ran-
that the channel gains h(1) , . . . , h(l) , the transmission power in
(8), the transition probability Pi (h(ȷ) |h(ı) ), and the average energy domized stationary policy, and λ∗ is the optimal Lagrange multiplier.
constraint p̄ satisfy the following inequality: The proof is given in Appendix C.
Furthermore, we provide a necessary condition of the exis-
∑ ∑
(1 − q(p[ı] , h ))Pi (h |h ) +
(ı) (ı) l (ȷ)
Pi (h |h )l
tence of the optimal policy for problem (21).
h(ı) h(ȷ)
1≤ı<σ σ <ȷ≤l
ϱ Proposition 1. If there exists an optimal policy θ ∗ solving the
+ (1 − q(p̄, h(σ ) ))Pi (h(σ ) |hl ) ≤ . (25)
problem (21), then it satisfies
∥A∥2
Jp (θ ∗ ) = p̄. (28)
Lemma 1. If Assumption 4 holds, then the optimal solution for the
CMDP (21) is bounded, i.e. infθ Je (θ ) < ∞ with the average energy The proof is given in Appendix D.
constraint Jp (θ ) ≤ p̄.
The proof is given in Appendix A. 3.4. Structural results
Assumption 4 offers a sufficient condition for the boundedness
of the one-stage cost ce (s, a) under the constraint (21). Intuitively, Theorem 1 has established the existence of a stationary policy
we construct a stable suboptimal policy, where the sensor uses θ ∗ ∈ Θ S for the problem (26). Based on the ACOE, the closed-form
the maximum available power of each interval in (8) when the optimal policy can be computed by
channel gain is small, and does not transmit data when the
{ ∑ }
channel gain is large. The condition requires that the packet θ ∗ (s) = arg min cλ (s, a)+ V (s′ )P (s′ |s, a)
a∈A(s)
dropout probability with the channel gain h(ı) , 1 ≤ ı < σ , and the s′

transition probability from h(l) to h(ȷ) , σ < ȷ ≤ l are small enough where the function V (·) can be obtained by the relative value
such that the inequality (25) holds. This is because the sensor iteration algorithm. However, this algorithm faces a high com-
does not transmit data when the channel gain is larger than h(σ ) . putational complexity in the case of (infinitely) countable state
Hence, the condition avoids, as much as possible, entering such spaces. In this section, a monotonic structure of the optimal
channel gains. power control policy is obtained for the problem (26) with dis-
counted cost, which reduces the computational complexity. The
Remark 3. Kalman filtering with data packet loss may lead to discounted optimization problem is formalized as follows:
unbounded estimation error covariances (Sinopoli et al., 2004).
Even in the simpler case without channel allocation and power DP : inf vδ (s, θ ), (29)
θ
6
Y. Xu, H. Xiang, L. Yang et al. Automatica 158 (2023) 111312

Algorithm 1 P-DQN for transmission scheduling

1: Initialize the Q-network Q (s, u, p; ω) and parameter network
p(s; θ ) with weights w and θ .
2: Initialize the target network Q ′ and p′ with weights ω′ ←
ω, θ ′ ← θ.
3: Initialize step sizes {αt , βt }t >0 , action exploration parameter
ϵ , action probability distribution ξ , and replay buffer B with
capacity B.
4: for episode = 1, 2, . . . , E do
5: Initialize state s0 .
6: for t = 0, 1, . . . , T do
7: Obtain the continuous action parameter phu ←
phut (st ; θt ).
8: Choose action at = (ut , phut ) according to the ϵ -greedy
) from distribution ξ (with probability
policy: at (is sampled ) ϵ,
and at = ut , phut , where ut = arg maxQ st , u, phu ; ωt with
u∈M
probability 1 − ϵ .
Fig. 4. The architecture of P-DQN. 9: Execute action at , and observe reward rt and next state
st +1 .
10: Store transition (st , at , rt , st +1 ) into B.
where 11: Randomly sample batch of N transitions
T −1 (sn , an , rn , sn+1 )n∈[N(] in B. (Compute the target value
Eθs [δ k cλ (sk , ak )],
∑
vδ (s, θ ) ≜ lim sup yn = rn + max δ Q sn+1 , u, phu sn+1 ; θt′ ; ωt′ .
) )
T →∞ u∈M
k=0
12: Use {yn , sn , an }n∈[N ] to obtain the stochastic gradients
with the initial state s, and a discount factor 0 < δ < 1. Define ∇ω ℓt (ω) and ∇θ ℓt (θ ). Update the weights by
the discounted value function as vδ∗ (s) ≜ infθ vδ (s, θ ), ∀s ∈ S. To
elucidate the structure, the following definition and lemmas are ωt +1 ← ωt − αt ∇ω ℓt (ωt ) ,
provided, which are crucial in the subsequent analysis. θt +1 ← θt − βt ∇θ ℓt (θt ) .
13: Update the target networks by
Definition 1 (Puterman, 2014). Let X and Y be partially ordered
sets, and g(x, y) a real-valued function on X × Y. The function ω′ ← τ1 ω + (1 − τ1 )ω′ ,
g(x, y) is submodular, if the inequality
θ ′ ← τ2 θ + (1 − τ2 )θ ′ .
g (x1 , y1 ) + g (x2 , y2 ) ≤ g (x2 , y1 ) + g (x1 , y2 ) (30)
14: end for
holds for all x2 ≥ x1 and y2 ≥ y1 with x1 , x2 ∈ X, and y1 , y2 ∈ Y. 15: end for

Lemma 3 (Puterman, 2014). Suppose that g(x, y) is a submodular

function on X × Y, and for each x ∈ X, miny∈Y g(x, y) exists. Then,
arg miny∈Y g(x, y) is monotone non-decreasing in x. δ tends to 1, the average cost problem (26) and the discounted
cost problem (29) are equivalent by Abel’s theorem (Leong et al.,
2020). In most cases, the packet arrival probability q(pk , huk ) in (9)
Lemma 4. The discounted value function vδ∗ (s) is non-decreasing
is unknown, which means that the smart sensor cannot obtain
in τ for all h ∈ HM .
the state transition probability P (sk+1 |sk , ak ) of CMDP in (16).
The proof is given in Appendix E. This section explores how to learn a near-optimal policy with
The structural results of the optimal transmission strategy for unknown transition probability for the relaxed problem (29). A
the discounted problem (29) is stated as follows. novel P-DQN algorithm (Xiong et al., 2018) that combines the
advantages of DQN and DDPG is employed to deal with the
Theorem 3. Given a fixed channel state h and a channel gain hu , the discrete-continuous hybrid action space. Firstly, problem (29) is
optimal transmission power control policy of the discounted problem converted into a discounted cost problem as
(29) is non-decreasing in the holding time τ .
T −1
∑
The proof is given in Appendix F. inf lim sup δ k E[tr (Pk ) + λ(pk − p̄)]. (31)
Thanks to the monotonic structure of the optimal power con- θ T →∞
k=0
trol policy, one only needs to compute and store the function
from the stopping time to the transmission power for each chan- Besides, an equivalent maximization problem is given as
nel gain. This significantly reduces the computational complexity. T −1
∑
The optimal policy can be implemented online simply by one-to- sup lim inf δ k E[r(sk , ak )], (32)
one correspondence. Specialized algorithms can be developed to θ T →∞
k=0
approximate this specific function, which is more efficient than
the relative value iteration algorithm. where r(sk , ak ) ≜ −[tr (Pk ) + λ(pk − p̄)] is the so-called reward
function.
4. Approximate optimal policy with unknown transition prob- The optimization problem (32) is standard within the RL
ability framework. We introduce the P-DQN algorithm, whose architec-
ture is depicted in Fig. 4. At each iteration, the new state is fed
The discounted cost problem can be solved by deep reinforce- into the parameter network p(s; θ ). The latter then exports the
ment learning techniques. In addition, when the discount factor corresponding optimal continuous actions (i.e., parameters) for all
7
Y. Xu, H. Xiang, L. Yang et al. Automatica 158 (2023) 111312

actions in the discrete action space. These parameters enter into Table 1
the Q-network Q (s, u, p; ω) along with the state, and the optimal The hyperparameters of P-DQN algorithm.

discrete-continuous action pairs are exported. The loss functions Hyperparameters Value
of these networks are defined as Max steps of each episode T 200
1[ ( ]2 Number of episode E 100
ℓ t (ω ) ≜ Q st , ut , phut ; ω − yt ,
)
2 Replay memory size B 10000
M
∑ Batch size N 128
ℓ t (θ ) ≜ − Q st , u, phu (st ; θ) ; ωt , Discounted factor δ
( )
(33) 0.99
u=1 Action exploration parameter ϵ 0.1
Learning rate of Q-network α 0.001
where yt = rt + maxu∈M δ Q st +1 , u, phu st +1 ; θt ; ωt is a tar-
( ( ′
) ′
)
Learning rate of Parameter network β 0.0001
get value. Q (s, u, p; ω′ ) and p(s; θ ′ ) are target networks, which Soft update factor of Q-network τ1 0.1
are duplicates of the Q-network and the parameter network. Soft update factor of Parameter network τ2 0.001
They utilize the ‘‘soft’’ update technology to update the weights, Number of hidden layer of each network 1
i.e., ω′ ← τ1 ω + (1 − τ1 )ω′ , θ ′ ← τ2 θ + (1 − τ2 )θ ′ , which im- Number of nodes of hidden layer 64
proves convergence of the learning process (Lillicrap et al., 2015).
Moreover, the weights of Q (s, u, p; ω) and p(s; θ ) are updated by
stochastic gradient methods with the gradients computed in (33).
In addition, similar to DQN, the replay buffer is used to reduce
the correlation between samples and improve the utilization of
samples. The entire execution process of P-DQN is summarized
in Algorithm 1.

Remark 4. Algorithm 1 is updated in two timescales, i.e., the pa-

rameter ωt is updated with step size αt , while θt is updated with
step size βt . In particular, the update process of ωt is much faster
than θt . Based on the stochastic approximation theory (Borkar,
2009), these step ∑sizes are required
∑∞ to 2satisfy the Robbins–Monro
∞ ∑∞
∑∞ 2 i.e., t =0 αt = ∞, t =0 αt < ∞, and t =0 βt = ∞,
conditions,
t =0 βt < ∞. Fig. 5. The average rewards over each episode of P-DQN algorithm and other
strategies in the case with one channel.
5. Numerical simulation of vehicle moving

In this section, we consider a point in 2-D space as a moving and the covariance matrix of the zero-mean Gaussian noise vk
vehicle (Gupta, Chung, Hassibi, & Murray, 2006), which is em- is R = diag {2.3, 0.1}. Assume that the channel gain set is H =
ployed to verify the effectiveness of the P-DQN algorithm in the {0.3, 0.2, 0.1}, and the channel gains satisfy the following one-
transmission strategy design. Denote the positions of the vehicle step state transition probability:
y y
on horizontal and vertical axes as sxk , sk , and velocities as vkx , vk .
0.5 0.3 0.2
[ ]
The vehicle dynamics is expressed as:
Ph = 0.4 0.3 0.3 .
sk+1 = sxk + ∆t vkx , 0.3 0.4 0.3
⎧ x
⎪
⎪
⎨ sy = sy + ∆ t v y ,
The power space of the smart sensor is P = [0, 10]. We assume
⎪
k+1 k k
⎪ vkx+1 = ax vkx , that the channel gains {0.3,0.2,0.1} correspond to the available
power space [0, 3.33], (3.33, 6.66], (6.66, 10], respectively. The
⎪
⎩ y
⎪
vk+1 = ay vkx , smart sensor uses binary phase shift keying transmission with
where ∆t = 0.02 is the sampling period, ax = ay = 0.99 are b = 4 bits per packet, which is unknown environmental informa-
constant accelerations of the vehicle. The state of the vehicle can tion. The Lagrange multiplier is set to be λ = 1.9, and the average
y y energy constraint is set to be p̄ = 7. The hyperparameters of the
be defined as xk ≜ [sxk , sk , vkx , vk ]T . The state space model is of the
form P-DQN algorithm are listed in Table 1.
The P-DQN algorithm is employed to learn a near-optimal
1.00 0.00 0.02 0.00
⎡ ⎤
transmission strategy for three cases with different channel num-
⎢ 0.00 1.00 0.00 0.02 ⎥
xk+1 =⎣ x + wk , bers. The average rewards over each episode are plotted in
0.00 0.00 0.99 0.00 ⎦ k Figs. 5–7. These are compared with the following strategies:
0.00 0.00 0.00 0.99
(1) Random strategy: the channel allocation and the power
where the zero-mean Gaussian noise process wk has been in- control are all uniformly random at each time instant.
cluded to model uncertainties. The covariance matrix Q is chosen
as: (2) Greedy power strategy: the channel allocation is random,
0.3200
⎡
0.0000 0.0040 0.0060
⎤ and the power is selected as a maximum value within the
⎢ 0.0000 0.3200 0.0000 0.0004 ⎥ selected range.
Q =⎣ .
0.0040 0.0000 0.0200 0.0000 ⎦ (3) Greedy channel strategy: the channel with the largest chan-
0.0060 0.0004 0.0000 0.0200 nel gain is selected at each time instant, and the power
There is a smart sensor observing the vehicle states, and the control is random.
measurement is described as (4) Greedy strategy: the channel with the largest channel gain
[ ]
1 0 1 0 is selected at each time instant, and the power is selected
yk = xk + vk , (34)
0 1 0 1 as a maximum value within the selected range.
8
Y. Xu, H. Xiang, L. Yang et al. Automatica 158 (2023) 111312

Fig. 6. The average rewards over each episode of P-DQN algorithm and other
strategies in the case with two channels.

Fig. 8. The monotonic structure of the optimal transmission power policy for
fixed channel gain. (a) h = 0.1, (b) h = 0.2, (c) h = 0.3.

Fig. 7. The average rewards over each episode of P-DQN algorithm and other
strategies in the case with three channels.

The algorithms are compared to single channel transmis-

Fig. 9. The average rewards over each episode of P-DQN algorithm compared
sions, where the sensor only needs to determine the transmis-
to a fixed channel without power restriction.
sion power. Hence, P-DQN algorithm degenerates into the tra-
ditional DQN algorithm, and the greedy channel strategy is the
same as random strategy, and the greedy strategy is the same
as greedy power strategy. As shown in Fig. 5, the P-DQN algo- 6. Conclusion
rithm shows the best performance from the beginning, and soon
exhibits stationary behavior. This paper has investigated transmission strategies for re-
The algorithms are compared in scenarios with multiple mote state estimation over multiple Markovian fading channels.
channels, where the sensor needs to determine the channel and Co-design of dynamic channel allocation and continuous trans-
the transmission power at the same time. From Figs. 6 and 7, the mission power control has been considered. The problem has
P-DQN algorithm can still achieve the best performance through been formulated as a CMDP to minimize the estimation error with
short-time learning, which verifies the effectiveness of the P-DQN average energy constraint, which has been relaxed as a UMDP by
algorithm in the transmission strategy designing. Lagrange multiplier method. The existence of the optimal station-
Once the training is over 100 episodes, the optimal transmis- ary policy for the UMDP has been verified. A structural result that
sion strategy can be extracted by establishes that the optimal power control policy is monotonic
(u∗ , p∗ ) = arg maxQ (s, u, p(s; θ ); ω). (35) in holding time with fixed channel state and channel allocation
u,p has also been obtained. In view of the discrete-continuous hybrid
To verify the structure of the optimal power control policy, as- action space, the P-DQN algorithm has been adopted to learn a
sume that there is only one channel, and the infinite action space near-optimal policy for the UMDP. Finally, an example of moving
P = [0, 10] is discretized as a set with two power levels, i.e., P = vehicle has been employed to illustrate the effectiveness of the
{0, 2}. The policy iteration algorithm is utilized to obtain an developed results.
optimal power control policy over a finite time horizon (Chadès,
Chapron, Cros, Garcia, & Sabbadin, 2014). Fig. 8 plots the optimal Appendix A. Proof of Lemma 1
transmission power p∗ for different channel gains h. It is evident
that there is a monotonic structure for the optimal transmission
To prove Lemma 1, it suffices to make the one-stage cost
power p∗ in the holding time τ , which confirms the analysis.
ce (sk , ak ) = E[tr(Pk )] bounded. Assume that there is only one
To further illustrate the effectiveness of the transmission strat-
channel i to be used. According to Quevedo, Ahlen, and Johansson
egy proposed in this work, we compare it with the optimal power
(2012, Theorem 1), a sufficient condition for the exponential
allocation for a fixed channel (and using all possible transmission
stability of the expected estimation error covariance Pk in norm
power levels). The comparison of average rewards over each
episode of P-DQN algorithm is shown in Fig. 9. The results verifies is given by
that the proposed transmission strategy has a better performance ϱ
sup P(γk = 0|hi,k−1 = h(o) ) ≤ , (36)
than the one of a fixed channel without power restriction. h(o) ∈H ∥A∥2
9
Y. Xu, H. Xiang, L. Yang et al. Automatica 158 (2023) 111312

for some scalars 0 ≤ ϱ < 1. In other words, if the inequality (36) for some η ≥ 0. In view of the lemma 4.1 in Schäl (1993), we
holds, the estimation error covariance Pk satisfies have
[ ζ −1 ]
E[∥Pk ∥] ≤ ãϱk + b̃, ∀k ≥ 0, (37) ∑
wδ (s) ≤ η + inf E cλ (sk , ak ) , (39)
with some non-negative constants ã and b̃. In the present for- θ
k=0
mulation, the left side of the condition (36) can be computed
for η ≥ 0, and δ < 1. Given η ≥ 0 and a random initial state s0 , we
by
consider a sub-optimal policy, that is, the sensor always transmits
P(γk = 0|hi,k−1 = h(l) ) estimates with power p̃i when the gain of selected channel is
∑ huk = h(i) , until vδ (sN ) ≤ mδ + η is satisfied at some time N.
= P(γk = 0|hi,k = h(ȷ) )Pi (h(ȷ) |h(l) ). Due to the property of exponentially forgetting of the initial state
h(ȷ) ∈H for Kalman filtering, we have N < ∞ with probability 1, and
Assume that the average energy constraint satisfies p[σ −1] ≤ p̄ ≤ E[N ] < ∞. As a consequence of ζ ≤ N, we have E[ζ ] < ∞.
p[σ ] . In the following, a suboptimal policy is adopted, where the Therefore, the inequality in (39) satisfies
transmission power at time instant k is pk = p[ı] if the channel [ ζ −1 ]
gain is hi,k = h(ı) for all ı < σ , and pk = p̄ if hi,k = h(σ ) , and
∑
wδ (s) ≤η + inf E cλ (sk , ak )
pk = 0 if hi,k = h(ȷ) for all ȷ > σ . Under this suboptimal policy, θ
k=0
one has
≤η + E[ζ ](Z + λp̃1 ) < ∞, (40)
P(γk = 0|hi,k−1 = h(l) )
∑ where the second inequality holds by Wald’s equation, and Z is an
= (1 − q(p[ı] , h(ı) ))Pi (h(ı) |h(l) ) upper bound of the trace of the expected error covariance, which
h(ı) exists by Lemma 1. Therefore, condition B is verified.
1≤ı<σ

+ (1 − q(p̄, h(σ ) ))Pi (h(σ ) |h(l) ) +

∑
Pi (h(ȷ) |h(l) ). Appendix C. Proof of Theorem 2
h(ȷ)
σ <ȷ≤l To prove Theorem 2, we first need to verify the following two
Therefore, the inequality (25) is a sufficient condition for the conditions (Altman, 1999, Chapter 12.6):
exponential stability of the expected estimation error covariance C1: ce (s, ·) and cp (·) are continuous on A(s).
C2: The transition probabilities are continuous on A(s), i.e.,
Pk in norm. Since the average transmission power is always less
limn→∞ P (z |s, an ) = P (z |s, a), ∀s, z ∈ S, if an → a for a, an ∈ A(s).
than p̄, and extra channels do not increase the average cost, the
In view of the analysis in Theorem 1, it follows that the
inequality (25) is also sufficient to prove the boundedness of
above two conditions are satisfied. Furthermore, the following
the one-stage cost ce (s, a) under the constraint (21). The proof is
two important conditions need to be verified.
completed.
Growth condition: There exists a sequence of increasing compact
subsets Ki of K such that ∪i Ki = K, and lim infi→∞ {ce (κ ); κ ∈ / Ki }
Appendix B. Proof of Theorem 1
= ∞.
Slater condition: There exists a feasible policy θ such that Jp (θ ) <
Consider the discounted optimization problem (29). For a
p̄.
given λ, the δ -discounted optimal value function is vδ∗ (s). We
A sufficient condition for the growth condition is that, the set
utilize the conditions W and B proposed in Schäl (1993) to prove
Theorem 1. The conditions W are listed as follows. { }
W1: The state space S is locally compact with countable base. s ∈ S : inf ce (s, a) < ℓ (41)
W2: The set of all feasible actions in state s, i.e., A(s) is compact, a∈A(s)

and the mapping s → A(s) is upper semi-continuous. is finite for all ℓ ∈ R (Altman, 1999, Chapter 11.3). Taking account
W3: The transition law P (s′ |·, ·) is weakly continuous in (s, a) ∈ of the estimation error covariance in (13), and the transition
K .1 probability (16), the estimation cost in (17) can be rewritten as
W4: The one-stage cost function cλ (s, a) is lower semi-continuous
(l.s.c.). ce (sk , ak ) =E[tr(hτk (P s ))] = q(pk , huk ) tr(P s )
The condition B is given as + (1 − q(pk , huk )) tr(hτk−1 +1 (P s )), (42)
B: The condition supδ<1 wδ (s) < ∞ holds for s ∈ S, where τ +1
wδ (s) ≜ vδ∗ (s) − mδ is a relative discounted value function, and
s
where tr(P ) is a constant, tr(h (P )) is increasing in τ , and
s

mδ is defined as mδ ≜ infs∈S vδ∗ (s). q(phu , hu ) ∈ [0, 1] is the packet arrival probability. Notice that
Based on the Heine–Borel theorem, the Euclidean space Rn if τ → ∞, then infa∈A(s) ce (s, a) → ∞. Hence, to make the
is locally compact (Johnsonbaugh & Pfaffenberger, 2012). Rn is a inequality infa∈A(s) ce (s, a) < l, ∀l ∈ R hold, a necessary condition
completely separable space, so its topology has a countable base. is τ < ∞. Because HM is the product of M finite sets, the set (41)
Since the state space S is a subset of RM +1 , W1 holds. The sets is finite.
M and Pi , i ∈ {1, . . . , l} are both compact, and the Cartesian The Slater condition holds by Lemma 1. All conditions have
product of two compact sets is still compact. The action space been verified. Hence, the proof is completed.
A(s) = {u, phu } with u ∈ M and phu ∈ Pi is compact. Therefore,
Appendix D. Proof of Proposition 1
conditions W2 and W3 are both verified. Because cλ (s, a) defined
in (23) is continuous, the condition W4 holds.
In Theorem 2, the Slater condition has been proven. Hence, the
Next, we prove the condition B. First define a stopping time as
saddle point (θ ∗ , λ∗ ) satisfies the Kuhn–Tucker condition (Altman,
ζ ≜ inf {k ≥ 0, vδ (sk ) ≤ mδ + η} , (38) 1999, Theorem 12.8), that is
λ∗ (Jp (θ ∗ ) − p̄) = 0. (43)
1 We say P (s′ |·, ·) is weakly continuous in (s, a), if v ′ (s, a) ≜ v (y)P (dy|s, a)
∫
The constraint Jp (θ ) ≤ p̄ is not satisfied if λ = 0. Hence, it is
∗ ∗
S
is continuous and bounded on (s, a) ∈ K for every continuous bounded function
v on S. certain that λ ̸ = 0. In other words, it holds that Jp (θ ∗ ) − p̄ = 0. □
10
Y. Xu, H. Xiang, L. Yang et al. Automatica 158 (2023) 111312

Appendix E. Proof of Lemma 4 which implies that vn+1 (s) is also non-decreasing in τ for all
h ∈ HM . Since vδ∗ (s) = limn→∞ vn (s), Lemma 4 is verified.
The discounted optimal value function vδ∗ (s) can be obtained
by the value iteration algorithm (Puterman, 2014), i.e., vδ∗ (s) = Appendix F. Proof of Theorem 3
limn→∞ vn (s), and
Similar to the average cost problem, the discounted cost opti-
vn+1 (s) = min cλ (s, a) + δ Es′ [vn (s′ )|s, a] ,
{ }
(44) mality equation with fixed u is given as
a∈A(s)

vδ∗ (s) = min cλ (s, p, u) + δ Es′ [vδ∗ (s′ )|s, p, u] ,

{ }
where s = (τ , h), s′ = (τ ′ , h′ ), and Es′ [vn (s′ )|s, a] ≜ s′ vn (s )
′
∑
p∈Pi
P (s |s, a). Because of this iterative relationship, we use the math-
′

ematical induction to prove the lemma. First, observe that the with h u = h(i) , i ∈ M, and Es′ [vδ∗ (s′ )|s,
function v0 (s) = 0 is non-decreasing in τ , and assume that vn (s) p, u] ≜ s′ vδ (s )P (s |s, p, u). Define a state–action value function
∗ ′ ′
∑
is non-decreasing in τ for all h ∈ HM . Then, let h be fixed, and with fixed u as
suppose two states s1 ≜ (τ1 , h), s2 ≜ (τ2 , h) with τ1 < τ2 .
Q (s, p) ≜ cλ (s, p, u) + δ Es′ [vδ∗ (s′ )|s, p, u]. (49)
Assume that the optimal actions in (44) for vn+1 (s1 ) and vn+1 (s2 )
are a1 = (p1 , u1 ) and a2 = (p2 , u2 ), respectively. Hence, for The optimal power control policy is to choose the transmission
the state s2 , we adopt the optimal action a2 , and the following power that minimizes the state–action value function, i.e., p∗ =
condition holds: arg minp∈Pi Q (s, p) with hu = h(i) , i ∈ M. We now prove that Q (·, ·)
is a submodular function on N × P, that is, given fixed channel
vn+1 (s2 )
condition h and channel allocation u, it holds that
=cλ (s2 , a2 ) + δ Es′ [vn (s′ )|s2 , a2 ]
Q (s1 , p1 ) + Q (s2 , p2 ) ≤ Q (s1 , p2 ) + Q (s2 , p1 ), (50)
=q(p2 , hu2 ) tr(P s ) + (1 − q(p2 , hu2 )) tr(hτ2 +1 (P s ))
+ λ(p2 − p̄) + δ Es′ [vn (s′ )|s2 , a2 ]. (45) where s1 = (τ1 , h) and s2 = (τ2 , h) with τ2 ≥ τ1 and p2 ≥ p1 .
According to the definition in (49), we have
For the state s1 , the action a2 is also adopted. Since the action is
not optimal, it follows that Q (s1 , p1 ) + Q (s2 , p2 ) − Q (s1 , p2 ) − Q (s2 , p1 )
=cλ (s1 , p1 , u) + cλ (s2 , p2 , u) − cλ (s1 , p2 , u)
vn+1 (s1 )
− cλ (s2 , p1 , u) + δ Es′ [vδ∗ (s′ )|s1 , p1 , u]
(
≤cλ (s1 , a2 ) + δ Es′ [vn (s′ )|s1 , a2 ]
+ Es′ [vδ∗ (s′ )|s2 , p2 , u] − Es′ [vδ∗ (s′ )|s1 , p2 , u]
=q(p2 , hu2 ) tr(P s ) + (1 − q(p2 , hu2 )) tr(hτ1 +1 (P s ))
− Es′ [vδ∗ (s′ )|s2 , p1 , u] .
)
(51)
+ λ(p2 − p̄) + δ Es′ [vn (s′ )|s1 , a2 ]
≤q(p2 , hu2 ) tr(P s ) + (1 − q(p2 , hu2 )) tr(hτ2 +1 (P s )) We divide Eq. (51) into two parts for computation. The first part
is
+ λ(p2 − p̄) + δ Es′ [vn (s′ )|s1 , a2 ], (46)
cλ (s1 , p1 , u) + cλ (s2 , p2 , u) − cλ (s1 , p2 , u)
where the second inequality holds as a consequence of τ1 < τ2
− cλ (s2 , p1 , u)
and Lemma 2. We expand the last item on the right side of the
=Eτ ′ [tr(hτ (P s ))|τ1 , p1 ] + Eτ ′ [tr(hτ (P s ))|τ2 , p2 ]
′ ′
second inequality as
− Eτ ′ [tr(h (P s ))|τ1 , p2 ] − Eτ ′ [tr(hτ (P s ))|τ2 , p1 ]
τ′ ′
Es′ [vn (s′ )|s1 , a2 ]
+ λ(p1 − p̄) + λ(p2 − p̄) − λ(p2 − p̄) − λ(p1 − p̄)
∑
= vn (τ ′ , h′ )P (τ ′ , h′ |τ1 , h, a2 )
= tr(h 2 (P )) − tr(hτ1 +1 (P s )) q̃(p2 , hu )
( τ +1 s )(
τ ′ ,h′

− q̃(p1 , hu ) ≤ 0,
∑( )
Ph (h′ |h) q(p2 , hu2 )vn (0, h′ )
(
= (52)
h′ where q̃(p, hu ) ≜ 1 − q(p, hu ). The above relations hold by
+ (1 − q(p2 , hu2 ))vn (τ1 + 1, h′ )
))
Lemma 2 and that q(·, ·) is monotonic non-decreasing in p. For
the second part, we have
∑(
Ph (h′ |h) q(p2 , hu2 )vn (0, h′ )
(
≤
h′ Es′ [vδ∗ (s′ )|s1 , p1 , u] + Es′ [vδ∗ (s′ )|s2 , p2 , u]
+ (1 − q(p2 , hu2 ))vn (τ2 + 1, h ) ′
− Es′ [vδ∗ (s′ )|s1 , p2 , u] − Es′ [vδ∗ (s′ )|s2 , p1 , u]
))
∑ ∑(
vn (τ ′ , h′ )P (τ ′ , h′ |τ2 , h, a2 ) Ph (h′ |h) q(p1 , hu )vδ∗ (s0 ) + q(p2 , hu )vδ∗ (s0 )
(
= =
τ ′ ,h′ h′

=Es′ [vn (s )|s2 , a2 ],

′
(47) − q(p2 , hu )vδ∗ (s0 ) − q(p1 , hu )vδ∗ (s0 )
where the inequality holds based on the assumption that vn (τ , h)
+ q̃(p1 , hu )vδ∗ (s1 + 1) + q̃(p2 , hu )vδ∗ (s2 + 1)
is non-decreasing in τ for all h ∈ HM . Hence, from the relations
))
− q̃(p2 , hu )vδ∗ (s1 + 1) − q̃(p1 , hu )vδ∗ (s2 + 1)
(45), (46) and (47), we have ∑(
Ph (h′ |h) q̃(p1 , hu ) − q̃(p2 , hu )
(( )
vn+1 (s1 ) =
h′
≤q(p2 , hu2 ) tr(P s ) + (1 − q(p2 , hu2 )) tr(hτ2 +1 (P s )) )))
× vδ (s1 + 1) − vδ (s2 + 1) ≤ 0,
(
(53)
+ λ(p2 − p̄) + δ Es′ [vn (s′ )|s1 , a2 ]
≤q(p2 , hu2 ) tr(P s ) + (1 − q(p2 , hu2 )) tr(hτ2 +1 (P s )) where vδ∗ (s0 ) ≜ vδ∗ (0, h′ ), vδ∗ (s1 + 1) ≜ vδ∗ (τ1 + 1, h′ ), and
+ λ(p2 − p̄) + δ Es′ [vn (s′ )|s2 , a2 ] vδ∗ (s2 + 1) ≜ vδ∗ (τ2 + 1, h′ ). The above inequality holds by
Lemma 4, and the fact that function q(·, ·) is monotone non-
=vn+1 (s2 ), (48) decreasing in p. Hence, the inequality (50) holds, thereby Q (·, ·)
11
Y. Xu, H. Xiang, L. Yang et al. Automatica 158 (2023) 111312

is a submodular function on N × P. Taking account of Lemma 3, Nourian, M., Leong, A. S., & Dey, S. (2014). Optimal energy allocation for Kalman
p∗ = arg minp∈Pi Q (s, p) with hu = h(i) , i ∈ M is non-decreasing filtering over packet dropping links with imperfect acknowledgments and
energy harvesting constraints. IEEE Transactions on Automatic Control, 59(8),
in the holding time τ , which concludes Theorem 3.
2128–2143.
Pantazis, N. A., & Vergados, D. D. (2007). A survey on power control issues
References in wireless sensor networks. IEEE Communications Surveys & Tutorials, 9(4),
86–107.
Pezzutto, M., Schenato, L., & Dey, S. (2022). Transmission power allocation for
Adamu, P. U., López-Benítez, M., & Zhang, J. (2023). Hybrid transmission scheme
remote estimation with multi-packet reception capabilities. Automatica, 140,
for improving link reliability in mmwave URLLC communications. IEEE
Article 110257.
Transactions on Wireless Communication.
Proakis, J. G., & Salehi, M. (2001). Digital communications, vol. 4. McGraw-hill
Altman, E. (1999). Constrained Markov decision processes, vol. 7. CRC Press.
New York.
Bertsekas, D., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Athena
Puterman, M. L. (2014). Markov decision processes: Discrete stochastic dynamic
Scientific.
programming. John Wiley & Sons.
Borkar, V. S. (2009). Stochastic approximation: A dynamical systems viewpoint, vol. Quevedo, D. E., Ahlen, A., & Johansson, K. H. (2012). State estimation over
48. Springer. sensor networks with correlated wireless fading channels. IEEE Transactions
Chadès, I., Chapron, G., Cros, M.-J., Garcia, F., & Sabbadin, R. (2014). MDPtool- on Automatic Control, 58(3), 581–593.
box: a multi-platform toolbox to solve stochastic dynamic programming Quevedo, D. E., Ahlén, A., Leong, A. S., & Dey, S. (2012). On Kalman filtering over
problems. Ecography, 37(9), 916–920. fading wireless channels with controlled transmission powers. Automatica,
Ding, K., Li, Y., Quevedo, D. E., Dey, S., & Shi, L. (2017). A multi-channel trans- 48(7), 1306–1316.
mission schedule for remote state estimation under DoS attacks. Automatica, Quevedo, D. E., Ahlén, A., & Ostergaard, J. (2010). Energy efficient state estimation
78, 194–201. with wireless sensors through the use of predictive power control and
Gatsis, K., Ribeiro, A., & Pappas, G. J. (2014). Optimal power management coding. IEEE Transactions on Signal Processing, 58(9), 4811–4823.
in wireless control systems. IEEE Transactions on Automatic Control, 59(6), Quevedo, D. E., Østergaard, J., & Ahlen, A. (2013). Power control and coding
1495–1510. formulation for state estimation with wireless sensors. IEEE Transactions on
George, S. M., Zhou, W., Chenji, H., Won, M., Lee, Y. O., Pazarloglou, A., et al. Control Systems Technology, 22(2), 413–427.
(2010). DistressNet: a wireless ad hoc and sensor network architecture for Ray, N. K., & Turuk, A. K. (2017). A framework for post-disaster communication
situation management in disaster response. IEEE Communications Magazine, using wireless ad hoc networks. Integration, 58, 274–285.
48(3), 128–136. Ren, X., Wu, J., Johansson, K. H., Shi, G., & Shi, L. (2017). Infinite horizon optimal
Goldsmith, A. (2005). Wireless communications. Cambridge University Press. transmission power control for remote state estimation over fading channels.
Gupta, V., Chung, T. H., Hassibi, B., & Murray, R. M. (2006). On a stochastic IEEE Transactions on Automatic Control, 63(1), 85–100.
sensor selection algorithm with applications in sensor scheduling and sensor Schäl, M. (1993). Average optimality in dynamic programming with general state
coverage. Automatica, 42(2), 251–260. space. Mathematics of Operations Research, 18(1), 163–172.
Hernández-Lerma, O., & Lasserre, J. B. (2012). Discrete-time Markov control Shi, L., Johansson, K. H., & Qiu, L. (2011). Time and event-based sensor schedul-
processes: basic optimality criteria, vol. 30. Springer Science & Business Media. ing for networks with limited communication resources. IFAC Proceedings
Iwaki, T., Wu, Y., Wu, J., Sandberg, H., & Johansson, K. H. (2017). Wireless sensor Volumes, 44(1), 13263–13268.
network scheduling for remote estimation under energy constraints. In 2017 Sinopoli, B., Schenato, L., Franceschetti, M., Poolla, K., Jordan, M. I., & Sastry, S.
IEEE 56th annual conference on decision and control. S. (2004). Kalman filtering with intermittent observations. IEEE Transactions
Johnsonbaugh, R., & Pfaffenberger, W. E. (2012). Foundations of mathematical on Automatic Control, 49(9), 1453–1464.
analysis. Courier Corporation. Soua, R., & Minet, P. (2015). Multichannel assignment protocols in wireless
Kailath, T., Sayed, A. H., & Hassibi, B. (2000). Linear estimation, no. BOOK. Prentice sensor networks: A comprehensive survey. Pervasive and Mobile Computing,
Hall. 16, 2–21.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT
Khodayar, M., Liu, G., Wang, J., & Khodayar, M. E. (2020). Deep learning in power
Press.
systems research: A review. CSEE Journal of Power and Energy Systems, 7(2),
209–220. Wu, S., Ding, K., Cheng, P., & Shi, L. (2020). Optimal scheduling of multiple
sensors over lossy and bandwidth limited channels. IEEE Transactions on
Leong, A. S., Dey, S., Nair, G. N., & Sharma, P. (2011). Power allocation for outage
Control of Network Systems, 7(3), 1188–1200.
minimization in state estimation over fading channels. IEEE Transactions on
Wu, S., Ren, X., Jia, Q.-S., Johansson, K. H., & Shi, L. (2019). Learning optimal
Signal Processing, 59(7), 3382–3397.
scheduling policy for remote state estimation under uncertain channel
Leong, A. S., Ramaswamy, A., Quevedo, D. E., Karl, H., & Shi, L. (2020). Deep
condition. IEEE Transactions on Control of Network Systems, 7(2), 579–591.
reinforcement learning for wireless sensor scheduling in cyber–physical
Xiong, J., Wang, Q., Yang, Z., Sun, P., Han, L., Zheng, Y., et al. (2018). Parametrized
systems. Automatica, 113, Article 108759.
deep q-networks learning: Reinforcement learning with discrete-continuous
Li, X., Li, D., Wan, J., Vasilakos, A. V., Lai, C.-F., & Wang, S. (2017). A review of
hybrid action space. arXiv preprint arXiv:1810.06394.
industrial wireless networks in the context of industry 4.0. Wireless Networks,
Yang, L., Rao, H., Lin, M., Xu, Y., & Shi, P. (2022). Optimal sensor scheduling
23(1), 23–41.
for remote state estimation with limited bandwidth: a deep reinforcement
Li, Y., Mehr, A. S., & Chen, T. (2019). Multi-sensor transmission power con- learning approach. Information Sciences, 588, 279–292.
trol for remote estimation through a SINR-based communication channel. Yang, L., Xu, Y., Huang, Z., Rao, H., & Quevedo, D. E. (2022). Learning optimal
Automatica, 101, 78–86. stochastic sensor scheduling for remote estimation with channel capacity
Li, Y., Zhang, F., Quevedo, D. E., Lau, V., Dey, S., & Shi, L. (2016). Power control constraint. IEEE Transactions on Industrial Informatics.
of an energy harvesting sensor for remote state estimation. IEEE Transactions Zhang, Y., Gan, R., Shao, J., Zhang, H., & Cheng, Y. (2020). Path selection with
on Automatic Control, 62(1), 277–290. Nash Q-learning for remote state estimation over multihop relay network.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., et al. (2015). International Journal of Robust and Nonlinear Control, 30(11), 4331–4344.
Continuous control with deep reinforcement learning. arXiv preprint arXiv: Zhang, X.-M., Han, Q.-L., & Yu, X. (2015). Survey on recent advances in
1509.02971. networked control systems. IEEE Transactions on Industrial Informatics, 12(5),
Lin, S., Miao, F., Zhang, J., Zhou, G., Gu, L., He, T., et al. (2016). ATPC: Adaptive 1740–1752.
transmission power control for wireless sensor networks. ACM Transactions
on Sensor Networks, 12(1), 1–31.
Liu, H. (2019). SINR-based multi-channel power schedule under DoS attacks: A
stackelberg game approach with incomplete information. Automatica, 100,
274–280. Yong Xu was born in Zhejiang Province, China, in 1983.
Liu, W., Quevedo, D. E., Li, Y., Johansson, K. H., & Vucetic, B. (2021). Remote He received the B.S. degree in information engineer-
state estimation with smart sensors over Markov fading channels. IEEE ing from Nanchang Hangkong University, Nanchang,
Transactions on Automatic Control, 67(6), 2743–2757. China, in 2007, the M.S. degree in control science
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., et and engineering from Hangzhou Dianzi University,
al. (2013). Playing atari with deep reinforcement learning. arXiv preprint Hangzhou, China, in 2010, and the Ph.D. degree in
arXiv:1312.5602. control science and engineering from Zhejiang Uni-
Ni, Y., Leong, A. S., Quevedo, D. E., & Shi, L. (2019). Pricing and selection of versity, Hangzhou, China, in 2014. He was a visiting
channels for remote state estimation using a stackelberg game framework. internship student with the department of Electronic
IEEE Transactions on Signal and Information Processing over Networks, 5(4), and Computer Engineering, Hong Kong University of
657–668. Science and Technology, Hong Kong, China, from June

12
Y. Xu, H. Xiang, L. Yang et al. Automatica 158 (2023) 111312

2013 to November 2013, where he was a Research Fellow from February 2018 Renquan Lu received the Ph.D. degree in control
to August 2018. Now he is a professor with School of Automation, at Guangdong science and engineering from Zhejiang University,
University of Technology, Guangzhou, China. Hangzhou, China, in 2004.
His research interests include PID control, networked control systems, state He was supported by the National Science Fund
estimation, and positive systems. for Distinguished Young Scientists of China in 2014,
was the Distinguished Professor of Pearl River Schol-
ars Program of Guangdong Province, the Distinguished
Professor of Yangtze River Scholars Program by the
Haoxiang Xiang was born in Guangdong, China, in Ministry of Education of China in 2015 and 2017,
2000. He received the B.S. degree in Automation respectively. He is currently a Professor of the School
from Guangdong University of Technology, Guangdong, of Automation, Guangdong University of Technology,
China, in 2022. He is currently working towards the Guangzhou, China. His research interests include complex systems, networked
Master degree in Control Engineering from University control systems, and nonlinear systems.
of Science and Technology of China, Hefei, China.
His research interests include state estimation,
control strategies and health management of energy Daniel E. Quevedo received Ingeniero Civil Electrónico
storage systems. and M.Sc. degrees from Universidad Técnica Federico
Santa María, Valparaíso, Chile, in 2000, and in 2005
the Ph.D. degree from the University of Newcastle,
Australia. He is Professor of Cyberphysical Systems
at the School of Electrical Engineering and Robotics,
Lixin Yang received the B.S. degree from the School Queens- land University of Technology (QUT), in Aus-
of Automation, Guangdong University of Technology, tralia. Before joining QUT, he established and led the
Guangzhou, China, in 2018, and the Ph.D. degree Chair in Automatic Control at Paderborn University,
in control science and engineering from the School Germany.
of Automation, Guangdong University of Technology, Prof. Quevedo’s research interests are in networked
Guangzhou, China, in 2023. He was a visiting internship control systems, control of power converters and cyberphysical systems security.
student in the Department of Electronic and Com- He currently serves as Associate Editor for IEEE Control Systems and in the
puter Engineering, Hong Kong University of Science and Editorial Board of the International Journal of Robust and Nonlinear Control.
Technology, Hong Kong, from October 2019 to March From 2015–2018 he was Chair of the IEEE Control Systems Society Technical
2020. Committee on Networks & Communication Systems. In 2003, he received the
He is currently a Postdoctoral Research Fellow with IEEE Conference on Decision and Control Best Student Paper Award and was also
the School of Electrical Engineering and Robotics, Queensland University of a finalist in 2002. Prof. Quevedo is co-recipient of the 2018 IEEE Transactions on
Technology, Brisbane, Australia. His research interests include networked control Automatic Control George S. Axelby Outstanding Paper Award. He is a Fellow of
systems, sensor scheduling, and reinforcement learning. the IEEE.

Recent Progress in Networked Control Systems - A Survey 2015
No ratings yet
Recent Progress in Networked Control Systems - A Survey 2015
25 pages
Wpafb 2006
No ratings yet
Wpafb 2006
37 pages
Cell Free Book
No ratings yet
Cell Free Book
316 pages
Ultimately Bounded Output Feedback Control For Networked Nonlinear Systems With Unreliable Communication Channel A Buffer-Aided Strategy
No ratings yet
Ultimately Bounded Output Feedback Control For Networked Nonlinear Systems With Unreliable Communication Channel A Buffer-Aided Strategy
13 pages
Ultimately Bounded State Estimation For Nonlinear Networked Systems With Constrained Average Bit Rate A Buffer-Aided Strategy
No ratings yet
Ultimately Bounded State Estimation For Nonlinear Networked Systems With Constrained Average Bit Rate A Buffer-Aided Strategy
12 pages
Journal of Electrical and Computer Engineering - 2021 - Tran Tin - Performance Analysis in DF Energy Harvesting Full Duplex
No ratings yet
Journal of Electrical and Computer Engineering - 2021 - Tran Tin - Performance Analysis in DF Energy Harvesting Full Duplex
9 pages
Progress
No ratings yet
Progress
30 pages
Zhang 2019
No ratings yet
Zhang 2019
6 pages
TSP Jiot 54791
No ratings yet
TSP Jiot 54791
25 pages
Performance Analysisof Wireless Networkswith Multiple DFRelay Nodes
No ratings yet
Performance Analysisof Wireless Networkswith Multiple DFRelay Nodes
65 pages
An Optimized Hybrid Approach For Spectrum Handoff in Cognitive Radio Networks With Non-Identical Channels
No ratings yet
An Optimized Hybrid Approach For Spectrum Handoff in Cognitive Radio Networks With Non-Identical Channels
10 pages
Event-Driven Transformer-Based Reinforcement Learning For Trajectory Design and Channel Assignment in Multi-UAV Assisted Communication
No ratings yet
Event-Driven Transformer-Based Reinforcement Learning For Trajectory Design and Channel Assignment in Multi-UAV Assisted Communication
13 pages
Abd-Elmagid, Dhillon, Pappas - 2019 - A Reinforcement Learning Framework For Optimizing Age-Of-Information in RF-powered Communication S
No ratings yet
Abd-Elmagid, Dhillon, Pappas - 2019 - A Reinforcement Learning Framework For Optimizing Age-Of-Information in RF-powered Communication S
14 pages
通信约束下调度协议与模糊滑模控制器的协同设计
No ratings yet
通信约束下调度协议与模糊滑模控制器的协同设计
9 pages
Power Allocation Based On Convex Optimization Theory For Fading Channels in OFDM-based Cognitive Radio Networks
No ratings yet
Power Allocation Based On Convex Optimization Theory For Fading Channels in OFDM-based Cognitive Radio Networks
6 pages
Ref 3
No ratings yet
Ref 3
12 pages
Factorization For Advanced Physical Layer Techniques in Network-Coded Wireless Communication Networks
No ratings yet
Factorization For Advanced Physical Layer Techniques in Network-Coded Wireless Communication Networks
6 pages
Resource and Trajectory Optimization For Secure Communications in Dual Unmanned Aerial Vehicle Mobile Edge Computing Systems
No ratings yet
Resource and Trajectory Optimization For Secure Communications in Dual Unmanned Aerial Vehicle Mobile Edge Computing Systems
10 pages
Cognitive Radio Networks
No ratings yet
Cognitive Radio Networks
143 pages
825143missing Mesurement Finite Frequ2013
No ratings yet
825143missing Mesurement Finite Frequ2013
9 pages
MonA05 1
No ratings yet
MonA05 1
6 pages
Grant-Free Random Access of IoT Devices in Massive MIMO With Partial CSI
No ratings yet
Grant-Free Random Access of IoT Devices in Massive MIMO With Partial CSI
6 pages
Online Deception Attack Against Remote State Esti - 2014 - IFAC Proceedings Volu
No ratings yet
Online Deception Attack Against Remote State Esti - 2014 - IFAC Proceedings Volu
6 pages
Joint Scheduling and Power Optimization For Delay Constrained Transmissions in Coded Caching Over Wireless Fading Channels
No ratings yet
Joint Scheduling and Power Optimization For Delay Constrained Transmissions in Coded Caching Over Wireless Fading Channels
14 pages
Lecture 2 Outline: Announcements
No ratings yet
Lecture 2 Outline: Announcements
75 pages
Team 2 Research Paper
No ratings yet
Team 2 Research Paper
6 pages
Deception (Sensor) s00034-022-02079-3
No ratings yet
Deception (Sensor) s00034-022-02079-3
17 pages
Kimura Et Al 2018 Adaptive Access Point and Channel Selection Method Using Markov Approximation
No ratings yet
Kimura Et Al 2018 Adaptive Access Point and Channel Selection Method Using Markov Approximation
11 pages
Delay Constrained Throughput
No ratings yet
Delay Constrained Throughput
12 pages
A Comprehensive Survey of Potential Game Approaches To Wireless Networks
No ratings yet
A Comprehensive Survey of Potential Game Approaches To Wireless Networks
20 pages
Deep Learning-Based Cross-Layer Resource Allocation For Wired Communication Systems
No ratings yet
Deep Learning-Based Cross-Layer Resource Allocation For Wired Communication Systems
5 pages
Optimal Period Input Design in FIR System Identification With Binary-Valued Observations and Event-Triggered Communication
No ratings yet
Optimal Period Input Design in FIR System Identification With Binary-Valued Observations and Event-Triggered Communication
5 pages
IET Control Theory Appl - 2019 - Wang - Hybrid Filter Design of Fault Detection For Networked Linear Systems With
No ratings yet
IET Control Theory Appl - 2019 - Wang - Hybrid Filter Design of Fault Detection For Networked Linear Systems With
7 pages
Extended Dissipative Sliding Mode Control For Nonlinear Networked Control Systems Via Event-Triggered Mechanism With Random Uncertain Measurement
No ratings yet
Extended Dissipative Sliding Mode Control For Nonlinear Networked Control Systems Via Event-Triggered Mechanism With Random Uncertain Measurement
16 pages
Distributed Machine Learning For Multiuser Mobile Edge Computing Systems
No ratings yet
Distributed Machine Learning For Multiuser Mobile Edge Computing Systems
14 pages
Networked Control Under Communication Constraints The Discrete-Time Case
No ratings yet
Networked Control Under Communication Constraints The Discrete-Time Case
6 pages
具有不完全转移概率的非线性马尔可夫跳跃系统的耗散跟踪控制：一种多事件触发方法
No ratings yet
具有不完全转移概率的非线性马尔可夫跳跃系统的耗散跟踪控制：一种多事件触发方法
11 pages
Michael S. Branicky 2003 Networked Control System CO-Simulation For CO-Design
No ratings yet
Michael S. Branicky 2003 Networked Control System CO-Simulation For CO-Design
6 pages
A Novel Fair Power Allocation For Sum-Rate Maximization To NOMA-based Relaying System
No ratings yet
A Novel Fair Power Allocation For Sum-Rate Maximization To NOMA-based Relaying System
5 pages
COS324 Course Notes
No ratings yet
COS324 Course Notes
256 pages
Dynamic Network Selection in Heterogeneous Networks With Integrated Sensing and Communication Hoa
No ratings yet
Dynamic Network Selection in Heterogeneous Networks With Integrated Sensing and Communication Hoa
12 pages
2011 - Qingjiang Shi - AnIterativelyWeightedMMSEApproachtoDistributedSumU (Retrieved 2020-11-28) PDF
No ratings yet
2011 - Qingjiang Shi - AnIterativelyWeightedMMSEApproachtoDistributedSumU (Retrieved 2020-11-28) PDF
10 pages
Sensors 15 12454 PDF
No ratings yet
Sensors 15 12454 PDF
20 pages
Machine Learning Presentation
No ratings yet
Machine Learning Presentation
20 pages
Cognitive Radio: Brain-Empowered Wireless Communcations: Matt Yu, EE360 Presentation, February 15 2012
No ratings yet
Cognitive Radio: Brain-Empowered Wireless Communcations: Matt Yu, EE360 Presentation, February 15 2012
22 pages
On Resource Allocation in Fading Multiple Access Channels - An Efficient Approximate Projection Approach
No ratings yet
On Resource Allocation in Fading Multiple Access Channels - An Efficient Approximate Projection Approach
22 pages
MAP-Based Channel Estimation For MIMO-OFDM Over Fast Rayleigh Fading Channels
No ratings yet
MAP-Based Channel Estimation For MIMO-OFDM Over Fast Rayleigh Fading Channels
6 pages
Dual-Hop Relaying Communications Over Fisher-Snedecor - Fading Channels
No ratings yet
Dual-Hop Relaying Communications Over Fisher-Snedecor - Fading Channels
16 pages
Outage Performance of MIMO Cognitive Relay Networks With Antenna Selection
No ratings yet
Outage Performance of MIMO Cognitive Relay Networks With Antenna Selection
5 pages
Deepseek 02495v1
No ratings yet
Deepseek 02495v1
42 pages
TW 11 A
No ratings yet
TW 11 A
11 pages
On The Power Allocation For Decode-and-Forward Cooperative Transmission Over Rayleigh-Fading Channels
No ratings yet
On The Power Allocation For Decode-and-Forward Cooperative Transmission Over Rayleigh-Fading Channels
5 pages
Cross Layer Optimization
No ratings yet
Cross Layer Optimization
43 pages
AIML (3rd - Year) Syllabus Igdtuw
No ratings yet
AIML (3rd - Year) Syllabus Igdtuw
34 pages
Topological Design of Communication Networks Using Multiobjective Genetic Optimization
No ratings yet
Topological Design of Communication Networks Using Multiobjective Genetic Optimization
6 pages
MDP For Traffic Light Control Based On Multi-Agent Reinforcement
No ratings yet
MDP For Traffic Light Control Based On Multi-Agent Reinforcement
20 pages
Radio Resource Allocation: R. L. Cruz, UCSD Michele Zorzi, Univ. of Padova, Italy
No ratings yet
Radio Resource Allocation: R. L. Cruz, UCSD Michele Zorzi, Univ. of Padova, Italy
28 pages
De Tai 6
No ratings yet
De Tai 6
5 pages
ML 1
No ratings yet
ML 1
35 pages
B.Tech AI &DS Curriculum - 16102020
No ratings yet
B.Tech AI &DS Curriculum - 16102020
64 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
Control of Networked and Complex Systems
No ratings yet
Control of Networked and Complex Systems
7 pages
Kalman Oh Yeah
No ratings yet
Kalman Oh Yeah
21 pages
Unit 5
No ratings yet
Unit 5
36 pages
Expert Systems - Merged
No ratings yet
Expert Systems - Merged
39 pages
Lecture 01
No ratings yet
Lecture 01
23 pages
Class 11 Introduction AI For Everyone
No ratings yet
Class 11 Introduction AI For Everyone
2 pages
Multi-Agent Systems and Strategic Decision Making: Module CS4760
No ratings yet
Multi-Agent Systems and Strategic Decision Making: Module CS4760
21 pages
A Study On Role of AI in Cloud Resource Optimization
No ratings yet
A Study On Role of AI in Cloud Resource Optimization
7 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
AI - Unit 4 - Notes
No ratings yet
AI - Unit 4 - Notes
6 pages
Optimizing Trading Strategies in Quantitative Markets Using Multi-Agent Reinforcement Learning
No ratings yet
Optimizing Trading Strategies in Quantitative Markets Using Multi-Agent Reinforcement Learning
5 pages
Human Choice
No ratings yet
Human Choice
34 pages
Path Planning For Automatic Berthing Using Ship-Ma
No ratings yet
Path Planning For Automatic Berthing Using Ship-Ma
16 pages
Optimal Channel Allocation in Wireless Sensor Network Using OFDMA
No ratings yet
Optimal Channel Allocation in Wireless Sensor Network Using OFDMA
6 pages
Art 3 A.I. BPM
No ratings yet
Art 3 A.I. BPM
11 pages
ML Unit-1
100% (1)
ML Unit-1
15 pages
Argall - A Survey of Robot Learning From Demonstration
No ratings yet
Argall - A Survey of Robot Learning From Demonstration
15 pages
CS2351 - Artificial Intellegence
No ratings yet
CS2351 - Artificial Intellegence
13 pages
Formula Sheet: Section 1 - Deterministic Dynamic Programming
No ratings yet
Formula Sheet: Section 1 - Deterministic Dynamic Programming
10 pages
Introduction To The Artificial Neural Networks: Andrej Krenker, Janez Bešter and Andrej Kos
No ratings yet
Introduction To The Artificial Neural Networks: Andrej Krenker, Janez Bešter and Andrej Kos
18 pages
February 2020 Number 2 Ietaa9 (ISSN 0018-9286) : Regular Papers
No ratings yet
February 2020 Number 2 Ietaa9 (ISSN 0018-9286) : Regular Papers
2 pages
RL Model Question Paper
100% (1)
RL Model Question Paper
1 page
SOP Apping MIT
No ratings yet
SOP Apping MIT
2 pages
M Tech Thesis Topics in Mechanical
100% (3)
M Tech Thesis Topics in Mechanical
7 pages
AI PROJECT CYCLE-1 Class 9
100% (1)
AI PROJECT CYCLE-1 Class 9
7 pages
Quantum Networks: The Future of Computer Communication
From Everand
Quantum Networks: The Future of Computer Communication
Manoj RC
No ratings yet
Principles of Data Forwarding Technologies: Definitive Reference for Developers and Engineers
From Everand
Principles of Data Forwarding Technologies: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Distributed Facts Device for Flow Controls
From Everand
Distributed Facts Device for Flow Controls
Dr.V.V.L.N. Sastry
No ratings yet

Optimal Transmission Strategy For Multiple Markovian Fading Channels Existence, Structure, and Approximation

Uploaded by

Optimal Transmission Strategy For Multiple Markovian Fading Channels Existence, Structure, and Approximation

Uploaded by

Automatica 158 (2023) 111312

Contents lists available at ScienceDirect

Optimal transmission strategy for multiple Markovian fading

1. Introduction interference and other factors, data packets may be delayed or

semi-definite) matrix A, we use the notation A ≻ 0 (A ⪰ 0), and

Pks+1|k = APks|k AT Q + , Assumption 2. For multiple Markovian fading channels, the

(1) The statistical properties of each channel are mutually in-

and the other term is the power cost, that is,

This subsection focuses on formulating Problem 1 as a CMDP. T −1

Algorithm 1 P-DQN for transmission scheduling

Lemma 3 (Puterman, 2014). Suppose that g(x, y) is a submodular

Remark 4. Algorithm 1 is updated in two timescales, i.e., the pa-

The algorithms are compared to single channel transmis-

+ (1 − q(p̄, h(σ ) ))Pi (h(σ ) |h(l) ) +

vδ∗ (s) = min cλ (s, p, u) + δ Es′ [vδ∗ (s′ )|s, p, u] ,

=Es′ [vn (s )|s2 , a2 ],

You might also like